CN111722954A

CN111722954A - Server abnormity positioning method and device, storage medium and server

Info

Publication number: CN111722954A
Application number: CN202010623604.7A
Authority: CN
Inventors: 余新来
Original assignee: Dawning Information Industry Beijing Co Ltd
Current assignee: Dawning Information Industry Beijing Co Ltd
Priority date: 2020-06-30
Filing date: 2020-06-30
Publication date: 2020-09-29

Abstract

The application provides a server abnormity positioning method, a device, a storage medium and a server, wherein the method comprises the following steps: when the server fails, inquiring a system event log stored in the mainboard manager, wherein the system event log comprises a first system event log, the first system event log is generated by reading a count value of a counter stored in a complex programmable logic device and a restarting type parameter, the count value of the counter is used for metering the restarting times of the server, and the restarting type parameter is used for representing the latest starting type of the server; acquiring restarting state information of the server according to the system event log; and judging the fault position of the server according to the restarting state information. According to the method and the device, the fault position of the server is judged according to the restarting state information, so that fault positioning is realized, the fault position can be found conveniently and rapidly, and the maintenance efficiency of the server is improved.

Description

Server abnormity positioning method and device, storage medium and server

Technical Field

The present disclosure relates to the field of server maintenance technologies, and in particular, to a method and an apparatus for locating server abnormality, a storage medium, and a server.

Background

With the rise of cloud computing, the number of X86 servers deployed in a data center has multiplied. Monitoring and diagnosing abnormal phenomena of the servers, particularly abnormal downtime and restart, is a very important work of server research and development and operation and maintenance departments. The server's motherboard manager assumes the responsibility of this monitoring for failures and abnormal reboots.

In the currently used technology, the motherboard manager relies on its record to record SEL events sent by the BIOS. And judging whether the server is started to which stage or not and whether abnormal restart occurs or not according to the event records sent by the BIOS. However, when the actual server fails abnormally, the BIOS has not yet reached the first instruction. Under the condition, it is difficult to judge what the reason of the black screen phenomenon occurs in the system, and whether the system is restarted or not cannot be judged, so that the fault phenomenon cannot be positioned.

In view of the above problems, no effective technical solution exists at present.

Disclosure of Invention

An object of the embodiments of the present application is to provide a method and an apparatus for server exception location, a storage medium, and a server, so as to improve server maintenance efficiency.

In a first aspect, an embodiment of the present application provides a server exception location method, where the server includes a processor, a complex programmable logic device, a motherboard manager, a BIOS, and a south bridge chip, where the complex programmable logic device is connected to the motherboard manager, the south bridge chip, and the processor, the south bridge chip is connected to the BIOS and the processor, and the BIOS is connected to the motherboard manager; the method is applied to the mainboard manager; the method is applied to the mainboard manager, and comprises the following steps:

when the server fails, inquiring a system event log stored in the mainboard manager, wherein the system event log comprises a first system event log, the first system event log is generated by reading a count value of a counter stored in a complex programmable logic device and a restarting type parameter, the count value of the counter is used for metering the restarting times of the server, and the restarting type parameter is used for representing the latest starting type of the server;

acquiring restarting state information of the server according to the system event log;

and judging the fault position of the server according to the restarting state information.

Optionally, in the method for locating an exception of a server according to the embodiment of the present application, before the step of querying a system event log stored in the motherboard manager, the method further includes:

when an EventTrigger interrupt signal of the complex programmable logic device is detected, reading a count value of a counter and a restart type parameter which are stored in the complex programmable logic device;

when the count value of the counter changes relative to the count value of the counter read last time, judging that the server is restarted, and generating a corresponding restart event record according to a restart type parameter;

and updating a first system event log in the system event logs according to the restart event record.

Optionally, in the server anomaly positioning method according to the embodiment of the present application, the system event log further includes a second system event log; the second system event log is used for judging the reaching stage after the system is restarted and enters the BIOS, and the second system event log is generated based on a plurality of running event records of the BIOS.

Optionally, in the server abnormal location method according to the embodiment of the present application, the restart state information includes: a reboot type of the server and a phase to which the server reboots.

Optionally, in the server exception location method according to the embodiment of the present application, the plurality of running event records include a BIOS start event record;

the method further comprises the steps of:

receiving a BIOS starting event record sent by the BIOS, wherein the BIOS starting event record is generated when the BIOS starts to start;

and updating the second system event log according to the BIOS starting event record.

Optionally, in the server exception positioning method according to the embodiment of the present application, the plurality of running event records further include a display initialization completion event record;

the method further comprises the steps of:

receiving a display parameter initialization completion event record sent by the BIOS, wherein the display parameter initialization completion event record is generated after the BIOS completes initialization operation on display parameters;

and updating the second system event log according to the display parameter initialization completion event record.

Optionally, in the server exception location method according to the embodiment of the present application, the plurality of running event records further include a BIOS start completion event record;

the method further comprises the steps of:

receiving a BIOS start-up completion event record sent by the BIOS, wherein the BIOS start-up completion event record is generated after the BIOS finishes start-up and transmits a control right to an operating system of the server;

and updating the second system event log according to the BIOS starting completion event record.

Optionally, in the method for locating an abnormality of a server according to the embodiment of the present application, the determining a fault location of the server according to the restart status information includes:

preliminarily screening out a server module with higher fault probability according to the restart type and the stage of restarting the server;

and confirming the fault position of the server from the screened server module with higher fault probability.

In a second aspect, an embodiment of the present application further provides a server exception locating device, where the server includes a processor, a complex programmable logic device, a motherboard manager, a BIOS, and a south bridge chip, where the complex programmable logic device is connected to the motherboard manager, the south bridge chip, and the processor, the south bridge chip is connected to the BIOS and the processor, and the BIOS is connected to the motherboard manager; the method is applied to the mainboard manager; the device comprises:

the system event log comprises a first system event log, the first system event log is generated by reading a count value of a counter stored in a complex programmable logic device and a restart type parameter, the count value of the counter is used for metering the restart times of the server, and the restart type parameter is used for representing the latest starting type of the server;

the acquisition module is used for acquiring the restarting state information of the server according to the system event log;

and the judging module is used for judging the fault position of the server according to the restarting state information.

In a third aspect, the present application further provides a storage medium having a computer program stored thereon, where the computer program is executed by a processor to execute the method according to any one of the above descriptions.

In a fourth aspect, an embodiment of the present application further provides a server, including a processor, a complex programmable logic device, a motherboard manager, a BIOS, and a south bridge chip, where the complex programmable logic device is connected to the motherboard manager, the south bridge chip, and the processor, respectively, and the BIOS is connected to the motherboard manager, the BIOS, the south bridge chip, and the processor in sequence;

the mainboard manager is used for executing the method of any one of the above items.

As can be seen from the above, the server exception location method, the apparatus, the storage medium, and the server provided in the embodiments of the present application query a system event log stored in the motherboard manager when the server fails, where the system event log includes a first system event log, the first system event log is generated by reading a count value of a counter stored in a complex programmable logic device and a restart type parameter, the count value of the counter is used to measure the number of times of restarting the server, and the restart type parameter is used to characterize the latest start type of the server; acquiring restarting state information of the server according to the system event log; and judging the fault position of the server according to the restarting state information, thereby realizing fault positioning, being convenient for finding out the fault position quickly and improving the maintenance efficiency of the server.

Additional features and advantages of the present application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the present application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a flowchart of a server anomaly positioning method according to an embodiment of the present application.

Fig. 2 is a schematic structural diagram of a server according to an embodiment of the present application.

Fig. 3 is a schematic structural diagram of a server anomaly positioning device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

Referring to fig. 1, fig. 1 is a flowchart illustrating a server anomaly locating method according to some embodiments of the present disclosure. Referring to fig. 2, fig. 2 is a schematic structural diagram of a server in the embodiment of the present application, where the server includes a processor 11, a complex programmable logic device 12, a motherboard manager 13, a BIOS (Basic Input Output System) 14, and a south bridge chip 15, where the complex programmable logic device 12 is connected to the motherboard manager 13, the south bridge chip 15, and the processor 11, the south bridge chip 15 is connected to the BIOS14 and the processor 11, and the BIOS14 is connected to the motherboard manager 13; the method is applied to the mainboard manager 13; the server anomaly positioning method is applied to the mainboard manager 13.

The server abnormity positioning method comprises the following steps:

s101, when the server fails, inquiring a system event log stored in the mainboard manager, wherein the system event log comprises a first system event log, the first system event log is generated by reading a counter value of a counter stored in a complex programmable logic device and a restarting type parameter, the counter value of the counter is used for metering the restarting times of the server, and the restarting type parameter is used for representing the latest starting type of the server.

And S102, acquiring the restarting state information of the server according to the system event log.

S103, judging the fault position of the server according to the restarting state information.

In step S101, the system event log includes a first system event log and a second system event log. The first system event log is generated by reading a count value of a counter stored in a complex programmable logic device and a restart type parameter, wherein the count value of the counter is used for metering the restart times of the server, and the restart type parameter is used for representing the latest start type of the server. The second system event log is generated based on an event record sent by the BIOS, and the first system event log is used for judging the phase and the restarting type which are reached before the system is restarted and enters the BIOS. The second system event log is used for judging the reaching stage after the system is restarted and enters the BIOS.

It is understood that, in some embodiments, before executing the step S101, the following steps are further included: s1001, when an EventTrigger interrupt signal of the complex programmable logic device is detected, reading a counter value and a restart type parameter stored in the complex programmable logic device. S1002, when the count value of the counter changes relative to the count value of the counter read last time, judging that the server is restarted, and generating a corresponding restart event record according to the restart type parameter. S1003, updating a first system event log in the system event logs according to the restart event record.

When the server host system is restarted, the two groups of signals in the X86 system mechanism change correspondingly according to different restarting types, when hot restarting occurs, only the PLTRST # signal is effective, and when cold restarting occurs, the PLTRST # signal and the SLP _ SX # signal are effective at the same time. The complex programmable logic device records corresponding restart types to an internal register based on the rule of the PLTRST # signal and the SLP _ SX # signal on different restart types, and meanwhile, the count value of a counter is added with 1, and the count value of the counter is used for representing the starting times of the server. Then, the complex programmable logic device interrupts and informs the main board manager through the GPIO signal of EventTrigger #. When the mainboard manager detects an EventTrigger interrupt signal of the complex programmable logic device, the count value of the counter stored in the complex programmable logic device and the restart type parameter are read, the complex programmable logic device compares the read count value of the counter with the count value of the counter read last time, if the count value of the counter is the same as the count value of the counter read last time, the restart is not generated, and if the count value of the counter is different from the count value of the counter read last time, the restart is generated. If a reboot has occurred, a reboot event record is generated and then the first system event log is updated.

It is to be understood that the system event log further includes a second system event log; a second system event log is generated based on a plurality of operational event records of the BIOS. Wherein the plurality of operational event records comprises: the BIOS start event record, the display parameter initialization completion event record, and the BIOS start completion event record are not limited thereto.

Specifically, in some embodiments, before executing step S101, the following steps are further included:

and S1004, receiving a BIOS starting event record sent by the BIOS, wherein the BIOS starting event record is generated when the BIOS starts to start.

S1005, updating the second system event log according to the BIOS starting event record.

S1006, receiving a display parameter initialization completion event record sent by the BIOS, wherein the display parameter initialization completion event record is generated after the BIOS completes initialization operation on the display parameters.

And S1007, updating the second system event log according to the display parameter initialization completion event record.

And S1008, receiving a BIOS start-up completion event record sent by the BIOS, wherein the BIOS start-up completion event record is generated after the BIOS finishes start-up and transmits the control right to an operating system of the server.

And S1009, updating the second system event log according to the BIOS starting completion event record.

The event records in the system event log are sorted according to the occurrence time of the event, so that the fault node is convenient to find.

In step S102, the restart status information includes: a reboot type of the server and a phase to which the server reboots. Wherein, the phase reached by the restart may be one of the following phases: the method comprises a restart initial stage, a BIOS starting and starting stage, a display parameter initialization completion stage and a BIOS starting and completing stage. Of course, it is not limited thereto. If the server is in a blank screen state and the system event log in the mainboard manager is not updated, it is indicated that the blank screen occurs due to the operating system fault of the server and no restarting action occurs. If only the restart event record is updated in the system event log of the mainboard manager, the server is restarted, and the restart is not carried out to the initial starting stage of the BIOS. If the system event log of the mainboard manager only updates the restart event record and the BIOS start event record, the card shows that the system is restarted before the display parameter initialization completion stage.

In step S103, when the fault location is determined from the restart status information, a preliminary determination is made based on the stage to which the restart has proceeded.

For example, if the server is blank and the system event log in the motherboard manager is not updated, it indicates that the blank screen occurs under the operating system of the server, and no restart action occurs, it indicates that a fault occurs on the display screen or the display driver.

For example, if only the restart event record is updated in the system event log of the motherboard manager, it indicates that the server has restarted, and the restart does not proceed to the initial startup phase of the BIOS, it indicates that a failure occurs in the BIOS or the processor.

For example, if only the restart event record and the BIOS start event record are updated in the system event log of the motherboard manager, it indicates that the card after the system is restarted is before the display parameter initialization completion node, and it indicates that a fault occurs in the display screen or the display card portion.

Of course, the position where the specific fault occurs can be judged by combining other parameters, so that the accuracy of fault positioning is improved.

For example, in some embodiments, this step S103 includes: s1031, preliminarily screening out a server module with higher fault probability according to the restart type and the stage reached by the server restart; s1032, confirming the fault position of the server from the server module with the larger fault occurrence probability. For example, if only the restart event record and the BIOS start event record are updated in the system event log of the motherboard manager, it indicates that the card is stuck before the display parameter initialization completion stage after the system is restarted, and therefore, the server modules that have failed may be preliminarily screened as follows: display, display card. Maintenance personnel can then obtain some status information of the display and graphics card, thereby allowing the specific location of the fault to be determined.

As can be seen from the above, in the embodiment of the present application, when the server fails, a system event log stored in the motherboard manager is queried, where the system event log includes a first system event log, the first system event log is generated by reading a count value of a counter stored in a complex programmable logic device and a restart type parameter, the count value of the counter is used to measure the number of times of restarting the server, and the restart type parameter is used to characterize a last start type of the server; acquiring restarting state information of the server according to the system event log; and judging the fault position of the server according to the restarting state information, thereby realizing fault positioning, being convenient for finding out the fault position quickly and improving the maintenance efficiency of the server.

Referring to fig. 3, fig. 3 is a structural diagram of a server anomaly locating device according to some embodiments of the present application.

Wherein, this server anomaly positioner includes: a query module 201, an acquisition module 202 and a judgment module 203.

The query module 201 is configured to query a system event log stored in the motherboard manager when the server fails, where the system event log includes a first system event log, the first system event log is generated by reading a count value of a counter stored in a complex programmable logic device and a restart type parameter, the count value of the counter is used to measure the number of times of restarting the server, and the restart type parameter is used to represent a last start type of the server. The system event log includes a first system event log and a second system event log. The first system event log is generated by reading a count value of a counter stored in a complex programmable logic device and a restart type parameter, wherein the count value of the counter is used for metering the restart times of the server, and the restart type parameter is used for representing the latest start type of the server. The second system event log is generated based on event records sent by the BIOS.

It will be appreciated that in some embodiments the query module 201 is further operable to: when an EventTrigger interrupt signal of the complex programmable logic device is detected, reading a count value of a counter and a restart type parameter which are stored in the complex programmable logic device; when the count value of the counter changes relative to the count value of the counter read last time, judging that the server is restarted, and generating a corresponding restart event record according to a restart type parameter; and updating a first system event log in the system event logs according to the restart event record. The complex programmable logic device monitors a PLTRST # signal and an SLP _ SX # signal of the south bridge chip, when the server host system is restarted, the two groups of signals in an X86 system mechanism change correspondingly according to different restarting types, when hot restarting occurs, only the PLTRST # signal is effective, and when cold restarting occurs, the PLTRST # signal and the SLP _ SX # signal are effective simultaneously. The complex programmable logic device records corresponding restart types to an internal register based on the rule of the PLTRST # signal and the SLP _ SX # signal on different restart types, and meanwhile, the count value of a counter is added with 1, and the count value of the counter is used for representing the starting times of the server. Then, the complex programmable logic device interrupts and informs the main board manager through the GPIO signal of EventTrigger #. When the mainboard manager detects an EventTrigger interrupt signal of the complex programmable logic device, the count value of the counter stored in the complex programmable logic device and the restart type parameter are read, the complex programmable logic device compares the read count value of the counter with the count value of the counter read last time, if the count value of the counter is the same as the count value of the counter read last time, the restart is not generated, and if the count value of the counter is different from the count value of the counter read last time, the restart is generated. If a reboot has occurred, a reboot event record is generated and then the first system event log is updated.

It is to be understood that the system event log further includes a second system event log; the second system event log is generated based on a plurality of operational event records of the BIOS 14. Wherein the plurality of operational event records comprises: the BIOS start event record, the display parameter initialization completion event record, and the BIOS start completion event record are not limited thereto.

Wherein the query module is further configured to: receiving a BIOS starting event record sent by the BIOS, wherein the BIOS starting event record is generated when the BIOS starts to start; updating the second system event log according to the BIOS starting event record; receiving a display parameter initialization completion event record sent by a BIOS, wherein the display parameter initialization completion event record is generated after the BIOS completes initialization operation on display parameters; initializing a completion event record according to the display parameters and updating a second system event log; receiving a BIOS start-up completion event record sent by the BIOS, wherein the BIOS start-up completion event record is generated after the BIOS finishes start-up and transmits a control right to an operating system of a server; and updating the second system event log according to the BIOS starting completion event record. The event records in the system event log are sorted according to the occurrence time of the event, so that the fault node is convenient to find.

The obtaining module 202 is configured to obtain the restart status information of the server according to the system event log. The restart status information includes: a reboot type of the server and a phase to which the server reboots. Wherein, the phase reached by the restart may be one of the following phases: the method comprises a restart initial stage, a BIOS starting and starting stage, a display parameter initialization completion stage and a BIOS starting and completing stage. Of course, it is not limited thereto. If the server is in a blank screen state and the system event log in the mainboard manager is not updated, it is indicated that the blank screen occurs due to the operating system fault of the server and no restarting action occurs. If only the restart event record is updated in the system event log of the mainboard manager, the server is restarted, and the restart is not carried out to the initial starting stage of the BIOS. If the system event log of the mainboard manager only updates the restart event record and the BIOS start event record, the card shows that the system is restarted before the display parameter initialization completion stage.

The judging module 203 is configured to judge a fault location of the server according to the restart status information. When the fault location is judged according to the restart status information, preliminary judgment is performed based on the stage to which the restart has proceeded.

The embodiment of the present application provides a storage medium, and when being executed by a processor, the computer program performs the method in any optional implementation manner of the above embodiment. The storage medium may be implemented by any type of volatile or nonvolatile storage device or combination thereof, such as a Static Random Access Memory (SRAM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), an Erasable Programmable Read-Only Memory (EPROM), a Programmable Read-Only Memory (PROM), a Read-Only Memory (ROM), a magnetic Memory, a flash Memory, a magnetic disk, or an optical disk.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A server exception positioning method is disclosed, wherein the server comprises a processor, a complex programmable logic device, a mainboard manager, a BIOS and a south bridge chip, wherein the complex programmable logic device is respectively connected with the mainboard manager, the south bridge chip and the processor, the south bridge chip is respectively connected with the BIOS and the processor, and the BIOS is connected with the mainboard manager; the method is applied to the mainboard manager, and is characterized in that the method comprises the following steps:

2. The server exception location method of claim 1, wherein the step of querying a system event log stored in the motherboard manager is preceded by the step of:

3. The server abnormal positioning method according to claim 1 or 2, wherein the restart status information comprises: a reboot type of the server and a phase to which the server reboots.

4. The server anomaly location method according to claim 2, wherein said system event log further comprises a second system event log; the second system event log is used for judging the reaching stage after the system is restarted and enters the BIOS, and the second system event log is generated based on a plurality of running event records of the BIOS.

5. The server exception location method of claim 4, wherein the plurality of run event records comprises a BIOS start event record;

the method further comprises the steps of:

6. The server anomaly locating method according to claim 4, wherein said plurality of running event records further comprises displaying an initialization completion event record;

the method further comprises the steps of:

initializing a completion event record according to the display parameters and updating the second system event log;

or, the plurality of running event records further include a BIOS start completion event record;

the method further comprises the steps of:

7. The method for locating server abnormality according to claim 4, wherein said determining a fault location of the server according to the restart status information includes:

8. A server exception positioning device comprises a processor, a complex programmable logic device, a mainboard manager, a BIOS (basic input output System), and a south bridge chip, wherein the complex programmable logic device is respectively connected with the mainboard manager, the south bridge chip and the processor; the method is applied to the mainboard manager; characterized in that the device comprises:

9. A storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, performs the method according to any of claims 1-7.

10. A server is characterized by comprising a processor, a complex programmable logic device, a mainboard manager, a BIOS and a south bridge chip, wherein the complex programmable logic device is respectively connected with the mainboard manager, the south bridge chip and the processor, and the BIOS is sequentially connected with the mainboard manager, the BIOS, the south bridge chip and the processor;

the motherboard manager is configured to perform the method of any of claims 1-7.