CN112445640A - Server downtime fault positioning and isolating system and method - Google Patents
Server downtime fault positioning and isolating system and method Download PDFInfo
- Publication number
- CN112445640A CN112445640A CN202011116419.5A CN202011116419A CN112445640A CN 112445640 A CN112445640 A CN 112445640A CN 202011116419 A CN202011116419 A CN 202011116419A CN 112445640 A CN112445640 A CN 112445640A
- Authority
- CN
- China
- Prior art keywords
- loaded
- equipment
- loading
- current
- fru
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims description 24
- 238000002955 isolation Methods 0.000 claims abstract description 15
- 238000004891 communication Methods 0.000 claims description 6
- 238000012795 verification Methods 0.000 description 5
- YYAVXASAKUOZJJ-UHFFFAOYSA-N 4-(4-butylcyclohexyl)benzonitrile Chemical compound C1CC(CCCC)CCC1C1=CC=C(C#N)C=C1 YYAVXASAKUOZJJ-UHFFFAOYSA-N 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 201000003034 pontocerebellar hypoplasia type 4 Diseases 0.000 description 4
- 238000012360 testing method Methods 0.000 description 3
- RKTYLMNFRDHKIL-UHFFFAOYSA-N copper;5,10,15,20-tetraphenylporphyrin-22,24-diide Chemical compound [Cu+2].C1=CC(C(=C2C=CC([N-]2)=C(C=2C=CC=CC=2)C=2C=CC(N=2)=C(C=2C=CC=CC=2)C2=CC=C3[N-]2)C=2C=CC=CC=2)=NC1=C3C1=CC=CC=C1 RKTYLMNFRDHKIL-UHFFFAOYSA-N 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0709—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention provides a server downtime fault positioning and isolating system.A BIOS acquires FRU storage information mounted under a BMC (baseboard management controller), compares current equipment to be loaded with all non-fault loading equipment information stored in the FRU after real-time update, and loads the current equipment to be loaded if all non-fault loading equipment information stored in the FRU after real-time update comprises the current equipment to be loaded; the BMC acquires the loading time of the BIOS current device to be loaded in the timing module, and judges whether the server system is down according to the loading time of the current device to be loaded; if the server system is down, the BMC removes the information of the equipment to be loaded currently in the FRU so as to realize the positioning and isolation of the fault equipment.
Description
Technical Field
The invention relates to the field of server faults, in particular to a system and a method for positioning and isolating a server downtime fault.
Background
With the development of information technology, the configuration of the server is more and more abundant at present, and various requirements can be met. Since the server typically runs critical application software, the reliability requirements of the system are very high.
However, due to the continuous abundance of server configuration, various devices can be accessed into the system, which leads to the continuous improvement of the complexity of system service application and the continuous increase of the probability of instability of the system. The problem of server downtime occurs in a machine room, and the operation of service application is seriously influenced.
In the current design, the downtime phenomenon needs to be judged and confirmed manually, and meanwhile, for the positioning of the fault equipment, an engineer needs to perform recurrence phenomenon testing and repeated verification according to own experience, so that a large amount of time and manpower are consumed, and the efficiency of judging and positioning the downtime fault of the server is not improved.
Disclosure of Invention
The invention provides a system and a method for positioning and isolating the downtime fault of a server in order to solve the problems in the prior art, effectively solves the problem that a large amount of time and labor are consumed due to artificial judgment and verification, and effectively improves the efficiency of judging and positioning the downtime fault of the server.
The invention provides a server downtime fault positioning and isolating system in a first aspect, which comprises: the system comprises a BIOS, a BMC, an FRU, a PCH and a timing module, wherein the FRU is mounted on the BMC and stores all non-fault loading equipment information which is updated in real time; the BIOS is in communication connection with the BMC through the PCH, acquires FRU storage information mounted under the BMC, compares current equipment to be loaded with all real-time updated loading equipment information which is stored in the FRU and does not have faults, and loads the current equipment to be loaded if all real-time updated loading equipment information which is stored in the FRU and does not have faults comprises the current equipment to be loaded; if all the updated information of the loading equipment which is not failed in real time and stored by the FRU does not comprise the current equipment to be loaded, the BIOS continues to load the next equipment to be loaded; the BMC is in communication connection with the timing module, acquires the loading time of the BIOS current device to be loaded in the timing module, and judges whether the server system is down according to the loading time of the current device to be loaded; and if the server system is down, the BMC removes the current equipment information to be loaded in the FRU so as to realize the positioning isolation of the fault equipment.
Optionally, the information of the loading device includes a loading time preset threshold corresponding to each loading device that has not failed.
Further, the specifically step of judging whether the server system is down according to the loading time of the current device to be loaded is as follows:
and judging whether the loading time of the current equipment to be loaded is greater than a preset threshold of the loading time corresponding to the current equipment to be loaded, if so, shutting down the server system, and if not, shutting down the server system.
Optionally, the timing module is a CPLD.
The second aspect of the present invention provides a method for locating and isolating a server downtime fault, which is implemented based on the system for locating and isolating a server downtime fault according to the first aspect of the present invention, and comprises:
after the system is powered on, the BIOS acquires all the real-time updated loading equipment information which is stored in the FRU mounted under the BMC and is not in fault, compares the current equipment to be loaded with all the real-time updated loading equipment information which is stored in the FRU and is not in fault, and if all the real-time updated loading equipment information which is stored in the FRU and is not in fault comprises the current equipment to be loaded, the BIOS loads the current equipment to be loaded;
the BMC acquires the loading time of the BIOS current device to be loaded in the timing module, and judges whether the server system is down according to the loading time of the current device to be loaded; if the server system is down, the BMC removes the current equipment information to be loaded in the FRU to realize fault equipment isolation;
and if all the updated information of the loading equipment which is not failed in real time and stored by the FRU does not comprise the current equipment to be loaded, the BIOS continues to load the next equipment to be loaded.
Optionally, the method further comprises: and starting the server system until all the equipment to be loaded finish loading and no downtime occurs in the loading process.
Optionally, the BMC may remove the information of the current device to be loaded in the FRU, so as to implement fault device isolation, and then:
and the BMC sets suggestive information and records the current downtime phenomenon and the current equipment to be loaded.
Optionally, the specifically step of judging whether the server system is down according to the loading time of the current device to be loaded is:
and judging whether the loading time of the current equipment to be loaded is greater than a preset threshold of the loading time corresponding to the current equipment to be loaded, if so, shutting down the server system, and if not, shutting down the server system.
Optionally, the timing module records the loading time of the current device to be loaded, and after the recording of the loading time of the current device to be loaded is completed, the timing module is cleared to zero to record the loading time of the next device to be loaded.
Further, the BIOS loads the devices to be loaded in sequence according to the stored HOB list, and the sum of the number of the devices to be loaded stored in the HOB list is not less than the number of all the devices which are not failed and stored in the FRU after being updated in real time.
The technical scheme adopted by the invention comprises the following technical effects:
1. the invention effectively solves the problem of large time consumption and manpower consumption caused by artificial judgment and verification, realizes automatic positioning and isolation of the equipment to be loaded when the server is down, and effectively improves the efficiency of judging and positioning the down fault of the server.
2. According to the technical scheme, the information of the loading equipment comprises a loading time preset threshold of the loading equipment, whether the loading equipment is down is judged according to whether the loading time of the current equipment to be loaded is greater than the loading time preset threshold of the equipment to be loaded, the loading condition of the corresponding equipment to be loaded can be judged according to the loading time preset threshold of each equipment to be loaded, different loading time preset thresholds can be conveniently set according to the actual condition of each equipment to be loaded, the down judgment is carried out, and the flexibility of judging the down condition of different equipment to be loaded is improved.
3. According to the technical scheme, the BMC removes the information of the current equipment to be loaded in the FRU so as to set the prompting information after the fault equipment is isolated, records the current downtime phenomenon and the current equipment to be loaded, is convenient to realize the positioning of the fault equipment and the test and analysis of the later downtime situation, and avoids the downtime phenomenon needing to be repeated for many times.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without any creative effort.
FIG. 1 is a schematic diagram of a system according to an embodiment of the present invention;
FIG. 2 is a schematic flow diagram of a second method embodiment of the present invention;
FIG. 3 is a schematic flow diagram of a third embodiment of a method according to aspects of the present invention;
fig. 4 is a schematic flow diagram of an embodiment of the tetragonal method in accordance with the present invention.
Detailed Description
In order to clearly explain the technical features of the present invention, the following detailed description of the present invention is provided with reference to the accompanying drawings. The following disclosure provides many different embodiments, or examples, for implementing different features of the invention. To simplify the disclosure of the present invention, the components and arrangements of specific examples are described below. Furthermore, the present invention may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed. It should be noted that the components illustrated in the figures are not necessarily drawn to scale. Descriptions of well-known components and processing techniques and procedures are omitted so as to not unnecessarily limit the invention.
Example one
As shown in fig. 1, the present invention provides a server downtime fault positioning and isolating system, which includes: the system comprises a BIOS1, a BMC2, an FRU3, a PCH4 and a timing module 5, wherein the FRU3 is mounted on the BMC2 and stores all information of the loaded equipment which is not failed and is updated in real time; the BIOS1 is in communication connection with the BMC2 through the PCH4, obtains FRU3 storage information mounted under the BMC2, compares the current equipment to be loaded with all real-time updated loading equipment information which is stored in the FRU3 and does not have faults, and if all real-time updated loading equipment information which is stored in the FRU3 and does not have faults comprises the current equipment to be loaded, the BIOS1 loads the current equipment to be loaded; if all the updated information of the loading equipment which is not failed in real time and stored by the FRU3 does not comprise the current equipment to be loaded, the BIOS1 continues to load the next equipment to be loaded; the BMC2 is in communication connection with the timing module 5, acquires the loading time of the current equipment to be loaded of the BIOS1 in the timing module 5, and judges whether the server system is down according to the loading time of the current equipment to be loaded; if the server system is down, the BMC2 removes the current device information to be loaded in the FRU3 to realize the location isolation of the fault device.
The BIOS1(Basic Input Output System, BIOS) and the BMC2 (Basic board Manager Controller, board management Controller) are communicatively connected through a P CH4 (integrated south bridge), specifically, the BI OS1 and the PCH4 are connected through a Serial Peripheral Interface (SPI) Bus, the PCH4 is communicatively connected to a BMC2 through an LPC (Low pin Bus), and the BMC2 controls an enable signal (FLASH _ CS) of the BIOS1 FLASH through the PCH 4. The BMC2 is communicatively coupled to the FRU3(Field R eplace Unit) and the timing module 5 via the I2C bus.
The information of the loading equipment comprises a loading time preset threshold corresponding to each loading equipment which does not fail; whether the equipment to be loaded is down is judged according to whether the loading time of the current equipment to be loaded is greater than a preset threshold of the loading time of the equipment to be loaded, the loading condition of the corresponding equipment to be loaded can be judged according to the preset threshold of the loading time of each equipment to be loaded, different preset thresholds of the loading time can be conveniently set according to the actual condition of each equipment to be loaded, the down judgment is carried out, and the flexibility of judging the down condition of different equipment to be loaded is improved. Specifically, the preset loading time threshold of each device to be loaded may be obtained by the preset loading time threshold of each device to be loaded after being updated in real time stored in the FRU3, and the information of all non-failed loading devices after being updated in real time in the FRU3 may be stored in the form of a data list or a database, for example, the name of each non-failed loading device — the preset loading time threshold … … corresponding to the loading device
Judging whether the server system is down according to the loading time of the current equipment to be loaded is specifically as follows: and judging whether the loading time of the current equipment to be loaded is greater than a preset threshold value of the loading time of the current equipment to be loaded, if so, shutting down the server system, and if not, shutting down the server system.
Specifically, the timing module 5 may be a CPLD (Complex Programming logic device) for recording the loading time of the device to be loaded, and after the recording of the loading time of the current device to be loaded is completed, the timing module 5 is cleared to zero to record the loading time of the next device to be loaded.
Further, if the server system is down, the BMC2 will remove the device information currently to be loaded in the FRU3 to implement location isolation of the faulty device, and the BMC2 sets the suggestive information to record the current down phenomenon and the device currently to be loaded. The specific record form may be a log form, and the present invention is not limited herein.
The BIOS1 loads the devices to be loaded in sequence according to the stored HOB list (list of sequences in which the BIOS loads the devices to be loaded), where the total number of devices to be loaded stored in the HOB list is not less than the number of all non-failed devices stored in the FRU3 after real-time update. The number of real-time updated load devices stored in FRU3 is the number of all non-failed load devices in FRU3 that were updated in real time.
The invention effectively solves the problem of large time consumption and manpower consumption caused by artificial judgment and verification, realizes automatic positioning and isolation of the equipment to be loaded when the server is down, and effectively improves the efficiency of judging and positioning the down fault of the server.
According to the technical scheme, the information of the loading equipment comprises a loading time preset threshold of the loading equipment, whether the loading equipment is down is judged according to whether the loading time of the current equipment to be loaded is greater than the loading time preset threshold of the equipment to be loaded, the loading condition of the corresponding equipment to be loaded can be judged according to the loading time preset threshold of each equipment to be loaded, different loading time preset thresholds can be conveniently set according to the actual condition of each equipment to be loaded, the down judgment is carried out, and the flexibility of judging the down condition of different equipment to be loaded is improved.
Example two
As shown in fig. 2, the technical solution of the present invention further provides a method for locating and isolating a server downtime fault, which is implemented based on the first embodiment of the present invention, and includes:
s1, after the system is powered on, the BIOS acquires all the real-time updated loading equipment information which is stored in the FRU under the BMC, and compares the current equipment to be loaded with all the real-time updated loading equipment information which is stored in the FRU and does not have faults;
s2, judging whether all the real-time updated information of the loading equipment without faults stored in the FRU includes the current equipment to be loaded, if so, executing the step S3; if the judgment result is no, executing step S5;
s3, loading the current equipment to be loaded by the BIOS;
s4, the BMC acquires the loading time of the BIOS current device to be loaded in the timing module, and judges whether the server system is down according to the loading time of the current device to be loaded; if the judgment result is yes, executing step S6; if the judgment result is no, executing step S5;
s5, the BIOS continues to load the next device to be loaded;
s6, the BMC removes the current device information to be loaded in the FRU to realize the fault device isolation.
In step S1, after the system is powered on, the BIOS communicates with the BMC through the PCH to obtain all the non-failure loaded device information stored in the FRU mounted under the BMC after being updated in real time, and compares the current device to be loaded with all the non-failure loaded device information stored in the FRU after being updated in real time; the loading device information includes a loading time preset threshold corresponding to each loading device that has not failed.
In step S4, the step of determining whether the server system is down according to the loading time of the current device to be loaded is specifically: and judging whether the loading time of the current equipment to be loaded is greater than a preset threshold of the loading time corresponding to the current equipment to be loaded, if so, shutting down the server system, and if not, shutting down the server system. The preset threshold value of the loading time corresponding to the equipment to be loaded can be flexibly adjusted and determined according to the actual conditions of the type and the like of each loading equipment, so that different preset threshold values of the loading time can be conveniently set according to the actual conditions of each equipment to be loaded, downtime judgment is carried out, and the flexibility of downtime judgment of different equipment to be loaded is improved. Specifically, the preset loading time threshold of each device to be loaded may be obtained by the preset loading time threshold of each device to be loaded after being updated in real time stored in the FRU3, and the information of all non-failed loading devices after being updated in real time in the FRU3 may be stored in the form of a data list or a database, for example, the name of each non-failed loading device — the preset loading time threshold … … corresponding to the loading device
And the timing module records the loading time of the current equipment to be loaded, and after the recording of the loading time of the current equipment to be loaded is finished, the timing module is reset to record the loading time of the next equipment to be loaded.
Specifically, the timing module may be a CPLD, or other types of timing modules, and may be flexibly adjusted in practical application, which is not limited herein.
And sequentially loading the equipment to be loaded by the BIOS according to a stored HOB list (a list of the order of loading the equipment to be loaded by the BIOS), wherein the sum of the number of the equipment to be loaded stored in the HOB list is not less than the number of all the loading equipment which are not failed and stored in the FRU after being updated in real time. The number of the loading devices which are stored in the FRU and updated in real time is the number of all the loading devices which are not failed and updated in real time in the FRU.
The invention effectively solves the problem of large time consumption and manpower consumption caused by artificial judgment and verification, realizes automatic positioning and isolation of the equipment to be loaded when the server is down, and effectively improves the efficiency of judging and positioning the down fault of the server.
According to the technical scheme, the information of the loading equipment comprises a loading time preset threshold of the loading equipment, whether the loading equipment is down is judged according to whether the loading time of the current equipment to be loaded is greater than the loading time preset threshold of the equipment to be loaded, the loading condition of the corresponding equipment to be loaded can be judged according to the loading time preset threshold of each equipment to be loaded, different loading time preset thresholds can be conveniently set according to the actual condition of each equipment to be loaded, the down judgment is carried out, and the flexibility of judging the down condition of different equipment to be loaded is improved.
EXAMPLE III
As shown in fig. 3, the technical solution of the present invention further provides a method for locating and isolating a server downtime fault, which is implemented based on the first embodiment of the present invention, and includes:
s1, after the system is powered on, the BIOS acquires all the real-time updated loading equipment information which is stored in the FRU under the BMC, and compares the current equipment to be loaded with all the real-time updated loading equipment information which is stored in the FRU and does not have faults;
s2, judging whether all the real-time updated information of the loading equipment without faults stored in the FRU includes the current equipment to be loaded, if so, executing the step S3; if the judgment result is no, executing step S5;
s3, loading the current equipment to be loaded by the BIOS;
s4, the BMC acquires the loading time of the BIOS current device to be loaded in the timing module, and judges whether the server system is down according to the loading time of the current device to be loaded; if the judgment result is yes, executing step S6; if the judgment result is no, executing step S5;
s5, the BIOS continues to load the next device to be loaded;
s6, the BMC removes the current equipment information to be loaded in the FRU to realize fault equipment isolation;
and S7, starting the server system until all the devices to be loaded are loaded and the downtime does not occur in the loading process.
In step S7, after the BIOS completes all the devices to be loaded, the server system is normally started up if the server system is not down.
Example four
As shown in fig. 4, the technical solution of the present invention further provides a method for locating and isolating a server downtime fault, which is implemented based on the first embodiment of the present invention, and includes:
s1, after the system is powered on, the BIOS acquires all the real-time updated loading equipment information which is stored in the FRU under the BMC, and compares the current equipment to be loaded with all the real-time updated loading equipment information which is stored in the FRU and does not have faults;
s2, judging whether all the real-time updated information of the loading equipment without faults stored in the FRU includes the current equipment to be loaded, if so, executing the step S3; if the judgment result is no, executing step S5;
s3, loading the current equipment to be loaded by the BIOS;
s4, the BMC acquires the loading time of the BIOS current device to be loaded in the timing module, and judges whether the server system is down according to the loading time of the current device to be loaded; if the judgment result is yes, executing step S6; if the judgment result is no, executing step S5;
s5, the BIOS continues to load the next device to be loaded;
s6, the BMC removes the current equipment information to be loaded in the FRU to realize fault equipment isolation;
s7, setting prompting information by the BMC, and recording the current downtime phenomenon and the current equipment to be loaded;
and S8, starting the server system until all the devices to be loaded are loaded and the downtime does not occur in the loading process.
According to the technical scheme, the BMC removes the information of the current equipment to be loaded in the FRU so as to set the prompting information after the fault equipment is isolated, records the current downtime phenomenon and the current equipment to be loaded, is convenient to realize the positioning of the fault equipment and the test and analysis of the later downtime situation, and avoids the downtime phenomenon needing to be repeated for many times.
Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.
Claims (10)
1. A server downtime fault positioning and isolating system is characterized by comprising: the system comprises a BIOS, a BMC, an FRU, a PCH and a timing module, wherein the FRU is mounted on the BMC and stores all non-fault loading equipment information which is updated in real time; the BIOS is in communication connection with the BMC through the PCH, acquires FRU storage information mounted under the BMC, compares current equipment to be loaded with all real-time updated loading equipment information which is stored in the FRU and does not have faults, and loads the current equipment to be loaded if all real-time updated loading equipment information which is stored in the FRU and does not have faults comprises the current equipment to be loaded; if all the updated information of the loading equipment which is not failed in real time and stored by the FRU does not comprise the current equipment to be loaded, the BIOS continues to load the next equipment to be loaded; the BMC is in communication connection with the timing module, acquires the loading time of the BIOS current device to be loaded in the timing module, and judges whether the server system is down according to the loading time of the current device to be loaded; and if the server system is down, the BMC removes the current equipment information to be loaded in the FRU so as to realize the positioning isolation of the fault equipment.
2. The system of claim 1, wherein the information about the loading devices comprises a preset threshold value of loading time corresponding to each loading device that has not failed.
3. The system for locating and isolating the downtime of the server according to claim 2, wherein the step of judging whether the server system is down according to the loading time of the current device to be loaded is specifically as follows:
and judging whether the loading time of the current equipment to be loaded is greater than a preset threshold of the loading time corresponding to the current equipment to be loaded, if so, shutting down the server system, and if not, shutting down the server system.
4. The system for locating and isolating the server downtime according to any one of claims 1 to 3, wherein the timing module is a CPLD.
5. A method for locating and isolating a server downtime fault, which is implemented based on the system for locating and isolating a server downtime fault according to any one of claims 1 to 4, and comprises the following steps:
after the system is powered on, the BIOS acquires all the real-time updated loading equipment information which is stored in the FRU mounted under the BMC and is not in fault, compares the current equipment to be loaded with all the real-time updated loading equipment information which is stored in the FRU and is not in fault, and if all the real-time updated loading equipment information which is stored in the FRU and is not in fault comprises the current equipment to be loaded, the BIOS loads the current equipment to be loaded;
the BMC acquires the loading time of the BIOS current device to be loaded in the timing module, and judges whether the server system is down according to the loading time of the current device to be loaded; if the server system is down, the BMC removes the current equipment information to be loaded in the FRU to realize fault equipment isolation;
and if all the updated information of the loading equipment which is not failed in real time and stored by the FRU does not comprise the current equipment to be loaded, the BIOS continues to load the next equipment to be loaded.
6. The method for locating and isolating the server downtime fault according to claim 5, further comprising: and starting the server system until all the equipment to be loaded finish loading and no downtime occurs in the loading process.
7. The method for locating and isolating the server downtime fault according to claim 5, wherein the BMC is further configured to remove the device information to be loaded currently in the FRU to implement the fault device isolation, and then is further configured to:
and the BMC sets suggestive information and records the current downtime phenomenon and the current equipment to be loaded.
8. The method for locating and isolating the server downtime fault according to claim 5, wherein the step of judging whether the server system is downtime according to the loading time of the current device to be loaded is specifically as follows:
and judging whether the loading time of the current equipment to be loaded is greater than a preset threshold of the loading time corresponding to the current equipment to be loaded, if so, shutting down the server system, and if not, shutting down the server system.
9. The method for locating and isolating the downtime of the server according to claims 5 to 8, wherein the timing module records the loading time of the current device to be loaded, and after the recording of the loading time of the current device to be loaded is completed, the timing module is cleared to zero to record the loading time of the next device to be loaded.
10. The method for locating and isolating the server down fault according to any one of claims 5 to 8, wherein the BIOS loads the devices to be loaded in sequence according to a stored HOB list, and the number of the devices to be loaded stored in the HOB list is not less than the number of all the devices which are not faulty and stored in the FRU after being updated in real time.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011116419.5A CN112445640A (en) | 2020-10-19 | 2020-10-19 | Server downtime fault positioning and isolating system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011116419.5A CN112445640A (en) | 2020-10-19 | 2020-10-19 | Server downtime fault positioning and isolating system and method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112445640A true CN112445640A (en) | 2021-03-05 |
Family
ID=74735548
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011116419.5A Withdrawn CN112445640A (en) | 2020-10-19 | 2020-10-19 | Server downtime fault positioning and isolating system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112445640A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109800025A (en) * | 2018-12-13 | 2019-05-24 | 平安普惠企业管理有限公司 | Page loading method, device, equipment and storage medium |
CN109947586A (en) * | 2019-03-20 | 2019-06-28 | 浪潮商用机器有限公司 | A kind of method, apparatus and medium of isolated fault equipment |
CN111526207A (en) * | 2020-05-06 | 2020-08-11 | 金蝶软件(中国)有限公司 | Data transmission method and related equipment |
-
2020
- 2020-10-19 CN CN202011116419.5A patent/CN112445640A/en not_active Withdrawn
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109800025A (en) * | 2018-12-13 | 2019-05-24 | 平安普惠企业管理有限公司 | Page loading method, device, equipment and storage medium |
CN109947586A (en) * | 2019-03-20 | 2019-06-28 | 浪潮商用机器有限公司 | A kind of method, apparatus and medium of isolated fault equipment |
CN111526207A (en) * | 2020-05-06 | 2020-08-11 | 金蝶软件(中国)有限公司 | Data transmission method and related equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111312325B (en) | BBU fault diagnosis method and device, electronic equipment and storage medium | |
CN111274077A (en) | Disk array reliability testing method, system, terminal and storage medium | |
CN113378403B (en) | Simulation test modeling method, system, test method, device and storage medium | |
CN109167701B (en) | Consistency checking method, device and system for power distribution automation standardization extension communication protocol | |
CN114117973A (en) | Logic synthesis method, device and storage medium | |
CN112073263A (en) | Method, system, equipment and medium for testing and monitoring reliability of white box switch | |
CN112100085B (en) | Android application program stability testing method, device and equipment | |
CN111597181B (en) | Distributed heterogeneous data cleaning system based on visual management | |
CN111078476B (en) | Network card drive firmware stability test method, system, terminal and storage medium | |
CN110990289B (en) | Method and device for automatically submitting bug, electronic equipment and storage medium | |
CN112445640A (en) | Server downtime fault positioning and isolating system and method | |
CN111240913A (en) | Server DQS error-reporting memory batch test method and device | |
CN111707966A (en) | CPLD electric leakage detection method and device | |
CN111552584B (en) | Testing system, method and device for satellite primary fault diagnosis isolation and recovery function | |
CN104678292A (en) | Test method and device for CPLD (Complex Programmable Logic Device) | |
CN107167675A (en) | A kind of ageing testing method and device of CANBus terminals | |
CN116449810B (en) | Fault detection method and device, electronic equipment and storage medium | |
CN113568842B (en) | Automatic testing method and system for batch tasks | |
CN115065628B (en) | Automatic test method and test system for fault code self-clearing of controller without sleep strategy | |
CN113609577B (en) | Automobile electric appliance principle inspection method | |
CN112034296B (en) | Avionics fault injection system and method | |
CN117112443A (en) | BootLoader test method, device and system | |
CN107390115B (en) | Method for detecting SC serial port and MC serial port of IO in raid memory in batch | |
CN115408268A (en) | OTA small flow flash test system and method | |
CN116089283A (en) | Monitoring test method, system, equipment and readable medium for simulating quasi-production environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20210305 |
|
WW01 | Invention patent application withdrawn after publication |