WO2016082523A1

WO2016082523A1 - Apparatus and method for handling fault

Info

Publication number: WO2016082523A1
Application number: PCT/CN2015/081355
Authority: WO
Inventors: 宋刚
Original assignee: 华为技术有限公司
Priority date: 2014-11-28
Filing date: 2015-06-12
Publication date: 2016-06-02
Also published as: CN104486100B; CN104486100A

Abstract

Disclosed are an apparatus and method for handling a fault, which fall within the technical field of computers. The method comprises: calculating the frequency of causing a corrective fault interruption of a hardware module in a server within a pre-set time period; detecting whether the frequency is greater than a disabling threshold value; and when it is detected that the frequency is greater than the disabling threshold value, switching an enabling state of the corrective fault interruption of the hardware module into a disabling state. The problem that an operating system cannot run normally since the operating system is in a continuous fault handling state and a large number of handling resources of the operating system are occupied when a large number of corrective faults of the hardware module occur within a short time is solved. The effects that the corrective fault interruption is reduced, the operating system can run normally and the running efficiency of the operating system is improved when a large number of corrective faults of the hardware module occur within a short time are achieved.

Description

Fault handling device and method

The present application claims the priority of the Chinese Patent Application, the entire disclosure of which is hereby incorporated by reference.

Technical field

The present invention relates to the field of computer technologies, and in particular, to a fault processing apparatus and method.

Background technique

A correctable fault is a common hardware failure that occurs when the server is running.

When a correctable fault occurs, the hardware module generates a Correctable Machine-Check Error Interrupt (CMCI) according to the correctable fault, and notifies the operating system to enter the interrupt handler to process the correctable fault interrupt. The operating system determines the hardware module based on the correctable fault interrupt and performs corresponding troubleshooting. In the case where the correctable fault occurs in the memory, the steps of the interrupt handler processing in the operating system to correct the fault interrupt are as follows:

1. The interrupt handler collects the fault data corresponding to the correctable fault;

2. The interrupt handler translates the fault physical address in the collected fault data into a fault logical address under the corresponding operating system;

3. The interrupt handler performs statistics on the number of correctable faults on the memory page to which the fault logical address belongs;

4. The interrupt handler performs a fault handling operation on the correctable fault.

In the process of implementing the present invention, the inventors have found that the prior art has at least the following problems: when a hardware module generates a large number of correctable faults in a short time, that is, when a correctable fault storm occurs, the hardware module will generate a large number of correctable faults. The fault is interrupted and the operating system is notified to enter the interrupt handler. The operating system needs to perform the above-mentioned fault handling for each correctable fault. It will be in a continuous fault handling state, occupying a large amount of processing resources of the operating system, and even causing the operating system to fail to operate normally. .

Summary of the invention

In order to solve the problem in the prior art, when a hardware module generates a large number of correctable faults in a short time, the operating system will be in a continuous fault processing state, occupying a large amount of processing resources of the operating system, and even causing the operating system to fail to operate normally. Embodiments provide a faulty device and method. The technical solution is as follows:

In a first aspect, a fault processing apparatus is provided for use in a server including at least one hardware module, the apparatus comprising:

a statistics module, configured to count, by a hardware module in the server, a frequency of generating a correctable fault interrupt within a predetermined time period, where the correctable fault interrupt is an interrupt generated by the hardware module when a correctable fault occurs;

a detecting module, configured to detect whether the frequency is greater than an inability threshold;

The first switching module is configured to switch the correctable fault interrupt of the hardware module from the enabled state to the disabled state when detecting that the frequency is greater than the disable threshold.

In a first possible implementation manner of the first aspect, the statistic module includes:

a reading module, configured to read, by an interrupt processing program, the number of correctable fault interrupts generated by the hardware module within a predetermined time period from a machine check abnormality MCE memory, wherein the interrupt processing program is configured to process the An interrupt handler capable of correcting a fault, the MCE memory being an MCE memory corresponding to the hardware module;

a calculating module, configured to calculate, by the interrupt processing program, the frequency according to the predetermined time period and the number of the correctable fault interrupts;

The detecting module is configured to detect, by the interrupt processing program, whether the frequency is greater than an inability threshold.

In a second possible implementation manner of the first aspect, the device further includes:

a startup module, configured to start a timer when the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state;

And a second switching module, configured to switch the correctable fault interrupt of the hardware module from the disabled state to the enabled state when the timer is timed to a predetermined duration.

In a third possible implementation manner of the first aspect, the device further includes:

a first search module, configured to acquire a level of real-time requirements for the service processed in the server, where the service is based on a task run by at least one hardware module in the server; Find a corresponding disable threshold in the table, where the first relation table is stored Having at least one level and a disable threshold corresponding to each of the levels, the at least one of the first relationship tables including the acquired level;

or,

a second search module, configured to obtain a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module; and searching for a corresponding ban in the second relationship table according to the service processing capability level a threshold, the second relationship table stores at least one service processing capability level and a disable threshold corresponding to each of the service processing capability levels, and at least one of the service processing capability levels in the second relationship table includes obtaining The level of business processing capability that is reached.

In conjunction with the second possible implementation of the first aspect, in a fourth possible implementation manner of the first aspect, the device further includes:

a third search module, configured to acquire a level of real-time requirements of the service processed in the server, where the service is based on a task run by at least one hardware module in the server; and the third relationship is performed according to the level Searching, in the table, a predetermined timer duration, where the third relationship table stores at least one level and a predetermined timer duration corresponding to each of the levels, and at least one of the third relationship tables includes obtaining Said level;

or,

a fourth search module, configured to acquire a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module, and search for a corresponding timer in the fourth relationship table according to the service processing capability level a predetermined duration, the fourth relationship table storing at least one service processing capability level and a timer predetermined duration corresponding to each of the service processing capability levels, where at least one of the service processing capability levels in the fourth relationship table is included The obtained business processing capability level.

In conjunction with the second possible implementation of the first aspect, in a fifth possible implementation of the first aspect, the first switching module is configured to enable a correctable fault interrupt corresponding to the hardware module The identification value in the register is set to the disable value;

The second switching module is configured to set an identifier value in the correctable fault interrupt enable register corresponding to the hardware module to an enable value.

In a second aspect, a fault processing method is provided for a server including at least one hardware module, the method comprising:

Counting that the hardware module in the server generates a correctable fault interrupt within a predetermined period of time Frequency, the correctable fault interrupt is an interrupt generated by the hardware module when a correctable fault occurs;

Detecting whether the frequency is greater than an inability threshold;

When it is detected that the frequency is greater than the disable threshold, the correctable fault interrupt of the hardware module is switched from an enabled state to an disabled state.

In a first possible implementation manner of the second aspect, the calculating, by the hardware module in the server, a frequency of correctable fault interrupts in a predetermined period of time, including:

Reading, by an interrupt handler, the number of correctable fault interrupts generated by the hardware module for a predetermined period of time from a machine check exception MCE memory, the interrupt handler being interrupt processing for processing the correctable fault a program, the MCE memory is an MCE memory corresponding to the hardware module;

And counting, by the interrupt processing program, the frequency according to the predetermined time period and the number of the correctable fault interrupts;

The detecting whether the frequency is greater than an inability threshold includes:

Whether the frequency is greater than the disable threshold is detected by the interrupt handler.

In a second possible implementation manner of the second aspect, the method further includes:

Initiating a timer when the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state;

When the timer expires for a predetermined length of time, the correctable fault interrupt of the hardware module is switched from the disabled state to the enabled state.

In a third possible implementation manner of the second aspect, before the detecting whether the frequency is greater than a disable threshold, the method further includes:

Obtaining a level of real-time requirement for the service processed in the server, where the service is based on a task run by at least one hardware module in the server; searching for a corresponding disable in the first relationship table according to the level a threshold, the first relationship table stores at least one level and an inability threshold corresponding to each of the levels, and the at least one of the first relationship tables includes the acquired level;

or,

Obtaining a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module; searching for a corresponding inactivity threshold in the second relationship table according to the service processing capability level, the second relationship The table stores at least one service processing capability level and The at least one service processing capability level in the second relationship table includes the obtained service processing capability level, and the banned threshold corresponding to each of the service processing capability levels.

In conjunction with the second possible implementation of the second aspect, in a fourth possible implementation of the first aspect, the step of switching the correctable fault interrupt of the hardware module from the enabled state to the In the disabled state, before starting the timer, it also includes:

Obtaining a level of real-time requirement for the service processed in the server, where the service is based on a task run by at least one hardware module in the server; searching for a corresponding timer in the third relationship table according to the level a predetermined duration, the third relationship table storing at least one level and a timer predetermined duration corresponding to each of the levels, and the at least one of the third relationship tables includes the acquired level;

or,

Obtaining a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module; searching for a corresponding timer predetermined duration in the fourth relationship table according to the service processing capability level, the fourth The relationship table stores at least one service processing capability level and a predetermined timer duration corresponding to each of the service processing capability levels, and the at least one service processing capability level in the fourth relationship table includes the acquired service processing. Ability level.

In conjunction with the second possible implementation of the second aspect, in a fifth possible implementation of the second aspect, the correctable fault interrupt of the hardware module is switched from an enabled state to an disabled state, including :

Setting an identification value in the correctable fault interrupt enable register corresponding to the hardware module to a disable value;

The switching the correctable fault interrupt of the hardware module from the disabled state to the enabled state includes:

The identification value in the correctable fault interrupt enable register corresponding to the hardware module is set to an enable value.

The beneficial effects brought by the technical solutions provided by the embodiments of the present invention are:

The frequency of the correctable fault interrupt is generated in the predetermined time period by the hardware module in the statistics server; whether the detection frequency is greater than the disable threshold; when the detected frequency is greater than the disable threshold, the correctable fault interrupt of the hardware module is enabled. Switching to the disabled state; solving the problem that when the hardware module has a large number of correctable faults in a short time, the operating system will be in a continuous fault handling state. It takes up a lot of processing resources of the operating system, and even causes the operating system to fail to operate normally. When the hardware module generates a large number of correctable faults in a short period of time, it reduces the occurrence of correctable fault interrupts, enables the operating system to operate normally, and improves the operation. The effect of the operating efficiency of the system.

DRAWINGS

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present invention. Other drawings may also be obtained from those of ordinary skill in the art in light of the inventive work.

1 is a block diagram showing the structure of a fault processing apparatus according to an embodiment of the present invention;

2 is a block diagram showing the structure of a fault processing apparatus according to another embodiment of the present invention;

FIG. 3A is a block diagram of a fault processing apparatus according to an embodiment of the present invention; FIG.

FIG. 3B is a block diagram of a fault processing apparatus according to another embodiment of the present invention; FIG.

4 is a flowchart of a method for processing a fault according to an embodiment of the present invention;

5A is a flowchart of a method for processing a fault according to another embodiment of the present invention;

FIG. 5B is a schematic diagram of an implementation of a fault processing method according to another embodiment of the present invention; FIG.

FIG. 6 is a flowchart of a method for processing a fault according to still another embodiment of the present invention.

detailed description

The embodiments of the present invention will be further described in detail below with reference to the accompanying drawings.

For ease of understanding, some terms appearing in the embodiments of the present invention are first explained:

Disabled state: A state in which a hardware module cannot generate a correctable fault interrupt based on a correctable fault, that is, the operating system cannot receive a state of a correctable fault interrupt generated by the hardware module. The mechanism by which each hardware module produces a correctable fault interrupt is usually independent of each other.

Enable state: It means that the hardware module can generate a state that can correct the fault interrupt according to the correctable fault, that is, the operating system can receive the state of the correctable fault interrupt generated by the hardware module.

Positive correlation: It means that the two variables change in the same direction. When one variable increases, the corresponding other variable also increases. When one variable decreases, the corresponding other variable also decreases. The two are linearly related or non-linear. Linear correlation.

Negative correlation: refers to the opposite direction of change of two variables, that is, when one variable increases, the corresponding one One variable decreases; when one variable decreases, the corresponding other variable increases, and the two are linearly related or nonlinearly related.

Correctable Fault Interrupt Enable Register: Enables the switch between the enabled and disabled states of the correctable fault interrupt for the hardware module by setting the identification value in the correctable fault interrupt enable register. Each hardware module corresponds to its own correctable fault interrupt enable register.

Please refer to FIG. 1 , which is a structural block diagram of a fault processing apparatus according to an embodiment of the present invention. The fault processing apparatus includes:

The statistics module 110 is configured to calculate, by the hardware module in the server, a frequency of correctable fault interrupts generated within a predetermined time period, where the correctable fault interrupt is an interrupt generated by the hardware module when a correctable fault occurs;

The detecting module 120 is configured to detect whether the frequency is greater than a disable threshold;

The first switching module 130 is configured to switch the correctable fault interrupt of the hardware module from the enabled state to the disabled state when the detected frequency is greater than the disable threshold.

In summary, the fault processing apparatus provided in this embodiment generates a frequency of correctable fault interrupts in a predetermined time period by using a hardware module in the statistics server; whether the detected frequency is greater than the disable threshold; and when the detected frequency is greater than the disable threshold When the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state; when the hardware module has a large number of correctable faults in a short time, the operating system will be in a continuous fault handling state, occupying the operating system. A large amount of processing resources, even the problem that the operating system can not operate normally; when the hardware module has a large number of correctable faults in a short period of time, reduce the occurrence of correctable fault interrupts, enable the operating system to run normally, and improve the operating efficiency of the operating system. Effect.

Referring to FIG. 2, it is a structural block diagram of a fault processing apparatus according to another embodiment of the present invention. The fault processing apparatus includes:

The statistics module 210 is configured to calculate, by the hardware module in the server, a frequency of generating a correctable fault interrupt within a predetermined time period, where the correctable fault interrupt is an interrupt generated by the hardware module when a correctable fault occurs;

The detecting module 220 is configured to detect whether the frequency is greater than a disable threshold;

The first switching module 230 is configured to switch the correctable fault interrupt of the hardware module from the enabled state to the disabled state when the detected frequency is greater than the disable threshold.

Optionally, the statistics module 210 includes:

The reading module 211 is configured to read, by using an interrupt processing program, a number of correctable fault interrupts generated by the hardware module in a predetermined period of time from a machine check exception (MCE) memory. The processing program is an interrupt handler for processing a correctable fault, and the MCE memory is an MCE memory corresponding to the hardware module;

The calculating module 212 is configured to calculate, by using an interrupt processing program, a frequency according to a predetermined time period and a number of correctable fault interrupts;

The detecting module 220 is configured to detect, by the interrupt processing program, whether the frequency is greater than an inability threshold.

Optionally, the device further includes:

The startup module 240 is configured to start a timer when the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state;

The second switching module 250 is configured to switch the correctable fault interrupt of the hardware module from the disabled state to the enabled state when the timer is timed to a predetermined duration.

Optionally, the device further includes:

The first search module 260 is configured to acquire a level of real-time requirement for the service processed in the server, where the service is based on a task run by at least one hardware module in the server; and the corresponding relationship is searched in the first relationship table according to the level. a threshold value, the first relationship table storing at least one level and a disable threshold corresponding to each level, where at least one of the first relationship tables includes the acquired level;

or,

The second search module 270 is configured to obtain a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module, and search for a corresponding inactivation threshold in the second relationship table according to the service processing capability level, the second relationship The table stores at least one service processing capability level and a disable threshold corresponding to each service processing capability level, and the at least one service processing capability level in the second relationship table includes the acquired service processing capability level.

Optionally, the device further includes:

The third search module 280 is configured to acquire a level of real-time requirement for the service processed in the server, where the service is based on a task run by at least one hardware module in the server; and search for a corresponding timing in the third relationship table according to the level. The third relationship table stores at least one level with a predetermined timer duration corresponding to each level, and at least one of the third relationship tables includes the acquired level;

or,

The fourth search module 290 is configured to determine a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module, and search for a corresponding timer predetermined duration in the fourth relationship table according to the service processing capability level, the fourth relationship. The table stores at least one service processing capability level and a timer predetermined duration corresponding to each service processing capability level, and at least one of the service processing capability levels in the fourth relationship table includes the acquired service processing capability level.

Optionally, the first switching module 230 is configured to set an identifier value in the correctable fault interrupt enable register corresponding to the hardware module to a disable value;

The second switching module 250 is configured to set an identifier value in the correctable fault interrupt enable register corresponding to the hardware module to an enable value.

In this embodiment, the timer is also set when the correctable fault interrupt of the hardware module is disabled, and when the timer is timed to a predetermined duration, the correctable fault interrupt of the hardware module is switched from the disabled state to the enabled state, and The enabled state is maintained when the frequency at which the hardware module generates a correctable fault interrupt is less than the enable threshold, and the timely processing can correct the correctable fault interrupt generated after the fault storm ends.

Please refer to FIG. 3A, which shows a block diagram of a fault handling apparatus according to an embodiment of the present invention. The fault processing apparatus may include a processor 310 and at least one hardware module 320, wherein the processor 310 and the at least one hardware module 320 are electrically connected. This embodiment is described with at least one hardware module 320 including a hardware module 321 and a hardware module 322.

The processor 310 is configured to generate, by the at least one hardware module 320 in the server, a frequency of correctable fault interrupts generated by the hardware module in a predetermined period of time, where the correctable fault interrupt is generated by the hardware module when a correctable fault occurs;

The processor 310 is configured to detect whether the frequency is greater than a disable threshold;

The processor 310 is configured to switch the correctable fault interrupt of the hardware module 320 from the enabled state to the disabled state when the detected frequency is greater than the disable threshold.

Based on FIG. 3A, the fault processing apparatus may further include: an MCE memory and a correctable fault interrupt enable register corresponding to each hardware module, and a memory for storing one or more programs, including Handle interrupt handlers that correct faults. The embodiment of the present invention includes the hardware module 321 and the hardware module 322. As shown in FIG. 3B, the fault processing apparatus 300 includes a processor 310, a hardware module 321, and a hardware module 322. The MCE memory 331, the correctable fault interrupt enable register 341 corresponding to the hardware module 321, the MCE memory 332 corresponding to the hardware module 322, the correctable fault interrupt enable register 342 corresponding to the hardware module 322, and the memory 350. The processor 310 is electrically connected to the at least one hardware module 320, the memory 350, the MCE memory corresponding to each hardware module, and the correctable fault interrupt enable register.

Specifically, when the hardware module in the statistics server generates the frequency of the correctable fault interrupt within a predetermined time period, the processor 310 is configured to read, by the interrupt processing program, the hardware module 320 generated by the hardware module 320 within a predetermined time period. The number of fault interrupts can be corrected, the interrupt processing The program is an interrupt handler for processing a correctable fault, the MCE memory being an MCE memory corresponding to the hardware module 320;

The processor 310 is configured to calculate, by using an interrupt processing program, a frequency according to a predetermined time period and a number of correctable fault interrupts;

The processor 310 is configured to detect, by the interrupt processing program, whether the frequency is greater than a disable threshold.

Specifically, when the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state, the processor 310 is configured to start when the correctable fault interrupt of the hardware module 320 is switched from the enabled state to the disabled state. Timer

The processor 301 is configured to switch the correctable fault interrupt of the hardware module 320 from the disabled state to the enabled state when the timer is timed to a predetermined duration.

Specifically, when determining the forbidden threshold, the processor 310 is configured to acquire a level of real-time requirement for the service processed in the server, where the service is based on a task run by at least one hardware module in the server; Querying, in the relationship table, a corresponding banned threshold, where the first relationship table stores at least one level and a disable threshold corresponding to each level, where at least one level in the first relationship table includes the acquired level;

or,

And a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module; and the corresponding inactivity threshold is searched in the second relationship table according to the service processing capability level, where the second relationship table stores at least one service The processing capability level and the inability threshold corresponding to each service processing capability level, and the at least one service processing capability level in the second relationship table includes the acquired service processing capability level.

Specifically, when determining the predetermined duration of the timer, the processor 310 is configured to acquire a level of real-time requirement for the service processed in the server, where the service is based on a task run by at least one hardware module in the server; Searching, in the third relationship table, a predetermined timer duration, where the third relationship table stores at least one level and a predetermined timer duration corresponding to each level, where at least one level of the third relationship table includes the acquired level;

or,

And a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module; and the corresponding timer is searched for in the fourth relationship table according to the service processing capability level, where the fourth relationship table stores at least one The service processing capability level and the timer corresponding to each service processing capability level are predetermined, and at least one of the fourth relationship tables The service processing capability level includes the acquired service processing capability level.

The processor 310 is configured to set an identifier value in the correctable fault interrupt enable register corresponding to the hardware module 320 to a disable value;

The processor 301 is configured to set an identifier value in the correctable fault interrupt enable register corresponding to the hardware module 320 to an enable value.

Please refer to FIG. 4, which is a flowchart of a method for processing a fault according to an embodiment of the present invention. The method is applicable to a server of at least one hardware module, and the fault processing method includes:

Step 402: The hardware module in the statistics server generates a frequency of correctable fault interrupts within a predetermined time period, where the correctable fault interrupt is an interrupt generated by the hardware module when a correctable fault occurs;

The Corrected Machine-Check Error Interrupt (CMCI) refers to the interrupt generated by the hardware module when a correctable fault occurs. The interrupt is used to notify the operating system to enter the interrupt handler to process the correctable fault.

Step 404, detecting whether the frequency is greater than a disable threshold;

Step 406: When it is detected that the frequency is greater than the disable threshold, the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state.

In summary, the fault processing method provided in this embodiment generates a frequency of correctable fault interrupts in a predetermined time period by using a hardware module in the statistics server; whether the detected frequency is greater than the disable threshold; and when the detected frequency is greater than the disable threshold When the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state; when the hardware module has a large number of correctable faults in a short time, the operating system will be in a continuous fault handling state, occupying the operating system. A large amount of processing resources, even the problem that the operating system can not operate normally; when the hardware module has a large number of correctable faults in a short period of time, reduce the occurrence of correctable fault interrupts, enable the operating system to run normally, and improve the operating efficiency of the operating system. Effect.

Please refer to FIG. 5A, which is a flowchart of a method for processing a fault according to another embodiment of the present invention. The method is applicable to a server of at least one hardware module, and the fault processing method includes:

Step 501: The hardware module in the statistics server generates a frequency of correctable fault interrupts within a predetermined time period, and the correctable fault interrupt is an interrupt generated by the hardware module when a correctable fault occurs.

The server may be an X86-based device. Since the existing server is mostly an X86 architecture, the present embodiment is described as an X86 architecture device, and the present invention is not limited thereto.

A hardware module refers to a hardware processing device with different processing functions in an X86 architecture device, and an X86 architecture device includes at least one hardware module. In the X86 architecture device, each hardware module corresponds to its own MCE memory, which is used to store the correctable fault interrupt generated by the hardware module. The interrupt handler can obtain the frequency of the correctable fault interrupt generated by acquiring the number of correctable fault interrupts generated in the predetermined time period from the MCE memory corresponding to the hardware module, and the step may include the following sub-steps:

1. The X86 architecture device reads, from the MCE memory, the number of correctable fault interrupts generated by the hardware module within a predetermined period of time through an interrupt handler, the interrupt handler being an interrupt handler for processing the correctable fault, the MCE The memory is the MCE memory corresponding to the hardware module.

When a correctable fault occurs in the hardware module, the hardware module generates a correctable fault interrupt according to the correctable fault, and notifies the operating system to enter the interrupt handler to process the correctable fault interrupt, and the interrupt handler determines that the fault occurs according to the correctable fault interrupt. Faulty hardware module, and Reading, from the MCE memory corresponding to the hardware module, the number of correctable fault interrupts generated by the hardware module within a predetermined time period, which is preset by the operating system and may be 5 seconds.

For example, the interrupt handler receives the correctable fault interrupt notification, determines that the hardware module that can correct the fault is the hardware module A, and reads the correctable fault interrupt generated in the last 5 seconds from the MCE memory A corresponding to the hardware module A. The number is 10.

Second, the X86 architecture device counts the frequency according to the predetermined time period and the number of correctable fault interrupts through the interrupt handler.

The interrupt handler calculates the frequency at which the hardware module generates a correctable fault interrupt within a predetermined time period based on the number of correctable fault interrupts generated during the predetermined period of time read and the predetermined time period.

For example, the number of correctable fault interrupts generated during the predetermined time period read by the interrupt handler is 10, and the predetermined time period is 5 seconds, and the hardware module is calculated to generate a correctable fault interrupt within a predetermined time period. The frequency is 10 times/5 seconds.

It should be noted that, because the X86 architecture device is running, multiple hardware modules will have correctable faults at the same time. Therefore, the interrupt handler needs to separately count different hardware modules to generate a frequency that can correct the fault interrupt. This embodiment only uses interrupt processing. The program counts the frequency at which a hardware module generates a correctable fault interrupt and does not limit the invention.

Step 502: Detect whether the frequency is greater than an inability threshold.

The X86 architecture device detects, by the interrupt handler, whether the frequency at which the hardware module generates a correctable fault interrupt within a predetermined time period is greater than an disable threshold. When the frequency is greater than the disable threshold, it may be determined that the hardware module has a correctable fault storm; When the frequency is less than the disable threshold, it can be determined that the hardware module does not have a correctable fault storm.

The ban threshold can be set in advance, or can be set in real time according to the real-time requirement of the service processed in the X86 architecture device or according to the service processing capability of the X86 architecture device. Setting the ban threshold can include the following two possible Method to realize:

The first possible implementation manner, the X86 architecture device acquires a level of real-time requirement for the service processed in the X86 architecture device, where the service is based on a task run by at least one hardware module in the X86 architecture device; Searching, in the first relationship table, a corresponding inability threshold, where the first relationship table stores at least one level and a disable threshold corresponding to each level, where at least one level in the first relationship table includes the acquired level .

When the service processed by the X86 architecture device has high real-time requirements, since the operating system frequently enters the interrupt handler to process the correctable fault interrupt, the service cannot be processed in time, so a smaller disable threshold can be set. The operating system can process the current service in time; when the service processed by the X86 architecture device has low real-time requirements, the disable threshold can be set larger.

The first relationship table in the operating system pre-stores the correspondence between each level of the service real-time requirement and the corresponding inactivation threshold, wherein each level has a negative correlation with the corresponding inactivation threshold, that is, the service requires real-time performance. The higher the level, the smaller the corresponding ban threshold. The lower the level of service required for real-time performance, the larger the corresponding ban threshold. The table structure of the first relational table can be exemplarily shown in Table 1:

Table I

业务对实时性要求的级别Level of business requirements for real-time performance	禁能阈值Disable threshold
	禁能阈值Disable threshold	11	10次/5秒10 times/5 seconds
22	8次/5秒8 times/5 seconds	11	10次/5秒10 times/5 seconds
22	8次/5秒8 times/5 seconds	33	5次/5秒5 times/5 seconds

The higher the level of real-time requirements of the service indicates that the real-time requirement of the service is higher, and the higher the level of the real-time requirement of the service indicates that the real-time requirement of the service is lower.

The operating system obtains the level of the real-time requirement of the service processed by the X86 architecture device, searches for the corresponding inactivity threshold in the first relationship table, and sets the disable threshold to the disable threshold of the service.

The second possible implementation manner, the X86 architecture device obtains a service processing capability level of the X86 architecture device; the service processing capability level is determined based on the at least one hardware module; and the corresponding disabling is found in the second relationship table according to the service processing capability level a threshold, the second relationship table stores at least one service processing capability level and a disable threshold corresponding to each service processing capability level, where at least one service processing capability level in the second relationship table includes the acquired service processing Ability level.

The service processing capability of the X86 architecture device is different. The processing resources and time occupied by the corresponding operating system entering the interrupt handler for fault processing are also different. Therefore, the operating system can set the disable threshold according to the service processing capability of the X86 architecture device.

The service processing capability of the X86 architecture device is pre-stored in the second relation table in the operating system. The positive correlation between the level and the forbidden threshold, wherein each service processing capability level is positively correlated with the corresponding inactivity threshold, that is, the higher the service processing capability level, the larger the corresponding inability threshold, and the lower the service processing capability level. The corresponding disable threshold is also smaller. The table structure of the second relational table can be exemplarily shown in Table 2:

Table II

业务处理能力等级Business processing capability level	禁能阈值Disable threshold
业务处理能力等级Business processing capability level	禁能阈值Disable threshold	11	5次/5秒5 times/5 seconds
22	8次/5秒8 times/5 seconds	11	5次/5秒5 times/5 seconds
22	8次/5秒8 times/5 seconds	33	10次/5秒10 times/5 seconds

The higher the service processing capability of the X86 architecture device is, the higher the service processing capability of the X86 architecture device is. The lower the service processing capability of the X86 architecture device is, the weaker the service processing capability of the X86 architecture device is, and the service processing of the X86 architecture device is weak. The capability level can be divided according to the hardware score of the X86 architecture device.

The operating system obtains the service processing capability level of the X86 architecture device, searches for the corresponding disable threshold in the second relationship table, and sets the disable threshold to the disable threshold of the X86 architecture device.

It should be noted that the operating system may also set the prohibition threshold for the level of real-time requirements of the integrated service and the service processing capability level of the X86 architecture device, and does not limit the present invention.

Step 503: When it is detected that the frequency is greater than the disable threshold, the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state.

When it is detected that the frequency is greater than the disable threshold, the X86 architecture device knows that the hardware module has a correctable fault storm, and the correctable fault storm indicates that the hardware module will generate a large number of correctable fault interrupts in a short time. In order to prevent the operating system from being in a continuous fault handling state in the event of a correctable fault storm, the operating system fails to operate normally, and the interrupt handler switches the correctable fault interrupt of the hardware module from the enabled state to the disabled state.

When the correctable fault storm does not occur in the hardware module, the identifier value in the correctable fault interrupt register corresponding to the hardware module is an enable value, that is, the correctable fault interrupt of the hardware module is enabled; when it is detected that the hardware module is generated, When the fault storm is corrected, the interrupt handler sets the identification value in the correctable fault interrupt enable register corresponding to the hardware module to the disable value, that is, the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state. Correctable fault interruption When disabled, the hardware module will not be able to generate correctable fault interrupts based on correctable faults, and the operating system will not frequently enter the interrupt handler for troubleshooting.

Step 504: Start a timer when the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state.

When the interrupt handler switches the correctable fault interrupt of the hardware module from the enabled state to the disabled state, the preset timer is started, and the correctable fault interrupt of the hardware module will be interrupted while the timer reaches the predetermined duration. The system is disabled and the operating system will not enter the interrupt handler for troubleshooting.

It should be noted that there is no strict relationship between the step 503 and the step 504, and the two can be executed at the same time. This embodiment is only described by the step 503 before the step 504, and the present invention is not limited.

Step 505: When the timer is timed to a predetermined duration, the correctable fault interrupt of the hardware module is switched from the disabled state to the enabled state.

In order to prevent the correctable fault interrupt of the hardware module from being disabled after the correctable fault storm, the operating system cannot receive the correctable fault interrupt and handle the fault. When the timer expires for a predetermined period of time, the interrupt handler is interrupted. The identification value in the correctable fault interrupt enable register corresponding to the hardware module is set to an enable value, that is, the correctable fault interrupt of the hardware module is switched from the disable state to the enable state, and at this time, the hardware module can be based on the correctable fault. Generates a correctable fault interrupt and notifies the operating system to enter the interrupt handler for troubleshooting. The steps of the interrupt handler for fault processing are similar to those of the prior art, and are not described herein again.

The scheduled duration of the timer can be set in advance, or can be set in real time according to the real-time requirements of the services processed by the X86 architecture device or according to the service processing capability of the X86 architecture device. The preset duration of the timer can include the following two possible Method to realize:

The first possible implementation manner, the X86 architecture device acquires a level of real-time requirement for the service processed in the X86 architecture device, where the service is based on a task run by at least one hardware module in the X86 architecture device; The third relationship table is configured to search for a corresponding timer for a predetermined duration, where the third relationship table stores at least one level and a predetermined timer duration corresponding to each level, and at least one of the third relationship tables includes the acquired level. .

The third relationship table in the operating system pre-stores the correspondence between the level of the real-time requirement of the service and the predetermined duration of the timer, where each level is positively correlated with the predetermined duration of the corresponding timer, that is, the level of the service required for real-time performance. The higher the time, the longer the corresponding timer is scheduled. Long, the lower the level of service requirements for real-time requirements, the longer the corresponding timer is scheduled. The table structure of the third relational table can be exemplarily shown in Table 3:

Table 3

业务对实时性要求的级别Level of business requirements for real-time performance	定时器预定时长Timer predetermined duration
	定时器预定时长Timer predetermined duration	11	100秒100 seconds
22	120秒120 seconds	11	100秒100 seconds
22	120秒120 seconds	33	150秒150 seconds

The higher the level of real-time requirements of the service indicates that the real-time requirement of the service is higher, and the higher the level of the real-time requirement of the service indicates that the real-time requirement of the service is lower. The operating system acquires the level of real-time requirements of the services processed in the X86 architecture device, searches for a predetermined timer duration in the third relationship table, and sets a predetermined duration of the current timer.

In a second possible implementation manner, the X86 architecture device obtains a service processing capability level of the X86 architecture device, where the service processing capability level is determined based on the at least one hardware module, and searches for a corresponding timer in the fourth relationship table according to the service processing capability level. The fourth relationship table stores at least one service processing capability level and a timer predetermined duration corresponding to each service processing capability level, and at least one of the service processing capability levels in the fourth relationship table includes the acquired service. Processing capability level.

The fourth relationship table in the operating system pre-stores the correspondence between the service processing capability level of the X86 architecture device and the predetermined duration of the timer, wherein each service processing capability level has a negative correlation with the predetermined timer duration, that is, When the service processing capability level is higher, the corresponding timer is smaller, and the lower the service processing capability level, the larger the predetermined timer duration. The table structure of the fourth relational table can be exemplarily shown in Table 4:

Table 4

业务处理能力等级Business processing capability level	定时器预定时长Timer predetermined duration
业务处理能力等级Business processing capability level	定时器预定时长Timer predetermined duration	11	150秒150 seconds
22	120秒120 seconds	11	150秒150 seconds
22	120秒120 seconds	33	100秒100 seconds

The higher the service processing capability of the X86 architecture device is, the higher the service processing capability of the X86 architecture device is. The lower the service processing capability of the X86 architecture device is, the weaker the service processing capability of the X86 architecture device is, and the service processing of the X86 architecture device is weak. Ability level can be based on X86 The hardware score of the architecture device is divided.

The operating system acquires the service processing capability level of the X86 architecture device, searches for the corresponding timer predetermined duration in the fourth relationship table, and sets the current timer predetermined duration.

It should be noted that the operating system may also set the timer for the integrated service to the real-time requirement level and the service processing capability level of the X86 architecture device, and does not limit the present invention.

Obviously, when the service processed in the X86 architecture device has higher real-time requirements or the service processing capability of the X86 architecture device is weaker, the corresponding timer is longer, which ensures the timely processing of the service by the operating system. . It should be noted that when the timer expires for a predetermined period of time, the timer will be reset, and in order to let the operating system know the estimated value of the number of correctable faults that the hardware module can take during the correctable fault storm, timing The controller will calculate an estimate of the number of correctable faults that may occur during the correcting of the fault storm, which may be the product of the predetermined duration set by the timer and the frequency at which the hardware module obtained in step 501 produces a correctable fault interrupt.

For example, the timer is set to a predetermined duration of 100 seconds, and the calculated hardware module generates a correctable fault interrupt frequency of 10 times/5 seconds, and the estimated number of correctable faults occurring during the correctable fault storm is calculated. That is 200 times. This estimate is primarily used to count the number of times a fault can be corrected.

In step 506, it is again detected whether the frequency at which the hardware module generates a correctable fault interrupt is less than an enable threshold.

After the interrupt handler switches the disableable fault interrupt of the hardware module from the disabled state to the enabled state, the received correctable fault interrupt is counted again within a predetermined period of time, and the calculation is performed within the predetermined time period. Correct the frequency of fault interrupts.

The interrupt handler detects whether the calculated frequency is less than a preset enable threshold, and the preset threshold is used to detect whether the correctable fault storm ends, and the enable threshold may be 1 time/5 seconds.

Step 507: When it is detected that the frequency at which the hardware module generates the correctable fault interrupt is less than the enable threshold, the correctable fault interrupt of the hardware module is maintained as an enabled state.

When it is detected that the frequency at which the hardware module generates a correctable fault interrupt is less than the enable threshold, the interrupt handler can determine that the correctable fault storm has ended, and the subsequent correctable fault interrupt generated by the hardware module does not cause the operating system to be in continuous fault handling. Status, ie the operating system can be normal run. Correspondingly, the correctable fault interrupt of the hardware module will remain enabled.

It should be noted that, in order to prevent the hardware module from recurring, the corrective fault storm causes the operating system to be in continuous fault handling, and the interrupt handler will continue to detect whether the frequency at which the hardware module generates a correctable fault interrupt within a predetermined period of time is greater than the disable threshold. When the frequency is greater than the disable threshold, the interrupt handler switches the correctable fault interrupt of the hardware module from the enabled state to the disabled state and restarts the timer.

Step 508: When it is detected that the frequency at which the hardware module generates the correctable fault interrupt is greater than the enable threshold, the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state, and the timer is restarted.

When it is detected that the frequency of the correctable fault interrupt generated by the hardware module is greater than the enable threshold, the interrupt handler considers that the correctable fault storm has not ended, and re-switches the correctable fault interrupt of the hardware module from the enabled state to the disabled state, and Restart the timer.

When the timer is timed again to a predetermined length of time, the interrupt handler will continue to perform steps 506 through 508 above.

Obviously, by adding a mechanism for detecting the frequency of the fault interrupt in the operating system's interrupt handler, when the hardware module has a correctable fault storm, the correctable fault interrupt of the hardware module will be disabled. The operating system will not enter continuous troubleshooting, so that the operating system can run normally, greatly improving the stability of the operating system.

As shown in FIG. 5B, it shows a schematic implementation of the fault processing method provided by this embodiment. The interrupt handler detects whether the frequency of the correctable fault interrupt generated by the hardware module in the predetermined time period is greater than the disable threshold during the T1 time period, and when the frequency is detected to be greater than the disable threshold, the hardware module interrupt is switched to the disabled state and The timer is started; the correctable fault interrupt of the hardware module is disabled during the predetermined time period T2 set by the timer; when the timer reaches the predetermined duration, the interrupt handler switches the correctable fault interrupt of the hardware module to the enabled state. And detecting whether the frequency of the correctable fault interrupt generated by the hardware module in the T3 time period is less than an enable threshold; when detecting that the frequency is greater than the enable threshold, the interrupt handler switches the correctable fault interrupt of the hardware module to the disabled state And restarting the timer; within the predetermined time period T4 set by the timer, the correctable fault interrupt of the hardware module is disabled; when the timer reaches the predetermined length, the interrupt handler again switches the correctable fault interrupt of the hardware module to Enable state and detect that the hardware module generates a correctable fault interrupt during the T5 time period Whether the frequency is less than the enable threshold; when the frequency is detected When the rate is greater than the enable threshold, the correctable fault interrupt of the hardware module remains enabled.

The correctable fault error interrupt in Figure 5A refers to the CMCI interrupt, and the interrupt handler refers to the interrupt handler in the operating system. As another possible implementation manner, the basic input/output system (English: Basic Input/Output System; BIOS for short) can be used to convert the correctable fault interrupt generated when the fault can be corrected into a system management interrupt (English: System Management) Interrupt; referred to as SMI), and the system management interrupt is processed by the system interrupt handler in the basic input and output system. The following description will be made using an embodiment.

Please refer to FIG. 6, which is a flowchart of a method for processing a fault according to still another embodiment of the present invention. The method comprises:

Step 601: Convert the correctable fault interrupt generated by the hardware module in the server into a system management interrupt.

When the operating system starts initialization, by setting in the basic input/output system, when the hardware module generates a correctable fault interrupt, the correctable fault interrupt will be converted into a system management interrupt, and correspondingly, the hardware module will notify the basic input and output. The system enters the system management interrupt processing The program handles the system management interrupt.

Step 602: The hardware module in the statistics server generates a frequency of system management interruption within a predetermined time period.

When a correctable fault occurs, since the correctable fault interrupt generated by the hardware module is converted into a system management interrupt, the system management interrupt handler counts the system management interrupt generated within the predetermined time period, and calculates the predetermined time period. The frequency of system management interrupts generated internally. It should be noted that, because a plurality of hardware modules generate correctable faults at the same time when the device is running, the system management interrupt processing program needs to separately count the frequency of the system management interrupt generated by different hardware modules, and the embodiment only interrupts the system management. The processing program stats the frequency at which a hardware module generates a system management interrupt, and does not limit the invention.

In step 603, it is detected whether the frequency is greater than an inability threshold.

The system management interrupt handler detects whether the frequency of the system management interrupt generated by the hardware module during the predetermined time period is greater than the disable threshold. When the frequency is greater than the disable threshold, the system management interrupt is obtained by the correctable fault interrupt conversion, and the system The management interrupt handler can determine that the hardware module has a correctable fault storm; when the frequency is less than the disable threshold, the system management interrupt handler can determine that no correctable fault storm has occurred. The forbidden threshold is a preset threshold for detecting whether a correctable fault storm occurs, and the disable threshold may be 10 times/5 seconds.

It should be noted that the method for setting the ban threshold is similar to the method for setting the ban threshold in step 502, and details are not described herein again.

Step 604: When it is detected that the frequency is greater than the disable threshold, the system management interrupt of the hardware module is switched from the enabled state to the disabled state.

When the system management interrupt handler detects that the frequency is greater than the disable threshold, it knows that the hardware module has a correctable fault storm. The correctable fault storm indicates that the hardware module will generate a large number of correctable fault interrupts in a short period of time. System Management The interrupt handler switches the system management interrupt of the hardware module from the enabled state to the disabled state.

When the hardware module does not have a correctable fault storm, the identifier in the system management interrupt register corresponding to the hardware module is an enable value, that is, the system management interrupt of the hardware module is enabled; when the system management interrupt handler detects the hardware module When a correctable fault storm occurs, the system management interrupt handler sets the identification value in the system management interrupt enable register corresponding to the hardware module to the disable value, that is, the system management interrupt of the hardware module is switched from the enabled state to the disabled state. State, when the system management interrupt is disabled, the hardware module will not be able to generate a system management interrupt.

Step 605: Start a timer when the system management interrupt of the hardware module is switched from the enabled state to the disabled state.

Similar to the interrupt handler, the system management interrupt handler will also start the preset timer while switching the system management interrupt of the hardware module from the enabled state to the disabled state.

It should be noted that there is no strict prior relationship between the

steps

604 and 605, and the two can be executed at the same time. The present embodiment is only described by the step 604 before the step 605, and the present invention is not limited.

Step 606: When the timer is timed to a predetermined duration, the system management interrupt of the hardware module is switched from the disabled state to the enabled state.

In order to prevent the system management interrupt of the hardware module from being disabled after the correctable fault storm, the basic input/output system cannot receive the system management interrupt and process it. When the timer expires for a predetermined period of time, the system management interrupt processing The program sets the identification value in the system management interrupt enable register corresponding to the hardware module to an enable value, that is, the system management interrupt of the hardware module is switched from the disable state to the enable state, and at this time, the hardware module can notify the basic input and output. The system enters the system management interrupt handler for processing.

It should be noted that the method for setting the predetermined duration of the timer is similar to the method for setting the predetermined duration of the timer in step 505, and details are not described herein again.

In step 607, it is detected again whether the frequency at which the hardware module generates the system management interrupt is less than the enable threshold.

After the system management interrupt handler switches the system management interrupt of the hardware module from the disabled state to the enabled state, the received system management interrupt is counted again within a predetermined time period, and the system is generated to be generated within the predetermined time period. Manage the frequency of interruptions.

The system management interrupt processing program detects whether the calculated frequency is less than a preset enabling threshold, and the threshold value for detecting the correctable fault storm is preset before the enabling threshold, and the enabling threshold may be 1 time/5 second.

Step 608: When it is detected that the frequency at which the hardware module generates the system management interrupt is less than the enable threshold, the system management interrupt of the hardware module is kept enabled.

When it is detected that the frequency of generating the system management interrupt is less than the enable threshold, the system management interrupt handler can determine that the correctable fault storm has ended, and the subsequent correctable fault interrupt generated by the hardware module will be converted into a system management interrupt, and the system is Manage interrupt handlers for processing. Correspondingly, the system management interrupt of the hardware module will remain enabled.

It should be noted that the system management interrupt handler will continue to detect the frequency at which the system management interrupt is generated, and switch the system management interrupt from the enabled state to the disabled state when the frequency is greater than the disable threshold.

Step 609: When it is detected that the frequency of the system management interrupt generated by the hardware module is greater than the enable threshold, the system management interrupt of the hardware module is switched from the enabled state to the disabled state, and the timer is restarted.

When it is detected that the frequency of generating the system management interrupt is greater than the enable threshold, the system management interrupt handler considers that the correctable fault storm has not ended yet, and re-switches the system management interrupt of the hardware module from the enabled state to the disabled state, and restarts Timer.

When the timer is timed again for a predetermined period of time, the system management interrupt handler will continue to perform steps 607 through 609 above.

In summary, the fault processing method provided in this embodiment converts the correctable fault interrupt generated by the hardware module in the server into a system management interrupt; the frequency of the system management interrupt generated by the hardware module in the statistical server within a predetermined time period. Detecting whether the frequency is greater than the disable threshold; when detecting that the frequency is greater than the disable threshold, switching the system management interrupt of the hardware module from the enabled state to the disabled state; solving the problem that when the hardware module occurs in a short time When the fault is corrected, the operating system will be in a continuous fault handling state, occupying a large amount of processing resources of the operating system, and even causing the operating system to fail to operate normally; when the hardware module has a large number of correctable faults in a short period of time, the generation is reduced. It can correct the fault interrupt, enable the operating system to run normally, and improve the operating efficiency of the operating system.

In this embodiment, the hardware-generated correctable fault interrupt is converted into a system management interrupt by the basic input/output system, and is processed by the system management interrupt processing program of the basic input/output system, thereby further reducing the pressure on the operating system and achieving the guaranteed operating system. The effect of stable operation.

The serial numbers of the embodiments of the present invention are merely for the description, and do not represent the advantages and disadvantages of the embodiments.

A person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium. The storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.

The above are only the preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalents, improvements, etc., which are within the spirit and scope of the present invention, should be included in the protection of the present invention. Within the scope.

Claims

A fault processing apparatus, characterized in that, in a server for at least one hardware module, the apparatus comprises:

a statistics module, configured to count, by a hardware module in the server, a frequency of generating a correctable fault interrupt within a predetermined time period, where the correctable fault interrupt is an interrupt generated by the hardware module when a correctable fault occurs;

a detecting module, configured to detect whether the frequency is greater than an inability threshold;

The first switching module is configured to switch the correctable fault interrupt of the hardware module from the enabled state to the disabled state when detecting that the frequency is greater than the disable threshold.
The device according to claim 1, wherein the statistics module comprises:

a reading module, configured to read, by an interrupt processing program, the number of correctable fault interrupts generated by the hardware module within a predetermined time period from a machine check abnormality MCE memory, wherein the interrupt processing program is configured to process the An interrupt handler capable of correcting a fault, the MCE memory being an MCE memory corresponding to the hardware module;

a calculating module, configured to calculate, by the interrupt processing program, the frequency according to the predetermined time period and the number of the correctable fault interrupts;

The detecting module is configured to detect, by the interrupt processing program, whether the frequency is greater than an inability threshold.
The device according to claim 1, wherein the device further comprises:

a startup module, configured to start a timer when the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state;

And a second switching module, configured to switch the correctable fault interrupt of the hardware module from the disabled state to the enabled state when the timer is timed to a predetermined duration.
The device according to claim 1, wherein the device further comprises:

a first search module, configured to acquire a level of real-time requirements for the service processed in the server, where the service is based on a task run by at least one hardware module in the server; Find a corresponding disable threshold in the table, where the first relation table is stored Having at least one level and a disable threshold corresponding to each of the levels, the at least one of the first relationship tables including the acquired level;

or,

a second search module, configured to obtain a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module; and searching for a corresponding ban in the second relationship table according to the service processing capability level a threshold, the second relationship table stores at least one service processing capability level and a disable threshold corresponding to each of the service processing capability levels, and at least one of the service processing capability levels in the second relationship table includes obtaining The level of business processing capability that is reached.
The device according to claim 3, wherein the device further comprises:

a third search module, configured to acquire a level of real-time requirements of the service processed in the server, where the service is based on a task run by at least one hardware module in the server; and the third relationship is performed according to the level Searching, in the table, a predetermined timer duration, where the third relationship table stores at least one level and a predetermined timer duration corresponding to each of the levels, and at least one of the third relationship tables includes obtaining Said level;

or,

a fourth search module, configured to acquire a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module, and search for a corresponding timer in the fourth relationship table according to the service processing capability level a predetermined duration, the fourth relationship table storing at least one service processing capability level and a timer predetermined duration corresponding to each of the service processing capability levels, where at least one of the service processing capability levels in the fourth relationship table is included The obtained business processing capability level.
The apparatus according to claim 3, wherein the first switching module is configured to set an identifier value in a correctable fault interrupt enable register corresponding to the hardware module to a disable value;

The second switching module is configured to set an identifier value in the correctable fault interrupt enable register corresponding to the hardware module to an enable value.
A fault processing method, characterized in that it is used for a service including at least one hardware module In the server, the method includes:

Counting, by the hardware module in the server, a frequency at which a correctable fault interrupt is generated for a predetermined period of time, the correctable fault interrupt being an interrupt generated by the hardware module when a correctable fault occurs;

Detecting whether the frequency is greater than an inability threshold;

When it is detected that the frequency is greater than the disable threshold, the correctable fault interrupt of the hardware module is switched from an enabled state to an disabled state.
The method according to claim 7, wherein the counting the frequency at which the hardware module in the server generates a correctable fault interrupt within a predetermined time period comprises:

Reading, by an interrupt handler, the number of correctable fault interrupts generated by the hardware module for a predetermined period of time from a machine check exception MCE memory, the interrupt handler being interrupt processing for processing the correctable fault a program, the MCE memory is an MCE memory corresponding to the hardware module;

And counting, by the interrupt processing program, the frequency according to the predetermined time period and the number of the correctable fault interrupts;

The detecting whether the frequency is greater than an inability threshold includes:

Whether the frequency is greater than the disable threshold is detected by the interrupt handler.
The method of claim 7, wherein the method further comprises:

Initiating a timer when the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state;

When the timer expires for a predetermined length of time, the correctable fault interrupt of the hardware module is switched from the disabled state to the enabled state.
The method according to claim 7, wherein before the detecting whether the frequency is greater than a disable threshold, the method further comprises:

Obtaining a level of real-time requirement for the service processed in the server, where the service is based on a task run by at least one hardware module in the server; searching for a corresponding disable in the first relationship table according to the level a threshold, the first relationship table storing at least one level and a disable threshold corresponding to each of the levels, and at least one level in the first relationship table Including the level obtained;

or,

Obtaining a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module; searching for a corresponding inactivity threshold in the second relationship table according to the service processing capability level, the second relationship The table stores at least one service processing capability level and a disable threshold corresponding to each of the service processing capability levels, and the at least one service processing capability level in the second relationship table includes the acquired service processing capability. grade.
The method according to claim 9, wherein before the timer is started, when the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state, the method further includes:

Obtaining a level of real-time requirement for the service processed in the server, where the service is based on a task run by at least one hardware module in the server; searching for a corresponding timer in the third relationship table according to the level a predetermined duration, the third relationship table storing at least one level and a timer predetermined duration corresponding to each of the levels, and the at least one of the third relationship tables includes the acquired level;

or,

Obtaining a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module; searching for a corresponding timer predetermined duration in the fourth relationship table according to the service processing capability level, the fourth The relationship table stores at least one service processing capability level and a predetermined timer duration corresponding to each of the service processing capability levels, and the at least one service processing capability level in the fourth relationship table includes the acquired service processing. Ability level.
The method according to claim 9, wherein the switching the correctable fault interrupt of the hardware module from the enabled state to the disabled state comprises:

Setting an identification value in the correctable fault interrupt enable register corresponding to the hardware module to a disable value;

The switching the correctable fault interrupt of the hardware module from the disabled state to the enabled state includes:

The identification value in the correctable fault interrupt enable register corresponding to the hardware module is set to an enable value.