WO2016082523A1 - Apparatus and method for handling fault - Google Patents

Apparatus and method for handling fault Download PDF

Info

Publication number
WO2016082523A1
WO2016082523A1 PCT/CN2015/081355 CN2015081355W WO2016082523A1 WO 2016082523 A1 WO2016082523 A1 WO 2016082523A1 CN 2015081355 W CN2015081355 W CN 2015081355W WO 2016082523 A1 WO2016082523 A1 WO 2016082523A1
Authority
WO
WIPO (PCT)
Prior art keywords
hardware module
interrupt
level
fault
correctable
Prior art date
Application number
PCT/CN2015/081355
Other languages
French (fr)
Chinese (zh)
Inventor
宋刚
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2016082523A1 publication Critical patent/WO2016082523A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to a fault processing apparatus and method.
  • a correctable fault is a common hardware failure that occurs when the server is running.
  • the hardware module When a correctable fault occurs, the hardware module generates a Correctable Machine-Check Error Interrupt (CMCI) according to the correctable fault, and notifies the operating system to enter the interrupt handler to process the correctable fault interrupt.
  • CMCI Correctable Machine-Check Error Interrupt
  • the operating system determines the hardware module based on the correctable fault interrupt and performs corresponding troubleshooting. In the case where the correctable fault occurs in the memory, the steps of the interrupt handler processing in the operating system to correct the fault interrupt are as follows:
  • the interrupt handler collects the fault data corresponding to the correctable fault
  • the interrupt handler translates the fault physical address in the collected fault data into a fault logical address under the corresponding operating system
  • the interrupt handler performs statistics on the number of correctable faults on the memory page to which the fault logical address belongs;
  • the interrupt handler performs a fault handling operation on the correctable fault.
  • the prior art has at least the following problems: when a hardware module generates a large number of correctable faults in a short time, that is, when a correctable fault storm occurs, the hardware module will generate a large number of correctable faults. The fault is interrupted and the operating system is notified to enter the interrupt handler. The operating system needs to perform the above-mentioned fault handling for each correctable fault. It will be in a continuous fault handling state, occupying a large amount of processing resources of the operating system, and even causing the operating system to fail to operate normally. .
  • Embodiments provide a faulty device and method. The technical solution is as follows:
  • a fault processing apparatus for use in a server including at least one hardware module, the apparatus comprising:
  • a statistics module configured to count, by a hardware module in the server, a frequency of generating a correctable fault interrupt within a predetermined time period, where the correctable fault interrupt is an interrupt generated by the hardware module when a correctable fault occurs;
  • a detecting module configured to detect whether the frequency is greater than an inability threshold
  • the first switching module is configured to switch the correctable fault interrupt of the hardware module from the enabled state to the disabled state when detecting that the frequency is greater than the disable threshold.
  • the statistic module includes:
  • a reading module configured to read, by an interrupt processing program, the number of correctable fault interrupts generated by the hardware module within a predetermined time period from a machine check abnormality MCE memory, wherein the interrupt processing program is configured to process the An interrupt handler capable of correcting a fault, the MCE memory being an MCE memory corresponding to the hardware module;
  • a calculating module configured to calculate, by the interrupt processing program, the frequency according to the predetermined time period and the number of the correctable fault interrupts
  • the detecting module is configured to detect, by the interrupt processing program, whether the frequency is greater than an inability threshold.
  • the device further includes:
  • a startup module configured to start a timer when the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state
  • a second switching module configured to switch the correctable fault interrupt of the hardware module from the disabled state to the enabled state when the timer is timed to a predetermined duration.
  • the device further includes:
  • a first search module configured to acquire a level of real-time requirements for the service processed in the server, where the service is based on a task run by at least one hardware module in the server; Find a corresponding disable threshold in the table, where the first relation table is stored Having at least one level and a disable threshold corresponding to each of the levels, the at least one of the first relationship tables including the acquired level;
  • a second search module configured to obtain a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module; and searching for a corresponding ban in the second relationship table according to the service processing capability level a threshold, the second relationship table stores at least one service processing capability level and a disable threshold corresponding to each of the service processing capability levels, and at least one of the service processing capability levels in the second relationship table includes obtaining The level of business processing capability that is reached.
  • the device further includes:
  • a third search module configured to acquire a level of real-time requirements of the service processed in the server, where the service is based on a task run by at least one hardware module in the server; and the third relationship is performed according to the level Searching, in the table, a predetermined timer duration, where the third relationship table stores at least one level and a predetermined timer duration corresponding to each of the levels, and at least one of the third relationship tables includes obtaining Said level;
  • a fourth search module configured to acquire a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module, and search for a corresponding timer in the fourth relationship table according to the service processing capability level a predetermined duration, the fourth relationship table storing at least one service processing capability level and a timer predetermined duration corresponding to each of the service processing capability levels, where at least one of the service processing capability levels in the fourth relationship table is included The obtained business processing capability level.
  • the first switching module is configured to enable a correctable fault interrupt corresponding to the hardware module
  • the identification value in the register is set to the disable value
  • the second switching module is configured to set an identifier value in the correctable fault interrupt enable register corresponding to the hardware module to an enable value.
  • a fault processing method for a server including at least one hardware module, the method comprising:
  • the correctable fault interrupt is an interrupt generated by the hardware module when a correctable fault occurs
  • the correctable fault interrupt of the hardware module is switched from an enabled state to an disabled state.
  • the calculating, by the hardware module in the server, a frequency of correctable fault interrupts in a predetermined period of time including:
  • the MCE memory is an MCE memory corresponding to the hardware module
  • the detecting whether the frequency is greater than an inability threshold includes:
  • the method further includes:
  • the correctable fault interrupt of the hardware module is switched from the disabled state to the enabled state.
  • the method before the detecting whether the frequency is greater than a disable threshold, the method further includes:
  • the step of switching the correctable fault interrupt of the hardware module from the enabled state to the In the disabled state, before starting the timer also includes:
  • Ability level Obtaining a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module; searching for a corresponding timer predetermined duration in the fourth relationship table according to the service processing capability level, the fourth The relationship table stores at least one service processing capability level and a predetermined timer duration corresponding to each of the service processing capability levels, and the at least one service processing capability level in the fourth relationship table includes the acquired service processing.
  • Ability level is
  • the correctable fault interrupt of the hardware module is switched from an enabled state to an disabled state, including :
  • the switching the correctable fault interrupt of the hardware module from the disabled state to the enabled state includes:
  • the identification value in the correctable fault interrupt enable register corresponding to the hardware module is set to an enable value.
  • the frequency of the correctable fault interrupt is generated in the predetermined time period by the hardware module in the statistics server; whether the detection frequency is greater than the disable threshold; when the detected frequency is greater than the disable threshold, the correctable fault interrupt of the hardware module is enabled. Switching to the disabled state; solving the problem that when the hardware module has a large number of correctable faults in a short time, the operating system will be in a continuous fault handling state. It takes up a lot of processing resources of the operating system, and even causes the operating system to fail to operate normally.
  • the hardware module When the hardware module generates a large number of correctable faults in a short period of time, it reduces the occurrence of correctable fault interrupts, enables the operating system to operate normally, and improves the operation. The effect of the operating efficiency of the system.
  • FIG. 1 is a block diagram showing the structure of a fault processing apparatus according to an embodiment of the present invention.
  • FIG. 2 is a block diagram showing the structure of a fault processing apparatus according to another embodiment of the present invention.
  • FIG. 3A is a block diagram of a fault processing apparatus according to an embodiment of the present invention.
  • FIG. 3B is a block diagram of a fault processing apparatus according to another embodiment of the present invention.
  • FIG. 4 is a flowchart of a method for processing a fault according to an embodiment of the present invention.
  • 5A is a flowchart of a method for processing a fault according to another embodiment of the present invention.
  • FIG. 5B is a schematic diagram of an implementation of a fault processing method according to another embodiment of the present invention.
  • FIG. 6 is a flowchart of a method for processing a fault according to still another embodiment of the present invention.
  • Disabled state A state in which a hardware module cannot generate a correctable fault interrupt based on a correctable fault, that is, the operating system cannot receive a state of a correctable fault interrupt generated by the hardware module.
  • the mechanism by which each hardware module produces a correctable fault interrupt is usually independent of each other.
  • Enable state It means that the hardware module can generate a state that can correct the fault interrupt according to the correctable fault, that is, the operating system can receive the state of the correctable fault interrupt generated by the hardware module.
  • Positive correlation It means that the two variables change in the same direction. When one variable increases, the corresponding other variable also increases. When one variable decreases, the corresponding other variable also decreases. The two are linearly related or non-linear. Linear correlation.
  • Negative correlation refers to the opposite direction of change of two variables, that is, when one variable increases, the corresponding one One variable decreases; when one variable decreases, the corresponding other variable increases, and the two are linearly related or nonlinearly related.
  • Correctable Fault Interrupt Enable Register Enables the switch between the enabled and disabled states of the correctable fault interrupt for the hardware module by setting the identification value in the correctable fault interrupt enable register. Each hardware module corresponds to its own correctable fault interrupt enable register.
  • FIG. 1 is a structural block diagram of a fault processing apparatus according to an embodiment of the present invention.
  • the fault processing apparatus includes:
  • the statistics module 110 is configured to calculate, by the hardware module in the server, a frequency of correctable fault interrupts generated within a predetermined time period, where the correctable fault interrupt is an interrupt generated by the hardware module when a correctable fault occurs;
  • the detecting module 120 is configured to detect whether the frequency is greater than a disable threshold
  • the first switching module 130 is configured to switch the correctable fault interrupt of the hardware module from the enabled state to the disabled state when the detected frequency is greater than the disable threshold.
  • the fault processing apparatus generates a frequency of correctable fault interrupts in a predetermined time period by using a hardware module in the statistics server; whether the detected frequency is greater than the disable threshold; and when the detected frequency is greater than the disable threshold
  • the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state; when the hardware module has a large number of correctable faults in a short time, the operating system will be in a continuous fault handling state, occupying the operating system.
  • FIG. 2 it is a structural block diagram of a fault processing apparatus according to another embodiment of the present invention.
  • the fault processing apparatus includes:
  • the statistics module 210 is configured to calculate, by the hardware module in the server, a frequency of generating a correctable fault interrupt within a predetermined time period, where the correctable fault interrupt is an interrupt generated by the hardware module when a correctable fault occurs;
  • the detecting module 220 is configured to detect whether the frequency is greater than a disable threshold
  • the first switching module 230 is configured to switch the correctable fault interrupt of the hardware module from the enabled state to the disabled state when the detected frequency is greater than the disable threshold.
  • the statistics module 210 includes:
  • the reading module 211 is configured to read, by using an interrupt processing program, a number of correctable fault interrupts generated by the hardware module in a predetermined period of time from a machine check exception (MCE) memory.
  • the processing program is an interrupt handler for processing a correctable fault
  • the MCE memory is an MCE memory corresponding to the hardware module;
  • the calculating module 212 is configured to calculate, by using an interrupt processing program, a frequency according to a predetermined time period and a number of correctable fault interrupts;
  • the detecting module 220 is configured to detect, by the interrupt processing program, whether the frequency is greater than an inability threshold.
  • the device further includes:
  • the startup module 240 is configured to start a timer when the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state;
  • the second switching module 250 is configured to switch the correctable fault interrupt of the hardware module from the disabled state to the enabled state when the timer is timed to a predetermined duration.
  • the device further includes:
  • the first search module 260 is configured to acquire a level of real-time requirement for the service processed in the server, where the service is based on a task run by at least one hardware module in the server; and the corresponding relationship is searched in the first relationship table according to the level. a threshold value, the first relationship table storing at least one level and a disable threshold corresponding to each level, where at least one of the first relationship tables includes the acquired level;
  • the second search module 270 is configured to obtain a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module, and search for a corresponding inactivation threshold in the second relationship table according to the service processing capability level, the second relationship
  • the table stores at least one service processing capability level and a disable threshold corresponding to each service processing capability level, and the at least one service processing capability level in the second relationship table includes the acquired service processing capability level.
  • the device further includes:
  • the third search module 280 is configured to acquire a level of real-time requirement for the service processed in the server, where the service is based on a task run by at least one hardware module in the server; and search for a corresponding timing in the third relationship table according to the level.
  • the third relationship table stores at least one level with a predetermined timer duration corresponding to each level, and at least one of the third relationship tables includes the acquired level;
  • the fourth search module 290 is configured to determine a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module, and search for a corresponding timer predetermined duration in the fourth relationship table according to the service processing capability level, the fourth relationship.
  • the table stores at least one service processing capability level and a timer predetermined duration corresponding to each service processing capability level, and at least one of the service processing capability levels in the fourth relationship table includes the acquired service processing capability level.
  • the first switching module 230 is configured to set an identifier value in the correctable fault interrupt enable register corresponding to the hardware module to a disable value;
  • the second switching module 250 is configured to set an identifier value in the correctable fault interrupt enable register corresponding to the hardware module to an enable value.
  • the fault processing apparatus generates a frequency of correctable fault interrupts in a predetermined time period by using a hardware module in the statistics server; whether the detected frequency is greater than the disable threshold; and when the detected frequency is greater than the disable threshold
  • the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state; when the hardware module has a large number of correctable faults in a short time, the operating system will be in a continuous fault handling state, occupying the operating system.
  • the timer is also set when the correctable fault interrupt of the hardware module is disabled, and when the timer is timed to a predetermined duration, the correctable fault interrupt of the hardware module is switched from the disabled state to the enabled state, and The enabled state is maintained when the frequency at which the hardware module generates a correctable fault interrupt is less than the enable threshold, and the timely processing can correct the correctable fault interrupt generated after the fault storm ends.
  • FIG. 3A shows a block diagram of a fault handling apparatus according to an embodiment of the present invention.
  • the fault processing apparatus may include a processor 310 and at least one hardware module 320, wherein the processor 310 and the at least one hardware module 320 are electrically connected.
  • This embodiment is described with at least one hardware module 320 including a hardware module 321 and a hardware module 322.
  • the processor 310 is configured to generate, by the at least one hardware module 320 in the server, a frequency of correctable fault interrupts generated by the hardware module in a predetermined period of time, where the correctable fault interrupt is generated by the hardware module when a correctable fault occurs;
  • the processor 310 is configured to detect whether the frequency is greater than a disable threshold
  • the processor 310 is configured to switch the correctable fault interrupt of the hardware module 320 from the enabled state to the disabled state when the detected frequency is greater than the disable threshold.
  • the fault processing apparatus generates a frequency of correctable fault interrupts in a predetermined time period by using a hardware module in the statistics server; whether the detected frequency is greater than the disable threshold; and when the detected frequency is greater than the disable threshold
  • the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state; when the hardware module has a large number of correctable faults in a short time, the operating system will be in a continuous fault handling state, occupying the operating system.
  • the fault processing apparatus may further include: an MCE memory and a correctable fault interrupt enable register corresponding to each hardware module, and a memory for storing one or more programs, including Handle interrupt handlers that correct faults.
  • the embodiment of the present invention includes the hardware module 321 and the hardware module 322.
  • the fault processing apparatus 300 includes a processor 310, a hardware module 321, and a hardware module 322.
  • the processor 310 is electrically connected to the at least one hardware module 320, the memory 350, the MCE memory corresponding to each hardware module, and the correctable fault interrupt enable register.
  • the processor 310 is configured to generate, by the at least one hardware module 320 in the server, a frequency of correctable fault interrupts generated by the hardware module in a predetermined period of time, where the correctable fault interrupt is generated by the hardware module when a correctable fault occurs;
  • the processor 310 is configured to detect whether the frequency is greater than a disable threshold
  • the processor 310 is configured to switch the correctable fault interrupt of the hardware module 320 from the enabled state to the disabled state when the detected frequency is greater than the disable threshold.
  • the processor 310 when the hardware module in the statistics server generates the frequency of the correctable fault interrupt within a predetermined time period, the processor 310 is configured to read, by the interrupt processing program, the hardware module 320 generated by the hardware module 320 within a predetermined time period. The number of fault interrupts can be corrected, the interrupt processing The program is an interrupt handler for processing a correctable fault, the MCE memory being an MCE memory corresponding to the hardware module 320;
  • the processor 310 is configured to calculate, by using an interrupt processing program, a frequency according to a predetermined time period and a number of correctable fault interrupts;
  • the processor 310 is configured to detect, by the interrupt processing program, whether the frequency is greater than a disable threshold.
  • the processor 310 when the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state, the processor 310 is configured to start when the correctable fault interrupt of the hardware module 320 is switched from the enabled state to the disabled state. Timer
  • the processor 301 is configured to switch the correctable fault interrupt of the hardware module 320 from the disabled state to the enabled state when the timer is timed to a predetermined duration.
  • the processor 310 when determining the forbidden threshold, is configured to acquire a level of real-time requirement for the service processed in the server, where the service is based on a task run by at least one hardware module in the server; Querying, in the relationship table, a corresponding banned threshold, where the first relationship table stores at least one level and a disable threshold corresponding to each level, where at least one level in the first relationship table includes the acquired level;
  • a service processing capability level of the server where the service processing capability level is determined based on the at least one hardware module; and the corresponding inactivity threshold is searched in the second relationship table according to the service processing capability level, where the second relationship table stores at least one service The processing capability level and the inability threshold corresponding to each service processing capability level, and the at least one service processing capability level in the second relationship table includes the acquired service processing capability level.
  • the processor 310 when determining the predetermined duration of the timer, is configured to acquire a level of real-time requirement for the service processed in the server, where the service is based on a task run by at least one hardware module in the server; Searching, in the third relationship table, a predetermined timer duration, where the third relationship table stores at least one level and a predetermined timer duration corresponding to each level, where at least one level of the third relationship table includes the acquired level;
  • a service processing capability level of the server where the service processing capability level is determined based on the at least one hardware module; and the corresponding timer is searched for in the fourth relationship table according to the service processing capability level, where the fourth relationship table stores at least one The service processing capability level and the timer corresponding to each service processing capability level are predetermined, and at least one of the fourth relationship tables
  • the service processing capability level includes the acquired service processing capability level.
  • the processor 310 is configured to set an identifier value in the correctable fault interrupt enable register corresponding to the hardware module 320 to a disable value;
  • the processor 301 is configured to set an identifier value in the correctable fault interrupt enable register corresponding to the hardware module 320 to an enable value.
  • the fault processing apparatus generates a frequency of correctable fault interrupts in a predetermined time period by using a hardware module in the statistics server; whether the detected frequency is greater than the disable threshold; and when the detected frequency is greater than the disable threshold
  • the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state; when the hardware module has a large number of correctable faults in a short time, the operating system will be in a continuous fault handling state, occupying the operating system.
  • the timer is also set when the correctable fault interrupt of the hardware module is disabled, and when the timer is timed to a predetermined duration, the correctable fault interrupt of the hardware module is switched from the disabled state to the enabled state, and The enabled state is maintained when the frequency at which the hardware module generates a correctable fault interrupt is less than the enable threshold, and the timely processing can correct the correctable fault interrupt generated after the fault storm ends.
  • FIG. 4 is a flowchart of a method for processing a fault according to an embodiment of the present invention.
  • the method is applicable to a server of at least one hardware module, and the fault processing method includes:
  • Step 402 The hardware module in the statistics server generates a frequency of correctable fault interrupts within a predetermined time period, where the correctable fault interrupt is an interrupt generated by the hardware module when a correctable fault occurs;
  • CMCI Corrected Machine-Check Error Interrupt
  • Step 404 detecting whether the frequency is greater than a disable threshold
  • Step 406 When it is detected that the frequency is greater than the disable threshold, the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state.
  • the fault processing method generates a frequency of correctable fault interrupts in a predetermined time period by using a hardware module in the statistics server; whether the detected frequency is greater than the disable threshold; and when the detected frequency is greater than the disable threshold
  • the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state; when the hardware module has a large number of correctable faults in a short time, the operating system will be in a continuous fault handling state, occupying the operating system.
  • FIG. 5A is a flowchart of a method for processing a fault according to another embodiment of the present invention.
  • the method is applicable to a server of at least one hardware module, and the fault processing method includes:
  • Step 501 The hardware module in the statistics server generates a frequency of correctable fault interrupts within a predetermined time period, and the correctable fault interrupt is an interrupt generated by the hardware module when a correctable fault occurs.
  • the server may be an X86-based device. Since the existing server is mostly an X86 architecture, the present embodiment is described as an X86 architecture device, and the present invention is not limited thereto.
  • a hardware module refers to a hardware processing device with different processing functions in an X86 architecture device, and an X86 architecture device includes at least one hardware module.
  • each hardware module corresponds to its own MCE memory, which is used to store the correctable fault interrupt generated by the hardware module.
  • the interrupt handler can obtain the frequency of the correctable fault interrupt generated by acquiring the number of correctable fault interrupts generated in the predetermined time period from the MCE memory corresponding to the hardware module, and the step may include the following sub-steps:
  • the X86 architecture device reads, from the MCE memory, the number of correctable fault interrupts generated by the hardware module within a predetermined period of time through an interrupt handler, the interrupt handler being an interrupt handler for processing the correctable fault, the MCE
  • the memory is the MCE memory corresponding to the hardware module.
  • the hardware module When a correctable fault occurs in the hardware module, the hardware module generates a correctable fault interrupt according to the correctable fault, and notifies the operating system to enter the interrupt handler to process the correctable fault interrupt, and the interrupt handler determines that the fault occurs according to the correctable fault interrupt.
  • Faulty hardware module and Reading, from the MCE memory corresponding to the hardware module, the number of correctable fault interrupts generated by the hardware module within a predetermined time period, which is preset by the operating system and may be 5 seconds.
  • the interrupt handler receives the correctable fault interrupt notification, determines that the hardware module that can correct the fault is the hardware module A, and reads the correctable fault interrupt generated in the last 5 seconds from the MCE memory A corresponding to the hardware module A.
  • the number is 10.
  • the X86 architecture device counts the frequency according to the predetermined time period and the number of correctable fault interrupts through the interrupt handler.
  • the interrupt handler calculates the frequency at which the hardware module generates a correctable fault interrupt within a predetermined time period based on the number of correctable fault interrupts generated during the predetermined period of time read and the predetermined time period.
  • the number of correctable fault interrupts generated during the predetermined time period read by the interrupt handler is 10, and the predetermined time period is 5 seconds, and the hardware module is calculated to generate a correctable fault interrupt within a predetermined time period.
  • the frequency is 10 times/5 seconds.
  • the interrupt handler needs to separately count different hardware modules to generate a frequency that can correct the fault interrupt.
  • This embodiment only uses interrupt processing.
  • the program counts the frequency at which a hardware module generates a correctable fault interrupt and does not limit the invention.
  • Step 502 Detect whether the frequency is greater than an inability threshold.
  • the X86 architecture device detects, by the interrupt handler, whether the frequency at which the hardware module generates a correctable fault interrupt within a predetermined time period is greater than an disable threshold. When the frequency is greater than the disable threshold, it may be determined that the hardware module has a correctable fault storm; When the frequency is less than the disable threshold, it can be determined that the hardware module does not have a correctable fault storm.
  • the ban threshold can be set in advance, or can be set in real time according to the real-time requirement of the service processed in the X86 architecture device or according to the service processing capability of the X86 architecture device. Setting the ban threshold can include the following two possible Method to realize:
  • the X86 architecture device acquires a level of real-time requirement for the service processed in the X86 architecture device, where the service is based on a task run by at least one hardware module in the X86 architecture device; Searching, in the first relationship table, a corresponding inability threshold, where the first relationship table stores at least one level and a disable threshold corresponding to each level, where at least one level in the first relationship table includes the acquired level .
  • the disable threshold can be set.
  • the operating system can process the current service in time; when the service processed by the X86 architecture device has low real-time requirements, the disable threshold can be set larger.
  • the first relationship table in the operating system pre-stores the correspondence between each level of the service real-time requirement and the corresponding inactivation threshold, wherein each level has a negative correlation with the corresponding inactivation threshold, that is, the service requires real-time performance.
  • the higher the level the smaller the corresponding ban threshold.
  • the lower the level of service required for real-time performance the larger the corresponding ban threshold.
  • Table 1 The table structure of the first relational table can be exemplarily shown in Table 1:
  • Level of business requirements for real-time performance Disable threshold 1 10 times/5 seconds 2 8 times/5 seconds 3 5 times/5 seconds
  • the operating system obtains the level of the real-time requirement of the service processed by the X86 architecture device, searches for the corresponding inactivity threshold in the first relationship table, and sets the disable threshold to the disable threshold of the service.
  • the X86 architecture device obtains a service processing capability level of the X86 architecture device; the service processing capability level is determined based on the at least one hardware module; and the corresponding disabling is found in the second relationship table according to the service processing capability level a threshold, the second relationship table stores at least one service processing capability level and a disable threshold corresponding to each service processing capability level, where at least one service processing capability level in the second relationship table includes the acquired service processing Ability level.
  • the service processing capability of the X86 architecture device is different.
  • the processing resources and time occupied by the corresponding operating system entering the interrupt handler for fault processing are also different. Therefore, the operating system can set the disable threshold according to the service processing capability of the X86 architecture device.
  • the service processing capability of the X86 architecture device is pre-stored in the second relation table in the operating system.
  • the corresponding disable threshold is also smaller.
  • Table 2 The table structure of the second relational table can be exemplarily shown in Table 2:
  • the capability level can be divided according to the hardware score of the X86 architecture device.
  • the operating system obtains the service processing capability level of the X86 architecture device, searches for the corresponding disable threshold in the second relationship table, and sets the disable threshold to the disable threshold of the X86 architecture device.
  • the operating system may also set the prohibition threshold for the level of real-time requirements of the integrated service and the service processing capability level of the X86 architecture device, and does not limit the present invention.
  • Step 503 When it is detected that the frequency is greater than the disable threshold, the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state.
  • the X86 architecture device When it is detected that the frequency is greater than the disable threshold, the X86 architecture device knows that the hardware module has a correctable fault storm, and the correctable fault storm indicates that the hardware module will generate a large number of correctable fault interrupts in a short time. In order to prevent the operating system from being in a continuous fault handling state in the event of a correctable fault storm, the operating system fails to operate normally, and the interrupt handler switches the correctable fault interrupt of the hardware module from the enabled state to the disabled state.
  • the identifier value in the correctable fault interrupt register corresponding to the hardware module is an enable value, that is, the correctable fault interrupt of the hardware module is enabled; when it is detected that the hardware module is generated, When the fault storm is corrected, the interrupt handler sets the identification value in the correctable fault interrupt enable register corresponding to the hardware module to the disable value, that is, the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state. Correctable fault interruption When disabled, the hardware module will not be able to generate correctable fault interrupts based on correctable faults, and the operating system will not frequently enter the interrupt handler for troubleshooting.
  • Step 504 Start a timer when the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state.
  • the interrupt handler switches the correctable fault interrupt of the hardware module from the enabled state to the disabled state, the preset timer is started, and the correctable fault interrupt of the hardware module will be interrupted while the timer reaches the predetermined duration.
  • the system is disabled and the operating system will not enter the interrupt handler for troubleshooting.
  • step 503 there is no strict relationship between the step 503 and the step 504, and the two can be executed at the same time.
  • This embodiment is only described by the step 503 before the step 504, and the present invention is not limited.
  • Step 505 When the timer is timed to a predetermined duration, the correctable fault interrupt of the hardware module is switched from the disabled state to the enabled state.
  • the operating system cannot receive the correctable fault interrupt and handle the fault.
  • the interrupt handler is interrupted.
  • the identification value in the correctable fault interrupt enable register corresponding to the hardware module is set to an enable value, that is, the correctable fault interrupt of the hardware module is switched from the disable state to the enable state, and at this time, the hardware module can be based on the correctable fault.
  • the steps of the interrupt handler for fault processing are similar to those of the prior art, and are not described herein again.
  • the scheduled duration of the timer can be set in advance, or can be set in real time according to the real-time requirements of the services processed by the X86 architecture device or according to the service processing capability of the X86 architecture device.
  • the preset duration of the timer can include the following two possible Method to realize:
  • the X86 architecture device acquires a level of real-time requirement for the service processed in the X86 architecture device, where the service is based on a task run by at least one hardware module in the X86 architecture device;
  • the third relationship table is configured to search for a corresponding timer for a predetermined duration, where the third relationship table stores at least one level and a predetermined timer duration corresponding to each level, and at least one of the third relationship tables includes the acquired level. .
  • the third relationship table in the operating system pre-stores the correspondence between the level of the real-time requirement of the service and the predetermined duration of the timer, where each level is positively correlated with the predetermined duration of the corresponding timer, that is, the level of the service required for real-time performance.
  • the higher the time the longer the corresponding timer is scheduled. Long, the lower the level of service requirements for real-time requirements, the longer the corresponding timer is scheduled.
  • Table 3 The table structure of the third relational table can be exemplarily shown in Table 3:
  • Level of business requirements for real-time performance Timer predetermined duration 1 100 seconds 2 120 seconds 3 150 seconds
  • the operating system acquires the level of real-time requirements of the services processed in the X86 architecture device, searches for a predetermined timer duration in the third relationship table, and sets a predetermined duration of the current timer.
  • the X86 architecture device obtains a service processing capability level of the X86 architecture device, where the service processing capability level is determined based on the at least one hardware module, and searches for a corresponding timer in the fourth relationship table according to the service processing capability level.
  • the fourth relationship table stores at least one service processing capability level and a timer predetermined duration corresponding to each service processing capability level, and at least one of the service processing capability levels in the fourth relationship table includes the acquired service. Processing capability level.
  • the fourth relationship table in the operating system pre-stores the correspondence between the service processing capability level of the X86 architecture device and the predetermined duration of the timer, wherein each service processing capability level has a negative correlation with the predetermined timer duration, that is, When the service processing capability level is higher, the corresponding timer is smaller, and the lower the service processing capability level, the larger the predetermined timer duration.
  • the table structure of the fourth relational table can be exemplarily shown in Table 4:
  • Timer predetermined duration 1 150 seconds 2 120 seconds 3 100 seconds
  • Ability level can be based on X86 The hardware score of the architecture device is divided.
  • the operating system acquires the service processing capability level of the X86 architecture device, searches for the corresponding timer predetermined duration in the fourth relationship table, and sets the current timer predetermined duration.
  • the operating system may also set the timer for the integrated service to the real-time requirement level and the service processing capability level of the X86 architecture device, and does not limit the present invention.
  • the timer when the service processed in the X86 architecture device has higher real-time requirements or the service processing capability of the X86 architecture device is weaker, the corresponding timer is longer, which ensures the timely processing of the service by the operating system. . It should be noted that when the timer expires for a predetermined period of time, the timer will be reset, and in order to let the operating system know the estimated value of the number of correctable faults that the hardware module can take during the correctable fault storm, timing The controller will calculate an estimate of the number of correctable faults that may occur during the correcting of the fault storm, which may be the product of the predetermined duration set by the timer and the frequency at which the hardware module obtained in step 501 produces a correctable fault interrupt.
  • the timer is set to a predetermined duration of 100 seconds, and the calculated hardware module generates a correctable fault interrupt frequency of 10 times/5 seconds, and the estimated number of correctable faults occurring during the correctable fault storm is calculated. That is 200 times. This estimate is primarily used to count the number of times a fault can be corrected.
  • step 506 it is again detected whether the frequency at which the hardware module generates a correctable fault interrupt is less than an enable threshold.
  • the received correctable fault interrupt is counted again within a predetermined period of time, and the calculation is performed within the predetermined time period. Correct the frequency of fault interrupts.
  • the interrupt handler detects whether the calculated frequency is less than a preset enable threshold, and the preset threshold is used to detect whether the correctable fault storm ends, and the enable threshold may be 1 time/5 seconds.
  • Step 507 When it is detected that the frequency at which the hardware module generates the correctable fault interrupt is less than the enable threshold, the correctable fault interrupt of the hardware module is maintained as an enabled state.
  • the interrupt handler can determine that the correctable fault storm has ended, and the subsequent correctable fault interrupt generated by the hardware module does not cause the operating system to be in continuous fault handling. Status, ie the operating system can be normal run. Correspondingly, the correctable fault interrupt of the hardware module will remain enabled.
  • the corrective fault storm causes the operating system to be in continuous fault handling, and the interrupt handler will continue to detect whether the frequency at which the hardware module generates a correctable fault interrupt within a predetermined period of time is greater than the disable threshold.
  • the interrupt handler switches the correctable fault interrupt of the hardware module from the enabled state to the disabled state and restarts the timer.
  • Step 508 When it is detected that the frequency at which the hardware module generates the correctable fault interrupt is greater than the enable threshold, the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state, and the timer is restarted.
  • the interrupt handler When it is detected that the frequency of the correctable fault interrupt generated by the hardware module is greater than the enable threshold, the interrupt handler considers that the correctable fault storm has not ended, and re-switches the correctable fault interrupt of the hardware module from the enabled state to the disabled state, and Restart the timer.
  • FIG. 5B it shows a schematic implementation of the fault processing method provided by this embodiment.
  • the interrupt handler detects whether the frequency of the correctable fault interrupt generated by the hardware module in the predetermined time period is greater than the disable threshold during the T1 time period, and when the frequency is detected to be greater than the disable threshold, the hardware module interrupt is switched to the disabled state and The timer is started; the correctable fault interrupt of the hardware module is disabled during the predetermined time period T2 set by the timer; when the timer reaches the predetermined duration, the interrupt handler switches the correctable fault interrupt of the hardware module to the enabled state.
  • the interrupt handler switches the correctable fault interrupt of the hardware module to the disabled state And restarting the timer; within the predetermined time period T4 set by the timer, the correctable fault interrupt of the hardware module is disabled; when the timer reaches the predetermined length, the interrupt handler again switches the correctable fault interrupt of the hardware module to Enable state and detect that the hardware module generates a correctable fault interrupt during the T5 time period Whether the frequency is less than the enable threshold; when the frequency is detected When the rate is greater than the enable threshold, the correctable fault interrupt of the hardware module remains enabled.
  • the fault processing method generates a frequency of correctable fault interrupts in a predetermined time period by using a hardware module in the statistics server; whether the detected frequency is greater than the disable threshold; and when the detected frequency is greater than the disable threshold
  • the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state; when the hardware module has a large number of correctable faults in a short time, the operating system will be in a continuous fault handling state, occupying the operating system.
  • the timer is also set when the correctable fault interrupt of the hardware module is disabled, and when the timer is timed to a predetermined duration, the correctable fault interrupt of the hardware module is switched from the disabled state to the enabled state, and The enabled state is maintained when the frequency at which the hardware module generates a correctable fault interrupt is less than the enable threshold, and the timely processing can correct the correctable fault interrupt generated after the fault storm ends.
  • the correctable fault error interrupt in Figure 5A refers to the CMCI interrupt, and the interrupt handler refers to the interrupt handler in the operating system.
  • the basic input/output system (English: Basic Input/Output System; BIOS for short) can be used to convert the correctable fault interrupt generated when the fault can be corrected into a system management interrupt (English: System Management) Interrupt; referred to as SMI), and the system management interrupt is processed by the system interrupt handler in the basic input and output system.
  • BIOS Basic Input/Output System
  • SMI System Management Interrupt
  • FIG. 6 is a flowchart of a method for processing a fault according to still another embodiment of the present invention.
  • the method comprises:
  • Step 601 Convert the correctable fault interrupt generated by the hardware module in the server into a system management interrupt.
  • the server may be an X86-based device. Since the existing server is mostly an X86 architecture, the present embodiment is described as an X86 architecture device, and the present invention is not limited thereto.
  • the hardware module When the operating system starts initialization, by setting in the basic input/output system, when the hardware module generates a correctable fault interrupt, the correctable fault interrupt will be converted into a system management interrupt, and correspondingly, the hardware module will notify the basic input and output.
  • the system enters the system management interrupt processing
  • the program handles the system management interrupt.
  • Step 602 The hardware module in the statistics server generates a frequency of system management interruption within a predetermined time period.
  • the system management interrupt handler When a correctable fault occurs, since the correctable fault interrupt generated by the hardware module is converted into a system management interrupt, the system management interrupt handler counts the system management interrupt generated within the predetermined time period, and calculates the predetermined time period. The frequency of system management interrupts generated internally. It should be noted that, because a plurality of hardware modules generate correctable faults at the same time when the device is running, the system management interrupt processing program needs to separately count the frequency of the system management interrupt generated by different hardware modules, and the embodiment only interrupts the system management. The processing program stats the frequency at which a hardware module generates a system management interrupt, and does not limit the invention.
  • step 603 it is detected whether the frequency is greater than an inability threshold.
  • the system management interrupt handler detects whether the frequency of the system management interrupt generated by the hardware module during the predetermined time period is greater than the disable threshold. When the frequency is greater than the disable threshold, the system management interrupt is obtained by the correctable fault interrupt conversion, and the system The management interrupt handler can determine that the hardware module has a correctable fault storm; when the frequency is less than the disable threshold, the system management interrupt handler can determine that no correctable fault storm has occurred.
  • the forbidden threshold is a preset threshold for detecting whether a correctable fault storm occurs, and the disable threshold may be 10 times/5 seconds.
  • the method for setting the ban threshold is similar to the method for setting the ban threshold in step 502, and details are not described herein again.
  • Step 604 When it is detected that the frequency is greater than the disable threshold, the system management interrupt of the hardware module is switched from the enabled state to the disabled state.
  • the system management interrupt handler When the system management interrupt handler detects that the frequency is greater than the disable threshold, it knows that the hardware module has a correctable fault storm.
  • the correctable fault storm indicates that the hardware module will generate a large number of correctable fault interrupts in a short period of time.
  • System Management The interrupt handler switches the system management interrupt of the hardware module from the enabled state to the disabled state.
  • the identifier in the system management interrupt register corresponding to the hardware module is an enable value, that is, the system management interrupt of the hardware module is enabled; when the system management interrupt handler detects the hardware module When a correctable fault storm occurs, the system management interrupt handler sets the identification value in the system management interrupt enable register corresponding to the hardware module to the disable value, that is, the system management interrupt of the hardware module is switched from the enabled state to the disabled state. State, when the system management interrupt is disabled, the hardware module will not be able to generate a system management interrupt.
  • Step 605 Start a timer when the system management interrupt of the hardware module is switched from the enabled state to the disabled state.
  • system management interrupt handler will also start the preset timer while switching the system management interrupt of the hardware module from the enabled state to the disabled state.
  • Step 606 When the timer is timed to a predetermined duration, the system management interrupt of the hardware module is switched from the disabled state to the enabled state.
  • the basic input/output system cannot receive the system management interrupt and process it.
  • the system management interrupt processing The program sets the identification value in the system management interrupt enable register corresponding to the hardware module to an enable value, that is, the system management interrupt of the hardware module is switched from the disable state to the enable state, and at this time, the hardware module can notify the basic input and output.
  • the system enters the system management interrupt handler for processing.
  • the method for setting the predetermined duration of the timer is similar to the method for setting the predetermined duration of the timer in step 505, and details are not described herein again.
  • step 607 it is detected again whether the frequency at which the hardware module generates the system management interrupt is less than the enable threshold.
  • the system management interrupt handler switches the system management interrupt of the hardware module from the disabled state to the enabled state, the received system management interrupt is counted again within a predetermined time period, and the system is generated to be generated within the predetermined time period. Manage the frequency of interruptions.
  • the system management interrupt processing program detects whether the calculated frequency is less than a preset enabling threshold, and the threshold value for detecting the correctable fault storm is preset before the enabling threshold, and the enabling threshold may be 1 time/5 second.
  • Step 608 When it is detected that the frequency at which the hardware module generates the system management interrupt is less than the enable threshold, the system management interrupt of the hardware module is kept enabled.
  • the system management interrupt handler can determine that the correctable fault storm has ended, and the subsequent correctable fault interrupt generated by the hardware module will be converted into a system management interrupt, and the system is Manage interrupt handlers for processing. Correspondingly, the system management interrupt of the hardware module will remain enabled.
  • system management interrupt handler will continue to detect the frequency at which the system management interrupt is generated, and switch the system management interrupt from the enabled state to the disabled state when the frequency is greater than the disable threshold.
  • Step 609 When it is detected that the frequency of the system management interrupt generated by the hardware module is greater than the enable threshold, the system management interrupt of the hardware module is switched from the enabled state to the disabled state, and the timer is restarted.
  • the system management interrupt handler When it is detected that the frequency of generating the system management interrupt is greater than the enable threshold, the system management interrupt handler considers that the correctable fault storm has not ended yet, and re-switches the system management interrupt of the hardware module from the enabled state to the disabled state, and restarts Timer.
  • the fault processing method converts the correctable fault interrupt generated by the hardware module in the server into a system management interrupt; the frequency of the system management interrupt generated by the hardware module in the statistical server within a predetermined time period. Detecting whether the frequency is greater than the disable threshold; when detecting that the frequency is greater than the disable threshold, switching the system management interrupt of the hardware module from the enabled state to the disabled state; solving the problem that when the hardware module occurs in a short time
  • the operating system will be in a continuous fault handling state, occupying a large amount of processing resources of the operating system, and even causing the operating system to fail to operate normally; when the hardware module has a large number of correctable faults in a short period of time, the generation is reduced. It can correct the fault interrupt, enable the operating system to run normally, and improve the operating efficiency of the operating system.
  • the hardware-generated correctable fault interrupt is converted into a system management interrupt by the basic input/output system, and is processed by the system management interrupt processing program of the basic input/output system, thereby further reducing the pressure on the operating system and achieving the guaranteed operating system.
  • the effect of stable operation is converted into a system management interrupt by the basic input/output system, and is processed by the system management interrupt processing program of the basic input/output system, thereby further reducing the pressure on the operating system and achieving the guaranteed operating system.
  • a person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium.
  • the storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

Disclosed are an apparatus and method for handling a fault, which fall within the technical field of computers. The method comprises: calculating the frequency of causing a corrective fault interruption of a hardware module in a server within a pre-set time period; detecting whether the frequency is greater than a disabling threshold value; and when it is detected that the frequency is greater than the disabling threshold value, switching an enabling state of the corrective fault interruption of the hardware module into a disabling state. The problem that an operating system cannot run normally since the operating system is in a continuous fault handling state and a large number of handling resources of the operating system are occupied when a large number of corrective faults of the hardware module occur within a short time is solved. The effects that the corrective fault interruption is reduced, the operating system can run normally and the running efficiency of the operating system is improved when a large number of corrective faults of the hardware module occur within a short time are achieved.

Description

故障处理装置及方法Fault handling device and method
本申请要求于2014年11月28日提交中国专利局、申请号为201410712709.4、发明名称为“故障处理装置及方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。The present application claims the priority of the Chinese Patent Application, the entire disclosure of which is hereby incorporated by reference.
技术领域Technical field
本发明涉及计算机技术领域,特别涉及一种故障处理装置及方法。The present invention relates to the field of computer technologies, and in particular, to a fault processing apparatus and method.
背景技术Background technique
可纠正故障是服务器在运行时产生的一种常见的硬件故障。A correctable fault is a common hardware failure that occurs when the server is running.
当发生可纠正故障时,硬件模块根据可纠正故障生成可纠正故障中断(英文:Corrected Machine-Check Error Interrupt;简称:CMCI),并通知操作系统进入中断处理程序对该可纠正故障中断进行处理,操作系统根据可纠正故障中断确定硬件模块并进行相应的故障处理。以该可纠正故障发生在内存中为例,操作系统中的中断处理程序处理可纠正故障中断的步骤如下:When a correctable fault occurs, the hardware module generates a Correctable Machine-Check Error Interrupt (CMCI) according to the correctable fault, and notifies the operating system to enter the interrupt handler to process the correctable fault interrupt. The operating system determines the hardware module based on the correctable fault interrupt and performs corresponding troubleshooting. In the case where the correctable fault occurs in the memory, the steps of the interrupt handler processing in the operating system to correct the fault interrupt are as follows:
1、中断处理程序收集该可纠正故障相应的故障数据;1. The interrupt handler collects the fault data corresponding to the correctable fault;
2、中断处理程序将收集到的故障数据中的故障物理地址翻译为对应的操作系统下的故障逻辑地址;2. The interrupt handler translates the fault physical address in the collected fault data into a fault logical address under the corresponding operating system;
3、中断处理程序对故障逻辑地址所属的内存页面进行可纠正故障次数统计;3. The interrupt handler performs statistics on the number of correctable faults on the memory page to which the fault logical address belongs;
4、中断处理程序对该可纠正故障执行故障处理操作。4. The interrupt handler performs a fault handling operation on the correctable fault.
在实现本发明的过程中,发明人发现现有技术至少存在以下问题:当硬件模块在短时间内发生大量的可纠正故障,也即发生可纠正故障风暴时,硬件模块将产生大量的可纠正故障中断并通知操作系统进入中断处理程序,操作系统需要对每个可纠正故障进行上述的故障处理,将处于持续的故障处理状态,占用了操作系统大量的处理资源,甚至导致操作系统不能正常运行。 In the process of implementing the present invention, the inventors have found that the prior art has at least the following problems: when a hardware module generates a large number of correctable faults in a short time, that is, when a correctable fault storm occurs, the hardware module will generate a large number of correctable faults. The fault is interrupted and the operating system is notified to enter the interrupt handler. The operating system needs to perform the above-mentioned fault handling for each correctable fault. It will be in a continuous fault handling state, occupying a large amount of processing resources of the operating system, and even causing the operating system to fail to operate normally. .
发明内容Summary of the invention
为了解决背景技术中当硬件模块在短时间内发生大量可纠正故障时,操作系统将处于持续的故障处理状态,占用了操作系统大量的处理资源,甚至导致操作系统不能正常运行的问题,本发明实施例提供了一种故障装置及方法。所述技术方案如下:In order to solve the problem in the prior art, when a hardware module generates a large number of correctable faults in a short time, the operating system will be in a continuous fault processing state, occupying a large amount of processing resources of the operating system, and even causing the operating system to fail to operate normally. Embodiments provide a faulty device and method. The technical solution is as follows:
第一方面,提供了一种故障处理装置,用于包括至少一个硬件模块的服务器中,所述装置包括:In a first aspect, a fault processing apparatus is provided for use in a server including at least one hardware module, the apparatus comprising:
统计模块,用于统计所述服务器中的硬件模块在预定时间段内产生可纠正故障中断的频率,所述可纠正故障中断是由所述硬件模块在发生可纠正故障时所产生的中断;a statistics module, configured to count, by a hardware module in the server, a frequency of generating a correctable fault interrupt within a predetermined time period, where the correctable fault interrupt is an interrupt generated by the hardware module when a correctable fault occurs;
检测模块,用于检测所述频率是否大于禁能阈值;a detecting module, configured to detect whether the frequency is greater than an inability threshold;
第一切换模块,用于当检测到所述频率大于所述禁能阈值时,将所述硬件模块的可纠正故障中断由使能状态切换为禁能状态。The first switching module is configured to switch the correctable fault interrupt of the hardware module from the enabled state to the disabled state when detecting that the frequency is greater than the disable threshold.
在第一方面的第一种可能的实施方式中,所述统计模块,包括:In a first possible implementation manner of the first aspect, the statistic module includes:
读取模块,用于通过中断处理程序从机器校验异常MCE存储器中读取所述硬件模块在预定时间段内产生的可纠正故障中断的个数,所述中断处理程序是用于处理所述可纠正故障的中断处理程序,所述MCE存储器是与所述硬件模块对应的MCE存储器;a reading module, configured to read, by an interrupt processing program, the number of correctable fault interrupts generated by the hardware module within a predetermined time period from a machine check abnormality MCE memory, wherein the interrupt processing program is configured to process the An interrupt handler capable of correcting a fault, the MCE memory being an MCE memory corresponding to the hardware module;
计算模块,用于通过所述中断处理程序根据所述预定时间段和所述可纠正故障中断的个数统计出所述频率;a calculating module, configured to calculate, by the interrupt processing program, the frequency according to the predetermined time period and the number of the correctable fault interrupts;
所述检测模块,用于通过所述中断处理程序检测所述频率是否大于禁能阈值。The detecting module is configured to detect, by the interrupt processing program, whether the frequency is greater than an inability threshold.
在第一方面的第二种可能的实施方式中,所述装置,还包括:In a second possible implementation manner of the first aspect, the device further includes:
启动模块,用于在将所述硬件模块的可纠正故障中断由所述使能状态切换为所述禁能状态时,启动定时器;a startup module, configured to start a timer when the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state;
第二切换模块,用于当所述定时器计时至预定时长时,将所述硬件模块的可纠正故障中断由所述禁能状态切换为所述使能状态。And a second switching module, configured to switch the correctable fault interrupt of the hardware module from the disabled state to the enabled state when the timer is timed to a predetermined duration.
在第一方面的第三种可能的实施方式中,所述装置,还包括:In a third possible implementation manner of the first aspect, the device further includes:
第一查找模块,用于获取所述服务器中所处理的业务对实时性要求的级别,所述业务是基于所述服务器中的至少一个硬件模块所运行的任务;根据所述级别在第一关系表中查找对应的禁能阈值,所述第一关系表存储 有至少一个级别和与每个所述级别所对应的禁能阈值,所述第一关系表中的至少一个级别中包括获取到的所述级别;a first search module, configured to acquire a level of real-time requirements for the service processed in the server, where the service is based on a task run by at least one hardware module in the server; Find a corresponding disable threshold in the table, where the first relation table is stored Having at least one level and a disable threshold corresponding to each of the levels, the at least one of the first relationship tables including the acquired level;
或,or,
第二查找模块,用于获取所述服务器的业务处理能力等级,所述业务处理能力等级基于所述至少一个硬件模块确定;根据所述业务处理能力等级在第二关系表中查找对应的禁能阈值,所述第二关系表存储有至少一个业务处理能力等级和与每个所述业务处理能力等级所对应的禁能阈值,所述第二关系表中的至少一个业务处理能力等级中包括获取到的所述业务处理能力等级。a second search module, configured to obtain a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module; and searching for a corresponding ban in the second relationship table according to the service processing capability level a threshold, the second relationship table stores at least one service processing capability level and a disable threshold corresponding to each of the service processing capability levels, and at least one of the service processing capability levels in the second relationship table includes obtaining The level of business processing capability that is reached.
结合第一方面的第二种可能的实施方式,在第一方面的第四种可能的实施方式中,所述装置,还包括:In conjunction with the second possible implementation of the first aspect, in a fourth possible implementation manner of the first aspect, the device further includes:
第三查找模块,用于获取所述服务器中所处理的业务对实时性要求的级别,所述业务是基于所述服务器中的至少一个硬件模块所运行的任务;根据所述级别在第三关系表中查找对应的定时器预定时长,所述第三关系表存储有至少一个级别与每个所述级别所对应的定时器预定时长,所述第三关系表中的至少一个级别中包括获取到的所述级别;a third search module, configured to acquire a level of real-time requirements of the service processed in the server, where the service is based on a task run by at least one hardware module in the server; and the third relationship is performed according to the level Searching, in the table, a predetermined timer duration, where the third relationship table stores at least one level and a predetermined timer duration corresponding to each of the levels, and at least one of the third relationship tables includes obtaining Said level;
或,or,
第四查找模块,用于获取所述服务器的业务处理能力等级,所述业务处理能力等级基于所述至少一个硬件模块确定;根据所述业务处理能力等级在第四关系表中查找对应的定时器预定时长,所述第四关系表存储有至少一个业务处理能力等级与每个所述业务处理能力等级所对应的定时器预定时长,所述第四关系表中的至少一个业务处理能力等级中包括获取到的所述业务处理能力等级。a fourth search module, configured to acquire a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module, and search for a corresponding timer in the fourth relationship table according to the service processing capability level a predetermined duration, the fourth relationship table storing at least one service processing capability level and a timer predetermined duration corresponding to each of the service processing capability levels, where at least one of the service processing capability levels in the fourth relationship table is included The obtained business processing capability level.
结合第一方面的第二种可能的实施方式,在第一方面的第五种可能的实施方式中,所述第一切换模块,用于将与所述硬件模块对应的可纠正故障中断使能寄存器中的标识值设置为禁能值;In conjunction with the second possible implementation of the first aspect, in a fifth possible implementation of the first aspect, the first switching module is configured to enable a correctable fault interrupt corresponding to the hardware module The identification value in the register is set to the disable value;
所述第二切换模块,用于将与所述硬件模块对应的可纠正故障中断使能寄存器中的标识值设置为使能值。The second switching module is configured to set an identifier value in the correctable fault interrupt enable register corresponding to the hardware module to an enable value.
第二方面,提供了一种故障处理方法,用于包括至少一个硬件模块的服务器中,所述方法包括:In a second aspect, a fault processing method is provided for a server including at least one hardware module, the method comprising:
统计所述服务器中的硬件模块在预定时间段内产生可纠正故障中断的 频率,所述可纠正故障中断是由所述硬件模块在发生可纠正故障时所产生的中断;Counting that the hardware module in the server generates a correctable fault interrupt within a predetermined period of time Frequency, the correctable fault interrupt is an interrupt generated by the hardware module when a correctable fault occurs;
检测所述频率是否大于禁能阈值;Detecting whether the frequency is greater than an inability threshold;
当检测到所述频率大于所述禁能阈值时,将所述硬件模块的可纠正故障中断由使能状态切换为禁能状态。When it is detected that the frequency is greater than the disable threshold, the correctable fault interrupt of the hardware module is switched from an enabled state to an disabled state.
在第二方面的第一种可能的实施方式中,所述统计所述服务器中的硬件模块在预定时间段内产生可纠正故障中断的频率,包括:In a first possible implementation manner of the second aspect, the calculating, by the hardware module in the server, a frequency of correctable fault interrupts in a predetermined period of time, including:
通过中断处理程序从机器校验异常MCE存储器中读取所述硬件模块在预定时间段内产生的可纠正故障中断的个数,所述中断处理程序是用于处理所述可纠正故障的中断处理程序,所述MCE存储器是所述硬件模块对应的MCE存储器;Reading, by an interrupt handler, the number of correctable fault interrupts generated by the hardware module for a predetermined period of time from a machine check exception MCE memory, the interrupt handler being interrupt processing for processing the correctable fault a program, the MCE memory is an MCE memory corresponding to the hardware module;
通过所述中断处理程序根据所述预定时间段和所述可纠正故障中断的个数统计出所述频率;And counting, by the interrupt processing program, the frequency according to the predetermined time period and the number of the correctable fault interrupts;
所述检测所述频率是否大于禁能阈值,包括:The detecting whether the frequency is greater than an inability threshold includes:
通过所述中断处理程序检测所述频率是否大于禁能阈值。Whether the frequency is greater than the disable threshold is detected by the interrupt handler.
在第二方面的第二种可能的实施方式中,所述方法,还包括:In a second possible implementation manner of the second aspect, the method further includes:
在将所述硬件模块的可纠正故障中断由所述使能状态切换为所述禁能状态时,启动定时器;Initiating a timer when the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state;
当所述定时器计时至预定时长时,将所述硬件模块的可纠正故障中断由所述禁能状态切换为所述使能状态。When the timer expires for a predetermined length of time, the correctable fault interrupt of the hardware module is switched from the disabled state to the enabled state.
在第二方面的第三种可能的实施方式中,所述检测所述频率是否大于禁能阈值之前,还包括:In a third possible implementation manner of the second aspect, before the detecting whether the frequency is greater than a disable threshold, the method further includes:
获取所述服务器中所处理的业务对实时性要求的级别,所述业务是基于所述服务器中的至少一个硬件模块所运行的任务;根据所述级别在第一关系表中查找对应的禁能阈值,所述第一关系表存储有至少一个级别和与每个所述级别所对应的禁能阈值,所述第一关系表中的至少一个级别中包括获取到的所述级别;Obtaining a level of real-time requirement for the service processed in the server, where the service is based on a task run by at least one hardware module in the server; searching for a corresponding disable in the first relationship table according to the level a threshold, the first relationship table stores at least one level and an inability threshold corresponding to each of the levels, and the at least one of the first relationship tables includes the acquired level;
或,or,
获取所述服务器的业务处理能力等级,所述业务处理能力等级基于所述至少一个硬件模块确定;根据所述业务处理能力等级在第二关系表中查找对应的禁能阈值,所述第二关系表存储有至少一个业务处理能力等级和 与每个所述业务处理能力等级所对应的禁能阈值,所述第二关系表中的至少一个业务处理能力等级中包括获取到的所述业务处理能力等级。Obtaining a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module; searching for a corresponding inactivity threshold in the second relationship table according to the service processing capability level, the second relationship The table stores at least one service processing capability level and The at least one service processing capability level in the second relationship table includes the obtained service processing capability level, and the banned threshold corresponding to each of the service processing capability levels.
结合第二方面的第二种可能的实施方式,在第一方面的第四种可能的实施方式中,所述在将所述硬件模块的可纠正故障中断由所述使能状态切换为所述禁能状态时,启动定时器之前,还包括:In conjunction with the second possible implementation of the second aspect, in a fourth possible implementation of the first aspect, the step of switching the correctable fault interrupt of the hardware module from the enabled state to the In the disabled state, before starting the timer, it also includes:
获取所述服务器中所处理的业务对实时性要求的级别,所述业务是基于所述服务器中的至少一个硬件模块所运行的任务;根据所述级别在第三关系表中查找对应的定时器预定时长,所述第三关系表存储有至少一个级别与每个所述级别所对应的定时器预定时长,所述第三关系表中的至少一个级别中包括获取到的所述级别;Obtaining a level of real-time requirement for the service processed in the server, where the service is based on a task run by at least one hardware module in the server; searching for a corresponding timer in the third relationship table according to the level a predetermined duration, the third relationship table storing at least one level and a timer predetermined duration corresponding to each of the levels, and the at least one of the third relationship tables includes the acquired level;
或,or,
获取所述服务器的业务处理能力等级,所述业务处理能力等级基于所述至少一个硬件模块确定;根据所述业务处理能力等级在第四关系表中查找对应的定时器预定时长,所述第四关系表存储有至少一个业务处理能力等级与每个所述业务处理能力等级所对应的定时器预定时长,所述第四关系表中的至少一个业务处理能力等级中包括获取到的所述业务处理能力等级。Obtaining a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module; searching for a corresponding timer predetermined duration in the fourth relationship table according to the service processing capability level, the fourth The relationship table stores at least one service processing capability level and a predetermined timer duration corresponding to each of the service processing capability levels, and the at least one service processing capability level in the fourth relationship table includes the acquired service processing. Ability level.
结合第二方面的第二种可能的实施方式,在第二方面的第五种可能的实施方式中,所述将所述硬件模块的可纠正故障中断由使能状态切换为禁能状态,包括:In conjunction with the second possible implementation of the second aspect, in a fifth possible implementation of the second aspect, the correctable fault interrupt of the hardware module is switched from an enabled state to an disabled state, including :
将与所述硬件模块对应的可纠正故障中断使能寄存器中的标识值设置为禁能值;Setting an identification value in the correctable fault interrupt enable register corresponding to the hardware module to a disable value;
所述将所述硬件模块的可纠正故障中断由所述禁能状态切换为所述使能状态,包括:The switching the correctable fault interrupt of the hardware module from the disabled state to the enabled state includes:
将与所述硬件模块对应的可纠正故障中断使能寄存器中的标识值设置为使能值。The identification value in the correctable fault interrupt enable register corresponding to the hardware module is set to an enable value.
本发明实施例提供的技术方案带来的有益效果是:The beneficial effects brought by the technical solutions provided by the embodiments of the present invention are:
通过统计服务器中的硬件模块在预定时间段内产生可纠正故障中断的频率;检测频率是否大于禁能阈值;当检测到频率大于禁能阈值时,将硬件模块的可纠正故障中断由使能状态切换为禁能状态;解决了当硬件模块在短时间内发生大量可纠正故障时,操作系统将处于持续的故障处理状态, 占用了操作系统大量的处理资源,甚至导致操作系统不能正常运行的问题;达到了硬件模块在短时间内发生大量可纠正故障时,减少产生可纠正故障中断,使操作系统能够正常运行,提高操作系统的运行效率的效果。The frequency of the correctable fault interrupt is generated in the predetermined time period by the hardware module in the statistics server; whether the detection frequency is greater than the disable threshold; when the detected frequency is greater than the disable threshold, the correctable fault interrupt of the hardware module is enabled. Switching to the disabled state; solving the problem that when the hardware module has a large number of correctable faults in a short time, the operating system will be in a continuous fault handling state. It takes up a lot of processing resources of the operating system, and even causes the operating system to fail to operate normally. When the hardware module generates a large number of correctable faults in a short period of time, it reduces the occurrence of correctable fault interrupts, enables the operating system to operate normally, and improves the operation. The effect of the operating efficiency of the system.
附图说明DRAWINGS
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present invention. Other drawings may also be obtained from those of ordinary skill in the art in light of the inventive work.
图1是本发明一个实施例提供的故障处理装置的结构方框图;1 is a block diagram showing the structure of a fault processing apparatus according to an embodiment of the present invention;
图2是本发明另一个实施例提供的故障处理装置的结构方框图;2 is a block diagram showing the structure of a fault processing apparatus according to another embodiment of the present invention;
图3A是本发明一个实施例提供的故障处理装置的框图;FIG. 3A is a block diagram of a fault processing apparatus according to an embodiment of the present invention; FIG.
图3B是本发明另一实施例提供的故障处理装置的框图;FIG. 3B is a block diagram of a fault processing apparatus according to another embodiment of the present invention; FIG.
图4是本发明一个实施例提供的故障处理方法的方法流程图;4 is a flowchart of a method for processing a fault according to an embodiment of the present invention;
图5A是本发明另一个实施例提供的故障处理方法的方法流程图;5A is a flowchart of a method for processing a fault according to another embodiment of the present invention;
图5B是本发明另一实施例提供的故障处理方法的实施示意图;FIG. 5B is a schematic diagram of an implementation of a fault processing method according to another embodiment of the present invention; FIG.
图6是本发明再一个实施例提供的故障处理方法的方法流程图。FIG. 6 is a flowchart of a method for processing a fault according to still another embodiment of the present invention.
具体实施方式detailed description
为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。The embodiments of the present invention will be further described in detail below with reference to the accompanying drawings.
为了方便理解,首先对本发明实施例中出现的一些名词进行解释:For ease of understanding, some terms appearing in the embodiments of the present invention are first explained:
禁能状态:指硬件模块无法根据可纠正故障产生可纠正故障中断的状态,即操作系统接收不到该硬件模块产生的可纠正故障中断的状态。每个硬件模块产生可纠正故障中断的机制通常互相独立。Disabled state: A state in which a hardware module cannot generate a correctable fault interrupt based on a correctable fault, that is, the operating system cannot receive a state of a correctable fault interrupt generated by the hardware module. The mechanism by which each hardware module produces a correctable fault interrupt is usually independent of each other.
使能状态:指硬件模块可以根据可纠正故障产生可纠正故障中断的状态,即操作系统可以接收到该硬件模块产生的可纠正故障中断的状态。Enable state: It means that the hardware module can generate a state that can correct the fault interrupt according to the correctable fault, that is, the operating system can receive the state of the correctable fault interrupt generated by the hardware module.
正相关关系:指两个变量的变化方向相同,即一个变量增大时,对应的另一个变量也增大;一个变量减小时,对应的另一个变量也减小,两者呈线性相关或非线性相关。Positive correlation: It means that the two variables change in the same direction. When one variable increases, the corresponding other variable also increases. When one variable decreases, the corresponding other variable also decreases. The two are linearly related or non-linear. Linear correlation.
负相关关系:指两个变量的变化方向相反,即一个变量增大时,对应的另 一个变量减小;一个变量减小时,对应的另一个变量增大,两者呈线性相关或非线性相关。Negative correlation: refers to the opposite direction of change of two variables, that is, when one variable increases, the corresponding one One variable decreases; when one variable decreases, the corresponding other variable increases, and the two are linearly related or nonlinearly related.
可纠正故障中断使能寄存器:通过对可纠正故障中断使能寄存器中的标识值进行设置,实现对硬件模块的可纠正故障中断在使能状态和禁能状态之间的切换。每个硬件模块对应各自的可纠正故障中断使能寄存器。Correctable Fault Interrupt Enable Register: Enables the switch between the enabled and disabled states of the correctable fault interrupt for the hardware module by setting the identification value in the correctable fault interrupt enable register. Each hardware module corresponds to its own correctable fault interrupt enable register.
请参考图1,其示出了本发明一个实施例提供的故障处理装置的结构方框图,该故障处理装置,包括:Please refer to FIG. 1 , which is a structural block diagram of a fault processing apparatus according to an embodiment of the present invention. The fault processing apparatus includes:
统计模块110,用于统计服务器中的硬件模块在预定时间段内产生可纠正故障中断的频率,该可纠正故障中断是由硬件模块在发生可纠正故障时所产生的中断;The statistics module 110 is configured to calculate, by the hardware module in the server, a frequency of correctable fault interrupts generated within a predetermined time period, where the correctable fault interrupt is an interrupt generated by the hardware module when a correctable fault occurs;
检测模块120,用于检测频率是否大于禁能阈值;The detecting module 120 is configured to detect whether the frequency is greater than a disable threshold;
第一切换模块130,用于当检测到频率大于禁能阈值时,将硬件模块的可纠正故障中断由使能状态切换为禁能状态。The first switching module 130 is configured to switch the correctable fault interrupt of the hardware module from the enabled state to the disabled state when the detected frequency is greater than the disable threshold.
综上所述,本实施例提供的故障处理装置,通过统计服务器中的硬件模块在预定时间段内产生可纠正故障中断的频率;检测频率是否大于禁能阈值;当检测到频率大于禁能阈值时,将硬件模块的可纠正故障中断由使能状态切换为禁能状态;解决了当硬件模块在短时间内发生大量可纠正故障时,操作系统将处于持续的故障处理状态,占用了操作系统大量的处理资源,甚至导致操作系统不能正常运行的问题;达到了硬件模块在短时间内发生大量可纠正故障时,减少产生可纠正故障中断,使操作系统能够正常运行,提高操作系统的运行效率的效果。In summary, the fault processing apparatus provided in this embodiment generates a frequency of correctable fault interrupts in a predetermined time period by using a hardware module in the statistics server; whether the detected frequency is greater than the disable threshold; and when the detected frequency is greater than the disable threshold When the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state; when the hardware module has a large number of correctable faults in a short time, the operating system will be in a continuous fault handling state, occupying the operating system. A large amount of processing resources, even the problem that the operating system can not operate normally; when the hardware module has a large number of correctable faults in a short period of time, reduce the occurrence of correctable fault interrupts, enable the operating system to run normally, and improve the operating efficiency of the operating system. Effect.
请参考图2,其示出了本发明另一个实施例提供的故障处理装置的结构方框图,该故障处理装置,包括:Referring to FIG. 2, it is a structural block diagram of a fault processing apparatus according to another embodiment of the present invention. The fault processing apparatus includes:
统计模块210,用于统计服务器中的硬件模块在预定时间段内产生可纠正故障中断的频率,该可纠正故障中断是由硬件模块在发生可纠正故障时所产生的中断;The statistics module 210 is configured to calculate, by the hardware module in the server, a frequency of generating a correctable fault interrupt within a predetermined time period, where the correctable fault interrupt is an interrupt generated by the hardware module when a correctable fault occurs;
检测模块220,用于检测频率是否大于禁能阈值;The detecting module 220 is configured to detect whether the frequency is greater than a disable threshold;
第一切换模块230,用于当检测到频率大于禁能阈值时,将硬件模块的可纠正故障中断由使能状态切换为禁能状态。 The first switching module 230 is configured to switch the correctable fault interrupt of the hardware module from the enabled state to the disabled state when the detected frequency is greater than the disable threshold.
可选地,统计模块210,包括:Optionally, the statistics module 210 includes:
读取模块211,用于通过中断处理程序从机器校验异常(英文:Machine Check Exception;简称:MCE)存储器中读取硬件模块在预定时间段内产生的可纠正故障中断的个数,该中断处理程序是用于处理可纠正故障的中断处理程序,该MCE存储器是与硬件模块对应的MCE存储器;The reading module 211 is configured to read, by using an interrupt processing program, a number of correctable fault interrupts generated by the hardware module in a predetermined period of time from a machine check exception (MCE) memory. The processing program is an interrupt handler for processing a correctable fault, and the MCE memory is an MCE memory corresponding to the hardware module;
计算模块212,用于通过中断处理程序根据预定时间段和可纠正故障中断的个数统计出频率;The calculating module 212 is configured to calculate, by using an interrupt processing program, a frequency according to a predetermined time period and a number of correctable fault interrupts;
检测模块220,用于通过中断处理程序检测频率是否大于禁能阈值。The detecting module 220 is configured to detect, by the interrupt processing program, whether the frequency is greater than an inability threshold.
可选地,该装置,还包括:Optionally, the device further includes:
启动模块240,用于在将硬件模块的可纠正故障中断由使能状态切换为禁能状态时,启动定时器;The startup module 240 is configured to start a timer when the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state;
第二切换模块250,用于当定时器计时至预定时长时,将硬件模块的可纠正故障中断由禁能状态切换为使能状态。The second switching module 250 is configured to switch the correctable fault interrupt of the hardware module from the disabled state to the enabled state when the timer is timed to a predetermined duration.
可选地,该装置,还包括:Optionally, the device further includes:
第一查找模块260,用于获取服务器中所处理的业务对实时性要求的级别,该业务是基于服务器中的至少一个硬件模块所运行的任务;根据级别在第一关系表中查找对应的禁能阈值,该第一关系表存储有至少一个级别和与每个级别所对应的禁能阈值,该第一关系表中的至少一个级别中包括获取到的级别;The first search module 260 is configured to acquire a level of real-time requirement for the service processed in the server, where the service is based on a task run by at least one hardware module in the server; and the corresponding relationship is searched in the first relationship table according to the level. a threshold value, the first relationship table storing at least one level and a disable threshold corresponding to each level, where at least one of the first relationship tables includes the acquired level;
或,or,
第二查找模块270,用于获取服务器的业务处理能力等级,该业务处理能力等级基于至少一个硬件模块确定;根据业务处理能力等级在第二关系表中查找对应的禁能阈值,该第二关系表存储有至少一个业务处理能力等级和与每个业务处理能力等级所对应的禁能阈值,该第二关系表中的至少一个业务处理能力等级中包括获取到的业务处理能力等级。The second search module 270 is configured to obtain a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module, and search for a corresponding inactivation threshold in the second relationship table according to the service processing capability level, the second relationship The table stores at least one service processing capability level and a disable threshold corresponding to each service processing capability level, and the at least one service processing capability level in the second relationship table includes the acquired service processing capability level.
可选地,该装置,还包括:Optionally, the device further includes:
第三查找模块280,用于获取服务器中所处理的业务对实时性要求的级别,该业务是基于服务器中的至少一个硬件模块所运行的任务;根据级别在第三关系表中查找对应的定时器预定时长,该第三关系表存储有至少一个级别与每个级别所对应的定时器预定时长,该第三关系表中的至少一个级别中包括获取到的级别; The third search module 280 is configured to acquire a level of real-time requirement for the service processed in the server, where the service is based on a task run by at least one hardware module in the server; and search for a corresponding timing in the third relationship table according to the level. The third relationship table stores at least one level with a predetermined timer duration corresponding to each level, and at least one of the third relationship tables includes the acquired level;
或,or,
第四查找模块290,用于服务器的业务处理能力等级,该业务处理能力等级基于至少一个硬件模块确定;根据业务处理能力等级在第四关系表中查找对应的定时器预定时长,该第四关系表存储有至少一个业务处理能力等级与每个业务处理能力等级所对应的定时器预定时长,该第四关系表中的至少一个业务处理能力等级中包括获取到的业务处理能力等级。The fourth search module 290 is configured to determine a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module, and search for a corresponding timer predetermined duration in the fourth relationship table according to the service processing capability level, the fourth relationship. The table stores at least one service processing capability level and a timer predetermined duration corresponding to each service processing capability level, and at least one of the service processing capability levels in the fourth relationship table includes the acquired service processing capability level.
可选地,第一切换模块230,用于将与硬件模块对应的可纠正故障中断使能寄存器中的标识值设置为禁能值;Optionally, the first switching module 230 is configured to set an identifier value in the correctable fault interrupt enable register corresponding to the hardware module to a disable value;
第二切换模块250,用于将与硬件模块对应的可纠正故障中断使能寄存器中的标识值设置为使能值。The second switching module 250 is configured to set an identifier value in the correctable fault interrupt enable register corresponding to the hardware module to an enable value.
综上所述,本实施例提供的故障处理装置,通过统计服务器中的硬件模块在预定时间段内产生可纠正故障中断的频率;检测频率是否大于禁能阈值;当检测到频率大于禁能阈值时,将硬件模块的可纠正故障中断由使能状态切换为禁能状态;解决了当硬件模块在短时间内发生大量可纠正故障时,操作系统将处于持续的故障处理状态,占用了操作系统大量的处理资源,甚至导致操作系统不能正常运行的问题;达到了硬件模块在短时间内发生大量可纠正故障时,减少产生可纠正故障中断,使操作系统能够正常运行,提高操作系统的运行效率的效果。In summary, the fault processing apparatus provided in this embodiment generates a frequency of correctable fault interrupts in a predetermined time period by using a hardware module in the statistics server; whether the detected frequency is greater than the disable threshold; and when the detected frequency is greater than the disable threshold When the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state; when the hardware module has a large number of correctable faults in a short time, the operating system will be in a continuous fault handling state, occupying the operating system. A large amount of processing resources, even the problem that the operating system can not operate normally; when the hardware module has a large number of correctable faults in a short period of time, reduce the occurrence of correctable fault interrupts, enable the operating system to run normally, and improve the operating efficiency of the operating system. Effect.
本实施例还通过在硬件模块的可纠正故障中断为禁能状态时设置定时器,当定时器计时至预定时长时,将硬件模块的可纠正故障中断由禁能状态切换为使能状态,且在硬件模块产生可纠正故障中断的频率是否小于使能阈值时保持该使能状态,达到了及时处理可纠正故障风暴结束后产生的可纠正故障中断的效果。In this embodiment, the timer is also set when the correctable fault interrupt of the hardware module is disabled, and when the timer is timed to a predetermined duration, the correctable fault interrupt of the hardware module is switched from the disabled state to the enabled state, and The enabled state is maintained when the frequency at which the hardware module generates a correctable fault interrupt is less than the enable threshold, and the timely processing can correct the correctable fault interrupt generated after the fault storm ends.
请参考图3A,其示出了本发明一个实施例提供的故障处理装置的框图。该故障处理装置可以包括:处理器310和至少一个硬件模块320,其中,处理器310和至少一个硬件模块320电性相连。本实施例以至少一个硬件模块320包括硬件模块321和硬件模块322进行说明。Please refer to FIG. 3A, which shows a block diagram of a fault handling apparatus according to an embodiment of the present invention. The fault processing apparatus may include a processor 310 and at least one hardware module 320, wherein the processor 310 and the at least one hardware module 320 are electrically connected. This embodiment is described with at least one hardware module 320 including a hardware module 321 and a hardware module 322.
处理器310,用于统计服务器中的至少一个硬件模块320在预定时间段内产生可纠正故障中断的频率,该可纠正故障中断是由硬件模块在发生可纠正故障时所产生的中断; The processor 310 is configured to generate, by the at least one hardware module 320 in the server, a frequency of correctable fault interrupts generated by the hardware module in a predetermined period of time, where the correctable fault interrupt is generated by the hardware module when a correctable fault occurs;
处理器310,用于检测频率是否大于禁能阈值;The processor 310 is configured to detect whether the frequency is greater than a disable threshold;
处理器310,用于当检测到频率大于禁能阈值时,将硬件模块320的可纠正故障中断由使能状态切换为禁能状态。The processor 310 is configured to switch the correctable fault interrupt of the hardware module 320 from the enabled state to the disabled state when the detected frequency is greater than the disable threshold.
综上所述,本实施例提供的故障处理装置,通过统计服务器中的硬件模块在预定时间段内产生可纠正故障中断的频率;检测频率是否大于禁能阈值;当检测到频率大于禁能阈值时,将硬件模块的可纠正故障中断由使能状态切换为禁能状态;解决了当硬件模块在短时间内发生大量可纠正故障时,操作系统将处于持续的故障处理状态,占用了操作系统大量的处理资源,甚至导致操作系统不能正常运行的问题;达到了硬件模块在短时间内发生大量可纠正故障时,减少产生可纠正故障中断,使操作系统能够正常运行,提高操作系统的运行效率的效果。In summary, the fault processing apparatus provided in this embodiment generates a frequency of correctable fault interrupts in a predetermined time period by using a hardware module in the statistics server; whether the detected frequency is greater than the disable threshold; and when the detected frequency is greater than the disable threshold When the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state; when the hardware module has a large number of correctable faults in a short time, the operating system will be in a continuous fault handling state, occupying the operating system. A large amount of processing resources, even the problem that the operating system can not operate normally; when the hardware module has a large number of correctable faults in a short period of time, reduce the occurrence of correctable fault interrupts, enable the operating system to run normally, and improve the operating efficiency of the operating system. Effect.
在图3A的基础上,该故障处理装置还可以包括:每个硬件模块对应的MCE存储器和可纠正故障中断使能寄存器,以及存储器,该存储器用于存储一个或一个以上的程序,包括用于处理可纠正故障的中断处理程序。本实施例以至少一个硬件模块320包括硬件模块321和硬件模块322进行说明,如图3B所示,该故障处理装置300包括:处理器310、硬件模块321、硬件模块322、与硬件模块321对应的MCE存储器331、与硬件模块321对应的可纠正故障中断使能寄存器341、与硬件模块322对应的MCE存储器332、与硬件模块322对应的可纠正故障中断使能寄存器342和存储器350。其中,处理器310分别与至少一个硬件模块320、存储器350、每个硬件模块对应的MCE存储器和可纠正故障中断使能寄存器电性相连。Based on FIG. 3A, the fault processing apparatus may further include: an MCE memory and a correctable fault interrupt enable register corresponding to each hardware module, and a memory for storing one or more programs, including Handle interrupt handlers that correct faults. The embodiment of the present invention includes the hardware module 321 and the hardware module 322. As shown in FIG. 3B, the fault processing apparatus 300 includes a processor 310, a hardware module 321, and a hardware module 322. The MCE memory 331, the correctable fault interrupt enable register 341 corresponding to the hardware module 321, the MCE memory 332 corresponding to the hardware module 322, the correctable fault interrupt enable register 342 corresponding to the hardware module 322, and the memory 350. The processor 310 is electrically connected to the at least one hardware module 320, the memory 350, the MCE memory corresponding to each hardware module, and the correctable fault interrupt enable register.
处理器310,用于统计服务器中的至少一个硬件模块320在预定时间段内产生可纠正故障中断的频率,该可纠正故障中断是由硬件模块在发生可纠正故障时所产生的中断;The processor 310 is configured to generate, by the at least one hardware module 320 in the server, a frequency of correctable fault interrupts generated by the hardware module in a predetermined period of time, where the correctable fault interrupt is generated by the hardware module when a correctable fault occurs;
处理器310,用于检测频率是否大于禁能阈值;The processor 310 is configured to detect whether the frequency is greater than a disable threshold;
处理器310,用于当检测到频率大于禁能阈值时,将硬件模块320的可纠正故障中断由使能状态切换为禁能状态。The processor 310 is configured to switch the correctable fault interrupt of the hardware module 320 from the enabled state to the disabled state when the detected frequency is greater than the disable threshold.
具体的,在统计服务器中的硬件模块在预定时间段内产生可纠正故障中断的频率时,处理器310,用于通过中断处理程序从MCE存储器中读取硬件模块320在预定时间段内产生的可纠正故障中断的个数,该中断处理 程序是用于处理可纠正故障的中断处理程序,该MCE存储器是与硬件模块320对应的MCE存储器;Specifically, when the hardware module in the statistics server generates the frequency of the correctable fault interrupt within a predetermined time period, the processor 310 is configured to read, by the interrupt processing program, the hardware module 320 generated by the hardware module 320 within a predetermined time period. The number of fault interrupts can be corrected, the interrupt processing The program is an interrupt handler for processing a correctable fault, the MCE memory being an MCE memory corresponding to the hardware module 320;
处理器310,用于通过中断处理程序根据预定时间段和可纠正故障中断的个数统计出频率;The processor 310 is configured to calculate, by using an interrupt processing program, a frequency according to a predetermined time period and a number of correctable fault interrupts;
处理器310,用于通过中断处理程序检测频率是否大于禁能阈值。The processor 310 is configured to detect, by the interrupt processing program, whether the frequency is greater than a disable threshold.
具体的,将硬件模块的可纠正故障中断由使能状态切换为禁能状态时,处理器310,用于在将硬件模块320的可纠正故障中断由使能状态切换为禁能状态时,启动定时器;Specifically, when the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state, the processor 310 is configured to start when the correctable fault interrupt of the hardware module 320 is switched from the enabled state to the disabled state. Timer
处理器301,用于当定时器计时至预定时长时,将硬件模块320的可纠正故障中断由禁能状态切换为使能状态。The processor 301 is configured to switch the correctable fault interrupt of the hardware module 320 from the disabled state to the enabled state when the timer is timed to a predetermined duration.
具体的,在确定禁能阈值时,处理器310用于获取服务器中所处理的业务对实时性要求的级别,该业务是基于服务器中的至少一个硬件模块所运行的任务;根据级别在第一关系表中查找对应的禁能阈值,该第一关系表存储有至少一个级别和与每个级别所对应的禁能阈值,该第一关系表中的至少一个级别中包括获取到的级别;Specifically, when determining the forbidden threshold, the processor 310 is configured to acquire a level of real-time requirement for the service processed in the server, where the service is based on a task run by at least one hardware module in the server; Querying, in the relationship table, a corresponding banned threshold, where the first relationship table stores at least one level and a disable threshold corresponding to each level, where at least one level in the first relationship table includes the acquired level;
或,or,
用于获取服务器的业务处理能力等级,该业务处理能力等级基于至少一个硬件模块确定;根据业务处理能力等级在第二关系表中查找对应的禁能阈值,该第二关系表存储有至少一个业务处理能力等级和与每个业务处理能力等级所对应的禁能阈值,该第二关系表中的至少一个业务处理能力等级中包括获取到的业务处理能力等级。And a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module; and the corresponding inactivity threshold is searched in the second relationship table according to the service processing capability level, where the second relationship table stores at least one service The processing capability level and the inability threshold corresponding to each service processing capability level, and the at least one service processing capability level in the second relationship table includes the acquired service processing capability level.
具体的,在确定定时器预定时长时,处理器310,用于获取服务器中所处理的业务对实时性要求的级别,该业务是基于服务器中的至少一个硬件模块所运行的任务;根据级别在第三关系表中查找对应的定时器预定时长,该第三关系表存储有至少一个级别与每个级别所对应的定时器预定时长,该第三关系表中的至少一个级别中包括获取到的级别;Specifically, when determining the predetermined duration of the timer, the processor 310 is configured to acquire a level of real-time requirement for the service processed in the server, where the service is based on a task run by at least one hardware module in the server; Searching, in the third relationship table, a predetermined timer duration, where the third relationship table stores at least one level and a predetermined timer duration corresponding to each level, where at least one level of the third relationship table includes the acquired level;
或,or,
用于获取服务器的业务处理能力等级,该业务处理能力等级基于至少一个硬件模块确定;根据业务处理能力等级在第四关系表中查找对应的定时器预定时长,该第四关系表存储有至少一个业务处理能力等级与每个业务处理能力等级所对应的定时器预定时长,该第四关系表中的至少一个业 务处理能力等级中包括获取到的业务处理能力等级。And a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module; and the corresponding timer is searched for in the fourth relationship table according to the service processing capability level, where the fourth relationship table stores at least one The service processing capability level and the timer corresponding to each service processing capability level are predetermined, and at least one of the fourth relationship tables The service processing capability level includes the acquired service processing capability level.
处理器310,用于将与硬件模块320对应的可纠正故障中断使能寄存器中的标识值设置为禁能值;The processor 310 is configured to set an identifier value in the correctable fault interrupt enable register corresponding to the hardware module 320 to a disable value;
处理器301,用于将与硬件模块320对应的可纠正故障中断使能寄存器中的标识值设置为使能值。The processor 301 is configured to set an identifier value in the correctable fault interrupt enable register corresponding to the hardware module 320 to an enable value.
综上所述,本实施例提供的故障处理装置,通过统计服务器中的硬件模块在预定时间段内产生可纠正故障中断的频率;检测频率是否大于禁能阈值;当检测到频率大于禁能阈值时,将硬件模块的可纠正故障中断由使能状态切换为禁能状态;解决了当硬件模块在短时间内发生大量可纠正故障时,操作系统将处于持续的故障处理状态,占用了操作系统大量的处理资源,甚至导致操作系统不能正常运行的问题;达到了硬件模块在短时间内发生大量可纠正故障时,减少产生可纠正故障中断,使操作系统能够正常运行,提高操作系统的运行效率的效果。In summary, the fault processing apparatus provided in this embodiment generates a frequency of correctable fault interrupts in a predetermined time period by using a hardware module in the statistics server; whether the detected frequency is greater than the disable threshold; and when the detected frequency is greater than the disable threshold When the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state; when the hardware module has a large number of correctable faults in a short time, the operating system will be in a continuous fault handling state, occupying the operating system. A large amount of processing resources, even the problem that the operating system can not operate normally; when the hardware module has a large number of correctable faults in a short period of time, reduce the occurrence of correctable fault interrupts, enable the operating system to run normally, and improve the operating efficiency of the operating system. Effect.
本实施例还通过在硬件模块的可纠正故障中断为禁能状态时设置定时器,当定时器计时至预定时长时,将硬件模块的可纠正故障中断由禁能状态切换为使能状态,且在硬件模块产生可纠正故障中断的频率是否小于使能阈值时保持该使能状态,达到了及时处理可纠正故障风暴结束后产生的可纠正故障中断的效果。In this embodiment, the timer is also set when the correctable fault interrupt of the hardware module is disabled, and when the timer is timed to a predetermined duration, the correctable fault interrupt of the hardware module is switched from the disabled state to the enabled state, and The enabled state is maintained when the frequency at which the hardware module generates a correctable fault interrupt is less than the enable threshold, and the timely processing can correct the correctable fault interrupt generated after the fault storm ends.
请参考图4,其示出了本发明一个实施例提供的故障处理方法的方法流程图,该方法可用于至少一个硬件模块的服务器中,该故障处理方法,包括:Please refer to FIG. 4, which is a flowchart of a method for processing a fault according to an embodiment of the present invention. The method is applicable to a server of at least one hardware module, and the fault processing method includes:
步骤402,统计服务器中的硬件模块在预定时间段内产生可纠正故障中断的频率,该可纠正故障中断是由硬件模块在发生可纠正故障时所产生的中断;Step 402: The hardware module in the statistics server generates a frequency of correctable fault interrupts within a predetermined time period, where the correctable fault interrupt is an interrupt generated by the hardware module when a correctable fault occurs;
可纠正故障中断(英文:Corrected Machine-Check Error Interrupt;简称:CMCI)指硬件模块在发生可纠正故障时产生的中断,该中断用于通知操作系统进入中断处理程序对该可纠正故障进行处理。The Corrected Machine-Check Error Interrupt (CMCI) refers to the interrupt generated by the hardware module when a correctable fault occurs. The interrupt is used to notify the operating system to enter the interrupt handler to process the correctable fault.
步骤404,检测该频率是否大于禁能阈值; Step 404, detecting whether the frequency is greater than a disable threshold;
步骤406,当检测到该频率大于禁能阈值时,将硬件模块的可纠正故障中断由使能状态切换为禁能状态。 Step 406: When it is detected that the frequency is greater than the disable threshold, the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state.
综上所述,本实施例提供的故障处理方法,通过统计服务器中的硬件模块在预定时间段内产生可纠正故障中断的频率;检测频率是否大于禁能阈值;当检测到频率大于禁能阈值时,将硬件模块的可纠正故障中断由使能状态切换为禁能状态;解决了当硬件模块在短时间内发生大量可纠正故障时,操作系统将处于持续的故障处理状态,占用了操作系统大量的处理资源,甚至导致操作系统不能正常运行的问题;达到了硬件模块在短时间内发生大量可纠正故障时,减少产生可纠正故障中断,使操作系统能够正常运行,提高操作系统的运行效率的效果。In summary, the fault processing method provided in this embodiment generates a frequency of correctable fault interrupts in a predetermined time period by using a hardware module in the statistics server; whether the detected frequency is greater than the disable threshold; and when the detected frequency is greater than the disable threshold When the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state; when the hardware module has a large number of correctable faults in a short time, the operating system will be in a continuous fault handling state, occupying the operating system. A large amount of processing resources, even the problem that the operating system can not operate normally; when the hardware module has a large number of correctable faults in a short period of time, reduce the occurrence of correctable fault interrupts, enable the operating system to run normally, and improve the operating efficiency of the operating system. Effect.
请参考图5A,其示出了本发明另一个实施例提供的故障处理方法的方法流程图,该方法可用于至少一个硬件模块的服务器中,该故障处理方法,包括:Please refer to FIG. 5A, which is a flowchart of a method for processing a fault according to another embodiment of the present invention. The method is applicable to a server of at least one hardware module, and the fault processing method includes:
步骤501,统计服务器中的硬件模块在预定时间段内产生可纠正故障中断的频率,该可纠正故障中断是由硬件模块在发生可纠正故障时所产生的中断。Step 501: The hardware module in the statistics server generates a frequency of correctable fault interrupts within a predetermined time period, and the correctable fault interrupt is an interrupt generated by the hardware module when a correctable fault occurs.
该服务器可以是采用X86架构的设备,由于现有的服务器大多是采用X86架构,所以本实施例以该服务器为X86架构设备进行说明,并不对本发明构成限定。The server may be an X86-based device. Since the existing server is mostly an X86 architecture, the present embodiment is described as an X86 architecture device, and the present invention is not limited thereto.
硬件模块是指X86架构设备中具有不同处理功能的硬件处理设备,X86架构设备中包含至少一个硬件模块。在X86架构设备中,各个硬件模块各自对应自己的MCE存储器,这些MCE存储器用于存储硬件模块产生的可纠正故障中断。中断处理程序可以通过从硬件模块对应的MCE存储器中获取预定时间段内产生的可纠正故障中断的个数,计算得到相应的产生可纠正故障中断的频率,本步骤可以包括如下子步骤:A hardware module refers to a hardware processing device with different processing functions in an X86 architecture device, and an X86 architecture device includes at least one hardware module. In the X86 architecture device, each hardware module corresponds to its own MCE memory, which is used to store the correctable fault interrupt generated by the hardware module. The interrupt handler can obtain the frequency of the correctable fault interrupt generated by acquiring the number of correctable fault interrupts generated in the predetermined time period from the MCE memory corresponding to the hardware module, and the step may include the following sub-steps:
一、X86架构设备通过中断处理程序从MCE存储器中读取硬件模块在预定时间段内产生的可纠正故障中断的个数,该中断处理程序是用于处理可纠正故障的中断处理程序,该MCE存储器是硬件模块对应的MCE存储器。1. The X86 architecture device reads, from the MCE memory, the number of correctable fault interrupts generated by the hardware module within a predetermined period of time through an interrupt handler, the interrupt handler being an interrupt handler for processing the correctable fault, the MCE The memory is the MCE memory corresponding to the hardware module.
当硬件模块发生可纠正故障时,硬件模块会根据该可纠正故障产生可纠正故障中断,并通知操作系统进入中断处理程序对该可纠正故障中断进行处理,中断处理程序根据可纠正故障中断确定发生故障的硬件模块,并 从该硬件模块对应的MCE存储器中读取该硬件模块在预定时间段内产生的可纠正故障中断的个数,该预定时间段是操作系统预先设定的,可以为5秒。When a correctable fault occurs in the hardware module, the hardware module generates a correctable fault interrupt according to the correctable fault, and notifies the operating system to enter the interrupt handler to process the correctable fault interrupt, and the interrupt handler determines that the fault occurs according to the correctable fault interrupt. Faulty hardware module, and Reading, from the MCE memory corresponding to the hardware module, the number of correctable fault interrupts generated by the hardware module within a predetermined time period, which is preset by the operating system and may be 5 seconds.
比如,中断处理程序接收到可纠正故障中断通知,确定该可纠正故障的硬件模块为硬件模块A,从该硬件模块A对应的MCE存储器A中读取到最近5秒内产生的可纠正故障中断的个数为10个。For example, the interrupt handler receives the correctable fault interrupt notification, determines that the hardware module that can correct the fault is the hardware module A, and reads the correctable fault interrupt generated in the last 5 seconds from the MCE memory A corresponding to the hardware module A. The number is 10.
二、X86架构设备通过中断处理程序根据预定时间段和可纠正故障中断的个数统计出频率。Second, the X86 architecture device counts the frequency according to the predetermined time period and the number of correctable fault interrupts through the interrupt handler.
中断处理程序根据读取到的预定时间段内产生的可纠正故障中断的个数和预定时间段,计算得到该硬件模块在预定时间段内产生可纠正故障中断的频率。The interrupt handler calculates the frequency at which the hardware module generates a correctable fault interrupt within a predetermined time period based on the number of correctable fault interrupts generated during the predetermined period of time read and the predetermined time period.
比如,中断处理程序读取到的预定时间段内产生的可纠正故障中断的个数为10个,且该预定时间段位5秒,计算得到该硬件模块在预定时间段内产生可纠正故障中断的频率为10次/5秒。For example, the number of correctable fault interrupts generated during the predetermined time period read by the interrupt handler is 10, and the predetermined time period is 5 seconds, and the hardware module is calculated to generate a correctable fault interrupt within a predetermined time period. The frequency is 10 times/5 seconds.
需要说明的是,由于X86架构设备在运行时,多个硬件模块会同时发生可纠正故障,所以中断处理程序需要分别统计不同的硬件模块产生可纠正故障中断的频率,本实施例仅以中断处理程序统计一个硬件模块产生可纠正故障中断的频率进行说明,并不对发明构成限定。It should be noted that, because the X86 architecture device is running, multiple hardware modules will have correctable faults at the same time. Therefore, the interrupt handler needs to separately count different hardware modules to generate a frequency that can correct the fault interrupt. This embodiment only uses interrupt processing. The program counts the frequency at which a hardware module generates a correctable fault interrupt and does not limit the invention.
步骤502,检测该频率是否大于禁能阈值。Step 502: Detect whether the frequency is greater than an inability threshold.
X86架构设备通过中断处理程序检测硬件模块在预定时间段内产生可纠正故障中断的频率是否大于禁能阈值,当该频率大于禁能阈值时,即可确定该硬件模块发生了可纠正故障风暴;当该频率小于禁能阈值时,即可确定该硬件模块未发生可纠正故障风暴。The X86 architecture device detects, by the interrupt handler, whether the frequency at which the hardware module generates a correctable fault interrupt within a predetermined time period is greater than an disable threshold. When the frequency is greater than the disable threshold, it may be determined that the hardware module has a correctable fault storm; When the frequency is less than the disable threshold, it can be determined that the hardware module does not have a correctable fault storm.
该禁能阈值可以预先进行设置,也可以根据X86架构设备中所处理的业务对实时性的要求或根据X86架构设备的业务处理能力进行实时设置,设置该禁能阈值可以包括下面两种可能的实现方式:The ban threshold can be set in advance, or can be set in real time according to the real-time requirement of the service processed in the X86 architecture device or according to the service processing capability of the X86 architecture device. Setting the ban threshold can include the following two possible Method to realize:
第一种可能的实现方式,X86架构设备获取该X86架构设备中所处理的业务对实时性要求的级别,该业务是基于X86架构设备中的至少一个硬件模块所运行的任务;根据该级别在第一关系表中查找对应的禁能阈值,该第一关系表存储有至少一个级别和与每个级别所对应的禁能阈值,该第一关系表中的至少一个级别中包括获取到的级别。 The first possible implementation manner, the X86 architecture device acquires a level of real-time requirement for the service processed in the X86 architecture device, where the service is based on a task run by at least one hardware module in the X86 architecture device; Searching, in the first relationship table, a corresponding inability threshold, where the first relationship table stores at least one level and a disable threshold corresponding to each level, where at least one level in the first relationship table includes the acquired level .
当X86架构设备处理的业务对实时性要求高时,由于操作系统频繁地进入中断处理程序对可纠正故障中断进行处理会导致业务得不到及时处理,所以可以设置一个较小的禁能阈值,使得操作系统能够及时处理当前业务;当X86架构设备处理的业务对实时性要求低时,该禁能阈值可以被设置的较大。When the service processed by the X86 architecture device has high real-time requirements, since the operating system frequently enters the interrupt handler to process the correctable fault interrupt, the service cannot be processed in time, so a smaller disable threshold can be set. The operating system can process the current service in time; when the service processed by the X86 architecture device has low real-time requirements, the disable threshold can be set larger.
操作系统中的第一关系表中预先存储有业务对实时性要求的各个级别与对应禁能阈值的对应关系,其中,各个级别与对应禁能阈值呈负相关关系,即业务对实时性要求的级别越高时,对应的禁能阈值也越小,业务对实时性要求的级别越低时,对应的禁能阈值也越大。该第一关系表的表结构可以示例性地如表一所示:The first relationship table in the operating system pre-stores the correspondence between each level of the service real-time requirement and the corresponding inactivation threshold, wherein each level has a negative correlation with the corresponding inactivation threshold, that is, the service requires real-time performance. The higher the level, the smaller the corresponding ban threshold. The lower the level of service required for real-time performance, the larger the corresponding ban threshold. The table structure of the first relational table can be exemplarily shown in Table 1:
表一Table I
业务对实时性要求的级别Level of business requirements for real-time performance 禁能阈值Disable threshold
11 10次/5秒10 times/5 seconds
22 8次/5秒8 times/5 seconds
33 5次/5秒5 times/5 seconds
其中,业务对实时性要求的级别越高表示该业务对实时性要求越高,业务对实时性要求的级别越第表示该业务对实时性要求越低。The higher the level of real-time requirements of the service indicates that the real-time requirement of the service is higher, and the higher the level of the real-time requirement of the service indicates that the real-time requirement of the service is lower.
操作系统获取X86架构设备中所处理的业务对实时性要求的级别,在第一关系表中查找对应的禁能阈值,并把该禁能阈值设置为适应该业务的禁能阈值。The operating system obtains the level of the real-time requirement of the service processed by the X86 architecture device, searches for the corresponding inactivity threshold in the first relationship table, and sets the disable threshold to the disable threshold of the service.
第二种可能的实现方式,X86架构设备获取该X86架构设备的业务处理能力等级;该业务处理能力等级基于至少一个硬件模块确定;根据业务处理能力等级在第二关系表中查找对应的禁能阈值,该第二关系表存储有至少一个业务处理能力等级和与每个业务处理能力等级所对应的禁能阈值,该第二关系表中的至少一个业务处理能力等级中包括获取到的业务处理能力等级。The second possible implementation manner, the X86 architecture device obtains a service processing capability level of the X86 architecture device; the service processing capability level is determined based on the at least one hardware module; and the corresponding disabling is found in the second relationship table according to the service processing capability level a threshold, the second relationship table stores at least one service processing capability level and a disable threshold corresponding to each service processing capability level, where at least one service processing capability level in the second relationship table includes the acquired service processing Ability level.
X86架构设备的业务处理能力的不同,对应的操作系统进入中断处理程序进行故障处理所占用的处理资源和时间也不同,所以操作系统可以根据X86架构设备的业务处理能力对禁能阈值进行设置。The service processing capability of the X86 architecture device is different. The processing resources and time occupied by the corresponding operating system entering the interrupt handler for fault processing are also different. Therefore, the operating system can set the disable threshold according to the service processing capability of the X86 architecture device.
操作系统中的第二关系表中预先存储有X86架构设备的业务处理能力 等级与禁能阈值的正相关关系,其中,各个业务处理能力等级与对应禁能阈值呈正相关关系,即业务处理能力等级越高时,对应的禁能阈值也越大,业务处理能力等级越低时,对应的禁能阈值也越小。该第二关系表的表结构可以示例性地如表二所示:The service processing capability of the X86 architecture device is pre-stored in the second relation table in the operating system. The positive correlation between the level and the forbidden threshold, wherein each service processing capability level is positively correlated with the corresponding inactivity threshold, that is, the higher the service processing capability level, the larger the corresponding inability threshold, and the lower the service processing capability level. The corresponding disable threshold is also smaller. The table structure of the second relational table can be exemplarily shown in Table 2:
表二Table II
业务处理能力等级Business processing capability level 禁能阈值Disable threshold
11 5次/5秒5 times/5 seconds
22 8次/5秒8 times/5 seconds
33 10次/5秒10 times/5 seconds
其中,X86架构设备的业务处理能力等级越高表示X86架构设备的业务处理能力越强,X86架构设备的业务处理能力等级越低表示X86架构设备的业务处理能力越弱,X86架构设备的业务处理能力等级可以根据X86架构设备的硬件评分进行划分。The higher the service processing capability of the X86 architecture device is, the higher the service processing capability of the X86 architecture device is. The lower the service processing capability of the X86 architecture device is, the weaker the service processing capability of the X86 architecture device is, and the service processing of the X86 architecture device is weak. The capability level can be divided according to the hardware score of the X86 architecture device.
操作系统获取X86架构设备的业务处理能力等级,在第二关系表中查找对应的禁能阈值,并把该禁能阈值设置为适应该X86架构设备的禁能阈值。The operating system obtains the service processing capability level of the X86 architecture device, searches for the corresponding disable threshold in the second relationship table, and sets the disable threshold to the disable threshold of the X86 architecture device.
需要说明的是,操作系统也可综合业务对实时性要求的级别和X86架构设备的业务处理能力等级对禁能阈值进行设置,并不对本发明构成限定。It should be noted that the operating system may also set the prohibition threshold for the level of real-time requirements of the integrated service and the service processing capability level of the X86 architecture device, and does not limit the present invention.
步骤503,当检测到该频率大于禁能阈值时,将硬件模块的可纠正故障中断由使能状态切换为禁能状态。Step 503: When it is detected that the frequency is greater than the disable threshold, the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state.
当检测到该频率大于禁能阈值时,X86架构设备即知悉硬件模块发生了可纠正故障风暴,该可纠正故障风暴表示硬件模块在短时间内将会产生大量的可纠正故障中断。为了不让操作系统在发生可纠正故障风暴时处于持续的故障处理状态,导致操作系统无法正常运行,中断处理程序将硬件模块的可纠正故障中断由使能状态切换为禁能状态。When it is detected that the frequency is greater than the disable threshold, the X86 architecture device knows that the hardware module has a correctable fault storm, and the correctable fault storm indicates that the hardware module will generate a large number of correctable fault interrupts in a short time. In order to prevent the operating system from being in a continuous fault handling state in the event of a correctable fault storm, the operating system fails to operate normally, and the interrupt handler switches the correctable fault interrupt of the hardware module from the enabled state to the disabled state.
当硬件模块未发生可纠正故障风暴时,硬件模块对应的可纠正故障中断寄存器中的标识值为使能值,即硬件模块的可纠正故障中断处于使能状态;当检测到硬件模块发生了可纠正故障风暴时,中断处理程序将与硬件模块对应的可纠正故障中断使能寄存器中的标识值设置为禁能值,即硬件模块的可纠正故障中断由使能状态切换为禁能状态,在可纠正故障中断处 于禁能状态时,硬件模块将无法根据可纠正故障产生可纠正故障中断,操作系统也就不会频繁的进入中断处理程序进行故障处理。When the correctable fault storm does not occur in the hardware module, the identifier value in the correctable fault interrupt register corresponding to the hardware module is an enable value, that is, the correctable fault interrupt of the hardware module is enabled; when it is detected that the hardware module is generated, When the fault storm is corrected, the interrupt handler sets the identification value in the correctable fault interrupt enable register corresponding to the hardware module to the disable value, that is, the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state. Correctable fault interruption When disabled, the hardware module will not be able to generate correctable fault interrupts based on correctable faults, and the operating system will not frequently enter the interrupt handler for troubleshooting.
步骤504,在将硬件模块的可纠正故障中断由使能状态切换为禁能状态时,启动定时器。Step 504: Start a timer when the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state.
在中断处理程序将硬件模块的可纠正故障中断由使能状态切换为禁能状态的同时,启动预先设置的定时器,在定时器在达到预定时长的过程中,硬件模块的可纠正故障中断将持续处于禁能状态,操作系统将不会进入中断处理程序进行故障处理。When the interrupt handler switches the correctable fault interrupt of the hardware module from the enabled state to the disabled state, the preset timer is started, and the correctable fault interrupt of the hardware module will be interrupted while the timer reaches the predetermined duration. The system is disabled and the operating system will not enter the interrupt handler for troubleshooting.
需要说明的是,步骤503与步骤504之间不存在严格的先后关系,两者可以同时执行,本实施例仅以步骤503在步骤504之前执行进行举例说明,并不对本发明构成限定。It should be noted that there is no strict relationship between the step 503 and the step 504, and the two can be executed at the same time. This embodiment is only described by the step 503 before the step 504, and the present invention is not limited.
步骤505,当定时器计时至预定时长时,将硬件模块的可纠正故障中断由禁能状态切换为使能状态。Step 505: When the timer is timed to a predetermined duration, the correctable fault interrupt of the hardware module is switched from the disabled state to the enabled state.
为了防止在可纠正故障风暴过后,由于硬件模块的可纠正故障中断依然处于禁能状态,导致操作系统无法接收到可纠正故障中断并进行故障处理,定时器在计时至预定时长时,中断处理程序将硬件模块对应的可纠正故障中断使能寄存器中的标识值设置为使能值,即硬件模块的可纠正故障中断由禁能状态切换为使能状态,此时,硬件模块可以根据可纠正故障产生可纠正故障中断,并通知操作系统进入中断处理程序进行故障处理。中断处理程序进行故障处理的步骤与现有技术类似,在此不再赘述。In order to prevent the correctable fault interrupt of the hardware module from being disabled after the correctable fault storm, the operating system cannot receive the correctable fault interrupt and handle the fault. When the timer expires for a predetermined period of time, the interrupt handler is interrupted. The identification value in the correctable fault interrupt enable register corresponding to the hardware module is set to an enable value, that is, the correctable fault interrupt of the hardware module is switched from the disable state to the enable state, and at this time, the hardware module can be based on the correctable fault. Generates a correctable fault interrupt and notifies the operating system to enter the interrupt handler for troubleshooting. The steps of the interrupt handler for fault processing are similar to those of the prior art, and are not described herein again.
定时器预定时长可以预先进行设置,也可以根据X86架构设备中所处理的业务对实时性的要求或根据X86架构设备的业务处理能力进行实时设置,设置定时器预定时长可以包括下面两种可能的实现方式:The scheduled duration of the timer can be set in advance, or can be set in real time according to the real-time requirements of the services processed by the X86 architecture device or according to the service processing capability of the X86 architecture device. The preset duration of the timer can include the following two possible Method to realize:
第一种可能的实现方式,X86架构设备获取该X86架构设备中所处理的业务对实时性要求的级别,该业务是基于X86架构设备中的至少一个硬件模块所运行的任务;根据级别在第三关系表中查找对应的定时器预定时长,该第三关系表存储有至少一个级别与每个级别所对应的定时器预定时长,该第三关系表中的至少一个级别中包括获取到的级别。The first possible implementation manner, the X86 architecture device acquires a level of real-time requirement for the service processed in the X86 architecture device, where the service is based on a task run by at least one hardware module in the X86 architecture device; The third relationship table is configured to search for a corresponding timer for a predetermined duration, where the third relationship table stores at least one level and a predetermined timer duration corresponding to each level, and at least one of the third relationship tables includes the acquired level. .
操作系统中的第三关系表中预先存储有业务对实时性要求的级别与定时器预定时长的对应关系,其中,各个级别与对应定时器预定时长呈正相关关系,即业务对实时性要求的级别越高时,对应的定时器预定时长也越 长,业务对实时性要求的级别越低时,对应的定时器预定时长也越长。该第三关系表的表结构可以示例性地如表三所示:The third relationship table in the operating system pre-stores the correspondence between the level of the real-time requirement of the service and the predetermined duration of the timer, where each level is positively correlated with the predetermined duration of the corresponding timer, that is, the level of the service required for real-time performance. The higher the time, the longer the corresponding timer is scheduled. Long, the lower the level of service requirements for real-time requirements, the longer the corresponding timer is scheduled. The table structure of the third relational table can be exemplarily shown in Table 3:
表三Table 3
业务对实时性要求的级别Level of business requirements for real-time performance 定时器预定时长Timer predetermined duration
11 100秒100 seconds
22 120秒120 seconds
33 150秒150 seconds
其中,业务对实时性要求的级别越高表示该业务对实时性要求越高,业务对实时性要求的级别越第表示该业务对实时性要求越低。操作系统获取X86架构设备中所处理业务对实时性要求的级别,在第三关系表中查找对应的定时器预定时长,并对当前定时器预定时长进行设置。The higher the level of real-time requirements of the service indicates that the real-time requirement of the service is higher, and the higher the level of the real-time requirement of the service indicates that the real-time requirement of the service is lower. The operating system acquires the level of real-time requirements of the services processed in the X86 architecture device, searches for a predetermined timer duration in the third relationship table, and sets a predetermined duration of the current timer.
第二种可能的实现方式,X86架构设备获取该X86架构设备的业务处理能力等级,该业务处理能力等级基于至少一个硬件模块确定;根据业务处理能力等级在第四关系表中查找对应的定时器预定时长,该第四关系表存储有至少一个业务处理能力等级与每个业务处理能力等级所对应的定时器预定时长,该第四关系表中的至少一个业务处理能力等级中包括获取到的业务处理能力等级。In a second possible implementation manner, the X86 architecture device obtains a service processing capability level of the X86 architecture device, where the service processing capability level is determined based on the at least one hardware module, and searches for a corresponding timer in the fourth relationship table according to the service processing capability level. The fourth relationship table stores at least one service processing capability level and a timer predetermined duration corresponding to each service processing capability level, and at least one of the service processing capability levels in the fourth relationship table includes the acquired service. Processing capability level.
操作系统中的第四关系表中预先存储有X86架构设备的业务处理能力等级与定时器预定时长的对应关系,其中,其中,各个业务处理能力等级与对应定时器预定时长呈负相关关系,即业务处理能力等级越高时,对应的定时器预定时长越小,业务处理能力等级越低时,对应的定时器预定时长越大。该第四关系表的表结构可以示例性地如表四所示:The fourth relationship table in the operating system pre-stores the correspondence between the service processing capability level of the X86 architecture device and the predetermined duration of the timer, wherein each service processing capability level has a negative correlation with the predetermined timer duration, that is, When the service processing capability level is higher, the corresponding timer is smaller, and the lower the service processing capability level, the larger the predetermined timer duration. The table structure of the fourth relational table can be exemplarily shown in Table 4:
表四Table 4
业务处理能力等级Business processing capability level 定时器预定时长Timer predetermined duration
11 150秒150 seconds
22 120秒120 seconds
33 100秒100 seconds
其中,X86架构设备的业务处理能力等级越高表示X86架构设备的业务处理能力越强,X86架构设备的业务处理能力等级越低表示X86架构设备的业务处理能力越弱,X86架构设备的业务处理能力等级可以根据X86 架构设备的硬件评分进行划分。The higher the service processing capability of the X86 architecture device is, the higher the service processing capability of the X86 architecture device is. The lower the service processing capability of the X86 architecture device is, the weaker the service processing capability of the X86 architecture device is, and the service processing of the X86 architecture device is weak. Ability level can be based on X86 The hardware score of the architecture device is divided.
操作系统获取X86架构设备的业务处理能力等级,在第四关系表中查找对应的定时器预定时长,并对当前定时器预定时长进行设置。The operating system acquires the service processing capability level of the X86 architecture device, searches for the corresponding timer predetermined duration in the fourth relationship table, and sets the current timer predetermined duration.
需要说明的是,操作系统也可综合业务对实时性要求的级别和X86架构设备的业务处理能力等级对定时器预定时长进行设置,并不对本发明构成限定。It should be noted that the operating system may also set the timer for the integrated service to the real-time requirement level and the service processing capability level of the X86 architecture device, and does not limit the present invention.
显而易见的,当X86架构设备中所处理的业务对实时性要求越高或X86架构设备的业务处理能力越弱时,对应的定时器预定时长也越长,从而保证了操作系统对业务的及时处理。需要说明的是,当定时器计时至预定时长时,该定时器将会被重置,且为了使操作系统了解硬件模块在可纠正故障风暴过程中发生的可纠正故障数量的预估值,定时器将计算可纠正故障风暴过程中发生的可纠正故障数量的预估值,该预估值可以为定时器设置的预定时长与步骤501统计得到的硬件模块产生可纠正故障中断的频率的乘积。Obviously, when the service processed in the X86 architecture device has higher real-time requirements or the service processing capability of the X86 architecture device is weaker, the corresponding timer is longer, which ensures the timely processing of the service by the operating system. . It should be noted that when the timer expires for a predetermined period of time, the timer will be reset, and in order to let the operating system know the estimated value of the number of correctable faults that the hardware module can take during the correctable fault storm, timing The controller will calculate an estimate of the number of correctable faults that may occur during the correcting of the fault storm, which may be the product of the predetermined duration set by the timer and the frequency at which the hardware module obtained in step 501 produces a correctable fault interrupt.
比如,定时器设置的预定时长为100秒,统计得到的硬件模块产生可纠正故障中断的频率为10次/5秒,计算得到的可纠正故障风暴过程中发生的可纠正故障数量的预估值即为200次。该预估值主要用于统计可纠正故障的次数。For example, the timer is set to a predetermined duration of 100 seconds, and the calculated hardware module generates a correctable fault interrupt frequency of 10 times/5 seconds, and the estimated number of correctable faults occurring during the correctable fault storm is calculated. That is 200 times. This estimate is primarily used to count the number of times a fault can be corrected.
步骤506,再次检测硬件模块产生可纠正故障中断的频率是否小于使能阈值。In step 506, it is again detected whether the frequency at which the hardware module generates a correctable fault interrupt is less than an enable threshold.
中断处理程序将硬件模块的可纠正故障中断由禁能状态切换为使能状态后,再次在预定时间段内对接收到的可纠正故障中断进行计数,并计算得到在该预定时间段内产生可纠正故障中断的频率。After the interrupt handler switches the disableable fault interrupt of the hardware module from the disabled state to the enabled state, the received correctable fault interrupt is counted again within a predetermined period of time, and the calculation is performed within the predetermined time period. Correct the frequency of fault interrupts.
中断处理程序检测计算得到的频率是否小于预设的使能阈值,该使能阈值时预先设定的用于检测可纠正故障风暴是否结束的阈值,该使能阈值可以为1次/5秒。The interrupt handler detects whether the calculated frequency is less than a preset enable threshold, and the preset threshold is used to detect whether the correctable fault storm ends, and the enable threshold may be 1 time/5 seconds.
步骤507,当检测到硬件模块产生可纠正故障中断的频率小于使能阈值时,保持硬件模块的可纠正故障中断为使能状态。Step 507: When it is detected that the frequency at which the hardware module generates the correctable fault interrupt is less than the enable threshold, the correctable fault interrupt of the hardware module is maintained as an enabled state.
当检测到硬件模块产生可纠正故障中断的频率小于使能阈值时,中断处理程序即可确定可纠正故障风暴已经结束,硬件模块产生的后续可纠正故障中断不会使操作系统处于持续的故障处理状态,即操作系统可以正常 运行。对应的,硬件模块的可纠正故障中断将保持使能状态。When it is detected that the frequency at which the hardware module generates a correctable fault interrupt is less than the enable threshold, the interrupt handler can determine that the correctable fault storm has ended, and the subsequent correctable fault interrupt generated by the hardware module does not cause the operating system to be in continuous fault handling. Status, ie the operating system can be normal run. Correspondingly, the correctable fault interrupt of the hardware module will remain enabled.
需要说明的是,为了防止硬件模块再次发生可纠正故障风暴导致操作系统处于持续的故障处理,中断处理程序将继续检测硬件模块在预定时间段内产生可纠正故障中断的频率是否大于禁能阈值,当该频率大于禁能阈值时,中断处理程序将硬件模块的可纠正故障中断由使能状态切换为禁能状态并重新启动定时器。It should be noted that, in order to prevent the hardware module from recurring, the corrective fault storm causes the operating system to be in continuous fault handling, and the interrupt handler will continue to detect whether the frequency at which the hardware module generates a correctable fault interrupt within a predetermined period of time is greater than the disable threshold. When the frequency is greater than the disable threshold, the interrupt handler switches the correctable fault interrupt of the hardware module from the enabled state to the disabled state and restarts the timer.
步骤508,当检测到硬件模块产生可纠正故障中断的频率大于使能阈值时,将硬件模块的可纠正故障中断由使能状态切换为禁能状态,并重新启动定时器。Step 508: When it is detected that the frequency at which the hardware module generates the correctable fault interrupt is greater than the enable threshold, the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state, and the timer is restarted.
当检测到硬件模块产生可纠正故障中断的频率大于使能阈值时,中断处理程序认为该可纠正故障风暴尚未结束,重新将硬件模块的可纠正故障中断由使能状态切换为禁能状态,并重新启动定时器。When it is detected that the frequency of the correctable fault interrupt generated by the hardware module is greater than the enable threshold, the interrupt handler considers that the correctable fault storm has not ended, and re-switches the correctable fault interrupt of the hardware module from the enabled state to the disabled state, and Restart the timer.
当定时器再次计时至预定时长时,中断处理程序将继续执行上述步骤506至步骤508。When the timer is timed again to a predetermined length of time, the interrupt handler will continue to perform steps 506 through 508 above.
显而易见的,通过在操作系统的中断处理程序中加入了检测硬件模块产生可纠正故障中断的频率的机制,当硬件模块发生可纠正故障风暴时,硬件模块的可纠正故障中断将处于禁能状态,操作系统将不会进入持续的故障处理,使得操作系统能够正常的运行,大大提高了操作系统运行的稳定性。Obviously, by adding a mechanism for detecting the frequency of the fault interrupt in the operating system's interrupt handler, when the hardware module has a correctable fault storm, the correctable fault interrupt of the hardware module will be disabled. The operating system will not enter continuous troubleshooting, so that the operating system can run normally, greatly improving the stability of the operating system.
如图5B所示,其示出了本实施例提供的故障处理方法的实施示意图。中断处理程序在T1时间段内检测硬件模块在预定时间段内产生可纠正故障中断的频率是否大于禁能阈值,当检测到该频率大于禁能阈值时,将硬件模块中断切换为禁能状态并启动定时器;在定时器设置的预定时长T2内,硬件模块的可纠正故障中断处于禁能状态;当定时器达到预定时长时,中断处理程序将硬件模块的可纠正故障中断切换为使能状态,并检测硬件模块在T3时间段内产生可纠正故障中断的频率是否小于使能阈值;当检测到该频率大于使能阈值时,中断处理程序将硬件模块的可纠正故障中断切换为禁能状态并重新启动定时器;在定时器设置的预定时长T4内,硬件模块的可纠正故障中断处于禁能状态;当定时器达到预定时长时,中断处理程序再次将硬件模块的可纠正故障中断切换为使能状态,并检测硬件模块在T5时间段内产生可纠正故障中断的频率是否小于使能阈值;当检测到该频 率大于使能阈值时,硬件模块的可纠正故障中断保持使能状态。As shown in FIG. 5B, it shows a schematic implementation of the fault processing method provided by this embodiment. The interrupt handler detects whether the frequency of the correctable fault interrupt generated by the hardware module in the predetermined time period is greater than the disable threshold during the T1 time period, and when the frequency is detected to be greater than the disable threshold, the hardware module interrupt is switched to the disabled state and The timer is started; the correctable fault interrupt of the hardware module is disabled during the predetermined time period T2 set by the timer; when the timer reaches the predetermined duration, the interrupt handler switches the correctable fault interrupt of the hardware module to the enabled state. And detecting whether the frequency of the correctable fault interrupt generated by the hardware module in the T3 time period is less than an enable threshold; when detecting that the frequency is greater than the enable threshold, the interrupt handler switches the correctable fault interrupt of the hardware module to the disabled state And restarting the timer; within the predetermined time period T4 set by the timer, the correctable fault interrupt of the hardware module is disabled; when the timer reaches the predetermined length, the interrupt handler again switches the correctable fault interrupt of the hardware module to Enable state and detect that the hardware module generates a correctable fault interrupt during the T5 time period Whether the frequency is less than the enable threshold; when the frequency is detected When the rate is greater than the enable threshold, the correctable fault interrupt of the hardware module remains enabled.
综上所述,本实施例提供的故障处理方法,通过统计服务器中的硬件模块在预定时间段内产生可纠正故障中断的频率;检测频率是否大于禁能阈值;当检测到频率大于禁能阈值时,将硬件模块的可纠正故障中断由使能状态切换为禁能状态;解决了当硬件模块在短时间内发生大量可纠正故障时,操作系统将处于持续的故障处理状态,占用了操作系统大量的处理资源,甚至导致操作系统不能正常运行的问题;达到了硬件模块在短时间内发生大量可纠正故障时,减少产生可纠正故障中断,使操作系统能够正常运行,提高操作系统的运行效率的效果。In summary, the fault processing method provided in this embodiment generates a frequency of correctable fault interrupts in a predetermined time period by using a hardware module in the statistics server; whether the detected frequency is greater than the disable threshold; and when the detected frequency is greater than the disable threshold When the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state; when the hardware module has a large number of correctable faults in a short time, the operating system will be in a continuous fault handling state, occupying the operating system. A large amount of processing resources, even the problem that the operating system can not operate normally; when the hardware module has a large number of correctable faults in a short period of time, reduce the occurrence of correctable fault interrupts, enable the operating system to run normally, and improve the operating efficiency of the operating system. Effect.
本实施例还通过在硬件模块的可纠正故障中断为禁能状态时设置定时器,当定时器计时至预定时长时,将硬件模块的可纠正故障中断由禁能状态切换为使能状态,且在硬件模块产生可纠正故障中断的频率是否小于使能阈值时保持该使能状态,达到了及时处理可纠正故障风暴结束后产生的可纠正故障中断的效果。In this embodiment, the timer is also set when the correctable fault interrupt of the hardware module is disabled, and when the timer is timed to a predetermined duration, the correctable fault interrupt of the hardware module is switched from the disabled state to the enabled state, and The enabled state is maintained when the frequency at which the hardware module generates a correctable fault interrupt is less than the enable threshold, and the timely processing can correct the correctable fault interrupt generated after the fault storm ends.
图5A中的可纠正故障错误中断是指CMCI中断,中断处理程序是指操作系统中的中断处理程序。作为另外一种可能的实施方式,可以由基本输入输出系统(英文:Basic Input/Output System;简称:BIOS)将可纠正故障发生时产生的可纠正故障中断转换为系统管理中断(英文:System Management Interrupt;简称:SMI),并由基本输入输出系统中的系统中断处理程序对该系统管理中断进行处理。下面采用一个实施例进行说明。The correctable fault error interrupt in Figure 5A refers to the CMCI interrupt, and the interrupt handler refers to the interrupt handler in the operating system. As another possible implementation manner, the basic input/output system (English: Basic Input/Output System; BIOS for short) can be used to convert the correctable fault interrupt generated when the fault can be corrected into a system management interrupt (English: System Management) Interrupt; referred to as SMI), and the system management interrupt is processed by the system interrupt handler in the basic input and output system. The following description will be made using an embodiment.
请参考图6,其示出了本发明再一个实施例提供的故障处理方法的方法流程图。该方法,包括:Please refer to FIG. 6, which is a flowchart of a method for processing a fault according to still another embodiment of the present invention. The method comprises:
步骤601,将服务器中的硬件模块产生的可纠正故障中断转换为系统管理中断。Step 601: Convert the correctable fault interrupt generated by the hardware module in the server into a system management interrupt.
该服务器可以是采用X86架构的设备,由于现有的服务器大多是采用X86架构,所以本实施例以该服务器为X86架构设备进行说明,并不对本发明构成限定。The server may be an X86-based device. Since the existing server is mostly an X86 architecture, the present embodiment is described as an X86 architecture device, and the present invention is not limited thereto.
在操作系统启动初始化时,通过在基本输入输出系统中进行设置,当硬件模块产生可纠正故障中断时,该可纠正故障中断将被转换为系统管理中断,对应的,硬件模块将通知基本输入输出系统进入系统管理中断处理 程序对该系统管理中断进行处理。When the operating system starts initialization, by setting in the basic input/output system, when the hardware module generates a correctable fault interrupt, the correctable fault interrupt will be converted into a system management interrupt, and correspondingly, the hardware module will notify the basic input and output. The system enters the system management interrupt processing The program handles the system management interrupt.
步骤602,统计服务器中的硬件模块在预定时间段内产生系统管理中断的频率。Step 602: The hardware module in the statistics server generates a frequency of system management interruption within a predetermined time period.
当发生可纠正故障时,由于硬件模块产生的可纠正故障中断被转换为系统管理中断,所以系统管理中断处理程序对预定时间段内产生的系统管理中断进行计数,并计算得到在该预定时间段内产生系统管理中断的频率。需要说明的是,由于设备在运行时,多个硬件模块会同时发生可纠正故障,所以系统管理中断处理程序需要分别统计不同的硬件模块产生系统管理中断的频率,本实施例仅以系统管理中断处理程序统计一个硬件模块产生系统管理中断的频率进行说明,并不对发明构成限定。When a correctable fault occurs, since the correctable fault interrupt generated by the hardware module is converted into a system management interrupt, the system management interrupt handler counts the system management interrupt generated within the predetermined time period, and calculates the predetermined time period. The frequency of system management interrupts generated internally. It should be noted that, because a plurality of hardware modules generate correctable faults at the same time when the device is running, the system management interrupt processing program needs to separately count the frequency of the system management interrupt generated by different hardware modules, and the embodiment only interrupts the system management. The processing program stats the frequency at which a hardware module generates a system management interrupt, and does not limit the invention.
步骤603,检测该频率是否大于禁能阈值。In step 603, it is detected whether the frequency is greater than an inability threshold.
系统管理中断处理程序检测硬件模块在预定时间段内产生系统管理中断的频率是否大于禁能阈值,当该频率大于禁能阈值时,由于该系统管理中断是由可纠正故障中断转换得到的,系统管理中断处理程序即可确定该硬件模块发生了可纠正故障风暴;当该频率小于禁能阈值时,系统管理中断处理程序即可确定未发生可纠正故障风暴。其中,禁能阈值是预先设定的用于检测是否发生可纠正故障风暴的阈值,该禁能阈值可以为10次/5秒。The system management interrupt handler detects whether the frequency of the system management interrupt generated by the hardware module during the predetermined time period is greater than the disable threshold. When the frequency is greater than the disable threshold, the system management interrupt is obtained by the correctable fault interrupt conversion, and the system The management interrupt handler can determine that the hardware module has a correctable fault storm; when the frequency is less than the disable threshold, the system management interrupt handler can determine that no correctable fault storm has occurred. The forbidden threshold is a preset threshold for detecting whether a correctable fault storm occurs, and the disable threshold may be 10 times/5 seconds.
需要说明的是,该禁能阈值的设置方法与步骤502中设置禁能阈值的方法相似,在此不再赘述。It should be noted that the method for setting the ban threshold is similar to the method for setting the ban threshold in step 502, and details are not described herein again.
步骤604,当检测到该频率大于禁能阈值时,将硬件模块的系统管理中断由使能状态切换为禁能状态。Step 604: When it is detected that the frequency is greater than the disable threshold, the system management interrupt of the hardware module is switched from the enabled state to the disabled state.
系统管理中断处理程序检测到该频率大于禁能阈值时,即知悉硬件模块发生了可纠正故障风暴,该可纠正故障风暴表示硬件模块在短时间内将会产生大量的可纠正故障中断,系统管理中断处理程序将硬件模块的系统管理中断由使能状态切换为禁能状态。When the system management interrupt handler detects that the frequency is greater than the disable threshold, it knows that the hardware module has a correctable fault storm. The correctable fault storm indicates that the hardware module will generate a large number of correctable fault interrupts in a short period of time. System Management The interrupt handler switches the system management interrupt of the hardware module from the enabled state to the disabled state.
当硬件模块未发生可纠正故障风暴时,硬件模块对应的系统管理中断寄存器中的标识值为使能值,即硬件模块的系统管理中断处于使能状态;当系统管理中断处理程序检测到硬件模块发生了可纠正故障风暴时,系统管理中断处理程序将与硬件模块对应的系统管理中断使能寄存器中的标识值设置为禁能值,即硬件模块的系统管理中断由使能状态切换为禁能状态,在系统管理中断处于禁能状态时,硬件模块将无法产生系统管理中断。 When the hardware module does not have a correctable fault storm, the identifier in the system management interrupt register corresponding to the hardware module is an enable value, that is, the system management interrupt of the hardware module is enabled; when the system management interrupt handler detects the hardware module When a correctable fault storm occurs, the system management interrupt handler sets the identification value in the system management interrupt enable register corresponding to the hardware module to the disable value, that is, the system management interrupt of the hardware module is switched from the enabled state to the disabled state. State, when the system management interrupt is disabled, the hardware module will not be able to generate a system management interrupt.
步骤605,在硬件模块的系统管理中断由使能状态切换为禁能状态时,启动定时器。Step 605: Start a timer when the system management interrupt of the hardware module is switched from the enabled state to the disabled state.
与中断处理程序相似的,系统管理中断处理程序将硬件模块的系统管理中断由使能状态切换为禁能状态的同时,也将启动预先设置的定时器。Similar to the interrupt handler, the system management interrupt handler will also start the preset timer while switching the system management interrupt of the hardware module from the enabled state to the disabled state.
需要说明的是,步骤604与步骤605之间不存在严格的先后关系,两者可以同时执行,本实施例仅以步骤604在步骤605之前执行进行举例说明,并不对本发明构成限定。It should be noted that there is no strict prior relationship between the steps 604 and 605, and the two can be executed at the same time. The present embodiment is only described by the step 604 before the step 605, and the present invention is not limited.
步骤606,当定时器计时至预定时长时,将硬件模块的系统管理中断由禁能状态切换为使能状态。Step 606: When the timer is timed to a predetermined duration, the system management interrupt of the hardware module is switched from the disabled state to the enabled state.
为了防止在可纠正故障风暴过后,由于硬件模块的系统管理中断依然处于禁能状态,导致基本输入输出系统无法接收到系统管理中断并进行处理,定时器在计时至预定时长时,系统管理中断处理程序将与硬件模块对应的系统管理中断使能寄存器中的标识值设置为使能值,即硬件模块的系统管理中断由禁能状态切换为使能状态,此时,硬件模块可以通知基本输入输出系统进入系统管理中断处理程序进行处理。In order to prevent the system management interrupt of the hardware module from being disabled after the correctable fault storm, the basic input/output system cannot receive the system management interrupt and process it. When the timer expires for a predetermined period of time, the system management interrupt processing The program sets the identification value in the system management interrupt enable register corresponding to the hardware module to an enable value, that is, the system management interrupt of the hardware module is switched from the disable state to the enable state, and at this time, the hardware module can notify the basic input and output. The system enters the system management interrupt handler for processing.
需要说明的是,定时器预定时长的设置方法与步骤505中设置定时器预定时长的方法相似,在此不再赘述。It should be noted that the method for setting the predetermined duration of the timer is similar to the method for setting the predetermined duration of the timer in step 505, and details are not described herein again.
步骤607,再次检测硬件模块产生系统管理中断的频率是否小于使能阈值。In step 607, it is detected again whether the frequency at which the hardware module generates the system management interrupt is less than the enable threshold.
系统管理中断处理程序将硬件模块的系统管理中断由禁能状态切换为使能状态后,再次在预定时间段内对接收到的系统管理中断进行计数,并计算得到在该预定时间段内产生系统管理中断的频率。After the system management interrupt handler switches the system management interrupt of the hardware module from the disabled state to the enabled state, the received system management interrupt is counted again within a predetermined time period, and the system is generated to be generated within the predetermined time period. Manage the frequency of interruptions.
系统管理中断处理程序检测计算得到的频率是否小于预设的使能阈值,该使能阈值时预先设定的用于检测可纠正故障风暴是否结束的阈值,该使能阈值可以为1次/5秒。The system management interrupt processing program detects whether the calculated frequency is less than a preset enabling threshold, and the threshold value for detecting the correctable fault storm is preset before the enabling threshold, and the enabling threshold may be 1 time/5 second.
步骤608,当检测到硬件模块产生系统管理中断的频率小于使能阈值时,保持硬件模块的系统管理中断为使能状态。Step 608: When it is detected that the frequency at which the hardware module generates the system management interrupt is less than the enable threshold, the system management interrupt of the hardware module is kept enabled.
当检测到产生系统管理中断的频率小于使能阈值时,系统管理中断处理程序即可确定可纠正故障风暴已经结束,硬件模块产生的后续可纠正故障中断将被转换为系统管理中断,并由系统管理中断处理程序进行处理。对应的,硬件模块的系统管理中断将保持使能状态。 When it is detected that the frequency of generating the system management interrupt is less than the enable threshold, the system management interrupt handler can determine that the correctable fault storm has ended, and the subsequent correctable fault interrupt generated by the hardware module will be converted into a system management interrupt, and the system is Manage interrupt handlers for processing. Correspondingly, the system management interrupt of the hardware module will remain enabled.
需要说明的是,系统管理中断处理程序将继续对产生系统管理中断的频率进行检测,并在该频率大于禁能阈值时将系统管理中断由使能状态切换为禁能状态。It should be noted that the system management interrupt handler will continue to detect the frequency at which the system management interrupt is generated, and switch the system management interrupt from the enabled state to the disabled state when the frequency is greater than the disable threshold.
步骤609,当检测到硬件模块产生系统管理中断的频率大于使能阈值时,将硬件模块的系统管理中断由使能状态切换为禁能状态,并重新启动定时器。Step 609: When it is detected that the frequency of the system management interrupt generated by the hardware module is greater than the enable threshold, the system management interrupt of the hardware module is switched from the enabled state to the disabled state, and the timer is restarted.
当检测到产生系统管理中断的频率大于使能阈值时,系统管理中断处理程序认为该可纠正故障风暴尚未结束,重新将硬件模块的系统管理中断由使能状态切换为禁能状态,并重新启动定时器。When it is detected that the frequency of generating the system management interrupt is greater than the enable threshold, the system management interrupt handler considers that the correctable fault storm has not ended yet, and re-switches the system management interrupt of the hardware module from the enabled state to the disabled state, and restarts Timer.
当定时器再次计时至预定时长时,系统管理中断处理程序将继续执行上述步骤607至步骤609。When the timer is timed again for a predetermined period of time, the system management interrupt handler will continue to perform steps 607 through 609 above.
综上所述,本实施例提供的故障处理方法,通过将服务器中的硬件模块产生的可纠正故障中断转换为系统管理中断;统计服务器中的硬件模块在预定时间段内产生系统管理中断的频率;检测该频率是否大于禁能阈值;当检测到该频率大于禁能阈值时,将硬件模块的系统管理中断由使能状态切换为禁能状态;解决了当硬件模块在短时间内发生大量可纠正故障时,操作系统将处于持续的故障处理状态,占用了操作系统大量的处理资源,甚至导致操作系统不能正常运行的问题;达到了硬件模块在短时间内发生大量可纠正故障时,减少产生可纠正故障中断,使操作系统能够正常运行,提高操作系统的运行效率的效果。In summary, the fault processing method provided in this embodiment converts the correctable fault interrupt generated by the hardware module in the server into a system management interrupt; the frequency of the system management interrupt generated by the hardware module in the statistical server within a predetermined time period. Detecting whether the frequency is greater than the disable threshold; when detecting that the frequency is greater than the disable threshold, switching the system management interrupt of the hardware module from the enabled state to the disabled state; solving the problem that when the hardware module occurs in a short time When the fault is corrected, the operating system will be in a continuous fault handling state, occupying a large amount of processing resources of the operating system, and even causing the operating system to fail to operate normally; when the hardware module has a large number of correctable faults in a short period of time, the generation is reduced. It can correct the fault interrupt, enable the operating system to run normally, and improve the operating efficiency of the operating system.
本实施例还通过基本输入输出系统将硬件产生的可纠正故障中断转换为系统管理中断,由基本输入输出系统的系统管理中断处理程序进行处理,进一步减轻了操作系统的压力,达到了保证操作系统稳定运行的效果。In this embodiment, the hardware-generated correctable fault interrupt is converted into a system management interrupt by the basic input/output system, and is processed by the system management interrupt processing program of the basic input/output system, thereby further reducing the pressure on the operating system and achieving the guaranteed operating system. The effect of stable operation.
上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the embodiments of the present invention are merely for the description, and do not represent the advantages and disadvantages of the embodiments.
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。A person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium. The storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。 The above are only the preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalents, improvements, etc., which are within the spirit and scope of the present invention, should be included in the protection of the present invention. Within the scope.

Claims (12)

  1. 一种故障处理装置,其特征在于,用于包括至少一个硬件模块的服务器中,所述装置包括:A fault processing apparatus, characterized in that, in a server for at least one hardware module, the apparatus comprises:
    统计模块,用于统计所述服务器中的硬件模块在预定时间段内产生可纠正故障中断的频率,所述可纠正故障中断是由所述硬件模块在发生可纠正故障时所产生的中断;a statistics module, configured to count, by a hardware module in the server, a frequency of generating a correctable fault interrupt within a predetermined time period, where the correctable fault interrupt is an interrupt generated by the hardware module when a correctable fault occurs;
    检测模块,用于检测所述频率是否大于禁能阈值;a detecting module, configured to detect whether the frequency is greater than an inability threshold;
    第一切换模块,用于当检测到所述频率大于所述禁能阈值时,将所述硬件模块的可纠正故障中断由使能状态切换为禁能状态。The first switching module is configured to switch the correctable fault interrupt of the hardware module from the enabled state to the disabled state when detecting that the frequency is greater than the disable threshold.
  2. 根据权利要求1所述的装置,其特征在于,所述统计模块,包括:The device according to claim 1, wherein the statistics module comprises:
    读取模块,用于通过中断处理程序从机器校验异常MCE存储器中读取所述硬件模块在预定时间段内产生的可纠正故障中断的个数,所述中断处理程序是用于处理所述可纠正故障的中断处理程序,所述MCE存储器是与所述硬件模块对应的MCE存储器;a reading module, configured to read, by an interrupt processing program, the number of correctable fault interrupts generated by the hardware module within a predetermined time period from a machine check abnormality MCE memory, wherein the interrupt processing program is configured to process the An interrupt handler capable of correcting a fault, the MCE memory being an MCE memory corresponding to the hardware module;
    计算模块,用于通过所述中断处理程序根据所述预定时间段和所述可纠正故障中断的个数统计出所述频率;a calculating module, configured to calculate, by the interrupt processing program, the frequency according to the predetermined time period and the number of the correctable fault interrupts;
    所述检测模块,用于通过所述中断处理程序检测所述频率是否大于禁能阈值。The detecting module is configured to detect, by the interrupt processing program, whether the frequency is greater than an inability threshold.
  3. 根据权利要求1所述的装置,其特征在于,所述装置,还包括:The device according to claim 1, wherein the device further comprises:
    启动模块,用于在将所述硬件模块的可纠正故障中断由所述使能状态切换为所述禁能状态时,启动定时器;a startup module, configured to start a timer when the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state;
    第二切换模块,用于当所述定时器计时至预定时长时,将所述硬件模块的可纠正故障中断由所述禁能状态切换为所述使能状态。And a second switching module, configured to switch the correctable fault interrupt of the hardware module from the disabled state to the enabled state when the timer is timed to a predetermined duration.
  4. 根据权利要求1所述的装置,其特征在于,所述装置,还包括:The device according to claim 1, wherein the device further comprises:
    第一查找模块,用于获取所述服务器中所处理的业务对实时性要求的级别,所述业务是基于所述服务器中的至少一个硬件模块所运行的任务;根据所述级别在第一关系表中查找对应的禁能阈值,所述第一关系表存储 有至少一个级别和与每个所述级别所对应的禁能阈值,所述第一关系表中的至少一个级别中包括获取到的所述级别;a first search module, configured to acquire a level of real-time requirements for the service processed in the server, where the service is based on a task run by at least one hardware module in the server; Find a corresponding disable threshold in the table, where the first relation table is stored Having at least one level and a disable threshold corresponding to each of the levels, the at least one of the first relationship tables including the acquired level;
    或,or,
    第二查找模块,用于获取所述服务器的业务处理能力等级,所述业务处理能力等级基于所述至少一个硬件模块确定;根据所述业务处理能力等级在第二关系表中查找对应的禁能阈值,所述第二关系表存储有至少一个业务处理能力等级和与每个所述业务处理能力等级所对应的禁能阈值,所述第二关系表中的至少一个业务处理能力等级中包括获取到的所述业务处理能力等级。a second search module, configured to obtain a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module; and searching for a corresponding ban in the second relationship table according to the service processing capability level a threshold, the second relationship table stores at least one service processing capability level and a disable threshold corresponding to each of the service processing capability levels, and at least one of the service processing capability levels in the second relationship table includes obtaining The level of business processing capability that is reached.
  5. 根据权利要求3所述的装置,其特征在于,所述装置,还包括:The device according to claim 3, wherein the device further comprises:
    第三查找模块,用于获取所述服务器中所处理的业务对实时性要求的级别,所述业务是基于所述服务器中的至少一个硬件模块所运行的任务;根据所述级别在第三关系表中查找对应的定时器预定时长,所述第三关系表存储有至少一个级别与每个所述级别所对应的定时器预定时长,所述第三关系表中的至少一个级别中包括获取到的所述级别;a third search module, configured to acquire a level of real-time requirements of the service processed in the server, where the service is based on a task run by at least one hardware module in the server; and the third relationship is performed according to the level Searching, in the table, a predetermined timer duration, where the third relationship table stores at least one level and a predetermined timer duration corresponding to each of the levels, and at least one of the third relationship tables includes obtaining Said level;
    或,or,
    第四查找模块,用于获取所述服务器的业务处理能力等级,所述业务处理能力等级基于所述至少一个硬件模块确定;根据所述业务处理能力等级在第四关系表中查找对应的定时器预定时长,所述第四关系表存储有至少一个业务处理能力等级与每个所述业务处理能力等级所对应的定时器预定时长,所述第四关系表中的至少一个业务处理能力等级中包括获取到的所述业务处理能力等级。a fourth search module, configured to acquire a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module, and search for a corresponding timer in the fourth relationship table according to the service processing capability level a predetermined duration, the fourth relationship table storing at least one service processing capability level and a timer predetermined duration corresponding to each of the service processing capability levels, where at least one of the service processing capability levels in the fourth relationship table is included The obtained business processing capability level.
  6. 根据权利要求3所述的装置,其特征在于,所述第一切换模块,用于将与所述硬件模块对应的可纠正故障中断使能寄存器中的标识值设置为禁能值;The apparatus according to claim 3, wherein the first switching module is configured to set an identifier value in a correctable fault interrupt enable register corresponding to the hardware module to a disable value;
    所述第二切换模块,用于将与所述硬件模块对应的可纠正故障中断使能寄存器中的标识值设置为使能值。The second switching module is configured to set an identifier value in the correctable fault interrupt enable register corresponding to the hardware module to an enable value.
  7. 一种故障处理方法,其特征在于,用于包括至少一个硬件模块的服 务器中,所述方法包括:A fault processing method, characterized in that it is used for a service including at least one hardware module In the server, the method includes:
    统计所述服务器中的硬件模块在预定时间段内产生可纠正故障中断的频率,所述可纠正故障中断是由所述硬件模块在发生可纠正故障时所产生的中断;Counting, by the hardware module in the server, a frequency at which a correctable fault interrupt is generated for a predetermined period of time, the correctable fault interrupt being an interrupt generated by the hardware module when a correctable fault occurs;
    检测所述频率是否大于禁能阈值;Detecting whether the frequency is greater than an inability threshold;
    当检测到所述频率大于所述禁能阈值时,将所述硬件模块的可纠正故障中断由使能状态切换为禁能状态。When it is detected that the frequency is greater than the disable threshold, the correctable fault interrupt of the hardware module is switched from an enabled state to an disabled state.
  8. 根据权利要求7所述的方法,其特征在于,所述统计所述服务器中的硬件模块在预定时间段内产生可纠正故障中断的频率,包括:The method according to claim 7, wherein the counting the frequency at which the hardware module in the server generates a correctable fault interrupt within a predetermined time period comprises:
    通过中断处理程序从机器校验异常MCE存储器中读取所述硬件模块在预定时间段内产生的可纠正故障中断的个数,所述中断处理程序是用于处理所述可纠正故障的中断处理程序,所述MCE存储器是与所述硬件模块对应的MCE存储器;Reading, by an interrupt handler, the number of correctable fault interrupts generated by the hardware module for a predetermined period of time from a machine check exception MCE memory, the interrupt handler being interrupt processing for processing the correctable fault a program, the MCE memory is an MCE memory corresponding to the hardware module;
    通过所述中断处理程序根据所述预定时间段和所述可纠正故障中断的个数统计出所述频率;And counting, by the interrupt processing program, the frequency according to the predetermined time period and the number of the correctable fault interrupts;
    所述检测所述频率是否大于禁能阈值,包括:The detecting whether the frequency is greater than an inability threshold includes:
    通过所述中断处理程序检测所述频率是否大于禁能阈值。Whether the frequency is greater than the disable threshold is detected by the interrupt handler.
  9. 根据权利要求7所述的方法,其特征在于,所述方法,还包括:The method of claim 7, wherein the method further comprises:
    在将所述硬件模块的可纠正故障中断由所述使能状态切换为所述禁能状态时,启动定时器;Initiating a timer when the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state;
    当所述定时器计时至预定时长时,将所述硬件模块的可纠正故障中断由所述禁能状态切换为所述使能状态。When the timer expires for a predetermined length of time, the correctable fault interrupt of the hardware module is switched from the disabled state to the enabled state.
  10. 根据权利要求7所述的方法,其特征在于,所述检测所述频率是否大于禁能阈值之前,还包括:The method according to claim 7, wherein before the detecting whether the frequency is greater than a disable threshold, the method further comprises:
    获取所述服务器中所处理的业务对实时性要求的级别,所述业务是基于所述服务器中的至少一个硬件模块所运行的任务;根据所述级别在第一关系表中查找对应的禁能阈值,所述第一关系表存储有至少一个级别和与每个所述级别所对应的禁能阈值,所述第一关系表中的至少一个级别中包 括获取到的所述级别;Obtaining a level of real-time requirement for the service processed in the server, where the service is based on a task run by at least one hardware module in the server; searching for a corresponding disable in the first relationship table according to the level a threshold, the first relationship table storing at least one level and a disable threshold corresponding to each of the levels, and at least one level in the first relationship table Including the level obtained;
    或,or,
    获取所述服务器的业务处理能力等级,所述业务处理能力等级基于所述至少一个硬件模块确定;根据所述业务处理能力等级在第二关系表中查找对应的禁能阈值,所述第二关系表存储有至少一个业务处理能力等级和与每个所述业务处理能力等级所对应的禁能阈值,所述第二关系表中的至少一个业务处理能力等级中包括获取到的所述业务处理能力等级。Obtaining a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module; searching for a corresponding inactivity threshold in the second relationship table according to the service processing capability level, the second relationship The table stores at least one service processing capability level and a disable threshold corresponding to each of the service processing capability levels, and the at least one service processing capability level in the second relationship table includes the acquired service processing capability. grade.
  11. 根据权利要求9所述的方法,其特征在于,所述在将所述硬件模块的可纠正故障中断由所述使能状态切换为所述禁能状态时,启动定时器之前,还包括:The method according to claim 9, wherein before the timer is started, when the correctable fault interrupt of the hardware module is switched from the enabled state to the disabled state, the method further includes:
    获取所述服务器中所处理的业务对实时性要求的级别,所述业务是基于所述服务器中的至少一个硬件模块所运行的任务;根据所述级别在第三关系表中查找对应的定时器预定时长,所述第三关系表存储有至少一个级别与每个所述级别所对应的定时器预定时长,所述第三关系表中的至少一个级别中包括获取到的所述级别;Obtaining a level of real-time requirement for the service processed in the server, where the service is based on a task run by at least one hardware module in the server; searching for a corresponding timer in the third relationship table according to the level a predetermined duration, the third relationship table storing at least one level and a timer predetermined duration corresponding to each of the levels, and the at least one of the third relationship tables includes the acquired level;
    或,or,
    获取所述服务器的业务处理能力等级,所述业务处理能力等级基于所述至少一个硬件模块确定;根据所述业务处理能力等级在第四关系表中查找对应的定时器预定时长,所述第四关系表存储有至少一个业务处理能力等级与每个所述业务处理能力等级所对应的定时器预定时长,所述第四关系表中的至少一个业务处理能力等级中包括获取到的所述业务处理能力等级。Obtaining a service processing capability level of the server, where the service processing capability level is determined based on the at least one hardware module; searching for a corresponding timer predetermined duration in the fourth relationship table according to the service processing capability level, the fourth The relationship table stores at least one service processing capability level and a predetermined timer duration corresponding to each of the service processing capability levels, and the at least one service processing capability level in the fourth relationship table includes the acquired service processing. Ability level.
  12. 根据权利要求9所述的方法,其特征在于,所述将所述硬件模块的可纠正故障中断由使能状态切换为禁能状态,包括:The method according to claim 9, wherein the switching the correctable fault interrupt of the hardware module from the enabled state to the disabled state comprises:
    将与所述硬件模块对应的可纠正故障中断使能寄存器中的标识值设置为禁能值;Setting an identification value in the correctable fault interrupt enable register corresponding to the hardware module to a disable value;
    所述将所述硬件模块的可纠正故障中断由所述禁能状态切换为所述使能状态,包括:The switching the correctable fault interrupt of the hardware module from the disabled state to the enabled state includes:
    将与所述硬件模块对应的可纠正故障中断使能寄存器中的标识值设置为使能值。 The identification value in the correctable fault interrupt enable register corresponding to the hardware module is set to an enable value.
PCT/CN2015/081355 2014-11-28 2015-06-12 Apparatus and method for handling fault WO2016082523A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410712709.4A CN104486100B (en) 2014-11-28 2014-11-28 Fault treating apparatus and method
CN201410712709.4 2014-11-28

Publications (1)

Publication Number Publication Date
WO2016082523A1 true WO2016082523A1 (en) 2016-06-02

Family

ID=52760608

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/081355 WO2016082523A1 (en) 2014-11-28 2015-06-12 Apparatus and method for handling fault

Country Status (2)

Country Link
CN (1) CN104486100B (en)
WO (1) WO2016082523A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110333938A (en) * 2019-05-31 2019-10-15 苏州简约纳电子有限公司 A method of improving embedded timer efficiency

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104486100B (en) * 2014-11-28 2018-07-13 华为技术有限公司 Fault treating apparatus and method
CN106155826B (en) * 2015-04-16 2019-10-18 伊姆西公司 For the method and system of mistake to be detected and handled in bus structures
CN105468497A (en) * 2015-12-15 2016-04-06 迈普通信技术股份有限公司 Interruption exception monitoring method and apparatus
CN105589789A (en) * 2015-12-25 2016-05-18 浪潮电子信息产业股份有限公司 Method for dynamically adjusting memory monitoring threshold value
CN107544838B (en) * 2016-06-24 2024-02-23 中兴通讯股份有限公司 Interrupt processing method and device
CN106326049B (en) * 2016-08-16 2019-07-19 Oppo广东移动通信有限公司 A kind of Fault Locating Method and terminal
CN106341291B (en) * 2016-09-08 2019-11-15 北京小米移动软件有限公司 It is connected to the network the test method and device of stability
CN107077408A (en) 2016-12-05 2017-08-18 华为技术有限公司 Method, computer system, baseboard management controller and the system of troubleshooting
CN113407391A (en) * 2016-12-05 2021-09-17 华为技术有限公司 Fault processing method, computer system, substrate management controller and system
CN107608331A (en) * 2017-08-24 2018-01-19 北京龙鼎源科技股份有限公司 The diagnostic method and device of nonrandom interruption
CN111625387B (en) * 2020-05-27 2024-03-29 北京金山云网络技术有限公司 Memory error processing method, device and server

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1567277A (en) * 2003-07-09 2005-01-19 明基电通股份有限公司 Control device and method for reducing interruption frequency of processor
CN101276295A (en) * 2008-03-19 2008-10-01 北京星网锐捷网络技术有限公司 Method for real-time operating system to avoid interrupt occupying excess CPU resources
CN102135912A (en) * 2011-04-02 2011-07-27 大唐移动通信设备有限公司 Interruption jitter processing method and equipment
CN104486100A (en) * 2014-11-28 2015-04-01 华为技术有限公司 Device and method for treating faults

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1567277A (en) * 2003-07-09 2005-01-19 明基电通股份有限公司 Control device and method for reducing interruption frequency of processor
CN101276295A (en) * 2008-03-19 2008-10-01 北京星网锐捷网络技术有限公司 Method for real-time operating system to avoid interrupt occupying excess CPU resources
CN102135912A (en) * 2011-04-02 2011-07-27 大唐移动通信设备有限公司 Interruption jitter processing method and equipment
CN104486100A (en) * 2014-11-28 2015-04-01 华为技术有限公司 Device and method for treating faults

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110333938A (en) * 2019-05-31 2019-10-15 苏州简约纳电子有限公司 A method of improving embedded timer efficiency

Also Published As

Publication number Publication date
CN104486100B (en) 2018-07-13
CN104486100A (en) 2015-04-01

Similar Documents

Publication Publication Date Title
WO2016082523A1 (en) Apparatus and method for handling fault
US11360842B2 (en) Fault processing method, related apparatus, and computer
EP3322125B1 (en) Fault management in a virtualized infrastructure
US10095576B2 (en) Anomaly recovery method for virtual machine in distributed environment
US8977905B2 (en) Method and system for detecting abnormality of network processor
US11144416B2 (en) Device fault processing method, apparatus, and system
WO2010114611A1 (en) Execution of a plugin according to plugin stability level
CN111008091A (en) Fault processing method, system and related device for memory CE
US11853150B2 (en) Method and device for detecting memory downgrade error
US20130132741A1 (en) Power supply apparatus of computer system and method for controlling power sequence thereof
CN109597719A (en) A kind of monitoring method of multiple nucleus system, system, device and readable storage medium storing program for executing
US20160026459A1 (en) Device and method for updating firmware of a rackmount server system
WO2022111048A1 (en) Power supply control method and apparatus, and server and non-volatile storage medium
WO2015057353A1 (en) Determine when an error log was created
CN103823708A (en) Virtual machine read-write request processing method and device
CN116049249A (en) Error information processing method, device, system, equipment and storage medium
WO2018103185A1 (en) Fault processing method, computer system, baseboard management controller and system
CN106844082A (en) Processor predictive failure analysis method and device
CN112306732A (en) Automatic error correction control method, device, equipment and medium in server
CN116820828B (en) Method and device for setting correctable error threshold, electronic equipment and storage medium
CN113849336B (en) BMC time management method, system, device and computer medium
US8230286B1 (en) Processor reliability improvement using automatic hardware disablement
CN110532160B (en) Method for BMC to record server system hot restart event
CN107179911A (en) A kind of method and apparatus for restarting management engine
US20110185161A1 (en) Electronic device and method for detecting operative states of components in the electronic device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15863808

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15863808

Country of ref document: EP

Kind code of ref document: A1