CN110471814B - Control method for error reporting function of server device - Google Patents

Control method for error reporting function of server device Download PDF

Info

Publication number
CN110471814B
CN110471814B CN201810446197.XA CN201810446197A CN110471814B CN 110471814 B CN110471814 B CN 110471814B CN 201810446197 A CN201810446197 A CN 201810446197A CN 110471814 B CN110471814 B CN 110471814B
Authority
CN
China
Prior art keywords
error
control unit
hardware
reporting function
hardware element
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810446197.XA
Other languages
Chinese (zh)
Other versions
CN110471814A (en
Inventor
黄佳仁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mitac Computer Shunde Ltd
Mitac Computing Technology Corp
Original Assignee
Mitac Computer Shunde Ltd
Mitac Computing Technology Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mitac Computer Shunde Ltd, Mitac Computing Technology Corp filed Critical Mitac Computer Shunde Ltd
Priority to CN201810446197.XA priority Critical patent/CN110471814B/en
Publication of CN110471814A publication Critical patent/CN110471814A/en
Application granted granted Critical
Publication of CN110471814B publication Critical patent/CN110471814B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • G06F11/3072Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention provides a control method of error reporting function of server device, comprising: the control unit receives a plurality of first error information sent by a first hardware element, in which a plurality of correctable errors occur, of the plurality of hardware elements, judges each error type of the errors occurring in the first hardware element according to the first error information, judges whether the occurrence times of each error type of the errors occurring in the first hardware element reach a preset time within a first preset time length, and if the control unit judges that the occurrence times of the errors of the first error type occurring in the first hardware element within the first preset time length reach the preset time, the control unit controls the first hardware element to stop executing an error reporting function corresponding to the first error type after sending the first error information.

Description

Control method for error reporting function of server device
Technical Field
The present invention relates to a method for controlling an error reporting function of a server device, and more particularly, to a method for controlling an error reporting function of a server device, which can select to shut down the error reporting function of a corresponding hardware element according to the type of error that can be corrected by the hardware element.
Background
In conventional servers, the hardware components of the server have a chance of error in operation. Taking a Peripheral Component Interconnect Express (PCIE) interface as an example, errors that may occur in the PCIE interface are classified into two categories: a correctable error (correctable errors) and an uncorrectable error (uncorrectable errors). The uncorrectable error may cause the PCIE interface to fail to operate normally, and the correctable error may not cause the PCIE interface to fail to operate normally, but may still affect the performance of the PCIE interface. If a correctable error occurs, the hardware of the server can debug the correctable error to correct the correctable error, and the software is not required to trigger the interrupt of the processor to correct the correctable error. Furthermore, the BIOS setting menu of the server comprises setting options for whether to record the correctable errors, and the manager of the server can enable the setting options to enable the server to record the correctable errors so that the hardware can further debug the correctable errors according to the records.
However, although the correctable errors may be corrected by the hardware of the server, when a large number of correctable errors occur, the hardware of the server may debug the large number of correctable errors, which may increase the processing load, cause the performance of the server to be low, and may even cause the server to be off-line in severe cases. Furthermore, if the manager of the server further enables the above setting options, when a large number of correctable errors occur, the server further needs to record the information of the large number of errors, which further increases the processing burden of the server and greatly increases the risk of the server being on-line.
Disclosure of Invention
The invention provides a control method for error reporting function of server device capable of selecting to turn off error reporting function of corresponding hardware element according to type of error which can be corrected by hardware element.
To solve the above-mentioned problems, a method for controlling an error reporting function of a server device includes: the control unit receives a plurality of first error information sent by a first hardware element, in which a plurality of correctable errors occur, of the plurality of hardware elements, judges each error type of the errors occurring in the first hardware element according to the first error information, judges whether the occurrence times of each error type of the errors occurring in the first hardware element reach a preset time within a first preset time length, and if the control unit judges that the occurrence times of the errors of the first error type occurring in the first hardware element reach the preset time within the first preset time length, the control unit controls the first hardware element to stop executing an error reporting function corresponding to the first error type after sending the first error information.
Compared with the prior art, according to an embodiment of the method for controlling the error reporting function of the server device of the present invention, the control unit can select to close the error reporting function of the corresponding hardware element according to the type of error that the hardware element can correct, and the control unit does not need to close all the error reporting functions, so that only the specific hardware element can be prevented from sending a large amount of error information of the specific error type, the efficiency of the server device is maintained, and the error reporting function corresponding to the error type with the number of errors which does not reach the preset number can be maintained, and the stability of the system is maintained.
[ description of the drawings ]
Fig. 1 is a block diagram of a server device according to an embodiment of the invention.
FIG. 2 is a flow chart of an embodiment of a method for controlling an error reporting function according to the present invention.
FIG. 3 is a flow chart of a portion of another embodiment of a method of controlling an error reporting function in accordance with the present invention.
Fig. 4 is a flowchart continuing another portion of fig. 3.
[ detailed description ] of the invention
Fig. 1 is a block diagram of an embodiment of a server apparatus according to the present invention, and fig. 1 illustrates that the server apparatus 10 includes a plurality of hardware elements 11-18 and a control unit 19, and each hardware element 11-18 is coupled to the control unit 19. In this embodiment, fig. 1 illustrates that the server apparatus 10 includes eight hardware elements 11-18, but the invention is not limited thereto, and the number of the hardware elements included in the server apparatus 10 may be less than eight or greater than eight. Each hardware element 11-18 has an error reporting (error reporting) function corresponding to a predefined number of preset error types, and if a correctable error occurs in the hardware element 11-18, the hardware element 11-18 performs the error reporting function corresponding to the correctable error to send error information corresponding to different preset error types according to the predefined number of preset error types. For example, the number of predefined predetermined error types may be 16, including a first error type, a second error type, …, and a sixteenth error type, and if the hardware component 11 generates an error of the first error type, the hardware component 11 can send an error message corresponding to the first error type; if the hardware element 11 generates an error of the second error type, the hardware element 11 can send error information corresponding to the second error type; other error types and other hardware elements 12-18 are so forth and are not described in detail herein.
The control unit 19 is coupled to the hardware elements 11-18, and the control unit 19 is configured to receive and record the error information from the hardware elements 11-18, and perform a corresponding debug operation according to the hardware element and the error type that send the error information. Moreover, if the control unit 19 receives the error information from the hardware devices 11-18, the control unit 19 will start a mechanism for detecting the number of errors, so as to avoid the excessive processing load of the control unit 19 caused by excessive error information sent by the hardware devices 11-18, resulting in low performance of the server device 10 or even causing the server device 10 to be powered on.
Fig. 2 is a flowchart of an embodiment of a control method of an error reporting function according to the present invention, please refer to fig. 1 and 2 together, in operation, after the control unit 19 receives the error information from the hardware components 11-18 (step S11), the control unit 19 determines each error type of the correctable errors occurring in the hardware components 11-18 according to the predefined predetermined error types (step S12). In step S12, the control unit 19 compares the predetermined error type with the bits representing the error type in the error information to determine each error type of the correctable errors generated by the hardware devices 11-18 by determining whether the error type represented by the bits matches the predetermined error type. Next, the control unit 19 determines whether the number of occurrence of each error type of the correctable errors generated by the hardware components 11-18 reaches a preset number of times within a preset time period (hereinafter referred to as a first preset time period) according to each type of the correctable errors generated by the hardware components 11-18 (step S13).
For example, taking the first preset time length as one hour and the preset number as three as an example, the control unit 19 determines whether the number of error messages sent by each hardware element 11-18 in one hour corresponding to the first error type reaches three, then determines whether the number of error messages sent by each hardware element 11-18 in one hour corresponding to the second error type reaches three, and so on, the control unit 19 finally determines whether the number of error messages sent by each hardware element 11-18 in one hour corresponding to the sixteenth error type reaches three. When the control unit 19 executes step S13, if the control unit 19 determines that the number of error messages of one of the error types (for example, the first error type) sent by one of the hardware elements (for example, the hardware element 11) within the first predetermined time period reaches three (yes) indicating that the hardware element 11 has a correctable error of the first error type within one hour, the control unit 19 turns off the error reporting function of the hardware element 11 corresponding to the first error type (step S14), that is, the control unit 19 controls the hardware element 11 to stop executing the error reporting function corresponding to the first error type, so that the hardware element 11 does not send the error message corresponding to the first error type if the error of the first error type occurs again after the error reporting function is turned off, that is, the control unit 19 does not receive the error message corresponding to the first error type from the hardware element 11 any more, and further avoids error decoding the correctable error of the first error type due to recording too many error messages corresponding to the first error type or too frequent error processing of the first error type.
In this embodiment, in step S13, taking the foregoing preset number of times as three as an example, the control unit 19 may calculate the time interval between the time of receiving the first error message and the time of receiving the third error message in the same error type according to the time of receiving the error message, and determine whether the foregoing time interval is less than or equal to the first preset time length. For example, if the hardware device 11 generates a fifth error type error at the time of 22 minutes 10, 23 minutes 10, and 25 minutes 10, the control unit 19 receives three error messages sent by the hardware device 11 at the time, and the control unit 19 can calculate that the time interval between the occurrence of the first fifth error type error and the occurrence of the third fifth error type error by the hardware device 11 is three minutes and less than a first preset time length of one hour; if the hardware component 17 generates a tenth error type error at 31 minutes 10, 32 minutes 10 and 50 minutes 11, the control unit 19 can calculate that the time interval between the occurrence of the first tenth error type error and the occurrence of the third tenth error type error of the hardware component 17 is 69 minutes and more than a first preset time length of one hour.
Thus, the control unit 19 does not turn off the error reporting functions of the hardware element 11 corresponding to the first to fourth error types, the error reporting functions corresponding to the sixth to sixteenth error types, and the control unit 19 does not turn off the error reporting functions of the other hardware elements 12 to 17 corresponding to the first to sixteenth error types, and the control unit 19 turns off only the error reporting functions of the hardware element 11 corresponding to the fifth error type. If the hardware element 11 generates an error of the fifth error type, the hardware element 11 does not send error information corresponding to the fifth error type, and the hardware element 11 only sends error information corresponding to other error types, such as the third error type, the seventh error type, etc., to the control unit 19.
In the present embodiment, the control unit 19 may include a chipset 191 and a central processing unit 192. The chipset 191 is coupled to the hardware devices 11-18 and the CPU 192, respectively. The chipset 191 is configured to receive error messages from the hardware components 11-18 and perform debug operations on the hardware components 11-18. If the chipset 191 determines that the hardware device 11-18 is in error, the chipset 191 sends a system management interrupt (System Management Interrupt; SMI) signal to the CPU 192 to initiate a system management mode (System Management mode; SMM) of the CPU 192, and the CPU 192 performs steps S12 to S13 in the SMM mode to perform the detection mechanism as described above.
Furthermore, the hardware elements 11-18 may be PCIE interface cards. The chipset 191 has a plurality of root ports (root ports) coupled to each of the hardware elements 11-18 one by one, and the control unit 19 can scan each root port when receiving the error information, so as to detect which of the plurality of root ports receives the error information and determine which of the hardware elements 11-18 sends the error information, such as the hardware element 11, so as to determine from the hardware elements 11-18 that the number of occurrence times of the correctable error of the fifth error type occurring in the hardware element 11 within the first predetermined time period reaches the predetermined number. Furthermore, the control unit 19 further controls the hardware device 11 to individually control the root port coupled to the hardware device 11 not to send the error message corresponding to the fifth error type.
Further, in other embodiments, the hardware elements 11-18 may also be memory units, i.e., each of the hardware elements 11-18 includes a number of memory channels, and each memory channel includes at least one Dual In-line Memory Module (DIMM). After receiving the error information from the hardware devices 11-18, the control unit 19 can determine which two-wire memory module has a number of correctable errors of a specific error type within a first predetermined time period by scanning each two-wire memory module of each memory channel. Also, the control unit 19 may individually control one of the two-wire memory modules of any of the hardware elements 11-18 to stop performing the error reporting function of a specific preset error type.
In this embodiment, taking the foregoing example that the control unit 19 turns off the error reporting function of the hardware element 11 corresponding to the fifth error type, after the control unit 19 turns off the error reporting function of the hardware element 11 corresponding to the fifth error type, the control unit 19 further calculates the turned-off time length (hereinafter referred to as the first turned-off time length) of the turned-off error reporting function corresponding to the fifth error type (step S15) (the turned-off time length is simply referred to as the turned-off time length in the drawing), and determines whether the first turned-off time length of the error reporting function of the hardware element 11 corresponding to the fifth error type reaches another preset time length (hereinafter referred to as the second preset time length) (step S16), if the first turned-off time length of the error reporting function of the hardware element 11 reaches the second preset time length (determined as yes), the control unit 19 determines whether the turned-off error reporting function needs to be restarted, that is, if the hardware element 11 needs to be controlled to execute the error reporting function corresponding to the fifth error type and send the error information corresponding to the fifth error type when the error occurs.
In detail, the control unit 19 can determine whether to control the hardware device 11 to send the error information corresponding to the fifth error type again by determining whether the hardware device 11 has the same error type again (step S17), i.e. the fifth error type, during the shutdown of the error reporting function. If the control unit 19 determines "no", it indicates that the hardware element 11 does not generate a correctable error of the fifth error type again during the off period of the error reporting function of the fifth error type, and at this time, the control unit 19 restarts the error reporting function of the hardware element 11 corresponding to the fifth error type (step S18), so that the hardware element 11 can execute the error reporting function at a later time point and transmit the error information corresponding to the fifth error type if the error of the fifth error type occurs. On the other hand, if the control unit 19 determines "yes" when executing step S17, which means that the same error of the fifth error type occurs again in the hardware element 11 during the off period of the error reporting function of the fifth error type, in order to avoid that the number of times the error of the fifth error type occurs again in the hardware element 11 within the first preset time period reaches the preset number of times, the control unit 19 does not restart the error reporting function of the hardware element 11 corresponding to the fifth error type. At this time, the control unit 19 may recalculate a closing time period (hereinafter referred to as a second closing time period) of the closed error reporting function (step S19) and return to step S16 to determine whether the second closing time period of the closed error reporting function reaches a second predetermined time period again, so as to determine whether to restart the closed error reporting function.
In practice, the control unit 19 executes steps S11 to S19 when executing the operating system. Furthermore, in step S15, the control unit 19 may record the shutdown time of the hardware device 11 corresponding to the error reporting function of the fifth error type, calculate the first shutdown time length and the second shutdown time length of the error reporting function of the fifth error type according to the current time (real-time) and the shutdown time in real time, and determine whether the first shutdown time length and the second shutdown time length reach the second preset time length in step S16. In this embodiment, the second predetermined time period may be three days.
Fig. 3 and 4 are flowcharts of another embodiment of a control method of an error reporting function according to the present invention, please refer to fig. 1, 3 and 4 in combination. In this embodiment, when the control unit 19 receives the error information from the hardware devices 11-18 (step S11), the control unit 19 may further set a flag signal stored in the non-volatile memory to a first logic level (step S04) to indicate that the hardware devices 11-18 have occurred a correctable error. Furthermore, when the control unit 19 determines in step S13 that the number of occurrences of each error type reaches the preset number within the first preset time period, the control unit 19 may further perform corresponding processing on a large number of correctable errors (step S05), for example, the control unit 19 may record the error information corresponding to the fifth error type sent by the hardware element 11 in step S05, and generate a record file containing the error information, and the control unit 19 may perform error resolution according to the recorded error information. Further, after the control unit 19 performs the corresponding handling of the plurality of correctable errors, the control unit 19 may reset the flag signal to the second logic level (step S06), wherein the second logic level is different from the first logic level, so as to indicate whether the control unit 19 has performed the corresponding handling of the plurality of correctable errors by the different logic levels of the flag signal. In practice, the first logic level may be "1" and the second logic level may be "0".
Accordingly, in step S05, taking the control unit 19 performing the error-resolving operation within the first predetermined time period as an example, the control unit 19 may perform the error-resolving operation for the fifth error type error occurring in the hardware device 11 within the first predetermined time period as an example, and the control unit 19 may determine whether to perform the error-resolving operation for a large number of the correctable errors according to the logic level of the flag signal when performing the Power-on self-test (POST) procedure every time, thereby determining whether to extend the first predetermined time period. As shown in fig. 3, the control unit 19 may determine whether the flag signal stored in the non-volatile memory is at the first logic level when executing the power-on self-test procedure (step S01), and if yes, it indicates that any one of the hardware devices 11-18 has occurred a correctable error (e.g., the hardware device 11) during the operation period of the server apparatus 10 before executing the power-on self-test procedure this time (hereinafter referred to as the previous operation period), and the control unit 19 has not handled the correctable error that has occurred during the previous operation period and has not set the flag signal to the second logic level. For example, the server device 10, after the control unit 19 sets the flag signal to the first logic level (step S04), turns on the control unit 19 to make the flag signal set to the first logic level less likely to handle the occurred error (step S05). For example: when the hardware device 11 generates a plurality of correctable errors (correctable errors) within the first predetermined time period, the processing load of the control unit 19 is excessive, and the server apparatus 10 is turned off. In other words, when the power-on self-detection procedure is executed, if the control unit 19 determines that the flag signal is at the first logic level, it represents that a plurality of correctable errors likely occur during the previous operation period to cause the power-on, so that the error occurred is not treated, and the flag signal is at the first logic level during the power-on self-detection.
In this embodiment, when the control unit 19 determines that the flag signal is at the first logic level each time the power-on self-test procedure is executed, the control unit 19 extends the first preset time period in the power-on self-test procedure (step S02), for example, the control unit 19 may extend the first preset time period by a preset multiple (hereinafter, two is taken as an example) to twice the original time period, and set the flag signal to be at the second logic level in the power-on self-test procedure (step S03), so as to determine whether the number of errors of one error type occurs in the hardware devices 11-18 reaches the preset number according to the extended first preset time period in step S13 when the operation system is executed subsequently, and determine whether to set the flag signal according to the determination result of the extended first preset time period, so that the flag signal is changed from the second logic level to the first logic level. Furthermore, if the server apparatus 10 is powered on again after the control unit 19 sets the flag signal to the first logic level in step S04 according to the extended first preset time period, the control unit 19 will again start executing from step S01 in the power-on self-detection procedure after the second power-on of the server apparatus 10 to extend the first preset time period again according to the flag signal of the first logic level. Here, the control unit 19 has more time to deal with a lot of correctable errors occurring in the hardware devices 11-18 when executing the operating system after the first preset time period is extended, and if the control unit 19 still turns on the server device 10 after the first preset time period is extended, the control unit 19 can continuously extend the first preset time period to strive for more abundant handling time until the server device 10 is no longer on the machine by step S02, thereby increasing the system stability of the server device.
On the other hand, if the control unit 19 determines that the flag signal is at the second logic level instead of the first logic level (no) in the power-on self-detection procedure, the control unit 19 does not extend the first predetermined time period. When the server device 10 completes its power-on self-detection procedure, the control unit 19 then executes its detection mechanism according to the error information from the hardware components 11-18. In the present embodiment, the control unit 19 executes steps S04 to S06 when executing the operating system.
For example, if the server device 10 is powered on due to a plurality of correctable errors and executes a power-on self-test program at different time points (hereinafter, the power-on self-test program executed by the server device 10 before the power-on is called a first power-on self-test program, and the power-on self-test program executed by the server device 10 after the power-on is called a second power-on self-test program), after the first power-on self-test program is executed by the server device 10, if the control unit 19 receives an error message from any hardware device 11-18 while executing the operating system, the control unit 19 sets a flag to a first logic level by step S04 to indicate that any hardware device 11-18 is correctable errors; then, before the control unit 19 does not execute the steps S05 and S06, if the server device 10 executes the second power-on self-test procedure after the power-on due to the occurrence of a large number of errors of the hardware devices 11-18, the control unit 19 executes the step S01 in the second power-on self-test procedure, and determines that the flag is the first logic level, which means that the control unit 19 does not handle a large number of correctable errors within the first predetermined time period in real time when executing the operating system after the first power-on self-test procedure, the control unit 19 further extends the first predetermined time period by the step S02 in the second power-on self-test procedure, and resets the flag to the second logic level by the step S03 in the second power-on self-test procedure.
In practice, the control unit 19 executes the BIOS code to execute steps S01-S06, S12-S19. Furthermore, taking the hardware components 11-18 as PCIE interface cards as examples, the predetermined error types may be a receiver error Status (Receiver Error Status), a Bad transaction layer packet Status (Bad TLP Status), a Bad link layer packet (Bad DLLP Status), a retransmission timer expiration Status (Replay Timer Timeout Status), a Advisory Non-fatal error Status (advice Non-Fatal Error Status), a header record overflow Status (Header Log Overflow Status), and so on.
In summary, according to an embodiment of the method for controlling an error reporting function of a server device of the present invention, the control unit can select to turn off the error reporting function of the corresponding hardware device according to the type of error that the hardware device has a correctable error, and the control unit does not need to turn off all the error reporting functions, so that only the specific hardware device is prevented from sending a lot of error information of the specific error type, the performance of the server device is maintained, and the error reporting function corresponding to the error type with the number of errors less than the preset number of errors can be maintained, thereby maintaining the stability of the system.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims (8)

1. A method for controlling an error reporting function of a server apparatus, comprising:
a control unit receives a plurality of first error messages sent by a first hardware element, wherein a plurality of correctable errors occur in the first hardware element;
the control unit judges each error type of the errors generated by the first hardware element according to the first error information;
the control unit judges whether the occurrence times of the different error types of the errors generated by the first hardware element respectively reach a preset time within a first preset time length; a kind of electronic device with high-pressure air-conditioning system
If the control unit determines that the number of times of occurrence of the errors of the first hardware element, which occur in the first preset time period, of one of the different error types reaches the preset number of times, the control unit controls the first hardware element to stop executing one of the plurality of error reporting functions corresponding to the first error types after sending the first error information so as to close the error reporting function corresponding to the first error types,
after the control unit controls the first hardware element to stop executing the error reporting function corresponding to the first error type, the control method of the error reporting function of the server device further includes:
the control unit calculates a first closing time length of the first hardware element for stopping executing the error reporting function corresponding to the first error type;
the control unit judges whether the first closing time length reaches a second preset time length or not; a kind of electronic device with high-pressure air-conditioning system
When the first closing time length reaches the second preset time length, the control unit judges whether the first hardware element needs to be controlled to restart the error reporting function corresponding to the first error types,
the step of determining whether to control the first hardware device to restart the error reporting function corresponding to the first error types includes:
the control unit judges whether another error corresponding to the first error type occurs during the closing period of the error reporting function of the first hardware element; a kind of electronic device with high-pressure air-conditioning system
If the control unit judges that the first hardware element is not in the first error type, the control unit controls the first hardware element to restart the error reporting function corresponding to the first error type.
2. The method according to claim 1, wherein the hardware components are PCIE interface cards, the control unit determines that the number of occurrences of the first error type of the first hardware component occurring within the first predetermined time period reaches the predetermined number of occurrences of the first error type from the hardware components by scanning a plurality of root ports coupled to the hardware components, and the control unit controls the first hardware component to stop executing the error reporting function corresponding to the first error type from the hardware components by the root ports.
3. The method according to claim 1, wherein the hardware components are memory units, each of the hardware components includes a plurality of memory channels, each of the memory channels is provided with at least one two-wire memory module, the control unit scans each of the two-wire memory modules of each of the memory channels to determine which of the two-wire memory modules of the first hardware component has an error of the first error type within the first predetermined time period for the predetermined number of times, and the control unit further controls one of the two-wire memory modules included in the first hardware component to stop executing the error reporting function corresponding to the first error type.
4. The method for controlling an error reporting function of a server apparatus according to claim 1, further comprising:
if the control unit judges that the first hardware element generates the other error corresponding to the first error type during the closing period of the error reporting function, the control unit recalculates a second closing time length of the closed error reporting function when the first closing time length reaches the second preset time length; a kind of electronic device with high-pressure air-conditioning system
The control unit determines whether the second closing time length reaches the second preset time length, so as to determine whether to control the first hardware element to restart the error reporting function corresponding to the first error types.
5. The method for controlling an error reporting function of a server apparatus according to claim 1, further comprising:
when the control unit judges that the occurrence times of the errors of the first error type in the first preset time length reach the preset times when the control unit executes an operating system after the server device executes a first power-on self-detection program, the control unit prolongs the first preset time length when the server device executes a second power-on self-detection program later than the operating system.
6. The method according to claim 1, wherein the control unit determines whether the number of occurrences of each of the error types of the first hardware device reaches the predetermined number within the first predetermined time period, and the control unit decodes the errors of the first hardware device within the first predetermined time period.
7. The method according to claim 5, wherein in the step of extending the first predetermined time period by the control unit, the control unit extends the first predetermined time period according to a predetermined multiple in the second power-on self-test procedure.
8. The method according to claim 1, wherein the control unit receives the first error information from the first hardware device when an operating system is executed between a first power-on self-test program and a second power-on self-test program executed by the server device in succession at different time points, and the control unit determines whether the occurrence count of each of the error types of the errors occurring in the first hardware device reaches the preset count within the first preset time period when the operating system is executed between the first power-on self-test program and the second power-on self-test program, the method further comprising:
if the first hardware element generates the correctable errors, the control unit sets a flag as a first logic level according to the first error information when executing the operating system;
if the control unit judges that the occurrence times of the errors of the first hardware element in the first preset time length reach the preset times when executing the operation system, the control unit carries out corresponding treatment on the correctable errors when executing the operation system, and resets the flag to a second logic level different from the first logic level after carrying out treatment;
the control unit judges whether the flag signal is at the first logic level when executing the second power-on self-detection program so as to judge whether the correctable errors are processed correspondingly when executing the operating system; a kind of electronic device with high-pressure air-conditioning system
When the control unit judges that the flag is the first logic level in the second power-on self-detection program, the control unit prolongs the first preset time length according to a preset multiple and resets the flag to be the second logic level in the second power-on self-detection program.
CN201810446197.XA 2018-05-11 2018-05-11 Control method for error reporting function of server device Active CN110471814B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810446197.XA CN110471814B (en) 2018-05-11 2018-05-11 Control method for error reporting function of server device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810446197.XA CN110471814B (en) 2018-05-11 2018-05-11 Control method for error reporting function of server device

Publications (2)

Publication Number Publication Date
CN110471814A CN110471814A (en) 2019-11-19
CN110471814B true CN110471814B (en) 2023-11-07

Family

ID=68504686

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810446197.XA Active CN110471814B (en) 2018-05-11 2018-05-11 Control method for error reporting function of server device

Country Status (1)

Country Link
CN (1) CN110471814B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112306732B (en) * 2020-11-19 2023-02-28 山东云海国创云计算装备产业创新中心有限公司 Automatic error correction control method, device, equipment and medium in server

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107077408A (en) * 2016-12-05 2017-08-18 华为技术有限公司 Method, computer system, baseboard management controller and the system of troubleshooting
CN107122321A (en) * 2016-02-24 2017-09-01 广达电脑股份有限公司 Hardware restorative procedure, hardware repair system and embodied on computer readable storage device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7739551B2 (en) * 2007-06-20 2010-06-15 Microsoft Corporation Web page error reporting

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107122321A (en) * 2016-02-24 2017-09-01 广达电脑股份有限公司 Hardware restorative procedure, hardware repair system and embodied on computer readable storage device
CN107077408A (en) * 2016-12-05 2017-08-18 华为技术有限公司 Method, computer system, baseboard management controller and the system of troubleshooting

Also Published As

Publication number Publication date
CN110471814A (en) 2019-11-19

Similar Documents

Publication Publication Date Title
TWI685751B (en) Error reporting function control method for server device
US10430260B2 (en) Troubleshooting method, computer system, baseboard management controller, and system
TWI229796B (en) Method and system to implement a system event log for system manageability
CN104636221B (en) Computer system fault processing method and device
CN111488233A (en) Method and system for processing bandwidth loss problem of PCIe device
US10896087B2 (en) System for configurable error handling
US20240103961A1 (en) PCIe Fault Auto-Repair Method, Apparatus and Device, and Readable Storage Medium
US20040003317A1 (en) Method and apparatus for implementing fault detection and correction in a computer system that requires high reliability and system manageability
US20030221141A1 (en) Software-based watchdog method and apparatus
CN115981898A (en) Error-correctable error processing method, device and equipment for memory and readable storage medium
CN112732477A (en) Method for fault isolation by out-of-band self-checking
CN117389790B (en) Firmware detection system, method, storage medium and server capable of recovering faults
TW202109298A (en) Flash memory controller and method capable of efficiently reporting debug information to host device
CN110704228B (en) Solid state disk exception handling method and system
CN110471814B (en) Control method for error reporting function of server device
US8839268B2 (en) Method and system of preventing silent data corruption
EP2860633A1 (en) Method for maintaining file system of computer system
WO2018103185A1 (en) Fault processing method, computer system, baseboard management controller and system
CN106406963B (en) Initialization method and device of Linux system
CN114003416B (en) Memory error dynamic processing method, system, terminal and storage medium
JP2013109722A (en) Computer, computer system and failure information management method
US20210334153A1 (en) Remote error detection method adapted for a remote computer device to detect errors that occur in a service computer device
US9176806B2 (en) Computer and memory inspection method
CN107451035B (en) Error state data providing method for computer device
TWI715005B (en) Monitor method for demand of a bmc

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant