CN117687848A - Processor and method for detecting processor errors - Google Patents

Processor and method for detecting processor errors Download PDF

Info

Publication number
CN117687848A
CN117687848A CN202211080632.4A CN202211080632A CN117687848A CN 117687848 A CN117687848 A CN 117687848A CN 202211080632 A CN202211080632 A CN 202211080632A CN 117687848 A CN117687848 A CN 117687848A
Authority
CN
China
Prior art keywords
processor
instruction
module
error
internal detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211080632.4A
Other languages
Chinese (zh)
Inventor
刘辉
俞洲
杨肖
邹文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Cloud Computing Technologies Co Ltd
Original Assignee
Huawei Cloud Computing Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Cloud Computing Technologies Co Ltd filed Critical Huawei Cloud Computing Technologies Co Ltd
Priority to CN202211080632.4A priority Critical patent/CN117687848A/en
Priority to PCT/CN2023/098504 priority patent/WO2024051231A1/en
Publication of CN117687848A publication Critical patent/CN117687848A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing

Abstract

The application provides a processor and a method for detecting processor errors, and relates to the technical field of computers. The processor includes: an error detection module and an instruction execution module, wherein: the error detection module is used for sending an internal detection instruction to the instruction execution module; the instruction execution module is used for executing the internal detection instruction, obtaining a first execution result and sending the first execution result to the error detection module; and the error detection module is used for determining whether the processor has an error according to the expected result of the first execution result corresponding to the internal detection instruction. Because the error of the processor is not detected by an external inspection tool, the processor can be prevented from decoding the program instruction, sending the execution result of the program instruction to the inspection tool and the like, and the processing capacity of the processor is reduced.

Description

Processor and method for detecting processor errors
Technical Field
Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a processor and a method for detecting a processor error.
Background
A processor is one of the main components of various types of computing devices, and the processor may be used to execute instructions. The processor may have errors in executing the instructions.
At present, one way to detect instruction errors of a processor is: and testing whether the processor has an instruction execution error by using the inspection tool. In this manner, the inspection tool may send program instructions to the processor, the processor decodes and executes the program instructions, and feeds back the execution results of the program instructions to the inspection tool. The inspection tool compares the execution result with a pre-stored expected result of the program instruction, and if the execution result does not match with the expected result of the program instruction, the inspection tool determines that the processor has a problem of instruction execution error. However, in this method, the processor is required to receive the program instruction from the inspection tool, decode the program instruction, and feed back the execution result of the program instruction to the inspection tool, and the like, and thus the processing amount of the processor is large.
Disclosure of Invention
The embodiment of the application provides a processor and a method for detecting processor errors, which are used for reducing the processing amount in the process of detecting the processor errors.
In a first aspect, embodiments of the present application provide a processor, including an error detection module and an instruction execution module, wherein: the error detection module is used for sending an internal detection instruction to the instruction execution module; the instruction execution module is used for executing the internal detection instruction, obtaining a first execution result and sending the first execution result to the error detection module; and the error detection module is used for determining whether the processor has an error according to the expected result of the first execution result corresponding to the internal detection instruction.
In this embodiment of the present application, the instruction execution module may directly execute the internal detection instruction from the error detection module to obtain a first execution result, and the error detection module compares the first execution result with an expected result corresponding to the internal detection instruction, so as to determine whether the processor has an error. In this way, the method is equivalent to the processor executing the instruction in the processor, and further determines whether the processor has an error, and the processor is not required to decode the instruction, send the first execution result to an external inspection tool, and the like, thereby being beneficial to reducing the processing capacity of the processor and saving the computing resource cost of the processor. In addition, an external inspection tool is not needed, so that the cost for detecting the processor errors is reduced. In addition, when the processor detects the error of the processor, the instruction in the processor is executed, and various types of instructions can be accommodated in the processor, so that the method is beneficial to improving the types of instructions applicable to the method.
In one possible implementation, the instruction execution module is further configured to: reading an external program instruction from an external storage module, decoding the external program instruction to obtain a decoded result, executing the decoded result to obtain a second execution result, and writing the second execution result into the external storage module.
In the above embodiment, the instruction execution module in the processor may execute the external program instruction in the external storage module in addition to the internal detection instruction of the processor, in other words, the processor may execute the internal detection instruction while not affecting the execution of the external program instruction.
In one possible implementation, the error detection module is further configured to: before sending an internal detection instruction to the instruction execution module, determining that the remaining space in a buffer queue of the processor is greater than or equal to a threshold value, wherein the buffer queue is used for buffering the instruction to be processed by the processor.
In the above embodiment, the error detection module may send the internal detection instruction to the instruction execution module in the case where the load of the processor is relatively small, so that the instruction execution module may execute the external program instruction in the case where the load is relatively small. Therefore, the execution process of executing the internal detection instruction can be reduced, the execution process of the external program instruction is influenced, and the reasonable allocation of the computing resources of the processor is facilitated.
In a possible implementation manner, the error detection module is further configured to add a flag to the internal detection instruction, where the flag represents an instruction for detecting the processor error; the instruction execution module is further configured to add the tag to the first execution result; the error detection module is further configured to identify, according to the flag in the first execution result, the first execution result corresponding to the internal detection instruction.
In the above embodiment, the error detection module may add a flag to the internal detection instruction, so that the subsequent instruction error detection module can distinguish the first execution result corresponding to the instruction for detecting the processor error.
In one possible implementation, the processor further includes a register, where the register stores an expected result corresponding to the internal detection instruction; the error detection module is further configured to read an expected result corresponding to the internal detection instruction from the register.
In the above embodiment, the user may manually or by the processor configure the expected result corresponding to the internal detection instruction in the register, so that the error detection module may quickly obtain the expected result corresponding to the internal detection instruction, which is also convenient for the subsequent quick detection of whether the processor has an error.
In a possible implementation manner, the error detection module is further configured to obtain the internal detection instruction from an instruction that has been executed by the processor; or, the processor further includes a first storage module storing the internal detection instruction, the first storage module allowing the internal detection instruction to be read by the processor, the error detection module further configured to read the internal detection instruction from the first storage module; or, the processor further includes a second storage module storing the internal detection instruction, the second storage module allowing the processor to read and write, and the error detection module is further configured to read the internal detection instruction from the second storage module.
In the above embodiment, three ways of obtaining the internal detection instruction by the error detection module are provided, and the ways of obtaining the instruction for detecting the processor error by the error detection module are enriched. In the first mode, the error detection module can sample and obtain the internal detection instruction from the instruction executed by the processor, and in the first mode, the internal detection instruction is obtained simply and directly. In a second manner, the error detection module may read internal detection instructions from the first storage module, where the instructions may be manually configured. In a third mode, the error detection module may read the internal detection instruction from the second storage module, unlike the second mode, the second storage module related to the third mode may support writing of the processor, which is beneficial for the processor or the external device to increase the instruction in the second storage module through the processor.
In one possible implementation, the processor further includes at least one processor core, one of the at least one processor core corresponding to the instruction execution module; the error detection module is specifically configured to: and determining whether the processor has an error according to an expected result corresponding to the first execution result and the internal detection instruction.
The above embodiments may be applied to a case where a processor includes one or more processor cores, if the processor includes a plurality of processor cores, determining which processor core, in particular, has an error according to the instruction execution module, so that the processor core having the error can be precisely located. In addition, in the embodiment, one error detection module can be used for performing error detection on a plurality of processor cores, which is beneficial to reducing the cost of detecting processor errors.
In one possible implementation, the error detection module is further configured to: and determining that a detection switch in a management module is in an on state, wherein the detection switch is used for indicating whether to detect errors of the processor, and the management module comprises one or more of a baseboard management controller, a firmware system corresponding to the processor and an operating system.
In the above embodiment, both the management module and the processor may be disposed in the computing device, where the management module may be configured with a detection switch, and the error detection module performs a process of detecting a processor error when it is determined that the detection switch in the management module is in an on state, which provides a method for flexibly starting up the detection of the processor error.
In one possible embodiment, the processor further comprises a control module; the error detection module is further configured to provide alarm information to a management module when determining that an error exists in the processor, where the management module includes one or more of a baseboard management controller, a firmware system corresponding to the processor, and an operating system, and the alarm information is used to indicate that the processor has the error; and/or the control module is used for controlling to shut down the processor when the processor is determined to have errors.
In the above embodiment, in the case that the error detection module determines that the processor has an error, the error detection module may provide the alarm information to the management module, so that the management module presents the alarm information, so that a user can timely learn that the processor has an error. In addition, the control module in the processor can also control the processor to be shut down, so that the processor is prevented from continuously executing the instructions in error.
In a second aspect, embodiments of the present application provide a method of detecting a processor error, the method being executable by a processor or by a computing device comprising a processor. For ease of description, the following description will take a processor as an example to perform the method. Executing an internal detection instruction by the processor to obtain a first execution result; the processor determines whether the processor has an error according to an expected result of the first execution result corresponding to the internal detection instruction.
In one possible embodiment, the method further comprises: reading an external program instruction from an external storage module, decoding the external program instruction to obtain a decoded result, executing the decoded result to obtain a second execution result, and writing the second execution result into the external storage module.
In one possible embodiment, the method further comprises: the processor determines that the remaining space in a buffer queue is greater than or equal to a threshold, the buffer queue being configured to buffer instructions to be processed by the processor.
In one possible embodiment, the method further comprises: the processor adding a tag to the internal detection instruction, the tag representing an instruction for detecting the processor error; the processor adds the tag to the first execution result; and the processor identifies the first execution result corresponding to the internal detection instruction according to the mark in the first execution result.
In one possible embodiment, the method further comprises: the processor comprises a register, wherein the register stores an expected result corresponding to the internal detection instruction; the method further comprises the steps of: and the processor reads the expected result corresponding to the internal detection instruction from the register.
In one possible embodiment, the method further comprises: the processor obtains the internal detection instruction from the instructions executed by the processor; or, the processor further includes a first memory module storing the internal detection instruction, the first memory module allowing the processor to read the internal detection instruction, the processor reading the internal detection instruction from the first memory module; or, the processor further includes a second memory module storing the internal detection instruction, the second memory module allowing the processor to read and write, the processor reading the internal detection instruction from the second memory module.
In one possible implementation, the processor includes at least one processor core; the processor determining whether the processor has an error according to the first execution result and an expected result corresponding to the internal detection instruction, including: the processor determines whether a processor core for obtaining the first execution result has an error according to an expected result of the first execution result corresponding to the internal detection instruction.
In one possible embodiment, the method further comprises: the processor determines that a detection switch in a management module is in an on state, the detection switch is used for indicating whether to detect errors of the processor, and the management module comprises one or more of a baseboard management controller, a firmware system corresponding to the processor and an operating system.
In one possible embodiment, the method further comprises: when the processor determines that the processor has an error, the processor provides alarm information to a management module, wherein the management module comprises one or more of a baseboard management controller, a firmware system corresponding to the processor and an operating system, and the alarm information is used for indicating that the processor has the error; and/or, upon determining that the processor has an error, the processor shuts down the processor.
In a third aspect, embodiments of the present application provide a method of detecting a processor error, the method being executable by a processor or by a computing device comprising a processor. For ease of description, the following description will take a processor as an example to perform the method. The processor comprises: an instruction execution module and an error detection module. The method comprises the following steps: the error detection module sends an internal detection instruction to the instruction execution module; the instruction execution module executes the internal detection instruction to obtain a first execution result, and sends the first execution result to the error detection module; the error detection module determines whether the processor has an error according to an expected result corresponding to the first execution result and the internal detection instruction.
In one possible embodiment, before sending the internal detection instruction to the instruction execution module, the method further comprises:
the error detection module determines that the remaining space in a buffer queue of the processor is greater than or equal to a threshold, where the buffer queue is used to buffer instructions to be processed by the processor.
In one possible embodiment, the method further comprises: the error detection module adds a marker to the internal detection instruction, the marker representing an instruction for detecting the processor error; and the error detection module identifies the first execution result corresponding to the internal detection instruction according to the mark.
In one possible implementation, the processor further includes a register, where the register stores an expected result corresponding to the internal detection instruction; the method further comprises the steps of: and the error detection module reads an expected result corresponding to the internal detection instruction from the register.
In one possible embodiment, the method further comprises: the error detection module obtains the internal detection instruction from the instructions executed by the processor; or, the error detection module reads the internal detection instruction from a first storage module in the processor, the first storage module allowing the internal detection instruction to be read by the processor; or, the error detection module reads the internal detection instruction from a second memory module in the processor, the second memory module allowing reading and writing by the processor.
In one possible implementation, the processor includes at least one processor core, one of the at least one processor core corresponding to the instruction execution module; according to the expected result of the first execution result corresponding to the internal detection instruction, the error detection module determines whether the processor core has an error, including: and the error detection module determines whether the processor core corresponding to the instruction execution module has an error according to the expected result of the first execution result corresponding to the internal detection instruction.
In one possible embodiment, the method further comprises: the error detection module determines that a detection switch in a management module is in an on state, the detection switch is used for indicating whether to detect errors of the processor, and the management module comprises one or more of a baseboard management controller, a firmware system corresponding to the processor and an operating system.
In a possible implementation manner, the error detection module provides alarm information to a management module when determining that the processor has an error, wherein the management module comprises one or more of a baseboard management controller, a firmware system corresponding to the processor or an operating system, and the alarm information is used for indicating that the processor has the error; and/or, when determining that the processor has an error, a control module in the processor controls the processor to be shut down.
In a fourth aspect, embodiments of the present application provide a computing device that may include a processor of any of the first aspects.
In a fifth aspect, embodiments of the present application provide a computing device comprising a processor and power supply circuitry to power the processor, the processor to implement the method of any of the second or third aspects.
In a sixth aspect, embodiments of the present application provide a cluster of computing devices, including at least one computing device, each computing device being executable to perform a method as in the second aspect or any of the third aspects described above.
Optionally, each computing device in the computing device cluster may be a computing device of any of the fourth aspect or the fifth aspect described above.
In a seventh aspect, there is provided a computer program product comprising instructions which, when run on a computer, implement the method of any of the above second or third aspects.
In an eighth aspect, embodiments of the present application provide a computer readable storage medium for storing a computer program or instructions which, when executed, implement the method of the second aspect or any of the third aspects described above.
Regarding the advantageous effects of the second aspect to the eighth aspect, the advantageous effects discussed with reference to the first aspect are not repeated here.
Drawings
FIG. 1 is a schematic view of the deployment of a patrol tool;
FIG. 2 is a schematic diagram of a processor according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a structure of another processor according to an embodiment of the present disclosure;
FIG. 4 is a schematic structural diagram of an error detection module according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a computing device according to an embodiment of the present application;
fig. 6 is a schematic deployment diagram of a cloud data center according to an embodiment of the present application;
fig. 7 is a schematic architecture diagram of a cloud data center according to an embodiment of the present application;
FIG. 8 is a flowchart illustrating a method for detecting a processor error according to an embodiment of the present disclosure;
fig. 9 is a schematic diagram of a detection switch of a management module according to an embodiment of the present application;
FIG. 10 is a schematic diagram of a process for processing internal detection instructions and external program instructions according to an embodiment of the present application;
FIG. 11 is a flowchart illustrating another method for detecting a processor error according to an embodiment of the present disclosure;
FIG. 12 is a schematic diagram of a process for processing internal test instructions and external program instructions according to an embodiment of the present application;
fig. 13 is a schematic architecture diagram of a computing device cluster according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.
In the following, some terms in the embodiments of the present application are explained for easy understanding by those skilled in the art.
1. Silent data errors (silent data corruption, SDC), which may also be referred to as silent data corruption, refer to errors that occur during execution of instructions by a processor, but are not perceived by an operating system in a device corresponding to the processor, resulting in the storage of erroneous results corresponding to the execution of instructions by the processor.
2. The processor, which may be a central processing unit (central processing unit, CPU), may be other general purpose processor, digital signal processor (digital signal processor, DSP), application specific integrated circuit (application specific integrated circuit, ASIC), field programmable gate array (field programmable gate array, FPGA) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. The general purpose processor may be a microprocessor, but in the alternative, it may be any conventional processor.
In the embodiments of the present application, the number of nouns, unless otherwise indicated, means "a singular noun or a plural noun", i.e. "one or more". "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a alone, a and B together, and B alone, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. For example, A/B, means: a or B. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c, represents: a, b, c, a and b, a and c, b and c, or a and b and c, wherein a, b, c may be single or plural.
To reduce silent data errors that may occur to a processor, inspection tools are currently available to test whether a processor is in error. Fig. 1 is a schematic layout diagram of a patrol tool. Alternatively, fig. 1 may be understood as a schematic diagram of an architecture of a device. As shown in fig. 1, the device includes a processor, a plurality of Applications (APPs) running, and an operating system. The plurality of applications include application 1, inspection tool, application 2, and the like as described in fig. 1.
The inspection tool may have built-in program instructions and expected results corresponding to the program instructions. When the inspection tool detects a processor error, the inspection tool may load program instructions into the external memory. The external memory and the processor are provided independently of each other. The processor reads the program instructions from the external memory and decodes the program instructions. The processor executes the decoded program instruction and sends the execution result of the program instruction to the inspection tool.
The inspection tool compares the execution result of the program instruction with the expected result. If the execution result matches the expected result, the inspection tool determines that the processor is free of errors. If the execution result and the expected result do not match, the inspection tool determines that the processor is in error. In the event that the inspection tool determines that the processor is in error, the inspection tool may feed back to the operating system that the processor is in error.
Therefore, in the conventional method for detecting the processor error, the processor needs to read the program instruction from the external memory, decode the program instruction, and send the execution result to the inspection tool, which results in a large processing capacity of the processor.
In view of this, embodiments of the present application provide a processor. The processor includes an error detection module and an instruction execution module. The error detection module may send (or pass) instructions (e.g., internal detection instructions) to the instruction execution module for detecting processor errors. The instruction execution module executes the internal detection instruction, obtains a first execution result of the internal detection instruction, and sends the first execution result of the internal detection instruction to the error detection module. The error detection module determines whether the processor has an error according to the first execution result and the expected result of the internal detection instruction. Therefore, the processor can automatically detect whether the processor has errors or not without an external inspection tool, and the processor is not required to decode external program instructions, send execution results of the external program instructions to the inspection tool and the like, so that the processing capacity of the processor can be reduced.
The processor in the embodiments of the present application may be any type of processor, including, for example, a single-core processor or a multi-core processor, etc.
Fig. 2 is a schematic structural diagram of a processor according to an embodiment of the present application. Fig. 2 is a schematic diagram of a single-core processor. As shown in fig. 2, processor 200 includes an error detection module 210 and an instruction execution module 220. Wherein the error detection module 210 and the instruction execution module 220 may communicate with each other.
The error detection module 210 and the instruction execution module 220 may be implemented in hardware, for example, in logic circuits, and the specific structures of the error detection module 210 and the instruction execution module 220 are not limited in this application. The instruction execution module 220 is, for example, an arithmetic logic unit (arithmetic logic unit, ALU).
Specifically, the error detection module 210 sends instructions (e.g., internal detection instructions) for detecting processor errors to the instruction execution module 220. The instruction execution module 220 is configured to execute the internal detection instruction to obtain a first execution result. The instruction execution module 220 sends the first execution result to the error detection module 210. The error detection module 210 is configured to determine whether the processor stores an error according to an expected result of the first execution result corresponding to the internal detection instruction.
For example, if the first execution result matches the expected result of the internal detection instruction, the error detection module 210 determines that the processor is not in error. If the first execution result does not match the expected result of the internal detection instruction, the error detection module 210 determines that the processor has an error.
In this embodiment, the error detection module 210 and the instruction execution module 220 in the processor 200 can cooperatively implement error detection of the processor 200 without using an external inspection tool, that is, without decoding a program instruction by the processor 200 and sending an execution result to the inspection tool, which is beneficial to reducing the processing amount of the processor and saving the resource overhead of the processor.
In one possible implementation, the error detection module 210 sends the internal detection instruction to the instruction execution module 220 in the event that the remaining space of the buffer queue of the processor 200 is greater than or equal to a threshold. The buffer queue is used to store instructions to be processed (or unprocessed) by the processor 200. The threshold may be preconfigured in the error detection module 210.
The smaller the remaining space of the cache queue, the more instructions that the processor 200 needs to process, the more load the processor 200 is loaded; the larger the remaining space of the cache queue, the fewer instructions that the processor 200 needs to process, and the less loaded the processor 200.
In this embodiment, when the processor 200 is under a heavy load, sending an internal detection instruction to the instruction execution module 220 may be avoided to aggravate the processing load of the processor 200, which may be advantageous for a reasonable utilization of the processor 200 resources.
Since the instruction execution module 220 may execute the external program instruction in addition to the instruction from the error detection module 210, the instruction execution module 220 may obtain the execution result of the internal detection instruction and the execution result corresponding to the program instruction. External program instructions refer to instructions that do not reside within processor 200. Thus, to facilitate the error detection module 210 identifying the first execution result corresponding to the internal detection instruction, in one possible implementation, the error detection module 210 may add a flag to the internal detection instruction. The error detection module 210 acquires the first execution result of the internal detection instruction corresponding to the flag from the instruction execution module 220 according to the flag.
The expected results corresponding to the internal detection instructions may be preconfigured in the error detection module 210. Alternatively, the expected results corresponding to the internal detection instruction may be preconfigured in the register 260 of the processor 200. Subsequently, the error detection module 210 may read the expected result corresponding to the internal detection instruction from the register 260.
As one example, the processor 200 also includes a control module 250. The control module 250 is used to control the various modules of the processor 200 (including the error detection module 210 and the instruction execution module 220, etc.). The control module 250 is, for example, a Control Unit (CU).
As one example, the processor 200 further includes a first memory module 230 and/or a second memory module 240. The first memory module 230 and the second memory module 240 may be understood as internal memory modules of the processor 200. The first storage module 230 is, for example, a Read Only Memory (ROM) or a memory, and the specific implementation forms of the first storage module 230 and the second storage module 240 are not limited in this embodiment. The second storage module 240 is, for example, a memory.
Specifically, the first storage module 230 is configured to store instructions. The first memory module 230 may allow for reading by the processor 200 (specifically, the error detection module 210 in the processor 200). For example, the error detection module 210 is configured to read the internal detection instruction from the first storage module 230.
The second storage module 240 is configured to store instructions. The second memory module 240 may allow for reading and writing by the processor 200 (specifically, the error detection module 210 in the processor 200). For example, the external device may write instructions to the second memory module 240 through the processor 200, and the error detection module 210 may also read internal detection instructions from the second memory module 240. In this manner, it is advantageous to expand the number and types of instructions for detecting errors in the processor 200, and the like.
As one example, the error detection module 210 may also read internal detection instructions from instructions that have been executed by the instruction execution module 220.
In the embodiment of the present application, the error detection module 210 may read the internal detection instruction from the first storage module 230, the second storage module 240, or the instruction executed by the instruction execution module 220 in the processor 200, which provides various ways to obtain the internal detection instruction.
In one possible implementation, the instruction execution module 220 may also be used to process external program instructions, which refer to programs from devices other than processors. For example, the instruction execution module 220 may decode the external program instruction to obtain a decoded result, and execute the decoded result to obtain a second execution result. Decoding external program instructions may be understood as converting external program instructions into instructions that may be directly operated on by instruction execution module 220.
Wherein the external program instructions are for example instructions stored in an external memory. For example, the external program instructions may be formed instructions after the application is loaded via the operating system.
Fig. 3 is a schematic structural diagram of another processor according to an embodiment of the present application. FIG. 3 is a schematic diagram of a multi-core processor. As shown in fig. 3, processor 300 includes a plurality of processor cores, an error detection module 301, and an instruction execution module 302. In fig. 3, a plurality of processor cores including a first processor core 310 and a second processor core 320 are illustrated. Processor cores may also be referred to as physical processor cores, or the like.
Wherein the function and implementation of the error detection module 301 may refer to the content of the error detection module 210 discussed in fig. 2 above, and the function and implementation of the instruction execution module 302 may refer to the content of the instruction execution module 220 discussed in fig. 2 above.
The first processor core 310 and the second processor core 320 may be identical in structure, and the first processor core 310 is taken as an example for description below.
The first processor core 310 includes an instruction execution module 302. The contents of instruction execution module 302 may be as previously discussed. The first processor core 310 further comprises a control module 307. The content of the control module 307 may be referred to previously discussed.
In the processor 300 according to the embodiment of fig. 3, the error detection module 301 may perform error detection on a plurality of processor cores of the processor 300, to determine which processor core has an error.
For example, the error detection module 301 may send an internal detection instruction to the instruction execution module 302 in the first processor core 310, and obtain an execution result of the internal detection instruction sent by the instruction execution module 302 in the first processor core 310. If the execution result of the internal detection instruction does not match the expected result, the error detection module 301 determines that an error exists in the first processor core 310. If the execution result of the internal detection instruction matches the expected result, the error detection module 301 determines that the first processor core 310 is free of errors. Similarly, the error detection module 301 may detect whether an error exists in the second processor core 320.
In this embodiment, the processor 300 may perform error detection on the processor cores in the processor 300, which is equivalent to performing more accurate error detection on the processor 300.
Optionally, the processor 300 further comprises one or more of a register 304, a first memory module 305 and a second memory module 306.
Wherein the function of register 304 may be referenced to the contents of register 260 discussed in the discussion of FIG. 2 above. The function and implementation of first storage module 305 may be with reference to the contents of first storage module 230 discussed in fig. 2, previously discussed. The function and implementation of the second storage module 306 may be with reference to the contents of the second storage module 240 discussed in fig. 2, previously discussed.
Fig. 4 is a schematic structural diagram of another processor according to an embodiment of the present application. As shown in fig. 4, the processor 400 includes an instruction execution module 410 and an error detection module 420. The error detection module 420 includes an instruction acquisition sub-module 421, an instruction identification sub-module 422, and an error determination sub-module 423.
Alternatively, the instruction obtaining sub-module 421, the instruction identifying sub-module 422 and the error judging sub-module 423 may be implemented by hardware, for example, the instruction obtaining sub-module 421 may be implemented by a register or a memory, the instruction identifying sub-module 422 may be implemented by a comparator, and the error judging sub-module 423 may be implemented by a comparator.
Specifically, the instruction obtaining sub-module 421 is configured to provide internal detection instructions to the instruction execution module 410. The content of the internal detection instruction obtained by the instruction obtaining sub-module 421 may refer to the content of the internal detection instruction obtained by the foregoing error detection module, which is not described herein. The instruction recognition sub-module 422 recognizes an execution result corresponding to the internal detection instruction from among execution results executed by the instruction execution module 410. The error determination sub-module 423 determines whether the execution result corresponding to the internal detection instruction is the same as the expected result of the internal detection instruction, thereby determining whether the processor 400 has an error.
Optionally, the processor 400 further includes a load determination sub-module 424, illustrated in fig. 4 as a dashed box. The load determination sub-module 424 may be used to determine whether the remaining space of the buffer queue of the processor 400 is greater than or equal to a threshold. For example, the load determination sub-module 424 determines that the remaining space of the buffer queue of the processor 400 is greater than or equal to a threshold, and triggers the instruction acquisition sub-module 421 to send an internal detection instruction to the instruction execution module 410.
The processors referred to in any of the various embodiments of the present application may be provided in any type of computing device. Computing devices generally refer to devices having processing capabilities, including, for example, servers or terminal devices, etc. Terminal devices such as cell phones, tablet computers, computers with wireless transceiver functions, wearable devices, vehicles, unmanned aerial vehicles, helicopters, airplanes, ships, robots, mechanical arms or smart home devices, etc.
Fig. 5 is a schematic structural diagram of a computing device according to an embodiment of the present application. As shown in fig. 5, computing device 500 includes software layers and hardware layers.
The software layers include an operating system 510, a firmware system 520, and a baseboard management controller (baseboard management controller, BMC) 530, among others. Baseboard management controller 530 is, for example, independent of a small operating system external to computing device 500. The operating system 510, firmware system 520, and baseboard management controller 530 can all be used to manage the computing device 500. One or more of the operating system 510, firmware system 520, and baseboard management controller 530 in embodiments of the present application can be considered a management module.
The hardware layer includes an external memory module 540, a processor 550, a network card 560, and the like. The external storage module 540 is, for example, the memory of the computing device 500. The structure of the processor 550 may refer to the structure shown in fig. 2, 3 or 4 previously. Only one processor 550 is illustrated in fig. 5, and in practice the number of processors 550 may be one or more.
Processor 550 may be used to process requests from outside computing device 500, and/or requests generated inside computing device 500. The network card 560 is used to communicate the computing device 500 with other devices.
In addition, computing device 500 may also include a bus that may be used for communication among the components of computing device 500.
Optionally, computing device 500 may also include power supply circuitry for powering processor 550.
In one possible implementation, the processor 550 detects that an error exists in the processor 550, and may provide alarm information to the management module, where the alarm information is used to indicate that an error exists in the processor 550, so that a user may learn about the situation of the processor 550 in time.
In one possible implementation, the processor 550 is configured to execute external program instructions in the external storage module 540 in addition to the internal detection instructions.
Specifically, the processor 550 reads external program instructions from the external storage module 540 and decodes the external program instructions to obtain decoded results. The processor 550 executes the decoded result to obtain a second execution result. The processor 550 may write the second execution result to the external storage module 540 in order for the operating system or the like of the computing device 500 to obtain the second execution result.
The computing device according to the embodiment of the application can be applied to any scene, for example, a cloud data center. The cloud data center may be used to provide business services to users, including storage services and/or computing services.
Fig. 6 is a schematic deployment diagram of a cloud data center according to an embodiment of the present application. Or, fig. 6 may be understood as an application scenario schematic diagram of an error detection method provided in the embodiments of the present application. As shown in fig. 6, the scenario includes running on a plurality of terminal devices 610, a plurality of clients 611, and a cloud data center 620. Each of the plurality of terminal devices 610 has one of the plurality of clients 611 running therein 611. Wherein each client 611 of the plurality of clients 611 may be a software module or an application or the like. Cloud data center 620 includes one or more computing devices.
Illustratively, the client 611 may remotely access the cloud data center 620, thereby using business services provided by the cloud data center 620.
The following describes the structure of the cloud data center with reference to an architecture schematic diagram of the cloud data center shown in fig. 7. In fig. 7, a computing device included in a cloud data center is taken as an example of a server.
As shown in fig. 7, cloud data center 700 includes cloud management platform 710 and at least one server 730. In fig. 7, the number of at least one server 730 is two. Wherein cloud management platform 710 communicates with each server 730 of the at least one server 730 over cloud data center intranet 720.
Cloud management platform 710 is used to provide access interfaces (e.g., interfaces or application program interfaces (application programming interface, APIs)). For example, a tenant may operate a client remote access API to register a cloud account and password with cloud management platform 710 and log into cloud management platform 710. The client is, for example, client 611 in fig. 6.
After the cloud management platform 710 successfully authenticates the cloud account and the password, the tenant can pay for selecting and purchasing the virtual machine with the specific specification (processor, memory, disk) at the cloud management platform 710. Among other things, virtual machines may also be referred to as cloud servers (elastic compute service, ECS) or elastic instances, etc. After the payment purchase of the tenant is successful, the cloud management platform 710 provides the tenant with the telnet account password of the purchased virtual machine. The client can remotely log in the virtual machine, and install and run the tenant application in the virtual machine.
Logic functions of cloud management platform 710 may include user consoles, computing management services, network management services, storage management services, authentication services, and image management services, among others. Wherein the user console provides an interface or API to interact with the tenant. The computing management service is used for managing servers running virtual machines and containers and bare metal servers. The network management service is used to manage network services (e.g., gateway, firewall, etc.). The storage management service is used to manage storage services (e.g., a data bucket service). The authentication service is used for managing account passwords of tenants. The image management service is used for managing virtual machine images.
Here, the structure of any two servers 730 of the at least one server 730 may be the same, and a structure of one server 730 will be described as an example.
Server 730 includes a hardware layer and a software layer. The hardware layers of the server 730 include a memory 734, a processor 735, a network card 736, and a disk 737. The contents of the memory 734, the processor 735, and the network card 736 may be as discussed above with reference to fig. 5. Optionally, server 730 may also include power supply circuitry for powering processor 735. The memory 734 may be considered an example of an external storage module.
The software layer of the server 730 includes an operating system (which may be referred to as a host operating system with respect to the operating system of the virtual machine) installed and running on the server 730, in which a virtual machine manager (virtual machine manager, VMM) 732 and a plurality of virtual machines 731 are disposed. Virtual machine manager 732 is operable to implement computing virtualization, network virtualization, storage virtualization of virtual machines, and to manage virtual machines 731. Wherein computing virtualization refers to providing portions of processor 735 and memory 734 of server 730 to virtual machines; network virtualization refers to providing a portion of the functionality (e.g., bandwidth) of the network card 736 to a virtual machine; storage virtualization refers to providing portions of disk 737 to virtual machines; managing virtual machines 731 is, for example, creating virtual machines 731, emulating virtual hardware for virtual machines according to a hardware layer (hardware emulation function), deleting virtual machines 731, forwarding and/or handling network messages between all virtual machines 731 running on the server 730 or between virtual machines 731 on the server 730 and external networks (virtual switching function), handling input/output (I/O) generated by virtual machines 731, etc.
Wherein the operating environments (e.g., virtual machine applications, operating systems, and virtual hardware) in the different virtual machines 731 are completely isolated, communication between the different virtual machines 731 may be through the virtual machine manager 732. Each virtual machine 731 may have an operating system, a firmware system, a baseboard management controller, and the like running therein.
The virtual machine manager 732 further includes a cloud platform management client 733, where the cloud platform management client 733 is configured to receive control plane commands sent by the cloud management platform 710, create and perform full life cycle management on the virtual machine on the server 730 according to the control plane control commands, so that a tenant may create, manage, log in, and operate the virtual machine 731 in the cloud data center 700 through the cloud management platform 710.
The method provided by the embodiments of the present application is described below with reference to the accompanying drawings.
In the drawings corresponding to the embodiments of the present application, all steps indicated by dotted lines are optional steps.
Fig. 8 is a flowchart of a method for detecting a processor error according to an embodiment of the present application. The structure of the processor involved in the embodiment shown in fig. 8 is, for example, the processor 200 in fig. 2, the processor 300 in fig. 3, or the processor 400 in fig. 4, and the processor involved in the embodiment shown in fig. 8 is, for example, applicable in the computing device shown in fig. 5. The embodiment shown in fig. 8 relates to a processor comprising an instruction execution module and an error detection module.
S801, the error detection module adds a mark for the internal detection instruction. The error detection module according to the embodiment of the present application is, for example, the error detection module 210 in fig. 2, the error detection module 301 in fig. 3, or the error detection module 420 in fig. 4.
The error detection module may determine an instruction for detecting a processor error, which in the embodiment of the present application is described by taking an internal detection instruction as an instruction for detecting a processor error.
When the processor is applied to a computing device, such as the computing device of fig. 5, the instruction execution module executes instructions from the error detection module and may also execute external program instructions from the external storage module. The instruction execution module is, for example, instruction execution module 220 of FIG. 2, instruction execution module 302 of FIG. 3, or instruction execution module 410 of FIG. 4. In order to facilitate the subsequent differentiation of the instructions sent by the error detection module to the instruction execution module, in embodiments of the present application, the error detection module may add a flag to the internal detection instruction, the flag being used to indicate that the internal detection instruction is an instruction for detecting a processor error. The specific form of the tag may be a label.
The external storage module is not disposed in the processor, and the external storage module is, for example, a ROM or a memory, and the external storage module is, for example, the external storage module 540 in fig. 5. Since the instruction in the external memory module is an external program instruction, the meaning of the external program instruction may be referred to above, and the external program instruction is, for example, an instruction from an external application or the like.
The error detection module may obtain the internal detection instruction in various ways, which are described below.
In one aspect, the error detection module determines an internal detection instruction from among the instructions that have been executed by the processor.
An instruction execution module in the processor executes a plurality of instructions, such as instructions from an error detection module, and instructions from an external memory module. For convenience of description, in this embodiment, an instruction that has been executed by the instruction execution module is referred to as a history instruction, and all instructions that have been executed by the instruction execution module are referred to as a history instruction set. The error detection module may select at least one historical instruction from the set of historical instructions for detecting whether the processor has an error. The error detection module may determine one of the at least one historical instruction as an internal detection instruction. Correspondingly, in a manner, the internal detection instruction belongs to an instruction already executed by the processor, namely, a history instruction.
For example, the error detection module may randomly determine an instruction from at least one historical instruction as an internal detection instruction.
For another example, the error detection module may take, as the internal detection instruction, an instruction having the greatest execution complexity among the at least one historical instruction. The execution complexity may be characterized by the length of time required to execute an instruction before the instruction execution module, the longer the length of time, the higher the execution complexity. Because the processor executes the instruction with high complexity and is easy to make mistakes, the instruction with high complexity is selected to detect whether the processor has errors or not, and the errors of the processor are easy to detect.
Alternatively, if the processor is a multi-core processor, such as processor 300 of FIG. 3, supra, the error detection module may sample the historical instruction set from a plurality of instruction execution modules corresponding to the plurality of processor cores. In other words, the historical instruction set includes instructions that have been executed by the plurality of instruction execution modules.
In one possible implementation manner, in a case that the error detection module obtains at least one historical instruction from the instruction execution module, the error detection module may further write a result corresponding to the execution of the at least one historical instruction by the instruction execution module into a preset register in the processor. The result corresponding to the at least one historical instruction may be considered an expected result corresponding to the at least one historical instruction. Wherein the register is, for example, register 260 in fig. 2 or register 304 in fig. 3.
To facilitate distinguishing at least one historical instruction, the error detection module may also generate an identification of each of the at least one historical instruction and store the identification of the at least one historical instruction and an expected result association corresponding to the at least one historical instruction into a register.
For example, referring to table 1 below, the register is associated with at least one identifier of a history instruction and an expected result corresponding to the at least one history instruction.
TABLE 1
Identification of instructions Expected outcome of an instruction
0 11
1 00
As shown in table 1, the expected result of the instruction corresponding to the instruction 0 is 11, and the expected result of the instruction corresponding to the instruction 1 is 00.
In a second mode, the error detection module reads the internal detection instruction from the first storage module. The first storage module in the embodiment of the present application is, for example, the first storage module 230 in fig. 2, or is, for example, the first storage module 305 in fig. 3.
The first memory module may have at least one instruction pre-stored therein, e.g., the at least one instruction may be manually configured in the first memory module, e.g., a worker manually configures in the first memory module prior to shipment of the processor. Correspondingly, in the second mode, the internal detection instruction belongs to one instruction in the first storage module.
The processor may perform a read operation on the first memory module, e.g., an error detection module in the processor may read one instruction from at least one instruction stored in the first memory module as an internal detection instruction.
For example, the error detection module may randomly read an instruction from the at least one instruction as an internal detection instruction.
As an example, the at least one instruction pre-stored in the first memory module may be an instruction having a probability of the processor executing an error greater than or equal to the first probability. In other words, the at least one instruction pre-stored in the first memory module may be an instruction that is easily executed by the processor as an error. The probability of the processor performing an error may be empirically determined or may be obtained by performing multiple tests on the processor.
In one possible implementation, the first storage module may further store an expected result corresponding to the at least one instruction. For example, when the at least one instruction is configured in the first storage module, the worker may manually configure the expected result corresponding to the at least one instruction in the first storage module. The expected result corresponding to the at least one instruction may be obtained by executing the at least one instruction by the other processor respectively. Wherein the other processors are different from the processor.
Alternatively, the identification of the at least one instruction, and the expected result corresponding to the at least one instruction, may be pre-stored in a register, such as that discussed above. For example, a worker may manually configure the expected result corresponding to the at least one instruction in a register. The expected result corresponding to the at least one instruction may be obtained in a manner as described above.
In a third mode, the error detection module reads the internal detection instruction from the second storage module. The second storage module in the embodiment of the present application is, for example, the second storage module 240 in fig. 2, or is, for example, the second storage module 306 in fig. 3.
Wherein the processor may perform write operations and read operations on the second memory module.
For example, at least one instruction in the second memory module may be processor-written. For example, the external device may access the processor, through which instructions are written in the second memory module. Correspondingly, in the third mode, the internal detection instruction belongs to the instruction in the second storage module. Further, the error detection module in the processor may read an instruction from the second storage module as an internal detection instruction.
In this third mode, the second memory module may support external writing, thereby facilitating subsequent new instructions for detecting processor errors.
In one possible implementation, the first storage module may further store an expected result corresponding to the at least one instruction. For example, when the external device writes at least one instruction into the first storage module, the expected result corresponding to the at least one instruction may also be written into the first storage module. Wherein the expected result for the at least one instruction may be obtained by execution by the other processor.
Alternatively, the identity of the at least one instruction, and the expected result corresponding to the at least one instruction, are pre-stored in a register, such as that discussed above. For example, the external device may write the expected result corresponding to the at least one instruction to the register. The expected result corresponding to the at least one instruction may be obtained in a manner as described above.
If the instruction execution module only needs to execute instructions from the error detection module, the error detection module may not need to perform the steps of S801, i.e., S801 is an optional step, illustrated in FIG. 8 as a dashed line.
S802, the error detection module sends an internal detection instruction to the instruction execution module. Accordingly, the instruction execution module receives the internal detection instruction from the error detection module.
In the case where the processor in the embodiment of fig. 8 is a multi-core processor, the error detection module sends an internal detection instruction to an instruction execution module corresponding to one processor core (e.g., a first processor core) in the multi-core processor. The first processor core is, for example, the first processor core 310 shown in fig. 3.
For example, the error detection module sends instructions for detecting processor errors to the instruction execution module in a first cycle. The duration of the first period may be preconfigured in the error detection module, for example, the duration of the first period is 5 hours, etc.
For another example, the error detection module may send instructions for detecting processor errors to the instruction execution module at random.
Or, for example, in the case where the load of the processor is small, the error detection module sends an internal detection instruction to the instruction execution module.
For example, the error detection module may characterize the load of the processor with the remaining space of the buffer queue in the processor. Therefore, the process of executing the error detection of the processor occupies the resources of the processor when the load of the processor is large can be avoided, and the processor can smoothly execute the instructions from the external storage module.
Specifically, if the remaining space of the buffer queue in the processor is greater than or equal to the threshold, the error detection module determines that the load of the processor is small; if the remaining space of the buffer queue of the processor is less than the threshold, the error detection module determines that the load of the processor is large. Wherein the threshold value may be preconfigured in the error detection module, the threshold value being for example 1M.
As an example, if the processor in the embodiment of fig. 8 is a multi-core processor, the error detection module may send an internal detection instruction to the instruction execution module corresponding to the first processor core if the load of the first processor core is small. Wherein the manner in which the load of the first processor core is determined to be small may be referred to as determining the load of the processor.
In the case where the processor in fig. 8 is applied to a computing device, the process of performing processor error detection is flexibly controlled for the user. The management module includes a detection switch. The detection switch is used for indicating whether the processor is subjected to error detection or not. The management module determines whether the detection switch is in an on state or an off state according to the operation of a user. In one possible embodiment, the error detection module may further determine that a detection switch in the management module is in an on state and send an internal detection instruction to the instruction execution module. Wherein, the detection switch is in an on state, which means that the processor is subjected to error detection; the detection switch being in an off state indicates that no error detection is performed on the processor.
The management module includes one or more of an operating system, baseboard management controller, or firmware system in the computing device. When the management module comprises two or more of an operating system, a baseboard management controller or a firmware system, the error detection module can only determine that a detection switch of one of the operating system, the baseboard management controller or the firmware system is in an on state, which is equivalent to determining that the detection switch in the management module is in the on state.
For example, if the error detection module determines that the detection switch is in an on state and the load of the processor is small, an internal detection instruction is sent to the instruction execution module; or if the error detection module determines that the detection switch is in an on state, an internal detection instruction is sent to the instruction execution module.
For example, please refer to fig. 9, which is a schematic diagram of a detection switch of a management module according to an embodiment of the present application. As shown in fig. 9, the detection switch in the management module (specifically, the key where the detection processor error is located in fig. 9) is shown as "v", which indicates that the detection switch in the management module is in an on state.
S803, the instruction execution module executes the internal detection instruction to obtain a first execution result.
Because the internal detection instruction is sent to the instruction execution module by the error detection module in the processor, the internal detection instruction can be directly executed by the instruction execution module without decoding the internal detection instruction, and the execution result of the first execution is obtained. For convenience of description, the embodiment of the present application refers to an execution result of the internal detection instruction as a first execution result.
Alternatively, in the case where the error detection module performs S801 (i.e., adds a tag to the internal detection instruction), the error detection module may cache the tag of the internal detection instruction and add the tag to the first execution result.
S804, the error detection module determines a first execution result according to the mark.
For example, in the case where the error detection module performs S801, the error detection module may obtain execution results (including, for example, a first execution result and a second execution result) of all instructions from the instruction execution module. The error detection module may identify the first execution result according to the flag, which is equivalent to the error detection module determining the first execution result.
For another example, the error detection module may send a first request to the instruction execution module requesting the execution result corresponding to the internal detection instruction, and the first request may include (or indicate) a flag of the internal detection instruction. The instruction execution module receives the first request and feeds back a first execution result to the error detection module, which is equivalent to the first execution result determined by the error detection module.
For another example, the error detection module may obtain, from the instruction execution module, a first execution result corresponding to the internal detection instruction according to the identifier of the internal detection instruction. Wherein the content of the identification of the internal detection instruction can refer to the foregoing. In this case, the error detection module does not need to determine the first execution result according to the flag, so S804 is an optional step.
S805, the error detection module determines whether the processor has an error according to a matching result of the first execution result and the expected result of the internal detection instruction.
In the case where the expected result of the internal detection instruction is pre-stored in a register of the processor, the error detection module may read the expected result of the internal detection instruction from the register. Alternatively, in the case where the expected result of the internal detection instruction is pre-stored in the first memory module of the processor, the error detection module may read the expected result of the internal detection instruction from the first memory module. Alternatively, in the case where the expected result of the internal detection instruction is pre-stored in the second memory module of the processor, the error detection module may read the expected result of the internal detection instruction from the second memory module.
The error detection module determines whether the first execution result matches an expected result of the internal detection instruction. If the first execution result matches the expected result of the internal detection instruction, the error detection module determines that there is no error in the processor, the error detection module may discard the first execution result, the processor may again execute the embodiment shown in FIG. 8, and the processor may be error detected. If the first execution result does not match the expected result of the internal detection instruction, the error detection module determines that an error exists with the processor.
For example, the error detection module determines whether the first execution result is the same as the expected result of the internal detection instruction, and if the first execution result is the same as the expected result of the internal detection instruction, indicates that the first execution result matches the expected result of the internal detection instruction; if the first execution result is not the same as the expected result of the internal detection instruction, the first execution result is not matched with the expected result of the internal detection instruction.
S806, the instruction execution module reads the external program instruction from the external storage module.
The external storage module is, for example, the external storage module 540 shown in fig. 5. The meaning of the external program instructions may be referred to in the foregoing discussion and will not be described in detail herein.
S807, the instruction execution module decodes the external program instruction to obtain a decoded result, and executes the decoded result to obtain a second execution result.
S808, the instruction execution module writes the second execution result into the external storage module.
Optionally, the operating system may obtain the second execution result from the external storage module, and further present the second execution result to the user.
As one example, S806 to S808 are optional steps.
For example, referring to fig. 10, a schematic process diagram of processing an internal detection instruction and an external program instruction according to an embodiment of the present application is provided.
As shown in fig. 10, the path of processing the internal detection instruction includes: error detection module- & gt instruction execution module- & gt error detection module. The path for processing the external program instructions includes: application → external storage module → instruction execution module → external storage module → application. It can be seen that the path for processing the internal detection instruction in the embodiment of the present application is different from the path for processing the external program instruction, and is simpler than the processing path for the external program instruction.
In order to avoid continuous errors of the processor, optionally, a control module in the processor may control to shut down the processor. Alternatively, in the case where the processor in FIG. 8 is applied to a computing device, the operating system of the computing device may control the processor to be turned off.
In the case of the processor of fig. 8 being applied to a computing device, in order to facilitate the user's ability to view errors of the processor, optionally, the error detection module may provide alert information to a management module in the computing device when it is determined that an error exists in the processor. The alert information is used to indicate that the processor is in error. Furthermore, the management module can display the alarm information. Thus, the user can timely learn that the processor is in error.
In one possible implementation, where the processor in fig. 8 is a multi-core processor, the internal detection instruction may be executed by an instruction execution module corresponding to one of the processor cores (e.g., the first processor core). Accordingly, the error detection module may determine that the first processor core has an error. In this way, the processor may pinpoint which processor core is in particular faulty.
Alternatively, in the case where the error detection module determines that the first processor core has an error, the error detection module may generate alarm information and send the alarm information to the management module. The alert information is used to indicate that the processor is in error.
Optionally, in the case that the error detection module determines that the first processor core has an error, the control module corresponding to the first processor core may control to shut down the first processor core. Alternatively, in the case where the processor in fig. 8 is applied to a computing device, the operating system of the computing device may control the first processor core to be turned off. In this manner, the other processor cores of the processor may still function properly.
In the embodiment of the application, the processor can detect whether the processor has an error according to the instruction in the processor, and the processor is not required to decode the program instruction, and the execution result of the program instruction is sent to an external inspection tool, so that the instruction generated by the processor can be reduced, which is beneficial to reducing the processing capacity of the processor. In addition, in the embodiment of the application, the processor can execute error detection when the load of the processor is small, so that the influence of the detection process of executing the processor error on the execution process of the external program instruction can be reduced. And, when the processor is a multi-core processor, the processor may detect which processor core, in particular, of the processors has an error to accurately determine the processor core that has the error. In addition, when the processor has errors, alarm information can be reported and/or the processor can be shut down, so that the errors can be processed in time, and larger influence is avoided. In addition, in the process of executing the internal detection instruction, an application can be loaded without an operating system, so that the processing capacity of the operating system can be reduced, that is, one computing device can execute error detection of a processor without installing the operating system, while the computing device in a cloud scene may not install the operating system, in other words, the processor error detection method in the embodiment of the application can be better suitable for cloud situations, such as cloud computing and/or cloud storage related in the cloud scene, and cloud scene embodiments, such as the cloud data center in the foregoing.
Fig. 11 is a flowchart of a method for detecting a processor error according to an embodiment of the present application. In the embodiment shown in fig. 11, the processor in fig. 8, specifically, the processor 400 shown in fig. 4, is taken as an example. The processor in the embodiment shown in fig. 11 includes an error detection module and an instruction execution module, and the error detection module includes an instruction acquisition sub-module, an instruction identification sub-module, and an instruction determination sub-module.
S1101, adding a mark for the internal detection instruction by the instruction obtaining sub-module.
Wherein the specific content of the mark and the added mark can be referred to the content discussed above.
Alternatively, the instruction acquisition sub-module may send the tag to the instruction identification sub-module so that the instruction identification sub-module subsequently identifies the instruction for detecting the processor error based on the tag.
S1102, the instruction obtaining sub-module sends an internal detection instruction to the instruction execution module. Correspondingly, the instruction execution module receives the internal detection instruction from the instruction acquisition sub-module.
For example, the instruction acquisition sub-module may send an internal detection instruction to the instruction execution module if it is determined that the detection switch is in an on state and/or if it is determined that the load of the processor is small.
Wherein the detection switch and the determination that the detection switch is in the on state are discussed above.
As an example, the processor in the embodiment shown in fig. 11 further includes a load determination sub-module, such as load determination sub-module 424 in fig. 4. The load judging submodule determines that the load of the processor is small, and the load judging submodule sends first indication information to the instruction obtaining submodule. The first indication information is used for indicating that the load of the processor is small. The instruction obtaining submodule receives the first instruction information and performs the step of S1102.
And under the condition that the load of the processor is large, the load judging submodule can send second instruction information to the instruction obtaining submodule. The second indication information is used for indicating that the remaining space in the buffer queue of the processor is smaller than a threshold value. The instruction obtaining submodule receives the second instruction information and does not execute the step of S1102; or, in the case that the load of the processor is large, the load judging sub-module does not need to send any indication information to the instruction obtaining sub-module, and the instruction obtaining sub-module executes the step of S1102 only when triggered by the first indication information by default.
S1103, the instruction execution module executes the internal detection instruction to obtain a first execution result.
The content of the first execution result may be referred to the content discussed previously.
In the case where the instruction obtaining submodule executes S1101, the instruction execution module may add a flag to the first execution result.
S1104, the instruction execution module sends a first execution result to the instruction identification sub-module. Correspondingly, the instruction identification sub-module receives a first execution result from the instruction execution module.
S1105, the instruction recognition submodule determines a first execution result according to the mark.
S1106, the instruction identification sub-module sends a first execution result to the instruction execution module. Correspondingly, the instruction execution module receives a first execution result from the instruction identification sub-module.
For example, the instruction execution module sends all execution results to the instruction recognition sub-module, which recognizes the first execution result according to the flag of the first execution result.
For another example, the instruction execution module may directly send the first execution result to the instruction recognition sub-module according to the flag of the first execution result, which is also equivalent to the instruction recognition sub-module determining the first execution result.
S1107, the instruction judging sub-module determines whether the processor has an error according to a matching result of the first execution result and an expected result of the internal detection instruction.
The manner in which the instruction determination submodule determines whether there is an error in the processor may be referred to previously.
Alternatively, if the instruction judging sub-module determines that the processor has an error, the alarm information may be provided to the management module, and the content of the management module and the alarm information may refer to the content discussed above.
S1108, the instruction execution module acquires external program instructions from the external storage module.
The meaning of the external program instructions may be referred to in the foregoing discussion.
S1109, the instruction execution module decodes the external program instruction to obtain a decoded result, and executes the decoded result to obtain a second execution result.
S1110, the instruction execution module sends a second execution result to the external storage module.
As one example, S1108-S1110 are optional steps.
For example, referring to fig. 12, a process diagram of processing an internal detection instruction and an external program instruction according to an embodiment of the present application is provided.
As shown in fig. 12, the path of processing the internal detection instruction includes: instruction acquisition sub-module, instruction execution module, instruction identification sub-module and error judgment sub-module. The path for processing the external program instructions includes: application- & gt external storage module- & gt instruction execution module- & gt application. It can be seen that the path for processing the internal detection instruction in the embodiment of the present application is different from the path for processing the external program instruction, and the path for processing the internal detection instruction in the embodiment of the present application is simpler.
The embodiment of the application can be applied to the processor shown in fig. 4, and in this embodiment, the error detection module may include an instruction obtaining sub-module, a load judging sub-module, an instruction identifying sub-module and an error judging sub-module, which provides a scheme for detecting processor errors. In addition, in this embodiment, the instruction obtaining sub-module, the load judging sub-module, the instruction identifying sub-module, the error judging sub-module and the instruction executing unit in the processor may cooperate to detect an error of the processor, so that the processor is not required to compile a program instruction, and an execution result of the program instruction is sent to an external inspection tool, which is beneficial to reducing the processing capacity of the processor. And the instruction obtaining sub-module can send an internal detection instruction to the instruction execution module when the load of the processor is small, so that the processing load of the processor is prevented from being increased when the load of the processor is large.
The embodiment of the application provides a computing device cluster. Referring to fig. 13, a computing device cluster according to an embodiment of the present application includes at least one computing device 1300, where any two computing devices 1300 communicate through a communication network.
As shown in fig. 13, computing device 1300 includes a processor 1301 and a power supply circuit 1302. The power supply circuit 1302 is used to power the processor 1301. Wherein the processor 1301 in the computing device 1300 may be used to implement the method of detecting a processor error of any of the foregoing, e.g., the method of detecting a processor error in the embodiment shown in fig. 8 or 11. The functions of the processor of any of the foregoing can also be implemented. The structure of the processor 1301 may refer to the structure of the processor in fig. 2, 3 or 4.
Optionally, the computing device 1300 further includes a memory 1303 and a communication interface 1304, the memory 1303 and the communication interface 1304 being illustrated in dashed boxes in fig. 13.
Wherein the processor 1301 and the communication interface 1304 are coupled to each other. It is understood that the communication interface 1304 may be a transceiver or an input-output interface.
The memory 1303 may be used to store external program instructions executed by the processor 1301, or to store input data required by the processor 1301 to execute the external program instructions, or to store data generated after the processor 1301 executes the instructions.
As one example, the cluster of computing devices shown in fig. 13 may be used to implement the functionality of the cloud data center in fig. 6 or fig. 7.
The embodiment of the application provides a chip system, which comprises: a processor and an interface. Wherein the processor is configured to invoke and execute instructions from the interface, and when the processor executes the instructions, implement any of the methods of detecting processor errors described above, such as the methods of detecting processor errors in the embodiments shown in fig. 8 or 11.
Embodiments of the present application provide a computer readable storage medium storing a computer program or instructions that, when executed, implement a method of detecting a processor error of any of the foregoing, for example, the method of detecting a processor error in the embodiment shown in fig. 8 or 11.
Embodiments of the present application provide a computer program product comprising instructions which, when executed on a computer, implement a method of detecting a processor error of any of the foregoing, for example, the method of detecting a processor error in the embodiments shown in fig. 8 or 11.
The method steps in the embodiments of the present application may be implemented by hardware, or may be implemented by a processor executing software instructions. The software instructions may be comprised of corresponding software modules that may be stored in random access memory, flash memory, read only memory, programmable read only memory, erasable programmable read only memory, electrically erasable programmable read only memory, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. In addition, the ASIC may reside in a base station or terminal. The processor and the storage medium may reside as discrete components in a base station or terminal.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are performed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, a network device, a user device, or other programmable apparatus. The computer program or instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program or instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired or wireless means. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that integrates one or more available media. The usable medium may be a magnetic medium, e.g., floppy disk, hard disk, tape; but also optical media such as digital video discs; but also semiconductor media such as solid state disks. The computer readable storage medium may be volatile or nonvolatile storage medium, or may include both volatile and nonvolatile types of storage medium.
In the various embodiments of the application, if there is no specific description or logical conflict, terms and/or descriptions between the various embodiments are consistent and may reference each other, and features of the various embodiments may be combined to form new embodiments according to their inherent logical relationships.
It will be appreciated that the various numerical numbers referred to in the embodiments of the present application are merely for ease of description and are not intended to limit the scope of the embodiments of the present application. The sequence number of each process does not mean the sequence of the execution sequence, and the execution sequence of each process should be determined according to the function and the internal logic.

Claims (23)

1. A processor comprising an error detection module and an instruction execution module, wherein:
the error detection module is used for sending an internal detection instruction to the instruction execution module;
the instruction execution module is used for executing the internal detection instruction, obtaining a first execution result and sending the first execution result to the error detection module;
and the error detection module is used for determining whether the processor has an error according to the expected result of the first execution result corresponding to the internal detection instruction.
2. The processor of claim 1, wherein the instruction execution module is further to:
reading an external program instruction from an external storage module;
decoding the external program instruction to obtain a decoded result;
executing the decoded result to obtain a second execution result;
and writing the second execution result into the external storage module.
3. The processor of claim 1 or 2, wherein the error detection module is further configured to:
before sending an internal detection instruction to the instruction execution module, determining that the remaining space in a buffer queue of the processor is greater than or equal to a threshold value, wherein the buffer queue is used for buffering the instruction to be processed by the processor.
4. A processor according to any one of claims 1-3, wherein,
the error detection module is further configured to add a flag to the internal detection instruction, where the flag indicates an instruction for detecting the processor error;
the instruction execution module is further configured to add the tag to the first execution result;
the error detection module is further configured to identify, according to the flag in the first execution result, the first execution result corresponding to the internal detection instruction.
5. The processor of any one of claims 1-4, further comprising a register storing an expected result for the internal detection instruction;
the error detection module is further configured to read an expected result corresponding to the internal detection instruction from the register.
6. The processor of any one of claims 1-5, wherein the error detection module is further configured to obtain the internal detection instruction from an instruction that has been executed by the processor; or alternatively, the first and second heat exchangers may be,
the processor further comprises a first storage module storing the internal detection instructions, the first storage module allowing the internal detection instructions to be read by the processor, and the error detection module further used for reading the internal detection instructions from the first storage module; or alternatively, the first and second heat exchangers may be,
the processor further comprises a second memory module storing the internal detection instructions, the second memory module allowing the internal detection instructions to be read and written by the processor, and the error detection module is further configured to read the internal detection instructions from the second memory module.
7. The processor of any one of claims 1-6, further comprising at least one processor core, one of the at least one processor core corresponding to the instruction execution module; the error detection module is specifically configured to:
And determining whether the processor core corresponding to the instruction execution module has an error according to the first execution result and an expected result corresponding to the internal detection instruction.
8. The processor of any one of claims 1-7, wherein the error detection module is further to:
and determining that a detection switch in a management module is in an on state, wherein the detection switch is used for indicating whether to detect errors of the processor, and the management module comprises one or more of a baseboard management controller, a firmware system corresponding to the processor and an operating system.
9. The processor of any one of claims 1-8, wherein the error detection module is further configured to provide, when determining that the processor has an error, alert information to a management module, the management module including one or more of a baseboard management controller, a firmware system or an operating system corresponding to the processor, the alert information being configured to indicate that the processor has an error; and/or the number of the groups of groups,
the processor also includes the control module for shutting down the processor when the error detection module determines that the processor has an error.
10. A method of detecting processor errors, comprising:
executing an internal detection instruction by the processor to obtain a first execution result;
the processor determines whether the processor has an error according to an expected result of the first execution result corresponding to the internal detection instruction.
11. The method according to claim 10, wherein the method further comprises:
the processor reads external program instructions from an external storage module;
the processor decodes the external program instruction to obtain a decoded result;
the processor executes the decoded result to obtain a second execution result;
the processor writes the second execution result into the external storage module.
12. The method according to claim 10 or 11, characterized in that the method further comprises:
the processor determines that the remaining space in a buffer queue is greater than or equal to a threshold, the buffer queue being configured to buffer instructions to be processed by the processor.
13. The method according to any one of claims 10-12, further comprising:
the processor adding a tag to the internal detection instruction, the tag representing an instruction for detecting the processor error;
The processor adds the tag to the first execution result;
and the processor identifies the first execution result corresponding to the internal detection instruction according to the mark in the first execution result.
14. The method of any of claims 10-13, wherein the processor includes a register storing an expected result corresponding to the internal detection instruction; the method further comprises the steps of:
and the processor reads the expected result corresponding to the internal detection instruction from the register.
15. The method according to any one of claims 10-14, further comprising:
the processor obtains the internal detection instruction from the instructions executed by the processor; or alternatively, the first and second heat exchangers may be,
the processor further comprises a first memory module storing the internal detection instructions, the first memory module allowing the internal detection instructions to be read by the processor, the processor reading the internal detection instructions from the first memory module; or alternatively, the first and second heat exchangers may be,
the processor further includes a second memory module storing the internal detection instructions, the second memory module allowing reading and writing by the processor, the processor reading the internal detection instructions from the second memory module.
16. The method of any of claims 10-15, wherein the processor comprises at least one processor core; the processor determining whether the processor has an error according to the first execution result and an expected result corresponding to the internal detection instruction, including:
the processor determines whether a processor core for obtaining the first execution result has an error according to an expected result of the first execution result corresponding to the internal detection instruction.
17. The method according to any one of claims 10-16, further comprising:
the processor determines that a detection switch in a management module is in an on state, the detection switch is used for indicating whether to detect errors of the processor, and the management module comprises one or more of a baseboard management controller, a firmware system corresponding to the processor and an operating system.
18. The method according to any one of claims 10-17, further comprising:
when the processor determines that the processor has an error, the processor provides alarm information to a management module, wherein the management module comprises one or more of a baseboard management controller, a firmware system corresponding to the processor and an operating system, and the alarm information is used for indicating that the processor has the error; and/or the number of the groups of groups,
The processor shuts down the processor when it is determined that the processor has an error.
19. A computing device comprising the processor of any of claims 1-9.
20. A computing device comprising a processor and power supply circuitry, the power supply circuitry powering the processor, the processor to perform the method of any of claims 10-18.
21. A cluster of computing devices, comprising at least one computing device, each computing device performing the method of any of claims 10-18.
22. A computer program product containing instructions that, when executed by a computing device, cause the computing device to perform the method of any of claims 10-18.
23. A computer readable storage medium, characterized in that the storage medium has stored therein a computer program or instructions which, when executed, implement the method of any of claims 10-18.
CN202211080632.4A 2022-09-05 2022-09-05 Processor and method for detecting processor errors Pending CN117687848A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202211080632.4A CN117687848A (en) 2022-09-05 2022-09-05 Processor and method for detecting processor errors
PCT/CN2023/098504 WO2024051231A1 (en) 2022-09-05 2023-06-06 Processor and processor error detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211080632.4A CN117687848A (en) 2022-09-05 2022-09-05 Processor and method for detecting processor errors

Publications (1)

Publication Number Publication Date
CN117687848A true CN117687848A (en) 2024-03-12

Family

ID=90127038

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211080632.4A Pending CN117687848A (en) 2022-09-05 2022-09-05 Processor and method for detecting processor errors

Country Status (2)

Country Link
CN (1) CN117687848A (en)
WO (1) WO2024051231A1 (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07113898B2 (en) * 1989-05-09 1995-12-06 株式会社日立製作所 Failure detection method
US7463992B2 (en) * 2006-09-29 2008-12-09 Intel Corporation Method and system to self-test single and multi-core CPU systems
US8122312B2 (en) * 2009-04-14 2012-02-21 International Business Machines Corporation Internally controlling and enhancing logic built-in self test in a multiple core microprocessor
GB2549280B (en) * 2016-04-11 2020-03-11 Advanced Risc Mach Ltd Self-testing in a processor core

Also Published As

Publication number Publication date
WO2024051231A1 (en) 2024-03-14

Similar Documents

Publication Publication Date Title
US9619308B2 (en) Executing a kernel device driver as a user space process
US8572159B2 (en) Managing device models in a virtual machine cluster environment
US8904159B2 (en) Methods and systems for enabling control to a hypervisor in a cloud computing environment
US9507619B2 (en) Virtualizing a host USB adapter
US11509505B2 (en) Method and apparatus for operating smart network interface card
US9354952B2 (en) Application-driven shared device queue polling
US20200097323A1 (en) Container migration
US8589728B2 (en) Job migration in response to loss or degradation of a semi-redundant component
CN110333875A (en) A kind of service routine update method, device, server and storage medium
US20160364304A1 (en) Providing availability of an agent virtual computing instance during a storage failure
US20210081234A1 (en) System and Method for Handling High Priority Management Interrupts
CN114356521A (en) Task scheduling method and device, electronic equipment and storage medium
CN115269213A (en) Data receiving method, data transmitting method, device, electronic device and medium
US9817683B2 (en) Optimized remediation policy in a virtualized environment
CN110764962A (en) Log processing method and device
US11252457B2 (en) Multimedia streaming and routing apparatus and operation method of the same
CN117687848A (en) Processor and method for detecting processor errors
US8762615B2 (en) Dequeue operation using mask vector to manage input/output interruptions
CN113127050B (en) Application resource packaging process monitoring method, device, equipment and medium
CN113742093A (en) Message processing method, device, equipment and storage medium
CN107688479A (en) Android system network cluster and its method built, the method and system of android system network cluster data processing
CN112181761B (en) Program execution control method, program execution test device, code detection device, program execution equipment and medium
CN116599917B (en) Network port determining method, device, equipment and storage medium
US20240020103A1 (en) Parallelizing data processing unit provisioning
CN113760345A (en) Application program generation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication