WO2024051231A1 - Processor and processor error detection method - Google Patents

Processor and processor error detection method Download PDF

Info

Publication number
WO2024051231A1
WO2024051231A1 PCT/CN2023/098504 CN2023098504W WO2024051231A1 WO 2024051231 A1 WO2024051231 A1 WO 2024051231A1 CN 2023098504 W CN2023098504 W CN 2023098504W WO 2024051231 A1 WO2024051231 A1 WO 2024051231A1
Authority
WO
WIPO (PCT)
Prior art keywords
processor
module
instruction
instructions
error
Prior art date
Application number
PCT/CN2023/098504
Other languages
French (fr)
Chinese (zh)
Inventor
刘辉
俞洲
杨肖
邹文
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2024051231A1 publication Critical patent/WO2024051231A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing

Definitions

  • Embodiments of the present application relate to the field of computer technology, and in particular, to a processor and a method for detecting processor errors.
  • the processor is one of the main components of various computing devices and is used to execute instructions.
  • the processor may make errors while executing instructions.
  • one way to detect errors in the execution of instructions by a processor is to use an inspection tool to test whether there are errors in the execution of instructions in the processor.
  • the inspection tool can send program instructions to the processor, the processor decodes and executes the program instructions, and feeds back the execution results of the program instructions to the inspection tool.
  • the inspection tool compares the execution result with the pre-stored expected result of the program instruction. If the execution result does not match the expected result of the program instruction, it is determined that the processor has an instruction execution error.
  • the processor needs to receive program instructions from the inspection tool, decode the program instructions, and feed back the execution results of the program instructions to the inspection tool. It can be seen that the processing capacity of the processor is large.
  • Embodiments of the present application provide a processor and a method for detecting processor errors, which are used to reduce the processing volume in the process of detecting processor errors.
  • embodiments of the present application provide a processor, including an error detection module and an instruction execution module, wherein: the error detection module is used to send internal detection instructions to the instruction execution module; the instruction execution module, used to execute the internal detection instruction, obtain a first execution result, and send the first execution result to the error detection module; the error detection module is used to calculate the first execution result according to the internal detection The expected result corresponding to the instruction determines whether there is an error in the processor.
  • the instruction execution module can directly execute the internal detection instruction from the error detection module to obtain the first execution result.
  • the error detection module compares the first execution result with the expected result corresponding to the internal detection instruction to determine whether the processor There is an error. In this way, it is equivalent to the processor executing the instructions in the processor, and then determining whether there is an error in the processor.
  • the processor There is no need for the processor to decode the instructions and send the first execution result to the external inspection tool, which is beneficial to reducing the processing time of the processor. It also helps save the computing resource overhead of the processor.
  • there is no need to use external inspection tools which helps reduce the cost of detecting processor errors.
  • the processor detects a processor error, it executes instructions in the processor, and the processor can accommodate various types of instructions, so it is helpful to increase the types of instructions to which this method is applicable.
  • the instruction execution module is also used to: read external Program instructions, decode the external program instructions, obtain the decoded result, execute the decoded result, obtain the second execution result, and write the second execution result into the external storage module .
  • the instruction execution module in the processor can also execute external program instructions in the external storage module.
  • the processor executes the internal detection instructions, it also Will affect the execution of external program instructions.
  • the error detection module is further configured to: before sending an internal detection instruction to the instruction execution module, determine that the remaining space in the buffer queue of the processor is greater than or equal to a threshold, and the The buffer queue is used to cache instructions to be processed by the processor.
  • the error detection module can send internal detection instructions to the instruction execution module when the load on the processor is relatively small, so that the instruction execution module can execute external program instructions when the load is relatively small. In this way, the impact of the execution process of internal detection instructions on the execution process of external program instructions can be reduced, which is conducive to reasonable allocation of computing resources of the processor.
  • the error detection module is also configured to add a mark to the internal detection instruction, where the mark represents an instruction used to detect errors of the processor; the instruction execution module is also configured to add a mark to the internal detection instruction. The error detection module is further configured to add the mark to the first execution result and identify the first execution result corresponding to the internal detection instruction based on the mark in the first execution result. .
  • the error detection module can add a mark to the internal detection instruction, so as to facilitate the subsequent instruction error detection module to distinguish the first execution result corresponding to the instruction used to detect processor errors.
  • the processor further includes a register, which stores the expected results corresponding to the internal detection instructions; the error detection module is also configured to read all the information from the register. Describe the expected results corresponding to the internal detection instructions.
  • the user can configure the expected result corresponding to the internal detection instruction in the register manually or by the processor, so that the error detection module can quickly obtain the expected result corresponding to the internal detection instruction, and also facilitates the subsequent rapid detection of whether the processor exists mistake.
  • the error detection module is further configured to obtain the internal detection instructions from instructions that have been executed by the processor; or, the processor further includes a storage device that stores the internal detection instructions.
  • a first storage module for detecting instructions the first storage module is allowed to be read by the processor, and the error detection module is also used to read the internal detection instructions from the first storage module; or,
  • the processor further includes a second storage module that stores the internal detection instructions, the second storage module is allowed to be read and written by the processor, and the error detection module is also used to read from the third Read the internal detection instructions from the second storage module.
  • the error detection module can sample the internal detection instructions from instructions that have been executed by the processor.
  • the method of obtaining the internal detection instructions is simple and direct.
  • the error detection module can read the internal detection instructions from the first storage module, and the instructions in the first storage module can be manually configured.
  • the error detection module can read the internal detection instructions from the second storage module. Different from the second method, the second storage module involved in the third method can support processor writing, so that It is advantageous for the processor or the external device to add instructions in the second storage module through the processor.
  • the processor further includes at least one processor core, and one of the at least one processor core corresponds to the instruction execution module; the error detection module is specifically used to: Determine whether there is an error in the processor according to an expected result corresponding to the first execution result and the internal detection instruction.
  • the above embodiments may be applied when the processor includes one or more processor cores. If the processor includes multiple processor cores, the instruction execution module determines which processor core has the error. In this way, the processing can be accurately located. There is a faulty processor core in the processor. Moreover, in this embodiment, one error detection module can be used to detect errors on multiple processor cores, which is beneficial to reducing the cost of detecting processor errors.
  • the error detection module is further configured to: determine that a detection switch in the management module is in an on state, and the detection switch is used to indicate whether to detect errors of the processor.
  • the management module includes one or more of a baseboard management controller, a firmware system or an operating system corresponding to the processor.
  • both the management module and the processor can be provided in the computing device, and the management module can be configured with a detection switch.
  • the error detection module performs detection processing when it is determined that the detection switch in the management module is in an open state.
  • the process of detecting processor errors provides a flexible way to enable detection of processor errors.
  • the processor further includes a control module; the error detection module is further configured to provide alarm information to a management module when it is determined that there is an error in the processor, and the management module includes a substrate One or more of the management controller, the firmware system or the operating system corresponding to the processor, the alarm information is used to indicate that the processor has an error; and/or the control module is used to determine When there is an error in the processor, the processor is controlled to be shut down.
  • the error detection module when the error detection module determines that there is an error in the processor, the error detection module can provide alarm information to the management module so that the management module can present the alarm information so that the user can know in time that there is an error in the processor. .
  • the control module in the processor can also control the shutdown of the processor to prevent the processor from continuing to execute instructions incorrectly.
  • embodiments of the present application provide a method for detecting processor errors.
  • the method may be executed by a processor or by a computing device including the processor.
  • the processor executes the internal detection instruction to obtain a first execution result; the processor determines whether there is an error in the processor based on the expected result corresponding to the first execution result and the internal detection instruction.
  • the method further includes: reading external program instructions from the external storage module, decoding the external program instructions, obtaining decoded results, and executing the decoded results. As a result, a second execution result is obtained, and the second execution result is written into the external storage module.
  • the method further includes: the processor determining that the remaining space in the buffer queue is greater than or equal to a threshold, and the buffer queue is used to cache instructions to be processed by the processor.
  • the method further includes: the processor adding a mark to the internal detection instruction, the mark indicating an instruction for detecting an error of the processor; The mark is added to the first execution result; the processor identifies the first execution result corresponding to the internal detection instruction based on the mark in the first execution result.
  • the method further includes: the processor includes a register, and the register stores expected results corresponding to the internal detection instructions; the method further includes: the processor obtains the In the register, read the expected result corresponding to the internal detection instruction.
  • the method further includes: the processor obtaining the internal detection instruction from instructions that have been executed by the processor; or, the processor further includes storing the internal detection instruction.
  • a first storage module for internal detection instructions the first storage module is allowed to be read by the processor, and the processor reads the internal detection instructions from the first storage module; or, the processor It also includes a second storage module that stores the internal detection instructions, the second storage module is allowed to be read and written by the processor, and the processor reads from the second storage module The internal detection instructions.
  • the processor includes at least one processor core; the processor determines whether there is an error in the processor based on the expected result corresponding to the first execution result and the internal detection instruction. , including: the processor determines whether there is an error in the processor core used to obtain the first execution result based on the expected result corresponding to the first execution result and the internal detection instruction.
  • the method further includes: the processor determines that a detection switch in the management module is in an on state, and the detection switch is used to indicate whether to detect an error of the processor, and the
  • the management module includes one or more of a baseboard management controller, a firmware system corresponding to the processor, or an operating system.
  • the method further includes: when the processor determines that there is an error in the processor, providing alarm information to a management module, where the management module includes a baseboard management controller, the processor In one or more of the corresponding firmware systems or operating systems, the alarm information is used to indicate that the processor has an error; and/or, when it is determined that the processor has an error, the processor shuts down the processor.
  • inventions of the present application provide a method for detecting processor errors.
  • the method may be executed by a processor or by a computing device including a processor.
  • the processor includes: instruction execution module and error detection module.
  • the method includes: the error detection module sends an internal detection instruction to the instruction execution module; the instruction execution module executes the internal detection instruction, obtains a first execution result, and sends the first execution result to the error detection module. An execution result; the error detection module determines whether there is an error in the processor based on the expected result corresponding to the first execution result and the internal detection instruction.
  • the method before sending an internal detection instruction to the instruction execution module, the method further includes:
  • the error detection module determines that the remaining space in a buffer queue of the processor, which is used to cache instructions to be processed by the processor, is greater than or equal to a threshold.
  • the method further includes: the error detection module adding a mark to the internal detection instruction, the mark indicating an instruction for detecting an error of the processor; the error detection module based on The mark identifies the first execution result corresponding to the internal detection instruction.
  • the processor further includes a register that stores expected results corresponding to the internal detection instructions; the method further includes: the error detection module reads from the register Get the expected result corresponding to the internal detection instruction.
  • the method further includes: the error detection module obtains the internal detection instructions from instructions that have been executed by the processor; or, the error detection module obtains the internal detection instructions from the processing instructions.
  • the internal detection instruction is read from the first storage module in the processor, and the first storage module is allowed to be read by the processor; or, the error detection module reads the internal detection instruction from the second storage module in the processor. , the internal detection instruction is read, and the second storage module is allowed to be read and written by the processor.
  • the processor includes at least one processor core, and one of the at least one processor core corresponds to the instruction execution module; according to the first execution result and the The error detection module determines whether there is an error in the processor core according to the expected result corresponding to the internal detection instruction, including: based on the expected result corresponding to the first execution result and the internal detection instruction, the error detection module determines and Whether there is an error in the processor core corresponding to the instruction execution module.
  • the method further includes: the error detection module determines that a detection switch in the management module is in an on state, and the detection switch is used to indicate whether to detect errors of the processor, so narrate management
  • the management module includes one or more of a baseboard management controller, a firmware system corresponding to the processor, or an operating system.
  • the error detection module determines that there is an error in the processor
  • the error detection module provides alarm information to the management module.
  • the management module includes a baseboard management controller, a firmware system corresponding to the processor, or One or more operating systems
  • the alarm information is used to indicate that there is an error in the processor; and/or, when it is determined that there is an error in the processor, the control module in the processor controls to shut down the processor. processor.
  • embodiments of the present application provide a computing device, which may include any of the processors in the first aspect.
  • inventions of the present application provide a computing device.
  • the computing device includes a processor and a power supply circuit.
  • the power supply circuit is used to supply power to the processor.
  • the processor is used to implement the second aspect or the third aspect. Either method.
  • embodiments of the present application provide a computing device cluster, including at least one computing device.
  • Each computing device can execute the method in any one of the above-mentioned second aspect or the above-mentioned third aspect.
  • each computing device in the computing device cluster may be the computing device of any one of the above-mentioned fourth aspect or the above-mentioned fifth aspect.
  • a seventh aspect provides a computer program product containing instructions that, when run on a computer, implements the method of any one of the above second or third aspects.
  • embodiments of the present application provide a computer-readable storage medium.
  • the computer-readable storage medium is used to store computer programs or instructions. When executed, the computer program or instructions implement any one of the above-mentioned second aspect or the above-mentioned third aspect. Methods.
  • Figure 1 is a schematic diagram of the deployment of an inspection tool
  • Figure 2 is a schematic structural diagram of a processor provided by an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of another processor provided by an embodiment of the present application.
  • Figure 4 is a schematic structural diagram of an error detection module provided by an embodiment of the present application.
  • Figure 5 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • Figure 6 is a schematic diagram of the deployment of a cloud data center provided by an embodiment of the present application.
  • Figure 7 is a schematic architectural diagram of a cloud data center provided by an embodiment of the present application.
  • Figure 8 is a schematic flowchart of a method for detecting processor errors provided by an embodiment of the present application.
  • Figure 9 is a schematic diagram of a detection switch of a management module provided by an embodiment of the present application.
  • Figure 10 is a schematic diagram of a process for processing internal detection instructions and external program instructions provided by an embodiment of the present application
  • Figure 11 is a schematic flowchart of yet another method for detecting processor errors provided by an embodiment of the present application.
  • Figure 12 is a schematic diagram of another process for processing internal detection instructions and external program instructions provided by the embodiment of the present application.
  • Figure 13 is a schematic architectural diagram of a computing device cluster provided by an embodiment of the present application.
  • SDC Silent data corruption
  • silent data corruption refers to an error that occurs during the execution of instructions by the processor, but the operating system in the device corresponding to the processor does not perceive this error. causes the error result corresponding to the instruction executed by the processor to be stored.
  • the processor can be a central processing unit (CPU), or other general-purpose processor, digital signal processor (DSP), application specific integrated circuit (ASIC), Field programmable gate array (FPGA) or other programmable logic devices, transistor logic devices, hardware components or any combination thereof.
  • a general-purpose processor can be a microprocessor or any conventional processor.
  • the number of nouns means “singular noun or plural noun", that is, “one or more”, unless otherwise specified.
  • At least one means one or more
  • plural means two or more.
  • “And/or” describes the relationship between associated objects, indicating that there can be three relationships, for example, A and/or B, which can mean: A exists alone, A and B exist simultaneously, and B exists alone, where A, B can be singular or plural.
  • the character “/” generally indicates that the related objects are in an "or” relationship.
  • A/B means: A or B.
  • At least one of the following or similar expressions thereof refers to any combination of these items, including any combination of a single item (items) or a plurality of items (items).
  • at least one of a, b, or c means: a, b, c, a and b, a and c, b and c, or a and b and c, where a, b, c Can be single or multiple.
  • Figure 1 is a schematic diagram of the deployment of an inspection tool.
  • Figure 1 can be understood as an architectural schematic diagram of the device.
  • the device includes a processor, multiple running applications (APPs), and an operating system.
  • APPs running applications
  • multiple applications include application 1, inspection tools, application 2, etc. as shown in Figure 1.
  • the inspection tool can have built-in program instructions and expected results corresponding to the program instructions.
  • the inspection tool can load program instructions into external memory.
  • the external memory and processor are configured independently of each other.
  • the processor reads program instructions from external memory and decodes the program instructions.
  • the processor executes the decoded program instructions and sends the execution results of the program instructions to the inspection tool.
  • the inspection tool compares the execution results of program instructions with the expected results. If the execution results match the expected results, the inspection tool determines that there is no processor error. If the execution results do not match the expected results, the inspection tool determines that there is a processor error. When the inspection tool determines that there is an error in the processor, the inspection tool can feedback to the operating system that the processor has an error.
  • the processor needs to read program instructions from external memory, decode the program instructions, and subsequently send execution results to the inspection tool, etc., resulting in a large processing capacity of the processor.
  • the processor includes an error detection module and an instruction execution module.
  • the error detection module may send (or transfer) instructions for detecting processor errors (such as internal detection instructions) to the instruction execution module.
  • the instruction execution module executes the internal detection instruction, obtains the first execution result of the internal detection instruction, and sends the first execution result of the internal detection instruction to the error detection module.
  • the error detection module determines whether there is an error in the processor based on the first execution result and the expected result of the internal detection instruction. In this way, the processor can detect whether there are errors in the processor by itself without resorting to external inspection tools. Since there is no need for the processor to decode external program instructions and send the execution results of external program instructions to the inspection tool, it can reduce the number of processor tasks. processing volume.
  • the processor in the embodiment of the present application may be any type of processor, including, for example, a single-core processor or a multi-core processor.
  • FIG. 2 is a schematic structural diagram of a processor provided by an embodiment of the present application.
  • Figure 2 shows an example of a single-core processor The structural diagram of the processor.
  • the processor 200 includes an error detection module 210 and an instruction execution module 220 .
  • the error detection module 210 and the instruction execution module 220 can communicate with each other.
  • the error detection module 210 and the instruction execution module 220 can be implemented by hardware, such as logic circuits. This application does not limit the specific structures of the error detection module 210 and the instruction execution module 220 .
  • the instruction execution module 220 is, for example, an arithmetic logic unit (arithmetic logic unit, ALU).
  • the error detection module 210 sends instructions for detecting processor errors (such as internal detection instructions) to the instruction execution module 220 .
  • the instruction execution module 220 is used to execute internal detection instructions to obtain the first execution result.
  • the instruction execution module 220 sends the first execution result to the error detection module 210 .
  • the error detection module 210 is configured to determine whether the processor stores an error based on the expected result corresponding to the first execution result and the internal detection instruction.
  • error detection module 210 determines that there is no processor error. If the first execution result does not match the expected result of the internally detected instruction, the error detection module 210 determines that there is an error in the processor.
  • the error detection module 210 and the instruction execution module 220 in the processor 200 can cooperate to implement error detection of the processor 200 without resorting to external inspection tools, and there is no need for the processor 200 to decode program instructions. As well as sending execution results to the inspection tool, etc., it is helpful to reduce the processing load of the processor, and also helps to save the resource overhead of the processor.
  • the error detection module 210 when the remaining space of the buffer queue of the processor 200 is greater than or equal to the threshold, the error detection module 210 sends an internal detection instruction to the instruction execution module 220 .
  • the buffer queue is used to store instructions to be processed (or not processed) by the processor 200 .
  • the threshold may be preconfigured in the error detection module 210.
  • the instruction execution module 220 may also execute instructions from external programs. Accordingly, the instruction execution module 220 will not only obtain the execution results of the internal detection instructions, but also obtain the execution results corresponding to the program instructions. . External program instructions refer to instructions that do not belong to the processor 200 . Therefore, in order to facilitate the error detection module 210 to identify the first execution result corresponding to the internal detection instruction, in a possible implementation, the error detection module 210 may add a mark to the internal detection instruction. The error detection module 210 obtains the first execution result of the internal detection instruction corresponding to the tag from the instruction execution module 220 according to the tag.
  • the expected results corresponding to the internal detection instructions may be preconfigured in the error detection module 210 .
  • the expected results corresponding to the internal detection instructions may be preconfigured in the register 260 of the processor 200 . Subsequently, the error detection module 210 can read the expected result corresponding to the internal detection instruction from the register 260 .
  • processor 200 also includes control module 250.
  • the control module 250 is used to control various modules of the processor 200 (including the error detection module 210 and the instruction execution module 220, etc.).
  • the control module 250 is, for example, a control unit (CU).
  • the processor 200 further includes a first storage module 230 and/or a second storage module 240.
  • the first storage module 230 and the second storage module 240 can be understood as internal storage modules of the processor 200 .
  • the first storage module 230 is, for example, a read only memory (ROM) or a memory.
  • ROM read only memory
  • the embodiment of the present application does not limit the specific implementation forms of the first storage module 230 and the second storage module 240 .
  • the second storage module 240 is, for example, a memory.
  • the first storage module 230 is used to store instructions.
  • the first storage module 230 may be allowed to be read by the processor 200 (specifically, the error detection module 210 in the processor 200).
  • the error detection module 210 is used to read internal detection instructions from the first storage module 230 .
  • the second storage module 240 is used to store instructions.
  • the second storage module 240 may be allowed to be read and written by the processor 200 (specifically, the error detection module 210 in the processor 200).
  • the external device can write instructions to the second storage module 240 through the processor 200, and the error detection module 210 can also read internal detection instructions from the second storage module 240. In this way, it is helpful to expand the number and type of instructions used to detect errors in the processor 200 .
  • the error detection module 210 may also read internal detection instructions from instructions that have been executed by the instruction execution module 220 .
  • the error detection module 210 can read the internal detection instructions from the instructions that have been executed by the first storage module 230, the second storage module 240 or the instruction execution module 220 inside the processor 200, providing the ability to obtain internal Multiple ways to detect instructions.
  • the instruction execution module 220 can also be used to process external program instructions.
  • the external program instructions refer to programs from devices other than the processor.
  • the instruction execution module 220 can decode the external program instruction, obtain the decoded result, and execute the decoded result to obtain the second execution result. Decoding the external program instructions can be understood as converting the external program instructions into instructions that the instruction execution module 220 can directly operate.
  • the external program instructions are, for example, instructions stored in an external memory.
  • the external program instructions may be instructions formed after the operating system loads the application.
  • FIG. 3 is a schematic structural diagram of another processor provided by an embodiment of the present application.
  • FIG. 3 is, for example, a schematic structural diagram of a multi-core processor.
  • the processor 300 includes multiple processor cores, an error detection module 301 and an instruction execution module 302 .
  • multiple processor cores including a first processor core 310 and a second processor core 320 are taken as an example.
  • a processor core can also be called a physical processor core, a physical processor core, or a processor core.
  • the function and implementation form of the error detection module 301 can refer to the content of the error detection module 210 discussed in Figure 2 above
  • the function and implementation form of the instruction execution module 302 can refer to the content of the instruction execution module 220 discussed in Figure 2 above.
  • the structures of the first processor core 310 and the second processor core 320 may be the same.
  • the first processor core 310 is taken as an example for introduction below.
  • the first processor core 310 includes an instruction execution module 302 .
  • the content of the instruction execution module 302 may refer to the content discussed above.
  • the first processor core 310 also includes a control module 307 .
  • the content of the control module 307 may refer to the content discussed above.
  • the error detection module 301 can perform error detection on multiple processor cores of the processor 300 to determine which processor core specifically has an error.
  • the error detection module 301 can send an internal detection instruction to the instruction execution module 302 in the first processor core 310, and obtain the execution result of the internal detection instruction sent by the instruction execution module 302 in the first processor core 310. If the execution result of the internal detection instruction does not match the expected result, the error detection module 301 determines that there is an error in the first processor core 310 . If the execution result of the internal detection instruction matches the expected result, the error detection module 301 determines that there is no error in the first processor core 310 . Similarly, the error detection module 301 can detect whether there is an error in the second processor core 320 .
  • the processor 300 can perform error detection on the processor core in the processor 300 , which is equivalent to performing more accurate error detection on the processor 300 .
  • the processor 300 also includes one or more of a register 304, a first storage module 305, and a second storage module 306.
  • the function of the register 304 may refer to the contents of the register 260 discussed in FIG. 2 discussed above.
  • the function and implementation form of the first storage module 305 may refer to the content of the first storage module 230 discussed in FIG. 2 discussed above.
  • the function and implementation form of the second storage module 306 may refer to the content of the second storage module 240 discussed in FIG. 2 discussed above.
  • FIG. 4 is a schematic structural diagram of another processor provided by an embodiment of the present application.
  • the processor 400 includes an instruction execution module 410 and an error detection module 420 .
  • the error detection module 420 includes an instruction acquisition sub-module 421, an instruction identification sub-module 422 and an error judgment sub-module 423.
  • the instruction acquisition sub-module 421, the instruction identification sub-module 422 and the error judgment sub-module 423 can all be implemented by hardware.
  • the instruction acquisition sub-module 421 can be implemented by a register or a memory
  • the instruction identification sub-module 422 can be implemented by a comparator.
  • the error judgment sub-module 423 can be implemented through a comparator.
  • the instruction acquisition sub-module 421 is used to provide internal detection instructions for the instruction execution module 410.
  • the content of the internal detection instructions obtained by the instruction acquisition sub-module 421 may refer to the content of the internal detection instructions obtained by the error detection module mentioned above, which will not be described again here.
  • the instruction identification sub-module 422 identifies the execution results corresponding to the internal detection instructions from the execution results executed by the instruction execution module 410 .
  • the error judgment sub-module 423 determines whether the execution result corresponding to the internal detection instruction is the same as the expected result of the internal detection instruction, thereby determining whether there is an error in the processor 400 .
  • the processor 400 also includes a load determination sub-module 424, which is illustrated by a dotted box in FIG. 4 .
  • the load determination sub-module 424 may be used to determine whether the remaining space of the buffer queue of the processor 400 is greater than or equal to the threshold. For example, the load determination sub-module 424 determines that the remaining space of the buffer queue of the processor 400 is greater than or equal to a threshold, triggering the instruction acquisition sub-module 421 to send an internal detection instruction to the instruction execution module 410 .
  • Computing equipment generally refers to equipment with processing capabilities, such as servers or terminal equipment.
  • Terminal devices such as mobile phones, tablets, computers with wireless transceiver functions, wearable devices, vehicles, drones, helicopters, airplanes, ships, robots, robotic arms or smart home devices, etc.
  • FIG. 5 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • computing device 500 includes software layers and hardware layers.
  • the software layer includes an operating system 510, a firmware system 520, a baseboard management controller (BMC) 530, etc.
  • the baseboard management controller 530 is, for example, a small operating system independent of the computing device 500 .
  • Operating system 510, firmware system 520, and baseboard management controller 530 may all be used to manage computing device 500.
  • one or more of the operating system 510, the firmware system 520, and the baseboard management controller 530 can be regarded as a management module.
  • the hardware layer includes external storage module 540, processor 550, network card 560, etc.
  • the external storage module 540 is, for example, the memory of the computing device 500 .
  • the structure of the processor 550 may refer to the structure shown in FIG. 2, FIG. 3 or FIG. 4. Only one processor 550 is illustrated in FIG. 5 . In fact, the number of processors 550 may be one or more.
  • Processor 550 may be used to process requests from outside computing device 500 and/or requests generated internally by computing device 500 .
  • Network card 560 is used by computing device 500 to communicate with other devices.
  • computing device 500 may also include a bus, which may be used for communication between components of computing device 500 .
  • the computing device 500 may also include a power supply circuit, which is used to power the processor 550 .
  • the processor 550 detects that there is an error in the processor 550 and can provide alarm information to the management module.
  • the alarm information is used to indicate that there is an error in the processor 550 so that the user can learn the status of the processor 550 in a timely manner. condition.
  • the processor 550 in addition to executing internal detection instructions, is also configured to execute external program instructions in the external storage module 540 .
  • the processor 550 reads the external program instructions from the external storage module 540 and decodes the external program instructions to obtain the decoded results.
  • the processor 550 executes the decoded result to obtain a second execution result.
  • the processor 550 may write the second execution result into the external storage module 540 to facilitate the operating system of the computing device 500 and the like to obtain the second execution result.
  • Cloud data centers can be used to provide users with business services, including storage services and/or computing services.
  • FIG. 6 is a schematic diagram of the deployment of a cloud data center provided by an embodiment of the present application.
  • FIG. 6 can be understood as a schematic diagram of an application scenario of an error detection method provided by an embodiment of the present application.
  • this scenario includes running on multiple terminal devices 610, multiple clients 611 and a cloud data center 620.
  • One client 611 of the plurality of clients 611 is run in each terminal device 610 of the plurality of terminal devices 610 .
  • Each client 611 in the plurality of clients 611 may be a software module or application.
  • Cloud data center 620 includes one or more computing devices.
  • the client 611 can remotely access the cloud data center 620, and then use the business services provided by the cloud data center 620.
  • the structure of the cloud data center is introduced below with reference to the schematic diagram of a cloud data center architecture shown in Figure 7.
  • the computing device included in the cloud data center is a server as an example.
  • the cloud data center 700 includes a cloud management platform 710 and at least one server 730.
  • the number of at least one server 730 is two.
  • the cloud management platform 710 communicates with each of the at least one server 730 through the cloud data center internal network 720.
  • the cloud management platform 710 is used to provide an access interface (such as an interface or an application programming interface (API)).
  • an access interface such as an interface or an application programming interface (API)
  • the tenant can operate the client remote access API to register a cloud account and password on the cloud management platform 710 and log in to the cloud management platform 710 .
  • the client is, for example, client 611 in Figure 6 .
  • the tenant can further select and purchase a virtual machine with specific specifications (processor, memory, disk) on the cloud management platform 710 for a fee.
  • virtual machines can also be called cloud servers (elastic compute service, ECS) or elastic instances.
  • ECS elastic compute service
  • the cloud management platform 710 provides the tenant with the remote login account and password of the purchased virtual machine. The client can log in to the virtual machine remotely and install and run tenant applications in the virtual machine.
  • the logical functions of the cloud management platform 710 may include user console, computing management service, network management service, storage management service, authentication service, image management service, etc.
  • the user console provides an interface or API to interact with tenants.
  • Compute management services are used to manage servers running virtual machines and containers, as well as bare metal servers.
  • Network management services are used to manage network services (such as gateways, firewalls, etc.).
  • the storage management service is used to manage storage services (such as data bucket services).
  • the authentication service is used to manage tenant accounts and passwords.
  • Image management service is used to manage virtual machine images.
  • any two servers 730 in the at least one server 730 may be the same.
  • the structure of one server 730 is taken as an example for description below.
  • Server 730 includes hardware layers and software layers.
  • the hardware layer of server 730 includes memory 734, processor 735, network card 736 and disk 737.
  • the contents of the memory 734, the processor 735 and the network card 736 can refer to the contents discussed in Figure 5 above.
  • the server 730 may also include a power supply circuit, which is used to power the processor 735 .
  • the memory 734 can be regarded as an example of an external storage module.
  • the software layer of the server 730 includes an operating system installed and running on the server 730 (the operating system relative to the virtual machine can be called a host operating system).
  • the operating system is provided with a virtual machine manager (virtual machine manager, VMM) 732 and Multiple virtual machines 731.
  • the virtual machine manager 732 may be used to implement computing virtualization, network virtualization, and storage virtualization of the virtual machine, as well as manage the virtual machine 731 .
  • computing virtualization refers to providing part of the processor 735 and memory 734 of the server 730 to the virtual machine
  • network virtualization refers to providing part of the functions (such as bandwidth) of the network card 736 to the virtual machine
  • storage virtualization refers to providing part of the processor 735 and memory 734 of the server 730 to the virtual machine.
  • Part of the disk 737 is provided to the virtual machine; the virtual machine 731 is managed, for example, creating a virtual machine 731, simulating virtual hardware for the virtual machine according to the hardware layer (hardware emulation function), deleting the virtual machine 731, forwarding and/or processing the data running on the server 730.
  • the running environments (such as virtual machine applications, operating systems, and virtual hardware) in different virtual machines 731 are completely isolated, and different virtual machines 731 can communicate with each other through the virtual machine manager 732 .
  • Each virtual machine 731 may run an operating system, a firmware system, a baseboard management controller, etc.
  • the virtual machine manager 732 also includes a cloud platform management client 733.
  • the cloud platform management client 733 is used to receive control plane commands sent by the cloud management platform 710, create on the server 730 according to the control plane control commands, and conduct full life management of the virtual machine. Cycle management, so that tenants can create, manage, log in and operate virtual machines 731 in the cloud data center 700 through the cloud management platform 710.
  • FIG. 8 is a schematic flowchart of a method for detecting processor errors provided by an embodiment of the present application.
  • the structure of the processor involved in the embodiment shown in Figure 8 is, for example, the processor 200 in Figure 2, the processor 300 in Figure 3, or the processor 400 in Figure 4, and the structure of the processor involved in the embodiment shown in Figure 8
  • the processor may be implemented, for example, in the computing device shown in FIG. 5 .
  • the processor involved in the embodiment shown in Figure 8 includes an instruction execution module and an error detection module.
  • the error detection module adds a mark to the internal detection instruction.
  • the error detection module involved in the embodiment of the present application is, for example, the error detection module 210 in Figure 2, the error detection module 301 in Figure 3, or the error detection module 420 in Figure 4.
  • the error detection module can determine the instruction used to detect processor errors.
  • an internal detection instruction is an instruction used to detect processor errors is introduced as an example.
  • the instruction execution module When the processor is applied to a computing device, such as the computing device in FIG. 5 , the instruction execution module will execute instructions from the error detection module and may also execute external program instructions from the external storage module.
  • the instruction execution module is, for example, the instruction execution module 220 in FIG. 2 , the instruction execution module 302 in FIG. 3 , or the instruction execution module 410 in FIG. 4 .
  • the error detection module can add a mark to the internal detection instruction.
  • the mark is used to indicate that the internal detection instruction is used to detect processor errors. instruction.
  • the specific form of this mark can be a label.
  • the external storage module is not provided in the processor.
  • the external storage module is, for example, a ROM or a memory.
  • the external storage module is, for example, the external storage module 540 in FIG. 5 . Since the instructions in the external storage module are external program instructions, the meaning of the external program instructions can be referred to the above.
  • the external program instructions are, for example, instructions from an external application.
  • the error detection module determines the internal detection instructions from the instructions that have been executed by the processor.
  • the instruction execution module in the processor executes multiple instructions, such as instructions from the error detection module, and Instructions from external memory modules.
  • an instruction that has been executed by the instruction execution module is called a historical instruction
  • all instructions that have been executed by the instruction execution module are called a historical instruction set.
  • the error detection module may select at least one historical instruction from the historical instruction set for detecting whether there is an error in the processor.
  • the error detection module may determine one of the at least one historical instructions as an internally detected instruction.
  • the internal detection instructions belong to instructions that have been executed by the processor, which are historical instructions.
  • the error detection module may randomly determine an instruction from at least one historical instruction as an internal detection instruction.
  • the error detection module may use an instruction with the greatest execution complexity among at least one historical instruction as an internal detection instruction.
  • Execution complexity can be characterized by the time required to execute instructions before the instruction execution module. The longer the time, the higher the execution complexity. Since the processor is more likely to make errors when executing instructions with high complexity, it is easier to detect processor errors by selecting instructions with high complexity to detect whether there are errors in the processor.
  • the error detection module may sample historical instruction sets from multiple instruction execution modules corresponding to multiple processor cores.
  • the historical instruction set includes instructions that have been executed by multiple instruction execution modules.
  • the error detection module when the error detection module obtains at least one historical instruction from the instruction execution module, the error detection module can also write the result corresponding to the execution of the at least one historical instruction by the instruction execution module into a preset in the processor. in the register.
  • the result corresponding to at least one historical instruction can be regarded as the expected result corresponding to at least one historical instruction.
  • the register is, for example, the register 260 in Figure 2 or the register 304 in Figure 3 .
  • the error detection module can also generate an identifier of each historical instruction in the at least one historical instruction, and store the identifier of the at least one historical instruction and the expected result corresponding to the at least one historical instruction in the register.
  • the expected result of the instruction corresponding to instruction 0 is 11, and the expected result of the instruction corresponding to instruction 1 is 00.
  • the error detection module reads the internal detection instructions from the first storage module.
  • the first storage module in the embodiment of the present application is, for example, the first storage module 230 in FIG. 2, or the first storage module 305 in FIG. 3, for example.
  • the first storage module may be pre-stored with at least one instruction.
  • the at least one instruction may be manually configured in the first storage module.
  • the processor is manually configured in the first storage module before the processor leaves the factory.
  • the internal detection instruction belongs to an instruction in the first storage module.
  • the processor may perform a read operation on the first storage module.
  • the error detection module in the processor may read an instruction from at least one instruction stored in the first storage module as an internal detection instruction.
  • the error detection module may randomly read an instruction from at least one instruction as an internal detection instruction.
  • At least one instruction prestored in the first storage module may be an instruction with a probability of an error executed by the processor greater than or equal to the first probability.
  • at least one instruction prestored in the first storage module may be an instruction that is prone to errors in execution by the processor.
  • the probability of processor execution error can be determined based on experience, or the processing obtained through multiple tests.
  • the first storage module may also store an expected result corresponding to at least one instruction.
  • the staff can manually configure the expected result corresponding to the at least one instruction in the first storage module.
  • the expected result corresponding to at least one instruction may be obtained by executing at least one instruction respectively on other processors.
  • processors are different.
  • the identifier of at least one instruction and the expected result corresponding to the at least one instruction may be pre-stored in a register, and the register is as discussed above.
  • the staff can manually configure the expected result corresponding to at least one instruction in the register.
  • the method for obtaining the expected result corresponding to at least one instruction may refer to the content discussed above.
  • the error detection module reads the internal detection instructions from the second storage module.
  • the second storage module in the embodiment of the present application is, for example, the second storage module 240 in Figure 2, or, for example, the second storage module 306 in Figure 3.
  • the processor can perform write operations and read operations on the second storage module.
  • At least one instruction in the second storage module may be written by the processor.
  • the external device may access the processor, and the processor may write instructions in the second storage module.
  • the internal detection instructions belong to instructions in the second storage module.
  • the error detection module in the processor can read an instruction from the second storage module as an internal detection instruction.
  • the second storage module can support external writing, which facilitates subsequent addition of instructions for detecting processor errors.
  • the first storage module may also store an expected result corresponding to at least one instruction.
  • the external device writes at least one instruction into the first storage module, it can also write the expected result corresponding to the at least one instruction into the first storage module.
  • the expected result corresponding to at least one instruction may be obtained through execution by another processor.
  • the identifier of at least one instruction and the expected result corresponding to at least one instruction are pre-stored in a register, and the register is as discussed above.
  • the external device may write the expected result corresponding to at least one instruction into a register.
  • the method for obtaining the expected result corresponding to at least one instruction may refer to the content discussed above.
  • step S801 is an optional step, which is illustrated by a dotted line in Figure 8 .
  • the error detection module sends an internal detection instruction to the instruction execution module.
  • the instruction execution module receives internal detection instructions from the error detection module.
  • the error detection module sends an internal detection instruction to the instruction execution module corresponding to one processor core (for example, the first processor core) in the multi-core processor.
  • the first processor core is, for example, the first processor core 310 shown in FIG. 3 .
  • the error detection module sends instructions for detecting processor errors to the instruction execution module according to the first cycle.
  • the duration of the first period may be pre-configured in the error detection module, and the duration of the first period may be, for example, 5 hours.
  • the error detection module sends instructions for detecting processor errors to the instruction execution module from time to time.
  • the error detection module sends an internal detection instruction to the instruction execution module.
  • the error detection module may characterize the load of the processor in terms of remaining space in a buffer queue in the processor. In this way, it can be avoided that the process of detecting processor errors occupies processor resources when the processor load is heavy, which is conducive to the smooth execution of instructions from the external storage module by the processor.
  • the error detection module determines that the load of the processor is small; if the remaining space of the buffer queue of the processor is less than the threshold, the error detection module determines that the load of the processor is small. big.
  • the threshold value may be pre-configured in the error detection module, and the threshold value is, for example, 1M.
  • the error detection module can send the error detection module to the instruction execution module corresponding to the first processor core when the load of the first processor core is small. Send internal detection command.
  • the method of determining that the load of the first processor core is small may refer to the previous content of determining that the load of the processor is small.
  • the management module includes detection switches. Among them, the detection switch is used to indicate whether to perform error detection on the processor. The management module determines whether the detection switch is on or off according to the user's operation. In a possible embodiment, the error detection module may also determine that the detection switch in the management module is on, and send an internal detection instruction to the instruction execution module. Wherein, the detection switch is in the on state, indicating that error detection is performed on the processor; the detection switch is in the off state, indicating that error detection on the processor is not performed.
  • the management module includes one or more operating systems, baseboard management controllers, or firmware systems in the computing device.
  • the error detection module can only determine that the detection switch of one of the operating system, the baseboard management controller or the firmware system is in an open state, This is equivalent to confirming that the detection switch in the management module is on.
  • the error detection module determines that the detection switch is on and the processor load is small, it sends an internal detection instruction to the instruction execution module; or if the error detection module determines that the detection switch is on, it sends an internal detection instruction to the instruction execution module. .
  • FIG. 9 is a schematic diagram of a detection switch of a management module provided by the present application.
  • the detection switch in the management module (specifically the button where the processor error is detected in Figure 9) is displayed as " ⁇ ", indicating that the detection switch in the management module is on.
  • the instruction execution module executes the internal detection instruction and obtains the first execution result.
  • the instruction execution module can directly execute the internal detection instruction to obtain the execution result of the first execution.
  • the embodiment of the present application refers to the execution result of the internal detection instruction as the first execution result.
  • the error detection module when the error detection module performs S801 (ie, adds a mark to the internal detection instruction), the error detection module can cache the mark of the internal detection instruction and add the mark to the first execution result.
  • the error detection module determines the first execution result according to the mark.
  • the error detection module may obtain the execution results of all instructions (for example, including the first execution result and the second execution result) from the instruction execution module.
  • the error detection module can identify the first execution result according to the mark, which is equivalent to the error detection module determining the first execution result.
  • the error detection module may send a first request to the instruction execution module.
  • the first request is used to request an execution result corresponding to the internal detection instruction, and the first request may include (or indicate) a flag of the internal detection instruction.
  • the instruction execution module receives the first request and feeds back the first execution result to the error detection module, which is equivalent to the error detection module determining the first execution result.
  • the error detection module may obtain the first execution result corresponding to the internal detection instruction from the instruction execution module according to the identification of the internal detection instruction.
  • the content of the identification of the internal detection instruction can be referred to the previous article.
  • the error detection module does not need to determine the first execution result according to the flag, so S804 is an optional step.
  • the error detection module determines whether there is an error in the processor based on the matching result between the first execution result and the expected result of the internal detection instruction.
  • the error detection module may read the expected result of the internal detection instruction from the register.
  • the error detection module may read the expected result of the internal detection instruction from the first storage module.
  • the error detection module may read the expected result of the internal detection instruction from the second storage module.
  • the error detection module determines whether the first execution result matches the expected result of the internally detected instruction. If the first execution result matches the expected result of the internal detection instruction, the error detection module determines that there is no error in the processor, the error detection module can discard the first execution result, and the processor can execute the embodiment shown in Figure 8 again, Perform error detection on the processor. If the first execution result does not match the expected result of the internal detection instruction, the error detection module determines that there is an error in the processor.
  • the error detection module determines whether the first execution result is the same as the expected result of the internal detection instruction. If the first execution result is the same as the expected result of the internal detection instruction, it means that the first execution result matches the expected result of the internal detection instruction; if the first execution result is the same as the expected result of the internal detection instruction; An execution result is different from the expected result of the internal detection instruction, indicating that the first execution result does not match the expected result of the internal detection instruction.
  • the instruction execution module reads external program instructions from the external storage module.
  • the external storage module is, for example, the external storage module 540 shown in FIG. 5 .
  • the meaning of external program instructions can refer to the content discussed above and will not be repeated here.
  • the instruction execution module decodes the external program instruction, obtains the decoded result, and executes the decoded result to obtain the second execution result.
  • the instruction execution module writes the second execution result into the external storage module.
  • the operating system can obtain the second execution result from the external storage module, and then present the second execution result to the user.
  • S806 to S808 are optional steps.
  • FIG. 10 is a schematic diagram of a process for processing internal detection instructions and external program instructions according to an embodiment of the present application.
  • the path for processing internal detection instructions includes: error detection module ⁇ instruction execution module ⁇ error detection module.
  • the path for processing external program instructions includes: application ⁇ external storage module ⁇ instruction execution module ⁇ external storage module ⁇ application. It can be seen that the path for processing internal detection instructions in the embodiment of the present application is different from the path for processing external program instructions, and is simpler than the processing path for external program instructions.
  • control module in the processor can control to shut down the processor.
  • the operating system of the computing device may control turning off the processor.
  • the error detection module can provide an alarm to the management module in the computing device when it determines that there is an error in the processor. information. This alarm information is used to indicate that there is an error in the processor. Furthermore, the management module can display the alarm information. In this way, users can be notified of processor errors in time.
  • the internal detection instruction may be executed by an instruction corresponding to one processor core (such as the first processor core) in the processor. module is executed.
  • the error detection module may determine that there is an error in the first processor core. In this way, the processor can pinpoint exactly which A processor core failed.
  • the error detection module when the error detection module determines that there is an error in the first processor core, the error detection module can generate alarm information and send the alarm information to the management module. This alarm information is used to indicate that there is an error in the processor.
  • the control module corresponding to the first processor core may control to shut down the first processor core.
  • the operating system of the computing device may control turning off the first processor core. In this way, other processor cores of the processor can still work normally.
  • the processor can detect whether there is an error in the processor according to the instructions in the processor. Since there is no need for the processor to decode the program instructions and send the execution results of the program instructions to an external inspection tool, it can Reducing the instructions generated by the processor will help reduce the processing load of the processor. Moreover, in the embodiment of the present application, the processor can perform error detection when the load of the processor is small, which can reduce the impact of the processor error detection process on the execution process of external program instructions. Moreover, when the processor is a multi-core processor, the processor can detect which processor core in the processor has an error to accurately determine the processor core in which the error occurred.
  • an alarm message can be reported and/or the processor can be shut down to handle the error in a timely manner to avoid greater impact.
  • applications can be loaded without the help of the operating system, thus reducing the processing load of the operating system.
  • a computing device can perform processor error detection without installing an operating system.
  • the computing device in the cloud scenario may not have an operating system installed.
  • the processor error detection method in the embodiment of the present application can be better applied to cloud scenarios, such as the cloud computing and/or cloud scenarios mentioned above. Specific examples of storage and cloud scenarios include the cloud data center mentioned above.
  • FIG. 11 is a schematic flowchart of a method for detecting processor errors provided by an embodiment of the present application.
  • the processor in FIG. 8 is specifically the processor 400 shown in FIG. 4 as an example.
  • the processor in the embodiment shown in Figure 11 includes an error detection module and an instruction execution module.
  • the error detection module includes an instruction acquisition sub-module, an instruction identification sub-module and an instruction judgment sub-module.
  • the instruction acquisition submodule adds a mark to the internal detection instruction.
  • tags and adding tags can refer to the content discussed above.
  • the instruction acquisition sub-module may send the tag to the instruction identification sub-module, so that the instruction identification sub-module subsequently identifies instructions for detecting processor errors based on the tag.
  • the instruction acquisition sub-module sends an internal detection instruction to the instruction execution module.
  • the instruction execution module receives internal detection instructions from the instruction acquisition sub-module.
  • the instruction acquisition sub-module may send an internal detection instruction to the instruction execution module when it is determined that the detection switch is in an on state and/or the load on the processor is small.
  • the content of detecting the switch and determining that the detection switch is in the on state may refer to the above discussion.
  • the processor in the embodiment shown in FIG. 11 also includes a load judgment sub-module, which is, for example, the load judgment sub-module 424 in FIG. 4 .
  • the load judgment sub-module determines that when the load of the processor is small, the load judgment sub-module sends the first indication information to the instruction acquisition sub-module.
  • the first indication information is used to indicate that the load of the processor is small.
  • the instruction acquisition sub-module receives the first instruction information and executes step S1102.
  • the load judgment sub-module may send second indication information to the instruction acquisition sub-module.
  • the second indication information is used to indicate that the remaining space in the buffer queue of the processor is less than the threshold.
  • the instruction acquisition sub-module receives the second instruction information and does not execute step S1102; or, when the load of the processor is large, the load judgment sub-module does not need to send any instruction information to the instruction acquisition sub-module.
  • the instruction acquisition sub-module defaults to only The touch of the first instruction message Send, then execute step S1102.
  • the instruction execution module executes the internal detection instruction and obtains the first execution result.
  • the content of the first execution result may refer to the content discussed above.
  • the instruction execution module may add a mark to the first execution result.
  • the instruction execution module sends the first execution result to the instruction identification sub-module.
  • the instruction identification sub-module receives the first execution result from the instruction execution module.
  • the instruction identification submodule determines the first execution result according to the mark.
  • the instruction identification sub-module sends the first execution result to the instruction execution module.
  • the instruction execution module receives the first execution result from the instruction identification sub-module.
  • the instruction execution module sends all execution results to the instruction identification sub-module, and the instruction identification sub-module identifies the first execution result according to the mark of the first execution result.
  • the instruction execution module can directly send the first execution result to the instruction identification sub-module according to the mark of the first execution result, which is equivalent to the instruction identification sub-module determining the first execution result.
  • the instruction judgment submodule determines whether there is an error in the processor based on the matching result between the first execution result and the expected result of the internal detection instruction.
  • the instruction judgment sub-module determines whether there is an error in the processor by referring to the content discussed above.
  • the instruction judgment sub-module determines that there is an error in the processor, it can provide alarm information to the management module.
  • the content of the management module and alarm information can refer to the content discussed above.
  • the instruction execution module obtains external program instructions from the external storage module.
  • the instruction execution module decodes the external program instruction, obtains the decoded result, and executes the decoded result to obtain the second execution result.
  • the instruction execution module sends the second execution result to the external storage module.
  • S1108-S1110 are optional steps.
  • FIG. 12 is a schematic diagram of a process for processing internal detection instructions and external program instructions according to an embodiment of the present application.
  • the path for processing internal detection instructions includes: instruction acquisition sub-module ⁇ instruction execution module ⁇ instruction identification sub-module ⁇ error judgment sub-module.
  • the path for processing external program instructions includes: application ⁇ external storage module ⁇ instruction execution module ⁇ application. It can be seen from this that the path for processing internal detection instructions in the embodiment of the present application is different from the path for processing external program instructions, and the path for processing the internal detection instructions in the embodiment of the present application is simpler.
  • the error detection module can include an instruction acquisition sub-module, a load judgment sub-module, an instruction identification sub-module and an error judgment sub-module, providing an A scheme for detecting processor errors.
  • the instruction acquisition sub-module, load judgment sub-module, instruction identification sub-module, error judgment sub-module and instruction execution unit in the processor can work together to detect processor errors without processor compilation. Program instructions, as well as sending execution results of program instructions to external inspection tools, etc., are beneficial to reducing the processing load of the processor.
  • the instruction acquisition sub-module can send internal detection instructions to the instruction execution module when the load of the processor is small, so as to avoid increasing the processing load of the processor when the load of the processor is large.
  • An embodiment of the present application provides a computing device cluster. Please refer to Figure 13, which is a computing device cluster provided by an embodiment of the present application.
  • the computing device cluster includes at least one computing device 1300, and any two computing devices 1300 communicate through a communication network.
  • computing device 1300 includes a processor 1301 and a power supply circuit 1302.
  • the power supply circuit 1302 is used to provide power to the processor 1301.
  • the processor 1301 in the computing device 1300 may be used to implement any of the above methods for detecting processor errors, for example, the method for detecting processor errors in the embodiment shown in FIG. 8 or FIG. 11 . It can also implement the functions of any of the previous processors.
  • the structure of the processor 1301 may refer to the structure of the processor in Figure 2, Figure 3 or Figure 4 mentioned above.
  • the computing device 1300 also includes a memory 1303 and a communication interface 1304.
  • the memory 1303 and the communication interface 1304 are shown as dotted boxes in FIG. 13 .
  • the processor 1301 and the communication interface 1304 are coupled to each other. It can be understood that the communication interface 1304 may be a transceiver or an input-output interface.
  • the memory 1303 may be used to store external program instructions executed by the processor 1301 or input data required by the processor 1301 to run external program instructions or data generated after the processor 1301 executes the instructions.
  • the computing device cluster shown in Figure 13 can be used to implement the functions of the cloud data center in Figure 6 or Figure 7.
  • Embodiments of the present application provide a chip system, which includes: a processor and an interface.
  • the processor is used to call and run instructions from the interface, and when the processor executes the instructions, implement any of the previous methods of detecting processor errors, for example, the detection in the embodiment shown in Figure 8 or Figure 11 Handler error method.
  • Embodiments of the present application provide a computer-readable storage medium.
  • the computer-readable storage medium is used to store computer programs or instructions.
  • any one of the above methods for detecting processor errors is implemented. For example, FIG. 8 or A method of detecting processor errors in the embodiment shown in FIG. 11 .
  • Embodiments of the present application provide a computer program product containing instructions that, when run on a computer, implement any of the foregoing methods for detecting processor errors, for example, the detection processing in the embodiments shown in Figure 8 or Figure 11 The wrong way to do it.
  • the method steps in the embodiments of the present application can be implemented by hardware or by a processor executing software instructions.
  • Software instructions can be composed of corresponding software modules, and the software modules can be stored in random access memory, flash memory, read-only memory, programmable read-only memory, erasable programmable read-only memory, electrically erasable programmable read-only memory In memory, register, hard disk, mobile hard disk, CD-ROM or any other form of storage medium well known in the art.
  • An exemplary storage medium is coupled to the processor such that the processor can read information from the storage medium and write information to the storage medium.
  • the storage medium can also be an integral part of the processor.
  • the processor and storage media may be located in an ASIC. Additionally, the ASIC can be located in the base station or terminal. Of course, the processor and the storage medium may also exist as discrete components in the base station or terminal.
  • the computer program product includes one or more computer programs or instructions.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, a network device, a user equipment, or other programmable device.
  • the computer program or instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another.
  • the computer program or instructions may be transmitted from a website, computer, A server or data center transmits via wired or wireless means to another website site, computer, server, or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or data center that integrates one or more available media.
  • the available media can be magnetic media, such as floppy disks, hard disks, and magnetic tapes; they can also be optical media, such as digital video optical disks; they can also be semiconductor media, such as solid-state media. harddisk.
  • the computer-readable storage medium may be volatile or nonvolatile storage media, or may include both volatile and nonvolatile types of storage media.

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The present application relates to the technical field of computers. Provided are a processor and a processor error detection method. The processor comprises an error detection module and an instruction execution module. The error detection module is used for sending an internal detection instruction to the instruction execution module; the instruction execution module is used for executing the internal detection instruction to obtain a first execution result, and sending the first execution result to the error detection module; and the error detection module is used for determining, according to the first execution result and an expected result corresponding to the internal detection instruction, whether the processor has an error. The present invention does not require an external inspection tool to detect processor errors, such that processors decoding program instructions and sending execution results of the program instructions to inspection tools, etc. can be avoided, thus reducing the processing amount of the processors.

Description

一种处理器及检测处理器错误的方法A processor and a method for detecting processor errors
相关申请的交叉引用Cross-references to related applications
本申请要求在2022年09月05日提交中国专利局、申请号为202211080632.4、申请名称为“一种处理器及检测处理器错误的方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed with the China Patent Office on September 5, 2022, with application number 202211080632.4 and the application title "A processor and a method for detecting processor errors", the entire content of which is incorporated by reference. in this application.
技术领域Technical field
本申请实施例涉及计算机技术领域,尤其涉及一种处理器及检测处理器错误的方法。Embodiments of the present application relate to the field of computer technology, and in particular, to a processor and a method for detecting processor errors.
背景技术Background technique
处理器是各类计算设备的主要部件之一,处理器可用于执行指令。处理器在执行指令过程中可能出错。The processor is one of the main components of various computing devices and is used to execute instructions. The processor may make errors while executing instructions.
目前,一种检测处理器执行指令错误的方式为:利用巡检工具测试处理器是否存在指令执行错误的情况。在该方式下,巡检工具可向处理器发送程序指令,处理器译码并执行程序指令,并向巡检工具反馈该程序指令的执行结果。巡检工具对比该执行结果与预存的该程序指令的预期结果,如果该执行结果和该程序指令的预期结果不匹配,则确定该处理器存在指令执行错误的问题。但这种方式中,需要处理器从巡检工具接收程序指令,并译码程序指令,还需向巡检工具反馈该程序指令的执行结果等,由此可见,处理器的处理量较大。Currently, one way to detect errors in the execution of instructions by a processor is to use an inspection tool to test whether there are errors in the execution of instructions in the processor. In this mode, the inspection tool can send program instructions to the processor, the processor decodes and executes the program instructions, and feeds back the execution results of the program instructions to the inspection tool. The inspection tool compares the execution result with the pre-stored expected result of the program instruction. If the execution result does not match the expected result of the program instruction, it is determined that the processor has an instruction execution error. However, in this method, the processor needs to receive program instructions from the inspection tool, decode the program instructions, and feed back the execution results of the program instructions to the inspection tool. It can be seen that the processing capacity of the processor is large.
发明内容Contents of the invention
本申请实施例提供一种处理器及检测处理器错误的方法,用于减少检测处理器错误过程中的处理量。Embodiments of the present application provide a processor and a method for detecting processor errors, which are used to reduce the processing volume in the process of detecting processor errors.
第一方面,本申请实施例提供一种处理器,包括错误检测模块和指令执行模块,其中:所述错误检测模块,用于向所述指令执行模块发送内部检测指令;所述指令执行模块,用于执行所述内部检测指令,获得第一执行结果,并向所述错误检测模块发送所述第一执行结果;所述错误检测模块,用于根据所述第一执行结果与所述内部检测指令对应的预期结果,确定所述处理器是否存在错误。In a first aspect, embodiments of the present application provide a processor, including an error detection module and an instruction execution module, wherein: the error detection module is used to send internal detection instructions to the instruction execution module; the instruction execution module, used to execute the internal detection instruction, obtain a first execution result, and send the first execution result to the error detection module; the error detection module is used to calculate the first execution result according to the internal detection The expected result corresponding to the instruction determines whether there is an error in the processor.
在本申请实施例中,指令执行模块可直接执行来自错误检测模块的内部检测指令,获得第一执行结果,错误检测模块对比第一执行结果与内部检测指令对应的预期结果,从而确定处理器是否存在错误。如此一来,相当于处理器执行处理器内的指令,进而确定处理器是否存在错误,无需处理器译码指令以及向外部的巡检工具发送第一执行结果等,有利于减少处理器的处理量,也有利于节省处理器的计算资源开销。并且,无需借助外部的巡检工具,有利于降低检测处理器错误的成本。另外,处理器检测处理器的错误时,执行处理器内的指令,而处理器内可容纳各种类型的指令,因此有利于提高该方法适用的指令的类型。In the embodiment of the present application, the instruction execution module can directly execute the internal detection instruction from the error detection module to obtain the first execution result. The error detection module compares the first execution result with the expected result corresponding to the internal detection instruction to determine whether the processor There is an error. In this way, it is equivalent to the processor executing the instructions in the processor, and then determining whether there is an error in the processor. There is no need for the processor to decode the instructions and send the first execution result to the external inspection tool, which is beneficial to reducing the processing time of the processor. It also helps save the computing resource overhead of the processor. Moreover, there is no need to use external inspection tools, which helps reduce the cost of detecting processor errors. In addition, when the processor detects a processor error, it executes instructions in the processor, and the processor can accommodate various types of instructions, so it is helpful to increase the types of instructions to which this method is applicable.
在一种可能的实施方式中,所述指令执行模块还用于:从外部存储模块中,读取外部 程序指令,对所述外部程序指令进行译码,获得译码后的结果,执行所述译码后的结果,获得第二执行结果,以及将所述第二执行结果写入所述外部存储模块。In a possible implementation, the instruction execution module is also used to: read external Program instructions, decode the external program instructions, obtain the decoded result, execute the decoded result, obtain the second execution result, and write the second execution result into the external storage module .
在上述实施方式中,处理器中的指令执行模块除了执行处理器的内部检测指令之外,还可以执行外部存储模块中的外部程序指令,换言之,处理器在执行内部检测指令的同时,还不会影响外部程序指令的执行。In the above embodiments, in addition to executing the internal detection instructions of the processor, the instruction execution module in the processor can also execute external program instructions in the external storage module. In other words, while the processor executes the internal detection instructions, it also Will affect the execution of external program instructions.
在一种可能的实施方式中,所述错误检测模块还用于:在向所述指令执行模块发送内部检测指令之前,确定所述处理器的缓冲队列中的剩余空间大于或等于阈值,所述缓冲队列用于缓存所述处理器待处理的指令。In a possible implementation, the error detection module is further configured to: before sending an internal detection instruction to the instruction execution module, determine that the remaining space in the buffer queue of the processor is greater than or equal to a threshold, and the The buffer queue is used to cache instructions to be processed by the processor.
在上述实施方式中,错误检测模块可以在处理器的负载相对小的情况下,向指令执行模块发送内部检测指令,使得指令执行模块可以在负载相对小的情况下执行外部程序指令。如此,可减小执行内部检测指令的执行过程影响外部程序指令的执行过程,有利于合理分配处理器的计算资源。In the above embodiment, the error detection module can send internal detection instructions to the instruction execution module when the load on the processor is relatively small, so that the instruction execution module can execute external program instructions when the load is relatively small. In this way, the impact of the execution process of internal detection instructions on the execution process of external program instructions can be reduced, which is conducive to reasonable allocation of computing resources of the processor.
在一种可能的实施方式中,所述错误检测模块,还用于为所述内部检测指令添加标记,所述标记表示用于检测所述处理器错误的指令;所述指令执行模块,还用于在所述第一执行结果中添加所述标记;所述错误检测模块,还用于根据所述第一执行结果中的所述标记,识别所述内部检测指令对应的所述第一执行结果。In a possible implementation, the error detection module is also configured to add a mark to the internal detection instruction, where the mark represents an instruction used to detect errors of the processor; the instruction execution module is also configured to add a mark to the internal detection instruction. The error detection module is further configured to add the mark to the first execution result and identify the first execution result corresponding to the internal detection instruction based on the mark in the first execution result. .
在上述实施方式中,错误检测模块可为内部检测指令添加标记,从而便于后续指令错误检测模块区分用于检测处理器错误的指令所对应的第一执行结果。In the above embodiment, the error detection module can add a mark to the internal detection instruction, so as to facilitate the subsequent instruction error detection module to distinguish the first execution result corresponding to the instruction used to detect processor errors.
在一种可能的实施方式中,所述处理器还包括寄存器,所述寄存器存储有所述内部检测指令对应的预期结果;所述错误检测模块,还用于从所述寄存器中,读取所述内部检测指令对应的预期结果。In a possible implementation, the processor further includes a register, which stores the expected results corresponding to the internal detection instructions; the error detection module is also configured to read all the information from the register. Describe the expected results corresponding to the internal detection instructions.
在上述实施方式中,用户可手动或处理器将内部检测指令对应的预期结果配置在寄存器中,使得错误检测模块可以快速地获取内部检测指令对应的预期结果,也便于后续快速检测处理器是否存在错误。In the above embodiment, the user can configure the expected result corresponding to the internal detection instruction in the register manually or by the processor, so that the error detection module can quickly obtain the expected result corresponding to the internal detection instruction, and also facilitates the subsequent rapid detection of whether the processor exists mistake.
在一种可能的实施方式中,所述错误检测模块,还用于从所述处理器已执行过的指令中,获得所述内部检测指令;或,所述处理器还包括存储有所述内部检测指令的第一存储模块,所述第一存储模块允许被所述处理器读取,所述错误检测模块,还用于从所述第一存储模块中读取所述内部检测指令;或,所述处理器还包括存储有所述内部检测指令的第二存储模块,所述第二存储模块允许被所述处理器读取和写入,所述错误检测模块,还用于从所述第二存储模块中读取所述内部检测指令。In a possible implementation, the error detection module is further configured to obtain the internal detection instructions from instructions that have been executed by the processor; or, the processor further includes a storage device that stores the internal detection instructions. A first storage module for detecting instructions, the first storage module is allowed to be read by the processor, and the error detection module is also used to read the internal detection instructions from the first storage module; or, The processor further includes a second storage module that stores the internal detection instructions, the second storage module is allowed to be read and written by the processor, and the error detection module is also used to read from the third Read the internal detection instructions from the second storage module.
在上述实施方式中,提供了错误检测模块获得内部检测指令的三种方式,丰富了错误检测模块获取用于检测处理器错误的指令的方式。第一种方式中,错误检测模块可从处理器已执行过的指令中采样获得内部检测指令,在第一种方式中,获取内部检测指令的方式简单直接。第二种方式中,错误检测模块可从第一存储模块中读取内部检测指令,第一存储模块中的指令可以是手动配置的。第三种方式中,错误检测模块可从第二存储模块中读取内部检测指令,与第二种方式不同的是,第三种方式涉及的第二存储模块可以支持处理器写入,这样有利于处理器或者外部设备通过处理器增加第二存储模块中的指令。In the above embodiment, three ways for the error detection module to obtain internal detection instructions are provided, which enriches the ways for the error detection module to obtain instructions for detecting processor errors. In the first method, the error detection module can sample the internal detection instructions from instructions that have been executed by the processor. In the first method, the method of obtaining the internal detection instructions is simple and direct. In the second method, the error detection module can read the internal detection instructions from the first storage module, and the instructions in the first storage module can be manually configured. In the third method, the error detection module can read the internal detection instructions from the second storage module. Different from the second method, the second storage module involved in the third method can support processor writing, so that It is advantageous for the processor or the external device to add instructions in the second storage module through the processor.
在一种可能的实施方式中,所述处理器还包括至少一个处理器核,所述至少一个处理器核中的一个处理器核对应所述指令执行模块;所述错误检测模块具体用于:根据所述第一执行结果与所述内部检测指令对应的预期结果,确定所述处理器是否存在错误。 In a possible implementation, the processor further includes at least one processor core, and one of the at least one processor core corresponds to the instruction execution module; the error detection module is specifically used to: Determine whether there is an error in the processor according to an expected result corresponding to the first execution result and the internal detection instruction.
上述实施方式可适用于处理器包括一个或多个处理器核时,如果处理器包括多个处理器核,则根据指令执行模块确定具体是哪一个处理器核存在错误,如此,可精确定位处理器存在错误的处理器核。并且,该实施方式中,一个错误检测模块可用于对多个处理器核进行错误检测,有利于降低检测处理器错误的成本。The above embodiments may be applied when the processor includes one or more processor cores. If the processor includes multiple processor cores, the instruction execution module determines which processor core has the error. In this way, the processing can be accurately located. There is a faulty processor core in the processor. Moreover, in this embodiment, one error detection module can be used to detect errors on multiple processor cores, which is beneficial to reducing the cost of detecting processor errors.
在一种可能的实施方式中,所述错误检测模块还用于:确定管理模块中的检测开关处于开启状态,所述检测开关用于表示是否对所述处理器的错误进行检测,所述管理模块包括基板管理控制器、所述处理器对应的固件系统或操作系统中的一种或多种。In a possible implementation, the error detection module is further configured to: determine that a detection switch in the management module is in an on state, and the detection switch is used to indicate whether to detect errors of the processor. The management module The module includes one or more of a baseboard management controller, a firmware system or an operating system corresponding to the processor.
在上述实施方式中,管理模块和处理器均可设置在计算设备中,在管理模块中可配置有检测开关,错误检测模块在确定管理模块中的检测开关处于开启状态的情况下,执行检测处理器错误的过程,提供了一种灵活开启检测处理器错误的方法。In the above embodiments, both the management module and the processor can be provided in the computing device, and the management module can be configured with a detection switch. The error detection module performs detection processing when it is determined that the detection switch in the management module is in an open state. The process of detecting processor errors provides a flexible way to enable detection of processor errors.
在一种可能的实施方式中,所述处理器还包括控制模块;所述错误检测模块,还用于在确定所述处理器存在错误时,向管理模块提供告警信息,所述管理模块包括基板管理控制器、所述处理器对应的固件系统或操作系统中的一种或多种,所述告警信息用于指示所述处理器存在错误;和/或,所述控制模块,用于在确定所述处理器存在错误时,控制关闭所述处理器。In a possible implementation, the processor further includes a control module; the error detection module is further configured to provide alarm information to a management module when it is determined that there is an error in the processor, and the management module includes a substrate One or more of the management controller, the firmware system or the operating system corresponding to the processor, the alarm information is used to indicate that the processor has an error; and/or the control module is used to determine When there is an error in the processor, the processor is controlled to be shut down.
在上述实施方式中,在错误检测模块确定处理器存在错误的情况下,错误检测模块可以向管理模块提供告警信息,以便于管理模块呈现该告警信息,以便于用户可及时得知处理器存在错误。另外,处理器中的控制模块还可以控制关闭处理器,避免处理器持续错误地执行指令。In the above embodiment, when the error detection module determines that there is an error in the processor, the error detection module can provide alarm information to the management module so that the management module can present the alarm information so that the user can know in time that there is an error in the processor. . In addition, the control module in the processor can also control the shutdown of the processor to prevent the processor from continuing to execute instructions incorrectly.
第二方面,本申请实施例提供一种检测处理器错误的方法,该方法可以由处理器执行,或者由包括处理器的计算设备执行。为了便于描述,下文以处理器执行该方法为例进行介绍。处理器执行内部检测指令,获得第一执行结果;所述处理器根据所述第一执行结果与所述内部检测指令对应的预期结果,确定所述处理器是否存在错误。In a second aspect, embodiments of the present application provide a method for detecting processor errors. The method may be executed by a processor or by a computing device including the processor. For the convenience of description, the following description takes the processor executing this method as an example. The processor executes the internal detection instruction to obtain a first execution result; the processor determines whether there is an error in the processor based on the expected result corresponding to the first execution result and the internal detection instruction.
在一种可能的实施方式中,所述方法还包括:从外部存储模块中,读取外部程序指令,对所述外部程序指令进行译码,获得译码后的结果,执行所述译码后的结果,获得第二执行结果,以及将所述第二执行结果写入所述外部存储模块。In a possible implementation, the method further includes: reading external program instructions from the external storage module, decoding the external program instructions, obtaining decoded results, and executing the decoded results. As a result, a second execution result is obtained, and the second execution result is written into the external storage module.
在一种可能的实施方式中,所述方法还包括:所述处理器确定缓冲队列中的剩余空间大于或等于阈值,所述缓冲队列用于缓存所述处理器待处理的指令。In a possible implementation, the method further includes: the processor determining that the remaining space in the buffer queue is greater than or equal to a threshold, and the buffer queue is used to cache instructions to be processed by the processor.
在一种可能的实施方式中,所述方法还包括:所述处理器为所述内部检测指令添加标记,所述标记表示用于检测所述处理器错误的指令;所述处理器在所述第一执行结果中添加所述标记;所述处理器根据所述第一执行结果中的所述标记,识别所述内部检测指令对应的所述第一执行结果。In a possible implementation, the method further includes: the processor adding a mark to the internal detection instruction, the mark indicating an instruction for detecting an error of the processor; The mark is added to the first execution result; the processor identifies the first execution result corresponding to the internal detection instruction based on the mark in the first execution result.
在一种可能的实施方式中,所述方法还包括:所述处理器包括寄存器,所述寄存器存储有所述内部检测指令对应的预期结果;所述方法还包括:所述处理器从所述寄存器中,读取所述内部检测指令对应的预期结果。In a possible implementation, the method further includes: the processor includes a register, and the register stores expected results corresponding to the internal detection instructions; the method further includes: the processor obtains the In the register, read the expected result corresponding to the internal detection instruction.
在一种可能的实施方式中,所述方法还包括:所述处理器从所述处理器已执行过的指令中,获得所述内部检测指令;或,所述处理器还包括存储有所述内部检测指令的第一存储模块,所述第一存储模块允许被所述处理器读取,所述处理器从所述第一存储模块中读取所述内部检测指令;或,所述处理器还包括存储有所述内部检测指令的第二存储模块,所述第二存储模块允许被所述处理器读取和写入,所述处理器从所述第二存储模块中读取 所述内部检测指令。In a possible implementation, the method further includes: the processor obtaining the internal detection instruction from instructions that have been executed by the processor; or, the processor further includes storing the internal detection instruction. A first storage module for internal detection instructions, the first storage module is allowed to be read by the processor, and the processor reads the internal detection instructions from the first storage module; or, the processor It also includes a second storage module that stores the internal detection instructions, the second storage module is allowed to be read and written by the processor, and the processor reads from the second storage module The internal detection instructions.
在一种可能的实施方式中,所述处理器包括至少一个处理器核;所述处理器根据所述第一执行结果与所述内部检测指令对应的预期结果,确定所述处理器是否存在错误,包括:所述处理器根据所述第一执行结果与所述内部检测指令对应的预期结果,确定用于获得所述第一执行结果的处理器核是否存在错误。In a possible implementation, the processor includes at least one processor core; the processor determines whether there is an error in the processor based on the expected result corresponding to the first execution result and the internal detection instruction. , including: the processor determines whether there is an error in the processor core used to obtain the first execution result based on the expected result corresponding to the first execution result and the internal detection instruction.
在一种可能的实施方式中,所述方法还包括:所述处理器确定管理模块中的检测开关处于开启状态,所述检测开关用于表示是否对所述处理器的错误进行检测,所述管理模块包括基板管理控制器、所述处理器对应的固件系统或操作系统中的一种或多种。In a possible implementation, the method further includes: the processor determines that a detection switch in the management module is in an on state, and the detection switch is used to indicate whether to detect an error of the processor, and the The management module includes one or more of a baseboard management controller, a firmware system corresponding to the processor, or an operating system.
在一种可能的实施方式中,所述方法还包括:所述处理器在确定所述处理器存在错误时,向管理模块提供告警信息,所述管理模块包括基板管理控制器、所述处理器对应的固件系统或操作系统中的一种或多种,所述告警信息用于指示所述处理器存在错误;和/或,在确定所述处理器存在错误时,所述处理器关闭所述处理器。In a possible implementation, the method further includes: when the processor determines that there is an error in the processor, providing alarm information to a management module, where the management module includes a baseboard management controller, the processor In one or more of the corresponding firmware systems or operating systems, the alarm information is used to indicate that the processor has an error; and/or, when it is determined that the processor has an error, the processor shuts down the processor.
第三方面,本申请实施例提供一种检测处理器错误的方法,该方法可以由处理器执行,或者由包括处理器的计算设备执行。为了便于描述,下文以处理器执行该方法为例进行介绍。处理器包括:指令执行模块和错误检测模块。所述方法包括:所述错误检测模块向所述指令执行模块发送内部检测指令;所述指令执行模块执行所述内部检测指令,获得第一执行结果,并向所述错误检测模块发送所述第一执行结果;所述错误检测模块根据所述第一执行结果与所述内部检测指令对应的预期结果,确定所述处理器是否存在错误。In a third aspect, embodiments of the present application provide a method for detecting processor errors. The method may be executed by a processor or by a computing device including a processor. For the convenience of description, the following description takes the processor executing this method as an example. The processor includes: instruction execution module and error detection module. The method includes: the error detection module sends an internal detection instruction to the instruction execution module; the instruction execution module executes the internal detection instruction, obtains a first execution result, and sends the first execution result to the error detection module. An execution result; the error detection module determines whether there is an error in the processor based on the expected result corresponding to the first execution result and the internal detection instruction.
在一种可能的实施方式中,在向所述指令执行模块发送内部检测指令之前,所述方法还包括:In a possible implementation, before sending an internal detection instruction to the instruction execution module, the method further includes:
所述错误检测模块确定所述处理器的缓冲队列中的剩余空间大于或等于阈值,所述缓冲队列用于缓存所述处理器待处理的指令。The error detection module determines that the remaining space in a buffer queue of the processor, which is used to cache instructions to be processed by the processor, is greater than or equal to a threshold.
在一种可能的实施方式中,所述方法还包括:所述错误检测模块为所述内部检测指令添加标记,所述标记表示用于检测所述处理器错误的指令;所述错误检测模块根据所述标记,识别所述内部检测指令对应的所述第一执行结果。In a possible implementation, the method further includes: the error detection module adding a mark to the internal detection instruction, the mark indicating an instruction for detecting an error of the processor; the error detection module based on The mark identifies the first execution result corresponding to the internal detection instruction.
在一种可能的实施方式中,所述处理器还包括寄存器,所述寄存器存储有所述内部检测指令对应的预期结果;所述方法还包括:所述错误检测模块从所述寄存器中,读取所述内部检测指令对应的预期结果。In a possible implementation, the processor further includes a register that stores expected results corresponding to the internal detection instructions; the method further includes: the error detection module reads from the register Get the expected result corresponding to the internal detection instruction.
在一种可能的实施方式中,所述方法还包括:所述错误检测模块从所述处理器已执行过的指令中,获得所述内部检测指令;或,所述错误检测模块从所述处理器中的第一存储模块中,读取所述内部检测指令,所述第一存储模块允许被所述处理器读取;或,所述错误检测模块从所述处理器中的第二存储模块中,读取所述内部检测指令,所述第二存储模块允许被所述处理器读取和写入。In a possible implementation, the method further includes: the error detection module obtains the internal detection instructions from instructions that have been executed by the processor; or, the error detection module obtains the internal detection instructions from the processing instructions. The internal detection instruction is read from the first storage module in the processor, and the first storage module is allowed to be read by the processor; or, the error detection module reads the internal detection instruction from the second storage module in the processor. , the internal detection instruction is read, and the second storage module is allowed to be read and written by the processor.
在一种可能的实施方式中,所述处理器包括至少一个处理器核,所述至少一个处理器核中的一个处理器核对应所述指令执行模块;根据所述第一执行结果与所述内部检测指令对应的预期结果,所述错误检测模块确定所述处理器核是否存在错误,包括:根据所述第一执行结果与所述内部检测指令对应的预期结果,所述错误检测模块确定与所述指令执行模块对应的处理器核是否存在错误。In a possible implementation, the processor includes at least one processor core, and one of the at least one processor core corresponds to the instruction execution module; according to the first execution result and the The error detection module determines whether there is an error in the processor core according to the expected result corresponding to the internal detection instruction, including: based on the expected result corresponding to the first execution result and the internal detection instruction, the error detection module determines and Whether there is an error in the processor core corresponding to the instruction execution module.
在一种可能的实施方式中,所述方法还包括:所述错误检测模块确定管理模块中的检测开关处于开启状态,所述检测开关用于表示是否对所述处理器的错误进行检测,所述管 理模块包括基板管理控制器、所述处理器对应的固件系统或操作系统中的一种或多种。In a possible implementation, the method further includes: the error detection module determines that a detection switch in the management module is in an on state, and the detection switch is used to indicate whether to detect errors of the processor, so narrate management The management module includes one or more of a baseboard management controller, a firmware system corresponding to the processor, or an operating system.
在一种可能的实施方式中,所述错误检测模块在确定所述处理器存在错误时,向管理模块提供告警信息,所述管理模块包括基板管理控制器、所述处理器对应的固件系统或操作系统中的一种或多种,所述告警信息用于指示所述处理器存在错误;和/或,在确定所述处理器存在错误时,所述处理器中的控制模块控制关闭所述处理器。In a possible implementation, when the error detection module determines that there is an error in the processor, the error detection module provides alarm information to the management module. The management module includes a baseboard management controller, a firmware system corresponding to the processor, or One or more operating systems, the alarm information is used to indicate that there is an error in the processor; and/or, when it is determined that there is an error in the processor, the control module in the processor controls to shut down the processor. processor.
第四方面,本申请实施例提供一种计算设备,计算设备可以包括第一方面中任一的处理器。In a fourth aspect, embodiments of the present application provide a computing device, which may include any of the processors in the first aspect.
第五方面,本申请实施例提供一种计算设备,计算设备包括处理器和供电电路,所述供电电路用于为所述处理器供电,所述处理器用于实现第二方面或第三方面中任一的方法。In a fifth aspect, embodiments of the present application provide a computing device. The computing device includes a processor and a power supply circuit. The power supply circuit is used to supply power to the processor. The processor is used to implement the second aspect or the third aspect. Either method.
第六方面,本申请实施例提供一种计算设备集群,包括至少一个计算设备,每个计算设备可执行如上述第二方面或上述第三方面中任一方面中任一的方法。In a sixth aspect, embodiments of the present application provide a computing device cluster, including at least one computing device. Each computing device can execute the method in any one of the above-mentioned second aspect or the above-mentioned third aspect.
可选的,该计算设备集群中的每个计算设备可以为上述第四方面或上述第五方面中任一的计算设备。Optionally, each computing device in the computing device cluster may be the computing device of any one of the above-mentioned fourth aspect or the above-mentioned fifth aspect.
第七方面,提供一种包含指令的计算机程序产品,当其在计算机上运行时,实现上述第二方面或第三方面任一的方法。A seventh aspect provides a computer program product containing instructions that, when run on a computer, implements the method of any one of the above second or third aspects.
第八方面,本申请实施例提供一种计算机可读存储介质,该计算机可读存储介质用于存储计算机程序或指令,当其被运行时,实现上述第二方面或上述第三方面中任一的方法。In an eighth aspect, embodiments of the present application provide a computer-readable storage medium. The computer-readable storage medium is used to store computer programs or instructions. When executed, the computer program or instructions implement any one of the above-mentioned second aspect or the above-mentioned third aspect. Methods.
关于上述第二方面至上述第八方面的有益效果,可参照上述第一方面论述的有益效果,此处不再重复列举。Regarding the beneficial effects of the above-mentioned second aspect to the above-mentioned eighth aspect, reference may be made to the beneficial effects discussed in the above-mentioned first aspect, which will not be repeatedly listed here.
附图说明Description of the drawings
图1为一种巡检工具的部署示意图;Figure 1 is a schematic diagram of the deployment of an inspection tool;
图2为本申请实施例提供的一种处理器的结构示意图;Figure 2 is a schematic structural diagram of a processor provided by an embodiment of the present application;
图3为本申请实施例提供的又一种处理器的结构示意图;Figure 3 is a schematic structural diagram of another processor provided by an embodiment of the present application;
图4为本申请实施例提供的一种错误检测模块的结构示意图;Figure 4 is a schematic structural diagram of an error detection module provided by an embodiment of the present application;
图5为本申请实施例提供的一种计算设备的结构示意图;Figure 5 is a schematic structural diagram of a computing device provided by an embodiment of the present application;
图6为本申请实施例提供的一种云数据中心的部署示意图;Figure 6 is a schematic diagram of the deployment of a cloud data center provided by an embodiment of the present application;
图7为本申请实施例提供的一种云数据中心的架构示意图;Figure 7 is a schematic architectural diagram of a cloud data center provided by an embodiment of the present application;
图8为本申请实施例提供的一种检测处理器错误的方法的流程示意图;Figure 8 is a schematic flowchart of a method for detecting processor errors provided by an embodiment of the present application;
图9为本申请实施例提供的一种管理模块的检测开关的示意图;Figure 9 is a schematic diagram of a detection switch of a management module provided by an embodiment of the present application;
图10为本申请实施例提供的一种处理内部检测指令和外部程序指令的过程示意图;Figure 10 is a schematic diagram of a process for processing internal detection instructions and external program instructions provided by an embodiment of the present application;
图11为本申请实施例提供的又一种检测处理器错误的方法的流程示意图;Figure 11 is a schematic flowchart of yet another method for detecting processor errors provided by an embodiment of the present application;
图12为本申请实施例提供的又一种处理内部检测指令和外部程序指令的过程示意图;Figure 12 is a schematic diagram of another process for processing internal detection instructions and external program instructions provided by the embodiment of the present application;
图13为本申请实施例提供的一种计算设备集群的架构示意图。Figure 13 is a schematic architectural diagram of a computing device cluster provided by an embodiment of the present application.
具体实施方式Detailed ways
为了使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施例作进一步地详细描述。In order to make the objectives, technical solutions, and advantages of the embodiments of the present application clearer, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
以下,对本申请实施例中的部分用语进行解释说明,以便于本领域技术人员理解。 In the following, some terms used in the embodiments of the present application are explained to facilitate understanding by those skilled in the art.
1、静默数据错误(silent data corruption,SDC),又可以称为静默数据破坏,是指处理器执行指令的过程中出现错误,但处理器对应的设备中的操作系统并未感知这种错误,导致处理器执行指令对应的错误结果被存放。1. Silent data corruption (SDC), also known as silent data corruption, refers to an error that occurs during the execution of instructions by the processor, but the operating system in the device corresponding to the processor does not perceive this error. Causes the error result corresponding to the instruction executed by the processor to be stored.
2、处理器,可以是中央处理单元(central processing unit,CPU),还可以是其它通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或者其它可编程逻辑器件、晶体管逻辑器件,硬件部件或者其任意组合。通用处理器可以是微处理器,也可以是任何常规的处理器。2. The processor can be a central processing unit (CPU), or other general-purpose processor, digital signal processor (DSP), application specific integrated circuit (ASIC), Field programmable gate array (FPGA) or other programmable logic devices, transistor logic devices, hardware components or any combination thereof. A general-purpose processor can be a microprocessor or any conventional processor.
本申请实施例中,对于名词的数目,除非特别说明,表示“单数名词或复数名词”,即"一个或多个”。“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。例如,A/B,表示:A或B。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),表示:a,b,c,a和b,a和c,b和c,或a和b和c,其中a,b,c可以是单个,也可以是多个。In the embodiments of this application, the number of nouns means "singular noun or plural noun", that is, "one or more", unless otherwise specified. "At least one" means one or more, and "plurality" means two or more. "And/or" describes the relationship between associated objects, indicating that there can be three relationships, for example, A and/or B, which can mean: A exists alone, A and B exist simultaneously, and B exists alone, where A, B can be singular or plural. The character "/" generally indicates that the related objects are in an "or" relationship. For example, A/B means: A or B. "At least one of the following" or similar expressions thereof refers to any combination of these items, including any combination of a single item (items) or a plurality of items (items). For example, at least one of a, b, or c means: a, b, c, a and b, a and c, b and c, or a and b and c, where a, b, c Can be single or multiple.
为了减少处理器可能出现的静默数据错误,目前可利用巡检工具测试处理器是否存在错误。请参照图1,为一种巡检工具的部署示意图。或者,图1可理解为设备的一种架构示意图。如图1所示,设备包括处理器、运行的多个应用(application,APP)和操作系统。其中,多个应用包括如图1所述的应用1、巡检工具和应用2等。In order to reduce silent data errors that may occur in the processor, inspection tools can currently be used to test whether there are errors in the processor. Please refer to Figure 1, which is a schematic diagram of the deployment of an inspection tool. Alternatively, Figure 1 can be understood as an architectural schematic diagram of the device. As shown in Figure 1, the device includes a processor, multiple running applications (APPs), and an operating system. Among them, multiple applications include application 1, inspection tools, application 2, etc. as shown in Figure 1.
巡检工具可内置有程序指令和与程序指令对应的预期结果。巡检工具在检测处理器错误时,巡检工具可将程序指令加载到外部存储器中。外部存储器和处理器相互独立设置。处理器从外部存储器中读取程序指令,并对该程序指令进行译码。处理器执行译码后的程序指令,并将程序指令的执行结果发送给巡检工具。The inspection tool can have built-in program instructions and expected results corresponding to the program instructions. When the inspection tool detects processor errors, the inspection tool can load program instructions into external memory. The external memory and processor are configured independently of each other. The processor reads program instructions from external memory and decodes the program instructions. The processor executes the decoded program instructions and sends the execution results of the program instructions to the inspection tool.
巡检工具对比程序指令的执行结果和预期结果。如果执行结果和预期结果匹配,则巡检工具确定处理器不存在错误。如果执行结果和预期结果不匹配,则巡检工具确定处理器存在错误。在巡检工具确定处理器存在错误的情况下,巡检工具可向操作系统反馈该处理器存在错误。The inspection tool compares the execution results of program instructions with the expected results. If the execution results match the expected results, the inspection tool determines that there is no processor error. If the execution results do not match the expected results, the inspection tool determines that there is a processor error. When the inspection tool determines that there is an error in the processor, the inspection tool can feedback to the operating system that the processor has an error.
由此可见,目前的检测处理器错误的方式中,处理器需要从外部存储器读取程序指令,译码程序指令,以及后续向巡检工具发送执行结果等,导致处理器的处理量较大。It can be seen that in the current method of detecting processor errors, the processor needs to read program instructions from external memory, decode the program instructions, and subsequently send execution results to the inspection tool, etc., resulting in a large processing capacity of the processor.
鉴于此,本申请实施例提供一种处理器。处理器包括错误检测模块和指令执行模块。错误检测模块可将用于检测处理器错误的指令(如内部检测指令)发送(或传递)给指令执行模块。指令执行模块执行内部检测指令,获得内部检测指令的第一执行结果,并将内部检测指令的第一执行结果发送给错误检测模块。错误检测模块根据第一执行结果与内部检测指令的预期结果,确定处理器是否存在错误。如此,处理器可自行检测处理器是否存在错误,无需借助外部的巡检工具,由于无需处理器译码外部程序指令,以及向巡检工具发送外部程序指令的执行结果等,因此可减少处理器的处理量。In view of this, embodiments of the present application provide a processor. The processor includes an error detection module and an instruction execution module. The error detection module may send (or transfer) instructions for detecting processor errors (such as internal detection instructions) to the instruction execution module. The instruction execution module executes the internal detection instruction, obtains the first execution result of the internal detection instruction, and sends the first execution result of the internal detection instruction to the error detection module. The error detection module determines whether there is an error in the processor based on the first execution result and the expected result of the internal detection instruction. In this way, the processor can detect whether there are errors in the processor by itself without resorting to external inspection tools. Since there is no need for the processor to decode external program instructions and send the execution results of external program instructions to the inspection tool, it can reduce the number of processor tasks. processing volume.
本申请实施例中的处理器可以是任意类型的处理器,例如,包括单核处理器或多核处理器等。The processor in the embodiment of the present application may be any type of processor, including, for example, a single-core processor or a multi-core processor.
请参照图2,为本申请实施例提供的一种处理器的结构示意图。图2例如为一种单核处 理器的结构示意图。如图2所示,处理器200包括错误检测模块210和指令执行模块220。其中,错误检测模块210和指令执行模块220之间可相互通信。Please refer to FIG. 2 , which is a schematic structural diagram of a processor provided by an embodiment of the present application. Figure 2 shows an example of a single-core processor The structural diagram of the processor. As shown in FIG. 2 , the processor 200 includes an error detection module 210 and an instruction execution module 220 . Among them, the error detection module 210 and the instruction execution module 220 can communicate with each other.
其中,错误检测模块210和指令执行模块220均可通过硬件实现,例如通过逻辑电路实现,本申请不对错误检测模块210和指令执行模块220的具体结构进行限定。指令执行模块220例如为算术逻辑单元(arithmetic logic unit,ALU)。The error detection module 210 and the instruction execution module 220 can be implemented by hardware, such as logic circuits. This application does not limit the specific structures of the error detection module 210 and the instruction execution module 220 . The instruction execution module 220 is, for example, an arithmetic logic unit (arithmetic logic unit, ALU).
具体的,错误检测模块210向指令执行模块220发送用于检测处理器错误的指令(如内部检测指令)。指令执行模块220用于执行内部检测指令,获得第一执行结果。指令执行模块220将第一执行结果发送给错误检测模块210。错误检测模块210用于根据第一执行结果与内部检测指令对应的预期结果,确定处理器是否存储错误。Specifically, the error detection module 210 sends instructions for detecting processor errors (such as internal detection instructions) to the instruction execution module 220 . The instruction execution module 220 is used to execute internal detection instructions to obtain the first execution result. The instruction execution module 220 sends the first execution result to the error detection module 210 . The error detection module 210 is configured to determine whether the processor stores an error based on the expected result corresponding to the first execution result and the internal detection instruction.
例如,如果第一执行结果与内部检测指令的预期结果匹配,则错误检测模块210确定处理器不存在错误。如果第一执行结果与内部检测指令的预期结果不匹配,则错误检测模块210确定处理器存在错误。For example, if the first execution result matches the expected result of the internally detected instruction, error detection module 210 determines that there is no processor error. If the first execution result does not match the expected result of the internally detected instruction, the error detection module 210 determines that there is an error in the processor.
在本申请实施例中,处理器200中的错误检测模块210和指令执行模块220可协同实现处理器200的错误检测,而不必借助外部的巡检工具,也就无需处理器200译码程序指令以及向巡检工具发送执行结果等,有利于减少处理器的处理量,也就有利于节省处理器的资源开销。In the embodiment of the present application, the error detection module 210 and the instruction execution module 220 in the processor 200 can cooperate to implement error detection of the processor 200 without resorting to external inspection tools, and there is no need for the processor 200 to decode program instructions. As well as sending execution results to the inspection tool, etc., it is helpful to reduce the processing load of the processor, and also helps to save the resource overhead of the processor.
在一种可能的实施方式中,在处理器200的缓冲队列的剩余空间大于或等于阈值的情况下,错误检测模块210向指令执行模块220发送内部检测指令。缓冲队列用于存储处理器200待处理(或未处理)的指令。阈值可被预配置在错误检测模块210中。In a possible implementation, when the remaining space of the buffer queue of the processor 200 is greater than or equal to the threshold, the error detection module 210 sends an internal detection instruction to the instruction execution module 220 . The buffer queue is used to store instructions to be processed (or not processed) by the processor 200 . The threshold may be preconfigured in the error detection module 210.
其中,缓存队列的剩余空间越小,表示处理器200需要处理的指令越多,则表示处理器200的负载越大;缓存队列的剩余空间越大,表示处理器200需要处理的指令越少,则表示处理器200的负载越小。The smaller the remaining space in the cache queue, the more instructions the processor 200 needs to process, which means the load on the processor 200 is greater; the larger the remaining space in the cache queue, the fewer instructions the processor 200 needs to process. This means that the load on the processor 200 is smaller.
在该实施方式中,可避免在处理器200负载较大时,向指令执行模块220发送内部检测指令加重处理器200的处理量的情况,有利于合理利用处理器200的资源。In this embodiment, it can be avoided that when the load of the processor 200 is heavy, sending internal detection instructions to the instruction execution module 220 and increasing the processing load of the processor 200 can be avoided, which is conducive to rational utilization of the resources of the processor 200 .
由于指令执行模块220除了执行来自错误检测模块210的指令,还可能要执行来自外部程序指令,相应的,指令执行模块220除了会获得内部检测指令的执行结果,还会有程序指令对应的执行结果。外部程序指令是指不属于处理器200内的指令。因此为了便于错误检测模块210识别内部检测指令对应的第一执行结果,在一种可能的实施方式中,错误检测模块210可为内部检测指令添加标记。错误检测模块210根据该标记,从指令执行模块220获取与标记对应的内部检测指令的第一执行结果。In addition to executing instructions from the error detection module 210, the instruction execution module 220 may also execute instructions from external programs. Accordingly, the instruction execution module 220 will not only obtain the execution results of the internal detection instructions, but also obtain the execution results corresponding to the program instructions. . External program instructions refer to instructions that do not belong to the processor 200 . Therefore, in order to facilitate the error detection module 210 to identify the first execution result corresponding to the internal detection instruction, in a possible implementation, the error detection module 210 may add a mark to the internal detection instruction. The error detection module 210 obtains the first execution result of the internal detection instruction corresponding to the tag from the instruction execution module 220 according to the tag.
内部检测指令对应的预期结果可被预配置在错误检测模块210中。或者,内部检测指令对应的预期结果可被预配置在处理器200的寄存器260中。后续,错误检测模块210可以从寄存器260中读取内部检测指令对应的预期结果。The expected results corresponding to the internal detection instructions may be preconfigured in the error detection module 210 . Alternatively, the expected results corresponding to the internal detection instructions may be preconfigured in the register 260 of the processor 200 . Subsequently, the error detection module 210 can read the expected result corresponding to the internal detection instruction from the register 260 .
作为一个示例,处理器200还包括控制模块250。控制模块250用于控制处理器200的各个模块(包括错误检测模块210和指令执行模块220等)。控制模块250例如为控制单元(control unit,CU)。As an example, processor 200 also includes control module 250. The control module 250 is used to control various modules of the processor 200 (including the error detection module 210 and the instruction execution module 220, etc.). The control module 250 is, for example, a control unit (CU).
作为一个示例,处理器200还包括第一存储模块230和/或第二存储模块240。第一存储模块230和第二存储模块240可以理解为处理器200的内部存储模块。第一存储模块230例如为只读存储器(read only memory,ROM)或内存,本申请实施例对第一存储模块230和第二存储模块240的具体实现形式不做限定。第二存储模块240例如为内存。 As an example, the processor 200 further includes a first storage module 230 and/or a second storage module 240. The first storage module 230 and the second storage module 240 can be understood as internal storage modules of the processor 200 . The first storage module 230 is, for example, a read only memory (ROM) or a memory. The embodiment of the present application does not limit the specific implementation forms of the first storage module 230 and the second storage module 240 . The second storage module 240 is, for example, a memory.
具体的,第一存储模块230用于存储有指令。第一存储模块230可允许被处理器200(具体如处理器200中的错误检测模块210)读取。例如,错误检测模块210用于从第一存储模块230中读取内部检测指令。Specifically, the first storage module 230 is used to store instructions. The first storage module 230 may be allowed to be read by the processor 200 (specifically, the error detection module 210 in the processor 200). For example, the error detection module 210 is used to read internal detection instructions from the first storage module 230 .
第二存储模块240用于存储有指令。第二存储模块240可允许被处理器200(具体如处理器200中的错误检测模块210)读取和写入。例如,外部设备可通过处理器200向第二存储模块240写入指令,以及错误检测模块210还可从第二存储模块240中读取内部检测指令。如此,有利于拓展用于检测处理器200错误的指令的数量以及类型等。The second storage module 240 is used to store instructions. The second storage module 240 may be allowed to be read and written by the processor 200 (specifically, the error detection module 210 in the processor 200). For example, the external device can write instructions to the second storage module 240 through the processor 200, and the error detection module 210 can also read internal detection instructions from the second storage module 240. In this way, it is helpful to expand the number and type of instructions used to detect errors in the processor 200 .
作为一个示例,错误检测模块210还可从指令执行模块220已执行过的指令中读取内部检测指令。As an example, the error detection module 210 may also read internal detection instructions from instructions that have been executed by the instruction execution module 220 .
在本申请实施例中,错误检测模块210可以从处理器200内部的第一存储模块230、第二存储模块240或指令执行模块220已经执行过的指令中读取内部检测指令,提供了获取内部检测指令的多种方式。In the embodiment of the present application, the error detection module 210 can read the internal detection instructions from the instructions that have been executed by the first storage module 230, the second storage module 240 or the instruction execution module 220 inside the processor 200, providing the ability to obtain internal Multiple ways to detect instructions.
在一种可能的实施方式中,指令执行模块220还可以用于处理外部程序指令,外部程序指令是指来自除了处理器之外的器件中的程序。例如,指令执行模块220可以对外部程序指令进行译码,获得译码后的结果,执行该译码后的结果,从而获得第二执行结果。对外部程序指令进行译码可以理解为将外部程序指令转换成指令执行模块220可以直接运算的指令。In a possible implementation, the instruction execution module 220 can also be used to process external program instructions. The external program instructions refer to programs from devices other than the processor. For example, the instruction execution module 220 can decode the external program instruction, obtain the decoded result, and execute the decoded result to obtain the second execution result. Decoding the external program instructions can be understood as converting the external program instructions into instructions that the instruction execution module 220 can directly operate.
其中,外部程序指令例如为存储在外部存储器中的指令。例如,外部程序指令可以是经过操作系统加载应用之后的所形成的指令。The external program instructions are, for example, instructions stored in an external memory. For example, the external program instructions may be instructions formed after the operating system loads the application.
请参照图3,为本申请实施例提供的又一种处理器的结构示意图。图3例如为一种多核处理器的结构示意图。如图3所示,处理器300包括多个处理器核、错误检测模块301和指令执行模块302。图3中是以多个处理器核包括第一处理器核310和第二处理器核320为例。处理器核又可以称为物理处理器核、物理处理器核心或处理器核心等。Please refer to FIG. 3 , which is a schematic structural diagram of another processor provided by an embodiment of the present application. FIG. 3 is, for example, a schematic structural diagram of a multi-core processor. As shown in FIG. 3 , the processor 300 includes multiple processor cores, an error detection module 301 and an instruction execution module 302 . In FIG. 3 , multiple processor cores including a first processor core 310 and a second processor core 320 are taken as an example. A processor core can also be called a physical processor core, a physical processor core, or a processor core.
其中,错误检测模块301的功能和实现形式可参照前文图2中论述的错误检测模块210的内容,以及指令执行模块302的功能和实现形式可参照前文图2中论述的指令执行模块220的内容。Among them, the function and implementation form of the error detection module 301 can refer to the content of the error detection module 210 discussed in Figure 2 above, and the function and implementation form of the instruction execution module 302 can refer to the content of the instruction execution module 220 discussed in Figure 2 above. .
第一处理器核310和第二处理器核320的结构可以是相同的,下面以第一处理器核310为例,进行介绍。The structures of the first processor core 310 and the second processor core 320 may be the same. The first processor core 310 is taken as an example for introduction below.
第一处理器核310包括指令执行模块302。指令执行模块302的内容可参照前文论述的内容。第一处理器核310还包括控制模块307。控制模块307的内容可参照前文论述的内容。The first processor core 310 includes an instruction execution module 302 . The content of the instruction execution module 302 may refer to the content discussed above. The first processor core 310 also includes a control module 307 . The content of the control module 307 may refer to the content discussed above.
在图3实施例涉及的处理器300中,错误检测模块301可以对处理器300的多个处理器核分别进行错误检测,以确定具体是哪个处理器核出现错误。In the processor 300 involved in the embodiment of FIG. 3, the error detection module 301 can perform error detection on multiple processor cores of the processor 300 to determine which processor core specifically has an error.
示例性的,错误检测模块301可以向第一处理器核310中的指令执行模块302发送内部检测指令,并获得第一处理器核310中的指令执行模块302发送的内部检测指令的执行结果。如果内部检测指令的执行结果与预期结果不匹配,则错误检测模块301确定第一处理器核310存在错误。如果内部检测指令的执行结果与预期结果匹配,则错误检测模块301确定第一处理器核310不存在错误。同理,错误检测模块301可检测第二处理器核320是否存在错误。For example, the error detection module 301 can send an internal detection instruction to the instruction execution module 302 in the first processor core 310, and obtain the execution result of the internal detection instruction sent by the instruction execution module 302 in the first processor core 310. If the execution result of the internal detection instruction does not match the expected result, the error detection module 301 determines that there is an error in the first processor core 310 . If the execution result of the internal detection instruction matches the expected result, the error detection module 301 determines that there is no error in the first processor core 310 . Similarly, the error detection module 301 can detect whether there is an error in the second processor core 320 .
在该实施方式中,处理器300可以对处理器300中的处理器核进行错误检测,相当于对处理器300进行了更精准的错误检测。 In this implementation, the processor 300 can perform error detection on the processor core in the processor 300 , which is equivalent to performing more accurate error detection on the processor 300 .
可选的,处理器300还包括寄存器304、第一存储模块305和第二存储模块306中的一种或多种。Optionally, the processor 300 also includes one or more of a register 304, a first storage module 305, and a second storage module 306.
其中,寄存器304的功能可参照前文论述图2中论述的寄存器260的内容。第一存储模块305的功能和实现形式可参照前文论述图2中论述的第一存储模块230的内容。第二存储模块306的功能和实现形式可参照前文论述图2中论述的第二存储模块240的内容。The function of the register 304 may refer to the contents of the register 260 discussed in FIG. 2 discussed above. The function and implementation form of the first storage module 305 may refer to the content of the first storage module 230 discussed in FIG. 2 discussed above. The function and implementation form of the second storage module 306 may refer to the content of the second storage module 240 discussed in FIG. 2 discussed above.
请参照图4,为本申请实施例提供的又一种处理器的结构示意图。如图4所示,该处理器400包括指令执行模块410和错误检测模块420。错误检测模块420包括指令获得子模块421、指令识别子模块422和错误判断子模块423。Please refer to FIG. 4 , which is a schematic structural diagram of another processor provided by an embodiment of the present application. As shown in FIG. 4 , the processor 400 includes an instruction execution module 410 and an error detection module 420 . The error detection module 420 includes an instruction acquisition sub-module 421, an instruction identification sub-module 422 and an error judgment sub-module 423.
可选的,指令获得子模块421、指令识别子模块422和错误判断子模块423均可通过硬件实现,例如,指令获得子模块421可以通过寄存器或存储器实现,指令识别子模块422可以通过比较器实现,错误判断子模块423可通过比较器实现。Optionally, the instruction acquisition sub-module 421, the instruction identification sub-module 422 and the error judgment sub-module 423 can all be implemented by hardware. For example, the instruction acquisition sub-module 421 can be implemented by a register or a memory, and the instruction identification sub-module 422 can be implemented by a comparator. Implementation, the error judgment sub-module 423 can be implemented through a comparator.
具体的,指令获得子模块421用于为指令执行模块410提供内部检测指令。其中,指令获得子模块421获得内部检测指令的内容可参照前文错误检测模块获得内部检测指令的内容,此处不再赘述。指令识别子模块422从指令执行模块410执行的执行结果中,识别内部检测指令对应的执行结果。错误判断子模块423确定内部检测指令对应的执行结果与内部检测指令的预期结果是否相同,从而确定处理器400是否存在错误。Specifically, the instruction acquisition sub-module 421 is used to provide internal detection instructions for the instruction execution module 410. Among them, the content of the internal detection instructions obtained by the instruction acquisition sub-module 421 may refer to the content of the internal detection instructions obtained by the error detection module mentioned above, which will not be described again here. The instruction identification sub-module 422 identifies the execution results corresponding to the internal detection instructions from the execution results executed by the instruction execution module 410 . The error judgment sub-module 423 determines whether the execution result corresponding to the internal detection instruction is the same as the expected result of the internal detection instruction, thereby determining whether there is an error in the processor 400 .
可选的,处理器400还包括负载判断子模块424,在图4中以虚线框示意。负载判断子模块424可用于确定处理器400的缓冲队列的剩余空间是否大于或等于阈值。例如,负载判断子模块424确定处理器400的缓冲队列的剩余空间大于或等于阈值,触发指令获得子模块421向指令执行模块410发送内部检测指令。Optionally, the processor 400 also includes a load determination sub-module 424, which is illustrated by a dotted box in FIG. 4 . The load determination sub-module 424 may be used to determine whether the remaining space of the buffer queue of the processor 400 is greater than or equal to the threshold. For example, the load determination sub-module 424 determines that the remaining space of the buffer queue of the processor 400 is greater than or equal to a threshold, triggering the instruction acquisition sub-module 421 to send an internal detection instruction to the instruction execution module 410 .
本申请的各个实施例中涉及任一的处理器可设置在任意类型的计算设备中。计算设备泛指具有处理能力的设备,例如包括服务器或终端设备等。终端设备例如手机、平板电脑、带无线收发功能的电脑、可穿戴设备、车辆、无人机、直升机、飞机、轮船、机器人、机械臂或智能家居设备等。Any processor involved in various embodiments of the present application may be disposed in any type of computing device. Computing equipment generally refers to equipment with processing capabilities, such as servers or terminal equipment. Terminal devices such as mobile phones, tablets, computers with wireless transceiver functions, wearable devices, vehicles, drones, helicopters, airplanes, ships, robots, robotic arms or smart home devices, etc.
请参照图5,为本申请实施例提供的一种计算设备的结构示意图。如图5所示,计算设备500包括软件层和硬件层。Please refer to FIG. 5 , which is a schematic structural diagram of a computing device provided by an embodiment of the present application. As shown in Figure 5, computing device 500 includes software layers and hardware layers.
软件层包括操作系统510、固件系统520和基板管理控制器(baseboard management controller,BMC)530等。基板管理控制器530例如为独立于计算设备500之外的小型操作系统。操作系统510、固件系统520和基板管理控制器530均可用于管理计算设备500。在本申请实施例中的操作系统510、固件系统520和基板管理控制器530的一种或多种均可视为管理模块。The software layer includes an operating system 510, a firmware system 520, a baseboard management controller (BMC) 530, etc. The baseboard management controller 530 is, for example, a small operating system independent of the computing device 500 . Operating system 510, firmware system 520, and baseboard management controller 530 may all be used to manage computing device 500. In the embodiment of the present application, one or more of the operating system 510, the firmware system 520, and the baseboard management controller 530 can be regarded as a management module.
硬件层包括外部存储模块540、处理器550和网卡560等。外部存储模块540例如为计算设备500的内存。处理器550的结构可参照前文图2、图3或图4所示的结构。图5中仅示意了一个处理器550,实际上处理器550的数量可以是一个或多个。The hardware layer includes external storage module 540, processor 550, network card 560, etc. The external storage module 540 is, for example, the memory of the computing device 500 . The structure of the processor 550 may refer to the structure shown in FIG. 2, FIG. 3 or FIG. 4. Only one processor 550 is illustrated in FIG. 5 . In fact, the number of processors 550 may be one or more.
处理器550可以用于处理来自计算设备500外部的请求,和/或计算设备500内部生成的请求。网卡560用于计算设备500与其他设备通信。Processor 550 may be used to process requests from outside computing device 500 and/or requests generated internally by computing device 500 . Network card 560 is used by computing device 500 to communicate with other devices.
另外,计算设备500还可以包括总线,总线可以用于计算设备500各组件之间的通信。In addition, computing device 500 may also include a bus, which may be used for communication between components of computing device 500 .
可选的,计算设备500还可以包括供电电路,供电电路用于为处理器550供电。Optionally, the computing device 500 may also include a power supply circuit, which is used to power the processor 550 .
在一种可能的实施方式中,处理器550检测处理器550存在错误,可以向管理模块提供告警信息,告警信息用于指示处理器550存在错误,使得用户可以及时获知处理器550的情 况。In a possible implementation, the processor 550 detects that there is an error in the processor 550 and can provide alarm information to the management module. The alarm information is used to indicate that there is an error in the processor 550 so that the user can learn the status of the processor 550 in a timely manner. condition.
在一种可能的实施方式中,处理器550除了执行内部检测指令,还用于执行外部存储模块540中的外部程序指令。In a possible implementation, in addition to executing internal detection instructions, the processor 550 is also configured to execute external program instructions in the external storage module 540 .
具体的,处理器550从外部存储模块540中读取外部程序指令,并对外部程序指令进行译码,从而获得译码后的结果。处理器550对译码后的结果进行执行,获得第二执行结果。处理器550可以将第二执行结果写入外部存储模块540中,以便于计算设备500的操作系统等获取第二执行结果。Specifically, the processor 550 reads the external program instructions from the external storage module 540 and decodes the external program instructions to obtain the decoded results. The processor 550 executes the decoded result to obtain a second execution result. The processor 550 may write the second execution result into the external storage module 540 to facilitate the operating system of the computing device 500 and the like to obtain the second execution result.
本申请实施例涉及的计算设备可应用于任意的场景中,例如应用于云数据中心中。云数据中心可用于为用户提供业务服务,业务服务包括存储服务和/或计算服务。The computing device involved in the embodiments of this application can be applied in any scenario, such as in a cloud data center. Cloud data centers can be used to provide users with business services, including storage services and/or computing services.
请参照图6,为本申请实施例提供的一种云数据中心的部署示意图。或者,图6可理解为本申请实施例提供的一种错误检测方法的应用场景示意图。如图6所示,该场景包括运行在多个终端设备610、多个客户端611和云数据中心620。多个终端设备610中的每个终端设备610中运行有多个客户端611中的一个客户端611。其中,多个客户端611中的每个客户端611可以是软件模块或应用等。云数据中心620包括一个或多个计算设备。Please refer to FIG. 6 , which is a schematic diagram of the deployment of a cloud data center provided by an embodiment of the present application. Alternatively, FIG. 6 can be understood as a schematic diagram of an application scenario of an error detection method provided by an embodiment of the present application. As shown in Figure 6, this scenario includes running on multiple terminal devices 610, multiple clients 611 and a cloud data center 620. One client 611 of the plurality of clients 611 is run in each terminal device 610 of the plurality of terminal devices 610 . Each client 611 in the plurality of clients 611 may be a software module or application. Cloud data center 620 includes one or more computing devices.
示例性的,客户端611可远程访问云数据中心620,进而使用云数据中心620提供的业务服务。For example, the client 611 can remotely access the cloud data center 620, and then use the business services provided by the cloud data center 620.
下面结合图7所示的一种云数据中心的架构示意图,对云数据中心的结构进行介绍。在图7中是以云数据中心包括的计算设备是服务器为例。The structure of the cloud data center is introduced below with reference to the schematic diagram of a cloud data center architecture shown in Figure 7. In Figure 7, the computing device included in the cloud data center is a server as an example.
如图7所示,云数据中心700包括云管理平台710和至少一个服务器730。在图7中是以至少一个服务器730的数量为两个进行示例。其中,云管理平台710通过云数据中心内部网络720与至少一个服务器730中的每个服务器730通信。As shown in Figure 7, the cloud data center 700 includes a cloud management platform 710 and at least one server 730. In FIG. 7 , the number of at least one server 730 is two. Wherein, the cloud management platform 710 communicates with each of the at least one server 730 through the cloud data center internal network 720.
云管理平台710用于提供访问接口(如界面或应用程序接口(application programming interface,API))。例如,租户可操作客户端远程接入访问API在云管理平台710注册云账号和密码,并登录云管理平台710。客户端例如为图6中的客户端611。The cloud management platform 710 is used to provide an access interface (such as an interface or an application programming interface (API)). For example, the tenant can operate the client remote access API to register a cloud account and password on the cloud management platform 710 and log in to the cloud management platform 710 . The client is, for example, client 611 in Figure 6 .
云管理平台710对云账号和密码鉴权成功后,租户可进一步在云管理平台710付费选择并购买特定规格(处理器、内存、磁盘)的虚拟机。其中,虚拟机也可称为云服务器(elastic compute service,ECS)或弹性实例等。租户付费购买成功后,云管理平台710为租户提供所购买的虚拟机的远程登录账号密码。客户端可远程登录该虚拟机,在该虚拟机中安装并运行租户的应用。After the cloud management platform 710 successfully authenticates the cloud account and password, the tenant can further select and purchase a virtual machine with specific specifications (processor, memory, disk) on the cloud management platform 710 for a fee. Among them, virtual machines can also be called cloud servers (elastic compute service, ECS) or elastic instances. After the tenant successfully purchases the virtual machine, the cloud management platform 710 provides the tenant with the remote login account and password of the purchased virtual machine. The client can log in to the virtual machine remotely and install and run tenant applications in the virtual machine.
云管理平台710的逻辑功能可包括用户控制台、计算管理服务、网络管理服务、存储管理服务、鉴权服务和镜像管理服务等。其中,用户控制台提供界面或API与租户交互。计算管理服务用于管理运行虚拟机和容器的服务器以及裸金属服务器。网络管理服务用于管理网络服务(如网关、防火墙等)。存储管理服务用于管理存储服务(如数据桶服务)。鉴权服务用于管理租户的账号密码。镜像管理服务用于管理虚拟机镜像。The logical functions of the cloud management platform 710 may include user console, computing management service, network management service, storage management service, authentication service, image management service, etc. Among them, the user console provides an interface or API to interact with tenants. Compute management services are used to manage servers running virtual machines and containers, as well as bare metal servers. Network management services are used to manage network services (such as gateways, firewalls, etc.). The storage management service is used to manage storage services (such as data bucket services). The authentication service is used to manage tenant accounts and passwords. Image management service is used to manage virtual machine images.
其中,至少一个服务器730中的任意两个服务器730的结构可以是相同的,下面以一个服务器730的结构为例进行说明。The structure of any two servers 730 in the at least one server 730 may be the same. The structure of one server 730 is taken as an example for description below.
服务器730包括硬件层和软件层。服务器730的硬件层包括内存734、处理器735、网卡736和磁盘737。其中,内存734、处理器735和网卡736的内容可参照前文图5论述的内容。可选的,服务器730还可以包括供电电路,供电电路用于为处理器735供电。其中,内存734可以视为外部存储模块的一种示例。 Server 730 includes hardware layers and software layers. The hardware layer of server 730 includes memory 734, processor 735, network card 736 and disk 737. Among them, the contents of the memory 734, the processor 735 and the network card 736 can refer to the contents discussed in Figure 5 above. Optionally, the server 730 may also include a power supply circuit, which is used to power the processor 735 . Among them, the memory 734 can be regarded as an example of an external storage module.
服务器730的软件层包括安装并运行在服务器730上的操作系统(相对虚拟机的操作系统可称为宿主机操作系统),操作系统中设置有虚拟机管理器(virtual machine manager,VMM)732和多个虚拟机731。虚拟机管理器732可用于实现虚拟机的计算虚拟化、网络虚拟化、存储虚拟化,以及管理虚拟机731。其中,计算虚拟化是指将服务器730的处理器735和内存734的部分提供给虚拟机;网络虚拟化是指将网卡736的部分功能(如带宽)提供给虚拟机;存储虚拟化是指将部分磁盘737提供给虚拟机;管理虚拟机731例如创建虚拟机731、根据硬件层为虚拟机模拟虚拟硬件(硬件模拟功能)、删除虚拟机731、转发和/或处理运行在该服务器730上的所有虚拟机731之间的网络报文或转发该服务器730上的虚拟机731与外部网络之间的网络报文(虚拟交换功能)、以及处理虚拟机731产生的输入/输出(input/output,I/O)等。The software layer of the server 730 includes an operating system installed and running on the server 730 (the operating system relative to the virtual machine can be called a host operating system). The operating system is provided with a virtual machine manager (virtual machine manager, VMM) 732 and Multiple virtual machines 731. The virtual machine manager 732 may be used to implement computing virtualization, network virtualization, and storage virtualization of the virtual machine, as well as manage the virtual machine 731 . Among them, computing virtualization refers to providing part of the processor 735 and memory 734 of the server 730 to the virtual machine; network virtualization refers to providing part of the functions (such as bandwidth) of the network card 736 to the virtual machine; and storage virtualization refers to providing part of the processor 735 and memory 734 of the server 730 to the virtual machine. Part of the disk 737 is provided to the virtual machine; the virtual machine 731 is managed, for example, creating a virtual machine 731, simulating virtual hardware for the virtual machine according to the hardware layer (hardware emulation function), deleting the virtual machine 731, forwarding and/or processing the data running on the server 730. Network packets between all virtual machines 731 or forward network packets between the virtual machine 731 on the server 730 and the external network (virtual switching function), and process input/output generated by the virtual machine 731, I/O) etc.
其中,不同虚拟机731中的运行环境(如虚拟机应用、操作系统和虚拟硬件)是完全隔离的,不同虚拟机731之间可通过虚拟机管理器732进行通信。其中,每个虚拟机731中可运行有操作系统、固件系统和基板管理控制器等。Among them, the running environments (such as virtual machine applications, operating systems, and virtual hardware) in different virtual machines 731 are completely isolated, and different virtual machines 731 can communicate with each other through the virtual machine manager 732 . Each virtual machine 731 may run an operating system, a firmware system, a baseboard management controller, etc.
虚拟机管理器732还包括云平台管理客户端733,云平台管理客户端733用于接收云管理平台710发送的控制面命令,根据控制面控制命令在服务器730上创建并对虚拟机进行全生命周期管理,以便于租户可通过云管理平台710在云数据中心700中创建、管理、登录和操作虚拟机731。The virtual machine manager 732 also includes a cloud platform management client 733. The cloud platform management client 733 is used to receive control plane commands sent by the cloud management platform 710, create on the server 730 according to the control plane control commands, and conduct full life management of the virtual machine. Cycle management, so that tenants can create, manage, log in and operate virtual machines 731 in the cloud data center 700 through the cloud management platform 710.
下面结合附图介绍本申请实施例所提供的方法。The methods provided by the embodiments of the present application will be introduced below with reference to the accompanying drawings.
本申请的各个实施例对应的附图中,凡是用虚线表示的步骤,均为可选的步骤。In the drawings corresponding to various embodiments of the present application, all steps indicated by dotted lines are optional steps.
请参照图8,为本申请实施例提供的一种检测处理器错误的方法的流程示意图。在图8所示的实施例涉及的处理器的结构例如为图2中的处理器200、图3中的处理器300或图4中的处理器400,以及图8所示的实施例涉及的处理器例如可应用于图5所示的计算设备中。图8所示的实施例涉及的处理器包括指令执行模块和错误检测模块。Please refer to FIG. 8 , which is a schematic flowchart of a method for detecting processor errors provided by an embodiment of the present application. The structure of the processor involved in the embodiment shown in Figure 8 is, for example, the processor 200 in Figure 2, the processor 300 in Figure 3, or the processor 400 in Figure 4, and the structure of the processor involved in the embodiment shown in Figure 8 The processor may be implemented, for example, in the computing device shown in FIG. 5 . The processor involved in the embodiment shown in Figure 8 includes an instruction execution module and an error detection module.
S801、错误检测模块为内部检测指令添加标记。本申请实施例涉及的错误检测模块例如为图2中的错误检测模块210,为图3中的错误检测模块301,或者为图4中的错误检测模块420。S801. The error detection module adds a mark to the internal detection instruction. The error detection module involved in the embodiment of the present application is, for example, the error detection module 210 in Figure 2, the error detection module 301 in Figure 3, or the error detection module 420 in Figure 4.
错误检测模块可确定用于检测处理器错误的指令,本申请实施例中是以内部检测指令是用于检测处理器错误的指令为例进行介绍。The error detection module can determine the instruction used to detect processor errors. In the embodiment of the present application, an internal detection instruction is an instruction used to detect processor errors is introduced as an example.
在处理器应用于计算设备时,计算设备例如为图5中的计算设备,指令执行模块会执行来自错误检测模块的指令,可能还会执行来自外部存储模块中的外部程序指令。指令执行模块例如为图2中的指令执行模块220,为图3中的指令执行模块302,或为图4中的指令执行模块410。为了便于后续区分错误检测模块发送给指令执行模块的指令,在本申请实施例中,错误检测模块可为内部检测指令添加标记,该标记用于指示该内部检测指令为用于检测处理器错误的指令。该标记的具体形式可以是标签。When the processor is applied to a computing device, such as the computing device in FIG. 5 , the instruction execution module will execute instructions from the error detection module and may also execute external program instructions from the external storage module. The instruction execution module is, for example, the instruction execution module 220 in FIG. 2 , the instruction execution module 302 in FIG. 3 , or the instruction execution module 410 in FIG. 4 . In order to facilitate subsequent differentiation of instructions sent by the error detection module to the instruction execution module, in the embodiment of the present application, the error detection module can add a mark to the internal detection instruction. The mark is used to indicate that the internal detection instruction is used to detect processor errors. instruction. The specific form of this mark can be a label.
其中,外部存储模块不设置在处理器中,外部存储模块例如为ROM或内存等,外部存储模块例如为图5中的外部存储模块540。由于外部存储模块中的指令为外部程序指令,外部程序指令的含义可参照前文,外部程序指令例如为来自外部的应用的指令等。The external storage module is not provided in the processor. The external storage module is, for example, a ROM or a memory. The external storage module is, for example, the external storage module 540 in FIG. 5 . Since the instructions in the external storage module are external program instructions, the meaning of the external program instructions can be referred to the above. The external program instructions are, for example, instructions from an external application.
其中,错误检测模块获得内部检测指令的方式可以有多种,下面分别介绍。Among them, there are many ways for the error detection module to obtain internal detection instructions, which are introduced below.
方式一,错误检测模块从处理器已执行过的指令中确定内部检测指令。In the first method, the error detection module determines the internal detection instructions from the instructions that have been executed by the processor.
处理器中的指令执行模块会执行多个指令,例如执行来自错误检测模块的指令,以及 来自外部存储模块中的指令。为了便于描述,本申请实施例中将指令执行模块已执行过的一个指令称为历史指令,指令执行模块已执行过的所有指令称为历史指令集。错误检测模块可从历史指令集中选择用于检测处理器是否存在错误的至少一个历史指令。错误检测模块可将这至少一个历史指令中的一个历史指令确定为内部检测指令。相应的,在方式一下,内部检测指令属于处理器已经执行过的指令,即为历史指令。The instruction execution module in the processor executes multiple instructions, such as instructions from the error detection module, and Instructions from external memory modules. For ease of description, in the embodiment of this application, an instruction that has been executed by the instruction execution module is called a historical instruction, and all instructions that have been executed by the instruction execution module are called a historical instruction set. The error detection module may select at least one historical instruction from the historical instruction set for detecting whether there is an error in the processor. The error detection module may determine one of the at least one historical instructions as an internally detected instruction. Correspondingly, in mode 1, the internal detection instructions belong to instructions that have been executed by the processor, which are historical instructions.
例如,错误检测模块可以从至少一个历史指令中随机确定一个指令作为内部检测指令。For example, the error detection module may randomly determine an instruction from at least one historical instruction as an internal detection instruction.
又例如,错误检测模块可以将至少一个历史指令中执行复杂度最大的一个指令作为内部检测指令。执行复杂度可以以指令执行模块之前执行指令所需的时长进行表征,时长越长,则执行复杂度越高。由于处理器执行复杂度高的指令较容易出错,因此选择复杂度高的指令检测处理器是否存在错误,更容易检测出处理器的错误。For another example, the error detection module may use an instruction with the greatest execution complexity among at least one historical instruction as an internal detection instruction. Execution complexity can be characterized by the time required to execute instructions before the instruction execution module. The longer the time, the higher the execution complexity. Since the processor is more likely to make errors when executing instructions with high complexity, it is easier to detect processor errors by selecting instructions with high complexity to detect whether there are errors in the processor.
可选的,如果处理器为多核处理器,例如处理器为前文图3中的处理器300,那么错误检测模块可以从多个处理器核对应的多个指令执行模块采样获得历史指令集。换言之,历史指令集包括多个指令执行模块已执行过的指令。Optionally, if the processor is a multi-core processor, for example, the processor is the processor 300 in FIG. 3 , the error detection module may sample historical instruction sets from multiple instruction execution modules corresponding to multiple processor cores. In other words, the historical instruction set includes instructions that have been executed by multiple instruction execution modules.
在一种可能的实施方式中,错误检测模块从指令执行模块获取至少一个历史指令的情况下,错误检测模块还可以将指令执行模块执行至少一个历史指令对应的结果写入处理器中的预设的寄存器中。这至少一个历史指令对应的结果可以视为至少一个历史指令对应的预期结果。其中,寄存器例如为图2中的寄存器260,或者为图3中的寄存器304。In a possible implementation, when the error detection module obtains at least one historical instruction from the instruction execution module, the error detection module can also write the result corresponding to the execution of the at least one historical instruction by the instruction execution module into a preset in the processor. in the register. The result corresponding to at least one historical instruction can be regarded as the expected result corresponding to at least one historical instruction. The register is, for example, the register 260 in Figure 2 or the register 304 in Figure 3 .
为了便于区分至少一个历史指令,错误检测模块还可生成至少一个历史指令中每个历史指令的标识,并将至少一个历史指令的标识,以及至少一个历史指令对应的预期结果关联存储到寄存器中。In order to facilitate distinguishing at least one historical instruction, the error detection module can also generate an identifier of each historical instruction in the at least one historical instruction, and store the identifier of the at least one historical instruction and the expected result corresponding to the at least one historical instruction in the register.
例如,请参照下表1,为寄存器关联存储的至少一个历史指令的标识,以及至少一个历史指令对应的预期结果。For example, please refer to Table 1 below for the identification of at least one historical instruction stored in register association, and the expected result corresponding to at least one historical instruction.
表1
Table 1
如上述表1所示,指令0对应的指令的预期结果为11,指令1对应的指令的预期结果为00。As shown in Table 1 above, the expected result of the instruction corresponding to instruction 0 is 11, and the expected result of the instruction corresponding to instruction 1 is 00.
方式二,错误检测模块从第一存储模块中读取内部检测指令。本申请实施例中的第一存储模块例如为图2中的第一存储模块230,或者例如为图3中的第一存储模块305。In the second method, the error detection module reads the internal detection instructions from the first storage module. The first storage module in the embodiment of the present application is, for example, the first storage module 230 in FIG. 2, or the first storage module 305 in FIG. 3, for example.
第一存储模块可预存有至少一个指令,例如,至少一个指令可以是手动配置在第一存储模块中,例如,处理器在出厂之前,工作人员手动配置在第一存储模块中的。相应的,在方式二下,内部检测指令属于第一存储模块中的一个指令。The first storage module may be pre-stored with at least one instruction. For example, the at least one instruction may be manually configured in the first storage module. For example, the processor is manually configured in the first storage module before the processor leaves the factory. Correspondingly, in mode 2, the internal detection instruction belongs to an instruction in the first storage module.
处理器可以对第一存储模块执行读操作,例如,处理器中的错误检测模块可以从第一存储模块中存储的至少一个指令,读取一个指令作为内部检测指令。The processor may perform a read operation on the first storage module. For example, the error detection module in the processor may read an instruction from at least one instruction stored in the first storage module as an internal detection instruction.
例如,错误检测模块可以从至少一个指令中随机读取一个指令作为内部检测指令。For example, the error detection module may randomly read an instruction from at least one instruction as an internal detection instruction.
作为一个示例,第一存储模块中预存的一至少一个指令可以是处理器执行出错的概率大于或等于第一概率的指令。换言之,第一存储模块中预存的至少一个指令可以是处理器容易执行出错的指令。其中,处理器执行出错的概率可以是根据经验确定的,或者对处理 器进行多次测试得到的。As an example, at least one instruction prestored in the first storage module may be an instruction with a probability of an error executed by the processor greater than or equal to the first probability. In other words, at least one instruction prestored in the first storage module may be an instruction that is prone to errors in execution by the processor. Among them, the probability of processor execution error can be determined based on experience, or the processing obtained through multiple tests.
在一种可能的实施方式中,第一存储模块还可存储有至少一个指令对应的预期结果。例如,工作人员在将至少一个指令配置在第一存储模块中时,可手动将至少一个指令对应的预期结果也配置在第一存储模块中。其中,至少一个指令对应的预期结果可以是通过其他处理器分别执行至少一个指令获得的。其中,其他处理器与处理器不同。In a possible implementation, the first storage module may also store an expected result corresponding to at least one instruction. For example, when configuring at least one instruction in the first storage module, the staff can manually configure the expected result corresponding to the at least one instruction in the first storage module. The expected result corresponding to at least one instruction may be obtained by executing at least one instruction respectively on other processors. Among others, processors are different.
或者,至少一个指令的标识,以及至少一个指令对应的预期结果可被预存在寄存器,寄存器如前文论述的内容。例如,工作人员可手动将至少一个指令对应的预期结果配置在寄存器中。其中,至少一个指令对应的预期结果的获得方式可参照前文论述的内容。Alternatively, the identifier of at least one instruction and the expected result corresponding to the at least one instruction may be pre-stored in a register, and the register is as discussed above. For example, the staff can manually configure the expected result corresponding to at least one instruction in the register. The method for obtaining the expected result corresponding to at least one instruction may refer to the content discussed above.
方式三,错误检测模块从第二存储模块中读取内部检测指令。本申请实施例中的第二存储模块例如为图2中的第二存储模块240,或者例如为图3中的第二存储模块306。In the third method, the error detection module reads the internal detection instructions from the second storage module. The second storage module in the embodiment of the present application is, for example, the second storage module 240 in Figure 2, or, for example, the second storage module 306 in Figure 3.
其中,处理器可对第二存储模块执行写操作和读操作。Wherein, the processor can perform write operations and read operations on the second storage module.
示例性的,第二存储模块中的至少一个指令可以是处理器写入的。例如,外部设备可访问处理器,通过处理器在第二存储模块中写入指令。相应的,在方式三下,内部检测指令属于第二存储模块中的指令。进而,处理器中的错误检测模块可以从第二存储模块中读取一个指令为内部检测指令。For example, at least one instruction in the second storage module may be written by the processor. For example, the external device may access the processor, and the processor may write instructions in the second storage module. Correspondingly, in mode three, the internal detection instructions belong to instructions in the second storage module. Furthermore, the error detection module in the processor can read an instruction from the second storage module as an internal detection instruction.
在该方式三中,第二存储模块可以支持外部写入,从而有利于后续新增用于检测处理器错误的指令。In the third method, the second storage module can support external writing, which facilitates subsequent addition of instructions for detecting processor errors.
在一种可能的实施方式中,第一存储模块还可存储有至少一个指令对应的预期结果。例如,外部设备在将至少一个指令写入第一存储模块时,可将至少一个指令对应的预期结果也写入第一存储模块。其中,至少一个指令对应的预期结果可以是通过其他处理器执行获得的。In a possible implementation, the first storage module may also store an expected result corresponding to at least one instruction. For example, when the external device writes at least one instruction into the first storage module, it can also write the expected result corresponding to the at least one instruction into the first storage module. Among them, the expected result corresponding to at least one instruction may be obtained through execution by another processor.
或者,至少一个指令的标识,以及至少一个指令对应的预期结果预存在寄存器,寄存器如前文论述的内容。例如,外部设备可将至少一个指令对应的预期结果写入寄存器。其中,至少一个指令对应的预期结果的获得方式可参照前文论述的内容。Alternatively, the identifier of at least one instruction and the expected result corresponding to at least one instruction are pre-stored in a register, and the register is as discussed above. For example, the external device may write the expected result corresponding to at least one instruction into a register. The method for obtaining the expected result corresponding to at least one instruction may refer to the content discussed above.
如果指令执行模块仅需要执行来自错误检测模块的指令,那么错误检测模块可无需执行S801的步骤,即S801为可选的步骤,在图8中以虚线示意。If the instruction execution module only needs to execute instructions from the error detection module, then the error detection module does not need to perform step S801, that is, S801 is an optional step, which is illustrated by a dotted line in Figure 8 .
S802、错误检测模块向指令执行模块发送内部检测指令。相应的,指令执行模块接收来自错误检测模块的内部检测指令。S802. The error detection module sends an internal detection instruction to the instruction execution module. Correspondingly, the instruction execution module receives internal detection instructions from the error detection module.
在图8实施例中的处理器为多核处理器的情况下,错误检测模块向多核处理器中的一个处理器核(例如第一处理器核)对应的指令执行模块发送内部检测指令。第一处理器核例如为图3所示的第一处理器核310。When the processor in the embodiment of FIG. 8 is a multi-core processor, the error detection module sends an internal detection instruction to the instruction execution module corresponding to one processor core (for example, the first processor core) in the multi-core processor. The first processor core is, for example, the first processor core 310 shown in FIG. 3 .
例如,错误检测模块按照第一周期,向指令执行模块发送用于检测处理器错误的指令。其中,第一周期的时长可以是预配置在错误检测模块中的,第一周期的时长例如为5小时等。For example, the error detection module sends instructions for detecting processor errors to the instruction execution module according to the first cycle. The duration of the first period may be pre-configured in the error detection module, and the duration of the first period may be, for example, 5 hours.
又例如,错误检测模块不定时地向指令执行模块发送用于检测处理器错误的指令。For another example, the error detection module sends instructions for detecting processor errors to the instruction execution module from time to time.
或者例如,在处理器的负载小的情况下,错误检测模块向指令执行模块发送内部检测指令。Or, for example, when the load on the processor is small, the error detection module sends an internal detection instruction to the instruction execution module.
例如,错误检测模块可以以处理器中的缓冲队列的剩余空间表征处理器的负载。如此,可以避免在处理器负载较大时,执行检测处理器错误的过程占用处理器的资源,有利于处理器顺利执行来自外部存储模块的指令。 For example, the error detection module may characterize the load of the processor in terms of remaining space in a buffer queue in the processor. In this way, it can be avoided that the process of detecting processor errors occupies processor resources when the processor load is heavy, which is conducive to the smooth execution of instructions from the external storage module by the processor.
具体的,如果处理器中的缓冲队列的剩余空间大于或等于阈值,则错误检测模块确定处理器的负载小;如果处理器的缓冲队列的剩余空间小于阈值,则错误检测模块确定处理器的负载大。其中,阈值可被预配置在错误检测模块中,阈值例如为1M。Specifically, if the remaining space of the buffer queue in the processor is greater than or equal to the threshold, the error detection module determines that the load of the processor is small; if the remaining space of the buffer queue of the processor is less than the threshold, the error detection module determines that the load of the processor is small. big. Wherein, the threshold value may be pre-configured in the error detection module, and the threshold value is, for example, 1M.
作为一个示例,如果图8实施例中的处理器为多核处理器,那么错误检测模块可以在第一处理器核的负载小的情况下,错误检测模块向第一处理器核对应的指令执行模块发送内部检测指令。其中,确定第一处理器核的负载小的方式可参照前文确定处理器的负载小的内容。As an example, if the processor in the embodiment of FIG. 8 is a multi-core processor, then the error detection module can send the error detection module to the instruction execution module corresponding to the first processor core when the load of the first processor core is small. Send internal detection command. The method of determining that the load of the first processor core is small may refer to the previous content of determining that the load of the processor is small.
在图8中的处理器应用于计算设备的情况下,为了用户能够灵活控制执行处理器错误检测的过程。管理模块包括检测开关。其中,检测开关用于表示是否对处理器进行错误检测。管理模块根据用户的操作,确定检测开关处于开启状态或关闭状态。在一种可能的实施例中,错误检测模块还可确定管理模块中的检测开关处于开启状态,并向指令执行模块发送内部检测指令。其中,检测开关处于开启状态,表示对处理器进行错误检测;检测开关处于关闭状态,表示对处理器不执行错误检测。In the case where the processor in FIG. 8 is applied to a computing device, the user can flexibly control the process of performing processor error detection. The management module includes detection switches. Among them, the detection switch is used to indicate whether to perform error detection on the processor. The management module determines whether the detection switch is on or off according to the user's operation. In a possible embodiment, the error detection module may also determine that the detection switch in the management module is on, and send an internal detection instruction to the instruction execution module. Wherein, the detection switch is in the on state, indicating that error detection is performed on the processor; the detection switch is in the off state, indicating that error detection on the processor is not performed.
管理模块包括计算设备中的操作系统、基板管理控制器或固件系统等一种或多种。在管理模块包括操作系统、基板管理控制器或固件系统中的两种或两种以上时,错误检测模块只可确定操作系统、基板管理控制器或固件系统中的一个的检测开关处于开启状态,相当于确定了管理模块中的检测开关处于开启状态。The management module includes one or more operating systems, baseboard management controllers, or firmware systems in the computing device. When the management module includes two or more of the operating system, the baseboard management controller or the firmware system, the error detection module can only determine that the detection switch of one of the operating system, the baseboard management controller or the firmware system is in an open state, This is equivalent to confirming that the detection switch in the management module is on.
例如,错误检测模块确定检测开关处于开启状态,以及处理器的负载小,则向指令执行模块发送内部检测指令;或者,错误检测模块确定检测开关处于开启状态,则向指令执行模块发送内部检测指令。For example, if the error detection module determines that the detection switch is on and the processor load is small, it sends an internal detection instruction to the instruction execution module; or if the error detection module determines that the detection switch is on, it sends an internal detection instruction to the instruction execution module. .
例如,请参照图9,为本申请实施提供的一种管理模块的检测开关的示意图。如图9所示,管理模块中的检测开关(具体如图9中检测处理器错误所在的按键)显示为“√”,表示管理模块中的检测开关处于开启状态。For example, please refer to FIG. 9 , which is a schematic diagram of a detection switch of a management module provided by the present application. As shown in Figure 9, the detection switch in the management module (specifically the button where the processor error is detected in Figure 9) is displayed as "√", indicating that the detection switch in the management module is on.
S803、指令执行模块执行内部检测指令,获得第一执行结果。S803. The instruction execution module executes the internal detection instruction and obtains the first execution result.
由于内部检测指令是处理器中的错误检测模块发送给指令执行模块的,因此无需对内部检测指令译码,指令执行模块可直接执行内部检测指令,获得第一执行的执行结果。为了便于描述,本申请实施例将内部检测指令的执行结果称为第一执行结果。Since the internal detection instruction is sent to the instruction execution module by the error detection module in the processor, there is no need to decode the internal detection instruction. The instruction execution module can directly execute the internal detection instruction to obtain the execution result of the first execution. For convenience of description, the embodiment of the present application refers to the execution result of the internal detection instruction as the first execution result.
可选的,在错误检测模块执行S801(即为内部检测指令添加标记)的情况下,错误检测模块可缓存内部检测指令的标记,并在第一执行结果中添加该标记。Optionally, when the error detection module performs S801 (ie, adds a mark to the internal detection instruction), the error detection module can cache the mark of the internal detection instruction and add the mark to the first execution result.
S804、错误检测模块根据标记,确定第一执行结果。S804. The error detection module determines the first execution result according to the mark.
例如,在错误检测模块执行S801的情况下,错误检测模块可以从指令执行模块获得所有指令的执行结果(例如包括第一执行结果和第二执行结果)。错误检测模块可根据该标记,识别第一执行结果,也就相当于错误检测模块确定第一执行结果。For example, when the error detection module performs S801, the error detection module may obtain the execution results of all instructions (for example, including the first execution result and the second execution result) from the instruction execution module. The error detection module can identify the first execution result according to the mark, which is equivalent to the error detection module determining the first execution result.
又例如,错误检测模块可向指令执行模块发送第一请求,该第一请求用于请求内部检测指令对应的执行结果,且该第一请求可包括(或指示)内部检测指令的标记。指令执行模块接收第一请求,向错误检测模块反馈第一执行结果,也就相当于错误检测模块确定第一执行结果。For another example, the error detection module may send a first request to the instruction execution module. The first request is used to request an execution result corresponding to the internal detection instruction, and the first request may include (or indicate) a flag of the internal detection instruction. The instruction execution module receives the first request and feeds back the first execution result to the error detection module, which is equivalent to the error detection module determining the first execution result.
又例如,错误检测模块可根据内部检测指令的标识,从指令执行模块获取内部检测指令对应的第一执行结果。其中,内部检测指令的标识的内容可参照前文。这种情况下,错误检测模块无需根据标记,确定第一执行结果,因此S804为可选的步骤。 For another example, the error detection module may obtain the first execution result corresponding to the internal detection instruction from the instruction execution module according to the identification of the internal detection instruction. Among them, the content of the identification of the internal detection instruction can be referred to the previous article. In this case, the error detection module does not need to determine the first execution result according to the flag, so S804 is an optional step.
S805、错误检测模块根据第一执行结果与内部检测指令的预期结果的匹配结果,确定处理器是否存在错误。S805. The error detection module determines whether there is an error in the processor based on the matching result between the first execution result and the expected result of the internal detection instruction.
在内部检测指令的预期结果被预存在处理器的寄存器的情况下,错误检测模块可以从寄存器读取内部检测指令的预期结果。或者,在内部检测指令的预期结果被预存在处理器的第一存储模块的情况下,错误检测模块可以从第一存储模块读取内部检测指令的预期结果。或者,在内部检测指令的预期结果被预存在处理器的第二存储模块的情况下,错误检测模块可以从第二存储模块读取内部检测指令的预期结果。In the case where the expected result of the internal detection instruction is pre-stored in a register of the processor, the error detection module may read the expected result of the internal detection instruction from the register. Alternatively, in the case where the expected result of the internal detection instruction is pre-stored in the first storage module of the processor, the error detection module may read the expected result of the internal detection instruction from the first storage module. Alternatively, in the case where the expected result of the internal detection instruction is pre-stored in the second storage module of the processor, the error detection module may read the expected result of the internal detection instruction from the second storage module.
错误检测模块确定第一执行结果与内部检测指令的预期结果是否匹配。如果第一执行结果与内部检测指令的预期结果匹配,则错误检测模块确定处理器不存在错误,错误检测模块可丢弃第一执行结果,处理器可再次执行图8中的所示的实施例,对处理器进行错误检测。如果第一执行结果与内部检测指令的预期结果不匹配,则错误检测模块确定处理器存在错误。The error detection module determines whether the first execution result matches the expected result of the internally detected instruction. If the first execution result matches the expected result of the internal detection instruction, the error detection module determines that there is no error in the processor, the error detection module can discard the first execution result, and the processor can execute the embodiment shown in Figure 8 again, Perform error detection on the processor. If the first execution result does not match the expected result of the internal detection instruction, the error detection module determines that there is an error in the processor.
例如,错误检测模块确定第一执行结果与内部检测指令的预期结果是否相同,如果第一执行结果与内部检测指令的预期结果相同,表示第一执行结果与内部检测指令的预期结果匹配;如果第一执行结果与内部检测指令的预期结果不相同,表示第一执行结果与内部检测指令的预期结果不匹配。For example, the error detection module determines whether the first execution result is the same as the expected result of the internal detection instruction. If the first execution result is the same as the expected result of the internal detection instruction, it means that the first execution result matches the expected result of the internal detection instruction; if the first execution result is the same as the expected result of the internal detection instruction; An execution result is different from the expected result of the internal detection instruction, indicating that the first execution result does not match the expected result of the internal detection instruction.
S806、指令执行模块从外部存储模块读取外部程序指令。S806. The instruction execution module reads external program instructions from the external storage module.
外部存储模块例如为图5所示的外部存储模块540。外部程序指令的含义可以参照前文论述的内容,此处不再赘述。The external storage module is, for example, the external storage module 540 shown in FIG. 5 . The meaning of external program instructions can refer to the content discussed above and will not be repeated here.
S807、指令执行模块对外部程序指令进行译码,获得译码后的结果,并执行译码后的结果,获得第二执行结果。S807. The instruction execution module decodes the external program instruction, obtains the decoded result, and executes the decoded result to obtain the second execution result.
S808、指令执行模块将第二执行结果写入外部存储模块。S808. The instruction execution module writes the second execution result into the external storage module.
可选的,操作系统可以从外部存储模块获取第二执行结果,进而将第二执行结果呈现给用户。Optionally, the operating system can obtain the second execution result from the external storage module, and then present the second execution result to the user.
作为一个示例,S806至S808为可选的步骤。As an example, S806 to S808 are optional steps.
例如,请参照图10,为本申请实施例提供的处理内部检测指令和外部程序指令的过程示意图。For example, please refer to FIG. 10 , which is a schematic diagram of a process for processing internal detection instructions and external program instructions according to an embodiment of the present application.
如图10所示,处理内部检测指令的路径包括:错误检测模块→指令执行模块→错误检测模块。处理外部程序指令的路径包括:应用→外部存储模块→指令执行模块→外部存储模块→应用。由此可见,本申请实施例中的处理内部检测指令的路径与处理外部程序指令的路径不同,且相比外部程序指令的处理路径更简单。As shown in Figure 10, the path for processing internal detection instructions includes: error detection module → instruction execution module → error detection module. The path for processing external program instructions includes: application→external storage module→instruction execution module→external storage module→application. It can be seen that the path for processing internal detection instructions in the embodiment of the present application is different from the path for processing external program instructions, and is simpler than the processing path for external program instructions.
为了避免处理器持续出错,可选的,处理器中的控制模块可控制关闭处理器。或者,在图8中的处理器应用于计算设备的情况下,计算设备的操作系统可控制关闭处理器。In order to avoid continuous errors of the processor, optionally, the control module in the processor can control to shut down the processor. Alternatively, where the processor of FIG. 8 is applied to a computing device, the operating system of the computing device may control turning off the processor.
在图8中的处理器应用于计算设备的情况下,为了便于用户能够查看处理器的错误,可选的,错误检测模块在确定处理器存在错误时,可向计算设备中的管理模块提供告警信息。该告警信息用于指示处理器存在错误。进而,该管理模块可展示该告警信息。如此,便用户可及时获知处理器出错。When the processor in Figure 8 is applied to a computing device, in order to facilitate the user to check errors of the processor, optionally, the error detection module can provide an alarm to the management module in the computing device when it determines that there is an error in the processor. information. This alarm information is used to indicate that there is an error in the processor. Furthermore, the management module can display the alarm information. In this way, users can be notified of processor errors in time.
在一种可能的实施方式中,在图8中的处理器为多核处理器的情况下,内部检测指令可以是被处理器中的一个处理器核(如第一处理器核)对应的指令执行模块执行的。相应的,错误检测模块可确定第一处理器核存在错误。如此,处理器可精确地定位具体是哪一 个处理器核故障。In a possible implementation, when the processor in FIG. 8 is a multi-core processor, the internal detection instruction may be executed by an instruction corresponding to one processor core (such as the first processor core) in the processor. module is executed. Correspondingly, the error detection module may determine that there is an error in the first processor core. In this way, the processor can pinpoint exactly which A processor core failed.
可选的,在错误检测模块确定第一处理器核存在错误的情况下,错误检测模块可生成告警信息,并向管理模块发送该告警信息。该告警信息用于指示处理器存在错误。Optionally, when the error detection module determines that there is an error in the first processor core, the error detection module can generate alarm information and send the alarm information to the management module. This alarm information is used to indicate that there is an error in the processor.
可选的,在错误检测模块确定第一处理器核存在错误的情况下,第一处理器核对应的控制模块可控制关闭第一处理器核。或者,在图8中的处理器应用于计算设备的情况下,计算设备的操作系统可控制关闭第一处理器核。如此,处理器的其他处理器核依旧可正常工作。Optionally, when the error detection module determines that there is an error in the first processor core, the control module corresponding to the first processor core may control to shut down the first processor core. Alternatively, in the case where the processor in FIG. 8 is applied to a computing device, the operating system of the computing device may control turning off the first processor core. In this way, other processor cores of the processor can still work normally.
在本申请实施例中,处理器可根据处理器中的指令,检测处理器是否存在错误,由于无需处理器译码程序指令,以及向外部的巡检工具发送程序指令的执行结果等,因此可减少处理器生成的指令,有利于减少处理器的处理量。并且,在本申请实施例中,处理器可在处理器的负载较小时,执行错误检测,可减少执行处理器错误的检测过程对外部程序指令的执行过程的影响。并且,当处理器为多核处理器时,处理器可检测具体是处理器中的哪个处理器核存在错误,以准确地确定出现错误的处理器核。另外,在处理器出现错误时,可上报告警信息和/或关闭处理器,以及时处理错误,避免造成更大的影响。并且,在执行内部检测指令的过程中,可以无需借助操作系统加载应用,因此可以减少操作系统的处理量,也就是说,一个计算设备不安装操作系统也可以执行处理器的错误检测,而在云场景中的计算设备可能不会安装操作系统,换言之,本申请实施例中的处理器错误检测方法可以较好地适用于云等场景下,云场景下例如前文涉及的云计算和/或云存储,云场景具体例如前文中的云数据中心。In the embodiment of the present application, the processor can detect whether there is an error in the processor according to the instructions in the processor. Since there is no need for the processor to decode the program instructions and send the execution results of the program instructions to an external inspection tool, it can Reducing the instructions generated by the processor will help reduce the processing load of the processor. Moreover, in the embodiment of the present application, the processor can perform error detection when the load of the processor is small, which can reduce the impact of the processor error detection process on the execution process of external program instructions. Moreover, when the processor is a multi-core processor, the processor can detect which processor core in the processor has an error to accurately determine the processor core in which the error occurred. In addition, when an error occurs in the processor, an alarm message can be reported and/or the processor can be shut down to handle the error in a timely manner to avoid greater impact. Moreover, in the process of executing internal detection instructions, applications can be loaded without the help of the operating system, thus reducing the processing load of the operating system. In other words, a computing device can perform processor error detection without installing an operating system. The computing device in the cloud scenario may not have an operating system installed. In other words, the processor error detection method in the embodiment of the present application can be better applied to cloud scenarios, such as the cloud computing and/or cloud scenarios mentioned above. Specific examples of storage and cloud scenarios include the cloud data center mentioned above.
请参照图11,为本申请实施例提供的一种检测处理器错误的方法的流程示意图。在图11所示的实施例中是以图8中的处理器具体为图4中所示的处理器400为例。图11所示的实施例中的处理器包括错误检测模块和指令执行模块,错误检测模块包括指令获得子模块、指令识别子模块和指令判断子模块。Please refer to FIG. 11 , which is a schematic flowchart of a method for detecting processor errors provided by an embodiment of the present application. In the embodiment shown in FIG. 11 , the processor in FIG. 8 is specifically the processor 400 shown in FIG. 4 as an example. The processor in the embodiment shown in Figure 11 includes an error detection module and an instruction execution module. The error detection module includes an instruction acquisition sub-module, an instruction identification sub-module and an instruction judgment sub-module.
S1101、指令获得子模块为内部检测指令添加标记。S1101. The instruction acquisition submodule adds a mark to the internal detection instruction.
其中,标记以及添加标记的具体内容可参照前文论述的内容。Among them, the specific content of tags and adding tags can refer to the content discussed above.
可选的,指令获得子模块可将该标记发送给指令识别子模块,以便指令识别子模块后续根据该标记,识别用于检测处理器错误的指令。Optionally, the instruction acquisition sub-module may send the tag to the instruction identification sub-module, so that the instruction identification sub-module subsequently identifies instructions for detecting processor errors based on the tag.
S1102、指令获得子模块向指令执行模块发送内部检测指令。相应的,指令执行模块接收来自指令获得子模块的内部检测指令。S1102. The instruction acquisition sub-module sends an internal detection instruction to the instruction execution module. Correspondingly, the instruction execution module receives internal detection instructions from the instruction acquisition sub-module.
例如,指令获得子模块可在确定检测开关处于开启状态,和/或确定处理器的负载小的情况下,向指令执行模块发送内部检测指令。For example, the instruction acquisition sub-module may send an internal detection instruction to the instruction execution module when it is determined that the detection switch is in an on state and/or the load on the processor is small.
其中,检测开关以及确定检测开关处于开启状态的内容可参照前文论述的。Among them, the content of detecting the switch and determining that the detection switch is in the on state may refer to the above discussion.
作为一个示例,图11所示的实施例中的处理器还包括负载判断子模块,负载判断子模块例如为图4中的负载判断子模块424。负载判断子模块确定在处理器的负载小的情况下,负载判断子模块向指令获得子模块发送第一指示信息。该第一指示信息用于指示处理器的负载小。指令获得子模块接收第一指示信息,执行S1102的步骤。As an example, the processor in the embodiment shown in FIG. 11 also includes a load judgment sub-module, which is, for example, the load judgment sub-module 424 in FIG. 4 . The load judgment sub-module determines that when the load of the processor is small, the load judgment sub-module sends the first indication information to the instruction acquisition sub-module. The first indication information is used to indicate that the load of the processor is small. The instruction acquisition sub-module receives the first instruction information and executes step S1102.
在处理器的负载大的情况下,负载判断子模块可向指令获得子模块发送第二指示信息。第二指示信息用于指示处理器的缓冲队列中的剩余空间小于阈值。指令获得子模块接收第二指示信息,不执行S1102的步骤;或者,在处理器的负载大的情况下,负载判断子模块无需向指令获得子模块发送任何指示信息,指令获得子模块默认只有在第一指示信息的触 发,才执行S1102的步骤。When the load of the processor is large, the load judgment sub-module may send second indication information to the instruction acquisition sub-module. The second indication information is used to indicate that the remaining space in the buffer queue of the processor is less than the threshold. The instruction acquisition sub-module receives the second instruction information and does not execute step S1102; or, when the load of the processor is large, the load judgment sub-module does not need to send any instruction information to the instruction acquisition sub-module. The instruction acquisition sub-module defaults to only The touch of the first instruction message Send, then execute step S1102.
S1103、指令执行模块执行内部检测指令,获得第一执行结果。S1103. The instruction execution module executes the internal detection instruction and obtains the first execution result.
第一执行结果的内容可参照前文论述的内容。The content of the first execution result may refer to the content discussed above.
在指令获得子模块执行S1101的情况下,指令执行模块可以为第一执行结果添加标记。When the instruction acquisition sub-module performs S1101, the instruction execution module may add a mark to the first execution result.
S1104、指令执行模块向指令识别子模块发送第一执行结果。相应的,指令识别子模块接收来自指令执行模块的第一执行结果。S1104. The instruction execution module sends the first execution result to the instruction identification sub-module. Correspondingly, the instruction identification sub-module receives the first execution result from the instruction execution module.
S1105、指令识别子模块根据标记,确定第一执行结果。S1105. The instruction identification submodule determines the first execution result according to the mark.
S1106、指令识别子模块向指令执行模块发送第一执行结果。相应的,指令执行模块接收来自指令识别子模块的第一执行结果。S1106. The instruction identification sub-module sends the first execution result to the instruction execution module. Correspondingly, the instruction execution module receives the first execution result from the instruction identification sub-module.
例如,指令执行模块将所有执行结果均发送给指令识别子模块,指令识别子模块根据第一执行结果的标记,识别该第一执行结果。For example, the instruction execution module sends all execution results to the instruction identification sub-module, and the instruction identification sub-module identifies the first execution result according to the mark of the first execution result.
又例如,指令执行模块可根据第一执行结果的标记,直接将第一执行结果发送给指令识别子模块,也相当于指令识别子模块确定了第一执行结果。For another example, the instruction execution module can directly send the first execution result to the instruction identification sub-module according to the mark of the first execution result, which is equivalent to the instruction identification sub-module determining the first execution result.
S1107、指令判断子模块根据第一执行结果与内部检测指令的预期结果的匹配结果,确定处理器是否存在错误。S1107. The instruction judgment submodule determines whether there is an error in the processor based on the matching result between the first execution result and the expected result of the internal detection instruction.
指令判断子模块确定处理器是否存在错误的方式可参照前文论述的内容。The instruction judgment sub-module determines whether there is an error in the processor by referring to the content discussed above.
可选的,指令判断子模块如果确定处理器存在错误,可以向管理模块提供告警信息,管理模块和告警信息的内容可以参照前文论述的内容。Optionally, if the instruction judgment sub-module determines that there is an error in the processor, it can provide alarm information to the management module. The content of the management module and alarm information can refer to the content discussed above.
S1108、指令执行模块从外部存储模块获取外部程序指令。S1108. The instruction execution module obtains external program instructions from the external storage module.
外部程序指令的含义可以参照前文论述的内容。The meaning of external program instructions can refer to the content discussed above.
S1109、指令执行模块对外部程序指令进行译码,获得译码后的结果,并执行译码后的结果,获得第二执行结果。S1109. The instruction execution module decodes the external program instruction, obtains the decoded result, and executes the decoded result to obtain the second execution result.
S1110、指令执行模块向外部存储模块发送第二执行结果。S1110. The instruction execution module sends the second execution result to the external storage module.
作为一个示例,S1108-S1110为可选的步骤。As an example, S1108-S1110 are optional steps.
例如,请参照图12,为本申请实施例提供的处理内部检测指令和外部程序指令的过程示意图。For example, please refer to FIG. 12 , which is a schematic diagram of a process for processing internal detection instructions and external program instructions according to an embodiment of the present application.
如图12所示,处理内部检测指令的路径包括:指令获得子模块→指令执行模块→指令识别子模块→错误判断子模块。处理外部程序指令的路径包括:应用→外部存储模块→指令执行模块→应用。由此可见,本申请实施例中的处理内部检测指令的路径与处理外部程序指令的路径不同,且本申请实施例中的处理内部检测指令的路径更为简单。As shown in Figure 12, the path for processing internal detection instructions includes: instruction acquisition sub-module → instruction execution module → instruction identification sub-module → error judgment sub-module. The path for processing external program instructions includes: application→external storage module→instruction execution module→application. It can be seen from this that the path for processing internal detection instructions in the embodiment of the present application is different from the path for processing external program instructions, and the path for processing the internal detection instructions in the embodiment of the present application is simpler.
本申请实施例可以适用于图4所示的处理器中,在该实施例中,错误检测模块可以包括指令获得子模块、负载判断子模块、指令识别子模块和错误判断子模块,提供了一种检测处理器错误的方案。并且,在该实施例中,处理器中的指令获得子模块、负载判断子模块、指令识别子模块、错误判断子模块和指令执行单元可以协同工作,以检测处理器的错误,无需处理器编译程序指令,以及向外部的巡检工具发送程序指令的执行结果等,有利于减少处理器的处理量。并且,指令获得子模块可在处理器的负载小时,向指令执行模块发送内部检测指令,避免在处理器的负载较大时增加处理器的处理负担。The embodiment of the present application can be applied to the processor shown in Figure 4. In this embodiment, the error detection module can include an instruction acquisition sub-module, a load judgment sub-module, an instruction identification sub-module and an error judgment sub-module, providing an A scheme for detecting processor errors. Moreover, in this embodiment, the instruction acquisition sub-module, load judgment sub-module, instruction identification sub-module, error judgment sub-module and instruction execution unit in the processor can work together to detect processor errors without processor compilation. Program instructions, as well as sending execution results of program instructions to external inspection tools, etc., are beneficial to reducing the processing load of the processor. In addition, the instruction acquisition sub-module can send internal detection instructions to the instruction execution module when the load of the processor is small, so as to avoid increasing the processing load of the processor when the load of the processor is large.
本申请实施例提供一种计算设备集群。请参照图13,为本申请实施例提供的一种计算设备集群,该计算设备集群中包括至少一个计算设备1300,任意两个计算设备1300之间通过通信网络通信。 An embodiment of the present application provides a computing device cluster. Please refer to Figure 13, which is a computing device cluster provided by an embodiment of the present application. The computing device cluster includes at least one computing device 1300, and any two computing devices 1300 communicate through a communication network.
如图13所示,计算设备1300包括处理器1301和供电电路1302。供电电路1302用于为处理器1301供电。其中,计算设备1300中的处理器1301可用于实现前文任一的检测处理器错误的方法,例如,图8或图11所示的实施例中的检测处理器错误的方法。也能实现前文任一的处理器的功能。其中,处理器1301的结构可以参照前文图2、图3或图4中的处理器的结构。As shown in Figure 13, computing device 1300 includes a processor 1301 and a power supply circuit 1302. The power supply circuit 1302 is used to provide power to the processor 1301. The processor 1301 in the computing device 1300 may be used to implement any of the above methods for detecting processor errors, for example, the method for detecting processor errors in the embodiment shown in FIG. 8 or FIG. 11 . It can also implement the functions of any of the previous processors. The structure of the processor 1301 may refer to the structure of the processor in Figure 2, Figure 3 or Figure 4 mentioned above.
可选的,计算设备1300还包括存储器1303和通信接口1304,在图13中以虚线框示意存储器1303和通信接口1304。Optionally, the computing device 1300 also includes a memory 1303 and a communication interface 1304. The memory 1303 and the communication interface 1304 are shown as dotted boxes in FIG. 13 .
其中,处理器1301和通信接口1304之间相互耦合。可以理解的是,通信接口1304可以为收发器或输入输出接口。Among them, the processor 1301 and the communication interface 1304 are coupled to each other. It can be understood that the communication interface 1304 may be a transceiver or an input-output interface.
其中,存储器1303可用于存储处理器1301执行的外部程序指令或存储处理器1301运行外部程序指令所需要的输入数据或存储处理器1301运行指令后产生的数据。The memory 1303 may be used to store external program instructions executed by the processor 1301 or input data required by the processor 1301 to run external program instructions or data generated after the processor 1301 executes the instructions.
作为一个示例,图13所示的计算设备集群可用于实现图6或图7中的云数据中心的功能。As an example, the computing device cluster shown in Figure 13 can be used to implement the functions of the cloud data center in Figure 6 or Figure 7.
本申请实施例提供一种芯片系统,该芯片系统包括:处理器和接口。其中,该处理器用于从该接口调用并运行指令,当该处理器执行该指令时,实现前文任一的检测处理器错误的方法,例如,图8或图11所示的实施例中的检测处理器错误的方法。Embodiments of the present application provide a chip system, which includes: a processor and an interface. Wherein, the processor is used to call and run instructions from the interface, and when the processor executes the instructions, implement any of the previous methods of detecting processor errors, for example, the detection in the embodiment shown in Figure 8 or Figure 11 Handler error method.
本申请实施例提供一种计算机可读存储介质,该计算机可读存储介质用于存储计算机程序或指令,当其被运行时,实现前文任一的检测处理器错误的方法,例如,图8或图11所示的实施例中的检测处理器错误的方法。Embodiments of the present application provide a computer-readable storage medium. The computer-readable storage medium is used to store computer programs or instructions. When the computer program or instructions are executed, any one of the above methods for detecting processor errors is implemented. For example, FIG. 8 or A method of detecting processor errors in the embodiment shown in FIG. 11 .
本申请实施例提供一种包含指令的计算机程序产品,当其在计算机上运行时,实现前文任一的检测处理器错误的方法,例如,图8或图11所示的实施例中的检测处理器错误的方法。Embodiments of the present application provide a computer program product containing instructions that, when run on a computer, implement any of the foregoing methods for detecting processor errors, for example, the detection processing in the embodiments shown in Figure 8 or Figure 11 The wrong way to do it.
本申请的实施例中的方法步骤可以通过硬件的方式来实现,也可以由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于随机存取存储器、闪存、只读存储器、可编程只读存储器、可擦除可编程只读存储器、电可擦除可编程只读存储器、寄存器、硬盘、移动硬盘、CD-ROM或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。另外,该ASIC可以位于基站或终端中。当然,处理器和存储介质也可以作为分立组件存在于基站或终端中。The method steps in the embodiments of the present application can be implemented by hardware or by a processor executing software instructions. Software instructions can be composed of corresponding software modules, and the software modules can be stored in random access memory, flash memory, read-only memory, programmable read-only memory, erasable programmable read-only memory, electrically erasable programmable read-only memory In memory, register, hard disk, mobile hard disk, CD-ROM or any other form of storage medium well known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from the storage medium and write information to the storage medium. Of course, the storage medium can also be an integral part of the processor. The processor and storage media may be located in an ASIC. Additionally, the ASIC can be located in the base station or terminal. Of course, the processor and the storage medium may also exist as discrete components in the base station or terminal.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机程序或指令。在计算机上加载和执行所述计算机程序或指令时,全部或部分地执行本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、网络设备、用户设备或者其它可编程装置。所述计算机程序或指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机程序或指令可以从一个网站站点、计算机、服务器或数据中心通过有线或无线方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是集成一个或多个可用介质的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,例如,软盘、硬盘、磁带;也可以是光介质,例如,数字视频光盘;还可以是半导体介质,例如,固态 硬盘。该计算机可读存储介质可以是易失性或非易失性存储介质,或可包括易失性和非易失性两种类型的存储介质。In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present application are executed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, a network device, a user equipment, or other programmable device. The computer program or instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer program or instructions may be transmitted from a website, computer, A server or data center transmits via wired or wireless means to another website site, computer, server, or data center. The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or data center that integrates one or more available media. The available media can be magnetic media, such as floppy disks, hard disks, and magnetic tapes; they can also be optical media, such as digital video optical disks; they can also be semiconductor media, such as solid-state media. harddisk. The computer-readable storage medium may be volatile or nonvolatile storage media, or may include both volatile and nonvolatile types of storage media.
在本申请的各个实施例中,如果没有特殊说明以及逻辑冲突,不同的实施例之间的术语和/或描述具有一致性、且可以相互引用,不同的实施例中的技术特征根据其内在的逻辑关系可以组合形成新的实施例。In the various embodiments of this application, if there is no special explanation or logical conflict, the terms and/or descriptions between different embodiments are consistent and can be referenced to each other. The technical features in different embodiments are based on their inherent Logical relationships can be combined to form new embodiments.
可以理解的是,在本申请的实施例中涉及的各种数字编号仅为描述方便进行的区分,并不用来限制本申请的实施例的范围。上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定。 It can be understood that the various numerical numbers involved in the embodiments of the present application are only for convenience of description and are not used to limit the scope of the embodiments of the present application. The size of the serial numbers of the above processes does not mean the order of execution. The execution order of each process should be determined by its function and internal logic.

Claims (23)

  1. 一种处理器,其特征在于,包括错误检测模块和指令执行模块,其中:A processor, characterized by including an error detection module and an instruction execution module, wherein:
    所述错误检测模块,用于向所述指令执行模块发送内部检测指令;The error detection module is used to send internal detection instructions to the instruction execution module;
    所述指令执行模块,用于执行所述内部检测指令,获得第一执行结果,并向所述错误检测模块发送所述第一执行结果;The instruction execution module is used to execute the internal detection instruction, obtain a first execution result, and send the first execution result to the error detection module;
    所述错误检测模块,用于根据所述第一执行结果与所述内部检测指令对应的预期结果,确定所述处理器是否存在错误。The error detection module is configured to determine whether there is an error in the processor based on the expected result corresponding to the first execution result and the internal detection instruction.
  2. 根据权利要求1所述的处理器,其特征在于,所述指令执行模块还用于:The processor according to claim 1, characterized in that the instruction execution module is also used to:
    从外部存储模块中,读取外部程序指令;Read external program instructions from the external storage module;
    对所述外部程序指令进行译码,获得译码后的结果;Decode the external program instructions and obtain the decoded results;
    执行所述译码后的结果,获得第二执行结果;Execute the decoded result to obtain a second execution result;
    将所述第二执行结果写入所述外部存储模块。Write the second execution result into the external storage module.
  3. 根据权利要求1或2所述的处理器,其特征在于,所述错误检测模块还用于:The processor according to claim 1 or 2, characterized in that the error detection module is also used to:
    在向所述指令执行模块发送内部检测指令之前,确定所述处理器的缓冲队列中的剩余空间大于或等于阈值,所述缓冲队列用于缓存所述处理器待处理的指令。Before sending an internal detection instruction to the instruction execution module, it is determined that the remaining space in the buffer queue of the processor is greater than or equal to a threshold, and the buffer queue is used to cache instructions to be processed by the processor.
  4. 根据权利要求1-3任一项所述的处理器,其特征在于,The processor according to any one of claims 1-3, characterized in that,
    所述错误检测模块,还用于为所述内部检测指令添加标记,所述标记表示用于检测所述处理器错误的指令;The error detection module is also configured to add a mark to the internal detection instruction, where the mark represents an instruction for detecting an error in the processor;
    所述指令执行模块,还用于在所述第一执行结果中添加所述标记;The instruction execution module is also configured to add the mark to the first execution result;
    所述错误检测模块,还用于根据所述第一执行结果中的所述标记,识别所述内部检测指令对应的所述第一执行结果。The error detection module is further configured to identify the first execution result corresponding to the internal detection instruction according to the mark in the first execution result.
  5. 根据权利要求1-4任一项所述的处理器,其特征在于,所述处理器还包括寄存器,所述寄存器存储有所述内部检测指令对应的预期结果;The processor according to any one of claims 1-4, characterized in that the processor further includes a register, and the register stores expected results corresponding to the internal detection instructions;
    所述错误检测模块,还用于从所述寄存器中,读取所述内部检测指令对应的预期结果。The error detection module is also used to read the expected result corresponding to the internal detection instruction from the register.
  6. 根据权利要求1-5任一项所述的处理器,其特征在于,所述错误检测模块,还用于从所述处理器已执行过的指令中,获得所述内部检测指令;或,The processor according to any one of claims 1 to 5, wherein the error detection module is further configured to obtain the internal detection instructions from instructions that have been executed by the processor; or,
    所述处理器还包括存储有所述内部检测指令的第一存储模块,所述第一存储模块允许被所述处理器读取,所述错误检测模块,还用于从所述第一存储模块中读取所述内部检测指令;或,The processor also includes a first storage module that stores the internal detection instructions, the first storage module is allowed to be read by the processor, and the error detection module is also used to read from the first storage module Read the internal detection instructions; or,
    所述处理器还包括存储有所述内部检测指令的第二存储模块,所述第二存储模块允许被所述处理器读取和写入,所述错误检测模块,还用于从所述第二存储模块中读取所述内部检测指令。The processor further includes a second storage module that stores the internal detection instructions, the second storage module is allowed to be read and written by the processor, and the error detection module is also used to read from the third Read the internal detection instructions from the second storage module.
  7. 根据权利要求1-6任一项所述的处理器,其特征在于,所述处理器还包括至少一个处理器核,所述至少一个处理器核中的一个处理器核对应所述指令执行模块;所述错误检测模块具体用于:The processor according to any one of claims 1 to 6, characterized in that the processor further includes at least one processor core, and one of the at least one processor core corresponds to the instruction execution module ;The error detection module is specifically used for:
    根据所述第一执行结果与所述内部检测指令对应的预期结果,确定与所述指令执行模块对应的处理器核是否存在错误。According to the expected result corresponding to the first execution result and the internal detection instruction, it is determined whether there is an error in the processor core corresponding to the instruction execution module.
  8. 根据权利要求1-7任一项所述的处理器,其特征在于,所述错误检测模块还用于:The processor according to any one of claims 1-7, characterized in that the error detection module is also used to:
    确定管理模块中的检测开关处于开启状态,所述检测开关用于表示是否对所述处理器 的错误进行检测,所述管理模块包括基板管理控制器、所述处理器对应的固件系统或操作系统中的一种或多种。Determine that the detection switch in the management module is in an on state, and the detection switch is used to indicate whether the processor The management module includes one or more of a baseboard management controller, a firmware system corresponding to the processor, or an operating system.
  9. 根据权利要求1-8任一项所述的处理器,其特征在于,所述错误检测模块,还用于在确定所述处理器存在错误时,向管理模块提供告警信息,所述管理模块包括基板管理控制器、所述处理器对应的固件系统或操作系统中的一种或多种,所述告警信息用于指示所述处理器存在错误;和/或,The processor according to any one of claims 1 to 8, wherein the error detection module is further configured to provide alarm information to a management module when it is determined that there is an error in the processor, and the management module includes One or more of the baseboard management controller, the firmware system or the operating system corresponding to the processor, the alarm information is used to indicate that there is an error in the processor; and/or,
    所述处理器还包括所述控制模块,所述控制模块用于在所述错误检测模块确定所述处理器存在错误时,关闭所述处理器。The processor further includes the control module, which is configured to shut down the processor when the error detection module determines that there is an error in the processor.
  10. 一种检测处理器错误的方法,其特征在于,包括:A method for detecting processor errors, characterized by including:
    处理器执行内部检测指令,获得第一执行结果;The processor executes the internal detection instruction and obtains the first execution result;
    所述处理器根据所述第一执行结果与所述内部检测指令对应的预期结果,确定所述处理器是否存在错误。The processor determines whether there is an error in the processor based on the expected result corresponding to the first execution result and the internal detection instruction.
  11. 根据权利要求10所述的方法,其特征在于,所述方法还包括:The method of claim 10, further comprising:
    所述处理器从外部存储模块中,读取外部程序指令;The processor reads external program instructions from the external storage module;
    所述处理器对所述外部程序指令进行译码,获得译码后的结果;The processor decodes the external program instructions and obtains the decoded results;
    所述处理器执行所述译码后的结果,获得第二执行结果;The processor executes the decoded result to obtain a second execution result;
    所述处理器将所述第二执行结果写入所述外部存储模块。The processor writes the second execution result into the external storage module.
  12. 根据权利要求10或11所述的方法,其特征在于,所述方法还包括:The method according to claim 10 or 11, characterized in that the method further includes:
    所述处理器确定缓冲队列中的剩余空间大于或等于阈值,所述缓冲队列用于缓存所述处理器待处理的指令。The processor determines that the remaining space in the buffer queue is greater than or equal to a threshold, and the buffer queue is used to cache instructions to be processed by the processor.
  13. 根据权利要求10-12任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 10-12, characterized in that the method further includes:
    所述处理器为所述内部检测指令添加标记,所述标记表示用于检测所述处理器错误的指令;The processor adds a flag to the internal detection instruction, the flag representing an instruction for detecting an error in the processor;
    所述处理器在所述第一执行结果中添加所述标记;The processor adds the mark to the first execution result;
    所述处理器根据所述第一执行结果中的所述标记,识别所述内部检测指令对应的所述第一执行结果。The processor identifies the first execution result corresponding to the internal detection instruction according to the mark in the first execution result.
  14. 根据权利要求10-13任一项所述的方法,其特征在于,所述处理器包括寄存器,所述寄存器存储有所述内部检测指令对应的预期结果;所述方法还包括:The method according to any one of claims 10 to 13, characterized in that the processor includes a register, and the register stores expected results corresponding to the internal detection instructions; the method further includes:
    所述处理器从所述寄存器中,读取所述内部检测指令对应的预期结果。The processor reads the expected result corresponding to the internal detection instruction from the register.
  15. 根据权利要求10-14任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 10-14, characterized in that the method further includes:
    所述处理器从所述处理器已执行过的指令中,获得所述内部检测指令;或,The processor obtains the internal detection instructions from instructions that have been executed by the processor; or,
    所述处理器还包括存储有所述内部检测指令的第一存储模块,所述第一存储模块允许被所述处理器读取,所述处理器从所述第一存储模块中读取所述内部检测指令;或,The processor further includes a first storage module that stores the internal detection instructions, the first storage module is allowed to be read by the processor, and the processor reads the first storage module from the first storage module. Internal testing instructions; or,
    所述处理器还包括存储有所述内部检测指令的第二存储模块,所述第二存储模块允许被所述处理器读取和写入,所述处理器从所述第二存储模块中读取所述内部检测指令。The processor also includes a second storage module that stores the internal detection instructions. The second storage module allows the processor to read and write. The processor reads from the second storage module. Get the internal detection instructions.
  16. 根据权利要求10-15任一项所述的方法,其特征在于,所述处理器包括至少一个处理器核;所述处理器根据所述第一执行结果与所述内部检测指令对应的预期结果,确定所述处理器是否存在错误,包括:The method according to any one of claims 10 to 15, characterized in that the processor includes at least one processor core; the processor determines the expected result corresponding to the first execution result and the internal detection instruction. , determine whether there are errors in the processor, including:
    所述处理器根据所述第一执行结果与所述内部检测指令对应的预期结果,确定用于获得所述第一执行结果的处理器核是否存在错误。 The processor determines whether there is an error in the processor core used to obtain the first execution result based on the expected result corresponding to the first execution result and the internal detection instruction.
  17. 根据权利要求10-16任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 10-16, characterized in that the method further includes:
    所述处理器确定管理模块中的检测开关处于开启状态,所述检测开关用于表示是否对所述处理器的错误进行检测,所述管理模块包括基板管理控制器、所述处理器对应的固件系统或操作系统中的一种或多种。The processor determines that the detection switch in the management module is in an on state. The detection switch is used to indicate whether to detect errors of the processor. The management module includes a baseboard management controller and firmware corresponding to the processor. One or more of the systems or operating systems.
  18. 根据权利要求10-17任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 10-17, characterized in that the method further includes:
    所述处理器在确定所述处理器存在错误时,向管理模块提供告警信息,所述管理模块包括基板管理控制器、所述处理器对应的固件系统或操作系统中的一种或多种,所述告警信息用于指示所述处理器存在错误;和/或,When the processor determines that there is an error in the processor, the processor provides alarm information to the management module. The management module includes one or more of a baseboard management controller, a firmware system corresponding to the processor, or an operating system, The alarm information is used to indicate that there is an error in the processor; and/or,
    所述处理器在确定所述处理器存在错误时,关闭所述处理器。When the processor determines that there is an error in the processor, the processor shuts down the processor.
  19. 一种计算设备,其特征在于,包括如权利要求1-9任一项所述的处理器。A computing device, characterized by comprising the processor according to any one of claims 1-9.
  20. 一种计算设备,其特征在于,包括处理器和供电电路,所述供电电路为所述处理器供电,所述处理器用于执行如权利要求10-18任一项所述的方法。A computing device, characterized in that it includes a processor and a power supply circuit, the power supply circuit supplies power to the processor, and the processor is used to execute the method according to any one of claims 10-18.
  21. 一种计算设备集群,其特征在于,包括至少一个计算设备,每个计算设备执行如权利要求10-18任一项所述的方法。A computing device cluster, characterized by including at least one computing device, each computing device executing the method according to any one of claims 10-18.
  22. 一种包含指令的计算机程序产品,其特征在于,当所述指令被计算设备运行时,使得所述计算设备执行如权利要求10-18任一项所述的方法。A computer program product containing instructions, characterized in that, when the instructions are executed by a computing device, the computing device performs the method according to any one of claims 10-18.
  23. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有计算机程序或指令,当所述计算机程序或指令被执行时,实现如权利要求10-18任一项所述的方法。 A computer-readable storage medium, characterized in that a computer program or instructions are stored in the storage medium. When the computer program or instructions are executed, the method according to any one of claims 10-18 is implemented.
PCT/CN2023/098504 2022-09-05 2023-06-06 Processor and processor error detection method WO2024051231A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211080632.4 2022-09-05
CN202211080632.4A CN117687848A (en) 2022-09-05 2022-09-05 Processor and method for detecting processor errors

Publications (1)

Publication Number Publication Date
WO2024051231A1 true WO2024051231A1 (en) 2024-03-14

Family

ID=90127038

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/098504 WO2024051231A1 (en) 2022-09-05 2023-06-06 Processor and processor error detection method

Country Status (2)

Country Link
CN (1) CN117687848A (en)
WO (1) WO2024051231A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5172378A (en) * 1989-05-09 1992-12-15 Hitachi, Ltd. Error detection method and apparatus for processor having main storage
US20080082285A1 (en) * 2006-09-29 2008-04-03 Samaan Samie B Method and system to self-test single and multi-core CPU systems
US20100262879A1 (en) * 2009-04-14 2010-10-14 International Business Machines Corporation Internally Controlling and Enhancing Logic Built-In Self Test in a Multiple Core Microprocessor
CN107451019A (en) * 2016-04-11 2017-12-08 Arm 有限公司 Self-test in the minds of processor core

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5172378A (en) * 1989-05-09 1992-12-15 Hitachi, Ltd. Error detection method and apparatus for processor having main storage
US20080082285A1 (en) * 2006-09-29 2008-04-03 Samaan Samie B Method and system to self-test single and multi-core CPU systems
US20100262879A1 (en) * 2009-04-14 2010-10-14 International Business Machines Corporation Internally Controlling and Enhancing Logic Built-In Self Test in a Multiple Core Microprocessor
CN107451019A (en) * 2016-04-11 2017-12-08 Arm 有限公司 Self-test in the minds of processor core

Also Published As

Publication number Publication date
CN117687848A (en) 2024-03-12

Similar Documents

Publication Publication Date Title
US9507619B2 (en) Virtualizing a host USB adapter
US9791908B2 (en) Systems and methods for protecting virtualized assets
US11509505B2 (en) Method and apparatus for operating smart network interface card
US10656877B2 (en) Virtual storage controller
US10416996B1 (en) System and method for translating affliction programming interfaces for cloud platforms
US20190141145A1 (en) Cloud-scale heterogeneous datacenter management infrastructure
US8904159B2 (en) Methods and systems for enabling control to a hypervisor in a cloud computing environment
CN104704478B (en) Recovery after input/output mistake containment event
CN104636076A (en) Distributed block device driving method and system for cloud storage
US10725890B1 (en) Program testing service
WO2020177385A1 (en) Virtual machine function detection method and apparatus, electronic device and storage medium
US9684475B2 (en) Multi-mode hybrid storage drive
US9319313B2 (en) System and method of forwarding IPMI message packets based on logical unit number (LUN)
CN107817962B (en) Remote control method, device, control server and storage medium
CN115269213A (en) Data receiving method, data transmitting method, device, electronic device and medium
TW202013135A (en) A computer-implemented method, computing device, and non-transitory computer-readable storage medium for managing a computing system
CN104077187A (en) Method and system for scheduling execution of application programs
US20060026214A1 (en) Switching from synchronous to asynchronous processing
WO2024051231A1 (en) Processor and processor error detection method
US9904654B2 (en) Providing I2C bus over ethernet
US10528397B2 (en) Method, device, and non-transitory computer readable storage medium for creating virtual machine
WO2019169582A1 (en) Method and device for processing interrupt
US10958597B2 (en) General purpose ring buffer handling in a network controller
US11070654B2 (en) Sockets for shared link applications
CN113031891B (en) Screen selection method, device, equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23861934

Country of ref document: EP

Kind code of ref document: A1