CN109240481B

CN109240481B - Multi-core microprocessor and power saving method using same

Info

Publication number: CN109240481B
Application number: CN201810985884.9A
Authority: CN
Inventors: G·葛兰·亨利; 泰瑞·派克斯; 布兰特·比恩; 史蒂芬·嘉斯金斯
Original assignee: Via Technologies Inc
Current assignee: Via Technologies Inc
Priority date: 2013-08-28
Filing date: 2014-08-28
Publication date: 2020-08-11
Anticipated expiration: 2034-08-28
Also published as: CN104360727B; CN104360727A; CN109240481A

Abstract

The invention provides a multi-core microprocessor and a power saving method using the same. The microprocessor includes a plurality of processing cores, a cache memory, and a control unit that sleeps the cores by stopping a clock signal to the cores. Each processing core executes the sleep instruction as a request generated by the control unit to respectively sleep the plurality of processing cores. The control unit sleeps each processing core in response to the request, detecting that only a last processing core generates the request when all cores have generated respective requests to sleep. The last processing core writes back and invalidates the cache, and indicates that the cache has been invalidated and generates a request to the control unit to put the last processing core back to sleep. The control unit puts the last processing core back to sleep and keeps the other processing cores asleep when the last processing core writes back and invalidates the cache. The invention has less power consumption.

Description

Multi-core microprocessor and power saving method using same

The present application is a divisional application filed on 2014, 28/8, with application number 201410430949.5 entitled "microprocessor and method for saving power using the same".

Technical Field

The present invention relates to a microprocessor, and more particularly to a single core wake multi-core synchronization mechanism.

Background

The increase in multi-core microprocessors has been largely due to the performance advantages they provide. This may be due primarily to the rapid reduction in the size of the geometric dimensions of semiconductor devices, thereby increasing transistor density. The existence of multiple cores in a microprocessor has created a need to communicate with one core and other cores to perform various functions, such as power management, cache management, debug, and configuration associated with the more multiple cores.

Traditionally, programs (e.g., operating systems or applications) running on an architecture on a multi-core processor have communicated using semaphores located in a system memory addressable by all of the core architectures. This may be sufficient for many purposes, but may not provide other desired speed, accuracy, and/or system level transparency.

Disclosure of Invention

The invention provides a microprocessor. The microprocessor includes a plurality of processing cores, a cache memory shared by the plurality of processing cores, and a control unit configured to respectively put each processing core into a sleep state by stopping a clock signal to the plurality of processing cores. Each processing core is configured to execute a sleep instruction as a request generated by the control unit to cause the plurality of processing cores to enter the sleep state, respectively. The control unit is configured to cause each processing core to enter the sleep state in response to the request, and to detect that only a last processing core of the plurality of processing cores is awake to generate the request when all of the plurality of processing cores have generated the respective request to cause them to enter the sleep state. The last processing core is configured to write back and invalidate the cache memory, indicate that the cache memory has been invalidated, and generate a request to the control unit to return the last processing core to the sleep state. The control unit is configured to return the last processing core to the sleep state and continue to maintain other processing cores asleep when the last processing core writes back and invalidates the cache memory, wherein the last processing core indicates that the cache memory has invalidated and returns to the sleep state.

The present invention provides a method for saving power in a microprocessor having a plurality of processing cores, a cache memory shared by the plurality of processing cores, and a control unit configured to cause each processing core to enter a sleep state by stopping a clock signal to the plurality of processing cores, respectively. The method includes executing a sleep instruction by each processing core as a request by the control unit to cause the plurality of processing cores to enter the sleep state, respectively. The method also includes causing, by the control unit, each processing core to enter the sleep state in response to the request, and detecting that only a last processing core of the plurality of processing cores is awake to generate the request when all of the plurality of processing cores have generated the respective request to cause them to enter the sleep state. The method also includes writing back and invalidating the cache memory by the last processing core and indicating that the cache memory has been invalidated, and generating a request to the control unit to return the last processing core to the sleep state. The method further includes returning the last processing core to the sleep state and continuing to maintain other processing cores asleep when the last processing core writes back and invalidates the cache memory, wherein the last processing core indicates that the cache memory has been invalidated and returns to the sleep state.

The invention provides a method for saving electricity by using a microprocessor, wherein the microprocessor is provided with a plurality of processing cores and a control unit. The method comprises the following steps: (a) causing all of said plurality of processing cores to enter a sleep state and blocking wake-up events for all of said plurality of processing cores except a core in a first instance, wherein said causing all of said plurality of processing cores to enter a sleep state is blocking providing a clock signal and a power source to all of said plurality of processing cores; (b) in response to detecting a wake event, waking up the plurality of processing cores to process the wake event; (c) releasing the wake-up events of all other processing cores except the plurality of processing cores; (d) returning the plurality of processing cores to the sleep state and deactivating all of the plurality of processing cores other than the plurality of processing cores in a second instance of the wake event; (e) maintaining the other processing cores in the sleep state until a wake event is directed to one or more of the other processing cores after returning the plurality of processing cores to the sleep state; and wherein the steps (a) through (e) are performed by the control unit of the microprocessor.

Drawings

FIG. 1 is a block diagram illustrating a multi-core microprocessor.

FIG. 2 is a block diagram showing a control word, a status word, and a configuration word.

FIG. 3 is a flow chart showing the operation of a control unit.

FIG. 4 is a block diagram of a microprocessor according to another embodiment.

FIG. 5 is a flowchart illustrating operation of a microprocessor to dump debug information.

FIG. 6 is a timing diagram illustrating an exemplary operation of the microprocessor according to the flowchart of FIG. 5.

FIGS. 7A-7B are flow diagrams illustrating a microprocessor performing cross-core cache control operations.

FIG. 8 is a timing diagram illustrating an example of the operation of the microprocessor according to the flowchart of FIGS. 7A-7B.

FIG. 9 is a flowchart illustrating operation of the microprocessor to enter a low power package C-state.

FIG. 10 is a timing diagram illustrating an example of the operation of a microprocessor according to the flowchart of FIG. 9.

FIG. 11 is a flowchart illustrating operation of a microprocessor to enter a low power package C-state according to another embodiment of the present invention.

FIG. 12 is a timing diagram illustrating an example of the operation of the microprocessor according to the flowchart of FIG. 11.

FIG. 13 is a timing diagram illustrating another example of operation of the microprocessor according to the flowchart of FIG. 11.

FIG. 14 is a flow chart showing dynamic reconfiguration of a microprocessor.

FIG. 15 is a flow chart showing dynamic reconfiguration of a microprocessor according to another embodiment.

FIG. 16 is a timing diagram illustrating an example of the operation of the microprocessor according to the flowchart of FIG. 15.

Fig. 17 is a block diagram of the hardware semaphore 118 shown in fig. 1.

FIG. 18 is a flowchart illustrating the operation of a core 102 when reading the hardware semaphore 118.

FIG. 19 is a flowchart illustrating operations when a core writes a hardware semaphore.

FIG. 20 is a flowchart illustrating operations performed by a microprocessor when using hardware semaphores to perform an operation requiring exclusive ownership of a resource.

Fig. 21 is a timing diagram illustrating an example of an operation in which a core issues a non-sleep synchronization request according to the flowchart of fig. 3.

FIG. 22 is a flowchart illustrating a process for configuring a microprocessor.

FIG. 23 is a flowchart illustrating a process for configuring a microprocessor according to another embodiment.

FIG. 24 is a block diagram illustrating a multi-core microprocessor according to another embodiment.

FIG. 25 is a block diagram illustrating a microcode patching architecture.

26A-26B illustrate a flowchart of an operation of the microprocessor of FIG. 24 to propagate a microcode patch of FIG. 25 to multiple cores of the microprocessor.

FIG. 27 is a timing diagram illustrating an example of the operation of a microprocessor according to the flowchart of FIGS. 26A-26B.

FIG. 28 is a block diagram illustrating a multi-core microprocessor according to another embodiment.

FIGS. 29A-29B are a flowchart illustrating an operation of the microprocessor of FIG. 28 to propagate a microcode patch to a plurality of cores of the microprocessor according to another embodiment.

FIG. 30 is a flowchart illustrating a method for patching a service processor code in the microprocessor of FIG. 24.

FIG. 31 is a block diagram illustrating a multi-core microprocessor according to another embodiment.

FIG. 32 is a flowchart illustrating operation of the microprocessor of FIG. 31 to propagate an MTRR update to the cores of the microprocessor.

Wherein the symbols in the drawings are briefly described as follows:

100: a multi-core microprocessor; 102A, 102B, 102N: core A, core B, core N; 103: non-nuclear; 104: a control unit; 106: a status register; 108A, 108B, 108C, 108D, 108N: a synchronization register; 108E, 108F, 108G, 108H: a shadow synchronization register; 114: a fuse; 116: a dedicated random access memory; 118: hardware semaphore; 119: a shared cache memory; 122A, 122B, 122N: a clock signal; 124A, 124B, 124N: an interrupt signal; 126A, 126B, 126N: a data signal; 128A, 128B, 128N: an electrical energy control signal; 202: a control word; 204: a wake-up event; 206: synchronous control; 208: a power gate; 212: sleeping; 214: selective awakening; 222: s; 224: c; 226: a synchronization state or a C-state; 228: a core set; 232: forced synchronization; 234: selective synchronization is suspended; 236: deactivating the core; 242: a status word; 244: a wake-up event; 246: the least common C-state; 248: error codes are coded; 252: configuring a word; 254-0 to 254-7: enabling; 256: the number of local cores; 258: the number of crystals; 302. 304, 305, 306, 312, 314, 316, 318, 322, 326, 328, 332, 334, 336: a step of; 402A, 402B: an inter-crystal bus unit A and an inter-crystal bus unit B; 404: an inter-crystal bus; 406A, 406B: crystal A, crystal B; 502. 504, 505, 508, 514, 516, 518, 524, 526, 528, 532: a step of; 702. 704, 706, 708, 714, 716, 717, 718, 724, 726, 727, 728, 744, 746, 747, 748, 749, 752: a step of; 902. 904, 906, 907, 908, 909, 914, 916, 919, 921, 924: a step of; 1102. 1104, 1106, 1108, 1109, 1121, 1124, 1132, 1134, 1136, 1137: a step of; 1402. 1404, 1406, 1408, 1412, 1414, 1416, 1417, 1418, 1422, 1424, 1426: a step of; 1502. 1504, 1506, 1508, 1517, 1518, 1522, 1524, 1526, 1532: a step of; 1702: an owned bit; 1704: an owner bit; 1706: state machines 1802, 1804, 1806, 1808: a step of; 1902. 1904, 1906, 1908, 1912, 1914, 1916, 1918: a step of; 2002. 2004, 2006, 2008: a step of; 2202. 2203, 2204, 2205, 2206, 2208, 2212, 2214, 2216, 2218, 2222, 2224: a step of; 2302. 2304, 2305, 2306, 2312, 2315, 2318, 2324: a step of; 2404: a core microcode read-only memory; 2408: non-core microcode patching the random access memory; 2423: a service processing unit; 2425: a non-core microcode read-only memory; 2439: patching the addressable content memory; 2497: serving processing unit start address register 2499: a core random access memory; 2500: microcode patching; 2502: a header; 2504: repairing immediately; 2506: checking the sum; 2508: CAM data; 2512: repairing the core PRAM; 2514: checking the sum; 2516: RAM repairing; 2518: non-core PRAM patching; 2522: checking the sum; 2602. 2604, 2606, 2608, 2611, 2612, 2614, 2616, 2618, 2621, 2622, 2624, 2626, 2628, 2631, 2632, 2634, 2652: a step of; 2808: kernel repair RAM; 2912. 2916, 2922, 2932: a step of; 3002. 3004, 3006: a step of; 3102: a memory type range register; 3202. 3204, 3206, 3208, 3211, 3212, 3214, 3216, 3218, 3252: and (5) carrying out the following steps.

Detailed Description

The following is a description of the preferred embodiments of the present invention. The examples are provided to illustrate the principles of the present invention and are not intended to limit the invention. The scope of the invention is to be determined by the following claims.

Referring to FIG. 1, a block diagram of a multi-core microprocessor 100 is shown. Microprocessor 100 includes a plurality of processing cores, designated 102A, 102B through 102N, which are collectively referred to as a plurality of processing cores 102, or simply a plurality of cores 102, and individually referred to as processing cores 102 or simply cores 102. More preferably, each core 102 includes one or more pipelines of functional units (not shown), including an instruction cache (instruction cache), an instruction translation unit or instruction decoder, and more preferably a microcode unit, a scratch pad unit, Reservation station (Reservation station), cache, execution units, memory subsystems, and retirement units including an ordering buffer. More preferably, the cores 102 include a Superscalar (Superscalar) out-of-order execution (out-of-order execution) microbody architecture. In one embodiment, microprocessor 100 is an x86 architecture microprocessor, but in other embodiments, microprocessor 100 conforms to the architecture of other instruction sets.

Microprocessor 100 also includes an uncore 103 coupled to cores 102 that is different from cores 102. The uncore 103 includes a control unit 104, fuses 114, a Private random access Memory 116 (PRAM), and a Shared Cache Memory 119(Shared Cache Memory), such as a level-2, L2 and/or level-3, L3 Cache Shared by the cores 102. Each core 102 is configured to read/write data from/to uncore 103 via a respective address/data bus 126, and cores 102 provide a non-architectural address space (also referred to as a private or micro-architectural address space) to shared resources of uncore 103. The dedicated RAM116 is dedicated or non-architected, that is, it is not in the architectural user program address space of the microprocessor 100. In one embodiment, uncore 103 includes Arbitration Logic (Arbitration Logic) that arbitrates requests for access to resources of uncore 103 by multiple cores 102.

Each fuse 114 is an electronic device that may or may not be blown; when the fuse 114 is not blown, the fuse 114 has a low impedance and conducts current easily; when the fuse 114 is blown, the fuse 114 has a high impedance and does not easily conduct current. A detection circuit is associated with each fuse 114 to evaluate the fuse 114, for example, to detect whether the fuse 114 conducts a high current or low voltage (does not blow, e.g., logic zero, or clear) or a low current or high voltage (blows, e.g., logic one, or set). The fuse 114 may be blown during the manufacturing of the microprocessor 100, and in some embodiments, an unblown fuse 114 may be blown after the manufacturing of the microprocessor 100. More preferably, a blown fuse 114 is irreversible. An example of a fuse 114 is a polysilicon fuse that blows by applying a sufficiently high voltage across the devices. Another example of a fuse 114 is a nickel-chromium fuse that can be blown using a laser. More preferably, the sensing circuitry powers on the sense fuse 114 and provides its evaluation to a corresponding bit in a hold register (HoldingRegister) of the microprocessor 100. When the microprocessor 100 is reset released, the cores 102 (e.g., microcode) read the save register to determine the value of the sensed fuse 114. In one embodiment, the updated values may be scanned into the save register via a boundary scan input, such as a Joint Test Action Group (JTAG) input, to substantially update the values of fuses 114 before microprocessor 100 is reset disabled. This is used for testing and/or debugging purposes, as is particularly useful in the embodiments described below in connection with fig. 22 and 23.

In addition, in one embodiment, microprocessor 100 includes a different local Advanced Programmable Interrupt Controller (APIC) (not shown) associated with each core 102. In one embodiment, the local advanced programmable interrupt controller architecture complies with a description of a local advanced programmable interrupt controller in Intel corporation of Santa Clara, California, Intel 64, 5 months 2012 and the IA-32 architecture software developer manual 3A, particularly in section 10.4. In particular, the local advanced programmable interrupt controller includes an advanced programmable interrupt controller ID and an advanced programmable interrupt controller base register including a Bootstrap Processor (BSP) flag, the generation and use of which will be described in more detail below, in particular in relation to the embodiments of fig. 14-16.

The control unit 104 includes hardware, software, or a combination of hardware and software. The control unit 104 includes a Hardware Semaphore (Hardware Semaphore)118 (described in detail below with reference to fig. 17-20), a status register 106, a configuration register 112, and a synchronization register 108 associated with each core 102. More preferably, each entity of uncore 103 is addressable by each core 102 at a different address in an architected address space that enables microcode to read from and write to core 102.

Each synchronization register 108 is writable by a respective core 102. The status register 106 is read by each core 102. Configuration register 112 may be read and indirectly written to by each core 102 (via disable core bit 236 of FIG. 2, described below). Control unit 104 may also include interrupt logic (not shown) that generates a corresponding interrupt signal (INTR) 124 to each core 102, which is generated by control unit 104 to interrupt the corresponding core 102. Interrupt sources respond to an interrupt signal 124 generated by the control unit 104 to a core 102, and may include external interrupt sources (e.g., x86 architecture INTR, SMI, NMI interrupt sources) or bus events (e.g., bus signal STPCLK assertion (assertion) or de-assertion (de-assertion) of x86 architecture). In addition, each core 102 may transmit an inter-core interrupt signal 124 to each other core 102 via the write control unit 104. More preferably, unless otherwise indicated, the inter-core interrupt signals described herein are non-architectural inter-core interrupt signals requested by microcode of a core 102 via a micro instruction (microinstruction), which is different from conventional architectural inter-core interrupt signals requested by system software via an architectural instruction. Finally, when a Synchronization Condition has occurred, as described below (e.g., see FIG. 21 and block 334 of FIG. 3), control unit 104 may generate an interrupt signal 124 to core 102 (a Synchronization interrupt signal). The control unit 104 also generates a corresponding CLOCK signal (CLOCK)122 to each core 102, wherein the control unit 104 may be selectively turned off and effectively put the corresponding core 102 to sleep and turned on to wake up the core 102 for backup. Control unit 104 also generates a power control signal (PWR)128 for the corresponding core to each core 102, which selectively controls the corresponding core 102 to receive or not receive power. Accordingly, the control unit 104 may selectively enable a core 102 to enter a deeper sleep state via the corresponding power control signal 128 to turn off power to the core and to turn on power to the core 102 again to wake up the core 102.

A core 102 may write to its corresponding Synchronization register 108 having a set of Synchronization bits (see S-bit 222 of fig. 2), which may be referred to as a Synchronization Request. More specifically, as described below, in one embodiment, the synchronization request requests the control unit 104 to put the core 102 into a sleep state and to wake the core 102 when a synchronization condition occurs and/or when a particular wake event occurs. A synchronization condition occurs when all of the enabled (see enable bit 254 of FIG. 2) cores 102 or a particular subset of the enabled cores 102 (see core set field 228 of FIG. 2) in the microprocessor 100 have written the same synchronization condition (described in more detail below with respect to C bit 224, synchronization condition, or a combination of C-state field 226 and core set field 228 of FIG. 2, and S bit 222) into their corresponding synchronization registers 108. In response to the occurrence of a synchronization condition, the control unit 104 wakes up all cores 102 that are waiting for the synchronization condition at the same time, i.e., the synchronization condition has been requested. In another embodiment described below, the core 102 may request that only the one core 102 that last written the synchronization request be woken up (see the selective wake bit 214 of FIG. 2). In another embodiment, the synchronization request does not request core 102 to enter a sleep state, but rather, synchronization request control unit 104 interrupts core 102 when a synchronization condition occurs, as described in more detail below, and in particular fig. 3 and 21.

More preferably, when the control unit 104 detects that a synchronization condition has occurred (due to the last write synchronization request to the last core 102 in the synchronization register 108), the control unit 104 puts the last core 102 into a sleep state, for example, turning off the clock signal 122 sent to the last write core 102, and then simultaneously waking up all cores 102, for example, turning on the clock signal 122 sent to all cores 102. In this approach, all cores 102 are woken up exactly in the same clock cycle (clock cycles), e.g., causing their clock signals 122 to be turned on. This may be particularly beneficial for certain operations, such as debug (see the embodiment of FIG. 5), which may be beneficial for waking up the core 102 precisely on the same clock cycle. In one embodiment, uncore 103 includes a single Phase-locked Loop (PLL) that generates clock signal 122 provided to core 102. In other embodiments, microprocessor 100 includes a plurality of phase locked loops that generate clock signal 122 provided to core 102.

Control, status and configuration words

Referring to FIG. 2, a block diagram of a control word 202, status word 242, and configuration word 252 is shown. A core 102 writes a value of the control word 202 to the synchronization register 108 of the control unit 104 of fig. 1 to generate an atomic request (atomic request) to request sleep and/or synchronize with all other cores 102 or a particular subset of the microprocessor 100. A core 102 reads a value of the status word 242 sent from the status register 106 in the control unit 104 to determine the status information described herein. A core 102 reads a value of the configuration word 252 sent by the configuration register 112 in the control unit 104 and uses the value, as described below.

The control word 202 includes a wake event field 204, a synchronization control field 206, and a Power Gate (PG) bit 208. The synchronization control field 206 includes various bits or sub-fields that control sleep of core 102 and/or synchronization of core 102 with other cores 102. The synchronization control field 206 includes a sleep bit 212, a selective WAKE (SEL WAKE) bit 214, an S bit 222, a C bit 224, a synchronization status or C-status field 226, a core set field 228, a forced synchronization bit 232, a selective synchronization abort (kill) bit 234, and a core disable core bit 236. The status word 242 includes a wake event field 244, a least common C-status field 246, and an error code field 248. The configuration word 252 includes an enable bit 254, a local core number field 256, and a crystal number field 258 for each core 102 of the microprocessor 100.

The wake event field 204 of the control word 202 includes a plurality of bits corresponding to different events. If core 102 sets a bit in wake event field 204, control unit 104 will wake core 102 (e.g., turn on clock signal 122 to core 102) when an event occurs corresponding to the bit. A wake event occurs when the core 102 has synchronized with all other cores specified in the core set field 228. In one embodiment, the core set field 228 may specify all of the cores 102 in the microprocessor 100; all cores 102 share a cache memory (e.g., a level two (L2) cache and/or level three (L3) cache) with real-time (instant) cores 102; in the same semiconductor crystal, all of the cores 102 are real-time cores 102 (see FIG. 4 for an example of an embodiment of a multi-core, multi-core microprocessor 100); or all of the cores 102 in other semiconductor crystals are immediate cores 102. A set of cores 102 of a shared cache may be considered a Slice (Slice). Other examples of other wake events include, but are not limited to, an assertion (assertion) or de-assertion (de-assertion) of an x86INTR, SMI, NMI, STPCLK, and an inter-core interrupt (inter-core interrupt). When a core 102 wakes up, it may read the wake event field 244 in the status word 242 to determine the active wake event.

If core 102 sets the PG bit 208, the control unit 104 turns off power to core 102 (e.g., via the power control signal 128) after core 102 enters a sleep state. When control unit 104 subsequently resumes power to core 102, control unit 104 clears PG bit 208. The use of the PG bit 208 is described in more detail below with respect to fig. 11-13.

If the core 102 sets the sleep bit 212 or the selective wake bit 214, the control unit 104 places the core 102 into a sleep state after the core 102 writes the synchronization register 108 with the wake event specified in the wake event field 204. The sleep bit 212 and the selective wake-up bit 214 are mutually exclusive. When a synchronization situation occurs, the differences between them are related to the action taken by the control unit 104. If the core 102 sets the sleep bit 212, the control unit 104 will wake up all cores 102 when a synchronization condition occurs. Conversely, if a core 102 sets the selective wake-up bit 214, when a synchronization event occurs, the control unit 104 will only wake up the core 102 that last written the synchronization event to its synchronization registers.

If core 102 does not have sleep bit 212 set or selective wake bit 214 set, control unit 104 will not wake core 102 when a synchronization event occurs, although control unit 104 will not put core 102 into a sleep state. Control unit 104 will still set the bit of wake event field 204 indicating that a synchronization condition is active so core 102 can detect that the synchronization condition has occurred. A number of wake-up events that may be specified in the wake-up event field 204 may also interrupt a source of an interrupt signal generated by the control unit 104 to the core 102. However, if required, the microcode of core 102 may mask the interrupt source. Thus, when the core 102 wakes up, the microcode may read the status register 106 to determine whether a synchronization event or a wake event or both occurred.

If core 102 sets S bit 222, it requests control unit 104 to synchronize in a synchronization case. The synchronization case is specified in the C bit 224, some combination of synchronization cases or C-state field 226, and in the core set field 228. If the C bit 224 is set, the C-state field 226 specifies a C-state value; if the C bit 224 is clear, the synchronization status field 226 specifies a non-C-state synchronization status. More preferably, the value of the synchronization status or C-status field 226 comprises a bounded set of non-negative integers. In one embodiment, the synchronization status or C-state field 226 is 4 bits. When the C bit 224 is clear, a synchronization condition occurs: all cores 102 in a particular core set field 228 have written the same value of the S-bit 222set and the synchronization status field 226 to the synchronization register 108. In one embodiment, the value of the synchronization status field 226 corresponds to a unique synchronization status, such as the various synchronization statuses in the exemplary embodiment described below. When the C bit 224 is set, synchronization occurs in that all of the cores 102 in a particular core set field 228 are written to the respective set of S bits 222 in the synchronization register 108 regardless of whether the same value in the C-status field 226 has been written. In this case, the control unit 104 distributes (post) the lowest written value in the C-state field 226, which can be read by a core 102, to the lowest frequently used C-state field 246 in the status register 106, e.g., by the primary core 102 in block 908 or by the last written/selectively awakened core 102 in block 1108. In one embodiment, if the core 102 specifies a predetermined value (e.g., all of the bit sets) in the synchronization status field 226, this instructs the control unit 104 to match any of the synchronization status field 226 values specified by the immediate core 102 and the other cores 102.

If the core 102 sets the force sync bit 232, the control unit 104 will force all ongoing sync requests to be matched immediately.

Generally, if any core 102 wakes up due to a wake event specified in the wake event field 204, the control unit 104 aborts (kill) all ongoing synchronization requests by clearing the S bit 222 in the synchronization register 108. However, if the core 102 sets the selective sync disable bit 234, the control unit 104 will disable the synchronization request being made by the core 102 that was only awakened by the (asynchronous case occurring) wake event.

If two or more cores 102 request synchronization under different synchronization conditions, control unit 104 considers this to be a stall (decode) condition. If two or more cores 102 write different values of an S-bit 222 with a set value, a C-bit 224 with a clear value, and a synchronization status field 226 to respective synchronization registers 108, the two or more cores 102 request synchronization under different synchronization conditions. For example, the control unit 104 recognizes this as a stall condition if one core 102 writes a value 7 of set (set) S-bit 222, clear (clear) C-bit 224, and a synchronization condition 226 into the synchronization register 108, and the other core 102 writes a set (set) S-bit 222, clear (clear) C-bit 224, and a synchronization condition 226 value 9 into the synchronization register 108. In addition, if one core 102 writes a C bit 224 with a clear value to its synchronization register 108 and the other core 102 writes a C bit 224 with a set value to its synchronization register 108, the control unit 104 recognizes this as a stall condition. In response to a stall condition, the control unit 104 aborts all ongoing synchronization requests and wakes all cores 102 in a sleep state. The control unit 104 also dispatches (post) the value in the error code field 248 of the status register 106, which status register 106 is the one that can be read by the core 102 to determine the cause of the stall and take appropriate action. In one embodiment, the error code 248 represents a synchronization condition written by each core 102 that allows each core to determine whether to continue executing its predetermined route or delay to another core 102. For example, if one core 102 writes a synchronization case to perform a power management operation (e.g., execute an x86MWAIT instruction) and another core 102 writes a synchronization case to perform a cache management operation (e.g., x86WBINVD instruction), the core 102 that scheduled to execute the MWAIT instruction cancels the MWAIT instruction because MWAIT is an optional operation and WBINVD is a mandatory operation, to delay to the other core 102 that is executing the WBINVD instruction. As another example, if one core 102 writes a synchronization condition to perform a debug operation (e.g., a dump debug state) and another core 102 writes a synchronization condition to perform a cache management operation (e.g., a WBINVD instruction), the core 102 that is scheduled to perform WBINVD waits for a dump debug to occur and a WBINVD state to be restored and executes the WBINVD instruction by storing the WBINVD state to delay until the core 102 that performed the dump debug.

The crystal number field 258 is zero in a single crystal embodiment. In a multiple crystal embodiment (e.g., in fig. 4), the crystal number field 258 indicates which crystal is resident by the core 102 reading configuration registers 112. For example, in a two-crystal embodiment, the crystals are designated 0 and 1 and the crystal number field 258 has a value of 0 or 1. In one embodiment, for example, the fuses 114 are selectively blown to designate a crystal as either a 0 or a 1.

The number of local cores field 256 indicates the number of cores in the crystal of cores 102 that are local to reading the configuration register 112. More preferably, although there is a single configuration register 112 shared by all cores 102, the control unit 104 knows which core 102 is reading the configuration register 112 and provides the correct value in the local core number field 256 based on a reader. This allows the microcode of a core 102 to know the number of local cores located among other cores 102 in the same crystal. In one embodiment, a multiplexer in the uncore 103 portion of the microprocessor 100 selects the appropriate value that is restored in the local core number field 256 of the configuration word 252 based on the core 102 reading the configuration register 112. In one embodiment, the selective blowing of the fuses 114 operates in conjunction with the multiplexer to restore the value of the local core number field 256. More preferably, the value of the local core number field 256 is fixed and independent, and its cores 102 in the crystal are available, as indicated by the enable bit 254 described below. That is, the value of the local core number field 256 remains fixed even when one or more cores 102 of the crystal are disabled. In addition, the microcode of core 102 calculates the overall core number of core 102, which is a configuration-related value for core 102, and the purpose of which is described in detail below. The overall core number indicates the core number of the overall core 102 of the microprocessor 100. The kernel 102 calculates its overall kernel number by using the value of the crystal number field 258. For example, in one embodiment, the microprocessor 100 includes 8 cores 102, equally divided into two crystals having crystal values of 0 and 1, with the local core number field 256 recovering a value of 0, 1, 2, or 3 in each crystal; the value of the local kernel number field 256 is restored by adding 4 to the kernel having a crystal value of 1 to calculate the overall kernel number.

Each core 102 of microprocessor 100 has a configuration word 252 corresponding to an enable bit 254, configuration word 252 indicating whether the core 102 is enabled or disabled. In FIG. 2, enable bits 254 are represented by enable bits 254-x, respectively, where x is the total number of cores of the corresponding core 102. The example of FIG. 2 assumes eight cores 102 in microprocessor 100, in the examples of FIGS. 2 and 4, enable bit 254-0 indicates whether core 102 with a global core number of 0 (e.g., core A) is enabled, enable bit 254-1 indicates whether core 102 with a global core number of 1 (e.g., core B) is enabled, enable bit 254-2 indicates whether core 102 with a global core number of 2 (e.g., core C) is enabled, and so on. Thus, by knowing the overall core count, the microcode of a core 102 may determine which core 102 of microprocessor 100 is disabled and which core 102 is enabled from configuration word 252. More preferably, an enable bit 254 is set if the core 102 is enabled and the enable bit 254 is cleared if the core 102 is disabled. When the microprocessor 100 is reset, hardware automatically populates (poplate) the enable bit 254. More preferably, when the microprocessor 100 is manufactured to indicate whether a given core 102 is enabled, if disabled, the hardware selectively fills the enable bit 254 based on the fuses 114 being blown. For example, if a given core 102 is tested and found to be faulty, a fuse 114 may be blown to clear the enable bit 254 of the core 102. In one embodiment, a blown fuse 114 indicates a core 102 as disabled and prevents clock signals from being provided to the disabled core 102. Each core 102 may write the disable core bit 236 to its synchronization register 108 to clear its enable bit 254, as described in more detail below with respect to FIGS. 14-16. More preferably, clearing the enable bit 254 does not prevent the core 102 from executing instructions, but updates the configuration register 112, and the core 102 must set a different bit (not shown) to prevent the core from executing instructions itself, e.g., causing its power to be removed and/or turning off its clock signal. For a multi-configuration microprocessor 100 (e.g., FIG. 4), the configuration register 112 includes an enable bit 254 for all of the cores 102 in the microprocessor 100, e.g., all of the cores 102 may be not only the cores 102 of the local crystal but also the cores 102 of the remote crystal. More preferably, in a multi-crystal microprocessor 100, when a core 102 writes to its synchronization register 108, the value of the synchronization register 108 is passed to the core 102 corresponding to the shadow synchronization register 108 in the other crystal (see FIG. 4), wherein if the disable core bit 236 is set, it causes an update to be sent to the remote crystal configuration register 112, so that both the local and remote crystal configuration registers 112 have the same value.

In one embodiment, configuration register 112 cannot be written directly by a core 102. However, writing to the configuration register 112 by a core 102 will cause the value of the local enable bit 254 to be propagated to configuration registers 112 of other crystals in a multi-crystal microprocessor 100, e.g., as depicted in block 1406 of FIG. 14.

Control unit

Referring to fig. 3, a flow chart depicting the control unit 104 is shown. Flow begins at block 302. At block 302, a core 102 writes a synchronization request, e.g., a control word 202, to its synchronization register 108, where the synchronization request is received by the control unit 104. In the case of a multi-configuration microprocessor 100 (see, e.g., FIG. 4), when the shadow synchronization register 108 of a control unit 104 receives the propagated synchronization register 108 values transmitted by other crystals 406, the control unit 104 effectively operates according to FIG. 3, e.g., when the control unit 104 receives a synchronization request from one of its local cores 102 (block 302), filling its local state registers 106 (block 318), in addition to the control unit 104 putting the core 102 to sleep (e.g., block 314), or waking (at blocks 306, 328, or 336), or interrupting (at block 334), or blocking the wake event of the core 102 at its local crystal 406 (block 326). Flow proceeds to block 304.

In block 304, the control unit 104 checks the synchronization condition in block 302 to determine whether a stall (decode) condition has occurred, as described above with respect to FIG. 2. If so, flow proceeds to block 306; otherwise, flow proceeds to decision block 312.

At block 305, the control unit 104 detects the occurrence of a wake-up event in the wake-up event field 204 of one of the synchronization registers 108 (in addition to the occurrence of a synchronization condition detected at block 316). As described in block 326 below, the control unit 104 may automatically block the wake event. Control unit 104 may write a synchronization request in block 302 upon detecting that the wake Event occurred as an Event Asynchronous (Event Asynchronous). Flow also proceeds from block 305 to block 306.

At block 306, the control unit 104 fills the status register 106, aborts the ongoing synchronization request, and wakes up any sleeping core 102. As described above, waking the sleep core 102 may include restoring its power. The core 102 may then read the status register 106, and in particular the error code 248, to determine the cause of the stall and process it according to the priority order corresponding to the conflicting synchronization request, as described above. In addition, the control unit 104 suspends all ongoing synchronization requests (e.g., clears the S bit 222 in the synchronization register 105 of each core 102) unless block 306 is reached after block 305 and the optional sync abort bit 234 is set, in which case the control unit 104 suspends ongoing synchronization requests of the core 102 that are only woken up by the wake-up event. If block 306 is reached after block 305, the core 102 may read the wake event 244 field to determine the wake event that occurred. In addition, if the wake event is an unmasked interrupt source, control unit 104 generates an interrupt request to core 102 via interrupt signal 124. Flow ends at block 306.

At decision block 312, the control unit 104 determines whether the sleep bit 212 or the selective wake-up bit 214 is set. If so, flow proceeds to block 314; otherwise, flow proceeds to decision block 316.

In block 314, the control unit 104 places the core 102 into a sleep state. As described above, putting a core 102 into a sleep state may include removing its power supply. In an exemplary embodiment, even if the PG bit 208 is set, if this is the last core 102 to write (e.g., will cause a synchronization condition to occur), the control unit 104 does not remove power from the core 102 at block 314, and the selective wake bit 214 is set because the control unit 104 will immediately wake up the last core 102 to write back at block 328. In one embodiment, the control unit 104 includes synchronization logic and sleep logic, which are separate from, but in communication with, each other; in addition, each of the synchronization logic and sleep logic comprises a portion of the synchronization register 108. Advantageously, the synchronization logic portion written to the synchronization register 108 and the sleep logic portion written to the synchronization register 108 are atomic, i.e., indivisible. That is, if a partial write occurs, both the synchronization logic and the sleep logic are guaranteed to occur. Preferably, the pipeline of the core 102 blocks, not allowing any more writes to occur until both portions of it are guaranteed to be written to the synchronization register 108. An advantage of writing a synchronization request and immediately entering a sleep state is that it does not require the core 102 (e.g., microcode) to be continuously running to determine whether the synchronization condition has occurred. This is advantageous because power can be saved and other resources, such as bus and/or memory bandwidth, are not consumed. It is noted that to enter sleep state without requesting synchronization with other cores 102 (e.g., blocks 924 and 1124), the core 102 may write the S bit 222 Clear and the sleep bit 212 Set (Set), referred to herein as a sleep request, into the synchronization register 108; if an unmasked wake event specified in the wake event field 204 occurs (e.g., block 305), but the occurrence of a synchronization condition for the core 102 is not sought (e.g., block 316), in which case the control unit 104 wakes the core 102 (e.g., block 306). Flow proceeds to decision block 316.

At decision block 316, the control unit 104 determines whether a synchronization condition has occurred. If so, flow proceeds to block 318. As described above, a synchronization condition may only occur when the S bit 222 is set. In one embodiment, the control unit 104 uses the enable bit 254 of FIG. 2, which indicates which cores 102 of the microprocessor 100 are enabled and which cores 102 are disabled. The control unit 104 only looks for enabled cores 102 to determine if a synchronization condition has occurred. A core 102 may be disabled because it was tested and found to be defective during production time. Thus, a fuse is blown to render the core 102 inoperable and to indicate that the core 102 is disabled. A core 102 may be disabled by software requested by the core 102 (see, e.g., fig. 15). For example, upon a user request, BIOS writes a Model Specific Register (MSR) to request that the core 102 be disabled in response to the core 102 ceasing to use itself (e.g., via the disable core bits 236) and notifies other cores 102 to read other cores 102 to determine to disable the configuration registers 112 of the core 102. A core 102 may also be patched (e.g., see fig. 14) via microcode that may be generated by blowing fuses 114 and/or loaded from system memory (e.g., a FLASH memory). In addition to determining whether a synchronization condition occurs, the control unit 104 checks the forced synchronization bit 232. If set, flow proceeds to block 318. If the force sync bit 232 is clear and a sync condition has not occurred, flow ends at block 316.

At block 318, the control unit 104 fills the status register 106. Specifically, if a synchronization event occurs in which all of the cores 102 request synchronization of a C-state, the control unit 104 populates the least frequently used C-state field 246, as described above. Flow proceeds to decision block 322.

At decision block 322, the control unit 104 checks the Selective Wake (SEL WAKE) bit 214. If the bit is set, flow proceeds to block 326; otherwise, flow proceeds to decision block 322.

At block 326, the control unit 104 blocks all wake events of all cores 102 except the immediate core (i.e., the core 102 that last written the synchronization request to its synchronization register 108 at block 302, thus causing the synchronization to occur. In one embodiment, if the wake-up event AND other aspects are to be prevented from being True (True), the logic of the control unit 104 is simply a Boolean AND operation with a wake-up condition of a False (False) signal. The use of blocking all wake events for all cores is described in more detail below, particularly fig. 11-13. Flow proceeds to block 328.

In block 328, the control unit 104 wakes up only the immediate core 102, but not the other cores requesting the synchronization. In addition, the control unit 104 aborts the immediate core 102 ongoing synchronization request by clearing the S-bit 222, but does not abort other core 102 ongoing synchronization requests, e.g., leaving the S-bit 222 setting of other cores 102. It is therefore advantageous if the immediate core 102 again causes a synchronization situation to occur when it writes another synchronization request after it is woken up (assuming that the synchronization requests of the other cores 102 have not been aborted), an example of which will be described below in fig. 12 and 13. Flow ends at block 328.

At decision block 332, the control unit 104 checks the sleep bit 212. If the bit is set, flow proceeds to block 336; otherwise, flow proceeds to block 334.

At block 334, the control unit 104 sends an interrupt signal (sync interrupt) to all cores 102. FIG. 21 is a timing diagram illustrating an example of a non-sleep sync request. Each core 102 may read the wake event field 244 and detect that a synchronization event occurred as the cause of the interrupt. Flow has proceeded to block 334 in which case core 102 chooses not to enter the sleep state when core 102 writes its synchronization request. While this does not allow core 102 to obtain the same benefits as when entering a sleep state (e.g., wake-on-the-fly), it has the potential advantage of allowing core 102 to continue processing instructions without requiring core 102 to wake-on-the-fly while waiting for the last core 102 to write its synchronization requirements. Flow ends at block 334.

In block 336, the control unit 104 is awakened by all cores 102 at the same time. In one embodiment, the control unit 104 turns on the clock signal 122 to all cores 102 exactly on the same clock cycle. In another embodiment, the control unit 104 turns on the clock signal 122 to all cores 102 in an interleaved manner. That is, the control unit 104 introduces a predetermined number of clock cycles (e.g., ten or one hundred clock cycles in sequence) between the enable clock signal 122 and each core. However, staggered (toggling) on of the clock signal 122 is also contemplated by the present invention. To reduce the likelihood of a power loss spike when all of the cores 102 are awake, it is beneficial to stagger the turn-on of the clock signal 122. In yet another embodiment, to reduce the possibility of power loss spikes, the control unit 104 turns on the clock signal 122 to all cores 102 on the same clock cycle, but performs in a discontinuous (stuck) or throttled (throttled) manner by initially providing the clock signal 122 at a reduced frequency and increasing the frequency to the target frequency. In one embodiment, the synchronization request is issued as a result of execution of a microcode instruction of the core 102, and the microcode is designed for at least some synchronization case values and specifies that the microcode location of the synchronization case value is unique. For example, a sync x request is included only one place in the microcode, a sync y request is included only one place in the microcode, and so on. In these cases, it may be beneficial to wake up at the same time, since all cores 102 are woken up at the exact same place, which may allow microcode designers to design more efficient and defect-free code. Furthermore, the concurrent wake-up for debug purposes may be particularly beneficial when attempting to re-establish and repair errors due to multi-core interactions, but not when a single core is running. Fig. 5 and 6 show this example. In addition, the control unit 104 aborts all ongoing synchronization requests (e.g., clears the S bit 222 in the synchronization register 108 of each core 102). Flow ends at block 336.

An advantage of the embodiments described herein is that they can significantly reduce the amount of microcode in a microprocessor, since microcode in each core can simply write synchronization requests, enter a sleep state, and know when to wake all cores co-located in the microcode, as opposed to looping or performing other checks to synchronize operations between cores. The microcode usage of the synchronization request mechanism will be described below.

Multi-chip microprocessor

Referring now to FIG. 4, a block diagram illustrating another embodiment of microprocessor 100 is shown. The microprocessor 100 of FIG. 4 is similar in many respects to the microprocessor 100 of FIG. 1, with a multi-core processor and cores 102 being similar. However, the embodiment of FIG. 4 is a polycrystalline configuration. That is, the microprocessor 100 includes multiple semiconductor crystals 406 mounted in a common package and communicating with one another via an intra-crystal bus 404. The embodiment of FIG. 4 includes two crystals 406, labeled Crystal A406A and Crystal B406B coupled by an inter-crystal bus 404. In addition, each crystal 406 includes an inter-crystal bus unit 402, the inter-crystal bus unit 402 linking the respective crystal 406 to the inter-crystal bus 404. Further, each crystal 406 includes a control unit 104 in uncore 103 coupled to a respective core 102 and inter-crystal bus unit 402. In the embodiment of FIG. 4, crystal A406A includes four cores 102 — core A102A, core B102B, core C102C, and core D102D, where the four cores 102 are coupled to a control unit A104A coupled to an inter-crystal bus unit A402A; similarly, crystal B406B includes four cores 102 — core E102E, core F102F, core G102G, and core H102H, where the four cores 102 are coupled to a control unit B104B that is coupled to an inter-crystal bus unit B402B. Finally, each control unit 104 includes not only a synchronization register 108 for each core in the crystal 406 itself, but also a synchronization register 108 for each core in another crystal 406, wherein the synchronization register 108 in the another crystal 406 is a Shadow register (Shadow register) as shown in FIG. 4. Thus, each control unit in the embodiment shown in FIG. 4 includes eight synchronization registers 108, denoted 108A, 108B, 108C, 108D, 108E, 108F, 108G, and 108H. In control unit A104A, synchronization registers 108E, 108F, 108G, and 108H are shadow registers, and in control unit B104B, synchronization registers 108A, 108B, 108C, and 108D are shadow registers.

When a core 102 writes a value to its synchronous register 108, the control unit 104 in the crystal 406 of the core 102 writes the value to the corresponding shadow register 108 in the other crystal 406 via the inter-crystal bus unit 402 and the inter-crystal bus 404. In addition, if the disable core bit 236 is set in the value propagated to the shadow synchronization register 108, the control unit 104 also updates the corresponding enable bit 254 in the configuration register 112. In this manner, the occurrence of a synchronization condition, including the occurrence of a trans-die synchronization condition, may be detected even in situations where the microprocessor 100 core configuration is dynamically changing (e.g., FIGS. 14-16). In one embodiment, the inter-crystal bus 404 is a relatively low speed bus and the propagation may be in a sequence of a predetermined number of 100 clock cycles, and each control unit 104 includes a state mechanism that takes a predetermined amount of time to detect the occurrence of the synchronization condition and turns on the clock signal to all cores 102 in the respective crystal 406. More preferably, after the control unit 104 begins writing values to another crystal 406 (e.g., granted inter-crystal bus 404), the control unit 104 in the local crystal 406 (e.g., including the crystal 406 of the write core 102) is configured to delay updating the local synchronization register until a predetermined amount of time (e.g., the sum of the propagation time amount and the detection time amount of the occurrence of the state machine synchronization condition). In this manner, the control unit 104 in both crystals simultaneously detects the occurrence of a synchronization condition and simultaneously turns on the clock signals to all cores 102 in both crystals 406. Debugging may be particularly beneficial when attempting to re-establish and repair errors that occur only due to multi-core interactions, but not while a single core is running. Fig. 5 and 6 depict embodiments that may take advantage of this functional advantage.

Debugging operations

The core 102 of the microprocessor 100 is configured to perform separate adjustment operations, such as Breakpoint (break point) of instruction execution and data access. Furthermore, the microprocessor 100 is configured to perform a cross-core (trans-core) debug operation, e.g., associated with more than one core 102 of the microprocessor 100.

Referring now to FIG. 5, a flowchart illustrating operation of microprocessor 100 to dump (dump) debug (debug) information is shown. The operation is described from the perspective of a single core, but each core 102 in microprocessor 100 collectively dumps the state of microprocessor 100 according to its described operation. More specifically, FIG. 5 illustrates the operation of one core receiving a request to dump debug information, flow beginning at block 502 and the operation of the other cores 102 beginning at block 532.

In block 502, one of cores 102 receives a request to dump debug information. More preferably, the adjustment information includes the state of the core 102 or a subset thereof. More preferably, the tuning information is dumped to system memory or to an external bus that can be monitored by a tuning device, such as a logic analyzer. In response to the request, core 102 transmits a debug dump to other cores 102 and transmits an inter-core interrupt signal to other cores 102. More preferably, during the period when interrupts are disabled (e.g., the microcode does not allow itself to be interrupted), core 102 blocks microcode in response to the request to flush debug information (at block 502), or in response to the interrupt signal (at block 532), and remains in microcode until block 528. In one embodiment, core 102 need only be interrupted when it is in a sleep state and at an architectural instruction boundary. In one embodiment, various inter-core information described herein (such as information at block 502 and other information such as information at blocks 702, 1502, 2606, and 3206) is transmitted and received via the synchronization status or C-state field 226 of the synchronization register 108 control word. In other embodiments, inter-core information is transmitted and received via uncore private random access memory 116. Flow proceeds from block 502 to block 504.

In block 532, one of the other cores 102 (e.g., a core 102 other than the core 102 receiving the debug dump request in block 502) is interrupted and receives the debug dump information due to the inter-core interrupt signals and information transferred in block 502. As described above, although the flow in block 532 is described from the perspective of a single core 102, each other core 102 (e.g., core 102 that is not in block 502) is interrupted at block 532 and receives the information and the steps of blocks 504-528 are performed. Flow proceeds from block 532 to block 504.

At block 504, core 102 writes a synchronization request for synchronization case 1 (labeled SYNC 1 in FIG. 5) into its synchronization register 108. Thus, the control unit 104 puts the core 102 into a sleep state. Flow proceeds to block 506.

In block 506, the core 102 is awakened by the control unit 104 when all cores have written SYNC 1. Flow proceeds to block 508.

In block 508, core 102 flushes its state to memory. Flow proceeds to block 514.

At block 514, core 102 writes a SYNC 2, which causes control unit 104 to put core 102 into a sleep state. Flow proceeds to block 516.

In block 516, the core 102 is awakened by the control unit 104 when all cores have written SYNC 2. Flow proceeds to block 518.

At block 518, core 102 flushes the memory address of debug information at block 508 to set a flag (flag), which is asserted via a Reset (Reset) signal, and then resets itself. Core 102 resets microcode that detects the flag and reloads its state from the stored memory address. Flow proceeds to block 524.

At block 524, core 102 writes a SYNC 3, which causes control unit 104 to put core 102 into a sleep state. Flow proceeds to block 526.

In block 526, core 102 is awakened by control unit 104 when all cores have written SYNC 3. Flow proceeds to block 528.

At block 528, the core 102 removes the reset based on the reloaded state at block 518 and begins fetching architectural (e.g., x86) instructions. Flow ends at block 528.

Referring now to FIG. 6, a timing diagram illustrating an exemplary operation of microprocessor 100 according to the flowchart of FIG. 5 is shown. In this example, microprocessor 100 is configured with three cores 102, labeled core 0, core 1, and core 2, as shown. However, it should be understood that in other embodiments, microprocessor 100 may include a different number of cores 102. In this timing diagram, the process of event timing is as follows.

Core 0 receives a debug dump request and transmits a debug dump message and interrupt message to core 1 and core 2 (per block 502) in response. Core 0 then writes a SYNC 1 and enters a sleep state (per block 504).

Each

core

1 and 2 is finally interrupted by its current task and reads its information (per block 532). In response, each of core 1 and core 2 writes a SYNC 1 and enters a sleep state (each block 504). As shown, the time at which each core writes SYNC 1 may be different, for example, because the instruction is executing when the interrupt is asserted.

When all cores have written SYNC 1, the control unit 104 wakes up all cores at the same time (each block 506). Each core then flushes its state to memory (per block 508), writes a SYNC 2 and enters a sleep state (per block 514). The amount of time to dump the state may vary; thus, the time to write SYNC 2 at each core may be different, as shown.

When all cores have written SYNC 2, the control unit 104 wakes up all cores at the same time (per block 516). Each core then resets itself and reloads its state from memory (per block 518), writes SYNC 3 and goes to sleep (per block 524). As shown, the amount of time to reset and reload the state may vary; therefore, the time to write SYNC 3 may be different at each core.

When all cores have written SYNC 3, the control unit 104 wakes up all cores simultaneously (per block 526). Each core then begins fetching architectural instructions at the interrupted point in time (per block 528).

A conventional solution for synchronizing operations between multiple processors is to use a software semaphore (semaphore). However, the conventional solution has a disadvantage in that it cannot provide time-level Synchronization (Clock-level Synchronization). An advantage of the embodiments described herein is that control unit 104 may turn on clock signal 122 to all cores 102 simultaneously.

In the method described above, an engineer tuning the microprocessor 100 may configure one of the cores 102 to periodically generate check time points for generating debug dump requests, for example, after a predetermined number of instructions have been executed. When the microprocessor 100 is running, the engineer takes all activity on the bus external to the microprocessor 100 in a log file. The portion of the log that is near the time the bus is perceived to have occurred may be provided to a software simulator that simulates the microprocessor 100 to aid engineers in debugging. The simulator simulates execution of instructions directed by each core 102 and simulates execution of the external microprocessor 100 bus using the recorded information. In one embodiment, the simulators for all cores 102 are started from a reset point at the same time. Therefore, it is highly effective that all cores 102 of the microprocessor 100 actually stop resetting at the same time (e.g., after SYNC 2). Furthermore, by waiting to dump their state before all other cores 102 have stopped their current tasks (e.g., after SYNC 1), the dumping of their state by one core 102 does not interfere with the code and/or hardware of the other cores performing debugging (e.g., shared memory bus or cache interaction), which may increase the likelihood of errors being re-generated and the cause thereof being determined. Similarly, waiting to begin fetching architectural instructions until all cores 102 have completed reloading their states (e.g., after SYNC 3), the reloading of states by one core 102 does not interfere with the code and/or hardware of the other cores performing debugging, which increases the likelihood of errors being regenerated and the cause thereof being determined.

These benefits provide further advantages over prior approaches, such as the U.S. patent US8,370,684, which is incorporated herein by reference in its entirety for all purposes, which do not enjoy the benefits of being able to obtain the synchronous requesting core.

Cache control operations

The cores 102 of the microprocessor 100 are configured to perform independent cache control operations, such as in a local cache, e.g., a cache not shared by two or more cores 102. Furthermore, the microprocessor 100 is configured to perform cache control operations as a cross-core (Trans-core), e.g., in connection with more than one core 102 of the microprocessor 100, and, e.g., because it is associated with a shared cache 119.

FIGS. 7A-7B are flow charts illustrating operations performed by microprocessor 100 to perform cross-core cache control. The embodiment of FIGS. 7A-7B illustrates how microprocessor 100 executes an x86 architectural Write back Invalidate buffer (WBINVD) instruction. A WBINVD instruction directs the core 102 executing the instruction to write back all modified lines in the microprocessor 100 cache memory to system memory and invalidate, or Flush, the cache memory. The WBINVD instruction also instructs the core 102 to issue special bus cycles to direct any cache memory external to the microprocessor 100 to write back its modified data and invalidate the data. The above operations are described in terms of a single core, but each core 102 of the microprocessor 100 collectively writes back Modified cache lines (Modified cache lines) and invalidates the cache of the microprocessor 100 in accordance with the present specification operations. More specifically, FIGS. 7A-7B depict operation of a core encountering a WBINVD instruction, with flow beginning at block 702 and flow beginning at block 752 for the other cores 102.

At block 702, one of the cores 102 encounters a WBINVD instruction. In response, core 102 sends a WBINVD instruction message to the other cores 102 and sends an inter-core interrupt signal to the other cores 102. More preferably, until flow proceeds to block 748/749, core 102 blocks microcode from responding to the WBINVD instruction (at block 702), or to the interrupt signal (at block 752), and maintains microcode during the period when the interrupt signal is disabled (e.g., the microcode does not allow itself to be interrupted). Flow proceeds from block 702 to block 704.

At block 752, one of the other cores 102 (e.g., a core other than the WBINVD instruction core 102 encountered at block 702) is interrupted and receives the WBINVD instruction message due to the inter-core interrupt signal transmitted at block 702. As described above, although flow is described at block 752 from the perspective of a single core 102, each other core 102 (e.g., core 102 not at block 702) is interrupted at block 752 and receives the information, and the steps of blocks 704 through 749 are performed. Flow proceeds from block 752 to block 704.

At block 704, the core 102 writes a synchronization request for synchronization case 4 (labeled SYNC4 in FIGS. 7A-7B) into its synchronization register 108. Thus, the control unit 104 puts the core 102 into a sleep state. Flow proceeds to block 706.

In block 706, when all cores 102 have written SYNC4, the core 102 is awakened by the control unit 104. Flow proceeds to block 708.

In block 708, the core 102 writes back and invalidates the local cache, e.g., a Level 1 (L1) cache that is not shared by the core 102 with other cores 102. Flow proceeds to block 714.

At block 714, core 102 writes a SYNC 5, which causes control unit 104 to put core 102 into a sleep state. Flow proceeds to block 716.

In block 716, when all cores 102 have written SYNC 5, the cores 102 are awakened by the control unit 104. Flow proceeds to decision block 717.

At decision block 717, the core 102 determines whether it is the core 102 that encountered the WBINVD instruction at block 702 (as opposed to the core 102 receiving the WBINVD instruction information at block 752). If so, flow proceeds to block 718; otherwise, flow proceeds to block 724.

At block 718, core 102 writes back and invalidates shared cache 119. In one embodiment, microprocessor 100 includes multiple dies in multiple cores but not all of the cores, and cores 102 of microprocessor 100 share a cache memory, as described above. In this embodiment, intermediate operations (not shown) similar to those in blocks 717 through 726 are performed by performing a write back and invalidate shared buffer by one of the cores 102 in the die, while the other core(s) of the die return to a sleep state similar to that in block 724 to wait until the cache invalidation. Flow proceeds to block 724.

At block 724, core 102 writes a SYNC 6, which causes control unit 104 to put core 102 into a sleep state. Flow proceeds to block 726.

In block 726, when all cores 102 have written SYNC 6, the cores 102 are awakened by the control unit 104. Flow proceeds to decision block 727.

At decision block 727, core 102 determines whether it is the core 102 that encountered the WBINVD instruction at block 702 (as opposed to the core 102 receiving the WBINVD instruction information at block 752). If so, flow proceeds to block 728; otherwise, flow proceeds to block 744.

In block 728, the core 102 issues a particular bus cycle to cause the external cache to be written back and invalidate the external cache. Flow proceeds to block 744.

At block 744, a SYNC 13 is written, which causes the control unit 104 to put the core 102 into a sleep state. Flow proceeds to block 746.

In block 746, when all cores 102 have written SYNC 13, the cores 102 are awakened by the control unit 104. Flow proceeds to decision block 747.

At decision block 747, core 102 determines whether it is the core 102 that encountered the WBINVD instruction at block 702 (as opposed to the core 102 receiving the WBINVD instruction information at block 752). If so, flow proceeds to block 748; otherwise, flow proceeds to block 749.

At block 748, the core 102 completes a WBINVD instruction, which includes a retire (retire) WBINVD instruction, and may include relinquishing ownership of a hardware semaphore (see FIG. 20). Flow ends at block 748.

At a block 749, the core 102 resumes resuming its executing task 102 at a block 749 before the core 102 is interrupted at a block 752. Flow ends at block 749.

Referring now to FIG. 8, a timing diagram illustrating operation of microprocessor 100 according to the flowchart of FIGS. 7A-7B is shown. In this example, microprocessor 100 is configured with three cores 102, labeled core 0, core 1, and core 2, as shown. However, it should be understood that in other embodiments, microprocessor 100 may include a different number of cores 102.

Core 0 encounters a WBINVD instruction and in response passes a WBINVD instruction message, and interrupts core 1 and core 2 (per block 702). Core 0 then writes a SYNC4 and enters the sleep state (per block 704).

Each

core

1 and 2 is finally interrupted from its current task and reads the information (per block 752). In response, each core 1 and core 2 write a SYNC4 and enter a sleep state (each block 704). As shown, the time to write SYNC4 may be different for each core.

When all cores have written SYNC4, the control unit 104 wakes up all cores simultaneously (per block 706). Each core then writes back and invalidates its particular cache (per block 708), writes SYNC 5 and enters a sleep state (per block 714). The amount of time to write back and invalidate the cache may be different, and thus the time to write SYNC 5 may be different at each core, as shown.

When all cores have written SYNC 5, the control unit 104 wakes up all cores simultaneously (each block 716). Only the core writeback that encountered the WBINVD instruction invalidates the shared cache 119 (per block 718), and all cores write SYNC 6 and enter a sleep state (per block 724). Since only one core writes back and invalidates the shared cache 119, the time for each core to write SYNC 6 may be different.

When all cores have written SYNC 6, the control unit 104 wakes up all cores simultaneously (each block 726). Only the core that encountered the WBINVD instruction completes the WBINVD instruction (per block 748) and all other cores resume processing prior to the interrupt.

It should be appreciated that although embodiments have been described in which the cache control instruction is an x86WBINVD instruction, other embodiments may assume that the synchronous request is used to execute other cache instructions. For example, the microprocessor 100 may perform similar actions to execute an x86INVD instruction without writing back the cache data (at blocks 708 and 718) and simply invalidate the cache. As another example, cache control instructions may be derived from a more diverse instruction set architecture than the x86 architecture.

Power management operation

The cores 102 in the microprocessor 100 are configured to perform various power-down operations such as, but not limited to, halting execution of instructions, requesting the control unit 104 to stop sending clock signals to the cores 102, requesting the control unit 104 to write back and invalidate local (e.g., non-shared) caches of the cores 102 and storing the state of the cores 102 to an external memory, such as the dedicated random access memory 116, by removing power from the cores 102. When a core 102 has performed one or more core-specified power reduction operations, it has entered a "core" C-state (also referred to as a core idle state or core sleep state). In one embodiment, the C-state values may correspond approximately to known Advanced Configuration and Power Interface (ACPI) specification processor states, but may also include finer Granularity (Granularity). Generally, a core 102 will enter a core C-state in response to requests from the operating system described above. For example, the x86 architecture Monitor Wait (MWAIT) instruction is a power management instruction that provides a hint, i.e., a target C-state, to the core 102 that executes the instruction to allow the microprocessor 100 to enter an optimized state, such as a lower power consumption state. In the case of an MWAIT instruction, the target C-state is the exclusive (speculative) rather than the ACPI C-state. Core C-state 0(C0) corresponds to an operational state of core 102 and increasing values of the C-state correspond to decreasing activity or response states (e.g., C1, C2, C3, etc. states). A decreasing response or active state refers to a configuration or operating state that saves more power relative to a more active or responsive state, or for some reason relatively decreases response (e.g., has a longer wake-up delay, less full enablement). Examples of possible power-saving operations for a core 102 are to stop execution of instructions, stop transmitting clock signals, reduce voltages, and/or remove power from portions of the core (e.g., functional units and/or local caches) or the entire core.

Furthermore, the microprocessor 100 is configured to perform cross-core power reduction operations. The cross-core power reduction operation ties in or affects multiple cores 102 of the microprocessor 100. For example, the shared cache 119 may be large and relatively consume a large amount of power. Thus, significant power savings may be achieved by removing clock signals and/or power to the shared cache 119. However, in order to remove clock signals and/or power to the shared cache 119, all shared cache cores 102 must agree to maintain data coherency. Embodiments contemplate that microprocessor 100 includes other shared power-related resources, such as shared clocks and power. In one embodiment, microprocessor 100 is coupled to a system chipset that includes a memory controller, peripheral controller and/or power management controller. In other embodiments, one or more controllers are integrated into microprocessor 100. System power savings can be achieved by the microprocessor 100 informing the controller to cause the controller to take power savings actions. For example, the microprocessor 100 may notify the controller to invalidate the microprocessor's cache memory and shut down so that it does not have to be snooped.

In addition to the concept of a core C-state, microprocessor 100 typically has a "packed" C-state (also referred to as a package idle state or package sleep state). The package C-state corresponds to the lowest (e.g., highest power consumption) common core C-state of the cores 102 (e.g., see field 246 of fig. 2 and block 318 of fig. 3). However, in addition to core-specific power reduction operations, the packed C-state involves the microprocessor 100 performing one or more cross-core power reduction operations. An example of cross-core power-save operation associated with the packed C-state includes shutting down a Phase-locked-loop (PLL) that generates a clock signal and flushing the shared cache 119 and stopping its clock and/or power, which allows the memory/external controller to avoid snooping the local shared cache of microprocessor 100. Other examples are changing the voltage, frequency and/or bus clock ratio, reducing the size of the cache memory, such as shared cache memory 119, and running shared cache memory 119 at half speed.

In many cases, the operating system is effectively used to execute instructions in the individual core 102, and thus the individual core may be put into a sleep state (e.g., to a core C-state), but there is no way to put the microprocessor 100 into a sleep state directly (e.g., to a package C-state). Advantageously, the cores 102 of the depicted embodiment of the microprocessor 100 work in cooperation with each other with the aid of the control unit 104 to detect when all of the cores 102 have entered the core C-state and are ready to have cross-core power save operations occur.

Referring now to FIG. 9, a flowchart illustrating operation of microprocessor 100 to enter a low power package C-state is shown. The embodiment of FIG. 9 illustrates an example of microprocessor 100 coupled to a chipset and executing using MWAIT instructions. However, it should be understood that in other embodiments, the operating system employs other power management instructions and the primary core 102 communicates with a controller integrated into the microprocessor 100 and is described using a different Handshake (Handshake) protocol.

This operation is described in terms of a single core, but each core 102 of the microprocessor 100 may encounter an MWAIT instruction and operate in accordance with the present disclosure to collectively bring the microprocessor 100 into an optimal state. Flow begins at block 902.

In block 902, a core 102 encounters an MWAIT instruction specifying a target C-state, denoted Cx in FIG. 9, where x is a non-negative integer value. Flow proceeds to block 904.

At block 904, core 102 writes a synchronization request having a set of C bits 224 and a C-state field 226 with a value x (labeled SYNC Cx in FIG. 9) to its synchronization register 108. In addition, the synchronization request specifies in its wake event field 204 that the core 102 is to be woken up on all wake events. Thus, control unit 104 causes core 102 to enter a sleep state. More preferably, core 102 writes back and invalidates the local cache to which it writes before writing SYNC Cx by core 102. Flow proceeds to block 906.

In block 906, when all cores 102 have written a SYNC Cx signal, the cores 102 are awakened by the control unit 104. As described above, the x values written by other cores 102 may be different, and the control unit 104 issues the least common C-state value to the least common C-state field 246 of the status word 242 of the status register 106 (per block 318). Prior to block 906, while core 102 is in the sleep state, it may be awakened by a wake event, such as an interrupt signal (e.g., blocks 305 and 306). More specifically, but without ensuring that the operating system will execute MWAIT instructions for all cores 102, microprocessor 100 may be allowed to perform power-save operations associated with the package C-state before a wake event occurs (e.g., an interrupt) indicating that one of cores 102 is effectively cancelling MWAIT instructions. However, in block 906, once core 102 is awakened, during the period when the clock interrupt is disabled (e.g., the microcode does not allow itself to be interrupted), core 102 (in fact, all cores 102) still execute the microcode due to the MWAIT instruction (in block 902) and remain in the microcode until block 924. In other words, although a small portion of all cores 102 have received MWAIT instructions to enter a sleep state, the individual cores 102 may be in a sleep state, but microprocessor 100 as a package does not indicate that the chipset is ready to enter a package sleep state. However, once all cores 102 have agreed to enter an encapsulating sleep state, which is effectively indicated by the occurrence of a synchronization condition in block 906, primary core 102 is allowed to complete an encapsulating sleep state handshake protocol with the chipset (e.g., blocks 908, 909 and 921 below), and is not interrupted and no other cores 102 are interrupted. Flow proceeds to decision block 907.

At decision block 907, core 102 determines whether it is the primary core 102 of microprocessor 100. More preferably, if it is determined that it is BSP at the reset time, one core 102 is the primary core 102. If the core is the primary core, flow proceeds to block 908; otherwise, flow proceeds to block 914.

At block 908, the primary core 102 writes back and invalidates the shared cache 119, and then communicates with the chipset that can take appropriate action to reduce power consumption. For example, since the memory controller and/or the external controller remain disabled while the microprocessor 100 is in the package C-state, the memory controller and/or the external controller may avoid snooping the local and shared caches of the microprocessor 100. As another example, the chipset may transmit a signal to microprocessor 100 to cause microprocessor 100 to assume power saving operation (e.g., assert x86-style STPCLK, SLP, DPSLP, NAP, VRDSLP signals as described below). More preferably, core 102 communicates power management information based on the lowest common C-state field 246 value. In one embodiment, core 102 issues an I/O read bus cycle to an I/O address that provides chipset-related power management information, such as package C-state values. Flow proceeds to block 909.

In block 909, primary core 102 waits for the chipset to assert (assert) the STPCLK signal. More preferably, if the STPCLK signal is not asserted after a predetermined number of bright clock cycles, the control unit 104 detects this after aborting its ongoing synchronization request, wakes up all of the cores 102 and indicates the error in the error code field 248. Flow proceeds to block 914.

At block 914, the core 102 writes a SYNC 14. In one embodiment, the synchronization request specifies in its wake event field 204 that the core 102 is not woken up in any wake event. Thus, control unit 104 causes core 102 to enter a sleep state. Flow proceeds to block 916.

In block 916, when all cores 102 have written a SYNC 14, cores 102 are awakened by control unit 104. Flow proceeds to decision block 919.

At decision block 919, core 102 determines whether it is the primary core 102 of microprocessor 100. If so, flow proceeds to block 921; otherwise, flow proceeds to block 924.

At block 921, primary core 102 issues a stall grant (grant) cycle on the microprocessor 100 bus to inform the chipset that it may take power saving operations across cores (e.g., package periphery) associated with microprocessor 100 as a whole, such as avoiding cache snooping of microprocessor 100, removing bus clocks (e.g., x 86-type BCLK) to microprocessor 100, and asserting other on-bus signals (e.g., x 86-type SLP, DPSLP, NAP, VRDSLP) to cause microprocessor 100 to remove clocks and/or power to various portions of microprocessor 100. Although the embodiments described herein relate to a handshake protocol between the microprocessor 100 and a chipset associated with I/O reads (at block 908), STPCLK establishment (at block 909), and stall of enable cycles issuance (at block 921) that are historically associated with x86 infrastructure systems, it should be appreciated that other embodiments assume association with other infrastructure systems having different protocol instruction sets, but may also save power, improve performance, and/or reduce complexity. Flow proceeds to block 924.

At block 924, core 102 writes a sleep request (e.g., a sleep request with sleep bit 212 set and S bit 222 clear) to synchronization register 108. Further, the synchronization request indicates in its wake event field 204 that the core 102 is only woken up in the STPCLK non-asserted wake event (i.e., the wake event of the de-asserted STPCLK). Thus, control unit 104 causes core 102 to enter a sleep state. Flow ends at block 924.

Referring now to FIG. 10, a timing diagram illustrating one embodiment of the operation of microprocessor 100 according to the flowchart of FIG. 9 is shown. In this example, microprocessor 100 is configured with three cores 102, labeled core 0, core 1, and core 2, as shown. However, it should be understood that in other embodiments, microprocessor 100 may include a different number of cores 102.

Core 0 encounters an MWAIT instruction specifying C-state 4 (MWAIT C4) (per block 902). Core 0 then writes a SYNC C4 and enters a sleep state (per block 904). Core 1 encounters an MWAIT instruction specifying C-state 3 (MWAIT C3) (per block 902). Core 1 then writes a SYNC C3 and enters a sleep state (per block 904). Core 2 encounters an MWAIT instruction specifying C-state 2 (MWAIT C2) (per block 902). Core 2 then writes a SYNC C2 and enters a sleep state (per block 904). As shown, the time to write SYNC Cx may be different at each core. In fact, one or more cores may not encounter an MWAIT instruction until some other event occurs, such as an interrupt.

When all cores have written SYNC Cx, the control unit 104 wakes up all cores at the same time (each block 906). The primary then issues an I/O read bus cycle (per block 908) and waits for the assertion of STPCLK (per block 909). All cores write a SYNC 14 and go to sleep (per block 914). Since only the primary core is flushing (Flush) the shared cache 119, issuing an I/O read bus cycle and waiting for STPCLK to assert, the time to write SYNC 14 may be different for each core, as shown. In fact, the primary core may write SYNC 14 in a sequence of hundreds of microseconds after the other cores.

When all cores write SYNC 14, control unit 104 wakes up all cores simultaneously (per block 916). Only one primary core issues a Stop grant cycle (Stop grant cycle) (per block 921). All cores write a sleep request waiting on the STPCLK deasserted Signal (STPCLK) and enter sleep state (per block 924). Since only the primary core issues a stall permission period, the time for each core to write a sleep request may be different, as shown.

When the STPCLK signal is deasserted (de-asserted), the control unit 104 wakes up all cores.

It can be observed from fig. 10 that core 1 and core 2 may advantageously sleep for a significant period of time when core 0 performs a handshake protocol. It should be noted, however, that the time required to wake up the microprocessor 100 from a sleep state of the package is generally proportional to the length of the sleep time (e.g., how much power is saved in the sleep state). Thus, where the package sleep state is relatively long (or even where a separate core 102 sleep state time is long), it may be desirable to further reduce the occurrence of wakeups and/or the time required for wakeups in connection with handshaking protocols. Fig. 11 depicts an embodiment of a handshake protocol handled by a single core 102 while another core 102 continues to be in a sleep state. In addition, according to the embodiment of FIG. 11, power savings may be further achieved by reducing the number of cores 102 that are woken up in response to a wake event.

Referring now to FIG. 11, a flowchart illustrating operation of microprocessor 100 to enter a low power package C-state according to another embodiment of the present invention is shown. The embodiment of figure 11 is illustrated using an example of microprocessor 100 coupled to execution of MWAIT instructions in a chipset. However, it should be appreciated that in other embodiments, the operating system employs other power management instructions, and the last synchronized core 102 communicates with a controller that is integrated into the microprocessor 100 and employs a different handshaking protocol than described.

The embodiment of fig. 11 is similar in some respects to the embodiment of fig. 9. However, in an environment where existing operating systems request that microprocessor 100 enter a very low power state and tolerate the delays associated therewith, the embodiment of FIG. 11 is designed in order to save potentially more power. More specifically, the embodiment of FIG. 11 facilitates controlling power to the cores and waking only one of the cores when necessary, such as when handling interrupts. Embodiments contemplate that the microprocessor 100 supports both modes of operation in FIGS. 9 and 11. Furthermore, the mode is configurable, whether at manufacture (e.g., through fuse 114) and/or via software control or automatically determined by microprocessor 100 depending on the particular C-state specified by the MWAIT instruction. Flow begins at block 1102.

In block 1102, core 102 encounters an MWAIT instruction (MWAIT Cx) specifying a target C-state, which is denoted Cx in FIG. 11, flow proceeds to block 1104.

At block 1104, core 102 writes a synchronization request with C bit 224 set and a C-state field 226 x (which is labeled SYNC Cx in FIG. 11) into its synchronization register 108. The synchronization request also sets the selective wake (SELWAKE) bit 214 and the PG bit 208. Further, the synchronization request indicates in its wake event field 204 that the core 102 is awake on all wake events except for the assertion of STPCLK and the deassertion of STPCLK (STPCLK, i.e., deassertion of STPCLK). (more preferably, the synchronization request specifies that core 102 is not to be woken up upon other wake events, such as AP startup). Thus, control unit 104 places core 102 into a sleep state, which includes preventing power from being provided to core 102 because PG bit 208 is set. In addition, core 102 writes back and invalidates the local cache and stores (preferably dedicated random access memory 116) the state of its core 102 prior to the write synchronization request. When core 102 is subsequently awakened (e.g., at blocks 1137, 1132, or 1106), core 102 will restore its state (e.g., from PRAM 116). As described above, and with particular reference to FIG. 3, when the last core 102 writes a synchronization request with the selective wake bit 214 set, the control unit 104 automatically blocks all wake events for all cores 102 except the last written core 102 (per block 326). Flow proceeds to block 1106.

In block 1106, when all cores 102 have written a SYNC Cx, the control unit 104 wakes up the last written core 102. As described above, control unit 104 maintains the S bit 222 settings of the other cores 102 even though control unit 104 wakes up the last written core 102 and clears the S bit. Before block 1106, when core 102 is in a sleep state, it may be awakened by a wake event, such as an interrupt. However, once core 102 wakes up in block 1106, core 102 still executes microcode from the MWAIT instruction (block 1102) and remains in the microcode for the duration that the interrupt is disabled (e.g., the microcode does not allow itself to be interrupted) until block 1124. In other words, while only individual cores 102 may sleep, although not more than all cores 102 have received an MWAIT instruction to enter a sleep state, microprocessor 100 as a package does not indicate to the chipset that it is ready to enter a package sleep state. However, once all cores 102 have agreed to enter an encapsulating sleep state, as indicated by the synchronization state occurrence at block 1106, the core 102 that was awakened at block 906 (the last written core 102 that caused the synchronization condition to occur) is allowed to complete the encapsulating sleep state handshake protocol with the chipset (e.g., blocks 1108, 1109, and 1121 as shown below) without being interrupted, and no other cores 102 are interrupted. Flow proceeds to block 1108.

At block 1108, core 102 writes back and invalidates shared cache 119, and then communicates with chipset, which may take appropriate action to reduce power consumption. Flow proceeds to block 1109.

In block 1109, core 102 waits for chipset to assert the STPCLK signal. More preferably, if the STPCLK signal is not asserted after a predetermined number of clock cycles, the control unit 104 detects this and wakes up all of the cores 102 after terminating its ongoing synchronization request and indicates the error in the error code field 248. Flow proceeds to block 1121.

At block 1121, core 102 issues a stall grant cycle to the chipset on the bus. Flow proceeds to block 1124.

At block 1124, core 102 writes a sleep request, e.g., having sleep bit 212 set and S bit 222 clear and PG bit 208 set, to synchronization register 108. In addition, the synchronization request specifies in its wake event field 204 that the core 102 is only woken up in the wake event that deasserts STPCLK. Thus, control unit 104 causes core 102 to enter a sleep state. Flow proceeds to block 1132.

At block 1132, control unit 104 detects that STPCLK is not asserted and wakes up core 102. It should be noted that, previously, the control unit 104 woke up the core 102, and the control unit 104 does not limit power to the core 102. Beneficially, core 102 is the only core that is operating at this time, which provides core 102 the opportunity to perform any actions that must be performed while no other cores 102 are operating. Flow proceeds to block 1134.

At block 1134, the core 102 writes to a register (not shown) of the control unit 104 to resolve the wake event for each of the other cores 102 specified in the wake event field 204 of its corresponding synchronization register 108. Flow proceeds to block 1136.

In block 1136, the core 102 handles any wake events that are being performed to designate the core 102. For example, in one embodiment, a system including microprocessor 100 allows for directional (e.g., interrupts that point to a particular core of microprocessor 100) and non-directional (e.g., interrupts that may be processed by any core 102 of microprocessor 100 when microprocessor 100 selects). An example of a non-directional interrupt is commonly referred to as a "low priority interrupt". In one embodiment, microprocessor 100 preferably points to a non-directed interrupt to the single core 102 that was awakened in the deasserted STPCLK of block 1132, which may continue to sleep and limit power since it was awakened and able to handle the interrupt in the hope that the other cores 102 do not have any wake events in progress. Flow returns to block 1104.

When the wake event is dismissed (unbuckled) in block 1134, it may be advantageous for core 102 to continue to sleep and limit power in each block 1104 if core 102 has no designated wake events in progress, other than core 102 that was woken in block 1132. However, when the wake event is de-asserted in block 1134, if a specified wake event is being handled by core 102, the core will not limit power (un-powered) and be awakened by control unit 104. In this case, the different flow begins at block 1137 in FIG. 11.

In block 1137, after the wake event is released in block 1134, another core 102 (e.g., a core 102 other than the core 102 that released the wake event in block 1134) is woken up. The other cores 102 handle any wake events that are ongoing and directed to the other cores 102, such as handling an interrupt. Flow proceeds from block 1137 to block 1104.

Referring now to FIG. 12, a timing diagram illustrating an example of operation of microprocessor 100 according to the flowchart of FIG. 11 is shown. In this example, microprocessor 100 is configured with three cores 102, labeled core 0, core 1, and core 2, as shown. However, it should be understood that in other embodiments, microprocessor 100 may include a different number of cores 102.

Core 0 encounters an MWAIT instruction specifying C-state 7 (MWAIT C7) (per block 1102). In this example, C-State 7 allows for limiting power. Core 0 then writes a selective wake bit 214 as set ("selective wake" as shown in fig. 12) and a PG bit 208 as SYNC C7 as set (set), and enters a sleep state and limits power (per block 1104). Core 1 encounters an MWAIT instruction specifying C-state of 7 (per block 1102). Core 1 then writes SYNC C7 with selective wake up bit 214 set (set) and PG bit 208 set (set) and enters sleep and limits power (per block 1104). Core 2 encounters an MWAIT instruction specifying C-state of 7 (per block 1102). Core 2 then writes SYNC C7 with selective wake bit 214 set (set) and PG bit 208 set (set) and enters a sleep state and limits power (each block 1104). (however, in a preferred embodiment described in block 314, the last written core cannot be power limited). As shown, the time for writing the same SYNC C7 may be different for each core.

When the last written core writes a SYNC C7 with the selective wake-up bit 214 set (set), the control unit 104 blocks (block off) all wake-up events for the last written core (each block 326), which in the example of fig. 12 is core 2. In addition, control unit 104 only wakes up the last written core (per block 1106), which saves power because the other cores are continuously asleep and power limited, and core 2 and chipset perform a handshake protocol. Core 2 then issues an I/O read bus cycle (per block 1108) and waits for the assertion of STPCLK (per block 1109). In response to the STPCLK, core 2 issues a stall permission cycle (per block 1121) and writes a sleep request with the PG bit 208 set (set) pending STPCLK resolution and enters a sleep state and limits power (per block 1124). The core may sleep and limit power for a relatively long period of time.

When STPCLK cannot be asserted, control unit 104 only wakes up core 2 (per block 1132). In the example of FIG. 12, the chipset cannot assert STPCLK in response to receipt of a non-directed interrupt, which is forwarded to microprocessor 100. Microprocessor 100 indicates a non-directed interrupt to core 2, which saves power because the other cores continue to sleep and limit power. The core disarms the other cores (each block 1134) for wake events and services the undirected interrupt (each block 1136). Core 2 then re-writes a SYNC C7 with the selective wake bit 214 set (set) and the PG bit 208 set (set) and enters a sleep state and limits power (per block 1104).

When core 2 writes SYNC C7 with selective wake bit 214 set (set) and PG bit 208 set (set), the control unit 104 blocks (block off) wake events for all cores except core 2, e.g., the last core written, since synchronization requests for other cores are still in progress, e.g., the S bit 222 for other cores is not cleared by the core 2 wake (each block 326). In addition, control unit 104 only wakes core 102 (per block 1106). Core 2 then issues an I/O read bus cycle (per block 1108) and waits for the assertion of STPCLK (per block 1109). In response to the STPCLK, core 2 issues a stall permission cycle (per block 1121) and writes a sleep request with the PG bit 208 set (set) waiting in the STPCLK deassertion and enters a sleep state and limits power (per block 1124).

When STPCLK cannot be asserted, control unit 104 only wakes up core 2 (per block 1132). In the example of fig. 12, STPCLK is deasserted by another non-directional interrupt. Thus, microprocessor 100 indicates the interrupt to core 2, which may save power. Core 2 then disarms the other cores from the wake event (per block 1134) and services the undirected interrupt (per block 1136). Core 2 then writes a SYNC C7 with the selective wake bit 214 set (set) and the PG bit 208 set (set) again and enters a sleep state and limits power (per block 1104).

This period may last for a considerable time, i.e. only non-directional interrupts are generated. FIG. 13 is a diagram showing an example indicating that a different core is handling except for the last written core.

As can be appreciated by comparing fig. 10 and 12, the embodiment in fig. 12 is advantageous in that once cores 102 begin to enter a sleep state (after writing SYNC C7 in the example of fig. 12), only one core 102 is awakened again to perform the handshake protocol with the chipset, and the other cores 102 remain asleep, which can be a significant advantage if cores 102 are in a relatively long sleep state. The power savings can be significant, especially if the operating system identifies that the processing workload for a single core 102 in the system is very small.

Further, advantageously, only one core 102 is awakened (to provide service non-directed events, such as a low priority interrupt) as long as no wake events are indicated to other cores 102. Further, it may be advantageous if core 102 is in a relatively long sleep state. In addition to relatively infrequent undirected interrupts, such as USB interrupts, power savings can be significant, especially without having a payload in the system. Further, embodiments may advantageously dynamically switch the single core 102, which implements the encapsulating sleep state protocol and service non-wake events, as shown in FIG. 13, even when a wake event occurrence is indicated to another core 102 (e.g., an interrupt operating system indication to a single core 102, such as an operating system timer interrupt), in order to enjoy the benefits of waking only a single core 102.

Referring now to FIG. 13, a timing diagram illustrating an example of operation of microprocessor 100 according to the flowchart of FIG. 11 is shown. The example of fig. 13 is similar in many respects to the example of fig. 12. However, in the first instance that STPCLK is deasserted, the interrupt is an interrupt directed to core 1 (rather than a non-directed interrupt as in the example of FIG. 12). Thus, control unit 104 wakes core 2 (per block 1132), and then wakes core 1 after the wake event is de-asserted (per block 1134) by core 2. Core 2 then writes a SYNC C7 with the selective wake bit 214 set (set) and the PG bit 208 set (set) again and enters a sleep state and limits power (per block 1104).

Core 1 services a directed interrupt (each block 1137). Core 1 then again writes SYNC C7 with selective wake bit 214 set (set) and PG bit 208 set (set), and enters a sleep state and limits power (each block 1104). in this example, core 2 writes its SYNC C7 before core 1 writes SYNC C7. Thus, while core 0 still has its S bit 222set when it writes to the initial SYNC C C7, core 1 still has the S bit 222 cleared when it is awakened. Thus, when core 2 writes to SYNC C7 after the de-wake event, not the last core write synchronization C7 request, but instead, core 1 becomes the last core write synchronization C7 request.

When core 1 writes a SYNCC7 with selective wake bit 214 set (set) and PG bit 208 set (set), the control unit 104 blocks all core wake events except core 1, e.g., the last core write (per block 326), since core 0's synchronization request is still ongoing (e.g., it is not cleared by the wake of core 1 and core 2), while core 2 (in this example) has written a SYNC 14 request. In addition, control unit 104 only wakes core 1 (per block 1106). Core 1 then issues an I/O read bus cycle (per block 1108) and waits for STPCLK to assert (per block 1109). In response to the STPCLK, core 1 issues a stall permission cycle (per block 1121) and writes a sleep request with PG bit 208 set (set) waiting for the STPCLK deassertion and enters a sleep state and limits power (per block 1124).

When STPCLK is deasserted, control unit 104 wakes up only core 1 (per block 1132). In the example of FIG. 12, STPCLK is deasserted due to a non-directional interrupt; thus, microprocessor 100 indicates a non-directional interrupt to core 1, which may save power. The period in which the non-directional interrupt is handled by core 1 may last for a considerable time, i.e. only non-directional interrupts are generated. In this manner, microprocessor 100 may advantageously enable the most recent interrupt to be indicated by indicating a non-directional interrupt to core 102 to save power as shown in the example of FIG. 13 in connection with switching to a different core. Core 1 again disarms the other cores' wake events (per block 1134) and services the undirected interrupt (per block 1136). Core 1 then writes a SYNC C7 again with the selective wake bit 214 set (set) and the PG bit 208 set (set) and enters a sleep state and limits power (each block 1104).

It should be appreciated that although an embodiment has been described in which the power management instruction is an x86MWAIT instruction, other embodiments are contemplated in which the synchronization request is used to execute the power management instruction. For example, the microprocessor 100 may perform similar operations in response to a read by a set of preset I/O port addresses associated with different C-states. In another example, the power management instructions may be derived from an instruction set architecture that is different from the x86 architecture.

Dynamic reconfiguration of multi-core processors

Each core 102 of the microprocessor 100 generates a configuration-related value based on the configuration of each core 102 of the microprocessor 100. More preferably, each core 102's microcode generates, stores, and uses configuration-related values. Embodiments describe that the generation of configuration related values may be dynamic and beneficial, which is described below. Examples of configuration related values include, but are not limited to, the following.

Each core 102 produces an overall number of cores associated with fig. 2 described above. The global core number refers to the number of cores 102 of the global core 102 associated with all cores 102 of the microprocessor 100, as compared to the local core number 256 of cores 102 associated with cores 102 that only reside crystals 406 in cores 102. In one embodiment, core 102 generates an overall core number that is the sum of the product of core 102 crystal number 258 and the number of cores 102 per crystal and its local core number 256, as follows:

the number of nuclei in total ═ (number of crystals x number of nuclei per crystal) + number of local nuclei.

Each core 102 also generates a number of virtual cores. The virtual core number is the total core number minus the number of inactive cores 102 having a total core number that is lower than the total core number of immediate cores 102. Thus, where all of the cores 102 of the microprocessor 100 are available, the overall number of cores is the same as the number of virtual cores. However, if one or more cores 102 are disabled and defective, the number of virtual cores of a core 102 may be different from its overall number of cores. In one embodiment, each core 102 fills the APIC ID field of its virtual core number to its corresponding APIC ID register. However, according to another embodiment (e.g., fig. 22 and 23), this is not the case. Additionally, in one embodiment, the operating system may update the APIC ID in the APIC ID register.

Each core 102 also generates a BSP flag that indicates whether the core 102 is a BSP. In one embodiment, in general (e.g., when the function of "all core BSPs" is disabled in fig. 23), one core 102 designates itself as a Boot Sequence Processor (BSP) and each other core 102 designates itself as an Application Processor (AP). After reset, AP core 102 initializes and then goes to a sleep state waiting for BSP notification to begin reading and executing instructions. Conversely, immediately after initialization of AP core 102, BSP core 102 begins reading and executing instructions of system firmware, such as BIOS boot code, that is used to initialize the system (e.g., verify that system memory and peripherals are functioning properly and initialize and/or configure them) and boot the operating system, such as loading the operating system (e.g., from disk) and transferring control to the operating system. Before booting the operating system, the BSP determines the system configuration (e.g., the number of cores 102 or logical processors in the system) and stores it in memory so that the operating system can be read after the system configuration is started. After the operating system is booted, AP core 102 is instructed to begin reading and executing operating system instructions. In one embodiment, in general (e.g., when the functions of "modify BSP" and "BSP of all cores" in fig. 22 and 23 are disabled, respectively), a core 102 designates itself as BSP if its number of virtual cores is 0, and all other cores 102 designate themselves as an AP core 102. Preferably, a core 102 fills its BSP flag-related configuration values into the BSP flag bits in the APIC base address register corresponding to its APIC. In one embodiment, the BSP is the primary core 102 in blocks 907 and 919, which performs the encapsulating sleep state handshake protocol of fig. 9, as described above.

Each core 102 also generates an APIC base value for filling the APIC base register. The APIC base address is generated based on the APIC ID of the core 102. In one embodiment, the operating system may update the APIC substrate address in the APIC substrate address register.

Each core 102 also generates a crystal primary indication that indicates whether the core 102 is the primary core 102 that includes the crystal 406 of the core 102.

Each core 102 also generates a chip primary indicator that indicates whether the core 102 is the primary core including the real-time core 102 chip, assuming the microprocessor 100 is configured with chips, as described above.

Each core 102 calculates configuration-related values and operates to use the configuration-related values so that the system including the microprocessor 100 operates normally. For example, the system indicates an interrupt request to core 102 based on its associated APIC ID. The APIC ID determines which interrupt request core 102 should respond to. More specifically, each interrupt request includes a destination identifier, and a core 102 responds to an interrupt request only if the destination identifier matches the APIC ID of the core 102 (or if the interrupt request identifier is a special value indicating that it is all cores 102 requested). As another example, each core 102 must know whether it is BSP in order for it to execute the initial BIOS code and boot the operating system, and in one embodiment performs the encapsulating sleep state handshake protocol as described in FIG. 9. Embodiments are described below (see FIGS. 22 and 23) in which the BSP flag and APIC ID may be modified from their normal values for specific purposes, such as for testing and/or debugging.

Referring now to FIG. 14, a flowchart illustrating dynamic reconfiguration of microprocessor 100 is shown. In the illustration of FIG. 14, reference is made to the multi-die microprocessor 100 of FIG. 4, which includes two crystals 406 and eight cores 102. However, it should be understood that the described dynamic reconfiguration may use a microprocessor 100 having a different configuration, i.e., having more than two crystals or a single crystal, and more or less than eight cores 102 but at least two cores 102. This operation is described from a single core perspective, but each core 102 of the microprocessor 100 dynamically operates and reconfigures the microprocessor 100 as a whole according to the description. Flow begins at block 1402.

At block 1402, the microprocessor 100 is reset and the hardware of the microprocessor 100 fills in the appropriate values into the configuration registers 112 of each core 102 based on the number of available cores 102 and the number of crystals resident in the cores 104. In one embodiment, the number of local cores 256 and the number of crystals 258 are hardwired (hardwired). As described above, the hardware may determine whether to enable or disable a core 102 from a state where fuses 114 are blown or unblown. Flow proceeds to block 1404.

At block 1404, core 102 reads configuration word 252 from configuration register 112. The core 102 then generates its associated value based on the configuration word 252 value read in block 1402. In the case of a multi-die microprocessor 100 configuration, the configuration-related values generated in block 1404 will be without regard to the cores 102 of the other dies 406. However, the configuration-related values generated in blocks 1414 and 1424 (and block 1524 in FIG. 15) will take into account the cores 102 of other crystals 406, as described below. Flow proceeds to block 1406.

At block 1406, core 102 causes the value of enable bit 254 of local core 102 in local configuration register 112 to be propagated to the corresponding enable bit 254 of remote crystal 406 configuration register 112. For example, referring to the configuration of FIG. 4, a core 102 in crystal A406A causes the enable bits 254 associated with cores A, B, C and D (local cores) in configuration register 112 of crystal A406A (local crystal) to be propagated to the enable bits 254 associated with cores A, B, C and D in configuration register 112 of crystal B406B (remote crystal). Conversely, a core 102 in crystal B406B causes the enable bit 254 associated with cores E, F, G and H (local cores) in configuration register 112 of crystal B406B (local crystal) to be propagated to the enable bit 254 associated with cores E, F, G and H in configuration register 112 of crystal A406A (remote crystal). In one embodiment, core 102 propagates to other crystals 406 by writing to local configuration register 112. More preferably, the writing to the local configuration registers 112 by the core 102 leaves the local configuration registers unchanged, but causes the local control unit 104 to propagate the local enable bit 254 value to the remote crystal 406. Flow proceeds to block 1408.

At block 1408, core 102 writes a synchronization request for synchronization case 8 (labeled SYNC 8 in FIG. 8) into its synchronization register 108. Thus, control unit 104 causes core 102 to enter a sleep state. Flow proceeds to block 1412.

At block 1412, the control unit 104 wakes up the core 102 when all available cores 102 in the core set specified by the core set field 228 have written a SYNC 8. It is noted that in the case of a multi-die 406 microprocessor 100 configuration, the synchronization event may occur as a multi-die synchronization event. That is, control unit 104 will wait to wake up (or interrupt if core 102 does not set sleep bit 212 to determine a sleep) core 102 until its synchronization request is written in core set field 228 (which may include core 102 in crystal 406). Flow proceeds to block 1414.

At block 1414, the core 102 again reads the configuration register 112 and generates its configuration-related values based on the new value of the configuration word 252 including the correct value of the enable bit 254 transmitted by the remote crystal, and flow proceeds to decision block 1416.

In decision block 1416, core 102 determines whether it should deactivate itself. In one embodiment, the fuse 114 is blown because the microcode is read in its reset process (prior to decision block 1416) to indicate that the core 102 should disable itself, so the core 102 determines that it needs to disable itself. The fuse 114 may be blown during or after manufacture of the microprocessor 100. In another embodiment, the updated fuse 114 value may be scanned into a holding register, as described above, and the scanned value indicates that the core 102 should be disabled. Fig. 15 is a flowchart describing another embodiment in which the core 102 determines that it should be decommissioned in a different manner. If core 102 determines that it should be disabled, flow proceeds to block 1417; otherwise, flow proceeds to block 1418.

At block 1417, core 102 writes disable core bit 236 to remove itself from the list of available cores 102, e.g., clears its corresponding enable bit 254 in configuration word 252 of configuration register 112. Thereafter, core 102 may prevent itself from executing any more instructions, preferably by setting one or more bits to turn off its clock signal and remove its power. Flow ends at block 1417.

At block 1418, core 102 writes a synchronization request for synchronization case 9 (labeled SYNC 9 in FIG. 14) into synchronization register 108. Thus, control unit 104 causes core 102 to enter a sleep state. Flow proceeds to block 1422.

At block 1422, when all enabled cores 102 have written a SYNC 9, the cores 102 are awakened by the control unit 104. Additionally, in the case of a multi-die 406 microprocessor 100 configuration, the synchronization may occur based on the updated values in the configuration register 112 being a die synchronization. Furthermore, when the control unit 104 determines whether a synchronization condition has occurred, the control unit 104 may exclude from consideration the disabling of its own core 102 at block 1417. To elaborate, in one case, all other cores 102 (except for the core 102 that disabled itself) write a SYNC 9 before the core 102 that did not disable itself writes the SYNC register 108 in block 1417, and then the control unit 104 will detect (in block 316) the occurrence of a SYNC condition when the disabled core bit setting of the core 102 that did not disable itself writes the SYNC register 108 in block 1417. When control unit 104 determines that a synchronization condition has occurred because enable bit 254 of disable core 102 is clear, control unit 104 disregards disable core 102. That is, since SYNC 9 has been written by all enabled cores 102, but not by disabled cores 102, regardless of whether SYNC 9 has been written by disabled cores 102, control unit 104 determines that a synchronization condition has occurred. Flow proceeds to block 1424.

At block 1424, if a core 102 is disabled by another core 102 operating at block 1417, the core 102 again reads the configuration register 112 and the new value of the configuration word 252 reflects a disabled core 102. The core 102 then again generates its configuration-related values based on the new values of the configuration word 252, in a manner similar to that in block 1414. The presence 102 of a disabled core may cause some configuration-related values to differ from the new values generated in block 1414. For example, as described above, the number of virtual cores, APIC ID, BSP flag, BSP base address, primary crystal primary die may change due to the presence of the disable core 102. In a further embodiment, after generating the configuration-related values, one of the cores 102 (e.g., the BSP) writes some of the configuration-related values for all of the cores 102 of the microprocessor 100 in the uncore-dedicated random access memory 116 so that they can be subsequently read by all of the cores 102. For example, in one embodiment, the global configuration-related values are read by the cores 102 to execute an architectural instruction (e.g., an x86CPUID instruction) that requests global information about the microprocessor 100, such as the number of cores 102 of the microprocessor 100. Flow proceeds to decision block 1426.

At block 1426, core 102 removes the reset and begins fetching architectural instructions. Flow ends at block 1426.

Referring now to FIG. 15, a flowchart illustrating dynamic reconfiguration of microprocessor 100 according to another embodiment is shown. In the illustration of FIG. 15, and with reference to the multi-die microprocessor 100 of FIG. 4, it includes two crystals 406 and eight cores 102. However, it should be understood that the described dynamic reconfiguration may use a microprocessor 100 having a different configuration, i.e., having more than two crystals or a single crystal, and more or less than eight cores 102 but at least two cores 102. This operation is described from a single core perspective, but each core 102 of the microprocessor 100 dynamically operates and reconfigures the microprocessor 100 as a whole according to the description. To illustrate more specifically, FIG. 15 depicts the operation of one core 102 encountering a core disable instruction, flow of which begins at block 1502, while the operation of another core 102 operates, flow of which begins at block 1532.

At block 1502, one of the cores 102 encounters an instruction instructing the core 102 to disable itself. In one embodiment, the instruction is an x86WRMSR instruction. In response, core 102 transmits a reconfiguration message to the other cores 102 and transmits an inter-core interrupt signal thereof. More preferably, during the time that the interrupt is disabled (e.g., the microcode does not allow itself to be interrupted), core 102 blocks microcode in response to the instruction to disable itself (at block 1502), or to respond to the interrupt (at block 1532), and remains in microcode until block 1526. Flow proceeds from block 1502 to block 1504.

In block 1532, one of the other cores 102 (e.g., a core other than the core 102 that encountered the disable instruction in block 1502) is interrupted and receives reconfiguration information due to the inter-core interrupt transmitted in block 1502. As described above, although flow in block 1532 is described from the perspective of a single core 102, each other core 102 (e.g., core 102 not in block 1502) is interrupted in block 1532 and receives the information and performs the steps in blocks 1504-1526. Flow proceeds from block 1532 to block 1504.

At block 1504, core 102 writes a synchronization request for synchronization request 10 (denoted as SYNC 10 in FIG. 15) to its synchronization register 108. Thus, control unit 104 causes core 102 to enter a sleep state. Flow proceeds to block 1506.

In block 1506, when all available cores 102 have written a SYNC 10, core 102 is awakened by control unit 102. It is noted that in the case of a multi-die 406 microprocessor 100 configuration, the synchronization event may occur as a multi-die synchronization event. That is, control unit 104 will wait to wake (or interrupt if core 102 has not decided to enter a sleep state) core 102 until the core 102 specified in core set field 228 (which may include core 102 in crystal 406) and enabled (which is indicated by the enable bit) writes its synchronization request. Flow proceeds to decision block 1508.

In decision block 1508, the core 102 determines whether it is a core 102 that was instructed to deactivate itself in block 1502. If so, flow proceeds to block 1517; otherwise, flow proceeds to block 1518.

At block 1517, core 102 writes the disable core bit 236 to remove itself from the list of available cores 102, e.g., clears its corresponding enable bit 254 in the configuration word 252 of configuration register 112. Thereafter, core 102 may prevent itself from executing any more instructions, preferably by setting one or more bits to turn off its clock signal and remove its power. Flow ends at block 1517.

At block 1518, core 102 writes a synchronization request for a synchronization case 11 (denoted as SYNC 11 in FIG. 15) into synchronization register 108. Thus, control unit 104 causes core 102 to enter a sleep state. Flow proceeds to block 1522.

In block 1522, when all enabled cores 102 have written a SYNC 11, the cores 102 are awakened by the control unit 104. Additionally, in the case of a multi-die 406 microprocessor 100 configuration, the synchronization may occur based on the updated values in the configuration register 112 being a multi-die synchronization. Furthermore, when control unit 104 determines whether a synchronization condition has occurred, control unit 104 may exclude from consideration core 102 that is disabled at block 1517. To elaborate, in one case, all other cores 102 (except the one that disabled itself 102) write a SYNC 11 before the one that disabled itself 102 writes to the SYNC register 108 in block 1517, then when it is determined that a SYNC condition has occurred because the enable bit 254 of the disabled core 102 is clear, the control unit 104 will detect the occurrence of the SYNC condition (in block 316) when the one that disabled itself 102 writes to the SYNC register 108 in block 1517 because the control unit 104 no longer considers the disabled core 102 (see fig. 16). That is, since all enabled cores 102 have written a SYNC 11, the control unit 104 determines that a synchronization condition has occurred regardless of whether the disabled core 102 has written a SYNC 11. Flow proceeds to block 1524.

At block 1524, core 102 reads configuration register 112 whose configuration word 252 reflects the disabled core 102 that was disabled at block 1517. The core 102 then generates its configuration related value based on the new value of the configuration word 252. More preferably, the disable instruction is executed by system firmware (e.g., BIOS settings) in block 1502, and after the core 102 is disabled, the system firmware performs a reboot of the system, for example, after block 1526. During the reboot, microprocessor 100 may perform operations other than the generation of the previously configured correlation values at block 1524. For example, the BSP may be a different core 102 than before the configuration-related value was generated during the reboot. As another example, the system configuration information (e.g., the number of cores 102 and logical processors in the system) determined by the BSP prior to booting the operating system and stored to memory to be readable by the operating system may be different. As another example, the APIC ID of the core 102 that is still in use is different from the APIC ID prior to generating the configuration-related value, in which case the operating system will indicate an interrupt request and the core 102 will respond to an interrupt request generated differently from the previous configuration-related value. As another example, the primary core 102 performing the encapsulating sleep state handshake protocol of fig. 9 in blocks 907 and 919 may be a different core 102 than the one that generated the previous configuration correlation values. Flow proceeds to decision block 1526.

At block 1526, core 102 resumes its executing tasks before being interrupted at block 1526. Flow ends at block 1526.

Dynamically reconfiguring the microprocessor 100 as described herein may be useful in a variety of applications. For example, dynamic reconfiguration may be used during development of microprocessor 100 for testing and/or simulation, and/or in field testing. Additionally, a user may want to know the amount of performance and/or power consumption of the system when running a particular application using only a subset of cores 102. In one embodiment, when a core 102 is disabled, it may have its clocks stopped and/or power removed so that it consumes substantially no power. Furthermore, in a high reliability system, each core 102 may periodically check whether other cores 102 and a particular core 102 selected by the core 102 has failed, and non-failed cores may disable the failed core 102 and cause the remaining cores 102 to perform dynamic reconfiguration as described above. In such an embodiment, control word 202 may include an additional field that causes write core 102 to specify that core 102 is disabled and modifies the operations depicted in FIG. 15 such that a core may disable a core 102 other than core 102 itself at block 1517.

Referring now to FIG. 16, a timing diagram illustrating an example of operation of microprocessor 100 according to the flowchart of FIG. 15 is shown. In this example, microprocessor 100 is configured with three cores 102, labeled core 0, core 1, and core 2, as shown. However, it should be understood that in other embodiments, microprocessor 100 may include a different number of cores 102 and may be a single crystal or multi-crystal microprocessor 100. In this timing diagram, the timing of events advances downward.

Core 1 encounters an instruction that disables itself and in response transfers a reconfiguration message and interrupts core 0 and core 2 (each block 1502). Core 1 then writes SYNC 10 and enters a sleep state (per block 1504).

Each core 0 and core 2 is eventually interrupted from its current task and reads the information (per block 1532). In response, each core 0 and core 2 writes SYNC 10 and enters a sleep state (each block 1504). As shown, the time to write the same SYNC 10 may be different for each core. For example, the instruction executes when the interrupt is asserted due to the latency of the instruction.

When all cores 102 write SYNC 10, control unit 104 wakes up all cores simultaneously (each block 1506). Core 0 and core 2 then determine that they do not deactivate themselves (per decision block 1508), and write a SYNC 11 and enter a sleep state (per block 1518). However, since core 1 decides that it disables itself, it writes to its disable core bit 236 (per block 1517). In this example, core 1 writes its disable core bit 236 after core 0 and core 2 write the respective SYNC 11, as shown. However, since control unit 104 determines that S bit 222 is set for each core 102 for which enable bit 254 is set, control unit 104 detects that a synchronization condition has occurred. That is, even though S bit 222 of core 1 is not set, its enable bit 254 is cleared at the time the synchronization register 108 of core 1 is written to at block 1517.

When all available cores have written SYNC 11, the control unit 104 wakes up all cores simultaneously (per block 1522). As described above, in the case of a multi-die microprocessor 100, when core 1 writes its disable core bit 236 and local control unit 104 clears core 1's local enable bit 254, respectively, local control unit 104 also propagates local enable bit 254 to remote crystal 406. Thus, the remote control unit 104 also detects the occurrence of the synchronization state and wakes up all available cores of its crystal 406 at the same time. Core 0 and core 2 then generate their configuration-related values based on the values of the updated configuration register 112 (per block 1524) and resume their previous activity (per block 1526).

HARDWARE SEMAPHORE (HARDWARE SEMAPHORE)

Please refer to fig. 17, which is a block diagram illustrating the hardware semaphore 118 in fig. 1. The hardware semaphore 118 includes a owned bit 1702, owner bit 1704, and a state machine 1706, which is used to update the owned bit 1702 and the owner bit 1704 in response to the hardware semaphore 118 being read from and written to by the core 102. More preferably, the number of owner bits 1704 is the number of cores 102 of the base-log 2 microprocessor 100 configuration in order to identify the hardware semaphore 118 currently owned by the core. In another embodiment, the owner bits 1704 include a corresponding bit for each core 102 of the microprocessor 100. It is noted that although the set of owned bit 1702, owner bit 1704, and state machine 1706 are depicted as being implemented with a hardware semaphore 118, microprocessor 100 may include multiple hardware semaphores 118, where each hardware semaphore 118 includes a set of hardware as described above. More preferably, to perform operations that require exclusive access to shared resources, microcode running in each core 102 reads and writes the hardware semaphore 118 to take ownership of a resource shared by the core 102, as described in more detail in the examples below. The microcode may associate each of the plurality of hardware semaphores 118 with a different ownership of the shared resource of microprocessor 100. More preferably, hardware semaphore 118 is read and written by core 102 at a predetermined address in a non-architected address space of core 102. The non-architectural address space is readable only by microcode of a core 102, but not directly by user code (e.g., x86 architectural program instructions). The operation of the state machine 1706 to update the owned 1702 and owner 1704 bits of the hardware semaphore 118 is described in FIGS. 18 and 19, and the use of the hardware semaphore 118 is also described later.

Please refer to fig. 18, which is a flowchart illustrating the operation when a core 102 reads the hardware semaphore 118. Flow begins at block 1802.

In block 1802, a core 102, designated core x, reads the hardware semaphore 118. As described above, preferably, the microcode of the core 102 reads a predetermined address in the non-architectural address space where the hardware semaphore 118 resides. Flow proceeds to decision block 1804.

At decision block 1804, state machine 1706 examines owner bit 1704 to determine whether core 102 is the owner of hardware semaphore 118. If so, flow proceeds to block 1808; otherwise, flow proceeds to block 1806.

In block 1806, the hardware semaphore 118 returns and reads a zero value in the core 102 to indicate that the core 102 does not own the hardware semaphore 118, and flow ends in block 1806.

At block 1808, the hardware semaphore 118 returns and reads a value in the core 102 to indicate that the core 102 owns the hardware semaphore 118, and flow ends at block 1808.

As described above, microprocessor 100 may include a plurality of hardware semaphores 118. In one embodiment, the microprocessor 100 includes 16 hardware semaphores 118, and when a core 102 reads a predetermined address, it receives a 16-bit data value, each bit of which corresponds to a different hardware semaphore 118 of the 16 hardware semaphores 118, and indicates whether the core 102 that reads the predetermined address owns the corresponding hardware semaphore 118.

Referring now to FIG. 19, therein is shown a flow chart of the operation of a core 102 when writing hardware semaphore 118. Flow begins at block 1902.

At block 1902, a core 102, designated core x, writes the hardware semaphore 118, e.g., at a non-architected default address as described above. Flow proceeds to decision block 1804.

At decision block 1904, the state machine 1706 examines the owned bit 1702 to determine whether the hardware semaphore 118 is owned or not owned (free) by any core 102. If owned, flow proceeds to decision block 1914; otherwise, flow proceeds to decision block 1906.

At decision block 1906, the state machine 1706 checks the written value. If the value is 1, indicating that the core 102 is to acquire ownership of the hardware semaphore 118, flow proceeds to block 1908. If, however, the value is 0, indicating that the core 102 is to relinquish ownership of the hardware semaphore 118, flow proceeds to block 1912.

At block 1908, state machine 1706 updates the owned bits 1702-1 and sets the owner bit 1704 to indicate the hardware semaphore 118 that core x now owns. Flow ends at block 1908.

At block 1912, the state machine 1706 does not perform an update of the owned bit 1702 nor the owner bit 1704 and flow ends at block 1912.

At decision block 1914, the state machine 1706 examines the owner bit 1704 to determine whether core x is the owner of the hardware semaphore 118. If so, flow proceeds to decision block 1916; otherwise, flow proceeds to block 1912.

At decision block 1916, the state machine 1706 checks the written value. If the value is 1, indicating that the core 102 is to acquire ownership of the hardware semaphore 118, flow proceeds to block 1912 (where the core 102 has owned the hardware semaphore 118 so no update occurs, as determined in decision block 1914). If, however, the value is 0, indicating that the core 102 is to relinquish ownership of the hardware semaphore 118, flow proceeds to block 1918.

At block 1918, the state machine 1706 updates the owned bit 1702 to zero to indicate that the uncore 102 now owns the hardware semaphore 118, and flow ends at block 1918.

As described above, in one embodiment, microprocessor 100 includes 16 hardware semaphores 118. When a core 102 writes to the predetermined address, it writes a 16-bit data value, each bit corresponding to a different one 118 of the 16 hardware semaphores 118, and indicates whether the core 102 writing to the predetermined address requested ownership (a value of 1) or relinquished ownership (a value of zero) of the corresponding hardware semaphore 118.

In one embodiment, arbitration logic arbitrates the hardware semaphore 118 for access requested by the core 102, such that the core 102 reads/writes the hardware semaphore 118 from the hardware semaphore 118 in a serialized (Serialize). In one embodiment, arbitration logic uses a Round-Robin Fairness Algorithm (Round-Robin Fairness Algorithm) among the cores 102 to access the hardware semaphore 118.

Turning now to FIG. 20, a flowchart illustrating operation of the microprocessor 100 when using the hardware semaphore 118 to perform an operation requiring exclusive ownership of a resource is shown. More specifically, the hardware semaphore 118 is used to ensure that only one core 102 performs a write back and invalidates the shared cache 119 at a time if two or more cores 102 have respectively encountered a write back and invalidate shared cache 119 instruction. The operation is described in terms of a single core, but each core 102 of the microprocessor 100 is generally assured to perform write-backs for one core 102 and invalidate operations for other cores 102 in accordance with the present invention. That is, the operations of FIG. 20 ensure that the WBINVD instruction process is serialized (Serialize). In one embodiment, the operations of FIG. 20 may be performed in a microprocessor 100 executing a WBINVD instruction according to the embodiments of FIGS. 7A-7B. Flow begins at block 2002.

At block 2002, a core 102 encounters a cache control instruction, such as a WBINVD instruction. Flow proceeds to block 2004.

In block 2004, the core 102 writes a1 to WBINVD hardware semaphore 118. In one embodiment, the microcode has allocated one of the hardware semaphores 118 to the WBINVD operation. The core 102 then reads the WBINVD hardware semaphore 118 to determine whether it has ownership. Flow proceeds to decision block 2006.

At decision block 2006, if the core 102 determines that it has ownership of the WBINVD hardware semaphore 118, flow proceeds to block 2008; otherwise, flow returns to block 2004 to again attempt to acquire ownership. It should be noted that as the microcode of the instant core 102 cycles through blocks 2004-2006, it is eventually interrupted by the core 102 having the WBINVD hardware semaphore 118 because the core 102 is executing a WBINVD instruction at block 702 in fig. 7A-7B and sending an interrupt to the instant core 102. More preferably, through each cycle, the microcode of the instant core 102 checks the interrupt status register to see if one of the other cores 102 (e.g., the core 102 that owns the WBINVD hardware semaphore 118) sends an interrupt to the instant core 102. The immediate core 102 will then perform the operations of FIGS. 7A-7B and resume operations according to FIG. 20 at block 749 in an attempt to obtain ownership of the hardware semaphore 118 to execute its WBINVD instruction.

In block 2008, the core 102 has obtained ownership and flow proceeds to block 702 in FIGS. 7A-7B to execute the WBINVD instruction. Due to the partial WBINVD instruction operation, the core 102 writes zeros into the WBINVD hardware semaphore 118 to relinquish its ownership in FIGS. 7A-7B block 748. Flow ends at block 2008.

An operation similar to that described in figure 20 may be performed by the microcode to obtain exclusive ownership of other shared resources. Other resources that a core 102 may obtain exclusive ownership by using a hardware semaphore 118 are registers of uncore 103, which are shared by cores 102. In one embodiment, uncore 103 registers include a control register that includes a respective field for each core 102. The fields control operational aspects of each core 102. Since the fields are located in the same register, when one core 102 wants to update its respective field but cannot update the fields of the other cores 102, the core 102 must read the control register, modify the read value, and then write back the modified value to the control register. For example, microprocessor 100 may include a Performance Control Register (PCR) for controlling the bus clock ratio of core 102, which may be non-core 103. To update its bus clock ratio, a particular core 102 must read, modify, and write back the PCR. Thus, in one embodiment, microcode is configured to perform a valid atomic read/modify/write back of a PCR when core 102 owns hardware semaphore 118 associated with the PCR. The bus clock ratio determines the clock frequency of the single core 102 to be a multiple of the clock frequency of the supporting microprocessor 100 via an external bus.

Another resource is a Trusted Platform Module (TPM). In one embodiment, microprocessor 100 executes a trusted platform module that runs microcode in cores 102. At a given instant in time, microcode running on one of cores 102 and 102 implements the TPM. However, the core 102 implementing the TPM may change over time. By using hardware semaphores 118 associated with the TPM, the microcode of cores 102 may ensure that only one core 102 implements the TPM at a time. More specifically, the core 102 that is currently executing the TPM writes TPM state to the private random access memory 116 before forgoing implementation of the TPM, and the core 102 that is taking over to implement the TPM reads the TPM state from the private random access memory 116. The microcode in each core 102 is configured such that when the core 102 is to become a core 102 that executes a TPM, the core 102 first takes ownership of the TPM hardware semaphore 118 and begins executing the TPM before reading the TPM state from the private random access memory 116. In one embodiment, the TPM substantially conforms to TPM specifications promulgated by the Trusted Computing Group (Trusted Computing Group), such as the ISO/IEC11889 specification.

As described above, the conventional solution to resource contention among multiple processors is to use software semaphores (software semaphores) in the system memory. A potential advantage of the hardware semaphore 118 described herein is that it avoids the generation of additional traffic on additional memory buses and is faster to access than the system's memory.

Interrupt, non-sleep synchronization request

Referring now to FIG. 21, a timing diagram illustrating an example of an operation of the core 102 issuing a non-sleep synchronization request according to the flowchart of FIG. 3 is shown. In this example, microprocessor 100 is configured with three cores 102, labeled core 0, core 1, and core 2, as shown. However, it should be understood that in other embodiments, the microprocessor 100 may include a different number of cores 102.

Core 0 writes a SYNC 14 that is not set in sleep bit 212 nor in selective wake bit 214 (e.g., a non-sleep SYNC request). Accordingly, control unit 104 allows core 0 to remain operational (branch "no" of each decision block 312).

Core 1 eventually also writes a non-sleep SYNC 14 and control unit 104 allows core 1 to remain running. Finally, core 2 writes a non-sleeping SYNC 14. As shown, the time at which each core writes SYNC 14 may be different.

When all cores have written non-sleep sync 14, control unit 104 sends a sync interrupt to each core 0, core 1, and core 2 at the same time. Each core then receives and services a synch interrupt (unless the synch interrupt is masked, in which case the microcode typically polls (poll) the synch interrupt).

Assignment of a steering processor

In one embodiment, as described above, a core 102 is typically (e.g., when the functionality of the "all core BSP" of FIG. 23 is disabled) designated itself as a boot processor (BSP) and performs designated tasks, such as booting a work system. In one embodiment, the number of virtual cores is typically preset to 0 by core 102BSP (e.g., when the functions of "modify BSP" and "all core BSP" of FIGS. 22 and 23, respectively, are disabled).

However, the inventors have observed that it may be advantageous for the BSP to be specified in a different way, embodiments of which are described below. For example, many tests of a portion of microprocessor 100, particularly during manufacturing tests, are performed by booting an operating system and running program code to ensure that the portion of microprocessor 100 is functioning properly. Because BSP core 102 performs system initialization and starts the operating system, BSP core 102 may operate in a manner in which the AP core cannot operate. Furthermore, it is observed that even in a multi-threaded operating environment, the BSP generally bears a larger portion of the processing load than the AP, and therefore, the AP core 102 cannot perform a comprehensive test as the BSP core 102 does. Finally, there may be some actions that need only be performed by the BSP core 102 on behalf of the microprocessor 100 as a whole, such as the encapsulating sleep state handshake protocol described in FIG. 9.

Thus, embodiments describe that any core 102 may be designated as a BSP. In one embodiment, the test is run N times during the test of the microprocessor 100, where N is the number of cores 102 of the microprocessor 100, and the microprocessor 100 is reconfigured to have a BSP of a different core 102 during each run of the test. This may advantageously provide better test coverage during manufacturing and also advantageously reveal errors in the microprocessor 100 during design of the microprocessor 100. Another advantage is that each core 102 may have a different APIC ID in different operations, responding to different interrupt requests, which may provide more extensive test coverage.

Referring now to FIG. 22, a flowchart illustrating a process for configuring microprocessor 100 is shown. The description of FIG. 22 refers to the multi-die microprocessor 100 of FIG. 4, which includes two crystals 406 and eight cores 102. However, it should be understood that the dynamic reconfiguration described herein may use a microprocessor 100 having a different configuration, i.e., having more than two crystals or a single crystal, and more or less than eight cores 102 but at least two cores 102. This operation is described from a single core perspective, but each core 102 of the microprocessor 100 dynamically operates and reconfigures the microprocessor 100 as a whole according to the description. Flow begins at block 2202.

At block 2202, the microprocessor 100 is reset and performs an initial portion of its initialization, preferably in a manner similar to that described above with respect to FIG. 14. However, generation of configuration related values, such as block 1424 in FIG. 14, and in particular the APIC ID and BSP flags, is performed in the manner described in blocks 2203-2204. Flow proceeds to block 2203.

At block 2203, core 102 generates its number of virtual cores, as better depicted in FIG. 14. Flow proceeds to decision block 2204.

At decision block 2204, core 102 samples an indication to determine whether a function is enabled. This function is referred to herein as a "modify BSP" function. In one embodiment, blowing a fuse 114 modifies the function of the BSP. More preferably, rather than blowing the BSP function-modifying fuses 114, a True value (True) is scanned into the saved register bits associated with the BSP function-modifying fuses 114 during testing, as described above with respect to FIG. 1, to enable the modified BSP function. In this manner, the modified BSP function is not permanently enabled in portions of microprocessor 100, but is disabled after power-up. More preferably, the operations at blocks 2203-2214 are performed by microcode of core 102. If the modify BSP function is enabled, flow proceeds to block 2205. Otherwise, flow proceeds to block 2206.

At block 2205, core 102 modifies the number of virtual cores generated at block 2203. In one embodiment, core 102 modifies the number of virtual cores to generate a result of a round-robin function (rotafunction) of the number of virtual cores generated in block 2203 and a round-robin amount as follows:

the number of virtual cores is equal to a cycle (cycle amount, number of virtual cores).

The round-robin function, in one embodiment, cycles the number of virtual cores between cores 102 by the number of cycles. The amount of cycling is a value that blows fuse 114 or, more preferably, is scanned into the hold register during testing. Table 1 shows the number of virtual cores per core 102, with the order pairs (crystal number 258, local core number 256) shown in the left row of an exemplary configuration, with each cycle count shown in the top row, with the crystal number 406 of two and the number of cores 102 per crystal 406 of 4, and all cores 102 may be enabled. In this manner, the tester is authorized to have core 102 generate its virtual core number, and the APIC ID, for example, of any valid value. Although one embodiment for modifying the number of virtual cores is described, other embodiments are also contemplated. For example, the direction of circulation may be reversed as shown in table 1. Flow proceeds to block 2206.

TABLE 1

At block 2206, core 102 populates the local APIC ID register with the number of virtual cores generated at block 2203 or the modified value generated at block 2203. In one embodiment, the apic id register may be read from itself by the core 102 (e.g., by BIOS and/or operating system) at memory address 0x0FEE 00020. However, in another embodiment, the APIC ID register may be read by core 102 at MSR address 0x 802. Flow proceeds to decision block 2208.

At decision block 2208, core 102 determines whether the APIC ID it populated at block 2208 is zero. If so, flow proceeds to block 2212; otherwise, flow proceeds to block 2214.

At block 2212, core 102 sets its BSP flag to true (true) to indicate that core 102 is a BSP. In one embodiment, the BSP flag is a bit of the x86APIC BASE register (IA32_ APIC _ BASE MSR) of the core 102. Flow proceeds to decision block 2216.

At block 2214, core 102 sets the BSP flag to false (false) to indicate that core 102 is not BSP, e.g., in an AP. Flow proceeds to decision block 2216.

At decision block 2216, core 102 determines whether it is a BSP, e.g., whether it is designated itself as BSP core 102 at block 2212, rather than as AP core 102 at block 2214. If so, flow proceeds to block 2218; otherwise, flow proceeds to block 2222.

At block 2218, core 102 begins extracting and executing system initialization firmware (e.g., BSP BIOS boot strap code). This may include an instruction associated with the BSP flag and the APIC ID, such as an instruction to read the APIC ID register or the APIC base register, in which case core 102 restores the values written at blocks 2206 and 2212/2214. It may also include as the sole core 102 of the microprocessor 100 to perform operations on behalf of the microprocessor 100 as a whole, such as the encapsulated sleep state handshake protocol described in fig. 9. More preferably, the BSP core 102 begins fetching and executing system initialization firmware in a defined configuration reset vector. For example, in the x86 architecture, the reset vector points to 0xffffff 0. More preferably, executing the system initialization firmware includes booting the operating system, e.g., loading the operating system and transitioning to a controlling operating system. Flow proceeds to block 2224.

In block 2222, the core 102 aborts itself and waits for a startup sequence from the BSP to begin fetching and executing instructions. In one embodiment, the boot sequence received from the BSP includes an interrupt vector (e.g., AP BIOS program code) to the AP system initialization firmware. This may include instructions related to the BSP flag and the APIC ID, in which case core 102 restores the values written in blocks 2206 and 2212/2214. Flow proceeds to block 2224.

At block 2224, when the core 102 executes the instruction, the core 102 receives and responds to the interrupt request based on the APIC ID written to its APIC ID register at block 2206. Flow ends at block 2224.

As described above, according to one embodiment, the core 102 with zero virtual cores is pre-set to the BSP. However, the inventors have observed that there may be circumstances where it is advantageous for all cores 102 to be designated as BSPs, and embodiments will be described below. For example, microprocessor 100 developers have invested significant time and cost in developing a large test agent designed to run in a single core of a single thread (single-threaded), and the developers want to use single core testing to test the multi-core microprocessor 100. For example, the test may run in the old and well-known DOS operating system of the x86 actual mode.

Running these tests on each core 102 may be done in a sequential manner using the modified BSP function described in fig. 22 and/or modifying the fuse values by blowing fuses or scanning to hold registers to disable all cores 102, but one core 102 is used for testing. However, the inventors have appreciated that this will require more time than running tests in all cores 102 simultaneously (e.g., about 4 times in the case of a 4-core microprocessor 100), and further, the time required to test each individual microprocessor 100 portion is valuable, especially when hundreds of thousands or more microprocessor 100 portions are manufactured, especially when many tests are tested in very expensive test equipment.

Additionally, another possibility is that a speed path in the microprocessor 100 logic will be stressed more when more than one core 102 (or all cores 102) are running at the same time, as it will generate more heat and/or draw more energy. Tests run in this continuous mode may not generate additional pressure and reveal the velocity path.

Thus, embodiments are described in which all of the cores 102 may be dynamically assigned to the BSP core 102 such that all of the cores 102 may perform a test simultaneously.

Referring now to FIG. 23, a flowchart illustrating a process for configuring microprocessor 100 according to another embodiment is shown. The description of FIG. 23 refers to the multi-die microprocessor 100 of FIG. 4, which includes two crystals 406 and eight cores 102. However, it should be understood that the dynamic reconfiguration described herein may use a microprocessor 100 having a different configuration, i.e., having more than two crystals or a single crystal, and more or less than eight cores 102 but at least two cores 102. This operation is described from a single core perspective, but each core 102 of the microprocessor 100 dynamically operates and reconfigures the microprocessor 100 as a whole according to the description. Flow begins at block 2302.

At block 2302, microprocessor 100 is reset and performs an initial portion of its initialization, preferably in a manner similar to that described above with respect to FIG. 14. However, generation of configuration related values, such as block 1424 in FIG. 14, and in particular the APIC ID and BSP flags, is performed in the manner described in blocks 2304-2312. Flow proceeds to decision block 2304.

At decision block 2304, core 102 detects that a function may be enabled. This function is referred to herein as the "all core BSP" function. More preferably, blowing fuse 114 enables all core BSP functionality to be enabled. More preferably, rather than blowing all of the core BSP function fuses 114 during testing, a True value (True) is scanned into the save register bits associated with all of the core BSP function fuses 114, as described above with respect to FIG. 1, to enable all of the core BSP functions. In this manner, the all-core BSP functionality is not permanently enabled in portions of microprocessor 100, but is disabled after power-up. More preferably, the operations at blocks 2304-2312 are performed by microcode of core 102. If all core BSP functionality is enabled, flow proceeds to block 2305. Otherwise, flow proceeds to block 2203 in FIG. 22.

At block 2305, core 102 sets its virtual core number to zero regardless of local core number 256 and the number of crystals 258 of core 102. Flow proceeds to block 2306.

At block 2306, core 102 fills the local apic id register with the number of virtual cores set to zero at block 2305. Flow proceeds to block 2312.

At block 2312, core 102 sets its BSP flag to True (True) to indicate that core 102 is BSP, regardless of local core number 256 and crystal 258 number of core 102. Flow proceeds to block 2315.

At block 2315, each time a core 102 executes a memory access request, the microprocessor 100 modifies the upper address bits of each core 102 memory access request address separately so that each core 102 accesses its separate memory space. That is, the microprocessor 100 modifies the upper address bits such that the upper address bits have a unique value per core 102, depending on the core 102 that generated the memory access request. In one embodiment, microprocessor 100 modifies the upper address bits indicated by the value of blown fuses 114. In another embodiment, microprocessor 100 modifies the upper address bits based on the number of local cores 256 and the number of crystals 258 of core 102. For example, in an embodiment in which the number of cores in the microprocessor 100 is 4, the microprocessor 100 modifies the upper two bits of the memory address and generates a unique value in the upper two bits of each core 102. In practice, the memory space addressable by microprocessor 100 is divided into N subspaces, where N is the number of cores 102. The development of the test program makes itself restricted to specifying the address of the lowest subspace among the N subspaces. For example, assume that microprocessor 100 is capable of finding an address of 64GB of memory and that microprocessor 100 includes four cores 102. The test was developed to access only the lowest 8GB of memory. When core 0 executes an instruction that accesses memory address A (the lower 8GB in memory), microprocessor 100 generates an address in memory bus A (unmodified); when core 1 executes an instruction that accesses the same memory address A, the microprocessor 100 generates an address in memory bus A +8 GB; when core 2 executes an instruction that accesses the same memory address A, the microprocessor 100 generates an address in memory bus A +16 GB; and when core 3 executes an instruction that accesses the same memory address A, the microprocessor 100 generates an address on memory bus A +32 GB. In this manner, advantageously, cores 102 will not conflict in their access memory, which may cause the test to perform correctly. More preferably, single thread testing is performed in a stand-alone test machine that is capable of testing the microprocessor 100 individually. The microprocessor 100 developer develops test data and provides it to the microprocessor 100 by a tester, and conversely, the microprocessor 100 developer develops result data, which is the result of comparing data written by the microprocessor 100 during a memory write access period by a tester, to ensure that the microprocessor 100 writes the correct data. In one embodiment, a shared cache 119 (e.g., a highest level cache that generates addresses for use in external bus processing) is part of the microprocessor 100 that is configured to modify the higher address bits when all core BSP functions are enabled. Flow proceeds to block 2318.

At block 2318, core 102 begins fetching and executing system initialization firmware (e.g., BSP BIOS boot strap code). This may include an instruction associated with the BSP flag and APIC ID, such as an instruction to read the APIC ID register or APIC base register, in which case the core 102 restores the zero value written in block 2306. More preferably, the BSP core 102 starts reading and executing system initialization firmware in an architecture-defined reset vector (Architecturally-defined reset vector). For example, in the x86 architecture, the reset vector points to the 0xffffff 0 address. Preferably, executing the system initialization firmware includes booting an operating system, e.g., loading the operating system and changing to controlling the operating system. Flow proceeds to block 2324.

At block 2324, when core 102 executes an instruction, core 102 receives and responds to an interrupt request based on the APIC ID value written to its APIC ID register value of zero at block 2306. Flow ends at block 2324.

Although an embodiment in which all of the cores 102 are designated as the BSP has been described in fig. 23, other embodiments contemplate that more, but less than all of the cores 102 are designated as the BSP.

Although embodiments are described in the context of an x 86-type system in which each core 102 uses a local APIC and has an association between local APIC ID and BSP designations, it should be understood that the designation of boot processors is not limited to the x86 embodiment, but may be used in systems having different system architectures.

Propagation for microcode Patching (PATCH) for multiple cores

As observed previously, there are many important functions that may be performed primarily by microcode of a microprocessor, and in particular, that require proper communication and coordination among the microcode instances executing in the microprocessor's multiple cores. Because of the complexity of the microcode, a significant probability indicates that an error will exist in the microcode to be corrected. This may be accomplished via microcode patching using the new microcode instruction to replace the old microcode instruction that caused the error. That is, the microprocessor includes specific hardware that facilitates microcode patching. In general, it is desirable to apply the micro-modifications to all cores of the microprocessor. Traditionally, patching has been performed by executing a fabric instruction separately in each core. However, the conventional method may have problems.

First, the patching is related to inter-core communication using microcode instances (e.g., core synchronization, hardware semaphore usage) or to functions requiring microcode inter-core communication (e.g., cross-core tuning requests, cache control operations or power management, or dynamic multi-core microprocessor configuration). The execution of a architected patching application on each core separately may create a window of time for microcode patching to be applied to some cores but not to others (or a previous patching to be applied to some cores and a new patching to be applied to other cores). This may cause a communication failure between cores and improper operation of the microprocessor. Other predictable and unpredictable problems may also arise if all cores of the microprocessor are repaired using the same microcode.

Second, the architecture of the microprocessor specifies a number of functions that may be supported by the microprocessor in some instances (instances) and not supported by other microprocessors. During operation, the microprocessor is able to communicate with system software that supports the particular function. For example, in the case of an x86 architecture microprocessor, the x86CPUID instruction may be executed by system software to determine the supported function settings. However, instructions that determine the function settings (e.g., CPUID) are executed separately in each core of the microprocessor. In some cases, a function may be disabled due to an error that exists during the time and the microprocessor is deactivated. However, a microcode fix-up may be developed subsequently to fix the error so that the function may be enabled after the fix-up application. However, if patching is conventionally implemented (e.g., by applying a separate instruction of a patch instruction in each core, separately implemented in each core), different cores may indicate different functional configurations at a given point in time, depending on whether the patch has been applied in the core. This can be problematic, especially when the system software (such as an operating system, for example, to facilitate inter-core thread migration) expects all cores of the microprocessor to have the same functional settings. In particular, it has been observed that some system software only obtains the functional configuration of one core and assumes that the other cores have the same functional configuration.

Further, each core controls and/or microcode instances that communicate with uncore resources shared by the cores (e.g., synchronization-related hardware, hardware semaphores, shared PRAMs, shared caches, or service processing units). Thus, since one of the cores has microcode patches in use and the other core does not (or both cores have different microcode patches), in general, it can be problematic for the microcode of two different cores to control or communicate with the uncore resource in two different ways at the same time.

Finally, the microcode patching hardware in the microprocessor may also use traditional patching, but it may cause interference with other core patching applications and patching operations by a core, for example, if part of the patching hardware is shared among cores.

More preferably, embodiments of applying microcode patching to a multi-core microprocessor in an atomic (atomic) manner at the architectural instruction level solve the problems described herein. First, patching is applied to the overall microprocessor 100 in response to execution of an architectural instruction in the single core 102. That is, embodiments do not necessarily require system software to execute an application microcode patch instruction (described below) in each core 102. More specifically, a single core 102 that encounters the application microcode patch instruction will communicate information and interrupt other cores 102 to cause its microcode to be used in patching the instances of the portions, and all microcode instances cooperate with another microcode to cause the microcode patch to be applied to the microcode patch software of each core 102 and share the patch hardware of the microprocessor 100 when interrupts are disabled in all cores 102. Second, the microcode instance running in all cores 102 and implementing the atomic patch application mechanism cooperates with another microcode to prevent execution of any architectural instruction (other than an application microcode patch instruction) until all cores 102 of the microprocessor 100 have agreed to apply such a patch. That is, when any core 102 uses the microcode fix-up, no core 102 executes a architected instruction. Furthermore, in a preferred embodiment, all cores 102 arrive at the same place as the microcode to execute a patch application with disable interrupts, and then cores 102 only execute the microcode instructions for patching until all cores of the microprocessor 100 confirm that the patch has been used. That is, when any core 102 of the microprocessor 100 is using the patch, none of the cores 102 execute microcode instructions other than those using microcode patches.

Referring now to FIG. 24, a block diagram of a multi-core microprocessor 100 according to another embodiment is shown. The microprocessor 100 is similar in many respects to the microprocessor 100 of FIG. 1. However, the microprocessor 100 of fig. 24 further includes a Service Processing Unit (SPU) 2423, a Service Processing Unit (SPU) start address register 2497, an uncore microcode Read Only Memory (ROM) 2425, and an uncore microcode patch Random Access Memory (RAM) 2408 in its uncore 103. In addition, each core 102 includes a core PRAM2499, a Content Addressable Memory (CAM) 2439, and a core microcode ROM 2404.

Microcode includes microcode instructions. The microcode instructions are non-architected instructions stored in one or more memories of the microprocessor 100 (e.g., the non-core microcode ROM2425, the non-core microcode patch RAM2408, and/or the core microcode ROM2404), wherein the microcode instructions are fetched by a core 102 based on a fetch address stored in the non-architected Micro-program Counter (Micro-PC) and used by the core 102 to implement instructions of the instruction set architecture of the microprocessor 100. More preferably, the microcode instructions are translated by a micro translator (micro translator) into microinstructions that are executed by the execution units of the core 102, or in another embodiment, the microcode instructions are executed directly by the execution units, in which case the microcode instructions are microinstructions. The microcode Instruction is a non-architected Instruction meaning that it is not an indication of the Instruction Set Architecture (ISA) of the microprocessor 100, but is encoded according to an Instruction Set that is different from the architected Instruction Set. The non-architected micro program counter is not defined by the instruction set architecture of the microprocessor 100 and is different from an architected-defined (architected-defined) program counter of the core 102. The microcode is used to implement some or all of the instructions of the ISA instruction set of the microprocessor as follows. In response to decoding a microcode execute ISA instruction, the core 102 transitions to controlling a microcode Routine (Routine) associated with the ISA. The microcode routine includes microcode instructions. The execution unit executes the microcode instructions or, according to a preferred embodiment, the microcode instructions are further translated into microinstructions that are executed by the execution unit. The results of execution of the microcode instructions (or microinstructions translated by the microcode instructions) by the execution units are the results defined by the ISA instructions. Thus, common execution of microcode routines associated with the ISA instruction (or microinstructions translated from the microcode routine instruction) is by the execution unit "execute" the ISA instruction. That is, the operations specified by an ISA instruction in its specified inputs are performed by collective execution by execution units executing microcode instructions (or microinstructions translated from the microcode instructions) to produce a result defined by the ISA instruction. In addition, the microcode instructions may be executed (or translated into executed microinstructions) when the microprocessor is reset to configure the microprocessor.

The core microcode ROM2404 owns microcode executed by a particular core 102 that includes the core microcode ROM 2404. The uncore microcode ROM2425 also owns microcode executed by the core 102. However, this uncore ROM2425 is shared by the core 102, as compared to the core microcode ROM 2404. More preferably, the uncore ROM2425 has microcode routines that require less performance and/or execute less frequently because the access time of the uncore ROM2425 is greater than the core ROM 2404. In addition, the uncore ROM2425 hosts program code that is retrieved and executed by the SPU 2423.

The uncore microcode patch RAM2408 is also shared by the cores 102. The uncore microcode patch RAM2408 holds microcode instructions that are executed by the cores 102. When the fetch address matches the contents of one of the entries (entries) in the patch CAM2439, then the patch CAM2439 has a patch address that is output by the patch CAM2439 to a Microsequencer (Microsequencer) in response to a microcode fetch address. In this case, the patch address output by the microsequencer is the microcode fetch address, rather than the next sequential fetch finger address (or target address in the case of a branch type instruction), in return for the uncore patch RAM2408 outputting a patch microcode instruction. For example, because the patch microcode instruction and/or the microcode instructions following it are a source of an error, a patch microcode instruction is executed by fetching in the uncore patch RAM2408, rather than a microcode instruction fetched from the uncore ROM2425 or the core ROM 2404. Thus, the patch microcode instruction effectively replaces, or patches, an unexpected microcode instruction residing in the core ROM2404 or the uncore microcode ROM2425 in the original microcode fetch address. More preferably, the patch CAM2439 and patch RAM2408 are loaded in response to architectural instructions contained in system software, such as a BIOS or operating system running on the microprocessor 100.

Among other things, the uncore PRAM116 is used by the microcode to store values used by the microcode. Some of the effective functions of these values are constants

When the microprocessor 100 is reset and not modified during operation of the microprocessor 100, the fuse 114 is blown because it is an immediate value (immediate value) stored in the core microcode ROM2404 or the uncore microcode ROM2425 or at the point in time that the microprocessor 100 is manufactured or written by the microcode to the uncore PRAM116, except perhaps via a fix-up or in response to execution of an instruction that explicitly modifies the value, such as a WRMSR instruction. Advantageously, these values may be modified via the patching mechanisms described herein without changing the core microcode ROM2404 or the uncore microcode ROM2425, which may be very expensive, and without requiring one or more unblown fuses 114.

In addition, the uncore PRAM116 is configured to store patches that are fetched and executed by the SPU2423, as described herein.

The core PRAM2499, which is similar to the uncore PRAM116, is private (private) or unstructured, meaning that the core PRAM2499 is not located in the microprocessor 100 architectural user program address space. However, unlike the uncore PRAMs 116, each PRAM2499 is only read by its respective core 102 and is not shared by other cores 102. Like the uncore PRAM116, the core PRAM2499 is also used by the microcode to store values used by the microcode. Advantageously, these values may be modified via the patching mechanisms described herein without changing the core microcode ROM2404 or the non-core microcode ROM 2425.

The SPU2423 includes a stored program processor that is an adjunct (add) to and distinct from each core 102. While the core 102 is architecturally operable to execute instructions of the ISA of the core 102 (e.g., ISA instructions of x86), the SPU2423 is not structurally capable of doing so. Thus, for example, the operating system cannot run on the SPU2423 nor can the ISA operating system scheduler of the core 102 (e.g., ISA instructions from x86) run on the SPU 2423. In other words, the SPU2423 is not a system resource managed by the operating system. More specifically, the SPU2423 executes instructions for regulating the operation of the microprocessor 100. In addition, the SPU2423 may help measure the performance and other functions of the core 102. More preferably, the SPU2423 is smaller, less complex, and has less power consumption than the core 102 (e.g., in one embodiment, the SPU2423 includes a built-in Clock Gating). In one embodiment, SPU2423 comprises a FORTH CPU core.

Asynchronous events that may occur with debug instructions executed by the core 102 may not be well handled. Advantageously, however, the SPU2423 may be commanded by a core 102 to detect the event and perform operations such as creating a log or modifying the behavior of aspects of the core 102 and/or the microprocessor 100 external bus interface in response to detecting the event. The SPU2423 may provide the profile information to the user, and it may also interact with the tracker to request the tracker to provide the profile information or to request the fetal tracker to perform other actions. In one embodiment, the SPU2423 has access to control registers of the memory subsystem and programmable interrupt controllers for each core 102, as well as control registers for the shared cache 119.

Examples of the SPU2423 detectable events include the following: (1) a core 102 is functioning, e.g., the core 102 has not retired (retire) any instructions that are programmable for a number of clock cycles; (2) a core 102 loads data from a non-cache area in memory; (3) temperature changes occur in the microprocessor 100; (4) the operating system requesting a change in the microprocessor 100 bus clock ratio and/or requesting a change in the microprocessor 100 voltage level; (5) the microprocessor 100, in its own right, varies voltage levels and/or bus clock ratios, e.g., to achieve power savings and improve performance; (6) an internal timer of a core 102 times out; (7) a Cache snoop (snoop) colliding to a modified Cache line (Cache line) resulting in the Cache line being written back to memory; (8) the temperature, voltage, bus clock ratio of the microprocessor 100 are outside a respective range; (9) an external trigger signal is asserted by a user in an external pin (pin) of the microprocessor 100.

Advantageously, because the SPU2423 runs the program code 132 of the core 102 independently, it does not have the same limitations as executing tracker microcode in the core 102. Thus, the SPU2423 can detect or be notified of events independent of the core 102 instruction execution boundaries and not interrupt the state of the core 102.

The SPU2423 has its own program code to execute. The SPU2423 may extract its program code from the uncore microcode ROM2425 or from the uncore PRAM 116. That is, the SPU2423 preferably shares microcode running in the core 102 with the uncore ROM2425 and the uncore PRAM 116. SPU2423 uses the uncore PRAM116 to store its data, including the log file. In one embodiment, the SPU2423 also includes its own serial port interface, which transmits the log file to an external device. Advantageously, the SPU2423 may also instruct the tracker running in a core 102 to store the profile information from the uncore PRAM116 into system memory.

The SPU2423 communicates with the core 102 through status and control registers. The SPU status register includes a bit corresponding to each event described above and detectable by the SPU 2423. To notify the SPU2423 of an event, the core 102 sets a bit in the SPU status register corresponding to the event. Some event bits are set by the hardware of the microprocessor 100 and some are set by the microcode of the core 102. The SPU2423 reads the status register to determine the list of events that have occurred. A control register includes bits corresponding to each operation that each operate upon an operation by the SPU2423 in response to detecting one of the specified events in the status register. That is, for each possible event in the status register, a set of operation bits is present in the control register. In one embodiment, there are 16 action bits per event. In one embodiment, when the status register is written to indicate an event, it causes the SPU2423 to interrupt in response to the SPU2423 reading the status register to determine which events have occurred. Advantageously, this saves power by reducing the need for the SPU2423 to poll the status register. The status and control registers may also be read and written by a user program executing instructions (e.g., RDMSR and WRMSR instructions).

The set of operations that the SPU2423 may perform in response to detecting an event includes the following. (1) The profile information is written to the uncore PRAM 116. For each record file writing operation, multiple operation bits exist to allow the programmer to specify that only a subset of the specific record file information should be written. (2) The record file information is written from the uncore PRAM116 to the serial port interface. (3) One of the control registers is written to set an event of the tracker. That is, the SPU2423 may interrupt a core 102 and cause the tracker microcode to perform a set of operations associated with the event. This operation may be specified by a previous user. In one embodiment, when the SPU2423 writes to the control register to set the event, this causes the core 102 to check for exceptions, and the machine check exception handler checks to see if a tracker is enabled. If so, the machine check exception handler transfers control to the tracker. The tracker reads the control register and performs the previously described operations by the user associated with the event if the event set in the control register is an event in which the user has enabled the tracker. For example, the SPU2423 may set an event that causes the tracker to write the profile information stored in the uncore PRAM116 into system memory. (4) Writing a control register causes the microcode to branch to a microcode address specified by the SPU 2423. This is particularly helpful if the microcode is in an infinite loop such that the tracker cannot perform any meaningful operation, but the core 102 still executes and retires (retire) the instruction, which means that the event being executed by the processor will not occur. (5) A control register is written to reset a core 102. As mentioned above, the SPU2423 can detect an ongoing core 102 (e.g., for some time programmable amount, no instructions have yet been retired) and reset the core. The reset microcode checks to see if the reset was initiated by the SPU2423 and, if so, facilitates writing the profile information to system memory before purging the profile information during initialization of the core 102. (6) The file events are continuously recorded. In this mode, rather than waiting for an event to be interrupted, SPU2423 spins (spin) in a loop (loop) that checks the status register and continuously records information to the uncore PRAM116 indicated as being associated with the event, and optionally additionally writes the record information to the serial port interface. (7) A control register is written to stop a core 102 from issuing requests to the shared cache 119 and/or to stop the shared cache 119 from acknowledging requests to the core 102. This is particularly useful in removing memory subsystem related design errors, such as page translation table (tablewalk) hardware errors, and may even be modified during operation of the microprocessor 100, such as by modifying the SPU2423 code via a fix-up, as described below. (8) Write to a control register of an external bus interface controller of the microprocessor 100 to perform processing on the external system bus, such as specific cycles or memory read/write cycles. (9) Writing to a control register of a core 102 programmable interrupt controller, for example, generates an interrupt to another core 102 or emulates an I/O device to the core 102 or fixes a fault in the interrupt controller. (10) A control register of the shared cache 119 is written to control its size, e.g., the associated shared cache 119 is disabled or enabled in a different manner. (11) Control registers of the various functional units of core 102 are written to configure different performance features, such as branch prediction (branch prediction) and data prefetch (prefetch) algorithms. As described below, the SPU2423 code may advantageously be patched such that, upon completion of the design of the microprocessor 100 and after the microprocessor 100 has been fabricated, the SPU2423 is caused to perform an action to patch defects of the design or perform other functions as described herein.

The SPU start address register 2497 holds the address at which the SPU2423 begins fetching instructions when it is removed from reset. The SPU start address register is written to by the core 102. This address may be located in uncore PRAM116 or uncore microcode ROM 2425.

Referring now to FIG. 25, therein is shown a block diagram of an architecture for microcode patching 2500, according to an embodiment of the present invention. In the embodiment of FIG. 25, the microcode patch 2500 includes the following: a header 2502; a real-time fix 2504; a Checksum (Checksum)2506 of the instant fix 2504; a CAM data 2508; a core PRAM patch 2512; the CAM data 2508 and a checksum 2514 of the check PRAM fix-up 2512; a RAM patch 2516; an uncore PRAM fix 2518; a checksum 2522 of the kernel PRAM patch 2512 and the RAM patch 2516. The checksum 2506/2514/2522 allows the microprocessor 100 to verify the integrity of the repaired parts after being loaded into the microprocessor 100. More preferably, the microcode patch 2500 is read by system memory and/or a Non-volatile (Non-volatile) system, such as from ROM or FLASH memory having a system BIOS or extensible firmware, for example. Header 2502 describes the portions of the patch 2500, such as its size, the location in memory associated with its respective patch loaded into the patch portion, and a valid flag indicating whether the portion contains a valid patch that applies to the microprocessor 100.

The real-time patch 2504 includes program code (e.g., instructions, preferably microcode instructions) to be loaded into the uncore microcode patch RAM2408 of FIG. 24 (e.g., at block 2612 of FIGS. 26A-26B) and then executed by each core 102 (e.g., at block 2616 of FIGS. 26A-26B). The patch 2500 also specifies the address at which the real-time patch 2504 is loaded into the patch RAM 2408. More preferably, the real-time patch 2504 code modifies default values written by the reset microcode, such as values written to configuration registers that affect the configuration of the microprocessor 100. After the immediate patch 2504 is executed by each core outside the patch RAM2408, it is not executed again. Additionally, subsequent loading of the RAM patch 2516 into the patch RAM2408 (e.g., at block 2632 in FIGS. 26A-26B) may overwrite the real-time patch 2504 in the patch RAM 2408.

The RAM patch 2516 includes patch microcode instructions to replace those in the core ROM2404 or the uncore ROM2425 that needs to be patched. The RAM patch 2516 also includes an address at which the patch microcode instruction is written to the location in the patch RAM2408 when the patch 2500 is used (e.g., at block 2632 of FIGS. 26A-26B). The CAM data 2508 is loaded into the patch CAM2439 of each core 102 (e.g., at block 2626 of FIGS. 26A-26B). As described above in terms of the operation of the patch CAM2439, the CAM data 2508 includes one or more entries, each entry including a pair of microcode fetch addresses. The first address is the microcode instruction being fetched and the contents matched by the fetch address. The second address points to an address in the patch RAM2408 that the patch RAM2408 has executed in place of the patch microcode instructions.

Unlike the immediate patch 2504, the RAM patch 2516 remains in the patch RAM2408 and continues to operate (along with the patch CAM2439 operation according to the patch CAM data 2508) to patch the core microcode ROM2404 and/or the non-core microcode ROM2425 until reset by another patch 2500 or the microprocessor 100.

The core PRAM fix-up 2512 includes the data written to the core PRAM2499 for each core 102 and the address at which each entry of the data was written into the core PRAM2499 (e.g., at block 2626 of FIGS. 26A-26B). The uncore PRAM fix-up 2518 includes the data written to the uncore PRAM116 and the address at which each entry of the data was written into the uncore PRAM116 (e.g., at block 2632 of FIGS. 26A-26B).

26A-26B, a flow chart illustrating an operation of the microprocessor 100 of FIG. 24 to propagate a microcode patch 2500 of FIG. 25 to a plurality of cores 102 of the microprocessor 100. The operation is described in a single and novel perspective, but each core 102 of the microprocessor 100 operates according to the present invention to co-propagate the microcode patch to all cores 102 of the microprocessor 100. 26A-26B depict a core encountering the instruction using an operation that modifies to the microcode, flow beginning at block 2602 and operation of the other cores 102 beginning at block 2652. It should be appreciated that the patches 2500 may be applied to the microprocessor 100 at different times during operation of the microprocessor 100. For example, a first patch 2500 is used according to embodiments described herein when the system including the microprocessor 100 is booted, such as during BIOS initialization, and a second patch 2500 is used after the operating system is run, which is particularly useful for the purpose of clearing errors from the processor 100.

At block 2602, one of the cores 102 encounters an instruction that instructs it to apply the microcode to patch the microprocessor 100. More preferably, the microcode patches are similar to the microcode patches described above. In one embodiment, the application microcode patch instruction is an x86WRMSR instruction. In response to the application microcode patch instruction, the core 102 disables interrupts and prevents microcode executing the application microcode patch instruction from executing. It should be appreciated that the system software including the application microcode patch instruction may include a multiple instruction sequence in preparation for the microcode patch application. More preferably, however, the microcode patches are propagated atomically in the architectural instruction stage to all cores in response to the sequence of single architectural instructions. That is, once an interrupt is disabled in the first core 102 (e.g., the core 102 encounters the apply microcode patch instruction at block 2602), the interrupt remains disabled while the executing microcode propagates the microcode patch and is applied to all of the cores 102 of the microprocessor 100 (e.g., until after block 2652); furthermore, once disabled in other cores 102 (e.g., at block 2652), it is still disabled until the microcode fix has been applied to all cores 102 of the microprocessor 100 (e.g., until after block 2634). Thus, advantageously, the microcode patches are propagated atomically at the architectural instruction level and applied to all of the cores 102 of the microprocessor 100. Flow proceeds to block 2604.

At block 2604, the core 102 obtains ownership of the hardware semaphore 118 of FIG. 1. More preferably, the microprocessor 100 includes a hardware semaphore 118 associated with the patch microcode. More preferably, the core 102 obtains ownership of the hardware semaphore 118 in a manner similar to that described above with respect to FIG. 20, and more particularly blocks 2004 and 2006. The hardware semaphore 118 is used because it is likely that one of the cores 102 will use a patch 2500 in response to encountering an application microcode patch instruction, while a second core 102 encounters an application microcode patch instruction as the second core will begin using the second patch 2500, which may cause incorrect execution, for example, due to misuse of the first patch 2500. Flow proceeds to block 2606.

At block 2606, the core 102 sends a patch message to the other cores 102 and sends an inter-core interrupt to the other cores 102. More preferably, the core 102 blocks the microcode in response to the application microcode patch instruction (block 2602) or in response to the interrupt (block 2652) during a time that the interrupt is disabled (e.g., the microcode does not allow itself to be interrupted) and holds the microcode until block 2634. Flow proceeds from block 2606 to block 2608.

At block 2652, one of the other cores 102 (e.g., a core other than the core 102 that encountered the application microcode fix-up instruction at block 2602) is interrupted and receives the fix-up information due to the inter-core interrupt transmitted at block 2606. In one embodiment, the core 102 fetches the interrupt in the next architectural instruction boundary (e.g., in the next x86 instruction boundary). In response to the interrupt, the core 102 disables the interrupt and prevents microcode from processing the patch information. As described above, although the flow in block 2652 is described in terms of a single core 102, each other core 102 (e.g., core 102 not in block 2602) is interrupted in block 2652 and receives the information, and the steps in blocks 2608 through 2634 are performed. Flow proceeds from block 2652 to block 2608.

At block 2608, the core 102 writes a synchronization request (denoted SYNC 21 in fig. 26A-26B) for a synchronization case 21 into its synchronization register 108, and the control unit 104 puts the core 102 into a sleep state and then wakes up by the control unit 104 when all cores 102 have written SYNC 21. Flow proceeds to decision block 2611.

At decision block 2611, the core 102 determines whether it is the core 102 that encountered the microcode fix-up at block 2602 (as compared to a core 102 that received the fix-up information at block 2652). If so, flow proceeds to block 2612; otherwise, flow proceeds to block 2614.

At block 2612, the core 102 loads a portion of the live patch 2504 of the microcode patch 2500 into the uncore patch RAM 2408. In addition, the core 102 generates a checksum of the load immediate fix 2504 and verifies that it matches the checksum 2506. More preferably, the core 102 also sends information to the other cores 102 indicating the length of the instant patch 2504 and the location in the uncore patch RAM2408 where the instant patch 2504 is loaded. Advantageously, since all cores 102 are known to execute the same microcode that implements the microcode patch application, it is safe to override the uncore patch RAM2408 with the new patch since during this period (assuming the microcode implemented in the microcode patch application is not patched) there will be no collisions (hit) in the patch CAM2439 when a previous RAM patch 2516 exists in the uncore patch RAM 2408. In another embodiment, the core 102 loads the immediate patch 2504 into the uncore PRAM116 and, prior to execution of the immediate patch 2504 at block 2616, the core 102 copies the immediate patch 2504 from the uncore PRAM116 to the uncore patch RAM 2408. More preferably, the core 102 loads the immediate fix-up into a portion of the uncore PRAM116 reserved for this purpose, e.g., a portion of the uncore PRAM116 that is not used for other purposes, such as holding a value used by the microcode (e.g., the core 102 state, TPM state, or valid microcode constants as described above), and a portion of the uncore PRAM116 may be fixed (e.g., at block 2632) such that any previous uncore PRAM fix-up 2518 is not corrupted (clobber). In one embodiment, the act of loading the uncore PRAM116 or copying by the uncore PRAM116 is performed in multiple stages to reduce the size required for the reserved portion. Flow proceeds to block 2614.

At block 2614, the core 102 writes a synchronization request for a synchronization case 22 (labeled SYNC 22 in FIGS. 26A-26B) to its synchronization register 108 and the control unit 104 puts the core 102 to sleep, and then wakes up by the control unit 104 when all cores 102 write a SYNC 22. Flow proceeds to block 2616.

At block 2616, the core 102 performs the real-time fix-up 2504 in the uncore fix-up RAM 2408. As described above, in one embodiment, the core 102 copies the real-time patch 2504 from the uncore patch RAM116 to the uncore patch RAM2408 before the core 102 executes the real-time patch 2504. Flow proceeds to block 2618.

At block 2618, the core 102 writes a synchronization request for a synchronization case 23 (labeled SYNC 23 in FIGS. 26A-26B) to its synchronization register 108 and the control unit 104 puts the core 102 to sleep, and then wakes up by the control unit 104 when all cores 102 write a SYNC 23. Flow proceeds to decision block 2621.

At decision block 2621, the core 102 determines whether the core 102 is the core 102 that encountered the application microcode fix-up instruction at block 2602 (as compared to a core 102 that received the fix-up information at block 2652). If so, flow proceeds to block 2622; otherwise, flow proceeds to block 2624.

At block 2622, the core 102 loads the CAM data 2508 and core PRAM patches 2512 into the uncore PRAM 116. In addition, the core 102 generates a checksum of the load CAM data 2508 and core PRAM patches 2512 and verifies that it matches the checksum 2514. More preferably, the core 102 also sends information to other cores 102 indicating the length of the CAM data 2508 and core PRAM patches 2512, and the location where the CAM data 2508 and core PRAM patches 2512 are loaded within the uncore PRAM 116. More preferably, the core 102 loads the CAM data 2508 and core PRAM patches 2512 into a reserved portion of the uncore PRAM116 so that any previous uncore PRAM patches 2518 are not corrupted (clobber), in a manner similar to that described in block 2612. Flow proceeds to block 2624.

At block 2624, the core 102 writes a synchronization request for a synchronization case 24 (labeled SYNC 24 in FIGS. 26A-26B) to its synchronization register 108 and the control unit 104 puts the core 102 to sleep, and then wakes up by the control unit 104 when all cores 102 write a SYNC 24. Flow proceeds to block 2626.

At block 2626, the core 102 loads the CAM data 2508 from the uncore PRAM116 into its patch CAM 2439. In addition, the core 102 loads the core PRAM fix-up 2512 from the uncore PRAM116 to its core PRAM 2499. Advantageously, since all cores are known to be executing the same microcode implemented in the microcode patching application, even if the corresponding RAM patch 2516 has not been written to the uncore patch RAM2408 (which would occur in block 2632), it is safe to load the patch CAM2439 with the CAM data 2508 since there would be no collisions (hit) in the patch CAM2439 during this period (assuming that the microcode implemented in the microcode patching application is not patched). Furthermore, since all cores 102 are known to be executing the same microcode implemented in the microcode patching application, and interrupts will not be used in any core 102 until the patch 2500 is propagated to all cores 102, any update performed by the core PRAM patch 2512 to the core PRAM2499, which includes an update (e.g., a functional setting) to change a value that may affect the operation of the core 102, is guaranteed not to be visible in the architecture until the patch 2500 has been propagated to all cores 102. Flow proceeds to block 2628.

At block 2628, the core 102 writes a synchronization request for a synchronization case 25 (labeled SYNC 25 in FIGS. 26A-26B) to its synchronization register 108 and the control unit 104 puts the core 102 to sleep, and then wakes up by the control unit 104 when all cores 102 write a SYNC 25. Flow proceeds to decision block 2631.

At decision block 2631, the core 102 determines whether the core 102 is the core 102 that encountered the application microcode fix-up instruction at block 2602 (as compared to a core 102 that received the fix-up information at block 2652). If so, flow proceeds to block 2632; otherwise, flow proceeds to block 2634.

At block 2632, the core 102 loads the RAM patch 2516 into the uncore patch RAM 2408. In addition, the core 102 loads the uncore PRAM patches 2518 to the uncore PRAM 116. In one embodiment, the uncore PRAM fix-up 2518 comprises program code executed by the SPU 2423. In one embodiment, the uncore PRAM fix-up 2518 includes an update of a value used by the microcode, as described above. In one embodiment, the uncore PRAM fix-up 2518 includes updates to the SPU2423 program code and values used by the microcode. Advantageously, since all cores 102 are known to be executing the same microcode implemented in the microcode patching application, more specifically, the patching CAM2439 of all cores 102 has been loaded with the new CAM data 2508 (e.g., in block 2626), and during this period (assuming that the microcode implemented in the microcode patching application is not patched), there will be no collisions (hit) in the patching CAM 2439. Furthermore, since all cores 102 are known to be executing the same microcode that is carried out in the microcode patching application, and interrupts will not be used in any core 102 until the patch 2500 is propagated to all cores 102, any updates performed to the uncore PRAM116 by the uncore PRAM patch 2518, including updates to change values that may affect the operation of the cores 102 (e.g., functional settings), are guaranteed to not be visible in the architecture until the patch 2500 has been propagated to all cores 102. Flow proceeds to block 2634.

At block 2634, the core 102 writes a synchronization request for a synchronization case 26 (labeled SYNC 26 in FIGS. 26A-26B) to its synchronization register 108 and the control unit 104 puts the core 102 to sleep, and then wakes up by the control unit 104 when all cores 102 write a SYNC 26. Flow ends at block 2634.

Following block 2634, the patch core 102 also then begins executing code if it is loaded into the uncore PRAM116 for the SPU2423, as described in FIG. 30. In addition, after block 2634, the repair core 102 releases the hardware semaphore 118 obtained in block 2634. More specifically, after block 2634, the core 102 restarts the interrupt.

FIG. 27 is a timing diagram illustrating an example of the operation of a microprocessor according to the flowchart of FIGS. 26A-26B. In this example, a microprocessor 100 is configured with three cores 102, labeled core 0, core 1, and core 2, as shown. However, it should be understood that in other embodiments, the microprocessor 100 may include a different number of cores 102. In this timing diagram, the timing at which events proceed is as follows.

Core 0 receives a request to patch microcode (per block 2602) and in response fetches the hardware semaphore 118 (per block 2604). Core 0 then sends a microcode patch message and interrupts to core 1 and core 2 (per block 2606). Core 0 then writes a SYNC 21 and enters sleep (per block 2608).

Each core 1 and core 2 is eventually interrupted from its current task and reads the information (per block 2652). To this end, each of core 1 and core 2 writes a SYNC 21 and enters a sleep state (per block 2608). As shown, the time to write SYNC 21 may be different for each core, for example, due to a delay in executing the instruction when the interrupt is asserted.

When all cores have written SYNC 21, the control unit 104 wakes up all cores at the same time (per block 2608). Core 0 then loads the real-time patch 2504 into the uncore PRAM116 (per block 2612) and writes a SYNC 22 and enters sleep (per block 2614). Each core 1 and core 2 writes a SYNC 22 and enters a sleep state (per block 2614).

When all cores have written the SYNC 22, the control unit 104 wakes up all cores at the same time (per block 2614). Each core performs the real-time patch 2504 (per block 2616) and writes a SYNC 23, and enters a sleep state (per block 2618).

When all cores have written the SYNC 23, the control unit 104 wakes up all cores at the same time (per block 2618). Core 0 then loads the CAM data 2508 and core PRAM patches 2512 into uncore PRAM116 (per block 2622), writes a SYNC 24, and enters sleep state (per block 2624).

When all cores have written the SYNC 24, the control unit 104 wakes up all cores at the same time (per block 2624). Each core then loads its patch CAM2439 with the CAM data 2508 and loads its core PRAM2499 with the core PRAM patch 2512 (per block 2626) and writes a SYNC 25 and enters sleep state (per block 2628).

When all cores have written the SYNC 25, the control unit 104 wakes up all cores at the same time (per block 2628). Core 0 then loads the RAM patch 2516 into the uncore patch RAM2408, loads the uncore PRAM patch 2518 into the uncore PRAM116, and writes a SYNC 26, and enters sleep (per block 2634).

When all cores have written the SYNC 26, the control unit 104 wakes up all cores at the same time (per block 2634). As described above, if code has been loaded into the uncore PRAM116 for use in the SPU2423 at block 2632, the core 102 also then begins executing the code, as described below in FIG. 30.

Referring now to FIG. 28, a block diagram of a multi-core microprocessor 100 according to another embodiment is shown. The microprocessor 100 is similar in many respects to the microprocessor 100 of FIG. 24. However, the microprocessor 100 of FIG. 28 does not include an uncore patch RAM, but includes a core patch RAM 2808 in each core 102 that provides similar functionality to the uncore patch RAM2408 of FIG. 24. However, the core patch RAM 2808 in each core 102 is dedicated by its respective core 102 and is not shared with other cores 102.

29A-29B, a flowchart illustrating an operation of the microprocessor 100 of FIG. 28 to propagate a microcode patch to the cores 102 of the microprocessor 100 according to another embodiment is shown. In another embodiment of fig. 28 and 29A-29B, the patch 2500 of fig. 25 may be modified such that the checksums 2514 employ the RAM patch 2516 instead of the core PRAM patch 2512 and enable the microprocessor 100 to verify the integrity of the CAM data 2508, the core PRAM patch 2512 and the RAM patch 2516 after the integrity of the CAM data 2508, the core PRAM patch 2512 and the RAM patch 2516 are loaded into the microprocessor 100 (e.g., at block 2922 in fig. 29A-29B). The flow diagram of fig. 29A-29B is similar in many respects to the flow diagram of fig. 26A-26B, and similarly numbered blocks. However, block 2912 replaces block 2612, block 2916 replaces block 2616, block 2922 replaces block 2622, block 2926 replaces block 2626, and block 2932 replaces block 2632. At block 2912, the core 102 loads the real-time fix-up 2504 into the uncore PRAM116 (instead of into an uncore fix-up RAM). At block 2916, the core 102 copies the instant fix-up 2504 from the uncore PRAM116 to the core fix-up RAM 2808 before executing the instant fix-up 2504. At block 2922, the core 102 loads the RAM patch 2516 into the uncore PRAM116 in addition to the CAM data 2508 and the core PRAM patch 2512. At block 2926, in addition to the core 102 loading the CAM data 2508 from the uncore PRAM116 to its patch CAM2439 and the core PRAM patch 2512 from the uncore PRAM116 to its core PRAM2499, the core 102 also loads the RAM patch 2516 from the uncore PRAM116 to its patch RAM 2808. In block 2932, unlike block 2632 of FIGS. 26A-26B, the core 102 does not load the RAM patch 2516 into an uncore patch RAM.

It can be observed from the above embodiments that the atomic propagation of the microcode patches 2500 that is beneficially propagated to each associated memory 2439/2499/2808 and to the associated uncore memory 2408/116 of the microprocessor 100 cores 102 is performed in a manner that ensures the integrity and validity of the patches 2500 even if there are multiple cores 102 executing simultaneously, which cores 102 can share resources, otherwise when applied in a conventional manner, the cores 102 may destroy (clobber) portions of another core patch.

Patching service processor code

Referring now to FIG. 30, a flowchart illustrating operation of the microprocessor 100 of FIG. 24 to patch a service processor code is shown. Flow begins at block 3002.

At block 3002, the core 102 loads program code executed by the SPU2423 into the uncore PRAM116 at a patch address specified by a patch, as described above in block 2632 of fig. 26A-26B. Flow proceeds to block 3004.

At block 3004, the core 102 controls the SPU2423 to execute program code at the patch address, e.g., the address that the SPU2423 was written to in the uncore PRAM116 at block 3002. In one embodiment, the SPU2423 is configured to fetch its reset vector from the start address register 2497 (e.g., the address at which the SPU2423 started fetching instructions after removing the reset), and the core 102 writes the patch address to the start address register 2497 and then to a control register that causes the SPU2423 to be reset. Flow proceeds to block 3006.

At block 3006, the SPU2423 begins fetching program code (e.g., fetching its first instruction) at the patch address, e.g., writing the SPU2423 program code to the address in the uncore PRAM116 at block 3002. Generally, the SPU2423 patch code residing in the uncore PRAM116 will perform a jump (jump) to the SPU2423 code residing in the uncore ROM 2425. Flow ends at block 3006.

Patching the functionality of the SPU2423 program code may be particularly useful. For example, the SPU2423 may be used for essentially transient performance testing, which may not be intended to make the performance testing SPU2423 code a permanent part of the microprocessor 100, but rather a part of a development part, e.g., a manufacturing part. In another example, the SPU2423 may be used to find and/or fix errors. In another example, the SPU2423 may be used to configure the microprocessor 100.

Atomic propagation with updates to per-core real-time architected visual storage resources

Referring now to FIG. 31, a block diagram of a multi-core microprocessor 100 according to another embodiment is shown. The microprocessor 100 is similar in many respects to the microprocessor 100 of FIG. 24. However, microprocessor 100 of FIG. 31 also includes architecturally visible Memory Type Range Registers (MTRRs) 3102 per core 102. That is, each core 102 instantiates an architecturally visible MTRR3102 even though system software requires that MTRR3102 be consistent across all cores 102 (described in more detail below). The MTRR3102 is an example of a storage resource visible on each of the verify-instantiated architectures, and other embodiments of storage resources visible on each of the verify-instantiated architectures are described below. (although not shown, each core 102 also includes the core PRAM2499, core microcode ROM2404, patch CAM2439 of FIG. 24, and in one embodiment, core microcode patch RAM 2808 of FIG. 28).

MTRR3102 provides system software to associate a memory type with a plurality of different physical address ranges in the microprocessor 100 system memory address space. Examples of different memory types include strong unbuffered (strong unbuffered), unbuffered (unbuffered), write-combining (write-combining), write-through (write through), write-back (write back), and write protected (write protected). Each MTRR3102 specifies (explicitly or implicitly) a memory range and its memory type. The common value of each MTRR3102 defines a memory map that specifies the memory types of the different memory ranges. In one embodiment, MTRR3102 is similar to that found in the Intel 64 and IA-32 architecture software developer Manual, volume 3: a system programming guide, described in 2013, month 9, particularly in section 11.11, which is incorporated herein by reference and forms a part of this specification.

It is desirable that the memory map defined by MTRR3102 be the same across all cores used in the microprocessor 100, so that the software operating in the microprocessor 100 has a memory coherency. However, in conventional processors, there is no hardware support to maintain consistency of MTRRs among a multi-core processor core. As mentioned earlier in the Intel Manual at page 11-20 of volume 3, the "P6 and more recent processor families provide no hardware support for maintaining consistency of MTRRs values". Thus, the system software is then responsible for maintaining consistency across the core MTRRs. The above reference to the Intel manual section 11.11.8 describes an algorithm for system software that maintains consistency in updating each core of its MTRRs multi-core processor, e.g., all cores execute instructions that update their respective MTRRs.

Conversely, embodiments of the system software that may update the respective requests (instances) of the MTRR3102 in one of the cores 102 and facilitate the core 102 to propagate the update to the respective requests of the MTRR3102 in all of the cores 102 of the microprocessor 100 in an atomic manner are described herein (similar to the manner described above for a microcode patch performed in the embodiments of fig. 24-30). A method is provided for maintaining architectural instruction level consistency among MTRRs 3102 of different cores 102.

Referring now to FIG. 32, a flowchart illustrating operation of the microprocessor 100 of FIG. 31 to propagate an MTRR3102 update to one of the cores 102 of the microprocessor 100 is shown. The operation is described from a single core perspective, but each core 102 of the microprocessor 100 operates according to the description of the co-propagating MTRR3102 updates to all cores 102 of the microprocessor 100. More specifically, FIG. 32 depicts operation of the core encountering the update MTRR3102 instruction, flow of which begins at block 3202, and operation of the other cores 102, flow of which begins at block 3252.

At block 3202, one of the cores 102 encounters an instruction that instructs the core to update its MTRR 3102. That is, the MTRR update instruction includes an MTRR3102 identifier and an update value that is written to the MTRR 3102. In one embodiment, the MTRR update instruction is an x86WRMSR instruction that specifies the update value in the EAX EDX register and the MTRR3102 identifier in the ECX register, which is an MSR address within the MSR address space of the core 102. In response to the MTRR update instruction, the core 102 disables interrupts and prevents microcode from executing the MTRR update instruction. It should be appreciated that the system software that includes the MTRR update instruction may include a multi-instruction sequence to prepare for the update of the MTRR 3102. More preferably, however, in response to the sequence of single architectural instructions, the MTRRs 3102 of all cores 102 are updated in an atomic manner in the architectural instruction stage. That is, once an interrupt is disabled in the first core 102 (e.g., at block 3202, the core 102 encounters the MTRR update instruction), the interrupt remains disabled while the executing microcode propagates the new MTRR3102 value to all of the cores 102 of the microprocessor 100 (e.g., until after block 3218). Further, once disabled in other cores 102 (e.g., at block 3252), it is still disabled until the MTRR3102 of all cores 102 of the microprocessor 100 has been updated (e.g., until after block 2634). Thus, advantageously, the new MTRR3102 value is atomically propagated to all of the cores 102 of the microprocessor 100 in the architectural instruction level. Flow proceeds to block 3204.

At block 3204, the core 102 obtains ownership of the hardware semaphore 118 of FIG. 1. More preferably, the microprocessor 100 includes a hardware semaphore 118 associated with an MTRR 3102. More preferably, the core 102 obtains ownership of the hardware semaphore 118 in a manner similar to that described above with respect to FIG. 20, and more particularly blocks 2004 and 2006. The hardware semaphore 118 is used because it is possible that one of the cores 102 will perform an MTRR3102 update in response to encountering an MTRR update instruction, while a second core 102 encounters an MTRR update instruction in response to which the second core will begin updating the MTRR3102, which may result in improper execution. Flow proceeds to block 3206.

At block 3206, one core 102 sends an MTRR update message to the other core 102 and sends the other core 102 an inter-core interrupt. More preferably, during the time that a time interrupt is disabled (e.g., the microcode does not allow itself to be interrupted), the core 102 blocks the microcode in response to the MTRR update instruction (at block 3202) or in response to the interrupt (at block 3252) and remains with the microcode until block 3218. Flow proceeds to block 3208.

At block 3252, one of the other cores 102 (e.g., a core other than the core 102 that encountered the MTRR update instruction at block 3202) is interrupted and receives the MTRR update message due to the inter-core interrupt transmitted at block 3206. In one embodiment, the core 102 fetches the interrupt in the next architectural instruction boundary (e.g., in the next x86 instruction boundary). In response to the interrupt, the core 102 disables the interrupt and prevents microcode from processing the MTRR update information. As described above, although the flow in block 3252 is described in terms of a single core 102, each other core 102 (e.g., core 102 not in block 3202) is interrupted in block 3252 and receives the information, and the steps in blocks 3208-3234 are performed. Flow proceeds from block 3252 to block 3208.

At block 3208, the core 102 writes a synchronization request (denoted SYNC 31 in fig. 32) for a synchronization case 31 into its synchronization register 108, and the control unit 104 puts the core 102 into a sleep state and then wakes up by the control unit 104 when all cores 102 have written SYNC 31. Flow proceeds to decision block 3211.

At decision block 3211, the core 102 determines whether it is the core 102 that encountered the MTRR update instruction at block 3202 (as compared to the one core 102 that received the MTRR update message at block 3252). If so, flow proceeds to block 3212; otherwise, flow proceeds to block 3214.

At block 3212, the core 102 loads the MTRR identifier specified by the MTRR update instruction and an MTRR update value that the MTRR is updated so that all other cores 102 are visible to the uncore PRAM 116. In the case of an x86 embodiment, MTRR3102 includes: (1) repaired range MTRRs, which include a single 64-bit MSR updated via a single WRMSR instruction and (2) different range MTRRs, which include two 64-bit MSRs, each of which is written to by a different WRMSR instruction, e.g., specifying different MSR addresses. For different ranges of MTRRs, one of the MSRs (the PHYSBASE register) includes a base address of the memory range and a type field specifying the memory type, and the other MSR (the PHYSMASK register) includes a valid bit and a mask field setting the range mask (mask). More preferably, the MTRR update value loaded by the core 102 into the uncore PRAM 116 is as follows.

1. If the MSR is determined to be the PHYSMASK register, then the core 102 loads the uncore PRAM 116 a 128-bit update value that includes the new 64-bit value specified by the WRMSR instruction (which includes the valid bit and the mask value) and the current value of the PHYSBASE register (which includes the base value and the type value).

2. If the MSR is determined to be the PHYSBASE register:

a. if the valid bit is being set in the PHYSMASK register, the core 102 loads a 128-bit update value into the uncore PRAM 116, which includes the new 64-bit value specified by the WRMSR instruction (the 64-bit value includes the base value and the type value) and the current value of the PHYSMASK register (the current value includes the valid bit and the mask value).

b. If the valid bit is being set in the PHYSMASK register, then the core 102 loads a 64-bit update value into the uncore PRAM 116, which includes only the new 64-bit value specified by the WRMSR instruction (the 64-bit value includes the base value and the type value).

In addition, if the written update value is a 128-bit value, the core 102 sets a flag in the uncore PRAM 116, and if the update value is a 64-bit value, the core 102 clears the flag. Flow proceeds from block 3212 to block 3214.

At block 3214, the core 102 writes a synchronization request for a synchronization case 32 (denoted SYNC 32 in fig. 32) to its synchronization register 108 and the control unit 104 puts the core 102 into a sleep state, and then wakes up by the control unit 104 when all cores 102 write a SYNC 32. Flow proceeds to block 3216.

At block 3216, the core 102 reads the MTRR3102 identifier and the MTRR update value written at block 3212 from the uncore PRAM 116. Advantageously, the MTRR update propagation is performed in an atomic manner such that any updates to MTRRs 3102 that may affect the operation of the respective cores 102 are guaranteed to be architecturally invisible until the update has been propagated to MTRRs 3102 of all cores 102, since all cores are known to be executing the same microcode carried out in the MTRR update instruction, and interrupts will not be used in any core 102 until the update is propagated to the respective MTRRs 3102 of all cores 102. As described above in this embodiment at block 3212, if the flag is set at block 3212, the core 102 also updates (in addition to the determined MSR) the physmak or phybase register; otherwise, if the flag is clear, the core 102 only updates the determined MSR. Flow proceeds to block 3218.

At block 3218, the core 102 writes a synchronization request for a synchronization case 33 (denoted as SYNC 33 in FIG. 32) to its synchronization register 108 and the control unit 104 puts the core 102 into a sleep state, and then wakes up by the control unit 104 when all cores 102 write a SYNC 33. Flow ends at block 3218.

Following block 3218, the MTRR core 102 releases the hardware semaphore 118 obtained in block 3204. Further, after block 3218, the core 102 restarts the interrupt.

It will be appreciated from a review of FIGS. 31 and 32 that the system software running on the microprocessor 100 of FIG. 31 may facilitate the execution of an MTRR update instruction in a single core 102 of the microprocessor 100 to complete the update of the specified MTRR3102 for all cores 102 of the microprocessor 100, rather than executing an MTRR update instruction in each core 102 individually, which may provide system integrity.

A specific MTRR3102 instantiated in each core 102 is a System Management Range Register (SMRR) 3102. Since the SMRR3102 has program code and data operations related to System Management Mode (SMM), such as a System Management Interrupt (SMI) handler, the memory range specified by the SMRR3102 is referred to as the SMRAM region. When code running in a core 102 attempts to access the SMRAM region, the core 102 only allows this access if the core 102 is running in SMM; otherwise, the core 102 ignores a write to the SMRAM area and restores a fixed value for each bit read from the SMRAM area. Additionally, if a core 102 operating in the SMM attempts to execute code outside of the SMRAM area, the core 102 will assert a machine check exception. Further, when the core 102 is operating in SMM, the core 102 only allows program code to be written into the SMRR 3102. This facilitates the protection of SMM program code and data in this SMRAM area. In one embodiment, the SMRR3102 is similar to that found in Intel64 and IA-32 architecture software developer Manual, volume 3: a system programming guide, 2013, month 9, particularly described in sections 11.11.2.4 and 34.4.2.1, which are incorporated herein by reference and form a part of this specification.

Generally, each core 102 has its own instance of SMM program code and data in memory. It is desirable that the SMM code and data of each core 102 be protected from not only code running in itself, but also code running in another core 102. To do so using SMRRs3102, system software typically places multiple SMM program code and data instances in adjacent blocks in memory. That is, the SMRAM area is a single contiguous memory area that includes all SMM program code and data instances. This may prevent code running on one core in non-SMM from updating the SMM code and data instances of another core 102 if the SMRR3102 of all cores 102 of the microprocessor 100 has a value specifying the entirety of the single contiguous memory region that includes all SMM code and data instances. When a window of time exists in which the values of SMRRs3102 in cores 102 are not the same, e.g., SMRRs3102 in different cores 102 of the microprocessor 100 have different values, any of which is definitely smaller than the entirety of a single contiguous memory region that includes all SMM program code and data instances, the system may be vulnerable to a security attack, which may be serious for the nature of a given SMM. Thus, embodiments describing atomic propagation updates to SMRRs3102 may be particularly advantageous.

Moreover, other embodiments contemplate that updates to the otherwise visible storage resources on each of the other exemplary architectures of the microprocessor 100 are propagated in an atomic manner similar to the method described above. For example, in one embodiment, each core 102 instantiates certain bit fields of the x86IA32_ MISC _ ENABLE MSR, and a WRMSR executing in a core 102 is propagated to all cores 102 in the microprocessor 100 in a manner similar to that described above. Furthermore, embodiments contemplate that execution in one core 102 of a WRMSR to other MSRs instantiated in all cores 102 of the microprocessor 100, which are both architecturally and exclusively and/or current and future, is propagated to all cores 102 of the microprocessor 100 in a manner similar to that described above.

Furthermore, although embodiments describe the storage resources visible on the per-core exemplary architecture as MTRRs, other embodiments contemplate that the per-core exemplary resources are resources other than x86ISA instruction set architectures, and other resources other than MTRRs. For example, other resources besides MTRRs include CPUID values and MSRs for reward functions, such as vector Multimedia eXtensions (VMX) functions.

Although the present invention has been described with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. For example, software may enable, for example, the functions, fabrication, modeling, simulation, description, and/or testing of the apparatus and methods described herein. This can be accomplished using general programming Languages (e.g., C, C + +), Hardware Description Languages (HDL) including Verilog HDL, VHDL, and so forth. Such software may be embodied in the form of program code in a tangible medium, such as any other machine-readable (e.g., computer-readable) storage medium, such as semiconductor, magnetic disk, hard disk, or optical disk (e.g., CD-ROM, DVD-ROM, etc.), wherein, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention. The methods and apparatus of the present invention may also be embodied in the form of program code transmitted over some transmission medium, such as electrical wiring or cabling, through fiber optics, or via any other form of transmission, wherein, when the program code is received and loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention. When implemented on a general-purpose processor, the program code combines with the processor to provide a unique apparatus that operates analogously to specific logic circuits. The apparatus and method of the present invention may be embodied in a semiconductor intellectual property core, such as a microprocessor core (embodied in HDL) and transformed into the hardware product of an integrated circuit. Furthermore, the apparatus and methods described herein may comprise a combined physical embodiment in both hardware and software. Therefore, the protection scope of the present invention should be determined by the claims of the present application. Finally, those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiments as a basis for modifying or modifying other structures for carrying out the same purposes of the present invention without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for putting a majority of cores of a multi-core microprocessor into a sleep state, wherein a given one of the cores services a non-directed wake event that is not directed to any particular core, the method comprising:

causing all of the cores to enter a sleep state and blocking wake events for all of the cores other than the designated core, wherein causing all of the cores to enter a sleep state comprises: preventing the provision of clock signals and power to all of the cores;

in response to detecting a wake event, waking the designated core to process the detected wake event;

de-waking wake events for cores other than the designated core, independent of whether there are any non-designated cores to which an ongoing wake event is directed, such that if a directed wake event is directed to a non-designated core, the non-designated core is able to react to the directed wake event, and if no directed wake event is directed to a non-designated core, all of the non-designated cores remain in a sleep state;

returning the designated core to a sleep state after the designated core services the wake event;

blocking wake events for all cores except the designated core; and

after all of the cores are put into a sleep state, keeping the other cores except the designated core in the sleep state until a wake event is directed to one or more of the other cores.

2. The method of claim 1, further comprising the steps of: designating a last core requesting to enter a sleep state as the designated core.

3. The method of claim 2, wherein a control unit of the multi-core microprocessor automatically blocks all wake events for the other cores if the designated core is the last core to write a synchronization request.

4. The method of claim 2, further comprising the steps of:

waking a second one of the cores other than the designated core if an ongoing directed wake event is directed to the second core;

the second core services the point-to-wake event;

the designated core and the second core each request to enter a sleep state;

designating the second core as a newly designated core and de-designating a previously designated core as the designated core in a case where the second core is a last core to request entry into a sleep state; and

repeating the act of putting all the cores into a sleep state by keeping the other cores in a sleep state until a wake event is directed to one or more of the other cores, wherein the other cores are defined with reference to the newly designated core.

5. The method of claim 1, wherein a control unit of the multi-core microprocessor acts to put all the cores into a sleep state by keeping the other cores in a sleep state.

6. The method of claim 1, wherein the preventing of the wake events for all cores other than the designated core is performed automatically by a control unit of the multi-core microprocessor in response to detecting the designated core requesting to enter a sleep state after all of the other cores request to enter a sleep state.

7. The method of claim 1, wherein the deactivation is performed by a control unit of the multi-core microprocessor in response to the designated core requesting deactivation of wake events for the other cores other than the designated core.

8. The method of claim 1, wherein after each core of the all cores executes an instruction specifying a target power-saving idle state, in response to the executing, causing all of the cores to enter a sleep state and blocking wake events for all cores other than the specified core.

9. The method of claim 1, wherein the designated core writes a synchronization request that wakes the designated core only by a de-assertion of STPCLK.

10. The method of claim 9, wherein, in the event that a control unit of the multi-core microprocessor detects the deassertion of the STPCLK, the control unit wakes up the designated core and does not limit power to the designated core.

11. A multi-core microprocessor having a majority of cores of the multi-core microprocessor entering a sleep state and designated ones of the cores servicing non-directed wake events that are not directed to any particular core, the multi-core microprocessor characterized by being configured to:

blocking wake events for all cores except the designated core; and

12. The multi-core microprocessor of claim 11, wherein the multi-core microprocessor is configured to designate a last core requesting to enter a sleep state as the designated core.

13. The multi-core microprocessor of claim 12, wherein, in the event that the designated core is the last core to write a synchronization request, a control unit of the multi-core microprocessor is configured to automatically block all wake events for the other cores.

14. The multi-core microprocessor of claim 12, wherein the multi-core microprocessor is further configured to:

causing the second core to service the directed wake event;

causing the designated core and the second core to each request to enter a sleep state;

15. The multi-core microprocessor of claim 11, wherein the control unit of the multi-core microprocessor is configured to perform an action of bringing all the cores into a sleep state by keeping the other cores in the sleep state.

16. The multi-core microprocessor of claim 11, wherein the control unit of the multi-core microprocessor is configured to: in response to detecting the designated core requesting to enter a sleep state after all of the other cores request to enter a sleep state, automatically blocking the wake events for all of the cores other than the designated core.

17. The multi-core microprocessor of claim 11, wherein the control unit of the multi-core microprocessor is configured to: the release is performed in response to the designated core requesting release of the wake event for the other cores other than the designated core.

18. The multi-core microprocessor of claim 11, wherein after each of the all cores executes an instruction specifying a target power-saving idle state, in response to the execution, all of the cores are put into a sleep state and wake events for all but the specified core are blocked.

19. The multi-core microprocessor of claim 11, wherein the designated core is configured to write a synchronization request that wakes the designated core only by a de-assertion of STPCLK.

20. The multi-core microprocessor of claim 19, wherein, in the event that a control unit of the multi-core microprocessor is configured to detect the deassertion of the STPCLK, the control unit wakes up the designated core and does not limit power to the designated core.