GB2525473A

GB2525473A - Modeless instruction execution with 64/32-bit addressing

Info

Publication number: GB2525473A
Application number: GB1503097.6A
Authority: GB
Inventors: Ranjit Rozario; Ranganathan Sudhakar
Original assignee: Imagination Technologies Ltd
Current assignee: Imagination Technologies Ltd
Priority date: 2014-02-25
Filing date: 2015-02-24
Publication date: 2015-10-28
Anticipated expiration: 2035-02-24
Also published as: US20150242212A1; US10671391B2; GB201503097D0; GB2525473B

Abstract

A method comprises receiving a value expressed by a number of bits expressed by a number of bits equal to a general-purpose register width (e.g. 64 bits); determining whether the value is within a pre-determined range, and using only a least significant portion of the value having a pre-determined number of bits (e.g. 32 bits) to perform an arithmetic operation; and sign-extending a result of the arithmetic operation to use as an effective address for instruction execution. An immediate value may be decoded from the instruction, and the operation may be adding the immediate to the least significant portion of the value. The pre-determined ranges may be defined by most-significant bits (MSB) being all binary 1 (one) or all binary 0 (zero); where the MSBs are bits from a general-purpose register not included within the least significant portion. An address calculation unit may be provided in an instruction fetch unit, with the effective address used as a location from which to fetch instructions. If the value is not within a pre-determined range, the value may be treated as a double-word sized value, with arithmetic operation performed using double-word sized value and not sign-extending a result of the operation.

Description

MODELESS INSTRUCTION EXECUTION WITH 64/32-BIT ADDRESSING

BACKGROUND

Field:

[9901] In one aspect, the following relates to microprocessor architecture, and in a more particular aspect, to implementations of disclosed features.

Related Art: [0902] An architecture of a microprocessor pertains to a set of instructions that can be haidled by the microprocessor, and what these instructions cause the microprocessor to do.

Architectures of microprocessors can be categorized according to a variety of characteristics.

One major characteristic is whether the instruction set is considered "complex" or of "reduced complexity". Traditionally, the terms Complex Instruction Set Computer (CISC) and Reduced Instruction Set Computer (RISC) respectively were used to refer to such architectures. Now, many modern processor architectures have characteristics that were traditionally associated with only CISC or RISC architectures. In practicality, a major distinction of meaning between RISC and CISC architecture is whether arithmetic instructions perform memory operations.

[0903] Processor architectures can he characterized according to a variety of parameters.

One parameter is a number of bits used to address memory, a number of bits available in general purpose registers, andlor a number of bits used to represent instructions. Some architectures may not use the same number of bits for all of these purposes. For example, some processors may use a different number of bits for representing instructions than for a number of bits used to address memory, or a memory word size. In general, however, a number of bits used for all these purposes has increased throughout the years on current-generation processors (even though a wide range of processor architectures continues to exist.) For example, some processor architectures originally had 4 or 8 bit memory word sizes, and have gradually increased to 16-bit, 32-bit, and now 64-bit addressing. A transition from 32-bit to 64-bit has been comparatively recent on a variety of different architectures. a

SUMMARY

[0904] A question arises as to how a given family of processor architectures that transitioned from 32-bit only to supporting 64-bit addressing will or will not continue to provide support for binaries written for the 32-bit architecture. If it is desired that the 64-bit architecture continue to support existing 32-bit binaries, a suitable implementation of such 64-bit architecture must be realized.

[0005] An aspect of the disclosure relates to a computing system that comprises a memory, and a processor. The processor comprises a set of general purpose registers, being at least double-word size (word size axd double word size being relative measures). The processor is configurable to execute instructions in a privileged mode and an unprivileged mode, which is controlled by a setting, which is itself maintained by a privileged resource, such as an OS kernel, or a hypervisor. The processor is capable of executing instructions in the unprivileged mode; for example, user code can be executed in unprivileged mode. The instructions comprise both arithmetic instructions and load/store instructions that load data from and store data to the memory. Some implementations may provide for indirect addressing, in which a value stored in a general purpose register can be used as an immediate va'ue in order to calculate a target address for the load or store. Loads ca be of instructions or data (and heilce address calculatiolls can be performed by one or more of a fetch unit and a load/store unit).

[0006] The load/store instructions do not differentiate between single-word and double-word values, which means that data defining the load/store instructiolls does lot itself indicate that a value stored in a register is to be used as a sillgle word or a double word sized immediate value. However, the processor is capable of executing load/store instructions that use gelleral purpose registers as either sillgle-word sized immediate values or as double-word sized immediate values, without executing privileged code at bouldaries between single word code and double word code, in order to change ai execution mode bit.

[0007] An aspect of the disclosure relates to processor for executing machine executable code having different memory addressability ranges. The processor comprises a plurality of registers, each beillg of a register size. The processor has an instruction decoding ulit configured to decode an instruction that accesses memory to obtain a register identifier. The instruction uses one of a smaller address range aild a larger address range, and the larger address range is addressable using a number of bits equal to the register size. The processor also has a load store unit configured to receive a value from the register identified by the register identifier and to determine whether the value is within either of two pre-determined ranges of values addressahfr using a part of the bits of the value from the register, and to calculate an effective address of a memory transaction in dependence on an outcome of the determination.

[0008] An aspect of the disclosure relates to a method of machine readable code execution in a processor. The method provides for processillg arithmetic instructions by decoding each arithmetic instruction to identify one or more registers identified in that instruction. The registers are from a set of registers that physically have a double-word size.

The method includes determining whether the instruction specifies that the value in each of the one or more registers is to he interpreted as a single-word-sized value or a double-word-sized value. The method includes processing thad/store instructions by decoding each load/store instruction to identify one or more registers, from a set of registers, identified in ftat load/store instruction. The method also provides for evaluating the respective data stored in each of the one or more registers to determine whether a single-word-sized portion of that register is to he used in calculating an effective address for that load/store instruction and calling a code module that uses single-word sized values for arithmetic instructiolls and for load/store instructions, without first executing privileged code to change an operating mode of the processor.

BRIEF DESCRIPTION OF THE DRAWING

[00091 FIGs. IA and I B depicts block diagrams pertaining to an example processor

which can implement aspects of the disclosure;

[0010] FIG. 2 depicts an example address space mapping for mapping a 32-bit byte addressable space into a 64-bit addressable space; [09111 FTG. 3 depicts an example configuration of an implementation of a processor aaccording to the disclosure, in which an Address Generation Unit determines how to interpret operands for a memory access instruction according to aspects of the disclosure; [0912] FTG. 4 depicts further details of an example Address Generation Unit according to

the disclosure;

[0013] FIG. 5 depicts an example process according to the disclosure; [0014] FIG. 6 depicts example functional elements of a machine implementing the

disclosure: and

[0015] FIG. 7 depicts an example machine in which aspects of the disclosure may be implemented.

DETAILED DESCRIPTION

[0016] The following disclosure uses examples principally pertainillg to a RISC instruction set, and more particularly, to aspects of a MIPS processor architecture. Using such examples does not restrict the applicability of the disclosure to other processor architectures, and implementations thereof.

[0917] Currently, the family of MIPS processor architecture includes a 64-bit memory addressahility architecture. MIPS 64-bit architectures can execute MIPS 32-bit binaries.

However, implementations of MIPS 64-bit architecture require a mode bit to be set in a status register that indicates whether each instruction is to be processed according to a MIPS 32-bit architecture or according to the MIPS 64-bit architecture. Applicants have recognized that it is desirable to avoid having a mode bit for this purpose. One way to avoid having such a mode bit would be to provide entirely new instructions for all 64-bit memory access instructions, including loads, stores, and instructions that may modify the program counter. Applicants have realized however that providing separate 64-bit and 32-bit versions of each instruction uses a great deal of operation code space within an available code space. For example, in MIPS architecture, all instructions are 32-hits, and only 6 bits are allocated to op code. Although there are additional bits available to specify a function, in some addressing modes in MIPS, some instructions affected by memory addressahility do not have any allocation of bits for function specification. Also, it is problematic from a programmer's perspective, as well as from development environment creation and maintenance to provide different instructions between 32-bit and 64-bit architecture machines. As such, Appficants have found that another sohation to supporting 32-bit code in a 64-hit machine is desired.

[0018] FIG. IA depicts an example diagram of functional elements of a processor 50 that supports 64-bit memory addressing according to aspects of the disclosure. The example elements of processor SO will be introduced first, and then addressed in more detail, as appropriate. This example is of a processor that is capable of out of order execution; however, disclosed aspects can he used in an in-order processor implementation. As such, FIG. IA depicts functiona' demerits of a microarchitectural implementation of the disclosure, hut other implementations are possible. Also, different processor architectures can implement aspects of the disclosure. The names given to some of the functional elements depicted in FIG. IA maybe different among existing processor architectures, but those of ordinary skill would understaild from this disclosure how to implement the disclosure on different processor architectures, including those architectures based on pre-existing architectures and even on a completely new architecture.

[0019] Processor 50 includes a fetch unit 52, which is coupled with an illstruction cache 54. Tnstruction cache 54 is coupled with a decode and rename unit 56. Decode and rename unit 56 is coupled with an instruction queue 58 and also with a branch predictor that includes an instruction Translation Lookaside Buffer (iTLB) 60. Instruction queue 58 is coupled with a ReOrder Buffer (ROB) 62 which is coup'ed with a commit unit 64. ROB 62 is coupled with reservation station(s) 68 and a LoadlStore Unit (LSU) 66. Reservation station(s) 68 are coup'ed with Out of Order (00) execution pipeline(s) 70. Execution pipeline(s) 70 and LSB 66 each couple with a register file 72.

[0020] Register file 72 couples with an Li data cache(s) 74. LI cache(s) 74 couple with L2 cache(s) 76. Processor 50 may also have access to further memory hierarchy elements 78.

Fetch unit 52 obtains illstructions from a memory (e.g., 12 cache 76, which can be a unified cache for data and instructions). Fetch unit 52 can receive directives from branch predictor 60 as to which instructions should be fetched.

[0021] Functiollal elements of processor 50 depicted in FIG. 1A may be sized mid arrailged differently ill differeilt implemeiltations. For example, instruction fetch 52 may fetch 1, 2, 4, 8 or more instructions at a time. Decode and rename 56 may support different ilumbers of rename registers and queue 58 may support different maximum numbers of entries among implementations. ROB 62 may support different sizes of instruction windows, while reservation station(s) 68 may be able to hold different numbers of instructions waiting for operands and similarly LSB 66 may be able to support different numbers of outstanding reads and writes.

Instruction cache 54 may employ different cache replacement algorithms and may employ multiple algorithms simultaneously, for different parts of the cache 54. Defining the capabilities of different nñcroarchitecture elements involve a variety of tradeoffs beyond the scope of the

present disclosure.

[0022] Implementations of processor 50 may be single threaded or support multiple threads. implementations also may have Single Instruction Multiple Data (SIMD) execution units. Execution units may support integer operations, floating point operations or both.

Additional functional units can be provided for different purposes. For example, encryption offload engines may be provided. FIG. 1A is provided to give context for aspects of the disclosure that follow and not by way of exclusion of any such additional functional elements.

This is a non-exhaustive enumeration of examples of design choices that can he made for a particular implementation of processor 50.

[0023] FIG. I B depicts that register file 72 of processor 50 may include 32 registers.

Each of these registers contains 64-bits in an example. Each register may be identified by a binary code associated with that register. in a simple example, 00000b identifies Register 0, 1111 lb identifies Register 31, and registers in between are numbered accordingly. Processor 50 performs computation according to specific configuration information provided by a stream of instructions. These instructions are in a format specified by the architecture of the processor. An instruction may specify one or more source registers, and one or more destination registers for a given operation. The binary codes for the registers are used within the instructions to identify different registers. The registers that can be identified by instructions can be known as "architectural registers", which present a large portion, but not necessarily all, of the state of the machine available to executing code. implementations of a particular processor architectural may support a larger number of physical registers. Having a larger number of physical registers allows speculative execution of instructions that refer to the same architectural registers.

Register file 72 may have different numbers and kinds of ports in different implementations. For example, some implementations may supply two ports, while others may supply more. Some implementations may have designated read ports and write ports. In an example, the registers have more than 64 bits (e.g., 128 or 256 hits). In some implementations, registers may have more than 64 bits, and by logically divided into multiple logical general purpose registers. For example, a i 28 bit register maybe divided into two 64-bit logical registers. Therefore, according to some implementations, general purpose registers can he implemented by one or more of physical registers of a given size and logical registers of a given size.

[0024] FIG. 2 depicts an example mapping of a memory space addressable using only 32-hits to a memory space addressable using 64-hits (called a "64-hit memory space" for ease of reference). This example mapping provides that a lower 2 GigaBytes (GB) of address space (under byte addressing, althoLigh implementations according to the disclosure are not limited to byte addressing), is mapping to a bottom of the 64-hit memory space and an upper 2 GB of the 32-hit memory space are mapped to a top of the 64-bit memory space. In this disclosure, the upper and lower portions of memory are used in the context of the examples presented, which are generally in accordance with little endian addressing. However, aspects of the disclosure are not limited to little endian architectures, and in fact, MIPS architecture processors may operate in big or little endian modes. Those of ordinary skill would be able to implement these disclosures according to the specific circumstances presented in that implementation.

[0925] When the 64-bit processor of FIG. IA is executing 32-hit code, an upper 32-bits of a 64-bit register is unused for memory access operations. In the context of the mapping depicted in FIG. 2, the upper 32 hits are either all hinary zeros.,for all memory addresses in the lower 2GB of memory, or all ones.,for all memory addresses in the upper 2GB of memory. It is essential for a processor to be able to load data from memory and store data to memory.

[0026] Load instructions and store instructions are provided for such purposes. One approach to addressing memory for loads and stores is to calculate the memory address and store the memory address in a register, using one instruction, and then refer to that register using a load or store instruction (a register-based addressing mode).

[0027] Some load and store instructions provide an indirect addressing mode, in which a memory address to be accessed is determined according to data in a register (a base address) and an immediate (constant) value (an offset) supplied with the load or store instruction itself For loads or stores using indirect addressing modes, LSU 66 calculates an address using the contents of the register identified in the instruction and the supplied immediate value. However, in the absence of a mode hit indicating whether the register stores a 32-bit or 64-bit quantity, or a different opcode to distinguish 32-hit from 64-hit instructions, LSIJ 66 cannot properly calculate the address.

[0028] In particular, if the instruction were from 32-bit code, then the upper part of the 64-bit register would be sign-extended data. For example, for 32-bit code, when a base address is in the lower 2GB of space, and adding the immediate to the base address would transition across the 2GB boundary, the desired 32-hit address is in the upper 2GB, which is mapped to a top of the 64-hit address space, and not contiguously to the tower 2GB. Therefore, the appropriate physical address in such a situation would retain the lower 32-hits of the addition, hut sign-extend the result, which in this example means that the upper 32 hits would he set to binary 1. However, if the instruction were from 64-bit code, then the appropriate physical address is contiguous with the lower 2GB, which means that the full 64-bits resulting from the addition such be maintained.

[0029] MIPS® 64 supports register-based 32-hit addressing on a 64-hit architecture hy supplying separate instructions for 32-bit arithmetic instructiolls and for 64-bit arithmetic instructions. For example, a 32-bit add performs the sign extension discussed above, while the 64-bit add does not, and in each case stores the result in a destination register that is 64 hits.

Then, an instruction can direcdy use the contents of the destination register without any concern whether the contents represent a sign-extended 32-hit quantity or a 64-hit quantity, because in each case the contents are interpreted the same. This is not the case for indirect addressing.

[0030] Focusing on a specific example for clarity, the load word (LW) instruction does not have a different version for 32 and 64 bit code. Turning to FIG. 3, there is depicted further example details of an example processor, in which an instruction unit 159 can decode a LW instruction that specifies a destination register (Rd), a source register (Rt), and an immediate (imml6). Instruction unit 159 includes PC update logic 161. A register file 161 is accessed to retrieve contents of Rt ($Rt), which are provided to LSU 66. An ALU 169 also couples with register file 161, in order to be able to access register contents for arithmetic instructitons, but which would not participate in processing the LW instruction currently being addressed. An address generation unit (AGU 175) is located in Load Store Unit 66 and couples with a memory 158. AGU 175 produces an effective address, based on the contents of Rt and the imml6, and LSU 66 obtains data from memory 158 stored at that effective address. Details concerning how memory I 58 may he implemented are abstracted from the present disclosure, and a wide variety of memory architectures may he supported by different implementations of the disclosure. For example, memory 158 maybe implemented as including one or more layers of cache hierarchy, in addition to a main memory. LSU 66 stores contents at the effective address in the register identified by Rd. Some implementations also may always store the retrieved data in an Li cache 169. Other addressing modes may be supported by LSU 66, which include program counter relative loads. For such purpose, a value of program counter 170 also maybe provided to LSU 66. A value of program counter 170 may he processed the same way as contents from register Rt. Different implementations may provide an incremented program counter, and this disclosure is not to he interpreted as requiring any particular approach to calculating an effective address relative to a program counter value.

[0031] As such, AGU 175 receives contents of the source register specified by the LW instruction (Rt), as well as the immediate value contained in the LW. AGU 175 then must generate the 64-hit address to he used to address the correct memory location for the LW instruction. However, AGU 175 does not have any a priori knowledge as to whether the LW instruction is from 32-hit or 64-hit code, and there is no explicit indication within the instruction data itself. Currently, a MIPS 64 machine uses a mode hit to determine whether the instruction is operating under a 32-hit mode or a 64-hit mode.

[0032] FIG. 4 presents an example implementation of AGU 175 that can determine the correct address for the LW instruction, regardless whether the LW is from 64 bit or 32 bit code, and without using a mode bit. FIG. 5 depicts an example process that can he implemented by AGIJ. FIG. 4 depicts that AG!.) 175 includes comparator circuitry 210 that accepts an operand 1, an operand 2, and a comparison value. hi this example, operand 1 is a value from register Rt, which holds a base address for the LW instruction and operand 2 is a 16-bit immediate from the LW instruction. The comparison is a definition of address ranges to be compared with the values in portions of one or more of operand 1 and operand 2. Comparator circuitry 210 outputs an indicator 211 that indicates whether the instruction should be interpreted as a 64 bit or a 32 bit instruction. If a 32-bit instruction, then AGU 175 adds a lower portion of the value of operand 1 to a sign-extended operand 2 to produce an effective memory address. If the instruction is to be interpreted as a 64-bit instruction, then both the upper and lower portions of operand 1 are used a single value, and a sign-extended version of operand 2 is added to the value of operand 1.

Concerning operand 2, this example is of a 16-hit immediate. However, operand 2 can he any of a variety of sizes. In some implementations, if immediate values are never outside of pre-determined ranges, then an explicit check for these values maybe dispensed with. Comparator circuitry 2i 0 can he implemented as a digital comparator between an upper 32 hits of Rt and each of the values 0 and OxFFFF FFFFIi.

[0033] The condition that a defined set of bits are either all one or all zero is referred to as "canonical" herein. If all arguments have canonical upper 32-hit portions then the instruction is execLited as a 32-hit instruction. If any of the arguments do not have canonical tipper 32-hit portions, then the instruction is executed as a 64-hit instruction. As an example, consider a "load byte" instruction that references a base address of Ox0000 0000 7FFF FFFEh, and includes an immediate value of 4 (base 10). The base address has an upper 32 hits that are canonical and is within the lower 2GB portion. Similarly, decimal 2 is represented by Ox 0000 0000 0000 OOlOh in a 64 bit register, such that this value also is canonical in the upper 32 bits. Thus, the load byte is treated as a 32-bit instruction, and the addition is performed by sign extending the result of the addition across the full register width, resulting in a final value of OxFFFF FFFF 0000 0002h.

By contrast, if the base address in the load byte instruction were Ox0000 0001 0000 000lh, then this address is not canonical in the upper 32bits, and hence the load byte would not be interpreted as a 32-bit instruction but rather as a 64-bit instruction, resulting in an effective address of Ox0000 0001 0000 0t0!h.

[0934] Instruction fetch address calculation logic (e.g., PC update logic i 6i of FTG. 3) also may implement these address calculation aspects. For example, when a program counter needs to be updated to fetch a next instruction, or when a branch or jump target address needs to be calculated, the instruction fetch address calculation logic performs similar operations. By particular example, if a program counter is at an upper boundary of the lower 2GB segment, then incrementing by 4 (32 bit instructions, byte addressability) is performed to obtain the next instruction. However, to observe the address mapping of FIG. 2, this increment actually needs to be mapped to the beginning of the upper 2GB. In terms of hex addresses, if the program counter is at Ox0000 0000 7FFF FFFFh, an increment by 4 for a 32-bit program should result in effective address that begins at OxFFFF FFFF 0000 0003h, and not Ox0000 000i 0000 0003h.

The converse calculation also can he demonstrated, in that if the base address were OxFFFF FFFF 0000 0003h, and the offset was -4 (subtracting 4 from the base address), then both of these values are canonical in their upper 32 bits (either all ones or all zeros). Thus, the address is calculated as a 32 bit value, which means that the subtraction is calculated to be Ox0000 0000 7FFF FFFFh.

[0935] In some examples, these address calculation disclosures can he implemented for each address generation situation in which there is not a separate instruction for 32-bit versus 64-bit usage situations. These situations may arise in calculating effective addresses for loads and stores of data, as well as in incrementing a program counter, or determining a jump or branch target address. Therefore, a processor maybe designed to implement arithmetic instructions that specify whether they are for 32 hit or 64 hit operands, while instructions that operate on memory may not specify whether operands are to he treated as 32 hit or 64 hit numbers. The combination of these approaches thus may allow dispensing with a mode bit or other condition code that indicates whether a given instruction is to be interpreted as a 64 bit or a 32 bit instruction.

Rather, either that information comes from an instruction itself, and tims can be propagated from the instruction decoder, or else can be inferred from values of the operands themselves.

[0036] The examples herein primary focus on 64 bit operands aild 32-bit operands.

However, this is for clarity. More generally, aspects of the disclosure apply to any processor implementation in which a sub-portion of a register of a given physical size is to he used for effective address calculation, for example. Such portions can he the same proportion, or different, e.g., a processor with a 64-hit physical register cou'd provide for diffeTent address modes for 32-hit and 16-hit code (e.g., double or quad word addressing), or a processor with a 128-bit register could also function as such. Although 2:1 ratios are expected to be most common, that also is not a requirement.

[0037] FIG. 5 depicts an example process accordifig to the disclosure. At 289, an instruction is decoded to ideiltify a register. At 290, contents of the register are accessed. At 291, it is determined whether a value in a portion (e.g., an upper portion) of the register contents is within any of one or more pre-defined ranges. if lot, then the instruction is processed as a 64-bit instruction. Optionally, at 292, it can be determined whether another operand (e.g., an humediate) also is within one or more pre-defined ranges. If so, then at 293, the instruction is processed as a 32-bit instruction, if 292 is not implemented, thell 293 may be implemented directly after 291.

[0938] It would he appreciated that a variety of logical equivalences can he used to express the operation of implementations of the disclosure. For example, rather than determining whether register contents are within a given range or ranges, it also can be determined whether or not the value is not within those range(s). Also, a variety of addressing modes and sources of operands may be provided, and the example of decoding an instruction to identify a register that sources a value is one example.

[0039] FIG. 6 depicts a block diagram of an example machine 439 in which aspects of the disclosure may he employed. A set of applications are available to he executed on machine 439. These applications are encoded in hytecode 440. Applications also can he represented in native machine code: these applications are represented by applications 441. Applications encoded in hytecode are executed within virtual machine 450. Virtual machine 450 can include an interpreter and/or a Just In Time (JIT) compiler 452. Virtual machine 450 may maintain a store 454 of compiled bytecode, which can be reused for application execution. Virtual machine 450 may use libraries from native code libraries 442. These libraries are object code libraries that are compiled for physical execution units 462. A Hardware Abstraction Layer 455 provides abstracted interfaces to various different hardware elements, collectively identified as devices 464. HAL 455 can be executed in user mode. Machine 439 also executes an operating system kernel 455. Th implementations of the disclosure, code libraries 442 can be 32-bit and/or 64-bit libraries. Calls may be made from 64-bit code into 32-bit code libraries without trapping through an operating system or other privileged code to update a mode hit. Where an implementation conforms to a processor architecture that provides different 64 bit and 32 bit arithmetic instructions, and the 64 hit instructions are a superset of the 32-bit instructions, it may also he the case that the 32 bit libraries do not need to be recompiled, since the processor would be able to process those instructions as a subset of the instruction set architecture supported.

[0040] Devices 464 may include JO devices and sensors, which are to be made available for use by applications. For example, HAL 455 may provide an interface for a Global Positioning System, a compass, a gyroscope, an accelerometer, temperature sensors, network, short range communication resources, such as Bluetooth or Near Field Communication, an RFID subsystem, a camera, and so on.

[0041] Machine 439 has a set of execution units 462 which consume machine code which configures the execution units 462 to perform computation. Such machine code thus executes in order to execute applications originating as bytecode, as native code libraries, as object code from user applications, and code for kernel 455. Any of these different components of machine 439 can he implemented using the virtuaUzed instruction encoding disclosures herein.

[0042] lmplementations of the disclosure may be used to implement execution of intermingled 32-bit and 64-bit user-mode code, without executing privileged mode code to change an execution mode. For example, a processor according to the disclosure has registers that are double-word sized registers. The processor maybe capable of decoding an arithmetic instruction that explicitly indicates whether register(s) identified by that instruction are to he interpreted as single-word sized values or double-word sized values. However, an instruction set capable of being decoded by the processor may not have different instrLlctions for sing'e word and for double word memory access operations. In such circumstances, the processor uses a value in one or more of the registers identified in each instruction to determine an effective address for that instruction. Such a processor, in one implementation, does not provide a mode bit indicating whether a given instruction is to be interpreted as using single word or double word sized operand values. Such a processor may be executing code that uses double-word sized operands, and which calls into a library of that uses single-word sized operands, and does not require updating a mode bit in coniunction with such a library call. Such a processor may avoid a substantial delay and execution of additional instructions required to trap to a privileged mode code section (e.g., in a Iiypervisor or operating system) to change an operating mode of the processor.

[0043] FIG. 7 depicts an example of a machine 505 that implements execution elements and other aspects disclosed herein. FIG. 7 depicts that different implementations of machine 505 can have different levels of integration. In one example, a single semiconductor element can implement a processor module 558, which includes cores 515-517, a coherence manager 520 that interfaces cores 515-517 with an L2 cache 525, an 110 controller unit 530 and an interrupt controller 510. A system memory 564 interfaces with L2 cache 525. Coherence manager 520 can include a memory management unit and operates to manage data coherency among data that is being operated on by cores 515-517. Cores may also have access to LI caches that are not separately depicted. In another implementation, an 10 Memory Management Unit (IOMMU) 532 is provided. IOMMU 532 may he provided on the same semiconductor element as the processor module 558, denoted as module 559. Module 559 also may interface with 10 devices 575-577 through an interconnect 580. A collection of processor module 558, which is included in module 559, interconnect 580, and 10 devices 575-577 can be formed on one or more semiconductor elements. In the example machine 505 of FIG. 7, cores 515-517 may each support one or more threads of computation, and may be architected according to the disclosures herein.

[0044] In various parts of the disclosure, determining values relative to a program coLinter was disclosed. For example, some disclosed aspects relate to adding a quantity to a program counter value, or otherwise determining a target branch address. Tt would he understood that these disclosures include adding a quantity to another quantity determined from the program coLinter value (e.g., the program counter value incremented by a value indicative of an instruction size, such as 4, in a situation where instructions are 32 bits and memory is byte-addressable). As such, these disclosures are not to be interpreted to exclude implementations in which certain details may be varied according to specifics of the processor architecture or microarchitecture.

[0045] Also, these address calculations can be made for any self-consistent environment; Addresses of instructions generated for 32-bit and for 64-bit code may both appear to be physical addresses, hut can stUl he mapped or translated by a memory management unit to other addresses. Therefore, the disclosure does not imply a requirement that addresses in memory that are depicted as being contiguous are in fact physically contiguous in actual memory.

[0046] Also, the example showed a situation where a 32-bit address space was mapped in two parts to portions of a 64-bit address space. However, the disclosures can he applied to situations where a 32-bit address space is mapped to more than two portions of a 64-bit address space. Also, the mapped portions do not necessarily need to be as depicted in FIG. 2, although such variations would present complications in terms of complexity of logic required to implement the disclosed aspects.

[0047] Modern general purpose processors regularly require in excess of two billion transistors to be implemented, while graphics processing units may have in excess of five billion transistors. Such transistor counts are likely to increase. Such processors have used these transistors to implement increasing complex operation reordering, prediction, more parallelism, larger memories (including more and bigger caches) and so on. As such, it becomes necessary to he able to describe or discuss technical subject matter concerning such processors, whether general purpose or application specific, at a level of detail appropriate to the technology being addressed. In general, a hierarchy of concepts is applied to allow those of ordinary skill to focus on details of the matter being addressed.

[0048] For example, high level features, such as what instructions a processor supports conveys architectLlral-level detail. When describing high-level technology, such as a programming model, such a level of abstraction is appropriate. Microarchitectural detail describes high leve' detail concerning an implementation of an architecture (even as the same microarchitecture may he able to execute different ISAs). Yet, microarchitectural detafi typically describes different functional units and their interrelationship, such as how and when data moves among these different functional units. As such, referencing these units by their functionality is also an appropriate level of abstraction, rather than addressing implementations of these functional units, since each of these functional units may themselves comprise hundreds of thousands or millions of gates. When addressing some particular feature of these functional units, it may be appropriate to identify substituent functions of these units, and abstract those, while addressing in more detail the relevant part of that functional unit.

[0949] Eventually, a precise logical arrangement of the gates and interconnect (a netlist) implementing these functional units (in the context of the entire processor) can he specified.

However, how such logical arrangement is physically realized in a particular chip (how that logic and interconnect is laid out in a particular design) still may differ in different process technology and for a variety of other reasons. Many of the details coilcerning producing netlists for functional units as well as actual layout are determined usillg design automation, proceedillg from a high level logical description of the logic to be implemented (e.g., a "hardware

description language").

[0050] The term "circuitry" does not imply a single electrically connected set of circuits.

Circuitry may be fixed function, configurable, or programmable. in general, circuitry implementing a functional unit is more likely to be configurable, or may be more configurable, than circuitry implementing a specific portion of a functional unit. For example, an Arithmetic Logic Unit (ALU) of a processor may reuse the same portion of circuitry differently when performing different arithmetic or logic operations. As such, that portion of circuitry is effectively circuitry or part of circuitry for each different operation, when configured to perform or otherwise interconnected to perform each different operation. Such configuration may come from or he based on instructions, or microcode, for examp'e.

[0051] In all these cases, describing portions of a processor in terms of its functionality conveys structure to a person of ordinary skill in the art. in the context of this disclosure, the term "unit" refers, in some implementations, to a class or group of circuitry that implements the functions or functions attributed to that unit. Such circuitry may implement additional functions, and so identification of circuitry performing one function does not mean that tile same circLlitry, or a portion thereof, cannot also perform other functions. In some circumstances, the functional unit may he identified, and then functional description of circLlitry that performs a certain feature differently, or implements a new feature may be described. For example, a "decode unit" refers to circuitry implemeilting decodillg of processor instructions. The description explicates that in some aspects, such decode unit, and hence circuitry implementing such decode unit, supports decoding of specified instruction types. Decoding of instructions differs across different architectures and microarchitectures, and the term makes no exclusion thereof, except for the explicit requirements of the claims. For example, different microarchitectures may implement instruction decoding and instruction scheduling somewhat differently, in accordance with design goals of that implementation. Similarly, there are situations in which structures have taken their names from the functions that they perform. For example, a "decoder" of program instructions, that behaves in a prescribed manner, describes structure supports that behavior. In some cases, the structure may have permanent physical differences or adaptations from decoders that do not support such behavior. However, such structure also may be produced by a temporary adaptation or configuration, such as one caused under program control, microcode, or other source of configuration.

[0052] Different approaches to desigil of circuitry exist, for example, circuitry may be synchronous or asyllchronous with respect to a clock. Circuitry may be designed to be static or be dynamic. Different circuit design philosophies may be used to implement different functional units or parts thereof. Absent some context-specific basis, "circuitry" encompasses all such design approaches.

[0053] Although circuitry or functional units described herein may be most frequently implemented by electrical circuitry, and more particularly, by circuitry that primarily relies on a transistor implemented in a semiconductor as a primary switch element, this term is to he understood in relation to the technology being disclosed. For example, different physica' processes maybe used in circuitry imp'ementing aspects of the disclosure, such as optical, nanotuhes, micro-dectrical mechanical elements, quantum switches or memory storage, magnetoresistive logic elements, and so on. Although a choice of tecimology used to construct circuitry or functional units according to the technology may change over time, this choice is an implementation decision to he made in accordance with the then-current state of technology.

This is exemplified by the transitions from using vacuum tubes as switching elements to using circuits with discrete transistors, to using integrated circuits, and advances in memory technologies, in that while there were many inventions in each of these areas, these inventions did not necessarily fundamentally change how computers fundamentally worked. For example, the use of stored programs having a sequence of ifistructions selected from an rnstructioll set architecture was an important change from a computer that required physical rewiring to change the program, but subsequently, many advances were made to varirnis functional units within such a stored-program computer.

[0954] Functional modules may he composed of circuitry, where such circuitry may he fixed function, configurable under program contTol or under other configuration information, or some combination thereof. Functional modules themselves thus may be described by the functions that they perform, to helpfully abstract how some of the constituent portions of such functions may be implemented.

[0055] In some situations, circuitry and fLinctional modules may he described partially in functional terms, and partially in structural terms. 111 some situatiolls, the structural portion of such a description may be described in terms of a configuration applied to circuitry or to functional modules, or both.

[0956] Although some subject matter may have been described in Imiguage specific to examples of structural features and/or method steps, it is to be understood that the subject matter defiled in the appeilded claims is not necessarily limited to these described features or acts. For example, a given structural feature may be subsumed within allother structural element, or such feature may be split among or distributed to distinct components. Similarly, an example portion of a process may be achieved as a by-product or concurrently with performance of allother act or process, or maybe performed as mullipk separate acts in some implementations. As such, implementations according to this disclosure are not limited to those that have a 1:1 correspondence to the examples depicted and/or described.

[0957] Above, various examples of computing hardware awl/or software programming were explained, as well as examples how such hardware/software can intercommunicate. These examples of hardware or hardware configured with software and such communications interfaces provide means for accomplishing the functions attributed to each of them. For example, a means for performing implementations of software processes described herein includes machine executable code used to configLire a machine to perform such process.

Some aspects of the disclosure pertain to processes carried out by limited conhgLlrahility or fixed function circuits and in such situations, means for performing such processes include one or more of special purpose and limited-programmability hardware. Such hardware can be controlled or invoked by software executing on a general purpose computer.

[0058] Tmplementations of the disdosure may he provided for use in embedded systems, such as televisions, appliances, vehicles, or personal computers, desktop computers, laptop computers, message processors, hand-held devices, muhi -processor systems, microprocessor-based or programmable consumer electronics, game cons6les, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets and the like.

[0059] In addition to hardware embodiments (e.g., within or coupled to a Central Processing Unit ("CPU"), microprocessor, microcontroller, digital signal processor, processor core, System on Chip ("SOC"), or any other programmable or electronic device), implementations may also be embodied in software (e.g., computer readable code, program code, instructions and/or data disposed in any form, such as source, object or machine language) disposed, for example, in a computer usable (e.g., readable) medium configured to store the software. Such software can enable, for example, the function, fabrication, modeling, simulation, description, and/or testing of the apparatus and methods described herein. For example, this can be accomplished through the use of general programming languages (e.g., C, Ci-i-), GDSII databases, hardware description languages (HDL) including Verilog HDL, VHDL, SystemC Register Transfer Level (RTL) and so on, or other available programs, databases, and/or circuit (i.e., schematic) capture tools. Embodiments can be disposed in computer usable medium including non-transitory memories such as memories using semiconductor, magnetic disk, optical disk, ferrous, resistive memory, and so on.

[0060] As specific examples, it is understood that implementations of disclosed apparatuses and methods may be implemented in a semiconductor intellectual property core, such as a microprocessor core, or a portion thereof, embodied in a Hardware Description Language (HDL)), that can be used to produce a specific integrated circuit implementation. A computer readable medium may embody or store such description language data, and thus constitute an article of manufacture. A non-transitory machine readaffle mcdi urn is an example of computer readable media. Examples of other embodiments include computer readable media stonng Register Transfer Language (RTL) description that maybe adapted for use in a specific architecture or microarchitecture implementation. Additionally, the apparatus and methods described herein may be embodied as a combination of hardware and software that configures or programs hardware.

[0961] Also, in some cases, terminology has been used herein because it is considered to more reasonably convey salient points to a person of ordinary skill, hut such terminology should not be considered to impliedly limit a range of implementations encompassed by disclosed examples and other aspects. A number of examples have been illustrated and described in the preceding disclosure. By necessity, not every example can illustrate every aspect, and the examples do not illustrate exclusive compositions of such aspects. Instead, aspects illustrated and described with respect to one figure or example can he used or combined with aspects illustrated and described with respect to other figures. As such, a person of ordinary skill would understand from these disclosures that the above disclosure is not limiting as to constituency of embodiments according to die claims, and rather the scope of the claims define the breadth and scope of inventive embodiments herein. The summary and abstract sections may set forth one or more but not all exemplary embodiments and aspects of the invention within the scope of the claims.

Claims

CLAIMST claim: -A method implemented in a processor, comprising: receiving, in an address calculation unit, a value expressed by a number of hits equal to a width of general-purpose registers in the processor; determining whether the value is within any of a set of pre-determined numerical ranges, and responsively performing an arithmetic operation using only a least significant portion of the value, the least significant portion having a pre-detennined number of bits, and sign-extending a result of the arithmetic operation to the width of the general-purpose registers: and using the sign-extended result as an effective address for executing the instruction.
2. The method implemented in a processor of Claim 1, further comprising accessing the value from a register that is identified during decoding of an instruction.
3. The method impkmented in a processor of Claim 2, further comprising decoding an immediate value from the instruction, and the performing of the arithmetic operation comprises adding the immediate value to the least significant portion of the value.
4. The method implemented in a processor of Claim 1, wherein the general purpose registers have a width of 64 bits, and the pre-determined number of bits in the least significant portion of the value is 32 bits.
5. The method implemented in a processor of Claim 1, further comprising decoding the instruction to determine that the instruction is one of a load of a value from memory and a store to memory, and to determine a register identified by the instruction, and the receiving comprises receiving, at an address generation unit of a LoadlStore Unit (LSU) of the processor, the value from the register identified by the instruction.
6. The method impkmented in a processor of Claim i, wherein the set of pre-determined numerical ranges are defined by most-significant hits either being all hinary I or all binary 0, wherein the most-significant bits are the bits from a general-purpose register not included within the least significant portion.
7. The method implemented in a processor of Claim 6, wherein the general purpose registers have 64 bits, and the least significailt portion is 32 bits.
8. The method implemented in a processor of Claim 1, wherein the address calculation unit is provided in an instruction fetch unit and the effective address is used as a location from which to fetch one or more instructions to be executed by the processor.
9. The method implemented in a processor of Claim 1, wherein if the value is not within any range of the pre-determined set of ranges, then determining to treat the value as a double-word sized value, performing an arithmetic operation using the double-word sized value and not sign-extending a result of the arithmetic operation.
10. The method implemented in a processor of Claim 1, further comprising using the effective address as a memory address from which to retrieve one or more bytes of data.
II. The method implemented in a processor of Claim 1, further comprising using the effective address as a memory address at which to store one or more bytes of data.
12. The method implemented in a processor of Claim 1, further comprising using the effective address as a memory address locating one or more instructions to be fetched.
13. A processor for executing machine executable code having different memory addressability ranges, comprising: a plurality of registers, each being of a register size; an instruction decoding unit configured to decode arithmetic instructions that each identify one or more source registers, and specify whether the instruction is to he executed using the entirety of each source register or a portion thereof, and to decode memory access instructions that specify a register, but do not specify whether to use the entirety of that register or a portion thereof; and a load store unit configured to process a decoded memory access instruction by calculating an effective address for the memory access instruction based either on a portion of the contents of the register identified by that instruction or an entirety of the contents of the register identified by that instruction, in dependence on whether the contents of the register are within any of a lire-determined set of ranges.
14. The processor for executing machine executable code having different memory addressability ranges of Claim 13, wherein the load store unit is configured to calculate the effective address by adding an immediate value from the decoded memory address instruction to a least significant portion of the contents of the register and sign-extending a result of the addition, if the contents of the register are within any range of the pre-determined set of ranges.
15. The processor for executing machine executable code having different memory addressability ranges of Claim 14, wherein the general purpose registers have a width of 64 bits, and the pre-determined number of bits in the least significant portion of the value is 32 hits.
I 6. The processor for executing machine executable code having different memory addressahility ranges of Claim 13, further comprising an address generation unit located in the load store unit configured to calculate the effective address by adding an immediate value from the decoded memory address instruction to a least significant portion of the contents of the register and sign-extending a result of the addition, if the contents of the register are within any range of the pre-determined set of ranges.
17. .The processor for executing machine executable code having different memory addressability ranges of Claim 13, further comprising an address generation unit located in a fetch unit and configured to calculate the effective address by adding an immediate value from the decoded memory address instruction to a least significant portion of the contents of the register and sign-extending a result of the addition, if the contents of the register are within any range of the pre-determined set of ranges.
1 8. The processor for executing machine executable code having different memory addressahiUty ranges of Claim 13, wherein the set of pre-determined numerical ranges are defined by most-significant bits either being all binary 1 or all binary 0, wherein the most-significant hits are the hits from a general-purpose register not included within the least significant portion.
19. The processor for executing machine executable code having different memory addressahility ranges of Claim 13, wherein the registers have 64 hits, and a least significant portion of each register is 32 hits.
20. The processor for executing machine executable code having different memory addressability ranges of Claim 13, wherein the load store unit is configured, if the value is not within any range of the pre-determined set of ranges, to treat the value as a double-word sized value, perform an arithmetic operation using the double-word sized value and not sign-extending a result of the arithmetic operation.
21. The processor for executing machine executable code having different memory addressability ranges of Claim 13, further comprising using the effective address as a memory address from which to retrieve or at which to store one or more bytes of data.