WO2009136402A2 - Register file system and method thereof for enabling a substantially direct memory access - Google Patents

Register file system and method thereof for enabling a substantially direct memory access Download PDF

Info

Publication number
WO2009136402A2
WO2009136402A2 PCT/IL2009/000472 IL2009000472W WO2009136402A2 WO 2009136402 A2 WO2009136402 A2 WO 2009136402A2 IL 2009000472 W IL2009000472 W IL 2009000472W WO 2009136402 A2 WO2009136402 A2 WO 2009136402A2
Authority
WO
WIPO (PCT)
Prior art keywords
data
instruction
address
register file
processing
Prior art date
Application number
PCT/IL2009/000472
Other languages
French (fr)
Other versions
WO2009136402A3 (en
Inventor
Yoav Peleg
Original Assignee
Cosmologic Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cosmologic Ltd. filed Critical Cosmologic Ltd.
Publication of WO2009136402A2 publication Critical patent/WO2009136402A2/en
Publication of WO2009136402A3 publication Critical patent/WO2009136402A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal

Definitions

  • the present invention relates to data processing. More particularly, the present invention relates to providing a register file system and a method thereof for enabling a substantially direct access to memory means that are coupled to a processing unit, such as a CPU (Central Processing Unit), microprocessor, and the like.
  • a processing unit such as a CPU (Central Processing Unit), microprocessor, and the like.
  • Fetching means retrieving an instruction from the program memory, wherein the instruction is represented by a number or by a sequence of numbers.
  • Instruction Register a register that stores a current instruction to be executed.
  • the instruction register is provided within a processing unit, and is located in physical proximity to processing means, such as ALU (Arithmetic Logic Unit).
  • ALU Arimetic Logic Unit
  • Opcode an opcode (operation code) is the portion of a machine-language instruction that specifies an operation to be performed (e.g., addition, subtraction, and the like).
  • an instruction operand is data/value or a pointer (address) to the data, on which (or by means of which) an operation/processing (e.g., addition, subtraction, and the like) has to be performed.
  • an operation/processing e.g., addition, subtraction, and the like
  • Register File is a storage unit located within the processing unit, such as the CPU. Generally, the register file is a combination of registers and combinatorial logic. Background of the Invention
  • a conventional central processing unit operates by four steps: a) fetching; b) decoding (that involves reading data from the CPU register file; c) instruction executing; and d) writing back the result of said executing.
  • the first step, fetching involves retrieving an instruction from the program memory (e.g., RAM (Random Access Memory)). Instruction location in the program memory is determined by a program counter, which keeps track of the CPU processing in the current program.
  • the value of the program counter is incremented by the length of the instruction word in terms of memory units; also, for example, when a conventional JUMP or BRANCH command is received, the program counter value is changed accordingly.
  • the instruction to be fetched must be retrieved from relatively slow memory (e.g., secondary memory) by means of a conventional Input/Output control unit, causing the CPU to stall while waiting for the instruction to be returned back to said CPU.
  • the instruction that the CPU fetches from the memory is used to determine what the CPU has to do, thus the CPU cannot proceed processing until the instruction is fetched from the memory.
  • the instruction is broken up into several portions to be processed by other CPU units (e.g, ALU).
  • the way in which the numerical instruction value is interpreted, is defined by the CPU instruction set architecture (ISA).
  • ISA CPU instruction set architecture
  • a group of numbers in the instruction called an opcode (operation code) indicates which operation has to be performed.
  • the remaining numbers in the instruction usually provide information required for that instruction (e.g., operands for the addition/subtraction operation).
  • operands may be given as a constant value (called an "immediate" value).
  • operands may be provided as addresses of corresponding values stored in a register file (that comprises a plurality of registers, e.g., 32 or 64 registers).
  • the executing step is performed.
  • the CPU performs the desired operation. If, for example, an addition operation is requested, the numbers to be added are provided to inputs of the Arithmetic Logic Unit (ALU), and the result (the final sum) will be provided at the ALU outputs.
  • the ALU comprises a circuitry to perform simple arithmetic and logical operations on the inputs, such as addition/subtraction operations.
  • the results of the executing step are "written back" to the register file or to CPU registers. After accomplishing the instruction execution and writing back the resulting data, the entire process repeats with the next instruction cycle, normally fetching the next-in- sequence instruction due to the incremented value in the program counter.
  • the above four CPU steps have to be performed relatively fast.
  • non-local memory means such as cache, on-board memory (e.g., DRAM (Dynamic Random Access Memory)), secondary memory and the like
  • the access time is greatly increased leading to significant delays and to a waste of the valuable CPU processing resources. In turn, it greatly decreases CPU performance and consumes most of the CPU processing time.
  • a conventional processing unit e.g., CPU
  • uses a limited set of registers e.g., 32 or 64 registers
  • the register file can be implemented in hardware by means of a plurality of electronic elements, such as latches, flip-flops, memory arrays, multi-port SRAM (Static Random Access Memory) and the like.
  • this register file is a portion of the CPU, and it is located in physical proximity to the ALU (Arithmetic Logic Unit) of said CPU.
  • ALU Arimetic Logic Unit
  • One of the reasons for having a limited local register file is due to the limited size of the CPU program word, which usually contains pointers to 3 registers: one register (accessed via "source 1" input of the local register file) storing the first value be processed by the ALU, another register (accessed via "source 2" input of the local register file) storing the second value to be processed by said ALU, and the last register (destination register, accessed via "destination” input of the local register file) storing the result value of the ALU processing (e.g., the sum of the values stored within said sources 1 and 2). Since the CPU program word is limited (in terms of data bits), the number of bits allowed for each the above registers is low.
  • VLIW Very Large Instruction Word
  • the CPU register file relatively rarely reaches a capacity of 256 registers.
  • Another reason for having a limited number of registers in the CPU register file is due to hardware limitations related to fast memory access and to capability of using a relatively large number of ports.
  • a conventional ALU requires providing at least two read ports and one write port.
  • a conventional system that implements CPU 5 also usually contains a memory controller (that comprises a MMU (Memory Management Unit)), various memory means (e.g., cache, SRAM, etc.), and different peripherals, such as cache controllers, interrupt controllers, timers, hardware accelerators, DMA engine, communication controllers (e.g., a USB controller) and the like.
  • the memory controller controls the CPU access to a wide range of registers/memory means, such as internal CPU memories (program and data), on-chip memories (including, for example, cache memory), on-chip peripheral memories, and off-chip (device) memories.
  • the CPU local register file is significantly limited in its size (e.g., contains only 32 registers), and the CPU memory mapped registers (to be accessed, for example, by CPU internal units, such as the ALU) are physically located outside the CPU local register file (e.g., cache, secondary memory, etc.).
  • the CPU needs to generate LOAD commands for loading data by means of the memory controller from each of said memory mapped register (e.g., located off-CPU-chip (outside the CPU chip)) into registers of the CPU local register file.
  • the CPU can manipulate said data (e.g., to perform data addition or data subtraction operations by means of its ALU unit). Then, the result is first stored in another register of the CPU local register file, and after that said result is conveyed to the corresponding memory mapped register (for example, non-local device/peripheral (located off-CPU-chip)) for updating it with a new data value - the result of ALU processing. For that, the CPU needs to generate at least one STORE command for storing said result within said non-local device/peripheral.
  • a single ALU command (e.g., addition, subtraction, etc.) is related to processing of data located within at least two registers.
  • the CPU needs to generate at least two separate LOAD commands (each in a single CPU clock cycle) for loading the data required for processing.
  • LOAD or multi-LOAD command for loading data from an external register/memory means (device register, cache memory, etc.
  • ALU data processing command for executing various operations (e.g., addition, subtraction);
  • STORE command for writing back the result of ALU processing into the corresponding non-local register/memory means - for this, even if working in a pipeline and avoiding data hazards, such data processing takes at least three CPU clock cycles.
  • the DMA Direct Memory Access
  • CPU CPU peripherals
  • DMA operations can be conducted in parallel with CPU operations.
  • dedicated hardware is required and the DMA engines need to be configured and enabled by the CPU; further, it is applicable only when no data processing (or substantially negligible data processing) is required.
  • Fig. IA is a schematic block-diagram 100 of a conventional processing unit, according to the prior art.
  • an instruction that comprises an opcode and one or more operands
  • the program memory e.g., RAM 140
  • instruction register 105 is 32 bits long [0...31] bits, wherein the first six bits [0...5] of the instruction provided within said instruction register 105 are opcode (that defines the operation to be performed, e.g., addition, subtraction, etc.); bits [11...15] are an address of the destination register within a register file 106 (the address of a register in which the result of ALU processing will be stored); bits [16...20] are an address of the "first" (Source 1) register (within register file 106), the value of which has to be manipulated (processed); and bits [21...25] are an address of the "second" (Source 2) register (within register file 106), the value of which has also to be manipulated (for example, has to be added to the value of said "first” register). It should be noted that the rest of the instruction bits (with
  • the addresses of the above Sources 1 and 2 are inputted into decoder(s) 120' of register file 106, and as a result, the data of corresponding registers (to which said addresses are related) of said register file 106 is outputted over data bus (one or more lines) 141.
  • the next step is based on the specific instruction to be processed, and can be, for example: a) reading data from on-CPU-chip (inside the CPU chip) memory/peripherals, or off-CPU-chip (outside the CPU chip) memory/peripherals (by establishing a LOAD command); b) storing data within said memory/peripherals (by establishing of a STORE command); and/or c) activating execution unit 130 (e.g., ALU) for performing a mathematical operation, such as addition, subtraction, multiplication, division: in this case, the operands for the execution unit processing are determined by means of control unit 115.
  • execution unit 130 e.g., ALU
  • the result is written back into the destination register within said register file 106 (the destination register address is defined by bits [11...15] of the executed instruction). Further, the result can be written back into the CPU memory means/peripherals 160 by means of Input/Output Control Unit 150 over bus 108 (by accomplishing a STORE command). Then, the cycle is started over with the next instruction to be further fetched, decoded and executed. Since a program counter 110 holds an address of the current instruction to be executed (and points to a corresponding RAM 140 memory address by means of address bus 119), the CPU always "knows" wherein within said RAM 140 the next instruction can be found. Each time the instruction is completed, program counter 110 is incremented by at least one memory address location; also, for example, when the instruction is a conventional JUMP or BRACH command, the program counter is changed accordingly.
  • CPU register file 106 is local, and it is a portion of CPU chip (core).
  • I/O control unit 150 e.g., comprising memory controller or memory management unit (MMU)
  • MMU memory management unit
  • ALU operations are not performed directly on the data stored within CPU mapped peripherals/memory means, and these peripherals/memory means are not accessed directly by means of said ALU 130: the data inputted into the ALU is incoming from local register file 106, to which it is loaded from corresponding memory/peripherals by means of Input/Output (I/O) control unit 150, for example. Therefore, for performing manipulation on data stored outside local register file 106, the data has first to be loaded into said local register file 106 by means of I/O control unit 150, thereby executing a LOAD command, and loading the data into the CPU local register file over load/store bus 108.
  • I/O control unit 150 Input/Output
  • control unit 115 (over control bus 121), which can comprise a controller 126, multiplexers 125, decoders 120" and the like.
  • Control unit 115 receives data to be processed from local register file 106 over data bus 141, and it controls execution unit 130 processing by sending to said execution unit 130 a control signal over bus 121 in accordance with the instruction opcode.
  • execution unit 130 receives the corresponding instruction operands to be processed from said control unit 115, and outputs a result of said processing over bus 108.
  • Fig. IB is a schematic illustration of a conventional (local) register file 106, according to the prior art.
  • instruction register (IR) 105 (Fig. IA) is 32 bits long [0...31] bits, wherein bits [11...15] are an address of the destination register within register file 106 (the address of a register in which the result of ALU 130 (Fig.
  • bits [16...20] are an address of the "first” (Source 1) register (within register file 106), the value of which has to be manipulated; and bits [21...25] are an address of the "second" (Source 2) register (within register file 106), the value of which has also to be manipulated (for example, has to be added to the value of said "first” register).
  • bits [21...25] are an address of the "second" (Source 2) register (within register file 106), the value of which has also to be manipulated (for example, has to be added to the value of said "first” register).
  • the rest of the instruction bits (within 32 bits of said instruction) can be related to various data, such as an "immediate" value, auto-increment, etc.
  • the addresses of the above Sources 1 and 2 are inputted into register file 106 (IR46-20 and IR 21-25 , respectively) and conveyed to decoders 120 (Fig. IA). Decoders 120 decode the addresses and enable outputting data of corresponding registers of register file 106 towards ALU for further processing said data (e.g., addition, subtraction of the data and the like). Thus, the data is outputted through "Source 1 Data” and “Source 2 Data” outputs, having a length of 32 bits. After ALU 130 processes said data, it stores the result (32 bits long) in a destination register within register file 106 (the destination register is defined by the IR 11-15 address).
  • US 6,178,482 discloses a system embedded with a processor, containing sets of cache lines for accessing cache memories, which are dynamically operated as different register sets for supplying source operands and in turn, accepting destination operands for instruction execution.
  • the different register sets may be of the same or of different virtual register files, and if the different register sets are of different virtual register files, the different virtual register files may be of the same or of different architectures.
  • the cache memories may be directly accessed by using cache addresses.
  • US 6,178,482 presents a data processing apparatus which uses a register file to provide a faster alternative to indirect memory addressing.
  • a functional unit is connected to a data register file which comprises a plurality of registers, each of which is accessed by a corresponding register number.
  • the functional unit of US 6,178,482 can execute at least one indirect register access instruction that comprises an operand register number field.
  • Instruction decode circuitry connected to the register file and the functional unit, is responsive to the indirect register access instruction to recall data stored in an operand register specified by the operand register number in the instruction, identify the recalled data as a register access number, and recall operand data from a data register corresponding to the register access number for use as an operand by the functional unit.
  • the present invention has many advantages over the prior art.
  • one advantage of the present invention is that it significantly reduces the number of instructions and CPU clock cycles required for manipulating/processing (e.g., performing addition, subtraction, data moving, data shifting operations and the like) memory mapped data by providing a substantially direct memory means access for one or more CPU execution units (for processing the data).
  • the number of instructions and corresponding CPU clock cycles for processing the data can be reduced, for example, to a single instruction that takes a single CPU clock cycle.
  • Another advantage of the present invention is that it can significantly expand the conventional CPU register file to the entire (complete) CPU memory map, thereby providing novel CPU architecture and enabling substantially direct memory access.
  • the expanded register file of said CPU can be further shared with other CPUs, or with other internal/external (on-chip/off-chip) peripherals or devices.
  • Still another advantage of the present invention is that it provides a method and system, in which for reducing the number of instructions and CPU clock cycles required for manipulating/processing memory mapped register data, there is substantially no need in changing the structure of the conventional CPU program word.
  • Still another advantage of the present invention is that it eliminates the need in using conventional DMA engines.
  • a further advantage of the present invention is that it provides a method and system, in which the size of external memory means of conventional processing devices (such as conventional cache or tightly coupled memories, as used in the prior art architectures) can be significantly reduced and/or the need for using the external memory means can be eliminated.
  • conventional processing devices such as conventional cache or tightly coupled memories, as used in the prior art architectures
  • Still a further advantage of the present invention is that it provides a method and system, in which CPU stalls (delays) are substantially prevented.
  • the present invention relates to providing a register file system and a method thereof for enabling a substantially direct access to memory means that are coupled to a processing unit, such as a CPU (Central Processing Unit), microprocessor, and the like.
  • a processing unit such as a CPU (Central Processing Unit), microprocessor, and the like.
  • the register file system comprises: a) a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses; b) at least one address converter, connected to one or more data units, for receiving one or more mapped addresses and converting them into the memory data unit addresses, wherein at least one first memory data unit address is of one or more memory cells that store the data to be processed and at least one second memory data unit address is of one or more memory cells for storing a result of the data processing; and c) at least one output port for outputting said data to be processed from said one or more memory cells that correspond to said at least one first memory data unit address.
  • the register file system further comprises at least one control input port configured to receive an opcode of an instruction to be processed.
  • the one or more mapped addresses are provided within an instruction to be processed.
  • the register file system further comprises at least one address generator for generating the at least one mapped address.
  • an instruction to be processed comprises data based on which the at least one mapped address is generated.
  • the register file system further comprises a control unit connected to the at least one control input port for receiving the opcode and enabling processing the instruction according to said opcode.
  • each data unit is configured to receive a read-enable command for enabling reading data from its one or more corresponding memory cells.
  • each data unit is configured to receive a write-enable command for enabling writing data into its one or more corresponding memory cells.
  • the data units are selected from one or more of the following: a) peripherals; b) memory means; and c) registers.
  • At least a portion the register file system is incorporated within a processing unit.
  • the register file system is used by means of at least one processing unit.
  • the register file system further comprises one or more execution units for processing at least a portion of the data outputted from said register file system.
  • the one or more execution units process the data outputted from said register file system according to an instruction opcode.
  • the execution unit is an Arithmetic Logic Unit.
  • the register file system further comprises an instruction register for storing at least one instruction to be processed, said at least one instruction comprising an opcode and one or more operands.
  • the register file system further comprises a program counter for providing an address of the next instruction to be processed.
  • At least one data unit is shared between two or more processing units.
  • register file system comprises: a) a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses, each data unit configured to: a.l. receive at least one mapped address; a.2. decode the received at least one mapped address and determine corresponding at least one memory data unit address; a.3. output data to be processed from one or more memory cells that correspond to said at least one memory data unit address; and a.4. store data within one or more memory cells that correspond to said at least one memory data unit address; and b) at least one output port for outputting said data to be processed from said one or more memory cells.
  • the processing unit device comprises: a) a register file system, comprising: a.l. a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses; a.2. at least one address converter, connected to one or more data units, for receiving one or more mapped addresses and converting them into the memory data unit addresses, wherein at least one first memory data unit address is of one or more memory cells that store the data to be processed and at least one second memory data unit address is of one or more memory cells for storing a result of the data processing; and a.3. at least one output port for outputting said data to be processed from said one or more memory cells that correspond to said at least one first memory data unit address; and b) at least one execution unit for receiving said data outputted from said register file system and processing it.
  • a register file system comprising: a.l. a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses; a.2. at least one address converter, connected to
  • the processing unit device further comprises an instruction register for storing at least one instruction to be processed, said at least one instruction comprising an opcode and one or more operands.
  • the register file system further comprises at least one control input port configured to receive an opcode of an instruction to be processed.
  • the one or more operands are the mapped addresses.
  • the processing unit device further comprises at least one address generator for generating the one or more mapped address.
  • one or more mapped addresses are generated according to data provided within an instruction to be processed.
  • the processing unit device further comprises a control unit connected to the at least one control input port for receiving the opcode and enabling processing the instruction according to said opcode.
  • each data unit is configured to receive a read-enable command for enabling reading data from its one or more corresponding memory cells.
  • each data unit is configured to receive a write-enable command for enabling writing data into its one or more corresponding memory cells.
  • the data units are selected from one or more of the following: a) peripherals; b) memory means; and c) registers.
  • the at least one execution unit processes the data outputted from the register file system according to the instruction opcode.
  • the execution unit is an Arithmetic Logic Unit.
  • the processing unit device further comprises a program counter for providing an address of the next instruction to be processed.
  • a processing unit device comprises: a) a register file system, comprising: a.l. a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses, each data unit configured to: a.1.1. receive at least one mapped address; a.1.2. decode the received at least one mapped address and determine corresponding at least one memory data unit address; a.1.3. output data to be processed from one or more memory cells that correspond to said at least one memory data unit address; and a.1.4. store data within one or more memory cells that correspond to said at least one memory data unit address; and a.2. at least one output port for outputting said data to be processed from said one or more memory cells; and b) at least one execution unit for receiving said data outputted from said register file system and processing it.
  • the method of processing a processing unit (PU) instruction comprises: a) conveying an address of a PU instruction to be processed into a program memory; b) fetching said PU instruction from said program memory; c) performing first decoding of said PU instruction and determining its opcode and one or more operands, wherein at least one of said operands is a PU mapped address; d) performing second decoding or converting the at least one PU mapped address into an address of one or more memory cells of a corresponding PU data unit, giving rise to a data unit address; e) enabling reading the data stored in the at least one data unit addresses; f) processing the read data; and g) writing back the result of said processing into the one or more memory cells within the corresponding PU data unit.
  • method of processing a processing unit (PU) instruction comprises: a) conveying an address of a PU instruction to be processed into a program memory; b) fetching said PU instruction from said program memory; c) performing first decoding of said PU instruction and determining its opcode and one or more operands; d) generating a corresponding PU mapped address for the at least one operand; e) performing second decoding or converting each generated PU mapped address into an address of one or more memory cells of a corresponding PU data unit, giving rise to a data unit address; f) enabling reading the data stored in the data unit addresses; g) processing the read data; and h) writing back the result of said processing into the one or more memory cells within the corresponding PU data unit.
  • Fig. IA is a schematic block-diagram of a conventional processing unit, according to the prior art
  • Fig. IB is a schematic illustration of a conventional (local) register file, according to the prior art
  • Fig. 2A is a schematic illustration of connecting a spread register file system to an instruction register and to an execution unit (such as ALU), according to an embodiment of the present invention
  • Fig. 2B is another schematic illustration of connecting a spread register file system to an instruction register and to an execution unit (such as ALU), according to another embodiment of the present invention
  • Fig. 3 is a schematic illustration of a spread register file system, according to an embodiment of the present invention.
  • Fig. 4 is a pipeline representation of operating with a spread register file system, according to an embodiment of the present invention.
  • swipe register file system or "SRF” system
  • SRF read register file
  • LRF local register file
  • the entire (complete) CPU memory map can comprise local registers (e.g., local CPU register files), cache memories, tightly coupled memories, on-chip/off-chip peripherals/memories (or registers) and any other conventional memory means.
  • CPU processing unit
  • processing processing
  • data operation such as data manipulation, data transfer, addition or subtraction of data and the like.
  • Fig. 2A is a schematic illustration of connecting a spread register file system 206 to instruction register 105 and to execution unit 130 (such as ALU), according to an embodiment of the present invention.
  • spread register file system 206 relates to the entire (complete) CPU mapped memories: cache memories, on-chip peripherals (e.g., RAM, SRAM), tightly coupled memories, on-board memories (e.g., DRAM), secondary memories (e.g., off-chip peripherals, hard disks, etc.), and any other memory means (e.g., CDs (Compact Discs), DVDs (Digital Versatile Discs), etc.).
  • spread register file system 206 comprises conventional peripheral address converters, as presented in Fig. 3.
  • Each peripheral address converter is used for converting a CPU memory mapped address to corresponding peripheral (device) address (e.g., the peripheral device can be a USB device, cache memory, RAM, SRAM, tightly coupled memory, DRAM, hard disks, CD, DVD, etc.).
  • one peripheral address converter converts a CPU mapped address of "source 1" register/memory means to corresponding address of said register/memory means within the corresponding peripheral that actually stores said data (to be processed by means of execution unit 130, such as ALU); another peripheral address converter converts a CPU mapped address of "source 2" register/memory means that stores additional data to be processed by means of said execution unit 130; and still another one peripheral address converter - converts a CPU mapped address of "destination" register/memory means, in which a result of the above execution unit 130 processing (e.g., addition, subtraction) will be stored.
  • the peripheral address converter can be implemented either in hardware and/or in software.
  • instruction register 105 contains a VLIW program word, which can be, for example, 128 or 256 bits long.
  • VLIW program word is 128 bits long, wherein the length of each one of the followings: opcode 221', CPU mapped "source 1" address 222', CPU mapped "source 2" address 223' and CPU mapped destination address 224' can be, for example, 32 bits long.
  • the VLIW program word is 256 bits long.
  • Each of the above addresses relates to a specific address within the entire CPU memory map, and is represented, for example, by a 2 32 or 2 64 binary number, respectively.
  • the command for performing such an operation is provided into said execution unit 130 via control bus 234. Then, after accomplishing the operation, the corresponding result is written back into the destination register/memory means (for example, located within the corresponding peripheral device, such as a USB device) over data bus 233, whose address is defined by the CPU mapped destination address 224' of the VLIW program word.
  • the destination register/memory means for example, located within the corresponding peripheral device, such as a USB device
  • I/O control units e.g., memory management units
  • CPU mapped memory means such as cache memories, off-CPU-chip memories and other memory means.
  • executing unit 130 is enabled to operate substantially directly on each of said CPU mapped memory means provided within spread register file system 206, i.e. is enabled to execute instructions without the need for generating and performing LOAD commands (loading data into said spread register file system 206 from external memory means) and corresponding additional STORE commands for storing the result of executing unit 130 operation externally to said spread register file system 206.
  • spread register file system 206 can operate with more than one executing unit 130. Further, instruction register 105 and/or executing unit 130 can be provided within said spread register file system 206.
  • spread register file system 206 can be provided on-CPU-chip (incorporated within a CPU) or off-CPU-chip. Further, according to still another embodiment of the present invention, a portion of spread register file system 206 can be provided on-CPU-chip and another portion - off- CPU-chip.
  • Fig. 2B is another schematic illustration of connecting spread register file system 206 to instruction register 105 and to execution unit 130 (such as ALU), according to another embodiment of the present invention.
  • "source 1" address 221", “source 2" address 222" and "destination” address 223" are inputted from instruction register 105 into address generator 250 for generating corresponding addresses being related to the entire CPU mapped memory.
  • address 221", 222" and 223" can be for example each 5 bits long, and CPU mapped "source 1" address, CPU mapped “source 2" address, CPU mapped “destination” address is each 32 bits long (if implemented for MIPS32 CPU), since each of these mapped addresses relates to the entire CPU memory map that is represented by 2 32 addresses. Similarly, for the MIPS64 CPU implementation, each of these mapped addresses is 64 bits long. It should be noted that address generator 250 can generate addresses in various ways based on different address generating functions.
  • address generator 250 can receive in its input a 5 bits long address (represented by a 2 s binary number) from instruction register 105, and then it can add this address to another 2 32 number, thereby generating a new CPU mapped address that is 32 bits long.
  • the above 2 32 number can be a predefined number, random number or a number that is calculated (generated) by means of address generator 250 according to some predefined function(s)/expressions.
  • said new CPU mapped address can be generated according to opcode 221" that can be inputted from instruction register 105 into said address generator 250.
  • each of the addresses outputted from instruction register 105 over lines 251, 252 and 253 can be further related to corresponding registers within address generator 250 (e.g., said each of the addresses outputted from said instruction register 105 can be related to a different base address of 32-bits (for MIPS32 technology), or 64-bits (for MIPS64 technology) provided within said address generator 250, based on which a CPU mapped address can be generated).Thus, each CPU mapped address (to be outputted from address generator 150) can be generated according to values stored within these corresponding registers.
  • the "source 1" generated CPU mapped address (provided from address generator 250 over line 212) can be the sum of: a value of operand 222" and the corresponding "source 1" base address stored within said address generator 250.
  • the "source 2" (or “destination”) generated CPU mapped address can be the sum of: a value of operand 223" (or 224") and the corresponding "source 2" (or “destination”) base address stored within said address generator 250.
  • address generator 250 stores CPU mapped addresses (e.g., 32 or 64 bits long), and for each operand 222", 223" and 224", said address generator 250 outputs a corresponding CPU mapped address.
  • Fig. 3 is a schematic illustration of spread register file system 206, according to an embodiment of the present invention.
  • Spread register file system 206 receives as inputs: CPU mapped "source 1" address (MSl address) over bus (line) 212, CPU mapped "source 2" address (MS2 address) over bus 213 and CPU mapped "destination" address (MD address) over bus 214 (each 32 bits long, for example).
  • MSl address CPU mapped "source 1" address
  • MS2 address CPU mapped "source 2" address
  • MD address CPU mapped "destination" address
  • addresses are converted by means of address converters 320, 321 and 322, respectively to addresses of corresponding peripheral/memory means, such as peripheral/memory means 301, 302, 303, ..., 310 (e.g., cache memories, tightly coupled memories, secondary memories, SRAM, DRAM, disk-on-keys, hard disks, CDs (Compact Discs), DVDs, or any other peripheral/memory means).
  • peripheral/memory means such as peripheral/memory means 301, 302, 303, ..., 310
  • said address converters can convert the CPU mapped addresses in various ways based on different address converting functions/expressions.
  • the above address can be converted according to opcode 221 ' or 221" (Figs. 2A or 2B) that can be inputted from instruction register 105 (Fig.
  • control unit 350 which generates corresponding control signals to address converters 320, 321 and 322 and to executing unit 130: for example, if opcode 221' or 221" relates to moving "source 1" data to the "destination" register, then only address converters 320 and 322 can be activated.
  • the converted "source 1 " and “source 2" addresses are inputted into corresponding peripheral/memory means (e.g., peripheral/memory means 301, 302, 303,..., N), which in turn outputs corresponding data stored in said addresses over "source 1" read bus 231 and "source 2" read bus 232.
  • peripheral/memory means e.g., peripheral/memory means 301, 302, 303,..., N
  • said data is processed (executed) by means of one or more execution units 130 (such as ALUs).
  • the processing result is provided over write back bus 233 to one or more peripheral/memory means to be stored in corresponding converted destination addresses (CD addresses) within said one or more peripheral/memory means.
  • the "source 1", “source 2", and “destination” memory cells can be physically located within the same or within different peripheral/memory means (such as peripheral/memory means 301, 302, 303,..., N).
  • address converters 320, 321 and 322 further provide Write Enable (WE)/Chip Select (CS) signals (for example, binary "0" or "1") to each of said peripheral/memory means 301, 302, 303, ...,N for enabling reading or writing from or to said peripheral/memory means (data units) 301, 302, 303, ..., N.
  • WE/CS commands can be provided to each of said peripheral/memory means (data units) 301, 302, 303, ..., N when accessing each converted address (e.g., "source 1" converted address) within said each peripheral/memory means 301, 302, 303, ..., N.
  • CS read command
  • WE write command
  • address converters 320, 321 and 322 can be unified in a single address converter for converting CPU mapped "source 1", “source 2" and “destination” addresses into corresponding peripheral/memory means addresses.
  • the address decoding (or the address conversion) is performed within one or more peripherals/memory means 301, 302, 303, ..., N.
  • Peripherals/memory means 301, 302, 303, ..., N can receive CPU mapped addresses and decode (or convert) them accordingly for determining corresponding addresses within said peripherals/memory means 301, 302, 303, ..., N, in which the required data is stored (or to be stored).
  • blocks 320, 321 and 322 can provide WE/CS commands to peripherals/memory means 301, 302, 303, ..., N, and do not perform the address conversion. Further, it should be noted that WE/CS commands can be generated by means of control unit 350.
  • an address converter (such as address converter 320, 321 or 322) can be incorporated (integrated) within each (or one or more) peripheral/memory means 301, 302, 303, ..., or N.
  • said peripheral/memory means receives a CPU mapped address and determines by means of the integrated address converter (according to predefined base-addresses of said peripheral/memory means), whether the received CPU mapped address is related to one or more memory cells provided within said peripheral/memory means or within another peripheral/memory means. It should be noted that the base-addresses of each peripherals/memory means can be further dynamically changed upon the need.
  • Fig. 4 is a pipeline representation 400 of operating with spread register file system 206 (Fig. 2B), according to an embodiment of the present invention.
  • pipeline has 12 stages (TO to TI l), each of which can correspond to a single CPU clock cycle.
  • the address of the instruction (program word) to be fetched is conveyed from the Program Counter (PC) to the CPU program memory (e.g., RAM (not shown)).
  • the instruction (program word) is fetched from said program memory into instruction register 105 (Fig. 2A).
  • the fetched instruction decoding is performed by means of control unit 350 (Fig. 3), which provides control signals during pipeline stages.
  • address generator 250 Fig.
  • the next program counter address can be determined, based on the decoded instruction. Also, for example, when a conventional JUMP or BRANCH command is issued, then the next program counter address is calculated in accordance with a pointer of said JUMP/BRANCH command).
  • the CPU mapped addresses are converted by means of address converters 320, 321 and 322 (Fig. 3), respectively to addresses of corresponding peripheral/memory means, such as peripheral/memory means 301, 302 and 303 (Fig. 3) (e.g., cache memories, secondary memories, disk-on-keys, hard disks, or any other peripheral/memory means).
  • peripheral/memory means such as peripheral/memory means 301, 302 and 303 (Fig. 3) (e.g., cache memories, secondary memories, disk-on-keys, hard disks, or any other peripheral/memory means).
  • said address converters can convert the CPU mapped addresses in various ways based on different address converting functions/expressions.
  • the above address can be converted according to opcode 221' or 221" (Figs. 2 A or 2B) that can be inputted from instruction register 105 (Fig. 2A) into a control unit 350 (Fig.
  • said peripherals/memory means 301, 302, 303, ..., N generate a READ request to their internal memory, thereby enabling reading corresponding data stored within them at the received converted addresses.
  • said corresponding data is read and ready, and then at stage T7, said data is latched and conveyed to the "source 1" and "source 2" read buses 231 and 232, respectively.
  • the data is provided over said read buses 231 and 232 into execution unit 130 (e.g., ALU).
  • execution unit 130 e.g., ALU
  • T9 and TlO the data is processed by means of said execution unit 130. It should be noted that the data can be processed only in stage T9, and no further processing can be required.
  • the pipeline can have, for example, only 11 stages (TO to TlO).
  • the processing result is written back into the "destination" register that is provided, for example, within peripherals/memory means 301, 302, 303, ..., N. It should be noted that the writing back operation can take more than a single CPU clock cycle.
  • CPU control unit 350 controls the pipeline process by generating required control signals during the pipeline stages.
  • CPU stalls are reduced or substantially eliminated. For example, there can be substantially no CPU stalls if the access latency to SRF system 206 is 6 (or less) CPU clock cycles and the pipeline is relatively deep (e.g., 12 stages).
  • a number of instructions and CPU clock cycles required for manipulating/processing is significantly reduced, compared to the prior art.
  • the number of instructions and corresponding CPU clock cycles for processing the data can be reduced, for example, to a single instruction that takes a single CPU clock cycle, enabling providing a substantially direct access between execution unit 130 and peripherals/memory means 301, 302, 303,...N.
  • spread register file system 206 can be shared between two or more processing unit (e.g, CPU, microprocessor, and the like), and/or between other internal/external (on-chip/off-chip) peripherals or devices.
  • processing unit e.g, CPU, microprocessor, and the like
  • the structure of a conventional CPU program word, compared to the prior art, is not changed.
  • the need in using conventional DMA engines is eliminated.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Executing Machine-Instructions (AREA)

Abstract

The present invention relates to a register file system, comprising a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses, each data unit configured to receive at least one mapped address, decode the received mapped address and determine corresponding at least one memory data unit address, output data to be processed from one or more memory cells that correspond to said at least one memory data unit address, and store data within one or more memory cells that correspond to said at least one memory data unit address, and at least one output port for outputting said data to be processed from said one or more memory cells The register file system is incorporated within a processing unit that further comprises at least one execution unit for receiving the data outputted from said register file system and processing it

Description

REGISTER FILE SYSTEM AND METHOD THEREOF FOR ENABLING A SUBSTANTIALLY DIRECT MEMORY ACCESS
Field of the Invention
The present invention relates to data processing. More particularly, the present invention relates to providing a register file system and a method thereof for enabling a substantially direct access to memory means that are coupled to a processing unit, such as a CPU (Central Processing Unit), microprocessor, and the like.
Definitions, Acronyms and Abbreviations
Throughout this specification, the following definitions are employed:
Fetching: means retrieving an instruction from the program memory, wherein the instruction is represented by a number or by a sequence of numbers.
Instruction Register: a register that stores a current instruction to be executed. The instruction register is provided within a processing unit, and is located in physical proximity to processing means, such as ALU (Arithmetic Logic Unit).
Opcode: an opcode (operation code) is the portion of a machine-language instruction that specifies an operation to be performed (e.g., addition, subtraction, and the like).
Operand: an instruction operand is data/value or a pointer (address) to the data, on which (or by means of which) an operation/processing (e.g., addition, subtraction, and the like) has to be performed.
Register File: is a storage unit located within the processing unit, such as the CPU. Generally, the register file is a combination of registers and combinatorial logic. Background of the Invention
The past decade is characterized by dramatic developments in the field of computers. For executing and processing most recently developed computer applications, fast and powerful computer processing units are required. In general, according to the prior art, a conventional central processing unit (CPU) operates by four steps: a) fetching; b) decoding (that involves reading data from the CPU register file; c) instruction executing; and d) writing back the result of said executing. The first step, fetching, involves retrieving an instruction from the program memory (e.g., RAM (Random Access Memory)). Instruction location in the program memory is determined by a program counter, which keeps track of the CPU processing in the current program. After the instruction is fetched from the memory, the value of the program counter is incremented by the length of the instruction word in terms of memory units; also, for example, when a conventional JUMP or BRANCH command is received, the program counter value is changed accordingly. Often, the instruction to be fetched must be retrieved from relatively slow memory (e.g., secondary memory) by means of a conventional Input/Output control unit, causing the CPU to stall while waiting for the instruction to be returned back to said CPU. The instruction that the CPU fetches from the memory is used to determine what the CPU has to do, thus the CPU cannot proceed processing until the instruction is fetched from the memory. After that, at the decoding step, the instruction is broken up into several portions to be processed by other CPU units (e.g, ALU). The way in which the numerical instruction value is interpreted, is defined by the CPU instruction set architecture (ISA). Often, a group of numbers in the instruction, called an opcode (operation code), indicates which operation has to be performed. The remaining numbers in the instruction usually provide information required for that instruction (e.g., operands for the addition/subtraction operation). Such operands may be given as a constant value (called an "immediate" value). Alternatively, operands may be provided as addresses of corresponding values stored in a register file (that comprises a plurality of registers, e.g., 32 or 64 registers).
After the fetching and decoding steps, the executing step is performed. During this step, the CPU performs the desired operation. If, for example, an addition operation is requested, the numbers to be added are provided to inputs of the Arithmetic Logic Unit (ALU), and the result (the final sum) will be provided at the ALU outputs. Generally, the ALU comprises a circuitry to perform simple arithmetic and logical operations on the inputs, such as addition/subtraction operations. Finally, at the write back step, the results of the executing step are "written back" to the register file or to CPU registers. After accomplishing the instruction execution and writing back the resulting data, the entire process repeats with the next instruction cycle, normally fetching the next-in- sequence instruction due to the incremented value in the program counter.
For achieving good performance, the above four CPU steps have to be performed relatively fast. However, when working with non-local memory means, such as cache, on-board memory (e.g., DRAM (Dynamic Random Access Memory)), secondary memory and the like, the access time is greatly increased leading to significant delays and to a waste of the valuable CPU processing resources. In turn, it greatly decreases CPU performance and consumes most of the CPU processing time.
According to the prior art, a conventional processing unit (e.g., CPU) uses a limited set of registers (e.g., 32 or 64 registers) in its register file. The register file can be implemented in hardware by means of a plurality of electronic elements, such as latches, flip-flops, memory arrays, multi-port SRAM (Static Random Access Memory) and the like. However, in most cases, this register file is a portion of the CPU, and it is located in physical proximity to the ALU (Arithmetic Logic Unit) of said CPU. Such a register file can be named a "local register file". One of the reasons for having a limited local register file is due to the limited size of the CPU program word, which usually contains pointers to 3 registers: one register (accessed via "source 1" input of the local register file) storing the first value be processed by the ALU, another register (accessed via "source 2" input of the local register file) storing the second value to be processed by said ALU, and the last register (destination register, accessed via "destination" input of the local register file) storing the result value of the ALU processing (e.g., the sum of the values stored within said sources 1 and 2). Since the CPU program word is limited (in terms of data bits), the number of bits allowed for each the above registers is low. For example, a CPU that has 64 registers in its register file requires 6 bits pointer per register (26=64), and is considered to be a relatively large register file, according to the prior art. In addition, even by using a CPU that is capable of receiving instructions as the Very Large Instruction Word (VLIW), the CPU register file relatively rarely reaches a capacity of 256 registers.
Another reason for having a limited number of registers in the CPU register file is due to hardware limitations related to fast memory access and to capability of using a relatively large number of ports. Usually, a conventional ALU requires providing at least two read ports and one write port.
A conventional system that implements CPU5 also usually contains a memory controller (that comprises a MMU (Memory Management Unit)), various memory means (e.g., cache, SRAM, etc.), and different peripherals, such as cache controllers, interrupt controllers, timers, hardware accelerators, DMA engine, communication controllers (e.g., a USB controller) and the like. The memory controller controls the CPU access to a wide range of registers/memory means, such as internal CPU memories (program and data), on-chip memories (including, for example, cache memory), on-chip peripheral memories, and off-chip (device) memories.
It should be noted that according to the prior art. the CPU local register file is significantly limited in its size (e.g., contains only 32 registers), and the CPU memory mapped registers (to be accessed, for example, by CPU internal units, such as the ALU) are physically located outside the CPU local register file (e.g., cache, secondary memory, etc.). Thus, in order that the CPU will be able to perform data manipulation on any of its memory mapped registers, the CPU needs to generate LOAD commands for loading data by means of the memory controller from each of said memory mapped register (e.g., located off-CPU-chip (outside the CPU chip)) into registers of the CPU local register file. After the data is loaded, the CPU can manipulate said data (e.g., to perform data addition or data subtraction operations by means of its ALU unit). Then, the result is first stored in another register of the CPU local register file, and after that said result is conveyed to the corresponding memory mapped register (for example, non-local device/peripheral (located off-CPU-chip)) for updating it with a new data value - the result of ALU processing. For that, the CPU needs to generate at least one STORE command for storing said result within said non-local device/peripheral. In addition, usually a single ALU command (e.g., addition, subtraction, etc.) is related to processing of data located within at least two registers. Therefore, the CPU needs to generate at least two separate LOAD commands (each in a single CPU clock cycle) for loading the data required for processing. In some more complex (VLIW) CPUs, a multi- LOAD request can be generated in a single CPU clock cycle and then, in an additional clock cycle, one or more (destination) registers within the local register file can be updated with new data values (the results of ALU processing). Generally, according to the prior art, for processing (manipulating) data by means of the CPU, at least three commands have to be generated: LOAD (or multi-LOAD) command for loading data from an external register/memory means (device register, cache memory, etc. that is located externally to CPU (off-CPU-chip)) into the local register file; ALU data processing command for executing various operations (e.g., addition, subtraction); STORE command for writing back the result of ALU processing into the corresponding non-local register/memory means - for this, even if working in a pipeline and avoiding data hazards, such data processing takes at least three CPU clock cycles.
When, for example, the data needs to be moved from one CPU memory mapped register (or from another external memory means, such as cache, secondary memory, etc.) to another CPU memory mapped register/memory means without performing ALU data processing, then the DMA (Direct Memory Access) engines, which are CPU peripherals, can be used for conducting such data movements, thereby reading the data from said one CPU memory mapped register/memory means and writing the data to said another CPU memory mapped register/memory means. By using the DMA engines, CPU is not required to generate LOAD and STORE commands. In addition, DMA operations can be conducted in parallel with CPU operations. However, for using the DMA engines, dedicated hardware is required and the DMA engines need to be configured and enabled by the CPU; further, it is applicable only when no data processing (or substantially negligible data processing) is required.
Fig. IA is a schematic block-diagram 100 of a conventional processing unit, according to the prior art. First, an instruction (that comprises an opcode and one or more operands) is fetched from the program memory (e.g., RAM 140) and is transferred via a 0472
- 6 - data bus (one or more lines) 107 to an instruction register (IR) 105 that stores the current instruction to be decoded and executed. For example, it is supposed that instruction register 105 is 32 bits long [0...31] bits, wherein the first six bits [0...5] of the instruction provided within said instruction register 105 are opcode (that defines the operation to be performed, e.g., addition, subtraction, etc.); bits [11...15] are an address of the destination register within a register file 106 (the address of a register in which the result of ALU processing will be stored); bits [16...20] are an address of the "first" (Source 1) register (within register file 106), the value of which has to be manipulated (processed); and bits [21...25] are an address of the "second" (Source 2) register (within register file 106), the value of which has also to be manipulated (for example, has to be added to the value of said "first" register). It should be noted that the rest of the instruction bits (within 32 bits of said instruction) can be related to various data, such as to an "immediate" value (some constant value) or to the auto-increment, etc.
Each of the above addresses is five bits long, thus referring to one of 32 (25=32) registers of register file 106. The addresses of the above Sources 1 and 2 are inputted into decoder(s) 120' of register file 106, and as a result, the data of corresponding registers (to which said addresses are related) of said register file 106 is outputted over data bus (one or more lines) 141. The next step is based on the specific instruction to be processed, and can be, for example: a) reading data from on-CPU-chip (inside the CPU chip) memory/peripherals, or off-CPU-chip (outside the CPU chip) memory/peripherals (by establishing a LOAD command); b) storing data within said memory/peripherals (by establishing of a STORE command); and/or c) activating execution unit 130 (e.g., ALU) for performing a mathematical operation, such as addition, subtraction, multiplication, division: in this case, the operands for the execution unit processing are determined by means of control unit 115. Once the processing is completed, the result is written back into the destination register within said register file 106 (the destination register address is defined by bits [11...15] of the executed instruction). Further, the result can be written back into the CPU memory means/peripherals 160 by means of Input/Output Control Unit 150 over bus 108 (by accomplishing a STORE command). Then, the cycle is started over with the next instruction to be further fetched, decoded and executed. Since a program counter 110 holds an address of the current instruction to be executed (and points to a corresponding RAM 140 memory address by means of address bus 119), the CPU always "knows" wherein within said RAM 140 the next instruction can be found. Each time the instruction is completed, program counter 110 is incremented by at least one memory address location; also, for example, when the instruction is a conventional JUMP or BRACH command, the program counter is changed accordingly.
As seen from Fig. IA, CPU register file 106 is local, and it is a portion of CPU chip (core). Input/Output (I/O) control unit 150 (e.g., comprising memory controller or memory management unit (MMU)) is used for loading and writing back data from or to CPU mapped peripherals/memory means to be further processed by ALU 130. Thus, according to the prior art, ALU operations (e.g., addition) are not performed directly on the data stored within CPU mapped peripherals/memory means, and these peripherals/memory means are not accessed directly by means of said ALU 130: the data inputted into the ALU is incoming from local register file 106, to which it is loaded from corresponding memory/peripherals by means of Input/Output (I/O) control unit 150, for example. Therefore, for performing manipulation on data stored outside local register file 106, the data has first to be loaded into said local register file 106 by means of I/O control unit 150, thereby executing a LOAD command, and loading the data into the CPU local register file over load/store bus 108. After the LOAD command is accomplished, then the CPU can execute (perform) the corresponding operation (processing) on the loaded data by means of its ALU 130. Then, the result of the ALU processing is written back to the destination register of local register file 106. Further, for updating corresponding register/memory means outside said local register file 106, a STORE command is generated for storing said result, over load/store bus 108, from said destination register into the outside register/memory means, such as cache. It should be noted that providing data to be processed into said ALU 130 and providing the processed data from said ALU, is controlled by means of control unit 115 (over control bus 121), which can comprise a controller 126, multiplexers 125, decoders 120" and the like. Control unit 115 receives data to be processed from local register file 106 over data bus 141, and it controls execution unit 130 processing by sending to said execution unit 130 a control signal over bus 121 in accordance with the instruction opcode. In turn, execution unit 130 receives the corresponding instruction operands to be processed from said control unit 115, and outputs a result of said processing over bus 108.
Fig. IB is a schematic illustration of a conventional (local) register file 106, according to the prior art. For example, it is supposed that instruction register (IR) 105 (Fig. IA) is 32 bits long [0...31] bits, wherein bits [11...15] are an address of the destination register within register file 106 (the address of a register in which the result of ALU 130 (Fig. IA) processing will be stored); bits [16...20] are an address of the "first" (Source 1) register (within register file 106), the value of which has to be manipulated; and bits [21...25] are an address of the "second" (Source 2) register (within register file 106), the value of which has also to be manipulated (for example, has to be added to the value of said "first" register). It should be noted that the rest of the instruction bits (within 32 bits of said instruction) can be related to various data, such as an "immediate" value, auto-increment, etc.
Each above address is five bits long, thus referring to one of 32 (25=32) registers of register file 106. The addresses of the above Sources 1 and 2 are inputted into register file 106 (IR46-20 and IR21-25, respectively) and conveyed to decoders 120 (Fig. IA). Decoders 120 decode the addresses and enable outputting data of corresponding registers of register file 106 towards ALU for further processing said data (e.g., addition, subtraction of the data and the like). Thus, the data is outputted through "Source 1 Data" and "Source 2 Data" outputs, having a length of 32 bits. After ALU 130 processes said data, it stores the result (32 bits long) in a destination register within register file 106 (the destination register is defined by the IR11-15 address).
According to the prior art, for loading the data from registers, processing the loaded data and storing the result with the memory (i.e., performing LOAD, "ALU processing" and STORE commands) with minimal CPU stalls (delays), conventional processing devices use cache memories or Tightly Coupled Memories (TCM), such as SRAM, etc. These memories are located in the physical proximity to the CPU core, and therefore accessing such memories (e.g., performing LOAD and STORE commands) is done with a relatively low latency compared to accessing other memory means, such as a hard disk, for example. Usually, when the CPU needs to operate on a large chunk of data, which in turn evolves performing relatively long loops of ALU commands, the data is first copied by the MMU to the cache memory or by the DMA engine to the tightly coupled memory. Only then, the CPU executes ALU commands within said loops of commands. Thus, this leads to a significant latency between the time when the data is first conveyed to the memory mapped register/memory means and the time when the CPU can process it. Especially, this leads to a significant latency until the processed data is written back into the memory mapped register/memory means.
The above problems related to achieving fast data access and performing fast data processing have been recognized in the prior art, and several solutions have been proposed. For example, US 6,178,482 discloses a system embedded with a processor, containing sets of cache lines for accessing cache memories, which are dynamically operated as different register sets for supplying source operands and in turn, accepting destination operands for instruction execution. The different register sets may be of the same or of different virtual register files, and if the different register sets are of different virtual register files, the different virtual register files may be of the same or of different architectures. The cache memories may be directly accessed by using cache addresses.
Further, US 6,178,482 presents a data processing apparatus which uses a register file to provide a faster alternative to indirect memory addressing. A functional unit is connected to a data register file which comprises a plurality of registers, each of which is accessed by a corresponding register number. The functional unit of US 6,178,482 can execute at least one indirect register access instruction that comprises an operand register number field. Instruction decode circuitry, connected to the register file and the functional unit, is responsive to the indirect register access instruction to recall data stored in an operand register specified by the operand register number in the instruction, identify the recalled data as a register access number, and recall operand data from a data register corresponding to the register access number for use as an operand by the functional unit. The present invention has many advantages over the prior art. For example, one advantage of the present invention is that it significantly reduces the number of instructions and CPU clock cycles required for manipulating/processing (e.g., performing addition, subtraction, data moving, data shifting operations and the like) memory mapped data by providing a substantially direct memory means access for one or more CPU execution units (for processing the data). Thus, the number of instructions and corresponding CPU clock cycles for processing the data can be reduced, for example, to a single instruction that takes a single CPU clock cycle.
Another advantage of the present invention is that it can significantly expand the conventional CPU register file to the entire (complete) CPU memory map, thereby providing novel CPU architecture and enabling substantially direct memory access. The expanded register file of said CPU can be further shared with other CPUs, or with other internal/external (on-chip/off-chip) peripherals or devices.
Still another advantage of the present invention, is that it provides a method and system, in which for reducing the number of instructions and CPU clock cycles required for manipulating/processing memory mapped register data, there is substantially no need in changing the structure of the conventional CPU program word.
Still another advantage of the present invention is that it eliminates the need in using conventional DMA engines.
A further advantage of the present invention is that it provides a method and system, in which the size of external memory means of conventional processing devices (such as conventional cache or tightly coupled memories, as used in the prior art architectures) can be significantly reduced and/or the need for using the external memory means can be eliminated.
Still a further advantage of the present invention is that it provides a method and system, in which CPU stalls (delays) are substantially prevented. Other advantages of the present invention will become apparent as the description proceeds.
Summary of the Invention
The present invention relates to providing a register file system and a method thereof for enabling a substantially direct access to memory means that are coupled to a processing unit, such as a CPU (Central Processing Unit), microprocessor, and the like.
According to an embodiment of the present invention, the register file system comprises: a) a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses; b) at least one address converter, connected to one or more data units, for receiving one or more mapped addresses and converting them into the memory data unit addresses, wherein at least one first memory data unit address is of one or more memory cells that store the data to be processed and at least one second memory data unit address is of one or more memory cells for storing a result of the data processing; and c) at least one output port for outputting said data to be processed from said one or more memory cells that correspond to said at least one first memory data unit address.
According to another embodiment of the present invention, the register file system further comprises at least one control input port configured to receive an opcode of an instruction to be processed.
According to still another embodiment of the present invention, the one or more mapped addresses are provided within an instruction to be processed.
According to a further embodiment of the present invention, the register file system further comprises at least one address generator for generating the at least one mapped address. According to an embodiment of the present invention, an instruction to be processed comprises data based on which the at least one mapped address is generated.
According to another embodiment of the present invention, the register file system further comprises a control unit connected to the at least one control input port for receiving the opcode and enabling processing the instruction according to said opcode.
According to still another embodiment of the present invention, each data unit is configured to receive a read-enable command for enabling reading data from its one or more corresponding memory cells.
According to still another embodiment of the present invention, each data unit is configured to receive a write-enable command for enabling writing data into its one or more corresponding memory cells.
According to a further embodiment of the present invention, the data units are selected from one or more of the following: a) peripherals; b) memory means; and c) registers.
According to still a further embodiment of the present invention, at least a portion the register file system is incorporated within a processing unit.
According to still a further embodiment of the present invention, the register file system is used by means of at least one processing unit.
According to an embodiment of the present invention, the register file system further comprises one or more execution units for processing at least a portion of the data outputted from said register file system. According to another embodiment of the present invention, the one or more execution units process the data outputted from said register file system according to an instruction opcode.
According to still another embodiment of the present invention, the execution unit is an Arithmetic Logic Unit.
According to still another embodiment of the present invention, the register file system further comprises an instruction register for storing at least one instruction to be processed, said at least one instruction comprising an opcode and one or more operands.
According to still another embodiment of the present invention, the register file system further comprises a program counter for providing an address of the next instruction to be processed.
According to a further embodiment of the present invention, at least one data unit is shared between two or more processing units.
According to another embodiment of the present invention, register file system comprises: a) a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses, each data unit configured to: a.l. receive at least one mapped address; a.2. decode the received at least one mapped address and determine corresponding at least one memory data unit address; a.3. output data to be processed from one or more memory cells that correspond to said at least one memory data unit address; and a.4. store data within one or more memory cells that correspond to said at least one memory data unit address; and b) at least one output port for outputting said data to be processed from said one or more memory cells. According to an embodiment of the present invention, the processing unit device comprises: a) a register file system, comprising: a.l. a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses; a.2. at least one address converter, connected to one or more data units, for receiving one or more mapped addresses and converting them into the memory data unit addresses, wherein at least one first memory data unit address is of one or more memory cells that store the data to be processed and at least one second memory data unit address is of one or more memory cells for storing a result of the data processing; and a.3. at least one output port for outputting said data to be processed from said one or more memory cells that correspond to said at least one first memory data unit address; and b) at least one execution unit for receiving said data outputted from said register file system and processing it.
According to another embodiment of the present invention, the processing unit device further comprises an instruction register for storing at least one instruction to be processed, said at least one instruction comprising an opcode and one or more operands.
According to still another embodiment of the present invention, the register file system further comprises at least one control input port configured to receive an opcode of an instruction to be processed.
According to still another embodiment of the present invention, the one or more operands are the mapped addresses. According to a further embodiment of the present invention, the processing unit device further comprises at least one address generator for generating the one or more mapped address.
According to still a further embodiment of the present invention, one or more mapped addresses are generated according to data provided within an instruction to be processed.
According to still a further embodiment of the present invention, the processing unit device further comprises a control unit connected to the at least one control input port for receiving the opcode and enabling processing the instruction according to said opcode.
According to still a further embodiment of the present invention, each data unit is configured to receive a read-enable command for enabling reading data from its one or more corresponding memory cells.
According to still a further embodiment of the present invention, each data unit is configured to receive a write-enable command for enabling writing data into its one or more corresponding memory cells.
According to an embodiment of the present invention, the data units are selected from one or more of the following: a) peripherals; b) memory means; and c) registers.
According to another embodiment of the present invention, the at least one execution unit processes the data outputted from the register file system according to the instruction opcode. According to still another embodiment of the present invention, the execution unit is an Arithmetic Logic Unit.
According to still another embodiment of the present invention, the processing unit device further comprises a program counter for providing an address of the next instruction to be processed.
According to another embodiment of the present invention, a processing unit device comprises: a) a register file system, comprising: a.l. a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses, each data unit configured to: a.1.1. receive at least one mapped address; a.1.2. decode the received at least one mapped address and determine corresponding at least one memory data unit address; a.1.3. output data to be processed from one or more memory cells that correspond to said at least one memory data unit address; and a.1.4. store data within one or more memory cells that correspond to said at least one memory data unit address; and a.2. at least one output port for outputting said data to be processed from said one or more memory cells; and b) at least one execution unit for receiving said data outputted from said register file system and processing it.
According to an embodiment of the present invention, the method of processing a processing unit (PU) instruction comprises: a) conveying an address of a PU instruction to be processed into a program memory; b) fetching said PU instruction from said program memory; c) performing first decoding of said PU instruction and determining its opcode and one or more operands, wherein at least one of said operands is a PU mapped address; d) performing second decoding or converting the at least one PU mapped address into an address of one or more memory cells of a corresponding PU data unit, giving rise to a data unit address; e) enabling reading the data stored in the at least one data unit addresses; f) processing the read data; and g) writing back the result of said processing into the one or more memory cells within the corresponding PU data unit.
According to another embodiment of the present invention, method of processing a processing unit (PU) instruction comprises: a) conveying an address of a PU instruction to be processed into a program memory; b) fetching said PU instruction from said program memory; c) performing first decoding of said PU instruction and determining its opcode and one or more operands; d) generating a corresponding PU mapped address for the at least one operand; e) performing second decoding or converting each generated PU mapped address into an address of one or more memory cells of a corresponding PU data unit, giving rise to a data unit address; f) enabling reading the data stored in the data unit addresses; g) processing the read data; and h) writing back the result of said processing into the one or more memory cells within the corresponding PU data unit.
Brief Description of the Drawings
In order to understand the invention and to see how it may be carried out in practice, a preferred embodiment will now be described, by way of non-limiting example only, with reference to the accompanying drawings, in which: Fig. IA is a schematic block-diagram of a conventional processing unit, according to the prior art;
Fig. IB is a schematic illustration of a conventional (local) register file, according to the prior art;
Fig. 2A is a schematic illustration of connecting a spread register file system to an instruction register and to an execution unit (such as ALU), according to an embodiment of the present invention;
Fig. 2B is another schematic illustration of connecting a spread register file system to an instruction register and to an execution unit (such as ALU), according to another embodiment of the present invention;
Fig. 3 is a schematic illustration of a spread register file system, according to an embodiment of the present invention; and
Fig. 4 is a pipeline representation of operating with a spread register file system, according to an embodiment of the present invention.
It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.
Detailed Description of the Preferred Embodiments
In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, systems, procedures, components, circuits and the like have not been described in detail so as not to obscure the present invention.
Hereinafter, wherein the term "spread register file" system or "SRF" system is mentioned, it should be noted that it refers to the expanded (spread) register file according to the present invention, which can be related to the entire CPU memory map, thereby enabling substantially direct memory/peripheral access for one or more CPU execution units (for processing the data). Further, wherein the term "local register file" or "LRF" is mentioned, it refers to the conventional CPU local register file 106 (Fig. IA). It should be also noted that according to an embodiment of the present invention, the entire (complete) CPU memory map can comprise local registers (e.g., local CPU register files), cache memories, tightly coupled memories, on-chip/off-chip peripherals/memories (or registers) and any other conventional memory means. Also, wherein the term CPU is mentioned, it refers to any processing unit (PU), such as a microprocessor and the like, hi addition, wherein the term "processing" (or a similar term) is mentioned, it should be noted that it refers to any data operation, such as data manipulation, data transfer, addition or subtraction of data and the like.
Fig. 2A is a schematic illustration of connecting a spread register file system 206 to instruction register 105 and to execution unit 130 (such as ALU), according to an embodiment of the present invention. According to an embodiment of the present invention, spread register file system 206 relates to the entire (complete) CPU mapped memories: cache memories, on-chip peripherals (e.g., RAM, SRAM), tightly coupled memories, on-board memories (e.g., DRAM), secondary memories (e.g., off-chip peripherals, hard disks, etc.), and any other memory means (e.g., CDs (Compact Discs), DVDs (Digital Versatile Discs), etc.).
According to an embodiment of the present invention, spread register file system 206 comprises conventional peripheral address converters, as presented in Fig. 3. Each peripheral address converter is used for converting a CPU memory mapped address to corresponding peripheral (device) address (e.g., the peripheral device can be a USB device, cache memory, RAM, SRAM, tightly coupled memory, DRAM, hard disks, CD, DVD, etc.). For example, one peripheral address converter converts a CPU mapped address of "source 1" register/memory means to corresponding address of said register/memory means within the corresponding peripheral that actually stores said data (to be processed by means of execution unit 130, such as ALU); another peripheral address converter converts a CPU mapped address of "source 2" register/memory means that stores additional data to be processed by means of said execution unit 130; and still another one peripheral address converter - converts a CPU mapped address of "destination" register/memory means, in which a result of the above execution unit 130 processing (e.g., addition, subtraction) will be stored. It should be noted that the peripheral address converter can be implemented either in hardware and/or in software.
According to an embodiment of the present invention, instruction register 105 contains a VLIW program word, which can be, for example, 128 or 256 bits long. For example, for the MIPS32 (Million Instructions Per Second) CPU technology, the VLIW program word is 128 bits long, wherein the length of each one of the followings: opcode 221', CPU mapped "source 1" address 222', CPU mapped "source 2" address 223' and CPU mapped destination address 224' can be, for example, 32 bits long. Similarly, for the MIPS64 CPU technology, the VLIW program word is 256 bits long. Each of the above addresses relates to a specific address within the entire CPU memory map, and is represented, for example, by a 232 or 264 binary number, respectively. After "source 1", "source 2" and "destination" CPU memory mapped addresses are inputted into spread register file system 206, they are converted to corresponding peripheral device addresses (e.g., USB device addresses). Then, the data stored within resisters/memory means being relates to "source 1" and "source 2" peripheral device addresses is outputted from said spread register file system 206 over buses 231 and 232, respectively. Then, the outputted data is provided into execution unit 130 and processed in accordance with opcode 221' of the VLIW program word. For example, the data of "source 1" can be added to the data of "source 2", or the data of "source 1" can be subtracted from the data of "source 2", etc. The command for performing such an operation is provided into said execution unit 130 via control bus 234. Then, after accomplishing the operation, the corresponding result is written back into the destination register/memory means (for example, located within the corresponding peripheral device, such as a USB device) over data bus 233, whose address is defined by the CPU mapped destination address 224' of the VLIW program word.
It should be noted that according to an embodiment of the present invention, conventional I/O control units (e.g., memory management units), which enable reading/writing back data from/to various means (e.g., cache memories, secondary memories, etc.), are incorporated within spread register file system 206 along with CPU mapped memory means, such as cache memories, off-CPU-chip memories and other memory means. Thus, executing unit 130 is enabled to operate substantially directly on each of said CPU mapped memory means provided within spread register file system 206, i.e. is enabled to execute instructions without the need for generating and performing LOAD commands (loading data into said spread register file system 206 from external memory means) and corresponding additional STORE commands for storing the result of executing unit 130 operation externally to said spread register file system 206.
In addition, it should be noted that according to another embodiment of the present invention, spread register file system 206 can operate with more than one executing unit 130. Further, instruction register 105 and/or executing unit 130 can be provided within said spread register file system 206.
According to another embodiment of the present invention, spread register file system 206 can be provided on-CPU-chip (incorporated within a CPU) or off-CPU-chip. Further, according to still another embodiment of the present invention, a portion of spread register file system 206 can be provided on-CPU-chip and another portion - off- CPU-chip.
Fig. 2B is another schematic illustration of connecting spread register file system 206 to instruction register 105 and to execution unit 130 (such as ALU), according to another embodiment of the present invention. According to this embodiment, "source 1" address 221", "source 2" address 222" and "destination" address 223" (which are portions of a conventional (non- VLIW) program word) are inputted from instruction register 105 into address generator 250 for generating corresponding addresses being related to the entire CPU mapped memory. Thus, address 221", 222" and 223" can be for example each 5 bits long, and CPU mapped "source 1" address, CPU mapped "source 2" address, CPU mapped "destination" address is each 32 bits long (if implemented for MIPS32 CPU), since each of these mapped addresses relates to the entire CPU memory map that is represented by 232 addresses. Similarly, for the MIPS64 CPU implementation, each of these mapped addresses is 64 bits long. It should be noted that address generator 250 can generate addresses in various ways based on different address generating functions. For example, address generator 250 can receive in its input a 5 bits long address (represented by a 2s binary number) from instruction register 105, and then it can add this address to another 232 number, thereby generating a new CPU mapped address that is 32 bits long. The above 232 number can be a predefined number, random number or a number that is calculated (generated) by means of address generator 250 according to some predefined function(s)/expressions. In addition, said new CPU mapped address can be generated according to opcode 221" that can be inputted from instruction register 105 into said address generator 250. Thus, for example, if opcode 221" relates to moving "source 1" data (to which "source 1" CPU mapped address is related) to the "destination" register that is provided, for example, within a corresponding peripheral (data unit) of SRF system 206, then only "source 1" CPU mapped address and CPU mapped "destination" addresses are generated by means of said address generator 250 ("source 2" CPU mapped address is not generated).
According to another embodiment of the present invention, each of the addresses outputted from instruction register 105 over lines 251, 252 and 253 can be further related to corresponding registers within address generator 250 (e.g., said each of the addresses outputted from said instruction register 105 can be related to a different base address of 32-bits (for MIPS32 technology), or 64-bits (for MIPS64 technology) provided within said address generator 250, based on which a CPU mapped address can be generated).Thus, each CPU mapped address (to be outputted from address generator 150) can be generated according to values stored within these corresponding registers. For example, the "source 1" generated CPU mapped address (provided from address generator 250 over line 212) can be the sum of: a value of operand 222" and the corresponding "source 1" base address stored within said address generator 250. Similarly, the "source 2" (or "destination") generated CPU mapped address, can be the sum of: a value of operand 223" (or 224") and the corresponding "source 2" (or "destination") base address stored within said address generator 250. According to another embodiment of the present invention, address generator 250 stores CPU mapped addresses (e.g., 32 or 64 bits long), and for each operand 222", 223" and 224", said address generator 250 outputs a corresponding CPU mapped address.
Fig. 3 is a schematic illustration of spread register file system 206, according to an embodiment of the present invention. Spread register file system 206 receives as inputs: CPU mapped "source 1" address (MSl address) over bus (line) 212, CPU mapped "source 2" address (MS2 address) over bus 213 and CPU mapped "destination" address (MD address) over bus 214 (each 32 bits long, for example). These addresses are converted by means of address converters 320, 321 and 322, respectively to addresses of corresponding peripheral/memory means, such as peripheral/memory means 301, 302, 303, ..., 310 (e.g., cache memories, tightly coupled memories, secondary memories, SRAM, DRAM, disk-on-keys, hard disks, CDs (Compact Discs), DVDs, or any other peripheral/memory means). It should be noted that said address converters can convert the CPU mapped addresses in various ways based on different address converting functions/expressions. In addition, the above address can be converted according to opcode 221 ' or 221" (Figs. 2A or 2B) that can be inputted from instruction register 105 (Fig. 2A) into a control unit 350, which generates corresponding control signals to address converters 320, 321 and 322 and to executing unit 130: for example, if opcode 221' or 221" relates to moving "source 1" data to the "destination" register, then only address converters 320 and 322 can be activated.
According to an embodiment of the present invention, the converted "source 1 " and "source 2" addresses (CSl and CS2 addresses, respectively) are inputted into corresponding peripheral/memory means (e.g., peripheral/memory means 301, 302, 303,..., N), which in turn outputs corresponding data stored in said addresses over "source 1" read bus 231 and "source 2" read bus 232. Then, said data is processed (executed) by means of one or more execution units 130 (such as ALUs). After that, the processing result is provided over write back bus 233 to one or more peripheral/memory means to be stored in corresponding converted destination addresses (CD addresses) within said one or more peripheral/memory means. It should be noted that according to an embodiment of the present invention, the "source 1", "source 2", and "destination" memory cells can be physically located within the same or within different peripheral/memory means (such as peripheral/memory means 301, 302, 303,..., N).
According to another embodiment of the present invention, address converters 320, 321 and 322 further provide Write Enable (WE)/Chip Select (CS) signals (for example, binary "0" or "1") to each of said peripheral/memory means 301, 302, 303, ...,N for enabling reading or writing from or to said peripheral/memory means (data units) 301, 302, 303, ..., N. The corresponding WE/CS commands can be provided to each of said peripheral/memory means (data units) 301, 302, 303, ..., N when accessing each converted address (e.g., "source 1" converted address) within said each peripheral/memory means 301, 302, 303, ..., N. For example, for reading data from the converted "source 1" address (e.g., the address of a register within the corresponding peripheral), CS (read command) and WE (write command) signals provided to said corresponding peripheral are "1" and "0", respectively; in turn, for writing the data into the converted "destination" address of said corresponding peripheral, the WE signal is
It should be noted that according to another embodiment of the present invention, address converters 320, 321 and 322 can be unified in a single address converter for converting CPU mapped "source 1", "source 2" and "destination" addresses into corresponding peripheral/memory means addresses.
According to another embodiment of the present invention, instead of providing address converters 320, 321 and 322, the address decoding (or the address conversion) is performed within one or more peripherals/memory means 301, 302, 303, ..., N. Thus, according to this embodiment of the present invention, the need in providing said address converters 320, 321 and 322 is substantially eliminated. Peripherals/memory means 301, 302, 303, ..., N can receive CPU mapped addresses and decode (or convert) them accordingly for determining corresponding addresses within said peripherals/memory means 301, 302, 303, ..., N, in which the required data is stored (or to be stored). According to still another embodiment of the present invention, blocks 320, 321 and 322 can provide WE/CS commands to peripherals/memory means 301, 302, 303, ..., N, and do not perform the address conversion. Further, it should be noted that WE/CS commands can be generated by means of control unit 350. According to a further embodiment of the present invention, an address converter (such as address converter 320, 321 or 322) can be incorporated (integrated) within each (or one or more) peripheral/memory means 301, 302, 303, ..., or N. In such case, said peripheral/memory means receives a CPU mapped address and determines by means of the integrated address converter (according to predefined base-addresses of said peripheral/memory means), whether the received CPU mapped address is related to one or more memory cells provided within said peripheral/memory means or within another peripheral/memory means. It should be noted that the base-addresses of each peripherals/memory means can be further dynamically changed upon the need.
Fig. 4 is a pipeline representation 400 of operating with spread register file system 206 (Fig. 2B), according to an embodiment of the present invention. According to this embodiment, pipeline has 12 stages (TO to TI l), each of which can correspond to a single CPU clock cycle. At the first stage TO, the address of the instruction (program word) to be fetched is conveyed from the Program Counter (PC) to the CPU program memory (e.g., RAM (not shown)). Then, at stage Tl, the instruction (program word) is fetched from said program memory into instruction register 105 (Fig. 2A). After that at stage T2, the fetched instruction decoding is performed by means of control unit 350 (Fig. 3), which provides control signals during pipeline stages. At stage T3, address generator 250 (Fig. 2B) generates CPU mapped addresses according to the decoded instruction. Thus, for example, if opcode 221" relates to moving "source 1" data (to which "source 1" CPU mapped address is related) to the "destination" register that is provided, for example, within a corresponding peripheral of SRF system 206, then only "source 1" CPU mapped address and CPU mapped "destination" addresses are generated by means of said address generator 250 ("source 2" CPU mapped address is not generated). In addition, at T3 stage, the next program counter address can be determined, based on the decoded instruction. Also, for example, when a conventional JUMP or BRANCH command is issued, then the next program counter address is calculated in accordance with a pointer of said JUMP/BRANCH command). Then, at stage T4, the CPU mapped addresses are converted by means of address converters 320, 321 and 322 (Fig. 3), respectively to addresses of corresponding peripheral/memory means, such as peripheral/memory means 301, 302 and 303 (Fig. 3) (e.g., cache memories, secondary memories, disk-on-keys, hard disks, or any other peripheral/memory means). It should be noted that said address converters can convert the CPU mapped addresses in various ways based on different address converting functions/expressions. In addition, the above address can be converted according to opcode 221' or 221" (Figs. 2 A or 2B) that can be inputted from instruction register 105 (Fig. 2A) into a control unit 350 (Fig. 3), which generates corresponding control signals to address converters 320, 321 and 322 and to executing unit 130 (Fig. 3): for example, if opcode 221' or 221" relates to moving "source 1" data to the "destination" register, then only address converters 320 and 322 can be activated. The converted "source 1" and "source 2" addresses are inputted (along with corresponding WE/CS signals, which are set to data "READ") from said address converters 320, 321 and 322 to peripherals/memory means 301, 302, 303, ..., N. Then, at the next stage T5, said peripherals/memory means 301, 302, 303, ..., N generate a READ request to their internal memory, thereby enabling reading corresponding data stored within them at the received converted addresses. In the next stage T6, said corresponding data is read and ready, and then at stage T7, said data is latched and conveyed to the "source 1" and "source 2" read buses 231 and 232, respectively. After that, at stage T8, the data is provided over said read buses 231 and 232 into execution unit 130 (e.g., ALU). At the next two stages, T9 and TlO, the data is processed by means of said execution unit 130. It should be noted that the data can be processed only in stage T9, and no further processing can be required. Thus, the pipeline can have, for example, only 11 stages (TO to TlO). At stage TIl, the processing result is written back into the "destination" register that is provided, for example, within peripherals/memory means 301, 302, 303, ..., N. It should be noted that the writing back operation can take more than a single CPU clock cycle. It should be noted that CPU control unit 350 controls the pipeline process by generating required control signals during the pipeline stages.
According to an embodiment of the present invention, CPU stalls (delays) are reduced or substantially eliminated. For example, there can be substantially no CPU stalls if the access latency to SRF system 206 is 6 (or less) CPU clock cycles and the pipeline is relatively deep (e.g., 12 stages).
It should be further noted that according to an embodiment of the present invention, a number of instructions and CPU clock cycles required for manipulating/processing (e.g., moving, shifting) memory mapped data is significantly reduced, compared to the prior art. Thus, the number of instructions and corresponding CPU clock cycles for processing the data can be reduced, for example, to a single instruction that takes a single CPU clock cycle, enabling providing a substantially direct access between execution unit 130 and peripherals/memory means 301, 302, 303,...N.
According to another embodiment of the present invention, spread register file system 206 (Fig. 2A) can be shared between two or more processing unit (e.g, CPU, microprocessor, and the like), and/or between other internal/external (on-chip/off-chip) peripherals or devices.
According to still another embodiment of the present invention, the structure of a conventional CPU program word, compared to the prior art, is not changed. According to a further embodiment of the present invention, the need in using conventional DMA engines is eliminated.
While some embodiments of the invention have been described by way of illustration, it will be apparent that the invention can be put into practice with many modifications, variations and adaptations, and with the use of numerous equivalents or alternative solutions that are within the scope of persons skilled in the art, without departing from the spirit of the invention or exceeding the scope of the claims.

Claims

Claims
1. A register file system, comprising: a) a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses; b) at least one address converter, connected to one or more data units, for receiving one or more mapped addresses and converting them into the memory data unit addresses, wherein at least one first memory data unit address is of one or more memory cells that store the data to be processed and at least one second memory data unit address is of one or more memory cells for storing a result of the data processing; and c) at least one output port for outputting said data to be processed from said one or more memory cells that correspond to said at least one first memory data unit address.
2. The register file system according to claim 1, further comprising at least one control input port configured to receive an opcode of an instruction to be processed.
3. The register file system according to claim 1, wherein the one or more mapped addresses are provided within an instruction to be processed.
4. The register file system according to claim 1, further comprising at least one address generator for generating the at least one mapped address.
5. The register file system according to claim 4, wherein an instruction to be processed comprises data based on which the at least one mapped address is generated.
6. The register file system according to claim 2, further comprising a control unit connected to the at least one control input port for receiving the opcode and enabling processing the instruction according to said opcode.
7. The register file system according to claim 1, wherein each data unit is configured to receive a read-enable command for enabling reading data from its one or more corresponding memory cells.
8. The register file system according to claim 1, wherein each data unit is configured to receive a write-enable command for enabling writing data into its one or more corresponding memory cells.
9. The register file system according to claim 1, wherein the data units are selected from one or more of the following: a) peripherals; b) memory means; and c) registers.
10. The register file system according to claim 1, wherein at least a portion said register file system is incorporated within a processing unit.
11. The register file system according to claim 1, wherein said register file system is used by means of at least one processing unit.
12. The register file system according to claim 1, further comprising one or more execution units for processing at least a portion of the data outputted from said register file system.
13. The register file system according to claim 12, wherein the one or more execution units process the data outputted from said register file system according to an instruction opcode.
14. The register file system according to claim 12, wherein the execution unit is an Arithmetic Logic Unit.
15. The register file system according to claim 1, further comprising an instruction register for storing at least one instruction to be processed, said at least one instruction comprising an opcode and one or more operands.
16. The register file system according to claim 1, further comprising a program counter for providing an address of the next instruction to be processed.
17. The register file system according to claim 1, wherein at least one data unit is shared between two or more processing units.
18. A register file system, comprising: a) a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses, each data unit configured to: a.l. receive at least one mapped address; a.2. decode the received at least one mapped address and determine corresponding at least one memory data unit address; a.3. output data to be processed from one or more memory cells that correspond to said at least one memory data unit address; and a.4. store data within one or more memory cells that correspond to said at least one memory data unit address; and b) at least one output port for outputting said data to be processed from said one or more memory cells.
19. The register file system according to claim 18, further comprising at least one control input port configured to receive an opcode of an instruction to be processed.
20. The register file system according to claim 18, wherein the at least one mapped address is provided within an instruction to be processed.
21. The register file system according to claim 18, further comprising at least one address generator for generating the at least one mapped address.
22. The register file system according to claim 21, wherein an instruction to be processed comprises data based on which the at least one mapped address is generated.
23. The register file system according to claim 19, further comprising a control unit connected to the at least one control input port for receiving the opcode and enabling processing the instruction according to said opcode.
24. The register file system according to claim 18, wherein each data unit is configured to receive a read-enable command for enabling reading data from its one or more corresponding memory cells.
25. The register file system according to claim 18, wherein each data unit is configured to receive a write-enable command for enabling writing data into its one or more corresponding memory cells.
26. The register file system according to claim 18, wherein the data units are selected from one or more of the following: a) peripherals; b) memory means; and c) registers.
27. The register file system according to claim 18, wherein at least a portion said register file system is incorporated within a processing unit.
28. The register file system according to claim 18, wherein said register file system is used by means of at least one processing unit.
29. The register file system according to claim 18, further comprising one or more execution units for processing at least a portion of the data outputted from said register file system.
30. The register file system according to claim 18, further comprising an instruction register for storing at least one instruction to be processed, said at least one instruction comprising an opcode and one or more operands.
31. The register file system according to claim 18, further comprising a program counter for providing an address of the next instruction to be processed.
32. The register file system according to claim 18, wherein at least one data unit is shared between two or more processing units.
33. A processing unit device, comprising: a) a register file system, comprising: a.l. a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses; a.2. at least one address converter, connected to one or more data units, for receiving one or more mapped addresses and converting them into the memory data unit addresses, wherein at least one first memory data unit address is of one or more memory cells that store the data to be processed and at least one second memory data unit address is of one or more memory cells for storing a result of the data processing; and a.3. at least one output port for outputting said data to be processed from said one or more memory cells that correspond to said at least one first memory data unit address; and b) at least one execution unit for receiving said data outputted from said register file system and processing it.
34. The processing unit device according to claim 33, further comprising an instruction register for storing at least one instruction to be processed, said at least one instruction comprising an opcode and one or more operands.
35. The processing unit device according to claim 33, wherein the register file system further comprises at least one control input port configured to receive an opcode of an instruction to be processed.
36. The processing unit device according to claim 34, wherein the one or more operands are the mapped addresses.
37. The processing unit device according to claim 33, further comprising at least one address generator for generating the one or more mapped address.
38. The processing unit device according to claim 37, wherein one or more mapped addresses are generated according to data provided within an instruction to be processed.
39. The processing unit device according to claim 35, further comprising a control unit connected to the at least one control input port for receiving the opcode and enabling processing the instruction according to said opcode.
40. The processing unit device according to claim 33, wherein each data unit is configured to receive a read-enable command for enabling reading data from its one or more corresponding memory cells.
41. The processing unit device according to claim 33, wherein each data unit is configured to receive a write-enable command for enabling writing data into its one or more corresponding memory cells.
42. The processing unit device according to claim 33, wherein the data units are selected from one or more of the following: a) peripherals; b) memory means; and c) registers.
43. The processing unit device according to claim 34, wherein the at least one execution unit processes the data outputted from the register file system according to the instruction opcode.
44. The processing unit device according to claim 33, wherein the execution unit is an Arithmetic Logic Unit.
45. The processing unit device according to claim 33, further comprising a program counter for providing an address of the next instruction to be processed.
46. A processing unit device, comprising: a) a register file system, comprising: a.1. a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses, each data unit configured to: a.1.1. receive at least one mapped address; a.1.2. decode the received at least one mapped address and determine corresponding at least one memory data unit address; a.1.3. output data to be processed from one or more memory cells that correspond to said at least one memory data unit address; a.1.4. store data within one or more memory cells that correspond to said at least one memory data unit address; and a.2. at least one output port for outputting said data to be processed from said one or more memory cells; and b) at least one execution unit for receiving said data outputted from said register file system and processing it.
47. The processing unit device according to claim 46, further comprising an instruction register for storing at least one instruction to be processed, said at least one instruction comprising an opcode and one or more operands.
48. The processing unit device according to claim 46, wherein the register file system further comprises at least one control input port configured to receive an opcode of an instruction to be processed.
49. The processing unit device according to claim 47, wherein the one or more operands are the mapped addresses.
50. The processing unit device according to claim 46, further comprising at least one address generator for generating the one or more mapped address.
51. The processing unit device according to claim 50, wherein one or more mapped addresses are generated according to data provided within an instruction to be processed.
52. The processing unit device according to claim 48, further comprising a control unit connected to the at least one control input port for receiving the opcode and enabling processing the instruction according to said opcode.
53. The processing unit device according to claim 46, wherein each data unit is configured to receive a read-enable command for enabling reading data from its one or more corresponding memory cells.
54. The processing unit device according to claim 46, wherein each data unit is configured to receive a write-enable command for enabling writing data into its one or more corresponding memory cells.
55. The processing unit device according to claim 46, wherein the data units are selected from one or more of the following: a) peripherals; b) memory means; and c) registers.
56. The processing unit device according to claim 47, wherein the at least one execution unit processes the data outputted from the register file system according to the instruction opcode.
57. The processing unit device according to claim 46, wherein the execution unit is an Arithmetic Logic Unit.
58. The processing unit device according to claim 46, further comprising a program counter for providing an address of the next instruction to be processed.
59. A method of processing a processing unit (PU) instruction, comprising: a) conveying an address of a PU instruction to be processed into a program memory; b) fetching said PU instruction from said program memory; c) performing first decoding of said PU instruction and determining its opcode and one or more operands, wherein at least one of said operands is a PU mapped address; d) performing second decoding or converting the at least one PU mapped address into an address of one or more memory cells of a corresponding PU data unit, giving rise to a data unit address; e) enabling reading the data stored in the at least one data unit addresses; f) processing the read data; and g) writing back the result of said processing into the one or more memory cells within the corresponding PU data unit.
60. The method of processing a PU instruction according to claim 59, further comprising providing at least one control input port configured to receive an opcode of the instruction to be processed.
61. The method of processing a PU instruction according to claim 59, further comprising providing at least one address generator for generating the at least one PU mapped address.
62. The method of processing a PU instruction according to claim 59, further comprising providing, within the instruction, the data based on which the at least one PU mapped address to be generated.
63. The method of processing a PU instruction according to claim 60, further comprising providing a control unit for receiving the opcode and enabling processing the instruction according to said opcode.
64. The method of processing a PU instruction according to claim 59, further comprising configuring each PU data unit to receive a read-enable command for enabling reading data from its one or more corresponding memory cells.
65. The method of processing a PU instruction according to claim 59, further comprising configuring each PU data unit to receive a write-enable command for enabling writing data into its one or more corresponding memory cells.
66. The method of processing a PU instruction according to claim 59, further comprising selecting the PU data units from one or more of the following: a) peripherals; b) memory means; and c) registers.
67. The method of processing a PU instruction according to claim 59, further comprising providing a program counter for providing an address of the next instruction to be processed.
68. A method of processing a processing unit (PU) instruction, comprising: a) conveying an address of a PU instruction to be processed into a program memory; b) fetching said PU instruction from said program memory; c) performing first decoding of said PU instruction and determining its opcode and one or more operands; d) generating a corresponding PU mapped address for the at least one operand; e) performing second decoding or converting each generated PU mapped address into an address of one or more memory cells of a corresponding PU data unit, giving rise to a data unit address; f) enabling reading the data stored in the data unit addresses; g) processing the read data; and h) writing back the result of said processing into the one or more memory cells within the corresponding PU data unit.
69. The method of processing a PU instruction according to claim 68, further comprising providing at least one control input port configured to receive an opcode of the instruction to be processed.
70. The method of processing a PU instruction according to claim 69, further comprising providing a control unit for receiving the opcode and enabling processing the instruction according to said opcode.
71. The method of processing a PU instruction according to claim 68, further comprising providing, within the instruction, the data based on which one or more PU mapped addresses are generated.
72. The method of processing a PU instruction according to claim 68, further comprising configuring each PU data unit to receive a read-enable command for enabling reading data from its one or more corresponding memory cells.
73. The method of processing a PU instruction according to claim 68, further comprising configuring each PU data unit to receive a write-enable command for enabling writing data into its one or more corresponding memory cells.
74. The method of processing a PU instruction according to claim 68, further comprising selecting the PU data units from one or more of the following: a) peripherals; b) memory means; and c) registers.
75. The method of processing a PU instruction according to claim 68, further comprising providing a program counter for providing an address of the next instruction to be processed.
PCT/IL2009/000472 2008-05-07 2009-05-07 Register file system and method thereof for enabling a substantially direct memory access WO2009136402A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US7158408P 2008-05-07 2008-05-07
US61/071,584 2008-05-07

Publications (2)

Publication Number Publication Date
WO2009136402A2 true WO2009136402A2 (en) 2009-11-12
WO2009136402A3 WO2009136402A3 (en) 2010-03-11

Family

ID=41265110

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IL2009/000472 WO2009136402A2 (en) 2008-05-07 2009-05-07 Register file system and method thereof for enabling a substantially direct memory access

Country Status (1)

Country Link
WO (1) WO2009136402A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150089111A1 (en) * 2011-12-22 2015-03-26 Intel Corporation Accessing data stored in a command/address register device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6081783A (en) * 1997-11-14 2000-06-27 Cirrus Logic, Inc. Dual processor digital audio decoder with shared memory data transfer and task partitioning for decompressing compressed audio data, and systems and methods using the same
US6269436B1 (en) * 1995-12-11 2001-07-31 Advanced Micro Devices, Inc. Superscalar microprocessor configured to predict return addresses from a return stack storage
US20030200339A1 (en) * 2001-07-02 2003-10-23 Globespanvirata Incorporated Communications system using rings architecture
US20070006150A9 (en) * 2002-12-02 2007-01-04 Walmsley Simon R Multi-level boot hierarchy for software development on an integrated circuit

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6269436B1 (en) * 1995-12-11 2001-07-31 Advanced Micro Devices, Inc. Superscalar microprocessor configured to predict return addresses from a return stack storage
US6081783A (en) * 1997-11-14 2000-06-27 Cirrus Logic, Inc. Dual processor digital audio decoder with shared memory data transfer and task partitioning for decompressing compressed audio data, and systems and methods using the same
US20030200339A1 (en) * 2001-07-02 2003-10-23 Globespanvirata Incorporated Communications system using rings architecture
US20070006150A9 (en) * 2002-12-02 2007-01-04 Walmsley Simon R Multi-level boot hierarchy for software development on an integrated circuit

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150089111A1 (en) * 2011-12-22 2015-03-26 Intel Corporation Accessing data stored in a command/address register device
US9436632B2 (en) * 2011-12-22 2016-09-06 Intel Corporation Accessing data stored in a command/address register device
US9442871B2 (en) 2011-12-22 2016-09-13 Intel Corporation Accessing data stored in a command/address register device

Also Published As

Publication number Publication date
WO2009136402A3 (en) 2010-03-11

Similar Documents

Publication Publication Date Title
CN110990060B (en) Embedded processor, instruction set and data processing method of storage and computation integrated chip
KR101121606B1 (en) Thread optimized multiprocessor architecture
US7473293B2 (en) Processor for executing instructions containing either single operation or packed plurality of operations dependent upon instruction status indicator
JP6124463B2 (en) Inter-architecture compatibility module that allows code modules of one architecture to use library modules of the other architecture
RU2636675C2 (en) Commands, processors, methods and systems of multiple registers access to memory
JP2776132B2 (en) Data processing system with static and dynamic masking of information in operands
US20040215852A1 (en) Active memory data compression system and method
US10678541B2 (en) Processors having fully-connected interconnects shared by vector conflict instructions and permute instructions
RU2638641C2 (en) Partial width loading depending on regime, in processors with registers with large number of discharges, methods and systems
US5455955A (en) Data processing system with device for arranging instructions
KR100462951B1 (en) Eight-bit microcontroller having a risc architecture
JPH05502125A (en) Microprocessor with last-in, first-out stack, microprocessor system, and method of operating a last-in, first-out stack
TW201717022A (en) Backward compatibility by restriction of hardware resources
JPH0210452B2 (en)
TW200403583A (en) Controlling compatibility levels of binary translations between instruction set architectures
RU2639695C2 (en) Processors, methods and systems for gaining access to register set either as to number of small registers, or as to integrated big register
KR100465388B1 (en) Eight-bit microcontroller having a risc architecture
WO2019172987A1 (en) Geometric 64-bit capability pointer
US9639362B2 (en) Integrated circuit device and methods of performing bit manipulation therefor
US6012138A (en) Dynamically variable length CPU pipeline for efficiently executing two instruction sets
US6327648B1 (en) Multiprocessor system for digital signal processing
CN113900710A (en) Expansion memory assembly
US20030196072A1 (en) Digital signal processor architecture for high computation speed
CN114945984A (en) Extended memory communication
KR100267092B1 (en) Single instruction multiple data processing of multimedia signal processor

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09742573

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase in:

Ref country code: DE

WPC Withdrawal of priority claims after completion of the technical preparations for international publication

Ref document number: 61/071,584

Country of ref document: US

Date of ref document: 20101025

Free format text: WITHDRAWN AFTER TECHNICAL PREPARATION FINISHED

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 24/01/2012)

122 Ep: pct application non-entry in european phase

Ref document number: 09742573

Country of ref document: EP

Kind code of ref document: A2