WO2009136401A2 - Improved processing unit implementing both a local register file system and spread register file system, and a method thereof - Google Patents

Improved processing unit implementing both a local register file system and spread register file system, and a method thereof Download PDF

Info

Publication number
WO2009136401A2
WO2009136401A2 PCT/IL2009/000471 IL2009000471W WO2009136401A2 WO 2009136401 A2 WO2009136401 A2 WO 2009136401A2 IL 2009000471 W IL2009000471 W IL 2009000471W WO 2009136401 A2 WO2009136401 A2 WO 2009136401A2
Authority
WO
WIPO (PCT)
Prior art keywords
data
address
register file
file system
processing
Prior art date
Application number
PCT/IL2009/000471
Other languages
French (fr)
Other versions
WO2009136401A3 (en
Inventor
Yoav Peleg
Original Assignee
Cosmologic Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cosmologic Ltd. filed Critical Cosmologic Ltd.
Publication of WO2009136401A2 publication Critical patent/WO2009136401A2/en
Publication of WO2009136401A3 publication Critical patent/WO2009136401A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention relates to processing units. More particularly, the present invention relates to providing an improved processing unit (e.g., CPU (Central Processing Unit), microprocessor and the like), and a method thereof, that implements both a conventional local register file system and spread register file system, which enables substantially direct memory access to memory means that are coupled to said improved processing unit.
  • an improved processing unit e.g., CPU (Central Processing Unit), microprocessor and the like
  • CPU Central Processing Unit
  • microprocessor microprocessor and the like
  • a method thereof that implements both a conventional local register file system and spread register file system, which enables substantially direct memory access to memory means that are coupled to said improved processing unit.
  • Fetching means retrieving an instruction from the program memory, wherein the instruction is represented by a number or by a sequence of numbers.
  • Instruction Register a register that stores a current instruction to be executed.
  • the instruction register is provided within a processing unit, and is located in physical proximity to processing means, such as ALU (Arithmetic Logic Unit).
  • ALU Arimetic Logic Unit
  • Opcode an opcode (operation code) is the portion of a machine-language instruction that specifies an operation to be performed (e.g., addition, subtraction, and the like).
  • an instruction operand is data/value or a pointer (address) to the data, on which (or by means of which) an operation/processing (e.g., addition, subtraction, and the like) has to be performed.
  • an operation/processing e.g., addition, subtraction, and the like
  • Register File is a storage unit located within the processing unit, such as the CPU. Generally, the register file is a combination of registers and combinatorial logic. Background of the Invention
  • a conventional central processing unit operates by four steps: a) fetching; b) decoding (that involves reading data from the CPU register file; c) instruction executing; and d) writing back the result of said executing.
  • the first step, fetching involves retrieving an instruction from the program memory (e.g., RAM (Random Access Memory)). Instruction location in the program memory is determined by a program counter, which keeps track of the CPU processing in the current program.
  • the program counter is incremented by the length of the instruction word in terms of memory units; also, for example, when a conventional JUMP or BRANCH command is received, the program counter value is changed accordingly.
  • the instruction to be fetched must be retrieved from relatively slow memory (e.g., secondary memory) by means of a conventional Input/Output control unit, causing the CPU to stall while waiting for the instruction to be returned back to said CPU.
  • the instruction that the CPU fetches from the memory is used to determine what the CPU has to do, thus the CPU cannot proceed processing until the instruction is fetched from the memory.
  • the instruction is broken up into several portions to be processed by other CPU units (e.g, ALU).
  • ISA CPU instruction set architecture
  • opcode operation code
  • the remaining numbers in the instruction usually provide information required for that instruction (e.g., operands for the addition/subtraction operation).
  • operands may be given as a constant value (called an immediate value).
  • operands may be provided as addresses of corresponding values stored in a register file (that comprises a plurality of registers, e.g., 32 or 64 registers).
  • the executing step is performed.
  • the CPU performs the desired operation. If, for example, an addition operation is requested, the numbers to be added are provided to inputs of the Arithmetic Logic Unit (ALU), and the result (the final sum) will be provided at the ALU outputs.
  • the ALU comprises a circuitry to perform simple arithmetic and logical operations on the inputs, such as addition/subtraction operations.
  • the results of the executing step are "written back" to the register file or to CPU registers. After accomplishing the instruction execution and writing back the resulting data, the entire process repeats with the next instruction cycle, normally fetching the next-in- sequence instruction due to the incremented value in the program counter.
  • the above four CPU steps have to be performed relatively fast.
  • non-local memory means such as cache, on-board memory (e.g., DRAM (Dynamic Random Access Memory)), secondary memory and the like
  • the access time is greatly increased leading to significant delays and to a waste of the valuable CPU processing resources. In turn, it greatly decreases CPU performance and consumes most of the CPU processing time.
  • a conventional processing unit e.g., CPU
  • uses a limited set of registers e.g., 32 or 64 registers
  • the register file can be implemented in hardware by means of a plurality of electronic elements, such as latches, flip-flops, memory arrays, multi-port SRAM (Static Random Access Memory) and the like.
  • this register file is a portion of the CPU, and it is located in the physical proximity to the ALU (Arithmetic Logic Unit) of said CPU.
  • ALU Arimetic Logic Unit
  • Such a register file can be named a "local register file system".
  • One of the reasons for having a limited local register file system is due to the limited size of the CPU program word, which usually contains pointers to 3 registers: one register (accessed via "source 1" input of the local register file system) storing the first value be processed by the ALU, another register (accessed via "source 2" input of the local register file system) storing the second value to be processed by said ALU, and the last register (destination register, accessed via "destination” input of the local register file system) storing the result value of the ALU processing (e.g., the sum of the values stored within said sources 1 and 2). Since the CPU program word is limited (in the term of data bits), the number of bits allowed for each the above registers is low.
  • VLIW Very Large Instruction Word
  • the CPU register file relatively rarely reaches a capacity of 256 registers.
  • Another reason for having a limited number of registers in the CPU register file is due to hardware limitations related to fast memory access and to capability of using a relatively large number of ports.
  • a conventional ALU requires providing at least two read ports and one write port.
  • a conventional system that implements CPU 5 also usually contains a memory controller (that comprises a MMU (Memory Management Unit)), various memory means (e.g., cache, SRAM, etc.), and different peripherals, such as cache controllers, interrupt controllers, timers, hardware accelerators, DMA engine, communication controllers (e.g., a USB controller), and the like.
  • the memory controller controls the CPU access to a wide range of registers/memory means, such as internal CPU memories (program and data), on-chip memories (including, for example, cache memory), on-chip peripheral memories, and off-chip (device) memories.
  • the CPU local register file system is significantly limited in its size (e.g., contains only 32 registers), and the CPU memory mapped registers (to be accessed, for example, by CPU internal units, such as the ALU) are physically located outside the CPU local register file system (e.g., cache, secondary memory, etc.).
  • the CPU needs to generate LOAD commands for loading data by means of the memory controller from each of said memory mapped register (e.g., located off-CPU-chip (outside the CPU chip)) into registers of the CPU local register file system.
  • the CPU can manipulate said data (e.g., to perform data addition or data subtraction operations by means of its ALU unit). Then, the result is first stored in another register of the CPU local register file system, and after that said result is conveyed to the corresponding memory mapped register (for example, non-local device/peripheral (located off-CPU-chip)) for updating it with a new data value - the result of ALU processing. For that, the CPU needs to generate at least one STORE command for storing said result within said non-local device/peripheral.
  • a single ALU command (e.g., addition, subtraction, etc.) is related to processing of data located within at least two registers.
  • the CPU needs to generate at least two separate LOAD commands (each in a single CPU clock cycle) for loading the data required for processing.
  • a multi- LOAD request can be generated in a single CPU clock cycle and then, in an additional clock cycle, one or more (destination) registers within the local register file system can be updated with new data values (the results of ALU processing).
  • LOAD or multi-LOAD command for loading data from an external register/memory means (device register, cache memory, etc.
  • ALU data processing command for executing various operations (e.g., addition, subtraction);
  • STORE command for writing back the result of ALU processing into the corresponding non-local register/memory means - for this, even if working in a pipeline and avoiding data hazards, such data processing takes at least three CPU clock cycles.
  • the DMA Direct Memory Access
  • CPU peripherals can be used for conducting such data movements, thereby reading the data from said one CPU memory mapped register/memory means and writing the data to said another off-CPU-chip register/memory means.
  • DMA engines can be used for conducting such data movements, thereby reading the data from said one CPU memory mapped register/memory means and writing the data to said another off-CPU-chip register/memory means.
  • DMA Direct Memory Access
  • CPU is not required to generate LOAD and STORE commands.
  • DMA operations can be conducted in parallel with CPU operations.
  • FIG. IA is a schematic block-diagram 100 of a conventional processing unit 100, according to the prior art.
  • an instruction that comprises an opcode and one or more operands
  • the program memory e.g., RAM 140
  • IR instruction register
  • instruction register 105 is 32 bits long [0...31] bits, wherein the first six bits [0...5] of the instruction provided within said instruction register 105 are opcode (that defines the operation to be performed, e.g., addition, subtraction, etc.); bits [11...15] are an address of the destination register within a register file 106 (the address of a register in which the result of ALU processing will be stored); bits [16...20] are an address of the "first" (Source 1) register (within register file 106), the value of which has to be manipulated (processed); and bits [21...25] are an address of the "second" (Source 2) register (within register file 106), the value of which has also to be manipulated (for example, has to be added to the value of said "first” register). It should be noted that the rest of the instruction bits (within 32 bits of said instruction) can be related to various data, such as the "immediate” value (some constant value) or auto-increment, etc.
  • the addresses of the above Sources 1 and 2 are inputted into decoder(s) 120' of register file 106, and as a result, the data of corresponding registers (to which said addresses are related) of said register file 106 is outputted over data bus (one or more lines) 141.
  • the next step is based on the specific instruction to be processed, and can be, for example: a) reading additional data from on-CPU-chip (inside the CPU chip) memory/peripherals or off-CPU-chip (outside the CPU chip) memory/peripherals (by establishing a LOAD command); b) storing data within said memory/peripherals (by establishing a LOAD command); and/or c) activating execution unit 130 (e.g., ALU) for performing a mathematical operation, such as addition, subtraction, multiplication, division: in this case, the operands for the execution unit processing are determined by means of control unit 115.
  • execution unit 130 e.g., ALU
  • the result is written back into the destination register within said register file 106 (the destination register address is defined by bits [11...15] of the executed instruction). Further, the result can be written back into the CPU memory means/peripherals 160 by means of Input/Output Control Unit 150 over bus 108 (by accomplishing a STORE command). Then, the cycle is started over with the next instruction to be further fetched, decoded and executed. Since a program counter 110 holds an address of the current instruction to be executed (and points to a corresponding RAM 140 memory address by means of address bus 119), the CPU always "knows" wherein within said RAM 140 the next instruction can be found. Each time the instruction is completed, program counter 110 is incremented by at least one memory address location: also, for example, when the instruction is a conventional JUMP or BRANCH command, the program counter is changed accordingly.
  • CPU register file 106 is local, and it is a portion of CPU chip (core).
  • I/O control unit 150 e.g., comprising memory controller or memory management unit (MMU)
  • MMU memory management unit
  • ALU operations are not performed directly on the data stored within CPU mapped peripherals/memory means, and these peripherals/memory means are not accessed directly by means of said ALU 130: the data inputted into the ALU is incoming from local register file system 106, to which it is loaded from corresponding memory/peripherals by means of Input/Output (I/O) control unit 150, for example. Therefore, for performing manipulation on data stored outside local register file system 106, the data has first to be loaded into said local register file system 106 by means of I/O control unit 150, thereby executing a LOAD command, and loading the data into the CPU local register file 106 over load/store bus 108.
  • I/O control unit 150 for performing manipulation on data stored outside local register file system 106, the data has first to be loaded into said local register file system 106 by means of I/O control unit 150, thereby executing a LOAD command, and loading the data into the CPU local register file 106 over load/store bus 108.
  • control unit 115 (over control bus 121), which can comprise a controller 126, multiplexers 125, decoders 120" and the like.
  • Control unit 115 receives data to be processed from local register file 106 over data bus 141, and it controls execution unit 130 processing by sending to said execution unit 130 a control signal over bus 121 in accordance with the instruction opcode.
  • execution unit 130 receives the corresponding instruction operands to be processed from said control unit 115, and outputs a result of said processing over bus 108.
  • Fig. IB is a schematic illustration of a conventional (local) register file 106, according to the prior art.
  • instruction register (IR) 105 (Fig. IA) is 32 bits long [0...31] bits, wherein bits [11...15] are an address of the destination register within register file 106 (the address of a register in which the result of ALU 130 (Fig.
  • bits [16...20] are an address of the "first” (Source 1) register (within register file 106), the value of which has to be manipulated; and bits [21...25] are an address of the "second" (Source 2) register (within register file 106), the value of which has also to be manipulated (for example, has to be added to the value of said "first” register).
  • bits [21...25] are an address of the "second" (Source 2) register (within register file 106), the value of which has also to be manipulated (for example, has to be added to the value of said "first” register).
  • the rest of the instruction bits (within 32 bits of said instruction) can be related to various data, such as an immediate value, auto- increment, etc.
  • the addresses of the above Sources 1 and 2 are inputted into register file 106 (11 ⁇ 6 -20 and IR 21- 25, respectively) and conveyed to decoders 120 (Fig. IA). Decoders 120 decode the addresses and enable outputting data of corresponding registers of register file 106 towards ALU for further processing said data (e.g., addition, subtraction of the data and the like). Thus, the data is outputted through "Source 1 Data” and “Source 2 Data” outputs, having a length of 32 bits. After ALU 130 processes said data, it stores the result (32 bits long) in a destination register within register file 106 (the destination register is defined by the IR 11-15 address).
  • US 6,178,482 discloses a system embedded with a processor, containing sets of cache lines for accessing cache memories, which are dynamically operated as different register sets for supplying source operands and in turn, accepting destination operands for instruction execution.
  • the different register sets may be of the same or of different virtual register files, and if the different register sets are of different virtual register files, the different virtual register files may be of the same or of different architectures.
  • the cache memories may be directly accessed by using cache addresses.
  • US 6,178,482 presents a data processing apparatus which uses a register file to provide a faster alternative to indirect memory addressing.
  • a functional unit is connected to a data register file which comprises a plurality of registers, each of which is accessed by a corresponding register number.
  • the functional unit of US 6,178,482 can execute at least one indirect register access instruction that comprises an operand register number field.
  • Instruction decode circuitry connected to the register file and the functional unit, is responsive to the indirect register access instruction to recall data stored in an operand register specified by the operand register number in the instruction, identify the recalled data as a register access number, and recall operand data from a data register corresponding to the register access number for use as an operand by the functional unit.
  • one advantage of the present invention is that it significantly reduces the number of instructions and CPU clock cycles required for manipulating/processing (e.g., performing addition, subtraction, data moving, data shifting operations and the like) memory mapped data by providing a substantially direct memory means access for one or more CPU execution units (for processing the data).
  • the number of instructions and corresponding CPU clock cycles for processing the data can be reduced, for example, to a single instruction that takes a single CPU clock cycle.
  • Another advantage of the present invention is that it provides an improved processing unit (e.g., CPU (Central Processing Unit), microprocessor and the like) that implements both a conventional local register file system and spread register file system, which enables substantially direct memory access.
  • the spread register file system can be further shared with other CPUs, or with other internal/external (on-chip/off-chip) peripherals or devices.
  • Still another advantage of the present invention is that it provides a method and system, in which for reducing the number of instructions and CPU clock cycles required for manipulating/processing memory mapped register data, there is substantially no need in changing the structure of the conventional CPU program word.
  • Still another advantage of the present invention is that it eliminates the need in using conventional DMA engines.
  • the present invention relates to providing an improved processing unit (e.g., CPU, microprocessor and the like), and a method thereof, that implements both a conventional local register file system and spread register file system, which enables substantially direct memory access to memory means that are coupled to said improved processing unit.
  • an improved processing unit e.g., CPU, microprocessor and the like
  • a method thereof that implements both a conventional local register file system and spread register file system, which enables substantially direct memory access to memory means that are coupled to said improved processing unit.
  • a processing unit device comprises: a) a local register file system having: a.l. one or more registers for storing at least one mapped addresses, said each register being assigned with a certain address that is specified in an instruction operand; and a.2. at least one data output port, from which said at least one mapped address stored within said one or more registers is outputted; and b) a spread register file system, comprising: b.l. a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses; b.2.
  • At least one address converter connected to one or more data units, for receiving one or more mapped addresses from said local register file system and converting them into the memory data unit addresses, wherein at least one first memory data unit address is of one or more memory cells that store the data to be processed and at least one second memory data unit address is of one or more memory cells for storing a result of the data processing; and b.3. at least one output port for outputting said data to be processed from said one or more memory cells that correspond to said at least one first memory data unit address; and c) at least one execution unit for processing the data outputted from said spread register file system.
  • the processing unit device further comprises an instruction register for storing at least one instruction to be processed, said at least one instruction comprising an opcode and one or more operands.
  • the spread register file system further comprises at least one control input port configured to receive an opcode of an instruction to be processed.
  • the processing unit device further comprises a control unit connected to the at least one control input port for receiving the opcode and enabling processing the instruction according to said opcode.
  • each data unit is configured to receive a read-enable command for enabling reading data from its one or more corresponding memory cells.
  • each data unit is configured to receive a write-enable command for enabling writing data into its one or more corresponding memory cells.
  • the data units are selected from one or more of the following: a) peripherals; b) memory means; and c) registers.
  • a result of the data processing is stored within the one or more memory cells of the corresponding data unit of the spread register file system or within the local register file system.
  • the local register file system further comprises at least one write address input port, into which is inputted an address of one or more memory cells of said local register file system to be auto- incremented.
  • the local register file system further comprises at least one write data input port, into which is provided a command to auto-increment an address of one or more memory cells of said local register file system.
  • At least one data unit of the spread register file system is shared between two or more processing units.
  • a processing unit device comprises: a) a local register file system having: a.l. one or more registers for storing at least one mapped addresses, said each register being assigned with a certain address that is specified in an instruction operand; and a.2. at least one data output port, from which said at least one mapped address stored within said one or more registers is outputted; and b) a spread register file system, comprising: b.l. a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses, each data unit configured to: b.1.1. receive at least one mapped address; b.1.2.
  • a processing unit device comprises: a) a local register file system having: a.l. one or more registers for storing data based on which at least one mapped address to be generated, said each register being assigned with a certain address that is specified in an instruction operand; and a.2. at least one data output port, from which said data stored within said one or more registers is outputted; and b) at least one address generator for generating said at least one mapped address from said data stored within said one or more registers of said local register file system; and c) a spread register file system, comprising: c.l. a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses; c.2.
  • At least one address converter connected to one or more data units, for receiving one or more mapped addresses from said at least one address generator and converting them into the memory data unit addresses, wherein at least one first memory data unit address is of one or more memory cells that store the data to be processed and at least one second memory data unit address is of one or more memory cells for storing a result of the data processing; and c.3. at least one output port for outputting said data to be processed from said one or more memory cells that correspond to said at least one first memory data unit address; and d) at least one execution unit for processing the data outputted from said spread register file system.
  • the local register file comprises one or more registers for storing one or more mapped addresses.
  • a processing unit device comprises: a) a local register file system having: a.l. one or more registers for storing data based on which at least one mapped address to be generated, said each register being assigned with a certain address that is specified in an instruction operand; and a.2. at least one data output port, from which said data stored within said one or more registers is outputted; and b) at least one address generator for generating said at least one mapped address from said data stored within said one or more registers of said local register file system; c) a spread register file system, comprising: c.l.
  • each data unit configured to: c.1.1. receive at least one mapped address; c.1.2. decode the received generated at least one mapped address and determine corresponding at least one memory data unit address; c.1.3. output data to be processed from one or more memory cells that correspond to said at least one memory data unit address; and c.1.4. store data within one or more memory cells that correspond to said at least one memory data unit address; and c.2. at least one output port for outputting said data to be processed from said one or more memory cells; and d) at least one execution unit for processing the data outputted from said spread register file system.
  • a processing unit device comprises: a) a local register file system having: a.l. one or more registers for storing first data to be processed, said each register being assigned with a certain address that is specified in an instruction operand; and a.2. at least one data output port, from which the stored first data to be processed is outputted; and b) a spread register file system, comprising: b.l. a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses; b.2.
  • At least one address converter connected to one or more data units, for receiving one or more mapped addresses and converting them into the memory data unit addresses, wherein at least one first memory data unit address is of one or more memory cells that store the second data to be processed and at least one second memory data unit address is of one or more memory cells for storing a result of the second data processing; and b.3. at least one output port for outputting said second data to be processed from said one or more memory cells that correspond to said at least one first memory data unit address; and c) at least one execution unit for processing said first data outputted from said local register file system and for processing said second data outputted from said spread register file system.
  • the local register file further comprises one or more registers for storing third data based on which the at least one mapped address to be generated.
  • the processing unit device further comprises at least one address generator for generating the at least one mapped address based on the third data.
  • the local register file system further comprises one or more write address input ports, into which are inputted the addresses of memory cells of said local register file system to be auto-incremented.
  • the local register file system further comprises one or more write data input ports, into which are provided commands to auto-increment the addresses of memory cells of said local register file system.
  • a processing unit device comprises: a) a local register file system having: a.l. one or more registers for storing first data to be processed, said each register being assigned with a certain address that is specified in an instruction operand; and a.2. at least one data output port, from which the stored first data to be processed is outputted; and b) a spread register file system, comprising: b.l. a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses, each data unit configured to: b.1.1. receive at least one mapped address; b.l.2. decode the received at least one mapped address and determine corresponding at least one memory data unit address; b.l.3.
  • the local register file further comprises one or more registers for storing fourth data based on which the at least one mapped address to be generated.
  • the processing unit device further comprises at least one address generator for generating the at least one mapped address based on the fourth data.
  • a method of processing a processing unit (PU) instruction comprises: a) conveying an address of a PU instruction to be processed into a program memory; b) fetching said PU instruction from said program memory; c) performing first decoding of said PU instruction and determining its opcode and operands, wherein one or more of said operands are addresses of registers of a local register file system, and wherein one or more of said registers store at least one PU mapped address; d) reading said at least one PU mapped address from said local register file system; e) performing second decoding or converting said at least one PU mapped address into an address of one or more memory cells of a corresponding PU data unit, giving rise to a data unit address; f) enabling reading the data stored in the at least one data unit address; g) processing the read data; and h) writing back the result of said processing into the one or more memory cells within the corresponding PU data unit or into one or more corresponding register
  • a method of processing a processing unit (PU) instruction comprises: a) conveying an address of a PU instruction to be processed into a program memory; b) fetching said PU instruction from said program memory; c) performing first decoding of said PU instruction and determining its opcode and operands, wherein one or more of said operands are addresses of registers of a local register file system, and wherein one or more of said registers store first data to be used for generation of at least one PU mapped address; d) reading said first data stored within said one or more registers of said local register file system; e) generating said at least one PU mapped address based on said first data; f) performing second decoding or converting the generated PU mapped address into an address of one or more memory cells of a corresponding PU data unit, giving rise to a data unit address; g) enabling reading the second data stored in the at least one data unit address; h) processing the read second data; and i
  • a method of processing a processing unit (PU) instruction comprises: a) conveying an address of a PU instruction to be processed into a program memory; b) fetching said PU instruction from said program memory; c) performing first decoding of said PU instruction and determining its opcode and operands, wherein one or more of said operands are addresses of registers of a local register file system and at least one of said operands is a PU mapped address, and wherein at least one of said registers stores first data to be processed; d) reading said first data to be processed from said local register file system; e) processing the read first data and writing back a result of said processing into the one or more corresponding registers of said local register file system; f) performing second decoding or converting the at least one PU mapped address into an address of one or more memory cells of a corresponding PU data unit, giving rise to a data unit address; g) enabling reading the second data stored in the at least one
  • a method of processing a processing unit (PU) instruction comprises: a) conveying an address of a PU instruction to be processed into a program memory; b) fetching said PU instruction from said program memory; c) performing first decoding of said PU instruction and determining its opcode and operands, wherein one or more of said operands are addresses of registers of a local register file system, and at least one of said operands is first data, based on which at least one PU mapped address to be generated, and wherein at least one of said registers stores second data to be processed; d) reading said first data to be used for generation of said at least one PU mapped address and reading said second data to be processed; e) generating said at least one PU mapped address based on said first data; f) processing the second data and writing back a result of said processing into one or more corresponding registers of said local register file system; g) performing second decoding or converting the generated at least one
  • - Fig. IA is a schematic block-diagram of a conventional processing unit, according to the prior art
  • Fig. IB is a schematic illustration of a conventional (local) register file, according to the prior art
  • FIG. 2A is a schematic illustration of connecting a spread register file system to a local register file system, to an instruction register (storing a program word - a current instruction to be executed) and to an execution unit (such as ALU), according to an embodiment of the present invention
  • Fig. 2B is another schematic illustration of connecting a spread register file system to a local register file system, to an instruction register and to execution units, according to another embodiment of the present invention
  • FIG. 3 A is a schematic illustration of a spread register file system, according to an embodiment of the present invention.
  • Fig. 3B is another schematic illustration of a spread register file system, according to another embodiment of the present invention.
  • Fig. 4A is a pipeline representation of operating with a local register file system and with a spread register file system, according to an embodiment of the present invention.
  • ⁇ Fig. 4B is another pipeline representation of operating with a local register file system and with a spread register file system, according to another embodiment of the present invention.
  • swipe register file system or "SRF” system
  • SRF read register file
  • LRF local register file system
  • the entire (complete) CPU memory map can comprise local registers (e.g., local CPU register files), cache memories, tightly coupled memories, on- chip/off-chip peripherals/memories (or registers) and any other conventional memory means.
  • CPU processing unit
  • processing processing
  • data operation such as data manipulation, data transfer, addition or subtraction of data and the like.
  • Fig. 2A is a schematic illustration of connecting a spread register file (SRF) system 206 to local register file system (LRF system) 106, to instruction register 105 (storing a program word — a current instruction to be executed) and to execution unit 130 (such as ALU), according to an embodiment of the present invention.
  • SRF spread register file
  • LRF system local register file system
  • instruction register 105 storing a program word — a current instruction to be executed
  • execution unit 130 such as ALU
  • SRF system 206 relates to the entire (complete) CPU mapped memories: cache memories, on-chip peripherals (e.g., RAM), on-board memories (e.g., DRAM, SRAM), secondary memories (e.g., off-chip peripherals, hard disks, etc.), and any other peripherals/memory means (e.g., CDs (Compact Discs), DVDs (Digital Versatile Discs), etc.).
  • LRF system 106 is a conventional local register file system that comprises a number of registers for storing data to be processed, or data being a result of processing.
  • spread register file system 206 comprises conventional peripheral address converters, as presented in Fig. 3.
  • Each peripheral address converter is used for converting a CPU memory mapped address to corresponding peripheral (device) address (e.g., the peripheral device can be a USB device, cache memory, RAM, SRAM, tightly coupled memory, DRAM, hard disks, CDs, DVD, etc.).
  • one peripheral address converter converts a CPU mapped address of "source 1" register/memory means to corresponding address of said register/memory means within the corresponding peripheral that actually stores said data (to be processed by means of execution unit 130, such as ALU); another peripheral address converter converts a CPU mapped address of "source 2" register/memory means that stores additional data to be processed by means of said execution unit 130; and still another one peripheral address converter — converts a CPU mapped address of "destination" register/memory means, in which a result of the above execution unit 130 processing (e.g., addition, subtraction) will be stored.
  • the peripheral address converter can be implemented either in hardware and/or in software.
  • instruction register 105 contains a program word, which can be, for example, 32 or 64 bits long.
  • program word is 32 bits long, wherein the length of each one of the followings: "source 1" address 222', "source 2" address 223' and destination address 224' can be, for example, 5 bits long.
  • the program word is 64 bits long.
  • Each of the above addresses relates to a specific address within the local register file system 106, said specific address represented, for example, by a 2 5 , respectively.
  • the data outputted from LRF system 106 are CPU mapped addresses related to corresponding memory cells of peripherals/memory means (e.g., USB device) provided within spread register file system 206.
  • CPU mapped addresses are related to memory cells within the entire CPU memory map.
  • the entire CPU memory map can be defined by 2 32 addresses (if implemented for MIPS32 CPU), or by 2 64 addresses (if implemented for MIPS64 CPU).
  • the data outputted from LRP system 106 is inputted into address generator 250 that further generates said CPU mapped addresses.
  • addresses 221', 222' and 223' inputted into LRF system 106 can be, for example, each 5 bits long, and CPU mapped "source 1" address, CPU mapped "source 2" address, CPU mapped "destination” address outputted from address generator 250 over lines 212, 213 and 214, respectively, can be each 32 bits long (if implemented for MIPS32 CPU). Similarly, for the MIPS64 CPU implementation, each of these CPU mapped addresses is 64 bits long. It should be noted that address generator 250 can generate addresses in various ways based on different address generating functions.
  • address generator 250 can receive in its input a 8 bits long address (represented by a 2 8 binary number) from LRF system 106, and then it can add this address to another 2 32 number, thereby generating a new CPU mapped address that is 32 bits long.
  • the above 2 2 number can be a predefined number, random number or a number that is calculated by means of address generator 250 according to some predefined function(s)/expression(s).
  • said new CPU mapped address can be generated according to opcode 221' that can be inputted from instruction register 105 into said address generator 250.
  • opcode 221' relates to moving "source 1" data (to which "source 1" CPU mapped address is related) to the "destination" register that is provided, for example, within LRF system 106, then only "source 1" CPU mapped address and CPU mapped destination addresses are generated by means of said address generator 250 ("source 2" CPU mapped address is not generated).
  • the CPU mapped "source 1" and “source 2" addresses generated by means of address generator 250 are inputted into SRP system 206 over lines 212 and 213, respectively, and then the data stored within corresponding resisters/memory means of said SRF system 206 (said data being relates to said CPU mapped "source 1" and “source 2” addresses) is outputted from said SRF system 206 over buses (lines) 231 and
  • the outputted data is provided into execution unit 130 and processed in accordance with opcode 221' of the program word (instruction) provided within instruction register 105.
  • the data of "source 1" can be added to the data of "source 2", or the data of "source 1” can be subtracted from the data of "source 2", etc.
  • the command for performing such an operation is provided into said execution unit 130 via control bus 234.
  • the corresponding result is written back into the "destination" register/memory means (for example, located within the corresponding peripheral device, such as a USB device) over data bus
  • said corresponding result can be written back into LRF system 106, according to a "destination" address provided in operand 224' of the program word.
  • said corresponding result is inputted into multiplexer 270 along with data provided over line 263" from LRF system 106, said data being a CPU mapped address, or related to said CPU mapped address (the data based on which said CPU mapped address is generated by means of address generator 250 and the auto-incremented).
  • multiplexer 270 According to operand 224' (the "destination" address) provided within instruction register 105, multiplexer 270 outputs either data provided over line 233 (from execution unit 130') or data provided over line 263" (from LRP system 106). The data provided over line 263" is outputted from said multiplexer 270 if "destination" address 224* corresponds to a SRF system 206 address. On the other hand, the data provided over line 233 is outputted from said multiplexer 270 if "destination" address 224' does not correspond to said SRF system 206 address and corresponds to an address of a specific register (within LRF system 106) that stores data to be processed.
  • I/O control units e.g., memory management units (MMUs), north and south bridges, etc.
  • MMUs memory management units
  • CPU mapped memory means such as cache memories, off-CPU- chip memories and other memory means.
  • executing unit 130 is enabled to operate substantially directly on each of said CPU mapped memory means provided within spread register file system 206 and local register file system 106, i.e.
  • spread register file system 206 can operate with more than one executing unit 130. Further, instruction register 105 and/or executing unit 130 can be incorporated within said spread register file system 206.
  • spread register file system 206 can be provided on-CPU-chip (incorporated within a CPU) or off-CPU-chip. Further, according to still another embodiment of the present invention, a portion of spread register file system 206 can be provided on-CPU-chip and another portion - off- CPU-chip. According to still another embodiment of the present invention, the CPU mapped addresses (or data related to said CPU mapped addresses) outputted from LRF system 106 over lines 261', 262' and 263', can be auto-incremented for enabling accessing other registers/memory cells of peripherals (data units) of SRF system 206, which correspond to the auto-incremented CPU mapped addresses.
  • the auto-incremented "source 1", “source 2” and “destination” CPU mapped addresses are inputted into LRF system 106 over lines 261", 262" and 263'". Then, such addresses are written into corresponding registers (of LRF system 106), which are assigned by addresses provided over lines 281, 282 and 283 from instruction register 105.
  • auto-increment commands can be provided within operands 222', 223' and 224' of the program word (for example, by setting a specific bit in each of said operands to "0" can indicate that the auto-increment of said corresponding operand is enabled, and setting said specific bit to "1" can indicate that the auto-increment command is disabled.
  • Fig. 2B is another schematic illustration of connecting spread register file system 206 to local register file system 106, to instruction register 105 and to execution units 130' and 130", according to another embodiment of the present invention.
  • instruction register 105 contains a VLIW (Very Large Instruction Word) program word (instruction), that comprises: two opcodes 221' and 221"; two "source 1” addresses 222' and 222"; two "source 2" addresses 223' and 223” (wherein “source 1” and “source 2” addresses 222' and 223', respectively, can be related to CPU mapped addresses that can be provided in corresponding registers within LRF system 106 (or to be further generated by means of address generator 250), and wherein “source 1” and “source 2” addresses 222" and 223", respectively, can be related to specific registers within LRF system 106 that store data to be processed); and two "destination” addresses 224' and 224" (wherein “destination” address 224' relates to a corresponding peripheral
  • all operands of the VLIW program word ("source 1", “source 2” and “destination” addresses) are inputted into LRF system 106 over lines 2517251”, 2527252" and 2537253", respectively. Then, the data that is stored in corresponding registers within said LRF system 106 (the registers that are assigned with the above addresses) is outputted from said LRF system 106. It should be noted that the data that corresponds to "source 1" address 222', "source 2" address 223' and “destination” address 224' can be CPU mapped addresses, each of which is related to a memory cell(s) or register(s) of a peripheral/memory means of SRF system 206.
  • CPU mapped addresses can be generated by means of address generator 250, according to the data (related to said "source 1" address 222', "source 2" address 223' and “destination” address 224') outputted from LRF system 106 over lines 261', 262' and 263', respectively.
  • the data that corresponds to "source 1" address 222" and "source 2" address 223" is data to be processed by means of execution unit 130". Said data is outputted from LRF system 106 over lined 231" and 232", and then is provided into execution unit 130" (e.g., ALU).
  • execution unit 130" processes the received data in accordance with opcode 221" of the VLIW program word provided within instruction register 105.
  • the data provided over line 231' can be added to the data provided over line 231", and the like.
  • a command for performing such an operation is provided into said execution unit 130" via control bus 234".
  • the result of processing is then provided over line 233" to LRF system 106 and stored within a specific register (of said LRF system 106), which corresponds to "destination" address 224" provided in the VLIW program word.
  • address generator 250 generates addresses being related to the entire CPU mapped memory.
  • addresses 221', 222' and 223' can be for example each 5 bits long, and CPU mapped "source 1" address provided over line (bus) 212, CPU mapped "source 2" address provided over line 213 and CPU mapped “destination” address provided over line 214, can be each 32 bits long (if implemented for MIPS32 CPU), since each of these mapped addresses relates to the entire CPU memory map that is represented by 2 32 addresses.
  • each of these mapped addresses is 64 bits long.
  • address generator 250 can generate addresses in various ways based on different address generating functions.
  • address generator 250 can receive in its input a 8 bits long address (represented by a 2 8 binary number) from LRF system 106, and then it can add this address to another 2 number, thereby generating a new CPU mapped address that is 32 bits long.
  • the above 2 number can be a predefined number, random number or a number that is calculated by means of address generator 250 according to some predefined function(s)/expression(s).
  • said new CPU mapped address can be generated according to opcode 221" that can be inputted from instruction register 105 into said address generator 250.
  • opcode 221' relates to moving "source 1" data (to which "source 1" CPU mapped address is related) to the "destination" register that is provided, for example, within LRF system 106, then only "source 1" CPU mapped address and CPU mapped destination addresses are generated by means of said address generator 250 ("source 2" CPU mapped address is not generated).
  • opcode 221' (or operands 222', 223' and 224' of the program word) can comprise a command (e.g., a predefined bit within said opcode 221' can be set to "0" or to "1", accordingly) for instructing address generator 250 to generate the corresponding CPU mapped addresses.
  • address generator 250 passes the data (in this case, CPU mapped addresses) inputted over lines 261', 262' and 263' as it is, and if said bit is "0", then address generator 250 adds the data (in this case, non-CPU mapped addresses) inputted over lines 261', 262' and 263', to additional data (base addresses) stored within its corresponding registers, thereby generating corresponding CPU mapped addresses related to peripherals/memory means of SRF 206.
  • the data (addresses) outputted from LRF system 106 over lines 261', 262' and 263' can be further related to corresponding registers within address generator 250: said each of the addresses outputted from LRF system 106 can be related to a different base address of 32-bits (for MIPS32 technology), or 64-bits (for MIPS64 technology) provided within said address generator 250, based on which a CPU mapped address can be generated).
  • each CPU mapped address can be generated according to values stored within these corresponding registers.
  • the "source 1" generated CPU mapped address (provided from address generator 250 over line 212) can be the sum of: a value outputted from LRF system 106 over line 261' (which can be also a value of operand 222') and the corresponding "source 1" base address stored within said address generator 250.
  • the "source 2" (or “destination”) generated CPU mapped address can be the sum of: a value outputted from LRF system 106 (which can be also a value of operand 223' (or 224')) and the corresponding "source 2" (or “destination”) base address stored within said address generator 250.
  • LRF system 106 is not required to store CPU mapped addresses (e.g., 32 or 64 bits long), and for each operand 222', 223' and 224', said address generator 250 outputs a corresponding CPU mapped address.
  • the CPU mapped addresses (or data related to said CPU mapped addresses) outputted from LRF system 106 over lines 261', 262' and 263' can be auto-incremented for enabling accessing other registers/memory cells of peripherals of SRF system 106, which correspond to the auto- incremented CPU mapped addresses.
  • the auto-incremented "source 1", "source 2" and "destination" CPU mapped addresses are inputted into LRF system 106 over lines 261", 262" and 263'". Then, such addresses are written into corresponding registers (of LRF system 106), which are assigned by addresses provided over lines 281, 282 and 283 from instruction register 105.
  • auto-increment commands can be provided within operands 222', 223' and 224' of the program word (for example, by setting a specific bit in each of said operands to "0" can indicate that the auto-increment of said corresponding operand is enabled, and setting said specific bit to "1" can indicate that the auto-increment command is disabled.
  • the result of data processing (performed by means of execution unit 130'), said result provided over line 233', is inputted into multiplexer 270 along with data provided over line 263" from LRF system 106, said data being a CPU mapped address or related to said CPU mapped address.
  • multiplexer 270 According to operand 224' ("destination" address) provided within instruction register 105, multiplexer 270 outputs either data provided over line 233' (from execution unit 130') or data provided over line 263" (from LRF system 106).
  • the data provided over line 263" is outputted from said multiplexer 270 if "destination" address 224' corresponds to a SRF system 206 address.
  • the data provided over line 233' is outputted from said multiplexer 270 if "destination" address 224' does not correspond to said SRF system 206 address and corresponds to an address of a specific register (within LRF system 106) that stores data to be processed. It should be noted that when inputting (writing) the data into LRF system 106/SRF 206, corresponding write-enable commands are provided to said LRF system 106/SRF 206. Similarly, when outputting data from said LRF system 106/SRF 206, corresponding read-enable commands are provided.
  • one or more operands of instruction register 105 are CPU mapped addresses or data, based on which one or more CPU mapped addresses are generated by means of address generator 250.
  • Fig. 3A is a schematic illustration of spread register file system 206, according to an embodiment of the present invention.
  • Spread register file system 206 receives as inputs: CPU mapped "source 1" address (MSl address) over bus 212, CPU mapped “source 2" address (MS2 address) over bus 213 and CPU mapped "destination” address (MD address) over bus 214 (each 32 bits long, for example).
  • MSl address CPU mapped "source 1" address
  • MS2 address CPU mapped "source 2" address
  • MD address CPU mapped "destination" address
  • addresses are converted by means of address converters 320, 321 and 322, respectively to addresses of corresponding peripheral/memory means, such as peripheral/memory means 301, 302, 303, ..., 310 (e.g., cache memories, tightly coupled memories, secondary memories, SRAM, DRAM, disk-on-keys, hard disks, CDs (Compact Discs), DVDs, or any other peripheral/memory means).
  • peripheral/memory means such as peripheral/memory means 301, 302, 303, ..., 310
  • said address converters can convert the CPU mapped addresses in various ways based on different address converting functions/expressions.
  • the above address can be converted according to opcode 221' (Figs. 2A) that can be inputted from instruction register 105 (Fig.
  • the converted "source 1" and “source 2" addresses are inputted into corresponding peripheral/memory means 301, 302, 303, ..., N, which in turn outputs corresponding data stored in said addresses over "source 1" read bus 231 and "source 2" read bus 232.
  • said data is processed (executed) by means of one or more execution units 130' (such as ALUs).
  • execution units 130' such as ALUs
  • the processing result is provided over write back bus 233' to corresponding peripheral/memory means, such as peripherals 301, 302, 303, ..., N, to be stored in corresponding converted destination addresses (CD addresses) within said peripheral/memory means 301, 302, 303, ..., N.
  • the processing result is provided into LRF system 106 (Fig. 2A).
  • the "source 1", “source 2", and “destination" memory cells can be physically located within the same or within different peripheral/memory means (such as peripheral/memory means 301, 302, 303,..., N).
  • address converters 320, 321 and 322 further provide Write Enable (WE)/Chip Select (CS) signals (for example, binary "0" or "1") to each of said peripheral/memory means (data units) 301, 302, 303, ..., N, for enabling reading or writing from or to said peripheral/memory means.
  • WE/CS commands can be provided to each of said peripheral/memory means 301, 302, 303, ..., N, when accessing each converted address (e.g., "source 1" converted address) within said each peripheral/memory means 301, 302 and 303.
  • CS read command
  • WE write command
  • address converters 320, 321 and 322 can be unified in a single address converter for converting CPU mapped "source 1", “source 2" and "destination" addresses into corresponding peripheral/memory means addresses.
  • the address decoding (or the address conversion) is performed within one or more peripherals/memory means 301, 302, 303, ..., N.
  • the need in providing said address converters 320, 321 and 322 is substantially eliminated.
  • Peripherals/memory means 301, 302, 303, ..., N can receive CPU mapped addresses and decode (or convert) them accordingly for determining corresponding addresses within said peripherals/memory means 301, 302, 303, ..., N, in which the required data is stored (or to be stored).
  • blocks 320, 321 and 322 can provide WE/CS commands to peripherals/memory means 301, 302, 303, ..., N, and do not perform the address conversion. Further, it should be noted that WE/CS commands can be generated by means of control unit 350.
  • an address converter (such as address converter 320, 321 or 322) can be incorporated (integrated) within each (or one or more) peripheral/memory means 301, 302, 303, ..., or N.
  • said peripheral/memory means receives a CPU mapped address and determines by means of the integrated address converter (according to predefined base-addresses of said peripheral/memory means), whether the received CPU mapped address is related to one or more memory cells provided within said peripheral/memory means or within another peripheral/memory means.
  • the base-addresses of each peripherals/memory means can be further dynamically changed upon the need.
  • Fig. 3B is another schematic illustration of spread register file system 206, according to another embodiment of the present invention.
  • a single execution unit 130' is used for both LRF system 106 (Fig. 2A) and SRF system 206.
  • two multiplexers 371 and 372 are provided, each of which selects the data to be processed (either data outputted from SRF system 206 or data outputted from LRF system 106) according to a control signal provided from control unit 350.
  • the control signal is provided according to opcode 221' (Fig. 2A).
  • an opcode related to processing of LRF system data differs from an opcode related to processing of SRF system data.
  • opcode 221' can comprise a control bit (that can be preset to "0" or "1"), which indicates whether SRF system or LRF system data has to be processed by means of execution unit 130': for example, if the bit is set to "1", then the data from SRF system is processed and if the bit is set to "0", then the data from LRF system is processed.
  • multiplexers 371 and 372 can output the data into execution unit 130', according to said control bit.
  • Fig. 4 A is a pipeline representation 400 of operating with local register file system 106 (Fig. 2A) and with spread register file system 206 (Fig. 2A), according to an embodiment of the present invention.
  • pipeline has 12 stages (TO to TIl), each of which can correspond to a single CPU clock cycle.
  • the address of the instruction (program word) to be fetched is conveyed from the Program Counter (PC) to the CPU program memory (e.g., RAM (not shown)).
  • the instruction (program word) is fetched from said program memory into instruction register 105 (Fig. 2A).
  • the fetched instruction decoding is performed by means of control unit 350 (Fig.
  • the decoded opcode relates to moving "source 1" data (to which "source 1" CPU mapped address is related) to the "destination" register that is provided, for example, within LRF system 106, then only "source 1" CPU mapped address and CPU mapped destination addresses are generated by means of said address generator 250 ("source 2" CPU mapped address is not generated).
  • an address of the next program counter instruction is determined, based on the decoded instruction. Also, for example, when a conventional JUMP or BRANCH command is issued, then the next program counter address is calculated in accordance with a pointer of said JUMP/BRANCH command).
  • the CPU mapped addresses are converted by means of said address converters 320, 321 and 322, respectively to addresses of corresponding peripheral/memory means, such as peripheral/memory means 301, 302 and 303 (Fig. 3A) (e.g., cache memories, secondary memories, disk-on-keys, hard disks, or any other peripheral/memory means).
  • peripheral/memory means such as peripheral/memory means 301, 302 and 303 (Fig. 3A) (e.g., cache memories, secondary memories, disk-on-keys, hard disks, or any other peripheral/memory means).
  • said address converters can convert the CPU mapped addresses in various ways based on different address converting functions/expressions.
  • the above address can be converted according to opcode 221' (Figs. 2A) that can be inputted from instruction register 105 (Fig. 2A) into a control unit 350 (Fig.
  • said peripherals/memory means 301, 302 and 303 generate a READ request to their internal memory, thereby enabling reading corresponding data stored within them at the received converted addresses, hi the next stage T6, said corresponding data is read and ready, and then at stage T7, said data is latched and conveyed to the "source 1" and "source 2" read buses 231 and 232, respectively.
  • the data is provided over said read buses 231 and 232 into execution unit 130' (e.g., ALU).
  • execution unit 130' e.g., ALU
  • T9 and TlO the data is processed by means of said execution unit 130. It should be noted that the data can be processed only in stage T9, and no further processing can be required.
  • the pipeline can have, for example, only 11 stages (TO to TlO).
  • the processing result is written back into the "destination" register that is provided, for example, within peripherals/memory means 301, 302, 303, ..., N, or within local register file system 106. It should be noted that the writing back operation can take more than a single CPU clock cycle.
  • CPU control unit 350 controls the pipeline process by generating required control signals during the pipeline stages.
  • CPU stalls are reduced or substantially eliminated. For example, there can be substantially no CPU stalls if the access latency to SRF system 206 is 6 (or less) CPU clock cycles and the pipeline is relatively deep (e.g., 12 stages).
  • a number of instructions and CPU clock cycles required for manipulating/processing is significantly reduced, compared to the prior art.
  • the number of instructions and corresponding CPU clock cycles for processing the data can be reduced, for example, to a single instruction that takes a single CPU clock cycle, enabling providing a substantially direct access between execution unit 130 and peripherals/memory means 301, 302, 303, ..., N.
  • spread register file system 206 can be shared between two or more processing units (e.g., CPU, microprocessor, and the like) and between other internal/external (on-chip/off-chip) peripherals or devices.
  • processing units e.g., CPU, microprocessor, and the like
  • Fig. 4B is another pipeline representation 401 of operating with local register file system 106 (Fig. 2B) and with spread register file system 206 (Fig. 2B), according to another embodiment of the present invention.
  • two execution units 130' and 130" (Fig. 2B) are provided: execution unit 130" for processing data outputted from LRF system 106 (Fig. 2B), and execution unit 130' for processing data outputted from SRF system 206.
  • execution unit 130" processes data outputted from LRF system 106 during stage T3 and, optionally, also during stage T4. Then, the result of data processing is written back into LRF system 106 at stage T5.
  • the result of processing by means of execution unit 130' can be written back into the "destination" register that is provided, for example, within peripherals/memory means 301, 302 and/or 303 (Fig. 3A) or within local register file system 106. It should be noted that the writing back operation can take more than a single CPU clock cycle.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Executing Machine-Instructions (AREA)

Abstract

A processing unit device comprising a local register file system having registers for storing mapped addresses wherein each register is assigned with an address specified by an instruction operand and a data output port from which the mapped address stored within the registers is outputted Included is a spread register file system comprising data units comprising memory cells that are assigned with memory data unit addresses Each data unit is configured to receive and decode a mapped address, determine a corresponding memory data unit address, output data to be processed from memory cells that correspond to the memory data unit address, and store data within memory cells that correspond to the memory data unit address The processing unit device includes an output port for outputting the data to be processed from the memory cells, and an execution unit for processing data outputted from the spread register file system

Description

IMPROVED PROCESSING UNIT IMPLEMENTING BOTH A LOCAL REGISTER FILE SYSTEM AND SPREAD REGISTER FILE SYSTEM, AND A
METHOD THEREOF
Field of the Invention
The present invention relates to processing units. More particularly, the present invention relates to providing an improved processing unit (e.g., CPU (Central Processing Unit), microprocessor and the like), and a method thereof, that implements both a conventional local register file system and spread register file system, which enables substantially direct memory access to memory means that are coupled to said improved processing unit.
Definitions, Acronyms and Abbreviations
Throughout this specification, the following definitions are employed:
Fetching: means retrieving an instruction from the program memory, wherein the instruction is represented by a number or by a sequence of numbers.
Instruction Register: a register that stores a current instruction to be executed. The instruction register is provided within a processing unit, and is located in physical proximity to processing means, such as ALU (Arithmetic Logic Unit).
Opcode: an opcode (operation code) is the portion of a machine-language instruction that specifies an operation to be performed (e.g., addition, subtraction, and the like).
Operand: an instruction operand is data/value or a pointer (address) to the data, on which (or by means of which) an operation/processing (e.g., addition, subtraction, and the like) has to be performed.
Register File: is a storage unit located within the processing unit, such as the CPU. Generally, the register file is a combination of registers and combinatorial logic. Background of the Invention
The past decade is characterized by dramatic developments in the field of computers. For executing and processing most recently developed computer applications, fast and powerful computer processing units are required. In general, according to the prior art, a conventional central processing unit (CPU) operates by four steps: a) fetching; b) decoding (that involves reading data from the CPU register file; c) instruction executing; and d) writing back the result of said executing. The first step, fetching, involves retrieving an instruction from the program memory (e.g., RAM (Random Access Memory)). Instruction location in the program memory is determined by a program counter, which keeps track of the CPU processing in the current program. After the instruction is fetched from the memory, the program counter is incremented by the length of the instruction word in terms of memory units; also, for example, when a conventional JUMP or BRANCH command is received, the program counter value is changed accordingly. Often, the instruction to be fetched must be retrieved from relatively slow memory (e.g., secondary memory) by means of a conventional Input/Output control unit, causing the CPU to stall while waiting for the instruction to be returned back to said CPU. The instruction that the CPU fetches from the memory is used to determine what the CPU has to do, thus the CPU cannot proceed processing until the instruction is fetched from the memory. After that, at the decoding step, the instruction is broken up into several portions to be processed by other CPU units (e.g, ALU). The way in which the numerical instruction value is interpreted, is defined by the CPU instruction set architecture (ISA). Often, a group of numbers in the instruction, called an opcode (operation code), indicates which operation has to be performed. The remaining numbers in the instruction usually provide information required for that instruction (e.g., operands for the addition/subtraction operation). Such operands may be given as a constant value (called an immediate value). Alternatively, operands may be provided as addresses of corresponding values stored in a register file (that comprises a plurality of registers, e.g., 32 or 64 registers).
After the fetching and decoding steps, the executing step is performed. During this step, the CPU performs the desired operation. If, for example, an addition operation is requested, the numbers to be added are provided to inputs of the Arithmetic Logic Unit (ALU), and the result (the final sum) will be provided at the ALU outputs. Generally, the ALU comprises a circuitry to perform simple arithmetic and logical operations on the inputs, such as addition/subtraction operations. Finally, at the write back step, the results of the executing step are "written back" to the register file or to CPU registers. After accomplishing the instruction execution and writing back the resulting data, the entire process repeats with the next instruction cycle, normally fetching the next-in- sequence instruction due to the incremented value in the program counter.
For achieving good performance, the above four CPU steps have to be performed relatively fast. However, when working with non-local memory means, such as cache, on-board memory (e.g., DRAM (Dynamic Random Access Memory)), secondary memory and the like, the access time is greatly increased leading to significant delays and to a waste of the valuable CPU processing resources. In turn, it greatly decreases CPU performance and consumes most of the CPU processing time.
According to the prior art, a conventional processing unit (e.g., CPU) uses a limited set of registers (e.g., 32 or 64 registers) in its register file. The register file can be implemented in hardware by means of a plurality of electronic elements, such as latches, flip-flops, memory arrays, multi-port SRAM (Static Random Access Memory) and the like. However, in most cases, this register file is a portion of the CPU, and it is located in the physical proximity to the ALU (Arithmetic Logic Unit) of said CPU. Such a register file can be named a "local register file system". One of the reasons for having a limited local register file system is due to the limited size of the CPU program word, which usually contains pointers to 3 registers: one register (accessed via "source 1" input of the local register file system) storing the first value be processed by the ALU, another register (accessed via "source 2" input of the local register file system) storing the second value to be processed by said ALU, and the last register (destination register, accessed via "destination" input of the local register file system) storing the result value of the ALU processing (e.g., the sum of the values stored within said sources 1 and 2). Since the CPU program word is limited (in the term of data bits), the number of bits allowed for each the above registers is low. For example, a CPU that has 64 registers in its register file requires 6 bits pointer per register (26=64), and is considered to be a relatively large register file, according to the prior art. In addition, even by using a CPU that is capable of receiving instructions as the Very Large Instruction Word (VLIW), the CPU register file relatively rarely reaches a capacity of 256 registers.
Another reason for having a limited number of registers in the CPU register file is due to hardware limitations related to fast memory access and to capability of using a relatively large number of ports. Usually, a conventional ALU requires providing at least two read ports and one write port.
A conventional system that implements CPU5 also usually contains a memory controller (that comprises a MMU (Memory Management Unit)), various memory means (e.g., cache, SRAM, etc.), and different peripherals, such as cache controllers, interrupt controllers, timers, hardware accelerators, DMA engine, communication controllers (e.g., a USB controller), and the like. The memory controller controls the CPU access to a wide range of registers/memory means, such as internal CPU memories (program and data), on-chip memories (including, for example, cache memory), on-chip peripheral memories, and off-chip (device) memories.
It should be noted that according to the prior art, the CPU local register file system is significantly limited in its size (e.g., contains only 32 registers), and the CPU memory mapped registers (to be accessed, for example, by CPU internal units, such as the ALU) are physically located outside the CPU local register file system (e.g., cache, secondary memory, etc.). Thus, in order that the CPU will be able to perform data manipulation on any of its memory mapped registers, the CPU needs to generate LOAD commands for loading data by means of the memory controller from each of said memory mapped register (e.g., located off-CPU-chip (outside the CPU chip)) into registers of the CPU local register file system. After the data is loaded, the CPU can manipulate said data (e.g., to perform data addition or data subtraction operations by means of its ALU unit). Then, the result is first stored in another register of the CPU local register file system, and after that said result is conveyed to the corresponding memory mapped register (for example, non-local device/peripheral (located off-CPU-chip)) for updating it with a new data value - the result of ALU processing. For that, the CPU needs to generate at least one STORE command for storing said result within said non-local device/peripheral. In addition, usually a single ALU command (e.g., addition, subtraction, etc.) is related to processing of data located within at least two registers. Therefore, the CPU needs to generate at least two separate LOAD commands (each in a single CPU clock cycle) for loading the data required for processing. In some more complex VLIW CPUs, a multi- LOAD request can be generated in a single CPU clock cycle and then, in an additional clock cycle, one or more (destination) registers within the local register file system can be updated with new data values (the results of ALU processing). Generally, according to the prior art, for processing (manipulating) data by means of the CPU, at least three commands have to be generated: LOAD (or multi-LOAD) command for loading data from an external register/memory means (device register, cache memory, etc. that is located externally to CPU (off-CPU-chip)) into the local register file system; ALU data processing command for executing various operations (e.g., addition, subtraction); STORE command for writing back the result of ALU processing into the corresponding non-local register/memory means - for this, even if working in a pipeline and avoiding data hazards, such data processing takes at least three CPU clock cycles.
When, for example, the data needs to be moved from one CPU memory mapped register (or from another external memory means, such as cache, secondary memory, etc.) to another CPU memory mapped register/memory means without performing ALU data processing, then the DMA (Direct Memory Access) engines, which are CPU peripherals, can be used for conducting such data movements, thereby reading the data from said one CPU memory mapped register/memory means and writing the data to said another off-CPU-chip register/memory means. By using the DMA engines, CPU is not required to generate LOAD and STORE commands. In addition, DMA operations can be conducted in parallel with CPU operations. However, for using the DMA engines, dedicated hardware is required and the DMA engines need to be configured and enabled by the CPU; further, it is applicable only when no data processing (or substantially negligible data processing) is required. Fig. IA is a schematic block-diagram 100 of a conventional processing unit 100, according to the prior art. First, an instruction (that comprises an opcode and one or more operands) is fetched from the program memory (e.g., RAM 140) and is transferred via a data bus (one or more line) 107 to an instruction register (IR) 105 that stores the current instruction to be decoded and executed. For example, it is supposed that instruction register 105 is 32 bits long [0...31] bits, wherein the first six bits [0...5] of the instruction provided within said instruction register 105 are opcode (that defines the operation to be performed, e.g., addition, subtraction, etc.); bits [11...15] are an address of the destination register within a register file 106 (the address of a register in which the result of ALU processing will be stored); bits [16...20] are an address of the "first" (Source 1) register (within register file 106), the value of which has to be manipulated (processed); and bits [21...25] are an address of the "second" (Source 2) register (within register file 106), the value of which has also to be manipulated (for example, has to be added to the value of said "first" register). It should be noted that the rest of the instruction bits (within 32 bits of said instruction) can be related to various data, such as the "immediate" value (some constant value) or auto-increment, etc.
Each of the above addresses is five bits long, thus referring to one of 32 (25=32) registers of register file 106. The addresses of the above Sources 1 and 2 are inputted into decoder(s) 120' of register file 106, and as a result, the data of corresponding registers (to which said addresses are related) of said register file 106 is outputted over data bus (one or more lines) 141. The next step is based on the specific instruction to be processed, and can be, for example: a) reading additional data from on-CPU-chip (inside the CPU chip) memory/peripherals or off-CPU-chip (outside the CPU chip) memory/peripherals (by establishing a LOAD command); b) storing data within said memory/peripherals (by establishing a LOAD command); and/or c) activating execution unit 130 (e.g., ALU) for performing a mathematical operation, such as addition, subtraction, multiplication, division: in this case, the operands for the execution unit processing are determined by means of control unit 115. Once the processing is completed, the result is written back into the destination register within said register file 106 (the destination register address is defined by bits [11...15] of the executed instruction). Further, the result can be written back into the CPU memory means/peripherals 160 by means of Input/Output Control Unit 150 over bus 108 (by accomplishing a STORE command). Then, the cycle is started over with the next instruction to be further fetched, decoded and executed. Since a program counter 110 holds an address of the current instruction to be executed (and points to a corresponding RAM 140 memory address by means of address bus 119), the CPU always "knows" wherein within said RAM 140 the next instruction can be found. Each time the instruction is completed, program counter 110 is incremented by at least one memory address location: also, for example, when the instruction is a conventional JUMP or BRANCH command, the program counter is changed accordingly.
As seen from Fig. IA, CPU register file 106 is local, and it is a portion of CPU chip (core). Input/Output (I/O) control unit 150 (e.g., comprising memory controller or memory management unit (MMU)) is used for loading and writing back data from or to CPU mapped peripherals/memory means to be further processed by ALU 130. Thus, according to the prior art, ALU operations (e.g., addition) are not performed directly on the data stored within CPU mapped peripherals/memory means, and these peripherals/memory means are not accessed directly by means of said ALU 130: the data inputted into the ALU is incoming from local register file system 106, to which it is loaded from corresponding memory/peripherals by means of Input/Output (I/O) control unit 150, for example. Therefore, for performing manipulation on data stored outside local register file system 106, the data has first to be loaded into said local register file system 106 by means of I/O control unit 150, thereby executing a LOAD command, and loading the data into the CPU local register file 106 over load/store bus 108. After the LOAD command is accomplished, then the CPU can execute (perform) the corresponding operation (processing) on the loaded data by means of its ALU 130. Then, the result of the ALU processing is written back to the destination register of local register file system 106. Further, for updating corresponding register/memory means outside said local register file system 106, a STORE command is generated for storing said result, over load/store bus 108, from said destination register into the outside register/memory means, such as cache. It should be noted that providing data to be processed into said ALU 130 and providing the processed data from said ALU, is controlled by means of control unit 115 (over control bus 121), which can comprise a controller 126, multiplexers 125, decoders 120" and the like. Control unit 115 receives data to be processed from local register file 106 over data bus 141, and it controls execution unit 130 processing by sending to said execution unit 130 a control signal over bus 121 in accordance with the instruction opcode. In turn, execution unit 130 receives the corresponding instruction operands to be processed from said control unit 115, and outputs a result of said processing over bus 108.
Fig. IB is a schematic illustration of a conventional (local) register file 106, according to the prior art. For example, it is supposed that instruction register (IR) 105 (Fig. IA) is 32 bits long [0...31] bits, wherein bits [11...15] are an address of the destination register within register file 106 (the address of a register in which the result of ALU 130 (Fig. IA) processing will be stored); bits [16...20] are an address of the "first" (Source 1) register (within register file 106), the value of which has to be manipulated; and bits [21...25] are an address of the "second" (Source 2) register (within register file 106), the value of which has also to be manipulated (for example, has to be added to the value of said "first" register). It should be noted that the rest of the instruction bits (within 32 bits of said instruction) can be related to various data, such as an immediate value, auto- increment, etc.
Each above address is five bits long, thus referring to one of 32 (25=32) registers of register file 106. The addresses of the above Sources 1 and 2 are inputted into register file 106 (11^6-20 and IR21-25, respectively) and conveyed to decoders 120 (Fig. IA). Decoders 120 decode the addresses and enable outputting data of corresponding registers of register file 106 towards ALU for further processing said data (e.g., addition, subtraction of the data and the like). Thus, the data is outputted through "Source 1 Data" and "Source 2 Data" outputs, having a length of 32 bits. After ALU 130 processes said data, it stores the result (32 bits long) in a destination register within register file 106 (the destination register is defined by the IR11-15 address).
According to the prior art, for loading the data from registers, processing the loaded data and storing the result with the memory (i.e., performing LOAD, "ALU processing" and STORE commands) with minimal CPU stalls (delays), conventional processing devices use cache memories or Tightly Coupled Memories (TCM), such as SRAM, etc. These memories are located in the physical proximity to the CPU core, and therefore accessing such memories (e.g., performing LOAD and STORE commands) is done with a relatively low latency compared to accessing other memory means, such as a hard disk, for example. Usually, when the CPU needs to operate on a large chunk of data, which in turn evolves performing relatively long loops of ALU commands, the data is first copied by the MMU to the cache memory or by the DMA engine to the tightly coupled memory. Only then, the CPU executes ALU commands within said loops of commands. Thus, this leads to a significant latency between the time when the data is first conveyed to the memory mapped register/memory means and the time when the CPU can process it. Especially, this leads to a significant latency until the processed data is written back into the memory mapped register/memory means.
The above problems related to achieving fast data access and performing fast data processing have been recognized in the prior art, and several solutions have been proposed. For example, US 6,178,482 discloses a system embedded with a processor, containing sets of cache lines for accessing cache memories, which are dynamically operated as different register sets for supplying source operands and in turn, accepting destination operands for instruction execution. The different register sets may be of the same or of different virtual register files, and if the different register sets are of different virtual register files, the different virtual register files may be of the same or of different architectures. The cache memories may be directly accessed by using cache addresses.
Further, US 6,178,482 presents a data processing apparatus which uses a register file to provide a faster alternative to indirect memory addressing. A functional unit is connected to a data register file which comprises a plurality of registers, each of which is accessed by a corresponding register number. The functional unit of US 6,178,482 can execute at least one indirect register access instruction that comprises an operand register number field. Instruction decode circuitry, connected to the register file and the functional unit, is responsive to the indirect register access instruction to recall data stored in an operand register specified by the operand register number in the instruction, identify the recalled data as a register access number, and recall operand data from a data register corresponding to the register access number for use as an operand by the functional unit.
The present invention has many advantages over the prior art. For example, one advantage of the present invention is that it significantly reduces the number of instructions and CPU clock cycles required for manipulating/processing (e.g., performing addition, subtraction, data moving, data shifting operations and the like) memory mapped data by providing a substantially direct memory means access for one or more CPU execution units (for processing the data). Thus, the number of instructions and corresponding CPU clock cycles for processing the data can be reduced, for example, to a single instruction that takes a single CPU clock cycle.
Another advantage of the present invention is that it provides an improved processing unit (e.g., CPU (Central Processing Unit), microprocessor and the like) that implements both a conventional local register file system and spread register file system, which enables substantially direct memory access. The spread register file system can be further shared with other CPUs, or with other internal/external (on-chip/off-chip) peripherals or devices.
Still another advantage of the present invention, is that it provides a method and system, in which for reducing the number of instructions and CPU clock cycles required for manipulating/processing memory mapped register data, there is substantially no need in changing the structure of the conventional CPU program word.
Still another advantage of the present invention is that it eliminates the need in using conventional DMA engines.
A further advantage of the present invention is that it provides a method and system, in which the size of external memory means of conventional processing devices (such as conventional cache or tightly coupled memories, as used in the prior art architectures) can be significantly reduced and/or the need for using the external memory means can be eliminated. Still a further advantage of the present invention is that it provides a method and system, in which CPU stalls (delays) are substantially prevented.
Other advantages of the present invention will become apparent as the description proceeds.
Summary of the Invention
The present invention relates to providing an improved processing unit (e.g., CPU, microprocessor and the like), and a method thereof, that implements both a conventional local register file system and spread register file system, which enables substantially direct memory access to memory means that are coupled to said improved processing unit.
According to an embodiment of the present invention, a processing unit device comprises: a) a local register file system having: a.l. one or more registers for storing at least one mapped addresses, said each register being assigned with a certain address that is specified in an instruction operand; and a.2. at least one data output port, from which said at least one mapped address stored within said one or more registers is outputted; and b) a spread register file system, comprising: b.l. a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses; b.2. at least one address converter, connected to one or more data units, for receiving one or more mapped addresses from said local register file system and converting them into the memory data unit addresses, wherein at least one first memory data unit address is of one or more memory cells that store the data to be processed and at least one second memory data unit address is of one or more memory cells for storing a result of the data processing; and b.3. at least one output port for outputting said data to be processed from said one or more memory cells that correspond to said at least one first memory data unit address; and c) at least one execution unit for processing the data outputted from said spread register file system.
According to another embodiment of the present invention, the processing unit device further comprises an instruction register for storing at least one instruction to be processed, said at least one instruction comprising an opcode and one or more operands.
According to still another embodiment of the present invention, the spread register file system further comprises at least one control input port configured to receive an opcode of an instruction to be processed.
According to still another embodiment of the present invention, the processing unit device further comprises a control unit connected to the at least one control input port for receiving the opcode and enabling processing the instruction according to said opcode.
According to still another embodiment of the present invention, each data unit is configured to receive a read-enable command for enabling reading data from its one or more corresponding memory cells.
According to still another embodiment of the present invention, each data unit is configured to receive a write-enable command for enabling writing data into its one or more corresponding memory cells.
According to a further embodiment of the present invention, the data units are selected from one or more of the following: a) peripherals; b) memory means; and c) registers. According to still a further embodiment of the present invention, a result of the data processing is stored within the one or more memory cells of the corresponding data unit of the spread register file system or within the local register file system.
According to still a further embodiment of the present invention, the local register file system further comprises at least one write address input port, into which is inputted an address of one or more memory cells of said local register file system to be auto- incremented.
According to still a further embodiment of the present invention, the local register file system further comprises at least one write data input port, into which is provided a command to auto-increment an address of one or more memory cells of said local register file system.
According to still a further embodiment of the present invention, at least one data unit of the spread register file system is shared between two or more processing units.
According to another embodiment of the present invention, a processing unit device comprises: a) a local register file system having: a.l. one or more registers for storing at least one mapped addresses, said each register being assigned with a certain address that is specified in an instruction operand; and a.2. at least one data output port, from which said at least one mapped address stored within said one or more registers is outputted; and b) a spread register file system, comprising: b.l. a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses, each data unit configured to: b.1.1. receive at least one mapped address; b.1.2. decode the received at least one mapped address and determine corresponding at least one memory data unit address; b.1.3. output data to be processed from one or more memory cells that correspond to said at least one memory data unit address; and b.1.4. store data within one or more memory cells that correspond to said at least one memory data unit address; and b.2. at least one output port for outputting said data to be processed from said one or more memory cells; and c) at least one execution unit for processing the data outputted from said spread register file system.
According to still another embodiment of the present invention, a processing unit device comprises: a) a local register file system having: a.l. one or more registers for storing data based on which at least one mapped address to be generated, said each register being assigned with a certain address that is specified in an instruction operand; and a.2. at least one data output port, from which said data stored within said one or more registers is outputted; and b) at least one address generator for generating said at least one mapped address from said data stored within said one or more registers of said local register file system; and c) a spread register file system, comprising: c.l. a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses; c.2. at least one address converter, connected to one or more data units, for receiving one or more mapped addresses from said at least one address generator and converting them into the memory data unit addresses, wherein at least one first memory data unit address is of one or more memory cells that store the data to be processed and at least one second memory data unit address is of one or more memory cells for storing a result of the data processing; and c.3. at least one output port for outputting said data to be processed from said one or more memory cells that correspond to said at least one first memory data unit address; and d) at least one execution unit for processing the data outputted from said spread register file system.
According to an embodiment of the present invention, the local register file comprises one or more registers for storing one or more mapped addresses.
According to a further embodiment of the present invention, a processing unit device comprises: a) a local register file system having: a.l. one or more registers for storing data based on which at least one mapped address to be generated, said each register being assigned with a certain address that is specified in an instruction operand; and a.2. at least one data output port, from which said data stored within said one or more registers is outputted; and b) at least one address generator for generating said at least one mapped address from said data stored within said one or more registers of said local register file system; c) a spread register file system, comprising: c.l. a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses, each data unit configured to: c.1.1. receive at least one mapped address; c.1.2. decode the received generated at least one mapped address and determine corresponding at least one memory data unit address; c.1.3. output data to be processed from one or more memory cells that correspond to said at least one memory data unit address; and c.1.4. store data within one or more memory cells that correspond to said at least one memory data unit address; and c.2. at least one output port for outputting said data to be processed from said one or more memory cells; and d) at least one execution unit for processing the data outputted from said spread register file system.
According to still a further embodiment of the present invention, a processing unit device comprises: a) a local register file system having: a.l. one or more registers for storing first data to be processed, said each register being assigned with a certain address that is specified in an instruction operand; and a.2. at least one data output port, from which the stored first data to be processed is outputted; and b) a spread register file system, comprising: b.l. a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses; b.2. at least one address converter, connected to one or more data units, for receiving one or more mapped addresses and converting them into the memory data unit addresses, wherein at least one first memory data unit address is of one or more memory cells that store the second data to be processed and at least one second memory data unit address is of one or more memory cells for storing a result of the second data processing; and b.3. at least one output port for outputting said second data to be processed from said one or more memory cells that correspond to said at least one first memory data unit address; and c) at least one execution unit for processing said first data outputted from said local register file system and for processing said second data outputted from said spread register file system.
According to an embodiment of the present invention, the local register file further comprises one or more registers for storing third data based on which the at least one mapped address to be generated.
According to another embodiment of the present invention, the processing unit device further comprises at least one address generator for generating the at least one mapped address based on the third data.
According to still another embodiment of the present invention, the local register file system further comprises one or more write address input ports, into which are inputted the addresses of memory cells of said local register file system to be auto-incremented.
According to still another embodiment of the present invention, the local register file system further comprises one or more write data input ports, into which are provided commands to auto-increment the addresses of memory cells of said local register file system.
According to still a further embodiment of the present invention, a processing unit device comprises: a) a local register file system having: a.l. one or more registers for storing first data to be processed, said each register being assigned with a certain address that is specified in an instruction operand; and a.2. at least one data output port, from which the stored first data to be processed is outputted; and b) a spread register file system, comprising: b.l. a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses, each data unit configured to: b.1.1. receive at least one mapped address; b.l.2. decode the received at least one mapped address and determine corresponding at least one memory data unit address; b.l.3. output second data to be processed from one or more memory cells that correspond to said at least one memory data unit address; and b.l.4. store third data within one or more memory cells that correspond to said at least one memory data unit address; and b.2. at least one output port for outputting said second data to be processed from said one or more memory cells; and c) at least one execution unit for processing said first data outputted from said local register file system and for processing said second data outputted from said spread register file system.
According to an embodiment of the present invention, the local register file further comprises one or more registers for storing fourth data based on which the at least one mapped address to be generated.
According to another embodiment of the present invention, the processing unit device further comprises at least one address generator for generating the at least one mapped address based on the fourth data.
According to an embodiment of the present invention, a method of processing a processing unit (PU) instruction comprises: a) conveying an address of a PU instruction to be processed into a program memory; b) fetching said PU instruction from said program memory; c) performing first decoding of said PU instruction and determining its opcode and operands, wherein one or more of said operands are addresses of registers of a local register file system, and wherein one or more of said registers store at least one PU mapped address; d) reading said at least one PU mapped address from said local register file system; e) performing second decoding or converting said at least one PU mapped address into an address of one or more memory cells of a corresponding PU data unit, giving rise to a data unit address; f) enabling reading the data stored in the at least one data unit address; g) processing the read data; and h) writing back the result of said processing into the one or more memory cells within the corresponding PU data unit or into one or more corresponding registers of said local register file system.
According to another embodiment of the present invention, a method of processing a processing unit (PU) instruction comprises: a) conveying an address of a PU instruction to be processed into a program memory; b) fetching said PU instruction from said program memory; c) performing first decoding of said PU instruction and determining its opcode and operands, wherein one or more of said operands are addresses of registers of a local register file system, and wherein one or more of said registers store first data to be used for generation of at least one PU mapped address; d) reading said first data stored within said one or more registers of said local register file system; e) generating said at least one PU mapped address based on said first data; f) performing second decoding or converting the generated PU mapped address into an address of one or more memory cells of a corresponding PU data unit, giving rise to a data unit address; g) enabling reading the second data stored in the at least one data unit address; h) processing the read second data; and i) writing back the result of said processing into the one or more memory cells within the corresponding PU data unit or into one or more corresponding registers of said local register file system.
According to still another embodiment of the present invention, a method of processing a processing unit (PU) instruction comprises: a) conveying an address of a PU instruction to be processed into a program memory; b) fetching said PU instruction from said program memory; c) performing first decoding of said PU instruction and determining its opcode and operands, wherein one or more of said operands are addresses of registers of a local register file system and at least one of said operands is a PU mapped address, and wherein at least one of said registers stores first data to be processed; d) reading said first data to be processed from said local register file system; e) processing the read first data and writing back a result of said processing into the one or more corresponding registers of said local register file system; f) performing second decoding or converting the at least one PU mapped address into an address of one or more memory cells of a corresponding PU data unit, giving rise to a data unit address; g) enabling reading the second data stored in the at least one data unit address; h) processing said second data stored in said at least one data unit address; and i) writing back the result of said processing of said second data into the one or more memory cells within the corresponding PU data unit or into one or more corresponding registers of said local register file system. According to a further embodiment of the present invention, a method of processing a processing unit (PU) instruction comprises: a) conveying an address of a PU instruction to be processed into a program memory; b) fetching said PU instruction from said program memory; c) performing first decoding of said PU instruction and determining its opcode and operands, wherein one or more of said operands are addresses of registers of a local register file system, and at least one of said operands is first data, based on which at least one PU mapped address to be generated, and wherein at least one of said registers stores second data to be processed; d) reading said first data to be used for generation of said at least one PU mapped address and reading said second data to be processed; e) generating said at least one PU mapped address based on said first data; f) processing the second data and writing back a result of said processing into one or more corresponding registers of said local register file system; g) performing second decoding or converting the generated at least one PU mapped address into an address of one or more memory cells of a corresponding PU data unit, giving rise to a data unit address; h) enabling reading the data stored in the at least one data unit address; i) processing said data stored in said at least one data unit address; and j) writing back the result of said processing of said data stored in said at least one data unit address into the one or more memory cells within the corresponding PU data unit or into one or more corresponding registers of said local register file system.
Brief Description of the Drawings
In order to understand the invention and to see how it may be carried out in practice, a preferred embodiment will now be described, by way of non-limiting example only, with reference to the accompanying drawings, in which: - Fig. IA is a schematic block-diagram of a conventional processing unit, according to the prior art;
Fig. IB is a schematic illustration of a conventional (local) register file, according to the prior art;
- Fig. 2A is a schematic illustration of connecting a spread register file system to a local register file system, to an instruction register (storing a program word - a current instruction to be executed) and to an execution unit (such as ALU), according to an embodiment of the present invention;
Fig. 2B is another schematic illustration of connecting a spread register file system to a local register file system, to an instruction register and to execution units, according to another embodiment of the present invention;
- Fig. 3 A is a schematic illustration of a spread register file system, according to an embodiment of the present invention;
Fig. 3B is another schematic illustration of a spread register file system, according to another embodiment of the present invention;
Fig. 4A is a pipeline representation of operating with a local register file system and with a spread register file system, according to an embodiment of the present invention; and
Fig. 4B is another pipeline representation of operating with a local register file system and with a spread register file system, according to another embodiment of the present invention.
It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.
Detailed Description of the Preferred Embodiments
In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, systems, procedures, components, circuits and the like have not been described in detail so as not to obscure the present invention.
Hereinafter, wherein the term "spread register file" system or "SRF" system is mentioned, it should be noted that it refers to the expanded (spread) register file according to the present invention, which can be related to the entire CPU memory map, thereby enabling substantially direct memory/peripheral access for one or more CPU execution units (for processing the data). Further, wherein the term "local register file system" or "LRF system" is mentioned, it refers to the conventional CPU local register file system 106 (Fig. IA). It should be also noted that according to an embodiment of the present invention, the entire (complete) CPU memory map can comprise local registers (e.g., local CPU register files), cache memories, tightly coupled memories, on- chip/off-chip peripherals/memories (or registers) and any other conventional memory means. Also, wherein the term CPU is mentioned, it refers to any processing unit (PU), such as a microprocessor and the like. In addition, wherein the term "processing" (or a similar term) is mentioned, it should be noted that it refers to any data operation, such as data manipulation, data transfer, addition or subtraction of data and the like.
It should be noted that co-pending US provisional patent application (Attorney Docket No. 1819762), titled "Register File System and Method Thereof for Enabling a Substantially Direct Memory Access", discloses a processing unit (e.g., CPU, microprocessor and the like) that implements a SRF system and enables substantially direct memory access for one or more (CPU) execution units (for processing the data). The present invention teaches implementing both a conventional local register file system (LRF system) and spread register file system (SRF system) to be used by one or more processing units.
Fig. 2A is a schematic illustration of connecting a spread register file (SRF) system 206 to local register file system (LRF system) 106, to instruction register 105 (storing a program word — a current instruction to be executed) and to execution unit 130 (such as ALU), according to an embodiment of the present invention. According to an embodiment of the present invention, SRF system 206 relates to the entire (complete) CPU mapped memories: cache memories, on-chip peripherals (e.g., RAM), on-board memories (e.g., DRAM, SRAM), secondary memories (e.g., off-chip peripherals, hard disks, etc.), and any other peripherals/memory means (e.g., CDs (Compact Discs), DVDs (Digital Versatile Discs), etc.). LRF system 106 is a conventional local register file system that comprises a number of registers for storing data to be processed, or data being a result of processing.
According to an embodiment of the present invention, spread register file system 206 comprises conventional peripheral address converters, as presented in Fig. 3. Each peripheral address converter is used for converting a CPU memory mapped address to corresponding peripheral (device) address (e.g., the peripheral device can be a USB device, cache memory, RAM, SRAM, tightly coupled memory, DRAM, hard disks, CDs, DVD, etc.). For example, one peripheral address converter converts a CPU mapped address of "source 1" register/memory means to corresponding address of said register/memory means within the corresponding peripheral that actually stores said data (to be processed by means of execution unit 130, such as ALU); another peripheral address converter converts a CPU mapped address of "source 2" register/memory means that stores additional data to be processed by means of said execution unit 130; and still another one peripheral address converter — converts a CPU mapped address of "destination" register/memory means, in which a result of the above execution unit 130 processing (e.g., addition, subtraction) will be stored. It should be noted that the peripheral address converter can be implemented either in hardware and/or in software. According to an embodiment of the present invention, instruction register 105 contains a program word, which can be, for example, 32 or 64 bits long. For example, for the MIPS32 (Million Instructions Per Second) CPU technology, the program word is 32 bits long, wherein the length of each one of the followings: "source 1" address 222', "source 2" address 223' and destination address 224' can be, for example, 5 bits long. Similarly, for the MIPS64 CPU technology, the program word is 64 bits long. Each of the above addresses relates to a specific address within the local register file system 106, said specific address represented, for example, by a 25, respectively. After "source 1" and "source 2" addresses are inputted into local register file system 106, the data stored within corresponding registers within said local register file system 106, that are assigned with said "source 1" and "source 2" addresses, is outputted from said local register file system 106.
According to an embodiment of the present invention, the data outputted from LRF system 106 are CPU mapped addresses related to corresponding memory cells of peripherals/memory means (e.g., USB device) provided within spread register file system 206. Generally, it should be noted that CPU mapped addresses are related to memory cells within the entire CPU memory map. For example, the entire CPU memory map can be defined by 232 addresses (if implemented for MIPS32 CPU), or by 264 addresses (if implemented for MIPS64 CPU).
According to another embodiment of the present invention, the data outputted from LRP system 106 is inputted into address generator 250 that further generates said CPU mapped addresses. Thus, addresses 221', 222' and 223' inputted into LRF system 106 can be, for example, each 5 bits long, and CPU mapped "source 1" address, CPU mapped "source 2" address, CPU mapped "destination" address outputted from address generator 250 over lines 212, 213 and 214, respectively, can be each 32 bits long (if implemented for MIPS32 CPU). Similarly, for the MIPS64 CPU implementation, each of these CPU mapped addresses is 64 bits long. It should be noted that address generator 250 can generate addresses in various ways based on different address generating functions. For example, address generator 250 can receive in its input a 8 bits long address (represented by a 28 binary number) from LRF system 106, and then it can add this address to another 232 number, thereby generating a new CPU mapped address that is 32 bits long. The above 2 2 number can be a predefined number, random number or a number that is calculated by means of address generator 250 according to some predefined function(s)/expression(s). In addition, said new CPU mapped address can be generated according to opcode 221' that can be inputted from instruction register 105 into said address generator 250. Thus, for example, if opcode 221' relates to moving "source 1" data (to which "source 1" CPU mapped address is related) to the "destination" register that is provided, for example, within LRF system 106, then only "source 1" CPU mapped address and CPU mapped destination addresses are generated by means of said address generator 250 ("source 2" CPU mapped address is not generated). The CPU mapped "source 1" and "source 2" addresses generated by means of address generator 250 are inputted into SRP system 206 over lines 212 and 213, respectively, and then the data stored within corresponding resisters/memory means of said SRF system 206 (said data being relates to said CPU mapped "source 1" and "source 2" addresses) is outputted from said SRF system 206 over buses (lines) 231 and
232, respectively. Then, the outputted data is provided into execution unit 130 and processed in accordance with opcode 221' of the program word (instruction) provided within instruction register 105. For example, the data of "source 1" can be added to the data of "source 2", or the data of "source 1" can be subtracted from the data of "source 2", etc. The command for performing such an operation is provided into said execution unit 130 via control bus 234. Then, after accomplishing the operation, the corresponding result is written back into the "destination" register/memory means (for example, located within the corresponding peripheral device, such as a USB device) over data bus
233, whose address is defined by the CPU mapped destination address provided over line 214. Alternatively, said corresponding result can be written back into LRF system 106, according to a "destination" address provided in operand 224' of the program word. For this, said corresponding result is inputted into multiplexer 270 along with data provided over line 263" from LRF system 106, said data being a CPU mapped address, or related to said CPU mapped address (the data based on which said CPU mapped address is generated by means of address generator 250 and the auto-incremented). According to operand 224' (the "destination" address) provided within instruction register 105, multiplexer 270 outputs either data provided over line 233 (from execution unit 130') or data provided over line 263" (from LRP system 106). The data provided over line 263" is outputted from said multiplexer 270 if "destination" address 224* corresponds to a SRF system 206 address. On the other hand, the data provided over line 233 is outputted from said multiplexer 270 if "destination" address 224' does not correspond to said SRF system 206 address and corresponds to an address of a specific register (within LRF system 106) that stores data to be processed.
It should be noted that according to an embodiment of the present invention, conventional I/O control units (e.g., memory management units (MMUs), north and south bridges, etc.), which enable reading/writing back data from/to various means (e.g., cache memories, secondary memories, etc.), are incorporated within spread register file system 206 along with CPU mapped memory means, such as cache memories, off-CPU- chip memories and other memory means. Thus, executing unit 130 is enabled to operate substantially directly on each of said CPU mapped memory means provided within spread register file system 206 and local register file system 106, i.e. is enabled executing instructions without the need for generating and performing LOAD commands (loading data into said spread register file system 206 from external memory means) and corresponding additional STORE commands for storing the result of executing unit 130 operation externally to said spread register file system 206.
In addition, it should be noted that according to another embodiment of the present invention, spread register file system 206 can operate with more than one executing unit 130. Further, instruction register 105 and/or executing unit 130 can be incorporated within said spread register file system 206.
According to another embodiment of the present invention, spread register file system 206 can be provided on-CPU-chip (incorporated within a CPU) or off-CPU-chip. Further, according to still another embodiment of the present invention, a portion of spread register file system 206 can be provided on-CPU-chip and another portion - off- CPU-chip. According to still another embodiment of the present invention, the CPU mapped addresses (or data related to said CPU mapped addresses) outputted from LRF system 106 over lines 261', 262' and 263', can be auto-incremented for enabling accessing other registers/memory cells of peripherals (data units) of SRF system 206, which correspond to the auto-incremented CPU mapped addresses. For this, the auto-incremented "source 1", "source 2" and "destination" CPU mapped addresses are inputted into LRF system 106 over lines 261", 262" and 263'". Then, such addresses are written into corresponding registers (of LRF system 106), which are assigned by addresses provided over lines 281, 282 and 283 from instruction register 105. It should be noted that auto-increment commands can be provided within operands 222', 223' and 224' of the program word (for example, by setting a specific bit in each of said operands to "0" can indicate that the auto-increment of said corresponding operand is enabled, and setting said specific bit to "1" can indicate that the auto-increment command is disabled. It should be noted that when inputting (writing) the data into LRF system 106/SRF 206, corresponding write-enable commands are provided to said LRF system 106/SRF 206. Similarly, when outputting data from said LRF system 106/SRF 206, corresponding read-enable commands are provided.
Fig. 2B is another schematic illustration of connecting spread register file system 206 to local register file system 106, to instruction register 105 and to execution units 130' and 130", according to another embodiment of the present invention. According to this embodiment, instruction register 105 contains a VLIW (Very Large Instruction Word) program word (instruction), that comprises: two opcodes 221' and 221"; two "source 1" addresses 222' and 222"; two "source 2" addresses 223' and 223" (wherein "source 1" and "source 2" addresses 222' and 223', respectively, can be related to CPU mapped addresses that can be provided in corresponding registers within LRF system 106 (or to be further generated by means of address generator 250), and wherein "source 1" and "source 2" addresses 222" and 223", respectively, can be related to specific registers within LRF system 106 that store data to be processed); and two "destination" addresses 224' and 224" (wherein "destination" address 224' relates to a corresponding peripheral/memory means (data unit) of SRF system 206, and "destination" address 224" relates to a specific register within LRF system 106 to store processed data). According to an embodiment of the present invention, all operands of the VLIW program word ("source 1", "source 2" and "destination" addresses) are inputted into LRF system 106 over lines 2517251", 2527252" and 2537253", respectively. Then, the data that is stored in corresponding registers within said LRF system 106 (the registers that are assigned with the above addresses) is outputted from said LRF system 106. It should be noted that the data that corresponds to "source 1" address 222', "source 2" address 223' and "destination" address 224' can be CPU mapped addresses, each of which is related to a memory cell(s) or register(s) of a peripheral/memory means of SRF system 206. Alternatively, such CPU mapped addresses can be generated by means of address generator 250, according to the data (related to said "source 1" address 222', "source 2" address 223' and "destination" address 224') outputted from LRF system 106 over lines 261', 262' and 263', respectively. Further, it should be noted that the data that corresponds to "source 1" address 222" and "source 2" address 223" is data to be processed by means of execution unit 130". Said data is outputted from LRF system 106 over lined 231" and 232", and then is provided into execution unit 130" (e.g., ALU). In turn, execution unit 130" processes the received data in accordance with opcode 221" of the VLIW program word provided within instruction register 105. For example, the data provided over line 231' can be added to the data provided over line 231", and the like. It should be noted that a command for performing such an operation (e.g., addition or subtraction) is provided into said execution unit 130" via control bus 234".
The result of processing, is then provided over line 233" to LRF system 106 and stored within a specific register (of said LRF system 106), which corresponds to "destination" address 224" provided in the VLIW program word.
According to an embodiment of the present invention, address generator 250 generates addresses being related to the entire CPU mapped memory. Thus, addresses 221', 222' and 223' can be for example each 5 bits long, and CPU mapped "source 1" address provided over line (bus) 212, CPU mapped "source 2" address provided over line 213 and CPU mapped "destination" address provided over line 214, can be each 32 bits long (if implemented for MIPS32 CPU), since each of these mapped addresses relates to the entire CPU memory map that is represented by 232 addresses. Similarly, for the MIPS64 CPU implementation, each of these mapped addresses is 64 bits long. It should be noted that address generator 250 can generate addresses in various ways based on different address generating functions. For example, address generator 250 can receive in its input a 8 bits long address (represented by a 28 binary number) from LRF system 106, and then it can add this address to another 2 number, thereby generating a new CPU mapped address that is 32 bits long. The above 2 number can be a predefined number, random number or a number that is calculated by means of address generator 250 according to some predefined function(s)/expression(s). In addition, said new CPU mapped address can be generated according to opcode 221" that can be inputted from instruction register 105 into said address generator 250. Thus, for example, if opcode 221' relates to moving "source 1" data (to which "source 1" CPU mapped address is related) to the "destination" register that is provided, for example, within LRF system 106, then only "source 1" CPU mapped address and CPU mapped destination addresses are generated by means of said address generator 250 ("source 2" CPU mapped address is not generated). Further, opcode 221' (or operands 222', 223' and 224' of the program word) can comprise a command (e.g., a predefined bit within said opcode 221' can be set to "0" or to "1", accordingly) for instructing address generator 250 to generate the corresponding CPU mapped addresses. For example, if said bit is "0", then address generator 250 passes the data (in this case, CPU mapped addresses) inputted over lines 261', 262' and 263' as it is, and if said bit is "0", then address generator 250 adds the data (in this case, non-CPU mapped addresses) inputted over lines 261', 262' and 263', to additional data (base addresses) stored within its corresponding registers, thereby generating corresponding CPU mapped addresses related to peripherals/memory means of SRF 206. According to another embodiment of the present invention, the data (addresses) outputted from LRF system 106 over lines 261', 262' and 263' can be further related to corresponding registers within address generator 250: said each of the addresses outputted from LRF system 106 can be related to a different base address of 32-bits (for MIPS32 technology), or 64-bits (for MIPS64 technology) provided within said address generator 250, based on which a CPU mapped address can be generated). Thus, each CPU mapped address can be generated according to values stored within these corresponding registers. For example, the "source 1" generated CPU mapped address (provided from address generator 250 over line 212) can be the sum of: a value outputted from LRF system 106 over line 261' (which can be also a value of operand 222') and the corresponding "source 1" base address stored within said address generator 250. Similarly, the "source 2" (or "destination") generated CPU mapped address, can be the sum of: a value outputted from LRF system 106 (which can be also a value of operand 223' (or 224')) and the corresponding "source 2" (or "destination") base address stored within said address generator 250. According to this embodiment of the present invention, LRF system 106 is not required to store CPU mapped addresses (e.g., 32 or 64 bits long), and for each operand 222', 223' and 224', said address generator 250 outputs a corresponding CPU mapped address.
According to still another embodiment of the present invention, the CPU mapped addresses (or data related to said CPU mapped addresses) outputted from LRF system 106 over lines 261', 262' and 263', can be auto-incremented for enabling accessing other registers/memory cells of peripherals of SRF system 106, which correspond to the auto- incremented CPU mapped addresses. For this, the auto-incremented "source 1", "source 2" and "destination" CPU mapped addresses are inputted into LRF system 106 over lines 261", 262" and 263'". Then, such addresses are written into corresponding registers (of LRF system 106), which are assigned by addresses provided over lines 281, 282 and 283 from instruction register 105. It should be noted that auto-increment commands can be provided within operands 222', 223' and 224' of the program word (for example, by setting a specific bit in each of said operands to "0" can indicate that the auto-increment of said corresponding operand is enabled, and setting said specific bit to "1" can indicate that the auto-increment command is disabled.
It should be noted that the result of data processing (performed by means of execution unit 130'), said result provided over line 233', is inputted into multiplexer 270 along with data provided over line 263" from LRF system 106, said data being a CPU mapped address or related to said CPU mapped address. According to operand 224' ("destination" address) provided within instruction register 105, multiplexer 270 outputs either data provided over line 233' (from execution unit 130') or data provided over line 263" (from LRF system 106). The data provided over line 263" is outputted from said multiplexer 270 if "destination" address 224' corresponds to a SRF system 206 address. On the other hand, the data provided over line 233' is outputted from said multiplexer 270 if "destination" address 224' does not correspond to said SRF system 206 address and corresponds to an address of a specific register (within LRF system 106) that stores data to be processed. It should be noted that when inputting (writing) the data into LRF system 106/SRF 206, corresponding write-enable commands are provided to said LRF system 106/SRF 206. Similarly, when outputting data from said LRF system 106/SRF 206, corresponding read-enable commands are provided.
According to another embodiment of the present invention, one or more operands of instruction register 105 (such as operands 222', 223' and 224') are CPU mapped addresses or data, based on which one or more CPU mapped addresses are generated by means of address generator 250.
Fig. 3A is a schematic illustration of spread register file system 206, according to an embodiment of the present invention. Spread register file system 206 receives as inputs: CPU mapped "source 1" address (MSl address) over bus 212, CPU mapped "source 2" address (MS2 address) over bus 213 and CPU mapped "destination" address (MD address) over bus 214 (each 32 bits long, for example). These addresses are converted by means of address converters 320, 321 and 322, respectively to addresses of corresponding peripheral/memory means, such as peripheral/memory means 301, 302, 303, ..., 310 (e.g., cache memories, tightly coupled memories, secondary memories, SRAM, DRAM, disk-on-keys, hard disks, CDs (Compact Discs), DVDs, or any other peripheral/memory means). It should be noted that said address converters can convert the CPU mapped addresses in various ways based on different address converting functions/expressions. In addition, the above address can be converted according to opcode 221' (Figs. 2A) that can be inputted from instruction register 105 (Fig. 2A) into a control unit 350, which generates corresponding control signals to address converters 320, 321 and 322 and to executing unit 130': for example, if opcode 221' relates to moving "source 1" data to the "destination" register, then only address converters 320 and 322 can be activated. According to an embodiment of the present invention, the converted "source 1" and "source 2" addresses (CSl and CS2 addresses, respectively) are inputted into corresponding peripheral/memory means 301, 302, 303, ..., N, which in turn outputs corresponding data stored in said addresses over "source 1" read bus 231 and "source 2" read bus 232. Then, said data is processed (executed) by means of one or more execution units 130' (such as ALUs). After that, the processing result is provided over write back bus 233' to corresponding peripheral/memory means, such as peripherals 301, 302, 303, ..., N, to be stored in corresponding converted destination addresses (CD addresses) within said peripheral/memory means 301, 302, 303, ..., N. Alternatively, the processing result is provided into LRF system 106 (Fig. 2A). It should be noted that according to an embodiment of the present invention, the "source 1", "source 2", and "destination" memory cells can be physically located within the same or within different peripheral/memory means (such as peripheral/memory means 301, 302, 303,..., N).
According to another embodiment of the present invention, address converters 320, 321 and 322 further provide Write Enable (WE)/Chip Select (CS) signals (for example, binary "0" or "1") to each of said peripheral/memory means (data units) 301, 302, 303, ..., N, for enabling reading or writing from or to said peripheral/memory means. The corresponding WE/CS commands can be provided to each of said peripheral/memory means 301, 302, 303, ..., N, when accessing each converted address (e.g., "source 1" converted address) within said each peripheral/memory means 301, 302 and 303. For example, for reading data from the converted "source 1" address (e.g., the address of a register within the corresponding peripheral), CS (read command) and WE (write command) signals provided to said corresponding peripheral are "1" and "0", respectively; in turn, for writing the data into the converted "destination" address of said corresponding peripheral, the WE signal is "1".
It should be noted that according to another embodiment of the present invention, address converters 320, 321 and 322 can be unified in a single address converter for converting CPU mapped "source 1", "source 2" and "destination" addresses into corresponding peripheral/memory means addresses. According to another embodiment of the present invention, instead of providing address converters 320, 321 and 322, the address decoding (or the address conversion) is performed within one or more peripherals/memory means 301, 302, 303, ..., N. Thus, according to this embodiment of the present invention, the need in providing said address converters 320, 321 and 322 is substantially eliminated. Peripherals/memory means 301, 302, 303, ..., N can receive CPU mapped addresses and decode (or convert) them accordingly for determining corresponding addresses within said peripherals/memory means 301, 302, 303, ..., N, in which the required data is stored (or to be stored). According to still another embodiment of the present invention, blocks 320, 321 and 322 can provide WE/CS commands to peripherals/memory means 301, 302, 303, ..., N, and do not perform the address conversion. Further, it should be noted that WE/CS commands can be generated by means of control unit 350. According to a further embodiment of the present invention, an address converter (such as address converter 320, 321 or 322) can be incorporated (integrated) within each (or one or more) peripheral/memory means 301, 302, 303, ..., or N. In such case, said peripheral/memory means receives a CPU mapped address and determines by means of the integrated address converter (according to predefined base-addresses of said peripheral/memory means), whether the received CPU mapped address is related to one or more memory cells provided within said peripheral/memory means or within another peripheral/memory means. It should be noted that the base-addresses of each peripherals/memory means can be further dynamically changed upon the need.
Fig. 3B is another schematic illustration of spread register file system 206, according to another embodiment of the present invention. According to this embodiment of the present invention, only a single execution unit 130' is used for both LRF system 106 (Fig. 2A) and SRF system 206. For this, two multiplexers 371 and 372 are provided, each of which selects the data to be processed (either data outputted from SRF system 206 or data outputted from LRF system 106) according to a control signal provided from control unit 350. In turn, the control signal is provided according to opcode 221' (Fig. 2A). Thus, if said opcode relates to processing of SRF system data, then the data from SRF system 206 is inputted into "source 1" and "source 2" inputs of execution unit 130' (over buses 231' and 232', respectively). On the other hand, if said opcode relates to processing of LRF system data, then the data from LRF system 106 is inputted into "source 1" and "source 2" inputs of execution unit 130' (over buses 231" and 232", respectively). It should be noted that according to an embodiment of the present invention, an opcode related to processing of LRF system data (e.g., adding "source 1" LRF system data to "source 2" LRF system data) differs from an opcode related to processing of SRF system data. According to another embodiment of the present invention, opcode 221' can comprise a control bit (that can be preset to "0" or "1"), which indicates whether SRF system or LRF system data has to be processed by means of execution unit 130': for example, if the bit is set to "1", then the data from SRF system is processed and if the bit is set to "0", then the data from LRF system is processed. Thus, multiplexers 371 and 372 can output the data into execution unit 130', according to said control bit.
Fig. 4 A is a pipeline representation 400 of operating with local register file system 106 (Fig. 2A) and with spread register file system 206 (Fig. 2A), according to an embodiment of the present invention. According to this embodiment, pipeline has 12 stages (TO to TIl), each of which can correspond to a single CPU clock cycle. At the first stage TO, the address of the instruction (program word) to be fetched is conveyed from the Program Counter (PC) to the CPU program memory (e.g., RAM (not shown)). Then, at stage Tl, the instruction (program word) is fetched from said program memory into instruction register 105 (Fig. 2A). After that at stage T2, the fetched instruction decoding is performed by means of control unit 350 (Fig. 3), which provides control signals during pipeline stages. Also, the data from local register file system 106 (Fig. 2A) is read. If the read data are CPU mapped addresses, then such data is conveyed to the address converters 320, 321 and 322 (Fig. 3A). On the other hand, if the read data relates to corresponding CPU based addresses, and, based on such data, said CPU based addresses have to be generated, then at stage T3, address generator 250 (Fig. 2B) generates such CPU mapped addresses accordingly. Thus, for example, if the decoded opcode relates to moving "source 1" data (to which "source 1" CPU mapped address is related) to the "destination" register that is provided, for example, within LRF system 106, then only "source 1" CPU mapped address and CPU mapped destination addresses are generated by means of said address generator 250 ("source 2" CPU mapped address is not generated). In addition, at T3 stage, an address of the next program counter instruction is determined, based on the decoded instruction. Also, for example, when a conventional JUMP or BRANCH command is issued, then the next program counter address is calculated in accordance with a pointer of said JUMP/BRANCH command). Then, at stage T4, the CPU mapped addresses are converted by means of said address converters 320, 321 and 322, respectively to addresses of corresponding peripheral/memory means, such as peripheral/memory means 301, 302 and 303 (Fig. 3A) (e.g., cache memories, secondary memories, disk-on-keys, hard disks, or any other peripheral/memory means). It should be noted that said address converters can convert the CPU mapped addresses in various ways based on different address converting functions/expressions. In addition, the above address can be converted according to opcode 221' (Figs. 2A) that can be inputted from instruction register 105 (Fig. 2A) into a control unit 350 (Fig. 3A), which generates corresponding control signals to address converters 320, 321 and 322 and to executing unit 130 (Fig. 3): for example, if opcode 221' relates to moving "source 1" data to the "destination" register, then only address converters 320 and 322 can be activated. The converted "source 1" and "source 2" addresses are inputted (along with corresponding WE/CS signals, which are set to data "READ") from said address converters 320, 321 and 322 to peripherals/memory means 301, 302 and 303. Then, at the next stage T5, said peripherals/memory means 301, 302 and 303 generate a READ request to their internal memory, thereby enabling reading corresponding data stored within them at the received converted addresses, hi the next stage T6, said corresponding data is read and ready, and then at stage T7, said data is latched and conveyed to the "source 1" and "source 2" read buses 231 and 232, respectively. After that, at stage T8, the data is provided over said read buses 231 and 232 into execution unit 130' (e.g., ALU). At the next two stages, T9 and TlO, the data is processed by means of said execution unit 130. It should be noted that the data can be processed only in stage T9, and no further processing can be required. Thus, the pipeline can have, for example, only 11 stages (TO to TlO). At stage Tl 1, the processing result is written back into the "destination" register that is provided, for example, within peripherals/memory means 301, 302, 303, ..., N, or within local register file system 106. It should be noted that the writing back operation can take more than a single CPU clock cycle. It should be noted that CPU control unit 350 controls the pipeline process by generating required control signals during the pipeline stages.
According to an embodiment of the present invention, CPU stalls (delays) are reduced or substantially eliminated. For example, there can be substantially no CPU stalls if the access latency to SRF system 206 is 6 (or less) CPU clock cycles and the pipeline is relatively deep (e.g., 12 stages).
It should be further noted that according to an embodiment of the present invention, a number of instructions and CPU clock cycles required for manipulating/processing (e.g., moving, shifting) memory mapped data is significantly reduced, compared to the prior art. Thus, the number of instructions and corresponding CPU clock cycles for processing the data can be reduced, for example, to a single instruction that takes a single CPU clock cycle, enabling providing a substantially direct access between execution unit 130 and peripherals/memory means 301, 302, 303, ..., N.
According to another embodiment of the present invention, spread register file system 206 (Fig. 2A) can be shared between two or more processing units (e.g., CPU, microprocessor, and the like) and between other internal/external (on-chip/off-chip) peripherals or devices.
According to still another embodiment of the present invention, the structure of a conventional CPU program word, compared to the prior art, is not changed.
According to a further embodiment of the present invention, the need in using conventional DMA engines is eliminated.
Fig. 4B is another pipeline representation 401 of operating with local register file system 106 (Fig. 2B) and with spread register file system 206 (Fig. 2B), according to another embodiment of the present invention. According to this embodiment of the present invention, two execution units 130' and 130" (Fig. 2B) are provided: execution unit 130" for processing data outputted from LRF system 106 (Fig. 2B), and execution unit 130' for processing data outputted from SRF system 206. According to an embodiment of the present invention, execution unit 130" processes data outputted from LRF system 106 during stage T3 and, optionally, also during stage T4. Then, the result of data processing is written back into LRF system 106 at stage T5. Further, at stage TIl, the result of processing by means of execution unit 130' can be written back into the "destination" register that is provided, for example, within peripherals/memory means 301, 302 and/or 303 (Fig. 3A) or within local register file system 106. It should be noted that the writing back operation can take more than a single CPU clock cycle.
While some embodiments of the invention have been described by way of illustration, it will be apparent that the invention can be put into practice with many modifications, variations and adaptations, and with the use of numerous equivalents or alternative solutions that are within the scope of persons skilled in the art, without departing from the spirit of the invention or exceeding the scope of the claims.

Claims

CIaims
1. A processing unit device, comprising: a) a local register file system having: a.l. one or more registers for storing at least one mapped addresses, said each register being assigned with a certain address that is specified in an instruction operand; and a.2. at least one data output port, from which said at least one mapped address stored within said one or more registers is outputted; and b) a spread register file system, comprising: b.l. a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses; b.2. at least one address converter, connected to one or more data units, for receiving one or more mapped addresses from said local register file system and converting them into the memory data unit addresses, wherein at least one first memory data unit address is of one or more memory cells that store the data to be processed and at least one second memory data unit address is of one or more memory cells for storing a result of the data processing; and b.3. at least one output port for outputting said data to be processed from said one or more memory cells that correspond to said at least one first memory data unit address; and c) at least one execution unit for processing the data outputted from said spread register file system.
2. The processing unit device according to claim 1, further comprising an instruction register for storing at least one instruction to be processed, said at least one instruction comprising an opcode and one or more operands.
3. The processing unit device according to claim 1, wherein the spread register file system further comprises at least one control input port configured to receive an opcode of an instruction to be processed.
4. The processing unit device according to claim 3, further comprising a control unit connected to the at least one control input port for receiving the opcode and enabling processing the instruction according to said opcode.
5. The processing unit device according to claim 1, wherein each data unit is configured to receive a read-enable command for enabling reading data from its one or more corresponding memory cells.
6. The processing unit device according to claim 1, wherein each data unit is configured to receive a write-enable command for enabling writing data into its one or more corresponding memory cells.
7. The processing unit device according to claim 1, wherein the data units are selected from one or more of the following: a) peripherals; b) memory means; and c) registers.
8. The processing unit device according to claim 1, wherein a result of the data processing is stored within the one or more memory cells of the corresponding data unit of the spread register file system or within the local register file system.
9. The processing device according to claim 1, wherein the local register file system further comprises at least one write address input port, into which is inputted an address of one or more memory cells of said local register file system to be auto-incremented.
10. The processing device according to claim 1, wherein the local register file system further comprises at least one write data input port, into which is provided a command to auto-increment an address of one or more memory cells of said local register file system.
11. The processing unit device according to claim 1, wherein at least one data unit of the spread register file system is shared between two or more processing units.
12. A processing unit device, comprising: a) a local register file system having: a.l. one or more registers for storing at least one mapped addresses, said each register being assigned with a certain address that is specified in an instruction operand; and a.2. at least one data output port, from which said at least one mapped address stored within said one or more registers is outputted; and b) a spread register file system, comprising: b.l. a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses, each data unit configured to: b.1.1. receive at least one mapped address; b.l.2. decode the received at least one mapped address and determine corresponding at least one memory data unit address; b.l.3. output data to be processed from one or more memory cells that correspond to said at least one memory data unit address; and b.l.4. store data within one or more memory cells that correspond to said at least one memory data unit address; and b.2. at least one output port for outputting said data to be processed from said one or more memory cells; and c) at least one execution unit for processing the data outputted from said spread register file system.
13. The processing unit device according to claim 12, further comprising an instruction register for storing at least one instruction to be processed, said at least one instruction comprising an opcode and one or more operands.
14. The processing unit device according to claim 12, wherein the spread register file system further comprises at least one control input port configured to receive an opcode of an instruction to be processed.
15. The processing unit device according to claim 14, further comprising a control unit connected to the at least one control input port for receiving the opcode and enabling processing the instruction according to said opcode.
16. The processing unit device according to claim 14, wherein each data unit is configured to receive a read-enable command for enabling reading data from its one or more corresponding memory cells.
17. The processing unit device according to claim 14, wherein each data unit is configured to receive a write-enable command for enabling writing data into its one or more corresponding memory cells.
18. The processing unit device according to claim 14, wherein the data units are selected from one or more of the following: a) peripherals; b) memory means; and c) registers.
19. The processing unit device according to claim 14, wherein a result of the data processing is stored within the one or more memory cells of the corresponding data unit of the spread register file system or within the local register file system.
20. The processing device according to claim 14, wherein the local register file system further comprises at least one write address input port, into which is inputted an address of one or more memory cells of said local register file system to be auto-incremented.
21. The processing device according to claim 14, wherein the local register file system further comprises at least one write data input port, into which is provided a command to auto-increment an address of one or more memory cells of said local register file system.
22. The processing unit device according to claim 14, wherein at least one data unit of the spread register file system is shared between two or more processing units.
23. A processing unit device, comprising: a) a local register file system having: a.l. one or more registers for storing data based on which at least one mapped address to be generated, said each register being assigned with a certain address that is specified in an instruction operand; and a.2. at least one data output port, from which said data stored within said one or more registers is outputted; and b) at least one address generator for generating said at least one mapped address from said data stored within said one or more registers of said local register file system; and c) a spread register file system, comprising: c.l. a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses; c.2. at least one address converter, connected to one or more data units, for receiving one or more mapped addresses from said at least one address generator and converting them into the memory data unit addresses, wherein at least one first memory data unit address is of one or more memory cells that store the data to be processed and at least one second memory data unit address is of one or more memory cells for storing a result of the data processing; and c.3. at least one output port for outputting said data to be processed from said one or more memory cells that correspond to said at least one first memory data unit address; and d) at least one execution unit for processing the data outputted from said spread register file system.
24. The processing unit device according to claim 23, wherein the local register file further comprises one or more registers for storing one or more mapped addresses.
25. The processing unit device according to claim 23, further comprising an instruction register for storing at least one instruction to be processed, said at least one instruction comprising an opcode and one or more operands.
26. The processing unit device according to claim 23, wherein the spread register file system further comprises at least one control input port configured to receive an opcode of an instruction to be processed.
27. The processing unit device according to claim 26, further comprising a control unit connected to the at least one control input port for receiving the opcode and enabling processing the instruction according to said opcode.
28. The processing unit device according to claim 23, wherein the data units are selected from one or more of the following: a) peripherals; b) memory means; and c) registers.
29. The processing unit device according to claim 23, wherein a result of the data processing is stored within the at least one memory cell of the corresponding data unit of the spread register file system or within the local register file system.
30. A processing unit device, comprising: a) a local register file system having: a.l. one or more registers for storing data based on which at least one mapped address to be generated, said each register being assigned with a certain address that is specified in an instruction operand; and a.2. at least one data output port, from which said data stored within said one or more registers is outputted; and b) at least one address generator for generating said at least one mapped address from said data stored within said one or more registers of said local register file system; c) a spread register file system, comprising: c.l. a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses, each data unit configured to: c.1.1. receive at least one mapped address; c.1.2. decode the received generated at least one mapped address and determine corresponding at least one memory data unit address; c.1.3. output data to be processed from one or more memory cells that correspond to said at least one memory data unit address; and c.1.4. store data within one or more memory cells that correspond to said at least one memory data unit address; and c.2. at least one output port for outputting said data to be processed from said one or more memory cells; and d) at least one execution unit for processing the data outputted from said spread register file system.
31. The processing unit device according to claim 30, wherein the local register file further comprises one or more registers for storing one or more mapped addresses.
32. The processing unit device according to claim 30, further comprising an instruction register for storing at least one instruction to be processed, said at least one instruction comprising an opcode and one or more operands.
33. The processing unit device according to claim 30, wherein the spread register file system further comprises at least one control input port configured to receive an opcode of an instruction to be processed.
34. The processing unit device according to claim 33, further comprising a control unit connected to the at least one control input port for receiving the opcode and enabling processing the instruction according to said opcode.
35. The processing unit device according to claim 30, wherein the data units are selected from one or more of the following: a) peripherals; b) memory means; and c) registers.
36. The processing unit device according to claim 30, wherein a result of the data processing is stored within the at least one memory cell of the corresponding data unit of the spread register file system or within the local register file system.
37. A processing unit device, comprising: a) a local register file system having: a.l. one or more registers for storing first data to be processed, said each register being assigned with a certain address that is specified in an instruction operand; and a.2. at least one data output port, from which the stored first data to be processed is outputted; and b) a spread register file system, comprising: b.l. a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses; b.2. at least one address converter, connected to one or more data units, for receiving one or more mapped addresses and converting them into the memory data unit addresses, wherein at least one first memory data unit address is of one or more memory cells that store the second data to be processed and at least one second memory data unit address is of one or more memory cells for storing a result of the second data processing; and b.3. at least one output port for outputting said second data to be processed from said one or more memory cells that correspond to said at least one first memory data unit address; and c) at least one execution unit for processing said first data outputted from said local register file system and for processing said second data outputted from said spread register file system.
38. The processing unit device according to claim 37, wherein the local register file further comprises one or more registers for storing one or more mapped addresses.
39. The processing unit device according to claim 37, wherein the local register file further comprises one or more registers for storing third data based on which the at least one mapped address to be generated.
40. The processing unit device according to claim 39, further comprising at least one address generator for generating the at least one mapped address based on the third data.
41. The processing unit device according to claim 37, wherein the spread register file system further comprises at least one control input port configured to receive an opcode of an instruction to be processed.
42. The processing unit device according to claim 41, further comprising a control unit connected to the at least one control input port for receiving the opcode and enabling processing the instruction according to said opcode.
43. The processing unit device according to claim 37, wherein each data unit is configured to receive a read-enable command for enabling reading data from its one or more corresponding memory cells.
44. The processing unit device according to claim 37, wherein each data unit is configured to receive a write-enable command for enabling writing data into its one or more corresponding memory cells.
45. The processing unit device according to claim 37, wherein the data units are selected from one or more of the following: a) peripherals; b) memory means; and c) registers.
46. The processing device according to claim 37, wherein the local register file system further comprises one or more write address input ports, into which are inputted the addresses of memory cells of said local register file system to be auto-incremented.
47. The processing device according to claim 37, wherein the local register file system further comprises one or more write data input ports, into which are provided commands to auto-increment the addresses of memory cells of said local register file system.
48. The processing unit device according to claim 37, wherein a result of the second data processing is stored within the at least one memory cell of the corresponding data unit of the spread register file system or within the local register file system.
49. A processing unit device, comprising: a) a local register file system having: a.l. one or more registers for storing first data to be processed, said each register being assigned with a certain address that is specified in an instruction operand; and a.2. at least one data output port, from which the stored first data to be processed is outputted; and b) a spread register file system, comprising: b.l. a plurality of data units, each comprising a plurality of memory cells that are assigned with memory data unit addresses, each data unit configured to: b.1.1. receive at least one mapped address; b.l.2. decode the received at least one mapped address and determine corresponding at least one memory data unit address; b.l.3. output second data to be processed from one or more memory cells that correspond to said at least one memory data unit address; and b.l -4. store third data within one or more memory cells that correspond to said at least one memory data unit address; and b.2. at least one output port for outputting said second data to be processed from said one or more memory cells; and c) at least one execution unit for processing said first data outputted from said local register file system and for processing said second data outputted from said spread register file system.
50. The processing unit device according to claim 49, wherein the local register file further comprises one or more registers for storing one or more mapped addresses.
51. The processing unit device according to claim 49, wherein the local register file further comprises one or more registers for storing fourth data based on which the at least one mapped address to be generated.
52. The processing unit device according to claim 51, further comprising at least one address generator for generating the at least one mapped address based on the fourth data.
53. The processing unit device according to claim 49, wherein the spread register file system further comprises at least one control input port configured to receive an opcode of an instruction to be processed.
54. The processing unit device according to claim 53, further comprising a control unit connected to the at least one control input port for receiving the opcode and enabling processing the instruction according to said opcode.
55. The processing unit device according to claim 49, wherein each data unit is configured to receive a read-enable command for enabling reading data from its one or more corresponding memory cells.
56. The processing unit device according to claim 49, wherein each data unit is configured to receive a write-enable command for enabling writing data into its one or more corresponding memory cells.
57. The processing unit device according to claim 49, wherein the data units are selected from one or more of the following: a) peripherals; b) memory means; and c) registers.
58. The processing device according to claim 49, wherein the local register file system further comprises one or more write address input ports, into which are inputted the addresses of memory cells of said local register file system to be auto-incremented.
59. The processing device according to claim 49, wherein the local register file system further comprises one or more write data input ports, into which are provided commands to auto-increment the addresses of memory cells of said local register file system.
60. The processing unit device according to claim 49, wherein a result of the second data processing is stored within the one or more memory cells of the corresponding data unit of the spread register file system or within the local register file system.
61. A method of processing a processing unit (PU) instruction, comprising: a) conveying an address of a PU instruction to be processed into a program memory; b) fetching said PU instruction from said program memory; c) performing first decoding of said PU instruction and determining its opcode and operands, wherein one or more of said operands are addresses of registers of a local register file system, and wherein one or more of said registers store at least one PU mapped address; d) reading said at least one PU mapped address from said local register file system; e) performing second decoding or converting said at least one PU mapped address into an address of one or more memory cells of a corresponding PU data unit, giving rise to a data unit address; f) enabling reading the data stored in the at least one data unit address; g) processing the read data; and h) writing back the result of said processing into the one or more memory cells within the corresponding PU data unit or into one or more corresponding registers of said local register file system.
62. The method of processing a PU instruction according to claim 61, further comprising providing in the addresses of the one or more registers of the local register file system, the data based on which the at least one PU mapped addresses to be generated.
63. The method of processing a PU instruction according to claim 62, further comprising providing at least one address generator for generating the at least one PU mapped address.
64. The method of processing a PU instruction according to claim 61, wherein the one or more of the registers of the local register file store another data to be processed.
65. A method of processing a processing unit (PU) instruction, comprising: a) conveying an address of a PU instruction to be processed into a program memory; b) fetching said PU instruction from said program memory; c) performing first decoding of said PU instruction and determining its opcode and operands, wherein one or more of said operands are addresses of registers of a local register file system, and wherein one or more of said registers store first data to be used for generation of at least one PU mapped address; d) reading said first data stored within said one or more registers of said local register file system; e) generating said at least one PU mapped address based on said first data; f) performing second decoding or converting the generated PU mapped address into an address of one or more memory cells of a corresponding PU data unit, giving rise to a data unit address; g) enabling reading the second data stored in the at least one data unit address; h) processing the read second data; and i) writing back the result of said processing into the one or more memory cells within the corresponding PU data unit or into one or more corresponding registers of said local register file system.
66. The method of processing a PU instruction according to claim 65, wherein the one or more of registers of the local register file store another data to be processed.
67. A method of processing a processing unit (PU) instruction, comprising: a) conveying an address of a PU instruction to be processed into a program memory; b) fetching said PU instruction from said program memory; c) performing first decoding of said PU instruction and determining its opcode and operands, wherein one or more of said operands are addresses of registers of a local register file system and at least one of said operands is a PU mapped address, and wherein at least one of said registers stores first data to be processed; d) reading said first data to be processed from said local register file system; e) processing the read first data and writing back a result of said processing into the one or more corresponding registers of said local register file system; f) performing second decoding or converting the at least one PU mapped address into an address of one or more memory cells of a corresponding PU data unit, giving rise to a data unit address; g) enabling reading the second data stored in the at least one data unit address; h) processing said second data stored in said at least one data unit address; and i) writing back the result of said processing of said second data into the one or more memory cells within the corresponding PU data unit or into one or more corresponding registers of said local register file system.
68. A method of processing a processing unit (PU) instruction, comprising: a) conveying an address of a PU instruction to be processed into a program memory; b) fetching said PU instruction from said program memory; c) performing first decoding of said PU instruction and determining its opcode and operands, wherein one or more of said operands are addresses of registers of a local register file system, and at least one of said operands is first data, based on which at least one PU mapped address to be generated, and wherein at least one of said registers stores second data to be processed; d) reading said first data to be used for generation of said at least one PU mapped address and reading said second data to be processed; e) generating said at least one PU mapped address based on said first data; f) processing the second data and writing back a result of said processing into one or more corresponding registers of said local register file system; g) performing second decoding or converting the generated at least one PU mapped address into an address of one or more memory cells of a corresponding PU data unit, giving rise to a data unit address; h) enabling reading the data stored in the at least one data unit address; i) processing said data stored in said at least one data unit address; and j) writing back the result of said processing of said data stored in said at least one data unit address into the one or more memory cells within the corresponding PU data unit or into one or more corresponding registers of said local register file system.
PCT/IL2009/000471 2008-05-07 2009-05-07 Improved processing unit implementing both a local register file system and spread register file system, and a method thereof WO2009136401A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US7158308P 2008-05-07 2008-05-07
US61/071,583 2008-05-07

Publications (2)

Publication Number Publication Date
WO2009136401A2 true WO2009136401A2 (en) 2009-11-12
WO2009136401A3 WO2009136401A3 (en) 2010-03-11

Family

ID=41265109

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IL2009/000471 WO2009136401A2 (en) 2008-05-07 2009-05-07 Improved processing unit implementing both a local register file system and spread register file system, and a method thereof

Country Status (1)

Country Link
WO (1) WO2009136401A2 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050240745A1 (en) * 2003-12-18 2005-10-27 Sundar Iyer High speed memory control and I/O processor system
US20080065862A1 (en) * 1995-08-16 2008-03-13 Microunity Systems Engineering, Inc. Method and Apparatus for Performing Data Handling Operations

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080065862A1 (en) * 1995-08-16 2008-03-13 Microunity Systems Engineering, Inc. Method and Apparatus for Performing Data Handling Operations
US20050240745A1 (en) * 2003-12-18 2005-10-27 Sundar Iyer High speed memory control and I/O processor system

Also Published As

Publication number Publication date
WO2009136401A3 (en) 2010-03-11

Similar Documents

Publication Publication Date Title
TWI567646B (en) Inter-architecture compatability module to allow code module of one architecture to use library module of another architecture
KR101121606B1 (en) Thread optimized multiprocessor architecture
RU2636675C2 (en) Commands, processors, methods and systems of multiple registers access to memory
US7473293B2 (en) Processor for executing instructions containing either single operation or packed plurality of operations dependent upon instruction status indicator
RU2638641C2 (en) Partial width loading depending on regime, in processors with registers with large number of discharges, methods and systems
KR100462951B1 (en) Eight-bit microcontroller having a risc architecture
EP0199173B1 (en) Data processing system
JPH05502125A (en) Microprocessor with last-in, first-out stack, microprocessor system, and method of operating a last-in, first-out stack
TW200403583A (en) Controlling compatibility levels of binary translations between instruction set architectures
RU2639695C2 (en) Processors, methods and systems for gaining access to register set either as to number of small registers, or as to integrated big register
US20130297918A1 (en) Apparatus for Predicate Calculation in Processor Instruction Set
KR100465388B1 (en) Eight-bit microcontroller having a risc architecture
US20110072170A1 (en) Systems and Methods for Transferring Data to Maintain Preferred Slot Positions in a Bi-endian Processor
WO2019172987A1 (en) Geometric 64-bit capability pointer
US9639362B2 (en) Integrated circuit device and methods of performing bit manipulation therefor
US6012138A (en) Dynamically variable length CPU pipeline for efficiently executing two instruction sets
CN113900710A (en) Expansion memory assembly
CN114945984A (en) Extended memory communication
US20030196072A1 (en) Digital signal processor architecture for high computation speed
WO2012061416A1 (en) Methods and apparatus for a read, merge, and write register file
WO2009136402A2 (en) Register file system and method thereof for enabling a substantially direct memory access
WO2009136401A2 (en) Improved processing unit implementing both a local register file system and spread register file system, and a method thereof
US9021238B2 (en) System for accessing a register file using an address retrieved from the register file
US11775310B2 (en) Data processing system having distrubuted registers
JP3105110B2 (en) Arithmetic unit

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09742572

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

WPC Withdrawal of priority claims after completion of the technical preparations for international publication

Ref document number: 61/071,583

Country of ref document: US

Date of ref document: 20101025

Free format text: WITHDRAWN AFTER TECHNICAL PREPARATION FINISHED

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 24/01/2012)

122 Ep: pct application non-entry in european phase

Ref document number: 09742572

Country of ref document: EP

Kind code of ref document: A2