WO2010044242A1 - データ処理装置 - Google Patents
データ処理装置 Download PDFInfo
- Publication number
- WO2010044242A1 WO2010044242A1 PCT/JP2009/005306 JP2009005306W WO2010044242A1 WO 2010044242 A1 WO2010044242 A1 WO 2010044242A1 JP 2009005306 W JP2009005306 W JP 2009005306W WO 2010044242 A1 WO2010044242 A1 WO 2010044242A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- unit
- register
- register file
- arithmetic
- units
- Prior art date
Links
- 238000012545 processing Methods 0.000 title claims abstract description 135
- 230000015654 memory Effects 0.000 claims description 172
- 238000012546 transfer Methods 0.000 claims description 93
- 230000003247 decreasing effect Effects 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 3
- 238000000034 method Methods 0.000 description 46
- 230000008569 process Effects 0.000 description 19
- 238000003672 processing method Methods 0.000 description 16
- 238000010586 diagram Methods 0.000 description 8
- 230000000644 propagated effect Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 238000012937 correction Methods 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 4
- 230000008859 change Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000000717 retained effect Effects 0.000 description 2
- 238000010187 selection method Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000004904 shortening Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/3012—Organisation of register space, e.g. banked or distributed register file
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/3012—Organisation of register space, e.g. banked or distributed register file
- G06F9/30134—Register stacks; shift registers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/34—Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
Definitions
- the present invention relates to a data processing apparatus having a plurality of arithmetic units and capable of performing arithmetic processing by each arithmetic unit synchronously.
- the arithmetic unit array method cannot execute existing machine language instructions. For this reason, a dedicated machine language instruction generation means for generating machine language instructions peculiar to this arithmetic unit array system is necessary, and lacks versatility.
- a superscalar method, a vector method, and a VLIW method are known as methods that execute general machine language instructions and that can execute machine language instructions in parallel.
- a plurality of operations and the like are specified in one instruction and are executed simultaneously.
- the superscalar method is a method in which hardware dynamically detects machine language instructions that can be executed simultaneously from a machine language instruction sequence and executes them in parallel.
- This superscalar method has the advantage of being able to use existing software assets as they are, but recently tends to be avoided due to the complexity of the mechanism and the large amount of power consumption.
- the vector method is a method in which basic operations such as load, operation, and store are repeatedly applied using a vector register in which a large number of registers are arranged in a one-dimensional direction, and high speed with high power efficiency is possible. . Furthermore, since no cache memory is required, the data transfer speed between the main memory and the vector register is guaranteed, and as a result, stable high speed is realized.
- the VLIW method is a method in which a plurality of operations and the like are designated in one instruction and are executed simultaneously.
- this VLIW method for example, 4 instructions are fetched simultaneously, 4 instructions are decoded simultaneously, necessary data is read from a general-purpose register, and operation is performed simultaneously by a plurality of operation devices, and the operation result storage means attached to the operation device is stored. Stores the operation result.
- the contents are read from the calculation result storage means and written to the general-purpose register.
- the calculation result is stored in the arithmetic unit. Bypass to the input.
- the LD / ST unit refers to the cache memory, stores the load result in the load result storage means attached to the LD / ST unit, and then the LD / ST unit performs an operation in the next cycle. The same operation as the apparatus is performed.
- each port of the register file has a circuit for selecting one of the signal lines for the number of registers held in the register file in order to supply any number of register contents to any arithmetic device. It is because it is necessary together.
- the conventional VLIW method has a feature that a large number of program resources can be used, but only a few instructions can be executed simultaneously.
- an object of the present invention is to provide a data processing apparatus capable of executing more instructions in parallel.
- a data processing apparatus for executing an instruction code composed of a plurality of machine language instructions, an instruction memory unit holding the instruction code, and the instructions
- An instruction fetch / decode unit that takes out and decodes the instruction code from the memory unit, and corresponds to each of a plurality of register numbers described in the instruction code decoded by the instruction fetch / decode unit, and
- a first register file unit including a plurality of first registers for temporarily holding data corresponding to the register numbers; and a plurality of second registers corresponding to the first registers of the first register file unit on a one-to-one basis.
- the data of each first register of the first register file unit is transferred to each second register of the second register file unit corresponding to each first register of the first register file unit.
- the second arithmetic unit reads the data from the second register of the second register file unit. Can be used to execute operations.
- the calculation result of the first calculation unit is transferred to the second calculation unit.
- the second calculation unit can use the calculation result of the first calculation unit for execution of the calculation immediately after the calculation by the first calculation unit.
- the data processing apparatus of the present invention includes an instruction memory unit that holds the instruction code, an instruction fetch / decode unit that extracts and decodes the instruction code from the instruction memory unit, and the instruction fetch / decode unit.
- the first register includes a plurality of first registers that correspond one-to-one to each of the plurality of register numbers described in the instruction code decoded by the data and that temporarily hold data corresponding to the register numbers N (n is an integer greater than or equal to 1) register files including a file part and a second register file part including a plurality of second registers corresponding to each first register of the first register file part on a one-to-one basis
- a first arithmetic unit that performs an operation using read data of each first register of the first register file unit, and a second arithmetic unit N calculation units, and n holding units including a first holding unit for temporarily holding the calculation result of the first calculation unit, wherein the first register file unit has its first When the register holds data, the data is transferred to the second register of the second
- the result can be transferred to the second arithmetic unit, and the second arithmetic unit includes the read data of each second register of the second register file unit and the arithmetic result transferred by the first holding unit.
- the operation is executed using at least one of the above.
- Embodiments 1 to 10 the configuration of the data processing apparatus in the present invention will be described in Embodiments 1 to 10, and then the processing procedure of the data processing method in the present invention will be described in Embodiment 11.
- FIG. 1 is a diagram showing a configuration of a data processing apparatus according to Embodiment 1 of the present invention.
- the data processing apparatus 101 in this embodiment includes an instruction memory unit 10, an instruction fetch unit (instruction fetch / decode unit) 20, an instruction decode unit (instruction fetch / decode unit) 30, First register file unit 110, second register file unit 210, first arithmetic unit (first arithmetic unit, first holding unit) 120, and second arithmetic unit (second arithmetic unit, second holding unit) 220 And.
- the instruction memory unit 10 can be appropriately selected from known magnetic disk devices such as hard disk drives and known storage devices such as semiconductor memories.
- the instruction memory unit 10 holds a program composed of a plurality of instructions, and may be a partial area of the main memory, or may be an instruction buffer that holds a part of the main memory.
- the instruction fetch unit 20 fetches a necessary instruction from the instruction memory unit 10, and the instruction decoding unit 30 decodes the fetched instruction.
- the processing contents in the first and second arithmetic units 120 and 220 are determined by the decoding result by the instruction decoding unit 30.
- the data processing apparatus 101 is premised on a known VLIW processor architecture.
- the instruction fetch unit 20 fetches, for example, four 32-bit width instructions simultaneously, and the instruction decode unit 30 fetches them. Assume that the instructions are decoded simultaneously.
- the first register file unit 110 holds data necessary for arithmetic processing in the first arithmetic unit 120.
- the first register file unit 110 transfers a register group 111 including a plurality of registers (first registers) r0 to r11 and read data of the registers r0 to r11 of the register group 111 to the outside of the first register file unit 110.
- a transmitter 112 for the purpose.
- Reading and writing to the registers r0 to r11 of the register group 111 are executed based on the decoding result by the instruction decoding unit 30.
- Each register r0 to r11 of the register group 111 is read or written using its own register number 0 to 11 as an access key.
- the transfer unit 112 transfers the data held in the register with the specified number to the outside of the first register file unit 110.
- the second register file unit 210 holds data necessary for arithmetic processing in the second arithmetic unit 220.
- the second register file unit 210 transfers a register group 211 including a plurality of registers (second registers) r0 to r11 and read data of the registers r0 to r11 of the register group 211 to the outside of the second register file unit 210.
- a transfer device 212 a transfer device 212.
- Reading and writing to each of the registers r0 to r11 in the register group 211 is executed based on the decoding result by the instruction decoding unit 30.
- Each register r0 to r11 of the register group 211 is read or written using its own register number 0 to 11 as an access key.
- the registers r0 to r11 of the register group 211 have a one-to-one correspondence with the registers r0 to r11 of the register group 111 of the first register file unit 110, and register numbers between the registers of the register group 111 and the register group 211 Are associated with each other. Then, the transfer unit 112 of the first register file unit 110 stores the read data of the registers r0 to r11 of the register group 111 with the same register number as the register numbers of the registers r0 to r11 of the register group 111. Data can be transferred to the registers r0 to r11 of the register group 211 of the register file unit 210.
- the transfer unit 112 of the first register file unit 110 can transfer the read data of the register r3 of the register group 111 to the register r3 of the register group 211 of the second register file unit 210.
- the transfer unit 112 of the first register file unit 110 can transfer the read data of the register r9 of the register group 111 to the register r9 of the register group 211 of the second register file unit 210.
- the transfer device 212 transfers the data held in the register with the specified number to the outside of the second register file unit 210.
- the first arithmetic unit 120 performs substantial processing in the data processing apparatus 101.
- the first arithmetic unit 120 includes an arithmetic unit group 121 including arithmetic units 1-1 to 1-4, a holder group 122 including holders 1-1 to 1-4, and a transfer unit 123. Yes.
- the first arithmetic unit 120 constitutes a first data processing stage together with the first register file unit 110, and the transfer unit 112 of the first register file unit 110 reads the read data of the registers r0 to r11 of the register group 111. Can be transferred to the first arithmetic unit 120.
- the arithmetic units 1-1 to 1-4 of the arithmetic unit group 121 of the first arithmetic unit 120 obtain two read data from the registers r0 to r11 of the first register file unit 110, and the data Various arithmetic processes such as four arithmetic operations and logical operations are executed using. The arithmetic processing of each of the arithmetic units 1-1 to 1-4 is executed simultaneously.
- the holders 1-1 to 1-4 of the holder group 122 store the calculation results of the corresponding calculators 1-1 to 1-4.
- Each retainer 1-1 to 1-4 corresponds one-to-one with each computing unit 1-1 to 1-4.
- the transfer unit 123 transfers the calculation results of the calculators 1-1 to 1-4 stored in the holders 1-1 to 1-4 to the outside of the first calculator 120.
- the second arithmetic unit 220 performs substantial processing in the data processing apparatus 101.
- the second arithmetic unit 220 includes an arithmetic unit group 221 including arithmetic units 2-1 to 2-4, a holder group 222 including holders 2-1 to 2-4, and a transfer unit 223. Yes.
- the second arithmetic unit 220 together with the second register file unit 210, constitutes a second data processing stage, and the transfer unit 212 of the second register file unit 210 reads data read from the registers r0 to r11 of the register group 211. Can be transferred to the second arithmetic unit 220.
- the arithmetic units 2-1 to 2-4 of the arithmetic unit group 221 of the second arithmetic unit 220 obtain two read data from the registers r0 to r11 of the second register file unit 210, and the data Various arithmetic processes such as four arithmetic operations and logical operations are executed using. The arithmetic processing of each of the arithmetic units 2-1 to 2-4 is executed simultaneously.
- the computing units 2-1 to 2-4 of the computing unit group 221 of the second computing unit 220 are stored in the respective cages 1-1 to 1-4 of the cage group 122 of the first computing unit 120.
- the calculation result can be acquired.
- the transfer unit 123 of the first calculation device 120 can transfer the calculation results of the calculation units 1-1 to 1-4 stored in the holders 1-1 to 1-4 to the second calculation device 220. It has become.
- the arithmetic units 2-1 to 2-4 of the second arithmetic unit 220 execute arithmetic processing using the arithmetic results instead of the read data of the registers r0 to r11 of the second register file unit 210. be able to.
- the holders 2-1 to 2-4 of the holder group 222 store the calculation results of the corresponding calculators 2-1 to 2-4.
- Each of the retainers 2-1 to 2-4 has a one-to-one correspondence with each of the arithmetic units 2-1 to 2-4.
- the transfer unit 223 transfers the calculation results of the calculators 2-1 to 2-4 stored in the holders 2-1 to 2-4 to the outside of the second calculation device 220.
- the arithmetic processing by the first arithmetic device 120 is performed using the read data of the registers r0 to r11 of the register group 111.
- the read data of the registers r0 to r11 of the register group 111 that is not the target of the arithmetic processing by the first arithmetic device 120 is transferred to the second register file unit 210.
- the arithmetic processing by the second arithmetic unit 220 is performed using the data transferred to the registers r0 to r11 of the register group 211 of the second register file unit 210.
- the arithmetic processing by the first arithmetic device 120 is performed using the read data of the registers r0 to r11 of the register group 111.
- the transfer device 123 of the first arithmetic device 120 is stored in each of the holders 1-1 to 1-4.
- the computation results of the computing units 1-1 to 1-4 are transferred to the second computing device 220.
- FIG. 2 is a diagram showing the configuration of the data processing apparatus according to Embodiment 2 of the present invention.
- the same reference numerals are given to the same parts as those of the first embodiment of the present invention, and the detailed description thereof is omitted.
- the difference between the data processing apparatus 102 in the present embodiment and the data processing apparatus 101 in the first embodiment is that a third register file unit 310 and a third arithmetic unit (third arithmetic unit) Part, third holding part) 320.
- a third register file unit 310 and a third arithmetic unit (third arithmetic unit) Part, third holding part) 320 is also executed simultaneously.
- the third register file unit 310 holds data necessary for arithmetic processing in the third arithmetic unit 320.
- the third register file unit 310 transfers a register group 311 including a plurality of registers (third registers) r0 to r11 and read data of the registers r0 to r11 of the register group 311 to the outside of the third register file unit 310. And a transfer device 312 for the above.
- Reading and writing to the registers r0 to r11 of the register group 311 are executed based on the decoding result by the instruction decoding unit 30.
- Each register r0 to r11 of the register group 311 is read or written using its own register number 0 to 12 as an access key.
- the registers r0 to r11 of the register group 311 have a one-to-one correspondence with the registers r0 to r11 of the register group 211 of the second register file unit 210, and register numbers between the registers of the register group 211 and the register group 311 Are associated with each other. Then, the transfer unit 212 of the second register file unit 210 receives the read data of the registers r0 to r11 of the register group 211 in the third register number having the same register number as the register numbers of the registers r0 to r11 of the register group 211. Data can be transferred to each of the registers r0 to r11 in the register group 311 of the register file unit 310.
- the transfer unit 312 transfers the data held in the register with the designated number to the outside of the third register file unit 310.
- the third register file unit 310 is stored in each of the holders 1-1 to 1-4 of the first arithmetic unit 120 by the transfer unit 123 of the first arithmetic unit 120.
- the calculation result of 1-4 can be acquired.
- the third arithmetic device 320 performs substantial processing in the data processing device 102.
- the third arithmetic unit 320 includes an arithmetic unit group 321 including arithmetic units 3-1 to 3-4, a holder group 322 including holders 3-1 to 3-4, and a transfer unit 323. Yes.
- the third arithmetic unit 320 constitutes a third data processing stage together with the third register file unit 310, and the transfer unit 312 of the third register file unit 310 reads the read data of the registers r0 to r11 of the register group 311. Can be transferred to the third arithmetic unit 320. Then, each of the arithmetic units 3-1 to 3-4 of the arithmetic unit group 321 of the third arithmetic unit 320 acquires two read data from each of the registers r0 to r11 of the third register file unit 310, and these data Various arithmetic processes such as four arithmetic operations and logical operations are executed using. The arithmetic processing of each of the arithmetic units 3-1 to 3-4 is executed simultaneously.
- the holders 3-1 to 3-4 of the holder group 322 store the calculation results of the corresponding calculators 3-1 to 3-4.
- Each retainer 3-1 to 3-4 has a one-to-one correspondence with each computing unit 3-1 to 3-4.
- the transfer unit 323 transfers the calculation results of the calculators 3-1 to 3-4 stored in the holders 3-1 to 3-4 to the outside of the third processor 320.
- the third arithmetic unit 320 includes the arithmetic units 2-1 to 2-2 stored in the respective holders 2-1 to 2-4 of the second arithmetic unit 220 by the transfer unit 223 of the second arithmetic unit 220. -4 can be obtained.
- the arithmetic processing by the second arithmetic device 220 is performed using the read data of the registers r0 to r11 of the register group 211.
- the read data of the registers r0 to r11 of the register group 211 that is not subject to the arithmetic processing by the second arithmetic device 220 is transferred to the third register file unit 310.
- the arithmetic processing by the third arithmetic unit 320 is performed using the data transferred to the registers r0 to r11 of the register group 311 of the third register file unit 310.
- the arithmetic processing by the second arithmetic device 220 is performed using the read data of the registers r0 to r11 of the register group 211.
- the transfer device 223 of the second arithmetic device 220 is stored in each of the holders 2-1 to 2-4.
- the computation results of the computing units 2-1 to 2-4 are transferred to the third computing device 320.
- the second arithmetic device 220 does not need the arithmetic result of the first arithmetic device 120 and the third arithmetic device 320 needs the arithmetic result of the first arithmetic device 120
- the result of the first arithmetic device 120 is obtained. Is stored in the third register file section, so that the operation result of the first arithmetic device 120 can be input to the third arithmetic device 320 indirectly.
- N is an integer of 1 or more.
- the calculation result of the arithmetic unit constituting the Nth data processing stage is the register file of the (N + 2) th data processing stage when the arithmetic result is used by arithmetic units after the (N + 2) th data processing stage. Written in the part.
- FIG. 3 is a diagram showing the configuration of the data processing apparatus according to Embodiment 4 of the present invention.
- parts similar to those of the second embodiment of the present invention are denoted by the same reference numerals, and detailed description thereof is omitted.
- the difference between the data processing apparatus 103 in the present embodiment and the data processing apparatus 102 in the second embodiment is that a first load / store unit (load unit, store unit) 130, The first cache memory 140 is further provided.
- the first load / store unit 130 and the first cache memory 140 together with the first arithmetic unit 120 and the first register file unit 110 constitute a first data processing stage.
- the first load / store unit 130 includes a load unit group 131 including load units (LD) 1-1 and 1-2, and a store unit group 132 including store units (ST) 1-1 and 1-2. Have.
- the first cache memory 140 is connected to the first load / store unit 130, and reading and writing are executed at high speed according to the load and store operations by the first load / store unit 130.
- a small-capacity cache different from the large-capacity cache memory used during non-array operation It is configured using a memory.
- the dirty line necessary for the array operation is temporarily saved in an external memory (not shown), and then the first cache memory 140 is used to shift to the array operation. do it. By doing so, it is possible to maintain consistency between the contents of the cache memory used during the non-array operation and the contents of the first cache memory 140.
- the data processing apparatus is premised on a known VLIW processor architecture.
- the machine language instructions in the VLIW format are usually the first register file unit 110, the first arithmetic unit 120, the first load / store. This is executed by the unit 130 and the first cache memory 140.
- the operation of the arithmetic processing by the VLIW method includes the first register file unit 110, the first arithmetic unit 120, the first load / store unit 130, and the first This is executed by one cache memory 140.
- the register information necessary for starting the simultaneous operation of the arithmetic processing by the plurality of arithmetic devices in the first to third embodiments described above (hereinafter sometimes referred to as “array operation”) is always It is stored in the first register file unit 110.
- Control information composed of a source register number representing a register number, a computation type of computation processing by each computation device, and a destination register number representing a register number of a register that is a storage destination of computation results of each computation device A is set for each data processing stage.
- the control information A may be arranged as additional information for the array operation start command. In this case, the control information A can be obtained at a time when the array operation start instruction is decoded.
- the control information A may be supplied as a subsequent VLIW instruction sequence itself.
- the subsequent VLIW instructions are successively decoded in order, and the backward branch instruction meaning loop repetition, that is, the instruction corresponding to the final stage of the array operation is decoded.
- a forward branch instruction meaning exit from the loop that is, an instruction corresponding to an array operation termination condition (operation termination condition) can be detected and set as a pause condition. For this reason, control information to be added to the existing instruction sequence can be reduced.
- control information arrives at the same time as the first data arrives at the computing device in FIG.
- an array operation termination condition for stopping the array operation of the arithmetic unit in each stage is added to the control information A, and when the pre-designated condition is satisfied during the array operation, the non-array is automatically performed. It is configured to return to operation.
- the array operation termination condition is specifically the number of execution cycles of the arithmetic unit in each data processing stage.
- FIG. 4 is a diagram showing the configuration of the data processing apparatus according to the fifth embodiment of the present invention.
- parts similar to those of the fourth embodiment of the present invention are denoted by the same reference numerals, and detailed description thereof is omitted.
- the difference between the data processing apparatus 104 in the present embodiment and the data processing apparatus 103 in the fourth embodiment is that an external memory 150 is further provided.
- the external memory 150 is connected only to the first cache memory 140 held by the first load / store unit 130. From the second stage onward, data in the first cache memory 140 is sequentially propagated. This simplifies the connection between the external memory 150 and the cache memory at each data processing stage.
- the load instruction refers to the first cache memory 140 according to the address obtained by adding / subtracting the address information stored in the first register file unit 110 in the first arithmetic unit 120, and the obtained data is stored in the first load / store unit. Stored in the store units 1-1 and 1-2 of the 130 store unit group 132.
- the data stored in the store units 1-1 and 1-2 are input to the subsequent arithmetic unit or register file unit in the next cycle.
- FIG. 5 is a diagram showing the configuration of the data processing apparatus according to Embodiment 6 of the present invention.
- parts similar to those of the fifth embodiment of the present invention are denoted by the same reference numerals, and detailed description thereof is omitted.
- the difference between the data processing apparatus 105 in the present embodiment and the data processing apparatus 104 in the fifth embodiment is that the second load / store unit 230 and the third load / store unit 330 are different. And a second cache memory 240 and a third cache memory 340.
- the second load / store unit 230 and the second cache memory 240 together with the second arithmetic unit 220 and the second register file unit 210 constitute a second data processing stage.
- the third load / store unit 330 and the third cache memory 340 together with the third arithmetic device 320 and the third register file unit 310 constitute a third data processing stage.
- the second load / store unit 230 includes a load unit group 231 composed of load units (LD) 2-1 and 2-2, and a store unit group 232 composed of store units (ST) 2-1 and 2-2. Have.
- the third load / store unit 330 includes a load unit group 331 composed of load units (LD) 3-1, 3-2, and a store unit group 332 composed of store units (ST) 3-1, 3-2. ,have.
- the second cache memory 240 is connected to the second load / store unit 230, and reading and writing are executed at high speed according to the load and store operations by the second load / store unit 230.
- the third cache memory 340 is connected to the third load / store unit 330, and reading and writing are executed at high speed according to the load and store operations by the third load / store unit 330.
- the second and third cache memories 240 and 340 are used during non-array operation because the capacity needs to be extremely small in order to propagate the entire contents to the next and subsequent stages.
- the cache memory is configured using a small-capacity cache memory different from the large-capacity cache memory.
- the second and third cache memories 240 and 340 do not have an interface for directly transferring data to the external memory 150. Therefore, data is indirectly supplied from the first cache memory 140 via the previous cache memory.
- FIG. 6 is a diagram showing the configuration of the data processing apparatus according to Embodiment 7 of the present invention.
- parts similar to those in the sixth embodiment of the present invention are denoted by the same reference numerals, and detailed description thereof is omitted.
- control information is set in advance in each arithmetic unit and the contents of the register flow from the previous stage, so that an operation originally intended by a machine language instruction in the VLIW format is performed. The same result as the result can be obtained continuously by the array operation.
- the load instruction in each stage autonomously increments or decrements the address information without waiting for the result of the subsequent instruction, and continuously references the cache memory. Configuration is required.
- each stage of the arithmetic device for calculating the load address calculates the next address using the previous arithmetic result.
- the load address is obtained by adding an offset to the base register, and for this addition, it must pass through one stage of the arithmetic unit.
- the next address to be used is not a value obtained by adding an offset, but a value obtained by adding only the data width such as 4 bytes.
- an ordinary program executes an instruction for increasing the value of the base register by 4 after execution of the load instruction.
- the previous stage autonomously updates the base address.
- FIG. 7 is a diagram showing the configuration of the data processing apparatus according to the eighth embodiment of the present invention.
- parts similar to those of the seventh embodiment of the present invention are denoted by the same reference numerals, and detailed description thereof is omitted.
- the first cache memory 140 is connected to the external memory 150.
- the operation of the data processing apparatus 107 in the present embodiment will be described by taking an outline extraction process which is one of image processes as an example.
- the contour extraction process which is one of the image processes, is to obtain a difference between diagonal pixels, for example, in a 3 ⁇ 3 pixel region, and generate a contour at the center pixel position when the sum exceeds a threshold value. is there.
- image data is transferred from an external I / O device to an external memory so as to have continuous addresses in the horizontal direction.
- continuous pixels in the vertical direction are discrete as memory addresses.
- the conventional cache memory is a technology that saves a continuous address area of about 16 words in a memory that is faster than an external memory, and is expected to be reused many times. Therefore, the above effect cannot be expected when the vertical discrete addresses are frequently referred to.
- the data processing device 107 when pixel data is stored in the external memory from the external I / O device, in order to expect high throughput by burst transfer, the external memory that is the write destination is used.
- the data transfer is performed such that the addresses are continuous based on the transfer information F including the base address and the number of transfer words (1024 when the image width is 1024 words).
- the base address of the first cache memory that is the write destination is supplied in order to supply the pixel data adjacent in the vertical direction to the arithmetic device every cycle.
- a plurality of short data (for example, every cycle) belonging to different banks of the external memory based on the transfer information G composed of the transfer data length (when the image width is 1024 words, the transfer of the upper, middle, and lower 3 words is 1024 times) 3 words) can be transferred to a plurality of lines of the cache memory every cycle.
- Such transfer information is associated with the array operation start command, and is read when the array operation start command is detected.
- FIG. 8 is a diagram showing the configuration of the data processing apparatus according to the ninth embodiment of the present invention.
- the same parts as those in the eighth embodiment of the present invention are denoted by the same reference numerals, and detailed description thereof is omitted.
- a mechanism for synchronizing the first cache memory and the first arithmetic unit is necessary.
- the purpose is that when all the data necessary for the first cache memory is obtained, that is, when it is confirmed that all the data required by the load / store unit after the second stage exists in the first cache memory.
- the arithmetic units are operated all at once, and all the arithmetic units are stopped all at once while the necessary data does not exist in the first cache memory.
- the number of loads (SKIP information) to be operated in advance before the start of the operation of the arithmetic device is added to the data transfer information described above.
- the operation of the subsequent arithmetic unit is started.
- the subsequent arithmetic unit is operated.
- the subsequent stage is stopped.
- the array operation is temporarily stopped to wait for data.
- the first stage load operation stops and only the subsequent stage operation continues.
- the array operation can be accurately controlled by stopping the array operation when the operation number counter in the final stage arithmetic unit reaches a specified value.
- the final result of the array operation is stored in an external memory or an external I / O device by a store instruction.
- a plurality of types of image processing can be continuously performed by storing the data in another external memory and using the input of another N-stage configuration connected to the external memory.
- an external memory is used as an interface to connect another array structure as a subordinate connection or directly connect to a first cache memory of another array structure. Accordingly, the number of stages of the arithmetic device can be expanded to cope with it.
- the instruction decode function interprets the machine language instruction as usual and stops the conventional operation for controlling the first arithmetic unit.
- the network between the arithmetic units necessary for array operation is set based on the machine language instructions in the loop structure.
- FIG. 10 the machine language instruction in the first line of the instruction code shown in FIG. 9 is interpreted.
- the setting location based on the instruction code in the first row is indicated by “S1”.
- the instruction code on the first line describes two load instructions (ld) and one subtraction instruction (subicc).
- the first load instruction (ld) adds the contents of the register (gr4) of the first register file unit 110 and a constant (-1284) (const.), And uses the result as the main memory address to store the first cache memory 140.
- the read value is referred to and stored in the register (fr1) of the fourth register file unit 410.
- the register (gr4) is read from the first register file unit 110 and set to be input to the arithmetic unit (first EAG) belonging to the first arithmetic unit 120.
- This setting is the same as the selection signal setting for a general selection circuit.
- the addition result of the first EAG is stored in the holder of the first EAG, and then transferred to the first load / store unit 130 in the next cycle.
- the input selection procedure in the first load / store section 130 is not necessary.
- the network is set so that the result of the first load / store unit 130 is written to the register (fr1) of the fourth register file unit 410.
- the second load instruction (ld) adds the contents of the register (gr4) of the first register file unit 110 and a constant (1284) (const.), And the result. Is stored in the register (fr2) of the fourth register file unit 410 with reference to the first cache memory 140 as a main memory address.
- the register (gr4) is read from the first register file unit 110 and set to be input to the arithmetic unit (second EAG) belonging to the first arithmetic unit 120.
- the addition result of the second EAG is stored in the second EAG holder, and then transferred to the first load / store unit 130 in the next cycle.
- the input selection procedure in the first load / store unit 130 is not necessary.
- the network is set so that the result of the first load / store unit 130 is written to the register (fr2) of the fourth register file unit 410.
- the third subtraction instruction (subicc) is an instruction to subtract 1 from the register (gr7) of the first register file unit 110 and store the result in the same register (gr7).
- a network up to the first computing device 120 is set so as to subtract 1 from the contents of the register (gr7) of the first register file unit 110.
- the network is set so that the output of the arithmetic unit (subicc) belonging to the first arithmetic unit 120 is not input from the register (gr7) of the first register file unit 110 after the next cycle.
- condition code described later can be used as a termination condition for the array operation.
- the network is set so that the condition code accompanying the subtraction result is transferred to the register (icc0) of the third register file unit 310.
- FIG. 11 the machine language instruction in the second line of the instruction code shown in FIG. 9 is interpreted.
- the setting location based on the instruction code in the second row is indicated by “S2”.
- the instruction code on the second line describes two load instructions (ld) and one conditional branch instruction (beq).
- the first load instruction (ld) adds the contents of the register (gr4) of the second register file unit 210 and the constant (-1280) (const.), And uses the result as the main memory address to store the second cache memory 240.
- the read value is referred to and stored in the register (fr3) of the fifth register file unit 510.
- a setting for transferring the register (gr4) of the first register file unit 110 to the register (gr4) of the second register file unit 210 is performed.
- the register (gr4) is read from the second register file unit 210 and set to be input to the arithmetic unit (first EAG) belonging to the second arithmetic unit 220.
- the addition result of the first EAG is stored in the holder of the first EAG, and then transferred to the second load / store unit 230 in the next cycle.
- the network is set so that the result of the second load / store unit 230 is written to the register (fr3) of the fifth register file unit 510.
- the second load instruction (ld) adds the contents of the register (gr4) of the second register file unit 210 and a constant (1280) (const.), And the result. Is stored in the register (fr4) of the fifth register file unit 510 with reference to the second cache memory 240 as a main memory address.
- the register (gr4) is read from the second register file unit 210 and set to be input to the computing unit (second EAG) belonging to the second computing unit 220.
- the addition result of the second EAG is stored in the second EAG holder, and then transferred to the second load / store unit 230 in the next cycle.
- the network is set so that the result of the second load / store unit 230 is written to the register (fr4) of the fifth register file unit 510.
- conditional branch instruction (beq) indicates that the condition code (icc0) accompanying the result of the subtraction instruction (subicc) described in the instruction code on the first line shown in FIG. 9 is 0,
- Machine language instruction that branches to edge ⁇ exit.
- normal operation non-array operation
- it is executed as a conventional conditional branch instruction.
- this condition code (icc0) is used as the above-mentioned “array operation termination condition”.
- the subtraction result indicates 0, it becomes a trigger (ARRAY-ABORT signal) for terminating the array operation and returning to the normal operation (non-array operation).
- condition code (icc0) accompanying the result of the subtraction instruction by the first arithmetic unit 120 can be bypassed from the first arithmetic unit 120 to the second arithmetic unit 220, so that eventually the register ( icc0) need not be written.
- FIG. 12 the machine language instruction in the third line of the instruction code shown in FIG. 9 is interpreted.
- the setting location based on the instruction code in the third row is indicated by “S3”.
- the instruction code on the third line describes two load instructions (ld) and one SAD instruction (sad).
- the first load instruction (ld) adds the content of the register (gr4) and a constant (-1276), refers to the third cache memory 340 using the result as the main memory address, and stores the read value in the register (fr5). To store.
- the register (gr4) of the third register file unit 310 is read and set to be input to the arithmetic unit (first EAG) belonging to the third arithmetic unit 320.
- the addition result of the first EAG is stored in the holder of the first EAG, and then transferred to the third load / store unit 330 in the next cycle.
- the network is set so that the result of the third load / store unit 330 is written to the register (fr5) of the sixth register file unit 610.
- the second load instruction (ld) adds the contents of the register (gr4) of the third register file unit 310 and a constant (1276) (const.), And the result. Is stored in the register (fr6) of the sixth register file unit 610 by referring to the third cache memory 340 with the main memory address as the main memory address.
- the register (gr4) is read from the third register file unit 310 and set to be input to the computing unit (second EAG) belonging to the third computing unit 320.
- the addition result of the second EAG is stored in the second EAG holder, and then transferred to the third load / store unit 330 in the next cycle.
- the network is set so that the result of the third load / store unit 330 is written to the register (fr6) of the sixth register file unit 610.
- the SAD instruction (sad) obtains the absolute difference sum for each byte of the register (fr1) and the register (fr2) of the fourth register file unit 410 previously loaded by the first load / store unit 130, and obtains the result as the fifth.
- This is a machine language instruction to be written in the register (fr1) of the register file unit 510.
- the first load / store unit 130 writes the register (fr1) and the register (fr2) of the fourth register file unit 410.
- an input (ld-bypass) can be made to the third arithmetic unit 320. Therefore, finally, reading from the register (fr1) and the register (fr2) of the fourth register file unit 410 becomes unnecessary.
- the computing unit network is set so that the result of the SAD instruction (sad) is written to the register (fr1) of the fifth register file unit 510.
- FIG. 13 the machine language instruction in the fourth line of the instruction code shown in FIG. 9 is interpreted.
- the setting location based on the instruction code in the fourth row is indicated by “S4”.
- the instruction code on the fourth line describes two load instructions (ld), one addi instruction, and one SAD instruction (sad).
- the first load instruction (ld) adds the contents of the register (gr4) and the constant (-4) (const.), And uses the result as a main memory address to refer to the fourth cache memory 440 and read the read value.
- the data is stored in the register (fr7) of the seventh register file unit 710.
- the register (gr4) of the fourth register file unit 410 is read and set to be input to the arithmetic unit (first EAG) belonging to the fourth arithmetic unit 420.
- the addition result of the first EAG is stored in the holder of the first EAG, and then transferred to the fourth load / store unit 430 in the next cycle.
- the network is set so that the result of the fourth load / store unit 430 is written to the register (fr7) of the seventh register file unit 710.
- the second load instruction (ld) adds the contents of the register (gr4) of the third register file unit 310 and the constant (4) (const.), And the result. Is stored in the register (fr8) of the seventh register file unit 710 with reference to the fourth cache memory 440 as a main memory address.
- the register (gr4) of the fourth register file unit 410 is read out and input to the computing unit (second EAG) belonging to the fourth computing unit 420 is set.
- the addition result of the second EAG is stored in the second EAG holder, and then transferred to the fourth load / store unit 430 in the next cycle.
- the network is set so that the result of the fourth load / store unit 430 is written to the register (fr9) of the seventh register file unit 710.
- the SAD instruction (sad) obtains the absolute difference sum for each byte of the register (fr3) and the register (fr4) of the fifth register file unit 510 previously loaded by the second load / store unit 230, and obtains the result as the sixth.
- This is a machine language instruction to be written to the register (fr3) of the register file unit 610.
- the second load / store unit 230 writes the register (fr3) and the register (fr4) of the fifth register file unit 510.
- an input (ld-bypass) can be made to the fourth arithmetic unit 420. Therefore, finally, reading from the register (fr3) and the register (fr4) of the fifth register file unit 510 becomes unnecessary.
- the computing unit network is set so that the result of the SAD instruction (sad) is written to the register (fr3) of the sixth register file unit 610.
- the addi instruction is a machine language instruction for updating the address of the register (gr4).
- a feedback loop is generated for the first to fourth arithmetic devices 120 to 420 which are arithmetic devices using the register (gr4). By generating these feedback loops, the load addresses of the first to fourth arithmetic devices 120 to 420 are automatically updated.
- FIG. 14 the machine language instruction in the fifth line of the instruction code shown in FIG. 9 is interpreted.
- the setting location based on the instruction code in the fifth line is indicated by “S5”.
- the instruction code on the fifth line describes one SAD instruction (sad) and one addition instruction (madd).
- the SAD instruction (sad) obtains the absolute difference sum for each byte of the register (fr5) and the register (fr6) of the sixth register file unit 610 previously loaded by the third load / store unit 330, and obtains the result as the seventh.
- This is a machine language instruction to be written in the register (fr5) of the register file unit 710.
- the third load / store unit 330 writes the register (fr5) and the register (fr6) of the sixth register file unit 610.
- an input (ld-bypass) can be made to the fifth arithmetic unit 520. Therefore, finally, reading from the register (fr5) and the register (fr6) of the sixth register file unit 610 becomes unnecessary.
- the computing unit network is set so that the result of the SAD instruction (sad) is written to the register (fr5) of the seventh register file unit 710.
- the addition instruction (madd) is a machine language instruction that accumulates the result of the previous SAD instruction (sad) in the register (fr1) of the seventh register file unit 710.
- the register (fr1) is read from the fifth register file unit 510 and read out from the sixth register file unit 610.
- the calculation result immediately before by the fourth arithmetic unit 420 can be bypassed (fr3-bypass) and input to the fifth arithmetic unit 520.
- the calculator network is set so that the calculation result of the fifth calculation device 520 is stored in the register (fr1) of the seventh register file unit 710.
- FIG. 15 the machine language instruction in the sixth line of the instruction code shown in FIG. 9 is interpreted.
- the setting location based on the instruction code in the sixth line is indicated by “S6”.
- the instruction code on the sixth line describes one SAD instruction (sad) and one addition instruction (madd).
- the SAD instruction (sad) obtains the absolute difference sum for each byte of the register (fr7) and the register (fr8) of the seventh register file unit 710 previously loaded by the fourth load / store unit 430, and obtains the result as the eighth.
- This is a machine language instruction to be written in the register (fr7) of the register file unit 810.
- the fourth load / store unit 430 writes the register (fr7) and the register (fr8) of the seventh register file unit 710.
- an input (ld-bypass) can be made to the sixth arithmetic unit 620. Therefore, finally, reading from the register (fr7) and the register (fr8) of the seventh register file unit 710 becomes unnecessary.
- the computing unit network is set so that the result of the SAD instruction (sad) is written to the register (fr7) of the eighth register file unit 810.
- the addition instruction (madd) is a machine language instruction that accumulates the result of the previous SAD instruction (sad) in the register (fr1) of the eighth register file unit 810.
- the calculation result immediately before by the fifth arithmetic unit 520 can be bypassed (fr5, 1-bypass) and input to the sixth arithmetic unit 620.
- the calculator network is set so that the calculation result of the sixth calculation device 620 is stored in the register (fr1) of the eighth register file unit 810.
- FIG. 16 the machine language instruction on the seventh line of the instruction code shown in FIG. 9 is interpreted.
- the setting location based on the instruction code in the seventh row is indicated by “S7”.
- the addition instruction (madd) is a machine language instruction that accumulates the result of the previous SAD instruction (sad) in the register (fr1) of the ninth register file unit 910.
- the calculation result immediately before by the sixth arithmetic unit 620 can be bypassed (fr7, 1-bypass) and input to the seventh arithmetic unit 720.
- the calculator network is set so that the calculation result of the seventh calculation device 720 is stored in the register (fr1) of the ninth register file unit 910.
- FIG. 17 the machine language instruction in the eighth line of the instruction code shown in FIG. 9 is interpreted.
- the setting location based on the instruction code in the eighth row is indicated by “S8”.
- the correction instruction (msum) is an instruction for merging the results divided into a plurality of partial sums such as upper and lower in the register (fr1) into one (summing up the partial sums to obtain a sum).
- SAD SAD
- the sum can be finally obtained by this instruction.
- the register (fr1) necessary for the calculation is input from the seventh calculation device 720 in the previous stage to the eighth calculation device 820 by bypass (fr1-bypass), and the calculation result of the eighth calculation device 820 is stored in the tenth register file unit 1010. Set the computing unit network to be stored in the register (fr1).
- FIG. 18 the machine language instruction in the ninth line of the instruction code shown in FIG. 9 is interpreted.
- the setting location based on the instruction code in the ninth line is indicated by “S9”.
- conditional set instruction (cset) is “11” in the eleventh register file section when the sum obtained by the correction instruction (msum) is less than the threshold given by the register (fr9), and “1” otherwise. This instruction is stored in the register (fr1) of 1110.
- the register (fr1) necessary for the calculation is input from the eighth arithmetic unit 820 in the previous stage to the ninth arithmetic unit 920 by bypass (fr1-bypass), and the threshold value is sequentially transferred from the first register file unit 110 to the ninth register.
- the calculation result of the ninth arithmetic unit 920 is read out from the register (fr9) of the ninth register file unit 910, and the calculation result of the ninth arithmetic unit 920 is
- the arithmetic unit network is set so as to be stored in the register (fr1) of the unit 1110.
- FIG. 19 the machine language instruction on the 10th line of the instruction code shown in FIG. 9 is interpreted.
- the setting location based on the instruction code in the 10th row is indicated by “S10”.
- the store instruction (stb) adds the contents of the register (gr5) and the constant (0) (const.), And stores the data in the store buffer (STBF) using the result as the main memory address.
- the register (gr5) of the tenth register file unit 1010 is read and set to be input to the arithmetic unit (EAG) belonging to the tenth arithmetic unit 1020.
- EAG arithmetic unit
- the EAG addition result is stored in the EAG holder, it is transferred to the tenth load / store unit 1030 in the next cycle.
- the contents of the store buffer (STBF) are sequentially output to the external memory.
- the addi instruction is a machine language instruction for updating the address of the register (gr5).
- a feedback loop is generated for the tenth arithmetic unit 1020, which is an arithmetic unit that uses the register (gr5).
- the store address of the tenth arithmetic unit 1020 is automatically updated.
- the execution result of the preceding arithmetic unit is bypassed to the next arithmetic unit and used in the next register file unit. It is necessary to write to. This is because the execution result of the arithmetic device is not limited to the arithmetic device in the next stage, and can be used in a subsequent arithmetic device.
- the x portion in FIG. 20 is a write path to a register that is finally determined to be unnecessary.
- the register (gr4), the register (gr5), and the register (fr9) remain to be propagated between the register file portions.
- propagation of the contents of the cache memory can be interrupted in the middle.
- a plurality of sets are arranged in cascade while maintaining one set of basic configuration including a register file unit, an arithmetic unit, and a load / store unit.
- a necessary register value is propagated between adjacent register file portions.
- the load / store unit has a configuration in which a plurality of sets are arranged in tandem and necessary data is propagated between adjacent small cache memories.
- register value propagation function between adjacent register files for example, a configuration in which the same number of physical registers are arranged can be used. In addition, a configuration in which a smaller number of physical registers and a table that holds the correspondence between the numbers of the registers can be used.
- a data processing apparatus is a data processing apparatus that interprets and executes a machine language instruction, and includes a plurality of registers that temporarily hold data corresponding to a plurality of register numbers described in the machine language instruction.
- a first-stage register file device, a first-stage arithmetic device that performs an operation using one or more data read from the first-stage register file device as an input, and the first-stage arithmetic device A first-stage calculation result holding means for temporarily holding a calculation result; a second-stage register file apparatus holding the same amount or more data as the first-stage register file apparatus;
- a second stage arithmetic unit that performs an operation using one or more data read from the stage register file unit as an input, and a second stage that temporarily holds an arithmetic result of the second stage arithmetic unit Computation result holding means, wherein the second stage register file device receives the contents of the first stage register file device as input, and the second stage operation device contains the contents of the first stage computation result holding means or The contents of the
- a third-stage register file device that holds the same amount or more data as the first-stage register file device, and an operation using one or more data read from the third-stage register file device as inputs
- a third-stage arithmetic result holding means for temporarily holding the arithmetic result of the third-stage arithmetic device
- the third-stage register file device includes the first stage The contents of the operation result holding means or the contents of the second stage register file device are input, and the third stage arithmetic device is the contents of the second stage operation result holding means or the third stage register file device. It is preferable that the second stage arithmetic unit and the third stage arithmetic unit operate simultaneously.
- N-th stage register file device in which N is an integer equal to or greater than 1, an N-th stage arithmetic device, and an N-th stage arithmetic result holding means are provided.
- the (N + 2) -th stage arithmetic device is used, the (N + 2) -th stage register file device is written. It is preferable to input to the (N + 1) -th stage arithmetic unit without writing to the register file unit.
- An array operation start instruction for starting operations of the N-th stage register file device, the N-th stage arithmetic device, and the N-th stage arithmetic result holding means, wherein N is an integer of 2 or more during execution of a machine language instruction Until detection, when only the first-stage register file device, first-stage arithmetic device, and first-stage arithmetic result holding means are operated and the array operation start instruction is detected, the instruction is associated with the instruction. Arithmetic unit control information is set in the Nth stage arithmetic unit, and the operations of the Nth stage register file unit, the Nth stage arithmetic unit, and the Nth stage arithmetic result holding means are started, and the array operation starts. It is preferable to stop the operations of the Nth stage register file device, the Nth stage arithmetic device, and the Nth stage arithmetic result holding means in accordance with the array operation termination condition indicated by the instruction.
- the first stage arithmetic unit temporarily stores the read data, a cache memory that temporarily holds the contents of the external memory, means for reading the cache memory using address information associated with the load instruction, and It is preferable that a first stage load result holding unit is provided, and data read from the load result holding unit is used as an input to a subsequent stage arithmetic unit or register file unit.
- the address information associated with the load instruction is retained in the means for reading out the cache memory using the address information associated with the load instruction provided in the arithmetic unit, and is retained each time a load operation is completed.
- the address information is autonomously loaded from consecutive addresses by increasing or decreasing the load data width.
- the first-stage cache memory includes transfer means including connection means with an external memory including a plurality of banks, a base address of a write destination cache memory associated with the array operation start instruction, and a transfer data length.
- transfer means for transferring data based on the data is provided, and a plurality of data is simultaneously transferred to the first stage cache memory simultaneously from a plurality of different addresses on the external memory.
- the external memory composed of a plurality of banks is based on transfer information composed of a connection means with an external I / O device, a base address of a write destination external memory associated with the array operation start command, and the number of transfer words. It is preferable that data transfer means for transferring data is provided, and a plurality of data is continuously transferred from the external I / O device to the oldest bank of the external memory.
- the array operation termination condition is based on the fact that the subsequent arithmetic unit has been operated by the number of times related.
- the operation result is stored in the external memory or the external I / O device by a store instruction, stored in another external memory, or input to the first cache memory of another N-stage array configuration It is preferable to do.
- the data processing apparatus is a data processing apparatus for executing an instruction code including a plurality of machine language instructions, and includes an instruction memory unit that holds the instruction code, and the instruction memory unit.
- An instruction fetch / decode unit that extracts and decodes the instruction code, and a plurality of register numbers described in the instruction code decoded by the instruction fetch / decode unit on a one-to-one basis, and each register
- a first register file unit including a plurality of first registers for temporarily holding data corresponding to the number; and a plurality of second registers corresponding to each first register of the first register file unit on a one-to-one basis.
- the first register file unit when each of its own first registers holds data, the second register file corresponding to the first register holding the data
- the first holding unit can transfer the calculation result held by itself to the second calculation unit, and the second calculation unit can transfer the data to the second register.
- the calculation is performed using at least one of the read data of each second register of the register file unit and the calculation result transferred by the first holding unit.
- the data of each first register of the first register file unit is transferred to each second register of the second register file unit corresponding to each first register of the first register file unit.
- the second arithmetic unit reads the data from the second register of the second register file unit. Can be used to execute operations.
- the calculation result of the first calculation unit is transferred to the second calculation unit.
- the second calculation unit can use the calculation result of the first calculation unit for execution of the calculation immediately after the calculation by the first calculation unit.
- the n register file units further include a third register file unit including a plurality of third registers corresponding one-to-one with each second register of the second register file unit, and the n operation units include:
- the n number of holding units further includes a second holding unit that temporarily holds a calculation result of the second calculating unit, and the second register file unit includes its own third calculating unit.
- the second register file unit includes its own third calculating unit.
- the operation result to be held can be transferred to the third operation unit, and the third operation unit is transferred by the read data of each third register of the third register file unit and the second holding unit. It is preferable to perform a calculation using at least one of the calculation results.
- each second register in the second register file unit is transferred to each third register in the third register file unit corresponding to each second register in the second register file unit.
- the third calculation unit reads the data from the third register of the third register file unit even when the data of the second register of the second register file unit is used for execution of the calculation of the second calculation unit. Can be used to execute operations.
- the calculation result of the second calculation unit is transferred to the third calculation unit.
- the third calculation unit can use the calculation result of the second calculation unit for execution of the calculation immediately after the calculation by the second calculation unit.
- the calculation result held by itself is used. While the operation result is transferred to the (N + 2) th register file unit included in the n register file units, the operation result held by itself is not used for execution by the (N + 2) th and subsequent operation units. In this case, it is preferable to transfer the calculation result to the (N + 1) th calculation unit included in the n calculation units.
- the instruction fetch / decode unit includes a plurality of register file units included in the n register file units, a plurality of operation units included in the n operation units, and a plurality of holding units included in the n holding units.
- the plurality of register file units, the plurality of operation units, and the plurality of holding units based on the decoding result of the operation instruction Are preferably operated simultaneously, and the operation of the instruction fetch / decode unit is preferably stopped.
- an “array operation” that simultaneously operates a plurality of register file units, a plurality of arithmetic units, and a plurality of holding units can be performed based on the decoding result of the operation instruction. Can be started.
- the operation command includes setting information to be set for each of the plurality of register file units to be operated simultaneously, the plurality of arithmetic units, and the plurality of holding units, and the plurality of register files. And an operation termination condition for stopping simultaneous operations of the plurality of arithmetic units and the plurality of holding units, and until the instruction fetch / decode unit decodes the operation instruction, the instruction fetch / decode unit , When the first register file unit, the first arithmetic unit, and the first holding unit are operated at the same time and the operation instruction is decoded, the instruction fetch / decode unit The operation is stopped, and the plurality of register file units, the plurality of arithmetic units, and the plurality of holding units are operated simultaneously, and are included in the operation command.
- the operations of the plurality of register file units, the plurality of arithmetic units, and the plurality of holding units excluding the first register file unit, the first arithmetic unit, and the first holding unit And the instruction fetch / decode unit, the first register file unit, the first arithmetic unit, and the first holding unit are preferably operated simultaneously.
- the simultaneous operation of the plurality of register file units, the plurality of arithmetic units, and the plurality of holding units started based on the decoding result of the operation instruction can be stopped depending on whether the operation termination condition is satisfied. For this reason, even if the operation of the instruction fetch / decode unit is stopped, the “non-array operation” which operates the instruction fetch / decode unit, the first register file unit, the first arithmetic unit, and the first holding unit simultaneously. ”.
- Each of the n arithmetic units uses a cache memory that temporarily holds the contents of an external memory arranged outside the data processing device, and address information associated with the load instruction included in the instruction code.
- a load unit that reads the cache memory, and a store unit that temporarily holds data read by the load unit, and the Nth operation unit included in the n operation units has its own Data stored in the store unit can be transferred to the (N + 1) th and subsequent arithmetic units included in the n arithmetic units and the (N + 1) th and subsequent register file units included in the n register file units.
- it is.
- the (N + 1) th and subsequent arithmetic units are The calculation using the read data by the Nth calculation unit can be started early, and as a result, the calculation by each calculation unit can be further speeded up.
- the Nth arithmetic unit included in the n arithmetic units is transferred to the cache memory of the (N + 1) th arithmetic unit included in the n arithmetic units when its cache memory holds data. Preferably it is possible.
- the (N + 1) th arithmetic unit since the cache memory of the Nth arithmetic unit can be transferred to the cache memory of the (N + 1) th arithmetic unit, the (N + 1) th arithmetic unit stores the data held in the cache memory of the Nth arithmetic unit. Can be started at an early stage, and as a result, the calculation by each calculation unit can be further speeded up.
- Each of the n arithmetic units holds address information associated with the load instruction when the cache memory is read by its own load unit, and each time reading by the load unit is completed, Preferably, the stored address information is increased or decreased by the read data width to generate address information for the next reading by the load unit.
- each calculation unit can generate address information for the next reading by itself, each calculation unit can execute the next calculation without acquiring new address information. It is possible to speed up the calculation by
- the first arithmetic unit includes the cache memory directly connected to an external memory disposed outside the data processing device, and the cache memory includes a write destination address and a transfer data length associated with the operation instruction. It is preferable that data transfer means for transferring data based on transfer information consisting of the data transfer means, wherein the data transfer means continuously transfers a plurality of data simultaneously from a plurality of different addresses on the external memory.
- the external memory has data transfer means for transferring data based on transfer information consisting of a write destination address and a transfer word number associated with the operation instruction, and the data transfer means is connected to an external I / O device. It is preferable to continuously transfer a plurality of data to the oldest bank of the external memory.
- the first arithmetic unit waits for data transfer from an external memory when the area corresponding to the address information associated with the load instruction does not exist in its cache memory, and the second and subsequent arithmetic units It is preferable that the operation termination condition is that the operation is performed a number of times corresponding to the number of transfer words associated with the operation command.
- the present invention can be suitably used for a data processing apparatus that simultaneously executes a plurality of machine language instructions at a high speed.
- 10 instruction memory unit 20 instruction fetch unit (instruction fetch / decode unit) 30 Instruction decode unit (instruction fetch / decode unit) 101, 102, 103, 104, 105, 106, 107, 108
- Data processing device 110 210, 310, 410, 510, 610, 710, 810, 910, 1010, 1110 Register file part 120, 220, 320, 420, 520, 620, 720, 820, 920, 1020 arithmetic device (arithmetic unit, holding unit) 130, 230, 330, 430, 1030 Load / store unit (load unit, store unit) 130, 230, 330, 430 Cache memory 150 External memory
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Advance Control (AREA)
- Executing Machine-Instructions (AREA)
Abstract
Description
図1は、本発明の実施の形態1におけるデータ処理装置の構成を示す図である。図1に示すように、本実施の形態におけるデータ処理装置101は、命令メモリ部10と、命令フェッチ部(命令フェッチ/デコード部)20と、命令デコード部(命令フェッチ/デコード部)30と、第1レジスタファイル部110と、第2レジスタファイル部210と、第1演算装置(第1演算部、第1保持部)120と、第2演算装置(第2演算部、第2保持部)220と、を備えている。
次に、本発明の実施の形態2について説明する。図2は、本発明の実施の形態2におけるデータ処理装置の構成を示す図である。以下、本発明の実施の形態1と同様の部分については、同一符号を付し、その詳細な説明は省略する。
次に、本発明の実施の形態3について説明する。本発明の実施の形態は、上記の実施の形態2のデータ処理装置102における第1~3データ処理段からなる3データ処理段の構成を、Nデータ処理段の構成に拡張した形態である。
次に、本発明の実施の形態4について説明する。図3は、本発明の実施の形態4におけるデータ処理装置の構成を示す図である。以下、本発明の実施の形態2と同様の部分については、同一符号を付し、その詳細な説明は省略する。
次に、本発明の実施の形態5について説明する。図4は、本発明の実施の形態5におけるデータ処理装置の構成を示す図である。以下、本発明の実施の形態4と同様の部分については、同一符号を付し、その詳細な説明は省略する。
次に、本発明の実施の形態6について説明する。図5は、本発明の実施の形態6におけるデータ処理装置の構成を示す図である。以下、本発明の実施の形態5と同様の部分については、同一符号を付し、その詳細な説明は省略する。
次に、本発明の実施の形態7について説明する。図6は、本発明の実施の形態7におけるデータ処理装置の構成を示す図である。以下、本発明の実施の形態6と同様の部分については、同一符号を付し、その詳細な説明は省略する。
次に、本発明の実施の形態8について説明する。図7は、本発明の実施の形態8におけるデータ処理装置の構成を示す図である。以下、本発明の実施の形態7と同様の部分については、同一符号を付し、その詳細な説明は省略する。
次に、本発明の実施の形態9について説明する。図8は、本発明の実施の形態9におけるデータ処理装置の構成を示す図である。以下、本発明の実施の形態8と同様の部分については、同一符号を付し、その詳細な説明は省略する。
次に、本発明の実施の形態10について説明する。
次に、本発明の実施の形態11について説明する。本実施の形態では、本発明におけるデータ処理方法の処理手順について説明する。
同様に、隣接する小規模なキャッシュメモリ間におけるデータの伝搬機能については、キャッシュメモリ全体を一度に複製する構成を用いることができる。また、前段のキャッシュメモリから流れ込んでくる差分データのみを次段へ伝搬させることで、実質的に同一の内容を次段へ複製する構成を用いても良い。
本発明は上述した各実施形態に限定されるものではなく、請求項に示した範囲で種々の変更が可能であり、異なる実施形態にそれぞれ開示された技術的手段を適宜組み合わせて得られる実施形態についても本発明の技術的範囲に含まれる。
20 命令フェッチ部(命令フェッチ/デコード部)
30 命令デコード部(命令フェッチ/デコード部)
101、102、103、104、105、106、107、108 データ処理装置
110、210、310、410、510、610、710、810、910、1010、1110 レジスタファイル部
120、220、320、420、520、620、720、820、920、1020 演算装置(演算部、保持部)
130、230、330、430、1030 ロード/ストア部(ロード部、ストア部)
130、230、330、430 キャッシュメモリ
150 外部メモリ
Claims (11)
- 複数の機械語命令からなる命令コードを実行するためのデータ処理装置であって、
前記命令コードを保持する命令メモリ部と、
前記命令メモリ部から前記命令コードを取り出してデコードする命令フェッチ/デコード部と、
前記命令フェッチ/デコード部によりデコードされる前記命令コードに記述された複数のレジスタ番号の各々に一対一に対応し、且つ、前記各レジスタ番号に対応するデータを一時的に保持する複数の第1レジスタを含む第1レジスタファイル部と、前記第1レジスタファイル部の各第1レジスタと一対一に対応する複数の第2レジスタを含む第2レジスタファイル部と、を含むn(nは1以上の整数)個のレジスタファイル部と、
前記第1レジスタファイル部の各第1レジスタの読み出しデータを用いて演算を実行する第1演算部と、第2演算部と、を含むn個の演算部と、
前記第1演算部の演算結果を一時的に保持する第1保持部を含むn個の保持部と
を備え、
前記第1レジスタファイル部は、自身の各第1レジスタがデータを保持する場合には、データを保持する第1レジスタに対応する前記第2レジスタファイル部の第2レジスタに当該データを転送すると共に、
前記第1保持部は、自身が保持する演算結果を前記第2演算部に転送可能となっており、
前記第2演算部は、前記第2レジスタファイル部の各第2レジスタの読み出しデータ及び前記第1保持部により転送される演算結果のうちの少なくとも一方を用いて演算を実行することを特徴とするデータ処理装置。 - 前記n個のレジスタファイル部は、前記第2レジスタファイル部の各第2レジスタと一対一に対応する複数の第3レジスタを含む第3レジスタファイル部をさらに含み、
前記n個の演算部は、第3演算部をさらに含み、
前記n個の保持部は、前記第2演算部の演算結果を一時的に保持する第2保持部をさらに含んでおり、
前記第2レジスタファイル部は、自身の各第2レジスタがデータを保持する場合には、データを保持する第2レジスタに対応する前記第3レジスタファイル部の第3レジスタに当該データを転送すると共に、
前記第2保持部は、自身が保持する演算結果を前記第3演算部に転送可能となっており、
前記第3演算部は、前記第3レジスタファイル部の各第3レジスタの読み出しデータ及び前記第2保持部により転送される演算結果のうちの少なくとも一方を用いて演算を実行することを特徴とする請求項1に記載のデータ処理装置。 - 前記n個の保持部に含まれるN(Nは1以上の整数であって、n以下)番目の保持部は、
自身が保持する演算結果が前記n個の演算部に含まれる(N+2)番目以降の演算部による演算実行に用いられる場合には、当該演算結果を前記n個のレジスタファイル部に含まれる(N+2)番目のレジスタファイル部に転送する一方、
自身が保持する演算結果が前記(N+2)番目以降の演算部による演算実行に用いられない場合には、当該演算結果を前記n個の演算部に含まれる(N+1)番目の演算部に転送することを特徴とする請求項1または2に記載のデータ処理装置。 - 前記命令フェッチ/デコード部が前記n個のレジスタファイル部に含まれる複数のレジスタファイル部、前記n個の演算部に含まれる複数の演算部、前記n個の保持部に含まれる複数の保持部の各々を同時に動作させるべく記述された命令コードに含まれる動作命令をデコードした場合に、前記動作命令のデコード結果に基づいて前記複数のレジスタファイル部、前記複数の演算部及び前記複数の保持部を同時に動作させ、且つ、前記命令フェッチ/デコード部の動作を停止させることを特徴とする請求項1~3のいずれか1項に記載のデータ処理装置。
- 前記動作命令は、同時に動作させるべき前記複数のレジスタファイル部、前記複数の演算部及び前記複数の保持部の各動作を制御するために、各々に設定すべき設定情報と、前記複数のレジスタファイル部、前記複数の演算部及び前記複数の保持部の同時動作を停止すべき動作終結条件と、を含み、
前記命令フェッチ/デコード部が前記動作命令をデコードするまでは、前記命令フェッチ/デコード部、前記第1レジスタファイル部、前記第1演算部及び前記第1保持部を同時に動作させ、
前記動作命令をデコードした場合に、前記動作命令のデコード結果に基づいて、前記命令フェッチ/デコード部の動作を停止させ、且つ、前記複数のレジスタファイル部、前記複数の演算部及び前記複数の保持部を同時に動作させ、
前記動作命令に含まれる前記動作終結条件が満たされると、前記第1レジスタファイル部、前記第1演算部及び前記第1保持部を除く前記複数のレジスタファイル部、前記複数の演算部及び前記複数の保持部の動作を停止させ、且つ、前記命令フェッチ/デコード部、前記第1レジスタファイル部、前記第1演算部及び前記第1保持部を同時に動作させることを特徴とする請求項4に記載のデータ処理装置。 - 前記n個の演算部の各々は、
前記データ処理装置の外部に配置された外部メモリの内容を一時的に保持するキャッシュメモリと、
前記命令コードに含まれるロード命令に付随するアドレス情報を用いて前記キャッシュメモリを読み出すロード部と、
前記ロード部により読み出されたデータを一時的に保持するストア部と
を有し、
前記n個の演算部に含まれるN番目の演算部は、自身のストア部が保持するデータを前記n個の演算部に含まれる(N+1)番目以降の演算部及び前記n個のレジスタファイル部に含まれる(N+1)番目以降のレジスタファイル部に転送可能となっていることを特徴とする請求項1~5のいずれか1項に記載のデータ処理装置。 - 前記n個の演算部に含まれるN番目の演算部は、自身のキャッシュメモリがデータを保持する場合には、前記n個の演算部に含まれる(N+1)番目の演算部のキャッシュメモリに転送可能となっていることを特徴とする請求項6に記載のデータ処理装置。
- 前記n個の演算部の各々は、自身のロード部による前記キャッシュメモリの読み出しを行なう場合には、前記ロード命令に付随するアドレス情報を保持すると共に、前記ロード部による読み出しが完了する度に、前記保持したアドレス情報を読み出されたデータ幅だけ増加または減少させて、前記ロード部による次の読み出しのためのアドレス情報を生成することを特徴とする請求項6または7に記載のデータ処理装置。
- 前記第1演算部は、前記データ処理装置の外部に配置された外部メモリと直接接続されたキャッシュメモリを有し、
前記キャッシュメモリは、前記動作命令に関連付けられる書き込み先アドレスと転送データ長とからなる転送情報に基づいてデータ転送を行なうデータ転送手段を有し、
前記データ転送手段は、前記外部メモリ上における互いに異なる複数のアドレスから同時に複数のデータを連続転送することを特徴とする請求項4または5に記載のデータ処理装置。 - 前記外部メモリは、前記動作命令に関連付けられる書き込み先アドレスと転送ワード数とからなる転送情報に基づいてデータ転送を行なうデータ転送手段を有し、
前記データ転送手段は、外部I/O装置から複数のデータを前記外部メモリの最も古いバンクへ連続転送することを特徴とする請求項9に記載のデータ処理装置。 - 前記第1演算部は、自身のキャッシュメモリに前記ロード命令に付随するアドレス情報に対応する領域が存在しない場合には、外部メモリからのデータ転送を待機すると共に、2番目以降の演算部が前記動作命令に関連付けられる転送ワード数に応じた回数だけ動作したことを前記動作終結条件とすることを特徴とする請求項5に記載のデータ処理装置。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP09820420.9A EP2352082B1 (en) | 2008-10-14 | 2009-10-13 | Data processing device for performing a plurality of calculation processes in parallel |
JP2010533819A JP5279046B2 (ja) | 2008-10-14 | 2009-10-13 | データ処理装置 |
US12/998,349 US20110264892A1 (en) | 2008-10-14 | 2009-10-13 | Data processing device |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2008-265312 | 2008-10-14 | ||
JP2008265312 | 2008-10-14 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2010044242A1 true WO2010044242A1 (ja) | 2010-04-22 |
Family
ID=42106414
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2009/005306 WO2010044242A1 (ja) | 2008-10-14 | 2009-10-13 | データ処理装置 |
Country Status (5)
Country | Link |
---|---|
US (1) | US20110264892A1 (ja) |
EP (1) | EP2352082B1 (ja) |
JP (1) | JP5279046B2 (ja) |
KR (1) | KR101586770B1 (ja) |
WO (1) | WO2010044242A1 (ja) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013137459A1 (ja) * | 2012-03-16 | 2013-09-19 | 国立大学法人奈良先端科学技術大学院大学 | データ供給装置及びデータ処理装置 |
US9292425B2 (en) | 2012-09-11 | 2016-03-22 | Samsung Electronics Co., Ltd. | Semiconductor memory device with operation functions to be used during a modified read or write mode |
KR20200083123A (ko) * | 2018-12-31 | 2020-07-08 | 그래프코어 리미티드 | 로드-저장 명령 |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2519108A (en) * | 2013-10-09 | 2015-04-15 | Advanced Risc Mach Ltd | A data processing apparatus and method for controlling performance of speculative vector operations |
GB2576572B (en) * | 2018-08-24 | 2020-12-30 | Advanced Risc Mach Ltd | Processing of temporary-register-using instruction |
US11237827B2 (en) | 2019-11-26 | 2022-02-01 | Advanced Micro Devices, Inc. | Arithemetic logic unit register sequencing |
US11862289B2 (en) | 2021-06-11 | 2024-01-02 | International Business Machines Corporation | Sum address memory decoded dual-read select register file |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0883264A (ja) | 1994-09-12 | 1996-03-26 | Nippon Telegr & Teleph Corp <Ntt> | 1次元シストリックアレイ型演算器とそれを用いたdct/idct演算装置 |
WO1996029646A1 (fr) * | 1995-03-17 | 1996-09-26 | Hitachi, Ltd. | Processeur |
JP2001147799A (ja) * | 1999-10-01 | 2001-05-29 | Hitachi Ltd | データ移動方法および条件付転送論理ならびにデータの配列換え方法およびデータのコピー方法 |
JP2001312481A (ja) | 2000-02-25 | 2001-11-09 | Nec Corp | アレイ型プロセッサ |
JP2003076668A (ja) | 2001-08-31 | 2003-03-14 | Nec Corp | アレイ型プロセッサ、データ処理システム |
JP2003099249A (ja) * | 2001-07-17 | 2003-04-04 | Sanyo Electric Co Ltd | データ処理装置 |
JP2005539293A (ja) * | 2002-08-16 | 2005-12-22 | カーネギー−メロン ユニバーシティ | 部分的にグローバルなコンフィギュレーションバスを用いたプログラマブルパイプラインファブリック |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6665792B1 (en) * | 1996-11-13 | 2003-12-16 | Intel Corporation | Interface to a memory system for a processor having a replay system |
JP2000259609A (ja) * | 1999-03-12 | 2000-09-22 | Hitachi Ltd | データ処理プロセッサおよびシステム |
US7308559B2 (en) * | 2000-02-29 | 2007-12-11 | International Business Machines Corporation | Digital signal processor with cascaded SIMD organization |
US7069372B1 (en) * | 2001-07-30 | 2006-06-27 | Cisco Technology, Inc. | Processor having systolic array pipeline for processing data packets |
US20040128482A1 (en) * | 2002-12-26 | 2004-07-01 | Sheaffer Gad S. | Eliminating register reads and writes in a scheduled instruction cache |
US8024549B2 (en) * | 2005-03-04 | 2011-09-20 | Mtekvision Co., Ltd. | Two-dimensional processor array of processing elements |
-
2009
- 2009-10-13 WO PCT/JP2009/005306 patent/WO2010044242A1/ja active Application Filing
- 2009-10-13 EP EP09820420.9A patent/EP2352082B1/en active Active
- 2009-10-13 US US12/998,349 patent/US20110264892A1/en not_active Abandoned
- 2009-10-13 JP JP2010533819A patent/JP5279046B2/ja active Active
- 2009-10-13 KR KR1020117010698A patent/KR101586770B1/ko active IP Right Grant
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0883264A (ja) | 1994-09-12 | 1996-03-26 | Nippon Telegr & Teleph Corp <Ntt> | 1次元シストリックアレイ型演算器とそれを用いたdct/idct演算装置 |
WO1996029646A1 (fr) * | 1995-03-17 | 1996-09-26 | Hitachi, Ltd. | Processeur |
JP2001147799A (ja) * | 1999-10-01 | 2001-05-29 | Hitachi Ltd | データ移動方法および条件付転送論理ならびにデータの配列換え方法およびデータのコピー方法 |
JP2001312481A (ja) | 2000-02-25 | 2001-11-09 | Nec Corp | アレイ型プロセッサ |
JP2003099249A (ja) * | 2001-07-17 | 2003-04-04 | Sanyo Electric Co Ltd | データ処理装置 |
JP2003076668A (ja) | 2001-08-31 | 2003-03-14 | Nec Corp | アレイ型プロセッサ、データ処理システム |
JP2005539293A (ja) * | 2002-08-16 | 2005-12-22 | カーネギー−メロン ユニバーシティ | 部分的にグローバルなコンフィギュレーションバスを用いたプログラマブルパイプラインファブリック |
Non-Patent Citations (1)
Title |
---|
SCHMIT, H ET AL.: "PipeRench: A virtualized programmable datapath in 0.18 micron technology", PROCEEDINGS OF THE IEEE CUSTOM INTEGRATED CIRCUITS CONFERENCE 2002, 15 May 2002 (2002-05-15), pages 63 - 66, XP008147180 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013137459A1 (ja) * | 2012-03-16 | 2013-09-19 | 国立大学法人奈良先端科学技術大学院大学 | データ供給装置及びデータ処理装置 |
JPWO2013137459A1 (ja) * | 2012-03-16 | 2015-08-03 | 国立大学法人 奈良先端科学技術大学院大学 | データ供給装置及びデータ処理装置 |
US9292425B2 (en) | 2012-09-11 | 2016-03-22 | Samsung Electronics Co., Ltd. | Semiconductor memory device with operation functions to be used during a modified read or write mode |
KR20200083123A (ko) * | 2018-12-31 | 2020-07-08 | 그래프코어 리미티드 | 로드-저장 명령 |
KR102201935B1 (ko) | 2018-12-31 | 2021-01-12 | 그래프코어 리미티드 | 로드-저장 명령 |
Also Published As
Publication number | Publication date |
---|---|
KR101586770B1 (ko) | 2016-01-19 |
JP5279046B2 (ja) | 2013-09-04 |
KR20110084915A (ko) | 2011-07-26 |
EP2352082A4 (en) | 2012-04-11 |
JPWO2010044242A1 (ja) | 2012-03-15 |
EP2352082A1 (en) | 2011-08-03 |
EP2352082B1 (en) | 2018-11-28 |
US20110264892A1 (en) | 2011-10-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5279046B2 (ja) | データ処理装置 | |
US5333280A (en) | Parallel pipelined instruction processing system for very long instruction word | |
US5903769A (en) | Conditional vector processing | |
US20110145543A1 (en) | Execution of variable width vector processing instructions | |
KR102379894B1 (ko) | 벡터 연산들 수행시의 어드레스 충돌 관리 장치 및 방법 | |
JPH0628184A (ja) | ブランチ予測方法及びブランチプロセッサ | |
US9141386B2 (en) | Vector logical reduction operation implemented using swizzling on a semiconductor chip | |
KR102279200B1 (ko) | 에뮬레이티드 공유 메모리 아키텍쳐를 위한 부동-소수점 지원가능한 파이프라인 | |
WO2015114305A1 (en) | A data processing apparatus and method for executing a vector scan instruction | |
JPH06103068A (ja) | データ処理装置 | |
JP2023527227A (ja) | プロセッサ、処理方法、および関連デバイス | |
US8055883B2 (en) | Pipe scheduling for pipelines based on destination register number | |
JP2014215624A (ja) | 演算処理装置 | |
JP4444305B2 (ja) | 半導体装置 | |
US6981130B2 (en) | Forwarding the results of operations to dependent instructions more quickly via multiplexers working in parallel | |
US20080222392A1 (en) | Method and arrangements for pipeline processing of instructions | |
WO2015155894A1 (ja) | プロセッサーおよび方法 | |
US7107478B2 (en) | Data processing system having a Cartesian Controller | |
US11416261B2 (en) | Group load register of a graph streaming processor | |
JP3771682B2 (ja) | ベクトル処理装置 | |
JP2017059273A (ja) | 演算処理装置 | |
JP7141401B2 (ja) | プロセッサおよび情報処理システム | |
WO2013137459A1 (ja) | データ供給装置及びデータ処理装置 | |
JP2002318689A (ja) | 資源使用サイクルの遅延指定付き命令を実行するvliwプロセッサおよび遅延指定命令の生成方法 | |
US20130298129A1 (en) | Controlling a sequence of parallel executions |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09820420 Country of ref document: EP Kind code of ref document: A1 |
|
DPE1 | Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101) | ||
ENP | Entry into the national phase |
Ref document number: 2010533819 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 20117010698 Country of ref document: KR Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2009820420 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 12998349 Country of ref document: US |