WO2013137459A1 - Data providing device and data processing device - Google Patents

Data providing device and data processing device Download PDF

Info

Publication number
WO2013137459A1
WO2013137459A1 PCT/JP2013/057503 JP2013057503W WO2013137459A1 WO 2013137459 A1 WO2013137459 A1 WO 2013137459A1 JP 2013057503 W JP2013057503 W JP 2013057503W WO 2013137459 A1 WO2013137459 A1 WO 2013137459A1
Authority
WO
WIPO (PCT)
Prior art keywords
unit
data
register
arithmetic
calculation
Prior art date
Application number
PCT/JP2013/057503
Other languages
French (fr)
Japanese (ja)
Inventor
康彦 中島
駿 姚
Original Assignee
国立大学法人奈良先端科学技術大学院大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国立大学法人奈良先端科学技術大学院大学 filed Critical 国立大学法人奈良先端科学技術大学院大学
Priority to JP2014505037A priority Critical patent/JP6164616B2/en
Publication of WO2013137459A1 publication Critical patent/WO2013137459A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8053Vector processors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • G06F9/3893Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator
    • G06F9/3895Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator for complex operations, e.g. multidimensional or interleaved address generators, macros
    • G06F9/3897Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator for complex operations, e.g. multidimensional or interleaved address generators, macros with adaptable data path

Definitions

  • the present invention relates to a data processing apparatus having a plurality of arithmetic units and capable of performing arithmetic processing by each arithmetic unit synchronously, and more particularly to a data supply method suitable for supplying data to the data processing unit. It is.
  • an arithmetic unit array method is known as a method for processing such a large number of instructions in parallel.
  • This arithmetic unit array method is a method in which an arithmetic unit network is fixed in accordance with target data processing, and input data is poured into the fixed arithmetic unit network (see, for example, Patent Documents 1 to 3).
  • the arithmetic unit array method cannot execute existing machine language instructions. For this reason, a dedicated machine language instruction generation means for generating machine language instructions peculiar to this arithmetic unit array system is necessary, and lacks versatility.
  • a superscalar method for example, a superscalar method, a vector method, or a VLIW (Very Long Instruction Word) method is known as a method capable of executing general machine language instructions and executing machine language instructions in parallel.
  • VLIW Very Long Instruction Word
  • a plurality of operations and the like are specified in one instruction and are executed simultaneously.
  • the superscalar method is a method in which hardware dynamically detects machine language instructions that can be executed simultaneously from a machine language instruction sequence and executes them in parallel.
  • This superscalar method has the advantage of being able to use existing software assets as they are, but recently tends to be avoided due to the complexity of the mechanism and the large amount of power consumption.
  • the vector method is a method in which basic operations such as load, operation, and store are repeatedly applied using a vector register in which a large number of registers are arranged in a one-dimensional direction, and high speed with high power efficiency is possible. . Furthermore, since no cache memory is required, the data transfer speed between the main memory and the vector register is guaranteed, and as a result, stable high speed is realized.
  • the vector method can only perform operations between the same element numbers of different vector registers, and is not suitable for a program for performing operations while referring to adjacent elements in the same vector register.
  • the VLIW method is a method in which a plurality of operations and the like are designated in one instruction and are executed simultaneously.
  • this VLIW method for example, 4 instructions are fetched simultaneously, 4 instructions are decoded simultaneously, necessary data is read from a general-purpose register, and operation is performed simultaneously by a plurality of operation devices, and the operation result storage means attached to the operation device is stored. Stores the operation result.
  • the contents are read from the calculation result storage means and written to the general-purpose register.
  • the calculation result is stored in the arithmetic unit. Bypass to the input.
  • the cache memory is referred to in the LD / ST unit
  • the load result is stored in the load result storage means associated with the LD / ST unit, and then the arithmetic unit operates in the next cycle.
  • the adoption of the cache system greatly contributes to the performance improvement.
  • the cache system there is a system in which a primary cache is incorporated everywhere in the arithmetic unit network, and at the same time, a secondary cache is provided between the external main memory. In this method, the hit rate of the secondary cache is increased, access to the main memory is reduced, and the performance of each arithmetic unit is improved.
  • this is a mechanism in which each small-scale buffer stores a certain amount while passing all the small-scale buffers attached to the computing unit with data read from the primary cache.
  • this mechanism a large number of wirings for reading out data every cycle from the primary cache are connected to many small buffers. That is, there has been a problem that no consideration has been given to efficiently propagating the contents of the primary cache to the arithmetic unit.
  • an object of the present invention is to efficiently supply data to the data processing apparatus in a data processing apparatus having a plurality of arithmetic units and capable of performing arithmetic processing by each arithmetic unit synchronously. Accordingly, it is an object of the present invention to provide a data supply device capable of reducing the power consumption of each computing unit.
  • a primary cache in which a plurality of ways are aggregated is arranged in various places in the arithmetic unit network, and data is supplied through a small buffer to arithmetic units to which the primary cache is not directly connected. It was.
  • This method has an advantage that the degree of freedom regarding the arrangement of the load instruction in the machine language instruction sequence is large, but has a disadvantage that the wiring for connecting the small buffers becomes large.
  • each way of the primary cache is uniformly distributed in the vicinity of the arithmetic unit and connection between small-scale buffers is eliminated. Therefore, although there is a restriction on the arrangement of the load instruction in the machine language instruction sequence, the instruction mapping position is changed according to the content of the data stored in the primary cache. Has the same instruction execution ability. That is, the problem is solved by reducing the number of wirings without reducing the capacity.
  • a data supply apparatus is a data supply apparatus that supplies data to an arithmetic unit bundle in which a plurality of arithmetic units are configured in multiple stages, and a memory unit divided into a plurality of blocks And a shift register unit in which a plurality of registers are connected in a row, and the shift register unit writes the data read from the memory unit to a register at the head or in the middle of the shift register unit.
  • Each of the shift register units outputs the contents of each address position corresponding to each address information by referring to the plurality of address information input to the data supply device.
  • one memory unit is divided into a plurality of blocks, and the data read from each block can be written to the head or middle register of the shift register unit.
  • Each of the memory section and the shift register section is referred to based on a plurality of address information input to the data supply device, and can output the contents of each address position corresponding to each address information.
  • a data processing apparatus is a data processing apparatus in which a plurality of the arithmetic unit bundles are configured in a multi-stage, and when a next high-speed execution is started after a certain series of high-speed executions, When the contents of the memory unit of the data supply device for supplying data to the computer can be used by another operation instruction, the mapping of the operation instruction to the operation units constituting the operation unit bundle is changed.
  • the data supply device of the present invention is a data supply device that supplies data to a computing unit bundle in which a plurality of computing units are configured in multiple stages, and includes a memory unit divided into a plurality of blocks, and a plurality of computing units
  • the shift register unit includes a shift register unit connected in a line, and the shift register unit writes data read from the memory unit to a register at the head or middle of the shift register unit, and the memory unit and the shift register Each unit outputs the contents of each address position corresponding to each address information by referring to the plurality of address information input to the data supply device.
  • FIG. 3 is a diagram showing a configuration of a LAPP in which the configuration of three data processing stages including first to third data processing stages in the LAPP is expanded to a configuration of N data processing stages. It is a schematic diagram for demonstrating the data supply from the cache memory in the said LAPP. It is a schematic diagram for demonstrating the structure which arrange
  • the present invention relates to a data supply method in a computer configuration system in which a large number of arithmetic units are juxtaposed.
  • the present invention is particularly relevant to the memory reference mechanism corresponding to the memory reference patterns shown in Table 1.
  • LAPP Linear Array Pipeline Processor
  • CGRA coarse-grained reconfigurable array
  • FIG. 1 is a diagram showing the configuration of the LAPP described above.
  • the LAPP 101 includes a configuration memory 10, a first register file unit 110, a second register file unit 210, a first arithmetic device (first arithmetic unit, first holding unit) 120, , A second arithmetic device (second arithmetic unit, second holding unit) 220.
  • the configuration memory 10 constitutes a known CGRA and stores configuration data.
  • the configuration data is data that defines processing contents in the first arithmetic device 120 and the second arithmetic device 220.
  • the configuration memory 10 transfers such configuration data to the first register file unit 110 and the second register file unit 210.
  • the first register file unit 110 holds data necessary for arithmetic processing in the first arithmetic unit 120.
  • the first register file unit 110 transfers a register group 111 including a plurality of registers (first registers) r0 to r11 and read data of the registers r0 to r11 of the register group 111 to the outside of the first register file unit 110.
  • a transmitter 112 for the purpose.
  • Reading and writing to each of the registers r0 to r11 of the register group 111 is executed based on configuration data stored in the configuration memory 10.
  • Each register r0 to r11 of the register group 111 is read or written using its own register number 0 to 11 as an access key.
  • the transfer unit 112 transfers the data held in the register with the specified number to the outside of the first register file unit 110.
  • the second register file unit 210 holds data necessary for arithmetic processing in the second arithmetic unit 220.
  • the second register file unit 210 transfers a register group 211 including a plurality of registers (second registers) r0 to r11 and read data of the registers r0 to r11 of the register group 211 to the outside of the second register file unit 210.
  • a transfer device 212 a transfer device 212.
  • Reading and writing to each of the registers r0 to r11 in the register group 211 is executed based on configuration data stored in the configuration memory 10.
  • Each register r0 to r11 of the register group 211 is read or written using its own register number 0 to 11 as an access key.
  • the registers r0 to r11 of the register group 211 have a one-to-one correspondence with the registers r0 to r11 of the register group 111 of the first register file unit 110, and register numbers between the registers of the register group 111 and the register group 211 Are associated with each other. Then, the transfer unit 112 of the first register file unit 110 stores the read data of the registers r0 to r11 of the register group 111 with the same register number as the register numbers of the registers r0 to r11 of the register group 111. Data can be transferred to the registers r0 to r11 of the register group 211 of the register file unit 210.
  • the transfer unit 112 of the first register file unit 110 can transfer the read data of the register r3 of the register group 111 to the register r3 of the register group 211 of the second register file unit 210.
  • the transfer unit 112 of the first register file unit 110 can transfer read data of the register r9 of the register group 111 to the register r9 of the register group 211 of the second register file unit 210.
  • the transfer device 212 transfers the data held in the register with the specified number to the outside of the second register file unit 210.
  • the first arithmetic unit 120 performs substantial processing in the LAPP 101.
  • the first arithmetic unit 120 includes an arithmetic unit group 121 including arithmetic units 1-1 to 1-4, a holder group 122 including holders 1-1 to 1-4, and a transfer unit 123. Yes.
  • the first arithmetic unit 120 constitutes a first data processing stage together with the first register file unit 110, and the transfer unit 112 of the first register file unit 110 reads the read data of the registers r0 to r11 of the register group 111. Can be transferred to the first arithmetic unit 120.
  • the arithmetic units 1-1 to 1-4 of the arithmetic unit group 121 of the first arithmetic unit 120 obtain two read data from the registers r0 to r11 of the first register file unit 110, and the data Various arithmetic processes such as four arithmetic operations and logical operations are executed using. The arithmetic processing of each of the arithmetic units 1-1 to 1-4 is executed simultaneously.
  • the holders 1-1 to 1-4 of the holder group 122 store the calculation results of the corresponding calculators 1-1 to 1-4.
  • Each retainer 1-1 to 1-4 corresponds one-to-one with each computing unit 1-1 to 1-4.
  • the transfer unit 123 transfers the calculation results of the calculators 1-1 to 1-4 stored in the holders 1-1 to 1-4 to the outside of the first calculator 120.
  • the second arithmetic unit 220 performs substantial processing in the LAPP 101.
  • the second arithmetic unit 220 includes an arithmetic unit group 221 including arithmetic units 2-1 to 2-4, a holder group 222 including holders 2-1 to 2-4, and a transfer unit 223. Yes.
  • the second arithmetic unit 220 together with the second register file unit 210, constitutes a second data processing stage, and the transfer unit 212 of the second register file unit 210 reads data read from the registers r0 to r11 of the register group 211. Can be transferred to the second arithmetic unit 220.
  • the arithmetic units 2-1 to 2-4 of the arithmetic unit group 221 of the second arithmetic unit 220 obtain two read data from the registers r0 to r11 of the second register file unit 210, and the data Various arithmetic processes such as four arithmetic operations and logical operations are executed using. The arithmetic processing of each of the arithmetic units 2-1 to 2-4 is executed simultaneously.
  • the computing units 2-1 to 2-4 of the computing unit group 221 of the second computing unit 220 are stored in the respective cages 1-1 to 1-4 of the cage group 122 of the first computing unit 120.
  • the calculation result can be acquired.
  • the transfer unit 123 of the first calculation device 120 can transfer the calculation results of the calculation units 1-1 to 1-4 stored in the holders 1-1 to 1-4 to the second calculation device 220. It has become.
  • the arithmetic units 2-1 to 2-4 of the second arithmetic unit 220 execute arithmetic processing using the arithmetic results instead of the read data of the registers r0 to r11 of the second register file unit 210. be able to.
  • the holders 2-1 to 2-4 of the holder group 222 store the calculation results of the corresponding calculators 2-1 to 2-4.
  • Each of the retainers 2-1 to 2-4 has a one-to-one correspondence with each of the arithmetic units 2-1 to 2-4.
  • the transfer unit 223 transfers the calculation results of the calculators 2-1 to 2-4 stored in the holders 2-1 to 2-4 to the outside of the second calculation device 220.
  • arithmetic processing by the first arithmetic unit 120 is performed using read data of the registers r0 to r11 of the register group 111.
  • the read data of the registers r0 to r11 of the register group 111 that is not the target of the arithmetic processing by the first arithmetic device 120 is transferred to the second register file unit 210.
  • the arithmetic processing by the second arithmetic unit 220 is performed using the data transferred to the registers r0 to r11 of the register group 211 of the second register file unit 210.
  • the arithmetic processing by the first arithmetic device 120 is performed using the read data of the registers r0 to r11 of the register group 111.
  • the transfer device 123 of the first arithmetic device 120 is stored in each of the holders 1-1 to 1-4.
  • the computation results of the computing units 1-1 to 1-4 are transferred to the second computing device 220.
  • the LAPP 102 shown in FIG. 2 further includes a third register file unit 310 and a third arithmetic unit (third arithmetic unit, third holding unit) 320 in addition to the LAPP 101 of FIG.
  • a third arithmetic unit third arithmetic unit, third holding unit
  • the arithmetic processing by the third arithmetic device 320 is also executed simultaneously.
  • the third register file unit 310 holds data necessary for arithmetic processing in the third arithmetic unit 320.
  • the third register file unit 310 transfers a register group 311 including a plurality of registers (third registers) r0 to r11 and read data of the registers r0 to r11 of the register group 311 to the outside of the third register file unit 310. And a transfer device 312 for the above.
  • Reading and writing to the registers r0 to r11 of the register group 311 are executed based on the configuration data stored in the configuration memory 10.
  • Each register r0 to r11 of the register group 311 is read or written using its own register number 0 to 12 as an access key.
  • the registers r0 to r11 of the register group 311 have a one-to-one correspondence with the registers r0 to r11 of the register group 211 of the second register file unit 210, and register numbers between the registers of the register group 211 and the register group 311 Are associated with each other. Then, the transfer unit 212 of the second register file unit 210 receives the read data of the registers r0 to r11 of the register group 211 in the third register number having the same register number as the register numbers of the registers r0 to r11 of the register group 211. Data can be transferred to each of the registers r0 to r11 in the register group 311 of the register file unit 310.
  • the transfer unit 312 transfers the data held in the register with the designated number to the outside of the third register file unit 310.
  • the third register file unit 310 is stored in each of the holders 1-1 to 1-4 of the first arithmetic unit 120 by the transfer unit 123 of the first arithmetic unit 120.
  • the calculation result of 1-4 can be acquired.
  • the 3rd arithmetic unit 320 performs the substantial process in LAPP102.
  • the third arithmetic unit 320 includes an arithmetic unit group 321 including arithmetic units 3-1 to 3-4, a holder group 322 including holders 3-1 to 3-4, and a transfer unit 323. Yes.
  • the third arithmetic unit 320 constitutes a third data processing stage together with the third register file unit 310, and the transfer unit 312 of the third register file unit 310 reads the read data of the registers r0 to r11 of the register group 311. Can be transferred to the third arithmetic unit 320. Then, each of the arithmetic units 3-1 to 3-4 of the arithmetic unit group 321 of the third arithmetic unit 320 acquires two read data from each of the registers r0 to r11 of the third register file unit 310, and these data Various arithmetic processes such as four arithmetic operations and logical operations are executed using. The arithmetic processing of each of the arithmetic units 3-1 to 3-4 is executed simultaneously.
  • the holders 3-1 to 3-4 of the holder group 322 store the calculation results of the corresponding calculators 3-1 to 3-4.
  • Each retainer 3-1 to 3-4 has a one-to-one correspondence with each computing unit 3-1 to 3-4.
  • the transfer unit 323 transfers the calculation results of the calculators 3-1 to 3-4 stored in the holders 3-1 to 3-4 to the outside of the third processor 320.
  • the third arithmetic unit 320 includes the arithmetic units 2-1 to 2-2 stored in the respective holders 2-1 to 2-4 of the second arithmetic unit 220 by the transfer unit 223 of the second arithmetic unit 220. -4 can be obtained.
  • the arithmetic processing by the second arithmetic unit 220 is performed using the read data of the registers r0 to r11 of the register group 211.
  • the read data of the registers r0 to r11 of the register group 211 that is not subject to the arithmetic processing by the second arithmetic device 220 is transferred to the third register file unit 310.
  • the arithmetic processing by the third arithmetic unit 320 is performed using the data transferred to the registers r0 to r11 of the register group 311 of the third register file unit 310.
  • the arithmetic processing by the second arithmetic device 220 is performed using the read data of the registers r0 to r11 of the register group 211.
  • the transfer device 223 of the second arithmetic device 220 is stored in each of the holders 2-1 to 2-4.
  • the computation results of the computing units 2-1 to 2-4 are transferred to the third computing device 320.
  • the second arithmetic device 220 does not need the arithmetic result of the first arithmetic device 120, and the third arithmetic device 320 needs the arithmetic result of the first arithmetic device 120.
  • the arithmetic result of the first arithmetic unit 120 can be input to the third arithmetic unit 320 indirectly.
  • the configuration of the three data processing stages including the first to third data processing stages in the LAPP 102 may be extended to the configuration of the N data processing stage.
  • N is an integer of 1 or more.
  • the calculation result of the arithmetic unit constituting the Nth data processing stage is the register file of the (N + 2) th data processing stage when the arithmetic result is used by arithmetic units after the (N + 2) th data processing stage. Written in the part.
  • FIG. 3 shows a configuration of the LAPP 103 in which the configuration of the three data processing stages including the first to third data processing stages in the LAPP 102 is expanded to the configuration of the N data processing stage.
  • a known technique can be used for the mechanism of the cache memory 14 and the configuration of the small-scale cache memory 15 and the propagation mechanism therebetween.
  • the LAPP 103 uses a method of supplying a large amount of data from a memory to a plurality of computing units 11. As the operation data propagates in one direction on the operation unit network composed of the plurality of operation units 11 via the plurality of register file units 12, the data on the memory is also propagated in the same direction.
  • a plurality of load instructions can refer to a plurality of memory addresses at the same time.
  • each of the three ways of the cache memory 14 is made to correspond to one array. Then, one word is read from each way for each cycle and propagated to the next stage. At each stage, the value of the three words being propagated is taken into the small-scale cache memory 15 at each stage, so that data in a predetermined memory address range can be referred to at random. Since arithmetic data and memory data propagate at the same speed, load instructions belonging to the same iteration can refer to the same memory address range regardless of which stage the small cache memory 15 is referred to.
  • the load / store unit 16 can be used even if an element near the array element of interest is required in each loop iteration.
  • the load / store instruction can be arranged at an arbitrary stage.
  • LAPP 103 can deal with the memory reference patterns shown in Table 1 using the above characteristics. Note that a wide range of random offsets can be handled only at the stage where the medium capacity memory 13 is directly connected. In addition, for the update type in which the load contents are changed and stored at the same address, the store data is stored in the original array with one round in the depth direction.
  • FIG. 4 is a schematic diagram for explaining data supply from the cache memory 14 in the LAPP 103.
  • a known technique can be used for the mechanism of the cache memory 14 and the configuration of the small-scale cache memory 15 and the propagation mechanism therebetween.
  • the LAPP of the present invention employs a configuration in which medium capacity memories are distributed and arranged, as with the LAPP 103 described above, but does not provide a regular data path for unconditionally propagating data read from the medium capacity memory to subsequent stages. This prevents an increase in the number of inter-stage data paths, which was a problem with the LAPP 103 described above.
  • FIG. 5 shows a configuration in which one medium capacity memory is arranged every four stages.
  • the number of stages is not limited to four.
  • any medium-capacity memory may be arranged for each “bundle” (arithmetic unit bundle) in which a plurality of “stages” composed of one or a plurality of arithmetic units are connected (multistage configuration).
  • the LAPP of the present invention is a multistage configuration of a plurality of such “bundles”. Therefore, in the configuration of FIG. 5, the propagation mechanism between the small cache memories 15 shown in FIGS. 3 and 4 is unnecessary.
  • FIG. 6 is a detailed configuration diagram of a memory system including a medium capacity memory.
  • black squares mainly indicate output latches
  • white squares indicate latches used as calculation inputs other than outputs.
  • Each number attached to the right side is a bit width.
  • the LAPP of the present invention is different from the above-mentioned LAPP 103 in that it can be divided into a plurality of blocks while mounting one way of the cache memory in the medium capacity memory. It is in the point to. Further, by combining one base address and six offsets, it is possible to execute a load instruction using six addresses for one way.
  • the usable address range is constrained while corresponding to the reference pattern shown in Table 1.
  • a 6-read, 2-write memory function is physically realized using a general memory having one port for reading and one port for writing.
  • the LAPP 1 of the present invention mainly includes a plurality of memory systems (data supply devices) 22 including a computing unit network composed of a plurality of computing units 21 and one way of a cache memory (not shown). And.
  • each memory system 22 is arranged at every four stages in an arithmetic unit network including a plurality of arithmetic units 21.
  • Each memory system 22 corresponds to each way in a cache memory (not shown) and exchanges data with the corresponding way.
  • the result of address calculation based on the address information supplied from the previous stage is stored in a plurality of latches (address holding units) 23 in front (upper part) of the memory system 22.
  • a medium-capacity memory or the like in the memory system 22 is referred to and stored in a plurality of latches 24 behind (lower) the memory system 22. Further, in the next cycle, it is used as an input of a plurality of computing units 21 and stores the computation results.
  • the calculation results obtained after passing through the first-stage and second-stage computing units 21 from the bottom are stored in a plurality of latches 25 at the bottom. Further, in the next cycle, the operation results stored in the plurality of latches 25 can be stored in the memory system 22 and further sent to the subsequent stage, or both can be selected.
  • FIG. 6 is a diagram showing a configuration of the memory system 22 shown in FIG.
  • the memory system 22 mainly includes a memory unit 31 divided into a plurality of blocks (here, four blocks), and a connection unit 32 for connecting blocks adjacent to each other. And a shift register (shift register unit) 33.
  • the shift register 33 has a plurality of registers connected in a line.
  • the plurality of latches 23 in FIG. 5 include a plurality of latches (first address storage circuits) 23-connected to each block so as to correspond to each block of the memory unit 31 on a one-to-one basis. 1, 23-2, 23-4, 23-5, and a plurality of latches (second address storage circuits) 23-3, 23-6 that are not connected to any of the blocks of the memory unit 31. include.
  • the latches 23-3 and 23-6 may be associated with the blocks divided from the memory unit 31, respectively. Conversely, the latches 23-1, 23-2, 23-4, and 23-5 may not be connected to any block of the memory unit 31. In short, the memory unit 31 is divided into a plurality of blocks, and there may be a latch associated with each block.
  • the first case (1) in Table 1 is a case in which a wide range of addresses are referenced randomly. As shown in FIG. 7, when the base address is set in the LD-BASE 201 and the offset is set in the latch 202, the offset is added to the base address and the effective address A0 is designated.
  • the effective address A0 is stored in the latch 23-1, the effective address A0 is supplied to the latch 203 of "way0.blk0" which is one block of the memory unit 31 in the next cycle. Similarly, the effective address A0 is supplied to the latch 204 of “way0.blk1” that is another block of the memory unit 31 adjacent to “way0.blk0”.
  • the value read from each block is sent to the connection unit 32, and one of the values is selected by the connection unit 32.
  • the linking unit 32 performs the above selection using the upper bits of the effective address A0 stored in the latch 23-1.
  • the data selected by the linking unit 32 is output to O0 of the latch 24 via the selector 33-1 of the shift register 33.
  • the connection function for connecting “way0.blk1” to “way0.blk0” as described above is used. There is no need. That is, the data read from “way0.blk0” may be output to O0 of the latch 24.
  • the offset is added to the base address and the effective address A3 is designated.
  • the effective address A3 is stored in the latch 23-4, the value supplied to the latch 206 of the two blocks “way0.blk2” and the latch 207 of “way0.blk3” of the memory unit 31 and read from each block Is sent to the connecting portion 32.
  • the linking unit 32 selects one using the upper bit of the effective address A3 stored in the latch 23-4, and outputs it to O3 of the latch 24 via the selector 33-5 of the shift register 33.
  • the second case (2) in Table 1 is a case in which six locations are referenced at the same time, although there are restrictions on the range of relative addresses based on a monotonically increasing address.
  • the base address set in the LD-BASE 301 is stored in the latch 23-1 as the effective address A0 via the latch 302.
  • the effective address A0 is supplied to the latch 303 of “way0.blk0” that is one block of the memory unit 31.
  • the effective address A0 is supplied to the latch 304 of “way0.blk1” that is another block of the memory unit 31 adjacent to “way0.blk0”.
  • the value read from each block is sent to the connection unit 32, and one of the values is selected by the connection unit 32.
  • the linking unit 32 performs the above selection using the upper bits of the effective address A0 stored in the latch 23-1.
  • the data selected by the linking unit 32 is output to O0 of the latch 24 via the selector 33-1 of the shift register 33 (at this time, as in the first case described above, “way0.blk2 And “way0.blk3” may be connected to each other).
  • the latch 305 has an offset “ ⁇ e”
  • the latch 306 has an offset “ ⁇ d”
  • the latch 307 has an offset “ ⁇ c”
  • the latch 308 has an offset “ ⁇ b”
  • the latch 309 has an offset “ ⁇ a”.
  • Each offset is set in the latches 23-2, 23-3, 23-4, 23-5, and 23-6 as effective addresses A1, A2, A3, A4, and A5, respectively.
  • the data selected by the linking unit 32 is written to the top register 33-2 of the shift register 33 via the selector 33-1.
  • the effective addresses A5, A4, A3, A2, set in the latches 23-6, 23-5, 23-4, 23-3 and 23-2, respectively An address within a range that can be stored in the shift register 33 is designated using A1. Thereby, addresses near the effective address A0 can be referred to simultaneously.
  • the effective addresses A5, A4, A3, A2, and A1 are values representing the positions of the registers 33-2, 33-3, 33-4, 33-6, and 33-7 in the shift register 33. In other words, it indicates which value of the registers 33-2, 33-3, 33-4, 33-6, 33-7 should be referred to as the value to be output to O5, O4, O3, O2 of the latch 24. ing.
  • the effective addresses A5, A4, A3, A2, and A1 compare the arbitrary register position of the shift register 33 with the address information, respectively, and register contents of the coincident portions respectively.
  • a mechanism for reading to O4, O3, O2, and O1 is required. Such a mechanism can be easily realized because the shift register 33 is small.
  • the third case (3) in Table 1 is based on a monotonically increasing address, and refers to six locations at the same time, although there are restrictions on the range of relative addresses.
  • the difference from the above-mentioned second case (2) is that six addresses also monotonously increase.
  • the offset is a random offset such as “ ⁇ a”, “ ⁇ b”, “ ⁇ c”, “ ⁇ d”, or “ ⁇ e”.
  • the offset is fixed.
  • the offset is set using the positions of the registers 33-2, 33-3, 33-4, 33-6, and 33-7 of the shift register 33. In other words, this is handled by a mechanism that reads directly from the shift register 33.
  • the base address set in the LD-BASE 401 is stored in the latch 23-1 as the effective address A0 via the latch 302.
  • the effective address A0 is supplied to the latch 403 of “way0.blk0” which is one block of the memory unit 31.
  • the effective address A0 is supplied to the latch 404 of “way0.blk1” that is another block of the memory unit 31 adjacent to “way0.blk0”.
  • the value read from each block is sent to the connection unit 32, and one of the values is selected by the connection unit 32.
  • the linking unit 32 performs the above selection using the upper bits of the effective address A0 stored in the latch 23-1.
  • the data selected by the linking unit 32 is output to O0 of the latch 24 via the selector 33-1 of the shift register 33.
  • the data selected by the linking unit 32 is written to the top register 33-2 of the shift register 33 via the selector 33-1.
  • any of the values of the registers 33-2, 33-3, 33-4, 33-6, and 33-7 should be referred to as the value to be output to the O2 to O5 of the latch 24. It is not necessary to set the effective addresses A5, A4, A3, A2, and A1 representing the cracks. This is because, in the case of the third case, unlike the second case described above, the offset is fixed. Therefore, if the positions of the registers 33-2, 33-3, 33-4, 33-6, and 33-7 are used, the values to be output to O2 to O5 of the latch 24 are the registers 33-2, 33- This is because it is possible to specify which value of 3, 33-4, 33-6, and 33-7 should be referred to. That is, it can be said that the effective addresses A5, A4, A3, A2, and A1 are set by the positions of the registers 33-2, 33-3, 33-4, 33-6, and 33-7.
  • the power consumption of the memory system 22 can be reduced.
  • the memory system 22 can also handle the third case (3).
  • the fourth case (4) in Table 1 is a case where two sets of access patterns that refer to three locations at the same time are required, although the range of relative addresses is limited based on a monotonically increasing address.
  • the base address set in the LD-BASE 501 is stored in the latch 23-1 as the effective address A0 via the latch 502.
  • the effective address A0 is supplied to the latch 503 of “way0.blk0”, which is one block of the memory unit 31.
  • the effective address A0 is supplied to the latch 504 of “way0.blk1” that is another block of the memory unit 31 adjacent to “way0.blk0”.
  • the value read from each block is sent to the connection unit 32, and one of the values is selected by the connection unit 32.
  • the linking unit 32 performs the above selection using the upper bits of the effective address A0 stored in the latch 23-1.
  • the data selected by the linking unit 32 is output to O0 of the latch 24 via the selector 33-1 of the shift register 33.
  • the data selected by the linking unit 32 is written to the top register 33-2 of the shift register 33 via the selector 33-1.
  • addresses near the effective address A0 can be simultaneously set. You can refer to it.
  • Effective addresses A1 and A2 are set using the positions of the registers 33-2 and 33-3 of the shift register 33.
  • the effective addresses A2 and A1 each need a mechanism for comparing the register contents of the matching portions by comparing arbitrary positions of the shift register 33 and the address information to the O2 and O1 of the latch 24, respectively. .
  • Such a mechanism can be easily realized because the shift register 33 is small.
  • the base address newly set in the LD-BASE 501 is stored in the latch 23-4 as the effective address A3 via the latch 505.
  • the effective address A3 is supplied to the latch 506 of “way0.blk2” which is one block of the memory unit 31.
  • the latch 507 of “way0.blk3” which is another block of the memory unit 31 adjacent to “way0.blk2”.
  • the value read from each block is sent to the connection unit 32, and one of the values is selected by the connection unit 32.
  • the linking unit 32 performs the above selection using the upper bits of the effective address A3 stored in the latch 23-4.
  • the data selected by the linking unit 32 is output to O3 of the latch 24 via the selector 33-5 of the shift register 33.
  • the fourth case (4) is different from the second case (2) in that the data flow is divided in the middle of the shift register 33. Therefore, the selector 33-5 for interrupting the value read from “way0.blk2” is required in the middle of the shift register 33.
  • the data selected by the linking unit 32 is written to the register 33-6 in the middle of the shift register 33 via the selector 33-1.
  • addresses near the effective address A3 can be simultaneously set. You can refer to it.
  • Effective addresses A4 and A5 are set using the positions of the registers 33-6 and 33-7 of the shift register 33. For this reason, the effective addresses A5 and A4 each need a mechanism for comparing the arbitrary register position of the shift register 33 with the address information and reading the register contents of the matching portions to O5 and O4 of the latch 24, respectively. . Such a mechanism can be easily realized because the shift register 33 is small.
  • the fifth case (5) in Table 1 is based on a monotonically increasing address, and there is a restriction on the range of relative addresses. ”,“ Way0.blk2 ”, and“ way0.blk1 ”can be accessed at the same time.
  • the base address set in the LD-BASE 601 is stored in the latch 23-1 as the effective address A0 via the latch 602.
  • the effective address A0 is supplied to the latch 606 of “way0.blk0”, which is one block of the memory unit 31.
  • the value read from “way0.blk0” is output to O0 of the latch 24 via the connection unit 32 and the selector 33-1 of the shift register 33.
  • the offset “ ⁇ b” is set in the latch 610, and the offset “ ⁇ a” is set in the latch 611. Each offset is set in latches 23-3 and 23-6 as effective addresses A2 and A5, respectively.
  • the selector 33-1 writes the value read from “way0.blk0” into the first register 33-2 of the shift register 33.
  • the effective addresses A2 and A5 respectively set in the latches 23-3 and 23-6 are used to set addresses within the range that can be stored in the shift register 33. specify. Thereby, addresses near the effective address A0 can be referred to simultaneously.
  • Effective addresses A5 and A2 are values representing the positions of the registers 33-2, 33-3, 33-4, 33-6, and 33-7 in the shift register 33. In other words, it indicates which value of the registers 33-2, 33-3, 33-4, 33-6, 33-7 should be referred to as a value to be output to O5, O2 of the latch 24.
  • the effective address A2 should refer to the register 33-2
  • the effective address A5 should refer to the register 33-3.
  • the value of the register 33-3 is output to O2 of the latch 24, and the value of the register 33-3 is output to O5 of the latch.
  • each of the effective addresses A5 and A2 needs a mechanism for comparing the register contents of the matching portion by comparing the address information with an arbitrary position of the shift register and reading the contents of the registers to O5 and O2 of the latch 24, respectively.
  • the base address newly set in the LD-BASE 601 is stored in the latch 23-2 as the effective address A1 via the latch 603.
  • the effective address A1 is supplied to the latch 607 of “way0.blk1” which is one block of the memory unit 31.
  • the value read from “way0.blk0” is output to O1 of the latch 24.
  • the base address newly set in the LD-BASE 601 is stored in the latch 23-4 as the effective address A3 via the latch 604.
  • the effective address A3 is supplied to the latch 608 of “way0.blk2” which is one block of the memory unit 31.
  • the value read from “way0.blk2” is output to O3 of the latch 24.
  • the base address newly set in the LD-BASE 601 is stored in the latch 23-5 as the effective address A4 via the latch 605.
  • the effective address A4 is supplied to the latch 609 of “way0.blk3” which is one block of the memory unit 31.
  • the value read from “way0.blk3” is output to O4 of the latch 24.
  • the effective addresses A4, A3, and A1 are directly connected to “way0.blk3”, “way0.blk2”, and “way0.blk1,” respectively. Write to O4, O3, O1.
  • the sixth case (6) in Table 1 is a case where the read memory value is updated and written to the original memory, as shown in FIG. This can be realized by using a data path (feedback mechanism) 26 returning from the plurality of arithmetic units 21 to the memory system 22 shown in FIG.
  • the read memory value (ST-value) 612 is supplied to the latch 614 and the latch 615 of “way0.blk0”, which is one block of the memory unit 31.
  • each data supplied to the latch 614 and the latch 615 is written to “way0.blk0” using the base address set in the ST-base 613.
  • a combination of a medium-capacity memory and a small-capacity shift register makes it possible to issue a large number of load instructions to a certain range of memory space.
  • Self-updating memory references including floating point operations (multiple cycles) can be arranged in multiple stages without increasing interstage wiring.
  • FIGS. 13 and 14 are instruction sequences when an example of image processing is realized by the prior art and the present invention, respectively.
  • load instructions are arranged at each stage on the assumption that load data is sequentially propagated.
  • load instructions are arranged in the fourth, eighth, and twelfth stages, and neighboring data is extracted from the ways belonging to each stage and input to the computing unit.
  • a mechanism for unconditionally propagating load data is not necessary, and at the same time, the number of stages for storing programs is reduced from 24 to 19 stages.
  • the present invention it is possible to reuse Way in different stages by shifting the instruction mapping downward by four stages without moving the contents of the medium-scale memory distributed in each stage. That is, the last stage and the first stage are connected by a ring structure.
  • the memory contents of 8 and 12 stages are newly moved without moving.
  • Necessary memory data is arranged in 16 stages. Thereby, it is possible to execute an instruction using the memory contents of 8, 12, and 16 stages.
  • FIG. 16 of the present invention a data path for directly propagating load data and store data is not necessary.
  • the eighth stage, the twelfth stage, and the sixteenth stage it is possible to map update type load ⁇ calculation ⁇ store. Compared to the prior art, four times as many instructions can be mapped, and the processing performance is increased four times.
  • a FIFO unit having a plurality of first-in first-out (FIFO) buffers can be arranged.
  • each FIFO buffer of the FIFO unit is arranged to correspond to each of the effective addresses A5, A4, A3, A2, A1, and A0 on a one-to-one basis.
  • the positions corresponding to the effective addresses A5, A4, A3, A2, A1, and A0 on a one-to-one basis that is, the selector 33-1 and the register of the shift register 33 33-2, a register 33-3, a register 33-4, a register 33-6, and a register 33-7, a selector 33-1, a register 33-2, a register 33-3, a register 33-4, a register
  • the FIFO buffers of the FIFO unit are arranged.
  • Each FIFO buffer of the FIFO unit includes one selector and five, similar to the selector 33-1, the register 33-2, the register 33-3, the register 33-4, the register 33-6, and the register 33-7. Has two registers.
  • the data supply from the memory unit 31 is performed only to the selector 33-1 (in this case, paying attention to the data supply to the selector 33-1 and supplying the data to the selector 33-5) No data is provided.)
  • the FIFO unit data is supplied from the memory unit 31 to each selector of each FIFO buffer.
  • the data read from one of the registers of each FIFO buffer is latched 24 corresponding to each FIFO buffer.
  • the FIFO buffer corresponds to the effective address A5
  • any one of the five registers of the FIFO buffer is read using the effective address A5, and the read data is O5 of the latch 24. Will be output. Similar processing is performed in the other FIFO buffers.
  • the present invention can also be expressed as follows. That is, the present invention has a configuration in which one storage system is connected to a bundle in which a plurality of stages each composed of one or a plurality of arithmetic units are connected, and each storage system includes a memory and a shift register. The data read from the memory is input to the top or middle of the shift register, and the address information corresponding to each address is referenced by referring to the memory and the shift register by using a plurality of address information input to the storage system. Is an accelerator configuration method for reading out each of.
  • the memory unit divided into a plurality of blocks includes an address holding unit that holds address information for each block, and further includes an address holding unit that is not connected to the memory unit, and these address holding units It is preferable to read the register by specifying the register position in the shift register using the address information.
  • another block is read using the data in the address holding unit provided in each block, and one of the data read from a plurality of blocks using a part of bits of the address information. It is preferable to select one.
  • the data supply device is a data supply device that supplies data to a computing unit bundle in which a plurality of computing units are configured in multiple stages, and includes a memory unit divided into a plurality of blocks and a plurality of registers.
  • a shift register unit connected in a row, and the shift register unit writes data read from the memory unit to a head register or a middle register of the shift register unit, and the memory unit and the shift register unit
  • Each is referred to based on a plurality of address information input to the data supply device, and outputs the contents of each address position corresponding to each address information.
  • one memory unit is divided into a plurality of blocks, and data read from each block can be written to a register at the head or in the middle of the shift register unit.
  • Each of the memory section and the shift register section is referred to based on a plurality of address information input to the data supply device, and can output the contents of each address position corresponding to each address information.
  • the data supply device further includes a plurality of address holding units that respectively hold a plurality of address information input to the data supply device, and the plurality of address holding units correspond to each block of the memory unit on a one-to-one basis. It is preferable to include a plurality of first address storage circuits connected to each block and a plurality of second address storage circuits not connected to any of the blocks of the memory unit.
  • the data finally output from the shift register unit can be determined using the address information referring to the memory unit and the address information referring to the shift register unit.
  • the shift register unit includes a selector that selects one of data read from two different blocks of the memory unit, and stores the address information held in the first address storage circuit.
  • the shift register unit uses the selector to It is preferable to select one of the data read from the two blocks based on a part of bits of the address information held in one address storage circuit.
  • the data supply device further includes a feedback mechanism capable of writing the operation results of one or more arithmetic units constituting the final stage of the arithmetic unit bundle into the memory unit.
  • the output values from the memory unit and the shift register unit can be rewritten to the memory unit.
  • each address information held in each first address storage circuit includes an offset set in the address information input to the data supply device, and the offset is added to the address information.
  • Each address information held in the second address storage circuit is preferably an offset set in the address information input to the data supply device, More preferably, the shift register unit determines an output value from each register using the offset.
  • the memory unit and the shift register unit can be referred to using the address information obtained by adding a random offset to the input address information.
  • the shift register unit determines an output value from each register by using the position of each register as the offset.
  • the memory unit and the shift register unit can be referred to using the address information obtained by adding a fixed offset to the input address information.
  • the shift register unit determines the position of each of a part of the registers of the address supply and one address information input to the data supply apparatus. Is used as an offset set to determine the output value from each of the some of the registers, and the position of the other part of each register is used as the other address information input to the data supply device. It is preferable that the output value from each of the other partial registers is determined by using the set offset.
  • the address information obtained by adding a fixed offset to the input address information is used for any address information.
  • the shift register portion can be referred to.
  • the shift register unit uses an offset set in one address information input to the data supply device, Determine the output value from each of the registers, and use the remaining address information input to the data supply device to read the data read from the block of the memory unit It is preferable to output as an output value from the register.
  • the memory unit and the shift register unit are referred to using the address information obtained by adding the offset to the input one address information, and the memory unit is used using the input remaining address information.
  • the shift register unit can be referred to.
  • a data processing apparatus is a data processing apparatus in which a plurality of the arithmetic unit bundles are configured in a multi-stage, and when a next high-speed execution is started after a certain series of high-speed executions, When the contents of the memory unit of the data supply device for supplying data to the computer can be used by another operation instruction, the mapping of the operation instruction to the operation units constituting the operation unit bundle is changed.
  • the data processing device is a data processing device for executing an instruction code composed of a plurality of lines of machine language instructions, corresponding to a plurality of register numbers described in the instruction code, and each register number And a second register file unit including a plurality of second registers corresponding to each of the first registers of the first register file unit.
  • n is an integer greater than or equal to 1) register file units, and machine of any of the plurality of lines of machine language instructions using read data of each first register of the first register file unit
  • a first arithmetic unit that executes a calculation using a word instruction, which is one stage of the multi-stage configuration, and a machine used by the first arithmetic unit among any of the machine language instructions of the plurality of rows
  • An n number of holding units including a first holding unit which is an output destination of the calculation result of the first calculation unit and temporarily holds the calculation result of the first calculation unit; The unit transfers the data to the second register of the second register file unit corresponding to the first register that holds the data that is not subject to the arithmetic processing
  • the data in each first register in the first register file unit is transferred to each second register in the second register file unit corresponding to each first register in the first register file unit.
  • the second arithmetic unit reads the data from the second register of the second register file unit. Can be used to execute operations.
  • the calculation result of the first calculation unit is transferred to the second calculation unit.
  • the second calculation unit can use the calculation result of the first calculation unit for execution of the calculation immediately after the calculation by the first calculation unit.
  • the n register file units further include a third register file unit including a plurality of third registers corresponding to the second registers of the second register file unit, and the n operation units A unit that performs an operation using a machine language instruction that is different from a machine language instruction used by the first operation unit and the second operation unit among any of the machine language instructions of the plurality of rows; A third operation unit which is a certain stage of the configuration, wherein the n holding units are output destinations of the operation result of the second operation unit when the second operation unit executes the operation; and A second holding unit that temporarily holds a calculation result of the second calculation unit is further included, and the second register file unit holds data that is not subject to calculation processing by the second calculation unit.
  • the second calculation unit When the data is transferred to the third register of the corresponding third register file unit and the second holding unit holds the calculation result of the second calculation unit, the second calculation unit The output destination of the calculation result is the third calculation unit, the calculation result of the second calculation unit is transferred to the third calculation unit, and the third calculation unit transfers each third register of the third register file unit.
  • An operation using at least one of the read data of the data and an operation result transferred by the second holding unit, an operation executed by the first operation unit, and an operation executed by the second operation unit Parallel processing is preferable.
  • the data in each second register in the second register file unit is transferred to each third register in the third register file unit corresponding to each second register in the second register file unit.
  • the third calculation unit reads the data from the third register of the third register file unit even when the data of the second register of the second register file unit is used for execution of the calculation of the second calculation unit. Can be used to execute operations.
  • the calculation result of the second calculation unit is transferred to the third calculation unit.
  • the third calculation unit can use the calculation result of the second calculation unit for execution of the calculation immediately after the calculation by the second calculation unit.
  • the Nth holding unit included in the n holding units (N is an integer of 1 or more and n or less) has a calculation result held by itself in the n calculating units.
  • the calculation result is transferred to the (N + 2) th register file unit included in the n number of register file units while being held by itself.
  • the calculation result to be performed is not used for the execution of calculation by the (N + 2) th and subsequent calculation units, it is preferable to transfer the calculation result to the (N + 1) th calculation unit included in the n calculation units. .
  • the calculation result held by the Nth holding unit is not used for the calculation execution by the (N + 2) th and subsequent calculation units, the calculation result is transferred to the (N + 1) th calculation unit.
  • unnecessary data transfer between the register file units is reduced, and as a result, power consumption can be further reduced.
  • the present invention can be suitably used for data supply to a data processing apparatus that has a plurality of arithmetic units and can perform arithmetic processing by each arithmetic unit synchronously.

Abstract

A memory system (22) for providing data to a computing unit cluster comprising multiple computing units configured in multiple stages is provided with a memory unit (31) partitioned into multiple blocks, and a shift register (33) comprising multiple registers connected in series.

Description

データ供給装置及びデータ処理装置Data supply device and data processing device
 本発明は、複数の演算器を有し、各演算器による演算処理を同期して行なうことができるデータ処理装置に係り、特に、当該データ処理装置へのデータ供給に好適なデータ供給手法に関するものである。 The present invention relates to a data processing apparatus having a plurality of arithmetic units and capable of performing arithmetic processing by each arithmetic unit synchronously, and more particularly to a data supply method suitable for supplying data to the data processing unit. It is.
 近年のマイクロプロセッサにおいては、マシンサイクルを短縮するとともに、1マシンサイクル当たりに実行される命令の数を増やすことにより、実効性能の向上を図る方式が多く提案されている。 In recent microprocessors, many methods for improving the effective performance by shortening the machine cycle and increasing the number of instructions executed per machine cycle have been proposed.
 このような多数の命令を並列に処理する方式として、例えば、演算器アレイ方式が知られている。この演算器アレイ方式は、目的とするデータ処理に合わせて演算器ネットワークを固定し、その固定された演算器ネットワークに入力データを流し込む方式である(例えば、特許文献1~3を参照)。 For example, an arithmetic unit array method is known as a method for processing such a large number of instructions in parallel. This arithmetic unit array method is a method in which an arithmetic unit network is fixed in accordance with target data processing, and input data is poured into the fixed arithmetic unit network (see, for example, Patent Documents 1 to 3).
 この演算器アレイ方式では、複数の演算器からなる演算器ネットワークを利用することにより、多くの機能を並列実行することが可能である。 In this arithmetic unit array method, it is possible to execute many functions in parallel by using an arithmetic unit network composed of a plurality of arithmetic units.
 しかし、演算器アレイ方式は、既存の機械語命令を実行することができない。このため、この演算器アレイ方式に特有の機械語命令を生成するための専用の機械語命令生成手段が必要であり、汎用性に欠けている。 However, the arithmetic unit array method cannot execute existing machine language instructions. For this reason, a dedicated machine language instruction generation means for generating machine language instructions peculiar to this arithmetic unit array system is necessary, and lacks versatility.
 そこで、一般的な機械語命令を実行し、且つ、機械語命令の並列実行が可能な方式としては、例えば、スーパスカラ方式、ベクトル方式、VLIW(Very Long Instruction Word)方式が知られている。これらの方式では、1つの命令の中で複数の演算等が指定され、それらが同時に実行されることになる。 Therefore, for example, a superscalar method, a vector method, or a VLIW (Very Long Instruction Word) method is known as a method capable of executing general machine language instructions and executing machine language instructions in parallel. In these methods, a plurality of operations and the like are specified in one instruction and are executed simultaneously.
 先ず、スーパスカラ方式は、機械語命令列の中から同時実行可能な機械語命令をハードウェアが動的に検出して並列実行する方式である。 First, the superscalar method is a method in which hardware dynamically detects machine language instructions that can be executed simultaneously from a machine language instruction sequence and executes them in parallel.
 このスーパスカラ方式は、既存のソフトウェア資産をそのまま活用できる強みがある一方、機構の複雑さ及び消費電力の多さから、最近では敬遠される傾向にある。 This superscalar method has the advantage of being able to use existing software assets as they are, but recently tends to be avoided due to the complexity of the mechanism and the large amount of power consumption.
 次に、ベクトル方式は、多数のレジスタを一次元方向に並べたベクトルレジスタを用いて、ロード、演算、ストア等の基本操作を繰返し適用する方式であり、電力効率の良い高速化が可能である。さらに、キャッシュメモリが不要となることから、主記憶とベクトルレジスタ間のデータ転送速度が保証され、その結果、安定した高速化が実現される。 Next, the vector method is a method in which basic operations such as load, operation, and store are repeatedly applied using a vector register in which a large number of registers are arranged in a one-dimensional direction, and high speed with high power efficiency is possible. . Furthermore, since no cache memory is required, the data transfer speed between the main memory and the vector register is guaranteed, and as a result, stable high speed is realized.
 しかし、ベクトル方式では、異なるベクトルレジスタの同一要素番号間の演算のみが可能であり、同一ベクトルレジスタ内の隣接要素を参照しながら演算を進めるプログラムには適さない。 However, the vector method can only perform operations between the same element numbers of different vector registers, and is not suitable for a program for performing operations while referring to adjacent elements in the same vector register.
 最後に、VLIW方式は、1つの命令の中で複数の演算等が指定され、それらが同時に実行される方式である。このVLIW方式では、例えば、4命令を同時にフェッチし、4命令を同時にデコードし、汎用レジスタから必要なデータを読み出し、複数の演算装置により同時に演算を行い、演算装置に付随する演算結果格納手段に演算結果を格納する。 Finally, the VLIW method is a method in which a plurality of operations and the like are designated in one instruction and are executed simultaneously. In this VLIW method, for example, 4 instructions are fetched simultaneously, 4 instructions are decoded simultaneously, necessary data is read from a general-purpose register, and operation is performed simultaneously by a plurality of operation devices, and the operation result storage means attached to the operation device is stored. Stores the operation result.
 そして、次のサイクルではその演算結果格納手段から内容を読み出して、汎用レジスタに書き込みを行なうとともに、次の演算においてその読み出された演算結果が必要となる場合には、その演算結果を演算装置の入力へバイパスする。 In the next cycle, the contents are read from the calculation result storage means and written to the general-purpose register. When the read calculation result is required in the next calculation, the calculation result is stored in the arithmetic unit. Bypass to the input.
 一方、ロード命令に対しては、LD/STユニットにおいてキャッシュメモリを参照し、LD/STユニットに付随するロード結果格納手段にロード結果を格納した後、次のサイクルにおいて、演算装置が動作を行なう。 On the other hand, for the load instruction, the cache memory is referred to in the LD / ST unit, the load result is stored in the load result storage means associated with the LD / ST unit, and then the arithmetic unit operates in the next cycle. .
 このようにしてVLIW方式では、並置された演算装置及びLD/STユニットの各々の数だけ演算を同時実行することができる。さらに、VLIW方式では、並列実行可能な命令列をコンパイラ等によりあらかじめスケジュールしておくため、スーパスカラ方式のように同時実行可能な機械語命令をハードウェアが動的に検出する機構が不要となる。したがって、VLIW方式では、電力効率の良い命令実行が可能である。しかし、多数のロードストア命令を同時に実行するためには、多数のポートを有するメモリシステムを装備する必要がある。このようなメモリシステムは、面積効率が極めて悪くなるため、VLIW方式による同時実行可能命令数の拡大にも限界がある。 In this way, in the VLIW system, it is possible to simultaneously execute operations for the number of juxtaposed arithmetic devices and LD / ST units. Furthermore, in the VLIW method, a sequence of instructions that can be executed in parallel is scheduled in advance by a compiler or the like, so that a mechanism for dynamically detecting machine language instructions that can be executed simultaneously as in the superscalar method becomes unnecessary. Therefore, in the VLIW method, it is possible to execute instructions with high power efficiency. However, in order to simultaneously execute a large number of load / store instructions, it is necessary to equip a memory system having a large number of ports. Since such a memory system has extremely poor area efficiency, there is a limit to the increase in the number of instructions that can be executed simultaneously by the VLIW method.
日本国公開特許公報「特開平8-83264号公報(1996年3月26日公開)」Japanese Patent Publication “JP-A-8-83264 (published March 26, 1996)” 日本国公開特許公報「特開2001-312481号公報(2001年11月9日公開)」Japanese Patent Publication “Japanese Patent Laid-Open No. 2001-312481 (published on November 9, 2001)” 日本国公開特許公報「特開2003-76668号公報(2003年3月14日公開)」Japanese Patent Publication “Japanese Laid-Open Patent Publication No. 2003-76668 (published March 14, 2003)”
 ところで、上述の演算器アレイ方式においては、キャッシュ方式の採用が性能向上に大きく寄与する。そのキャッシュ方式としては、演算器ネットワークの随所に1次キャッシュを内蔵させると同時に、外部の主記憶との間に2次キャッシュを設ける方式が挙げられる。この方式では、2次キャッシュのヒット率を高め、主記憶へのアクセスを低減し、各演算器の性能向上を図っている。 By the way, in the above-described arithmetic unit array system, the adoption of the cache system greatly contributes to the performance improvement. As the cache system, there is a system in which a primary cache is incorporated everywhere in the arithmetic unit network, and at the same time, a secondary cache is provided between the external main memory. In this method, the hit rate of the secondary cache is increased, access to the main memory is reduced, and the performance of each arithmetic unit is improved.
 このようなキャッシュ方式を採用する場合では、同時に複数の演算器にデータを供給するために、演算器ネットワークの随所に設けられた1次キャッシュの内容を近傍の演算器に供給するための大規模なデータ伝搬機構が必要となる。 In the case of adopting such a cache system, in order to supply data to a plurality of arithmetic units at the same time, a large scale for supplying the contents of the primary cache provided in various places in the arithmetic unit network to nearby arithmetic units. A simple data propagation mechanism is required.
 具体的には、1次キャッシュから読み出したデータを演算器に付随させた小規模バッファの全てを通過させつつ、各小規模バッファが一定量を保存する機構である。この機構では、1次キャッシュから毎サイクルデータを読み出す多数の配線が多くの小規模バッファに接続されることとなる。すなわち、1次キャッシュの内容を演算器に効率よく伝搬させることに関しては、何ら考慮されていないといった課題があった。 Specifically, this is a mechanism in which each small-scale buffer stores a certain amount while passing all the small-scale buffers attached to the computing unit with data read from the primary cache. In this mechanism, a large number of wirings for reading out data every cycle from the primary cache are connected to many small buffers. That is, there has been a problem that no consideration has been given to efficiently propagating the contents of the primary cache to the arithmetic unit.
 上記課題に鑑み、本発明の目的は、複数の演算器を有し、各演算器による演算処理を同期して行なうことができるデータ処理装置において、当該データ処理装置へデータを効率よく供給することにより、各演算器の消費電力を削減可能なデータ供給装置を提供することにある。 In view of the above problems, an object of the present invention is to efficiently supply data to the data processing apparatus in a data processing apparatus having a plurality of arithmetic units and capable of performing arithmetic processing by each arithmetic unit synchronously. Accordingly, it is an object of the present invention to provide a data supply device capable of reducing the power consumption of each computing unit.
 従来型演算器アレイ方式においては、複数wayが集約された1次キャッシュが演算器ネットワークの随所に配置され、1次キャッシュが直接接続されない演算器に対しては、小規模バッファを通じてデータを供給していた。この方式では、機械語命令列中のロード命令の配置に関する自由度が大きい利点があるものの、小規模バッファ間を接続する配線が大規模になる欠点があった。 In the conventional arithmetic unit array system, a primary cache in which a plurality of ways are aggregated is arranged in various places in the arithmetic unit network, and data is supplied through a small buffer to arithmetic units to which the primary cache is not directly connected. It was. This method has an advantage that the degree of freedom regarding the arrangement of the load instruction in the machine language instruction sequence is large, but has a disadvantage that the wiring for connecting the small buffers becomes large.
 本発明では、1次キャッシュの各wayを演算器の近傍に一様に分散配置させるとともに、小規模バッファ間の接続を排除している。このため、機械語命令列中のロード命令の配置に関して制約が生じるものの、1次キャッシュに格納されるデータの内容に応じて命令写像位置を変更する方式とすることにより、実質的に、従来技術と同等の命令実行能力を確保している。すなわち、能力を落すことなく、配線数を減らすことにより、課題を解決している。 In the present invention, each way of the primary cache is uniformly distributed in the vicinity of the arithmetic unit and connection between small-scale buffers is eliminated. Therefore, although there is a restriction on the arrangement of the load instruction in the machine language instruction sequence, the instruction mapping position is changed according to the content of the data stored in the primary cache. Has the same instruction execution ability. That is, the problem is solved by reducing the number of wirings without reducing the capacity.
 上記目的を達成するために、本発明に係るデータ供給装置は、複数の演算器が多段構成された演算器束にデータを供給するデータ供給装置であって、複数のブロックに分割されたメモリ部と、複数のレジスタが一列に接続されたシフトレジスタ部とを備え、前記シフトレジスタ部は、自身の先頭または途中のレジスタに、前記メモリ部から読み出されたデータが書き込まれると共に、前記メモリ部及び前記シフトレジスタ部の各々は、前記データ供給装置に入力された複数のアドレス情報を基に参照されることにより、前記各アドレス情報に対応する各アドレス位置の内容を出力する。 In order to achieve the above object, a data supply apparatus according to the present invention is a data supply apparatus that supplies data to an arithmetic unit bundle in which a plurality of arithmetic units are configured in multiple stages, and a memory unit divided into a plurality of blocks And a shift register unit in which a plurality of registers are connected in a row, and the shift register unit writes the data read from the memory unit to a register at the head or in the middle of the shift register unit. Each of the shift register units outputs the contents of each address position corresponding to each address information by referring to the plurality of address information input to the data supply device.
 すなわち、上記構成によれば、1つのメモリ部を複数のブロックに分割し、シフトレジスタ部の先頭または途中のレジスタに各ブロックから読み出されたデータを書き込み可能となっている。 That is, according to the above configuration, one memory unit is divided into a plurality of blocks, and the data read from each block can be written to the head or middle register of the shift register unit.
 そして、メモリ部及びシフトレジスタ部の各々は、データ供給装置に入力された複数のアドレス情報を基に参照され、各アドレス情報に対応する各アドレス位置の内容を出力可能となっている。 Each of the memory section and the shift register section is referred to based on a plurality of address information input to the data supply device, and can output the contents of each address position corresponding to each address information.
 このようなデータ供給装置を用いて、複数の演算器が多段構成された演算器束にデータを供給することにより、異なる演算器束の各々にデータを供給するデータ供給装置間におけるデータ伝搬が不要となる。 By using such a data supply device to supply data to an arithmetic unit bundle in which a plurality of arithmetic units are configured in multiple stages, there is no need for data propagation between data supply devices that supply data to different arithmetic unit bundles. It becomes.
 それゆえ、従来のような、演算器ネットワークの随所に設けられた1次キャッシュの内容を近傍の演算器に供給するための大規模なデータ伝搬機構が不要となるので、データ処理装置へデータを効率よく供給し、これにより、各演算器の消費電力を削減することができる。 This eliminates the need for a large-scale data propagation mechanism for supplying the contents of the primary cache provided in various places in the computing unit network to nearby computing units as in the prior art. It is possible to efficiently supply power, thereby reducing the power consumption of each arithmetic unit.
 本発明に係るデータ処理装置は、複数の前記演算器束が多段構成されたデータ処理装置であって、或る一連の高速実行後、次の高速実行を開始する際に、或る演算器束にデータを供給する上記データ供給装置の前記メモリ部の内容が別の演算命令にて使用することができる場合、前記演算器束を構成する演算器に対する演算命令の写像を変更する。 A data processing apparatus according to the present invention is a data processing apparatus in which a plurality of the arithmetic unit bundles are configured in a multi-stage, and when a next high-speed execution is started after a certain series of high-speed executions, When the contents of the memory unit of the data supply device for supplying data to the computer can be used by another operation instruction, the mapping of the operation instruction to the operation units constituting the operation unit bundle is changed.
 上記構成によれば、データ供給装置のメモリ部に格納されるデータの内容に応じて命令写像位置を変更することにより、従来技術と同等の命令実行能力を確保することができる。 According to the above configuration, by changing the instruction mapping position in accordance with the content of data stored in the memory unit of the data supply device, it is possible to ensure instruction execution capability equivalent to that of the conventional technology.
 本発明のデータ供給装置は、以上のように、複数の演算器が多段構成された演算器束にデータを供給するデータ供給装置であって、複数のブロックに分割されたメモリ部と、複数のレジスタが一列に接続されたシフトレジスタ部とを備え、前記シフトレジスタ部は、自身の先頭または途中のレジスタに、前記メモリ部から読み出されたデータが書き込まれると共に、前記メモリ部及び前記シフトレジスタ部の各々は、前記データ供給装置に入力された複数のアドレス情報を基に参照されることにより、前記各アドレス情報に対応する各アドレス位置の内容を出力する。 As described above, the data supply device of the present invention is a data supply device that supplies data to a computing unit bundle in which a plurality of computing units are configured in multiple stages, and includes a memory unit divided into a plurality of blocks, and a plurality of computing units The shift register unit includes a shift register unit connected in a line, and the shift register unit writes data read from the memory unit to a register at the head or middle of the shift register unit, and the memory unit and the shift register Each unit outputs the contents of each address position corresponding to each address information by referring to the plurality of address information input to the data supply device.
 それゆえ、複数の演算器を有し、各演算器による演算処理を同期して行なうことができるデータ処理装置において、当該データ処理装置へデータを効率よく供給することにより、各演算器の消費電力を削減することができるという効果を奏する。 Therefore, in a data processing apparatus that has a plurality of arithmetic units and can perform arithmetic processing by each arithmetic unit synchronously, by efficiently supplying data to the data processing unit, the power consumption of each arithmetic unit There is an effect that can be reduced.
本発明の一実施形態におけるLAPPの構成を示す図である。It is a figure which shows the structure of LAPP in one Embodiment of this invention. 本発明の他の実施形態におけるLAPPの構成を示す図である。It is a figure which shows the structure of LAPP in other embodiment of this invention. 上記LAPPにおける第1~3データ処理段からなる3データ処理段の構成を、Nデータ処理段の構成に拡張したLAPPの構成を示す図である。FIG. 3 is a diagram showing a configuration of a LAPP in which the configuration of three data processing stages including first to third data processing stages in the LAPP is expanded to a configuration of N data processing stages. 上記LAPPにおける、キャッシュメモリからのデータ供給を説明するための模式図である。It is a schematic diagram for demonstrating the data supply from the cache memory in the said LAPP. 4段毎に1つの中容量メモリを配置する構成を説明するための模式図である。It is a schematic diagram for demonstrating the structure which arrange | positions one medium capacity | capacitance memory for every four steps. 中容量メモリを含むメモリシステムの詳細な構成図である。It is a detailed block diagram of a memory system including a medium capacity memory. 上記メモリシステムの動作を説明するための説明図である。It is explanatory drawing for demonstrating operation | movement of the said memory system. 上記メモリシステムの動作を説明するための説明図である。It is explanatory drawing for demonstrating operation | movement of the said memory system. 上記メモリシステムの動作を説明するための説明図である。It is explanatory drawing for demonstrating operation | movement of the said memory system. 上記メモリシステムの動作を説明するための説明図である。It is explanatory drawing for demonstrating operation | movement of the said memory system. 上記メモリシステムの動作を説明するための説明図である。It is explanatory drawing for demonstrating operation | movement of the said memory system. 上記メモリシステムの動作を説明するための説明図である。It is explanatory drawing for demonstrating operation | movement of the said memory system. 画像処理の一例を従来技術により実現した場合の命令列を示す図である。It is a figure which shows the command sequence at the time of implement | achieving an example of image processing by a prior art. 画像処理の一例を本発明により実現した場合の命令列を示す図である。It is a figure which shows the command sequence at the time of implement | achieving an example of image processing by this invention. 浮動小数点演算処理の一例を従来技術により実現した場合の命令列を示す図である。It is a figure which shows the command sequence at the time of implement | achieving an example of a floating point arithmetic processing by a prior art. 浮動小数点演算処理の一例を本発明により実現した場合の命令列を示す図である。It is a figure which shows the command sequence at the time of implement | achieving an example of a floating point arithmetic processing by this invention.
 以下、図面を参照しつつ本発明の実施の形態について説明する。以下の説明に用いる図面では、同一の部品に同一の符号を付してある。それらの名称及び機能も同一である。したがって、それらについての詳細な説明は繰り返さない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the drawings used for the following description, the same parts are denoted by the same reference numerals. Their names and functions are also the same. Therefore, detailed description thereof will not be repeated.
 (本発明の前提技術)
 本発明は、多数の演算器を並置する計算機構成方式におけるデータ供給手法に関するものである。本発明は、特に、表1に示すメモリ参照パターンに対応するメモリ参照機構に関連が深いものである。
(Prerequisite technology of the present invention)
The present invention relates to a data supply method in a computer configuration system in which a large number of arithmetic units are juxtaposed. The present invention is particularly relevant to the memory reference mechanism corresponding to the memory reference patterns shown in Table 1.
Figure JPOXMLDOC01-appb-T000001
Figure JPOXMLDOC01-appb-T000001
 一般に、上述のデータ供給手法におけるメモリ参照機構において、競合する2つの構成はベクトル機構とメニコアである。完全に規則的なメモリ参照と演算とからなるプログラム、すなわち、要求メモリ性能:演算性能=1:1であるプログラムであれば、ベクトル機構が最適である。ベクトル機構であれば、ベクトルロード命令とベクトル演算命令のオーバラップ実行により、メモリ性能と演算性能とを使い切ることができる。 Generally, in the memory reference mechanism in the above-described data supply method, two competing configurations are a vector mechanism and a menicore. For a program consisting of completely regular memory references and computations, that is, a program with required memory performance: computation performance = 1: 1, the vector mechanism is optimal. In the case of the vector mechanism, the memory performance and the computation performance can be used up by overlapping execution of the vector load instruction and the vector operation instruction.
 しかし、実際には、メモリ参照にはランダムアクセスの要素があるのが通常である。このため、大域的には規則的なメモリ参照であっても、局所的にはランダムな参照である場合、ベクトル機構は対応することができない(例えば、配列添字I-1、I、I+1を同時に参照する等)。 However, in practice, there are usually random access elements in memory references. For this reason, even if it is a globally regular memory reference, if it is a locally random reference, the vector mechanism cannot respond (for example, array subscripts I-1, I, and I + 1 can be used simultaneously). Etc.)
 一方、メニコアでは、上述のランダムアクセスには対応できるものの、メモリ性能:演算性能=1:1を維持するためには、極めて高度なスーパスカラ機能が必要となる。特に、アドレス計算とメモリ参照と演算とを完全にオーバラップさせるには、アドレス計算をどのように隠蔽できるかが重要となる。 On the other hand, although Menicoa can cope with the above-mentioned random access, in order to maintain memory performance: operation performance = 1: 1, an extremely advanced superscalar function is required. In particular, in order to completely overlap the address calculation, the memory reference, and the operation, it is important how the address calculation can be hidden.
 この要求に応えるものとして、例えば、以下の演算器アレイ型アクセラレータ(Linear Array Pipeline Processor)(以下、「LAPP」と呼ぶ)を用いることができる。このLAPP(データ処理装置)は、複数の演算器を2次元アレイ状に配置する粗粒度リコンフィギャラブルアレイ(Coarse-Grained Reconfigurable Architecture)(以下、「CGRA」と呼ぶ)を採用し、且つ、既存の機械語命令を用いるものである。 As a response to this requirement, for example, the following arithmetic array type accelerator (Linear Array Pipeline Processor) (hereinafter referred to as “LAPP”) can be used. This LAPP (data processing device) employs a coarse-grained reconfigurable array (hereinafter referred to as “CGRA”) in which a plurality of arithmetic units are arranged in a two-dimensional array, and is an existing one. The machine language instruction is used.
 図1は、上述のLAPPの構成を示す図である。図1に示すように、このLAPP101は、コンフィギュレーションメモリ10と、第1レジスタファイル部110と、第2レジスタファイル部210と、第1演算装置(第1演算部、第1保持部)120と、第2演算装置(第2演算部、第2保持部)220と、を備えている。 FIG. 1 is a diagram showing the configuration of the LAPP described above. As shown in FIG. 1, the LAPP 101 includes a configuration memory 10, a first register file unit 110, a second register file unit 210, a first arithmetic device (first arithmetic unit, first holding unit) 120, , A second arithmetic device (second arithmetic unit, second holding unit) 220.
 コンフィギュレーションメモリ10は、公知のCGRAを構成するものであり、コンフィギュレーションデータを格納する。コンフィギュレーションデータは、第1演算装置120および第2演算装置220における処理内容を規定するデータである。コンフィギュレーションメモリ10は、このようなコンフィギュレーションデータを第1レジスタファイル部110および第2レジスタファイル部210に転送する。 The configuration memory 10 constitutes a known CGRA and stores configuration data. The configuration data is data that defines processing contents in the first arithmetic device 120 and the second arithmetic device 220. The configuration memory 10 transfers such configuration data to the first register file unit 110 and the second register file unit 210.
 第1レジスタファイル部110は、第1演算装置120における演算処理に必要なデータを保持するものである。第1レジスタファイル部110は、複数のレジスタ(第1レジスタ)r0~r11からなるレジスタ群111と、レジスタ群111の各レジスタr0~r11の読み出しデータを第1レジスタファイル部110の外部に転送するための転送器112と、を有している。 The first register file unit 110 holds data necessary for arithmetic processing in the first arithmetic unit 120. The first register file unit 110 transfers a register group 111 including a plurality of registers (first registers) r0 to r11 and read data of the registers r0 to r11 of the register group 111 to the outside of the first register file unit 110. And a transmitter 112 for the purpose.
 レジスタ群111の各レジスタr0~r11に対する読み出しや書き込みは、コンフィギュレーションメモリ10に格納されたコンフィギュレーションデータに基づいて実行される。レジスタ群111の各レジスタr0~r11は、自身のレジスタ番号0~11をアクセスのキーとして読み出しや書き込みがされる。 Reading and writing to each of the registers r0 to r11 of the register group 111 is executed based on configuration data stored in the configuration memory 10. Each register r0 to r11 of the register group 111 is read or written using its own register number 0 to 11 as an access key.
 転送器112は、読み出しレジスタ番号が指定されると、その指定された番号が付されたレジスタに保持されているデータを第1レジスタファイル部110の外部に転送する。 When the read register number is specified, the transfer unit 112 transfers the data held in the register with the specified number to the outside of the first register file unit 110.
 第2レジスタファイル部210は、第2演算装置220における演算処理に必要なデータを保持する。第2レジスタファイル部210は、複数のレジスタ(第2レジスタ)r0~r11からなるレジスタ群211と、レジスタ群211の各レジスタr0~r11の読み出しデータを第2レジスタファイル部210の外部に転送するための転送器212と、を有している。 The second register file unit 210 holds data necessary for arithmetic processing in the second arithmetic unit 220. The second register file unit 210 transfers a register group 211 including a plurality of registers (second registers) r0 to r11 and read data of the registers r0 to r11 of the register group 211 to the outside of the second register file unit 210. And a transfer device 212.
 レジスタ群211の各レジスタr0~r11に対する読み出しや書き込みは、コンフィギュレーションメモリ10に格納されたコンフィギュレーションデータに基づいて実行される。レジスタ群211の各レジスタr0~r11は、自身のレジスタ番号0~11をアクセスのキーとして読み出しや書き込みがされる。 Reading and writing to each of the registers r0 to r11 in the register group 211 is executed based on configuration data stored in the configuration memory 10. Each register r0 to r11 of the register group 211 is read or written using its own register number 0 to 11 as an access key.
 レジスタ群211の各レジスタr0~r11は、第1レジスタファイル部110のレジスタ群111の各レジスタr0~r11と一対一に対応しており、レジスタ群111及びレジスタ群211の各レジスタ間においてレジスタ番号が同一のもの同士が対応付けられている。そして、第1レジスタファイル部110の転送器112は、レジスタ群111の各レジスタr0~r11の読み出しデータを、レジスタ群111の各レジスタr0~r11のレジスタ番号と同一のレジスタ番号を持つ、第2レジスタファイル部210のレジスタ群211の各レジスタr0~r11に、転送可能である。 The registers r0 to r11 of the register group 211 have a one-to-one correspondence with the registers r0 to r11 of the register group 111 of the first register file unit 110, and register numbers between the registers of the register group 111 and the register group 211 Are associated with each other. Then, the transfer unit 112 of the first register file unit 110 stores the read data of the registers r0 to r11 of the register group 111 with the same register number as the register numbers of the registers r0 to r11 of the register group 111. Data can be transferred to the registers r0 to r11 of the register group 211 of the register file unit 210.
 例えば、第1レジスタファイル部110の転送器112は、レジスタ群111のレジスタr3の読み出しデータを、第2レジスタファイル部210のレジスタ群211のレジスタr3に転送可能である。また、第1レジスタファイル部110の転送器112は、レジスタ群111のレジスタr9の読み出しデータを、第2レジスタファイル部210のレジスタ群211のレジスタr9に転送可能である。 For example, the transfer unit 112 of the first register file unit 110 can transfer the read data of the register r3 of the register group 111 to the register r3 of the register group 211 of the second register file unit 210. The transfer unit 112 of the first register file unit 110 can transfer read data of the register r9 of the register group 111 to the register r9 of the register group 211 of the second register file unit 210.
 転送器212は、読み出しレジスタ番号が指定されると、その指定された番号が付されたレジスタに保持されているデータを第2レジスタファイル部210の外部に転送する。 When the read register number is specified, the transfer device 212 transfers the data held in the register with the specified number to the outside of the second register file unit 210.
 第1演算装置120は、LAPP101における実体的な処理を行なうものである。第1演算装置120は、演算器1-1~1-4からなる演算器群121と、保持器1-1~1-4からなる保持器群122と、転送器123と、を有している。 The first arithmetic unit 120 performs substantial processing in the LAPP 101. The first arithmetic unit 120 includes an arithmetic unit group 121 including arithmetic units 1-1 to 1-4, a holder group 122 including holders 1-1 to 1-4, and a transfer unit 123. Yes.
 第1演算装置120は、第1レジスタファイル部110と共に、第1データ処理段を構成しており、第1レジスタファイル部110の転送器112は、レジスタ群111の各レジスタr0~r11の読み出しデータを第1演算装置120に転送可能である。そして、第1演算装置120の演算器群121の各演算器1-1~1-4は、第1レジスタファイル部110の各レジスタr0~r11のうちから2つの読み出しデータを取得し、それらデータを用いて四則演算や論理演算等各種の演算処理を実行する。各演算器1-1~1-4の演算処理は同時に実行される。 The first arithmetic unit 120 constitutes a first data processing stage together with the first register file unit 110, and the transfer unit 112 of the first register file unit 110 reads the read data of the registers r0 to r11 of the register group 111. Can be transferred to the first arithmetic unit 120. The arithmetic units 1-1 to 1-4 of the arithmetic unit group 121 of the first arithmetic unit 120 obtain two read data from the registers r0 to r11 of the first register file unit 110, and the data Various arithmetic processes such as four arithmetic operations and logical operations are executed using. The arithmetic processing of each of the arithmetic units 1-1 to 1-4 is executed simultaneously.
 保持器群122の保持器1-1~1-4は、各々に対応する演算器1-1~1-4の演算結果を格納する。各保持器1-1~1-4は、各演算器1-1~1-4と一対一に対応している。 The holders 1-1 to 1-4 of the holder group 122 store the calculation results of the corresponding calculators 1-1 to 1-4. Each retainer 1-1 to 1-4 corresponds one-to-one with each computing unit 1-1 to 1-4.
 転送器123は、各保持器1-1~1-4に格納されている、各演算器1-1~1-4の演算結果を第1演算装置120の外部に転送する。 The transfer unit 123 transfers the calculation results of the calculators 1-1 to 1-4 stored in the holders 1-1 to 1-4 to the outside of the first calculator 120.
 第2演算装置220は、LAPP101における実体的な処理を行なうものである。第2演算装置220は、演算器2-1~2-4からなる演算器群221と、保持器2-1~2-4からなる保持器群222と、転送器223と、を有している。 The second arithmetic unit 220 performs substantial processing in the LAPP 101. The second arithmetic unit 220 includes an arithmetic unit group 221 including arithmetic units 2-1 to 2-4, a holder group 222 including holders 2-1 to 2-4, and a transfer unit 223. Yes.
 第2演算装置220は、第2レジスタファイル部210と共に、第2データ処理段を構成しており、第2レジスタファイル部210の転送器212は、レジスタ群211の各レジスタr0~r11の読み出しデータを第2演算装置220に転送可能である。そして、第2演算装置220の演算器群221の各演算器2-1~2-4は、第2レジスタファイル部210の各レジスタr0~r11のうちから2つの読み出しデータを取得し、それらデータを用いて四則演算や論理演算等各種の演算処理を実行する。各演算器2-1~2-4の演算処理は同時に実行される。 The second arithmetic unit 220, together with the second register file unit 210, constitutes a second data processing stage, and the transfer unit 212 of the second register file unit 210 reads data read from the registers r0 to r11 of the register group 211. Can be transferred to the second arithmetic unit 220. The arithmetic units 2-1 to 2-4 of the arithmetic unit group 221 of the second arithmetic unit 220 obtain two read data from the registers r0 to r11 of the second register file unit 210, and the data Various arithmetic processes such as four arithmetic operations and logical operations are executed using. The arithmetic processing of each of the arithmetic units 2-1 to 2-4 is executed simultaneously.
 さらに、第2演算装置220の演算器群221の各演算器2-1~2-4は、第1演算装置120の保持器群122の各保持器1-1~1-4に格納されている演算結果を取得することができる。第1演算装置120の転送器123は、各保持器1-1~1-4に格納されている、各演算器1-1~1-4の演算結果を第2演算装置220に転送可能となっている。 Further, the computing units 2-1 to 2-4 of the computing unit group 221 of the second computing unit 220 are stored in the respective cages 1-1 to 1-4 of the cage group 122 of the first computing unit 120. The calculation result can be acquired. The transfer unit 123 of the first calculation device 120 can transfer the calculation results of the calculation units 1-1 to 1-4 stored in the holders 1-1 to 1-4 to the second calculation device 220. It has become.
 そして、第2演算装置220の各演算器2-1~2-4は、第2レジスタファイル部210の各レジスタr0~r11の読み出しデータに代えて、それら演算結果を用いて演算処理を実行することができる。 Then, the arithmetic units 2-1 to 2-4 of the second arithmetic unit 220 execute arithmetic processing using the arithmetic results instead of the read data of the registers r0 to r11 of the second register file unit 210. be able to.
 保持器群222の保持器2-1~2-4は、各々に対応する演算器2-1~2-4の演算結果を格納する。各保持器2-1~2-4は、各演算器2-1~2-4と一対一に対応している。 The holders 2-1 to 2-4 of the holder group 222 store the calculation results of the corresponding calculators 2-1 to 2-4. Each of the retainers 2-1 to 2-4 has a one-to-one correspondence with each of the arithmetic units 2-1 to 2-4.
 転送器223は、各保持器2-1~2-4に格納されている、各演算器2-1~2-4の演算結果を第2演算装置220の外部に転送する。 The transfer unit 223 transfers the calculation results of the calculators 2-1 to 2-4 stored in the holders 2-1 to 2-4 to the outside of the second calculation device 220.
 次に、LAPP101の動作について説明する。 Next, the operation of LAPP 101 will be described.
 LAPP101においては、レジスタ群111のレジスタr0~r11の読み出しデータを用いて、第1演算装置120による演算処理が行なわれる。 In LAPP 101, arithmetic processing by the first arithmetic unit 120 is performed using read data of the registers r0 to r11 of the register group 111.
 第1演算装置120による演算処理と同時に、第1演算装置120による演算処理の対象外であったレジスタ群111のレジスタr0~r11の読み出しデータが第2レジスタファイル部210に転送される。 Simultaneously with the arithmetic processing by the first arithmetic device 120, the read data of the registers r0 to r11 of the register group 111 that is not the target of the arithmetic processing by the first arithmetic device 120 is transferred to the second register file unit 210.
 そして、次のサイクルにおいて、第2レジスタファイル部210のレジスタ群211のレジスタr0~r11に転送されたデータを用いて、第2演算装置220による演算処理が行なわれる。 Then, in the next cycle, the arithmetic processing by the second arithmetic unit 220 is performed using the data transferred to the registers r0 to r11 of the register group 211 of the second register file unit 210.
 第2演算装置220による演算処理と同時に、レジスタ群111のレジスタr0~r11の読み出しデータを用いて、第1演算装置120による演算処理が行なわれる。 Simultaneously with the arithmetic processing by the second arithmetic device 220, the arithmetic processing by the first arithmetic device 120 is performed using the read data of the registers r0 to r11 of the register group 111.
 さらに、第2演算装置220が第1演算装置120の演算結果を必要とする場合には、第1演算装置120の転送器123が各保持器1-1~1-4に格納されている、各演算器1-1~1-4の演算結果を第2演算装置220に転送する。 Further, when the second arithmetic device 220 needs the operation result of the first arithmetic device 120, the transfer device 123 of the first arithmetic device 120 is stored in each of the holders 1-1 to 1-4. The computation results of the computing units 1-1 to 1-4 are transferred to the second computing device 220.
 図2に示すLAPP102は、図1のLAPP101に、第3レジスタファイル部310と、第3演算装置(第3演算部、第3保持部)320と、をさらに備えたものである。これにより、第1演算装置120による演算処理及び第2演算装置220による演算処理に加えて、第3演算装置320による演算処理も同時に実行するものである。 The LAPP 102 shown in FIG. 2 further includes a third register file unit 310 and a third arithmetic unit (third arithmetic unit, third holding unit) 320 in addition to the LAPP 101 of FIG. Thereby, in addition to the arithmetic processing by the first arithmetic device 120 and the arithmetic processing by the second arithmetic device 220, the arithmetic processing by the third arithmetic device 320 is also executed simultaneously.
 第3レジスタファイル部310は、第3演算装置320における演算処理に必要なデータを保持するものである。第3レジスタファイル部310は、複数のレジスタ(第3レジスタ)r0~r11からなるレジスタ群311と、レジスタ群311の各レジスタr0~r11の読み出しデータを第3レジスタファイル部310の外部に転送するための転送器312と、を有している。 The third register file unit 310 holds data necessary for arithmetic processing in the third arithmetic unit 320. The third register file unit 310 transfers a register group 311 including a plurality of registers (third registers) r0 to r11 and read data of the registers r0 to r11 of the register group 311 to the outside of the third register file unit 310. And a transfer device 312 for the above.
 レジスタ群311の各レジスタr0~r11に対する読み出しや書き込みは、コンフィギュレーションメモリ10に格納されたコンフィギュレーションデータに基づいて実行される。レジスタ群311の各レジスタr0~r11は、自身のレジスタ番号0~12をアクセスのキーとして読み出しや書き込みがされる。 Reading and writing to the registers r0 to r11 of the register group 311 are executed based on the configuration data stored in the configuration memory 10. Each register r0 to r11 of the register group 311 is read or written using its own register number 0 to 12 as an access key.
 レジスタ群311の各レジスタr0~r11は、第2レジスタファイル部210のレジスタ群211の各レジスタr0~r11と一対一に対応しており、レジスタ群211及びレジスタ群311の各レジスタ間においてレジスタ番号が同一のもの同士が対応付けられている。そして、第2レジスタファイル部210の転送器212は、レジスタ群211の各レジスタr0~r11の読み出しデータを、レジスタ群211の各レジスタr0~r11のレジスタ番号と同一のレジスタ番号を持つ、第3レジスタファイル部310のレジスタ群311の各レジスタr0~r11に、転送可能である。 The registers r0 to r11 of the register group 311 have a one-to-one correspondence with the registers r0 to r11 of the register group 211 of the second register file unit 210, and register numbers between the registers of the register group 211 and the register group 311 Are associated with each other. Then, the transfer unit 212 of the second register file unit 210 receives the read data of the registers r0 to r11 of the register group 211 in the third register number having the same register number as the register numbers of the registers r0 to r11 of the register group 211. Data can be transferred to each of the registers r0 to r11 in the register group 311 of the register file unit 310.
 転送器312は、読み出しレジスタ番号が指定されると、その指定された番号が付されたレジスタに保持されているデータを第3レジスタファイル部310の外部に転送する。 When the read register number is designated, the transfer unit 312 transfers the data held in the register with the designated number to the outside of the third register file unit 310.
 また、第3レジスタファイル部310は、第1演算装置120の転送器123により、第1演算装置120の各保持器1-1~1-4に格納されている、各演算器1-1~1-4の演算結果を取得することができる。 In addition, the third register file unit 310 is stored in each of the holders 1-1 to 1-4 of the first arithmetic unit 120 by the transfer unit 123 of the first arithmetic unit 120. The calculation result of 1-4 can be acquired.
 第3演算装置320は、LAPP102における実体的な処理を行なうものである。第3演算装置320は、演算器3-1~3-4からなる演算器群321と、保持器3-1~3-4からなる保持器群322と、転送器323と、を有している。 3rd arithmetic unit 320 performs the substantial process in LAPP102. The third arithmetic unit 320 includes an arithmetic unit group 321 including arithmetic units 3-1 to 3-4, a holder group 322 including holders 3-1 to 3-4, and a transfer unit 323. Yes.
 第3演算装置320は、第3レジスタファイル部310と共に、第3データ処理段を構成しており、第3レジスタファイル部310の転送器312は、レジスタ群311の各レジスタr0~r11の読み出しデータを第3演算装置320に転送可能である。そして、第3演算装置320の演算器群321の各演算器3-1~3-4は、第3レジスタファイル部310の各レジスタr0~r11のうちから2つの読み出しデータを取得し、それらデータを用いて四則演算や論理演算等各種の演算処理を実行する。各演算器3-1~3-4の演算処理は同時に実行される。 The third arithmetic unit 320 constitutes a third data processing stage together with the third register file unit 310, and the transfer unit 312 of the third register file unit 310 reads the read data of the registers r0 to r11 of the register group 311. Can be transferred to the third arithmetic unit 320. Then, each of the arithmetic units 3-1 to 3-4 of the arithmetic unit group 321 of the third arithmetic unit 320 acquires two read data from each of the registers r0 to r11 of the third register file unit 310, and these data Various arithmetic processes such as four arithmetic operations and logical operations are executed using. The arithmetic processing of each of the arithmetic units 3-1 to 3-4 is executed simultaneously.
 保持器群322の保持器3-1~3-4は、各々に対応する演算器3-1~3-4の演算結果を格納する。各保持器3-1~3-4は、各演算器3-1~3-4と一対一に対応している。 The holders 3-1 to 3-4 of the holder group 322 store the calculation results of the corresponding calculators 3-1 to 3-4. Each retainer 3-1 to 3-4 has a one-to-one correspondence with each computing unit 3-1 to 3-4.
 転送器323は、各保持器3-1~3-4に格納されている、各演算器3-1~3-4の演算結果を第3演算装置320の外部に転送する。 The transfer unit 323 transfers the calculation results of the calculators 3-1 to 3-4 stored in the holders 3-1 to 3-4 to the outside of the third processor 320.
 また、第3演算装置320は、第2演算装置220の転送器223により、第2演算装置220の各保持器2-1~2-4に格納されている、各演算器2-1~2-4の演算結果を取得することができる。 In addition, the third arithmetic unit 320 includes the arithmetic units 2-1 to 2-2 stored in the respective holders 2-1 to 2-4 of the second arithmetic unit 220 by the transfer unit 223 of the second arithmetic unit 220. -4 can be obtained.
 次に、LAPP102の動作について説明する。 Next, the operation of the LAPP 102 will be described.
 LAPP102においては、レジスタ群211のレジスタr0~r11の読み出しデータを用いて、第2演算装置220による演算処理が行なわれる。 In LAPP102, the arithmetic processing by the second arithmetic unit 220 is performed using the read data of the registers r0 to r11 of the register group 211.
 第2演算装置220による演算処理と同時に、第2演算装置220による演算処理の対象外であったレジスタ群211のレジスタr0~r11の読み出しデータが第3レジスタファイル部310に転送される。 Simultaneously with the arithmetic processing by the second arithmetic device 220, the read data of the registers r0 to r11 of the register group 211 that is not subject to the arithmetic processing by the second arithmetic device 220 is transferred to the third register file unit 310.
 そして、次のサイクルにおいて、第3レジスタファイル部310のレジスタ群311のレジスタr0~r11に転送されたデータを用いて、第3演算装置320による演算処理が行なわれる。 Then, in the next cycle, the arithmetic processing by the third arithmetic unit 320 is performed using the data transferred to the registers r0 to r11 of the register group 311 of the third register file unit 310.
 第3演算装置320による演算処理と同時に、レジスタ群211のレジスタr0~r11の読み出しデータを用いて、第2演算装置220による演算処理が行なわれる。 Simultaneously with the arithmetic processing by the third arithmetic device 320, the arithmetic processing by the second arithmetic device 220 is performed using the read data of the registers r0 to r11 of the register group 211.
 さらに、第3演算装置320が第2演算装置220の演算結果を必要とする場合には、第2演算装置220の転送器223が各保持器2-1~2-4に格納されている、各演算器2-1~2-4の演算結果を第3演算装置320に転送する。 Further, when the third arithmetic device 320 needs the operation result of the second arithmetic device 220, the transfer device 223 of the second arithmetic device 220 is stored in each of the holders 2-1 to 2-4. The computation results of the computing units 2-1 to 2-4 are transferred to the third computing device 320.
 また、第1演算装置120の演算結果を第2演算装置220が必要とせず、第3演算装置320が第1演算装置120の演算結果を必要とする場合がある。この場合には、第1演算装置120の結果を第3レジスタファイル部に格納することにより、第1演算装置120の演算結果を間接的に第3演算装置320に投入することができる。 In some cases, the second arithmetic device 220 does not need the arithmetic result of the first arithmetic device 120, and the third arithmetic device 320 needs the arithmetic result of the first arithmetic device 120. In this case, by storing the result of the first arithmetic unit 120 in the third register file unit, the arithmetic result of the first arithmetic unit 120 can be input to the third arithmetic unit 320 indirectly.
 なお、LAPP102における第1~3データ処理段からなる3データ処理段の構成を、Nデータ処理段の構成に拡張してもよい。 Note that the configuration of the three data processing stages including the first to third data processing stages in the LAPP 102 may be extended to the configuration of the N data processing stage.
 例えば、Nを1以上の整数とする。この場合、第Nデータ処理段を構成する演算装置の演算結果は、その演算結果を第(N+2)データ処理段以降の演算装置が使用する場合には、第(N+2)データ処理段のレジスタファイル部に書き込まれる。 For example, N is an integer of 1 or more. In this case, the calculation result of the arithmetic unit constituting the Nth data processing stage is the register file of the (N + 2) th data processing stage when the arithmetic result is used by arithmetic units after the (N + 2) th data processing stage. Written in the part.
 一方、その演算結果を第(N+2)データ処理段以降の演算装置が使用しない場合には、その演算結果を第(N+2)データ処理段のレジスタファイル部に書き込むことなく第(N+1)データ処理段の演算装置に入力する。 On the other hand, when the arithmetic result after the (N + 2) th data processing stage does not use the arithmetic result, the (N + 1) th data processing stage is not written in the register file part of the (N + 2) th data processing stage. To the arithmetic unit.
 次に、上述のLAPP101および102におけるデータ供給手法について説明する。図3に、上述のLAPP102における第1~3データ処理段からなる3データ処理段の構成を、Nデータ処理段の構成に拡張したLAPP103の構成を示す。なお、図3において、キャッシュメモリ14の機構および小規模キャッシュメモリ15の構成と、それらの間の伝搬機構とは、公知の技術を用いることができる。 Next, a data supply method in the above-described LAPP 101 and 102 will be described. FIG. 3 shows a configuration of the LAPP 103 in which the configuration of the three data processing stages including the first to third data processing stages in the LAPP 102 is expanded to the configuration of the N data processing stage. In FIG. 3, a known technique can be used for the mechanism of the cache memory 14 and the configuration of the small-scale cache memory 15 and the propagation mechanism therebetween.
 図3に示すように、このLAPP103は、複数の演算器11に対し、メモリから大量データを供給する手法を用いている。演算データが、複数のレジスタファイル部12を介して、複数の演算器11からなる演算器ネットワーク上を1方向に伝搬するのに合わせて、メモリ上のデータも同一方向に伝搬させる。これにより、複数のロード命令が同時に複数のメモリアドレスを参照することができる。 As shown in FIG. 3, the LAPP 103 uses a method of supplying a large amount of data from a memory to a plurality of computing units 11. As the operation data propagates in one direction on the operation unit network composed of the plurality of operation units 11 via the plurality of register file units 12, the data on the memory is also propagated in the same direction. Thus, a plurality of load instructions can refer to a plurality of memory addresses at the same time.
 具体的には、初段や後続段に配置した中容量メモリ13において、キャッシュメモリ14の3つのwayの各々を1つの配列に対応させる。そして、各wayから毎サイクル1ワードを読み出し、次段へ伝搬させる。各段では、伝搬中の3ワードの値を各段の小規模キャッシュメモリ15に取り込むことにより、あらかじめ決められたメモリアドレス範囲のデータをランダムに参照することができる。演算データとメモリデータとが同一速度で伝搬するため、同一イタレーションに属するロード命令は、どの段において小規模キャッシュメモリ15を参照しても、同一のメモリアドレス範囲を参照できる。任意の段において、それ以前に配置した中容量メモリ13の内容を参照できることから、各ループイタレーションにおいて、着目する配列要素の近傍の要素を必要とする場合であっても、ロード/ストア部16により、ロードストア命令を任意の段に配置できる。 Specifically, in the medium capacity memory 13 arranged in the first stage or the subsequent stage, each of the three ways of the cache memory 14 is made to correspond to one array. Then, one word is read from each way for each cycle and propagated to the next stage. At each stage, the value of the three words being propagated is taken into the small-scale cache memory 15 at each stage, so that data in a predetermined memory address range can be referred to at random. Since arithmetic data and memory data propagate at the same speed, load instructions belonging to the same iteration can refer to the same memory address range regardless of which stage the small cache memory 15 is referred to. Since the contents of the medium-capacity memory 13 previously arranged can be referred to at an arbitrary stage, the load / store unit 16 can be used even if an element near the array element of interest is required in each loop iteration. Thus, the load / store instruction can be arranged at an arbitrary stage.
 LAPP103では、以上の特性を利用して、表1に示したメモリ参照パターンに対応可能である。なお、広範囲のランダムオフセットについては、中容量メモリ13が直接接続されている段においてのみ対応可能である。また、ロード内容に変更を加えて同一アドレスにストアする更新型については、ストアデータを深さ方向に1周させて元の配列に格納する。 LAPP 103 can deal with the memory reference patterns shown in Table 1 using the above characteristics. Note that a wide range of random offsets can be handled only at the stage where the medium capacity memory 13 is directly connected. In addition, for the update type in which the load contents are changed and stored at the same address, the store data is stored in the original array with one round in the depth direction.
 (本発明の前提技術における問題点)
 上述のLAPP103では、複数のレジスタファイル部12を備え、通常の機械語命令列を複数の演算器11に写像し、高速実行できるという利点がある。しかし、上述のLAPP103には、その実用化に向けて以下の課題がある。以下、図4を用いて、それら課題について説明する。図4は、LAPP103における、キャッシュメモリ14からのデータ供給を説明するための模式図である。なお、図4において、キャッシュメモリ14の機構および小規模キャッシュメモリ15の構成と、それらの間の伝搬機構とは、公知の技術を用いることができる。
(Problems in the prerequisite technology of the present invention)
The LAPP 103 described above has an advantage that a plurality of register file units 12 are provided, and a normal machine language instruction sequence is mapped to a plurality of arithmetic units 11 so that it can be executed at high speed. However, the above-mentioned LAPP 103 has the following problems for its practical use. Hereinafter, these problems will be described with reference to FIG. FIG. 4 is a schematic diagram for explaining data supply from the cache memory 14 in the LAPP 103. In FIG. 4, a known technique can be used for the mechanism of the cache memory 14 and the configuration of the small-scale cache memory 15 and the propagation mechanism therebetween.
 (1)中容量メモリ13から読み出したデータを後続段に伝搬させるために、way数分のデータパス17が必要となる。このような段間の配線数が増大すると、複数個のLSIを接続して大規模な演算機構を実現することが難しくなる。 (1) In order to propagate the data read from the medium capacity memory 13 to the subsequent stage, the data paths 17 corresponding to the number of ways are required. When the number of wirings between such stages increases, it becomes difficult to realize a large-scale arithmetic mechanism by connecting a plurality of LSIs.
 (2)プログラムによっては、多くのwayを必要とする。way数を増加するには、LAPP103の深さ方向に段数を増加させて中容量メモリ13の数を増加させるか、幅方向にway数を増加させて各中容量メモリ13の幅を増やす必要がある。いずれの場合でも、上述の(1)と同様、段間のデータパス17の多さが障害となる。 (2) Some programs require many ways. In order to increase the number of ways, it is necessary to increase the number of intermediate capacity memories 13 by increasing the number of stages in the depth direction of the LAPP 103 or increase the number of way memories in the width direction to increase the width of each intermediate capacity memory 13. is there. In any case, as in the above (1), the large number of data paths 17 between stages is an obstacle.
 (3)各配列要素を累算する場合、同一配列に対してロードとストアを行なう必要がある。上述のLAPP103では、ロードデータとストアデータが1方向に伝搬するために、ストアデータを深さ方向に1周させて元の配列に格納する必要がある。ストアする配列数が多い場合、ロードデータの伝搬に必要なデータパスに加えて、ストアデータの伝搬にも多くのデータパス18を設けなければならない。 (3) When accumulating each array element, it is necessary to load and store the same array. In the LAPP 103 described above, since the load data and the store data propagate in one direction, it is necessary to store the store data in the original array by making one round in the depth direction. When the number of arrays to be stored is large, in addition to the data path necessary for propagation of load data, many data paths 18 must be provided for propagation of store data.
 (本発明の構成)
 本発明のLAPPは、上述のLAPP103と同様、中容量メモリを分散配置させる構成を採用する一方、中容量メモリから読み出したデータを後続段に無条件に伝搬させる規則的なデータパスを設けない。これにより、上述のLAPP103の課題であった段間データパスの配線数の増大を防止する。
(Configuration of the present invention)
The LAPP of the present invention employs a configuration in which medium capacity memories are distributed and arranged, as with the LAPP 103 described above, but does not provide a regular data path for unconditionally propagating data read from the medium capacity memory to subsequent stages. This prevents an increase in the number of inter-stage data paths, which was a problem with the LAPP 103 described above.
 図5は、4段毎に1つの中容量メモリを配置する構成である。もちろん、この段数は4段に限られるものではない。要は、1つまたは複数の演算器から構成される「段」を複数連結(多段構成)した「バンドル」(演算器束)毎に、1つの中容量メモリを配置する構成であればよい。言い換えれば、本発明のLAPPは、このような「バンドル」を複数個、多段構成したものであると言える。それゆえ、図5の構成では、図3及び図4で示した、小規模キャッシュメモリ15間の伝搬機構が不要となっている。 FIG. 5 shows a configuration in which one medium capacity memory is arranged every four stages. Of course, the number of stages is not limited to four. In short, any medium-capacity memory may be arranged for each “bundle” (arithmetic unit bundle) in which a plurality of “stages” composed of one or a plurality of arithmetic units are connected (multistage configuration). In other words, it can be said that the LAPP of the present invention is a multistage configuration of a plurality of such “bundles”. Therefore, in the configuration of FIG. 5, the propagation mechanism between the small cache memories 15 shown in FIGS. 3 and 4 is unnecessary.
 また、図6は、中容量メモリを含むメモリシステムの詳細な構成図である。なお、以降の図においては、黒い四角は主に出力ラッチ、白い四角は出力以外に演算の入力として使用するラッチを示している。また、各々の右横添付の数字はビット幅である。 FIG. 6 is a detailed configuration diagram of a memory system including a medium capacity memory. In the following drawings, black squares mainly indicate output latches, and white squares indicate latches used as calculation inputs other than outputs. Each number attached to the right side is a bit width.
 図5および図6に示すように、本発明のLAPPが上述のLAPP103と異なる点は、中容量メモリにキャッシュメモリの1つのwayを搭載しつつ、複数のブロックに分割して使用することを可能とする点にある。さらに、1つのベースアドレスと、6つのオフセットを組み合わせることにより、1つのwayに対して6箇所のアドレスを使用したロード命令の実行を可能としている点にある。 As shown in FIG. 5 and FIG. 6, the LAPP of the present invention is different from the above-mentioned LAPP 103 in that it can be divided into a plurality of blocks while mounting one way of the cache memory in the medium capacity memory. It is in the point to. Further, by combining one base address and six offsets, it is possible to execute a load instruction using six addresses for one way.
 通常、任意の6箇所のアドレスを使用可能とするためには、6ポートメモリを設計する必要がある。しかし、このような多ポートメモリは、面積効率や動作速度の点から、実用的ではない。 Normally, it is necessary to design a 6-port memory in order to be able to use any 6 addresses. However, such a multiport memory is not practical in terms of area efficiency and operation speed.
 これに対し、本発明では、表1に示した参照パターンに対応しつつも、使用可能なアドレス範囲に制約を設ける。これにより、物理的には、読み出しに1ポート、書き込みに1ポートを備える一般的なメモリを用いて、6リード、2ライトのメモリ機能を実現している。 On the other hand, in the present invention, the usable address range is constrained while corresponding to the reference pattern shown in Table 1. As a result, a 6-read, 2-write memory function is physically realized using a general memory having one port for reading and one port for writing.
 図5に示すように、本発明のLAPP1は、主として、複数の演算器21からなる演算器ネットワークと、キャッシュメモリ(図示省略)の1つのwayを含む、複数のメモリシステム(データ供給装置)22と、を備えている。 As shown in FIG. 5, the LAPP 1 of the present invention mainly includes a plurality of memory systems (data supply devices) 22 including a computing unit network composed of a plurality of computing units 21 and one way of a cache memory (not shown). And.
 各メモリシステム22は、図5に示すように、複数の演算器21からなる演算器ネットワークにおいて、4段ごとに配置されている。そして、各メモリシステム22は、キャッシュメモリ(図示省略)の各wayに対応し、対応するwayとの間においてデータのやり取りを行なうものである。 As shown in FIG. 5, each memory system 22 is arranged at every four stages in an arithmetic unit network including a plurality of arithmetic units 21. Each memory system 22 corresponds to each way in a cache memory (not shown) and exchanges data with the corresponding way.
 メモリシステム22において、前段から供給されるアドレス情報に基づきアドレス計算を行った結果が、メモリシステム22の手前(上部)の複数のラッチ(アドレス保持部)23に格納される。次のサイクルでは、メモリシステム22内の中容量メモリ等が参照され、メモリシステム22の後ろ(下部)の複数のラッチ24に格納される。さらに、次のサイクルでは、複数の演算器21の入力として使用されて演算結果が格納される。 In the memory system 22, the result of address calculation based on the address information supplied from the previous stage is stored in a plurality of latches (address holding units) 23 in front (upper part) of the memory system 22. In the next cycle, a medium-capacity memory or the like in the memory system 22 is referred to and stored in a plurality of latches 24 behind (lower) the memory system 22. Further, in the next cycle, it is used as an input of a plurality of computing units 21 and stores the computation results.
 なお、下から1段目及び2段目の演算器21を通過した後に得られる演算結果は、最下部の複数のラッチ25に格納される。さらに、次のサイクルにおいて、複数のラッチ25に格納された演算結果を、メモリシステム22にストアするか、さらに、後続段へ送るか、あるいは、両方の選択を可能な構成となっている。 Note that the calculation results obtained after passing through the first-stage and second-stage computing units 21 from the bottom are stored in a plurality of latches 25 at the bottom. Further, in the next cycle, the operation results stored in the plurality of latches 25 can be stored in the memory system 22 and further sent to the subsequent stage, or both can be selected.
 図6は、図5に示したメモリシステム22の構成を示す図である。図6に示すように、メモリシステム22は、主として、複数のブロック(ここでは4個のブロック)に分割されたメモリ部31と、お互いに隣接し合うブロック同士を連結するための連結部32と、シフトレジスタ(シフトレジスタ部)33と、を備えている。シフトレジスタ33は、後述するように、複数のレジスタが一列に接続されている。 FIG. 6 is a diagram showing a configuration of the memory system 22 shown in FIG. As shown in FIG. 6, the memory system 22 mainly includes a memory unit 31 divided into a plurality of blocks (here, four blocks), and a connection unit 32 for connecting blocks adjacent to each other. And a shift register (shift register unit) 33. As will be described later, the shift register 33 has a plurality of registers connected in a line.
 図6に示すように、図5の複数のラッチ23には、メモリ部31の各ブロックに一対一に対応するように各ブロックに接続された複数のラッチ(第1のアドレス記憶回路)23-1、23-2、23-4、23-5と、メモリ部31の各ブロックのいずれにも接続されていない複数のラッチ(第2のアドレス記憶回路)23-3、23-6と、が含まれている。 As shown in FIG. 6, the plurality of latches 23 in FIG. 5 include a plurality of latches (first address storage circuits) 23-connected to each block so as to correspond to each block of the memory unit 31 on a one-to-one basis. 1, 23-2, 23-4, 23-5, and a plurality of latches (second address storage circuits) 23-3, 23-6 that are not connected to any of the blocks of the memory unit 31. include.
 もちろん、ラッチ23-3、23-6について、メモリ部31から分割されたブロックに、それぞれ、対応付けしてもよい。逆に、ラッチ23-1、23-2、23-4、23-5について、メモリ部31のいずれのブロックにも接続されないようにしてもよい。要は、メモリ部31が複数のブロックに分割されており、各ブロックに対応付けられたラッチがあればよい。 Of course, the latches 23-3 and 23-6 may be associated with the blocks divided from the memory unit 31, respectively. Conversely, the latches 23-1, 23-2, 23-4, and 23-5 may not be connected to any block of the memory unit 31. In short, the memory unit 31 is divided into a plurality of blocks, and there may be a latch associated with each block.
 以下、上述の表1の各メモリ参照パターンについて、図6に示したメモリシステム22の動作を説明する。 Hereinafter, the operation of the memory system 22 shown in FIG. 6 will be described for each memory reference pattern in Table 1 described above.
 (第1のケース)
 表1の第1のケース(1)は、広範囲のアドレスをランダムに参照するケースである。図7に示すように、LD-BASE201にベースアドレスが設定され、ラッチ202にオフセットが設定されると、ベースアドレスにオフセットが加算され、有効アドレスA0が指定される。
(First case)
The first case (1) in Table 1 is a case in which a wide range of addresses are referenced randomly. As shown in FIG. 7, when the base address is set in the LD-BASE 201 and the offset is set in the latch 202, the offset is added to the base address and the effective address A0 is designated.
 有効アドレスA0がラッチ23-1に格納されると、次のサイクルで、有効アドレスA0は、メモリ部31の1つのブロックである「way0.blk0」のラッチ203に供給される。同様に、有効アドレスA0は、「way0.blk0」に隣接する、メモリ部31の他のブロックである「way0.blk1」のラッチ204に供給される。 When the effective address A0 is stored in the latch 23-1, the effective address A0 is supplied to the latch 203 of "way0.blk0" which is one block of the memory unit 31 in the next cycle. Similarly, the effective address A0 is supplied to the latch 204 of “way0.blk1” that is another block of the memory unit 31 adjacent to “way0.blk0”.
 各ブロックから読み出された値は連結部32に送られ、連結部32により、一方が選択される。連結部32は、ラッチ23-1に格納された有効アドレスA0の上位ビットを用いて、上述の選択を実行する。連結部32により選択されたデータは、シフトレジスタ33のセレクタ33-1を介して、ラッチ24のO0に出力される。 The value read from each block is sent to the connection unit 32, and one of the values is selected by the connection unit 32. The linking unit 32 performs the above selection using the upper bits of the effective address A0 stored in the latch 23-1. The data selected by the linking unit 32 is output to O0 of the latch 24 via the selector 33-1 of the shift register 33.
 なお、ラッチ24のO0として出力すべきデータのサイズが、「way0.blk0」のみに収まる場合には、上述したような、「way0.blk0」に「way0.blk1」を連結させる連結機能を用いる必要はない。すなわち、「way0.blk0」から読み出されたデータをラッチ24のO0に出力すればよい。 When the size of the data to be output as O0 of the latch 24 is limited to only “way0.blk0”, the connection function for connecting “way0.blk1” to “way0.blk0” as described above is used. There is no need. That is, the data read from “way0.blk0” may be output to O0 of the latch 24.
 同様に、LD-BASE201に新たにベースアドレスが設定され、ラッチ205に新たにオフセットが設定されると、ベースアドレスにオフセットが加算され、有効アドレスA3が指定される。有効アドレスA3がラッチ23-4に格納された場合、メモリ部31の2つのブロック「way0.blk2」のラッチ206及び「way0.blk3」のラッチ207に供給され、各ブロックから読み出された値は連結部32に送られる。連結部32は、ラッチ23-4に格納された有効アドレスA3の上位ビットを用いて、一方を選択し、シフトレジスタ33のセレクタ33-5を介して、ラッチ24のO3に出力する。 Similarly, when a new base address is set in the LD-BASE 201 and a new offset is set in the latch 205, the offset is added to the base address and the effective address A3 is designated. When the effective address A3 is stored in the latch 23-4, the value supplied to the latch 206 of the two blocks “way0.blk2” and the latch 207 of “way0.blk3” of the memory unit 31 and read from each block Is sent to the connecting portion 32. The linking unit 32 selects one using the upper bit of the effective address A3 stored in the latch 23-4, and outputs it to O3 of the latch 24 via the selector 33-5 of the shift register 33.
 このように、複数のブロックに分割されたメモリ部31を用いることにより、複数のランダム参照に対応することが可能である。 Thus, by using the memory unit 31 divided into a plurality of blocks, it is possible to cope with a plurality of random references.
 (第2のケース)
 表1の第2のケース(2)は、単調増加するアドレスを基準とし、相対アドレスの範囲に制約があるものの、同時に6箇所をランダムに参照するケースである。図8に示すように、LD-BASE301に設定されたベースアドレスは、ラッチ302を介して、有効アドレスA0として、ラッチ23-1に格納される。次のサイクルで、有効アドレスA0は、メモリ部31の1つのブロックである「way0.blk0」のラッチ303に供給される。同様に、有効アドレスA0は、「way0.blk0」に隣接する、メモリ部31の他のブロックである「way0.blk1」のラッチ304に供給される。
(Second case)
The second case (2) in Table 1 is a case in which six locations are referenced at the same time, although there are restrictions on the range of relative addresses based on a monotonically increasing address. As shown in FIG. 8, the base address set in the LD-BASE 301 is stored in the latch 23-1 as the effective address A0 via the latch 302. In the next cycle, the effective address A0 is supplied to the latch 303 of “way0.blk0” that is one block of the memory unit 31. Similarly, the effective address A0 is supplied to the latch 304 of “way0.blk1” that is another block of the memory unit 31 adjacent to “way0.blk0”.
 各ブロックから読み出された値は連結部32に送られ、連結部32により、一方が選択される。連結部32は、ラッチ23-1に格納された有効アドレスA0の上位ビットを用いて、上述の選択を実行する。連結部32により選択されたデータは、シフトレジスタ33のセレクタ33-1を介して、ラッチ24のO0に出力される(この時、さらに、上述の第1のケースのように、「way0.blk2」と「way0.blk3」とを連結する構成としてもよい)。 The value read from each block is sent to the connection unit 32, and one of the values is selected by the connection unit 32. The linking unit 32 performs the above selection using the upper bits of the effective address A0 stored in the latch 23-1. The data selected by the linking unit 32 is output to O0 of the latch 24 via the selector 33-1 of the shift register 33 (at this time, as in the first case described above, “way0.blk2 And “way0.blk3” may be connected to each other).
 一方、ラッチ305にはオフセット「-e」、ラッチ306にはオフセット「-d」、ラッチ307にはオフセット「-c」、ラッチ308にはオフセット「-b」、ラッチ309にはオフセット「-a」が、それぞれ、設定される。各オフセットは、有効アドレスA1、A2、A3、A4及びA5として、ラッチ23-2、23-3、23-4、23-5及び23-6に、それぞれ、設定される。 On the other hand, the latch 305 has an offset “−e”, the latch 306 has an offset “−d”, the latch 307 has an offset “−c”, the latch 308 has an offset “−b”, and the latch 309 has an offset “−a”. Are set respectively. Each offset is set in the latches 23-2, 23-3, 23-4, 23-5, and 23-6 as effective addresses A1, A2, A3, A4, and A5, respectively.
 O0に対する書き込みと同時に、連結部32により選択されたデータを、セレクタ33-1を介して、シフトレジスタ33の先頭のレジスタ33-2に書き込みを行なう。次のサイクル以降、シフトレジスタ33中にデータを流しながら、ラッチ23-6、23-5、23-4、23-3及び23-2にそれぞれ設定された有効アドレスA5、A4、A3、A2、A1を用いて、シフトレジスタ33中に格納可能な範囲内のアドレスを指定する。これにより、有効アドレスA0近傍のアドレスを同時に参照することができる。 Simultaneously with writing to O0, the data selected by the linking unit 32 is written to the top register 33-2 of the shift register 33 via the selector 33-1. After the next cycle, while flowing data into the shift register 33, the effective addresses A5, A4, A3, A2, set in the latches 23-6, 23-5, 23-4, 23-3 and 23-2, respectively, An address within a range that can be stored in the shift register 33 is designated using A1. Thereby, addresses near the effective address A0 can be referred to simultaneously.
 すなわち、有効アドレスA5、A4、A3、A2、A1は、シフトレジスタ33中のレジスタ33-2、33-3、33-4、33-6、33-7の各位置を表わす値である。言い換えれば、ラッチ24のO5、O4、O3、O2に出力すべき値として、レジスタ33-2、33-3、33-4、33-6、33-7のいずれの値を参照すべきかを表わしている。 That is, the effective addresses A5, A4, A3, A2, and A1 are values representing the positions of the registers 33-2, 33-3, 33-4, 33-6, and 33-7 in the shift register 33. In other words, it indicates which value of the registers 33-2, 33-3, 33-4, 33-6, 33-7 should be referred to as the value to be output to O5, O4, O3, O2 of the latch 24. ing.
 なお、このために、有効アドレスA5、A4、A3、A2、A1は、各々、シフトレジスタ33の任意の位置とアドレス情報とを比較して一致した部分のレジスタ内容を各々、ラッチ24のO5、O4、O3、O2、O1に読み出す機構が必要である。このような機構は、シフトレジスタ33が小規模であるため、容易に実現可能である。 For this reason, the effective addresses A5, A4, A3, A2, and A1 compare the arbitrary register position of the shift register 33 with the address information, respectively, and register contents of the coincident portions respectively. A mechanism for reading to O4, O3, O2, and O1 is required. Such a mechanism can be easily realized because the shift register 33 is small.
 (第3のケース)
 表1の第3のケース(3)は、単調増加するアドレスを基準とし、相対アドレスの範囲に制約があるものの、同時に6箇所を参照するケースである。上述の第2のケース(2)と異なる点は、6箇所のアドレスも単調増加する点である。上述の第2のケース(2)では、オフセットが「-a」、「-b」、「-c」、「-d」、「-e」といったランダムなオフセットである。一方、第3のケースは、オフセットが固定である。
(Third case)
The third case (3) in Table 1 is based on a monotonically increasing address, and refers to six locations at the same time, although there are restrictions on the range of relative addresses. The difference from the above-mentioned second case (2) is that six addresses also monotonously increase. In the second case (2) described above, the offset is a random offset such as “−a”, “−b”, “−c”, “−d”, or “−e”. On the other hand, in the third case, the offset is fixed.
 そこで、第3のケースの場合、シフトレジスタ33のレジスタ33-2、33-3、33-4、33-6、33-7の各位置を用いて、オフセットを設定する。すなわち、シフトレジスタ33から直接読み出す機構により対応する。 Therefore, in the third case, the offset is set using the positions of the registers 33-2, 33-3, 33-4, 33-6, and 33-7 of the shift register 33. In other words, this is handled by a mechanism that reads directly from the shift register 33.
 このようなオフセットの設定により、第2のケースの場合とは異なり、図8のラッチ305~309、及び、ラッチ23-2~23-6を動作させる必要がない。これらの動作の分だけ、消費電力の削減が可能となる。 By setting such an offset, unlike the case of the second case, it is not necessary to operate the latches 305 to 309 and the latches 23-2 to 23-6 in FIG. Power consumption can be reduced by the amount of these operations.
 図9に示すように、LD-BASE401に設定されたベースアドレスは、ラッチ302を介して、有効アドレスA0として、ラッチ23-1に格納される。次のサイクルで、有効アドレスA0は、メモリ部31の1つのブロックである「way0.blk0」のラッチ403に供給される。同様に、有効アドレスA0は、「way0.blk0」に隣接する、メモリ部31の他のブロックである「way0.blk1」のラッチ404に供給される。 As shown in FIG. 9, the base address set in the LD-BASE 401 is stored in the latch 23-1 as the effective address A0 via the latch 302. In the next cycle, the effective address A0 is supplied to the latch 403 of “way0.blk0” which is one block of the memory unit 31. Similarly, the effective address A0 is supplied to the latch 404 of “way0.blk1” that is another block of the memory unit 31 adjacent to “way0.blk0”.
 各ブロックから読み出された値は連結部32に送られ、連結部32により、一方が選択される。連結部32は、ラッチ23-1に格納された有効アドレスA0の上位ビットを用いて、上述の選択を実行する。連結部32により選択されたデータは、シフトレジスタ33のセレクタ33-1を介して、ラッチ24のO0に出力される。 The value read from each block is sent to the connection unit 32, and one of the values is selected by the connection unit 32. The linking unit 32 performs the above selection using the upper bits of the effective address A0 stored in the latch 23-1. The data selected by the linking unit 32 is output to O0 of the latch 24 via the selector 33-1 of the shift register 33.
 O0に対する書き込みと同時に、連結部32により選択されたデータを、セレクタ33-1を介して、シフトレジスタ33の先頭のレジスタ33-2に書き込みを行なう。次のサイクル以降、シフトレジスタ33中にデータを流しながら、シフトレジスタ33のレジスタ33-2、33-3、33-4、33-6、33-7の各位置を用いて、固定オフセットを指定する。これにより、有効アドレスA0近傍のアドレスを同時に参照することができる。 Simultaneously with writing to O0, the data selected by the linking unit 32 is written to the top register 33-2 of the shift register 33 via the selector 33-1. Specify the fixed offset using the positions of the registers 33-2, 33-3, 33-4, 33-6, and 33-7 of the shift register 33 while flowing data into the shift register 33 after the next cycle. To do. Thereby, addresses near the effective address A0 can be referred to simultaneously.
 すなわち、本第3のケースの場合、ラッチ24のO2~O5に出力すべき値として、レジスタ33-2、33-3、33-4、33-6、33-7のいずれの値を参照すべきかを表わす、有効アドレスA5、A4、A3、A2、A1の設定は不要となる。なぜなら、本第3のケースの場合、上述の第2のケースとは異なり、オフセットは固定である。それゆえ、レジスタ33-2、33-3、33-4、33-6、33-7の各位置を用いれば、ラッチ24のO2~O5に出力すべき値として、レジスタ33-2、33-3、33-4、33-6、33-7のいずれの値を参照すべきかを特定することができるからである。つまり、有効アドレスA5、A4、A3、A2、A1は、レジスタ33-2、33-3、33-4、33-6、33-7の各位置により設定される、と言える。 In other words, in the case of the third case, any of the values of the registers 33-2, 33-3, 33-4, 33-6, and 33-7 should be referred to as the value to be output to the O2 to O5 of the latch 24. It is not necessary to set the effective addresses A5, A4, A3, A2, and A1 representing the cracks. This is because, in the case of the third case, unlike the second case described above, the offset is fixed. Therefore, if the positions of the registers 33-2, 33-3, 33-4, 33-6, and 33-7 are used, the values to be output to O2 to O5 of the latch 24 are the registers 33-2, 33- This is because it is possible to specify which value of 3, 33-4, 33-6, and 33-7 should be referred to. That is, it can be said that the effective addresses A5, A4, A3, A2, and A1 are set by the positions of the registers 33-2, 33-3, 33-4, 33-6, and 33-7.
 上述したように、この場合、メモリシステム22の消費電力を削減することができる。 As described above, in this case, the power consumption of the memory system 22 can be reduced.
 もちろん、上述の第2のケース(2)と同様、メモリシステム22は、第3のケース(3)にも対応可能である。 Of course, like the second case (2) described above, the memory system 22 can also handle the third case (3).
 (第4のケース)
 表1の第4のケース(4)は、単調増加するアドレスを基準とし、相対アドレスの範囲に制約があるものの、同時に3箇所を参照するアクセスパターンが2組必要なケースである。
(Fourth case)
The fourth case (4) in Table 1 is a case where two sets of access patterns that refer to three locations at the same time are required, although the range of relative addresses is limited based on a monotonically increasing address.
 図10に示すように、LD-BASE501に設定されたベースアドレスは、ラッチ502を介して、有効アドレスA0として、ラッチ23-1に格納される。次のサイクルで、有効アドレスA0は、メモリ部31の1つのブロックである「way0.blk0」のラッチ503に供給される。同様に、有効アドレスA0は、「way0.blk0」に隣接する、メモリ部31の他のブロックである「way0.blk1」のラッチ504に供給される。 As shown in FIG. 10, the base address set in the LD-BASE 501 is stored in the latch 23-1 as the effective address A0 via the latch 502. In the next cycle, the effective address A0 is supplied to the latch 503 of “way0.blk0”, which is one block of the memory unit 31. Similarly, the effective address A0 is supplied to the latch 504 of “way0.blk1” that is another block of the memory unit 31 adjacent to “way0.blk0”.
 各ブロックから読み出された値は連結部32に送られ、連結部32により、一方が選択される。連結部32は、ラッチ23-1に格納された有効アドレスA0の上位ビットを用いて、上述の選択を実行する。連結部32により選択されたデータは、シフトレジスタ33のセレクタ33-1を介して、ラッチ24のO0に出力される。 The value read from each block is sent to the connection unit 32, and one of the values is selected by the connection unit 32. The linking unit 32 performs the above selection using the upper bits of the effective address A0 stored in the latch 23-1. The data selected by the linking unit 32 is output to O0 of the latch 24 via the selector 33-1 of the shift register 33.
 O0に対する書き込みと同時に、連結部32により選択されたデータを、セレクタ33-1を介して、シフトレジスタ33の先頭のレジスタ33-2に書き込みを行なう。次のサイクル以降、シフトレジスタ33中にデータを流しながら、シフトレジスタ33のレジスタ33-2、33-3の各位置を用いて、固定オフセットを指定することにより、有効アドレスA0近傍のアドレスを同時に参照することができる。 Simultaneously with writing to O0, the data selected by the linking unit 32 is written to the top register 33-2 of the shift register 33 via the selector 33-1. After the next cycle, while sending data to the shift register 33, by specifying a fixed offset using the positions of the registers 33-2 and 33-3 of the shift register 33, addresses near the effective address A0 can be simultaneously set. You can refer to it.
 有効アドレスA1及びA2は、シフトレジスタ33のレジスタ33-2、33-3の各位置を用いて、設定される。このために、有効アドレスA2、A1は、各々、シフトレジスタ33の任意の位置とアドレス情報とを比較して一致した部分のレジスタ内容を各々、ラッチ24のO2、O1に読み出す機構が必要である。このような機構は、シフトレジスタ33が小規模であるため、容易に実現可能である。 Effective addresses A1 and A2 are set using the positions of the registers 33-2 and 33-3 of the shift register 33. For this purpose, the effective addresses A2 and A1 each need a mechanism for comparing the register contents of the matching portions by comparing arbitrary positions of the shift register 33 and the address information to the O2 and O1 of the latch 24, respectively. . Such a mechanism can be easily realized because the shift register 33 is small.
 同様に、LD-BASE501に新たに設定されたベースアドレスは、ラッチ505を介して、有効アドレスA3として、ラッチ23-4に格納される。次のサイクルで、有効アドレスA3は、メモリ部31の1つのブロックである「way0.blk2」のラッチ506に供給される。同様に、「way0.blk2」に隣接する、メモリ部31の他のブロックである「way0.blk3」のラッチ507に供給される。 Similarly, the base address newly set in the LD-BASE 501 is stored in the latch 23-4 as the effective address A3 via the latch 505. In the next cycle, the effective address A3 is supplied to the latch 506 of “way0.blk2” which is one block of the memory unit 31. Similarly, it is supplied to the latch 507 of “way0.blk3” which is another block of the memory unit 31 adjacent to “way0.blk2”.
 各ブロックから読み出された値は連結部32に送られ、連結部32により、一方が選択される。連結部32は、ラッチ23-4に格納された有効アドレスA3の上位ビットを用いて、上述の選択を実行する。連結部32により選択されたデータは、シフトレジスタ33のセレクタ33-5を介して、ラッチ24のO3に出力される。 The value read from each block is sent to the connection unit 32, and one of the values is selected by the connection unit 32. The linking unit 32 performs the above selection using the upper bits of the effective address A3 stored in the latch 23-4. The data selected by the linking unit 32 is output to O3 of the latch 24 via the selector 33-5 of the shift register 33.
 ここで、第4のケース(4)は、シフトレジスタ33の途中でデータ流を分断する点が第2のケース(2)と異なっている。そのため、シフトレジスタ33の途中に、「way0.blk2」から読み出した値を割り込ませるための上記セレクタ33-5が必要となる。 Here, the fourth case (4) is different from the second case (2) in that the data flow is divided in the middle of the shift register 33. Therefore, the selector 33-5 for interrupting the value read from “way0.blk2” is required in the middle of the shift register 33.
 O3に対する書き込みと同時に、連結部32により選択されたデータを、セレクタ33-1を介して、シフトレジスタ33の途中のレジスタ33-6に書き込みを行なう。次のサイクル以降、シフトレジスタ33中にデータを流しながら、シフトレジスタ33のレジスタ33-6、33-3の各位置を用いて、固定オフセットを指定することにより、有効アドレスA3近傍のアドレスを同時に参照することができる。 Simultaneously with the writing to O3, the data selected by the linking unit 32 is written to the register 33-6 in the middle of the shift register 33 via the selector 33-1. After the next cycle, while sending data to the shift register 33, by specifying a fixed offset using the positions of the registers 33-6 and 33-3 of the shift register 33, addresses near the effective address A3 can be simultaneously set. You can refer to it.
 有効アドレスA4及びA5は、シフトレジスタ33のレジスタ33-6、33-7の各位置を用いて、設定される。このために、有効アドレスA5、A4は、各々、シフトレジスタ33の任意の位置とアドレス情報とを比較して一致した部分のレジスタ内容を各々、ラッチ24のO5、O4に読み出す機構が必要である。このような機構は、シフトレジスタ33が小規模であるため、容易に実現可能である。 Effective addresses A4 and A5 are set using the positions of the registers 33-6 and 33-7 of the shift register 33. For this reason, the effective addresses A5 and A4 each need a mechanism for comparing the arbitrary register position of the shift register 33 with the address information and reading the register contents of the matching portions to O5 and O4 of the latch 24, respectively. . Such a mechanism can be easily realized because the shift register 33 is small.
 (第5のケース)
 表1の第5のケース(5)は、単調増加するアドレスを基準とし、相対アドレスの範囲に制約があるものの、同時に3箇所を参照するアクセスパターンと、メモリ部31の各ブロック「way0.blk3」、「way0.blk2」及び「way0.blk1」を独立してアクセスできる機構とを同時に必要とするケースである。
(Fifth case)
The fifth case (5) in Table 1 is based on a monotonically increasing address, and there is a restriction on the range of relative addresses. ”,“ Way0.blk2 ”, and“ way0.blk1 ”can be accessed at the same time.
 図11に示すように、LD-BASE601に設定されたベースアドレスは、ラッチ602を介して、有効アドレスA0として、ラッチ23-1に格納される。次のサイクルで、有効アドレスA0は、メモリ部31の1つのブロックである「way0.blk0」のラッチ606に供給される。「way0.blk0」から読み出された値は、連結部32及びシフトレジスタ33のセレクタ33-1を介して、ラッチ24のO0に出力される。 As shown in FIG. 11, the base address set in the LD-BASE 601 is stored in the latch 23-1 as the effective address A0 via the latch 602. In the next cycle, the effective address A0 is supplied to the latch 606 of “way0.blk0”, which is one block of the memory unit 31. The value read from “way0.blk0” is output to O0 of the latch 24 via the connection unit 32 and the selector 33-1 of the shift register 33.
 ラッチ610にはオフセット「-b」、ラッチ611にはオフセット「-a」が、それぞれ、設定される。各オフセットは、有効アドレスA2及びA5として、ラッチ23-3及び23-6に、それぞれ、設定される。 The offset “−b” is set in the latch 610, and the offset “−a” is set in the latch 611. Each offset is set in latches 23-3 and 23-6 as effective addresses A2 and A5, respectively.
 O0に対する書き込みと同時に、セレクタ33-1は、「way0.blk0」から読み出された値を、シフトレジスタ33の先頭のレジスタ33-2に書き込みを行なう。次のサイクル以降、シフトレジスタ33中にデータを流しながら、ラッチ23-3及び23-6にそれぞれ設定された有効アドレスA2及びA5を用いて、シフトレジスタ33中に格納可能な範囲内のアドレスを指定する。これにより、有効アドレスA0近傍のアドレスを同時に参照することができる。 Simultaneously with the writing to O0, the selector 33-1 writes the value read from “way0.blk0” into the first register 33-2 of the shift register 33. After the next cycle, while flowing data into the shift register 33, the effective addresses A2 and A5 respectively set in the latches 23-3 and 23-6 are used to set addresses within the range that can be stored in the shift register 33. specify. Thereby, addresses near the effective address A0 can be referred to simultaneously.
 有効アドレスA5、A2は、シフトレジスタ33中のレジスタ33-2、33-3、33-4、33-6、33-7の各位置を表わす値である。言い換えれば、ラッチ24のO5、O2に出力すべき値として、レジスタ33-2、33-3、33-4、33-6、33-7のいずれの値を参照すべきかを表わしている。例えば、有効アドレスA2がレジスタ33-2を、有効アドレスA5がレジスタ33-3を、それぞれ、参照すべきであることを表わしている。この場合、ラッチ24のO2にはレジスタ33-3の値が、ラッチのO5にはレジスタ33-3の値が、それぞれ、出力されることになる。 Effective addresses A5 and A2 are values representing the positions of the registers 33-2, 33-3, 33-4, 33-6, and 33-7 in the shift register 33. In other words, it indicates which value of the registers 33-2, 33-3, 33-4, 33-6, 33-7 should be referred to as a value to be output to O5, O2 of the latch 24. For example, the effective address A2 should refer to the register 33-2, and the effective address A5 should refer to the register 33-3. In this case, the value of the register 33-3 is output to O2 of the latch 24, and the value of the register 33-3 is output to O5 of the latch.
 このため、有効アドレスA5、A2は、各々、シフトレジスタの任意の位置とアドレス情報を比較して一致した部分のレジスタ内容を各々、ラッチ24のO5、O2に読み出す機構が必要である。 For this reason, each of the effective addresses A5 and A2 needs a mechanism for comparing the register contents of the matching portion by comparing the address information with an arbitrary position of the shift register and reading the contents of the registers to O5 and O2 of the latch 24, respectively.
 一方、LD-BASE601に新たに設定されたベースアドレスは、ラッチ603を介して、有効アドレスA1として、ラッチ23-2に格納される。次のサイクルで、有効アドレスA1は、メモリ部31の1つのブロックである「way0.blk1」のラッチ607に供給される。「way0.blk0」から読み出された値は、ラッチ24のO1に出力される。 On the other hand, the base address newly set in the LD-BASE 601 is stored in the latch 23-2 as the effective address A1 via the latch 603. In the next cycle, the effective address A1 is supplied to the latch 607 of “way0.blk1” which is one block of the memory unit 31. The value read from “way0.blk0” is output to O1 of the latch 24.
 また、LD-BASE601に新たに設定されたベースアドレスは、ラッチ604を介して、有効アドレスA3として、ラッチ23-4に格納される。次のサイクルで、有効アドレスA3は、メモリ部31の1つのブロックである「way0.blk2」のラッチ608に供給される。「way0.blk2」から読み出された値は、ラッチ24のO3に出力される。 The base address newly set in the LD-BASE 601 is stored in the latch 23-4 as the effective address A3 via the latch 604. In the next cycle, the effective address A3 is supplied to the latch 608 of “way0.blk2” which is one block of the memory unit 31. The value read from “way0.blk2” is output to O3 of the latch 24.
 さらに、LD-BASE601に新たに設定されたベースアドレスは、ラッチ605を介して、有効アドレスA4として、ラッチ23-5に格納される。次のサイクルで、有効アドレスA4は、メモリ部31の1つのブロックである「way0.blk3」のラッチ609に供給される。「way0.blk3」から読み出された値は、ラッチ24のO4に出力される。 Furthermore, the base address newly set in the LD-BASE 601 is stored in the latch 23-5 as the effective address A4 via the latch 605. In the next cycle, the effective address A4 is supplied to the latch 609 of “way0.blk3” which is one block of the memory unit 31. The value read from “way0.blk3” is output to O4 of the latch 24.
 有効アドレスA4、A3、A1は各々、「way0.blk3」、「way0.blk2」、「way0.blk1」に直結しており、各々、独立に内容を参照して、読み出した値をラッチ24のO4、O3、O1に書き込む。 The effective addresses A4, A3, and A1 are directly connected to “way0.blk3”, “way0.blk2”, and “way0.blk1,” respectively. Write to O4, O3, O1.
 (第6のケース)
 表1の第6のケース(6)は、図12に示すように、読み出したメモリの値を更新して元のメモリに書き込むケースである。図5に示した複数の演算器21からメモリシステム22へ戻るデータパス(フィードバック機構)26を利用して実現することができる。
(Sixth case)
The sixth case (6) in Table 1 is a case where the read memory value is updated and written to the original memory, as shown in FIG. This can be realized by using a data path (feedback mechanism) 26 returning from the plurality of arithmetic units 21 to the memory system 22 shown in FIG.
 例えば、図12においては、読み出したメモリの値(ST-value)612がメモリ部31の1つのブロックである「way0.blk0」のラッチ614及びラッチ615に供給される。次のサイクルで、ST-base613に設定されたベースアドレスを用いて、ラッチ614及びラッチ615に供給された各データが「way0.blk0」に書き込まれる。 For example, in FIG. 12, the read memory value (ST-value) 612 is supplied to the latch 614 and the latch 615 of “way0.blk0”, which is one block of the memory unit 31. In the next cycle, each data supplied to the latch 614 and the latch 615 is written to “way0.blk0” using the base address set in the ST-base 613.
 以上説明したように、本発明のLAPP1によれば、
 (1)中容量メモリを分散させ、かつ、ロード/ストア専用の段間伝搬データパスを不要とすることにより、段間の配線数を大幅に削減できる。
As explained above, according to LAPP1 of the present invention,
(1) The number of wirings between stages can be greatly reduced by distributing medium capacity memories and eliminating the need for interstage propagation data paths dedicated to load / store.
 (2)段間の配線数を削減することにより、大規模回路を、動作周波数を落すことなく複数LSI構成により実現することが可能となる。 (2) By reducing the number of wires between stages, a large-scale circuit can be realized with a plurality of LSI configurations without reducing the operating frequency.
 (3)中容量メモリと小容量シフトレジスタの組み合わせにより、一定の範囲のメモリ空間に対する多数のロード命令発行が可能となる。 (3) A combination of a medium-capacity memory and a small-capacity shift register makes it possible to issue a large number of load instructions to a certain range of memory space.
 (4)浮動小数点演算(複数サイクル)を含む自己更新型メモリ参照を、段間配線を増加させることなく複数段に配置可能となる。 (4) Self-updating memory references including floating point operations (multiple cycles) can be arranged in multiple stages without increasing interstage wiring.
 (5)複数の中容量バッファの並列動作により、複数段に分散させた配列データの並列処理が可能となる。 (5) Parallel processing of array data distributed in multiple stages becomes possible by parallel operation of multiple medium-capacity buffers.
 (6)Wayのうち再利用できるデータを移動することなく、命令MAPを移動することにより、データ移動に伴う電力や時間を削減することができる。 (6) By moving the command MAP without moving reusable data in the way, it is possible to reduce power and time associated with data movement.
 (具体例1)
 図13及び図14は、画像処理の一例を従来技術と本発明により各々実現した場合の命令列である。図13では、ロードデータが順次伝搬されることを前提に、各段にロード命令が配置されている。
(Specific example 1)
FIGS. 13 and 14 are instruction sequences when an example of image processing is realized by the prior art and the present invention, respectively. In FIG. 13, load instructions are arranged at each stage on the assumption that load data is sequentially propagated.
 一方、図14では、第4段、第8段、第12段にロード命令が配置されており、各段に属するWayから近傍のデータが取り出されて演算器に投入されている。この結果、ロードデータを無条件に伝搬させる機構が不要であると同時に、プログラムを収容する段数が24段から19段に減少している。 On the other hand, in FIG. 14, load instructions are arranged in the fourth, eighth, and twelfth stages, and neighboring data is extracted from the ways belonging to each stage and input to the computing unit. As a result, a mechanism for unconditionally propagating load data is not necessary, and at the same time, the number of stages for storing programs is reduced from 24 to 19 stages.
 なお、従来技術では、初段に配置された中規模メモリからデータを伝搬させるため、初段において中規模メモリのWay番号を読み替えるだけで中規模メモリの内容を一部再利用することができた。 In the prior art, since the data is propagated from the medium-scale memory arranged in the first stage, it is possible to partially reuse the contents of the medium-scale memory by simply replacing the way number of the medium-scale memory in the first stage.
 一方、本発明では、各段に分散した中規模メモリの内容を移動することなく、命令写像を下方に4段ずらせることにより、異なる段においてWayを再利用することを可能としている。すなわち、最終段と初段とは、リング構造により接続されている。例えば、図14の場合、4、8、12段のうち、8、12段を再利用し、命令写像を4段ずらせることにより、8、12段のメモリ内容は移動することなく、新たに必要となるメモリデータを16段に配置する。これにより、8、12、16段のメモリ内容を利用した命令実行が可能となる。 On the other hand, in the present invention, it is possible to reuse Way in different stages by shifting the instruction mapping downward by four stages without moving the contents of the medium-scale memory distributed in each stage. That is, the last stage and the first stage are connected by a ring structure. For example, in the case of FIG. 14, by reusing 8 and 12 stages out of 4, 8, and 12 stages and shifting the instruction mapping by 4 stages, the memory contents of 8 and 12 stages are newly moved without moving. Necessary memory data is arranged in 16 stages. Thereby, it is possible to execute an instruction using the memory contents of 8, 12, and 16 stages.
 (具体例2)
 図15及び図16は、浮動小数点演算処理の一例を従来技術と本発明により各々実現した場合の命令列である。従来の技術の図15では、第6段におけるストアデータを1周させて第1段のメモリに格納する必要があり、また、このために、多数のストアを配置することが困難であった。
(Specific example 2)
15 and 16 are instruction sequences when an example of the floating-point arithmetic processing is realized by the prior art and the present invention, respectively. In FIG. 15 of the prior art, it is necessary to store the store data in the sixth stage once in the first stage memory, and for this reason, it is difficult to arrange a large number of stores.
 これに対し、本発明の図16では、ロードデータやストアデータを直接伝搬するためのデータパスが不要である。これにより、第4段、第8段、第12段、第16段において、更新型のロード→演算→ストアを写像することができる。従来技術に比べて4倍の命令を写像することができ、処理性能が4倍に高まる。 On the other hand, in FIG. 16 of the present invention, a data path for directly propagating load data and store data is not necessary. Thereby, in the fourth stage, the eighth stage, the twelfth stage, and the sixteenth stage, it is possible to map update type load → calculation → store. Compared to the prior art, four times as many instructions can be mapped, and the processing performance is increased four times.
 本発明は上述した実施形態に限定されるものではなく、請求項に示した範囲で種々の変更が可能である。すなわち、請求項に示した範囲で適宜変更した技術的手段を組み合わせて得られる実施形態についても本発明の技術的範囲に含まれる。 The present invention is not limited to the above-described embodiment, and various modifications can be made within the scope indicated in the claims. That is, embodiments obtained by combining technical means appropriately modified within the scope of the claims are also included in the technical scope of the present invention.
 (その他の実施形態)
 上記実施形態におけるシフトレジスタ33に代えて、複数のFIFO(First In First Out)バッファを有するFIFO部を配置することも可能である。FIFO部の各FIFOバッファは、例えば図6の構成であれば、有効アドレスA5、A4、A3、A2、A1、A0の各々に一対一に対応するように配置される。
(Other embodiments)
Instead of the shift register 33 in the above embodiment, a FIFO unit having a plurality of first-in first-out (FIFO) buffers can be arranged. For example, in the case of the configuration shown in FIG. 6, each FIFO buffer of the FIFO unit is arranged to correspond to each of the effective addresses A5, A4, A3, A2, A1, and A0 on a one-to-one basis.
 具体的には、例えば図6の構成であれば、有効アドレスA5、A4、A3、A2、A1、A0の各々に一対一に対応する位置、すなわち、シフトレジスタ33の、セレクタ33-1、レジスタ33-2、レジスタ33-3、レジスタ33-4、レジスタ33-6及びレジスタ33-7の各々の位置に、セレクタ33-1、レジスタ33-2、レジスタ33-3、レジスタ33-4、レジスタ33-6及びレジスタ33-7の各々に代えて、上記FIFO部の各FIFOバッファが配置されることになる。 Specifically, for example, in the configuration of FIG. 6, the positions corresponding to the effective addresses A5, A4, A3, A2, A1, and A0 on a one-to-one basis, that is, the selector 33-1 and the register of the shift register 33 33-2, a register 33-3, a register 33-4, a register 33-6, and a register 33-7, a selector 33-1, a register 33-2, a register 33-3, a register 33-4, a register Instead of each of the 33-6 and the register 33-7, the FIFO buffers of the FIFO unit are arranged.
 上記FIFO部の各FIFOバッファは、上記セレクタ33-1、レジスタ33-2、レジスタ33-3、レジスタ33-4、レジスタ33-6及びレジスタ33-7の各々と同様の、1つのセレクタおよび5つのレジスタを有している。 Each FIFO buffer of the FIFO unit includes one selector and five, similar to the selector 33-1, the register 33-2, the register 33-3, the register 33-4, the register 33-6, and the register 33-7. Has two registers.
 上記実施形態におけるシフトレジスタ33においては、メモリ部31からのデータ供給はセレクタ33-1のみに行なわれていた(ここでは、セレクタ33-1へのデータ供給に着目し、セレクタ33-5へのデータ供給は行われていないものとする。)。これに対し、上記FIFO部においては、メモリ部31からのデータ供給は、各FIFOバッファのセレクタの各々に行なわれることになる。 In the shift register 33 in the above embodiment, the data supply from the memory unit 31 is performed only to the selector 33-1 (in this case, paying attention to the data supply to the selector 33-1 and supplying the data to the selector 33-5) No data is provided.) On the other hand, in the FIFO unit, data is supplied from the memory unit 31 to each selector of each FIFO buffer.
 そして、有効アドレスA5、A4、A3、A2、A1、A0の各々を用いて、各FIFOバッファの各レジスタのうちの1つから読み出されたデータが、各FIFOバッファの各々に対応するラッチ24のO0~O5にそれぞれ、出力されることになる。例えば、有効アドレスA5に対応するFIFOバッファであれば、当該FIFOバッファの5つのレジスタのうちのいずれか1つが、有効アドレスA5を用いて読み出され、その読み出されたデータがラッチ24のO5に出力されることになる。他のFIFOバッファにおいても同様の処理が行なわれる。 Then, using each of the effective addresses A5, A4, A3, A2, A1, A0, the data read from one of the registers of each FIFO buffer is latched 24 corresponding to each FIFO buffer. Are output to O0 to O5. For example, if the FIFO buffer corresponds to the effective address A5, any one of the five registers of the FIFO buffer is read using the effective address A5, and the read data is O5 of the latch 24. Will be output. Similar processing is performed in the other FIFO buffers.
 なお、本発明は、以下のようにも表現することができる。すなわち、本発明は、1つまたは複数の演算器から構成される段を複数連結したバンドルに対して、1つの記憶システムを接続した構成であり、各記憶システムは、メモリとシフトレジスタから構成され、メモリから読み出したデータをシフトレジスタの先頭または途中に投入するとともに、記憶システムに入力される複数のアドレス情報を使用して、メモリ及びシフトレジスタを参照し、各アドレスに対応するアドレス位置の内容を各々読み出すアクセラレータ構成方法である。 The present invention can also be expressed as follows. That is, the present invention has a configuration in which one storage system is connected to a bundle in which a plurality of stages each composed of one or a plurality of arithmetic units are connected, and each storage system includes a memory and a shift register. The data read from the memory is input to the top or middle of the shift register, and the address information corresponding to each address is referenced by referring to the memory and the shift register by using a plurality of address information input to the storage system. Is an accelerator configuration method for reading out each of.
 上記アクセラレータ構成方法において、複数のブロックに分割されたメモリ部には、ブロック毎にアドレス情報を保持するアドレス保持部を備え、さらに、メモリ部に接続されないアドレス保持部を備え、これらのアドレス保持部のアドレス情報を利用して、シフトレジスタ内のレジスタ位置を特定してレジスタを読み出すことが好ましい。 In the accelerator configuration method, the memory unit divided into a plurality of blocks includes an address holding unit that holds address information for each block, and further includes an address holding unit that is not connected to the memory unit, and these address holding units It is preferable to read the register by specifying the register position in the shift register using the address information.
 上記アクセラレータ構成方法において、各ブロックに備えられるアドレス保持部のデータを使用して、他のブロックの読み出しを行い、アドレス情報の一部のビットを使用して、複数ブロックから読み出したデータのうち1つを選択することが好ましい。 In the accelerator configuration method described above, another block is read using the data in the address holding unit provided in each block, and one of the data read from a plurality of blocks using a part of bits of the address information. It is preferable to select one.
 上記アクセラレータ構成方法において、バンドルの最終段から記憶システムへのフィードバックを備え、バンドル内において、メモリを読み出すと共に、当該メモリに対する書き込みも可能であることが好ましい。 In the above accelerator configuration method, it is preferable that feedback from the final stage of the bundle to the storage system is provided, and the memory can be read and written to the memory in the bundle.
 上記アクセラレータ構成方法において、一連の高速実行後、次の高速実行を開始する際に、あるバンドルに属するメモリ内容を別の演算命令にて使用することができる場合、演算器に対する演算命令の写像を変更することにより、バンドルに属するメモリ内容を移動することなく、次の高速実行を開始することが好ましい。 In the above accelerator configuration method, when the memory content belonging to a certain bundle can be used by another operation instruction when starting the next high-speed execution after a series of high-speed execution, a mapping of the operation instruction to the arithmetic unit is performed. By changing, it is preferable to start the next high-speed execution without moving the memory contents belonging to the bundle.
 また、本発明は、以下のようにも表現することができる。すなわち、本発明に係るデータ供給装置は、複数の演算器が多段構成された演算器束にデータを供給するデータ供給装置であって、複数のブロックに分割されたメモリ部と、複数のレジスタが一列に接続されたシフトレジスタ部とを備え、前記シフトレジスタ部は、自身の先頭または途中のレジスタに、前記メモリ部から読み出されたデータが書き込まれると共に、前記メモリ部及び前記シフトレジスタ部の各々は、前記データ供給装置に入力された複数のアドレス情報を基に参照されることにより、前記各アドレス情報に対応する各アドレス位置の内容を出力する。 The present invention can also be expressed as follows. That is, the data supply device according to the present invention is a data supply device that supplies data to a computing unit bundle in which a plurality of computing units are configured in multiple stages, and includes a memory unit divided into a plurality of blocks and a plurality of registers. A shift register unit connected in a row, and the shift register unit writes data read from the memory unit to a head register or a middle register of the shift register unit, and the memory unit and the shift register unit Each is referred to based on a plurality of address information input to the data supply device, and outputs the contents of each address position corresponding to each address information.
 上記構成によれば、1つのメモリ部を複数のブロックに分割し、シフトレジスタ部の先頭または途中のレジスタに各ブロックから読み出されたデータを書き込み可能となっている。 According to the above configuration, one memory unit is divided into a plurality of blocks, and data read from each block can be written to a register at the head or in the middle of the shift register unit.
 そして、メモリ部及びシフトレジスタ部の各々は、データ供給装置に入力された複数のアドレス情報を基に参照され、各アドレス情報に対応する各アドレス位置の内容を出力可能となっている。 Each of the memory section and the shift register section is referred to based on a plurality of address information input to the data supply device, and can output the contents of each address position corresponding to each address information.
 このようなデータ供給装置を用いて、複数の演算器が多段構成された演算器束にデータを供給することにより、異なる演算器束の各々にデータを供給するデータ供給装置間におけるデータ伝搬が不要となる。 By using such a data supply device to supply data to an arithmetic unit bundle in which a plurality of arithmetic units are configured in multiple stages, there is no need for data propagation between data supply devices that supply data to different arithmetic unit bundles. It becomes.
 それゆえ、従来のような、演算器ネットワークの随所に設けられた1次キャッシュの内容を近傍の演算器に供給するための大規模なデータ伝搬機構が不要となるので、データ処理装置へデータを効率よく供給し、これにより、各演算器の消費電力を削減することができる。 This eliminates the need for a large-scale data propagation mechanism for supplying the contents of the primary cache provided in various places in the computing unit network to nearby computing units as in the prior art. It is possible to efficiently supply power, thereby reducing the power consumption of each arithmetic unit.
 上記データ供給装置は、前記データ供給装置に入力された複数のアドレス情報をそれぞれ保持する複数のアドレス保持部をさらに備え、前記複数のアドレス保持部は、前記メモリ部の各ブロックに一対一に対応するように各ブロックに接続された複数の第1のアドレス記憶回路と、前記メモリ部の各ブロックのいずれにも接続されていない複数の第2のアドレス記憶回路と、を含むことが好ましい。 The data supply device further includes a plurality of address holding units that respectively hold a plurality of address information input to the data supply device, and the plurality of address holding units correspond to each block of the memory unit on a one-to-one basis. It is preferable to include a plurality of first address storage circuits connected to each block and a plurality of second address storage circuits not connected to any of the blocks of the memory unit.
 上記構成によれば、メモリ部を参照するアドレス情報と、シフトレジスタ部を参照するアドレス情報と、を用いて、シフトレジスタ部から最終的に出力されるデータを決定することができる。 According to the above configuration, the data finally output from the shift register unit can be determined using the address information referring to the memory unit and the address information referring to the shift register unit.
 上記データ供給装置において、前記シフトレジスタ部は、前記メモリ部の異なる2つのブロックから読み出されたデータのいずれかを選択するセレクタを含み、前記第1のアドレス記憶回路に保持されたアドレス情報を用いて、当該第1のアドレス記憶回路が接続されたブロック及び、当該ブロックに隣接する他のブロックからの各読み出しが行なわれた場合において、前記シフトレジスタ部は、前記セレクタを用いて、前記第1のアドレス記憶回路に保持されたアドレス情報の一部のビットに基づき、前記2つのブロックから読み出されたデータのうちの1つを選択することが好ましい。 In the data supply device, the shift register unit includes a selector that selects one of data read from two different blocks of the memory unit, and stores the address information held in the first address storage circuit. In the case where each read is performed from the block to which the first address storage circuit is connected and another block adjacent to the block, the shift register unit uses the selector to It is preferable to select one of the data read from the two blocks based on a part of bits of the address information held in one address storage circuit.
 上記構成によれば、2つのブロックを連結させることができるので、1つのブロックに収まらないサイズのデータであっても、メモリ部に格納することができる。 According to the above configuration, since two blocks can be connected, even data having a size that does not fit in one block can be stored in the memory unit.
 上記データ供給装置は、前記演算器束の最終段を構成する1つまたは複数の演算器の演算結果を前記メモリ部に書き込み可能なフィードバック機構をさらに備えることが好ましい。 It is preferable that the data supply device further includes a feedback mechanism capable of writing the operation results of one or more arithmetic units constituting the final stage of the arithmetic unit bundle into the memory unit.
 上記構成によれば、メモリ部及びシフトレジスタ部からの出力値をメモリ部に再書き込みすることができる。 According to the above configuration, the output values from the memory unit and the shift register unit can be rewritten to the memory unit.
 上記データ供給装置において、前記各第1のアドレス記憶回路に保持された各アドレス情報は、前記データ供給装置に入力されたアドレス情報に設定されたオフセット、及び、当該アドレス情報に当該オフセットが加算されたアドレス情報、のうちのいずれかであり、前記各第2のアドレス記憶回路に保持された各アドレス情報は、前記データ供給装置に入力されたアドレス情報に設定されたオフセットであることが好ましく、前記シフトレジスタ部は、前記オフセットを用いて、各レジスタからの出力値を決定することがより好ましい。 In the data supply device, each address information held in each first address storage circuit includes an offset set in the address information input to the data supply device, and the offset is added to the address information. Each address information held in the second address storage circuit is preferably an offset set in the address information input to the data supply device, More preferably, the shift register unit determines an output value from each register using the offset.
 上記構成によれば、入力されたアドレス情報にランダムなオフセットが加算されたアドレス情報を用いて、メモリ部及びシフトレジスタ部を参照することができる。 According to the above configuration, the memory unit and the shift register unit can be referred to using the address information obtained by adding a random offset to the input address information.
 上記データ供給装置において、前記シフトレジスタ部は、自身の各レジスタの位置を前記オフセットとして用いることにより、各レジスタからの出力値を決定することが好ましい。 In the data supply device, it is preferable that the shift register unit determines an output value from each register by using the position of each register as the offset.
 上記構成によれば、入力されたアドレス情報に固定のオフセットが加算されたアドレス情報を用いて、メモリ部及びシフトレジスタ部を参照することができる。 According to the above configuration, the memory unit and the shift register unit can be referred to using the address information obtained by adding a fixed offset to the input address information.
 上記データ供給装置において、前記データ供給装置に2つのアドレス情報が入力された場合、前記シフトレジスタ部は、自身の一部の各レジスタの位置を、前記データ供給装置に入力された一方のアドレス情報に設定されたオフセットとして用いることにより、前記一部の各レジスタからの出力値を決定し、自身の他の一部の各レジスタの位置を、前記データ供給装置に入力された他方のアドレス情報に設定されたオフセットとして用いることにより、前記他の一部の各レジスタからの出力値を決定することが好ましい。 In the data supply apparatus, when two pieces of address information are input to the data supply apparatus, the shift register unit determines the position of each of a part of the registers of the address supply and one address information input to the data supply apparatus. Is used as an offset set to determine the output value from each of the some of the registers, and the position of the other part of each register is used as the other address information input to the data supply device. It is preferable that the output value from each of the other partial registers is determined by using the set offset.
 上記構成によれば、データ供給装置に2つのアドレス情報が入力された場合でも、いずれのアドレス情報に関しても、入力されたアドレス情報に固定のオフセットが加算されたアドレス情報を用いて、メモリ部及びシフトレジスタ部を参照することができる。 According to the above configuration, even when two pieces of address information are input to the data supply device, the address information obtained by adding a fixed offset to the input address information is used for any address information. The shift register portion can be referred to.
 上記データ供給装置において、前記データ供給装置に複数のアドレス情報が入力された場合、前記シフトレジスタ部は、前記データ供給装置に入力された1つのアドレス情報に設定されたオフセットを用いて、自身の一部の各レジスタからの出力値を決定し、前記データ供給装置に入力された残余のアドレス情報を用いて、前記メモリ部のブロックから読み出されたデータを、自身の他の一部の各レジスタからの出力値として出力することが好ましい。 In the data supply device, when a plurality of pieces of address information are input to the data supply device, the shift register unit uses an offset set in one address information input to the data supply device, Determine the output value from each of the registers, and use the remaining address information input to the data supply device to read the data read from the block of the memory unit It is preferable to output as an output value from the register.
 上記構成によれば、入力された1つのアドレス情報にオフセットが加算されたアドレス情報を用いて、メモリ部及びシフトレジスタ部を参照し、且つ、入力された残余のアドレス情報を用いて、メモリ部及びシフトレジスタ部を参照することができる。 According to the above configuration, the memory unit and the shift register unit are referred to using the address information obtained by adding the offset to the input one address information, and the memory unit is used using the input remaining address information. In addition, the shift register unit can be referred to.
 本発明に係るデータ処理装置は、複数の前記演算器束が多段構成されたデータ処理装置であって、或る一連の高速実行後、次の高速実行を開始する際に、或る演算器束にデータを供給する上記データ供給装置の前記メモリ部の内容が別の演算命令にて使用することができる場合、前記演算器束を構成する演算器に対する演算命令の写像を変更する。 A data processing apparatus according to the present invention is a data processing apparatus in which a plurality of the arithmetic unit bundles are configured in a multi-stage, and when a next high-speed execution is started after a certain series of high-speed executions, When the contents of the memory unit of the data supply device for supplying data to the computer can be used by another operation instruction, the mapping of the operation instruction to the operation units constituting the operation unit bundle is changed.
 上記構成によれば、データ供給装置のメモリ部に格納されるデータの内容に応じて命令写像位置を変更することにより、従来技術と同等の命令実行能力を確保することができる。 According to the above configuration, by changing the instruction mapping position in accordance with the content of data stored in the memory unit of the data supply device, it is possible to ensure instruction execution capability equivalent to that of the conventional technology.
 上記データ処理装置は、複数の行の機械語命令からなる命令コードを実行するためのデータ処理装置であって、前記命令コードに記述された複数のレジスタ番号に対応し、且つ、前記各レジスタ番号に対応するデータを一時的に保持する複数の第1レジスタを含む第1レジスタファイル部と、前記第1レジスタファイル部の各第1レジスタと対応する複数の第2レジスタを含む第2レジスタファイル部と、を含むn(nは1以上の整数)個のレジスタファイル部と、前記第1レジスタファイル部の各第1レジスタの読み出しデータを用いて前記複数の行の機械語命令のいずれかの機械語命令を用いて演算を実行する、前記多段構成の或る一段となる第1演算部と、前記複数の行の機械語命令のいずれかのうち、前記第1演算部が用いた機械語命令とは異なる機械語命令を用いて演算を実行する、前記多段構成の或る一段となる第2演算部と、を含むn個の演算部と、前記第1演算部が演算を実行したときにおける前記第1演算部の演算結果の出力先であり、且つ、前記第1演算部の演算結果を一時的に保持する第1保持部を含むn個の保持部とを備え、前記第1レジスタファイル部は、前記第1演算部による演算処理の対象外であったデータを保持する第1レジスタに対応する前記第2レジスタファイル部の第2レジスタに、当該データを転送すると共に、前記第1保持部は、自身が前記第1演算部の演算結果を保持する場合には、前記第1演算部の演算結果の出力先を前記第2演算部として、前記第1演算部の演算結果を前記第2演算部に転送し、前記第2演算部は、前記第2レジスタファイル部の各第2レジスタの読み出しデータ及び前記第1保持部により転送される演算結果のうちの少なくとも一方を用いて演算を実行し、前記第1演算部により実行される演算と並列処理することが好ましい。 The data processing device is a data processing device for executing an instruction code composed of a plurality of lines of machine language instructions, corresponding to a plurality of register numbers described in the instruction code, and each register number And a second register file unit including a plurality of second registers corresponding to each of the first registers of the first register file unit. And n (n is an integer greater than or equal to 1) register file units, and machine of any of the plurality of lines of machine language instructions using read data of each first register of the first register file unit A first arithmetic unit that executes a calculation using a word instruction, which is one stage of the multi-stage configuration, and a machine used by the first arithmetic unit among any of the machine language instructions of the plurality of rows An n number of arithmetic units including a second arithmetic unit having a certain stage of the multi-stage configuration, which executes an arithmetic operation using a machine language instruction different from the instruction, and when the first arithmetic unit executes the arithmetic operation An n number of holding units including a first holding unit which is an output destination of the calculation result of the first calculation unit and temporarily holds the calculation result of the first calculation unit; The unit transfers the data to the second register of the second register file unit corresponding to the first register that holds the data that is not subject to the arithmetic processing by the first arithmetic unit, and the first holding When the unit itself holds the calculation result of the first calculation unit, the output destination of the calculation result of the first calculation unit is the second calculation unit, and the calculation result of the first calculation unit is the first calculation unit. 2 to the computing unit, and the second computing unit An operation is performed using at least one of the read data of each second register of the register file unit and the operation result transferred by the first holding unit, and is processed in parallel with the operation executed by the first operation unit. It is preferable.
 上記構成によれば、第1レジスタファイル部の各第1レジスタのデータが、第1レジスタファイル部の各第1レジスタに対応する第2レジスタファイル部の各第2レジスタに転送されている。 According to the above configuration, the data in each first register in the first register file unit is transferred to each second register in the second register file unit corresponding to each first register in the first register file unit.
 このため、第2演算部は、第1レジスタファイル部の第1レジスタのデータが第1演算部の演算実行に用いられている場合でも、そのデータを第2レジスタファイル部の第2レジスタから読み出して演算の実行に用いることができる。 For this reason, even when the data of the first register of the first register file unit is used for the execution of the operation of the first arithmetic unit, the second arithmetic unit reads the data from the second register of the second register file unit. Can be used to execute operations.
 また、第1演算部の演算結果が、第2演算部に転送されている。 Also, the calculation result of the first calculation unit is transferred to the second calculation unit.
 このため、第2演算部は第1演算部による演算の終了後直ちに、第1演算部の演算結果を演算の実行に用いることができる。 Therefore, the second calculation unit can use the calculation result of the first calculation unit for execution of the calculation immediately after the calculation by the first calculation unit.
 したがって、上記のデータ処理装置では、第1及び第2演算部による2つの演算を並列的に実行させることができる。 Therefore, in the above data processing apparatus, two operations by the first and second operation units can be executed in parallel.
 上記データ処理装置において、前記n個のレジスタファイル部は、前記第2レジスタファイル部の各第2レジスタと対応する複数の第3レジスタを含む第3レジスタファイル部をさらに含み、前記n個の演算部は、前記複数の行の機械語命令のいずれかのうち、前記第1演算部及び前記第2演算部が用いた機械語命令とは異なる機械語命令を用いて演算を実行する、前記多段構成の或る一段となる第3演算部をさらに含み、前記n個の保持部は、前記第2演算部が演算を実行したときにおける前記第2演算部の演算結果の出力先であり、且つ、前記第2演算部の演算結果を一時的に保持する第2保持部をさらに含んでおり、前記第2レジスタファイル部は、前記第2演算部による演算処理の対象外であったデータを保持する第2レジスタに対応する前記第3レジスタファイル部の第3レジスタに、当該データを転送すると共に、前記第2保持部は、自身が前記第2演算部の演算結果を保持する場合には、前記第2演算部の演算結果の出力先を前記第3演算部として、前記第2演算部の演算結果を前記第3演算部に転送し、前記第3演算部は、前記第3レジスタファイル部の各第3レジスタの読み出しデータ及び前記第2保持部により転送される演算結果のうちの少なくとも一方を用いて演算を実行し、前記第1演算部により実行される演算及び前記第2演算部により実行される演算と並列処理することが好ましい。 In the data processing device, the n register file units further include a third register file unit including a plurality of third registers corresponding to the second registers of the second register file unit, and the n operation units A unit that performs an operation using a machine language instruction that is different from a machine language instruction used by the first operation unit and the second operation unit among any of the machine language instructions of the plurality of rows; A third operation unit which is a certain stage of the configuration, wherein the n holding units are output destinations of the operation result of the second operation unit when the second operation unit executes the operation; and A second holding unit that temporarily holds a calculation result of the second calculation unit is further included, and the second register file unit holds data that is not subject to calculation processing by the second calculation unit. In the second register When the data is transferred to the third register of the corresponding third register file unit and the second holding unit holds the calculation result of the second calculation unit, the second calculation unit The output destination of the calculation result is the third calculation unit, the calculation result of the second calculation unit is transferred to the third calculation unit, and the third calculation unit transfers each third register of the third register file unit. An operation using at least one of the read data of the data and an operation result transferred by the second holding unit, an operation executed by the first operation unit, and an operation executed by the second operation unit Parallel processing is preferable.
 上記構成によれば、第2レジスタファイル部の各第2レジスタのデータが、第2レジスタファイル部の各第2レジスタに対応する第3レジスタファイル部の各第3レジスタに転送されている。 According to the above configuration, the data in each second register in the second register file unit is transferred to each third register in the third register file unit corresponding to each second register in the second register file unit.
 このため、第3演算部は、第2レジスタファイル部の第2レジスタのデータが第2演算部の演算実行に用いられている場合でも、そのデータを第3レジスタファイル部の第3レジスタから読み出して演算の実行に用いることができる。 For this reason, the third calculation unit reads the data from the third register of the third register file unit even when the data of the second register of the second register file unit is used for execution of the calculation of the second calculation unit. Can be used to execute operations.
 また、第2演算部の演算結果が、第3演算部に転送されている。 Also, the calculation result of the second calculation unit is transferred to the third calculation unit.
 このため、第3演算部は第2演算部による演算の終了後直ちに、第2演算部の演算結果を演算の実行に用いることができる。 For this reason, the third calculation unit can use the calculation result of the second calculation unit for execution of the calculation immediately after the calculation by the second calculation unit.
 したがって、上記のデータ処理装置では、第1、第2及び第3の演算部による3つの演算を並列的に実行させることができる。 Therefore, in the above data processing apparatus, three operations by the first, second, and third operation units can be executed in parallel.
 上記データ処理装置において、前記n個の保持部に含まれるN(Nは1以上の整数であって、n以下)番目の保持部は、自身が保持する演算結果が前記n個の演算部に含まれる(N+2)番目以降の演算部による演算実行に用いられる場合には、当該演算結果を前記n個のレジスタファイル部に含まれる(N+2)番目のレジスタファイル部に転送する一方、自身が保持する演算結果が前記(N+2)番目以降の演算部による演算実行に用いられない場合には、当該演算結果を前記n個の演算部に含まれる(N+1)番目の演算部に転送することが好ましい。 In the data processing apparatus, the Nth holding unit included in the n holding units (N is an integer of 1 or more and n or less) has a calculation result held by itself in the n calculating units. When used for execution of operations by the (N + 2) th and subsequent calculation units included, the calculation result is transferred to the (N + 2) th register file unit included in the n number of register file units while being held by itself. When the calculation result to be performed is not used for the execution of calculation by the (N + 2) th and subsequent calculation units, it is preferable to transfer the calculation result to the (N + 1) th calculation unit included in the n calculation units. .
 上記構成によれば、N番目の保持部が保持する演算結果が(N+2)番目以降の演算部による演算実行に用いられない場合には(N+1)番目の演算部に転送しているので、この場合、レジスタファイル部間における不要なデータ転送が低減され、その結果、消費電力をより低下させることができる。 According to the above configuration, when the calculation result held by the Nth holding unit is not used for the calculation execution by the (N + 2) th and subsequent calculation units, the calculation result is transferred to the (N + 1) th calculation unit. In this case, unnecessary data transfer between the register file units is reduced, and as a result, power consumption can be further reduced.
 本発明は、複数の演算器を有し、各演算器による演算処理を同期して行なうことができるデータ処理装置へのデータ供給に好適に利用することができる。 The present invention can be suitably used for data supply to a data processing apparatus that has a plurality of arithmetic units and can perform arithmetic processing by each arithmetic unit synchronously.
 22  メモリシステム
 23  ラッチ(アドレス保持部)
 23-1、23-2、23-4、23-5  ラッチ(第1のアドレス記憶回路)
 23-3、23-6  ラッチ(第2のアドレス記憶回路)
 31  メモリ部
 33  シフトレジスタ(シフトレジスタ部)
 101、102、103  LAPP
22 Memory system 23 Latch (address holding unit)
23-1, 23-2, 23-4, 23-5 latch (first address storage circuit)
23-3, 23-6 latch (second address storage circuit)
31 Memory part 33 Shift register (shift register part)
101, 102, 103 LAPP

Claims (13)

  1.  複数の演算器が多段構成された演算器束にデータを供給するデータ供給装置であって、
     複数のブロックに分割されたメモリ部と、
     複数のレジスタが一列に接続されたシフトレジスタ部と
    を備え、
     前記シフトレジスタ部は、自身の先頭または途中のレジスタに、前記メモリ部から読み出されたデータが書き込まれると共に、
     前記メモリ部及び前記シフトレジスタ部の各々は、前記データ供給装置に入力された複数のアドレス情報を基に参照されることにより、前記各アドレス情報に対応する各アドレス位置の内容を出力することを特徴とするデータ供給装置。
    A data supply device that supplies data to a computing unit bundle in which a plurality of computing units are configured in multiple stages,
    A memory unit divided into a plurality of blocks;
    A shift register unit in which a plurality of registers are connected in a line;
    In the shift register unit, the data read from the memory unit is written in the register at the head or in the middle of the shift register unit,
    Each of the memory unit and the shift register unit outputs the contents of each address position corresponding to each address information by referring to the plurality of address information input to the data supply device. Characteristic data supply device.
  2.  前記データ供給装置に入力された複数のアドレス情報をそれぞれ保持する複数のアドレス保持部をさらに備え、
     前記複数のアドレス保持部は、
      前記メモリ部の各ブロックに一対一に対応するように各ブロックに接続された複数の第1のアドレス記憶回路と、
      前記メモリ部の各ブロックのいずれにも接続されていない複数の第2のアドレス記憶回路と、を含むことを特徴とする請求項1に記載のデータ供給装置。
    A plurality of address holding units each holding a plurality of address information input to the data supply device;
    The plurality of address holding units are
    A plurality of first address storage circuits connected to each block so as to correspond one-to-one to each block of the memory unit;
    The data supply apparatus according to claim 1, further comprising: a plurality of second address storage circuits that are not connected to any of the blocks of the memory unit.
  3.  前記シフトレジスタ部は、前記メモリ部の異なる2つのブロックから読み出されたデータのいずれかを選択するセレクタを含み、
     前記第1のアドレス記憶回路に保持されたアドレス情報を用いて、当該第1のアドレス記憶回路が接続されたブロック及び、当該ブロックに隣接する他のブロックからの各読み出しが行なわれた場合において、
     前記シフトレジスタ部は、前記セレクタを用いて、前記第1のアドレス記憶回路に保持されたアドレス情報の一部のビットに基づき、前記2つのブロックから読み出されたデータのうちの1つを選択することを特徴とする請求項2に記載のデータ供給装置。
    The shift register unit includes a selector that selects one of data read from two different blocks of the memory unit,
    When each read from the block connected to the first address storage circuit and another block adjacent to the block is performed using the address information held in the first address storage circuit,
    The shift register unit selects one of the data read from the two blocks based on a part of bits of the address information held in the first address storage circuit using the selector. The data supply apparatus according to claim 2, wherein:
  4.  前記演算器束の最終段を構成する1つまたは複数の演算器の演算結果を前記メモリ部に書き込み可能なフィードバック機構をさらに備えることを特徴とする請求項1~3のいずれか一項に記載のデータ供給装置。 The feedback mechanism according to any one of claims 1 to 3, further comprising a feedback mechanism capable of writing operation results of one or more arithmetic units constituting the final stage of the arithmetic unit bundle into the memory unit. Data supply equipment.
  5.  前記各第1のアドレス記憶回路に保持された各アドレス情報は、前記データ供給装置に入力されたアドレス情報に設定されたオフセット、及び、当該アドレス情報に当該オフセットが加算されたアドレス情報、のうちのいずれかであり、
     前記各第2のアドレス記憶回路に保持された各アドレス情報は、前記データ供給装置に入力されたアドレス情報に設定されたオフセットであることを特徴とする請求項2または3に記載のデータ供給装置。
    Each address information held in each first address storage circuit includes an offset set in the address information input to the data supply device, and address information obtained by adding the offset to the address information. Either
    4. The data supply device according to claim 2, wherein each address information held in each second address storage circuit is an offset set in the address information input to the data supply device. 5. .
  6.  前記シフトレジスタ部は、前記オフセットを用いて、各レジスタからの出力値を決定することを特徴とする請求項5に記載のデータ供給装置。 The data supply apparatus according to claim 5, wherein the shift register unit determines an output value from each register using the offset.
  7.  前記シフトレジスタ部は、自身の各レジスタの位置を前記オフセットとして用いることにより、各レジスタからの出力値を決定することを特徴とする請求項6に記載のデータ供給装置。 The data supply apparatus according to claim 6, wherein the shift register unit determines an output value from each register by using a position of each register of the shift register unit as the offset.
  8.  前記データ供給装置に2つのアドレス情報が入力された場合において、
     前記シフトレジスタ部は、
      自身の一部の各レジスタの位置を、前記データ供給装置に入力された一方のアドレス情報に設定されたオフセットとして用いることにより、前記一部の各レジスタからの出力値を決定し、
      自身の他の一部の各レジスタの位置を、前記データ供給装置に入力された他方のアドレス情報に設定されたオフセットとして用いることにより、前記他の一部の各レジスタからの出力値を決定することを特徴とする請求項6に記載のデータ供給装置。
    When two pieces of address information are input to the data supply device,
    The shift register unit is
    By using the position of each partial register of itself as an offset set in one address information input to the data supply device, the output value from each partial register is determined,
    By using the position of each of the other partial registers as an offset set in the other address information input to the data supply device, an output value from each of the other partial registers is determined. The data supply device according to claim 6.
  9.  前記データ供給装置に複数のアドレス情報が入力された場合において、
     前記シフトレジスタ部は、
      前記データ供給装置に入力された1つのアドレス情報に設定されたオフセットを用いて、自身の一部の各レジスタからの出力値を決定し、
      前記データ供給装置に入力された残余のアドレス情報を用いて、前記メモリ部のブロックから読み出されたデータを、自身の他の一部の各レジスタからの出力値として出力することを特徴とする請求項6に記載のデータ供給装置。
    When a plurality of address information is input to the data supply device,
    The shift register unit is
    Using an offset set in one address information input to the data supply device, determine an output value from each of some of its own registers,
    Using the remaining address information input to the data supply device, the data read from the block of the memory unit is output as an output value from each of the other partial registers. The data supply device according to claim 6.
  10.  複数の前記演算器束が多段構成されたデータ処理装置であって、
     或る一連の高速実行後、次の高速実行を開始する際に、或る演算器束にデータを供給する請求項1~9のいずれか一項に記載のデータ供給装置の前記メモリ部の内容が別の演算命令にて使用することができる場合、前記演算器束を構成する演算器に対する演算命令の写像を変更することを特徴とするデータ処理装置。
    A data processing device in which a plurality of the arithmetic unit bundles are configured in multiple stages,
    The content of the memory unit of the data supply device according to any one of claims 1 to 9, wherein data is supplied to a certain computing unit bundle when starting the next high-speed execution after a series of high-speed executions. The data processing device is characterized in that, when a different arithmetic instruction can be used, the mapping of the arithmetic instruction to the arithmetic units constituting the arithmetic unit bundle is changed.
  11.  前記データ処理装置は、複数の行の機械語命令からなる命令コードを実行するためのデータ処理装置であって、
     前記命令コードに記述された複数のレジスタ番号に対応し、且つ、前記各レジスタ番号に対応するデータを一時的に保持する複数の第1レジスタを含む第1レジスタファイル部と、前記第1レジスタファイル部の各第1レジスタと対応する複数の第2レジスタを含む第2レジスタファイル部と、を含むn(nは1以上の整数)個のレジスタファイル部と、
     前記第1レジスタファイル部の各第1レジスタの読み出しデータを用いて前記複数の行の機械語命令のいずれかの機械語命令を用いて演算を実行する、前記多段構成の或る一段となる第1演算部と、前記複数の行の機械語命令のいずれかのうち、前記第1演算部が用いた機械語命令とは異なる機械語命令を用いて演算を実行する、前記多段構成の或る一段となる第2演算部と、を含むn個の演算部と、
     前記第1演算部が演算を実行したときにおける前記第1演算部の演算結果の出力先であり、且つ、前記第1演算部の演算結果を一時的に保持する第1保持部を含むn個の保持部と
    を備え、
     前記第1レジスタファイル部は、前記第1演算部による演算処理の対象外であったデータを保持する第1レジスタに対応する前記第2レジスタファイル部の第2レジスタに、当該データを転送すると共に、
     前記第1保持部は、自身が前記第1演算部の演算結果を保持する場合には、前記第1演算部の演算結果の出力先を前記第2演算部として、前記第1演算部の演算結果を前記第2演算部に転送し、
     前記第2演算部は、前記第2レジスタファイル部の各第2レジスタの読み出しデータ及び前記第1保持部により転送される演算結果のうちの少なくとも一方を用いて演算を実行し、前記第1演算部により実行される演算と並列処理することを特徴とする請求項10に記載のデータ処理装置。
    The data processing device is a data processing device for executing an instruction code composed of a plurality of lines of machine language instructions,
    A first register file unit corresponding to a plurality of register numbers described in the instruction code and including a plurality of first registers for temporarily storing data corresponding to the register numbers; and the first register file A second register file unit including a plurality of second registers corresponding to each first register of the unit, n (n is an integer of 1 or more) register file units,
    A first stage of the multi-stage configuration in which an operation is performed using a machine language instruction of any of the plurality of lines of machine language instructions using read data of each first register of the first register file unit. An arithmetic unit and a certain stage of the multi-stage configuration that performs an arithmetic operation using a machine language instruction different from the machine language instruction used by the first arithmetic unit among any of the machine language instructions of the plurality of rows. N arithmetic units including the second arithmetic unit,
    N units including a first holding unit which is an output destination of the calculation result of the first calculation unit when the first calculation unit executes the calculation and temporarily holds the calculation result of the first calculation unit With a holding part,
    The first register file unit transfers the data to the second register of the second register file unit corresponding to the first register holding the data that is not subject to the arithmetic processing by the first arithmetic unit. ,
    When the first holding unit holds the calculation result of the first calculation unit, the first calculation unit outputs the calculation result of the first calculation unit as the second calculation unit. The result is transferred to the second calculation unit,
    The second operation unit performs an operation using at least one of the read data of each second register of the second register file unit and the operation result transferred by the first holding unit, and the first operation The data processing apparatus according to claim 10, wherein the data processing apparatus performs parallel processing with an operation executed by the unit.
  12.  前記n個のレジスタファイル部は、前記第2レジスタファイル部の各第2レジスタと対応する複数の第3レジスタを含む第3レジスタファイル部をさらに含み、
     前記n個の演算部は、前記複数の行の機械語命令のいずれかのうち、前記第1演算部及び前記第2演算部が用いた機械語命令とは異なる機械語命令を用いて演算を実行する、前記多段構成の或る一段となる第3演算部をさらに含み、
     前記n個の保持部は、前記第2演算部が演算を実行したときにおける前記第2演算部の演算結果の出力先であり、且つ、前記第2演算部の演算結果を一時的に保持する第2保持部をさらに含んでおり、
     前記第2レジスタファイル部は、前記第2演算部による演算処理の対象外であったデータを保持する第2レジスタに対応する前記第3レジスタファイル部の第3レジスタに、当該データを転送すると共に、
     前記第2保持部は、自身が前記第2演算部の演算結果を保持する場合には、前記第2演算部の演算結果の出力先を前記第3演算部として、前記第2演算部の演算結果を前記第3演算部に転送し、
     前記第3演算部は、前記第3レジスタファイル部の各第3レジスタの読み出しデータ及び前記第2保持部により転送される演算結果のうちの少なくとも一方を用いて演算を実行し、前記第1演算部により実行される演算及び前記第2演算部により実行される演算と並列処理することを特徴とする請求項11に記載のデータ処理装置。
    The n register file units further include a third register file unit including a plurality of third registers corresponding to each second register of the second register file unit,
    The n operation units perform an operation using a machine language instruction that is different from the machine language instructions used by the first operation unit and the second operation unit among any of the machine language instructions in the plurality of rows. And further including a third arithmetic unit that performs one stage of the multistage configuration,
    The n holding units are output destinations of the calculation result of the second calculation unit when the second calculation unit executes the calculation, and temporarily hold the calculation result of the second calculation unit. A second holding part;
    The second register file unit transfers the data to the third register of the third register file unit corresponding to the second register holding the data that is not subject to the arithmetic processing by the second arithmetic unit. ,
    When the second holding unit holds the calculation result of the second calculation unit, the second calculation unit outputs the calculation result of the second calculation unit as the third calculation unit. The result is transferred to the third arithmetic unit,
    The third calculation unit performs a calculation using at least one of the read data of each third register of the third register file unit and the calculation result transferred by the second holding unit, and the first calculation The data processing apparatus according to claim 11, wherein the data processing device performs parallel processing with an operation executed by a unit and an operation executed by the second operation unit.
  13.  前記n個の保持部に含まれるN(Nは1以上の整数であって、n以下)番目の保持部は、
     自身が保持する演算結果が前記n個の演算部に含まれる(N+2)番目以降の演算部による演算実行に用いられる場合には、当該演算結果を前記n個のレジスタファイル部に含まれる(N+2)番目のレジスタファイル部に転送する一方、
     自身が保持する演算結果が前記(N+2)番目以降の演算部による演算実行に用いられない場合には、当該演算結果を前記n個の演算部に含まれる(N+1)番目の演算部に転送することを特徴とする請求項11または12に記載のデータ処理装置。
    The N-th holding unit included in the n holding units (N is an integer of 1 or more and n or less) is:
    When the calculation result held by itself is used for execution of calculation by the (N + 2) th and subsequent calculation units included in the n calculation units, the calculation result is included in the n register file units (N + 2). While transferring to the second register file part,
    When the calculation result held by itself is not used for calculation execution by the (N + 2) th and subsequent calculation units, the calculation result is transferred to the (N + 1) th calculation unit included in the n calculation units. The data processing apparatus according to claim 11 or 12, characterized in that:
PCT/JP2013/057503 2012-03-16 2013-03-15 Data providing device and data processing device WO2013137459A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2014505037A JP6164616B2 (en) 2012-03-16 2013-03-15 Data supply device and data processing device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2012-061110 2012-03-16
JP2012061110 2012-03-16

Publications (1)

Publication Number Publication Date
WO2013137459A1 true WO2013137459A1 (en) 2013-09-19

Family

ID=49161353

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2013/057503 WO2013137459A1 (en) 2012-03-16 2013-03-15 Data providing device and data processing device

Country Status (2)

Country Link
JP (1) JP6164616B2 (en)
WO (1) WO2013137459A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016163421A1 (en) * 2015-04-08 2016-10-13 国立大学法人奈良先端科学技術大学院大学 Data processing device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05266056A (en) * 1992-03-23 1993-10-15 Nippon Telegr & Teleph Corp <Ntt> Parallel arithemtic unit for summing squares of sum/ difference of differential absolute value
JP2002215455A (en) * 2001-01-19 2002-08-02 Sony Corp Interleaved device
JP2004145476A (en) * 2002-10-22 2004-05-20 Hiroshima Industrial Promotion Organization Matching arithmetic circuit
WO2010044242A1 (en) * 2008-10-14 2010-04-22 国立大学法人奈良先端科学技術大学院大学 Data processing device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07230366A (en) * 1994-02-18 1995-08-29 Ricoh Co Ltd Picture processor
JP3652909B2 (en) * 1999-02-18 2005-05-25 日本電信電話株式会社 Pseudo multi-port memory device
JP4940497B2 (en) * 2001-01-19 2012-05-30 ソニー株式会社 Address generator

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05266056A (en) * 1992-03-23 1993-10-15 Nippon Telegr & Teleph Corp <Ntt> Parallel arithemtic unit for summing squares of sum/ difference of differential absolute value
JP2002215455A (en) * 2001-01-19 2002-08-02 Sony Corp Interleaved device
JP2004145476A (en) * 2002-10-22 2004-05-20 Hiroshima Industrial Promotion Organization Matching arithmetic circuit
WO2010044242A1 (en) * 2008-10-14 2010-04-22 国立大学法人奈良先端科学技術大学院大学 Data processing device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016163421A1 (en) * 2015-04-08 2016-10-13 国立大学法人奈良先端科学技術大学院大学 Data processing device
CN107408076A (en) * 2015-04-08 2017-11-28 国立大学法人奈良先端科学技术大学院大学 Data processing equipment
US10275392B2 (en) 2015-04-08 2019-04-30 National University Corporation NARA Institute of Science and Technology Data processing device

Also Published As

Publication number Publication date
JPWO2013137459A1 (en) 2015-08-03
JP6164616B2 (en) 2017-07-19

Similar Documents

Publication Publication Date Title
CN114168525B (en) Reconfigurable parallel processing
CN101482811B (en) Processor architectures for enhanced computational capability
CN107851013A (en) element size increase instruction
CN102402415A (en) Device and method for buffering data in dynamic reconfigurable array
WO2017021676A1 (en) An apparatus and method for transferring a plurality of data structures between memory and one or more vectors of data elements stored in a register bank
JP2017045151A (en) Arithmetic processing device and control method of arithmetic processing device
CN112074810B (en) Parallel processing apparatus
JP6164616B2 (en) Data supply device and data processing device
JP4444305B2 (en) Semiconductor device
WO2007099950A1 (en) Processor array system having function for data reallocation between high-speed pe
US20100064115A1 (en) Vector processing unit
US11971847B2 (en) Reconfigurable parallel processing
US20230071941A1 (en) Parallel processing device
JP2002318687A (en) Information processor and computer system
JPWO2011105408A1 (en) SIMD processor
JP4703735B2 (en) Compiler, code generation method, code generation program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13760465

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2014505037

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13760465

Country of ref document: EP

Kind code of ref document: A1