WO2013137459A1

WO2013137459A1 - Data providing device and data processing device

Info

Publication number: WO2013137459A1
Application number: PCT/JP2013/057503
Authority: WO
Inventors: 康彦中島; 駿姚
Original assignee: 国立大学法人奈良先端科学技術大学院大学
Priority date: 2012-03-16
Filing date: 2013-03-15
Publication date: 2013-09-19
Also published as: JP6164616B2; JPWO2013137459A1

Abstract

A memory system (22) for providing data to a computing unit cluster comprising multiple computing units configured in multiple stages is provided with a memory unit (31) partitioned into multiple blocks, and a shift register (33) comprising multiple registers connected in series.

Description

Data supply device and data processing device

The present invention relates to a data processing apparatus having a plurality of arithmetic units and capable of performing arithmetic processing by each arithmetic unit synchronously, and more particularly to a data supply method suitable for supplying data to the data processing unit. It is.

In recent microprocessors, many methods for improving the effective performance by shortening the machine cycle and increasing the number of instructions executed per machine cycle have been proposed.

For example, an arithmetic unit array method is known as a method for processing such a large number of instructions in parallel. This arithmetic unit array method is a method in which an arithmetic unit network is fixed in accordance with target data processing, and input data is poured into the fixed arithmetic unit network (see, for example, Patent Documents 1 to 3).

In this arithmetic unit array method, it is possible to execute many functions in parallel by using an arithmetic unit network composed of a plurality of arithmetic units.

However, the arithmetic unit array method cannot execute existing machine language instructions. For this reason, a dedicated machine language instruction generation means for generating machine language instructions peculiar to this arithmetic unit array system is necessary, and lacks versatility.

Therefore, for example, a superscalar method, a vector method, or a VLIW (Very Long Instruction Word) method is known as a method capable of executing general machine language instructions and executing machine language instructions in parallel. In these methods, a plurality of operations and the like are specified in one instruction and are executed simultaneously.

First, the superscalar method is a method in which hardware dynamically detects machine language instructions that can be executed simultaneously from a machine language instruction sequence and executes them in parallel.

This superscalar method has the advantage of being able to use existing software assets as they are, but recently tends to be avoided due to the complexity of the mechanism and the large amount of power consumption.

Next, the vector method is a method in which basic operations such as load, operation, and store are repeatedly applied using a vector register in which a large number of registers are arranged in a one-dimensional direction, and high speed with high power efficiency is possible. . Furthermore, since no cache memory is required, the data transfer speed between the main memory and the vector register is guaranteed, and as a result, stable high speed is realized.

However, the vector method can only perform operations between the same element numbers of different vector registers, and is not suitable for a program for performing operations while referring to adjacent elements in the same vector register.

Finally, the VLIW method is a method in which a plurality of operations and the like are designated in one instruction and are executed simultaneously. In this VLIW method, for example, 4 instructions are fetched simultaneously, 4 instructions are decoded simultaneously, necessary data is read from a general-purpose register, and operation is performed simultaneously by a plurality of operation devices, and the operation result storage means attached to the operation device is stored. Stores the operation result.

In the next cycle, the contents are read from the calculation result storage means and written to the general-purpose register. When the read calculation result is required in the next calculation, the calculation result is stored in the arithmetic unit. Bypass to the input.

On the other hand, for the load instruction, the cache memory is referred to in the LD / ST unit, the load result is stored in the load result storage means associated with the LD / ST unit, and then the arithmetic unit operates in the next cycle. .

In this way, in the VLIW system, it is possible to simultaneously execute operations for the number of juxtaposed arithmetic devices and LD / ST units. Furthermore, in the VLIW method, a sequence of instructions that can be executed in parallel is scheduled in advance by a compiler or the like, so that a mechanism for dynamically detecting machine language instructions that can be executed simultaneously as in the superscalar method becomes unnecessary. Therefore, in the VLIW method, it is possible to execute instructions with high power efficiency. However, in order to simultaneously execute a large number of load / store instructions, it is necessary to equip a memory system having a large number of ports. Since such a memory system has extremely poor area efficiency, there is a limit to the increase in the number of instructions that can be executed simultaneously by the VLIW method.

Japanese Patent Publication “JP-A-8-83264 (published March 26, 1996)” Japanese Patent Publication “Japanese Patent Laid-Open No. 2001-312481 (published on November 9, 2001)” Japanese Patent Publication “Japanese Laid-Open Patent Publication No. 2003-76668 (published March 14, 2003)”

By the way, in the above-described arithmetic unit array system, the adoption of the cache system greatly contributes to the performance improvement. As the cache system, there is a system in which a primary cache is incorporated everywhere in the arithmetic unit network, and at the same time, a secondary cache is provided between the external main memory. In this method, the hit rate of the secondary cache is increased, access to the main memory is reduced, and the performance of each arithmetic unit is improved.

In the case of adopting such a cache system, in order to supply data to a plurality of arithmetic units at the same time, a large scale for supplying the contents of the primary cache provided in various places in the arithmetic unit network to nearby arithmetic units. A simple data propagation mechanism is required.

Specifically, this is a mechanism in which each small-scale buffer stores a certain amount while passing all the small-scale buffers attached to the computing unit with data read from the primary cache. In this mechanism, a large number of wirings for reading out data every cycle from the primary cache are connected to many small buffers. That is, there has been a problem that no consideration has been given to efficiently propagating the contents of the primary cache to the arithmetic unit.

In view of the above problems, an object of the present invention is to efficiently supply data to the data processing apparatus in a data processing apparatus having a plurality of arithmetic units and capable of performing arithmetic processing by each arithmetic unit synchronously. Accordingly, it is an object of the present invention to provide a data supply device capable of reducing the power consumption of each computing unit.

In the conventional arithmetic unit array system, a primary cache in which a plurality of ways are aggregated is arranged in various places in the arithmetic unit network, and data is supplied through a small buffer to arithmetic units to which the primary cache is not directly connected. It was. This method has an advantage that the degree of freedom regarding the arrangement of the load instruction in the machine language instruction sequence is large, but has a disadvantage that the wiring for connecting the small buffers becomes large.

In the present invention, each way of the primary cache is uniformly distributed in the vicinity of the arithmetic unit and connection between small-scale buffers is eliminated. Therefore, although there is a restriction on the arrangement of the load instruction in the machine language instruction sequence, the instruction mapping position is changed according to the content of the data stored in the primary cache. Has the same instruction execution ability. That is, the problem is solved by reducing the number of wirings without reducing the capacity.

In order to achieve the above object, a data supply apparatus according to the present invention is a data supply apparatus that supplies data to an arithmetic unit bundle in which a plurality of arithmetic units are configured in multiple stages, and a memory unit divided into a plurality of blocks And a shift register unit in which a plurality of registers are connected in a row, and the shift register unit writes the data read from the memory unit to a register at the head or in the middle of the shift register unit. Each of the shift register units outputs the contents of each address position corresponding to each address information by referring to the plurality of address information input to the data supply device.

That is, according to the above configuration, one memory unit is divided into a plurality of blocks, and the data read from each block can be written to the head or middle register of the shift register unit.

Each of the memory section and the shift register section is referred to based on a plurality of address information input to the data supply device, and can output the contents of each address position corresponding to each address information.

By using such a data supply device to supply data to an arithmetic unit bundle in which a plurality of arithmetic units are configured in multiple stages, there is no need for data propagation between data supply devices that supply data to different arithmetic unit bundles. It becomes.

This eliminates the need for a large-scale data propagation mechanism for supplying the contents of the primary cache provided in various places in the computing unit network to nearby computing units as in the prior art. It is possible to efficiently supply power, thereby reducing the power consumption of each arithmetic unit.

A data processing apparatus according to the present invention is a data processing apparatus in which a plurality of the arithmetic unit bundles are configured in a multi-stage, and when a next high-speed execution is started after a certain series of high-speed executions, When the contents of the memory unit of the data supply device for supplying data to the computer can be used by another operation instruction, the mapping of the operation instruction to the operation units constituting the operation unit bundle is changed.

According to the above configuration, by changing the instruction mapping position in accordance with the content of data stored in the memory unit of the data supply device, it is possible to ensure instruction execution capability equivalent to that of the conventional technology.

As described above, the data supply device of the present invention is a data supply device that supplies data to a computing unit bundle in which a plurality of computing units are configured in multiple stages, and includes a memory unit divided into a plurality of blocks, and a plurality of computing units The shift register unit includes a shift register unit connected in a line, and the shift register unit writes data read from the memory unit to a register at the head or middle of the shift register unit, and the memory unit and the shift register Each unit outputs the contents of each address position corresponding to each address information by referring to the plurality of address information input to the data supply device.

Therefore, in a data processing apparatus that has a plurality of arithmetic units and can perform arithmetic processing by each arithmetic unit synchronously, by efficiently supplying data to the data processing unit, the power consumption of each arithmetic unit There is an effect that can be reduced.

It is a figure which shows the structure of LAPP in one Embodiment of this invention. It is a figure which shows the structure of LAPP in other embodiment of this invention. FIG. 3 is a diagram showing a configuration of a LAPP in which the configuration of three data processing stages including first to third data processing stages in the LAPP is expanded to a configuration of N data processing stages. It is a schematic diagram for demonstrating the data supply from the cache memory in the said LAPP. It is a schematic diagram for demonstrating the structure which arrange | positions one medium capacity | capacitance memory for every four steps. It is a detailed block diagram of a memory system including a medium capacity memory. It is explanatory drawing for demonstrating operation | movement of the said memory system. It is explanatory drawing for demonstrating operation | movement of the said memory system. It is explanatory drawing for demonstrating operation | movement of the said memory system. It is explanatory drawing for demonstrating operation | movement of the said memory system. It is explanatory drawing for demonstrating operation | movement of the said memory system. It is explanatory drawing for demonstrating operation | movement of the said memory system. It is a figure which shows the command sequence at the time of implement | achieving an example of image processing by a prior art. It is a figure which shows the command sequence at the time of implement | achieving an example of image processing by this invention. It is a figure which shows the command sequence at the time of implement | achieving an example of a floating point arithmetic processing by a prior art. It is a figure which shows the command sequence at the time of implement | achieving an example of a floating point arithmetic processing by this invention.

Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the drawings used for the following description, the same parts are denoted by the same reference numerals. Their names and functions are also the same. Therefore, detailed description thereof will not be repeated.

(Prerequisite technology of the present invention)
The present invention relates to a data supply method in a computer configuration system in which a large number of arithmetic units are juxtaposed. The present invention is particularly relevant to the memory reference mechanism corresponding to the memory reference patterns shown in Table 1.

Generally, in the memory reference mechanism in the above-described data supply method, two competing configurations are a vector mechanism and a menicore. For a program consisting of completely regular memory references and computations, that is, a program with required memory performance: computation performance = 1: 1, the vector mechanism is optimal. In the case of the vector mechanism, the memory performance and the computation performance can be used up by overlapping execution of the vector load instruction and the vector operation instruction.

However, in practice, there are usually random access elements in memory references. For this reason, even if it is a globally regular memory reference, if it is a locally random reference, the vector mechanism cannot respond (for example, array subscripts I-1, I, and I + 1 can be used simultaneously). Etc.)

On the other hand, although Menicoa can cope with the above-mentioned random access, in order to maintain memory performance: operation performance = 1: 1, an extremely advanced superscalar function is required. In particular, in order to completely overlap the address calculation, the memory reference, and the operation, it is important how the address calculation can be hidden.

As a response to this requirement, for example, the following arithmetic array type accelerator (Linear Array Pipeline Processor) (hereinafter referred to as “LAPP”) can be used. This LAPP (data processing device) employs a coarse-grained reconfigurable array (hereinafter referred to as “CGRA”) in which a plurality of arithmetic units are arranged in a two-dimensional array, and is an existing one. The machine language instruction is used.

FIG. 1 is a diagram showing the configuration of the LAPP described above. As shown in FIG. 1, the LAPP 101 includes a configuration memory 10, a first register file unit 110, a second register file unit 210, a first arithmetic device (first arithmetic unit, first holding unit) 120, , A second arithmetic device (second arithmetic unit, second holding unit) 220.

The configuration memory 10 constitutes a known CGRA and stores configuration data. The configuration data is data that defines processing contents in the first arithmetic device 120 and the second arithmetic device 220. The configuration memory 10 transfers such configuration data to the first register file unit 110 and the second register file unit 210.

The first register file unit 110 holds data necessary for arithmetic processing in the first arithmetic unit 120. The first register file unit 110 transfers a register group 111 including a plurality of registers (first registers) r0 to r11 and read data of the registers r0 to r11 of the register group 111 to the outside of the first register file unit 110. And a transmitter 112 for the purpose.

Reading and writing to each of the registers r0 to r11 of the register group 111 is executed based on configuration data stored in the configuration memory 10. Each register r0 to r11 of the register group 111 is read or written using its own register number 0 to 11 as an access key.

When the read register number is specified, the transfer unit 112 transfers the data held in the register with the specified number to the outside of the first register file unit 110.

The second register file unit 210 holds data necessary for arithmetic processing in the second arithmetic unit 220. The second register file unit 210 transfers a register group 211 including a plurality of registers (second registers) r0 to r11 and read data of the registers r0 to r11 of the register group 211 to the outside of the second register file unit 210. And a transfer device 212.

Reading and writing to each of the registers r0 to r11 in the register group 211 is executed based on configuration data stored in the configuration memory 10. Each register r0 to r11 of the register group 211 is read or written using its own register number 0 to 11 as an access key.

The registers r0 to r11 of the register group 211 have a one-to-one correspondence with the registers r0 to r11 of the register group 111 of the first register file unit 110, and register numbers between the registers of the register group 111 and the register group 211 Are associated with each other. Then, the transfer unit 112 of the first register file unit 110 stores the read data of the registers r0 to r11 of the register group 111 with the same register number as the register numbers of the registers r0 to r11 of the register group 111. Data can be transferred to the registers r0 to r11 of the register group 211 of the register file unit 210.

For example, the transfer unit 112 of the first register file unit 110 can transfer the read data of the register r3 of the register group 111 to the register r3 of the register group 211 of the second register file unit 210. The transfer unit 112 of the first register file unit 110 can transfer read data of the register r9 of the register group 111 to the register r9 of the register group 211 of the second register file unit 210.

When the read register number is specified, the transfer device 212 transfers the data held in the register with the specified number to the outside of the second register file unit 210.

The first arithmetic unit 120 performs substantial processing in the LAPP 101. The first arithmetic unit 120 includes an arithmetic unit group 121 including arithmetic units 1-1 to 1-4, a holder group 122 including holders 1-1 to 1-4, and a transfer unit 123. Yes.

The first arithmetic unit 120 constitutes a first data processing stage together with the first register file unit 110, and the transfer unit 112 of the first register file unit 110 reads the read data of the registers r0 to r11 of the register group 111. Can be transferred to the first arithmetic unit 120. The arithmetic units 1-1 to 1-4 of the arithmetic unit group 121 of the first arithmetic unit 120 obtain two read data from the registers r0 to r11 of the first register file unit 110, and the data Various arithmetic processes such as four arithmetic operations and logical operations are executed using. The arithmetic processing of each of the arithmetic units 1-1 to 1-4 is executed simultaneously.

The holders 1-1 to 1-4 of the holder group 122 store the calculation results of the corresponding calculators 1-1 to 1-4. Each retainer 1-1 to 1-4 corresponds one-to-one with each computing unit 1-1 to 1-4.

The transfer unit 123 transfers the calculation results of the calculators 1-1 to 1-4 stored in the holders 1-1 to 1-4 to the outside of the first calculator 120.

The second arithmetic unit 220 performs substantial processing in the LAPP 101. The second arithmetic unit 220 includes an arithmetic unit group 221 including arithmetic units 2-1 to 2-4, a holder group 222 including holders 2-1 to 2-4, and a transfer unit 223. Yes.

The second arithmetic unit 220, together with the second register file unit 210, constitutes a second data processing stage, and the transfer unit 212 of the second register file unit 210 reads data read from the registers r0 to r11 of the register group 211. Can be transferred to the second arithmetic unit 220. The arithmetic units 2-1 to 2-4 of the arithmetic unit group 221 of the second arithmetic unit 220 obtain two read data from the registers r0 to r11 of the second register file unit 210, and the data Various arithmetic processes such as four arithmetic operations and logical operations are executed using. The arithmetic processing of each of the arithmetic units 2-1 to 2-4 is executed simultaneously.

Further, the computing units 2-1 to 2-4 of the computing unit group 221 of the second computing unit 220 are stored in the respective cages 1-1 to 1-4 of the cage group 122 of the first computing unit 120. The calculation result can be acquired. The transfer unit 123 of the first calculation device 120 can transfer the calculation results of the calculation units 1-1 to 1-4 stored in the holders 1-1 to 1-4 to the second calculation device 220. It has become.

Then, the arithmetic units 2-1 to 2-4 of the second arithmetic unit 220 execute arithmetic processing using the arithmetic results instead of the read data of the registers r0 to r11 of the second register file unit 210. be able to.

The holders 2-1 to 2-4 of the holder group 222 store the calculation results of the corresponding calculators 2-1 to 2-4. Each of the retainers 2-1 to 2-4 has a one-to-one correspondence with each of the arithmetic units 2-1 to 2-4.

The transfer unit 223 transfers the calculation results of the calculators 2-1 to 2-4 stored in the holders 2-1 to 2-4 to the outside of the second calculation device 220.

Next, the operation of LAPP 101 will be described.

In LAPP 101, arithmetic processing by the first arithmetic unit 120 is performed using read data of the registers r0 to r11 of the register group 111.

Simultaneously with the arithmetic processing by the first arithmetic device 120, the read data of the registers r0 to r11 of the register group 111 that is not the target of the arithmetic processing by the first arithmetic device 120 is transferred to the second register file unit 210.

Then, in the next cycle, the arithmetic processing by the second arithmetic unit 220 is performed using the data transferred to the registers r0 to r11 of the register group 211 of the second register file unit 210.

Simultaneously with the arithmetic processing by the second arithmetic device 220, the arithmetic processing by the first arithmetic device 120 is performed using the read data of the registers r0 to r11 of the register group 111.

Further, when the second arithmetic device 220 needs the operation result of the first arithmetic device 120, the transfer device 123 of the first arithmetic device 120 is stored in each of the holders 1-1 to 1-4. The computation results of the computing units 1-1 to 1-4 are transferred to the second computing device 220.

The LAPP 102 shown in FIG. 2 further includes a third register file unit 310 and a third arithmetic unit (third arithmetic unit, third holding unit) 320 in addition to the LAPP 101 of FIG. Thereby, in addition to the arithmetic processing by the first arithmetic device 120 and the arithmetic processing by the second arithmetic device 220, the arithmetic processing by the third arithmetic device 320 is also executed simultaneously.

The third register file unit 310 holds data necessary for arithmetic processing in the third arithmetic unit 320. The third register file unit 310 transfers a register group 311 including a plurality of registers (third registers) r0 to r11 and read data of the registers r0 to r11 of the register group 311 to the outside of the third register file unit 310. And a transfer device 312 for the above.

Reading and writing to the registers r0 to r11 of the register group 311 are executed based on the configuration data stored in the configuration memory 10. Each register r0 to r11 of the register group 311 is read or written using its own register number 0 to 12 as an access key.

The registers r0 to r11 of the register group 311 have a one-to-one correspondence with the registers r0 to r11 of the register group 211 of the second register file unit 210, and register numbers between the registers of the register group 211 and the register group 311 Are associated with each other. Then, the transfer unit 212 of the second register file unit 210 receives the read data of the registers r0 to r11 of the register group 211 in the third register number having the same register number as the register numbers of the registers r0 to r11 of the register group 211. Data can be transferred to each of the registers r0 to r11 in the register group 311 of the register file unit 310.

When the read register number is designated, the transfer unit 312 transfers the data held in the register with the designated number to the outside of the third register file unit 310.

In addition, the third register file unit 310 is stored in each of the holders 1-1 to 1-4 of the first arithmetic unit 120 by the transfer unit 123 of the first arithmetic unit 120. The calculation result of 1-4 can be acquired.

3rd arithmetic unit 320 performs the substantial process in LAPP102. The third arithmetic unit 320 includes an arithmetic unit group 321 including arithmetic units 3-1 to 3-4, a holder group 322 including holders 3-1 to 3-4, and a transfer unit 323. Yes.

The third arithmetic unit 320 constitutes a third data processing stage together with the third register file unit 310, and the transfer unit 312 of the third register file unit 310 reads the read data of the registers r0 to r11 of the register group 311. Can be transferred to the third arithmetic unit 320. Then, each of the arithmetic units 3-1 to 3-4 of the arithmetic unit group 321 of the third arithmetic unit 320 acquires two read data from each of the registers r0 to r11 of the third register file unit 310, and these data Various arithmetic processes such as four arithmetic operations and logical operations are executed using. The arithmetic processing of each of the arithmetic units 3-1 to 3-4 is executed simultaneously.

The holders 3-1 to 3-4 of the holder group 322 store the calculation results of the corresponding calculators 3-1 to 3-4. Each retainer 3-1 to 3-4 has a one-to-one correspondence with each computing unit 3-1 to 3-4.

The transfer unit 323 transfers the calculation results of the calculators 3-1 to 3-4 stored in the holders 3-1 to 3-4 to the outside of the third processor 320.

In addition, the third arithmetic unit 320 includes the arithmetic units 2-1 to 2-2 stored in the respective holders 2-1 to 2-4 of the second arithmetic unit 220 by the transfer unit 223 of the second arithmetic unit 220. -4 can be obtained.

Next, the operation of the LAPP 102 will be described.

In LAPP102, the arithmetic processing by the second arithmetic unit 220 is performed using the read data of the registers r0 to r11 of the register group 211.

Simultaneously with the arithmetic processing by the second arithmetic device 220, the read data of the registers r0 to r11 of the register group 211 that is not subject to the arithmetic processing by the second arithmetic device 220 is transferred to the third register file unit 310.

Then, in the next cycle, the arithmetic processing by the third arithmetic unit 320 is performed using the data transferred to the registers r0 to r11 of the register group 311 of the third register file unit 310.

Simultaneously with the arithmetic processing by the third arithmetic device 320, the arithmetic processing by the second arithmetic device 220 is performed using the read data of the registers r0 to r11 of the register group 211.

Further, when the third arithmetic device 320 needs the operation result of the second arithmetic device 220, the transfer device 223 of the second arithmetic device 220 is stored in each of the holders 2-1 to 2-4. The computation results of the computing units 2-1 to 2-4 are transferred to the third computing device 320.

In some cases, the second arithmetic device 220 does not need the arithmetic result of the first arithmetic device 120, and the third arithmetic device 320 needs the arithmetic result of the first arithmetic device 120. In this case, by storing the result of the first arithmetic unit 120 in the third register file unit, the arithmetic result of the first arithmetic unit 120 can be input to the third arithmetic unit 320 indirectly.

Note that the configuration of the three data processing stages including the first to third data processing stages in the LAPP 102 may be extended to the configuration of the N data processing stage.

For example, N is an integer of 1 or more. In this case, the calculation result of the arithmetic unit constituting the Nth data processing stage is the register file of the (N + 2) th data processing stage when the arithmetic result is used by arithmetic units after the (N + 2) th data processing stage. Written in the part.

On the other hand, when the arithmetic result after the (N + 2) th data processing stage does not use the arithmetic result, the (N + 1) th data processing stage is not written in the register file part of the (N + 2) th data processing stage. To the arithmetic unit.

Next, a data supply method in the above-described

LAPP

101 and 102 will be described. FIG. 3 shows a configuration of the LAPP 103 in which the configuration of the three data processing stages including the first to third data processing stages in the LAPP 102 is expanded to the configuration of the N data processing stage. In FIG. 3, a known technique can be used for the mechanism of the cache memory 14 and the configuration of the small-scale cache memory 15 and the propagation mechanism therebetween.

As shown in FIG. 3, the LAPP 103 uses a method of supplying a large amount of data from a memory to a plurality of computing units 11. As the operation data propagates in one direction on the operation unit network composed of the plurality of operation units 11 via the plurality of register file units 12, the data on the memory is also propagated in the same direction. Thus, a plurality of load instructions can refer to a plurality of memory addresses at the same time.

Specifically, in the medium capacity memory 13 arranged in the first stage or the subsequent stage, each of the three ways of the cache memory 14 is made to correspond to one array. Then, one word is read from each way for each cycle and propagated to the next stage. At each stage, the value of the three words being propagated is taken into the small-scale cache memory 15 at each stage, so that data in a predetermined memory address range can be referred to at random. Since arithmetic data and memory data propagate at the same speed, load instructions belonging to the same iteration can refer to the same memory address range regardless of which stage the small cache memory 15 is referred to. Since the contents of the medium-capacity memory 13 previously arranged can be referred to at an arbitrary stage, the load / store unit 16 can be used even if an element near the array element of interest is required in each loop iteration. Thus, the load / store instruction can be arranged at an arbitrary stage.

LAPP 103 can deal with the memory reference patterns shown in Table 1 using the above characteristics. Note that a wide range of random offsets can be handled only at the stage where the medium capacity memory 13 is directly connected. In addition, for the update type in which the load contents are changed and stored at the same address, the store data is stored in the original array with one round in the depth direction.

(Problems in the prerequisite technology of the present invention)
The LAPP 103 described above has an advantage that a plurality of register file units 12 are provided, and a normal machine language instruction sequence is mapped to a plurality of arithmetic units 11 so that it can be executed at high speed. However, the above-mentioned LAPP 103 has the following problems for its practical use. Hereinafter, these problems will be described with reference to FIG. FIG. 4 is a schematic diagram for explaining data supply from the cache memory 14 in the LAPP 103. In FIG. 4, a known technique can be used for the mechanism of the cache memory 14 and the configuration of the small-scale cache memory 15 and the propagation mechanism therebetween.

(1) In order to propagate the data read from the medium capacity memory 13 to the subsequent stage, the data paths 17 corresponding to the number of ways are required. When the number of wirings between such stages increases, it becomes difficult to realize a large-scale arithmetic mechanism by connecting a plurality of LSIs.

(2) Some programs require many ways. In order to increase the number of ways, it is necessary to increase the number of intermediate capacity memories 13 by increasing the number of stages in the depth direction of the LAPP 103 or increase the number of way memories in the width direction to increase the width of each intermediate capacity memory 13. is there. In any case, as in the above (1), the large number of data paths 17 between stages is an obstacle.

(3) When accumulating each array element, it is necessary to load and store the same array. In the LAPP 103 described above, since the load data and the store data propagate in one direction, it is necessary to store the store data in the original array by making one round in the depth direction. When the number of arrays to be stored is large, in addition to the data path necessary for propagation of load data, many data paths 18 must be provided for propagation of store data.

(Configuration of the present invention)
The LAPP of the present invention employs a configuration in which medium capacity memories are distributed and arranged, as with the LAPP 103 described above, but does not provide a regular data path for unconditionally propagating data read from the medium capacity memory to subsequent stages. This prevents an increase in the number of inter-stage data paths, which was a problem with the LAPP 103 described above.

FIG. 5 shows a configuration in which one medium capacity memory is arranged every four stages. Of course, the number of stages is not limited to four. In short, any medium-capacity memory may be arranged for each “bundle” (arithmetic unit bundle) in which a plurality of “stages” composed of one or a plurality of arithmetic units are connected (multistage configuration). In other words, it can be said that the LAPP of the present invention is a multistage configuration of a plurality of such “bundles”. Therefore, in the configuration of FIG. 5, the propagation mechanism between the small cache memories 15 shown in FIGS. 3 and 4 is unnecessary.

FIG. 6 is a detailed configuration diagram of a memory system including a medium capacity memory. In the following drawings, black squares mainly indicate output latches, and white squares indicate latches used as calculation inputs other than outputs. Each number attached to the right side is a bit width.

As shown in FIG. 5 and FIG. 6, the LAPP of the present invention is different from the above-mentioned LAPP 103 in that it can be divided into a plurality of blocks while mounting one way of the cache memory in the medium capacity memory. It is in the point to. Further, by combining one base address and six offsets, it is possible to execute a load instruction using six addresses for one way.

Normally, it is necessary to design a 6-port memory in order to be able to use any 6 addresses. However, such a multiport memory is not practical in terms of area efficiency and operation speed.

On the other hand, in the present invention, the usable address range is constrained while corresponding to the reference pattern shown in Table 1. As a result, a 6-read, 2-write memory function is physically realized using a general memory having one port for reading and one port for writing.

As shown in FIG. 5, the LAPP 1 of the present invention mainly includes a plurality of memory systems (data supply devices) 22 including a computing unit network composed of a plurality of computing units 21 and one way of a cache memory (not shown). And.

As shown in FIG. 5, each memory system 22 is arranged at every four stages in an arithmetic unit network including a plurality of arithmetic units 21. Each memory system 22 corresponds to each way in a cache memory (not shown) and exchanges data with the corresponding way.

In the memory system 22, the result of address calculation based on the address information supplied from the previous stage is stored in a plurality of latches (address holding units) 23 in front (upper part) of the memory system 22. In the next cycle, a medium-capacity memory or the like in the memory system 22 is referred to and stored in a plurality of latches 24 behind (lower) the memory system 22. Further, in the next cycle, it is used as an input of a plurality of computing units 21 and stores the computation results.

Note that the calculation results obtained after passing through the first-stage and second-stage computing units 21 from the bottom are stored in a plurality of latches 25 at the bottom. Further, in the next cycle, the operation results stored in the plurality of latches 25 can be stored in the memory system 22 and further sent to the subsequent stage, or both can be selected.

FIG. 6 is a diagram showing a configuration of the memory system 22 shown in FIG. As shown in FIG. 6, the memory system 22 mainly includes a memory unit 31 divided into a plurality of blocks (here, four blocks), and a connection unit 32 for connecting blocks adjacent to each other. And a shift register (shift register unit) 33. As will be described later, the shift register 33 has a plurality of registers connected in a line.

As shown in FIG. 6, the plurality of latches 23 in FIG. 5 include a plurality of latches (first address storage circuits) 23-connected to each block so as to correspond to each block of the memory unit 31 on a one-to-one basis. 1, 23-2, 23-4, 23-5, and a plurality of latches (second address storage circuits) 23-3, 23-6 that are not connected to any of the blocks of the memory unit 31. include.

Of course, the latches 23-3 and 23-6 may be associated with the blocks divided from the memory unit 31, respectively. Conversely, the latches 23-1, 23-2, 23-4, and 23-5 may not be connected to any block of the memory unit 31. In short, the memory unit 31 is divided into a plurality of blocks, and there may be a latch associated with each block.

Hereinafter, the operation of the memory system 22 shown in FIG. 6 will be described for each memory reference pattern in Table 1 described above.

(First case)
The first case (1) in Table 1 is a case in which a wide range of addresses are referenced randomly. As shown in FIG. 7, when the base address is set in the LD-BASE 201 and the offset is set in the latch 202, the offset is added to the base address and the effective address A0 is designated.

When the effective address A0 is stored in the latch 23-1, the effective address A0 is supplied to the latch 203 of "way0.blk0" which is one block of the memory unit 31 in the next cycle. Similarly, the effective address A0 is supplied to the latch 204 of “way0.blk1” that is another block of the memory unit 31 adjacent to “way0.blk0”.

The value read from each block is sent to the connection unit 32, and one of the values is selected by the connection unit 32. The linking unit 32 performs the above selection using the upper bits of the effective address A0 stored in the latch 23-1. The data selected by the linking unit 32 is output to O0 of the latch 24 via the selector 33-1 of the shift register 33.

When the size of the data to be output as O0 of the latch 24 is limited to only “way0.blk0”, the connection function for connecting “way0.blk1” to “way0.blk0” as described above is used. There is no need. That is, the data read from “way0.blk0” may be output to O0 of the latch 24.

Similarly, when a new base address is set in the LD-BASE 201 and a new offset is set in the latch 205, the offset is added to the base address and the effective address A3 is designated. When the effective address A3 is stored in the latch 23-4, the value supplied to the latch 206 of the two blocks “way0.blk2” and the latch 207 of “way0.blk3” of the memory unit 31 and read from each block Is sent to the connecting portion 32. The linking unit 32 selects one using the upper bit of the effective address A3 stored in the latch 23-4, and outputs it to O3 of the latch 24 via the selector 33-5 of the shift register 33.

Thus, by using the memory unit 31 divided into a plurality of blocks, it is possible to cope with a plurality of random references.

(Second case)
The second case (2) in Table 1 is a case in which six locations are referenced at the same time, although there are restrictions on the range of relative addresses based on a monotonically increasing address. As shown in FIG. 8, the base address set in the LD-BASE 301 is stored in the latch 23-1 as the effective address A0 via the latch 302. In the next cycle, the effective address A0 is supplied to the latch 303 of “way0.blk0” that is one block of the memory unit 31. Similarly, the effective address A0 is supplied to the latch 304 of “way0.blk1” that is another block of the memory unit 31 adjacent to “way0.blk0”.

The value read from each block is sent to the connection unit 32, and one of the values is selected by the connection unit 32. The linking unit 32 performs the above selection using the upper bits of the effective address A0 stored in the latch 23-1. The data selected by the linking unit 32 is output to O0 of the latch 24 via the selector 33-1 of the shift register 33 (at this time, as in the first case described above, “way0.blk2 And “way0.blk3” may be connected to each other).

On the other hand, the latch 305 has an offset “−e”, the latch 306 has an offset “−d”, the latch 307 has an offset “−c”, the latch 308 has an offset “−b”, and the latch 309 has an offset “−a”. Are set respectively. Each offset is set in the latches 23-2, 23-3, 23-4, 23-5, and 23-6 as effective addresses A1, A2, A3, A4, and A5, respectively.

Simultaneously with writing to O0, the data selected by the linking unit 32 is written to the top register 33-2 of the shift register 33 via the selector 33-1. After the next cycle, while flowing data into the shift register 33, the effective addresses A5, A4, A3, A2, set in the latches 23-6, 23-5, 23-4, 23-3 and 23-2, respectively, An address within a range that can be stored in the shift register 33 is designated using A1. Thereby, addresses near the effective address A0 can be referred to simultaneously.

That is, the effective addresses A5, A4, A3, A2, and A1 are values representing the positions of the registers 33-2, 33-3, 33-4, 33-6, and 33-7 in the shift register 33. In other words, it indicates which value of the registers 33-2, 33-3, 33-4, 33-6, 33-7 should be referred to as the value to be output to O5, O4, O3, O2 of the latch 24. ing.

For this reason, the effective addresses A5, A4, A3, A2, and A1 compare the arbitrary register position of the shift register 33 with the address information, respectively, and register contents of the coincident portions respectively. A mechanism for reading to O4, O3, O2, and O1 is required. Such a mechanism can be easily realized because the shift register 33 is small.

(Third case)
The third case (3) in Table 1 is based on a monotonically increasing address, and refers to six locations at the same time, although there are restrictions on the range of relative addresses. The difference from the above-mentioned second case (2) is that six addresses also monotonously increase. In the second case (2) described above, the offset is a random offset such as “−a”, “−b”, “−c”, “−d”, or “−e”. On the other hand, in the third case, the offset is fixed.

Therefore, in the third case, the offset is set using the positions of the registers 33-2, 33-3, 33-4, 33-6, and 33-7 of the shift register 33. In other words, this is handled by a mechanism that reads directly from the shift register 33.

By setting such an offset, unlike the case of the second case, it is not necessary to operate the latches 305 to 309 and the latches 23-2 to 23-6 in FIG. Power consumption can be reduced by the amount of these operations.

As shown in FIG. 9, the base address set in the LD-BASE 401 is stored in the latch 23-1 as the effective address A0 via the latch 302. In the next cycle, the effective address A0 is supplied to the latch 403 of “way0.blk0” which is one block of the memory unit 31. Similarly, the effective address A0 is supplied to the latch 404 of “way0.blk1” that is another block of the memory unit 31 adjacent to “way0.blk0”.

Simultaneously with writing to O0, the data selected by the linking unit 32 is written to the top register 33-2 of the shift register 33 via the selector 33-1. Specify the fixed offset using the positions of the registers 33-2, 33-3, 33-4, 33-6, and 33-7 of the shift register 33 while flowing data into the shift register 33 after the next cycle. To do. Thereby, addresses near the effective address A0 can be referred to simultaneously.

In other words, in the case of the third case, any of the values of the registers 33-2, 33-3, 33-4, 33-6, and 33-7 should be referred to as the value to be output to the O2 to O5 of the latch 24. It is not necessary to set the effective addresses A5, A4, A3, A2, and A1 representing the cracks. This is because, in the case of the third case, unlike the second case described above, the offset is fixed. Therefore, if the positions of the registers 33-2, 33-3, 33-4, 33-6, and 33-7 are used, the values to be output to O2 to O5 of the latch 24 are the registers 33-2, 33- This is because it is possible to specify which value of 3, 33-4, 33-6, and 33-7 should be referred to. That is, it can be said that the effective addresses A5, A4, A3, A2, and A1 are set by the positions of the registers 33-2, 33-3, 33-4, 33-6, and 33-7.

As described above, in this case, the power consumption of the memory system 22 can be reduced.

Of course, like the second case (2) described above, the memory system 22 can also handle the third case (3).

(Fourth case)
The fourth case (4) in Table 1 is a case where two sets of access patterns that refer to three locations at the same time are required, although the range of relative addresses is limited based on a monotonically increasing address.

As shown in FIG. 10, the base address set in the LD-BASE 501 is stored in the latch 23-1 as the effective address A0 via the latch 502. In the next cycle, the effective address A0 is supplied to the latch 503 of “way0.blk0”, which is one block of the memory unit 31. Similarly, the effective address A0 is supplied to the latch 504 of “way0.blk1” that is another block of the memory unit 31 adjacent to “way0.blk0”.

Simultaneously with writing to O0, the data selected by the linking unit 32 is written to the top register 33-2 of the shift register 33 via the selector 33-1. After the next cycle, while sending data to the shift register 33, by specifying a fixed offset using the positions of the registers 33-2 and 33-3 of the shift register 33, addresses near the effective address A0 can be simultaneously set. You can refer to it.

Effective addresses A1 and A2 are set using the positions of the registers 33-2 and 33-3 of the shift register 33. For this purpose, the effective addresses A2 and A1 each need a mechanism for comparing the register contents of the matching portions by comparing arbitrary positions of the shift register 33 and the address information to the O2 and O1 of the latch 24, respectively. . Such a mechanism can be easily realized because the shift register 33 is small.

Similarly, the base address newly set in the LD-BASE 501 is stored in the latch 23-4 as the effective address A3 via the latch 505. In the next cycle, the effective address A3 is supplied to the latch 506 of “way0.blk2” which is one block of the memory unit 31. Similarly, it is supplied to the latch 507 of “way0.blk3” which is another block of the memory unit 31 adjacent to “way0.blk2”.

The value read from each block is sent to the connection unit 32, and one of the values is selected by the connection unit 32. The linking unit 32 performs the above selection using the upper bits of the effective address A3 stored in the latch 23-4. The data selected by the linking unit 32 is output to O3 of the latch 24 via the selector 33-5 of the shift register 33.

Here, the fourth case (4) is different from the second case (2) in that the data flow is divided in the middle of the shift register 33. Therefore, the selector 33-5 for interrupting the value read from “way0.blk2” is required in the middle of the shift register 33.

Simultaneously with the writing to O3, the data selected by the linking unit 32 is written to the register 33-6 in the middle of the shift register 33 via the selector 33-1. After the next cycle, while sending data to the shift register 33, by specifying a fixed offset using the positions of the registers 33-6 and 33-3 of the shift register 33, addresses near the effective address A3 can be simultaneously set. You can refer to it.

Effective addresses A4 and A5 are set using the positions of the registers 33-6 and 33-7 of the shift register 33. For this reason, the effective addresses A5 and A4 each need a mechanism for comparing the arbitrary register position of the shift register 33 with the address information and reading the register contents of the matching portions to O5 and O4 of the latch 24, respectively. . Such a mechanism can be easily realized because the shift register 33 is small.

(Fifth case)
The fifth case (5) in Table 1 is based on a monotonically increasing address, and there is a restriction on the range of relative addresses. ”,“ Way0.blk2 ”, and“ way0.blk1 ”can be accessed at the same time.

As shown in FIG. 11, the base address set in the LD-BASE 601 is stored in the latch 23-1 as the effective address A0 via the latch 602. In the next cycle, the effective address A0 is supplied to the latch 606 of “way0.blk0”, which is one block of the memory unit 31. The value read from “way0.blk0” is output to O0 of the latch 24 via the connection unit 32 and the selector 33-1 of the shift register 33.

The offset “−b” is set in the latch 610, and the offset “−a” is set in the latch 611. Each offset is set in latches 23-3 and 23-6 as effective addresses A2 and A5, respectively.

Simultaneously with the writing to O0, the selector 33-1 writes the value read from “way0.blk0” into the first register 33-2 of the shift register 33. After the next cycle, while flowing data into the shift register 33, the effective addresses A2 and A5 respectively set in the latches 23-3 and 23-6 are used to set addresses within the range that can be stored in the shift register 33. specify. Thereby, addresses near the effective address A0 can be referred to simultaneously.

Effective addresses A5 and A2 are values representing the positions of the registers 33-2, 33-3, 33-4, 33-6, and 33-7 in the shift register 33. In other words, it indicates which value of the registers 33-2, 33-3, 33-4, 33-6, 33-7 should be referred to as a value to be output to O5, O2 of the latch 24. For example, the effective address A2 should refer to the register 33-2, and the effective address A5 should refer to the register 33-3. In this case, the value of the register 33-3 is output to O2 of the latch 24, and the value of the register 33-3 is output to O5 of the latch.

For this reason, each of the effective addresses A5 and A2 needs a mechanism for comparing the register contents of the matching portion by comparing the address information with an arbitrary position of the shift register and reading the contents of the registers to O5 and O2 of the latch 24, respectively.

On the other hand, the base address newly set in the LD-BASE 601 is stored in the latch 23-2 as the effective address A1 via the latch 603. In the next cycle, the effective address A1 is supplied to the latch 607 of “way0.blk1” which is one block of the memory unit 31. The value read from “way0.blk0” is output to O1 of the latch 24.

The base address newly set in the LD-BASE 601 is stored in the latch 23-4 as the effective address A3 via the latch 604. In the next cycle, the effective address A3 is supplied to the latch 608 of “way0.blk2” which is one block of the memory unit 31. The value read from “way0.blk2” is output to O3 of the latch 24.

Furthermore, the base address newly set in the LD-BASE 601 is stored in the latch 23-5 as the effective address A4 via the latch 605. In the next cycle, the effective address A4 is supplied to the latch 609 of “way0.blk3” which is one block of the memory unit 31. The value read from “way0.blk3” is output to O4 of the latch 24.

The effective addresses A4, A3, and A1 are directly connected to “way0.blk3”, “way0.blk2”, and “way0.blk1,” respectively. Write to O4, O3, O1.

(Sixth case)
The sixth case (6) in Table 1 is a case where the read memory value is updated and written to the original memory, as shown in FIG. This can be realized by using a data path (feedback mechanism) 26 returning from the plurality of arithmetic units 21 to the memory system 22 shown in FIG.

For example, in FIG. 12, the read memory value (ST-value) 612 is supplied to the latch 614 and the latch 615 of “way0.blk0”, which is one block of the memory unit 31. In the next cycle, each data supplied to the latch 614 and the latch 615 is written to “way0.blk0” using the base address set in the ST-base 613.

As explained above, according to LAPP1 of the present invention,
(1) The number of wirings between stages can be greatly reduced by distributing medium capacity memories and eliminating the need for interstage propagation data paths dedicated to load / store.

(2) By reducing the number of wires between stages, a large-scale circuit can be realized with a plurality of LSI configurations without reducing the operating frequency.

(3) A combination of a medium-capacity memory and a small-capacity shift register makes it possible to issue a large number of load instructions to a certain range of memory space.

(4) Self-updating memory references including floating point operations (multiple cycles) can be arranged in multiple stages without increasing interstage wiring.

(5) Parallel processing of array data distributed in multiple stages becomes possible by parallel operation of multiple medium-capacity buffers.

(6) By moving the command MAP without moving reusable data in the way, it is possible to reduce power and time associated with data movement.

(Specific example 1)
FIGS. 13 and 14 are instruction sequences when an example of image processing is realized by the prior art and the present invention, respectively. In FIG. 13, load instructions are arranged at each stage on the assumption that load data is sequentially propagated.

On the other hand, in FIG. 14, load instructions are arranged in the fourth, eighth, and twelfth stages, and neighboring data is extracted from the ways belonging to each stage and input to the computing unit. As a result, a mechanism for unconditionally propagating load data is not necessary, and at the same time, the number of stages for storing programs is reduced from 24 to 19 stages.

In the prior art, since the data is propagated from the medium-scale memory arranged in the first stage, it is possible to partially reuse the contents of the medium-scale memory by simply replacing the way number of the medium-scale memory in the first stage.

On the other hand, in the present invention, it is possible to reuse Way in different stages by shifting the instruction mapping downward by four stages without moving the contents of the medium-scale memory distributed in each stage. That is, the last stage and the first stage are connected by a ring structure. For example, in the case of FIG. 14, by reusing 8 and 12 stages out of 4, 8, and 12 stages and shifting the instruction mapping by 4 stages, the memory contents of 8 and 12 stages are newly moved without moving. Necessary memory data is arranged in 16 stages. Thereby, it is possible to execute an instruction using the memory contents of 8, 12, and 16 stages.

(Specific example 2)
15 and 16 are instruction sequences when an example of the floating-point arithmetic processing is realized by the prior art and the present invention, respectively. In FIG. 15 of the prior art, it is necessary to store the store data in the sixth stage once in the first stage memory, and for this reason, it is difficult to arrange a large number of stores.

On the other hand, in FIG. 16 of the present invention, a data path for directly propagating load data and store data is not necessary. Thereby, in the fourth stage, the eighth stage, the twelfth stage, and the sixteenth stage, it is possible to map update type load → calculation → store. Compared to the prior art, four times as many instructions can be mapped, and the processing performance is increased four times.

The present invention is not limited to the above-described embodiment, and various modifications can be made within the scope indicated in the claims. That is, embodiments obtained by combining technical means appropriately modified within the scope of the claims are also included in the technical scope of the present invention.

(Other embodiments)
Instead of the shift register 33 in the above embodiment, a FIFO unit having a plurality of first-in first-out (FIFO) buffers can be arranged. For example, in the case of the configuration shown in FIG. 6, each FIFO buffer of the FIFO unit is arranged to correspond to each of the effective addresses A5, A4, A3, A2, A1, and A0 on a one-to-one basis.

Specifically, for example, in the configuration of FIG. 6, the positions corresponding to the effective addresses A5, A4, A3, A2, A1, and A0 on a one-to-one basis, that is, the selector 33-1 and the register of the shift register 33 33-2, a register 33-3, a register 33-4, a register 33-6, and a register 33-7, a selector 33-1, a register 33-2, a register 33-3, a register 33-4, a register Instead of each of the 33-6 and the register 33-7, the FIFO buffers of the FIFO unit are arranged.

Each FIFO buffer of the FIFO unit includes one selector and five, similar to the selector 33-1, the register 33-2, the register 33-3, the register 33-4, the register 33-6, and the register 33-7. Has two registers.

In the shift register 33 in the above embodiment, the data supply from the memory unit 31 is performed only to the selector 33-1 (in this case, paying attention to the data supply to the selector 33-1 and supplying the data to the selector 33-5) No data is provided.) On the other hand, in the FIFO unit, data is supplied from the memory unit 31 to each selector of each FIFO buffer.

Then, using each of the effective addresses A5, A4, A3, A2, A1, A0, the data read from one of the registers of each FIFO buffer is latched 24 corresponding to each FIFO buffer. Are output to O0 to O5. For example, if the FIFO buffer corresponds to the effective address A5, any one of the five registers of the FIFO buffer is read using the effective address A5, and the read data is O5 of the latch 24. Will be output. Similar processing is performed in the other FIFO buffers.

The present invention can also be expressed as follows. That is, the present invention has a configuration in which one storage system is connected to a bundle in which a plurality of stages each composed of one or a plurality of arithmetic units are connected, and each storage system includes a memory and a shift register. The data read from the memory is input to the top or middle of the shift register, and the address information corresponding to each address is referenced by referring to the memory and the shift register by using a plurality of address information input to the storage system. Is an accelerator configuration method for reading out each of.

In the accelerator configuration method, the memory unit divided into a plurality of blocks includes an address holding unit that holds address information for each block, and further includes an address holding unit that is not connected to the memory unit, and these address holding units It is preferable to read the register by specifying the register position in the shift register using the address information.

In the accelerator configuration method described above, another block is read using the data in the address holding unit provided in each block, and one of the data read from a plurality of blocks using a part of bits of the address information. It is preferable to select one.

In the above accelerator configuration method, it is preferable that feedback from the final stage of the bundle to the storage system is provided, and the memory can be read and written to the memory in the bundle.

In the above accelerator configuration method, when the memory content belonging to a certain bundle can be used by another operation instruction when starting the next high-speed execution after a series of high-speed execution, a mapping of the operation instruction to the arithmetic unit is performed. By changing, it is preferable to start the next high-speed execution without moving the memory contents belonging to the bundle.

The present invention can also be expressed as follows. That is, the data supply device according to the present invention is a data supply device that supplies data to a computing unit bundle in which a plurality of computing units are configured in multiple stages, and includes a memory unit divided into a plurality of blocks and a plurality of registers. A shift register unit connected in a row, and the shift register unit writes data read from the memory unit to a head register or a middle register of the shift register unit, and the memory unit and the shift register unit Each is referred to based on a plurality of address information input to the data supply device, and outputs the contents of each address position corresponding to each address information.

According to the above configuration, one memory unit is divided into a plurality of blocks, and data read from each block can be written to a register at the head or in the middle of the shift register unit.

The data supply device further includes a plurality of address holding units that respectively hold a plurality of address information input to the data supply device, and the plurality of address holding units correspond to each block of the memory unit on a one-to-one basis. It is preferable to include a plurality of first address storage circuits connected to each block and a plurality of second address storage circuits not connected to any of the blocks of the memory unit.

According to the above configuration, the data finally output from the shift register unit can be determined using the address information referring to the memory unit and the address information referring to the shift register unit.

In the data supply device, the shift register unit includes a selector that selects one of data read from two different blocks of the memory unit, and stores the address information held in the first address storage circuit. In the case where each read is performed from the block to which the first address storage circuit is connected and another block adjacent to the block, the shift register unit uses the selector to It is preferable to select one of the data read from the two blocks based on a part of bits of the address information held in one address storage circuit.

According to the above configuration, since two blocks can be connected, even data having a size that does not fit in one block can be stored in the memory unit.

It is preferable that the data supply device further includes a feedback mechanism capable of writing the operation results of one or more arithmetic units constituting the final stage of the arithmetic unit bundle into the memory unit.

According to the above configuration, the output values from the memory unit and the shift register unit can be rewritten to the memory unit.

In the data supply device, each address information held in each first address storage circuit includes an offset set in the address information input to the data supply device, and the offset is added to the address information. Each address information held in the second address storage circuit is preferably an offset set in the address information input to the data supply device, More preferably, the shift register unit determines an output value from each register using the offset.

According to the above configuration, the memory unit and the shift register unit can be referred to using the address information obtained by adding a random offset to the input address information.

In the data supply device, it is preferable that the shift register unit determines an output value from each register by using the position of each register as the offset.

According to the above configuration, the memory unit and the shift register unit can be referred to using the address information obtained by adding a fixed offset to the input address information.

In the data supply apparatus, when two pieces of address information are input to the data supply apparatus, the shift register unit determines the position of each of a part of the registers of the address supply and one address information input to the data supply apparatus. Is used as an offset set to determine the output value from each of the some of the registers, and the position of the other part of each register is used as the other address information input to the data supply device. It is preferable that the output value from each of the other partial registers is determined by using the set offset.

According to the above configuration, even when two pieces of address information are input to the data supply device, the address information obtained by adding a fixed offset to the input address information is used for any address information. The shift register portion can be referred to.

In the data supply device, when a plurality of pieces of address information are input to the data supply device, the shift register unit uses an offset set in one address information input to the data supply device, Determine the output value from each of the registers, and use the remaining address information input to the data supply device to read the data read from the block of the memory unit It is preferable to output as an output value from the register.

According to the above configuration, the memory unit and the shift register unit are referred to using the address information obtained by adding the offset to the input one address information, and the memory unit is used using the input remaining address information. In addition, the shift register unit can be referred to.

The data processing device is a data processing device for executing an instruction code composed of a plurality of lines of machine language instructions, corresponding to a plurality of register numbers described in the instruction code, and each register number And a second register file unit including a plurality of second registers corresponding to each of the first registers of the first register file unit. And n (n is an integer greater than or equal to 1) register file units, and machine of any of the plurality of lines of machine language instructions using read data of each first register of the first register file unit A first arithmetic unit that executes a calculation using a word instruction, which is one stage of the multi-stage configuration, and a machine used by the first arithmetic unit among any of the machine language instructions of the plurality of rows An n number of arithmetic units including a second arithmetic unit having a certain stage of the multi-stage configuration, which executes an arithmetic operation using a machine language instruction different from the instruction, and when the first arithmetic unit executes the arithmetic operation An n number of holding units including a first holding unit which is an output destination of the calculation result of the first calculation unit and temporarily holds the calculation result of the first calculation unit; The unit transfers the data to the second register of the second register file unit corresponding to the first register that holds the data that is not subject to the arithmetic processing by the first arithmetic unit, and the first holding When the unit itself holds the calculation result of the first calculation unit, the output destination of the calculation result of the first calculation unit is the second calculation unit, and the calculation result of the first calculation unit is the first calculation unit. 2 to the computing unit, and the second computing unit An operation is performed using at least one of the read data of each second register of the register file unit and the operation result transferred by the first holding unit, and is processed in parallel with the operation executed by the first operation unit. It is preferable.

According to the above configuration, the data in each first register in the first register file unit is transferred to each second register in the second register file unit corresponding to each first register in the first register file unit.

For this reason, even when the data of the first register of the first register file unit is used for the execution of the operation of the first arithmetic unit, the second arithmetic unit reads the data from the second register of the second register file unit. Can be used to execute operations.

Also, the calculation result of the first calculation unit is transferred to the second calculation unit.

Therefore, the second calculation unit can use the calculation result of the first calculation unit for execution of the calculation immediately after the calculation by the first calculation unit.

Therefore, in the above data processing apparatus, two operations by the first and second operation units can be executed in parallel.

In the data processing device, the n register file units further include a third register file unit including a plurality of third registers corresponding to the second registers of the second register file unit, and the n operation units A unit that performs an operation using a machine language instruction that is different from a machine language instruction used by the first operation unit and the second operation unit among any of the machine language instructions of the plurality of rows; A third operation unit which is a certain stage of the configuration, wherein the n holding units are output destinations of the operation result of the second operation unit when the second operation unit executes the operation; and A second holding unit that temporarily holds a calculation result of the second calculation unit is further included, and the second register file unit holds data that is not subject to calculation processing by the second calculation unit. In the second register When the data is transferred to the third register of the corresponding third register file unit and the second holding unit holds the calculation result of the second calculation unit, the second calculation unit The output destination of the calculation result is the third calculation unit, the calculation result of the second calculation unit is transferred to the third calculation unit, and the third calculation unit transfers each third register of the third register file unit. An operation using at least one of the read data of the data and an operation result transferred by the second holding unit, an operation executed by the first operation unit, and an operation executed by the second operation unit Parallel processing is preferable.

According to the above configuration, the data in each second register in the second register file unit is transferred to each third register in the third register file unit corresponding to each second register in the second register file unit.

For this reason, the third calculation unit reads the data from the third register of the third register file unit even when the data of the second register of the second register file unit is used for execution of the calculation of the second calculation unit. Can be used to execute operations.

Also, the calculation result of the second calculation unit is transferred to the third calculation unit.

For this reason, the third calculation unit can use the calculation result of the second calculation unit for execution of the calculation immediately after the calculation by the second calculation unit.

Therefore, in the above data processing apparatus, three operations by the first, second, and third operation units can be executed in parallel.

In the data processing apparatus, the Nth holding unit included in the n holding units (N is an integer of 1 or more and n or less) has a calculation result held by itself in the n calculating units. When used for execution of operations by the (N + 2) th and subsequent calculation units included, the calculation result is transferred to the (N + 2) th register file unit included in the n number of register file units while being held by itself. When the calculation result to be performed is not used for the execution of calculation by the (N + 2) th and subsequent calculation units, it is preferable to transfer the calculation result to the (N + 1) th calculation unit included in the n calculation units. .

According to the above configuration, when the calculation result held by the Nth holding unit is not used for the calculation execution by the (N + 2) th and subsequent calculation units, the calculation result is transferred to the (N + 1) th calculation unit. In this case, unnecessary data transfer between the register file units is reduced, and as a result, power consumption can be further reduced.

The present invention can be suitably used for data supply to a data processing apparatus that has a plurality of arithmetic units and can perform arithmetic processing by each arithmetic unit synchronously.

22 Memory system 23 Latch (address holding unit)
23-1, 23-2, 23-4, 23-5 latch (first address storage circuit)
23-3, 23-6 latch (second address storage circuit)
31 Memory part 33 Shift register (shift register part)
101, 102, 103 LAPP

Claims

A data supply device that supplies data to a computing unit bundle in which a plurality of computing units are configured in multiple stages,
A memory unit divided into a plurality of blocks;
A shift register unit in which a plurality of registers are connected in a line;
In the shift register unit, the data read from the memory unit is written in the register at the head or in the middle of the shift register unit,
Each of the memory unit and the shift register unit outputs the contents of each address position corresponding to each address information by referring to the plurality of address information input to the data supply device. Characteristic data supply device.
A plurality of address holding units each holding a plurality of address information input to the data supply device;
The plurality of address holding units are
A plurality of first address storage circuits connected to each block so as to correspond one-to-one to each block of the memory unit;
The data supply apparatus according to claim 1, further comprising: a plurality of second address storage circuits that are not connected to any of the blocks of the memory unit.
The shift register unit includes a selector that selects one of data read from two different blocks of the memory unit,
When each read from the block connected to the first address storage circuit and another block adjacent to the block is performed using the address information held in the first address storage circuit,
The shift register unit selects one of the data read from the two blocks based on a part of bits of the address information held in the first address storage circuit using the selector. The data supply apparatus according to claim 2, wherein:
The feedback mechanism according to any one of claims 1 to 3, further comprising a feedback mechanism capable of writing operation results of one or more arithmetic units constituting the final stage of the arithmetic unit bundle into the memory unit. Data supply equipment.
Each address information held in each first address storage circuit includes an offset set in the address information input to the data supply device, and address information obtained by adding the offset to the address information. Either
4. The data supply device according to claim 2, wherein each address information held in each second address storage circuit is an offset set in the address information input to the data supply device. 5. .
The data supply apparatus according to claim 5, wherein the shift register unit determines an output value from each register using the offset.
The data supply apparatus according to claim 6, wherein the shift register unit determines an output value from each register by using a position of each register of the shift register unit as the offset.
When two pieces of address information are input to the data supply device,
The shift register unit is
By using the position of each partial register of itself as an offset set in one address information input to the data supply device, the output value from each partial register is determined,
By using the position of each of the other partial registers as an offset set in the other address information input to the data supply device, an output value from each of the other partial registers is determined. The data supply device according to claim 6.
When a plurality of address information is input to the data supply device,
The shift register unit is
Using an offset set in one address information input to the data supply device, determine an output value from each of some of its own registers,
Using the remaining address information input to the data supply device, the data read from the block of the memory unit is output as an output value from each of the other partial registers. The data supply device according to claim 6.
A data processing device in which a plurality of the arithmetic unit bundles are configured in multiple stages,
The content of the memory unit of the data supply device according to any one of claims 1 to 9, wherein data is supplied to a certain computing unit bundle when starting the next high-speed execution after a series of high-speed executions. The data processing device is characterized in that, when a different arithmetic instruction can be used, the mapping of the arithmetic instruction to the arithmetic units constituting the arithmetic unit bundle is changed.
The data processing device is a data processing device for executing an instruction code composed of a plurality of lines of machine language instructions,
A first register file unit corresponding to a plurality of register numbers described in the instruction code and including a plurality of first registers for temporarily storing data corresponding to the register numbers; and the first register file A second register file unit including a plurality of second registers corresponding to each first register of the unit, n (n is an integer of 1 or more) register file units,
A first stage of the multi-stage configuration in which an operation is performed using a machine language instruction of any of the plurality of lines of machine language instructions using read data of each first register of the first register file unit. An arithmetic unit and a certain stage of the multi-stage configuration that performs an arithmetic operation using a machine language instruction different from the machine language instruction used by the first arithmetic unit among any of the machine language instructions of the plurality of rows. N arithmetic units including the second arithmetic unit,
N units including a first holding unit which is an output destination of the calculation result of the first calculation unit when the first calculation unit executes the calculation and temporarily holds the calculation result of the first calculation unit With a holding part,
The first register file unit transfers the data to the second register of the second register file unit corresponding to the first register holding the data that is not subject to the arithmetic processing by the first arithmetic unit. ,
When the first holding unit holds the calculation result of the first calculation unit, the first calculation unit outputs the calculation result of the first calculation unit as the second calculation unit. The result is transferred to the second calculation unit,
The second operation unit performs an operation using at least one of the read data of each second register of the second register file unit and the operation result transferred by the first holding unit, and the first operation The data processing apparatus according to claim 10, wherein the data processing apparatus performs parallel processing with an operation executed by the unit.
The n register file units further include a third register file unit including a plurality of third registers corresponding to each second register of the second register file unit,
The n operation units perform an operation using a machine language instruction that is different from the machine language instructions used by the first operation unit and the second operation unit among any of the machine language instructions in the plurality of rows. And further including a third arithmetic unit that performs one stage of the multistage configuration,
The n holding units are output destinations of the calculation result of the second calculation unit when the second calculation unit executes the calculation, and temporarily hold the calculation result of the second calculation unit. A second holding part;
The second register file unit transfers the data to the third register of the third register file unit corresponding to the second register holding the data that is not subject to the arithmetic processing by the second arithmetic unit. ,
When the second holding unit holds the calculation result of the second calculation unit, the second calculation unit outputs the calculation result of the second calculation unit as the third calculation unit. The result is transferred to the third arithmetic unit,
The third calculation unit performs a calculation using at least one of the read data of each third register of the third register file unit and the calculation result transferred by the second holding unit, and the first calculation The data processing apparatus according to claim 11, wherein the data processing device performs parallel processing with an operation executed by a unit and an operation executed by the second operation unit.
The N-th holding unit included in the n holding units (N is an integer of 1 or more and n or less) is:
When the calculation result held by itself is used for execution of calculation by the (N + 2) th and subsequent calculation units included in the n calculation units, the calculation result is included in the n register file units (N + 2). While transferring to the second register file part,
When the calculation result held by itself is not used for calculation execution by the (N + 2) th and subsequent calculation units, the calculation result is transferred to the (N + 1) th calculation unit included in the n calculation units. The data processing apparatus according to claim 11 or 12, characterized in that: