WO2011033862A1

WO2011033862A1 - Information processing device and data transfer method

Info

Publication number: WO2011033862A1
Application number: PCT/JP2010/062664
Authority: WO
Inventors: 健加納
Original assignee: 日本電気株式会社
Priority date: 2009-09-15
Filing date: 2010-07-28
Publication date: 2011-03-24
Also published as: JPWO2011033862A1; JP5423801B2

Abstract

Provided is an information processing device, wherein the time required and power consumed in having a cache memory search for an address array can be reduced, when transferring elements to the cache memory. The processing device, which transfers m pieces of elements having a predetermined line-up order, to a cache memory comprising n pieces of memory banks, is comprised of: a storage unit wherein the m pieces of elements are stored in the n pieces of register banks, by being interleaved in the lined-up order thereof; a control unit that calculates the processing size by dividing the block size within the memory banks by the stride, and makes an assessment of whether the starting address matches either of the forefront addresses of the blocks. The processing device is also comprised of a processing unit that divides the m pieces of elements into processing groups consisting of elements in continuous line-up order, and consisting of a number of elements that has the processing size multiplied by n, and that, when the starting address matches the forefront address, reads out the elements in parallel from each of the register banks and transfers the elements, in such a way that elements are read out in parallel from each of the block data, which has the elements within the processing group divided by the processing size, in the lined-up order of the elements.

Description

Information processing apparatus and data transfer method

The present invention relates to an information processing apparatus and a data transfer method, and more particularly, to an information processing apparatus and a data transfer method for transferring data to a cache memory having a plurality of banks (memory banks).

A vector computer which is an information processing apparatus has a plurality of vector pipelines in order to simultaneously process a plurality of data for each machine clock (hereinafter simply referred to as “one clock”). The vector computer stores a plurality of elements in vector data processed by one vector instruction in any of a plurality of vector pipelines. Each vector pipeline processes each element stored in it. For example, page 214 of Non-Patent Document 1 shows a parallel pipeline and a multiple parallel pipeline.

Even when the vector length of the vector data is short, the vector data is interleaved and held in each vector pipeline for each element so that the processing time in each vector pipeline is equalized.

Specifically, when there are N vector pipelines, vector pipeline 0 processes element 0, element N, element 2N,..., And vector pipeline 1 includes element 1, element N + 1, element 2N + 1,. , ..., the vector pipeline N-1 processes the element N-1, the element 2N-1, the element 3N-1,.

The method of simultaneously processing a plurality of data for each clock is used not only when data is calculated, but also when data is loaded or stored. For example, the same number of input / output ports as the vector pipeline are provided in the cache memory or memory, and by using these input / output ports, data can be loaded or stored in the cache memory or memory. It has been known.

In the main storage device, a data storage unit is provided for each input / output port corresponding one-to-one with the input / output port of the vector pipeline.

When the address of the main storage device to be accessed is designated, the input / output port in the main storage device and the data storage unit in the main storage device are uniquely determined.

Also, in a main storage device generally used in a vector computer, data is interleaved and stored for each element. This is because, when the vector computer accesses a plurality of data stored at consecutive addresses in the main storage device, the maximum data access performance can be achieved by accessing a plurality of data storage units in parallel. It is to make it.

Patent Document 1 discloses a vector computer as described above, specifically, a vector processor having a plurality of vector pipelines and a plurality of input / output ports.

Further, Patent Document 2 discloses a method of storing a large number of elements stored in a vector register by interleaving into a data storage unit of a main storage device for each of a plurality of consecutive elements.

This vector register has a plurality of banks. Each bank has a one-to-one correspondence with the vector pipeline. Each bank is included in a vector pipeline corresponding to itself, and stores elements (specifically, data constituting the elements). A serial number is assigned to each bank. The elements in each bank are transferred to the main storage device via the crossbar switch.

The vector processing device described in Patent Literature 2 uses the serial number assigned to each bank as the position at which reading of the element is started in each bank, so that the element read from each bank is output from the crossbar switch. The problem of conflicting ports is avoided.

On the other hand, it is possible to add a cache memory to the main memory used by the vector computer to reduce the access time to data and achieve high throughput.

The cache memory stores a plurality of data with consecutive memory addresses as block data. Hereinafter, a storage area defined by a memory address of block data, that is, a storage area in which block data is stored is referred to as a “block”.

The cache memory includes an address array that stores the memory address of the block and the state of the block, and a data array that stores the block data. The cache memory operates by determining whether the requested memory address is stored in the address array, thereby determining a process to be performed by the cache memory.

When a cache memory is used, in a process of writing a plurality of data having consecutive memory addresses, such as a process of storing vector data (hereinafter referred to as “vector store”), a plurality of write requests for a certain block are received. , Will arrive continuously in the cache memory.

When the block accessed by a certain write request is the same as the block accessed by the previous write request, the cache memory may determine the operation of the cache memory without searching the address array. Is possible.

Therefore, if the vector computer operates so that data write requests having consecutive memory addresses continuously arrive at the cache memory, the time and power consumption for the cache memory to search the address array can be reduced.

JP 2005-38185 A International Publication No. 2008/111500 Pamphlet

The vector store method disclosed in Patent Document 2 has the following problems.

First, in the method using the serial number assigned to the bank as a position to start reading the element in each bank, it is possible to avoid that the element read from each bank competes at the output port of the crossbar switch. A plurality of elements having consecutive memory addresses stored in the same data storage unit in the main storage device are not written continuously to the main storage device.

Therefore, when a cache memory is added to the main storage device, a plurality of elements stored in the same block cannot be written to the cache memory in succession, and the time and power consumption for searching the address array can be reduced. Can not. For example, as shown in FIG. 8 of Patent Document 2, the 1st to 7th elements are output with a delay of 25 clocks after the 0th element is output.

FIG. 1 shows a case where the technique disclosed in Patent Document 2 is executed when the number of vector pipelines is 8, the block size of the cache memory is 64 B (bytes), and the number of data storage units of the cache memory is 8. It is a figure for demonstrating the problem which arises.

In the upper half of FIG. 1, the arrangement of 128 elements (elements 0 to 127) in eight vector pipelines (vector pipelines 0 to 7), more specifically, banks in the eight vector pipelines. The arrangement of elements 0 to 127 in the vector register 810 configured as shown in FIG. Note that when the elements 0 to 127 are arranged in the order of the element numbers, the vector data is restored.

The lower half of FIG. 1 shows elements output from the crossbar switch 111 to eight data storage units (data storage units 0 to 7) in the cache memory every clock. In the lower half of FIG. 1, a bold frame 811 indicates a block boundary of the cache memory.

Here, the start address of the vector store is “0” and the stride is 8B. Since the block size of the cache memory is 64B and the stride is 8B, each block in each data storage unit stores eight elements having consecutive memory addresses, that is, eight elements having consecutive element numbers. Is done.

1 A bold vertical line 801 shown in the upper half of FIG. 1 indicates a block boundary of the cache memory. For example, elements 0 to 7 are stored in the same block of the cache memory.

Each vector pipeline has a read pointer. Each read pointer designates a position at which an element is read from a bank in the vector pipeline using read positions 0 to 15 shown in the upper half of FIG.

Each element 802 shown in the upper half of FIG. 1 (elements indicated by bold squares, specifically, element 0, element 9, element 18, element 27, element 36, element 45, element 54, element 63) is an element read out first from a bank in each vector pipeline.

When the first clock is input, the following operations are executed.

The read pointer of the vector pipeline 0 outputs “0” as the read position. The read pointer of the vector pipeline 1 outputs “1” as the read position. The read pointer of the vector pipeline 2 outputs “2” as the read position. The read pointer of the vector pipeline 3 outputs “3” as the read position. The read pointer of the vector pipeline 4 outputs “4” as the read position. The read pointer of the vector pipeline 5 outputs “5” as the read position. The read pointer of the vector pipeline 6 outputs “6” as the read position. The read pointer of the vector pipeline 7 outputs “7” as the read position.

Therefore, the element 0 corresponding to the read position “0” is read from the bank in the vector pipeline 0 and the element 0 is input to the crossbar switch 111. Further, the element 9 corresponding to the read position “1” is read from the bank in the vector pipeline 1, and the element 9 is input to the crossbar switch 111. Further, the element 18 corresponding to the read position “2” is read from the bank in the vector pipeline 2, and the element 18 is input to the crossbar switch 111. Further, the element 27 corresponding to the read position “3” is read from the bank in the vector pipeline 3, and the element 27 is input to the crossbar switch 111. Further, the element 36 corresponding to the read position “4” is read from the bank in the vector pipeline 4, and the element 36 is input to the crossbar switch 111. Further, the element 45 corresponding to the read position “5” is read from the bank in the vector pipeline 5, and the element 45 is input to the crossbar switch 111. Further, the element 54 corresponding to the read position “6” is read from the bank in the vector pipeline 6, and the element 54 is input to the crossbar switch 111. Further, the element 63 corresponding to the read position “7” is read from the bank in the vector pipeline 7, and the element 63 is input to the crossbar switch 111.

The element 0, the element 9, the element 18, the element 27, the element 36, the element 45, the element 54, and the element 63 are respectively stored in the cache memory by the data storage unit 0, the data storage unit 1, and the data storage unit by the crossbar switch 111. 2, the data storage unit 3, the data storage unit 4, the data storage unit 5, the data storage unit 6, and the data storage unit 7 (see time 1 in FIG. 1). For this reason, output conflict does not occur in the crossbar switch 111.

Subsequently, when the second clock is input, the following operation is executed.

The read pointer of the vector pipeline 0 outputs “1” as the read position. The read pointer of the vector pipeline 1 outputs “2” as the read position. The read pointer of the vector pipeline 2 outputs “3” as the read position. The read pointer of the vector pipeline 3 outputs “4” as the read position. The read pointer of the vector pipeline 4 outputs “5” as the read position. The read pointer of the vector pipeline 5 outputs “6” as the read position. The read pointer of the vector pipeline 6 outputs “7” as the read position. The read pointer of the vector pipeline 7 outputs “8” as the read position.

Therefore, the element 8 is read from the bank in the vector pipeline 0, and the element 8 is input to the crossbar switch 111. Further, the element 17 is read from the bank in the vector pipeline 1, and the element 17 is input to the crossbar switch 111. Further, the element 26 is read from the bank in the vector pipeline 2, and the element 26 is input to the crossbar switch 111. Further, the element 35 is read from the bank in the vector pipeline 3, and the element 35 is input to the crossbar switch 111. Further, the element 44 is read from the bank in the vector pipeline 4, and the element 44 is input to the crossbar switch 111. Further, the element 53 is read from the bank in the vector pipeline 5, and the element 53 is input to the crossbar switch 111. Further, the element 62 is read from the bank in the vector pipeline 6, and the element 62 is input to the crossbar switch 111. Further, the element 71 is read from the bank in the vector pipeline 7, and the element 71 is input to the crossbar switch 111.

The element 8, the element 17, the element 26, the element 35, the element 44, the element 53, the element 62, and the element 71 are respectively stored in the cache memory by the data storage unit 1, the data storage unit 2, and the data storage unit by the crossbar switch 111. 3, the data storage unit 4, the data storage unit 5, the data storage unit 6, the data storage unit 7, and the data storage unit 0 (see time 2 in FIG. 1). For this reason, output conflict does not occur in the crossbar switch 111.

Hereinafter, when each of the third clock to the 16th clock is inputted, each element is read from each vector pipeline as shown in FIG.

In this way, since elements read from each vector pipeline in accordance with one clock are stored in different data storage units, output conflicts in the crossbar switch 111 do not occur. Therefore, the performance of the vector store in the crossbar switch 111 does not deteriorate.

However, as shown in the lower half of FIG. 1, a plurality of elements stored in the same block in the data storage unit of the cache memory may not arrive at the data storage unit having the block continuously.

For example, in the data storage unit 0, the element 0 and the elements 1 to 7 stored in the same block arrive at a distance of 8 clocks from each other. Elements 71 to 64 stored in other blocks are inserted between element 0 and elements 1 to 7.

For this reason, the time and power consumption for searching the address array cannot be reduced as in the case where elements stored in the same block arrive at the data storage unit continuously.

Next, Patent Document 2 shows only an example in which element 0 is the first element among a plurality of elements stored in the same block in the order of memory addresses. Normally, however, element 0 is an arbitrary element. It is possible to write to a memory address.

When element 0 is written to other than the head address of the block, the method of simply using the bank number (vector pipeline number) as the read start position of each bank completely eliminates element contention at the output port of the crossbar switch. It cannot be avoided. For this reason, the execution time of the vector store increases.

Furthermore, two elements stored in different blocks in the same data storage unit in the cache memory compete at the output port of the crossbar switch. For this reason, in some cases, elements stored in the same block do not reach the data storage unit continuously. Therefore, it is impossible to reduce the time and power consumption for searching the address array.

FIG. 2 shows a problem that occurs when the technique disclosed in Patent Document 2 is executed when the number of vector pipelines is 8, the block size of the cache memory is 64B, and the number of data storage units of the cache memory is 8. It is a figure for demonstrating.

In the upper half of FIG. 2, the arrangement of 128 elements (elements 0 to 127) in the eight vector pipelines (vector pipelines 0 to 7), more specifically, the banks in the eight vector pipelines. The arrangement of elements 0 to 127 in the vector register 910 configured as shown in FIG. Note that when the elements 0 to 127 are arranged in the order of the element numbers, the vector data is restored.

The lower half of FIG. 2 shows elements output from the crossbar switch 111 to eight data storage units (data storage units 0 to 7) in the cache memory every clock.

Here, the start address of the vector store is “8” and the stride is 8B. Since the block size of the cache memory is 64B and the stride is 8B, each block in each data storage unit stores eight elements having consecutive memory addresses, that is, eight elements having consecutive element numbers. Is possible.

2, bold vertical lines 901 shown in the upper half of FIG. 2 indicate the boundaries of the cache memory blocks. For example, elements 7 to 14 are stored in the same block of the cache memory.

Each element 902 shown in the upper half of FIG. 2 (elements indicated by bold squares, specifically, element 0, element 9, element 18, element 27, element 36, element 45, element 54, element 63) is an element read out first from a bank in each vector pipeline.

When the first clock is input, the following operations are executed.

Therefore, element 0 is read from the bank in vector pipeline 0, and element 0 is input to crossbar switch 111. Further, the element 9 is read from the bank in the vector pipeline 1, and the element 9 is input to the crossbar switch 111. Further, the element 18 is read from the bank in the vector pipeline 2, and the element 18 is input to the crossbar switch 111. Further, the element 27 is read from the bank in the vector pipeline 3, and the element 27 is input to the crossbar switch 111. Further, the element 36 is read from the bank in the vector pipeline 4, and the element 36 is input to the crossbar switch 111. Further, the element 45 is read from the bank in the vector pipeline 5, and the element 45 is input to the crossbar switch 111. Further, the element 54 is read from the bank in the vector pipeline 6, and the element 54 is input to the crossbar switch 111. Further, the element 63 is read from the bank in the vector pipeline 7, and the element 63 is input to the crossbar switch 111.

The element 0, the element 9, the element 18, the element 27, the element 36, the element 45, the element 54, and the element 63 are respectively stored in the cache memory by the data storage unit 0, the data storage unit 1, and the data storage unit by the crossbar switch 111. 2, data storage unit 3, data storage unit 4, data storage unit 5, data storage unit 6, and data storage unit 0. For this reason, in the crossbar switch 111, contention between the element 0 and the element 63 occurs in the output to the data storage unit 0. For this reason, the crossbar switch 111 reduces the performance of the vector store.

In FIG. 2, the element 0 arrives at the data storage unit 0 before the element 63, but the arrival order of the element 0 and the element 63 depends on the contention arbitration method in the crossbar switch 111. Therefore, it cannot be guaranteed that the elements stored in the same block arrive at the data storage unit continuously.

An object of the present invention is to provide an information processing apparatus and a data transfer method that can solve the above-described problem that elements stored in the same block do not continuously arrive at a data storage unit.

In the information processing apparatus according to the present invention, m is arranged in a cache memory having n (n is an integer of 2 or more) memory banks having memory blocks of a predetermined size associated with different addresses. An information processing device for transferring data elements,
storage means having n register banks, wherein the m data elements are interleaved in units of the data elements in the arrangement order and are stored in the n register banks in order.
When the stride information indicating the stride between data elements and the start address which is the address of the first data element in the arrangement order are received, the processing size is calculated by dividing the predetermined size by the stride, and the start Control means for determining whether the address matches the start address of any of the memory blocks;
The m data elements are divided into processing groups each including the number of data elements in which the arrangement order is continuous and the processing size is multiplied by n, and the start address is the top of any memory block. If it is determined that the address matches the address, each time the m data elements are divided into the processing groups, the data elements in the processing group are divided into the number of processing sizes in the arrangement order. From each of the n register banks, the data elements are read in parallel to the cache memory so that each of the data elements is read in parallel and all the data elements in the processing group are read. Processing means for transferring.

According to the data transfer method of the present invention, the order of arrangement is determined in a cache memory having n (n is an integer of 2 or more) memory banks having memory blocks of a predetermined size associated with different addresses. A data transfer method in an information processing apparatus for transferring individual data elements,
The m data elements are interleaved in units of the data elements in the arrangement order, and sequentially stored in n register banks,
When the stride information indicating the stride between data elements and the start address which is the address of the first data element in the arrangement order are received, the processing size is calculated by dividing the predetermined size by the stride, and the start Determine whether the address matches the start address of any memory block,
The m data elements are divided into processing groups each including the number of data elements in which the arrangement order is continuous and the processing size is multiplied by n, and the start address is the top of any memory block. If it is determined that the address matches the address, each time the m data elements are divided into the processing groups, the data elements in the processing group are divided into the number of processing sizes in the arrangement order. From each of the n register banks, the data elements are read in parallel to the cache memory so that each of the data elements is read in parallel and all the data elements in the processing group are read. Forward.

According to the present invention, when data elements are transferred and stored in the cache memory, it is possible to reduce the time and power consumption for the cache memory to search the address array.

It is a figure explaining the problem of the technique of patent document 2. FIG. It is a figure explaining the problem of the technique of patent document 2. FIG. It is a figure which shows the vector processor of 1st Embodiment of this invention. It is a figure which shows a vector store control part and a reading position control part. It is a flowchart showing the operation | movement in a reading position control part. It is a figure which shows an address calculation part. It is a figure which shows operation | movement of a vector store. It is a figure which shows operation | movement of a vector store. It is a figure which shows operation | movement of a vector store.

Hereinafter, a vector processor which is an information processing apparatus according to an embodiment of the present invention will be described in detail with reference to the drawings. The information processing apparatus according to the present embodiment is not limited to a vector processor, and has a plurality of calculation pipelines. Data stored in each calculation pipeline is stored in a plurality of banks (memory banks) in the cache memory. Any information processing device to be transferred may be used.

FIG. 3 is a block diagram showing a configuration example of the vector processor of the present embodiment.

3, the vector processor includes a CPU (Central Processing Unit) 101.

The CPU 101 includes an instruction issue control unit 102, a vector processing unit 103, and an address calculation unit 104. The CPU 101 is connected to the main storage device 106 via the cache memory 105.

The cache memory 105 includes a plurality of data storage units 107-0 to 107-7.

Each data storage unit 107-0 to 107-7 is a memory bank. Each of the data storage units 107-0 to 107-7 includes a storage area (memory block, hereinafter referred to as “block”) for storing a plurality of elements having consecutive memory addresses or one element. Each of the data storage units 107-0 to 107-7 has a plurality of blocks. The size (predetermined size) of each block is 64B. The element size is 8B. The memory address is 64 bits.

In the present embodiment, the data storage unit 107-0 has a block for caching data at memory addresses 0 to 56 and a block for caching data at memory addresses 512 to 568.

The data storage unit 107-1 has a block for caching data at memory addresses 64-120 and a block for caching data at memory addresses 576-632.

The data storage unit 107-2 has a block for caching data at memory addresses 128 to 184 and a block for caching data at memory addresses 640 to 696.

The data storage unit 107-3 has a block for caching data at memory addresses 192 to 248 and a block for caching data at memory addresses 704 to 760.

The data storage unit 107-7 includes a block for caching data at memory addresses 448 to 504 and a block for caching data at memory addresses 960 to 1016.

That is, blocks having a predetermined storage area (64B) are set in the data storage units 107-0 to 107-7 in a state where they are interleaved in the order of memory addresses.

In the present embodiment, the number of memory banks (n) is “8”, but the number of banks may be an integer of 2 or more, more specifically, an arbitrary power of 2.

The vector processing unit 103 includes a plurality of vector pipelines 109-0 to 109-7, a vector store control unit 110, and a crossbar switch 111. The suffix of the vector pipeline 109 means the “number” assigned to the vector pipeline.

In the present embodiment, the number of vector pipelines (n) is “8”, but the number of vector pipelines may be an integer of 2 or more, more specifically, an arbitrary power of two.

The vector pipelines 109-0 to 109-7 have the same configuration. Therefore, the vector pipeline 109-0 among the vector pipelines 109-0 to 109-7 will be described.

The vector pipeline 109-0 includes a vector calculator 112-0 and a vector register 113-0. The vector register 113-0 includes a read position control unit 114-0 and a bank 113A-0. Each suffix of the vector calculator 112, the vector register 113, the read position control unit 114, and the bank 113A means a “number” assigned to the vector pipeline 109 in which they are included.

256 elements (elements 0 to 255) are stored in any one of the banks 113A-0 to 113A-7. Elements can be generally referred to as data elements. Each of banks 113-0 to 113A-7 can generally be referred to as a register bank.

The order of the 256 elements is determined. In this embodiment, the arrangement order of 256 elements is the order of element 0 to element 255. Therefore, element 0 is the first element in the arrangement order among 256 elements (elements 0 to 255). In this embodiment, 256 is used as the number m of elements, but m is not limited to 256.

In this embodiment, the element 8N (N = 0, 1, 2,..., 31) is stored in the vector pipeline 109-0 (specifically, the bank 113A-0). Element “8N + 1” (N = 0, 1, 2,..., 31) is stored in vector pipeline 109-1 (specifically, bank 113A-1). Element “8N + 2” (N = 0, 1, 2,..., 31) is stored in vector pipeline 109-2 (specifically, bank 113A-2). The element “8N + 3” (N = 0, 1, 2,..., 31) is stored in the vector pipeline 109-3 (specifically, the bank 113A-3). The element “8N + 4” (N = 0, 1, 2,..., 31) is stored in the vector pipeline 109-4 (specifically, the bank 113A-4). The element “8N + 5” (N = 0, 1, 2,..., 31) is stored in the vector pipeline 109-5 (specifically, the bank 113A-5). The element “8N + 6” (N = 0, 1, 2,..., 31) is stored in the vector pipeline 109-6 (specifically, the bank 113A-6). Further, the element “8N + 7” (N = 0, 1, 2,..., 31) is stored in the vector pipeline 109-7 (specifically, the bank 113A-7).

That is, 256 elements are interleaved for each element in the arrangement order and stored in the eight banks 113A-0 to 113A-7 in order.

The crossbar switch 111 has 8 input ports and 8 output ports. The number of input ports of the crossbar switch 111 is the same as the number of vector pipelines 109. The number of output ports of the crossbar switch 111 is the same as the number of data storage units 107.

In the vector store, the memory address from the address calculation unit 104 and the element (write data) from the vector register 113 are input to the input port of the crossbar switch 111. The crossbar switch 111 outputs a memory address and an element (write data) to an output port connected to the data storage unit 107 determined by the memory address.

The instruction issue control unit 102 includes stride information (hereinafter simply referred to as “start address”) indicating the stride between the start address (hereinafter simply referred to as “start address”), which is the address of the first element 0 in the arrangement order, and between elements (between data elements). (Referred to as “stride”) is provided to the vector store control unit 110 and the address calculation unit 104.

The vector store control unit 110 can be generally called control means.

When the vector store control unit 110 receives the stride and the start address from the instruction issuance control unit 102, the vector store control unit 110 divides the block size (64B) by the stride to calculate a processing size (hereinafter referred to as a “processing group size”). Further, the vector store control unit 110 determines whether the start address matches the head address of any block.

The vector store control unit 110 determines whether or not to correct the position where the reading of the element is started in each vector register 113 based on the start address and the stride.

For example, the vector store control unit 110 includes a block size (64 B), a stride from the instruction issue control unit 102 (for example, 8 B), the number of banks in the cache memory 105 (eight), and the number of vector pipelines (8 ) And the size of the processing group is determined.

The vector store control unit 110 determines the determination result (hereinafter, simply referred to as “determination result”) as to whether or not to correct the position at which reading of an element is started in each vector register 113 and the size of the processing group. This is transmitted to the read position control unit 114 in 113. The vector store control unit 110 will be described in detail later.

Each readout position control unit 114 is included in the processing unit 114A. Processing unit 114A can be generally referred to as processing means.

The processing unit 114A divides 256 elements into processing groups. The processing group is composed of data elements having the number of values obtained by multiplying the size of the processing group and the number of vector registers 113 in the arrangement order.

The processing unit 114 </ b> A performs the following operation when it is determined that the start address matches the head address of any block in the cache memory 105.

Each time the processing unit 114A divides 256 elements into processing groups, one element from each of the block data generated when the elements in the processing group are divided into processing group sizes in the order of arrangement, Are read from the eight banks 113A and transferred to the cache memory 105 so that all the elements in the processing group are read out in parallel.

Further, the processing unit 114A performs the following operation when the start address does not match the head address in any block in the cache memory 105.

Each time the processing unit 114A divides 256 elements into processing groups, the data elements in the processing group are assigned addresses based on the start address and stride, and the data elements in the processing group are assigned based on the addresses. When divided into nine block data corresponding to any of the blocks, one of each of the eight block data other than the last block data having the largest address among the nine block data, When the elements are read out in parallel and then all the elements in the first block data with the smallest address among the nine block data are read out, the eight blocks other than the first block data out of the nine block data From each piece of data, one unread element is read in parallel and Elements in the group such that all read and transferred to the cache memory 105 reads out the element in parallel from eight banks 113A.

Each read position control unit 114 is an element (vector data) transferred from the bank 113A to the cache memory 105 based on the determination result of the vector store control unit 110, the size of the processing group, and the number of the bank 113A. The reading position of is calculated.

Each read position control unit 114 provides the read position to the address calculation unit 104.

Further, each read position control unit 114 reads an element (data) from the read position in the bank 113A, and inputs the element as write data to the input port of the crossbar switch 111.

The read position control unit 114 will be described in detail later.

The address calculation unit 104 calculates a write memory address in which write data is stored based on the start address and stride from the instruction issue control unit 102 and the read position from each read position control unit 114.

The address calculation unit 104 inputs the write memory address to the input port of the crossbar switch 111 together with the write data from each vector pipeline 109. The address calculation unit 104 will be described in detail later.

FIG. 4 is a diagram showing the vector store control unit 110 and the read position control unit 114.

In FIG. 4, the vector store control unit 110 includes a lower 6 to 4 bit register (hereinafter simply referred to as “start address register”) 201 for storing lower 6 to 4 bit data of a start address, a stride register 202, , A read start position correction calculation circuit (hereinafter simply referred to as “calculation circuit”) 203 and a processing group size determination circuit 204.

The start address register 201 stores lower 6 to 4 bit data of the start address from the instruction issue control unit 102. The lower 6 to 4 bits of data of the start address indicate the position of element 0 in the block.

The stride register 202 stores a stride from the instruction issue control unit 102.

The calculation circuit 203 performs the calculation shown in Table 1 based on the lower 6 to 4 bit data of the start address in the start address register 201 and the stride in the stride register 202, and the read start position is corrected. The vector pipeline 109 to be executed is determined.

The calculation circuit 203 supplies a correction signal of “0” or “1” to the read position control unit 114 of each vector pipeline 109. In the present embodiment, that the correction signal is “1” indicates that the reading start position is corrected. The correction signal “0” indicates that the reading start position is not corrected.

That is, when the calculation circuit 203 determines that the start address matches the head address of any block in the cache memory 105, the calculation circuit 203 supplies a correction signal of “0” to all the read position control units 114. If the calculation circuit 203 determines that the start address does not match the head address of any block in the cache memory 105, the calculation circuit 203 supplies a correction signal of “1” to at least one read position control unit 114.

The processing group size determination circuit 204 performs the calculation shown in Table 2 based on the stride in the stride register 202 to determine the size of the processing group.

Table 2 shows the embodiment in which the number of vector pipelines is 8, the number of elements stored in one block in the cache memory 105 is 8, the number of banks in the cache memory 105, that is, the number of data storage units 107 is 8. Shows the size of the processing group for each stride.

In general,
Processing group size =
The number of elements per block × the number of banks in the cache memory ÷ the number of vector pipelines Note that the processing group size determination circuit 204 performs the calculation of the number of elements per block = block size ÷ stride.

The size of the processing group will be described in detail later.

The processing group size determination circuit 204 supplies the processing group size to the read position control unit 114 in each vector pipeline 109.

Next, the reading position control unit 114 will be described.

The read position control unit 114 is provided in each vector pipeline 109. The read position control unit 114 includes a read start position determination circuit 205, a pipeline number register 206, a read position offset register 207, a read position base register 208, a decrementer 209, a modulo calculator 210, and a multiplexer (MUX). 211, an adder 212, a multiplexer (MUX) 213, an adder 214, a decrementer 215, a counter 216, an effective read position register 218, a comparator 219, and a control circuit 217 for controlling them. Including.

The pipeline number register 206 stores the number of the vector pipeline 109 including the pipeline number register 206.

In this embodiment, since there are eight vector pipelines 109, any one of “0” to “7” is stored in the pipeline number register 206.

“0” is stored in the pipeline number register 206 in the vector pipeline 109-0. “1” is stored in the pipeline number register 206 in the vector pipeline 109-1.

Also, “2” is stored in the pipeline number register 206 in the vector pipeline 109-2. “3” is stored in the pipeline number register 206 in the vector pipeline 109-3.

Also, “4” is stored in the pipeline number register 206 in the vector pipeline 109-4. “5” is stored in the pipeline number register 206 in the vector pipeline 109-5.

Also, “6” is stored in the pipeline number register 206 in the vector pipeline 109-6. “7” is stored in the pipeline number register 206 in the vector pipeline 109-7.

The read start position determination circuit 205 determines the read start position based on the correction signal from the calculation circuit 203 and the pipeline number from the pipeline number register 206.

When the correction signal is “0”, the read start position determining circuit 205 outputs “pipeline number” as the read start position. When the correction signal is “1”, the read start position determination circuit 205 outputs “pipeline number−1” as the read start position.

Next, the operation of the read position control unit 114 will be described with reference to the flowcharts of FIGS.

First, the control circuit 217 causes the multiplexer (MUX) 213 to select “0” and writes “0” to the read position base register 208 (step 301).

Next, the control circuit 217 causes the decrementer 215 to perform an operation of subtracting “1” from the size of the processing group from the vector store control unit 110, and outputs the output of the decrementer 215 (processing group size−1) to the counter 216. Is set as an initial value (step 302).

Then, the control circuit 217 causes the multiplexer (MUX) 211 to select the read start position determined by the read start position determining circuit 205.

Subsequently, the control circuit 217 causes the modulo arithmetic unit 210 to perform an operation for taking the modulo of the size of the processing group with respect to the read start position selected by the multiplexer (MUX) 211.

Subsequently, the control circuit 217 writes the calculation result of the modulo calculator 210 in the read position offset register 207 in response to the clock (step 303).

Subsequently, the adder 214 adds the value in the reading position offset register 207 and the value in the reading position base register 208, and uses the addition result as the first reading position to start from the bank 113A. Read an element.

The read position is also sent to the address calculation unit 104. The address calculation unit 104 calculates the memory address for writing where the first element is stored based on the read position and the start address and stride of the vector store from the instruction issue control unit 102, and calculates the calculation result. Output.

Next, the control circuit 217 checks whether the value of the counter 216 is “0” (step 304).

If the value of the counter 216 is not 0 (step 305), the control circuit 217 causes the decrementer 209 to perform an operation of subtracting “1” from the value in the reading position offset register 207.

Subsequently, the control circuit 217 causes the multiplexer (MUX) 211 to select the output of the decrementer 209.

Subsequently, the control circuit 217 causes the modulo arithmetic unit 210 to perform an operation for taking the modulo of the size of the processing group with respect to the value selected by the multiplexer (MUX) 211.

Subsequently, the control circuit 217 writes the calculation result of the modulo calculator 210 in the read position offset register 207 in response to the clock (step 306).

Subsequently, the adder 214 adds the value in the read position offset register 207 and the value in the read position base register 208, and uses the addition result as the next read position, from the bank 113A. Read an element.

The read position is also sent to the address calculation unit 104. The address calculation unit 104 calculates a write memory address at which the next element is stored based on the read position and the vector store start address and stride from the instruction issue control unit 102, and calculates the calculation result. Output.

Next, the control circuit 217 decrements the value of the counter 216 (step 307). Then, the control circuit 217 checks whether the value of the counter 216 is “0” (step 304).

When the value of the counter 216 is “0” (step 308), the control circuit 217 causes the adder 212 to execute a process of adding the size of the processing group to the value of the read position base register 208, and the multiplexer (MUX) ) The value is written into the read position base register 208 via 213 (step 309).

Then, the control circuit 217 checks whether or not the value of the read position base register 208 is at a valid read position (step 310). In this embodiment, the control circuit 217 determines whether or not the value of the read position base register 208 is a valid read position by comparing the value of the valid read position register 218 and the value of the read position base register 208. To do.

When the value of the read position base register 208 is valid (step 311), the control circuit 217 returns to step 302. If the value of the read position base register 208 is not valid (step 312), the control circuit 217 ends the process.

The effective read position register 218 stores the maximum read position calculated from the vector length (the length of the vector formed by adding the elements 0 to 255) and the vector pipeline number.

The comparator 219 compares the value of the valid read position register 218 with the read position from the adder 214.

When the read position is larger than the value of the valid read position register 218, the comparator 219 outputs an invalid signal to the crossbar switch 111 indicating that the element read from the bank 113A is invalid. The crossbar switch 111 does not process an element that is reported to be invalid.

The maximum read position set in the valid read position register 218 is calculated by the following formula.

If vector length mod number of vector pipelines <vector pipeline number,
Vector length ÷ number of vector pipelines -1
Vector length mod Number of vector pipelines ≥ Vector pipeline number
Vector length / number of vector pipelines Next, processing of the read position control unit 114, particularly the control circuit 217, will be described using specific values.

When the stride is “8B” and the lower 6th to 4th bits of the start address are “000”, processing in the read control unit 114 of the vector pipeline 109-1 will be described.

The vector length is 128 elements. Since there are eight vector pipelines, “15” is set in the effective read position register 218.

First, the calculation circuit 203 outputs the correction signal “0” to the vector pipeline 109-1 according to Table 1.

Also, the processing group size determination circuit 204 outputs “8” as the size of the processing group according to Table 2.

First, the control circuit 217 writes “0” in the read position base register 208 (step 301).

Subsequently, the control circuit 217 sets “8-1 = 7” as an initial value in the counter 216 (step 302).

Since the correction signal is “0” and the value in the pipeline number register 206 is “1”, the read start position determination circuit 205 outputs “1”.

Subsequently, the control circuit 217 causes the multiplexer (MUX) 211 to select “1” from the read start position determination circuit 205.

Subsequently, the control circuit 217 causes the modulo arithmetic unit 210 to perform an operation for performing modulo 8 on “1” from the multiplexer (MUX) 211.

Subsequently, the control circuit 217 writes the calculation result “1” of the modulo calculator 210 in the read position offset register 207 (step 303).

As a result, the adder 214 outputs “0 + 1 = 1” as the reading position of the first element. Since the value of the valid read position register 218 is “15”, no invalid signal is sent to the crossbar switch 111.

Next, in step 304, the control circuit 217 compares the value of the counter 216 with “0”. Since the value of the counter 216 is “7”, the control circuit 217 proceeds to Step 306.

In step 306, the control circuit 217 performs a modulo 8 operation on the value “0” obtained by subtracting “1” from the value “1” in the read position offset register 207 using the modulo calculator 210. The calculation result “0” is written to the read position offset register 207 in response to the clock.

As a result, the adder 214 outputs “0 + 0 = 0” as the reading position of the second element. Since the value of the valid read position register 218 is “15”, no invalid signal is sent to the crossbar switch 111.

Next, in step 307, the control circuit 217 decrements the value of the counter 216. Since the value of the counter 216 is “7”, the value of the counter 216 is “6”.

Subsequently, in step 304, the control circuit 217 compares the value of the counter 216 with “0”. Since the value of the counter 216 is “6”, the control circuit 217 proceeds to Step 306.

In step 306, the control circuit 217 performs a modulo 8 operation on the value “−1” obtained by subtracting “1” from the value “0” in the read position offset register 207 using the modulo calculator 210. The calculation result “7” is written in the read position offset register 207 in response to the clock.

As a result, the adder 214 outputs “0 + 7 = 7” as the reading position of the third element. Since the value of the valid read position register 218 is “15”, no invalid signal is sent to the crossbar switch 111.

Next, in step 307, the control circuit 217 decrements the value of the counter 216. Since the value of the counter 216 is “6”, the value of the counter 216 is “5”.

Subsequently, in step 304, the control circuit 217 compares the value of the counter 216 with “0”. Since the value of the counter 216 is “5”, the control circuit 217 proceeds to Step 306.

In step 306, the control circuit 217 performs a modulo 8 operation on the value “6” obtained by subtracting “1” from the value “7” in the reading position offset register 207 using the modulo calculator 210. The calculation result “6” is written to the read position offset register 207 in response to the clock.

As a result, the adder 214 outputs “0 + 6 = 6” as the reading position of the fourth element. Since the value of the valid read position register 218 is “15”, no invalid signal is sent to the crossbar switch 111.

Next, in step 307, the control circuit 217 decrements the value of the counter 216. Since the value of the counter 216 is “5”, the value of the counter 216 is “4”.

Subsequently, in step 304, the control circuit 217 compares the value of the counter 216 with “0”. Since the value of the counter 216 is “4”, the control circuit 217 proceeds to Step 306.

In step 306, the control circuit 217 performs a modulo 8 operation on the value “5” obtained by subtracting “1” from the value “6” in the read position offset register 207 using the modulo calculator 210. The calculation result “5” is written to the read position offset register 207 in response to the clock.

As a result, the adder 214 outputs “0 + 5 = 5” as the reading position of the fifth element. Since the value of the valid read position register 218 is “15”, no invalid signal is sent to the crossbar switch 111.

Next, in step 307, the control circuit 217 decrements the value of the counter 216. Since the value of the counter 216 is “4”, the value of the counter 216 is “3”.

Subsequently, in step 304, the control circuit 217 compares the value of the counter 216 with “0”. Since the value of the counter 216 is “3”, the control circuit 217 proceeds to Step 306.

In step 306, the control circuit 217 performs a modulo 8 operation on the value “4” obtained by subtracting “1” from the value “5” in the read position offset register 207 using the modulo calculator 210. The calculation result “4” is written in the read position offset register 207 in response to the clock.

As a result, the adder 214 outputs “0 + 4 = 4” as the reading position of the sixth element. Since the value of the valid read position register 218 is “15”, no invalid signal is sent to the crossbar switch 111.

Next, in step 307, the control circuit 217 decrements the value of the counter 216. Since the value of the counter 216 is “3”, the value of the counter 216 is “2”.

Subsequently, in step 304, the control circuit 217 compares the value of the counter 216 with “0”. Since the value of the counter 216 is “2”, the control circuit 217 proceeds to Step 306.

In step 306, the control circuit 217 performs a modulo 8 operation on the value “3” obtained by subtracting “1” from the value “4” in the reading position offset register 207 using the modulo calculator 210. The calculation result “3” is written to the read position offset register 207 in response to the clock.

As a result, the adder 214 outputs “0 + 3 = 3” as the reading position of the seventh element. Since the value of the valid read position register 218 is “15”, no invalid signal is sent to the crossbar switch 111.

Next, in step 307, the control circuit 217 decrements the value of the counter 216. Since the value of the counter 216 is “2”, the value of the counter 216 is “1”.

Subsequently, in step 304, the control circuit 217 compares the value of the counter 216 with “0”. Since the value of the counter 216 is “1”, the control circuit 217 proceeds to Step 306.

In step 306, the control circuit 217 performs a modulo 8 operation on the value “2” obtained by subtracting “1” from the value “3” in the read position offset register 207 using the modulo calculator 210. The calculation result “2” is written in the read position offset register 207 in response to the clock.

As a result, the adder 214 outputs “0 + 2 = 2” as the reading position of the eighth element. Since the value of the valid read position register 218 is “15”, no invalid signal is sent to the crossbar switch 111.

Next, in step 307, the control circuit 217 decrements the value of the counter 216. Since the value of the counter 216 is “1”, the value of the counter 216 is “0”.

Subsequently, in step 304, the control circuit 217 compares the value of the counter 216 with “0”. Since the value of the counter 216 is “0”, the control circuit 217 proceeds to Step 309.

In step 309, the control circuit 217 causes the adder 212 to execute an operation of adding the processing group size “8” to the value “0” of the reading position base register 208. Subsequently, the control circuit 217 causes the multiplexer (MUX) 211 to select the calculation result “8” of the adder 212 and writes the calculation result “8” of the adder 212 to the read position base register 208.

Subsequently, in step 310, the control circuit 217 checks whether the value of the read position base register 208 is valid. Since the value “8” of the read position base register 208 is smaller than the value “15” of the valid read position register 218, the control circuit 217 determines that the value of the read position base register 208 is valid, and step 302. Proceed to

Next, in step 302, the control circuit 217 sets “8-1 = 7” as an initial value in the counter 216.

As a result, the adder 214 outputs “8 + 1 = 9” as the reading position of the ninth element. Since the value of the valid read position register 218 is “15”, no invalid signal is sent to the crossbar switch 111.

As a result, the adder 214 outputs “8 + 0 = 8” as the reading position of the tenth element. Since the value of the valid read position register 218 is “15”, no invalid signal is sent to the crossbar switch 111.

As a result, the adder 214 outputs “8 + 7 = 15” as the reading position of the eleventh element. Since the value of the valid read position register 218 is “15”, no invalid signal is sent to the crossbar switch 111.

As a result, the adder 214 outputs “8 + 6 = 14” as the reading position of the twelfth element. Since the value of the valid read position register 218 is “15”, no invalid signal is sent to the crossbar switch 111.

As a result, the adder 214 outputs “8 + 5 = 13” as the reading position of the thirteenth element. Since the value of the valid read position register 218 is “15”, no invalid signal is sent to the crossbar switch 111.

As a result, the adder 214 outputs “8 + 4 = 12” as the reading position of the 14th element. Since the value of the valid read position register 218 is “15”, no invalid signal is sent to the crossbar switch 111.

As a result, the adder 214 outputs “8 + 3 = 11” as the reading position of the 15th element. Since the value of the valid read position register 218 is “15”, no invalid signal is sent to the crossbar switch 111.

As a result, the adder 214 outputs “8 + 2 = 10” as the reading position of the 16th element. Since the value of the valid read position register 218 is “15”, no invalid signal is sent to the crossbar switch 111.

In step 309, the control circuit 217 causes the adder 212 to perform an operation of adding the processing group size “8” to the value “8” of the reading position base register 208. Subsequently, the control circuit 217 causes the multiplexer (MUX) 211 to select the operation result “16” of the adder 212 and writes the operation result “16” of the adder 212 to the read position base register 208.

Subsequently, in step 310, the control circuit 217 checks whether the value of the read position base register 208 is valid. Since the value “16” of the read position base register 208 is larger than the value “15” of the valid read position register 218, the control circuit 217 determines that the value of the read position base register 208 is invalid and performs processing. finish.

FIG. 6 is a diagram showing the address calculation unit 104.

The address calculation unit 104 includes a start address register 401, a stride register 402, and an address calculation circuit 403 that corresponds to each vector pipeline 109 on a one-to-one basis.

In this embodiment, since there are eight vector pipelines 109, eight address calculation circuits 403 are also provided.

The start address register 401 receives and stores the start address of the vector store instruction from the instruction issue control unit 102.

The stride register 402 receives and stores the stride from the instruction issue control unit 102.

The address calculation circuit 403 uses the value of the start address register 401, the value of the stride register 402, and the read position from the read position control unit 114 in the vector pipeline 109 corresponding to its own circuit to calculate the memory address. calculate.

The address calculation circuit 403 includes a pipeline number register 404, an offset calculation circuit 405, and an adder 406.

In the pipeline number register 404, the number of the corresponding vector pipeline 109 is stored.

In this embodiment, since there are eight vector pipelines 109, any one of “0” to “7” is stored in the pipeline number register 404.

The offset calculation circuit 405 calculates the memory address offset based on the stride, the read position, and the pipeline number.

The offset calculation circuit 405 performs the following calculation to calculate the offset.

Offset = (readout position x number of vector pipelines + vector pipeline number) x stride
The adder 406 calculates a memory address by adding the start address and the offset.

Next, the vector store will be described with reference to FIGS.

The upper half of FIGS. 7 to 9 shows the arrangement of 128 elements (element 0 to element 127) in 8 vector pipelines (vector pipeline 0 to vector pipeline 7), more specifically, 8 vectors. An arrangement of element 0 to element 127 in a vector register composed of banks in the pipeline is shown. Note that when the elements 0 to 127 are arranged in the order of the element numbers, the vector data is restored.

7 to 9 show elements output from the crossbar switch 111 to eight data storage units (data storage unit 0 to data storage unit 7) in the cache memory every clock. In the lower half of FIGS. 7 to 9, a bold frame 504 indicates a block boundary of the cache memory.

First, a vector store in a situation where the start address is “0” and the stride is 8B will be described with reference to FIG.

In FIG. 7, a vertical line 501 indicates a boundary between cache memory blocks. Since the block size of the cache memory is 64B and the stride is 8B, each block in each data storage unit stores eight elements having consecutive memory addresses, that is, eight elements having consecutive element numbers. Is done. For example, element 0 to element 7 are stored in the same block in the cache memory 105.

In Table 1, when the lower 6 to 4 bits of the start address are “000” in the stride 8B, since the read start position is not corrected, each read start position determination circuit 205 outputs the vector pipeline number. .

In Table 2, when the stride is 8B, the size of the processing group is “8”, so the processing group size determination circuit 204 outputs “8”.

Therefore, the elements 0 to 127 are divided into a processing group composed of the elements 0 to 63 and a processing group composed of the elements 64 to 127.

In the upper half of FIG. 7, elements 502 (element 0, element 9, element 18, element 27, element 36, element 45, element 54 and element 63) and elements 503 (element 64, element 63) which are bold squares. 73, element 82, element 91, element 100, element 109, element 118, and element 127) indicate read start elements of each processing group.

When the first clock is input, the following operations are executed.

The read position control unit 114-0 outputs “0” as the read position. The read position control unit 114-1 outputs “1” as the read position. The read position control unit 114-2 outputs “2” as the read position. The read position control unit 114-3 outputs “3” as the read position. The read position control unit 114-4 outputs “4” as the read position. The read position control unit 114-5 outputs “5” as the read position. The read position control unit 114-6 outputs “6” as the read position.
The read position control unit 114-7 outputs “7” as the read position.

Therefore, the element 0 is read from the bank 113A-0, and the element 0 is input to the crossbar switch 111. Element 9 is read from bank 113A-1, and element 9 is input to crossbar switch 111. The element 18 is read from the bank 113A-2, and the element 18 is input to the crossbar switch 111. The element 27 is read from the bank 113A-3, and the element 27 is input to the crossbar switch 111. The element 36 is read from the bank 113A-4, and the element 36 is input to the crossbar switch 111. The element 45 is read from the bank 113A-5, and the element 45 is input to the crossbar switch 111. The element 54 is read from the bank 113A-6, and the element 54 is input to the crossbar switch 111. The element 63 is read from the bank 113A-7, and the element 63 is input to the crossbar switch 111.

The element 0, the element 9, the element 18, the element 27, the element 36, the element 45, the element 54, and the element 63 are respectively stored in the data storage unit 107-0 and the data storage unit 107- in the cache memory 105 by the crossbar switch 111. 1, the data storage unit 107-2, the data storage unit 107-3, the data storage unit 107-4, the data storage unit 107-5, the data storage unit 107-6, and the data storage unit 107-7. For this reason, output conflict does not occur in the crossbar switch 111.

Subsequently, when the second clock is input, the following operation is executed according to the second clock.

The read position control unit 114-0 outputs “7” as the read position. The read position control unit 114-1 outputs “0” as the read position. The read position control unit 114-2 outputs “1” as the read position. The read position control unit 114-3 outputs “2” as the read position. The read position control unit 114-4 outputs “3” as the read position. The read position control unit 114-5 outputs “4” as the read position. The read position control unit 114-6 outputs “5” as the read position.
The read position control unit 114-7 outputs “6” as the read position.

Therefore, the element 56 is read from the bank 113A-0, and the element 56 is input to the crossbar switch 111. Element 1 is read from bank 113A-1, and element 1 is input to crossbar switch 111. The element 10 is read from the bank 113A-2, and the element 10 is input to the crossbar switch 111. The element 19 is read from the bank 113A-3, and the element 19 is input to the crossbar switch 111. The element 28 is read from the bank 113A-4, and the element 28 is input to the crossbar switch 111. The element 37 is read from the bank 113A-5, and the element 37 is input to the crossbar switch 111. The element 46 is read from the bank 113A-6, and the element 46 is input to the crossbar switch 111. The element 55 is read from the bank 113A-7, and the element 55 is input to the crossbar switch 111.

Element 56, element 1, element 10, element 19, element 28, element 37, element 46, and element 55 are respectively stored in the cache memory 105 by the data storage unit 107-7 and the data storage unit 107-by the crossbar switch 111. 0, data storage unit 107-1, data storage unit 107-2, data storage unit 107-3, data storage unit 107-4, data storage unit 107-5, and data storage unit 107-6. For this reason, output conflict does not occur in the crossbar switch 111.

Hereinafter, when each of the third to eighth clocks is input, each element is read from each vector pipeline as shown in FIG.

Next, the operation when the ninth clock is input will be described. When the ninth clock is input, the processing group is switched.

When the 9th clock is input, the following operations are executed.

In each vector pipeline 109, the control circuit 217 adds “8”, which is the size of the processing group, to the value of the read position base register 208.

Therefore, from the 9th clock to the 16th clock, each of the read position control units 114 outputs “8” to “15” as read positions in a predetermined order. In the upper half of FIG. 7, an element 503 that is a bold square is an element that is read out by the ninth clock.
In this way, since elements read from each vector pipeline in accordance with one clock are stored in different data storage units, output contention in the crossbar switch 111 does not occur. Therefore, the performance of the vector store in the crossbar switch 111 does not deteriorate.

Further, as shown in the lower half of FIG. 7, a plurality of elements stored in the same block in the data storage unit 107 in the cache memory 105 arrives continuously at the data storage unit 107 having the block. Therefore, each data storage unit 107 only needs to search the address array for only the element that first arrives at the block among a plurality of elements stored in the same block, and can write efficiently.

Next, the vector store in the situation where the start address is “8” and the stride is 8B will be described with reference to FIG.

In FIG. 8, a vertical line 601 indicates a boundary between cache memory blocks. For example, elements 7 to 14 are stored in the same block in the cache memory 105.

In Table 1, when the stride is 8B and the lower 6 to 4 bits of the start address are “001”, the read start position is corrected only in the vector pipeline 109-7.

In the vector pipelines 109-0 to 109-6, the vector pipeline number is output from the read start position determining circuit 205. On the other hand, in the vector pipeline 109-7, “6” is output from the read start position determination circuit 205.

In the upper half of FIG. 8, elements 602 (element 0, element 9, element 18, element 27, element 36, element 45, element 54, and element 55) that are bold squares and element 603 (element 64, element 55) 73, element 82, element 91, element 100, element 109, element 118, and element 119) indicate read start elements of each processing group.

When the first clock is input, the following operations are executed.

The read position control unit 114-0 outputs “0” as the read position. The read position control unit 114-1 outputs “1” as the read position. The read position control unit 114-2 outputs “2” as the read position. The read position control unit 114-3 outputs “3” as the read position. The read position control unit 114-4 outputs “4” as the read position. The read position control unit 114-5 outputs “5” as the read position. The read position control unit 114-6 outputs “6” as the read position.
The read position control unit 114-7 outputs “6” as the read position.

Therefore, the element 0 is read from the bank 113A-0, and the element 0 is input to the crossbar switch 111. Element 9 is read from bank 113A-1, and element 9 is input to crossbar switch 111. The element 18 is read from the bank 113A-2, and the element 18 is input to the crossbar switch 111. The element 27 is read from the bank 113A-3, and the element 27 is input to the crossbar switch 111. The element 36 is read from the bank 113A-4, and the element 36 is input to the crossbar switch 111. The element 45 is read from the bank 113A-5, and the element 45 is input to the crossbar switch 111. The element 54 is read from the bank 113A-6, and the element 54 is input to the crossbar switch 111. The element 55 is read from the bank 113A-7, and the element 55 is input to the crossbar switch 111.

The element 0, the element 9, the element 18, the element 27, the element 36, the element 45, the element 54, and the element 55 are respectively stored in the cache memory 105 by the data storage unit 107-0 and the data storage unit 107- by the crossbar switch 111. 1, the data storage unit 107-2, the data storage unit 107-3, the data storage unit 107-4, the data storage unit 107-5, the data storage unit 107-6, and the data storage unit 107-7. For this reason, output conflict does not occur in the crossbar switch 111.

The read position control unit 114-0 outputs “7” as the read position. The read position control unit 114-1 outputs “0” as the read position. The read position control unit 114-2 outputs “1” as the read position. The read position control unit 114-3 outputs “2” as the read position. The read position control unit 114-4 outputs “3” as the read position. The read position control unit 114-5 outputs “4” as the read position. The read position control unit 114-6 outputs “5” as the read position.
The read position control unit 114-7 outputs “5” as the read position.

Therefore, the element 56 is read from the bank 113A-0, and the element 56 is input to the crossbar switch 111. Element 1 is read from bank 113A-1, and element 1 is input to crossbar switch 111. The element 10 is read from the bank 113A-2, and the element 10 is input to the crossbar switch 111. The element 19 is read from the bank 113A-3, and the element 19 is input to the crossbar switch 111. The element 28 is read from the bank 113A-4, and the element 28 is input to the crossbar switch 111. The element 37 is read from the bank 113A-5, and the element 37 is input to the crossbar switch 111. The element 46 is read from the bank 113A-6, and the element 46 is input to the crossbar switch 111. The element 47 is read from the bank 113A-7, and the element 47 is input to the crossbar switch 111.

Element 56, element 1, element 10, element 19, element 28, element 37, element 46, and element 47 are respectively stored in the cache memory 105 by the data storage unit 107-7 and the data storage unit 107-by the crossbar switch 111. 0, data storage unit 107-1, data storage unit 107-2, data storage unit 107-3, data storage unit 107-4, data storage unit 107-5, and data storage unit 107-6. For this reason, output conflict does not occur in the crossbar switch 111.

When the 9th clock is input, the following operations are executed.

Therefore, from the 9th clock to the 16th clock, each of the read position control units 114 outputs “8” to “15” as read positions in a predetermined order. In the upper half of FIG. 8, a bold square element 603 is an element that is read by the ninth clock.
In this way, since elements read from each vector pipeline in accordance with one clock are stored in different data storage units, output contention in the crossbar switch 111 does not occur. Therefore, the performance of the vector store in the crossbar switch 111 does not deteriorate.

Further, as shown in the lower half of FIG. 8, a plurality of elements stored in the same block in the data storage unit 107 in the cache memory 105 arrives continuously at the data storage unit 107 having the block. Therefore, each data storage unit 107 only needs to search the address array for only the element that first arrives at the block among a plurality of elements stored in the same block, and can write efficiently.

Next, a vector store in a situation where the start address is “0” and the stride is 16B will be described with reference to FIG.

In FIG. 9, two bold

vertical lines

701 and 702 indicate the boundaries of the cache memory blocks. For example, elements 4 to 7 are stored in the same block in the cache memory 105.

In Table 1, when the lower 6 to 4 bits of the start address are “000” in the stride 16B, since the read start position is not corrected, the vector pipeline number is output from the read start position determining circuit 205.

In Table 2, since the size of the processing group is “4” when the stride is 16B, the processing group size determination circuit 204 outputs “4”.

Therefore, element 0 to element 127 are a processing group composed of element 0 to element 31, a processing group composed of element 32 to element 63, and a processing group composed of element 64 to element 95. And processing groups composed of element 96 to element 127.

In the upper half of FIG. 9, elements 703 (element 0, element 9, element 18, element 27, element 4, element 13, element 22 and element 31) which are bold squares and element 704 (element 32, element 31) 41, element 50, element 59, element 36, element 45, element 54 and element 63) and element 705 (element 64, element 73, element 82, element 91, element 68, element 77, element 86 and element 95) , Element 706 (element 96, element 105, element 114, element 123, element 100, element 109, element 118, and element 127) indicate read start elements of the respective processing groups.

When the first clock is input, the following operations are executed.

The read position control unit 114-0 outputs “0” as the read position. The read position control unit 114-1 outputs “1” as the read position. The read position control unit 114-2 outputs “2” as the read position. The read position control unit 114-3 outputs “3” as the read position. The read position control unit 114-4 outputs “0” as the read position. The read position control unit 114-5 outputs “1” as the read position. The read position control unit 114-6 outputs “2” as the read position.
The read position control unit 114-7 outputs “3” as the read position.

Therefore, the element 0 is read from the bank 113A-0, and the element 0 is input to the crossbar switch 111. Element 9 is read from bank 113A-1, and element 9 is input to crossbar switch 111. The element 18 is read from the bank 113A-2, and the element 18 is input to the crossbar switch 111. The element 27 is read from the bank 113A-3, and the element 27 is input to the crossbar switch 111. Element 4 is read from bank 113A-4, and element 4 is input to crossbar switch 111. Element 13 is read from bank 113A-5, and element 13 is input to crossbar switch 111. The element 22 is read from the bank 113A-6, and the element 22 is input to the crossbar switch 111. The element 31 is read from the bank 113A-7, and the element 31 is input to the crossbar switch 111.

Element 0, element 9, element 18, element 27, element 4, element 13, element 22 and element 31 are respectively stored in the cache memory 105 by the data storage unit 107-0 and the data storage unit 107- by the crossbar switch 111. 2. Data storage unit 107-4, data storage unit 107-6, data storage unit 107-1, data storage unit 107-3, data storage unit 107-5, and data storage unit 107-7. For this reason, output conflict does not occur in the crossbar switch 111.

The read position control unit 114-0 outputs “3” as the read position. The read position control unit 114-1 outputs “0” as the read position. The read position control unit 114-2 outputs “1” as the read position. The read position control unit 114-3 outputs “2” as the read position. The read position control unit 114-4 outputs “3” as the read position. The read position control unit 114-5 outputs “0” as the read position. The read position control unit 114-6 outputs “1” as the read position.
The read position control unit 114-7 outputs “2” as the read position.

Therefore, the element 24 is read from the bank 113A-0, and the element 24 is input to the crossbar switch 111. Element 1 is read from bank 113A-1, and element 1 is input to crossbar switch 111. The element 10 is read from the bank 113A-2, and the element 10 is input to the crossbar switch 111. The element 19 is read from the bank 113A-3, and the element 19 is input to the crossbar switch 111. The element 28 is read from the bank 113A-4, and the element 28 is input to the crossbar switch 111. Element 5 is read from bank 113A-5, and element 5 is input to crossbar switch 111. The element 14 is read from the bank 113A-6, and the element 14 is input to the crossbar switch 111. The element 23 is read from the bank 113A-7, and the element 23 is input to the crossbar switch 111.

Element 24, element 1, element 10, element 19, element 28, element 5, element 14, and element 23 are respectively stored in the cache memory 105 by the data storage unit 107-6 and the data storage unit 107-by the crossbar switch 111. 0, data storage unit 107-2, data storage unit 107-4, data storage unit 107-7, data storage unit 107-1, data storage unit 107-3, and data storage unit 107-5. For this reason, output conflict does not occur in the crossbar switch 111.

Hereinafter, when each of the third to fourth clocks is input, each element is read from each vector pipeline as shown in FIG.

Next, the operation when the fifth clock is input will be described. When the fifth clock is input, the processing group is switched.

When the fifth clock is input, the following operations are performed.

In each vector pipeline 109, the control circuit 217 adds “4”, which is the size of the processing group, to the value of the read position base register 208.

Accordingly, from the fifth clock to the eighth clock, each of the read position control units 114 outputs “4” to “7” as read positions in a predetermined order. In the upper half of FIG. 9, a bold square element 704 is an element read by the fifth clock.
In this way, since elements read from each vector pipeline in accordance with one clock are stored in different data storage units, output contention in the crossbar switch 111 does not occur. Therefore, the performance of the vector store in the crossbar switch 111 does not deteriorate.

Further, as shown in the lower half of FIG. 9, a plurality of elements stored in the same block in the data storage unit 107 in the cache memory 105 arrives continuously at the data storage unit 107 having the block. Therefore, each data storage unit 107 only needs to search the address array for only the element that first arrives at the block among a plurality of elements stored in the same block, and can write efficiently.

Next, the effect of this embodiment will be described.

The first effect is that control is performed so that write data (elements) written to the same block of the cache memory continuously arrives at a bank (memory bank, data storage unit) in the cache memory having the block. Thus, the time and power consumption for searching the address array in the cache memory can be reduced.

The second effect is that even when the address of the element 0 is not the head address of the block, it is possible to avoid contention in the crossbar switch and to avoid an increase in the processing time of the vector store. Since contention does not occur, write data (elements) written to the same block of the cache memory can be controlled so that it continuously arrives at a bank (memory bank, data storage unit) in the cache memory having the block, Therefore, it is possible to reduce the time and power consumption for searching the address array in the cache memory.

According to this embodiment, the processing unit 114A divides 256 elements into processing groups based on the size of the processing group.

The processing unit 114A performs the following operation when it is determined that the start address matches the head address of any block in the cache memory 105.

Each time the processing unit 114A divides 256 elements into processing groups, one element from each of the block data generated when the elements in the processing group are divided into processing group sizes in the order of arrangement, Are output in parallel so that all the elements in the processing group are output, and the elements are read in parallel from the eight banks 113A and transferred to the cache memory 105.

Therefore, when the start address matches the head address of any block in the cache memory 105, the write data (element) written to the same block in the cache memory 105 is continuously stored in each cache memory 105. It becomes possible to arrive at the bank. Therefore, it is possible to reduce the time and power consumption for the cache memory to search the address array.

Further, the processing unit 114A performs the following operation when the start address does not match the head address in any of the memory blocks in the cache memory 105.

Each time the processing unit 114A divides 256 elements into processing groups, nine block data corresponding to one of the blocks in the cache memory 105 are assigned to the elements in the processing group based on the start address and stride. When the data is divided, the elements are output in parallel one by one from each of the eight block data other than the last block data having the largest address among the nine block data. When all the elements in the first block data with the smallest address among the block data are output, one of the eight block data other than the first block data out of the nine block data, one unoutput element It is required in parallel from eight banks 113A so that all elements in the processing group are output in parallel. Reading is transferred to the cache memory 105.

Therefore, even when the start address does not match the head address of any block in the cache memory 105, the write data (element) written to the same block of the cache memory 105 is continuously stored in the cache memory 105. It is possible to arrive at each bank, thereby reducing the time and power consumption for the cache memory to search the address array.

The present invention has been described above with reference to the embodiments, but the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

This application claims priority based on Japanese Patent Application No. 2009-213262 filed on September 15, 2009, the entire disclosure of which is incorporated herein.

101 CPU
102 Instruction issue control unit 103 Vector processing unit 104 Address calculation unit 105 Cache memory 106 Main storage device 107 Data storage unit 109 Vector pipeline 110 Vector store control unit 111 Crossbar switch 112 Vector computing unit 113 Vector register 113A Bank 114 Read position control unit 114A processing unit 201 start address register 202 stride register 203 read start position correction calculation circuit 204 processing group size determination circuit 205 read start position determination circuit 206 pipeline number register 207 read position offset register 208 read position base register 209 decrementer 21 Modulo arithmetic unit 211 Multiplexer 212 Adder 213 Multiplexer 214 Adder 215 Decrementer 216 Counter 217 Control circuit 218 Valid read position register 219 Comparator 401 Start address register 402 Stride register 403 Address calculation circuit 404 Pipeline number register 405 Offset calculation circuit 406 Addition vessel

Claims

Information for transferring m data elements whose arrangement order is determined to a cache memory having n (n is an integer of 2 or more) memory banks having memory blocks of a predetermined size associated with different addresses. A processing device comprising:
storage means having n register banks, wherein the m data elements are interleaved in units of the data elements in the arrangement order and are stored in the n register banks in order.
When the stride information indicating the stride between data elements and the start address which is the address of the first data element in the arrangement order are received, the processing size is calculated by dividing the predetermined size by the stride, and the start Control means for determining whether the address matches the start address of any of the memory blocks;
The m data elements are divided into processing groups each including the number of data elements in which the arrangement order is continuous and the processing size is multiplied by n, and the start address is the top of any memory block. If it is determined that the address matches the address, each time the m data elements are divided into the processing groups, the data elements in the processing group are divided into the number of processing sizes in the arrangement order. From each of the n register banks, the data elements are read in parallel to the cache memory so that each of the data elements is read in parallel and all the data elements in the processing group are read. And an information processing apparatus including processing means for transferring.
When the start address does not match the head address in any of the memory blocks, the processing means sets the data elements in the processing group each time the m data elements are divided into the processing groups. When an address is assigned based on the start address and the stride information and the data element in the processing group is divided into n + 1 block data corresponding to any of the memory blocks based on the address, the n + 1 The data elements are read in parallel one by one from each of the n block data other than the last block data having the largest address among the block data, and then, among the n + 1 block data When all data elements in the first block data with the smallest address are read, The unread data elements are read in parallel one by one from each of the n block data other than the head block data in the +1 block data, and the data elements in the processing group are The information processing apparatus according to claim 1, wherein the data elements are read from the n register banks in parallel so as to be read out and transferred to the cache memory.
Information for transferring m data elements whose arrangement order is determined to a cache memory having n (n is an integer of 2 or more) memory banks having memory blocks of a predetermined size associated with different addresses. A data transfer method in a processing device,
The m data elements are interleaved in units of the data elements in the arrangement order, and sequentially stored in n register banks,
When the stride information indicating the stride between data elements and the start address which is the address of the first data element in the arrangement order are received, the processing size is calculated by dividing the predetermined size by the stride, and the start Determine whether the address matches the start address of any memory block,
The m data elements are divided into processing groups each including the number of data elements in which the arrangement order is continuous and the processing size is multiplied by n, and the start address is the top of any memory block. If it is determined that the address matches the address, each time the m data elements are divided into the processing groups, the data elements in the processing group are divided into the number of processing sizes in the arrangement order. From each of the n register banks, the data elements are read in parallel to the cache memory so that each of the data elements is read in parallel and all the data elements in the processing group are read. Data transfer method to transfer.
If the start address does not match the first address in any memory block, each time the m data elements are divided into the processing groups, the start address and the data elements in the processing group When an address is assigned based on the stride information and the data element in the processing group is divided into n + 1 block data corresponding to any of the memory blocks based on the address, the n + 1 block data Among the n block data other than the last block data having the largest address, the data elements are read in parallel one by one, and then the head having the smallest address among the n + 1 block data. When all the data elements in the block data are read, the n + 1 blocks are read. The data elements that are not read out are output in parallel from each of the n block data other than the head block data in the data, so that all the data elements in the processing group are output. 4. The data transfer method according to claim 3, wherein the data elements are read from the n register banks in parallel and transferred to the cache memory.