WO2013005343A1

WO2013005343A1 - Apparatus and method for a marker guided data transfer between a single memory and an array of memories with unevenly distributed data amount in an simd processor system

Info

Publication number: WO2013005343A1
Application number: PCT/JP2011/065739
Authority: WO
Inventors: Hanno Lieske
Original assignee: Renesas Electronics Corporation
Priority date: 2011-07-01
Filing date: 2011-07-01
Publication date: 2013-01-10
Also published as: TW201319932A; TWI512614B

Abstract

An end marker setting unit sets an end marker at an end of a data stream stored inside memory elements. When transferring data from a processing element array to a single memory over a bus system, in the case that the end marker is detected regarding certain processing element, a marker evaluation unit for write direction deletes the data which is transferred from that processing element in the following rows. And when transferring data from the single memory to the processing element array, in the case that the end marker for certain processing element is detected, a marker evaluation unit for read direction inserts data for that processing element in the following row.

Description

DESCRIPTION

Title of Invention

APPARATUS AND METHOD FOR A MARKER GUIDED DATA TRANSFER BETWEEN A SINGLE MEMORY AND AN ARRAY OF MEMORIES WITH UNEVENLY DISTRIBUTED DATA AMOUNT IN AN SIMD PROCESSOR SYSTEM

Technical Field

[0001]

The present invention relates to a data transfer between a single memory and an array of memories in an SIMD processor system. More particularly, it relates to a fast data transfer with small implementation costs and low data transfer amount increase for an unevenly distributed amount of data in each memory of the memory array.

Background Art

[0002]

When processing e.g. a compression algorithm on the processing elements (PE) in an SIMD processor, the amount of data compressed in the memory of each PE of the PE array can be different.

For an example, this situation could occur in the following case. An image data that is taken with a CCD camera or a CMOS sensor is processed by a plurality of PEs in parallel processing. As an example of an image processing, an image compression is executed by the PEs. Because image compression ratio could be different depending on a portion of the image data, the amount of data compressed in the memory of each PE of the PE array can be different.

[0003]

When transferring the unevenly distributed data amount to a single memory by using a bus system to which each memory of the array is accessing only in parallel, the amount of data to be transferred to single memory depends on the highest amount of data stored in any of the memories of the memory array, because the highest amount of data determines the number of data transfers to transfer all necessary data between the memory array and the single memory.

[0004]

For the case of unevenly distributed data in the memory array, there exists the time point, where some memories have already transferred all compressed data while other memories still have to transfer further data.

Because of the SIMD style data transfer, however, all memories are accessed at the same time to, e.g., read the same amount of data out which is then transferred over a bus system to the single memory, so that large amount of data overhead is transferred which reduces for the compression example the reachable compression factor.

[0005]

For the case of evenly distributed data in the memory array, there exists a solution to transfer the data between an internal memory array and a single external memory over a ring bus as described in NPL 1.

Fig. 18 shows the structure of the architecture used to explain the data transfer between an internal memory array and a single external memory in case of evenly distributed data in the memory array as presented in NPL 1.

[0006]

The architecture consists of an array of PEs with memory 14. The array is composed of PEs 11 and memory elements 12 which are grouped into group of 4 "PE with memory element" 13. Data is transferred between the internal memory array and a single external memory 18 over a bus system 15 which is a pipelined ring bus.

The registers 16 are arranged over the ring bus in such a way, that between 2 registers either a group of PE or the control unit 17 is connected to the bus 15.

[0007]

In NPL1, for the write transfer from internal to external memory, the evenly distributed data in the memory elements of the internal memory array is accessed at the same time.

The read data from each memory element is then stored into the registers on the ring bus from where they are successively transferred to the external memory.

For the read direction, the data is successively read element wise from external memory and stored in the registers on the ring bus from where the data is finally stored at the same time into the memory elements of the internal memory array.

Citation List

Patent Literature

[0008]

PTL 1: Japanese unexamined patent application publication No.H06-75929

PTL 2: Japanese unexamined patent application publication No.H05-94425

PTL 3: International Patent Publication No. WO2009/131007

Non Patent Literature

[0009] NPL 1: S. yo, et.al.," A Low-Cost Mixed-Mode Parallel Processor Architecture for Embedded Systems", Proceedings of the 21st annual international conference on

Supercomputing, ICS'07, June 2007 Summary of Invention

Technical Problem

[0010]

While the described data transfer between the internal memory array and the external single memory is working for evenly distributed data without data storage overhead in the external memory, unevenly distributed data transfers would require such data storage overhead.

This is due to the fact that the internal memory array can only be accessed line wise, not element wise, and a line consists of "number of memory elements inside the memory array" elements, so using the line data transfer described in NPLl would require storing line wise data to external memory till the required data from all the internal memory elements have been transferred.

[0011]

Here, PTL1 (H06-75929) discloses a parallel processing device in which one processing element transmit its loads to other processing elements, thereby dispersing loads between PEs. PTL2 (H05 -94425) discloses a task managing method to reduce a required time for load allocation. Furthermore, PTL3 (WO2009/131007) discloses a SIMD parallel computer system which uniform processing loads between PEs. However, even though employing these techniques disclosed in above patent literatures, the above problem remains unsolved.

[0012]

The present invention has been made in view of the above mentioned problems, and an object of the present invention is to provide a possibility to reduce the data amount which has to be stored to the single memory compared to the case described in for the case of unevenly distributed data stored inside the memory elements of the memory array.

Solution of Problem

[0013]

According to an aspect of the present invention, there is provided a data transfer apparatus comprising:

a processing element array that has multiple processing elements controlled in a Single Instruction Multiple Data style; memory elements that are provided inside each of the processing elements, data access to all the memory elements of the processing elements being done in parallel;

a control unit controlling the processing element array in the Single Instruction Multiple Data style;

a data bus system connecting all of the processing elements with each other and with the control unit;

an single memory that exchanges data with the memory elements of the processing element array;

an end marker setting unit that is responsible to set an end marker at an end of a data stream stored inside the memory elements;

a marker evaluation unit for write direction; and

a marker evaluation unit for read direction,

wherein when transferring data from the processing element array to the single memory over the bus system, in the case that the end marker is detected regarding certain processing element, the marker evaluation unit for write direction has the task to delete the data which is transferred from that processing element in the following rows, and

when transferring data from the single memory to the processing element array over the bus system, in the case that the end marker is detected regarding certain processing element, the marker evaluation unit for read direction has the task to insert data for that processing element in the following line.

Advantageous Effects of Invention

[0014]

According to the present invention, unevenly distributed data inside the memory elements of the PE array can be transferred fast and efficient with small hardware

implementation costs and low data transfer amount increase to the single memory, because the end markers are set in advance and in the case that the end marker is detected regarding certain PE, the data from this PE in the following rows can be automatically skipped. Brief Description of Drawings

[0015]

[Fig. 1] Fig. 1 shows the architecture of the SIMD processor.

[Fig. 2] Fig. 2 shows an end marker setting unit.

[Fig. 3] Fig. 3 shows an example of state where the end markers have been added to data stored inside the each memory.

[Fig. 4A] Fig. 4A shows a situation where the memory is divided into sections of 4 bytes and 9 bytes data are stored inside the memory.

[Fig. 4B] Fig. 4B shows the end marker set at an unaligned position at the end of the data stream. [Fig. 4C] Fig. 4C shows the end maker set at an aligned position at the end of the data stream. [Fig. 5] Fig. 5 shows the marker evaluation unit for write direction of the marker evaluation apparatus.

[Fig. 6A] Fig. 6A shows the transition of flag values inside the flag register.

[Fig. 6B] Fig. 6B shows the transition of flag values inside the flag register.

[Fig. 7] Fig. 7 shows data that is stored in the PE array.

[Fig. 8] Fig. 8 shows the data that the end markers have been already set inside the each memory. [Fig. 9] Fig. 9 shows a flowchart that is executed in the marker evaluation unit for write direction 420.

[Fig. 10] Fig. 10 shows a flowchart that is executed in the marker evaluation unit for write direction 420.

[Fig. 11] Fig. 11 shows the data 1001 in the external single memory after the transfer is finished. [Fig. 12] Fig. 12 shows the marker evaluation unit for read direction of the marker evaluation apparatus.

[Fig. 13] Fig. 13 shows the operation of the selection switch.

[Fig. 14] Fig. 14 shows the stored data in the external memory and output data 1102 to the PE array.

[Fig. 15] Fig. 15 shows data that are transferred from the single external memory to the PE array. [Fig. 16] Fig. 16 shows a possible system design in which the SIMD processor with the example architecture could operate.

[Fig. 17] Fig. 17 shows the case where an end marker setting unit 1302 is placed next to the marker evaluation apparatus 1301 into the control unit 1300.

[Fig. 18] Fig. 18 shows a structure of a architecture used to explain the data transfer between an internal memory array and a single external memory in case of evenly distributed data in the memory array as presented in NPL1.

Description of Embodiments

[0016]

With reference to the accompanying drawings, exemplary embodiments of the present invention will be described. [First embodiment]

As a first embodiment, transfer of unevenly distributed data from the memory array to a single external memory will be described.

Fig. 1 shows the architecture of the SIMD processor 100. The architecture of the SIMD processor 100 in Fig. 1 has an array 200 of PEs 220. In the array 200, four PEs 210 compose one group 210 of PEs 220. In addition to a memory 230, each PEs 220 has an end marker setting unit 240.

[0017]

Fig. 2 shows the end marker setting unit 240. The end marker setting unit 240 adds in each PE 220 an end marker at the end of the data stream which should be transferred from the memory 230 of the PE 220 to a single memory 500. The setting of the end marker can be either at an unaligned position (Fig. 4B) or an aligned position (Fig. 4C).

[0018]

Fig. 3 shows an example of state where the end markers 600 have been added to data 231 stored inside the each memory 230. Here, we take as an example the situation where the memory is divided into sections of 4 bytes and 9 bytes data are stored inside the memory, as shown in Fig. 4A. In this case, Fig. 4B shows the end marker set at an unaligned position at the end of the data stream and Fig. 4C shows the end maker set at an aligned position at the end of the data stream.

[0019]

In the end marker setting unit 240, the selection whether the end marker or the data input is transferred to the data output is done using the data output selector 241. Data from the memory 230 is input to the end marker setting unit 240 sequentially.

The end marker setting unit 240 determines whether the input data from the memory 230 is the end data (the last data) or not.

When the input data is the end data, the data output selector 241 adds the end marker 600 to the end data. When the input data is not the end data, the data output selector 241 allow the input data pass without any change.

[0020]

Data is transferred between the PE array 200 and the single external memory 500 over a bus system 300 which is in this embodiment a pipelined ring bus.

Some registers (shift register) 310 are arranged over the ring bus 300 in such a way that, between two registers 310, either a group of PEs 210 or the control unit 400 is connected to the ring bus 300. In this embodiment, the ring bus 300 has a capacity of 128 bits and an each line 250 that connects each PE 220 and the ring bus 300 has a capacity of 32 bits.

[0021]

Between the ring bus 300 and the external memory 500 is provided a control unit 400. The control unit 400 has a marker evaluation apparatus 410, which apparatus has a marker evaluation unit for write direction 420 and a marker evaluation unit for read direction 430.

Transferred data is passing either the marker evaluation unit for write direction 420 or the marker evaluation unit for read direction 430 inside the marker evaluation apparatus 410.

[0022]

Fig. 5 shows the marker evaluation unit for write direction 420 of the marker evaluation apparatus 410. Data transferred from the memories 230 via the ring bus 300 is taken in to the control unit 400. The data taken into the control unit 400 is input to the marker evaluation unit for write direction 420. A comparator 421 is provided in the marker evaluation unit for write direction 420. Here, the comparator 421 has an inverter at an output terminal. In addition to the input data, an end marker code is input to the comparator 421.

[0023]

The data input is compared with the end marker code in a comparator 421. The result is stored in a flag register 422 which is provided at the latter stage of the comparator 421. The output of the flag register 422 controls a switch 423, which has the task to let the input data only pass to the output buffer 424 if not earlier an end marker had been detected for that PE. If an end marker had been detected for that PE, no data is allowed to pass.

[0024]

Fig. 6A and Fig. 6B shows the transition of flag values inside the flag register 422. The flag register 422 stores flag status for each PEs 220, which flag status represents whether the end marker of certain PE 220 had passed or not. As shown in Fig. 6A, at first, all flag values are 'T'.

Here, as an example, if all stored data in the PE6 had been sent and the end marker of PE6 reached to the comparator 421, the comparator output low level signal. As a result, the flag value for PE6 is changed to "0" as shown in Fig. 6B. When the flag for PE6 is "0", the switch 423 opens and does not pass data from PE6.

[0025]

Data from the switch are stored in an output buffer 424 temporarily. Here, as an example, the output buffer 424 has a capacity of 128 bytes.

[0026] Further on, the status of the output buffer 424 is checked in a comparator 425 whether it is full or not. In the case that the output buffer 424 is full, the data is sent to the single memory 500 by switching on a switch 426 and the buffer 424 is emptied by switching on a switch 427.

[0027]

Next, described is the operation of this SIMD processor 100. As shown in Fig. 7, data is stored in the PE array 200 and these data should be sent to the single external memory 500. In each PEs 220, as shown in Fig. 8, the end marker setting unit 240 adds the end marker to data which is stored in the own memory. The end marker is set either aligned or unaligned, as shown in Fig. 4B and Fig. 4C.

[0028]

Each PE 220 outputs data stored in the own memory to the ring bus 300 sequentially. The data output from each PE 220 are transferred to the control unit 400 and the control unit 400 takes in the data (ST100). Every time the control unit 400 takes in the data, the flowchart of Fig. 9 and Fig. 10 are executed in the marker evaluation unit for write direction 420.

[0029]

Received data (ST 100) is compared with the end marker code by the comparator 421 (ST110). The result is output the flag register 422 and updates an appropriate flag which is the flag for the PE 220 where the data element belongs to.

The flag value specifies whether the end marker has been transferred with this data element or not. When the input data = the end marker code (ST110: YES), the flag value is changed to "0" (ST 120). When the input data is not equal to the end marker code (ST110:NO), the flag value is kept at "1" and next data is received (ST100).

[0030]

The information of flag value is read out of the flag register 422 (ST200) and the data is transferred to the output buffer 424 depending on the flag value. This selection is performed with the switch 423.

In the case that the flag value is "1" (ST210: NO), the data is transferred to the output buffer 424 (ST230).

In the case that the flag value is "0" (ST210: YES), the data is not transferred (ST220). In other words, in the case that an end marker is detected regarding certain PE 220, the data from this PE in the following rows is automatically skipped.

[0031]

For an example, Fig. 11 shows the data 1001 in the external single memory 500 after the transfer is finished. The first data of each PE 220 is transferred to the external single memory 500 starting from the left side, then, the following rows are transferred. The end marker of PE6 is detected in the second row; therefore data from PE6 is skipped in the third line. Similarly, the end marker of PE3 is detected in the third row, therefore data from PE3 is skipped in the fourth line.

[0032]

Here, as already described in Fig. 4A, when taking an example that one data unit is composed of 4byte (32bits) data, if we can skip the process of one data unit, it reduces a lot of process steps. Moreover, when end markers are detected, we can skip the data from the PEs whose end markers are already detected in the following rows.

Therefore data transfer amount can be reduced dramatically.

[0033]

The output buffer 242 is checked whether all places are filled with elements (ST250). In the case that the output buffer 242 is full (ST250: YES), the data stored in the output buffer 242 is sent to the single external memory 500 while the content of the output buffer 242 is cleared.

[0034]

In this embodiment, unevenly distributed data inside the memory elements of the PE array 200 can be transferred fast and efficient with small hardware implementation costs and low data transfer amount increase to the single memory, because the end markers are set in advance and in the case that the end marker is detected regarding certain PE 220, the data from this PE in the following rows can be automatically skipped.

[0035]

[Second embodiment]

As a second embodiment, transfer of unevenly distributed data from the single external memory to the PE array will be described.

Fig. 12 shows the marker evaluation unit for read direction 430 of the marker evaluation apparatus 410. Data is transferred from the external memory 500 to the control unit 400.

Here, the data 1001 in the external single memory 500 is already processed so that the end marker is added to appropriate positions, after data from PEs are transferred in the manner described in the first embodiment.

[0036]

The data received into the control unit 400 is input to the marker evaluation unit for read direction 430. A comparator 431 , a flag register 432, an output buffer 424, a comparator 425, a switch 436, and a switch 437 are fundamentally equal to corresponding part of the maker evaluation unit for write direction 420 of the first embodiment.

A selection switch 433 is provided in the marker evaluation unit for read direction 430. The output of the flag register 432 controls the selection switch 433. The selection switch 433 has the task to let the input data pass to the output buffer 434 if not earlier an end marker had been detected for that PE. If an end marker had been detected, instead zero data is passed to the output buffer 434 for that PE.

[0037]

Next, described is the operation of this SIMD processor.

The operation of the flag register 432 is fundamentally equal to the operation of the flag register 422 of the marker evaluation unit for write direction 420. Fig. 9 and the explanation thereof can be applied to the flag register 432.

[0038]

Fig. 13 shows the operation of the selection switch 433.

First, the information of the flag register 432 is read out of the flag register (ST300) and the data input from the external memory 500 is transferred to the output buffer depending on the flag value. This selection is performed in the selection switch 433. In the case that the flag value is "1" (ST310: NO), the data is transferred to the output buffer434 (ST330). In the case that the flag value is "0" (ST310: YES), instead zero data is transferred to the output buffer 434 (ST320).

[0039]

Fig. 14 shows the stored data 1101 in the external memory 500 and output data 1102 to the PE array 200. Starting from the left side, data element is transferred sequentially to each memory of the PE array 200.

Then, the following rows are transferred. In the case that an end marker is detected, this end marker is the last data from the external memory 500 which is transferred for this PE to the PE array 200.

[0040]

Afterwards, only filling zeros are transferred for this PE.

As shown in Fig. 14, the end marker for PE6 is detected in the second line; therefore "zero data" is selected for PE6 in the third row. Data line 1102 is output to the ring bus 300 and each PE takes in the own data. As a result, PE array can take data as shown in Fig. 15 where we write "Zero" clearly so that it help reader understand the operation of this invention.

[0041]

In this embodiment, data from the external memory 500 can be stored in an unevenly distributed form inside the memory units of the PE array 200 efficiently with small hardware implementation costs.

[0042]

(Modified embodiment)

This invention is not limited to the embodiment described above.

Fig. 16 shows a possible system design in which the SIMD processor 1202 with the example architecture could operate. Other units inside the system could be a central processing unit 1201 and a single memory element 1203, which are all connected over connections 1205 to a bus system 1204.

[0043]

Moreover, as alternative to the implementation shown in Fig. 1 , Fig. 17 shows the case where the end marker setting units are taken out of each PE and one end marker setting unit (global marker setting unit) 1302 is placed next to the marker evaluation apparatus 1301 into the control unit 1300, responsible to set the end markers in all single memory elements of the memory array on request of the responsible processing elements.

[0044]

It will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present invention as defined by the claims.

Industrial applicability

[0045]

The present invention can be applied to a method and an apparatus for an image processing, and the image data can be acquired with a camera, a laser probe, or an internet.

Reference Signs List

[0046]

100 the SIMD processor

200 PE array

210 one group of PEs

220 PE (processing element)

230 memory inside a PE

240 end marker setting unit

300 ring bus shift register

control unit

marker evaluation apparatus

marker evaluation unit for write direction marker evaluation unit for read direction

Claims

[Claim 1]

A data transfer apparatus comprising:

a processing element array that comprises multiple processing elements controlled in a

Single Instruction Multiple Data style;

memory elements that are provided inside each of the processing elements, data access to all the memory elements of the processing elements being done in parallel;

a bus system connecting all of the processing elements with each other and with the control unit;

a marker evaluation unit for write direction; and

a marker evaluation unit for read direction,

[Claim 2]

The data transfer apparatus according to claim 1 , wherein the marker setting unit is provided inside each of the processing element.

[Claim 3]

The data transfer apparatus according to claim 1 , wherein the marker setting unit is provided inside the control unit.

[Claim 4]

The data transfer apparatus according to any one of claims 1, 2 and 3, wherein the marker setting unit adds the end marker at an aligned position or an aligned position.

[Claim 5]

The data transfer apparatus according to any one of claims 1 to 4, wherein the bus system is a ring bus.

[Claim 6]

The data transfer apparatus according to any one of claims 1 to 5, wherein the single memory is an external memory.

[Claim 7]

A data transfer method for transferring data between a processing element array that comprises multiple processing elements with own memory element and a single memory in parallel processing, the data transfer method comprising:

transferring data from the processing element array to the single memory over a bus system, and

transferring data from the single memory to the processing element array over the bus system,

wherein in a case of transferring data from the processing element array to the single memory over the bus system;

setting an end marker at an end of a data stream stored inside the memory elements;

transferring data from the processing element array to the single memory over a bus system;

detecting the end marker for certain processing element; and deleting the data which is transferred from that processing element in the following rows when the end marker is detected for the certain processing element, and,

in a case of transferring data from the single memory to the processing element array over the bus system;

transferring data from the single memory to the processing element array over the bus system;

detecting the end marker for certain processing element; and inserting data for that processing element in the following row when the end marker is detected regarding certain processing element.