CN102012803A

CN102012803A - Configurable matrix register unit for supporting multi-width SIMD and multi-granularity SIMT

Info

Publication number: CN102012803A
Application number: CN2010105594582A
Authority: CN
Inventors: 陈书明; 张凯; 陈海燕; 万江华; 彭元喜; 刘仲; 阳柳; 杨惠; 刘蓬侠; 胡春媚; 唐涛
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2010-11-25
Filing date: 2010-11-25
Publication date: 2011-04-13
Anticipated expiration: 2030-11-25
Also published as: CN102012803B

Abstract

The invention relates to a configurable matrix register unit for supporting multi-width single instruction multiple data stream (SIMD) and multi-granularity single instruction multiple threads (SIMT). The configurable matrix register unit comprises a matrix register and a control register SR; the matrix register of which the size is N*N is divided into M*M blocks, wherein N is a positive integer and is the power of 2, and M is an integer which is more than or equal to 0 and is the power of 2; the block modes of the matrix register and the multi-thread numbers simultaneously processed by a vector processing unit are recorded in the control register; and the width of the control register is log2C+log2T, wherein C is the number of the number of the block modes of the matrix register, and T is the number of multi-thread modes which can be processed by a vector processor. The configurable matrix register unit has the advantages that: the principle is simple; the configurable matrix register unit is simple and convenient to operate; the block size and the thread number can be configured flexibly; the access to vector data in the mode of multi-width SIMD and multi-granularity SIMT is supported at the same time and the like.

Description

Support the configurable matrix register unit of many width S IMD and many granularities SIMT

Technical field

The present invention is mainly concerned with the design field of vector registor in the vector processor, refer in particular to a kind of block size and the configurable matrix register of number of threads in vector processor, data are carried out many width and the visit of many granularities with the vector operation unit of supporting to operate by single instruction stream multiple data stream (SIMD) and single instruction stream multithreading (SIMT) mode.

Background technology

Along with the further investigation of 4G wireless communication technology and video image processing technology, vector processor has obtained using widely.Need to carry out a large amount of matrix operations in the wireless communication protocol of evolution and the video image Processing Algorithm fast, as channel estimating, MIMO equilibrium and dct transform.Matrix operation in the algorithms of different granularity difference that walks abreast, the handled matrix-block size of algorithm is also different, vector processor only provides the efficient support to the matrix operation of these different numbers and different masses size, can adapt to the data-intensive application of this class better, satisfy the real time data processing requirement.

The core algorithm that wireless communication protocol and video image are handled is usually expressed as the parallel and Thread-Level Parallelism of data level and exists simultaneously, the vector processor of using towards this class adopts very long instruction word (VLIW), single instruction stream multiple data stream (SIMD) architecture usually, also can provide the support of single instruction stream multithreading (SIMT) technology simultaneously, to obtain enough concurrent operation abilities.Above-mentioned two class algorithms also show as following characteristics usually: along with the quick evolution of agreement, the handled vector length of algorithm is also constantly changing, and simultaneously, developable Thread-Level Parallelism is also changing in the algorithm.In the 3G infinite communication protocol, the evolution of agreement makes the number of antennas of base station and handheld terminal change always, this has just caused vector length in the channel equalization matrix also in continuous change, means that the width of the manageable vector data of vector processing unit and simultaneously treated number of threads are all changing.Can these above characteristics provide from the architecture level vector processor provides enough effectively support to propose strong requirement to many width S IMD processing and many granularities SIMT processing.Therefore the present invention proposes a kind of block size and the configurable matrix register of number of threads, can satisfy the vector operation demand of different walk abreast granularities and block sizes in the algorithm.

The memory cell array of matrix register generally is made up of the individual storage unit of N*M (M, N are the integer greater than 1), and the bit wide of each storage unit is generally 4,8,12,16,32, and this array logically can be regarded as by N capable vector registor VR ₀-VR _N-1Or M column vector CVR ₀-CVR _M-1Register is formed, and N and M are generally 2 exponential.Each row vector registor comprises M element (storage unit) E _{I, 0}-E _{I, M-1}(i=0,1,2 ... N-1), each column vector register comprises N element E _{0, i}-E _{M-1, i}(i=0,1,2 ... M-1).Finish reading and writing of ranks vector under the control that matrix register enables in read-write, read/write address and ranks are selected signal.

Existing research provides the fixedly visit of the blocks of data of scale of above-mentioned matrix register, these technology are read and write the capable vector or the column vector of matrix at every turn, the length of vector is fixed, when vector length is greater than or less than this regular length, common employing is combined into a long vector with a plurality of short vectors and comes parallel processing, perhaps a long vector is split into several short vectors and come step-by-step processing, can't handle the matrix data of different sizes flexibly, do not support the SIMD of many width to handle, do not support to visit a plurality of matrix datas simultaneously in the mode of many granularities SIMT yet, can not obtain enough dirigibilities, can not develop enough degree of parallelisms, particularly Thread-Level Parallelism.

In sum, how in vector processor, to provide the high efficient and flexible of matrix data is handled, for handling, many granularities SIMT of vector processor and many width S IMD provide flexible and enough parallel work-flow numbers, improve the parallel processing efficient of vector processor, array processor, to satisfy application such as radio communication and Flame Image Process are still this area research to the demand of extensive matrix operation a hot issue.

Summary of the invention

The technical problem to be solved in the present invention just is: the technical matters that exists at prior art, but the invention provides that a kind of principle is simple, easy and simple to handle, block size and number of threads flexible configuration, support many width S IMD and many granularities SIMT mode to visit the matrix register unit of vector data simultaneously.

For solving the problems of the technologies described above, the present invention by the following technical solutions:

A kind of configurable matrix register unit of supporting many width S IMD and many granularities SIMT, it is characterized in that: comprise matrix register and control register SR, the matrix register of described big or small N*N is divided into the M*M piece, and wherein N is positive integer and is 2 power, and M is for more than or equal to 0 integer and be 2 power; Write down matrix register in the described control register and divided block mode and the simultaneously treated multithreading number of vector processing unit, the width of described control register is log ₂C+log ₂T, wherein C is the piecemeal pattern count of matrix register, T is the treatable multithread mode number of vector processor.

As a further improvement on the present invention:

When M was 0, the representing matrix register is piecemeal not, a capable vector or the column vector of vector operation parts at every turn can the access matrix register; When M is not 0, the vector operation parts are according to the capable vector of child or the sub-column vector of the one or more equal length in the different access matrix register of simultaneously treated number of threads, and capable vector of the child of these equal length or sub-column vector come from different partitioned matrix.

When described vector operation parts conduct interviews to matrix register, the address decoding logical block selects signal to decipher according to content, read/write address and the ranks of control register SR, the capable vector or the column vector of selection matrix register are read and write, or select one or more son row vectors or sub-column vector to read and write.

Described control register SR is an independently control register, perhaps be stored in the reservation position of other control register, and the reservation bit length of other control registers is greater than log ₂C+log ₂The integer of T.

Compared with prior art, the invention has the advantages that: the present invention supports many width S IMD and many granularities SIMT mode to visit the matrix register unit of vector data, principle is simple, easy and simple to handle, but block size and number of threads flexible configuration, in vector processor, can handle the high efficient and flexible of matrix data, for handling, many granularities SIMT of vector processor and many width S IMD provide flexible and enough parallel work-flow numbers, thereby improved vector processor, the parallel processing efficient of array processor has satisfied the demand of application such as radio communication and Flame Image Process to extensive matrix operation.

Description of drawings

Fig. 1 is the general structure synoptic diagram of matrix register of the present invention;

Fig. 2 is the architecture frame synoptic diagram of vector processor;

Fig. 3 is the structural representation of SR register among the present invention;

Fig. 4 is the memory cell array structure synoptic diagram of matrix register of the present invention;

Fig. 5 is the not row address space synoptic diagram during piecemeal of matrix register of the present invention;

Fig. 6 is the not column address space synoptic diagram during piecemeal of matrix register of the present invention;

The descending address space synoptic diagram of single thread mode when Fig. 7 is a matrix register piecemeal of the present invention;

The following address space synoptic diagram of single thread mode when Fig. 8 is a matrix register piecemeal of the present invention;

The descending address space synoptic diagram of M thread mode when Fig. 9 is a matrix register piecemeal of the present invention;

The following address space synoptic diagram of M thread mode when Figure 10 is a matrix register piecemeal of the present invention;

Figure 11 is a decoding path synoptic diagram among the present invention.

Embodiment

Below with reference to Figure of description and specific embodiment the present invention is described in further details.

As shown in Figure 1, be the general structure synoptic diagram of matrix register of the present invention.The configurable matrix register unit of many width S of support IMD of the present invention and many granularities SIMT comprises matrix register and control register SR.

When the read-write enable signal is effective, the address decoding logical block is according to the content of read/write address, be expert at and decipher under the control of array selecting signal and control register SR, the capable vector or the column vector of selection matrix register are read and write, and perhaps select one or more son row vectors or sub-column vector to read and write.Matrix register is made up of the memory cell array of N*N, and the bit wide of each unit is W, and the memory capacity size is (N*N*W) position.By the part field of configuration control register SR, can carry out piecemeal to matrix register.When the partitioned mode of this matrix register of content representation among the control register SR is M*M, the representing matrix register is divided into the sub-piece of M by row by row simultaneously, be divided into into M*M piece, the size of each sub-piece is (N/M) * (N/M), M is generally 0,2,4,8 ... and M is no more than N/2, when M is 0, it is not piecemeal (can claim that also the branch block mode is 0*0) of representing matrix register, when M is not 0, each row vector registor or column vector register logically are divided into M son row vector registor or sub-column vector register, each son row vector registor or sub-column vector register comprise N/M storage unit, at this moment, the vector operation unit is the length of son row vector registor or sub-column vector register to the read-write unit length of matrix register, can select one or more son row vectors or sub-column vector to read and write at every turn.In the present invention, the functional part of vector operation unit can be by the same mode access matrix register of visit general register, and matrix register also provides the function of column access, and can satisfy the vector read-write of different vector lengths, also supported the access mode of SIMT simultaneously.The maximum bandwidth of each read-write of this matrix register is the N*W position.

Fig. 3 is the structural representation of control register SR among the present invention.The branch block mode of current matrix register and arithmetic element have been write down among the control register SR just in simultaneously treated number of threads, in order to reduce cost, improve the reusability of design, the present invention is with the control register of SR as processor, the programmer can visit SR by the instruction of existing access control register, does not need to increase extra instruction again.The bit wide of valid data is greater than (log among the SR ₂C+log ₂T) smallest positive integral (C is the piecemeal pattern count of matrix register, and T is the treatable multithread mode number of vector processor, and C, T are the integer greater than 1).SR can be used as a proprietary control register and independently exists, and the perhaps position of generally all withing a hook at the end of the control register in the processor is if the number that keeps the position in the original control register of processor is more than or equal to log ₂C+log ₂T just can not need to increase extra SR register, otherwise, just also need to increase a control register SR.No matter belong to which kind of situation, the programmer can visit SR by the instruction of existing access control register, does not therefore need to increase the dynamic-configuration that extra instruction just can realize SR again.

Vector processing unit all needs to provide three kinds of signals to each visit of matrix register: read-write enables, read/write address, ranks are selected signal.The address decoding logical block selects signal to decipher according to content, read/write address and the ranks of control register SR, the capable vector or the column vector of selection matrix register are read and write, and also can select one or more son row vectors or sub-column vector to read and write.

The present invention has designed a kind of complete map addresses scheme, and under different branch block modes and thread mode, matrix register presents different address views, and these address views provide complete access mode flexibly for the programmer.Map addresses scheme regular as follows: interblock afterwards descends the order on the first left back right side again according to going up earlier, row address is according to the order under going up afterwards earlier in the piece, column address is according to the linear successively increase addressing of order on the left back right side earlier, when number of threads is L (L≤max (M)), L continuous piece of line direction shared a slice column vector address space, L continuous piece of column direction shared a slice row vector address space, promptly when number of threads is L, vector processing unit is L son row vector or L sub-column vector of access matrix register simultaneously, capable vector of each height or sub-column vector are handled by V vector processing unit, and V is the length of son row vector or sub-column vector.

Fig. 2 is the architecture frame of vector processor among the present invention.Vector processor generally is made up of N parallel processing element (PE), and each processing unit has i functional part, and these functional parts can be MAC, ALU, division, shifting part etc., and each functional part can be read and write matrix register as required.N PE constituted the processing mode of SIMD, and the structure of VLIW is generally taked in each PE inside, a plurality of functional part concurrent operations.The cell array size of matrix register is N*N, has logically constituted N capable vector registor VR ₀-VR _N-1With N column vector register CVR ₀-CVR _N-1Each row vector registor VR _iComprise N storage unit E _{I, 0}-E _{I, N-1}(i=0,1,2 ... N-1), each column vector register CVR _iComprise N storage unit E _{0, i}-E _{N-1, i}(i=0,1,2 ... N-1).Matrix register is designed to the multiport read-write mode, can provide source operand for a plurality of functional parts of N PE simultaneously, can support that also the data of a plurality of functional parts of N PE write.

Fig. 4 is the memory cell array structure synoptic diagram of matrix register among the present invention.The memory cell array of matrix register generally is made up of N*N storage unit, and N is generally 2 exponential.The bit wide of each storage unit is W, and W is generally 4,8,12,16,32.This array logically can be regarded N capable vector registor VR as ₀-VR _N-1Or N column vector CVR ₀-CVR _N-1Register is formed, and each row vector registor comprises N element (storage unit) E _{I, 0}-E _{I, N-1}(i=0,1,2 ... N-1).With VR ₀Be example, this row vector registor comprises storage unit E _0,0-E _{0, N-1}This memory cell array is divided into the column of memory cells of N N*W position by row, and every row are made up of the element of N same column.This N column of memory cells and N column vector register CVR ₀-CVR _N-1Corresponding one by one, be used to realize the access facility of respective column vector registor.With CVR _N-1Be example, this column vector register comprises all row vector registor VR ₀-VR _N-1Last element E _{I, N-1}(i=0,1,2 ... N-1).

Fig. 5 is the not row address space synoptic diagram during piecemeal of matrix register of the present invention.When the piecemeal pattern field oriental matrix register among the SR not during piecemeal, matrix register is only supported single-threaded computing.Matrix register is made up of N capable vector registor, and each row vector registor comprises N storage unit E _{I, 0}-E _{I, N-1}(i=0,1,2 ... N-1).The functional part of vector operation unit can be to a capable vector registor read-write, and the data bandwidth of each read-write is the N*W position.Linear in accordance with the order from top to bottom increasing addresses in the address of row vector registor.The vector operation unit can be vectorial according to the different row of different row address access matrix registers.

Fig. 6 is the not column address space synoptic diagram during piecemeal of matrix register of the present invention.When the piecemeal pattern field oriental matrix register among the SR not during piecemeal, matrix register is only supported single-threaded computing.Matrix register is made up of N column vector register, and each column vector register comprises N storage unit E _{0, i}-E _{N-1, i}(i=0,1,2 ... N-1) functional part of vector operation unit can be to a column vector register read-write, and the data bandwidth of each read-write is the N*W position.The address of column vector register increases addressing according to order linear from left to right.The vector operation unit can be according to the different different column vectors of column address access matrix register.

The descending address space synoptic diagram of single thread mode when Fig. 7 is a matrix register piecemeal of the present invention.When the piecemeal pattern field oriental matrix register among the SR divides block mode is that M*M and thread mode field are when indicating current computing to be single-threaded computing, matrix register is made up of N*M son row vector registor, and each son row vector registor comprises N/M storage unit E _{I, 0}-E _{I, N/M-1}(i=0,1,2 ... (N-1) * M).The functional part of vector operation unit can be to the read-write of a son row vector registor, and the data bandwidth of each read-write is (N/M) * W position.The addressing of son row vector registor is according to following rule: interblock afterwards descend the order on the left back right side earlier according to going up earlier again, in the piece in accordance with the order from top to bottom the linearity increase address.The vector operation unit can be different according to different row address access matrix registers the capable vector of child.The value of M has just realized the support to the SIMD computing of different length not simultaneously, has promptly supported many width S IMD visit.

The following address space synoptic diagram of single thread mode when Fig. 8 is a matrix register piecemeal of the present invention.When the piecemeal pattern field oriental matrix register among the SR divides block mode is that M*M and thread mode field are when indicating current computing to be single-threaded computing, matrix register is made up of N*M sub-column vector register, and each sub-column vector register comprises N/M storage unit E _{0, i}-E _N/M-1.i(i＝0，1，2……(N-1)*M)。The functional part of vector operation unit can be to a sub-column vector register read-write, and the data bandwidth of each read-write is (N/M) * W position.The addressing of sub-column vector register is according to following rule: interblock afterwards descend the order on the left back right side earlier according to going up earlier again, and the interior order linear increase according to from left to right of piece addresses.The vector operation unit can be different according to different column address access matrix registers sub-column vector.The value of M has just realized the support to the SIMD computing of different length not simultaneously, has promptly supported many width S IMD visit.

The descending address space synoptic diagram of M thread mode when Fig. 9 is a matrix register piecemeal of the present invention.When the piecemeal pattern field oriental matrix register among the SR divides block mode is that M*M and thread mode field are when indicating current computing to be multithreading (number of threads is L) computing, matrix register is made up of N*M/L son row vector registor, and each son row vector registor comprises N/M storage unit E _{I, 0}-E _{I, N/M-1}(i=0,1,2 ... (N-1) * M/L).The functional part of vector operation unit can be to the read-write of L son row vector registor, and the data bandwidth of each read-write is (L*N/M) * W position.The addressing of son row vector registor is according to following rule: interblock afterwards descends the order on the left back right side of elder generation again according to going up earlier, linear in accordance with the order from top to bottom increasing addresses in the piece, L continuous piece of column direction shared a slice row address space, the vector operation unit can be according to different row address access matrix register L the different capable vector of child, this L the capable vector of different childs shared same address, but derive from different matrix data pieces, promptly the individual different child of L is capable is the data access of a multithreading to quality entity.Figure 9 shows that the capable vectorial addressing mode of child when L equals M.The value of M has just realized the support to the SIMD computing of different length not simultaneously, has promptly supported many width S IMD visit.When the value of L not simultaneously, just realized varigrained Thread-Level Parallelism, promptly supported the access mode of many granularities SIMT.

The following address space synoptic diagram of M thread mode when Figure 10 is a matrix register piecemeal of the present invention.When the piecemeal pattern field oriental matrix register among the SR divides block mode is that M*M and thread mode field are when indicating current computing to be multithreading (number of threads is L) computing, matrix register is made up of N*M/L sub-column vector register, and each sub-column vector register comprises N/M storage unit E _{0, i}-E _{N/M-1, i}(i=0,1,2 ... (N-1) * M/L).The functional part of vector operation unit can be to L sub-column vector register read-write, and the data bandwidth of each read-write is (L*N/M) * W position.The addressing of sub-column vector register is according to following rule: interblock afterwards descends the order on the first left back right side again according to going up earlier, increase addressing according to from left to right order linear in the piece, L continuous piece of line direction shared a slice column address space, the vector operation unit can be according to different row address access matrix register L different sub-column vector, this L different sub-column vector shared same address, but derive from different matrix data pieces, i.e. L the data access that different sub-column vector essence is a multithreading.Figure 10 shows that the sub-column vector addressing mode when L equals M.The value of M has just realized the support to the SIMD computing of different length not simultaneously, has promptly supported many width S IMD visit.When the value of L not simultaneously, just realized varigrained Thread-Level Parallelism, promptly supported the access mode of many granularities SIMT.

Figure 11 is the decoding path synoptic diagram of matrix register of the present invention.This decoding path is made up of address decoding logic, read and write data buffer cell and the bus that reads and writes data.When vector processing unit is read and write matrix register, address decoding logical foundation read/write address, ranks select the content of signal, SR to carry out address decoding, select one or more row/column vectors, perhaps select one or more son row/sub-column vectors to read and write.When matrix register was carried out read operation, after the decoded logic of some storage unit was chosen, the content of this storage unit was read out and is put on the read data bus that this storage unit is expert at, and delivered to the read data buffering then.The read data buffer cell is made into vector form with the data set of different storage unit and returns to the vector operation unit.When matrix register is carried out write operation, the write data buffer cell will split into a plurality of data that will write different storage unit from the vector data of vector operation unit.After the decoded logic of some storage unit was chosen, the content that write this storage unit was placed on the write data bus that this storage unit is expert at, and when clock is effective, writes this storage unit again.

Below only be preferred implementation of the present invention, protection scope of the present invention also not only is confined to the foregoing description, and all technical schemes that belongs under the thinking of the present invention all belong to protection scope of the present invention.Should be pointed out that for those skilled in the art the some improvements and modifications not breaking away under the principle of the invention prerequisite should be considered as protection scope of the present invention.

Claims

1. configurable matrix register unit of supporting many width S IMD and many granularities SIMT, it is characterized in that: comprise matrix register and control register SR, the matrix register of described big or small N*N is divided into the M*M piece, wherein N is positive integer and is 2 power, and M is for more than or equal to 0 integer and be 2 power; Write down matrix register in the described control register and divided block mode and the simultaneously treated multithreading number of vector processing unit, the width of described control register is log ₂C+log ₂T, wherein C is the piecemeal pattern count of matrix register, T is the treatable multithread mode number of vector processor.

2. the configurable matrix register unit of many width S of support IMD according to claim 1 and many granularities SIMT, it is characterized in that: when M is 0, the representing matrix register is piecemeal not, a capable vector or the column vector of vector operation parts at every turn can the access matrix register; When M is not 0, the vector operation parts are according to the capable vector of child or the sub-column vector of the one or more equal length in the different access matrix register of simultaneously treated number of threads, and capable vector of the child of these equal length or sub-column vector come from different partitioned matrix.

3. the configurable matrix register unit of many width S of support IMD according to claim 2 and many granularities SIMT, it is characterized in that: when described vector operation parts conduct interviews to matrix register, the address decoding logical block selects signal to decipher according to content, read/write address and the ranks of matrix register SR, the capable vector or the column vector of selection matrix register are read and write, or select one or more son row vectors or sub-column vector to read and write.

4. according to the configurable matrix register unit of claim 1 or 2 or 3 described many width S of support IMD and many granularities SIMT, it is characterized in that: described control register SR is an independently control register, perhaps be stored in the reservation position of other control register, and the reservation bit length of other control registers is greater than log ₂C+log ₂The integer of T.