CN102012803B

CN102012803B - Configurable matrix register unit for supporting multi-width SIMD and multi-granularity SIMT

Info

Publication number: CN102012803B
Application number: CN201010559458.2A
Authority: CN
Inventors: 陈书明; 张凯; 陈海燕; 万江华; 彭元喜; 刘仲; 阳柳; 杨惠; 刘蓬侠; 胡春媚; 唐涛
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2010-11-25
Filing date: 2010-11-25
Publication date: 2014-09-10
Anticipated expiration: 2030-11-25
Also published as: CN102012803A

Abstract

The invention relates to a configurable matrix register unit for supporting multi-width single instruction multiple data stream (SIMD) and multi-granularity single instruction multiple threads (SIMT). The configurable matrix register unit comprises a matrix register and a control register SR; the matrix register of which the size is N*N is divided into M*M blocks, wherein N is a positive integer and is the power of 2, and M is an integer which is more than or equal to 0 and is the power of 2; the block modes of the matrix register and the multi-thread numbers simultaneously processed by a vector processing unit are recorded in the control register; and the width of the control register is log2C+log2T, wherein C is the number of the number of the block modes of the matrix register, and T is the number of multi-thread modes which can be processed by a vector processor. The configurable matrix register unit has the advantages that: the principle is simple; the configurable matrix register unit is simple and convenient to operate; the block size and the thread number can be configured flexibly; the access to vector data in the mode of multi-width SIMD and multi-granularity SIMT is supported at the same time and the like.

Description

Support the configurable matrix register unit of many width S IMD and many granularities SIMT

Technical field

The present invention is mainly concerned with the design field of vector registor in vector processor, refer in particular to a kind of block size and the configurable matrix register of number of threads in vector processor, with the vector operation unit of supporting to operate by single instruction stream multiple data stream (SIMD) and single instruction stream multithreading (SIMT) mode, data are carried out to many width and the access of many granularities.

Background technology

Along with the further investigation of 4G wireless communication technology and video image processing technology, vector processor is widely used.In the wireless communication protocol of evolution and video image Processing Algorithm, need to carry out a large amount of matrix operations fast, as channel estimating, MIMO equilibrium and dct transform.The parallel granularity difference of matrix operation in algorithms of different, the handled matrix-block size of algorithm is also different, vector processor only provides the efficient support of the matrix operation to these different numbers and different masses size, can adapt to better this class data-intensive applications, meet real time data processing requirement.

The core algorithm of wireless communication protocol and video image processing is usually expressed as the parallel and Thread-Level Parallelism of data level and exists simultaneously, vector processor towards this class application adopts very long instruction word (VLIW), single instruction stream multiple data stream (SIMD) architecture conventionally, also can provide the support of single instruction stream multithreading (SIMT) technology, to obtain enough concurrent operation abilities simultaneously.Above-mentioned two class algorithms also show as following characteristics conventionally: along with the quick evolution of agreement, the handled vector length of algorithm is also constantly changing, and meanwhile, in algorithm, developable Thread-Level Parallelism is also changing.In 3G infinite communication protocol, the evolution of agreement is changing the number of antennas of base station and handheld terminal always, this has just caused vector length in channel equalization matrix also in continuous change, means that the width of the manageable vector data of vector processing unit and simultaneously treated number of threads are all changing.Can these above features provide and provide enough effectively support to propose strong requirement to many width S IMD processing and many granularities SIMT processing from architecture level vector processor.Therefore the present invention proposes the configurable matrix register of a kind of block size and number of threads, can meet the vector operation demand of different walk abreast granularities and block sizes in algorithm.

The memory cell array of matrix register is generally by N*M(M, N the integer that is greater than 1) individual storage unit forms, and the bit wide of each storage unit is generally 4,8,12,16,32, and this array logically can be regarded as by N row vector register VR ₀-VR _n-1or M column vector CVR ₀-CVR _m-1register composition, N and M are generally 2 exponential.Each row vector register comprises M element (storage unit) E _{i, 0}-E _{i, M-1}(i=0,1,2 ... N-1), each column vector register comprises N element E _{0, i}-E _{m-1, i}(i=0,1,2 ... M-1).Matrix register enables in read-write, complete reading and writing of ranks vector under the control of read/write address and row array selecting signal.

Existing research provides the access of the blocks of data to the fixing scale of above-mentioned matrix register, these technology are read and write row vector or the column vector of matrix at every turn, the length of vector is fixed, in the time that vector length is greater than or less than this regular length, common employing is combined into a long vector by multiple short vectors and carrys out parallel processing, or a long vector is split into several short vectors and carry out step-by-step processing, cannot process flexibly the matrix data of different sizes, do not support the SIMD of many width to process, do not support to access multiple matrix datas in the mode of many granularities SIMT simultaneously yet, can not obtain enough dirigibilities, can not develop enough degree of parallelisms, particularly Thread-Level Parallelism.

In sum, how high efficient and flexible processing to matrix data is provided in vector processor, for many granularities SIMT and many width S IMD processing of vector processor provide flexible and enough parallel work-flow numbers, improve the parallel processing efficiency of vector processor, array processor, to meet the application such as radio communication and image processing, the demand of extensive matrix operation is still a hot issue of this area research.

Summary of the invention

The technical problem to be solved in the present invention is just: the technical matters existing for prior art, the invention provides that a kind of principle is simple, easy and simple to handle, block size and number of threads can flexible configuration, support many width S IMD and many granularities SIMT mode to access the matrix register unit of vector data simultaneously.

For solving the problems of the technologies described above, the present invention by the following technical solutions:

A kind of configurable matrix register unit of supporting many width S IMD and many granularities SIMT, it is characterized in that: comprise matrix register and control register SR, the matrix register of described big or small N*N is divided into M*M piece, wherein N is positive integer and is 2 power, M be to be greater than 0 integer and be 2 power or in the time that M is 0 not piecemeal of representing matrix register; In described control register, recorded the simultaneously treated multithreading number of matrix register macroblock mode and vector processing unit, the width of described control register is log ₂c+log ₂t, the macroblock mode number that wherein C is matrix register, T is the treatable maximum multithread mode number of vector processor.

As a further improvement on the present invention:

In the time that M is 0, not piecemeal of representing matrix register, a row vector or the column vector of vector operation parts at every turn can access matrix register; In the time that M is not 0, vector operation parts are according to the sub-row vector of the one or more equal length in the different access matrix register of simultaneously treated number of threads or sub-column vector, and sub-row vector or the sub-column vector of these equal length come from different partitioned matrix.

When described vector operation parts conduct interviews to matrix register, address decoding logical block is carried out decoding according to the content of control register SR, read/write address and row array selecting signal, row vector or the column vector of selection matrix register are read and write, or select one or more sub-row vectors or sub-column vector to read and write.

Described control register SR is an independently control register, or is stored in the reservation position of other control register, and the reservation bit length of other control registers is for being greater than log ₂c+log ₂the integer of T.

Compared with prior art, the invention has the advantages that: the present invention supports many width S IMD and many granularities SIMT mode to access the matrix register unit of vector data, principle is simple, easy and simple to handle, block size and number of threads can flexible configuration, can be to the high efficient and flexible processing of matrix data in vector processor, for many granularities SIMT and many width S IMD processing of vector processor provide flexible and enough parallel work-flow numbers, thereby improve vector processor, the parallel processing efficiency of array processor, meet radio communication and image processing etc. and applied the demand to extensive matrix operation.

Brief description of the drawings

Fig. 1 is the general structure schematic diagram of matrix register of the present invention;

Fig. 2 is the architecture frame schematic diagram of vector processor;

Fig. 3 is the structural representation of SR register in the present invention;

Fig. 4 is the memory cell array structure schematic diagram of matrix register of the present invention;

Fig. 5 is not row address space schematic diagram when piecemeal of matrix register of the present invention;

Fig. 6 is not column address space schematic diagram when piecemeal of matrix register of the present invention;

The descending address space schematic diagram of single thread mode when Fig. 7 is matrix register piecemeal of the present invention;

The following address space schematic diagram of single thread mode when Fig. 8 is matrix register piecemeal of the present invention;

The descending address space schematic diagram of M thread mode when Fig. 9 is matrix register piecemeal of the present invention;

The following address space schematic diagram of M thread mode when Figure 10 is matrix register piecemeal of the present invention;

Figure 11 is decoding path schematic diagram in the present invention.

Embodiment

Below with reference to Figure of description and specific embodiment, the present invention is described in further details.

As shown in Figure 1, be the general structure schematic diagram of matrix register of the present invention.The configurable matrix register unit of many width S of support IMD of the present invention and many granularities SIMT, comprises matrix register and control register SR.

In the time that read-write enable signal is effective, address decoding logical block is according to the content of read/write address, be expert under the control of array selecting signal and control register SR and carry out decoding, row vector or the column vector of selection matrix register are read and write, or select one or more sub-row vectors or sub-column vector to read and write.Matrix register is made up of the memory cell array of N*N, and the bit wide of each unit is W, and memory capacity size is (N*N*W) position.By the part field of configuration control register SR, can carry out piecemeal to matrix register.In the time that the partitioned mode of this matrix register of content representation in control register SR is M*M, representing matrix register is divided into M sub-block by row by row simultaneously, be divided into into M*M sub-block, the size of each sub-block is (N/M) * (N/M), M is generally 0, 2, 4, 8 M be to be greater than 0 integer and be 2 power or in the time that M is 0 not piecemeal of representing matrix register, and M is no more than N/2, in the time that M is 0, it is not piecemeal (also can claim that macroblock mode is 0*0) of representing matrix register, in the time that M is not 0, each row vector register or column vector register are logically divided into M sub-row vector register or sub-column vector register, every sub-row vector register or sub-column vector register comprise N/M storage unit, at this moment, vector operation unit is the length of sub-row vector register or sub-column vector register to the read-write unit length of matrix register, can select one or more sub-row vectors or sub-column vector to read and write at every turn.In the present invention, the functional part of vector operation unit can be by the same mode access matrix register of access general register, and matrix register also provides the function of row access, and can meet the vector read-write of different vector lengths, also support the access mode of SIMT simultaneously.The maximum bandwidth of each read-write of this matrix register is N*W position.

Fig. 3 is the structural representation of control register SR in the present invention.The macroblock mode of current matrix register and arithmetic element in control register SR, are recorded just in simultaneously treated number of threads, in order to reduce cost, improve the reusability of design, the control register of the present invention using SR as processor, programmer can visit SR by the instruction of existing access control register, does not need to increase extra instruction again.In SR, the bit wide of valid data is for being greater than (log ₂c+log ₂smallest positive integral T) (the macroblock mode number that C is matrix register, T is the treatable maximum multithread mode number of vector processor, C, T are the integer that is greater than 1).SR can be used as a proprietary control register and independently exists, or control register in the processor position of generally all withing a hook at the end, and is more than or equal to log if retain the number of position in the original control register of processor ₂c+log ₂t, just can not need to increase extra SR register, otherwise, just also need to increase a control register SR.No matter belong to which kind of situation, programmer can visit SR by the instruction of existing access control register, does not therefore need to increase extra instruction again and just can realize the dynamic-configuration of SR.

Vector processing unit all needs to provide three kinds of signals to each access of matrix register: read-write enables, read/write address, row array selecting signal.Address decoding logical block is carried out decoding according to the content of control register SR, read/write address and row array selecting signal, row vector or the column vector of selection matrix register are read and write, and also can select one or more sub-row vectors or sub-column vector to read and write.

The present invention has designed a kind of complete address mapping scheme, and under different macroblock modes and thread mode, matrix register presents different address views, and these address views provide complete access mode flexibly for programmer.The rule of address mapping scheme is as follows: interblock is according to first up and then down first left and then right order again, in piece, row address is according to first up and then down order, column address addresses according to first left and then right order successively linear increasing, number of threads is L(L≤max (M)) time, L continuous piece of line direction shared a slice column vector address space, L continuous piece of column direction shared a slice row vector address space, in the time that number of threads is L, vector processing unit is L sub-row vector or L the sub-column vector of access matrix register simultaneously, each sub-row vector or sub-column vector are processed by V vector processing unit, V is the length of sub-row vector or sub-column vector.

Fig. 2 is the architecture frame of vector processor in the present invention.Vector processor is generally made up of N parallel processing element (PE), and each processing unit has i functional part, and these functional parts can be MAC, ALU, division, shifting part etc., and each functional part can be read and write matrix register as required.N PE formed the processing mode of SIMD, and each PE generally takes inside the structure of VLIW, multiple functional part concurrent operations.The cell array size of matrix register is N*N, has logically formed N row vector register VR ₀-VR _n-1with N column vector register CVR ₀-CVR _n-1.Each row vector register VR _icomprise N storage unit E _{i, 0}-E _{i, N-1}(i=0,1,2 ... N-1), each column vector register CVR _icomprise N storage unit E _{0, i}-E _{n-1, i}(i=0,1,2 ... N-1).Matrix register is designed to multiport read-write mode, can, for multiple functional parts of N PE provide source operand, also can support the data of multiple functional parts of N PE to write simultaneously.

Fig. 4 is the memory cell array structure schematic diagram of matrix register in the present invention.The memory cell array of matrix register is generally made up of N*N storage unit, and N is generally 2 exponential.The bit wide of each storage unit is W, and W is generally 4,8,12,16,32.This array logically can be regarded N row vector register VR as ₀-VR _n-1or N column vector CVR ₀-CVR _n-1register composition, each row vector register comprises N element (storage unit) E _{i, 0}-E _{i, N-1}(i=0,1,2 ... N-1).With VR ₀for example, this row vector register comprises storage unit E _0,0-E _{0, N-1}.This memory cell array divided by column is the column of memory cells of N N*W position, and every row are made up of the element of N same column.This N column of memory cells and N column vector register CVR ₀-CVR _n-1corresponding one by one, for realizing the access facility of respective column vector registor.With CVR _n-1for example, this column vector register comprises all row vector register VR ₀-VR _n-1last element E _{i, N-1}(i=0,1,2 ... N-1).

Fig. 5 is not row address space schematic diagram when piecemeal of matrix register of the present invention.When the macroblock mode field oriental matrix register in SR is not when piecemeal, matrix register is only supported single-threaded computing.Matrix register is made up of N row vector register, and each row vector register comprises N storage unit E _{i, 0}-E _{i, N-1}(i=0,1,2 ... N-1).The functional part of vector operation unit can be to a row vector register read-write, and the data bandwidth of each read-write is N*W position.The address of row vector register in accordance with the order from top to bottom linear increasing addresses.Vector operation unit can be according to the different different row vectors of row address access matrix register.

Fig. 6 is not column address space schematic diagram when piecemeal of matrix register of the present invention.When the macroblock mode field oriental matrix register in SR is not when piecemeal, matrix register is only supported single-threaded computing.Matrix register is made up of N column vector register, and each column vector register comprises N storage unit E _{0, i}-E _{n-1, i}(i=0,1,2 ... N-1) functional part of vector operation unit can be to a column vector register read-write, and the data bandwidth of each read-write is N*W position.The address of column vector register increases addressing according to order linear from left to right.Vector operation unit can be according to the different different column vectors of column address access matrix register.

The descending address space schematic diagram of single thread mode when Fig. 7 is matrix register piecemeal of the present invention.When the macroblock mode field oriental matrix register macroblock mode in SR is that M*M and thread mode field are while indicating current computing to be single-threaded computing, matrix register is made up of N*M sub-row vector register, and every sub-row vector register comprises N/M storage unit E _{i, 0}-E _{i, N/M-1}(i=0,1,2 ... (N-1) * M).The functional part of vector operation unit can be to a sub-row vector register read-write, and the data bandwidth of each read-write is (N/M) * W position.The addressing of sub-row vector register is according to following rule: interblock is according to first up and then down first left and then right order again, linearly in accordance with the order from top to bottom in piece increases addressing.Vector operation unit can be different according to different row address access matrix registers sub-row vector.When the value of M is different, just realize the support of the SIMD computing to different length, supported many width S IMD access.

The following address space schematic diagram of single thread mode when Fig. 8 is matrix register piecemeal of the present invention.When the macroblock mode field oriental matrix register macroblock mode in SR is that M*M and thread mode field are while indicating current computing to be single-threaded computing, matrix register is made up of N*M sub-column vector register, and every sub-column vector register comprises N/M storage unit E _{0, i}-E _{n/M-1, i}(i=0,1,2 ... (N-1) * M).The functional part of vector operation unit can be to a sub-column vector register read-write, and the data bandwidth of each read-write is (N/M) * W position.The addressing of sub-column vector register is according to following rule: interblock, according to first up and then down first left and then right order again, increases addressing according to order linear from left to right in piece.Vector operation unit can be different according to different column address access matrix registers sub-column vector.When the value of M is different, just realize the support of the SIMD computing to different length, supported many width S IMD access.

The descending address space schematic diagram of M thread mode when Fig. 9 is matrix register piecemeal of the present invention.When the macroblock mode field oriental matrix register macroblock mode in SR is that M*M and thread mode field are while indicating current computing to be multithreading (number of threads is L) computing, matrix register is made up of N*M/L sub-row vector register, and every sub-row vector register comprises N/M storage unit E _{i, 0}-E _{i, N/M-1}(i=0,1,2 ... (N-1) * M/L).The functional part of vector operation unit can be to L sub-row vector register read-write, and the data bandwidth of each read-write is (L*N/M) * W position.The addressing of sub-row vector register is according to following rule: interblock is according to first up and then down first left and then right order again, in piece, linear increasing addresses in accordance with the order from top to bottom, L continuous piece of column direction shared a slice row address space, vector operation unit can be according to different row address access matrix register L different sub-row vector, this L different sub-row vector shared same address, but derive from different matrix data pieces, i.e. L the data access that different sub-row vector essence is a multithreading.Figure 9 shows that sub-row vector addressing mode when L equals M.When the value of M is different, just realize the support of the SIMD computing to different length, supported many width S IMD access.In the time that the value of L is different, just realize varigrained Thread-Level Parallelism, support the access mode of many granularities SIMT.

The following address space schematic diagram of M thread mode when Figure 10 is matrix register piecemeal of the present invention.When the macroblock mode field oriental matrix register macroblock mode in SR is that M*M and thread mode field are while indicating current computing to be multithreading (number of threads is L) computing, matrix register is made up of N*M/L sub-column vector register, and every sub-column vector register comprises N/M storage unit E _{0, i}-E _{n/M-1, i}(i=0,1,2 ... (N-1) * M/L).The functional part of vector operation unit can be to L sub-column vector register read-write, and the data bandwidth of each read-write is (L*N/M) * W position.The addressing of sub-column vector register is according to following rule: interblock is according to first up and then down first left and then right order again, in piece, increase addressing according to order linear from left to right, L continuous piece of line direction shared a slice column address space, vector operation unit can be according to different row address access matrix register L different sub-column vector, this L different sub-column vector shared same address, but derive from different matrix data pieces, i.e. L the data access that different sub-column vector essence is a multithreading.Figure 10 shows that sub-column vector addressing mode when L equals M.When the value of M is different, just realize the support of the SIMD computing to different length, supported many width S IMD access.In the time that the value of L is different, just realize varigrained Thread-Level Parallelism, support the access mode of many granularities SIMT.

Figure 11 is the decoding path schematic diagram of matrix register of the present invention.This decoding path is made up of address decoding logic, read and write data buffer cell and the bus that reads and writes data.In the time that vector processing unit is read and write matrix register, address decoding logic is carried out address decoding according to the content of read/write address, row array selecting signal, SR, select one or more row/column vectors, or select one or more son row/sub-column vectors to read and write.When matrix register is carried out to read operation, after the decoded logic of some storage unit is chosen, the content of this storage unit is read out and is put on the read data bus that this storage unit is expert at, and then delivers to read data buffering.Read data buffer cell becomes vector form to return to vector operation unit the Organization of Data of different storage unit.When matrix register is carried out to write operation, write data buffer unit and will split into the multiple data that will write different storage unit from the vector data of vector operation unit.After the decoded logic of some storage unit is chosen, the content that write this storage unit is placed on the write data bus that this storage unit is expert at, and in the time that clock is effective, then writes this storage unit.

Below be only the preferred embodiment of the present invention, protection scope of the present invention is also not only confined to above-described embodiment, and all technical schemes belonging under thinking of the present invention all belong to protection scope of the present invention.It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention, should be considered as protection scope of the present invention.

Claims

1. support the configurable matrix register unit of many width S IMD and many granularities SIMT for one kind, it is characterized in that: comprise matrix register and control register SR, size is that the matrix register of N*N is divided into M*M piece, wherein N is positive integer and is 2 power, and M is to be greater than 0 integer and is that 2 power or M are 0; In described control register SR, recorded the simultaneously treated multithreading number of matrix register macroblock mode and vector processing unit, the width of described control register SR is for being greater than log ₂c+log ₂the smallest positive integral of T, the macroblock mode number that wherein C is matrix register, T is the treatable maximum multithread mode number of vector processor;

In the time that M is 0, not piecemeal of representing matrix register, a row vector or the column vector of vector operation parts at every turn can access matrix register; In the time that M is not 0, vector operation parts are according to the sub-row vector of the one or more equal length in the different access matrix register of simultaneously treated number of threads or sub-column vector, and sub-row vector or the sub-column vector of these equal length come from different partitioned matrix;

When described vector operation parts conduct interviews to matrix register, address decoding logical block is carried out decoding according to the content of control register SR, read/write address and row array selecting signal, row vector or the column vector of selection matrix register are read and write, or select one or more sub-row vectors or sub-column vector to read and write;

Described SIMD is single instruction stream multiple data stream, and described SIMT is single instruction stream multithreading.

2. the configurable matrix register unit of many width S of support IMD according to claim 1 and many granularities SIMT, it is characterized in that: described control register SR is an independently control register, or be stored in the reservation position of other control register, and the reservation bit length of other control registers is for being greater than log ₂c+log ₂the integer of T.