CN102012803B - Configurable matrix register unit for supporting multi-width SIMD and multi-granularity SIMT - Google Patents

Configurable matrix register unit for supporting multi-width SIMD and multi-granularity SIMT Download PDF

Info

Publication number
CN102012803B
CN102012803B CN201010559458.2A CN201010559458A CN102012803B CN 102012803 B CN102012803 B CN 102012803B CN 201010559458 A CN201010559458 A CN 201010559458A CN 102012803 B CN102012803 B CN 102012803B
Authority
CN
China
Prior art keywords
register
vector
matrix
matrix register
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201010559458.2A
Other languages
Chinese (zh)
Other versions
CN102012803A (en
Inventor
陈书明
张凯
陈海燕
万江华
彭元喜
刘仲
阳柳
杨惠
刘蓬侠
胡春媚
唐涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201010559458.2A priority Critical patent/CN102012803B/en
Publication of CN102012803A publication Critical patent/CN102012803A/en
Application granted granted Critical
Publication of CN102012803B publication Critical patent/CN102012803B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention relates to a configurable matrix register unit for supporting multi-width single instruction multiple data stream (SIMD) and multi-granularity single instruction multiple threads (SIMT). The configurable matrix register unit comprises a matrix register and a control register SR; the matrix register of which the size is N*N is divided into M*M blocks, wherein N is a positive integer and is the power of 2, and M is an integer which is more than or equal to 0 and is the power of 2; the block modes of the matrix register and the multi-thread numbers simultaneously processed by a vector processing unit are recorded in the control register; and the width of the control register is log2C+log2T, wherein C is the number of the number of the block modes of the matrix register, and T is the number of multi-thread modes which can be processed by a vector processor. The configurable matrix register unit has the advantages that: the principle is simple; the configurable matrix register unit is simple and convenient to operate; the block size and the thread number can be configured flexibly; the access to vector data in the mode of multi-width SIMD and multi-granularity SIMT is supported at the same time and the like.

Description

Support the configurable matrix register unit of many width S IMD and many granularities SIMT
Technical field
The present invention is mainly concerned with the design field of vector registor in vector processor, refer in particular to a kind of block size and the configurable matrix register of number of threads in vector processor, with the vector operation unit of supporting to operate by single instruction stream multiple data stream (SIMD) and single instruction stream multithreading (SIMT) mode, data are carried out to many width and the access of many granularities.
Background technology
Along with the further investigation of 4G wireless communication technology and video image processing technology, vector processor is widely used.In the wireless communication protocol of evolution and video image Processing Algorithm, need to carry out a large amount of matrix operations fast, as channel estimating, MIMO equilibrium and dct transform.The parallel granularity difference of matrix operation in algorithms of different, the handled matrix-block size of algorithm is also different, vector processor only provides the efficient support of the matrix operation to these different numbers and different masses size, can adapt to better this class data-intensive applications, meet real time data processing requirement.
The core algorithm of wireless communication protocol and video image processing is usually expressed as the parallel and Thread-Level Parallelism of data level and exists simultaneously, vector processor towards this class application adopts very long instruction word (VLIW), single instruction stream multiple data stream (SIMD) architecture conventionally, also can provide the support of single instruction stream multithreading (SIMT) technology, to obtain enough concurrent operation abilities simultaneously.Above-mentioned two class algorithms also show as following characteristics conventionally: along with the quick evolution of agreement, the handled vector length of algorithm is also constantly changing, and meanwhile, in algorithm, developable Thread-Level Parallelism is also changing.In 3G infinite communication protocol, the evolution of agreement is changing the number of antennas of base station and handheld terminal always, this has just caused vector length in channel equalization matrix also in continuous change, means that the width of the manageable vector data of vector processing unit and simultaneously treated number of threads are all changing.Can these above features provide and provide enough effectively support to propose strong requirement to many width S IMD processing and many granularities SIMT processing from architecture level vector processor.Therefore the present invention proposes the configurable matrix register of a kind of block size and number of threads, can meet the vector operation demand of different walk abreast granularities and block sizes in algorithm.
The memory cell array of matrix register is generally by N*M(M, N the integer that is greater than 1) individual storage unit forms, and the bit wide of each storage unit is generally 4,8,12,16,32, and this array logically can be regarded as by N row vector register VR 0-VR n-1or M column vector CVR 0-CVR m-1register composition, N and M are generally 2 exponential.Each row vector register comprises M element (storage unit) E i, 0-E i, M-1(i=0,1,2 ... N-1), each column vector register comprises N element E 0, i-E m-1, i(i=0,1,2 ... M-1).Matrix register enables in read-write, complete reading and writing of ranks vector under the control of read/write address and row array selecting signal.
Existing research provides the access of the blocks of data to the fixing scale of above-mentioned matrix register, these technology are read and write row vector or the column vector of matrix at every turn, the length of vector is fixed, in the time that vector length is greater than or less than this regular length, common employing is combined into a long vector by multiple short vectors and carrys out parallel processing, or a long vector is split into several short vectors and carry out step-by-step processing, cannot process flexibly the matrix data of different sizes, do not support the SIMD of many width to process, do not support to access multiple matrix datas in the mode of many granularities SIMT simultaneously yet, can not obtain enough dirigibilities, can not develop enough degree of parallelisms, particularly Thread-Level Parallelism.
In sum, how high efficient and flexible processing to matrix data is provided in vector processor, for many granularities SIMT and many width S IMD processing of vector processor provide flexible and enough parallel work-flow numbers, improve the parallel processing efficiency of vector processor, array processor, to meet the application such as radio communication and image processing, the demand of extensive matrix operation is still a hot issue of this area research.
Summary of the invention
The technical problem to be solved in the present invention is just: the technical matters existing for prior art, the invention provides that a kind of principle is simple, easy and simple to handle, block size and number of threads can flexible configuration, support many width S IMD and many granularities SIMT mode to access the matrix register unit of vector data simultaneously.
For solving the problems of the technologies described above, the present invention by the following technical solutions:
A kind of configurable matrix register unit of supporting many width S IMD and many granularities SIMT, it is characterized in that: comprise matrix register and control register SR, the matrix register of described big or small N*N is divided into M*M piece, wherein N is positive integer and is 2 power, M be to be greater than 0 integer and be 2 power or in the time that M is 0 not piecemeal of representing matrix register; In described control register, recorded the simultaneously treated multithreading number of matrix register macroblock mode and vector processing unit, the width of described control register is log 2c+log 2t, the macroblock mode number that wherein C is matrix register, T is the treatable maximum multithread mode number of vector processor.
As a further improvement on the present invention:
In the time that M is 0, not piecemeal of representing matrix register, a row vector or the column vector of vector operation parts at every turn can access matrix register; In the time that M is not 0, vector operation parts are according to the sub-row vector of the one or more equal length in the different access matrix register of simultaneously treated number of threads or sub-column vector, and sub-row vector or the sub-column vector of these equal length come from different partitioned matrix.
When described vector operation parts conduct interviews to matrix register, address decoding logical block is carried out decoding according to the content of control register SR, read/write address and row array selecting signal, row vector or the column vector of selection matrix register are read and write, or select one or more sub-row vectors or sub-column vector to read and write.
Described control register SR is an independently control register, or is stored in the reservation position of other control register, and the reservation bit length of other control registers is for being greater than log 2c+log 2the integer of T.
Compared with prior art, the invention has the advantages that: the present invention supports many width S IMD and many granularities SIMT mode to access the matrix register unit of vector data, principle is simple, easy and simple to handle, block size and number of threads can flexible configuration, can be to the high efficient and flexible processing of matrix data in vector processor, for many granularities SIMT and many width S IMD processing of vector processor provide flexible and enough parallel work-flow numbers, thereby improve vector processor, the parallel processing efficiency of array processor, meet radio communication and image processing etc. and applied the demand to extensive matrix operation.
Brief description of the drawings
Fig. 1 is the general structure schematic diagram of matrix register of the present invention;
Fig. 2 is the architecture frame schematic diagram of vector processor;
Fig. 3 is the structural representation of SR register in the present invention;
Fig. 4 is the memory cell array structure schematic diagram of matrix register of the present invention;
Fig. 5 is not row address space schematic diagram when piecemeal of matrix register of the present invention;
Fig. 6 is not column address space schematic diagram when piecemeal of matrix register of the present invention;
The descending address space schematic diagram of single thread mode when Fig. 7 is matrix register piecemeal of the present invention;
The following address space schematic diagram of single thread mode when Fig. 8 is matrix register piecemeal of the present invention;
The descending address space schematic diagram of M thread mode when Fig. 9 is matrix register piecemeal of the present invention;
The following address space schematic diagram of M thread mode when Figure 10 is matrix register piecemeal of the present invention;
Figure 11 is decoding path schematic diagram in the present invention.
Embodiment
Below with reference to Figure of description and specific embodiment, the present invention is described in further details.
As shown in Figure 1, be the general structure schematic diagram of matrix register of the present invention.The configurable matrix register unit of many width S of support IMD of the present invention and many granularities SIMT, comprises matrix register and control register SR.
In the time that read-write enable signal is effective, address decoding logical block is according to the content of read/write address, be expert under the control of array selecting signal and control register SR and carry out decoding, row vector or the column vector of selection matrix register are read and write, or select one or more sub-row vectors or sub-column vector to read and write.Matrix register is made up of the memory cell array of N*N, and the bit wide of each unit is W, and memory capacity size is (N*N*W) position.By the part field of configuration control register SR, can carry out piecemeal to matrix register.In the time that the partitioned mode of this matrix register of content representation in control register SR is M*M, representing matrix register is divided into M sub-block by row by row simultaneously, be divided into into M*M sub-block, the size of each sub-block is (N/M) * (N/M), M is generally 0, 2, 4, 8 M be to be greater than 0 integer and be 2 power or in the time that M is 0 not piecemeal of representing matrix register, and M is no more than N/2, in the time that M is 0, it is not piecemeal (also can claim that macroblock mode is 0*0) of representing matrix register, in the time that M is not 0, each row vector register or column vector register are logically divided into M sub-row vector register or sub-column vector register, every sub-row vector register or sub-column vector register comprise N/M storage unit, at this moment, vector operation unit is the length of sub-row vector register or sub-column vector register to the read-write unit length of matrix register, can select one or more sub-row vectors or sub-column vector to read and write at every turn.In the present invention, the functional part of vector operation unit can be by the same mode access matrix register of access general register, and matrix register also provides the function of row access, and can meet the vector read-write of different vector lengths, also support the access mode of SIMT simultaneously.The maximum bandwidth of each read-write of this matrix register is N*W position.
Fig. 3 is the structural representation of control register SR in the present invention.The macroblock mode of current matrix register and arithmetic element in control register SR, are recorded just in simultaneously treated number of threads, in order to reduce cost, improve the reusability of design, the control register of the present invention using SR as processor, programmer can visit SR by the instruction of existing access control register, does not need to increase extra instruction again.In SR, the bit wide of valid data is for being greater than (log 2c+log 2smallest positive integral T) (the macroblock mode number that C is matrix register, T is the treatable maximum multithread mode number of vector processor, C, T are the integer that is greater than 1).SR can be used as a proprietary control register and independently exists, or control register in the processor position of generally all withing a hook at the end, and is more than or equal to log if retain the number of position in the original control register of processor 2c+log 2t, just can not need to increase extra SR register, otherwise, just also need to increase a control register SR.No matter belong to which kind of situation, programmer can visit SR by the instruction of existing access control register, does not therefore need to increase extra instruction again and just can realize the dynamic-configuration of SR.
Vector processing unit all needs to provide three kinds of signals to each access of matrix register: read-write enables, read/write address, row array selecting signal.Address decoding logical block is carried out decoding according to the content of control register SR, read/write address and row array selecting signal, row vector or the column vector of selection matrix register are read and write, and also can select one or more sub-row vectors or sub-column vector to read and write.
The present invention has designed a kind of complete address mapping scheme, and under different macroblock modes and thread mode, matrix register presents different address views, and these address views provide complete access mode flexibly for programmer.The rule of address mapping scheme is as follows: interblock is according to first up and then down first left and then right order again, in piece, row address is according to first up and then down order, column address addresses according to first left and then right order successively linear increasing, number of threads is L(L≤max (M)) time, L continuous piece of line direction shared a slice column vector address space, L continuous piece of column direction shared a slice row vector address space, in the time that number of threads is L, vector processing unit is L sub-row vector or L the sub-column vector of access matrix register simultaneously, each sub-row vector or sub-column vector are processed by V vector processing unit, V is the length of sub-row vector or sub-column vector.
Fig. 2 is the architecture frame of vector processor in the present invention.Vector processor is generally made up of N parallel processing element (PE), and each processing unit has i functional part, and these functional parts can be MAC, ALU, division, shifting part etc., and each functional part can be read and write matrix register as required.N PE formed the processing mode of SIMD, and each PE generally takes inside the structure of VLIW, multiple functional part concurrent operations.The cell array size of matrix register is N*N, has logically formed N row vector register VR 0-VR n-1with N column vector register CVR 0-CVR n-1.Each row vector register VR icomprise N storage unit E i, 0-E i, N-1(i=0,1,2 ... N-1), each column vector register CVR icomprise N storage unit E 0, i-E n-1, i(i=0,1,2 ... N-1).Matrix register is designed to multiport read-write mode, can, for multiple functional parts of N PE provide source operand, also can support the data of multiple functional parts of N PE to write simultaneously.
Fig. 4 is the memory cell array structure schematic diagram of matrix register in the present invention.The memory cell array of matrix register is generally made up of N*N storage unit, and N is generally 2 exponential.The bit wide of each storage unit is W, and W is generally 4,8,12,16,32.This array logically can be regarded N row vector register VR as 0-VR n-1or N column vector CVR 0-CVR n-1register composition, each row vector register comprises N element (storage unit) E i, 0-E i, N-1(i=0,1,2 ... N-1).With VR 0for example, this row vector register comprises storage unit E 0,0-E 0, N-1.This memory cell array divided by column is the column of memory cells of N N*W position, and every row are made up of the element of N same column.This N column of memory cells and N column vector register CVR 0-CVR n-1corresponding one by one, for realizing the access facility of respective column vector registor.With CVR n-1for example, this column vector register comprises all row vector register VR 0-VR n-1last element E i, N-1(i=0,1,2 ... N-1).
Fig. 5 is not row address space schematic diagram when piecemeal of matrix register of the present invention.When the macroblock mode field oriental matrix register in SR is not when piecemeal, matrix register is only supported single-threaded computing.Matrix register is made up of N row vector register, and each row vector register comprises N storage unit E i, 0-E i, N-1(i=0,1,2 ... N-1).The functional part of vector operation unit can be to a row vector register read-write, and the data bandwidth of each read-write is N*W position.The address of row vector register in accordance with the order from top to bottom linear increasing addresses.Vector operation unit can be according to the different different row vectors of row address access matrix register.
Fig. 6 is not column address space schematic diagram when piecemeal of matrix register of the present invention.When the macroblock mode field oriental matrix register in SR is not when piecemeal, matrix register is only supported single-threaded computing.Matrix register is made up of N column vector register, and each column vector register comprises N storage unit E 0, i-E n-1, i(i=0,1,2 ... N-1) functional part of vector operation unit can be to a column vector register read-write, and the data bandwidth of each read-write is N*W position.The address of column vector register increases addressing according to order linear from left to right.Vector operation unit can be according to the different different column vectors of column address access matrix register.
The descending address space schematic diagram of single thread mode when Fig. 7 is matrix register piecemeal of the present invention.When the macroblock mode field oriental matrix register macroblock mode in SR is that M*M and thread mode field are while indicating current computing to be single-threaded computing, matrix register is made up of N*M sub-row vector register, and every sub-row vector register comprises N/M storage unit E i, 0-E i, N/M-1(i=0,1,2 ... (N-1) * M).The functional part of vector operation unit can be to a sub-row vector register read-write, and the data bandwidth of each read-write is (N/M) * W position.The addressing of sub-row vector register is according to following rule: interblock is according to first up and then down first left and then right order again, linearly in accordance with the order from top to bottom in piece increases addressing.Vector operation unit can be different according to different row address access matrix registers sub-row vector.When the value of M is different, just realize the support of the SIMD computing to different length, supported many width S IMD access.
The following address space schematic diagram of single thread mode when Fig. 8 is matrix register piecemeal of the present invention.When the macroblock mode field oriental matrix register macroblock mode in SR is that M*M and thread mode field are while indicating current computing to be single-threaded computing, matrix register is made up of N*M sub-column vector register, and every sub-column vector register comprises N/M storage unit E 0, i-E n/M-1, i(i=0,1,2 ... (N-1) * M).The functional part of vector operation unit can be to a sub-column vector register read-write, and the data bandwidth of each read-write is (N/M) * W position.The addressing of sub-column vector register is according to following rule: interblock, according to first up and then down first left and then right order again, increases addressing according to order linear from left to right in piece.Vector operation unit can be different according to different column address access matrix registers sub-column vector.When the value of M is different, just realize the support of the SIMD computing to different length, supported many width S IMD access.
The descending address space schematic diagram of M thread mode when Fig. 9 is matrix register piecemeal of the present invention.When the macroblock mode field oriental matrix register macroblock mode in SR is that M*M and thread mode field are while indicating current computing to be multithreading (number of threads is L) computing, matrix register is made up of N*M/L sub-row vector register, and every sub-row vector register comprises N/M storage unit E i, 0-E i, N/M-1(i=0,1,2 ... (N-1) * M/L).The functional part of vector operation unit can be to L sub-row vector register read-write, and the data bandwidth of each read-write is (L*N/M) * W position.The addressing of sub-row vector register is according to following rule: interblock is according to first up and then down first left and then right order again, in piece, linear increasing addresses in accordance with the order from top to bottom, L continuous piece of column direction shared a slice row address space, vector operation unit can be according to different row address access matrix register L different sub-row vector, this L different sub-row vector shared same address, but derive from different matrix data pieces, i.e. L the data access that different sub-row vector essence is a multithreading.Figure 9 shows that sub-row vector addressing mode when L equals M.When the value of M is different, just realize the support of the SIMD computing to different length, supported many width S IMD access.In the time that the value of L is different, just realize varigrained Thread-Level Parallelism, support the access mode of many granularities SIMT.
The following address space schematic diagram of M thread mode when Figure 10 is matrix register piecemeal of the present invention.When the macroblock mode field oriental matrix register macroblock mode in SR is that M*M and thread mode field are while indicating current computing to be multithreading (number of threads is L) computing, matrix register is made up of N*M/L sub-column vector register, and every sub-column vector register comprises N/M storage unit E 0, i-E n/M-1, i(i=0,1,2 ... (N-1) * M/L).The functional part of vector operation unit can be to L sub-column vector register read-write, and the data bandwidth of each read-write is (L*N/M) * W position.The addressing of sub-column vector register is according to following rule: interblock is according to first up and then down first left and then right order again, in piece, increase addressing according to order linear from left to right, L continuous piece of line direction shared a slice column address space, vector operation unit can be according to different row address access matrix register L different sub-column vector, this L different sub-column vector shared same address, but derive from different matrix data pieces, i.e. L the data access that different sub-column vector essence is a multithreading.Figure 10 shows that sub-column vector addressing mode when L equals M.When the value of M is different, just realize the support of the SIMD computing to different length, supported many width S IMD access.In the time that the value of L is different, just realize varigrained Thread-Level Parallelism, support the access mode of many granularities SIMT.
Figure 11 is the decoding path schematic diagram of matrix register of the present invention.This decoding path is made up of address decoding logic, read and write data buffer cell and the bus that reads and writes data.In the time that vector processing unit is read and write matrix register, address decoding logic is carried out address decoding according to the content of read/write address, row array selecting signal, SR, select one or more row/column vectors, or select one or more son row/sub-column vectors to read and write.When matrix register is carried out to read operation, after the decoded logic of some storage unit is chosen, the content of this storage unit is read out and is put on the read data bus that this storage unit is expert at, and then delivers to read data buffering.Read data buffer cell becomes vector form to return to vector operation unit the Organization of Data of different storage unit.When matrix register is carried out to write operation, write data buffer unit and will split into the multiple data that will write different storage unit from the vector data of vector operation unit.After the decoded logic of some storage unit is chosen, the content that write this storage unit is placed on the write data bus that this storage unit is expert at, and in the time that clock is effective, then writes this storage unit.
Below be only the preferred embodiment of the present invention, protection scope of the present invention is also not only confined to above-described embodiment, and all technical schemes belonging under thinking of the present invention all belong to protection scope of the present invention.It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention, should be considered as protection scope of the present invention.

Claims (2)

1. support the configurable matrix register unit of many width S IMD and many granularities SIMT for one kind, it is characterized in that: comprise matrix register and control register SR, size is that the matrix register of N*N is divided into M*M piece, wherein N is positive integer and is 2 power, and M is to be greater than 0 integer and is that 2 power or M are 0; In described control register SR, recorded the simultaneously treated multithreading number of matrix register macroblock mode and vector processing unit, the width of described control register SR is for being greater than log 2c+log 2the smallest positive integral of T, the macroblock mode number that wherein C is matrix register, T is the treatable maximum multithread mode number of vector processor;
In the time that M is 0, not piecemeal of representing matrix register, a row vector or the column vector of vector operation parts at every turn can access matrix register; In the time that M is not 0, vector operation parts are according to the sub-row vector of the one or more equal length in the different access matrix register of simultaneously treated number of threads or sub-column vector, and sub-row vector or the sub-column vector of these equal length come from different partitioned matrix;
When described vector operation parts conduct interviews to matrix register, address decoding logical block is carried out decoding according to the content of control register SR, read/write address and row array selecting signal, row vector or the column vector of selection matrix register are read and write, or select one or more sub-row vectors or sub-column vector to read and write;
Described SIMD is single instruction stream multiple data stream, and described SIMT is single instruction stream multithreading.
2. the configurable matrix register unit of many width S of support IMD according to claim 1 and many granularities SIMT, it is characterized in that: described control register SR is an independently control register, or be stored in the reservation position of other control register, and the reservation bit length of other control registers is for being greater than log 2c+log 2the integer of T.
CN201010559458.2A 2010-11-25 2010-11-25 Configurable matrix register unit for supporting multi-width SIMD and multi-granularity SIMT Active CN102012803B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010559458.2A CN102012803B (en) 2010-11-25 2010-11-25 Configurable matrix register unit for supporting multi-width SIMD and multi-granularity SIMT

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010559458.2A CN102012803B (en) 2010-11-25 2010-11-25 Configurable matrix register unit for supporting multi-width SIMD and multi-granularity SIMT

Publications (2)

Publication Number Publication Date
CN102012803A CN102012803A (en) 2011-04-13
CN102012803B true CN102012803B (en) 2014-09-10

Family

ID=43842979

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010559458.2A Active CN102012803B (en) 2010-11-25 2010-11-25 Configurable matrix register unit for supporting multi-width SIMD and multi-granularity SIMT

Country Status (1)

Country Link
CN (1) CN102012803B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102447462B (en) * 2011-12-13 2013-08-28 北京控制工程研究所 Over current (OC) instruction matrix circuit with impact resistance
US9021237B2 (en) * 2011-12-20 2015-04-28 International Business Machines Corporation Low latency variable transfer network communicating variable written to source processing core variable register allocated to destination thread to destination processing core variable register allocated to source thread
CN103294623B (en) * 2013-03-11 2016-04-27 浙江大学 A kind of multi-thread dispatch circuit of configurable SIMD system
US11544214B2 (en) * 2015-02-02 2023-01-03 Optimum Semiconductor Technologies, Inc. Monolithic vector processor configured to operate on variable length vectors using a vector length register
GB2552154B (en) * 2016-07-08 2019-03-06 Advanced Risc Mach Ltd Vector register access
CN106484519B (en) * 2016-10-11 2019-11-08 东南大学苏州研究院 Asynchronous thread recombination method and SIMT processor based on this method
WO2018169911A1 (en) * 2017-03-14 2018-09-20 Yuan Li Reconfigurable parallel processing
CN109684602B (en) * 2018-12-29 2023-06-06 上海商汤智能科技有限公司 Batch processing method and device and computer readable storage medium
CN111158874A (en) * 2019-12-20 2020-05-15 深圳市商汤科技有限公司 Data processing method and device, electronic equipment and storage medium
CN112346783B (en) * 2020-11-05 2022-11-22 海光信息技术股份有限公司 Processor and operation method, device, equipment and medium thereof
CN114565075A (en) * 2020-11-27 2022-05-31 安徽寒武纪信息科技有限公司 Apparatus, method and readable storage medium for supporting multiple access modes
WO2023236013A1 (en) * 2022-06-06 2023-12-14 Intel Corporation Data re-arrangement by mixed simt and simd execution mode

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5513366A (en) * 1994-09-28 1996-04-30 International Business Machines Corporation Method and system for dynamically reconfiguring a register file in a vector processor
CN1180864A (en) * 1996-08-19 1998-05-06 三星电子株式会社 Single-instruction-multiple-data processing in multimedia signal processor and device thereof
CN1684058A (en) * 2004-04-16 2005-10-19 索尼株式会社 Processor
CN101776988A (en) * 2010-02-01 2010-07-14 中国人民解放军国防科学技术大学 Restructurable matrix register file with changeable block size

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5513366A (en) * 1994-09-28 1996-04-30 International Business Machines Corporation Method and system for dynamically reconfiguring a register file in a vector processor
CN1180864A (en) * 1996-08-19 1998-05-06 三星电子株式会社 Single-instruction-multiple-data processing in multimedia signal processor and device thereof
CN1684058A (en) * 2004-04-16 2005-10-19 索尼株式会社 Processor
CN101776988A (en) * 2010-02-01 2010-07-14 中国人民解放军国防科学技术大学 Restructurable matrix register file with changeable block size

Also Published As

Publication number Publication date
CN102012803A (en) 2011-04-13

Similar Documents

Publication Publication Date Title
CN102012803B (en) Configurable matrix register unit for supporting multi-width SIMD and multi-granularity SIMT
US8422330B2 (en) Memory controller and memory controlling method
EP3035204B1 (en) Storage device and method for performing convolution operations
EP3035249B1 (en) Method and apparatus for distributed and cooperative computation in artificial neural networks
US9047193B2 (en) Processor-cache system and method
KR101710116B1 (en) Processor, Apparatus and Method for memory management
US20070239970A1 (en) Apparatus For Cooperative Sharing Of Operand Access Port Of A Banked Register File
US11342944B2 (en) Computational memory with zero disable and error detection
US6804771B1 (en) Processor with register file accessible by row column to achieve data array transposition
US11468002B2 (en) Computational memory with cooperation among rows of processing elements and memory thereof
EP3035203A1 (en) Fine-grain storage interface and method for low power accelerators
US9442893B2 (en) Product-sum operation circuit and product-sum operation system
US10599586B2 (en) Information processing apparatus, memory control circuitry, and control method of information processing apparatus
US20110087859A1 (en) System cycle loading and storing of misaligned vector elements in a simd processor
WO2013186155A1 (en) An element selection unit and a method therein
US20150178217A1 (en) 2-D Gather Instruction and a 2-D Cache
Fan et al. A parallel-access mapping method for the data exchange buffers around DCT/IDCT in HEVC encoders based on single-port SRAMs
US20180232207A1 (en) Arithmetic processing apparatus and control method for arithmetic processing apparatus
US20140164706A1 (en) Multi-core processor having hierarchical cahce architecture
CN111709872B (en) Spin memory computing architecture of graph triangle counting algorithm
US20210042111A1 (en) Efficient encoding of high fanout communications
US20090077325A1 (en) Method and arrangements for memory access
Kumaki et al. CAM enhanced super parallel SIMD processor with high-speed pattern matching capability
JP2023509813A (en) SIMT command processing method and device
CN115883006A (en) Apparatus and method for MIMO decoding

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CI01 Correction of invention patent gazette

Correction item: Abstract|Description

Correct: The instructions are correct

False: Abstract manual error

Number: 15

Volume: 27

CI02 Correction of invention patent application

Correction item: Abstract|Description

Correct: The instructions are correct

False: Abstract manual error

Number: 15

Page: The title page

Volume: 27

ERR Gazette correction

Free format text: CORRECT: ABSTRACT; DESCRIPTION; FROM: ABSTRACT, DESCRIPTION IS WRONG TO: ABSTRACT, DESCRIPTION IS CORRECT

C14 Grant of patent or utility model
GR01 Patent grant