CN102012803A - Configurable matrix register unit for supporting multi-width SIMD and multi-granularity SIMT - Google Patents

Configurable matrix register unit for supporting multi-width SIMD and multi-granularity SIMT Download PDF

Info

Publication number
CN102012803A
CN102012803A CN2010105594582A CN201010559458A CN102012803A CN 102012803 A CN102012803 A CN 102012803A CN 2010105594582 A CN2010105594582 A CN 2010105594582A CN 201010559458 A CN201010559458 A CN 201010559458A CN 102012803 A CN102012803 A CN 102012803A
Authority
CN
China
Prior art keywords
vector
register
matrix
matrix register
simt
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2010105594582A
Other languages
Chinese (zh)
Other versions
CN102012803B (en
Inventor
陈书明
张凯
陈海燕
万江华
彭元喜
刘仲
阳柳
杨惠
刘蓬侠
胡春媚
唐涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201010559458.2A priority Critical patent/CN102012803B/en
Publication of CN102012803A publication Critical patent/CN102012803A/en
Application granted granted Critical
Publication of CN102012803B publication Critical patent/CN102012803B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a configurable matrix register unit for supporting multi-width single instruction multiple data stream (SIMD) and multi-granularity single instruction multiple threads (SIMT). The configurable matrix register unit comprises a matrix register and a control register SR; the matrix register of which the size is N*N is divided into M*M blocks, wherein N is a positive integer and is the power of 2, and M is an integer which is more than or equal to 0 and is the power of 2; the block modes of the matrix register and the multi-thread numbers simultaneously processed by a vector processing unit are recorded in the control register; and the width of the control register is log2C+log2T, wherein C is the number of the number of the block modes of the matrix register, and T is the number of multi-thread modes which can be processed by a vector processor. The configurable matrix register unit has the advantages that: the principle is simple; the configurable matrix register unit is simple and convenient to operate; the block size and the thread number can be configured flexibly; the access to vector data in the mode of multi-width SIMD and multi-granularity SIMT is supported at the same time and the like.

Description

Support the configurable matrix register unit of many width S IMD and many granularities SIMT
Technical field
The present invention is mainly concerned with the design field of vector registor in the vector processor, refer in particular to a kind of block size and the configurable matrix register of number of threads in vector processor, data are carried out many width and the visit of many granularities with the vector operation unit of supporting to operate by single instruction stream multiple data stream (SIMD) and single instruction stream multithreading (SIMT) mode.
Background technology
Along with the further investigation of 4G wireless communication technology and video image processing technology, vector processor has obtained using widely.Need to carry out a large amount of matrix operations in the wireless communication protocol of evolution and the video image Processing Algorithm fast, as channel estimating, MIMO equilibrium and dct transform.Matrix operation in the algorithms of different granularity difference that walks abreast, the handled matrix-block size of algorithm is also different, vector processor only provides the efficient support to the matrix operation of these different numbers and different masses size, can adapt to the data-intensive application of this class better, satisfy the real time data processing requirement.
The core algorithm that wireless communication protocol and video image are handled is usually expressed as the parallel and Thread-Level Parallelism of data level and exists simultaneously, the vector processor of using towards this class adopts very long instruction word (VLIW), single instruction stream multiple data stream (SIMD) architecture usually, also can provide the support of single instruction stream multithreading (SIMT) technology simultaneously, to obtain enough concurrent operation abilities.Above-mentioned two class algorithms also show as following characteristics usually: along with the quick evolution of agreement, the handled vector length of algorithm is also constantly changing, and simultaneously, developable Thread-Level Parallelism is also changing in the algorithm.In the 3G infinite communication protocol, the evolution of agreement makes the number of antennas of base station and handheld terminal change always, this has just caused vector length in the channel equalization matrix also in continuous change, means that the width of the manageable vector data of vector processing unit and simultaneously treated number of threads are all changing.Can these above characteristics provide from the architecture level vector processor provides enough effectively support to propose strong requirement to many width S IMD processing and many granularities SIMT processing.Therefore the present invention proposes a kind of block size and the configurable matrix register of number of threads, can satisfy the vector operation demand of different walk abreast granularities and block sizes in the algorithm.
The memory cell array of matrix register generally is made up of the individual storage unit of N*M (M, N are the integer greater than 1), and the bit wide of each storage unit is generally 4,8,12,16,32, and this array logically can be regarded as by N capable vector registor VR 0-VR N-1Or M column vector CVR 0-CVR M-1Register is formed, and N and M are generally 2 exponential.Each row vector registor comprises M element (storage unit) E I, 0-E I, M-1(i=0,1,2 ... N-1), each column vector register comprises N element E 0, i-E M-1, i(i=0,1,2 ... M-1).Finish reading and writing of ranks vector under the control that matrix register enables in read-write, read/write address and ranks are selected signal.
Existing research provides the fixedly visit of the blocks of data of scale of above-mentioned matrix register, these technology are read and write the capable vector or the column vector of matrix at every turn, the length of vector is fixed, when vector length is greater than or less than this regular length, common employing is combined into a long vector with a plurality of short vectors and comes parallel processing, perhaps a long vector is split into several short vectors and come step-by-step processing, can't handle the matrix data of different sizes flexibly, do not support the SIMD of many width to handle, do not support to visit a plurality of matrix datas simultaneously in the mode of many granularities SIMT yet, can not obtain enough dirigibilities, can not develop enough degree of parallelisms, particularly Thread-Level Parallelism.
In sum, how in vector processor, to provide the high efficient and flexible of matrix data is handled, for handling, many granularities SIMT of vector processor and many width S IMD provide flexible and enough parallel work-flow numbers, improve the parallel processing efficient of vector processor, array processor, to satisfy application such as radio communication and Flame Image Process are still this area research to the demand of extensive matrix operation a hot issue.
Summary of the invention
The technical problem to be solved in the present invention just is: the technical matters that exists at prior art, but the invention provides that a kind of principle is simple, easy and simple to handle, block size and number of threads flexible configuration, support many width S IMD and many granularities SIMT mode to visit the matrix register unit of vector data simultaneously.
For solving the problems of the technologies described above, the present invention by the following technical solutions:
A kind of configurable matrix register unit of supporting many width S IMD and many granularities SIMT, it is characterized in that: comprise matrix register and control register SR, the matrix register of described big or small N*N is divided into the M*M piece, and wherein N is positive integer and is 2 power, and M is for more than or equal to 0 integer and be 2 power; Write down matrix register in the described control register and divided block mode and the simultaneously treated multithreading number of vector processing unit, the width of described control register is log 2C+log 2T, wherein C is the piecemeal pattern count of matrix register, T is the treatable multithread mode number of vector processor.
As a further improvement on the present invention:
When M was 0, the representing matrix register is piecemeal not, a capable vector or the column vector of vector operation parts at every turn can the access matrix register; When M is not 0, the vector operation parts are according to the capable vector of child or the sub-column vector of the one or more equal length in the different access matrix register of simultaneously treated number of threads, and capable vector of the child of these equal length or sub-column vector come from different partitioned matrix.
When described vector operation parts conduct interviews to matrix register, the address decoding logical block selects signal to decipher according to content, read/write address and the ranks of control register SR, the capable vector or the column vector of selection matrix register are read and write, or select one or more son row vectors or sub-column vector to read and write.
Described control register SR is an independently control register, perhaps be stored in the reservation position of other control register, and the reservation bit length of other control registers is greater than log 2C+log 2The integer of T.
Compared with prior art, the invention has the advantages that: the present invention supports many width S IMD and many granularities SIMT mode to visit the matrix register unit of vector data, principle is simple, easy and simple to handle, but block size and number of threads flexible configuration, in vector processor, can handle the high efficient and flexible of matrix data, for handling, many granularities SIMT of vector processor and many width S IMD provide flexible and enough parallel work-flow numbers, thereby improved vector processor, the parallel processing efficient of array processor has satisfied the demand of application such as radio communication and Flame Image Process to extensive matrix operation.
Description of drawings
Fig. 1 is the general structure synoptic diagram of matrix register of the present invention;
Fig. 2 is the architecture frame synoptic diagram of vector processor;
Fig. 3 is the structural representation of SR register among the present invention;
Fig. 4 is the memory cell array structure synoptic diagram of matrix register of the present invention;
Fig. 5 is the not row address space synoptic diagram during piecemeal of matrix register of the present invention;
Fig. 6 is the not column address space synoptic diagram during piecemeal of matrix register of the present invention;
The descending address space synoptic diagram of single thread mode when Fig. 7 is a matrix register piecemeal of the present invention;
The following address space synoptic diagram of single thread mode when Fig. 8 is a matrix register piecemeal of the present invention;
The descending address space synoptic diagram of M thread mode when Fig. 9 is a matrix register piecemeal of the present invention;
The following address space synoptic diagram of M thread mode when Figure 10 is a matrix register piecemeal of the present invention;
Figure 11 is a decoding path synoptic diagram among the present invention.
Embodiment
Below with reference to Figure of description and specific embodiment the present invention is described in further details.
As shown in Figure 1, be the general structure synoptic diagram of matrix register of the present invention.The configurable matrix register unit of many width S of support IMD of the present invention and many granularities SIMT comprises matrix register and control register SR.
When the read-write enable signal is effective, the address decoding logical block is according to the content of read/write address, be expert at and decipher under the control of array selecting signal and control register SR, the capable vector or the column vector of selection matrix register are read and write, and perhaps select one or more son row vectors or sub-column vector to read and write.Matrix register is made up of the memory cell array of N*N, and the bit wide of each unit is W, and the memory capacity size is (N*N*W) position.By the part field of configuration control register SR, can carry out piecemeal to matrix register.When the partitioned mode of this matrix register of content representation among the control register SR is M*M, the representing matrix register is divided into the sub-piece of M by row by row simultaneously, be divided into into M*M piece, the size of each sub-piece is (N/M) * (N/M), M is generally 0,2,4,8 ... and M is no more than N/2, when M is 0, it is not piecemeal (can claim that also the branch block mode is 0*0) of representing matrix register, when M is not 0, each row vector registor or column vector register logically are divided into M son row vector registor or sub-column vector register, each son row vector registor or sub-column vector register comprise N/M storage unit, at this moment, the vector operation unit is the length of son row vector registor or sub-column vector register to the read-write unit length of matrix register, can select one or more son row vectors or sub-column vector to read and write at every turn.In the present invention, the functional part of vector operation unit can be by the same mode access matrix register of visit general register, and matrix register also provides the function of column access, and can satisfy the vector read-write of different vector lengths, also supported the access mode of SIMT simultaneously.The maximum bandwidth of each read-write of this matrix register is the N*W position.
Fig. 3 is the structural representation of control register SR among the present invention.The branch block mode of current matrix register and arithmetic element have been write down among the control register SR just in simultaneously treated number of threads, in order to reduce cost, improve the reusability of design, the present invention is with the control register of SR as processor, the programmer can visit SR by the instruction of existing access control register, does not need to increase extra instruction again.The bit wide of valid data is greater than (log among the SR 2C+log 2T) smallest positive integral (C is the piecemeal pattern count of matrix register, and T is the treatable multithread mode number of vector processor, and C, T are the integer greater than 1).SR can be used as a proprietary control register and independently exists, and the perhaps position of generally all withing a hook at the end of the control register in the processor is if the number that keeps the position in the original control register of processor is more than or equal to log 2C+log 2T just can not need to increase extra SR register, otherwise, just also need to increase a control register SR.No matter belong to which kind of situation, the programmer can visit SR by the instruction of existing access control register, does not therefore need to increase the dynamic-configuration that extra instruction just can realize SR again.
Vector processing unit all needs to provide three kinds of signals to each visit of matrix register: read-write enables, read/write address, ranks are selected signal.The address decoding logical block selects signal to decipher according to content, read/write address and the ranks of control register SR, the capable vector or the column vector of selection matrix register are read and write, and also can select one or more son row vectors or sub-column vector to read and write.
The present invention has designed a kind of complete map addresses scheme, and under different branch block modes and thread mode, matrix register presents different address views, and these address views provide complete access mode flexibly for the programmer.Map addresses scheme regular as follows: interblock afterwards descends the order on the first left back right side again according to going up earlier, row address is according to the order under going up afterwards earlier in the piece, column address is according to the linear successively increase addressing of order on the left back right side earlier, when number of threads is L (L≤max (M)), L continuous piece of line direction shared a slice column vector address space, L continuous piece of column direction shared a slice row vector address space, promptly when number of threads is L, vector processing unit is L son row vector or L sub-column vector of access matrix register simultaneously, capable vector of each height or sub-column vector are handled by V vector processing unit, and V is the length of son row vector or sub-column vector.
Fig. 2 is the architecture frame of vector processor among the present invention.Vector processor generally is made up of N parallel processing element (PE), and each processing unit has i functional part, and these functional parts can be MAC, ALU, division, shifting part etc., and each functional part can be read and write matrix register as required.N PE constituted the processing mode of SIMD, and the structure of VLIW is generally taked in each PE inside, a plurality of functional part concurrent operations.The cell array size of matrix register is N*N, has logically constituted N capable vector registor VR 0-VR N-1With N column vector register CVR 0-CVR N-1Each row vector registor VR iComprise N storage unit E I, 0-E I, N-1(i=0,1,2 ... N-1), each column vector register CVR iComprise N storage unit E 0, i-E N-1, i(i=0,1,2 ... N-1).Matrix register is designed to the multiport read-write mode, can provide source operand for a plurality of functional parts of N PE simultaneously, can support that also the data of a plurality of functional parts of N PE write.
Fig. 4 is the memory cell array structure synoptic diagram of matrix register among the present invention.The memory cell array of matrix register generally is made up of N*N storage unit, and N is generally 2 exponential.The bit wide of each storage unit is W, and W is generally 4,8,12,16,32.This array logically can be regarded N capable vector registor VR as 0-VR N-1Or N column vector CVR 0-CVR N-1Register is formed, and each row vector registor comprises N element (storage unit) E I, 0-E I, N-1(i=0,1,2 ... N-1).With VR 0Be example, this row vector registor comprises storage unit E 0,0-E 0, N-1This memory cell array is divided into the column of memory cells of N N*W position by row, and every row are made up of the element of N same column.This N column of memory cells and N column vector register CVR 0-CVR N-1Corresponding one by one, be used to realize the access facility of respective column vector registor.With CVR N-1Be example, this column vector register comprises all row vector registor VR 0-VR N-1Last element E I, N-1(i=0,1,2 ... N-1).
Fig. 5 is the not row address space synoptic diagram during piecemeal of matrix register of the present invention.When the piecemeal pattern field oriental matrix register among the SR not during piecemeal, matrix register is only supported single-threaded computing.Matrix register is made up of N capable vector registor, and each row vector registor comprises N storage unit E I, 0-E I, N-1(i=0,1,2 ... N-1).The functional part of vector operation unit can be to a capable vector registor read-write, and the data bandwidth of each read-write is the N*W position.Linear in accordance with the order from top to bottom increasing addresses in the address of row vector registor.The vector operation unit can be vectorial according to the different row of different row address access matrix registers.
Fig. 6 is the not column address space synoptic diagram during piecemeal of matrix register of the present invention.When the piecemeal pattern field oriental matrix register among the SR not during piecemeal, matrix register is only supported single-threaded computing.Matrix register is made up of N column vector register, and each column vector register comprises N storage unit E 0, i-E N-1, i(i=0,1,2 ... N-1) functional part of vector operation unit can be to a column vector register read-write, and the data bandwidth of each read-write is the N*W position.The address of column vector register increases addressing according to order linear from left to right.The vector operation unit can be according to the different different column vectors of column address access matrix register.
The descending address space synoptic diagram of single thread mode when Fig. 7 is a matrix register piecemeal of the present invention.When the piecemeal pattern field oriental matrix register among the SR divides block mode is that M*M and thread mode field are when indicating current computing to be single-threaded computing, matrix register is made up of N*M son row vector registor, and each son row vector registor comprises N/M storage unit E I, 0-E I, N/M-1(i=0,1,2 ... (N-1) * M).The functional part of vector operation unit can be to the read-write of a son row vector registor, and the data bandwidth of each read-write is (N/M) * W position.The addressing of son row vector registor is according to following rule: interblock afterwards descend the order on the left back right side earlier according to going up earlier again, in the piece in accordance with the order from top to bottom the linearity increase address.The vector operation unit can be different according to different row address access matrix registers the capable vector of child.The value of M has just realized the support to the SIMD computing of different length not simultaneously, has promptly supported many width S IMD visit.
The following address space synoptic diagram of single thread mode when Fig. 8 is a matrix register piecemeal of the present invention.When the piecemeal pattern field oriental matrix register among the SR divides block mode is that M*M and thread mode field are when indicating current computing to be single-threaded computing, matrix register is made up of N*M sub-column vector register, and each sub-column vector register comprises N/M storage unit E 0, i-E N/M-1.i(i=0,1,2……(N-1)*M)。The functional part of vector operation unit can be to a sub-column vector register read-write, and the data bandwidth of each read-write is (N/M) * W position.The addressing of sub-column vector register is according to following rule: interblock afterwards descend the order on the left back right side earlier according to going up earlier again, and the interior order linear increase according to from left to right of piece addresses.The vector operation unit can be different according to different column address access matrix registers sub-column vector.The value of M has just realized the support to the SIMD computing of different length not simultaneously, has promptly supported many width S IMD visit.
The descending address space synoptic diagram of M thread mode when Fig. 9 is a matrix register piecemeal of the present invention.When the piecemeal pattern field oriental matrix register among the SR divides block mode is that M*M and thread mode field are when indicating current computing to be multithreading (number of threads is L) computing, matrix register is made up of N*M/L son row vector registor, and each son row vector registor comprises N/M storage unit E I, 0-E I, N/M-1(i=0,1,2 ... (N-1) * M/L).The functional part of vector operation unit can be to the read-write of L son row vector registor, and the data bandwidth of each read-write is (L*N/M) * W position.The addressing of son row vector registor is according to following rule: interblock afterwards descends the order on the left back right side of elder generation again according to going up earlier, linear in accordance with the order from top to bottom increasing addresses in the piece, L continuous piece of column direction shared a slice row address space, the vector operation unit can be according to different row address access matrix register L the different capable vector of child, this L the capable vector of different childs shared same address, but derive from different matrix data pieces, promptly the individual different child of L is capable is the data access of a multithreading to quality entity.Figure 9 shows that the capable vectorial addressing mode of child when L equals M.The value of M has just realized the support to the SIMD computing of different length not simultaneously, has promptly supported many width S IMD visit.When the value of L not simultaneously, just realized varigrained Thread-Level Parallelism, promptly supported the access mode of many granularities SIMT.
The following address space synoptic diagram of M thread mode when Figure 10 is a matrix register piecemeal of the present invention.When the piecemeal pattern field oriental matrix register among the SR divides block mode is that M*M and thread mode field are when indicating current computing to be multithreading (number of threads is L) computing, matrix register is made up of N*M/L sub-column vector register, and each sub-column vector register comprises N/M storage unit E 0, i-E N/M-1, i(i=0,1,2 ... (N-1) * M/L).The functional part of vector operation unit can be to L sub-column vector register read-write, and the data bandwidth of each read-write is (L*N/M) * W position.The addressing of sub-column vector register is according to following rule: interblock afterwards descends the order on the first left back right side again according to going up earlier, increase addressing according to from left to right order linear in the piece, L continuous piece of line direction shared a slice column address space, the vector operation unit can be according to different row address access matrix register L different sub-column vector, this L different sub-column vector shared same address, but derive from different matrix data pieces, i.e. L the data access that different sub-column vector essence is a multithreading.Figure 10 shows that the sub-column vector addressing mode when L equals M.The value of M has just realized the support to the SIMD computing of different length not simultaneously, has promptly supported many width S IMD visit.When the value of L not simultaneously, just realized varigrained Thread-Level Parallelism, promptly supported the access mode of many granularities SIMT.
Figure 11 is the decoding path synoptic diagram of matrix register of the present invention.This decoding path is made up of address decoding logic, read and write data buffer cell and the bus that reads and writes data.When vector processing unit is read and write matrix register, address decoding logical foundation read/write address, ranks select the content of signal, SR to carry out address decoding, select one or more row/column vectors, perhaps select one or more son row/sub-column vectors to read and write.When matrix register was carried out read operation, after the decoded logic of some storage unit was chosen, the content of this storage unit was read out and is put on the read data bus that this storage unit is expert at, and delivered to the read data buffering then.The read data buffer cell is made into vector form with the data set of different storage unit and returns to the vector operation unit.When matrix register is carried out write operation, the write data buffer cell will split into a plurality of data that will write different storage unit from the vector data of vector operation unit.After the decoded logic of some storage unit was chosen, the content that write this storage unit was placed on the write data bus that this storage unit is expert at, and when clock is effective, writes this storage unit again.
Below only be preferred implementation of the present invention, protection scope of the present invention also not only is confined to the foregoing description, and all technical schemes that belongs under the thinking of the present invention all belong to protection scope of the present invention.Should be pointed out that for those skilled in the art the some improvements and modifications not breaking away under the principle of the invention prerequisite should be considered as protection scope of the present invention.

Claims (4)

1. configurable matrix register unit of supporting many width S IMD and many granularities SIMT, it is characterized in that: comprise matrix register and control register SR, the matrix register of described big or small N*N is divided into the M*M piece, wherein N is positive integer and is 2 power, and M is for more than or equal to 0 integer and be 2 power; Write down matrix register in the described control register and divided block mode and the simultaneously treated multithreading number of vector processing unit, the width of described control register is log 2C+log 2T, wherein C is the piecemeal pattern count of matrix register, T is the treatable multithread mode number of vector processor.
2. the configurable matrix register unit of many width S of support IMD according to claim 1 and many granularities SIMT, it is characterized in that: when M is 0, the representing matrix register is piecemeal not, a capable vector or the column vector of vector operation parts at every turn can the access matrix register; When M is not 0, the vector operation parts are according to the capable vector of child or the sub-column vector of the one or more equal length in the different access matrix register of simultaneously treated number of threads, and capable vector of the child of these equal length or sub-column vector come from different partitioned matrix.
3. the configurable matrix register unit of many width S of support IMD according to claim 2 and many granularities SIMT, it is characterized in that: when described vector operation parts conduct interviews to matrix register, the address decoding logical block selects signal to decipher according to content, read/write address and the ranks of matrix register SR, the capable vector or the column vector of selection matrix register are read and write, or select one or more son row vectors or sub-column vector to read and write.
4. according to the configurable matrix register unit of claim 1 or 2 or 3 described many width S of support IMD and many granularities SIMT, it is characterized in that: described control register SR is an independently control register, perhaps be stored in the reservation position of other control register, and the reservation bit length of other control registers is greater than log 2C+log 2The integer of T.
CN201010559458.2A 2010-11-25 2010-11-25 Configurable matrix register unit for supporting multi-width SIMD and multi-granularity SIMT Active CN102012803B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010559458.2A CN102012803B (en) 2010-11-25 2010-11-25 Configurable matrix register unit for supporting multi-width SIMD and multi-granularity SIMT

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010559458.2A CN102012803B (en) 2010-11-25 2010-11-25 Configurable matrix register unit for supporting multi-width SIMD and multi-granularity SIMT

Publications (2)

Publication Number Publication Date
CN102012803A true CN102012803A (en) 2011-04-13
CN102012803B CN102012803B (en) 2014-09-10

Family

ID=43842979

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010559458.2A Active CN102012803B (en) 2010-11-25 2010-11-25 Configurable matrix register unit for supporting multi-width SIMD and multi-granularity SIMT

Country Status (1)

Country Link
CN (1) CN102012803B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102447462A (en) * 2011-12-13 2012-05-09 北京控制工程研究所 Over current (OC) instruction matrix circuit with impact resistance
CN103294623A (en) * 2013-03-11 2013-09-11 浙江大学 Configurable multi-thread dispatch circuit for SIMD system
CN104011676A (en) * 2011-12-20 2014-08-27 国际商业机器公司 Low Latency Variable Transfer Network For Fine Grained Parallelism Of Virtual Threads Across Multiple Hardware Threads
CN106484519A (en) * 2016-10-11 2017-03-08 东南大学 Asynchronous thread recombination method and the SIMT processor based on the method
CN107408037A (en) * 2015-02-02 2017-11-28 优创半导体科技有限公司 It is configured to the monolithic vector processor operated to variable-length vector
CN109416634A (en) * 2016-07-08 2019-03-01 Arm有限公司 Vector registor access
CN109684602A (en) * 2018-12-29 2019-04-26 上海商汤智能科技有限公司 A kind of batch processing method and device and computer readable storage medium
CN111158874A (en) * 2019-12-20 2020-05-15 深圳市商汤科技有限公司 Data processing method and device, electronic equipment and storage medium
CN112346783A (en) * 2020-11-05 2021-02-09 海光信息技术股份有限公司 Processor and operation method, device, equipment and medium thereof
CN114238204A (en) * 2017-03-14 2022-03-25 珠海市芯动力科技有限公司 Reconfigurable parallel processing
WO2022111013A1 (en) * 2020-11-27 2022-06-02 安徽寒武纪信息科技有限公司 Device supporting multiple access modes, method and readable storage medium
WO2023236013A1 (en) * 2022-06-06 2023-12-14 Intel Corporation Data re-arrangement by mixed simt and simd execution mode

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5513366A (en) * 1994-09-28 1996-04-30 International Business Machines Corporation Method and system for dynamically reconfiguring a register file in a vector processor
CN1180864A (en) * 1996-08-19 1998-05-06 三星电子株式会社 Single-instruction-multiple-data processing in multimedia signal processor and device thereof
CN1684058A (en) * 2004-04-16 2005-10-19 索尼株式会社 Processor
CN101776988A (en) * 2010-02-01 2010-07-14 中国人民解放军国防科学技术大学 Restructurable matrix register file with changeable block size

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5513366A (en) * 1994-09-28 1996-04-30 International Business Machines Corporation Method and system for dynamically reconfiguring a register file in a vector processor
CN1180864A (en) * 1996-08-19 1998-05-06 三星电子株式会社 Single-instruction-multiple-data processing in multimedia signal processor and device thereof
CN1684058A (en) * 2004-04-16 2005-10-19 索尼株式会社 Processor
CN101776988A (en) * 2010-02-01 2010-07-14 中国人民解放军国防科学技术大学 Restructurable matrix register file with changeable block size

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102447462A (en) * 2011-12-13 2012-05-09 北京控制工程研究所 Over current (OC) instruction matrix circuit with impact resistance
CN102447462B (en) * 2011-12-13 2013-08-28 北京控制工程研究所 Over current (OC) instruction matrix circuit with impact resistance
CN104011676A (en) * 2011-12-20 2014-08-27 国际商业机器公司 Low Latency Variable Transfer Network For Fine Grained Parallelism Of Virtual Threads Across Multiple Hardware Threads
CN104011676B (en) * 2011-12-20 2017-03-01 国际商业机器公司 For transmitting the Method and circuits device of variable between the hardware thread in multiple processing cores
CN103294623A (en) * 2013-03-11 2013-09-11 浙江大学 Configurable multi-thread dispatch circuit for SIMD system
CN103294623B (en) * 2013-03-11 2016-04-27 浙江大学 A kind of multi-thread dispatch circuit of configurable SIMD system
CN107408037B (en) * 2015-02-02 2021-03-02 优创半导体科技有限公司 Monolithic vector processor configured to operate on variable length vectors
CN107408037A (en) * 2015-02-02 2017-11-28 优创半导体科技有限公司 It is configured to the monolithic vector processor operated to variable-length vector
CN109416634A (en) * 2016-07-08 2019-03-01 Arm有限公司 Vector registor access
CN106484519B (en) * 2016-10-11 2019-11-08 东南大学苏州研究院 Asynchronous thread recombination method and SIMT processor based on this method
CN106484519A (en) * 2016-10-11 2017-03-08 东南大学 Asynchronous thread recombination method and the SIMT processor based on the method
CN114238204A (en) * 2017-03-14 2022-03-25 珠海市芯动力科技有限公司 Reconfigurable parallel processing
CN114238204B (en) * 2017-03-14 2023-01-06 珠海市芯动力科技有限公司 Reconfigurable parallel processing
CN109684602A (en) * 2018-12-29 2019-04-26 上海商汤智能科技有限公司 A kind of batch processing method and device and computer readable storage medium
CN109684602B (en) * 2018-12-29 2023-06-06 上海商汤智能科技有限公司 Batch processing method and device and computer readable storage medium
CN111158874A (en) * 2019-12-20 2020-05-15 深圳市商汤科技有限公司 Data processing method and device, electronic equipment and storage medium
CN112346783A (en) * 2020-11-05 2021-02-09 海光信息技术股份有限公司 Processor and operation method, device, equipment and medium thereof
CN112346783B (en) * 2020-11-05 2022-11-22 海光信息技术股份有限公司 Processor and operation method, device, equipment and medium thereof
WO2022111013A1 (en) * 2020-11-27 2022-06-02 安徽寒武纪信息科技有限公司 Device supporting multiple access modes, method and readable storage medium
WO2023236013A1 (en) * 2022-06-06 2023-12-14 Intel Corporation Data re-arrangement by mixed simt and simd execution mode

Also Published As

Publication number Publication date
CN102012803B (en) 2014-09-10

Similar Documents

Publication Publication Date Title
CN102012803B (en) Configurable matrix register unit for supporting multi-width SIMD and multi-granularity SIMT
KR102655386B1 (en) Method and apparatus for distributed and cooperative computation in artificial neural networks
EP3035204B1 (en) Storage device and method for performing convolution operations
US8422330B2 (en) Memory controller and memory controlling method
CN102141905B (en) Processor system structure
CN104603795B (en) Realize instruction and the micro-architecture of the instant context switching of user-level thread
KR101710116B1 (en) Processor, Apparatus and Method for memory management
CN102144225A (en) Method & apparatus for real-time data processing
CN102648456A (en) Memory device and method
EP3384498B1 (en) Shift register with reduced wiring complexity
US20220263525A1 (en) Computational memory with zero disable and error detection
US9442893B2 (en) Product-sum operation circuit and product-sum operation system
US11468002B2 (en) Computational memory with cooperation among rows of processing elements and memory thereof
EP3035203A1 (en) Fine-grain storage interface and method for low power accelerators
US9350584B2 (en) Element selection unit and a method therein
US20100146241A1 (en) Modified-SIMD Data Processing Architecture
US20130024652A1 (en) Scalable Processing Unit
US20180232207A1 (en) Arithmetic processing apparatus and control method for arithmetic processing apparatus
US7130985B2 (en) Parallel processor executing an instruction specifying any location first operand register and group configuration in two dimensional register file
Fan et al. A parallel-access mapping method for the data exchange buffers around DCT/IDCT in HEVC encoders based on single-port SRAMs
US7080216B2 (en) Data access in a processor
CN111045965B (en) Hardware implementation method for multi-channel conflict-free splitting, computer equipment and readable storage medium for operating method
CN115719088B (en) Intermediate cache scheduling circuit device supporting in-memory CNN
US20210042111A1 (en) Efficient encoding of high fanout communications
Kumaki et al. CAM enhanced super parallel SIMD processor with high-speed pattern matching capability

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CI01 Publication of corrected invention patent application

Correction item: Abstract|Description

Correct: The instructions are correct

False: Abstract manual error

Number: 15

Volume: 27

CI02 Correction of invention patent application

Correction item: Abstract|Description

Correct: The instructions are correct

False: Abstract manual error

Number: 15

Page: The title page

Volume: 27

ERR Gazette correction

Free format text: CORRECT: ABSTRACT; DESCRIPTION; FROM: ABSTRACT, DESCRIPTION IS WRONG TO: ABSTRACT, DESCRIPTION IS CORRECT

C14 Grant of patent or utility model
GR01 Patent grant