CN101021778A - Computing group structure for superlong instruction word and instruction flow multidata stream fusion - Google Patents

Computing group structure for superlong instruction word and instruction flow multidata stream fusion Download PDF

Info

Publication number
CN101021778A
CN101021778A CN 200710034567 CN200710034567A CN101021778A CN 101021778 A CN101021778 A CN 101021778A CN 200710034567 CN200710034567 CN 200710034567 CN 200710034567 A CN200710034567 A CN 200710034567A CN 101021778 A CN101021778 A CN 101021778A
Authority
CN
China
Prior art keywords
data
instruction
simd
group
microcontroller
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 200710034567
Other languages
Chinese (zh)
Other versions
CN100456230C (en
Inventor
邢座程
杨学军
张民选
蒋江
张承义
马驰远
李勇
陈海燕
高军
李晋文
衣晓飞
张明
穆长富
阳柳
曾献君
倪晓强
唐遇星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CNB2007100345670A priority Critical patent/CN100456230C/en
Publication of CN101021778A publication Critical patent/CN101021778A/en
Application granted granted Critical
Publication of CN100456230C publication Critical patent/CN100456230C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Advance Control (AREA)

Abstract

This invention discloses a VLIW with single-instruction multiple data streams converging computing cluster structure, which includes the data buffer and microcontroller connected with main controller, and their connective computing cluster. The main controller are responsible for the instruction and data mobile, instruction loads in the microcontroller, data loads in the data buffer and receives their output. The data buffer receives main controller data, supplies operation data for computing cluster and receives its result, and outputs to the main controller. Microcontroller receives the VLIW sequence form main controller, decodes and broadcast to each processing unit for parallel execution. The computing group includes a number of same processing units with a number of dealing components respectively and the same instruction sequence, and the data is from different modules of data buffer.

Description

The computing group structure that very long instruction word and single instruction stream multiple data stream merge
Technical field
The present invention is mainly concerned with the instruction process technology in the microprocessor Design, especially is applied to refer in particular to the computing group structure that a kind of very long instruction word and single instruction stream multiple data stream merge in the processor of computation-intensive calculating.
Background technology
Instruction level parallelism (Instruction Level Parallelism, be called for short ILP) be the main path of concurrency exploitation in the microprocessor Design, can executed in parallel between the incoherent instruction, improve the efficient that processor is carried out with this, as superscale technology and very long instruction word technology (Very Long Instruction Word is called for short VLIW).The VLIW technology disposes a plurality of functional parts on hardware, they can parallel execution of instructions, compiler is responsible for package is made into the very long instruction word sequence, each new instruction word comprise many can executed in parallel presumptive instruction, can map directly on the functional part during execution and handle, thereby reach the purpose of exploitation instruction level parallelism, do not have complicated hardware to detect relevant and transmitter logic.And also having lot of data level parallel (DataLevel Parallelism abbreviates DLP as) in the compute-intensive applications such as multimedia and science calculating simultaneously, the data of same type or structure often need to carry out identical one or a string operation.Adopt special instruction process technology can develop data level concurrency in this class method effectively, thereby promote the execution performance of processor.
At first the exploitation of data level concurrency can be converted into the instruction-level parallelism exploitation.Software flow is a kind of effective compiling dispatching method of exploitation instruction-level parallelism, compiler will not exist the relevant instruction from different cycle periods of data to reschedule by loop unrolling, form new loop body, with this increase in the processor can executed in parallel instruction strip number, it is relevant to solve data.Relevant between containing the data dependence between instruction in the parallel program of mass data level and mainly be storage and calculating, the software flow technology can be dissolved this being correlated with comparalive ease.The software flow technology can make simultaneously with superscale or VLIW technology and be used for developing ILP.
The Vector Processing technology is a kind of effective ways that improve processor bulk data handling property, utilize the thought of time-interleaving, data parallel is converted into instruction to walk abreast, by being used to handle the loop statement vectorization of same operation, not only can reduce the size of code of program, control dependence between the loop iteration can also be hidden in the vector instruction, improve the execution efficient of hardware.The chained technology of vector can effectively reduce the memory space of intermediate result on the basis of Vector Processing, alleviate the dispense pressure of vector registor.
Single instruction stream multiple data stream (Single Instruction, Multiple Data, abbreviation SIMD) technology is a kind of resource repeat techniques, comes the development data level parallel by disposing a plurality of parallel processing units or a processing unit being divided into many data paths.Article one, the control signal of instruction can be controlled a plurality of arithmetic units and works simultaneously, but the data of processing are from a plurality of data stream.The IA-32 instruction set of Intel and IA-64 instruction set all have the instruction expansion at SIMD, can improve the performance that numerical evaluation is used.
As seen, current concurrency exploitation for compute-intensive applications such as multimedia and science calculating is based on independently instruction level parallelism or data level concurrent development, continue in the application scale under the situation of expansion and complexity increase, utilize the difficulty of said method parallelization to become big, the performance gain that is obtained also constantly reduces.How support on hardware configuration that exploitation is new approaches that solve this type of problem in the parallel and instruction level parallelism of data level.
Summary of the invention
The technical problem to be solved in the present invention just is: at the problem of prior art existence, the invention provides a kind of data level concurrency and instruction-level parallelism can supported simultaneously develops, can further improve the very long instruction word of compute-intensive applications program execution performance on processor and the computing group structure that single instruction stream multiple data stream merges from number of ways in conjunction with very long instruction word technology, single instruction stream multiple data stream technology and software flow technology.
For solving the problems of the technologies described above, the solution that the present invention proposes is: the computing group structure that a kind of very long instruction word and single instruction stream multiple data stream merge, it is characterized in that: it comprises master controller, data buffering, SIMD calculates group and microcontroller, data buffering links to each other with master controller respectively with microcontroller, calculating faciation by SIMD between data buffering and the microcontroller connects, master controller is responsible for the preparation of instruction and data, call in memory unit in the microcontroller with needing SIMD to calculate instruction that the group carries out, control SIMD calculates group's startup and time-out, required source operand certificate is called in data buffering, and receives final calculation result in the data buffering; The data buffering parts receive the data that transmit from master controller, and store with specific organizational form, provide calculating required source operand for SIMD calculates the group, calculate and finish the output result that the back receives the calculating group, and net result outputs to master controller; The microcontroller parts receive the very long instruction word sequence that master controller provides, and after deciphering, with the form of broadcasting the alignment processing parts F executed in parallel that SIMD calculates each processing unit of group are assigned, are mapped in each operation; SIMD calculating group is a plurality of parallel processing elements with SIMD form tissue, and each processing unit structure is identical, and disposes a plurality of processing element, and performed instruction sequence all comes from microcontroller, but required computational data is from the diverse location of data buffering.
It is one group of identical processing unit PE of structure that described SIMD calculates the group, and the structure of all PE is all identical, carries out same instruction or instruction sequence from microcontroller simultaneously in the mode of SIMD; Each PE comprises a plurality of arithmetic, logical operation processing element Fn and local register file, logical operation processing element Fn supports very long instruction word to carry out, the a plurality of dissimilar operations of parallel processing simultaneously, each logical operation processing element Fn has independent local register file, directly operand is provided and preserves result of calculation for processing element.
Compared with prior art, because the data computation amount is bigger in the compute-intensive applications programs such as multimedia and science calculating, and the data of same type or structure often need to carry out identical one or a string operation.Therefore in the microprocessor of this type of application, have the following advantages by disposing hardware configuration proposed by the invention:
(1) computing group structure of very long instruction word technology proposed by the invention and single instruction stream multiple data stream technological incorporation is by the intermediate result in PE internal configurations local register, configuration data buffering save routine computing in system, avoided the memory bandwidth waste, utilize the strategy while development data level concurrency and the instruction-level parallelism of resource repetition and compiling scheduling, can improve the execution efficient of application program on processor;
(2) develop DLP and ILP simultaneously.Owing to support the VLIW program to carry out in the mode of SIMD, instruction-level parallelism in the co-development program and data level concurrency have improved the throughput that processor is carried out this type of application program greatly;
(3) hardware efficiency height.The shared cover instruction control logic of a plurality of PE, get finger, decoding, assignment, mapping, and operand is from different data stream, and the executive mode of this SIMD utilizes the control path of an instruction to realize the data throughput that multicomputer system could be realized, has improved hardware efficiency;
(4) the hardware implementation complexity is low.Because compiling can be determined the operation delay of various instructions, the work of concurrency exploitation is finished by compiler fully, has avoided complicated hardware to detect logic and streamline interlocking logic, has reduced hard-wired complexity;
(5) alleviated the memory bandwidth bottleneck.Owing to used the local register of PE inside, the intermediate result that instruction manipulation produces does not need to occupy outside data buffering, has alleviated the bandwidth pressure of exterior storage, and has accelerated the speed that operand reads;
In sum, hardware configuration proposed by the invention combines the advantage of VLIW and SIMD development sequence concurrency, be fit to be applied in the processor of compute-intensive applications, but be not limited to this kind processor, other processors that need develop multiple concurrency simultaneously also can adopt.
Description of drawings
Fig. 1 is a framed structure synoptic diagram of the present invention;
Fig. 2 is the computing group structure synoptic diagram that VLIW and SIMD merge among the present invention;
Fig. 3 is the schematic flow sheet of instruction process of the present invention.
Embodiment
Below with reference to the drawings and specific embodiments the present invention is described in further details.
Referring to shown in Figure 1, the computing group structure that very long instruction word of the present invention and single instruction stream multiple data stream merge, it comprises that master controller, data buffering, SIMD calculate group and microcontroller.Wherein master controller is responsible for the preparation of instruction and data, call in memory unit in the microcontroller with needing SIMD to calculate instruction that the group carries out, control SIMD calculates group's startup and time-out, and required source operand certificate is called in data buffering, and receives final calculation result in the data buffering; The data buffering parts receive the data that transmit from master controller, and store with specific organizational form, provide calculating required source operand for SIMD calculates the group, calculate and finish the output result that the back receives the calculating group, and net result outputs to master controller; The microcontroller parts receive the very long instruction word sequence that master controller provides, after deciphering, the alignment processing parts F executed in parallel of each operation being assigned, is mapped to each processing unit (Processing Element is called for short PE) of SIMD calculating group with the form of broadcasting; SIMD calculating group is a plurality of parallel processing elements with SIMD form tissue, and each processing unit structure is identical, and disposes a plurality of processing element, and performed instruction sequence all comes from microcontroller, but required computational data is from the diverse location of data buffering.
Fig. 2 is the computing group structure figure that VLIW and SIMD merge.SIMD calculate the group be one group of identical processing unit PE of structure (PE0, PE1 ..., PEN).The structure of all PE is all identical, carries out simultaneously from same of microcontroller instruction or instruction sequence in the mode of SIMD.Each PE comprise a plurality of arithmetic, logical operation processing element (F1, F2 ..., Fn) and local register file.Processing element F1, F2 ..., Fn supports very long instruction word to carry out, a plurality of dissimilar operations of parallel processing simultaneously (as addition, multiplication, take advantage of add, logical operation etc.).Each processing element all has independent local register file, directly operand is provided and preserves result of calculation for processing element.Operand at first reads in local register from data buffering, because master controller had been called in data buffering to operand before execution is deciphered, assigns, shone upon to the startup microcontroller, the time-delay that operand is called in the PE local register from data buffering is fixed.There is network to link to each other between inner each the local register file of processing unit, can exchange and calculate the middle ephemeral data that generates, the operand of each processing element all directly reads from own local register, so there is not the not ready and pause that can not expect that causes of factor certificate in processing element when executing instruction operations.Therefore, very long instruction word is carried out any operation required time-delay on all PE all be consistent, as can be known, compiler can be developed instruction-level parallelism fully, generate the very long instruction word instruction sequence, and utilize circulation in the software flow technology reorganization program, dissolve the data dependence between the very long instruction word, need not the hardware intervention.Also there are interconnection network between the PE simultaneously, can carry out necessary synchronous and exchanges data.
With following pseudo-coded program is example:
// data are prepared
load(datal[m],meml);
load(data2[m],mem2);
// data processing
for(i?from?l?to?m)
func(datal[i],data2[i],data3[i],data4[i]);
func(data3[i],data4[i],data5[i],data6[i]);
// data write back to external memory or network interface
send(data5[m],mem3);
send(data6[m],chan0);
// data processing function definition, input in_a and in_b, output out_c and out_d
func(in_a,in_b,out_c,out_d){
// article one very long instruction word (VLIW) word instruction I 1
OP 11// execution unit is F1
OP 12// execution unit is F2
OP 1n // execution unit is Fn; Be the very long instruction word boundary marking
// second very long instruction word (VLIW) word instruction I 2
OP 21
OP 22
OP 2n;;
}
The data of a succession of same structure of this program description (data1, data2) order is carried out sequence of operations (two func functions), and the process of producing new a succession of data (data5, data6), and main body is a for circulation.Wherein second func function of for round-robin is relevant with the data that have read-after-write between first func function, in the traditional instruction treatment technology, executing efficiency is that concurrency is subject to that this is relevant, and will produce a large amount of ephemeral data (data3, data4) between calling for twice, if register resources deficiency, then to be swapped out to storer and preserve, be read into again in the register when needing, waste memory bandwidth.The hardware configuration that adopts the present invention to propose can effectively be avoided these bottlenecks, and the implementation of this program in the present invention is:
1. carry out the load instruction, the flowing of main controller controls data stream, the log-on data buffering is carried out desired data sequence data1 and data2 from the external memory storage program of calling in, and leaves in the data buffering by the fixed indices address;
2. master controller is transmitted into the Instructions Cache parts in the microcontroller with the very long instruction word sequence in the program (first func function);
3. microcontroller is to very long instruction word (VLIW) word instruction I 1, I 2Deng deciphering, every bat translates n microoperation (OP I1, OP I2..., OP In), carry out on n the processing element that is assigned to N PE that walk abreast:
(a) data are written into operation control data (in_a in_b) are read into the local register of PE from data buffering;
(b) data write back operation control data (out_c, out_d) local register from PE writes back data buffering;
(c) source operand of arithmetical logic operation is taken from local register separately, and execution result writes back local register separately;
(d) data can move between local register by the interconnection network of PE inside, also can stride PE by the interconnection network between PE and move;
4. carry out second func function, repeated for 2,3 steps;
5. carry out the send instruction, the flowing of main controller controls data stream, the log-on data buffering writes back to data sequence data5 and data6 the network port of external memory storage or appointment;
Each microoperation all has fixing, predictable execution time-delay, and all PE begin to carry out, finish simultaneously simultaneously, are the SIMD mode between the PE, the parallel employing VLIW mode of PE inside.
Fig. 3 is the flow process of instruction process of the present invention.Instruction is after master controller decoding, and judgement is data preparation instruction or data processing instructions.If the data preparation instruction then is transmitted into data buffering and carries out, the data buffering judgement is write command or reads instruction.Read instruction and read in monoblock data from external memory storage, and leave in the data buffering by the fixed indices address according to the instruction data address that provides and the length of peeking; Write command outputs to assigned address (external memory storage or the network port) with the specified data block in the data buffering.If data processing instructions is then got corresponding one section very long instruction word instruction sequence to microcontroller from external memory storage, microcontroller instructs one by one and deciphers, assigns and carries out.The microoperation that every very long instruction word (VLIW) word instruction translates is assigned to N PE simultaneously, is mapped to corresponding processing element F, and desired data is from data buffering, and the mode with SIMD between the PE is carried out.Article one, the very long instruction word instruction process finishes, and microcontroller starts the decoding and the assignment of next bar instruction; One section instruction sequence is finished, and master controller starts reading in and handling of next sequence.

Claims (2)

1, the computing group structure that a kind of very long instruction word and single instruction stream multiple data stream merge, it is characterized in that: it comprises master controller, data buffering, SIMD calculates group and microcontroller, data buffering links to each other with master controller respectively with microcontroller, calculating faciation by SIMD between data buffering and the microcontroller connects, master controller is responsible for the preparation of instruction and data, call in memory unit in the microcontroller with needing SIMD to calculate instruction that the group carries out, control SIMD calculates group's startup and time-out, required source operand certificate is called in data buffering, and receives final calculation result in the data buffering; The data buffering parts receive the data that transmit from master controller, and store with specific organizational form, provide calculating required source operand for SIMD calculates the group, calculate and finish the output result that the back receives the calculating group, and net result outputs to master controller; The microcontroller parts receive the very long instruction word sequence that master controller provides, and after deciphering, with the form of broadcasting the alignment processing parts executed in parallel that SIMD calculates each processing unit of group are assigned, are mapped in each operation; It is the identical a plurality of parallel processing elements of structure that SIMD calculates the group, and each processing unit disposes a plurality of processing element, and performed instruction sequence all comes from microcontroller, but required computational data is from the diverse location of data buffering.
2, the computing group structure of very long instruction word according to claim 1 and single instruction stream multiple data stream fusion, it is characterized in that: it is one group of identical processing unit PE of structure that described SIMD calculates the group, the structure of all PE is all identical, carries out simultaneously from same of microcontroller instruction or instruction sequence in the mode of SIMD; Each PE comprises a plurality of arithmetic, logical operation processing element Fn and local register file, logical operation processing element Fn supports very long instruction word to carry out, the a plurality of dissimilar operations of parallel processing simultaneously, each logical operation processing element Fn has independent local register file, directly operand is provided and preserves result of calculation for processing element.
CNB2007100345670A 2007-03-19 2007-03-19 Computing group structure for superlong instruction word and instruction flow multidata stream fusion Expired - Fee Related CN100456230C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2007100345670A CN100456230C (en) 2007-03-19 2007-03-19 Computing group structure for superlong instruction word and instruction flow multidata stream fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2007100345670A CN100456230C (en) 2007-03-19 2007-03-19 Computing group structure for superlong instruction word and instruction flow multidata stream fusion

Publications (2)

Publication Number Publication Date
CN101021778A true CN101021778A (en) 2007-08-22
CN100456230C CN100456230C (en) 2009-01-28

Family

ID=38709553

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2007100345670A Expired - Fee Related CN100456230C (en) 2007-03-19 2007-03-19 Computing group structure for superlong instruction word and instruction flow multidata stream fusion

Country Status (1)

Country Link
CN (1) CN100456230C (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101452394B (en) * 2007-11-28 2012-05-23 无锡江南计算技术研究所 Compiling method and compiler
CN102970049A (en) * 2012-10-26 2013-03-13 北京邮电大学 Parallel circuit based on chien search algorithm and forney algorithm and RS decoding circuit
WO2015173674A1 (en) * 2014-05-12 2015-11-19 International Business Machines Corporation Parallel slice processor with dynamic instruction stream mapping
US9672043B2 (en) 2014-05-12 2017-06-06 International Business Machines Corporation Processing of multiple instruction streams in a parallel slice processor
US9720696B2 (en) 2014-09-30 2017-08-01 International Business Machines Corporation Independent mapping of threads
US9740486B2 (en) 2014-09-09 2017-08-22 International Business Machines Corporation Register files for storing data operated on by instructions of multiple widths
US9934033B2 (en) 2016-06-13 2018-04-03 International Business Machines Corporation Operation of a multi-slice processor implementing simultaneous two-target loads and stores
US9971602B2 (en) 2015-01-12 2018-05-15 International Business Machines Corporation Reconfigurable processing method with modes controlling the partitioning of clusters and cache slices
US9983875B2 (en) 2016-03-04 2018-05-29 International Business Machines Corporation Operation of a multi-slice processor preventing early dependent instruction wakeup
US10037211B2 (en) 2016-03-22 2018-07-31 International Business Machines Corporation Operation of a multi-slice processor with an expanded merge fetching queue
US10037229B2 (en) 2016-05-11 2018-07-31 International Business Machines Corporation Operation of a multi-slice processor implementing a load/store unit maintaining rejected instructions
US10042647B2 (en) 2016-06-27 2018-08-07 International Business Machines Corporation Managing a divided load reorder queue
US10133576B2 (en) 2015-01-13 2018-11-20 International Business Machines Corporation Parallel slice processor having a recirculating load-store queue for fast deallocation of issue queue entries
US10133581B2 (en) 2015-01-13 2018-11-20 International Business Machines Corporation Linkable issue queue parallel execution slice for a processor
WO2019104638A1 (en) * 2017-11-30 2019-06-06 深圳市大疆创新科技有限公司 Neural network processing method and apparatus, accelerator, system, and mobile device
US10318419B2 (en) 2016-08-08 2019-06-11 International Business Machines Corporation Flush avoidance in a load store unit
US10346174B2 (en) 2016-03-24 2019-07-09 International Business Machines Corporation Operation of a multi-slice processor with dynamic canceling of partial loads
US10761854B2 (en) 2016-04-19 2020-09-01 International Business Machines Corporation Preventing hazard flushes in an instruction sequencing unit of a multi-slice processor
CN113032013A (en) * 2021-01-29 2021-06-25 成都商汤科技有限公司 Data transmission method, chip, equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6718457B2 (en) * 1998-12-03 2004-04-06 Sun Microsystems, Inc. Multiple-thread processor for threaded software applications
JP2001175618A (en) * 1999-12-17 2001-06-29 Nec Eng Ltd Parallel computer system
WO2005036384A2 (en) * 2003-10-14 2005-04-21 Koninklijke Philips Electronics N.V. Instruction encoding for vliw processors
US7949856B2 (en) * 2004-03-31 2011-05-24 Icera Inc. Method and apparatus for separate control processing and data path processing in a dual path processor with a shared load/store unit
US9047094B2 (en) * 2004-03-31 2015-06-02 Icera Inc. Apparatus and method for separate asymmetric control processing and data path processing in a dual path processor
CN100357932C (en) * 2006-06-05 2007-12-26 中国人民解放军国防科学技术大学 Method for decreasing data access delay in stream processor

Cited By (40)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101452394B (en) * 2007-11-28 2012-05-23 无锡江南计算技术研究所 Compiling method and compiler
CN102970049A (en) * 2012-10-26 2013-03-13 北京邮电大学 Parallel circuit based on chien search algorithm and forney algorithm and RS decoding circuit
CN102970049B (en) * 2012-10-26 2016-01-20 北京邮电大学 Based on parallel circuit and the RS decoding circuit of money searching algorithm and Fu Ni algorithm
WO2015173674A1 (en) * 2014-05-12 2015-11-19 International Business Machines Corporation Parallel slice processor with dynamic instruction stream mapping
US9665372B2 (en) 2014-05-12 2017-05-30 International Business Machines Corporation Parallel slice processor with dynamic instruction stream mapping
US9672043B2 (en) 2014-05-12 2017-06-06 International Business Machines Corporation Processing of multiple instruction streams in a parallel slice processor
US9690585B2 (en) 2014-05-12 2017-06-27 International Business Machines Corporation Parallel slice processor with dynamic instruction stream mapping
US9690586B2 (en) 2014-05-12 2017-06-27 International Business Machines Corporation Processing of multiple instruction streams in a parallel slice processor
US10157064B2 (en) 2014-05-12 2018-12-18 International Business Machines Corporation Processing of multiple instruction streams in a parallel slice processor
US9740486B2 (en) 2014-09-09 2017-08-22 International Business Machines Corporation Register files for storing data operated on by instructions of multiple widths
US9760375B2 (en) 2014-09-09 2017-09-12 International Business Machines Corporation Register files for storing data operated on by instructions of multiple widths
US11144323B2 (en) 2014-09-30 2021-10-12 International Business Machines Corporation Independent mapping of threads
US9870229B2 (en) 2014-09-30 2018-01-16 International Business Machines Corporation Independent mapping of threads
US10545762B2 (en) 2014-09-30 2020-01-28 International Business Machines Corporation Independent mapping of threads
US9720696B2 (en) 2014-09-30 2017-08-01 International Business Machines Corporation Independent mapping of threads
US9977678B2 (en) 2015-01-12 2018-05-22 International Business Machines Corporation Reconfigurable parallel execution and load-store slice processor
US10983800B2 (en) 2015-01-12 2021-04-20 International Business Machines Corporation Reconfigurable processor with load-store slices supporting reorder and controlling access to cache slices
US10083039B2 (en) 2015-01-12 2018-09-25 International Business Machines Corporation Reconfigurable processor with load-store slices supporting reorder and controlling access to cache slices
US9971602B2 (en) 2015-01-12 2018-05-15 International Business Machines Corporation Reconfigurable processing method with modes controlling the partitioning of clusters and cache slices
US11734010B2 (en) 2015-01-13 2023-08-22 International Business Machines Corporation Parallel slice processor having a recirculating load-store queue for fast deallocation of issue queue entries
US10133576B2 (en) 2015-01-13 2018-11-20 International Business Machines Corporation Parallel slice processor having a recirculating load-store queue for fast deallocation of issue queue entries
US10133581B2 (en) 2015-01-13 2018-11-20 International Business Machines Corporation Linkable issue queue parallel execution slice for a processor
US10223125B2 (en) 2015-01-13 2019-03-05 International Business Machines Corporation Linkable issue queue parallel execution slice processing method
US11150907B2 (en) 2015-01-13 2021-10-19 International Business Machines Corporation Parallel slice processor having a recirculating load-store queue for fast deallocation of issue queue entries
US9983875B2 (en) 2016-03-04 2018-05-29 International Business Machines Corporation Operation of a multi-slice processor preventing early dependent instruction wakeup
US10037211B2 (en) 2016-03-22 2018-07-31 International Business Machines Corporation Operation of a multi-slice processor with an expanded merge fetching queue
US10564978B2 (en) 2016-03-22 2020-02-18 International Business Machines Corporation Operation of a multi-slice processor with an expanded merge fetching queue
US10346174B2 (en) 2016-03-24 2019-07-09 International Business Machines Corporation Operation of a multi-slice processor with dynamic canceling of partial loads
US10761854B2 (en) 2016-04-19 2020-09-01 International Business Machines Corporation Preventing hazard flushes in an instruction sequencing unit of a multi-slice processor
US10268518B2 (en) 2016-05-11 2019-04-23 International Business Machines Corporation Operation of a multi-slice processor implementing a load/store unit maintaining rejected instructions
US10255107B2 (en) 2016-05-11 2019-04-09 International Business Machines Corporation Operation of a multi-slice processor implementing a load/store unit maintaining rejected instructions
US10042770B2 (en) 2016-05-11 2018-08-07 International Business Machines Corporation Operation of a multi-slice processor implementing a load/store unit maintaining rejected instructions
US10037229B2 (en) 2016-05-11 2018-07-31 International Business Machines Corporation Operation of a multi-slice processor implementing a load/store unit maintaining rejected instructions
US9940133B2 (en) 2016-06-13 2018-04-10 International Business Machines Corporation Operation of a multi-slice processor implementing simultaneous two-target loads and stores
US9934033B2 (en) 2016-06-13 2018-04-03 International Business Machines Corporation Operation of a multi-slice processor implementing simultaneous two-target loads and stores
US10042647B2 (en) 2016-06-27 2018-08-07 International Business Machines Corporation Managing a divided load reorder queue
US10318419B2 (en) 2016-08-08 2019-06-11 International Business Machines Corporation Flush avoidance in a load store unit
WO2019104638A1 (en) * 2017-11-30 2019-06-06 深圳市大疆创新科技有限公司 Neural network processing method and apparatus, accelerator, system, and mobile device
CN113032013A (en) * 2021-01-29 2021-06-25 成都商汤科技有限公司 Data transmission method, chip, equipment and storage medium
CN113032013B (en) * 2021-01-29 2023-03-28 成都商汤科技有限公司 Data transmission method, chip, equipment and storage medium

Also Published As

Publication number Publication date
CN100456230C (en) 2009-01-28

Similar Documents

Publication Publication Date Title
CN100456230C (en) Computing group structure for superlong instruction word and instruction flow multidata stream fusion
CN108268278B (en) Processor, method and system with configurable spatial accelerator
EP3726389B1 (en) Apparatuses, methods, and systems for memory interface circuit allocation in a configurable spatial accelerator
US10515046B2 (en) Processors, methods, and systems with a configurable spatial accelerator
JP6525286B2 (en) Processor core and processor system
Udupa et al. Software pipelined execution of stream programs on GPUs
US11029958B1 (en) Apparatuses, methods, and systems for configurable operand size operations in an operation configurable spatial accelerator
US20220100680A1 (en) Apparatuses, methods, and systems for a configurable accelerator having dataflow execution circuits
JP2018519602A (en) Block-based architecture with parallel execution of continuous blocks
CN100489830C (en) 64 bit stream processor chip system structure oriented to scientific computing
Karim et al. A multilevel computing architecture for embedded multimedia applications
Lisper Towards parallel programming models for predictability
US11907713B2 (en) Apparatuses, methods, and systems for fused operations using sign modification in a processing element of a configurable spatial accelerator
Tan et al. Optimizing the LINPACK algorithm for large-scale PCIe-based CPU-GPU heterogeneous systems
US20230367604A1 (en) Method of interleaved processing on a general-purpose computing core
Krashinsky Vector-thread architecture and implementation
She et al. OpenCL code generation for low energy wide SIMD architectures with explicit datapath
Sandokji et al. Task scheduling frameworks for heterogeneous computing toward exascale
CN104615496B (en) The parallel-expansion method of reconstruction structure based on multi-level heterogeneous structure
Dey et al. Embedded support vector machine: Architectural enhancements and evaluation
Evripidou et al. Data-flow vs control-flow for extreme level computing
Rutzig Multicore platforms: Processors, communication and memories
Luo et al. HAD: A Prototype Of Dataflow Compute Architecture
Margerm Leveraging Dynamic Task Parallelism in Hardware Accelerators
Berkovich et al. XMT-M: A scalable decentralized processor

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090128

Termination date: 20110319