CN102508643A - Multicore-parallel digital signal processor and method for operating parallel instruction sets - Google Patents

Multicore-parallel digital signal processor and method for operating parallel instruction sets Download PDF

Info

Publication number
CN102508643A
CN102508643A CN2011103638203A CN201110363820A CN102508643A CN 102508643 A CN102508643 A CN 102508643A CN 2011103638203 A CN2011103638203 A CN 2011103638203A CN 201110363820 A CN201110363820 A CN 201110363820A CN 102508643 A CN102508643 A CN 102508643A
Authority
CN
China
Prior art keywords
parallel
processor
access
subset
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011103638203A
Other languages
Chinese (zh)
Inventor
刘大可
王建
猷阿·索
安德里雅思·卡尔松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN2011103638203A priority Critical patent/CN102508643A/en
Publication of CN102508643A publication Critical patent/CN102508643A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Advance Control (AREA)

Abstract

An embodiment of the invention provides a multicore-parallel digital signal processor and a method for operating parallel instruction sets. The multicore-parallel digital signal processor at least operates a control instruction subset, a parallel access instruction subset and a parallel computing instruction subset, and comprises a main processor and a plurality of auxiliary processors. Each auxiliary processor consists of a parallel access unit and a parallel computing unit, which are mutually independent. The control instruction subset, the parallel access instruction subset and the parallel computing instruction subset are coded independently and executed by independent hardware units respectively. The main processor is used for operational control of the instruction subsets, the parallel access units of the one or more auxiliary processors are used for parallel access of the instruction subsets, and the parallel computing units of the one or more auxiliary processors are used for parallel computing of the instruction subsets. Using the multicore-parallel digital signal processor and the method for operating the parallel instruction sets can improve efficiency of the processor and reduce or wipe out redundant operations to the utmost extent, and thereby processing performance of the processor can be improved.

Description

The operation method of a kind of multi-core parallel concurrent digital signal processor and parallel instruction collection
Technical field
The present invention relates to the polycaryon processor technical field, relate in particular to the operation method of a kind of multi-core parallel concurrent digital signal processor and parallel instruction collection.
Background technology
Traditional calculating machine structural design is devoted to optimize cache memory, the superscalar that branch prediction and non-order are carried out.These ways are applicable to universal processor design, but are not the optimal selection of embedded system.Similar with it, concurrent operation has been brought into play vital role in general high-performance calculation, but existing parallel organization and multiple programming model do not design to the high-performance embedded system.
Embedded processor has a wide range of applications, as in mobile phone and other battery powered systems, using the processor with super low-power consumption.ASIP is the optimum processor structure of embedded system.The embedded signal disposal system uses ASIP to reach the high-performance in the application-specific scope, low-power consumption and programmability.Thereby ASIP should be used for designing and optimizing power consumption and the silicon area that its instruction set architecture reduces processor to one type.
The following two kinds of models of the general use of the design of application specific processor framework:
The flowing water parallel model: this model comprises some processing units, and task of each unit operation uses chain type to connect between processing unit.The output of processing unit N is connected to the input of processing unit N+1.This model is widely used in communication and multimedia signal dispose.The key of using the flowing water parallel model is the time interval of the task run time on each processor all importing data less than total system.
The data parallel model: some processing units are carried out same task to different data and are come result of calculation.The key of using the data parallel model is the systematicness of recognition data and in view of the above data is split and concurrent operation.
The flexible combination of the normally above two kinds of models of application specific processor design.The target of processor or system-on-chip designs is that the best of breed of two kinds of models of design is just satisfying the demand of application-specific to performance, to avoid the hardware spending and the power consumption of internet on unnecessary arithmetical organ and the sheet.
Initial high performance signal processor adopting special IC (ASIC, Application Specific Integrated Circuits), this is the unique method that reaches high-performance and low-power consumption at that time.But special IC lacks dirigibility.Because signal processing applications constantly has new standard and new algorithm to be suggested, the dirigibility of hardware designs and programmability become an important requirement.Typical instance is radio baseband processor and multimedia processor.The radio baseband processor need be supported the base band signal process of a plurality of wireless communication standards through software programming.Multimedia processor need be supported the encoding and decoding standard of multiple Voice & Video.
For bigger arithmetic capability is provided, application specific processor uses the parallel multi-core structure.Each processor core also uses instruction-level or data level to walk abreast to improve arithmetic capability.Existing programmable digital signal processor nuclear uses two kinds of microstructures.A kind of processor that is based on very long instruction word (VLIW, Very Long Instruction Word) structure.Another kind is to use the processor of single instruction multiple data (SIMD, Single Instruction Multiple Data) structure.Existing polycaryon processor mainly uses following three kinds of structures: first kind of double-core framework that is based on a DSP (Digital Signal Processing, digital signal processing) nuclear and a VLIW nuclear.Second kind of polycaryon processor that is based on a controller and some SIMD nuclear.The third is the large-scale parallel computing array of similar pattern video-stream processor (GPU, Graphic Processing Unit).
The treatment effeciency of digital signal processor is defined as the algorithm function arithmetic operation divided by total operation.The algorithm function arithmetic operation is the operation that processor must be supported the user.NOT-function control operation and data access operation are redundant operation.For improving processor efficient, need the minimizing of maximum possible in the process of instruction set architecture design and system's multinuclear design or cover redundant operation.Promising minimizing redundancy and the instruction and the specific structure of particular design have increased the programming complexity.The complexity of this increase must be covered through compilation compilation tool and programming flow process.
Summary of the invention
The embodiment of the invention provides the operation method of a kind of multi-core parallel concurrent digital signal processor and parallel instruction collection, improving processor efficient, and the minimizing of maximum possible or cover redundant operation.
On the one hand; The embodiment of the invention provides a kind of multi-core parallel concurrent digital signal processor; Said multi-core parallel concurrent digital signal processor moves following three subset of instructions at least: steering order subclass, parallel access subset of instructions and concurrent operation subset of instructions; Wherein, said multi-core parallel concurrent digital signal processor comprises a primary processor and a plurality of from processor, saidly is made up of separate parallel access unit and parallel computation unit from processor; Absolute coding is also respectively by independently hardware cell execution respectively for said steering order subclass, parallel access subset of instructions and concurrent operation subset of instructions, and primary processor is used to move the steering order subclass; One or more parallel access unit from processor are used to move the parallel access subset of instructions; One or more parallel computation unit from processor are used to move the concurrent operation subset of instructions.
Optional; In an embodiment of the present invention; The finite states machine control device of one or more parallel access unit from processor is used to move the data access that single instrction carries out single, and perhaps single instrction circulation or MIMD round-robin data access operation are carried out in the instruction of operation task level.
Optional, in an embodiment of the present invention, the finite states machine control device of one or more parallel computation unit from processor; Be used to move the single instrction arithmetic logical operation; Perhaps the instruction of operation task level is carried out: single instrction circulates, or the MIMD circulation, or (is started by the task level instruction based on the single instrction multioperation of parallel data passage; Use the data channel of parallel computation unit; Accomplish a plurality of arithmetical logics operations through calculating combination of elements in the data channel), or (start the data channel of use parallel computation unit by the task level instruction based on the single instrction multioperation of parallel data passage and finite states machine control device; Under the control of finite states machine control device; Accomplish the operation of a plurality of arithmetical logics), or based on the arithmetic logical operation of the single instrction multioperation of tight coupling accelerator module (start by the task level instruction, use the tight coupling accelerator of parallel computation unit to carry out multioperation and calculate).
Optional; In an embodiment of the present invention; Said one or more parallel access unit from processor is used to move the parallel access subset of instructions; Comprise: through comprising the parallel access unit that the local storage group that is made up of a plurality of storeies and registers group are formed; Operation parallel access subset of instructions, and the parallel access of multichannel data is provided to the parallel computation unit is at swap data between local storage group and the registers group, between local storage group and the parallel computation unit, between registers group and the parallel computation unit.
Optional; In an embodiment of the present invention; Said one or more parallel computation unit from processor is used to move the concurrent operation subset of instructions; Comprise: adopt the data channel of single instruction multiple data SIMD structure through one or more parallel computation unit from processor, multichannel data is done identical separate operation and returned multichannel result's computing, perhaps operation obtains one tunnel result's computing to multichannel data.
Optional; In an embodiment of the present invention; Said one or more parallel access unit from processor is used to move the parallel access subset of instructions; Comprise: through the data access passage of parallel access unit to direct memory access dma controller and parallel computation unit, the perhaps data access of serial walks abreast.
Optional; In an embodiment of the present invention; Said one or more parallel access unit from processor is used to move the parallel access subset of instructions, comprising: utilize in the parallel access unit the many interleaved path controller based on the address search table, with conflict-free access that memory set is walked abreast.
Optional; In an embodiment of the present invention; The said utilization in the parallel access unit based on many interleaved path controller of address search table with conflict-free access that memory set is walked abreast, comprising: utilize first table that interweaves; The data of sequential storage in the primary memory are broken up, and deposit in parallel from the storer of processor by demand execution algorithm; Utilize second table that interweaves, with parallel from the storer of processor out of order operation result return to original order, and deposit primary memory in.
Optional; In an embodiment of the present invention; The orthogonal instruction subclass of said multi-core parallel concurrent digital signal processor through executed in parallel reaches carries out arithmetic computation in data access; And through setting the execution number of times, independent loops is carried out or combined cycle is carried out said concurrent operation subset of instructions and said parallel access subset of instructions.
Optional; In an embodiment of the present invention; Said multi-core parallel concurrent digital signal processor as cycle controller, is used to control the cycling of a single instruction multiple data SIMD data channel through the finite states machine control device of parallel access unit and the finite states machine control device of parallel computation unit; Said cycling comprises two types: one type is that the operation function is found the solution task, and this type circulation does not need local vector memory that data are provided; The another kind of circulation that is based on the multiply accumulating function, the local vector memory of this type circulation need provide vector data and coefficient array.
On the other hand; The embodiment of the invention provides a kind of operation method of parallel instruction collection; Said method is moved following three subset of instructions at least through the multi-core parallel concurrent digital signal processor: steering order subclass, parallel access subset of instructions and concurrent operation subset of instructions; Wherein, Said multi-core parallel concurrent digital signal processor comprises a primary processor and a plurality of from processor; Saidly be made up of separate parallel access unit and parallel computation unit from processor, said steering order subclass, parallel access subset of instructions and concurrent operation subset of instructions be absolute coding and being carried out by hardware cell independently respectively respectively, comprising: through primary processor operation steering order subclass; Through one or more parallel access unit operation parallel access subset of instructions from processor; Through one or more parallel computation unit operation concurrent operation subset of instructions from processor.
Optional; In an embodiment of the present invention; Said through one or more parallel access unit operation parallel access subset of instructions from processor; Comprise: through the finite states machine control device of one or more parallel access unit from processor, the operation single instrction carries out the data access of single, and perhaps single instrction circulation or MIMD round-robin data access operation are carried out in the instruction of operation task level.
Optional; In an embodiment of the present invention, said through one or more parallel computation unit operation concurrent operation subset of instructions from processor, comprising: through the finite states machine control device of one or more parallel computation unit from processor; The arithmetic logical operation of operation single instrction; Perhaps operation task level instruction is carried out: single instrction circulation, or MIMD circulation, or based on the single instrction multioperation of parallel data passage; Or based on the single instrction multioperation of parallel data passage and finite states machine control device, or based on the arithmetic logical operation of the single instrction multioperation of tight coupling accelerator module.
Optional; In an embodiment of the present invention; Said through one or more parallel access unit operation parallel access subset of instructions from processor; Comprise: through comprising the parallel access unit that the local storage group that is made up of a plurality of storeies and registers group are formed; Operation parallel access subset of instructions, and the parallel access of multichannel data is provided to the parallel computation unit is at swap data between local storage group and the registers group, between local storage group and the parallel computation unit, between registers group and the parallel computation unit.
Optional; In an embodiment of the present invention; Said through one or more parallel computation unit operation concurrent operation subset of instructions from processor; Comprise: adopt the data channel of single instruction multiple data SIMD structure through one or more parallel computation unit from processor, multichannel data is done identical separate operation and returned multichannel result's computing, perhaps operation obtains one tunnel result's computing to multichannel data.
Optional; In an embodiment of the present invention; Said through one or more parallel access unit operation parallel access subset of instructions from processor; Comprise: through the data access passage of parallel access unit to direct memory access dma controller and parallel computation unit, the perhaps data access of serial walks abreast.
Optional; In an embodiment of the present invention; Said through one or more parallel access unit operation parallel access subset of instructions from processor, comprising: utilize in the parallel access unit many interleaved path controller, with conflict-free access that memory set is walked abreast based on the address search table.
Optional; In an embodiment of the present invention; The said utilization in the parallel access unit based on many interleaved path controller of address search table with conflict-free access that memory set is walked abreast, comprising: utilize first table that interweaves; The data of sequential storage in the primary memory are broken up, and deposit in parallel from the storer of processor by demand execution algorithm; Utilize second table that interweaves, with parallel from the storer of processor out of order operation result return to original order, and deposit primary memory in.
Optional; In an embodiment of the present invention; The orthogonal instruction subclass of said multi-core parallel concurrent digital signal processor through executed in parallel reaches carries out arithmetic computation in data access; And through setting the execution number of times, independent loops is carried out or combined cycle is carried out said concurrent operation subset of instructions and said parallel access subset of instructions.
Optional; In an embodiment of the present invention; Said multi-core parallel concurrent digital signal processor as cycle controller, is used to control the cycling of a single instruction multiple data SIMD data channel through the finite states machine control device of parallel access unit and the finite states machine control device of parallel computation unit; Said cycling comprises two types: one type is that the operation function is found the solution task, and this type circulation does not need local vector memory that data are provided; The another kind of circulation that is based on the multiply accumulating function, the local vector memory of this type circulation need provide vector data and coefficient array.
Technique scheme has following beneficial effect: because adopt the multi-core parallel concurrent digital signal processor to move following three subset of instructions at least: steering order subclass, parallel access subset of instructions and concurrent operation subset of instructions; Wherein, Said multi-core parallel concurrent digital signal processor comprises a primary processor and a plurality of from processor; Saidly form by separate parallel access unit and parallel computation unit from processor; Said steering order subclass, parallel access subset of instructions and concurrent operation subset of instructions absolute coding respectively comprise also respectively by independently hardware cell execution: through primary processor operation steering order subclass; Through one or more parallel access unit operation parallel access subset of instructions from processor; Through the technological means of one or more parallel computation unit operation concurrent operation subset of instructions from processor, thus processor efficient improved, the minimizing of maximum possible or covered redundant operation, thus the handling property of processor improved.
Description of drawings
In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art; To do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art below; Obviously, the accompanying drawing in describing below only is some embodiments of the present invention, for those of ordinary skills; Under the prerequisite of not paying creative work property, can also obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is a kind of multi-core parallel concurrent digital signal processor architecture of embodiment of the invention synoptic diagram;
Fig. 2 comprises the parallel storage block diagram of three data access hardware for the embodiment of the invention;
Fig. 3 is embodiment of the invention external memory storage input data interlacing example schematic;
Fig. 4 is a polycaryon processor structure example of an embodiment of the invention synoptic diagram.
Embodiment
To combine the accompanying drawing in the embodiment of the invention below, the technical scheme in the embodiment of the invention is carried out clear, intactly description, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills are not making the every other embodiment that is obtained under the creative work prerequisite, all belong to the scope of the present invention's protection.
As shown in Figure 1; Be a kind of multi-core parallel concurrent digital signal processor architecture of embodiment of the invention synoptic diagram; Said multi-core parallel concurrent digital signal processor moves following three subset of instructions at least: steering order subclass 21, parallel access subset of instructions 22 and concurrent operation subset of instructions 23; Wherein, Said multi-core parallel concurrent digital signal processor comprises a primary processor 11 and a plurality of from processor; Saidly is made up of separate parallel access unit 13 and parallel computation unit 14 from processor, said steering order subclass 21, parallel access subset of instructions 22 and concurrent operation subset of instructions 23 be absolute codings and respectively by independently hardware cell execution respectively, and primary processor 11 is used to move steering order subclass 21; One or more parallel access unit 13 from processor are used to move parallel access subset of instructions 22; One or more parallel computation unit 14 from processor are used to move concurrent operation subset of instructions 23.This multi-core parallel concurrent digital signal processor also comprises direct memory access (DMA, Direct Memory Access) controller 12, is used for through storage subsystem 15 from the primary memory access data, and transmits data at primary processor 11 with between the processor.
Optional; In an embodiment of the present invention; The finite states machine control device of one or more parallel access unit 13 from processor is used to move the data access that single instrction carries out single, and perhaps single instrction circulation or MIMD round-robin data access operation are carried out in the instruction of operation task level.
Optional, in an embodiment of the present invention, the finite states machine control device of one or more parallel computation unit 14 from processor; Be used to move the single instrction arithmetic logical operation; Perhaps operation task level instruction is carried out: single instrction circulation, or MIMD circulation, or based on the single instrction multioperation of parallel data passage; Or based on the single instrction multioperation of parallel data passage and finite states machine control device, or based on the arithmetic logical operation of the single instrction multioperation of tight coupling accelerator module.
Optional; In an embodiment of the present invention; Said one or more parallel access unit 13 from processor is used to move parallel access subset of instructions 22; Comprise: through comprising the parallel access unit 13 that the local storage group that is made up of a plurality of storeies and registers group are formed; Operation parallel access subset of instructions 22, and the parallel access of multichannel data is provided to parallel computation unit 14 is at swap data between local storage group and the registers group, between local storage group and the parallel computation unit 14, between registers group and the parallel computation unit 14.
Optional; In an embodiment of the present invention; Said one or more parallel computation unit 14 from processor is used to move concurrent operation subset of instructions 23; Comprise: adopt the data channel of single instruction multiple data SIMD structures through one or more parallel computation unit 14 from processor, multichannel data is done identical separate operation and returned multichannel result's computing, perhaps operation obtains one tunnel result's computing to multichannel data.
Optional; In an embodiment of the present invention; Said one or more parallel access unit 13 from processor is used to move parallel access subset of instructions 22; Comprise: through the data access passage of parallel access unit 13 to direct memory access dma controller 12 and parallel computation unit 14, the perhaps data access of serial walks abreast.
Optional; In an embodiment of the present invention; Said one or more parallel access unit 13 from processor is used to move parallel access subset of instructions 22; Comprise: utilize in the parallel access unit 13 many interleaved path controller, with conflict-free access that memory set is walked abreast based on the address search table.
Optional; In an embodiment of the present invention; The said utilization in the parallel access unit 13 based on many interleaved path controller of address search table with conflict-free access that memory set is walked abreast, comprising: utilize first table that interweaves; The data of sequential storage in the primary memory are broken up, and deposit in parallel from the storer of processor by demand execution algorithm; Utilize second table that interweaves, with parallel from the storer of processor out of order operation result return to original order, and deposit primary memory in.
Optional; In an embodiment of the present invention; The orthogonal instruction subclass of said multi-core parallel concurrent digital signal processor through executed in parallel reaches carries out arithmetic computation in data access; And through setting the execution number of times, independent loops is carried out or combined cycle is carried out said concurrent operation subset of instructions 23 and said parallel access subset of instructions.
Optional; In an embodiment of the present invention; Said multi-core parallel concurrent digital signal processor as cycle controller, is used to control the cycling of a single instruction multiple data SIMD data channel through the finite states machine control device of parallel access unit 13 and the finite states machine control device of parallel computation unit 14; Said cycling comprises two types: one type is that the operation function is found the solution task, and this type circulation does not need local vector memory that data are provided; The another kind of circulation that is based on the multiply accumulating function, the local vector memory of this type circulation need provide vector data and coefficient array.
The embodiment of the invention discloses a kind of efficient parallel architecture based on three-dimensional orthogonal parallel instruction collection.The instruction set of this architecture is divided into three subset of instructions: steering order subclass, parallel access subset of instructions and concurrent operation subset of instructions.The steering order subclass is moved in master controller, is used for the execution of top-level procedure, system's control, resource management and a spot of unpredictable algorithm.The binary coding of every instruction of steering order subclass is very short, therefore can allow its program length longer.Parallel access subset of instructions control parallel data access path; It is the data access command subclass, is used to work out and carry out the access of the data between local storage group and registers group, local storage group and parallel computation unit, registers group and the parallel computation unit.Mostly be the loop iteration computing because data-signal is handled, the parallel access subset of instructions should be able to promptly provide the execution of each clock to set with program code means; Can the fixing control vector towards a concrete subroutine be provided with the mode of restructural vector code again, this fixing control vector need not each clock to be changed.Though therefore the data access command subset code maybe be longer, it changes relatively less, number of instructions is few, and the expense of code is relatively very low.Concurrent operation subset of instructions control parallel data treatment channel, it is the function executing subset of instructions of parallel subroutine.Thereby this sub-set is used to control the parallel data passage carries out parallel algorithm on the parallel data passage.The execution of this subset code is based on data have been delivered to the inlet of data channel by the local storage group in parallel storage group or data prerequisite.When carrying out the macro instruction of a job class, the execution of the whole or big portion of an instruction may command subroutine.Therefore, though the code of the function executing subset of instructions of parallel subroutine maybe be longer, it changes relatively less, number of instructions is few, and the expense of code is relatively very low.Through separately with three subset of instructions, define separately and independent coding, will obtain advantages: 1) each subset of instructions is carried out the binary code coding separately, and the storage overhead of code lowers greatly; 2) because the length of the program code of parallel access subset of instructions and concurrent operation subset of instructions is low; Its binary coding can be used microcode, thereby on the degree of maximum, has opened user exploitation and the dirigibility of reconstruct of controllability, data and the addressing function of hardware resource to the programmer; 3) three subset of instructions codes can be further with flowing water/streamline (pipeline) mode executed in parallel, and the programmer can use parallel hardware to the full extent.The above-mentioned multi-core parallel concurrent digital signal processor of the present invention embodiment has improved processor efficient, the minimizing of maximum possible or covered redundant operation, thus improved the handling property of processor.
As shown in Figure 2, comprise the parallel storage block diagram of three data access hardware for the embodiment of the invention.It comprises that (1) contain the memory set 31 of a plurality of parallel random access memory; (2) the controlled switches set 32 that interweaves that the parallel input and the parallel output data of this memory set is interweaved, (3) are implemented the restructural finite states machine control device 33 of control to the memory set and the switches set that interweaves.
Hardware 31 memory set are made up of parallel a plurality of random access memory, support at one time the parallel read/write to a plurality of storeies.The reference address of each storer is provided during read-write operation, but the different addresses of each storer of random access.This parallel storage group has two access paths.Passage 1 (memory set left channel) is the external reference passage, and the input data of external memory storage write parallel storage through this passage, and perhaps the local data of parallel storage exports external memory storage to through this passage.Passage 2 (memory set right channel) is the access path that vector registor group and vector data are handled.Through programming the reference address of parallel storage is provided from processor, but the random address in each memory block of concurrent access.When execution walks abreast read operation, support multichannel data passage reading to identical address in the same memory block.
The controlled switches set that interweaves that hardware 32 parallel inputs and parallel output data interweave.External memory storage inputoutput data to parallel storage interweaves.When the input data were interweaved, the input data will be upset to be left in the parallel storage group.When output data is interweaved, the data of reading in the parallel storage will be exported to external memory storage by rearrangement.
The finite states machine control device of the 33 pairs of memory set of hardware and the switches set enforcement control that interweaves.It controls interleaver through the control vector that interweaves that operation parallel access instruction produces, thereby adjustment inputs or outputs the order of vector data.Parallel access instruction may command interweaves to what the single vector data read or write, or the data stream that a plurality of data are constituted read or write interweave.
External memory storage input data interlacing:
As shown in Figure 3, be embodiment of the invention external memory storage input data interlacing example schematic.The parallel storage group is made up of 4 random access memory.Input traffic is D0-D15, is written in parallel to 4 at every turn and counts to the parallel storage group.Write data through the controlled switches set that interweaves, readjusted the writing position in parallel storage.Like second vectorial D4D5D6D7, be adjusted to D7D4D5D6 and write in the parallel storage.The purpose of adjustment is the parallel data visit for the parallel computation unit.The parallel computation unit needs parallel processing D0D4D8D12 in this example, will need the element in the vector of concurrent access to leave in the different storeies through the input switches set that interweaves, to reach the parallel conflict-free access when calculating.
In the opposite direction, when output data was interweaved, the result of calculation of parallel computation unit was adjusted order and is written in the parallel storage, and it is adjusted in proper order is in order external memory storage to be carried out and conflict-free access local storage group during line output.
The switches set that interweaves in the above instance is controlled by reconfigurable finite states machine control device, the control vector that interweaves that is input as operation parallel access instruction generation of this controller.Through this finite states machine control device of reconstruct, can produce the control that interweaves according to configuration to the input and output single vector, or to each vectorial control that interweaves in the data stream.
The multinuclear networking:
A plurality of processors of this polycaryon processor and external memory storage are through the interconnected line data exchange of going forward side by side of network-on-chip.As shown in Figure 4, be a polycaryon processor structure example of embodiment of the invention synoptic diagram, it constitutes from processor by a primary processor and N, and each comprises parallel computation unit and parallel storage unit from processor.Data transmission between external memory storage and the local storage can be accomplished by the dma controller that is connected to network-on-chip.From processor and from also passing through the network-on-chip swap data between processor.
Parallel access:
Parallel access to data realizes that by the parallel storage group this parallel storage group is made up of a plurality of random access memory.At one time, can carry out independently reading and writing data to each storer.When being written in parallel to, the corresponding storer of each element of input vector can be with should once being written to memory set by vector.In parallel reading, read data from each random access memory, thereby realize that the parallel of a vector read.
Parallel computation:
Parallel computation to data realizes through the parallel computation unit from processor.The data level parallel processor adopts single instruction multiple data usually, and (it calculates a plurality of input data through the data channel of many inputs for Single Instruction Multiple Data, SIMD) structure.Parallel computation comprises implements the calculating that same operation obtains the multichannel result to multipath input data, adds computing like vector, to two element additions that vector is corresponding, obtains result vector.Or multipath input data calculated single result's calculating, like vectorial summation operation, the result be each element of input vector with.
Corresponding to said apparatus embodiment; The embodiment of the invention also provides a kind of operation method of parallel instruction collection; Said method is moved following three subset of instructions at least through the multi-core parallel concurrent digital signal processor: steering order subclass, parallel access subset of instructions and concurrent operation subset of instructions; Wherein, Said multi-core parallel concurrent digital signal processor comprises a primary processor and a plurality of from processor; Saidly be made up of separate parallel access unit and parallel computation unit from processor, said steering order subclass, parallel access subset of instructions and concurrent operation subset of instructions be absolute coding and being carried out by hardware cell independently respectively respectively, comprising: through primary processor operation steering order subclass; And through one or more parallel access unit operation parallel access subset of instructions from processor; And through one or more parallel computation unit operation concurrent operation subset of instructions from processor.
Optional; In an embodiment of the present invention; Said through one or more parallel access unit operation parallel access subset of instructions from processor; Comprise: through the finite states machine control device of one or more parallel access unit from processor, the operation single instrction carries out the data access of single, and perhaps single instrction circulation or MIMD round-robin data access operation are carried out in the instruction of operation task level.
Optional; In an embodiment of the present invention, said through one or more parallel computation unit operation concurrent operation subset of instructions from processor, comprising: through the finite states machine control device of one or more parallel computation unit from processor; The arithmetic logical operation of operation single instrction; Perhaps operation task level instruction is carried out: single instrction circulation, or MIMD circulation, or based on the single instrction multioperation of parallel data passage; Or based on the single instrction multioperation of parallel data passage and finite states machine control device, or based on the arithmetic logical operation of the single instrction multioperation of tight coupling accelerator module.
Optional; In an embodiment of the present invention; Said through one or more parallel access unit operation parallel access subset of instructions from processor; Comprise: through comprising the parallel access unit that the local storage group that is made up of a plurality of storeies and registers group are formed; Operation parallel access subset of instructions, and the parallel access of multichannel data is provided to the parallel computation unit is at swap data between local storage group and the registers group, between local storage group and the parallel computation unit, between registers group and the parallel computation unit.
Optional; In an embodiment of the present invention; Said through one or more parallel computation unit operation concurrent operation subset of instructions from processor; Comprise: adopt the data channel of single instruction multiple data SIMD structure through one or more parallel computation unit from processor, multichannel data is done identical separate operation and returned multichannel result's computing, perhaps operation obtains one tunnel result's computing to multichannel data.
Optional; In an embodiment of the present invention; Said through one or more parallel access unit operation parallel access subset of instructions from processor; Comprise: through the data access passage of parallel access unit to direct memory access dma controller and parallel computation unit, the perhaps data access of serial walks abreast.
Optional; In an embodiment of the present invention; Said through one or more parallel access unit operation parallel access subset of instructions from processor, comprising: utilize in the parallel access unit many interleaved path controller, with conflict-free access that memory set is walked abreast based on the address search table.
Optional; In an embodiment of the present invention; The said utilization in the parallel access unit based on many interleaved path controller of address search table with conflict-free access that memory set is walked abreast, comprising: utilize first table that interweaves; The data of sequential storage in the primary memory are broken up, and deposit in parallel from the storer of processor by demand execution algorithm; Utilize second table that interweaves, with parallel from the storer of processor out of order operation result return to original order, and deposit primary memory in.
Optional; In an embodiment of the present invention; The orthogonal instruction subclass of said multi-core parallel concurrent digital signal processor through executed in parallel reaches carries out arithmetic computation in data access; And through setting the execution number of times, independent loops is carried out or combined cycle is carried out said concurrent operation subset of instructions and said parallel access subset of instructions.
Optional; In an embodiment of the present invention; Said multi-core parallel concurrent digital signal processor as cycle controller, is used to control the cycling of a single instruction multiple data SIMD data channel through the finite states machine control device of parallel access unit and the finite states machine control device of parallel computation unit; Said cycling comprises two types: one type is that the operation function is found the solution task, and this type circulation does not need local vector memory that data are provided; The another kind of circulation that is based on the multiply accumulating function, the local vector memory of this type circulation need provide vector data and coefficient array.
The operation method embodiment of the above-mentioned parallel instruction collection of the present invention has improved processor efficient, the minimizing of maximum possible or covered redundant operation, thus improved the handling property of processor.
Above-described embodiment; The object of the invention, technical scheme and beneficial effect have been carried out further explain, and institute it should be understood that the above is merely embodiment of the present invention; And be not used in qualification protection scope of the present invention; All within spirit of the present invention and principle, any modification of being made, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (20)

1. multi-core parallel concurrent digital signal processor; It is characterized in that; Said multi-core parallel concurrent digital signal processor moves following three subset of instructions at least: steering order subclass, parallel access subset of instructions and concurrent operation subset of instructions; Wherein, said multi-core parallel concurrent digital signal processor comprises a primary processor and a plurality of from processor, saidly is made up of separate parallel access unit and parallel computation unit from processor; Absolute coding is also respectively by independently hardware cell execution respectively for said steering order subclass, parallel access subset of instructions and concurrent operation subset of instructions, and primary processor is used to move the steering order subclass; One or more parallel access unit from processor are used to move the parallel access subset of instructions; One or more parallel computation unit from processor are used to move the concurrent operation subset of instructions.
2. multi-core parallel concurrent digital signal processor according to claim 1 is characterized in that,
The finite states machine control device of one or more parallel access unit from processor is used to move the data access that single instrction carries out single, and perhaps single instrction circulation or MIMD round-robin data access operation are carried out in the instruction of operation task level.
3. multi-core parallel concurrent digital signal processor according to claim 1 is characterized in that,
The finite states machine control device of one or more parallel computation unit from processor; Be used to move the single instrction arithmetic logical operation; Perhaps operation task level instruction is carried out: single instrction circulation, or MIMD circulation, or based on the single instrction multioperation of parallel data passage; Or based on the single instrction multioperation of parallel data passage and finite states machine control device, or based on the arithmetic logical operation of the single instrction multioperation of tight coupling accelerator module.
4. multi-core parallel concurrent digital signal processor according to claim 1; It is characterized in that; Said one or more parallel access unit from processor is used to move the parallel access subset of instructions; Comprise: through comprising the parallel access unit that the local storage group that is made up of a plurality of storeies and registers group are formed; Operation parallel access subset of instructions, and the parallel access of multichannel data is provided to the parallel computation unit is at swap data between local storage group and the registers group, between local storage group and the parallel computation unit, between registers group and the parallel computation unit.
5. multi-core parallel concurrent digital signal processor according to claim 1; It is characterized in that; Said one or more parallel computation unit from processor is used to move the concurrent operation subset of instructions; Comprise: adopt the data channel of single instruction multiple data SIMD structure through one or more parallel computation unit from processor, multichannel data is done identical separate operation and returned multichannel result's computing, perhaps operation obtains one tunnel result's computing to multichannel data.
6. multi-core parallel concurrent digital signal processor according to claim 1; It is characterized in that; Said one or more parallel access unit from processor is used to move the parallel access subset of instructions; Comprise: through the data access passage of parallel access unit to direct memory access dma controller and parallel computation unit, the perhaps data access of serial walks abreast.
7. multi-core parallel concurrent digital signal processor according to claim 1; It is characterized in that; Said one or more parallel access unit from processor is used to move the parallel access subset of instructions; Comprise: utilize in the parallel access unit many interleaved path controller, with conflict-free access that memory set is walked abreast based on the address search table.
8. like the said multi-core parallel concurrent digital signal processor of claim 7; It is characterized in that; The said utilization in the parallel access unit based on many interleaved path controller of address search table with conflict-free access that memory set is walked abreast, comprising: utilize first table that interweaves; The data of sequential storage in the primary memory are broken up, and deposit in parallel from the storer of processor by demand execution algorithm; Utilize second table that interweaves, with parallel from the storer of processor out of order operation result return to original order, and deposit primary memory in.
9. multi-core parallel concurrent digital signal processor according to claim 1; It is characterized in that; The orthogonal instruction subclass of said multi-core parallel concurrent digital signal processor through executed in parallel reaches carries out arithmetic computation in data access; And through setting the execution number of times, independent loops is carried out or combined cycle is carried out said concurrent operation subset of instructions and said parallel access subset of instructions.
10. multi-core parallel concurrent digital signal processor according to claim 1; It is characterized in that; Said multi-core parallel concurrent digital signal processor as cycle controller, is used to control the cycling of a single instruction multiple data SIMD data channel through the finite states machine control device of parallel access unit and the finite states machine control device of parallel computation unit; Said cycling comprises two types: one type is that the operation function is found the solution task, and this type circulation does not need local vector memory that data are provided; The another kind of circulation that is based on the multiply accumulating function, the local vector memory of this type circulation need provide vector data and coefficient array.
11. the operation method of a parallel instruction collection; It is characterized in that; Said method is moved following three subset of instructions at least through the multi-core parallel concurrent digital signal processor: steering order subclass, parallel access subset of instructions and concurrent operation subset of instructions; Wherein, said multi-core parallel concurrent digital signal processor comprises a primary processor and a plurality of from processor, saidly is made up of separate parallel access unit and parallel computation unit from processor; Said steering order subclass, parallel access subset of instructions and concurrent operation subset of instructions absolute coding respectively comprise also respectively by independently hardware cell execution:
Through primary processor operation steering order subclass;
Through one or more parallel access unit operation parallel access subset of instructions from processor;
Through one or more parallel computation unit operation concurrent operation subset of instructions from processor.
12. the operation method like the said parallel instruction collection of claim 11 is characterized in that, and is said through one or more parallel access unit operation parallel access subset of instructions from processor, comprising:
Through the finite states machine control device of one or more parallel access unit from processor, the operation single instrction carries out the data access of single, and perhaps single instrction circulation or MIMD round-robin data access operation are carried out in the instruction of operation task level.
13. the operation method like the said parallel instruction collection of claim 11 is characterized in that, and is said through one or more parallel computation unit operation concurrent operation subset of instructions from processor, comprising:
Finite states machine control device through one or more parallel computation unit from processor; The arithmetic logical operation of operation single instrction; Perhaps operation task level instruction is carried out: single instrction circulation, or MIMD circulation, or based on the single instrction multioperation of parallel data passage; Or based on the single instrction multioperation of parallel data passage and finite states machine control device, or based on the arithmetic logical operation of the single instrction multioperation of tight coupling accelerator module.
14. the operation method like the said parallel instruction collection of claim 11 is characterized in that, and is said through one or more parallel access unit operation parallel access subset of instructions from processor, comprising:
Through comprising the parallel access unit that the local storage group that is made up of a plurality of storeies and registers group are formed; Operation parallel access subset of instructions; And the parallel access of multichannel data is provided to the parallel computation unit, at swap data between local storage group and the registers group, between local storage group and the parallel computation unit, between registers group and the parallel computation unit.
15. the operation method like the said parallel instruction collection of claim 11 is characterized in that, and is said through one or more parallel computation unit operation concurrent operation subset of instructions from processor, comprising:
Through one or more data channel that adopt single instruction multiple data SIMD structure from the parallel computation unit of processor; Multichannel data is done identical separate operation and returned multichannel result's computing, and perhaps operation obtains one tunnel result's computing to multichannel data.
16. the operation method like the said parallel instruction collection of claim 11 is characterized in that, and is said through one or more parallel access unit operation parallel access subset of instructions from processor, comprising:
Through the data access passage of parallel access unit to direct memory access dma controller and parallel computation unit, the perhaps data access of serial walks abreast.
17. the operation method like the said parallel instruction collection of claim 11 is characterized in that, and is said through one or more parallel access unit operation parallel access subset of instructions from processor, comprising:
Utilize in the parallel access unit many interleaved path controller, with conflict-free access that memory set is walked abreast based on the address search table.
18. the operation method like the said parallel instruction collection of claim 17 is characterized in that, the said utilization in the parallel access unit based on many interleaved path controller of address search table with conflict-free access that memory set is walked abreast, comprising:
Utilize first table that interweaves, the data of sequential storage in the primary memory are broken up, and deposit in parallel from the storer of processor by demand execution algorithm;
Utilize second table that interweaves, with parallel from the storer of processor out of order operation result return to original order, and deposit primary memory in.
19. operation method like the said parallel instruction collection of claim 11; It is characterized in that; The orthogonal instruction subclass of said multi-core parallel concurrent digital signal processor through executed in parallel reaches carries out arithmetic computation in data access; And through setting the execution number of times, independent loops is carried out or combined cycle is carried out said concurrent operation subset of instructions and said parallel access subset of instructions.
20. operation method like the said parallel instruction collection of claim 11; It is characterized in that; Said multi-core parallel concurrent digital signal processor as cycle controller, is used to control the cycling of a single instruction multiple data SIMD data channel through the finite states machine control device of parallel access unit and the finite states machine control device of parallel computation unit; Said cycling comprises two types: one type is that the operation function is found the solution task, and this type circulation does not need local vector memory that data are provided; The another kind of circulation that is based on the multiply accumulating function, the local vector memory of this type circulation need provide vector data and coefficient array.
CN2011103638203A 2011-11-16 2011-11-16 Multicore-parallel digital signal processor and method for operating parallel instruction sets Pending CN102508643A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011103638203A CN102508643A (en) 2011-11-16 2011-11-16 Multicore-parallel digital signal processor and method for operating parallel instruction sets

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011103638203A CN102508643A (en) 2011-11-16 2011-11-16 Multicore-parallel digital signal processor and method for operating parallel instruction sets

Publications (1)

Publication Number Publication Date
CN102508643A true CN102508643A (en) 2012-06-20

Family

ID=46220737

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011103638203A Pending CN102508643A (en) 2011-11-16 2011-11-16 Multicore-parallel digital signal processor and method for operating parallel instruction sets

Country Status (1)

Country Link
CN (1) CN102508643A (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102902512A (en) * 2012-08-31 2013-01-30 浪潮电子信息产业股份有限公司 Multi-thread parallel processing method based on multi-thread programming and message queue
CN103324465A (en) * 2013-05-10 2013-09-25 刘保国 Parallel algorithm and structure of multivariable complex control system
CN103440225A (en) * 2013-08-21 2013-12-11 复旦大学 Multi-core processor and method for reconstructing single instruction and multiple processes
CN103605572A (en) * 2013-12-05 2014-02-26 用友软件股份有限公司 Multithread calculation device
CN104035898A (en) * 2014-06-04 2014-09-10 同济大学 Memory access system based on VLIW (Very Long Instruction Word) type processor
CN105207957A (en) * 2015-08-18 2015-12-30 中国电子科技集团公司第五十八研究所 On-chip network multi-core framework
CN105975048A (en) * 2016-05-05 2016-09-28 高靳旭 DSP chip and construction method thereof
CN106293640A (en) * 2015-06-26 2017-01-04 英特尔公司 Hardware processor and method for closely-coupled Heterogeneous Computing
CN108874730A (en) * 2018-06-14 2018-11-23 北京理工大学 A kind of data processor and data processing method
CN108897263A (en) * 2018-09-13 2018-11-27 杭州华澜微电子股份有限公司 Smart circuit unit and its system and control method with multidimensional data transfer and processing function
CN108984235A (en) * 2018-06-29 2018-12-11 郑州云海信息技术有限公司 A kind of method and relevant apparatus of data processing
CN109063831A (en) * 2017-10-30 2018-12-21 上海寒武纪信息科技有限公司 Artificial intelligence process device and the method for executing vector adduction instruction using processor
CN109086228A (en) * 2018-06-26 2018-12-25 深圳市安信智控科技有限公司 High-speed memory chip with multiple independent access channels
CN109558170A (en) * 2018-11-06 2019-04-02 海南大学 It is a kind of to support data level parallel and the 2-D data access framework of multiple instructions fusion
CN110472747A (en) * 2019-08-16 2019-11-19 第四范式(北京)技术有限公司 For executing the distributed system and its method of multimachine device learning tasks
CN111447394A (en) * 2020-03-05 2020-07-24 视联动力信息技术股份有限公司 Video data processing method, electronic equipment and storage medium
CN111459551A (en) * 2020-04-14 2020-07-28 上海兆芯集成电路有限公司 Microprocessor with highly advanced branch predictor
CN112286456A (en) * 2020-10-27 2021-01-29 清华大学 Storage method and device
CN113204518A (en) * 2020-01-31 2021-08-03 慧与发展有限责任合伙企业 Master and slave processors for configuring a subsystem
CN114902619A (en) * 2019-12-31 2022-08-12 北京希姆计算科技有限公司 Storage management device and chip
US11990137B2 (en) 2018-09-13 2024-05-21 Shanghai Cambricon Information Technology Co., Ltd. Image retouching method and terminal device
US11996105B2 (en) 2020-12-11 2024-05-28 Shanghai Cambricon Information Technology Co., Ltd. Information processing method and terminal device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5450603A (en) * 1992-12-18 1995-09-12 Xerox Corporation SIMD architecture with transfer register or value source circuitry connected to bus
JP2003177929A (en) * 2001-12-07 2003-06-27 Nri & Ncc Co Ltd Master machine, slave machine, and clustering system having them
CN102144225A (en) * 2008-05-29 2011-08-03 阿克西斯半导体有限公司 Method & apparatus for real-time data processing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5450603A (en) * 1992-12-18 1995-09-12 Xerox Corporation SIMD architecture with transfer register or value source circuitry connected to bus
JP2003177929A (en) * 2001-12-07 2003-06-27 Nri & Ncc Co Ltd Master machine, slave machine, and clustering system having them
CN102144225A (en) * 2008-05-29 2011-08-03 阿克西斯半导体有限公司 Method & apparatus for real-time data processing

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102902512A (en) * 2012-08-31 2013-01-30 浪潮电子信息产业股份有限公司 Multi-thread parallel processing method based on multi-thread programming and message queue
CN102902512B (en) * 2012-08-31 2015-12-16 浪潮电子信息产业股份有限公司 A kind of multi-threading parallel process method based on multi-thread programming and message queue
CN103324465A (en) * 2013-05-10 2013-09-25 刘保国 Parallel algorithm and structure of multivariable complex control system
CN103440225A (en) * 2013-08-21 2013-12-11 复旦大学 Multi-core processor and method for reconstructing single instruction and multiple processes
CN103440225B (en) * 2013-08-21 2018-04-03 复旦大学 A kind of polycaryon processor and method of the multi-process of restructural single instrction
CN103605572A (en) * 2013-12-05 2014-02-26 用友软件股份有限公司 Multithread calculation device
CN104035898A (en) * 2014-06-04 2014-09-10 同济大学 Memory access system based on VLIW (Very Long Instruction Word) type processor
CN106293640A (en) * 2015-06-26 2017-01-04 英特尔公司 Hardware processor and method for closely-coupled Heterogeneous Computing
CN106293640B (en) * 2015-06-26 2018-12-04 英特尔公司 Hardware processor, method and the hardware device of Heterogeneous Computing for close-coupled
CN105207957A (en) * 2015-08-18 2015-12-30 中国电子科技集团公司第五十八研究所 On-chip network multi-core framework
CN105207957B (en) * 2015-08-18 2018-10-30 中国电子科技集团公司第五十八研究所 A kind of system based on network-on-chip multicore architecture
CN105975048A (en) * 2016-05-05 2016-09-28 高靳旭 DSP chip and construction method thereof
US11922132B2 (en) 2017-10-30 2024-03-05 Shanghai Cambricon Information Technology Co., Ltd. Information processing method and terminal device
US11762631B2 (en) 2017-10-30 2023-09-19 Shanghai Cambricon Information Technology Co., Ltd. Information processing method and terminal device
CN109063831A (en) * 2017-10-30 2018-12-21 上海寒武纪信息科技有限公司 Artificial intelligence process device and the method for executing vector adduction instruction using processor
CN108874730A (en) * 2018-06-14 2018-11-23 北京理工大学 A kind of data processor and data processing method
CN109086228A (en) * 2018-06-26 2018-12-25 深圳市安信智控科技有限公司 High-speed memory chip with multiple independent access channels
CN109086228B (en) * 2018-06-26 2022-03-29 深圳市安信智控科技有限公司 High speed memory chip with multiple independent access channels
CN108984235A (en) * 2018-06-29 2018-12-11 郑州云海信息技术有限公司 A kind of method and relevant apparatus of data processing
US11990137B2 (en) 2018-09-13 2024-05-21 Shanghai Cambricon Information Technology Co., Ltd. Image retouching method and terminal device
CN108897263A (en) * 2018-09-13 2018-11-27 杭州华澜微电子股份有限公司 Smart circuit unit and its system and control method with multidimensional data transfer and processing function
CN109558170B (en) * 2018-11-06 2021-05-04 极芯通讯技术(南京)有限公司 Two-dimensional data path architecture supporting data level parallelism and multi-instruction fusion
CN109558170A (en) * 2018-11-06 2019-04-02 海南大学 It is a kind of to support data level parallel and the 2-D data access framework of multiple instructions fusion
CN110472747A (en) * 2019-08-16 2019-11-19 第四范式(北京)技术有限公司 For executing the distributed system and its method of multimachine device learning tasks
CN114902619A (en) * 2019-12-31 2022-08-12 北京希姆计算科技有限公司 Storage management device and chip
CN114902619B (en) * 2019-12-31 2023-07-25 北京希姆计算科技有限公司 Storage management device and chip
CN113204518A (en) * 2020-01-31 2021-08-03 慧与发展有限责任合伙企业 Master and slave processors for configuring a subsystem
CN111447394A (en) * 2020-03-05 2020-07-24 视联动力信息技术股份有限公司 Video data processing method, electronic equipment and storage medium
CN111459551B (en) * 2020-04-14 2022-08-16 上海兆芯集成电路有限公司 Microprocessor with highly advanced branch predictor
CN111459551A (en) * 2020-04-14 2020-07-28 上海兆芯集成电路有限公司 Microprocessor with highly advanced branch predictor
CN112286456B (en) * 2020-10-27 2022-03-08 清华大学 Storage method and device
CN112286456A (en) * 2020-10-27 2021-01-29 清华大学 Storage method and device
US11996105B2 (en) 2020-12-11 2024-05-28 Shanghai Cambricon Information Technology Co., Ltd. Information processing method and terminal device

Similar Documents

Publication Publication Date Title
CN102508643A (en) Multicore-parallel digital signal processor and method for operating parallel instruction sets
Abts et al. Think fast: A tensor streaming processor (TSP) for accelerating deep learning workloads
CN108268278B (en) Processor, method and system with configurable spatial accelerator
CN102750133B (en) 32-Bit triple-emission digital signal processor supporting SIMD
Fang et al. swdnn: A library for accelerating deep learning applications on sunway taihulight
CN102004719B (en) Very long instruction word processor structure supporting simultaneous multithreading
CN101751244B (en) Microprocessor
CN1142484C (en) Vector processing method of microprocessor
CN102402415B (en) Device and method for buffering data in dynamic reconfigurable array
CN105335331B (en) A kind of SHA256 realization method and systems based on extensive coarseness reconfigurable processor
Sano et al. Scalable streaming-array of simple soft-processors for stencil computations with constant memory-bandwidth
CN102279818A (en) Vector data access and storage control method supporting limited sharing and vector memory
CN102306139A (en) Heterogeneous multi-core digital signal processor for orthogonal frequency division multiplexing (OFDM) wireless communication system
CN101847093B (en) Digital signal processor with reconfigurable low power consumption data interleaving network
CN102306141A (en) Method for describing configuration information of dynamic reconfigurable array
CN107506329A (en) A kind of automatic coarse-grained reconfigurable array and its collocation method for supporting loop iteration streamline
CN101021831A (en) 64 bit stream processor chip system structure oriented to scientific computing
CN114356836A (en) RISC-V based three-dimensional interconnected many-core processor architecture and working method thereof
CN103761072A (en) Coarse granularity reconfigurable hierarchical array register file structure
Song et al. Gpnpu: Enabling efficient hardware-based direct convolution with multi-precision support in gpu tensor cores
CN103235717B (en) There is the processor of polymorphic instruction set architecture
CN102012802B (en) Vector processor-oriented data exchange method and device
CN102411490B (en) Instruction set optimization method for dynamically reconfigurable processors
CN102023846B (en) Shared front-end assembly line structure based on monolithic multiprocessor system
Tan et al. A pipelining loop optimization method for dataflow architecture

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120620