CN102508643A

CN102508643A - Multicore-parallel digital signal processor and method for operating parallel instruction sets

Info

Publication number: CN102508643A
Application number: CN2011103638203A
Authority: CN
Inventors: 刘大可; 王建; 猷阿·索; 安德里雅思·卡尔松
Original assignee: Individual
Current assignee: Individual
Priority date: 2011-11-16
Filing date: 2011-11-16
Publication date: 2012-06-20

Abstract

An embodiment of the invention provides a multicore-parallel digital signal processor and a method for operating parallel instruction sets. The multicore-parallel digital signal processor at least operates a control instruction subset, a parallel access instruction subset and a parallel computing instruction subset, and comprises a main processor and a plurality of auxiliary processors. Each auxiliary processor consists of a parallel access unit and a parallel computing unit, which are mutually independent. The control instruction subset, the parallel access instruction subset and the parallel computing instruction subset are coded independently and executed by independent hardware units respectively. The main processor is used for operational control of the instruction subsets, the parallel access units of the one or more auxiliary processors are used for parallel access of the instruction subsets, and the parallel computing units of the one or more auxiliary processors are used for parallel computing of the instruction subsets. Using the multicore-parallel digital signal processor and the method for operating the parallel instruction sets can improve efficiency of the processor and reduce or wipe out redundant operations to the utmost extent, and thereby processing performance of the processor can be improved.

Description

The operation method of a kind of multi-core parallel concurrent digital signal processor and parallel instruction collection

Technical field

The present invention relates to the polycaryon processor technical field, relate in particular to the operation method of a kind of multi-core parallel concurrent digital signal processor and parallel instruction collection.

Background technology

Traditional calculating machine structural design is devoted to optimize cache memory, the superscalar that branch prediction and non-order are carried out.These ways are applicable to universal processor design, but are not the optimal selection of embedded system.Similar with it, concurrent operation has been brought into play vital role in general high-performance calculation, but existing parallel organization and multiple programming model do not design to the high-performance embedded system.

Embedded processor has a wide range of applications, as in mobile phone and other battery powered systems, using the processor with super low-power consumption.ASIP is the optimum processor structure of embedded system.The embedded signal disposal system uses ASIP to reach the high-performance in the application-specific scope, low-power consumption and programmability.Thereby ASIP should be used for designing and optimizing power consumption and the silicon area that its instruction set architecture reduces processor to one type.

The following two kinds of models of the general use of the design of application specific processor framework:

The flowing water parallel model: this model comprises some processing units, and task of each unit operation uses chain type to connect between processing unit.The output of processing unit N is connected to the input of processing unit N+1.This model is widely used in communication and multimedia signal dispose.The key of using the flowing water parallel model is the time interval of the task run time on each processor all importing data less than total system.

The data parallel model: some processing units are carried out same task to different data and are come result of calculation.The key of using the data parallel model is the systematicness of recognition data and in view of the above data is split and concurrent operation.

The flexible combination of the normally above two kinds of models of application specific processor design.The target of processor or system-on-chip designs is that the best of breed of two kinds of models of design is just satisfying the demand of application-specific to performance, to avoid the hardware spending and the power consumption of internet on unnecessary arithmetical organ and the sheet.

Initial high performance signal processor adopting special IC (ASIC, Application Specific Integrated Circuits), this is the unique method that reaches high-performance and low-power consumption at that time.But special IC lacks dirigibility.Because signal processing applications constantly has new standard and new algorithm to be suggested, the dirigibility of hardware designs and programmability become an important requirement.Typical instance is radio baseband processor and multimedia processor.The radio baseband processor need be supported the base band signal process of a plurality of wireless communication standards through software programming.Multimedia processor need be supported the encoding and decoding standard of multiple Voice & Video.

For bigger arithmetic capability is provided, application specific processor uses the parallel multi-core structure.Each processor core also uses instruction-level or data level to walk abreast to improve arithmetic capability.Existing programmable digital signal processor nuclear uses two kinds of microstructures.A kind of processor that is based on very long instruction word (VLIW, Very Long Instruction Word) structure.Another kind is to use the processor of single instruction multiple data (SIMD, Single Instruction Multiple Data) structure.Existing polycaryon processor mainly uses following three kinds of structures: first kind of double-core framework that is based on a DSP (Digital Signal Processing, digital signal processing) nuclear and a VLIW nuclear.Second kind of polycaryon processor that is based on a controller and some SIMD nuclear.The third is the large-scale parallel computing array of similar pattern video-stream processor (GPU, Graphic Processing Unit).

The treatment effeciency of digital signal processor is defined as the algorithm function arithmetic operation divided by total operation.The algorithm function arithmetic operation is the operation that processor must be supported the user.NOT-function control operation and data access operation are redundant operation.For improving processor efficient, need the minimizing of maximum possible in the process of instruction set architecture design and system's multinuclear design or cover redundant operation.Promising minimizing redundancy and the instruction and the specific structure of particular design have increased the programming complexity.The complexity of this increase must be covered through compilation compilation tool and programming flow process.

Summary of the invention

The embodiment of the invention provides the operation method of a kind of multi-core parallel concurrent digital signal processor and parallel instruction collection, improving processor efficient, and the minimizing of maximum possible or cover redundant operation.

On the one hand; The embodiment of the invention provides a kind of multi-core parallel concurrent digital signal processor; Said multi-core parallel concurrent digital signal processor moves following three subset of instructions at least: steering order subclass, parallel access subset of instructions and concurrent operation subset of instructions; Wherein, said multi-core parallel concurrent digital signal processor comprises a primary processor and a plurality of from processor, saidly is made up of separate parallel access unit and parallel computation unit from processor; Absolute coding is also respectively by independently hardware cell execution respectively for said steering order subclass, parallel access subset of instructions and concurrent operation subset of instructions, and primary processor is used to move the steering order subclass; One or more parallel access unit from processor are used to move the parallel access subset of instructions; One or more parallel computation unit from processor are used to move the concurrent operation subset of instructions.

Optional; In an embodiment of the present invention; The finite states machine control device of one or more parallel access unit from processor is used to move the data access that single instrction carries out single, and perhaps single instrction circulation or MIMD round-robin data access operation are carried out in the instruction of operation task level.

Optional, in an embodiment of the present invention, the finite states machine control device of one or more parallel computation unit from processor; Be used to move the single instrction arithmetic logical operation; Perhaps the instruction of operation task level is carried out: single instrction circulates, or the MIMD circulation, or (is started by the task level instruction based on the single instrction multioperation of parallel data passage; Use the data channel of parallel computation unit; Accomplish a plurality of arithmetical logics operations through calculating combination of elements in the data channel), or (start the data channel of use parallel computation unit by the task level instruction based on the single instrction multioperation of parallel data passage and finite states machine control device; Under the control of finite states machine control device; Accomplish the operation of a plurality of arithmetical logics), or based on the arithmetic logical operation of the single instrction multioperation of tight coupling accelerator module (start by the task level instruction, use the tight coupling accelerator of parallel computation unit to carry out multioperation and calculate).

Optional; In an embodiment of the present invention; Said one or more parallel access unit from processor is used to move the parallel access subset of instructions; Comprise: through comprising the parallel access unit that the local storage group that is made up of a plurality of storeies and registers group are formed; Operation parallel access subset of instructions, and the parallel access of multichannel data is provided to the parallel computation unit is at swap data between local storage group and the registers group, between local storage group and the parallel computation unit, between registers group and the parallel computation unit.

Optional; In an embodiment of the present invention; Said one or more parallel computation unit from processor is used to move the concurrent operation subset of instructions; Comprise: adopt the data channel of single instruction multiple data SIMD structure through one or more parallel computation unit from processor, multichannel data is done identical separate operation and returned multichannel result's computing, perhaps operation obtains one tunnel result's computing to multichannel data.

Optional; In an embodiment of the present invention; Said one or more parallel access unit from processor is used to move the parallel access subset of instructions; Comprise: through the data access passage of parallel access unit to direct memory access dma controller and parallel computation unit, the perhaps data access of serial walks abreast.

Optional; In an embodiment of the present invention; Said one or more parallel access unit from processor is used to move the parallel access subset of instructions, comprising: utilize in the parallel access unit the many interleaved path controller based on the address search table, with conflict-free access that memory set is walked abreast.

Optional; In an embodiment of the present invention; The said utilization in the parallel access unit based on many interleaved path controller of address search table with conflict-free access that memory set is walked abreast, comprising: utilize first table that interweaves; The data of sequential storage in the primary memory are broken up, and deposit in parallel from the storer of processor by demand execution algorithm; Utilize second table that interweaves, with parallel from the storer of processor out of order operation result return to original order, and deposit primary memory in.

Optional; In an embodiment of the present invention; The orthogonal instruction subclass of said multi-core parallel concurrent digital signal processor through executed in parallel reaches carries out arithmetic computation in data access; And through setting the execution number of times, independent loops is carried out or combined cycle is carried out said concurrent operation subset of instructions and said parallel access subset of instructions.

Optional; In an embodiment of the present invention; Said multi-core parallel concurrent digital signal processor as cycle controller, is used to control the cycling of a single instruction multiple data SIMD data channel through the finite states machine control device of parallel access unit and the finite states machine control device of parallel computation unit; Said cycling comprises two types: one type is that the operation function is found the solution task, and this type circulation does not need local vector memory that data are provided; The another kind of circulation that is based on the multiply accumulating function, the local vector memory of this type circulation need provide vector data and coefficient array.

On the other hand; The embodiment of the invention provides a kind of operation method of parallel instruction collection; Said method is moved following three subset of instructions at least through the multi-core parallel concurrent digital signal processor: steering order subclass, parallel access subset of instructions and concurrent operation subset of instructions; Wherein, Said multi-core parallel concurrent digital signal processor comprises a primary processor and a plurality of from processor; Saidly be made up of separate parallel access unit and parallel computation unit from processor, said steering order subclass, parallel access subset of instructions and concurrent operation subset of instructions be absolute coding and being carried out by hardware cell independently respectively respectively, comprising: through primary processor operation steering order subclass; Through one or more parallel access unit operation parallel access subset of instructions from processor; Through one or more parallel computation unit operation concurrent operation subset of instructions from processor.

Optional; In an embodiment of the present invention; Said through one or more parallel access unit operation parallel access subset of instructions from processor; Comprise: through the finite states machine control device of one or more parallel access unit from processor, the operation single instrction carries out the data access of single, and perhaps single instrction circulation or MIMD round-robin data access operation are carried out in the instruction of operation task level.

Optional; In an embodiment of the present invention, said through one or more parallel computation unit operation concurrent operation subset of instructions from processor, comprising: through the finite states machine control device of one or more parallel computation unit from processor; The arithmetic logical operation of operation single instrction; Perhaps operation task level instruction is carried out: single instrction circulation, or MIMD circulation, or based on the single instrction multioperation of parallel data passage; Or based on the single instrction multioperation of parallel data passage and finite states machine control device, or based on the arithmetic logical operation of the single instrction multioperation of tight coupling accelerator module.

Optional; In an embodiment of the present invention; Said through one or more parallel access unit operation parallel access subset of instructions from processor; Comprise: through comprising the parallel access unit that the local storage group that is made up of a plurality of storeies and registers group are formed; Operation parallel access subset of instructions, and the parallel access of multichannel data is provided to the parallel computation unit is at swap data between local storage group and the registers group, between local storage group and the parallel computation unit, between registers group and the parallel computation unit.

Optional; In an embodiment of the present invention; Said through one or more parallel computation unit operation concurrent operation subset of instructions from processor; Comprise: adopt the data channel of single instruction multiple data SIMD structure through one or more parallel computation unit from processor, multichannel data is done identical separate operation and returned multichannel result's computing, perhaps operation obtains one tunnel result's computing to multichannel data.

Optional; In an embodiment of the present invention; Said through one or more parallel access unit operation parallel access subset of instructions from processor; Comprise: through the data access passage of parallel access unit to direct memory access dma controller and parallel computation unit, the perhaps data access of serial walks abreast.

Optional; In an embodiment of the present invention; Said through one or more parallel access unit operation parallel access subset of instructions from processor, comprising: utilize in the parallel access unit many interleaved path controller, with conflict-free access that memory set is walked abreast based on the address search table.

Technique scheme has following beneficial effect: because adopt the multi-core parallel concurrent digital signal processor to move following three subset of instructions at least: steering order subclass, parallel access subset of instructions and concurrent operation subset of instructions; Wherein, Said multi-core parallel concurrent digital signal processor comprises a primary processor and a plurality of from processor; Saidly form by separate parallel access unit and parallel computation unit from processor; Said steering order subclass, parallel access subset of instructions and concurrent operation subset of instructions absolute coding respectively comprise also respectively by independently hardware cell execution: through primary processor operation steering order subclass; Through one or more parallel access unit operation parallel access subset of instructions from processor; Through the technological means of one or more parallel computation unit operation concurrent operation subset of instructions from processor, thus processor efficient improved, the minimizing of maximum possible or covered redundant operation, thus the handling property of processor improved.

Description of drawings

In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art; To do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art below; Obviously, the accompanying drawing in describing below only is some embodiments of the present invention, for those of ordinary skills; Under the prerequisite of not paying creative work property, can also obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 is a kind of multi-core parallel concurrent digital signal processor architecture of embodiment of the invention synoptic diagram;

Fig. 2 comprises the parallel storage block diagram of three data access hardware for the embodiment of the invention;

Fig. 3 is embodiment of the invention external memory storage input data interlacing example schematic;

Fig. 4 is a polycaryon processor structure example of an embodiment of the invention synoptic diagram.

Embodiment

To combine the accompanying drawing in the embodiment of the invention below, the technical scheme in the embodiment of the invention is carried out clear, intactly description, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills are not making the every other embodiment that is obtained under the creative work prerequisite, all belong to the scope of the present invention's protection.

As shown in Figure 1; Be a kind of multi-core parallel concurrent digital signal processor architecture of embodiment of the invention synoptic diagram; Said multi-core parallel concurrent digital signal processor moves following three subset of instructions at least: steering order subclass 21, parallel access subset of instructions 22 and concurrent operation subset of instructions 23; Wherein, Said multi-core parallel concurrent digital signal processor comprises a primary processor 11 and a plurality of from processor; Saidly is made up of separate parallel access unit 13 and parallel computation unit 14 from processor, said steering order subclass 21, parallel access subset of instructions 22 and concurrent operation subset of instructions 23 be absolute codings and respectively by independently hardware cell execution respectively, and primary processor 11 is used to move steering order subclass 21; One or more parallel access unit 13 from processor are used to move parallel access subset of instructions 22; One or more parallel computation unit 14 from processor are used to move concurrent operation subset of instructions 23.This multi-core parallel concurrent digital signal processor also comprises direct memory access (DMA, Direct Memory Access) controller 12, is used for through storage subsystem 15 from the primary memory access data, and transmits data at primary processor 11 with between the processor.

Optional; In an embodiment of the present invention; The finite states machine control device of one or more parallel access unit 13 from processor is used to move the data access that single instrction carries out single, and perhaps single instrction circulation or MIMD round-robin data access operation are carried out in the instruction of operation task level.

Optional, in an embodiment of the present invention, the finite states machine control device of one or more parallel computation unit 14 from processor; Be used to move the single instrction arithmetic logical operation; Perhaps operation task level instruction is carried out: single instrction circulation, or MIMD circulation, or based on the single instrction multioperation of parallel data passage; Or based on the single instrction multioperation of parallel data passage and finite states machine control device, or based on the arithmetic logical operation of the single instrction multioperation of tight coupling accelerator module.

Optional; In an embodiment of the present invention; Said one or more parallel access unit 13 from processor is used to move parallel access subset of instructions 22; Comprise: through comprising the parallel access unit 13 that the local storage group that is made up of a plurality of storeies and registers group are formed; Operation parallel access subset of instructions 22, and the parallel access of multichannel data is provided to parallel computation unit 14 is at swap data between local storage group and the registers group, between local storage group and the parallel computation unit 14, between registers group and the parallel computation unit 14.

Optional; In an embodiment of the present invention; Said one or more parallel computation unit 14 from processor is used to move concurrent operation subset of instructions 23; Comprise: adopt the data channel of single instruction multiple data SIMD structures through one or more parallel computation unit 14 from processor, multichannel data is done identical separate operation and returned multichannel result's computing, perhaps operation obtains one tunnel result's computing to multichannel data.

Optional; In an embodiment of the present invention; Said one or more parallel access unit 13 from processor is used to move parallel access subset of instructions 22; Comprise: through the data access passage of parallel access unit 13 to direct memory access dma controller 12 and parallel computation unit 14, the perhaps data access of serial walks abreast.

Optional; In an embodiment of the present invention; Said one or more parallel access unit 13 from processor is used to move parallel access subset of instructions 22; Comprise: utilize in the parallel access unit 13 many interleaved path controller, with conflict-free access that memory set is walked abreast based on the address search table.

Optional; In an embodiment of the present invention; The said utilization in the parallel access unit 13 based on many interleaved path controller of address search table with conflict-free access that memory set is walked abreast, comprising: utilize first table that interweaves; The data of sequential storage in the primary memory are broken up, and deposit in parallel from the storer of processor by demand execution algorithm; Utilize second table that interweaves, with parallel from the storer of processor out of order operation result return to original order, and deposit primary memory in.

Optional; In an embodiment of the present invention; The orthogonal instruction subclass of said multi-core parallel concurrent digital signal processor through executed in parallel reaches carries out arithmetic computation in data access; And through setting the execution number of times, independent loops is carried out or combined cycle is carried out said concurrent operation subset of instructions 23 and said parallel access subset of instructions.

Optional; In an embodiment of the present invention; Said multi-core parallel concurrent digital signal processor as cycle controller, is used to control the cycling of a single instruction multiple data SIMD data channel through the finite states machine control device of parallel access unit 13 and the finite states machine control device of parallel computation unit 14; Said cycling comprises two types: one type is that the operation function is found the solution task, and this type circulation does not need local vector memory that data are provided; The another kind of circulation that is based on the multiply accumulating function, the local vector memory of this type circulation need provide vector data and coefficient array.

The embodiment of the invention discloses a kind of efficient parallel architecture based on three-dimensional orthogonal parallel instruction collection.The instruction set of this architecture is divided into three subset of instructions: steering order subclass, parallel access subset of instructions and concurrent operation subset of instructions.The steering order subclass is moved in master controller, is used for the execution of top-level procedure, system's control, resource management and a spot of unpredictable algorithm.The binary coding of every instruction of steering order subclass is very short, therefore can allow its program length longer.Parallel access subset of instructions control parallel data access path; It is the data access command subclass, is used to work out and carry out the access of the data between local storage group and registers group, local storage group and parallel computation unit, registers group and the parallel computation unit.Mostly be the loop iteration computing because data-signal is handled, the parallel access subset of instructions should be able to promptly provide the execution of each clock to set with program code means; Can the fixing control vector towards a concrete subroutine be provided with the mode of restructural vector code again, this fixing control vector need not each clock to be changed.Though therefore the data access command subset code maybe be longer, it changes relatively less, number of instructions is few, and the expense of code is relatively very low.Concurrent operation subset of instructions control parallel data treatment channel, it is the function executing subset of instructions of parallel subroutine.Thereby this sub-set is used to control the parallel data passage carries out parallel algorithm on the parallel data passage.The execution of this subset code is based on data have been delivered to the inlet of data channel by the local storage group in parallel storage group or data prerequisite.When carrying out the macro instruction of a job class, the execution of the whole or big portion of an instruction may command subroutine.Therefore, though the code of the function executing subset of instructions of parallel subroutine maybe be longer, it changes relatively less, number of instructions is few, and the expense of code is relatively very low.Through separately with three subset of instructions, define separately and independent coding, will obtain advantages: 1) each subset of instructions is carried out the binary code coding separately, and the storage overhead of code lowers greatly; 2) because the length of the program code of parallel access subset of instructions and concurrent operation subset of instructions is low; Its binary coding can be used microcode, thereby on the degree of maximum, has opened user exploitation and the dirigibility of reconstruct of controllability, data and the addressing function of hardware resource to the programmer; 3) three subset of instructions codes can be further with flowing water/streamline (pipeline) mode executed in parallel, and the programmer can use parallel hardware to the full extent.The above-mentioned multi-core parallel concurrent digital signal processor of the present invention embodiment has improved processor efficient, the minimizing of maximum possible or covered redundant operation, thus improved the handling property of processor.

As shown in Figure 2, comprise the parallel storage block diagram of three data access hardware for the embodiment of the invention.It comprises that (1) contain the memory set 31 of a plurality of parallel random access memory; (2) the controlled switches set 32 that interweaves that the parallel input and the parallel output data of this memory set is interweaved, (3) are implemented the restructural finite states machine control device 33 of control to the memory set and the switches set that interweaves.

Hardware 31 memory set are made up of parallel a plurality of random access memory, support at one time the parallel read/write to a plurality of storeies.The reference address of each storer is provided during read-write operation, but the different addresses of each storer of random access.This parallel storage group has two access paths.Passage 1 (memory set left channel) is the external reference passage, and the input data of external memory storage write parallel storage through this passage, and perhaps the local data of parallel storage exports external memory storage to through this passage.Passage 2 (memory set right channel) is the access path that vector registor group and vector data are handled.Through programming the reference address of parallel storage is provided from processor, but the random address in each memory block of concurrent access.When execution walks abreast read operation, support multichannel data passage reading to identical address in the same memory block.

The controlled switches set that interweaves that hardware 32 parallel inputs and parallel output data interweave.External memory storage inputoutput data to parallel storage interweaves.When the input data were interweaved, the input data will be upset to be left in the parallel storage group.When output data is interweaved, the data of reading in the parallel storage will be exported to external memory storage by rearrangement.

The finite states machine control device of the 33 pairs of memory set of hardware and the switches set enforcement control that interweaves.It controls interleaver through the control vector that interweaves that operation parallel access instruction produces, thereby adjustment inputs or outputs the order of vector data.Parallel access instruction may command interweaves to what the single vector data read or write, or the data stream that a plurality of data are constituted read or write interweave.

External memory storage input data interlacing:

As shown in Figure 3, be embodiment of the invention external memory storage input data interlacing example schematic.The parallel storage group is made up of 4 random access memory.Input traffic is D0-D15, is written in parallel to 4 at every turn and counts to the parallel storage group.Write data through the controlled switches set that interweaves, readjusted the writing position in parallel storage.Like second vectorial D4D5D6D7, be adjusted to D7D4D5D6 and write in the parallel storage.The purpose of adjustment is the parallel data visit for the parallel computation unit.The parallel computation unit needs parallel processing D0D4D8D12 in this example, will need the element in the vector of concurrent access to leave in the different storeies through the input switches set that interweaves, to reach the parallel conflict-free access when calculating.

In the opposite direction, when output data was interweaved, the result of calculation of parallel computation unit was adjusted order and is written in the parallel storage, and it is adjusted in proper order is in order external memory storage to be carried out and conflict-free access local storage group during line output.

The switches set that interweaves in the above instance is controlled by reconfigurable finite states machine control device, the control vector that interweaves that is input as operation parallel access instruction generation of this controller.Through this finite states machine control device of reconstruct, can produce the control that interweaves according to configuration to the input and output single vector, or to each vectorial control that interweaves in the data stream.

The multinuclear networking:

A plurality of processors of this polycaryon processor and external memory storage are through the interconnected line data exchange of going forward side by side of network-on-chip.As shown in Figure 4, be a polycaryon processor structure example of embodiment of the invention synoptic diagram, it constitutes from processor by a primary processor and N, and each comprises parallel computation unit and parallel storage unit from processor.Data transmission between external memory storage and the local storage can be accomplished by the dma controller that is connected to network-on-chip.From processor and from also passing through the network-on-chip swap data between processor.

Parallel access:

Parallel access to data realizes that by the parallel storage group this parallel storage group is made up of a plurality of random access memory.At one time, can carry out independently reading and writing data to each storer.When being written in parallel to, the corresponding storer of each element of input vector can be with should once being written to memory set by vector.In parallel reading, read data from each random access memory, thereby realize that the parallel of a vector read.

Parallel computation:

Parallel computation to data realizes through the parallel computation unit from processor.The data level parallel processor adopts single instruction multiple data usually, and (it calculates a plurality of input data through the data channel of many inputs for Single Instruction Multiple Data, SIMD) structure.Parallel computation comprises implements the calculating that same operation obtains the multichannel result to multipath input data, adds computing like vector, to two element additions that vector is corresponding, obtains result vector.Or multipath input data calculated single result's calculating, like vectorial summation operation, the result be each element of input vector with.

Corresponding to said apparatus embodiment; The embodiment of the invention also provides a kind of operation method of parallel instruction collection; Said method is moved following three subset of instructions at least through the multi-core parallel concurrent digital signal processor: steering order subclass, parallel access subset of instructions and concurrent operation subset of instructions; Wherein, Said multi-core parallel concurrent digital signal processor comprises a primary processor and a plurality of from processor; Saidly be made up of separate parallel access unit and parallel computation unit from processor, said steering order subclass, parallel access subset of instructions and concurrent operation subset of instructions be absolute coding and being carried out by hardware cell independently respectively respectively, comprising: through primary processor operation steering order subclass; And through one or more parallel access unit operation parallel access subset of instructions from processor; And through one or more parallel computation unit operation concurrent operation subset of instructions from processor.

The operation method embodiment of the above-mentioned parallel instruction collection of the present invention has improved processor efficient, the minimizing of maximum possible or covered redundant operation, thus improved the handling property of processor.

Above-described embodiment; The object of the invention, technical scheme and beneficial effect have been carried out further explain, and institute it should be understood that the above is merely embodiment of the present invention; And be not used in qualification protection scope of the present invention; All within spirit of the present invention and principle, any modification of being made, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. multi-core parallel concurrent digital signal processor; It is characterized in that; Said multi-core parallel concurrent digital signal processor moves following three subset of instructions at least: steering order subclass, parallel access subset of instructions and concurrent operation subset of instructions; Wherein, said multi-core parallel concurrent digital signal processor comprises a primary processor and a plurality of from processor, saidly is made up of separate parallel access unit and parallel computation unit from processor; Absolute coding is also respectively by independently hardware cell execution respectively for said steering order subclass, parallel access subset of instructions and concurrent operation subset of instructions, and primary processor is used to move the steering order subclass; One or more parallel access unit from processor are used to move the parallel access subset of instructions; One or more parallel computation unit from processor are used to move the concurrent operation subset of instructions.

2. multi-core parallel concurrent digital signal processor according to claim 1 is characterized in that,

The finite states machine control device of one or more parallel access unit from processor is used to move the data access that single instrction carries out single, and perhaps single instrction circulation or MIMD round-robin data access operation are carried out in the instruction of operation task level.

3. multi-core parallel concurrent digital signal processor according to claim 1 is characterized in that,

The finite states machine control device of one or more parallel computation unit from processor; Be used to move the single instrction arithmetic logical operation; Perhaps operation task level instruction is carried out: single instrction circulation, or MIMD circulation, or based on the single instrction multioperation of parallel data passage; Or based on the single instrction multioperation of parallel data passage and finite states machine control device, or based on the arithmetic logical operation of the single instrction multioperation of tight coupling accelerator module.

4. multi-core parallel concurrent digital signal processor according to claim 1; It is characterized in that; Said one or more parallel access unit from processor is used to move the parallel access subset of instructions; Comprise: through comprising the parallel access unit that the local storage group that is made up of a plurality of storeies and registers group are formed; Operation parallel access subset of instructions, and the parallel access of multichannel data is provided to the parallel computation unit is at swap data between local storage group and the registers group, between local storage group and the parallel computation unit, between registers group and the parallel computation unit.

5. multi-core parallel concurrent digital signal processor according to claim 1; It is characterized in that; Said one or more parallel computation unit from processor is used to move the concurrent operation subset of instructions; Comprise: adopt the data channel of single instruction multiple data SIMD structure through one or more parallel computation unit from processor, multichannel data is done identical separate operation and returned multichannel result's computing, perhaps operation obtains one tunnel result's computing to multichannel data.

6. multi-core parallel concurrent digital signal processor according to claim 1; It is characterized in that; Said one or more parallel access unit from processor is used to move the parallel access subset of instructions; Comprise: through the data access passage of parallel access unit to direct memory access dma controller and parallel computation unit, the perhaps data access of serial walks abreast.

7. multi-core parallel concurrent digital signal processor according to claim 1; It is characterized in that; Said one or more parallel access unit from processor is used to move the parallel access subset of instructions; Comprise: utilize in the parallel access unit many interleaved path controller, with conflict-free access that memory set is walked abreast based on the address search table.

8. like the said multi-core parallel concurrent digital signal processor of claim 7; It is characterized in that; The said utilization in the parallel access unit based on many interleaved path controller of address search table with conflict-free access that memory set is walked abreast, comprising: utilize first table that interweaves; The data of sequential storage in the primary memory are broken up, and deposit in parallel from the storer of processor by demand execution algorithm; Utilize second table that interweaves, with parallel from the storer of processor out of order operation result return to original order, and deposit primary memory in.

9. multi-core parallel concurrent digital signal processor according to claim 1; It is characterized in that; The orthogonal instruction subclass of said multi-core parallel concurrent digital signal processor through executed in parallel reaches carries out arithmetic computation in data access; And through setting the execution number of times, independent loops is carried out or combined cycle is carried out said concurrent operation subset of instructions and said parallel access subset of instructions.

10. multi-core parallel concurrent digital signal processor according to claim 1; It is characterized in that; Said multi-core parallel concurrent digital signal processor as cycle controller, is used to control the cycling of a single instruction multiple data SIMD data channel through the finite states machine control device of parallel access unit and the finite states machine control device of parallel computation unit; Said cycling comprises two types: one type is that the operation function is found the solution task, and this type circulation does not need local vector memory that data are provided; The another kind of circulation that is based on the multiply accumulating function, the local vector memory of this type circulation need provide vector data and coefficient array.

11. the operation method of a parallel instruction collection; It is characterized in that; Said method is moved following three subset of instructions at least through the multi-core parallel concurrent digital signal processor: steering order subclass, parallel access subset of instructions and concurrent operation subset of instructions; Wherein, said multi-core parallel concurrent digital signal processor comprises a primary processor and a plurality of from processor, saidly is made up of separate parallel access unit and parallel computation unit from processor; Said steering order subclass, parallel access subset of instructions and concurrent operation subset of instructions absolute coding respectively comprise also respectively by independently hardware cell execution:

Through primary processor operation steering order subclass;

Through one or more parallel access unit operation parallel access subset of instructions from processor;

Through one or more parallel computation unit operation concurrent operation subset of instructions from processor.

12. the operation method like the said parallel instruction collection of claim 11 is characterized in that, and is said through one or more parallel access unit operation parallel access subset of instructions from processor, comprising:

Through the finite states machine control device of one or more parallel access unit from processor, the operation single instrction carries out the data access of single, and perhaps single instrction circulation or MIMD round-robin data access operation are carried out in the instruction of operation task level.

13. the operation method like the said parallel instruction collection of claim 11 is characterized in that, and is said through one or more parallel computation unit operation concurrent operation subset of instructions from processor, comprising:

Finite states machine control device through one or more parallel computation unit from processor; The arithmetic logical operation of operation single instrction; Perhaps operation task level instruction is carried out: single instrction circulation, or MIMD circulation, or based on the single instrction multioperation of parallel data passage; Or based on the single instrction multioperation of parallel data passage and finite states machine control device, or based on the arithmetic logical operation of the single instrction multioperation of tight coupling accelerator module.

14. the operation method like the said parallel instruction collection of claim 11 is characterized in that, and is said through one or more parallel access unit operation parallel access subset of instructions from processor, comprising:

Through comprising the parallel access unit that the local storage group that is made up of a plurality of storeies and registers group are formed; Operation parallel access subset of instructions; And the parallel access of multichannel data is provided to the parallel computation unit, at swap data between local storage group and the registers group, between local storage group and the parallel computation unit, between registers group and the parallel computation unit.

15. the operation method like the said parallel instruction collection of claim 11 is characterized in that, and is said through one or more parallel computation unit operation concurrent operation subset of instructions from processor, comprising:

Through one or more data channel that adopt single instruction multiple data SIMD structure from the parallel computation unit of processor; Multichannel data is done identical separate operation and returned multichannel result's computing, and perhaps operation obtains one tunnel result's computing to multichannel data.

16. the operation method like the said parallel instruction collection of claim 11 is characterized in that, and is said through one or more parallel access unit operation parallel access subset of instructions from processor, comprising:

Through the data access passage of parallel access unit to direct memory access dma controller and parallel computation unit, the perhaps data access of serial walks abreast.

17. the operation method like the said parallel instruction collection of claim 11 is characterized in that, and is said through one or more parallel access unit operation parallel access subset of instructions from processor, comprising:

Utilize in the parallel access unit many interleaved path controller, with conflict-free access that memory set is walked abreast based on the address search table.

18. the operation method like the said parallel instruction collection of claim 17 is characterized in that, the said utilization in the parallel access unit based on many interleaved path controller of address search table with conflict-free access that memory set is walked abreast, comprising:

Utilize first table that interweaves, the data of sequential storage in the primary memory are broken up, and deposit in parallel from the storer of processor by demand execution algorithm;

Utilize second table that interweaves, with parallel from the storer of processor out of order operation result return to original order, and deposit primary memory in.

19. operation method like the said parallel instruction collection of claim 11; It is characterized in that; The orthogonal instruction subclass of said multi-core parallel concurrent digital signal processor through executed in parallel reaches carries out arithmetic computation in data access; And through setting the execution number of times, independent loops is carried out or combined cycle is carried out said concurrent operation subset of instructions and said parallel access subset of instructions.

20. operation method like the said parallel instruction collection of claim 11; It is characterized in that; Said multi-core parallel concurrent digital signal processor as cycle controller, is used to control the cycling of a single instruction multiple data SIMD data channel through the finite states machine control device of parallel access unit and the finite states machine control device of parallel computation unit; Said cycling comprises two types: one type is that the operation function is found the solution task, and this type circulation does not need local vector memory that data are provided; The another kind of circulation that is based on the multiply accumulating function, the local vector memory of this type circulation need provide vector data and coefficient array.