CN101031904A

CN101031904A - Programmable processor system with two kinds of subprocessor to execute multimedia application

Info

Publication number: CN101031904A
Application number: CN 200580030649
Authority: CN
Inventors: R·阿米特; R·H·小约翰
Original assignee: 3Plus1 Technology Inc
Current assignee: 3Plus1 Technology Inc
Priority date: 2004-07-13
Filing date: 2005-07-12
Publication date: 2007-09-05

Abstract

One embodiment of the present includes a heterogenous, high-performance, scalable processor having at least one W-type sub-processor capable of processing W bits in parallel, W being an integer value, at least one N-type sub-processor capable of processing N bits in parallel, N being an integer value wherein and smaller than W by a factor of two. The processor further includes a shared bus coupling the at least one W-type sub-processor and at least one N-type subprocessor and memory shared coupled to the at least one W-type sub-processor and the at least one N-type sub-processor, wherein the W-type sub-processor rearranges memory to accommodate execution of applications allowing for fast operations.

Description

Have two class sub-processors to carry out the programmable processor system of multimedia application

Background of invention

Cross reference to related application

That the application requires to submit on July 13rd, 2004, exercise question is the U.S. Provisional Patent Application No.60/598 of " Quasi-AdiabaticProgrammable or COOL Processors Architecture (the able to programme or cold treatment body architecture of quasi-adiabatic) ", 691 rights and interests, with submitted on August 2nd, 2004, exercise question is the U.S. Provisional Patent Application No.60/598 of " Quasi-Adiabatic Programmable ProcessorsArchitecture (quasi-adiabatic programmable processor architecture) ", 417 rights and interests.

Technical field

The present invention generally relates to field of processors, more specifically, relate to have low-power consumption, high-performance, low tube core (die) area and by neatly with the processor that is used in multimedia and the communications applications scalablely.

Background technology

Along with such as honeycomb or the such popular appearance of consumer's skinny device (gadget) of mobile phone, digital camera, iPod and personal digital assistant (PDA), the many new standard that is used for communicating with these skinny devices is extensively adopted by industry member.Some comprise H264, MPEG4, UWB, bluetooth, 2G/2.5G/3G/4G, GPS, MP3 and security in these standards certain.Yet the problem that displays is, use various criterion arrange different skinny devices and between communication need a large amount of developments.One of them reason of the problems referred to above is, also do not have a kind ofly currently to be easy to programme for all digital devices use and defer to various standards of being ordered at processor available on the market or sub-processor.The aggravation of this problem only is a matter of time, because the new trend in the consumer electronics device is to permit in standard that the cause industry member is adopted even more.

Just displaying of processor, if not, one of current exactly requirement is: however low-power consumption can cause the execution of the code that is enough to handle a plurality of application.Current power consumption is each magnitude of using nearly hundred milliwatts (sub-hundreds), and target is to be lower than nearly hundred milliwatts for carrying out a plurality of application.Another requirement of processor is low-cost.Because the extensive utilization of processor in consumer products, processor must manufacture very cheap, otherwise its use in prevailing consumer electronics device is unpractiaca.

For the concrete example for the current processor problem is provided, roughly describe below and following every relevant problem, that is: the RISC that in some consumer products, uses, the microprocessor that in other consumer products, uses, the digital signal processor (DSP) that in other consumer products, uses and the special IC (ASIC) that in other consumer products, uses and some other processor of knowing that each presents oddity problem.These problems are summarized in " positive (Pros) " joint of " reverse side (Cons) " its advantage of joint neutralization discussion of its shortcoming of discussion below together with the advantage of using every kind of processor.

A.RISC/ superscalar processor

RISC and superscalar processor are to be used for the architecture solution that the quilt of all general-purpose computations is accepted the most widely.They are usually strengthened with special-purpose accelerator, so that solve some the special problem in the context of general solution.

Example comprises: ARM series, ARC series, StrongARM series and MIPS series.

Positive (Pros):

Accepting extensively of industry member causes ripe more instrument chain and software selection widely.

Obtained the programming model of robust from very effective automatic code maker, wherein the automatic code maker is used to generate scale-of-two from higher level lanquage (as the C language).

Processor in this classification is extraordinary universal solution.

Mole (Moore) law can be used in the raising performance effectively.

Reverse side (Cons):

The universal property of architecture can not influence the common trait/special characteristic of (leverage) a group or a son group application by leverage so that obtain better price, power and performance.

With respect to the calculated amount that is provided, they wait until high quantity of power in consuming.

The raising of performance is that cost reaches with the pipeline latency time mainly, and this has influenced the several multimedias and the communication of algorithms nocuously.

Complicated hardware scheduler, senior controlling mechanism and for the restriction that significantly reduces that more effective automatic code of general-purpose algorithm generates make that the area efficiency of this class solution is less.

B. very long instruction word (VLIW) and DSP

Vliw architecture has been eliminated some poor efficiency of finding in RISC and superscalar architecture, to be created in solution quite general in the digital signal processing space.Concurrency is significantly increased.The responsibility of scheduling is transferred to software from hardware, so that save area.

Example comprises: TI64xx, TI55xx, StarCore SC140, ADI SHARC series.

Positive (Pros):

Solution is restricted to the signal Processing space, and this compares with RISC and superscalar architecture and has improved 3P.

With respect to RISC and superscalar architecture, vliw architecture provides higher levels of concurrency.

Effectively instrument chain and accepting extensively of industry member are generated quite apace.

Automatic code generates and programmability is just demonstrating great progress, belongs to this classification because be designed to the multiple processor of signal Processing.

Reverse side (Cons):

Though problem resolving ability is reduced to the digital signal processing space, it is too wide for the universal solution that resembles the VLIW machine, so that there is not effective 3P.

Control is expensive and power hungry, particularly for for the basic controlling code in many multimedias and the communications applications.

It is easier that the technology of several power and area poor efficiency is used for making automatic code to generate.Software corporations (software community) for the strong dependence generation generation of these technology advanced this poor efficiency.

Vliw architecture is not very to be suitable for handling serial code.

C. reconfigurable calculating

Past over 10 years several work in industry member and academia concentrate on the development a kind of solution flexibly with price, power and performance characteristic resemble ASIC.Many work are successfully challenged existing and ripe rule and design example with little industry.Most trial is to create on the direction of solution based on the architecture that resembles than coarse particle (grain) FPGA.

Positive (Pros):

It is competitive that some design that is limited to application-specific and is provided at the dirigibility of the needs in this application simultaneously is proved to be on price, power, performance.

Studies show that, so limited but still flexibly solution can be created and solve many application focuses.

Reverse side (Cons):

Several designs in this space do not provide effectively and are easy to the programming solution, do not accept extensively so be good at the corporations of programming DSP.

Generate for many designs or in fact impossible from the automatic code of more senior language (as the C language), or extremely inefficient.

When attempting to use one type interconnection and other granularity of level (granularity) to make up xenogenesis (heterogeneous) to use, lost the 3P advantage.The degree of utilizing of the concurrency that is provided is subjected to grievous injury.

Design is very big to the additional overhead that reconfigures for great majority in 3P.

Under many situations, external interface is complicated, because privately owned reconfigurable structure can not be complementary with the industrial standard design method.

Reconfigurable machine is uniprocessor and seriously depends on closely integrated RISC, even for handling primitive (primitive) control.

D. processor array

Some up-to-date methods concentrate on and make reconfigurable system be applicable to the application of processing xenogenesis better.Solution on this direction has been united a plurality of processors for one or one group optimizing application, to create the processor array structure.

Positive (Pros):

When using resulting structure to be joined together, can help to solve many problems for not using the different processor of optimizing on the same group.

When performance requirement increased, zoom model allowed a plurality of processors are linked together uniformly.

Complicated algorithm can be broken effectively.

Reverse side (Cons):

Though performance requirement can be satisfied fully, the poor efficiency of power and price is still too serious.

Programming model is difference with each processor.This makes that the work of application developer is more difficult.

The even convergent-divergent of a plurality of processors is very expensive and consume power resources.This has been revealed as some indeterminism (non-determinism) that demonstration may be harmful to the performance of total system.

Under the situation of the memory resource that does not have to share-because the storer of sharing not is even convergent-divergent, programming model is subjected to the infringement of the complicacy of communication data, code and control information on system level.

Dissimilar processors is connected to the poor efficiency that the needed huge bonding logic with repeating of homogeneous networks has strengthened area, has increased power and increased the stand-by period.

In view of foregoing, but need a kind of processor of lower powered, cheap, effective, high performance flexible programming, xenogenesis, so that allow to carry out simultaneously one or more multimedia application.

Summary of the invention

General introduction ground, one embodiment of the present of invention comprise a kind of xenogenesis, high performance, scalable processor, it has: can parallel processing W bit or at least one W type sub-processors of more bits, W is a round values; At least one N type sub-processor that can parallel processing N bit, wherein N is a round values and less than W.This processor also comprises the shared bus of this at least one W type sub-processor of coupling and this at least one N type sub-processor, with the storer of sharing that is coupled to this at least one W type sub-processor and this at least one N type sub-processor, wherein W type sub-processor is rearranged byte when shifting byte to storer or from the memory transfer byte, allows the execution of the application of operation fast so that adapt to.

Description of drawings

Fig. 1 shows the application 10 that shows with reference to the digital product 12 that comprises the embodiment of the invention.

Fig. 2 shows that wherein processor 22 is coupled to Memory Controller and direct memory visit (DMA) circuit 24 according to example integrated circuit 20 embodiment of the invention, that comprise xenogenesis, high performance, scalable processor 22.

Fig. 3 illustrates the further details according to the processor 20 of the embodiment of the invention.

Fig. 4 show according to the embodiment of the invention, be included in the piece in one of W type piece (such as piece 74 or 76) or the high-level block diagram of member.

Fig. 5 shows according to the block diagram embodiment of the invention, that be included in the circuit block in the piece 402.

Fig. 6 shows the standard member that adopts for register file, (particularly at piece 402,404,406 and 408 in) transmitted in the macroefficiency unit in greater detail.

Fig. 7 is with the further details of high-level block diagram form demonstration according to the piece 408 of the embodiment of the invention.

Fig. 8 shows further details according to the piece 404 of the embodiment of the invention with the form of block diagram.

Fig. 9 and 10 shows particularly with respect to carrying out further details displacement, piece 404.

Figure 11 is with the further details of high-level block diagram form demonstration according to the parts of the piece 406 of the embodiment of the invention.

Figure 12 shows the high-level block diagram according to the details of the piece 78 of the embodiment of the invention.

Figure 13 is with the further details of high-level block diagram form demonstration according to the piece 78 of the embodiment of the invention.

Figure 14 shows the further details according to the piece 1322 of the embodiment of the invention.

Figure 15 shows according to the further details embodiment of the invention, that be included in the Circuits System in the piece 1324 with the high-level block diagram form.

Figure 16 show according to the embodiment of the invention, be included in the block diagram that reduces circuit block 1602 in the piece 1520.

Figure 17 shows according to the further details embodiment of the invention, that be included in the Circuits System in the piece 1326 with the high-level block diagram form.

Figure 18 shows according to the further details embodiment of the invention, that be included in the Circuits System in the piece 1330 with the high-level block diagram form.

Figure 19 shows according to the further details embodiment of the invention, that be included in the Circuits System in the piece 1332 with the high-level block diagram form.

Figure 20 shows according to the further details embodiment of the invention, that be included in the Circuits System in the piece 1334 with the high-level block diagram form.

Figure 21 show according to the embodiment of the invention, use the programming flow process of processor 22 and the example of instrument.

Figure 22 shows the example of the scalability of the embodiment of the invention.

Figure 23 shows the figure of some benefit that provides scalability of the present invention.

Embodiment

Referring now to Fig. 1, shown on the figure with reference to the application 10 that comprises the digital product 12 of the embodiment of the invention.Fig. 1 plans to be used for to provide to the reader some and whole distant view of advantages not necessarily of a relevant product, and this product has comprised one embodiment of the present of invention for those products available on the market.

Therefore, product 12 is a convergence (converging) products, because it has incorporated all application that need be carried out by mobile telephone equipment 14, digital camera devices 16, digital recording or musical instruments 18 and the PDA equipment 20 of today into.Yet product 12 one or more functions of actuating equipment 14-20 simultaneously only utilizes less power.

Product 12 is battery-operated typically, so even also only consume very little power when carrying out by a plurality of application the in those application of equipment 14-20 execution.It can also be used for finishing the operation of complying with a plurality of application by run time version, and described application includes but not limited to: H264, MPEG4, UWB, bluetooth, 2G/2.5G/3G/4G, GPS, MP3 and security.

Fig. 2 shows that wherein processor 22 is coupled to Memory Controller and direct memory visit (DMA) circuit 24 according to example integrated circuit 20 embodiment of the invention, that comprise xenogenesis, high performance, scalable processor 22.Show also on Fig. 2 that processor 22 is coupled to interface circuit 26 by versabus 30, is coupled to interface circuit 28 by versabus 31, and also by bus 30, be coupled to general processor 32 by bus 31.Circuit 20 also is shown as and comprises clock-reset and power management 34, is used to generate clock and reset signal, and wherein clock is by the remaining circuit utilization of circuit 10, and reset signal is utilized in the Circuits System that is used for by its managing power in an identical manner.In circuit 20, also include JTAG (JTAG) circuit 36.JTAG is used as the standard of test chip.

Be shown as the interface circuit 26 that is coupled to bus 30 and be shown as the interface circuit 28 that is coupled to bus 31, comprise piece 40-66, they are normally known to those skilled in the art, and are used by current processor.

Be shown as the processor 22 of heterogeneous multiprocessor and comprise: the data-carrier store of sharing 70, the data-carrier store of sharing 72, CoolW sub-processor (or piece) 74, CoolW sub-processor (or piece) 76, CoolN sub-processor (or piece) 78 and CoolN sub-processor (or piece) 80.Each piece 74-80 and instruction storer interrelates, for example, CoolW piece 74 and instruction storeies 82 interrelate, and CoolW piece 76 and instruction storeies 84 interrelate, CoolN piece 78 and instruction storeies 86 interrelate, and CoolN piece 80 and instruction storeies 88 interrelate.Similarly, each piece 74-80 and controll block interrelate.Piece 74 interrelates with controll block 90, and piece 76 interrelates with controll block 92, and piece 78 interrelates with controll block 94, and piece 80 interrelates with controll

block 96.Piece

74 and 76 is designed to always operate effectively for 16,24,32 and 64 bit operatings or application, and

piece

78 and 80 is designed to always for 1,4 or 8 bit operatings or application operation effectively.

Piece 74-80 is sub-processor basically, and

CoolW piece

74 and 76 is wide (or W) type pieces, and

CoolN piece

78 and 80 is narrow (or N) type pieces.Wide and the narrow relative number that is meant parallel bit processed or that be routed in sub-processor, this provides the xenogenesis characteristic of processor 22.And circuit 24 is directly coupled to one of sub-processor, i.e. one of piece 74-80, the minimum stand-by period path that causes leading to the sub-processor that it is coupled to.On Fig. 2, circuit 24 is shown as and is directly coupled to piece 76, although it can be coupled in

piece

74,78 or 80 any.The agency of higher-priority or task can be assigned to that piece that is directly coupled to circuit 24.

Though should be pointed out that to have shown four piece 74-80, can utilize the piece of other number, yet, utilize extra block obviously to cause additional die space and the manufacturing cost of Geng Gao.

Need the application of the complexity of big processing power not to be dispersed in the circuit 20, but their polymerizations or be limited to specific sub-processor or piece is handled, this is by eliminating or reducing line (metal) or route length at least, reduce the line capacity thus and improved power consumption widely.In addition, improved utilization factor and reduced activity, thereby helped lower power consumption.

Circuit 20 provides the example of silicon (or SoC) on the chip of the quasi-adiabatic sub-processor able to programme that is used for multimedia and communications applications, comprises two types sub-processor, just as previously indicated: W type and N type.W type or wide type processor design for high power, price, effectiveness of performance in the application of

needs

16,24,32 and 64 bit process.N type or narrow type processor design for the high-level efficiency in the application of needs 8,4 and 1 bit process.Though use these bit numbers in the mode of accompanying drawing and explanation in an embodiment of the present invention, also can easily adopt the bit of other number.

Performance that different application needs is different or processing power, and therefore carry out by dissimilar pieces or sub-processor.For example by the application of typically being carried out by DSP, they are usually by W type sub-processors handle, and the

piece

74 or 76 of all Fig. 2 in this way is because they comprise the DSP nuclear of common appearance on feature.Such application comprises, but be not limited to: fast Fourier transform (FFT) or contrary FFT (IFFT), adaptive finite impulse response (FIR) wave filter, discrete cosine transform (DCT) or inverse DCT (IDCT), real/complex FIR wave filter, iir filter, rescap root raised cosine (RRC) wave filter, the color space transformation device, 3D bilinearity texture, Gouraud covers, Gray (Golay) is relevant, bilinear interpolation, intermediate value/OK/column filter, α mixes (Alphablending), high order surfaces tessellation (Tessellation), summit gradual change (VertexShade) (transparent/shallow, Trans/Light), triangle is provided with, anti-aliased and the quantification of full screen.

Other DSP that usually occurs endorses being carried out by the N type sub-processor such as

piece

78 and 80, and they are including, but not limited to variable length code code translator, Viterbi (Viterbi) coder-decoder, turbine (Turbo) coder-decoder, cyclic redundancy check (CRC), Walsh (Walsh) code generator, interleaving/de-interleaving device, LFSR, scrambler, despreader, convolution coder, Read-Solomon (Reed-Solomon) coder-decoder, scrambling code generator and punching/go and punching.

With the method on the existing architecture, compare as RISC, reconfigurable, superscalar, VLIW and multiprocessor method, it is low that W type and N type sub-processor can both keep each clean activity that shifts and the energy that finally obtains, and keeps the high-performance of the utilization factor with increase simultaneously.The sub-processor architecture of processor 22 has reduced die-size, thereby causes the best scheme that deals with, and it comprises the architecture of the novelty of a kind of being called as " quasi-adiabatic " or " COOL (cold) " architecture.Programmable processor according to this architecture is called as the able to programme or COOL processor of quasi-adiabatic.

Able to programme or the COOL processor of quasi-adiabatic makes data routing, control, storer and functional unit granularity optimization, so that be matched with the application of foregoing limited son group.Being relevant to the discussion and the introduction of following that provide, relevant with the different units of processor 22 or piece or circuit and interactive operation thereof accompanying drawing, will be clearly in order to the mode of finishing this point.

" quasi-adiabatic is able to programme " or the concurrent application of xenogenesis interconnection and functional unit (COOL) processor.Aspect thermokinetics, the insulation process is wasted heat not, and they shift all energy that are used and carry out useful work.Because the nonadiabatic character of existing standard technology, circuit design and logical block storehouse designing technique, people can not make the processor of a thermal insulation forever.Yet in the middle of possible processor architecture that may be different, some architecture may be more to approach thermal insulation.Various embodiment of the present invention has shown a kind of processor architecture, and it is compared with the architecture of prior art, is in close proximity to thermal insulation, and they remain programmable simultaneously.They are called as " quasi-adiabatic programmable processor ".

Integrated circuit 20 or side by side be performed the number that the number of such application is supported considerably beyond current processor with allowing resources can be supported in processor 22 as many application.The examples of applications that can carry out side by side or concomitantly by integrated circuit 20 including, but not limited to, from the film that wireless device downloads is used and decoding has simultaneously received, therefore, can download and decipher a film simultaneously.Because having realized application simultaneously on integrated circuit 20 carries out, and the number of the application that this integrated circuit is supported than it, have little die-size or silicon real estate (real estate), so make the needed cost of a plurality of devices that the cost of this integrated circuit is lower than Fig. 1 widely.In addition, processor 22 provides single framework able to programme to the user, so that implement to use such as multimedia integration such multi-functional.Its significant values is the ability of the following standard that adopted by industry member of the support of integrated circuit 20 (being processor 22), expects that this, standard can have bigger complicacy than the standard of today in future.

Each piece 74-80 can be in a sequence (or stream) of the executive routine of given time.A sequence of program relates to and specific application function associated.For example, FFT is one type a sequence.Yet different sequences can be complementary.For example, in case finish a FFT program, just can be stored in its result in the storer 70, and next next sequence can use the result who is stored.Share information or complementary in this wise different sequences in this wise and be called as " jet (stream flow) ".

On Fig. 2,

storer

70 and 72 comprises the 16k byte memory of 8 pieces separately, yet, in other embodiments, can utilize the storer of different size.

Command memory

82,84,86 and 88 is used for storing respectively the instruction of being carried out by piece 74-80.

Fig. 3 shows the further details according to the processor 20 of the embodiment of the invention.On Fig. 3, processor 20 is shown as and comprises that sub-processor 74-80, each sub-processor comprise instruction hypervelocity impact damper 302-308, are respectively applied for the instruction of storage by each sub-processors handle.Processor 20 also be shown as comprise arbitration piece 310, data-carrier store 312, general I/O (GPIO) piece 314, the SoC bus block of sharing 316, with radio frequency interface 318, dma controller piece 320 and the Memory Controller piece 322 of DMA piece, they are coupled in mode shown in Figure 3.Data-carrier store 312 is used as the reservoir of data message, and it is utilized by sub-processor and other piece under the guiding of arbitration piece 310, the operation and the data service of various member/pieces that 310 guiding of arbitration piece are shown in Figure 3.Piece 314 is adjusted the input and output business of going to from processor 22, the dma operation that piece 320 is carried out by processor 22 by bus 316 controls, piece 322 is by the relevant operation of bus 316 controls and storer 312, and piece 318 comprises the Circuits System of controlling dma operation and can receive and launch the RF signal that is coupled by signal 324.

Randomly, the

register

326 and 328 of Gong Xianging causes direct communication between two types sub-processor.For example, on Fig. 3, register 326 is shown as and is coupled to

piece

74 and 78, will be so that facilitate by the storage of these piece Sharing Information, and this is convenient to utilize more than one sub-processor to carry out application, and the quickening that reaches application is carried out.Similarly, register 328 is shown as and is coupled to

piece

80 and 76, so that play the identical effect with register 326.

Fig. 4 show according to the embodiment of the invention, be included in the piece in one of W type piece (such as piece 74 or 76) or the high-level block diagram of member.As an example, on Fig. 4, use piece 74.On Fig. 4 and in this article, functional unit or macro block (Macro-block) between parts, have been provided with very concrete interconnection structure as totalizer, multiplier, register and multiplexer or the like.These macro blocks are called as " macroefficiency unit " or " MFU ".The son group effective able to programme of the operation of MFU representative one or more common appearance in multimedia and communications applications limited group.High-level efficiency in the macroefficiency unit is the result who substitutes the key group of the atomic operation of finding (atomic operatiin) with one group of operation that presents utmost point premium properties and power-performance of deriving out in intended application.In some cases, the operation that occurs is combined with unique mode usually, so that reuse hardware effectively.

On Fig. 4, piece 74 is shown as and comprises load/store MFU piece 402, scalar ALU (ALU) and multiplication add up (ACC) MFU piece 406, vector x MFU piece 404, vectorial ALU and multiplication ACC MFU piece 408 and local storage 410, and they are coupling in together in mode shown in Figure 4.Piece 402 generates storage address, and it is coupling on the memory address bus 412.Memory data is coupled on the memory data bus 414, and is arrived

piece

404 and 406 by bidirectional coupled.Vector stores mask and is coupled on the vector storage mask bus 416, and is generated by piece 404.The further details of each piece is presented and discusses with respect to accompanying drawing subsequently.Such present and discuss before, be discussed below some universal performance and the piece of piece 74.

Piece

406 and 408 most of actual calculation of carrying out for data.Load/store MFU piece 402 calculates and is used for storer 312 and storer 410/ from address that storer 312 and storer 410 conduct interviews.Vector data in the way that vector X MFU piece 404 is rearranged between storer 312 and piece 408.Vector x MFU piece 404 also is used for generating vector and stores mask, stores with the vector that is used for storer 312.Piece 406 the given time only to a piece of data executable operations, and

piece

404 and 408 pairs of data executable operations with vector form.Piece 402 is provided for the address of memory access.Some calculates by piece 402 and carries out, but it has the character that additional overhead is calculated.

Except the operation of mobile data between the MFU piece, machine instruction is the operations that (if needed) separate of encoding of various MFU pieces.All operations in single instruction is carried out concurrently.Under the control of the separately operation of coding in instruction, vector x MFU piece 404 causes rearranges vector data, and generates vector storage mask.Local storage 410 is used in local canned data, with avoid all must access block 74 outsides for each instruction information.Bus 412 is coupled to storer 312, provides storage address by it.

Piece 402 is shown as by bus 424 and is coupled to piece 404, and piece 402 also is shown as by bus 426 and is coupled to piece 406, and piece 402 also is shown as by bus 428 and is coupled to piece 410.Piece 404,408 and 410 is shown as by vector bus 420 coupling mutually, and piece 406,404,408 and 410 is shown as by scalar bus 422 coupling mutually.Bus is one group of line normally, every line coupled signal, and therefore wherein line is parallel to each other, coupled signal concurrently.The number of bus intraconnections has defined the number of binary bits, and it is used as the feature of bus.On Fig. 4, vector bus 420 is wideer than scalar bus 422, that is, compared with bus 422, bus 420 comprises more bits or line, their more signals that can be coupled concurrently.The example of the ratio of the bit number of the bit number of bus 420 and bus 422 is 4 times, and for example this is that bus 422 is that 32 bits, bus 420 are 4 to take advantage of in the example of 32 bits or 128 bits therein.

Piece 404 also provides vector to store mask, and it is coupled on the bus 416.

Memory data is coupled on the piece 406 from piece 402, and being used for calculating operation, but vector data at first is provided to piece 404.To point out that importantly piece 404 provides the data of organizing in the storer to be matched with the ability of the data that are required in computing unit (promptly at piece 408), has improved performance thus widely.

Fig. 5 shows according to the block diagram embodiment of the invention, that be included in the circuit block in the piece 402.Piece 402 is shown as and comprises address block 502, circular buffering block of registers 504, address generator piece 508, address generator piece 506, multiplexer (mux) 510 and multiplexer 512, and they are coupling in together in mode shown in Figure 5.

Piece 502 is coupled to other piece of piece 402 as shown in Figure 4, and piece 502 memory addresss.Piece 504 is used for a circular buffering range storage in one of circular buffering register (piece 504).When by PROGRAMMED REQUESTS,

piece

506 and 508 impels address computation to unroll in the circular buffering scope.The arrow that points in the piece 504 allows these registers to be loaded.That is, piece 506 be used for revising the address that generates by piece 504 or from address that piece 406 receives or or even from the address that piece 502 generates, and piece 508 is used for revising from piece 502 and/or piece 406 and even the address that receives of piece 504.

The circular buffering register of the address register of piece 402 and piece 404 is provided to the input of the address generator of piece 506 and 508.Under the situation of the address register of piece 402, these inputs are previously stored addresses, and for the circular buffering register of piece 404, these inputs relate to the information of cyclic buffer.

Piece

506 and 508 is used for the modified address.That is, piece 506 be used for revising the address that generates by piece 504 or from address that piece 406 receives or or even from the address that piece 502 generates, and piece 508 is used for revising from piece 502 and/or piece 406 and or even the address that receives of piece 504.The output of piece 506 is provided to multiplexer 512 as input then, and multiplexer 512 also receives the address that generated by piece 502 as input.Multiplexer 512 is selected its one of them input then, and this input is coupled on the bus 520, so that received by other piece of piece 74, as shown in Figure 4.Similarly, the output of piece 508 is provided to multiplexer 510 as input, and multiplexer 510 also receives the address that generated by piece 502 as input.Multiplexer 510 is selected its one of them input then, and this input is coupled on the bus 522, so that received by the storer of piece 74, as shown in Figure 4.

Therefore, load/store MFU can generate two addresses concurrently.The address is calculated by the combination of address register with from the constant of scalar ALU MFU or numerical value.Randomly can unroll in the boundary of cyclic buffer in the address of calculating.Reference-to storage is mainly planned to be used in the address of calculating, but also can be assigned to address register or circular buffering register, or is used as the input that is added to other MFU.

Fig. 6 shows in greater detail for register file standard member that adopt and that transmit (particularly in piece 402,404,406 and 408) in inside, macroefficiency unit.On Fig. 6, a plurality of registers 602, a plurality of multiplexer 604, cross bar switch (crossbar) 606, block of registers 608, a plurality of classification (staging) register 610, a plurality of functional unit 612 and a plurality of multiplexer 614 have been shown according to embodiments of the invention.Register 602 is shown as and is coupled to multiplexer 604, and multiplexer 604 is shown as again and is coupled to cross bar switch 606.Cross bar switch 606 is shown as and is coupled to register 610, and register 610 is shown as again and is coupled to functional unit 612, and functional unit 612 is shown as and is coupled to multiplexer 614.Usually, the function of multiplexer is to select between the input that is provided, and generates selected input.The output of cross bar switch 606 also is provided to other piece of Fig. 4.Though on Fig. 6, shown unit, multiplexer and/or the register of given number, can adopt these members of other number.

The member of Fig. 6 is coupled to together in the mode that shows in scheming.Multiplexer 604 is shown as the additional input of reception from other piece of Fig. 4, has two such inputs at least, and the output that receives multiplexer 614.

The register of Fig. 6 and feedback path (coupling) provide unique tissue, make the compromise optimization of area, energy and performance.This tissue has three main features:

For assembly language is that the register file that can see and have several above registers is divided into two son groups: several registers are implemented with visit property fully, and remaining register is implemented with more limited visit.Under most of situations, have only four registers (numbering 0 to 3) to support visit property fully.For the machine operation that involves this register, any He all complete addressable registers can be selected simultaneously the source and destination into operation.On the contrary, the register with limited visit is only shared a spot of read and write port between them.Register with limited visit has two read ports and write port at the most that they are shared.This arrangement has provided most of benefits of the register file with a large amount of read and write ports, and need be more than one or two read/write port for the most of registers in the group.

Input end at each functional unit is " a classification register ".Before functions of use unit in a clock period, its input classification register must be configured to suitable input value at the end of previous clock period.The functional unit that can not use simultaneously can be grouped in together, to share identical classification register, so that reduce the sum of register.If do not need to share the functional unit of identical classification register in the clock period, then the previous numerical value of register is held, and therefore eliminates the transfer power consumption in these functional units in this cycle.

Implement in two stages in the forwarding between the functional unit.At first, the next numerical value of the register that can visit fully is selected by multiplexer, is written to the register with limited visit together with this one or more numerical value (if any).In subordinate phase, the next numerical value of the register that can visit fully, with numerical value from the read port of register with limited visit, be fed to cross bar switch together, it selects to be written to the numerical value (being used for functional unit like this in the next clock period) of classification register when the clock period finishes.This tissue is possible cost to experience two multiple connection stages rather than increase that the stage was caused time-delay, makes the number of the input that is added to cross bar switch minimize, thereby has influenced its size widely.

Between the write and read port of register, can implement or can not implement to transmit with limited visit.Do not finish if here transmit, then, will the additional cycles of stand-by period to occur the operation of writing one of these registers and reading between this operation registers subsequently.

Fig. 7 is with the further details of high-level block diagram form demonstration according to the piece 408 of the embodiment of the invention.On Fig. 7, vector registor piece 702 is shown as and is coupled to N ALU piece 704, vector element shift unit piece 706, vector element selector switch piece 708,2N and N bit transducer piece 710, N ALU piece 712 and 2N multiplier block 714.On Fig. 7, piece 408 also is shown as and comprises vector registor piece 716, and it is coupled to N adder block 718, N shift unit piece 720, vector summation block 722, N3 input summer piece 724,2N and N bit transducer 726, multiplexer 723 and multiplexer 732.The piece of Fig. 7 and multiplexer are coupling in together in mode shown in Figure 7.Piece 702 is coupled to other piece of Fig. 4, and is coupled to piece 704-714.Piece 716 is shown as reception from the input of piece 406 with from the input of the output terminal of multiplexer 732, piece 710 and piece 714 and piece 724.Piece 702 is shown as and is coupled to multiplexer 704, and the latter also is coupled to piece 712 and 726.Usually, the circuit of Fig. 7 or piece be to the numerical value of vector type executable operations concurrently, the M bit value executable operations that all orders of logarithm in this way are N, and wherein M is the integer number of bit.

Multiplexer 732 receives by

piece

718 and 720 outputs that generate as input, and multiplexer 730 receives by

piece

704 and 706 inputs that generate, and generates the output that is received by piece 702.

Piece

708 and 722 output are provided to piece 406.Here the N of Shi Yonging is a round values, and for example, N ALU is that number is the ALU circuit of N.

Piece 702-714 and multiplexer 730 carried out multiplication (MAC) function that adds up usually, and piece 716-726 and multiplexer 732 are carried out the ALU function, yet the number of parallel bit of carrying out such MAC and ALU function thereon is usually than the big N of bit number times that is handled by piece 406.

Piece

704 and 712 is sectionals, that is, they can be selectively segmentation of phase add operation.For example, under the situation of handling N 32 bits concurrently, except can carrying out N 32 bit addition computings, each ALU piece can be carried out 2N 16 bit addition computings, or 4N 8 bit addition computings.Piece 714 moves in the mode identical with the piece 1110 of Figure 11, and this will be described

briefly.Piece

710 and 726 operations come N 32 bit value are transformed into N 40 bit value, or 2N 16 bit value are transformed into 2N 40 bit value.In an example, 32 bit value are transformed into 40 bit value, and in another example, 16 bit value are transformed into 40 bit value, therefore, provide the bit ability to transform.

Piece 706 is vector value, i.e. N M bit value is to the right or to the round values of shifting left.The example of vector displacement will be to get one such as following vector

<a0，a1，a2，a3，a4，a5，a6，a7>

Be 8 numerical value in this example, and return vector

<a1，a2，a3，a4，a5，a6，a7，0>

Perhaps be

<0，0，0，a0，a1，a2，a3，a4>

This operation is not interpreted as the multiplication or the division of any kind of usually.Piece 708 allows to choose the individual element of vector value, for example, can select specific byte (8 bit) from vector value.

Piece 720 to be moving with the same mode of piece 706, and piece 726 is to move with the same mode of piece 710.

Piece

712 and 726 output optionally are provided to piece 702 by multiplexer 704, and

piece

706 and 704 output optionally are provided to piece 702 by multiplexer 730.In addition,

piece

720 and 718 output optionally are provided to piece 716 by multiplexer 732.

The additive operation that piece 722 is carried out based on vector, and other piece of piece 408 is operated based on element.That is, piece 722 is added together all elements of single vector, and the piece of operating based on element is carried out computing to the one or more selected of different vectors with elements corresponding.

Piece

710 and 726 optionally allows to carry out conversion from N or 2N separately.Show also on Fig. 8 that the output of piece 804 is fed back to the input of piece 802.

Fig. 8 shows piece 404 further details according to the embodiment of the invention with the form of block diagram.On Fig. 8, piece 404 is shown as and comprises mask control register piece 802, mask code generator piece 804, mask register piece 806, vector registor piece 808 and vectorial byte mask substitution block 810, and they are coupling in together in mode shown in Figure 8.

Piece 802 is shown as the input of reception from other piece of Fig. 4, and generates the input that is added to piece 804, and piece 804 is shown as and is coupled to piece 806.Piece 806 is shown as other piece and the storer 312 that is coupled to piece 801 and also is coupled to Fig. 4.Piece 808 is shown as other piece that is coupled to storer 312 and Fig. 4.Piece 810 is shown as and is coupled the input that receives from

piece

806 and 808.

In an example, piece 404 has the register file of a N*32 bit vectors register, piece 808, and N is identical with piece 408.The piece 806 of piece 404 comprises the mask register that is of a size of the N*4 bit.Each bit of mask register is corresponding to a byte of vector registor.When the N*32 bit vectors was stored in outside shared storage, the N*4 bit-masks can be provided to indicate which byte of this vector by the actual storer that is written to.(the storer byte corresponding to zero bit in the mask keeps constant.) the mask code generator function calculates the 4*N bit-masks according to the value of setting of mask control register.

Piece 404 can be replaced the 8*N byte of two vector registors, to choose the 4*N byte.Under common situation, specific displacement is controlled by the numerical value of the 3rd vectorial register.The displacement of some " precoding " does not need to use control vector, and these all funnels that comprise two input vector registers shift left and dextroposition.When the 8*N of two vector registors byte was replaced, the 8*N bit of two mask registers can be replaced identically, to remain on mask and identical bit between the numerical quantity correspondence to byte (bit-for-byte).

The piece of Fig. 8 is operated based on vector value.Piece 810 allows to rearrange vector value, such as previous brief description.This finishes by using displacement, and this describes further with reference to Fig. 9 and 10.Piece 810 provides relevant which kind of displacement by expected information.Similarly, which mask of being replaced the mask indication of replacing from the quilt of

piece

804 and 806 will provide.Usually, want stored byte that a mask bit is arranged for each.

The piece 802,804,806,808 and 810 of Fig. 8 causes the ability of rearranging the address in the storer, to be suitable for just experiencing the application-specific of execution.In the prior art, rearrange typically and automatically performed, yet in an embodiment of the present invention, according to program or code, programmable device can be carried out on demand able to programmely and rearrange.This allows the rearranging of infinitely organizing that be close to according to needs of programmable device, and this is that prior art does not provide fully, that is, the ability of rearranging be scheduled to and comprise the predetermined possibility group of rearranging.Therefore, generate the mask according to the program that just is being performed, this provides the further dirigibility of rearranging about address in the storer.

SIMD is the abb. for single instruction multiple data (_ Single Instruction, Multiple Data_), and MIMD is the abb. for multiple-instruction multiple-data (MIMD) (_ Multiple Instruction, Multiple Data_).These are the standard terminologys in Computer Architecture and the programming, are well known to those skilled in the art.

Fig. 9 and 10 displaying blocks＜quantity (number)〉the further details of permutation circuit, wherein＜quantity be the number of " vectorial byte+mask displacement " frame.Piece 404 has the displacement of two vectors of execution to generate the functional unit through the result vector of displacement, as shown in Figures 9 and 10.The circuit that is used for carrying out displacement can be described to get two input vector A and B in a general way, each has N unit, and generating also is the output vector Z of N unit, one of them unit be any arbitrarily but be the bit of unified number, and wherein to require N be that 2 power is inferior.Make K be N be the logarithm at the end with 2.This permutation circuit has the K+1 level, every grade of N switch enclosure (switch box) with particular type, as shown in the figure.Always have three types switch enclosure, be called as " type A ", " type B " and " Type C ".Switch enclosure type A only is used in the first order; The switch enclosure Type C only in the end one-level be used; All levels in the centre are only utilized the switch enclosure type B.Connection by every type switch enclosure support shows with being separated.Be the butterfly switch between the switch enclosure of every pair of adjacent level,, and reach switch gradually apart from N/2 from distance 1 switch.The value of setting of switch enclosure all determines independently that by " control vector " this control vector is the 3rd input that is added to permutation circuit.Because the value of setting of each type A and Type C switch enclosure only needs individual bit to stipulate, value of setting of each type B switch enclosure just in time two bits is stipulated, so complete control vector needs 2*K*N bit.Control vector can be hinted that fully perhaps it can partly or wholly be provided by program in some way from the displacement instruction of carrying out.

Figure 11 shows further details according to the parts of the piece 406 of the embodiment of the invention with the form of block diagram.On Figure 11, block of registers 1102 is shown as and is coupled to ALU piece 1104, bit transducer piece 1106, ALU piece 1108 and multiplier block 1110.Piece 406 also is shown as and comprises block of registers 1112, shift unit piece 1114, adder block 1116 and bit transducer piece 1118.On Figure 11, also shown

multiplexer

1122,1120 and 1124.The multiplexer of Figure 11 and piece are coupling in together in the mode that shows in scheming.

Piece 1102 is shown as other piece that is coupled to storer 312 and Fig. 4, and receives the input from multiplexer 1122 and multiplexer 1120.Shift unit piece 1114 provides one of them input of multiplexer 1122, and piece 1104 provides another input of multiplexer 1122.Multiplexer 1120 receives its input from piece 1118 and 1108.Piece 1114 also is shown as and is coupled to piece 1102, and multiplexer 1124 is shown as the output that reception is added to piece 1114 from the input and the generation of

piece

1112 and 1102.

Piece 1112 is shown as and is coupled to piece 1116, and piece 1116 generates output, is provided to piece 1112 as input.Piece 1118 is shown as and is coupled to piece 1112, and

piece

1106 and 1110 is shown as and is coupled to piece 1112.

Piece

1102,1104,1106,1108 and 1110 and multiplexer 1122 make the ALU function be performed, and piece 1112-1118 and multiplexer 1124 make multiplication-(MAC) function that adds up be performed.

Piece

1104 and 1108 is ALU and carries out such function, and their output optionally is provided to piece 1102 as input (or feedback) by multiplexer 1122 and 1120.In each clock period, can carry out two ALU operations.Piece 1110 is carried out multiplication function and is produced output, and it is provided to piece 1112, and piece 1112 can be handled the bit compared with piece 1102 handled more more numbers concurrently.For example, have at piece 1102 under the situation of 32 bit capacities, piece 1112 has 40 bit capacities.Piece 1112 is used as accumulator registers, and ground addition input promptly adds up.

Piece 1106 is transformed into N+X to the N bit value, and wherein X is a round values.For example, 32 bit value can be transformed into 40 bit value.Piece 1114 is the bit of numeric shift predetermined number, and by multiplexer 1122 result is sent to piece 1102.

Piece 1118 transforms to bit than low number from the bit of higher number, such as from 40 bits to 32 bits.This piece is coupled to piece 408.Piece 406 can be for carry out twice ALU operation concurrently from the numerical value of piece 1102.Replace ALU operation for the first time, can carry out the shifting function of N bit, or carry out from the N bit value to the conversion that will be stored in the X bit value the piece 1112.Replace ALU operation for the second time, can carry out multiplication, and the result is stored in one of the register of piece 1112 by piece 1110.

Piece 406 can carry out 40 bits displacements, 40 bit added/subtracted concurrently and from 40 bit value to one of 32 bit register that will be stored in scalar ALU MFU the conversion of 32 bit value.

Further details referring now to one of N type sub-processor of following figure discussion such as piece 78.Should be pointed out that the

piece

406 and 404 of the Fig. 4 that is relevant to W type sub-processor, is common with the N type sub-processor such as piece 78.

Figure 12 shows the high-level block diagram according to the details of the piece 78 of the embodiment of the invention.On Figure 12, piece 78 is shown as and comprises that data path unit (DPU) piece 1202, path are to memory block 1204 and controller, sequencer and data address maker (DAG) piece 1206.Piece 1204 and 1206 be with the piece of W type sub-processor common and in these pieces, find.Piece 1206 is identical with piece 402 on function usually.

Figure 13 is with the further details of high-level block diagram form demonstration according to the piece 78 of the embodiment of the invention.On Figure 78, storage element piece 1302 is shown as and is coupled to X cell block 1304, and piece 1304 is shown as again and is coupled to load units piece 1306.Piece 1304 is identical with piece 404 on function usually, so it was above being discussed in more detail.

Piece 1306 is shown as and also is coupled to macro block 1340, and piece 1340 is shown as again by macroefficiency bus 1310 and is coupled to piece 1302.Piece 1302 is shown as and comprises store buffer 1314, store buffer 1312 and bus interconnect block 1308.Piece 1302 generates the output that is provided to storer (such as storer 312), and therefore correspondingly is coupled by piece 1314.Piece 1304 is shown as and receives input or be coupled to storer, all storeies in this way 312.Piece 1306 is shown as and comprises loading buffer device 1320, loading buffer device 1318 and bus interconnect block 1316, and piece 1316 is coupled to piece 1340.

Piece 1340 is shown as and comprises Galois (Galois) territory MAC piece 1322, special ALU piece 1324, assembler block 1326, storer 1328, piece 1330, interleaver block 1332 and Viterbi (Viterbi) piece 1334 of punchinging of punchinging/go, and they are shown as separately is coupled to bus 1310.Piece 1322-1332 is shown as reception separately from the input of piece 1316 or be coupled to piece 1316.Piece 1334 receives the input from piece 1332, and is coupled and receives and generate the data that are added to its there.

Data flow is such, and promptly data or information flow into piece 1340 from piece 1306 with by piece 1306, arrive piece 1302 then, and flow out on the storer.So just, introduced the streamline influence, wherein a plurality of operation overlaps are also handled concomitantly with pipeline system.For example, information can be loaded by piece 1306, and information is stored in the storer by piece 1302 simultaneously.Data are stored in the piece 1320 and 1328 of piece 1306 after being received from storer by piece 1304, are provided to piece 1340 subsequently and are handled by piece 1340, and their details is discussed briefly with reference to figure subsequently.

Undertaken by piece 1340 finish dealing with after, the data of processing are provided to piece 1302 by bus 1310, and are stored in

piece

1312 and 1314, they are stored in wherein always before receiving being coupled the cause storer.That

piece

1314,1312,1318 and 1320 impact damper have is parallel, the bit of preset width or number.In an example, each of these impact dampers is 256 bit widths, yet, can adopt the bit of other number.

May can be moved on to piece 1306 from piece 1302 by numerical value or the data that piece 1340 was handled, so that re-use.And data can be received from storer by piece 1304, are moved to piece 1306 then, so that it is handled.Provide the further details of each piece 1340

now.Piece

1314 and 1312 causes the double buffering effect, " stall (stalling) " that its help reduces to stand usually in stream line operation, and

piece

1318 and 1320 also is like this.Stall is that time visit causes to piece 1302 and 1306 by storer.In another embodiment,

piece

1314 and 1312 can be a piece, and

piece

1318 and 1320 can be a piece.

Stand-by period can be relevant with operation, maybe can have the streamline influence.Stand-by period can be to be caused by each piece for piece 1340.

Figure 14 shows the further details according to the piece 1322 of the embodiment of the invention.On Figure 14, Galois Field piece 1402 is shown as and is coupled to XOR (XOR)/Clr circuit 1404, circuit 1404 and then be shown as again and be coupled to accumulator registers piece 1406.Piece 1402 is shown as and generates Galois Field output signal 1408, it is used as the input that is added to Galois Field multiplexer 1410, and Galois Field multiplexer 1410 also receives output another input that generate, that be called as accumulator registers piece output signal 1412 by piece 1406.

Signal

1408 and 1412 is used as the input that is added to multiplexer 1410, is used for optionally generating Galois Field MAC output signal 1416, and it is coupled on the bus 1310 of Figure 13.The selection signal 1414 that is used as another input that is added to multiplexer 1410 plays a role and selects one of

signal

1408 and 1412, to be used to generate signal 1416.Therefore, or the output of piece 1402 is provided as the output of piece 1322, or Galois Field MAC operating result is provided as the output of piece 1322, and wherein the output of piece 1402 is actually the result of galois field operation.

The output of piece 1406 is shown as and is coupled to circuit 1404, with another input as it.The output of piece 1404 is provided to piece 1406, and such coupling has realized the MAC part of Galois Field MAC operation.In fact piece 1404 carries out the XOR multiply operation of typically using in Galois Field MAC operation.

Piece 1402 is shown as and comprises block of registers 1420 and block of registers 1422, and they are shown as and are coupled to Xor tree piece 1424.Piece 1420 also is shown as and comprises block of registers 1426, Galois field multiply iteration 1 1428, block of registers 1430, Galois field multiply iteration 1 1432, block of registers 1434 and block of registers 1436.Though not shown on the figure, also comprise additional number, such as piece 1434 and 1436 such block of registers, and they in series are coupling between piece 1434 and 1436.

Piece 1424 is shown as and is coupled to piece 1426, piece 1426 is shown as again and is coupled to piece 1428, piece 1428 is shown as again and is coupled to piece 1430, piece 1430 is shown as again and is coupled to piece 1432, piece 1432 is shown as again and is coupled to piece 1434, and piece 1434 is coupled to piece 1436 or is coupled to and is positioned at one or more block of registers of locating in the middle of piece 1434 and 1436.

On Figure 14,

piece

1420 and 1422 inputs that receive from piece 1306, and in another embodiment, they can be combined into a piece.The Galois Field that piece 1402 common execution are known is to those skilled in the art handled, and remaining piece of Figure 14 causes the execution of

MAC operation.Piece

1426,1430,1434 and 1436 is used as the different iteration of Galois tree, knows that from experience under the worst situation, the number of iteration is 8, therefore needs 8 block of registers.The multiplication part of MAC operation is performed by the xor operation of being carried out by circuit 1404 usually, and piece 1406 is used as the totalizer function.Circuit 1404 receives its input from the last iteration of the galois field operation carried out by piece 1402 (under the situation of Figure 14, being piece 1436).

In operation, piece 1322 pairs of N bit value or data executable operations such as 8 bit value, and generating a N bit value or data based on this numerical value or data, wherein said generation is by carrying out raw value eight tunnel (the eight way) that are shifted based on another N bit value.This N bit value is carried out XOR by piece 1404 then, until utilizing one to reduce constant the result is reduced to the N bit, and randomly with the content addition of this result and N bit accumulator register, all numerical value in piece 1406 in this way of wherein said content." removing " operation also can be carried out by piece 1406.Thereby adopt Galois Field MAC operation and adopt the examples of applications of piece 1322 to include but not limited to: Cyclic Redundancy Code (CRC) operation, convolution coder operation, scrambling code generator operation or the like.

Figure 15 shows according to the further details embodiment of the invention, that be included in the Circuits System in the piece 1324 with the high-level block diagram form.On Figure 15,

multiplexer

1504 and 1502 is shown as and is coupled to A-register piece 1508 and B-register piece 1506 respectively.Numerical value that is called as A of piece 1508 storages, numerical value that is called as B of piece 1502 storages, these A and B numerical value are will be by the data of piece 1324 to its executable operations.Each all is the N bit width for A and B numerical value.

Piece

1508 and 1506 is shown as and generates the input that is added to condition register piece 1512, also be shown as to be coupled and generate the input that is added to added/subtracted/absolute value/difference/condition addition-subtract each other/multiplication (AGU) piece 1510, piece 1510 generates the input that is added to output register piece 1514 again.Piece 1514 is shown as and is coupled to multiplexer 1516, and multiplexer 1516 is shown as again and is coupled to totalizer 1518.Totalizer 1518 is shown as and is coupled to accumulator registers piece 1520, and the output of this piece 1520 is shown as another input that is used as totalizer 1518.Another output of piece 1520 is shown as and is used as the input that is added to multiplexer 1522, and the output that multiplexer 1522 receives piece 1514 is imported as another.Multiplexer 1522 generates output 1530, and this output is coupled to bus 1310.Some input that is added to

multiplexer

1504 and 1502 is received from piece 1316.

Each

multiplexer

1504 and 1502 is shown as and receives four inputs.One of them input dp of multiplexer 1504 is received from piece 1306, and the input dp of multiplexer 1502 also is like this.Another input of multiplexer 1504 is from a series of lowest-order bits of piece 1514 outputs, and one of them input of multiplexer 1502 is so same.Another input of multiplexer 1504 is from the higher order bits of the same output of piece 1514.Another input of multiplexer 1504 is numerical value " 0 ".One of them input of multiplexer 1502 is numerical value " 1 ", and its wherein another input is numerical value " 1 ".Numerical value " 0 ", " 1 " and " 1 " are provided to be devoted to quicken the operation carried out by piece 1324, because known that from experience these numerical value are repeatedly utilized in various operations, have just improved system performance so exist there.Should be pointed out that to have a plurality of pieces 1510 that are utilized to improve performance.Piece 1324 is organized into as illustrated in fig. 15 and allows to carry out many operations, and thus, many operating in the single clock period is performed.

In operation, piece 1510 and 1512 A and the B numerical value executable operations to providing respectively by piece 1508 and 1506.Two other inputs that are added to multiplexer 1516 are generated by the operating block (not shown on Figure 15) that reduces in the piece 1520, and this will be discussed briefly.At present, these two inputs are called as " neighbor-acc-reg (adjacent access register) " and " reduction-acc-reg (reducing access register) ", and each is that 2N is wide.

Piece 1512 is the wide registers of 2N, and it allows by piece 1510 executive condition additions or the reducing of condition phase, so that use in de-spreading operation.In fact piece 1512 revises A and B numerical value so that used by piece 1510.

In fact multiplexer 1522 allows the output of piece 1510 optionally being provided to piece 1302 by piece 1514 storage backs by signal 1530, and this is determined by the selection signal that is provided as another input that is added to multiplexer 1522.Otherwise the result of piece 1510 stands to add up-the phase add operation, and its last result was stored in piece 1520 by

piece

1518 and 1520 before being provided to piece 1302.

Piece 1324 is N layer ALU, and it comprises the one or more ALU that support following computing:

-N time addition/additive operation wherein carried out computings to two N bit value, with generate them with value or difference

-to the N bit XOR (XOR) of two input values

-to the maximum/minimum computing of two N bit input values

-to the maximal value * computing of two N bit input values, so consequently its result is calculated as follows: and max (a, b)+constant (from storer or the little look-up table that loads in advance)

-condition addition-subtract each other: this function is normally because the use of piece 1512 causes that it depends on input code and addition or subtract each other the data stream of N bit value conditionally.This input code is loaded into control register in advance.In the input code ' 1 ' causes additive operation, and ' 0 ' causes sum operation.Output can obtain in 16 bit accumulator registers.Also support " assembling (gather) " computing of other special ALU of this computing of self-supporting.

-use the SAD with totalizer identical in condition addition-additive operation.

-N * N multiplication

Piece 1510 is public for W type sub-processor, and wherein each piece 1510 can read at least 128 bits, and therefore when not having contention in the storer, two pieces can read at least 256 bits each clock period.

Figure 16 show according to the embodiment of the invention, be included in the block diagram that reduces circuit block 1602 in the piece 1520.On Figure 16, shown M level accumulator registers circuit, the details of its each accumulator registers circuit is displayed in the acc-reg piece 1610.For example, acc-reg circuit block 1602 comprises four pieces 1610, and they are coupled in mode shown in Figure 16.Similarly, each acc-reg circuit block 1604-1608 comprises level Four acc-reg circuit, such as this circuit of being made up of piece 1610.The output or the result of each grade in each piece 1602-1608 are used as the input that is added to next stage, so they are added, add up to reach.Piece 1602-1608 is shown as separately comprises 4 grades or such as 4 pieces of piece 1610, but also can adopt the piece or the level of other number.

Making the result of each piece 1602-1608 is available for another piece.For example, the result of piece 1602 is used as the input that is added to piece 1604, and the result of piece 1604 or output are used as the input that is added to the last acc-reg piece in the piece 1608, and the result of piece 1606 or output are used as the input that is added to piece 1608.Because the result of piece with pass-through mode and with piece in the adding up and provide simultaneously of level, so when adopting level Four acc-reg piece, only need 7 cycles execution reduce computing.

Piece 16 comprises the multiplexer that is coupled to totalizer.This multiplexer is 2: 1 multiplexers, and its selection will be provided to one of two inputs of totalizer.One of two inputs of the multiplexer of piece 1610 are provided by the output of piece 1514, and another input is the result of previous stage acc-reg piece.Like this, the function that reduces of Figure 16 is flexibly when it controls data.Each input from the output of direct level the preceding is called as ' adjacent ' signal 1616, and its generation is added to the neighbor-acc-seq input of multiplexer 1516.The output of some grade generates and is added to the reduction-acc-seg of multiplexer 1516, and is called as ' reducing ' signal 1618.The output of the last acc-reg piece of piece 1608 generates the output 1620 that is coupled to multiplexer 1530.The circuit that reduces of Figure 16 causes being used to carry out the minimum clock cycle that reduces to operate, and has saved power consumption simultaneously.

Figure 17 shows according to the further details embodiment of the invention, that be included in the Circuits System in the piece 1326 with the high-level block diagram form.On Figure 17, piece 1326 is shown as and comprises shift unit 1702-1712, is used to the data input that is shifted and receives from piece 1306.In one embodiment, input 1700 is 128 bits, yet, can utilize other number bits.The output of each shift unit 1702-1712 is shown as and is coupled to register banks piece 1714.Shift unit 1702-1712 generates the various combination of the bit of input 1700.

Piece 1714 comprises a plurality of registers, and they comprise register 1716 to 1746, and they are used for creating the combination of the output of shift unit 1702-1712.For example, the 8 lower bits that can make each shift unit 1702-1712 output are through multiplexer, optionally to choose which of these 8 lower bits of last generation.Therefore, each register of piece 1714 can be selected arbitrarily being shifted between " the interested part " of bit.Interested part is determined by the output of each shift unit 1702-1712.The output of piece 1714 is provided to bus 1310.

Therefore, in one embodiment of the invention, piece 1326 comprises four 20 bits and two 24 bit input registers.It comprises eight 16 bit register, creates and store 32,16,8 and 4 bit combinations at random from the bit of its input register therein.Piece 1326 can be used with three kinds of patterns: use two 20 specific bit register to be used for output and generate; 2) using four 20 bit register to be used for output generates; Or 3) using seven all registers to be used for output generates.Shift unit 1702-1712 comprises input register, because those skilled in the art know the 26S Proteasome Structure and Function of shift unit, so not shown this input register.

In order to reduce for the needed hardware of the combination function of execution block 1326 or the number of piece or circuit, each bit in the 32 bit output registers can only under first pattern from minimum effective 8 bits two 20 bit register, under second pattern from 4 minimum effective bits four 20 bit register be received in from 2 minimum effective bits four 20 bit register and 4 minimum effective bits in 24 bit register under the three-mode.Combination at random from input register is two step process, and wherein first step involves " interested " bit is displaced to least significant bit (LSB) and puts, and can allow to be filled into output register at random from this position under this pattern.In the example that uses with reference to Figure 17 here, so that interested bit arrival least significant bit (LSB) when putting, 1326 each cycle of piece can be created the bit of 16 combinations when carrying out shifting function for input register streamline ground.Some combination of output can spend a plurality of clock period.

Storer 1326 is common random access memory, so be not described in a more detailed discussion.Yet to be based on the application that will use N type sub-processor just much of that as long as say the size of this storer.

Figure 18 shows according to the further details embodiment of the invention, that be included in the Circuits System in the piece 1330 with the high-level block diagram form.On Figure 18, single word register 1802 is shown as and comprises 8 bit positions, and each bit position 1804 can select circuit 1806 to make amendment by bit.Such modification including, but not limited to: insert one ' 0 ', insert one ' 1 ', this bit is got non-, be equivalent to reverse it; Perhaps do not revise it, be equivalent to " NOP " or do not have operation.Single word register 1802 is repeated, that is, each resembles word register 1810-1820 and store and revise a word the register 1802.Therefore, in the example of 16 bit words and 8 words, being modified in the clock period of eight 16 bit words carried out, and unlike traditional DSP, needs a plurality of cycles to be used to carry out same work.The mode that the modification of each bit of these words or punching/go to punching and controlled by multiplexer 1824 and trigger 1826, multiplexer 1824 and trigger 1826 show with Figure 18 is coupled mutually and is coupled to register 1802.Register 1810-1822 also is coupled to other multiplexer and trigger circuit similarly.The model selection bit selects to select which input in four inputs of multiplexer, and the model selection bit generates from instruction code.Wherein two inputs 1828 that are added to multiplexer 1824 also are from instruction code, and two inputs in addition of this multiplexer are from storer, and one of them can be another counter-rotating pattern, as shown in figure 18.

The input that is added to the circuit of piece 1330 generates from piece 1332, and piece 1332 will be discussed briefly, but now, and its generates, and be added to piece 1330 whole interweave, part interweaves or do not have the N bit words that interweaves.In an example, operation is at 256 bit words, in this case, piece 1330 the given time to 16 bit executable operations.The control word of obtaining in advance is used for determining which bit must be inverted in 16 bit words.Randomly, except counter-rotating, be input to specific bit position to ' 0 ' with ' 1 ' numerical value.

Figure 19 shows according to the further details embodiment of the invention, that be included in the Circuits System in the piece 1332 with the high-level block diagram form.On Figure 19, memory array 1902 is shown as by bus 1316 receptions and reads to enable to import 1906 from the input 104 of input equipment with by bus 1316 receptions, and receive the input that generates piece 1908 from control row-column address, be provided to the output device signal 1910 of piece 1302 with generation.In an example, piece 1902 comprises the memory array that is made of 128 * 16 bits.Data can be written to piece 1902 or read from piece 1902 by row or by row.That is, the row that can read the memory array of piece 1902 maybe can be read the row of the memory array of piece 1902.In addition, data can be written into and be read out by row by row, and vice versa.

Figure 20 shows according to the further details embodiment of the invention, that be included in the Circuits System in the piece 1334 with the high-level block diagram form.On Figure 20, branch metric unit 2002 is shown as the input of reception from piece 1332, and be shown as and be coupled to addition/comparison/selection piece, this piece is shown as and is coupled to remaining (survivor) memory block 2012, this piece 2012 is shown as again and is coupled to multiplexer 2020, and multiplexer 2020 generates the output 2022 that is coupled to bus 1310.Multiplexer 2020 also is shown as from the output terminal of totalizer 2018 and receives another input, the input that totalizer 2018 receives from multiplexer 2016.Randomly, the summation of absolute value differences (SAD) piece 2008 and despreader (being used for despreading) piece 2010 are used for generating the input that is added to multiplexer 2016.Under the situation that does not have piece 2008 and 2010, do not use multiplexer 2016, piece 2018 and multiplexer 2020.Local storage 2006 is shown as and is coupled to piece 2004.Piece 2002 is carried out the branch metric calculation of knowing for the technician who is familiar with the Viterbi coding/decoding.The remaining path of knowing equally for the technician who is familiar with the Viterbi coding/decoding is stored in the piece 2012.

Piece 1334 can be carried out turbo decoder, SAD and despreading function.In an example, 32 to 256 phase add-compare-select operations can be carried out for the 16 bit branches and the path metric value that are generated by local storage 2006 concurrently by piece 2004.In an example, the size of local storage 2006 is 1k bit and 16k bit.

Can include a plurality of 2004 in piece 1334, each piece can comprise 8 totalizers than peculiar sign.In addition, each piece can comprise comparison and select piece that it returns triumph path and decision bits.Phase add-compare-select operation cause winning path and decision bits.The triumph path can be used to reduce " multicast " interconnect scheme of grid and shares with adjacent piece 2004 by use.Decision bits with triumph branch and path metric value is stored, to be used to recall (backtrack).

Piece 2008 uses four 8 bit A LU, and in an example, their four absolute value differences can each be calculated periodically.Reduce tree and be based upon in the piece 2004, absolute value differences is added to 16 bit accumulator.Multicast network can be used for sending these numerical value thereon, is used for further reducing.Each clock period altogether the piece 2008 of 128 8 bits (64 16 bits) be possible.Yet, it is believed that, consider that effective utilization of all additional overhead can cause lower number.

ALU implements with this special ALU piece enforcement and in identical condition addition discussed above-subtract each other function.Must be loaded onto in the local storage for the needed control bit of despreading, from wherein it is acquired and is stored in register.This result is accumulated in 16 bit accumulator, from 1 piece 2004 that can be transferred to other wherein, is used for the operation that reduces thereon.By despreading, in an example, might in the single cycle, carry out the condition addition of 128 whiles-subtract each other.Each energy that shifts is higher than serving the employed energy of special ALU of some general utility functions except despreading and SAD in this unit.For than the pointer of peanut or for the estimating motion than low rate, special ALU is a more effectively option of power.

Figure 21 show according to the embodiment of the invention, use the programming flow process of processor 22 and the example of instrument.Figure 22 shows the example of the scalability of the embodiment of the invention.For example, on Figure 22, the cluster 2202 of the sub-processor of N type and W type is arranged, they are shown as by using bus 2204 to be interconnected.Each cluster 2202 comprises two or four sub-processors.In an example, bus 2204 is standard SoC buses.By keeping the method for designing of classification, solved interconnectivity.

The convergent-divergent of processor 20 causes the cluster of four sub-processors, have bus separately for each cluster, otherwise four sub-processors can be shared single memory.With respect to the scalability of processor normally by means of the number that increases processor or improve processor frequencies or speed.Yet, the convergent-divergent that is carried out before the needed convergent-divergent of complicated applications exceeds.In the present invention, W type and N type sub-processor are modified, and can handle single application so that form four such sub-processors of a processing.

Therefore, processor 22 is equipped with than directly based on more effectively operate in the control found in the intended application and the ability of order DSP sign indicating number from the RISC of the compilation of C code and superscalar processor.Simultaneously, it is designed to utilize RISC and the employed automatic code generation technique of superscalar processor being used for tradition application and compact applications.And the Software tool with industrial standard that processor 22 usefulness are ripe comes work, and described Software tool is the similar Simulink that is used to use mapping and exploitation.Mole (Moore) law can be utilized to the performance of enhancement process device 22.Processor 22 singly is not the machine of highly-parallel, and is the multiprocessor of an xenogenesis.The multimedia and the communications applications that need parallel heterogeneous multiprocessor to solve high request are the certified facts in industry member and academia.Its allow to utilize the many automatic code generation techniques that use in VLIW, and do not use any on power and area the technology of poor efficiency.It is optimized basis to utilize the pattern of repetition from the compilation of the control routine of C.This has reduced power controlling widely, and the feasible serial code that might move compilation effectively.In addition, the programming model of processor 22 is designed to be suitable for them by the instrument (as Simulink) that the big corporations that use the dsp program person are familiar with.Its development process provides and has been used for the means that effective C collects are carried out in control and order DSP sign indicating number.In addition, provide extensive group of storehouse that communication efficiently and multimedia examine.Example is the parametrization storehouse of FFT, IDCT, RRC, Viterbi, VLC, 2D/3D figure, turbine coding code translator and descrambler.

Data routing design in the processor 22 is successfully integrated to connect the diversified interconnection structure of varigrained functional unit, yet has solved more highly favourable application mix that is focused on effectively.

The scalability of processor 22 is designed so that according to standard SoC bus that in single (time-multiplexed) all are applied in and is equipped with immediate adjacent connection in the piece.The indeterminism of very a large amount of ineffectivities and all system levels is reduced, because can use a plurality of to handle a plurality of application, and need not any private communication between them.

Figure 23 has shown the figure of some benefit that presents scalability of the present invention.

Though the present invention describes for specific embodiment, can expect that their change and modification will become apparent to those skilled in the art undoubtedly.So following claim is planned to be interpreted as covering and is belonged to such change and modifications true spirit of the present invention and scope, all.

Claims

1. a processor xenogenesis, high performance, scalable comprises:

At least one W type sub-processor, it can parallel processing W bit or more bits, and W is a round values;

At least one N type sub-processor, it can parallel processing N bit, and wherein N is a round values and less than W;

The bus of sharing, its be coupled this at least one W type sub-processor and this at least one N type sub-processor; And

The storer of sharing, it is coupled to this at least one W type sub-processor and this at least one N type sub-processor,

Wherein W type sub-processor is rearranged byte when shifting byte to storer or from the memory transfer byte, allows the execution of the application of operation fast so that adapt to.

2. as statement, processor xenogenesis, high performance, scalable in claim 1, wherein this processor is scalable.

3. statement, processor xenogenesis, high performance, scalable as in claim 1, wherein two of two of at least one W type sub-processor and at least one N type sub-processor.

4. as statement, processor xenogenesis, high performance, scalable in claim 2, wherein this at least one W type sub-processor and this at least one N type sub-processor are carried out the program that is used for multimedia application.

5. as statement, processor xenogenesis, high performance, scalable in claim 4, wherein each of this at least one W type sub-processor comprises a plurality of macroefficiency unit.

6. as statement, processor xenogenesis, high performance, scalable in claim 5, wherein this a plurality of macroefficiency unit comprises the loading block storage, is used to generate storage address, uses for other macroefficiency unit in this a plurality of macroefficiency unit.

7. as statement, processor xenogenesis, high performance, scalable in claim 6, wherein this a plurality of macroefficiency unit comprises and is coupled to the scalar ALU (ALU) that loads block storage and the multiplication piece that adds up, and it is for from loading data execution scalar arithmetic sum logic and the multiplying that block storage receives.

8. as statement, processor xenogenesis, high performance, scalable in claim 7, wherein this a plurality of macroefficiency unit comprises being coupled to and loads the add up vectorial X piece of piece of block storage and scalar ALU and multiplication, it is for the data execute vector computing from the loading block storage, and vectorial X piece generates vector data.

9. as statement, processor xenogenesis, high performance, scalable in claim 8, wherein this a plurality of macroefficiency unit comprises and is coupled to scalar ALU and the multiplication vectorial ALU of piece and vectorial X piece and the multiplication piece that adds up that adds up, and is used for the vector data execute vector ALU and the multiplication accumulating operation that receive from vectorial X piece.

10. as statement, processor xenogenesis, high performance, scalable in claim 2, wherein this at least one N type sub-processor comprises storage element piece, macro block and load units piece, this macro block is coupled to the load units piece, and being coupled to the macroefficiency bus, this macroefficiency bus is used for macro block is coupled to block storage.

11. as statement, processor xenogenesis, high performance, scalable in claim 10, wherein this at least one N type sub-processor comprises data path unit (DPU) piece and controller, sequencer and data address maker (DAG) piece of being shared by this at least one W type sub-processor.

12. as statement, processor xenogenesis, high performance, scalable in claim 10, wherein this macro block comprises the Galois field multiply that is coupled to macroefficiency bus and load units piece 1306 (MAC) piece that adds up, and is used to carry out the Galois Field computing.

13. as statement, processor xenogenesis, high performance, scalable in claim 12, wherein this macro block comprises the special ALU that is coupled to this a load units piece and a load units piece, is used to carry out special ALU computing.

14. as statement, processor xenogenesis, high performance, scalable in claim 13, wherein this macro block comprises the piece of punchinging of punchinging/go that is coupled to this a load units piece and a load units piece, is used for carrying out the operation of punchinging of punchinging/go.

15. as statement, processor xenogenesis, high performance, scalable in claim 14, wherein this macro block comprises the interleaver block that is coupled to this a load units piece and a load units piece, is used to carry out interlace operation.

16. as statement, processor xenogenesis, high performance, scalable in claim 15, wherein this macro block comprises the Viterbi piece that is coupled to storage element piece and interleaver block, is used to carry out the Viterbi operation.

17. as statement, processor xenogenesis, high performance, scalable in claim 16, wherein this macro block comprises the assembler block that is coupled to this a load units piece and a load units piece, is used to carry out combination operation.

18. as statement, processor xenogenesis, high performance, scalable in claim 16, wherein this at least one N type sub-processor comprises the X cell block that is coupling between storage element piece and the load units piece.

19. as statement, processor xenogenesis, high performance, scalable in claim 16, comprise the shared register that is coupling between this at least one W type sub-processor and this at least one N type sub-processor, be used for the direct communication between them.

20. the method for a process information comprises:

Xenogenesis, high performance, scalable processor comprises:

Use can parallel processing W bit at least one W type sub-processor come deal with data, W is a round values;

Use can parallel processing N bit at least one N type sub-processor come deal with data simultaneously, wherein N is round values and is 1/2nd of W; And

Make and carry out multimedia application fast, and keep the simplification of low-power consumption and programmability simultaneously.