CN1781088A

CN1781088A - Multithreaded processor with efficient processing for convergence device applications

Info

Publication number: CN1781088A
Application number: CN 02827350
Authority: CN
Inventors: 埃德姆·赫凯内克; 马扬·穆德吉尔; 约翰·C·格洛斯纳
Original assignee: Sandbridge Technologies Inc
Current assignee: Qualcomm Inc
Priority date: 2001-12-20
Filing date: 2002-12-11
Publication date: 2006-05-31
Anticipated expiration: 2022-12-11
Also published as: CN100359506C

Abstract

A multithreaded processor includes an instruction decoder for decoding retrieved instructions to determine an instruction type for each of the retrieved instructions, an integer unit coupled to the instruction decoder for processing integer type instructions, and a vector unit coupled to the instruction decoder for processing vector type instructions. A reduction unit is preferably associated with the vector unit and receives parallel data elements processed in the vector unit. The reduction unit generates a serial output from the parallel data elements. The processor may be configured to execute at least control code, digital signal processor DSP code, Java code and network processing code, and is therefore well-suited for use in a convergence device. The processor is preferably configured to utilize token triggered threading in conjunction with instruction pipelining.

Description

Can efficiently handle the multiline procedure processor that convergence device is used

Related application

The sequence number that the application requires submit to Dec 20 calendar year 2001 is 60/341,289, title is the right of priority of the U.S. Provisional Patent Application of " method and apparatus (Method and Apparatus forMultithreaded Processor) that is used for multiline procedure processor ", and this application is incorporated herein by reference.

The application relates in application attorney docket, and to be 1007-5 and title be " being used in the multiline procedure processor method and apparatus (Method and Apparatus forThread-Based Memory Access in a Multithreaded Processor) based on the storage access of thread ", application attorney docket is that 1007-7 and title are that " being used for the method and apparatus (Method and Apparatus for Register File PortReduction in a Multithreaded Processor) that register file port reduces in the multiline procedure processor " and application attorney docket are the invention that 1007-8 and title are described for " method and apparatus of token triggered multithreading (Method and Apparatus forToken Triggered Multithreading) ", all these applications are all submitted to simultaneously, and are incorporated herein by reference.

Technical field

The application relates generally to field of digital data processors, more particularly, relates to multiline procedure processor.

Background technology

Multiline procedure processor is a processor of supporting that a plurality of different instruction sequences or " thread " are carried out simultaneously.Conventional threading technology is for example described in following document: M.J.Flynn, " Computer Architecture:Pipelined and Parallel Processor Design ", Jones and Bartlett Publishers, Boston, MA, 1995 and G.A.Blaauw and Frederick P.Brooks, " Computer Architecture:Concepts andEvolution ", Addison-Wesley, Reading, Massachusetts, 1997, two pieces of documents are incorporated herein by reference.

Existing multiline procedure processor is not optimized in order to use in convergence device usually, and described convergence device comprises processed voice, data, audio frequency, video and other equipment by multiple different medium information transmitted of being configured to.This equipment need be carried out the coding of number of different types usually, comprises the digital signal processor relevant with signal processing operations (DSP) coding and uses coding with the advanced procedures that Java or another kind of Object-Oriented Programming Language are write.The example more specifically of this convergence device is the radio mobile unit of developing recently such as the high speed CDMA communication system of third generation collaborative project (3GPP) wideband CDMA (WCDMA) system, described system is described in 3GPP technical manual TS 25.1xx, and this standard is incorporated herein by reference.

Therefore, need a kind of modified multiline procedure processor that is particularly suitable in convergence device, using.

Summary of the invention

The invention provides a kind of modified multiline procedure processor, it can carry out control coding, dsp code, Java coding and network processes coding based on RISC effectively in an illustrative embodiment, thereby, be particularly suitable in 3GPP WCDMA mobile unit or other convergence device, using.

According to an aspect of the present invention, a kind of multiline procedure processor comprises: an instruction decoder, and the instruction that retrieved of being used to decode is to determine the instruction type of each institute's search instruction; An integer unit is coupled to described instruction decoder, is used to handle the integer type instruction; With a vector units, be coupled to described instruction decoder and be used for the instruction of processing vector type.Reduction unit preferably with the vector units associated, and be received in the parallel data unit of handling in the vector units.The reduction unit generates serial output according to described parallel data unit.This processor preferably is configured to combined command pipelining technology and uses the token triggered threading.

Description of drawings

Fig. 1 is the block scheme according to the illustrative embodiment of multiline procedure processor of the present invention.

Fig. 2 is a kind of block scheme that may embodiment of the disposal system of the explanation multiline procedure processor that contains Fig. 1.

Embodiment

To the present invention be described in conjunction with exemplary multiline procedure processor and corresponding disposal system at this.Yet, be to be understood that the present invention does not need to use the concrete multiline procedure processor and the disposal system structure of this illustrative embodiment, be more suitable for usually in any multiline procedure processor that is being desirable to provide the modified processor performance or information handling system application, using.In addition, although be particularly suitable for using in convergence device, multiline procedure processor of the present invention also can use in the equipment of other type.

As will be described herein in more detail hereinafter, encode according to control coding, digital signal processor (DSP) coding, Java coding and network processes that the illustrative embodiment of multiline procedure processor of the present invention can be carried out based on RISC.This processor comprises single instruction multiple data (SIMD) vector units, reduction unit and the execution of CLIW (LIW) compound instruction.

Fig. 1 diagram is according to multiline procedure processor 102 of the present invention.This multiline procedure processor 102 comprises multithreading cache 110, multithreading data-carrier store 112, instruction decoder 116, register file 118 and Memory Management Unit (MMU) 120.Multithreading cache memory 110 is also referred to as the multithreading cache at this.

Multithreading cache 110 comprise a plurality of thread cache 110-1,110-2 ..., 110-N, wherein N represents the number of threads that this multiline procedure processor 102 is supported usually, in this concrete example, N=4.Certainly, also can use other N value, as conspicuous for a person skilled in the art.

Thereby each thread has its respective thread cache associated in multithreading cache 110.Similarly, data-carrier store 112 comprises the example of the data-carrier store that N is different, be labeled as shown in the figure data-carrier store 112-1,112-2 ..., 112-N.

Processor 102 can be carried out the token triggered multithreading, for example at above-mentioned attorney 1007-8 and title for described in the U.S. Patent application of " method and apparatus that is used for the token triggered multithreading ".The token triggered threading is distributed to different tokens each thread in a plurality of processor threads usually.For example, the token triggered threading can use a token and current processor clock cycle to combine a concrete thread in the recognition processor thread, and this thread will send the instruction that is used for the clock period subsequently.Also can use or alternatively use the threading of other type.

Each thread cache in the multithreading cache 110 can comprise the memory array with one or more groups memory location.Given thread cache also comprises a thread identifier register, is used to store a relevant thread identifier.

Multithreading cache 110 is by primary memory (not shown) the formation interface of MMU 120 with processor 102 outsides.Be similar to cache 110, MMU 120 comprises the independent example that is used for N each thread of thread that processor supports.MMU 120 guarantees to come the suitable instruction of autonomous memory to be downloaded to multithreading cache 110.The MMU 120 that can comprise cache controller or be attached thereto can carry out at least a portion map addresses technology, for example fully correlation map, directly mapping or group correlation map.Be suitable for being the U.S. Patent application that transfers the applicant jointly 10/161 that on June 4th, 2002 submitted in conjunction with the illustrative group of correlation map technology that the present invention uses, 774 and 10/161, be described in 874, these two pieces of patent documentations are incorporated herein by reference.

Data-carrier store 112 also is directly connected to above-mentioned external main memory usually, although this connection does not illustrate in the accompanying drawings clearly.What also be connected with data-carrier store 112 is data buffer 130.

The attorney of quoting in the above other storer that to be 1007-5 described data-carrier store 112 or linked to each other with multiline procedure processor for the U.S. Patent application of " being used for the method and apparatus based on the storage access of thread in multiline procedure processor " with title.

Usually, multithreading cache 110 is used to store the instruction that will be carried out by multiline procedure processor 102, and data-carrier store 112 storages are by the data of instruction manipulation.Instruction is extracted from multithreading cache 110 by instruction decoder 116 and decodes.According to instruction type, instruction decoder 116 can be transmitted to each interior other unit of processor with given instruction or relevant information, as describing hereinafter.

Processor 102 also comprises one group of background register 132, in this example, comprises control register (CR) 134, link register (LR) 136 sum counter registers (CTR) 138.These background registers are assisted program control flow by the position of revising the instruction of being extracted.As shown in the figure, illustrate each background register 134,136 that links to each other with each thread in this illustrative embodiment and an example of 138.

Other register in the processor 102 comprises branch register 140 and programmable counter (PC) register 142.Be similar to background register 134,136 and 138, program counter register 142 comprises an example that is used for each thread.Branch register 140 receives instruction from instruction decoder 116, and in conjunction with program counter register 142 input is offered addition module 144.The branch units that comprises processor 102 that unit 140,142 and 144 is total.The control of this branch units is by the extraction of the instruction in the performed instruction pipelining of processor.

Register file 118 provides the interim storage of whole-number result.Decoding offers the instruction of integer instructions formation (IQ) 150 from instruction decoder 116, and is illustrated as the offset units 152 that comprises the independent example that is used for each thread by use and selects correct hardware thread unit.Offset units 152 is inserted register file addresses with clear and definite bit, so that do not interrupt independently thread-data.For given thread, these clear and definite bits can comprise such as corresponding thread identifier.

As shown in the figure, register file 118 is connected to input register RA and RB, and its output is connected to addition module 154.Input register RA and RB use when the execution command pipelining.The output of addition module 154 is connected to data-carrier store 112.

According to the present invention, register file 118, integer instructions formation 150, offset units 152, unit R A and RB and adder unit comprise an exemplary integer unit jointly.

Attorney 1007-7 that quotes in the above and title have been described the technology that is used for based on the register file of thread ground access such as register file 118 for the U.S. Patent application of " method and apparatus that is used for the register file port reduction in multiline procedure processor ".

Executable instruction type comprises branch (brand), loads (load), stores (Store), integer (Integer) and vector (Vector)/SIMD instruction type in processor 102.If given instruction is assigned finger, loading, storage or integer operation not, then it is a vector/SIMD instruction.Also can use other instruction type.These integers and vector/SIMD instruction type is the example that is called integer and vector instruction type at this usually more respectively.

Vector IQ 156 receives the vector/SIMD instruction of transmitting from instruction decoder 116.Be illustrated as the respective offsets unit 158 that comprises the independent example that is used for each thread and be used to insert suitable bit to guarantee not interrupt independently thread-data.

The vector units 160 of processor 102 is divided into N different parallel section, and comprises the vector file of dividing similarly 162.Vector file 162 is basically as the purpose identical with register file 118, and except the former operation is vector/SIMD instruction type.

Vector units 160 diagram ground comprise the computing and the storage unit of vector instruction formation 156, offset units 158, vector file 162 and associated company.

The operation of vector units 160 is as follows.Given vector/SIMD the data block that is encoded to mark or integer data type reads from vector file 162, and is stored in the visual register VRABC.From here on, flow process continues to carry out by the parallel MPY piece that multiplies each other simultaneously of carrying out vector/SIMD data.The result stores in the structurally visual register PABC.Adder unit can be carried out additional arithmetic operation subsequently, and the result is stored in totalizer (ACC) register.After this, data continue to handle by reduction unit 164, accumulation result concurrently wherein, but generate serial semantics.If it is the substantially the same output of result that serial computing will generate that serial semantics provides with four saturation values of parallel computation in vector units 160.Such output is also referred to as serial output at this.The reduction sum that is obtained is stored in the saturation register that is labeled as SAT.

The other parts of reduction unit 164 and vector units 160 also can be used and the similar technology of describing in following document of technology: N.Yadav, M.Schulte and J.Glossner, " Parallel Saturating Fractional Arithmetic Units ", Proceedings of the9th Great Lakes Symposium on VLSI, the 172-179 page or leaf, Ann Arbor, Michigan, on March 4th to 6,1999, this document is incorporated herein by reference.

Although reduction unit 164 is illustrated as the part of vector units 160 in this illustrative embodiment, also can be implemented as independent unit.

Processor 102 preferably uses the instruction process of pipelining.For example, processor 102 can use an instruction pipelining, and wherein each thread sends single instruction on each processor clock cycle.As another example, instruction pipelining can be configured to each thread and send a plurality of instructions on each processor clock cycle.More particularly, use the thread of sufficient amount and suitable pipelining, each thread of processor can send loading instruction and vector multiplying instruction simultaneously in given processor clock cycle under the situation that does not stop arbitrary thread.

Advantageously, processor 102 shown in Figure 1 can be carried out various dissimilar order numbers effectively, comprises control coding, dsp code, Java coding and network processes coding based on RISC.Therefore, processor 102 is particularly suitable for realizing in the convergence device such as 3GPP WCDMA mobile unit.

Fig. 2 illustrates an example of disposal system 200, wherein can realize processor 102.Disposal system 200 can for example be counted as convergence device, a unit of for example above-mentioned 3GPP WCDMA mobile unit.

More particularly, disposal system 200 in this embodiment is configured to and supports WCDMA and Global Link (GSM) radio communication simultaneously, while processed voice, data, audio frequency, video and the out of Memory that transmits on various different mediums.

Disposal system 200 comprises DSP hardware 202 and microprocessor 204.DSP hardware 202 is illustrated as and comprises first and second examples that are labeled as 202-1 and 202-2.DSP hardware is connected to a relevant external memory storage 206.Microprocessor 208 is connected to a relevant external memory storage 208.Storer 206 and 208 is called " inside ", because they are in the inside of disposal system 200, both can represent a plurality of parts of common storage.DSP hardware 202 also can be communicated by letter with not shown one or more external memory storages respectively with microprocessor 204.

DSP hardware 202 and microprocessor 204 preferably all use single multiline procedure processor as shown in Figure 1 to realize.Also can use such as other structure based on the structure of a plurality of processors.

The first example 202-1 of DSP hardware 202 comprises a plurality of processing units to diagram, comprises GSM channel equalizer, GSM channel encoder, GSM burst builder, GSM channel decoder, GSM Voice decoder, GSM speech coder, GSM transmitter, encrypt/decrypt, timing controlled, WCDMA transmitter, filtering, gain and frequency control, WCDMA searcher, Rake receiver, channel encoder, WCDMA Voice decoder, WCDMA speech coder and channel decoder.Other unit comprises Windows ^Media audio (WMA), physical medium, JPEG (joint photographic experts group) (JPEG/JPEG2000), mobile motion picture expert group version layer 3 audio frequency (MP3), advanced audio (AAC) and musical instrument digital interface (MIDI).The operation of these unit is well known in the art, therefore, does not describe in detail further at this.

The second interface 202-2 of DSP hardware 202 can dispose similarly, perhaps can comprise other processing unit that is suitable for supporting other communication function in the disposal system 200.

Microprocessor 204 is illustrated as and comprises a plurality of processing units, comprises man-machine interface (MMI), mobile photographic experts group 4 (MPEG4), protocol stack, Short Message Service/message management system (SMS/MMS) and real time operating system (OS) unit, as shown in the figure.At this, the operation of these unit is known in the field.

Disposal system 200 also comprises the communication bus 210 that is connected between DSP hardware 202, microprocessor 204 and the system unit 212.Similarly, communication bus 214 is connected between DSP hardware 202 and the system unit 216.

System unit 212 comprises digital camera, video camera, USB (universal serial bus) (USB), universal asynchronous receiver/transmitter (UARTS), SCSI parallel interface (SPI), intelligence interface controller (I2C), general purpose I/O (GPIO), security identity module/USIM (Universal Subscriber Identity Module) (SIM/USIM), external memory storage I/O, keyboard, LCD, interruptable controller and direct memory access (DMA) (DMA) controller.

System unit 216 comprises receiver I/O, transmitter I/O and bluetooth I/O.

Other system unit shown in the figure comprises test I/O (I/O) 218, system clock and control 220 and power management 222.

System unit 212,216,218,220 and 220 operate in known in the artly, therefore, these unit are not described further at this.

Point out that as top the function relevant with two DSP hardware 202 and microprocessor 204 can be carried out on the single multiline procedure processor such as multiline procedure processor 102.Thereby multiline procedure processor 102 can be used to carry out the coding relevant with system unit 212,216,218,220 and 222 and the relevant coding with DSP hardware 202 and microprocessor 204.

Microprocessor 204 in the disposal system 200 can be used to move the coding relevant with higher layer applications.

The processing unit relevant with DSP hardware 202 can use software translating to realize.Advantageously, software translating makes it possible to change effectively high-level programming language.

Should be understood that the present invention does not need difference concrete multiline procedure processor and disposal system structure as depicted in figs. 1 and 2.As previously noted, the present invention can realize with various other multiline procedure processor and disposal system structures.

And, should be appreciated that for clearly explanation, simplified concrete structure illustrated in figures 1 and 2, can also comprise not clear and definite illustrated other or substituting unit.

Thereby the above embodiment of the present invention will only be illustrative, and the various alternate embodiments within the protection domain of claim will be conspicuous for a person skilled in the art.

Claims

1. multiline procedure processor comprises:

An instruction decoder, the instruction retrieved of being used for decoding is extracted the instruction type of each instruction of instructing with the institute that determines at least one subclass;

An integer unit is connected to instruction decoder, is used to handle the integer type instruction that receives from instruction decoder; With

A vector units is connected to instruction decoder, is used to handle the vector type instruction that receives from instruction decoder.

2. according to the multiline procedure processor of claim 1, also comprise a reduction unit, with the vector units associated, and be received in the parallel data unit of handling in the vector units, this reduction unit generates a serial output according to this parallel data unit.

3. according to the multiline procedure processor of claim 1, wherein by the multithreading cache memory search instruction of instruction decoder from multiline procedure processor, this multithreading cache memory comprises the thread cache of a plurality of each thread of thread that are used for processor.

4. according to the multiline procedure processor of claim 1, wherein integer unit also comprises: an integer instructions formation, and it has an input that is connected to the output of instruction decoder; A register file, it has an input of the output that is connected to the integer instructions formation; An offset units, it has an output of the input that is connected to a register file; With an adder unit, it has at least one input of an output of the register file of being connected to.

5. according to the multiline procedure processor of claim 4, wherein offset units comprises the independent example that is used for by a plurality of each thread of thread of processor support.

6. according to the multiline procedure processor of claim 1, wherein vector units also comprises a vector instruction formation, and it has an input of an output that is connected to instruction decoder; A vector file, it has an input of an output that is connected to the vector instruction formation; An offset units, it has an output of an input that is connected to vector file; With at least one arithmetic element, it has an input of an output that is connected to vector file.

7. according to the multiline procedure processor of claim 6, wherein offset units comprises an independent example that is used for by a plurality of each thread of thread of processor support.

8. according to the multiline procedure processor of claim 1, wherein this processor is configured to support at least branch, loading, storage, integer and vector instruction type.

9. multiline procedure processor according to Claim 8, wherein the vector instruction type comprises that single-shot goes out the overabsorption instruction type.

10. according to the multiline procedure processor of claim 1, wherein vector units comprises a plurality of parallel branchs, and each branch is corresponding with a particular thread of processor.

11. according to the multiline procedure processor of claim 10, wherein each parallel branch comprises the series of combination of a part, multiplier, totalizer and the totalizer of vector file.

12. according to the multiline procedure processor of claim 1, wherein processor is configured to and carries out control coding, digital signal processor (DSP) coding, Java coding and network processes coding at least.

13. according to the multiline procedure processor of claim 1, wherein processor is configured to and uses the token triggered threading.

14. according to the multiline procedure processor of claim 13, wherein the token triggered threading uses a token to discern a concrete thread that will allow to send in a plurality of threads of processor of the instruction of clock period subsequently in conjunction with current processor clock cycle.

15. according to the multiline procedure processor of claim 13, wherein the token triggered threading is distributed to different tokens each thread in a plurality of threads of processor.

16. according to the multiline procedure processor of claim 1, wherein this processor is arranged to the pipelining instruction process.

17. according to the multiline procedure processor of claim 16, wherein processor uses an instruction pipelining, wherein each thread sends single instruction in the clock period of each processor.

18. according to the multiline procedure processor of claim 16, wherein processor uses an instruction pipelining, wherein each thread sends a plurality of instructions at each processor clock cycle.

19., wherein send loading instruction and vector multiplying instruction simultaneously in each processor clock cycle in corresponding a plurality of processor clock cycles under the situation of each thread each thread in not stopping a plurality of threads in a plurality of threads of processor according to the multiline procedure processor of claim 18.

20. a processor system comprises:

A multiline procedure processor; With

A storer links to each other with multiline procedure processor;

This multiline procedure processor comprises: an instruction decoder, the instruction retrieved of being used for decoding think that each instruction of institute's search instruction of at least one subclass determines instruction type; An integer unit is connected to instruction decoder, is used to handle the integer type instruction that receives from instruction decoder; With a vector units, be connected to instruction decoder, be used to handle the vector type instruction that receives from instruction decoder.