CN101031904A - Programmable processor system with two kinds of subprocessor to execute multimedia application - Google Patents

Programmable processor system with two kinds of subprocessor to execute multimedia application Download PDF

Info

Publication number
CN101031904A
CN101031904A CN 200580030649 CN200580030649A CN101031904A CN 101031904 A CN101031904 A CN 101031904A CN 200580030649 CN200580030649 CN 200580030649 CN 200580030649 A CN200580030649 A CN 200580030649A CN 101031904 A CN101031904 A CN 101031904A
Authority
CN
Grant status
Application
Patent type
Prior art keywords
block
processor
coupled
heterogeneous
high performance
Prior art date
Application number
CN 200580030649
Other languages
Chinese (zh)
Inventor
R·阿米特
R·H·小约翰
Original Assignee
3加1科技公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Abstract

One embodiment of the present includes a heterogenous, high-performance, scalable processor having at least one W-type sub-processor capable of processing W bits in parallel, W being an integer value, at least one N-type sub-processor capable of processing N bits in parallel, N being an integer value wherein and smaller than W by a factor of two. The processor further includes a shared bus coupling the at least one W-type sub-processor and at least one N-type subprocessor and memory shared coupled to the at least one W-type sub-processor and the at least one N-type sub-processor, wherein the W-type sub-processor rearranges memory to accommodate execution of applications allowing for fast operations.

Description

带有两类子处理器以执行多媒体应用的可编程处理器系统 A programmable processor system with two sub-processors to execute the multimedia application

发明背景对相关专利申请的交叉参考本申请要求2004年7月13日提交的、题目为“Quasi-AdiabaticProgrammable or COOL Processors Architecture(准绝热可编程或冷处理器体系结构)”的美国临时专利申请No.60/598,691的权益,和2004年8月2日提交的、题目为“Quasi-Adiabatic Programmable ProcessorsArchitecture(准绝热可编程处理器体系结构)”的美国临时专利申请No.60/598,417的权益。 BACKGROUND OF THE INVENTION CROSS-REFERENCE TO RELATED PATENT APPLICATION This application claims filed July 13, 2004, entitled "Quasi-AdiabaticProgrammable or COOL Processors Architecture (quasiadiabatic programmable processor architecture or cold)," U.S. Provisional Patent Application No. equity 60 / 598,691, the 2004 and filed August 2, US provisional Patent entitled "quasi-adiabatic programmable ProcessorsArchitecture (quasiadiabatic programmable processor architecture)" application interests No.60 / 598,417 of.

技术领域 FIELD

本发明总地涉及处理器领域,更具体地,涉及具有低功耗、高性能、低管芯(die)面积、并且被灵活地和可缩放地利用于多媒体和通信应用中的处理器。 The present invention relates generally to the field of processors and, more particularly, relates to a low power consumption, high performance, low die (Die) area, and is scalable and flexible location for multimedia and communications applications processor.

背景技术 Background technique

随着诸如蜂窝或移动电话、数字照相机、iPod和个人数据助理(PDA)那样的消费者小设备(gadget)流行的出现,用于与这些小设备进行通信的许多新的标准已被工业界广泛采用。 With the advent of such as cellular or mobile phones, digital cameras, iPod and personal data assistants (PDA) as small consumer devices (gadget) popular for a number of new standard to communicate with these small devices have been widely industry use. 这些标准中的某一些包括H264、MPEG4、UWB、蓝牙、2G/2.5G/3G/4G、GPS、MP3和安全性。 Including some of H264, MPEG4, UWB, Bluetooth, 2G / 2.5G / 3G / 4G, GPS, MP3 and safety of these standards. 然而,显现出来的问题是,使用不同标准来支配不同小设备的及其之间的通信需要大量的开发工作。 However, the apparent problem is that the communication between the governed and how to use different criteria to different small devices requires a lot of development work. 上述问题的其中一个原因在于,还没有一种当前在市场上可得到的处理器或子处理器易于编程来供所有数字设备使用且遵从各种被命令的标准。 One reason for the above problem is that there is not a currently available on the market in a processor or sub-processor is programmed to easily use digital devices and for all ordered various standards compliant. 这个问题的加剧仅仅是时间问题,因为消费者电子装置中的新趋势是准许在将来由工业界采用的甚至更多的标准。 Exacerbate this problem is only a matter of time, because the new trends in consumer electronic devices is permitted even more standards adopted by the industry in the future.

处理器的正显现出来的,不然的话,就是当前的要求之一是:低功耗、然而能够引发足以处理多个应用的代码的执行。 The processor being apparent, otherwise, is one of the current requirements are: low power, but can be sufficient to trigger the application of a plurality of processing code execution. 当前的功耗是每个应用近百毫瓦(sub-hundreds)的量级,而目标是对于执行多个应用来说低于近百毫瓦。 The current consumption is nearly mW each application (sub-hundreds) the order of the target for executing a plurality of applications is less than hundred milliwatts. 处理器的另一个要求是低成本。 Another requirement is a low cost processor. 由于处理器在消费者产品中的广泛利用,处理器必须是制造起来很便宜的,否则,它在最普通的消费者电子装置中的使用是不实际的。 Due to the widespread use in consumer products processor, the processor must be very cheap to manufacture, otherwise, it is used in the most common consumer electronic devices is not practical.

为了提供对于当前处理器问题的具体的例子,在下面概略地描述与下述各项有关的问题,即:在某些消费者产品中使用的RISC、在其它消费者产品中使用的微处理器、在另外的消费者产品中使用的数字信号处理器(DSP)、和在其它消费者产品中使用的专用集成电路(ASIC)、以及每个呈现出独特问题的某些其它熟知的处理器。 To provide a specific example of the problem for current processors, the problems associated with the following schematically describes the following, namely: RISC for use in certain consumer products, consumer products used in other microprocessor , a digital signal processor (DSP) for use in other consumer products, and application specific integrated circuits (ASIC) used in other consumer products, and some other well-known processors each present a unique problem. 这些问题连同使用每种处理器的优点一起,在下面的讨论其缺点的“反面(Cons)”一节中和讨论其优点的“正面(Pros)”一节中予以概述。 These problems, together with the advantages of using each of the processors, in the following discussion of the disadvantages "negative (Cons)" one and discussed the advantages of the "front face (the Pros)" be in an overview.

A.RISC/超级标量处理器RISC和超级标量处理器是用于所有通用计算的被最广泛地接受的体系结构解决方案。 A.RISC / superscalar RISC processor and a superscalar processor is a general-purpose computing architecture, all the most widely accepted solution. 它们常常被用专用加速器来增强,以便解决在一般解决方案的上下文内的某些专门的问题。 They are often used to enhance specific accelerators, in order to solve some of the problems in the context of specific solutions to the general.

例子包括:ARM系列、ARC系列、StrongARM系列和MIPS系列。 Examples include: ARM series, ARC series, StrongARM, and MIPS series series.

正面(Pros):·工业界的广泛接受导致更加成熟的工具链和广泛的软件选择。 Front (Pros): · the industry's widely accepted cause more mature tool chain and extensive software selection.

·从非常有效的自动代码生成器得到了鲁棒的编程模型,其中自动代码生成器被用来从高级语言(如C语言)生成二进制。 · A robust programming model obtained from very efficient automatic code generator, wherein the code generator is used to automatically generate a binary high-level language (e.g., C language).

·该类别中的处理器是非常好的通用解决方案。 • This category is a very good general purpose processor solutions.

·摩尔(Moore)定律可以有效地被使用于提高性能。 Moore (Moore) law can be effectively used to improve performance.

反面(Cons):·体系结构的通用性质不能通过杠杆作用来影响(leverage)一组或一个子组应用的共同特征/特定特征以便得到更好的价格、功率和性能。 Negative (Cons): General-properties can not be leveraged to affect the architecture common features (leveraged) or a subset of a set of applications / features specific to get a better price, power and performance.

·相对于所提供的计算量,它们消耗中等到高的功率量。 * Relative to the amount of the supplied, they consume a high amount of power in the wait.

·性能的提高主要是以流水线等待时间为代价达到的,这有害地影响了几个多媒体和通信算法。 · Improve performance mainly in the pipeline waiting time at the expense of reach, which adversely affected several multimedia and communications algorithms.

·复杂的硬件调度器、高级的控制机制和对于通用算法的更有效自动代码生成的显著减小的限制,使得这类解决方案的面积效率较小。 · Complex hardware scheduler, advanced control mechanisms and significantly reduced more efficiently limited universal automatic code generation algorithm, to solve such a small area so that the efficiency of the program.

B.非常长的指令字(VLIW)和DSPVLIW体系结构消除了在RISC和超级标量体系结构中发现的某些低效率,以创建在数字信号处理空间中相当通用的解决方案。 B. very long instruction word (VLIW) architecture and DSPVLIW eliminate some of the inefficiencies found in RISC and superscalar architecture to create the digital signal processing space fairly universal solution. 并行性被显著提高。 Parallelism is significantly improved. 调度的责任被从硬件转移到软件,以便节省面积。 Scheduling responsibility is transferred from hardware to software, in order to save space.

例子包括:TI64xx、TI55xx、StarCore SC140、ADI SHARC系列。 Examples include: TI64xx, TI55xx, StarCore SC140, ADI SHARC series.

正面(Pros):·把解决方案限制到信号处理空间,这与RISC和超级标量体系结构相比较改进了3P。 Front (Pros): · limit the solution to signal processing space, and with the RISC superscalar architecture compared improve 3P.

·相对于RISC和超级标量体系结构,VLIW体系结构提供了更高水平的并行性。 · Relative and RISC superscalar architecture, VLIW architecture provides a higher level of parallelism.

·有效的工具链和工业界的广泛接受被相当快速地生成。 An effective tool chain and widely accepted industry generated quite rapidly.

·自动代码生成和可编程性正显示出重大的进步,因为被设计用于信号处理的多种处理器属于这个类别。 Automatic code generation and programmability is showing significant progress, since various signal processing processor is used belongs to this category.

反面(Cons):·虽然问题解决能力被缩减到数字信号处理空间,但对于象VLIW机器的通用解决方案来说它太宽,以致于没有有效的3P。 Negative (Cons): · While solving the problem is reduced to a digital signal processing space, but as a general solution VLIW machine is that too wide, so that no effective 3P.

·控制是昂贵的和耗费功率的,特别是对于在许多多媒体和通信应用中的基本控制代码而言。 · Control is expensive and the cost of power, especially for the basic control code in many applications in multimedia and communications purposes.

·几种功率和面积低效的技术被使用来令自动代码生成更容易。 Several power and area-inefficient technique is used to make automatic code generation easier. 软件社团(software community)对于这些技术的强烈依赖一代一代地推进了这种低效率。 Software community (software community) to strongly rely on these techniques from generation to generation advanced this inefficiency.

·VLIW体系结构不是十分适合于处理串行码。 · VLIW architecture is not very suitable for handling serial code.

C.可重新配置的计算过去10年来在工业界和学术界中的几项工作集中在发展一种具有象ASIC那样的价格、功率和性能特征的灵活的解决方案。 C. reconfigurable computing over the past 10 years several jobs in industry and academia focused on the development of flexible solutions like ASIC having a price, power and performance characteristics as the. 许多工作以小的工业成功来挑战现有的和成熟的规律和设计范例。 Many work in small industrial success to challenge existing and proven rules and design examples. 大多数的尝试是在基于象较粗颗粒(grain)FPGA那样的体系结构来创建解决方案的方向上。 Most of the attempts in the direction of creating solutions based on architecture as as coarser particles (grain) FPGA.

正面(Pros):·限于特定应用、而同时提供在该应用内的需要的灵活性的某些设计被证明在价格、功率、性能上是有竞争力的。 Front (Pros): · limited to specific applications, while providing needed in the application of some design flexibility is demonstrated in price, power, performance is competitive.

·研究表明,这样的受限的但仍然灵活的解决方案可被创建来解决许多应用热点。 · Studies have shown that such a limited but still flexible solutions can be created to solve many applications hotspots.

反面(Cons):·在这个空间中的几个设计不提供有效的和容易的编程解决方案,所以没有被擅长于编程DSP的社团广泛接受。 Negative (Cons): · In the space of a few design does not provide an effective and easy programming solution, so there is no widely accepted good at DSP programming community.

·从较高级语言(如C语言)的自动代码生成对于许多设计来说或者是实际上不可能的,或者是极其低效率的。 Generate For many designs is virtually impossible or automatic code from higher level languages ​​(such as C), or is extremely inefficient.

·当试图使用一种类型的互连和一个级别的粒度(granularity)来组合异种(heterogeneous)应用时,丧失了3P优点。 · When trying to use a level of granularity and interconnects (granularity &) to a type of xenograft (Heterogeneous) applications, the loss of the advantages 3P. 所提供的并行性的利用程度受到严重损害。 The degree of parallelism using the supplied severely compromised.

·重新配置的附加开销在3P中对于大多数设计是很大的。 · Overhead reconfiguration in the 3P for most design is great.

·在许多情形下,外部接口被复杂化,因为私有的可重新配置的构造不能与工业标准系统设计方法相匹配。 · In many cases, the external interface is complicated, because the private reconfigurable configuration does not match with the industry standard system design.

·可重新配置的机器是单处理器并且严重地依赖于紧密集成的RISC,即使是对于处理原语(primitive)控制而言。 · Reconfigurable machine is a single-processor and depends heavily on the tight integration of RISC, even for processing primitives (primitive) in terms of control.

D.处理器阵列一些最新的方法集中在使可重新配置的系统更好地适用于处理异种应用。 Some of the latest method of D. processor array centered at the reconfigurable system is better suited for handling heterogeneous applications. 在这一方向上的解决方案联合了对于一个或一组应用优化的多个处理器,以创建处理器阵列构造。 Solutions in this direction jointly for a plurality of processors or a group of applications optimized to create an array processor configuration.

正面(Pros):·为不同组应用而优化的不同的处理器在使用有效结构被连接在一起时可以有助于解决许多问题。 Front (Pros): · optimized for different sets of applications in different processors using a valid structure are connected together can help solve many problems.

·当性能要求增加时,均匀的缩放模型允许将多个处理器连接在一起。 · When the increased performance requirements, uniform scaling model allows a plurality of processors connected together.

·复杂的算法可以被有效地分拆。 · Complex algorithms can be effectively split.

反面(Cons):·虽然性能要求可以被充分地满足,但功率和价格的低效率仍旧太严重。 Negative (Cons): · Although the performance requirements can be fully met, but the low efficiency of the power and the price is still too serious.

·编程模型随各个处理器而不同。 · Programming model is different with each processor. 这使得应用开发者的工作更困难。 This makes the application developers work more difficult.

·多个处理器的均匀缩放是非常昂贵和消耗功率资源的。 Uniform scaling • Multiple processors are very expensive and power consumption of resources. 这已经显现为显示可能对整个系统的性能有害的某些非决定论(non-determinism)。 This has emerged to show some non-determinism (non-determinism) might be detrimental to overall system performance.

·在没有共享的存储器资源的情况下-因为共享的存储器并不是均匀缩放的,在系统级别上编程模型受到通信数据、代码和控制信息的复杂性的损害。 · In the absence of shared memory resources - because the shared memory is not uniform scaling, at the system level by the complexity of the programming model of impaired communication data, and control information codes.

·将不同类型的处理器连接到同种网络所需要的巨大的和重复的粘结逻辑加大了面积的低效率,增大了功率和增加了等待时间。 · Different types of processors connected to the network desired isotype great logic and repeated adhesion area increased inefficiency, and increased power increases latency.

鉴于上述内容,需要一种低功率的、便宜的、有效的、高性能的、可灵活编程的、异种的处理器,以便允许同时执行一个或多个多媒体应用。 In view of the foregoing, a need for a low power, inexpensive, effective, high-performance, flexible programming of heterogeneous processors, so as to allow simultaneously perform one or more multimedia applications.

发明内容 SUMMARY

概述地,本发明的一个实施例包括一种异种的、高性能的、可缩放的处理器,其具有:能够并行处理W比特或更多比特的至少一个W型子处理器,W是整数值;能够并行处理N比特的至少一个N型子处理器,其中N是整数值且小于W。 Overview, the present invention comprises a heterogeneous, high performance, scalable processor embodiment, having: W bits be processed in parallel or at least one W-shaped sub-processors more bits, integer values ​​W ; N bits can be processed in parallel at least one N-type sub-processors, where N is an integer and less than W. 该处理器还包括耦合该至少一个W型子处理器和该至少一个N型子处理器的共享总线,和被耦合到该至少一个W型子处理器和该至少一个N型子处理器的共享的存储器,其中W型子处理器在转移字节到存储器或从存储器转移字节时重新安排字节,以便适应允许快速运行的应用的执行。 The processor further includes coupling the at least one sub-processor shared W type and the at least one N-type sub-processors sharing the bus, and coupled to the at least one W-type sub-processor and the at least one N-type sub-processors a memory, wherein the rearranging bytes W-type sub-processor transfer to a memory byte or bytes transferred from the memory, in order to adapt quickly to allow the application to run.

附图说明 BRIEF DESCRIPTION

图1显示参照包括本发明实施例的数字产品12来显示的应用10。 Figure 1 shows the reference numeral 10 includes an application embodiment of the present invention, the product 12 to be displayed.

图2显示按照本发明实施例的、包括异种、高性能的、可缩放的处理器22的示例性集成电路20,其中处理器22被耦合到存储器控制器和直接存储器访问(DMA)电路24。 2 shows the embodiment according to the present invention, an integrated circuit 20 comprising an exemplary heterogeneous, high performance, scalable processor 22, wherein the processor 22 is coupled to a memory controller and direct memory access (DMA) circuit 24.

图3图示了按照本发明实施例的处理器20的进一步的细节。 Figure 3 illustrates further details of an embodiment of the present invention, the processor 20.

图4显示按照本发明实施例的、被包括在W型块之一(诸如块74或76)内的块或构件的高级别框图。 FIG. 4 shows an embodiment according to the present invention, a high-level block diagram of a block included in one member or W-type block (such as block 74 or 76).

图5显示按照本发明实施例的、被包括在块402中的电路块的框图。 Figure 5 shows an embodiment according to the present invention, is a block diagram of a circuit block in block 402 include.

图6更详细地显示对于寄存器文件采用的、在宏功能单元内(具体地在块402,404,406和408中)进行转发的通用构件。 Figure 6 shows in more detail the register file for use in a macro functional unit (specifically, at block 402, 404 and 408) for forwarding the common member.

图7以高级别框图形式显示按照本发明实施例的块408的进一步的细节。 Figure 7 shows a further embodiment of the detail of block 408 in accordance with embodiments of the present invention at a high level block diagram form.

图8以框图的形式显示按照本发明实施例的块404的进一步的细节。 Figure 8 shows further details of the embodiment of block 404 in accordance with the embodiment of the present invention in block diagram form.

图9和10显示具体地相对于执行置换的、块404的进一步的细节。 9 and 10 show a permutation of the particular phase, for further details of block 404.

图11以高级别框图形式显示按照本发明实施例的块406的部件的进一步细节。 Figure 11 shows further details of the block member 406 in accordance with an embodiment of the present invention in a high-level block diagram form.

图12显示按照本发明实施例的块78的细节的高级别框图。 Figure 12 shows a high-level block diagram illustrating details of the embodiment of the block 78 according to the embodiment of the present invention.

图13以高级别框图形式显示按照本发明实施例的块78的进一步的细节。 Figure 13 shows further details of an embodiment of the block 78 according to the present invention, a high-level block diagram form.

图14显示按照本发明实施例的块1322的进一步的细节。 Figure 14 shows further details of block 1322 according to an embodiment of the present invention.

图15以高级别框图形式显示按照本发明实施例的、被包括在块1324中的电路系统的进一步的细节。 Figure 15 shows a high level block diagram form an embodiment of the present invention, further details of which are included in the circuitry of block 1324.

图16显示按照本发明实施例的、被包括在块1520内的减小电路块1602的框图。 Figure 16 shows an embodiment according to the present invention, is a block diagram reduction circuit blocks within the block comprises 1520 1602.

图17以高级别框图形式显示按照本发明实施例的、被包括在块1326中的电路系统的进一步的细节。 Figure 17 shows a high level block diagram form an embodiment of the present invention, including further details of the circuitry in the block 1326.

图18以高级别框图形式显示按照本发明实施例的、被包括在块1330中的电路系统的进一步的细节。 Figure 18 shows a high level block diagram form an embodiment of the present invention, including further details of the circuitry in the block 1330.

图19以高级别框图形式显示按照本发明实施例的、被包括在块1332中的电路系统的进一步的细节。 19 shows a high level block diagram form an embodiment of the present invention, including further details of the circuitry in the block 1332.

图20以高级别框图形式显示按照本发明实施例的、被包括在块1334中的电路系统的进一步的细节。 Figure 20 shows a high level block diagram form an embodiment of the present invention, further details of which are included in the circuitry of block 1334.

图21显示按照本发明实施例的、使用处理器22的编程流程和工具的例子。 Examples of processes using a processor and programming tool 22 of FIG. 21 shows an embodiment of the present invention.

图22显示本发明实施例的可缩放性的例子。 Figure 22 shows an example of the scalability of the embodiments of the present invention.

图23显示给出本发明的可缩放性的某些好处的图。 FIG. FIG. 23 is given by some of the benefits of the present invention, scalability display.

具体实施方式 detailed description

现在参照图1,图上显示了参考包括本发明实施例的数字产品12的应用10。 Referring now to FIG. 1, the embodiment of the present invention with reference to embodiments comprising a digital product applications 1012 on FIG. 图1打算用来向读者提供有关一个产品的某些而不一定是全部优点的远景,该产品相对于市场上可得到的那些产品而言包括了本发明的一个实施例。 FIG 1 is intended to provide the reader with a certain relevant product and not necessarily all of the advantages of vision of the product with respect to those products available on the market in terms of an embodiment comprising the present invention.

因此,产品12是一个会聚的(converging)产品,因为它并入了需要由今天的移动电话设备14、数码照相机设备16、数字记录或音乐设备18和PDA设备20执行的所有的应用。 Thus, the product 12 is a convergent (converging) the product, since it incorporates all of the required applications 14, 16 digital camera device, a digital music recording device 18 or PDA device 20 and executed by the mobile phone device today. 产品12能够同时执行设备14-20的一个或多个功能、然而只利用较小的功率。 Products 12 can be performed while one or more functions of devices 14 to 20, however, using only a small power.

产品12典型地是电池操作的,所以即使是在执行由设备14-20执行的那些应用中的多个应用时也只消耗很小的功率。 Product 12 is typically battery operated, so that even when a plurality of applications executing application executed by the device 14-20 only consumes little power. 它还能够执行代码用来完成依从多个应用的操作,所述应用包括但不限于:H264、MPEG4、UWB、蓝牙、2G/2.5G/3G/4G、GPS、MP3和安全性。 It can also be used to execute code to complete the operation in compliance with a plurality of applications, the applications include, but are not limited to: H264, MPEG4, UWB, Bluetooth, 2G / 2.5G / 3G / 4G, GPS, MP3 and safety.

图2显示按照本发明实施例的、包括异种、高性能的、可缩放的处理器22的示例性集成电路20,其中处理器22被耦合到存储器控制器和直接存储器访问(DMA)电路24。 2 shows the embodiment according to the present invention, an integrated circuit 20 comprising an exemplary heterogeneous, high performance, scalable processor 22, wherein the processor 22 is coupled to a memory controller and direct memory access (DMA) circuit 24. 图2上还显示,处理器22通过通用总线30被耦合到接口电路26,通过通用总线31被耦合到接口电路28,以及还通过总线30、通过总线31被耦合到通用处理器32。 FIG 2 also shows, the processor 22 is coupled via a common bus 30 to the interface circuit 26, 28 is coupled to, and also by bus interface circuit 30 via common bus 31, is coupled to the general purpose processor 32 via bus 31. 电路20还被显示为包括时钟复位和功率管理34,用于生成时钟以及复位信号,其中时钟由电路10的其余电路利用,复位信号以相同的方式在用于通过其管理功率的电路系统中被利用。 Circuit 20 is further shown to include a clock reset and power management 34, and a reset signal for generating a clock, wherein the clock circuit 10 using the remaining circuitry of the reset signal in the same manner by the circuit for the power management system are use. 在电路20中还包括有联合测试行动组(JTAG)电路36。 In the circuit 20 also includes Joint Test Action Group (JTAG) circuit 36. JTAG被用作为测试芯片的标准。 JTAG is used as a standard test chip.

被显示为耦合到总线30的接口电路26和被显示为耦合到总线31的接口电路28,包括块40-66,它们对于本领域技术人员来说通常是熟知的,并被当前的处理器所使用。 It is shown coupled to the interface circuit 26 and bus 30 is shown coupled to the bus interface circuit 31, 28, 40-66 comprises a block, which the skilled person is generally well known, and the current processor use.

作为异种多处理器的处理器22被显示为包括:共享的数据存储器70,共享的数据存储器72,CoolW子处理器(或块)74,CoolW子处理器(或块)76,CoolN子处理器(或块)78和CoolN子处理器(或块)80。 22 is displayed as a heterogeneous multi-processor processor comprising: a shared data memory 70, the shared data storage 72, CoolW sub-processors (or block) 74, CoolW sub-processors (or block) 76, CoolN sub-processor (or block) 78 and sub-processors CoolN (or block) 80. 每个块74-80与指令存储器相联系,例如,CoolW块74与指令存储器82相联系,CoolW块76与指令存储器84相联系,CoolN块78与指令存储器86相联系,以及CoolN块80与指令存储器88相联系。 Each block 74-80 associated with the instruction memory, e.g., CoolW block 74 associated with instruction memory 82, CoolW block 76 associated with instruction memory 84, CoolN block 78 associated with instruction memory 86, the block 80 and the instructions CoolN The memory 88 is associated. 同样地,每个块74-80与控制块相联系。 Likewise, each block 74-80 and linked to the control block. 块74与控制块90相联系,块76与控制块92相联系,块78与控制块94相联系,以及块80与控制块96相联系。 Control block 74 and associated blocks 90, block 76 and associated control block 92, block 78 associated with the control block 94, block 80 and block 96 associated with the control. 块74和76被设计成总地对于16、24、32和64比特操作或应用有效地操作,而块78和80被设计成总地对于1、4、或8比特操作或应用有效地操作。 Block 74 and 76 are generally designed to operate efficiently for 16,24,32 and 64-bit operations or applications, and the block 78 and 80 are designed to generally operate effectively to 1,4, or 8-bit operations or applications.

块74-80基本上是子处理器,CoolW块74和76是宽(或W)型块,而CoolN块78和80是窄(或N)型块。 Block 74-80 is substantially sub-processors, CoolW blocks 74 and 76 are wider (or W) type block, and the blocks 78 and 80 are CoolN narrow (or N) - block. 宽和窄是指在子处理器内被处理的或被路由的并行比特的相对数目,这给出处理器22的异种特性。 It refers to the relatively narrow width and the number of parallel bits or routed within the sub-processor to be processed, which gives the processor 22 of dissimilar characteristics. 而且,电路24被直接耦合到子处理器之一,即块74-80之一,导致通往它所耦合到的子处理器的最低等待时间路径。 Also, circuit 24 is coupled directly to one of the sub-processors, i.e., one of blocks 74-80, it results in the lowest latency access to the sub-processors coupled to the path. 在图2上,电路24被显示为直接耦合到块76,尽管它可被耦合到块74、78或80中的任一个。 In FIG 2, a circuit 24 is shown in any of blocks 74, 78 or 80 although it may be a directly coupled to the coupling block 76 to. 较高优先权的代理或任务可被分配给直接耦合到电路24的那个块。 Agent or higher priority tasks may be assigned to that block directly coupled to the circuit 24.

应当指出,虽然显示了四个块74-80,但可以利用其它数目的块,然而,利用附加块显然导致了附加的管芯空间和更高的制造成本。 It should be noted that although the four display blocks 74-80, but other number of blocks may be utilized, however, obviously result in additional blocks using additional die space and higher manufacturing costs.

需要大处理功率的复杂的应用未被散布在电路20中,而是把它们聚合或限制于特定的子处理器或块来进行处理,这通过消除或至少减小连线(金属)或路由长度、由此减小连线容量而大大地改善了功耗。 Requires large processing power of the application is not complex dispersed in the circuit 20, but they are processed polymerization or limitation to specific processors or sub-blocks, eliminating or at least reduce it by wires (metal) or route length , thereby reducing the wiring capacity greatly improves power consumption. 另外,提高了利用率和降低了活动性,从而有助于较低的功耗。 In addition, improved utilization and reduced mobility, thus contributing to lower power consumption.

电路20是提供用于多媒体和通信应用的准绝热可编程子处理器的芯片上硅(或SoC)的例子,包括两种类型的子处理器,正如前面所表示的:W型和N型。 The circuit 20 is programmable to provide quasiadiabatic sub-processors for multimedia and communications applications Examples of the silicon chip (or SoC), including two types of sub-processors, as previously indicated: W-type and N-type. W型或宽型处理器是为了在需要16、24、32和64比特处理的应用中的高功率、价格、性能效率而设计的。 W or wide processor for applications requiring high power 16,24,32 and 64-bit processing in the price, performance and design efficiency. N型或窄型处理器是为了在需要8、4和1比特处理的应用中的高效率而设计的。 N-type processor or to narrow in applications requiring high efficiency and a 8,4-bit processing in design. 虽然在本发明的实施例中以附图和说明的方式使用这些比特数目,但也可以容易地采用其它数目的比特。 Although the number of bits in the embodiment of the present invention by way of illustration and the drawings, but may readily employ other number of bits.

不同的应用需要不同的性能或处理能力,并因此通过不同类型的块或子处理器来执行。 Different applications require different performance or processing capability, and thus be performed by different types of blocks or sub-processors. 例如拿典型地由DSP执行的应用来说,它们通常由W型子处理器处理,诸如是图2的块74或76,因为它们在特征上包括通常出现的DSP核。 Take for example, the application is typically performed by the DSP, they are usually handled by the W-type sub-processor, such as a block 74 or 76 in FIG. 2, because they include a DSP core usually appear in the feature. 这样的应用包括、但不限于:快速傅立叶变换(FFT)或逆FFT(IFFT)、自适应有限脉冲响应(FIR)滤波器、离散余弦变换(DCT)或逆DCT(IDCT)、实数/复数FIR滤波器、IIR滤波器、电阻电容器根升余弦(RRC)滤波器、彩色空间变换器、3D双线性纹理映射、Gouraud遮蔽、格雷(Golay)相关、双线性内插、中值/行/列滤波器、α混合(Alphablending)、高阶表面嵌石装饰(Tessellation)、顶点渐变(VertexShade)(透明/浅,Trans/Light)、三角形设置、全屏幕防混叠和量化。 Such applications include, but are not limited to: a fast Fourier transform (FFT) or an inverse FFT (IFFT), the adaptive finite impulse response (FIR) filter, discrete cosine transform (DCT) or inverse DCT (IDCT), real / complex FIR filter, the IIR filter, a resistance of the capacitor root raised cosine (RRC) filter, a color space converter, bilinear 3D texture mapping, Gouraud shading, gray (Golay) associated, bilinear interpolation, the value of / rows / filter column, a mixed [alpha] (AlphaBlending), higher order surface decorative stone trapping (the Tessellation), vertex gradient (VertexShade) (transparent / shallow, Trans / Light), triangle set, full-screen anti-aliasing and quantization.

其它通常出现的DSP核可以由诸如块78和80的N型子处理器执行,它们包括、但不限于:可变长度编码译码器、维特比(Viterbi)编码译码器、涡轮(Turbo)编码译码器、循环冗余校验、沃尔什(Walsh)码发生器、交织/去交织器、LFSR、加扰器、解扩器、卷积编码器、里德-所罗门(Reed-Solomon)编码译码器、扰码生成器、和凿孔/去凿孔。 Other commonly occurring may be performed by DSP core blocks 78 and 80, such as the N-type sub-processors, which include, but are not limited to: variable-length encoder-decoder, the Viterbi (the Viterbi) codec, a turbine (Turbo) codec, a cyclic redundancy check, Walsh (Walsh) code generator, interleaver / deinterleaver, the LFSR, a scrambler, a despreader, the convolutional encoder, a Reed - Solomon (Reed-Solomon ) codec, a scrambling code generator, and puncturing / de-puncturing.

与现有的体系结构上的方法,如RISC、可重新配置的、超级标量、VLIW和多处理器方法相比较,W型和N型子处理器都能够保持每次转移的净活动性和最终得到的能量为低,而同时维持具有增加的利用率的高性能。 The method and the conventional architecture, such as RISC, reconfigurable, superscalar, the VLIW multiprocessor methods and compared, W-type and N-type sub-processors are able to maintain the activity of each transfer net and final the energy obtained is low, while maintaining high performance with increased efficiency. 处理器22的子处理器体系结构减小了管芯尺寸,从而导致最佳的处理解决方案,并且它包括一种被称为“准绝热”或“COOL(冷)”体系结构的新颖的体系结构。 22 sub-processor architecture of the processor die size is reduced, leading to optimal processing solution, and it includes one called "quasi-adiabatic" or "the COOL (cold)" new system architecture structure. 按照此体系结构的可编程处理器被称为准绝热可编程或COOL处理器。 A programmable processor, this architecture is called quasi-adiabatic COOL or programmable processor.

准绝热可编程或COOL处理器使得数据路径、控制、存储器和功能单元粒度最佳化,以便匹配于如前所述的有限子组的应用。 COOL or quasi-adiabatic programmable processor, cause the data path, control, memory and optimizing particle size of the functional unit, so as to match a limited subset of the application as described above. 相关于以下给出的、与处理器22的不同单元或块或电路及其交互操作有关的附图的讨论和介绍,用以完成这一点的方式将是清楚的。 Given below in relation to, associated with different blocks or circuit elements or a processor 22 and its interaction discussion and presentation of the drawings, to complete this way is clear.

异种互连和功能单元(COOL)处理器的“准绝热可编程”或并发应用。 "Quasi-adiabatic programmable" heterogeneous interconnect and function-unit (the COOL) processor or a concurrent application. 在热动力学方面,绝热处理过程不浪费热量,它们转移所有被使用的能量来执行有用的工作。 In terms of thermodynamics, the process is not adiabatic waste heat, they are transferred energy is being used to perform useful work. 由于现有的标准工艺、电路设计和逻辑单元库设计技术的非绝热性质,人们永远不能做出一个绝热的处理器。 Since the non-adiabatic nature of the existing standard process, circuit design and the logic cell library technology, people can never make a heat insulating processor. 然而,在可能不同的可能处理器体系结构中间,某些体系结构可能是更接近于绝热的。 However, in the middle may be different processor architectures, some architectures may be closer to adiabatic. 本发明的各种实施例显示了一种处理器体系结构,它与现有技术的体系结构相比,非常接近于绝热,而同时它们仍然是可编程的。 The various embodiments of the present invention shows a processor architecture, it is compared with the prior art architecture, very close to adiabatic, while they are still programmable. 它们被称为“准绝热可编程处理器”。 They are called "quasi-adiabatic programmable processor."

集成电路20允许与处理器22内资源可以支持的一样多的应用一起地或同时地被执行,这样的应用的数目远远超过当前处理器支持的数目。 22 of integrated circuit 20 allows the resource processor and can support as many applications or together are performed simultaneously, the number of such applications far exceeds the number of processors supported by the current. 可以由集成电路20同时地或并发地执行的应用的例子包括、但不限于,从无线设备下载应用而同时译码已接收的电影,因此,可以同时下载和译码一个电影。 Examples of applications may include the integrated circuit 20 simultaneously or concurrently, but not limited to, while decoding received from the wireless device to download the application film, and therefore, a movie can be downloaded and decoded simultaneously. 由于在集成电路20上实现了同时的应用执行,而该集成电路相比于它所支持的应用的数目,具有小的管芯尺寸或硅不动产(real estate),所以制造该集成电路的成本大大地低于图1的多个器件所需要的成本。 Since the integrated circuit 20 is implemented on an application executing at the same time, the integrated circuit and compared to the number of applications that it supports, has a small die size, or silicon real property (real estate), the manufacturing cost of the integrated circuit is greatly less than the cost of a plurality of devices required for the FIG. 另外,处理器22提供单个可编程框架给用户,以便实施诸如多媒体综合应用那样的多功能。 Further, the programmable processor 22 provides a single framework to a user, such as to implement multi-functional integrated multimedia applications. 其重要的价值是集成电路20(即处理器22)的支持由工业界采用的未来标准的能力,预期该未来标准比今天的标准会具有更大的复杂性。 Its value is important to support the integrated circuit 20 (ie processor 22) by the ability of future industry adopt the standard, it is expected that future standards than today's standards would have greater complexity.

每个块74-80可以在给定的时间只执行程序的一个序列(或流)。 Each block 74-80 of a sequence program (or stream) may be performed only at a given time. 程序的一个序列涉及到与特定的应用有关的功能。 The program involves a sequence associated with a particular application functionality. 例如,FFT是一种类型的序列。 For example, FFT is one type of sequence. 然而,不同的序列可以是互相依赖的。 However, different sequences may be interdependent. 例如,一旦完成一个FFT程序,就可以把它的结果存储在存储器70中,而下一个序列接下来可以使用所存储的结果。 For example, upon completion of an FFT procedure, it can result stored in the memory 70, and the next subsequent sequence stored result may be used. 这样地共享信息或这样地互相依赖的不同序列被称为“射流(stream flow)”。 Such different sequences in such a way to share information or interdependent is called "jet (stream flow)".

在图2上,存储器70和72各自包括8个块的16k字节存储器,然而,在其它实施例中,可以利用不同尺寸的存储器。 In Fig. 2, 70 and 72 each include a memory 16k byte block of memory 8, however, in other embodiments, may utilize different memory sizes.

指令存储器82、84、86和88被使用来分别存储由块74-80执行的指令。 Instruction memory 82, 84 are used to store instructions for execution by the blocks 74-80.

图3显示按照本发明实施例的处理器20的进一步的细节。 3 shows further details of the processor 20 according to the embodiment of the present invention. 在图3上,处理器20被显示为包括子处理器74-80,每个子处理器包括指令超高速缓冲器302-308,分别用于存储由各个子处理器处理的指令。 In Figure 3, the processor 20 is shown to include sub-processors 74-80, each of the sub-processor includes an instruction cache ultra 302-308, respectively, for storing instructions by each sub-processor. 处理器20还被显示为包括判优块310、数据存储器312、通用输入/输出(GPIO)块314、共享的SoC总线块316、与DMA块的射频接口318、DMA控制器块320、和存储器控制器块322,它们以图3所示的方式被耦合。 The processor 20 is further shown to include an arbitration block 310, data memory 312, general purpose input / output (GPIO) block 314, a shared bus SoC block 316, the RF interface block 318 DMA, the DMA controller block 320, and a memory controller block 322, which are coupled in the manner shown in FIG. 数据存储器312用作为数据信息的贮存器,它在判优块310的引导下被子处理器和其它块利用,判优块310引导图3所示的各种构件/块的操作和数据业务。 Data storage memory 312 is used as information data, which the processor and other quilt utilized in the guide blocks arbitration block 310, operations and data traffic arbitration block 310 the guide member shown in FIG various 3 / block. 块314调整去往和来自处理器22的输入和输出业务,块320通过总线316控制由处理器22执行的DMA操作,块322通过总线316控制与存储器312有关的操作,块318包括操控DMA操作的电路系统并且能够接收和发射通过信号324耦合的RF信号。 Adjustment block 314 to and from the processor 22 of the input and output operations, DMA block 320 controls operations performed by the processor 22 by a bus 316, a control block 322 through 316 and memory bus related operation 312, control block 318 includes a DMA operation and circuitry capable of receiving and transmitting RF signals through signal 324 is coupled.

任选地,共享的寄存器326和328造成在两种类型的子处理器之间直接通信。 Optionally, the shared registers 326 and 328 caused by the direct communication between the two types of sub-processors. 例如,在图3上,寄存器326被显示为耦合到块74和78,以便促成要被这些块共享的信息的贮存,这便于利用一个以上的子处理器执行应用,达到应用的加快执行。 For example, in FIG. 3, the register 326 is shown coupled to blocks 74 and 78 in order to facilitate storage of information to be shared by these blocks, which facilitates the use of more than one sub-processor executes the application, to speed up the execution of the application. 同样地,寄存器328被显示为耦合到块80和76,以便起到与寄存器326的相同的作用。 Similarly, register 328 is shown coupled to blocks 80 and 76, to play the same role as the register 326.

图4显示按照本发明实施例的、被包括在W型块之一(诸如块74或76)内的块或构件的高级别框图。 FIG. 4 shows an embodiment according to the present invention, a high-level block diagram of a block included in one member or W-type block (such as block 74 or 76). 作为例子,在图4上使用块74。 As an example, the use of block 74 in FIG. 4. 在图4上和在本文中,在如加法器、乘法器、寄存器和复接器等等的部件之间给出了具有非常具体的互连结构的功能单元或宏块(Macro-block)。 4 and in FIG herein, such as between the adders, multipliers, registers and multiplexer means like a functional unit or macroblock (Macro-block) having a very specific interconnect structure. 这些宏块被称为“宏功能单元”或“MFU”。 These macroblocks are referred to as "macro function unit" or "the MFU." MFU代表在多媒体和通信应用的有限组中一个或多个通常出现的操作的有效可编程子组。 Representative effective programmable MFU limited subset of the group of multimedia and communications applications typically occurs in one or more operations. 在宏功能单元中的高效率是用一组推导出来的呈现极优良性能和功率性能的操作来替代在目标应用中发现的原子操作(atomic operatiin)的关键组的结果。 High efficiency macro function unit is derived using a set of rendering a very good operating performance and the results of power performance instead of critical atomic group (atomic operatiin) is found in the target application. 在某些情形下,通常出现的操作以独特的方式被组合,以便有效地重复使用硬件。 In some cases, the operation usually occurs are combined in a unique way, in order to effectively re-use hardware.

在图4上,块74被显示为包括装载/存储MFU块402、标量算术逻辑单元(ALU)与乘法累加(ACC)MFU块406、向量x MFU块404、向量ALU与乘法ACC MFU块408、和本地存储器410,它们以图4所示的方式被耦合在一起。 In FIG. 4, block 74 is shown to include a load / store MFU block 402, scalar arithmetic logic unit (ALU) and multiply-accumulate (the ACC) MFU block 406, the vector X MFU block 404, the vector ALU and multiplication of the ACC MFU block 408, and a local memory 410, which are coupled together in the manner shown in FIG. 块402生成存储器地址,并把它耦合在存储器地址总线412上。 Block 402 generates a memory address and put it in the memory coupled to the address bus 412. 存储器数据被耦合到存储器数据总线414上,并被双向耦合到块404和406。 Memory data is coupled to the memory data bus 414, and is bi-directionally coupled to block 404 and 406. 向量储存掩码被耦合到向量储存掩码总线416上,并且由块404生成。 Vector stored in vector mask is coupled to the mask storage bus 416, and is generated by block 404. 每个块的进一步的细节相对于随后的附图被呈现和讨论。 Further details of each of the blocks with respect to subsequent figures are presented and discussed. 在这样的呈现和讨论之前,在下面讨论块74的某些通用特性和块。 Prior to such presentation and discussion, discussed below, certain general characteristics of the block 74 and the block.

块406和408执行对于数据的大多数实际的计算。 Block 406 and 408 perform most of the data for the actual calculation. 装载/存储MFU块402计算用于对存储器312和存储器410/从存储器312和存储器410进行访问的地址。 Load / store MFU block 402 calculates the memory address of the memory 312 and 410 / accessed from the memory 312 and the memory 410. 向量X MFU块404重新安排在存储器312与块408之间的途中的向量数据。 Vector X MFU block 404 to rearrange the way between the vector data memory 312 and the block 408. 向量x MFU块404还被使用来生成向量储存掩码,以用于到存储器312的向量储存。 X MFU vector block 404 is also used to generate a vector store mask vector memory 312 for the storage. 块406在给定的时间只对一份数据执行操作,而块404和408对具有向量形式的数据执行操作。 Block 406 at a given time to perform operations on only one piece of data, and the block 404 and 408 operating on the data having a vector form. 块402提供用于存储器访问的地址。 Block 402 provides addresses for memory access. 某些计算由块402执行,但它具有附加开销计算的性质。 Some calculation is performed by the block 402, it has a property additional cost calculation.

除了在MFU块之间移动数据的操作以外,机器指令为各种MFU块来编码(如有需要)分开的操作。 In addition to operation of the mobile data between MFU block MFU machine instructions for a variety of block code (if needed) separate operations. 在单个指令中的所有操作被并行地执行。 All operations in a single instruction is performed in parallel. 在指令中的分开编码的操作的控制下,向量x MFU块404造成重新安排向量数据,并生成向量储存掩码。 Under the control of a separate operating instruction encoding, block 404 causes the vector x MFU rearranged vector data, and generates a vector store mask. 本地存储器410被使用来在本地存储信息,以避免对于每个指令都必须访问块74外部的信息。 Local memory 410 is used to store information locally, for each instruction to avoid the block 74 must access external information. 总线412被耦合到存储器312,通过它提供存储器地址。 Bus 412 is coupled to memory 312, which provide a memory address.

块402被显示为通过总线424而耦合到块404,块402还被显示为通过总线426而耦合到块406,块402还被显示为通过总线428而耦合到块410。 Block 402 is shown coupled by a bus 424 to block 404, block 402 is also shown coupled by a bus 426 to block 406, block 402 is also shown coupled by a bus 428 to block 410. 块404、408和410被显示为通过向量总线420而互相耦合,以及块406、404、408和410被显示为通过标量总线422而互相耦合。 Blocks 404, 408 and 410 are shown as a vector by a bus 420 coupled to each other, and blocks 406,404,408 and 410 are shown as bus 422 by a scalar coupled to each other. 总线通常是一组连线,每条连线耦合信号,其中连线互相平行,因此能够并行地耦合信号。 Bus connection is usually a set, each coupled to a signal connection, wherein the connection in parallel to each other, a signal can be coupled in parallel. 总线内连线的数目定义了二进制比特的数目,它用作为总线的特征。 The number defines the number of the bus connection of binary bits, characterized in that it is used as a bus. 在图4上,向量总线420比标量总线422宽,即,比起总线422,总线420包括更多的比特或连线,它们可以并行地耦合更多的信号。 In FIG. 4, the vector bus 420 bus 422 is wider than the standard amount, i.e., compared to bus 422, bus 420 includes more bits or wires, which may be coupled more signals in parallel. 总线420的比特数与总线422的比特数的比值的例子是4倍,例如这是在其中总线422是32比特、总线420是4乘32比特或128比特的例子中。 Examples of the ratio of the number of bits of the bus 420 and bus 422 the number of bits is four times, for example, which is where the bus 422 is a 32-bit bus 420 is a 4 by 32 or 128 bits in the example.

块404还提供向量储存掩码,它被耦合到总线416上。 Vector storage block 404 also provides a mask, which is coupled to the bus 416.

存储器数据从块402被耦合到块406上,以用于计算操作,但向量数据首先被提供到块404。 Memory data is coupled from block 402 to block 406, for calculating the operation, but the vector data is first provided to the block 404. 重要的是要指出,块404提供了组织存储器中的数据以匹配于在计算单元(即在块408)中被需要的数据的能力,由此大大地提高了性能。 It is important to point out, block 404 provides a data organization in memory to match the capacity of the calculating unit (i.e., block 408) is required for the data, thereby greatly improving performance.

图5显示按照本发明实施例的、被包括在块402中的电路块的框图。 Figure 5 shows an embodiment according to the present invention, is a block diagram of a circuit block in block 402 include. 块402被显示为包括地址块502、循环缓冲寄存器块504、地址生成器块508、地址生成器块506、复接器(mux)510和复接器512,它们以图5所示的方式被耦合在一起。 Block 402 is shown to include an address block 502, 512, as shown in FIG. 5 embodiment are the circular buffer block 504, the address generator block 508, a block address generator 506, multiplexer (mux) 510, and the multiplexing is coupled together.

块502被耦合到如图4所示的块402的其它块,并且块502存储地址。 Block 502 is coupled to the block shown in FIG. 4 other blocks 402, and block 502 stores the address. 块504用来把循环缓冲范围存储在循环缓冲寄存器(块504)之一中。 Used to block 504 in circular buffer one circular buffer memory range (block 504). 当由程序请求时,块506和508促使地址计算在循环缓冲范围内回绕。 When requested by the program, address blocks 506 and 508 cause the circular buffer in calculating wrap around range. 指向块504内的箭头允许这些寄存器被装载。 Arrow pointing in the block 504 to allow these registers to be loaded. 即,块506用来修改由块504生成的地址、或是从块406接收的地址、或者甚至是从块502生成的地址,而块508用来修改从块502和/或块406和甚至块504接收的地址。 That is, to modify the address generation block 506 from block 504 or block 406 received from the address or even address generated from block 502, block 508 and from block 502 to modify and / or even block 406 and block 504 received address.

块402的地址寄存器和块404的循环缓冲寄存器提供到块506和508的地址生成器的输入。 Block 402 and the block address register 404 is supplied to an input circular buffer block 506 and the address generator 508. 在块402的地址寄存器的情形下,这些输入是先前存储的地址,而对于块404的循环缓冲寄存器,这些输入是有关于循环缓冲器的信息。 In the case of the block address register 402, these addresses are input previously stored for block 404 of circular buffer, which is the information about the input circular buffer.

块506和508用来修改地址。 Block 506 and 508 to modify the address. 即,块506用来修改由块504生成的地址、或是从块406接收的地址、或者甚至是从块502生成的地址,而块508用来修改从块502和/或块406以及甚至是块504接收的地址。 That is, to modify the address generation block 506 from block 504 or block 406 from the received address, or even addresses generated from block 502, block 508 and from block 502 to modify and / or block 406, and even received address block 504. 块506的输出然后作为输入被提供到复接器512,复接器512还接收由块502生成的地址作为输入。 The output of block 506 is then provided as an input to the multiplexer 512, multiplexer 512 also receives the address generation block 502 as input. 复接器512然后选择它的其中一个输入,并把该输入耦合到总线520上,以便由块74的其它块接收,如图4所示。 Multiplexer 512 and select one of its input, and the input is coupled to the bus 520 for reception by other blocks of the block 74, as shown in FIG. 同样地,块508的输出作为输入被提供到复接器510,复接器510还接收由块502生成的地址作为输入。 Similarly, the output of block 508 is provided as an input to the multiplexer 510, multiplexer 510 also receives the address generated by the block 502 as input. 复接器510然后选择它的其中一个输入,并把该输入耦合到总线522上,以便由块74的存储器接收,如图4所示。 Multiplexer 510 and select one of its input, and the input is coupled to the bus 522 for reception by the memory block 74, as shown in FIG.

因此,装载/存储MFU可以并行地生成两个地址。 Accordingly, load / store MFU can generate two addresses in parallel. 地址是通过组合地址寄存器和来自标量ALU MFU的常数或数值而被计算出来的。 Address is calculated by the address register and constant or a combination of values ​​from the scalar ALU MFU. 计算出来的地址任选地可以在循环缓冲器的界限内回绕。 Calculated address optionally may wrap around the circular buffer within the confines. 计算出来的地址主要打算使用于访问存储器,但也可以被分配给地址寄存器或循环缓冲寄存器,或被用作为加到其它MFU的输入。 Calculated address is primarily intended for use in accessing the memory, but may be assigned to the address registers or circular buffer, or used as an MFU applied to the input of another.

图6更详细地显示对于寄存器文件采用的和在宏功能单元内部(具体地在块402、404、406和408中)进行转发的通用构件。 Figure 6 shows a general-purpose member and forwarding in the macro function of the internal (in particular 402, 404 and 408 in the block) to the register file unit employed in more detail. 在图6上,按照本发明的实施例显示了多个寄存器602、多个复接器604、交叉开关(crossbar)606、寄存器块608、多个分级(staging)寄存器610、多个功能单元612和多个复接器614。 In Figure 6, according to an embodiment of the present invention show a plurality of registers 602, a plurality of multiplexer 604, crossbar (crossbar) 606, a register block 608, a plurality of fractionation (the staging) register 610, a plurality of functional units 612 and a plurality of multiplexer 614. 寄存器602被显示为被耦合到复接器604,复接器604又被显示为被耦合到交叉开关606。 Register 602 is shown as being coupled to the multiplexer 604, multiplexer 604 has been shown as being coupled to the crossbar 606. 交叉开关606被显示为被耦合到寄存器610,寄存器610又被显示为被耦合到功能单元612,功能单元612被显示为被耦合到复接器614。 Crossbar 606 is shown as being coupled to a register 610, register 610 has been shown as being coupled to the functional unit 612, functional unit 612 is shown as being coupled to the multiplexer 614. 通常,复接器的功能是在被提供的输入之间进行选择,并生成所选择的输入。 Typically, the function of multiplexer selects between the inputs are provided, and generating the selected input. 交叉开关606的输出还被提供到图4的其它块。 Outputs of the crossbar 606 is further supplied to other blocks of FIG. 4. 虽然在图6上显示了特定数目的单元、复接器和/或寄存器,但可以采用其它数目的这些构件。 While a particular number of cells, multiplexer and / or registers in FIG. 6, but other numbers of these components may be employed.

图6的构件以图中显示的方式被耦合到一起。 FIG member 6 are coupled together in the manner shown in FIG. 复接器604被显示为接收来自图4的其它块的附加输入,至少有两个这样的输入,以及接收复接器614的输出。 Multiplexer 604 is shown receiving additional inputs from other blocks of FIG. 4, there are at least two such inputs, and receiving the output of multiplexer 614.

图6的寄存器和反馈路径(耦合)提供了独特的组织,使得面积、能量和性能的折衷最佳化。 FIG register 6 and a feedback path (coupling) provides a unique organization, such that the best compromise of area, power and performance. 这种组织具有三个主要的特征:·对于汇编语言是可看见的、并且具有几个以上寄存器的寄存器文件被划分成两个子组:几个寄存器以完全访问性被实施,而其余的寄存器以更有限的访问性被实施。 This organization has three main features: · a register file for the assembly language are visible, and has a few more registers are divided into two subgroups: Several registers are implemented in a fully accessible, while the remaining registers It is more limited accessibility embodiment. 在大多数情形下,只有头四个寄存器(编号0到3)支持完全访问性。 In most cases, only the first four registers (numbered 0-3) supports full accessibility. 对于牵涉到这种寄存器的机器操作,任何和所有的完全可访问的寄存器可被同时选择为操作的源和目的地。 For this involves register machine operation, any and all registers may be fully accessible to operate simultaneously selected source and destination. 相反,具有有限访问性的寄存器只共享它们之间的少量的读和写端口。 In contrast, the register having limited accessibility share only a small amount of read and write ports therebetween. 具有有限访问性的寄存器具有它们共享的至多两个读端口和一个写端口。 Having limited access registers have their share of up to two read ports and one write port. 这种安排给出了具有大量读和写端口的寄存器文件的大多数好处,而对于组中的大多数寄存器不需要多于一个或两个读/写端口。 This arrangement gives a large majority of the benefits of having a register file read and write ports, and for most of the group does not need to register more than one or two read / write ports.

·在每个功能单元的输入端处是“分级寄存器”。 · At the input of each functional unit is "staging register." 当在一个时钟周期内使用功能单元之前,它的输入分级寄存器必须在前一个时钟周期的末尾被设置以适当的输入值。 When using the clock cycle before a functional unit, its input staging register must be set to the input values ​​at the end of a previous clock cycle. 可以将不能同时使用的功能单元分组在一起,以共享相同的分级寄存器,以便减小寄存器的总数。 The functional unit can not be used may be grouped together to share the same staging register, in order to reduce the total number of registers. 如果在时钟周期内不需要共享相同的分级寄存器的功能单元,则寄存器的先前的数值被保持,因此消除在该周期内在这些功能单元中的转移功耗。 If no staging register share the same clock cycle of the functional unit, the previous value of the register is maintained, thus eliminating the power transfer in the internal cycle of these functional units.

·在功能单元之间的转发在两个阶段中实施。 · Forwarding between functional units implemented in two stages. 首先,可完全访问的寄存器的下一个数值通过复接器被选择,连同该一个或者多个数值一起(如果有的话)写入到具有有限访问性的寄存器。 First, the value of the next full access register is selected by a multiplexer, together with the one or more values ​​(if any) is written to register with limited accessibility. 在第二阶段,可完全访问的寄存器的下一个数值,和来自具有有限访问性的寄存器的读端口的数值,一起被馈送到交叉开关,它选择将要在时钟周期结束时被写入到分级寄存器的数值(这样在下一个时钟周期内用于功能单元)。 In the second stage, the next can be fully accessible register, and the value from the read port of the register have limited accessibility, and is fed to the crossbar together, select it to be written to the staging register at the end of clock cycle value (such a functional unit for the next clock cycle). 这种组织以经历两个复接阶段而不是一个阶段所导致的增加延时为可能的代价,使得加到交叉开关的输入的数目最小化,从而大大地影响了它的尺寸。 Such tissue to increase the delay experienced multiplexing two stages instead of one stage may be caused by the cost, so that the number of inputs applied crossbar minimized, thereby greatly affect its dimensions.

在具有有限访问性的寄存器的写和读端口之间,可以实施或可以不实施转发。 Between the write and read port register with limited accessibility, embodiments may be implemented or may not forwarded. 如果在这里转发没有完成,则在写这些寄存器之一的操作与随后的读该寄存器的操作之间,将出现等待时间的一个额外周期。 If there is no forwarding is complete, the write operation and the subsequent read operation between the register of one of these registers, a waiting period of extra time will appear.

图7以高级别框图形式显示按照本发明实施例的块408的进一步的细节。 Figure 7 shows a further embodiment of the detail of block 408 in accordance with embodiments of the present invention at a high level block diagram form. 在图7上,向量寄存器块702被显示为被耦合到N ALU块704、向量元素移位器块706、向量元素选择器块708、2N与N比特变换器块710、N ALU块712、和2N乘法器块714。 In Figure 7, block 702 is displayed vector register 704, vector elements shifter block 706, the selector block 708,2N vector elements of the converter block to be coupled to the N bits of the ALU block N 710, N ALU block 712, and 2N multiplier block 714. 在图7上,块408还被显示为包括向量寄存器块716,其被耦合到N加法器块718、N移位器块720、向量求和块722、N3输入加法器块724、2N与N比特变换器726、复接器723和复接器732。 In Figure 7, block 408 is further shown to include a vector register block 716, which is coupled to the adder block 718 N, N shifter block 720, a vector summation block 722, N3 and an input adder block 724,2N N bit converter 726, multiplexer 723 and multiplexer 732. 图7的块和复接器以图7所示的方式被耦合在一起。 FIG. 7 is a block and a multiplexer coupled together in the manner shown in FIG. 7. 块702被耦合到图4的其它块,并且还被耦合到块704-714。 Block 702 is coupled to the other blocks of FIG. 4, and is also coupled to block 704-714. 块716被显示为接收来自块406的输入和来自复接器732、块710和块714以及块724的输出端的输入。 Block 716 is shown receiving the output from the terminal block 406 and an input from the multiplexer 732, block 710 and block 714 and block 724 is input. 块702被显示为被耦合到复接器704,后者还被耦合到块712和726。 Block 702 is shown as being coupled to a multiplexer 704 which is also coupled to block 712 and 726. 通常,图7的电路或块对向量类型的数值并行地执行操作,诸如是对数目为N的M比特值执行操作,其中M是比特的整数个数。 Typically, the circuit of FIG. 7 or blocks perform numeric vector type operate in parallel, such as is the number of N M-bit values ​​to perform operations, where M is the number of integer bits.

复接器732接收由块718和720生成的输出作为输入,复接器730接收由块704和706生成的输入,并且还生成由块702接收的输出。 Multiplexer 732 receives the output generated by the blocks 718 and 720 as input, the multiplexer 730 receives generated by the input block 704 and 706, and also generates an output received from the block 702. 块708和722的输出被提供到块406。 The output of block 708 and 722 are provided to a block 406. 在这里使用的N是整数值,例如,N个ALU是数目为N的ALU电路。 As used herein, N is an integer value, e.g., ALU number N is the number of N ALU circuit.

块702-714和复接器730通常执行乘法累加(MAC)功能,而块716-726和复接器732执行ALU功能,然而,在其上执行这样的MAC和ALU功能的并行比特的数目通常比由块406处理的比特数目大N倍。 The number of blocks 702-714 and multiplexer 730 typically performs multiply-accumulate (MAC) function, while blocks 716-726 and multiplexer ALU 732 performs the function, however, is performed thereon such parallel bits of the MAC and ALU functions generally greater than the number of bits handled by blocks 406 N times. 块704和712是可分段的,即,它们能够把相加操作可选择地分段。 Block 704 and 712 is segmented, i.e., they can be selectively segmented to add operation. 例如,在并行地处理N个32比特的情形下,除了能够执行N个32比特加法运算以外,每个ALU块能够执行2N个16比特加法运算,或4N个8比特加法运算。 For example, in the case of N parallel 32-bit processing, can be performed in addition to the N 32 bit addition operation can be performed for each block of 2N ALU 16 bit addition operation, 8 bit addition, or 4N operation. 块714以与图11的块1110相同的方式运行,这将简略地予以描述。 Block 714 in the same block 1110 of FIG. 11 is run, which will be described briefly. 块710和726运行来把N个32比特数值变换成N个40比特数值,或把2N个16比特数值变换成2N个40比特数值。 Blocks 710 and 726 to run the 32-bit value of the N converted into N-bit values ​​40, 16, or the 2N 2N-bit value into a 40-bit value. 在一个例子中,32比特数值被变换成40比特数值,在另一个例子中,16比特数值被变换成40比特数值,因此,提供比特变换能力。 In one example, the 32 bit value is converted into 40-bit number, in another example, 16-bit value is converted into 40-bit value, thus, the ability to provide bit conversion.

块706把向量值,即N个M比特数值,向右或向左移位一个整数值。 Value to block 706, i.e., N M-bit value, an integer value shifted right or left. 向量移位的例子将是取一个诸如以下的向量<a0,a1,a2,a3,a4,a5,a6,a7> Examples of the vector will be shifted to take a vector such as the following & lt; a0, a1, a2, a3, a4, a5, a6, a7 & gt;

在本例中是8个数值,而返回向量<a1,a2,a3,a4,a5,a6,a7,0> It is in the present example eight values, and returns a vector & lt; a1, a2, a3, a4, a5, a6, a7,0 & gt;

或许是<0,0,0,a0,a1,a2,a3,a4> Perhaps & lt; 0,0,0, a0, a1, a2, a3, a4 & gt;

这个操作通常不解译为任何种类的乘法或除法。 This operation is usually translated understand any kind of multiplication or division. 块708允许选取向量值的单个元素,例如,可以从向量值中选择特定的字节(8比特)。 Block 708 to allow selection of the magnitude of the individual elements, for example, can select a specific byte (8 bits) from the vector value.

块720以与块706同样的方式运行,块726以与块710同样的方式运行。 Block 720 operates in the same manner as block 706, block 726 operate in the same manner as block 710. 块712和726的输出通过复接器704被选择性地提供到块702,块706和704的输出通过复接器730被选择性地提供到块702。 The output of block 712 and 726 is selectively provided through multiplexer 704 to block 702, output block 706 and 704 are provided to selectively block 702 through the multiplexer 730. 此外,块720和718的输出通过复接器732被选择性地提供到块716。 Further, the output of block 720 and 718 are provided to selectively block 716 through multiplexer 732.

块722执行基于向量的加法运算,而块408的其它块基于元素进行操作。 Block 722 performs an addition operation based vector, while the other block based on the block elements 408 operates. 即,块722把单个向量的所有元素相加在一起,而基于元素进行操作的块对不同向量的一个或多个选定的和相应的元素执行运算。 That is, all elements of a single vector blocks 722 are added together, and the block elements on the operation of one or more different selected vector and perform an operation corresponding elements.

块710和726各自选择性地允许从N或2N进行变换。 Blocks 710 and 726 are each selectively allow conversion from N or 2N. 图8上还显示,块804的输出被反馈到块802的输入。 FIG 8 also shows the output of block 804 is fed back to the input of block 802.

图8以框图的形式显示按照本发明实施例的块404进一步的细节。 Figure 8 shows further details of the block 404 according to an embodiment of the present invention in the form of a block diagram. 在图8上,块404被显示为包括掩码控制寄存器块802、掩码生成器块804、掩码寄存器块806、向量寄存器块808、和向量字节掩码置换块810,它们以图8所示的方式被耦合在一起。 In Figure 8, block 404 is shown as comprising a mask control register block 802, mask generator block 804, a block mask register 806, a vector register block 808, and the byte mask vector permutation block 810, which in FIG. 8 They are coupled together in the manner shown.

块802被显示为接收来自图4的其它块的输入,并生成加到块804的输入,块804被显示为被耦合到块806。 Block 802 is shown to receive input from other blocks of FIG. 4, block 804, and applied to the input generates, the block 804 is shown as being coupled to a block 806. 块806被显示为被耦合到块801以及还被耦合到图4的其它块以及存储器312。 Block 806 is shown as being coupled to the block 801 and is also coupled to memory 312 and the other blocks of FIG. 4. 块808被显示为被耦合到存储器312和图4的其它块。 Block 808 is shown as being coupled to a memory 312 and other blocks of FIG. 块810被显示为被耦合来接收来自块806和808的输入。 Block 810 is shown to receive input from block 806 and 808 are to be coupled.

在一个例子中,块404具有一个N*32比特向量寄存器的寄存器文件,块808,N与块408的相同。 In one example, a block 404 having N * 32-bit vector register of register file, the same block 808, N 408 and the block. 块404的块806包括尺寸为N*4比特的掩码寄存器。 Block 806 comprises a size of 404 N * 4-bit mask register. 掩码寄存器的每个比特对应于向量寄存器的一个字节。 Each bit mask register corresponds to a byte vector registers. 当N*32比特向量被存储到外部共享存储器时,N*4比特掩码可被提供来指示该向量的哪些字节被实际写入到存储器。 When N * 32-bit vector is stored to the external shared memory, N * 4-bit mask may be provided to indicate which bytes of the vector is actually written to the memory. (对应于掩码中零比特的存储器字节保留不变。)掩码生成器功能根据掩码控制寄存器的设置值来计算4*N比特掩码。 (Zero bits in the mask corresponding to bytes of memory remain unchanged.) Mask generator 4 * N function to calculate a bit mask according to the mask control register set value.

块404可以置换两个向量寄存器的8*N字节,以选取4*N字节。 Block 404 may be replaced with two 8 * N bytes vector register, to select a 4 * N bytes. 在通常的情形下,特定的置换由第三向量寄存器的数值来控制。 Under normal circumstances, a specific permutation controlled by the value of the third vector register. 某些“预编码”的置换不需要使用控制向量,这些包括两个输入向量寄存器的所有的漏斗左移位和右移位。 Certain "precoding" replacement without the use of control vectors, including the input vector register all two funnel left and shift right. 在两个向量寄存器的8*N字节被置换的同时,两个掩码寄存器的8*N比特可以完全相同地被置换,以保持在掩码与向量数值之间的相同的比特对字节(bit-for-byte)的对应性。 8 * N bytes at the same time two vector registers to be replaced, two 8 * N mask register bits may be replaced identically to maintain between the mask and the value of the same bit vector byte (bit-for-byte) correspondence.

图8的块基于向量值进行操作。 FIG 8 is a block operates based on the value. 块810允许重新安排向量值,诸如先前简要说明的。 Block 810 allows the rearranged values, such as previously briefly described. 这是通过使用置换来完成的,这将参照图9和10进一步地描述。 This is done by using the permutation, which will be further described with reference to FIGS. 9 and 10. 块810提供有关哪种置换是被预期的信息。 Block 810 provides information about what is expected of replacement information. 同样地,来自块804和806的被置换的掩码指示要提供哪些被置换的掩码。 Likewise, the replacement mask 806 from block 804 and indicates which mask to be provided is replaced. 通常,对于每个要被存储的字节有一个掩码比特。 Typically, there is a mask bit for each byte to be stored.

图8的块802、804、806、808和810造成重新安排存储器中的地址的能力,以适合于正经历执行的特定应用。 Blocks 802, 804 and 810 of FIG. 8 result in the ability to rearrange the address in memory to suit a particular application is undergoing execution. 在现有技术中,重新安排典型地被自动执行,然而,在本发明的实施例中,按照程序或代码,编程器可以按需要可编程地执行重新安排。 In the prior art, to rearrange typically be performed automatically, however, in the embodiment of the present invention, according to a program or code that can be programmed as needed to perform the re-programmable arrangement. 这允许依照编程器之需要的近乎无限组的重新安排,这是现有技术完全不提供的,即,重新安排的能力是预定的且包括预定的重新安排的可能性组。 This allows an almost unlimited set of rearranged in accordance with the needs of the programmer, which is the prior art does not provide a complete, i.e., the ability to re-arrange the predetermined group and the possibility of including a predetermined rearranged. 因此,生成依照正被执行的程序的掩码,这提供了关于存储器中地址的重新安排的进一步灵活性。 Therefore, in accordance with the program being executed to generate the mask, which provides further flexibility to reschedule about memory addresses.

SIMD是对于单指令多数据(_Single Instruction,Multiple Data_)的缩写词,MIMD是对于多指令多数据(_Multiple Instruction,Multiple Data_)的缩写词。 SIMD for single instruction multiple data (_Single Instruction, Multiple Data_) abbreviations, MIMD for multiple instruction multiple data (_Multiple Instruction, Multiple Data_) abbreviations. 这些是计算机体系结构和编程中的标准术语,是本领域技术人员熟知的。 These terms are standard computer architecture and programming, are well known to the skilled person.

图9和10显示块<数量(number)>的置换电路的进一步的细节,其中<数量>是“向量字节+掩码置换”框的数目。 9 and 10 show block & lt; further details of the permutation circuit, wherein & lt;; number (number) & gt number & gt; is the number of "+ vector mask byte substitution" box. 块404具有执行两个向量的置换以生成经置换的结果向量的功能单元,如图9和10所示。 Performing permutation block 404 having two vectors to generate a permuted result vector functional units, as shown in Figures 9 and 10. 被使用来执行置换的电路可以以一般的方式被描述为取两个输入向量A和B,每个有N个单元,并且生成也是N个单元的输出向量Z,其中一个单元是任何任意的但是是统一数目的比特,以及其中要求N是2的幂次。 Is used to perform the permutation circuit may be described in a general manner is to take two input vectors A and B, each having N elements, and generates an output vector Z is N cells, where a cell is any arbitrary but is the number of bits of unity, and wherein N is the required power of two. 令K是N的以2为底的对数。 K N is the order of base 2 logarithm. 该置换电路具有K+1级,每级具有特定类型的N个开关盒(switch box),如图所示。 The permutation circuit having K + 1 stages, each stage of the N switch box (switch box) having a specific type, as shown in FIG. 总共有三种类型的开关盒,被称为“类型A”、“类型B”和“类型C”。 There are three types of switch box, referred to as "Type A", "type B" and "Type C". 开关盒类型A仅仅在第一级被使用;开关盒类型C仅仅在最后一级被使用;在中间的所有的级只利用开关盒类型B。 A cartridge type switching is used only in the first stage; Device type C is used only in the last stage; only the cartridge type by the switching stage in the middle of all B. 由每种类型的开关盒支持的连接被分开地显示。 A switch box for each type of supported connections are shown separately. 在每对相邻级的开关盒之间是蝶形交换机,从距离1的交换机开始,并逐渐达到距离N/2的交换机。 Between each pair of adjacent stages of the switch box is a butterfly switch, a distance from the switch start and switch gradually reach distance of N / 2. 开关盒的设置值全部由“控制向量”独立地确定,该控制向量是加到置换电路的第三输入。 Set value switch box are all independently determined by the "vector control", which is the control vector applied to a third input of the substitution circuit. 由于每个类型A和类型C开关盒的设置值只需要单个比特来规定,每个类型B开关盒的设置值需要正好两个比特来规定,所以完整的控制向量需要2*K*N个比特。 Since the type A and type C each switch box requires only a single bit value to a predetermined set value for each type of switch box B requires two bits to specify exactly, so the need for complete control vector 2 * K * N bits . 控制向量可以从执行的置换指令中被完全地暗示,或者它可以以某种方式部分地或全部地由程序提供。 The control vector can be implied from completely replaced in instruction execution, or it may be provided by the program in some way partly or wholly.

图11以框图的形式显示按照本发明实施例的块406的部件的进一步细节。 Figure 11 shows further details of the block member 406 in accordance with an embodiment of the present invention in the form of a block diagram. 在图11上,寄存器块1102被显示为被耦合到ALU块1104、比特变换器块1106、ALU块1108、和乘法器块1110。 In Figure 11, the register block 1102 is shown as being coupled to the ALU block 1104, the bit converter block 1106, block 1108 ALU and multiplier 1110 blocks. 块406还被显示为包括寄存器块1112、移位器块1114、加法器块1116、和比特变换器块1118。 Block 406 is also shown to include a register block 1112, a shift block 1114, adder block 1116, block 1118, and the bit inverter. 在图11上还显示了复接器1122、1120和1124。 In Figure 11 also shows a multiplexer 1122,1120 and 1124. 图11的复接器和块以图中显示的方式被耦合在一起。 FIG multiplexer 11 and displayed in the block are coupled together in FIG.

块1102被显示为被耦合到存储器312和图4的其它块,并且接收来自复接器1122和复接器1120的输入。 Block 1102 is shown as being coupled to a memory 312 and other blocks of FIG. 4, and receives from the multiplexer 1122 and the multiplexer 1120 inputs. 移位器块1114提供复接器1122的其中一个输入,以及块1104提供复接器1122的另一个输入。 Shifter block 1114 provides one input multiplexer 1122 and a block 1104 provides the other input of multiplexer 1122. 复接器1120接收它的来自块1118和1108的输入。 Multiplexer 1120 receives its input from block 1118 and 1108. 块1114还被显示为被耦合到块1102,以及复接器1124被显示为接收来自块1112和1102的输入并生成加到块1114的输出。 Block 1114 is also shown to be coupled to a block 1102, and multiplexer 1124 is shown receiving an input from block 1112 and 1102 and generates output block 1114 added.

块1112被显示为被耦合到块1116,块1116生成输出,作为输入被提供到块1112。 Block 1112 is shown as being coupled to a block 1116, block 1116 generates the output, is provided as an input to block 1112. 块1118被显示为被耦合到块1112,以及块1106和1110被显示为被耦合到块1112。 Block 1118 is shown as being coupled to block 1112 and blocks 1106 and 1110 is shown as being coupled to the block 1112.

块1102、1104、1106、1108和1110以及复接器1122使得ALU功能被执行,而块1112-1118和复接器1124使得乘法-累加(MAC)功能被执行。 Blocks 1102,1104,1106,1108 and 1110 and multiplexer 1122 such that the ALU function to be performed, and blocks 1112-1118 and multiplexer 1124 such that the multiply - accumulate (MAC) function is executed.

块1104和1108是ALU并执行这样的功能,且它们的输出通过复接器1122和1120被选择性地提供到块1102作为输入(或反馈)。 Blocks 1104 and 1108 are ALU and perform such functions, and their outputs 1122 and 1120 are selectively provided as input to block 1102 (or feedback) through a multiplexer. 在每个时钟周期,可以执行两个ALU操作。 In each clock cycle, two ALU operations can be performed. 块1110执行乘法功能并产生输出,其被提供到块1112,块1112能够并行地处理比起块1102所处理的更多数目的比特。 Performing multiplication function block 1110 and generates an output which is provided to block 1112, block 1112 block 1102 to handle than the larger number of bits processed in parallel. 例如,在块1102具有32比特容量的情形下,块1112具有40比特容量。 For example, in the case of block 1102 having a capacity of 32 bits, the block 1112 has a capacity of 40 bits. 块1112用作为累加器寄存器,即累加地相加输入。 Block 1112 is used as an accumulator register, i.e. additively summed input.

块1106把N比特数值变换成N+X,其中X是整数值。 Block 1106 N-bit value is converted into N + X, where X is an integer value. 例如,32比特数值可被变换成40比特数值。 For example, 32-bit value can be converted into 40-bit value. 块1114把数值移位预定数目的比特,并通过复接器1122把结果传送到块1102。 Block 1114 the value shifted by a predetermined number of bits, and transferred to block 1102 via multiplexer 1122 results.

块1118从较高数目的比特变换到较低数目的比特,诸如从40比特到32比特。 Block 1118 is converted from a higher number of bits to a lower number of bits, such as from 40 bits to 32 bits. 该块被耦合到块408。 The block 408 is coupled to the block. 块406可以对于来自块1102的数值并行地执行两次ALU操作。 Block 406 may be performed in parallel for two values ​​from the ALU operation block 1102. 代替第一次ALU操作,可以执行N比特移位操作,或执行从N比特数值到要被存储在块1112中的X比特数值的变换。 Instead of the first ALU operation, an N-bit shift operation may be performed, or performed from the N-bit value to be converted to X bits stored in block 1112 the value. 代替第二次ALU操作,可以由块1110执行乘法,并将结果存储在块1112的寄存器之一中。 Instead of the second ALU operation, multiplication can be performed by the block 1110, and the result is stored in one block of 1112 registers.

块406可以并行地执行40比特移位、40比特相加/相减、和从40比特数值到要被存储在标量ALU MFU的32比特寄存器之一中的32比特数值的变换。 Block 406 may be performed in parallel 40 bit shift, 40 bit addition / subtraction, and from the 40 bit value to be converted to be stored in one 32-bit ALU MFU scalar register in the 32-bit value.

现在参照下面的图讨论诸如块78的N型子处理器之一的进一步细节。 Referring now to FIG discussed below further detail of one such block of the N-type sub-processor 78. 应当指出,相关于W型子处理器的图4的块406和404,是与诸如块78的N型子处理器共同的。 It should be noted, in relation to the W-type sub-processor block 406 of FIG. 4 and 404, are the N-type sub-processors 78 such as a common block.

图12显示按照本发明实施例的块78的细节的高级别框图。 Figure 12 shows a high-level block diagram illustrating details of the embodiment of the block 78 according to the embodiment of the present invention. 在图12上,块78被显示为包括数据路径单元(DPU)块1202、路径到存储器块1204、以及控制器、定序器与数据地址生成器(DAG)块1206。 In Figure 12, the block 78 is shown to include a data path unit (the DPU) block 1202, the path to the memory block 1204, and a controller, the sequencer and the data address generator (DAG) block 1206. 块1204和1206是与W型子处理器的块共同的且在这些块中找到。 Common block and blocks 1204 and 1206 are found with W-type sub-processors in the blocks. 块1206通常在功能上是与块402相同的。 Block 1206 the block 402 is generally the same in function.

图13以高级别框图形式显示按照本发明实施例的块78的进一步细节。 Figure 13 shows further details of block 78 in embodiments according to the embodiment of the present invention in a high-level block diagram form. 在图78上,储存单元块1302被显示为被耦合到X单元块1304,块1304又被显示为被耦合到装载单元块1306。 In Figure 78, the storage unit block 1302 is shown as being coupled to unit block X 1304, block 1304 has been shown as being coupled to the loading unit block 1306. 块1304通常在功能上是与块404相同的,因此其在上文已被更详细地讨论过。 Block 1304 and block 404 is generally the same in function, and therefore it has been discussed in more detail above.

块1306被显示为还被耦合到宏功能块1340,块1340又被显示为通过宏功能总线1310被耦合到块1302。 Block 1306 is shown also coupled to a 1340 macroblock, and block 1340 is shown as a function of the bus by the macro block 1310 is coupled to 1302. 块1302被显示为包括储存缓冲器1314、储存缓冲器1312和总线互连块1308。 Block 1302 is shown as comprising a storage buffer 1314, and a bus 1312 interconnecting storage buffer 1308. 块1302生成被提供到存储器(诸如存储器312)的输出,并因此通过块1314被相应地耦合。 Generating block 1302 is provided to memory (such as memory 312) output, and therefore are correspondingly coupled through 1314. 块1304被显示为接收输入或被耦合到存储器,诸如是存储器312。 Block 1304 is shown to receive input or be coupled to a memory, such as a memory 312. 块1306被显示为包括装载缓冲器1320、装载缓冲器1318和总线互连块1316,块1316被耦合到块1340。 Block 1306 is shown as comprising a loading buffer 1320, and a bus 1318 interconnecting the loading buffer block 1316, block 1316 is coupled to the block 1340.

块1340被显示为包括伽罗瓦(Galois)域MAC块1322、专门ALU块1324、组合器块1326、存储器1328、凿孔/去凿孔块1330、交织器块1332和维特比(Viterbi)块1334,它们被各自显示为被耦合到总线1310。 Block 1340 is shown to include a Galois (the Galois) field MAC block 1322, block 1324 dedicated ALU, combiner block 1326, a memory 1328, puncturing / de-puncturing block 1330, block interleaver 1332 and Viterbi (the Viterbi) block 1334, which are each shown as being coupled to the bus 1310. 块1322-1332被各自显示为接收来自块1316的输入或是被耦合到块1316。 Are each shown as blocks 1322-1332 receive an input from block 1316 or block 1316 is coupled to. 块1334接收来自块1332的输入,以及被耦合来接收和生成加到它那儿的数据。 Block 1334 receives inputs from block 1332, and is coupled to receive and generate data applied to it there.

数据流程是这样的,即数据或信息从块1306和通过块1306流入到块1340,然后到块1302,并且流出到存储器上。 Data flow is such that the data or information flows from block 1306 through blocks 1306 and 1340 to block, and then to block 1302, and out to the memory. 这样便引入了流水线影响,其中多个操作重叠并以流水线方式被并发地处理。 This will influence the introduction of the pipeline, wherein the plurality of overlapping operations in a pipelined manner and processed concurrently. 例如,信息可以由块1306装载,而同时信息由块1302存储到存储器中。 For example, information may be loaded by the block 1306, while block 1302 by the information stored in the memory. 数据在由块1304从存储器接收后,被存储在块1306的块1320和1328中,随后被提供到块1340并被块1340处理,它们的细节将参照随后的图简略地讨论。 After receiving the data from the memory block 1304, block 1320 is stored in the 1306 and 1328, is then provided to block 1340 and process block 1340, the details of which reference will be briefly discussed in the following FIG.

在由块1340进行的处理完成后,处理的数据通过总线1310被提供到块1302,并被存储在块1312和1314中,在被耦合来由存储器接收之前它们一直被存储在其中。 After the processing performed by the block 1340 is complete, the processed data is supplied via the bus 1310 to block 1302, and stored in block 1312 and 1314, before being coupled to the memory reason they have been received are stored therein. 块1314、1312、1318和1320的缓冲器具有并行的、预定宽度或数目的比特。 Blocks 1314,1312,1318 and 1320 in parallel with the buffer, the predetermined width or number of bits. 在一个例子中,这些缓冲器的每一个是256比特宽,然而,可以采用其它数目的比特。 In one example, each of these buffers is 256 bits wide, however, other number of bits may be employed.

可能被块1340处理过的数值或数据可以从块1302移到块1306,以便再使用。 Block 1340 may be processed or data values ​​can move the block 1302 from block 1306, for reuse. 而且,数据可以由块1304从存储器接收,然后被移到块1306,以便对其进行处理。 Further, data may be received from a memory by a block 1304, block 1306 is then moved, for processing thereof. 现在给出每个块1340的进一步的细节。 Further details will now be given of each block 1340. 块1314和1312造成双缓冲效果,它帮助减小在流水线操作中通常经受的“失速(stalling)”,块1318和1320也是如此。 Blocks 1314 and 1312 result in double buffering effect, which helps in reducing the pipeline operation commonly experienced "Stall (stalling)", blocks 1318 and 1320 as well. 失速是由存储器对块1302和1306的同时访问引起的。 Stall is a memory block 1302 and 1306 to simultaneously access caused. 在另一个实施例中,块1314和1312可以是一个块,以及块1318和1320可以是一个块。 In another embodiment, the block 1314 and 1312 may be a block, and the block 1318 and 1320 may be a block.

等待时间可以与操作有关,或可以存在流水线影响。 The waiting time can be related operations, or may affect the presence of the pipeline. 等待时间可以是由对于块1340的每个块引起的。 Wait time for each block may be a block 1340 it caused.

图14显示按照本发明实施例的块1322的进一步的细节。 Figure 14 shows further details of block 1322 according to an embodiment of the present invention. 在图14上,伽罗瓦域块1402被显示为被耦合到XOR(异或)/Clr电路1404,电路1404进而又被显示为被耦合到累加器寄存器块1406。 In Figure 14, the Galois field block 1402 is shown as being coupled to the XOR (exclusive or) / Clr circuit 1404, circuit 1404 is shown as being in turn coupled to an accumulator register block 1406. 块1402被显示为生成伽罗瓦域输出信号1408,它用作为加到伽罗瓦域复接器1410的输入,伽罗瓦域复接器1410还接收由块1406的输出生成的、被称为累加器寄存器块输出信号1412的另一个输入。 Block 1402 is shown as a Galois field to generate an output signal 1408, which serves as a Galois field applied to the input multiplexer 1410, multiplexer 1410 Galois field also receives the output generated by the block 1406, and is called another output signal is input to accumulator register block 1412. 信号1408和1412用作为加到复接器1410的输入,用于选择性地生成伽罗瓦域MAC输出信号1416,它被耦合到图13的总线1310上。 Signals 1408 and 1412 as an input to multiplexer 1410 for selectively generating an output signal of the Galois field MAC 1416, which is coupled to bus 13 by 1310 of FIG. 用作为加到复接器1410的另一个输入的选择信号1414发挥作用来选择信号1408与1412之一,以用于生成信号1416。 As the selection signal applied to the other input of the multiplexer 1410 selects the signal 1414 to function with one of 1408 1412 1416 in order to generate a signal. 因此,或者是块1402的输出被提供作为块1322的输出,或者是伽罗瓦域MAC操作结果被提供作为块1322的输出,其中块1402的输出实际上是伽罗瓦域操作的结果。 Thus, the block 1402 or the output block 1322 is provided as an output, or a Galois field of a MAC operation block 1322 is provided as an output, wherein the output of block 1402 is actually the result of the Galois field operation.

块1406的输出被显示为被耦合到电路1404,以作为它的另一个输入。 Output block 1406 is shown as being coupled to circuit 1404, as its other input. 块1404的输出被提供到块1406,这样的耦合实现了伽罗瓦域MAC操作的MAC部分。 The output of block 1404 is provided to a block 1406, the MAC achieve such a coupling portion Galois field MAC operations. 块1404实际上执行典型地在伽罗瓦域MAC操作中使用的XOR乘法操作。 1404 actually performs multiplication block XOR operation typically used in the Galois field MAC operations.

块1402被显示为包括寄存器块1420和寄存器块1422,它们被显示为被耦合到Xor树块1424。 Block 1402 is shown as a block 1420 includes a register 1422 and a register block, which are shown as being coupled to Xor tree block 1424. 块1420还被显示为包括寄存器块1426、伽罗瓦域乘法迭代1 1428、寄存器块1430、伽罗瓦域乘法迭代1 1432、寄存器块1434和寄存器块1436。 Block 1420 is also shown as including a register block 1426, a Galois Field multiplication iteration 11428, Register Block 1430, a Galois Field multiplication iteration 11432, register block 1434 and block 1436 registers. 虽然图上未示出,但还包括附加数目的、诸如块1434和1436那样的寄存器块,并且它们被串联地耦合在块1434与1436之间。 Although not shown, but further including an additional number, such as block 1434 and a block 1436 registers like, and they are coupled in series between blocks 1434 and 1436.

块1424被显示为被耦合到块1426,块1426又被显示为被耦合到块1428,块1428又被显示为被耦合到块1430,块1430又被显示为被耦合到块1432,块1432又被显示为被耦合到块1434,块1434被耦合到块1436或被耦合到位于块1434和1436中间处的一个或多个寄存器块。 Block 1424 is shown as being coupled to a block 1426, block 1426 has been shown as being coupled to a block 1428, block 1428 has been shown as being coupled to a block 1430, block 1430 has been shown as being coupled to a block 1432, block 1432 and is shown as being coupled to a block 1434, block 1436 block 1434 is coupled to or be coupled to a block located at 1434 and 1436 or more intermediate register block.

在图14上,块1420和1422接收来自块1306的输入,并且在另一个实施例中,它们可被组合成一个块。 In Figure 14, blocks 1420 and 1422 receive input from block 1306, and in another embodiment, they may be combined into one block. 块1402通常执行对于本领域技术人员来说熟知的伽罗瓦域处理,图14的其余的块造成MAC操作的执行。 Block 1402 typically performs the Galois field for the present process well known to those skilled in the rest of the block 14 is caused to perform a MAC operation. 块1426、1430、1434和1436用作为伽罗瓦树的不同的迭代,从经验中已知道,在最坏的情形下,迭代的数目是8,因此需要8个寄存器块。 Blocks 1426,1430,1434 and 1436 as Galois tree with different iterations have been know from experience that, in the worst case, the number of iterations is 8, and therefore requires eight register block. MAC操作的乘法部分通常通过由电路1404执行的XOR操作被执行,以及块1406用作为累加器功能。 Multiplication section MAC operations are typically performed by an XOR operation performed by the circuit 1404, and the block 1406 is used as an accumulator function. 电路1404从由块1402(在图14的情形下,是块1436)执行的伽罗瓦域操作的最后的迭代接收它的输入。 Circuit 1404 from the last iteration Galois Field operations performed by the block receives its input 1402 (in the case of FIG. 14, a block 1436).

在操作中,块1322对诸如8比特数值的N比特数值或数据执行操作,并基于该数值或数据生成一个N比特数值或数据,其中所述生成是通过基于另一个N比特数值将原始数值移位八路(eight way)而进行的。 In operation, such as an N-bit block 1322 8-bit data value or values ​​to perform operations, and generating an N-bit data based on the value or values, or data, wherein said generating is obtained by shifting the numerical values ​​of the original N-bit based on another bit Octal (eight way) carried out. 该N比特数值然后由块1404进行XOR,直至利用一个减小常数将结果减小到N比特,以及任选地将该结果与N比特累加器寄存器的内容相加,其中所述内容诸如是在块1406中的数值。 The N-bit value is then performed by the XOR block 1404, using a reduced constant until the result is reduced to N bits, and, optionally, the addition result of the N-bit contents of the accumulator register, wherein the content, such as is Numerical block 1406. “清除”操作也可以由块1406执行。 "Clear" operation may also be performed by block 1406. 采用伽罗瓦域MAC操作并从而采用块1322的应用的例子包括但不限于:循环冗余码(CRC)操作、卷积编码器操作、扰码生成器操作等等。 MAC operation uses the Galois Field and thereby block 1322 using the example of the application include, but are not limited to: cyclic redundancy code (CRC) operation, the convolutional encoder operation, a scrambling code generator operation or the like.

图15以高级别框图形式显示按照本发明实施例的、被包括在块1324中的电路系统的进一步的细节。 Figure 15 shows a high level block diagram form an embodiment of the present invention, further details of which are included in the circuitry of block 1324. 在图15上,复接器1504和1502被显示为分别被耦合到A寄存器块1508和B寄存器块1506。 In Figure 15, the multiplexer 1504 and 1502 are shown as being coupled to the A and B registers block 1508 registers block 1506. 块1508存储一个被称为A的数值,块1502存储一个被称为B的数值,这些A和B数值是将要由块1324对其执行操作的数据。 Block 1508 stores a numerical value A is referred to, memory block 1502 is referred to as a value of B, A and B which are to be the data value by a block 1324 the operation was performed. A和B数值每个都是N比特宽。 A and B values ​​are each N bits wide.

块1508和1506被显示为生成加到条件寄存器块1512的输入,还被显示为被耦合来生成加到相加/相减/绝对值/差值/条件相加-相减/乘法(AGU)块1510的输入,块1510又生成加到输出寄存器块1514的输入。 Blocks 1508 and 1506 are applied to generate a display condition input register block 1512, is also shown to be coupled was added to generate addition / subtraction / Abs / Diff / conditional addition - subtraction / multiplication (of AGU) input block 1510, generating block 1510 and the input to the output register block 1514. 块1514被显示为被耦合到复接器1516,复接器1516又被显示为被耦合到加法器1518。 Block 1514 is shown as being coupled to the multiplexer 1516, multiplexer 1516 has been shown as being coupled to the adder 1518. 加法器1518被显示为被耦合到累加器寄存器块1520,该块1520的输出被显示为用作为加法器1518的另一个输入。 The adder 1518 is shown as being coupled to the accumulator register block 1520, the output of the block 1520 is shown as used as another input of the adder 1518. 块1520的另一个输出被显示为用作为加到复接器1522的输入,复接器1522接收块1514的输出作为另一个输入。 Another output block 1520 is shown as applied with an input multiplexer 1522, multiplexer 1522 receives the output of block 1514 as another input. 复接器1522生成输出1530,该输出被耦合到总线1310。 Multiplexer 1522 generates an output 1530, which output is coupled to bus 1310. 加到复接器1504和1502的某些输入是从块1316被接收的。 Applied to the multiplexer 1504 and 1502 are some of the input received from the block 1316.

每个复接器1504和1502被显示为接收四个输入。 Each multiplexer 1504 and 1502 are shown receiving four input. 复接器1504的其中一个输入dp从块1306被接收,复接器1502的输入dp也是这样。 Multiplexer 1504 which is received from an input block 1306 dp, dp input multiplexer 1502 is also true. 复接器1504的另一个输入来自块1514输出的一系列最低阶比特,复接器1502的其中一个输入同样如此。 Another series of the lowest order bit input multiplexer 1504 output from block 1514, multiplexer 1502 to one input of which the same. 复接器1504的另一个输入来自块1514的同一个输出的最高阶比特。 The other input of multiplexer 1504 is the highest order bits of the block 1514 from the same output. 复接器1504的再一个输入是数值“0”。 A further input to multiplexer 1504 is the value "0." 复接器1502的其中一个输入是数值“1”,而它的其中另一个输入是数值“-1”。 Wherein a multiplexer 1502 is input the value "1", while its other input of which the value "-1." 数值“0”、“1”和“-1”被提供来致力于加速由块1324执行的操作,因为从经验中已知道这些数值在各种操作中被重复地利用,所以在那里存在便提高了系统性能。 Value "0", "1" and "-1" is provided to the acceleration operation performed by the dedicated block 1324, as has been known from experience repeated These values ​​are utilized in a variety of operations, so there will be present to improve the system performance. 应当指出,可以有多个被利用来提高性能的块1510。 It should be noted that there may be a plurality of block 1510 be utilized to improve performance. 块1324如图15所示地被组织成允许执行许多操作,由此,许多操作在单个时钟周期内被执行。 Block 1324 shown in FIG. 15 are organized to allow many of the operations performed thereby, many operations are performed in a single clock cycle.

在操作中,块1510和1512分别对由块1508和1506提供的A和B数值执行操作。 In operation, blocks 1510 and 1512 respectively perform operations on values ​​A and B provided by block 1508 and 1506. 加到复接器1516的两个其它输入由块1520内的减小操作块(图15上未示出)生成,这将简略地予以讨论。 Was added to the other two input multiplexer 1516 by reducing the operation block in the block 1520 (not shown on FIG. 15) generates, as will be discussed briefly. 目前,这两个输入被称为“neighbor-acc-reg(相邻访问寄存器)”和“reduction-acc-reg(减小访问寄存器)”,每个是2N宽。 At present, these two inputs is referred to as "neighbor-acc-reg (neighboring access register)" and "reduction-acc-reg (reduced access register)", each of which is 2N wide.

块1512是2N宽的寄存器,其允许由块1510执行条件相加或条件相减操作,以便在解扩操作中使用。 Block 1512 is a wide 2N registers, which allows added by block 1510 performs subtraction operation condition or a condition, for use in despreading operation. 块1512实际上修改A和B数值以便由块1510使用。 Block 1512 actually modify A and B values ​​for use by block 1510.

复接器1522实际上允许块1510的输出在被块1514存储后通过信号1530选择性地提供到块1302,并且这由被提供作为加到复接器1522的另一个输入的选择信号来确定。 Multiplexer 1522 actually allows the output block 1510 is provided after being stored by block 1514 to block 1530 selectively signal 1302, and this is determined by the selection signal applied as another input to multiplexer 1522. 否则,块1510的结果经受累加-相加操作,它的最后的结果在被提供到块1302之前,通过块1518和1520被存储在块1520。 Otherwise, the result is subjected to accumulation block 1510 - the addition operation, the final result in its being provided to the block prior to 1302, a block 1518 through blocks 1520 and 1520 are in the storage.

块1324是N层ALU,其包括支持以下运算的一个或多个ALU:-N次相加/相减运算,其中对两个N比特数值执行运算,以生成它们的和值或差值 ALU block 1324 is an N layer, which comprises one or more of the ALU supports the following operations: -N times addition / subtraction operation, in which the two N-bit values ​​perform an operation to generate a sum or difference thereof

-对两个输入值的N比特XOR(异或)-对两个N比特输入值的最大值/最小值运算-对两个N比特输入值的最大值*运算,这样以致它的结果被如下地计算:max(a,b)+常数(来自存储器或小的预先装载的查找表)-条件相加-相减:这个功能通常是由于块1512的使用引起的,它取决于输入代码而有条件地相加或相减N比特数值的数据流。 - N-bit input values ​​two XOR (exclusive OR) - MAX / MIN operation on two N-bit input value - of two N-bit input values ​​* maximum operation, so that its results are as follows calculated: max (a, b) + constant (or look-up table from the memory of a small pre-loaded) - the conditional addition - subtraction: this is often due to the use of block 1512 cause, enter the code while it depends conditionally adding or subtracting the bit value of the N data streams. 该输入代码被预先装载到控制寄存器。 The input code is previously loaded into the control register. 输入代码中的'1'导致相减运算,'0'导致相加运算。 Enter the code '1' results in subtraction operation, '0' result in the addition operation. 输出在16比特累加器寄存器中可得到。 In the 16-bit output is obtained in the accumulator register. 还支持来自支持这个运算的其它专门ALU的“聚集(gather)”运算。 Support also from other specialized support the operation of the ALU "aggregate (gather)" operation.

-使用与在条件相加-相减运算中相同的累加器的SAD。 - using the conditions in addition - subtraction calculation SAD same accumulator.

-N×N乘法块1510对于W型子处理器是公共的,其中每个块1510能够读取至少128比特,因此当存储器中没有争用时,两个块能够每个时钟周期读取至少256比特。 -N × N multiplication block 1510 for W-type sub-processor is common, wherein each block 1510 is capable of reading at least 128 bits, so that when there is no contention in the memory, two blocks can be read per clock cycle at least 256 bits .

图16显示按照本发明实施例的、被包括在块1520内的减小电路块1602的框图。 Figure 16 shows an embodiment according to the present invention, is a block diagram reduction circuit blocks within the block comprises 1520 1602. 在图16上,显示了M级累加器寄存器电路,它的每一个累加器寄存器电路的细节被显示在acc-reg块1610中。 In Figure 16, shows the M-stage accumulator register circuit, every detail of its accumulator register circuit is shown in block 1610 acc-reg. 例如,acc-reg电路块1602包括四个块1610,它们以图16所示的方式被耦合。 For example, acc-reg block 1602 includes four circuit blocks 1610 which are coupled in the manner shown in FIG. 16. 同样地,每个acc-reg电路块1604-1608包括四级acc-reg电路,诸如由块1610组成的该电路。 Likewise, each of the circuit blocks 1604-1608 acc-reg comprises four acc-reg circuitry such as a block composed of 1610. 在每个块1602-1608内的每一级的输出或结果被用作为加到下一级的输入,所以它们被相加,以达到累加。 Or output the result of each block 1602-1608 in each stage is used as input applied to the next stage, they are summed to achieve accumulation. 块1602-1608被各自显示为包括4级或诸如块1610的4个块,但也可以采用其它数目的块或级。 Blocks 1602-1608 are shown as each comprising four or four blocks such as block 1610, it is also possible to use other number of blocks or stages.

令每个块1602-1608的结果对于另一个块是可得到的。 So results for each block 1602-1608 another block is available. 例如,块1602的结果用作为加到块1604的输入,块1604的结果或输出用作为加到块1608内的最后的acc-reg块的输入,以及块1606的结果或输出用作为加到块1608的输入。 For example, the result of block 1602 is used as an input to block 1604, the result output block 1604 or used as an input of the last block in the acc-reg is applied to block 1608, block 1606 and the result output as to the nugget or enter 1608. 因为块的结果以转发方式并与块内的级的累加同时提供,所以当采用四级acc-reg块时,只需要7个周期执行减小运算。 Because the result of the block and to forward manner while providing the level accumulated in the block, so when using four acc-reg block, only 7 cycles to execute operation is reduced.

块16包括被耦合到累加器的复接器。 Block 16 is coupled to the accumulator comprises a multiplexer. 该复接器是2∶1复接器,其选择要被提供到累加器的两个输入之一。 The multiplexer is 2:1 multiplexer, which selects one of two inputs to be supplied to the accumulator. 块1610的复接器的两个输入之一由块1514的输出提供,而另一个输入是前一级acc-reg块的结果。 One of the two inputs of the multiplexer block 1610 is provided by the output block 1514, and the other input is the result of the previous stage acc-reg block. 这样,图16的减小功能在它操控数据时是灵活的。 Thus, reducing the function of FIG 16 is flexible when it manipulate data. 来自直接在前的级输出的每个输入被称为'相邻'信号1616,其生成加到复接器1516的neighbor-acc-seq输入。 Each input directly from the output of the preceding stage is referred to as 'adjacent' signal 1616, which generates applied to multiplexer input neighbor-acc-seq 1516. 某些级的输出生成加到复接器1516的reduction-acc-seg,并被称为'减小'信号1618。 Applied to generate some output stage multiplexer 1516 reduction-acc-seg, and is called 'reduced' signal 1618. 块1608的最后的acc-reg块的输出生成被耦合到复接器1530的输出1620。 The final output block acc-reg block 1608 generates an output coupled to the multiplexer to 16,201,530. 图16的减小电路导致用于执行减小操作的最小时钟周期,而同时节省了功耗。 FIG reduction circuit 16 results in a minimum of clock cycles to execute down operation while saving power.

图17以高级别框图形式显示按照本发明实施例的、被包括在块1326中的电路系统的进一步的细节。 Figure 17 shows a high level block diagram form an embodiment of the present invention, including further details of the circuitry in the block 1326. 在图17上,块1326被显示为包括移位器1702-1712,用于移位从块1306接收的数据输入。 In Figure 17, block 1326 is shown to include 1702-1712 shifter for shifting input data received from block 1306. 在一个实施例中,输入1700是128比特,然而,可以利用其它数目比特。 In one embodiment, input 1700 is 128 bits, however, other numbers of bits may be utilized. 每个移位器1702-1712的输出被显示为被耦合到寄存器库块1714。 The output of each shifter 1702-1712 is shown as being coupled to the register bank block 1714. 移位器1702-1712生成输入1700的比特的不同组合。 1702-1712 shifter 1700 generates different combinations of the input bits.

块1714包括多个寄存器,它们包括寄存器1716到1746,它们被使用来创建移位器1702-1712的输出的组合。 Block 1714 comprises a plurality of registers, which includes registers 1716-1746, which are used to create the combined output of the shifter of 1702-1712. 例如,可以令每个移位器1702-1712输出的较低的8比特经过复接器,以选择性地选取最后生成该较低的8比特的哪些。 For example, each can make the output of the shifter 1702-1712 through the lower 8-bit multiplexer to selectively select the last generates the lower 8 bits of which. 因此,块1714的每个寄存器可以在被移位比特的“感兴趣的部分”之间任意进行选择。 Thus, each register block 1714 can arbitrarily select between the shifted bits "portion of interest." 感兴趣的部分由每个移位器1702-1712的输出确定。 Of interest determined by the output portion of each of the shifters 1702-1712. 块1714的输出被提供到总线1310。 The output of block 1714 is provided to the bus 1310.

因此,在本发明的一个实施例中,块1326包括四个20比特和两个24比特输入寄存器。 Thus, in one embodiment of the invention, the block 1326 includes four 20-bit register and two 24-bit inputs. 它包括八个16比特寄存器,在其中创建和存储来自其输入寄存器的比特的随机的32、16、8和4比特组合。 It includes eight 16-bit registers, which create and store in its input register bits from random bit combination 32,16,8 and 4. 块1326可以以三种模式被使用:使用两个特定的20比特寄存器用于输出生成;2)使用四个20比特寄存器用于输出生成;或3)使用所有的七个寄存器用于输出生成。 Block 1326 can be used in three modes: using two specific 20-bit register is used to generate an output; 2) using four 20-bit registers to generate an output; or 3) to use all the seven registers for output generation. 移位器1702-1712包括输入寄存器,由于本领域技术人员熟知移位器的结构和功能,所以未示出该输入寄存器。 1702-1712 shifter comprises an input register, as well known to those skilled in the structure and function of the shifter, not shown, so that the input register.

为了减小对于执行块1326的组合功能所需要的硬件或是块或电路的数目,32比特输出寄存器中的每个比特只能在第一模式下从在两个20比特寄存器中的最低有效的8比特、在第二模式下从在四个20比特寄存器中的4个最低有效比特、和在第三模式下从在四个20比特寄存器中的2个最低有效比特和在24比特寄存器中的4个最低有效比特被填入。 In order to reduce the number of hardware or a combination of functional blocks or circuit block 1326 performs required, the 32-bit output of the register each bit only from the least significant 20 bits of the two registers in the first mode, 8 bits, in the second mode from the four 20-bit registers in the four least significant bits, and in the third mode from the two least significant bits of the four 20-bit registers and a 24-bit register four least significant bits are filled. 来自输入寄存器的随机组合是二步骤处理过程,其中第一步骤牵涉到把“感兴趣的”比特移位到最低有效位置,在该模式下可以允许从该位置随机填入到输出寄存器。 From random combinations of the input register is a two step process wherein the first step involves the "interest" to the least significant bit position shifting, in this mode may allow random position to fill from the output register. 在这里的参照图17使用的例子中,当对于输入寄存器流水线地进行移位操作以使感兴趣的比特到达最低有效位置时,块1326每个周期可以创建16个组合的比特。 In the example used herein with reference to FIG. 17, the registers for the input line when the shift operation so that the least significant bit position of interest arrives, block 1326 for each cycle can be combined to create 16 bits. 输出的某些组合可花费多个时钟周期。 Certain combinations of output may take multiple clock cycles.

存储器1326是普通的随机存取存储器,所以不作更详细的讨论。 The memory 1326 is an ordinary random access memory, it will not be discussed in further detail. 然而,只要说出该存储器的尺寸是基于要使用N型子处理器的应用就够了。 However, just say the size of the memory to be used is based on the application of the N-type sub-processors is enough.

图18以高级别框图形式显示按照本发明实施例的、被包括在块1330中的电路系统的进一步的细节。 Figure 18 shows a high level block diagram form an embodiment of the present invention, including further details of the circuitry in the block 1330. 在图18上,单个字寄存器1802被显示为包括8比特位置,每个比特位置1804能够由比特选择电路1806进行修改。 In Fig. 18, a single word register 1802 is shown including 8 bit positions, each bit position can be selected 1804 by the bit modification circuit 1806. 这样的修改包括、但不限于:插入一个'0',插入一个'1',对该比特取非,相当于反转它;或者根本不修改它,相当于“NOP”或无操作。 Such modifications include, but are not limited to: insert a '0', inserting a '1', this bit is negated, it is equivalent to inversion; modify it or not, is equivalent to "NOP" or no operation. 单个字寄存器1802被重复,即,字寄存器1810-1820每个象寄存器1802那样存储和修改一个字。 Single word register 1802 is repeated, i.e., word register and stored as 1810-18201802 modified word of each image register. 因此,在16比特字和8个字的例子中,八个16比特字的修改在一个时钟周期中执行,不像传统的DSP那样,需要多个周期以用于执行同样的工作。 Thus, eight 16-bit words performed in a modified clock cycles in the example of 16-bit words and eight words, unlike traditional DSP as a plurality of cycles required for performing the same work. 这些字的每个比特的修改或凿孔/去凿孔由复接器1824与触发器1826进行控制,复接器1824与触发器1826以图18显示的方式互相耦合且耦合到寄存器1802。 Each bit of these words or modifications puncturing / de-puncturing is controlled by the multiplexer 1824 and flip-flop 1826, multiplexer 1824 and flip-flop 1826 shown in Figure 18 coupled to each other and coupled to the register 1802. 寄存器1810-1822也类似地被耦合到其它复接器和触发电路。 Registers 1810-1822 are also similarly coupled to other multiplexer and trigger circuit. 模式选择比特选择要选择复接器的四个输入中的哪个输入,模式选择比特是从指令代码生成的。 Mode selection bit selection input to select which of four inputs of the multiplexer, the mode selection bit is generated from the instruction code. 加到复接器1824的其中两个输入1828也是来自指令代码,而该复接器的另两个输入是来自存储器,其中一个可以是另一个的反转型式,如图18所示。 Applied to multiplexer 1824 are two of the inputs 1828 from the instruction code, and the two multiplexer inputs from the other memory in which a reversal may be another type, as shown in FIG.

加到块1330的电路的输入是从块1332生成的,块1332将被简略地予以讨论,但现在,它生成加到块1330的全部交织、部分交织、或无交织的N比特字。 Applied to an input circuit block 1330 is generated from the block 1332, block 1332 will be discussed briefly be, but now, it is added to generate all interleaved block 1330, interlaced or non-interlaced N-bit word. 在一个例子中,操作是针对256比特字的,在这种情形下,块1330在给定的时间对16比特执行操作。 In one example, the operation is for the 256-bit words, in this case, block 1330 perform operations on 16 bits at a given time. 预先获取的控制字被使用来决定16比特字内哪些比特必须被反转。 Control word acquired in advance is used to determine which bits of the 16 bit word must be reversed. 任选地,除了反转以外,把'0'和'1'数值输入到特定的比特位置。 Optionally, in addition to the reversal, the '0' and '1' values ​​input to a specific bit position.

图19以高级别框图形式显示按照本发明实施例的、被包括在块1332中的电路系统的进一步的细节。 19 shows a high level block diagram form an embodiment of the present invention, including further details of the circuitry in the block 1332. 在图19上,存储器阵列1902被显示为通过总线1316接收来自输入设备的输入104和通过总线1316接收读使能输入1906,并且还接收来自控制行-列地址生成块1908的输入,以生成被提供到块1302的输出设备信号1910。 In Figure 19, the memory array 1902 is shown to receive input 104 from an input device via a bus 1316 and receives read enable via the bus 1316 can input 1906, and also receives from the control row - column address to generate the input 1908 of the block, to generate the signal is supplied to an output apparatus 1910 1302. 在一个例子中,块1902包括由128×16比特构成的存储器阵列。 In one example, block 1902 includes a memory array of 128 × 16 bits. 数据可以按行或按列被写入到块1902或从块1902读出。 Data may be written in columns or rows to read a block from the block 1902 or 1902. 即,可以读出块1902的存储器阵列的行或可以读出块1902的存储器阵列的列。 That is, the line can be read out of the memory array block 1902 or block 1902 may read out a column of the memory array. 另外,数据可以按行被写入而按列被读出,以及反之亦然。 Further, data may be written and are read out row by row, and vice versa.

图20以高级别框图形式显示按照本发明实施例的、被包括在块1334中的电路系统的进一步的细节。 Figure 20 shows a high level block diagram form an embodiment of the present invention, further details of which are included in the circuitry of block 1334. 在图20上,分支度量单元2002被显示为接收来自块1332的输入,并被显示为被耦合到相加/比较/选择块,该块被显示为被耦合到残存(survivor)存储器块2012,该块2012又被显示为被耦合到复接器2020,复接器2020生成被耦合到总线1310的输出2022。 In Figure 20, the branch metric unit 2002 is shown receiving an input from block 1332, and is shown as being coupled to the addition / comparison / selection block, which is shown as being coupled to the remaining (Survivor) memory blocks 2012, the block 2012 is shown as being in turn coupled to multiplexer 2020, multiplexer 2020 generates an output 1310 coupled to bus 2022. 复接器2020还被显示为从累加器2018的输出端接收另一个输入,累加器2018接收来自复接器2016的输入。 Multiplexer 2020 is also shown receiving another input from the output of the accumulator 2018, the accumulator 2018 receives input from the multiplexer 2016. 任选地,绝对值差值的求和(SAD)块2008和解扩器(用于解扩)块2010被使用来生成加到复接器2016的输入。 Optionally, the summing absolute differences (SAD) despreader block 2008 (for despreading) block 2010 is added to be used to generate an input multiplexer 2016. 在不存在块2008和2010的情形下,不使用复接器2016、块2018和复接器2020。 In the case of the absence of the 2010 and 2008 blocks, without using a multiplexer 2016, and multiplexer 2018 block 2020. 本地存储器2006被显示为被耦合到块2004。 Local memory 2006 is shown as being coupled to the block 2004. 块2002执行对于熟悉维特比编码/译码的技术人员来说熟知的分支度量计算。 Block 2002 performs the branch metric calculation familiar Viterbi encoding / decoding is well known to the art. 对于熟悉维特比编码/译码的技术人员来说同样熟知的残存路径被存储在块2012中。 For those familiar with Viterbi encoded / decoded for the same are well known in the art of surviving paths is stored at block 2012.

块1334能够执行涡轮译码器、SAD和解扩功能。 Block 1334 is capable of performing turbo decoder, SAD despreading function. 在一个例子中,32到256相加-比较-选择操作可以由块2004对于由本地存储器2006生成的16比特分支与路径度量值并行地执行。 In one example, adding 32-256 - Comparison - selection operation may be performed in parallel by the block 2004 for 16-bit branch path metric generated by the local memory 2006. 在一个例子中,本地存储器2006的尺寸是1k比特和16k比特。 In one example, the size of the local memory 2006 is 1k and 16k bits bits.

在块1334中可以包括有多个块2004,每个块可包括8比特有正负号的加法器。 In block 1334 may include a plurality of blocks 2004, each block may include a number of positive and negative 8-bit adder. 另外,每个块可以包括比较与选择块,其返还获胜路径和判决比特。 In addition, each block may include a selection block compared with that winning path and return decision bits. 相加-比较-选择操作导致获胜路径和判决比特。 Addition - Comparison - selection operation and results in the winning path decision bit. 获胜路径可以通过使用用于降低网格的“多播”互连方案而与相邻的块2004共享。 Winning path may be used to "multicast" grid interconnection scheme by reducing the use of shared with the adjacent block 2004. 具有获胜分支和路径度量值的判决比特被存储,以用于回溯(backtrack)。 Having a winning path decision bit and branch metric values ​​are stored for retrospective (backtrack).

块2008使用四个8比特ALU,在一个例子中,它们的四个绝对值差值可以每个周期地被计算。 Four 8-bit block using the ALU 2008, in one example, four absolute value difference thereof can be calculated for each cycle. 减小树被建立在块2004中,以把绝对值差值累加到16比特累加器。 Reduced tree is established at block 2004, the accumulated difference absolute value to the 16-bit accumulator. 多播网络可被使用来在其上发送这些数值,用于进一步的减小。 Multicast network can be transmitted on these values, used for further reduction. 每个时钟周期总共128个8比特(64个16比特)的块2008是可能的。 Block 128 each clock cycle a total of 8 bits (64 16-bit) of 2008 is possible. 然而,人们相信,考虑所有附加开销的有效利用可以导致较低的数目。 However, it is believed that, considering the effective use of all the additional overhead can result in a lower number.

ALU实施与该专门的ALU块实施且在以上讨论的相同的条件相加-相减功能。 ALU embodiment embodiment the dedicated ALU block and in the same conditions as discussed above, the addition - subtraction function. 对于解扩所需要的控制比特必须被装载到本地存储器中,从其中它被获取并被存储在寄存器。 Despreading the control bits required to be loaded into the local memory, from where it is acquired and stored in a register. 该结果被累加到16比特累加器中,从其中1可被转送到其它的块2004,用于在其上的减小操作。 The 16-bit result is accumulated into the accumulator, from which one can be forwarded to other blocks 2004, for reducing operation thereon. 通过解扩,在一个例子中,有可能在单个周期中执行128个同时的条件相加-相减。 By despreading, in one example, it is possible to simultaneously perform 128 in a single cycle of the conditional addition - subtraction. 在该单元中每次转移的能量高于对服务于除了解扩和SAD以外的某些通用功能的专门ALU所使用的能量。 Energy transfer in the cell is higher than the energy of each service in addition to some common features and despread dedicated ALU SAD is used. 对于较小数目的指针或对于较低速率的运动估值,专门ALU是功率更加有效的任选项。 For a small number or pointer movement relative lower rate valuation, specifically power is more efficient ALU optional.

图21显示按照本发明实施例的、使用处理器22的编程流程和工具的例子。 Examples of processes using a processor and programming tool 22 of FIG. 21 shows an embodiment of the present invention. 图22显示本发明实施例的可缩放性的例子。 Figure 22 shows an example of the scalability of the embodiments of the present invention. 例如,在图22上,有N型和W型的子处理器的集群2202,它们被显示为通过使用总线2204被互连。 For example, in FIG. 22, there is a cluster-type and W type 2202 N of sub-processors, which are displayed by using a 2204 bus are interconnected. 每个集群2202包括两个或四个子处理器。 Each cluster 2202 includes two or four sub-processors. 在一个例子中,总线2204是标准SoC总线。 In one example, the bus 2204 is a bus standard SoC. 通过保持分级的设计方法,解决了互连性。 By keeping the hierarchical design method to solve interconnectivity.

处理器20的缩放导致四个子处理器的集群,对于每个集群具有分开的总线,否则,四个子处理器可以共享单个存储器。 Scaling processor 20 causes the processor of the four sub-clusters, for each cluster having a separate bus, otherwise, the four sub-processors may share a single memory. 相对于处理器的可缩放性通常是借助于增加处理器的数目或提高处理器频率或速度。 With respect to processor scalability usually by increasing the number of processors or processor frequency, or increase speed. 然而,复杂应用所需要的缩放超出以前所进行的缩放。 However, the complexity required for the application before scaling beyond the scaling performed. 在本发明中,W型和N型子处理器被修改,以使得形成一个处理的四个这样的子处理器可以处理单个应用。 In the present invention, W-type and N-type sub-processor is modified, so as to form four such sub-processor may process a single application process.

因此,处理器22配备有比直接基于来自C代码的汇编的RISC和超级标量处理器更有效地运行在目标应用中发现的控制和顺序DSP码的能力。 Thus, 22 DSP equipped with a control code sequence and the ability to run directly on the processor than compiled C code from the RISC processor and a superscalar found more effective in the target application. 同时,它被设计成利用在用于传统应用和小型应用的RISC和超级标量处理器中所使用的自动代码生成技术。 At the same time, it is designed to use and automatic code in the RISC superscalar processor for traditional applications and small applications used in the generation techniques. 而且,处理器22用成熟的和工业标准的软件工具来工作,所述软件工具是类似用于应用映射和开发的Simulink。 Further, the processor 22 operates with a mature industry standard software tools and the tool are similar for Simulink software application map and development. 摩尔(Moore)定律可被利用来增强处理器22的性能。 Mol (Moore)'s Law may be utilized to enhance the performance of the processor 22. 处理器22不单是高度并行的机器,而且是一个异种的多处理器。 The processor 22 is not only highly parallel machines, and is a heterogeneous multi-processor. 需要并行的异种多处理器来解决高要求的多媒体和通信应用在工业界和学术界是已证明的事实。 Require concurrent heterogeneous multi-processor to address the fact that the high demands of multimedia and communications applications in industry and academia is proven. 它允许利用在VLIW中使用的许多自动代码生成技术,而不使用任何在功率和面积上低效的技术。 It allows the use of many automatic code generation techniques used in the VLIW without using any inefficiency in power and area of ​​technology. 它被最佳化成根据来自C的控制代码的汇编而利用重复的图案。 It is optimized in accordance with the control code compiled from C and using a repeating pattern. 这大大地减小了控制功率,并使得有可能有效地运行汇编的串行码。 This greatly reduces power control, and makes it possible to operate efficiently serial assembler code. 另外,处理器22的编程模型被设计成通过使用DSP编程者的大社团所熟悉的工具(如Simulink)而适合于他们。 Further, the processor 22 programming model is designed by the DSP programmer familiar to a large community of tools (such as the Simulink) and adapted to them. 它的开发流程提供了用于对控制和顺序DSP码进行有效的C汇编的手段。 Its development process provides a means for controlling the sequence and effective DSP code compiled C. 另外,提供了高效的通信和多媒体核的库的大规模组。 In addition, a group of large-scale and efficient communication and multimedia core library. 例子是FFT、IDCT、RRC、维特比、VLC、2D/3D图形、涡轮编码译码器和解扰器的参数化库。 Examples are FFT, IDCT, RRC, Viterbi, VLC, 2D / 3D graphics, and a turbo encoder decoder descrambler parameterized library.

处理器22中的数据路径设计成功地集成了连接不同粒度的功能单元的多样化的互连结构,有效地解决了一个所聚焦的、然而更高度有利的应用混合。 Processor data path 22 is designed to successfully integrate the variety of functional units connected to different particle sizes of the interconnect structure, effectively solved, yet more mixing a highly advantageous application of the focused.

处理器22的可缩放性被根据标准SoC总线而设计成使单个块(时间复用)中的所有应用在块内配备有最接近的相邻连接。 The processor 22 may be scaled according to the standard of bus SoC designed so that a single block (time multiplexed) with all applications in the nearest neighbor block has a connection. 很大量的无效性和所有的系统级别的非决定论被减小,因为可使用多个块来处理多个应用,而无需在它们之间的任何专用通信。 A very large number of invalid and all non-deterministic system level is reduced, since a plurality of blocks can be used to handle multiple applications, without the need of any dedicated communication therebetween.

图23显示了呈现本发明的可缩放性的某些好处的图。 Figure 23 shows some of the benefits available zoom of the invention presented in FIG.

虽然本发明是对于具体的实施例描述的,但可以预期,它们的变更和修改对于本领域技术人员来说将无疑变得显而易见。 Although the present invention is described with respect to specific embodiments, it is contemplated that variations and modifications thereof skilled artisan will no doubt become apparent. 所以,以下的权利要求打算被解译为覆盖属于本发明的真正精神和范围的、所有的这样的变更和修改。 Therefore, the following claims are intended to be interpreted to cover the true spirit and scope of the present invention requires that all such variations and modifications.

Claims (20)

  1. 1.一种异种的、高性能的、可缩放的处理器,包括:至少一个W型子处理器,其能够并行处理W比特或更多的比特,W是整数值;至少一个N型子处理器,其能够并行处理N比特,其中N是整数值且小于W;共享的总线,其耦合该至少一个W型子处理器和该至少一个N型子处理器;以及共享的存储器,其被耦合到该至少一个W型子处理器和该至少一个N型子处理器,其中W型子处理器在转移字节到存储器或从存储器转移字节时重新安排字节,以便适应允许快速运行的应用的执行。 A heterogeneous, high performance, scalable processor, comprising: at least one W-type sub-processor capable of parallel processing of the bits or more bits W, W is an integer value; the at least one N-type sub-process device, which is capable of parallel processing N bits, where N is an integer and less than W; shared bus, which is coupled to the at least one W-type sub-processor and the at least one N-type sub-processor; and a shared memory, which is coupled W type to the at least one sub-processor and the at least one N-type sub-processors, wherein W type rearranged byte sub-processor transfer to a memory byte or bytes transferred from the memory to accommodate the running applications allow rapid execution.
  2. 2.如在权利要求1中陈述的、异种的、高性能的、可缩放的处理器,其中该处理器是可缩放的。 2. In the case, heterogeneous, high performance, scalable processor as set forth in claim 1, wherein the processor is scalable.
  3. 3.如在权利要求1中陈述的、异种的、高性能的、可缩放的处理器,其中至少一个W型子处理器的两个和至少一个N型子处理器的两个。 3. If, heterogeneous, high performance, scalable processor as set forth in claim 1, wherein the at least one of the two sub-processors of two W-shaped and at least one N-type sub-processors.
  4. 4.如在权利要求2中陈述的、异种的、高性能的、可缩放的处理器,其中该至少一个W型子处理器和该至少一个N型子处理器执行用于多媒体应用的程序。 4. As used herein, a heterogeneous, high performance, scalable processor as set forth in claim 2, wherein the at least one W-type sub-processor and the at least one N-type sub-processor to execute a program for multimedia applications.
  5. 5.如在权利要求4中陈述的、异种的、高性能的、可缩放的处理器,其中该至少一个W型子处理器的每一个包括多个宏功能单元。 5. As, heterogeneous, high performance, scalable processor as set forth in claim 4, wherein each of the at least one processor of a W-type sub-macro comprising a plurality of functional units.
  6. 6.如在权利要求5中陈述的、异种的、高性能的、可缩放的处理器,其中该多个宏功能单元包括装载储存块,用于生成存储器地址,供该多个宏功能单元中的其它宏功能单元使用。 6. As used herein, a heterogeneous, high performance, scalable processor as set forth in claim 5, wherein the plurality of functional units comprises a load store macro block, for generating memory addresses for the plurality of function macro cells other functional units using the macro.
  7. 7.如在权利要求6中陈述的、异种的、高性能的、可缩放的处理器,其中该多个宏功能单元包括被耦合到装载储存块的标量算术逻辑单元(ALU)和乘法累加块,其对于从装载储存块接收的数据执行标量算术和逻辑及乘法运算。 7. The accumulation block, a heterogeneous, high performance, scalable processor as set forth in claim 6, wherein the functional unit comprises a plurality of macro is coupled to a scalar arithmetic logic unit (ALU) and multiplying the loading storage block , which performs scalar multiplication and an arithmetic and logic for storing data received from the loading block.
  8. 8.如在权利要求7中陈述的、异种的、高性能的、可缩放的处理器,其中该多个宏功能单元包括被耦合到装载储存块以及标量ALU和乘法累加块的向量X块,其对于来自装载储存块的数据执行向量运算,向量X块生成向量数据。 , Heterogeneous, high performance, scalable processor 8. As set forth in claim 7, wherein the plurality of functional macro coupled to the loading unit includes a storage block and scalar ALU and multiply accumulate the block to the vector X, generating vector data to the vector operation execution data from the load storage block, the block vector X.
  9. 9.如在权利要求8中陈述的、异种的、高性能的、可缩放的处理器,其中该多个宏功能单元包括被耦合到标量ALU和乘法累加块以及向量X块的向量ALU和乘法累加块,用于对从向量X块接收的向量数据执行向量ALU和乘法累加运算。 , Heterogeneous, high performance, scalable processor 9. set forth in the claims 8, wherein the functional unit comprises a plurality of macro is coupled to a scalar ALU and multiply accumulate ALU X-block and block vector of the vector multiply and accumulate block for performing vector multiply-accumulate operations of ALU and vector data received from block vector X.
  10. 10.如在权利要求2中陈述的、异种的、高性能的、可缩放的处理器,其中该至少一个N型子处理器包括储存单元块、宏功能块和装载单元块,该宏功能块被耦合到装载单元块,并且还被耦合到宏功能总线,该宏功能总线用于把宏功能块耦合到储存块。 , Heterogeneous, high performance, scalable processor 10. As set forth in claim 2, wherein the at least one N-type sub-processor includes a storage cell block, macroblock and block loading unit, the macro function block the loading unit is coupled to the block, and is also coupled to the macro function of the bus, the bus is used as the macro block being coupled to a storage macro block.
  11. 11.如在权利要求10中陈述的、异种的、高性能的、可缩放的处理器,其中该至少一个N型子处理器包括由该至少一个W型子处理器共享的数据路径单元(DPU)块以及控制器、定序器和数据地址生成器(DAG)块。 , Heterogeneous, high performance, scalable processor 11. As set forth in claim 10, wherein the at least one N-type sub-processor including at least one shared by the W-type sub-processor data path unit (DPU ) blocks and a controller, the sequencer and the data address generator (DAG) block.
  12. 12.如在权利要求10中陈述的、异种的、高性能的、可缩放的处理器,其中该宏功能块包括被耦合到宏功能总线和装载单元块1306的伽罗瓦域乘法累加(MAC)块,用于执行伽罗瓦域运算。 , Heterogeneous, high performance, scalable processor 12. As set forth in claim 10, wherein the macroblock comprises a function of the bus and coupled to the macro block loading unit 1306 Galois field multiply-accumulate (MAC ) block for performing a Galois field operation.
  13. 13.如在权利要求12中陈述的、异种的、高性能的、可缩放的处理器,其中该宏功能块包括被耦合到该装载单元块和一个装载单元块的专门ALU,用于执行专门的ALU运算。 13. As used herein, a heterogeneous, high performance, scalable processor as set forth in claim 12, wherein the macroblock comprises a loading unit is coupled to the dedicated ALU block and a block loading unit for performing specialized the ALU operations.
  14. 14.如在权利要求13中陈述的、异种的、高性能的、可缩放的处理器,其中该宏功能块包括被耦合到该装载单元块和一个装载单元块的凿孔/去凿孔块,用于执行凿孔/去凿孔操作。 , Heterogeneous, high performance, scalable processor 14. As set forth in claim 13, wherein the macroblock comprises a block coupled to the loading unit and a puncturing unit block loading / punctured to block for performing a punctured / punctured to operate.
  15. 15.如在权利要求14中陈述的、异种的、高性能的、可缩放的处理器,其中该宏功能块包括被耦合到该装载单元块和一个装载单元块的交织器块,用于执行交织操作。 , Heterogeneous, high performance, scalable processor 15. As set forth in claim 14, wherein the macroblock comprises a loading unit is coupled to the block and a block interleaver block loading unit for performing interleaving operation.
  16. 16.如在权利要求15中陈述的、异种的、高性能的、可缩放的处理器,其中该宏功能块包括被耦合到储存单元块和交织器块的维特比块,用于执行维特比操作。 16. As used herein, a heterogeneous, high performance, scalable processor as set forth in claim 15, wherein the macroblock comprises a storage unit is coupled to a block interleaver block and Viterbi block, for performing a Viterbi operating.
  17. 17.如在权利要求16中陈述的、异种的、高性能的、可缩放的处理器,其中该宏功能块包括被耦合到该装载单元块和一个装载单元块的组合器块,用于执行组合操作。 , Heterogeneous, high performance, scalable processor 17. As set forth in claim 16, wherein the macroblock comprises a loading unit is coupled to the combiner block and the block of one block loading unit for performing combining operation.
  18. 18.如在权利要求16中陈述的、异种的、高性能的、可缩放的处理器,其中该至少一个N型子处理器包括被耦合在储存单元块与装载单元块之间的X单元块。 , Heterogeneous, high performance, scalable processor 18. As set forth in claim 16, wherein the at least one N-type sub-processor unit includes an X-block is coupled between the memory cell block and the block loading unit .
  19. 19.如在权利要求16中陈述的、异种的、高性能的、可缩放的处理器,包括被耦合在该至少一个W型子处理器与该至少一个N型子处理器之间的共享寄存器,用于在它们之间的直接通信。 , Heterogeneous, high performance, scalable processor 19. As set forth in claim 16, including the shared register is coupled between the at least one W-type sub-processor with the at least one N-type sub-processor for direct communication between them.
  20. 20.一种处理信息的方法,包括:异种的、高性能的、可缩放的处理器包括:使用能够并行处理W比特的至少一个W型子处理器来处理数据,W是整数值;使用能够并行处理N比特的至少一个N型子处理器来同时处理数据,其中N是整数值且是W的二分之一;以及使得快速执行多媒体应用,而同时保持低功耗和可编程性的简易性。 20. A method for processing information, comprising: a heterogeneous, high performance, scalable processor comprising: a W-bit parallel processing capable of using at least one W-type sub-processor to process data, W is an integer value; capable of N-bit parallel processing at least one N-type sub-processor to simultaneously process data, where N is an integer value, and one half of W; and causing rapid execution of multimedia applications while maintaining low power consumption and easy programmability sex.
CN 200580030649 2004-07-13 2005-07-12 Programmable processor system with two kinds of subprocessor to execute multimedia application CN101031904A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US58769104 true 2004-07-13 2004-07-13

Publications (1)

Publication Number Publication Date
CN101031904A true true CN101031904A (en) 2007-09-05

Family

ID=38716298

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200580030649 CN101031904A (en) 2004-07-13 2005-07-12 Programmable processor system with two kinds of subprocessor to execute multimedia application

Country Status (1)

Country Link
CN (1) CN101031904A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104094182A (en) * 2011-12-23 2014-10-08 英特尔公司 Apparatus and method of mask permute instructions
US9513918B2 (en) 2011-12-22 2016-12-06 Intel Corporation Apparatus and method for performing permute operations
US9588764B2 (en) 2011-12-23 2017-03-07 Intel Corporation Apparatus and method of improved extract instructions
US9619236B2 (en) 2011-12-23 2017-04-11 Intel Corporation Apparatus and method of improved insert instructions
US9658850B2 (en) 2011-12-23 2017-05-23 Intel Corporation Apparatus and method of improved permute instructions
US9946540B2 (en) 2011-12-23 2018-04-17 Intel Corporation Apparatus and method of improved permute instructions with multiple granularities

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9513918B2 (en) 2011-12-22 2016-12-06 Intel Corporation Apparatus and method for performing permute operations
CN104094182A (en) * 2011-12-23 2014-10-08 英特尔公司 Apparatus and method of mask permute instructions
US9588764B2 (en) 2011-12-23 2017-03-07 Intel Corporation Apparatus and method of improved extract instructions
US9619236B2 (en) 2011-12-23 2017-04-11 Intel Corporation Apparatus and method of improved insert instructions
US9632980B2 (en) 2011-12-23 2017-04-25 Intel Corporation Apparatus and method of mask permute instructions
US9658850B2 (en) 2011-12-23 2017-05-23 Intel Corporation Apparatus and method of improved permute instructions
US9946540B2 (en) 2011-12-23 2018-04-17 Intel Corporation Apparatus and method of improved permute instructions with multiple granularities

Similar Documents

Publication Publication Date Title
Khailany et al. Imagine: Media processing with streams
US6430684B1 (en) Processor circuits, systems, and methods with efficient granularity shift and/or merge instruction(s)
US5761103A (en) Left and right justification of single precision mantissa in a double precision rounding unit
US7107305B2 (en) Multiply-accumulate (MAC) unit for single-instruction/multiple-data (SIMD) instructions
US6810475B1 (en) Processor with pipeline conflict resolution using distributed arbitration and shadow registers
Hauser et al. Garp: A MIPS processor with a reconfigurable coprocessor
US6209078B1 (en) Accelerated multimedia processor
US20040148321A1 (en) Method and system for performing calculation operations and a device
US7249242B2 (en) Input pipeline registers for a node in an adaptive computing engine
EP1126368A2 (en) Microprocessor with non-aligned circular addressing
US20100106944A1 (en) Data processing apparatus and method for performing rearrangement operations
US6546480B1 (en) Instructions for arithmetic operations on vectored data
US6922716B2 (en) Method and apparatus for vector processing
US6848074B2 (en) Method and apparatus for implementing a single cycle operation in a data processing system
US20050044344A1 (en) System, method and software for static and dynamic programming and configuration of an adaptive computing architecture
Renaudin et al. ASPRO-216: a standard-cell QDI 16-bit RISC asynchronous microprocessor
US20050188182A1 (en) Microprocessor having a set of byte intermingling instructions
Yu et al. AsAP: An asynchronous array of simple processors
US6574724B1 (en) Microprocessor with non-aligned scaled and unscaled addressing
US5968167A (en) Multi-threaded data processing management system
Tillich et al. Instruction set extensions for efficient AES implementation on 32-bit processors
David et al. DART: a dynamically reconfigurable architecture dealing with future mobile telecommunications constraints
Woh et al. AnySP: anytime anywhere anyway signal processing
US6094726A (en) Digital signal processor using a reconfigurable array of macrocells
US20050240644A1 (en) Scalar/vector processor

Legal Events

Date Code Title Description
C06 Publication
C10 Request of examination as to substance
C02 Deemed withdrawal of patent application after publication (patent law 2001)