JP2012530978A

JP2012530978A - Method and apparatus for performing shift and exclusive OR operations with a single instruction

Info

Publication number: JP2012530978A
Application number: JP2012516393A
Authority: JP
Inventors: ゴパル、ヴィノド; ディー．ギルフォード、ジェームス; オズターク、エルディンク; フェガーリ、ワジディ; エム．ウォルリッチ、ギルバート; ジー．ディクソン、マーティン
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 2009-12-17
Filing date: 2010-10-29
Publication date: 2012-12-06
Anticipated expiration: 2030-10-29
Also published as: CN104679478A; US20150089196A1; CN104699459A; TW201533660A; US9495165B2; JP5941498B2; US9495166B2; TWI531969B; JP6615819B2; DE112010004887T5; US9501281B2; TWI575456B; TWI610235B; CN104598203A; KR101411064B1; US20170351519A1; GB201119720D0; CN104699456A; TWI562067B; KR20120099236A

Abstract

シフトおよびＸＯＲ演算を実行する方法および装置が開示される。一実施形態では、装置は、第１の命令を実行する実行リソースを含む。第１の命令に呼応して、実行リソースは、少なくとも１つの値にシフトおよびＸＯＲを行う。
【選択図】図１ＡA method and apparatus for performing shift and XOR operations is disclosed. In one embodiment, the apparatus includes an execution resource that executes the first instruction. In response to the first instruction, the execution resource shifts and XORs to at least one value.
[Selection] Figure 1A

Description

本開示は、コンピュータ処理分野に係る。より詳しくは、実施形態は、シフトおよび論理和（ＸＯＲ）演算を実行する命令に係る。 The present disclosure relates to the computer processing field. More particularly, embodiments relate to instructions that perform shift and logical OR (XOR) operations.

単一命令多重データ処理（ＳＩＭＤ）命令は、多くのデータエレメント（パックデータ）を並列に処理する様々なアプリケーションで利用価値がある。シフト演算または排他的論理和（ＸＯＲ）演算等の演算を直列で実行すると、性能が落ちる。 Single instruction multiple data processing (SIMD) instructions are useful in a variety of applications that process many data elements (packed data) in parallel. Performance decreases when operations such as shift operations or exclusive OR (XOR) operations are executed in series.

本発明は、添付図面に例示として、限定としてではなく示されている。 The present invention is illustrated by way of example and not limitation in the accompanying drawings.

本発明の一実施形態におけるシフトおよびＸＯＲ演算命令を実行する実行部を含むプロセッサで構成されるコンピュータシステムのブロック図である。It is a block diagram of the computer system comprised with the processor containing the execution part which performs the shift and XOR operation instruction in one Embodiment of this invention. 本発明の別の実施形態における別のコンピュータシステム例のブロック図である。It is a block diagram of another example computer system in another embodiment of the present invention. 本発明の別の実施形態における別のコンピュータシステム例のブロック図である。It is a block diagram of another example computer system in another embodiment of the present invention. 本発明におけるシフトおよびＸＯＲ演算を行う論理回路を含む一実施形態のプロセッサのマイクロアーキテクチャのブロック図である。1 is a block diagram of a processor micro-architecture of one embodiment including a logic circuit that performs shift and XOR operations in the present invention. FIG. 本発明の一実施形態におけるマルチメディアレジスタ内の様々なパックデータタイプ表現である。FIG. 6 is a representation of various pack data types in a multimedia register in one embodiment of the invention. 別の実施形態刑におけるパックデータタイプを示す。Fig. 9 illustrates a pack data type in another embodiment sentence. 本発明の一実施形態におけるマルチメディアレジスタの、様々な符号付き、および、符号なしのパックデータタイプ表現を示す。Fig. 5 shows various signed and unsigned packed data type representations of multimedia registers in one embodiment of the present invention. 演算符号（オペコード）フォーマットの一実施形態を示す。3 illustrates one embodiment of an operational code (opcode) format. 別の演算符号（オペコード）フォーアットを示す。Another operational code (opcode) forat is shown. また別の演算符号フォーマットを示す。Another operational code format is shown. 本発明における命令を実行する論理の一実施形態のブロック図である。FIG. 3 is a block diagram of one embodiment of logic for executing instructions in the present invention. 一実施形態において実行される演算のフロー図である。It is a flowchart of the calculation performed in one Embodiment.

以下の記載は、処理装置、コンピュータシステム、またはソフトウェアプログラムでシフトまたはＸＯＲ演算を実行する技術の実施形態を記載している。以下の記載では、プロセッサタイプ、マイクロアーキテクチャ条件、イベント、実施メカニズム（enablement mechanism）等の幾多の詳細を述べて、本発明のより完全な理解を促している。しかし、当業者には本発明の実施形態をこれら特定の詳細なしに実行できることが明らかである。さらに、公知の構造、回路等は詳述を避けて、本発明の実施形態を不当に曖昧にしないようにしている箇所もある。 The following description describes embodiments of techniques for performing shift or XOR operations on a processing device, computer system, or software program. In the following description, numerous details are set forth such as processor type, microarchitecture requirements, events, enablement mechanism, etc. to facilitate a more complete understanding of the present invention. However, it will be apparent to one skilled in the art that embodiments of the invention may be practiced without these specific details. In addition, well-known structures, circuits, etc. are avoided in detail so as not to unduly obscure the embodiments of the present invention.

以下の実施形態はプロセッサに関して記載されるが、他の種類の集積回路および論理デバイスには他の実施形態を適用可能である。本発明と同じ技術および教示は、パイプラインスループットおよび性能を上げることで利益がある他の種類の回路または半導体素子にも容易に適用できる。本発明の教示は、データ操作を行う任意のプロセッサまたは機械に適用可能である。しかし本発明の実施形態は、２５６ビット、１２８ビット、６４ビット、３２ビット、または１６ビットのデータ処理を行うプロセッサまたは機械に限定されず、パックデータの操作が必要ないずれのプロセッサおよび機械にも適用可能である。 Although the following embodiments are described in terms of a processor, other embodiments are applicable to other types of integrated circuits and logic devices. The same techniques and teachings of the present invention can be readily applied to other types of circuits or semiconductor devices that benefit from increased pipeline throughput and performance. The teachings of the present invention are applicable to any processor or machine that performs data manipulation. However, embodiments of the present invention are not limited to processors or machines that perform 256-bit, 128-bit, 64-bit, 32-bit, or 16-bit data processing, and can be applied to any processor and machine that requires manipulation of packed data. Applicable.

以下の例は、実行部および論理回路のコンテキストで命令処理および配信を記載しているが、本発明の他の実施形態は、有形媒体に格納されているソフトウェアにより実行可能である。一実施形態では、本発明の方法は、機械実行可能命令に具現化される。命令を利用して、命令をプログラミングされた汎用プロセッサまたは専用プロセッサに、本発明の各段階を実行させることができる。本発明の実施形態は、コンピュータ（その他の電子デバイス）に本発明のプロセスを実行させるようプログラミングするのに利用できる命令を格納する機械またはコンピュータ可読媒体を含みうるコンピュータプログラムプロダクトまたはソフトウェアとして提供可能である。または、本発明の各段階を、各段階を実行するハードワイヤ論理を含む特定用途のハードウェアコンポーネントにより、または、プログラミングされたコンピュータコンポーネントおよびカスタムハードウェアコンポーネントの任意の組み合わせにより実行することもできる。これらのソフトウェアは、システムのメモリ内に格納可能である。同様に、コードは、他のコンピュータ可読媒体により、またはネットワーク経由で配信可能である。 The following examples describe instruction processing and distribution in the context of execution units and logic circuits, but other embodiments of the invention can be performed by software stored on tangible media. In one embodiment, the method of the present invention is embodied in machine-executable instructions. The instructions can be used to cause a general purpose processor or a dedicated processor programmed with instructions to perform the steps of the present invention. Embodiments of the present invention can be provided as a computer program product or software that can include a machine or computer readable medium that stores instructions that can be used to program a computer (other electronic device) to perform the processes of the present invention. is there. Alternatively, each stage of the invention may be performed by a special purpose hardware component that includes the hard wire logic that performs each stage, or by any combination of programmed computer components and custom hardware components. These software can be stored in the memory of the system. Similarly, the code can be distributed over other computer-readable media or over a network.

機械可読媒体は、機械可読な形式で情報を格納または送信する任意のメカニズムを含んでよく、これらに限定はされないが、フロッピー（登録商標）ディスク、光ディスク、ＣＤ，ＣＤ−ＲＯＭ，および磁気光ディスク、ＲＯＭ，ＲＡＭ、ＥＰＲＯＭ、ＥＥＰＲＯＭ、光カードまたは磁気カード、フラッシュメモリ、インターネット経由の送信、電気、光、音響、その他の形態の伝播信号（例えば搬送波、赤外線信号、デジタル信号等）が含まれる。従って、コンピュータ可読媒体は、機械（コンピュータ）が可読な形式で電子命令または情報を格納または送信するのに適した任意の種類の媒体／機会か読媒体を含む。さらに本発明は、コンピュータプログラムプロダクトとしてダウンロードすることもできる。従って、プログラムは、遠隔コンピュータ（例えばサーバ）から要求を出しているコンピュータ（例えばクライアント）に転送されてよい。プログラムの転送は、搬送波または他の伝播媒体で、通信リンク（例えばモデム、ネットワーク接続等）を介して具現化される、電気、光、音響、または他の形式のデータ信号により行うことができる。 Machine-readable media may include any mechanism for storing or transmitting information in a machine-readable format, including but not limited to floppy disks, optical disks, CDs, CD-ROMs, and magnetic optical disks, ROM, RAM, EPROM, EEPROM, optical card or magnetic card, flash memory, transmission via the Internet, electricity, light, sound, and other forms of propagation signals (eg, carrier wave, infrared signal, digital signal, etc.) are included. Accordingly, computer readable media include any type of media / opportunity or read media suitable for storing or transmitting electronic instructions or information in a form readable by a machine (computer). Furthermore, the present invention can also be downloaded as a computer program product. Thus, the program may be transferred to a requesting computer (eg, client) from a remote computer (eg, server). The transfer of the program can be performed by electrical, optical, acoustic, or other form of data signal embodied on a carrier wave or other propagation medium via a communication link (eg, modem, network connection, etc.).

設計は、作成からシミュレーション、ひいては製造までの様々な段階を経て行うことができる。設計を表すデータは、複数の方法で設計を表していてよい。第一に、シミュレーションで好適であるが、ハードウェアを、ハードウェア記述言語その他の機能記述言語で表すことができる。さらに、論理および／またはトランジスタゲートを有する回路レベルモデルを、設計プロセスの幾つかの段階で製造することができる。さらに、殆どの設計では、ある段階で、ハードウェアモデルの様々なデバイスの物理的位置を表すデータレベルに到達する。通常の半導体製造技術を利用する場合には、ハードウェアモデルを表すデータは、集積回路の製造に利用されるマスク用の様々なマスク層上の様々な特徴の存在または不在を示すデータであってよい。設計のいずれの表現においても、データは、いずれかの形態の機械可読媒体に格納されてよい。変調されたり、この情報を送信するように生成されたりしている光波または電波、メモリ、またはディスク等の磁気光ストレージは、機械可読媒体である。これら媒体のいずれかは、設計またはソフトウェア情報を「搬送」または「指示」できてよい。コードまたは設計を指示する、または搬送する電気搬送波が送信されると、電気信号のコピー、バッファリング、または再送信の範囲では、新たなコピーが生成される。従って、通信プロバイダまたはネットワークプロバイダは、本発明の技術を具現化している物品（搬送波）のコピーを製造しうる。 Design can be done through various stages from creation to simulation, and thus manufacturing. Data representing the design may represent the design in multiple ways. First, although suitable for simulation, the hardware can be expressed in a hardware description language or other function description language. Furthermore, circuit level models with logic and / or transistor gates can be produced at several stages of the design process. In addition, most designs reach a data level that represents the physical location of various devices in the hardware model at some stage. When using normal semiconductor manufacturing technology, the data representing the hardware model is data indicating the presence or absence of various features on various mask layers for masks used in the manufacture of integrated circuits. Good. In any representation of the design, the data may be stored on any form of machine-readable medium. Magneto-optical storage such as light waves or radio waves, memory, or disks that are modulated or generated to transmit this information is a machine-readable medium. Any of these media may be capable of “carrying” or “indicating” design or software information. When an electrical carrier that directs or carries code or design is transmitted, a new copy is generated in the scope of copying, buffering, or retransmitting the electrical signal. Accordingly, a communication provider or network provider can produce a copy of an article (carrier wave) that embodies the techniques of the present invention.

現代のプロセッサでは、複数の異なる実行部を利用して、様々なコードおよび命令を処理および実行している。完了まで速いものもあれば、莫大な数のクロックサイクルを要するものもあり、全ての命令を同じように製造するわけではない。命令のスループットが速いと、プロセッサの全体性能は良くなる。従って、多くの命令がより高速に実行されると好適である。しかし、複雑度が高く、実行時間およびプロセッサリソースをより多く必要とする命令も存在する。例を挙げると、浮動小数点命令、ロード／格納処理、データ移動などである。 Modern processors use a plurality of different execution units to process and execute various codes and instructions. Some are quick to complete, others require a huge number of clock cycles, and not all instructions are manufactured in the same way. Fast instruction throughput improves overall processor performance. Therefore, it is preferable that many instructions are executed at higher speed. However, some instructions are high in complexity and require more execution time and processor resources. Examples include floating point instructions, load / store processing, data movement, and the like.

インターネットおよびマルチメディアアプリケーションで利用されるコンピュータシステム数が増えるにつれて、さらなるプロセッササポートを導入することが増えてきた。例えば、単一命令多重データ（ＳＩＭＤ）整数／浮動少数点命令、および、ストリーミングＳＩＭＤ拡張（ＳＳＥ）は、特定のプログラムタスクの実行に必要な命令の総数を減らす命令であり、消費電力も減らすことができる。この種類の命令は、複数のデータエレメントに並列処理を行うことにより、ソフトウェアのパフォーマンスを高速化することができる。この結果、ビデオ、音声、および画像／写真処理を含む幅広い範囲のアプリケーションで性能面の利得が得られる。ＳＩＭＤ命令をマイクロプロセッサおよび類似した種類の論理回路に実装するには、通常、多くの課題が存在する。さらに、ＳＩＭＤ演算は複雑なので、しばしば、正確にデータを処理、操作するためにはさらなる回路が必要となる。 As the number of computer systems utilized in the Internet and multimedia applications has increased, the introduction of additional processor support has increased. For example, single instruction multiple data (SIMD) integer / floating point instructions and streaming SIMD extension (SSE) are instructions that reduce the total number of instructions required to perform a particular program task and reduce power consumption. Can do. This type of instruction can speed up software performance by performing parallel processing on multiple data elements. This results in performance gains in a wide range of applications including video, audio, and image / photo processing. There are usually many challenges to implementing SIMD instructions in microprocessors and similar types of logic circuits. Furthermore, SIMD operations are complex and often require additional circuitry to accurately process and manipulate the data.

現在のところＳＩＭＤによるシフトおよびＸＯＲ命令は実用化されていない。本発明の実施形態では、ＳＩＭＤシフトおよびＸＯＲ命令がないと、音声／ビデオ／グラフィック圧縮、処理、操作等のアプリケーションで同等の成果を達成するために、多数の命令およびデータレジスタが必要となることが想定される。従って、本発明の実施形態においては少なくとも１つのシフトおよびＸＯＲ命令を利用することで、コードのオーバヘッドおよびリソース要件を低減させることができる。本発明の実施形態は、シフトおよびＸＯＲ演算を、ＳＩＭＤ関連のハードウェアを利用するアルゴリズムとして実装する方法を提供する。現在のところ、ＳＩＭＤレジスタのデータに、シフトおよびＸＯＲ演算を行うことは困難であり時間がかかる（tedious）。またアルゴリズムのなかには、これら演算を実行するのにかかる実際の命令数よりも、算術演算データを配置するための命令数が多いようなものがある。本発明の実施形態におけるシフトおよびＸＯＲ演算を実装することで、シフトおよびＸＯＲ演算を行うために必要となる命令数を大幅に減らすことができるようになる。 At present, SIMD shift and XOR instructions have not been put to practical use. In embodiments of the present invention, without SIMD shift and XOR instructions, multiple instructions and data registers are required to achieve equivalent results in applications such as audio / video / graphics compression, processing, manipulation, etc. Is assumed. Accordingly, code overhead and resource requirements can be reduced by utilizing at least one shift and XOR instruction in embodiments of the present invention. Embodiments of the present invention provide a method for implementing shift and XOR operations as algorithms that utilize SIMD related hardware. Currently, it is difficult and time consuming to perform shift and XOR operations on data in SIMD registers. Some algorithms have more instructions for allocating arithmetic operation data than the actual number of instructions required to execute these operations. By implementing the shift and XOR operations in the embodiments of the present invention, the number of instructions required to perform the shift and XOR operations can be greatly reduced.

本発明の実施形態は、シフトおよびＸＯＲ演算の実装命令に係る。一実施形態では、シフトおよびＸＯＲ演算は…。 Embodiments of the invention relate to implementation instructions for shift and XOR operations. In one embodiment, the shift and XOR operations are ...

データエレメントに実行される一実施形態におけるシフトおよびＸＯＲ演算は、概してＤＥＳＴ１←ＳＣＲ１「ＳＲＣ２」と表すことができる。 The shift and XOR operations in one embodiment performed on the data elements can be generally expressed as DEST1 ← SCR1 “SRC2”.

一実施形態では、ＳＲＣ１は、複数のデータエレメントを有する第１のオペランドを格納しており、ＳＲＣ２は、シフトおよびＸＯＲ命令でシフトされる値を表す値を含んでいる。他の実施形態では、シフトおよびＸＯＲ値のインジケータを、即値フィールドに格納してもよい。 In one embodiment, SRC1 stores a first operand having a plurality of data elements, and SRC2 includes a value representing a value that is shifted with a shift and XOR instruction. In other embodiments, shift and XOR value indicators may be stored in the immediate field.

上述したフローでは、「ＤＥＳＴ」および「ＳＲＣ」は、対応するデータまたは演算のソースおよび宛先を表す一般用語である。一部の実施形態では、これらを、レジスタ、メモリ、または記載されたものとは異なる名称または関数を有する他の格納領域に実装することができる。一実施形態では、例えば、ＤＥＳＴ１およびＤＥＳＴ２は、第１の時間格納領域および第２の時間格納領域（例えば「ＴＥＭＰ１」「ＴＥＭＰ２」レジスタ）であってよく、ＳＲＣ１およびＳＲＣ３は、第１および第２の宛先格納領域（例えば「ＤＥＳＴ１」および「ＤＥＳＴ２」レジスタ）等であってよい。他の実施形態では、２以上のＳＲＣおよびＤＥＳＴ格納領域が、同じ格納領域（例えばＳＩＭＤレジスタ）内の異なるデータ格納エレメントに対応していてもよい。 In the flow described above, “DEST” and “SRC” are general terms that represent the source and destination of the corresponding data or operation. In some embodiments, these can be implemented in registers, memory, or other storage areas having different names or functions than those described. In one embodiment, for example, DEST1 and DEST2 may be a first time storage area and a second time storage area (eg, “TEMP1” and “TEMP2” registers), and SRC1 and SRC3 are first and second Destination storage area (for example, “DEST1” and “DEST2” registers). In other embodiments, two or more SRC and DEST storage areas may correspond to different data storage elements within the same storage area (eg, a SIMD register).

図１Ａは、本発明の一実施形態におけるシフトおよびＸＯＲ演算命令を実行する実行部を含むプロセッサで形成されるコンピュータシステムのブロック図である。システム１００は、例えばここに記載する実施形態におけるような、本発明における処理データのアルゴリズムを実行する論理を含む実行部を利用するプロセッサ１０２等のコンポーネントを含む。システム１００は、カリフォルニア州サンタクララのＩｎｔｅｌＣｏｒｐｏｒａｔｉｏｎから入手可能なＰＥＮＴＩＵＭ（登録商標）ＩＩＩ、ＰＥＮＴＩＵＭ（登録商標）４、Ｘｅｏｎ（登録商標）、Ｉｔａｎｉｕｍ（登録商標）、ＸＳｃａｌｅ（登録商標）、および／または、ＳｔｒｏｎｇＡＲＭ（登録商標）マイクロプロセッサを表しているが、他のシステム（他のマイクロプロセッサ、工学ワークステーション、セットトップボックス等を有するＰＣを含む）を利用することもできる。一実施形態では、サンプルシステム１００は、ワイントン州のＲｅｄｍｏｎｄのＭｉｃｒｏｓｏｆｔＣｏｒｐｏｒａｔｉｏｎから入手可能なＷＩＮＤＯＷＳ（登録商標）オペレーティングシステムの一バージョンを実行することができるが、他のオペレーティングシステム（例えばＵＮＩＸ（登録商標）、Ｌｉｎｕｘ（登録商標））、埋め込みソフトウェア、および／またはグラフィックユーザインタフェースを利用することもできる。従って本発明の実施形態は、ハードウェア回路およびソフトウェアの特定の組み合わせに限定されない。 FIG. 1A is a block diagram of a computer system formed by a processor including an execution unit that executes shift and XOR operation instructions according to an embodiment of the present invention. The system 100 includes components, such as a processor 102 that utilizes an execution unit that includes logic to execute an algorithm for processing data in the present invention, such as in the embodiments described herein. System 100 may be PENTIUM® III, PENTIUM® 4, Xeon®, Itanium®, XScale®, and / or available from Intel Corporation of Santa Clara, California. However, other systems (including other microprocessors, PCs with engineering workstations, set-top boxes, etc.) may be utilized. In one embodiment, the sample system 100 can run a version of the WINDOWS® operating system available from Microsoft Corporation, Redmond, W., but other operating systems (eg, UNIX®). ), Linux), embedded software, and / or a graphic user interface. Thus, embodiments of the invention are not limited to a specific combination of hardware circuitry and software.

実施形態はコンピュータシステムに限定されない。ハンドヘルドデバイスおよび埋め込みアプリケーションといった他のデバイスには本発明の別の実施形態を利用することができる。ハンドヘルドデバイスの例には、携帯電話（cellular phone）、インターネットプロトコルデバイス、デジタルカメラ、携帯情報端末（ＰＤＡ）およびハンドヘルドＰＣが含まれる。埋め込みアプリケーションは、マイクロコントローラ、デジタル信号プロセッサ（ＤＳＰ）、システムオンチップ、ネットワークコンピュータ（ＮｅｔＰＣ）、セットトップボックス、ネットワークハブ、ワイドエリアネットワーク（ＷＡＮ）スイッチ、またはシフトおよびＸＯＲ演算をオペランドに行うことのできる任意の他のシステムを含むことができる。さらに、マルチメディアアプリケーションの効率を高めるために、幾つかのデータに同時に命令を実行するために幾つかのアーキテクチャを実装することもできる。データの種類および量が増えると、コンピュータおよびそのプロセッサを向上させて、より効率的な方法でデータを操作する必要がでてくる。 Embodiments are not limited to computer systems. Other embodiments of the invention can be utilized for other devices such as handheld devices and embedded applications. Examples of handheld devices include cellular phones, internet protocol devices, digital cameras, personal digital assistants (PDAs), and handheld PCs. Embedded applications can be microcontrollers, digital signal processors (DSPs), system-on-chips, network computers (NetPCs), set-top boxes, network hubs, wide area network (WAN) switches, or performing shift and XOR operations on operands Any other possible system can be included. In addition, to increase the efficiency of multimedia applications, several architectures can be implemented to execute instructions on several data simultaneously. As the type and amount of data increases, it becomes necessary to improve the computer and its processor to manipulate the data in a more efficient manner.

図１Ａは、本発明の一実施形態において、幾つかのデータエレメントをシフトおよびＸＯＲ演算させるアルゴリズムを実行する１以上の実行部１０８を含むプロセッサ１０２で構成されるコンピュータシステム１００のブロック図である。一実施形態は、単一のプロセッサデスクトップまたはサーバシステムを想定した説明がなされるが、マルチプロセッサを想定する別の実施形態を含むこともできる。システム１００は、ハブアーキテクチャの一例である。コンピュータシステム１００は、データ信号を処理するプロセッサ１０２を含む。プロセッサ１０２は、複合命令セットコンピュータ（ＣＩＳＣ）マイクロプロセッサ、縮小命令セットコンピュータ（ＲＩＳＣ）マイクロプロセッサ、超長命令後（ＶＬＩＷ）マイクロプロセッサ、命令セットの組み合わせを実装するプロセッサ、または任意の他のプロセッサデバイス（例えばデジタル信号プロセッサ）であってよい。プロセッサ１０２は、プロセッサ１０２とシステム１００の他のプロセッサとの間でデータ信号を送信することのできるプロセッサバス１１０に連結されている。システム１００のエレメントは、当業者には公知である自身の通常の機能を実行する。 FIG. 1A is a block diagram of a computer system 100 comprised of a processor 102 that includes one or more execution units 108 that execute an algorithm that shifts and XORs several data elements in one embodiment of the invention. One embodiment is described assuming a single processor desktop or server system, but may include other embodiments that assume multiple processors. System 100 is an example of a hub architecture. Computer system 100 includes a processor 102 that processes data signals. The processor 102 may be a complex instruction set computer (CISC) microprocessor, a reduced instruction set computer (RISC) microprocessor, a very long instruction (VLIW) microprocessor, a processor implementing a combination of instruction sets, or any other processor device. (For example, a digital signal processor). The processor 102 is coupled to a processor bus 110 that can transmit data signals between the processor 102 and other processors in the system 100. The elements of system 100 perform their normal functions known to those skilled in the art.

一実施形態では、プロセッサ１０２は、レベル１（Ｌ１）内部キャッシュメモリ１０４を含む。アーキテクチャによっては、プロセッサ１０２は、単一の内部キャッシュまたは複数のレベルの内部キャッシュを有してよい。また別の実施形態では、キャッシュメモリがプロセッサ１０２の外部に常駐していてもよい。また他の実施形態として、実装例および必要性に応じて、内部キャッシュおよび外部キャッシュの両方の組み合わせを含んでもよい。レジスタファイル１０６は、整数レジスタ、浮動少数点レジスタ、ステータスレジスタ、および命令ポインタレジスタ等の様々なレジスタに様々な種類のデータを格納することができる。 In one embodiment, the processor 102 includes a level 1 (L1) internal cache memory 104. Depending on the architecture, the processor 102 may have a single internal cache or multiple levels of internal cache. In another embodiment, the cache memory may reside external to the processor 102. As another embodiment, a combination of both an internal cache and an external cache may be included depending on the implementation example and necessity. Register file 106 can store various types of data in various registers, such as integer registers, floating point registers, status registers, and instruction pointer registers.

実行ユニット１０８は、整数および浮動少数点演算を実行する論理を含み、これもプロセッサ１０２内に常駐している。プロセッサ１０２はさらに、一定のマクロ命令のマイクロコードを格納するマイクロコード（ｕコード）ＲＯＭを含む。この実施形態では、実行ユニット１０８は、パック命令セット１０９を処理する論理を含む。一実施形態では、パック命令セット１０９は、複数のオペランドにシフトおよびＸＯＲを行う、パックシフトおよびＸＯＲ命令を含む。パック命令セット１０９を汎用プロセッサ１０２の命令セットに含め、さらに、命令を実行する関連回路も含めることで、多くのマルチメディアアプリケーションが利用する処理を、汎用プロセッサ１０２の充填データを利用して行うことができる。従って多くのマルチメディアアプリケーションは、充填データを処理するためにプロセッサのデータバス全幅を利用することで、加速化してより効率的に実行することができる。これにより、１つの処理を一度に１つのデータエレメントに対して行う際に、プロセッサのデータバスに各小片のデータを転送する必要がなくなる。 Execution unit 108 includes logic that performs integer and floating point operations, also residing within processor 102. The processor 102 further includes a microcode (ucode) ROM that stores microcode for certain macro instructions. In this embodiment, execution unit 108 includes logic to process packed instruction set 109. In one embodiment, packed instruction set 109 includes packed shift and XOR instructions that shift and XOR multiple operands. By including the packed instruction set 109 in the instruction set of the general-purpose processor 102 and further including related circuits for executing the instructions, processing used by many multimedia applications can be performed using the filling data of the general-purpose processor 102. Can do. Thus, many multimedia applications can be accelerated and run more efficiently by utilizing the full width of the processor's data bus to process the fill data. This eliminates the need to transfer each piece of data to the processor data bus when one process is performed on one data element at a time.

さらにマイクロコントローラ、埋め込みプロセッサ、グラフィックデバイス、ＤＳＰ、その他の論理回路では、実行部１０８の別の実施形態を利用することもできる。システム１００はメモリ１２０を含む。メモリ１２０は、ＤＲＡＭデバイス、ＳＲＡＭデバイス、フラッシュメモリデバイス、その他のメモリデバイスであってよい。メモリ１２０は、プロセッサ１０２が実行可能なデータ信号により表される命令および／またはデータを格納することができる。 Furthermore, other embodiments of the execution unit 108 can be used in a microcontroller, an embedded processor, a graphic device, a DSP, and other logic circuits. System 100 includes a memory 120. The memory 120 may be a DRAM device, SRAM device, flash memory device, or other memory device. Memory 120 may store instructions and / or data represented by data signals that are executable by processor 102.

プロセッサバス１１０およびメモリ１２０には、システム論理チップ１１６が連結される。例示されている実施形態におけるシステム論理チップ１１６は、メモリコントローラハブ（ＭＣＨ）である。プロセッサ１０２は、プロセッサバス１１０を介してＭＣＨ１１６と通信することができる。ＭＣＨ１１６は、命令およびデータを格納して、グラフィックコマンド、データおよびテクスチャを格納するために、高い帯域幅メモリ経路１１８をメモリ１２０に提供する。ＭＣＨ１１６は、プロセッサ１０２、メモリ１２０、およびシステム１００の他のコンポーネント間にデータ信号を送信して、プロセッサバス１１０、メモリ１２０、およびシステムＩ／Ｏ１２２間でデータ信号をブリッジ（橋絡）する。一部の実施形態では、システム論理チップ１１６は、グラフィックコントローラ１１２に連結するためのグラフィックポートを提供してよい。ＭＣＨ１１６は、メモリインタフェース１１８を介してメモリ１２０に連結される。グラフィックカード１１２は、アクセラレーテッドグラフィックポート（ＡＧＰ）インターコネクト１１４経由でＭＣＨ１１６に連結される。 A system logic chip 116 is coupled to the processor bus 110 and the memory 120. The system logic chip 116 in the illustrated embodiment is a memory controller hub (MCH). The processor 102 can communicate with the MCH 116 via the processor bus 110. The MCH 116 provides a high bandwidth memory path 118 to the memory 120 for storing instructions and data and storing graphic commands, data and textures. The MCH 116 transmits data signals between the processor 102, memory 120, and other components of the system 100 to bridge the data signals between the processor bus 110, memory 120, and system I / O 122. In some embodiments, the system logic chip 116 may provide a graphics port for coupling to the graphics controller 112. The MCH 116 is coupled to the memory 120 via the memory interface 118. The graphics card 112 is coupled to the MCH 116 via an accelerated graphics port (AGP) interconnect 114.

システム１００は、ＭＣＨ１１６をＩ／Ｏコントローラハブ（ＩＣＨ）１３０に連結するために専用ハブインタフェースバス１２２を利用する。ＩＣＨ１３０は、一部のＩ／ＯデバイスへのローカルＩ／Ｏバスを介した直接接続を提供する。ローカルＩ／Ｏバスは、周辺機器をメモリ１２０、チップセット、およびプロセッサ１０２に接続するための高速Ｉ／Ｏバスである。数例には、オーディオコントローラ、ファームウェアハブ（フラッシュＢＩＯＳ）１２８、無線トランシーバ１２６、データストレージ１２４、ユーザ入力およびキーボードインタフェームを含むレガシーＩ／Ｏコントローラ、ユニバーサルシリアルバス（ＵＳＢ）等のシリアル拡張ポート、およびネットワークコントローラ１３４が含まれる。データ格納デバイス１２４は、ハードディスクドライブ、フロッピー（登録商標）ディスクドライブ、ＣＤ−ＲＯＭデバイス、フラッシュメモリデバイス、その他の大容量格納デバイスを含んでよい。 The system 100 utilizes a dedicated hub interface bus 122 to connect the MCH 116 to an I / O controller hub (ICH) 130. The ICH 130 provides a direct connection via a local I / O bus to some I / O devices. The local I / O bus is a high-speed I / O bus for connecting peripheral devices to the memory 120, the chipset, and the processor 102. Some examples include audio controllers, firmware hubs (flash BIOS) 128, wireless transceivers 126, data storage 124, legacy I / O controllers including user input and keyboard interfaces, serial expansion ports such as Universal Serial Bus (USB), And a network controller 134. The data storage device 124 may include a hard disk drive, floppy disk drive, CD-ROM device, flash memory device, or other mass storage device.

システムの別の実施形態では、シフトおよびＸＯＲ命令を有するアルゴリズムを実行する実行部をシステムオンチップとともに利用することができる。システムオンチップの一実施形態には、プロセッサおよびメモリからなるものがある。このようなシステム用のメモリの一例はフラッシュメモリである。フラッシュメモリは、プロセッサおよびその他のシステムコンポーネントと同じダイに配置されてよい。加えて、他の論理ブロック（例えばメモリコントローラまたはグラフィックコントローラ等）を、システムオンチップに配置することもできる。 In another embodiment of the system, an execution unit that executes an algorithm with shift and XOR instructions can be utilized with the system on chip. One embodiment of a system on chip consists of a processor and memory. An example of such a system memory is a flash memory. The flash memory may be located on the same die as the processor and other system components. In addition, other logic blocks (eg, memory controller or graphics controller) can be located on the system on chip.

図１Ｂは、本発明の一実施形態の原理を実装するデータ処理システム１４０を示す。当業者であればここに記載する実施形態を、本発明の範囲を逸脱することなく別の処理システムに応用することもできることを理解する。 FIG. 1B illustrates a data processing system 140 that implements the principles of one embodiment of the invention. Those skilled in the art will appreciate that the embodiments described herein may be applied to other processing systems without departing from the scope of the present invention.

コンピュータシステム１４０は、シフトおよびＸＯＲ演算を含むＳＩＭＤ演算を実行する機能を有する処理コア１５９を含む。一実施形態では、処理コア１５９は、任意の種類のアーキテクチャの処理ユニットを表す（これらに限られないが、ＣＩＳＣ、ＲＩＳＣ，またはＶＬＩＷアーキテクチャ等）。処理コア１５９は、１以上のプロセス技術での製造に適したものであってもよいし、十分な詳細を機械可読媒体に表すことにより、この製造を促すのに適したものであってもよい。 Computer system 140 includes a processing core 159 that has the capability of performing SIMD operations including shift and XOR operations. In one embodiment, processing core 159 represents a processing unit of any type of architecture (such as, but not limited to, a CISC, RISC, or VLIW architecture). The processing core 159 may be suitable for manufacturing with one or more process technologies, or may be suitable for facilitating this manufacturing by representing sufficient details in a machine-readable medium. .

処理コア１５９は、実行部１４２、レジスタファイルセット１４５、およびデコーダ１４４を含む。処理コア１５９は、さらに、本発明の理解には不要な回路（不図示）も含んでいる。実行部１４２は、処理コア１５９が受信する命令を実行するのに利用される。通常のプロセッサ命令を認識することに加えて、実行部１４２は、パック令セット１４３の命令を認識して、パックデータフォーマットに演算を行うことができる。パック命令セット１４３は、シフトおよびＸＯＲ演算をサポートする命令を含み、さらには他のパック命令を含むこともできる。実行部１４２は、内部バスによりレジスタファイル１４５に連結される。レジスタファイル１４５は、データを含む情報を格納する処理コア１５９上の格納領域を表す。前述したように、パックデータを格納するために利用する格納領域は必須ではない。実行部１４２はデコーダ１４４に連結される。デコーダ１４４は、処理コア１５９が受信した命令を、制御信号および／またはマイクロコード・エントリポイントにデコードするために利用される。これらの制御信号および／またはマイクロコード・エントリポイントに呼応して、実行部１４２は適切な処理を行う。 The processing core 159 includes an execution unit 142, a register file set 145, and a decoder 144. Processing core 159 further includes circuitry (not shown) that is not necessary for an understanding of the present invention. The execution unit 142 is used to execute a command received by the processing core 159. In addition to recognizing normal processor instructions, the execution unit 142 can recognize instructions in the pack instruction set 143 and perform operations on the pack data format. The packed instruction set 143 includes instructions that support shift and XOR operations, and may also include other packed instructions. The execution unit 142 is connected to the register file 145 by an internal bus. The register file 145 represents a storage area on the processing core 159 that stores information including data. As described above, a storage area used for storing pack data is not essential. The execution unit 142 is connected to the decoder 144. The decoder 144 is used to decode instructions received by the processing core 159 into control signals and / or microcode entry points. In response to these control signals and / or microcode entry points, the execution unit 142 performs appropriate processing.

処理コア１５９は、様々な他のシステムデバイスと通信するバス１４１に連結されており、これには、これらに限定はされないが、ＳＤＲＡＭコントロール１４６、ＳＲＡＭコントロール１４７、バーストフラッシュメモリインタフェース１４８、ＰＣＭＣＩＡ（personal computer memory card international association）／コンパクトフラッシュ（登録商標）（ＣＦ）カードコントロール１４９、液晶ディスプレイ（ＬＣＤ）コントロール１５０、直接メモリアクセス（ＤＭＡ）コントローラ１５１、および代用バスマスターインタフェース１５２が含まれる。一実施形態では、データ処理システム１４０はさらに、Ｉ／Ｏバス１５３経由で様々なＩ／Ｏデバイスと通信するＩ／Ｏブリッジ１５４を含んでよい。このＩ／Ｏデバイスは、これらに限定はされないが、ＵＡＲＴ（universal asynchronous receiver/transmitter）１５５、ユニバーサルシリアルバス（ＵＳＢ）１５６、Ｂｌｕｅｔｏｏｔｈ（登録商標）無線ＵＡＲＴ１５７、およびＩ／Ｏ拡張インタフェース１５８をさらに含んでよい。 The processing core 159 is coupled to a bus 141 that communicates with various other system devices including, but not limited to, SDRAM control 146, SRAM control 147, burst flash memory interface 148, PCMCIA (personal A computer memory card international association / compact flash (CF) card control 149, a liquid crystal display (LCD) control 150, a direct memory access (DMA) controller 151, and a substitute bus master interface 152 are included. In one embodiment, the data processing system 140 may further include an I / O bridge 154 that communicates with various I / O devices via the I / O bus 153. The I / O device further includes, but is not limited to, a universal asynchronous receiver / transmitter (UART) 155, a universal serial bus (USB) 156, a Bluetooth® wireless UART 157, and an I / O expansion interface 158. It's okay.

データ処理システム１４０の一実施形態は、モバイル、ネットワークおよび／または無線通信を提供し、シフトおよびＸＯＲ演算を含むＳＩＭＤ演算を実行する機能を有する処理コア１５９を提供する。処理コア１５９は、様々な音声、ビデオ、画像および通信アルゴリズム（例えば、ウォルシュ・アダマール変換、高速フーリエ変換（ＦＦＴ）、離散コサイン変換（ＤＣＴ）、およびそれらのそれぞれの逆変換、色空間等の変換圧縮／伸張技術、ビデオ符号化動き推定またはビデオ復号動き補償、およびパルス符号変調（ＰＣＭ）等の変復調（ＭＯＤＥＭ）機能）でプログラミングされてよい。本発明の一部の実施形態はさらに、グラフィックアプリケーション（三次元（「３Ｄ」）モデリング、描画、オブジェクト衝突検出、３Ｄオブジェクト変換および点灯（lighting）等）に利用することもできる。 One embodiment of the data processing system 140 provides a processing core 159 that provides mobile, network and / or wireless communication and has the capability to perform SIMD operations including shift and XOR operations. The processing core 159 includes various audio, video, image and communication algorithms (eg, Walsh Hadamard transform, fast Fourier transform (FFT), discrete cosine transform (DCT), and their respective inverse transforms, color space transforms, etc. It may be programmed with compression / decompression techniques, video coding motion estimation or video decoding motion compensation, and modulation and demodulation (MODEM functions such as pulse code modulation (PCM)). Some embodiments of the present invention may also be used for graphic applications (three-dimensional (“3D”) modeling, drawing, object collision detection, 3D object conversion, lighting, etc.).

図１Ｃは、ＳＩＭＤシフトおよびＸＯＲ演算を行う機能を有するデータ処理システムのまた別の実施形態を示す。別の実施形態において、データ処理システム１６０は、メインプロセッサ１６６、ＳＩＭＤコプロセッサ１６１、キャッシュメモリ１６７、および入出力システム１６８を含んでよい。入出力システム１６８は、必須ではないが、無線インタフェース１６９に連結されてもよい。ＳＩＭＤコプロセッサ１６１は、シフトおよびＸＯＲ演算を含むＳＩＭＤ演算を行う機能を有する。処理コア１７０は、１以上のプロセス技術での製造に適したものであってもよいし、十分な詳細を機械可読媒体に表すことにより、処理コア１７０を含むデータ処理システム１６０の全てまたは一部の製造を促すのに適したものであってもよい。 FIG. 1C illustrates another embodiment of a data processing system having the capability of performing SIMD shift and XOR operations. In another embodiment, the data processing system 160 may include a main processor 166, a SIMD coprocessor 161, a cache memory 167, and an input / output system 168. The input / output system 168 is not essential, but may be coupled to the wireless interface 169. The SIMD coprocessor 161 has a function of performing SIMD operations including shift and XOR operations. The processing core 170 may be suitable for manufacturing with one or more process technologies, and may represent all or part of the data processing system 160 including the processing core 170 by representing sufficient details in a machine-readable medium. It may be suitable for promoting the manufacture of

一実施形態では、ＳＩＭＤコプロセッサ１６１は、実行部１６２とレジスタファイルセット１６４とを含む。メインプロセッサ１６５の一実施形態は、実行部１６２により実行されるＳＩＭＤシフトおよびＸＯＲ計算命令を含む、命令セット１６３の中の命令を認識するデコーダ１６５を含む。別の実施形態では、ＳＩＭＤコプロセッサ１６１はさらに、命令セット１６３内の命令をデコードするデコーダ１６５Ｂの少なくとも一部を含む。処理コア１７０はさらに、本発明の理解には不要な回路（不図示）も含んでいる。 In one embodiment, the SIMD coprocessor 161 includes an execution unit 162 and a register file set 164. One embodiment of the main processor 165 includes a decoder 165 that recognizes instructions in the instruction set 163, including SIMD shift and XOR calculation instructions executed by the execution unit 162. In another embodiment, the SIMD coprocessor 161 further includes at least a portion of a decoder 165B that decodes the instructions in the instruction set 163. Processing core 170 further includes circuitry (not shown) that is not necessary for an understanding of the present invention.

動作において、メインプロセッサ１６６は、キャッシュメモリ１６７および入出力システム１６８との相互作用を含む一般的な種類のデータ処理を制御するデータ処理命令ストリームを実行する。データ処理命令ストリームにはＳＩＭＤコプロセッサ命令が埋め込まれている。メインプロセッサ１６６のデコーダ１６５は、これらＳＩＭＤコプロセッサ命令を、接続されているＳＩＭＤコプロセッサ１６１が実行すべき種類のものである認識する。従ってメインプロセッサ１６６は、コプロセッサバス１６６上にこれらのＳＩＭＤコプロセッサ命令（またはＳＩＭＤコプロセッサ命令を表す制御信号）を発行して、ここから任意の接続されているＳＩＭＤコプロセッサがこれら命令を受け取る。この場合、ＳＩＭＤコプロセッサ１６１は、これを宛先とする全ての受信されたＳＩＭＤコプロセッサ命令を受け付けて実行する。 In operation, main processor 166 executes a data processing instruction stream that controls general types of data processing, including interaction with cache memory 167 and input / output system 168. A SIMD coprocessor instruction is embedded in the data processing instruction stream. The decoder 165 of the main processor 166 recognizes these SIMD coprocessor instructions as being of a type to be executed by the connected SIMD coprocessor 161. Accordingly, the main processor 166 issues these SIMD coprocessor instructions (or control signals representing SIMD coprocessor instructions) on the coprocessor bus 166, from which any connected SIMD coprocessor receives these instructions. . In this case, the SIMD coprocessor 161 accepts and executes all received SIMD coprocessor instructions destined for it.

データは無線インタフェース１６９経由で受信され、ＳＩＭＤコプロセッサ命令による処理に備えさせられる。一例としては、音声通信は、デジタル信号の形式で受け取られてよく、ＳＩＭＤコプロセッサ命令の処理を受けて、音声通信を表すデジタルオーディオサンプルが再生される。別の例では、圧縮された音声および／またはビデオがデジタルビットストリームの形式で受信されてよく、これがＳＩＭＤコプロセッサ命令により処理されることで、デジタルオーディオサンプルおよび／または動きビデオフレームが再生されてよい。処理コア１７０の一実施形態では、メインプロセッサ１６６およびＳＩＭＤコプロセッサ１６１は、実行部１６２、レジスタファイルセット１６４、およびＳＩＭＤシフトおよびＸＯＲ命令を含む命令セット１６３の命令を認識するデコーダ１６５を含む単一の処理コア１７０に統合される。 Data is received via the wireless interface 169 and prepared for processing by SIMD coprocessor instructions. As an example, voice communications may be received in the form of a digital signal, and upon processing of SIMD coprocessor instructions, digital audio samples representing the voice communications are played. In another example, compressed audio and / or video may be received in the form of a digital bitstream that is processed by SIMD coprocessor instructions to play digital audio samples and / or motion video frames. Good. In one embodiment of processing core 170, main processor 166 and SIMD coprocessor 161 include a single decoder 165 that recognizes instructions in execution unit 162, register file set 164, and instruction set 163 including SIMD shift and XOR instructions. Integrated into the processing core 170.

図２は、本発明におけるシフトおよびＸＯＲ演算を行う論理回路を含む一実施形態のプロセッサ２００のマイクロアーキテクチャのブロック図である。シフトおよびＸＯＲ命令の一実施形態では、命令は、浮動少数点の仮数値を、指数が示す量だけ右にシフトさせて、シフトされた値を所与の値でＸＯＲして、最終結果を生成する。一実施形態では、正常のフロントエンド２０１は、実行するマクロ命令をフェッチしてきて、プロセッサパイプラインでの利用に備えさせるプロセッサ２００の一部である。フロントエンド２０１は、幾つかのユニットを含んでよい。一実施形態では、命令プリフェッチャ２２６が、メモリからマクロ命令をフェッチして、命令デコーダ２２８に供給して、命令デコーダ２２８がこれらを、機械が実行できるマイクロ命令またはマイクロオプレーションと称されるプリミティブ（マイクロオプまたはｕオプと称されることもある）にデコードする。一実施形態では、トレースキャッシュ２３０は、デコードされたｕオプをとり、これらを、ｕオプキュー２３４のプログラムが命ずる（program ordered）シーケンスまたはトレースに、実行用にアセンブルする。トレースキャッシュ２３０が複合マイクロ命令を発見すると、マイクロコードＲＯＭ２３２は、演算を完了させるのに必要なｕオプを提供する。 FIG. 2 is a block diagram of a micro-architecture of processor 200 of one embodiment that includes logic circuitry for performing shift and XOR operations in the present invention. In one embodiment of the shift and XOR instruction, the instruction shifts the floating point mantissa to the right by the amount indicated by the exponent and XORs the shifted value with the given value to produce the final result. To do. In one embodiment, the normal front end 201 is the part of the processor 200 that fetches the macro instructions to execute and prepares them for use in the processor pipeline. The front end 201 may include several units. In one embodiment, an instruction prefetcher 226 fetches macro instructions from memory and supplies them to an instruction decoder 228, which is capable of executing them by primitives (called microinstructions or microoptions) that can be executed by a machine. (Sometimes referred to as micro-op or u-op). In one embodiment, the trace cache 230 takes the decoded u-ops and assembles them for execution into a sequence or trace that is programmed by the u-op queue 234 program. When the trace cache 230 finds a composite microinstruction, the microcode ROM 232 provides the u option necessary to complete the operation.

数多くのマクロ命令が、単一のマイクロオプに変換され、その他のマクロ命令は、演算全体を完了させるのに幾つかのマイクロオプを必要とする。一実施形態では、４つを超える数のマイクロオプがマクロ命令の完了に必要な場合、デコーダ２２８はマイクロコードＲＯＭ２３２にアクセスして、マクロ命令を実行する。一実施形態では、パックシフトおよびＸＯＲ命令を、少数のマイクロオプにデコードして、命令デコーダ２２８での処理に備えさせる。別の実施形態では、処理を実行するのに幾つかの数のマイクロオプが必要な場合に、パックシフトおよびＸＯＲアルゴリズムのための命令をマイクロコードＲＯＭ２３２内に格納することができる。トレースキャッシュ２３０は、エントリポイントのプログラマブルロジックアレイ（ＰＬＡ）を参照して、マイクロコードＲＯＭ２３２のシフトおよびＸＯＲアルゴリズムのためのマイクロコードシーケンスを読み出す正確なマイクロ命令ポインタを決定する。マイクロコードＲＯＭ２３２が現在のマクロ命令のマイクロオプの順序付けを終了すると、マシンのフロントエンド２０１が、トレースキャッシュ２３０からのマイクロオプのフェッチを再開する。 Many macro instructions are converted into a single micro-op, and other macro instructions require several micro-ops to complete the entire operation. In one embodiment, if more than four microops are required to complete a macro instruction, decoder 228 accesses microcode ROM 232 to execute the macro instruction. In one embodiment, the pack shift and XOR instructions are decoded into a small number of microops for preparation at the instruction decoder 228. In another embodiment, instructions for pack shift and XOR algorithms can be stored in the microcode ROM 232 if several numbers of microops are required to perform the processing. The trace cache 230 references the entry point programmable logic array (PLA) to determine the correct microinstruction pointer that reads the microcode sequence for the shift and XOR algorithm of the microcode ROM 232. When the microcode ROM 232 finishes ordering the microops for the current macroinstruction, the machine front end 201 resumes fetching the microops from the trace cache 230.

ＳＩＭＤその他のマルチメディアタイプの命令は、複合命令とみなされる。殆どの浮動少数点関連の命令も複合命令である。従って、命令デコーダ２２８が複合マクロ命令を発見すると、マイクロコードＲＯＭ２３２の適切な位置にアクセスして、そのマクロ命令のマイクロコードシーケンスを取得する。そのマクロ命令を実行するのに必要な様々なマイクロオプをアウトオブオーダ実行エンジン２０３に通信して、適切な整数および浮動少数点実行部での実行に備えさせる。 SIMD and other multimedia type instructions are considered compound instructions. Most floating point instructions are also compound instructions. Thus, when the instruction decoder 228 finds a composite macro instruction, it accesses the appropriate location in the microcode ROM 232 to obtain the microcode sequence for that macro instruction. Various micro-ops necessary to execute the macro instruction are communicated to the out-of-order execution engine 203 to prepare for execution in the appropriate integer and floating point execution units.

アウトオブオーダ実行エンジン２０３では、マイクロ命令を実行に備えさせる。アウトオブオーダ実行論理は、マイクロ命令のフローの平滑化および順序のつけ直しを行い、パイプラインを流れるときの性能を最適化して、実行に備えさせる。アロケータ論理は、各ｕオプが実行する際に必要とするマシンバッファおよびリソースを割り当てる。レジスタのリネーム論理は、論理レジスタをレジスタファイルのエントリへとリネームする。アロケータはさらに、命令スケジューラ、メモリスケジューラ、高速スケジューラ２０２、遅い／汎用の浮動少数点スケジューラ２０４、および簡易浮動少数点スケジューラ２０６の前に、各ｕオプのエントリを２つのｕオプキューのどちらかに対して、１つをメモリ処理に、１つを非メモリ処理に、という具合に割り当てる。ｕオプスケジューラ２０２、２０４、２０６は、従属入力レジスタオペランドソースが準備できているか、および、ｕオプが処理を完了するために必要な実行リソースの利用可能性に基づいて、ｕオプが準備できているかを判断する。本実施形態の高速スケジューラ２０２は、メインクロックサイクルの各半分にスケジュールを行うが、他のスケジューラは、各メインプロセッサのクロックサイクルごとに一度しかスケジュールを行うことができない。スケジューラは、実行するｕオプをスケジュールするべく発送ポート間を調整する。 The out-of-order execution engine 203 prepares micro instructions for execution. Out-of-order execution logic smoothes and reorders the flow of microinstructions to optimize performance as it flows through the pipeline and prepare for execution. The allocator logic allocates machine buffers and resources needed for each u option to execute. Register rename logic renames a logical register to an entry in a register file. The allocator further forwards each u-op entry to either of the two u-op queues before the instruction scheduler, memory scheduler, fast scheduler 202, slow / general floating-point scheduler 204, and simple floating-point scheduler 206. One is assigned to memory processing, and one is assigned to non-memory processing. The u-op scheduler 202, 204, 206 is ready for the u-op based on whether the dependent input register operand source is ready and the availability of execution resources required for the u-op to complete processing. Judgment is made. The fast scheduler 202 of this embodiment schedules each half of the main clock cycle, but other schedulers can schedule only once for each main processor clock cycle. The scheduler coordinates between shipping ports to schedule u-options to execute.

レジスタファイル２０８、２１０は、スケジューラ２０２、２０４、２０６、および、実行部２１２、２１４、２１６、２１８、２２０、２２２、２２４の間に存在している（実行ブロック２１１）。整数および浮動少数点演算のためにそれぞれ別個のレジスタファイル２０８、２１０が存在している。本実施形態の各レジスタファイル２０８、２１０は、さらに、今完成したばかりで、まだレジスタファイルに書き込まれていない結果を、新たな依存ｕオプにバイパスまたは転送することのできるバイパスネットワークを含む。整数レジスタファイル２０８および浮動少数点レジスタファイル２１０は、さらに互いに（with the other）データを通信する機能を有する。一実施形態では、整数レジスタファイル２０８は、２つの別個のレジスタファイル（一方のレジスタファイルが、下位３２ビットのデータ用であり、他方のレジスタファイルが、上位３２ビットのデータ用である）に分割される。浮動少数点命令は通常６４ビット幅から１２８ビット幅であるので、一実施形態の浮動少数点レジスタファイル２１０は、１２８ビット幅のエントリを有する。 The register files 208 and 210 exist between the schedulers 202, 204, and 206 and the execution units 212, 214, 216, 218, 220, 222, and 224 (execution block 211). Separate register files 208, 210 exist for integer and floating point operations, respectively. Each register file 208, 210 of this embodiment further includes a bypass network that can now bypass or forward results that have just been completed and have not yet been written to the register file to a new dependent u option. The integer register file 208 and the floating point register file 210 further have a function of communicating data with the other. In one embodiment, the integer register file 208 is split into two separate register files (one register file for the lower 32 bits of data and the other register file for the upper 32 bits of data). Is done. Since floating point instructions are typically 64 to 128 bits wide, the floating point register file 210 of one embodiment has 128 bit wide entries.

実行ブロック２１１は、命令が実際に実行される実行部２１２、２１４、２１６、２１８、２２０、２２２、２２４を含む。このセクションは、マイクロ命令が実行する必要のある整数および浮動少数点データオペランドの値を格納するレジスタファイル２０８、２１０を含む。本実施形態のプロセッサ２００は、幾つかの実行ユニット（アドレス生成ユニット（ＡＧＵ）２１２、ＡＧＵ２１４、高速ＡＬＵ２１６、高速ＡＬＵ２１８、遅いＡＬＵ２２０、浮動少数点ＡＬＵ２２２、浮動少数点移動ユニット２２４）からなる。本実施形態では、浮動少数点実行ブロック２２２、２２４は、浮動少数点ＭＭＸ、ＳＩＭＤ、およびＳＳＥ演算を実行する。本実施形態の浮動少数点ＡＬＵ２２２は、マイクロオプの除算、平方根、および余りを求める、６４ビット×６４ビットの浮動少数点除算器を含む。本発明の実施形態では、浮動少数点の値に関する任意の処理は、浮動少数点ハードウェアで行われる。例えば、整数フォーマットと浮動少数点フォーマットとの間の変換には、浮動少数点レジスタファイルを利用する。同様に、浮動少数点の除算は、浮動少数点除算器で行う。他方で、非浮動少数点および整数のタイプは、整数ハードウェアリソースで処理する。この単純で、非常に頻繁に行われるＡＬＵ演算は、高速ＡＬＵ事項部２１６、２１８に送られる。本実施形態の高速ＡＬＵ２１６、２１８は、二分の一のクロックサイクルという実効レイテンシーで高速処理を行うことができる。一実施形態では、殆どの複雑な整数演算が遅いＡＬＵ２２０に送られるが、これは、遅いＡＬＵ２２０が、乗算、シフト、フラグ論理、および分岐処理といったレイテンシーの長いタイプの演算用の整数実行ハードウェアを含むからである。メモリロード／格納演算は、ＡＧＵ２１２、２１４で行われる。本実施形態では、整数ＡＬＵ２１６、２１８、２２０を、６４ビットのオペランドに対する整数演算を例にとって記載する。しかし別の実施形態では、ＡＬＵ２１６、２１８、２２０は、１６、３２、１２８、２５６等の様々なデータビットをサポートするために実装することもできる。同様に、浮動少数点部２２２、２２４を、様々な幅のビットを有する一定の範囲のオペランドをサポートするために実装することもできる。一実施形態では、浮動少数点部２２２.２２４は、ＳＩＭＤおよびマルチメディア命令と協働して、１２８ビット幅のパックデータオペランドに演算を行うことができる。 The execution block 211 includes execution units 212, 214, 216, 218, 220, 222, and 224 in which instructions are actually executed. This section includes register files 208, 210 that store the values of integer and floating point data operands that the microinstruction needs to execute. The processor 200 according to the present embodiment includes several execution units (address generation unit (AGU) 212, AGU 214, high-speed ALU 216, high-speed ALU 218, slow ALU 220, floating point ALU 222, floating point moving unit 224). In this embodiment, floating point execution blocks 222, 224 perform floating point MMX, SIMD, and SSE operations. The floating-point ALU 222 of this embodiment includes a 64-bit × 64-bit floating-point divider that calculates micro-op division, square root, and remainder. In embodiments of the present invention, any processing related to floating point values is performed in floating point hardware. For example, a floating point register file is used for conversion between the integer format and the floating point format. Similarly, floating-point division is performed by a floating-point divider. On the other hand, non-floating point and integer types are handled with integer hardware resources. This simple, very frequently performed ALU operation is sent to the high-speed ALU item sections 216, 218. The high-speed ALUs 216 and 218 of this embodiment can perform high-speed processing with an effective latency of one-half clock cycle. In one embodiment, most complex integer operations are sent to the slow ALU 220, which slows the integer execution hardware for long latency types of operations such as multiplication, shift, flag logic, and branch processing. It is because it contains. Memory load / store operations are performed by the AGUs 212 and 214. In the present embodiment, the integer ALUs 216, 218, and 220 are described by taking an integer operation for a 64-bit operand as an example. However, in other embodiments, the ALUs 216, 218, 220 may be implemented to support various data bits such as 16, 32, 128, 256, etc. Similarly, the floating point portions 222, 224 may be implemented to support a range of operands having various width bits. In one embodiment, floating point 222.224 can operate on 128-bit wide packed data operands in conjunction with SIMD and multimedia instructions.

「レジスタ」という用語は、ここでは、オペランドを特定するマクロ命令の一部として利用されるオンボードのプロセッサ格納位置を示すために利用される。つまり、ここで利用されるレジスタは、プロセッサ外から見ることができるもののことである（例えばプログラマから見えるもののことである）。しかし、一実施形態のレジスタの意味は、特定の種類の回路に限定されない。一実施形態におけるレジスタは、データの格納および提供が可能であり、且つ、ここで記載する機能を行うことができる、ということのみを要件としている。ここで記載するレジスタは、任意の数の様々な技術を利用してプロセッサ内の回路により実装可能である（例えば、専用物理レジスタ、レジスタリネーミング機能を利用して動的に割り当てられた物理レジスタ、専用レジスタと動的に割り当てられた物理レジスタの組み合わせ等）。一実施形態では、整数レジスタは３２ビットの整数データを格納する。一実施形態のレジスタファイルはさらに、１６個のＸＭＭおよび汎用レジスタを含み、８個のマルチメディア（例えば「ＥＭ６４Ｔ」個の追加）マルチメディアＳＩＭＤレジスタを、パックデータ用に含む。以下の説明では、レジスタは、カリフォルニア州サンタクララのＩｎｔｅｌＣｏｒｐｏｒａｔｉｏｎから入手可能なＭＭＸ技術で可能となるマイクロプロセッサ内の６４ビット幅のＭＭＸ（登録商標）レジスタ（「ｍｍ」レジスタと称される場合もある）等の、パックデータを保持するよう設計されるデータレジスタとして理解される。これらＭＭＸレジスタは、整数および浮動少数点の形態で利用することができ、ＳＩＭＤおよびＳＳＥ命令に付随するパックデータエレメントで処理することができる。同様に、ＳＳＥ２、ＳＳＥ３、ＳＳＥ４、またはこれらを超える（一般的に「ＳＳＥｘ」と称される）技術に関する１２８ビット幅のＸＭＭレジスタも、これらパックデータオペランドを保持するために利用することができる。本実施形態では、パックデータおよび整数データを格納する際に、レジスタは、２つのデータタイプを区別する必要がない。一実施形態では、他のレジスタおよびレジスタの組み合わせを利用して、２５６ビット以上のデータを格納することもできる。 The term “register” is used herein to indicate an on-board processor storage location that is used as part of a macro instruction that identifies an operand. In other words, the registers used here are those that can be seen from outside the processor (for example, those that are visible to the programmer). However, the meaning of the register in one embodiment is not limited to a particular type of circuit. The registers in one embodiment are only required to be able to store and provide data and to perform the functions described herein. The registers described here can be implemented by circuitry within the processor using any number of different technologies (eg, dedicated physical registers, physical registers dynamically allocated using register renaming functions). , Combinations of dedicated registers and dynamically allocated physical registers). In one embodiment, the integer register stores 32-bit integer data. The register file of one embodiment further includes 16 XMM and general purpose registers, and 8 multimedia (eg, “EM64T” additional) multimedia SIMD registers for packed data. In the following description, the registers are 64-bit wide MMX® registers (sometimes referred to as “mm” registers) in a microprocessor enabled by MMX technology available from Intel Corporation of Santa Clara, California. Is understood as a data register designed to hold packed data. These MMX registers are available in integer and floating point form and can be processed with packed data elements associated with SIMD and SSE instructions. Similarly, a 128-bit wide XMM register for SSE2, SSE3, SSE4, or beyond (generally referred to as “SSEx”) can also be used to hold these packed data operands. In this embodiment, when storing packed data and integer data, the register does not need to distinguish between the two data types. In one embodiment, other registers and register combinations may be utilized to store more than 256 bits of data.

以下の図の例では、複数のデータオペランドが記載されている。図３Ａは、本発明の一実施形態におけるマルチメディアレジスタにおける様々なタイプのパックデータの表現を示す。図３Ａは、１２８ビット幅のオペランドについて、パックバイト３１０、パックワード３２０、および、パックダブルワード（ｄｗｏｒｄ）３３０のデータタイプを示す。この例のパックバイトのフォーマット３１０は、１２８ビットの長さを有し、１６個のパックバイトのデータエレメントを含む。１バイトは、８ビットのデータとして定義されている。各バイトデータエレメントの情報は、ビット７からビット０までをバイト０として、ビット１５からビット８までをバイト１として、ビット２３からビット１６までをバイト２として、最後にビット１２０からビット１２７までをバイト１５として、といった具合に格納される。このようにして全ての利用可能なビットをレジスタで利用することができる。この格納構成によって、プロセッサの格納効率が上がる。また、１６個のデータエレメントにアクセスするとき、１つの処理を１６個のデータエレメントに対して並列に実行することもできる。 In the example of the following figure, a plurality of data operands are described. FIG. 3A illustrates various types of pack data representations in multimedia registers in one embodiment of the invention. FIG. 3A shows the data types of packed byte 310, packed word 320, and packed double word (dword) 330 for a 128-bit wide operand. The packed byte format 310 in this example has a length of 128 bits and includes 16 packed byte data elements. One byte is defined as 8-bit data. The information of each byte data element includes bit 7 to bit 0 as byte 0, bit 15 to bit 8 as byte 1, bit 23 to bit 16 as byte 2, and finally bit 120 to bit 127. It is stored as byte 15 and so on. In this way, all available bits can be used in the register. This storage configuration increases the storage efficiency of the processor. Further, when accessing 16 data elements, one process can be executed in parallel for 16 data elements.

一般的に、１データエレメントは、同じ長さの他のデータエレメントとともに単一のレジスタまたはメモリ位置に格納されている個々のデータのことである。ＳＳＥｘ技術に関するパックデータシーケンスでは、ＸＭＭレジスタに格納されているデータエレメント数は、１２８ビットを個々のデータエレメントのビット長で除算した値である。同様に、ＭＭＸおよびＳＳＥ技術に関するパックデータシーケンスでは、ＭＭＸレジスタに格納されているデータエレメントの数は、６４ビットを個々のデータエレメントのビット長で除算した値である。図３Ａに示されているデータのタイプは１２８ビット長であるが、本発明の実施形態は、６４ビット幅であっても、他のサイズのオペランドであっても処理することができる。この例のパックワードフォーマット３２０は１２８ビット長であり、８つのパックワードデータエレメントを含む。各パックワードは、１６ビットの情報を含む。図３Ａのパックダブルワードフォーマット３３０は、１２８ビット長であり、４つのパックダブルワードデータエレメントを含む。各パックダブルワードデータエレメントは、３２ビットの情報を含む。パッククワドワードは１２８ビット長であり、２つのパッククワドワードのデータエレメントを含む。 In general, one data element refers to individual data stored in a single register or memory location along with other data elements of the same length. In the packed data sequence related to the SSEx technology, the number of data elements stored in the XMM register is a value obtained by dividing 128 bits by the bit length of each data element. Similarly, in packed data sequences for MMX and SSE techniques, the number of data elements stored in the MMX register is a value obtained by dividing 64 bits by the bit length of each data element. Although the type of data shown in FIG. 3A is 128 bits long, embodiments of the present invention can handle operands that are 64 bits wide or of other sizes. The packed word format 320 in this example is 128 bits long and includes eight packed word data elements. Each packed word contains 16 bits of information. The packed double word format 330 of FIG. 3A is 128 bits long and includes four packed double word data elements. Each packed double word data element contains 32 bits of information. The packed quadword is 128 bits long and contains two packed quadword data elements.

図３Ｂは、別のレジスタ内のデータ格納フォーマットを示す。各パックデータは、１を超える数の独立データエレメントを含んでよい。パック・ハーフ３４１、パック・シングル３４２、およびパック・ダブル３４３という、３つのパックデータエレメントが記載されている。パック・ハーフ３４１、パック・シングル３４２、およびパック・ダブル３４３の一実施形態は、固定少数点（fixed-point）データエレメントを含む。別の実施形態では、パック・ハーフ３４１、パック・シングル３４２、およびパック・ダブル３４３の１以上が、浮動少数点データエレメントを含む。パック・ハーフ３４１の別の実施形態は、１６ビットのデータエレメントを８つ含む１２８ビット長である。パック・シングル３４２の一実施形態は、１２８ビット長であり、３２ビットのデータエレメントを４つ含む。パック・ダブル３４３の一実施形態は、１２８ビット長であり、６４ビットのデータエレメントを２つ含む。これらパックデータフォーマットは、さらに、他のレジスタ長（例えば９６ビット、１６０ビット、１９２ビット、２２４ビット、２５６ビット、あるいはそれ以上のビット）に拡張することもできる。 FIG. 3B shows a data storage format in another register. Each pack data may include more than one independent data element. Three pack data elements are described: pack half 341, pack single 342, and pack double 343. One embodiment of packed half 341, packed single 342, and packed double 343 includes fixed-point data elements. In another embodiment, one or more of pack half 341, pack single 342, and pack double 343 include floating point data elements. Another embodiment of the pack half 341 is 128 bits long with eight 16-bit data elements. One embodiment of packed single 342 is 128 bits long and includes four 32-bit data elements. One embodiment of packed double 343 is 128 bits long and includes two 64-bit data elements. These packed data formats can be further extended to other register lengths (eg, 96 bits, 160 bits, 192 bits, 224 bits, 256 bits, or more).

図３Ｃは、本発明の一実施形態におけるマルチメディアレジスタの、様々な符号付き、および、符号なしのタイプのパックデータ表現を示す。符号なしパックバイト表現３４４は、ＳＩＭＤレジスタに符号なしパックバイトが格納されていることを示している。各バイトデータエレメントの情報は、ビット７からビット０までをバイト０として、ビット１５からビット８までをバイト１として、ビット２３からビット１６までをバイト２として、最後に、ビット１２０からビット１２７までをバイト１５として、といった具合に格納される。このようにすることで、全ての利用可能なビットをレジスタで利用することができる。この格納構成によって、プロセッサの格納効率が上がる。さらにこの構成では、１６個のデータエレメントにアクセスするとき、１つの処理を１６個のデータエレメントに対して並列に実行することもできる。符号付きパックデータ表現３４５は、符号付パックバイトの格納状態を示している。各バイトデータエレメントの８つ目のビットは、符号インジケータである。符号なしパックデータ表現３４６は、ワード７からワード０までがどのようにＳＩＭＤレジスタに格納されているかを示している。符号付きパックワード表現３４７は、符号なしパックワードのレジスタ内の表現３４６に類似している。各ワードデータエレメントの１６個目のビットは、符号インジケータである。符号なしパックダブルワード表現３４８は、ダブルワードデータエレメントがどのようい格納されているかを示している。符号付きパックダブルワード表現３４９は、符号なしパックダブルワードのレジスタ内の表現３４８に類似している。必要な符号ビットは、各ダブルワードデータエレメントの３２個目のビットである。 FIG. 3C shows various signed and unsigned types of packed data representations of multimedia registers in one embodiment of the present invention. Unsigned packed byte representation 344 indicates that unsigned packed bytes are stored in the SIMD register. The information of each byte data element includes bit 7 to bit 0 as byte 0, bit 15 to bit 8 as byte 1, bit 23 to bit 16 as byte 2, and finally from bit 120 to bit 127. Is stored as byte 15 and so on. In this way, all available bits can be used in the register. This storage configuration increases the storage efficiency of the processor. Further, in this configuration, when 16 data elements are accessed, one process can be executed in parallel on the 16 data elements. Signed packed data representation 345 indicates the storage state of signed packed bytes. The eighth bit of each byte data element is a sign indicator. Unsigned packed data representation 346 shows how words 7 through 0 are stored in the SIMD register. Signed packed word representation 347 is similar to representation 346 in the register of unsigned packed words. The 16th bit of each word data element is a sign indicator. Unsigned packed doubleword representation 348 shows how doubleword data elements are stored. Signed packed doubleword representation 349 is similar to representation 348 in the register of unsigned packed doublewords. The required sign bit is the 32nd bit of each doubleword data element.

図３Ｄは、演算符号（オペコード）フォーマット３６０の一実施形態を示しており、３２以上のビットを有し、レジスタ／メモリオペランドアドレスモードが「IA-32 Intel Architecture Software Developer's Manual Volume 2: Instruction Set Reference」に記載されているオペコードフォーマットの種類に対応しており、この文献は、カリフォルニア州のサンタクララのＩｎｔｅｌＣｏｒｐｏｒａｔｉｏｎが発行しており、www.intel. com/design/litcentr.から入手可能である。一実施形態では、シフトおよびＸＯＲ演算は、フィールド３６１および３６２の１以上で符号化されてよい。各命令について、２つまでのオペランド位置（２つのソースオペランド識別子３６４および３６５までを含む）を特定する。シフトおよびＸＯＲ命令の一実施形態では、宛先オペランド識別子３６６は、ソースオペランド識別子３６４と等しいが、他の実施形態では異なっていてもよい。別の実施形態では、宛先オペランド識別子３６６は、ソースオペランド識別子３６５と等しいが、他の実施形態では異なっていてもよい。シフトおよびＸＯＲ命令の一実施形態では、ソースオペランド識別子３６４および３６５が特定するソースオペランドのいずれかを、シフトおよびＸＯＲオペランドの結果で上書きし、他の実施形態では、識別子３６４がソースレジスタエレメントに対応しており、識別子３６５が宛先レジスタエレメントに対応している。シフトおよびＸＯＲ命令の一実施形態では、オペランド識別子３６４および３６５を利用して、３２ビットまたは６４ビットのソースオペランドおよび宛先オペランドを特定する。 FIG. 3D shows an embodiment of an opcode format 360, which has 32 or more bits and has a register / memory operand address mode of “IA-32 Intel Architecture Software Developer's Manual Volume 2: Instruction Set Reference”. This document is published by Intel Corporation of Santa Clara, California, and is available from www.intel.com/design/litcentr. In one embodiment, shift and XOR operations may be encoded with one or more of fields 361 and 362. For each instruction, specify up to two operand positions (including up to two source operand identifiers 364 and 365). In one embodiment of the shift and XOR instruction, the destination operand identifier 366 is equal to the source operand identifier 364, but may be different in other embodiments. In another embodiment, the destination operand identifier 366 is equal to the source operand identifier 365, but may be different in other embodiments. In one embodiment of the shift and XOR instruction, one of the source operands identified by the source operand identifiers 364 and 365 is overwritten with the result of the shift and XOR operand, and in another embodiment, the identifier 364 corresponds to a source register element. The identifier 365 corresponds to the destination register element. In one embodiment of the shift and XOR instructions, operand identifiers 364 and 365 are utilized to identify 32-bit or 64-bit source and destination operands.

図３Ｅは、４０以上のビットを有する別の演算符号（オペコード）フォーマット３７０を示す。オペコードフォーマット３７０は、オペコードフォーマット３６０に対応しており、オプションのプレフィックスバイト３７８を含む。この種類のシフトおよびＸＯＲ演算は、フィールド３７８、３７１、および３７２の１以上で符号化されてよい。ソースオペランド識別子３７４および３７５により、および、プレフィックスバイト３７８により、各命令についてオペランド位置を２つまで特定することができる。シフトおよびＸＯＲ命令の一実施形態では、プレフィックスバイト３７８を利用して、３２ビットまたは６４ビットのソースオペランドおよび宛先オペランドを特定する。シフトおよびＸＯＲ命令の一実施形態では、宛先オペランド識別子３７６はソースオペランド識別子３７４と等しいが、他の実施形態では異なっていてもよい。別の実施形態では、宛先オペランド識別子３７６は、ソースオペランド識別子３７５と等しいが、他の実施形態では異なっていてもよい。シフトおよびＸＯＲ命令の一実施形態では、オペランド識別子３７４および３７５が特定するオペランドのいずれかを、オペランド識別子３７４および３７５が特定する別のオペランドに、シフトおよびＸＯＲを行い、これを、シフトおよびＸＯＲの結果で上書きするが、他の実施形態では、識別子３７４および３７５が特定するオペランドのシフトおよびＸＯＲは、別のレジスタの別のデータエレメントに書き込まれる。オペコードフォーマット３６０および３７０は、レジスタからレジスタへ、メモリからレジスタへ、メモリによりレジスタへ、レジスタによりレジスタへ、即値よりレジスタへ、レジスタから、ＭＯＤフィールド３６３および３７３により部分的に、およびオプションとしてスケールインデックスベースおよび変位バイトにより特定されるメモリアドレスへ、といった書き込みを許可する。 FIG. 3E shows another operational code (opcode) format 370 having 40 or more bits. Opcode format 370 corresponds to opcode format 360 and includes an optional prefix byte 378. This type of shift and XOR operation may be encoded in one or more of fields 378, 371, and 372. Up to two operand positions can be specified for each instruction by source operand identifiers 374 and 375 and by prefix byte 378. In one embodiment of shift and XOR instructions, prefix byte 378 is utilized to identify 32-bit or 64-bit source and destination operands. In one embodiment of the shift and XOR instruction, the destination operand identifier 376 is equal to the source operand identifier 374, but may be different in other embodiments. In another embodiment, destination operand identifier 376 is equal to source operand identifier 375, but may be different in other embodiments. In one embodiment of the shift and XOR instruction, one of the operands identified by operand identifiers 374 and 375 is shifted and XORed to another operand identified by operand identifiers 374 and 375, and this is done with the shift and XOR Overwriting with the result, but in other embodiments, the operand shifts and XORs identified by identifiers 374 and 375 are written to another data element of another register. Opcode formats 360 and 370 are register-to-register, memory-to-register, memory-to-register, register-to-register, immediate-to-register, register-to-register, partially through MOD fields 363 and 373, and optionally scale index Write to a memory address specified by the base and displacement bytes.

次に図３Ｆを参照すると、一部の別の実施形態では、６４ビットの単一命令多重データ（ＳＩＭＤ）算術演算を、コプロセッサデータ処理（ＣＤＰ）命令により行ってよい。演算符号（オペコード）フォーマット３８０は、ＣＤＰオペコードフィールド３８２および３８９を有するこのようなＣＤＰ命令を示す。この種類のＣＤＰ命令は、シフトおよびＸＯＲ演算の別の実施形態では、フィールド３８３、３８４、３８７、および３８８の１以上により符号化されてよい。各命令について、２つまでのソースオペランド識別子３８５および３９０、並びに、１つの宛先オペランド識別子３８６を含む、３つまでのオペランド位置を特定する。コプロセッサの一実施形態は、８、１６、３２、および６４ビットの値に演算を行うことができる。一実施形態では、シフトおよびＸＯＲ演算を、浮動少数点データエレメントに行う。一部の実施形態では、シフトおよびＸＯＲ命令は、選択フィールド３８１を利用して条件付きで実行されてよい。幾つかのシフトおよびＸＯＲ命令では、ソースデータサイズはフィールド３８３により符号化されてよい。シフトおよびＸＯＲ命令の一部の実施形態では、ゼロ（Ｚ）、負（Ｎ）、繰上げ（carry）（Ｃ）、およびオーバフロー（Ｖ）検知をＳＩＭＤフィールドに対して行う。一部の命令では、飽和のタイプをフィールド３８４で符号化することもできる。 Referring now to FIG. 3F, in some other embodiments, 64-bit single instruction multiple data (SIMD) arithmetic operations may be performed by coprocessor data processing (CDP) instructions. Opcode format 380 shows such a CDP instruction with CDP opcode fields 382 and 389. This type of CDP instruction may be encoded with one or more of fields 383, 384, 387, and 388 in another embodiment of shift and XOR operations. For each instruction, up to three operand positions are identified, including up to two source operand identifiers 385 and 390 and one destination operand identifier 386. One embodiment of a coprocessor can operate on 8, 16, 32, and 64-bit values. In one embodiment, shift and XOR operations are performed on floating point data elements. In some embodiments, shift and XOR instructions may be conditionally executed utilizing selection field 381. For some shift and XOR instructions, the source data size may be encoded by field 383. In some embodiments of the shift and XOR instructions, zero (Z), negative (N), carry (C), and overflow (V) detection is performed on the SIMD field. For some instructions, the type of saturation can also be encoded in field 384.

図４は、本発明におけるパックデータオペランドにシフトおよびＸＯＲ演算を行う論理の一実施形態のブロック図である。本発明の実施形態は、上述したもののような様々な種類のオペランドを有する関数に実装することができる。簡潔にいうと、以下の記載および例は、データエレメントを処理するシフトおよびＸＯＲ命令を例にとって説明される。一実施形態では、第１のオペランド４０１は、シフタ４１０により、入力４０５が特定する量だけシフトされる。一実施形態では、これは右シフトである。しかし他の実施形態では、シフタは左シフト演算を行う。一部の実施形態では、オペランドはスカラー値であるが、他の実施形態では、複数の異なる可能性のあるデータサイズおよびタイプを有するパックデータ値（例えば浮動少数点、整数）である。一実施形態では、シフトカウント４０５は、パック（あるいは「ベクトル」）値であり、その各エレメントは、対応するシフトカウントエレメントによりシフトされるパックオペランドのエレメントに対応している。他の実施形態では、シフトカウントは、第１のデータオペランドの全てのエレメントに適用される。さらに一部の実施形態では、シフトカウントは、命令の１フィールドで指定される（例えば即値、ｒ／ｍ、またはその他のフィールド）。他の実施形態では、シフトカウントは、命令が指定するレジスタにより指定される。 FIG. 4 is a block diagram of one embodiment of logic for performing shift and XOR operations on packed data operands in the present invention. Embodiments of the present invention can be implemented in functions having various types of operands such as those described above. For brevity, the following description and examples are described by way of example of shift and XOR instructions that process data elements. In one embodiment, first operand 401 is shifted by shifter 410 by the amount specified by input 405. In one embodiment, this is a right shift. However, in other embodiments, the shifter performs a left shift operation. In some embodiments, the operand is a scalar value, while in other embodiments it is a packed data value (eg, floating point, integer) that has a plurality of different possible data sizes and types. In one embodiment, the shift count 405 is a packed (or “vector”) value, each element corresponding to an element of the packed operand that is shifted by the corresponding shift count element. In other embodiments, the shift count is applied to all elements of the first data operand. Further, in some embodiments, the shift count is specified in one field of the instruction (eg, immediate, r / m, or other field). In other embodiments, the shift count is specified by a register specified by the instruction.

シフトされたオペランドは次に、論理４２０により値４３０でＸＯＲされ、ＸＯＲされた結果は、宛先格納位置（例えばレジスタ）４２５に格納される。一実施形態では、ＸＯＲ値４３０は、パック（あるいは「ベクトル」）値であり、その各エレメントは、対応するＸＯＲエレメントでＸＯＲされるパックオペランドのエレメントに対応している。他の実施形態では、ＸＯＲ値４３０は、第１のデータオペランドの全てのエレメントに適用される。さらに一部の実施形態では、ＸＯＲ値は、命令の１フィールドで指定される（例えば即値、ｒ／ｍ、またはその他のフィールド）。他の実施形態では、ＸＯＲ値は、命令が指定するレジスタにより指定される。 The shifted operand is then XORed by logic 420 with the value 430 and the XORed result is stored in the destination storage location (eg, register) 425. In one embodiment, the XOR value 430 is a packed (or “vector”) value, each element corresponding to an element of the packed operand that is XORed with the corresponding XOR element. In other embodiments, the XOR value 430 applies to all elements of the first data operand. Further, in some embodiments, the XOR value is specified in one field of the instruction (eg, immediate, r / m, or other field). In other embodiments, the XOR value is specified by a register specified by the instruction.

図５は、本発明の一実施形態におけるシフトおよびＸＯＲ命令の演算を示す。処理５０１で、シフトおよびＸＯＲ命令を受信すると、第１のオペランドを、処理５０５のシフトカウント分だけシフトする。一実施形態では、これは右シフトである。他の実施形態では、シフタは左シフトを行ってもよい。一部の実施形態では、オペランドはスカラー値であるが、他の実施形態では、複数の異なる可能性のあるデータサイズおよびタイプを有するパックデータ値（例えば浮動少数点、整数）である。一実施形態では、シフトカウント４０５は、パック（あるいは「ベクトル」）値であり、その各エレメントは、対応するシフトカウントエレメントによりシフトされるパックオペランドのエレメントに対応している。他の実施形態では、シフトカウントは、第１のデータオペランドの全てのエレメントに適用される。さらに一部の実施形態では、シフトカウントは、命令の１フィールドで指定される（例えば即値、ｒ／ｍ、またはその他のフィールド）。他の実施形態では、シフトカウントは、命令が指定するレジスタにより指定される。 FIG. 5 illustrates the operation of shift and XOR instructions in one embodiment of the present invention. When a shift and XOR instruction is received in process 501, the first operand is shifted by the shift count of process 505. In one embodiment, this is a right shift. In other embodiments, the shifter may perform a left shift. In some embodiments, the operand is a scalar value, while in other embodiments it is a packed data value (eg, floating point, integer) that has a plurality of different possible data sizes and types. In one embodiment, the shift count 405 is a packed (or “vector”) value, each element corresponding to an element of the packed operand that is shifted by the corresponding shift count element. In other embodiments, the shift count is applied to all elements of the first data operand. Further, in some embodiments, the shift count is specified in one field of the instruction (eg, immediate, r / m, or other field). In other embodiments, the shift count is specified by a register specified by the instruction.

処理５１０で、シフトされた値を、ＸＯＲ値でＸＯＲする。一実施形態では、ＸＯＲ値４３０は、パック（あるいは「ベクトル」）値であり、その各エレメントは、対応するＸＯＲエレメントでＸＯＲされるパックオペランドのエレメントに対応している。他の実施形態では、ＸＯＲ値４３０は、第１のデータオペランドの全てのエレメントに適用される。さらに一部の実施形態では、ＸＯＲ値は、命令の１フィールドで指定される（例えば即値、ｒ／ｍ、またはその他のフィールド）。他の実施形態では、ＸＯＲ値は、命令が指定するレジスタにより指定される。 In operation 510, the shifted value is XORed with the XOR value. In one embodiment, the XOR value 430 is a packed (or “vector”) value, each element corresponding to an element of the packed operand that is XORed with the corresponding XOR element. In other embodiments, the XOR value 430 applies to all elements of the first data operand. Further, in some embodiments, the XOR value is specified in one field of the instruction (eg, immediate, r / m, or other field). In other embodiments, the XOR value is specified by a register specified by the instruction.

処理５１５で、シフトされ、ＸＯＲされた値を、所与の位置に格納する。一実施形態では、この位置はスカラーレジスタである。別の実施形態では、この位置がパックデータレジスタである。別の実施形態では、宛先位置はさらにソース位置としても利用される（例えば命令が指定するパックデータレジスタ）。他の実施形態では、宛先位置は、最初のオペランドその他の値（例えばシフトカウントまたはＸＯＲ値）を格納するソース位置とは異なる位置である。 In operation 515, the shifted and XORed value is stored at a given location. In one embodiment, this location is a scalar register. In another embodiment, this location is a packed data register. In another embodiment, the destination location is also used as the source location (eg, a packed data register specified by the instruction). In other embodiments, the destination location is a different location than the source location storing the first operand or other value (eg, shift count or XOR value).

一実施形態では、シフトおよびＸＯＲ命令は、様々なコンピュータアプリケーションでデータの重複除外（de-duplication）を行う際に有用である。データの重複除外は、ファイル間で共通のデータブロックを見つけて、ディスクのストレージおよび／またはネットワーク帯域幅を最適化しようとする試みである。一実施形態では、シフトおよびＸＯＲ命令は、ローリングハッシュ、ハッシュダイジェスト（例えばＳＨＡ１またはＭＤ５）および固有のチャンクの圧縮（高速レンペル・ジブ（Ｌｅｍｐｅｌ−Ｚｉｖ）スキームを利用する）を利用して、チャンク境界を見つける等の処理を利用してデータの重複解除性能を向上させる用途に有用である。 In one embodiment, shift and XOR instructions are useful in data de-duplication in various computer applications. Data deduplication is an attempt to find common data blocks between files to optimize disk storage and / or network bandwidth. In one embodiment, the shift and XOR instructions utilize a rolling hash, hash digest (eg, SHA1 or MD5), and native chunk compression (utilizing a fast Rempel-Ziv scheme) to define chunk boundaries. This is useful for improving the deduplication performance of data by using a process such as finding data.

例えば、あるデータの重複解除アルゴリズムは、以下の擬似コードで示すことができる。

For example, a deduplication algorithm for certain data can be represented by the following pseudo code.

上述したアルゴリズムでは、スクランブルテーブルは、ランダムな３２ビットの定数の２５６のエントリアレイであり、ｖは、過去の３２バイトのデータのハッシュ値を有するローリングハッシュである。チャンク境界が見つかると、アルゴリズムは、ｒｅｔ＝１として戻り、位置ｐは、チャンクの境界を示す。値ｚは、１２から１５といった、良好なチャンクを検知することができる値であり、用途に応じて決定されてよい。一実施形態では、シフトおよびＸＯＲ命令を利用することで、上述のアルゴリズムを、約２サイクル／バイトのレートで行うことができる。他の実施形態では、シフトおよびＸＯＲ命令は、用途によっては、これよりさらに速く、または遅くアルゴリズムを実行することもできる。 In the above-described algorithm, the scramble table is a random 32-bit constant 256 entry array, and v is a rolling hash having a hash value of the past 32-byte data. If a chunk boundary is found, the algorithm returns as ret = 1 and position p indicates the chunk boundary. The value z is a value that can detect a good chunk, such as 12 to 15, and may be determined according to the application. In one embodiment, utilizing the shift and XOR instructions, the algorithm described above can be performed at a rate of approximately 2 cycles / byte. In other embodiments, shift and XOR instructions may execute algorithms faster or slower depending on the application.

シフトおよびＸＯＲ命令を利用する少なくとも１つの実施形態を、以下の擬似コードで表すことができる。

At least one embodiment that utilizes shift and XOR instructions can be represented by the following pseudo code:

上述したアルゴリズムでは、ｂｒｅｆ１＿ｓｃｒａｍｂｌｅアレイの各エントリが、元のスクランブルアレイの対応するエントリのビットを反映したバージョンを含む。一実施形態では、上述したアルゴリズムにより、右ではなくて左にｖをシフトして、ｖには、ローリングハッシュのビットが反映されたバージョンが含まれる。一実施形態では、チャンク境界のチェックを、先頭のゼロの最小数（minimum number of leading zeros）をチェックすることにより行う。 In the algorithm described above, each entry in the ref1_scramble array includes a version that reflects the bit of the corresponding entry in the original scramble array. In one embodiment, the algorithm described above shifts v to the left rather than to the right, where v includes a version reflecting the bits of the rolling hash. In one embodiment, chunk boundary checking is performed by checking the minimum number of leading zeros.

他の実施形態では、シフトおよびＸＯＲ命令を、他の有用なコンピュータ演算およびアルゴリズムで利用することもできる。さらに、実施形態によって、シフトおよびＸＯＲ演算を大規模に利用する数多くのプログラムの性能を向上させることができる。 In other embodiments, the shift and XOR instructions may be utilized with other useful computer operations and algorithms. Furthermore, the performance of a large number of programs that use shift and XOR operations on a large scale can be improved according to the embodiment.

このように、シフトおよびＸＯＲ命令を行う技術が開示された。一部の実施形態は、添付図面に示されているが、これら実施形態はあくまで例示を意図しており、広い範囲に及ぶ発明を制約する意図はなく、本発明が図示されたり説明されたりしている特定の構成および配置に限定されない点に留意されたい。本開示を読んだ当業者であれば、様々な他の変形例を想到する。当技術分野は成長著しく、将来の進歩を見通すことが難しいので、開示されている実施形態は、本開示の原理または添付請求項の範囲から逸脱しなければ、技術的進歩により、構成および詳細において容易に変更可能であることを理解されたい。 Thus, techniques for performing shift and XOR instructions have been disclosed. Although some embodiments are illustrated in the accompanying drawings, these embodiments are intended to be examples only and are not intended to limit the invention to a broader scope, and the present invention may be illustrated or described. It should be noted that the specific configurations and arrangements are not limited. Those of ordinary skill in the art who have read this disclosure will envision various other modifications. Since the art is growing rapidly and it is difficult to foresee future advances, the disclosed embodiments can be made in construction and detail in accordance with the technical advances without departing from the principles of the disclosure or the scope of the appended claims. It should be understood that it can be easily changed.

Claims

A processor comprising logic to execute a shift and XOR instruction that shifts a first value by a given shift amount and XORs the shifted value with a second value.

The processor of claim 1, wherein the first value is shifted left.

The processor of claim 1, wherein the first value is right shifted.

The processor of claim 1, wherein the first value is logically shifted.

The processor of claim 1, wherein the first value is arithmetically shifted.

The processor according to claim 1, comprising a shifter and an XOR circuit.

The processor of claim 1, wherein the shift and XOR instructions include a first field that stores the second value.

The processor of claim 1, wherein the first value is a packed data type.

Storage for storing a first instruction that performs shift and XOR operations;
A processor that executes logic to execute a shift and XOR instruction that shifts a first value by a given shift amount and XORs the shifted value with a second value.

The system of claim 9, wherein the first value is shifted left.

The system of claim 9, wherein the first value is right shifted.

The system of claim 9, wherein the first value is logically shifted.

The system of claim 9, wherein the first value is arithmetically shifted.

The system of claim 9, comprising a shifter and an XOR circuit.

The system of claim 9, wherein the shift and XOR instructions include a first field that stores the second value.

The system of claim 9, wherein the first value is a packed data type.

A method comprising performing a shift and XOR instruction that shifts a first value by a given shift amount and XORs the shifted value with a second value.

The method of claim 17, wherein the first value is shifted left.

The method of claim 17, wherein the first value is right shifted.

The method of claim 17, wherein the first value is logically shifted.

The method of claim 17, wherein the first value is arithmetically shifted.

The method of claim 17, comprising a shifter and an XOR circuit.

The method of claim 17, wherein the shift and XOR instruction includes a first field storing the second value.

The method of claim 17, wherein the first value is a packed data type.

A machine-readable medium storing instructions, wherein when the instructions are executed by a machine,
Shifting the first value by a given shift amount;
XORing the shifted value with a second value.

26. The method of claim 25, wherein the first value is shifted left.

26. The method of claim 25, wherein the first value is shifted right.

26. The method of claim 25, wherein the first value is logically shifted.

26. The method of claim 25, wherein the first value is arithmetic shifted.

The method of claim 25, comprising a shifter and an XOR circuit.

26. The method of claim 25, wherein the shift and XOR instructions include a first field that stores the second value.

26. The method of claim 25, wherein the first value is a packed data type.

Performing an exclusive OR (XOR) operation between the first shift value and the second bit reflected value, and storing the execution result in the first register;
Checking the minimum number of leading zeros in the execution result.

34. The method of claim 33, wherein the execution result corresponds to a first chunk if the minimum number of leading zeros is in the execution result.

35. The method of claim 34, wherein the first shift value is shifted left by a position corresponding to 1 bit.

The method of claim 34, wherein the first shift value is right shifted by a position corresponding to 1 bit.