JP2011165216A

JP2011165216A - Method and apparatus for vectorizing multiple input instructions

Info

Publication number: JP2011165216A
Application number: JP2011110994A
Authority: JP
Inventors: Yoav Almog; アルモグ，ヨアヴ; Roni Rosner; ロズネル，ロニ; Ronny Ronen; ロネン，ロニー
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 2004-06-24
Filing date: 2011-05-18
Publication date: 2011-08-25
Anticipated expiration: 2025-05-25
Also published as: CN1977241B; WO2006007193A1; DE112005001277B4; GB2429554B; CN1977241A; US20050289529A1; DE112005003852A5; DE112005001277T5; JP5646390B2; JP2008503836A; GB2429554A; GB0619968D0; US7802076B2; DE112005003852B4

Abstract

<P>PROBLEM TO BE SOLVED: To provide an effective method and an apparatus for vectorizing a plurality of input instructions. <P>SOLUTION: The apparatus has an optimization unit for searching two or more instructions with operation codes with common traces and when the two or more instructions have the same levels in a trace dependence tree for merging the two or more instructions with one SIMD (Single Instruction Multiple Data) instruction. The trace dependence tree has instructions in a plurality of levels with instructions where each level has the same instruction, and the trace instruction is stored in a memory of the apparatus. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、複数の入力命令をベクトル化する方法及び装置に関する。 The present invention relates to a method and apparatus for vectorizing a plurality of input instructions.

コンピュータシステムの中央処理ユニット（ＣＰＵ）は、命令をパラレルに処理する複数の機能実行ユニットを含むかもしれない。これらの命令は、ＳＩＭＤ（ＳｉｎｇｌｅＩｎｓｔｒｕｃｔｉｏｎＭｕｌｔｉｐｌｅＤａｔａ）命令を含むかもしれない。ＳＩＭＤ命令は、複数のデータに対する共通の処理をパラレルに実行することが可能である。従って、ＳＩＭＤ命令は、ＣＰＵが全体の実行時間を低減するため、複数の繰り返しの計算を同時に実行することを可能にするかもしれない。ＳＩＭＤ処理の使用は、音声及び画像処理などのマルチメディアアプリケーションにおいて特に有効であるかもしれない。 A central processing unit (CPU) of a computer system may include a plurality of function execution units that process instructions in parallel. These instructions may include SIMD (Single Instruction Multiple Data) instructions. The SIMD instruction can execute a common process for a plurality of data in parallel. Thus, SIMD instructions may allow the CPU to perform multiple iterations simultaneously, in order to reduce the overall execution time. The use of SIMD processing may be particularly effective in multimedia applications such as voice and image processing.

本発明の課題は、複数の入力命令をベクトル化する効果的な方法及び装置を提供することである。 An object of the present invention is to provide an effective method and apparatus for vectorizing a plurality of input instructions.

上記課題を解決するため、本発明の一特徴は、トレースの共通のオペレーションコードを有する２以上の命令を検索し、前記２以上の命令が、トレース依存性ツリーにおいて同一のレベルを有する場合、前記２以上の命令を１つのＳＩＭＤ（ＳｉｎｇｌｅＩｎｓｔｒｕｃｔｉｏｎＭｕｌｔｉｐｌｅＤａｔａ）命令にマージする最適化ユニットを有する装置であって、前記トレース依存性ツリーは、各レベルが同一の高さの命令を有する複数のレベルにおける命令を有し、前記トレースの命令は、メモリに格納されることを特徴とする装置に関する。 In order to solve the above problem, one feature of the present invention is to search for two or more instructions having a common operation code of trace, and when the two or more instructions have the same level in a trace dependency tree, An apparatus having an optimization unit for merging two or more instructions into one SIMD (Single Instruction Multiple Data) instruction, wherein the trace dependency tree includes a plurality of instructions each having the same height instruction. The apparatus has a command, and the trace command is stored in a memory.

本発明によると、複数の入力命令をベクトル化する効果的な方法及び装置を提供することができる。 According to the present invention, an effective method and apparatus for vectorizing a plurality of input commands can be provided.

図１は、本発明の一実施例によるコンピュータシステムのブロック図である。FIG. 1 is a block diagram of a computer system according to an embodiment of the present invention. 図２は、本発明の一実施例による最適化ユニットのブロック図である。FIG. 2 is a block diagram of an optimization unit according to an embodiment of the present invention. 図３は、本発明の一実施例による命令をＳＩＭＤ命令に変換する方法を説明するのに有用な一例となる依存性ツリーの図である。FIG. 3 is a diagram of an exemplary dependency tree useful for describing a method for converting an instruction into a SIMD instruction according to one embodiment of the present invention. 図４は、本発明の一実施例によるベクトル化処理の説明に有用なテーブルの図である。FIG. 4 is a table useful for explaining vectorization processing according to an embodiment of the present invention. 図５は、本発明の他の実施例によるベクトル化処理の説明に有用なテーブルの図である。FIG. 5 is a table useful for explaining vectorization processing according to another embodiment of the present invention.

以下の詳細な説明では、本発明の完全なる理解を提供するため、多数の具体的詳細が提供される。しかしながら、本発明がこれらの具体的詳細なく実現可能であるということは、当業者には理解されるであろう。他の例では、本発明を不明りょうにしないように、周知の方法、処理、構成要素及び回路は、詳細には説明されない。 In the following detailed description, numerous specific details are provided to provide a thorough understanding of the present invention. However, it will be understood by one skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, processes, components and circuits have not been described in detail so as not to obscure the present invention.

以下の詳細な説明の一部は、コンピュータメモリ内のデータビット又はバイナリデジタル信号に対する記号表現及びアルゴリズムに関して提供される。これらのアルゴリズム的な記載及び表現は、データ処理分野の当業者によって他の当業者に自らの研究の本質を伝えるのに利用される技術であるかもしれない。 Some of the detailed description below is provided in terms of symbolic representations and algorithms for data bits or binary digital signals in computer memory. These algorithmic descriptions and representations may be techniques used by those skilled in the data processing arts to convey the substance of their work to others skilled in the art.

以下の説明から明らかなように、具体的に説明されない場合、本明細書の全体を通じて、「処理」、「計算」、「決定」などの用語の使用は、計算システムのレジスタ及び／又はメモリ内の電子量などの物理量として表されるデータを、計算システムのメモリ、レジスタ又は他のこのような情報ストレージ、送信若しくは表示装置内の物理量として同様に表される他のデータに操作及び／又は変換するコンピュータ、計算システム又は同様の電子計算装置のアクション及び／又は処理を表すことが理解される。さらに、「複数の」という用語は、２以上の構成要素、装置、要素、パラメータなどを説明するのに本明細書を通じて使用されるかもしれない。例えば、「複数の命令」とは、２以上の命令を表す。 As will be apparent from the following description, the use of terms such as “processing”, “calculation”, “decision”, etc. throughout the present specification, unless specifically explained, is used in the registers and / or memory of the computing system. Manipulate and / or convert data represented as physical quantities such as electronic quantities into memory, registers or other such information storage, transmission or other data similarly represented as physical quantities in display devices It is understood to represent the actions and / or processes of a computer, computing system or similar electronic computing device. Further, the term “plurality” may be used throughout this specification to describe two or more components, devices, elements, parameters, and the like. For example, “a plurality of instructions” represents two or more instructions.

「ＳＩＭＤ化」又は「ベクトル化」という用語は、実行のためスケジューリングされるものであって、レジスタや機能ユニットなどの同様の実行リソースを要求する処理を単一のＳＩＭＤ命令にマージする処理を表す等価な用語であるということが理解されるべきである。本発明の範囲はこれに限定されるものではないが、説明の簡単化のため、「ベクトル化」という用語は、実行のためにスケジューリングされ、同様の実行リソースを必要とする処理をマージする処理を説明するのに使用される。 The terms “SIMDization” or “vectorization” refer to processing that is scheduled for execution and that merges processing that requires similar execution resources such as registers and functional units into a single SIMD instruction. It should be understood that they are equivalent terms. Although the scope of the present invention is not limited to this, for simplicity of explanation, the term “vectorization” is the process of merging processes that are scheduled for execution and require similar execution resources. Used to explain.

本発明は、様々な用途に利用可能であるということが理解されるべきである。本発明はこれに限定されるものではないが、ここで開示される回路及び技術は、コンピュータシステム、プロセッサ、ＣＰＵなどの多数の装置において利用可能である。本発明の範囲内に含まれるべきプロセッサは、単なる一例ではあるが、ＲＩＳＣ（ＲｅｄｕｃｅｄＩｎｓｔｒｕｃｔｉｏｎＳｅｔＣｏｍｐｕｔｅｒ）、パイプラインを有するプロセッサ、ＣＩＳＣ（ＣｏｍｐｌｅｘＩｎｓｔｒｕｃｔｉｏｎＳｅｔＣｏｍｐｕｔｅｒ）などを含む。 It should be understood that the present invention can be used in a variety of applications. Although the invention is not so limited, the circuits and techniques disclosed herein may be used in numerous devices such as computer systems, processors, CPUs and the like. Processors to be included within the scope of the present invention include RISC (Reduced Instruction Set Computer), processors having pipelines, CISC (Complex Instruction Set Computer), and the like.

本発明の一部の実施例は、例えば、マシーンにより実行される場合（例えば、プロセッサ及び／又は他の適切なマシーンによって）、当該マシーンに本発明の実施例による方法及び／又は処理を実行させる命令又は命令セットを格納可能なマシーン可読媒体又は物品を利用して実現されるかもしれない。このようなマシーンは、例えば、任意の適切な処理プラットフォーム、計算プラットフォーム、計算装置、処理装置、計算システム、処理システム、コンピュータ、プロセッサなどを含むものであってもよく、ハードウェア及び／又はソフトウェアの任意の適切な組み合わせを用いて実現されるようにしてもよい。マシーン可読媒体又は物品は、例えば、任意の適切なタイプのメモリユニット、記憶装置、メモリ物品、記憶媒体、ストレージ装置、ストレージ物品、ストレージ媒体及び／又はユニットを含むものであってもよく、例えば、メモリ、着脱可能又は着脱不可な媒体、消去可能又は消去不可な媒体、書き込み可能又は書き換え可能な媒体、デジタル又はアナログ媒体、ハードディスク、フロッピー（登録商標）ディスク、ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｋＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＣＤ−Ｒ（ＣｏｍｐａｃｔＤｉｓｋＲｅｃｏｒｄａｂｌｅ）、ＣＤ−ＲＷ（ＣｏｍｐａｃｔＤｉｓｋＲｅｗｒｉｔａｂｌｅ）、光ディスク、磁気媒体、各種タイプのＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）、テープ、カセットなどを含むものであってもよい。命令は、ソースコード、コンパイルされたコード、インタープリットされたコード、実行可能コード、静的コード、動的コードなどの任意の適切なタイプのコードを含むものであってもよく、例えば、Ｃ、Ｃ＋＋、Ｊａｖａ（登録商標）、ＢＡＳＩＣ、Ｐａｓｃａｌ、Ｆｏｒｔｒａｎ、Ｃｏｂｏｌ、アセンブリ言語、機械コードなどの任意の適切な高レベル、低レベル、オブジェクト指向、ビジュアル、コンパイル及び／又はインタープリットプログラミング言語を含むものであってもよい。 Some embodiments of the present invention, for example, when executed by a machine (eg, by a processor and / or other suitable machine) cause the machine to perform the methods and / or processes according to embodiments of the present invention. It may be implemented utilizing a machine readable medium or article capable of storing instructions or a set of instructions. Such machines may include, for example, any suitable processing platform, computing platform, computing device, processing device, computing system, processing system, computer, processor, etc., and hardware and / or software You may make it implement | achieve using arbitrary appropriate combinations. A machine-readable medium or article may include, for example, any suitable type of memory unit, storage device, memory article, storage medium, storage device, storage article, storage medium and / or unit, for example, Memory, removable or non-removable medium, erasable or non-erasable medium, writable or rewritable medium, digital or analog medium, hard disk, floppy (registered trademark) disk, CD-ROM (Compact Disk Read Only Memory) , CD-R (Compact Disk Recordable), CD-RW (Compact Disk Rewriteable), optical disk, magnetic medium, various types of DVD (Digital Versatile Disk), tape, Set may be one and the like. The instructions may include any suitable type of code, such as source code, compiled code, interpreted code, executable code, static code, dynamic code, such as C, Includes any suitable high-level, low-level, object-oriented, visual, compiled and / or interpreted programming languages such as C ++, Java, BASIC, Pascal, Fortran, Cobol, assembly language, machine code There may be.

図１を参照するに、本発明の一実施例によるコンピュータシステム１００のブロック図が示される。本発明の範囲はこれに限定されるものではないが、コンピュータシステム１００は、パーソナルコンピュータ（ＰＣ）、携帯情報端末（ＰＤＡ）、インターネット装置、携帯電話又は他の任意の計算装置であってもよい。一例では、コンピュータシステム１００は、電源１２０によって駆動されるメイン処理ユニット１１０を有するかもしれない。本発明の実施例では、メイン処理ユニット１１０は、システムインターコネクト１３５により記憶装置１４０及び１以上のインタフェース回路１５０に電気的に接続されるマルチ処理ユニット１３０を有するものであってもよい。例えば、システムインターコネクト１３５は、所望される場合には、アドレス／データバスであってもよい。バス以外のインターコネクトがマルチ処理ユニット１３０を記憶装置１４０に接続するのに利用可能であるということが理解されるべきである。例えば、１以上の専用線及び／又はクロスバーが、マルチ処理ユニット１３０を記憶装置１４０に接続するのに利用可能である。 Referring to FIG. 1, a block diagram of a computer system 100 according to one embodiment of the present invention is shown. Although the scope of the present invention is not limited thereto, the computer system 100 may be a personal computer (PC), a personal digital assistant (PDA), an internet device, a cellular phone, or any other computing device. . In one example, computer system 100 may have a main processing unit 110 that is driven by a power source 120. In an embodiment of the present invention, the main processing unit 110 may include a multi-processing unit 130 that is electrically connected to the storage device 140 and one or more interface circuits 150 by a system interconnect 135. For example, system interconnect 135 may be an address / data bus, if desired. It should be understood that interconnects other than buses can be used to connect the multi-processing unit 130 to the storage device 140. For example, one or more dedicated lines and / or crossbars can be used to connect the multi-processing unit 130 to the storage device 140.

本発明の一部の実施例によると、マルチ処理ユニット１３０は、Ｉｎｔｅｌ（登録商標）Ｐｅｎｔｉｕｍ（登録商標）^ＴＭ系のマイクロプロセッサ、Ｉｎｔｅｌ（登録商標）Ｉｔａｎｉｕｍ^ＴＭ系のマイクロプロセッサ、及び／又はＩｎｔｅｌ（登録商標）系のＸＳｃａｌｅ^ＴＭ系のプロセッサなど任意のタイプの処理ユニットを有するものであってもよい。さらに、マルチ処理ユニット１３０は、ＳＲＡＭ（ＳｔａｔｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）などの任意のタイプのキャッシュメモリを有するものであってもよい。記憶装置１４０は、ＤＲＡＭ（ＤｙｎａｍｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、不揮発性メモリなどを有するものであってもよい。一例として、記憶装置１４０は、所望される場合には、マルチ処理ユニット１３０により実行可能なソフトウェアプログラムを格納するものであってもよい。 According to some embodiments of the present invention, the multi-processing unit 130 may be an Intel® Pentium ^™ based microprocessor, an Intel® Itanium ^™ based microprocessor, and / or an Intel ( It may have any type of processing unit such as an XScale ^™ processor of the registered trademark. Further, the multi-processing unit 130 may have any type of cache memory such as SRAM (Static Random Access Memory). The storage device 140 may include a DRAM (Dynamic Random Access Memory), a nonvolatile memory, or the like. As an example, the storage device 140 may store a software program that can be executed by the multi-processing unit 130 if desired.

本発明の範囲はこれに限定されるものではないが、インタフェース回路１１０は、イーサネット（登録商標）インタフェース、ＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）インタフェースなどを有するものであってもよい。本発明の実施例では、１以上の入力装置１６０が、データ及びコマンドをメイン処理ユニット１１０に入力するため、インタフェース回路１５０に接続可能である。例えば、入力装置１６０は、キーボード、マウス、タッチ画面、トラックパッド、トラックボール、イソポイント（ｉｓｏｐｏｉｎｔ）、音声認識システムなどを含むものであってもよい。 Although the scope of the present invention is not limited to this, the interface circuit 110 may include an Ethernet (registered trademark) interface, a USB (Universal Serial Bus) interface, or the like. In an embodiment of the present invention, one or more input devices 160 can be connected to the interface circuit 150 for inputting data and commands to the main processing unit 110. For example, the input device 160 may include a keyboard, a mouse, a touch screen, a trackpad, a trackball, an isopoint, a voice recognition system, and the like.

本発明の範囲はこれに限定されるものではないが、出力装置１７０は、１以上のインタフェース回路１６０を介しメイン処理ユニット１１０に動作可能に接続可能であり、所望される場合には、１以上のディスプレイ、プリンタ、スピーカー及び／又は他の出力装置を含むものであってもよい。例えば、出力装置の１つはディスプレイであるかもしれない。ディスプレイは、ＣＲＴ（ＣａｔｈｏｄｅＲａｙＴｕｂｅ）、ＬＣＤ（ＬｉｑｕｉｄＣｒｙｓｔａｌＤｉｓｐｌａｙ）又は他の任意のタイプのディスプレイであってもよい。 Although the scope of the present invention is not limited thereto, the output device 170 can be operatively connected to the main processing unit 110 via one or more interface circuits 160, and if desired, one or more. Other displays, printers, speakers, and / or other output devices. For example, one of the output devices may be a display. The display may be a CRT (Cathode Ray Tube), LCD (Liquid Crystal Display) or any other type of display.

本発明の範囲はこれに限定されるものではないが、コンピュータシステム１００は、１以上のストレージ装置１８０を有するものであってもよい。例えば、コンピュータシステム１００は、所望される場合、１以上のハードドライブ、１以上のＣＤドライブ、１以上のＤＶＤドライブ及び／又は他のコンピュータメディア入出力（Ｉ／Ｏ）装置を含むものであってもよい。 Although the scope of the present invention is not limited to this, the computer system 100 may include one or more storage apparatuses 180. For example, computer system 100 may include one or more hard drives, one or more CD drives, one or more DVD drives, and / or other computer media input / output (I / O) devices, if desired. Also good.

本発明の範囲はこれに限定されるものではないが、コンピュータシステム１００は、ネットワーク１９０との接続を介し他の装置とデータを交換することが可能である。ネットワーク接続は、イーサネット（登録商標）接続、デジタル加入者線（ＤＳＬ）、電話線、同軸ケーブルなどの任意のタイプのネットワーク接続であってもよい。ネットワーク１９０は、インターネット、電話ネットワーク、ケーブルネットワーク、無線ネットワークなどの任意のタイプのネットワークであってもよい。 Although the scope of the present invention is not limited to this, the computer system 100 can exchange data with other devices via a connection with the network 190. The network connection may be any type of network connection such as an Ethernet connection, a digital subscriber line (DSL), a telephone line, a coaxial cable, etc. Network 190 may be any type of network such as the Internet, a telephone network, a cable network, a wireless network, and the like.

本発明の範囲は本実施例に限定されるものではないが、本発明の当該実施例では、マルチ処理ユニット１３０は、最適化ユニット２００を含むものであってもよい。本発明の実施例によると、最適化ユニット２００は、トレースの２以上の候補命令を検索する処理を実行するものであってもよい。さらに、最適化ユニット２００は、トレース依存性ツリー（ｔｒａｃｅｄｅｐｅｎｄｅｃｙｔｒｅｅ）の深さに従って２以上の候補命令をＳＩＭＤ命令にマージするようにしてもよい。本発明の一部の実施例では、候補命令は、ＳＩＭＤ命令に含まれる同様の及び／又は同一のタイプのオペレーションコードを含むものであってもよい。例えば、最適化ユニット２００は、候補命令の依存性の深さに基づき、同様の処理を実行する候補命令を検索するようにしてもよい。本発明の実施例によると、最適化ユニット２００は、所望される場合、候補命令の少なくとも一部をＳＩＭＤ命令にマージするようにしてもよい。本発明の範囲はこれに限定されるものではないが、最適化ユニット２００は、ソフトウェア、ハードウェア又はソフトウェアとハードウェアの任意の適切な組み合わせにより実現可能であるということが理解されるべきである。 The scope of the present invention is not limited to this embodiment, but in this embodiment of the present invention, the multi-processing unit 130 may include the optimization unit 200. According to an embodiment of the present invention, the optimization unit 200 may execute a process of searching for two or more candidate instructions for tracing. Further, the optimization unit 200 may merge two or more candidate instructions into a SIMD instruction according to the depth of the trace dependency tree. In some embodiments of the present invention, candidate instructions may include similar and / or the same type of operation code included in SIMD instructions. For example, the optimization unit 200 may search for a candidate instruction that performs similar processing based on the depth of dependency of the candidate instruction. According to embodiments of the present invention, optimization unit 200 may merge at least some of the candidate instructions into SIMD instructions if desired. While the scope of the invention is not so limited, it should be understood that the optimization unit 200 can be implemented by software, hardware or any suitable combination of software and hardware. .

図２を参照するに、本発明の一実施例による図１の最適化ユニット２００のブロック図が示される。本発明の範囲はこれに限定されるものではないが、最適化ユニット２００は、入力トレースバッファ２１０と、シーケンサ２２０と、ベクトル化ユニット２３０と、出力トレースバッファ２４０とを有するものであってもよい。本発明の範囲はこれに限定されるものではないが、本発明の一部の実施例では、ベクトル化ユニット２３０は、第１ステージ２３２と、第２ステージ２３４と、キャッシュメモリなどのメモリ２３６とを有するものであってもよい。 Referring to FIG. 2, a block diagram of the optimization unit 200 of FIG. 1 is shown according to one embodiment of the present invention. Although the scope of the present invention is not limited to this, the optimization unit 200 may include an input trace buffer 210, a sequencer 220, a vectorization unit 230, and an output trace buffer 240. . Although the scope of the present invention is not so limited, in some embodiments of the present invention, the vectorization unit 230 includes a first stage 232, a second stage 234, and a memory 236 such as a cache memory. It may have.

本発明の範囲はこれに限定されるものではないが、入力トレースバッファ２１０は、オペレーションコード（オペ）コードを有する命令トレースを受け付けるようにしてもよい。本発明の一部の実施例では、シーケンサ２２０は、入力トレースバッファ２１０から命令を受け取り、オペレーションコード及び／又は命令トレース（シーケンスなど）をベクトル化ユニット２３０に提供するかもしれない。例えば、命令は、ＬＯＡＤ、ＳＴＯＲＥなどのメモリ処理と、ＡＤＤ、ＳＵＢＴＲＡＣＴ、ＭＵＬＴ、ＳＨＩＦＴ、ＡＮＤなどの算術処理の少なくとも２つのタイプの処理を有するものであるかもしれない。さらに、命令は、レジスタ、定数などの入力値及び出力値を含むものであってもよい。 Although the scope of the present invention is not limited to this, the input trace buffer 210 may accept an instruction trace having an operation code (operation) code. In some embodiments of the present invention, the sequencer 220 may receive instructions from the input trace buffer 210 and provide operation codes and / or instruction traces (such as sequences) to the vectorization unit 230. For example, the instruction may have at least two types of processing such as memory processing such as LOAD and STORE and arithmetic processing such as ADD, SUBTRACT, MULT, SHIFT, and AND. Furthermore, the instruction may include an input value and an output value such as a register and a constant.

本発明の実施例によると、ベクトル化ユニット２３０は、シーケンサ２２０からトレースを受け取り、トレース依存性に従って候補命令を検索するようにしてもよい。本発明の一部の実施例では、第１ステージ２３２は、シーケンサ２２０から受け付けたオペコード命令を処理する。例えば、トレースの命令及び／又はオペコードは、ＳＳＡ（ＳｉｎｇｌｅＳｔａｔｉｃＡｓｓｉｇｎｍｅｎｔ）形式に変換されるかもしれない。ＳＳＡ形式では、レジスタはトレースに１回のみ書き込み可能であり、残りの処理は、ＳＳＡ条件を満足するため、「バーチャル」レジスタ名を導入するかもしれない。従来のＩＳＡ（ＩｎｓｔｒｕｃｔｉｏｎＳｅｔＡｒｃｈｉｔｅｃｔｕｒｅ）により記述されるプログラムコードなどのプログラムコードは、本発明の範囲がこれに限定されるものではないが、同一のレジスタと同一の名前を有する２つのソースレジスタを提供するかもしれない。 According to an embodiment of the present invention, vectorization unit 230 may receive a trace from sequencer 220 and retrieve candidate instructions according to trace dependencies. In some embodiments of the present invention, the first stage 232 processes opcode instructions received from the sequencer 220. For example, trace instructions and / or opcodes may be converted to SSA (Single Static Assignment) format. In SSA format, registers can only be written to the trace once, and the rest of the processing may introduce “virtual” register names to satisfy the SSA condition. A program code such as a program code described by a conventional ISA (Instruction Set Architecture) provides two source registers having the same register and the same name, although the scope of the present invention is not limited thereto. Might do.

本発明の範囲はこれに限定されるものではないが、第１ステージ２３２は、命令を依存性ツリーに配置することによって、ベクトル化のための候補を検索するようにしてもよい。 Although the scope of the present invention is not limited to this, the first stage 232 may search for candidates for vectorization by placing instructions in a dependency tree.

図３を参照するに、本発明の一実施例によるＳＩＭＤ命令を生成する方法を説明するのに有用な一例となる依存性ツリー３００が示される。本発明の範囲はこれに限定されるものではないが、依存性ツリー３００は、異なる高さの命令を含むものであってもよい。本発明の範囲はこれに限定されるものではないが、依存性ツリー３００のレベルは、同一の高さの命令を含むものであってもよい。第１レベル３１０は命令３１２及び３１４を、第２レベル３２０は命令３２２を、第３レベル３３０は命令３３２及び３３４を、第４レベル３４０は命令３４２を含むものであってもよい。さらに、依存性ツリー３００の深さは、依存性ツリー３００の最初の高さ３１０から最後の高さ３４０までの距離（例えば、当該距離は、レベル間の矢印によって示されるかもしれない）に従って計算されてもよい。 Referring to FIG. 3, an exemplary dependency tree 300 useful for describing a method for generating SIMD instructions according to one embodiment of the present invention is shown. Although the scope of the present invention is not so limited, the dependency tree 300 may include instructions of different heights. Although the scope of the present invention is not so limited, the levels of the dependency tree 300 may include instructions of the same height. The first level 310 may include instructions 312 and 314, the second level 320 may include instructions 322, the third level 330 may include instructions 332 and 334, and the fourth level 340 may include instructions 342. Further, the depth of the dependency tree 300 is calculated according to the distance from the initial height 310 to the final height 340 of the dependency tree 300 (eg, the distance may be indicated by an arrow between levels). May be.

図２を参照するに、本発明の範囲はこれに限定されるものではないが、第１ステージ２３２は、ベクトル化のための候補命令をメモリ２３６に格納する。本発明の実施例によると、第２ステージ２３４は、同一又は同様のレベルを有する同様のオペコードをメモリ２３６から検索し、ＳＩＭＤ命令を生成するようにしてもよい。さらに、第２ステージ２３２は、元のトレース命令をＳＩＭＤ命令に置換してもよく、ＳＩＭＤ命令を出力トレースバッファ２４０に格納するようにしてもよい。 Referring to FIG. 2, although the scope of the present invention is not limited to this, the first stage 232 stores candidate instructions for vectorization in the memory 236. According to embodiments of the present invention, the second stage 234 may retrieve similar opcodes having the same or similar levels from the memory 236 and generate SIMD instructions. Further, the second stage 232 may replace the original trace instruction with the SIMD instruction, and may store the SIMD instruction in the output trace buffer 240.

本発明の範囲はこれに限定されるものではないが、最適化ユニット２００の第１ステージ２３２と第２ステージの処理が、一例となるＣ言語を模した擬似コードアルゴリズムにより記述可能である。 Although the scope of the present invention is not limited to this, the processing of the first stage 232 and the second stage of the optimization unit 200 can be described by a pseudo code algorithm simulating the C language as an example.

本発明の範囲はこれに限定されるものではないが、Ｃ言語を模した擬似コードアルゴリズムの第１部分は、定数、変数構造などを定義するものである。 Although the scope of the present invention is not limited to this, the first part of the pseudo code algorithm simulating C language defines constants, variable structures, and the like.

例えば、トレースの最大命令数は、 For example, the maximum number of instructions for tracing is

として定義される。

Is defined as

命令の最大ソース数は、 The maximum number of instructions source is

として定義される。

Is defined as

命令の最大デスティネーション数は、 The maximum number of instruction destinations is

として定義される。

Is defined as

トレース範囲及び内部バッファサイズは、 Trace range and internal buffer size are

として定義される。

Is defined as

Ｃ言語を模した擬似コードアルゴリズムによると、命令構造は、当該命令がベクトル化に適しているか示すブール変数、デスティネーションレジスタ、オペコード、ソースレジスタを有するかもしれない。この命令構造は、 According to a pseudo-code algorithm simulating C language, the instruction structure may have a Boolean variable, a destination register, an opcode, and a source register that indicate whether the instruction is suitable for vectorization. This instruction structure is

として定義される。

Is defined as

Ｃ言語を模した擬似コードアルゴリズムによると、トレースは、ＭＡＸ＿ＴＲＡＣＥ＿ＳＩＺＥのエントリのベクトルによって表される高々ＭＡＸ＿ＴＲＡＣＥ＿ＳＩＺＥの命令のシーケンスとして定義される。さらに、ツー・デミニュション（ｔｗｏｄｉｍｉｎｕｔｉｏｎｓ）（２Ｄ））トレース依存ビットマップが、トレースの命令の有効性を示すのに利用可能である。トレースの実際の命令数はＩＮＩＴＩＡＬ＿ＴＲＡＣＥ＿ＳＩＺＥであるかもしれない場合、最初のＩＮＩＴＩＡＬ＿ＴＲＡＣＥ＿ＳＩＺＥのエントリのみが有効であるかもしれない。 According to a pseudo-code algorithm that mimics the C language, a trace is defined as a sequence of at most MAX_TRACE_SIZE instructions represented by a vector of MAX_TRACE_SIZE entries. In addition, two dimensions (2D)) trace dependent bitmaps can be used to show the validity of the instructions in the trace. If the actual number of instructions in the trace may be INITIAL_TRACE_SIZE, only the first INITIAL_TRACE_SIZE entry may be valid.

Ｃ言語を模した擬似コードアルゴリズムによると、メモリ２３６に格納されるＳＩＭＤマトリックスは、オペレーションコードを有し、Ｍ個のオペコード位置のＮ本のラインを保持するかもしれない（例えば、合計でＮ^ｘＭ^ｘｌｏｇ（ＭＡＸ＿ＴＲＡＣＥ＿ＳＩＺＥ）ビットなど）。

According to a pseudo-code algorithm that mimics the C language, the SIMD matrix stored in the memory 236 has an operation code and may hold N lines of M opcode positions (eg, a total of N ^x M ^x log (MAX_TRACE_SIZE) bit, etc.).

本発明の範囲はこれに限定されるものではないが、本実施例のアルゴリズムでは、最適化ユニット２３０の第１ステージ２３２は、昇順にトレースの命令を繰り返すことによってトレースの候補命令を検索する。第１ステージ２３２は、リネーミング処理中に構成されるトレース［ｉ］のすべてのプレデセッサ（ｐｒｅｄｅｃｅｓｓｏｒ）を比較する。さらに、第１ステージ２３２は、トレース［ｉ］の依存性の高さ（レベルなど）と、それの可能性のある最先のスケジューリング位置を計算することによって、依存性ツリー（依存性ツリー３００など）における命令の高さ（レベルなど）をタグ付けするようにしてもよい。

Although the scope of the present invention is not limited to this, in the algorithm of this embodiment, the first stage 232 of the optimization unit 230 searches for trace candidate instructions by repeating the trace instructions in ascending order. The first stage 232 compares all predecessors of trace [i] configured during the renaming process. Further, the first stage 232 calculates a dependency tree (such as the dependency tree 300) by calculating the high dependency (level, etc.) of the trace [i] and the earliest possible scheduling position. ) May be tagged with the height (level, etc.) of the instruction.

本発明の範囲はこれに限定されるものではないが、本例のＣ言語を模した擬似コードアルゴリズムでは、第２ステージ２３４は、ベクトル化に適した命令（マトリックスＳＩＭＤなど）をメモリ２３６から検索する。例えば、適切な命令は、同じ依存性ツリーの高さ（レベルなど）におけるより以前の命令トレース［ｊ］であるかもしれない。さらに、第２ステージ２３６は、ＳＩＭＤ命令を生成し、以下に示すように、元の命令をＳＩＭＤ命令と置換するかもしれない。

Although the scope of the present invention is not limited to this, in the pseudo code algorithm simulating the C language in this example, the second stage 234 searches the memory 236 for instructions (such as matrix SIMD) suitable for vectorization. To do. For example, a suitable instruction may be an earlier instruction trace [j] at the same dependency tree height (level, etc.). Further, the second stage 236 may generate a SIMD instruction and replace the original instruction with a SIMD instruction, as shown below.

本発明の一部の実施例によると、最適化ユニット２００は、メモリにアクセスする２つの命令が、連続するメモリアドレスにアクセスする場合、単一のＳＩＭＤ命令に合成されるというルールに従って、ＳＩＭＤ命令を生成するようにしてもよい。すなわち、これら２つの命令によってアクセスされるデータが（少なくともバーチャルメモリ空間において）隣接することは、それらのメモリアドレスと対応するデータ長から計算することが可能である。例えば、以下の命令を含むトレースでは、すなわち、
１．ＥＳＰ＋４から４バイトをＬＯＡＤする。
２．ＥＳＰ＋１２から４バイトをＬＯＡＤする。
３．ＥＳＰ＋８から４バイトをＬＯＡＤする。
では、命令は、所望される場合には、単一のＳＩＭＤ命令である「ＥＳＰ＋４から１２バイトをＬＯＡＤする」に合成されるかもしれない。

According to some embodiments of the present invention, the optimization unit 200 follows the rule that two instructions accessing a memory are combined into a single SIMD instruction when accessing consecutive memory addresses. May be generated. That is, the fact that the data accessed by these two instructions is adjacent (at least in the virtual memory space) can be calculated from their memory addresses and the corresponding data length. For example, in a trace that includes the following instructions:
1. LOAD 4 bytes from ESP + 4.
2. LOAD 4 bytes from ESP + 12.
3. LOAD 4 bytes from ESP + 8.
Then, if desired, the instruction may be combined into a single SIMD instruction “LOAD 12 bytes from ESP + 4” if desired.

図４を参照するに、テーブル４００が示される。本発明の範囲はこれに限定されるものではないが、テーブル４００は、依存性ツリー（依存性ツリー３００など）における当該命令のレベルを示すレベルカラムと、入力トレースバッファ２１０及びシーケンサ２２０によって提供される元の命令を示す元のトレースカラムと、出力トレースバッファ２４０における命令を示すベクトル化後のトレースとを含む。テーブル４００の行は、命令のレベルと、元の命令とベクトル化後の命令とを示すかもしれない。 Referring to FIG. 4, a table 400 is shown. Although the scope of the present invention is not limited thereto, the table 400 is provided by a level column indicating the level of the instruction in the dependency tree (such as the dependency tree 300), the input trace buffer 210, and the sequencer 220. The original trace column indicating the original instruction to be output and the trace after vectorization indicating the instruction in the output trace buffer 240 are included. The rows of the table 400 may indicate the level of the instruction, the original instruction, and the instruction after vectorization.

本発明の範囲はこれに限定されるものではないが、最適化ユニット２００は、トレース依存性グラフの深さ（トレース命令の高さなど）をタグ付けするものであってもよい。さらに例えば、テーブル４００によると、最適化ユニット２００は、ベクトル化のための候補と同一のレベル（レベル２など）にある命令「ＥＡＸ←ＬＯＡＤ（ＥＳＰ，４）」と「ＥＢＸ←ＬＯＡＤ（ＥＳＰ，８）を特定し、所望される場合には、これらの命令をＳＩＭＤ命令「ＥＡＸ，ＥＢＸ←ＳＩＭＤ＿ＬＯＡＤ（ＥＳＰ，４）」に合成するようにしてもよい。本発明の範囲はこれに限定されるものではないが、最適化ユニット２００は、共通の処理（ＬＯＡＤなど）によるものであって、トレース依存性グラフの同じ深さ（高さなど）にある２つの命令が、それらの一定でないすべてのソース（レジスタなど）が同様のものである場合、及び／又は一定又は直接的なソースが異なる場合、単一のＳＩＭＤ命令（ＳＩＭＤ＿ＬＯＡＤなど）に合成されるというルールに従うことによって、ＳＩＭＤ命令を生成するようにしてもよい。 Although the scope of the present invention is not so limited, the optimization unit 200 may tag the depth of the trace dependency graph (such as the height of the trace instruction). Further, for example, according to the table 400, the optimization unit 200 includes instructions “EAX ← LOAD (ESP, 4)” and “EBX ← LOAD (ESP, ESP,) at the same level (level 2 etc.) as the candidates for vectorization. 8) may be specified, and if desired, these instructions may be combined into SIMD instructions “EAX, EBX ← SIMD_LOAD (ESP, 4)”. Although the scope of the present invention is not limited to this, the optimization unit 200 is based on a common process (such as LOAD) and is at the same depth (such as height) of the trace dependency graph 2. Two instructions are combined into a single SIMD instruction (such as SIMD_LOAD) if all their non-constant sources (such as registers) are similar and / or if the constant or direct sources are different A SIMD instruction may be generated by following the rules.

図５を参照するに、本発明の他の実施例によるテーブル５００が示される。本発明の範囲はこれに限定されるものではないが、テーブル５００は、依存性ツリー（依存性ツリー３００など）における元の命令のレベルを示すレベルカラムと、入力トレースバッファ２１０及びシーケンサ２２０によって提供される元の命令を示す元のトレースカラムと、ＳＳＡなどの基本的変換後の命令のレベルを示すレベルカラムと、変換後の命令を示すカラムと、出力トレースバッファ２４０におけるベクトル化後のトレースの命令を示すカラムとを有するものであってもよい。テーブル５００の行は、命令のレベルと、基本的変換後の命令の元の命令レベルと、基本的変換後の命令と、ベクトル化後の命令とを示すものであってもよい。 Referring to FIG. 5, a table 500 according to another embodiment of the present invention is shown. Although the scope of the present invention is not so limited, the table 500 is provided by a level column indicating the level of the original instruction in the dependency tree (such as the dependency tree 300), and the input trace buffer 210 and sequencer 220. Of the original trace column indicating the original instruction to be processed, a level column indicating the level of the instruction after the basic conversion such as SSA, a column indicating the instruction after the conversion, and the trace after the vectorization in the output trace buffer 240 And a column indicating an instruction. The rows of the table 500 may indicate the instruction level, the original instruction level of the instruction after basic conversion, the instruction after basic conversion, and the instruction after vectorization.

本発明の範囲はこれに限定されるものではないが、一例となるテーブル５００によると、最適化ユニット２００は、トレースにおける元の命令の高さをタグ付けする。最適化ユニット２００は、トレースの命令をＳＳＡ形式に変換するかもしれない。最適化ユニット２００は、トレースがＳＳＡ形式に変換されることを利用することによって、トレースの命令を変換してもよい。最適化ユニット２００は、
例えば、 Although the scope of the present invention is not so limited, according to the example table 500, the optimization unit 200 tags the height of the original instruction in the trace. The optimization unit 200 may convert the trace instructions into SSA format. The optimization unit 200 may convert the instructions of the trace by utilizing that the trace is converted to SSA format. The optimization unit 200 is
For example,

などのベクトル化のための候補命令と同一レベルの変換された命令をタグ付けし、それらをＳＩＭＤ命令

Tag the converted instructions at the same level as the candidate instructions for vectorization, etc. and SIMD them

にそれぞれ合成するようにしてもよい。

May be combined respectively.

本発明の特徴がここで図示及び説明されたが、当業者には、多数の改良、置換、変更及び均等が想起するであろう。従って、添付した請求項が、本発明の真の趣旨に属するこのようなすべての改良及び変更をカバーするものであるということは理解されるべきである。 While the features of the invention have been illustrated and described herein, many modifications, substitutions, changes and equivalents will occur to those skilled in the art. Accordingly, it is to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the invention.

１００コンピュータシステム
１１０メイン処理ユニット
１２０電源
１３０マルチ処理ユニット
１４０記憶装置 100 Computer System 110 Main Processing Unit 120 Power Supply 130 Multi-Processing Unit 140 Storage Device

Claims

When two or more instructions having a common operation code of the trace are searched and the two or more instructions have the same level in the trace dependency tree, the two or more instructions are converted into one SIMD (Single Instruction Multiple Data). An apparatus having an optimization unit that merges into instructions,
The trace dependency tree has instructions at multiple levels, each level having the same height instruction;
The trace command is stored in a memory.

The apparatus of claim 1, comprising:
The common operation code comprises a memory operation code or an arithmetic operation code.

The apparatus of claim 1, comprising:
The optimization unit is
A first stage that retrieves the two or more instructions from a trace of instructions received from a sequencer by placing the two or more instructions in the trace dependency tree;
A cache memory for storing the instructions in the trace dependency tree;
A second stage for retrieving the two or more instructions in the cache memory to generate the one SIMD instruction;
A device characterized by comprising:

The apparatus of claim 1, comprising:
The apparatus is characterized in that the optimization unit is capable of combining the two or more instructions for accessing consecutive memory addresses into the one SIMD instruction.

The apparatus of claim 1, comprising:
The optimization unit is capable of converting an input instruction into an SSA (Single Static Assignment) format.

Retrieving two or more instructions having a common operation code in the trace;
Merging the two or more instructions into a single instruction multiple data (SIMD) instruction if the two or more instructions have the same level in a trace dependency tree;
A method comprising:
The method of claim 1, wherein the trace dependency tree has a plurality of levels, each level having the same height instruction.

The method of claim 6, comprising:
A method comprising the step of selecting the common operation code comprised of a memory processing operation code or an arithmetic operation code.

The method of claim 6, comprising:
A method comprising combining the two or more instructions that access consecutive memory addresses into the one SIMD instruction.

The method of claim 6, comprising:
The method of merging comprises converting the input instruction of the trace into an SSA (Single Static Assignment) format.

With bus,
A storage device connected to the bus;
When two or more instructions having a common operation code of the trace are searched and the two or more instructions have the same level in the trace dependency tree, the two or more instructions are converted into one SIMD (Single Instruction Multiple Data). A processor having an optimization unit that merges into the instructions;
A system comprising:
The trace dependency tree has instructions at multiple levels, each level having the same height instruction;
The trace command is stored in the storage device.

The system of claim 10, wherein
The common operation code includes a memory operation code or an arithmetic operation code.

The system of claim 10, wherein
The optimization unit is
A first stage that retrieves the two or more instructions from a trace of instructions received from a sequencer by placing the two or more instructions in the trace dependency tree;
A cache memory for storing the instructions in the trace dependency tree;
A second stage for retrieving the two or more instructions in the cache memory to generate the one SIMD instruction;
The system characterized by having.

The system of claim 10, wherein
The system is characterized in that the optimization unit is capable of combining the two or more instructions that access consecutive memory addresses into the one SIMD instruction.

The system of claim 10, wherein
The optimization unit is capable of converting an input instruction into an SSA (Single Static Assignment) format.

Retrieving two or more instructions having a common operation code in the trace;
Merging the two or more instructions into a single instruction multiple data (SIMD) instruction if the two or more instructions have the same level in a trace dependency tree;
A program for causing a computer to execute
The trace dependency tree has a plurality of levels, each level having an instruction having the same height.

The program according to claim 15, wherein
A program for causing the computer to execute the step of selecting the common operation code composed of a memory processing operation code or an arithmetic operation code.

The program according to claim 15, wherein
A program for causing the computer to execute a step of combining the two or more instructions for accessing consecutive memory addresses into the one SIMD instruction.

The program according to claim 15, wherein
A program that causes the computer to execute a step of converting an input instruction into an SSA (Single Static Assignment) format.