JP6651485B2

JP6651485B2 - Instruction and logic circuits for processing character strings

Info

Publication number: JP6651485B2
Application number: JP2017154302A
Authority: JP
Inventors: ジュリア，マイケル，エー．; グレイ，ジェフリー，ディー．; チェヌパティー，スリニヴァス; マーケス，ショーン，ピー．; セコニ，マーク，ピー．
Original assignee: インテルコーポレイション
Priority date: 2006-09-22
Filing date: 2017-08-09
Publication date: 2020-02-19
Anticipated expiration: 2027-09-21
Also published as: US11029955B2; JP6567285B2; DE102007063911B3; US20110246751A1; US9804848B2; US8819394B2; JP2015143992A; JP7052171B2; CN105607890A; US9772846B2; JP7452930B2; CN104657113B; KR101300362B1; KR20110038185A; JP6193281B2; JP2022050519A; US20180052689A1; US20150106594A1; DE102007045496B4; CN108052348B

Description

本願開示は、論理的及び数学的演算を行う、処理装置並びに関連するソフトウェア及びソフトウェア列の分野に関する。 The present disclosure relates to the field of processing devices and associated software and software trains for performing logical and mathematical operations.

計算機システムは、我々の社会でますます普及している。計算機の処理能力により、広範な職業で働く人々の効率と生産性が向上している。計算機を買って所有する費用は落ち続けている。従って、より多くの消費者が、より新しく、より速い計算機を活用できるようになっている。更に、多くの人は、ノート型の計算機を、その自由度ゆえに、楽しんで利用している。可搬型計算機により、利用者は、職場を離れても旅行中でも、簡単にデータを持ち運ぶことができ、仕事もできる。このような場面は、営業職や管理職にとって、また学生にとってすらも、ありふれた光景である。 Computer systems are becoming more and more popular in our society. The computing power is increasing the efficiency and productivity of people working in a wide range of occupations. The cost of buying and owning a calculator continues to fall. Thus, more consumers can take advantage of newer and faster computers. Furthermore, many people enjoy using notebook computers because of their flexibility. The portable computer allows the user to easily carry data and work while away from work or traveling. Such scenes are a common sight for sales and managers, and even for students.

処理装置の技術が進むにつれ、より新しいソフトウェアも開発が進んでいる。このソフトウェアは、進んだ処理装置を持つ計算機で走る。利用者は概して、自分の計算機に、より高い性能を期待し要求する。このことは、使うソフトウェアの種類には無関係である。このような性能に関する問題が起こりうるのは、処理装置の内部で実際に実行される命令及び演算の種類からである。ある種類の演算は、完了するのに、他の演算よりも時間がかかる。その理由は、演算が複雑であるせいか、演算に必要な回路の型のせいか、その両方のせいである。このことが、ある種の複雑な演算を処理装置の内部で実行する方法を、最適化する動機である。 As processor technology advances, newer software is also being developed. This software runs on a computer with advanced processing equipment. Users generally expect and demand higher performance from their computers. This is independent of the type of software used. Such performance problems may arise from the types of instructions and operations actually executed inside the processing device. Certain types of operations take longer to complete than others. This is because of the complexity of the operation or the type of circuit required for the operation, or both. This is the motivation for optimizing the way certain complex operations are performed inside the processing unit.

通信の応用が、１０年以上に渡って、超小型処理装置の進歩を駆り立ててきた。実際、計算と通信の間の境界線は、ますますぼやけてきている。この理由は、部分的には、通信の応用で文字列を使うからである。文字列の応用は、一般消費者向け市場で広まっている。また、文字列の応用は、多数の装置で広まっている。装置とは例えば携帯電話から個人用計算機までである。このような装置は、文字列情報を、一層より高速に処理することを求めている。文字列を通信する装置は、計算し通信する装置に進化し続けている。計算し通信する装置は、次のような形の応用を行う。即ち、マイクロソフト（登録商標）インスタントメッセンジャー（商標）、電子メールの応用（例えばマイクロソフト（登録商標）アウトルック（商標））、及び携帯電話メールの応用である。その結果、将来における、個人の計算及び通信の体験は、文字列を扱う能力について、更により豊かになると期待される。 Communications applications have driven the advancement of microprocessing devices for over a decade. In fact, the line between computation and communication is becoming increasingly blurred. This is due in part to the use of character strings in communications applications. String applications are widespread in the consumer market. Also, the application of character strings has been widespread on many devices. The device is, for example, from a mobile phone to a personal computer. Such devices require that the character string information be processed even faster. Devices that communicate character strings continue to evolve into devices that compute and communicate. Devices that calculate and communicate have applications in the following forms: That is, Microsoft (registered trademark) Instant Messenger (trademark), an application of electronic mail (for example, Microsoft (registered trademark) Outlook (trademark)), and an application of mobile phone mail. As a result, personal computing and communication experiences in the future are expected to be even richer in their ability to handle strings.

従って、計算又は通信する装置同士の間で交換される文字列情報を、処理すること又は構文解析することは、現在の計算装置及び通信装置にとって、一段と重要性を増している。とりわけ、通信又は計算する装置が文字情報の列を解釈することは、文字列データに対して行う最も重要な演算のうちの、いくつかを含む。このような演算では、計算量が嵩むにしても、データの並列度は高い水準であってもよい。この並列度を利用して、様々なデータ格納装置を使う効率的な実装を行える。格納装置とは、例えば、単一命令複数データ（ＳＩＭＤ）型のレジスタである。数多くの現在の計算機アーキテクチャはまた、次のことを要求する。即ち、複数の演算、複数の命令、又は複数の下位命令（よく「マイクロ命令」又は「μｏｐ」という。）を使って、様々な論理的及び数学的演算を、多数の演算対象に対して行う。このことにより、処理速度を上げ、その論理的及び数学的演算を行うのに必要なクロック周期の数を減らす。 Accordingly, processing or parsing string information exchanged between computing or communicating devices is becoming increasingly important for current computing and communication devices. Among other things, interpreting a sequence of character information by a communicating or calculating device involves some of the most important operations performed on character string data. In such an operation, the degree of parallelism of the data may be at a high level even if the amount of calculation increases. By utilizing this degree of parallelism, efficient implementation using various data storage devices can be performed. The storage device is, for example, a single instruction multiple data (SIMD) type register. Many current computer architectures also require that: That is, various logical and mathematical operations are performed on a large number of operation targets using a plurality of operations, a plurality of instructions, or a plurality of lower instructions (often referred to as “micro instructions” or “μops”). . This speeds up processing and reduces the number of clock cycles required to perform its logical and mathematical operations.

例えば、多数の命令から成る命令列が、次のことを行うために必要であってもよい。即ち、文字列の中の特定の語を解釈するのに必要な１つ以上の演算である。この演算は、処理装置、システム、又は計算機プログラムの内部の様々なデータ型が表現する、２つ以上の文字列語を比べることを含む。しかし、このような従来の技術では、多数の処理周期が必要になることがあり、処理装置又はシステムは、結果を得るために、不要な電力を消費してしまうことがある。更に、いくつかの従来技術では、演算の対象としてもよいデータ型として、限られたものしか使えないことがある。 For example, an instruction sequence consisting of a number of instructions may be needed to do the following: That is, one or more operations required to interpret a particular word in the string. This operation involves comparing two or more string words represented by various data types within the processing device, system, or computer program. However, such a conventional technique may require a large number of processing cycles, and the processing device or system may consume unnecessary power to obtain a result. Further, in some conventional techniques, only a limited data type may be used as a target of an operation.

本発明の一態様によると、命令を記憶した機械読み取り可能媒体が提供される。前記命令は、機械により実行されると、前記機械に第１のパック化オペランドの各データ要素を、第２のパック化オペランドの各データ要素と比較する段階と、前記比較の第１の結果を記憶する段階を含む方法を実行させる。 According to one aspect of the present invention, there is provided a machine-readable medium having instructions stored thereon. The instructions, when executed by the machine, cause the machine to compare each data element of a first packed operand with each data element of a second packed operand, and to provide a first result of the comparison. The method including the step of storing is performed.

計算機システムの区画図である。計算機システムは、処理装置を含む。処理装置は、実行部を含む。実行部は、命令を実行する。命令は、文字列比較演算を行う。この命令は、本願発明の１つの実施例による。FIG. 2 is a block diagram of the computer system. The computer system includes a processing device. The processing device includes an execution unit. The execution unit executes the instruction. The instruction performs a character string comparison operation. This instruction is in accordance with one embodiment of the present invention. 本願発明の別の実施例による、別の例の計算機システムの区画図である。FIG. 6 is a block diagram of another example computer system according to another embodiment of the present invention. 本願発明の更に別の実施例による、更に別の例の計算機システムの区画図である。FIG. 11 is a block diagram of a computer system of still another example according to still another embodiment of the present invention. １つの実施例による処理装置のマイクロアーキテクチャの区画図である。この実施例は、論理回路を含む。この論理回路は、本願発明による文字列比較演算を１つ以上行う。FIG. 2 is a block diagram of a microarchitecture of a processing device according to one embodiment. This embodiment includes a logic circuit. This logic circuit performs one or more character string comparison operations according to the present invention. 本願発明の１つの実施例による、マルチメディアレジスタにおける種々のパック化データ型の表現を示す。4 illustrates a representation of various packed data types in a multimedia register according to one embodiment of the present invention. 別の実施例による、パック化データ型を示す。4 illustrates a packed data type according to another embodiment. 本願発明の１つの実施例による、マルチメディアレジスタにおける種々の符号付き及び符号無しのパック化データ型の表現を示す。4 illustrates various signed and unsigned packed data type representations in multimedia registers according to one embodiment of the present invention. 演算の符号化（即ち命令符号）の形式の１つの実施例を示す。FIG. 4 illustrates one embodiment of the form of operation encoding (ie, instruction code). 演算の符号化（即ち命令符号）の別な形式を示す。5 illustrates another form of operation encoding (ie, instruction code). 演算の符号化の更に別な形式を示す。5 shows yet another form of operation coding. 論理回路の区画図である。この論理回路は、本願発明の１つの実施例により、少なくとも１つの文字列比較演算を、１つ以上の単精度パック化データ演算対象に対して行う。FIG. 3 is a block diagram of a logic circuit. This logic circuit performs at least one character string comparison operation on one or more single precision packed data operation targets according to one embodiment of the present invention. 配列の区画図である。この配列を使って、１つの実施例による少なくとも１つの文字列比較演算を行ってもよい。It is a division figure of an arrangement. This array may be used to perform at least one string comparison operation according to one embodiment. 本発明の１つの実施例で行ってもよい演算を示す。4 illustrates operations that may be performed in one embodiment of the present invention.

本願発明を実施例を使って説明する。本願発明は、実施例及び添付の図面によっては、限定されない。 The present invention will be described using examples. The present invention is not limited by the embodiments and the attached drawings.

以下の記載が記述するのは、技法の実施例である。この技法は、処理装置、計算機システム、又はソフトウェアプログラムの内部で、文字列の要素同士の間を比べる演算を行う。以下の記載では、多数の個別の詳細を記述する。詳細とは例えば処理装置の型、マイクロアーキテクチャの事情、事象、実施可能な機構、等である。詳細を記載する目的は、本願発明のより深い理解を与えるためである。しかし、当業者は次の点に注意。即ち、本発明を、そのような個別の詳細を抜きに実施してもよい。加えて、いくつかの周知の構造、回路などは、詳細を示していない。これは、本願発明を不要に複雑に示すのを避けるためである。 Described below are examples of techniques. This technique performs an operation to compare between elements of a character string inside a processing device, a computer system, or a software program. In the following description, numerous specific details are set forth. The details are, for example, the type of the processing device, the circumstances of the microarchitecture, the event, the executable mechanism, and the like. The purpose of describing the details is to provide a deeper understanding of the invention. However, those skilled in the art are aware of the following. That is, the present invention may be practiced without such individual details. In addition, some well-known structures, circuits, etc., do not show details. This is to avoid unnecessarily complicating the present invention.

以下の実施例を、処理装置を参照して記述する。しかし、他の実施例を、他の型の集積回路や論理部品に応用できる。本願発明と同じ技術及び教示を、他の型の回路又は半導体部品に容易に応用できる。他の型の回路又は半導体部品も、より高いパイプライン効率及び改善した性能から、利益を受けることができる。本願発明の教示は、データの演算を行う、いかなる処理装置又は機械にも、応用できる。なお、本願発明は、２５６ビット、１２８ビット、６４ビット、３２ビット、又は１６ビットのデータの演算を行う処理装置又は機械に限定されない。本願発明を、パック化データを演算する必要がある、いかなる処理装置及び機械にも、応用できる。 The following example is described with reference to a processing device. However, other embodiments can be applied to other types of integrated circuits and logic components. The same techniques and teachings as the present invention can be readily applied to other types of circuits or semiconductor components. Other types of circuits or semiconductor components can also benefit from higher pipeline efficiency and improved performance. The teachings of the present invention can be applied to any processing device or machine that performs data operations. Note that the present invention is not limited to a processing device or a machine that performs 256-bit, 128-bit, 64-bit, 32-bit, or 16-bit data operation. The present invention can be applied to any processing device and machine that needs to operate packed data.

以下の記載では、説明のために、多数の個別の詳細を記述する。詳細を記載する目的は、本願発明の徹底的な理解を与えるためである。しかし、当業者は次の点を理解することになる。即ち、これらの個別の詳細は、本願発明を実施するために必要ではない。場合により、周知の電気的な構造及び回路については、特に詳しくは記載していない。これは、本願発明を不要に複雑に示すのを避けるためである。加えて、以下の記載は、例を示す。添付の図面は、様々な例を示す。これらの例を示すのは、説明のためである。しかし、これらの例を、本願発明を限定する意味で解釈してはならない。これらの例は、本願発明の例を示すことを、意図しているだけである。これらの例は、本願発明の全ての可能な実装を網羅する一覧を示すことを、意図していない。 In the following description, numerous specific details are set forth for purposes of explanation. The purpose of describing the details is to provide a thorough understanding of the invention. However, those skilled in the art will understand the following. That is, these individual details are not required to practice the present invention. In some instances, well-known electrical structures and circuits have not been described in detail. This is to avoid unnecessarily complicating the present invention. In addition, the following description shows examples. The accompanying drawings illustrate various examples. These examples are provided for illustrative purposes. However, these examples should not be construed as limiting the present invention. These examples are only intended to illustrate examples of the present invention. These examples are not intended to provide a comprehensive list of all possible implementations of the present invention.

以下の例では、命令の取り扱い及び分散を、実行部及び論理回路の文脈で記述する。しかし、本願発明の他の実施例を、ソフトウェアによっても実現できる。１つの実施例では、本願発明の方法を、機械が実行可能な命令に実施する。この命令を使って、次のことを行える。即ち、汎用処理装置又は専用処理装置をこの命令によってプログラムし、本願発明の工程を実行させる。本願発明を、計算機プログラム又はソフトウェアとして提供してもよい。この計算機プログラム又はソフトウェアは、機械可読媒体又は計算機可読媒体を含んでもよい。機械可読媒体又は計算機可読媒体は、命令を内部に格納して持つ。この命令を使って、計算機（又は他の電子装置）をプログラムしてもよい。このプログラムにより、本願発明による処理を行う。代わりに、本願発明の工程を、特定のハードウェア部品によって実行してもよい。特定のハードウェア部品は、本願発明の工程を実行するための、配線を固定した論理回路を含む。又は、本願発明の工程を、プログラムされた計算機部品と専用ハードウェア部品との、いかなる組み合わせによっても実行してもよい。このようなソフトウェアを、システムの記憶装置の内部に格納できる。同様に、命令を分散できる。この分散を、網により行う。又は、この分散を、他の計算機可読媒体を使って行う。 In the following examples, instruction handling and distribution will be described in the context of execution units and logic circuits. However, other embodiments of the present invention can be realized by software. In one embodiment, the method of the present invention is implemented on machine-executable instructions. Using this instruction, you can: That is, the general-purpose processing device or the special-purpose processing device is programmed by this instruction, and the steps of the present invention are executed. The present invention may be provided as a computer program or software. The computer program or software may include a machine-readable medium or a computer-readable medium. A machine readable medium or a computer readable medium has instructions stored therein. The instructions may be used to program a calculator (or other electronic device). The processing according to the present invention is performed by this program. Alternatively, the steps of the present invention may be performed by specific hardware components. Specific hardware components include logic circuits with fixed wiring for performing the steps of the present invention. Alternatively, the steps of the present invention may be executed by any combination of programmed computer components and dedicated hardware components. Such software can be stored inside the storage device of the system. Similarly, instructions can be distributed. This dispersion is performed by a network. Alternatively, this distribution is performed using another computer-readable medium.

従って、機械可読媒体は、機械（例えば計算機）が読める形式で情報を格納又は伝達するための、いかなる機構を含んでもよい。機械可読媒体は、次のものを含むが、これらに限定されない：フロッピー（登録商標）ディスケット；光学ディスク；コンパクトディスク；ＣＤ−ＲＯＭ；光磁気ディスク；ＲＯＭ；ＲＡＭ；ＥＰＲＯＭ；ＥＥＰＲＯＭ；磁気カード若しくは光学カード；フラッシュ記憶装置；インターネット上の伝送；電気的、光学的、音響的、若しくは他の形態の伝搬する信号（例えば搬送波、赤外線信号、デジタル信号、等）；又は、同様のもの。従って、計算機可読媒体は、機械（例えば計算機）が読める形式で、電子的な命令又は情報を、格納又は伝達するのに適した、いかなる型の媒体及び機械可読媒体をも含む。更に、本願発明を、計算機プログラムとしてダウンロードしてもよい。即ち、プログラムを、遠隔の計算機（例えばサーバー）から転送して、要求する計算機（例えばクライアント）に取り込んでもよい。プログラムの転送を、次の信号によって行ってもよい。即ち、電気的、光学的、音響的、又は他の形態のデータ信号。これらの信号を、搬送波又は他の伝搬媒体に実施する。これらの信号は、通信接続（例えばモデム接続、網接続等）を経由する。 Thus, a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (eg, a computer). Machine-readable media include, but are not limited to, floppy diskettes; optical disks; compact disks; CD-ROMs; magneto-optical disks; ROM; RAM; EPROM; Card; flash storage; transmission over the Internet; electrical, optical, acoustic, or other forms of propagating signal (eg, carrier, infrared, digital, etc.); or the like. Accordingly, computer readable media includes any type of medium and machine readable media suitable for storing or transmitting electronic instructions or information in a form readable by a machine (eg, a computer). Further, the present invention may be downloaded as a computer program. That is, the program may be transferred from a remote computer (for example, a server) and taken into a requesting computer (for example, a client). The transfer of the program may be performed by the following signal. That is, an electrical, optical, acoustic, or other form of data signal. These signals are implemented on a carrier or other propagation medium. These signals pass through a communication connection (eg, a modem connection, a network connection, etc.).

設計は、様々な段階を踏んでもよい。即ち、設計は、創案からシミュレーションを経て製造に至る。設計を表現するデータは、その設計を多数の方法で表現してもよい。まず、シミュレーションで便利なのは、次の方法である。即ち、ハードウェアを、ハードウェア記述言語又は別の機能記述言語を使って表現してもよい。加えて、論理の及び／又はトランジスターのゲート水準の回路モデルを、設計の過程の何らかの段階で作ってもよい。更に、ほとんどの設計者は、何らかの段階で、ハードウェアモデルにおける、種々の素子の物理的な配置を表現する水準のデータに辿り着く。従来の半導体の製造技術を使う場合には、このハードウェアモデルを表現するデータは、半導体マスクの様々な層に種々の特徴が有るか無いかを指定するデータであってもよい。このマスクを使って集積回路を作る。設計におけるいかなる表現でも、そのデータをいかなる形態の機械可読媒体に格納してもよい。機械可読媒体とは、次のものでもよい。即ち、そのような情報を伝送するために、変調した若しくは他の方法で生成した、光学的若しくは電気的な波、記憶装置、又は磁気若しくは光学的な格納器（例えば円盤）。これらの媒体のいかなるものも、設計又はソフトウェア情報を「担う」又は「示す」のでもよい。符号又は設計を示す又は担う電気的な搬送波を伝送する場合に、その電気信号の複写、蓄積、又は再送を行うと、新しい複写ができる。従って、通信設備者又は網提供者は、本願発明の技術を実施する物（即ち搬送波）の複写を作ってもよい。 The design may take different stages. That is, the design extends from the creation to the manufacturing through the simulation. Data representing a design may represent the design in a number of ways. First, the following method is convenient for simulation. That is, the hardware may be expressed using a hardware description language or another function description language. In addition, logic and / or transistor gate level circuit models may be created at some stage in the design process. Further, most designers at some stage reach a level of data that represents the physical placement of various elements in the hardware model. When a conventional semiconductor manufacturing technique is used, the data representing the hardware model may be data specifying whether various layers of the semiconductor mask have various features. An integrated circuit is made using this mask. Any representation in the design may store the data on any form of machine-readable medium. The machine-readable medium may be: That is, an optical or electrical wave, storage device, or magnetic or optical enclosure (eg, a disk) that has been modulated or otherwise generated to transmit such information. Any of these media may "carry" or "show" the design or software information. When transmitting an electrical carrier that indicates or carries a code or design, copying, storing, or retransmitting the electrical signal results in a new copy. Accordingly, the telecommunications facility operator or network provider may make a copy of an entity (ie, a carrier) that implements the techniques of the present invention.

近年の処理装置では、多数の異なる実行部を使って、様々な命令を処理し実行する。全ての命令が平等に作られている訳では無い。即ち、ある命令は他の命令よりも早く完了する。別の命令は、完了するのに莫大なクロックサイクルを費やすことがある。命令の実行速度が速ければ速いほど、処理装置の全体的な性能はより良いことになる。従って、有利なのは、なるべく多くの命令を、なるべく速く実行することである。しかし、いくつかの命令は、他の命令よりも遥かに複雑である。従って、実行時間と処理装置の資源を、他の命令よりも多く必要とする。そのような命令の例としては、浮動小数点命令、記憶装置からの読み込み／記憶装置への書き出し操作、データの移動命令等がある。 Recent processing devices process and execute various instructions using a number of different execution units. Not all instructions are created equal. That is, some instructions complete earlier than others. Another instruction may spend an enormous number of clock cycles to complete. The faster the instruction is executed, the better the overall performance of the processing unit. Therefore, it is advantageous to execute as many instructions as possible, as quickly as possible. However, some instructions are much more complex than others. Therefore, it requires more execution time and processor resources than other instructions. Examples of such an instruction include a floating-point instruction, a read / write operation from / to a storage device, and a data movement instruction.

ますます多くの計算機システムを、インターネット、文章作成、及びマルチメディアのアプリケーションで使うようになったので、時が経つにつれ、処理装置に、それらを支援する機能が追加されてきた。例えば、単一命令複数データ（ＳＩＭＤ）型の整数及び浮動小数点命令、並びに、ストリーミングＳＩＭＤ拡張（ＳＳＥ）のような命令は、特定のプログラムの仕事を実行するのに必要な命令の総数を減らす。このことにより、消費電力を減らすこともできる。このような命令がソフトウェアの性能を高速化できるのは、複数のデータ要素に並列に演算を行うことによる。その結果、広範な応用で性能を上げられる。応用は、映像の処理、発話の処理、及び画像や写真の処理を含む。ＳＩＭＤ命令の実装は、超小型処理装置や類似の論理回路で行われている。このような実装は、通常、多数の問題を孕んでいる。更に、ＳＩＭＤ演算は複雑なので、大抵は追加の回路が必要になる。追加の回路により、データを正しく処理して演算する。 As more and more computer systems are used for Internet, text-writing, and multimedia applications, over time, processing units have been provided with features to support them. For example, instructions such as single instruction multiple data (SIMD) type integer and floating point instructions, and streaming SIMD extensions (SSE) reduce the total number of instructions required to perform a particular program's work. Thereby, power consumption can be reduced. Such instructions can speed up the performance of software by performing operations on a plurality of data elements in parallel. As a result, performance can be increased in a wide range of applications. Applications include video processing, speech processing, and image and photo processing. The implementation of the SIMD instruction is performed by a micro processor or a similar logic circuit. Such an implementation typically has a number of problems. In addition, SIMD operations are complex and often require additional circuitry. The additional circuit correctly processes and computes the data.

現在、少なくとも２つのパック化演算対象のデータ要素の各々を比べるＳＩＭＤ命令は存在しない。本発明の１つの実施例で行うようなＳＩＭＤパック化比較命令が無いと、応用プログラムで同じ結果を得るために、多数の命令及びデータレジスタが必要になることがある。応用プログラムは、例えば文字列についての、解釈、圧縮及び復元、処理、並びに演算を行う。本願で開示する実施例では、「文字列」の比較と「列」の比較を、相互に交換可能なように参照する。しかし、本発明の実施例を、情報のいかなる列（例えば、文字の列、数値の列、又は他のデータの列）にも適用してよい。 Currently, there is no SIMD instruction that compares each of at least two data elements to be packed. Without SIMD packed compare instructions as performed in one embodiment of the present invention, multiple instructions and data registers may be required to achieve the same result in an application program. The application program performs, for example, interpretation, compression and decompression, processing, and calculation of a character string. In the embodiments disclosed in the present application, the comparison of “character string” and the comparison of “string” are referred to interchangeably. However, embodiments of the present invention may be applied to any sequence of information (eg, a sequence of characters, a sequence of numbers, or a sequence of other data).

従って、本願発明の実施例による、少なくとも１つの文字列比較命令は、プログラムのオーバーヘッド及び必要な資源を減らせる。本願発明の実施例は、文字列を構文解析する演算を、ＳＩＭＤ関連のハードウェアを利用する算法として実装する方法を提供する。現在、ＳＩＭＤレジスタにあるデータについて、文字列を構文解析する演算を行うことは、やや困難で手間がかかる。算法によっては、算術演算を実行する肝心の命令の数よりも、算術演算のためにデータを配置する命令に、より多くの数を必要とするほどである。本願発明の実施例による文字列比較演算の実施例を実装することにより、文字列を処理するために必要な命令の数を大幅に減らせる。 Thus, at least one string comparison instruction according to embodiments of the present invention can reduce program overhead and required resources. Embodiments of the present invention provide a method for implementing an operation for parsing a character string as an algorithm using SIMD-related hardware. At present, it is somewhat difficult and troublesome to perform an operation for parsing a character string on data in a SIMD register. Some algorithms require more instructions for arranging data for arithmetic operations than the number of critical instructions for performing arithmetic operations. By implementing the embodiment of the character string comparison operation according to the embodiment of the present invention, the number of instructions required to process a character string can be significantly reduced.

本願発明の実施例は、文字列を比べる１つ以上の演算を実装するための命令を含む。文字列を比べる演算は、一般に、データの２つの列からのデータ要素を比較することに関する。この比較により、どのデータ要素が合致するかを判断する。別の変形例を、汎用の文字列比較算法について作ってもよい。この算法も後で開示する。一般化した意味では、文字列比較演算の１つの実施例を、２つのパック化演算対象中にある個々のデータ要素に適用する。２つのパック化演算対象は、データの２つの列を示す。この文字列比較演算の実施例を、次のように汎用的に示せる：
ＤＥＳＴ１＜− ＳＲＣ１ｃｍｐＳＲＣ２；
１つのパック化したＳＩＭＤデータ演算対象について、この汎用演算を、各演算対象の各データ要素の位置に適用できる。 Embodiments of the present invention include instructions for implementing one or more operations that compare strings. Operations that compare strings generally relate to comparing data elements from two columns of data. This comparison determines which data element matches. Another variation may be made for a general purpose string comparison algorithm. This algorithm is also disclosed later. In a generalized sense, one embodiment of a string comparison operation applies to individual data elements in two packed operations. The two packed operation targets indicate two columns of data. An example of this string comparison operation can be shown generically as follows:
DEST1 <-SRC1 cmp SRC2;
For one packed SIMD data operation target, this general-purpose operation can be applied to the position of each data element of each operation target.

上記の動作において、「ＤＥＳＴ」と「ＳＲＣ」は、対応するデータや動作の送信先と送信元を表す一般的な用語である。実施形態では、レジスタ、またはメモリ、または図示したものとは異なる名称や機能を有するその他の記憶領域により実施できる。例えば、一実施形態では、ＤＥＳＴ１は一時的記憶レジスタやその他の記憶領域であり、ＳＲＣ１とＳＲＣ２は送信先の第１と第２の記憶レジスタまたはその他の記憶領域である。他の実施形態では、ＳＲＣ及びＤＥＳＴ記憶領域は同一記憶領域内（例えば、ＳＩＭＤレジスタ）の異なるデータ記憶要素に対応する。 In the above operation, “DEST” and “SRC” are general terms indicating the destination and source of the corresponding data or operation. In an embodiment, the present invention can be implemented by a register or a memory, or another storage area having a name and function different from those illustrated. For example, in one embodiment, DEST1 is a temporary storage register or other storage area, and SRC1 and SRC2 are destination first and second storage registers or other storage areas. In other embodiments, the SRC and DEST storage areas correspond to different data storage elements in the same storage area (eg, SIMD registers).

さらに、一実施形態では、ストリング比較動作により、あるソースレジスタの各要素が他のソースレジスタの各要素と等しいかどうかのインジケータを生成し、そのインジケータをＤＥＳＴ１等のレジスタに記憶する。一実施形態では、インジケータはインデックス値である。他の実施形態では、インジケータはマスク値である。他の実施形態では、インジケータはその他のデータ構造やポインタを表す。 Further, in one embodiment, the string comparison operation generates an indicator of whether each element of one source register is equal to each element of another source register, and stores the indicator in a register such as DEST1. In one embodiment, the indicator is an index value. In another embodiment, the indicator is a mask value. In other embodiments, the indicators represent other data structures or pointers.

図１Ａはコンピュータシステムの一例を示すブロック図である。このコンピュータシステムはプロセッサを有する。このプロセッサは、本発明の一実施形態によるストリング比較動作の命令を実行する実行ユニットを含む。システム１００は、ここに説明する実施形態のような、本発明により、データを処理するアルゴリズムを実行する論理回路を含む実行ユニットを利用する、プロセッサ１０２等のコンポーネントを含む。システム１００は、カリフォルニア州サンタクララ市のインテルコーポレイションから入手可能なＰＥＮＴＩＵＭ（登録商標）ＩＩＩ、ＰＥＮＴＩＵＭ（登録商標）４、Ｘｅｏｎ（商標）、Ｉｔａｎｉｕｍ（登録商標）、ＸＳｃａｌｅ（登録商標）、ＳｔｒｏｎｇＡＲＭ（登録商標）に基づくプロセッシングシステムを表す。しかし、（他のマイクロプロセッサを有するＰＣ、エンジニアリングワークステーション、セットトップボックス等を含む）他のシステムを使うことも可能である。一実施形態では、サンプルシステム１００は、ワシントン州レドモンド市のマイクロソフトコーポレーションのウィンドウズ（登録商標）オペレーティングシステムの一バージョンを実行するが、他のオペレーティングシステム（ユニックス、リナックス（登録商標）等）、組み込みソフトウェア、及び／またはグラフィカルユーザインターフェイス等を用いても良い。このように、本発明の実施形態は、ハードウェア回路とソフトウェアの特定の組み合わせには限定されない。 FIG. 1A is a block diagram illustrating an example of a computer system. The computer system has a processor. The processor includes an execution unit that executes a string comparison operation instruction according to one embodiment of the invention. System 100 includes components such as a processor 102 that utilize an execution unit that includes logic for executing algorithms that process data in accordance with the invention, such as the embodiments described herein. System 100 is a PENTIUM® III, PENTIUM® 4, Xeon®, Itanium®, XScale®, StrongARM® registered device available from Intel Corporation of Santa Clara, California. (Trademark). However, other systems (including PCs with other microprocessors, engineering workstations, set-top boxes, etc.) can be used. In one embodiment, the sample system 100 runs a version of the Windows operating system of Microsoft Corporation of Redmond, Wash., While other operating systems (such as Unix, Linux, etc.), embedded software , And / or a graphical user interface. Thus, embodiments of the present invention are not limited to any particular combination of hardware circuits and software.

実施形態はコンピュータシステムには限定されない。本発明の別の実施形態は、その他のデバイス、例えばハンドヘルドデバイスや組み込みアプリケーション等で利用することもできる。ハンドヘルドデバイスの例としては、セルラ電話、インターネットプロトコルデバイス、デジタルカメラ、パーソナルデジタルアシスタント（ＰＤＡ）、ハンドヘルドＰＣなどがある。組み込みアプリケーションには、マイクロコントローラ、デジタルシグナルプロセッサ（ＤＳＰ）、システムオンチップ、ネットワークコンピュータ（ＮｅｔＰＣ）、セットトップボックス、ネットワークハブ、ワイドエリアネットワーク（ＷＡＮ）スイッチ、その他のオペランドにストリング比較演算を実行するシステムがある。さらに、複数のデータ（several data）に対して同時に命令を実行してマルチメディアアプリケーションの効率を向上させるアーキテクチャを組み込んだ。データのタイプとボリュームが大きくなるにつれ、コンピュータやそのプロセッサはより効率的な方法でデータを操作するように高機能化（enhanced）されねばならない。 Embodiments are not limited to computer systems. Other embodiments of the present invention may be utilized with other devices, such as handheld devices and embedded applications. Examples of handheld devices include cellular phones, Internet protocol devices, digital cameras, personal digital assistants (PDAs), handheld PCs, and the like. For embedded applications, perform string comparison operations on microcontrollers, digital signal processors (DSPs), system-on-chips, network computers (NetPCs), set-top boxes, network hubs, wide area network (WAN) switches, and other operands There is a system. In addition, it incorporates an architecture that increases the efficiency of multimedia applications by executing instructions on multiple data simultaneously. As data types and volumes grow, computers and their processors must be enhanced to manipulate the data in a more efficient manner.

図１Ａは、コンピュータシステム１００のブロック図であり、プロセッサ１０２を有する。プロセッサ１０２は、１つまたは複数のオペランド（operands）のデータ要素を比較するアルゴリズムを実行する１つまたは複数の実行ユニット１０８を含む。一実施形態をシングルプロセッサデスクトップまたはサーバシステムについて説明するが、別の実施形態をマルチプロセッサシステムで利用することができる。システム１００はハブアーキテクチャの一例である。コンピュータシステム１００は、データ信号を処理するプロセッサ１０２を含む。プロセッサ１０２は、ＣＩＳＣ（complex instruction set computer）マイクロプロセッサ、ＲＩＳＣ（reduced instruction set computing）マイクロプロセッサ、ＶＬＩＷ（very long instruction word）マイクロプロセッサ、複数の命令セットの組み合わせを実装したプロセッサ、その他のデジタルシグナルプロセッサ等の任意のプロセッサである。プロセッサ１０２は、プロセッサバス１１０と結合し、プロセッサバス１１０により、プロセッサ１０２とシステム１００の他のコンポーネントとの間でデータ信号を送信できる。システム１００の要素は、本技術分野の当業者に周知である従来の機能を実行する。 FIG. 1A is a block diagram of a computer system 100 having a processor 102. Processor 102 includes one or more execution units 108 that execute algorithms that compare data elements of one or more operands. Although one embodiment is described for a single processor desktop or server system, another embodiment can be utilized with a multiprocessor system. System 100 is an example of a hub architecture. Computer system 100 includes a processor 102 that processes data signals. The processor 102 includes a complex instruction set computer (CISC) microprocessor, a reduced instruction set computing (RISC) microprocessor, a very long instruction word (VLIW) microprocessor, a processor having a combination of a plurality of instruction sets, and other digital signal processors. And any other processor. Processor 102 couples to processor bus 110, which allows data signals to be transmitted between processor 102 and other components of system 100. Elements of system 100 perform conventional functions well known to those skilled in the art.

一実施形態では、プロセッサ１０２はレベル１（L１）内部キャッシュメモリ１０４を含む。アーキテクチャによって、プロセッサ１０２は単一内部キャッシュを有しても、複数内部キャッシュレベルを有していてもよい。あるいは、他の実施形態では、キャッシュメモリはプロセッサ１０２の外部にあってもよい。他の実施形態では、具体的な実施形態及び必要性に応じて内部キャッシュと外部キャッシュを組み合わせてもよい。レジスタファイル１０６は、整数レジスタ、浮動小数点レジスタ、ステータスレジスタ、命令ポインタレジスタを含む様々なレジスタに相異なるタイプのデータを格納できる。 In one embodiment, processor 102 includes level one (L1) internal cache memory 104. Depending on the architecture, processor 102 may have a single internal cache or multiple internal cache levels. Alternatively, in other embodiments, the cache memory may be external to processor 102. In other embodiments, the internal cache and the external cache may be combined depending on the specific embodiment and needs. The register file 106 can store different types of data in various registers, including integer registers, floating point registers, status registers, and instruction pointer registers.

プロセッサ１０２には、実行ユニット１０８もあり、整数及び浮動小数点の演算を実行する論理回路を含む。プロセッサ１０２は、マクロ命令のマイクロコードを格納するマイクロコード（μコード）ROMも含む。この実施形態では、実行ユニット１０８はパック化命令セット１０９を処理する論理回路を含む。一実施形態では、パック化命令セット１０９は、複数のオペランドの要素を比較するパック化ストリング比較命令（packed string comparison instruction）を含む。汎用プロセッサ１０２の命令セットにパック化命令セット１０９を含めることにより、その命令を実行する関連回路とともに、多くのマルチメディアアプリケーションで利用する演算を汎用プロセッサ１０２においてパック化データを用いて実行することができる。このように、プロセッサのデータバスの幅を最大限に用いてパック化データ（packed data）に演算を行ことにより、多くのマルチメディアアプリケーションを高速化し、より効率的に実行することができる。これにより、プロセッサのデータバスを介してデータを小さい単位で転送して、一度に一データ要素に演算を実行する必要が無くなる。
マイクロコントローラ、組み込みプロセッサ、グラフィックスデバイス、DSP、その他のタイプの論理回路において、実行ユニット１０８の別の実施形態を利用することもできる。システム１００は、メモリ１２０を含む。メモリ１２０は、ＤＲＡＭ（dynamic random access memory）デバイス、ＳＲＡＭ（static random access memory）デバイス、フラッシュメモリデバイス、その他のメモリデバイスである。メモリ１２０は、プロセッサ１０２により実行できる、データ信号で表された命令及び／またはデータを格納できる。システム論理チップ１１６はプロセッサバス１１０とメモリ１２０に結合している。例示した実施形態では、システム論理チップ１１６はメモリコントローラハブ（ＭＣＨ）である。プロセッサ１０２は、プロセッサバス１１０を介してＭＣＨ１１６と通信できる。ＭＣＨ１１６は、命令とデータの格納、グラフィックスコマンド、データ、及びテクスチャの格納のために、メモリ１２０への広帯域幅メモリパス１１８を提供する。ＭＣＨ１１６は、プロセッサ１０２、メモリ１２０、及びシステム１００のその他のコンポーネントの間でデータ信号を方向付け（direct）、プロセッサバス１１０、メモリ１２０、及びシステムＩ／Ｏ１２２間のデータ信号をブリッジする。実施形態によっては、システム論理チップ１１６は、グラフィックスコントローラ１１２に結合するためのグラフィックスポートを提供する。ＭＣＨ１１６は、メモリインターフェイス１１８を通してメモリ１２０に結合している。グラフィックスカード１１２は、ＡＧＰ（Accelerated Graphics Port）インターコネクト１１４によりＭＣＨ１１６に結合されている。 Processor 102 also has an execution unit 108, which includes logic for performing integer and floating point operations. Processor 102 also includes a microcode (μcode) ROM that stores microcode for macro instructions. In this embodiment, execution unit 108 includes logic for processing packed instruction set 109. In one embodiment, packed instruction set 109 includes a packed string comparison instruction that compares elements of a plurality of operands. By including the packed instruction set 109 in the instruction set of the general-purpose processor 102, the operations used in many multimedia applications can be performed using the packed data in the general-purpose processor 102, along with the associated circuits that execute the instruction. it can. Thus, by performing operations on packed data using the maximum width of the data bus of the processor, many multimedia applications can be speeded up and executed more efficiently. This eliminates the need to transfer data in small units via the data bus of the processor and perform operations on one data element at a time.
Other embodiments of execution unit 108 may be utilized in microcontrollers, embedded processors, graphics devices, DSPs, and other types of logic circuits. System 100 includes memory 120. The memory 120 is a dynamic random access memory (DRAM) device, a static random access memory (SRAM) device, a flash memory device, or another memory device. Memory 120 can store instructions and / or data represented by data signals that can be executed by processor 102. System logic chip 116 is coupled to processor bus 110 and memory 120. In the illustrated embodiment, system logic chip 116 is a memory controller hub (MCH). Processor 102 can communicate with MCH 116 via processor bus 110. MCH 116 provides a high bandwidth memory path 118 to memory 120 for storage of instructions and data, graphics commands, data, and textures. MCH 116 directs data signals between processor 102, memory 120, and other components of system 100, and bridges data signals between processor bus 110, memory 120, and system I / O 122. In some embodiments, system logic chip 116 provides a graphics port for coupling to graphics controller 112. MCH 116 is coupled to memory 120 through memory interface 118. The graphics card 112 is connected to the MCH 116 by an Accelerated Graphics Port (AGP) interconnect 114.

システム１００は、独自のハブインターフェイスバス１２２を用いて、ＭＣＨ１１６をＩ／Ｏコントローラハブ（ＩＣＨ）１３０に結合する。ＩＣＨ１３０は、ローカルＩ／Ｏバスを介してＩ／Ｏデバイスに直接接続する。ローカルＩ／Ｏバスは、メモリ１２０、チップセット、及びプロセッサ１０２に周辺機器を接続する高速Ｉ／Ｏバスである。例としては、オーディオコントローラ、ファームウェアハブ（フラッシュＢＩＯＳ）１２８、ワイヤレストランシーバ１２６、データストレージ１２４、ユーザ入力及びキーボードインターフェイスを含むレガシーＩ／Ｏコントローラ、ＵＳＢ（Universal Serial Bus）等のシリアル拡張ポート、及びネットワークコントローラ１３４がある。データストレージデバイス１２４は、ハードディスクドライブ、フロッピー（登録商標）ディスクドライブ、ＣＤ−ＲＯＭデバイス、フラッシュメモリデバイス、その他の大容量ストレージデバイスである。 System 100 couples MCH 116 to an I / O controller hub (ICH) 130 using a unique hub interface bus 122. ICH 130 connects directly to I / O devices via a local I / O bus. The local I / O bus is a high-speed I / O bus that connects peripheral devices to the memory 120, the chipset, and the processor 102. Examples include audio controllers, firmware hubs (flash BIOS) 128, wireless transceivers 126, data storage 124, legacy I / O controllers including user input and keyboard interfaces, serial expansion ports such as USB (Universal Serial Bus), and networks. There is a controller 134. The data storage device 124 is a hard disk drive, a floppy (registered trademark) disk drive, a CD-ROM device, a flash memory device, or another large-capacity storage device.

システムの他の実施形態の場合、ストリング比較命令を含むアルゴリズムを実行する実行ユニットをシステムオンチップ（system on a chip）で利用できる。システムオンチップの一実施形態は、プロセッサ及びメモリである。かかるシステムのメモリはフラッシュメモリである。フラッシュメモリはプロセッサ及びその他のシステムコンポーネントと同じダイ（die）にあってもよい。また、他の論理ブロック、例えばメモリコントローラまたはグラフィックスコントローラ等がシステムオンチップ上にあってもよい。 For other embodiments of the system, an execution unit that executes an algorithm including a string comparison instruction is available on a system on a chip. One embodiment of a system on a chip is a processor and a memory. The memory of such a system is a flash memory. The flash memory may be on the same die as the processor and other system components. Also, other logic blocks, such as a memory controller or a graphics controller, may be on the system-on-chip.

図１Ｂは、本発明の一実施形態の原理を化体するデータ処理システム１４０を示す。当業者には言うまでもなく、本発明の範囲から逸脱することなく、ここに説明する実施形態を別の処理システムで利用することもできる。 FIG. 1B shows a data processing system 140 embodying the principles of one embodiment of the present invention. Those skilled in the art will appreciate that the embodiments described herein may be utilized in alternative processing systems without departing from the scope of the invention.

コンピュータシステム１４０は、ストリング比較演算を含むＳＩＭＤ演算を実行できるプロセッシングコア１５９を有する。一実施形態では、プロセッシングコア１５９は、任意タイプのアーキテクチャの処理ユニットを表し、ＣＩＳＣ、ＲＩＳＣ、ＶＬＩＷなど各タイプのアーキテクチャを含むが、これらには限定されない。プロセッシングコア１５９は、１つまたは複数のプロセステクノロジーでの生産に適しており、機械読み取り可能媒体で十分に詳しく表せるので、生産が容易になる。 Computer system 140 has a processing core 159 that can perform SIMD operations including string comparison operations. In one embodiment, processing core 159 represents a processing unit of any type of architecture, including, but not limited to, each type of architecture such as CISC, RISC, VLIW, and the like. Processing core 159 is suitable for production in one or more process technologies, and can be represented in machine-readable media in sufficient detail to facilitate production.

プロセッシングコア１５９は、実行ユニット１４２、一組のレジスタファイル１４５、及びデコーダ１４４を有する。プロセッシングコア１５９は、この他の回路（図示せず）も含むが、この回路は本発明を理解するためには必要ない。実行ユニット１４２は、プロセッシングコア１５９が受け取った命令を実行するために使用する。実行ユニット１４２は、一般的なプロセッサ命令を認識するのに加え、パック化命令セット１４３の命令を認識して、パック化データフォーマットに演算を実行する。パック化命令セット１４３は、ストリング比較演算をサポートする命令を含み、他のパック化命令を含んでも良い。実行ユニット１４２は内部バスによりレジスタファイル１４５に結合している。レジスタファイル１４５は、データを含む情報を格納する、プロセッシングコア１５９上の記憶領域を表す。上記の通り、パック化データを記憶するのに用いる記憶領域は必須ではない。実行ユニット１４２はデコーダ１４４に結合している。デコーダ１４４は、プロセッシングコア１５９が受け取った命令を制御信号及び／またはマイクロコードエントリーポイント（microcode entry points）にデコードするために用いられる。実行ユニット１４２は、これらの制御信号及び／またはマイクロコードエントリーポイントに応じて適切な演算を実行する。 The processing core 159 has an execution unit 142, a set of register files 145, and a decoder 144. Processing core 159 also includes other circuitry (not shown), which is not necessary for understanding the present invention. Execution unit 142 is used to execute instructions received by processing core 159. The execution unit 142 recognizes instructions in the packed instruction set 143 in addition to recognizing general processor instructions and performs operations on packed data format. Packed instruction set 143 includes instructions that support string comparison operations, and may include other packed instructions. Execution unit 142 is coupled to register file 145 by an internal bus. The register file 145 represents a storage area on the processing core 159 for storing information including data. As described above, the storage area used to store the packed data is not essential. Execution unit 142 is coupled to decoder 144. The decoder 144 is used to decode the instructions received by the processing core 159 into control signals and / or microcode entry points. The execution unit 142 performs an appropriate operation according to these control signals and / or microcode entry points.

プロセッシングコア１５９は、他の様々なシステムデバイスと通信するためにバス１４１と結合されている。システムデバイスには、シンクロナスＤＲＡＭ（ＳＤＲＡＭ）コントロール１４６、スタティックＲＡＭ（ＳＲＡＭ）コントロール１４７、バーストフラッシュメモリインターフェイス１４８、ＰＣＭＣＩＡ／コンパクトフラッシュ（登録商標）（ＣＦ）カードコントロール１４９、液晶ディスプレイ（ＬＣＤ）コントロール１５０、ＤＭＡコントローラ１５１、代替バスマスターインターフェイス１５２が含まれるが、これらには限定されない。一実施形態では、データプロセッシングシステム１４０は、Ｉ／Ｏバス１５３を介して様々なＩ／Ｏデバイスと通信するためのＩ／Ｏブリッジ１５４も有する。Ｉ／Ｏデバイスには、例えばＵＡＲＴ１５５、ＵＳＢ１５６、ブルートゥース（登録商標）ワイヤレスＵＡＲＴ１５７、及びＩ／Ｏ拡張インターフェイス１５８が含まれるが、これらには限定されない。 Processing core 159 is coupled to bus 141 for communicating with various other system devices. The system devices include a synchronous DRAM (SDRAM) control 146, a static RAM (SRAM) control 147, a burst flash memory interface 148, a PCMCIA / compact flash (registered trademark) (CF) card control 149, and a liquid crystal display (LCD) control 150. , DMA controller 151, and alternate bus master interface 152, but are not limited thereto. In one embodiment, data processing system 140 also has an I / O bridge 154 for communicating with various I / O devices via I / O bus 153. I / O devices include, but are not limited to, UART 155, USB 156, Bluetooth® Wireless UART 157, and I / O expansion interface 158, for example.

データプロセッシングシステム１４０の一実施形態は、ストリング比較演算を含むＳＩＭＤ演算を実行できる、モバイル、ネットワーク及び／またはワイヤレス通信およびプロセッシングコア１５９である。プロセッシングコア１５９は、様々なオーディオ、ビデオ、画像化、及び通信アルゴリズムでプログラムすることができる。これらのアルゴリズムには、例えば、ウォルシュ・アダマール変換、高速フーリエ変換、離散余弦変換（ＤＣＴ）、これらのそれぞれの逆変換；色空間変換等の圧縮・解凍方法、ビデオエンコード動き予測、またはビデオデコード動き補償；パルスコード変調（ＰＣＭ）等の変復調（ＭＯＤＥＭ）機能等が含まれる。 One embodiment of the data processing system 140 is a mobile, network and / or wireless communication and processing core 159 that can perform SIMD operations, including string comparison operations. Processing core 159 can be programmed with various audio, video, imaging, and communication algorithms. These algorithms include, for example, Walsh-Hadamard transform, fast Fourier transform, discrete cosine transform (DCT), their respective inverse transforms; compression and decompression methods such as color space transform, video encode motion prediction, or video decode motion. Compensation; includes modulation / demodulation (MODEM) functions such as pulse code modulation (PCM).

図１Ｃは、ＳＩＭＤストリング比較演算を実行できるデータ処理システムのさらに別の実施形態を示す。別の一実施形態によるデータプロセッシングシステム１６０は、メインプロセッサ１６６、ＳＩＭＤコ・プロセッサ１６１、キャッシュメモリ１６７、及び入出力システム１６８を含む。入出力システム１６８は、任意的に、ワイヤレスインターフェイス１６９に結合している。ＳＩＭＤコ・プロセッサ１６１は、ストリング比較演算を含むＳＩＭＤ演算を実行できる。プロセッシングコア１７０は、１つまたは複数のプロセステクノロジーでの生産に適しており、機械読み取り可能媒体で十分に詳しく表せるので、プロセッシングコア１７０を含むデータプロセッシングシステム１６０の全部または一部の生産が容易になる。 FIG. 1C illustrates yet another embodiment of a data processing system capable of performing a SIMD string comparison operation. The data processing system 160 according to another embodiment includes a main processor 166, a SIMD co-processor 161, a cache memory 167, and an input / output system 168. Input / output system 168 is optionally coupled to wireless interface 169. The SIMD co-processor 161 can execute a SIMD operation including a string comparison operation. The processing core 170 is suitable for production in one or more process technologies and can be represented in sufficient detail on a machine-readable medium to facilitate the production of all or a portion of the data processing system 160 including the processing core 170. Become.

一実施形態では、ＳＩＭＤコ・プロセッサ１６１は、実行ユニット１６２と一組のレジスタファイル１６４を有する。メインプロセッサ１６５の一実施形態は、実行ユニット１６２が実行するＳＩＭＤストリング比較命令を含む命令セット１６３の命令を認識するデコーダ１６５を有する。別の実施形態では、ＳＩＭＤコ・プロセッサ１６１は、デコーダ１６５Ｂの少なくとも一部を有し、命令セット１６３の命令をデコードする。プロセッシングコア１７０は、この他の回路（図示せず）も含むが、この回路は本発明の実施形態を理解するためには必要ない。 In one embodiment, the SIMD co-processor 161 has an execution unit 162 and a set of register files 164. One embodiment of the main processor 165 has a decoder 165 that recognizes instructions in the instruction set 163, including SIMD string comparison instructions, which are executed by the execution unit 162. In another embodiment, SIMD co-processor 161 has at least a portion of decoder 165B and decodes instructions of instruction set 163. Processing core 170 also includes other circuitry (not shown), which is not necessary for understanding embodiments of the present invention.

動作中、メインプロセッサ１６６は、キャッシュメモリ１６７や入出力システム１６８とのインターラクションを含む、一般的なタイプのデータ処理演算を制御するデータ処理命令ストリーム（stream of data processing instructions）を実行する。ＳＩＭＤコ・プロセッサ命令はデータ処理命令ストリームの中に組み込まれている。メインプロセッサ１６６のデコーダ１６５は、ＳＩＭＤコ・プロセッサ命令を、付随するＳＩＭＤコ・プロセッサ１６１が実行すべきタイプであるとして認識する。従って、メインプロセッサ１６６は、これらのＳＩＭＤコ・プロセッサ命令（または、ＳＩＭＤコ・プロセッサ命令を表す制御信号）をコ・プロセッサバス１６６上に発行し、付随するＳＩＭＤコ・プロセッサはコ・プロセッサバス１６６からコ・プロセッサ命令を受け取る。この場合、ＳＩＭＤコ・プロセッサ１６１は、それに宛てられたＳＩＭＤコ・プロセッサ命令を受け取り、実行する。 In operation, main processor 166 executes streams of data processing instructions that control general types of data processing operations, including interactions with cache memory 167 and input / output systems 168. SIMD co-processor instructions are embedded in the data processing instruction stream. The decoder 165 of the main processor 166 recognizes the SIMD coprocessor instruction as being of a type that the associated SIMD coprocessor 161 should execute. Accordingly, main processor 166 issues these SIMD co-processor instructions (or control signals representing SIMD co-processor instructions) on co-processor bus 166, and the associated SIMD co-processor issues co-processor bus 166. Receive co-processor instructions from. In this case, SIMD coprocessor 161 receives and executes the SIMD coprocessor instruction destined for it.

ＳＩＭＤコ・プロセッサ命令が処理するデータは、ワイヤレスインターフェイス１６９を介して受け取ってもよい。一例として、音声通信をデジタル信号の形式で受信して、ＳＩＭＤコ・プロセッサ命令で処理して、その音声通信を表すデジタルオーディオサンプルを再生する。他の一例として、圧縮オーディオ及び／またはビデオをデジタルビットストリームの形式で受信して、ＳＩＭＤコ・プロセッサ命令で処理して、そのデジタルオーディオサンプル及び／またはモーションビデオフレームを再生してもよい。プロセッシングコア１７０の一実施形態では、メインプロセッサ１６６とＳＩＭＤコ・プロセッサ１６１は単一のプロセッシングコア１７０に集積されている。プロセッシングコア１７０は、実行ユニット１６２、一組のレジスタファイル１６４、及びデコーダ１６５を有し、ＳＩＭＤストリング比較命令を含む命令セット１６３の命令を認識する。 Data processed by the SIMD co-processor instructions may be received via the wireless interface 169. As an example, a voice communication is received in the form of a digital signal and processed with SIMD co-processor instructions to reproduce digital audio samples representing the voice communication. As another example, compressed audio and / or video may be received in the form of a digital bitstream and processed with SIMD co-processor instructions to play the digital audio samples and / or motion video frames. In one embodiment of processing core 170, main processor 166 and SIMD co-processor 161 are integrated on a single processing core 170. Processing core 170 has an execution unit 162, a set of register files 164, and a decoder 165, and recognizes instructions in instruction set 163, including SIMD string compare instructions.

図２は、プロセッサ２００のマイクロアーキテクチャを示すブロック図である。プロセッサ２００は、本発明の一実施形態によるストリング比較命令を実行する論理回路を含む。ストリング比較命令の一実施形態では、第１のオペランドの各データ要素を第２のオペランドの各データ要素と比較して、各比較結果が一致したかを示すインジケータを格納する。実施形態では、サイズがバイト、ワード、ダブルワード、クアッドワード（quadword）等であり、データタイプが整数や浮動小数点であるデータ要素に、ストリング比較命令を演算することができる。一実施形態では、インオーダー（in-order）フロントエンド２０１がプロセッサ２００の一部となっており、実行するマクロ命令をフェッチして、後でプロセッサパイプラインで使用するように準備する。フロントエンド２０１は複数のユニットを含む。一実施形態では、命令プリフェッチャ２２６は、メモリからマクロ命令をフェッチして、命令デコーダ２２８に供給（feed）する。命令デコーダ２２８は、マクロ命令を、機械が実行可能なマイクロ命令またはマイクロ演算（micro opやμopsとも呼ぶ）と呼ばれるプリミティブ（primitives）にデコードする。一実施形態では、トレースキャッシュ２３０は、デコードされたマイクロ演算を取って、プログラムオーダーシーケンス（program ordered sequences）またはトレース（traces）を組立、実行のためにマイクロ演算キュー２３４に入れる。トレースキャッシュ２３０が複雑なマクロ命令を見つける（encounter）と、マイクロコードＲＯＭ２３２がその演算を完了するのに必要なマイクロ演算を供給する。 FIG. 2 is a block diagram illustrating the micro-architecture of the processor 200. Processor 200 includes logic for executing a string compare instruction according to one embodiment of the present invention. In one embodiment of the string compare instruction, each data element of the first operand is compared to each data element of the second operand, and an indicator is stored that indicates whether each comparison result matches. In the embodiment, a string comparison instruction can be performed on a data element whose size is byte, word, double word, quadword, or the like, and whose data type is an integer or a floating point. In one embodiment, an in-order front end 201 is part of the processor 200 and fetches macro instructions for execution and prepares them for later use in the processor pipeline. The front end 201 includes a plurality of units. In one embodiment, instruction prefetcher 226 fetches macro instructions from memory and feeds them to instruction decoder 228. The instruction decoder 228 decodes the macro instruction into primitives called machine-executable micro-instructions or micro-operations (also called micro-ops or μops). In one embodiment, trace cache 230 takes the decoded micro-operations and places them in micro-operation queue 234 for assembling and executing program ordered sequences or traces. When trace cache 230 encounters a complex macro instruction, microcode ROM 232 provides the micro-operations needed to complete the operation.

多数のマクロ命令は単一のマイクロ演算に変換されるが、他のマクロ命令はその演算を完全に完了するのに複数の（several）マイクロ演算を必要とする。一実施形態では、１つのマクロ命令を完了するのに５つ以上のマイクロ演算が必要であれば、デコーダ２２８はマイクロコードＲＯＭ２３２にアクセスしてマクロ命令を実行する。一実施形態では、パック化ストリング比較命令を少数のマイクロ演算にデコードして、命令デコーダ２２８で処理する。他の実施形態では、演算を行うのに多数のマイクロ演算が必要な場合、パック化ストリング比較アルゴリズムをマイクロコードＲＯＭ２３２内に格納することもできる。トレースキャッシュ２３０は、マイクロコードＲＯＭ２３２のストリング比較アルゴリズムのマイクロコードシーケンスを読むための、正しいマイクロ命令ポインタを決定するエントリーポイントのプログラマブルロジックアレイ（ＰＬＡ）である。マイクロコードＲＯＭ２３２がカレントの（current）マクロ命令のマイクロ演算のシーケンス決定（sequencing）を終了すると、マシンのフロントエンド２０１は、トレースキャッシュ２３０からマイクロ演算のフェッチを再開する。 Many macro instructions are translated into a single micro-operation, while other macro instructions require several micro-operations to complete the operation completely. In one embodiment, if more than four micro-operations are required to complete a macro instruction, decoder 228 accesses microcode ROM 232 to execute the macro instruction. In one embodiment, the packed string compare instruction is decoded into a small number of micro-operations and processed by instruction decoder 228. In other embodiments, the packed string comparison algorithm may be stored in microcode ROM 232 if the operation requires a large number of micro-operations. Trace cache 230 is a programmable logic array (PLA) of entry points that determines the correct microinstruction pointer for reading the microcode sequence of the string comparison algorithm in microcode ROM 232. When the microcode ROM 232 finishes sequencing the micro-operations of the current macro instruction, the front end 201 of the machine resumes fetching the micro-operations from the trace cache 230.

一部のＳＩＭＤその他のマルチメディアタイプの命令は複雑な命令であると考えられる。浮動小数点関係の命令もほとんどが複雑な命令である。そこで、命令デコーダ２２８は複雑なマクロ命令が来ると（encounter）、マイクロコードＲＯＭ２３２の適切な場所にアクセスして、そのマクロ命令のマイクロコードシーケンスを読み出す。そのマクロ命令を実行するのに必要な様々なマイクロ演算を、アウトオブオーダー（out-of-order）実行エンジン２０３に送り、適切な整数実行ユニット及び浮動小数点実行ユニットで実行する。 Some SIMD and other multimedia type instructions are considered to be complex instructions. Most floating point instructions are also complicated instructions. Thus, when a complex macro instruction arrives, the instruction decoder 228 accesses an appropriate location in the microcode ROM 232 and reads the microcode sequence of the macro instruction. The various micro-operations required to execute the macro instruction are sent to an out-of-order execution engine 203 for execution in the appropriate integer and floating point execution units.

アウトオブオーダー実行エンジン２０３は、マイクロ命令の実行準備をするところである。アウトオブオーダー実行論理回路は、多数のバッファを有し、マイクロ命令がパイプラインを下り、実行スケジューリングがなされるにつれ、実行を最適化するように、マイクロ命令のフローをスムースにして並べ替える。アロケータロジックは、各マイクロ演算を実行するために必要なマシンバッファとリソースをアロケートする。レジスタリネーミングロジックは、ロジックレジスタをレジスタファイルのエントリーにリネーム（rename）する。アロケータは、命令スケジューラであるメモリスケジューラ、高速スケジューラ２０２、低速・一般浮動小数点スケジューラ２０４、及び単純浮動小数点スケジューラ２０６の前にある、メモリ演算用と非メモリ演算用の２つのマイクロ演算キューの一方の各マイクロ演算にエントリーをアロケートする。マイクロ演算スケジューラ２０２、２０４、２０６は、マイクロ演算が依存する入力レジスタオペランドソースの準備状況（readiness）と、マイクロ演算がその演算を完了するのに必要とする実行リソースの利用可能性とに基づき、マイクロ演算がいつ実行できるか決定する。本実施形態の高速スケジューラ２０２は、メインクロックサイクルの半分ごとにスケジューリングをできるが、他のスケジューラはメインプロセッサクロックサイクルごとにしかスケジューリングができない。複数のスケジューラはディスパッチポートをアービトレーションしてマイクロ演算の実行をスケジューリングする。 The out-of-order execution engine 203 is preparing to execute a microinstruction. The out-of-order execution logic has a large number of buffers and smoothly reorders the flow of microinstructions to optimize execution as the microinstructions go down the pipeline and execution is scheduled. The allocator logic allocates the machine buffers and resources needed to perform each micro operation. Register renaming logic renames the logic register to an entry in the register file. The allocator consists of one of two micro operation queues, one for memory operation and one for non-memory operation, that precede the instruction scheduler memory scheduler, high speed scheduler 202, low speed / general floating point scheduler 204, and simple floating point scheduler 206. Allocate an entry for each micro operation. The micro-operation scheduler 202, 204, 206 may determine, based on the readiness of the input register operand source on which the micro-operation depends, and the availability of execution resources required by the micro-operation to complete the operation, Determine when micro-operations can be performed. The fast scheduler 202 of this embodiment can schedule every half of the main clock cycle, while other schedulers can schedule only every main processor clock cycle. A plurality of schedulers arbitrate the dispatch ports to schedule the execution of micro-operations.

レジスタファイル２０８、２１０はスケジューラ２０２、２０４、２０６と、実行ブロック２１１の実行ユニット２１２、２１４、２１６、２１８、２２０、２２２、２２４との間にある。整数演算と浮動小数点演算にはそれぞれ別のレジスタファイル２０８、２１０がある。他の実施形態では、整数レジスタ及び浮動小数点レジスタは同一レジスタファイルにあってもよい。本実施形態の各レジスタファイル２０８、２１０は、ちょうど完了した結果であってまだレジスタファイルに書き込まれていないものを、新しいディペンデント（dependent）なマイクロ演算にバイパスまたは転送するバイパスネットワークを含む。整数レジスタファイル２０８と浮動小数点レジスタファイル２１０は、互いにデータをやりとりすることができる。一実施形態では、整数レジスタファイル２０８は、下位３２ビット用と上位３２ビット用である２つの別々のレジスタファイルに分離されている。一実施形態の浮動小数点レジスタファイル２１０は、１２８ビット幅のエントリーを有する。浮動小数点命令は、一般的には６４ビットから１２８ビットの幅のオペランドを有するからである。 The register files 208, 210 are between the schedulers 202, 204, 206 and the execution units 212, 214, 216, 218, 220, 222, 224 of the execution block 211. There are separate register files 208 and 210 for integer arithmetic and floating point arithmetic, respectively. In other embodiments, the integer and floating point registers may be in the same register file. Each register file 208, 210 of the present embodiment includes a bypass network that bypasses or transfers just completed results that have not yet been written to the register file to a new dependent micro-operation. The integer register file 208 and the floating point register file 210 can exchange data with each other. In one embodiment, the integer register file 208 is separated into two separate register files, one for the lower 32 bits and one for the upper 32 bits. In one embodiment, the floating point register file 210 has 128 bit wide entries. This is because floating point instructions typically have operands that are 64 to 128 bits wide.

実行ブロック２１１は、実行ユニット２１２，２１４，２１６，２１８，２２０，２２２，２２４を含み、これらにより命令が実際に実行される。このセクションにはレジスタファイル２０８，２１０が含まれる。レジスタファイル２０８，２１０は、マイクロ命令の実行に必要な整数及び浮動小数点データオペランドの値が記憶される。本実施形態のプロセッサ２００は、複数の実行ユニット、すなわちアドレス生成ユニット（AGU）２１２、AGU２１４、高速ALU２１６、高速ALU２１８、低速ALU２２０、浮動小数点ＡＬＵ２２２、浮動小数点moveユニット２２４により構成されている。本実施形態では、浮動小数点実行ブロック２２２、２２４は、浮動小数点演算、ＭＭＸ演算、ＳＩＭＤ演算、及びＳＳＥ演算を実行する。本実施形態の浮動小数点ＡＬＵ２２２は、６４ビット対６４ビットの浮動小数点割り算器を含み、割り算、平方根、剰余のマイクロ演算を実行する。本発明の実施形態では、浮動小数点値が関わる動作は浮動小数点ハードウェアで行われる。例えば、整数形式と浮動小数点形式の間の変換には浮動小数点レジスタファイルが関与する。同様に、浮動小数点割り算演算は浮動小数点割り算器で行われる。一方、非浮動小数点型や整数型は整数ハードウェアリソースで処理される。単純かつ頻度が高いＡＬＵ演算は高速ＡＬＵ実行ユニット２１６、２１８に行く。本実施形態の高速ＡＬＵ２１６，２１８は、有効レイテンシーがクロックサイクルの半分である高速演算を実行できる。一実施形態では、ほとんどの複雑な整数演算は低速ALU２２０に行く。低速ALU２２０が、乗算、シフト、フラグロジック、ブランチ処理等のレイテンシーが長いタイプの演算用の整数実行ハードウェアを含むからである。メモリロード・ストア命令は、ＡＧＵ２１２，２１４で実行される。この実施形態は、整数ＡＬＵ２１６，２１８，２２０は、６４ビットデータオペランドに整数演算を実行するものとして説明した。別の実施形態では、ＡＬＵ２１６，２１８，２２０は、１６，３２，１２８，２５６等の様々なデータビットをサポートするように実施することもできる。同様に、浮動小数点ユニット２２２，２２４は、様々な幅のビットを有するある範囲のオペランドをサポートするように実施することもできる。一実施形態では、浮動小数点ユニット２２２、２２４は、ＳＩＭＤ命令やマルチメディア命令とともに、１２８ビット幅のパック化データオペランドに演算をすることができる。 Execution block 211 includes execution units 212, 214, 216, 218, 220, 222, and 224, which execute the instructions. This section includes the register files 208 and 210. The register files 208 and 210 store the values of the integer and floating-point data operands necessary for executing the microinstruction. The processor 200 of this embodiment includes a plurality of execution units, that is, an address generation unit (AGU) 212, AGU 214, high-speed ALU 216, high-speed ALU 218, low-speed ALU 220, floating-point ALU 222, and floating-point move unit 224. In the present embodiment, the floating point execution blocks 222 and 224 execute a floating point operation, an MMX operation, a SIMD operation, and an SSE operation. The floating-point ALU 222 according to the present embodiment includes a 64-bit to 64-bit floating-point divider, and performs division, square root, and remainder micro operations. In embodiments of the present invention, operations involving floating point values are performed in floating point hardware. For example, conversion between integer format and floating point format involves a floating point register file. Similarly, floating point division operations are performed in floating point dividers. On the other hand, non-floating point types and integer types are handled by integer hardware resources. Simple and frequent ALU operations go to the fast ALU execution units 216, 218. The high-speed ALUs 216 and 218 according to the present embodiment can execute high-speed operations in which the effective latency is half a clock cycle. In one embodiment, most complex integer operations go to the slow ALU 220. This is because the low-speed ALU 220 includes integer execution hardware for long-latency type operations such as multiplication, shift, flag logic, and branch processing. The memory load / store instruction is executed by the AGUs 212 and 214. This embodiment has been described assuming that integer ALUs 216, 218, and 220 perform integer operations on 64-bit data operands. In another embodiment, ALUs 216, 218, 220 may be implemented to support various data bits, such as 16, 32, 128, 256, and the like. Similarly, floating point units 222 and 224 may be implemented to support a range of operands having bits of various widths. In one embodiment, the floating point units 222, 224 can operate on 128-bit wide packed data operands along with SIMD and multimedia instructions.

本実施形態では、マイクロ演算スケジューラ２０２，２０４，２０６は、親のロード（load）の実行が終わる前に、ディペンデント演算（dependent operations）をディスパッチする。マイクロ演算はプロセッサ２００においてスペキュレーティブ（speculatively）にスケジューリングされるので、プロセッサ２００はメモリミスを処理するロジックも含む。データキャッシュにおいてデータロードがミスすると、パイプライン中には、データが一時的に正しくないディペンデント演算がある。正しくないデータを使う命令をリプレイメカニズムが追跡し、再実行する。ディペンデント演算のみをリプレイする必要があり、インディペンデント演算は完了することができる。プロセッサの一実施形態のスケジューラとリプレイメカニズムは、ストリング比較演算の命令シーケンスを捉えるように設計されている。 In the present embodiment, the micro operation schedulers 202, 204, 206 dispatch dependent operations before the execution of the parent load ends. Because micro-operations are speculatively scheduled in processor 200, processor 200 also includes logic to handle memory misses. If the data load misses in the data cache, there is a dependent operation in the pipeline where the data is temporarily incorrect. The replay mechanism tracks instructions that use incorrect data and re-executes them. Only the dependent operation needs to be replayed, and the independent operation can be completed. The scheduler and replay mechanism of one embodiment of the processor is designed to capture the instruction sequence of the string comparison operation.

「レジスタ」という用語は、オペランドを特定するマクロ命令の一部として使われる、オンボードプロセッサの記憶場所を言う。換言すると、ここでレジスタとは、プロセッサの外側から（プログラマーの視点から）見えるレジスタである。しかし、一実施形態のレジスタは、特定タイプの回路を意味していると限定すべきではない。むしろ、実施形態のレジスタは、データを記憶して供給し、本明細書に記載する機能を実行できるだけでよい。ここで説明したレジスタは、専用の物理的レジスタ、レジスタリネーミングを利用した動的割当ての物理的レジスタ、専用の物理的レジスタ及び動的割当の物理的レジスタの組み合わせなど、任意数の異なる技術を用いて、プロセッサ内の回路により実施することができる。一実施形態では、整数レジスタは３２ビットの整数データを記憶する。一実施形態のレジスタファイルは、パック化データ用に８個のマルチメディアＳＩＭＤレジスタも含む。以下の説明では、レジスタは、カリフォルニア州サンタクララ市のインテルコーポレイションのＭＭＸテクノロジーで実現された、マイクロプロセッサの６４ビット幅ＭＭＸ（登録商標）レジスタ（場合によっては「ｍｍ」レジスタとも呼ぶ）などの、パック化データを保持するように設計されたデータレジスタであるものとする。これらのＭＭＸレジスタは、整数形式と浮動小数点形式とがあるが、ＳＩＭＤ命令やＳＳＥ命令をともなうパック化データ要素に利用できる。同様に、ＳＳＥ２，ＳＳＥ３，ＳＳＥ４またはそれ以降（総称的に「ＳＳＥｘ」と呼ぶ）のテクノロジーに関する１２８ビット幅のＸＭＭレジスタも、このようなパック化データオペランドを保持するために用いることができる。本実施形態では、パック化データや整数データを記憶する際、レジスタは２つのデータタイプを区別する必要はない。 The term "register" refers to a location on an onboard processor that is used as part of a macro instruction that specifies an operand. In other words, a register here is a register that is visible from outside the processor (from a programmer's point of view). However, the register of one embodiment should not be limited to mean any particular type of circuit. Rather, the registers of the embodiments need only be able to store and supply data and perform the functions described herein. The registers described herein may employ any number of different technologies, including dedicated physical registers, dynamically allocated physical registers using register renaming, and combinations of dedicated and dynamically allocated physical registers. And can be implemented by circuitry within the processor. In one embodiment, the integer register stores 32 bits of integer data. The register file of one embodiment also includes eight multimedia SIMD registers for packed data. In the following description, registers are referred to as microprocessor 64-bit wide MMX® registers (sometimes referred to as “mm” registers), implemented with MMX technology from Intel Corporation of Santa Clara, California. It is assumed that the data register is a data register designed to hold packed data. These MMX registers come in integer and floating point formats, but can be used for packed data elements with SIMD and SSE instructions. Similarly, a 128-bit wide XMM register for SSE2, SSE3, SSE4 or higher technology (collectively referred to as "SSEx") may also be used to hold such packed data operands. In the present embodiment, when storing packed data or integer data, the register need not distinguish between the two data types.

以下の図の実施例では、複数のデータオペランドを説明する。図３Ａは、本発明の一実施形態によるマルチメディアレジスタにおける様々なパック化データタイプを表した図である。図３Ａは、１２８ビット幅オペランドの、パック化バイト３１０、パック化ワード３２０、及びパック化ダブルワード３３０を示している。本実施例のパック化バイトフォーマット３１０は、１２８ビットの長さで、１６個のパック化バイトデータ要素を含む。ここでは、１バイトは８ビットのデータであると定義する。各バイトデータ要素の情報は、バイト０がビット７からビット０まで、バイト１がビット１５からビット８まで、バイト２がビット２３からビット１６まで、そして最終的にバイト１５がビット１２７からビット１２０までに記憶される。このように、レジスタのすべてのビットが利用される。このような記憶構成をとることにより、プロセッサの記憶効率が高まる。また、１６個のデータ要素にアクセスするので、１つの演算を１６個のデータ要素に並行に演算することができる。 In the embodiment of the following figures, a plurality of data operands will be described. FIG. 3A is a diagram illustrating various packed data types in a multimedia register according to an embodiment of the present invention. FIG. 3A illustrates a packed byte 310, a packed word 320, and a packed doubleword 330 for a 128-bit wide operand. The packed byte format 310 of the present embodiment is 128 bits long and includes 16 packed byte data elements. Here, one byte is defined as 8-bit data. The information of each byte data element is that byte 0 is from bit 7 to bit 0, byte 1 is from bit 15 to bit 8, byte 2 is from bit 23 to bit 16, and finally byte 15 is from bit 127 to bit 120 Is stored by In this way, all bits of the register are used. With such a storage configuration, the storage efficiency of the processor is increased. Also, since 16 data elements are accessed, one operation can be performed on 16 data elements in parallel.

一般的に、データ要素は、単一のレジスタや記憶場所（memory location）に格納される個別のデータ（individual piece of data）であり、他のデータ要素と同じ長さのものである。ＳＳＥｘテクノロジーに関連するパック化データシーケンスでは、ＸＭＭレジスタに格納されるデータ要素数は、１２８ビットを個々のデータ要素のビット長で割った数である。ＭＭＸ及びＳＳＥテクノロジーに関連するパック化データシーケンスでは、ＭＭＸレジスタに格納されるデータ要素数は、６４ビットを個々のデータ要素のビット長で割った数である。図３Ａに示したデータタイプは１２８ビット長であるが、本発明の実施形態は、６４ビット幅でもその他のサイズのオペランドでも動作可能である。本実施例のパック化ワードフォーマット３２０は、１２８ビットの長さで、８個のパック化ワードデータ要素を含む。各パック化ワードは１６ビットの情報を含む。図３Ａのパック化ダブルワードフォーマット３３０は、１２８ビットの長さで、４個のパック化ダブルワードデータ要素を含む。各パック化ダブルワードデータ要素は３２ビットの情報を含む。パック化クアドワード（quadword）は、１２８ビットの長さであり、２つのパック化クアドワードデータ要素を含む。 In general, a data element is an individual piece of data stored in a single register or memory location and has the same length as other data elements. In packed data sequences related to SSEx technology, the number of data elements stored in the XMM register is 128 bits divided by the bit length of each data element. In packed data sequences related to MMX and SSE technologies, the number of data elements stored in the MMX register is 64 bits divided by the bit length of each data element. Although the data type shown in FIG. 3A is 128 bits long, embodiments of the present invention can operate with 64-bit wide or other size operands. The packed word format 320 of this embodiment is 128 bits long and includes eight packed word data elements. Each packed word contains 16 bits of information. The packed doubleword format 330 of FIG. 3A is 128 bits long and includes four packed doubleword data elements. Each packed doubleword data element contains 32 bits of information. A packed quadword is 128 bits long and contains two packed quadword data elements.

図３Ｂは、別のレジスタ内データ記憶フォーマットを示す図である。各パック化データは独立した２つ以上のデータ要素を含んでいても良い。パック化ハーフ３４１、パック化シングル３４２、及びパック化ダブル３４３である３つのパック化データフォーマットを示した。パック化ハーフ３４１、パック化シングル３４２、及びパック化ダブル３４３の一実施形態は、固定小数点データ要素である。別の実施形態では、パック化ハーフ３４１、パック化シングル３４２、及びパック化ダブル３４３は、浮動小数点データ要素を含んでいてもよい。パック化ハーフ３４１の別の一実施形態は、８個の１６ビットデータ要素を含む１２８ビット長データである。パック化シングル３４２の一実施形態は、１２８ビットの長さであり、４個の３２ビットデータ要素を含む。パック化ダブル３４３の一実施形態は、１２８ビットの長さであり、２つの６４ビットデータ要素を含む。言うまでもなく、かかるパック化データフォーマットは、例えば、９６ビット、１６０ビット、１９２ビット、２２４ビット、２５６ビット、またはそれ以上のレジスタ長に拡張することができる。 FIG. 3B is a diagram showing another data storage format in a register. Each packed data may include two or more independent data elements. Three packed data formats are shown: packed half 341, packed single 342, and packed double 343. One embodiment of packed half 341, packed single 342, and packed double 343 is a fixed point data element. In another embodiment, packed half 341, packed single 342, and packed double 343 may include floating point data elements. Another embodiment of packed half 341 is 128-bit long data containing eight 16-bit data elements. One embodiment of packed single 342 is 128 bits long and includes four 32-bit data elements. One embodiment of packed double 343 is 128 bits long and includes two 64-bit data elements. Of course, such a packed data format can be extended to register lengths of, for example, 96 bits, 160 bits, 192 bits, 224 bits, 256 bits, or more.

図３Ｃは、本発明の一実施形態によるマルチメディアレジスタにおける様々な符号付き及び符号無しのパック化データタイプを表した図である。符号無しパック化バイト表現３４４は、ＳＩＭＤレジスタにおける符号無しパック化バイトの記憶を示す。各バイトデータ要素の情報は、バイト０がビット７からビット０まで、バイト１がビット１５からビット８まで、バイト２がビット２３からビット１６まで、そして最終的にバイト１５がビット１２７からビット１２０までに格納される。このように、レジスタのすべてのビットが利用される。このような記憶構成をとることにより、プロセッサの記憶効率が高まる。また、１６個のデータ要素にアクセスするので、１つの演算を１６個のデータ要素に並行に演算することができる。符号付きパック化バイト表現３４５は、符号付きパック化バイトの記憶を示す。各バイトデータ要素の８番目のビットは符号インジケータである。符号無しパック化ワード表現３４６は、ワード７からワード０までがどのようにＳＩＭＤレジスタに記憶されるかを示している。符号付きパック化ワード表現３４７は、符号無しパック化ワードレジスタ内表現３４６と同様である。各ワードデータ要素の１６番目のビットは符号インジケータである。符号無しパック化ダブルワードデータ表現３４８は、ダブルワードデータ要素がどのように格納されるか示している。符号付きパック化ダブルワード表現３４９は、符号無しパック化ダブルワードレジスタ内表現３４８と同様である。必要な符号ビットは、各ダブルワードデータ要素の３２番目のビットである。一実施形態では、オペランドは定数でもよく、それが付随する命令によって変化しない。 FIG. 3C is a diagram illustrating various signed and unsigned packed data types in a multimedia register according to an embodiment of the present invention. The unsigned packed byte representation 344 shows the storage of unsigned packed bytes in SIMD registers. The information of each byte data element is that byte 0 is from bit 7 to bit 0, byte 1 is from bit 15 to bit 8, byte 2 is from bit 23 to bit 16, and finally byte 15 is from bit 127 to bit 120 Stored by In this way, all bits of the register are used. With such a storage configuration, the storage efficiency of the processor is increased. Also, since 16 data elements are accessed, one operation can be performed on 16 data elements in parallel. Signed packed byte representation 345 indicates the storage of signed packed bytes. The eighth bit of each byte data element is a sign indicator. The unsigned packed word representation 346 shows how words 7 through 0 are stored in SIMD registers. Signed packed word representation 347 is similar to unsigned packed word register representation 346. The 16th bit of each word data element is a sign indicator. The unsigned packed doubleword data representation 348 illustrates how doubleword data elements are stored. Signed packed doubleword representation 349 is similar to unsigned packed doubleword register representation 348. The required sign bit is the 32nd bit of each doubleword data element. In one embodiment, the operand may be a constant and is not changed by the instruction with which it is associated.

図３Ｄは、オペレーションエンコーディング（opcode）フォーマット３６０の一実施形態を示す。これは、３２ビット以上であり、レジスタ・メモリオペランドのアドレッシングモードは、「IA-32 Intel Architecture Software Developer's Manual Volume 2: Instruction Set Reference」に記載されたopcodeフォーマットのタイプに対応している。このマニュアルは、ワールドワイドウェブintel.com/design/litcentrで、カリフォルニア州サンタクララ市のインテルコーポレイションから入手できる。一実施形態では、ストリング比較演算は１つまたは複数のフィールド３６１及び３６２でエンコードされる。２つまでのソースオペランド識別子３６４と３６５を含め、一命令につき２つまでのオペランドの場所が特定される。ストリング比較命令の一実施形態では、デスティネーションオペランド識別子３６６はソースオペランド識別子３６４と同じであり、他の実施形態では異なる。別の実施形態では、デスティネーションオペランド識別子３６６はソースオペランド識別子３６５と同じであり、他の実施形態では異なる。ストリング比較命令の一実施形態では、ソースオペランド識別子３６４と３６５により特定されるソースオペランドの一方は、ストリング比較命令の結果により上書きされる。一方、他の実施形態では、識別子３６４はソースレジスタ要素に対応し、識別子３６５はデスティネーションレジスタ要素に対応する。ストリング比較命令の一実施形態では、オペランド識別子３６４と３６５は、３２ビットまたは６４ビットのソース及びデスティネーションオペランドを特定するために用いられる。 FIG. 3D illustrates one embodiment of an operation encoding (opcode) format 360. This is 32 bits or more, and the addressing mode of the register / memory operand corresponds to the type of the opcode format described in “IA-32 Intel Architecture Software Developer's Manual Volume 2: Instruction Set Reference”. This manual is available on the World Wide Web at intel.com/design/litcentr from Intel Corporation of Santa Clara, California. In one embodiment, the string comparison operation is encoded in one or more fields 361 and 362. Up to two operands per instruction are located, including up to two source operand identifiers 364 and 365. In one embodiment of the string compare instruction, the destination operand identifier 366 is the same as the source operand identifier 364, and is different in other embodiments. In another embodiment, destination operand identifier 366 is the same as source operand identifier 365, and is different in other embodiments. In one embodiment of the string compare instruction, one of the source operands identified by source operand identifiers 364 and 365 is overwritten by the result of the string compare instruction. On the other hand, in other embodiments, identifier 364 corresponds to a source register element and identifier 365 corresponds to a destination register element. In one embodiment of the string compare instruction, operand identifiers 364 and 365 are used to identify 32-bit or 64-bit source and destination operands.

図３Ｅは、４０ビットまたはそれ以上の、別のオペレーションエンコーディング（opcode）フォーマット３７０を示す。opcodeフォーマット３７０は、opcodeフォーマット３６０に対応し、任意的なプレフィックスバイト３７８を含む。ストリング比較演算のタイプは、１つまたは複数のフィールド３７８、３７１及び３７２でエンコードされる。１つの命令につき２つまでのオペランドの場所がソースオペランド識別子３７４と３７５、及びプレフィックスバイト３７８により特定される。ストリング比較命令の一実施形態では、プレフィックスバイト３７８は、３２ビット、６４ビット、または１２８ビットのソース及びデスティネーションオペランドを特定するために用いられる。ストリング比較命令の一実施形態では、デスティネーションオペランド識別子３７６はソースオペランド識別子３７４と同じであり、他の実施形態では異なる。別の実施形態では、デスティネーションオペランド識別子３７６はソースオペランド識別子３７５と同じであり、他の実施形態では異なる。一実施形態では、ストリング比較演算は、オペランド識別子３７４と３７５により特定されるオペランドの各要素を、オペランド識別子３７４と３７５により特定される他のオペランドの各要素と比較、その各要素をストリング比較演算の結果により上書きする。一方、他の実施形態では、識別子３７４と３７５により特定されるオペランドのストリング比較は、他のレジスタの他のデータ要素に書き込まれる。opcodeフォーマット３６０と３７０では、ＭＯＤフィールド３６３と３７３、及び任意的なスケール・インデックス・ベース及びディスプレースメントバイトにより部分的に規定される、レジスタからレジスタ、メモリからレジスタ、メモリによるレジスタ、レジスタによるレジスタ、イミーディエイト（immediate）によるレジスタ、レジスタからメモリへのアドレッシングが可能である。 FIG. 3E illustrates another operation encoding (opcode) format 370 of 40 bits or more. Opcode format 370 corresponds to opcode format 360 and includes an optional prefix byte 378. The type of string comparison operation is encoded in one or more fields 378, 371 and 372. The location of up to two operands per instruction is specified by source operand identifiers 374 and 375 and prefix byte 378. In one embodiment of the string compare instruction, the prefix byte 378 is used to identify 32-bit, 64-bit, or 128-bit source and destination operands. In one embodiment of the string compare instruction, the destination operand identifier 376 is the same as the source operand identifier 374, and is different in other embodiments. In another embodiment, destination operand identifier 376 is the same as source operand identifier 375, and is different in other embodiments. In one embodiment, the string comparison operation compares each element of the operand identified by operand identifiers 374 and 375 with each element of the other operand identified by operand identifiers 374 and 375, and compares each element with the string comparison operation. Overwrite with the result of On the other hand, in other embodiments, the string comparison of the operand identified by identifiers 374 and 375 is written to another data element of another register. In opcode formats 360 and 370, MOD fields 363 and 373, and register-to-register, memory-to-register, memory-to-register, register-to-register, and register-to-register, are partially defined by the optional scale index base and displacement bytes. Addressing from a register to a memory by an immediate is possible.

次に図３Ｆを参照して、別の実施形態では、６４ビット単一命令複数データ（ＳＩＭＤ）算術演算は、コ・プロセッサデータ処理（ＣＤＰ）命令により実行される。オペレーションエンコーディング（opcode）フォーマット３８０は、ＣＤＰopcodeフィールド３８２と３８９を有するかかるＣＤＰ命令を示す。ストリング比較演算の別の実施形態では、ＣＤＰ命令のタイプは、１つまたは複数のフィールド３８３、３８４、３８７及び３８８でエンコードされる。２つまでのソースオペランド識別子３８５と３９０と、１つのデスティネーションオペランド識別子３８６とを含め、一命令につき３つまでのオペランドの場所を特定できる。コ・プロセッサの一実施形態は、８、１６、３２及び６４ビット値で動作できる。一実施形態では、ストリング比較演算は整数データ要素に実行される。実施形態では、ストリング比較命令は、条件フィールド３８１を用いて、条件付きで実行してもよい。ストリング比較命令によっては、ソースデータサイズはフィールド３８３によりエンコードできる。ストリング比較命令の実施形態では、ＳＩＭＤフィールドでゼロ（Ｚ）、ネガティブ（Ｎ）、キャリー（Ｃ）、オーバーフロー（Ｖ）の検出をできる。命令によっては飽和のタイプをフィールド３８４でエンコードしてもよい。 Referring now to FIG. 3F, in another embodiment, a 64-bit single instruction multiple data (SIMD) arithmetic operation is performed by a co-processor data processing (CDP) instruction. Operation encoding (opcode) format 380 indicates such a CDP instruction having CDP opcode fields 382 and 389. In another embodiment of the string comparison operation, the type of the CDP instruction is encoded in one or more fields 383, 384, 387 and 388. Up to three operands can be specified per instruction, including up to two source operand identifiers 385 and 390 and one destination operand identifier 386. One embodiment of the co-processor can operate on 8, 16, 32 and 64 bit values. In one embodiment, the string comparison operation is performed on integer data elements. In embodiments, the string comparison instruction may be conditionally executed using condition field 381. Depending on the string comparison instruction, the source data size can be encoded by field 383. In the embodiment of the string comparison instruction, zero (Z), negative (N), carry (C), and overflow (V) can be detected in the SIMD field. Depending on the instruction, the type of saturation may be encoded in field 384.

一実施形態では、ストリング比較演算の結果が非ゼロであることを示すために、フィールドまたは「フラグ」を用いてもよい。実施形態によっては、ソース要素が無効であることを示すフラグや、ストリング比較演算の結果のＬＳＢまたはＭＳＢを示すフラグなどの他のフィールドを使ってもよい。 In one embodiment, a field or "flag" may be used to indicate that the result of the string comparison operation is non-zero. In some embodiments, other fields may be used, such as a flag indicating that the source element is invalid, or a flag indicating the LSB or MSB of the result of the string comparison operation.

図４は、本発明による、パック化データオペランドにストリング比較演算を実行するロジックの一実施形態を示すブロック図である。本発明の実施形態は、上記のような様々なタイプのオペランドで機能するように実施できる。一実施形態では、本発明によるストリング比較演算は、特定のデータタイプに作用する命令セットとして実施する。例えば、整数と浮動小数点を含む３２ビットデータタイプの比較を実行するパック化ストリング比較命令を提供する。同様に、整数と浮動小数点を含む６４ビットデータタイプの比較を実行するパック化ストリング比較命令を提供する。以下の説明と実施例により、データ要素が何を表しているかに関わらずデータ要素を比較する比較命令の動作を説明する。説明を簡単にするため、一部の実施例は、データ要素がテキストの言葉である１つまたは複数のストリング比較命令の実行を示す。 FIG. 4 is a block diagram illustrating one embodiment of logic for performing a string comparison operation on packed data operands according to the present invention. Embodiments of the present invention can be implemented to work with various types of operands as described above. In one embodiment, the string comparison operation according to the present invention is implemented as an instruction set that operates on a particular data type. For example, it provides a packed string compare instruction that performs a comparison of 32-bit data types including integer and floating point. Similarly, a packed string compare instruction is provided that performs a comparison of 64-bit data types including integer and floating point. The following description and examples describe the operation of a compare instruction that compares data elements regardless of what the data elements represent. For simplicity, some embodiments show the execution of one or more string comparison instructions where the data elements are textual words.

一実施形態では、ストリング比較命令は、第１のデータオペランドDATA A ４１０の各要素を、第２のデータオペランドDATA B ４２０の各要素と比較し、各比較の結果をRESULTANT ４４０レジスタに格納する。以下の説明では、DATA A、DATA B、及びRESULTANTはレジスタであるものとする。しかし、そのようには限定されず、レジスタ、レジスタファイル、及びメモリの記憶場所を含む。一実施形態では、テキストストリング比較命令（例えば、「PCMPxSTRy」）は１つのマイクロ演算にデコードされる。別の実施形態では、各命令は、データオペランドにテキストストリング比較演算を行う様々な数のマイクロ演算にデコードできる。この実施例では、オペランド４１０、４２０は、ワード幅のデータ要素を有するソースレジスタ・メモリに格納された１２８ビット幅の情報である。一実施形態では、オペランド４１０、４２０は、１２８ビットＳＳＥｘＸＭＭレジスタ等の１２８ビット長ＳＩＭＤレジスタに保持される。一実施形態では、ＲＥＳＵＬＴＡＮＴ４４０はＸＭＭデータレジスタでもある。他の実施形態では、ＲＥＳＵＬＴＡＮＴ４４０は、拡張レジスタ（例えば、「ＥＡＸ」）などの他のタイプのレジスタであってもよく、メモリの記憶場所であってもよい。実施形態によっては、オペランドとレジスタは３２、６４、２５６ビットなどの長さであっても良く、バイト、ダブルワード、またはクアドワードサイズのデータ要素を有していてもよい。この実施例のデータ要素はワードサイズであるが、同じコンセプトをバイトやダブルワードサイズの要素に拡張することができる。一実施形態では、データオペランドが６４ビット幅であれば、ＸＭＭレジスタの替わりにＭＭＸレジスタを用いる。 In one embodiment, the string compare instruction compares each element of the first data operand DATA A 410 with each element of the second data operand DATA B 420 and stores the result of each comparison in the RESULTANT 440 register. In the following description, it is assumed that DATA A, DATA B, and RESULTANT are registers. However, it is not so limited and includes registers, register files, and memory locations. In one embodiment, a text string compare instruction (eg, “PCMPxSTRy”) is decoded into a single micro-operation. In another embodiment, each instruction can be decoded into various numbers of micro-operations that perform text string comparison operations on data operands. In this embodiment, operands 410 and 420 are 128 bits wide information stored in a source register memory having word wide data elements. In one embodiment, operands 410, 420 are held in a 128-bit SIMD register, such as a 128-bit SSEx XMM register. In one embodiment, RESULTANT 440 is also an XMM data register. In other embodiments, RESULTANT 440 may be another type of register, such as an extension register (eg, “EAX”), or may be a memory location. In some embodiments, the operands and registers may be 32, 64, 256 bits in length, and may have byte, doubleword, or quadword sized data elements. Although the data elements in this embodiment are word sized, the same concept can be extended to byte and double word sized elements. In one embodiment, if the data operand is 64 bits wide, an MMX register is used instead of an XMM register.

一実施形態では、第１のオペランド４１０は、Ａ７，Ａ６，Ａ５，Ａ４，Ａ３，Ａ２，Ａ１及びＡ０の８つのデータ要素により構成されている。第１と第２のオペランドの要素間の各比較は、結果４４０中のデータ要素の位置に対応してもよい。一実施形態では、第２のオペランド４２０は、Ｂ７，Ｂ６，Ｂ５，Ｂ４，Ｂ３，Ｂ２，Ｂ１及びＢ０の８つのデータセグメントにより構成されている。ここでデータセグメントとは、長さが等しく、１データワード（１６ビット）より構成される。しかし、データ要素とデータ要素位置はワード以外の粒度（granularities）を有していてもよい。各データ要素がバイト（８ビット）、ダブルワード（３２ビット）、またはクアドワード（６４ビット）であるとき、１２８ビットオペランドは１６バイト幅、４ダブルワード幅、または２クアドワード幅のデータ要素をそれぞれ有する。本発明の実施形態は特定の長さのデータオペランドやデータセグメントに限定されず、各実施形態に適切なサイズを利用できる。 In one embodiment, the first operand 410 is composed of eight data elements A7, A6, A5, A4, A3, A2, A1 and A0. Each comparison between the elements of the first and second operands may correspond to a position of the data element in the result 440. In one embodiment, the second operand 420 is composed of eight data segments B7, B6, B5, B4, B3, B2, B1, and B0. Here, the data segment has the same length and is composed of one data word (16 bits). However, data elements and data element locations may have granularities other than words. When each data element is a byte (8 bits), a doubleword (32 bits), or a quadword (64 bits), the 128-bit operand has a data element of 16 bytes, 4 doublewords, or 2 quadwords, respectively. . Embodiments of the present invention are not limited to data operands or data segments of a particular length, but may utilize any size appropriate for each embodiment.

オペランド４１０，４２０は、レジスタ、メモリの記憶場所、レジスタファイル、またはこれらの組み合わせ（mix）のどれにあってもよい。データオペランド４１０、４２０は、テキストストリング比較命令とともに、プロセッサの実行ユニットのストリング比較ロジック４３０に送られる。一実施形態では、命令が実行ユニットに到着する時までに、その命令はプロセッサパイプラインで早めにデコードされる。このように、ストリング比較命令はマイクロ命令（μop）またはその他のデコードされたフォーマットの形式であり得る。一実施形態では、２つのデータオペランド４１０，４２０をストリング比較ロジック４３０が受け取る。一実施形態では、テキストストリング比較ロジックは、２つのデータオペランドの要素が等しいかどうかの表示を生成する。一実施形態では、各オペランドの有効要素のみを比較する。有効要素は、各オペランドの各要素について他のレジスタまたはメモリの記憶場所により示される。一実施形態では、オペランド４１０の各要素をオペランド４２０の各要素と比較する。この比較により、オペランド４１０の要素数にオペランド４２０の要素数をかけた数に等しい比較結果ができる。例えば、各オペランド４１０と４２０が３２ビット値である場合、結果レジスタ４４０は、ストリング比較ロジック４３０で実行されたテキスト比較演算の３２×３２までの結果インジケータを記憶する。一実施形態では、第１と第２のオペランドからのデータ要素は単精度（例えば、３２ビット）であり、他の実施形態では、第１と第２のオペランドのデータ要素は倍精度（例えば、６４ビット）である。他の実施形態では、第１と第２のオペランドは、８、１６、３２ビットを含む任意サイズの整数要素を含み得る。 Operands 410 and 420 may be in any of a register, a memory location, a register file, or a combination thereof. The data operands 410, 420 are sent to the string comparison logic 430 of the execution unit of the processor along with the text string comparison instruction. In one embodiment, by the time the instruction arrives at the execution unit, the instruction is decoded early in the processor pipeline. Thus, the string compare instruction may be in the form of a microinstruction (μop) or other decoded format. In one embodiment, two data operands 410, 420 are received by string comparison logic 430. In one embodiment, the text string comparison logic generates an indication whether the elements of the two data operands are equal. In one embodiment, only the valid elements of each operand are compared. The valid elements are indicated by other registers or memory locations for each element of each operand. In one embodiment, each element of operand 410 is compared with each element of operand 420. This comparison produces a comparison result equal to the number of elements of operand 410 multiplied by the number of elements of operand 420. For example, if each operand 410 and 420 is a 32-bit value, result register 440 stores up to 32 × 32 result indicators of text comparison operations performed by string comparison logic 430. In one embodiment, the data elements from the first and second operands are single precision (eg, 32 bits); in another embodiment, the data elements of the first and second operands are double precision (eg, 64 bits). In other embodiments, the first and second operands may include integer elements of any size, including 8, 16, and 32 bits.

一実施形態では、すべてのデータ位置のデータ要素は並行に処理される。他の実施形態では、データ要素位置の一部は同時に処理できる。一実施形態では、RESULTANT４４０は、オペランド４１０と４２０に格納された各データ要素間の比較の複数の結果により構成される。具体的には、一実施形態では、結果（RESULTANT）はオペランド４１０または４２０の一方のデータ要素数の２乗だけの比較結果を記憶してもよい。 In one embodiment, data elements at all data locations are processed in parallel. In other embodiments, some of the data element locations can be processed simultaneously. In one embodiment, RESULTANT 440 comprises a plurality of results of a comparison between each data element stored in operands 410 and 420. Specifically, in one embodiment, the result (RESULTANT) may store the comparison result of only the square of the number of data elements of one of the operands 410 and 420.

一実施形態では、RESULTANTは、オペランド４１０と４２０の有効なデータ要素の間の比較のみの比較結果を記憶する。一実施形態では、各オペランドのデータ要素は、明示的または黙示的に有効であると示され得る。例えば、一実施形態では、各オペランドデータ要素は、有効レジスタなどの他の記憶領域内に記憶される、有効ビットなどの有効性インジケータに対応する。一実施形態では、両方のオペランドの各要素の有効性ビットは、同じ有効レジスタに記憶される。しかし、他の実施形態では、１つのオペランドの有効性ビットは、第１の有効レジスタに記憶され、他のオペランドの有効性ビットは第２の有効レジスタに記憶される。有効な要素間でのみ比較を行うように、オペランドデータ要素を比較する前に、またはそれと共に、（例えば、対応する有効ビットをチェックすることにより）両方のデータ要素が有効であるか判断してもよい。 In one embodiment, RESULTANT stores the comparison result of only the comparison between the valid data elements of operands 410 and 420. In one embodiment, the data element of each operand may be explicitly or implicitly indicated as valid. For example, in one embodiment, each operand data element corresponds to a validity indicator, such as a valid bit, stored in another storage area, such as a valid register. In one embodiment, the validity bits for each element of both operands are stored in the same valid register. However, in other embodiments, the validity bits of one operand are stored in a first valid register and the validity bits of the other operand are stored in a second valid register. Determine whether both data elements are valid (eg, by checking the corresponding valid bit) before and / or with the operand data elements so that the comparison is performed only between valid elements. Is also good.

一実施形態では、各オペランドの有効データ要素は、オペランドの一方または両方に記憶されたヌルまたは「ゼロ」フィールドの使用により黙示的に示され得る。例えば、一実施形態では、ヌルバイト（または他のサイズ）を要素に記憶して、ヌルバイトより重要な（significant）データ要素はすべて無効であり、一方、ヌルバイトより重要でないデータ要素はすべて有効であるので、他のオペランドの対応する有効なデータ要素と比較すべきことを示してもよい。さらに、一実施形態では、（上記の通り）１つのオペランドの有効データ要素を明示的に示し、一方、他のオペランドの有効データ要素をヌルフィールドを用いて黙示的に示しても良い。一実施形態では、有効データ要素は、１つ以上のソースオペランド内の有効なデータ要素またはサブエレメント（sub-elements）の数に対応するカウントにより示される。 In one embodiment, the valid data element for each operand may be implicitly indicated by the use of a null or "zero" field stored in one or both of the operands. For example, in one embodiment, the null byte (or other size) is stored in the element such that all significant data elements are invalid while all non-null data elements are valid. , May be indicated to be compared with corresponding valid data elements of other operands. Further, in one embodiment, the valid data element of one operand may be explicitly indicated (as described above), while the valid data element of the other operand may be implicitly indicated using a null field. In one embodiment, valid data elements are indicated by a count corresponding to the number of valid data elements or sub-elements in one or more source operands.

各オペランドの有効データ要素を示す方法にかかわらず、少なくとも１つの実施形態では、有効であると示された各オペランドのデータ要素を比較する。有効データ要素のみの比較は、様々な実施形態で複数の方法で実行できる。詳細かつ理解可能な説明をする目的では、２つのテキストストリングオペランド間で有効なデータ要素のみを比較する方法は、以下によりもっともよく概念的に説明できる。しかし、以下の説明は、テキストストリングオペランドの有効データ要素のみの比較を以下に概念的に説明または実施するかの一例に過ぎない。他の実施形態では、他の概念的説明や方法を用いて、有効なデータ要素をいかに比較するかを示す。 Regardless of how the valid data element for each operand is indicated, at least one embodiment compares the data element for each operand that is indicated as valid. Comparison of only valid data elements can be performed in multiple ways in various embodiments. For the purpose of providing a detailed and understandable explanation, a method of comparing only valid data elements between two text string operands can be best conceptually described below. However, the following description is merely one example of how the comparison of only valid data elements of a text string operand is conceptually described or performed below. In other embodiments, other conceptual descriptions and methods are used to show how valid data elements are compared.

一実施形態では、オペランドの有効なデータ要素数が（例えば、有効性レジスタの有効ビットや、最下位から始めて有効なバイト・ワードの数をカウントすることにより）明示的に示されているか、（例えば、オペランド内のヌルキャラクタにより）黙示的に示されているかにかかわらず、各オペランドの有効データ要素のみを互いに比較する。一実施形態では、有効性インジケータの集計と比較するデータ要素を、図５を参照して概念的に説明する。 In one embodiment, the number of valid data elements of the operand is explicitly indicated (e.g., by counting the number of valid bits in the validity register, or the number of valid byte words starting from the least significant), or ( Only the valid data elements of each operand are compared to each other, whether implied (for example, by a null character in the operand). In one embodiment, the data elements that are compared to the tally of validity indicators are conceptually described with reference to FIG.

図５を参照して、一実施形態では、アレイ５０１と５０５は、第１のオペランドと第２のオペランドの各要素がそれぞれ有効であるかどうかを示すエントリーを含む。例えば、上記の例では、アレイ５０１は、第１のオペランドが対応する有効データ要素を含む各アレイ要素には「１」を含む。同様に、アレイ５０５は、第２のオペランドが対応する有効データ要素を含む各アレイ要素に「１」を含む。一実施形態では、アレイ５０１と５０５は、２つのオペランドにある各有効要素に対して、アレイ要素０から始まり１を含む。例えば、一実施形態では、第１のオペランドが４つの有効要素を含む場合、アレイ５０１は最初の４つのアレイ要素にのみ１を含み、アレイ５０１の他のアレイ要素はすべてゼロである。 Referring to FIG. 5, in one embodiment, arrays 501 and 505 include an entry indicating whether each element of the first and second operands is valid, respectively. For example, in the above example, array 501 contains a "1" for each array element whose first operand contains a corresponding valid data element. Similarly, array 505 contains a "1" for each array element whose second operand contains a corresponding valid data element. In one embodiment, arrays 501 and 505 include one starting at array element 0 for each valid element in the two operands. For example, in one embodiment, if the first operand contains four valid elements, the array 501 contains only ones in the first four array elements, and all other array elements in array 501 are zero.

一実施形態では、アレイ５０１と５０５はサイズが１６要素であり、２つの１２８ビットオペランドの１６個のデータ要素を表し、各々はサイズが８ビット（１バイト）である。他の実施形態では、オペランドのデータ要素のサイズが１６ビットであり、アレイ５０１と５０５は８要素のみを含む。他の実施形態では、アレイ５０１と５０５は、対応するオペランドのサイズに応じて大きくても小さくてもよい。 In one embodiment, arrays 501 and 505 are 16 elements in size and represent 16 data elements of two 128-bit operands, each 8 bits (1 byte) in size. In another embodiment, the data element size of the operand is 16 bits, and arrays 501 and 505 contain only 8 elements. In other embodiments, arrays 501 and 505 may be larger or smaller depending on the size of the corresponding operand.

一実施形態では、第１のオペランドの各データ要素を第２のオペランドの各データ要素と比較し、その結果をｉ×ｊアレイ５１０で表す。例えば、テキストストリングを表す第１のオペランドの第１のデータ要素を、例えば、他のテキストストリングを表す他のオペランドの各データ要素と比較し、アレイ５１０の第１の行内の各アレイ要素に記憶された「１」は、第１のオペランドの第１のデータ要素と第２のオペランドの各データ要素の間の一致に対応する。これは、アレイ５１０が完了するまで、第１のオペランドの各データ要素に対して繰り返される。 In one embodiment, each data element of the first operand is compared to each data element of the second operand, and the result is represented by an i × j array 510. For example, a first data element of a first operand representing a text string is compared to, for example, each data element of another operand representing another text string and stored in each array element in a first row of array 510. The assigned "1" corresponds to a match between the first data element of the first operand and each data element of the second operand. This is repeated for each data element of the first operand until array 510 is completed.

一実施形態では、ｉ×ｊエントリーの第２のアレイ５１５が生成され、有効なオペランドのデータ要素のみが等しいかどうかの表示を記憶する。例えば、一実施形態では、アレイ５１０の最初の行５１１の各エントリーを対応する有効なアレイ要素５０６及び有効なアレイ要素５０２と論理的にＡＮＤを取って、その結果をアレイ５１５の対応する要素５１６に配置する。ＡＮＤ演算は、アレイ５１０の各要素と、有効なアレイ５０１及び５０５の対応する要素との間で実行し、その結果をアレイ５２０の対応する要素に配置してもよい。 In one embodiment, a second array 515 of i × j entries is generated to store an indication whether only valid operand data elements are equal. For example, in one embodiment, each entry in the first row 511 of the array 510 is logically ANDed with the corresponding valid array element 506 and the valid array element 502 and the result is compared with the corresponding element 516 of the array 515. To place. An AND operation may be performed between each element of array 510 and the corresponding element of valid arrays 501 and 505, and the result placed in the corresponding element of array 520.

一実施形態では、結果アレイ５２０は、一オペランドのデータ要素のうち他のオペランドのデータ要素と関係するものがあるか示す。例えば、結果アレイ５２０は、アレイ５１５の要素のペアをＡＮＤ演算し、ＡＮＤのすべての結果をＯＲ演算することにより、他のオペランドのデータ要素により決まる範囲内にデータ要素があるか示すビットを記憶することができる。 In one embodiment, result array 520 indicates whether any of the data elements of one operand are related to data elements of another operand. For example, result array 520 performs an AND operation on a pair of elements of array 515 and ORs all the results of the AND to store a bit indicating whether the data element is within a range determined by the data element of the other operand. can do.

図５は、少なくとも２つのパック化オペランドのデータ要素間の比較に関する様々なインジケータを記憶する結果アレイ５２０も示す。例えば、結果アレイ５２０は、アレイ５１５の対応する要素をＯＲ演算することにより、２つのオペランド間に等しいデータ要素はあるかどうかを示すビットを記憶する。アレイ５１５のアレイ要素のどれかが、例えば、オペランドの有効なデータ要素間に一致するものがあることを示す「１」を含む場合、これは結果アレイ５２０に反映される。結果アレイ５２０の要素をＯＲ演算して、オペランドの有効なデータ要素が等しいか判断することもできる。 FIG. 5 also shows a result array 520 that stores various indicators for comparisons between data elements of at least two packed operands. For example, result array 520 stores a bit indicating whether there is an equal data element between the two operands by ORing the corresponding elements of array 515. If any of the array elements of the array 515 include, for example, a "1" indicating that there is a match between valid data elements of the operands, this is reflected in the result array 520. Elements of the result array 520 may be ORed to determine whether the valid data elements of the operands are equal.

一実施形態では、アレイ内の隣接する「１」を検出することにより、結果アレイ５２０内の、２つのオペランドのデータ要素間の有効な一致の連続を検出する。一実施形態では、これは、連続する結果アレイ要素を一度にＡＮＤ演算し、「０」を検出するまで一ＡＮＤ演算の結果と次の結果とをＡＮＤ演算することにより、実現できる。他の実施形態では、他の論理を用いて２つのパック化演算のデータ要素の有効な一致の範囲を検出してもよい。 In one embodiment, detecting a contiguous "1" in the array detects a sequence of valid matches between the data elements of the two operands in the result array 520. In one embodiment, this can be achieved by ANDing successive result array elements at one time and ANDing the result of one AND operation with the next result until a "0" is detected. In other embodiments, other logic may be used to detect a valid match range of the data elements of the two packed operations.

一実施形態では、結果アレイ５２０は、対応する結果アレイエントリーに「１」を返すことにより、両方のオペランドの各データ要素が一致するか示すこともできる。すべてのエントリーが等しいか判断するため、結果アレイエントリーにＸＯＲ演算を実行してもよい。他の実施形態では、他の論理を用いて２つのオペランドの有効データ要素が等しいか判断してもよい。 In one embodiment, result array 520 may indicate whether each data element of both operands matches by returning a "1" to the corresponding result array entry. An XOR operation may be performed on the result array entries to determine if all entries are equal. In other embodiments, other logic may be used to determine whether the valid data elements of the two operands are equal.

一実施形態では、データ要素のストリングがデータ要素の他のストリング内のどこかにあることを、テストストリングを他のストリングの同じサイズの部分と比較して、テストストリングと他のストリングのその部分との一致を結果アレイに示すことにより、検出できる。例えば、一実施形態では、第１のオペランドの３つのデータ要素に対応する３つのキャラクタのテストストリングを、第２のストリングの３つのデータ要素の第１のセットと比較する。一致を検出したら、その一致を結果アレイに反映させる。これは、一致に対応する３つの結果エントリーのグループに「１」を格納することにより行う。テストストリングを他のオペランドの次の３つのデータ要素と比較する。または、比較されるにつれてテストストリングが他のオペランドに沿って「スライド」するように、前のオペランドのデータ要素の２つと新しい第３のデータ要素を、テストストリングと比較してもよい。 In one embodiment, the test string is compared to the same sized portion of the other string to determine that the string of the data element is somewhere in the other string of the data element, and the test string and that portion of the other string are compared. Can be detected by showing a match in the result array. For example, in one embodiment, a test string of three characters corresponding to three data elements of a first operand is compared to a first set of three data elements of a second string. When a match is found, the match is reflected in the result array. This is done by storing "1" in the group of three result entries corresponding to the match. Compare the test string with the next three data elements of the other operand. Alternatively, two of the data elements of the previous operand and a new third data element may be compared to the test string such that the test string "slides" along the other operand as it is compared.

一実施形態では、アプリケーションに応じて、結果アレイのエントリーを反転、または否定してもよい。他の実施形態では、結果エントリーの一部のみを、例えば２つのオペランドのデータ要素間の有効な一致に対応するものだけを否定（negate）する。他の実施形態では、他の演算を結果アレイ５２０の結果エントリーに実行してもよい。例えば、実施形態によっては、結果アレイ５２０はマスク値として表される。他の実施形態では、結果アレイはインデックス値で表され、レジスタなどの記憶場所に記憶される。インデックスは、一実施形態では結果アレイのＭＳＢのグループにより表され、他の実施形態ではアレイのＬＳＢで表される。一実施形態では、インデックスは、設定されているＬＳＢまたはＭＳＢへのオフセット値により表される。マスクは、一実施形態ではゼロ拡張であり、他の実施形態ではバイト／ワードマスク、またはその他の粒度（granularity）である。 In one embodiment, the result array entries may be inverted or negated, depending on the application. In other embodiments, only some of the result entries are negated, eg, only those that correspond to a valid match between the data elements of the two operands. In other embodiments, other operations may be performed on the result entries of result array 520. For example, in some embodiments, result array 520 is represented as a mask value. In another embodiment, the result array is represented by an index value and stored in a storage location such as a register. The index is represented in one embodiment by the group of MSBs of the result array, and in another embodiment by the LSB of the array. In one embodiment, the index is represented by an offset value to a set LSB or MSB. The mask is a zero extension in one embodiment, a byte / word mask, or other granularity in other embodiments.

様々な実施形態では、ＳＩＭＤオペランドの各要素の比較する際の上記の各相違は、個々の命令として実行される。他の実施形態では、上記の相違は、命令に付随するフィールド（immediate fields）などの単一の命令の属性を変えることにより実行され得る。図６は、１つまたは複数の命令により実行される、２つまたはそれ以上のＳＩＭＤオペランドの各データ要素を比較する様々な動作を示す図である。一実施形態では、図６の動作により比較されるオペランドはテキストストリングである。他の実施形態では、オペランドはその他のデータ情報やデータである。 In various embodiments, each of the above differences in comparing each element of the SIMD operand is implemented as a separate instruction. In other embodiments, the above differences may be implemented by changing the attributes of a single instruction, such as immediate fields associated with the instruction. FIG. 6 is a diagram illustrating various operations performed by one or more instructions to compare each data element of two or more SIMD operands. In one embodiment, the operands compared by the operation of FIG. 6 are text strings. In other embodiments, the operand is other data information or data.

図６を参照して、動作６１０において、第１のＳＩＭＤオペランド６０１と第２のＳＩＭＤオペランド６０５の各要素を互いに比較する。一実施形態では、一方のオペランドはＸＭＭレジスタなどのレジスタに記憶され、他方のオペランドは他のＸＭＭレジスタまたはメモリに記憶されている。一実施形態では、比較のタイプは、図６に示した動作を実行する命令に対応するイミーディエイトフィールド（immediate field）により制御される。例えば、一実施形態では、２ビットのイミーディエイトフィールド（例えば、ＩＭＭ８［１：０］）を用いて、比較するデータ要素が符号付きバイトか、符号付きワードか、符号無しバイトか、符号無しワードか示す。一実施形態では、比較結果によりｉ×ｊアレイ（例えば、ＢｏｏｌＲｅｓ［ｉ，ｊ］）、またはｉ×ｊアレイの一部ができる。 Referring to FIG. 6, in operation 610, the elements of first SIMD operand 601 and second SIMD operand 605 are compared with each other. In one embodiment, one operand is stored in a register, such as an XMM register, and the other operand is stored in another XMM register or memory. In one embodiment, the type of comparison is controlled by an immediate field corresponding to the instruction that performs the operation shown in FIG. For example, in one embodiment, a 2-bit immediate field (e.g., IMM8 [1: 0]) is used to compare data elements with signed bytes, signed words, unsigned bytes, or unsigned bytes. Indicates a word or not. In one embodiment, the comparison results in an ixj array (eg, BoolRes [i, j]), or a portion of the ixj array.

動作６１３において、並行して、オペランド６０１と６０５がそれぞれ表すストリングの終わりを見つけて、オペランド６０１と６０５の各要素の有効性を判断する。一実施形態では、レジスタまたはメモリの記憶場所内の対応する１つまたは複数のビットを設定することにより、オペランド６０１と６０５の各要素の有効性を明示的に示す。一実施形態では、その１つまたは複数のビットは、オペランド６０１と６０５のＬＳＢの位置から始まる連続した有効データ要素（例えば、バイト）の数に対応する。例えば、オペランドのサイズにもよるが、ＥＡＸレジスタやＲＡＸレジスタなどのレジスタを用いて、第１のオペランドの各データ要素の有効性を示すビットを記憶する。同様に、オペランドのサイズによっては、ＥＤＸレジスタやＲＤＸレジスタなどのレジスタを用いて、第２のオペランドの各データ要素の有効性を示すビットを記憶する。他の実施形態では、オペランド６０１と６０５の各要素の有効性を、本開示ですでに説明した手段により、黙示的に示しても良い。 In operation 613, the end of the string represented by operands 601 and 605, respectively, is found in parallel to determine the validity of each element of operands 601 and 605. In one embodiment, the validity of each element of operands 601 and 605 is explicitly indicated by setting the corresponding bit or bits in a register or memory location. In one embodiment, the one or more bits correspond to the number of consecutive valid data elements (eg, bytes) starting from the LSB position of operands 601 and 605. For example, depending on the size of the operand, a bit such as an EAX register or a RAX register is used to store a bit indicating the validity of each data element of the first operand. Similarly, depending on the size of the operand, a bit indicating the validity of each data element of the second operand is stored using a register such as an EDX register or an RDX register. In other embodiments, the validity of each element of operands 601 and 605 may be implicitly indicated by means already described in this disclosure.

一実施形態では、動作６１５において、比較と有効性に関する情報を集約機能（aggregation function）により結合して、２つのオペランドの要素の比較結果を生成する。一実施形態では、集約機能を、２つのオペランドの要素の比較を実行する命令に付随するイミーディエイトフィールドにより決定する。例えば、一実施形態では、２つのオペランドのデータ要素が等しいか、２つのオペランドのデータ要素の範囲が等しいか、２つのオペランドの各データ要素が等しいか、オペランドの少なくともデータ要素の一部の並びが同じか、比較により示すかどうか、イミーディエイトフィールド（immediate field）が示す。 In one embodiment, in operation 615, the comparison and validity information is combined by an aggregation function to generate a comparison of the elements of the two operands. In one embodiment, the aggregation function is determined by the immediate field associated with the instruction that performs the comparison of the two operand elements. For example, in one embodiment, the data elements of the two operands are equal, the ranges of the data elements of the two operands are equal, the data elements of the two operands are equal, or the sequence of at least some of the data elements of the operands. The immediate field indicates whether are the same or a comparison.

動作６２０において、一実施形態では、（例えば、ＩｎｔＲｅｓ１に記憶された）集約機能の結果をネゲートする。一実施形態では、イミーディエイトフィールドのビット（例えば、ＩＭＭ８［６：５］）により、集約機能の結果に実行するネゲート機能のタイプを制御する。例えば、イミーディエイトフィールドは、集約結果をまったくネゲート（negate）しない、集約機能の結果をすべてネゲートする、オペランドの有効要素に対応する集約結果のみをネゲートすることを示してもよい。一実施形態では、ネゲート演算の結果をアレイ（例えば、ＩｎｔＲｅｓ２アレイ）に記憶する。 At operation 620, in one embodiment, the result of the aggregation function (eg, stored in IntRes1) is negated. In one embodiment, bits in the immediate field (eg, IMM8 [6: 5]) control the type of negation function performed on the result of the aggregation function. For example, the immediate field may indicate that the aggregation result is not negated at all, that the results of the aggregation function are all negated, or that only the aggregation result corresponding to the valid element of the operand is negated. In one embodiment, the result of the negation operation is stored in an array (eg, an IntRes2 array).

一実施形態では、それぞれ動作６２５と６３０において、ネゲート演算により生成される結果のアレイをインデックス値またはマスク値に変換する。ネゲート演算結果をインデックスに変換する場合、イミーディエイトフィールドのビット（例えば、ＩＭＭ８［６］）により、比較結果のＭＳＢまたはＬＳＢをインデックスにエンコードするかどうか、その結果をレジスタ（例えば、ＥＣＸまたはＲＣＸ）に記憶するかどうか制御する。一実施形態では、ネゲート演算の結果をマスク値で表す場合、イミーディエイトフィールドのビット（例えば、ＩＭＭ８［６］）を用いて、マスクをゼロ延長（zero-extended）拡張するか、バイト（またはワード）に拡張するか制御する。 In one embodiment, at operations 625 and 630, respectively, the resulting array produced by the negation operation is converted to an index value or a mask value. When the result of the negation operation is converted into an index, whether the MSB or LSB of the comparison result is encoded into the index is determined by a bit (eg, IMM8 [6]) of the immediate field, and the result is registered in a register (eg, ECX or RCX). ) Is controlled. In one embodiment, if the result of the negate operation is represented by a mask value, bits in the immediate field (eg, IMM8 [6]) can be used to zero-extend the mask or use bytes (or Word).

このように、ストリング比較演算の実行方法を開示する。実施形態の例を説明し、添付した図面に示したが、言うまでもなく、かかる実施形態は本発明の単なる例示であって制約するものではなく、本開示を研究すれば当業者には様々な修正に想到するので、本発明は図示し説明した具体的な構成に限定はされない。本技術分野等では、成長が速く進歩が容易には予見できないので、本発明の原理や添付したクレームの範囲から逸脱することなく技術的な進歩を可能とすることにより容易になるので、開示の実施形態を構成と詳細において容易に修正できる。 Thus, a method of performing a string comparison operation is disclosed. While example embodiments are described and illustrated in the accompanying drawings, it should be understood that such embodiments are merely illustrative of the invention and are not limiting, and that various modifications will become apparent to those of skill in the art upon studying the present disclosure. Therefore, the present invention is not limited to the specific configuration shown and described. In this technical field, etc., since growth is fast and progress cannot be easily foreseen, it is facilitated by enabling technical progress without departing from the principle of the present invention and the scope of the appended claims. Embodiments can be easily modified in configuration and details.

なお、上記の実施形態について次の付記を記載する。
（付記１）命令を記憶した機械読み取り可能媒体であって、前記命令は、機械により実行されると、前記機械に
第１のパック化オペランドの各データ要素を、第２のパック化オペランドの各データ要素と比較する段階と、
前記比較の第１の結果を記憶する段階と
を含む方法を実行させる媒体。
（付記２）前記第１のオペランドの有効データ要素のみを、前記第２のオペランドの有効データ要素のみと比較する、付記１に記載の機械読み取り可能媒体。
（付記３）前記第１の結果は前記データ要素のいずれかが等しいかどうか示す、付記１に記載の機械読み取り可能媒体。
（付記４）前記第１の結果は前記第１のオペランドに示された一範囲のデータ要素が、前記第２のオペランドに示された一範囲のデータ要素と等しいかどうか示す、付記１に記載の機械読み取り可能媒体。
（付記５）前記第１の結果は前記第１のオペランドの各データ要素が、前記第２のオペランドの各データ要素と等しいかどうか示す、付記１に記載の機械読み取り可能媒体。
（付記６）前記第１の結果は前記第１のオペランドのデータ要素の一部の順序が、前記第２のオペランドのデータ要素の一部の順序と等しいかどうか示す、付記１に記載の機械読み取り可能媒体。
（付記７）前記第１の結果の一部をネゲートする、付記１に記載の機械読み取り可能媒体。
（付記８）前記第１の結果は、マスク値またはインデックス値のいずれかにより表される、付記１に記載の機械読み取り可能媒体。
（付記９）第１のオペランドの有効データ要素のみを、第２のオペランドの有効データ要素のみと比較する比較ロジックと、
前記比較ロジックを制御する第１の制御信号とを有する装置。
（付記１０）前記第１と第２のオペランドのデータ要素の有効性を明示的に示す、付記９に記載の装置。
（付記１１）前記第１と第２のオペランドのデータ要素の有効性を黙示的に示す、付記９に記載の装置。
（付記１２）前記第１の制御信号は、前記比較ロジックが符号付きまたは符号無しの値を比較するかどうか示す符号制御信号を含む、付記９に記載の装置。
（付記１３）前記第１の制御信号は、どれかが等しい、範囲が等しい、それぞれ等しい、不連続サブストリング、及び順序が等しいよりなるリストから選択した集約機能を前記比較ロジックが実行するかどうか示す集約機能信号を含む、付記１２に記載の装置。
（付記１４）前記第１の制御信号は、ネゲート信号を含み、前記比較ロジックに前記比較の結果の少なくとも一部をネゲートさせる、付記１３に記載の装置。
（付記１５）前記第１の制御信号は、前記比較ロジックが前記比較の結果のＭＳＢまたはＬＳＢのインデックスを生成するかどうか示すインデックス信号を含む、付記１４に記載の装置。
（付記１６）前記第１の制御信号は、前記比較ロジックが前記比較の結果としてゼロ延長マスクまたは拡張マスクを生成するかどうかを示すマスク信号を含む、付記１５に記載の装置。
（付記１７）前記第１の制御信号は、複数のビットを記憶する制御フィールドである、付記１６に記載の装置。
（付記１８）単一命令複数データ（ＳＩＭＤ）比較命令を記憶する第１のメモリと、
前記ＳＩＭＤ比較命令を実行して、前記ＳＩＭＤ比較命令で示された第１と第２のオペランドのデータ要素を比較するプロセッサを有する、システム。
（付記１９）前記第１のオペランドを、第１のレジスタのアドレスにより前記命令内に示す、付記１８に記載のシステム。
（付記２０）前記第２のオペランドを、メモリアドレスまたは第２のレジスタにより前記命令内に示す、付記１９に記載のシステム。
（付記２１）前記命令は前記プロセッサに対する制御信号を示すイミーディエイトフィールドを含む、付記２０に記載のシステム。
（付記２２）イミーディエイトフィールドは、前記オペランドが符号付きバイト、符号無しバイト、符号付きワード、または符号無しワードを含むかどうかを示す、付記２１に記載のシステム。
（付記２３）前記イミーディエイトフィールドは集約機能を前記プロセッサが実行することを示す、付記２２に記載のシステム。
（付記２４）前記イミーディエイトフィールドは、マスクまたはインデックスを前記命令の実行に応じて生成するかどうかを示す、付記２３に記載のシステム。
（付記２５）前記命令は、前記第１及び第２のオペランドの明示的に有効なデータ要素のみを比較させる、付記１８に記載のシステム。
（付記２６）前記命令は、前記第１及び第２のオペランドの黙示的に有効なデータ要素のみを比較させる、付記１８に記載のシステム。
（付記２７）第１のテキストストリングに対応する第１のパック化オペランドを記憶する第１の記憶領域と、
第２のテキストストリングに対応する第２のパック化オペランドを記憶する第２の記憶領域と、
前記第１のパック化オペランドのすべての有効データ要素を、前記第２のパック化オペランドのすべての有効データ要素と比較する比較ロジックと、
前記比較ロジックが実行した前記比較の結果アレイを記憶する第３の記憶領域と
を有するプロセッサ。
（付記２８）前記比較ロジックは値の２次元のアレイを生成し、前記アレイのエントリーは前記第１のパック化オペランドの有効なデータ要素と前記第２のパック化オペランドの有効なデータ要素との間の比較に対応する、付記２７に記載のプロセッサ。
（付記２９）前記比較ロジックは、前記値の２次元のアレイに、いずれかが等しい、範囲が等しい、各々が等しい、非連続的サブストリング、及び順序が等しいよりなる集約機能の１つを実行する、付記２８に記載のプロセッサ。
（付記３０）前記結果アレイは、マスク値またはインデックス値のいずれかにより表される、付記２９に記載のプロセッサ。 In addition, the following supplementary notes are described for the above embodiment.
(Supplementary Note 1) A machine-readable medium storing instructions, wherein the instructions, when executed by a machine, cause the machine to transfer each data element of a first packed operand to each of a second packed operand. Comparing with the data element;
Storing the first result of the comparison.
(Supplementary note 2) The machine-readable medium of Supplementary note 1, wherein only the valid data element of the first operand is compared with only the valid data element of the second operand.
(Supplementary note 3) The machine-readable medium of Supplementary note 1, wherein the first result indicates whether any of the data elements are equal.
(Supplementary note 4) The supplementary note 1, wherein the first result indicates whether the range of data elements indicated in the first operand is equal to the range of data elements indicated in the second operand. Machine readable medium.
(Supplementary note 5) The machine-readable medium of Supplementary note 1, wherein the first result indicates whether each data element of the first operand is equal to each data element of the second operand.
(Supplementary note 6) The machine according to Supplementary note 1, wherein the first result indicates whether the order of some of the data elements of the first operand is equal to the order of some of the data elements of the second operand. Readable media.
(Supplementary note 7) The machine-readable medium of Supplementary note 1, wherein a part of the first result is negated.
(Supplementary note 8) The machine-readable medium according to supplementary note 1, wherein the first result is represented by either a mask value or an index value.
(Supplementary note 9) comparison logic for comparing only the valid data element of the first operand with only the valid data element of the second operand;
A first control signal for controlling the comparison logic.
(Supplementary note 10) The apparatus according to supplementary note 9, wherein the validity of the data elements of the first and second operands is explicitly indicated.
(Supplementary note 11) The apparatus according to supplementary note 9, wherein the validity of the data elements of the first and second operands is implicitly indicated.
(Supplementary note 12) The apparatus according to supplementary note 9, wherein the first control signal includes a sign control signal indicating whether the comparison logic compares signed or unsigned values.
(Supplementary note 13) The first control signal is whether or not the comparison logic performs an aggregation function selected from a list consisting of: equal, equal in range, respectively equal, discontinuous substrings, and equal order. 13. The apparatus of claim 12, including the aggregate function signal shown.
(Supplementary note 14) The apparatus according to supplementary note 13, wherein the first control signal includes a negation signal, and causes the comparison logic to negate at least a part of a result of the comparison.
(Supplementary note 15) The apparatus according to supplementary note 14, wherein the first control signal includes an index signal indicating whether the comparison logic generates an MSB or LSB index of the result of the comparison.
(Supplementary note 16) The apparatus according to supplementary note 15, wherein the first control signal includes a mask signal indicating whether the comparison logic generates a zero extension mask or an extension mask as a result of the comparison.
(Supplementary note 17) The apparatus according to supplementary note 16, wherein the first control signal is a control field that stores a plurality of bits.
(Supplementary Note 18) A first memory storing a single instruction multiple data (SIMD) comparison instruction;
A system comprising: a processor that executes the SIMD comparison instruction and compares data elements of first and second operands indicated by the SIMD comparison instruction.
(Supplementary note 19) The system according to supplementary note 18, wherein the first operand is indicated in the instruction by an address of a first register.
(Supplementary note 20) The system according to supplementary note 19, wherein the second operand is indicated in the instruction by a memory address or a second register.
(Supplementary note 21) The system according to supplementary note 20, wherein the instruction includes an immediate field indicating a control signal to the processor.
(Supplementary note 22) The system according to supplementary note 21, wherein the immediate field indicates whether the operand includes a signed byte, an unsigned byte, a signed word, or an unsigned word.
(Supplementary note 23) The system according to Supplementary note 22, wherein the immediate field indicates that the processor performs an aggregation function.
(Supplementary note 24) The system according to supplementary note 23, wherein the immediate field indicates whether to generate a mask or an index in response to execution of the instruction.
(Supplementary note 25) The system according to supplementary note 18, wherein the instruction causes only the explicitly valid data elements of the first and second operands to be compared.
(Supplementary note 26) The system according to Supplementary note 18, wherein the instruction causes only the implicitly valid data elements of the first and second operands to be compared.
(Supplementary Note 27) A first storage area for storing a first packed operand corresponding to the first text string;
A second storage area for storing a second packed operand corresponding to the second text string;
Comparison logic for comparing all valid data elements of the first packed operand with all valid data elements of the second packed operand;
A third storage area for storing an array of the result of the comparison executed by the comparison logic.
(Supplementary note 28) The comparison logic generates a two-dimensional array of values, where an entry in the array is a valid data element of the first packed operand and a valid data element of the second packed operand. 28. The processor of claim 27 corresponding to a comparison between:
(Supplementary note 29) The comparison logic performs one of an aggregation function consisting of equal to, equal to, equal to each other, non-contiguous substrings, and equal order to the two-dimensional array of values. 29. The processor according to attachment 28.
(Supplementary note 30) The processor according to supplementary note 29, wherein the result array is represented by either a mask value or an index value.

Claims

A processor,
At least one processor core for executing instructions and processing data;
A multi-level cache including a level 1 (L1) cache;
A first packed data source register;
A second packed data source register;
A decoder for decoding an instruction, including a packed compare instruction, wherein a first set of packed integer data elements is stored in said first packed data source register, and a second set of packed integer data elements is: The first set of one or more packed integer data elements stored in the second packed data source register encodes a character, and the second set of packed integer data elements stores a set of ranges. A decoder to encode,
Executing the packed comparison instruction such that a first character encoded by the first set of first packed integer data elements is defined by the second set of packed integer data elements. a running circuit that determines it is within range, to update the result register, and execution circuit to which the first character to provide the information that it is within the range,
A processor having:

The processor of claim 1, wherein the first set of packed integer data elements and the second set of packed integer data elements include packed bytes.

The processor of claim 1, further comprising a microcode read only memory (ROM) for storing microcode of the packed comparison instruction.

The first packed data source register and the second packed data source register include a 128 bit packed data register;
The processor according to claim 1.

The processor of claim 1, further comprising an instruction scheduler for scheduling the packed compare instruction for execution by the execution circuit.

The processor of claim 1, wherein the characters include text characters and numeric characters.

Storing a first set of packed integer data elements in a first packed data source register;
Storing a second set of packed integer data elements in a second packed data source register;
Decoding an instruction including a packed compare instruction, wherein the first set of one or more packed integer data elements encodes a character, and wherein the second set of packed integer data elements comprises a range of Encoding a set; and executing the packed compare instruction;
In response to execution of the packed comparison instruction, a first character encoded by the first set of first packed integer data elements is defined by the second set of packed integer data elements. Judging whether it is within the range,
Updating a result register to provide information that said first character is within said range;
Including, methods.

The method of claim 7, wherein the first set of packed integer data elements and the second set of packed integer data elements include packed bytes.

The method of claim 7, wherein decoding comprises translating the packed compare instruction using microcode read-only memory (ROM).

The first packed data source register and the second packed data source register include a 128 bit packed data register;
The method according to claim 7.

Scheduling the packed compare instruction for execution subsequent to decoding.
The method according to claim 7.

The characters include text characters and numeric characters,
The method according to claim 7.

On the computer,
Storing a first set of packed integer data elements in a first packed data source register;
Storing a second set of packed integer data elements in a second packed data source register;
Decoding an instruction including a packed compare instruction, wherein the first set of one or more packed integer data elements encodes a character, and wherein the second set of packed integer data elements comprises a range of Encoding the set, steps,
Executing the packed comparison instruction;
In response to execution of the packed comparison instruction, a first character encoded by the first set of first packed integer data elements is defined by the second set of packed integer data elements. Judging whether it is within the range,
Updating a result register to provide information that said first character is within said range;
A computer program that executes

14. The computer program of claim 13, wherein the first set of packed integer data elements and the second set of packed integer data elements include packed bytes.

Decoding includes converting the packed compare instruction using a microcode read only memory (ROM);
A computer program according to claim 13.

The first packed data source register and the second packed data source register include a 128 bit packed data register;
A computer program according to claim 13.

On the computer,
14. The computer program of claim 13, further comprising the step of scheduling the packed compare instruction for execution following decoding.

The characters include text characters and numeric characters,
A computer program according to claim 13.

A machine-readable medium storing the computer program according to any one of claims 13 to 18.