JP4107043B2

JP4107043B2 - Arithmetic processing unit

Info

Publication number: JP4107043B2
Application number: JP2002300511A
Authority: JP
Inventors: 久志高須; 岳史長田
Original assignee: Denso Corp
Current assignee: Denso Corp
Priority date: 2002-10-15
Filing date: 2002-10-15
Publication date: 2008-06-25
Anticipated expiration: 2022-10-15
Also published as: JP2004139156A

Description

【０００１】
【発明の属する技術分野】
本発明は、演算処理装置に関し、特にＳＩＭＤ演算を用いてベクトルの外積演算を実行する演算処理装置に関する。
【０００２】
【従来の技術】
従来より、演算処理装置では、処理の高速化のために様々な手法が用いられており、その一つとして、単一の演算命令で複数の演算器と複数のデータを扱って演算を並列に処理するＳＩＭＤ（Single Instruction Multiple Date）演算が知られている。
【０００３】
即ち、レジスタに格納されたデータに対して各種演算を実行する演算処理ユニットは、通常の命令では、レジスタに単一のデータが格納されているものとして処理するのに対して、ＳＩＭＤ演算では、ｍ×ｎ（ｍ，ｎはいずれも正整数）ビット幅のレジスタに、ｎビット幅のデータがｍ個詰め込まれているものとして、これらｍ個のデータを並列に処理するのである。
【０００４】
従って、ＳＩＭＤ演算を実行する演算処理ユニットを備えた演算処理装置は、音声処理や画像処理（特に３次元グラフィックス）等のように、大量のデータに対して同様の演算を繰り返し適用する必要がある処理に対して優れた処理能力が得られる。
【０００５】
ところで、３次元グラフィクスにて、大量に処理されるベクトルデータや座標データ（以下総称して単に「ベクトルデータ」という。）は、通常、ｘ，ｙ，ｚの３成分からなる。これらベクトルデータの各成分に対してＳＩＭＤ演算を実行する演算処理装置は、各成分がｎビット幅にて表される場合、２ｎ（つまりｍ＝２）又は４ｎ（ｍ＝４）ビット幅のデータバスやレジスタを備えることになる。
【０００６】
そして、メモリに格納されたベクトルデータをデータバスを介してレジスタに読み込む時には、異なるベクトルデータの成分が同時に転送されることのないように、メモリ上では、通常、図４に示すように、ｘ成分，ｙ成分，ｚ成分に続けてダミー成分（以下「ｗ成分」ともいう。）が挿入されている。なお、図中（ａ）は、メモリアクセスの最小単位がｎビットのメモリの場合、（ｂ）は、メモリアクセスの最小単位が２ｎビットのメモリの場合を示す。但し、ｎ＝３２である。
【０００７】
このため、ベクトルデータの転送時には、転送を２ｎビット単位で行う場合には２回に１度、４ｎビット単位で行う場合には毎回、ダミー成分が含まれることになり、転送効率が低下するだけでなく、他の成分と共にレジスタに読み込まれたダミー成分もＳＩＭＤ演算の対象となってしまうため、演算処理ユニットでの処理効率も低下してしまうという問題があった。
【０００８】
このようなダミー成分に対する無駄な演算を回避するには、メモリからレジスタに読み込んだデータを、レジスタ上で再配置すればよいが、レジスタ間転送という新たな処理を発生させてしまうことになる。即ち、３次元グラフィクスの処理においてＳＩＭＤ演算を行っても、バスの転送能力や演算処理装置の処理能力を最大限に引き出すことができなかった。
【０００９】
特に、ベクトル外積演算（（１）式参照）の際に必要となる６個の乗算結果をＳＩＭＤ演算を用いて求める場合には、内積演算（（２）式参照）やマトリクス演算を行う場合とは異なり、メモリに記憶されている各成分の並び順と、演算で対応させるべき成分の並び順とが一致しない。このため、メモリからレジスタに各成分を読み込んだ後、レジスタ上で各成分の再配置を行わなければ、ＳＩＭＤ演算を実行することができなかった。
【００１０】
【数１】

【００１１】
ここで、図５は、ＳＩＭＤ演算が可能な演算処理装置におけるレジスタ周辺の構成を表すブロック図である。
なお、演算処理装置１０２は、レイトレーシングによるシェーディングを行う三次元グラフィクス処理を実行するものであり、データバスＤを介して接続されるメモリ（図示せず）には、少なくとも３次元のベクトルデータが格納されている。
【００１２】
また、メモリに格納されるベクトルデータの各成分は、ｎビットにて表され、このｎビットを１ワードとも呼ぶ。そして、データバスＤは２ワード分（２ｎビット）のビット幅を有しており、また、メモリは、１ワード単位又は２ワード単位でのアクセスが可能なように構成されているものとする。
【００１３】
図５に示すように、演算処理装置１０２は、２ワード分のビット幅を有するｐ個のレジスタｒ１〜ｒｐからなるレジスタファイル１０４と、データバスＤを介してメモリ等からデータの読込を行う時に、レジスタファイル１０４を構成する各レジスタｒ１〜ｒｐの中のいずれか一つを、レジスタ選択信号Ｃに従って指定するレジスタ選択部１０８と、プログラムを実行するコア部（図示せず）からの指令を受けて、レジスタ選択信号Ｃを生成する制御信号生成部１１０とを備えている。
【００１４】
そして、各レジスタｒ１〜ｒｐの上位ｎビットからなる上位ワードｒih（ｉ＝１〜ｐ）は、内部上位バスＩＤＨを介して、また、下位ｎビットからなる下位ワードｒilは、内部下位バスＩＤＬを介して、ＳＩＭＤ演算を実行可能な演算処理ユニット（図示せず）等に接続されている。つまり、演算処理ユニットは、レジスタｒ１〜ｒｐをそのまま用いた２ワードデータの単一演算（例えばｒ１×ｒ２）だけでなく、上位ワード同士、及び下位ワード同士で行う１ワードデータの並列演算（例えばｒ1h×ｒ2h，ｒ1l×ｒ2l）も実行可能なように構成されている。
【００１５】
なお、演算処理装置１０２は、バスサイジング機能を有しており、１ワード単位でレジスタにデータを読み込んだ時には、レジスタの下位ワードｒilにデータが格納される。
また、メモリには、図４（ａ）に示すように、ｍｍ１番地にベクトルＡのｘ成分Ａｘ、ｍｍ１＋１番地にベクトルＡのｙ成分Ａｙ、ｍｍ１＋２番地にベクトルＡのｚ成分Ａｚ、ｍｍ＋３番地にベクトルＡのｗ（ダミー）成分Ａｗが格納され、同様に、ｍｍ２番地にベクトルＢのｘ成分Ｂｘ、ｍｍ２＋１番地にベクトルＢのｙ成分Ｂｙ、ｍｍ２＋２番地にベクトルＢのｚ成分Ｂｚ、ｍｍ２＋３番地にベクトルＢのｗ（ダミー）成分Ｂｗが格納されているものとする。
【００１６】
このように構成された演算処理装置１０２では、ＳＩＭＤ演算にてベクトル外積演算を行う場合、図６に示す手順に従って、メモリからレジスタへのデータ（両ベクトルの各成分）の読込、ＳＩＭＤ演算に適したデータ配列となるようにレジスタ上でのデータの再配置を行う。
【００１７】
即ち、まず、ベクトル外積演算の演算対象となる両ベクトルＡ，Ｂの各成分Ａｚ，Ａｙ．Ａｚ，Ｂｘ，Ｂｙ，Ｂｚを、１ワード単位でのメモリアクセスを用いて、レジスタｒ７〜ｒ１２の下位ワードに順次読み込んだ後［ステップＳ１０１〜Ｓ１０６］、これらレジスタｒ７〜ｒ１２に格納された各成分を、ベクトル外積演算で必要な６個の乗算結果Ａｘ・Ｂｙ、Ａｙ・Ｂｘ、Ａｚ・Ｂｘ、Ａｘ・Ｂｚ、Ａｙ・Ｂｚ、Ａｚ・ＢｙがＳＩＭＤ演算にて得られるように２個ずつ組み合わせて、レジスタｒ１〜ｒ６に格納する［ステップＳ１０７〜Ｓ１１２］。
【００１８】
このように、各ステップを動作クロックの１サイクルにて実行可能であるとすると、ＳＩＭＤ演算を開始するまでに、１２サイクルもの期間が費やされることになる。つまり、ベクトル演算を行う場合には、ＳＩＭＤ演算を用いても、処理時間を大幅には短縮することができなかったのである。
【００１９】
これに対して、レジスタに格納された各成分の配列を並び替えるための専用回路（ツイスト及びジップユニット７４／混合回路１５）を設けた装置が知られている（例えば特許文献１，特許文献２。）
【００２０】
【特許文献１】
特開平８−３２８８４９号公報
（例えば、段落「００４８」〜「００７４」，図９〜図１３）
【００２１】
【特許文献２】
特開平９−１６３９７号公報
（例えば、段落「００１８」〜「００２０」，図２）
【００２２】
【発明が解決しようとする課題】
しかし、これらの専用回路では、レジスタ上でのデータの再配置を高速に行うことが可能となるが、多くのマルチプレクサが用いられており、回路構成が複雑化し且つ大規模化してしまうという問題があった。
【００２３】
本発明は、上記問題点を解決するために、データバスを介して読み込まれるデータの並び順とレジスタに設定すべきデータの並び順とが異なる場合でも、演算処理装置の処理能力を十分に引き出せるようにすることを目的とする。
【００２４】
【課題を解決するための手段】
上記目的を達成するための発明である請求項１記載の演算処理装置では、ｍ×ｎ（ｍ，ｎはいずれも正整数）ビット幅のデータバスを介してデータが入出力される複数のレジスタからなるレジスタファイルを備えている。そして、このレジスタファイルを構成する各レジスタを、それぞれｎビット幅のｍ個の部分レジスタに分割し、各レジスタの同一桁の部分レジスタを集めたものを、部分レジスタファイルと呼び、データバスをｎビット幅に分割したものを部分データバスと呼ぶ。
【００２５】
なお、同一桁の部分レジスタとは、例えば各レジスタのビット幅が３ｎビットであり、これをｎビット幅の３個の部分レジスタに分割した場合には、上位ｎビットの部分レジスタ同士、中位ｎビットの部分レジスタ同士、下位ｎビットの部分レジスタ同士のことを言う。
【００２６】
そして、部分レジスタ選択手段が、これらｍ個の部分レジスタファイル毎に、部分レジスタを一つずつ選択すると共に、部分データバス選択手段が、ｍ個の部分データバスの中から、部分レジスタファイル毎に、部分レジスタ選択手段が選択した部分レジスタに接続すべき部分データバスを選択することにより、データバスを介して供給されるデータの配列を、部分データバス単位で組み替えてレジスタに格納する。
【００２７】
このように、本発明の演算装置によれば、データバスを介して各レジスタにデータを読み込む際に、データの配列を部分データバス単位で組み替えることができるため、データバスに接続されたメモリなどから供給されるデータの並び順と、実際の処理のためにレジスタに格納すべきデータの並び順とが異なっていても、レジスタに読み込んでから再配列のための処理を別途行う必要がなく、処理時間を短縮することができる。
【００２８】
しかも、このようなデータの並び順の組み替えを、部分レジスタ選択手段及び部分データバス選択手段にて２段階の選択を行うという簡易な構成にて実現しているため、装置構成を小型化できる。
【００２９】
また、請求項１記載の演算処理装置では、レジスタファイルを構成する各レジスタは二つの部分レジスタ（ｍ＝２）からなり、部分レジスタ選択手段及び部分データバス選択手段は、３成分からなる二つのベクトルの外積演算が、レジスタを構成する各部分レジスタに対する演算を単一の演算命令で並列に処理するＳＩＭＤ演算にて実行可能となるように、前記二つのベクトルの各成分を、６個のレジスタの各部分レジスタに格納する。
【００３０】
つまり、ベクトル外積演算では、二つのベクトルの各成分が、乗算の際に互いに異なる成分と組み合わされるため（（１）式参照）、この乗算をＳＩＭＤ演算にて実行可能とするためには、データ配列の組み替えが必須となる。この組み替えを、部分データバス選択手段と部分レジスタ選択手段とを用いて行うことにより、ベクトル外積演算を効率よく行うことができる。
【００３１】
なお、６個のレジスタの各部分レジスタに格納される二つのベクトルの各成分は、第１ベクトルの第１の成分と第２ベクトルの第１の成分、第１ベクトルの第２の成分と第２ベクトルの第２の成分、第１ベクトルの第３の成分と第２ベクトルの第３の成分はそれぞれ同一の属性を有するものとした場合には、請求項２記載のように、第１ベクトルの各成分は、第１の成分が第１の上位部分レジスタと第２の下位部分レジスタに、第２の成分が第１の下位部分レジスタと第３の上位部分レジスタに、第３の成分が第２の上位部分レジスタと第３の下位部分レジスタに格納され、第２ベクトルの各成分は、第１の成分が第４の下位部分レジスタと第５の上位部分レジスタに、第２の成分が第４の上位部分レジスタと第６の下位部分レジスタに、第３の成分が第５の下位部分レジスタと第６の上位部分レジスタに格納されるようにすればよい。
【００３２】
また、この場合、二つのベクトルにおいて同一の属性を有する成分は、立体空間における座標値を示すｘ成分，ｙ成分，ｚ成分からなるものとした場合には、請求項３記載のように、前記第１の成分をｘ成分かつ前記第２の成分をｙ成分かつ前記第３の成分をｚ成分、又は前記第１の成分をｚ成分かつ前記第２の成分をｙ成分かつ前記第３の成分をｘ成分に対応させればよい。
【００３３】
そして、本発明は、請求項４記載のように、レイトレーシングやＺバッファ等のシェーディングによる画像処理を実行する演算処理装置に適用した場合に、大きな効果を得ることができる。
即ち、これらの処理では、光線を反射する面の法線ベクトルを求める必要があり、この法線ベクトルの算出にベクトル外積演算が使用され、ベクトル外積演算を実行する比率が高いため、処理速度の大幅に向上させることができる。
【００３４】
【発明の実施の形態】
以下に本発明の実施形態を図面と共に説明する。
図１は、実施形態の演算処理装置の主要部を示すブロック図であり、ここでは、従来装置とは構成の異なるレジスタ周辺の構成のみを示す。なお、本実施形態の演算処理装置は、従来装置と同様に、レイトレーシングによるシェーディングを行う三次元グラフィクス処理を実行するためのものであり、データバスＤを介して接続されるメモリ（図示せず）には、少なくとも３次元のベクトルデータや座標データ（以下、総称して「ベクトルデータ」と呼ぶ。）が格納されている。
【００３５】
また、メモリに格納されるベクトルデータの各成分は、ｎ（本実施形態ではｎ＝３２）ビットにて表され、このｎビットを１ワードとも呼ぶ。そして、データバスは２ワード分（２ｎビット）のビット幅を有しており、また、メモリは、１ワード単位又は２ワード単位でのアクセスが可能なように構成されている。
【００３６】
図１に示すように、本実施形態の演算処理装置２は、２ワード分のビット幅を有する８個のレジスタｒ１〜ｒ８からなるレジスタファイル４を備えている。このレジスタファイル４を構成する各レジスタｒｉ（ｉ＝１〜８）は、それぞれが１ワード分のデータ幅を有する一対の部分レジスタｒih，ｒilからなり、上位の部分レジスタr1h〜r8hを総称して上位部分レジスタファイル４ａ、下位の部分レジスタr1l〜r8lを総称して下位部分レジスタファイル４ｂとも呼ぶ。
【００３７】
また本実施形態の演算処理装置２は、２ｎビット幅を有するデータバスＤを、上位ｎビットからなる上位部分データバスＤＨと下位ｎビットからなる下位部分データバスＤＬとに分け、２ビットのバス選択信号Ｓ（Ｓ０，Ｓ１）に従って、部分レジスタファイル４ａ，４ｂのそれぞれに、両部分データバスＤＨ，ＤＬのいずれかを接続する部分データバス選択手段としてのバス選択部６と、バス選択部６により上位部分レジスタファイル４ａに接続された部分データバスを介して供給されるデータの格納先を、３ビットのレジスタ選択信号ＣＨ（ＣＨ０〜ＣＨ２）に従って、部分レジスタｒ1h〜ｒ8hの中から選択すると共に、バス選択部６により下位部分レジスタファイル４ｂに接続された部分データバスを介して供給されるデータの格納先を、３ビットのレジスタ選択信号ＣＬ（ＣＬ０〜ＣＬ２）に従って、部分レジスタｒ1l〜ｒ8lの中から選択する部分レジスタ選択手段としてのレジスタ選択部８と、プログラムを実行する図示しないコア部からの指令に従って、バス選択信号Ｓ、レジスタ選択信号ＣＨ，ＣＬを生成する制御信号生成部１０とを備えている。
【００３８】
このうち、バス選択部６は、バス選択信号Ｓ０に従って上位部分レジスタファイル４ａに接続する部分データバスを選択する第１セレクタと、バス選択信号Ｓ１に従って下位部分レジスタファイル４ｂに接続する部分データバスを選択する第２セレクタとからなる。
【００３９】
具体的には、図２（ａ）に示すように、第１セレクタは、Ｓ０＝０の時には上位部分データバスＤＨ、Ｓ０＝１の時には下位部分データバスＤＬを選択し、第２セレクタは、Ｓ１＝０の時には下位部分データバスＤＬ、Ｓ１＝１の時には上位部分データバスＤＨを選択するように構成されている。
【００４０】
また、レジスタ選択部８は、レジスタ選択信号ＣＨに従って、部分レジスタｒ1h〜ｒ8hのいずれかを選択する第１マルチプレクサと、レジスタ選択信号ＣＬに従って、部分レジスタｒ1l〜ｒ8lのいずれかを選択する第２マルチプレクサとからなる。
【００４１】
具体的には、第１マルチプレクサは、図２（ｂ）に示すように、ＣＨ０〜ＣＨ２のビットパターンを２進数としてみた数値ｋ（ｋ＝０〜７）に従い、ｉ＝ｋ＋１として、レジスタｒihを選択するように構成されている。これと同様に、第２マルチプレクサは、図２（ｃ）に示すように、ＣＬ０〜ＣＬ２のビットパターンを２進数としてみた数値ｋ（ｋ＝０〜７）に従い、ｉ＝ｋ＋１として、レジスタｒilを選択するように構成されている。
【００４２】
そして、上位部分レジスタファイル４ａを構成する各部分レジスタｒihは内部上位バスＩＤＨを介して、また、下位部分レジスタファイル４ｂを構成する各部分レジスタｒilは内部下位バスＩＤＬを介して、ＳＩＭＤ演算が可能な演算処理ユニット（図示せず）等に接続されている。つまり、演算処理ユニットは、レジスタｒ１〜ｒｐをそのまま用いた２ワードデータの単一演算だけでなく、上位部分レジスタファイル４ａを構成する各部分レジスタ（ｒ1h〜ｒ8h）同士、及び下位部分レジスタファイル４ｂを構成する各部分レジスタ（ｒ1l〜ｒ8l）同士で行う１ワードデータの並列演算も実行可能なように構成されている。
【００４３】
なお、メモリには、図４（ａ）に示すように、ｍｍ１番地にベクトルＡのｘ成分Ａｘ、ｍｍ１＋１番地にベクトルＡのｙ成分Ａｙ、続くｍｍ１＋２番地にベクトルＡのｚ成分Ａｚ、ｍｍ１＋３番地にベクトルＡのｗ（ダミー）成分Ａｗが格納され、同様に、ｍｍ２番地にベクトルＢのｘ成分Ｂｘ、ｍｍ２＋１番地にベクトルＢのｙ成分Ｂｙ、ｍｍ２＋２番地にベクトルＢのｚ成分Ｂｚ、ｍｍ２＋３番地にベクトルＢのｗ（ダミー）成分Ｂｗが格納されているものとする。
【００４４】
このように構成された演算処理装置２では、ベクトルＡ（Ａｘ，Ａｙ，Ａｚ）とベクトルＢ（Ｂｘ，Ｂｙ，Ｂｚ）の外積演算をＳＩＭＤ演算により実行する際に、以下に示すステップＳ１〜Ｓ６に従って、メモリアクセスと制御信号生成部１０でのバス選択信号Ｓ及びレジスタ選択信号ＣＨ，ＣＬの生成とを行い、各演算要素（ベクトル成分）を、ＳＩＭＤ演算の実行に適した並び順にしてレジスタｒ１〜ｒ６に格納する。
【００４５】
なお、各ステップは、いずれも動作クロックの１サイクルにて実行される。また、以下では、Ｓ＝［Ｓ０，Ｓ１］、ＣＨ＝［ＣＨ０，ＣＨ１，ＣＨ２］、ＣＬ＝［ＣＬ０，ＣＬ１，ＣＬ２］とする。
［ステップＳ１］
・メモリアクセス：アドレスｍｍ１（２ワードアクセス）
・バス選択信号：Ｓ＝［０，１］
・レジスタ選択信号：ＣＨ＝［０，０，０］，ＣＬ＝［０，０，１］
これにより、アドレスｍｍ１，ｍｍ＋１に格納された２ワードデータのうち、バス選択信号Ｓにて選択された上位ワード（アドレスｍｍ１）のデータ、即ちベクトルＡのｘ成分Ａｘが、レジスタ選択信号ＣＨ，ＣＬにて選択されたｒ1h，ｒ2lに、それぞれ格納される（レジスタ格納値：ｒ1h←Ａｘ，ｒ2l←Ａｘ）。
［ステップＳ２］
・メモリアクセス：アドレスｍｍ１（２ワードアクセス）
・バス選択信号：Ｓ＝［１，０］
・レジスタ選択信号：ＣＨ＝［０，１，０］，ＣＬ＝［０，０，０］
これにより、アドレスｍｍ１，ｍｍ＋１に格納された２ワードデータのうち、バス選択信号Ｓにて選択された下位ワード（アドレスｍｍ１＋１）のデータ、即ちベクトルＡのｙ成分Ａｙが、レジスタ選択信号ＣＨ，ＣＬにて選択されたｒ3h，ｒ1lに、それぞれ格納される（レジスタ格納値：ｒ3h←Ａｙ，ｒ1l←Ａｙ）。
［ステップＳ３］
・メモリアクセス：アドレスｍｍ１＋２（２ワードアクセス）
・バス選択信号：Ｓ＝［０，１］
・レジスタ選択信号：ＣＨ＝［０，０，１］，ＣＬ＝［０，１，０］
これにより、アドレスｍｍ１＋２，ｍｍ１＋３に格納された２ワードデータのうち、バス選択信号Ｓにて選択された上位ワード（アドレスｍｍ１＋２）のデータ、即ちベクトルＡのｚ成分Ａｚが、レジスタ選択信号ＣＨ，ＣＬにて選択されたｒ2h，ｒ3lに、それぞれ格納される（レジスタ格納値：ｒ2h←Ａｚ，ｒ3l←Ａｚ）。
［ステップＳ４］
・メモリアクセス：アドレスｍｍ２（２ワードアクセス）
・バス選択信号：Ｓ＝［０，１］
・レジスタ選択信号：ＣＨ＝［１，０，０］，ＣＬ＝［０，１，１］
これにより、アドレスｍｍ２，ｍｍ２＋１に格納された２ワードデータのうち、バス選択信号Ｓにて選択された上位ワード（アドレスｍｍ２）のデータ、即ちベクトルＢのｘ成分Ｂｘが、レジスタ選択信号ＣＨ，ＣＬにて選択されたｒ5h，ｒ4lに、それぞれ格納される（レジスタ格納値：ｒ5h←Ｂｘ，ｒ4l←Ｂｘ）。
［ステップＳ５］
・メモリアクセス：アドレスｍｍ２（２ワードアクセス）
・バス選択信号：Ｓ＝［１，０］
・レジスタ選択信号：ＣＨ＝［０，１，１］，ＣＬ＝［１，０，１］
これにより、アドレスｍｍ２，ｍｍ２＋１に格納された２ワードデータのうち、バス選択信号Ｓにて選択された下位ワード（アドレスｍｍ２＋１）のデータ、即ちベクトルＢのｙ成分Ｂｙが、レジスタ選択信号ＣＨ，ＣＬにて選択されたｒ4h，ｒ6lに、それぞれ格納される（レジスタ格納値：ｒ4h←Ｂｙ，ｒ6l←Ｂｙ）。
［ステップＳ６］
・メモリアクセス：アドレスｍｍ２＋２（２ワードアクセス）
・バス選択信号：Ｓ＝［０，１］
・レジスタ選択信号：ＣＨ＝［１，０，１］，ＣＬ＝［１，０，０］
これにより、アドレスｍｍ２＋２，ｍｍ２＋３に格納された２ワードデータのうち、バス選択信号Ｓにて選択された上位ワード（アドレスｍｍ２＋２）のデータ、即ちベクトルＢのｚ成分Ｂｚが、レジスタ選択信号ＣＨ，ＣＬにて選択されたｒ6h，ｒ5lに、それぞれ格納される（レジスタ格納値：ｒ6h←Ｂｚ，ｒ5l←Ｂｚ）。
【００４６】
このようなステップＳ１〜Ｓ６により、各レジスタｒ１〜ｒ６に格納されたデータを演算要素として、レジスタｒ１，ｒ４、レジスタｒ２，ｒ５、レジスタｒ３，ｒ６の間でＳＩＭＤ演算を実行して、上位部分レジスタの値同士、下位部分レジスタの値同士で乗算を行うことにより、ベクトル外積を求める際に必要な６つの乗算値、Ａｘ・Ｂｙ、Ａｙ・Ｂｘ、Ａｚ・Ｂｘ、Ａｘ・Ｂｚ、Ａｙ・Ｂｚ、Ａｚ・Ｂｙを得る。
【００４７】
以上説明したように、本実施形態の演算処理装置２では、データバスＤを介してレジスタファイル４を構成するレジスタｒ１〜ｒ８にデータを読み込む際に、データの並び順を１ワード（部分レジスタ／部分データバスＤＨ，ＤＬ）毎に、任意に組み替えることができる。このため、データバスＤを介してメモリから供給されるデータの並び順と、実際の処理のためにレジスタｒ１〜ｒ８に格納すべきデータの並び順とが異なっていても、レジスタｒ１〜ｒ８に読み込んでから再配列のための処理を別途行う必要がなく、処理時間を短縮することができる。
【００４８】
しかも、このようなデータの並び順の組み替えを、セレクタからなる部分レジスタ選択部、及びマルチプレクサからなるバス選択部により簡易な構成にて実現しているため、装置構成を小型化できる。
なお、本実施形態では、ベクトル外積演算用のデータを各レジスタに設定するステップＳ１〜Ｓ６において、上位部分データバスＤＨ又は下位部分データバスＤＬのいずれかを選択し、同一ステップにて両部分レジスタファイル４ａ，４ｂに同じ部分データバスが接続されるように制御しているが、以下に示すステップＳ１１〜Ｓ１６（但し、ステップＳ１３，Ｓ１６は、それぞれステップＳ３，Ｓ６と同じ）に示すように、同一ステップにて両部分レジスタファイル４ａ，４ｂに、互いに異なる部分データバスが接続されるように制御してもよい。
［ステップＳ１１］
・メモリアクセス：アドレスｍｍ１（２ワードアクセス）
・バス選択信号：Ｓ＝［０，０］
・レジスタ選択信号：ＣＨ＝［０，０，０］，ＣＬ＝［０，０，０］
（レジスタ格納値：ｒ1h←Ａｘ，ｒ1l←Ａｙ）
［ステップＳ１２］
・メモリアクセス：アドレスｍｍ１（２ワードアクセス）
・バス選択信号：Ｓ＝［１，１］
・レジスタ選択信号：ＣＨ＝［０，１，０］，ＣＬ＝［０，０，１］
（レジスタ格納値：ｒ3h←Ａｙ，ｒ2l←Ａｘ）
［ステップＳ１３］
・メモリアクセス：アドレスｍｍ１＋２（２ワードアクセス）
・バス選択信号：Ｓ＝［０，１］
・レジスタ選択信号：ＣＨ＝［０，０，１］，ＣＬ＝［０，１，０］
（レジスタ格納値：ｒ2h←Ａｚ，ｒ3l←Ａｚ）
［ステップＳ１４］
・メモリアクセス：アドレスｍｍ２（２ワードアクセス）
・バス選択信号：Ｓ＝［０，０］
・レジスタ選択信号：ＣＨ＝［１，０，０］，ＣＬ＝［１，０，１］
（レジスタ格納値：ｒ5h←Ｂｘ，ｒ6l←Ｂｙ）
［ステップＳ１５］
・メモリアクセス：アドレスｍｍ２（２ワードアクセス）
・バス選択信号：Ｓ＝［１，１］
・レジスタ選択信号：ＣＨ＝［０，１，１］，ＣＬ＝［０，１，１］
（レジスタ格納値：ｒ4h←Ｂｙ，ｒ4l←Ｂｘ）
［ステップＳ１６］
・メモリアクセス：アドレスｍｍ２＋２（２ワードアクセス）
・バス選択信号：Ｓ＝［０，１］
・レジスタ選択信号：ＣＨ＝［１，０，１］，ＣＬ＝［１，０，０］
（レジスタ格納値：ｒ6h←Ｂｚ，ｒ5l←Ｂｚ）
このステップＳ１１〜Ｓ１６に従って制御した場合、各部分レジスタにデータが格納される順番が異なるだけで、上記実施形態と同じ処理量、且つ同じ並び順で、レジスタｒ１〜ｒ６に演算要素（ベクトルＡ，Ｂの各成分）を設定することができる。
【００４９】
また、上記実施形態では、１ワード単位でアドレスを付与したメモリを用いることにより、メモリアクセスの最小単位が１ワードとされているが、図４（ｂ）に示すように、２ワード単位でアドレスを付与したメモリを用いることにより、メモリアクセスの最小単位が２ワードとされていてもよい。
【００５０】
また、上記実施形態では、ｍ×ｎビットのデータバス、レジスタファイルにおいて、ｍ＝２の場合を用いて説明してきたが、ｍは２に限定されることはなく、２^k （ｋは正整数）の構成とすることができ、ｋを大きくすることにより、データの並び順の組み換え時間の高速化、演算回路の小型化などの効果をより向上させることができる。
【図面の簡単な説明】
【図１】実施形態の演算処理装置におけるレジスタ周辺の構成を示すブロック図である。
【図２】バス選択部を構成するセレクタ、及びレジスタ選択部を構成するマルチプレクサの動作を説明するための表である。
【図３】ベクトル外積演算をＳＩＭＤ演算にて行う準備として、レジスタに演算要素を設定する際の手順を示す説明図である。
【図４】メモリへのベクトルデータの格納状態を示す説明図である。
【図５】従来装置におけるレジスタ周辺の構成を示すブロック図である。
【図６】従来装置にてベクトル外積演算をＳＩＭＤ演算にて行う準備として、レジスタに演算要素を設定する際の手順を示す説明図である。
【符号の説明】
２…演算処理装置、４…レジスタファイル、４ａ…上位部分レジスタファイル、４ｂ…下位部分レジスタファイル、６…バス選択部、８…レジスタ選択部、１０…制御信号生成部、データバス…Ｄ、ＩＤＬ…内部下位バス、ＩＤＨ…内部上位バス。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an arithmetic processing device, and more particularly to an arithmetic processing device that performs a vector outer product operation using SIMD arithmetic.
[0002]
[Prior art]
Conventionally, various processing methods have been used in arithmetic processing devices to speed up processing, and one of them is to handle multiple arithmetic units and multiple data with a single arithmetic instruction in parallel. A SIMD (Single Instruction Multiple Date) operation to be processed is known.
[0003]
That is, an arithmetic processing unit that executes various operations on data stored in a register processes a single instruction as if a single data is stored in a register in a normal instruction, whereas in an SIMD operation, It is assumed that m pieces of n-bit width data are packed in an m × n (m and n are both positive integer) bit width registers, and these m pieces of data are processed in parallel.
[0004]
Therefore, an arithmetic processing apparatus including an arithmetic processing unit that executes SIMD arithmetic needs to repeatedly apply the same arithmetic to a large amount of data, such as audio processing and image processing (particularly three-dimensional graphics). Excellent processing capability is obtained for a certain process.
[0005]
Incidentally, vector data and coordinate data (hereinafter simply referred to as “vector data”) that are processed in large quantities in three-dimensional graphics are usually composed of three components, x, y, and z. An arithmetic processing unit that performs SIMD arithmetic on each component of the vector data is data having a 2n (that is, m = 2) or 4n (m = 4) bit width when each component is represented by an n-bit width. A bus and a register will be provided.
[0006]
Then, when reading the vector data stored in the memory into the register via the data bus, in order to prevent different vector data components from being transferred at the same time, normally, as shown in FIG. A dummy component (hereinafter also referred to as “w component”) is inserted after the component, y component, and z component. In the figure, (a) shows the case where the minimum unit of memory access is an n-bit memory, and (b) shows the case where the minimum unit of memory access is a 2n-bit memory. However, n = 32.
[0007]
For this reason, when transferring vector data, a dummy component is included once every 2n bits when the transfer is performed in 2n bit units, and every time the transfer is performed in 4n bit units, the transfer efficiency is reduced. In addition, since the dummy component read into the register together with other components is also subject to SIMD calculation, there is a problem in that the processing efficiency in the arithmetic processing unit also decreases.
[0008]
In order to avoid such a wasteful calculation for the dummy component, the data read from the memory into the register may be rearranged on the register, but a new process called inter-register transfer occurs. That is, even if SIMD calculation is performed in the processing of three-dimensional graphics, the bus transfer capability and the processing capability of the arithmetic processing unit cannot be maximized.
[0009]
In particular, when six multiplication results necessary for vector cross product operation (see equation (1)) are obtained using SIMD operation, inner product operation (see equation (2)) or matrix operation is performed. In contrast, the arrangement order of the components stored in the memory does not match the arrangement order of the components to be matched in the calculation. Therefore, after each component is read from the memory into the register, the SIMD operation cannot be executed unless the components are rearranged on the register.
[0010]
[Expression 1]

[0011]
Here, FIG. 5 is a block diagram showing a configuration around registers in an arithmetic processing unit capable of SIMD arithmetic.
The arithmetic processing unit 102 executes three-dimensional graphics processing that performs shading by ray tracing, and at least three-dimensional vector data is stored in a memory (not shown) connected via the data bus D. Stored.
[0012]
Each component of the vector data stored in the memory is represented by n bits, and this n bits is also called one word. The data bus D has a bit width of 2 words (2n bits), and the memory is configured to be accessible in units of one word or two words.
[0013]
As shown in FIG. 5, the arithmetic processing unit 102 reads data from a register file 104 composed of p registers r1 to rp having a bit width of two words and a memory via a data bus D. In response to an instruction from the register selection unit 108 that designates one of the registers r1 to rp constituting the register file 104 according to the register selection signal C and a core unit (not shown) that executes the program. And a control signal generator 110 for generating a register selection signal C.
[0014]
The upper word rih (i = 1 to p) consisting of the upper n bits of the registers r1 to rp is passed through the internal upper bus IDH, and the lower word ril consisting of the lower n bits is set to the internal lower bus IDL. Via an arithmetic processing unit (not shown) capable of executing SIMD arithmetic. That is, the arithmetic processing unit performs not only a single operation (for example, r1 × r2) of 2-word data using the registers r1 to rp as it is, but also a parallel operation (for example, 1-word data performed between upper words and lower words). r1h × r2h, r1l × r2l) are also configured to be executable.
[0015]
Note that the arithmetic processing unit 102 has a bus sizing function, and when data is read into the register in units of one word, the data is stored in the lower word ril of the register.
4A, the x component Ax of the vector A at the address mm1, the y component Ay of the vector A at the address mm1 + 1, the z component Az of the vector A at the address mm1 + 2, and the vector at the address mm + 3, as shown in FIG. Similarly, the w (dummy) component Aw of A is stored. Similarly, the x component Bx of the vector B at the address mm2, the y component By of the vector B at the address mm2 + 1, the z component Bz of the vector B at the address mm2 + 2, and the vector B at the address mm2 + 3. It is assumed that the w (dummy) component Bw is stored.
[0016]
The arithmetic processing unit 102 configured as described above is suitable for reading data (components of both vectors) from the memory to the register and performing SIMD calculation according to the procedure shown in FIG. The data is rearranged on the register so that the data array becomes the same.
[0017]
That is, first, the components Az, Ay. Az, Bx, By, and Bz are sequentially read into the lower words of the registers r7 to r12 using memory access in units of one word [steps S101 to S106], and then the components stored in these registers r7 to r12 Are combined two by two so that the six multiplication results Ax · By, Ay · Bx, Az · Bx, Ax · Bz, Ay · Bz, Az · By required for the vector cross product operation can be obtained. And stored in the registers r1 to r6 [steps S107 to S112].
[0018]
As described above, if each step can be executed in one cycle of the operation clock, a period of 12 cycles is consumed before the SIMD operation is started. That is, in the case of performing a vector operation, the processing time cannot be significantly reduced even if the SIMD operation is used.
[0019]
On the other hand, an apparatus provided with a dedicated circuit (twist and zip unit 74 / mixing circuit 15) for rearranging the arrangement of each component stored in a register is known (for example, Patent Document 1 and Patent Document 2). .)
[0020]
[Patent Document 1]
JP-A-8-328849
(For example, paragraphs “0048” to “0074”, FIGS. 9 to 13)
[0021]
[Patent Document 2]
Japanese Patent Laid-Open No. 9-16397
(For example, paragraphs “0018” to “0020”, FIG. 2)
[0022]
[Problems to be solved by the invention]
However, in these dedicated circuits, data rearrangement on the registers can be performed at high speed, but many multiplexers are used, and there is a problem that the circuit configuration becomes complicated and large-scale. there were.
[0023]
In order to solve the above problems, the present invention can sufficiently bring out the processing capability of the arithmetic processing unit even when the arrangement order of data read via the data bus is different from the arrangement order of data to be set in the registers. The purpose is to do so.
[0024]
[Means for Solving the Problems]
The arithmetic processing unit according to claim 1, which is an invention for achieving the above object, wherein a plurality of registers for inputting / outputting data via a data bus of m × n (m and n are both positive integers) bit width A register file consisting of Each register constituting the register file is divided into m partial registers each having an n-bit width, and a collection of partial registers of the same digit of each register is referred to as a partial register file, and the data bus is defined as n. Those divided into bit widths are called partial data buses.
[0025]
The same-digit partial registers are, for example, each register having a bit width of 3n bits, and when this is divided into three partial registers having an n-bit width, It means n-bit partial registers and lower n-bit partial registers.
[0026]
Then, the partial register selection unit selects one partial register for each of the m partial register files, and the partial data bus selection unit selects from the m partial data buses for each partial register file. By selecting a partial data bus to be connected to the partial register selected by the partial register selection means, the arrangement of data supplied via the data bus is rearranged in units of partial data buses and stored in the register.
[0027]
As described above, according to the arithmetic device of the present invention, when data is read into each register via the data bus, the data array can be rearranged in units of partial data buses, so that the memory connected to the data bus, etc. Even if the arrangement order of the data supplied from is different from the arrangement order of the data to be stored in the register for actual processing, there is no need to separately perform processing for rearrangement after reading into the register, Processing time can be shortened.
[0028]
  Moreover, since the rearrangement of the data arrangement order is realized by a simple configuration in which selection is performed in two stages by the partial register selection means and the partial data bus selection means, the apparatus configuration can be reduced in size..
[0029]
  Claim 1In the described arithmetic processing unit, each register constituting the register file is composed of two partial registers (m = 2), and the partial register selecting means and the partial data bus selecting means are capable of performing an outer product operation of two vectors composed of three components., Operate on each partial register that constitutes a register in parallel with a single operation instructionEach component of the two vectors is stored in each partial register of six registers so that it can be executed by SIMD operation.
[0030]
That is, in the vector cross product operation, each component of the two vectors is combined with a different component at the time of multiplication (see equation (1)), so that this multiplication can be executed by SIMD operation. It is essential to rearrange the sequence. By performing this rearrangement using the partial data bus selection means and the partial register selection means, the vector cross product operation can be performed efficiently.
[0031]
  The components of the two vectors stored in the partial registers of the six registers are the first component of the first vector, the first component of the second vector, the second component of the first vector, and the second component. When the second component of the two vectors, the third component of the first vector, and the third component of the second vector have the same attribute,Claim 2As described, each component of the first vector has a first component in the first upper part register and the second lower part register, and a second component in the first lower part register and the third upper part. In the register, the third component is stored in the second upper part register and the third lower part register, and each component of the second vector has the first component as the fourth lower part register and the fifth upper part. The second component may be stored in the register in the fourth upper part register and the sixth lower part register, and the third component may be stored in the fifth lower part register and the sixth upper part register. .
[0032]
  In this case, if the components having the same attribute in the two vectors are composed of an x component, a y component, and a z component indicating coordinate values in the three-dimensional space,Claim 3As described, the first component is an x component and the second component is a y component and the third component is a z component, or the first component is a z component and the second component is a y component. In addition, the third component may correspond to the x component.
[0033]
  And this invention,Claim 4As described above, when applied to an arithmetic processing device that executes image processing by shading such as ray tracing or Z-buffer, a great effect can be obtained.
  That is, in these processes, it is necessary to obtain the normal vector of the surface that reflects the light ray, and the vector cross product operation is used to calculate the normal vector, and the ratio of executing the vector cross product operation is high. It can be greatly improved.
[0034]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings.
FIG. 1 is a block diagram showing a main part of the arithmetic processing apparatus according to the embodiment. Here, only a configuration around a register having a configuration different from that of the conventional apparatus is shown. Note that the arithmetic processing apparatus of this embodiment is for executing a three-dimensional graphics process for performing shading by ray tracing, as in the conventional apparatus, and a memory (not shown) connected via the data bus D. ) Stores at least three-dimensional vector data and coordinate data (hereinafter collectively referred to as “vector data”).
[0035]
Each component of the vector data stored in the memory is represented by n (in this embodiment, n = 32) bits, and this n bits is also called one word. The data bus has a bit width of 2 words (2n bits), and the memory is configured to be accessible in units of 1 word or 2 words.
[0036]
As shown in FIG. 1, the arithmetic processing device 2 of this embodiment includes a register file 4 including eight registers r1 to r8 having a bit width of two words. Each register ri (i = 1 to 8) constituting the register file 4 is composed of a pair of partial registers rih and ril each having a data width of one word, and the upper partial registers r1h to r8h are generically named. The upper partial register file 4a and the lower partial registers r1l to r8l are collectively referred to as a lower partial register file 4b.
[0037]
The arithmetic processing unit 2 according to the present embodiment divides the data bus D having a 2n-bit width into an upper partial data bus DH composed of upper n bits and a lower partial data bus DL composed of lower n bits. In accordance with a selection signal S (S0, S1), a bus selection unit 6 as a partial data bus selection means for connecting either of the partial data buses DH, DL to each of the

partial register files

4a, 4b, and a bus selection unit 6 To select a storage destination of data supplied via the partial data bus connected to the upper partial register file 4a from the partial registers r1h to r8h according to the 3-bit register selection signal CH (CH0 to CH2). The storage destination of data supplied by the bus selection unit 6 via the partial data bus connected to the lower partial register file 4b In accordance with a 3-bit register selection signal CL (CL0 to CL2), a register selection unit 8 as a partial register selection means for selecting from among the partial registers r1l to r8l and a command from a core unit (not shown) for executing a program, And a control signal generation unit 10 for generating a bus selection signal S and register selection signals CH and CL.
[0038]
Among them, the bus selection unit 6 selects a first selector that selects a partial data bus connected to the upper partial register file 4a according to the bus selection signal S0 and a partial data bus connected to the lower partial register file 4b according to the bus selection signal S1. And a second selector to be selected.
[0039]
Specifically, as shown in FIG. 2A, the first selector selects the upper partial data bus DH when S0 = 0, the lower partial data bus DL when S0 = 1, and the second selector When S1 = 0, the lower partial data bus DL is selected. When S1 = 1, the upper partial data bus DH is selected.
[0040]
The register selection unit 8 also selects a first multiplexer that selects any of the partial registers r1h to r8h according to the register selection signal CH, and a second multiplexer that selects any of the partial registers r1l to r8l according to the register selection signal CL. It consists of.
[0041]
Specifically, as shown in FIG. 2B, the first multiplexer sets the register rih as i = k + 1 according to a numerical value k (k = 0 to 7) obtained by regarding the bit pattern of CH0 to CH2 as a binary number. Configured to select. Similarly, as shown in FIG. 2 (c), the second multiplexer sets the register ril as i = k + 1 according to the numerical value k (k = 0 to 7) in which the bit pattern of CL0 to CL2 is regarded as a binary number. Configured to select.
[0042]
Each partial register rih constituting the upper partial register file 4a can be subjected to SIMD arithmetic via the internal upper bus IDH, and each partial register ri constituting the lower partial register file 4b can be subject to SIMD arithmetic via the internal lower bus IDL. Connected to an arithmetic processing unit (not shown). In other words, the arithmetic processing unit is not only a single operation of 2-word data using the registers r1 to rp as they are, but also the partial registers (r1h to r8h) constituting the upper partial register file 4a and the lower partial register file 4b. Are configured such that parallel operation of one word data performed between the respective partial registers (r1l to r8l) constituting the.
[0043]
In the memory, as shown in FIG. 4A, the x component Ax of the vector A at the address mm1, the y component Ay of the vector A at the address mm1 + 1, the z component Az of the vector A at the address mm1 + 2, and the mm1 + 3 address. The w (dummy) component Aw of the vector A is stored, and similarly, the x component Bx of the vector B at the mm2 address, the y component By of the vector B at the mm2 + 1 address, the z component Bz of the vector B at the mm2 + 2 address, and the vector at the mm2 + 3 address It is assumed that w (dummy) component Bw of B is stored.
[0044]
In the arithmetic processing unit 2 configured as described above, when executing the outer product operation of the vector A (Ax, Ay, Az) and the vector B (Bx, By, Bz) by the SIMD operation, the following steps S1 to S6 are performed. The memory access and generation of the bus selection signal S and the register selection signals CH and CL in the control signal generation unit 10 are performed, and the operation elements (vector components) are arranged in an order suitable for execution of the SIMD operation. Store in r1 to r6.
[0045]
Each step is executed in one cycle of the operation clock. In the following description, S = [S0, S1], CH = [CH0, CH1, CH2], and CL = [CL0, CL1, CL2].
[Step S1]
-Memory access: Address mm1 (2 word access)
Bus selection signal: S = [0, 1]
Register selection signal: CH = [0,0,0], CL = [0,0,1]
As a result, the data of the upper word (address mm1) selected by the bus selection signal S among the two word data stored at the addresses mm1 and mm + 1, that is, the x component Ax of the vector A is stored in the register selection signals CH and CL. Are stored in r1h and r2l selected by (register stored values: r1h ← Ax, r2l ← Ax).
[Step S2]
-Memory access: Address mm1 (2 word access)
Bus selection signal: S = [1, 0]
Register selection signal: CH = [0, 1, 0], CL = [0, 0, 0]
Thereby, the data of the lower word (address mm1 + 1) selected by the bus selection signal S among the two word data stored at the addresses mm1 and mm + 1, that is, the y component Ay of the vector A is converted into the register selection signals CH and CL. Are stored in r3h and r1l selected by (register stored values: r3h ← Ay, r1l ← Ay).
[Step S3]
Memory access: Address mm1 + 2 (2 word access)
Bus selection signal: S = [0, 1]
Register selection signal: CH = [0, 0, 1], CL = [0, 1, 0]
As a result, of the two word data stored at the addresses mm1 + 2 and mm1 + 3, the data of the upper word (address mm1 + 2) selected by the bus selection signal S, that is, the z component Az of the vector A becomes the register selection signals CH and CL. Are stored in r2h and r3l selected by (register stored values: r2h ← Az, r3l ← Az).
[Step S4]
Memory access: Address mm2 (2 word access)
Bus selection signal: S = [0, 1]
Register selection signal: CH = [1, 0, 0], CL = [0, 1, 1]
Thereby, the data of the upper word (address mm2) selected by the bus selection signal S among the two word data stored at the addresses mm2 and mm2 + 1, that is, the x component Bx of the vector B is converted into the register selection signals CH and CL. Are stored in r5h and r4l selected by (register stored values: r5h ← Bx, r4l ← Bx).
[Step S5]
Memory access: Address mm2 (2 word access)
Bus selection signal: S = [1, 0]
Register selection signal: CH = [0, 1, 1], CL = [1, 0, 1]
Thereby, the data of the lower word (address mm2 + 1) selected by the bus selection signal S among the two word data stored at the addresses mm2 and mm2 + 1, that is, the y component By of the vector B is converted into the register selection signals CH and CL. Are stored in r4h and r6l selected by (register stored values: r4h ← By, r6l ← By).
[Step S6]
-Memory access: Address mm2 + 2 (2 word access)
Bus selection signal: S = [0, 1]
Register selection signal: CH = [1, 0, 1], CL = [1, 0, 0]
Thereby, the data of the upper word (address mm2 + 2) selected by the bus selection signal S among the two word data stored at the addresses mm2 + 2 and mm2 + 3, that is, the z component Bz of the vector B is converted into the register selection signals CH and CL. Are stored in r6h and r5l selected by (register stored values: r6h ← Bz, r5l ← Bz).
[0046]
Through such steps S1 to S6, the SIMD operation is performed between the registers r1 and r4, the registers r2 and r5, and the registers r3 and r6 using the data stored in the registers r1 to r6 as the operation element, and the upper part By multiplying the values of the registers and the values of the lower partial registers, six multiplication values necessary for obtaining the vector outer product, Ax · By, Ay · Bx, Az · Bx, Ax · Bz, Ay · Bz , Az · By are obtained.
[0047]
As described above, in the arithmetic processing unit 2 of this embodiment, when data is read into the registers r1 to r8 constituting the register file 4 via the data bus D, the data arrangement order is set to one word (partial register / Any partial data bus DH, DL) can be arbitrarily rearranged. Therefore, even if the arrangement order of the data supplied from the memory via the data bus D and the arrangement order of the data to be stored in the registers r1 to r8 for actual processing are different, the registers r1 to r8 are stored. It is not necessary to separately perform processing for rearrangement after reading, and the processing time can be shortened.
[0048]
In addition, the rearrangement of the data arrangement order is realized with a simple configuration by the partial register selection unit made up of the selector and the bus selection unit made up of the multiplexer, so that the device configuration can be reduced in size.
In this embodiment, in steps S1 to S6 in which data for vector cross product operation is set in each register, either the upper partial data bus DH or the lower partial data bus DL is selected, and both partial registers in the same step. The

files

4a and 4b are controlled to be connected to the same partial data bus. However, as shown in steps S11 to S16 shown below (however, steps S13 and S16 are the same as steps S3 and S6, respectively) It may be controlled so that different partial data buses are connected to both

partial register files

4a and 4b in the same step.
[Step S11]
-Memory access: Address mm1 (2 word access)
-Bus selection signal: S = [0, 0]
Register selection signal: CH = [0,0,0], CL = [0,0,0]
(Register stored value: r1h ← Ax, r1l ← Ay)
[Step S12]
-Memory access: Address mm1 (2 word access)
-Bus selection signal: S = [1, 1]
Register selection signal: CH = [0, 1, 0], CL = [0, 0, 1]
(Register stored value: r3h ← Ay, r2l ← Ax)
[Step S13]
Memory access: Address mm1 + 2 (2 word access)
Bus selection signal: S = [0, 1]
Register selection signal: CH = [0, 0, 1], CL = [0, 1, 0]
(Register stored value: r2h ← Az, r3l ← Az)
[Step S14]
Memory access: Address mm2 (2 word access)
-Bus selection signal: S = [0, 0]
Register selection signal: CH = [1, 0, 0], CL = [1, 0, 1]
(Register stored value: r5h ← Bx, r6l ← By)
[Step S15]
Memory access: Address mm2 (2 word access)
-Bus selection signal: S = [1, 1]
Register selection signal: CH = [0, 1, 1], CL = [0, 1, 1]
(Register stored value: r4h ← By, r4l ← Bx)
[Step S16]
-Memory access: Address mm2 + 2 (2 word access)
Bus selection signal: S = [0, 1]
Register selection signal: CH = [1, 0, 1], CL = [1, 0, 0]
(Register stored value: r6h ← Bz, r5l ← Bz)
When the control is performed in accordance with steps S11 to S16, only the order in which data is stored in each partial register is different, and the calculation elements (vectors A, Each component of B) can be set.
[0049]
Further, in the above embodiment, by using a memory to which an address is assigned in units of one word, the minimum unit of memory access is set to one word. However, as shown in FIG. 4B, an address is given in units of two words. The minimum unit of memory access may be set to 2 words by using a memory provided with
[0050]
In the above embodiment, the case where m = 2 in the m × n-bit data bus and register file has been described. However, m is not limited to 2, and 2^k (K is a positive integer). By increasing k, it is possible to further improve effects such as speeding up the recombination time of the data arrangement order and downsizing of the arithmetic circuit.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration around a register in an arithmetic processing unit according to an embodiment.
FIG. 2 is a table for explaining operations of a selector configuring a bus selection unit and a multiplexer configuring a register selection unit;
FIG. 3 is an explanatory diagram showing a procedure for setting an operation element in a register as preparation for performing a vector outer product operation by SIMD operation;
FIG. 4 is an explanatory diagram showing a storage state of vector data in a memory.
FIG. 5 is a block diagram showing a configuration around a register in a conventional apparatus.
FIG. 6 is an explanatory diagram showing a procedure for setting an operation element in a register as preparation for performing a vector cross product operation by SIMD operation in a conventional apparatus;
[Explanation of symbols]
2 ... arithmetic processing unit, 4 ... register file, 4a ... upper part register file, 4b ... lower part register file, 6 ... bus selection unit, 8 ... register selection unit, 10 ... control signal generation unit, data bus ... D, IDL ... Internal lower bus, IDH ... Internal upper bus.

Claims

a register file composed of a plurality of registers in which data is input / output via a data bus of m × n (m and n are both positive integers) bit width;
Each register constituting the register file is divided into m partial registers each having an n-bit width, and the partial registers are divided into m partial register files each consisting of a set of partial registers of the same digit of each register. Partial register selection means for selecting one by one;
Partial data bus selection for dividing the data bus into m partial data buses having an n-bit width and selecting the partial data bus to be connected to the partial register selected by the partial register selection unit for each partial register file Means,
Comprising a sequence of data supplied via the data bus, the arithmetic processing unit that stores in the register reclassified by the partial data bus unit,
Each register constituting the register file is composed of two partial registers (m = 2),
The partial register selection means and the partial data bus selection means can perform a cross product operation of two vectors consisting of three components into a SIMD operation in which an operation on each partial register constituting the register is processed in parallel with a single operation instruction. The arithmetic processing apparatus is characterized in that each component of the two vectors is stored in each partial register of six registers so as to be executable.

The components of the two vectors stored in the partial registers of the six registers are the first component of the first vector, the first component of the second vector, the second component of the first vector, and the second component, respectively. The second component of the vector, the third component of the first vector, and the third component of the second vector each have the same attribute;
Each component of the first vector has a first component in the first upper part register and the second lower part register, a second component in the first lower part register and the third upper part register, Are stored in the second upper part register and the third lower part register,
Each component of the second vector has a first component in the fourth lower part register and the fifth upper part register, a second component in the fourth upper part register and the sixth lower part register, The arithmetic processing unit according to claim 1 , wherein the component is stored in the fifth lower part register and the sixth upper part register.

The component having the same attribute in the two vectors includes an x component, a y component, and a z component indicating coordinate values in the three-dimensional space, the first component is the x component, the second component is the y component, and the 3. The arithmetic processing according to claim 2, wherein the third component corresponds to the z component, or the first component corresponds to the z component, the second component corresponds to the y component, and the third component corresponds to the x component. apparatus.

4. The arithmetic processing apparatus according to claim 1, wherein image processing by shading is executed.