JP2017123034A

JP2017123034A - Image processing apparatus including cip and image processing method

Info

Publication number: JP2017123034A
Application number: JP2016001394A
Authority: JP
Inventors: 周作澤戸; Shusaku Sawato; 拓大澤; Hiroshi Osawa
Original assignee: Digital Media Professionals Inc
Current assignee: Digital Media Professionals Inc
Priority date: 2016-01-06
Filing date: 2016-01-06
Publication date: 2017-07-13

Abstract

PROBLEM TO BE SOLVED: To provide an image processing apparatus which draws an image in accordance with a shader program achieving acceleration and low power consumption of processing by efficiently using a fixed function pipeline.SOLUTION: An image processing apparatus includes: a main unit 10 which can program calculation processing to be executed; a sub-unit 20 which has a plurality of calculation units for performing predetermined calculation processing and in which the calculation units are connected in series so that the calculation result by the calculation unit on the former stage is input to the calculation unit on the latter stage; and a compilation processing unit 30 which extracts a code that can be executed by the sub-unit 20 from a shader program in accordance with a connection pattern of the calculation units constituting the sub-unit 20 and rewrites the extracted code into a code for the sub-unit that can be executed by the sub-unit 20. The main unit 10 causes the sub-unit 20 to execute the code for the sub-unit in accordance with the shader program rewritten by the compilation processing unit 30 and executes other codes by itself.SELECTED DRAWING: Figure 1

Description

本発明は，固定機能パイプラインとして機能するＣＩＰ（Combined Instruction Processor）を備えた画像処理装置，及び当該装置によって実行される画像処理方法に関する。具体的に説明すると，本発明は，プログラマブルパイプライン（メインユニット）で実行する演算処理の一部を，固定機能パイプライン（サブユニット：ＣＩＰ）によって実行させることで，コンピュータグラフィックを高速かつ低消費電力で描画する装置及び方法に関するものである。 The present invention relates to an image processing apparatus including a CIP (Combined Instruction Processor) functioning as a fixed function pipeline, and an image processing method executed by the apparatus. More specifically, the present invention allows computer graphics to be consumed at high speed and with low consumption by causing a part of arithmetic processing executed in the programmable pipeline (main unit) to be executed by the fixed function pipeline (sub unit: CIP). The present invention relates to an apparatus and a method for drawing with electric power.

従来から，コンピュータグラフィックの描画は，実装が容易で高速かつ小型な固定機能パイプラインによって行われていた。しかし，コンピュータの発達により，固定機能パイプラインによる画像処理は，より柔軟で様々な処理が可能なプログラマブルパイプラインによる画像処理に置き換わり，現在ではその殆どがプログラマブルパイプラインにより行われている（特許文献１，特許文献２）。プログラマブルパイプラインは，例えばプログラマが画像生成のアルゴリズムを定義するプログラマブルシェーダが組み込まれており，様々な用途に用いることができる。プログラマブルシェーダの例としては，頂点シェーダや，ジオメトリシェーダ，フラグメントシェーダ（ピクセルシェーダー）等をそれぞれ個別に構成する他，これらを統合して統合シェーダとして構成するといった構成が可能である。このプログラム可能なシェーダプロセッサのハードウェア上の実装方法としては，必要な処理能力を参酌して，複数のプログラマブルシェーダが実装されたプロセッサをチップ上又は基板上に複数セットするという態様を採ることが一般的である。このようなプログラマブルパイプラインでは自由なシェーダプログラムを使用できるため，柔軟な表現が可能になり，従来の固定機能パイプラインでは実現できなかった複雑な陰影も描画することができる。 Traditionally, computer graphic drawing has been performed with a fast, small, fixed-function pipeline that is easy to implement. However, with the development of computers, image processing by fixed function pipelines has been replaced by image processing by programmable pipelines that are more flexible and capable of various processing, and most of them are now performed by programmable pipelines (Patent Documents). 1, Patent Document 2). The programmable pipeline incorporates a programmable shader in which a programmer defines an image generation algorithm, for example, and can be used for various purposes. As an example of a programmable shader, a vertex shader, a geometry shader, a fragment shader (pixel shader), and the like can be individually configured, and these can be integrated to form an integrated shader. As a method of mounting the programmable shader processor on the hardware, it is possible to adopt a mode in which a plurality of processors on which a plurality of programmable shaders are mounted are set on a chip or a board in consideration of necessary processing capability. It is common. Since a free shader program can be used in such a programmable pipeline, flexible expression is possible, and it is possible to draw complicated shadows that could not be realized in the conventional fixed function pipeline.

ところが，プロセッサにプログラマブルパイプラインを搭載する場合，プログラマブルパイプラインが様々な処理を汎用的に演算可能なようになっていることから，そのプロセッサは固定機能パイプラインのみ場合の構成に比べて大型化する傾向にある。また，所定の演算能力を得ようとするとプログラマブルパイプラインの構成をさらに大型化させる必要が生じるが，これは搭載空間の限られた携帯ゲーム機等においては特に顕著な問題であった。そこで，本願出願人は，ＣＰＵによって処理された情報が入力される画像処理装置であって，予め決められた演算処理を行う固定機能パイプラインと，実行する演算処理をプログラム可能なプログラマブルパイプラインの両方を搭載したハイブリッド型の画像処理装置を開発した（特許文献３）。この画像処理装置において，ＣＰＵによって処理された情報は，固定機能パイプライン又はプログラマブルパイプラインのいずれかに入力されて選択的に演算処理されることとなる。これにより，電力消費とチップ搭載スペースを抑制しつつ，画像処理の高速化を実現した。 However, when a programmable pipeline is installed in a processor, the programmable pipeline can perform various operations on a general-purpose basis, so the processor is larger than the configuration with only a fixed-function pipeline. Tend to. In addition, it is necessary to further increase the size of the programmable pipeline when attempting to obtain a predetermined calculation capability, but this is a particularly significant problem in portable game machines with limited mounting space. Therefore, the applicant of the present invention is an image processing apparatus to which information processed by the CPU is input, and includes a fixed function pipeline that performs predetermined arithmetic processing and a programmable pipeline that can program the arithmetic processing to be executed. A hybrid type image processing apparatus equipped with both has been developed (Patent Document 3). In this image processing apparatus, information processed by the CPU is input to either a fixed function pipeline or a programmable pipeline and selectively processed. As a result, high-speed image processing was realized while reducing power consumption and chip mounting space.

特開２００５−３２２２２４号公報JP 2005-322224 A 特開２０１０−２５０６２５号公報JP 2010-250625 A 特開２０１２−１６４２３８号公報JP 2012-164238 A

しかしながら，特許文献３に開示された画像処理装置は，固定機能パイプラインとプログラマブルパイプラインの両方を実装するものの，画像処理を行う際にはどちらか一方のみを動作させるものであった。つまり，この画像処理装置は，ＣＰＵによって処理された情報が固定機能パイプラインとプログラマブルパイプラインのいずれかに選択的に入力されるものであり，例えば見る者の注目を集める部位や特殊な演算処理を要する部分のような複雑な処理はプログラマブルパイプラインで実行し，一方でそれ以外の単純な処理は固定機能パイプラインで実行される。このように，この画像処理装置は，入力されたシェーダプログラムを固定機能パイプラインとプログラマブルパイプラインのどちらか一方で処理するものであり，これらを同時に動作させる機能を備えるものではなかった。また，この画像処理装置は，プログラマブルパイプラインと固定機能パイプラインを選択的に動作させるものであるため，シェーダプログラムが固定機能パイプラインの処理能力を少しでも上回る場合には必ずプログラマブルパイプラインを動作させることとなるが，そのような処理が続くと固定機能パイプラインが一切機能しなくなり，固定機能パイプラインを利用することによる画像処理の高速化を実現できないばかりか，固定機能パイプラインを搭載したスペースが無駄になるおそれがあった。 However, the image processing apparatus disclosed in Patent Document 3 mounts both a fixed function pipeline and a programmable pipeline, but operates only one of them when performing image processing. That is, in this image processing apparatus, information processed by the CPU is selectively input to either the fixed function pipeline or the programmable pipeline. Complicated processes such as those requiring processing are executed in the programmable pipeline, while other simple processes are executed in the fixed function pipeline. As described above, this image processing apparatus processes an input shader program in either a fixed function pipeline or a programmable pipeline, and does not have a function of operating these simultaneously. In addition, since this image processor selectively operates the programmable pipeline and the fixed function pipeline, the programmable pipeline always operates when the shader program exceeds the processing capability of the fixed function pipeline. However, if such processing continues, the fixed function pipeline will not function at all, and not only can the image processing speed not be increased by using the fixed function pipeline, but also the fixed function pipeline is installed. There was a risk of wasting space.

そこで，本発明は，固定機能パイプラインとプログラマブルパイプラインの両方を備えた画像処理装置において，固定機能パイプラインを効率的に使用して，画像処理のさらなる高速化と低消費電力化を実現することを解決課題とする。 Therefore, the present invention achieves further speeding up of image processing and lower power consumption by efficiently using the fixed function pipeline in the image processing apparatus having both the fixed function pipeline and the programmable pipeline. This is the solution issue.

本発明の発明者らは，上記課題の解決手段について鋭意検討した結果，コンパイル処理により，オリジナルのシェーダプログラムから固定機能パイプライン（サブユニット）によって実行可能なコード（行）を抽出し，その抽出したコードを固定機能パイプラインで実行可能なコードに書き換えるとともに，そのコンパイル処理済みのシェーダプログラムをプログラマブルパイプライン（メインユニット）に入力して，プログラマブルパイプラインの制御の下で，書き換え済みのコード部分を固定機能パイプラインに実行させるという知見を得た。これにより，固定機能パイプラインの処理能力を超えたシェーダプログラムが入力された場合であっても，そのシェーダプログラムの一部を固定機能パイプラインに実行させることができ，またこれと同時にその他の部分をプログラマブルパイプラインにおいて実行することができるため，固定機能パイプラインを効率的に使用して，画像処理のさらなる高速化と低消費電力化を実現することができる。そして，本発明者らは，上記知見に基づけば従来技術の課題を解決できることに想到し，本発明を完成させた。具体的に説明すると，本発明は以下の構成・工程を有する。 The inventors of the present invention, as a result of diligent research on the means for solving the above problems, extract code (lines) that can be executed by the fixed function pipeline (subunit) from the original shader program by compile processing, and extract the code Code that can be executed in the fixed-function pipeline, and the compiled shader program is input to the programmable pipeline (main unit), and the rewritten code portion is controlled under the programmable pipeline. I got the knowledge that the fixed function pipeline is executed. As a result, even when a shader program that exceeds the processing capability of the fixed function pipeline is input, a part of the shader program can be executed by the fixed function pipeline, and at the same time, other parts can be executed. Can be executed in the programmable pipeline, so that the fixed-function pipeline can be used efficiently to achieve higher speed image processing and lower power consumption. Then, the present inventors have conceived that the problems of the prior art can be solved based on the above knowledge, and have completed the present invention. If it demonstrates concretely, this invention has the following structures and processes.

本発明の第１の側面は，シェーダプログラムに従って画像を描画する画像処理装置に関する。本発明に係る画像処理装置は，メインユニット１０と，サブユニット２０と，コンパイル処理部３０とを備える。
メインユニット１０は，プログラマブルパイプラインであって，実行する演算処理をプログラム可能である。
サブユニット２０は，固定機能パイプラインであって，予め決められた演算処理を行う複数の演算器を有しており，前段の演算器による演算結果が後段の演算器に入力されるように各演算器が直列に連結されている。サブユニット２０は，画像処理装置の中に一又は複数備わっている。なお，このように複数の演算器が直列に接続された構成を，本願明細書ではＣＩＰ（Combined Instruction Processor）と称している。
コンパイル処理部３０は，サブユニット２０を構成する演算器の連結パターンに応じて，シェーダプログラムからサブユニット２０で実行可能なコードを抽出し，抽出したコードをサブユニット２０で実行可能なサブユニット用コードに書き換えるコンパイル処理を行う。コンパイル処理部３０によってコンパイルされたシェーダプログラムは，メインユニット１０に入力される。
メインユニット１０は，コンパイル処理部３０による書き換え済みのシェーダプログラムに従い，サブユニット用コードをサブユニット２０に実行させるとともに，それ以外のコードをメインユニット自身で実行する。 A first aspect of the present invention relates to an image processing apparatus that draws an image according to a shader program. The image processing apparatus according to the present invention includes a main unit 10, a subunit 20, and a compile processing unit 30.
The main unit 10 is a programmable pipeline, and can program the arithmetic processing to be executed.
The subunit 20 is a fixed function pipeline, and has a plurality of arithmetic units that perform predetermined arithmetic processing, and each of the sub units 20 is input so that the arithmetic result of the preceding arithmetic unit is input to the subsequent arithmetic unit. Arithmetic units are connected in series. One or more subunits 20 are provided in the image processing apparatus. A configuration in which a plurality of arithmetic units are connected in series in this manner is referred to as a CIP (Combined Instruction Processor) in this specification.
The compile processing unit 30 extracts code that can be executed by the subunit 20 from the shader program in accordance with the connection pattern of the arithmetic units constituting the subunit 20, and uses the extracted code for the subunit that can be executed by the subunit 20. Compile the code to rewrite it. The shader program compiled by the compile processing unit 30 is input to the main unit 10.
The main unit 10 causes the subunit 20 to execute the subunit code in accordance with the shader program that has been rewritten by the compile processing unit 30, and executes the other codes by the main unit itself.

本発明において，シェーダプログラムは，命令に従った演算から得られる変数を規定した複数のコード（行）を含む。例えば図２（ａ）に示された例において，“Ａ＝ＡＤＤＢ，Ｃ”というコードでは，“Ａ”の部分が「変数」を意味し，“ＡＤＤＢ，Ｃ”の部分が「命令」を意味する。また，シェーダプログラムに含まれる複数のコードには，あるコードの変数が他のコードの命令の中に含まれるコード間の依存関係が存在している。この場合に，コンパイル処理部３０は，コンパイル処理において，コード間の依存関係がサブユニット２０を構成する演算器間の依存関係と一致するものをシェーダプログラムの中から抽出して，サブユニット２０で実行可能なサブユニット用コードに書き換える。 In the present invention, a shader program includes a plurality of codes (lines) that define variables obtained from operations according to instructions. For example, in the example shown in FIG. 2A, in the code “A = ADD B, C”, the part “A” means “variable” and the part “ADD B, C” means “instruction”. Means. A plurality of codes included in the shader program have a dependency relationship between codes in which a variable of a certain code is included in an instruction of another code. In this case, the compile processing unit 30 extracts from the shader program those in which the dependency between the codes coincides with the dependency between the arithmetic units constituting the subunit 20 in the compile processing. Rewrite to executable subunit code.

本発明において，コンパイル処理部３０がサブユニット用コードとして書き換えるコードを抽出する場合は，そのコードの変数を使用する命令の集合のすべてを，シェーダプログラムの中からサブユニット用コードに書き換えるものとして抽出できる場合に限られることが好ましい。 In the present invention, when the compile processing unit 30 extracts the code to be rewritten as the subunit code, all the sets of instructions that use the variable of the code are extracted from the shader program as the subunit code. It is preferable that it is limited to the case where it is possible.

本発明の画像処理装置は，サブユニット２０を複数備えることが好ましい。この場合に，各サブユニット２０を構成する演算器の連結パターンは，それぞれ異なることが好ましい。このように，サブユニット２０の設定を複数用意しておき，メインユニット１０からサブユニット２０に処理を依頼するときに，どの設定で動作するかを指定するように設計することで，シェーダプログラム中の複数個所にサブユニット２０の処理を割り当てることができ，サブユニット２０の使用頻度をより高めることができる。 The image processing apparatus of the present invention preferably includes a plurality of subunits 20. In this case, it is preferable that the connection patterns of the arithmetic units constituting each subunit 20 are different. In this way, a plurality of settings for the subunit 20 are prepared, and when the processing is requested from the main unit 10 to the subunit 20, it is designed to specify which setting is to be used in the shader program. The processing of the subunit 20 can be assigned to a plurality of locations, and the usage frequency of the subunit 20 can be further increased.

本発明において，メインユニット１０は，演算処理で使用する変数の値を格納した複数のコンテキストを有しており，使用するコンテキストを切り替えることで，演算処理の内容を切り替え可能であることが好ましい。このように，メインユニット１０がコンテキストを切り替えるスイッチ機能を有することで，サブユニット２０に特定の処理を実行させている間，メインユニット１０自身では別の演算処理を行うことが可能となる。これにより，サブユニット２０による演算結果の待ち時間を隠蔽することが可能となり，画像処理の更なる効率化を図ることができる。 In the present invention, it is preferable that the main unit 10 has a plurality of contexts that store values of variables used in the arithmetic processing, and the contents of the arithmetic processing can be switched by switching the context to be used. As described above, the main unit 10 has a switching function for switching contexts, so that the main unit 10 itself can perform another calculation process while the subunit 20 executes a specific process. Thereby, it is possible to hide the waiting time of the calculation result by the subunit 20, and it is possible to further improve the efficiency of the image processing.

本発明の画像処理装置において，サブユニット２０を構成する演算器の段数は４段であることが好ましい。画像処理装置を構成する回路の大きさと性能を考慮した場合，サブユニット２０が持つ演算器の段数は，統計上，４段とすることが好ましいといえる。 In the image processing apparatus of the present invention, it is preferable that the number of arithmetic units constituting the subunit 20 is four. In consideration of the size and performance of the circuits constituting the image processing apparatus, it can be said that the number of arithmetic units of the subunit 20 is preferably four in terms of statistics.

本発明の画像処理装置において，サブユニット２０を構成する演算器は加算器及び乗算器からなることが好ましい。コンピュータグラフィックスの画像処理においては，加算と乗算を行うことが多く，これをサブユニット２０に実行させることで，より効率的にサブユニット２０を活用することができる。 In the image processing apparatus of the present invention, the arithmetic unit constituting the subunit 20 is preferably composed of an adder and a multiplier. In image processing of computer graphics, addition and multiplication are often performed, and the subunit 20 can be used more efficiently by causing the subunit 20 to execute this.

本発明の画像処理装置において，メインユニット１０は，サブユニット２０と同時には演算処理を行わない処理部（例えばテクスチャ処理部）を含み，当該処理部を構成する演算器とサブユニット２０を構成する演算器は，少なくとも一部が共有されていることが好ましい。このように，メインユニット１０を構成する演算器の一部をサブユニット２０を構成する演算器と共有化することで，画像処理装置全体の回路規模をさらに小型化することができる。 In the image processing apparatus of the present invention, the main unit 10 includes a processing unit (for example, a texture processing unit) that does not perform arithmetic processing at the same time as the subunit 20, and constitutes a computing unit and the subunit 20 that constitute the processing unit. It is preferable that at least a part of the arithmetic units is shared. In this way, by sharing a part of the arithmetic unit constituting the main unit 10 with the arithmetic unit constituting the subunit 20, the circuit scale of the entire image processing apparatus can be further reduced.

本発明に係る画像処理装置において，上記したコンパイル処理部３０は，その画像処理装置自体には備わっていなくてもよい。すなわち，シェーダプログラムは，別のコンピュータ等によって，予め，サブユニット２０を構成する演算器の連結パターンに応じて，サブユニット２０で実行可能なコードが，サブユニット２０で実行可能なサブユニット用コードに書き換えられたものであってもよい。この場合，メインユニット１０は，予め書き換え済みのシェーダプログラムに従い，サブユニット用コードをサブユニット２０に実行させるとともに，それ以外のコードをメインユニットで実行する。 In the image processing apparatus according to the present invention, the above-described compile processing unit 30 may not be provided in the image processing apparatus itself. That is, the shader program is a code for subunits that can be executed by the subunit 20 in advance according to the connection pattern of the computing units constituting the subunit 20 by another computer or the like. It may have been rewritten. In this case, the main unit 10 causes the subunit 20 to execute the subunit code in accordance with the shader program that has been rewritten in advance, and executes the other codes in the main unit.

本発明の第２の側面は，シェーダプログラムに従って画像を描画する画像処理装置によって実行される画像処理方法に関する。画像処理装置は，上記した第１の側面と同様に，実行する演算処理をプログラム可能なメインユニット１０と，予め決められた演算処理を行う複数の演算器を有し前段の演算器による演算結果が後段の演算器に入力されるように各演算器が直列に連結された一又は複数のサブユニット２０と，を備える。
ここで，画像処理方法は，コンパイル処理工程と実行工程とを含む，
コンパイル処理工程では，サブユニット２０を構成する演算器の連結パターンに応じて，シェーダプログラムからサブユニット２０で実行可能なコードを抽出し，抽出したコードをサブユニット２０で実行可能なサブユニット用コードに書き換える。
実行工程では，メインユニット１０が，コンパイル処理工程において書き換え済みのシェーダプログラムに従い，サブユニット用コードをサブユニット２０に実行させるとともに，それ以外のコードをメインユニット自身で実行する。 The second aspect of the present invention relates to an image processing method executed by an image processing apparatus that draws an image according to a shader program. Similar to the first aspect described above, the image processing apparatus includes a main unit 10 that can program arithmetic processing to be executed, and a plurality of arithmetic units that perform predetermined arithmetic processing. Is provided with one or a plurality of subunits 20 in which the respective arithmetic units are connected in series so as to be input to the post-stage arithmetic unit.
Here, the image processing method includes a compile processing step and an execution step.
In the compile processing step, code that can be executed by the subunit 20 is extracted from the shader program in accordance with the connection pattern of the arithmetic units constituting the subunit 20, and the extracted code is executed by the subunit 20. Rewrite to
In the execution process, the main unit 10 causes the subunit code to be executed by the subunit 20 in accordance with the shader program that has been rewritten in the compile processing process, and other codes are executed by the main unit itself.

本発明によれば，固定機能パイプラインとプログラマブルパイプラインの両方を備えた画像処理装置において，固定機能パイプラインを効率的に使用して，画像処理のさらなる高速化と低消費電力化を実現することができる。 According to the present invention, in an image processing apparatus equipped with both a fixed function pipeline and a programmable pipeline, the fixed function pipeline is efficiently used to further increase the speed of image processing and reduce power consumption. be able to.

具体的に説明すると，本発明では，コンパイル処理において，オリジナルのシェーダプログラムからサブユニット２０（固定機能パイプライン）によって実行可能なコードを抽出し，その抽出したコードをサブユニット２０で実行可能なコードに書き換える。そして，このコンパイル処理済みのシェーダプログラムをメインユニット１０（プログラマブルパイプライン）に入力し，このメインユニット１０による制御の下で，書き換え済みのコード部分をサブユニット２０に実行させる。これにより，全体としてはサブユニット２０の処理能力を超えたシェーダプログラムが入力された場合であっても，そのシェーダプログラムの一部をサブユニット２０に実行させることができ，またこれと同時にその他の部分をメインユニット１０において実行することができる。このように，メインユニット１０とサブユニット２０とを同時に動作させて画像処理を実行することが可能となるため，回路規模の増大を抑えつつも演算能力を高めて画像処理の高速化を実現できる。また，固定機能パイプラインであるサブユニット２０は，メインユニット１０に比べて消費電力がはるかに小さいため，実行可能な部分をサブユニット２０に積極的に担当させることで，画像処理装置全体の消費電力を低く抑えることができる。 More specifically, in the present invention, code that can be executed by the subunit 20 (fixed function pipeline) is extracted from the original shader program in the compiling process, and the extracted code is executed by the subunit 20. Rewrite to Then, the compiled shader program is input to the main unit 10 (programmable pipeline), and the rewritten code portion is executed by the subunit 20 under the control of the main unit 10. As a result, even when a shader program exceeding the processing capability of the subunit 20 is input as a whole, a part of the shader program can be executed by the subunit 20, and at the same time, other shader programs can be executed. The part can be executed in the main unit 10. As described above, since it is possible to perform image processing by operating the main unit 10 and the subunit 20 at the same time, it is possible to realize high-speed image processing by increasing the calculation capability while suppressing an increase in circuit scale. . Further, since the subunit 20 that is a fixed function pipeline consumes much less power than the main unit 10, the subunit 20 is actively responsible for the executable portion, thereby consuming the entire image processing apparatus. Electric power can be kept low.

また，サブユニット２０の処理能力を超えたシェーダプログラムが入力された場合であっても，その一部の処理をサブユニット２０にも担当させることができるようになるため，サブユニット２０を積極的に活用できるようになり，このサブユニット２０の搭載スペースが無駄になることもない。 Even when a shader program exceeding the processing capability of the subunit 20 is input, a part of the processing can be assigned to the subunit 20, so that the subunit 20 is actively Thus, the mounting space for the subunit 20 is not wasted.

さらに，本発明では，コンパイル処理において，オリジナルのシェーダプログラムを，サブユニット２０を構成する演算器の連結パターンに応じて解析することで，そのサブユニット２０で実行可能なコードを抽出して，サブユニット２０用のコードに自動的に書き換えることができる。このため，プログラマがサブユニット２０における演算器の連結パターンを想定してシェーダプログラムを作成しなくても，シェーダプログラムがサブユニット２０で実行可能なものに自動的に書き換えられるため，プログラマの負担を軽減することができる。また，過去に作成された既存のシェーダプログラムをそのまま本発明の画像処理装置に入力した場合でも，そのシェーダプログラムが自動的に書き換えられるため，あらゆるシェーダプログラムを汎用的に利用することができる。 Furthermore, in the present invention, in the compile process, the original shader program is analyzed according to the connection pattern of the arithmetic units constituting the subunit 20, so that the code executable by the subunit 20 is extracted, The code for the unit 20 can be automatically rewritten. For this reason, even if the programmer does not create a shader program assuming the connection pattern of the arithmetic units in the subunit 20, the shader program is automatically rewritten to one that can be executed in the subunit 20, so that the burden on the programmer is reduced. Can be reduced. Even when an existing shader program created in the past is directly input to the image processing apparatus of the present invention, the shader program is automatically rewritten, so that any shader program can be used for general purposes.

また，複数の演算器が直列に連結されたサブユニット２０を利用することで，演算処理の中間の変数をコンテキストに読み書きする回数が減るため，コンテキストの大きさを小さく抑えることができる。また，コンテキストに対する変数の読み書きの回数が減少することで，メモリアクセスのバスの負担を軽減できる。さらに，バスの負担が軽くなると，テクスチャ等の画像へのアクセスの競合が減少するため，各ユニットがスムーズに動作するようになる。 In addition, by using the subunit 20 in which a plurality of arithmetic units are connected in series, the number of times of reading and writing the intermediate variables in the arithmetic processing is reduced, so that the size of the context can be kept small. In addition, the load on the memory access bus can be reduced by reducing the number of times the variable is read from and written to the context. Furthermore, if the bus load is reduced, the competition for accessing textures and other images decreases, so that each unit operates smoothly.

図１は，画像処理装置及び画像処理方法の基本概念を示したブロック図である。FIG. 1 is a block diagram showing the basic concept of an image processing apparatus and an image processing method. 図２は，画像処理装置の具体的構成を示したブロック図である。FIG. 2 is a block diagram showing a specific configuration of the image processing apparatus. 図３は，サブユニットの構成例を示したブロック図である。FIG. 3 is a block diagram illustrating a configuration example of the subunit. 図４は，コンパイル処理部による処理の基本概念を示しているFig. 4 shows the basic concept of processing by the compilation processing unit. 図５は，サブユニットの具体的構成を示したブロック図である。FIG. 5 is a block diagram showing a specific configuration of the subunit. 図６は，コンパイル処理部による処理の具体例を示している。FIG. 6 shows a specific example of processing by the compile processing unit. 図７は，コンパイル処理部による処理の条件に関する説明図である。FIG. 7 is an explanatory diagram regarding processing conditions by the compilation processing unit. 図８は，コンパイル処理部による最適化処理の一例を示している。FIG. 8 shows an example of optimization processing by the compile processing unit. 図９は，サブユニットのハードウェア構成の一例を示している。FIG. 9 shows an example of the hardware configuration of the subunit. 図１０は，画像処理装置の一実施形態を示したブロック図である。FIG. 10 is a block diagram showing an embodiment of the image processing apparatus. 図１１は，画像処理装置の他の実施系形態を示したブロック図である。FIG. 11 is a block diagram showing another embodiment of the image processing apparatus.

以下，図面を用いて本発明を実施するための形態について説明する。本発明は，以下に説明する形態に限定されるものではなく，以下の形態から当業者が自明な範囲で適宜変更したものも含む。 Hereinafter, embodiments for carrying out the present invention will be described with reference to the drawings. This invention is not limited to the form demonstrated below, The thing suitably changed in the range obvious to those skilled in the art from the following forms is also included.

図１は，本発明に係る画像処理装置及び画像処理方法の基本概念を示したブロック図である。図１に示されるように，複数のコード（行）から構成されたオリジナルのシェーダプログラムが，コンパイル処理部３０に入力される。コンパイル処理部３０は，プログラミング言語で書かれた形式のシェーダプログラム（ソースコード）を，コンピュータが直接実行可能な機械語形式のシェーダプログラム（バイナリコード）に書き換えるコンパイル処理を行う。このコンパイル処理において，コンパイル処理部３０は，オリジナルのシェーダプログラムの中から，演算機能が固定されたサブユニット２０（ＣＩＰ）で実行可能なコードを抽出し，抽出したコードをサブユニット２０で実行可能なサブユニット用コードに書き換える処理を行う。このコンパイル処理の詳細については後述する。また，コンパイル済みのシェーダプログラムは，ＧＰＵ１００（Graphics Processing Unit）に入力される。ＧＰＵ１００は，実行する演算処理を任意にプログラム可能なメインユニット１０と，演算機能が固定された一又は複数のサブユニット２０（ＣＩＰ）を含んで構成されている。メインユニット１０は，コンパイル済みのシェーダプログラムを受け取ると，このプログラムに従って，サブユニット用コードについてはサブユニット２０に実行させるとともに，それ以外のコードについては自分自身で実行する。そして，メインユニット１０は，自身による演算結果とサブユニット２０による演算結果をまとめて外部メモリ２００又はその他の外部回路へと出力する。 FIG. 1 is a block diagram showing the basic concept of an image processing apparatus and an image processing method according to the present invention. As shown in FIG. 1, an original shader program composed of a plurality of codes (lines) is input to the compile processing unit 30. The compile processing unit 30 performs compile processing for rewriting a shader program (source code) written in a programming language into a machine language shader program (binary code) that can be directly executed by a computer. In this compile processing, the compile processing unit 30 extracts a code that can be executed by the subunit 20 (CIP) having a fixed arithmetic function from the original shader program, and can execute the extracted code by the subunit 20. Rewrite the code to the correct subunit code. Details of the compiling process will be described later. The compiled shader program is input to the GPU 100 (Graphics Processing Unit). The GPU 100 includes a main unit 10 that can arbitrarily program the arithmetic processing to be executed, and one or a plurality of subunits 20 (CIP) having a fixed arithmetic function. When the main unit 10 receives the compiled shader program, the main unit 10 causes the subunit 20 to execute the subunit code and executes the other codes by itself according to this program. Then, the main unit 10 collectively outputs the calculation result by itself and the calculation result by the subunit 20 to the external memory 200 or other external circuit.

ここで，コンパイル処理部３０は，ＧＰＵ１００と同じ画像処理装置（コンピュータ）内に実装されていることが好ましい。ただし，コンパイル処理部３０は，ＧＰＵ１００が実装されたコンピュータとは別のコンピュータに実装することもできる。この場合，別のコンピュータが備えるコンパイル処理部３０によってコンパイル済みのシェーダプログラムが，記録媒体やインターネットなどの通信網を介して，ＧＰＵ１００と同じ画像処理装置に入力されることとなる。 Here, the compile processing unit 30 is preferably mounted in the same image processing apparatus (computer) as the GPU 100. However, the compile processing unit 30 can be mounted on a computer different from the computer on which the GPU 100 is mounted. In this case, the shader program compiled by the compile processing unit 30 provided in another computer is input to the same image processing apparatus as the GPU 100 via a communication network such as a recording medium or the Internet.

図２は，図１に示したＧＰＵ１００の構成をより詳しく示したブロック図である。ＧＰＵ１００は，メインユニット１０と一又は複数のサブユニット２０（ＣＩＰ）とを含んで構成されており，このサブユニット２０はメインユニット１０の制御に基づいて動作する。例えば，メインユニット１０は，ＯｐｅｎＧＬ２．ｘ系以降に対応したプログラマブルパイプラインとして機能するハードウェアを実装することができ，サブユニットは，旧来のＯｐｅｎＧＬ１．ｘ系に対応した固定機能パイプラインとして機能するハードウェアを実装することができる。 FIG. 2 is a block diagram showing in more detail the configuration of the GPU 100 shown in FIG. The GPU 100 includes a main unit 10 and one or a plurality of subunits 20 (CIP). The subunit 20 operates based on the control of the main unit 10. For example, the main unit 10 is OpenGL2. Hardware that functions as a programmable pipeline corresponding to the x system and later can be implemented, and the subunit is the conventional OpenGL1. It is possible to implement hardware that functions as a fixed function pipeline corresponding to the x system.

ＧＰＵ１００を構成するメインユニット１０は，典型的には，管理部１１と，頂点処理部１２と，ラスタライズ処理部１３と，フラグメント処理部１４と，テクスチャ処理部１５と，カラーアップデート処理部１６と，内部メモリ１７とを含む。また，メインユニット１０は，ＧＰＵ１００の外部に置かれた外部メモリ２００とも接続されている。 The main unit 10 constituting the GPU 100 typically includes a management unit 11, a vertex processing unit 12, a rasterization processing unit 13, a fragment processing unit 14, a texture processing unit 15, a color update processing unit 16, And an internal memory 17. The main unit 10 is also connected to an external memory 200 placed outside the GPU 100.

管理部１１は，コンパイル処理部３０からコンパイル済みのシェーダプログラムを受け取り，このシェーダプログラムに従って各処理部１２〜１６を制御する。管理部１１は，外部メモリ２００からポリゴンデータ（３次元モデル）の頂点情報を読み込んで頂点処理部１２に受け渡す。 The management unit 11 receives a compiled shader program from the compilation processing unit 30 and controls the processing units 12 to 16 according to the shader program. The management unit 11 reads vertex information of polygon data (three-dimensional model) from the external memory 200 and transfers it to the vertex processing unit 12.

頂点処理部１２は，ポリゴンデータを構成する各頂点に対して，座標変換や照明計算を含む頂点処理を行う。頂点処理では，例えば，モデリング座標系で表現された頂点の座標値をワールド座標系，カメラ座標系，又は投影座標系に変換する座標変換処理や，各頂点と光源の距離や各頂点と光源の角度に基づいてその頂点の輝度を計算する照明計算処理，あるいはテクスチャマッピングのためのテクスチャ座標値を計算するテクスチャ処理などが行われる。頂点処理部１２は，頂点処理の結果として得られた頂点情報をラスタライズ処理部１３に受け渡す。 The vertex processing unit 12 performs vertex processing including coordinate conversion and illumination calculation on each vertex constituting the polygon data. In vertex processing, for example, coordinate transformation processing that converts the coordinate values of vertices expressed in the modeling coordinate system into the world coordinate system, camera coordinate system, or projected coordinate system, the distance between each vertex and the light source, and the distance between each vertex and the light source Illumination calculation processing for calculating the luminance of the vertex based on the angle or texture processing for calculating texture coordinate values for texture mapping is performed. The vertex processing unit 12 passes the vertex information obtained as a result of the vertex processing to the rasterization processing unit 13.

ラスタライズ処理部１３は，頂点処理で得られた頂点情報に基づいて，ポリゴンデータのピクセル毎のデータ（フラグメント）を生成するラスタライズ処理を行う。頂点処理によって計算された結果は，その頂点のためだけの計算結果であり，実際にコンピュータグラフィックスの描画を行うには，複数の頂点で構成されるポリゴンデータの内部全てのピクセルに対して描画色を計算する必要がある。このため，ラスタライズ処理部１３は，頂点と線分（ベクタ）の組み合わせで表わされた図形を，ピクセルの集合に置き換えて表すラスタライズ処理を行う。また，グローシェーディングにおける頂点の輝度値からのポリゴン内部の輝度値の補間なども，ラスタライズ処理において行うこととしてもよい。ラスタライズ処理部１３は，ラスタライズ処理によって生成したフラグメントをフラグメント処理部１４に受け渡す。 The rasterization processing unit 13 performs rasterization processing for generating data (fragment) for each pixel of polygon data based on the vertex information obtained by the vertex processing. The result calculated by the vertex processing is only the result for that vertex. To actually draw computer graphics, draw for all the pixels in the polygon data consisting of multiple vertices. The color needs to be calculated. For this reason, the rasterization processing unit 13 performs a rasterization process in which a figure represented by a combination of a vertex and a line segment (vector) is replaced with a set of pixels. In addition, interpolation of the luminance value inside the polygon from the luminance value of the vertex in glow shading may be performed in the rasterizing process. The rasterization processing unit 13 delivers the fragment generated by the rasterization processing to the fragment processing unit 14.

フラグメント処理部１４は，ラスタライズ処理で生成されたフラグメントに対する演算処理（フラグメント処理）を行う。フラグメント処理部１４は，例えばフラグメント毎に，色値や輝度，透過度などの色に関する各種計算を行う。また，フラグメント処理部１４は，計算各フラグメントの持つ輝度値と参照可能なテクスチャの値に基づく演算を行うことが可能であり，その場合にはフラグメントの輝度値に対するテクスチャ値の混合値（加算値や乗算値）の演算を行う。具体的に，フラグメント処理部１４は，三次元オブジェクトの表面に質感を与えるためのテクスチャ画像を貼り付けるテクスチャ処理を行うことができる。テクスチャ処理により，テクスチャ画像に対応したテクスチャ座標（ｕ，ｖ）がピクセルデータに付与される。また，フラグメント処理部１４は，光の角度と光源からの距離を考慮して，オブジェクトを構成する各ピクセルのカラー値（色調や階調）を変化させるシェーディング処理を行うことができる。例えば，頂点処理部１２は，頂点処理において光源計算を行い，シェーディング処理用の光源の情報や，照明モデルや，オブジェクトの各頂点の法線ベクトルなどを求め，これらの頂点情報に基づいて，ポリゴンの各頂点のカラー値（ＲＧＢ）を求める。そして，フラグメント処理部１４は，頂点処理で求められた各頂点のカラー値に基づいて，ポリゴンに対応する各ピクセルのカラー値を，例えば，ホンシェーディングや，グローシェーディングなどにより求めることができる。その後，フラグメント処理部１４は，フラグメントの色に関する演算の結果などを表示画面用の外部メモリ２００に書き込む。 The fragment processing unit 14 performs arithmetic processing (fragment processing) on the fragment generated by the rasterization processing. The fragment processing unit 14 performs various calculations related to colors such as color value, luminance, and transparency for each fragment. In addition, the fragment processing unit 14 can perform an operation based on the luminance value of each calculated fragment and the texture value that can be referred to. In this case, a mixed value (addition value) of the texture value with respect to the luminance value of the fragment. Or multiplication value). Specifically, the fragment processing unit 14 can perform texture processing for pasting a texture image for giving a texture to the surface of a three-dimensional object. Through texture processing, texture coordinates (u, v) corresponding to the texture image are added to the pixel data. In addition, the fragment processing unit 14 can perform a shading process that changes the color value (tone or gradation) of each pixel constituting the object in consideration of the angle of light and the distance from the light source. For example, the vertex processing unit 12 performs light source calculation in the vertex processing, obtains information on a light source for shading processing, a lighting model, a normal vector of each vertex of the object, and the like based on the vertex information. The color value (RGB) of each vertex is obtained. Then, the fragment processing unit 14 can obtain the color value of each pixel corresponding to the polygon by, for example, phone shading or glow shading based on the color value of each vertex obtained by the vertex processing. Thereafter, the fragment processing unit 14 writes the result of the calculation related to the color of the fragment in the external memory 200 for the display screen.

テクスチャ処理部１５は，３次元モデルの画像処理を行う上で外部メモリ２００に格納されたテクスチャデータを必要とする場合に，この外部メモリ２００の中から必要なテクスチャデータを読み込んで，頂点処理部１２や，ラスタライズ処理部１３，あるいはフラグメント処理部１４に受け渡す。また，テクスチャ処理部１５が外部メモリ２００から読み出したテクスチャデータを，内部メモリ１７に一時的に格納することもでき，その場合に頂点処理部１２などは，内部メモリ１７から必要なテクスチャデータを読み出すことができる。 When the texture processing unit 15 needs the texture data stored in the external memory 200 to perform the image processing of the three-dimensional model, the texture processing unit 15 reads the necessary texture data from the external memory 200, and the vertex processing unit 12, the rasterization processing unit 13, or the fragment processing unit 14. Further, the texture data read from the external memory 200 by the texture processing unit 15 can be temporarily stored in the internal memory 17. In this case, the vertex processing unit 12 and the like read necessary texture data from the internal memory 17. be able to.

カラーアップデート処理部１６は，任意の要素である。カラーアップデート処理部１６は，フラグメント処理の結果を外部メモリ２００（又は内部メモリ１７）上のフレームバッファに記憶されたコンテンツにマージしたり，あるいはフラグメントごとのコンテンツに関する情報を更新する。 The color update processing unit 16 is an arbitrary element. The color update processing unit 16 merges the result of the fragment processing into the content stored in the frame buffer on the external memory 200 (or the internal memory 17), or updates the information regarding the content for each fragment.

内部メモリ１７は，基本的に，各処理部１２〜１６と接続されており，各処理部１２〜１６が演算処理を行うにあたり必要な情報の書き出しや読み出しが行われる。内部メモリ１７には，ＧＰＵ１００が実行するシェーダプログラムの一部や全部を一時的に格納したり，計算途中に現れる変数の値が格納したりすることが可能であり，内部メモリ１７はキャッシュとしての役割も持つ。なお，最終的には，外部メモリ２００に，シェーダプログラム，最終的な結果画像，参照されるテクスチャ画像，頂点情報など全ての情報が格納される。 The internal memory 17 is basically connected to the processing units 12 to 16 and writes and reads information necessary for the processing units 12 to 16 to perform arithmetic processing. The internal memory 17 can temporarily store all or part of the shader program executed by the GPU 100, and can store the values of variables that appear during the calculation. Also has a role. Finally, all information such as a shader program, a final result image, a referenced texture image, and vertex information is stored in the external memory 200.

なお，ここで説明したメインユニット１０の構成は一例に過ぎず，適宜公知の構成を採用することができる。例えば，頂点処理部１２とフラグメント処理部１４は同一の装置として実装されており，頂点処理とフラグメント処理を同一の装置とし，時分割で実行することもできる。また，フラグメント処理部１４とテクスチャ処理部１５とを直接結線するか，あるいは内部メモリ１７を介してデータを送受信可能にするかなど様々な構成が考えられる。これらの構成の詳細は，本発明の仕様条件を制限するものではない。 In addition, the structure of the main unit 10 demonstrated here is only an example, and can employ | adopt a well-known structure suitably. For example, the vertex processing unit 12 and the fragment processing unit 14 are implemented as the same device, and the vertex processing and the fragment processing can be performed as the same device and can be executed in a time-sharing manner. Various configurations are conceivable, such as whether the fragment processing unit 14 and the texture processing unit 15 are directly connected, or whether data can be transmitted / received via the internal memory 17. The details of these configurations do not limit the specification conditions of the present invention.

ＧＰＵ１００を構成するサブユニット２０（ＣＩＰ）は，本来であればメインユニット１０で実行可能な処理の一部を，処理の高速化及び低消費電力化のために分担して実行するものである。つまり，サブユニット２０が処理を実行している間，これと同時にメインユニット１０も他の処理を実行することが可能である。サブユニット２０は，メインユニット１０を構成する各処理部（頂点処理部１２，ラスタライズ処理部１３，フラグメント処理部１４，テクスチャ処理部１５，及びカラーアップデート処理部１６）の１つ又は複数に接続することができる。図２に示した実施形態においては，サブユニット２０は，頂点処理部１２とフラグメント処理部１４に接続されている。このように，サブユニット２０は，頂点処理部１２とフラグメント処理部１４とが実行する頂点処理及び／又はフラグメント処理の一部を分担して実行するものであることが好ましい。ただし，サブユニット２０が接続される処理部は，これらのものに限定されない。また，図２に示された例において，サブユニット２０は画像処理装置内に１つ実装されているが，サブユニット２０を２つ以上実装することも可能である。 The subunit 20 (CIP) that constitutes the GPU 100 executes a part of the processing that can be executed by the main unit 10 in order to increase the processing speed and reduce the power consumption. That is, while the sub unit 20 is executing a process, the main unit 10 can also execute another process at the same time. The subunit 20 is connected to one or a plurality of processing units (vertex processing unit 12, rasterization processing unit 13, fragment processing unit 14, texture processing unit 15, and color update processing unit 16) constituting the main unit 10. be able to. In the embodiment shown in FIG. 2, the subunit 20 is connected to the vertex processing unit 12 and the fragment processing unit 14. As described above, the subunit 20 is preferably one that shares and executes part of the vertex processing and / or fragment processing executed by the vertex processing unit 12 and the fragment processing unit 14. However, the processing unit to which the subunit 20 is connected is not limited to these. In the example shown in FIG. 2, one subunit 20 is mounted in the image processing apparatus, but two or more subunits 20 can be mounted.

図３は，サブユニット２０の構成の一例を示している。図３に示されるように，サブユニット２０は，予め決められた演算処理を行う複数の演算器２１，２２，２３を有している。これらの複数の演算器２１〜２３は，前段の演算器による演算結果が後段の演算器に入力されるように，各演算器が直列に連結されている。すなわち，図３に示した例においては，第１の演算器２１の演算結果が第２の演算器２２に入力され，第２の演算器２２は第２の演算器２２の演算結果に基づいた演算を行う。さらに，第２の演算器２２の演算結果が第３の演算器２３に入力され，第３の演算器２３は第２の演算器２２の演算結果に基づいた演算を行う。そして，第３の演算器２３による演算結果は，メインユニット１０へと戻される。このように直列に接続された複数の演算器２１〜２３を持つサブユニット２０（固定機能パイプライン）の構成を，本願明細書ではＣＩＰ（Combined Instruction Processor）と称している。また，サブユニット２０は，各演算器による演算結果を書き出したり，あるいは各演算器で演算を行う際にデータを読み出すための内部メモリを持たない。サブユニット２０は，各演算器が直列に接続されているため，内部メモリを利用することなく演算を行うことができる。内部メモリに対する書き込みや読み出しを行わないため，サブユニット２０は高速かつ低消費電力で演算処理を完了させることができる。なお，サブユニット２０を構成する演算器の段数は，２段であってもよいし，３段以上であってもよい。特に演算器の段数は，統計上，４段であることが好ましい。 FIG. 3 shows an example of the configuration of the subunit 20. As shown in FIG. 3, the subunit 20 includes a plurality of arithmetic units 21, 22, and 23 that perform predetermined arithmetic processing. The plurality of calculators 21 to 23 are connected in series so that the calculation result of the preceding calculator is input to the succeeding calculator. That is, in the example shown in FIG. 3, the calculation result of the first calculator 21 is input to the second calculator 22, and the second calculator 22 is based on the calculation result of the second calculator 22. Perform the operation. Further, the calculation result of the second calculator 22 is input to the third calculator 23, and the third calculator 23 performs a calculation based on the calculation result of the second calculator 22. Then, the calculation result by the third calculator 23 is returned to the main unit 10. The configuration of the subunit 20 (fixed function pipeline) having the plurality of arithmetic units 21 to 23 connected in series in this way is referred to as a CIP (Combined Instruction Processor) in the present specification. Further, the subunit 20 does not have an internal memory for writing out the calculation results by the respective arithmetic units or reading out data when performing the arithmetic operations by the individual arithmetic units. Since each arithmetic unit is connected in series, the subunit 20 can perform computation without using an internal memory. Since writing to and reading from the internal memory are not performed, the subunit 20 can complete the arithmetic processing at high speed and with low power consumption. In addition, the number of stages of the arithmetic units constituting the subunit 20 may be two, or may be three or more. In particular, the number of stages of the arithmetic unit is preferably 4 in terms of statistics.

サブユニット２０を構成する演算器としては公知のものを採用できる。例えば，演算器としては，少なくとも加算器（ＡＤＤ）と乗算器（ＭＵＬ）を含むことが好ましい。また，その他に，ＭＡＤＤ，ＥＱＵＡＬ，ＮＥＱＵＡＬ，ＬＥＳＳ，ＬＥＱＵＡＬ，ＳＵＭ，ＲＳＱ，ＳＥＬ，ＥＸＰ，及びＬＯＧの中から１種又は２種以上の演算器を採用することもできる。また，図３に示されるように，各演算器は，２入力であってもよいし，１入力又は３入力以上であってもよい。例えば，２入力の演算器は，加算又は乗算の演算を行うものであることが好ましく，３入力の演算器は，乗算可算命令を行うものであることが好ましい。また，１入力の演算器は，“ＲＳＱ（ｘ）＝１／ｓｑｒｔ（ｘ）”の演算を行うものであることが好ましい。それぞれの演算器がどの演算処理を実行するのかは，描画処理が開始される前にドライバプログラムに従って事前に設定される。このような演算器の設定は，コンパイル処理部３０がオリジナルのシェーダプログラムを変換するときに作成するようにすればよい。また，画像処理装置内にサブユニット２０が複数設けられている場合，各サブユニット２０を構成する演算器のパターンは異なることが好ましい。異なる設定のサブユニット２０を複数設けることで，シェーダプログラム中の複数個所にサブユニット２０の処理を割り当てることができ，サブユニット２０の使用頻度をより高めることができる。 A well-known thing can be employ | adopted as a calculator which comprises the subunit 20. FIG. For example, the arithmetic unit preferably includes at least an adder (ADD) and a multiplier (MUL). In addition, one type or two or more types of arithmetic units may be employed from MADD, EQUAL, NEQUAL, LESS, LEQUAL, SUM, RSQ, SEL, EXP, and LOG. Further, as shown in FIG. 3, each computing unit may have two inputs, or may have one input or three or more inputs. For example, a 2-input computing unit is preferably one that performs an addition or multiplication operation, and a 3-input computing unit is preferably one that performs a multiplication count instruction. Further, it is preferable that the one-input computing unit performs a computation of “RSQ (x) = 1 / sqrt (x)”. Which calculation process each calculator performs is set in advance according to the driver program before the drawing process is started. Such a calculator setting may be created when the compile processing unit 30 converts the original shader program. Further, when a plurality of subunits 20 are provided in the image processing apparatus, it is preferable that the patterns of the arithmetic units constituting each subunit 20 are different. By providing a plurality of subunits 20 with different settings, the processing of the subunits 20 can be assigned to a plurality of locations in the shader program, and the usage frequency of the subunits 20 can be further increased.

また，図２に示されるように，サブユニット２０は，メインユニット１０を構成する処理部と演算器の一部又は全部を共有することとしてもよい。すなわち，メインユニット１０は，サブユニット２０と同時には演算処理を行わない処理部を含んでいる。この場合に，メインユニット１０の処理部を構成する演算器とサブユニット２０を構成する演算器の一部又は全部を共有化することにより，画像処理装置に搭載する演算器の数を減少させることができるため，画像処理装置をさらに小型化することが可能である。図２に示された例において，メインユニット１０は，テクスチャ処理部１５を備えているが，このテクスチャ処理部１５は，基本的にはサブユニット２０と同時には動作しない。このため，テクスチャ処理部１５を構成する演算器とサブユニット２０を構成する演算器を共有化しても，画像処理に影響を及ぼすことはない。このため，図２に示された例では，テクスチャ処理部１５とサブユニット２０とで演算器の共有化が可能である。 Further, as shown in FIG. 2, the subunit 20 may share a part or all of the computing units and the processing units constituting the main unit 10. That is, the main unit 10 includes a processing unit that does not perform arithmetic processing simultaneously with the subunit 20. In this case, by sharing a part or all of the arithmetic units constituting the processing unit of the main unit 10 and the sub unit 20, the number of arithmetic units mounted in the image processing apparatus can be reduced. Therefore, it is possible to further reduce the size of the image processing apparatus. In the example shown in FIG. 2, the main unit 10 includes the texture processing unit 15, but the texture processing unit 15 basically does not operate simultaneously with the subunit 20. For this reason, even if the arithmetic unit constituting the texture processing unit 15 and the arithmetic unit constituting the subunit 20 are shared, the image processing is not affected. Therefore, in the example shown in FIG. 2, the computing unit can be shared between the texture processing unit 15 and the subunit 20.

ここで，コンパイル処理部３０は，上記のようなサブユニット２０を構成する演算器の連結パターンを把握しており，この演算器の連結パターンに応じて，オリジナルのシェーダプログラムを，サブユニット２０によって実行可能なコードを含むものに書き換える処理を含む，コンパイル処理を行う。ここにいう「演算器の連結パターン」とは，サブユニット２０を構成する演算器の段数と，各段を構成する演算器の種類に関する情報を含む情報である。例えば，サブユニット２０を構成する演算器の数が３段であり，各演算器が“ＡＤＤ→ＡＤＤ→ＡＤＤ”のように連結されている場合に，コンパイル処理部３０は，「演算器の連結パターン」として，演算器の段数が３段であることと，各演算器の組み合わせ（依存関係）が“ＡＤＤ→ＡＤＤ→ＡＤＤ”であることを把握している。 Here, the compile processing unit 30 grasps the connection pattern of the arithmetic units constituting the subunit 20 as described above, and the original shader program is generated by the subunit 20 according to the connection pattern of the arithmetic units. Compile processing, including processing to rewrite it to include executable code. Here, the “computation pattern of computing units” is information including information on the number of computing units constituting the subunit 20 and the types of computing units constituting each stage. For example, when the number of arithmetic units constituting the subunit 20 is three and each arithmetic unit is connected as “ADD → ADD → ADD”, the compile processing unit 30 determines that “the connection of the arithmetic units”. As the “pattern”, it is understood that the number of stages of the arithmetic units is three and that the combination (dependency) of the arithmetic units is “ADD → ADD → ADD”.

図４は，コンパイル処理部３０が行うコードの抽出・書き換え処理の概念を示した説明図である。図４に示されるように，シェーダプログラムは，「命令」に従った演算から得られる「変数」を規定した複数のコードを含む。すなわち，図４（ａ）に示した例において，プログラム１（シェーダプログラム）は，“Ａ＝ＡＤＤＢ，Ｃ”という１行目のコードと，その他２行のコードを含む合計３行のコードからなるものであり，１行目のコードは，“Ａ”の部分が「変数」を意味し，“ＡＤＤＢ，Ｃ”の部分が「命令」を意味する。つまり，この１行目のコードは，定数Ｂと定数Ｃを加算して変数Ａを得る演算を行うことを意味している。ここで，図４（ａ）に示されたプログラム１は，各コードに規定された変数が，他のコードの命令に組み込まれていない。このため，プログラム１では，各コードの間に依存関係は存在していない。このプログラム１に従って演算を行うためには，ＧＰＵ１００は，コンテキスト１のように，定数の値が既に格納され，変数の値を格納するための空間が設けられたコンテキストを参照する必要がある。プログラム１は，各コードが依存関係を有していないため，図３に示したようなＣＩＰの構成を持つサブユニット２０に処理を担当させることはできない。このため，コンパイル処理部３０は，プログラム１をサブユニット２０用に書き換えることはできない。なお，ここでは，変数の値を格納するテーブルをコンテキストと呼ぶ。 FIG. 4 is an explanatory diagram showing the concept of code extraction / rewriting processing performed by the compile processing unit 30. As shown in FIG. 4, the shader program includes a plurality of codes that define “variables” obtained from operations according to “instructions”. That is, in the example shown in FIG. 4A, the program 1 (shader program) includes a total of three lines of code including the first line of code “A = ADD B, C” and the other two lines of code. In the code on the first line, the part “A” means “variable” and the part “ADD B, C” means “instruction”. That is, the code in the first line means that an operation for obtaining the variable A by adding the constant B and the constant C is performed. Here, in the program 1 shown in FIG. 4A, the variables defined in each code are not incorporated in the instructions of other codes. For this reason, in the program 1, there is no dependency relationship between the codes. In order to perform an operation according to the program 1, the GPU 100 needs to refer to a context in which a constant value is already stored and a space for storing a variable value is provided, as in the context 1. In the program 1, since each code has no dependency relationship, the subunit 20 having the CIP configuration as shown in FIG. 3 cannot be in charge of processing. For this reason, the compile processing unit 30 cannot rewrite the program 1 for the subunit 20. Here, a table storing variable values is called a context.

これに対して，図４（ｂ）に示されるように，プログラム２（シェーダプログラム）は，１行目のコードの変数“Ａ”が２行目のコードの命令に組み込まれ，また２行目のコードの変数“Ｅ”が３行目のコードの命令に組み込まれている。このため，プログラム２を構成する３行のコードは，１行目→２行目→３行目といった依存関係を有しているといえる。ただし，プログラム２のような命令文の記述方法では，まず１行目の演算を行って変数“Ａ”を求めて，その値をコンテキスト２に格納し，２行目の演算を行う際には，コンテキスト２を参照して変数“Ａ”の値を読み出して変数“Ｅ”を求め，その値をコンテキスト２に格納し，さらに３行目の演算を行う際には，コンテキスト２を参照して変数“Ｅ”の値を読み出して変数“Ｇ”を求めて，その値をコンテキスト３に格納するといった処理が行われる。このように，コンテキストに対するデータの書き出しや読み出しが必要な場合には，内部メモリ１７を有するメインユニット１０によって演算処理を行う必要があり，内部メモリを持たないサブユニット２０によっては演算処理を行うことができない。 On the other hand, as shown in FIG. 4B, in the program 2 (shader program), the variable “A” of the code on the first line is incorporated in the instruction of the code on the second line, and the second line The code variable “E” is incorporated in the code instruction on the third line. Therefore, it can be said that the three lines of code constituting the program 2 have a dependency relationship such as the first line → the second line → the third line. However, in the statement description method such as program 2, firstly, the first line operation is performed to obtain the variable “A”, the value is stored in the context 2, and the second line operation is performed. , Read the value of the variable “A” by referring to the context 2 to obtain the variable “E”, store the value in the context 2, and refer to the context 2 when performing the operation on the third line. A process of reading the value of the variable “E”, obtaining the variable “G”, and storing the value in the context 3 is performed. As described above, when writing or reading of data with respect to the context is necessary, it is necessary to perform arithmetic processing by the main unit 10 having the internal memory 17 and to perform arithmetic processing by the subunit 20 having no internal memory. I can't.

そこで，コンパイル処理部３０は，内部メモリを持たないサブユニット２０であっても実行可能なように，プログラム２を書き換える処理を行う。例えば，図３に示されるように，サブユニット２０が３段の演算器からなり，各演算器が全て乗算器（ＡＤＤ）であるような場合には，サブユニット２０を構成する演算器間の依存関係と，プログラム２を構成するコード間の依存関係が完全に一致することとなる。この場合，コンパイル処理部３０は，プログラム２の全てのコードを抽出し，抽出したコードをサブユニット２０で実行可能なコードに書き換えて，コンパイル済みのプログラム３（シェーダプログラム）を出力する。図４（ｂ）に示された例において，プログラム３は，１行目と２行目の変数部分が削除され，２行目と３行目の命令部分に［ＰＲＥＶ］が記述されている。この［ＰＲＥＶ］は，前段のコード（演算式）での演算結果を直接代入するという意味である。このため，プログラム３では，１行目の演算結果は２行目の命令部分にそのまま代入され，２行目の演算結果は３行目の命令部分にそのまま代入される。このように，各コード間の演算結果を他のコードに直接代入することができるため，プログラム３では，演算結果を読み書きするためのコンテキストが不要となる。そして，プログラム３におけるコードの依存関係は，サブユニット２０における演算器の依存関係と一致するため，プログラム３は，サブユニット２０において実行することが可能となる。コンパイル処理部３０は，このように，オリジナルのシェーダプログラムの一部又は全部を，サブユニット２０を構成する演算器の結合パターンに一致するように書き換える機能を持つ。 Therefore, the compile processing unit 30 performs a process of rewriting the program 2 so that even the subunit 20 having no internal memory can be executed. For example, as shown in FIG. 3, in the case where the subunit 20 is composed of three stages of arithmetic units, and all the arithmetic units are multipliers (ADD), the arithmetic units constituting the subunit 20 are connected to each other. The dependency relationship and the dependency relationship between the codes constituting the program 2 are completely matched. In this case, the compile processing unit 30 extracts all the codes of the program 2, rewrites the extracted codes into codes that can be executed by the subunit 20, and outputs a compiled program 3 (shader program). In the example shown in FIG. 4B, in the program 3, the variable parts in the first and second lines are deleted, and [PREV] is described in the instruction parts in the second and third lines. This [PREV] means to directly substitute the calculation result of the preceding code (calculation expression). For this reason, in the program 3, the operation result of the first line is assigned as it is to the instruction part of the second line, and the operation result of the second line is assigned as it is to the instruction part of the third line. As described above, since the calculation result between the codes can be directly assigned to another code, the program 3 does not require a context for reading and writing the calculation result. Since the code dependency in the program 3 matches the dependency relationship of the arithmetic units in the subunit 20, the program 3 can be executed in the subunit 20. In this way, the compile processing unit 30 has a function of rewriting part or all of the original shader program so as to match the coupling pattern of the arithmetic units constituting the subunit 20.

また，図５と図６を参照して，コンパイル処理部３０によるコードの抽出・書き換え処理についてさらに詳しく説明する。図５は，サブユニット２０を構成する演算器の結合パターンを示している。図５に示した例において，サブユニット２０は，３段の演算器２１〜２３が直列に接続された構成であり，第１の演算器２１は加算器（ＡＤＤ），第２の演算器２２は乗算器（ＭＵＬ），第３の演算器２３は加算器（ＡＤＤ）となっている。コンパイル処理部３０は，このようなサブユニット２０を構成する演算器の結合パターンを把握している。 Further, with reference to FIGS. 5 and 6, the code extraction / rewriting process by the compile processing unit 30 will be described in more detail. FIG. 5 shows a coupling pattern of arithmetic units constituting the subunit 20. In the example shown in FIG. 5, the subunit 20 has a configuration in which three stages of computing units 21 to 23 are connected in series, the first computing unit 21 is an adder (ADD), and the second computing unit 22. Is a multiplier (MUL), and the third computing unit 23 is an adder (ADD). The compile processing unit 30 grasps the coupling pattern of the arithmetic units constituting such a subunit 20.

図６は，コンパイル処理部３０によるコードの抽出・書き換え処理の流れを示している。まず，コンパイル処理部３０には，オリジナルのシェーダプログラムとして，プログラム４が入力される。プログラム４は，合計８行のコードから構成されている。コンパイル処理部３０は，このプログラム４を解析し，複数のコードの中から，コード間の依存関係がサブユニット２０を構成する演算器間の依存関係と一致するものを抽出する。具体的には，上述したとおりサブユニット２０を構成する演算器間の依存関係は，“ＡＤＤ→ＭＵＬ→ＡＤＤ”となっているため，コンパイル処理部３０は，プログラム４の中からこの依存関係と一致するコードを抽出する。図６に示した例においては，プログラム４における１行目，５行目，及び７行目のコードの依存関係が，演算器間の依存関係に対応していることがわかる。そこで，コンパイル処理部３０は，これら１行目，５行目，及び７行目を抽出して適切な順位に並び替えたプログラム５に書き換える。プログラム４では，バラバラに並べられていた１行目，５行目，及び７行目の３行のコードが，プログラム５では，５行目〜７行目に整列するように書き換えられていることがわかる。さらに，コンパイル処理部３０は，プログラム５における５行目〜７行目のコードをまとめて，サブユニット２０（ＣＩＰ）で簡単に実行可能な１行のコードに書き換えたプログラム６を生成することができる。プログラム６においては，５行目のコードが“Ｒ１５＝ＣＩＰ（Ｒ２，Ｒ１，Ｒ１２，Ｒ１２）”のように書き換えられている。このコードは，図５に示されるように，サブユニット（ＣＩＰ）の入力１に“Ｒ２”，入力２に“Ｒ１”，入力３に“Ｒ１２”，入力４に“Ｒ１２”をそれぞれ入力し，サブユニット（ＣＩＰ）によって変数“Ｒ１５”を求める演算を行うことを意味している。このように，プログラム６における５行目のコードは，サブユニット２０における演算器の連結パターンに対応するものであるため，サブユニット２０によって実行可能である。その他，プログラム６における１行目〜４行目，及び６行目のコードは，サブユニット２０では実行できないため，メインユニット１０によって実行する。 FIG. 6 shows the flow of code extraction / rewriting processing by the compile processing unit 30. First, the program 4 is input to the compile processing unit 30 as an original shader program. The program 4 is composed of a total of 8 lines of code. The compile processing unit 30 analyzes the program 4 and extracts, from among a plurality of codes, ones whose dependency relationships between the codes coincide with the dependency relationships between the arithmetic units constituting the subunit 20. Specifically, as described above, the dependency relationship between the arithmetic units constituting the subunit 20 is “ADD → MUL → ADD”. Therefore, the compile processing unit 30 includes this dependency relationship from the program 4. Extract matching code. In the example shown in FIG. 6, it can be seen that the dependency relationships of the codes in the first line, the fifth line, and the seventh line in the program 4 correspond to the dependency relations between the arithmetic units. Therefore, the compile processing unit 30 extracts the first line, the fifth line, and the seventh line, and rewrites the program 5 in an appropriate order. In Program 4, the three lines of code in the 1st, 5th, and 7th lines that were arranged separately in Program 4 have been rewritten so that they are aligned in the 5th to 7th lines in Program 5. I understand. Furthermore, the compile processing unit 30 can generate the program 6 by rewriting the codes in the fifth to seventh lines in the program 5 into one line of code that can be easily executed by the subunit 20 (CIP). it can. In the program 6, the code on the fifth line is rewritten as “R15 = CIP (R2, R1, R12, R12)”. As shown in FIG. 5, this code inputs “R2” to input 1 of the subunit (CIP), “R1” to input 2, “R12” to input 3, and “R12” to input 4. This means that an operation for obtaining the variable “R15” is performed by the subunit (CIP). As described above, the code on the fifth line in the program 6 corresponds to the connection pattern of the arithmetic units in the subunit 20 and can therefore be executed by the subunit 20. In addition, since the codes in the first to fourth lines and the sixth line in the program 6 cannot be executed in the subunit 20, they are executed by the main unit 10.

このように，オリジナルのシェーダプログラム（プログラム４）を，サブユニット２０で実行可能なコードとそれ以外のコードを含むプログラム（プログラム６）に書き換えることで，サブユニット２０とメインユニット１０とを同時に動作させて，両ユニットで並列的に演算を行うことができる。これにより演算処理の高速化を実現できる。また，固定機能パイプラインからなるサブユニット２０は，メインユニット１０と比較して消費電力がはるかに低いものであるため，シェーダプログラムに記述された演算処理の一部をサブユニット２０に担当させることで，画像処理装置全体としての消費電力を抑えることができる。また，複数の演算器で構成されたサブユニット２０は，その回路規模がプログラマブルなメインユニット１０よりも小さいものであるため，加算器や乗算器といった必要な計算ユニットを狭い面積に多く実装することができる。このため，処理性能を維持しつつ，画像処理装置全体の回路規模を小さくすることができる。 In this way, by rewriting the original shader program (program 4) into a program (program 6) that includes code executable by the subunit 20 and other codes (program 6), the subunit 20 and the main unit 10 operate simultaneously. Thus, both units can be operated in parallel. As a result, high-speed arithmetic processing can be realized. Further, since the subunit 20 composed of the fixed function pipeline has much lower power consumption than the main unit 10, the subunit 20 is responsible for a part of the arithmetic processing described in the shader program. Thus, the power consumption of the entire image processing apparatus can be suppressed. Further, since the subunit 20 composed of a plurality of arithmetic units is smaller than the main unit 10 having a programmable circuit scale, a large number of necessary calculation units such as adders and multipliers are mounted in a small area. Can do. For this reason, the circuit scale of the entire image processing apparatus can be reduced while maintaining the processing performance.

続いて，図７及び図８を参照して，シェーダプログラムに含まれるコードをサブユニット２０が実行可能なものに書き換えるための条件について説明する。図７に示されたシェーダプログラム（プログラム７）は，５行目〜７行目のコードが，図６に示したプログラム６の５行目〜７行目と一致している。このため，プログラム７の５行目〜７行目を，図５に示したサブユニット２０によって実行可能なコードに書き換えることも考えられる。しかし，プログラム７では，８行目のコードの命令部分に，５行目の演算によって求められる変数“Ｒ３”を使用することが必要である。ここで，プログラム７における５行目〜７行目のコードを，サブユニット２０用のコードに書き換えてサブユニット２０において実行した場合，変数“Ｒ３”がコンテキストに書き出されることなくサブユニット２０内でのみ消費されてしまう。そうすると，変数“Ｒ３”を，プログラム７における８行目のコードに使用することができなくなり，８行目のコードの演算結果が求めることができなくなる。従って，このような場合には，プログラム７の中にサブユニット２０を構成する演算器間の依存関係に一致するコードが含まれている場合であっても，そのコードをサブユニット２０用のコードに書き換えることができない。このように，コンパイル処理部３０がサブユニット用コードとして書き換えるコードを抽出する場合は，そのコードの変数を使用する命令の集合のすべてを，シェーダプログラムの中からサブユニット用コードに書き換えるものとして抽出できる場合に限られる。 Next, conditions for rewriting the code included in the shader program so that it can be executed by the subunit 20 will be described with reference to FIGS. In the shader program (program 7) shown in FIG. 7, the codes in the 5th to 7th lines coincide with the 5th to 7th lines of the program 6 shown in FIG. For this reason, it is also conceivable to rewrite the fifth to seventh lines of the program 7 with a code that can be executed by the subunit 20 shown in FIG. However, in the program 7, it is necessary to use the variable “R3” obtained by the operation on the fifth line in the instruction part of the code on the eighth line. Here, when the code in the 5th to 7th lines in the program 7 is rewritten to the code for the subunit 20 and is executed in the subunit 20, the variable “R3” is not written in the context in the subunit 20. Will only be consumed. Then, the variable “R3” cannot be used for the code on the eighth line in the program 7, and the calculation result of the code on the eighth line cannot be obtained. Therefore, in such a case, even if the program 7 includes a code that matches the dependency between the arithmetic units constituting the subunit 20, the code is used as the code for the subunit 20. Cannot be rewritten. In this way, when the compile processing unit 30 extracts the code to be rewritten as the subunit code, all the instruction sets that use the code variables are extracted from the shader program as the subunit code. Limited to when possible.

つまり，コンパイル処理部３０がシェーダプログラムの中からサブユニット２０で実行するものとして選択するコード群は，以下の２つの条件を満たすことが必要となる。
（条件１）シェーダプログラム全体において同じ変数への代入が一度きりであること。
（条件２）選択するコード群の変数を使用する命令の集合は残らずサブユニット用のコード群の中に取り込まれること。 In other words, the code group that the compile processing unit 30 selects from the shader program to be executed by the subunit 20 needs to satisfy the following two conditions.
(Condition 1) The entire shader program must be assigned to the same variable only once.
(Condition 2) A set of instructions that use the variables of the code group to be selected is not taken into the code group for the subunit.

なお，ここで上記条件１及び条件２は，コンパイル処理部３０によって最適化を行った後のシェーダプログラムが満たしていればよい。具体的に説明すると，図８に示したシェーダプログラム（プログラム９）においては，１行目と２行目において同じ変数“Ｒ３”への代入が行われている。このため，最適化前のプログラム９は，上記条件１を満たさない。ただし，１行目のコードはデッドコードであるため，コンパイル処理部３０は最適化処理の過程においてこの１行目のコードを削除し，変数“Ｒ３”への代入を省略するはずである。そうすると，１行目のコードを削除した後においては，上記条件１を満たすものとなる。また，プログラム９においては，最後の変数“Ｒ１８”を求めるコードの命令部分において変数“Ｒ３”が利用されている。このため，最適化前のプログラム９は，上記条件２を満たさない。ただし，最終行のコードもデッドコードであるため，コンパイル処理部３０は最適化処理の過程においてこの最終行のコードを削除するはずである。そうすると，最終行のコードを削除した後においては，上記条件２を満たすものとなる。このように，プログラム９は，最適前の状態ではサブユニット２０への割り当てを行うことができないが，最適化を行ってデッドコードを削除することで条件１及び条件２を満たすようになり，サブユニット２０への割り当てが可能となる Here, the condition 1 and the condition 2 need only satisfy the shader program after optimization by the compile processing unit 30. More specifically, in the shader program (program 9) shown in FIG. 8, the same variable “R3” is assigned to the first and second lines. For this reason, the program 9 before optimization does not satisfy the above condition 1. However, since the code on the first line is a dead code, the compile processing unit 30 should delete the code on the first line in the course of the optimization process and omit the assignment to the variable “R3”. Then, after the code on the first line is deleted, the above condition 1 is satisfied. In the program 9, the variable “R3” is used in the instruction part of the code for obtaining the last variable “R18”. For this reason, the program 9 before optimization does not satisfy the above condition 2. However, since the code on the last line is also dead code, the compile processing unit 30 should delete the code on the last line in the course of the optimization process. Then, after deleting the code in the last line, the above condition 2 is satisfied. As described above, the program 9 cannot be assigned to the subunit 20 in the pre-optimal state, but by performing the optimization and deleting the dead code, the condition 9 and the condition 2 are satisfied. Can be assigned to unit 20

図９は，サブユニット２０のハードウェア構成の好ましい例を示している。サブユニット２０は，図９に示されるように，複数の演算器（ＳＴＡＧＥ０〜ＳＴＡＧＥＮ）が直列に接続されており，前段の演算器による演算結果が後段の演算器に直接入力されるように構成されている。サブユニット２０は，さらに，メインユニット１０から供給された演算値（特に変数）を一時的に保持するための複数の入力レジスタ（ＩＮ０〜ＩＮ３）とともに，ドライバプログラムなどから供給された定数を一時的に保持するための複数の定数レジスタ（ＣＯＮＳＴ０〜ＣＯＮＳＴ７）を備えている。入力レジスタと定数レジスタによって保持されている値は，必要に応じて複数の演算器に入力され，各演算器での演算に使用される。ただし，図９に示されるように，サブユニット２０は，各演算機による演算結果を記憶するためのメモリやレジスタは備えていない。各演算器を通して演算された値は，メモリ等に記憶されることなく，そのままメインユニット１０へと出力されるようになっている。このように，サブユニット２０の構成はメモリ等を備えない簡素なものであるため，消費電力が少なく，しかも演算処理を高速に行うことができる。 FIG. 9 shows a preferred example of the hardware configuration of the subunit 20. As shown in FIG. 9, the subunit 20 has a plurality of arithmetic units (STAGE 0 to STAGE N) connected in series so that the arithmetic result of the preceding arithmetic unit is directly input to the subsequent arithmetic unit. It is configured. The sub unit 20 further includes a plurality of input registers (IN 0 to IN 3) for temporarily holding the operation values (particularly variables) supplied from the main unit 10 and constants supplied from a driver program or the like. A plurality of constant registers (CONST 0 to CONST 7) are provided for temporary storage. The values held by the input register and the constant register are input to a plurality of arithmetic units as necessary and used for arithmetic operations by each arithmetic unit. However, as shown in FIG. 9, the subunit 20 does not include a memory or a register for storing a calculation result by each calculator. The value calculated through each calculator is output to the main unit 10 as it is without being stored in a memory or the like. As described above, since the configuration of the subunit 20 is simple without a memory or the like, it consumes less power and can perform arithmetic processing at high speed.

また，本発明の画像処理装置において，メインユニット１０は，演算処理で使用する変数の値を格納した複数のコンテキストを有しており，使用するコンテキストを切り替えることで，演算処理の内容を切り替え可能であることが好ましい。メインユニット１０のコンテキストスイッチ機能は，例えばサブユニット２０による演算結果が出力されるまでに長い時間のかかる命令がある場合に，その待ち時間の間，他のコンテキストを使用して他の演算処理を行う機能である。メインユニット１０は，同一のシェーダプログラムであっても，変数の内容が異なる複数の処理を同時に実行している。このため，メインユニット１０には，変数の内容を記載したコンテキスト（変数の組のテーブル）を複数備えておき，あるコンテキストに基づく処理について待ち時間が発生する場合には，他のコンテキストに切り替えて他の処理を実行することで効率化を図ることができる。本発明の画像処理装置においては，このようなコンテキストの切替機能を有するメインユニット１０を用いることで，メインユニット１０と複数のサブユニット２０を効果的に協働させることができる。つまり，メインユニット１０がコンテキストを切り替えるスイッチ機能を有することで，サブユニット２０に特定の処理を実行させている間，メインユニット１０自身では別の演算処理を行うことが可能となる。これにより，サブユニット２０による演算結果の待ち時間を隠蔽することが可能となり，画像処理の更なる効率化を図ることができる。 In the image processing apparatus of the present invention, the main unit 10 has a plurality of contexts that store the values of variables used in the arithmetic processing, and the contents of the arithmetic processing can be switched by switching the context to be used. It is preferable that For example, when there is an instruction that takes a long time until the operation result by the subunit 20 is output, the context switch function of the main unit 10 performs other operation processing using another context during the waiting time. It is a function to perform. The main unit 10 simultaneously executes a plurality of processes with different variable contents even in the same shader program. For this reason, the main unit 10 is provided with a plurality of contexts (variable group tables) describing the contents of variables, and when waiting time occurs for processing based on a certain context, switch to another context. Efficiency can be achieved by executing other processes. In the image processing apparatus of the present invention, by using the main unit 10 having such a context switching function, the main unit 10 and the plurality of subunits 20 can be effectively cooperated. In other words, since the main unit 10 has a switching function for switching contexts, the main unit 10 itself can perform another calculation process while the subunit 20 is executing a specific process. Thereby, it is possible to hide the waiting time of the calculation result by the subunit 20, and it is possible to further improve the efficiency of the image processing.

続いて，図１０を及び図１１を参照して，コンパイル処理部３０が設けられる部位について説明する。図１０及び図１１に示されるように，画像処理機能を持つコンピュータ（画像処理装置）１０００は，一般的に，ＧＰＵ１００（グラフィックス処理装置）と，外部メモリ２００（記憶装置）と，ＣＰＵ３００（中央処理装置）とを備えており，これらの装置が互いにバスなどを通じて接続されている。ＣＰＵ３００は，オペレーティングシステムプログラムを実行するとともに，外部メモリ２００に格納されているアプリケーションプログラムやドライバプログラムを適宜読み出してこれらを実行する。また，グラフィックスを描画する必要が生じた場合，ＣＰＵ３００は，ＧＰＵ１００用のドライバプログラムを実行して，ドライバプログラムに従ってＧＰＵ１００に必要な設定（定数の設定など）を行う。その後，ＧＰＵ１００が，外部メモリ２００に格納されているシェーダプログラムに従って，外部メモリ２００上に所望のグラフィックスの描画を行う。ＣＰＵ３００は，必要に応じて外部メモリ２００に格納された画像を表示装置へと送る。 Next, with reference to FIG. 10 and FIG. 11, a part where the compile processing unit 30 is provided will be described. As shown in FIGS. 10 and 11, a computer (image processing apparatus) 1000 having an image processing function generally includes a GPU 100 (graphics processing apparatus), an external memory 200 (storage apparatus), and a CPU 300 (central These devices are connected to each other through a bus or the like. The CPU 300 executes the operating system program and also appropriately reads and executes application programs and driver programs stored in the external memory 200. When it is necessary to draw graphics, the CPU 300 executes a driver program for the GPU 100 and performs necessary settings (such as setting constants) in the GPU 100 according to the driver program. Thereafter, the GPU 100 draws desired graphics on the external memory 200 in accordance with a shader program stored in the external memory 200. The CPU 300 sends the image stored in the external memory 200 to the display device as necessary.

ここで，ＯｐｅｎＧＬなどのグラフィックス処理系では，ＧＰＵ１００によって実行されるシェーダプログラムが，コンピュータ（画像処理装置）１０００に対して，コンパイル処理を済ませたバイナリコードの状態で提供される場合と，コンパイル処理前のソースコードの状態で提供される場合の２通りがある。 Here, in a graphics processing system such as OpenGL, when a shader program executed by the GPU 100 is provided to the computer (image processing apparatus) 1000 in the state of binary code after completion of the compiling process, There are two cases when provided in the state of the previous source code.

図１０は，画像処理装置１０００に対して，別のコンピュータ２０００から，シェーダプログラムがコンパイル処理を済ませたバイナリコードの状態で提供される場合の例を示している。なお，コンパイル済みのシェーダプログラムは，ＣＤ−ＲＯＭなどの記録媒体を介して画像処理装置１０００に提供することもできるし，あるいはインターネット等の情報通信回線を通じて画像処理装置１０００に提供することも可能である。このような態様は，携帯ゲーム機などで多く見受けられる。携帯ゲーム機などのようにＧＰＵ１００とＣＰＵ３００の処理能力が限定されている場合，画像処理装置１０００（携帯ゲーム機）に対しては，シェーダプログラムがバイナリコードの状態で提供される。この場合，画像処理装置１０００自身において，シェーダプログラムをコンパイルする必要がなくなるため，ＧＰＵ１００は，外部メモリ２００に格納されているシェーダプログラムをそのままロードして実行することができる。図１０に示した態様の場合，シェーダプログラムをコンパイルする機能を担うのは，別のコンピュータ２０００である。このため，このような態様では，コンパイル処理部３０は，別のコンピュータ２０００に備わっていることとなる。 FIG. 10 shows an example in which the shader program is provided to the image processing apparatus 1000 from another computer 2000 in the state of binary code that has been compiled. The compiled shader program can be provided to the image processing apparatus 1000 via a recording medium such as a CD-ROM, or can be provided to the image processing apparatus 1000 through an information communication line such as the Internet. is there. Such an aspect is often seen in portable game machines and the like. When the processing capabilities of the GPU 100 and the CPU 300 are limited as in a portable game machine, a shader program is provided in a binary code state to the image processing apparatus 1000 (portable game machine). In this case, since it is not necessary to compile the shader program in the image processing apparatus 1000 itself, the GPU 100 can load and execute the shader program stored in the external memory 200 as it is. In the case of the mode shown in FIG. 10, it is another computer 2000 that has the function of compiling the shader program. For this reason, in such an aspect, the compile processing unit 30 is provided in another computer 2000.

一方，図１１は，画像処理装置１０００に対して，別のコンピュータ２０００から，シェーダプログラムがコンパイル前のソースコードの状態で提供される場合の例を示している。デスクトップＰＣなどのような処理装置では，一般的に，グラフィックス用のシェーダプログラムがソースコードで提供され，ＧＰＵ１００のためのコンパイルプログラムはドライバプログラムの中に同梱されている。これは，デスクトップＰＣなどのような処理装置では，ＧＰＵ１００とＣＰＵ３００がそれぞれ異なるメーカーから提供されることが一般的であり，両者のインターフェースを統一する必要があるからである。このような態様では，ＧＰＵ１００が使用するシェーダプログラムは，描画を実行する直前に，ＧＰＵメーカーが提供してドライバに組み込まれた独自のコンパイルプログラムによってバイナリプログラムへと変換される。すなわち，ＣＰＵ３００が，ドライバプログラムに同梱されているシェーダ用のコンパイルプグラムを実行し，外部メモリ２００に記憶されているソースコード状態のシェーダプログラムを読み出して，これをバイナリコードに書き換えるコンパイル処理を行う。そして，ＣＰＵ３００は，バイナリコード状態になったシェーダプログラムをＧＰＵ１００に対して提供し，ＧＰＵ１００は，このコンパイル済みのシェーダプログラムに従ってグラフィックス処理を実行する。このように，図１１に示した態様の場合，シェーダプログラムをコンパイルする機能を担うのは，画像処理装置１０００内のＣＰＵ３００である。このため，このような態様では，コンパイル処理部３０は，画像処理装置１０００内のＣＰＵ３００に備わっていることとなる。 On the other hand, FIG. 11 shows an example in which a shader program is provided to the image processing apparatus 1000 from another computer 2000 in the state of source code before compilation. In a processing device such as a desktop PC, a graphics shader program is generally provided as source code, and a compiled program for the GPU 100 is included in a driver program. This is because in a processing device such as a desktop PC, the GPU 100 and the CPU 300 are generally provided by different manufacturers, and it is necessary to unify both interfaces. In such an embodiment, the shader program used by the GPU 100 is converted into a binary program by a unique compile program provided by the GPU manufacturer and incorporated in the driver immediately before the drawing is executed. That is, the CPU 300 executes a compiling program for a shader bundled with the driver program, reads a shader program in a source code state stored in the external memory 200, and performs a compiling process for rewriting it into binary code. . Then, the CPU 300 provides the shader program in the binary code state to the GPU 100, and the GPU 100 executes graphics processing according to the compiled shader program. As described above, in the case of the mode shown in FIG. 11, the CPU 300 in the image processing apparatus 1000 is responsible for compiling the shader program. Therefore, in such an embodiment, the compile processing unit 30 is provided in the CPU 300 in the image processing apparatus 1000.

以上のように，ＧＰＵ用のコンパイルは大変に重い処理であるため，デスクトップＰＣなどでは描画直前にＣＰＵによってシェーダプログラムをコンパイルすることが可能である。しかし，携帯ゲーム機などでは，ＣＰＵの能力が低いことや，ＧＰＵの機種が予め決まっていることを理由に，事前にシェーダプログラムをコンパイルしてバイナリプログラムに変換しておく場合が多い。本発明において，ＧＰＵ用のシェーダプログラムのコンパイル処理は，図１０に示されるように，画像処理装置とは別のコンピュータによって実行され，コンパイル済みのシェーダプログラムが画像処理装置に格納されていてもよいし，あるいは図１１に示されるように，画像処理装置が備えるＣＰＵによって描画直前に実行されてもよい。すなわち，本発明は，図１０及び図１１の両方の態様を含むものである。 As described above, since compiling for GPU is a very heavy process, it is possible to compile a shader program by a CPU immediately before drawing on a desktop PC or the like. However, in a portable game machine or the like, a shader program is often compiled and converted into a binary program in advance because the CPU capability is low or the GPU model is determined in advance. In the present invention, as shown in FIG. 10, the GPU shader program compilation process may be executed by a computer different from the image processing apparatus, and the compiled shader program may be stored in the image processing apparatus. Alternatively, as shown in FIG. 11, it may be executed immediately before drawing by a CPU provided in the image processing apparatus. That is, the present invention includes both aspects of FIG. 10 and FIG.

以上，本願明細書では，本発明の内容を表現するために，図面を参照しながら本発明の実施形態の説明を行った。ただし，本発明は，上記実施形態に限定されるものではなく，本願明細書に記載された事項に基づいて当業者が自明な変更形態や改良形態を包含するものである。 As mentioned above, in this specification, in order to express the content of this invention, embodiment of this invention was described, referring drawings. However, the present invention is not limited to the above-described embodiments, but includes modifications and improvements obvious to those skilled in the art based on the matters described in the present specification.

本発明は，コンピュータグラフィックス用の画像処理装置及び画像処理方法に関するものである。従って，本発明はコンピュータ関連産業において好適に利用し得る。 The present invention relates to an image processing apparatus and an image processing method for computer graphics. Therefore, the present invention can be suitably used in the computer related industry.

１０…メインユニット１１…管理部
１２…頂点処理部１３…ラスタライズ処理部
１４…フラグメント処理部１５…テクスチャ処理部
１６…カラーアップデート処理部１７…内部メモリ
２０…サブユニット２１…第１の演算器
２２…第２の演算器２３…第３の演算器
３０…コンパイル処理部１００…ＧＰＵ
２００…外部メモリ３００…ＣＰＵ
１０００…画像処理装置２０００…別のコンピュータ DESCRIPTION OF SYMBOLS 10 ... Main unit 11 ... Management part 12 ... Vertex processing part 13 ... Rasterization processing part 14 ... Fragment processing part 15 ... Texture processing part 16 ... Color update processing part 17 ... Internal memory 20 ... Sub unit 21 ... First computing unit 22 ... second computing unit 23 ... third computing unit 30 ... compile processing unit 100 ... GPU
200 ... External memory 300 ... CPU
1000: Image processing apparatus 2000: Another computer

Claims

An image processing apparatus for drawing an image according to a shader program,
A main unit (10) capable of programming the arithmetic processing to be executed;
One or a plurality of sub-units having a plurality of arithmetic units that perform predetermined arithmetic processing and in which the respective arithmetic units are connected in series so that the arithmetic result of the former stage arithmetic unit is input to the subsequent stage arithmetic unit. 20),
Codes that can be executed by the subunit (20) are extracted from a shader program according to the connection pattern of the computing units constituting the subunit (20), and the extracted codes can be executed by the subunit (20). A compile processing unit (30) for rewriting the code for the subunit,
The main unit (10) causes the subunit unit (20) to execute the subunit code in accordance with the shader program that has been rewritten by the compile processing unit (30), and transmits other codes to the main unit (10). ) Is executed.

A shader program includes a plurality of codes that define variables obtained from operations according to instructions, and the plurality of codes include a dependency relationship between codes in which a variable of a certain code is included in an instruction of another code. Exists,
The compile processing unit (30) can extract from the shader program that the dependency relationship between codes matches the dependency relationship between the arithmetic units constituting the subunit, and can be executed by the subunit (20). The image processing apparatus according to claim 1, wherein the sub-unit code is rewritten.

When the code to be rewritten as the subunit code is extracted by the compile processing unit (30), it is assumed that all of the instruction set using the code variable is rewritten from the shader program to the subunit code. The image processing apparatus according to claim 2, which is limited to a case where extraction is possible.

The image processing apparatus according to claim 1, wherein a plurality of the subunits (20) are provided, and a connection pattern of computing units constituting each subunit (20) is different.

The main unit (10) has a plurality of contexts that store values of variables used in arithmetic processing, and the contents of arithmetic processing can be switched by switching the context to be used. Image processing apparatus.

The image processing device according to claim 1, wherein the number of stages of the computing units constituting the subunit (20) is four.

The image processing apparatus according to claim 1, wherein the computing unit constituting the subunit (20) includes an adder and a multiplier.

The main unit (10) includes a processing unit that does not perform arithmetic processing simultaneously with the subunit (20), and an arithmetic unit that constitutes the processing unit and an arithmetic unit that constitutes the subunit (20) are at least The image processing apparatus according to claim 1, wherein a part is shared.

An image processing apparatus for drawing an image according to a shader program,
A main unit (10) capable of programming the arithmetic processing to be executed;
One or a plurality of sub-units having a plurality of arithmetic units that perform predetermined arithmetic processing and in which the respective arithmetic units are connected in series so that the arithmetic result of the former stage arithmetic unit is input to the subsequent stage arithmetic unit. 20), and
The shader program is a code for subunits that can be executed by the subunit (20) according to a connection pattern of computing units constituting the subunit (20). It has been rewritten as
The main unit (10) causes the subunit unit (20) to execute the subunit code in accordance with the rewritten shader program, and executes the other codes in the main unit (10). .

An image processing method executed by an image processing apparatus for drawing an image according to a shader program,
The image processing apparatus includes:
A main unit (10) capable of programming the arithmetic processing to be executed;
One or a plurality of sub-units having a plurality of arithmetic units that perform predetermined arithmetic processing and in which the respective arithmetic units are connected in series so that the arithmetic result of the former stage arithmetic unit is input to the subsequent stage arithmetic unit. 20), and
The image processing method includes:
Codes that can be executed by the subunit (20) are extracted from a shader program according to the connection pattern of the computing units constituting the subunit (20), and the extracted codes can be executed by the subunit (20). A compilation process to rewrite the code for the subunit,
The main unit (10) causes the subunit unit (20) to execute the subunit code in accordance with the shader program that has been rewritten in the compile processing step, and the other units execute other codes. An image processing method comprising: