JP3720178B2

JP3720178B2 - Digital processing unit

Info

Publication number: JP3720178B2
Application number: JP33010997A
Authority: JP
Inventors: 健次郎山本; 雅嗣亀谷; 二宮　　拓; 博之品田; 理山田; 康継宇佐見
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1997-12-01
Filing date: 1997-12-01
Publication date: 2005-11-24
Anticipated expiration: 2017-12-01
Also published as: JPH11161633A

Description

【０００１】
【発明の属する技術分野】
本発明は、デジタル演算処理装置に係わり、特に、高速な演算速度が要求されるデジタル演算装置に関する。
【０００２】
【従来の技術】
従来、高速演算制御装置には以下の例がある。
（１）特開平５−２５８７０３号公報に示されている公知例では、高速制御出力演算をデジタル演算器で行わず、アナログ回路を用いて制御出力を行って高速制御に対応している。
【０００３】
（２）特開平５−２２６２３４号公報に示されている電子線描画装置では、演算器としてのＤＳＰを組み合わせて演算処理部を構成し、ＤＳＰ間は２ポートメモリを用いてデータの転送を行っている。また、各演算ＤＳＰの制御は、マスタとなるＤＳＰが管理する構成となっている。さらに、デジタルデータをアナログ出力に変換するＤ／Ａ変換器には、プロセッサがデータを出力するため、２面のレジスタを交互に切り替える構成となっている。
【０００４】
【発明が解決しようとする課題】
例えば、電子線描画装置等の制御装置においては、その処理速度の向上化が望まれている。処理速度を向上するためには、上記特開平５−２５８７０３号公報のように、高速制御出力演算をデジタル演算器で行わず、アナログ回路を用いることが考えられる。ところが、アナログ回路を用いた場合には、処理速度は向上するが、高精度な制御処理は望めない。
【０００５】
このため、電子線描画装置等の高速演算制御装置において、精度の向上や処理時間短縮のために、高速演算が可能なデジタル演算処理装置が望まれているが、その実現化には、以下のような問題点があった。
【０００６】
（１）高精度かつ高速演算を行うためには、実数演算処理を並列パイプライン的に実行する必要があるが、非常に多くのトランジスタを必要とする。さらに、リアルタイムで動作する制御部と連動するため、レイテンシタイムを小さく保つ必要があり、配線短縮の点からもコンパクトに構成する必要性が生ずる。
このため、非常に高集積なＬＳＩ又は電子基板を実現しなければならず、高度な論理設計技術やトランジスタ数の削減技術が要求される。
【０００７】
上記要求を満足させようとすると、多くのトランジスタが小さなエリアで大量にスイッチング動作を行うようにしなければならず、それに伴って大量の発熱が生ずる恐れがある。このため、発熱を押さえる回路設計上の工夫が必要となる。しかしながら、これでは、回路構成の大規模化及び複雑化を伴い、高価格となってしまう。
【０００８】
（２）マスタプロセッサと高速デジタル演算処理装置との間で高速クロックに同期してスムーズな情報のやり取りが必要となる。
従来は、２ポートメモリを用いた手段が用いられることが多いが、メモリ等のデバイスのアクセススピードの制限やアクセス競合の問題があり、高速化が困難であった。
【０００９】
（３）制御対象に指令値を与えるため、高精度な高速制御出力演算データをアナログ信号に変換する変換する必要がある。
【００１０】
しかしながら、デジタルアナログ変換に際して、高速高精度のデジタルアナログ変換器には制限があり、誤差成分も補正した形で演算処理結果を高精度なデジタルデータとして、例えば、１２ｂｉｔ精度以上で１０ｎｓ以下のクロック周期で出力するのは困難であり、それ以上の演算速度の高速化ができなかった。
【００１１】
本発明の目的は、回路構成の大規模化及び複雑化を伴うこと無く、制御出力周期を１００ＭＨｚ以上で行うことを可能とする高速デジタル演算処理制御装置を実現することである。
【００１２】
【課題を解決するための手段】
上記目的を達成するため、本発明は次のように構成される。
（１）高速なクロック信号周期に同期してデジタル演算処理を行うデジタル演算処理装置において、
入力データを内部形式に変換する初段、変換した入力データの乗算を実施する段、及び乗算結果を内部形式から出力形式に変換して出力するパイプレジスタからなる最終段を有する実数乗算器と、
入力データを内部形式に変換するパイプレジスタからなる初段、変換した入力データの加算を実施する段、及び加算結果を内部形式から出力形式に変換して出力する最終段を有する実数加算器と、をパイプライン化し、演算処理の基本単位となる演算を、上記パイプライン化した上記実数乗算器と実数加算器とを１つに結合して、乗算と加算とを順次実施して行なうＭＡＣ演算器を備え、
上記実数乗算器の最終段であるパイプラインレジスタと実数加算器の初段であるパイプラインレジスタとの演算器における段数を、同レベルにそろえ、実数乗算器の最終段の処理と、実数加算器の初段の処理の１部とを並列に動作させる。
【００１３】
本構成において、実数乗算器の出力段と、実数加算器の入力段とのレベルが同レベルであるので、これらの信号を並列に処理可能であり、実数乗算器の最終段の次段で、ＩＥＥＥ形式に変換するステージを設ける必要が無く、内部形式のまま、次段の実数加算器にデータを引き渡すことができる。
【００１４】
従って、実数加算器の初段で、乗算器からの結果に対してＩＥＥＥ形式からの変換ステージを実行する必要もなくなる。これにより、パイプライン段数とトランジスタ数の削減が可能となり、回路構成の大規模化及び複雑化と発熱を伴うことなく高集積化が実現でき、高速で高精度なデジタル演算処理が可能となる。
【００１５】
（２）第１の高速クロック信号に同期し、デジタル処理を行うデジタル演算処理装置において、
上記第１の高速クロック信号と非同期の第２のクロック信号に同期して動作するプロセッサと、上記プロセッサからのデータを第１のゲート信号に応答してラッチする機能を有する１段目のラッチレジスタと、上記１段目のラッチレジスタからのデータを第２のゲート信号に応答してラッチする機能を有する２段目のラッチレジスタと、を備え、上記第１のゲート信号は、上記プロセッサからのライトアクセス信号を基に生成され、第２のゲート信号は、上記第１の高速クロック信号に同期化した信号を基に生成され、上記１段目のラッチレジスタに書き込まれた上記プロセッサからの複数のデータが上記第２のゲート信号により上記２段目のラッチレジスタに一斉に引き渡される。
【００１６】
第１の周期で高速に変化する情報を処理する演算処理装置に、第２の周期で低速に変化する情報を処理するプロセッサを設けることにより、低速で制御する部分と、高速で制御する部分とを分割して、制御の適切化が可能になるが、本構成により、上記２種類の周期で動作する部分のスムーズな情報のやり取りが実現できる。これにより、情報のやり取りの同期が問題とならずに、高速なデジタル演算処理可能な演算手段を用いることができ、高速で高精度なデータ処理が可能なデジタル演算処理装置を実現することができる。
【００１７】
（３）高速なクロック信号周期に同期してデジタルデータをアナログデータとして出力するデジタル演算処理装置において、デジタルデータを連続的なビット列で構成された少なくとも２つの出力データに分割する手段と、その上位側の出力データに対応した補正データを記憶するメモリ手段と、上記少なくとも２つの出力データと補正データとに対応したアナログデータを出力する少なくとも３つのデジタルアナログ変換器と、上記少なくとも２つの出力データを、対応するデジタルアナログ変換器に与えるデータフォーマットに変換する手段と、上記少なくとも２つの出力データと補正データとの出力タイミングを合わせる手段と、を備え、上記少なくとも３つのデジタルアナログ変換器のアナログデータを加算する。
【００１８】
本構成により、高速で演算処理されたデジタル信号を、適切に高精度かつ高速にアナログ信号に変換可能なデジタル演算処理を実現することができる。
【００１９】
（４）ビーム光源からのビームを、ビーム走査制御部により走査して、被検出物に照射し、画像処理部により被検出物の画像情報を得るビーム走査型の画像情報取り込み装置において、上記ビーム走査制御部は、入力データを内部形式に変換する初段、変換した入力データの乗算を実施する段、及び乗算結果を内部形式から出力形式に変換して出力する最終段を有する実数乗算器と、入力データを内部形式に変換する初段、変換した入力データの加算を実施する段、及び加算結果を内部形式から出力形式に変換して出力する最終段を有する実数加算器と、をパイプライン化し、パイプライン化した実数乗算器及び実数加算器を１つに融合し、上記実数乗算器の最終段であるパイプラインレジスタと上記実数加算器の初段であるパイプラインレジスタとの演算器における段数を、同レベルにそろえ、上記実数乗算器の最終段の処理と、上記実数加算器の初段の処理の１部とを並列に動作させて、乗算と加算とを順次実施するＭＡＣ演算器を有し、ビーム走査位置の誤差から生じる画像の歪の補正演算を行ない、ビーム制御デジタルデータを出力するデジタル演算処理手段と、上記デジタル演算手段の第１の高速クロック信号と非同期の第２のクロック信号に同期して上記補正演算の係数データを設定するプロセッサと、上記プロセッサからのデータを、上記プロセッサからのライトアクセス信号を基に生成される第１のゲート信号に応答してラッチする機能を有する１段目のラッチレジスタと、上記１段目のラッチレジスタに書き込まれた上記プロセッサからの複数のデータを、上記プロセッサから出力される第２のゲート信号により一斉にラッチし、上記デジタル演算処理手段に与える機能を有する２段目のラッチレジスタと、上記ビーム制御デジタルデータをアナログビーム走査制御信号に変換するために、上記デジタル演算処理手段からのデジタルデータを連続的なビット列で構成された少なくとも２つの出力データに分割する手段と、その上位側の出力データに対応した補正データを記憶するメモリ手段と、上記少なくとも２つの出力データと補正データとに対応したアナログデータを出力する少なくとも３つのデジタルアナログ変換器と、上記少なくとも２つの出力データを、対応するデジタルアナログ変換器に与えるデータフォーマットに変換する手段と、上記少なくとも２つの出力データと補正データとの出力タイミングを合わせる手段と、上記少なくとも３つのデジタルアナログ変換器のアナログデータを加算する手段と、を備え、デジタル演算により正確なビーム走査位置を制御する。
【００２０】
本構成により、高速で高精度な処理が可能なビーム走査型の画像情報取り込み装置を実現することができる。
【００２１】
【発明の実施の形態】
本発明の一実施形態は、第１の周期で高速に変化する情報と、第２の周期で低速に変化する情報との組み合わせ演算を行い、第１の周期毎に高速で結果を出力するデジタル演算処理装置を実現する。
下記に示すように、上記第１の周波数とは１００ＭＨｚ〜２００ＭＨｚ、第２の周波数とは１００ＫＨｚ〜２００ＫＨｚオーダである。
【００２２】
本発明の実施形態であるデジタル演算処理装置が適用される装置の例としては、図１に示すビーム走査型の画像情報取り込み装置が挙げられる。この画像情報取り込み装置は、ビーム光源１と、ビーム走査部２と、レンズ部３と、被検出物（観察物）である試料４と、ステージ部５と、検出部６と、画像処理部７と、ビーム走査制御部８とから構成される。
【００２３】
図１に示すビーム走査型の画像情報取り込み装置においては、ビーム光源１から生成されるビームをビーム走査部２で適切な角度に振り、レンズ部３で、そのビームをフォーカスして試料４上を適切な拡大率をもって、ある方向、例えばＸ方向に走査させ、検出部６及び画像処理部７に得られたＸ方向の線画像を、ステージ部５にて、例えばＹ方向にずらしながら連続的にＹ方向に連結してＸ−Ｙの面画像を得るものである。
【００２４】
現状、上記Ｘ方向の１ピクセルに相当する画像を得る時間を第１の周波数ｆ１とし、ｆ１＝１００ＭＨｚ〜２００ＭＨｚ程度とする。そして、Ｘ方向に１ライン走査する時間を第２の周波数ｆ２とし、ｆ２＝１００ＫＨｚ〜２００ＫＨｚ程度を設定している。
【００２５】
上記第１の周波数ｆ１及び第２の周波数ｆ２の値は、例えば、被検出試料４が半導体のウェハ上のＬＳＩチップであり、そのＬＳＩチップ上のパターン画像を得て、それが正しいか否かを検査する装置として、上記画像情報取り込み装置を用いる場合の画像処理分解能及びタクトタイム等から計算した、現実的に必要とされるスペックの１つである。
【００２６】
図１の装置において、本発明が提案するデジタル演算処理装置を必要とする重要部分は、ビーム走査制御部８である。ビーム走査制御部８では、主に、具体的に次に示す２つの誤差対象に対して補正制御演算を行う必要がある。
【００２７】
（１）光学的歪みに代表される半固定的誤差であり、連続的な誤差関数により事前定義可能なものである誤差対象。これらは、変換関数を用いた座標変換等の数値計算や事前の形状計測情報、又は両者の組み合わせ等によって補正処理を行い、制御出力に反映する。
ビーム経路中、検出物４の位置に依存し、上記計測情報を用いた補正もこれに含まれる。
【００２８】
（２）ステージ部５の移動に伴う位置ずれ、速度むら等の機械的変動、温度変動等の環境変化に呼応した変動や経時変化による誤差に対する補正であり、ステージ部５からのセンシング情報及び随時行う計測情報を用いて補正処理を行い、制御出力に反映する。
【００２９】
ここで、制御出力とは、ビームを正しく制御するためのビーム走査部２に対する指令に相当する。
図２にビーム走査制御部８の基本システム構成を示す。
図２において、関数演算部１２が上記（１）の動作に相当する処理部であり、制御情報演算部１１が上記（２）の動作に相当する処理部である。
【００３０】
レジスタ部ａ１４は、制御情報演算部１１に上記第２の周期で変化する情報ｇ（ＰＣＤ０…）を保持しており、レジスタ部ｂ１５は、関数演算部１２に上記第１の周期に同期化され、第２の周期で変化する情報ｈ（ＦＣＤ０…）を保持している。これらのレジスタ部１４、１５は、いずれもマスタプロセッサ部１６によってその情報が変更され、各演算部の処理に変数又は定数として用いられる。
【００３１】
外部情報入力部１０は、ステージ制御部１７の測長部から得る位置情報等のステージ部５の状態を監視するための情報ａを得て、計算処理可能なように情報ｂを生成する前処理を行う。ここでの入力情報は、ステージ部５を駆動する制御情報のフィードバック情報でも良いし、ステージ部５に装備されるセンサからのフィードフォワード情報を使用しても良い。
制御情報演算部１１は、外部情報入力部１０からの前処理済情報を得て、関数演算部１２に与えるための情報ｃを第１の周期にて生成する。
【００３２】
関数演算部１２は、例えば試料２０に対してビームを走査する際の位置（Ｘ，Ｙ）の関数として、例として、次式（１）に示すような３次式で表される座標変換式で表現される、光学系の歪みを補正して平面等方化するための投影処理を行い、ビーム制御のための基本情報ｄを生成する。
図には示していないが、ａ、ｂの係数は、事前計測情報により得られるもので、ビームの目標位置上の試料２０の高さ等の情報から、マスタプロセッサ１６からレジスタ部ｂ１５を通して与えられる。
【００３３】
【数１】

【００３４】
Ｘがライン方向だとすると、画像処理部１９での１ラインあたりの画素数ｎ個分を走査する時間がＹ方向の変化する最小の周期となる。Ｘの変化周期は１画素分の走査時間に相当し、従って、Ｙの変化周期は、およそ（Ｘの変化周期）×ｎ＋αとできる。
【００３５】
１ラインあたりの走査時間を、およそ１０μsで、Ｘ方向１ライン当たりの画素数を、およそ１０００と仮定すると、Ｘの変化周期は、およそ１０μs／１０００＝１０ｎｓ（ｆ＝１００ＭＨｚ）となり、Ｙの変化周期は１０μs＋αとなる。情報ｄは、Ｘの変化周期に応答するため、１０ｎsの周期の高速な変化情報となる。これを上記第１の周波数における周期と定義し、１０μs＋αを上記第２の周波数における周期と定義する。
【００３６】
制御情報出力部１３は、ＤＡＣ部２２にて、ビーム走査部１８に与える指令情報ｆ（アナログ制御情報）を生成するための元となるデジタル制御情報ｅを生成する。これについては後に詳しく述べる。
【００３７】
マスタプロセッサ部１６は、制御情報演算部１１、関数演算部１２、画像処理部１９、ステージ制御部１７等からの情報を集約して、総合的な判断処理、管理処理、レジスタ部ａ１４、ｂ１５上のパラメータ変更処理等を、上記第２の周波数における周期を基本周期として行う。すなわち、マスタプロセッサ部１６は、ビーム走査制御部２１の総合制御／管理部と位置づける事ができる。
【００３８】
さて、この例において、デジタル処理を行う上で重要かつ実現困難なビーム走査制御部２１の第１の構成要素は、上記式（１）にて示した関数処理を、周波数ｆ＝１００ＭＨｚ以上のスループットにて動作しなければならない関数演算部１２である。
【００３９】
単純に、上記式（１）を実行するだけで３６個の加算、乗算が必要であり、正規化する事等も含めると４０演算以上のオペレーションが要求される。また、これらの演算は、高精度の観点から実数演算が要求されており、式（１）と同様の汎用的な記述に基づいて実行するとなると、浮動小数点型の実数演算処理を１秒間に４Ｇ回（４ＧＦＬＯＰＳ）処理する能力が要求される。その他、前処理演算及び補正演算を組み合わせて実行する必要が生ずる場合もあり、総合すると、１０Ｇ回／ｓ（１０ＧＦＬＯＰＳ）程度の処理能力が必要となるケースも予想される。
【００４０】
関数演算部１２を構成する上で、問題となる事項を以下にまとめておく。
（ａ）上述のような高速演算を行うためには、実数演算処理を並列パイプライン的に実行する必要があるが、非常に多くのトランジスタを必要とする。さらに、リアルタイムで動作する制御部と連動するため、レイテンシタイムを小さく保つ必要があり、配線短縮の点からもコンパクトに構成する必要性が生ずる。すなわち、非常に高集積なＬＳＩ又は電子基板を実現しなければならず高度な論理設計技術やトランジスタ数削減技術が要求される。
【００４１】
（ｂ）上記（ａ）の点を実現しようとすると、多くのトランジスタが小さなエリアで大量にスイッチング動作を行うため、それに伴って大量の発熱が生ずる恐れがある。したがって、発熱を押さえる回路設計上の工夫が必要となる。
【００４２】
（ｃ）レジスタ部からの情報等、異なる周期で変化する情報をスムーズに高速処理の中に取り込んだり、処理情報をリアルタイムでマスタプロセッサへ読みだしたりする必要がある。すなわち、回路動作上の高度な同期化処理技術が要求される。
【００４３】
上記の理由からビーム走査制御部２１は、ＤＡＣ部２２を除いては、ＬＳＩで構成するのが良い。図２に示した例では、ゲート量とピン数との制約から、外部情報入力部１０と、制御情報演算部１１と、レジスタ部ａ１４とを１チップとし、関数演算部１２と、制御情報出力部１３と、レジスタ部ｂ１５とを他の１チップとして実現し、かつ１種類のＬＳＩ上でセレクト信号により切り替えられるようになっている。
【００４４】
上記問題（ａ）の解決方法の一例として、上記式（１）の演算を、乗算器と加算器を積和型に一体化したＭＡＣ演算器（乗算加算積和型演算器）を基本演算器として構成し、それを組み合わせて最も効率良く並列に実行する方式を図３に示す。
【００４５】
ＭＡＣ演算器は、実数入力に対して所望の数値範囲で結果が得られるように、例えばＩＥＥＥ規格の実数フォーマットに準拠した汎用の実数演算器として構成する。
【００４６】
図３において、ＭＡＣ演算器３１には（Ｙｂ，ａ７，ａ５）が入力され、ＭＡＣ演算器３２には（Ｙｂ，ａ６，ａ３）が入力される。また、ＭＡＣ演算器３３には（Ｙｂ，ａ８，ａ４）が入力される。
【００４７】
また、ＭＡＣ演算器３４には（Ｘｂ，ａ９）が入力されるとともに、ＭＡＣ演算器３１からの出力が入力され、ＭＡＣ演算器３５には（Ｙｂ，ａ２）が入力されるとともに、ＭＡＣ演算器３２からの出力が入力される。また、ＭＡＣ演算器３６には（Ｙｂ，ａ１）が入力されるとともに、ＭＡＣ演算器３３からの出力が入力される。
【００４８】
また、ＭＡＣ演算器３７にはＸｂが入力されるとともに、ＭＡＣ演算器３４及びＭＡＣ演算器３５からの出力が入力され、ＭＡＣ演算器３８には（Ｙｂ，ａ０）が入力されるとともに、ＭＡＣ演算器３６からの出力が入力される。また、ＭＡＣ演算器３９にはＸｂが入力されるとともに、ＭＡＣ演算器３７及びＭＡＣ演算器３８からの出力が入力される。そして、ＭＡＣ演算器３９からＳｘ又はＳｙが出力される。
【００４９】
上記図３に示した式（１）の演算器３０は、２つの制御方向（Ｘ，Ｙ）のうちの１方向のみの演算について構成したものである。式（１）を実現するためには、図３に示した構成の演算器を２つ並列に動作させれば良い。
【００５０】
ＭＡＣ演算器３１〜３９を構成した場合の利点を以下に示す。
【００５１】
１）中間フォーマットを自由に設定できるため、乗算器を加算器に単純に接続する場合より省ゲート化が可能である（少なくとも１０００ゲート以上の省ゲート化が可能）。
【００５２】
２）丸め処理が少なくなり、精度を高く保つことができる。
３）後述するパイプライン化の際、上記１）、２）等の効果と相俟って、演算レイテンシタイムの短縮が図れるため、パイプライン段数を少なくできる。この事も省ゲート化、省電力化に大きく貢献する。
【００５３】
ところで、図３に示した例の構成を透過タイプのスカラ演算器で構成した場合、ＣＭＯＳプロセスのＬＳＩとして設計すると、ＭＡＣ演算１段当たりのレイテンシタイムは、５０ｎｓ程度必要である。すべての演算を処理するためのレイテンシタイムは、このような最適な並列処理構造を採用したとても、２００ｎｓ程度かかることになり、１０ｎｓ（周波数ｆ＝１００ＭＨｚ）以下の計算周期を得ることは不可能である。
【００５４】
そこで、上記（ａ）で述べたように、パイプライン並列型の演算器構造を採用する必要がある。しかし、単純にパイプライン化しても、中間データを保持するためのパイプラインレジスタが増大し、上記（ｂ）に示したパイプラインレジスタでのスイッチングに伴う発熱が発生するとともに、トランジスタ数（ゲート数）が増加してしまう。
【００５５】
そこで、図４に示す５段のパイプライン構造を有するＭＡＣ演算器４０を提案する。詳細は図６に示し、後で述べる。
図４及び図６において、パラメータａ及びｂは、それぞれレジスタ８０及び８１、ステージ４２及び４３を介して、共にレジスタ８２、ステージ４４に供給される。そして、ステージ４４からの出力は、レジスタ８３、ステージ４５を介してステージ４７に供給される。
【００５６】
一方、パラメータｃは、レジスタ８４、ステージ４６に直接供給されるとともに、直列に接続された２つのレジスタ４１、８７を介して、レジスタ８４、ステージ４６に供給される。そして、ステージ４６からの出力は、ステージ４７に供給される。
ステージ４７からの出力は、レジスタ８５、ステージ４８に供給され、このステージ４８から、レジスタ８６、ステージ４０、５０を通して出力Ｓが出力される。
【００５７】
つまり、レジスタ８０及び８１と、ステージ４２、４３、もしくはレジスタ４１で１段、レジスタ８２、ステージ４４、もしくはレジスタ８７で２段、レジスタ８３、８４、ステージ４５、４６、４７で３段、レジスタ８５、ステージ４８で４段、レジスタ８６とステージ４９、５０で５段となる。
【００５８】
図４で示した点線で示した部分が単純にパイプライン化したときに、演算ステージを合わせるために必要となっていたパイプラインレジスタ４１、８７であり、これを削減すれば、トランジスタ換算でＭＡＣ演算器１つ当たり約１６００トランジスタ分の省ゲート化とスイッチングパワーの除去が可能である。
【００５９】
ところで、ＭＡＣ演算器に入力される数値パラメータｃの入力タイミングがパラメータａ、ｂと異なるため、演算ステージ段数が合わなくなってしまう可能性がある。しかし、図５に示す様に、周波数ｆ＝１００ＭＨｚ以上で変化する入力変換（Ｘｂ，Ｙｂ）の整合用パイプラインパスのみを調整すれば全体の処理を矛盾なく実行させることが可能である。
【００６０】
図５に示した例は、図３の構成に対し、図４のパイプライン化されたＭＡＣ演算器を適用して、全体的にパイプライン化を図ったものである。各モジュールの下及び上に示したＸＸ段→ＹＹ段は、その出力段までのトータルパイプライン段数を示し、ＸＸが図４の点線部分を含む場合、ＹＹが本方式の省ゲートタイプＭＡＣ演算器を用いた場合である。
【００６１】
トータルレイテンシタイムはもちろん整合用のパイプライン段数も減らせることがわかる。結局、トータルレイテンシタイムとして２０段から１８段に短縮され、パイプラインレジスタの本数も総合で２４段も省略できたことになる。単純に、乗算器と加算器を組み合わせると、ＭＡＣ処理当たり６段のパイプライン段数となり、結果的に本方式よりも５７段ものパイプラインレジスタが余分に必要となる。
【００６２】
図６に、図４に示したパイプライン構造のＭＡＣ演算器４０の演算分割配分を示す。
図６において、入力パラメータａ、ｂは、パイプラインレジスタ８０、８１、乗算ステージＭＰＹＳＴＧ１Ａ（４２）、ＭＰＹＳＴＧ１Ｂ（４３）、パイプラインレジスタ８２、乗算ステージＭＰＹＳＴＧ２（４４）、パイプラインレジスタ８３、乗算ステージＭＰＹＳＴＧ３Ａ（４５）を介して、加算ステージＡＤＤＳＴＧ１Ｂ（４７）に供給される。
一方、入力パラメータｃは、パイプラインレジスタ８４、加算ステージＡＤＤＳＴＧ１Ａ（４６）を介してＡＤＤＳＴＧ１Ｂ（４７）に供給される。
【００６３】
そして、ＡＤＤＳＴＧ１Ｂ（４７）からの出力は、パイプラインレジスタ８５、加算ステージＡＤＤＳＴＧ２（４８）、パイプラインレジスタ８６、加算ステージＡＤＤＳＴＧ３Ａ（４９）を介して、加算ステージＡＤＤＳＴＧ３Ｂ（５０）に供給される。この加算ステージＡＤＤＳＴＧ３Ｂ（５０）から出力Ｓ（Ｓ＝ａｘｂ＋ｃ）が出力される。
【００６４】
上記ＭＡＣ演算器４０は、約１０ｎｓの周期（周波数ｆ＝１００ＭＨｚ）で動作できる。すなわち、入力パラメータａ、ｂ、ｃは、１０ｎｓ周期でクロック信号に同期して投入可能であり、パイプライン的に処理（Ｓ＝ａ×ｂ＋ｃ）された結果、出力Ｓは、１０ｎｓ周期で出力される。
【００６５】
入力段のステージＭＰＹＳＴＧ１Ａ（４２）及びＡＤＤＳＴＧ１Ａ（４６）では、ＩＥＥＥ規格で入力されたデータ（ａ，ｂ，ｃ）を、演算処理を施し易い内部形式（２進形式）に変更する必要がある。
【００６６】
この処理に約１．５〜３ｎｓかかるが、乗算器と加算器とを融合した本発明によるＭＡＣ演算器では、乗算の最終ステージＭＰＹＳＴＧ３Ａ（４５）と、パラメータｃの加算の入力部の内部形式への変化ステージＡＤＤＳＴＧ１Ａ（４６）とを並列に処理可能である。
【００６７】
つまり、本発明によれば、実数乗算器と実数加算器とを、実数乗算器の出力段である最終段のパイプラインレジスタと実数加算器の入力段である初段のパイプラインレジスタとを同レベルにそろえ、実数乗算器の最終ステージの処理と、実数加算器の初段ステージの１部とを並列に動作させるという、融合手段が開示され、この融合手段により、乗算の最終ステージＭＰＹＳＴＧ３Ａ（４５）と、パラメータｃの加算の入力部の内部形式への変化ステージＡＤＤＳＴＧ１Ａ（４６）とを並列に処理可能である。
【００６８】
また、乗算ステージの最終段ＭＰＹＳＴＧ３Ａ（４５）の次段で、ＩＥＥＥ形式に変換するステージ（ＭＰＹＳＴＧ３Ｂに相当する）を設ける必要が無く、内部形式のまま加算器のステージＡＤＤＳＴＧ１Ｂ（４７）にデータを引き渡すことができる。
【００６９】
従って、加算ステージの初段で,乗算器からの結果に対してＩＥＥＥ形式からの変換ステージ（ＡＤＤＳＴＧ１Ａに相当する）を実行する必要もなくなる。次の演算器へＩＥＥＥ形式に変換（丸め処理も行う）して出力する出力段ステージ（ＡＤＤＳＴＧ３Ｂに相当する）についても、加算器の最終段にのみ設けるだけで良い。
【００７０】
以上から、関数演算部１２の基本単位となるＭＡＣ演算器の構成は、乗算ステージＭＰＹＳＴＧ１Ａ（４２）、ＭＰＹＳＴＧ１Ｂ（４３）が合計９ｎｓ、乗算ステージＭＰＹＳＴＧ２（４４）が９ｎｓ、乗算ステージＭＰＹＳＴＧ３Ａ（４５）が３ｎｓ、加算ステージＡＤＤＳＴＧ１Ａ（４６）が乗算ステージＭＰＹＳＴＧ３Ａ（４５）と並列に３ｎｓ、加算ステージＡＤＤＳＴＧ１Ｂ（４７）が６ｎｓ、加算ステージＡＤＤＳＴＧ２（４８）が９ｎｓ、加算ステージＡＤＤＳＴＧ３Ａ（４９）が３ｎｓ、加算ステージＡＤＤＳＴＧ３Ｂ（５０）が３ｎｓ、というレイテンシタイムの配分となっている。
【００７１】
なお、最終段ステージＡＤＤＳＴＧ３Ａ（４９）、ＡＤＤＳＴＧ３Ｂ（５０）は、合計６ｎｓとなっているが、次段の演算器に送るために約３ｎｓの余裕（伝送路の遅延マージン）を持たせているためである。なお、ＩＥＥＥの形式に圧縮してデータの入出力を行う必要があるのは、外部からの汎用データ入力形式と整合性をとる目的もあるが、以下のａ）及びｂ）の理由等からでもある。
ａ）加算器と乗算器とで有効な内部形式がそれぞれ異なる。
ｂ）内部形式のビット幅はＩＥＥＥ形式よりも広く、ゲート数、スイッチングパワー、演算器間の結線量のいずれも内部形式の方が不利である。
【００７２】
以上から、本発明によるＭＡＣ演算器は、５段のパイプラインで構成可能となっており、単純に汎用乗算器を組み合わせた場合より、パイプライン段数で１〜２段、トータルゲート数で１５〜２０％程度削減できている。
【００７３】
次に、上記（ｃ）に示した外部との入出力に関わる同期化の問題についての解決策について述べる。
ここで、外部とは、主としてマスタプロセッサとのやり取りを示す。
【００７４】
まず、関数処理部に与えるパラメータ（図３、図５の実数パラメータａ０〜ａ９に相当する）を保持するレジスタ部ｂ（１５）へのデータセット方法について、本発明では以下のレジスタ構成と手法を採る。
【００７５】
（イ）マスタプロセッサを動作させるクロック信号と、ビーム走査制御部の基準クロック信号（周波数ｆ＝１００ＭＨｚ以上）とは、非同期と考えるべきであり、マスタプロセッサ側から、ビーム走査制御部内のレジスタに対し、自在にアクセスするためには、マスタプロセッサからのアクセス判断信号と、前記基準クロックとの間で同期化を図る必要がある。
【００７６】
これは、図８に示すように、マスタプロセッサからのライトコマンド（／ＣＰＵＷＴ）を、クロック信号ＣＬＫ（周波数ｆ＝１００ＭＨｚ以上）を用いて、２段以上のフリップフロップ回路でシフトすることにより、ライト信号／ＷＴａを生成する非同期信号の同期化処理を施す。
【００７７】
さらに、ライト信号／ＷＴａを１段分以上シフトして、ライト信号／ＷＴｂを生成すれば、ライト信号／ＷＴａ＝Ｈｉかつライト信号／ＷＴｂ＝Ｌｏの期間を取り出し、クロック信号ＣＬＫに同期したライト信号ＷＴＥが生成可能である。例えば、図７に示すラッチレジスタＡ（５１）に、ライト信号ＷＴＥ（５３）に応答してマスタプロセッサからのデータをラッチすれば、ラッチされたデータＬＤＡＴＡ−Ａ（８０）はクロック信号ＣＬＫに同期して出力できる。
【００７８】
（ロ）事前に変更しておいたパラメータのみをあるタイミング（例えばサンプリング周期の初め）で、一斉に変更して関数演算部１２に与えたいケースがある。これは、図７に示すように、もう１つのラッチレジスタＢ（５２）をラッチレジスタＡ（５１）の後段に設け、一斉に変更すべきタイミングを示す信号（ＲＥＰＴＲＧ）に応答してラッチレジスタＡ（５１）の内容をラッチレジスタＢにコピーする方法を採る。
【００７９】
ＲＥＰＴＲＧ信号に対応するレジスタ群のラッチレジスタＢ（５２）に共通して接続すれば、そのレジスタ群の内容を適切なタイミングで同時に変更可能である。その場合の出力としてはＬＤＡＴＡ−Ｂを用いる。
【００８０】
なお、ＲＥＰＴＲＧ信号は、マスタプロセッサ部１６からのアクセス制御信号（／ＣＰＵＷＴ，／ＣＰＵＲＤ）に応答して、ライト信号ＷＴＥの生成と同様の非同期信号の同期化手法を用いてクロック信号ＣＬＫに同期化させて生成するのが一般的であるが、外部からのリプレースコマンドをクロック信号ＣＬＫに同期化して用いて生成しても良い。
【００８１】
（ハ）図７に示すレジスタの構成の中で、ラッチレジスタＡ（５１）、ラッチレジスタＢ（５２）は、ゲートラッチ回路を用いて構成する。ゲートラッチとはこの場合、Ｇ入力に与える信号（ここではライト信号ＷＴＥ５３、ＲＥＰＴＲＧ５４）がＨｉレベルのとき、Ｄ入力のデータを透過してＱ出力（ＬＤＡＴＡ−Ａ（８０）、ＬＤＡＴＡ−Ｂ（８１））に出力し、Ｇ入力に与える信号がＬｏレベルに遷移するタイミングでＤ入力のデータをラッチし保持する機能を有している。ゲートラッチ回路を用いれば、フリップフロップ回路を用いる場合の約１／２のゲート数で構成可能であり、消費電力的にも有利である。
【００８２】
次に、関数演算部を含むビーム走査制御部内のクロック信号ＣＬＫ（周波数ｆ＝１００ＭＨ以上）に同期したデータ群を、マスタプロセッサ側に読み出す際の同期化手段について述べる。
【００８３】
イ）図８に示すように、マスタプロセッサ部１６側から生成されるリードコマンド（／ＣＰＵＲＤ）を、ライト信号／ＷＴａ生成時と同様の同期化手段にてクロック信号ＣＬＫに同期化し、リード信号／ＲＤａ信号を生成する。
【００８４】
ロ）図９に示す内部レジスタをラッチするためのラッチレジスタ５５を設け、生成したリード信号／ＲＤａ信号５６の立ち上がりタイミングに応答してマルチプレクサＭＵＸ５７を介して選択信号ＳＥＬ５９により選択されたクロック信号ＣＬＫに同期した内部データ５８をラッチレジスタ５５にラッチする。
【００８５】
これにより、マスタプロセッサ部１６に対しては、リード信号／ＲＤａが立ち下がる約１ＣＬＫ程度以上前のタイミングから、／ＣＰＵＲＤが立ち上がる（終了する）少なくとも１ＣＬＫ以上先のタイミングまでの期間、所望の内部データを正しく表示することができる。マスタプロセッサはこの表示データを読み込めば良い。
【００８６】
なお、マルチプレクサＭＵＸ５７を切り換え、所望の内部データをラッチレジスタ５５に対して与えるための選択信号ＳＥＬ５９には、一般的にマスタプロセッサ部１６からのアドレス信号か、それに応答してモディファイされた信号を用いれば良い。
【００８７】
次に、関数演算部１２からの結果を高精度なアナログ情報に変換して１００ＭＨｚ以上のレートで出力する制御情報出力部１３について述べる。
【００８８】
図１０に、周波数ｆ＝１００ＭＨｚ以上の周波数で高精度なアナログ情報に変更する手段を示す。
図１０において、ＦＩ６０は、浮動小数点データ（実数）を整数値（３２ｂｉｔ）に変換する演算器、ＭＵＸＨ６１、ＭＵＸＬ６２及びＭＵＸＡ６３は、それぞれ選択信号ＳＥＬＨ６４、ＳＥＬＬ６５及びＳＥＬＡ６６に対応して、演算器ＦＩ６０から出力される３２ビットデータのうち上位２０ビットから１６ビット分を選択するマルチプレクサである。
【００８９】
マルチプレクサＭＵＸＨ６１、ＭＵＸＬ６２の出力は、フリップフロップ回路ＦＦで構成されるパイプラインレジスタ６７、６８を介して、ＤＡＣ（デジタルアナログ変換器）の入力フォーマット（ストレートバイナリ、オフセットバイナリ、コンプリメンタリ等）に変換するロジックＦＭ回路（ＭＳＢとその他のビットを反転させる回路）６９、７０を経由し、さらにパイプラインレジスタ７１、７２を介して、それぞれ１００ＭＨｚ以上のサンプリング周波数性能を有するＤ／Ａ変換器であるＤＡＣＨ７３、ＤＡＣＬ７４に入力される。
【００９０】
一方、ＭＵＸＡ６３の出力は、パイプラインレジスタ７５を介して、メモリユニット７６のアドレス入力に与えられ、メモリユニット７６からは対応するデータが出力される。そして、このメモリユニット７６からの出力データは、パイプラインレジスタ７７を介した後、１００ＭＨｚ以上のサンプリング周波数性能を有するＤＡＣＡＤＪ７８（補正用ＤＡＣ）に入力される。
【００９１】
上述した例では、ＤＡＣＨ７３とＤＡＣＬ７４とからのアナログ出力をアナログ的に加算することにより、最大３２ビット分解能レベルのアナログ出力が得られる。しかし、ＤＡＣの非線形性や、基準オフセット誤差等を補正しないと十分な精度が得られないため、精度的にネックとなるＤＡＣＨ部の補正を主眼として、ＤＡＣＡＤＪ７８により補正加算値を出力する。
【００９２】
補正加算値は、ＤＡＣＨ７３とＤＡＣＬ７４の加算値を高精度電圧測定器で事前に測定しておき、誤差の補正分を加算値として、メモリ書き込み手段７９によって予めメモリユニット７６に保持させておけば良い。また、補正加算値は、アンプ部の動的な歪の逆関数に対応する数値をメモリユニットに保持させることで、アナログ歪も補正可能となる。従って、ＤＡＣＨ、ＤＡＣＬ、ＤＡＣＡＤＪの各アナログ出力をアナログ的に加算して用いれば、高精度なアナログ情報を出力することができる。
【００９３】
【発明の効果】
本発明は、以上説明したように構成されているため、次のような効果がある。デジタル演算処理装置において、パイプライン化した実数乗算器と実数加算器とを融合手段により１つに結合して構成したＭＡＣ演算器を用い、実数乗算器の出力段である最終段のパイプラインレジスタと実数加算器の入力段である初段のパイプラインレジスタとを同レベルにそろえ、実数乗算器の最終ステージの処理と、実数加算器の初段ステージの１部とを並列に動作させるように構成される。
【００９４】
これにより、実数乗算器の出力段と、実数加算器の入力段とのレベルが同レベルであるので、これらの信号を並列に処理可能であり、実数乗算器の最終段の次段で、ＩＥＥＥ形式に変換するステージを設ける必要が無く、内部形式のまま、次段の実数加算器にデータを引き渡すことができる。
【００９５】
従って、実数加算器の初段で,乗算器からの結果に対してＩＥＥＥ形式からの変換ステージを実行する必要もなくなる。これにより、パイプライン段数とトランジスタ数の削減が可能となり、回路構成の大規模化及び複雑化と発熱を伴うことなく高集積化が実現でき、高速で高精度なデジタル演算処理が可能となる。
【００９６】
また、第１の周期で高速に変化する情報を処理する演算処理装置に、第２の周期で低速に変化する情報を処理するプロセッサを設けることにより、低速で制御する部分と、高速で制御する部分とを分割して、制御の適切化が可能になるが、本構成により、上記２種類の周期で動作する部分のスムーズな情報のやり取りが実現できる。これにより、情報のやり取りの同期が問題とならずに、高速なデジタル演算処理可能な演算手段を用いることができ、高速で高精度なデータ処理が可能なデジタル演算処理装置を実現することができる。
【００９７】
また、デジタルデータを連続的なビット列で構成された出力データに分割する手段と、その上位側の出力データに対応した補正データを記憶するメモリ手段と、２つの出力データと補正データとに対応したアナログデータを出力する３つのデジタルアナログ変換器と、少なくとも２つの出力データを、対応するデジタルアナログ変換器に与えるデータフォーマットに変換する手段と、２つの出力データと補正データとの出力タイミングを合わせる手段とを備え、３つのデジタルアナログ変換器のアナログデータを加算し、高精度なアナログ出力を生成する。
【００９８】
これにより、高速で演算処理されたデジタル信号を、適切に高精度かつ高速にアナログ信号に変換可能なデジタル演算処理を実現することができる。
【００９９】
また、上記デジタル演算処理装置は、ビーム走査型の画像情報取り込み装置に適用することができ、高速で高精度な画像取り込み処理が可能なビーム走査型の画像情報取り込み装置を実現することができる。
【０１００】
さらに、高速デジタル演算処理装置において、１００ＭＨｚ以上のクロック周波数に同期して、外部からの情報やマスタプロセッサからの情報を取り込み、１０ｎｓ以下の周期でデジタル処理をパイプライン的に進め、１０ｎｓ以下の周期での結果外部出力が達成できる効果がある。
【０１０１】
また、演算処理部の論理回路量やパイプラインレジスタの削減可能となり、それによりトランジスタのスイッチングパワーを小さくでき発熱を押さえる効果と、演算処理のレイテンシタイムを小さくする効果とが同時に得られる。
【０１０２】
また、マスタプロセッサと高速デジタル演算処理装置との間で高速クロックに同期してスムーズに情報のやり取りが可能となる効果がある。
【０１０３】
また、デジタルアナログ変換器に対して、そのデジタルアナログ変換器の誤差成分も補正した形で演算処理結果を高精度なデジタルデータとして１０ｎｓ以下のクロック周期で出力できる効果がある。
【図面の簡単な説明】
【図１】本発明の数値演算システムを必要とする装置であるビーム走査型の画像情報取り込み装置の概略構成図である。
【図２】図１の例におけるビーム走査制御部の基本システム構成を示した図である。
【図３】基本演算器としてＭＡＣ演算器で構成した演算器の例を示した図である。
【図４】５段のパイプライン構造を有するＭＡＣ演算器を説明した図である。
【図５】図３の構成に対し、図４のパイプライン化されたＭＡＣ演算器を適用した場合のパイプライン段数の削減を説明した図である。
【図６】図４に示したパイプライン構造のＭＡＣ演算器の演算分割配分を示した図である。
【図７】ライト時の同期化手段であるライトデータ用レジスタの構成を説明した図である。
【図８】マスタプロセッサからのアクセス信号と、基準クロック信号との同期化を説明した図である。
【図９】リード時の同期化手段であるリードデータ用レジスタの構成を説明した図である。
【図１０】デジタルデータを１００ＭＨｚ以上の周期で高精度なアナログ情報に変更する手段を説明する図である。
【符号の説明】
１ビーム光源
２ビーム走査部
３レンズ部
４被検出試料
５ステージ部
６検出部
７画像処理部
８ビーム走査制御部
１０外部情報入力部
１１制御情報演算部
１２関数演算部
１３制御情報出力部
１４レジスタ部ａ
１５レジスタ部ｂ
１６マスタプロセッサ部
１７ステージ制御部
１８ビーム走査部
１９画像処理部
２０被検出試料
２１ビーム走査制御部
３０演算器
３１〜３９、４０ＭＡＣ演算器
４１レジスタ
４２ＭＰＹＳＴＧ１Ａ
４３ＭＰＹＳＴＧ１Ｂ
４４ＭＰＹＳＴＧ２
４５ＭＰＹＳＴＧ３Ａ
４６ＡＤＤＳＴＧ１Ａ
４７ＡＤＤＳＴＧ１Ｂ
４８ＡＤＤＳＴＧ２
４９ＡＤＤＳＴＧ３Ａ
５０ＡＤＤＳＴＧ３Ｂ
５１ラッチレジスタＡ
５２ラッチレジスタＢ
５３ＷＴＥ
５４ＲＥＰＴＲＧ
８０ＬＤＡＴＡ−Ａ
８１ＬＤＡＴＡ−Ｂ
５５ラッチレジスタ
５６／ＲＤａ信号
５７ＭＵＸ
５８内部データ
５９ＳＥＬ
６０ＦＩ
６１ＭＵＸＨ
６２ＭＵＸＬ
６３ＭＵＸＡ
６４ＳＥＬＨ
６５ＳＥＬＬ
６６ＳＥＬＡ
６７、６８、７１パイプラインレジスタ
７２、７５、７７パイプラインレジスタ
６９、７０ＦＭ
７３ＤＡＣＨ
７４ＤＡＣＬ
７６メモリユニット
７８ＤＡＣＡＤＪ
７９メモリ書き込み手段
８０〜８７パイプラインレジスタ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a digital arithmetic processing device, and more particularly to a digital arithmetic device that requires a high arithmetic speed.
[0002]
[Prior art]
Conventionally, there are the following examples of high-speed arithmetic control devices.
(1) In the known example shown in Japanese Patent Laid-Open No. 5-258703, high-speed control output calculation is not performed by a digital arithmetic unit, but control output is performed using an analog circuit to support high-speed control.
[0003]
(2) In the electron beam drawing apparatus disclosed in Japanese Patent Laid-Open No. 5-226234, an arithmetic processing unit is configured by combining DSPs as arithmetic units, and data is transferred between DSPs using a two-port memory. ing. In addition, the control of each arithmetic DSP is configured to be managed by the master DSP. Furthermore, the D / A converter that converts digital data into analog output is configured to alternately switch the registers on the two sides in order for the processor to output the data.
[0004]
[Problems to be solved by the invention]
For example, in a control apparatus such as an electron beam drawing apparatus, it is desired to improve the processing speed. In order to improve the processing speed, it is conceivable to use an analog circuit instead of performing a high-speed control output operation with a digital operation unit as disclosed in Japanese Patent Application Laid-Open No. 5-258703. However, when an analog circuit is used, the processing speed is improved, but high-precision control processing cannot be expected.
[0005]
For this reason, in high-speed arithmetic control devices such as an electron beam drawing apparatus, a digital arithmetic processing device capable of high-speed arithmetic is desired in order to improve accuracy and shorten processing time. There was a problem like this.
[0006]
(1) In order to perform high-accuracy and high-speed arithmetic, it is necessary to execute real number arithmetic processing in a parallel pipeline, but a very large number of transistors are required. Furthermore, since it is linked with the control unit operating in real time, it is necessary to keep the latency time small, and there is a need for a compact configuration from the viewpoint of shortening the wiring.
For this reason, a very highly integrated LSI or electronic substrate must be realized, and advanced logic design technology and technology for reducing the number of transistors are required.
[0007]
In order to satisfy the above requirement, a large number of transistors must perform a large amount of switching operation in a small area, which may cause a large amount of heat generation. For this reason, a device design for suppressing heat generation is required. However, this increases the scale and complexity of the circuit configuration, resulting in a high price.
[0008]
(2) Smooth exchange of information is required between the master processor and the high-speed digital arithmetic processing device in synchronization with the high-speed clock.
Conventionally, means using a two-port memory is often used, but there is a problem of access speed limitation and access contention of devices such as a memory, and it is difficult to increase the speed.
[0009]
(3) In order to give a command value to the controlled object, it is necessary to convert the high-precision high-speed control output calculation data into an analog signal.
[0010]
However, high-speed and high-precision digital-to-analog converters are limited in digital-to-analog conversion, and the arithmetic processing result is converted into high-precision digital data with the error component corrected, for example, a clock cycle of 12-bit accuracy to 10 ns or less. It was difficult to output the output with the above, and the calculation speed could not be further increased.
[0011]
An object of the present invention is to realize a high-speed digital arithmetic processing control device capable of performing a control output cycle at 100 MHz or higher without increasing the scale and complexity of the circuit configuration.
[0012]
[Means for Solving the Problems]
  In order to achieve the above object, the present invention is configured as follows.
  (1)FastIn a digital arithmetic processing device that performs digital arithmetic processing in synchronization with a clock signal cycle,
  A real multiplier having a first stage that converts input data into an internal format, a stage that performs multiplication of the converted input data, and a final stage that consists of a pipe register that outputs the result of conversion from the internal format to the output format;
  A real adder having a first stage composed of a pipe register for converting input data into an internal format, a stage for performing addition of the converted input data, and a final stage for converting the addition result from the internal format to an output format and outputting the result. Pipeline,The above-mentioned pipelined real number multiplier and real number adder1Combined into onePerforming multiplication and addition in sequenceMAC calculatorWith
  Of the above real multiplierThe last stagePipeline registers and real addersFirst stageWith pipeline registersThe number of stages in the computing unit isAt the same level, the final multiplierSteppedProcessing and the first stage of the real adderProcessingOne part is operated in parallel.
[0013]
In this configuration, since the output level of the real number multiplier and the input level of the real number adder are the same level, these signals can be processed in parallel, and at the next stage of the final stage of the real number multiplier, There is no need to provide a stage for conversion to the IEEE format, and data can be transferred to the real adder at the next stage in the internal format.
[0014]
Therefore, it is not necessary to execute the conversion stage from the IEEE format on the result from the multiplier at the first stage of the real number adder. As a result, the number of pipeline stages and the number of transistors can be reduced, the circuit configuration can be increased in scale, complexity, and high integration without heat generation, and high-speed and high-precision digital arithmetic processing can be performed.
[0015]
  (2)FirstIn a digital arithmetic processing device that performs digital processing in synchronization with the high-speed clock signal of
  The first high-speed clock signal;AsynchronousA processor that operates in synchronization with the second clock signal, a first-stage latch register having a function of latching data from the processor in response to the first gate signal, and the first-stage latch register Data is latched in response to the second gate signalFunctionThe first gate signal is generated based on a write access signal from the processor.IsThe second gate signal is,UpGenerated based on a signal synchronized with the first high-speed clock signalThen, a plurality of data from the processor written in the first-stage latch register are delivered all at once to the second-stage latch register by the second gate signal..
[0016]
By providing a processor that processes information that changes at a low speed in the second cycle in an arithmetic processing unit that processes information that changes at a high speed in the first cycle, a portion that controls at a low speed, and a portion that controls at a high speed However, according to this configuration, it is possible to realize a smooth exchange of information between the parts operating in the two types of cycles. As a result, it is possible to use arithmetic means capable of high-speed digital arithmetic processing without causing synchronization of information exchange, and to realize a digital arithmetic processing device capable of high-speed and high-precision data processing. .
[0017]
  (3)high speedIn a digital arithmetic processing apparatus for outputting digital data as analog data in synchronization with a specific clock signal cycle, means for dividing the digital data into at least two output data composed of continuous bit strings, and output data on the upper side thereof Memory means for storing correction data corresponding to the above, at least three digital-to-analog converters outputting analog data corresponding to the at least two output data and correction data, and the at least two output data corresponding to the corresponding digital Means for converting to a data format to be provided to the analog converter, and means for matching the output timing of the at least two output data and the correction data, the analog data of the at least three digital analog converters beingto add.
[0018]
With this configuration, it is possible to realize digital arithmetic processing that can appropriately convert a digital signal that has been arithmetically processed into an analog signal with high accuracy and high speed.
[0019]
  (4)beamIn a beam scanning type image information capturing apparatus in which a beam from a light source is scanned by a beam scanning control unit, irradiated to a detected object, and image information of the detected object is obtained by an image processing unit, the beam scanning control unit includes: ,A real number multiplier having a first stage for converting input data into an internal format, a stage for performing multiplication of the converted input data, and a final stage for converting the multiplication result from the internal format to the output format and outputting the data, and the input data in the internal format A real number adder having a first stage for converting into a first stage, a stage for performing addition of the converted input data, and a final stage for converting the addition result from an internal format to an output format and outputting it,Pipelined real multiplierAnd real number adderAre combined into one, and the real multiplierThe last stagePipeline register and real adderFirst stagePipeline registers andThe number of stages in the computing unitAt the same level, the last of the above real multiplierSteppedProcessing and the first stage of the real adderOf processingOperate one part in parallelSequentially perform multiplication and additionMAC calculatorIt performs correction calculation of image distortion caused by beam scanning position error and outputs beam control digital dataAsynchronous with the digital arithmetic processing means and the first high-speed clock signal of the digital arithmetic meansofIn synchronization with the second clock signalSet coefficient data for the above correction calculationA first stage latch register having a function of latching data from the processor in response to a first gate signal generated based on a write access signal from the processor;A plurality of data from the processor written in the first-stage latch register are simultaneously latched by a second gate signal output from the processor and supplied to the digital arithmetic processing means.A second-stage latch register having a function;In order to convert the beam control digital data into an analog beam scanning control signal,Means for dividing the digital data from the digital arithmetic processing means into at least two output data composed of continuous bit strings; memory means for storing correction data corresponding to the higher-order output data; and at least two At least three digital-to-analog converters for outputting analog data corresponding to one output data and correction data, means for converting the at least two output data into a data format to be provided to the corresponding digital-to-analog converter, and at least the above Means for matching the output timings of the two output data and the correction data;Means for adding analog data of at least three digital-to-analog converters, and controlling an accurate beam scanning position by digital calculation.
[0020]
With this configuration, it is possible to realize a beam scanning type image information capturing device capable of high-speed and high-precision processing.
[0021]
DETAILED DESCRIPTION OF THE INVENTION
One embodiment of the present invention performs a combination operation of information that changes at high speed in the first period and information that changes at low speed in the second period, and outputs the result at high speed every first period An arithmetic processing unit is realized.
As shown below, the first frequency is 100 MHz to 200 MHz, and the second frequency is on the order of 100 KHz to 200 KHz.
[0022]
An example of an apparatus to which the digital arithmetic processing apparatus according to the embodiment of the present invention is applied is a beam scanning type image information capturing apparatus shown in FIG. This image information capturing apparatus includes a beam light source 1, a beam scanning unit 2, a lens unit 3, a sample 4 that is an object to be detected (observed object), a stage unit 5, a detection unit 6, and an image processing unit 7. And a beam scanning control unit 8.
[0023]
In the beam scanning type image information capturing apparatus shown in FIG. 1, the beam generated from the beam light source 1 is swung to an appropriate angle by the beam scanning unit 2, and the beam is focused by the lens unit 3 on the sample 4. The X direction line image obtained by scanning in a certain direction, for example, the X direction with an appropriate magnification ratio, and obtained by the detection unit 6 and the image processing unit 7 is continuously shifted, for example, in the Y direction by the stage unit 5. By connecting in the Y direction, an XY plane image is obtained.
[0024]
At present, the time for obtaining an image corresponding to one pixel in the X direction is the first frequency f1, and f1 = about 100 MHz to 200 MHz. The time for scanning one line in the X direction is set as the second frequency f2, and f2 = about 100 KHz to 200 KHz is set.
[0025]
The values of the first frequency f1 and the second frequency f2 are, for example, whether or not the detected sample 4 is an LSI chip on a semiconductor wafer, and a pattern image on the LSI chip is obtained and is correct. This is one of the specifications that are actually required, calculated from the image processing resolution and tact time when using the image information capturing device.
[0026]
In the apparatus shown in FIG. 1, an important part requiring the digital arithmetic processing apparatus proposed by the present invention is a beam scanning control unit 8. The beam scanning control unit 8 mainly needs to perform correction control calculation on the following two error targets specifically.
[0027]
(1) An error object that is a semi-fixed error typified by optical distortion and can be predefined by a continuous error function. These are corrected by a numerical calculation such as coordinate transformation using a transformation function, prior shape measurement information, or a combination of both, and reflected in the control output.
This depends on the position of the detection object 4 in the beam path, and includes correction using the measurement information.
[0028]
(2) It is a correction for errors caused by environmental changes such as positional shifts, speed fluctuations, and other environmental changes such as temperature fluctuations, and changes due to changes over time. Sensing information from the stage part 5 and as needed Correction processing is performed using the measurement information to be performed and reflected in the control output.
[0029]
Here, the control output corresponds to a command to the beam scanning unit 2 for correctly controlling the beam.
FIG. 2 shows a basic system configuration of the beam scanning control unit 8.
In FIG. 2, the function calculation unit 12 is a processing unit corresponding to the operation (1), and the control information calculation unit 11 is a processing unit corresponding to the operation (2).
[0030]
The register unit a14 holds the information g (PCD0...) That changes in the second cycle in the control information calculation unit 11, and the register unit b15 is synchronized with the function calculation unit 12 in the first cycle. , Information h (FCD0...) That changes in the second period is held. Information of these register units 14 and 15 is changed by the master processor unit 16 and is used as a variable or a constant in the processing of each arithmetic unit.
[0031]
The external information input unit 10 obtains information a for monitoring the state of the stage unit 5 such as position information obtained from the length measuring unit of the stage control unit 17, and generates information b so that the calculation can be performed. I do. The input information here may be feedback information of control information for driving the stage unit 5, or may use feedforward information from a sensor mounted on the stage unit 5.
The control information calculation unit 11 obtains preprocessed information from the external information input unit 10 and generates information c to be given to the function calculation unit 12 in the first period.
[0032]
For example, as a function of the position (X, Y) when the beam is scanned with respect to the sample 20, the function calculation unit 12 is a coordinate conversion formula represented by a cubic formula as shown in the following formula (1), for example A projection process for correcting the distortion of the optical system and making the plane isotropic expressed as follows is generated, and basic information d for beam control is generated.
Although not shown in the figure, the coefficients a and b are obtained from the pre-measurement information, and are given from the master processor 16 through the register b15 based on information such as the height of the sample 20 above the target position of the beam. .
[0033]
[Expression 1]

[0034]
Assuming that X is in the line direction, the time for scanning n pixels per line in the image processing unit 19 is the minimum period in which the Y direction changes. The change period of X corresponds to the scanning time for one pixel. Therefore, the change period of Y can be approximately (X change period) × n + α.
[0035]
Assuming that the scanning time per line is about 10 μs and the number of pixels per line in the X direction is about 1000, the change period of X is about 10 μs / 1000 = 10 ns (f = 100 MHz), and the change of Y The period is 10 μs + α. Since the information d responds to the change cycle of X, it becomes high-speed change information with a cycle of 10 ns. This is defined as the period at the first frequency, and 10 μs + α is defined as the period at the second frequency.
[0036]
The control information output unit 13 uses the DAC unit 22 to generate digital control information e that is a source for generating command information f (analog control information) to be given to the beam scanning unit 18. This will be described in detail later.
[0037]
The master processor unit 16 aggregates information from the control information calculation unit 11, the function calculation unit 12, the image processing unit 19, the stage control unit 17, and the like, and performs comprehensive judgment processing, management processing, and register units a14 and b15. The parameter changing process is performed using the period at the second frequency as a basic period. That is, the master processor unit 16 can be positioned as a comprehensive control / management unit of the beam scanning control unit 21.
[0038]
In this example, the first component of the beam scanning control unit 21 that is important and difficult to realize in performing digital processing is the function processing expressed by the above equation (1) with a throughput of frequency f = 100 MHz or more. This is a function calculation unit 12 that must operate at
[0039]
Simply executing the above equation (1) requires 36 additions and multiplications, and including normalization requires more than 40 operations. In addition, these operations require real number operations from the viewpoint of high accuracy, and if they are executed based on a general-purpose description similar to Equation (1), floating point type real number operation processing is performed at 4G per second. The ability to process four times (4GFLOPS) is required. In addition, there is a case where it is necessary to execute a combination of the preprocessing calculation and the correction calculation, and in total, a case where a processing capability of about 10 G times / s (10 GFLOPS) is required is also expected.
[0040]
Items constituting the problem in configuring the function calculation unit 12 are summarized below.
(A) In order to perform the high-speed operation as described above, it is necessary to execute real number arithmetic processing in a parallel pipeline, but a very large number of transistors are required. Furthermore, since it is linked with the control unit operating in real time, it is necessary to keep the latency time small, and there is a need for a compact configuration from the viewpoint of shortening the wiring. That is, a very highly integrated LSI or electronic substrate must be realized, and advanced logic design technology and transistor number reduction technology are required.
[0041]
(B) If the above point (a) is to be realized, many transistors perform a large amount of switching operation in a small area, which may cause a large amount of heat generation. Therefore, it is necessary to devise a circuit design that suppresses heat generation.
[0042]
(C) Information that changes at different cycles, such as information from the register unit, needs to be smoothly taken into high-speed processing, or processing information must be read out to the master processor in real time. That is, a high-level synchronization processing technique for circuit operation is required.
[0043]
For the above reason, the beam scanning control unit 21 is preferably composed of an LSI except for the DAC unit 22. In the example shown in FIG. 2, the external information input unit 10, the control information calculation unit 11, and the register unit a <b> 14 are configured as one chip due to restrictions on the gate amount and the number of pins, the function calculation unit 12, and the control information output. The unit 13 and the register unit b15 are realized as another single chip, and can be switched by a select signal on one type of LSI.
[0044]
As an example of a solution to the problem (a), a MAC operator (multiply-add product-sum type operator) in which the operation of the above formula (1) is integrated into a product-sum type is used as a basic operator. FIG. 3 shows a system that is configured as the above and executes them in parallel in the most efficient manner.
[0045]
The MAC arithmetic unit is configured as a general-purpose real number arithmetic unit conforming to the real number format of the IEEE standard, for example, so that a result is obtained in a desired numerical range with respect to the real number input.
[0046]
In FIG. 3, (Yb, a7, a5) is input to the MAC calculator 31, and (Yb, a6, a3) is input to the MAC calculator 32. Further, (Yb, a8, a4) is input to the MAC calculator 33.
[0047]
In addition, (Xb, a9) is input to the MAC calculator 34, an output from the MAC calculator 31 is input, (Yb, a2) is input to the MAC calculator 35, and the MAC calculator The output from 32 is input. In addition, (Yb, a1) is input to the MAC calculator 36 and the output from the MAC calculator 33 is input.
[0048]
In addition, Xb is input to the MAC calculator 37, outputs from the MAC calculator 34 and the MAC calculator 35 are input, (Yb, a0) is input to the MAC calculator 38, and MAC calculation is performed. The output from the device 36 is input. In addition, Xb is input to the MAC calculator 39 and outputs from the MAC calculator 37 and the MAC calculator 38 are input. Then, Sx or Sy is output from the MAC calculator 39.
[0049]
The computing unit 30 of the formula (1) shown in FIG. 3 is configured for computation in only one of the two control directions (X, Y). In order to realize the expression (1), two arithmetic units having the configuration shown in FIG. 3 may be operated in parallel.
[0050]
Advantages when the MAC calculators 31 to 39 are configured will be described below.
[0051]
1) Since the intermediate format can be freely set, it is possible to save gates compared to the case where the multiplier is simply connected to the adder (at least 1000 gates or more can be saved).
[0052]
2) The rounding process is reduced and the accuracy can be kept high.
3) When pipeline processing described later is performed, combined with the effects 1), 2), etc., the operation latency time can be shortened, so that the number of pipeline stages can be reduced. This also greatly contributes to gate saving and power saving.
[0053]
By the way, when the configuration of the example shown in FIG. 3 is configured by a transparent type scalar arithmetic unit, if designed as an LSI of a CMOS process, the latency time per one stage of MAC calculation needs about 50 ns. The latency time for processing all operations takes about 200 ns using such an optimal parallel processing structure, and it is impossible to obtain a calculation period of 10 ns (frequency f = 100 MHz) or less. is there.
[0054]
Therefore, as described in the above (a), it is necessary to adopt a pipeline parallel type arithmetic unit structure. However, even if it is simply pipelined, the number of pipeline registers for holding intermediate data increases, heat is generated due to switching in the pipeline registers shown in (b) above, and the number of transistors (number of gates) is increased. ) Will increase.
[0055]
Accordingly, a MAC computing unit 40 having a five-stage pipeline structure shown in FIG. 4 is proposed. Details are shown in FIG. 6 and described later.
4 and 6, the parameters a and b are supplied to the register 82 and the stage 44 through the

registers

80 and 81 and the stages 42 and 43, respectively. The output from the stage 44 is supplied to the stage 47 via the register 83 and the stage 45.
[0056]
On the other hand, the parameter c is directly supplied to the register 84 and the stage 46 and is also supplied to the register 84 and the stage 46 via the two registers 41 and 87 connected in series. Then, the output from the stage 46 is supplied to the stage 47.
The output from the stage 47 is supplied to the register 85 and the stage 48, and the output S is output from the stage 48 through the register 86 and the

stages

40 and 50.
[0057]
That is, the

registers

80 and 81, the stage 42, 43, or the register 41, the first stage, the register 82, the stage 44, or the register 87, the second stage, the registers 83, 84, the stages 45, 46, 47, the third stage, the register 85 Stage 48 has 4 stages, and register 86 and stages 49 and 50 have 5 stages.
[0058]
When the portion indicated by the dotted line in FIG. 4 is simply pipelined, it is the pipeline registers 41 and 87 that are necessary for matching the operation stage. It is possible to save about 1600 transistors per arithmetic unit and eliminate switching power.
[0059]
By the way, since the input timing of the numerical parameter c input to the MAC arithmetic unit is different from the parameters a and b, there is a possibility that the number of operation stage stages does not match. However, as shown in FIG. 5, if only the matching pipeline path for input conversion (Xb, Yb) that changes at a frequency f = 100 MHz or more is adjusted, the entire process can be executed without contradiction.
[0060]
In the example shown in FIG. 5, the pipelined MAC computing unit shown in FIG. 4 is applied to the configuration shown in FIG. The XX stage → YY stage shown below and above each module indicates the total number of pipeline stages up to the output stage. When XX includes the dotted line portion of FIG. 4, YY is a gate-saving MAC processor of this system. Is used.
[0061]
It can be seen that the number of pipeline stages for matching can be reduced as well as the total latency time. Eventually, the total latency time was reduced from 20 to 18 stages, and the total number of pipeline registers could be omitted as much as 24. Simply combining a multiplier and an adder results in 6 pipeline stages per MAC process, resulting in an additional 57 pipeline registers required for this method.
[0062]
FIG. 6 shows a calculation division distribution of the MAC calculator 40 having the pipeline structure shown in FIG.
In FIG. 6, input parameters a and b are

pipeline registers

80 and 81, multiplication stages MPYSTG1A (42), MPYSTG1B (43), pipeline register 82, multiplication stage MPYSTG2 (44), pipeline register 83, and multiplication stage MPYSTG3A. Via (45), it is supplied to the addition stage ADDSTG1B (47).
On the other hand, the input parameter c is supplied to ADDSTG1B (47) via the pipeline register 84 and the addition stage ADDSTG1A (46).
[0063]
The output from ADDSTG1B (47) is supplied to the addition stage ADDSTG3B (50) via the pipeline register 85, the addition stage ADDSTG2 (48), the pipeline register 86, and the addition stage ADDSTG3A (49). An output S (S = axb + c) is output from the addition stage ADDSTG3B (50).
[0064]
The MAC computing unit 40 can operate with a period of about 10 ns (frequency f = 100 MHz). That is, the input parameters a, b, and c can be input in synchronization with the clock signal at a cycle of 10 ns, and as a result of being processed in a pipeline manner (S = a × b + c), the output S is output at a cycle of 10 ns. The
[0065]
In the input stages MPYSTG1A (42) and ADDSTG1A (46), it is necessary to change the data (a, b, c) input according to the IEEE standard to an internal format (binary format) that is easy to perform arithmetic processing.
[0066]
This process takes about 1.5 to 3 ns. However, in the MAC computing unit according to the present invention in which a multiplier and an adder are integrated, the final stage of multiplication MPYSTG3A (45) and the internal form of the input part of the addition of the parameter c are converted. The change stage ADDSTG1A (46) can be processed in parallel.
[0067]
In other words, according to the present invention, the real number multiplier and the real number adder are connected at the same level between the final stage pipeline register that is the output stage of the real number multiplier and the first stage pipeline register that is the input stage of the real number adder. Therefore, a fusion means is disclosed in which the processing of the final stage of the real number multiplier and a part of the first stage of the real number adder are operated in parallel. By this fusion means, the final stage of multiplication MPYSTG3A (45) is disclosed. The change stage ADDSTG1A (46) to the internal format of the input part of the parameter c addition can be processed in parallel.
[0068]
Further, it is not necessary to provide an IEEE format conversion stage (corresponding to MPYSTG3B) after the final stage MPYSTG3A (45) of the multiplication stage, and the data is transferred to the adder stage ADDSTG1B (47) in the internal format. be able to.
[0069]
Therefore, it is not necessary to execute the conversion stage (corresponding to ADDSTG1A) from the IEEE format on the result from the multiplier at the first stage of the addition stage. The output stage (corresponding to ADDSTG3B) that is converted into IEEE format (also rounded) and output to the next arithmetic unit need only be provided at the final stage of the adder.
[0070]
From the above, the configuration of the MAC computing unit as the basic unit of the function computing unit 12 is that the multiplication stages MPYSTG1A (42) and MPYSTG1B (43) are 9 ns in total, the multiplication stage MPYSTG2 (44) is 9 ns, and the multiplication stage MPYSTG3A (45) is 3 ns, the addition stage ADDSTG1A (46) is 3 ns in parallel with the multiplication stage MPYSTG3A (45), the addition stage ADDSTG1B (47) is 6 ns, the addition stage ADDSTG2 (48) is 9 ns, the addition stage ADDSTG3A (49) is 3 ns, and the addition stage ADDSTG3B (50) is a latency time distribution of 3 ns.
[0071]
The final stage ADDSTG3A (49) and ADDSTG3B (50) have a total of 6 ns, but have a margin of about 3 ns (transmission path delay margin) for sending to the next stage arithmetic unit. It is. Note that the need to input / output data after compressing it to the IEEE format is intended to be consistent with the general-purpose data input format from the outside, but also for the following reasons a) and b) is there.
a) The effective internal format differs between the adder and the multiplier.
b) The bit width of the internal format is wider than that of the IEEE format, and the internal format is disadvantageous in terms of the number of gates, the switching power, and the dose between the calculators.
[0072]
From the above, the MAC arithmetic unit according to the present invention can be configured with a five-stage pipeline. Compared to a simple combination of general-purpose multipliers, the number of pipeline stages is one to two, and the total number of gates is 15 to It has been reduced by about 20%.
[0073]
Next, a solution for the synchronization problem related to external input / output shown in (c) above will be described.
Here, the term “external” mainly refers to an exchange with the master processor.
[0074]
First, regarding the data setting method to the register unit b (15) that holds parameters (corresponding to the real number parameters a0 to a9 in FIGS. 3 and 5) given to the function processing unit, the present invention has the following register configuration and method. take.
[0075]
(A) The clock signal for operating the master processor and the reference clock signal (frequency f = 100 MHz or more) of the beam scanning control unit should be considered asynchronous, and from the master processor side to the register in the beam scanning control unit In order to access freely, it is necessary to synchronize between the access determination signal from the master processor and the reference clock.
[0076]
As shown in FIG. 8, the write command (/ CPUWT) from the master processor is shifted by a flip-flop circuit having two or more stages using a clock signal CLK (frequency f = 100 MHz or more). Asynchronous signal generation processing for generating the signal / WTa is performed.
[0077]
Further, if the write signal / WTa is shifted by one stage or more to generate the write signal / WTb, the write signal / WTa = Hi and the write signal / WTb = Lo are extracted, and the write signal synchronized with the clock signal CLK A WTE can be generated. For example, if the data from the master processor is latched in the latch register A (51) shown in FIG. 7 in response to the write signal WTE (53), the latched data LDATA-A (80) is synchronized with the clock signal CLK. Can be output.
[0078]
(B) There are cases where only the parameters that have been changed in advance are to be changed at a given timing (for example, at the beginning of the sampling period) and given to the function calculation unit 12 at once. As shown in FIG. 7, another latch register B (52) is provided in the subsequent stage of the latch register A (51), and in response to a signal (REPTRG) indicating the timing to be changed at once, the latch register A A method of copying the contents of (51) to the latch register B is adopted.
[0079]
If commonly connected to the latch register B (52) of the register group corresponding to the REPTRG signal, the contents of the register group can be simultaneously changed at an appropriate timing. In this case, LDATA-B is used as the output.
[0080]
Note that the REPTRG signal is synchronized with the clock signal CLK using an asynchronous signal synchronization method similar to the generation of the write signal WTE in response to the access control signal (/ CPUWT, / CPURD) from the master processor unit 16. However, it may be generated by using an external replace command in synchronization with the clock signal CLK.
[0081]
(C) In the register configuration shown in FIG. 7, the latch register A (51) and the latch register B (52) are configured using gate latch circuits. In this case, the gate latch is such that when the signal applied to the G input (here, the write signals WTE53 and REPTRG54) is at the Hi level, the data of the D input is transmitted and the Q output (LDATA-A (80), LDATA-B (81) is transmitted. )) And a function of latching and holding D input data at the timing when the signal applied to the G input transitions to the Lo level. If a gate latch circuit is used, the number of gates can be approximately ½ that of a flip-flop circuit, which is advantageous in terms of power consumption.
[0082]
Next, a synchronization means for reading a data group synchronized with the clock signal CLK (frequency f = 100 MHz or more) in the beam scanning control unit including the function calculation unit to the master processor side will be described.
[0083]
B) As shown in FIG. 8, the read command (/ CPURD) generated from the master processor unit 16 side is synchronized with the clock signal CLK by the same synchronization means as in the generation of the write signal / WTa, and the read signal / An RDa signal is generated.
[0084]
B) A latch register 55 for latching the internal register shown in FIG. 9 is provided, and the clock signal CLK selected by the selection signal SEL59 via the multiplexer MUX57 in response to the rising timing of the generated read signal / RDa signal 56 The synchronized internal data 58 is latched in the latch register 55.
[0085]
As a result, for the master processor unit 16, desired internal data for a period from about 1 CLK or more before the read signal / RDa falls to at least 1 CLK or more before / CPURD rises (ends). Can be displayed correctly. The master processor may read this display data.
[0086]
Note that an address signal from the master processor unit 16 or a signal modified in response thereto is generally used as the selection signal SEL59 for switching the multiplexer MUX57 and supplying desired internal data to the latch register 55. It ’s fine.
[0087]
Next, the control information output unit 13 that converts the result from the function calculation unit 12 into highly accurate analog information and outputs it at a rate of 100 MHz or higher will be described.
[0088]
FIG. 10 shows means for changing to high-precision analog information at a frequency f = 100 MHz or higher.
In FIG. 10, FI 60 is an arithmetic unit that converts floating point data (real number) into an integer value (32 bits). MUXH 61, MUXL 62, and MUX A 63 are output from the arithmetic unit FI 60 in response to selection signals SELH 64, SELL 65, and SELA 66, respectively. It is a multiplexer that selects 16 bits from the upper 20 bits of the 32-bit data.
[0089]
Logic that converts the outputs of the multiplexers MUXH61 and MUXL62 into the input format (straight binary, offset binary, complementary, etc.) of a DAC (digital analog converter) via pipeline registers 67 and 68 composed of flip-flop circuits FF. DACH73 and DACL74 which are D / A converters having sampling frequency performances of 100 MHz or more through FM circuits (circuits for inverting MSB and other bits) 69 and 70 and further via pipeline registers 71 and 72, respectively. Is input.
[0090]
On the other hand, the output of the MUXA 63 is given to the address input of the memory unit 76 via the pipeline register 75, and the corresponding data is output from the memory unit 76. Then, the output data from the memory unit 76 is input to the DACADJ 78 (correction DAC) having a sampling frequency performance of 100 MHz or higher after passing through the pipeline register 77.
[0091]
In the above-described example, the analog outputs from the DACH 73 and the DACL 74 are added in an analog manner to obtain an analog output having a maximum 32-bit resolution level. However, sufficient accuracy cannot be obtained unless the nonlinearity of the DAC, the reference offset error, and the like are corrected. Therefore, the correction addition value is output by the DACADJ 78 with the focus on the correction of the DACH portion that is a bottleneck in accuracy.
[0092]
The correction addition value may be obtained by measuring the addition value of DACH 73 and DACL 74 in advance with a high-precision voltage measuring instrument and preliminarily holding the error correction amount in the memory unit 76 by the memory writing means 79 as the addition value. . Further, the correction addition value can also correct the analog distortion by causing the memory unit to hold a numerical value corresponding to the inverse function of the dynamic distortion of the amplifier unit. Therefore, if each analog output of DACH, DACL, and DACADJ is used after being added in an analog manner, highly accurate analog information can be output.
[0093]
【The invention's effect】
Since the present invention is configured as described above, the following effects are obtained. In a digital arithmetic processing unit, a final stage pipeline register which is an output stage of a real number multiplier using a MAC arithmetic unit formed by combining a pipelined real number multiplier and a real number adder into one by a fusion means And the first stage pipeline register, which is the input stage of the real adder, are arranged at the same level, and the processing of the final stage of the real multiplier and a part of the first stage of the real adder are operated in parallel. The
[0094]
As a result, since the output level of the real multiplier and the input stage of the real adder are the same level, these signals can be processed in parallel, and at the next stage of the final stage of the real multiplier, IEEE There is no need to provide a stage for conversion into a format, and data can be transferred to the real adder at the next stage in the internal format.
[0095]
Therefore, it is not necessary to execute the conversion stage from the IEEE format on the result from the multiplier at the first stage of the real number adder. As a result, the number of pipeline stages and the number of transistors can be reduced, the circuit configuration can be increased in scale, complexity, and high integration without heat generation, and high-speed and high-precision digital arithmetic processing can be performed.
[0096]
In addition, by providing a processor that processes information that changes at a low speed in the second cycle in an arithmetic processing unit that processes information that changes at a high speed in the first cycle, a portion that controls at a low speed and a high speed control are provided. The control can be made appropriate by dividing the part, but with this configuration, it is possible to realize the smooth exchange of information between the parts that operate in the two types of cycles. As a result, it is possible to use arithmetic means capable of high-speed digital arithmetic processing without causing synchronization of information exchange, and to realize a digital arithmetic processing device capable of high-speed and high-precision data processing. .
[0097]
Further, means for dividing the digital data into output data composed of continuous bit strings, memory means for storing correction data corresponding to the higher-order output data, and two output data and correction data Three digital-to-analog converters for outputting analog data, means for converting at least two output data into a data format to be supplied to the corresponding digital-to-analog converter, and means for matching the output timings of the two output data and correction data And adding the analog data of the three digital-analog converters to generate a highly accurate analog output.
[0098]
Accordingly, it is possible to realize digital arithmetic processing capable of appropriately converting a digital signal subjected to arithmetic processing into an analog signal with high accuracy and high speed.
[0099]
The digital arithmetic processing device can be applied to a beam scanning type image information capturing device, and can realize a beam scanning type image information capturing device capable of high-speed and highly accurate image capturing processing.
[0100]
Further, in a high-speed digital arithmetic processing apparatus, in synchronization with a clock frequency of 100 MHz or more, information from the outside or information from the master processor is taken in, and digital processing is advanced in a pipeline with a cycle of 10 ns or less, and a cycle of 10 ns or less. As a result, the external output can be achieved.
[0101]
In addition, it is possible to reduce the amount of logic circuits and pipeline registers in the arithmetic processing unit, thereby reducing the switching power of the transistors and suppressing heat generation, and the effect of reducing the latency time of arithmetic processing at the same time.
[0102]
In addition, there is an effect that information can be smoothly exchanged between the master processor and the high-speed digital arithmetic processing device in synchronization with a high-speed clock.
[0103]
In addition, the digital-analog converter has an effect that the calculation processing result can be output as high-precision digital data with a clock cycle of 10 ns or less with the error component of the digital-analog converter corrected.
[Brief description of the drawings]
FIG. 1 is a schematic configuration diagram of a beam scanning type image information capturing device which is a device requiring the numerical operation system of the present invention.
FIG. 2 is a diagram showing a basic system configuration of a beam scanning control unit in the example of FIG.
FIG. 3 is a diagram illustrating an example of an arithmetic unit configured by a MAC arithmetic unit as a basic arithmetic unit.
FIG. 4 is a diagram illustrating a MAC computing unit having a five-stage pipeline structure.
5 is a diagram for explaining the reduction in the number of pipeline stages when the pipelined MAC computing unit of FIG. 4 is applied to the configuration of FIG. 3;
6 is a diagram showing an operation division distribution of the MAC operator having the pipeline structure shown in FIG. 4; FIG.
FIG. 7 is a diagram illustrating a configuration of a write data register that is a synchronization unit at the time of writing.
FIG. 8 is a diagram illustrating synchronization between an access signal from a master processor and a reference clock signal.
FIG. 9 is a diagram illustrating a configuration of a read data register that is a synchronization means at the time of reading.
FIG. 10 is a diagram for explaining means for changing digital data into highly accurate analog information with a period of 100 MHz or more.
[Explanation of symbols]
1 beam light source
2 Beam scanning unit
3 Lens part
4 Sample to be detected
5 Stage part
6 detector
7 Image processing section
8 Beam scanning controller
10 External information input section
11 Control information calculation part
12 Function calculator
13 Control information output section
14 Register part a
15 Register part b
16 Master processor section
17 Stage controller
18 Beam scanning unit
19 Image processing unit
20 Sample to be detected
21 Beam scanning controller
30 calculator
31-39, 40 MAC computing unit
41 registers
42 MPYSTG1A
43 MPYSTG1B
44 MPYSTG2
45 MPYSTG3A
46 ADDSTG1A
47 ADDSTG1B
48 ADDSTG2
49 ADDSTG3A
50 ADDSTG3B
51 Latch register A
52 Latch register B
53 WTE
54 RETRRG
80 LDATA-A
81 LDATA-B
55 Latch register
56 / RDa signal
57 MUX
58 Internal data
59 SEL
60 FI
61 MUXH
62 MUXL
63 MUXA
64 SELH
65 SELL
66 SELA
67, 68, 71 Pipeline register
72, 75, 77 Pipeline register
69, 70 FM
73 DACH
74 DACL
76 memory units
78 DACADJ
79 Memory writing means
80-87 pipeline registers

Claims

In a digital arithmetic processing device that performs digital arithmetic processing in synchronization with a high-speed clock signal cycle,
A real multiplier having a first stage that converts input data into an internal format, a stage that performs multiplication of the converted input data, and a final stage that consists of a pipe register that outputs the result of conversion from the internal format to the output format;
A real adder having a first stage composed of a pipe register for converting input data into an internal format, a stage for performing addition of the converted input data, and a final stage for converting the addition result from the internal format to an output format and outputting the result. Pipeline,
A MAC arithmetic unit that performs the basic unit of arithmetic processing by combining the pipelined real number multiplier and the real number adder into one, and sequentially performing multiplication and addition ,
The number of stages in the arithmetic unit of the pipeline register, which is the final stage of the real multiplier, and the pipeline register, which is the first stage of the real adder, are set to the same level, and the processing of the final stage of the real multiplier and the real adder A digital arithmetic processing apparatus, wherein a part of the first stage processing is operated in parallel.

In a digital arithmetic processing device that performs digital processing in synchronization with the first high-speed clock signal,
A processor that operates in synchronization with a second clock signal asynchronous with the first high-speed clock signal;
A first-stage latch register having a function of latching data from the processor in response to a first gate signal;
A second-stage latch register having a function of latching data from the first-stage latch register in response to a second gate signal,
The first gate signal is generated based on the write access signal from said processor, the second gate signal is generated based on a signal synchronized to the upper Symbol first high-speed clock signal, the 1-stage A digital arithmetic processing apparatus, wherein a plurality of data from the processor written in a second latch register are delivered all at once to the second-stage latch register by the second gate signal .

In a digital arithmetic processing device that outputs digital data as analog data in synchronization with a high-speed clock signal cycle,
Means for dividing the digital data into at least two output data composed of continuous bit strings;
Memory means for storing correction data corresponding to the higher-order output data;
At least three digital-to-analog converters that output analog data corresponding to the at least two output data and the correction data;
Means for converting the at least two output data into a data format to be provided to a corresponding digital-to-analog converter;
And means for matching the output timing of the at least two output data and the correction data, the digital processing unit, which comprises adding the analog data of the at least three digital-to-analog converter.

In a beam scanning type image information capturing device that scans a beam from a beam light source with a beam scanning control unit, irradiates a detected object, and obtains image information of the detected object with an image processing unit.
The beam scanning controller is
A real number multiplier having a first stage for converting input data into an internal format, a stage for performing multiplication of the converted input data, and a final stage for converting the multiplication result from the internal format to the output format and outputting the data, and the input data in the internal format A real number that has been pipelined into a first stage that converts the input data into a first stage, a stage that performs addition of the converted input data, and a final stage that converts the addition result from the internal format to the output format and outputs the result. The multiplier and the real adder are merged into one, and the number of stages in the arithmetic unit of the pipeline register which is the final stage of the real number multiplier and the pipeline register which is the first stage of the real number adder is set to the same level, It includes a process in the final stage of the real number multipliers, and one part of the first-stage processing of the real adder is operated in parallel, the MAC calculation unit for sequentially carrying out the multiplication and addition, beam run Performs correction calculation of the distortion of an image resulting from the error location, and digital processing means for outputting the beam control digital data,
A processor for setting the coefficient data of the correction operation in synchronism with the first high-speed clock signal and the second clock signal asynchronous said digital computing means,
A first-stage latch register having a function of latching data from the processor in response to a first gate signal generated based on a write access signal from the processor;
A second stage having a function of simultaneously latching a plurality of data from the processor written in the first stage latch register by a second gate signal output from the processor and supplying the data to the digital arithmetic processing means Latch registers,
Means for dividing the digital data from the digital arithmetic processing means into at least two output data composed of continuous bit strings in order to convert the beam control digital data into analog beam scanning control signals ;
Memory means for storing correction data corresponding to the higher-order output data;
At least three digital-to-analog converters that output analog data corresponding to the at least two output data and the correction data;
Means for converting the at least two output data into a data format to be provided to a corresponding digital-to-analog converter;
Means for matching the output timings of the at least two output data and the correction data;
Means for adding analog data of the at least three digital-to-analog converters;
A beam scanning type image information capturing device, wherein an accurate beam scanning position is controlled by digital calculation .