JP3895031B2

JP3895031B2 - Matrix vector multiplier

Info

Publication number: JP3895031B2
Application number: JP02539698A
Authority: JP
Inventors: 井芳朗坪
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1998-02-06
Filing date: 1998-02-06
Publication date: 2007-03-22
Anticipated expiration: 2018-02-06
Also published as: JPH11224246A

Description

【０００１】
【発明の属する技術分野】
本発明は、行列ベクトル乗算器に係り、特に行列（マトリックス）とベクトルの演算処理を高速かつ低消費電力で行なうことができる行列ベクトル乗算器に関するものである。
【０００２】
【従来の技術】
画像データ圧縮のための基礎的な要素アルゴリズムとして広く用いられているものに、離散コサイン変換（以下、ＤＣＴ― Discrete Cosine Transform―と略記する。）方法およびその逆変換（以下、ＩＤＣＴ―Inverse Discrete Cosine Transform ―と略記する。）方法を用いた帯域圧縮技術がある。上記のＤＣＴやＩＤＣＴはいずれも直交変換の一種であるが、実際の演算においては定数係数のマトリックスとベクトルの乗算であり、画素値のベクトルｘ_n、ＤＣＴ係数のベクトルＸ_kに対し下式で定義される：
【０００３】
【数１】

動画像圧縮の国際標準規格であるＭＰＥＧ（Moving Picture Experts Group）１、ＭＰＥＧ２等においては水平方向・垂直方向に各々８画素ずつの６４画素からなる正方形型の領域（ブロック）を単位としたＮ＝８の２次元ＤＣＴおよびＩＤＣＴが用いられている。
【０００４】
動画像圧縮装置は一般に演算量が非常に大きく、特に実時間処理が必要な利用形態（ＨＤＴＶ放送、ＴＶ電話、映像監視システム等）における動画像圧縮符号化・復号化はその処理系に対して要求される性能が非常に高くなっている。このため、上記離散コサイン変換および逆変換の計算も高速アルゴリズムと専用ハードウェアを用いて処理するのが一般的であり、積和演算の回数を大幅に削減できるチェン（Ｃｈｅｎ）の高速アルゴリズムや、乗算器を用いずに計算を実行できる分散演算（ＤＡ―Distributed Arithmetic―）法などの方式が広範に用いられており、これらを応用した専用ハードウェアの構成としては、例えば文献IEICE Trans. Electron., Vol. E75-C(1992), No.4 pp.390-397 等に開示されている。ここではＤＡ法による離散コサイン変換の計算方法、およびハードウェアの構成方法を簡単に説明する。
【０００５】
まず、上記式（１）のマトリックス−ベクトル乗算の第ｋ行は次のような内積の形に表すことができる：
【０００６】
【数２】

さらに、ハードウェアのデータパスのビット幅をＭとする時、各Ｘ_nは２進数（２の補数表現）で：
【０００７】
【数３】

と表されるので、上記式（２）は内積の線形性を用いると：
【０００８】
【数４】

の形となる。ここで、部分和
【０００９】
【数５】

は入力データ｛ｘ_n｝のｉビット目｛ｂ_n,i｝の関数であるが、予めあらゆるビットパターン（２^N通り）について計算しメモリに格納しておけば、｛ｂ_n,i｝をＮビットのアドレスとみなして読み出すことができる。また、２^i-1の乗算はｉ−１ビットの左シフトに対応していることに注意すると、結局上記式（４）はＮビットアドレスのメモリ参照と左シフト、加算および減算の組み合わせにより求めることができることが分かる。
【００１０】
ＤＡ法は上述の原理を応用したものであり、以下、Ｎ＝８の場合の具体的な乗算器の構成とその手順の一例について図１０ないし図１２を参照しながら説明する。まず、図１０を用いて従来の行列ベクトル乗算器の回路構成について説明する。図１０において、従来の行列ベクトル演算回路は、各行毎に入力ポートを有し、第１の所定数であるｎ列と第２の所定数であるｋ行の縦横方向に０か１の情報が配列されたデータとしてのマトリックス１と、図１１を用いて後述するようなテーブルを格納すると共にアドレス信号により前記マトリックス１のｉビット目毎の行方向から読み出された情報により前記テーブルの対応する項目が読み出される読出し専用メモリ（以下、ＲＯＭ―Read Only Memory―と略記する。）２と、このＲＯＭ２より読出された値を累積加算する累積加算器５と、を備えている。
【００１１】
累積加算器５は、ＲＯＭ２に格納されたテーブルより読み出したビットスライスを所定量、例えば３ビットずつシフトさせるシフタ６と、このシフタ６の出力を順次に累積加算する２入力１出力の加算器７と、加算器７の出力を一時的に蓄積すると共に前記加算器７の他方の入力側にその出力を供給する出力レジスタ８と、を備えている。この従来の演算回路によれば、実質上はシフタ３と加算器４だけでマトリックス・ベクトル演算を行なっているために、簡単な構成で複雑な演算を処理することができ、汎用の演算器回路を適用できる長所がある。
【００１２】
上記構成の従来の行列ベクトル乗算器の動作について説明する。まず、図１１に示すように、全ての８ビットパターン｛ｂ_n｝（２⁸＝２５６通り）についてマトリックスの各行ｋ（ｋ＝０，…，７）に対応する部分和をあらかじめ必要な精度（例えば１６ビット）で計算したテーブルを作成する。図１１に示されるテーブルは、上述したように、図１０に示すＲＯＭ２に予め格納されている。このＲＯＭ２は、８ビットのアドレスで１６ビットの精度を有する場合、その容量は４キロバイトとなる。
【００１３】
次に、８個の入力データ｛ｘ_n；ｎ＝０，…，７｝の各々から２進表現の最下位ビット（以下、ＬＳＢ― Least Significant Bit―と略記する。）を取り出して、この８ビット｛ｂ_n,0｝をアドレス信号として前記ＲＯＭ２から読出した部分和のデータを各行ごとに累積加算器５により積算する。続いて、下から２番目のビット｛ｂ_n,1｝をアドレスとして部分和のデータをＲＯＭ２から読み出した後、１ビット左シフトを行ない（これが２を乗算することに相当している）、累積加算器５により積算する。以下同様に、図１１に示すように、下からｉ番目のビットについて部分和をｉ−１ビット左シフトして積算する動作｛図１２（ａ）（ｂ）参照｝を最上位ビット（以下、ＭＳＢ―Most Significant Bit―と略記する。）の直前まで繰り返すことにより、上記式（４）の第２項が累積加算器に積算される。ＭＳＢについては符号ビットであるため、｛ｂ_n,M-1｝のアドレスで部分和を読み出して、これを「Ｍ−１」ビットだけ左へシフトした後、符号を反転してから積算を行なうと上記式（４）の第１項が加算されて、式（４）の解が求められることになる。
【００１４】
この方法を用いると、乗算専用のハードウェアを用意することなく図１０に示すようにな簡単な汎用の演算回路により行列（マトリックス）とベクトルデータとの積を簡単に計算することができるため、ハードウェア量を節約して高速に処理をする目的には適している。
【００１５】
【発明が解決しようとする課題】
しかしながら、上述した従来の行列ベクトル乗算器によれば、入力データの性質や内容の相関に関係なく、その演算量やハードウェアの稼働率は常に一定であるため、例えば「０」を連続的に演算する等のように、自明な計算や不必要な計算についても累積加算器５を動作させなければならず、演算効率が良好でないという問題があった。
【００１６】
また、良く知られているように、ＤＡ法においてはどのような性質を持ったデータ列に対しても常に同じ動作を行なうことになり、上述の例のように不要な計算や自明な計算を繰り返すことに起因するスイッチング動作の機会が増加し、このため効率が悪いばかりでなく、消費電力の増加を招くという問題もあった。さらに、メモリへのアクセスが入力データの各桁をアドレスとしたランダムアクセスとなるので、メモリに高度な機能が要求されることになり、マトリックス１を読み込むＲＯＭ２の構造が複雑になってしまうという問題点もあった。
【００１７】
本発明の目的は、画像処理等に頻繁に現れる定数成分の行列（マトリックス）とベクトル変数の乗算を行なう演算回路を、汎用の乗算器を用いるのではなく、並列に動作する加算器およびシフタと、マトリックスの成分を保持する読み出し専用メモリとを用いて構成することにより、不必要な演算を省略して演算効率の改善を図ると共に消費電力の上昇を抑え、更にマトリックス回路の構造が簡単なもので済む行列ベクトル乗算器を提供することにある。
【００１８】
【課題を解決するための手段】
上記目的を達成するため、本発明に係る行列ベクトル乗算器は、第１の所定数より成る列成分と第２の所定数より成る行成分により構成される行列データを符号部分と絶対値部分とにより表現して順次蓄積する行列データ蓄積手段と、前記行列データ蓄積手段に蓄積された行列データの特定の列成分の列番号を指定するアドレス制御信号を出力すると共に前記行列データの前記符号部分を符号制御信号として出力する符号制御部と、前記アドレス制御信号に対応する前記行列データの前記絶対値部分を加算制御信号として出力する加算制御部と、を含む演算制御手段と、前記行列データの列成分に対応する係数データとその反対符号のデータとを予め符号部分と絶対値部分とにより表現して格納すると共に前記符号制御部より出力される前記アドレス制御信号および前記符号制御信号に基づいて対応する列成分のデータを順次出力する読出し専用記憶手段と、前記行列データの前記行成分ごとに設けられて前記読出し専用記憶手段より供給された前記列成分の符号に対応する前記係数データを一時的に蓄積すると共に前記符号制御信号および加算制御信号に基づいて前記アドレス制御信号の１周期ごとに前記係数データの前記絶対値部分を所定量ずつ所定方向に移動させる複数の入力蓄積部と、前記加算制御信号に応じて前記読出し専用記憶手段からのデータの加算・非加算を切り換える複数の加算部と、前記複数の加算部の各々の積算値を一時的に蓄積すると共にこの積算値を前記所定量ずつ前記所定方向に移動させる複数の出力蓄積部と、より各々が構成される複数の累積加算回路より成る累積加算手段と、を備えることを特徴としている。
【００１９】
また、請求項２に係る行列ベクトル乗算器は、請求項１に記載の乗算器において、前記累積加算手段の入力蓄積部が前記符号制御信号および加算制御信号に基づいて前記行列データの前記絶対値部分を１ビットずつ左へシフトさせるシフト機能を備えた入力レジスタにより構成され、前記出力蓄積部が前記符号制御信号および加算制御信号に基づいて前記加算部の積算値を１ビットずつ左にシフトさせて出力する出力レジスタにより構成されていることを特徴としている。
【００２０】
また、請求項３に係る行列ベクトル乗算器は、請求項１に記載の乗算器において、前記累積加算手段の入力蓄積部が前記符号制御信号および加算制御信号に基づいて前記行列データの前記絶対値部分を１ビットずつ右へシフトさせるシフト機能を備えた入力レジスタにより構成され、前記出力蓄積部が前記符号制御信号および加算制御信号に基づいて前記加算部の積算値を１ビットずつ右にシフトさせて出力する出力レジスタにより構成されていることを特徴としている。
【００２１】
また、請求項４に係る行列ベクトル乗算器は、請求項１に記載の乗算器において、前記累積加算手段の前記累積加算回路が前記行列データの列数である前記第１の所定数と同数設けられ、前記読出し専用記憶手段は前記行列データの前記第１の所定数と同数の行成分データを同時に出力し、前記累積加算回路が並列に動作して前記読出し専用記憶手段からの行成分のデータを積算することを特徴としている。
【００２２】
また、請求項５に係る行列ベクトル乗算器は、請求項１に記載の乗算器において、前記累積加算回路が、前記行列データ蓄積手段からの前記加算制御信号およびアドレス制御信号の組合せにより、内容を０に設定するリセットモード，現在の値を更新する通常モード，現在の値を保持するホールドモード，および入力を１ビットシフトして内部に設定するシフトモードのそれぞれのモードを切り換える機能を有する１対の前記入力蓄積部および出力蓄積部が一方の入力と出力とに接続され、さらに前記出力蓄積部の出力が分岐して他方の入力に接続されるように構成されていることを特徴としている。
【００２３】
また、請求項６に係る行列ベクトル乗算器は、請求項１に記載の乗算器において、前記演算制御手段の前記符号制御部が、クロック信号ごとにその内容を１つずつ積算するカウンタにより構成されていることを特徴としている。
【００２４】
さらに、請求項７に係る行列ベクトル乗算器は、請求項６に記載の乗算器において、前記演算制御手段の符号制御部は、前記カウンタの前記クロック信号の下位側の複数のビットにより前記アドレス制御信号を発生させることを特徴としている。
【００２５】
請求項８に係る行列ベクトル乗算器は、請求項１に記載の乗算器において、前記累積加算手段は、前記行列データ蓄積手段の蓄積する行列データの特定位のビット列を前記加算制御信号として用いて、前記アドレス制御信号が１周期を経過するごとに前記加算制御信号として用いる前記ビット列の位置を１ビット右シフトすることを特徴としている。
【００２６】
さらに、請求項９に係る行列ベクトル乗算器は、請求項８に記載の乗算器において、前記累積加算手段が、前記入力蓄積部の入力データビット幅分だけ左シフトした位置に、前記読出し専用記憶手段から読出されたデータの符号拡張を行なってから保持するように構成されていることを特徴としている。
【００２７】
また、請求項１０に係る行列ベクトル乗算器は、請求項１に記載の乗算器において、前記演算制御手段の前記加算制御部が、前記行列データ蓄積手段の任意の桁からなるビット列を選択して、これを前記加算制御信号である選択信号として送出する機能を備えるセレクタより構成されていることを特徴としている。
【００２８】
また、請求項１１に係る行列ベクトル乗算器は、請求項１０に記載の乗算器において、前記演算制御手段の前記符号制御部は、クロック信号ごとにその内容を１つずつ積算するカウンタにより構成され、このカウンタの上位側の複数のビットを前記セレクタの選択信号として前記累積加算手段が用いることを特徴としている。
また、請求項１２に係る行列ベクトル乗算器は、請求項１に記載の乗算器において、前記累積加算手段は、前記行列データ蓄積手段の蓄積する行列データの特定位のビット列を前記加算制御信号として用いて、前記アドレス制御信号が１周期を経過するごとに前記加算制御信号として用いる前記ビット列の位置を１ビット左シフトすることを特徴としている。
また、請求項１３に係る行列ベクトル乗算器は、請求項１２に記載の乗算器において、前記累積加算手段は、前記行列データ蓄積手段の入力データビット幅分だけ右シフト位置に、前記読出し専用記憶手段から読出されたデータの符号拡張を行なってから保持するように構成されていることを特徴としている。
【００２９】
したがって、本発明に係る行列ベクトル乗算器によれば、Ｎ成分からなる入力データの各成分を上位の桁よりビットごとに取り出して、得られたビット列（Ｎ bit）を制御信号として加算器の動作モードを順次切り替え、ＲＯＭから読み出された各ビットに対応する係数データを積算する。この際、制御ビットが１ならばレジスタに保持されたデータを加算し、０ならばレジスタをホールドするようにモードを切り替えることにより、不必要な加算動作をレジスタの更新によるスイッチングを減らすことができ、低消費電力動作が可能となる。Ｎ個の係数データを処理したら結果を１ビット左シフトし、入力データの次の桁について同様の操作を繰り返す。以下同様に全桁を処理するまで繰り返すことになる。
【００３０】
上述の手順により、行列の乗算を正しく行なうために、累積加算手段の入力データに２の補数表現ではなく符号−絶対値表現を用いている。ＭＰＥＧ等の動き補償を用いた動画像圧縮処理においては、Ｐ−ピクチャ等の差分画像のデータは０の周辺に集中している確率が高いため、符号−絶対値表現を用いて表すと上位ビットの０の個数が多くなるため、０が連続的に現れるマトリックスにおいては動作モードをホールド状態にすることにより、スイッチングの回数を一層少なくすることができる。
【００３１】
【発明の実施の形態】
以下、本発明に係る行列ベクトル乗算器の実施形態について図面を参照しながら詳細に説明する。まず、図１を参照して本発明の基本原理としての第１実施形態に係る行列ベクトル乗算器の構成について説明する。
【００３２】
図１において、第１実施形態に係る行列ベクトル乗算器１０は、第１の所定数であるｎ個より成る列成分と第２の所定数であるｋ個より成る行成分により構成される行列データを符号部分と絶対値部分とにより表現して順次蓄積する行列データ蓄積手段１１と、前記行列データ蓄積手段に蓄積された行列データの特定の列成分の列番号を指定するアドレス制御信号を出力すると共に前記行列データの前記符号部分を符号制御信号として出力する符号制御部１３と、前記アドレス制御信号に対応する前記行列データの前記絶対値部分を加算制御信号として出力する加算制御部１４と、を含む演算制御手段１２と、前記行列データの列成分に対応する係数データとその反対符号のデータとを予め符号部分と絶対値部分とにより表現して格納すると共に前記符号制御部１３より出力される前記アドレス制御信号および前記符号制御信号に基づいて対応する列成分のデータを順次出力する読出し専用記憶手段（以下、ＲＯＭ）１５と、前記行列データの前記第１の所定数ｎと同数のｎ個設けられた複数の累積加算回路２１ａないし２１ｎを有する累積加算手段２０と、を備えている。
【００３３】
上記累積加算手段２０は並列接続された複数の累積加算回路２１ａ〜２１ｎを有しており、この複数の累積加算回路２１ａ〜２１ｎは、前記行列データの前記行成分ごとに設けられて前記ＲＯＭ１５より供給された前記列成分の符号に対応する前記係数データを一時的に蓄積すると共に前記符号制御信号および加算制御信号に基づいて前記アドレス制御信号の１周期ごとに前記係数データの前記絶対値部分を所定量ずつ所定方向に移動させる複数の入力蓄積部２３ａ〜２３ｎと、前記複数の入力蓄積部２３ａ〜２３ｎのそれぞれに入力される前記加算制御信号に応じて前記ＲＯＭ１５からのデータの加算・非加算を切り換える複数の加算部２４ａ〜２４ｎと、前記複数の加算部２４ａ〜２４ｎの各々の積算値を一時的に蓄積すると共にこの積算値を前記所定量ずつ前記所定方向に移動させる複数の出力蓄積部２５ａ〜２５ｎと、をそれぞれ備えている。
【００３４】
なお、第１実施形態に係る行列ベクトル乗算器の動作については、以降に詳述する第２ないし第４実施形態の動作の説明により補完するものとしてその詳細説明を省略する。
【００３５】
次に、本発明の第２実施形態に係る行列ベクトル乗算器について、図２ないし図６を参照しながら詳細に説明する。この第２実施形態に係る行列ベクトル乗算器は、上記第１実施形態に係る乗算器の構成を更に具体的なものにして説明するものである。
【００３６】
第２実施形態に係る乗算器は、Ｎ個の入力データを符号−絶対値表現の形で格納するＮ個のＭビットのレジスタと、そのレジスタ群のＭ−１桁目から取り出したＮビットのうち１つを指定するＮ：１セレクタ、マトリックスの成分のデータを格納し列ごとに読み出すことのできるＲＯＭ、Ｎ個が並列に動作してＲＯＭからのデータを積算する累積加算器、マトリックスの列番号を指定する信号を発生するカウンタとから構成される。以下、Ｎ＝８、Ｍ＝１２の場合についてそのアルゴリズム、構成方法、機能および効果を説明する。ここで、上記第１実施形態に係る乗算器１０の構成において、第１の所定数が「８」であり第２の所定数が「１２」ということになる。
【００３７】
まず、第２実施形態に係る乗算器のアルゴリズムについて説明する。入力データのｘ_nはビット幅を１２として符号−絶対値表現を用いると：
【００３８】
【数６】

と表すことができるので、前記（２）の内積計算は線形性を用いると、
【００３９】
【数７】

と書くことができる。「ｂ_i＝０or１」なので、結局のところ式（６）はマトリックスの各行ｋについて「＋ａ_k,n」または「−ａ_k,n」または「０」の加算とシフト演算との組み合わせにより求められる。
【００４０】
具体的には、まずマトリックスの各成分のデータおよびその反対符号のデータを格納した読み出し用メモリを用意しておき、このメモリからｎ＝０，…の順にｘ_nの符号に応じて＋ａ_k,nまたは−ａ_k,nを順次読み出す。８個の入力データｘ_nは符号−絶対値表現で表しておき、この絶対値部分の最上位ビット（１０桁目）を取り出したビット列｛ｂ_n,10；ｎ＝０，…，７｝について、ｂ_n,10＝１の場合のみ対応する＋ａ_k,n（または−ａ_k,n）を累積加算器で積算し、０の場合は演算しない。ｎ＝０，…，７について処理を終えた後、積算結果を１ビット左シフトし、絶対値部分の次のビット（９桁目）についても同様に＋ａ_k,n（または−ａ_k,n）を累積加算器で積算する処理を繰り返す。以下、最下位ビットに至るまでこの処理を繰り返すことにより式（６）を求めることができる。
【００４１】
以上のアルゴリズムにおいて、ａ_k,nに付随する符号は各列ｎごとにｘ_nの符号によって（ｋによらずに）決定されるので、あらかじめマトリックスの第ｎ列の８個のデータを一組に、その反対符号のデータを一組にしてメモリに格納しておき、各列ごとにどちらかの符号をまとめて参照することによってｋ＝０，…，７の計算を並列に行なうことが可能となる。
【００４２】
次に、第２実施形態に係る乗算器の全体構成について、図２を参照しながら説明する。ｎ列のＭ行（有意データはｋ行）の行列より成るデータとしてのマトリックス１と、このマトリックス１の各成分についての累積加算を制御する演算制御手段１２と、乗算に必要なデータをテーブルとして格納し演算制御手段の制御信号により必要なデータを出力する読出し専用記憶手段としてのＲＯＭ１５と、ＲＯＭ１５と演算制御手段１２からの制御信号に基づいて乗算を行なうと共に並列に設けられた複数（ｎ個）の累積加算加算回路２１ａないし２１ｎを備える累積加算手段２０と、を備えている。上記演算制御手段１２は符号制御部として機能するカウンタ１３と、加算制御部として機能するセレクタ１４と、を備えており、カウンタ１３はアドレス指定手段１１としても機能している。なお、ｉｎ０〜ｉｎ７は行列データ蓄積手段１１の各列成分を蓄積する入力レジスタであり、Ａｃｃ０〜Ａｃｃ７は累積加算手段２０を構成する個々の累積加算回路２１ａ〜２１ｎの演算対象としての行成分であり、この第２実施形態においては、最上位ビットから演算処理を開始する。
【００４３】
上記構成を更に詳しく説明すると、まず、マトリックスを構成する「８×８」の各成分とその符号を反転したデータ（すなわち、２の補数で表現されたもの）との合計１２８個が、要求される精度に応じたビット幅により用意される。必要なビット幅は応用分野によって異なるが、例えばＭＰＥＧ１やＭＰＥＧ２等の場合であれば１６ビット程度の精度が要求される。これらのデータは読み出し専用メモリに格納することもできるが、本発明の場合アドレス参照によるメモリへのランダムアクセスは発生しないので、組合せ回路を用いたシーケンサ等を用いてもよい。上記ＲＯＭ１５は、これら種々の機能構成を総称してこの用語を用いるものとする。
【００４４】
このＲＯＭ１５は、図３に示すように、マトリックスの列番号（０〜７）を指定する信号と各列の符号を指定する信号Ｓ１（ＳＩＧＮ）によって対応する列の対応する符号のデータを８個（精度が１６ビットであれば１２８ビット分）ずつ同時に出力するように構成されているものとする。ＲＯＭ１５の出力に含まれる各データは、マトリックスの各行ごとに設けられ、各々が並列に動作する８個の累積加算回路２１ａないし２１ｎへ供給される。この累積加算回路２１ａないし２１ｎの各々の回路は、本質的には図４に示すように２入力１出力の加算器２４であり、各々の加算器２４が、ＲＯＭデータＤ１が供給される１個の入力レジスタ２３と、加算器２４の出力ポートに接続されると共に積算用に設けられた１個の出力レジスタ２５と、を備えると共に、加算器２４の出力ポートからの信号線が出力レジスタ２５に接続された後、２つに分岐してその一方が加算器２４の他方の入力ポートに接続されたものである。
【００４５】
加算器２４の各レジスタ２３，２５は、図５に示すように、クロック信号に同期して入力値を内部に設定するノーマルモードと、入力値に関係なく直前の値を保持するホールドモードと、内容を０にリセットするリセットモードと、入力を１ビット左シフトして内部にセットするシフトモードとを有しており、符号制御信号としての信号Ｓ２（ＳＨＩＦＴ）と加算制御信号としての信号Ｓ４（ＥＸＥＣ）との２種類の制御信号の組合せに応じてその動作モードが切り換えられている。
【００４６】
加算器のビット幅はＲＯＭデータが１６ビット、入力データが１２ビット（Ｍ＝１２）、入力８成分（Ｎ＝８）の場合、１６＋１１＋３ビット必要になるが、この第２実施形態においては後述の理由により更に１ビットを用意しておく。８個の入力データは通常は２の補数表現で与えられているが、これを符号−絶対値表現に変換して入力データレジスタに格納しておく。これは、例えば周知の方法であるが、入力データの最上位ビット（ＭＳＢ；１２桁目）が１の時に限り１〜１１桁目をビット反転し、その結果に１１ビット幅で１を加算して最上位ビットと再び連結すればよい。ＭＳＢが０の時は何も操作する必要はない。各データのＭＳＢは符号信号Ｓ１（ＳＩＧＮ）としてＲＯＭ１５へ供給され、それぞれ対応する列の出力データの符号が決定される。
【００４７】
また、入力データレジスタのうち絶対値部分（１〜１１桁目）の最上位（１１桁目）から取り出した８ビットのビット列（ビットスライス）｛ｂ_n；ｎ＝０，…，７｝を８：１セレクタで選択し、このセレクタの出力を各累積加算器の入力側および出力側レジスタのＥＸＥＣ信号として用いて通常モード（ＥＸＥＣ＝１）とホールドモード（ＥＸＥＣ＝０）を切り換える。また入力データレジスタはシフターを備えており、ＳＨＩＦＴ＝１を受けると次のクロックに伴ってその絶対値部分が１ビット左シフトを行なう。
【００４８】
カウンタ（７ビット幅）の下位３ビットの出力信号Ｓ３は０〜７の値をとり、ＲＯＭ１５の列番号を指定する信号Ｓ１と、セレクタの選択信号Ｓ４の両方に用いられる。すなわち、入力レジスタの符号ビットカウンタの値とによりＲＯＭの出力データが決まり、これに同期して累積加算器の動作モードも定まる。またカウンタの下位３ビットが０００の時に限りＳＨＩＦＴ＝１を累積加算器および入力データレジスタに送出する。またカウンタの上位４ビットは入力データの何桁目までが処理済みであるかを判定するのに用いられる。
【００４９】
次に、図５を用いて、第２実施形態に係る乗算器における累積加算器の動作について説明する。演算処理に先立ってカウンタおよび累積加算器の入出力レジスタをリセットして、入力データは符号−絶対値表現に変換して予め入力レジスタに格納しておく。演算処理が開始されると、カウンタ値と符号に対応したＲＯＭデータが累積加算回路２１ａ〜２１ｎに順次供給され、またセレクタ１４はカウンタ１３の値に応じて、入力レジスタ２３ａ〜２３ｎの１１ビット目ｂ₀，…，ｂ₇をこの順でＥＸＥＣ信号として加算器２４ａ〜２４ｎに送出する。例えば、カウンタ１３の下位３ビットが０００の時はマトリックスの第０列成分が、入力データの第０列成分の符号に対応して正または負符号で累積加算器の入力側レジスタに渡されることになる。このとき、もしもＥＸＥＣ＝１（入力データの第０列成分の絶対値の最上位ビットが１）であると累積加算回路２１のレジスタ２３はノーマルモードなので、次のクロックで累積加算回路２１の入力レジスタ２３にＲＯＭ１５からのデータが符号拡張してセットされ、出力ポートにはこの値と出力レジスタ２５の内容を加算した値が現れる。このとき、ＥＸＥＣ＝１であればこの値はその次のクロックで出力レジスタ２５にセットされる。一方、ＥＸＥＣ＝０であれば累積加算回路２１のレジスタ２３，２５はどちらもホールド状態となり、ＲＯＭ１５からのデータが入力レジスタ２３にセットされず、加算器２４の出力ポートの値も出力レジスタ２５にセットされないので、図６に示すように、加算器２４は以前の状態を保ったままとなりスイッチング動作をしないことになる。
【００５０】
上の動作を８回行なってカウンタの下位３ビットが１１１を示すとＳＨＩＦＴ＝１が累積加算器に送られ、次のクロックに伴って出力ポートのデータが１ビット左シフトして（２を乗算することに相当）出力レジスタ２５にセットされる。このとき、入力レジスタ２３の絶対値部分も１ビット左シフトされて入力データの１０桁目が入力レジスタ２３の１１桁目に入る。以降は上の手順を絶対値部分の全ての桁について（１１回）繰り返せばよい。出力ポートがシフトされるごとに結果に２が乗算されるので、最終的に１１桁目の部分和には２¹¹が、１０桁目の部分和には２¹⁰が乗じられ、以下同様に第ｉ桁目の計算結果には２ⁱが乗じられたものの総和が求められる。カウンタ１３が「１０１０１１１」を示したその次のクロックで全ての桁に関する処理が終了しているので、累積加算器の出力レジスタから必要なビット幅のデータを取り出せば積和演算の結果が得られる。ここで、各部分和に対する２の乗数が一つ多くなっているので、これを考慮してあらかじめ出力レジスタのビット幅を予め１ビット分増やしておいて、計算結果を取り出す時に出力の最下位ビットを無視することで正しい結果を得ることができる。
【００５１】
上述の構成により上述のように動作する第２実施形態に係る行列ベクトル乗算器の効果について説明する。まず、通常モードではクロックに伴って入力レジスタ２３および出力レジスタ２５の内容が更新されると、加算器２４の内部におけるスイッチング動作により積算が行なわれるが、ホールドモードでは加算器２４の２つの入力ポートはともにクロック以前の値に固定されているため、スイッチング動作が行なわれず、余分な電力消費が抑えられる。
【００５２】
したがって、２進表現の入力データに含まれる０の個数が多いほど電力消費が少なくなるが、特に動き補償を用いた動画像圧縮符号化・復号化処理における差分画像データではデータ分布は０の周囲に集中しているため、符号−絶対値表現を用いた場合にその上位ビットが０である確率が高く、このような性質を持つデータ列に適用すると通常のＤＡ法等に比較して消費電力を大幅に削減できる。また、マトリックスの成分を格納しているＲＯＭはランダムアクセスされることがなく、一定の順序でデータを順次送出するだけなので非常に構造が単純であるという効果もある。
【００５３】
次に、本発明の第３実施形態に係る行列ベクトル乗算器について、図７および図８を用いて説明する。本発明の第３実施形態に係る乗算器は、本質的に第２実施形態と同様のアルゴリズムを用い、その加算の順序を反対に（下位の桁から）行なうように変更したものである。以下、Ｎ＝８、Ｍ＝１２の場合について、その構成方法、機能および効果を説明する。
【００５４】
本発明の第３実施形態による乗算器の基本構成を図７に示す。図７において、第２実施形態と同様な構成、つまりカウンタ１３およびセレクタ１４を有する演算制御手段１２、ＲＯＭ１５、累積加算手段２０が設けられている。また、行列データ蓄積手段１１も第２実施形態とほぼ同様に構成されているが、絶対値部分の最上位ビットではなく最下位ビットについてビットスライスを取り出してから８：１セレクタにより選択してＥＸＥＣ信号として送出する。また累積加算器の入力側レジスタはＲＯＭからのデータを取得するにあたり、１１ビット左シフトした上で符号拡張して内部にセットし（あらかじめ２¹¹を乗ずることに相当）、下位１１ビットは常に０とする。また、累積加算器の出力側レジスタは、シフトモードのとき、その入力を１ビット右シフトして内部にセットしている。
【００５５】
このような第３実施形態に係る乗算器の機能も第２実施形態と同様である。すなわち、演算処理に先立ってカウンタおよび累積加算器の入出力側レジスタをリセットし、入力データは符号−絶対値表現で入力レジスタに格納しておく。処理が開始されると、カウンタの下位３ビットと符号に対応したＲＯＭデータが累積加算器に順次出力される。またセレクタは入力データレジスタの最下位ビットから取り出されたビットスライスｂ₀，…，ｂ₇をカウンタの下位３ビットの示す値に応じて選択してこの順でＥＸＥＣ信号として累積加算器に送る。例えばカウンタの下位３ビットが０１０の時はマトリックスの第２列成分（０１０に対応）が、入力データの第２成分の符号に対応した符号で累積加算器の入力側レジスタに供給されることになる。この時ＥＸＥＣ＝１（入力データの第２成分の絶対値の最下位ビットが１）で累積加算回路２１のレジスタ２３，２５がノーマルモードであるならば、次のクロックで入力レジスタ２３の１２ビット目以上にＲＯＭ１５からのデータが符号拡張してセットされ、第２の実施形態の場合と同様に加算器２４の出力ポートには、この値と出力側レジスタの内容を加算した値が現れる。一方、ＥＸＥＣ＝０であればこれも第２実施形態の場合と同様に累積加算回路のレジスタ２３，２５は、図５のように何れもホールド状態となり、ＲＯＭからのデータが入力側のレジスタ２３にセットされず、出力ポートの値も出力側レジスタ２５にセットされないので、加算器２４は以前の状態を保ったままスイッチング動作を行なうことはない。
【００５６】
以下、第２実施形態と同様の動作を８回行なって、カウンタの下位３ビットが１１１を示すとＳＨＩＦＴ＝１が累積加算器に送られ、次のクロックに伴って出力ポートのデータが１ビット右シフトして（１／２を乗算することに相当）出力側レジスタ２５にセットされる。このとき、入力レジスタ２３の絶対値部分も１ビット右シフトされ、入力データの２桁目が入力データレジスタの１桁目に入り入力データの１桁目に関する処理が終了する。これ以降は、上記の手順を絶対値部分の全ての桁につき１１回繰り返せばよい。入力レジスタにあらかじめ２¹¹を乗じたデータが供給されているが、出力ポートが右シフトされるごとにその結果に１／２が乗算されるので、最終的に１桁目の部分和には２⁰が、２桁目の部分和には２が、以下同様に第ｉ桁目の計算結果には２^i-1が乗じられたものの総和が求められる。各桁ともにカウンタが１０１０１１１を示したその次のクロックで全ての桁に関する処理が終了しているので、累積加算器の出力レジスタから必要なビット幅のデータを取り出せば積和演算の結果が得られる。
【００５７】
以上のような構成・動作を有する第３実施形態に係る行列ベクトル乗算器の効果について説明する。まず、ＲＯＭからのデータを入力レジスタに供給する際に、ＲＯＭデータのビット幅と累積加算器のビット幅の差に相当する部分は符号拡張を行なわなければならず、図８に示すように、ＲＯＭから渡されるデータが負であるのならばＲＯＭデータの最上位から入力側レジスタの最上位までのビット（Ｍ＝１２、Ｎ＝８ならば１４ビット）を全て１にセットし、正ならば０にしてレジスタにセットする必要がある。差分画像のデータは０を中心に正負の両側にほぼ一様に分布しているため、ＲＯＭから渡されるデータの正負の比率は、ほぼ１：１で順序の予測ができず、この部分のスイッング確率が非常に大きくなる可能性があるが、あらかじめ１１ビットシフトして代入して下位１１ビットを常に０にしておくと、符号拡張に伴う上位ビットのスイッチング回数を少なくすることができ、さらなる低消費電力化が可能となる。
【００５８】
最後に、本発明の第４実施形態に係る行列ベクトル乗算器の構成，機能および効果について、図９を参照しながら説明する。この第４実施形態に係る乗算器も、第１の所定数であるｎ列で８、第２の所定数である行数も８であるがビット数Ｍが１２の場合について説明する。
【００５９】
まず、第４実施形態に係る行列ベクトル乗算器の構成を図９に示す。この第４実施形態に係る乗算器は、第２，第３実施形態に係る乗算器と同様な構成のＲＯＭ１５と、累積加算手段２０と、カウンタ１３が設けられている。また、行列データ蓄積手段１１も第１ないし第３実施形態とほぼ同様の構成とするが、絶対値部分の最上位ビットだけではなく任意のビットについてビットスライスを取り出すことができ、信号Ｓ５により１１：１の選択比率の第２のセレクタ３４がカウンタの上位４ビットに応じてどの桁を取り出すかを制御するものである。カウンタ１３の上位４ビットと蓄積手段１１との桁の対応は任意でよいが、例えば００００の時が１１桁目、１０１０の時が、１桁目のように対応づけておくと、構成をより簡単にすることができる。
【００６０】
上記構成に基づく第４実施形態に係る乗算器の機能動作について説明する。第２実施形態に係る乗算器と同様に、演算処理に先立ってカウンタ１３および累積加算回路２１の入出力側のレジスタ２３，２５をリセットし、入力データは符号−絶対値表現で入力側のレジスタ２３に格納しておく。演算処理が開始されるとカウンタ１３の下位３ビットと符号に対応したＲＯＭデータが累積加算回路２１の加算器２４に順次渡される。また、第２のセレクタ３４は、行列データ蓄積手段１１のある桁（カウンタの上位４ビットで指定）から取り出されたビットスライスｂ₀，…，ｂ₇を、カウンタ１３の下位３ビットの示す値に応じてセレクタ１４により選択してこの順でＥＸＥＣ信号Ｓ４として累積加算回路２１に送る。例えば、カウンタ１３が００１１１０１の時は蓄積手段１１に格納されているマトリックスの第５列成分（１０１に対応）が、入力データの第５成分の符号に対応して正または負符号で累積加算回路２１の入力側レジスタ２３に渡され、入力レジスタの８桁目（００１１に対応）から取り出されたビットスライスの第５成分が累積加算器へＥＸＥＣ信号として渡されることになる。以下第２実施形態と同様の手順で累積加算を行なう。
【００６１】
第２実施形態に係る乗算器の場合、ある桁の加算を終了するごとに入力レジスタの１ビット左シフトを行なっていたが、ここではカウンタの上位４ビットに応じてビットスライスを取り出す桁を切り換えるためのハードウェアとして第２のセレクタ３４を追加するだけで、左シフトのための機構を省略し、さらにシフトに伴って生じるレジスタの更新を抑制するので、さらなる低消費電力化が可能となる。
【００６２】
【発明の効果】
以上、詳細に説明したように、本発明に係る行列ベクトル乗算器によれば、入力された行列データのビットが０の時は加算を行なわないので、消費電力を大幅に低減することが可能となると共に、各行に関する積算を並列接続された複数の累積加算回路により並列に演算処理することができるので、演算処理の高速化が可能となる。
【００６３】
さらに、並列に設けられた累積加算回路の入出力レジスタのスイッチング動作の回数を削減することもできるので、この点においても消費電力の低減が可能となるばかりでなく、読出し専用記憶手段へのランダムアクセスが発生しないことになるので、読出し専用記憶手段（ＲＯＭ）の構造を簡単化するこもできる。
【００６４】
また、符号拡張の際のスイッチングの回数も削減することができ、消費電力の低減化が期待でき、また、行列データ蓄積手段のシフト機能が不要となるので、蓄積手段の構造を簡略化でき、スイッチングの回数も削減される。さらに、カウンタを共用することでハードウェア量の削減を図ることもできる。
【図面の簡単な説明】
【図１】本発明の基本概念としての第１実施形態に係る行列ベクトル乗算器の構成を示すブロック図。
【図２】本発明の第２実施形態に係る行列ベクトル乗算器の構成を示すブロック図。
【図３】第２ないし第４実施形態の乗算器において共通に用いられるＲＯＭの機能を表すテーブルと、第５列の＋符号を指定された場合とを示す説明図。
【図４】本発明に係る行列ベクトル乗算器において共通して用いられる累積加算回路を示すブロック図。
【図５】図４における入出力レジスタ２３，２５にそれぞれ入力される制御信号に対するモード割り当てテーブルを示す説明図。
【図６】第２実施形態に係る乗算器における累積加算回路の動作状態の一例を説明するため入力データ絶対値部分の最上位ビットのビットスライスが（０１１０１１００）、次の桁のビットスライスが（０１１０．．．．）の場合を示す説明図。
【図７】本発明の第３実施形態に係る行列ベクトル乗算器の構成を示すブロック図。
【図８】本発明に係る乗算器の（ａ）第２実施形態における累積加算回路の入力側レジスタへのデータのセット方法、（ｂ）第３実施形態における累積加算回路の入力側レジスタへのデータのセット方法をそれぞれ示す説明図。
【図９】本発明の第４実施形態に係る行列ベクトル乗算器の構成を示すブロック図。
【図１０】ＤＡ法を用いた従来の行列ベクトル乗算器の構成を示すブロック図。
【図１１】従来の行列ベクトル乗算器のＲＯＭに格納されたテーブルを示す説明図。
【図１２】従来の乗算器における計算アルゴリズムを示す説明図。
【符号の説明】
１０行列ベクトル乗算器
１１行列データ蓄積手段
１２演算制御手段
１３符号制御部（アドレス指定カウンタ）
１４加算制御部（セレクタ）
１５読出し専用記憶手段（ＲＯＭ）
２０累積加算手段
２１（ａ〜ｎ）累積加算回路
２３（ａ〜ｎ）入力蓄積部（入力レジスタ―ＡＲＥＧ―）
２５（ａ〜ｎ）出力蓄積部（出力レジスタ―ＢＲＥＧ―）
３４符号制御部（第２のセレクタ）[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a matrix vector multiplier, and more particularly to a matrix vector multiplier capable of performing matrix and vector operation processing at high speed and with low power consumption.
[0002]
[Prior art]
Widely used as a basic element algorithm for image data compression is a discrete cosine transform (hereinafter abbreviated as DCT-Discrete Cosine Transform) method and its inverse transform (hereinafter IDCT-Inverse Discrete Cosine). There is band compression technology using the Transform method. Each of the above DCT and IDCT is a kind of orthogonal transform, but in actual calculation, it is a multiplication of a constant coefficient matrix and a vector, and a pixel value vector x_n, DCT coefficient vector X_kIs defined by:
[0003]
[Expression 1]

In MPEG (Moving Picture Experts Group) 1, MPEG2, etc., which are international standards for moving picture compression, N = in a square area (block) composed of 64 pixels of 8 pixels each in the horizontal and vertical directions. Eight two-dimensional DCT and IDCT are used.
[0004]
In general, a moving image compression apparatus has a very large calculation amount, and moving image compression encoding / decoding in a use form (HDTV broadcast, TV phone, video surveillance system, etc.) that particularly requires real-time processing is applied to the processing system The required performance is very high. Therefore, the calculation of the discrete cosine transform and the inverse transform is generally performed using a high-speed algorithm and dedicated hardware, and Chen's high-speed algorithm that can greatly reduce the number of product-sum operations, A method such as a distributed arithmetic (DA-Distributed Arithmetic) method capable of performing calculations without using a multiplier is widely used. As a configuration of dedicated hardware using these methods, for example, the document IEICE Trans. Electron. , Vol. E75-C (1992), No.4 pp.390-397 and the like. Here, the calculation method of the discrete cosine transform by the DA method and the hardware configuration method will be briefly described.
[0005]
First, the k-th row of the matrix-vector multiplication of the above equation (1) can be expressed in the form of an inner product as follows:
[0006]
[Expression 2]

Further, when the bit width of the hardware data path is M, each X_nIs a binary number (2's complement):
[0007]
[Equation 3]

Therefore, the above equation (2) uses the linearity of the inner product:
[0008]
[Expression 4]

It becomes the form. Where partial sum
[0009]
[Equation 5]

Is the input data {x_n} I-th bit {b_{n, i}}, But any bit pattern (2^NStreet) and store it in memory, {b_{n, i}} Can be read as an N-bit address. 2^i-1If it is noted that the multiplication of i corresponds to a left shift of i−1 bits, it can be understood that the above equation (4) can be obtained by a combination of N-bit address memory reference and left shift, addition and subtraction.
[0010]
The DA method is an application of the above-described principle. Hereinafter, a specific example of the multiplier configuration and its procedure when N = 8 will be described with reference to FIGS. First, the circuit configuration of a conventional matrix vector multiplier will be described with reference to FIG. In FIG. 10, the conventional matrix vector arithmetic circuit has an input port for each row, and information of 0 or 1 in the vertical and horizontal directions of the first predetermined number n columns and the second predetermined number k rows. The matrix 1 as the arranged data and a table as will be described later with reference to FIG. 11 are stored, and the table corresponds to the information read from the row direction for each i-th bit of the matrix 1 by the address signal. A read only memory (hereinafter abbreviated as “ROM—Read Only Memory”) 2 from which items are read, and a cumulative adder 5 that cumulatively adds values read from the ROM 2 are provided.
[0011]
A cumulative adder 5 shifts a bit slice read from a table stored in the ROM 2 by a predetermined amount, for example, 3 bits, and a 2-input 1-output adder 7 that sequentially accumulates and adds the outputs of the shifter 6. And an output register 8 that temporarily accumulates the output of the adder 7 and supplies the output to the other input side of the adder 7. According to this conventional arithmetic circuit, since the matrix vector operation is substantially performed only by the shifter 3 and the adder 4, a complicated operation can be processed with a simple configuration. There are advantages that can be applied.
[0012]
The operation of the conventional matrix vector multiplier having the above configuration will be described. First, as shown in FIG. 11, all 8-bit patterns {b_n} (2⁸= 256 ways), a table is created in which partial sums corresponding to each row k (k = 0,..., 7) of the matrix are calculated in advance with a required accuracy (for example, 16 bits). As described above, the table shown in FIG. 11 is stored in advance in the ROM 2 shown in FIG. If this ROM 2 has an 8-bit address and a 16-bit precision, its capacity is 4 kilobytes.
[0013]
Next, 8 input data {x_nEach of n = 0,..., 7} is extracted from the least significant bit of binary representation (hereinafter abbreviated as LSB-Least Significant Bit-), and these 8 bits {b_{n, 0}} As an address signal, the partial sum data read from the ROM 2 is integrated by the accumulator 5 for each row. Subsequently, the second bit from the bottom {b_{n, 1}} Is read out from the ROM 2 using the address as the address, then left-shifted by 1 bit (this corresponds to multiplying by 2), and accumulated by the cumulative adder 5. Similarly, as shown in FIG. 11, the operation {see FIGS. 12 (a) and 12 (b)} for shifting the partial sum for the i-th bit from the bottom to the left by i-1 bits and accumulating the most significant bit (hereinafter, By repeating until just before MSB (Most Significant Bit), the second term of the above equation (4) is accumulated in the accumulator. Since the MSB is a sign bit, {b_{n, M-1}}, When the partial sum is read to the left by “M−1” bits, and the sign is inverted, and integration is performed, the first term of the above equation (4) is added, and the equation The solution of (4) will be required.
[0014]
When this method is used, the product of the matrix and the vector data can be easily calculated by a simple general-purpose arithmetic circuit as shown in FIG. 10 without preparing hardware dedicated for multiplication. It is suitable for the purpose of processing at high speed while saving the amount of hardware.
[0015]
[Problems to be solved by the invention]
However, according to the above-described conventional matrix vector multiplier, the calculation amount and the hardware operation rate are always constant regardless of the nature of the input data and the correlation of the contents. For example, “0” is continuously set. The cumulative adder 5 must be operated even for obvious calculations and unnecessary calculations such as calculation, and there is a problem that the calculation efficiency is not good.
[0016]
As is well known, the DA method always performs the same operation on a data string having any property, and unnecessary calculations and obvious calculations are performed as in the above example. The opportunity of the switching operation due to repetition increases, which causes not only inefficiency but also a problem of increasing power consumption. Furthermore, since access to the memory is random access using each digit of the input data as an address, a sophisticated function is required for the memory, and the structure of the ROM 2 for reading the matrix 1 becomes complicated. There was also a point.
[0017]
An object of the present invention is to provide an arithmetic circuit that performs multiplication of a matrix of constant components frequently appearing in image processing and the like and a vector variable without using a general-purpose multiplier, and an adder and shifter that operate in parallel. By using a read-only memory that holds matrix components, unnecessary calculations are omitted to improve calculation efficiency and suppress power consumption rise, and the matrix circuit structure is simple It is to provide a matrix vector multiplier that can be completed with
[0018]
[Means for Solving the Problems]
  In order to achieve the above object, a matrix vector multiplier according to the present invention converts a matrix data composed of a first predetermined number of column components and a second predetermined number of row components into a code part and an absolute value part. Matrix data storage means that sequentially stores and expresses the address control signal designating the column number of a specific column component of the matrix data stored in the matrix data storage means, and the code portion of the matrix data An arithmetic control means including: a code control unit that outputs as a code control signal; and an addition control unit that outputs the absolute value portion of the matrix data corresponding to the address control signal as an addition control signal; and the matrix dataColumnscomponentThe coefficient data corresponding to and the data of the opposite sign are expressed in advance by a sign part and an absolute value part.The corresponding column component is stored based on the address control signal and the code control signal that are stored and output from the code control unit.dataRead-only storage means for sequentially outputting the code of the column component provided for each row component of the matrix data and supplied from the read-only storage meansThe coefficient data corresponding toAre temporarily stored and the address control signal is cycled based on the sign control signal and the addition control signal.coefficientA plurality of input accumulation units for moving the absolute value portion of the data by a predetermined amount in a predetermined direction; andReadA plurality of adders for switching between addition and non-addition of data from the dedicated storage means, and the accumulated values of each of the plurality of adders are temporarily accumulated, and the accumulated values are moved by the predetermined amount in the predetermined direction. And a plurality of output accumulating units, and a cumulative addition means including a plurality of cumulative addition circuits each configured.
[0019]
Further, the matrix vector multiplier according to claim 2 is the multiplier according to claim 1, wherein the input accumulation unit of the accumulative addition means is configured to calculate the absolute value of the matrix data based on the code control signal and the addition control signal. It is composed of an input register having a shift function for shifting the portion to the left bit by bit, and the output accumulation unit shifts the integrated value of the adder to the left bit by bit based on the code control signal and the addition control signal. It is characterized by being constituted by an output register that outputs the output.
[0020]
Further, the matrix vector multiplier according to claim 3 is the multiplier according to claim 1, wherein the input accumulation unit of the accumulative addition means is configured to calculate the absolute value of the matrix data based on the code control signal and the addition control signal. It is composed of an input register having a shift function for shifting the portion to the right by one bit, and the output accumulation unit shifts the accumulated value of the addition unit to the right by one bit based on the code control signal and the addition control signal. It is characterized by being constituted by an output register that outputs the output.
[0021]
Further, the matrix vector multiplier according to claim 4 is the multiplier according to claim 1, wherein the cumulative addition circuit of the cumulative addition means has the same number as the first predetermined number that is the number of columns of the matrix data. The read-only storage means simultaneously outputs the same number of row component data as the first predetermined number of the matrix data, and the cumulative addition circuit operates in parallel so that the row component data from the read-only storage means It is characterized by integrating.
[0022]
  A matrix vector multiplier according to claim 5 is the multiplier according to claim 1, wherein the cumulative addition is performed.circuitIs based on the combination of the addition control signal and the address control signal from the matrix data storage means, the reset mode for setting the content to 0, the normal mode for updating the current value, the hold mode for holding the current value, and the input A pair of the input storage unit and output storage unit having a function of switching each mode of the shift mode for shifting the signal by 1 bit and setting it internally is connected to one input and the output, and further the output of the output storage unit Is configured to be branched and connected to the other input.
[0023]
According to a sixth aspect of the present invention, there is provided a matrix vector multiplier according to the first aspect of the present invention, wherein the sign control unit of the arithmetic control means includes a counter that accumulates the contents one by one for each clock signal. It is characterized by having.
[0024]
Further, the matrix vector multiplier according to claim 7 is the multiplier according to claim 6, wherein the sign control unit of the operation control means uses the address control by a plurality of lower bits of the clock signal of the counter. It is characterized by generating a signal.
[0025]
  A matrix vector multiplier according to an eighth aspect of the present invention is the multiplier according to the first aspect, wherein the cumulative addition unit is configured to store the matrix data stored in the matrix data storage unit.specificPlaceofbitColumnIs used as the addition control signal, and the addition control is performed every time one period of the address control signal passes.Position of the bit string used as a signalIs shifted right by 1 bit.
[0026]
Furthermore, the matrix vector multiplier according to claim 9 is the multiplier according to claim 8, wherein the accumulative adding means shifts the read-only memory at a position shifted to the left by the input data bit width of the input storage unit. The data read from the means is subjected to sign extension and then held.
[0027]
The matrix vector multiplier according to claim 10 is the multiplier according to claim 1, wherein the addition control unit of the arithmetic control unit selects a bit string composed of arbitrary digits of the matrix data storage unit. , And a selector having a function of sending it as a selection signal which is the addition control signal.
[0028]
  Further, the matrix vector multiplier according to claim 11 is the multiplier according to claim 10, wherein the sign control unit of the arithmetic control means is configured by a counter that accumulates the contents one by one for each clock signal. The cumulative addition means uses a plurality of higher-order bits of the counter as selection signals for the selector.
  The matrix vector multiplier according to claim 12 is the multiplier according to claim 1, wherein the cumulative addition means uses a bit string at a specific position of matrix data stored in the matrix data storage means as the addition control signal. The position of the bit string used as the addition control signal is shifted to the left by one bit every time the address control signal passes one cycle.
  A matrix vector multiplier according to claim 13 is the multiplier according to claim 12,  The cumulative addition means is configured to sign-extend the data read from the read-only storage means and hold it at the right shift position by the input data bit width of the matrix data storage means. It is a feature.
[0029]
Therefore, according to the matrix vector multiplier of the present invention, each component of the input data consisting of N components is extracted for each bit from the upper digit, and the operation of the adder is performed using the obtained bit string (N bit) as a control signal. The modes are sequentially switched, and coefficient data corresponding to each bit read from the ROM is integrated. At this time, by switching the mode so that the data held in the register is added if the control bit is 1 and the register is held if the control bit is 0, the unnecessary addition operation can be reduced by switching the register. , Low power consumption operation becomes possible. When N coefficient data are processed, the result is shifted left by 1 bit, and the same operation is repeated for the next digit of the input data. Similarly, the process is repeated until all digits are processed.
[0030]
In order to perform matrix multiplication correctly according to the above procedure, the sign-absolute value expression is used instead of the two's complement expression for the input data of the cumulative addition means. In a moving image compression process using motion compensation such as MPEG, the difference image data such as P-picture is highly likely to be concentrated around 0. Since the number of zeros increases, the number of times of switching can be further reduced by setting the operation mode to the hold state in a matrix in which zeros continuously appear.
[0031]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of a matrix vector multiplier according to the present invention will be described below in detail with reference to the drawings. First, the configuration of a matrix vector multiplier according to the first embodiment as the basic principle of the present invention will be described with reference to FIG.
[0032]
  In FIG. 1, a matrix vector multiplier 10 according to the first embodiment is a matrix data composed of a first predetermined number of n column components and a second predetermined number of k row components. Matrix data storage means 11 that sequentially expresses and stores the data by a sign part and an absolute value part,Stored in the matrix data storage meansAddress control signal that specifies the column number of a specific column component of matrix dataTheOutput the matrix dataSaidAn arithmetic control means 12 including: a code control unit 13 that outputs a code part as a code control signal; and an addition control unit 14 that outputs the absolute value part of the matrix data corresponding to the address control signal as an addition control signal; , The matrix dataColumnscomponentCoefficient data corresponding toThe opposite signData ofAnd in advanceExpressed by sign part and absolute value partAnd storing the code control unit13Read-only storage means (hereinafter referred to as ROM) 15 for sequentially outputting the corresponding column component data based on the address control signal and the sign control signal output from the memory, and the first predetermined number n of the matrix data; And cumulative addition means 20 having a plurality of cumulative addition circuits 21a to 21n provided with the same number n.
[0033]
  The cumulative addition means 20 has a plurality of cumulative addition circuits 21 a to 21 n connected in parallel. The plurality of cumulative addition circuits 21 a to 21 n are provided for each row component of the matrix data and are stored in the ROM 15. The sign of the supplied column componentThe coefficient data corresponding toAre temporarily stored and, based on the code control signal and the addition control signal, for each cycle of the address control signal,coefficientA plurality of input storage units 23a-23n for moving the absolute value portion of the data by a predetermined amount in a predetermined direction;,in frontAccording to the addition control signal input to each of the plurality of input storage units 23a to 23n,ROM15A plurality of addition units 24a to 24n for switching between addition and non-addition of data from and a cumulative value of each of the plurality of addition units 24a to 24n and temporarily storing the integration value by the predetermined amount in the predetermined direction. And a plurality of output storage units 25a to 25n to be moved to each other.
[0034]
The operation of the matrix vector multiplier according to the first embodiment will be omitted as it is supplemented by the description of the operations of the second to fourth embodiments described in detail below.
[0035]
Next, a matrix vector multiplier according to a second embodiment of the present invention will be described in detail with reference to FIGS. The matrix vector multiplier according to the second embodiment is described with a more specific configuration of the multiplier according to the first embodiment.
[0036]
The multiplier according to the second embodiment includes N M-bit registers that store N pieces of input data in the form of a sign-absolute value expression, and N bits of N bits extracted from the M-1st digit of the register group. N: 1 selector for designating one of them, ROM capable of storing matrix component data and reading out each column, a cumulative adder operating in parallel and accumulating data from the ROM, matrix columns And a counter for generating a signal for designating a number. Hereinafter, an algorithm, a configuration method, a function, and an effect will be described in the case of N = 8 and M = 12. Here, in the configuration of the multiplier 10 according to the first embodiment, the first predetermined number is “8” and the second predetermined number is “12”.
[0037]
First, the multiplier algorithm according to the second embodiment will be described. X of input data_nUsing the sign-absolute value representation with a bit width of 12:
[0038]
[Formula 6]

Since the inner product calculation in (2) uses linearity,
[0039]
[Expression 7]

Can be written. “B_i= 0 or 1 ”, so after all, equation (6) becomes“ + a for each row k of the matrix._{k, n}"Or" -a_{k, n}"Or" 0 "and a combination of shift operation.
[0040]
Specifically, first, a read memory storing the data of each component of the matrix and the data of the opposite sign is prepared, and from this memory, x = 0,._n+ A depending on the sign of_{k, n}Or -a_{k, n}Are read sequentially. 8 input data x_nIs represented by a sign-absolute value representation, and a bit string {b obtained by taking out the most significant bit (10th digit) of the absolute value portion._{n, 10}; For n = 0,..., 7}, b_{n, 10}+ A corresponding only when = 1_{k, n}(Or -a_{k, n}) Is accumulated by a cumulative adder. After completing the processing for n = 0,..., 7, the integration result is shifted to the left by 1 bit, and + a is similarly applied to the next bit (9th digit) of the absolute value portion._{k, n}(Or -a_{k, n}) Is repeated with a cumulative adder. Thereafter, this process is repeated until the least significant bit is reached, thereby obtaining Equation (6).
[0041]
In the above algorithm, a_{k, n}The sign associated with is x for each column n_nIs stored in the memory in advance by storing the 8 data of the nth column of the matrix as a set and the data of the opposite code as a set. By referring to one of the codes collectively, it is possible to perform calculations of k = 0,..., 7 in parallel.
[0042]
Next, the overall configuration of the multiplier according to the second embodiment will be described with reference to FIG. Matrix 1 as data composed of a matrix of n rows and M rows (significant data is k rows), operation control means 12 for controlling cumulative addition for each component of the matrix 1, and data necessary for multiplication as a table ROM 15 as a read-only storage means for storing and outputting necessary data in response to a control signal from the arithmetic control means, and a plurality (n pieces) provided in parallel and performing multiplication based on control signals from ROM 15 and arithmetic control means 12 And cumulative addition means 20 including the cumulative addition circuit 21a to 21n. The arithmetic control unit 12 includes a counter 13 that functions as a sign control unit and a selector 14 that functions as an addition control unit. The counter 13 also functions as an address designation unit 11. Note that in0 to in7 are input registers for storing each column component of the matrix data storage means 11, and Acc0 to Acc7 are row components as calculation targets of the individual cumulative addition circuits 21a to 21n constituting the cumulative addition means 20. In the second embodiment, the arithmetic processing is started from the most significant bit.
[0043]
The above configuration will be described in more detail. First, a total of 128 pieces of each component of “8 × 8” constituting the matrix and data obtained by inverting the sign (that is, expressed in two's complement) are required. It is prepared with a bit width corresponding to the accuracy. Although the required bit width varies depending on the application field, for example, in the case of MPEG1, MPEG2, etc., an accuracy of about 16 bits is required. Although these data can be stored in the read-only memory, in the case of the present invention, random access to the memory by address reference does not occur, so a sequencer using a combinational circuit or the like may be used. The ROM 15 uses these terms generically for these various functional configurations.
[0044]
As shown in FIG. 3, the ROM 15 has eight codes corresponding to the corresponding column by a signal designating the column number (0 to 7) of the matrix and a signal S1 (SIGN) designating the code of each column. It is assumed that it is configured to output simultaneously (128 bits if the precision is 16 bits). Each data included in the output of the ROM 15 is provided for each row of the matrix, and is supplied to eight cumulative addition circuits 21a to 21n each operating in parallel. Each of the cumulative addition circuits 21a to 21n is essentially a two-input one-output adder 24 as shown in FIG. 4, and each adder 24 is supplied with one ROM data D1. Input register 23 and one output register 25 connected to the output port of the adder 24 and provided for integration, and a signal line from the output port of the adder 24 is connected to the output register 25. After being connected, it is branched into two and one of them is connected to the other input port of the adder 24.
[0045]
As shown in FIG. 5, each register 23 and 25 of the adder 24 has a normal mode in which an input value is internally set in synchronization with a clock signal, a hold mode in which the previous value is held regardless of the input value, It has a reset mode for resetting the contents to 0 and a shift mode for shifting the input to the left by 1 bit and setting it internally. The signal S2 (SHIFT) as a sign control signal and the signal S4 (addition control signal) The operation mode is switched in accordance with a combination of two types of control signals (EXEC).
[0046]
The bit width of the adder is 16 + 11 + 3 bits when ROM data is 16 bits, input data is 12 bits (M = 12), and input is 8 components (N = 8), but this second embodiment will be described later. One more bit is prepared for the reason. Eight input data are usually given in 2's complement representation, but this is converted into a sign-absolute value representation and stored in the input data register. This is, for example, a well-known method, but only when the most significant bit (MSB; 12th digit) of the input data is 1, bits 1 to 11 are inverted, and 1 is added to the result with an 11-bit width. Then, it can be connected again with the most significant bit. When the MSB is 0, no operation is required. The MSB of each data is supplied to the ROM 15 as a code signal S1 (SIGN), and the code of the output data of the corresponding column is determined.
[0047]
Also, an 8-bit bit string (bit slice) {b taken out from the most significant (11th digit) of the absolute value portion (1st to 11th digits) of the input data register {b_n; N = 0,..., 7} are selected by an 8: 1 selector, and the output of this selector is used as the EXEC signal of the input side and output side registers of each cumulative adder and the normal mode (EXEC = 1) and the hold mode Switch (EXEC = 0). The input data register includes a shifter. When SHIFT = 1 is received, the absolute value portion is shifted to the left by 1 bit with the next clock.
[0048]
The output signal S3 of the lower 3 bits of the counter (7-bit width) takes a value of 0 to 7, and is used for both the signal S1 for designating the column number of the ROM 15 and the selector selection signal S4. That is, the output data of the ROM is determined by the value of the sign bit counter of the input register, and the operation mode of the cumulative adder is determined in synchronization therewith. Only when the lower 3 bits of the counter are 000, SHIFT = 1 is sent to the cumulative adder and the input data register. The upper 4 bits of the counter are used to determine how many digits of the input data have been processed.
[0049]
Next, the operation of the cumulative adder in the multiplier according to the second embodiment will be described with reference to FIG. Prior to the arithmetic processing, the counter and the input / output register of the cumulative adder are reset, and the input data is converted into a sign-absolute value representation and stored in the input register in advance. When the arithmetic processing is started, ROM data corresponding to the counter value and the sign is sequentially supplied to the cumulative addition circuits 21a to 21n, and the selector 14 selects the 11th bit of the input registers 23a to 23n according to the value of the counter 13. b₀, ..., b₇Are sent to the adders 24a to 24n as an EXEC signal in this order. For example, when the lower 3 bits of the counter 13 are 000, the 0th column component of the matrix is passed to the input register of the cumulative adder with a positive or negative sign corresponding to the sign of the 0th column component of the input data. become. At this time, if EXEC = 1 (the most significant bit of the absolute value of the 0th column component of the input data is 1), the register 23 of the cumulative adder circuit 21 is in the normal mode, so the input of the cumulative adder circuit 21 is input at the next clock. Data from ROM 15 is sign-extended and set in register 23, and a value obtained by adding this value and the contents of output register 25 appears at the output port. At this time, if EXEC = 1, this value is set in the output register 25 at the next clock. On the other hand, if EXEC = 0, both the registers 23 and 25 of the cumulative addition circuit 21 are in the hold state, the data from the ROM 15 is not set in the input register 23, and the value of the output port of the adder 24 is also stored in the output register 25. Since it is not set, as shown in FIG. 6, the adder 24 maintains the previous state and does not perform the switching operation.
[0050]
When the above operation is performed 8 times and the lower 3 bits of the counter indicate 111, SHIFT = 1 is sent to the cumulative adder, and the data of the output port is shifted left by 1 bit with the next clock (multiply by 2). Equivalent to) setting in the output register 25. At this time, the absolute value portion of the input register 23 is also shifted left by 1 bit, and the 10th digit of the input data enters the 11th digit of the input register 23. Thereafter, the above procedure may be repeated for all the digits of the absolute value portion (11 times). Since the result is multiplied by 2 every time the output port is shifted, the partial sum of the 11th digit is finally 2¹¹Is 2 for the 10th digit partial sum^TenIn the same manner, the calculation result for the i-th digit is 2ⁱThe sum of those multiplied by is calculated. Since the processing for all the digits is completed at the next clock when the counter 13 indicates “1010111”, the result of the product-sum operation can be obtained by extracting the data of the necessary bit width from the output register of the cumulative adder. . Here, since the multiplier of 2 for each partial sum is increased by one, the bit width of the output register is increased in advance by one bit in consideration of this, and the least significant bit of the output is obtained when the calculation result is taken out. You can get correct results by ignoring.
[0051]
The effect of the matrix vector multiplier according to the second embodiment operating as described above with the above configuration will be described. First, in the normal mode, when the contents of the input register 23 and the output register 25 are updated with the clock, the integration is performed by the switching operation in the adder 24. In the hold mode, the two input ports of the adder 24 are integrated. Since both are fixed to the values before the clock, the switching operation is not performed and excessive power consumption is suppressed.
[0052]
Therefore, although the power consumption decreases as the number of zeros included in the input data in binary representation increases, the data distribution is around 0 especially in the differential image data in the moving image compression encoding / decoding process using motion compensation. Therefore, when the sign-absolute value expression is used, the high-order bit has a high probability of being 0. When applied to a data string having such properties, the power consumption is higher than that of the ordinary DA method. Can be greatly reduced. In addition, the ROM storing the matrix components is not randomly accessed, and only sends data sequentially in a certain order, so that the structure is very simple.
[0053]
Next, a matrix vector multiplier according to a third embodiment of the present invention will be described with reference to FIGS. The multiplier according to the third embodiment of the present invention uses an algorithm that is essentially the same as that of the second embodiment, and is changed so that the order of addition is reversed (from the lower digit). Hereinafter, the configuration method, function, and effect will be described for N = 8 and M = 12.
[0054]
FIG. 7 shows a basic configuration of a multiplier according to the third embodiment of the present invention. In FIG. 7, the same configuration as that of the second embodiment, that is, an arithmetic control means 12 having a counter 13 and a selector 14, a ROM 15, and a cumulative addition means 20 are provided. Further, the matrix data storage means 11 is configured in substantially the same manner as in the second embodiment, but after extracting the bit slice for the least significant bit, not the most significant bit of the absolute value portion, it is selected by the 8: 1 selector and executed. Send out as a signal. In addition, when acquiring data from the ROM, the register on the input side of the cumulative adder is left-shifted by 11 bits and then sign-extended and set internally (2 in advance).¹¹The lower 11 bits are always 0. Further, in the shift mode, the output register of the cumulative adder shifts its input to the right by 1 bit and sets it internally.
[0055]
The function of the multiplier according to the third embodiment is the same as that of the second embodiment. That is, prior to the arithmetic processing, the counter and the input / output register of the cumulative adder are reset, and the input data is stored in the input register in the sign-absolute value representation. When the processing is started, ROM data corresponding to the lower 3 bits and the sign of the counter are sequentially output to the cumulative adder. The selector also selects the bit slice b extracted from the least significant bit of the input data register.₀, ..., b₇Are selected according to the value indicated by the lower 3 bits of the counter and sent to the cumulative adder as an EXEC signal in this order. For example, when the lower 3 bits of the counter are 010, the second column component of the matrix (corresponding to 010) is supplied to the input side register of the cumulative adder with a code corresponding to the code of the second component of the input data. Become. At this time, if EXEC = 1 (the least significant bit of the absolute value of the second component of the input data is 1) and the registers 23 and 25 of the cumulative addition circuit 21 are in the normal mode, then the 12 bits of the input register 23 at the next clock. The data from the ROM 15 is sign-extended and set more than the first, and the value obtained by adding this value and the contents of the output side register appears at the output port of the adder 24 as in the case of the second embodiment. On the other hand, if EXEC = 0, as in the case of the second embodiment, the registers 23 and 25 of the cumulative addition circuit are both held as shown in FIG. Since the output port value is not set in the output side register 25, the adder 24 does not perform the switching operation while maintaining the previous state.
[0056]
Thereafter, the same operation as in the second embodiment is performed 8 times. When the lower 3 bits of the counter indicate 111, SHIFT = 1 is sent to the cumulative adder, and the data of the output port is 1 bit with the next clock. Right-shifted (equivalent to multiplying by 1/2) is set in the output side register 25. At this time, the absolute value portion of the input register 23 is also shifted to the right by one bit, the second digit of the input data enters the first digit of the input data register, and the processing relating to the first digit of the input data is completed. Thereafter, the above procedure may be repeated 11 times for all the digits of the absolute value portion. 2 in advance in the input register¹¹However, each time the output port is shifted to the right, the result is multiplied by ½, so that the partial sum of the first digit is finally 2⁰Is 2 for the partial sum of the second digit, and 2 for the calculation result of the i-th digit in the same manner.^i-1The sum of those multiplied by is calculated. Since processing for all the digits is completed at the next clock when the counter indicates 1010111 for each digit, the result of the product-sum operation can be obtained by extracting the data of the required bit width from the output register of the cumulative adder. .
[0057]
The effect of the matrix vector multiplier according to the third embodiment having the configuration and operation as described above will be described. First, when supplying the data from the ROM to the input register, the portion corresponding to the difference between the bit width of the ROM data and the bit width of the cumulative adder must be sign-extended, as shown in FIG. If the data passed from the ROM is negative, all the bits from the top of the ROM data to the top of the input side register (14 bits if M = 12, N = 8) are all set to 1; It must be set to 0 and set to a register. Since the difference image data is distributed almost uniformly on both the positive and negative sides centering on 0, the positive / negative ratio of the data passed from the ROM is almost 1: 1, and the order cannot be predicted. Although the probability may become very large, if the lower 11 bits are always set to 0 by shifting and substituting 11 bits in advance, the number of times of switching the upper bits associated with sign extension can be reduced, and the lower Power consumption can be reduced.
[0058]
Finally, the configuration, function, and effect of the matrix vector multiplier according to the fourth embodiment of the present invention will be described with reference to FIG. The multiplier according to the fourth embodiment will also be described for the case where the first predetermined number is n columns and the second predetermined number is 8 rows, but the bit number M is 12.
[0059]
First, FIG. 9 shows the configuration of a matrix vector multiplier according to the fourth embodiment. The multiplier according to the fourth embodiment is provided with a ROM 15 having the same configuration as the multiplier according to the second and third embodiments, a cumulative addition means 20, and a counter 13. The matrix data storage means 11 has substantially the same configuration as that of the first to third embodiments, but a bit slice can be extracted not only for the most significant bit of the absolute value portion but also for an arbitrary bit. The second selector 34 having a selection ratio of: 1 controls which digit is taken out according to the upper 4 bits of the counter. The correspondence between the upper 4 bits of the counter 13 and the digit of the storage means 11 may be arbitrary. For example, when 0000 is associated with the 11th digit and 1010 is associated with the first digit, the configuration is further improved. Can be simple.
[0060]
A functional operation of the multiplier according to the fourth embodiment based on the above configuration will be described. Similar to the multiplier according to the second embodiment, prior to the arithmetic processing, the input / output side registers 23 and 25 of the counter 13 and the cumulative addition circuit 21 are reset, and the input data is represented by the sign-absolute value expression. 23. When the arithmetic processing is started, the lower 3 bits of the counter 13 and the ROM data corresponding to the sign are sequentially transferred to the adder 24 of the cumulative addition circuit 21. In addition, the second selector 34 uses the bit slice b extracted from a certain digit (specified by the upper 4 bits of the counter) of the matrix data storage unit 11.₀, ..., b₇Are selected by the selector 14 according to the value indicated by the lower 3 bits of the counter 13 and sent to the cumulative addition circuit 21 as the EXEC signal S4 in this order. For example, when the counter 13 is 0011101, the fifth column component (corresponding to 101) of the matrix stored in the storage means 11 is a positive or negative sign corresponding to the sign of the fifth component of the input data. Thus, the fifth component of the bit slice taken out from the eighth digit (corresponding to 0011) of the input register is passed as an EXEC signal to the cumulative adder. Thereafter, cumulative addition is performed in the same procedure as in the second embodiment.
[0061]
In the case of the multiplier according to the second embodiment, the input register is shifted to the left by 1 bit each time addition of a certain digit is completed. Here, the digit from which the bit slice is extracted is switched according to the upper 4 bits of the counter. By simply adding the second selector 34 as hardware for this purpose, the mechanism for the left shift is omitted, and further, register updating caused by the shift is suppressed, so that further reduction in power consumption is possible.
[0062]
【The invention's effect】
As described above in detail, the matrix vector multiplier according to the present invention does not perform addition when the bit of the input matrix data is 0, so that power consumption can be significantly reduced. At the same time, the integration of each row can be processed in parallel by a plurality of cumulative addition circuits connected in parallel, so that the processing speed can be increased.
[0063]
Furthermore, since the number of switching operations of the input / output registers of the cumulative addition circuit provided in parallel can be reduced, not only in this respect, power consumption can be reduced, but also random access to the read-only storage means is possible. Since no access occurs, the structure of the read-only storage means (ROM) can be simplified.
[0064]
In addition, the number of times of switching at the time of sign extension can be reduced, and a reduction in power consumption can be expected, and since the shift function of the matrix data storage means is not required, the structure of the storage means can be simplified, The number of switching operations is also reduced. Furthermore, the amount of hardware can be reduced by sharing the counter.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a matrix vector multiplier according to a first embodiment as a basic concept of the present invention.
FIG. 2 is a block diagram showing a configuration of a matrix vector multiplier according to a second embodiment of the present invention.
FIG. 3 is an explanatory diagram showing a table representing ROM functions commonly used in the multipliers of the second to fourth embodiments and a case where a + sign in the fifth column is designated.
FIG. 4 is a block diagram showing a cumulative adder circuit commonly used in a matrix vector multiplier according to the present invention.
5 is an explanatory diagram showing a mode assignment table for control signals respectively input to the input / output registers 23 and 25 in FIG. 4;
FIG. 6 is a diagram illustrating an example of an operation state of the cumulative addition circuit in the multiplier according to the second embodiment. The bit slice of the most significant bit of the input data absolute value portion is (01101100), and the bit slice of the next digit is ( 0110..
FIG. 7 is a block diagram showing a configuration of a matrix vector multiplier according to a third embodiment of the present invention.
FIGS. 8A and 8B are (a) a method of setting data in the input side register of the cumulative addition circuit in the second embodiment of the multiplier according to the present invention, and (b) the input to the input side register of the cumulative addition circuit in the third embodiment. Explanatory drawing which shows each data setting method.
FIG. 9 is a block diagram showing a configuration of a matrix vector multiplier according to a fourth embodiment of the present invention.
FIG. 10 is a block diagram showing a configuration of a conventional matrix vector multiplier using the DA method.
FIG. 11 is an explanatory diagram showing a table stored in a ROM of a conventional matrix vector multiplier.
FIG. 12 is an explanatory diagram showing a calculation algorithm in a conventional multiplier.
[Explanation of symbols]
10 Matrix vector multiplier
11 Matrix data storage means
12 Calculation control means
13 Code control unit (addressing counter)
14 Addition control unit (selector)
15 Read-only memory (ROM)
20 Cumulative addition means
21 (a to n) Cumulative addition circuit
23 (a to n) Input accumulation unit (input register -AREG-)
25 (a to n) Output accumulation section (output register -BREG-)
34 Code control unit (second selector)

Claims

Matrix data storage means for sequentially storing the matrix data composed of the column component consisting of the first predetermined number and the row component consisting of the second predetermined number by the code part and the absolute value part,
A code control unit that outputs an address control signal designating a column number of a specific column component of the matrix data stored in the matrix data storage means and outputs the code part of the matrix data as a code control signal; and the address control An addition control unit that outputs the absolute value portion of the matrix data corresponding to a signal as an addition control signal;
The coefficient data corresponding to the column component of the matrix data and the data of the opposite sign are preliminarily expressed and stored as a code part and an absolute value part, and the address control signal and the code control output from the code control unit are stored. Read-only storage means for sequentially outputting corresponding column component data based on the signal;
The coefficient data corresponding to the code of the column component provided for each row component of the matrix data and supplied from the read-only storage means is temporarily accumulated and based on the code control signal and the addition control signal a plurality of input storage unit for moving the absolute value portion of the coefficient data for each cycle of the address control signal in a predetermined direction by a predetermined amount, the data from the dedicated storage means and the reading in response to the addition control signal A plurality of addition units for switching between addition and non-addition, a plurality of output accumulation units for temporarily accumulating each integrated value of the plurality of addition units and moving the integrated value in the predetermined direction by the predetermined amount; A cumulative addition means comprising a plurality of cumulative addition circuits each comprising
A matrix vector multiplier characterized by comprising:

The input accumulation unit of the cumulative addition means is constituted by an input register having a shift function for shifting the absolute value portion of the coefficient data to the left bit by bit based on the sign control signal and the addition control signal, and the output accumulation 2. The matrix vector according to claim 1, wherein the unit includes an output register that shifts and outputs the integrated value of the addition unit to the left bit by bit based on the code control signal and the addition control signal. Multiplier.

The input accumulation unit of the cumulative addition means is constituted by an input register having a shift function for shifting the absolute value portion of the coefficient data to the right bit by bit based on the sign control signal and the addition control signal, and the output accumulation 2. The matrix vector according to claim 1, wherein the unit includes an output register that shifts and outputs the accumulated value of the addition unit to the right bit by bit based on the code control signal and the addition control signal. Multiplier.

The cumulative addition circuit of the cumulative addition means is provided in the same number as the first predetermined number that is the number of columns of the matrix data, and the read-only storage means has the same number of row components as the first predetermined number of the matrix data. 2. The matrix vector multiplier according to claim 1, wherein the matrix vector multiplier outputs data simultaneously, and the cumulative addition circuit operates in parallel to accumulate row component data from the read-only storage means.

The cumulative addition circuit includes a reset mode for setting the content to 0, a normal mode for updating the current value, and a hold for holding the current value by a combination of the addition control signal and the address control signal from the matrix data storage means. A pair of the input storage unit and the output storage unit having a function of switching a mode and a shift mode in which the input is shifted by one bit and set internally are connected to one input and an output, and further the output 2. The matrix vector multiplier according to claim 1, wherein the output of the storage unit is branched and connected to the other input.

2. The matrix vector multiplier according to claim 1, wherein the sign control unit of the arithmetic control unit is configured by a counter that accumulates the contents one by one for each clock signal.

7. The matrix vector multiplier according to claim 6, wherein the sign control unit of the arithmetic control unit generates the address control signal based on a plurality of bits on the lower side of the clock signal of the counter.

Said accumulating means uses a bit string of a specific position of the matrix data storage for the matrix data storage means as the addition control signal, said address control signal is used as the addition control signal each time elapses one period the The matrix vector multiplier according to claim 1, wherein the position of the bit string is shifted right by 1 bit.

The cumulative addition means is configured to extend after sign extension of data read from the read-only storage means at a position shifted to the left by the input data bit width of the matrix data storage means. The matrix vector multiplier according to claim 8.

The addition control unit of the arithmetic control means is composed of a selector having a function of selecting a bit string consisting of arbitrary digits of the matrix data storage means and sending it as a selection signal which is the addition control signal. The matrix vector multiplier according to claim 1.

The sign control unit of the arithmetic control means is constituted by a counter that accumulates the contents one by one for each clock signal, and the cumulative addition means uses a plurality of higher-order bits of the counter as selection signals for the selector. The matrix vector multiplier according to claim 10.

The cumulative addition means uses the bit string at a specific position of the matrix data stored in the matrix data storage means as the addition control signal, and uses the bit string used as the addition control signal every time one period of the address control signal elapses. The matrix vector multiplier according to claim 1, wherein the position of is shifted by 1 bit to the left.

The cumulative addition means is configured to sign-extend the data read from the read-only storage means and hold it at the right shift position by the input data bit width of the matrix data storage means. The matrix vector multiplier according to claim 12, wherein