JP3895031B2 - Matrix vector multiplier - Google Patents
Matrix vector multiplier Download PDFInfo
- Publication number
- JP3895031B2 JP3895031B2 JP02539698A JP2539698A JP3895031B2 JP 3895031 B2 JP3895031 B2 JP 3895031B2 JP 02539698 A JP02539698 A JP 02539698A JP 2539698 A JP2539698 A JP 2539698A JP 3895031 B2 JP3895031 B2 JP 3895031B2
- Authority
- JP
- Japan
- Prior art keywords
- control signal
- data
- matrix
- addition
- bit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
- Complex Calculations (AREA)
画像データ圧縮のための基礎的な要素アルゴリズムとして広く用いられているものに、離散コサイン変換(以下、DCT― Discrete Cosine Transform―と略記する。)方法およびその逆変換(以下、IDCT―Inverse Discrete Cosine Transform ―と略記する。)方法を用いた帯域圧縮技術がある。上記のDCTやIDCTはいずれも直交変換の一種であるが、実際の演算においては定数係数のマトリックスとベクトルの乗算であり、画素値のベクトルxn 、DCT係数のベクトルXk に対し下式で定義される:
動画像圧縮の国際標準規格であるMPEG(Moving Picture Experts Group)1、MPEG2等においては水平方向・垂直方向に各々8画素ずつの64画素からなる正方形型の領域(ブロック)を単位としたN=8の2次元DCTおよびIDCTが用いられている。
動画像圧縮装置は一般に演算量が非常に大きく、特に実時間処理が必要な利用形態(HDTV放送、TV電話、映像監視システム等)における動画像圧縮符号化・復号化はその処理系に対して要求される性能が非常に高くなっている。このため、上記離散コサイン変換および逆変換の計算も高速アルゴリズムと専用ハードウェアを用いて処理するのが一般的であり、積和演算の回数を大幅に削減できるチェン(Chen)の高速アルゴリズムや、乗算器を用いずに計算を実行できる分散演算(DA―Distributed Arithmetic―)法などの方式が広範に用いられており、これらを応用した専用ハードウェアの構成としては、例えば文献IEICE Trans. Electron., Vol. E75-C(1992), No.4 pp.390-397 等に開示されている。ここではDA法による離散コサイン変換の計算方法、およびハードウェアの構成方法を簡単に説明する。
さらに、ハードウェアのデータパスのビット幅をMとする時、各Xn は2進数(2の補数表現)で:
は入力データ{xn }のiビット目{bn,i }の関数であるが、予めあらゆるビットパターン(2N 通り)について計算しメモリに格納しておけば、{bn,i }をNビットのアドレスとみなして読み出すことができる。また、2i-1 の乗算はi−1ビットの左シフトに対応していることに注意すると、結局上記式(4)はNビットアドレスのメモリ参照と左シフト、加算および減算の組み合わせにより求めることができることが分かる。
DA法は上述の原理を応用したものであり、以下、N=8の場合の具体的な乗算器の構成とその手順の一例について図10ないし図12を参照しながら説明する。まず、図10を用いて従来の行列ベクトル乗算器の回路構成について説明する。図10において、従来の行列ベクトル演算回路は、各行毎に入力ポートを有し、第1の所定数であるn列と第2の所定数であるk行の縦横方向に0か1の情報が配列されたデータとしてのマトリックス1と、図11を用いて後述するようなテーブルを格納すると共にアドレス信号により前記マトリックス1のiビット目毎の行方向から読み出された情報により前記テーブルの対応する項目が読み出される読出し専用メモリ(以下、ROM―Read Only Memory―と略記する。)2と、このROM2より読出された値を累積加算する累積加算器5と、を備えている。
上記構成の従来の行列ベクトル乗算器の動作について説明する。まず、図11に示すように、全ての8ビットパターン{bn }(28 =256通り)についてマトリックスの各行k(k=0,…,7)に対応する部分和をあらかじめ必要な精度(例えば16ビット)で計算したテーブルを作成する。図11に示されるテーブルは、上述したように、図10に示すROM2に予め格納されている。このROM2は、8ビットのアドレスで16ビットの精度を有する場合、その容量は4キロバイトとなる。
次に、8個の入力データ{xn ;n=0,…,7}の各々から2進表現の最下位ビット(以下、LSB― Least Significant Bit―と略記する。)を取り出して、この8ビット{bn,0 }をアドレス信号として前記ROM2から読出した部分和のデータを各行ごとに累積加算器5により積算する。続いて、下から2番目のビット{bn,1 }をアドレスとして部分和のデータをROM2から読み出した後、1ビット左シフトを行ない(これが2を乗算することに相当している)、累積加算器5により積算する。以下同様に、図11に示すように、下からi番目のビットについて部分和をi−1ビット左シフトして積算する動作{図12(a)(b)参照}を最上位ビット(以下、MSB―Most Significant Bit―と略記する。)の直前まで繰り返すことにより、上記式(4)の第2項が累積加算器に積算される。MSBについては符号ビットであるため、{bn,M-1 }のアドレスで部分和を読み出して、これを「M−1」ビットだけ左へシフトした後、符号を反転してから積算を行なうと上記式(4)の第1項が加算されて、式(4)の解が求められることになる。
また、請求項11に係る行列ベクトル乗算器は、請求項10に記載の乗算器において、 前記演算制御手段の前記符号制御部は、クロック信号ごとにその内容を1つずつ積算するカウンタにより構成され、このカウンタの上位側の複数のビットを前記セレクタの選択信号として前記累積加算手段が用いることを特徴としている。
また、請求項13に係る行列ベクトル乗算器は、請求項12に記載の乗算器において、 前記累積加算手段は、前記行列データ蓄積手段の入力データビット幅分だけ右シフト位置に、前記読出し専用記憶手段から読出されたデータの符号拡張を行なってから保持するように構成されていることを特徴としている。
したがって、本発明に係る行列ベクトル乗算器によれば、N成分からなる入力データの各成分を上位の桁よりビットごとに取り出して、得られたビット列(N bit)を制御信号として加算器の動作モードを順次切り替え、ROMから読み出された各ビットに対応する係数データを積算する。この際、制御ビットが1ならばレジスタに保持されたデータを加算し、0ならばレジスタをホールドするようにモードを切り替えることにより、不必要な加算動作をレジスタの更新によるスイッチングを減らすことができ、低消費電力動作が可能となる。N個の係数データを処理したら結果を1ビット左シフトし、入力データの次の桁について同様の操作を繰り返す。以下同様に全桁を処理するまで繰り返すことになる。
と書くことができる。「bi =0or1」なので、結局のところ式(6)はマトリックスの各行kについて「+ak,n 」または「−ak,n 」または「0」の加算とシフト演算との組み合わせにより求められる。
具体的には、まずマトリックスの各成分のデータおよびその反対符号のデータを格納した読み出し用メモリを用意しておき、このメモリからn=0,…の順にxn の符号に応じて+ak,n または−ak,n を順次読み出す。8個の入力データxn は符号−絶対値表現で表しておき、この絶対値部分の最上位ビット(10桁目)を取り出したビット列{bn,10;n=0,…,7}について、bn,10=1の場合のみ対応する+ak,n (または−ak,n )を累積加算器で積算し、0の場合は演算しない。n=0,…,7について処理を終えた後、積算結果を1ビット左シフトし、絶対値部分の次のビット(9桁目)についても同様に+ak,n (または−ak,n )を累積加算器で積算する処理を繰り返す。以下、最下位ビットに至るまでこの処理を繰り返すことにより式(6)を求めることができる。
以上のアルゴリズムにおいて、ak,n に付随する符号は各列nごとにxn の符号によって(kによらずに)決定されるので、あらかじめマトリックスの第n列の8個のデータを一組に、その反対符号のデータを一組にしてメモリに格納しておき、各列ごとにどちらかの符号をまとめて参照することによってk=0,…,7の計算を並列に行なうことが可能となる。
また、入力データレジスタのうち絶対値部分(1〜11桁目)の最上位(11桁目)から取り出した8ビットのビット列(ビットスライス){bn ;n=0,…,7}を8:1セレクタで選択し、このセレクタの出力を各累積加算器の入力側および出力側レジスタのEXEC信号として用いて通常モード(EXEC=1)とホールドモード(EXEC=0)を切り換える。また入力データレジスタはシフターを備えており、SHIFT=1を受けると次のクロックに伴ってその絶対値部分が1ビット左シフトを行なう。
次に、図5を用いて、第2実施形態に係る乗算器における累積加算器の動作について説明する。演算処理に先立ってカウンタおよび累積加算器の入出力レジスタをリセットして、入力データは符号−絶対値表現に変換して予め入力レジスタに格納しておく。演算処理が開始されると、カウンタ値と符号に対応したROMデータが累積加算回路21a〜21nに順次供給され、またセレクタ14はカウンタ13の値に応じて、入力レジスタ23a〜23nの11ビット目b0 ,…,b7 をこの順でEXEC信号として加算器24a〜24nに送出する。例えば、カウンタ13の下位3ビットが000の時はマトリックスの第0列成分が、入力データの第0列成分の符号に対応して正または負符号で累積加算器の入力側レジスタに渡されることになる。このとき、もしもEXEC=1(入力データの第0列成分の絶対値の最上位ビットが1)であると累積加算回路21のレジスタ23はノーマルモードなので、次のクロックで累積加算回路21の入力レジスタ23にROM15からのデータが符号拡張してセットされ、出力ポートにはこの値と出力レジスタ25の内容を加算した値が現れる。このとき、EXEC=1であればこの値はその次のクロックで出力レジスタ25にセットされる。一方、EXEC=0であれば累積加算回路21のレジスタ23,25はどちらもホールド状態となり、ROM15からのデータが入力レジスタ23にセットされず、加算器24の出力ポートの値も出力レジスタ25にセットされないので、図6に示すように、加算器24は以前の状態を保ったままとなりスイッチング動作をしないことになる。
上の動作を8回行なってカウンタの下位3ビットが111を示すとSHIFT=1が累積加算器に送られ、次のクロックに伴って出力ポートのデータが1ビット左シフトして(2を乗算することに相当)出力レジスタ25にセットされる。このとき、入力レジスタ23の絶対値部分も1ビット左シフトされて入力データの10桁目が入力レジスタ23の11桁目に入る。以降は上の手順を絶対値部分の全ての桁について(11回)繰り返せばよい。出力ポートがシフトされるごとに結果に2が乗算されるので、最終的に11桁目の部分和には211が、10桁目の部分和には210が乗じられ、以下同様に第i桁目の計算結果には2i が乗じられたものの総和が求められる。カウンタ13が「1010111」を示したその次のクロックで全ての桁に関する処理が終了しているので、累積加算器の出力レジスタから必要なビット幅のデータを取り出せば積和演算の結果が得られる。ここで、各部分和に対する2の乗数が一つ多くなっているので、これを考慮してあらかじめ出力レジスタのビット幅を予め1ビット分増やしておいて、計算結果を取り出す時に出力の最下位ビットを無視することで正しい結果を得ることができる。
このような第3実施形態に係る乗算器の機能も第2実施形態と同様である。すなわち、演算処理に先立ってカウンタおよび累積加算器の入出力側レジスタをリセットし、入力データは符号−絶対値表現で入力レジスタに格納しておく。処理が開始されると、カウンタの下位3ビットと符号に対応したROMデータが累積加算器に順次出力される。またセレクタは入力データレジスタの最下位ビットから取り出されたビットスライスb0 ,…,b7 をカウンタの下位3ビットの示す値に応じて選択してこの順でEXEC信号として累積加算器に送る。例えばカウンタの下位3ビットが010の時はマトリックスの第2列成分(010に対応)が、入力データの第2成分の符号に対応した符号で累積加算器の入力側レジスタに供給されることになる。この時EXEC=1(入力データの第2成分の絶対値の最下位ビットが1)で累積加算回路21のレジスタ23,25がノーマルモードであるならば、次のクロックで入力レジスタ23の12ビット目以上にROM15からのデータが符号拡張してセットされ、第2の実施形態の場合と同様に加算器24の出力ポートには、この値と出力側レジスタの内容を加算した値が現れる。一方、EXEC=0であればこれも第2実施形態の場合と同様に累積加算回路のレジスタ23,25は、図5のように何れもホールド状態となり、ROMからのデータが入力側のレジスタ23にセットされず、出力ポートの値も出力側レジスタ25にセットされないので、加算器24は以前の状態を保ったままスイッチング動作を行なうことはない。
以下、第2実施形態と同様の動作を8回行なって、カウンタの下位3ビットが111を示すとSHIFT=1が累積加算器に送られ、次のクロックに伴って出力ポートのデータが1ビット右シフトして(1/2を乗算することに相当)出力側レジスタ25にセットされる。このとき、入力レジスタ23の絶対値部分も1ビット右シフトされ、入力データの2桁目が入力データレジスタの1桁目に入り入力データの1桁目に関する処理が終了する。これ以降は、上記の手順を絶対値部分の全ての桁につき11回繰り返せばよい。入力レジスタにあらかじめ211を乗じたデータが供給されているが、出力ポートが右シフトされるごとにその結果に1/2が乗算されるので、最終的に1桁目の部分和には20 が、2桁目の部分和には2が、以下同様に第i桁目の計算結果には2i-1 が乗じられたものの総和が求められる。各桁ともにカウンタが1010111を示したその次のクロックで全ての桁に関する処理が終了しているので、累積加算器の出力レジスタから必要なビット幅のデータを取り出せば積和演算の結果が得られる。
上記構成に基づく第4実施形態に係る乗算器の機能動作について説明する。第2実施形態に係る乗算器と同様に、演算処理に先立ってカウンタ13および累積加算回路21の入出力側のレジスタ23,25をリセットし、入力データは符号−絶対値表現で入力側のレジスタ23に格納しておく。演算処理が開始されるとカウンタ13の下位3ビットと符号に対応したROMデータが累積加算回路21の加算器24に順次渡される。また、第2のセレクタ34は、行列データ蓄積手段11のある桁(カウンタの上位4ビットで指定)から取り出されたビットスライスb0 ,…,b7 を、カウンタ13の下位3ビットの示す値に応じてセレクタ14により選択してこの順でEXEC信号S4として累積加算回路21に送る。例えば、カウンタ13が0011101の時は蓄積手段11に格納されているマトリックスの第5列成分(101に対応)が、入力データの第5成分の符号に対応して正または負符号で累積加算回路21の入力側レジスタ23に渡され、入力レジスタの8桁目(0011に対応)から取り出されたビットスライスの第5成分が累積加算器へEXEC信号として渡されることになる。以下第2実施形態と同様の手順で累積加算を行なう。
10 行列ベクトル乗算器
11 行列データ蓄積手段
12 演算制御手段
13 符号制御部(アドレス指定カウンタ)
14 加算制御部(セレクタ)
15 読出し専用記憶手段(ROM)
20 累積加算手段
21(a〜n) 累積加算回路
23(a〜n) 入力蓄積部(入力レジスタ―AREG―)
25(a〜n) 出力蓄積部(出力レジスタ―BREG―)
34 符号制御部(第2のセレクタ)[0001]
The present invention relates to a matrix vector multiplier, and more particularly to a matrix vector multiplier capable of performing matrix and vector operation processing at high speed and with low power consumption.
[Prior art]
Widely used as a basic element algorithm for image data compression is a discrete cosine transform (hereinafter abbreviated as DCT-Discrete Cosine Transform) method and its inverse transform (hereinafter IDCT-Inverse Discrete Cosine). There is band compression technology using the Transform method. Each of the above DCT and IDCT is a kind of orthogonal transform, but in actual calculation, it is a multiplication of a constant coefficient matrix and a vector, and a pixel value vector xn, DCT coefficient vector XkIs defined by:
[Expression 1]
In MPEG (Moving Picture Experts Group) 1, MPEG2, etc., which are international standards for moving picture compression, N = in a square area (block) composed of 64 pixels of 8 pixels each in the horizontal and vertical directions. Eight two-dimensional DCT and IDCT are used.
In general, a moving image compression apparatus has a very large calculation amount, and moving image compression encoding / decoding in a use form (HDTV broadcast, TV phone, video surveillance system, etc.) that particularly requires real-time processing is applied to the processing system The required performance is very high. Therefore, the calculation of the discrete cosine transform and the inverse transform is generally performed using a high-speed algorithm and dedicated hardware, and Chen's high-speed algorithm that can greatly reduce the number of product-sum operations, A method such as a distributed arithmetic (DA-Distributed Arithmetic) method capable of performing calculations without using a multiplier is widely used. As a configuration of dedicated hardware using these methods, for example, the document IEICE Trans. Electron. , Vol. E75-C (1992), No.4 pp.390-397 and the like. Here, the calculation method of the discrete cosine transform by the DA method and the hardware configuration method will be briefly described.
First, the k-th row of the matrix-vector multiplication of the above equation (1) can be expressed in the form of an inner product as follows:
[Expression 2]
Further, when the bit width of the hardware data path is M, each XnIs a binary number (2's complement):
[Equation 3]
Therefore, the above equation (2) uses the linearity of the inner product:
[Expression 4]
It becomes the form. Where partial sum
[Equation 5]
Is the input data {xn} I-th bit {bn, i}, But any bit pattern (2NStreet) and store it in memory, {bn, i} Can be read as an N-bit address. 2i-1If it is noted that the multiplication of i corresponds to a left shift of i−1 bits, it can be understood that the above equation (4) can be obtained by a combination of N-bit address memory reference and left shift, addition and subtraction.
The DA method is an application of the above-described principle. Hereinafter, a specific example of the multiplier configuration and its procedure when N = 8 will be described with reference to FIGS. First, the circuit configuration of a conventional matrix vector multiplier will be described with reference to FIG. In FIG. 10, the conventional matrix vector arithmetic circuit has an input port for each row, and information of 0 or 1 in the vertical and horizontal directions of the first predetermined number n columns and the second predetermined number k rows. The
The operation of the conventional matrix vector multiplier having the above configuration will be described. First, as shown in FIG. 11, all 8-bit patterns {bn} (28= 256 ways), a table is created in which partial sums corresponding to each row k (k = 0,..., 7) of the matrix are calculated in advance with a required accuracy (for example, 16 bits). As described above, the table shown in FIG. 11 is stored in advance in the
Next, 8 input data {xnEach of n = 0,..., 7} is extracted from the least significant bit of binary representation (hereinafter abbreviated as LSB-Least Significant Bit-), and these 8 bits {bn, 0} As an address signal, the partial sum data read from the
When this method is used, the product of the matrix and the vector data can be easily calculated by a simple general-purpose arithmetic circuit as shown in FIG. 10 without preparing hardware dedicated for multiplication. It is suitable for the purpose of processing at high speed while saving the amount of hardware.
[Problems to be solved by the invention]
However, according to the above-described conventional matrix vector multiplier, the calculation amount and the hardware operation rate are always constant regardless of the nature of the input data and the correlation of the contents. For example, “0” is continuously set. The
As is well known, the DA method always performs the same operation on a data string having any property, and unnecessary calculations and obvious calculations are performed as in the above example. The opportunity of the switching operation due to repetition increases, which causes not only inefficiency but also a problem of increasing power consumption. Furthermore, since access to the memory is random access using each digit of the input data as an address, a sophisticated function is required for the memory, and the structure of the
An object of the present invention is to provide an arithmetic circuit that performs multiplication of a matrix of constant components frequently appearing in image processing and the like and a vector variable without using a general-purpose multiplier, and an adder and shifter that operate in parallel. By using a read-only memory that holds matrix components, unnecessary calculations are omitted to improve calculation efficiency and suppress power consumption rise, and the matrix circuit structure is simple It is to provide a matrix vector multiplier that can be completed with
[Means for Solving the Problems]
In order to achieve the above object, a matrix vector multiplier according to the present invention converts a matrix data composed of a first predetermined number of column components and a second predetermined number of row components into a code part and an absolute value part. Matrix data storage means that sequentially stores and expresses the address control signal designating the column number of a specific column component of the matrix data stored in the matrix data storage means, and the code portion of the matrix data An arithmetic control means including: a code control unit that outputs as a code control signal; and an addition control unit that outputs the absolute value portion of the matrix data corresponding to the address control signal as an addition control signal; and the matrix dataColumnscomponentThe coefficient data corresponding to and the data of the opposite sign are expressed in advance by a sign part and an absolute value part.The corresponding column component is stored based on the address control signal and the code control signal that are stored and output from the code control unit.dataRead-only storage means for sequentially outputting the code of the column component provided for each row component of the matrix data and supplied from the read-only storage meansThe coefficient data corresponding toAre temporarily stored and the address control signal is cycled based on the sign control signal and the addition control signal.coefficientA plurality of input accumulation units for moving the absolute value portion of the data by a predetermined amount in a predetermined direction; andReadA plurality of adders for switching between addition and non-addition of data from the dedicated storage means, and the accumulated values of each of the plurality of adders are temporarily accumulated, and the accumulated values are moved by the predetermined amount in the predetermined direction. And a plurality of output accumulating units, and a cumulative addition means including a plurality of cumulative addition circuits each configured.
Further, the matrix vector multiplier according to
Further, the matrix vector multiplier according to
Further, the matrix vector multiplier according to
A matrix vector multiplier according to
According to a sixth aspect of the present invention, there is provided a matrix vector multiplier according to the first aspect of the present invention, wherein the sign control unit of the arithmetic control means includes a counter that accumulates the contents one by one for each clock signal. It is characterized by having.
Further, the matrix vector multiplier according to
A matrix vector multiplier according to an eighth aspect of the present invention is the multiplier according to the first aspect, wherein the cumulative addition unit is configured to store the matrix data stored in the matrix data storage unit.specificPlaceofbitColumnIs used as the addition control signal, and the addition control is performed every time one period of the address control signal passes.Position of the bit string used as a signalIs shifted right by 1 bit.
Furthermore, the matrix vector multiplier according to claim 9 is the multiplier according to
The matrix vector multiplier according to
Further, the matrix vector multiplier according to
The matrix vector multiplier according to
A matrix vector multiplier according to
Therefore, according to the matrix vector multiplier of the present invention, each component of the input data consisting of N components is extracted for each bit from the upper digit, and the operation of the adder is performed using the obtained bit string (N bit) as a control signal. The modes are sequentially switched, and coefficient data corresponding to each bit read from the ROM is integrated. At this time, by switching the mode so that the data held in the register is added if the control bit is 1 and the register is held if the control bit is 0, the unnecessary addition operation can be reduced by switching the register. , Low power consumption operation becomes possible. When N coefficient data are processed, the result is shifted left by 1 bit, and the same operation is repeated for the next digit of the input data. Similarly, the process is repeated until all digits are processed.
In order to perform matrix multiplication correctly according to the above procedure, the sign-absolute value expression is used instead of the two's complement expression for the input data of the cumulative addition means. In a moving image compression process using motion compensation such as MPEG, the difference image data such as P-picture is highly likely to be concentrated around 0. Since the number of zeros increases, the number of times of switching can be further reduced by setting the operation mode to the hold state in a matrix in which zeros continuously appear.
Embodiments of a matrix vector multiplier according to the present invention will be described below in detail with reference to the drawings. First, the configuration of a matrix vector multiplier according to the first embodiment as the basic principle of the present invention will be described with reference to FIG.
In FIG. 1, a
The cumulative addition means 20 has a plurality of
The operation of the matrix vector multiplier according to the first embodiment will be omitted as it is supplemented by the description of the operations of the second to fourth embodiments described in detail below.
Next, a matrix vector multiplier according to a second embodiment of the present invention will be described in detail with reference to FIGS. The matrix vector multiplier according to the second embodiment is described with a more specific configuration of the multiplier according to the first embodiment.
The multiplier according to the second embodiment includes N M-bit registers that store N pieces of input data in the form of a sign-absolute value expression, and N bits of N bits extracted from the M-1st digit of the register group. N: 1 selector for designating one of them, ROM capable of storing matrix component data and reading out each column, a cumulative adder operating in parallel and accumulating data from the ROM, matrix columns And a counter for generating a signal for designating a number. Hereinafter, an algorithm, a configuration method, a function, and an effect will be described in the case of N = 8 and M = 12. Here, in the configuration of the
First, the multiplier algorithm according to the second embodiment will be described. X of input datanUsing the sign-absolute value representation with a bit width of 12:
[Formula 6]
Since the inner product calculation in (2) uses linearity,
[Expression 7]
Can be written. “Bi= 0 or 1 ”, so after all, equation (6) becomes“ + a for each row k of the matrix.k, n"Or" -ak, n"Or" 0 "and a combination of shift operation.
Specifically, first, a read memory storing the data of each component of the matrix and the data of the opposite sign is prepared, and from this memory, x = 0,.n+ A depending on the sign ofk, nOr -ak, nAre read sequentially. 8 input data xnIs represented by a sign-absolute value representation, and a bit string {b obtained by taking out the most significant bit (10th digit) of the absolute value portion.n, 10; For n = 0,..., 7}, bn, 10+ A corresponding only when = 1k, n(Or -ak, n) Is accumulated by a cumulative adder. After completing the processing for n = 0,..., 7, the integration result is shifted to the left by 1 bit, and + a is similarly applied to the next bit (9th digit) of the absolute value portion.k, n(Or -ak, n) Is repeated with a cumulative adder. Thereafter, this process is repeated until the least significant bit is reached, thereby obtaining Equation (6).
In the above algorithm, ak, nThe sign associated with is x for each column nnIs stored in the memory in advance by storing the 8 data of the nth column of the matrix as a set and the data of the opposite code as a set. By referring to one of the codes collectively, it is possible to perform calculations of k = 0,..., 7 in parallel.
Next, the overall configuration of the multiplier according to the second embodiment will be described with reference to FIG.
The above configuration will be described in more detail. First, a total of 128 pieces of each component of “8 × 8” constituting the matrix and data obtained by inverting the sign (that is, expressed in two's complement) are required. It is prepared with a bit width corresponding to the accuracy. Although the required bit width varies depending on the application field, for example, in the case of MPEG1, MPEG2, etc., an accuracy of about 16 bits is required. Although these data can be stored in the read-only memory, in the case of the present invention, random access to the memory by address reference does not occur, so a sequencer using a combinational circuit or the like may be used. The
As shown in FIG. 3, the
As shown in FIG. 5, each
The bit width of the adder is 16 + 11 + 3 bits when ROM data is 16 bits, input data is 12 bits (M = 12), and input is 8 components (N = 8), but this second embodiment will be described later. One more bit is prepared for the reason. Eight input data are usually given in 2's complement representation, but this is converted into a sign-absolute value representation and stored in the input data register. This is, for example, a well-known method, but only when the most significant bit (MSB; 12th digit) of the input data is 1,
Also, an 8-bit bit string (bit slice) {b taken out from the most significant (11th digit) of the absolute value portion (1st to 11th digits) of the input data register {bn; N = 0,..., 7} are selected by an 8: 1 selector, and the output of this selector is used as the EXEC signal of the input side and output side registers of each cumulative adder and the normal mode (EXEC = 1) and the hold mode Switch (EXEC = 0). The input data register includes a shifter. When SHIFT = 1 is received, the absolute value portion is shifted to the left by 1 bit with the next clock.
The output signal S3 of the lower 3 bits of the counter (7-bit width) takes a value of 0 to 7, and is used for both the signal S1 for designating the column number of the
Next, the operation of the cumulative adder in the multiplier according to the second embodiment will be described with reference to FIG. Prior to the arithmetic processing, the counter and the input / output register of the cumulative adder are reset, and the input data is converted into a sign-absolute value representation and stored in the input register in advance. When the arithmetic processing is started, ROM data corresponding to the counter value and the sign is sequentially supplied to the
When the above operation is performed 8 times and the lower 3 bits of the counter indicate 111, SHIFT = 1 is sent to the cumulative adder, and the data of the output port is shifted left by 1 bit with the next clock (multiply by 2). Equivalent to) setting in the output register 25. At this time, the absolute value portion of the
The effect of the matrix vector multiplier according to the second embodiment operating as described above with the above configuration will be described. First, in the normal mode, when the contents of the
Therefore, although the power consumption decreases as the number of zeros included in the input data in binary representation increases, the data distribution is around 0 especially in the differential image data in the moving image compression encoding / decoding process using motion compensation. Therefore, when the sign-absolute value expression is used, the high-order bit has a high probability of being 0. When applied to a data string having such properties, the power consumption is higher than that of the ordinary DA method. Can be greatly reduced. In addition, the ROM storing the matrix components is not randomly accessed, and only sends data sequentially in a certain order, so that the structure is very simple.
Next, a matrix vector multiplier according to a third embodiment of the present invention will be described with reference to FIGS. The multiplier according to the third embodiment of the present invention uses an algorithm that is essentially the same as that of the second embodiment, and is changed so that the order of addition is reversed (from the lower digit). Hereinafter, the configuration method, function, and effect will be described for N = 8 and M = 12.
FIG. 7 shows a basic configuration of a multiplier according to the third embodiment of the present invention. In FIG. 7, the same configuration as that of the second embodiment, that is, an arithmetic control means 12 having a
The function of the multiplier according to the third embodiment is the same as that of the second embodiment. That is, prior to the arithmetic processing, the counter and the input / output register of the cumulative adder are reset, and the input data is stored in the input register in the sign-absolute value representation. When the processing is started, ROM data corresponding to the lower 3 bits and the sign of the counter are sequentially output to the cumulative adder. The selector also selects the bit slice b extracted from the least significant bit of the input data register.0, ..., b7Are selected according to the value indicated by the lower 3 bits of the counter and sent to the cumulative adder as an EXEC signal in this order. For example, when the lower 3 bits of the counter are 010, the second column component of the matrix (corresponding to 010) is supplied to the input side register of the cumulative adder with a code corresponding to the code of the second component of the input data. Become. At this time, if EXEC = 1 (the least significant bit of the absolute value of the second component of the input data is 1) and the
Thereafter, the same operation as in the second embodiment is performed 8 times. When the lower 3 bits of the counter indicate 111, SHIFT = 1 is sent to the cumulative adder, and the data of the output port is 1 bit with the next clock. Right-shifted (equivalent to multiplying by 1/2) is set in the output side register 25. At this time, the absolute value portion of the
The effect of the matrix vector multiplier according to the third embodiment having the configuration and operation as described above will be described. First, when supplying the data from the ROM to the input register, the portion corresponding to the difference between the bit width of the ROM data and the bit width of the cumulative adder must be sign-extended, as shown in FIG. If the data passed from the ROM is negative, all the bits from the top of the ROM data to the top of the input side register (14 bits if M = 12, N = 8) are all set to 1; It must be set to 0 and set to a register. Since the difference image data is distributed almost uniformly on both the positive and negative sides centering on 0, the positive / negative ratio of the data passed from the ROM is almost 1: 1, and the order cannot be predicted. Although the probability may become very large, if the lower 11 bits are always set to 0 by shifting and substituting 11 bits in advance, the number of times of switching the upper bits associated with sign extension can be reduced, and the lower Power consumption can be reduced.
Finally, the configuration, function, and effect of the matrix vector multiplier according to the fourth embodiment of the present invention will be described with reference to FIG. The multiplier according to the fourth embodiment will also be described for the case where the first predetermined number is n columns and the second predetermined number is 8 rows, but the bit number M is 12.
First, FIG. 9 shows the configuration of a matrix vector multiplier according to the fourth embodiment. The multiplier according to the fourth embodiment is provided with a
A functional operation of the multiplier according to the fourth embodiment based on the above configuration will be described. Similar to the multiplier according to the second embodiment, prior to the arithmetic processing, the input / output side registers 23 and 25 of the
In the case of the multiplier according to the second embodiment, the input register is shifted to the left by 1 bit each time addition of a certain digit is completed. Here, the digit from which the bit slice is extracted is switched according to the upper 4 bits of the counter. By simply adding the second selector 34 as hardware for this purpose, the mechanism for the left shift is omitted, and further, register updating caused by the shift is suppressed, so that further reduction in power consumption is possible.
【The invention's effect】
As described above in detail, the matrix vector multiplier according to the present invention does not perform addition when the bit of the input matrix data is 0, so that power consumption can be significantly reduced. At the same time, the integration of each row can be processed in parallel by a plurality of cumulative addition circuits connected in parallel, so that the processing speed can be increased.
Furthermore, since the number of switching operations of the input / output registers of the cumulative addition circuit provided in parallel can be reduced, not only in this respect, power consumption can be reduced, but also random access to the read-only storage means is possible. Since no access occurs, the structure of the read-only storage means (ROM) can be simplified.
In addition, the number of times of switching at the time of sign extension can be reduced, and a reduction in power consumption can be expected, and since the shift function of the matrix data storage means is not required, the structure of the storage means can be simplified, The number of switching operations is also reduced. Furthermore, the amount of hardware can be reduced by sharing the counter.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a matrix vector multiplier according to a first embodiment as a basic concept of the present invention.
FIG. 2 is a block diagram showing a configuration of a matrix vector multiplier according to a second embodiment of the present invention.
FIG. 3 is an explanatory diagram showing a table representing ROM functions commonly used in the multipliers of the second to fourth embodiments and a case where a + sign in the fifth column is designated.
FIG. 4 is a block diagram showing a cumulative adder circuit commonly used in a matrix vector multiplier according to the present invention.
5 is an explanatory diagram showing a mode assignment table for control signals respectively input to the input / output registers 23 and 25 in FIG. 4;
FIG. 6 is a diagram illustrating an example of an operation state of the cumulative addition circuit in the multiplier according to the second embodiment. The bit slice of the most significant bit of the input data absolute value portion is (01101100), and the bit slice of the next digit is ( 0110..
FIG. 7 is a block diagram showing a configuration of a matrix vector multiplier according to a third embodiment of the present invention.
FIGS. 8A and 8B are (a) a method of setting data in the input side register of the cumulative addition circuit in the second embodiment of the multiplier according to the present invention, and (b) the input to the input side register of the cumulative addition circuit in the third embodiment. Explanatory drawing which shows each data setting method.
FIG. 9 is a block diagram showing a configuration of a matrix vector multiplier according to a fourth embodiment of the present invention.
FIG. 10 is a block diagram showing a configuration of a conventional matrix vector multiplier using the DA method.
FIG. 11 is an explanatory diagram showing a table stored in a ROM of a conventional matrix vector multiplier.
FIG. 12 is an explanatory diagram showing a calculation algorithm in a conventional multiplier.
[Explanation of symbols]
10 Matrix vector multiplier
11 Matrix data storage means
12 Calculation control means
13 Code control unit (addressing counter)
14 Addition control unit (selector)
15 Read-only memory (ROM)
20 Cumulative addition means
21 (a to n) Cumulative addition circuit
23 (a to n) Input accumulation unit (input register -AREG-)
25 (a to n) Output accumulation section (output register -BREG-)
34 Code control unit (second selector)
Claims (13)
を備えることを特徴とする行列ベクトル乗算器。Matrix data storage means for sequentially storing the matrix data composed of the column component consisting of the first predetermined number and the row component consisting of the second predetermined number by the code part and the absolute value part,
A code control unit that outputs an address control signal designating a column number of a specific column component of the matrix data stored in the matrix data storage means and outputs the code part of the matrix data as a code control signal; and the address control An addition control unit that outputs the absolute value portion of the matrix data corresponding to a signal as an addition control signal;
The coefficient data corresponding to the column component of the matrix data and the data of the opposite sign are preliminarily expressed and stored as a code part and an absolute value part, and the address control signal and the code control output from the code control unit are stored. Read-only storage means for sequentially outputting corresponding column component data based on the signal;
The coefficient data corresponding to the code of the column component provided for each row component of the matrix data and supplied from the read-only storage means is temporarily accumulated and based on the code control signal and the addition control signal a plurality of input storage unit for moving the absolute value portion of the coefficient data for each cycle of the address control signal in a predetermined direction by a predetermined amount, the data from the dedicated storage means and the reading in response to the addition control signal A plurality of addition units for switching between addition and non-addition, a plurality of output accumulation units for temporarily accumulating each integrated value of the plurality of addition units and moving the integrated value in the predetermined direction by the predetermined amount; A cumulative addition means comprising a plurality of cumulative addition circuits each comprising
A matrix vector multiplier characterized by comprising:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
JP02539698A JP3895031B2 (en) | 1998-02-06 | 1998-02-06 | Matrix vector multiplier |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
JP02539698A JP3895031B2 (en) | 1998-02-06 | 1998-02-06 | Matrix vector multiplier |
Publications (2)
Publication Number | Publication Date |
JPH11224246A JPH11224246A (en) | 1999-08-17 |
JP3895031B2 true JP3895031B2 (en) | 2007-03-22 |
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
JP02539698A Expired - Fee Related JP3895031B2 (en) | 1998-02-06 | 1998-02-06 | Matrix vector multiplier |
Country Status (1)
Country | Link |
JP (1) | JP3895031B2 (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
JP7225831B2 (en) | 2019-01-23 | 2023-02-21 | 富士通株式会社 | Processing unit, program and control method for processing unit |
- 1998-02-06 JP JP02539698A patent/JP3895031B2/en not_active Expired - Fee Related
Also Published As
Publication number | Publication date |
JPH11224246A (en) | 1999-08-17 |
Similar Documents
Publication | Publication Date | Title |
KR100329339B1 (en) | An apparatus for performing multiply-add operations on packed data | |
US6546480B1 (en) | Instructions for arithmetic operations on vectored data | |
JPH07236143A (en) | High-speed digital signal decoding method | |
KR19990044304A (en) | Instruction set for compressed data operation | |
US6574651B1 (en) | Method and apparatus for arithmetic operation on vectored data | |
US7020671B1 (en) | Implementation of an inverse discrete cosine transform using single instruction multiple data instructions | |
EP1065884A1 (en) | Dct arithmetic device | |
JP3857308B2 (en) | Apparatus and method for performing inverse discrete cosine transform | |
KR100241078B1 (en) | Vector processor instruction for video compression and decompression | |
EP0928100B1 (en) | Run-length encoding | |
JP3895031B2 (en) | Matrix vector multiplier | |
JP2002519957A (en) | Method and apparatus for processing a sign function | |
JP5589628B2 (en) | Inner product calculation device and inner product calculation method | |
US5671169A (en) | Apparatus for two-dimensional inverse discrete cosine transform | |
JP4243277B2 (en) | Data processing device | |
US8693796B2 (en) | Image processing apparatus and method for performing a discrete cosine transform | |
JP4405452B2 (en) | Inverse conversion circuit | |
JP2790911B2 (en) | Orthogonal transform operation unit | |
KR100575285B1 (en) | High speed low power discrete cosine converter and method | |
JP3610564B2 (en) | Information processing device | |
KR100350943B1 (en) | Fast Discrete Cosine Transform Processors using Distributed Arithmetic | |
KR100408884B1 (en) | Discrete cosine transform circuit of distributed arithmetic | |
Zandonai et al. | An architecture for MPEG motion estimation | |
JP2004234407A (en) | Data processor | |
Ismail et al. | High speed on-chip multiple cosine transform generator |
Legal Events
Date | Code | Title | Description |
A621 | Written request for application examination |
Free format text: JAPANESE INTERMEDIATE CODE: A621 Effective date: 20040413 |
A977 | Report on retrieval |
Free format text: JAPANESE INTERMEDIATE CODE: A971007 Effective date: 20060425 |
A131 | Notification of reasons for refusal |
Free format text: JAPANESE INTERMEDIATE CODE: A131 Effective date: 20060512 |
A521 | Written amendment |
Free format text: JAPANESE INTERMEDIATE CODE: A523 Effective date: 20060711 |
TRDD | Decision of grant or rejection written | ||
A01 | Written decision to grant a patent or to grant a registration (utility model) |
Free format text: JAPANESE INTERMEDIATE CODE: A01 Effective date: 20061201 |
A61 | First payment of annual fees (during grant procedure) |
Free format text: JAPANESE INTERMEDIATE CODE: A61 Effective date: 20061213 |
LAPS | Cancellation because of no payment of annual fees |