JP2012022363A

JP2012022363A - Inner product calculation device and inner product calculation method

Info

Publication number: JP2012022363A
Application number: JP2010157564A
Authority: JP
Inventors: Atsuo Hashimoto; 篤男橋本
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2010-07-12
Filing date: 2010-07-12
Publication date: 2012-02-02
Anticipated expiration: 2030-07-12
Also published as: JP5589628B2

Abstract

PROBLEM TO BE SOLVED: To provide an inner product calculation device and an inner product calculation method realizing a high-speed computing having cycle time suitable for a highly parallel computing with a computing unit constitution that has a small hardware amount and uses no multiplier, and allowing an inner product calculation to be performed with efficiently and accuracy even using no ROM.SOLUTION: An inner product calculation device comprises: input element registers 2 for storing plural input vector elements; a barrel shifter 3 for calculating the partial product of the term of a power of two of a constant vector element and an input vector element; an adder-subtracter 4 for accumulating the partial products; an accumulator 5 for storing the accumulation result of the adder-subtracter; a shifter 6 for rounding off a result during the accumulation stored in the accumulator 5; and calculation control means for allowing accumulation of the partial products of the terms of the least significant power of two of the constant vector elements and all input vector elements corresponding to the respective terms, sequential repetitive accumulation of the partial products corresponding to the terms of a higher order power of two, and repetition until the terms of the most significant power of two.

Description

本発明は、離散コサイン変換などの直交変換において行われるベクトル内積を演算する内積演算装置および内積演算方法に関する。 The present invention relates to an inner product calculation device and inner product calculation method for calculating a vector inner product performed in orthogonal transformation such as discrete cosine transformation.

ベクトル内積演算は、画像圧縮処理などの分野で多用される離散コサイン変換に代表される直交変換の中核をなす演算であるが、その演算量が膨大となるため実時間処理など高速処理の要求に応えるためには、一般的には大規模なハードウェア量が必要となり装置のコストは増大する傾向にある。以下に内積演算の応用される離散コサイン変換（Discrete Cosine Transform：以下、ＤＣＴという）処理について簡単に説明を行う。以下の式が１次元の数列に対する、Ｎ次ＤＣＴの一般式である。 Vector inner product operation is the core of orthogonal transformation represented by discrete cosine transformation, which is frequently used in the field of image compression processing, etc., but the amount of computation is enormous. In order to respond, generally a large amount of hardware is required, and the cost of the apparatus tends to increase. A discrete cosine transform (hereinafter referred to as DCT) process to which the inner product operation is applied will be briefly described below. The following formula is a general formula of Nth-order DCT for a one-dimensional number sequence.

前記式をＮ＝８とした８次のＤＣＴ処理は以下の行列積の式で表される。 The 8th-order DCT processing in which the above equation is N = 8 is expressed by the following matrix product equation.

この式の右辺の行列はＤＣＴ係数行列と称され、小数点数値表現すると以下の行列となる。 The matrix on the right side of this equation is referred to as a DCT coefficient matrix.

このような行列式をデジタルハードウェア演算装置で処理するために、例えば固定小数点演算用に正負符号と１０ビット桁の整数で表すと以下の整数行列式となる。 In order to process such a determinant by a digital hardware arithmetic unit, for example, when expressed by a positive / negative sign and an integer of 10-bit digits for fixed-point arithmetic, the following integer determinant is obtained.

平面画像信号を対象とするＤＣＴ処理は前記式のごとく８次のものが多く、例えばＪＰＥＧ（Joint Photographic Experts Group）方式で用いられるＤＣＴは水平方向８画素、垂直方向８画素の８×８画素について、垂直方向の１次元ＤＣＴを施したのち水平方向の１次元ＤＣＴを施して８×８の２次元平面上の周波数成分に分解する、所謂２次元ＤＣＴとして採用されている。以降、説明を容易にするため、入力ベクトルの要素は画素値として一般的な８ビット整数値、ＤＣＴ係数は典型的なハードウェア構成として一般的な８から１６ビット程度の固定小数点表現とする。 There are many 8th order DCT processes for planar image signals as in the above formula. For example, DCT used in the JPEG (Joint Photographic Experts Group) system is 8 × 8 pixels of 8 pixels in the horizontal direction and 8 pixels in the vertical direction. It is employed as a so-called two-dimensional DCT in which a one-dimensional DCT in the vertical direction is applied and then a one-dimensional DCT in the horizontal direction is applied to decompose the frequency components on an 8 × 8 two-dimensional plane. Hereinafter, for ease of explanation, the elements of the input vector are assumed to be a general 8-bit integer value as a pixel value, and the DCT coefficient is assumed to be a fixed-point representation of about 8 to 16 bits as a typical hardware configuration.

ここで数４において、出力ベクトルＺの要素であるＺ１の計算に着目するとＤＣＴ係数行列の２行目の枠で囲まれた部分行ベクトル（５０２，４２６，２８４，１００，−１００，−２８４，−４２６，−５０２）と入力ベクトル（Ｘ０，Ｘ１，Ｘ２，Ｘ３，Ｘ４，Ｘ５，Ｘ６，Ｘ７）の内積演算に他ならず、８回の乗算と７回の加減算が必要となる。このように内積演算は多くの乗算と加算を内包しその演算量が多く、高速に処理するためのハードウェアが多く考案されている。 Here, in Expression 4, when attention is paid to the calculation of Z1 which is an element of the output vector Z, partial row vectors (502, 426, 284, 100, −100, −284, surrounded by the frame of the second row of the DCT coefficient matrix). −426, −502) and the inner product operation of the input vector (X0, X1, X2, X3, X4, X5, X6, X7), 8 multiplications and 7 additions / subtractions are required. As described above, the inner product calculation includes many multiplications and additions, has a large calculation amount, and a lot of hardware has been devised for high-speed processing.

図１５に従来の内積演算装置の例を示す。図１５に示された内積演算装置１００は、加減算器１０１、１０２、１０３、１０４と、レジスタ１０５、１０６、１０７、１０８と、乗算器１０９、１１０、１１１、１１２と、並列加算器１１３と、備えている。 FIG. 15 shows an example of a conventional inner product calculation device. An inner product arithmetic device 100 shown in FIG. 15 includes adders / subtracters 101, 102, 103, 104, registers 105, 106, 107, 108, multipliers 109, 110, 111, 112, a parallel adder 113, I have.

図１５の内積演算装置１００は、入力ベクトルＸの各要素とＤＣＴ係数の乗算を入力ベクトルＸの要素ごとに乗算器１０９、１１０、１１１、１１２で並列に実行し、それぞれの乗算結果を並列加算器１１３で総和することで内積演算を実行する。入力ベクトルの要素Ｘ０〜Ｘ７は一括してそれぞれの乗算器１０９、１１０、１１１、１１２に並列に入力され、ＤＣＴ係数は定数であることからレジスタに置かれ乗算器１０９、１１０、１１１、１１２に入力される。なお、ＤＣＴ係数はＲＯＭに格納しても良い。 15 performs multiplication of each element of the input vector X and the DCT coefficient in parallel for each element of the input vector X by the multipliers 109, 110, 111, and 112, and adds each multiplication result in parallel. An inner product operation is executed by summing the data in the unit 113. Elements X0 to X7 of the input vector are input to the respective multipliers 109, 110, 111, and 112 in parallel, and the DCT coefficient is a constant, so it is placed in a register and is input to the multipliers 109, 110, 111, and 112. Entered. The DCT coefficient may be stored in the ROM.

ここで、内積にかかるＤＣＴ係数行列の部分行ベクトルを（Ｃ０，Ｃ１，Ｃ２，Ｃ３，Ｃ４，Ｃ５，Ｃ６，Ｃ７）とすると、ＤＣＴ係数の性質上Ｃ０とＣ７、Ｃ１とＣ６、Ｃ２とＣ５,Ｃ３とＣ４の絶対値が対称関係となるため、対称な係数に対応する入力Ｘの要素すなわちＸ０とＸ７、Ｘ１とＸ６、Ｘ２とＸ５,Ｘ３とＸ４の加算または減算を係数の正負符号にあわせて先に行うことで（バタフライ演算）、乗算回数を４回に減じたハードウェア構成となっている。 Here, assuming that the partial row vector of the DCT coefficient matrix for the inner product is (C0, C1, C2, C3, C4, C5, C6, C7), C0 and C7, C1 and C6, C2 and C5 due to the nature of the DCT coefficient. Since the absolute values of C3 and C4 have a symmetric relationship, the addition or subtraction of the elements of the input X corresponding to the symmetric coefficient, that is, X0 and X7, X1 and X6, X2 and X5, and X3 and X4, is used as the sign of the coefficient. In addition, by performing first (butterfly calculation), the hardware configuration is obtained by reducing the number of multiplications to four.

しかしながら上述した構成は乗算回数分の複数の乗算器が必要となることが本質であり、乗算器は一般的にハードウェア規模が大きく、費やす素子数、回路面積、消費電力が大きいうえに、演算桁数が増えると回路内の伝播遅延によりサイクルタイムが低下する問題点がある。また、処理の高速化のために複数の内積結果を複数の内積演算器で同時並列に実行しうるように並列化した演算器構成を構築することもできるが、個々の演算器のハードウェア規模が大きいと、ハードウェア資源であるシリコン面積、電子プリント基板面積は、有限かつ小型化を求められるため大規模に並列化することが困難となる問題点もある。さらに、回路規模の制約から、乗算器や加減算器の桁数の制限がある場合は、入力桁を丸めたり、あるいは積項の乗算結果を丸めたりする必要があり、真の下位桁からの部分積の累算が行われず演算精度が低下するという欠点があった。 However, it is essential that the above-described configuration requires a plurality of multipliers corresponding to the number of multiplications. The multiplier generally has a large hardware scale, consumes a large number of elements, circuit area, and consumes a large amount of power. As the number of digits increases, there is a problem that the cycle time decreases due to propagation delay in the circuit. In order to speed up the processing, it is possible to construct a computing unit configuration in which multiple inner product results can be executed in parallel by multiple inner product computing units, but the hardware scale of each computing unit Is large, the silicon area and the electronic printed circuit board area, which are hardware resources, are finite and require miniaturization, which makes it difficult to parallelize on a large scale. Furthermore, if there are restrictions on the number of digits in the multiplier or adder / subtracter due to circuit size restrictions, it is necessary to round the input digit or the product term multiplication result, and the part from the true lower digit There is a drawback in that accumulation of products is not performed and calculation accuracy is lowered.

また、乗算器を使用しない内積演算装置としては、例えば特許文献１や２に記載のものが提案されている。図１６に乗算器を使用しない内積演算装置の例を示す。図１６に示された内積演算装置２００は、ＲＯＭ２０１と、加減算器２０２と、アキュムレータ２０３と、シフタ２０４と、備えている。 Further, as an inner product calculation device that does not use a multiplier, for example, those described in Patent Documents 1 and 2 have been proposed. FIG. 16 shows an example of an inner product calculation device that does not use a multiplier. The inner product arithmetic device 200 shown in FIG. 16 includes a ROM 201, an adder / subtracter 202, an accumulator 203, and a shifter 204.

図１６に示された内積演算装置２００は、Ａ〜Ｄが入力ベクトル要素で４ビット幅、定数ベクトル要素Ｃ０〜Ｃ３は１６ビット幅としている。内積演算装置２００では、定数ベクトルの各要素が固定値であることから、入力ベクトルとの内積演算において、入力ベクトルが４ビット幅であることからこれをｂ３，ｂ２，ｂ１，ｂ０と二進値で表すと、入力ベクトルにおける最下位のビットスライスはＡ（ｂ０），Ｂ（ｂ０），Ｃ（ｂ０），Ｄ（ｂ０）と表され、これに対する部分内積は、Ｃ０×Ａ（ｂ０）＋Ｃ１×Ｂ（ｂ０）＋Ｃ２×Ｃ（ｂ０）＋Ｃ３×Ｄ（ｂ０）となる。 In the inner product calculation device 200 shown in FIG. 16, A to D are input vector elements having a 4-bit width, and the constant vector elements C0 to C3 have a 16-bit width. In the inner product calculation device 200, each element of the constant vector is a fixed value, and therefore, in the inner product calculation with the input vector, the input vector is 4 bits wide, so this is expressed as b3, b2, b1, b0 and a binary value. Is represented by A (b0), B (b0), C (b0), and D (b0), and the partial inner product corresponding thereto is C0 × A (b0) + C1 ×. B (b0) + C2 × C (b0) + C3 × D (b0).

Ａ（ｂ０），Ｂ（ｂ０），Ｃ（ｂ０），Ｄ（ｂ０）はそれぞれ、１か０の２値の値しかとらないため、ビットスライスに対する部分内積はＣ０〜Ｃ３の単純な加減算に帰結する。Ｃ０〜Ｃ３は固定値の定数ベクトルであるから、ビットスライスＡ（ｂ０），Ｂ（ｂ０），Ｃ（ｂ０），Ｄ（ｂ０）の出現パターンにしたがって、予め計算された部分内積をＲＯＭ２０１に格納しておき、演算では入力ベクトルのビットスライスのビットパターンによりＲＯＭ２０１を読み出すことにより部分内積を読み出すことができる。これをＲＯＭアキュムレータと称しＲＡＣと略称される。内積演算では、最下位のビットスライスから始めて上位に向かって部分内積を累算してゆくことで内積演算が達成される。部分内積の累算は加減算器２０２とアキュムレータ２０３で行われ、アキュムレータ２０３をシフタ２０４で右シフトして上位桁の累算を開始する。 Since A (b0), B (b0), C (b0), and D (b0) each take only a binary value of 1 or 0, the partial inner product for the bit slice results in a simple addition / subtraction of C0 to C3. To do. Since C0 to C3 are constant vectors of fixed values, the partial inner product calculated in advance is stored in the ROM 201 in accordance with the appearance pattern of the bit slices A (b0), B (b0), C (b0), and D (b0). In the calculation, the partial inner product can be read by reading the ROM 201 from the bit pattern of the bit slice of the input vector. This is called a ROM accumulator and is abbreviated as RAC. In the inner product operation, the inner product operation is achieved by accumulating partial inner products starting from the lowest bit slice and moving upward. The accumulation of the partial inner product is performed by the adder / subtractor 202 and the accumulator 203, and the accumulator 203 is shifted to the right by the shifter 204 to start accumulation of the upper digits.

図１６に示された内積演算装置２００では、シフトと加減算で内積演算が実現可能でありハードウェア量が少ないこと、部分内積の演算がＲＡＣとしてＲＯＭ化されているので高速に動作すること、下位桁の部分内積から順に累算してゆくので必要な結果精度に応じて、演算器の語長を選択できてかつそれが乗算の演算語長より小さな桁の構成であっても、下位桁からの部分内積の累算を完遂しているので演算精度が確保される利点がある。 In the inner product arithmetic device 200 shown in FIG. 16, the inner product operation can be realized by shift and addition / subtraction, the amount of hardware is small, the operation of the partial inner product is ROMized as a RAC, and it operates at high speed. Since the partial inner product of digits is accumulated in order, the word length of the calculator can be selected according to the required accuracy of the result, and even if it is composed of digits smaller than the operation word length of multiplication, Since the accumulation of the partial inner product is completed, there is an advantage that the calculation accuracy is ensured.

上述したように、従来の専用の乗算回路と加減算器とアキュムレータとを有するような積和演算装置では、乗算回路の回路量が大きくなり、コストが増大する問題点があった。乗算回路は特に配列型の乗算器を有する場合は、その回路内容が複雑で規則性が低下し、かつ個々のハードウェア量が大きいことから、複数演算器を大規模に並列に構成するようなＳＩＭＤ（Single Instruction-stream Multiple Data-stream）型の演算器構成を選択することが困難であった。また、乗算回路を有する装置では乗算回路が複雑かつ大規模であるために信号伝播時間が増大し演算装置全体のサイクルタイムが低下する問題点があった。さらに、小規模な演算ブロックを大規模高並列に構成して内積演算装置を構成する場合や、既定のマイクロプロセッサなどの演算装置で内積演算をプログラミングするような場合などでは、演算桁数に制限があることから入力自身の桁数を落としたり、あるいは積項の乗算結果を丸めたりして途中の演算桁を制限せざるを得ず、内積演算の精度が落ちてしまうという問題点があった。 As described above, the product-sum operation apparatus having the conventional dedicated multiplication circuit, adder / subtractor, and accumulator has a problem that the circuit amount of the multiplication circuit increases and the cost increases. Especially when the multiplication circuit has an array type multiplier, the circuit contents are complicated, regularity is lowered, and the amount of hardware is large, so that a plurality of arithmetic units are configured in parallel on a large scale. It has been difficult to select a SIMD (Single Instruction-stream Multiple Data-stream) arithmetic unit configuration. In addition, since the multiplication circuit is complicated and large-scale in the device having the multiplication circuit, there is a problem that the signal propagation time is increased and the cycle time of the entire arithmetic device is lowered. In addition, the number of arithmetic digits is limited when a small arithmetic block is configured in a large scale and highly parallel to configure an inner product arithmetic device, or when an inner product arithmetic is programmed with an arithmetic device such as a predetermined microprocessor. Therefore, there is a problem that the precision of the inner product operation is reduced because the number of digits of the input itself is reduced or the multiplication result of the product term is rounded to limit the operation digits in the middle. .

また、ＲＯＭアキュムレータを用いて部分内積の累積加減算とシフト動作によって内積演算を行う内積演算装置は、下位桁から順に部分内積を累算してゆく構造であるため、積項の演算精度がすべて保証され、内積演算の精度低下はないが、ＲＯＭアキュムレータの有するＲＯＭの容量は、入力の語長と内積を行うベクトルの要素数すなわち積項の数が増大するにしたがって増大してしまうという問題点があった。このことは、大規模な並列演算器を構成する場合、同一のＲＯＭを多数複製せざるを得ず、ハードウェア面積の利用効率も悪く並列化の規模が制限されてしまう問題点もあった。 In addition, the inner product arithmetic unit that performs the inner product operation by cumulative addition / subtraction of partial inner product and shift operation using ROM accumulator is a structure that accumulates the partial inner product in order from the lower digit, so all the calculation accuracy of the product term is guaranteed. However, the accuracy of the inner product calculation is not lowered, but the ROM capacity of the ROM accumulator increases as the number of input word lengths and the number of vector elements, that is, the number of product terms increase. there were. This has the problem that, when a large-scale parallel computing unit is configured, the same ROM must be duplicated many times, the utilization efficiency of the hardware area is poor, and the scale of parallelization is limited.

本発明はかかる問題を解決することを目的としている。 The present invention aims to solve such problems.

すなわち、本発明は、乗算器を使用しないハードウェア量の少ない演算器構成で、高並列に適してサイクルタイムの高速化が図れるとともに、ＲＯＭを用いなくても内積演算が精度低下なく行うことができる内積演算装置および内積演算方法を提供することを目的としている。 That is, according to the present invention, it is possible to increase the cycle time by using an arithmetic unit configuration that does not use a multiplier and has a small amount of hardware, and is suitable for high parallelism. An object of the present invention is to provide an inner product calculation device and inner product calculation method.

上記課題を解決するためになされた請求項１に記載された発明は、所定のビット語長を有する複数の入力ベクトル要素から構成される入力ベクトルと複数の定数ベクトル要素から構成される定数ベクトルとの内積を求める内積演算装置において、前記複数の入力ベクトル要素を格納する格納手段と、前記格納手段から前記入力ベクトル要素を選択して、選択された前記入力ベクトルを左ビットシフトさせることにより前記定数ベクトル要素の２のべき乗項と入力ベクトル要素との部分積を求めるシフト手段と、前記シフト手段が求めた前記部分積を累算するとともに、前記入力ベクトル要素と前記定数ベクトル要素とを乗算した際に必要となる乗算精度よりも小さいビット桁数で構成された加減算手段と、前記加減算手段の累算結果を格納するアキュムレータと、予め定めた桁数のビットシフトにより前記加減算手段による累算途中の前記アキュムレータに格納された結果の切り捨てを行って演算結果の丸めを行う丸め手段と、前記加減算手段に、前記定数ベクトル要素の最下位の２のべき乗項の同じ項にかかる全ての入力ベクトル要素の部分積の累算を行わせて前記アキュムレータに格納させて、以降、順次高位の２のべき乗項にかかる部分積の累算を繰り返して最上位の２のべき乗項まで繰り返させるとともに、前記加減算手段の桁あふれが発生する前に前記丸め手段により類算途中の前記アキュムレータに格納された結果の下位桁を切捨てさせて、以降の累算の初期値とするように動作させる演算制御手段と、を備えていることを特徴とする内積演算装置である。 In order to solve the above-mentioned problem, the invention described in claim 1 includes an input vector composed of a plurality of input vector elements having a predetermined bit word length and a constant vector composed of a plurality of constant vector elements. In the inner product calculation device for calculating the inner product of the constants, the storage means for storing the plurality of input vector elements, the input vector element is selected from the storage means, and the selected input vector is left bit shifted to shift the constant A shift means for obtaining a partial product of a power-of-two term of a vector element and an input vector element, and accumulating the partial product obtained by the shift means and multiplying the input vector element and the constant vector element. Stores the addition / subtraction means configured with a smaller number of bit digits than the required multiplication precision, and the accumulation result of the addition / subtraction means. An accumulator; a rounding means for rounding a result of operation by truncating a result stored in the accumulator during accumulation by the addition / subtraction means by a bit shift of a predetermined number of digits; and the constant vector The accumulation of the partial products of all the input vector elements related to the same term of the least significant 2 power term of the element is performed and stored in the accumulator. The accumulation is repeated until the most significant power-of-two term is repeated, and before the addition / subtraction means overflows, the rounding means truncates the lower digits of the result stored in the accumulator during the calculation. And an arithmetic control unit that operates so as to obtain an initial value for subsequent accumulation.

請求項２に記載された発明は、請求項１に記載された発明において、前記演算制御手段が、予め定めたテーブルに基づいて、前記シフト手段に対して前記入力ベクトル要素の２ビット毎に前記部分積を求めさせ、前記加減算手段に対して該部分積を累算させることを特徴とするものである。 According to a second aspect of the present invention, in the first aspect of the present invention, the calculation control unit is configured to perform the shift unit based on a predetermined table for each two bits of the input vector element. The partial product is obtained, and the addition / subtraction means accumulates the partial product.

請求項３に記載された発明は、請求項１または２に記載された発明において、前記丸め手段が前記加減算手段による累算途中の前記アキュムレータに格納された結果の切り捨てを行うビット桁数は、前記加減算手段のビット語長から前記入力ベクトル要素のビット語長および前記入力ベクトル要素数の２を底とする対数を減じた数以下のビット桁数として予め定められていることを特徴とするものである。 The invention described in claim 3 is the invention described in claim 1 or 2, wherein the rounding means truncates the result stored in the accumulator during accumulation by the addition / subtraction means. It is predetermined as the number of bit digits less than or equal to the number obtained by subtracting the logarithm of the bit vector length of the input vector element and the number of the input vector elements from 2 from the bit word length of the addition / subtraction means It is.

請求項４に記載された発明は、所定のビット語長を有する複数の入力ベクトル要素から構成される入力ベクトルと複数の定数ベクトル要素から構成される定数ベクトルとの内積を求める内積演算装置において、前記複数の入力ベクトル要素を格納する格納手段と、前記格納手段から前記入力ベクトル要素を選択して、前記定数ベクトル要素の２のべき乗項と選択された前記入力ベクトル要素との部分積を求めて累算するとともに、前記入力ベクトル要素と前記定数ベクトル要素とを乗算した際に必要となる乗算精度よりも小さいビット桁数で構成された加減算手段と、前記加減算手段の累算結果を自身の上位桁側に格納するアキュムレータと、前記アキュムレータの内容を下位桁方向に右ビットシフトして以降の累算値とするとともに、前記入力ベクトル要素と前記定数ベクトル要素とを乗算した際に必要となる乗算精度よりも小さいビット桁数で構成された第一シフト手段と、前記加減算手段に、前記定数ベクトル要素の同じ桁の２のべき乗項の同じ項にかかる全ての入力ベクトル要素の部分積の累算を行わせて前記アキュムレータに格納させて、前記第一シフト手段に前記アキュムレータに格納された累算結果を右ビットシフトさせる動作を前記定数ベクトル要素の２のべき乗項の最上位桁まで繰り返させる演算制御手段と、を備えていることを特徴とする内積演算装置である。 The invention described in claim 4 is an inner product calculation device for obtaining an inner product of an input vector composed of a plurality of input vector elements having a predetermined bit word length and a constant vector composed of a plurality of constant vector elements. A storage unit that stores the plurality of input vector elements; and the input vector element is selected from the storage unit to obtain a partial product of the power vector of the constant vector element and the selected input vector element Accumulating and adding / subtracting means configured with a bit number smaller than the multiplication precision required when multiplying the input vector element and the constant vector element, and the accumulated result of the adding / subtracting means An accumulator to be stored on the digit side, and the contents of the accumulator are right-bit shifted in the lower digit direction to obtain an accumulated value thereafter, and A first shift means configured with a bit number smaller than the multiplication precision required when multiplying the force vector element and the constant vector element; and the adder / subtracter with 2 of the same digit of the constant vector element. An operation for accumulating partial products of all input vector elements related to the same term of a power term and storing the result in the accumulator, and causing the first shift means to shift the accumulation result stored in the accumulator to the right bit And an arithmetic control unit that repeats up to the most significant digit of the power-of-two term of the constant vector element.

請求項５に記載された発明は、請求項４に記載された発明において、前記格納手段に格納された前記入力ベクトル要素を２倍する第二シフト手段を備え、前記演算制御手段が、予め定めたテーブルに基づいて、前記格納手段に格納された前記入力ベクトル要素または前記第二シフト手段が２倍にした入力ベクトル要素のいずれかを選択して前記加減算手段に累算させるとともに、前記第一シフト手段に２ビットシフトさせることを特徴とするものである。 The invention described in claim 5 is the invention described in claim 4, further comprising second shift means for doubling the input vector element stored in the storage means, wherein the arithmetic control means is predetermined. On the basis of the table, the input vector element stored in the storage means or the input vector element doubled by the second shift means is selected and accumulated in the addition / subtraction means, and the first The shift means is shifted by 2 bits.

請求項６に記載された発明は、所定のビット語長を有する複数の入力ベクトル要素から構成される入力ベクトルを格納する格納手段と、前記入力ベクトルを左ビットシフトさせることにより複数の定数ベクトル要素から構成される定数ベクトルの２のべき乗項と前記入力ベクトル要素との部分積を求めるシフト手段と、前記シフト手段が求めた前記部分積を累算するとともに、前記入力ベクトル要素と前記定数ベクトル要素とを乗算した際に必要となる乗算精度よりも小さいビット桁数で構成された加減算手段と、前記加減算手段の累算結果を格納するアキュムレータと、を備えたマイクロプロセッサでオペランドの加減算とシフトが一体に実行可能な命令を用いて前記入力ベクトルと前記定数ベクトルとの内積を求める内積演算方法において、前記シフト手段が前記格納手段から前記入力ベクトル要素を選択して前記部分積を求める第一の工程と、前記加減算手段に、前記定数ベクトル要素の最下位の２のべき乗項の同じ項にかかる全ての前記入力ベクトル要素の前記部分積の累算を行わせて前記アキュムレータに格納させる第二の工程と、前記加減算手段の桁あふれが発生する前に前記アキュムレータに格納させている類算途中の前記アキュムレータに格納された結果の下位桁を切捨てて、以降の累算の初期値とする第三の工程と、を備え、前記第一の工程と前記第二の工程を順次高位の２のべき乗項にかかる部分積の累算を繰り返して最上位の２のべき乗項まで繰り返すとともに、前記第一の工程と前記第二の工程の繰り返しの途中に少なくとも前記第三の工程を１回以上行うことを特徴とする内積演算方法である。 The invention described in claim 6 is a storage means for storing an input vector composed of a plurality of input vector elements having a predetermined bit word length, and a plurality of constant vector elements by left-bit shifting the input vector. Shift means for obtaining a partial product of a power vector of a constant vector composed of 2 and the input vector element, accumulating the partial product obtained by the shift means, and the input vector element and the constant vector element Addition / subtraction means configured with a bit number smaller than the multiplication precision required when multiplying and an accumulator for storing the accumulation result of the addition / subtraction means, and adding / subtracting and shifting of operands in a microprocessor An inner product calculation method for obtaining an inner product of the input vector and the constant vector using an instruction that can be executed integrally. The shift means selects the input vector element from the storage means and obtains the partial product, and the addition / subtraction means takes the same term of the least significant power of 2 of the constant vector element. A second step in which the partial products of all the input vector elements are accumulated and stored in the accumulator; and in the middle of the calculation that is stored in the accumulator before the adder / subtracter overflows A third step of truncating the lower digit of the result stored in the accumulator and setting it as an initial value for subsequent accumulation, wherein the first step and the second step are sequentially raised to a power of 2 The partial product accumulation for the term is repeated until the highest power term of 2 is repeated, and at least the third step is performed at least once during the repetition of the first step and the second step. It is the inner product calculation method comprising.

請求項７に記載された発明は、請求項６に記載された発明において、前記第一の工程は、前記入力ベクトル要素自身または前記入力ベクトル要素の２倍の値のいずれかを選択して前記部分積を求め、前記第二の工程は、前記部分積の累算を２ビット毎に行うことを特徴とするものである。 The invention described in claim 7 is the invention described in claim 6, wherein the first step selects either the input vector element itself or a value twice as large as the input vector element. The partial product is obtained, and the second step is characterized in that the partial product is accumulated every two bits.

請求項８に記載された発明は、請求項６または７に記載の発明において、前記第三の工程で前記加減算手段による累算途中の前記アキュムレータに格納された結果の切り捨てを行うビット桁数は、前記加減算手段のビット語長から前記入力ベクトル要素のビット語長および前記入力ベクトル要素数の２を底とする対数を減じた数以下のビット桁数として予め定められていることを特徴とするものである。 The invention described in claim 8 is the invention according to claim 6 or 7, wherein the number of bit digits for rounding down the result stored in the accumulator during the accumulation by the addition / subtraction means in the third step is The number of bit digits is less than or equal to the number obtained by subtracting the logarithm of the bit vector length of the input vector element and the number of the input vector elements from 2 from the bit word length of the adding / subtracting means. Is.

請求項１に記載の発明によれば、乗算回路を使用せず、加減算手段とアキュムレータとシフト手段とを有する累積加減算構造で装置を構成するために、規則性が高く、高並列に適したハードウェア量の小さい演算装置を構成できる。また、下位桁から順に部分積の累算を実行し、上位桁に必ず下記桁の部分積和結果が反映されるので演算精度が低下することなく内積演算を行うことができる。また、累算途中でアキュムレータの内容を丸め手段によって右シフトして丸めることができるので、積項の乗算語長より小さい桁数の加減算器とアキュムレータであっても桁あふれを起こすことなく精度低下のない内積演算を行うことができる。 According to the first aspect of the present invention, since the apparatus is constituted by the cumulative addition / subtraction structure having the addition / subtraction means, the accumulator, and the shift means without using the multiplication circuit, the hardware having high regularity and suitable for high parallelism is provided. A computing device with a small amount of wear can be configured. In addition, partial product accumulation is executed in order from the lower digit, and the partial product sum result of the following digits is always reflected in the upper digit, so that the inner product operation can be performed without lowering the operation accuracy. In addition, the accumulator contents can be right-shifted and rounded by the rounding means during accumulation, so even with adder / subtracters and accumulators with a number of digits smaller than the product term multiplication word length, accuracy is reduced without causing overflow. It is possible to perform an inner product operation without any.

請求項２に記載の発明によれば、演算制御手段が、ブースのアルゴリズムを適用して、定数ベクトルの２ビット毎の部分積の累算を行わせることができ、制御ステップが削減され、さらに少ないサイクル数で演算することができる。 According to the second aspect of the present invention, the operation control means can apply Booth's algorithm to accumulate partial products for every 2 bits of the constant vector, thereby reducing the control step, and Calculation can be performed with a small number of cycles.

請求項３に記載の発明によれば、積項の乗算語長より小さい桁数の加減算器とアキュムレータであっても、確実に桁あふれを起こさずに内積演算を行うことができる。 According to the third aspect of the present invention, even if the adder / subtracter and the accumulator have a number of digits smaller than the multiplication word length of the product term, it is possible to perform the inner product operation without causing overflow.

請求項４に記載の発明によれば、乗算回路を使用せず、加減算手段とアキュムレータと、シフト手段を有する累積加減算構造で装置を構成するために、規則性が高く、高並列に適したハードウェア量の小さい演算装置を構成できる。また、下位桁から順に部分積の累算を実行するよう動作するため、上位桁に必ず下位桁の部分積和結果が完遂し反映されるので演算精度が低下することなく内積演算を行うことができる。また、累算途中でアキュムレータの内容を右シフトして丸めることができるので、積項の乗算語長より小さい桁数の加減算器とアキュムレータを有する演算装置であっても桁あふれを起こすことなく、精度低下のない内積演算を行うことができる。さらに、入力を任意桁左シフトするためのバレルシフタを有しないので、さらに回路規模を小さくすることができる。 According to the fourth aspect of the present invention, since the apparatus is configured by the cumulative addition / subtraction structure having the addition / subtraction means, the accumulator, and the shift means without using the multiplication circuit, the hardware having high regularity and suitable for high parallelism is provided. A computing device with a small amount of wear can be configured. Also, since the partial product accumulation is performed in order from the lower digit, the result of the partial product sum of the lower digit is always completed and reflected in the upper digit, so that the inner product calculation can be performed without lowering the calculation accuracy. it can. In addition, since the contents of the accumulator can be right-shifted and rounded during accumulation, even an arithmetic unit having an adder / subtracter and an accumulator with a number of digits smaller than the multiplication word length of the product term does not cause overflow. It is possible to perform an inner product calculation without degrading accuracy. Furthermore, since there is no barrel shifter for shifting the input to the left by an arbitrary digit, the circuit scale can be further reduced.

請求項５に記載の発明によれば、演算制御手段が、ブースのアルゴリズムを適用して、定数ベクトルの２ビット毎の部分積の累算を行うことができ、制御ステップが削減され、さらに少ないサイクル数で演算することができる。 According to the fifth aspect of the present invention, the operation control means can apply the Booth algorithm to accumulate the partial products for every 2 bits of the constant vector, thereby reducing the control steps and further reducing the number of control steps. It can be calculated by the number of cycles.

請求項６に記載の発明によれば、オペランドのシフトが一体となった加減算命令を持つマイクロプロセッサにおいて、下位桁から順に部分積の累算を実行するので、上位桁に必ず下記桁の部分積和結果が完遂し反映されるために演算精度が低下することなく内積演算を行うことができる。また、累算途中でアキュムレータの内容を右シフトして丸める第三のステップを備えるのでので、積項の乗算語長より小さい桁数の加減算器とアキュムレータを有する演算装置でも桁あふれを起こすことなく精度低下のない内積演算を行うことができる方法が提供できる。 According to the sixth aspect of the present invention, in a microprocessor having an addition / subtraction instruction in which operand shifts are integrated, partial product accumulation is executed in order from the lower digit, so that the partial product of the following digits is always added to the upper digit. Since the sum result is completed and reflected, the inner product calculation can be performed without lowering the calculation accuracy. In addition, since it has a third step of rounding the contents of the accumulator right-shifted during accumulation, the arithmetic unit having an adder / subtracter and accumulator with a number of digits smaller than the multiplication word length of the product term does not cause overflow. It is possible to provide a method capable of performing an inner product operation without a decrease in accuracy.

請求項７に記載の発明によれば、ブースのアルゴリズムを適用して、定数ベクトルの２ビット毎の部分積の累算を行うので、累算すべき部分積の数が削減され、実行命令数をさらに少なくすることができる。 According to the seventh aspect of the present invention, since the Booth algorithm is applied to accumulate the partial products for every two bits of the constant vector, the number of partial products to be accumulated is reduced and the number of executed instructions is reduced. Can be further reduced.

請求項８に記載の発明によれば、積項の乗算語長より小さい桁数の加減算器とアキュムレータを備えた演算装置であっても、確実に桁あふれを起こさずに内積演算を行うことができる。 According to the eighth aspect of the present invention, even if the arithmetic unit includes an adder / subtracter having a number of digits smaller than the multiplication word length of the product term and an accumulator, the inner product operation can be performed without causing overflow. it can.

本発明の第１の実施形態にかかる内積演算装置の構成図である。It is a lineblock diagram of the inner product arithmetic unit concerning a 1st embodiment of the present invention. 図１に示された内積演算装置の内積演算動作を示したプログラムリストである。3 is a program list showing an inner product operation of the inner product operation device shown in FIG. 1. 本発明の第２の実施形態にかかる内積演算装置に適用されるブースのアルゴリズム表である。It is a booth algorithm table | surface applied to the inner product calculating apparatus concerning the 2nd Embodiment of this invention. 本発明の第２の実施形態にかかる内積演算装置の内積演算動作を示したプログラムリストである。It is a program list | wrist which showed the inner product calculation operation | movement of the inner product calculation apparatus concerning the 2nd Embodiment of this invention. 本発明の第３の実施形態にかかる内積演算装置の構成図である。It is a block diagram of the inner product calculating apparatus concerning the 3rd Embodiment of this invention. 図５に示された内積演算装置の内積演算動作を示したプログラムリストである。6 is a program list showing an inner product operation of the inner product operation device shown in FIG. 5. 本発明の第４の実施形態にかかる内積演算装置の構成図である。It is a block diagram of the inner product calculating apparatus concerning the 4th Embodiment of this invention. 図７に示された内積演算装置の内積演算動作を示したプログラムリストである。8 is a program list showing an inner product operation of the inner product operation device shown in FIG. 7. 本発明の第５の実施形態にかかる内積演算方法を実行するマイクロプロセッサの演算器部分の構成図である。It is a block diagram of the calculating part of the microprocessor which performs the inner product calculating method concerning the 5th Embodiment of this invention. 図９に示したマイクロプロセッサの機械語命令コードフォーマットである。It is a machine language instruction code format of the microprocessor shown in FIG. 図９に示したマイクロプロセッサの内積演算動作を示したプログラムリストの一の部分である。FIG. 10 is a part of a program list showing an inner product calculation operation of the microprocessor shown in FIG. 9. FIG. 図９に示したマイクロプロセッサの内積演算動作を示したプログラムリストの他の部分である。FIG. 10 is another part of the program list showing the inner product calculation operation of the microprocessor shown in FIG. 9. FIG. 図９に示したマイクロプロセッサで動作する内積演算方法のフローチャートである。10 is a flowchart of an inner product calculation method operating on the microprocessor illustrated in FIG. 9. 本発明の第６の実施形態にかかるマイクロプロセッサの内積演算動作を示したプログラムリストである。It is a program list | wrist which showed the inner product calculation operation | movement of the microprocessor concerning the 6th Embodiment of this invention. 従来の内積演算装置を示した構成図である。It is the block diagram which showed the conventional inner product calculating apparatus. 従来の内積演算装置を示した構成図である。It is the block diagram which showed the conventional inner product calculating apparatus.

（第１実施形態）
以下、本発明の第１の実施形態を、図１および図２を参照して説明する。図１は、本発明の第１の実施形態にかかる内積演算装置の構成図である。図２は、図１に示された内積演算装置の内積演算動作を示したプログラムリストである。 (First embodiment)
Hereinafter, a first embodiment of the present invention will be described with reference to FIGS. 1 and 2. FIG. 1 is a configuration diagram of an inner product arithmetic device according to a first embodiment of the present invention. FIG. 2 is a program list showing the inner product calculation operation of the inner product calculation apparatus shown in FIG.

図１に本発明の第１の実施形態にかかる内積演算装置１を示す。図１に示した内積演算装置１は、入力要素レジスタ２と、バレルシフタ３と、加減算器４と、アキュムレータ５と、シフタ６と、セレクタ７と、制御部８と、を備えている。 FIG. 1 shows an inner product calculation device 1 according to a first embodiment of the present invention. 1 includes an input element register 2, a barrel shifter 3, an adder / subtracter 4, an accumulator 5, a shifter 6, a selector 7, and a control unit 8.

格納手段としての入力要素レジスタ２は、８ビット語長でＲ０〜Ｒ７までの８つのレジスタから構成される。各レジスタに入力ベクトル要素（例えば、数４のＸ０〜Ｘ７）が格納され、制御部８からの制御信号により１つのレジスタが選択されバレルシフタ３に出力される。 The input element register 2 as storage means is composed of eight registers R0 to R7 having an 8-bit word length. Input vector elements (for example, X0 to X7 in Formula 4) are stored in each register, and one register is selected by a control signal from the control unit 8 and output to the barrel shifter 3.

シフト手段としてのバレルシフタ３は、入力された入力ベクトル要素を任意の桁数の左シフトを行い、加減算器４の一方の入力に接続される。本実施例では、入力８ビット出力１６ビット語長で、０〜９ビット桁の符号拡張付きの左シフト機能を有する。バレルシフタ３のシフト量は、制御部８からの制御信号によりサイクル毎に選択される。 The barrel shifter 3 as a shift means shifts the input vector element inputted to the left by an arbitrary number of digits, and is connected to one input of the adder / subtractor 4. In this embodiment, it has a left shift function with an input 8-bit output 16-bit word length and 0 to 9-bit digit sign extension. The shift amount of the barrel shifter 3 is selected for each cycle by a control signal from the control unit 8.

加減算手段としての加減算器４は、制御部８からの制御信号によりサイクル毎に加算か減算かの動作が選択される入出力とも１６ビットの語長を有する演算器である。すなわち、入力ベクトル要素と定数ベクトル要素とを乗算した際に必要となる乗算精度よりも小さいビット桁数で構成されている。加減算器４の出力は、アキュムレータ５に接続される。 The adder / subtracter 4 as addition / subtraction means is an arithmetic unit having a word length of 16 bits for both input and output, in which an operation of addition or subtraction is selected for each cycle by a control signal from the control unit 8. That is, the number of bit digits is smaller than the multiplication accuracy required when the input vector element and the constant vector element are multiplied. The output of the adder / subtractor 4 is connected to the accumulator 5.

アキュムレータ５は、加減算器４による途中および最終の演算結果が格納される１６ビットのレジスタである。アキュムレータ５の出力はシフタ６に接続される。 The accumulator 5 is a 16-bit register in which intermediate and final calculation results by the adder / subtracter 4 are stored. The output of the accumulator 5 is connected to the shifter 6.

丸め手段としてのシフタ６は、１６ビット語長で、アキュムレータ５に格納された演算結果を、制御部８からの制御信号により５ビット固定桁の右シフトを行い、演算途中結果の切捨て丸めを行うことができる構成となっている。 The shifter 6 serving as a rounding means shifts the arithmetic result stored in the accumulator 5 to the right by a 5-bit fixed digit by a control signal from the control unit 8 and rounds off the intermediate result of the arithmetic operation. It has a configuration that can.

セレクタ７は、制御部８からの制御信号によりシフタ６の出力とアキュムレータ５の出力とを選択して加減算器４の他方の入力に接続されるとともに、演算結果として外部へ出力する。 The selector 7 selects the output of the shifter 6 and the output of the accumulator 5 according to a control signal from the control unit 8 and is connected to the other input of the adder / subtractor 4 and outputs the result as an operation result to the outside.

演算制御手段としての制御部８は、内積演算装置１の演算動作を制御し、演算動作ステップに応じて制御信号を、入力要素レジスタ２、バレルシフタ３、加減算器４、アキュムレータ５、シフタ６、セレクタ７に出力する。 The control unit 8 serving as an arithmetic control means controls the arithmetic operation of the inner product arithmetic device 1 and outputs control signals according to the arithmetic operation steps to the input element register 2, the barrel shifter 3, the adder / subtractor 4, the accumulator 5, the shifter 6, and the selector. 7 is output.

上述した構成の内積演算装置１は、ベクトル内積演算を行うために複数の入力ベクトル要素を格納するレジスタとそれを選択する手段を備えた入力要素レジスタ２を備え、乗算語長より小さい桁数の加減算器４を備えて、さらに、部分積の累算途中で加減算演算がオーバーフローしないように、累算が終了して不要となった下位桁を切り捨てるためのシフタ６を備えている。 The inner product calculation device 1 having the above-described configuration includes an input element register 2 having a register for storing a plurality of input vector elements and means for selecting the same in order to perform a vector inner product calculation, and has a number of digits smaller than the multiplication word length. An adder / subtracter 4 is further provided, and a shifter 6 is provided for truncating the lower digits that are no longer needed after the accumulation is completed so that the addition / subtraction operation does not overflow during the accumulation of partial products.

次に、上述した内積演算装置１の積和演算（内積演算）の処理の内容について説明する。 Next, the contents of the product-sum operation (inner product operation) of the inner product operation device 1 described above will be described.

例えば、符号付８ビット表現された入力ベクトルの要素Ｘ０、Ｘ１、Ｘ２、…Ｘ７と定数ベクトルの要素Ｃ０、Ｃ１、Ｃ２、…Ｃ７との内積演算が上述した内積演算装置１でどのようになされるかを説明する。定数ベクトルは、ＤＣＴ処理のコサイン係数に相当する。以下、数４のＤＣＴ処理を表す整数行列式の出力要素Ｚ１の内積演算方法を例示して説明する。 For example, the inner product operation of the input vector elements X0, X1, X2,... X7 represented by the signed 8-bit and the constant vector elements C0, C1, C2,. Explain how. The constant vector corresponds to a cosine coefficient for DCT processing. Hereinafter, an inner product calculation method of the output element Z1 of the integer determinant representing the DCT process of Formula 4 will be described as an example.

以下の数値は１０ビット固定小数点表現で表されたＤＣＴ係数行列の第２行ベクタ（数４の２行目）を抜き出してその整数値と符号および絶対値を２進表現したものである。
Ｃ０＝５０２（＋）１＿１１１１＿０１１０
Ｃ１＝４２６（＋）１＿１０１０＿１０１０
Ｃ２＝２８４（＋）１＿０００１＿１１００
Ｃ３＝１００（＋）０＿０１１０＿０１００
Ｃ４＝−１００（−）０＿０１１０＿０１００
Ｃ５＝−２８４（−）１＿０００１＿１１００
Ｃ６＝−４２６（−）１＿１０１０＿１０１０
Ｃ７＝−５０２（−）１＿１１１１＿０１１０ The following numerical values are obtained by extracting the second row vector (the second row of Equation 4) of the DCT coefficient matrix expressed in 10-bit fixed-point representation and expressing the integer value, sign, and absolute value in binary.
C0 = 502 (+) 1_1111_1010
C1 = 426 (+) 1 — 1010 — 1010
C2 = 284 (+) 1_0001_1100
C3 = 100 (+) 0 — 0110 — 0100
C4 = −100 (−) 0 — 0110 — 0100
C5 = −284 (−) 1 — 0001 — 1100
C6 = −426 (−) 1 — 1010 — 1010
C7 = −502 (−) 1 — 1111 — 0110

これらの絶対値を２のべき乗項の多項式で表すと以下のとおりとなる（以降の式で＊は乗算を示し、また、例えば２＾１は２の１乗を示す）。 These absolute values are expressed as follows in terms of a power-of-two polynomial (in the following equations, * indicates multiplication, and 2 ^ 1 indicates 2 to the power of 1).

ここでベクトル内積Ｚは一般的に、 Where the vector dot product Z is generally

と表され、ＤＣＴ結果の出力ベクトルの第２要素であるＺ１は、下式で求められる。 Z1 which is the second element of the output vector of the DCT result is obtained by the following equation.

上式で５０２＊Ｘ０の乗算は、 The multiplication of 502 * X0 in the above formula is

と表され、これを展開すると、 When this is expanded,

となる。 It becomes.

２＾１＊Ｘ０は定数“５０２”の２の１乗項の部分積、２＾４＊Ｘ０は定数“５０２”の２の４乗項の部分積を表し、定数“５０２”の最上位桁である２の８乗項の部分積まで加算することで５０２＊Ｘ０の乗算結果が得られる。 2 ^ 1 * X0 represents the partial product of the 2nd power term of the constant “502”, 2 ^ 4 * X0 represents the partial product of the 4th power of the constant “502”, and the most significant digit of the constant “502” The result of multiplication of 502 * X0 is obtained by adding up to the partial product of 2 8 terms.

つまり、それぞれの部分積はＸ０の２のべき乗倍となっているため、単純にデータの左シフト演算により求めることができ、それらを加減算器４で累算することで乗算結果が得られる。これが以降のシフトと加減算による積和演算の原理となる。以下に５０２＊Ｘ０以外の項についても記載する。 That is, since each partial product is a power of 2 of X0, it can be obtained simply by the left shift operation of the data, and the result of multiplication is obtained by accumulating them by the adder / subtractor 4. This is the principle of product-sum operation by subsequent shift and addition / subtraction. Items other than 502 * X0 are also described below.

図２に上述した原理に則りＺ１を求めるための動作のプログラムリストを示す。このプログラムは制御部８で動作する。図２の左端はステップ番号（行番号）を表し、ステップ番号の右側の部分で実際の制御内容を表している。 FIG. 2 shows a program list of operations for obtaining Z1 in accordance with the principle described above. This program runs on the control unit 8. The left end of FIG. 2 represents a step number (line number), and the right control part represents the actual control content.

図２のプログラムリストを説明すると、ステップ００１でアキュムレータ５をリセットした後のステップ００２〜００５にかけて定数ベクトルの各要素の最下位のべき乗項であるｂｉｔ１、すなわち２の１乗項に対する部分積を、該当する入力ベクトルの各要素を選択して左シフトすることにより求め、サイクルごとに順に累積加減算している。定数ベクトル要素Ｃ０，Ｃ１，Ｃ２，Ｃ３は正数、Ｃ４，Ｃ５，Ｃ６，Ｃ７は負数なので、その正負に応じて部分積を加算または減算する。このとき３番目の制御項で入力ベクトルの要素を選択し、４番目の制御項でバレルシフタ３による左シフト量、すなわち入力の２のべき乗倍の選択を行う。２番目の制御項は加減算器の加算か、減算か、を選択を表し、定数ベクトルの符号によって、加算か減算かを切替える役割を果たす。 Referring to the program list of FIG. 2, the partial product for bit 1, that is, the first power term of 2, which is the lowest power term of each element of the constant vector, from step 002 to 005 after resetting the accumulator 5 in step 001. Each element of the corresponding input vector is selected and shifted to the left, and cumulative addition / subtraction is sequentially performed every cycle. Since the constant vector elements C0, C1, C2, and C3 are positive numbers and C4, C5, C6, and C7 are negative numbers, the partial products are added or subtracted according to the positive and negative values. At this time, the element of the input vector is selected by the third control term, and the left shift amount by the barrel shifter 3, that is, the input power of 2 is selected by the fourth control term. The second control term represents the selection of addition or subtraction of the adder / subtractor, and plays a role of switching between addition and subtraction according to the sign of the constant vector.

つまり、この１つのステップで、バレルシフタ３が、入力要素レジスタ２から入力ベクトル要素を選択して、選択された入力ベクトルを左ビットシフトさせることにより定数ベクトル要素の２のべき乗項と入力ベクトル要素との部分積を求め、加減算器４が、バレルシフタ３が求めた部分積を累算して、加減算器４の累算結果をアキュムレータ５に格納している。 That is, in this one step, the barrel shifter 3 selects an input vector element from the input element register 2 and shifts the selected input vector to the left by shifting the power vector of the constant vector element 2 and the input vector element. The adder / subtractor 4 accumulates the partial products obtained by the barrel shifter 3 and stores the accumulated result of the adder / subtractor 4 in the accumulator 5.

したがって、ステップ００２〜００５は、２＾１＊Ｘ０＋２＾１＊Ｘ１＋２＾１＊Ｘ６＋２＾１＊Ｘ７の演算、すなわち、数１０の各式の右端の項（定数ベクトルの各要素の最下位のべき乗項）の累積加減算を行っていることを示している。なお、本実施形態に示した定数ベクトルでは２の０乗項がすべて“０”のため２の１乗項を最下位のべき乗項として演算しているが、定数ベクトルの設定によっては２の０乗項が最下位のべき乗項となる場合もあり、その場合は、２の０乗項の累積加減算から行う。 Therefore, Steps 002 to 005 are operations of 2 ^ 1 * X0 + 2 ^ 1 * X1 + 2 ^ 1 * X6 + 2 ^ 1 * X7, that is, the rightmost term of each expression of Equation 10 (the least significant power of each element of the constant vector) This indicates that the cumulative addition / subtraction of (item) is being performed. In the constant vector shown in this embodiment, since the 2 0th power term is all “0”, the 2nd power term is calculated as the lowest power term. However, depending on the setting of the constant vector, 2 0 In some cases, the power term is the lowest power term, and in this case, the cumulative addition / subtraction of the 0th power term is performed.

次に、ステップ００６〜０１１にかけて、定数ベクトル各要素の２の２乗項の部分積を生成して累積加減算を行い、以下順にステップ０１２〜０１５にかけて２の３乗項の部分積、ステップ０１６〜０１９にかけて２の４乗項の部分積の加減算を行っている。 Next, in steps 006 to 011, a partial product of 2 square terms of each element of the constant vector is generated and cumulative addition / subtraction is performed, and in the following order in steps 012 to 015, partial products of 2 cube terms, steps 016 to From 019, addition and subtraction of partial products of 2 4 terms are performed.

この時点でアキュムレータ５の累算結果が最大で１６ビット桁に達するため、以降の演算のオーバーフローを回避するために、次のステップ０２０では、２の５乗項の部分積の累積加減算を開始するにあたり、５番目の制御項でアキュムレータ５の内容をシフタ６で５ビット右シフトさせて、加減算器４に入力することで途中の累算結果の５ビット分、下位桁を切捨て丸めするように動作させる。すなわち、予め定めた桁数のビットシフトにより加減算器４による累算途中のアキュムレータ５に格納された結果の切り捨てを行って演算結果の丸めを行っている。累積加減算する部分積の桁合わせのために、ステップ０２０以降すなわち定数ベクトルの２の５乗項以降の部分積を求めるためのバレルシフト３のシフト量はゼロから開始される。 At this time, since the accumulation result of the accumulator 5 reaches a maximum of 16-bit digits, in order to avoid overflow of subsequent operations, in the next step 020, cumulative addition / subtraction of partial products of the fifth power term is started. At the time of operation, the contents of the accumulator 5 are shifted 5 bits to the right by the shifter 6 in the fifth control term and input to the adder / subtractor 4 so that the lower digits are rounded off by 5 bits of the intermediate accumulation result. Let That is, the result stored in the accumulator 5 in the middle of accumulation by the adder / subtractor 4 is rounded down by a bit shift of a predetermined number of digits to round the calculation result. In order to align the digits of the partial products to be cumulatively added and subtracted, the shift amount of the barrel shift 3 for obtaining a partial product after step 020, that is, the partial vector after the fifth power of the constant vector is started from zero.

このように下位桁の部分積から順に累積加減算を行うこと、演算途中で加減算値がオーバーフローしないように既に累積加減算を終えた途中結果の下位桁を右シフトして切捨て丸めを行うことが本発明の特徴である。画像圧縮などで使用されるＤＣＴ等の処理をデジタル演算器で処理する場合は、構成する演算器の語長が限られることから、累積加減算の途中結果の切捨て丸めを行うことで桁あふれを起こすことなく、精度低下のない内積演算を行うことができる。 In this way, cumulative addition / subtraction is performed in order starting from the partial product of the lower digits, and the lower digits of the intermediate results that have already been subjected to cumulative addition / subtraction are right-shifted and rounded off so that the addition / subtraction value does not overflow during the calculation. It is the feature. When processing such as DCT, which is used for image compression, is performed with a digital computing unit, the word length of the computing unit is limited, and therefore overflow occurs by rounding off the result of cumulative addition / subtraction. Therefore, it is possible to perform the inner product calculation without degrading accuracy.

次に、ステップ０２６〜０２９にかけて２の６乗項の部分積の累積加減算を行い、ステップ０３０〜０３３にかけて２の７乗項の部分積の累積加減算を行い、ステップ０３４〜０３９にかけて２の８乗項の部分積の累積加減算を行っている。 Next, cumulative addition / subtraction of partial products of 2 6 power terms is performed through steps 026 to 029, cumulative addition / subtraction of partial products of 2 7 power terms is performed through steps 030 to 033, and 2 8 powers through steps 034 to 039. Cumulative addition / subtraction of partial products of terms is performed.

つまり、図２のプログラムリストを実行することで、加減算器４に、定数ベクトル要素の最下位の２のべき乗項の同じ項にかかる全ての入力ベクトル要素の部分積の累算を行わせてアキュムレータ５に格納させて、以降、順次高位の２のべき乗項にかかる部分積の累算を繰り返して最上位の２のべき乗項まで繰り返させるとともに、加減算部４の桁あふれが発生する前にシフタ６により類算途中のアキュムレータ５に格納された結果の下位桁を切捨てさせて、以降の累算の初期値とするように動作させている。 That is, by executing the program list of FIG. 2, the adder / subtracter 4 accumulates the partial products of all the input vector elements related to the same term in the least significant power-of-two term of the constant vector elements, thereby accumulating the accumulator. 5, and thereafter, the partial product accumulation for the higher-order power-of-two term is sequentially repeated until the highest-order power-of-two term is repeated, and the shifter 6 before the overflow of the adder / subtractor 4 occurs. Thus, the lower digit of the result stored in the accumulator 5 in the middle of the calculation is truncated, and the operation is performed so as to be the initial value of the subsequent accumulation.

この実施形態では、符号付き８ビットの入力ベクトルと、符号なし小数点以下１０ビットの固定小数点の定数（係数）ベクトルとの内積演算により最終的には最大で整数部符号付１１ビット小数部５ビットで計１６ビットの演算結果が得られ、例えば２次元ＤＣＴであれば次段のＤＣＴ処理の入力として処理されることになる。 In this embodiment, a signed 8-bit input vector and a fixed-point constant (coefficient) vector of 10 bits after the unsigned decimal point are finally used to finally produce a maximum of 11 bits with an integer part signed 5 bits Thus, a 16-bit calculation result is obtained. For example, in the case of a two-dimensional DCT, it is processed as an input of the DCT process in the next stage.

本実施形態によれば、専用の乗算器をもたず、入力要素レジスタ２と加減算器４とアキュムレータ５とバレルシフタ３およびシフタ６を備えただけなので、ハードウェア量が少なく規則的な演算装置が構成できる。また、内積演算の片方のベクトル要素が定数であることを前提としているので、入力ベクトル要素（レジスタ）の選択、シフト量と加減算の簡単な制御のみで内積演算を実現することができる。また、従来の乗算器を用いた構成であれば、乗算結果の最大語長の加減算精度が必要であったが、同じ桁の部分積を下位から順に累算して行くことにより、乗算の最大語長より小さい語長の演算器構成でも、下位桁累算からの繰り上がりを落とすことなく、すなわち演算精度を確保して内積演算を行うことができる。また、累算途中でアキュムレータ５の内容をシフタ６によって右シフトして丸めることができるので、積項の乗算語長より小さい桁数の加減算器４とアキュムレータ５であっても桁あふれを起こすことなく精度低下のない内積演算を行うことができる。 According to the present embodiment, there is no dedicated multiplier and only the input element register 2, the adder / subtractor 4, the accumulator 5, the barrel shifter 3 and the shifter 6 are provided. Can be configured. Further, since it is assumed that one vector element of the inner product operation is a constant, the inner product operation can be realized only by selecting an input vector element (register) and simple control of the shift amount and addition / subtraction. In addition, in the configuration using the conventional multiplier, the accuracy of addition / subtraction of the maximum word length of the multiplication result is necessary, but by multiplying the partial products of the same digit in order from the lower order, the maximum multiplication is possible. Even with an arithmetic unit configuration having a word length smaller than the word length, it is possible to perform the inner product calculation without lowering the carry from the lower digit accumulation, that is, ensuring the calculation accuracy. In addition, since the contents of the accumulator 5 can be right-shifted and rounded by the shifter 6 during the accumulation, even the adder / subtractor 4 and the accumulator 5 having a number of digits smaller than the multiplication word length of the product term cause overflow. It is possible to perform inner product calculation without any loss of accuracy.

なお、本実施形態では演算途中結果の丸め処理において、演算器構成では５ビット固定の右シフトを行うシフタ６を備えるように構成されている。その切捨て桁数については次のようにして算出される。例えば、入力ベクトル要素の語長が８ビット、ベクトル要素の数が８、加減算器の語長が１６ビットであれば、桁あふれを起こさずに累算できる部分積は、定数ベクトル要素の（１６−（８＋ｌｏｇ₂８））＋１＝６ビットに相当するものまでである（加減算器４のビット語長から入力ベクトル要素のビット語長および入力ベクトル要素数の２を底とする対数を減じた数以下のビット桁数）。つまり定数ベクトル要素の最大６ビット分に相当する部分積までの累算は桁あふれなく演算可能であり、その時点で、既に累算の終了している６ビット分までの下位桁を切捨てることが可能である。但し本実施形態では内積演算の精度切捨てをなるべく小さくするために最大語長１６ビット（整数部１１ビット、小数部５ビット）で出力しているので、前記した最大切捨て可能桁である６ビット以下である５ビット固定桁の切捨てを、定数ベクトル要素５ビット分に相当する部分積の累算終了後に実行している。このように丸め桁数を予め決めておくことで、ハードウェア規模の大きな任意桁のバレルシフタを使用することなく固定桁数のシフタ６だけを使って途中累算結果の切捨て処理を行うことができる。 In the present embodiment, in the rounding process of the intermediate calculation result, the arithmetic unit is configured to include a shifter 6 that performs a right shift fixed at 5 bits. The number of digits discarded is calculated as follows. For example, if the word length of the input vector element is 8 bits, the number of vector elements is 8, and the word length of the adder / subtracter is 16 bits, the partial product that can be accumulated without causing overflow is the constant vector element (16 − (8 + log ₂ 8)) + 1 = 6 bits (number obtained by subtracting the logarithm of the bitword length of the input vector element 4 from the bitword length of the input vector element and the number of input vector elements 2) The following number of bit digits). In other words, accumulation up to a partial product corresponding to a maximum of 6 bits of a constant vector element can be performed without overflow, and at that point, the lower digits up to 6 bits that have already been accumulated are truncated. Is possible. However, in this embodiment, the maximum word length of 16 bits (integer part 11 bits, decimal part 5 bits) is output in order to reduce the precision truncation of the inner product operation as much as possible. The 5-bit fixed digit truncation is executed after the accumulation of partial products corresponding to 5 bits of constant vector elements is completed. By previously determining the number of rounding digits, the intermediate accumulation result can be truncated using only the fixed-digit shifter 6 without using an arbitrary-digit barrel shifter having a large hardware scale. .

（第２実施形態）
次に、本発明の第２の実施形態を図３および図４を参照して説明する。なお、前述した第１の実施形態と同一部分には、同一符号を付して説明を省略する。図３は、本発明の第２の実施形態にかかる内積演算装置に適用されるブースのアルゴリズム表である。図４は、本発明の第２の実施形態にかかる内積演算装置の内積演算動作を示したプログラムリストである。 (Second Embodiment)
Next, a second embodiment of the present invention will be described with reference to FIGS. Note that the same parts as those in the first embodiment described above are denoted by the same reference numerals and description thereof is omitted. FIG. 3 is a booth algorithm table applied to the inner product arithmetic device according to the second embodiment of the present invention. FIG. 4 is a program list showing an inner product operation of the inner product operation device according to the second embodiment of the present invention.

本実施形態は、構成は第１の実施形態と同じであるが、内積演算の制御を変更することで、よりサイクル数の少ない演算としている。本実施形態では２次のブースのアルゴリズムを適用して定数ベクトル要素の２ビットごとに部分積を生成することで、累算すべき部分積の数を減らし、より高速な演算を行うことができる。 This embodiment has the same configuration as that of the first embodiment, but changes the control of the inner product operation to reduce the number of cycles. In this embodiment, by applying the second order Booth algorithm and generating partial products for every 2 bits of the constant vector element, the number of partial products to be accumulated can be reduced, and higher speed operation can be performed. .

２次のブースのアルゴリズムでは、定数ベクトルの要素を乗数として１０ビット２進値で、ｂ９，ｂ８，ｂ７，ｂ６，ｂ５，ｂ４，ｂ３，ｂ２，ｂ１，ｂ０と表したときに、乗数の各ビットである２のべき乗項ごとに入力ベクトル要素である被乗数との部分積を求めるのではなく、乗数の２ビット分ごと、すなわち下位から（ｂ１，ｂ０）の部分積、（ｂ３，ｂ２）の部分積、（ｂ５，ｂ４）の部分積、（ｂ７，ｂ６）の部分積、（ｂ９，ｂ８）の部分積を順に求めてゆく方法である。ただし、例えば（ｂ１，ｂ０）＝（１，１）の場合は、部分積は被乗数の３倍の値となり別途加算器が必要となるがこれを回避するために図３に示した表にしたがって２のべき乗倍の部分積の生成に置き換える。 In the second-order Booth algorithm, each of the multipliers is expressed as b9, b8, b7, b6, b5, b4, b3, b2, b1, b0 with a constant vector element as a multiplier. Rather than finding a partial product with a multiplicand that is an input vector element for each power of 2 that is a bit, a partial product of (b1, b0) from the lower order, that is, a partial product of (b1, b0), (b3, b2) In this method, the partial product, the partial product of (b5, b4), the partial product of (b7, b6), and the partial product of (b9, b8) are obtained in order. However, for example, in the case of (b1, b0) = (1, 1), the partial product is three times the multiplicand, and an additional adder is required. In order to avoid this, according to the table shown in FIG. Replace with generation of partial product of power of 2.

図３に示した表の乗数の３ビットは最下位桁については（ｂ１，ｂ０，“０”）、次桁以降は（ｂ３，ｂ２，ｂ１）、（ｂ５，ｂ４，ｂ３）、（ｂ７，ｂ６，ｂ５）、（ｂ９，ｂ８，ｂ７）のビット値を示している。また、加算すべき部分積の“Ｍ”は被乗数自身を示し、“２Ｍ”は被乗数を２倍したものすなわち１ビット左シフトしたものを示している。 3 bits of the multiplier shown in FIG. 3 are (b1, b0, “0”) for the least significant digit, (b3, b2, b1), (b5, b4, b3), (b7, The bit values of b6, b5) and (b9, b8, b7) are shown. Further, “M” of the partial product to be added indicates the multiplicand itself, and “2M” indicates a value obtained by doubling the multiplicand, that is, a one-bit left shift.

ところでブースのアルゴリズムでは、加減算すべき部分積の数を減じるのみで、乗数側が定数であることから加減算すべき部分積は予めわかっており、本実施形態の場合は、第１の実施形態で示した演算器構成に加えて特別なハードウェアが必要となるわけではなく単に加減算すべき部分積が異なるのみであるので制御ステップを変更すればよい。 By the way, in Booth's algorithm, only the number of partial products to be added / subtracted is subtracted, and since the multiplier side is a constant, the partial products to be added / subtracted are known in advance. In the case of this embodiment, the first embodiment shows In addition to the arithmetic unit configuration, no special hardware is required, and only the partial products to be added and subtracted are different, so that the control step may be changed.

図４に本実施形態の動作のプログラムリストを示す。図４も図２と同様に数４のＺ１を求める場合のものである。このプログラムは制御部８で動作する。 FIG. 4 shows a program list of the operation of this embodiment. FIG. 4 also shows a case where Z1 of Formula 4 is obtained as in FIG. This program runs on the control unit 8.

図４のプログラムリストを説明すると、ステップ００１はアキュムレータ５のリセットを行い、演算をリセットしている。次に、ステップ００２〜００５は定数ベクトルの最下位２ビットすなわちｂ１，ｂ０に対応する部分積の総和、すなわち部分内積を計算する。ここで、定数ベクトル要素Ｃ０〜Ｃ７は、
Ｃ０＝５０２（＋）１＿１１１１＿０１１０
Ｃ１＝４２６（＋）１＿１０１０＿１０１０
Ｃ２＝２８４（＋）１＿０００１＿１１００
Ｃ３＝１００（＋）０＿０１１０＿０１００
Ｃ４＝−１００（−）０＿０１１０＿０１００
Ｃ５＝−２８４（−）１＿０００１＿１１００
Ｃ６＝−４２６（−）１＿１０１０＿１０１０
Ｃ７＝−５０２（−）１＿１１１１＿０１１０
であるからそれぞれの定数ベクトル要素の下位２ビットを含む（ｂ１，ｂ０，“０”）のビットパターンと定数ベクトル要素の符号から図３に示したブースのアルゴリズム表を参照して、部分内積の演算を行う。 Referring to the program list of FIG. 4, in step 001, the accumulator 5 is reset to reset the calculation. Next, Steps 002 to 005 calculate the sum of partial products corresponding to the least significant 2 bits of the constant vector, that is, b1 and b0, that is, the partial inner product. Here, the constant vector elements C0 to C7 are
C0 = 502 (+) 1_1111_1010
C1 = 426 (+) 1 — 1010 — 1010
C2 = 284 (+) 1_0001_1100
C3 = 100 (+) 0 — 0110 — 0100
C4 = −100 (−) 0 — 0110 — 0100
C5 = −284 (−) 1 — 0001 — 1100
C6 = −426 (−) 1 — 1010 — 1010
C7 = −502 (−) 1 — 1111 — 0110
Therefore, referring to the Booth algorithm table shown in FIG. 3 from the bit pattern of (b1, b0, “0”) including the lower 2 bits of each constant vector element and the code of the constant vector element, the partial inner product Perform the operation.

したがって、ステップ００２は、定数ベクトル要素Ｃ０が正数で（ｂ１，ｂ０，“０”）が“１００”であるから図３より“−２Ｍ”すなわち被乗数である入力要素Ｘ０の２倍を減算している。ステップ００３は、定数ベクトル要素Ｃ１が正数で（ｂ１，ｂ０，“０”）が“１００”であるから図３より“−２Ｍ”すなわち被乗数である入力要素Ｘ１の２倍を減算している。ステップ００４は、定数ベクトル要素Ｃ６が負数で（ｂ１，ｂ０，“０”）が“１００”であるから図３より“−２Ｍ”すなわち被乗数である入力要素Ｘ６の２倍を加算している。ステップ００５は、定数ベクトル要素Ｃ７が負数で（ｂ１，ｂ０，“０”）が“１００”であるから“−２Ｍ”すなわち被乗数である入力要素Ｘ７の２倍を加算している。なお、定数ベクトルＣ２〜Ｃ５は（ｂ１，ｂ０，“０”）が“０００”であるので、加減算すべき部分積は無い。つまり、各ステップで、バレルシフタ３が入力ベクトル要素の２ビット毎に部分積を求めて、加減算器４が該部分積を累算している。 Therefore, in step 002, the constant vector element C0 is a positive number and (b1, b0, “0”) is “100”, so “−2M”, that is, twice the input element X0 that is a multiplicand is subtracted from FIG. ing. In step 003, since the constant vector element C1 is a positive number and (b1, b0, “0”) is “100”, “−2M”, that is, twice the input element X1 that is a multiplicand is subtracted from FIG. . In step 004, since the constant vector element C6 is a negative number and (b1, b0, “0”) is “100”, “−2M”, that is, twice the input element X6 that is a multiplicand is added from FIG. In step 005, since the constant vector element C7 is a negative number and (b1, b0, “0”) is “100”, “−2M”, that is, twice the input element X7 which is a multiplicand is added. Since the constant vectors C2 to C5 have “b1, b0,“ 0 ”)“ 000 ”, there is no partial product to be added or subtracted. That is, at each step, the barrel shifter 3 obtains a partial product for every two bits of the input vector element, and the adder / subtracter 4 accumulates the partial products.

次に、ステップ００６〜０１３は定数ベクトルの次の２ビットすなわちｂ３，ｂ２に対応する部分積の総和、すなわち部分内積を計算している。ここで部分積の桁位置は２ビット上位のｂ２の位置が基準であり、加算すべき部分積“Ｍ”とは被乗数を４倍したもの、すなわち被乗数を２ビット左シフトしたものとなり、部分積“２Ｍ”とは被乗数を８倍したもの、すなわち被乗数を３ビット左シフトしたものとなる。 Next, steps 006 to 013 calculate the sum of partial products corresponding to the next two bits of the constant vector, that is, b3 and b2, that is, the partial inner product. Here, the position of the partial product is based on the position of b2 which is higher by 2 bits, and the partial product “M” to be added is a value obtained by multiplying the multiplicand by four, that is, the multiplicand is shifted to the left by 2 bits. “2M” is a value obtained by multiplying the multiplicand by eight, that is, a value obtained by shifting the multiplicand to the left by 3 bits.

したがって、ステップ００６は、定数ベクトルＣ０が正数で（ｂ３，ｂ２，ｂ１）が“０１１”であるから図３より“＋２Ｍ”すなわち被乗数である入力要素Ｘ０の８倍を加算している。ステップ００７は、定数ベクトルＣ１が正数で（ｂ３，ｂ２，ｂ１）が“１０１”であるから図３より“−Ｍ”すなわち被乗数である入力要素Ｘ１の４倍を減算している。ステップ００８は、定数ベクトルＣ２が正数で（ｂ３，ｂ２，ｂ１）が“１１０”であるから図３より“−Ｍ”すなわち被乗数である入力要素Ｘ２の４倍を減算している。ステップ００９は、定数ベクトルＣ３が正数で（ｂ３，ｂ２，ｂ１）が“０１０”であるから図３より“＋Ｍ”すなわち被乗数である入力要素Ｘ３の４倍を加算している。ステップ０１０は、定数ベクトルＣ４が負数で（ｂ３，ｂ２，ｂ１）が“０１０”であるから図３より“＋Ｍ”すなわち被乗数である入力要素Ｘ４の４倍を減算している。ステップ０１１は、定数ベクトルＣ５が負数で（ｂ３，ｂ２，ｂ１）が“１１０”であるから図３より“−Ｍ”すなわち被乗数である入力要素Ｘ５の４倍を加算している。ステップ０１２は、定数ベクトルＣ６が負数で（ｂ３，ｂ２，ｂ１）が“１０１”であるから図３より“−Ｍ”すなわち被乗数である入力要素Ｘ６の４倍を加算している。ステップ０１３は、定数ベクトルＣ７が負数で（ｂ３，ｂ２，ｂ１）が“０１１”であるから図３より“＋２Ｍ”すなわち被乗数である入力要素Ｘ７の８倍を減算している。 Therefore, in step 006, since the constant vector C0 is a positive number and (b3, b2, b1) is “011”, “+ 2M”, that is, eight times the multiplicand input element X0 is added from FIG. In step 007, since the constant vector C1 is a positive number and (b3, b2, b1) is “101”, “−M”, that is, four times the multiplicand input element X1 is subtracted from FIG. In step 008, since the constant vector C2 is a positive number and (b3, b2, b1) is “110”, “−M”, that is, four times the multiplicand input element X2 is subtracted from FIG. In step 009, since the constant vector C3 is a positive number and (b3, b2, b1) is “010”, “+ M”, that is, four times the multiplicand input element X3 is added from FIG. In step 010, the constant vector C4 is a negative number and (b3, b2, b1) is “010”, so that “+ M”, that is, four times the multiplicand input element X4 is subtracted from FIG. In step 011, since the constant vector C5 is a negative number and (b3, b2, b1) is “110”, “−M”, that is, four times the multiplicand input element X5 is added from FIG. In step 012, since the constant vector C6 is a negative number and (b3, b2, b1) is “101”, “−M”, that is, four times the multiplicand input element X6 is added from FIG. In step 013, since the constant vector C7 is a negative number and (b3, b2, b1) is “011”, “+ 2M”, that is, eight times the multiplicand input element X7 is subtracted from FIG.

次に、ステップ０１４〜０２１は定数ベクトルの次の２ビットｂ５，ｂ４、ステップ０２２〜０２５は定数ベクトルの次の２ビットｂ７，ｂ６、ステップ０２６〜０３１は定数ベクトルの次の２ビットｂ９，ｂ８にそれぞれ対応する部分積の総和すなわち部分内積を累算する。ここで、ステップ０１８では第１の実施形態と同様に加減算器４でのオーバーフローを回避するために累算の途中結果の下位５ビット固定の切捨て処理を行っている。 Next, steps 014 to 021 are the next 2 bits b5 and b4 of the constant vector, steps 022 to 025 are the next 2 bits b7 and b6 of the constant vector, and steps 026 to 031 are the next 2 bits b9 and b8 of the constant vector. The sum of the partial products corresponding to each is accumulated. Here, in step 018, in order to avoid overflow in the adder / subtractor 4 as in the first embodiment, the lower-order 5 bits fixed round-down process of the intermediate result of accumulation is performed.

本実施形態によれば、定数ベクトル要素の２ビットごとに部分内積の累算を行うことで内積演算を遂行しているので、累算すべき部分積の個数が減じられ、制御ステップが削減でき、演算サイクル数をさらに削減できる。例えば第１の実施形態（図２）と本実施形態（図４）とを比較すると、図２では３９ステップ必要であるのに対して、図４では３１ステップと２０％程度の演算速度の改善がなされることが明らかである。 According to this embodiment, the inner product operation is performed by accumulating the partial inner product for every two bits of the constant vector element, so that the number of partial products to be accumulated can be reduced and the control step can be reduced. The number of operation cycles can be further reduced. For example, comparing the first embodiment (FIG. 2) and the present embodiment (FIG. 4), FIG. 2 requires 39 steps, whereas FIG. 4 improves the calculation speed by 31 steps and about 20%. It is clear that

（第３実施形態）
次に、本発明の第３の実施形態を図５および図６を参照して説明する。なお、前述した第１、第２の実施形態と同一部分には、同一符号を付して説明を省略する。図５は、本発明の第３の実施形態にかかる内積演算装置の構成図である。図６は、図５に示された内積演算装置の内積演算動作を示したプログラムリストである。 (Third embodiment)
Next, a third embodiment of the present invention will be described with reference to FIGS. The same parts as those in the first and second embodiments described above are denoted by the same reference numerals and description thereof is omitted. FIG. 5 is a block diagram of an inner product calculation device according to the third embodiment of the present invention. FIG. 6 is a program list showing the inner product calculation operation of the inner product calculation apparatus shown in FIG.

本実施形態では、図１に示した内積演算装置１に対して、バレルシフタ３とセレクタ７が削除されている。そして、加減算器４が１１ビット語長の加減算器９となり、アキュムレータ５の後段には第一シフト手段としてのシフタ１０が設けられている。 In the present embodiment, the barrel shifter 3 and the selector 7 are deleted from the inner product calculation device 1 shown in FIG. The adder / subtracter 4 becomes an adder / subtracter 9 having an 11-bit word length, and a shifter 10 as a first shift means is provided at the subsequent stage of the accumulator 5.

加減算器９は、制御部８からの制御信号によりサイクル毎に加算か減算かの動作が選択される一方の入力が８ビット、他方の入力が１１ビットで出力が１１ビットとなっている演算器であり、８ビットの一方の入力には入力要素レジスタ２の出力が接続されている。すなわち、入力ベクトル要素と定数ベクトル要素とを乗算した際に必要となる乗算精度よりも小さいビット桁数で構成されている。 The adder / subtractor 9 is an arithmetic unit having one input of 8 bits, the other input of 11 bits, and an output of 11 bits for selecting an operation of addition or subtraction for each cycle by a control signal from the control unit 8. The output of the input element register 2 is connected to one 8-bit input. That is, the number of bit digits is smaller than the multiplication accuracy required when the input vector element and the constant vector element are multiplied.

アキュムレータ５には、加減算器９の出力が上位側の１１ビットに入力され、下位側の５ビットは後述するシフタ１０の下位５ビットが入力されている。 The accumulator 5 receives the output of the adder / subtractor 9 as the upper 11 bits, and the lower 5 bits as the lower 5 bits of the shifter 10 described later.

シフタ１０は、１６ビット語長で、アキュムレータ５に格納された演算結果を、制御部８からの制御信号により１ビット固定桁の右シフトを行う。 The shifter 10 has a 16-bit word length, and shifts the operation result stored in the accumulator 5 to the right by a 1-bit fixed digit by a control signal from the control unit 8.

上述した構成の内積演算装置１は、入力要素レジスタ２には複数の入力ベクトル要素を格納し、制御信号により１つが選択されて、加減算器９の一方に接続される。加減算器９は制御信号により、サイクル毎に加算か減算動作が選択される。加減算器９の出力は、アキュムレータ５に接続されており、途中および最終の演算結果が格納される。アキュムレータ５の出力はシフタ１０に接続され、１ビットの右シフトを行い、アキュムレータ５に格納された演算途中結果を１ビット右シフトできる構成となっている。シフタ１０は、同じ桁の部分積を累算している間はシフト動作を行わず、同桁の累算の終了後にシフト動作を行い上位桁の累算を開始する。シフト動作の有無は制御信号によって切替えられる。 In the inner product arithmetic device 1 having the above-described configuration, a plurality of input vector elements are stored in the input element register 2, one is selected by the control signal, and is connected to one of the adder / subtractor 9. The adder / subtractor 9 selects an addition or subtraction operation for each cycle according to the control signal. The output of the adder / subtractor 9 is connected to the accumulator 5, and the intermediate and final calculation results are stored. The output of the accumulator 5 is connected to the shifter 10 so as to perform a 1-bit right shift, and the intermediate result stored in the accumulator 5 can be shifted to the right by 1 bit. The shifter 10 does not perform the shift operation while accumulating the partial product of the same digit, and performs the shift operation after the accumulation of the same digit is completed and starts accumulating the upper digit. The presence or absence of the shift operation is switched by a control signal.

図６に本実施形態の動作のプログラムリストを示す。図６も図２、図４と同様に数４のＺ１を求める場合のものである。このプログラムは制御部８で動作する。 FIG. 6 shows a program list of the operation of this embodiment. FIG. 6 also shows a case where Z1 of Equation 4 is obtained as in FIGS. This program runs on the control unit 8.

図６のプログラムリストを説明すると、ステップ００１はアキュムレータ５のリセットを行い、演算をリセットしている。次に、ステップ００２から００５にかけて定数ベクトルの各要素の最下位のべき乗項であるｂｉｔ１、すなわち２の１乗項に対する部分積を、該当する入力ベクトルの各要素を選択して、サイクルごとに順に累積加減算している。加減算器９の出力はアキュムレータ５の上位１１ビットに接続されているので、アキュムレータ５の６ビット目を最下位ビットとして、最初の累算結果が得られる。定数ベクトル要素Ｃ０，Ｃ１，Ｃ２，Ｃ３は正数、Ｃ４，Ｃ５，Ｃ６，Ｃ７は負数なので、その正負に応じて部分積を加算または減算する。このとき３番目の制御項で入力ベクトルの要素を選択し、4番目の制御項でアキュムレータ出力の右シフトの制御を行う。２番目の制御項は加減算器９の加算か、減算かを選択する項で、定数ベクトル要素の符号によって、加算か減算かを切替える役割を果たす。つまり、この１つのステップで、加減算器９に、定数ベクトル要素の同じ桁の２のべき乗項の同じ項にかかる全ての入力ベクトル要素の部分積の累算を行わせてアキュムレータ５に格納している。 Referring to the program list of FIG. 6, in step 001, the accumulator 5 is reset to reset the calculation. Next, from step 002 to 005, select the partial product for bit 1 that is the lowest power term of each element of the constant vector, that is, the first power term of 2, select each element of the corresponding input vector, and sequentially in each cycle. Cumulative addition / subtraction. Since the output of the adder / subtracter 9 is connected to the upper 11 bits of the accumulator 5, the first accumulation result is obtained with the 6th bit of the accumulator 5 as the least significant bit. Since the constant vector elements C0, C1, C2, and C3 are positive numbers and C4, C5, C6, and C7 are negative numbers, the partial products are added or subtracted according to the positive and negative values. At this time, the element of the input vector is selected by the third control term, and the right shift of the accumulator output is controlled by the fourth control term. The second control term is a term for selecting addition or subtraction by the adder / subtractor 9 and plays a role of switching between addition and subtraction according to the sign of the constant vector element. That is, in this one step, the adder / subtracter 9 accumulates the partial products of all input vector elements related to the same term of the power-of-two term of the same digit of the constant vector element and stores it in the accumulator 5. Yes.

次に、ステップ００６では上位桁の部分積の累算を開始するためにアキュムレータ５の右シフトが選択されて加減算器９に入力される。このステップ００６〜０１１にかけて、定数ベクトル各要素の２の２乗項の部分積の累積加減算が実行され、以下順にステップ０１２〜０１５にかけて２の３乗項、ステップ０１６〜０１９にかけて２の４乗項、ステップ０２０〜０２５にかけて２の５乗項、ステップ０２６〜０２９にかけて２の６乗項、ステップ０３０〜０３３にかけて２の７乗項、ステップ０３４〜０３９にかけて２の８乗項、の部分積の累積加減算をそれぞれ行う。また、本実施形態の数値例では２の９乗項は存在しない。つまり、シフタ１０にアキュムレータ５に格納された累算結果を右ビットシフトさせる動作を行い、上位桁の部分積の累算を行う動作を最上位桁まで繰り返している。 Next, in step 006, the right shift of the accumulator 5 is selected and input to the adder / subtracter 9 in order to start accumulating the partial product of the upper digits. Cumulative addition / subtraction of partial products of 2 square terms of each element of the constant vector is executed through steps 006 to 011, followed by steps 012 to 015 in the order of 2 3 power terms, and steps 016 to 019 in order of 2 4 power terms. , Partial powers of 2 to the 5th power term from step 020 to 025, 2 to the 6th power term from step 026 to 029, 2 to the 7th power term from step 030 to 033, and 2 to the 8th power term from step 034 to 039 Add and subtract each. In the numerical example of the present embodiment, there is no 2 9 power term. That is, the shifter 10 performs an operation of shifting the accumulation result stored in the accumulator 5 to the right bit, and the operation of accumulating the partial product of the upper digits is repeated up to the most significant digit.

ステップ０４０、０４１はアキュムレータ５の内容をそれぞれ１ビットずつ右シフトするだけの動作がなされて有効桁の桁合わせが行われる。そして、最終的には整数部符号付１１ビット小数部５ビットで計１６ビットの演算結果が得られる。 In steps 040 and 041, the contents of the accumulator 5 are each shifted to the right by 1 bit, and the effective digits are aligned. Finally, an arithmetic result of 16 bits in total is obtained with 11 bits with an integer part sign and 5 bits with a fractional part.

本実施形態によれば、個別の乗算器をもたず、入力要素レジスタ２と加減算器９とアキュムレータ５とシフタ１０を有するだけなので、ハードウェア量が少なく規則的な演算装置が構成できる。また、内積演算の片方のベクトル要素が定数であることを前提としているので、入力ベクトル要素（レジスタ）の選択、シフト量と加減算の簡単な制御のみで内積演算を実現することができる。また、従来の乗算器を用いた構成であれば、乗算結果の最大語長の加減算精度が必要であったが、下位桁の部分積から順に累積加減算を行い、アキュムレータ５の語長を超えて右シフトされた途中の累算結果は自動的に切捨てられるように動作するので、同じ桁の部分積を下位から順に累算することにより、乗算の最大語長より小さい語長の演算器構成でも、下位桁累算からの繰り上がりを落とすことなく、すなわち演算精度を確保して内積演算を行うことができる。さらに、部分積を求める際にバレルシフタを用いていないので回路規模を小さくすることができる。 According to the present embodiment, since there is no individual multiplier but only the input element register 2, the adder / subtractor 9, the accumulator 5, and the shifter 10, the amount of hardware is small and a regular arithmetic device can be configured. Further, since it is assumed that one vector element of the inner product operation is a constant, the inner product operation can be realized only by selecting an input vector element (register) and simple control of the shift amount and addition / subtraction. In addition, in the configuration using the conventional multiplier, the addition / subtraction accuracy of the maximum word length of the multiplication result is necessary. However, the cumulative addition / subtraction is performed in order from the partial product of the lower digits, and exceeds the word length of the accumulator 5. Since the accumulation result in the middle of the right shift is automatically truncated, even if the arithmetic unit has a word length smaller than the maximum word length of multiplication by accumulating partial products of the same digit in order from the lower order, Thus, it is possible to perform the inner product calculation without lowering the carry from the lower digit accumulation, that is, while ensuring the calculation accuracy. Furthermore, since the barrel shifter is not used when obtaining the partial product, the circuit scale can be reduced.

（第４実施形態）
次に、本発明の第４の実施形態を図７および図８を参照して説明する。なお、前述した第１〜第３の実施形態と同一部分には、同一符号を付して説明を省略する。図７は、本発明の第４の実施形態にかかる内積演算装置の構成図である。図８は、図７に示された内積演算装置の内積演算動作を示したプログラムリストである。 (Fourth embodiment)
Next, a fourth embodiment of the present invention will be described with reference to FIGS. In addition, the same code | symbol is attached | subjected to the same part as the 1st-3rd embodiment mentioned above, and description is abbreviate | omitted. FIG. 7 is a block diagram of an inner product arithmetic device according to the fourth embodiment of the present invention. FIG. 8 is a program list showing the inner product calculation operation of the inner product calculation apparatus shown in FIG.

本実施形態は、第３の実施形態と基本的な構成は同じであるが、入力要素レジスタ２からの出力が、そのまま出力するか２倍（１ビットシフト）して出力するかを選択するように構成されている。したがって、入力要素レジスタ２の出力を２倍するための符号拡張機能付きの第二シフト手段としてのシフタ１１と、セレクタ１２が追加され、加減算器１３の一方の入力が９ビットとなっている。 This embodiment has the same basic configuration as the third embodiment, but selects whether the output from the input element register 2 is output as it is or twice (1 bit shift) for output. It is configured. Therefore, a shifter 11 as a second shift means with a sign extension function for doubling the output of the input element register 2 and a selector 12 are added, and one input of the adder / subtractor 13 has 9 bits.

また、第一シフト手段としてのシフタ１４は、アキュムレータ５に格納された演算結果を、制御部８からの制御信号により２ビット固定桁の右シフトを行うように変更されている。 Further, the shifter 14 as the first shift means is changed so as to shift the calculation result stored in the accumulator 5 to the right by a 2-bit fixed digit by a control signal from the control unit 8.

本実施形態の基本動作は、第３の実施形態と同等であるが、第２の実施形態の説明のごとく２次のブースのアルゴリズムを採用して、定数ベクトル要素の２ビットごとに、下位ビットを付け加えた３ビットのパターンから加減算すべき部分積すなわち入力ベクトル要素自身“Ｍ”もしくは、その２倍値“２Ｍ”を累算する。すなわち、入力ベクトル要素またはシフトタ１１が２倍にした入力ベクトル要素のいずれかを選択して加減算器１３に累算する。自身もしくは２倍値の生成と選択は、シフタ１１とセレクタ１２をサイクルごとに制御することでこれを行う。定数ベクトル要素の２ビットごとに同位桁の部分積の累算を終了すると、上位桁の累算を開始するためにシフタ１４でアキュムレータ５の内容が２ビット右シフトされて、次の加減算の入力となる。 The basic operation of this embodiment is the same as that of the third embodiment. However, as described in the second embodiment, the second order Booth algorithm is adopted, and the lower order bit is obtained for every two bits of the constant vector element. The partial product to be added or subtracted from the 3-bit pattern to which “” is added, that is, the input vector element itself “M” or its double value “2M” is accumulated. That is, either the input vector element or the input vector element doubled by the shifter 11 is selected and accumulated in the adder / subtractor 13. The generation or selection of itself or the double value is performed by controlling the shifter 11 and the selector 12 for each cycle. When the accumulation of the partial product of the same digit is finished every 2 bits of the constant vector element, the contents of the accumulator 5 are shifted to the right by 2 bits in the shifter 14 to start the accumulation of the upper digit, and the next addition / subtraction input It becomes.

図８に本実施形態の動作のプログラムリストを示す。図８も図２、図４、図６と同様に数４のＺ１を求める場合のものである。このプログラムは制御部８で動作する。 FIG. 8 shows a program list of the operation of this embodiment. FIG. 8 also shows the case where Z1 of Equation 4 is obtained in the same manner as FIG. 2, FIG. 4, and FIG. This program runs on the control unit 8.

図８のプログラムリストを説明すると、ステップ００１はアキュムレータ５のリセットを行い、演算をリセットしている。ステップ００２〜００５にかけて定数ベクトルの各要素の最下位のべき乗項であるｂｉｔ１とｂｉｔ０の２ビットすなわち２の１乗項と２の０乗項に対する加減算すべき部分積として図３に示したブースのアルゴリズム表にしたがって入力ベクトルの各要素自身もしくはその２倍値をサイクルごとに累積加減算している。なお、演算内容については、第２の実施形態と同等であるので説明を省略する。加減算器１３の出力はアキュムレータ５の上位１１ビットに接続されているので、アキュムレータ５の６ビット目を最下位ビットとして、最初の累算結果が得られる。 Referring to the program list of FIG. 8, in step 001, the accumulator 5 is reset to reset the calculation. The booth shown in FIG. 3 as a partial product to be added to or subtracted from 2 bits of bit1 and bit0, that is, the lowest power term of each element of the constant vector from step 002 to 005, that is, the 2nd power term and the 2nd power term. In accordance with the algorithm table, each element of the input vector itself or its double value is cumulatively added or subtracted every cycle. Since the calculation contents are the same as those in the second embodiment, description thereof will be omitted. Since the output of the adder / subtractor 13 is connected to the upper 11 bits of the accumulator 5, the first accumulation result is obtained with the 6th bit of the accumulator 5 as the least significant bit.

次に、ステップ００６ではブースのアルゴリズムにしたがって、２ビット分上位桁の部分積の累算を開始するためにアキュムレータの内容がシフタ１４によって、右シフトされて加減算器１３に入力される。ステップ００６〜０１３にかけては、定数ベクトル各要素の２の２乗項と２の３乗項の部分積の累積加減算が実行され、以下順にステップ０１４〜０２１にかけて２は４乗項と２の５乗項、ステップ０２２〜０２５にかけて２は６乗項と２の７乗項、ステップ０２６〜０３１にかけては２の８乗項と２の９乗項の部分積の累積加減算を行っている。 Next, in step 006, the accumulator contents are shifted to the right by the shifter 14 and input to the adder / subtractor 13 in order to start accumulating the partial product of the upper digits of 2 bits according to the Booth algorithm. From steps 006 to 013, cumulative addition / subtraction of partial products of 2 squared terms and 2 cubed terms of each element of the constant vector is executed, and in the order of steps 014 to 021 2 is 4th power terms and 2 to the 5th power. In terms of terms, 2 is the 6th power term and 2 7th power term through steps 022 to 025, and 2 is the cumulative addition / subtraction of partial products of 2 8th power term and 2 9th power term from step 026 to 031.

ステップ０３２はアキュムレータ５の内容を２ビットずつ右シフトするだけの動作がなされ、有効桁の桁合わせが行われる。そして、最終的には整数部符号付１２ビット小数部４ビットで計１６ビットの演算結果が得られる。 In step 032, the operation of only shifting the contents of the accumulator 5 to the right by 2 bits is performed, and the effective digits are aligned. Finally, an arithmetic result of 16 bits in total is obtained with the 12 bits fractional part 4 bits with integer part sign.

本実施形態によれば、定数ベクトル要素の２ビットごとに部分内積の累算を行うことで内積演算を遂行するので、累算すべき部分積の個数が減じられ、制御ステップが削減でき、演算サイクル数をさらに削減できる。例えば第３の実施形態（図６）と本実施形態（図８）とを比較すると、図６では４１ステップ必要であるのに対して、図８では３２ステップと２０％程度の演算速度の改善がなされる。 According to the present embodiment, the inner product operation is performed by accumulating the partial inner product for every two bits of the constant vector element. Therefore, the number of partial products to be accumulated can be reduced, and the control step can be reduced. The number of cycles can be further reduced. For example, when the third embodiment (FIG. 6) is compared with the present embodiment (FIG. 8), 41 steps are required in FIG. 6, whereas in FIG. 8, the calculation speed is improved by 32 steps and about 20%. Is made.

（第５実施形態）
次に、本発明の第５の実施形態を図９ないし図１３を参照して説明する。なお、前述した第１〜第４の実施形態と同一部分には、同一符号を付して説明を省略する。図９は、本発明の第５の実施形態にかかる内積演算方法を実行するマイクロプロセッサの演算器部分の構成図である。図１０は、図９に示したマイクロプロセッサの機械語命令コードフォーマットである。図１１は、図９に示したマイクロプロセッサの内積演算動作を示したプログラムリストの一の部分である。図１２は、図９に示したマイクロプロセッサの内積演算動作を示したプログラムリストの他の部分である。図１３は、図９に示したマイクロプロセッサで動作する内積演算方法のフローチャートである。 (Fifth embodiment)
Next, a fifth embodiment of the present invention will be described with reference to FIGS. In addition, the same code | symbol is attached | subjected to the same part as the 1st-4th embodiment mentioned above, and description is abbreviate | omitted. FIG. 9 is a configuration diagram of a computing unit portion of a microprocessor that executes the inner product computing method according to the fifth embodiment of the present invention. FIG. 10 shows a machine language instruction code format of the microprocessor shown in FIG. FIG. 11 is a part of a program list showing the inner product calculation operation of the microprocessor shown in FIG. FIG. 12 shows another part of the program list showing the inner product calculation operation of the microprocessor shown in FIG. FIG. 13 is a flowchart of the inner product calculation method operating in the microprocessor shown in FIG.

本実施形態は、マイクロプロセッサなどプログラム命令で実施可能な内積演算方法を示す。特に、専用の乗算回路と命令もしくは内積演算専用の回路と命令を有しないマイクロプロセッサで実現可能な内積演算方法を示す。 This embodiment shows an inner product calculation method that can be executed by a program instruction such as a microprocessor. In particular, an inner product calculation method that can be implemented by a microprocessor that does not have a dedicated multiplication circuit and instruction or an inner product calculation dedicated circuit and instruction.

図３に示した演算装置としてのマイクロプロセッサ２０は、論理積・論理和・算術加算・算術減算を行う加減算手段としての算術論理演算器（ＡＬＵ）２２と、ＡＬＵ２２の演算結果が格納されるアキュムレータ２４と、バス３０を介してアキュムレータ２４およびバレルシフタ２８と接続される格納手段としてのレジスタ２６と、レジスタ２６から送り出されたオペランドデータを左シフトしてＡＬＵ２２に送るシフト手段としてのバレルシフタ２８と、を備えている。 The microprocessor 20 as the arithmetic unit shown in FIG. 3 includes an arithmetic logic unit (ALU) 22 as addition / subtraction means for performing logical product / logical sum / arithmetic addition / arithmetic subtraction, and an accumulator in which the arithmetic result of the ALU 22 is stored. 24, a register 26 as storage means connected to the accumulator 24 and the barrel shifter 28 via the bus 30, and a barrel shifter 28 as shift means for shifting the operand data sent from the register 26 to the ALU 22 by shifting leftward. I have.

ＡＬＵ２２は、１６ビット語長の算術論理演算器で、一方の入力にはバレルシフタ２８の出力が接続されており、レジスタ２６から送り出されたオペランドデータを左シフトしてＡＬＵ２２に入力する。他方の入力にはアキュムレータ２４の出力が接続され、ＡＬＵ２２の出力はアキュムレータ２４に接続されて、ＡＬＵ２２の演算結果がアキュムレータ２４に蓄積されるようにしてある。 The ALU 22 is an arithmetic logic unit having a 16-bit word length, and the output of the barrel shifter 28 is connected to one input, and the operand data sent from the register 26 is shifted to the left and input to the ALU 22. The output of the accumulator 24 is connected to the other input, the output of the ALU 22 is connected to the accumulator 24, and the operation result of the ALU 22 is stored in the accumulator 24.

アキュムレータ２４は、１６ビット語長で構成され、ＡＬＵ２２の他方の入力と、８ビット幅のバス３０を介してレジスタ２６に接続されており、アキュムレータ２４に蓄積されたデータがレジスタ２６に転送できるようにしてある。 The accumulator 24 has a 16-bit word length and is connected to the register 26 via the other input of the ALU 22 and the 8-bit width bus 30 so that the data accumulated in the accumulator 24 can be transferred to the register 26. It is.

レジスタ２６は、例えば汎用レジスタとしてＲ０〜Ｒ３１の３２本の８ビット幅のレジスタを備えている。 The register 26 includes, for example, 32 8-bit wide registers R0 to R31 as general-purpose registers.

バレルシフタ２８は、レジスタ２６に格納されている８ビットデータ（オペランドデータ）を後述する機械語命令コードで指定された分のシフト量のシフトを行い１６ビットデータとしてＡＬＵ２２に出力する。 The barrel shifter 28 shifts the 8-bit data (operand data) stored in the register 26 by a shift amount specified by a machine language instruction code, which will be described later, and outputs the data to the ALU 22 as 16-bit data.

機械語命令コード３７は、図１０に示したように、演算の種類Ｃ、符号拡張の指定Ｓ、シフト量ＢＳＨの情報を含む。演算の種類Ｃには、加算、減算、論理積、論理和の演算が含まれ、Ｃの値により区別される。符号拡張Ｓには、ゼロ拡張と符号拡張があり、ゼロ拡張の場合はＳに０が指定され、符号拡張の場合はＳに１が指定される。シフト量ＢＳＨはゼロ桁から１５桁まで指定可能となっている。 As shown in FIG. 10, the machine language instruction code 37 includes information on the operation type C, the sign extension designation S, and the shift amount BSH. The operation type C includes addition, subtraction, logical product, and logical sum operations, and is distinguished by the value of C. The sign extension S includes zero extension and sign extension. In the case of zero extension, 0 is specified for S, and 1 is specified for S in the case of sign extension. The shift amount BSH can be specified from zero digits to 15 digits.

図１１および図１２に上述した構成のマイクロプロセッサ２０において内積演算行うためのプログラムリストを示す。ここで入力ベクトル要素はレジスタＴｍＲ０（レジスタ２６のＲ０）からＴｍＲ７（レジスタ２６のＲ７）のラベルのレジスタに格納されているとする。図１１および図１２に示したプログラムリストも図２、図４、図６、図８と同様に数４のＺ１を求める場合のものである。 FIG. 11 and FIG. 12 show a program list for performing an inner product calculation in the microprocessor 20 having the above-described configuration. Here, it is assumed that the input vector element is stored in a register labeled TmR0 (R0 of register 26) to TmR7 (R7 of register 26). The program lists shown in FIG. 11 and FIG. 12 are also used when obtaining Z1 of Equation 4 in the same manner as FIG. 2, FIG. 4, FIG. 6, and FIG.

次に、命令ニモニックについて説明する。
ｌｄａ＃０
この命令はアキュムレータ２４にゼロ値をロードする命令のニモニックである。
ａｄｄＴｍＲ０：ｓ０
この命令はアキュムレータ２４の値にソースオペランドである“ＴｍＲ０”ラベルのレジスタ値を呼び出して加算する命令のニモニックである。ソースオペランドである“ＴｍＲ０”の右側に“：ｓ０”と補助コードが付加されているがこれは“ｓ”に続く数字のビット数分の符号およびビット拡張つきの左シフトを行ってレジスタ値を読み出す動作を行うことを意味する。同様に“ｓｕｂ”は減算命令である。 Next, the instruction mnemonic will be described.
lda # 0
This instruction is a mnemonic for an instruction that loads the accumulator 24 with a zero value.
add TmR0: s0
This instruction is a mnemonic of an instruction that calls and adds the register value of the source operand “TmR0” to the value of the accumulator 24. “: S0” and an auxiliary code are added to the right of the source operand “TmR0”, but this is a left shift with a sign and bit extension of the number of bits following “s” and reading the register value. Means to perform an action. Similarly, “sub” is a subtraction instruction.

“ｓｔａ”はアキュムレータ２４内容をデスティネーションオペランドに書き出す命令で、例えば、
ｓｔａＴｍＲ１２：ｚ５
と記述され、これはデスティネーションオペランドである“ＴｍＲ１２”の右側に“：ｚ５”と補助コードが付加されているが、これは“ｚ”に続くビット数分の右シフトを行ってアキュムレータ２４の内容をレジスタ２６に転送する動作を行う。“ｌｄａ”は即値もしくはソースオペランドのレジスタ値をアキュムレータ２４に読み出す命令である。 “Sta” is an instruction to write the contents of the accumulator 24 to the destination operand.
sta TmR12: z5
This is written with “: z5” and an auxiliary code added to the right side of the destination operand “TmR12”, which is shifted to the right by the number of bits following “z” to store the accumulator 24. The operation of transferring the contents to the register 26 is performed. “Lda” is an instruction for reading the immediate value or the register value of the source operand to the accumulator 24.

図１１および図１２に示したプログラムリストでセミコロンの付加された行はコメント行であり、その行の命令は実行されない。図中では理解を容易にするためにコメントとして命令ニモニックが残してある。コメント行の命令は定数ベクトルのビット値が“０”である桁の部分積の加減算を表しており、第１の実施形態でも説明したように、これらの部分積の加減算は実行されない。 In the program lists shown in FIGS. 11 and 12, a line with a semicolon added is a comment line, and the instruction on that line is not executed. In the figure, an instruction mnemonic is left as a comment for easy understanding. The instruction in the comment line represents the addition / subtraction of the partial products of the digits whose constant vector bit value is “0”, and as described in the first embodiment, addition / subtraction of these partial products is not executed.

図１１および図１２に示したプログラムリストの演算動作については、定数ベクトル要素の数値および動作は第１の実施形態と同じであるが、概略動作を図１３のフローチャートを参照して説明する。 11 and 12, the numerical values and operations of the constant vector elements are the same as those in the first embodiment, but the general operation will be described with reference to the flowchart of FIG.

まず、アキュムレータ２４にゼロ値ロードしてリセットし（ステップＳ１、図１１の先頭行）、定数ベクトル要素の最下位桁の部分積を求め、部分積を累算し、全定数ベクトル要素の累算が終了するまで繰り返す（ステップＳ２〜Ｓ４、例えば図１１のＣＯＥＦ＿ｂｉｔ１ＭＵＬ／ＡＤＤ以下の８ステップ）。すなわち、ステップＳ２が特許請求の範囲の第一の工程に相当し、ステップＳ３が特許請求の範囲の第二の工程に相当する。 First, the accumulator 24 is loaded with a zero value and reset (step S1, first line in FIG. 11), the partial product of the least significant digit of the constant vector element is obtained, the partial product is accumulated, and all the constant vector elements are accumulated. (Steps S2 to S4, for example, 8 steps below COEF_bit1 MUL / ADD in FIG. 11). That is, step S2 corresponds to the first step in the claims, and step S3 corresponds to the second step in the claims.

次に、１つ上の桁に移動して（ステップＳ５）、定数ベクトル要素の最下位桁の部分積を求め、部分積を累算し、全定数ベクトル要素の累算が終了するまで繰り返す（ステップＳ６〜Ｓ８、例えば図１１のＣＯＥＦ＿ｂｉｔ２ＭＵＬ／ＡＤＤ以下の８ステップ）。そして、この繰り返しは最上位桁が終了するまで繰り返し、最上位桁まで終了した場合は処理を終了する（ステップＳ１１）。すなわち、ステップＳ６が特許請求の範囲の第一の工程に相当し、ステップＳ７が特許請求の範囲の第二の工程に相当する。 Next, move to the next higher digit (step S5), find the partial product of the least significant digit of the constant vector element, accumulate the partial product, and repeat until the accumulation of all the constant vector elements is completed ( Steps S6 to S8, for example, 8 steps below COEF_bit2 MUL / ADD in FIG. This repetition is repeated until the most significant digit is completed, and when it is completed to the most significant digit, the process is terminated (step S11). That is, step S6 corresponds to the first step in the claims, and step S7 corresponds to the second step in the claims.

なお、累算結果がアキュムレータ２４の最大桁の達した場合は、５ビット右シフトして切り捨て丸め処理を行う（ステップＳ９、Ｓ１０、図１１の中間結果／Ｏｖｅｒｆｌｏｗ（１６ｂｉｔ）回避の部分）。すなわち、ステップＳ１０が特許請求の範囲の第三の工程に相当する。なお、この切り捨てる５ビットも第１の実施形態と同様に、加減算器４のビット語長から入力ベクトル要素のビット語長および入力ベクトル要素数の２を底とする対数を減じた数以下のビット桁数から算出されて予め決められたものである。 When the accumulated result reaches the maximum digit of the accumulator 24, it is shifted to the right by 5 bits and rounded down (steps S9, S10, intermediate result / overflow (16 bits) avoidance part in FIG. 11). That is, step S10 corresponds to the third step in the claims. As with the first embodiment, the five bits to be discarded are bits equal to or less than the number obtained by subtracting the logarithm of the bit word length of the input vector element and the number of input vector elements from 2 as the bit word length of the adder / subtractor 4. It is calculated from the number of digits and determined in advance.

また、本実施形態において、演算結果はアキュムレータ２４に格納され１６ビット幅であるが、次の段階のプログラム命令に渡すために、ＴｍＲ１２（レジスタ２６のＲ１２）、ＴｍＲ１３（レジスタ２６のＲ１３）のラベルのレジスタに下位、上位に分けて転送している。また、演算途中の５ビット右シフトによる切捨て丸め操作は、レジスタ幅と命令セットの都合上、上位、下位にわけて実施している。 In this embodiment, the calculation result is stored in the accumulator 24 and has a 16-bit width. However, labels of TmR12 (R12 of the register 26) and TmR13 (R13 of the register 26) are passed to the next stage program instruction. Are transferred separately to the lower and upper registers. In addition, the rounding-off operation by 5-bit right shift in the middle of the calculation is performed in the upper and lower order for the convenience of register width and instruction set.

本実施形態によれば、専用の乗算回路と命令もしくは内積演算専用の回路と命令を有しないマイクロプロセッサ２０において、ソースオペランドのシフトが一体となった加減算命令を使って、同じ桁の部分積を下位から順に累算して行くことにより、下位桁累算からの繰り上がりを落とすことなくすなわち演算精度を確保して内積演算を行うことができる。また累算途中結果を一旦ビットシフトして丸めるステップを備えているので乗算の最大語長より小さい語長の演算器構成のマイクロプロセッサでも精度の良い内積演算を実現できる。 According to the present embodiment, in the microprocessor 20 having no dedicated multiplication circuit and instruction or inner product calculation dedicated circuit and instruction, the partial product of the same digit is obtained by using the addition / subtraction instruction in which the shift of the source operand is integrated. By accumulating in order from the lower order, it is possible to perform the inner product operation without lowering the carry from the lower digit accumulation, that is, ensuring the operation accuracy. In addition, since there is a step of once shifting and rounding the result in the middle of accumulation, a highly accurate inner product operation can be realized even with a microprocessor having a word length smaller than the maximum word length of multiplication.

（第６実施形態）
次に、本発明の第６の実施形態を図１４を参照して説明する。なお、前述した第１〜第５の実施形態と同一部分には、同一符号を付して説明を省略する。図１４は、本発明の第６の実施形態にかかるマイクロプロセッサの内積演算動作を示したプログラムリストである。 (Sixth embodiment)
Next, a sixth embodiment of the present invention will be described with reference to FIG. In addition, the same code | symbol is attached | subjected to the same part as the 1st-5th embodiment mentioned above, and description is abbreviate | omitted. FIG. 14 is a program list showing the inner product calculation operation of the microprocessor according to the sixth embodiment of the present invention.

本実施形態は、構成は第５の実施形態と同じであるが、第２、第４の実施形態の説明のごとく２次のブースのアルゴリズムを採用して、定数ベクトル要素の２ビットごとに、下位ビットを付け加えた３ビットのパターンから加減算すべき部分積すなわち入力ベクトル要素自身“Ｍ”もしくは、その２倍値“２Ｍ”を累算する。すなわち、第一の工程は、入力ベクトル要素自身または入力ベクトル要素の２倍の値のいずれかを選択して部分積を求め、第二の工程は、部分積の累算を２ビット毎に行っている。 This embodiment has the same configuration as that of the fifth embodiment, but adopts a second-order Booth algorithm as described in the second and fourth embodiments, and every 2 bits of the constant vector element, The partial product to be added or subtracted from the 3-bit pattern with the lower bits added, that is, the input vector element itself “M” or its double value “2M” is accumulated. That is, the first step selects either the input vector element itself or a value twice the input vector element to obtain a partial product, and the second step performs partial product accumulation every two bits. ing.

図１４に本実施形態の動作のプログラムリストを示す。図１４も図２、図４、図６、図８、図１１、図１２と同様に数４のＺ１を求める場合のものである。なお、演算内容については、第２の実施形態と同等であるので説明を省略する。 FIG. 14 shows a program list of the operation of this embodiment. FIG. 14 also shows a case where Z1 of Equation 4 is obtained in the same manner as FIGS. 2, 4, 6, 8, 11, and 12. Since the calculation contents are the same as those in the second embodiment, description thereof will be omitted.

本実施形態によれば、定数ベクトル要素の２ビットごとに部分内積の累算を行うことで内積演算を遂行するので、累算すべき部分積の個数が減じられ、制御ステップが削減でき、演算サイクル数をさらに削減できる。例えば第５の実施形態（図１１、図１２）と本実施形態（図１４）とを比較すると、図６では４８ステップ必要であるのに対して、図８では３８ステップと２０％程度の演算速度の改善がなされる。 According to the present embodiment, the inner product operation is performed by accumulating the partial inner product for every two bits of the constant vector element. Therefore, the number of partial products to be accumulated can be reduced, and the control step can be reduced. The number of cycles can be further reduced. For example, comparing the fifth embodiment (FIGS. 11 and 12) and the present embodiment (FIG. 14), FIG. 6 requires 48 steps, whereas FIG. 8 requires 38 steps and about 20% of computation. Speed improvements are made.

なお、本発明は上記実施形態に限定されるものではない。即ち、本発明の骨子を逸脱しない範囲で種々変形して実施することができる。 The present invention is not limited to the above embodiment. That is, various modifications can be made without departing from the scope of the present invention.

１内積演算装置
２入力要素レジスタ（格納手段）
３バレルシフタ（シフト手段）
４加減算器（加減算手段）
５アキュムレータ
６シフタ（丸め手段）
７セレクタ
８制御部（演算制御手段）
９加減算器（加減算手段）
１０シフタ（第一シフト手段）
１１シフタ（第二シフト手段）
１２セレクタ
１３加減算器（加減算手段）
１４シフタ（第一シフト手段）
２０マイクロプロセッサ（演算装置）
２２ＡＬＵ（加減算手段）
２４アキュムレータ
２６レジスタ（格納手段）
２８バレルシフタ（シフト手段） 1 inner product arithmetic unit 2 input element register (storage means)
3 Barrel shifter (shift means)
4 Adder / Subtracter (addition / subtraction means)
5 Accumulator 6 Shifter (Rounding means)
7 selector 8 control unit (calculation control means)
9 Adder / Subtracter (addition / subtraction means)
10 Shifter (first shift means)
11 Shifter (second shift means)
12 selector 13 adder / subtracter (addition / subtraction means)
14 Shifter (first shift means)
20 Microprocessor (computing device)
22 ALU (addition / subtraction means)
24 accumulator 26 register (storage means)
28 Barrel shifter (shifting means)

特公平５−２６２２９号公報Japanese Patent Publication No. 5-26229 特開２０００−１３２５３９号公報JP 2000-132539 A

Claims

In an inner product arithmetic device for obtaining an inner product of an input vector composed of a plurality of input vector elements having a predetermined bit word length and a constant vector composed of a plurality of constant vector elements,
Storage means for storing the plurality of input vector elements;
Shift means for selecting the input vector element from the storage means and shifting the selected input vector to the left bit to obtain a partial product of the power vector of the constant vector element and the input vector element;
Adding / subtracting means configured to accumulate the partial product obtained by the shift means and configured with a bit number smaller than the multiplication precision required when the input vector element and the constant vector element are multiplied;
An accumulator for storing the accumulation result of the addition / subtraction means;
Rounding means for rounding off the operation result by truncating the result stored in the accumulator during accumulation by the addition / subtraction means by a bit shift of a predetermined number of digits;
The addition / subtraction means causes the accumulator to accumulate partial products of all input vector elements related to the same term of the least-significant power-of-two term of the constant vector element, and thereafter sequentially stores the higher 2 The accumulation of the partial product related to the power term is repeated until the most significant power term of 2 is stored, and stored in the accumulator in the middle of the summation by the rounding means before overflow of the addition / subtraction means occurs. Arithmetic control means that operates so that the lower digits of the result are truncated and set to the initial value of the subsequent accumulation;
An inner product arithmetic device comprising:

The arithmetic control means causes the shift means to obtain the partial product for every two bits of the input vector element based on a predetermined table, and causes the addition / subtraction means to accumulate the partial products. The inner product arithmetic device according to claim 1.

The rounding means truncates the result stored in the accumulator in the middle of accumulation by the addition / subtraction means. The inner product calculation device according to claim 1 or 2, wherein the inner product calculation device is predetermined as a number of bit digits equal to or less than a number obtained by subtracting a logarithm with 2 as a base.

In an inner product arithmetic device for obtaining an inner product of an input vector composed of a plurality of input vector elements having a predetermined bit word length and a constant vector composed of a plurality of constant vector elements,
Storage means for storing the plurality of input vector elements;
The input vector element is selected from the storage means, a partial product of the power vector of the constant vector element and the selected input vector element is obtained and accumulated, and the input vector element and the constant vector are accumulated. Addition / subtraction means configured with a bit number smaller than the multiplication accuracy required when multiplying elements;
An accumulator for storing the accumulation result of the adding / subtracting means on its upper digit side;
The accumulator content is right-bit shifted in the lower digit direction to obtain the subsequent accumulated value, and the number of bit digits smaller than the multiplication accuracy required when the input vector element and the constant vector element are multiplied. Configured first shift means;
The addition / subtraction means accumulates the partial products of all input vector elements related to the same term of the power-of-two term of the same digit of the constant vector element, and stores it in the accumulator, and the first shift means Arithmetic control means for repeating the operation of right-bit shifting the accumulation result stored in the accumulator up to the most significant digit of the power vector of the constant vector element;
An inner product arithmetic device comprising:

A second shift means for doubling the input vector element stored in the storage means;
The calculation control means selects either the input vector element stored in the storage means or the input vector element doubled by the second shift means based on a predetermined table and adds it to the addition / subtraction means. 5. The inner product operation device according to claim 4, wherein the first product is accumulated and the first shift means is shifted by 2 bits.

Storage means for storing an input vector composed of a plurality of input vector elements having a predetermined bit word length, and a power of 2 of a constant vector composed of a plurality of constant vector elements by left-bit shifting the input vector Shift means for obtaining a partial product of a term and the input vector element, and multiplication required when the partial product obtained by the shift means is accumulated and the input vector element and the constant vector element are multiplied. Addition / subtraction means configured with a number of bit digits smaller than the precision and an accumulator for storing an accumulation result of the addition / subtraction means, using an instruction that can execute addition / subtraction and shift of operands integrally in a microprocessor In an inner product calculation method for obtaining an inner product of an input vector and the constant vector,
A first step in which the shift means selects the input vector element from the storage means to obtain the partial product;
A second step of causing the accumulator to perform accumulation of the partial products of all the input vector elements according to the same term of the least significant power-of-two term of the constant vector element, and to store in the accumulator;
A third step of truncating the lower digits of the result stored in the accumulator that is being stored in the accumulator before overflow of the addition / subtraction means occurs and setting the initial value for the subsequent accumulation; ,
With
The first step and the second step are sequentially repeated to accumulate the partial products for the higher power of 2 to the highest power of 2, and the first and second steps are repeated. An inner product calculation method, wherein the third step is performed at least once during the repetition of the step.

In the first step, the partial product is obtained by selecting either the input vector element itself or a value twice the input vector element.
7. The inner product calculation method according to claim 6, wherein in the second step, the partial product is accumulated every two bits.

In the third step, the number of bit digits for truncating the result stored in the accumulator during accumulation by the adder / subtracter is calculated from the bitword length of the adder / subtractor and the bitword length of the input vector element and the input vector The inner product calculation method according to claim 6 or 7, wherein the inner product calculation method is predetermined as a number of bit digits equal to or less than a number obtained by subtracting a logarithm having 2 as the base of the number of elements.