JP4243277B2

JP4243277B2 - Data processing device

Info

Publication number: JP4243277B2
Application number: JP2005508741A
Authority: JP
Inventors: 辰男落合; 治赤平
Original assignee: Hitachi ULSI Systems Co Ltd
Current assignee: Hitachi Solutions Technology Ltd
Priority date: 2003-08-28
Filing date: 2003-08-28
Publication date: 2009-03-25
Anticipated expiration: 2023-08-28
Also published as: WO2005024625A1; JPWO2005024625A1

Description

本発明は、データ処理装置に関し、特にプロセッサエレメントにおける固定小数点演算器で、効率的なハードウェア構成により丸め演算を実現する技術に関する。 The present invention relates to a data processing apparatus, and more particularly to a technique for realizing a rounding operation with an efficient hardware configuration in a fixed-point arithmetic unit in a processor element.

動画圧縮を含めた画像処理は、比較的単純な計算アルゴリズムの反復であり、同一命令に対するデータ並列性が大きい。そのため、画像処理の高速化には、ＳＩＭＤ（ＳｉｎｇｌｅＩｎｓｔｒｕｃｔｉｏｎＭｕｌｔｉｐｌｅＤａｔａｓｔｒｅａｍ）型並列計算手法が適する。
動画圧縮規格としては、ＭＰＥＧ（ＩＳＯ／ＩＥＣ１４４９６−２（ＭＰＥＧ４），ＩＳＯ／ＩＥＣ１３８１８−２（ＭＰＥＧ２），ＩＳＯ／ＩＥＣ１１１７２−２（ＭＰＥＧ１））が知られている。この動画圧縮規格によれば、デジタル化された画像をブロック分割して各ブロック毎に動きベクトルを検出し、ＤＴＣ（離散コサイン変換）、及び量子化を施し、ハフマン符号化して画像データを圧縮する。
ＳＩＭＤ型並列計算アーキテクチャをＭＰＥＧ動画圧縮に適用すると、複数のプロセッサを上記画像ブロックの計算単位とすることがきる。すなわち、動きベクトル検出では、検出範囲のシフトブロックを複数プロセッサのローカルメモリに配置し、制御系から圧縮対象のブロックを全プロセッサにブロードキャストしてフレーム差を並列演算することで、プロセッサ数倍の高速化が期待できる。また、圧縮対象のブロック画像あるいは検出した動きベクトル位置とのブロック単位のフレーム差データを複数プロセッサのローカルメモリに配置し、ＤＣＴ（あるいはＩＤＣＴ）あるいは量子化（あるいは逆量子化）の計算を並列演算することで、プロセッサ数倍の高速化が期待できる。Image processing including moving image compression is a relatively simple iteration of a calculation algorithm, and data parallelism for the same instruction is large. Therefore, a SIMD (Single Instruction Multiple Data stream) type parallel calculation method is suitable for speeding up image processing.
MPEG (ISO / IEC14496-2 (MPEG4), ISO / IEC13818-2 (MPEG2), ISO / IEC11172-2 (MPEG1)) is known as a moving picture compression standard. According to this video compression standard, a digitized image is divided into blocks, a motion vector is detected for each block, DTC (Discrete Cosine Transform) and quantization are performed, and Huffman coding is performed to compress the image data. .
When the SIMD parallel computing architecture is applied to MPEG video compression, a plurality of processors can be used as the calculation unit of the image block. In other words, in motion vector detection, the shift block of the detection range is arranged in the local memory of multiple processors, the block to be compressed is broadcast to all the processors from the control system, and the frame difference is calculated in parallel, thereby speeding up the number of processors Can be expected. Also, the block difference frame data of the block image to be compressed or the detected motion vector position is placed in the local memory of multiple processors, and DCT (or IDCT) or quantization (or inverse quantization) calculations are performed in parallel. By doing so, it can be expected to increase the speed by several times the number of processors.

本発明者らの検討によれば、上記した技術では、固定小数点形式のデータについて、乗算によって変更されるビット数の調整あるいは加減算で整数部の有効桁数を調整するための小数点位置の変更は、プロセッサ内のシフタで行われる。このとき、乗算結果を所定のデータ幅に縮小するためにＬＳＢ側のビットを削除あるいはシフタによる右ビットシフトする場合には、演算精度の劣化を避けるためにデータのＬＳＢに対して丸め処理が必要である。
丸め処理には、浮動小数点形式の規格であるＩＥＥＥ７５４において、演算の途中結果を仮数部の所定ビットに丸めて出力するために定められた方法がある。本規格によれば、最近値丸め、−∞方向丸め、＋∞方向丸め、及び０方向丸めの４種の方法があり、アプリケーションは必要な精度に応じて何れかの丸め方法（以下丸めモードと記載する）を選択可能とされる。
上記した技術では、ハード的には単に切り捨てのために、データは−∞方向の丸めとなる。その他の最近値丸め、＋∞方向丸め、あるいは０方向丸めを用いるためには、プログラムの記述によってソフト的に行う必要がある。すなわち、プログラムの記述によって、所望の丸め方法に応じて切り捨てるビットを調べその状態について判断し丸める最下位ビットに１を加算するように動作させる。
しかしながら、プログラムによる上記丸め処理の実現では、切り捨てるビットを評価するためのシフト演算と、その状態を判断するためのＡＬＵ（算術論理演算ユニット）によるコンディションコードの取得処理が必要となる。
そのため、ハード的に無策な−∞方向の場合には、例えばパイプライン処理によって見かけ上の１データを１ステップで処理可能とされる演算であっても、その他の最近値丸め、＋∞方向丸め、あるいは０方向丸めを用いる場合には丸めのための処理ステップが必要となり、速度性能が劣化する。特に、ＳＩＭＤ型並列演算器構成では、１つのプログラムに対して複数の演算データが同時に存在するため、データ状態に応じた処理ステップの分岐による高速化は困難であり、全ケースの処理ステップ時間が必要となる。
本発明の目的は、効率的な回路構成で丸め処理のオーバヘッド低減でき、固定小数点演算器を用いた情報処理装置の数値演算精度を効率的に向上することのできるデータ処理装置を提供することにある。
本発明の前記並びにその他の目的と新規な特徴は、本明細書の記述及び添付図面から明らかになるであろう。
本願において開示される発明のうち、代表的なものの概要を簡単に説明すれば、次のとおりである。
すなわち、本発明のデータ処理装置は、以下のような特徴を有するものである。
（１）右ビットシフト演算機能を有するシフタと、２つの入力データの加算に、最下位ビットに１ビットデータを加算するキャリーイン機能を有するＡＬＵとを含んで構成されたプロセッサと、前記プロセッサを単一命令で制御する制御ユニットとで構成されるデータ処理装置であって、上記シフタは、シフト演算結果のデータを出力すると同時に、右ビットシフトの場合に切り捨てられるビットについて丸め評価をしてシフト演算結果の最下位ビットに“１”加算が必要か否かを示す１ビットデータを出力する丸め評価手段を有し、上記ＡＬＵは、２入力データの内の一方を上記シフタの出力データとし、上記キャリーイン機能で用いるデータを上記シフタの丸め評価手段が出力する１ビットデータとするものである。
これにより、異なる小数点位置のデータを加算する場合、整数桁を保証するために、整数桁の少ないデータを右シフトして桁合わせをしてから加算する。上記のように構成したことによって、先ずシフタにおいて桁合わせのための右シフトと同時に丸め処理のための１加算の要否が１ビットデータとして出力でき、次にＡＬＵにおいて桁合わせされた２つのデータの加算と同時にこの検出された１ビットデータを加算できるので、右シフトするデータについての丸め処理に要する時間を解消することができる。
また、丸めのための１加算は、従来技術のＡＬＵにおけるキャリーイン付き加算機能とすることができるので、シフタへの丸め評価と該結果としての１ビットデータ出力の回路を追加するだけですむ。
（２）上記シフタにおける右シフト演算及び上記ＡＬＵにおけるキャリーイン機能の加算は、２の補数形式のデータに対応した演算である。
これにより、上記シフタ及びＡＬＵの演算を従来技術である２の補数形式の演算とすることで、符号付き演算に対して、上記同様の構成で丸め処理を実現することができる。例えば、上記シフタ及びＡＬＵの演算を制御系からの制御信号に呼応して符号無し演算または符号付き演算の何れかに選択可能とすれば、符号無しデータ及び符号付きデータが混在するような利用目的であっても、上記同様の効果の丸め処理が可能となる。
（３）丸めモード選択手段を有し、上記丸め評価手段は、該丸めモード選択手段
で選択可能な複数の丸めモードの個々に対応した丸め評価を行うものである。
これにより、上記丸め評価回路を、複数の丸めモードに対応させることで、上記同様の構成で所望の丸めモードの丸め処理を実現することができる。例えば、上記丸め評価回路の動作における丸めモードを制御系からの制御信号に呼応して選択可能とすれば、異なる丸めモードを動的に変更して用いるような利用目的に対して、上記同様の効果の丸め処理が可能となる。
（４）上記プロセッサは複数からなり、上記丸めモード選択手段は、上記複数のプロセッサの個々に設けられたデータ記憶手段によって丸めモードを選択するものである。
これにより、上記丸めモードを選択する手段を、プロセッサ毎に設けたレジスタなどの記憶手段とすることで、制御系から複数のプロセッサに対する制御信号の本数を低減できる効果がある。
（５）上記丸め評価は、上記右ビットシフトで切り捨てられるビットの論理和である。
これにより、上記丸め評価の動作を、上記シフタにおける右ビットシフトで切り捨てられるビットの論理和とすることで、符号無しデータあるいは２の補数形式の符号付きデータに対して＋∞方向丸め（切り上げ）を実現することができる。また、符号付き絶対値形式のデータに対して、符号の∞方向丸めを実現することができる。
（６）上記丸め評価は、上記右ビットシフトで切り捨てられるビットの論理和と、シフト演算データの符号との論理積である。
これにより、上記丸め評価の動作を、上記シフタにおける右ビットシフトで切り捨てられるビットの論理和と、シフト演算データの符号との論理積とすることで、２の補数形式の符号付きデータに対して０方向丸めを実現することができる。また、符号付き絶対値形式のデータに対して、−∞方向丸めを実現することができる。
（７）上記丸め評価は、上記右ビットシフトで切り捨てられるビットの内の最上位ビットを除くビットとシフト演算結果の最下位ビットとの論理和と、上記右ビットシフトで切り捨てられるビットの内の最上位ビットとの論理積である。
これにより、上記丸め評価の動作を、上記シフタにおけるシフトで切り捨てられるビットの内の最上位ビットを除くビットとシフト演算結果の最下位ビットとの論理和と、上記右シフトで切り捨てられるビットの内の最上位ビットとの論理積とすることで、符号無しデータ、２の補数形式の符号付きデータあるいは符号付き絶対値形式のデータに対して、最近値丸めを実現することができる。
（８）１つの半導体基板に構成されたものである。
上記データ処理装置によれば、さらに、従来の技術同様に切り捨てを実現するためには、上記丸めモード選択手段によって、上記丸め評価回路から上記ＡＬＵへの１ビットデータ出力を常にネゲートする状態を設ければよい。このとき、２の補数形式の符号付きデータに対して−∞方向丸めとして動作する。また、符号無しデータあるいは符号付き絶対値形式のデータに対して０方向丸め（切り捨て）として動作する。
上記のごとく丸め評価回路を設けたことによるシフタの端子追加は、モード指定の信号とＡＬＵへの１ビットデータ出力信号だけですむ。また、シフタに追加する丸め評価回路が、複数の丸めモードに対応させたとしても、切り捨てるビットの論理和を共通にすることができるので回路規模は少なくてすむ。
以上のように、効率的な回路構成で丸め処理のオーバヘッド低減を達成する。According to the study by the present inventors, in the above-described technique, for the data in the fixed-point format, adjustment of the number of bits changed by multiplication or change of the decimal point position for adjusting the number of significant digits of the integer part by addition / subtraction is not performed. This is done with a shifter in the processor. At this time, when the bit on the LSB side is deleted or the right bit is shifted by the shifter in order to reduce the multiplication result to a predetermined data width, a rounding process is required for the LSB of the data in order to avoid deterioration of the calculation accuracy. It is.
For rounding, there is a method defined in IEEE 754, which is a floating-point format standard, for rounding the result of an operation to a predetermined bit of the mantissa and outputting it. According to this standard, there are four types of methods: nearest rounding, -∞ direction rounding, + ∞ direction rounding, and 0 direction rounding. Can be selected).
In the above-described technique, the data is rounded off in the −∞ direction simply because of the truncation. In order to use other nearest rounding, + ∞ direction rounding, or 0 direction rounding, it is necessary to carry out in software by program description. That is, according to the description of the program, the bits to be discarded are examined according to the desired rounding method, the state is judged, and the operation is performed to add 1 to the least significant bit to be rounded.
However, in order to realize the rounding process by a program, it is necessary to perform a shift operation for evaluating the bits to be discarded and a condition code acquisition process by an ALU (arithmetic logic unit) for determining the state.
Therefore, in the case of the −∞ direction, which is hardware inconvenient, for example, even if the operation allows one apparent data to be processed in one step by pipeline processing, other nearest value rounding and + ∞ direction rounding In the case of using 0-direction rounding, a processing step for rounding is required, and the speed performance is deteriorated. In particular, in the SIMD type parallel arithmetic unit configuration, since a plurality of pieces of arithmetic data exist simultaneously for one program, it is difficult to increase the speed by branching the processing step according to the data state, and the processing step time of all cases is difficult. Necessary.
An object of the present invention is to provide a data processing device that can reduce the overhead of rounding processing with an efficient circuit configuration and can efficiently improve the numerical operation accuracy of an information processing device using a fixed-point arithmetic unit. is there.
The above and other objects and novel features of the present invention will be apparent from the description of this specification and the accompanying drawings.
Of the inventions disclosed in the present application, the outline of typical ones will be briefly described as follows.
That is, the data processing apparatus of the present invention has the following characteristics.
(1) A processor configured to include a shifter having a right bit shift operation function, an ALU having a carry-in function for adding 1-bit data to the least significant bit in addition of two input data, and the processor A data processing apparatus configured with a control unit controlled by a single instruction, wherein the shifter outputs data of a shift operation result and simultaneously performs a rounding evaluation on a bit that is truncated in the case of right bit shift and shifts Rounding evaluation means for outputting 1-bit data indicating whether or not addition of “1” is necessary to the least significant bit of the operation result, and the ALU uses one of the two input data as output data of the shifter, The data used in the carry-in function is 1-bit data output by the rounding evaluation means of the shifter.
Thereby, when adding data of different decimal point positions, in order to guarantee an integer digit, the data with a small number of integer digits is right-shifted and digitized before adding. With the above configuration, first, the shifter can output the necessity of 1 addition for rounding processing as 1-bit data at the same time as the right shift for digit alignment, and then the two data aligned in the ALU Since the detected 1-bit data can be added simultaneously with the addition of, the time required for the rounding process for the right-shifted data can be eliminated.
In addition, since 1 addition for rounding can be an addition function with carry-in in the ALU of the prior art, it is only necessary to add a round evaluation to the shifter and a 1-bit data output circuit as a result.
(2) The right shift operation in the shifter and the addition of the carry-in function in the ALU are operations corresponding to 2's complement data.
Thereby, the calculation of the shifter and the ALU is performed in the two's complement format, which is a conventional technique, so that the rounding process can be realized with the same configuration as the above for the signed calculation. For example, if the shifter and ALU operations can be selected as either unsigned operation or signed operation in response to a control signal from the control system, the purpose of use is such that unsigned data and signed data are mixed. Even so, it is possible to perform rounding processing with the same effect as described above.
(3) It has a rounding mode selection means, and the rounding evaluation means performs rounding evaluation corresponding to each of a plurality of rounding modes selectable by the rounding mode selection means.
Thus, by making the rounding evaluation circuit compatible with a plurality of rounding modes, it is possible to realize a rounding process in a desired rounding mode with the same configuration as described above. For example, if the rounding mode in the operation of the rounding evaluation circuit can be selected in response to a control signal from the control system, the same rounding mode as described above can be used for the purpose of use by dynamically changing different rounding modes. The effect can be rounded.
(4) The processor includes a plurality of processors, and the rounding mode selection means selects a rounding mode by data storage means provided for each of the plurality of processors.
As a result, the means for selecting the rounding mode is a storage means such as a register provided for each processor, so that the number of control signals from the control system to a plurality of processors can be reduced.
(5) The rounding evaluation is a logical sum of bits rounded down by the right bit shift.
As a result, the rounding evaluation operation is the logical sum of the bits rounded down by the right bit shift in the shifter, thereby rounding (rounding up) the unsigned data or the signed data in the two's complement format. Can be realized. Further, the ∞ rounding of the sign can be realized for signed absolute value format data.
(6) The rounding evaluation is a logical product of the logical sum of the bits rounded down by the right bit shift and the sign of the shift operation data.
As a result, the rounding evaluation operation is performed on the signed two-complement signed data by performing a logical product of the logical sum of the bits rounded down by the right bit shift in the shifter and the sign of the shift operation data. Zero-direction rounding can be realized. Further, rounding in the −∞ direction can be realized for signed absolute value format data.
(7) The rounding evaluation is performed by calculating the logical sum of the bit except the most significant bit of the bits truncated by the right bit shift and the least significant bit of the shift operation result and the bits of the bits truncated by the right bit shift. It is the logical product with the most significant bit.
As a result, the rounding evaluation operation is performed using the logical sum of the bits other than the most significant bit of the bits truncated by the shift in the shifter and the least significant bit of the shift operation result, and the bits truncated by the right shift. By performing a logical product with the most significant bit of, nearest rounding can be realized for unsigned data, signed data in 2's complement format, or signed absolute value format data.
(8) It is configured on one semiconductor substrate.
According to the data processing apparatus, in order to realize truncation as in the prior art, a state is provided in which the rounding mode selection means always negates 1-bit data output from the rounding evaluation circuit to the ALU. Just do it. At this time, it operates as -∞ direction rounding for signed data in 2's complement format. Also, it operates as rounding in the zero direction (truncating) on unsigned data or signed absolute value format data.
The addition of the shifter terminal by providing the rounding evaluation circuit as described above requires only a mode designation signal and a 1-bit data output signal to the ALU. Even if the rounding evaluation circuit added to the shifter supports a plurality of rounding modes, the logical sum of bits to be rounded down can be made common, so that the circuit scale can be reduced.
As described above, the overhead reduction of the rounding process is achieved with an efficient circuit configuration.

図１は、本発明の一実施の形態である丸め処理を実現する回路構成を示すブロック図である。
図２は、図１のシフタの詳細な回路構成を示すブロック図である。
図３は、図２の丸め評価回路における動作の真理値表を示す図である。
図４は、図１の回路構成を含んだＳＩＭＤ型の並列ＤＳＰの構成を示す図である。
図５は、図２，図３のシフタにシフトビット数を指定する記憶手段を追加する構成を示す図である。
図６は、図４のデータ演算実行部の別の構成を示す図である。
図７は、図６のシフタの詳細な回路構成を示すブロック図である。
図８は、図７の丸め評価回路における動作の真理値表を示す図である。
図９は、従来のデータ処理装置に本発明の丸め処理を適用した構成を示す図である。FIG. 1 is a block diagram showing a circuit configuration for realizing a rounding process according to an embodiment of the present invention.
FIG. 2 is a block diagram showing a detailed circuit configuration of the shifter of FIG.
FIG. 3 is a diagram showing a truth table of operations in the rounding evaluation circuit of FIG.
FIG. 4 is a diagram showing a configuration of a SIMD type parallel DSP including the circuit configuration of FIG.
FIG. 5 is a diagram showing a configuration in which storage means for designating the number of shift bits is added to the shifters of FIGS.
FIG. 6 is a diagram showing another configuration of the data calculation execution unit of FIG.
FIG. 7 is a block diagram showing a detailed circuit configuration of the shifter of FIG.
FIG. 8 is a diagram showing a truth table of operations in the rounding evaluation circuit of FIG.
FIG. 9 is a diagram showing a configuration in which the rounding processing of the present invention is applied to a conventional data processing apparatus.

以下、本発明の実施の形態を図面に基づいて詳細に説明する。なお、実施の形態を説明するための全図において、同一部材には同一の符号を付し、その繰り返しの説明は省略する。
図１は、本発明の一実施の形態である丸め処理を実現する回路構成を示すブロック図である。
図１において、ｎはデータ信号のビット数を表し、例えば３２ビットとすることができる。
シフタ１０は、シフタ入力ラインを介して、汎用レジスタファイル２０または外部から択一的に伝達されたｎビットデータについて、シフト演算し、結果をｎビットのシフタ出力ラッチ３０に出力する。同時に、丸め評価結果を１ビットのラッチｒ４０に出力する。
ここで、シフタ入力ラインでのデータの選択は、図示しないが、例えば、外部から与えられる制御信号などで為される。
シフタ１０におけるシフト演算は、左右ｍ（＜ｎ）ビットシフトの算術・論値シフト機能を有し、図示しないが、例えば、外部から与えられる制御信号などで指定された演算が為される。
算術シフトと論理シフトの相違は、算術シフトが入力データを２の補数形式の２進数としたシフトであって、シフト演算による符号（ＭＳＢビット）が変更されない点である。すなわち、右シフトの場合、算術シフトでは出力データの上位シフトビット数には符号ビット（入力のＭＳＢビット）が詰められ、論理シフトでは０が詰められる。
左シフトの場合、出力データの下位シフトビット数には算術及び論理シフト共に０が詰められるが、算術シフトでは入力データの上位のシフトビット数＋１ビットが同じでない入力データに対してオーバフロー処理が為され、入力データの符号に等しい最大値（３２ビットの場合、正の最大値はＨ’７ｆｆｆｆｆｆｆ、負の最大値はＨ’８０００００００、ただしＨ’は１６進数を表すための接頭詞）が出力される。
さらに、シフタ１０は丸め評価機能を有し、右シフトで右にあふれるビットを上記右シフト演算結果の出力データに丸めるために１加算が必要な場合にはラッチｒ４０に１を出力し、それ以外の場合にはラッチｒ４０に０を出力する。
ＡＬＵ５０は、ＡＬＵ入力▲１▼ラインを介して、シフタ出力ラッチ３０、汎用レジスタファイル２０または外部から択一的に伝達された第１のｎビットデータまたは定数“０”と、ＡＬＵ入力▲２▼ラインを介して、累積加算レジスタΣＲ６０または汎用レジスタファイル２０から択一的に伝達された第２のｎビットデータまたは定数“０”と、セレクタ９０により１ビットのラッチｒ４０またはラッチＣＯ７０から選択された１ビットデータとについて、算術論理演算し、結果をｎビットのＡＬＵ出力ラッチ８０及び累積加算レジスタΣＲ６０に出力する。
同時に、最上位ビットからの桁上がり（キャリーアウト）を１ビットのラッチＣＯ７０に出力する。ここで、ＡＬＵ入力▲１▼ライン及びＡＬＵ入力▲２▼ラインでのデータの選択、定数０の選択、及び１ビットデータの選択は、図示しないが、例えば、外部から与えられる制御信号などで為される。
ＡＬＵ５０における算術・論理演算は、算術演算としての加減算、及び論理演算機能を有し、図示しないが、例えば、外部から与えられる制御信号などで指定された演算が為される。算術演算は、上記第１のデータと第２のデータを２の補数形式の２進数とした符号付き加減算、または上記第１のデータと第２のデータを符号無し２進数とした符号無し加減算であり、さらに、これら２つの加減算において、最下位ビット（ＬＳＢ）に１ビットデータを同時に加算するキャリーイン機能が選択可能である。
論理演算は、特に制限されないが、上記第１のデータと第２のデータの各対応するビットについての、論理和、論理積、排他的論理和、論理和反転、論理積反転、または排他的論理和反転である。
累積加算レジスタΣＲ６０は、図示しないが、例えば、外部から与えられる制御信号などに呼応してＡＬＵ５０が出力するｎビットデータが書き込まれる。
汎用レジスタファイル２０は、特に制限されないが、ｎビット×複数ワードのレジスタ群について１入力２出力で構成され、図示しないが、例えば、外部から与えられる制御信号などで指定されたレジスタへの書き込みと読み出しが為される。
すなわち、上記制御信号に呼応して、ＲＦ書込ラインを介して、シフタ出力ラッチ３０あるいはＡＬＵ出力ラッチ８０から択一的に伝達されたｎビットデータが、制御信号に含まれる汎用レジスタアドレス１（ＲＦＡ１）で指定されたレジスタに書き込まれる。
また、汎用レジスタファイル２０の２つの出力の内、一方は、上記制御信号に含まれる汎用レジスタアドレス０（ＲＦＡ０）で指定されたレジスタのデータが出力され、シフタ入力ライン及びＡＬＵ入力▲１▼ラインに伝達される。他方は、上記汎用レジスタアドレス１（ＲＦＡ１）で指定されたレジスタのデータが出力され、ＡＬＵ入力▲２▼ライン及び外部に伝達される。なお、汎用レジスタアドレス１（ＲＦＡ１）で指定されたレジスタについて、上記書き込みが為される場合、上記出力データは書き込み前のデータとなる。
次に、上記図１の回路構成例による丸め処理を用いた動作について、以下例題で説明する。
まず、例題として、汎用レジスタファイル２０に格納された固定小数点位置が異なる３つの２の補数形式の２進数Ａ，Ｂ，Ｃを加算し、汎用レジスタファイル２０に格納する場合の動作を説明する。
ここでは、説明を簡単するため、データ幅ｎ＝３２ビットとし、データの固定小数点位置を、データ名（整数ビット数．小数ビット数）、で表すとき、上記３つのデータの固定小数点位置をＡ（１０．２２）、Ｂ（１．３１）、Ｃ（１５．１７）とし、加算結果Ｘの固定小数点位置をＸ（１６．１６）とする。
［処理ステップ１］
汎用レジスタファイル２０に対して、Ａ（１０．２２）が格納されているアドレスをＲＦＡ０で指定してＡ（１０．２２）を出力させる。
同時に、シフタ入力ラインに対して、汎用レジスタファイル２０を選択し、シフタにＡ（１０．２２）を伝達する。
同時に、シフタ１０に対して、右６ビット算術シフト演算を指定する。
以上による制御で、処理ステップ１では、Ａ（１０．２２）を右６ビット算術シフトしたデータがシフタ出力ラッチ３０に出力され、同時に切り捨てられる６ビットについての丸め評価結果がラッチｒ４０に出力される。
［処理ステップ２］
汎用レジスタファイル２０に対して、Ｂ（１．３１）が格納されているアドレスをＲＦＡ０で指定してＢ（１．３１）を出力させる。
同時に、シフタ入力ラインに対して、汎用レジスタファイル２０を選択し、シフタにＢ（１．３１）を伝達する。
同時に、シフタ１０に対して、右１５ビット算術シフト演算を指定する。
同時に、ＡＬＵ入力▲１▼ラインに対して、シフタ出力ラッチ３０を選択し、ＡＬＵ５０の第１のデータとして処理ステップ１でのＡのシフト結果を伝達する。
同時に、ＡＬＵ入力▲２▼ラインに対して、定数０を選択し、ＡＬＵ５０の第２のデータを０とする。
同時に、ＡＬＵ５０に対して、キャリーインをラッチｒ４０とするキャリーイン機能付き符号付き加算演算を指定する。
同時に、累積加算レジスタΣＲ６０に対して、書き込みを指定する。
以上による制御で、処理ステップ２では、Ｂ（１．３１）を右１５ビット算術シフトしたデータがシフタ出力ラッチ３０に出力され、同時に切り捨てられる１５ビットについての丸め評価結果がラッチｒ４０に出力される。また、ＡＬＵ５０によって、Ａ（１０．２２）をＡ（１６．１６）に桁合わせする際のＬＳＢ側６ビットに対する丸め処理がされ、累積加算レジスタΣＲ６０に出力される。
［処理ステップ３］
汎用レジスタファイル２０に対して、Ｃ（１５．１７）が格納されているアドレスをＲＦＡ０で指定してＣ（１５．１７）を出力させる。
同時に、シフタ入力ラインに対して、汎用レジスタファイル２０を選択し、シフタ１０にＣ（１５．１７）を伝達する。
同時に、シフタ１０に対して、右１ビット算術シフト演算を指定する。
同時に、ＡＬＵ入力▲１▼ラインに対して、シフタ出力ラッチ３０を選択し、ＡＬＵ５０の第１のデータとして処理ステップ２でのＢのシフト結果を伝達する。
同時に、ＡＬＵ入力▲２▼ラインに対して、累積加算レジスタΣＲ０を選択し、ＡＬＵ５０の第２のデータとして処理ステップ２でのＡの丸め処理結果を伝達する。
同時に、ＡＬＵ５０に対して、キャリーインをラッチｒ４０とするキャリーイン機能付き符号付き加算演算を指定する。
同時に、累積加算レジスタΣＲ６０に対して、書き込みを指定する。
以上による制御で、処理ステップ３では、Ｃ（１５．１７）を右１ビット算術シフトしたデータがシフタ出力ラッチ３０に出力され、同時に切り捨てられる１ビットについての丸め評価結果がラッチｒ４０に出力される。また、ＡＬＵ５０によって、Ｂ（１．３１）をＢ（１６．１６）に桁合わせする際のＬＳＢ側１５ビットに対する丸め処理と、処理ステップ２で桁合わせされたＡ（１６．１６）への加算が同時にされ、累積加算レジスタΣＲ６０に出力される。
［処理ステップ４］
ＡＬＵ入力▲１▼ラインに対して、シフタ出力ラッチ３０を選択し、ＡＬＵ５０の第１のデータとして処理ステップ３でのＣのシフト結果を伝達する。
同時に、ＡＬＵ入力▲２▼ラインに対して、累積加算レジスタΣＲ６０を選択し、ＡＬＵ５０の第２のデータとして処理ステップ３でのＡとＢの丸め・加算結果を伝達する。
同時に、ＡＬＵ５０に対して、キャリーインをラッチｒ４０とするキャリーイン機能付き符号付き加算演算を指定する。
以上による制御で、処理ステップ４では、ＡＬＵ５０によって、Ｃ（１５．１７）をＢ（１６．１６）に桁合わせする際のＬＳＢ側１ビットに対する丸め処理と、処理ステップ２で桁合わせされたＡ（１６．１６）とＢ（１６．１６）の加算結果への加算が同時にされ、ＡＬＵ出力ラッチ８０に出力される。
［処理ステップ５］
ＲＦ書込ラインに対して、ＡＬＵ出力ラッチ８０を選択する。
汎用レジスタファイル２０に対して、Ｘ（１６．１６）を格納するアドレスをＲＦＡ１で指定して、書き込みを指定する。
以上による制御で、処理ステップ５では、Ａ、Ｂ及びＣについて（１６．１６）に丸め処理を施して桁合わせした加算結果が汎用レジスタファイル２０に出力され、目的とされるＸ（１６．１６）を得る。
以上、説明したように、固定小数点位置が異なる複数データの加算について、データ数に等しい処理ステップとパイプラインのための２ステップのオーバヘッドで処理することができ、桁合わせで発生する切り捨てビットに対する丸め処理のための処理ステップを解消することができる。
次に、図１のシフタ１０の詳細な回路構成について説明する。
図２は、図１のシフタの詳細な回路構成を示すブロック図である。
図２において、バレルシフタ１００は、シフタ入力ラインのｎビットデータについて、制御信号に従ってシフト演算してｎビット出力する。
制御信号の内、算術シフト／論理シフト選択信号は、特に制限されないが、バレルシフタ１００に対して０が算術シフト、１が論理シフトを指示する。
シフトビット数は、特に制限されないが、±ｎを２の補数形式でエンコードした値（ｍとする）の信号で、正が左シフトビット数、負が右シフトビット数を指示する。
丸めモード選択信号は、丸め評価回路１１０に対して丸めモードを指示する信号で、特に制限されないが、２ビットとされ、Ｂ’００が−∞方向丸め、Ｂ’０１が＋∞方向丸め、Ｂ’１０が０方向丸め、Ｂ’１１が最近値丸めを指示する。（ただしＢ’は２進数を表すための接頭詞）
ここで、−∞方向丸めとは、真値（入力データが表す値）より大きくない最近値（出力データの固定小数点位置で表現可能な値の内で、条件を満たした最も近い値）に丸める方式の丸めモードである。
＋∞方向丸めとは、真値より大きくない最近値に丸める方式の丸めモードである。
０方向丸めとは、真値の絶対値より大きくない最近値に丸める方式の丸めモードである。
最近値丸めとは、無条件に最近値に丸める方式の丸めモードである。ただし、２つの最近値が真値から等距離にある場合、ＬＳＢが“０”となる最近値に丸める。
丸め評価回路１１０は、上記シフトビット数ｍの指示が右シフトビット数（ｍ＜０）である場合、入力データの下位の｜ｍ｜ビット（すなわち切り捨てられるビット）を、上記シフト結果（バレルシフタ１００の出力）に丸めるために１加算が必要か否かを、上記丸めモード選択信号で指示された丸めモードで評価し、１加算が必要な場合は１、不要な場合は０を示す１ビットを出力する。
図３には、上記図２の丸め評価回路１１０における動作の真理値表が示される。
図３に示す真理値表に基づき、評価結果としての１ビット出力信号Ｒを、丸めモード毎の論理式で表記すれば以下になる。
▲１▼−∞方向丸め：Ｒ＝０
▲２▼＋∞方向丸め：Ｒ＝（切り捨てられるビットの論理和）
＝（入力データの第｜ｍ｜−１ビット〜第０ビットの論理和）
▲３▼０方向丸め：Ｒ＝（入力データが負）∩（切り捨てられるビットの論理和）
＝（入力データのＭＳＢ）∩（入力データの第｜ｍ｜−１ビット〜第０ビットの論理和）
▲４▼最近値丸め：Ｒ＝（切り捨てられるデータの内の最上位ビット）
∩ ｛（（切り捨てられるデータの内の最上位を除くビット）∪（出力データのＬＳＢとなるビット）｝
＝（入力データの第｜ｍ｜−１ビット）∩｛（第｜ｍ｜−２ビット〜第０ビットの論理和）∪（入力データの第｜ｍ｜ビット）｝
ここで、上記論理式において、「∪」は論理和、「∩」は論理積を意味する。また、「第ｉビット」は入力データのＬＳＢを０、ＭＳＢをｎ−１として各ビットに０からｎ−１まで連続する整数で番号付けたときに番号ｉとなる位置のビットを示す。
以上、図２〜３で説明したように、シフタ１０に丸め機能を設けたことによりシフタブロックに追加される端子は、丸めモード選択信号と評価結果としての１ビット出力信号だけですみ、これにより、例えば、半導体基板上に回路をブロック分割して実装する場合のブロック間配線を最小限にすることができる。さらに、丸めモードの複数に対応したとしても、上記論理式における（第｜ｍ｜−２ビット〜第０ビットの論理和）などの論理の共通化ができるので、実装面積の増大を抑えることができるという効果がある。
次に、図１の回路構成を含んで構成されるプロセッサの複数と、これらのプロセッサを制御する１つの制御ユニットとで構成されるＳＩＭＤ型の並列ＤＳＰの構成について説明する。
図４は、図１の回路構成を含んだＳＩＭＤ型の並列ＤＳＰの構成を示す図であり、例えば半導体基板上に構成される。
図４において、制御ユニット２００は、特に制限されないが、プログラムメモリ２１０を含んだプログラム実行制御部２２０と、データメモリ２３０を含んだデータ制御部２４０とを含んで構成される。
これらのプログラムメモリ２１０及びデータメモリ２３０には、外部からのデータの入出力が可能であり、外部から設定された計算手順（計算アルゴリズム）に従った情報処理が可能である。
プロセッサアレイ２５０は、同一構成のプロセッサ２６０の複数で構成され、全プロセッサ２６０は、命令バス、ブロードキャストデータバス、及びトライステートバッファを介した共通データバスで制御ユニット２００に接続される。
また、各プロセッサ２６０には、各プロセッサ２６０に対応して設けられたトライステートバッファの制御と共通のプロセッサ選択信号が入力され、制御ユニット２００によって常に唯一のプロセッサ２６０が選択される。
プロセッサ２６０は、プロセッサ制御部２７０とデータ演算実行部２８０を含んで構成される。
データ演算実行部２８０は、例えば、図１記載の回路構成とすることができる。
プロセッサ制御部２７０は、制御ユニット２００から命令バスを介して与えられる命令に基づき、データ演算実行部を制御する。
制御ユニット２００内のプログラム実行制御部２２０が処理ステップに同期して出力する命令は、例えば、ＶＬＩＷ（ＶｅｒｙＬｏｎｇＩｎｓｔｒｕｃｔｉｏｎＷｏｒｄ）方式とされ、各部の動作が水平に制御される。すなわち、制御ユニット２００が出力する各ステップの命令には各部に対応したフィールドが設けられ、１ステップで複数の機能ブロックを水平に制御することができる。
上記命令の内、プロセッサ２６０に対する命令（プロセッサ命令フィールド）は、命令バスを介して全プロセッサ２６０に伝達される。すなわち、制御ユニット２００は、全プロセッサ２６０を同一の命令で制御する。
プロセッサ２６０に伝達された上記プロセッサ命令フィールドは、プロセッサ２６０内のプロセッサ制御部２７０に伝達される。
プロセッサ制御部２７０は、上記伝達されたプロセッサ命令フィールドの内、プロセッサ制御部２７０に対する命令（プロセッサ制御部命令フィールド）に従って動作し、データ演算実行部２８０に対する命令（データ演算実行部命令フィールド）をデータ演算実行部２８０に伝達する。
このとき、プロセッサ制御部２７０は、プロセッサ制御部命令フィールド内の命令でアドレスマスク実行指示がある場合には、データ演算実行部命令フィールドの内の全ての書き込み命令をプロセッサ選択信号のネゲートでマスクして、データ演算実行部２８０に伝達する。これによって、プロセッサ２６０の１つを選択して動作させることが可能であり、例えば、データ演算実行部２８０内の汎用レジスタファイル２０に対して、各プロセッサ固有のデータを設定することができる。
また、プロセッサ制御部２７０には、データ演算実行部２８０よりコンディションコード信号（ＣＣ）が入力される。プロセッサ制御部２７０は、プロセッサ制御部命令フィールド内の命令でグループマスク実行指示がある場合には、データ演算実行部命令フィールドの内の全ての書き込み命令を、指定されたコンディションコード信号でマスクして、データ演算実行部２８０に伝達する。
このコンディションコード信号は、特に制限されないが、データ演算実行部２８０の演算状態を示す信号であり、図１におけるＡＬＵ５０からラッチＣＯ７０を介して出力される桁上がり（キャリーアウト）信号であり、さらに、図１の図示しないが、上記桁上がり信号と同様にＡＬＵ５０からラッチ８０を介して出力される符号（データのＭＳＢ）信号、零（演算結果が０であるときアサートされる）信号、及びオーバフロー（ＭＳＢの桁上がり信号とＭＳＢ−１ビットの桁上がり信号の排他的論理和）信号である。
これによって、プロセッサ２６０の内部状態に応じた動作が可能であり、例えば演算結果が負であったプロセッサ２６０のみグループマスクして、このデータを０からの減算で符号反転することにより絶対値をとるといった条件実行が可能である。
プロセッサ２６０内のデータ演算実行部２８０に伝達されるデータ演算実行部命令フィールドには、シフタ１０に対するシフタ命令フィールド、ＡＬＵ５０に対するＡＬＵ命令フィールド及び汎用レジスタファイル２０に対する汎用レジスタファイル命令フィールドがあり、図１記載の回路構成の各機能ブロックを制御して、上記処理ステップ１〜５に例示した丸め処理を実現することができる。
ここで、図２に記載したシフタ内の丸め評価回路に対する丸めモード選択信号は、上記プロセッサ制御部２７０に設けられた丸めモード選択レジスタＲＭＲ２９０によって制御される。この丸めモード選択レジスタＲＭＲ２９０には、上記プロセッサ制御部命令フィールド内の命令でＲＭＲ書き込み指示がある場合に、ブロードキャストデータバスのデータが書き込まれる。
このとき、特に制限されないが、図２〜３記載の丸めモード選択信号は２ビットのため丸めモード選択レジスタＲＭＲ２９０は２ビット構成とされ、ブロードキャストデータバスから入力するｎ（例えば３２）ビットデータの下位２ビットが書き込まれる。
丸めモードは各処理ステップ毎に変更する用途が少ないため、命令バスに丸めモード選択のためのフィールドを設ける必要がなく、上記丸めモード選択レジスタＲＭＲ２９０によるレジスタ設定とすることができ、これによって、命令バス幅の増大を低減する効果を得る。
上記プロセッサ２６０の演算結果の外部への取り出しは、上記プロセッサ選択信号で所望のプロセッサ２６０を選択することによって、該プロセッサ２６０のトライステートバッファ３００がドライブ状態となりデータ共通データバスを介して制御ユニットに出力されることで為される。
以上説明したように、図４記載のＳＩＭＤ型並列ＤＳＰの半導体装置において、１つの制御系から出力される単一の命令とデータについて複数のプロセッサ２６０が内に持つ固有データとの演算を並列に処理していくことによって、加減算を基本とする並列アルゴリズムを適切な丸め処理による演算で高速に処理することが可能となる。
以上説明した実施の形態では、本発明の丸め処理に焦点を絞って説明したが、利用目的に応じて、種々変更可能であることは言うまでもない。
例えば、上記したＳＩＭＤ型並列ＤＳＰの各プロセッサにおいて、シフト演算のシフトビット数を指定する記憶手段を追加することによって、各プロセッサ固有のデータとすることができる。
図５には、上記図２〜３で説明した本発明によるシフタに、上記シフトビット数を指定する記憶手段を追加する場合の実施例が示される。
図５において、上記記憶手段としてシフトビットレジスタＳＢＲ４００は、図２記載のシフタにおけるシフトビット数を表現するのに必要なビット数（例えばｎ＝３２とするとき６ビット）で構成され、上記データ演算実行部命令フィールド内に追加した書き込み指示に呼応してデータが設定される。
シフトビットレジスタＳＢＲ４００に書き込むデータは、特に制限されないがＡＬＵ５０の出力データがオーバフロー処理回路４１０を介して伝達されたデータである。オーバフロー処理回路４１０は、ＡＬＵ５０の出力であるｎビットデータを２の補数形式の２進整数として入力し、シフトビットレジスタＳＢＲ４００の構成ビット数の２の補数形式の２進整数に上位ビットを丸めて出力する。
すなわち、シフトビットレジスタＳＢＲ４００の構成ビット数を６ビットとしたとき、ＡＬＵ５０の出力が−３１〜３１の場合はその値を出力するが、ＡＬＵ５０の出力が３２以上の場合（正のオーバフロー）には３１を出力し、ＡＬＵ５０の出力が−３２以下の場合（負のオーバフロー）には−３１を出力する。
また、シフタ１０へのシフトビット数の出力は、上記データ演算実行部命令フィールド内の命令と、上記シフトビットレジスタＳＢＲ４００とから、セレクタ４２０によって選択される。このセレクタ４２０の制御は、特に制限されないが、例えば、上記データ演算実行部命令フィールド内の命令で指定されたシフトビット数が負の最大値（シフトビット数が６ビットの場合は−３２）であるとき、ＳＢＲ出力を選択し、その他の場合は上記データ演算実行部命令フィールド内の命令で指定されるシフトビット数を選択するように、データのコンペア回路４３０で為され、コンペア回路３０の出力の「０」、「１」により切替が行われる。
上記説明した実施の形態について、画像処理やニューラルネットワークなどの並列アルゴリズムを高速処理するためには、上記プロセッサに、積和演算を高速処理するための乗算器、及びプロセッサ固有のデータをより多く分散配置させるためのローカルメモリを設ければよい。
図６には、図４におけるプロセッサ内のデータ演算実行部の別の回路構成例として、図１に記載した回路構成に乗算器及びローカルメモリＬＭを追加した回路構成が示される。
図６において、乗算器５００は、乗算器入力▲１▼ラインを介して、シフタ出力ラッチ３０、汎用レジスタファイル２０または外部から択一的に伝達される第１のｎビットデータと、乗算器入力▲２▼ラインを介して、汎用レジスタファイル２０またはローカルメモリＬＭ５１０の出力ラッチ５２０から択一的に伝達された第２のｎビットデータとについて、乗算し、結果の２ｎビット（ｎが３２のとき６４ビット）データを乗算器出力ラッチ５３０に伝達する。ここで、乗算器入力▲１▼ライン及び乗算器入力▲２▼ラインでのデータの選択は、図示しないが、例えば、外部から与えられる制御信号などで為される。
乗算器出力ラッチ５３０の出力は、シフタ入力ラインに伝達される。
ローカルメモリＬＭ５１０は、ｎビット×複数ワードのＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）で構成され、図示しないが、例えば、外部から与えられる制御信号などに呼応して、シフタ出力ラッチ３０、ＡＬＵ出力ラッチ８０または外部から択一的に伝達されたｎビットデータについて、ＬＭアドレス制御部５４０内のローカルメモリポインタレジスタＬＭＰＲ５５０から伝達されるアドレスに書き込む。または、該アドレスに格納されているデータを後段のローカルメモリ出力ラッチ５２０に読み出す。
ローカルメモリ出力ラッチ５２０の出力は、乗算器入力▲２▼ライン、シフタ入力ライン及びＡＬＵ入力▲２▼ラインに伝達される。
ＬＭアドレス制御部５４０は、ローカルメモリＬＭ５１０の書き込み／読み出しアドレス出力するローカルメモリポインタレジスタＬＭＰＲ５５０を含み、図示しないが、例えば、外部から与えられる制御信号などに従って、ローカルメモリポインタレジスタＬＭＰＲ５５０のインクリメント、ディクリメント、あるいはＡＬＵ出力ラッチ８０から伝達されたデータの書き込みを行う。
ローカルメモリポインタレジスタＬＭＰＲ５５０の出力は、ローカルメモリＬＭ５１０へのアドレスの他に、ＲＦ書込ラインに伝達され汎用レジスタファイル２０への退避や、ＡＬＵ５０による演算が可能となる。
上記、乗算器５００、ローカルメモリＬＭ５１０、及びＬＭアドレス制御部５４０の追加により、図１に記載した機能ブロックの変更点は、乗算器５００の出力データ幅に対応して、シフタ６００の入力データ幅が増加したことである。
シフタ入力ラインのデータ幅の２ｎビットについて、上記制御信号で選択されたデータが汎用レジスタファイル２０、外部またはローカルメモリ出力ラッチ５２０のｎビットデータの場合には、シフタ入力ラインのＭＳＢ側ｎビットに伝達され、このとき、シフタ入力ラインのＬＳＢ側のｎビットは０となる。
図７には、図６のシフタの詳細な回路構成を示すブロック図が示される。
図７において、バレルシフタ６１０は、２ｎビットの入力データのＭＳＢを基準として、図２に記載のバレルシフタ１００と同様のシフト演算を行い、ｎビット出力する。図２で説明したバレルシフタ１００との相違点は、左シフトにおいて、オーバフローが発生しない場合、ＬＳＢ側シフトビット数の出力が、図２では０であるのに対して、図７のバレルシフタ６１０では入力データの下位ｎビットデータの内の上位ビットとした点である。
図８には、図７に記載の丸め評価回路における動作の真理値表が示される。図７の上記説明したバレルシフタ６１０は、入力の２ｎビットデータについてＭＳＢを基準にしてシフトしてｎビットを出力するため、左ビット（０ビットシフトを含む）でも切り捨てビットが発生する。これに対応して、図７における丸め評価回路６２０は、図２における丸め評価回路１１０の動作に対して、シフト演算のシフト方向によらず（すなわち常に）、２ｎビット入力データの内の切り捨てビットを評価する点が異なる。
以上により、画像処理あるいはニューラルネットワークなどの並列アルゴリズムを、１つの制御系と複数のプロセッサで構成されるＳＩＭＤ型並列ＤＳＰで高速に処理する情報処理装置に、比較的小規模の回路追加で処理ステップの増加なく丸め処理機能を組み込むことができ、このことが、固定小数点演算器を用いた情報処理装置の数値演算精度を効率的に向上するという本発明の目的を達成する。
図９には、従来のデータ処理装置に本発明の丸め処理を適用した構成が示される。
図９において、データ処理装置を構成するプロセッサは、演算ユニット１１、ローカルメモリユニット１２、及びトライステートバッファ１３とを含む。
演算ユニット１１は、本実施の形態の丸めモード選択レジスタＲＭＲ２９０を含むプロセッサ制御回路１１０１、ラッチ回路１１０２，１１０５，１１０７，１１１２，１１１４、汎用レジスタファイル１１０３、乗算来１１０４、本実施の形態のＣＯレジスタ７０を含むコンディションコードレジスタ（ＣＣＲ）１１０６、累積レジスタ１１０８、本実施の形態のＡＬＵ（算術論理演算ユニット）５０及びセレクタ９０、ＳＢＲ（シフトビットレジスタ）１１１０、本実施の形態のシフタ１０及びタッチｒ４０、差分絶対値演算器１１１３を含む。
ローカルメモリユニット１２は、ローカルメモリ（ＬＭ０，ＬＭ１）１２０２，１２０４、ローカルメモリ１２０２のアドレス信号を生成するためのアドレス演算回路１２０１、ローカルメモリ１２０４のアドレス信号を生成するためのアドレス演算回路１２０３、セレクタ１２０５，１２０６，１２０７，１２０８，１２０９を含む。
また、図９に示すデータ処理装置の動作は、シフタ１０、ラッチｒ４０、ＡＬＵ（算術論理演算ユニット）５０、及びセレクタ９０は、本実施の形態で説明した動作をし、その他の動作については、特願２００３−２３０７６号に示されるデータ処理装置の動作と同様である。
以上、本発明者によってなされた発明をその実施の形態に基づき具体的に説明したが、本発明は前記実施の形態に限定されるものではなく、その要旨を逸脱しない範囲で種々変更可能であることはいうまでもない。
本願発明によって開示される発明のうち、代表的なものによって得られる効果を簡単に説明すれば、以下の通りである。
シフタに丸め評価機能を設けたことによりシフタブロックに追加される端子は、丸めモード選択信号と評価結果としての１ビット出力信号だけですみ、効率的な回路構成で丸め処理のオーバヘッド低減できるので、固定小数点演算器を用いた情報処理装置の数値演算精度が効率的に向上する。Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. Note that in all the drawings for explaining the embodiments, the same members are denoted by the same reference numerals, and the repeated explanation thereof is omitted.
FIG. 1 is a block diagram showing a circuit configuration for realizing a rounding process according to an embodiment of the present invention.
In FIG. 1, n represents the number of bits of the data signal, and can be, for example, 32 bits.
The shifter 10 performs a shift operation on the n-bit data selectively transmitted from the general-purpose register file 20 or the outside via the shifter input line, and outputs the result to the n-bit shifter output latch 30. At the same time, the rounding evaluation result is output to the 1-bit latch r40.
Here, the selection of data on the shifter input line is performed by, for example, a control signal supplied from the outside although not shown.
The shift operation in the shifter 10 has an arithmetic / logical value shift function of left / right m (<n) bit shift, and although not shown, for example, an operation designated by an external control signal or the like is performed.
The difference between the arithmetic shift and the logical shift is that the arithmetic shift is a shift in which the input data is a two's complement binary number, and the code (MSB bit) by the shift operation is not changed. That is, in the case of the right shift, the sign bit (input MSB bit) is packed in the number of upper shift bits of the output data in the arithmetic shift, and 0 is packed in the logical shift.
In the case of left shift, the lower shift bits of the output data are padded with 0 for both arithmetic and logical shifts. However, in the arithmetic shift, overflow processing is performed for input data whose upper shift bits + 1 bit of the input data are not the same. The maximum value equal to the sign of the input data (in the case of 32 bits, the positive maximum value is H'7fffffff, the negative maximum value is H'80000000, where H 'is a prefix for representing a hexadecimal number) is output. The
Further, the shifter 10 has a rounding evaluation function, and outputs 1 to the latch r40 when 1 addition is necessary to round the bits overflowing to the right by the right shift to the output data of the right shift calculation result, and otherwise. In this case, 0 is output to the latch r40.
The ALU 50 receives the first n-bit data or constant “0” alternatively transmitted from the shifter output latch 30, the general-purpose register file 20 or the outside via the ALU input (1) line, and the ALU input (2). The second n-bit data or constant “0” alternatively transmitted from the cumulative addition register ΣR60 or the general-purpose register file 20 via the line, and the selector 90 selects the 1-bit latch r40 or the latch CO70. An arithmetic logic operation is performed on the 1-bit data, and the result is output to the n-bit ALU output latch 80 and the cumulative addition register ΣR60.
At the same time, the carry (carry out) from the most significant bit is output to the 1-bit latch CO70. Here, the selection of data on the ALU input (1) line and the ALU input (2) line, the selection of constant 0, and the selection of 1-bit data are not illustrated, but are performed by, for example, control signals given from the outside. Is done.
Arithmetic / logical operations in the ALU 50 have addition / subtraction as arithmetic operations and a logical operation function. Although not shown, for example, an operation designated by an external control signal or the like is performed. Arithmetic operations are signed addition / subtraction in which the first data and the second data are binary numbers in two's complement format, or unsigned addition / subtraction in which the first data and the second data are unsigned binary numbers. In addition, in these two additions and subtractions, a carry-in function for simultaneously adding 1-bit data to the least significant bit (LSB) can be selected.
Although the logical operation is not particularly limited, the logical sum, logical product, exclusive logical sum, logical sum inversion, logical product inversion, or exclusive logic for each corresponding bit of the first data and the second data is not limited. It is sum reversal.
Although not shown, the cumulative addition register ΣR60 is written with n-bit data output from the ALU 50 in response to, for example, a control signal supplied from the outside.
The general-purpose register file 20 is not particularly limited, and is composed of one input and two outputs for a register group of n bits × multiple words. Although not shown, for example, writing to a register designated by an externally supplied control signal or the like Reading is done.
That is, in response to the control signal, the n-bit data alternatively transmitted from the shifter output latch 30 or the ALU output latch 80 via the RF write line is the general register address 1 ( It is written to the register specified by RFA1).
One of the two outputs of the general-purpose register file 20 outputs the data of the register specified by the general-purpose register address 0 (RFA0) included in the control signal, and the shifter input line and the ALU input (1) line Is transmitted to. On the other hand, the data of the register designated by the general-purpose register address 1 (RFA1) is output and transmitted to the ALU input (2) line and to the outside. When the above-described writing is performed on the register designated by general-purpose register address 1 (RFA1), the output data is data before writing.
Next, an operation using the rounding process according to the circuit configuration example of FIG.
First, as an example, an operation in the case where three binary numbers A, B, and C having different fixed-point positions stored in the general-purpose register file 20 are added and stored in the general-purpose register file 20 will be described.
Here, for simplicity of explanation, when the data width is n = 32 bits and the fixed-point position of the data is represented by a data name (integer number of bits.number of fractional bits), the fixed-point position of the three data is A (10.22), B (1.31), and C (15.17), and the fixed point position of the addition result X is X (16.16).
[Processing step 1]
For the general register file 20, the address where A (10.22) is stored is designated by RFA0 and A (10.22) is output.
At the same time, the general-purpose register file 20 is selected for the shifter input line, and A (10.22) is transmitted to the shifter.
At the same time, a right 6-bit arithmetic shift operation is designated for the shifter 10.
With the above control, in processing step 1, data obtained by arithmetically shifting A (10.22) to the right by 6 bits is output to the shifter output latch 30, and the rounding evaluation result for 6 bits to be simultaneously discarded is output to the latch r40. .
[Processing step 2]
An address where B (1.31) is stored is designated by RFA0 for the general-purpose register file 20, and B (1.31) is output.
At the same time, the general-purpose register file 20 is selected for the shifter input line, and B (1.31) is transmitted to the shifter.
At the same time, a right 15-bit arithmetic shift operation is designated for the shifter 10.
At the same time, the shifter output latch 30 is selected for the ALU input (1) line, and the A shift result in the processing step 1 is transmitted as the first data of the ALU 50.
At the same time, the constant 0 is selected for the ALU input (2) line, and the second data of the ALU 50 is set to 0.
At the same time, a signed addition operation with a carry-in function with the carry-in as a latch r40 is designated for the ALU 50.
At the same time, write is designated to the cumulative addition register ΣR60.
With the above control, in processing step 2, data obtained by arithmetically shifting B (1.31) to the right by 15 bits is output to the shifter output latch 30, and the rounding evaluation result for 15 bits to be simultaneously discarded is output to the latch r40. . The ALU 50 rounds the 6 bits on the LSB side when digitizing A (10.22) to A (16.16) and outputs the result to the cumulative addition register ΣR60.
[Processing step 3]
The general register file 20 is designated by RFA0 with the address where C (15.17) is stored, and C (15.17) is output.
At the same time, the general-purpose register file 20 is selected for the shifter input line, and C (15.17) is transmitted to the shifter 10.
At the same time, a right 1-bit arithmetic shift operation is designated for the shifter 10.
At the same time, the shifter output latch 30 is selected for the ALU input (1) line, and the B shift result in the processing step 2 is transmitted as the first data of the ALU 50.
At the same time, the cumulative addition register ΣR0 is selected for the ALU input (2) line, and the rounding result of A in the processing step 2 is transmitted as the second data of the ALU 50.
At the same time, a signed addition operation with a carry-in function with the carry-in as a latch r40 is designated for the ALU 50.
At the same time, write is designated to the cumulative addition register ΣR60.
With the above control, in processing step 3, data obtained by arithmetically shifting C (15.17) to the right by 1 bit is output to the shifter output latch 30, and the rounding evaluation result for 1 bit to be simultaneously discarded is output to the latch r40. . In addition, the ALU 50 rounds the 15 bits on the LSB side when digitizing B (1.31) to B (16.16) and adds to A (16.16) digitized in processing step 2. Are simultaneously output to the cumulative addition register ΣR60.
[Processing step 4]
The shifter output latch 30 is selected for the ALU input (1) line, and the C shift result in the processing step 3 is transmitted as the first data of the ALU 50.
At the same time, the cumulative addition register ΣR60 is selected for the ALU input (2) line, and the rounding / addition result of A and B in the processing step 3 is transmitted as the second data of the ALU 50.
At the same time, a signed addition operation with a carry-in function with the carry-in as a latch r40 is designated for the ALU 50.
With the above control, in processing step 4, the ALU 50 rounds the 1 bit on the LSB side when digitizing C (15.17) to B (16.16), and the digitized A in processing step 2 (16.16) and B (16.16) are added to the addition result at the same time and output to the ALU output latch 80.
[Processing step 5]
ALU output latch 80 is selected for the RF write line.
For the general-purpose register file 20, an address for storing X (16.16) is designated by RFA1, and writing is designated.
With the control as described above, in processing step 5, the rounding process is performed on (16.16) for A, B, and C, and the added result is output to the general-purpose register file 20, and the target X (16.16) is output. )
As described above, the addition of a plurality of data having different fixed-point positions can be processed with the processing step equal to the number of data and the overhead of two steps for the pipeline, and rounding to the rounded bits generated by digit alignment is performed. Processing steps for processing can be eliminated.
Next, a detailed circuit configuration of the shifter 10 in FIG. 1 will be described.
FIG. 2 is a block diagram showing a detailed circuit configuration of the shifter of FIG.
In FIG. 2, a barrel shifter 100 shifts n-bit data of a shifter input line according to a control signal and outputs n bits.
Of the control signals, the arithmetic shift / logical shift selection signal is not particularly limited, but 0 indicates an arithmetic shift and 1 indicates a logical shift to the barrel shifter 100.
The number of shift bits is not particularly limited, but is a signal (value m) obtained by encoding ± n in 2's complement format, and positive indicates the number of left shift bits and negative indicates the number of right shift bits.
The rounding mode selection signal is a signal instructing the rounding mode to the rounding evaluation circuit 110 and is not particularly limited. However, the rounding mode selection signal is 2 bits, B′00 is rounded in the −∞ direction, B′01 is rounded in the + ∞ direction, B '10 indicates rounding in the 0 direction, and B'11 indicates rounding to the nearest value. (B 'is a prefix to represent binary numbers)
Here, -∞ direction rounding means rounding to the nearest value (the closest value that satisfies the condition among the values that can be expressed by the fixed-point position of the output data) that is not larger than the true value (value represented by the input data). The rounding mode of the scheme.
The + ∞ direction rounding is a rounding mode that rounds to the nearest value that is not greater than the true value.
The zero-direction rounding is a rounding mode in which rounding is performed to the nearest value that is not larger than the absolute value of the true value.
Nearest rounding is a rounding mode that rounds to the nearest value unconditionally. However, when the two nearest values are equidistant from the true value, the LSB is rounded to the nearest value that becomes “0”.
When the indication of the shift bit number m is the right shift bit number (m <0), the rounding evaluation circuit 110 converts the lower | m | bit (that is, the bit to be truncated) of the input data into the shift result (barrel shifter 100). In the rounding mode indicated by the rounding mode selection signal, it is evaluated whether 1 addition is necessary for rounding to 1). Output.
FIG. 3 shows a truth table of operations in the rounding evaluation circuit 110 of FIG.
Based on the truth table shown in FIG. 3, the 1-bit output signal R as an evaluation result is expressed as a logical expression for each rounding mode as follows.
▲ 1 ▼ -∞ rounding: R = 0
(2) + ∞ rounding: R = (logical sum of bits to be rounded down)
= (OR of input data | m | -1 bit to 0th bit)
(3) Rounding to 0: R = (input data is negative) ∩ (logical sum of bits to be rounded down)
= (MSB of input data) ∩ (input data | m | -1 bit to 0th bit OR)
(4) Round nearest value: R = (Most significant bit of data to be rounded down)
｛{((Bits excluding the most significant of the data to be rounded down) ∪ (bit that becomes the LSB of the output data)}
= (Input data | m | -1 bit) ∩ {(logical sum of | m | -2 bits to 0th bit) ∪ (| m | bit of input data)}
Here, in the above logical expression, “∪” means a logical sum, and “∩” means a logical product. The “i-th bit” indicates a bit at a position of number i when the LSB of the input data is 0 and the MSB is n−1 and each bit is numbered with an integer from 0 to n−1.
As described above with reference to FIGS. 2 to 3, since the rounding function is provided in the shifter 10, the only terminals added to the shifter block are the rounding mode selection signal and the 1-bit output signal as the evaluation result. For example, it is possible to minimize inter-block wiring when a circuit is mounted on a semiconductor substrate while being divided into blocks. Furthermore, even if it corresponds to a plurality of rounding modes, it is possible to share logic such as (logical sum of | m | −2 bits to 0th bit) in the above logical expression, thereby suppressing an increase in mounting area. There is an effect that can be done.
Next, a configuration of a SIMD type parallel DSP including a plurality of processors configured to include the circuit configuration of FIG. 1 and a single control unit that controls these processors will be described.
FIG. 4 is a diagram showing a configuration of a SIMD type parallel DSP including the circuit configuration of FIG. 1, and is configured on, for example, a semiconductor substrate.
In FIG. 4, the control unit 200 is configured to include a program execution control unit 220 including a program memory 210 and a data control unit 240 including a data memory 230, although not particularly limited.
These program memory 210 and data memory 230 can input / output data from the outside, and can perform information processing according to a calculation procedure (calculation algorithm) set from the outside.
The processor array 250 includes a plurality of processors 260 having the same configuration, and all the processors 260 are connected to the control unit 200 by a common data bus via an instruction bus, a broadcast data bus, and a tristate buffer.
Further, each processor 260 receives a processor selection signal common to the control of a tristate buffer provided corresponding to each processor 260, and a single processor 260 is always selected by the control unit 200.
The processor 260 includes a processor control unit 270 and a data operation execution unit 280.
The data calculation execution unit 280 can have, for example, the circuit configuration illustrated in FIG.
The processor control unit 270 controls the data operation execution unit based on an instruction given from the control unit 200 via the instruction bus.
The command output by the program execution control unit 220 in the control unit 200 in synchronization with the processing steps is, for example, a VLIW (Very Long Instruction Word) system, and the operation of each unit is controlled horizontally. That is, a field corresponding to each part is provided in each step command output by the control unit 200, and a plurality of functional blocks can be controlled horizontally in one step.
Of the above instructions, an instruction for the processor 260 (processor instruction field) is transmitted to all the processors 260 via the instruction bus. That is, the control unit 200 controls all the processors 260 with the same command.
The processor instruction field transmitted to the processor 260 is transmitted to the processor control unit 270 in the processor 260.
The processor control unit 270 operates in accordance with an instruction (processor control unit instruction field) for the processor control unit 270 in the transmitted processor instruction field, and converts an instruction (data operation execution unit instruction field) for the data operation execution unit 280 into data. This is transmitted to the calculation execution unit 280.
At this time, if there is an address mask execution instruction in the instruction in the processor control unit instruction field, the processor control unit 270 masks all the write instructions in the data operation execution unit instruction field with the negation of the processor selection signal. To the data calculation execution unit 280. Thus, one of the processors 260 can be selected and operated. For example, data specific to each processor can be set in the general-purpose register file 20 in the data operation execution unit 280.
In addition, a condition code signal (CC) is input to the processor control unit 270 from the data calculation execution unit 280. When there is a group mask execution instruction in the processor control unit instruction field, the processor control unit 270 masks all the write instructions in the data operation execution unit instruction field with the specified condition code signal. And transmitted to the data calculation execution unit 280.
The condition code signal is not particularly limited, but is a signal indicating the calculation state of the data calculation execution unit 280, a carry (carry out) signal output from the ALU 50 in FIG. 1 via the latch CO70, Although not shown in FIG. 1, the sign (data MSB) signal output from the ALU 50 via the latch 80, the zero (asserted when the operation result is 0) signal, and the overflow (not shown in FIG. 1) MSB carry signal and MSB-1 bit carry signal exclusive OR) signal.
As a result, an operation according to the internal state of the processor 260 is possible. For example, only the processor 260 whose operation result is negative is group-masked, and the absolute value is obtained by inverting the sign of this data by subtraction from 0. Conditional execution is possible.
The data operation execution unit instruction field transmitted to the data operation execution unit 280 in the processor 260 includes a shifter instruction field for the shifter 10, an ALU instruction field for the ALU 50, and a general purpose register file instruction field for the general purpose register file 20. FIG. The rounding process exemplified in the above processing steps 1 to 5 can be realized by controlling each functional block of the circuit configuration described.
Here, the rounding mode selection signal for the rounding evaluation circuit in the shifter shown in FIG. 2 is controlled by the rounding mode selection register RMR 290 provided in the processor control unit 270. In the rounding mode selection register RMR 290, data in the broadcast data bus is written when there is an RMR write instruction by an instruction in the processor control unit instruction field.
At this time, although not particularly limited, since the rounding mode selection signal shown in FIGS. 2 to 3 is 2 bits, the rounding mode selection register RMR 290 has a 2-bit configuration, and the lower order of n (for example, 32) bit data input from the broadcast data bus. Two bits are written.
Since the rounding mode has few applications to be changed for each processing step, it is not necessary to provide a rounding mode selection field in the instruction bus, and register setting by the rounding mode selection register RMR 290 can be performed. The effect of reducing the increase in bus width is obtained.
The calculation result of the processor 260 is fetched to the outside by selecting the desired processor 260 with the processor selection signal, so that the tri-state buffer 300 of the processor 260 is in the drive state and is sent to the control unit via the data common data bus. It is done by being output.
As described above, in the SIMD parallel DSP semiconductor device shown in FIG. 4, a single instruction and data output from one control system are operated in parallel with specific data held by a plurality of processors 260. By performing the processing, it becomes possible to process a parallel algorithm based on addition and subtraction at high speed by an operation based on an appropriate rounding process.
In the embodiment described above, the explanation has been made focusing on the rounding process of the present invention, but it goes without saying that various modifications can be made according to the purpose of use.
For example, in each processor of the above SIMD type parallel DSP, data unique to each processor can be obtained by adding a storage means for designating the number of shift bits of the shift operation.
FIG. 5 shows an embodiment in which a storage means for designating the number of shift bits is added to the shifter according to the present invention described with reference to FIGS.
In FIG. 5, the shift bit register SBR400 as the storage means is composed of the number of bits necessary for expressing the number of shift bits in the shifter shown in FIG. 2 (for example, 6 bits when n = 32), Data is set in response to the write instruction added in the execution unit command field.
Data to be written to the shift bit register SBR 400 is not particularly limited, but is data in which output data of the ALU 50 is transmitted via the overflow processing circuit 410. The overflow processing circuit 410 inputs n-bit data output from the ALU 50 as a binary integer in 2's complement format, and rounds the upper bits to a binary integer in 2's complement format of the number of bits constituting the shift bit register SBR400. Output.
That is, when the number of constituent bits of the shift bit register SBR400 is 6 bits, the value is output when the output of the ALU 50 is −31 to 31, but when the output of the ALU 50 is 32 or more (positive overflow). 31 is output, and when the output of the ALU 50 is −32 or less (negative overflow), −31 is output.
Also, the output of the shift bit number to the shifter 10 is selected by the selector 420 from the instruction in the data operation execution part instruction field and the shift bit register SBR400. The control of the selector 420 is not particularly limited. For example, the shift bit number specified by the instruction in the data operation execution unit instruction field is a negative maximum value (-32 when the shift bit number is 6 bits). In some cases, the SBR output is selected. In other cases, the data compare circuit 430 selects the shift bit number specified by the instruction in the data operation execution unit instruction field. Are switched by "0" and "1".
In the embodiment described above, in order to perform high-speed processing of parallel algorithms such as image processing and neural networks, the processor is further distributed with multipliers for high-speed processing of product-sum operations and processor-specific data. A local memory for placement may be provided.
FIG. 6 shows a circuit configuration in which a multiplier and a local memory LM are added to the circuit configuration shown in FIG. 1 as another circuit configuration example of the data operation execution unit in the processor in FIG.
In FIG. 6, a multiplier 500 includes a first n-bit data transmitted from the shifter output latch 30, the general-purpose register file 20 or from the outside via a multiplier input (1) line, and a multiplier input. (2) Multiply the second n-bit data alternatively transmitted from the general-purpose register file 20 or the output latch 520 of the local memory LM510 via the line, and the result is 2n bits (when n is 32) 64 bits) data is transferred to the multiplier output latch 530. Here, selection of data on the multiplier input {circle around (1)} line and the multiplier input {circle around (2)} line is made by, for example, a control signal given from the outside, although not shown.
The output of the multiplier output latch 530 is transmitted to the shifter input line.
The local memory LM510 is composed of an n-bit × multiple-word RAM (Random Access Memory). Although not shown, for example, the shifter output latch 30, the ALU output latch 80, or the external memory LM510 responds to an external control signal. The n-bit data transmitted alternatively from is written in the address transmitted from the local memory pointer register LMPR 550 in the LM address control unit 540. Alternatively, the data stored at the address is read to the local memory output latch 520 at the subsequent stage.
The output of the local memory output latch 520 is transmitted to the multiplier input (2) line, shifter input line, and ALU input (2) line.
The LM address control unit 540 includes a local memory pointer register LMPR 550 that outputs a write / read address of the local memory LM510. Alternatively, the data transmitted from the ALU output latch 80 is written.
The output of the local memory pointer register LMPR 550 is transmitted to the RF write line in addition to the address to the local memory LM 510 and can be saved to the general register file 20 and can be operated by the ALU 50.
With the addition of the multiplier 500, the local memory LM 510, and the LM address control unit 540, the change in the functional block shown in FIG. 1 corresponds to the output data width of the multiplier 500 and the input data width of the shifter 600. Is an increase.
For 2n bits of data width of the shifter input line, when the data selected by the control signal is n-bit data of the general register file 20, external or local memory output latch 520, the MSB side n bits of the shifter input line At this time, n bits on the LSB side of the shifter input line become 0.
FIG. 7 is a block diagram showing a detailed circuit configuration of the shifter of FIG.
In FIG. 7, a barrel shifter 610 performs a shift operation similar to that of the barrel shifter 100 shown in FIG. 2 based on the MSB of 2n-bit input data, and outputs n bits. The difference from the barrel shifter 100 described with reference to FIG. 2 is that when overflow does not occur in the left shift, the output of the number of LSB side shift bits is 0 in FIG. 2, whereas the barrel shifter 610 in FIG. The upper bit of the lower n bits of the data is used.
FIG. 8 shows a truth table of operations in the rounding evaluation circuit shown in FIG. Since the above-described barrel shifter 610 in FIG. 7 shifts the input 2n-bit data with reference to the MSB and outputs n bits, truncation bits are generated even in the left bits (including 0-bit shift). Corresponding to this, the rounding evaluation circuit 620 in FIG. 7 performs the truncation bit in the 2n-bit input data regardless of the shift direction of the shift operation (that is, always) with respect to the operation of the rounding evaluation circuit 110 in FIG. The point to evaluate is different.
As described above, a processing step can be performed by adding a relatively small circuit to an information processing apparatus that processes a parallel algorithm such as image processing or a neural network at high speed with a SIMD parallel DSP including one control system and a plurality of processors. The rounding function can be incorporated without any increase in the number, and this achieves the object of the present invention to efficiently improve the numerical calculation accuracy of an information processing apparatus using a fixed point arithmetic unit.
FIG. 9 shows a configuration in which the rounding process of the present invention is applied to a conventional data processing apparatus.
In FIG. 9, the processor constituting the data processing device includes an arithmetic unit 11, a local memory unit 12, and a tristate buffer 13.
The arithmetic unit 11 includes a processor control circuit 1101 including the rounding mode selection register RMR290 of the present embodiment, latch circuits 1102, 1105, 1107, 1112, and 1114, a general register file 1103, a multiplier 1104, and a CO register of the present embodiment. 70, a condition code register (CCR) 1106, an accumulation register 1108, an ALU (arithmetic logic unit) 50 and selector 90 of this embodiment, an SBR (shift bit register) 1110, a shifter 10 and touch r40 of this embodiment. The difference absolute value calculator 1113 is included.
The local memory unit 12 includes local memories (LM0, LM1) 1202, 1204, an address calculation circuit 1201 for generating an address signal for the local memory 1202, an address calculation circuit 1203 for generating an address signal for the local memory 1204, a selector 1205, 1206, 1207, 1208, 1209.
In the operation of the data processing apparatus shown in FIG. 9, the shifter 10, the latch r40, the ALU (arithmetic logic unit) 50, and the selector 90 operate as described in this embodiment. The operation is the same as that of the data processing apparatus disclosed in Japanese Patent Application No. 2003-23076.
As mentioned above, the invention made by the present inventor has been specifically described based on the embodiment. However, the invention is not limited to the embodiment, and various modifications can be made without departing from the scope of the invention. Needless to say.
The effects obtained by typical ones of the inventions disclosed by the present invention will be briefly described as follows.
Since the shifter has a rounding evaluation function, the only terminal added to the shifter block is the rounding mode selection signal and the 1-bit output signal as the evaluation result, and the overhead of the rounding process can be reduced with an efficient circuit configuration. The numerical calculation accuracy of the information processing apparatus using the fixed point arithmetic unit is efficiently improved.

１個プロセッサ、すなわち一般的なＤＳＰに適用することができ、ＳＩＭＤ型並列ＤＳＰをはじめとするＤＳＰを搭載した半導体装置に応用することができる。
また、浮動小数点演算を処理するための回路構成に応用することができる。
さらには、ＳＨマイコンに代表されるＣＰＵの演算実行部（ＥｘｅｃｕｔｉｏｎＵｎｉｔ）に適用することもできる。
その他、丸め処理が必要なデジタル信号処理装置全般にも適用することもできる。The present invention can be applied to a single processor, that is, a general DSP, and can be applied to a semiconductor device equipped with a DSP such as a SIMD type parallel DSP.
Further, the present invention can be applied to a circuit configuration for processing floating point arithmetic.
Furthermore, the present invention can be applied to an execution unit (Execution Unit) of a CPU represented by an SH microcomputer.
In addition, the present invention can be applied to all digital signal processing apparatuses that require rounding.

Claims

A processor including a shifter having a right bit shift operation function, and an ALU having a carry-in function for adding 1-bit data to the least significant bit in addition of two input data;
A data processing apparatus comprising a control unit for controlling the processor with a single instruction,
The shifter outputs the data of the shift operation result, and at the same time, performs rounding evaluation on the bits that are discarded in the case of right bit shift, and indicates whether or not “1” addition is necessary for the least significant bit of the shift operation result A rounding evaluation means for outputting data;
The ALU is characterized in that one of two input data is output data of the shifter, and data used in the carry-in function is 1-bit data output from the rounding evaluation means of the shifter.

2. The data processing apparatus according to claim 1, wherein the right shift operation in the shifter and the addition of the carry-in function in the ALU are operations corresponding to two's complement data.

3. The data processing according to claim 1, further comprising a rounding mode selection unit, wherein the rounding evaluation unit performs rounding evaluation corresponding to each of a plurality of rounding modes selectable by the rounding mode selection unit. apparatus.

The said processor consists of two or more, The said rounding mode selection means selects a rounding mode by the data storage means provided in each of the said several processors, The any one of Claim 1 thru | or 3 characterized by the above-mentioned. Data processing equipment.

5. The data processing apparatus according to claim 1, wherein the rounding evaluation is a logical sum of bits rounded down by the right bit shift.

5. The data processing apparatus according to claim 2, wherein the rounding evaluation is a logical product of a logical sum of bits rounded down by the right bit shift and a sign of shift operation data. 6. .

The rounding evaluation is performed by calculating the logical sum of the bits other than the most significant bit of the bits that are truncated by the right bit shift and the least significant bit of the shift operation result, and the most significant bit of the bits that are truncated by the right bit shift. 5. The data processing apparatus according to claim 1, wherein the data processing apparatus is a logical product of

8. The data processing apparatus according to claim 1, wherein the data processing apparatus is configured on one semiconductor substrate.