WO2005024625A1

WO2005024625A1 - Data processor

Info

Publication number: WO2005024625A1
Application number: PCT/JP2003/010976
Authority: WO
Inventors: Tatsuo Ochiai; Osamu Akahira
Original assignee: Hitachi Ulsi Systems Co., Ltd.
Priority date: 2003-08-28
Filing date: 2003-08-28
Publication date: 2005-03-17
Also published as: JPWO2005024625A1; JP4243277B2

Abstract

A data processor enabling reduction of overhead of rounding with an efficient circuit constitution and enabling efficient improvement of the numeric operation accuracy of an information processor using a fixed-point calculating unit. The data processor comprises a processor including a shifter (10) having a right bit-shift calculation function and an ALU (50) having a carry-in function of adding one-bit data to the least significant bit when two numbers of input data are added, and a control unit (200) for controlling the processor by a single instruction. The shifter (10) has a rounding evaluation circuit (110) for outputting the data on the result of shift operation, performing rounding evaluation of the bit rounded down when data is right-bit-shifted and outputting one-bit data representing whether or not it is necessary to add “1” to the least significant bit of the shift calculation result. The ALU (50) handles one of the two number of input data as output data of the shifter (10) and handles the data used by the carry-in function as one-bit data outputted from the rounding evaluation circuit (110).

Description

Data processing device

Technical field

The present invention relates to a data processing device, and more particularly to a technique for implementing a rounding operation with an efficient hardware configuration in a fixed-point arithmetic unit in a processor element.

Description Background Art Image processing including video compression is a repetition of a relatively simple calculation algorithm, and has high data parallelism for the same instruction. Therefore, to speed up image processing,

A S I MD (S i n g l e I n s t r u c t i o n M u l t i p l e D a t a s t r e am) parallel computing method is suitable.

MPEG (ISO / IEC 14496-2 (MP EG 4), ISO / IEC 13818-2 (MP EG 2), and ISO / IEC 1 1 172-2 (MPEG 1)) are known as video compression standards. Has been. According to this video compression standard, a digitized image is divided into blocks, motion vectors are detected for each block, DTC (Discrete Cosine Transform) and quantization are performed, and Huffman coding is performed on the image data. Compress the data.

If the S IMD-type parallel computing architecture is applied to MPEG video compression, it is possible to use multiple processors as the calculation unit for the image block. In other words, in motion vector detection, shift blocks in the detection range are arranged in local memories of multiple processors, the blocks to be compressed are broadcast from the control system to all processors, and the frame difference is calculated in parallel. Double the speed can be expected. In addition, the block difference data between the block image to be compressed and the position of the detected motion vector is stored in the local memory of multiple processors, and DCT (or IDCT) or quantization (or inverse quantization) is calculated. Can be expected to speed up the number of processors by parallel processing. Disclosure of the invention

According to the study of the present inventors, in the above-described technique, for fixed-point format data, adjustment of the number of bits changed by multiplication or adjustment of the number of significant digits of the integer part by addition or subtraction is performed. Changes are made in shifters within the processor. At this time, if the bits on the LSB side are deleted or the bits are shifted to the right by a shifter to reduce the multiplication result to a predetermined data width, rounding of the LSB of the data is required to avoid deterioration in calculation accuracy It is.

In the rounding process, there is a method defined in IEEE 754, which is a floating-point format standard, for rounding an intermediate result of an operation to a predetermined bit of a mantissa and outputting the result. According to this standard, there are four types of rounding: nearest rounding, rounding in the ∞ direction, rounding in the + ∞ direction, and rounding in the 0 direction. Can be selected).

In the above technique, the data is rounded in one direction because it is simply truncated in hardware. In order to use other round-to-nearest rounds, rounding in the + ∞ direction, or rounding in the 0 direction, it is necessary to do so by writing a program. In other words, according to the description of the program, the bits to be rounded down are checked according to the desired rounding method, the state is determined, and the state is determined so that 1 is added to the least significant bit to be rounded.

However, in the above-described rounding processing implemented by a program, a shift operation for evaluating the bits to be truncated and an ALU (arithmetic operation unit) for judging the state of the shift code are used to determine the state. Therefore, in the case of the _∞ direction, which is inconvenient in terms of hardware, for example, even if it is an operation that can process apparent 1 data in 1 step by pipeline processing, other recent values When using rounding, rounding in the + 0 direction, or rounding in the 0 direction, a processing step for rounding is required, and the speed performance is deteriorated. Simultaneous calculation data Since it exists, it is difficult to increase the processing speed by branching processing steps according to the data state, and processing step time is required for all cases.

An object of the present invention is to provide a data processing device capable of reducing the overhead of rounding processing with an efficient circuit configuration and efficiently improving the numerical calculation accuracy of an information processing device using a fixed-point arithmetic unit. It is in.

The above and other objects and novel features of the present invention will become apparent from the description of the present specification and the accompanying drawings.

The following is a brief description of an outline of typical inventions disclosed in the present application.

That is, the data processing device of the present invention has the following features.

(1) a processor including a shifter having a right bit shift operation function, an ALU having a carry-in function of adding 1-bit data to the least significant bit in addition of two input data, and the processor A data processing device comprising a control unit controlled by a single instruction, wherein the shifter outputs data of a shift operation result and, at the same time, performs rounding evaluation on a bit truncated in the case of right bit shift. The ALU has a rounding evaluation means for outputting 1-bit data indicating whether addition of "1" is necessary to the least significant bit of the shift operation result. The ALU outputs one of the two input data to the output of the shifter. The data used in the carry function is 1-bit data output by the rounding evaluation means of the shifter.

In this way, when adding data with different decimal places, data with few integer digits are right-shifted and digit-aligned to ensure integer digits before addition. With the configuration described above, first, the shifter can output the right shift for digit alignment and the necessity of 1 addition for rounding at the same time as 1-bit data, and then the two Since the detected 1-bit data can be added simultaneously with the addition of the data, the time required for the rounding process for the data to be shifted rightward can be eliminated. In addition, since addition by 1 for rounding can be performed by the addition function with carry-in in the conventional ALU, it is only necessary to add a circuit for rounding evaluation to the shifter and 1-bit data output as the result.

(2) The right shift operation in the shifter and the addition of the carry-in function in the ALU are operations corresponding to two's complement data.

Thus, by performing the arithmetic operation of the shifter and the ALU as an arithmetic operation of the two's complement format, which is a conventional technique, it is possible to realize a rounding process for a signed arithmetic operation with the same configuration as described above. For example, if the operation of the shifter and ALU can be selected to either unsigned operation or signed operation in response to a control signal from the control system, the purpose of use is to mix unsigned data and signed data. Even with this, rounding with the same effect as described above can be performed.

(3) There is a rounding mode selection means, and the rounding evaluation means includes a rounding mode selection means.

Is to perform rounding evaluation corresponding to each of a plurality of rounding modes that can be selected. Thus, by making the rounding evaluation circuit correspond to a plurality of rounding modes, a rounding process in a desired rounding mode can be realized with the same configuration as described above. For example, if the rounding mode in the operation of the above rounding evaluation circuit can be selected in response to a control signal from the control system, it can be used for the purpose of dynamically changing and using different rounding modes. Thus, a rounding process having the same effect as above can be performed.

(4) The processor comprises a plurality of processors, and the rounding mode selecting means selects a rounding mode by data storage means provided for each of the plurality of processors.

This makes it possible to reduce the number of control signals from the control system to a plurality of processors by using the means for selecting the rounding mode as a storage means such as a register provided for each processor.

(5) The rounding evaluation is a logical sum of bits rounded down by the right bit shift.

Thereby, the operation of the rounding evaluation can be performed by the right bit shift in the shifter. By calculating the logical sum of the bits to be truncated, + ∞ direction rounding (rounding up) can be realized for unsigned data or signed data in 2's complement format. In addition, sign-direction rounding in signed absolute value format data can be realized.

(6) The rounding evaluation is a logical product of the logical sum of the bits rounded down by the right bit shift and the sign of the shift operation data.

Accordingly, the operation of the rounding evaluation is performed by the logical product of the logical sum of the bits rounded down by the right bit shift in the shifter and the sign of the shift operation data, so that the signed data in the two's complement format is obtained. Zero-direction rounding can be realized. In addition, unidirectional rounding can be realized for signed absolute value format data.

(7) The rounding evaluation is performed by calculating the logical sum of the bits excluding the most significant bit out of the bits truncated by the right bit shift and the least significant bit of the shift operation result, and the It is the logical product with the most significant bit. Accordingly, the operation of the rounding evaluation is performed by calculating the logical sum of the bits excluding the most significant bit of the bits truncated by the shift in the shifter and the least significant bit of the shift operation result and the bits truncated by the right shift. By performing a logical AND with the most significant bit in, recent value rounding can be realized for unsigned data, signed two's complement data, or signed absolute value data. .

(8) It is configured on one semiconductor substrate.

According to the data processing device, in order to realize truncation in the same manner as in the related art, the rounding mode selecting means needs to always negate the 1-bit data output from the rounding evaluation circuit to the ALU. It may be provided. At this time, it operates as one-direction rounding for signed data in 2's complement format. In addition, it operates as unsigned data or data in signed absolute value format as rounding in the 0 direction.

Addition of shifter terminals by providing the rounding evaluation circuit as described above Only the specified signal and the 1-bit data output signal to the ALU are required. Also, even if the rounding evaluation circuit added to the shifter is compatible with multiple rounding modes, the circuit size can be reduced because the logical sum of the bits to be truncated can be made common. As described above, the overhead reduction of the rounding processing is achieved with an efficient circuit configuration.

FIG. 1 is a block diagram showing a circuit configuration for implementing a rounding process according to an embodiment of the present invention.

FIG. 2 is a block diagram showing a detailed circuit configuration of the shifter of FIG.

FIG. 3 is a diagram showing a truth table of the operation in the rounding evaluation circuit of FIG. FIG. 4 is a diagram showing a configuration of a SIMD parallel DSP including the circuit configuration of FIG.

FIG. 5 is a diagram showing a configuration in which storage means for designating the number of shift bits is added to the shifters of FIGS.

FIG. 6 is a diagram showing another configuration of the data operation execution unit of FIG.

FIG. 7 is a block diagram showing a detailed circuit configuration of the shifter of FIG.

FIG. 8 is a diagram showing a truth table of the operation in the rounding evaluation circuit of FIG. FIG. 9 is a diagram showing a configuration in which a rounding process of the present invention is applied to a conventional data processing device. BEST MODE FOR CARRYING OUT THE INVENTION

Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In all the drawings for describing the embodiments, the same members are denoted by the same reference numerals, and the description thereof will not be repeated.

In FIG. 1, n represents the number of bits of the data signal, and may be, for example, 32 bits. The shifter 10 performs a shift operation on the n-bit data transmitted selectively from the general-purpose register file 20 or externally via the shifter input line, and outputs the result to the n-bit shifter output latch 30 I do. At the same time, the rounding evaluation result is output to a 1-bit latch r40.

Here, although not shown, the selection of data in the shifter input line is performed by, for example, an externally applied control signal or the like.

The shift operation in the shifter 10 has an arithmetic-theoretical value shift function of left and right m (<n) bit shifts, and although not shown, for example, an operation specified by a control signal or the like given from the outside is performed. Done.

The difference between arithmetic shift and logical shift is that arithmetic shift is a shift in which the input data is a binary number in 2's complement format, and the sign (MSB bit) of the shift operation is not changed. That is, in the case of right shift, the sign bit (input MSB bit) is packed in the number of upper shift bits of the output data in the arithmetic shift, and 0 in the logical shift.

In the case of left shift, the number of lower shift bits of the output data is padded with 0 for both arithmetic and logical shifts, but in arithmetic shift, overflow processing is performed on input data where the number of upper shift bits + 1 bit of input data is not the same. The maximum value equal to the sign of the input data (for 32 bits, the maximum positive value is H '7fffffff, the maximum negative value is H' 80000000, but H 'is hexadecimal. Is output.

Further, shifter 10 has a rounding evaluation function, and outputs 1 to latch r 40 when it is necessary to add 1 to round the bits overflowing right by right shift into the output data of the right shift operation result. , Otherwise output 0 to latch r 40 _c

The ALU 50 is connected to the shifter output latch 30, the general-purpose register file 20 or the first n-bit data or constant “0” transmitted from the outside via the ALU input line and the ALU input line. Via the line, the second n-bit data or constant "0" alternatively transmitted from the accumulative addition register 汎用 R60 or the general-purpose register file 20 and a 1-bit latch r40 by the selector 90 Also Performs an arithmetic and logic operation on the 1-bit data selected from the latch CO 70 and outputs the result to the n-bit ALU output latch 80 and the accumulation register R 60.

At the same time, the carry from the most significant bit (carry out) is output to the 1-bit latch CO70. Here, selection of data, selection of a constant 0, and selection of 1-bit data on the ALU input line and the ALU input line are performed by a force s (not shown), for example, an externally applied control signal or the like.

Arithmetic and logical operations in the ALU 50 have addition and subtraction as arithmetic operations and logical operation functions. Although not shown, for example, an operation specified by an externally applied control signal or the like is performed. The arithmetic operation is a signed addition and subtraction in which the first data and the second data are binary numbers in a two's complement format, or an unsigned addition and subtraction in which the first data and the second data are unsigned binary numbers. Yes, and a carry-in function that adds 1-bit data to the least significant bit (LSB) at the same time can be selected for these two additions and subtractions.

The logical operation is not particularly limited, but a logical sum, a logical product, an exclusive logical sum, a logical sum inverted, a logical product inverted, or an exclusive logical sum of the corresponding bits of the first data and the second data described above. This is a logical sum inversion.

Although not shown, for example, n-bit data output from the ALU 50 in response to a control signal or the like given from the outside is written in the cumulative addition register $ R60. The general-purpose register file 2 ◦ includes, but is not limited to, one input and two outputs for a register group of n bits × multiple words, not shown, for example, writing to a register designated by an externally applied control signal or the like. Is read.

That is, in response to the control signal, n-bit data alternatively transmitted from the shifter output latch 30 or the ALU output latch 80 via the RF write line is stored in the general register address 1 included in the control signal. Data is written to the register specified by (RFA1).

One of the two outputs of the general-purpose register file 20 is the control signal described above. The data of the register specified by the general-purpose register address 0 (RFA0) included in is output and transmitted to the shifter input line and ALU input line. On the other hand, the data of the register specified by the general-purpose register address 1 (RFA1) is output and transmitted to the ALU input ② line and the outside. In addition, when the above write is performed for the register specified by the general-purpose register address 1 (RFA 1), the above output data is the data before the write.

Next, the operation using the rounding process according to the circuit configuration example of FIG. 1 will be described with an example below.

First, as an example, the operation when three two's complement binary numbers A, B, and C with different fixed-point positions stored in general-purpose register file 20 are added and stored in general-purpose register file 20 will be described. I do.

Here, for simplicity, the data width n = 32 bits, and the fixed-point position of the data is represented by the data name (number of integer bits; number of decimal bits), where the fixed-point position of the above three data Are A (10.22), B (1.31), and C (15.17), and the fixed-point position of the addition result X is X (16.16).

[Processing step 1]

For the general register file 20, specify the address where A (10.22) is stored by RF AO and output A (10.22).

At the same time, select general-purpose register file 20 for the shifter input line and transmit A (10.22) to the shifter.

At the same time, the right 6-bit arithmetic shift operation is specified for the shifter 10.

With the above control, in processing step 1, the data obtained by arithmetically shifting A (10.22) to the right by 6 bits is output to shifter output latch 30, and the rounding evaluation result for the 6 bits that are truncated at the same time is output to latch r40. Is output.

[Processing step 2]

For the general-purpose register file 20, specify the address where B (1.31) is stored by RF AO and output B (1.31).

At the same time, select general register file 20 for the shifter input line, Transmit B (1.3.1) to the shifter.

At the same time, the right 15-bit arithmetic shift operation is specified for the shifter 10. At the same time, the shifter output latch 30 is selected for the ALU input line, and the shift result of A in the processing step 1 is transmitted as the first data of the ALU 50. At the same time, a constant 0 is selected for the ALU input ② line, and the second data of the ALU 50 is set to 0.

At the same time, a signed addition operation with a carry-in function that sets the carry-in to latch r40 is specified for ALU50.

At the same time, write is specified for cumulative addition register No. R60.

With the above control, in processing step 2, the data obtained by arithmetically shifting B (1.31) to the right by 15 bits is output to shifter output latch 30, and the rounding evaluation result for the 15 bits that are truncated at the same time are stored in latch r40. Is output. The A LU50 rounds the 6 bits on the LSB side when A (10.22) is digit-aligned with A (16.16), and outputs the result to the cumulative addition register No. R60.

[Processing step 3]

Specify the address where C (15.17) is stored in general-purpose register file 20 by RFA0 and output C (15.17).

At the same time, select the general-purpose register file 20 for the shifter input line, and transmit C (15.17) to the shifter 10.

At the same time, the right 1-bit arithmetic shift operation is specified for the shifter 10.

At the same time, the shifter output latch 30 is selected for the ALU input line and the result of the shift of B in the processing step 2 is transmitted as the first data of the ALU 50. At the same time, the cumulative addition register # R0 is selected for the ALU input # line, and the rounding result of A in the processing step 2 is transmitted as the second data of the ALU 50.

At the same time, for the ALU 50, a signed addition operation with a carry-in function with the carry-in latch r40 is specified. At the same time, write is specified for cumulative addition register No. R60.

With the above control, in processing step 3, the data obtained by arithmetically shifting C (15.17) to the right by one bit is output to shifter output latch 30, and the rounding evaluation result for one bit that is cut off at the same time is latch r4 Output to ◦. Also, the AL U50 rounds the 15 bits on the LSB side when B (1.3.1) is aligned with B (16.16), and A (1 6.16) Are added at the same time and output to the cumulative addition register No.R60.

[Processing Step 4]

The shifter output latch 30 is selected for the ALU input line, and the result of the C shift in the processing step 3 is transmitted as the first data of the ALU 50.

At the same time, the cumulative addition register R60 is selected for the ALU input line, and the rounding and addition result of A and B in processing step 3 is transmitted as the second data of the ALU 50.

With the control described above, in processing step 4, the ALU 50 rounded the 1 bit on the LSB side when aligning C (15.17) with B (16.16), and performed digit matching in processing step 2. A (1 6. 16) and B (1 6. 16) are added to the addition result at the same time and output to the ALU output latch 80.

[Processing step 5]

Select ALU output latch 80 for RF write line.

Specify the address for storing X (16.16) in RFA1 for general register file 20, and specify writing.

With the above control, in processing step 5, the addition result obtained by performing rounding on (16.16) for A, B, and C and aligning the digits is output to the general-purpose register file 20, and the target X (16. 16)

As described above, for the addition of multiple data with different fixed-point positions, processing steps equal to the number of data and two-step overs for the pipeline The processing steps for rounding off truncated bits generated by digit alignment can be eliminated.

Next, a detailed circuit configuration of the shifter 10 of FIG. 1 will be described.

In FIG. 2, a parallel shifter 100 performs a shift operation on n-bit data of a shifter input line according to a control signal and outputs n-bit data.

Among the control signals, the arithmetic shift / logic shift selection signal is not particularly limited, but 0 indicates an arithmetic shift and 1 indicates a logical shift for the parallel shifter 100. The number of shift bits is not particularly limited, but is a signal (± m) in which ± n is encoded in 2's complement format, where positive indicates the number of left shift bits and negative indicates the number of right shift bits.

The rounding mode selection signal is a signal that indicates the rounding mode to the rounding evaluation circuit 110, and is not particularly limited, but is 2 bits, B'OO is rounded in the ∞ direction, and B'01 is rounded in the + ∞ direction. , B '10 indicates rounding toward zero, and B' ll indicates rounding to the nearest. (However, B 'is a prefix for representing binary numbers.)

Here, _∞ direction rounding is the closest value that is not greater than the true value (the value represented by the input data) (the closest value that satisfies the condition among the values that can be expressed in the fixed-point position of the output data) This is a rounding mode of a rounding method.

+ ∞ direction rounding is a rounding mode that rounds to the nearest value that is not greater than the true value.

Round-to-zero is a rounding mode that rounds to the nearest value not greater than the absolute value of the true value.

The round-to-nearest mode is a rounding mode that rounds to the nearest value unconditionally. However, if the two nearest values are equidistant from the true value, round to the nearest value that results in LSB "0".

When the instruction of the number of shift bits m is the number of right shift bits (m <0), the rounding evaluation circuit 110 converts the lower 1 m I bits of the input data (that is, the bits to be truncated) into the above shift result. (Output of parel shifter 100) In the rounding mode specified by the rounding mode selection signal, whether 1 addition is necessary or not is evaluated. If 1 addition is required, 1 bit is output, indicating 1; otherwise, 0 is output.

FIG. 3 shows a truth table of the operation in the rounding evaluation circuit 110 of FIG.

Based on the truth table shown in FIG. 3, the 1-bit output signal R as the evaluation result is expressed by the following logical expression for each rounding mode.

①-∞ Rounding: R = 0

② Rounding in the + ∞ direction: R = (logical sum of truncated bits)

= (OR of input data bit i m I _ 1 to bit 0)

③ Rounding in the 0 direction: R = (input data is negative) n (OR of bits to be truncated)

= (MSB of input data) Π (logical sum of 1st bit to 0th bit of input data)

④ Round-to-nearest: R = (most significant bit of truncated data)

n {((bits excluding the most significant bits of the data to be truncated) u (bits to be the LSB of output data)}

= (Im I bit of input data-1 bit) Π {(I m I-OR of bit 2 to 0 bit) U (I m I bit of input data)}

Here, in the above logical expression, “u” means a logical sum, “n” means a logical product _c , and “i-th bit” indicates that the LSB of the input data is 0, and the MSB is n_1, and each bit is Indicates the bit at the position that becomes number i when numbered consecutively from 0 to n-1.

As described above with reference to Figs. 2 and 3, only the rounding mode selection signal and the 1-bit output signal as the evaluation result need to be added to the shifter block due to the rounding function provided in shifter 10. Thereby, for example, it is possible to minimize the wiring between blocks when the circuit is divided into blocks on the semiconductor substrate and mounted. _C Further, even if a plurality of rounding modes are supported, Since the logic such as 1m I1 2nd bit to the 0th bit (OR) can be shared, implementation There is an effect that an increase in area can be suppressed.

Next, a description will be given of a configuration of an SIMD type parallel DSP composed of a plurality of processors including the circuit configuration of FIG. 1 and one control unit for controlling these processors.

FIG. 4 is a diagram showing a configuration of a SIMD parallel DSP including the circuit configuration of FIG. 1, and is configured on, for example, a semiconductor substrate.

In FIG. 4, the control unit 200 includes, but is not limited to, a program execution control unit 220 including a program memory 210 and a data control unit 240 including a data memory 230. It is comprised including.

The program memory 210 and the data memory 230 can input and output data from outside, and can perform information processing according to a calculation procedure (calculation algorithm) set from outside.

The processor array 250 is composed of a plurality of identically configured processors 260, and all processors 260 are controlled by a common data path via an instruction path, a broadcast data bus, and a tri-state buffer. _c is also connected to, each processor 2 6 0, common processor selection signals and control of the tri-state buffers provided corresponding to each processor 2 6 0 is input, always only by controlling Yunitto 2 0 0 Processor 260 is selected.

The processor 260 includes a processor control unit 270 and a data operation execution unit 280.

The data operation execution unit 280 can have, for example, the circuit configuration shown in FIG.

The processor control unit 270 controls the data calculation execution unit based on an instruction given from the control unit 200 via an instruction path.

The instruction output by the program execution control unit 220 in the control unit 200 in synchronization with the processing step is, for example, VLIW (Very Long Instruction WORD), and the operation of each unit is controlled horizontally. Is done. In other words, the instructions of each step output by the control unit 200 include the file corresponding to each section. A plurality of function blocks can be controlled horizontally in one step.

Among the above instructions, an instruction to the processor 260 (processor instruction field) is transmitted to all the processors 260 via the instruction path. That is, the control unit 200 controls all the processors 260 with the same instruction.

The processor instruction field transmitted to the processor 260 is transmitted to the processor controller 270 in the processor 260.

The processor control unit 270 operates in accordance with the instruction (processor control unit instruction field) for the processor control unit 270 in the transmitted processor instruction field, and executes the instruction (data operation) for the data operation execution unit 280. (Execution section instruction field) to the data operation execution section 280.

At this time, if there is an address mask execution instruction in the instruction in the processor control unit instruction field, the processor control unit 270 negates all the write instructions in the data operation execution unit instruction field to the processor selection signal. Then, it is transmitted to the data operation execution unit 280. This makes it possible to select and operate one of the processors 260. For example, data specific to each processor is stored in the general-purpose register file 20 in the data operation execution unit 28 #. Can be set.

The processor control unit 270 receives a condition code signal (CC) from the data operation execution unit 280. If there is a group mask execution instruction in the instruction in the processor control section instruction field, the processor control section 270 executes all the write instructions in the data operation execution section instruction field with the designated condition code. The signal is masked and transmitted to the data operation execution unit 280. This condition code signal is, although not particularly limited, a signal indicating the operation state of the data operation execution unit 280, and a carry (carry-out) output from the ALU 50 in FIG. 1 via the latch CO 70. Although not shown in FIG. 1, a sign (MSB of data) signal output from the ALU 50 via the latch 80 and a zero (the operation result is 0) When Asserted) and overflow (MSB carry signal and MSB—exclusive OR of 1-bit carry signal).

This makes it possible to operate in accordance with the internal state of the processor 260. For example, only the processor 260 whose operation result is negative is group-masked, and this data is subtracted from 0. When the absolute value is obtained by inverting the sign, it is possible to execute the condition any time.

The data operation execution unit instruction field transmitted to the data operation execution unit 280 in the processor 260 includes a shifter instruction field for the shifter 10, an ALU instruction field for the ALU 50, and a general-purpose register for the general-purpose register file 20. The file instruction field is completed, and the function blocks of the circuit configuration shown in FIG. 1 are controlled, so that the rounding processes exemplified in the above processing steps 1 to 5 can be realized.

Here, the rounding mode selection signal for the rounding evaluation circuit in the shifter shown in FIG. 2 is controlled by a rounding mode selection register R M R 290 provided in the processor control unit 270. The data of the broadcast data path is written into the rounding mode selection register RMR290 when there is an RMR write instruction by an instruction in the processor control unit instruction field.

At this time, although not particularly limited, since the rounding mode selection signal shown in FIGS. 2 and 3 is 2 bits, the rounding mode selection register RMR290 has a 2-bit configuration and is inputted from the broadcast data path. The lower 2 bits of n (for example, 32) bit data are written.

Since the rounding mode has few applications to be changed for each processing step, there is no need to provide a field for selecting the rounding mode in the instruction path, and the register can be set by the above-mentioned rounding mode selection register RMR290. This has the effect of reducing the increase in instruction path width.

The operation result of the processor 260 is taken out to the outside by selecting a desired processor 260 by the processor selection signal, and the tri-state buffer 300 of the processor 260 becomes a drive state and the data is output. Common data path Output to the control unit via the

As described above, in the semiconductor device of the SI MD type parallel DSP shown in FIG. 4, for a single instruction and data output from one control system, the unique data held by a plurality of processors 260 By performing these operations in parallel, a parallel algorithm based on addition and subtraction can be processed at high speed by an operation by appropriate rounding.

In the embodiment described above, the rounding processing of the present invention has been focused on, but it is needless to say that various changes can be made according to the purpose of use.

For example, in each processor of the SMD-type parallel DSP described above, data unique to each processor can be obtained by adding storage means for specifying the number of shift bits of the shift operation.

FIG. 5 shows an embodiment in which storage means for designating the number of shift bits is added to the shifter according to the present invention described in FIGS.

In FIG. 5, the shift bit register SBR 400 as the storage means has the number of bits required to express the number of shift bits in the shifter shown in FIG. 2 (for example, 6 bits when n = 32). The data is set in response to a write instruction added in the data operation execution unit instruction field.

The data to be written to the shift bit register SBR 400 is not particularly limited, but is the data transmitted from the ALU 50 via the overflow processing circuit 410. The overflow processing circuit 410 inputs the n-bit data output from the ALU 50 as a 2's complement binary integer, and outputs the 2's complement 2 bits of the number of bits constituting the shift bit register SBR 400. Rounds the high-order bit to a decimal integer and outputs.

That is, when the number of configuration bits of the shift bit register SBR 400 is set to 6 bits, if the output of the ALU 50 is one of 31 to 31, the value is output, but the output of the ALU 50 is 3 2 In the above cases (positive overflow), 31 is output, and when the output of ALU 50 is less than 132 (negative overflow), 131 is output. The output of the number of shift bits to the shifter 10 is selected by the selector 420 from the instruction in the instruction field of the data operation execution section and the shift bit register SBR 400. Although the control of the selector 420 is not particularly limited, for example, the maximum shift number specified by the instruction in the instruction field of the data operation execution section is a negative maximum value (when the shift bit number is 6 bits, 1 3 2), select the SBR output, otherwise select the number of shift bits specified by the instruction in the instruction field in the data execution unit instruction field. The switching is performed by “0” and “1” of the output of the comparator circuit 30.

In the embodiment described above, in order to process parallel algorithms such as image processing and neural networks at high speed, a multiplier for performing high-speed product-sum operations and more processor-specific data are distributed to the processor. What is necessary is just to provide the local memory for arrangement.

FIG. 6 shows a circuit configuration in which a multiplier and a local memory LM are added to the circuit configuration shown in FIG. 1 as another circuit configuration example of the data operation execution unit in the processor in FIG.

In FIG. 6, a multiplier 500 is provided with a shifter output latch 30 via a multiplier input line 、, a general-purpose register file 20 or first n-bit data which is selectively transmitted from outside, and a multiplier. Multiplied by the second n-bit data transmitted from the general-purpose register file 20 or the output latch 520 of the local memory LM510 via the input-input line, and multiplied by 2 Transmits n bits (64 bits when 11 is 3 2) data to multiplier output latch 53 °. Here, the selection of data at the multiplier input line and the multiplier input line is not shown, but is made by, for example, a control signal given from the outside.

The output of the multiplier output latch 530 is transmitted to the shifter input line.

The local memory LM510 is composed of an n-bit × multiple-mode RAM (Random Access Memory). Although not shown, for example, a shifter output latch is provided in response to an externally applied control signal or the like. 30, ALU output The n-bit data selectively transmitted from the latch 80 or the outside is written into the address transmitted from the local memory pointer register LMP R 550 in the LM address control unit 540. 'Alternatively, the data stored in the address is read out to the local memory output latch 520 at the subsequent stage.

The output of the local memory output latch 520 is transmitted to the multiplier input line, the shifter input line and the ALU input line.

The LM address controller 540 includes a local memory pointer register LMPR 550 that outputs a write Z read address of the local memory LM 510. Increment, decrement, or write the data transmitted from the ALU output latch 80.

The output of the local memory pointer register LMPR550 is transmitted to the RF write line in addition to the address to the local memory LM510, and can be saved to the general-purpose register file 20 and operated by the ALU 50.

With the addition of the multiplier 500, the local memory LM510, and the LM address control unit 540, the functional blocks described in FIG. 1 are changed in accordance with the output data width of the multiplier 500 and the input data of the shifter 600. The width has increased.

If the data selected by the above control signal is n-bit data of general-purpose register file 20 or external or local memory output latch 520 for 2 n bits of the data width of the shifter input line, the MSB of the shifter input line The n bits on the LSB side of the shifter input line become 0 at this time. FIG. 7 is a block diagram showing a detailed circuit configuration of the shifter of FIG. In FIG. 7, the parallel shifter 610 performs the same shift operation as the parallel shifter 100 shown in FIG. 2 on the basis of the MSB of the input data of 2 n bits, and outputs n bits. The difference from the parallel shifter 100 described with reference to FIG. 2 is that when overflow does not occur in the left shift, the output of the number of LSB shift bits is 0 in FIG. 2 while the output of the parallel shifter 610 in FIG. Below the data This is the upper bit of n-bit data.

FIG. 8 shows a truth table of the operation in the rounding evaluation circuit shown in FIG. Since the above-described parallel shifter 610 of FIG. 7 shifts the input 2 n-bit data with reference to the MSB and outputs n bits, a truncated bit is generated even in the left bit (including 0-bit shift). Correspondingly, the rounding evaluation circuit 620 in FIG. 7 is different from the operation of the rounding evaluation circuit 110 in FIG. 2 in that the 2 n-bit input data is independent of the shift direction of the shift operation (ie, always). The difference is that the truncated bits in are evaluated.

As described above, a relatively small-scale circuit was added to an information processing device that can process parallel algorithms such as image processing or neural networks at high speeds with an SIMD-type parallel DSP composed of one control system and multiple processors. Thus, the rounding function can be incorporated without increasing the number of processing steps, and this achieves the object of the present invention of efficiently improving the numerical calculation accuracy of an information processing device using a fixed-point arithmetic unit.

FIG. 9 shows a configuration in which the rounding process of the present invention is applied to a conventional data processing device.

In FIG. 9, the processor constituting the data processing device includes an arithmetic unit 11, a local memory unit 12, and a tristate buffer 13. The operation unit 11 includes a processor control circuit 1101, including the rounding mode selection register RMR290 of the present embodiment, a latch circuit 1102, 1105, 1107, 1112, 1114, a general register file 1103, Multiplier 1104, Condition code register (CCR) 1106 including CO register 70 of this embodiment, Accumulation register 1108, ALU (arithmetic logic unit) 50 of this embodiment, selector 90, SBR (shift bit) Register) 110, the shifter 10 and the touch r40 of the present embodiment, and the absolute difference calculator 113.

The local memory unit 12 has addresses for generating address signals of the local memories (LM0, LM1) 1202, 1204, and the local memory 1202. Address operation circuit 1203 for generating an address signal for the local memory 1204, and selectors 1205, 1206, 1207, 1208, and 1209.

The operation of the data processing device shown in FIG. 9 is as follows. The shifter 10, the latch r40, the ALU (arithmetic logic unit) 50, and the selector 90 perform the operations described in the present embodiment. Is the same as the operation of the data processing device described in Japanese Patent Application No. 2003_23076.

Although the invention made by the inventor has been specifically described based on the embodiment, the invention is not limited to the embodiment and can be variously modified without departing from the gist of the invention. Needless to say,

The effects obtained by typical aspects of the invention disclosed by the present invention will be briefly described as follows.

The addition of the rounder evaluation function to the shifter block requires only the rounding mode selection signal and the 1-bit output signal as the evaluation result, and the overhead of the rounding process can be reduced with an efficient circuit configuration. The numerical calculation accuracy of the information processing device using the fixed-point arithmetic unit is efficiently improved. Industrial applicability

The present invention can be applied to a single processor, that is, a general DSP, and can be applied to a semiconductor device equipped with a DSP such as an SIMD type parallel DSP.

Further, it can be applied to a circuit configuration for processing a floating-point operation. Furthermore, the present invention can also be applied to an arithmetic execution unit (EXEcut iount Unit) of a CPU represented by an SH microcomputer.

In addition, the present invention can be applied to all digital signal processing devices that require rounding.

Claims

The scope of the claims

1. A processor including a shifter having a right bit shift operation function, and an ALU having a carry-in function of adding 1-bit data to the least significant bit in addition of two input data,

A control unit for controlling said processor with a single instruction, comprising:

The shifter outputs the data of the shift operation result, and at the same time, performs rounding evaluation on the bits rounded down in the case of right bit shift, and 1 bit indicating whether addition of "1" is necessary to the least significant bit of the shift operation result It has a rounding evaluation means for outputting data,

A data processing device wherein the ALU is one of two input data as output data of the shifter, and the data used in the carry-in function is 1-bit data output by the rounding evaluation means of the shifter.

2. The data processing device according to claim 1, wherein the right shift operation in the shifter and the addition of the carry-in function in the ALU are operations corresponding to two's complement data.

3. The method according to claim 1, further comprising a rounding mode selection unit, wherein the rounding evaluation unit performs rounding evaluation corresponding to each of a plurality of rounding modes selectable by the rounding mode selection unit. Data processing device.

4. The processor according to any one of claims 1 to 3, wherein the processor comprises a plurality of processors, and the rounding mode selecting means selects a rounding mode by data storage means provided individually for the plurality of processors. A data processing device according to claim 1.

5. The data processing device according to claim 1, wherein the rounding evaluation is a logical sum of bits rounded down by the right bit shift.

6. The data according to any one of claims 2 to 4, wherein the rounding evaluation is a logical product of a logical sum of bits rounded down by the right bit shift and a sign of shift operation data. Processing equipment.

7. The rounding evaluation is based on the most significant bit 5. The logical product of a logical sum of a bit excluding a significant bit and a least significant bit of a shift operation result and a most significant bit of the bits rounded down by the right bit shift. The data processing device according to claim 1.

8. The semiconductor device according to claim 1, wherein the semiconductor device is formed on one semiconductor substrate.

The data processing device according to item 1.