KR100434391B1

KR100434391B1 - The architecture and the method to process image data in real-time for DSP and Microprocessor

Info

Publication number: KR100434391B1
Application number: KR10-2001-0043711A
Authority: KR
Inventors: 선우명훈; 전영섭
Original assignee: 학교법인대우학원
Priority date: 2001-07-20
Filing date: 2001-07-20
Publication date: 2004-06-04
Also published as: KR20030011978A

Abstract

본 발명은 DSP(Digital Signal Processor) 및 마이크로 프로세서 상에서 영상 신호처리를 위해 영상 데이터의 DCT(Discrete Cosine Transform)와 같은 벡터 연산 및 움직임 추정 알고리즘에서의 SAD(Sum of Absolute Differences) 값을 효과적으로 연산할 수 있도록 하는 연산 방법 및 그 회로에 관한 것이다.The present invention can effectively calculate sum of absolute differences (SAD) values in a vector calculation and motion estimation algorithm such as discrete cosine transform (DCT) of image data for image signal processing on a digital signal processor (DSP) and a microprocessor. To a calculation method and a circuit thereof.

Description

The architecture and the method to process image data in real-time for DSP and Microprocessor

본 발명은 DSP(Digital Signal Processor) 및 마이크로프로세서 상에서 영상 신호처리를 위해 영상 데이터의 DCT(Discrete Cosine Transform)와 같은 벡터 연산 및 움직임 추정 알고리즘에서의 SAD(Sum of Absolute Differences) 값을 효과적으로 연산할 수 있도록 하는 연산 방법 및 그 회로에 관한 것이다.The present invention can effectively calculate sum of absolute differences (SAD) values in vector computation and motion estimation algorithms such as discrete cosine transform (DCT) of image data for image signal processing on digital signal processors (DSPs) and microprocessors. To a calculation method and a circuit thereof.

최근 멀티미디어 데이터의 처리와 통신이 요구되면서 방대한 양의 영상 데이터 처리를 위해 압축 알고리즘 영역이 큰 비중을 차지하게 되었다. 영상압축의 처리과정은 매우 복잡한 다중 처리과정을 요구하며 실시간 영상처리에 있어서 가장 연산량이 많은 과정이다. 따라서, 기존의 DSP 칩으로는 영상압축의 실시간 처리가 불가능하여 새로운 아키텍쳐의 개발이 필수적이므로 특수 목적용의 멀티미디어 DSP 칩들이 개발되었다.Recently, as the processing and communication of multimedia data is required, the compression algorithm area takes up a large portion for processing a large amount of image data. The image compression process requires a very complex multi-processing process and is the most computational process in real time image processing. Therefore, the development of a new architecture is essential since the real-time processing of image compression is impossible with the existing DSP chips. Therefore, special purpose multimedia DSP chips have been developed.

DCT는 영상압축 기술에서 가장 많이 사용되는 영상변환 알고리즘으로 JPEG(Joint Photographic Experts Group) 및 MPEG(Moving Picture Experts Group)에서 허프만(Huffman) 부호화를 수행하기 전에 사용된다. 2차원 M ×N 영상에 대한 DCT는 수학식1과 같다.DCT is the most widely used image conversion algorithm in image compression technology and is used before Huffman coding is performed in the Joint Photographic Experts Group (JPEG) and Moving Picture Experts Group (MPEG). The DCT for the 2D M × N image is shown in Equation 1.

여기서, here,

수학식1에서 보는 바와 같이 DCT는 1차원적으로 처리하려면 2 ×M ×N번의 곱셈과 (M-1)(N-1)번의 덧셈이 필요하다. 곱셈기는 하드웨어가 매우 크고 계산 속도가 느리기 때문에 연산량을 줄이기 위하여 행렬분리 계산방법(row column decomposition)과 첸(Chen)의 알고리즘을 많이 사용한다. 행렬 분리 계산은 행 단위로 1차원 DCT를 수행한 다음 이 결과를 열 단위로 1차원 DCT를 다시 수행함으로써 2차원 DCT의 결과를 얻는다. 상기 첸(Chen)의 알고리즘은 코사인(Cosine)의 주기 특성을 이용하여 불필요한 중복 계산을 줄인 것으로 DCT 행렬식인 수학식2는 수학식3과 수학식4로 표현할 수 있다. 상기 수학식2의 계수 행렬을 보면 중앙열을 기점으로 홀수(odd), 짝수(even) 행 그룹에 따라 좌우 대칭임을 알 수 있고, 따라서 수학식3과 수학식4로 나누어 표현 가능하므로 수학식2에 비해 연산수가 줄어드는 것을 알 수 있다. 수학식5와 수학식6은 IDCT 결과를 나타낸 것이다.As shown in Equation 1, DCT requires multiplication of 2 × M × N times and addition of (M-1) (N-1) times in order to process one-dimensionally. Since the multiplier is very large hardware and slow computations, we use a lot of row column decomposition and Chen's algorithm to reduce the amount of computation. The matrix separation calculation results in a two-dimensional DCT by performing a one-dimensional DCT on a row basis and then performing the one-dimensional DCT on a column basis again. Chen's algorithm reduces unnecessary duplication by using the periodic characteristic of Cosine, and the DCT determinant Equation 2 can be expressed by Equation 3 and Equation 4. Looking at the coefficient matrix of Equation 2, it can be seen that the left and right symmetry according to the odd (odd) and even (even) row groups starting from the center column, and thus can be expressed by dividing Equation (3) and Equation (4). It can be seen that the number of operations is reduced compared to. Equations 5 and 6 show the IDCT results.

여기서,이다.here, to be.

MPEG 알고리즘의 부호화 과정 중 가장 계산량이 많은 과정은 움직임 벡터를구하는 과정이며 따라서 부호화 과정이 복호화 과정보다 많은 계산량을 요구한다. 움직임 벡터는 여러 가지 방법으로 구할 수 있고 그 방법에 대한 표준은 없지만 가장 일반적인 방법은 BMA(Block Matching Algorithm)로서 수학식7과 같이 비교하고자 하는 영상(Picture)에서 현재 부호화하는 영상의 매크로(Macro) 블록을 이동시키면서 SAD 값을 구하고, 움직임 벡터(Motion Vector:MV)는 수학식8과 같이 SAD 값 중에 최소값을 구한다. 이 과정 역시 종래의 DSP 칩으로는 실시간 처리가 어려운 과정이다.The most computational process among the encoding processes of the MPEG algorithm is a process of obtaining a motion vector. Therefore, the encoding process requires more computation than the decoding process. The motion vector can be obtained in various ways, and there is no standard for the method, but the most common method is BMA (Block Matching Algorithm), which is a macro of the image currently encoded in the picture to be compared as shown in Equation (7). The SAD value is calculated while moving the block, and the motion vector MV obtains the minimum value among the SAD values as shown in Equation (8). This process is also difficult to process in real time with a conventional DSP chip.

종래의 고정소수점 DSP 칩인 DSP56100, DSP1610, ADSP2100 및 TMS320C6x는 범용 DSP 프로세서로 벡터 연산 및 BMA의 SAD 연산을 처리하기 위한 특수 연산 구조를 가지고 있지 않기 때문에 자체 내장한 곱셈기 및 덧셈기와 쉬프트 연산기를 사용하여 데이터를 처리하고 결과값을 레지스터에 저장하는 과정을 반복함으로써 방대한 양의 연산을 수행한다. 따라서 다량의 데이터를 가지며 이것의 처리를 위해 많은 연산량을 요구하는 영상 및 멀티미디어 데이터를 실시간으로 고속 처리하기 위해 새로운 고성능 멀티미디어 DSP 칩들이 개발되었다.Conventional fixed-point DSP chips, DSP56100, DSP1610, ADSP2100, and TMS320C6x, are general purpose DSP processors that do not have a special operation structure to handle vector and SMA operations of BMA. We do a lot of operations by repeating the process of storing and storing the result in a register. Therefore, new high-performance multimedia DSP chips have been developed to process high-speed video and multimedia data in real time, which has a large amount of data and requires a large amount of computation for its processing.

썬(Sun)사의 Ultrasparc의 멀티미디어 전용 VIS(Visual Instruction Set) 명령어 집합은 움직임 추정 알고리즘을 지원하기 위해 두 64비트 오퍼런드 간의 8비트 단위의 차의 절대값을 구하는 Pdist 명령어를 가지고 있으며 필립스(Philips)사의 트라이미디어(TriMedia)도 이와 같은 연산을 하는 ume8ii 명령어를 가지고 있다. 이 명령어를 처리하기 위한 연산 구조는 8개의 8비트 뺄셈기와 덧셈기, 그리고 감산 연산 후 음수 값에 대한 절대값을 구하는 연산기로 이루어져 있다. 그러나 입력 데이터와 결과 저장에 사용되는 레지스터의 크기가 64비트라는 점에서 멀티미디어 데이터(32비트) 처리에 있어 효율성이 떨어진다. 휴렛팩커드(Hewlett Packard)사의 PA-RISC의 MAX-2 명령어 또한 움직임 추정 알고리즘에서의 두 데이터간의 차의 절대값을 구하는 연산을 지원한다. 그러나 각 입력 데이터에 대해 각각의 연산을 수행하기 때문에 다수의 데이터를 처리하는 데 있어 방대한 양의 연산을 처리해야함으로 연산량이 많아지는 단점이 있다.Sun's Ultrasparc's multimedia-only Visual Instruction Set (VIS) instruction set includes a Pdist instruction that calculates the absolute value of the 8-bit difference between two 64-bit operands to support motion estimation algorithms. TriMedia also has an ume8ii command that does the same thing. The operation structure for processing this instruction consists of eight 8-bit subtractors and adders, and an operator that calculates the absolute value of the negative value after the subtraction operation. However, the efficiency of processing multimedia data (32 bits) is inferior because the size of registers used to store input data and results is 64 bits. Hewlett Packard's PA-RISC MAX-2 instructions also support the calculation of the absolute value of the difference between two data in the motion estimation algorithm. However, since each operation is performed on each input data, a large amount of operations are required to process a plurality of data, thereby increasing the amount of computation.

인텔(Intel)사의 펜티엄(Pentium) 프로세서의 MMX 명령어 집합에서 지원하는 PMADDWD 경우 벡터 연산을 수행하는 데 있어 종래 DSP 프로세서들 보다 효율적인 구조(16 ×16 곱셈기 2개, 32비트 덧셈기 및 데이터 저장용 64비트 레지스터 3개)로서 연산 사이클 수를 줄이고 고속 동작을 할 수 있도록 구성되어 있으나 두 개의 연산 결과값을 다시 가산해야하는 점에서 연산에 필요한 사이클 수가 증가하는 단점이 있다.PMADDWD, which is supported by Intel's Pentium processor's MMX instruction set, is more efficient than conventional DSP processors in performing vector operations (two 16 × 16 multipliers, 32-bit adders, and 64-bits for data storage). 3 registers) are designed to reduce the number of operation cycles and to perform high-speed operation. However, the number of cycles required for the operation is increased in that two result values are added again.

또한, 종래의 멀티미디어 DSP 프로세서에서는 데이터를 바이트 혹은 워드 별로 패킹(Packing)하여 레지스터 1개에 다수의 데이터를 저장함으로써 레지스터의 사용 효율을 향상시키는 방식을 사용하였다. 도1은 인텔사의 펜티엄 프로세서의 MMX 명령어(PMADDWD)를 사용하는 벡터 연산 동작을 나타내고 있다. 패킹 명령어 PACKSS를 통해 네 개의 각 데이터는 64비트 레지스터 0(S[63:48], T[47:32], U[31:16], V[15:0])에 저장된다. 이 데이터들과 벡터 연산을 수행하기 위한 다른 데이터들도 마찬가지로 워드별로 패킹되어 레지스터 1(W[63:48], X[47:32], Y[31:16], Z[15:0])에 저장된다. 레지스터 0과 레지스터 1의 데이터들은 각각 워드(Word)별로 16 ×16 곱셈기의 입력부에 연결되어 곱셈 연산을 수행한다. 네 개의 결과 데이터들은 각각 두 개씩 32비트 덧셈기의 입력부에 연결되어 가산되고 두 결과 데이터들은 64비트 레지스터의 상위 32비트 레지스터와 하위 32비트 레지스터 영역에 저장됨으로써 명령어 연산을 마치게 된다. 최종 벡터 연산을 수행하기 위해 32비트 덧셈기가 필요하게 되며 64비트 레지스터에 저장된 두 연산결과는 다시 32 비트 덧셈 연산기에 입력되어 연산을 수행 후 레지스터에 저장되어 벡터 연산을 마치게 됨으로써 종래 기술에서는 벡터 연산 처리 시 명령어를 2개 더 추가해야 하는 문제점이 있다.In addition, the conventional multimedia DSP processor uses a method of improving the use efficiency of registers by storing data in one register by packing data by byte or word. Figure 1 shows a vector operation operation using the MMX instruction (PMADDWD) of the Pentium processor of Intel Corporation. Each packing data is stored in 64-bit registers 0 (S [63:48], T [47:32], U [31:16], and V [15: 0]) through the packing instruction PACKSS. These data and other data to perform vector operations are likewise packed word by word to register 1 (W [63:48], X [47:32], Y [31:16], Z [15: 0]). Are stored in. The data of register 0 and register 1 are connected to the input of the 16x16 multiplier for each word and perform a multiplication operation. Each of the four result data is connected to the input of the 32-bit adder and added, and the two result data are stored in the upper 32-bit register and the lower 32-bit register area of the 64-bit register to complete the instruction operation. In order to perform the final vector operation, a 32-bit adder is required, and the two operation results stored in the 64-bit register are inputted to the 32-bit addition operator again, and the operation is stored in the register to complete the vector operation. There is a problem that needs to add two more city commands.

또한, 움직임 추정 알고리즘에서의 SAD값을 연산하는 데 있어 종래의 멀티미디어 연산을 지원하는 썬사의 Ultrasparc의 Pdist 명령어 및 멀티미디어 DSP 프로세서인 필립스사의 TriMedia32의 ume8ii는 도2에 도시된 바와 같은 연산 동작을 하게 된다. 위에서 언급한 것과 마찬가지로 레지스터의 효율을 위해 8개의 데이터를바이트별로 패킹을 시킨 후 각각을 a1[64:56], a2[55:48], a3[47:40], a4[39:32], a5[31:24], a6[23:16], a7[15:8], a8[7:0]의 바이트별로 64비트 레지스터 0에 저장을 시킨다. 저장된 데이타는 레지스터 1에 패킹되어 저장된 데이터(b1, b2, b3, b4, b5, b6, b7, b8)와 바이트 별로 8 ×8 뺄셈 연산기에 입력되어 감산 연산을 거친다. 연산 결과 음수인 결과값은 절대값(abs:absolute)을 취하게 되며 최종적으로 얻어진 8개의 양수 결과값들에 대하여 8 ×8 덧셈기에 입력됨으로써 가산 연산을 수행하게 되고 이들 4개의 결과 데이터들은 가중합 되어 레지스터에 저장함으로써 SAD 연산을 마치게 된다. 그리고 종래 기술에서는 데이터 저장 시 64비트의 레지스터를 사용하기 때문에 출력된 데이터에 비해 비트 수가 큰 레지스터를 사용하는 결과가 되어 공간 활용도를 감소시키고 큰 레지스터의 사용으로 하드웨어가 커지는 문제점이 있다.In addition, in calculating the SAD value in the motion estimation algorithm, Sun's Ultrasparc Pdist instruction and multimedia DSP processor ume8ii, which is a multimedia DSP processor, perform the operation as shown in FIG. . As mentioned above, for the efficiency of registers, eight data are packed by byte and then each is a1 [64:56], a2 [55:48], a3 [47:40], a4 [39:32], Each byte of a5 [31:24], a6 [23:16], a7 [15: 8], and a8 [7: 0] is stored in 64-bit register 0. The stored data is packed in register 1 and inputted to the 8 × 8 subtraction operator for each byte and stored data (b1, b2, b3, b4, b5, b6, b7, b8) and subjected to subtraction. The negative result is the absolute value (abs: absolute), and the eight positive result values are added to the 8 × 8 adder to perform the addition operation, and these four result data are weighted sum. The SAD operation is completed by storing the data in a register. In the prior art, since a 64-bit register is used to store data, a register having a larger number of bits than the output data is used, thereby reducing space utilization and increasing hardware by using a large register.

이처럼, 종래의 멀티미디어 연산을 지원하는 DSP 프로세서인 썬사의 Ultrasparc이나 멀티미디어 DSP 프로세서인 필립스사의 TriMedia는 도2에 도시된 바와 같이 움직임 추정 알고리즘 처리 시 다량의 데이터 처리 전후에 입력 데이터 및 결과 데이터의 저장을 위해 64비트의 매우 큰 레지스터를 사용해야 한다는 것과 그 결과로 레지스터 사용 효율이 감소하고 하드웨어의 크기가 커진다는 문제점을 갖는다.As such, Sun's Ultrasparc, a DSP processor that supports conventional multimedia operations, or TriMedia, a multimedia DSP processor, can store input data and result data before and after processing a large amount of data when processing a motion estimation algorithm, as shown in FIG. This requires the use of very large 64-bit registers, which results in reduced register usage and increased hardware size.

또한, 벡터 연산을 지원하는 인텔사의 MMX 명령어는 도1에 도시된 바와 같이 데이터의 연산 수행 후 최종 합을 얻기 위해 한 레지스터에 패킹되어 저장된 두 결과 데이터를 불러내어 다시 가산연산을 수행해야 하므로 2개의 명령어 사이클을 더소비하여 벡터 연산을 마쳐야 한다는 문제점이 있었다.Also, as shown in FIG. 1, the MMX instruction of Intel Corporation that supports vector operations needs to add two result data packed and stored in one register to perform the addition operation as shown in FIG. There was a problem that the vector operation must be completed by using more instruction cycles.

따라서 본 발명은 상술한 문제점들을 해결하기 위하여 DSP 프로세서에서의 영상신호처리에 있어서 데이터의 벡터 연산 및 움직임 추정을 위한 SAD 값을 효과적으로 계산할 수 있도록 하는 연산방법과 그 연산방법을 실행하기 위한 회로를 제공하여 디지털 필터와 DCT 연산 처리에 효율적이며 연산 사이클을 줄여 하드웨어의 부담을 줄이고 데이터의 실시간 처리를 위한 것을 목적으로 한다.Accordingly, in order to solve the above problems, the present invention provides an operation method for efficiently calculating SAD values for vector operation and motion estimation of data in image signal processing in a DSP processor, and a circuit for executing the operation method. Therefore, it is effective for digital filter and DCT operation processing, and it is to reduce the computational cycle to reduce hardware burden and to process data in real time.

또한, 데이터 처리 후 저장 시 패킹 네트워크 구조를 제공함으로써 레지스터의 사용빈도를 줄여 활용도를 높일 수 있게 한 연산 회로를 제공하는 것을 또 다른 목적으로 한다.In addition, another object of the present invention is to provide an arithmetic circuit capable of increasing utilization by reducing the frequency of use of a register by providing a packing network structure during data processing and storage.

상기 목적을 달성하기 위하여 본 발명은 입력되는 4개의 데이터를 바이트 별로 저장하는 제1레지스터와, 입력되는 또 다른 4개의 데이터를 바이트별로 저장하는 제2레지스터와; 상기 레지스터들에 저장되어 있는 데이터를 바이트 별로 동시에 곱셈을 수행하기 위한 4개의 8 × 8 곱셈기와; 상기 곱셈기들의 곱셈 연산값 4개를 두 개씩 덧셈을 수행하기 위한 두 개의 16비트 덧셈기와; 상기 두 개의 덧셈기들의 연산값을 다시 한번 덧셈을 수행하기 위한 하나의 32비트 덧셈기와; 상기 32비트 덧셈기의 결과값을 저장하기 위해 덧셈기의 오버플로우(overflow)를 감안한 40비트 제3레지스터를 포함하는 연산 회로와 그 회로를 이용한 영상 데이터 처리를 위한 연산방법을 그 특징으로 한다.In order to achieve the above object, the present invention provides a first register for storing four input data by byte, and a second register for storing another input four data by byte; Four 8x8 multipliers for performing simultaneous multiplication of data stored in the registers byte by byte; Two 16-bit adders for adding the four multiplication operations of the multipliers by two; One 32-bit adder for performing addition of the operation values of the two adders once again; An operation circuit including a 40-bit third register in consideration of an overflow of the adder to store the result value of the 32-bit adder, and an operation method for processing image data using the circuit.

또한, 본 발명은 입력되는 4개의 데이터를 바이트별로 저장하는 32비트 제1레지스터와, 입력되는 다른 4개의 데이터를 바이트별로 저장하는 32비트 제2레지스터와, 상기 두 개의 레지스터로부터 바이트별로 출력되는 각 두 개의 데이터의 감산을 수행하기 위한 4개의 뺄셈기와; 상기 4개의 뺄셈기에서 감산된 결과값을 2의 보수로 연산을 위한 4개의 abs 연산기와, 그리고 상기 4개의 abs 연산기에서 출력된 결과값을 두 개씩 더하기 위한 제1, 제2 덧셈기와, 상기 제1, 제2 덧셈기에서 더해진 결과값을 다시 덧셈하기 위한 제3덧셈기와, 상기 제3덧셈기에서 더해진 16비트 결과값을 저장하기 위한 32비트의 제3레지스터를 포함하는 또 다른 연산 회로와 그 회로를 이용한 영상 데이터 처리를 위한 연산방법을 특징으로 한다.In addition, the present invention provides a 32-bit first register for storing four input data by byte, a 32-bit second register for storing four other input data by byte, and outputs each byte by byte from the two registers. Four subtractors for subtracting two data; Four abs calculators for calculating the result value subtracted by the four subtractors with two's complement, and first and second adders for adding the result values output from the four abs calculators two by two, Another arithmetic circuit including a third adder for re-adding the result added by the second adder, and a 32-bit third register for storing the 16-bit result added by the third adder and the circuit It is characterized by a calculation method for image data processing using.

도1은 종래 인텔사의 펜티엄 MMX 프로세서의 PMADDWD 명령어 동작의 흐름을 나타낸 도면.1 is a diagram illustrating a flow of PMADDWD instruction operation of a Pentium MMX processor of Intel Corporation.

도2는 종래 썬사의 Ultrasparc Pdist 명령어 동작 흐름을 나타낸 도면.Figure 2 is a view showing the flow of Ultrasparc Pdist command operation of the prior art Sun.

도3a는 본 발명에 따른 PMADB 명령어를 실행하기 위한 연산회로를 나타낸 도면.Figure 3a illustrates arithmetic circuitry for executing PMADB instructions in accordance with the present invention.

도3b는 본 발명에 따른 PMADB16 명령어를 실행하기 위한 연산회로를 나타낸 도면.Figure 3B illustrates arithmetic circuitry for executing PMADB16 instructions in accordance with the present invention.

도4a는 본 발명에 따른 PSADB 명령어를 실행하기 위한 연산회로를 나타낸 도면.4A illustrates arithmetic circuitry for executing PSADB instructions in accordance with the present invention.

도4b는 본 발명에 따른 PSADB16 명령어를 실행하기 위한 연산회로를 나타낸 도면.4B illustrates arithmetic circuitry for executing PSADB16 instructions in accordance with the present invention.

도5는 본 발명에 사용되는 abs연산기를 나타낸 도면.5 is a view showing an abs operator used in the present invention.

도6a은 본 발명에 따른 Pack16 모드를 나타낸 도면.Figure 6a illustrates a Pack16 mode in accordance with the present invention.

도6b는 본 발명에 따른 Pack 모드를 나타낸 도면.Figure 6b is a view showing a pack mode according to the present invention.

이하, 첨부된 도면을 참조로 하여 본 발명을 상세히 설명하기로 한다.Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

도3a 및 도3b는 본 발명에 따라 벡터 연산을 실행하기 위한 연산회로를 나타낸 것으로, 상기 연산회로의 연산실행방법은 PMADB 명령어와 PMADB16 명령어라 칭한다. 도3a는 상기 PMADB 명령어를 실행하기 위한 연산회로를 나타낸 것으로, 상기 연산 회로는 입력되는 4개의 데이터를 바이트 별로 저장하는 제1레지스터와; 입력되는 또 다른 4개의 데이터를 바이트별로 저장하는 제2레지스터와; 상기 레지스터들에 저장되어 있는 데이터를 바이트 별로 동시에 곱셈을 수행하기 위한 4개의 8 ×8 곱셈기와; 상기 곱셈기의 결과값 데이터의 지연경로를 줄이기 위한 제1파이프라인단과; 상기 곱셈기들의 곱셈 연산값 4개를 두 개씩 덧셈을 수행하기 위한 두 개의 16비트 덧셈기와; 상기 덧셈기의 결과값 데이터의 지연경로를 줄이기 위한제2파이프 라인단과; 상기 두 개의 덧셈기들의 덧셈 연산값을 다시 한번 덧셈을 수행하기 위한 하나의 32비트 덧셈기와; 상기 32비트 덧셈기의 결과값을 저장하기 위해 덧셈기의 오버플로우(overflow)를 감안한 40비트의 제3레지스터로 구성되며, PMADB16 명령어를 실행하기 위한 연산 회로는 도3b에서 도시된 바와 같이 PMADB 명령어를 실행하기 위한 연산 회로에 후술하는 도6의 구성 요소를 포함한 회로로 구성된다.3A and 3B show an arithmetic circuit for executing a vector operation according to the present invention, and the arithmetic execution method of the arithmetic circuit is called a PMADB instruction and a PMADB16 instruction. 3A shows an arithmetic circuit for executing the PMADB instruction, the arithmetic circuit comprising: a first register for storing four input data bytes by byte; A second register for storing another four pieces of data input for each byte; Four 8x8 multipliers for simultaneously multiplying data stored in the registers byte by byte; A first pipeline stage for reducing a delay path of the resultant data of the multiplier; Two 16-bit adders for adding the four multiplication operations of the multipliers by two; A second pipe line stage for reducing a delay path of the resultant data of the adder; One 32-bit adder for performing the addition operation of the two adders once again; In order to store the result value of the 32-bit adder, a 40-bit third register is configured in consideration of the overflow of the adder, and an operation circuit for executing the PMADB16 instruction executes the PMADB instruction as shown in FIG. 3B. It consists of a circuit including the component of FIG. 6 mentioned later in the calculating circuit for below.

상기 벡터 연산을 처리하기 위한 PMADB 명령어를 실행하는 연산회로를 이용한 연산방법은 제1레지스터에 바이트별(A[31:24], B[23:16], C[15:8], D[7:0])로 저장된 4개의 데이터는 제2레지스터에 저장(E[31:24], F[23:16], G[15:8], H[7:0])된 데이터와 바이트별로 4개의 8 ×8 곱셈기에 입력되어 연산(A[31:24] ×E[31:24], B[23:16] ×F[23:16], C[15:8] ×G[15:8], D[7:0] ×H[7:0])을 수행하게 된다. 상기 곱셈기 각각의 연산 결과인 16비트 데이터 4개는 두 개씩 두 개의 16비트 덧셈기에 입력되어 가산되고 그 결과값인 2개의 16비트 데이터는 다시 32비트 덧셈기에 입력되어 가산 연산을 수행한 후 40비트의 제3 레지스터에 저장됨으로써 벡터 연산을 마치게 된다.An operation method using an operation circuit for executing the PMADB instruction for processing the vector operation is performed by byte (A [31:24], B [23:16], C [15: 8], D [7) in the first register. 4 data stored as: 0]) is stored in the second register (E [31:24], F [23:16], G [15: 8], H [7: 0]) and 4 bytes per byte. Input to 8 8 multipliers (A [31:24] × E [31:24], B [23:16] × F [23:16], C [15: 8] × G [15: 8] ], D [7: 0] × H [7: 0]). Four 16-bit data, which are the result of each operation of the multiplier, are inputted and added to two 16-bit adders, respectively, and the two 16-bit data, which is the result, are inputted to the 32-bit adder again to perform an add operation and 40 bits. The vector operation is completed by being stored in the third register of.

상기 가산된 데이터는 연산 결과가 16비트만이 유효한 경우 도3b와 같이 Pack 네트워크를 이용하는 PMADB16이라는 명령어을 통해 전 데이터가 저장된 레지스터의 일부에 값을 저장시킴으로써 데이터 저장 시 적은 수의 레지스터를 사용하고 남는 레지스터를 다른 용도로 사용하여 레지스터 파일 및 메모리의 활용도를 높인다. 이러한 다수의 연산은 파이프라인 처리되어 한 사이클 동안에 4개의 데이터를 처리하게 되며 파이프라인 구조를 이용함으로써 본 발명의 연산 구조는 고속의 데이터 처리가 가능하다.When only 16 bits of the calculation result are valid, the added data stores a value in a part of the register where all the data is stored through the PMADB16 command using the Pack network as shown in FIG. 3B. Can be used for other purposes to increase the utilization of register files and memory. Many of these operations are pipelined to process four data in one cycle. By utilizing the pipelined structure, the computational structure of the present invention enables high-speed data processing.

다음에 도4a 및 도4b는 SAD 연산을 실행하기 위한 연산 회로를 나타낸 것으로, 상기 연산회로에 의한 연산방법을 각각 PSADB(Sum of Absolute Difference Byte) 명령어와 PSADB16(SAD with packing) 명령어라 칭한다. 도4a는 상기 PSADB 명령어를 실행하기 위한 연산 회로를 나타낸 것으로, 입력되는 4개의 데이터를 바이트별로 저장하는 32비트 제1레지스터와, 입력되는 다른 4개의 데이터를 바이트별로 저장하는 32비트 제2레지스터와, 상기 두 개의 레지스터로부터 바이트별로 출력되는 각 두 개의 데이터의 감산을 수행하기 위한 4개의 뺄셈기와; 상기 4개의 뺄셈기에서 감산된 결과값 데이터의 지연경로를 줄이기 위한 제1파이프라인단과; 상기 제1파이프라인 단을 통과한 결과값들을 2의 보수로 연산하기 위한 4개의 abs 연산기와; 상기 abs 연산기의 결과값 데이터의 지연경로를 줄이기 위한 제2파이프라인 단과; 상기 제2파이프라인 단을 통과한 abs 연산기의 결과값을 두 개씩 더하기 위한 제1, 제2 덧셈세기와; 상기 두 개의 덧셈기에서 더해진 결과값 데이터의 지연경로를 줄이기 위한 제3파이프라인 단과; 그리고 상기 제3파이프라인 단을 통과한 제1, 제2 덧셈기의 결과값을 다시 덧셈하기 위한 제3덧셈기와, 상기 제3덧셈기에서 더해진 16비트 결과값을 저장하기 위한 32비트의 제3레지스터로 구성되고, PSADB16 명령어를 실행하기 위한 연산 회로는 도4b에 도시된 바와 같이 도4a에 도시된 연산 회로에 도6의 구성 요소를 포함한 회로로 구성된다.4A and 4B show an arithmetic circuit for executing an SAD operation, and the arithmetic method by the arithmetic circuit is called a PSADB (Sum of Absolute Difference Byte) instruction and a PSADB16 (SAD with packing) instruction, respectively. 4A shows an arithmetic circuit for executing the PSADB instruction, a 32-bit first register storing four input data bytes by byte, a 32-bit second register storing four other input data bytes by byte; Four subtractors for subtracting each of two pieces of data output for each byte from the two registers; A first pipeline stage for reducing a delay path of the resultant data subtracted by the four subtractors; Four abs calculators for calculating the result values passing through the first pipeline stage with two's complement; A second pipeline stage for reducing a delay path of the result data of the abs operator; First and second addition strengths for adding two result values of the abs calculator that have passed through the second pipeline stage; A third pipeline stage for reducing a delay path of resultant data added by the two adders; And a third adder for re-adding the result values of the first and second adders having passed through the third pipeline stage, and a 32-bit third register for storing the 16-bit result value added by the third adder. And an arithmetic circuit for executing the PSADB16 instruction is composed of a circuit including the components of FIG. 6 in the arithmetic circuit shown in FIG. 4A as shown in FIG. 4B.

상기 SAD 연산을 처리하기 위한 PSADB 연산회로를 이용한 연산방법은 데이터의 절대합을 구하기 위해 제1레지스터에 패킹된 4개의 데이터를 제2레지스터에 패킹된 4개의 데이터와 각 바이트별로 8비트 4개의 뺄셈기에 입력하여 연산(A[31:24]-E[31:24], B[23:16]-F[23:16], C[15:8]-G[15:8], D[7:0]-H[7:0])을 수행한다. 상기 뺄셈기의 연산 결과인 4개의 데이터에 대해 최상위 비트([31])인 사인(sign)비트는 음수 값(사인비트가 '1')에 대해 양수 값(사인 비트가 '0')으로 변환하기 위해 2의 보수를 취할 목적으로 4개의 abs 연산기에 각각 입력된다.In the calculation method using the PSADB operation circuit for processing the SAD operation, four data packed in the first register is subtracted from the four data packed in the second register and eight bits four subtracted from each byte to obtain an absolute sum of data. To the operation (A [31:24] -E [31:24], B [23:16] -F [23:16], C [15: 8] -G [15: 8], D [7 : 0] -H [7: 0]). The sign bit, which is the most significant bit ([31]) for the four data that are the result of the operation of the subtractor, is converted to a positive value (sign bit is '0') for a negative value (sign bit is '1'). In order to take two's complement, each of them is input to four abs operators.

상기 abs 연산기는 도5에 도시된 바와 같이 32비트 NOT 게이트 1개와 2-to-1 멀티플렉서 1개, 32비트 가산기 1개, 그리고 3-state buffer 1개로 구성되어 데이터를 NOT 연산시켜 모든 비트의 데이터가 토글('0'→'1', '1'→'0')되고 다음 덧셈기의 캐리(carry) 입력부에 '1' 신호를 가함으로써 2의 보수가 완성된다. 상기 2의 보수 연산은 데이터의 사인 비트를 멀티플렉서(multiplex)의 선택(select) 신호로 사용하여 데이터의 1의 보수 연산을 제어하는데 사용하고 또한, 덧셈기의 캐리 입력부의 '1' 신호 입력에 대한 3-State 버퍼(Buffer)의 인에이블(ENABLE) 컨트롤 신호로 사용함으로써 각 연산기들의 동작을 제어하게 된다.As shown in FIG. 5, the abs operator is composed of one 32-bit NOT gate, one 2-to-1 multiplexer, one 32-bit adder, and one 3-state buffer to perform NOT operation on data, thereby performing data operation on all bits. 2's complement is completed by toggling ('0' → '1', '1' → '0') and applying a '1' signal to the carry input of the next adder. The two's complement operation is used to control the one's complement operation of the data by using the sine bit of the data as a select signal of the multiplexer, and the three's complement input to the '1' signal input of the carry input unit of the adder. -The state buffer controls the operation of each operator by using it as the enable control signal of the state buffer.

상기 abs 연산기의 결과값인 4개의 데이터들은 각 2개씩 8비트의 제1덧셈기와 제2 덧셈기로 입력되어 덧셈을 수행하고, 상기 덧셈기들에서 연산된 두 개의 결과 값 은 최종적으로 16비트 덧셈기를 통해 가산되어 32비트 레지스터의 하위 16비트([15:0])에 저장되게 된다.Four data, which are the result values of the abs operator, are inputted to the first adder and the second adder of 8 bits each to perform addition, and the two result values calculated by the adders are finally added through the 16-bit adder. It is added and stored in the lower 16 bits ([15: 0]) of the 32-bit register.

PMADB16과 마찬가지로 데이터의 연산 결과에 따라 PSADB16 명령어로 레지스터의 일부에 값을 저장시킴으로써 데이터 저장 시 적은 수의 레지스터를 사용하고 남는 레지스터를 다른 용도로 사용하여 레지스터 파일과 메모리의 활용도를 높인다.As with PMADB16, the PSADB16 instruction stores values in some of the registers according to the results of the data operation, thus increasing the utilization of register files and memory by using fewer registers and using the remaining registers for other purposes.

다음에 도6a 및 도6b는 본 발명에 따라 레지스터의 사용 효율을 높이기 위해 사용되는 팩(Pack) 네트워크의 동작을 실행하기 위한 회로를 나타낸 것으로서, 16비트 멀티플렉서 1개를 사용하여 팩 모드에 따라 동작을 하게 된다.6A and 6B show a circuit for performing an operation of a pack network used to improve the efficiency of using a register according to the present invention, and operating in a pack mode using one 16-bit multiplexer. Will be

도6의 데이터 저장 레지스터의 입력은 패킹된 데이터[15:0]와 패킹되지 않은 데이터[31:0]이다. 팩 네트워크는 크게 3가지 모드로 구분된다. 팩 네트워크에서 아무 연산을 수행하지 않는 노팩(No Pack) 모드와 과거의 연산 결과값을 16비트 쉬프트 레프트하고 현재의 결과값을 하위 16비트에 저장하는 팩16(Pack16) 모드, 그리고 32비트의 연산 결과를 16비트로 줄여 저장하는 팩 모드가 그것이다. 노팩 모드는 일반적인 명령어를 수행할 때 팩 네트워크에서 아무 연산도 수행하지 않는 모드이다. 팩16 모드는 연산 결과가 16비트만이 유효한 경우 과거의 연산결과 16비트를 상위 16비트로 쉬프트 시킨 후 하위 16비트에 현재 연산 결과값을 저장하는 모드이다. 팩 모드는 16 비트 연산의 중간 값이 32비트로 확장되었을 때 다시 16비트 데이터와 연산을 하기 위해 16비트로 데이터를 패킹해야 하는 경우 사용된다.The inputs of the data storage registers of Fig. 6 are packed data [15: 0] and unpacked data [31: 0]. Pack networks are divided into three modes. No-pack mode with no operation on the pack network, 16-bit shift left shift of past calculation results, and 16-bit pack mode for storing the current result in the lower 16 bits, and 32-bit operation Pack mode saves the result to 16 bits. The no-pack mode is a mode in which no operations are performed on the pack network when performing general instructions. In pack 16 mode, if only 16 bits are valid, the previous 16 bits are shifted to the upper 16 bits and the current 16 bits are stored in the lower 16 bits. Pack mode is used when the median value of a 16-bit operation is extended to 32 bits, and the data must be packed with 16 bits to operate again with 16-bit data.

도6a 및 도6b는 팩16 모드와 팩 모드의 동작을 나타내는 것으로써 멀티플렉서를 사용, 팩 모드에 따른 선택 신호로써 데이터 저장 레지스터의 데이터 출력을 제어하게 된다. 연산 결과는 각 명령어에 따라 결과값의 형태를 예상할 수 있고, 다음 연산에 필요한 부분을 알 수 있으므로 데이터의 형태에 따른 팩 모드를 결정할 수 있다. 즉 프로세서에서 상기 명령어를 디코딩하여 그 특성에 따라 선택 신호를 생성시키고 이에 따라 팩 모드가 결정된다. 출력단에서의 이러한 패킹 네트워크는 상술한 바와 같이 데이터 저장에 사용되는 레지스터에 두 개의 결과 데이터를 저장함으로써 데이터 저장에 필요한 레지스터의 수를 반으로 줄이고 사용하지 않는 레지스터를 다른 용도로 사용할 수 있게 함으로써 사용 효율을 높인다.6A and 6B show operations of the pack 16 mode and the pack mode, and use a multiplexer to control the data output of the data storage register as a selection signal according to the pack mode. The operation result can be predicted in the form of the result value according to each instruction, and since the part necessary for the next operation can be known, the pack mode according to the data type can be determined. In other words, the processor decodes the instruction to generate a selection signal according to its characteristics, and thus the pack mode is determined. This packing network at the output stage saves two result data in the register used for data storage as described above, thereby reducing the number of registers required for data storage and reducing the number of registers used for other purposes. Increase

본 발명에 사용된 곱셈기는 고속처리를 위하여 부분 곱의 수를 반으로 줄임으로써 하드웨어의 크기와 연산 수행 시간을 줄인 부스(Booth) 곱셈기로 이루어져 있다. 또한 각 단별로 파이프라인(Pipeline) 구조를 사용함으로써 데이터의 처리 지연 시간을 줄여 고속의 연산을 수행토록 한다.The multiplier used in the present invention is composed of a boot multiplier that reduces the size of hardware and the operation execution time by cutting the number of partial products in half for high speed processing. In addition, by using a pipeline structure for each stage, it reduces the processing delay time of data and performs high speed operation.

본 발명은 연산 시 팩 네트워크를 입·출력단 동시에 사용함으로써 데이터를 원하는 형태로 변환하고 연산된 결과값을 바이트(Byte), 워드(Word) 단위로 레프트-쉬프트(Left-shift)시켜 저장시킨다. 따라서 레지스터를 효율적으로 사용하게 되며 바이트 단위의 어드레싱(Addressing)을 제거함으로써 레지스터 파일 복잡도를 줄일 수 있으며, 적은 클럭 사이클동안 다수의 연산을 동시에 처리함으로써 연산수의 부담을 줄이고 고속 동작을 할 수 있다.The present invention converts the data into a desired form by using the pack network at the same time as the input and output terminals during the calculation, and stores the calculated result values by left-shifting in units of bytes and words. As a result, registers can be used efficiently and register file complexity can be eliminated by eliminating byte addressing, and multiple operations can be processed simultaneously for fewer clock cycles.

또한, 본 발명의 벡터 연산 및 SAD 연산 아키텍쳐는 32비트의 레지스터를 사용함으로써 32비트로 이루어진 멀티미디어 데이터 처리에 효율적이고 또한 불필요한 사이클 수를 줄여 데이터의 고속 처리가 가능하도록 하며, 데이터의 패킹 구조를 연산부의 출력단에 사용하여 경우에 따라 결과 데이터를 패킹하여 저장함으로써출력단의 레지스터 사용공간을 반으로 줄인다. 또한 나머지 공간을 다른 데이터의 저장에 사용하여 레지스터 사용 효율을 높이고 이에 대한 결과로 적은 수의 레지스터를 사용함으로써 남는 레지스터를 다른 용도로 사용할 수 있게 하는 효과를 가진다.In addition, the vector operation and SAD operation architecture of the present invention is efficient for 32-bit multimedia data processing by using 32-bit registers, and enables high-speed processing of data by reducing unnecessary cycles. Used in the output stage, the result data is sometimes packed and stored, thereby reducing the register usage space of the output stage in half. In addition, the remaining space can be used to store other data to increase the efficiency of register use, and as a result, using a small number of registers has the effect of allowing the remaining registers to be used for other purposes.

즉, 본 발명은 다수의 데이터와 연산을 동시에 처리함으로써 벡터 연산시 종래 멀티미디어 DSP 프로세서에 비해 2개의 연산 사이클 수를 줄여 프로세서의 연산 부담을 감소시키고, 데이터의 처리속도를 향상시키며, 레지스터의 사용효율을 높일 수 있다.That is, the present invention processes multiple data and operations simultaneously to reduce the computational burden of the processor, improve the processing speed of the data, and improve the efficiency of registers by reducing the number of two computation cycles compared to the conventional multimedia DSP processor when performing vector computations. Can increase.

Claims

delete

A first register for storing four input data bytes by byte;

A second register for storing another four pieces of data input for each byte;

Four 8x8 multipliers for simultaneously performing data multiplication by bytes stored in the registers;

A first pipeline stage for reducing a delay path of the resultant data of the multiplier;

Two 16-bit adders for adding four multiplication values of the multipliers that have passed through the first pipeline stage by two;

A second pipeline stage for reducing a delay path of the resultant data of the adder;

One 32-bit adder for performing addition of the addition operation values of the two adders once passed through the second pipeline stage;

A multiplexer for performing operations of 16 bits of packed operation result and 32 bits of unpacked operation result;

A result storage register for storing the operation result value;

And a 40-bit third register in consideration of the overflow of the adder in order to store the result value of the 32-bit adder.

delete

A 32-bit first register that stores four input data bytes by byte,

A 32-bit second register for storing the other four input data for each byte;

Four subtractors for subtracting each of two pieces of data output for each byte from the two registers;

A first pipeline stage for reducing a delay path of resultant data of the subtractors;

Four abs (absolute) calculators for calculating the result value subtracted from the four subtractors having passed through the first pipeline stage with two's complement; And

First and second adders for adding two result values output from the four abs calculators;

A second pipeline stage for reducing a delay path of the resultant data of the adders;

A third adder for re-adding result values added by the first and second adders passing through the second pipeline stage;

And a 32-bit third register for storing the 16-bit result added by the third adder.

The method of claim 4, wherein

And a multiplexer for performing operations of a 16-bit packed operation result and a 32-bit unpacked operation result between the third adder and the 32-bit register, and a result storage register for storing the operation result. Computation circuit for real-time operation data processing in the DSP processor and the microprocessor, characterized in that to increase the use efficiency of the register.

The method of claim 2,

A computing circuit for real-time arithmetic data processing in a DSP processor and a microprocessor, characterized in that it is configured in the same way as the circuit of claim 2 even if the number of bits of input / output data of the circuit changes.

The method of claim 2,

4. A computing circuit for real-time arithmetic data processing in a DSP processor and a microprocessor, characterized in that it is configured identically to the circuit of claim 2 even if the size of the datapath in the circuit changes.

The method of claim 2,

Even if the size of the operand of the data is changed when the program is written to perform the operation, the same as the circuit of claim 2 is configured for real-time associative data operation data processing in the DSP processor and the microprocessor Operation circuit.

Dividing the eight input data into two groups of four and storing the data in bytes;

Performing subtraction by extracting the data stored in the two groups for each byte;

Calculating the subtracted result with two's complement;

Performing two additions of the four result values calculated by the two's complement and outputting two addition result values;

And adding the result values of the two additions once again to output and store one addition result value.