KR100246033B1

KR100246033B1 - A real-time high speed full search block matching motion estimation processor

Info

Publication number: KR100246033B1
Application number: KR1019970030135A
Authority: KR
Inventors: 유재희
Original assignee: 유재희
Priority date: 1997-06-25
Filing date: 1997-06-25
Publication date: 2000-03-02
Also published as: KR19990005915A

Abstract

본 발명은, 계층적 블럭 탐색에 있어서, 여러 움직임 추정 계층간의 종속성으로 인한 지연시간을 제거함으로써 하드웨어의 이용도를 향상시키고, 통일적 아키텍쳐를 바탕으로 확장성 및 가변성을 갖고, 이를 전역 탐색 알고리즘의 연산 장치에도 확장, 적용할 수 있도록 하는 고속 실시간 처리 움직임 추정을 위한 연산방법 및 이를 위한 연산장치에 관한 것으로서, 계층 1,2,3의 움직임 추정 프로세서( 101).(102),(103), 반화소 움직임 추정 프로세서(104), 한 마크로블럭에 대한 연산에 필요한 데이터를 저장하는 과거프레임 저장 메모리(105-111)와 각각의 과거프레임의 마크로블럭에 대응되는 현재프레임 마크로블럭 데이터(112-118)로 이루어진다. 1,2,3 계층 탐색 움직임 추정 프로세서(101),(102),(103)는 병렬로 각 계층의 움직임 추정 연산을 수행하고, 상기 프로세서의 과거프레임 데이터는 상위 계층의 움직임 벡터에 따라, 과거프레임 탐색영역이 결정된다. 이후 이 움직임 벡터에 의한 과거프레임 데이터가 하위 계층 움직임 추정 프로세서에 입력된 후에야 연산이 시작되어야 하므로 지연 시간이 발생한다. 움직임 추정 프로세서(101-104)는, 각각 과거프레임 저장 메모리(105),(107),(109),(111)에 저장된 과거프레임 마크로블럭을 연산하여 다양한 움직임 추정 연산모드를 통일적 아키텍쳐를 바탕으로 구현가능하며, 고속 연산이 가능하고, 입력된 데이터의 중복 사용도를 높여 입출력 핀 개수의 최소화 및 하드웨어 양을 최소화 시키는 효과가 있는 매우 유용한 발명이다.In the hierarchical block search, the present invention improves the utilization of hardware by eliminating the delay time due to the dependency between several motion estimation layers, and has the scalability and variability based on the unified architecture. The present invention relates to a calculation method for fast real-time processing motion estimation and an apparatus for the same, which can be extended and applied to a device. Pixel motion estimation processor 104, past frame storage memory 105-111 for storing data necessary for operation on one macro block and current frame macro block data 112-118 corresponding to the macro block of each past frame. Is done. The first, second, and third hierarchical search motion estimation processors 101, 102, and 103 perform a motion estimation operation of each layer in parallel, and the past frame data of the processor is based on a motion vector of a higher layer. The frame search area is determined. Thereafter, a delay occurs because the operation must be started only after past frame data of the motion vector is input to the lower layer motion estimation processor. The motion estimation processor 101-104 calculates the past frame macroblocks stored in the past frame storage memories 105, 107, 109, and 111, respectively, based on the unified architecture in various motion estimation calculation modes. It is a very useful invention that can realize high speed operation and increase the redundancy of input data, thereby minimizing the number of input / output pins and minimizing the amount of hardware.

Description

Computation Method for Fast Real-time Processing Motion Estimation and Computing Device Therefor

본 발명은 화상 압축 코덱 시스템에서 구현상 가장 어려운 움직임 추정 연산장치 및 이를 구성하는 다양한 기능 모듈을 효율적으로 구현하기 위한 고속 실시간 처리 움직임 추정을 위한 연산방법 및 이를 위한 연산장치에 관한 것으로서, 더욱 상세하게는 계층적 블럭 탐색 알고리즘(Hierarchical search block matching algorithm)의 경우, 여러 움직임 추정 계층간의 종속성으로 인한 지연시간을 제거함으로써 하드웨어의 이용도를 향상시키고, 화상 데이터의 중복 사용도 향상을 통해, 입출력 병목 현상을 해결하고 침 핀 수를 최소화시킨다. 또한, 각 움직임 추정 계층별로 동일한 어레이 하드웨어 구조의 반복사용을 통해 설계함으로써, VLSI(very large scale intergration) 구현을 용이하게 하고, 고속 영상 압축을 수행하며 화소, 반화소 정확도 및 프레임, 필드방식, 다양한 블럭 크기 등의 여러 응용분야에 적용 가능한 움직임 추정에 있어, 통일적 아키텍쳐를 바탕으로 확장성 및 가변성을 갖고, 이를 전역 탐색 알고리즘의 연산 장치에도 확장, 적용할 수 있도록 하는 고속 실시간 처리 움직임 추정을 위한 연산방법 및 이를 위한 연산장치에 관한 것이다.The present invention relates to a motion estimation computing device that is the most difficult to implement in an image compression codec system, and to a calculation method for a fast real-time processing motion estimation for efficiently implementing various functional modules constituting the same, and a computing device therefor. In case of hierarchical search block matching algorithm, I / O bottleneck is improved by eliminating delay due to dependency between several motion estimation layers and improving redundancy of image data. And minimize the needle pin count. In addition, by designing through repeated use of the same array hardware structure for each motion estimation layer, it facilitates the implementation of very large scale intergration (VLSI), performs high-speed image compression, pixel, half-pixel accuracy and frame, field method, various In motion estimation that can be applied to various applications such as block size, it has scalability and variability based on a unified architecture, and it is a calculation for fast real-time processing motion estimation that can be extended and applied to a computing device of global search algorithm. It relates to a method and a computing device therefor.

종래의 움직임 추정 알고리즘은 화상 데이터를 전송 또는 저장하는데 있어, 데이터의 양을 압축시키기 위하여, 과거 화상 데이터와 현재 화상 데이터의 차이를 연산하여, 이 차이의 최소값을 갖는 움직임 벡터를 연산한다. 이를 통해, 복호기 쪽에서 전송된 과거 화상 데이터와 움직임 벡터를 이용하여, 현재 화상 데이터를 만들어 내는 과정에 사용되는 연산이다.Conventional motion estimation algorithms compute the difference between past image data and current image data in order to compress the amount of data in the transmission or storage of the image data, and calculate the motion vector with the minimum of this difference. Through this, the operation is used to generate the current image data by using the past image data and the motion vector transmitted from the decoder.

이러한 움직임 추정 알고리즘에 있어서, 전역 탐색 움직임 추정 알고리즘(F ull search block matching algorithm : FBMA)은 현재프레임의 N크기 마크로블럭 내의 모든 화소를 일정한 크기로 설정된 이전 프레임의 탐색영역(p) 내의 모든 마크로블럭과 비교한 후, 가장 최소 차이값을 가지는 양 마크로블럭에 있어, 두 마크로블럭간의 위치 변화를 움직임 벡터로 발생시키는 방법이다. 전역 탐색 움직임 추정 알고리즘은 정확도는 높으나 고선명 텔레비전과 같이, 탐색영역이 넓어질 경우, 비교해야 할 화소수가 엄청나게 증가함으로 인하여, 실시간 처리가 거의 블가능하거나, 하드웨어의 크기가 엄청나게 증가하는 단점이 있다. 이를 해결하기 위해, 계층 탐색 움직임 추정 알고리즘(Hierarchical search block algorithm : HSA)은 연산량을 감소시키기 위하여, 계층별로 탐색영역 내의 선별된 마크로블럭 또는 마크로블럭내의 화소에 대해 움직임 추정을 수행한다. 첫 계층에서는 넓은 탐색영역에 대해 마크로블럭 또는 마크로블럭내의 화소에 대해 샘플링을 통하여, 정확도가 떨어지는 움직임 백터를 찾고, 이를 바탕으로, 계층이 진행됨에 따라 전단에 비해 적은 탐색영역에 대해 보다 정확도가 높은 움직임 추정을 수행하여 최종 단에서 가장 정확한 움직임 백터를 찾았다.In such a motion estimation algorithm, a full search block matching algorithm (FBMA) includes all the macroblocks in the search region p of the previous frame in which all pixels in the N-size macroblock of the current frame are set to a constant size. After comparing with, for both macroblocks having the smallest difference value, the position change between the two macroblocks is generated as a motion vector. Although the global search motion estimation algorithm has high accuracy, such as high-definition television, when the search area is widened, the number of pixels to be compared is enormously increased, so that real-time processing is almost impossible or the size of hardware is greatly increased. In order to solve this problem, a hierarchical search block estimation algorithm (HSA) performs motion estimation on selected macroblocks or pixels in a macroblock in a search area for each layer in order to reduce the amount of computation. In the first layer, the macroblock for the large search area or the pixels in the macroblock is sampled to find a motion vector that is less accurate. Based on this, as the layer progresses, the accuracy is higher for the smaller search area than the front end. Motion estimation was performed to find the most accurate motion vector in the final stage.

일반적으로 이러한 움직임 추정은 연산량이 막대하며, 고속으로 실시간 처리를 할 필요가 있으므로 병렬처리를 통하여, 연산성능을 향상시킬 필요가 있다. 전역 탐색 알고리즘인 경우에는, 계층적 블럭 탐색 알고리즘에 비해, 비교적 규칙적인 데이터의 흐름이 존재하기 때문에, 종래의 기술에 있어서는, 일반적으로, 시스토릭 어레이 등을 사용하여, 구현하였다. 그러나, 전역 탐색 알고리즘은 하드웨어 구현은 비교적 용이하나, 넓은 탐색영역, 큰 프레임 크기를 고속으로 실시간 처리하기 어려워, 최근에는 고선명 텔레비젼 등의 구현에 있어서는, 계층적 블럭 탐색 알고리즘을 사용하였다.In general, such motion estimation has a large amount of computation and needs to be processed in real time at a high speed. Therefore, it is necessary to improve computation performance through parallel processing. In the case of the global search algorithm, a relatively regular flow of data exists compared with the hierarchical block search algorithm. Thus, in the conventional technology, a systolic array or the like is generally implemented. However, although the global search algorithm is relatively easy to implement hardware, it is difficult to process a large search area and a large frame size in real time at high speed. Recently, hierarchical block search algorithms have been used in high-definition television.

세부적으로 종래의 움직임 추정 아키텍쳐를 살펴보면, 시스토릭 아키텍쳐의 경우 주로 전역 탐색 알고리즘을 위한 아키텍쳐를 바탕으로 계층 탐색 움직임 추종 알고리즘의 구현을 위해 움직임 추정계층의 변화에 따라 데이터를 공급하기 위한 방법에 대한 방안을 제시하고 있는데, 1차원 중심 시스토릭 어레이 구조와 2차원 시스토릭 어레이구조 및 파이프라인 방식의 트리 아키텍쳐를 이용하여 계층 탐색 알고리즘을 구현하였으며, 연산 유닛 개수를 감소시키기 위하여, 트리절단방식을 바탕으로 한 반복적인(recursive) 아키텍쳐를 사용하였다.Looking at the conventional motion estimation architecture in detail, in the case of the systolic architecture, a method for supplying data according to the change of the motion estimation layer for the implementation of the hierarchical search motion tracking algorithm based on the architecture for the global search algorithm. The hierarchical search algorithm is implemented by using a 1-dimensional central systolic array structure, a 2-dimensional systolic array structure, and a pipelined tree architecture, and in order to reduce the number of computation units, One recursive architecture was used.

움직임 추정 계층간의 지연시간 문제를 해결하기 위하여, 현재프레임 마크로블럭과 일부의 과거프레임 마크로블럭과 부분적인 움직임 벡터를 구한 후, 이로 인하여 보다 적은 탐색영역을 탐색하게 된다. 이와 병렬적으로, 아직 움직임 추정이 끝나지 않은 나머지 움직임 추정 벡터를 구하여, 위 두 연산의 결과를 비교함으로써, 최종적인 움직임 벡터를 구하게 된다.In order to solve the delay problem between the motion estimation layers, the current frame macroblock, some past frame macroblocks, and partial motion vectors are obtained, thereby searching for a smaller search range. In parallel, the final motion vector is obtained by comparing the results of the above two operations by obtaining the remaining motion estimation vectors for which motion estimation has not yet been completed.

화상 데이터의 불규칙적인 입출력을 효율적으로 구현하기 위하여서는, 일반적으로, 메모리에서 화상 데이터 이동을 위하여, 데이터의 입력과 출력을 동시에 처리할 수있는, 더블 버퍼(double buffer)의 개념을 사용하였다.In order to efficiently implement irregular input and output of image data, generally, the concept of a double buffer, which can simultaneously process input and output of data, is used for moving image data in a memory.

움직임 추정 아키텍쳐는 대부분 정해진 화상 데이터 블럭 크기, 탐색영역을 대상으로, 설계되고 있다. 움직임 추정은 고속 실시간 처리가 요구되므로, 이를 위해, 하드웨어의 구조가 각 계층에 맞게 최적화 되거나, 대부분 서로 다른 아키텍쳐를 사용하였다.Most motion estimation architectures are designed for fixed image data block sizes and search areas. Since motion estimation requires high-speed real-time processing, the hardware structure is optimized for each layer or mostly different architectures are used.

이러한 움직임 추정 시스템에서는 연산성능의 향상뿐 아니라, 막대한 화상 데이터의 입출력이 필요하여, 이를 효율적으로 구현하기 위한 방안도 필수적이다. 따라서, 시간당 입출력 양을 줄이기 위하여서는, 입력된 화상 데이터의 중복사용을 위해, 효율적인 시프트 레지스터간의 데이터 이동 및 내부 메모리 등의 다양한 방법을 사용하였다.In such a motion estimation system, not only the improvement of computational performance is required, but also the input / output of enormous image data is required, and a method for efficiently implementing this is also essential. Therefore, in order to reduce the amount of input / output per hour, various methods such as efficient data movement between shift registers and internal memory have been used for the redundant use of input image data.

하드웨어 구현상, 움직임 추정 연산을 위해서는, 많은 양의 과거와 현재의 화상 데이터 차이의 절대값을 구하는 연산이 필요하다. 움직임 추정 연산은 고속 실시간 처리가 필요하므로, 위 연산을 신속히 처리하는 절대값 계산회로가 필요하다. 이 절대값 계산에 있어, 사용되어 오던 종래의 절대값 계산 회로는 제7a도와 같이, 예를 들어, (a-b)의 절대값 계산의 경우, 감산기(701)에서, (a-b)를 연산하고, 감산기(702)에서, (b-a)를 연산하여, 선택기(703)에서, 이중 양수의 값을 취하는 방법과 제7b도와 같이, 레지스터(704)에 가를 저장하고, 레지스터(705)에 나를 저장하고, a와 b의 대소를 판단하여, 감산기(706)에서 큰 수에서 작은 수를 감산하는 방법과 제7c도와 같이, 감산기(7808)에서 (a-b)를 연산하고, 음수 탐지기(709)에서 결과가 음수인 경우, 이를 먹스(707)에서, 외부 입력 대신 위 결과를 입력으로 받아, 감산기(708)를 통해, 양수로 전환하는 방법등이 사용되고 있었다.In a hardware implementation, a motion estimation operation requires an operation for finding an absolute value of a large amount of difference between past and present image data. Since the motion estimation operation requires fast real-time processing, an absolute value calculation circuit for processing the above operation is needed. The conventional absolute value calculating circuit which has been used in this absolute value calculation, as shown in FIG. 7A, for example, in the case of absolute value calculation of (ab), calculates (ab) in the subtractor 701 and subtracts it. At 702, calculate (ba), and at selector 703, take a double positive value and store the value in register 704, store me in register 705, as shown in FIG. And the magnitude of b and b, the subtractor 706 subtracts the small number from the large number, and as shown in FIG. 7C, the subtractor 7808 calculates (ab), and the negative detector 709 has a negative result. In this case, the mux 707 receives the above result as an input instead of an external input and converts it to a positive number through the subtractor 708.

그러나, 종래의 계층 탐색 움직임 추정 프로세서는 알고리즘의 특성상 움직임 추정계층간 연산에 있어서, 각단이 독립적인 연산을 수행하는 것이 아니라 상위 계층 움직임 연산의 완료 후 발생된 움직임 벡터를 바탕으로 하위 연산 과거프레임 마크로블럭을 받아들여 연산을 시작할 수 있으므로, 이로 인한 지연시간이 발생되어 하드웨어의 이용도 및 연산도가 저하되는 문제점이 있었다.However, in the conventional hierarchical search motion estimation processor, in the motion estimation inter-layer computation, each stage does not perform independent computations, but instead of performing the independent computation of each layer based on the motion vector generated after the completion of the higher-layer motion computation. Since the operation can be started by accepting the block, there is a problem in that the delay time is generated and the utilization of the hardware and the degree of computation decrease.

움직임 추정은 고속 실시간 처리가 요구되어, 외부의 막대한 화상 데이터를 받아들여야 할뿐 아니라, 특히 계층 탐색 움직임 추정 알고리즘의 경우, 데이터의 구조로 불규칙적이므로, 이에 의해 하드웨어 양 데이터를 공급하는 메모리의 대역폭, 특히 포트 수의 증가, 또는 입출력에 연산속도에 대한 제한을 발생시키게 된다. 특히, 2차원 시스토릭 어레이의 경우가, 1차원 시스토릭 어레이의 경우보다 더 많은 포트 수가 필요하다. 그러나, 1차원 시스토릭 어레이의 경우, 하드웨어의 이용도가 저하되거나, 고속 연산이 어렵다는 문제점이 있다. 이와 더불어, 다음 현재프레임 마크로블럭 연산초기에 일시에 막대한 화상 데이터가 움직임 추정 프로세서에, 입력 되야 하므로 포트 수를 매우 증가시키게 된다. 이와 같은 문제는 특히 트리 구조를 이용할 경우에 심각하다. 이와 같은 이유로 인해, 메모리 인터리빙 (interleaving)에 의해 포트 수를 줄였으나, 고속 연산시, 메모리속도에 의해 연산속도가 제한되는 단점이 있다. 메모리의 속도제한 문제를 해결하기 위하여, 다수의 메모리를 사용하였으나, 이로 인해, 연결선이 복잡하게 되어 하드웨어의 구현이 어렵고, 지연시간 또한 증가하게 되며 다양한 메모리의 속도와 움직임 추정 연산장치의 속도에 따라, 포트의 수를 조정할 수 있는 융통성이 부족한 점이 있었다.The motion estimation requires fast real-time processing, and not only has to accept huge external image data, but especially the hierarchical search motion estimation algorithm is irregular in the structure of the data, whereby the bandwidth of the memory supplying hardware quantity data, In particular, an increase in the number of ports or a limitation on the operation speed may occur at the input / output. In particular, two-dimensional systolic arrays require more ports than one-dimensional systolic arrays. However, in the case of the one-dimensional systolic array, there is a problem that the utilization of hardware is lowered or high-speed computation is difficult. In addition, since a large amount of image data must be input to the motion estimation processor at the beginning of the next current frame macroblock operation, the number of ports is greatly increased. This problem is particularly acute when using tree structures. For this reason, although the number of ports is reduced by memory interleaving, the operation speed is limited by the memory speed during high speed operation. In order to solve the memory speed limitation problem, a large number of memories are used, but this leads to complicated connection lines, making it difficult to implement hardware, increasing latency, and depending on the speed of various memory speeds and motion estimation computing devices. However, there was a lack of flexibility in adjusting the number of ports.

한편, 시스토릭 어레이를 사용할 경우, 연산에 필요한 연산 유닛의 개수가 블럭의 크기, 탐색 영역에 의해 고정되므로, 계층 탐색 알고리즘의 각 계층별 연산, 반화소 정확도와 프레임 및 필드 기반 등의 다양한 움직임 추정 연산 방식 확장성에 한계가 있게 된다. 즉, 각각 절대값 연산 유닛의 개수가 마크로블럭 크기에 의해, 탐색영역의 크기가 움직임 추정에 사용되는 마크로블럭 크기의 절반에 해당하는 특수한 경우에만 적용 가능하다. 따라서, 아키텍쳐의 연산능력이 고정되어, 다양한 응용분야에 맞는 움직임 추정 시스템에 적용하기 위한 가변성이 떨어진다. 이로 인해, 하드웨어의 이용도가 저하되거나, 여러 가지의 아키텍쳐가 필요하게 되어 시스템의 크기, 비용증가, 설계시간의 장기화되고 블럭의 크기, 탐색 영역 등의 다양한 응용분야, 계층 탐색 움직임 추정 연산의 각 계층 및 필드, 프레임, 반화소 연산방식을 처리할 수 있는 통일적 아키텍쳐가 필요하다.On the other hand, when the systolic array is used, the number of calculation units required for the calculation is fixed by the size of the block and the search area, so that the calculation for each layer of the hierarchical search algorithm, half-pixel accuracy, and various motion estimation such as frame and field based There is a limit to the scalability of the algorithm. That is, the number of absolute value calculation units is applicable to the special case where the size of the search area corresponds to half of the size of the macroblock used for motion estimation by the size of the macroblock. Thus, the computing power of the architecture is fixed, resulting in poor variability for application to motion estimation systems suitable for various applications. As a result, the utilization of hardware is reduced, or various architectures are required, so that the size of the system, the cost, the design time are prolonged, and various applications such as the size of the block and the search area, and each of the hierarchical search motion estimation operations There is a need for a unified architecture that can handle hierarchies, fields, frames, and half-pixel operations.

상기와 같은 통일적 아키텍쳐에서 반드시 고려해야 할 점은, 계층탐색 알고리즘이 요구하는 다양한 입력 데이터 구조 및 서로 다른 연산 유닛에서 필요한 동일 화소 데이터의 이동을 통한 재사용 문제의 효율적 처리를 통해, 상기에서 설명된 입출력 포트의 수를 감소시키기 위한 방안이 반드시 필요하다. 이러한 방안으로써, 종래의 기술에서는, 프로세서 내에 더블 버퍼방식을 바탕으로, 입력된 화소 데이터의 사용시간 간격만큼 연산장치 내부에서 저장하는 방법을 사용한다. 그러나, 더블 버퍼 방식은 연산장치 내부에서, 출력이 진행됨과 동시에, 외부로 부터 데이터가 입력되어야 하므로, 두배의 기억용량 및 특히 복잡도 및 하드웨어양을 크게 증가시키게 된다. 따라서, 이를 해결하기 위한 효율적인 회로 구현 방법이 필요하다. 이와 더불어, 움직임 추정 연산에는, 데이터의 양이 방대하고, 특히 계층 탐색 움직임 추정 알고리즘을 구현할 경우, 동일 데이터의 사용 시간 간격이 크고 불규칙적인 면이 있어, 많은 양의 내부 버퍼 및 시프트 레지스터와 복잡한 외부 메모리 주소 계산이 필요하게 된다, 일반적으로, 종래의 움직임 추정 연산 프로세서에서는 내부 버퍼, 시프트 레지스터의 양과 데이터의 재사용도, 포트 수간에 상호 충돌관계가 존재하며, 적은 하드웨어 양 및 포트 수에 의한 각 계층 별 통일적 아키텍쳐가 불가능하였다.In the unified architecture described above, the input / output ports described above are implemented through efficient processing of the reuse problem through the movement of the same pixel data required by different input units and various input data structures required by the hierarchical search algorithm. A measure to reduce the number of is necessary. In this way, in the conventional art, a method of storing the input pixel data in the processor unit within the processor unit based on a double buffer method is used. However, since the double buffer method requires output from the outside at the same time as the output proceeds inside the arithmetic unit, the double memory capacity, especially the complexity and the amount of hardware, is greatly increased. Therefore, there is a need for an efficient circuit implementation method to solve this problem. In addition, the motion estimation operation has a large amount of data, and especially when implementing a hierarchical search motion estimation algorithm, the use time interval of the same data is large and irregular, and a large amount of internal buffers and shift registers and complex externals are required. In general, in conventional motion estimation arithmetic processors, there is a collision between internal buffers, the amount of shift registers and the reuse of data, and the number of ports. There was no uniform architecture.

움직임 추정 연산을 효율적으로 병렬처리하기 위해서는, 과거와 현재의 화소차이의 절대값을 연산하는 하드웨어가 다수 필요하므로 이를 위한 효율적인 회로 구현이 필요한데, 제7도에 도시된 종래의 구현방법에 있어, 제7a도의 감산기와 먹스로 이루어진 장치의 경우, 두 개의 감산기가 필요하게 되어, 하드웨어 양이 증가하게된다. 제7b도의 먹스와 감산기로 이루어진 장치의 경우, 두 입력의 대소를 비교하기 위한 하드웨어가 감산기 이외로 추가로 필요하게 되어, 하드웨어양이 증가하게되며 제7b도의 먹스와 감산기, 음수 탐지기로 이루어진 장치의 경우, 절대값 계산을 위해, 두 번에 걸쳐 감산기를 통해 연산해야 하므로 절대값 계산 시간이 증가하는 문제점이 존재한다.In order to efficiently parallelize the motion estimation operation, a large amount of hardware for calculating the absolute value of the pixel difference between the past and the present is required. Therefore, an efficient circuit implementation is required. In the conventional implementation method shown in FIG. In the case of a 7a-degree subtractor and a mux device, two subtractors would be needed, increasing the amount of hardware. In the case of the mux and the subtractor of FIG. 7b, the hardware for comparing the magnitudes of the two inputs is needed in addition to the subtractor, so that the amount of hardware is increased and the apparatus of the mux, the subtractor and the negative detector of FIG. In this case, there is a problem in that the absolute value calculation time is increased because the two values must be calculated through a subtractor.

따라서, 본 발명은 상기와 같은 문제점을 해결하기 위하여 창작된 것으로서, 규칙적이며 동일한 하드웨어 구조의 중복사용을 통한 하드웨어 구현이 용이하고, 효율적인 과거프레임 버퍼에 의해, 연속되는 화상 마크로블럭간의 하드웨어의 휴지도를 감소 시킬 수 있고 외부 화상 데이터의 직렬 데이터 입력이 가능하도록 함으로써, 입출력 핀 개수를 감소시키는 계층간의 파이프라인 방식에 의한 마크로블럭 탐색방법을 제공하는 데 그 목적이 있는 것이며, 본 발명의 다른 목적은, 움직임 추정 각 계층, 블럭크기, 필드, 프레임 연산, 반화소 연산등 다양한 움직임 추정 연산 모드를 통일적 아키텍쳐를 바탕으로 구현 가능하며 고속 연산이 가능한 각 계층별 화소 차이값의 통일적 산출장치를 제공하는 것이며, 본 발명의 또 다른 목적은, 입력된 데이터의 중복 사용도를 높여 입출력 핀 개수 및 하드웨어양을 최소화 시킬 수 있도록 하는 각 계층별 화소 차이값의 통일적 산출장치를 제공하는 것이며, 본 발명의 또 다른 목적은, 외부와 내부 클럭 속도의 조절을 통한 다양한 입출력 핀 수의 조절 및 다양한 외부 화상 메모리와의 동작이 가능하도록 하는 전환이 가능한 가산 및 비교기능의 병합 연산장치 제공하는 것이며, 본 발명의 또 다른 목적은, 적은 메모리의 기억용량으로 입출력을 동시에 진행시키는 화상 메모리가 구현 가능하도록 하는 메모리 셀의 연결구조를 제공하는 것이다.Therefore, the present invention was created to solve the above problems, and it is easy to implement hardware through regular use of the same hardware structure, and the pause of hardware between successive image macroblocks is achieved by an efficient past frame buffer. It is an object of the present invention to provide a method for searching a macroblock by a layer-to-layer pipeline method that reduces the number of input / output pins by reducing the number of pixels and enabling serial data input of external image data. It is possible to implement various motion estimation operation modes such as motion estimation layer, block size, field, frame operation, and half pixel operation based on the unified architecture, and to provide a unified calculation device of pixel difference values for each layer capable of high speed operation. Another object of the present invention, the input data It is to provide a uniform calculation device of the pixel difference value for each layer to minimize the number of input and output pins and hardware by increasing the number of uses, and another object of the present invention is to adjust the external and internal clock speeds It is an object of the present invention to provide a merge operation apparatus with an add and compare function capable of controlling input / output pin numbers and switching with various external image memories, and another object of the present invention is to simultaneously perform input / output with a small memory capacity. It is to provide a connection structure of the memory cells to enable the image memory to be implemented.

제1도는 본 발명에 따른 외부 화상 메모리의 구성도이고,1 is a configuration diagram of an external image memory according to the present invention,

제2도는 본 발명에 따른 외부 메모리에서, 프로세서 내부로, 데이터 공급하는 장치의 구성도이고,2 is a block diagram of an apparatus for supplying data to an internal memory in an external memory according to the present invention.

제3도는 본 발명에 따른 계층 탐색 움직임 추정 연산의 모든 계층을 연산할 수 있는 통일적 연산장치의 구성도이고,3 is a block diagram of a unified computing device capable of computing all layers of the hierarchical search motion estimation operation according to the present invention,

제4a, b도는 본 발명에 따른 프로세서내의 데이터 이동을 위한 시프트 레지스터의 구성도이고,4a and b are block diagrams of shift registers for data movement in a processor according to the present invention;

제5도는 본 발명에 따른 비교기와 가산기를 혼용할 수 있는 장치의 구성도이고,5 is a block diagram of a device that can be used in combination with the comparator and the adder according to the present invention,

제6도는 본 발명에 따른 화소 절대값 차이 메모리의 구성도이고,6 is a configuration diagram of an absolute pixel difference memory according to the present invention;

제7도는 종래의 절대값 계산 회로의 구성도이고,7 is a configuration diagram of a conventional absolute value calculation circuit,

제8도는 본 발명에 따른 화소 절대값 계산 회로의 구성도이고,8 is a configuration diagram of an absolute pixel calculating circuit according to the present invention;

제9도와 제10도는 현재프레임(CF) 및 과거프레임(PF) 화소의 위치를 좌측 위를 원점으로 2차원 적으로 도시한 것이다.9 and 10 illustrate two-dimensional positions of the current frame CF and the past frame PF, with the origin at the upper left.

* 도면의 주요부분에 대한 부호의 설명* Explanation of symbols for main parts of the drawings

101-104 : 움직임 추정 프로세서 105-111 : 과거프레임 저장 메모리101-104: motion estimation processor 105-111: past frame storage memory

112-118 : 현재프레임 마크로블럭 데이터112-118: Current frame macroblock data

201 : 외부메모리 202 : 과거프레임 버퍼201: External memory 202: Past frame buffer

203 : 시프트 레지스터 어레이 301-316 : 화소 절대값차이 그룹203: shift register array 301-316: pixel absolute value difference group

317 : 가산기 트리 318 : 필드 프레임발생기317: Adder tree 318: Field frame generator

401-404, 421-439, 441-444, 449-452, 454-457, 462-465, 467-470 : 시프트 레지스터401-404, 421-439, 441-444, 449-452, 454-457, 462-465, 467-470: Shift register

405-420, 445-448, 458-461 : 화소 절대값 차이 연산 유닛과 먹스405-420, 445-448, 458-461: Pixel absolute difference calculation unit and mux

440, 453, 466, 703, 707 : 먹스 501 : 입력먹스440, 453, 466, 703, 707: mux 501: input mux

502, 802 : 가산기 503, 805 : 출력먹스502, 802: adder 503, 805: output mux

601-604 : 메모리 셀601-604: memory cells

606, 607 : 읽기용 메모리 셀 열 선택(column select)기606, 607: Memory cell column selector for reading

608, 609 : 쓰기용 메모리 셀 열 선택(column select)기608, 609: Memory cell column selector for writing

610 : 가산 축척기 701, 702, 706, 708 : 감산기610: addition accumulator 701, 702, 706, 708: subtractor

704, 705 : 레지스터 709 : 음수 탐지기704, 705: Register 709: Negative Detector

801, 803 : 인버터 804 : 절대값 발생기801, 803 Inverter 804 Absolute value generator

상기와 같은 목적을 달성하기 위한 본 발명에 따른 계층간의 파이프라인 방식에 의한 마크로블럭 탐색방법은, 연속되는 마크로블럭(k1, k2, ..., kn-1, kn, kn+1, ...)을 탐색하는 움직임 추정기의 각 계층간의 연산방법에 있어서, 연관된 탐색영역 내에서의 입력된 마크로블럭(kn)의 탐색과 다음 마크로블럭(kn+1)의 연관된 탐색영역 내의 데이타의 독출을 상위계층에서 동시에 수행하는 제1단계; 상기 제1단계 이전의 상기 상위계층에 의한 탐색에 의해 축소설정된 탐색영역 내에서의 이전 마크로블럭(kn-1)의 탐색과 상기 제1단계에서의 마크로블럭(kn)의 탐색에 의해 축소 설정된 탐색영역내의 데이타의 독출을 하위계층에서 동시에 수행하는 제2단계; 및 상기 제2단계에서의 독출된 탐색영역 내에서의 마크로블럭(k_n)의 탐색을 상기 하위계층에서 수행하는 제3단계;를 포함하여 이루어지는 것과, 상기 독출 과정은 헤당 데이타를 저장수단 내에 저장하는 단계; 및 프로세서의 상기 탐색 중에 다음 마크로블럭의 탐색에 필요한 초기 데이타 양을 상기 프로세서 내의 내부 버퍼로 이동시키는 입력단계;를 포함하여 이루어지는 것에 특징이 있는 것이며, 본 발명에 따른 각 계층별 화소 차이값의 통일적 산출장치는, 움직임 추정을 위한 화소 차이값 산출장치에 있어서, 계층별 마크로블럭을 최소크기의 마크로블럭 단위로 분해하여 순차적으로 화소차이의 절대값을 합산출력하는 단위차이 연산수단; 및 상기 단위차이 연산수단에서 출력되는 합산값을 가산하여 상기 계층별 마크로블럭의 화소 차이값을 출력하는 가산수단;을 포함하여 구성되는 것에 특징이 있는 것이며, 본 발명에 따른 각 계층별 화소 차이값의 통일적 산출장치는, 상기 단위차이 연산수단은 최소크기(M×M)의 현재 마크로블럭의 행 또는 열의 화소 데이타를 일대일 대응저장하는 소정개수의 래치; 탐색영역 내의 행 또는 열의 화소데이타를 일대일 대응 저장하는 다수의 시프트 레지스터; 및 상기 래치 내의 화소데이타와 상기 시프트 레지스터 내의 화소데이타의 화소 차이의 절대값을 출력하는 소정개수(kM, k는 1

k

M인 자연수)의 연산기;를 포함하여 구성되되, 상기 연산기들은 상기 래치 내의 행 또는 열의 화소데이타 군을 상기 군수에 대응되는 상기 시프트 레지스터 내의 화소데이타 군 및 연속(k번) 인접된 화소데이타 군들을 각각 동시에 연산하고, 상기 시프트 레지스터들은 상기 연산기의 연산 후에 소정 화소거리(M)의 데이타간에 이동이 이루어지는 것과, 상기 단위차이 연산수단은 탬색영역 내의 행 또는 열의 화소데이타 군을 저장하는 다수의 시프트 제지스터; 현재 마크로블럭 내의 임의의 화소데이타를 복제하여 상기 다수의 시프트 레지스터 내의 화소데이타와의 차이의 절대값을 각각 연산하는 다수의 연산기;를 포함하여 구성되되, 상기 시프트 레지스터는 상기 연산기의 연산후에 인접된 화소데이타의 이동을 수행하고, 상기 연산기는 현재 마크로블럭 내의 상기 임의의 화소데이타의 다은 화소데이타로 상기 연산과정을 수행하는 것에 다른 특징이 있는 것이며, 본 발명에 따른 전환이 가능한 가산 및 비교기능의 병합 연산장치는, 외부의 기능선택 신호에 따라 임의의 입력값(A) 또는 이에 상응하는 음수값

을 선택출력하는 입력값 선택수단; 상기 입력값 선택수단의 출력값과 또 다른 임의의 입력값(B)을 가산출력하는 가산수단; 및 상기 외부의 기능선택 신호에 기준하여 상기 가산수단의 출력값을 출력하거나 상기 임의의 입력값(A)과 상기 또 다른 임의의 입력값(B) 중 상기 가산수단의 출력값에 의해 결정되는 최소값을 선택출력하는 출력값 선택수단;을 포함하여 구성되는 것에 특징이 있는 것이며, 본 발명에 따른 메모리 셀의 연결구조는, 행렬배치를 갖는 다수의 메모리 셀에 있어서, 인접되는 일련의 행 또는 열의 메모리 셀은 동일 선택신호선에 의해 선택되어 데이타 비트의 쓰기 및 읽기가 동시에 행해지는 것에 특징이 있는 것이다.In order to achieve the above object, the macroblock search method using the inter-layer pipeline method according to the present invention includes a continuous macroblock (k1, k2, ..., kn-1, kn, kn + 1, ...). A method for computing between layers of a motion estimator for searching for (.), Which differs from a search of an input macroblock kn in an associated search region and a reading of data in an associated search region of a next macroblock kn + 1. Performing a first step simultaneously in a layer; The search reduced in size by the search of the previous macroblock kn-1 in the search area reduced by the search by the upper layer before the first step and the search in the macroblock kn in the first step. A second step of simultaneously reading out data in the region in a lower layer; And a third step of performing the search of the macroblock k _n in the read search area in the second step in the lower layer. Doing; And an input step of moving an initial amount of data required for searching for a next macroblock to the internal buffer in the processor during the searching of the processor, wherein the pixel difference values of the respective layers according to the present invention are unified. The calculating device includes: a pixel difference value calculating device for motion estimation, comprising: unit difference calculating means for decomposing a macroblock for each layer into units of a minimum sized macroblock and sequentially adding up absolute values of pixel differences; And addition means for outputting pixel difference values of the macroblocks for each layer by adding the sum values output from the unit difference calculating means, wherein the pixel difference values for each layer according to the present invention are included. The unitary calculating device may include: a predetermined number of latches for one-to-one correspondence storing pixel data of a row or column of a current macroblock of a minimum size (M × M); A plurality of shift registers for one-to-one correspondence storing pixel data of rows or columns in the search area; And a predetermined number kM, k for outputting an absolute value of pixel differences between pixel data in the latch and pixel data in the shift register.

k

A natural number of M); and the operator includes a group of pixel data of a row or a column in the latch and a group of consecutive (k times) adjacent pixel data in the shift register corresponding to the group. The shift registers are moved between data of a predetermined pixel distance M after the operation of the calculator, and the unit difference calculating means stores a plurality of shift restraints for storing a group of pixel data of a row or a column in a color gamut. Stub; A plurality of calculators each of which calculates an absolute value of a difference with the pixel data in the plurality of shift registers by duplicating arbitrary pixel data in the current macroblock, wherein the shift registers are adjacent to each other after the operation of the operator. The pixel data is shifted, and the calculator has another feature of performing the calculation process with the next pixel data of the arbitrary pixel data in the current macro block. The merge operation unit may generate an arbitrary input value A or a corresponding negative value according to an external function selection signal.

Input value selection means for selectively outputting the data; Adding means for adding and outputting an output value of said input value selecting means and another arbitrary input value (B); And outputting an output value of the adding means based on the external function selection signal or selecting a minimum value determined by the output value of the adding means from the arbitrary input value A and the other optional input value B. Output value selection means for outputting; characterized in that it comprises a, characterized in that the connection structure of the memory cells according to the present invention, in a plurality of memory cells having a matrix arrangement, the memory cells of a series of adjacent rows or columns are the same It is characterized by being selected by the selection signal line to simultaneously write and read data bits.

상기와 같이 구성되고 이루어지는 본 발명에 따른 고속 실시간 처리 움직임 추정을 위한 연산방법 및 이를 위한 연산장치에서는, 계층 탐색 움직임 추정 연산장치에 있어, 서로 다른 현재프레임 화상 데이터 또는 마크로블럭간의 연산지연의 문제를 해결하기 위해 종래의 움직임 추정에 사용되는 연속된 현재프레임, 과거프레임 마크로블럭 데이터를 사용하지 않고 시간적으로 하나 건너뛰어 사용한다. 이때 움직임 추정이 완료된 후 발생된 움직임 벡터를 이용하여 다음 계층의 과거프레임 마크로블럭 데이터를 위한 주소를 발생시키는데 필요한 지연시간은 하나의 움직임 추정계층의 연산에 필요한 시간에 비해 적으므로 제거됨으로써, 이를 통해 하드웨어 이용도를 극대화시킬 수 있다.In the above-described computing method and apparatus for fast real-time processing motion estimation according to the present invention constructed and configured as described above, the hierarchical search motion estimation computing device has a problem of computation delay between different current frame image data or macroblocks. To solve this problem, one does not use continuous current frame and past frame macroblock data used in conventional motion estimation, but skips one in time. In this case, since the delay time required to generate the address for the past frame macroblock data of the next layer using the motion vector generated after the motion estimation is completed is less than the time required for the calculation of one motion estimation layer, Maximize hardware utilization.

하드웨어의 이용도를 높이기 위해 계층 탐색 움직임 추정시 발생하는 또 다른 지연시간은 주로 하나의 과거프레임 마크로블럭에 대한 움직임 추정을 수행한 후 다음의 과거프레임 마크로블럭에 대한 움직임 추정을 수행하기 위해 연산에 필요한 과거프레임 마크로블럭의 데이터를 읽어들여 절대값 차이 연산 유닛 어레이에 채우는데 발생된다. 이를 해결하기 위하여, 내부에 버퍼를 두어 현재 연산중인 과거프레임 마크로블럭에 대한 연산이 수행되고 있는 동안 다음 움직임 추정 연산 수행에 필요한 과거프레임 마크로블럭 데이터를 미리 읽어 저장한다. 이때 움직임 추정 연산 장치 칩의 입출력 핀 수와 시간당 입출력 양 사이에는 서로 반비례 관계가 있으므로, 이를 이용하여, 시스템 외부와 내부 클럭 속도를 조정하여, 다양한 메모리 또는 입출력 장치와 연결 가능하게 하였다. 이후 다음 움직임 추정 연산 초기에 일시에, 절대값 차이 연산 유닛 어레이에 입력시킨후 연산을 시작한다. 이때 외부의 메모리와 움직임 추정 부간의 입출력 병목현상을 해결할 수 있도록, 시간적으로 연산시간과 입출력시간을 중첩하고, 연산에 필요한 화상 데이터 입력은 외부 메모리로부터 이전 연산 동안 직렬로 입력되도록 하였으며, 아키텍쳐 내부에서는 고속 연산을 위해, 입력된 데이터를 이용하여 병렬로 연산이 수행되도록 한다.Another delay that occurs during hierarchical search motion estimation to improve the utilization of the hardware is mainly performed by performing motion estimation on one past frame macroblock and then performing motion estimation on the next past frame macroblock. Generated to read the data of the required macro frame macroblock and fill it with the absolute value difference operation unit array. In order to solve this problem, the past frame macroblock data necessary for performing the next motion estimation operation is read in advance and stored while a buffer for the previous frame estimation is being performed. In this case, since there is an inverse relationship between the number of input / output pins of the motion estimation device chip and the amount of input / output per hour, by using this, the external clock and the internal clock speed are adjusted to be connected to various memory or input / output devices. Then, at the beginning of the next motion estimation operation, the operation is inputted to the absolute value difference operation unit array at a time. At this time, in order to solve the I / O bottleneck between the external memory and the motion estimation unit, the operation time and the input / output time are overlapped in time, and the image data input required for the operation is input from the external memory in serial during the previous operation. For high speed computation, the computation is performed in parallel using the input data.

또한, 통일적 아키텍쳐를 통해 계층 탐색 움직임 추정 알고리즘을 연산하기 위해, 움직임 추정계층이 증가함에 따라 탐색영역의 크기는 감소하나 현재프레임 마크로블럭과 과거프레임 마크로블럭에 대한 절대값 차이연산에 필요한 연산량은 샘플링 거리가 감소함에 따라서 반비례하여 감소하므로 움직임 추정에 필요한 각 계층별 연산량을 다음 식에 나타낸 바와 같이 항상 동일한 값을 유지하게 하였다. 따라서 동일한 개수의 절대값 연산 유닛을 이용하여 모든 움직임 추정 계층을 연산하는데 필요한 시간은 항상 동일한 값을 가지게 된다.In addition, in order to calculate the hierarchical search motion estimation algorithm through a unified architecture, as the motion estimation layer increases, the size of the search area decreases, but the amount of computation required for calculating the absolute difference between the current frame macroblock and the past frame macroblock is sampled. As the distance decreases in inverse proportion, the amount of calculation for each layer required for motion estimation is always kept the same as shown in the following equation. Therefore, the time required to compute all the motion estimation layers using the same number of absolute value calculating units will always have the same value.

식(1)에서 정의된 pst, Sst는 각 움직임 추정계층별 탐색영역과 샘플링 거리를 나타낸다. 세부적으로, 샘플링 거리와 탐색영역의 변화에 따라 하드웨어의 구조 혹은 데이터 입력 방식을 프로그램 가능하게 할 수 있는 구조를 설계하여, 계층탐색 움직임 추정 알고리즘을 하드웨어로 구현하는데 있어 각기 다른 구조의 하드웨어를 설계해야 하는 문제가 해결 가능하다. 구현상 절대값 연산 유닛 어레이는 가장 작은 블럭인 계층 1에 해당되는 N/S1×N/S1(단, S1 : 계층 1에서의 샘플 거리)크기의 화소 그룹을 바탕으로 연산을 수행한다. 총 N/S1×N/S1개의 화소 절대값 차이 연산 유닛 각각은 현재프레임 마크로블럭 내에 존재하는 서로 다른 화소들을 저장하고 있어, 이를 병렬처리로 연산을 수행한 후 이를 가산 축척하여 결과적으로 하나의 마크로블럭에 대한 연산을 수행하게 된다. 계층이 증가하여 실제 연산에 필요한 마크로블럭 내의 화소 데이터 개수가 N/S1×N/S1 보다 커지게 되면, 이를 N/S1×N/S1의 크기를 갖는 여러 개의 블럭으로 나눈 뒤, 여러 번에 걸쳐 각 블럭에 대한 절대값 차이를 연산하여 이를 화소 절대값 차이 메모리에 저장한다. 이 메모리는 이전에 저장된 블럭의 절대값 차이와 계속 더해 나감으로서 각 계층에서 요구하는 데이타의 크기를 만족하는 마크로블럭별 절대값 차이를 만들게 된다. 각 계층이 증가할수록 상기 가산 대상인 N/S1×N/S1 화소 그룹의 개수는 증가한다. 또한 본 발명은 필드, 프레임 연산까지 통일적 아키텍쳐에 의해 연산 가능할 수 있도록, 상기 가산 절차에 있어, 프레임인 경우는 모든 화소 그룹, 필드 모드에 있어서는, 짝수, 홀수 행별로 가산을 수행한다. 상기 연산시, 각 화소 절대값 차이 연산 유닛이 계산하게 되는 화소 절대값 차이의 수를 n이라고 하면, n은 과거프레임 마크로블럭 수/총 화소 절대값 차이 연산 유닛 수가 된다. 따라서 화소 절대값 차이 연산 유닛은 n만큼 인터리빙 방식에 의해, 과거프레임 데이터를 연산한다. 본 발명에 의한 아키텍쳐에서는 응용분야에 따라, n이 조절 가능하게 하여, 하드웨어양과 연산속도를 자유롭게 조절 가능하게 함으로써, 다양한 영상처리 규격을 전부 동일한 아키텍쳐로 구현 가능하도록 한다. 이와 더불어, 움직임 추정 연산장치의 성능을 향상시키기 위하여, N/S1 × N/S1 개의 화소 절대값 차이 연산 유닛 그룹을 수평, 수직으로 각각 u, v만큼 늘여 과거 프레임 탐색 영역 내에 수평방향으로 u개, 수직방향으로 v개 존재하는 마크로블록을 동시에 연산 가능하다. 반대로 하드웨어양을 감소시킬 필요가 있어, 연산 능력을 줄일 경우에는 N/S1 × N/S1개의 화소 절대값 차이 연산 유닛 그룹들이 연산해야하는 과거프레임의 마크로블록을 증가시켜, 인터리빙 하여 연산함으로서 이를 해결할 수 있다. 이와 같은 방법을 통하여, 다양한 하드웨어양과 연산속도에 적용 가능하다.Pst and Sst defined in Equation (1) represent a search region and a sampling distance for each motion estimation layer. In detail, by designing a structure that can program the hardware structure or data input method according to the change of sampling distance and search area, hardware of different structure must be designed to implement hierarchical search motion estimation algorithm in hardware. The problem is solved. In an implementation, the absolute value operation unit array performs an operation based on a pixel group having a size of N / S1 × N / S1 (S1: sample distance in layer 1) corresponding to layer 1, which is the smallest block. Each of the N / S1 × N / S1 pixel absolute value difference calculating units stores different pixels existing in the current frame macroblock. It will perform the operation on the block. If the hierarchical layer increases and the number of pixel data in the macroblock required for the actual operation becomes larger than N / S1 × N / S1, divide it into several blocks having the size of N / S1 × N / S1, and then multiple times. The absolute value difference for each block is calculated and stored in the pixel absolute value difference memory. This memory continues to add up to the absolute difference of the previously stored blocks, creating an absolute difference for each macroblock that satisfies the size of data required by each layer. As each layer increases, the number of N / S1 × N / S1 pixel groups to be added increases. In addition, the present invention performs addition in every pixel group and field mode in even and odd rows in the case of a frame, so that field and frame operations can be calculated by a uniform architecture. In the calculation, if the number of pixel absolute value differences calculated by each pixel absolute value difference calculating unit is n, n is the number of past macroblocks / total pixel absolute value difference calculating units. Therefore, the pixel absolute value difference calculating unit calculates past frame data by n by the interleaving method. In the architecture according to the present invention, n can be adjusted according to an application field, and the amount of hardware and a calculation speed can be freely adjusted, so that various image processing standards can be implemented with the same architecture. In addition, in order to improve the performance of the motion estimation arithmetic unit, the N / S1 × N / S1 pixel absolute value difference calculation unit groups are horizontally and vertically u and v respectively increased in u horizontal direction in the past frame search region. In the vertical direction, v existing macroblocks can be calculated simultaneously. On the contrary, it is necessary to reduce the amount of hardware. To reduce the computing power, it is possible to solve this problem by increasing the macroblock of the past frame that the N / S1 × N / S1 pixel absolute value difference calculation unit groups need to calculate and interleaving them. have. Through this method, it is applicable to various hardware quantities and calculation speeds.

움직임 추정 연산 장치에서 사용되는 과거프레임 데이터는, 연산이 진행됨에 따라, 각 화소 절대값 차이 연산 유닛에서, 서로 중복된 입력을 연산하므로, 효율적으로 재 사용될 필요가 있다. 따라서, 상기에 설명된 외부 화상 메모리와 움직임 추정 프로세서간의 입출력 병목 현상 해결 방법과 더불어, 입력된 데이터의 재사용 도를 높임으로서 외부와의 시간당 데이터 입출력 양을 최소화 시킬 수 있다. 즉 2차원적으로 배열된 화소 절대값 차이 연산 장치에 시간에 따라, 입력된 화상 데이터를 시프트 레지스터에 의해 이동, 재사용하며, 연산하는 방법을 사용하였다. 본 발명에 의한 시프트 레지스터의 데이터 이동 및 저장 구조는, 다양한 블록 크기와 탐색 영역에 대하여, 오직 연산에 필요한 화소 수만큼의 시프트 레지스터 개수만이 요구되어, 종래의 데이터 이동만을 위한 별도의 시프트 레지스터가 요구되지 않는다. 이와 같은 방법을 통해, 현재 프레임 내에 있는 임의의 마크로블록과 움직임 벡터를 바탕으로 하여 설정된 탐색 영역 내에 존재하는 과거프레임의 마크로블록에 있어, 같은 위치에 존재하는 화소 절대값 차이를 계산한다. 이와 더불어, 고속의 연산을 위해 그룹별로 연산하고, 이동시켜, 화소 그룹에 대한 연산을 수행하므로써, 병렬처리를 가능하게 하며 계층 탐색 움직임 추정 알고리즘 대신, 전역 탐색 알고리즘에 적용할 경우, 매 연산마다, 동일한 현재 프레임 데이터를 사용하도록 하여, 모든 절대값 차이 연산 유닛으로, 현재프레임 데이터를 브로스 캐스팅(broa dcasting)하도록 할 수 있다.The past frame data used in the motion estimation arithmetic unit needs to be reused efficiently as the calculation proceeds, in each pixel absolute value difference calculating unit, the inputs overlapped with each other are calculated. Therefore, in addition to the above-described method for resolving the input / output bottleneck between the external image memory and the motion estimation processor, the amount of data input / output with the outside can be minimized by increasing the reusability of the input data. That is, a method of moving, reusing, and calculating the input image data with a shift register over time in a two-dimensionally arranged pixel absolute value difference calculating device is used. The data shift and storage structure of the shift register according to the present invention requires only the number of shift registers as many as the number of pixels required for the operation for various block sizes and search areas. Not required. Through this method, the pixel absolute value difference existing at the same position is calculated for any macroblock in the current frame and the macroblock of the past frame existing in the set search area based on the motion vector. In addition, by performing calculations on groups of pixels for fast operation, and by performing operations on pixel groups, parallel processing is possible, and when applied to global search algorithms instead of hierarchical search motion estimation algorithms, The same current frame data can be used, so that all the absolute difference calculation units can brocast the current frame data.

움직임 추정 연산은 블록크기가 다를 경우, 가산 축적될 화소 절대값 차이의 개수와, 움직임 벡터를 계산하기 위한 최소값 탐색 대상인 위 가산 축적된 결과의 개수에 차이가 있다. 따라서, 통일적인 아키텍쳐를 통해, 이와 같은 다양한 경우를 연산하기 위해서는, 화소 절대값 차이 가산축적기와 이 결과의 비교기의 개수에 차이가 있게된다. 본 발명에서는, 비교기와 가산기가 전용 가능한 장치를 제안하여, 이를 해결하였다. 비교기와 가산기의 전용은, 하드웨어 구현상 비교기는 감산기를통한 연산 결과의 부호 비트의 판별을 통해 대, 소를 구별할 수 있고, 2의 보수인 경우, 감산기는 입력의 보수에 최하위 캐리 비트에 1을 입력시킨다.When the block size is different, the motion estimation operation has a difference in the number of pixel absolute value differences to be added and accumulated, and the number of the above accumulated values which are the minimum value search targets for calculating the motion vector. Thus, in order to compute these various cases through a unified architecture, there is a difference in the number of pixel absolute value difference accumulators and the number of comparators of the result. In the present invention, a device capable of converting a comparator and an adder is proposed and solved. In the hardware implementation, the comparator can distinguish between large and small by determining the sign bit of the operation result through the subtractor. In the case of 2's complement, the subtractor is 1 to the least significant carry bit for the complement of the input. Enter.

움직임 추정 시스템에 있어 화소 절대값 차이 메모리는 읽기와 쓰기가 동시에 수행되며, 읽기 주소와 쓰기 주소간에는 항상 1행의 차이가 생긴다. 이를 위해, 종래의 기술에서는, 읽기와 쓰기 포트가 서로 독립적으로 존재하며, 두배의 기억용량이 필요한데, 읽기와 쓰기가 처리되는 메모리 셀의 위치가 연속되어 있다는 점을 이용하여, 메모리 셀 행 제어를 통일적으로 제어함으로써, 상기 메모리의 제어구조, 기억용량을 감소시켰다.In the motion estimation system, the absolute pixel difference memory performs reading and writing at the same time, and there is always one row difference between the read address and the write address. To this end, in the prior art, read and write ports exist independently of each other and require twice the storage capacity, and memory cell row control is performed by taking advantage of the contiguous positions of the memory cells to be read and written. By controlling uniformly, the control structure and memory capacity of the memory are reduced.

또한, 움직임 추정 시스템에 있어서, 고속 병렬처리 연산을 위해, 화소 절대값 차이 연산기가 매우 많이 필요하다. 이러한 절대값 차이 연산은 기본적으로, 두 개의 가산기가 필요하다. 즉, 가산기 1의 출력이 음수일 경우는 절대값을 얻기 위하여 2의 보수를 취해야 하는데 이때는 두 개의 가산기를 직렬로 거쳐야 하므로 절대값 연산 유닛의 연산속도 저하를 가져올 수 있다. 본 발명에서는, 두 개의 가산기중 절대값 연산시 필요한 음수의 보수를 취하는 가산기를 종래의 기술과는 다르게, 단순화시킴으로써, 연산 속도의 향상과 하드웨어양을 감소 시켰다. 간략화 될 가산기 2는 CLA(캐리 록 어헤드 가산기)를 이용하였다. 2의 보수를 구하는 과정은 가산기 1출력의 보수를 취한 값에 +1을 가산하여 처리하였다. 이 경우에 있어, 가산기 2의 최하위 캐리 Cin에 1을 입력시키면 가산기 2의 나머지 입력(Bi)은 항상 0이 된다. 이를 이용하여 간략화된 가산기 2의 논리식은 다음과 같다.In addition, in a motion estimation system, a large number of pixel absolute value difference calculators are required for high speed parallel processing. This absolute difference operation basically requires two adders. In other words, when the output of the adder 1 is negative, two's complements should be taken to obtain an absolute value. In this case, since two adders must pass in series, the operation speed of the absolute value calculating unit may be reduced. In the present invention, an adder that takes the negative complement of the two adders, which is required to calculate the absolute value, is simplified, unlike the conventional art, thereby improving the computation speed and reducing the amount of hardware. Adder 2 to be simplified used CLA (Carry Lock Ahead Adder). The process of finding the complement of 2 was performed by adding +1 to the value of the complement of the adder 1 output. In this case, when 1 is input to the lowest carry Cin of the adder 2, the remaining input Bi of the adder 2 is always zero. The logic of adder 2 simplified using this is as follows.

따라서, CLA에서 요구되는 Gi,Pi를 위한 회로가 필요 없으며 Ci를 계산하기 위한 회로도 간소해지고, 두 개의 가산기가 직렬로 연결되어 있으나, 가산기 2가 간소화되기 때문에 절대값 연산 유닛의 연산속도를 크게 저하시키지 않는다.Therefore, the circuit for Gi and Pi required in the CLA is not required, and the circuit for calculating Ci is simplified, and even though two adders are connected in series, the adder 2 is simplified, which greatly reduces the computation speed of the absolute value calculating unit. Don't let that happen.

이하, 본 발명에 따른 고속 실시간 처리 움직임 추정을 위한 연산방법 및 이를 위한 연산장치의 바람직한 실시예의 구성 및 동작에 대해, 첨부된 도면에 의거하여 상세히 설명한다.Hereinafter, the configuration and operation of a calculation method for fast real-time processing motion estimation and a preferred embodiment of the calculation device for the same according to the present invention will be described in detail with reference to the accompanying drawings.

제1도는 전체 계층 탐색 움직임 추정 시스템의 구성도로서, 계층 1의 움직임 추정 프로세서(101), 계층 2의 움직임 추정 프로세서(102), 계층 3의 움직임 추정 프로세서(103), 반화소 움직임 추정 프로세서(104), 한 마크로블록에 대한 연산에 필요?? 데이터를 저장하는 과거프레임 저장 메모리에 있어 발생하는 시간의 역순으로 (105), (106), (107), (108), (109), (110), (111)과 각각의 과거프레임 마크로블록에 대응되는 현재프레임 마크로블록 데이터(112), (113), (114), (115), (116), (117), (118)로 이루어진다. 계층 1,2,3 계층 탐색 움직임 추정 프로세서(101), (102), (103)는 병렬로 각 계층의 움직임 추정 연산을 수행하고, 상기 프로세서의 과거프레임 데이터는 상위 계층의 움직임 벡터에 따라, 과거프레임 탐색영역이 결정된다. 이후 이 움직임 벡터에 의한 과거프레임 데이터가 하위 계층 움직임 추정 프로세서에 입력된 후에야 연산이 시작되어야 하므로 지연시간이 발생한다. 움직임 추정 프로세서(101), (102), (103), (104)는, 각각 과거프레임 저장 메모리(105), (107), (109), (111)에 저장된 과거프레임 마크로블록을 연산한다.1 is a block diagram of a full-layer search motion estimation system, which includes a motion estimation processor 101 of layer 1, a motion estimation processor 102 of layer 2, a motion estimation processor 103 of layer 3, and a half-pixel motion estimation processor ( 104), what is needed to operate on one macroblock ?? (105), (106), (107), (108), (109), (110), and (111) each of the past frame macroblocks in the reverse order of the time that occurs in the past frame storage memory for storing data. The current frame corresponding to the macroblock data 112, 113, 114, 115, 116, 117, 118. Hierarchy 1,2,3 hierarchical search motion estimation processors 101, 102, and 103 perform motion estimation operations of each layer in parallel, and the past frame data of the processor is based on the motion vector of the higher layer. The past frame search area is determined. Thereafter, since the operation must be started only after past frame data by this motion vector is input to the lower layer motion estimation processor, a delay occurs. The motion estimation processors 101, 102, 103, and 104 calculate past frame macroblocks stored in the past frame storage memories 105, 107, 109, and 111, respectively.

제2도는 외부 메모리에서 연산장치 내부로 입출력 병목 현상을 최소화시키면서, 데이터를 입력하는 인터페이스 부분을 나타내었다. 인터페이스는 외부메모리(2 01), 과거프레임 버퍼 및 인터폴레이터(interplolator)(202), 및 시프트 레지스터 어레이(203)로 이루어져 있다.2 illustrates an interface part for inputting data while minimizing input / output bottlenecks from an external memory into an operation unit. The interface consists of an external memory 201, a past frame buffer and an interplolator 202, and a shift register array 203.

현재프레임 마크로블록에 대한 연산을 수행함과 동시에 다음 연산에 필요한 과거프레임 데이터의 일부를 과거프레임 버퍼(202)에 미리 받아들여 시간당 입출력 양을 일정하게 유지시키게 한다. 입출력 병목 현상을 최소화시키기 위하여 과거프레임 버퍼(202)에 있어 입력은 직렬로 처리하여, 이를 통해 프로세서 핀 수를 최소화시킨다. 반화소 모드 연산에 있어서는, 인터폴레이터를 통과시킨 결과를 과거프레임 버퍼(202)에 저장시킨다. 과거프레임 버퍼 및 인터폴레이터(interpolator )(202)의 입출력 속도는 다양한 응용분야에 적합하게 포트 수를 조절가능하다. 즉 포트 수를 증가시킬 경우, 입출력 속도를 저하시킬 수 있고, 포트수를 감소시킬 경우 빠른 속도의 입출력이 필요하다. 이와 같은 방법을 통해, 입출력 핀 수와 입출력 속도의 응용분야 및 구현 환경에 따른 조절이 가능하다. 저장된 연속되는 과거프레임 마크로블록을 위한 화상 데이터는 연산 초기에 일시에 움직임 추정 연산장치 내부의 상기 시프트 레지스터 어레이(203)로 일시에 입력되어, 지연시간이 제거된다.While performing the operation on the current frame macroblock, a part of the past frame data required for the next operation is received in the past frame buffer 202 in advance to keep the amount of input / output per hour constant. In order to minimize input and output bottlenecks, inputs to the past frame buffer 202 are processed in series, thereby minimizing the number of processor pins. In the half pixel mode operation, the result of passing the interpolator is stored in the past frame buffer 202. The input / output speed of the old frame buffer and interpolator 202 is adjustable in the number of ports to suit a variety of applications. In other words, if the number of ports is increased, the I / O speed can be reduced, and if the number of ports is reduced, high speed I / O is required. Through this method, the input / output pin count and the input / output speed can be adjusted according to the application field and the implementation environment. The image data for the successive past frame macroblocks stored are temporarily inputted into the shift register array 203 inside the motion estimation arithmetic unit at the beginning of the calculation so that the delay time is eliminated.

상기에서 설명한 바와 같이 일시에 입력된 화상 데이터는 어레이로 구성된 절대값 차이 연산 유닛을 통해, 현재와 과거프레임 마크로블록을 구성하는 화소의 절대값 차이를 병렬로 연산한다.As described above, the image data input at the time is computed in parallel through the absolute value difference calculation unit configured in the array, in parallel between the absolute value differences of pixels constituting the current and past frame macroblocks.

제3도는 통일적 아키텍쳐를 통해 계층 탐색 움직임 추정 알고리즘의 다양한 계층, 필드, 프레임, 반화소 움직임 추정 연산을 수행할 수 있는 장치의 내부 구조를 나타내었다. (301), (302), (303), (304), (305), (306), (307), (308), (309), (310), (311), (312), (313), (314), (315), (316)은 이미 설명된바 있는, N/S1 × N/S1(단, S1 : 계층 1에서의 샘플링 거리)의 크기를 갖는 화소 절대값차이 그룹을 나타낸다. 여러개의 화소 절대값 차이로 이루어진 상기의 그룹은 트리 형식으로 가산 축척을 수행하는 가산기 트리(317)로 입력된다. 각 그룹내의 화소 절대값 차이의 가산 축척된 결과를 발생시키는 가산기 트리의 결과는 필드 프레임발생기(318)에서, 화소 절대값 차이 메모리를 통해, 순차적으로 움직임 추정 각 계층에 적합한 크기의 마크로블록 단위로 가산 축척 된다. 이 과정에서, 필드, 프레임 모드에 따라, 필드인 경우에는 홀수 및 짝수 행에 대해, 프레임인 경우에는 모든 행에 대해, 일차적으로 가산기 트리에서 발생된 화소 절대값 차이의 가산 축척 된 결과를 가산 축척하게 된다.3 illustrates an internal structure of an apparatus capable of performing various hierarchical, field, frame, and half-pixel motion estimation operations of the hierarchical search motion estimation algorithm through a unified architecture. (301), (302), (303), (304), (305), (306), (307), (308), (309), (310), (311), (312), (313) ), 314, 315, and 316 denote groups of pixel absolute value differences having a magnitude of N / S1 × N / S1 (where S1: sampling distance in layer 1), as described above. . The group consisting of a plurality of pixel absolute value differences is input to an adder tree 317 which performs addition scaling in a tree form. The result of the adder tree, which generates an additive scaled result of the pixel absolute value differences in each group, is obtained in the field frame generator 318, through the pixel absolute value difference memory, in units of macroblocks of sizes appropriate for each layer of motion estimation sequentially. Counts up. In this process, depending on the field, frame mode, the additive scaled result of the sum of the absolute differences of the pixels generated in the adder tree, primarily for odd and even rows in the field and for all rows in the frame. Done.

제4도는 데이터 재사용을 통해, 시간당 입출력 양을 최소화시킬 수 있고, 내부 버퍼 메모리의 양을 감소시킬 수 있는 두 가지 데이터 저장 및 이동 구조를 나타내었다. 내부 연산 장치는, 절대값 차이를 계산하는 화소 절대값 차이 연산 유닛과 이들에게 과거프레임 마크로블록 데이터를 공급해주는 시프트 레지스터로 구성되어 있다. 일반적으로, 상기에서 설명된 통일적 아키텍쳐에 있어, 연산장치는 수직 방향으로 v개의 마크로블록을 동시에 계산하므로, (N/S1+v-1)개의 화소 절대값 차이 연산 유닛 행으로 구성되며, 1화소 절대값 차이 연산 유닛 행에는

의 2차원 형태로 화소 절대값 차이 연산 유닛이 배열되어 있다. 이와 더불어, (N/S1 + 2p/S1-1)개의 시프트 레지스터가

행 형태와 맨 마지막 행은 m=(u-1)(u<N/S1) 또는 (N/S1-1)(u≥N/S1)개의 시프트 레지스터로 배치된다. 상기의 m개의 시프트 레지스터들은 연산 진행에 따라, 과거프레임 데이터가 연속되는 연산에 중복하여 사용되거나, 한 번만 사용하게되는 차이점이 있어, (2p/S1)/u 클럭마다 인접해 있는 행간 또는 하나 건너 존재하는 행에 대한 경로로 바뀌어 가며 데이터를 이동하게 된다.FIG. 4 shows two data storage and movement structures that can minimize the amount of input / output per hour and reduce the amount of internal buffer memory through data reuse. The internal computing device is composed of a pixel absolute value difference calculating unit for calculating absolute value differences and a shift register for supplying past frame macroblock data to them. In general, in the unified architecture described above, the computing device calculates v macroblocks simultaneously in the vertical direction, so that it is composed of (N / S1 + v-1) pixel absolute value difference calculating unit rows, one pixel. The absolute difference calculation unit row

The pixel absolute value difference calculating units are arranged in a two-dimensional form. In addition, (N / S1 + 2p / S1-1) shift registers

The row form and the last row are arranged with m = (u-1) (u <N / S1) or (N / S1-1) (u≥N / S1) shift registers. As the above m shift registers have a difference in that the past frame data is used repeatedly in a continuous operation or used only once as the operation progresses, the adjacent shifts or one across each (2p / S1) / u clock are skipped. The data is moved by changing the path to the existing row.

제4a도는 이와 같은 방안을 바탕으로, 계층 탐색 움직임 추정 알고리즘을 위한 것으로, 블록 크기 16, 탐색영역 32, 샘플링 거리 4의 경우를 나타낸다. 병렬처리 방안은, u=4, v=1 경우를 나타내며, 따라서, 시프트 레지스터는 4×4 형태와 마지막 행의 3형태로 이루어진다. 제4a도에서 (401),(402),(403),(404),(421),(422) ,(423),(424),(425),(426),(427),(428),(429),(430),(431),(432),(433),(434),(435)는 시프트 레지스터를 나타내며, (405),(406),(407),(408),(409),(410),(411), (412),(413),(414),(415),(416),(417),(418),(419),(420)은 화소 차이 절대값 연산 유닛을 나타낸다. 연산 초기에, 각 시프트 레지스터는 과거프레임 버퍼(202)로부터 과거프레임 데이터를 입력하여, 연결되어 있는 절대값 연산 유닛에 전송하며, 이와 더불어 위쪽으로 인접해있는 시프트 레지스터에 클럭에 따라 화살표 방향으로 데이터를 이동시킨다. 따라서, 동일한 데이터의 중복사용을 통해 화소 절대값 차이 연산을 수행하게 된다.FIG. 4A is for the hierarchical search motion estimation algorithm based on the above scheme, and shows the case of block size 16, search area 32, and sampling distance 4. FIG. The parallel processing scheme represents the case where u = 4 and v = 1, and thus the shift register has a 4x4 form and three forms of the last row. 401, 402, 403, 404, 421, 422, 423, 424, 425, 426, 427, 428 in FIG. ), (429), (430), (431), (432), (433), (434) and (435) represent shift registers, and (405) (406) (407) (408). (409), (410), (411), (412), (413), (414), (415), (416), (417), (418), (419), and (420) are pixels Represents the absolute difference value calculation unit. At the beginning of the operation, each shift register inputs the past frame data from the past frame buffer 202 and transmits the past frame data to the connected absolute value calculating unit. Move it. Therefore, the pixel absolute value difference operation is performed by using the same data repeatedly.

제4b도는 전역 탐색 움직임 추정 알고리즘을 위한 데이터 이동 및 저장 방안으로서, 시프트 레지스터(436), (437), (438), (439), (441), (442), (443),(444 ),(449),(450), (451),(452),(454),(455),(456),(457),(462),(463),(464),(465),( 467),(468),(469),(470)과 먹스(440),(453),(466)와 화소 절대값 차이 연산 유닛 및 먹스(445),(446),(447),(448),(458),(459),(460),(461)로 이루어져 있다.4b illustrates a data movement and storage method for the global search motion estimation algorithm, and includes shift registers 436, 437, 438, 439, 441, 442, 443, and 444. , (449), (450), (451), (452), (454), (455), (456), (457), (462), (463), (464), (465), ( 467, 468, 469, 470, mux 440, 453, 466 and pixel absolute difference calculation unit and mux 445, 446, 447, 448 ), 458, 459, 460, and 461.

연산유닛과 연결되어 있는 먹스는 (2pN/수평으로 배여된 화소 절대값 차이 유닛수) 클럭 마다 위쪽과 아래쪽 과거프레임 화소 데이터를 번갈아 가며, 연결해준다. 또 다른, 먹스(440),(453),(466)는 번갈아 가며 각 시프트 레지스터에 입력되는 데이터를 시프트 레지스터 행의 바로 위 또는 3행 아래로부터 받는 역할을 수행한다. 제4a와 제4b도의 차이점은 제4a도에서는, 계층 탐색 움직임 추정 알고리즘을 바탕으로, 그룹단위로 병렬 처리, 이동이 가능하여, 제4b도보다, 연산성능을 향상시킬 수 있고, 데이터 이동만을 위한 추가의 시프트 레지스터를 제거하기 위한 조건 N=2p로 제한을 받지 않는다. 제4b도는 전역 탐색 움직임 추정 알고리즘을 바탕으로, 모든 화소 절대값 차이 연산 유닛에서, 동일한 현재 프레임 화소 데이터를 다음 현재 프레임 데이터를 사용하기 전까지 여러 클럭동안 사용하므로, 이 현재 프레임 데이터를 하드웨어 구현상 유리한 늦은 속도로, 브로드 캐스팅(broa dcasting)할 수 있다는 장점이 있다.MUX connected to arithmetic unit (2pN / horizontal pixel absolute value difference unit) connects the upper and lower past frame pixel data alternately per clock. In addition, the mux 440, 453, 466 alternately receives the data input to each shift register from directly above or below three rows of the shift register row. The difference between FIGS. 4A and 4B is that, in FIG. 4A, based on the hierarchical search motion estimation algorithm, parallel processing and movement can be performed in units of groups, so that the computational performance can be improved compared to that of FIG. 4B. The condition for removing the additional shift register is not limited to N = 2p. FIG. 4B is based on the global search motion estimation algorithm. In all the pixel absolute difference calculation units, the same current frame pixel data is used for several clocks until the next current frame data is used. It has the advantage of being able to broadcast at slow speeds.

제5도는 다양한 블록크기를 연산 가능한 가산기와 비교기 사이에 전용 가능한 장치의 구조를 나타내었다. 이 장치는 입력먹스(501), 출력먹스(503), 가산기(5 02)로 이루어져 있다. 입력먹스(501)는 가산기인 경우에는 정상 입력을, 비교기로 사용될 경우에는 2의 보수를 취한 값을 가산기(502)에 입력시킨다. 가산기(502)는 비교기가 필요할 경우 입력먹스(501)에서 2의 보수를 받아들이고, 최하위 캐리에 1을 입력받아, 감산연산을 수행한다. 출력먹스(503)는 가산기인 경우에는 가산기(50 2)의 출력을, 비교기인 경우에는 가산기(502)의 출력의 부호 비트를 바탕으로 두 입력중 최소값을 출력하게 된다.5 shows the structure of a device that can be dedicated between an adder and a comparator capable of calculating various block sizes. The apparatus consists of an input mux 501, an output mux 503, and an adder 5 02. The input mux 501 inputs a normal input in the case of an adder, and a value having a complement of 2 in the case of a comparator, to the adder 502. The adder 502 receives a two's complement from the input mux 501 when a comparator is needed, receives 1 into the lowest carry, and performs a subtraction operation. The output mux 503 outputs the minimum value of the two inputs based on the output of the adder 50 2 in the case of the adder and the sign bit of the output of the adder 502 in the case of the comparator.

제6도는 필드 프레임발생기(318)에서 화소 절대값 차이를 축척하기 위한 화상 절대값 차이 메모리의 구조를 나타내었다. 화상메모리는 데이터의 비트 폭만큼의 메모리셀 열로 이루어져 있으며, 동작원리를 간단히 설명하기 위하여, 메모리 셀(601),(602),(603),(604)과 읽기용 메모리 셀 열 선택(column select)기(606 ),(607) 및 쓰기용 메모리 셀 열 선택(column select)기(608),(609), 가산 축적기( 610)만을 간단히 도시하였다. 모든 계층을 연산할 수 있는 본 발명에서 제시된 아키텍쳐에서는, 인접된 화소 절대값 차이 그룹에 대해 순차적으로, 가산 축척이 수행된다. 예를들어, 제6도의 첫 번째 열에 나타낸 단일 비트의 경우, 한 클럭 사이클동안, 상기 메모리 셀(601)과 메모리 셀(603)이 동일한 워드라인에 의하여, 인접된 화소 블록에 대한 읽기와 쓰기를 수행한다. 메모리 셀(603)로 부터의 읽기 결과출력먹스는 가산 축적기(610)의 입력으로, 가산축적기의 출력은 쓰기용 메모리 셀 열 선택(column select)장치(608)를 통해, 메모리 셀(603)로 쓰기가 수행된다.6 shows the structure of an image absolute value difference memory for scaling pixel absolute value differences in the field frame generator 318. The image memory is made up of memory cell columns equal to the bit width of the data, and for simplicity of operation, the memory cells 601, 602, 603, and 604 and the memory cell columns for reading are selected. Only the 606, 607, and write memory column column selectors 608, 609, and the add accumulator 610 are shown. In the architecture presented in the present invention, in which all hierarchies can be computed, addition scaling is performed sequentially on adjacent pixel absolute value difference groups. For example, in the case of the single bit shown in the first column of FIG. 6, during one clock cycle, the memory cell 601 and the memory cell 603 read and write to adjacent pixel blocks by the same word line. To perform. The read result output mux from the memory cell 603 is an input of the add accumulator 610, and the output of the add accumulator is passed through the memory cell column select device 608 for writing. Write is performed.

제8도는 화소 절대값 차이 연산을 위한 장치의 구조를 나타내었다. 이 장치는 인버터(801),(803)와 가산기(802), 간략화된 절대값 발생기(804) 및 출력먹스( 805)로 이루어져 있다. 먼저 가산기(802)에서 현재프레임과 과거프레임 화소의 차이를 계산하고, 이의 사인 비트를 검사하여 양수 일 경우에는 데이터를 출력먹스 (805)를 통해, 그대로 출력하게 된다. 가산기(802)의 출력이 음수일 경우에는 인버터(803)를 통해 결과의 보수를 발생시키고, CLA를 통해 이전에 설명된 방법에 의해 간략화된 절대값 발생기(804) 최하위 캐리 비트에 1을 입력하여 절대값을 계산해낸후 출력먹스(805)를 통해 절대값을 발생시킨다.8 shows the structure of an apparatus for computing pixel absolute value differences. The device consists of inverters 801 and 803, adder 802, simplified absolute value generator 804 and output mux 805. First, the adder 802 calculates the difference between the current frame and the past frame pixel, and if the sign bit thereof is positive, outputs the data through the output mux 805 as it is. If the output of the adder 802 is negative, the result is compensated through the inverter 803, and 1 is inputted to the least significant carry bit of the absolute value generator 804 simplified by the method described previously through the CLA. After calculating the absolute value, the output mux 805 generates an absolute value.

제4a도에 나타낸 장치의 구조를 바탕으로, 3계층 탐색 움직임 추정을 연산하는 방식을 사용하여, 본 발명에 의한 데이터의 재사용 방안, 데이터의 시프트 레지스터를 통한 이동 및 연산 방안을 위한 실시예를 설명하였다. 다음의 일련의 표는 각 화소 절대값 차이 연산 유닛에 있어, 각 클럭에 있어서 연산 과정이 실시예를 나타내었고, 제9도는 현재프레임(CF) 및 과거프레임(PF) 화소의 위치를 좌측 위를 원점으로 2차원적으로 도시 하였다. 타이밍 다이어그램에 있어, 첫 행은 각 연산 유닛의 번호를 나타낸다.Based on the structure of the apparatus shown in FIG. 4A, using the scheme of calculating the three-layer search motion estimation, an embodiment for the data reuse method, the movement through the data shift register and the calculation method will be described. It was. The following series of tables shows an embodiment of the calculation process for each clock in each pixel absolute value difference calculation unit, and FIG. 9 shows the position of the current frame CF and the past frame PF in the upper left corner. It is shown two-dimensionally as the origin. In the timing diagram, the first row shows the number of each computing unit.

또한, 제4b도에 나타낸 장치의 구조를 바탕으로, 전역 탐색 움직임 추정을 연산하는 방식을 사용하여, 본 발명에 의한 데이터의 재사용 방안, 데이터의 시프트 레지스터를 통한 이동 및 연산 방안을 위한 실시예를 나타낸다. 다음의 일련의 표는 각 화소 절대값 차이 연산 유닛에 있어, 각 클럭에 있어서의 연산 과정 실시예를 나타내었다. 블록 크기 4, 탐색영역 2경우의 전역 탐색 움직임 추정의 실시예를 나타내었고, 제10도는 현재프레임(CF) 및 과거프레임(PF) 화소의 위치를 좌측 위를 원점으로 2차원적으로 도시하였다. 타이밍 다이어 그램에 있어, 첫 행은 각 화소 절대값 차이 연산 유닛의 번호를 나타낸다.In addition, based on the structure of the apparatus shown in FIG. 4B, an embodiment for a method of reusing data, a movement through a shift register, and a calculation method according to the present invention is provided by using a method of calculating global search motion estimation. Indicates. The following series of tables shows an embodiment of the calculation process at each clock in each pixel absolute value difference calculation unit. An embodiment of global search motion estimation in the case of block size 4 and search area 2 is shown. FIG. 10 illustrates two-dimensional positions of pixels of the current frame CF and the past frame PF from the upper left. In the timing diagram, the first row shows the number of each pixel absolute value difference calculation unit.

또한, 본 발명에서 제안된 데이터 저장 및 이동 방법 등의 연산 방안과 장치는 다양한 움직임 추정 알고리즘, 다양한 계층의 수, 마크로블록 크기,샘플링 거리에 모두 적용 가능하다. 본 발명에서 제안된, 가산기와 비교기 전환 가능한 장치는 움직임 추정 장치이외로, 가산 축적 후 비교를 통한 최소값 또는 최대값을 구하는 모든 연산에 적용 가능하다. 본 발명에서 제안된 화상 메모리 구조는, 입출력이 동시에 필요한 모든 장치에 적용 가능하다. 본 발명에서 제안된 절대값 연산장치는, 움직임 추정 연산이외로, 절대값 계산이 필요한 모든 연산에 적용 가능하다.In addition, the calculation methods and apparatuses of the data storage and movement method proposed in the present invention can be applied to various motion estimation algorithms, various layers, macroblock sizes, and sampling distances. The adder and the comparator switchable device proposed in the present invention can be applied to all operations for obtaining a minimum or maximum value through comparison after addition accumulation, in addition to the motion estimation device. The image memory structure proposed in the present invention can be applied to any device requiring input and output at the same time. The absolute value calculating device proposed in the present invention is applicable to all calculations requiring absolute value calculation other than the motion estimation operation.

상기와 같이 구성되고 이루어지는 본 발명에 따른 고속 실시간 처리 움직임 추정을 위한 연산방법 및 이를 위한 연산장치는, 계층간 종속성으로 인해 발생되는 연산지연 문제점을 해결할 수 있으며, 규칙적이며 동일한 하드웨어 구조의 중복사용을 통한 하드웨어 구현이 용이하며 효율적인 과거프레임 버퍼에 의해, 연속되는 화상 마크로블록간의 하드웨어의 휴지도를 감소 시킬 수 있고 외부 화상 데이터의 직렬 데이터 입력이 가능하도록 함으로써 입출력 핀 개수를 감소시키고, 입출력 병목문제를 제거되는 효과가 있다.The calculation method and the computing device for the fast real-time processing motion estimation according to the present invention configured and configured as described above can solve the problem of computation delay caused by the inter-layer dependency, and the regular use of the same hardware structure Easy to implement hardware through the efficient past frame buffer, it is possible to reduce the idleness of hardware between consecutive image macroblocks and to enable serial data input of external image data, thereby reducing the number of input / output pins and eliminating the input / output bottleneck problem. It has the effect of being removed.

또한, 움직임 추정 각 계층, 블록크기, 필드, 프레임 연산, 반화소 연산등 다양한 움직임 추정 연산모드를 통일적 아키텍쳐를 바탕으로 구현가능하며, 고속연산이 가능하고, 입력된 데이터의 중복 사용도를 높여 입출력 핀 개수의 최소화 및 하드웨어 양을 최소화 시키는 효과가 있으며, 외부와 내부 클럭 속도의 조절을 통한 다양한 입출력 핀 수의 조절 및 다양한 외부 화상 메모리와의 동작이 가능하며, 적은 메모리의 기억용량만으로 입출력을 동시에 진행시킬 수 있는 화상 메모리가 구현가능하고, 아키텍쳐의 성능이 특정 요소에 의해 제한되는 문제점을 해결하며 다양한 계층 탐색 움직임 추정 알고리즘, 계층 연산, 연산속도, 하드웨어양, 병렬처리 정도, 마크로블록 크기 등을 통일적으로 만족시킬 수 있는 효과가 있는 매우 유용한 발명이다.In addition, various motion estimation operation modes such as motion estimation layer, block size, field, frame operation, and half-pixel operation can be implemented based on a unified architecture, enabling high-speed operation, and increasing the redundancy of input data. It has the effect of minimizing the number of pins and minimizing the amount of hardware.It is possible to control various input / output pin numbers and to operate with various external image memories by adjusting external and internal clock speeds. It is possible to implement an image memory that can be advanced, solve the problem that the performance of the architecture is limited by a certain factor, and various hierarchical search motion estimation algorithms, hierarchical operations, computation speed, hardware amount, parallel processing degree, macroblock size, etc. It is a very useful invention with the effect that can be uniformly satisfied.

Claims

In the image motion estimator of the image data processing system, the image macroblock kn input in the associated search area is searched for when the consecutive image macroblocks k1, k2, .. kn-1, kn, kn + 1 ... are searched. A first step of searching and simultaneously reading out pixel data in the search area associated with the next image macroblock kn + 1 in the upper layer, respectively; The previous picture macroblock kn-1 is searched in the search area reduced by the search operation performed in the upper layer before the first step, and the picture microblock kn in the first step. A second step of simultaneously performing operations of reading pixel data in a search area reduced in size by a search operation in a lower layer, respectively; And performing an operation of searching the image micro block kn in the search area in which the pixel data is read in the second step in the lower layer. Way.

The method of claim 1, wherein the reading process comprises: a storing step of storing the read pixel data in a storage means; And an input step of temporarily storing initial data by an amount necessary for searching for a next picture macroblock during the search operation.

The method of claim 2, wherein the input step further comprises an interpolation step of generating half pixel data from the stored pixel data when the lower layer is the lowest layer, wherein the amount of the initial data is the generated half. An image macroblock searching method using an inter-layer pipeline method, which is equal to an initial amount of pixel data.

In the pixel difference value calculating apparatus for image motion estimation performed in an image data processing system, after dividing an image macroblock for each layer into pixel blocks of the minimum unit, the pixel difference between the pixel blocks is sequentially summed and outputted. Unit difference calculating means; And adding means for adding the sum value output from the unit difference calculating means to output the pixel difference value between the image macroblocks of each layer.

5. The apparatus of claim 4, wherein the unit difference calculating means comprises: a predetermined number of latches for one-to-one correspondence storing pixel data of a row or a column of a current image macroblock having a minimum size (MxM); A plurality of shift registers for one-to-one correspondence storing of pixel data of rows or columns in the search area, and a predetermined number of outputting absolute values of pixel differences between the pixel data in the latches and the pixel data in the shift registers (kM, k being 1≤k≤ A natural number, which is M, wherein the operations comprise: grouping pixel data groups in a row or column in the latch and pixel groups in successive (k times) contiguous pixel data groups in the shift register corresponding to the group. And simultaneously shifting the shift registers so that the shift registers are moved between data having a predetermined pixel distance (M) after a calculation operation of the calculator.

5. The apparatus of claim 4, wherein the unit difference calculating means comprises: a plurality of shift registers storing pixel data groups of rows or columns in a search area; And a plurality of calculators for duplicating arbitrary pixel data in the current image macroblock, and for calculating an absolute value of a difference with the pixel data in the plurality of shift registers, respectively, wherein the shift register is operated by the operator. After the operation, the movement of adjacent pixel data is performed, and the calculator repeats the operation process with the next pixel data subsequent to any pixel data in the current image macroblock, wherein each pixel of each layer is repeated. Uniform calculator for difference values.

In the image motion estimator used in the image data processing system, according to an external function selection signal, an arbitrary input value A or a corresponding negative value

Input value selection means for selectively outputting the data; Adding means for adding and outputting an output value output from said input value selecting means and another arbitrary input value (B); And outputting the output value of the adding means on the basis of the external function selection signal, or determined by the output value of the adding means among the arbitrary input value A and the other optional input value B. A merge operation apparatus of the addable and compare function, which can be switched, comprising an output value selecting means for selecting and outputting a minimum value.

In an image memory of an image motion estimator used in an image data processing system, the image memory is composed of a plurality of memory cells having a matrix arrangement, wherein memory cells of a series of adjacent rows or columns are selected by the same selection signal line and the data. An image memory having a connection structure that allows writing and reading of bits to be performed simultaneously.

An image motion estimator for use in an image data processing system, comprising: a first adder for calculating and outputting a pixel difference between a current frame and a past frame of the image data; An inverter for finding a complement of one for any number output from the adder; And a second adder which adds 1 only by the least significant bit input carry and calculates and outputs 2's complement to the obtained reward.