KR20030080958A

KR20030080958A - Global elimination algorithm for motion estimation and the hardware architecture thereof

Info

Publication number: KR20030080958A
Application number: KR1020020035608A
Authority: KR
Inventors: 첸리앙지; 후앙유웬; 치엔샤오이
Original assignee: 내셔널 타이완 유니버시티
Priority date: 2002-04-12
Filing date: 2002-06-25
Publication date: 2003-10-17
Also published as: TW526657B; JP2003319396A

Abstract

본 발명은 움직임 예측(motion estimation)용 글로벌 소거 알고리즘 및 그 하드웨어 구조에 관한 것으로서, 연산 데이터 흐름에 있어서의 분기를 효과적으로 소거할 수 있게 됨으로써 그 연산 데이터 흐름이 원활하게 되고 하드웨어 구현에 보다 적합하게 된다. 각각의 움직임 벡터에 대한 처리 시간은 고정되어 있기 때문에, 예비 예측(preliminary prediction)을 행하지 않아도 된다. 검색 장소의 소거율이 시간에 따라 변화하지 않으며 따라서 소거율을 높일 수 있다. 글로벌 소거 알고리즘은 전체 검색 블록 정합 알고리즘과 동일하게 높은 정확도의 검색 결과를 낼 수 있다. 글로벌 소거 알고리즘의 신호 대 잡음비의 최대값은 전체 검색 블록 정합 알고리즘의 최대값보다 더 클 때도 있다. 전체 검색 블록 정합 알고리즘에 기초한 다른 하드웨어 구조과 비교할 때, 본 발명의 하드웨어 구조는 각각의 논리 게이트에 최상의 연산 능력을 제공할 수 있는 반면, 움직임 벡터 처리율이 동일할 경우 논리 게이트의 전력 소모는 최소로 된다.The present invention relates to a global erasing algorithm for motion estimation and its hardware structure, which makes it possible to effectively cancel a branch in a calculation data flow, thereby making the calculation data flow smooth and more suitable for hardware implementation. . Since the processing time for each motion vector is fixed, it is not necessary to perform preliminary prediction. The erase rate of the search place does not change with time, and thus the erase rate can be increased. The global cancellation algorithm can produce a search result with the same high accuracy as the entire search block matching algorithm. The maximum value of the signal-to-noise ratio of the global cancellation algorithm is sometimes greater than the maximum value of the entire search block matching algorithm. Compared with other hardware structures based on the full search block matching algorithm, the hardware structure of the present invention can provide the best computational power to each logic gate, while the power consumption of the logic gate is minimal when the motion vector throughput is the same. .

Description

GLOBAL ELIMINATION ALGORITHM FOR MOTION ESTIMATION AND THE HARDWARE ARCHITECTURE THEREOF

본 발명은 멀티미디어 비디오 압축 시스템에서 사용하기 위한 블록 정합 움직임 예측 알고리즘에 관한 것으로, 특히 비디오 압축의 목적을 달성하기 위하여 비디오 시퀀스 내의 고유의 시간적 중복성(temporal redundancy)을 감소시킬 수 있는 고효율의 움직임 예측용 글로벌 소거 알고리즘 및 그 하드웨어 구조에 관한 것이다.The present invention relates to a block-matched motion prediction algorithm for use in a multimedia video compression system, and more particularly for high efficiency motion prediction that can reduce the temporal redundancy inherent in video sequences to achieve the purpose of video compression. A global erase algorithm and its hardware structure.

하이테크 산업(high-technology industry)에 의해 개발된 비디오 압축 기술에서의 급속한 진보에 있어서, 비디오 시퀀스 전송에서의 연산 데이터 흐름량 및 전송 품질이 점점 더 중요하게 되고 있다. 비디오 시퀀스에 관계되는 한, 필요로 하는 기억 공간이 매우 방대하기 때문에, 비디오 시퀀스에 의해 점유되는 기억 공간을 감소시키는 것이 매우 바람직하다. 그 결과, 비디오 시퀀스는 압축되어야 하고, 이에 따라서 비디오 압축 기술을 이미지 처리 시스템에서 기본 요소로서 사용할 필요가 있다. 비디오 압축 기술은 일반적으로 비디오 압축의 목적을 달성하기위하여 비디오 시퀀스 내의 고유의 중복성 감소를 수반한다. 움직임 예측 알고리즘은 비디오 시퀀스 내의 고유의 중복성을 감소시키기 위한 요구에 기초한 비디오 압축 기술이라고 알려져 있다.In the rapid advancement in video compression technology developed by the high-technology industry, the amount of computational data flow and transmission quality in video sequence transmission has become increasingly important. As far as the video sequence is concerned, it is highly desirable to reduce the storage space occupied by the video sequence, since the required storage space is very large. As a result, the video sequence has to be compressed, and therefore, it is necessary to use video compression techniques as a basic element in the image processing system. Video compression techniques generally involve inherent reductions in redundancy within video sequences to achieve the purpose of video compression. Motion prediction algorithms are known as video compression techniques based on the need to reduce inherent redundancy in video sequences.

움직임 예측 알고리즘은 일반적으로 현재 프레임 내의 현재 블록과 최상으로 정합되는 참조 프레임 내의 후보 블록을 찾는 방법에 대하여 설명한다. 다수의 움직임 예측 알고리즘 중에서, 가장 광범위하게 사용되고 있는 하나의 알고리즘은 전체 검색 블록 정합 알고리즘(full-search block matching algorithm)이라고 부른다. 전체 검색 블록 정합 알고리즘은 실시간 응용을 위한 현재의 범용 마이크로프로세서에 의해 취급될 수 없는 대량의 연산을 가지고 있다. 전체 검색 블록 정합 알고리즘에서의 규칙적인 연산 데이터 흐름에 기인하여, 여러 가지의 병렬 또는 파이프라인식 하드웨어 구조가 어드레스를 지정하고 있다. 그러나, 불행하게도 이들 구조 중에서는 필요한 클럭 사이클의 면에서 1-D 어레이 구조의 연산 속도가 너무 느리다. 따라서, 대형 프레임 및 광범위한 검색 응용에 있어서는 1-D 어레이 구조의 동작 주파수가 크게 증가되어야만 한다. 비록 필요한 클럭 사이클의 면에서 2-D 어레이 구조의 연산 속도가 1-D 어레이 구조의 연산 속도보다 더 빠르더라도, 논리 게이트의 양이 너무 많고 그에 따라 어레이 구조의 비용을 과도하게 한다. 비록 트리 구조가 연산 속도 및 영역의 면에서 양호한 성능을 발휘하더라도, 이 트리 구조는 더 큰 메모리 비트폭을 필요로 하고, 따라서 유연성(feasibility)이 감소된다.The motion prediction algorithm generally describes a method for finding candidate blocks in a reference frame that best matches the current block in the current frame. Of the many motion prediction algorithms, one of the most widely used algorithms is called a full-search block matching algorithm. The full search block matching algorithm has a large amount of operations that cannot be handled by current general purpose microprocessors for real time applications. Due to the regular operational data flow in the entire search block matching algorithm, various parallel or pipelined hardware structures address. Unfortunately, among these structures, the computational speed of the 1-D array structure is too slow in terms of the required clock cycles. Thus, for large frames and a wide range of search applications, the operating frequency of the 1-D array structure must be greatly increased. Although the computational speed of the 2-D array structure is faster than the computational speed of the 1-D array structure in terms of the required clock cycles, the amount of logic gates is too large and therefore the cost of the array structure is excessive. Although the tree structure exhibits good performance in terms of computational speed and area, this tree structure requires a larger memory bit width, thus reducing flexibility.

전체 검색 블록 정합 알고리즘의 대량 연산을 감소시키기 위하여, 전체 검색 블록 정합 알고리즘과 동일한 결과를 얻을 수 있는 연속 소거 알고리즘(successiveelimination algorithm; sea)이 제안되고 있다. 연속 소거 알고리즘은 최대 신호 대 잡음비(PSNR)의 비용으로 블록 검색을 수행하는 다른 고속 검색 알고리즘, 예를 들어 3 단계 검색, 다이어몬드 검색 또는 2-D 로그 검색보다도 더 양호한 연산 효과를 제공한다. 도 1은 연속 소거 알고리즘의 연산 흐름을 도시한다. 먼저, 단계 S10에서, 각 검색 위치의 연속 소거 알고리즘 값[sea(m,n)]이 연산된다. 이어서, 단계 S12에서, 연속 소거 알고리즘값[sea(m,n)]을 차분의 절대값(absolute difference)의 합의 최소치(SAD_min)와 비교하여 연속 소거 알고리즘값[sea(m,n)]이 차분 절대값의 합의 최소치(SAD_min)보다 더 큰지 여부를 판단한다. sea(m,n)＞SAD_min인 경우에는, 알고리즘은 단계 S14로 진행하여 검색 위치 (m,n)을 스킵하고 단계 S22로 직접 진행한다. sea(m,n)＜SAD_min인 경우에는, 알고리즘은 단계 S16으로 진행하여 각 검색 위치의 차분의 절대값의 합[SAD(m,n)]을 계속하여 연산한다. 차분의 절대값의 합[SAD(m,n)]을 발생시킨 후, 알고리즘은 단계 S18로 진행하여 SAD(m,n)과 SAD_min을 비교한다. SAD(m,n)＞SAD_min인 경우에는, 알고리즘은 단계 S22로 진행하고, 만일 SAD(m,n)＜SAD_min이면, 알고리즘은 단계 S20으로 진행하여 차분의 절대값의 합의 최소치(SAD_min)를 갱신하고 단계 S22로 진행한다. 단계 S22에서, 현재의 검색 위치 (m,n)이 최종의 검색 위치인지 여부를 판단한다. "예(yes)"인 경우에는, 이것은 최소의 SAD 값이 존재하는 위치가 발견된 것을 의미하고, 알고리즘은 단계 S26으로 진행하여 예측된 움직임 벡터(MV)를 생성하고 모든 처리를 종료한다."아니오(no)"인 경우에는, 이것은 다른 위치가 검색되지 않은 것을 의미하고, 알고리즘은 단계 S24로 진행하여 다음 검색 위치 (m,n)를 갱신하며, 단계 S10으로 진행하여 상기 단계들을 반복한다.In order to reduce the bulk operation of the full search block matching algorithm, a successive elimination algorithm (sea) is proposed which can achieve the same result as the full search block matching algorithm. The continuous cancellation algorithm provides a better computational effect than other fast search algorithms that perform block search at the cost of maximum signal-to-noise ratio (PSNR), for example, three-step search, diamond search or 2-D log search. 1 illustrates the operational flow of a continuous cancellation algorithm. First, in step S10, the successive erasing algorithm value [sea (m, n)] of each search position is calculated. Subsequently, in step S12, the continuous erasure algorithm value sea (m, n) is compared with the minimum value SAD _min of the sum of the absolute differences of the differences to obtain the continuous erase algorithm value sea (m, n). It is determined whether the difference is greater than the minimum SAD _min of the absolute value of the difference. If sea (m, n) > SAD _min , the algorithm proceeds to step S14 to skip the search position (m, n) and proceed directly to step S22. If sea (m, n) <SAD _min , the algorithm proceeds to step S16 to continuously calculate the sum of absolute values [SAD (m, n)] of the differences of the respective search positions. After generating the sum of the absolute values of the difference [SAD (m, n)], the algorithm proceeds to step S18 to compare SAD (m, n) with SAD _min . If SAD (m, n) > SAD _min , the algorithm proceeds to step S22, and if SAD (m, n) < SAD _min , the algorithm proceeds to step S20 to determine the minimum value of the sum of the absolute values of the differences (SAD _min). ), And the process proceeds to step S22. In step S22, it is determined whether the current search position (m, n) is the last search position. If yes, this means that the position where the minimum SAD value is present is found, and the algorithm proceeds to step S26 to generate the predicted motion vector MV and terminate all processing. If no, this means that no other position has been searched, and the algorithm proceeds to step S24 to update the next search position (m, n) and proceeds to step S10 to repeat the above steps.

각 검색 위치에 대응하는 sea 값을 연산한 후, 연산 데이터 흐름을 매우 불규칙하게 하고 사전에 예측할 수 없는 분기(branch)가 연산 흐름에서 발생할 수 있다. 따라서, 하드웨어 구조 설계시에 시스톨릭(systolic) 어레이 구조를 사용할 수 없게 된다. 이 후, 다중 레벨 연속 소거 알고리즘을 개발하였지만, 상기한 바와 동일한 문제를 여전히 해소할 수 없다.After calculating the sea value corresponding to each search position, the computational data flow becomes very irregular and a branch that is not predictable in advance may occur in the computational flow. As a result, a systolic array structure cannot be used when designing a hardware structure. After this, a multilevel continuous erasure algorithm was developed, but still cannot solve the same problem as described above.

또한, 연속 소거 알고리즘은 연산량을 효과적으로 감소시키기 위해 움직임 벡터(MV)를 예비 예측(preliminary prediction)을 행하여야만 한다. 그럼에도 불구하고, 불규칙적인 움직임 중에 있는 영역 내에서의 움직임 벡터에 대한 예비 예측은 매우 어렵다. 또한, 실제의 움직임 벡터가 검색 범위를 벗어난 경우에는, 연속 소거 알고리즘을 위한 검색 위치의 소거율은 연속 소거 알고리즘의 연산 시간이 전체 검색 블록 정합 알고리즘의 연산 시간보다 길어질 정도로 낮아진다. 또한, 차분의 절대값의 합의 연산을 소거하는 횟수를 증가시키기 위해서, 연속 소거 알고리즘은 통상적으로 나선형 주사 기술을 이용하여 검색 위치의 우선 순위를 판단한다. 이 조건 하에서는, 하드웨어 회로는 정상적으로 종래의 래스터 주사(raster scan) 기술을 이용하는 것보다 많은 비용을 들여야만 한다.In addition, the continuous cancellation algorithm must perform preliminary prediction of the motion vector (MV) in order to effectively reduce the amount of computation. Nevertheless, preliminary predictions of motion vectors within regions that are in irregular motion are very difficult. In addition, when the actual motion vector is out of the search range, the erase rate of the search position for the continuous erase algorithm is so low that the operation time of the continuous erase algorithm is longer than that of the entire search block matching algorithm. Also, in order to increase the number of times of canceling the operation of the sum of the difference absolute values, the continuous erasure algorithm typically uses a spiral scanning technique to determine the priority of the search position. Under this condition, hardware circuitry normally has to cost more than using conventional raster scan techniques.

따라서, 종래의 연속 소거 알고리즘으로 인해 발생하는 단점을 효율적으로 소거할 수 있는 글로벌 소거 알고리즘 및 그 하드웨어 구조를 제안하는 것이 바람직할 것이다.Therefore, it would be desirable to propose a global erase algorithm and its hardware structure that can efficiently eliminate the disadvantages caused by conventional continuous erase algorithms.

본 발명의 목적은 데이터의 흐름이 하드웨어의 실시를 위해 보다 규칙적이고 보다 원활하며 보다 적응 가능하도록 연산 데이터 흐름의 분기를 적절하게 제거하는 움직임 예측용 글로벌 소거 알고리즘 및 그 하드웨어 구조를 제공하는 데에 있다.SUMMARY OF THE INVENTION An object of the present invention is to provide a motion cancellation global erasing algorithm and a hardware structure thereof that appropriately remove branches of computational data flow so that data flow is more regular, smoother, and more adaptable for hardware implementation. .

본 발명의 다른 목적은 고신뢰도 및 때로는 보다 양호한 최대 신호 대 잡음비(PSNR)를 갖는 글로벌 소거 알고리즘을 제공하는 데에 있고, 이 글로벌 소거 알고리즘은 글로벌 소거 알고리즘의 검색 결과와 전체 검색 블록 정합 알고리즘의 검색 결과 사이에 고도의 유사성이 있다.Another object of the present invention is to provide a global cancellation algorithm with high reliability and sometimes better maximum signal-to-noise ratio (PSNR), which is a search result of the global cancellation algorithm and a full search block matching algorithm. There is a high degree of similarity between the results.

본 발명의 또 다른 목적은 움직임 예측용 글로벌 소거 알고리즘의 하드웨어 구조를 제공하는 데에 있고, 각 논리 게이트에 대한 연산 능력은 전체 검색 블록 정합 알고리즘에 기초하여 다른 구조와 비교해서 최고인 한편, 움직임 벡터의 동일한 처리량 하에서 논리 게이트의 전력 소비는 최저이다.It is another object of the present invention to provide a hardware structure of the global cancellation algorithm for motion prediction, wherein the computational power for each logic gate is the best compared to other structures based on the entire search block matching algorithm, while Under the same throughput, the power consumption of the logic gate is lowest.

본 발명의 또 다른 목적은 고급 예측 모드를 지원하는 움직임 예측용 글로벌 소거 알고리즘 및 그 하드웨어 구조를 제공하는 데에 있다.It is still another object of the present invention to provide a global cancellation algorithm for motion prediction that supports an advanced prediction mode and a hardware structure thereof.

도 1은 종래의 연속 소거 알고리즘의 연산 흐름을 도시하는 도면.BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a diagram showing the operation flow of a conventional continuous cancellation algorithm.

도 2는 본 발명에 따른 글로벌 소거 알고리즘을 도시하는 흐름도.2 is a flow chart illustrating a global erase algorithm in accordance with the present invention.

도 3은 모바일 캘린더 CIF 비디오 시퀀스에서 글로벌 소거 알고리즘과 전체검색 블록 정합 알고리즘의 동일한 움직임 벡터의 백분율을 도시하는 도면.3 shows the percentage of the same motion vector of the global erasure algorithm and the global search block matching algorithm in the mobile calendar CIF video sequence.

도 4는 모바일 캘린더 CIF 비디오 시퀀스에서 글로벌 소거 알고리즘과 전체검색 블록 정합 알고리즘의 최대 신호 대 잡음비 패턴 곡선을 도시하는 도면.4 illustrates the maximum signal-to-noise ratio pattern curve of the global cancellation algorithm and the global search block matching algorithm in a mobile calendar CIF video sequence.

도 5는 본 발명의 하드웨어 구조를 도시하는 도면.5 is a diagram showing a hardware structure of the present invention.

도 6은 본 발명에 따른 시스톨릭 모듈의 구조를 도시하는 도면.6 illustrates the structure of a systolic module in accordance with the present invention.

도 7은 본 발명에 따른 병렬 가산기 트리의 구조를 도시하는 도면.7 illustrates the structure of a parallel adder tree in accordance with the present invention.

도 8은 본 발명에 따른 병렬 비교기 트리의 구조를 도시하는 도면.8 illustrates the structure of a parallel comparator tree according to the present invention.

도 9는 본 발명에 따른 하드웨어 구조가 고급 예측 모드를 지원할 수 있는 방법을 도시하는 도면.9 illustrates how a hardware architecture in accordance with the present invention may support advanced prediction mode.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

10 : 시스톨릭 모듈10: systolic module

12 : SAD 트리12: SAD tree

14 : 병렬 비교기 트리14: Parallel Comparator Tree

18 : 제어 유닛18: control unit

20 : 멀티플렉서(MUX)20: Multiplexer (MUX)

22, 24 : MUX 네트워크22, 24: MUX network

이들 목적을 실현하기 위하여, 본 발명은, 움직임 예측을 위한 글로벌 소거 알고리즘으로서, 각각의 검색 장소에 있는 참조 프레임내의 후보 블록들 중 현재 프레임내의 현재 블록들을 거친(coarse) 패턴으로 표시하는 단계, 상기 현재 블록및 상기 후보 블록들에서의 상기 거친 패턴들을 비교하는 단계, 상기 현재 블록과 유사한 거친 패턴을 갖는 M 개의 후보 블록들을 검색하는 단계, 상기 M 개의 후보 블록들의 미세(fine) 패턴들과 상기 현재 블록들의 미세 패턴들을 비교하는 단계, 및 상기 M 개의 후보 블록들의 상기 미세 패턴들의 차분들 중 최소값을 갖는 후보 블록을 선택하는 단계를 포함하는 글로벌 소거 알고리즘을 제안한다.In order to realize these objects, the present invention provides a global erasure algorithm for motion prediction, comprising: presenting, in a coarse pattern, current blocks in a current frame among candidate blocks in a reference frame at each search location, Comparing the rough patterns in a current block and the candidate blocks, searching for M candidate blocks having a rough pattern similar to the current block, fine patterns of the M candidate blocks and the current Comparing the fine patterns of the blocks, and selecting a candidate block having a minimum value of the differences of the fine patterns of the M candidate blocks.

본 발명의 다른 실시예는, 움직임 예측을 위한 글로벌 소거 알고리즘을 수행하는 하드웨어 구조로서, ① 각각의 하위 블록들의 거친 패턴을 병렬로 연산하는 시스톨릭 모듈, ② 현재 블록들의 각각의 거친 패턴과 후보 블록들의 각각의 거친 패턴을 비교하는 가산기 트리로서, 상기 현재 블록들의 각각의 미세 패턴과 상기 후보 블록들의 각각의 미세 패턴을 비교하는 데 재사용할 수 있는 것인 가산기 트리, ③ 상기 현재 블록과 유사한 거친 패턴을 갖는 M 개의 후보 블록들을 검색하는 적어도 하나의 비교기 트리, ④ 상기 시스톨릭 모듈, 상기 가산기 트리 및 상기 비교기 트리의 동작을 제어하는 제어 장치, 및 ⑤ 상기 현재 블록 및 상기 후보 블록들의 데이터를 저장하는 적어도 하나의 메모리를 포함하는 하드웨어 구조와 관련이 있다.Another embodiment of the present invention is a hardware structure for performing a global erasure algorithm for motion prediction, comprising: (1) a systolic module that computes a coarse pattern of each sub-block in parallel; and (2) each coarse pattern and a candidate block of current blocks. An adder tree that compares each coarse pattern of the plurality of coarse patterns, which can be reused to compare each fine pattern of each of the current blocks with each fine pattern of the candidate blocks; At least one comparator tree for retrieving M candidate blocks having: a control device for controlling the operation of the systolic module, the adder tree, and the comparator tree; and storing data of the current block and the candidate blocks. It relates to a hardware architecture that includes at least one memory.

첨부 도면을 참조하여 다음의 상세한 설명을 보면 본 발명이 보다 명확해질 것이다.The present invention will be more apparent from the following detailed description with reference to the accompanying drawings.

(바람직한 실시예의 상세한 설명)(Detailed Description of the Preferred Embodiments)

당업자라면, 움직임 예측 방법이 비디오 압축 기술 분야에서 중요한 구성 요소이고 디지털 캠코더와 같은 멀티미디어 전자 제품에 응용가능하다는 것을 이미이해하고 있다. 본 발명은 연산 데이터의 흐름이 하드웨어의 실시를 위해서 보다 규칙적이고 보다 적응되고, 고신뢰도, 고속 연산 속도 및 고효율성의 특징을 갖는 한편, 종래의 다중 레벨의 연속 소거 알고리즘에 의해 발생되는 단점을 해소하도록 연산 데이터의 흐름 내의 분기를 적절하게 감소시킬 수 있는 신규의 움직임 예측용 글로벌 소거 알고리즘 및 그 하드웨어 구조에 관한 것이다.Those skilled in the art already understand that motion prediction methods are an important component in the field of video compression technology and are applicable to multimedia electronic products such as digital camcorders. The present invention allows the flow of operational data to be more regular and more adaptive for the implementation of hardware, to have the characteristics of high reliability, high speed of computation and high efficiency, while eliminating the disadvantages caused by the conventional multilevel continuous erase algorithm. A novel global prediction algorithm for motion prediction and its hardware structure capable of appropriately reducing branches in the flow of operational data.

도 2는 본 발명에 따른 글로벌 소거 알고리즘을 도시하는 흐름도이다. 도 2로부터 알 수 있는 바와 같이, 본 발명에 따른 글로벌 소거 알고리즘은 먼저 단계 S30에서 각 검색 위치에 대한 다중 레벨의 연속 소거 알고리즘 msea(m, n) 값을 연산하는 단계를 포함한다. 단계 S32에서, 검색 위치 (m, n)이 최종 검색 위치인지의 여부를 판단한다. 검색 위치 (m, n)이 최종 검색 위치가 아닌 경우에는, 알고리즘은 단계 S34로 진행하여 다음의 검색 위치 (m, n)을 갱신하고, 단계 S30으로 복귀하여 전술한 단계를 반복한다. 단계 S34에서, 검색 위치에 대한 갱신의 우선 순위는 랜덤하게 설정할 수 있고, 최종 검색 결과에 영향을 미치지 않는다. 따라서, 이 경우에는 종래의 래스터 주사 기술을 이용할 수 있다. 검색 위치 (m, n)이 최종 검색 위치인 경우에는, 알고리즘은 단계 S36으로 바로 진행하여 -p 내지 p-1와 같은 검색 범위를 설정한다. 단계 S36에서, (2p)²의 검색 위치의 전체 검색 위치 중에서 최소 msea 값을 유지하는 M 개의 검색 위치를 발견할 수 있는 반면, 다른 [(2p)²- M] 개의 검색 위치를 소거한다. 단계 S36의 실행을 완료한 후, 알고리즘은 단계 S38로 진행하여 각 검색 위치용 차분의 절대값의 합 SAD(m, n)을 연산한다. 최종적으로 알고리즘은 단계 S40으로 진행하여 M 개의 검색 위치의 SAD 값 중 차분의 절대값의 합 SAD(m, n)의 최소값을 선택한다. 최소 SAD 값을 유지하는 검색 위치는 글로벌 소거 알고리즘에 의하여 움직임 벡터를 정확하게 예측한다.2 is a flow chart illustrating a global erase algorithm in accordance with the present invention. As can be seen from FIG. 2, the global erasure algorithm according to the present invention first includes calculating a multilevel continuous erase algorithm msea (m, n) value for each search position in step S30. In step S32, it is determined whether or not the search position (m, n) is the last search position. If the search position (m, n) is not the last search position, the algorithm proceeds to step S34 to update the next search position (m, n), and returns to step S30 to repeat the above-described steps. In step S34, the priority of the update to the search position can be set randomly and does not affect the final search result. In this case, therefore, conventional raster scanning techniques can be used. If the search position (m, n) is the final search position, the algorithm proceeds directly to step S36 to set a search range such as -p to p-1. In step S36, (2p), whereas to find the M number of search locations for holding the minimum value of the total msea search positions of the search position of the ^second, different [(2p) ² - M] deletes the search-position. After completing execution of step S36, the algorithm proceeds to step S38 to calculate the sum SAD (m, n) of the absolute values of the differences for each search position. Finally, the algorithm proceeds to step S40 to select the minimum value of the sum SAD (m, n) of the absolute values of the differences among the SAD values of the M search positions. The search position that maintains the minimum SAD value accurately predicts the motion vector by the global cancellation algorithm.

이 알고리즘에서 글로벌 소거라고 칭하는 용어를 사용하는 이유는 도 2의 단계 S32에 의하여 이해될 수 있을 것이다. 검색 위치를 소거할 수 있을지의 여부를 판단하기 위해서 검색 위치를 하나씩 검사하는 다중 레벨의 연속 소거 알고리즘과는 상이하게, 글로벌 소거 알고리즘은 검색 위치의 전체에 대응하는 msea 값(다중 레벨의 연속 소거 알고리즘 값)을 연산한 후 검색 위치를 소거할 수 있는지의 여부를 판단한다. 각 검색 위치에 대응하는 msea 값에 대한 연산 처리 기간 동안, 연산은 오른쪽 분기에 따라서 행하여지고, 그 연산 데이터의 흐름은 연속적이고 규칙적이다. 따라서, 시스톨릭 어레이 구조는 하드웨어 구조 설계를 실시하는데 사용될 수 있다.The reason for using the term called global erasure in this algorithm may be understood by step S32 of FIG. Unlike a multilevel continuous erase algorithm that checks search positions one by one to determine whether or not a search position can be erased, the global erase algorithm has an msea value corresponding to the entirety of the search position (multilevel continuous erase algorithm). Value), then it is determined whether or not the search position can be erased. During the calculation processing period for the msea value corresponding to each search position, the calculation is performed according to the right branch, and the flow of the calculation data is continuous and regular. Thus, the systolic array structure can be used to implement hardware structure design.

M의 값의 선택은 연산 속도와 인코딩 효율 사이에서 절충된다. M의 값은 다중 레벨의 연속 소거 알고리즘 값들, 예를 들어 1 내지 63 사이에서 두는 것이 바람직하다. 그러나, 일반적으로, M의 값을 더욱 크게 하면, 즉 연산 속도를 더욱 저속으로 하면 할수록 그 인코딩 효율은 보다 높게 된다. 역으로, M의 값을 더욱 작게 하면, 즉 연산 속도를 더욱 고속으로 하면 할수록 그 인코딩 효율은 보다 낮게 된다. M의 값과 상관없이, 각 움직임 벡터에 의하여 요구되는 처리 시간은 고정되고 예측 가능할 수 있다. 이것은 하드웨어 실시형 인코딩 시스템의 작업 스케줄링에 보다 유익하다.The choice of the value of M is a trade off between computation speed and encoding efficiency. The value of M is preferably placed between multiple levels of continuous erase algorithm values, for example 1 to 63. However, in general, the larger the value of M, i.e., the slower the computation speed, the higher the encoding efficiency. Conversely, the smaller the value of M, i.e., the faster the operation speed, the lower the encoding efficiency. Regardless of the value of M, the processing time required by each motion vector can be fixed and predictable. This is more beneficial for job scheduling of the hardware embodiment encoding system.

비록 글로벌 소거 알고리즘이 그 검색 결과가 다중 레벨의 연속 소거 알고리즘과 같은 전체 검색 블록 정합 알고리즘의 검색 결과와 100% 동일한 것을 보장할 수 없을지라도, 글로벌 소거 알고리즘은 여전히 매우 신뢰할 만하다. 본 발명은 2 개의 공통 조건에 대한 다수의 테스트를 제공한다. 제1 조건은 16 ×16 블록이고, -16 ~ +15의 검색 범위이며, 제3 레벨의 msea 값이고, M = 7일 뿐만 아니라, SAD의 연산을 스킵하는 검색 위치의 비율이 99.31% 인 상태의 QCIF(176 ×144) 프레임이다. 제2 조건은 16 ×16 블록이고, -32~+31의 검색 범위이며, 제3 레벨의 msea 값이고, M = 7일 뿐만 아니라, SAD의 연산을 스킵하는 검색 위치의 비율이 99.83% 인 상태의 CIF(352 ×288) 프레임이다. 그 나머지 결과는 표 1에 도시되어 있다. 테스트의 검증 처리는 다수의 표준 테스트 비디오 시퀀스에 의해 시험되고, 글로벌 소거 알고리즘을 사용함으로써 보상되는 프레임의 평균 PNSR이 전체 검색 블록 정합 알고리즘의 결과와 매우 근접한 것을 알 수 있다. 가장 크지만 여전히 사소한 차이는 글로벌 소거 알고리즘을 이용함으로써 보상되는 CIF 프레임의 홀 모니터 항목이 전체 검색 블록 정합 알고리즘에 의해 보상되는 CIF 프레임의 홀 모니터 항목보다 0.08 dB 만큼 낮다는 것이다. 또한, 글로벌 소거 알고리즘을 사용함으로써 보상되는 프레임의 평균 PNSR은 포어맨(Foreman) QCIF, 사일런트(Silent) QCIF 및 테이블 테니스(Table Tennis) QCIF와 같은 전체 검색 블록 정합 알고리즘을 사용함으로써 보상되는 프레임의 평균 PNSR보다 종종 높다. 전체 검색 블록 정합 알고리즘의 평균 PNSR이 최대값으로 판단하는 것은 잘못된 일이다. 이것은 최소 SAD 값이 예를들어 1+9 ＜ 5+6이면서 1²+9²＞ 5²+6²인 최소 평균 자승 오차를 보장할 수 없기 때문이다. 대부분의 경우에 있어서, 글로벌 소거 알고리즘의 결과는 전체 검색 블록 정합 알고리즘의 결과와 매우 근접하고, 이것은 도 3 및 도 4로부터 최상으로 이해할 수 있다. 도 3은 모바일 캘린더(Mobile Calendar) CIF 비디오 시퀀에 글로벌 소거 알고리즘 및 전체 검색 블록 정합 알고리즘의 동일한 움직임 계수의 백분율을 도시한다. 움직임 벡터의 98.1%는 300 개의 프레임에서 평균적으로 동일한 것은 도 3으로부터 알 수 있다. 도 4는 모바일 캘린더 CIF 비디오 시퀀스에 글로벌 소거 알고리즘 및 전체 검색 블록 정합 알고리즘의 최대 신호 대 잡음비의 패턴 곡선을 도시한다. 이들 2 개의 곡선이 서로 매우 밀접하기 때문에, 이들 2 개의 곡선 사이를 구별하는 데에는 다소 어려움이 있다. 결과적으로, 본 발명에 따른 글로벌 소거 알고리즘은 통계표 및 통계도에 리스트된 통계에 따르면 신뢰도가 높은 것으로 드러난다.Although the global cancellation algorithm cannot guarantee that the search results are 100% identical to the search results of the entire search block matching algorithm, such as a multilevel continuous cancellation algorithm, the global cancellation algorithm is still very reliable. The present invention provides a number of tests for two common conditions. The first condition is a 16 × 16 block, a search range of -16 to +15, a msea value of the third level, M = 7, and a ratio of 99.31% of search positions for skipping the calculation of the SAD. QCIF (176 x 144) frames. The second condition is a 16 × 16 block, a search range of -32 to +31, a msea value of the third level, and not only M = 7, but also a ratio of 99.83% of search positions for skipping the calculation of the SAD. CIF (352 x 288) frames. The remaining results are shown in Table 1. The verification process of the test is tested by a number of standard test video sequences, and it can be seen that the average PNSR of the frame compensated by using the global erasure algorithm is very close to the result of the entire search block matching algorithm. The biggest but still minor difference is that the Hall monitor item of the CIF frame compensated by using the global erase algorithm is 0.08 dB lower than the Hall monitor item of the CIF frame compensated by the full search block matching algorithm. In addition, the average PNSR of the frame compensated by using the global erasure algorithm is the average of the frame compensated by using full search block matching algorithms such as Foreman QCIF, Silent QCIF, and Table Tennis QCIF. Often higher than PNSR. It is wrong to determine that the average PNSR of the entire search block matching algorithm is the maximum. This is because the minimum mean square error of, for example, 1 + 9 <5 + 6 and 1 ² +9 ² > 5 ² +6 ² can not be guaranteed. In most cases, the result of the global cancellation algorithm is very close to the result of the overall search block matching algorithm, which can be best understood from FIGS. 3 and 4. Figure 3 shows the percentage of the same motion coefficients of the global cancellation algorithm and the full search block matching algorithm in the Mobile Calendar CIF video sequence. It can be seen from FIG. 3 that 98.1% of the motion vectors are identical on average in 300 frames. 4 shows the pattern curve of the maximum signal-to-noise ratio of the global cancellation algorithm and the full search block matching algorithm in the mobile calendar CIF video sequence. Since these two curves are very close to each other, there is some difficulty in distinguishing between these two curves. As a result, the global elimination algorithm according to the present invention is found to have high reliability according to the statistics listed in the statistical table and the statistical diagram.

단위: dBUnit: dB 표준 비디오시퀀스Standard video sequence (a)(a) (b)(b) 전체 검색블록 정합알고리즘Global Search Block Matching Algorithm 글로벌 소거알고리즘Global elimination algorithm 전체 검색블록 정합알고리즘Global Search Block Matching Algorithm 글로벌 소거알고리즘Global elimination algorithm 코스트가드Cost guard 32.9332.93 32.9332.93 31.5931.59 31.5531.55 콘테이너Container 43.1143.11 43.1143.11 38.5338.53 38.5338.53 포어맨Foreman 32.2132.21 32.2232.22 32.8532.85 32.8232.82 홀 모니터Hall monitor 32.9832.98 32.9732.97 34.9034.90 34.8234.82 모바일 캘린더Mobile calendar 26.1526.15 26.1526.15 25.2025.20 25.1625.16 사일런트Silent 35.1435.14 35.1635.16 36.1236.12 36.1136.11 스테팬Stephan 24.7124.71 24.6724.67 25.7325.73 25.7125.71 테이블 테니스Table tennis 32.1032.10 32.1132.11 33.0333.03 32.9632.96 웨더Weather 38.4238.42 38.4238.42 37.4537.45 37.4537.45

본 발명에 따른 글로벌 소거 알고리즘을 설명한 후, 다음에 더 상세하게 대응하는 하드웨어 구조를 설명한다. 이하 당업자가 본 명세서에 개시된 실시예를 참고하여 본 발명을 실시할 수 있을 정도로 충분한 이해할 수 있는 도 5를 참조하여 크기가 16 ×16 이고 msea값이 제3 레벨이며 m = 7인 블록을 예로 들어 본 발명을 설명한다. 도 5에 도시된 바와 같이, 움직임 예측 알고리즘에 적합한 하드웨어 구조는 시스톨릭 모듈(10), 병렬 가산기 트리(12), 병렬 비교기 트리(14), 각각의 소자의 동작을 제어하기 위한 제어 장치, 참조 프레임 내에 후보 블록을 저장하기 위한 메모리(16) 및 현재 프레임 내에 현재 블록을 저장하기 위한 메모리(16')를 포함한다. 제어 장치는 제어 유닛(18)과 멀티플렉서(MUX)(20), MUX 네트워크 1(22) 및 MUX 네트워크 2(24)로 구성된 제어 회로를 포함한다.After describing the global erase algorithm according to the present invention, the corresponding hardware structure will be described in more detail. A block having a size of 16 × 16, a msea value of a third level, and m = 7 will be described below with reference to FIG. 5, which is understood enough to enable those skilled in the art to practice the present invention with reference to the embodiments disclosed herein. The present invention will be described. As shown in FIG. 5, a hardware structure suitable for the motion prediction algorithm is referred to as a systolic module 10, a parallel adder tree 12, a parallel comparator tree 14, a control device for controlling the operation of each element, and the like. Memory 16 for storing candidate blocks in the frame and memory 16 'for storing the current blocks in the current frame. The control device comprises a control unit 18 and a control circuit composed of a multiplexer (MUX) 20, MUX network 1 22 and MUX network 2 24.

도 5에 도시된 바와 같이, 시스톨릭 모듈(10)은 동일한 사이클에서 크기가 16 ×16인 블록의 16 개의 하위 블록 내의 픽셀 강도의 합, 즉 거친 패턴을 연산하여 그 연산 결과를 병렬로 출력하는데 사용된다. 도 6은 시스톨릭 모듈(10) 내의 연산 데이터 흐름을 도시하는데, 여기서 C_1,k및 S_1,k는 각각 현재 블록 데이터 c(k,1) 및 검색 영역 데이터 s(k,1)를 나타낸다. 도 6에 도시된 직사각형은 시프트 레지스터(26)이고, 검색 범위는 예컨대 -16 ~ +15 사이로 설정된다. 블록 데이터는 시스톨릭 모듈(10)에 열단위로 병렬로 로딩된다. t = 0 ~ 15인 경우에는, 현재 블록 데이터가 시스톨릭 모듈(10)로 로딩되고, t = 15인 경우에는 16 ×16인 현재 블록의 각각의 16 개의 4 ×4 하위 블록 내의 픽셀 강도의 합(도 6에서 sum₀₀- sum₃₃및 csum₀₀- csum₃₃으로 표시됨)이 연산되며, t = 16인 경우에는 클록의 상승 에지에서 16 개의 12 비트 레지스터에 저장된다. 이어서, 검색 블록 데이터가 시스톨릭 모듈(10)에 로딩된다. t = 16~62인 경우에는, 검색 위치 (-16,-16)~(+15,-16) 내의 후보 블록이 로딩되고, t = 31~62인 경우에는 후보 블록의 검색 위치 (-16,-16)~(+15,-16) 내의 각각의 16 개의 하위 블록 내의 픽셀 강도의 합(도 6에서 sum₀₀- sum₃₃및 rsum₀₀- rsum₃₃으로 표시됨)이 연산된다. 다음 행의 검색 블록 데이터가 동일한 방법으로 연산된다. t = 63~109인 경우에는 검색 위치 (-16,-15)~(+15,-15) 내의 후보 블록 데이터가 로딩되고, t = 31~62인 경우에는 후보 블록의 검색 위치 (-16,-15)~(+15,-15) 내의 각각의 16 개의 하위 블록 내의 픽셀 강도의 합이 계산된다. 검색 위치의 각 행은 픽셀 강도의 합을 연산하기 위한 (2p + N - 1) 개의 클록 사이클과 더불어 현재 블록 데이터를 로딩하기 위한 N 개의 클록 사이클을 필요로 한는 것을 전술한 사항으로부터 알 수 있다. 따라서, 시스톨릭 모듈(10)은 모든 블록의 하위 블록 내의 픽셀 강도의 합(거친 패턴)을 연산하는데 N + 2p(2p + N - 1) 개의 클록 사이클을 필요로 한다.As shown in FIG. 5, the systolic module 10 calculates a sum of pixel intensities in 16 sub-blocks of a block having a size of 16 × 16, that is, a rough pattern, and outputs the result of the calculation in parallel in the same cycle. Used. 6 shows the computational data flow in the systolic module 10, where C _{1, k} and S _{1, k} represent the current block data c (k, 1) and search area data s (k, 1), respectively. . The rectangle shown in FIG. 6 is a shift register 26, and the search range is set, for example, between -16 and +15. Block data is loaded in parallel to the systolic module 10 in units of columns. If t = 0 to 15, the current block data is loaded into the systolic module 10, and if t = 15 the sum of the pixel intensities in each of the 16 4 x 4 sub-blocks of the current block of 16 x 16 (Indicated by sum ₀₀ -sum ₃₃ and csum ₀₀ -csum ₃₃ in Fig. 6) are computed and stored in 16 12-bit registers on the rising edge of the clock when t = 16. The search block data is then loaded into the systolic module 10. If t = 16 to 62, the candidate blocks in the search positions (-16, -16) to (+ 15, -16) are loaded, and if t = 31 to 62, the search positions of the candidate blocks (-16, The sum of the pixel intensities in each of the 16 sub-blocks within -16) to (+ 15, -16) (indicated by sum ₀₀ -sum ₃₃ and rsum ₀₀ -rsum ₃₃ in FIG. 6) is calculated. The search block data of the next row is calculated in the same way. If t = 63 ~ 109, the candidate block data in the search positions (-16, -15) to (+ 15, -15) is loaded; if t = 31 ~ 62, the search position of the candidate block (-16, The sum of pixel intensities in each of the sixteen sub-blocks within -15) to (+ 15, -15) is calculated. It can be seen from the foregoing that each row of search positions requires N clock cycles for loading the current block data along with (2p + N-1) clock cycles to calculate the sum of pixel intensities. Thus, the systolic module 10 requires N + 2p (2p + N-1) clock cycles to calculate the sum (rough pattern) of the pixel intensities in the lower blocks of every block.

하위 블록의 픽셀 강도와 시스톨릭 모듈(10)에 의해 연산된 동일한 결과는 병렬 가산기 트리(12)로 전송된다. 도 6 및 도 7을 참조하면, 병렬 가산기 트리(12)는 이하의 수학식 1에 의해 msea 값을 계산하는데 사용되는 것이다.The pixel intensity of the lower block and the same result computed by the systolic module 10 are sent to the parallel adder tree 12. 6 and 7, the parallel adder tree 12 is used to calculate the msea value according to Equation 1 below.

상기 수학식 1에 있어서, K는 현재 블록 내의 픽셀의 합을 나타내고, SB(m,n)는 검색 위치 (m,n)에 있는 후보 블록 내의 픽셀의 합을 나타낸다. K와 SB 사이의 차분의 절대값은 정확히 말하면 소위 1차 msea 값이라고 칭하는 sea 값이다. 하나의 블록을 L 개의 하위 블록으로 분할하는 경우, K_q는 현재 블록의 q 번째 하위 블록의 픽셀의 합을 나타내고, SB_q(m,n)는 검색 위치 (m,n)에 있는 후보 블록의 q 번째 하위 블록의 픽셀의 합을 나타내며, msea 값은 총 L 개의 K_q및 SB_q의 차분의 절대값을 합산하여 얻을 수 있다. 하나의 블록을 동일한 크기의 4^Level-1개의 하위 블록으로 분할하는 경우, 이것을 Level 번째 레벨의 연속 소거라고 칭한다. 예를 들면, 3 번째 레벨의 연속 소거는 크기가 16 ×16인 블록을 16 개의 4 ×4 하위 블록으로 분할하는 것이다. 도 7에 도시된 바와 같이, ADXX로 표시된 소자는 현재 블록의 하위 블록의 픽셀 강도의 합 csum_xx과 후보 블록의 하위 블록의 픽셀 강도의 합 rsum_xx사이의 차분의 절대값을 연산하는데 사용된다. 병렬 가산기 트리(12)는 AD00~AD33의 결과를 합산하여 msea 값을 얻는데 사용된다.In Equation 1, K represents the sum of the pixels in the current block, and SB (m, n) represents the sum of the pixels in the candidate block at the search position (m, n). The absolute value of the difference between K and SB is exactly the sea value called the first order msea value. When one block is divided into L subblocks, K _q represents the sum of the pixels of the q th subblock of the current block, and SB _q (m, n) is the number of candidate blocks at the search position (m, n). It represents the sum of the pixels of the q-th lower block, the msea value can be obtained by summing the absolute value of the difference between a total of L K _q and SB _q . When one block is divided into 4 ^Level-1 lower blocks of the same size, this is called continuous erasure of the Level-th level. For example, the third level of continuous erasing is to divide a 16 × 16 block into 16 4 × 4 subblocks. As shown in Fig. 7, the element denoted ADXX is used to calculate the absolute value of the difference between the sum csum _xx of the pixel intensities of the lower blocks of the current block and the sum rsum _xx of the pixel intensities of the lower blocks of the candidate block. The parallel adder tree 12 is used to add up the results of AD00 to AD33 to obtain the msea value.

각 블록의 msea 값을 순차적으로 얻은 후, 이 msea 값을 병렬 비교기 트리(14)에 입력하여 최소 msea 값에 대응하는 M 개의 검색 위치를 발견한다. 병렬 비교기 트리(14)는 현재 최소 msea 값 뿐만 아니라 대응하는 움직임 벡터를 레지스터로 저장하는데 사용된다. 입력된 msea 값이 M 개의 msea 값 중 하나 이상의 msea 값 이하인 경우에는 최대 msea 값을 입력된 msea 값으로 대체한다. M 개의 msea 값 중 둘 이상의 msea 값이 최대값인 경우에는 단 하나의 msea 값만을 입력된 msea 값으로 대체하여야만 한다.After the msea values of each block are obtained sequentially, the msea values are input to the parallel comparator tree 14 to find M search positions corresponding to the minimum msea values. Parallel comparator tree 14 is used to store the current minimum msea values as well as the corresponding motion vectors into registers. If the entered msea value is less than or equal to one or more msea values among M msea values, the maximum msea value is replaced with the entered msea value. If more than one msea value among M msea values is the maximum value, only one msea value should be replaced with the input msea value.

도 8은 본 발명에 따른 병렬 비교기 트리의 회로도를 도시하는데, 여기서 "_reg"로 표시된 소자는 시프트 레지스터를 나타내고, "MAX"로 표시된 소자는 비교기를 나타낸다. 도 8의 (a)에 있어서, 병렬 가산기 트리(12)로부터 유효 msea 값이 입력되기 전, 병렬 비교기 트리 회로의 일부는 레지스터(msea1_reg~msea7_reg)의 초기값을 0xFFFF(65535)로 설정한다. 이 병렬 비교기 트리 회로의 일부는 레지스터(msea_in_reg, msea1_reg~msea7_reg)의 최대값(msea_max)을 연산하고, 비교기(MAX)는 2 개의 입력의 최대값을 출력한다. 도 8의 (b)에 도시된 회로는 레지스터(msea_in_reg) 값과 레지스터(msea_in_reg) 값 사이의 최대값(msea_max)을 연산하는데 사용되고, 비교기(MAX)는 2 개의 입력들 사이의 최대값을 출력한다. 소자(EQU_X)는 레지스터(mseax_reg)(x = 1 ~ 7) 간을 비교하는데 사용되고, CHECK 회로는 2 개 이상의 레지스터(mseax_reg) 중 1 개의 레지스터를 선택하는데 사용되는 한편, 이들 레지스터의 전체는 최대값(msea_max)을 포함한다. 다시, 말하자면, 대체 신호(replace_x)를 활성화한 동안 레지스터(mseax_reg) 및 레지스터(mvx_reg)는 레지스터(msea_in_reg) 및 레지스터(mv_in_reg)로 각각 대체되어야만 하고, 다만 하나의 대체 신호(replace_x)만을 활성화하는 것을 나타낸다. 도 8의 (c)에 도시된 바와 같은 병렬 가산기 트리 회로는 대체 동작을 담당하는데 사용되고, 여기서 소자(MUX)는 대체 신호(replace_x)의 제어에 있어서의 멀티플렉서이다.Figure 8 shows a circuit diagram of a parallel comparator tree according to the present invention, wherein the device labeled "_reg" represents a shift register and the device labeled "MAX" represents a comparator. In Fig. 8A, before a valid msea value is input from the parallel adder tree 12, part of the parallel comparator tree circuit sets the initial value of the registers msea1_reg to msea7_reg to 0xFFFF65535. A part of this parallel comparator tree circuit calculates the maximum value msea_max of the registers msea_in_reg and msea1_reg to msea7_reg, and the comparator MAX outputs the maximum value of the two inputs. The circuit shown in FIG. 8B is used to calculate the maximum value msea_max between the register msea_in_reg value and the register msea_in_reg value, and the comparator MAX outputs the maximum value between the two inputs. . The device EQU _X is used to compare between registers mseax_reg (x = 1 to 7), and the CHECK circuit is used to select one of two or more registers (mseax_reg), while the entirety of these registers is at most Value (msea_max). In other words, while activating the replace signal replace _x , registers mseax_reg and register mvx_reg must be replaced with registers msea_in_reg and register mv_in_reg, respectively, and only one replace signal replace _x is active. It shows. The parallel adder tree circuit as shown in Fig. 8C is used to perform the replacement operation, where the element MUX is a multiplexer in the control of the replacement signal replace _x .

이러한 방법으로, 최소 M 개의 msea 값 및 대응하는 움직임 벡터는 언제든지 레지스터에 저장될 수 있다. 검색 위치의 전체(후보 블록)의 msea 값이 병렬 가산기 트리(14) 내로 입력될 때까지, 레지스터는 (2p)²개의 검색 위치중 M 개의 최소 msea 값 및 대응하는 움직임 벡터를 포함한다. 이어서, M 개의 검색 위치에서의 SAD 값을 연산하여 최소값을 발견하고, 움직임 벡터를 출력하여 움직임 벡터의 예측을 완료한다. 각 행의 검색 위치에서 필드 데이터가 시스톨릭 모듈(10)로 입력된 경우, 이전의 (N-1) 개의 클록 사이클이 무효인 동안 msea 값은 병렬 가산기 트리(12)에 의해 생성되는 것을 주목하여야만 한다. 이하, 병렬 가산기 트리(12)로 입력될 msea 값은 보정된 결과를 생성하기 위해 0xFFFF(65535)의 값으로 대체되어야만 한다.In this way, at least M msea values and corresponding motion vectors can be stored in a register at any time. Msea until the value of the total of the search position (candidate blocks) is entered into a parallel adder tree 14 and register (2p) ² of search positions of the M include msea minimum value and a corresponding motion vector. Subsequently, a minimum value is found by calculating SAD values at M search positions, and a motion vector is output to complete prediction of the motion vector. It should be noted that when field data is entered into the systolic module 10 at the search position of each row, the msea value is generated by the parallel adder tree 12 while the previous (N-1) clock cycles are invalid. do. Hereinafter, the msea value to be input into the parallel adder tree 12 must be replaced with a value of 0xFFFF 65535 to produce a corrected result.

후보 블록의 열 데이터를 병렬로 출력하기 위해서, 하드웨어 구조의 동작을 후술되는 방법으로 실행되어야만 한다. 검색 범위 내의 데이터는 총 (2p + N - 1) 개의 행을 가진다. 본 발명에 따르면, 행 데이터는 0부터 (2p + N - 2)까지 번호가 지정되고, 도 5에 도시된 바와 같이, 행 데이터의 번호를 N으로 나눔으로써 발생되는 나머지가 0인 행 데이터는 메모리(16)의 RAM00 내에 저장되는 한편, 행 데이터의 번호를 N으로 나눔으로써 발생되는 나머지가 1인 행 데이터는 RAM01 내에 저장된다. 따라서, 열 데이터는 N 개의 적절한 어드레스에 의해 제어된 N 개의 RAM 모듈에 의해 병렬로 출력될 수 있다. 현재 블록 데이터에 대해서와 같이, 현재 블록 데이터의 열 데이터는 병렬로 출력되도록 다른 128 비트(N = 16이라고 가정함) 메모리(16')에 저장된다. 후보 블록의 열 데이터가 출력되는 동안, 이들 후보 블록이 시스톨릭 모듈(10)에 입력되기 전에 정확한 하위 블록에 입력될 수 있도록 하기 위해서, 이들 후보 블록은 멀티플렉서 네트워크 1(22)을 통과하여야만 한다. N = 16이고 Level = 3의 조건하에서, 멀티플렉서 네트워크 1(22)은 16 개의 4 대 1의 8 비트 멀티플렉서를 포함한다. 상이한 행의 검색 위치에 있어서, 멀티플렉서 네트워크 1(22)을 제어하는데 사용되는 제어 신호는 적절히 조정되어야만 한다.In order to output the column data of the candidate blocks in parallel, the operation of the hardware structure must be performed in a manner described below. The data within the search range has a total of (2p + N-1) rows. According to the present invention, the row data is numbered from 0 to (2p + N-2), and as shown in Fig. 5, the row data having zero remainder generated by dividing the number of the row data by N is stored in memory. While the data is stored in RAM00 of (16), the row data having a remainder of 1 generated by dividing the number of row data by N is stored in RAM01. Thus, column data can be output in parallel by N RAM modules controlled by N appropriate addresses. As with the current block data, the column data of the current block data is stored in another 128-bit (assuming N = 16) memory 16 'to be output in parallel. While the column data of the candidate blocks are output, these candidate blocks must pass through multiplexer network 1 (22) in order for these candidate blocks to be entered in the correct lower blocks before being input to the systolic module 10. Under the condition of N = 16 and Level = 3, multiplexer network 1 22 includes sixteen four-to-one eight-bit multiplexers. For different row search positions, the control signal used to control multiplexer network 1 22 must be adjusted accordingly.

이와 유사하게, M 개의 검색 위치의 SAD 값을 연산하는 동안, 후보 블록의 데이터는 멀티플렉서 네트워크 2(24)를 통과하고, 이어서 16 개의 16 대 1의 8 비트 멀티플렉서로 구성된 병렬 가산기 트리(12)에 입력되어야만 한다. 멀티플렉서 네트워크 2(24)를 제어하기 위한 제어 신호는 상이한 행의 검색 위치에 대해 변조되어야만 한다. 따라서, 본 발명은 최소 sea 값이 유지되는 M 개의 검색 위치를 발견하기 위해 N + 2p(2p + N - 1) 개의 클록 사이클을 필요로 한다, 이들 M 개의 검색 위치의 SAD 값을 연산하고자 하는 경우, 병렬 가산기 트리(12)의 자원이 재사용될 수 있다. 각 검색 위치는 그 검색 위치의 SAD 값을 연산하기 위해 N 개의 클록 사이클을 필요로 하고, M 개의 검색 위치는 SAD 값의 전체를 연산하기 위해 (M×N) 개의 클록 사이클을 필요로 한다. 결론적으로, N = 16이고 Level = 3의 경우를 예로 들면, 본 발명에 따른 하드웨어 구조는 움직임 벡터를 연산하기 위해 N + 2p(2p + N - 1) + (M ×N) 개의 클록 사이클을 필요로 한다.Similarly, while computing the SAD values of the M search positions, the data of the candidate block passes through multiplexer network 2 (24) and then to parallel adder tree (12) consisting of 16 16 to 1 8-bit multiplexers. It must be entered. The control signal for controlling the multiplexer network 2 24 must be modulated for the search position of different rows. Therefore, the present invention requires N + 2p (2p + N-1) clock cycles to find M search positions where the minimum sea value is maintained. The resources of parallel adder tree 12 can be reused. Each search position requires N clock cycles to compute the SAD value of that search position, and the M search positions require (M × N) clock cycles to compute the entirety of the SAD value. In conclusion, taking N = 16 and Level = 3 as an example, the hardware structure according to the present invention requires N + 2p (2p + N-1) + (M × N) clock cycles to compute the motion vector. Shall be.

이와 같이, 본 발명의 사상 및 원리가 설명되어 있다. 특정 실험적인 실시예가 전술한 원리 및 효과를 검증하기 위해 이하에 바로 제시된다. 본 발명에 따른 하드웨어 구조의 성능을 분석하기 위해서, 본 발명의 하드웨어 구조는 전체 검색 블록 정합 알고리즘에 기초한 하드웨어 구조와 비교되고, 여기서, 비교될 하드웨어 구조는 본 명세서의 뒷부분에 리스트된 참고 문헌 [1] ~ [7]로부터 발원된다. 비교 결과는 표 2 및 표 3에 도시되는데, 여기서 표 2는 크기가 16 ×16 블록이고, -16 ~ +15의 검색 범위이며, Level = 3이고 M = 7의 조건하에서의 상이한 구조 간의 비교를 나타내며, 표 3은 크기가 16 ×16 블록이고 -32 ~ +31의 검색 범위이며 Level = 3이고 M = 7의 조건하에서의 상이한 구조 간의 비교를 나타낸다.As such, the spirit and principles of the present invention have been described. Specific experimental embodiments are presented immediately below to verify the principles and effects described above. In order to analyze the performance of the hardware structure according to the present invention, the hardware structure of the present invention is compared with the hardware structure based on the entire search block matching algorithm, where the hardware structure to be compared is referred to [1] listed later in this specification. ] To [7]. The comparison results are shown in Tables 2 and 3, where Table 2 shows a comparison between the different structures under the conditions of 16 × 16 blocks, a search range of -16 to +15, Level = 3 and M = 7. Table 3 shows a comparison between the different structures under conditions of 16 × 16 blocks, a search range of -32 to +31, Level = 3 and M = 7.

이 비교는 처리 소자 어레이와 관련해서 발생되는 한편, 제어 회로는 이들 하드웨어 구조에서 중요하지 않은 부분을 수행하기 때문에, 하드웨어의 형태로 실행되지 않는다. 처리 소자 어레이는 AVANT! 0.35 ㎛ 셀 라이브러리를 갖는 SYNOPSYS 설계 분석기에 의해 합성되고, 임계 경로 제약(Critical Path Constraint)은 20 ㎱로 설정된다. 즉, 회로의 동작 주파수는 적어도 50 MHz에 달할 수 있다. 처리 소자외에 별표(*)가 붙은 구조가 표 2 및 표 3에 도시되고, 대부분 시프트 레지스터로 구성된 다량의 부가적인 논리 회로가 데이터의 재사용 가능성을 증가시키는데 필요하다. 그 결과, 이들 하드웨어 구조의 논리 게이트의 실제적인게이트 계수 및 전력 소비는 모의 실험보다 매우 높다. 표 2 및 표 3에 있어서, 메모리, 멀티플렉서 네트워크 2 및 제어 유닛은 모의 실험으로 실행될 수 없는 한편, 다른 소자는 모의 실험을 고려할 수 있는 것을 주목하자. 또한, 3 단 파이프라인은 모의 실험에 부적합하다.This comparison occurs with respect to the array of processing elements, while the control circuitry performs insignificant parts of these hardware structures and therefore is not implemented in the form of hardware. The processing element array is AVANT! Synthesized by SYNOPSYS design analyzer with 0.35 μm cell library, Critical Path Constraint is set to 20 μs. That is, the operating frequency of the circuit can reach at least 50 MHz. In addition to the processing elements, structures marked with an asterisk (*) are shown in Tables 2 and 3, and a large amount of additional logic circuits, mostly composed of shift registers, are required to increase the reusability of data. As a result, the actual gate coefficients and power consumption of the logic gates of these hardware structures are much higher than the simulations. In Tables 2 and 3, note that the memory, multiplexer network 2 and control unit cannot be run in a simulation, while other devices can consider the simulation. Also, the three stage pipeline is not suitable for simulation.

이들 하드웨어 구조를 공정하게 비교하기 위하여, 움직임 벡터의 이들 하드웨어 구조는 동일한 처리율(움직임 벡터/초)에 기초하여 비교하여야만 한다. 따라서, "게이트당 정규화된 처리 능력(NPCPG; normalized processing capability per gate)" 및 정규화된 전력(NP; normalized power)을 다음의 수학식 2 및 수학식 3으로 각각 정의한다.In order to compare these hardware structures fairly, these hardware structures of motion vectors must be compared based on the same throughput (motion vectors / second). Therefore, "normalized processing capability per gate" (NPCPG) and normalized power (NP) are defined by the following equations (2) and (3), respectively.

일반적으로, 필요한 클록 사이클에 관한 1-D 어레이 구조의 연산 속도는 충분히 빠르지 않고, 1-D 어레이 구조의 동작 주파수는 대형-프레임 및 광범위한 검색 응용에 대해 증가되어야만 한다. 한편, 비록 2-D 어레이 구조의 연산 속도가 1-D 어레이 구조와 비교해서 더 빠르더라도, 논리 게이트의 양이 많고, 그 비용이 고가이다. 비록 참고 문헌 [6]의 구조가 일종의 1-D 어레이 구조의 한 종류이더라도 이 구조는 데이터-비월 및 2-D 구조의 재사용을 취하기 때문에, 2-D 어레이 구조와동일한 문제, 즉 다량의 논리 게이트가 필요한 문제를 가진다. 비록 트리 구조가 연산 속도 및 영역면에서 양호한 성능을 수행하더라도, 필요한 메모리 비트 폭이 매우 크기 때문에, 이에 따라서 실용성이 떨어진다. 본 발명에 따른 하드웨어 구조의 연산 속도는 실질적으로 2-D 어레이 구조 및 트리 구조보다는 다소 느리지만(참고 문헌 [3]의 구조의 연산 속도는 본 발명보다 더 느림), 본 발명에 따른 논리 게이트의 양이 다른 구조들보다 훨씬 적게 든다. 1-D 어레이 구조를 고려하면, 1-D 어레이 구조의 연산 속도는 본 발명의 연산 속도보다 훨씬 느리고, 광범위한 검색시 1-D 어레이 구조의 논리 게이트의 양이 본 발명의 논리 게이트의 양보다도 훨씬 더 많다. 사실상, 본 발명의 성능은 "게이트당 정규화된 처리 능력" 및 "정규화된 전력"에 관해서 다른 구조보다 우수하다.In general, the computational speed of the 1-D array structure with respect to the required clock cycles is not fast enough, and the operating frequency of the 1-D array structure must be increased for large-frame and wide search applications. On the other hand, although the computational speed of the 2-D array structure is faster than that of the 1-D array structure, the amount of logic gates is large, and the cost is high. Although the structure of Ref. [6] is a kind of 1-D array structure, this structure takes the same problem as the 2-D array structure, i.e. a large amount of logic gates, because it takes the reuse of data-interlacing and 2-D structures. Has the necessary problem. Although the tree structure performs a good performance in terms of computation speed and area, the practical memory bit width is very large, thus making it less practical. Although the computational speed of the hardware structure according to the present invention is substantially slower than the 2-D array structure and the tree structure (the computational speed of the structure of Ref. [3] is slower than the present invention), the logic gate of the present invention The amount is much less than other structures. Considering the 1-D array structure, the computational speed of the 1-D array structure is much slower than the computational speed of the present invention, and the amount of logic gates of the 1-D array structure is much larger than the amount of logic gates of the present invention in the extensive search. more. In fact, the performance of the present invention is superior to other structures in terms of "normalized processing power per gate" and "normalized power".

구조rescue 설명Explanation PE수PE water MV당사이클Cycle per MV 필요한메모리I/ORequired Memory I / O CIF 30fps에대해 필요한주파수Frequency Required for CIF 30fps 게이트계수 @50MHzGate Factor @ 50 MHz NPCPGNPCPG 게이트-레벨전력 @50MHzGate-Level Power @ 50MHz NPNP [1]Yang[1] Yang 1-D 세미-시스톨릭1-D semi-systolic 3232 81928192 24비트24-bit 97.32MHz97.32 MHz 28.0K28.0K 0.130.13 26.0mW26.0 mW 2.992.99 [2]AB1[2] AB1 1-D시스톨릭1-D systolic 1616 2406424064 256비트256 bit 285.88MHz285.88 MHz 3.8K3.8K 0.320.32 11.7mW11.7 mW 3.953.95 [2]AB2[2] AB2 2-D시스톨릭2-D systolic 256256 15041504 128비트128 bit 17.87MHz17.87 MHz 95.1K95.1K 0.200.20 227.8mW227.8mW 4.824.82 [3]Hsieh＊[3] Hsieh ＊ 2-D시스톨릭2-D systolic 256256 22092209 8비트8 bit 26.24MHz26.24 MHz 100.6K100.6K 0.130.13 147.2mW147.2 mW 4.574.57 [4]Tree[4] Tree 트리구조Tree structure 256256 10241024 2048비트2048 bit 12.17MHz12.17 MHz 56.1K56.1K 0.510.51 179.5mW179.5 mW 2.592.59 [5]Yeo[5] Yeo 2-D 세미-시스톨릭2-D semi-systolic 10241024 256256 24비트24-bit 3.04MHz3.04 MHz 447.4K447.4K 0.260.26 1052.6mW1052.6 mW 3.793.79 [6]Lai[6] Lai 1-D 세미-시스톨릭1-D semi-systolic 10241024 256256 24비트24-bit 3.04MHz3.04 MHz 387.6K387.6K 0.300.30 845.6mW845.6mW 3.043.04 [7]SA＊[7] SA ＊ 2-D시스톨릭2-D systolic 256256 10241024 16비트16 bit 12.17MHz12.17 MHz 126.5K126.5K 0.230.23 258.0mW258.0 mW 3.723.72 [7]SSA＊[7] SSA ＊ 2-D 세미-시스톨릭2-D semi-systolic 256256 10241024 16비트16 bit 12.17MHz12.17 MHz 106.0K106.0K 0.270.27 280.1mW280.1mW 4.044.04 본 발명The present invention GEA기반GEA-based 1616 16351635 256비트256 bit 19.42MHz19.42 MHz 17.9K17.9K 1.001.00 43.4mW43.4mW 1.001.00

구조rescue 설명Explanation PE수PE water MV당사이클Cycle per MV 필요한메모리I/ORequired Memory I / O CIF 30fps에대해 필요한주파수Frequency Required for CIF 30fps 게이트계수@50MHzGate Count @ 50 MHz NPCPGNPCPG 게이트-레벨전력 @50MHzGate-Level Power @ 50MHz NPNP [1]Yang[1] Yang 1-D 세미-시스톨릭1-D semi-systolic 3232 1638416384 24비트24-bit 194.64MHz194.64 MHz 56.0K56.0K 0.100.10 52.0mW52.0 mW 3.783.78 [2]AB1[2] AB1 1-D시스톨릭1-D systolic 1616 8089680896 256비트256 bit 961.04MHz961.04 MHz 3.8K3.8K 0.300.30 11.7mW11.7 mW 4.204.20 [2]AB2[2] AB2 2-D시스톨릭2-D systolic 256256 50565056 128비트128 bit 60.07MHz60.07 MHz 95.1K95.1K 0.190.19 227.8mW227.8mW 5.125.12 [3]Hsieh＊[3] Hsieh ＊ 2-D시스톨릭2-D systolic 256256 62416241 8비트8 bit 74.14MHz74.14 MHz 100.6K100.6K 0.150.15 147.2mW147.2 mW 4.084.08 [4]Tree[4] Tree 트리구조Tree structure 256256 40964096 2048비트2048 bit 48.66MHz48.66 MHz 56.1K56.1K 0.400.40 179.5mW179.5 mW 3.273.27 [5]Yeo[5] Yeo 2-D 세미-시스톨릭2-D semi-systolic 10241024 256256 24비트24-bit 3.04MHz3.04 MHz 1790.0K1790.0K 0.200.20 4210.3mW4210.3 mW 4.794.79 [6]Lai[6] Lai 1-D 세미-시스톨릭1-D semi-systolic 10241024 256256 24비트24-bit 3.04MHz3.04 MHz 1550.4K1550.4K 0.230.23 3382.4mW3382.4 mW 3.843.84 [7]SA＊[7] SA ＊ 2-D시스톨릭2-D systolic 256256 40964096 16비트16 bit 48.66MHz48.66 MHz 126.5K126.5K 0.180.18 258.0mW258.0 mW 4.694.69 [7]SSA＊[7] SSA ＊ 2-D 세미-시스톨릭2-D semi-systolic 256256 40964096 16비트16 bit 48.66MHz48.66 MHz 106.0K106.0K 0.210.21 280.1mW280.1mW 5.095.09 본 발명The present invention GEA기반GEA-based 1616 51875187 256비트256 bit 61.62MHz61.62 MHz 17.9K17.9K 1.001.00 43.4mW43.4mW 1.001.00

차세대의 비디오 압축 표준, 예를 들어 H.263+ 및 MPEG-4 등의 비디오 압축 표준에 있어서, 다른 타입의 움직임 예측 모드가 제공될 수 있다. 차세대의 비디오 압축 표준의 움직임 예측 알고리즘에 사용되는 블록은 통상적인 블록 크기인 16 ×16에 한정되지는 않지만, 16 ×16 픽셀 블록 내에서 4 개의 8 ×8 하위 볼록씩 4 개의 움직임 벡터를 생성할 수 있다. 비디오 압축 알고리즘이, 먼저 사용되어야만 되는 움직임 벡터를 적합하게 판단할 수 있는 경우, 인코딩 효율은 상당히 증진될 수 있다. 이러한 움직임 예측 모드를 소위 "고급 예측 모드"라고 칭한다. 본 발명에 따른 하드웨어 구조는 도 9에 도시된 바와 같이, 4 개의 병렬 비교기 트리외에 고급 예측 모드를 용이하게 지원할 수 있다. 본 발명의 구조가 고급 예측 모드를지원하는 것이 가능하다면, 레벨 = 4를 이용하여 회로 토폴리지를 설계하면 더 우수한 인코딩 효율을 얻을 수 있다.In the next generation of video compression standards, for example video compression standards such as H.263 + and MPEG-4, other types of motion prediction modes may be provided. The blocks used in the motion prediction algorithms of the next generation of video compression standards are not limited to the normal block size of 16 × 16, but can generate four motion vectors of four 8 × 8 subconvexs within a 16 × 16 pixel block. Can be. If the video compression algorithm can properly determine the motion vectors that must be used first, the encoding efficiency can be significantly improved. This motion prediction mode is called "advanced prediction mode". The hardware structure according to the present invention can easily support the advanced prediction mode in addition to the four parallel comparator trees, as shown in FIG. If the structure of the present invention is capable of supporting advanced prediction modes, better encoding efficiency can be obtained by designing the circuit topology with level = 4.

따라서, 본 발명은 연산 데이터 흐름을 더욱 규칙적이고 원활하며 하드웨어 실시에 더욱 적합하게 하며, 종래의 다중 레벨의 연속 소거 알고리즘이 직면한 단점을 해소하는 것이 가능하다. 또한, 본 발명은 움직임 벡터의 동일한 처리율의 조건하에서 고신뢰성, 높은 연산 능력 및 그 논리 게이트에 대한 최소로 감소된 전력 소비를 제공한다.Thus, the present invention makes the computational data flow more regular, smooth and more suitable for hardware implementation, and it is possible to solve the disadvantages faced by conventional multi-level successive erase algorithms. In addition, the present invention provides high reliability, high computing power and minimally reduced power consumption for its logic gates under conditions of the same throughput of motion vectors.

비록 본 발명을 상세히 설명하고 도시하였지만, 이것은 단지 예시적인 것이며 제한적인 목적으로 이루어진 것이 아니고, 본 발명의 사상 및 범주는 첨부된 특허 청구 범위에 의해서만 제한된다는 것을 명심하여야 한다.Although the invention has been described and illustrated in detail, it is to be understood that this is merely illustrative and not for the purpose of limitation, and the spirit and scope of the invention is limited only by the appended claims.

[참고 문헌][references]

[1] K.M.Yang, M.T.Sun 및 L.Wu의 "A family of VLSI designs for the motion compensation block-matching algorithm,"(IEEE Trans. on Circuits and Systems, vol. 36, no.2, pp.1317-1358, Oct.1989).[1] "A family of VLSI designs for the motion compensation block-matching algorithm," by KMYang, MTSun and L.Wu (IEEE Trans.on Circuits and Systems, vol. 36, no.2, pp.1317- 1358, Oct. 1989).

[2] T.Komarek 및 P.Pirsch의 "Array architectures for block matching algorithms,"(IEEE Trans. on Circuits and Systems, vol. 36, no.2, pp.1301-1308, Oct.1989).[2] "Array architectures for block matching algorithms," by T. Kaomarek and P. Pirsch (IEEE Trans. On Circuits and Systems, vol. 36, no. 2, pp. 1301-1308, Oct. 1989).

[3] C.H.Hsieh 및 T.P.Lin의 "VLSI architectures for block-matching motion estimation algorithm,"(IEEE Trans. on Circuits and Systems for VideoTechnology, vol.2, no.2, pp.169-175, Jun. 1992).[3] "VLSI architectures for block-matching motion estimation algorithm," by CHHsieh and TPLin (IEEE Trans. On Circuits and Systems for Video Technology, vol. 2, no. 2, pp. 169-175, Jun. 1992). .

[4] Y.S Jehng, L.G.Chen 및 T.D.Chiueh의 "An efficient and simple VLSI tree architecture for motion estimation algorithms,"(IEEE Trans. on Signal Processing, vol.41, no.2, pp.889-900, Feb. 1993).[4] "An efficient and simple VLSI tree architecture for motion estimation algorithms," by YS Jehng, LGChen and TDChiueh (IEEE Trans. On Signal Processing, vol. 41, no. 2, pp. 889-900, Feb. 1993).

[5] H.Yeo 및 Y.H.Hu의 "A novel modular systolic array architecture for full-search block matching motion estimation,"(IEEE Trans. on Circuits and Systems for Video Technology, vol.5, no.5, pp.407-416, Oct. 1995).[5] "A novel modular systolic array architecture for full-search block matching motion estimation," by H.Yeo and YHHu (IEEE Trans.on Circuits and Systems for Video Technology, vol.5, no.5, pp.407 -416, Oct. 1995).

[6] Y.K.Lai 및 L.G.Chen의 "A data-interlacing architecture with two-dimensional data-reuse for full-search block-matching algorithm,"(IEEE Trans. on Circuits and Systems for Video Technology, vol.8, no.2, pp.124-127, Apr.1998).[6] "A data-interlacing architecture with two-dimensional data-reuse for full-search block-matching algorithm," by YKLai and LG Chen (IEEE Trans. On Circuits and Systems for Video Technology, vol. 8, no. 2, pp. 124-127, Apr. 1998).

[7] Y.H.Yeh 및 C.Y.Lee의 "Cost-effective VLSI architectures and buffer size optimization for full-search block matching algorithms,"(IEEE Trans. on VLSI Systems, vol.7, no.3, pp.345-358, Sep.1999).[7] "Cost-effective VLSI architectures and buffer size optimization for full-search block matching algorithms," by YHYeh and CYLee (IEEE Trans. On VLSI Systems, vol. 7, no. 3, pp.345-358, Sep. 1999).

Claims

As a global cancellation algorithm for motion prediction,

Displaying the current blocks in the current frame among the candidate blocks in the reference frame at each search place in a coarse pattern;

Comparing the rough patterns in the current block and the candidate blocks;

Searching for M candidate blocks having a rough pattern similar to the current block;

Comparing the fine patterns of the M candidate blocks with the fine patterns of the current blocks;

Selecting a candidate block having a minimum value among differences of the fine patterns of the M candidate blocks

Global erasing algorithm comprising a.

2. The global erase algorithm of claim 1, wherein M has a value in the range of 1 to 63.

The global cancellation algorithm of claim 1, wherein the motion vector corresponding to the minimum value of the differences of the fine patterns of the candidate blocks is a predicted motion vector.

2. The global erase algorithm of claim 1, wherein the coarse pattern is one of a continuous erase algorithm value and a multilevel continuous erase algorithm value.

The global erase algorithm of claim 1, wherein the differences of the fine patterns of the candidate blocks are a sum of absolute values of the differences.

The global erase algorithm of claim 1, wherein the M candidate blocks are located at M search locations having a minimum value among fine patterns.

In a hardware structure for performing a global cancellation algorithm for motion prediction,

A systolic module that computes rough patterns of the respective lower blocks in parallel;

An adder tree that compares each coarse pattern of the current blocks with each coarse pattern of the candidate blocks, the adder tree reusable for comparing each fine pattern of the current blocks with each fine pattern of the candidate blocks; ;

At least one comparator tree for searching M candidate blocks having a rough pattern similar to the current block;

A control device for controlling the operation of the systolic module, the adder tree and the comparator tree;

At least one memory for storing data of the current block and the candidate blocks

Hardware structure comprising a.

8. The hardware structure of claim 7, wherein the systolic module includes a processing unit for computing a coarse pattern in the current block and the candidate block.

8. The method of claim 7, wherein the comparator tree stores the similarity of the M candidate blocks and their corresponding motion vectors in a register and compares the similarity of the M candidate blocks to the similarity of an input candidate block. Search for candidate blocks that are not the most similar to the current block among candidate blocks and the input candidate blocks, and convert the input candidate blocks into one candidate block that is not similar to the current block but is part of candidate blocks in the register; And replace the input candidate block with a candidate block of any one of the candidate blocks that are not similar to the current block and are part of the candidate blocks in the register.

8. The hardware structure of claim 7, wherein M has a value within the range of 1 to 63.

10. The hardware structure of claim 9, wherein M has a value in the range of 1 to 63.

8. The hardware structure of claim 7, further comprising four additional adder trees coupled to the adder tree and capable of supporting a pre-prediction mode by slightly modifying the configuration of the control unit.

8. The hardware structure of claim 7, wherein the rough pattern is one of a continuous erase algorithm value and a multilevel continuous erase algorithm value.

8. The hardware structure of claim 7, wherein the differences of the fine patterns of the candidate blocks are a sum of absolute values of the differences.

8. The hardware structure of claim 7, wherein the M candidate blocks are located in M search locations with a minimum fine pattern.