KR100777753B1

KR100777753B1 - Data processing using a coprocessor

Info

Publication number: KR100777753B1
Application number: KR1020037008516A
Authority: KR
Inventors: 카펜터폴매튜; 알드오쓰피터제임스
Original assignee: 에이알엠 리미티드
Priority date: 2001-02-20
Filing date: 2001-12-13
Publication date: 2007-11-19
Also published as: GB0104160D0; GB2372848A; RU2275678C2; IL155662A0; KR20030078063A; WO2002067113A1; IL155662A; TWI285322B; JP2004519768A; EP1362286B1; CN1254740C; JP3729809B2; EP1362286A1; CN1491383A; MY124779A; GB2372848B; US20020116580A1; US7089393B2

Abstract

메인 프로세서(8)와 코프로세서(10)를 사용하는 데이터 처리시스템은, 정렬에 의존하는 다양한 수의 데이터 값을 코프로세서(10) 내에 로드하고 또한, 상기 로딩된 데이터 워드들 내의 오퍼랜드에 관해 수행되는 데이터 처리 연산을 지정하여 그 결과의 데이터 워드를 발생하는 코프로세서 로드 명령어(USALD)를 제공한다. 이렇게 지정된 코프로세서 처리 연산은, 화소 바이트 값의 열에 대하 절대차의 합 계산이어도 된다. 이 계산의 결과는, 누적 레지스터(22)에 누적된다. 코프로세서 메모리(18)는, 상기 코프로세서(10) 내에 설치되어 자주 사용된 오퍼랜드 값의 국부적인 저장을 코프로세서(10)에 제공한다.
The data processing system using the main processor 8 and the coprocessor 10 loads into the coprocessor 10 various numbers of data values that depend on the alignment and also performs on the operands in the loaded data words. Provide a coprocessor load instruction (USALD) that specifies the data processing operations to be generated and generate the resulting data words. The coprocessor processing operation specified in this manner may be a calculation of the sum of absolute differences with respect to the column of pixel byte values. The result of this calculation is accumulated in the accumulation register 22. Coprocessor memory 18 is installed within coprocessor 10 to provide coprocessor 10 with a local storage of frequently used operand values.

데이터 처리장치, 코프로세서, 메인 프로세서, 로드 명령어, 절대차의 합Sum of data processor, coprocessor, main processor, load instruction, absolute difference

Description

DATA PROCESSING USING A COPROCESSOR}

본 발명은 데이터 처리 시스템에 관한 것이다. 특히, 본 발명은 메인 프로세서와 코프로세서를 내장한 데이터 처리 시스템에 관한 것이다.The present invention relates to a data processing system. In particular, the present invention relates to a data processing system incorporating a main processor and a coprocessor.

메인 프로세서와 코프로세서를 내장한 데이터 처리 시스템은 공지되어 있다. 이러한 데이터 처리 시스템의 예로는, 영국의 캠브리지 ARM 리미티드의 데이터 처리 시스템이 있는데, 이 시스템은 전용 디지털 신호 연산과 같은 기능을 수행하는 피콜로(Piccolo) 코프로세서 등과 같은 코프로세서와 결합된 ARM7 또는 ARM9 등의 메인 프로세서를 제공한다. 또 다른 코프로세서의 예로는, 부동 소수점 산술 코프로세서 등이어도 된다.Data processing systems incorporating a main processor and a coprocessor are known. An example of such a data processing system is the Cambridge ARM Limited data processing system in the UK, which includes an ARM7 or ARM9 combined with a coprocessor such as a Piccolo coprocessor that performs the same functions as a dedicated digital signal operation. Provides the main processor. Another example of a coprocessor may be a floating point arithmetic coprocessor or the like.

코프로세서는, 종종 기본 시스템에서 불필요한 데이터 처리 시스템 내에 추가의 기능성을 제공하는데 사용되지만, 적합한 코프로세서를 제공하는데 추가 비용이 드는 것이 당연할 때 특정의 경우에서 유용하다. 특히, 요구 데이터 처리 환경은, 비디오 영상 조작 등의 디지털 신호 처리를 포함하는 것들이다. 이러한 애플리케이션에서의 처리를 요하는 데이터 양은 높을 수 있다. 이것은, 필요한 처리 양을 해결할 수 있음과 동시에 매우 낮은 비용 및 저소비전력을 갖는 데이터 처리 시스템을 제공하는데 문제가 된다.Coprocessors are often used to provide additional functionality within a data processing system that is unnecessary in the base system, but are useful in certain cases when it is natural to have additional costs to provide a suitable coprocessor. In particular, the required data processing environment is those including digital signal processing such as video image manipulation. The amount of data that needs processing in such an application can be high. This is a problem to provide a data processing system that can solve the required processing amount and at the same time have a very low cost and low power consumption.

이러한 계산 집약적인 애플리케이션을 처리하는 하나의 방법은, 전용 디지털 신호처리회로를 제공하는데 있다. 이러한 전용회로는, 특히 연산 처리를 매우 제한된 범위이지만 초고속으로 수행하도록 구성된 구조를 가질 수 있다. 예로서, 다중 데이터 채널은, 데이터가 관련회로 내부 및 외부로 병렬로 흐르도록 구성되어도 된다. 이러한 구성으로 높은 데이터 처리량을 필요로 하는 문제를 해결할 수도 있지만, 그 구성은 통상 가변적이지 않다는 단점이 있다. 이 불가변성은, 알고리즘에서 실행하는데 요구되는 아주 작은 변경으로 비싼 대응 하드웨어 변경을 필요로 할 수 있다는 것을 의미하기도 한다. 이것은, 일반적으로 처음에 매우 다양한 서로 다른 알고리즘을 실행할 수 있도록 설계된 범용 프로세서와 비교된다.One way to handle such computationally intensive applications is to provide a dedicated digital signal processing circuit. Such a dedicated circuit may have a structure, in particular, configured to perform arithmetic processing at a very limited range but at very high speed. By way of example, multiple data channels may be configured to allow data to flow in and out of the associated circuit in parallel. Such a configuration may solve the problem of requiring a high data throughput, but the disadvantage is that the configuration is usually not variable. This invariability also means that the smallest changes required to implement the algorithm can require expensive corresponding hardware changes. This is generally compared with a general purpose processor designed to run a wide variety of different algorithms at first.

본 발명의 일 국면은,One aspect of the invention,

프로그램 명령어에 응답하여 데이터 처리 연산을 수행하는 메인 프로세서와,A main processor that performs data processing operations in response to program instructions;

상기 메인 프로세서에 접속되고 상기 메인 프로세서의 코프로세서 로드 명령어에 응답하여 하나 이상의 로딩된 데이터 워드를 그 내부에 로드하고, 상기 하나 이상의 로딩된 데이터 워드를 사용하여 상기 코프로세서 로드 명령어가 규정한 적어도 하나의 코프로세서 처리 연산을 수행하여 오퍼랜드 데이터를 제공하여 적어도 하나의 결과 데이터 워드를 발생하는 코프로세서를 구비하고,At least one loaded data word therein connected to the main processor and responsive to a coprocessor load instruction of the main processor, the at least one defined by the coprocessor load instruction using the one or more loaded data words Coprocessor processing operations to provide operand data to generate at least one result data word,

상기 코프로세서 로드 명령어에 응답하여, 상기 하나 이상의 로딩된 데이터 워드 내의 상기 오퍼랜드 데이터의 시작 어드레스를 워드 경계와 정렬하였는지에 따라 상기 코프로세서 내에 다양한 수의 로딩된 데이터 워드가 로드되는 데이터 처리장치를 제공한다.In response to the coprocessor load instruction, varying the number of loaded data words loaded into the coprocessor, depending on whether the start address of the operand data in the one or more loaded data words is aligned with a word boundary. .

본 발명은, 범용 메인 프로세서를 포함한 시스템 내부에, 매우 특별한 기능 을 염두에 둔 코프로세서가 설치되어도 되는 것을 알 수 있다. 특히, 코프로세서 로드 명령어도 데이터 처리 연산을 일으켜 상기 로딩된 데이터 워드내의 오퍼랜드에 관해 수행되어 결과 데이터 워드를 발생시킴으로써 속도와 코드 밀도면에서 상당한 이점을 얻을 수도 있다. 이와 같은 코프로세서는 상기 시스템 내에 아주 특별한 역할을 갖지만, 범용 메인 프로세서와 결합할 때, 이러한 결합으로 범용 프로세서의 능력을 유지하며 서로 다른 알고리즘 및 사정에 적응하면서 처리 능력의 바람직한 증가를 제공할 수 있다는 것이 밝혀졌다.It can be seen that the present invention may be provided with a coprocessor having a very special function in a system including a general-purpose main processor. In particular, coprocessor load instructions may also perform data processing operations to be performed on the operands within the loaded data word to generate the resulting data word, which may yield significant advantages in terms of speed and code density. Such a coprocessor has a very special role in the system, but when combined with a general purpose main processor, such a combination can provide the desired increase in processing power while maintaining the capabilities of the general purpose processor and adapting to different algorithms and circumstances. It turned out.

메모리 시스템과 버스 구조는 종종 이 시스템의 어드레스 기반에 의해 특정 정렬을 하는데 만 알맞게 동작하도록 설치되지만, 코프로세서에 의해 조작되어 사용되는 원하는 오퍼랜드 값은 서로 다른 정렬값이어도 된다. 이와 같이 하여, 성능을 향상시키기 위해서는, 코프로세서에 로딩된 로딩 데이터 워드의 수는 상기 정렬값에 의존한다. 일 예로서, 워드 정렬형 32비트 데이터 워드를 사용하여 코프로세서 로드 명령어에 응하여 8개의 8비트 오퍼랜드를 로드하는 것이 바람직하여서, 이는 워드 경계에 오퍼랜드를 정렬하는 경우 2개의 데이터 워드, 또는 워드 경계에 오퍼랜드를 정렬하지 않은 경우 3개의 데이터 워드에 의해 달성하여도 된다.Memory systems and bus structures are often set up to work properly only to make specific alignments based on their address base, but the desired operand values manipulated and used by the coprocessor may be different alignment values. In this way, to improve performance, the number of loading data words loaded into the coprocessor depends on the alignment value. As an example, it may be desirable to load eight 8-bit operands in response to a coprocessor load instruction using a word aligned 32-bit data word, which is equivalent to two data words, or word boundaries, if the operands are aligned on a word boundary. If the operands are not aligned, they may be achieved by three data words.

특히, 본 발명의 바람직한 실시예는, 코프로세서 내에 상기 로딩된 데이터 워드와 결합하여 오퍼랜드로서 사용될 데이터 워드를 국부적으로 저장하도록 코프로세서 메모리를 제공한 예들이 있다. 이러한 구성으로 알 수 있는 것은, 실생활에서 많은 계산을 하는 경우에, 데이터 워드의 매우 작은 부분 집합은, 자주 필요로 하지 않는 데이터 워드의 매우 큰 집합(set)과 결합하여 사용하는데 자주 필요로 한다는 것이다. 이러한 특징은 상기 자주 필요로 하는 데이터 워드를 국부적으로 저장함으로써 이것을 이용하므로, 메인 프로세서와 코프로세서간에 필요로 하는 데이터 채널 용량을 바람직하게 감소시킨다. 종래의 디지털 신호 처리 시스템과 비교하여, 필요에 따라 하나의 메인 프로세서와 하나의 코프로세서 사이에 일반적으로 간단히 더 많은 데이터 채널을 추가하는 것이 메인 프로세서 구조가 다른 요소들에 의해 제한을 받기 때문에 더욱 곤란하다.In particular, preferred embodiments of the present invention are examples in which a coprocessor memory is provided to locally store a data word to be used as an operand in combination with the loaded data word in the coprocessor. This configuration shows that, in the real world, when doing many calculations, very small subsets of data words are often needed to use in combination with very large sets of data words that are not often needed. . This feature exploits this by locally storing the frequently needed data words, thus advantageously reducing the data channel capacity required between the main processor and the coprocessor. Compared with conventional digital signal processing systems, it is usually more difficult to simply add more data channels between one main processor and one coprocessor as needed, since the main processor structure is limited by other factors. Do.

상기 시스템의 성능은, 메인 프로세서에 접속된 메모리로부터 검색되는(retrieve) 로딩된 데이터 워드를 그 메인 프로세서의 레지스터 내에 저장하지 않고 코프로세서에 전달하는 실시예에서 향상된다. 이러한 경우에, 메인 프로세서는 코프로세서를 위해 어드레스 발생기와 메모리 액세스장치의 역할을 제공할 수 있다는 것을 알 것이다.The performance of the system is improved in embodiments in which loaded data words retrieved from memory connected to the main processor are passed to the coprocessor without being stored in registers of the main processor. In this case, it will be appreciated that the main processor may provide the role of an address generator and a memory accessor for the coprocessor.

메인 프로세서가, 데이터 워드가 코프로세서 내로 로딩하는 것을 나타내는 어드레스 값을 저장 가능하도록 동작하는 레지스터를 구비할 경우 특히 편리하다. 이것으로 메인 프로세서로의 어드레스 포인터를 제어함으로써, 지원 받아도 되는 알고리즘의 형태로 유연도를 향상시킨다.It is particularly convenient if the main processor has a register operative to store an address value indicating that the data word is loading into the coprocessor. This improves flexibility in the form of algorithms that may be supported by controlling the address pointer to the main processor.

로딩되어 조작되는 데이터 워드는 다양한 폭을 가질 수 있다는 것을 알 것이다. 일 예로서, 데이터 워드는, 8비트 화소값을 각각 나타내는 4개의 8비트 오퍼랜드를 포함한 32비트 데이터 워드일 수 있다. 그러나, 데이터 워드와 오퍼랜드는, 16비트, 64비트, 128비트 등과 같은 폭 넓은 다양한 서로 다른 크기를 가질 수 있다는 것을 알 것이다. It will be appreciated that the data words loaded and manipulated may have various widths. As an example, the data word may be a 32-bit data word containing four 8-bit operands each representing an 8-bit pixel value. However, it will be appreciated that data words and operands can have a wide variety of different sizes, such as 16 bits, 64 bits, 128 bits and the like.

실생활에서 많은 경우에 , 그 정렬은 변화할 수 있지만, 그 정렬은 많은 수의 순서적 액세스에 대해 일반적으로 동일하다. 이러한 경우에, 바람직한 실시예는, 코프로세서 내의 레지스터 값을 사용하여 데이터 오퍼랜드와 데이터 워드 사이의 정렬과 코프로세서 로드 명령어마다 데이터 워드를 얼마나 로드하는지를 제어하기 위해 어느 코프로세서가 응답하는지를 지정하는 정렬값을 저장하여도 된다.In many cases in real life, the alignment may change, but the alignment is generally the same for a large number of sequential accesses. In such a case, the preferred embodiment uses a register value in the coprocessor to specify which coprocessor responds to control the alignment between the data operand and the data word and how much the data word is loaded per coprocessor load instruction. May be stored.

코프로세서는 개개의 시스템에 의존하는 상기 로딩된 데이터 워드 내의 오퍼랜드에 관해 폭 넓은 다양한 처리 연산을 수행할 수 있지만, 본 발명은, 복수의 오퍼랜드 값들간의 절대차의 합(Sum of Absolute Differences)을 수행하는 것이 요구되는 시스템에서 특히 유용하다. 대량의 데이터에 대한 절대차의 합을 수행하는 것은, 이미지 처리시 화소 블록 매칭과 같은 연산을 실행할 때 범용 프로세서가 필요로 하는 처리 로드의 중요한 부분을 종종 나타낸다. 상기 절대차의 합의 저급 계산의 대부분을 코프로세서로 언로드하는 것에 의해, 폭 넓은 서로 다른 알고리즘의 일부로서 특수용 절대차의 합 계산을 이용하기 위해 범용 프로세서의 유연성을 계속 유지하면서 성능이 상당히 증가하게 된다.While the coprocessor can perform a wide variety of processing operations on the operands in the loaded data words that depend on the individual system, the present invention provides sums of absolute differences between a plurality of operand values. It is particularly useful in systems where it is necessary to perform. Performing the sum of absolute differences for large amounts of data often represents an important part of the processing load that a general purpose processor requires when performing operations such as pixel block matching in image processing. By unloading most of the lower computations of the sum of the absolute differences into the coprocessor, the performance is significantly increased while still maintaining the flexibility of the general-purpose processor to use the sum of special-purpose absolute differences as part of a wide variety of different algorithms. .

누적 레지스터는, 절대차의 합 시스템 내에서, 상기 계산된 절대차의 총합을 코프로세서 내부에 누적하도록 구성되는 것이 바람직하다. 이러한 상기 코프로세서 내의 누적 레지스터는, 필요에 따라 추가 조작을 위해 메인 프로세서 내부로 다시 검색되지만 상기 코프로세서 내에 국부적으로 보유되어, 연산을 급속하게 진행시키고 상기 코프로세서와 메인 프로세서 사이의 데이터 전송을 위한 필요조건을 감소시킬 수 있다. The accumulation register is preferably configured to accumulate the sum of the calculated absolute differences within the coprocessor within the sum of absolute differences system. This cumulative register in the coprocessor is retrieved back into the main processor for further manipulation as needed, but retained locally in the coprocessor, to speed up the operation and transfer data between the coprocessor and the main processor. Can reduce the requirements.

코프로세서의 어드레스 발생기로서 메인 프로세서의 역할을 향상시키기 위해서, 코프로세서 로드 명령어는, 메인 프로세서 내에 포인터로서 저장된 어드레스값에 인가되는 오프셋 값을 포함하는 것이 바람직하다. 이러한 오프셋 값을 선택적으로 사용하여 포인터 값이 사용되기 전 또는 후에 포인터 값을 갱신하여도 된다.To enhance the role of the main processor as the address generator of the coprocessor, the coprocessor load instruction preferably includes an offset value applied to the address value stored as a pointer in the main processor. This offset value may optionally be used to update the pointer value before or after the pointer value is used.

또한, 상술한 것처럼 본 발명은, 코프로세서 로드 명령어를 내장한 컴퓨터 프로그램 제품을 제공한다. 이 컴퓨터 프로그램 제품으로는, 콤팩트디스크 또는 플로피디스크 등과 같은 배포 가능형 매체의 형태가 있거나, 장치 내에 삽입된 펌웨어의 일부이거나 네트워크 링크 등을 거쳐 동적으로 다운로드하여도 된다.In addition, as described above, the present invention provides a computer program product incorporating coprocessor load instructions. This computer program product may be in the form of a distributable medium such as a compact disk or a floppy disk, may be part of firmware embedded in the device, or may be downloaded dynamically via a network link or the like.

본 발명의 실시예를 아래 첨부도면을 참조하여 예들에 의해서만 설명하겠다:Embodiments of the invention will now be described by way of example only with reference to the accompanying drawings in which:

도 1은 바람직한 절대차의 합 계산을 개략적으로 나타낸 도면이고,1 is a view schematically showing the calculation of the sum of preferred absolute differences,

도 2는 메인 프로세서와 코프로세서의 결합을 개략적으로 나타낸 도면이고,2 is a diagram schematically illustrating a combination of a main processor and a coprocessor,

도 3은 도 2의 시스템으로 수행한 연산 형태의 예를 개략적으로 나타낸 흐름도이고,3 is a flowchart schematically showing an example of a calculation form performed by the system of FIG.

도 4는 4개의 코프로세서 로드 명령어의 예를 나타낸 도면이며,4 is a diagram illustrating an example of four coprocessor load instructions.

도 5 내지 도 7은 일 실시예에 따른 코프로세서의 더욱 상세한 설명을 내용을 제공한다.
5 through 7 provide further details of a coprocessor according to one embodiment.

도 1은 기준 이미지 내에 가장 좋은 정합을 발견하는데 바람직한 화소(2)의 현재 블록을 나타낸다. 화소의 현재 블록은, 8비트 화소 바이트값을 갖는 8*8 블록을 포함한다. 화소(2)의 현재 블록은, 화소(2)의 현재 블록으로부터 서로 다른 벡 터 변위 v에 위치된 화소(4)의 기준 블록과 비교된다. 정합을 위한 시험을 하는데 바람직한 각 벡터 변위에서, 도 1에 도시된 절대차의 합의 식이 계산된다. 이 식은, 2개의 블록에 대한 각각의 해당 화소값들간의 절대차를 결정하고, 상기 얻어진 64개의 절대차 값을 합한다. 양호한 이미지 정합은, 일반적으로 낮은 절대차 값의 합으로 나타낸다. 일 실행 MPEG형 처리 등의 이미지 데이터 처리 시스템 내부에서, 상기와 같은 절대차의 합 계산이 종종 요구되고, 범용 프로세서에 대해 불리하게 큰 처리 오버헤드를 나타낼 수 있다.1 shows the current block of pixels 2 which is desirable to find the best match in the reference image. The current block of pixels contains 8 * 8 blocks with 8 bit pixel byte values. The current block of the pixel 2 is compared with the reference block of the pixel 4 located at different vector displacements v from the current block of the pixel 2. At each angular vector displacement desired for testing for matching, the equation of the sum of the absolute differences shown in FIG. 1 is calculated. This equation determines the absolute difference between respective corresponding pixel values for the two blocks, and adds the 64 absolute difference values obtained above. Good image registration is generally represented by the sum of low absolute difference values. Inside an image data processing system such as one execution MPEG type processing, such sum calculation of the absolute difference is often required and may present a large processing overhead disadvantageously for general purpose processors.

도 2는 메인 프로세서(8), 코프로세서(10), 캐시 메모리(12) 및 메인 메모리(14)를 구비한 데이터 처리 시스템(6)을 나타낸다. 상기 메인 프로세서(8)는, 메인 프로세서(8)가 사용하여도 되는 범용 레지스터 값을 저장하는 레지스터 뱅크(16)를 구비한다. 예를 들면, 이 메인 프로세서(8)는, 영국 캠프리지의 ARM 리미티드에 의해 설계된 메인 프로세서 중의 하나이어도 된다.2 shows a data processing system 6 having a main processor 8, a coprocessor 10, a cache memory 12, and a main memory 14. The main processor 8 includes a register bank 16 for storing general purpose register values that the main processor 8 may use. For example, this main processor 8 may be one of the main processors designed by ARM Limited of British Village.

상기 메인 프로세서(8)는, 캐시 메모리(12)에 접속되어, 아주 자주 요구된 데이터 값으로의 고속 액세스를 제공하는 역할을 한다. 저속이지만 보다 높은 용량인 메인 메모리(14)는, 캐시(12)의 외측에 설치된다.The main processor 8 is connected to the cache memory 12 and serves to provide fast access to the data values required very often. The low speed but higher capacity main memory 14 is provided outside the cache 12.

코프로세서(10)는, 메인 프로세서(8)의 코프로세서 버스에 접속되고, 메인 프로세서(8)에 의해 수신되어 실행된 코프로세서 명령어에 대해 응답하여 소정의 연산을 실행한다. ARM 구조 내에는, 로드 코프로세서 명령어가 구비되어 데이터 값을 코프로세서 내로 로드하는 역할을 한다. 도 2에 도시된 코프로세서(10)는, 상기와 같은 코프로세서 로드 명령어의 기능성을 확장하고 그 명령어들을 사용하여 코 프로세서(10) 내로 로드된 데이터 워드 내의 오퍼랜드 값에 관한 특정의 소정 처리 연산을 실행하는 것을 코프로세서에 지정한다.The coprocessor 10 is connected to the coprocessor bus of the main processor 8 and executes certain operations in response to the coprocessor instructions received and executed by the main processor 8. Within the ARM architecture, load coprocessor instructions are provided to load data values into the coprocessor. The coprocessor 10 shown in FIG. 2 extends the functionality of such coprocessor load instructions and uses those instructions to perform certain predetermined processing operations on operand values in data words loaded into the coprocessor 10. Tell the coprocessor to run.

더욱 구체적으로는, 코프로세서(10)는, 코프로세서 메모리(18), 정렬 레지스터(20), 누적 레지스터(22) 및 제어 및 산술 기능 로직부(24)를 구비한다. 특정 코프로세서 로드 명령어를 사용하여 16개의 32비트 데이터 워드를 코프로세서 메모리(18)내로 로드하여도 된다. 이 16개의 데이터 워드 각각은, 4개의 8비트 화소값을 포함하고, 도 1에 도시된 현재 블록(2)인 8*8화소 블록에 해당한다. 상기 현재 블록(2) 내의 화소 값들은, 기준 이미지 내에 서로 다른 다양한 위치로부터 취득된 기준 블록들을 갖는 절대차의 합을 사용하여, 가장 낮은 절대차의 합을 나타내고 가장 좋은 이미지 정합에 해당하는 기준 블록(4)을 구하는 블록으로서 비교될 것이다. 현재 블록(2)의 자주 사용된 화소값을 코프로세서 메모리(18) 내에 국부적으로 저장하는 것은, 자원을 처리하는 효율적인 사용방법이다. 일단 코프로세서 메모리(18)가 현재 블록(2)에 의해 로드되었다면, 특정 코프로세서 로드 명령어(USALD 명령어)는, 2개 또는 3개의 데이터 워드를 코프로세서(10) 내로 로드하여 이 로드된 데이터 워드 내에 8개의 화소 오퍼랜드들의 절대차 값의 합을 계산하는 역할을 하는 메인 프로세서(8)에 의해 실행된다. 또한, 메인 프로세서(8)의 명령어 스트림 내에 있는 USALD 명령어는, 상기 제어 및 산술 기능 로직부(24)를 트리거시켜 상기 메인 프로세서(8)를 거쳐 캐시(12) 또는 메인 메모리(14)로부터 필요한 데이터 워드의 수의 로딩을 제어하고서 이들 로드된 값과 상기 코프로세서 메모리(14)로부터의 값을 사용하여 절대차의 합 계산을 실행하는 코프로세서(10)에 (직접 또는 하나 이상의 제어신호의 형태로) 전달된다. 상기 정렬 레지스터(20)는, 메인 프로세서(8)가 실행한 코프로세서 레지스터 로드 명령어에 의해 미리 설정된 정렬값을 보유한다. 상기 제어 및 산술 기능 로직부(24)는, 이 정렬값에 응답하여, 오퍼랜드가 워드 경계에 의해 정렬될 경우 2개의 32비트 데이터 워드 또는 상기와 같은 정렬이 없을 경우 3개의 32비트 데이터 워드를 로드한다. 도 2의 캐시 메모리(12)의 측면에는, 워드 경계와 CR_BY0의 정렬 오프셋에 의해 정렬되지 않은 캐시 메모리 내부에 저장된 8개의 원하는 화소 오퍼랜드 값이 도시되어 있다. 도시된 예에서는, 3개의 32비트 데이터 워드를, 레지스터 뱅크(16)의 레지스터 중 하나에 저장되고 도시된 것처럼 어드레스 지정된 정렬된 워드를 가리키는 어드레스 값[Rn] 내에 로드한다. 3개의 32비트 데이터 워드가 검색될 경우, 제어 및 산술 기능 로직부(24)는 다중 연산을 수행하여 상기 지정된 정렬 값에 따라 상기 로딩된 데이터 워드 내로부터 필요한 오퍼랜드를 선택한다. 상기 로딩된 데이터 워드로부터 추출된 오퍼랜드 값으로, 가산기 및 감산기 등의 표준 산술 처리 로직을 사용하여 도 1에 도시된 계산의 일부를 구성하는 절대차의 합 계산으로 이루어진다. 8개의 화소 바이트 오퍼랜드를 도시한 예에서는, 도 1에 도시된 현재 화소블록(2)과 기준 화소 블록(4) 간의 블록 비교내의 단일 열을 효과적으로 나타낸다는 것을 알 수 있을 것이다. 도 1에 도시된 풀(full) 계산을 수행하기 위해서, 8개의 상기와 같은 코프로세서 로드 명령어는 차례로 실행될 필요가 있다. 이들 코프로세서 명령어의 각각에 의해 계산된 절대차의 합은, 누적 레지스터(22) 내에 누적된다. 따라서, 절대차의 열 합을 각각 지정하는 8개의 모든 코프로세서 로드 명령어를 실행 후, 절대차의 블록 합을 수행하여 그 결과를 상기 누적 레지스터(22) 내에 저장한다. 그리고, 이렇게 저장된 값은, 예컨대, 코프로세서 레지스터 값을 상기 레지스터 뱅크(16) 내의 레지스터 중 하나로 후퇴시키는 메인 프로세서 명령어에 의해 메인 프로세서(8)로 되돌려져도 된다.More specifically, the coprocessor 10 includes a coprocessor memory 18, an alignment register 20, an accumulation register 22, and a control and arithmetic function logic section 24. Sixteen 32-bit data words may be loaded into coprocessor memory 18 using specific coprocessor load instructions. Each of these 16 data words contains four 8-bit pixel values and corresponds to an 8 * 8 pixel block, which is the current block 2 shown in FIG. The pixel values in the current block 2 are the reference blocks representing the sum of the lowest absolute differences and corresponding to the best image match, using the sum of absolute differences having reference blocks obtained from different positions in the reference image. (4) will be compared as a block to find. Storing frequently used pixel values of the current block 2 in the coprocessor memory 18 is an efficient way of handling resources. Once the coprocessor memory 18 has been loaded by the current block 2, a particular coprocessor load instruction (USALD instruction) loads two or three data words into the coprocessor 10 to load this loaded data word. It is executed by the main processor 8 which serves to calculate the sum of the absolute difference values of the eight pixel operands within. In addition, the USALD instruction in the instruction stream of the main processor 8 triggers the control and arithmetic function logic section 24 to pass the necessary data from the cache 12 or the main memory 14 via the main processor 8. Coprocessor 10 (directly or in the form of one or more control signals) that controls the loading of the number of words and then performs a sum calculation of the absolute difference using these loaded values and the values from the coprocessor memory 14 Delivered. The alignment register 20 holds an alignment value preset by a coprocessor register load instruction executed by the main processor 8. The control and arithmetic function logic section 24, in response to this alignment value, loads two 32-bit data words if the operands are aligned by word boundaries or three 32-bit data words if there is no such alignment. do. On the side of the cache memory 12 of FIG. 2, eight desired pixel operand values stored inside the cache memory that are not aligned by the word boundary and the alignment offset of CR_BY0 are shown. In the example shown, three 32-bit data words are stored in one of the registers of register bank 16 and loaded into an address value [Rn] that points to the aligned word addressed as shown. When three 32-bit data words are retrieved, control and arithmetic function logic section 24 performs multiple operations to select the required operand from within the loaded data word according to the specified alignment value. The operand value extracted from the loaded data word consists of a sum of absolute differences that form part of the calculation shown in FIG. 1 using standard arithmetic processing logic such as an adder and a subtractor. It will be appreciated that in the example showing the eight pixel byte operand, it effectively represents a single column in the block comparison between the current pixelblock 2 and the reference pixel block 4 shown in FIG. In order to perform the full calculations shown in FIG. 1, eight such coprocessor load instructions need to be executed in sequence. The sum of the absolute differences calculated by each of these coprocessor instructions is accumulated in the accumulation register 22. Therefore, after executing all eight coprocessor load instructions that respectively specify the column sum of the absolute differences, the block sum of the absolute differences is performed and the result is stored in the accumulation register 22. The stored value may be returned to the main processor 8 by, for example, a main processor instruction for retreating a coprocessor register value to one of the registers in the register bank 16.

상술한 예에서, 어드레스 포인터는, 검색하려고 하는 제 1 데이터 워드의 시작 어드레스를 직접 가리킨 레지스터 뱅크(16)의 레지스터 내에 보유하였다. 그러나, 이렇게 저장된 포인터 값은, 액세스하려고 하는 실제 어드레스를 나타내기 위해 상기 저장된 포인터 값에 적용된 10비트 오프셋 등의 오프셋으로 이루어지는 것이 가능하다. 어떤 경우에는, 이러한 오프셋을 추가로 사용하여 사용시마다 포인터 값을 갱신하는 것이 편리하다. 이것에 의해, 코프로세서 로드 명령어가 기준 이미지를 지정하는 데이터를 통해 적절한 양만큼 효과적으로 나아가서, 포인터를 변경하기 위해 추가적인 메인 프로세서 명령어를 반드시 필요로 하지 않고 특정 기준 블록(4)에 필요한 서로 다른 8화소 열을 픽업하게 한다.In the above example, the address pointer is held in a register of the register bank 16 which directly points to the start address of the first data word to be searched. However, the stored pointer value can be made up of an offset such as a 10-bit offset applied to the stored pointer value to indicate the actual address to be accessed. In some cases, it is convenient to use this additional offset to update the pointer value each time it is used. This allows the coprocessor load instruction to advance effectively through the data specifying the reference image by an appropriate amount, so that eight different pixels needed for a particular reference block 4 without necessarily requiring additional main processor instructions to change the pointer. Have the heat pick up.

도 3은 도 2의 시스템을 사용하여 수행되는 처리의 일 예를 개략적으로 나타낸 흐름도이다. 단계 26에서는, 현재 화소 블록(2)을 나타낸 16개의 워드를 캐시(12) 또는 메인 메모리(14)로부터 코프로세서 메모리(18) 내로 로드한다. 단계 28에서는, 메모리 내의 기준 블록(4)의 시작을 가리키는 포인터 값에 의해 상기 메인 프로세서(8)내에 레지스터 Rn을 로드한다. 단계 30에서는, 메인 프로세서(8)의 코프로세서 레지스터 로드 명령어를 사용하여 코프로세서(10)의 정렬 레지스터(20) 내에 정렬값을 로드한다. 단계 26, 단계 28 및 단계 30이 코프로세서 로드와 절대 차의 합 명령어를 수행하기 위해 데이터 처리 환경을 설정한다는 것을 알 수 있다. 여러 가지 경우에서는, 이 설정을 한번에 수행해야만 하여 특정 현재 블록(2)에 대해 다수의 기준 블록(4)을 테스트하기 위해 현재 상태를 유지해야 한다. 이러한 경우들에서는, 단계 26, 단계 28 및 단계 30과 관련된 처리 오버헤드를 상당히 감소시킨다.3 is a flow diagram schematically illustrating an example of a process performed using the system of FIG. 2. In step 26, sixteen words representing the current pixel block 2 are loaded into the coprocessor memory 18 from the cache 12 or the main memory 14. In step 28, the register Rn is loaded into the main processor 8 by a pointer value indicating the start of the reference block 4 in the memory. In step 30, the alignment value is loaded into the alignment register 20 of the coprocessor 10 using the coprocessor register load instruction of the main processor 8. It can be seen that steps 26, 28 and 30 set up the data processing environment to execute the sum instruction of the coprocessor load and the absolute difference. In many cases, this setup must be performed at one time to maintain the current state in order to test multiple reference blocks 4 for a particular current block 2. In such cases, the processing overhead associated with steps 26, 28 and 30 is significantly reduced.

단계 32는, 상술한 것과 같은 8개의 USALD 코프로세서 로드 명령어의 실행을 나타낸다. 이들 각 명령어는, 현재 블록(2)과 기준 블록(4) 내의 열에 대한 절대차의 합을 각각 계산하여 상기 누적 레지스터(22) 내의 누적값을 갱신한다.Step 32 represents the execution of eight USALD coprocessor load instructions as described above. Each of these instructions calculates the sum of the absolute differences for the columns in the current block 2 and the reference block 4, respectively, to update the cumulative value in the accumulation register 22.

단계 34에서는, 전체 기준 블록(4)에 대해 상기 계산된 절대차의 합을, 이동 코프로세서 레지스터에 의해 누적 레지스터(22)로부터 메인 프로세서(8)로 메인 프로세서 레지스터 명령어에 대해 검색한다. 이렇게 누적된 값을, 이전에 계산된 누적값 또는, 가장 좋은 이미지 정합을 식별하기 위해 또는 타 목적을 위해 다른 파라미터와 비교할 수 있다.In step 34, the sum of the calculated absolute differences for the entire reference block 4 is retrieved for the main processor register instruction from the accumulation register 22 to the main processor 8 by means of the mobile coprocessor register. This cumulative value can be compared with a previously calculated cumulative value or with other parameters to identify the best image match or for other purposes.

도 4는 USALD 명령어의 3개의 변형을 나타낸다. 제 1 변형은, 오프셋을 사용하지 않고 레지스터 Rn 내에 보유된 포인터를 거쳐 어드레스를 단지 지정하고, 조건적 코드 {cond}에 따라 조건적인 실행으로 이루어진다. 제 2 변형은, 플래그 {!}에 따라 사용된 그 값 전 또는 후에 초기값으로부터 가산되거나 감산되어도 되는 10비트 오프셋 값으로 이루어진 어드레스 포인터를 사용한다. 또한, 제 3 변형은, 오프셋 값의 사용 전에 레지스터 Rn 내에 있는 포인터 값에 적용되는 오프셋 값을 사용하고, 이때의 포인터 값은 변경되지 않는다. 4 shows three variations of the USALD instruction. The first variant consists only of addressing via a pointer held in register Rn without using an offset, and consists of conditional execution according to the conditional code {cond}. The second variant uses an address pointer consisting of a 10-bit offset value that may be added or subtracted from the initial value before or after its value used according to the flag {!}. Further, the third variant uses the offset value applied to the pointer value in the register Rn before using the offset value, and the pointer value at this time is not changed.

이하, 본 발명의 실시예의 더욱 구체적인 설명을 아래에 제시한다:
Hereinafter, a more specific description of the embodiment of the present invention is given below:

1.1 용어 및 약어1.1 terms and abbreviations

이 문서는 다음의 용어 및 약어를 사용한다.This document uses the following terms and abbreviations.

용어 의미Term Meaning

ASIC 응용 주문형 집적회로(Application Specific Integrated Circuit)ASIC Application Specific Integrated Circuit

BIST 내장된 자체 테스트(Built In Self Test)BIST Built In Self Test

JTAG 공동 테스트 액션 그룹(Joint Test Action Group)JTAG Joint Test Action Group

범위range

이 문서는, MPEG4 인코더 애플리케이션의 성능을 향상시키기 위한 ARM9×6 코프로세서의 기술적인 상세 내용을 포함한다. 이 문서는, 하드웨어와 소프트웨어의 관점에서의 기능적인 사양을 포함한다. 이 문서는, 하드웨어 또는 소프트웨어의 실시의 상세 내용을 포함하지 않는다.This document contains technical details of the ARM9x6 coprocessor for improving the performance of MPEG4 encoder applications. This document contains functional specifications in terms of hardware and software. This document does not contain the details of implementing hardware or software.

도입Introduction

어친 코프로세서(Urchin CoProcessor, UCP)는, 절대차의 합(SAD)의 연산 실행을 가속하도록 설계된 ARM9×6이다. 이 SAD 연산은, 기준 프레임으로부터의 8×8 블록을 현재 프레임의 8×8 블록과 비교할 때 MPEG4 움직임 추정 알고리즘에서 사용된다. 이는, MPEG4 비디오 인코딩 애플리케이션의 일부이다.Urchin CoProcessor (UCP) is an ARM9x6 designed to accelerate computational execution of the sum of absolute differences (SAD). This SAD operation is used in the MPEG4 motion estimation algorithm when comparing an 8x8 block from a reference frame with an 8x8 block of the current frame. This is part of the MPEG4 video encoding application.

1.2 UCP 구조1.2 UCP Structure

UCP는, ARM 명령어 세트의 일부인 코프로세서 명령어의 세트를 해석한다. ARM 코프 로세서 명령어에 의해 ARM 프로세서가,UCP interprets a set of coprocessor instructions that are part of the ARM instruction set. The ARM coprocessor instructions cause the ARM processor to

· (코프로세서로 로드 LDC 및 코프로세서에 저장 STC의 명령어를 사용하여) UCP와 메모리간의 데이터를 전송되게 하고,· Allow data to be transferred between UCP and memory (using the load LDC and save to coprocessor STC instructions to coprocessor),

· (코프로세서로 이동 MCR 및 ARM으로 이동 MRC의 명령어를 사용하여) UCP와 ARM 레지스터간의 데이터를 전송되게 한다.Allow data to be transferred between UCP and ARM registers (using the move to coprocessor MCR and move to ARM MRC instructions).

ARM은 UCP용 어드레스 발생기와 데이터 펌프로서 동작한다.ARM operates as an address generator and data pump for UCP.

UCP는, 레지스터 뱅크, 데이터 경로 및 제어로직으로 구성된다. 이를 도 5의 UCP 개략도에 나타낸다.The UCP consists of a register bank, a data path and control logic. This is shown in the UCP schematic diagram of FIG. 5.

1.3 코프로세서 인터페이스1.3 Coprocessor Interface

ARM으로부터 UCP로의 접속만이 코프로세서 인터페이스이다. 모든 다른 시스템 연결(AMBA 또는 인터럽트)은, ARM을 통해 처리된다.The only connection from ARM to UCP is the coprocessor interface. All other system connections (AMBA or interrupts) are handled through ARM.

1.4 설계 제약1.4 Design Constraints

UCP 초기 구현은, 0.18um 라이브러리에 목표를 둔 포인트 솔루션이다. 이것은, ARM926을 갖는 UCP를 집적할 것이다. 이 UCP는, 단지 시스템 내에서 코프로세서일 뿐이다. 설계 상의 주요 제약은, 모든 설계 결정이 이것에 의해 결정되는 엄격한 시간환산 필요조건들이 있다.The initial implementation of UCP is a point solution aimed at the 0.18um library. This will integrate the UCP with ARM926. This UCP is just a coprocessor in the system. The main constraint in design is the strict time conversion requirements in which all design decisions are made.

다른 중요한 설계 제약은, 게이트 카운트, 최대 동작 주파수(최악의 경우) 및 전력 소비가 있다.Other important design constraints are gate count, maximum operating frequency (worst case) and power consumption.

1.4.1 게이트 카운트1.4.1 Gate Count

이 섹션은 특허출원에 관련이 없기 때문에 삭제되었다. This section has been deleted because it is not relevant to the patent application.

1.4.2 동작 주파수1.4.2 Operating Frequency

이 섹션은 특허출원에 관련이 없기 때문에 삭제되었다.This section has been deleted because it is not relevant to the patent application.

1.4.3 전력 소비1.4.3 Power Consumption

프로그래머의 모델Programmer's model

1.5 레지스터1.5 register

UCP 코프로세서는, 2가지 형태의 데이터 기억장치를 포함한다:The UCP coprocessor includes two types of data storage:

·레지스터 : 이 레지스터들을 사용하여 ARM 레지스터와 코프로세서 사이에서 데이터를 직접 전송한다. 이들을 MCR 및 MRC 연산에 의해 액세스할 수 있다.Registers: These registers are used to transfer data directly between the ARM register and the coprocessor. These can be accessed by MCR and MRC operations.

·블록 버퍼 : 이 버퍼는 메모리 매핑된 영역(즉 ARM 레지스터가 아님)으로부터 직접 로딩 또는 저장할 수만 있는 8라인 당 8바이트(64-비트)를 저장한다. 이 블록 버퍼는, 특수 UCP 명령어의 세트(ARM은 이 명령어들을 LDC와 STC 연산으로서 참조한다)에 의해 액세스된다.Block buffer: This buffer stores 8 bytes (64-bits) per 8 lines that can only be loaded or stored directly from memory-mapped regions (ie not ARM registers). This block buffer is accessed by a set of special UCP instructions (ARM refers to these instructions as LDC and STC operations).

다음은 UCP 내부의 레지스터에 관한 요약이다:Here is a summary of the registers inside the UCP:

보류되거나 불확정된 레지스터 비트를, 레지스터 판독시에 마스킹해두고, 레지스터 기록시에 0으로 설정해두어야 한다. 이것의 예외는, CR_ACC 및 CR_BY0가 있다. 이 레지스터들은 항상 미사용된 비트로부터 판독시에 0으로 복구될 것이고, 이 레지스터들에의 기록은 임의의 값을 미사용된 비트 위치에 넣을 수 있다(UCP는 이 값들을 무시한다).Reserved or indeterminate register bits should be masked at register read and set to zero at register write. Exceptions to this are CR_ACC and CR_BY0. These registers will always be restored to zero on read from unused bits, and writing to these registers can put any value in the unused bit position (UCP ignores these values).

1.5.1 CR_ACC1.5.1 CR_ACC

이것은 14비트 판독/기록 레지스터이다. 이 레지스터는, MCR에 의해서 직접적으로 , SAD 연산에 의해서는 간접적으로 갱신될 수 있다.This is a 14 bit read / write register. This register can be updated directly by the MCR and indirectly by the SAD operation.

1.5.2 CR_IDX1.5.2 CR_IDX

이것은 3비트 판독/기록 레지스터이다. 이 레지스터는, MCR에 의해 직접적으로, 블록 버퍼 로드/저장 또는 라인 인덱스를 증가시키는 SAD 연산에 의해서는 간접적으로 갱신될 수 있다.This is a 3-bit read / write register. This register can be updated by the MCR directly, or indirectly by a SAD operation that increments the block buffer load / store or line index.

이 레지스터는, 블록 버퍼의 라인 인덱스를 나타낸다. 이것은, 블록버퍼를 액세스하는 다음 연산(이들 연산은 항상 블록 버퍼로부터 신호 라인을 사용하기만 한다)이 어느 라인을 참조할 것인지를 설정한다.This register represents the line index of the block buffer. This sets which line the next operation that accesses the block buffer (these operations always use signal lines from the block buffer) will refer to.

이 레지스터가 값 7을 넘어서 증가될 경우, 0으로 덮어 씌워 질 것이다.If this register is incremented beyond the value 7, it will be overwritten with zeros.

1.5.3 CR_BY01.5.3 CR_BY0

이것은 2비트 판독/기록 레지스터이다. 이 레지스터는 MCR에 의해 갱신될 수만 있다.This is a 2-bit read / write register. This register can only be updated by the MCR.

UCP는, 바이트 정렬된 어드레스로부터 기준 프레임으로의 액세스를 지원한다. 이들 로드를 위한 어드레스 버스의 하위 2비트는 이 레지스터에 저장된다.UCP supports access to a reference frame from a byte aligned address. The lower two bits of the address bus for these loads are stored in this register.

코프로세서는, ARM에 의해 메모리 액세스를 위해 사용된 어드레스 값을 직접 보지 못하고, 그 이유는 소프트웨어가 바이트 오프셋을 CR_BY0 내에 따로따로 프로그래밍하여야 하기 때문이라는 것을 주목한다. 또한, ARM은 코프로세서를 위해 워드 정렬되지 않은 로드를 직접적으로 지원하지 않으므로 UCP는 3개의 워드 로드를 수행하고서 필요로 하는 8바이트를 추출한다.Note that the coprocessor does not directly see the address value used by the ARM for memory access because the software has to program the byte offset separately in CR_BY0. Also, since ARM does not directly support unword-aligned loads for coprocessors, UCP performs three word loads and extracts the eight bytes required.

1.5.4 CR_CFG1.5.4 CR_CFG

이것은, 단일 비트 판독/기록 레지스터이다. 이 레지스터는 MCR에 의해 갱신될 수만 있다.This is a single bit read / write register. This register can only be updated by the MCR.

IDX_INC 비트는, 블록 버퍼 로드/저장 또는 SAD 연산 후 CR_IDX 레지스터를 증가할지 아닐지를 제어한다. 그 비트가 클리어일 경우 증가는 일어나지 않는다. 그 비트가 설정될 경우, CR_IDX는 블록 버퍼 또는 SAD 연산이 종료된 후 증가된다.The IDX_INC bit controls whether the CR_IDX register is incremented after block buffer load / store or SAD operations. If that bit is clear, no increment occurs. If that bit is set, CR_IDX is incremented after the block buffer or SAD operation is completed.

1.5.5 CR_ID1.5.5 CR_ID

이 14비트 판독전용 레지스터는, UCP 구조 및 교정 코드를 구비한다.This 14-bit read-only register has a UCP structure and calibration code.

비트[3:0]은, 구현을 위한 교정 수를 포함한다.Bits [3: 0] contain the correction number for the implementation.

비트[7:4]는, ARM 설계 구현을 위한 값 0xF로 설정된다.Bits [7: 4] are set to the value 0xF for the ARM design implementation.

비트[13:8]은, UCP 구조 버전 : 0x00=버전 1을 포함한다.Bits [13: 8] contain the UCP structure version: 0x00 = version 1.

1.6 명령어 세트 1.6 instruction set

UCP 코프로세서의 어셈블러 신택스(syntax)는 ARM과 같은 포맷을 사용한다.The assembler syntax of the UCP coprocessor uses the same format as ARM.

{ }는 선택적인 필드를 나타낸다.{} Indicates an optional field.

cond field는 ARM 명령어 조건코드이다.cond field is the ARM instruction condition code.

dest는 UCP 목적지 레지스터를 지정한다.dest specifies the UCP destination register.

Rn은 ARM 레지스터이다.Rn is an ARM register.

CRn은 UCP 레지스터이다.CRn is a UCP register.

!는 계산된 어드레스는 기저 레지스터에 라이트백(write back)된다는 것을 나타 낸다.! Indicates that the computed address is written back to the base register.

UCP는 UCP의 코프로세서 수UCP is the number of coprocessors in UCP

10_Bit_Offset은, 10비트 워드 오프셋에 대한 값을 구하는 식이다. 이 오프셋을 기저 레지스터에 가산하여 로드 어드레스를 형성한다. 주목할 것은,10_Bit_Offset is a formula for obtaining a value for a 10-bit word offset. This offset is added to the base register to form a load address. Note that

상기 오프셋이 4의 배수이어야 한다는 것이다.The offset must be a multiple of four.

6_Bit_Offset은, 6비트 워드 오프셋에 대한 값을 구하는 식이다. 이 오프셋을 기저 레지스터에 가산하여 로드 어드레스를 형성한다. 주목할 것은, 상기6_Bit_Offset is a formula for obtaining a value for a 6-bit word offset. This offset is added to the base register to form a load address. Note that the above

오프셋이 4의 배수이어야 한다는 것이다.The offset must be a multiple of four.

주의 : 라이트 백에 의해 또는 라이트 백 없는 포스트 인덱싱을 허용한다.Note: Allow post indexing with or without light back.

다음 페이지에서는 UCP의 명령어 세트를 설명한다. The next page describes the instruction set for UCP.

1.6.1 명령어 세트 요약1.6.1 Instruction Set Summary

1.6.2 명령어 인코딩1.6.2 Instruction Encoding

UCP의 명령어는 다음의 코프로세서 명령어 종류로 되어 있다:The commands in the UCP are of the following coprocessor command types:

코프로세서 명령어는 다음과 같이 인코딩된다:Coprocessor instructions are encoded as follows:

cond : 조건 코드cond: condition code

UCP : UCP 코프로세서 수UCP: Number of UCP Coprocessors

Rn : ARM 레지스터 소스 CRn : UCP 레지스터 소스 Rn: ARM register source CRn: UCP register source

Rd : ARM 레지스터 목적지 CRd : UCP 레지스터 목적지Rd: ARM register destination CRd: UCP register destination

8_bit_offset : 8비트 수(0-255); 어드레스 오프셋을 나타내는데 사용됨.8_bit_offset: 8-bit number (0-255); Used to indicate address offset.

P: 프리/포스트 인덱싱 비트 0=포스트; 전송 후 오프셋 가산 1=프리; 전송전 오프셋 가산P: pre / post indexing bit 0 = post; Add offset after transmission 1 = free; Add offset before transmission

U : 업/다운 비트 0= 다운; 기저로부터 감산 1=업; 오프셋을 기저에 가산U: up / down bit 0 = down; Subtract from base 1 = up; Add offset to base

W : 라이트 백 비트 0=라이트 백 없음 1=어드레스를 기저에 기록W: write back bit 0 = no write back 1 = write address base

주목할 것은, 상기 UCP에 대해 상기 오페코드만 유효하다는 것이다. (비트 11에서 아래로 8까지의 UCP를 보유하는) 오페코드들의 임의의 변형으로 예측 불가능한 작용 이 일어날 것이다.Note that only the opcode is valid for the UCP. Any modification of the opcodes (which hold UCPs from bit 11 down to 8 down) will result in unpredictable behavior .

1.6.3 UMCR1.6.3 UMCR

이 UMCR 명령어를 사용하여 UCP 레지스터에 기록한다. UMCR은 ARM 레지스터 Rn으로부터 UCP 레지스터 CRd로 데이터를 이동시킨다.Use this UMCR instruction to write to the UCP register. UMCR moves data from ARM register Rn to UCP register CRd.

액션action

1 ARM 레지스터 Rn을 UCP 레지스터 CRd로 이동1 Move ARM register Rn to UCP register CRd

연상 기호Associative symbol

UMCR CRd, Rn, {cond}UMCR CRd, Rn, {cond}

예제example

UMCR CR_ACC, R0 ; R0의 내용을 CR_ACC 내로 로드 UMCR CR_ACC, R0; Load the contents of R0 into CR_ACC

1.6.4 UMRC1.6.4 UMRC

이 UMRC 명령어를 사용하여 UCP 레지스터로부터 판독한다. UMCR은 UCP 레지스터 CRn으로부터 ARM 레지스터 Rd로 데이터를 이동한다.Read from the UCP register using this UMRC instruction. UMCR moves data from UCP register CRn to ARM register Rd.

액션action

1 UCP 레지스터 CRn을 ARM 레지스터 Rd로 이동1 Move UCP register CRn to ARM register Rd

연상 기호Associative symbol

UMRC Rd, CRn, {cond}UMRC Rd, CRn, {cond}

예제example

UMRC R6, CR_ID ; ID 레지스터를 R6 내로 로드.UMRC R6, CR_ID; Load ID register into R6.

1.6.5 UBBLD1.6.5 UBBLD

이 UBBLD 명령어를 사용하여 데이터를 블록 버퍼 내로 로드한다.Use this UBBLD instruction to load data into the block buffer.

액션action

1 워드 Block_Buffer(CR_IDX,0)를 로드Load 1 word Block_Buffer (CR_IDX, 0)

2 워드 Block_Buffer(CR_IDX,1)를 로드Load 2 word Block_Buffer (CR_IDX, 1)

3 (IDX_INC==1)이면If 3 (IDX_INC == 1)

4 ++CR_IDX 4 ++ CR_IDX

연상 기호Associative symbol

UBBLD [Rn],#0,{cond}UBBLD [Rn], # 0, {cond}

UBBLD [Rn, #+/-10_Bit_Offset]{!}, {cond}UBBLD [Rn, # + /-10_Bit_Offset] {!}, {Cond}

UBBLD [Rn], #+/-10_Bit_Offset{!}, {cond}UBBLD [Rn], # + /-10_Bit_Offset {!}, {Cond}

예제example

UBBLD[R0],#320! ; 2개의 워드를 mem(R0)로부터 블록 버퍼 내로 로드하고 R0을UBBLD [R0], # 320! ; Load two words from mem (R0) into the block buffer

다음 라인까지 후치 증가.Post-increase until next line.

1.6.6 UBBST1.6.6 UBBST

이 UBBST 명령어를 사용하여 블록 버퍼로부터의 데이터를 저장한다. 이 연산은, 검증 목적만을 위한 것이고 애플리케이션 코드에 의해 사용되지 않는다.This UBBST instruction is used to store data from the block buffer. This operation is for validation purposes only and is not used by application code.

액션action

1 워드 Block_Buffer(CR_IDX,0)를 저장Save 1 word Block_Buffer (CR_IDX, 0)

2 워드 Block_Buffer(CR_IDX,1)를 저장Save 2 words Block_Buffer (CR_IDX, 1)

3 (IDX_INC==1)이면If 3 (IDX_INC == 1)

4 ++CR_IDX4 ++ CR_IDX

연상 기호Associative symbol

UBBST [Rn], #0,{cond}UBBST [Rn], # 0, {cond}

UBBST [Rn, #+/-10_Bit_Offset]{!},{cond} UBBST [Rn, # + /-10_Bit_Offset] {!}, {Cond}

UBBST [Rn], #+/-10_Bit_Offset]{!},{cond}UBBST [Rn], # + /-10_Bit_Offset] {!}, {Cond}

예제example

UBBST [R3],#8! ; 블록 버퍼로부터 2개의 워드를, mem(r3)로부터 시작하는UBBST [R3], # 8! ; Two words from the block buffer, starting with mem (r3)

메모리 내로 저장.Save into memory.

주의 : UCP는 IDX_INC==1일 경우 연속하여 UBBST를 지원하지 않는다. 각 UBBST는, 적어도 하나의 NOP에 의해 분리되어야 한다. 연속하여 UBBST를 사용하면 예측 불가능한 작용 이 일어날 것이다.Note: UCP does not support UBBST consecutively when IDX_INC == 1. Each UBBST must be separated by at least one NOP. Using UBBST in succession will result in unpredictable behavior .

1.6.7 USALD1.6.7 USALD

이 USALD 명령어는 기준 블록으로부터 데이터를 로드하고, SAD 누적 연산을 수행한다.This USALD instruction loads data from a reference block and performs an SAD accumulation operation.

액션action

1 (CR_BY0==0)이면If 1 (CR_BY0 == 0)

2 워드를 제1 word_tmp로 로드Load 2 words into first word_tmp

3 그렇지 않으면3 otherwise

4 워드를 mux1_tmp로 로드; 워드를 mux2_tmp로 로드; CR_BY0에 의거하여 mux1_tmp와 mux2_tmp를 조합하여 비정렬된 워드를 형성하여 제1 word_tmp에 로드.Load 4 words into mux1_tmp; Load word into mux2_tmp; Combining mux1_tmp and mux2_tmp based on CR_BY0 to form an unaligned word and load it into the first word_tmp.

5 제1 word_tmp에 있는 바이트 및 Block_Buffer(CR_IDX,0)에 있는 대응 바이 트 마다 SAD를 수행한다. 이 누적은 CR_ACC가 이루어진다.5 SAD is performed for each byte in the first word_tmp and the corresponding byte in Block_Buffer (CR_IDX, 0). This accumulation is done by CR_ACC.

6 (CR_BY0==0)이면6 (CR_BY0 == 0)

7 워드를 제2 word_tmp로 로드Load 7 words into second word_tmp

8 그렇지 않으면8 otherwise

9 워드를 mux3_tmp로 로드; CR_BY0에 의거하여 mux2_tmp와 mux3_tmp를 조합하 여 비정렬된 워드를 형성하여 제2 word_tmp에 로드.Load 9 words into mux3_tmp; Combining mux2_tmp and mux3_tmp based on CR_BY0 to form an unaligned word and load it into the second word_tmp.

10 제2 word_tmp에 있는 바이트 및 Block_Buffer(CR_IDX,1)에 있는 대응 바이트 마다 SAD를 수행한다. 이 누적은 CR_ACC가 이루어진다.10 SAD is performed for each byte in the second word_tmp and the corresponding byte in Block_Buffer (CR_IDX, 1). This accumulation is done by CR_ACC.

11 (IDX_INC==1)이면If 11 (IDX_INC == 1)

12 ++CR_IDX12 ++ CR_IDX

주의 : 산술은 부호가 없다. CR_ACC의 오버플로우를 검출/처리하는 구성을 포함하지 않는다. CR_ACC의 크기는, 8x8 블록 비교로부터 오버플로우가 확실히 일어날 수 없도록 선택한다.Note: Arithmetic is unsigned. Does not include a configuration to detect / process an overflow of CR_ACC. The size of the CR_ACC is chosen so that overflow cannot reliably occur from an 8x8 block comparison.

연상기호Mnemonic

USALD [Rn], #0, {cond}USALD [Rn], # 0, {cond}

USALD [Rn, #+/-10_Bit_Offset]{!}, {cond}USALD [Rn, # + /-10_Bit_Offset] {!}, {Cond}

USALD [Rn], #+/-10_Bit_Offset{!}, {cond}USALD [Rn], # + /-10_Bit_Offset {!}, {Cond}

예제example

USALD[R0],#320! ; 2개의 워드를 mem(R0), SAD를 수행하고 R0을 다음 라인으로USALD [R0], # 320! ; Two words mem (R0), SAD and R0 to the next line

후치 증가. Post-increase.

1.7 명령어 사이클 타이밍1.7 Instruction Cycle Timing

아래 표는 각 명령어를 완료하는데 소요되는 사이클 수를 나타낸다.The table below shows the number of cycles required to complete each instruction.

주의 : CR_BY0이 0일 경우 N=2 사이클, 그렇지 않을 경우 N=3 사이클.Note: N = 2 cycles if CR_BY0 is 0, N = 3 cycles otherwise.

1.8 데이터 해저드(Data hazards)1.8 Data hazards

데이터 해저드는, 명령어간의 데이터 의존성이 예측 불가능한 작용이 생길 수 있는 경우이다.Data hazards are cases where data dependencies between instructions can cause unpredictable behavior.

UCP에서는 데이터 해저드를 해결하기 위해 하드웨어 인터록을 하지 않는다. 그 대신에 소프트웨어는, NOP가 필요하면, 필요한 곳에 논(non) 코프로세서 명령어를 구비하여야 한다.
UCP does not use hardware interlocks to solve data hazards. Instead, software should have non-coprocessor instructions where needed, if NOP is needed.

[표 1]TABLE 1

기능적인 설명Functional description

1.9 코프로세서의 개요1.9 coprocessor overview

이 코프로세서는 ARM9×6 프로세서에 근접하게 접속된다. UCP용으로 정해지는 각 명령어는, 마치 ARM 코어가 자신의 명령어를 실행하고 있는 것처럼 즉시 처리된다. 이러한 UCP는, 파이어 앤 포겟(fire and forget) 기능을 갖는 코프로세서가 아니고, UCP 명령어는 즉시 실행되어 그 결과는 (특정 조건으로 이루어짐; 섹션 1.8 참조) 다음 명령어에 사용 가능하다. 이는, ARM이 코프로세서의 상태를 폴링(polling)할 필요가 없고 코프로세서가 ARM을 인터럽트할 필요가 없다는 것을 의미한다. This coprocessor is closely connected to the ARM9x6 processor. Each instruction destined for the UCP is processed immediately, as if the ARM core were executing its instructions. This UCP is not a coprocessor with fire and forget, and UCP instructions are executed immediately and the result (with certain conditions; see section 1.8) is available for the next instruction. This means that the ARM does not need to poll the state of the coprocessor and the coprocessor does not need to interrupt the ARM.

1.10 기능도1.10 Functional Diagram

도 6에 UCP의 연결을 도시하였다.6 shows the connection of the UCP.

1.11 블록도1.11 Block Diagram

도 7에 UCP의 주요 블록을 도시하였다.7 shows the main block of the UCP.

제어로직은, 메모리로부터 도착하는대로 각 명령어를 판독하고, UCP와 ARM이 일치하도록 파이프라인 폴로워(follower)를 구비한다. 이 코프로세서 파이프라인은, 다음의 단계, 즉 페치, 디코드, 실행, 기억, 기록 및 SAD-누적의 단계들로 이루어진다.The control logic reads each instruction as it arrives from memory and has a pipeline follower to match the UCP and ARM. This coprocessor pipeline consists of the following steps: fetch, decode, execute, store, record, and SAD-accumulate.

레지스터 뱅크는, 블록 버퍼와 MCR/MRC 가능형 레지스터를 보유한다. 또한, 이 레지터 뱅크는 RC_IDX를 증가시키는 로직을 보유한다.The register bank holds a block buffer and MCR / MRC capable registers. This register bank also holds logic to increment RC_IDX.

데이터 경로는, 바이트 조작, SAD 및 누적을 포함한 USALD연산용 로직을 포함한다. 또한, 이 데이터 경로는, 레지스터 판독 선택 및 내부 연산, UMRC 및 UBBST를 진행한 결과를 처리한다.The data path contains logic for USALD operations, including byte manipulation, SAD, and accumulation. This data path also handles the results of register read selection and internal operations, UMRC and UBBST.

시스템 연결System connection

UCP는, ARM9×6 프로세서에 직접 접속되도록 설계되어 있다. CHSDE_C와 CHSEX_C 출력은, 이 구성에서는 사용되고 있지 않다. CPABORT 입력은, ARM966과 ARM946 프로세서에 대해 덜 얽매여야 한다. 이러한 구성에서는, UCP가 로드 연산시에 데이터 포기로부터 복원할 특수한 포기 처리기 코드를 필요로 할 것이다.UCP is designed to be directly connected to an ARM9x6 processor. The CHSDE_C and CHSEX_C outputs are not used in this configuration. The CPABORT input should be less constrained to the ARM966 and ARM946 processors. In such a configuration, the UCP would need special abandon handler code to recover from data abandonment during the load operation.

1.12 ARM920T와의 접속1.12 Connection with ARM920T

ARM920 코프로세서 인터페이스는, 기능상 9×6 코프로세서 인터페이스와 동일하다. 그러나, 상기 인터페이스는, 신호 타이밍 관점에서 상당히 서로 다르다. UCP를 ARM920으로 사용하기 위해서는 별도의 외부 리타이밍(re-timing) 블록이 필요하다. 이 리타이밍 블록의 상세한 필요조건은, 이 문서의 범위밖이지만, 여기서는 그 구성의 간단한 개요를 설명한다.The ARM920 coprocessor interface is functionally identical to the 9x6 coprocessor interface. However, the interfaces differ significantly from each other in terms of signal timing. To use UCP as an ARM920, a separate external retiming block is required. The detailed requirements of this retiming block are beyond the scope of this document, but here is a brief overview of their construction.

ARM920에 의해 제공된 클록은, 반전된 후 UCP용 CPCLK로 전달된다. CHSDE 및 CHSEX 출력이 사용되지 않는 대신에, CHSDE_C와 CHSEX_C를 상기 ARM 920 클록에 의해 인에이블되는 트랜스페어런트 래치에 전달한다.The clock provided by the ARM920 is inverted and then transferred to the CPCLK for the UCP. Instead of using the CHSDE and CHSEX outputs, CHSDE_C and CHSEX_C are passed to a transparent latch enabled by the ARM 920 clock.

CPABORT신호는, ETM 인터페이스로부터 얻어진다.The CPABORT signal is obtained from the ETM interface.

1.13 보조 코프로세서의 접속1.13 Connection of secondary coprocessor

상기 UCP는, 디폴트에 의해 다수의 외부 코프로세서를 사용하는 시스템용으로 구성되지 않는다. 이를 지원하기 위해서는 보조 로직이 필요하다.The UCP is by default not configured for a system using multiple external coprocessors. Auxiliary logic is required to support this.

다수의 코프로세서가 인터페이스와 부착될 경우, 핸드쉐이킹 신호는 비트 1을 AND하고 비트 0을 OR하여서 조합될 수 있다. 핸드쉐이킹 신호 CHSDE1, CHSEX1 및 CHSDE2, CHSEX2를 갖는 2개의 코프로세서일 경우에, 각각:When multiple coprocessors are attached with the interface, the handshaking signal can be combined by ANDing bit 1 and ORing bit 0. For two coprocessors with handshaking signals CHSDE1, CHSEX1 and CHSDE2, CHSEX2, respectively:

CHSDE[1]<=CHSDE1[1] AND CHSDE2[1]CHSDE [1] <= CHSDE1 [1] AND CHSDE2 [1]

CHSDE[0]<=CHSDE1[0] OR CHSDE2[0]CHSDE [0] <= CHSDE1 [0] OR CHSDE2 [0]

CHSEX[1]<=CHSEX1[1] AND CHSEX2[1]CHSEX [1] <= CHSEX1 [1] AND CHSEX2 [1]

CHSEX[0]<=CHSEX1[0] OR CHSEX2[0].CHSEX [0] <= CHSEX1 [0] OR CHSEX2 [0].

AC 타이밍AC timing

이 섹션은, 현재 완전하지 않다. AC 타이밍은 필립스 프로세스의 920T에 대해 사용 가능하다.This section is not complete at this time. AC timing is available for the 920T of the Philips process.

부록Appendix

1.14 신호 설명1.14 Signal Description

아래의 표는, UCP가 ARM9TDMI와 인터페이스하는 신호를 설명한다. 이때, 이것은, 실시가 시작되면 변화되도록 이루어진다. 최종 신호는, 아래에 나타낸 리스트와 서로 달라도 된다.The table below describes the signals that the UCP interfaces with the ARM9TDMI. At this time, this is made to be changed when implementation begins. The final signal may differ from the list shown below.

1.14.1 UCP 명령어 페치 인터페이스 신호

1.14.1 UCP Command Fetch Interface Signals

1.14.2 UCP 데이터 버스1.14.2 UCP Data Bus

1.14.3 UCP 코프로세서 인터페이스 신호1.14.3 UCP Coprocessor Interface Signals

1.14.4 UCP의 여러 가지 신호들1.14.4 Different Signals of the UCP

Claims

A main processor that performs data processing operations in response to program instructions;

At least one loaded data word therein connected to the main processor and responsive to a coprocessor load instruction of the main processor, the at least one defined by the coprocessor load instruction using the one or more loaded data words A coprocessor for performing a coprocessor processing operation of the processor and providing operand data to generate at least one result data word,

In response to the coprocessor load instruction, a varying number of loaded data words is loaded into the coprocessor, depending on whether the start address of the operand data in the one or more loaded data words is aligned with a word boundary,

And an alignment register for storing a value specifying an alignment between the operand data and the one or more loaded data words.

The method of claim 1,

The coprocessor comprises a coprocessor memory for storing one or more locally stored data words used as operands in the at least one coprocessor processing operation in combination with the one or more loaded data words. Device.

The method according to claim 1 or 2,

And a memory coupled to the main processor, wherein the one or more loaded data words are retrieved from the memory to the coprocessor via the main processor without being stored in a register within the main processor. Device.

The method according to claim 1 or 2,

And the main processor includes a register operative to store an address value indicating the one or more data words.

The method according to claim 1 or 2,

And said at least one coprocessor processing operation calculates a sum of an absolute difference between a plurality of byte values.

The method of claim 5,

The coprocessor includes a coprocessor memory that, in combination with the one or more loaded data words, stores one or more locally stored data words used as operands in the at least one coprocessor processing operation, the sum of the absolute differences Is calculated as the sum of the absolute differences between the plurality of byte values in the one or more loaded data words and the corresponding byte values of the plurality of byte values in the one or more locally stored data words.

The method of claim 6,

And the sum of the absolute differences is accumulated in a cumulative register of the coprocessor.

delete

The method of claim 4, wherein

And the coprocessor load instruction includes an offset value added to the address value when executed.

The method according to claim 1 or 2,

And said at least one coprocessor processing operation calculates a sum of absolute differences as part of block pixel value matching.

Performing data processing operations in the main processor in response to program instructions;

Load one or more loaded data words into a coprocessor connected to the main processor in response to a coprocessor load instruction of the main processor, and use the one or more loaded data words to define at least one defined by the coprocessor load instruction Performing one coprocessor processing operation and providing operand data to generate at least one result data word,

In response to the coprocessor load instruction, varying numbers within the coprocessor, depending on a value stored in an alignment register in the coprocessor indicating whether the start address of the operand data in the one or more loaded data words is aligned with a word boundary. And a loaded data word of the data processing method.

Load one or more loaded data words into a coprocessor connected to the main processor in response to a coprocessor load instruction of the main processor, and use the one or more loaded data words to define at least one defined by the coprocessor load instruction Control the computer to perform one coprocessor processing operation and provide operand data to generate at least one result data word,

In response to the coprocessor load instruction, varying numbers within the coprocessor, depending on a value stored in an alignment register in the coprocessor indicating whether the start address of the operand data in the one or more loaded data words is aligned with a word boundary. And a computer program loaded with the loaded data word.