KR20060040597A

KR20060040597A - Result partitioning within simd data processing systems

Info

Publication number: KR20060040597A
Application number: KR1020057024017A
Authority: KR
Inventors: 다니엘 커셔
Original assignee: 에이알엠 리미티드
Priority date: 2003-06-16
Filing date: 2003-12-18
Publication date: 2006-05-10
Also published as: US20040255100A1; EP1634163A1; WO2004114127A1; CN100378651C; EP1634163B1; CN1791857A; TWI266204B; KR101042647B1; IL169374A0; IL169374A; MY135903A; US7668897B2; JP4402654B2; AU2003290285A1; RU2005139390A; JP2006527868A; TW200500879A

Abstract

Within a processor (2) providing single instruction multiple data (SIMD) type operation, single data processing instructions can serve to control processing logic (4, 6, 8, 10) to perform SIMD-type processing operations upon multiple independent input values to generate multiple independent result values having a greater data width than the corresponding input values. A repartitioner in the form of appropriately controlled multiplexers serves to partition these result data values into high order bit portions and low order bit portions that are stored into separate registers (38, 40). The required SIMD width preserved result values can be read from the desired high order (38) result register or low order result register (40) without further processing being required. Furthermore, the preservation of the full result facilitates improvements in accuracy, such as over extended accumulate operations and the like.

Description

Splitting results within a single command multiple data processing system {RESULT PARTITIONING WITHIN SIMD DATA PROCESSING SYSTEMS}

본 발명은 데이터 처리 시스템에 관한 것이다. 보다 상세하게, 본 발명은 단일 명령 복수 데이터(single instruction multiple data; SIMD)의 데이터 처리 시스템 내에 복수 결과 데이터 값을 포함하는 결과의 분할에 관한 것이다.The present invention relates to a data processing system. More particularly, the present invention relates to the partitioning of results comprising multiple result data values within a data processing system of single instruction multiple data (SIMD).

단일 명령 복수 데이터 기능을 갖는 데이터 처리 시스템이 공지되어 있다. 이러한 시스템에서, 레지스터는 전형적으로 조작될 독립 데이터 값을 복수개 포함하고 있다. 예를 들면, 32-비트 레지스터는 2개의 독립 16-비트 데이터 값을 포함할 수 있으며, 이들 데이터 값은 예컨대, 다른 32-비트 레지스터 내에 저장된 2개의 다른 16-비트 데이터 값에 대해, 별도로 가산되거나, 승산되거나, 그렇지 않으면 조합된다. 이러한 SIMD 연산은 디지털 신호 처리 분야에서 보편적인 것이며, 처리 속도가 증가하고 코드 밀도가 감소되는 것을 포함하여 많은 장점을 갖는다.Data processing systems with a single instruction plural data function are known. In such systems, registers typically contain a plurality of independent data values to be manipulated. For example, a 32-bit register may contain two independent 16-bit data values, which may be added separately, for example, for two other 16-bit data values stored in another 32-bit register. , Multiplied, or otherwise combined. Such SIMD operations are common in the field of digital signal processing and have many advantages, including increased processing speed and reduced code density.

공지된 SIMD 기술의 예로는 Intel Corporation 제품인 Intel 프로세서의 MMX 명령이 있다. MMX 명령에는 각각 4개의 16-비트 데이터 값을 포함하는 레지스터 2개를 함께 승산하는 명령이 포함되어 있다. 16-비트 데이터 값을 다른 16-비트 데이터 값으로 승산하면, 그 결과는 32-비트 데이터 값이 된다. 따라서, MMX SIMD 명령에서 지정된 4쌍의 16-비트 데이터 값을 함께 승산하면, 그 결과는 4개의 32- 비트 결과 데이터 값이 된다. 많은 경우, 이러한 연산 수행시 SIMD 포맷과 데이터 크기를 그대로 유지하는 것이 바람직하다. 이를 위해, MMX 명령은 상기의 경우 얻어진 결과가 각 32-비트 결과에서 16개의 최상위 비트가 되는 4개의 16-비트 결과 데이터 값의 형태를 갖는 방식의 명령을 포함하며, 이들 16-비트 값들은 단일 64-비트 레지스터 내에 조합되어, 즉 SIMD 방식의 결과를 얻는다. 다른 방식으로는, 64-비트 레지스터에 조합된 출력으로서 승산 결과에서 4개의 최하위 16-비트를 만들어 내는 별도의 명령들을 갖는 것도 가능하다. An example of a known SIMD technology is the MMX instruction of an Intel processor manufactured by Intel Corporation. The MMX instruction includes instructions to multiply two registers, each containing four 16-bit data values. Multiplying the 16-bit data value by another 16-bit data value results in a 32-bit data value. Thus, multiplying the four pairs of 16-bit data values specified in the MMX SIMD instruction together results in four 32-bit result data values. In many cases, it is desirable to preserve the SIMD format and data size when performing these operations. To this end, the MMX instruction includes instructions in the form of four 16-bit result data values in which the result obtained in the above case will be the 16 most significant bits in each 32-bit result, the 16-bit values being single Combined into 64-bit registers, i.e., the result of the SIMD scheme. Alternatively, it is also possible to have separate instructions that produce four least significant 16-bits in the multiplication result as output combined in a 64-bit register.

본 발명의 일 측면에 따라 제공된 장치로, 데이터 처리 명령에 응답하여 데이터 처리 연산을 수행하는 장치는:An apparatus provided according to one aspect of the present invention, wherein an apparatus for performing a data processing operation in response to a data processing instruction includes:

상기 데이터 처리 명령에 응답하여, 하나 이상의 입력 저장부 내에 저장된 복수의 독립 데이터 값으로부터 각각의 복수의 결과 데이터 값을 발생시키는 처리 로직과;Processing logic for generating each of a plurality of result data values from a plurality of independent data values stored in one or more input storages in response to the data processing command;

상기 데이터 처리 명령에 응답하여, 상위의 결과 저장부 내에 각 결과 데이터 값의 상위 비트부를 저장하고, 하위의 결과 저장부 내에 각 결과 데이터 값의 하위 비트부를 저장하는 결과 분할부;A result divider for storing an upper bit portion of each result data value in an upper result storage section and a lower bit portion of each result data value in a lower result storage section in response to the data processing command;

를 포함한다.It includes.

본 발명이 인정하는 바로서, 많은 경우 SIMD 형식의 결과를 생성하는 것이 바람직할 수 있지만, 몇몇 경우, 부적절한 라운딩 오차 등의 불리한 결과를 회피하기 위해서는 결과치의 정확성을 완전하게 유지하는 것이 중요하다는 것이다. 따라서, 본 기술은, 단일 데이터 처리 명령에 응답하여 고밀도의 코드를 형성하고, SIMD 형식의 연산을 복수의 독립 데이터 값에 대해 실행하여, 상위부가 일 저장부에 있고 하위부가 다른 저장부에 있는 SIMD 형식에 복수 결과 데이터 값이 저장되는 시스템을 제공한다. 따라서, SIMD 형식의 결과는 필요한 경우 추가의 처리 과정 없이 즉시 활용 가능하며, 결과의 정확성이 2개의 저장부의 조합에 온전히 유지되어 그로부터 조작될 수 있기 때문에, 정확성이 전적으로 유지되어 전방 자리올림될 수 있다. As will be appreciated by the present invention, in many cases it may be desirable to produce results in SIMD format, but in some cases it is important to maintain the accuracy of the results completely in order to avoid adverse consequences such as inappropriate rounding errors. Thus, the present technology forms a dense code in response to a single data processing instruction, and executes a SIMD format operation on a plurality of independent data values, so that the SIMD has an upper portion in one storage portion and a lower portion in another storage portion. Provides a system for storing multiple result data values in a format. Thus, the results in SIMD format can be immediately utilized without further processing if necessary, and the accuracy can be fully maintained and rounded forward since the accuracy of the results can be maintained intact in the combination of the two reservoirs and manipulated therefrom. .

복수의 독립 입력 데이터 값으로부터 결과 데이터 값을 발생시키도록 처리 로직에 의해 수행된 데이터 처리 연산의 형식은 다양한 다른 형태를 취할 수 있음을 알 것이다. 처리 로직으로의 입력은, 소정의 계산 기법에 따라 소정 정확도를 갖는 저장하고 있는 독립 데이터 값의 제곱, 또는 독립 데이터 값의 제곱근 등이 결과가 되는 단일 저장부의 컨텐츠일 수 있다. 그러나, 본 발명의 바람직한 실시예에서, 처리 로직은 각 쌍의 독립 데이터 값을, 제1 입력 저장부로부터 취한 쌍의 제1 독립 데이터 값과 제2 입력 저장부로부터 취한 쌍의 제2 독립 데이터 값에 승산하도록 실행 가능하다. It will be appreciated that the format of data processing operations performed by processing logic to generate result data values from a plurality of independent input data values may take a variety of different forms. The input to the processing logic may be the content of a single storage, resulting in the square of the stored independent data value, or the square root of the independent data value, etc., having a certain accuracy, in accordance with a predetermined calculation technique. However, in a preferred embodiment of the present invention, the processing logic is to determine each pair of independent data values from the first independent data value of the pair taken from the first input storage and the second independent data values of the pair taken from the second input storage. It is executable to multiply by.

이러한 SIMD 방식의 승산 연산은 보편적인 것으로, 정확성이 전적으로 유지되고 SIMD 방식의 결과가 여전히 직접 발생되는 경우 활용되는 본 기술을 필요로 하는 결과의 데이터 폭을 증가시킨다. This SIMD multiplication operation is common and increases the data width of the result which requires the present technique to be utilized if the accuracy is maintained entirely and the result of the SIMD method is still directly generated.

본 기술은 본 기술에 의해 유지되는 부가 정확성이 누적 방식의 연산에 발생할 수 있는 복수의 라운딩 에러의 누적 효과를 없애는데 도움이 되기 때문에 누적 연산이 승산과 연관되는 상황에 특히 적합하다. The technique is particularly suitable for situations where cumulative operations are associated with multiplication because the added accuracy maintained by the technique helps to eliminate the cumulative effects of multiple rounding errors that may occur in cumulative operations.

상위 비트부와 하위 비트부가 다양한 상이한 관계를 가질 수 있음을 알겠지만, 이들이 관련 결과 데이터 값에서 중복되지 않고 연속되는 경우 가장 효율적이고 바람직하다.It will be appreciated that the upper and lower bit portions can have a variety of different relationships, but they are most efficient and desirable if they are contiguous and non-overlapping in the associated result data values.

데이터 처리 명령은 정수 승산 또는 부호 표시된 소수점 이하 값의 승산 등의 여러가지 다른 형태의 승산 연산을 지정할 수 있다. 그러나, 본 발명은, 지정된 승산이 부호 표시된 소수점 이하 값의 승산이며, 처리 로직이 각 입력 데이터 값에 부호 표시 비트의 존재를 고려하도록 각 결과 데이터 값을 2배화하도록 실행 가능한 상황에 특히 적합하다. 상기 2배화는 부가 오버헤드(overhead)가 적은 그외의 연산으로 효과적으로 포함될 수 있다. The data processing instruction may specify various other forms of multiplication, such as integer multiplication or multiplication of signed decimal values. However, the present invention is particularly suitable for situations where the designated multiplication is a multiplication of a signed decimal value and the processing logic is feasible to double each result data value to take into account the presence of a sign indicating bit in each input data value. The doubling can be effectively included in other operations with less additional overhead.

독립 SIMD 데이터 값의 데이터 폭은 변화될 수 있으며, 바람직한 실시예에서, 데이터 처리 명령은 관련 데이터 폭을 지정한다. The data width of the independent SIMD data values can be varied, and in a preferred embodiment, the data processing instruction specifies the relevant data width.

승산기는 요구되는 특정 상황에 따라 여러가지 형태를 취할 수 있지만, 특히 바람직한 형태는 비교적 간단하고 속도가 빠르면서도 적절한 구성의 다양한 상이한 형태의 연산을 구성할 수 있는 정수 승산기이다. Multipliers can take many forms depending on the particular situation required, but a particularly preferred form is an integer multiplier that is relatively simple, fast, and capable of constructing a variety of different forms of operation in an appropriate configuration.

데이터 처리 명령에 의해 지정될 수 있는 처리 연산의 형태의 예로서, 처리는 포화 연산을 수행하도록 하는 것과 같이 선택적일 수 있다. As an example of the type of processing operation that may be specified by a data processing instruction, the processing may be optional, such as to perform a saturation operation.

결과 분할부는 결과 데이터 값을 다른 저장부 사이로 분할하도록 기능하며, 바람직한 실시예에서, 이를 위해 복수의 멀티플렉서가 사용된다. 본 기술은 DSP 등의 많은 다른 형태의 데이터 처리 시스템에 적용될 수 있지만, 특히 프로세서 코어에 사용되기에 적합하다. The result divider functions to divide the result data value among the other stores, and in a preferred embodiment a plurality of multiplexers are used for this. The technique can be applied to many other forms of data processing systems, such as DSPs, but is particularly suitable for use in processor cores.

시스템 내의 입력 저장부, 상위 결과 저장부, 하위 결과 저장부 및 저장부는 다양한 다른 형태를 가질 수 있지만, 레지스터 뱅크 레지스터, 전용 레지스터, 버퍼 메모리, 선입선출(first-in-first-out) 버퍼 및 메모리 일부(예, 캐시, 메인, 벌크 등) 중의 하나 이상임을 알 것이다. 이들 상이한 형태의 저장부들은 상이한 저장부가 다른 형태를 갖는 복합 상황에 사용될 수 있다. 저장부를 위한 레지스터 대신에 메모리 또는 버퍼를 사용하는 경우, 조작될 일련의 데이터 값의 스트리밍이 편리하게 제공될 수 있다. Input storage, upper result storage, lower result storage, and storage in the system can take many different forms, but register bank registers, dedicated registers, buffer memory, first-in-first-out buffers and memory. It will be appreciated that one or more of some (eg cache, main, bulk, etc.). These different types of stores may be used in complex situations where different stores have different forms. When using a memory or buffer instead of a register for storage, streaming of a series of data values to be manipulated can be conveniently provided.

본 기술에 쉽게 적용되는 방식으로 계산될 결과의 범위를 증가시키는 방법으로서, 바람직한 실시예에서는 포화 연산의 콘텍스트에 이용될 수 있는 것과 같은 하나 이상의 상위 가드 비트를 발생시키기도 한다. 이들 가드 비트는 결과 분할이 가드 비트를 저장하는 자체 저장부를 가질 수 있다. As a method of increasing the range of results to be calculated in a manner readily applicable to the present technology, preferred embodiments may generate one or more higher guard bits, such as those that can be used in the context of a saturation operation. These guard bits may have their own storage where the result partition stores the guard bits.

본 발명의 다른 측면에 따라 제공된 방법으로서, 데이터 처리 명령에 응답하여 데이터 처리 연산을 수행하는 방법은:As a method provided in accordance with another aspect of the present invention, a method of performing a data processing operation in response to a data processing instruction is:

상기 데이터 처리 명령에 응답하여, 하나 이상의 입력 저장부 내에 저장된 복수의 독립 데이터 값으로부터 각각의 복수의 결과 데이터 값을 발생시키는 단계와;In response to the data processing instruction, generating each of a plurality of result data values from a plurality of independent data values stored in one or more input storages;

상기 데이터 처리 명령에 응답하여, 상위의 결과 저장부 내에 각 결과 데이터 값의 상위 비트부를 저장하고, 하위의 결과 저장부 내에 각 결과 데이터 값의 하위 비트부를 저장하는 것에 의해 상기 결과 데이터 값을 분할하는 단계;In response to the data processing instruction, storing the upper bit portion of each result data value in an upper result storage section and dividing the result data value by storing a lower bit portion of each result data value in a lower result storage section. step;

를 포함한다.It includes.

이하, 하기의 첨부 도면을 참조로 본 발명의 실시예를 단지 예시의 목적으로 설명한다. 도면에서,DESCRIPTION OF THE EMBODIMENTS Hereinafter, embodiments of the present invention will be described by way of example only with reference to the accompanying drawings. In the drawing,

도 1은 본 기술을 활용할 수 있는 방식의 프로세서 코어를 개략적으로 나타내며;1 schematically illustrates a processor core in a manner that may utilize the present technology;

도 2는 다양한 SIMD 데이터 포맷을 개략적으로 나타내며;2 schematically illustrates various SIMD data formats;

도 3은 다양한 데이터 폭에 대해 본 발명에 따른 입력 데이터 값과 출력 데이터 값 사이의 관계를 개략적으로 나타내며;3 schematically illustrates the relationship between input data values and output data values according to the present invention for various data widths;

도 4는 도 1의 프로세서 코어 내의 데이터 처리 경로의 일부를 개략적으로 나타내며;4 schematically illustrates a portion of a data processing path within the processor core of FIG. 1;

도 5는 본 기술에 따라 결과 데이터 값을 분할하기 위한 멀티플렉스 장치를 나타내며;5 illustrates a multiplex device for dividing a result data value according to the present technology;

도 6은 본 기술에 따른 승산 누적 연산의 다른 형태를 개략적으로 나타낸다.6 schematically illustrates another form of multiplication accumulation operation in accordance with the present technology.

도 1은 영국, 캠브리지의 ARM Limited의 제품과 같은 프로세서 코어(2)를 도시한다. 프로세서 코어(2)는 레지스터 뱅크(4), 승산기(6), 시프터(8), 및 데이터 처리 데이터 경로의 일부를 형성하는 가산기(10)를 포함한다. 데이터 처리 명령은 명령 파이프라인(12)으로 받아들여져서, 이로부터 명령 디코더(14)에 의해 디코딩되어, 프로세서(2) 내의 다른 회로 소자의 동작을 제어하는 제어 신호가 발생된다. 프로세서(2)는 전형적으로 많은 추가의 회로 소자를 포함하지만, 이들 소자는 간결성을 위해 도시되지 않음을 알 것이다. 도 1의 예에서, 입력 데이터 값은 레지스 터 뱅크(4) 내의 레지스터로부터 읽혀지고, 레지스터 뱅크(4)의 레지스터로 다시 기록되어진다. 다른 실시예에서, 입력 값과 결과 값은 전용 레지스터, 버퍼 메모리, 선입선출 버퍼 및 범용 메모리 등의 상이한 방식의 저장부를 대상으로 읽혀지고 기록된다. 이들은 다른 선택예로서 이용될 수 있고, 다양한 혼합 조합으로 이용될 수 있다. 이들 상이한 선택예는 도 1에 도시되지 않는다.1 shows a processor core 2 such as the product of ARM Limited of Cambridge, England. The processor core 2 includes a register bank 4, a multiplier 6, a shifter 8, and an adder 10 forming part of a data processing data path. The data processing instructions are received by the instruction pipeline 12, from which they are decoded by the instruction decoder 14 to generate control signals that control the operation of other circuit elements in the processor 2. While the processor 2 typically includes many additional circuit elements, it will be appreciated that these elements are not shown for brevity. In the example of FIG. 1, the input data value is read from a register in the register bank 4 and written back to the register in the register bank 4. In other embodiments, input values and result values are read and written to different types of storage, such as dedicated registers, buffer memory, first-in, first-out buffers, and general-purpose memory. They can be used as other options and in various mixed combinations. These different options are not shown in FIG. 1.

도 2는 여러가지 상이한 SIMD 데이터 포맷을 나타낸다. 도 1에 도시된 데이터 경로의 데이터 폭은 이러한 데이터 폭을 지원하도록 변형된 버전의 ARM 프로세서에서는 64-비트일 수 있다. 이 데이터 경로는 64-비트 워드의 전체 길이를 비-SIMD 모드로 조작할 수 있다. 이 예에서, 다양한 SIMD 모드는 2개의 32-비트 데이터 값, 4개의 16-비트 데이터 값, 또는 8개의 8-비트 데이터 값을 조작한다. SIMD 모드에서, 데이터 값은 서로 독립적이며, 도 1의 프로세서(2)의 데이터 경로는 SIMD 데이터 값의 크기에 따라 구성되어, 예컨대, 적절한 위치 등에서 자리 올림 연쇄(carry chain)를 차단하는 것에 의해, 이들 데이터 값을 별도로 처리한다. SIMD 방식의 연산을 수행하는 데이터 경로의 정합(adaption)은 그 자체가 공지된 것으로 여기에 추가 설명하지 않는다. 2 illustrates several different SIMD data formats. The data width of the data path shown in FIG. 1 may be 64-bit in versions of ARM processors modified to support this data width. This data path can manipulate the full length of a 64-bit word in non-SIMD mode. In this example, the various SIMD modes manipulate two 32-bit data values, four 16-bit data values, or eight 8-bit data values. In the SIMD mode, the data values are independent of each other, and the data path of the processor 2 of FIG. 1 is configured according to the size of the SIMD data value, for example, by blocking a carry chain at an appropriate location or the like, Treat these data values separately. Adaptation of data paths for performing SIMD operations is known per se and is not further described herein.

도 3은 본 기술에 따른 상이한 SIMD 데이터 폭 모드의 입력 데이터 값과 결과 데이터 값 사이의 관계를 나타낸다. 실시예 (i)에서, 입력 데이터 값은 제1 64-비트 레지스터에 저장된 2개의 32-비트 입력 값(A0, A1)과, 제2 레지스터에 저장된 2개의 32-비트 입력 값(B0, B1)으로 이루어진다. 이 실시예에서, 처리 명령에 의해 지정된 데이터 처리 연산은 SIMD 승산이므로, 32-비트 값(A0)은 32-비트 값(B0)에 의해 승산되고, 32-비트 값(A1)은 32-비트 값(B1)에 의해 승산된다. 이들 승산은 각각 A0B0 및 A1B1의 64-비트 결과를 산출한다. 이들 2개의 결과의 최상위 32 비트는 상위 결과 레지스터(17)에 기록된다. 이들 2개의 결과의 최하위 32 비트는 하위 결과 레지스터(18)에 기록된다. 상이한 레지스터(17, 18)에 기록된 2개의 부분은 중복되지 않는 연속부이다. 3 shows the relationship between input data values and result data values of different SIMD data width modes according to the present technology. In embodiment (i), the input data values are two 32-bit input values A0 and A1 stored in a first 64-bit register and two 32-bit input values B0 and B1 stored in a second register. Is done. In this embodiment, since the data processing operation specified by the processing instruction is a SIMD multiplication, the 32-bit value A0 is multiplied by the 32-bit value B0, and the 32-bit value A1 is the 32-bit value. Multiplied by (B1). These multiplications yield 64-bit results of A0B0 and A1B1, respectively. The most significant 32 bits of these two results are written to the upper result register 17. The least significant 32 bits of these two results are written to the lower result register 18. The two portions recorded in the different registers 17 and 18 are continuous portions that do not overlap.

실시예 (ii), (iii)은 유사한 것으로, 각각은 SIMD 승산 명령에 의해 승산의 적용을 받고, 총 결과의 절반 상위이거나 총 결과의 절반 하위인 상이한 레지스터에 각각 결과 데이터 값을 생성시키는 16-비트 입력 값과 8-비트 입력 값에 관한 것이다. Embodiments (ii) and (iii) are similar, each of which is subject to multiplication by a SIMD multiplication instruction and generates a result data value in a different register, respectively, which is either half of the total result or half of the total result. It relates to bit input values and 8-bit input values.

동일한 데이터 폭의 추가의 SIMD 방식의 연산의 승산에 의해 생성된 결과를 이용하여 추가의 처리를 계속하는 것을 원한다면, 보다 상위의 결과 레지스터(17)를 직접 독출하여 상기 추가 연산을 위한 입력으로서 이용할 수 있다. 코드 밀도, 속도, 전력 소비 등의 향상을 위해 시프팅(shifting) 또는 재-배열은 필요하지 않다. 특별히 바람직한 경우는 상위 결과 레지스터(17)와 하위 결과 레지스터(18)가 누적 연산을 위한 수신지로서 이용되어, 연속적인 승산의 결과가 이들 레지스터에 축적되고, 하위 결과 레지스터(18)에 보존된 하위 결과 값이 보다 정확한 결과를 산출하고 라운딩 오차를 방지하도록 연속적으로 업데이트될 수 있다. 따라서, 본 기술은 단일 명령을 이용하여 올바른 데이터 폭 값으로 직접 액세스되도록 하며, 결과값의 전 데이터 폭의 유지를 통해 여전히 정확성을 보유하도록 한다. If it is desired to continue further processing using the result generated by the multiplication of additional SIMD type operations of the same data width, the higher result register 17 can be read directly and used as input for the further operation. have. No shifting or re-arrangement is needed to improve code density, speed, power consumption, etc. In a particularly preferred case, the upper result register 17 and the lower result register 18 are used as destinations for the cumulative operation so that the results of successive multiplications are accumulated in these registers and stored in the lower result register 18. The resulting value can be updated continuously to yield more accurate results and to avoid rounding errors. Thus, the technique allows for direct access to the correct data width value using a single command and still retains accuracy by maintaining the full data width of the result.

도 4는 도 1의 데이터 경로의 일부를 보다 상세히 나타낸 개략도이다. SIMD 정수 승산기(20)는 레지스터 뱅크(4)의 각 레지스터로부터 얻은 2개의 64-비트 입력 값을 공급받는다. 이들 입력 값들은 단일의 64-비트*64-비트 비-SIMD 연산, 또는 전술한 3가지 SIMD 방식 연산중 하나를 나타낼 수 있다. SIMD 승산기(20)는 자리올림 연쇄 등에 적절한 차단을 제공하여 독립 입력값과 결과 출력 값을 적절히 분할한다. SIMD 승산기(20)로부터의 출력은 자리올림 보존 포맷이다. 시스템이 부호 표시 소수 모드(signed fractional mode)로 동작시, 멀티플렉서(22,24)로 공급된 신호를 나타내는 소수 모드는 최상위 위치에 추가 부호 비트를 보상하는 방식으로 수치값을 이배화하는 것과 같이 자리올림 보존 출력을 1 비트 위치 만큼 시프팅하도록 기능한다. 가산기(26)는 SIMD 승산기(20)로부터의 자리올림 보존 출력을, 보존 및 자리올림 레지스터(28, 30)로부터 재순환된 부분 누적된 값 또는 멀티플렉서(32, 34)에 의해 선택된 레지스터 뱅크(40)의 레지스터(D, C)로부터의 128-비트 값에 가산하는 기능을 한다. 멀티플렉서(32, 34)는 그 다양한 값이 도 4의 하부의 테이블에 보여지는 누산 제어 신호에 의해 제어된다. 시스템은 소스 레지스터 파일로부터 누산하고, 누적없이 승산하거나, 벡터형 연산 도중과 같이 이전의 부분 연산된 결과에 누산하거나, 누산 값을 위한 소스로서 레지스터 뱅크를 바이패스하거나 하도록 배열될 수 있다. 4 is a schematic diagram illustrating a portion of the data path of FIG. 1 in more detail. The SIMD integer multiplier 20 is supplied with two 64-bit input values obtained from each register in the register bank 4. These input values may represent a single 64-bit * 64-bit non-SIMD operation, or one of the three SIMD scheme operations described above. The SIMD multiplier 20 provides appropriate cutoff for carry chains, etc., to properly divide the independent input values and the resulting output values. The output from the SIMD multiplier 20 is a carry-on format. When the system is operating in signed fractional mode, the fractional mode, which represents the signal supplied to the multiplexers 22 and 24, is floated, such as by doubling numerical values, by compensating additional sign bits at the highest positions. Function to shift the reserve output by one bit position. The adder 26 stores the carry-preservation output from the SIMD multiplier 20, the partially accumulated value recycled from the hold and carry registers 28, 30 or the register bank 40 selected by the multiplexers 32, 34. To 128-bit values from registers (D, C). The multiplexers 32, 34 are controlled by accumulation control signals whose various values are shown in the table at the bottom of FIG. The system may be arranged to accumulate from the source register file, multiply without accumulation, accumulate to previous partial computed results, such as during vectorized operations, or bypass the register bank as a source for accumulated values.

주어진 처리 동작에서 승산 및 가산 연산을 완료하면, 레지스터(28, 30)로부터의 최종 128-비트 보존 및 자리올림 값이 가산기(36)로 전달되어, 그곳에서 함께 가산되어 결과의 통상적인 128-비트 대표값을 얻는다. 승산 및 가산은 파이프라인 방식의 연산일 수 있다. 가산기(36)의 출력은 레지스터(A, B)로 부터의 64-비트 입력값에 비해 비트폭이 2배이다. 따라서, SIMD 결과값은 독립 SIMD 입력 값의 폭의 2배값을 갖는다. 가산기(36)의 출력은 본 실시예에서 도 5에 도시된 다양한 멀티플렉서의 형태를 갖는 결과 분할기로 공급된다. Upon completion of the multiply and add operations in a given processing operation, the final 128-bit preservation and rounding values from registers 28 and 30 are passed to adder 36, where they are added together to result in the usual 128-bit of the result. Get a representative value. Multiplication and addition may be pipelined operations. The output of adder 36 is twice the bit width compared to the 64-bit input value from registers A and B. Thus, the SIMD result has twice the width of the independent SIMD input value. The output of adder 36 is fed to the result divider in the form of the various multiplexers shown in FIG. 5 in this embodiment.

도 5에서, 상위 결과 레지스터(38)는 상위부가 되는 각 결과값의 선택된 부분을 수용한다. 하위 결과 레지스터(40)는 결과값의 대응하는 하위부를 수용한다. 제어 신호(B, H, W 및 L)는 사용되는 SIMD 데이터 폭(바이트, 반워드, 워드 및 긴 워드)을 나타낸다. 이들 값 중 하나는 어느 때고 한 번 "1"로, 나머지는 "0"으로 확정된다. 이들 폭 지정 신호는 관련 멀티플렉서의 다양한 입력 사이를 선택하도록, 인접한 각 멀티플렉서에 주어진 논리적 표현에 따라 도 5에 도시된 멀티플렉서를 제어한다. 그 제어 신호에 의해 제어되는 도 5의 멀티플렉서의 전체 동작은 가산기(36)에 의한 128 비트 출력 사이에서부터 선택/재분할하여, 도 3의 다른 실시예에서 나타낸 바와 같은 상위 결과 레지스터(38)와 하위 결과 레지스터(40)의 컨텐츠를 형성한다. In Fig. 5, the upper result register 38 contains a selected portion of each result value that becomes upper part. Lower result register 40 receives the corresponding lower portion of the result value. The control signals B, H, W and L indicate the SIMD data widths (bytes, half words, words and long words) used. One of these values is established at any time as "1" and the other as "0". These width specifying signals control the multiplexer shown in FIG. 5 according to the logical representation given to each adjacent multiplexer to select between the various inputs of the associated multiplexer. The overall operation of the multiplexer of FIG. 5 controlled by the control signal is selected / repartitioned between the 128 bit outputs by the adder 36, so that the upper result register 38 and the lower result as shown in another embodiment of FIG. The contents of the register 40 are formed.

도 1의 디코더(14)로 공급되어 도 4 및 도 5의 회로를 예시된 방법으로 제어하는 프로그램 명령은 사용되는 데이터 폭이 비-SIMD 전 데이터 폭이거나 다양한 SIMD 데이터 폭 중 하나인지 여부를 지정하는 파라미터들을 포함하는 구문을 갖는다. 프로그램 명령은 또한 누산의 수행 여부 및 이 누산이 추가의 레지스터 값 또는 "내부" 부분 결과를 이용하여 수행되는지 여부를 지정한다. The program instructions supplied to the decoder 14 of FIG. 1 to control the circuits of FIGS. 4 and 5 in the illustrated manner specify whether the data width used is a non-SIMD pre data width or one of various SIMD data widths. It has a syntax that includes parameters. The program instruction also specifies whether the accumulation is to be performed and whether the accumulation is performed using additional register values or "internal" partial results.

도 5의 2개의 결과 레지스터(38, 40) 이외에, 가드 레지스터가 또한 제공될 수 있다. 이 가드 레지스터 내로는 확장된 버전의 누산 결과로부터 연산된 가드 비트가 공급된다. 예컨대, 16-비트 SIMD 데이터 값이 승산 누적 연산에 사용되면, 누산기는 누산 값의 오버플로가 가드 비트 내에 수용되도록, 2개 또는 4개의 가드 비트 제공 여부에 따라 예컨대, 34 또는 36 비트로서, 32 비트 보다 클 수 있다. 이 실시예에서, 가드 비트는 별도의 가드 비트 레지스터로 분할될 수 있으며, 이 형태에서 가드 비트 레지스터는, 하위 결과 레지스터가 결과값의 하부에 가드 비트를 제공하고 상위 결과 레지스터가 정상적으로 필요한 SIMD 폭 보존 데이터 값을 제공하도록, 결과의 최상단 위치에 가드 비트를 제공하는 것으로 간주될 수 있다. In addition to the two result registers 38, 40 of FIG. 5, a guard register may also be provided. This guard register is supplied with a guard bit computed from the extended version of the accumulation result. For example, if a 16-bit SIMD data value is used in a multiplication accumulating operation, the accumulator is, for example, 34 or 36 bits, depending on whether two or four guard bits are provided, such that the overflow of the accumulated value is accommodated in the guard bits. It can be greater than a bit. In this embodiment, the guard bit may be divided into separate guard bit registers, in which the guard bit register preserves the SIMD width where the lower result register provides the guard bit at the bottom of the result and the upper result register is normally needed. To provide a data value, it may be considered to provide a guard bit at the top of the result.

도 6은 다중 데이터 포맷에 의한 다중 누적 연산으로 누적 레지스터 결과(stacked register result)를 제공하는 것을 개략적으로 도시한다.6 schematically illustrates providing a stacked register result with multiple cumulative operations with multiple data formats.

레지스터(A, B)는 이 경우 4개의 16-비트 량(A0-A3 및 B0-B3)을 보유하는 64 비트 SIMD 레지스터이다. 이들 레지스터를 함께 승산한 결과는 각각 32 비트 폭 까지일 수 있는 4개의 결과의 벡터이다. Registers A and B are in this case a 64-bit SIMD register holding four 16-bit quantities A0-A3 and B0-B3. The result of multiplying these registers together is a vector of four results, each of which can be up to 32 bits wide.

4개의 32 비트 승산 결과는 각각 2개의 32 비트량을 보유하는 다른 2개의 레지스터(C, D)에 보유된 4개의 32 비트값에 누산될 수 있다. The four 32-bit multiplication results can be accumulated in four 32-bit values held in two different registers C and D, each holding two 32-bit amounts.

가산의 결과는 누적 포맷(stacked format)의 레지스터(RL, RH)에 저장될 수 있다.The result of the addition may be stored in registers RL and RH in a stacked format.

Claims

A device for performing data processing operations in response to a data processing command:

Processing logic for generating each of a plurality of result data values from a plurality of independent data values stored in one or more input storages in response to the data processing command;

And a result divider for storing the upper bit part of each result data value in the upper result storage part and the lower bit part of each result data value in the lower result storage part in response to the data processing command. A device for performing data processing operations.

The method of claim 1,

Wherein said processing logic is executable to multiply each pair of first independent data values of a pair taken from a first input storage and an independent data value of a second independent data value of a pair taken from a second input storage. Device for performing data processing operations.

The method of claim 2,

The processing logic is executable to accumulate values already stored in the upper result storage section and the lower result storage section to values generated from the respective pairs of independent data values to generate the plurality of result data values. A device for performing data processing operations.

The method according to claim 1, wherein

And the upper bit portion and the lower bit portion of each result data value are non-overlapping consecutive portions of the result data value.

The method according to claim 2, wherein

Wherein said data processing instruction indicates that said independent data value is a value below a sign display decimal point, and said processing logic is executable to double each value obtained by multiplying a first independent data value by a second independent data value. A device for performing data processing operations.

The method according to any of the preceding claims,

Each input storage unit stores M independent N-bit data values.

The method of claim 6,

And the data processing instruction specifies a data width of the independent data value.

The method according to claim 2, wherein

And said processing logic comprises an integer multiplier operable to multiply each said pair of independent data values together.

The method according to any of the preceding claims,

And said processing logic is executable to perform a saturation data processing operation on said independent data value.

The method according to any of the preceding claims,

And the result divider comprises a plurality of multiplexers controlled according to the data processing instructions.

The method according to any of the preceding claims,

And said device is a processor core.

The method according to any of the preceding claims,

The one or more input storage unit,

Register bank registers;

Dedicated registers;

Buffer memory;

First-in, first-out buffer; And

Memory;

At least one of the data processing operation performing apparatus.

The method according to any of the preceding claims,

The upper result storage unit,

Register bank registers;

Dedicated registers;

Buffer memory;

First-in, first-out buffer; And

Memory;

Data processing operation performing apparatus, characterized in that one of.

The method according to any of the preceding claims,

The lower result storage unit,

Register bank registers;

Dedicated registers;

Buffer memory;

First-in, first-out buffer; And

Memory;

Data processing operation performing apparatus, characterized in that one of.

The method according to any of the preceding claims,

And said processing logic is operable to generate at least one higher guard bit for each result data, and said result divider is executable to store said guard bit in a guard bit storage.

The method of claim 15,

The guard bit storage unit,

Register bank registers;

Dedicated registers;

Buffer memory;

First-in, first-out buffer; And

Memory;

Data processing operation performing apparatus, characterized in that one of.

As a method of performing data processing operations in response to data processing instructions:

In response to the data processing instruction, generating each of a plurality of result data values from a plurality of independent data values stored in one or more input storages;

In response to the data processing instruction, storing the upper bit portion of each result data value in an upper result storage section and dividing the result data value by storing a lower bit portion of each result data value in a lower result storage section. And performing a data processing operation.

The method of claim 17,

Wherein each pair of independent data values is multiplied together, the first independent data value of the predetermined pair is taken from the first input storage and the second independent data value of the pair is taken from the second input storage. How to perform processing operations.

The method of claim 18,

And the values already stored in the upper result storage section and the lower result storage section accumulate to values generated from the respective pairs of independent data values to generate the plurality of result data values.

The method of claim 17, wherein

And the upper and lower bit portions of each result data value are consecutive portions which are not duplicated of the result data value.

The method of claim 18, wherein

The data processing instruction indicates that the independent data value is a value below the sign display decimal point, and each value obtained by multiplying the second independent data value by the first independent data value is doubled. .

The method of claim 17, wherein

Each input storage unit stores M independent N-bit data values.

The method of claim 22,

And the data processing instruction designates a data width of the independent data value.

The method of claim 18, wherein

And wherein each pair of independent data values is executed to be multiplied together by an integer multiplier.

The method of claim 17, wherein

And a saturation data processing operation is performed on the independent data values.

The method of claim 17, wherein

And at least partially partitioning is performed by a plurality of multiplexers controlled according to the data processing instructions.

The method of claim 17, wherein

And wherein said method is performed in a processor core.

The method of claim 17, wherein

The one or more input storage unit,

Register bank registers;

Dedicated registers;

Buffer memory;

First-in, first-out buffer; And

Memory;

And at least one of the data processing operations.

The method of claim 17, wherein

The upper result storage unit,

Register bank registers;

Dedicated registers;

Buffer memory;

First-in, first-out buffer; And

Memory;

Data processing operation method characterized in that one of the.

The method of claim 17, wherein

The lower result storage unit,

Register bank registers;

Dedicated registers;

Buffer memory;

First-in, first-out buffer; And

Memory;

Data processing operation method characterized in that one of the.

The method of claim 17 to 30,

And said processing logic is executable to generate at least one higher guard bit for each result data, said result divider being executable to store said guard bit in a guard bit storage.

The method of claim 31, wherein

The guard bit storage unit,

Register bank registers;

Dedicated registers;

Buffer memory;

First-in, first-out buffer; And

Memory;

Data processing operation method characterized in that one of the.