KR100788500B1

KR100788500B1 - Vertex processing apparatus and method having multi-stage pipeline structure

Info

Publication number: KR100788500B1
Application number: KR1020060066586A
Authority: KR
Inventors: 박기현; 정형기; 이광엽
Original assignee: 엠텍비젼 주식회사
Priority date: 2006-07-14
Filing date: 2006-07-14
Publication date: 2007-12-24

Abstract

A device and a method for processing a vertex in a multi-stage pipeline structure are provided to reduce the maximum delay between respective execution codes to three stages by using an ALU(Arithmetic Logic Unit) of the multi-stage pipeline structure sequentially connecting operators. A first resister(310a) stores data for processing the vertex and a second register stores an operating result. An instruction fetch unit(300) sequentially fetches an instruction including an OP(Operation) code and more than one operand. A decoding unit(320) decodes the instruction and reads the needed data from the first register according to the operand. The ALU(330) comprises a plurality of operators, and sequentially processes/outputs the data through more than one operator according to a type of the OP code. A write back unit(340) stores an operation result of the prior OP code to a second register(310b) and temporally stores other results in the inside to process other result in the next stage.

Description

Vertex processing apparatus and method having multi-stage pipeline structure

도 1은 종래 실행코드의 파이프 라인 흐름을 나타낸 도면. 1 is a diagram illustrating a pipeline flow of a conventional executable code.

도 2는 본 발명의 바람직한 일 실시예에 따른 명령어의 실행코드 분류를 나타낸 도면.2 is a diagram illustrating execution code classification of instructions according to an exemplary embodiment of the present invention.

도 3은 본 발명의 바람직한 일 실시예에 따른 정점 처리 장치의 구성블록도.Figure 3 is a block diagram of a vertex processing apparatus according to an embodiment of the present invention.

도 4는 도 3에 도시된 정점 처리 장치에서의 파이프라인 흐름 중 일례를 나타낸 도면.FIG. 4 shows an example of pipeline flow in the vertex processing apparatus shown in FIG. 3. FIG.

도 5는 도 3에 도시된 정점 처리 장치에서의 다단 파이프라인 흐름의 또 다른 일례.5 is another example of a multi-stage pipeline flow in the vertex processing apparatus shown in FIG.

도 6은 본 발명의 바람직한 일 실시예에 따른 연산 논리 모듈에서의 산술 연산 방법의 흐름도.6 is a flowchart of an arithmetic operation method in an arithmetic logic module according to a preferred embodiment of the present invention.

도 7은 본 발명의 바람직한 일 실시예에 따른 라이트백 모듈에서의 연산결과 처리 방법의 흐름도.7 is a flowchart illustrating a calculation result processing method in a writeback module according to an exemplary embodiment of the present invention.

도 8은 본 발명의 바람직한 또 다른 실시예에 따른 레지스터의 교체에 따른 지연 시간을 줄인 정점 처리 방법의 흐름도.8 is a flowchart of a method of processing a vertex with reduced delay time due to a replacement of a register according to another exemplary embodiment of the present invention.

도 9는 종래 레지스터의 교체에 따른 지연 시간을 나타낸 도면.9 is a diagram illustrating a delay time according to the replacement of a conventional register.

도 10은 본 발명에 따라 지연 시간이 감소된 정점 처리 방법을 나타낸 도면.10 illustrates a vertex processing method with reduced delay time in accordance with the present invention.

<도면의 주요부분에 대한 부호의 설명><Description of the symbols for the main parts of the drawings>

300 : 명령어 인출 모듈300: instruction drawing module

310a : 제1 레지스터310a: first register

310b : 제2 레지스터310b: second register

320 : 디코딩 모듈320: decoding module

330 : 산술 로직부330: arithmetic logic unit

340 : 라이트 백부340: light uncle

350 : 포워딩부350: forwarding part

본 발명은 정점 처리 장치에 관한 것으로, 보다 상세하게는 정점 처리 명령어의 수행시 명령어 인출(instruction fetch)에서부터 라이트백(write back) 동작까지의 1 사이클 동안에 스테이지(stage)를 다단 파이프라인 구조화하여 필요로 하는 스테이지 수를 최소화하는 정점 처리 장치 및 그 방법에 관한 것이다. The present invention relates to a vertex processing apparatus, and more particularly, it is necessary to structure a stage in a multi-stage pipeline during one cycle from an instruction fetch to a write back operation when performing a vertex processing instruction. It relates to a vertex processing apparatus and a method for minimizing the number of stages.

OpenGL ES(Open Graphics Library Embedded System)는 자동차와 각종 설비 및 휴대 장치를 포함하는 임베디드 시스템 상에서의 2차원/3차원 그래픽 기능을 위 한 크로스 플랫폼(cross-platform) 응용프로그램 인터페이스(API; Application Program Interface)이다. 이는 PC 환경의 3차원 그래픽 표준인 OpenGL(Open Graphics Library)의 부분집합으로, 소프트웨어 어플리케이션(application)과 하드웨어 또는 소프트웨어의 그래픽 엔진 간의 유연하면서도 강력한 저수준의 인터페이스를 제공한다. OpenGL ES (Open Graphics Library Embedded System) is a cross-platform application program interface (API) for two-dimensional and three-dimensional graphics functions on embedded systems including automobiles, various facilities and portable devices. )to be. It is a subset of the Open Graphics Library (OpenGL), a three-dimensional graphics standard for PC environments, that provides a flexible yet powerful low-level interface between software applications and the graphics engine of hardware or software.

OpenGL ES는 이동 통신 단말기, 개인 휴대 단말기(PDA : Personal Digital Assistant), 휴대형 멀티미디어 단말기(PMP : Portable Multimedia Player) 등의 모바일 장치, 자동차 제어장치, 냉장고 제어장치, 공장로봇 제어장치 등의 임베디드 시스템 하에서 3차원 게임과 다양한 고급 3차원 그래픽 기능을 제공하기 위해 3차원 그래픽 연산을 처리하는 소프트웨어 솔루션이다. OpenGL ES can be used under embedded systems such as mobile communication devices, personal digital assistants (PDAs), mobile multimedia devices (PMPs) and the like, automotive controls, refrigerator controls, and factory robot controls. It is a software solution that processes 3D graphics operations to provide 3D games and a variety of advanced 3D graphics features.

OpenGL ES를 지원하는 그래픽스 하드웨어는 OpenGL ES가 제공하는 3차원 알고리즘을 하드웨어로 구현한 것으로, 3차원 그래픽 연산을 실시간으로 처리하기 위한 장치이다. 기존의 그래픽스 하드웨어는 고정된 알고리즘에 따라 3차원 데이터를 처리하였다.Graphics hardware that supports OpenGL ES is a hardware implementation of the three-dimensional algorithm provided by OpenGL ES, and is a device for processing three-dimensional graphics operations in real time. Conventional graphics hardware processed three-dimensional data according to a fixed algorithm.

임베디드 시스템 중 대표적인 것이 모바일 장치인 휴대형 단말기이다. 현재 3D 그래픽 엔진(즉, 그래픽스 하드웨어)을 탑재하여 출시되는 휴대형 단말기는 다음과 같은 과정을 통해 그래픽 연산을 처리하고 있다. A typical example of an embedded system is a mobile terminal that is a mobile device. Currently, portable terminals equipped with a 3D graphics engine (ie, graphics hardware) are processing graphics operations through the following process.

표현하고자 하는 사물의 모양을 삼각형 형태의 폴리곤 집합으로 구분한다. 여기서, 각 폴리곤을 구성하는 세 개의 꼭지점을 정점(vertex)이라고 한다. 그래픽스 하드웨어는 세 개의 정점의 좌표(position), 색상(color), 법선 벡터(normal vector), 텍스처 좌표(texture coordinate) 등의 데이터를 응용프로그램 인터페이스로부터 입력받는다. The shape of the object to be expressed is divided into a polygon set of triangles. Here, three vertices constituting each polygon are called vertices. The graphics hardware receives data from the application interface, including the three vertices' positions, colors, normal vectors, and texture coordinates.

정점 처리(Vertex Processing) 과정을 통해 입력받은 정점들에 대해 행렬연산을 통해 화면 상에서의 좌표를 결정하고 조명 모델(예를 들어, phong illumination model 등)에 따라 점의 밝기를 결정한다. For vertices received through vertex processing, matrix coordinates are used to determine the coordinates on the screen, and the brightness of the points is determined according to an illumination model (eg, phong illumination model).

그리고 프리미티브 어셈블리(Primitive Assembly) 과정을 통해 좌표 변환 및 조명 계산이 끝난 점들을 모아서 삼각형을 구성한다. 이후 래스터라이저(Rasterizer) 과정을 통해 삼각형이 화면에서 차지하는 픽셀(pixel)들을 결정한다. And through the primitive assembly process, the coordinate transformation and lighting calculation points are collected to form a triangle. The rasterizer process then determines the pixels the triangle occupies on the screen.

그리고 지정된 상태 정보에 따라 래스터라이저 과정을 통해 결정된 픽셀 데이터를 픽셀 처리(pixel processing) 과정(즉, 텍스처 연산, 색상 합계, 안개 효과)을 거쳐 픽셀 데이터의 최종 색상을 결정하고, 렌더링된 픽셀이 출력된다. Based on the specified state information, the pixel data determined through the rasterizer process is processed through pixel processing (ie, texture operation, color sum, and fog effect) to determine the final color of the pixel data, and the rendered pixel is output. do.

정점 처리 과정을 구체화하면, 정점의 좌표를 모델 좌표계에서 스크린 좌표계로 변환하는 과정과, 조명 계산 과정으로 구분된다. When the vertex processing process is specified, it is divided into a process of converting the coordinates of the vertex from the model coordinate system to the screen coordinate system and the lighting calculation process.

정점 데이터에 포함된 정점의 좌표는 모델들이 정의된 좌표계(일반적으로 모델의 중심이 원점이다)에서 여러 모델들이 공존하는 가상세계 좌표계인 월드 좌표계로 변환한다. 즉, 모델 좌표계 상의 점들을 이동, 회전, 크기조절 등의 처리과정을 거쳐 월드 좌표계 상의 점들을 획득한다. 그리고 월드 좌표계 상의 점들을 이동과 회전을 통해 계산되는 카메라를 중심으로 한 좌표계인 뷰 좌표계로의 뷰변환을 하고, 원근투영한 결과에 해당하는 좌표계인 투영 좌표계로의 투영변환을 한다. 투 영변환은 뷰 좌표계 상의 점들을 원점에서 멀어질수록 x, y 좌표들을 작게 만드는 과정이다. 그리고 실제 표현하고자 하는 화면의 크기에 따라 크기 변환(뷰포트 스케일)을 하여 스크린 좌표계 상의 점들로 좌표 변환한다. The coordinates of the vertices included in the vertex data are converted from the coordinate system in which the models are defined (generally, the center of the model is the origin) to the world coordinate system, a virtual world coordinate system in which several models coexist. That is, the points on the world coordinate system are acquired through a process of moving, rotating, and scaling the points on the model coordinate system. Then, the points of the world coordinate system are transformed into the view coordinate system, which is the coordinate system centered on the camera calculated through movement and rotation, and the projection transformation is performed to the projection coordinate system, which is the coordinate system corresponding to the result of perspective projection. Projection transformation is the process of making the x and y coordinates smaller as the points on the view coordinate system move away from the origin. And the size is converted according to the size of the screen to be expressed (viewport scale) and the coordinates are transformed into points on the screen coordinate system.

그리고 주위의 다른 사물에 의해 반사된 빛이 간접적으로 영향을 주는 빛의 성분인 주변광(Ambient lighting), 물체의 표면에서 산란되어 반사되는 빛의 성분인 산란광(Diffuse lighting), 물체의 표면에서 반사되는 빛이되 특정 방향(눈의 위치를 고려함)을 가지는 반사광(Specular lighting)을 합하여 정점 색상을 결정하는 조명 계산을 한다. Ambient lighting, a component of light that is indirectly influenced by other objects around it, diffuse lighting, which is a component of light scattered and reflected from the surface of an object, and reflection from the surface of an object. It calculates lighting to determine vertex color by adding specular lighting that is light but has a specific direction (considering eye position).

상술한 것과 같은 정점 처리를 위한 명령어는 실행코드(Opcode)와 하나 이상의 오퍼런드(Operand)로 구성된다. 이 중 실행코드의 집합은 하기의 표 1과 같다. An instruction for vertex processing as described above consists of an executable code and one or more operands. The set of executable codes is shown in Table 1 below.

종래 정점 처리를 위한 실행코드는 상기 표 1과 같이 20개가 있으며, 실행코드를 실행하여 연산 처리함에 있어서 지연시간(Latency)은 연산 스테이지 수로 계산할 때 1~8로 다양하다. 여기서, 연산 스테이지 수라 함은 명령어 수행시 명령어 인출에서부터 라이트백 동작까지의 1 사이클 동안에서 연산 논리 모듈에서 연산을 위해 필요로 하는 스테이지 수를 의미한다. There are 20 execution codes for conventional vertex processing, as shown in Table 1 above, and the delay time (Latency) in the calculation process by executing the execution code varies from 1 to 8 when calculated by the number of operation stages. Here, the number of operation stages refers to the number of stages required for operation in the operation logic module during one cycle from instruction fetching to writeback operation.

정점 데이터를 각각 정점 처리함에 있어서 상기 표 1의 실행코드들이 다양한 순서를 가지고 실행되면 많은 시간 지연(stall)이 발생하게 되고, 정점 처리 과정에 있어서 성능이 저하되고 효율이 낮아지는 문제점이 있다. In the processing of the vertex data, when the execution codes of Table 1 are executed in various orders, a large time delay occurs, and there is a problem in that performance decreases and efficiency decreases in the vertex processing process.

도 1은 종래 실행코드의 파이프 라인 흐름을 나타낸 도면이다. 1 is a diagram illustrating a pipeline flow of a conventional executable code.

예를 들어, 정점 처리를 위해 RSQ, ADD, DP3 순으로 실행코드가 수행된다고 가정한다. 이 경우 첫번째 실행코드인 RSQ의 연산에 8 스테이지가 필요하다. 첫번째 실행코드인 RSQ의 연산이 완료된 후 각 레지스터에 연산결과를 저장하는 라이트백(Write Back) 과정에서야 두번째 실행코드인 ADD가 연산되게 된다. 이 경우 ADD는 자체 연산에는 1 스테이지만 필요하지만, 첫번째 실행코드인 RSQ가 긴 지연시간을 가지고 있어 도 1에 도시된 클럭(CLK)을 기준으로 1~8까지는 시간 지연(stall)이 발생하게 된다. For example, assume that execution code is executed in order of vertex processing in order of RSQ, ADD, and DP3. In this case, eight stages are required to compute the first executable code, RSQ. After the operation of the first execution code RSQ is completed, the second execution code ADD is calculated only during the write back process of storing the operation result in each register. In this case, ADD requires only one stage for its own operation, but since the first execution code RSQ has a long delay time, a time delay occurs from 1 to 8 based on the clock CLK shown in FIG. .

즉, 현재 실행코드인 RSQ와 다음 실행코드인 ADD 간에 데이터 의존(Data Dependancy)에 의해 장해(hazard)가 발생하게 되며, 다음 실행코드가 실행되지 못하거나 전체 파이프라인을 8 스테이지로 설계해야 하는 문제점이 있다. 또한, RSQ, ADD, DP3 순으로 실행코드를 연산함에 있어서 클럭을 기준으로 총 14 클럭(14 스테이지 필요)의 연산 시간이 소요되는 문제점이 있다.That is, a problem occurs due to data dependency between RSQ, which is the current execution code, and ADD, which is the next execution code, and the next execution code cannot be executed or the entire pipeline must be designed in eight stages. There is this. In addition, in calculating the execution code in the order of RSQ, ADD, and DP3, a total of 14 clocks (requires 14 stages) of calculation time are required.

또한, 정점 데이터를 처리함에 있어서, 현재 정점 데이터의 정점 처리를 위한 마지막 명령어의 해독이 완료된 후 현재 정점 데이터를 저장하고 있는 입력 레지스터를 교체하여 다음 정점 데이터를 입력받게 되므로, 정점 처리를 위한 명령어 수에 비례하여 현재 정점 데이터와 다음 정점 데이터를 처리하는 시간 간격이 증가하는 문제점이 있다.In addition, in processing vertex data, after decoding of the last instruction for vertex processing of the current vertex data is completed, the next vertex data is input by replacing an input register that stores the current vertex data. There is a problem in that a time interval for processing current vertex data and next vertex data increases in proportion to.

따라서, 본 발명은 연산부들이 순차적으로 연결된 다단 파이프라인 구조의 연산 논리 모듈을 통해 각 실행코드 간에 최대 지연시간을 3 스테이지로 줄이는 것이 가능한 다단 파이프라인 구조의 정점 처리 장치 및 그 방법을 제공한다.Accordingly, the present invention provides an apparatus and method for processing a vertex in a multi-stage pipeline structure that can reduce the maximum delay time between each execution code to three stages through an arithmetic logic module in a multi-stage pipeline structure in which operation units are sequentially connected.

또한, 본 발명은 긴 지연시간을 가지는 특별 실행코드에 대하여 1 스테이지 이내에 연산결과가 산출되도록 하는 특별 연산 모듈을 가지는 연산 논리 모듈을 통해 최대 지연시간을 줄이는 것이 가능한 다단 파이프라인 구조의 정점 처리 장치 및 그 방법을 제공한다. In addition, the present invention is a vertex processing apparatus of a multi-stage pipeline structure capable of reducing the maximum delay time through an operation logic module having a special operation module for calculating a calculation result within one stage for a special execution code having a long delay time; It provides a way.

또한, 본 발명은 실행코드 룩업 테이블을 사용하여 실행코드를 해독함으로써 연산 논리 모듈의 연산을 효율적으로 정의하고, 추후 실행코드의 확장 및 변경을 용이하게 하는 다단 파이프라인 구조의 정점 처리 장치 및 그 방법을 제공한다.In addition, the present invention is an apparatus and method for processing a vertex in a multi-stage pipeline structure that efficiently defines the operation of the operation logic module by decoding the execution code using the execution code lookup table, and facilitates the expansion and modification of the execution code later. To provide.

또한, 본 발명은 각 정점 데이터를 처리함에 있어 실행코드의 수와 관계없이 일정한 지연시간을 가지는 다단 파이프라인 구조의 정점 처리 방법을 제공한다. In addition, the present invention provides a vertex processing method of a multi-stage pipeline structure having a constant delay regardless of the number of execution codes in processing each vertex data.

본 발명의 이외의 목적들은 하기의 설명을 통해 쉽게 이해될 수 있을 것이다. Other objects of the present invention will be readily understood through the following description.

상기 목적들을 달성하기 위하여, 본 발명의 일 측면에 따르면, 정점 처리를 하기 위한 데이터를 저장하는 제1 레지스터; 연산결과를 저장하는 제2 레지스터; 실행코드(OP code) 및 하나 이상의 오퍼런드(Operand)를 포함하는 명령어를 순차적으로 인출하는 명령어 인출 모듈(instruction fetch); 상기 명령어를 디코딩하고 상기 오퍼런드에 따라 상기 제1 레지스터로부터 필요로 하는 데이터를 읽어오는 디코딩 모듈; 복수의 순차적인 연산부로 이루어지고, 상기 실행코드의 종류에 따라 상기 연산부 중 하나 이상을 통해 상기 데이터를 순차적으로 연산 처리하여 출력하는 연산 논리 모듈(ALU); 및 상기 연산결과 중 상기 실행코드의 우선순위가 최우선인 실행코드의 연산결과를 상기 제2 레지스터에 저장하고, 그 외 연산결과는 다음 스테이지에서 처리되도록 내부에 일시 저장하는 라이트백 모듈(Write Back Unit)을 포함하는 다단 파이프라인 구조의 정점 처리 장치가 제공될 수 있다.In order to achieve the above object, according to an aspect of the present invention, a first register for storing data for vertex processing; A second register for storing an operation result; An instruction fetch module for sequentially fetching an instruction including an OP code and one or more operands; A decoding module for decoding the instruction and reading data required from the first register according to the operand; An arithmetic logic module (ALU) comprising a plurality of sequential arithmetic units, and sequentially calculating and outputting the data through one or more of the arithmetic units according to the type of the execution code; And a write back unit for storing an operation result of the execution code having the highest priority of the execution code among the operation results in the second register, and temporarily storing other operation results therein to be processed in the next stage. There may be provided a vertex processing apparatus of a multistage pipeline structure.

바람직하게는 본 발명에 따른 다단 파이프라인 구조의 정점 처리 장치에서, 상기 연산 논리 모듈은, 기본 연산 및 곱셈 연산 중 하나 이상을 수행하고, 상기 수행 결과인 기본 연산결과 및 곱셈 연산결과와, 소정의 값과, 입력된 데이터 중 하나 이상을 출력하는 제1 연산부; 상기 제1 연산부에서의 출력 결과들을 덧셈 연산한 제1 덧셈 연산결과를 출력하는 제2 연산부; 및 상기 제2 연산부에서의 출력 결과들을 덧셈 연산한 제2 덧셈 연산결과를 출력하는 제3 연산부를 포함할 수 있다. 여기서, 상기 제1 연산부, 상기 제2 연산부 및 상기 제3 연산부는 파이프라인 구조를 가지며, 상기 각 연산부는 서로 다른 데이터에 대해서 동시에 연산 수행이 가능하다. Preferably, in the vertex processing apparatus of the multi-stage pipeline structure according to the present invention, the arithmetic logic module performs one or more of a basic operation and a multiplication operation, and the basic operation result and the multiplication operation result which is the performance result, and a predetermined value. A first calculator for outputting one or more of a value and input data; A second operation unit configured to output a first addition operation result obtained by adding the output results of the first operation unit; And a third operation unit configured to output a second addition operation result obtained by adding the output results of the second operation unit. Here, the first operation unit, the second operation unit and the third operation unit has a pipeline structure, and each operation unit may perform calculations on different data at the same time.

또한, 본 발명에 따른 다단 파이프라인 구조의 정점 처리 장치는, 상기 연산 논리 모듈에 입력되는 데이터를 필요에 따라 입력 순서를 바꾸는 스위즐(swizzle) 동작을 수행하는 소스 수정 모듈(source modifier)을 더 포함할 수 있다. In addition, the vertex processing apparatus of the multi-stage pipeline structure according to the present invention further comprises a source modifier for performing a swizzle operation of changing the order of input of data input to the arithmetic logic module as necessary. It may include.

또한, 본 발명에 따른 다단 파이프라인 구조의 정점 처리 장치는, 상기 연산 논리 모듈에 입력되는 데이터를 필요에 따라 부호를 반전시키는 부호 반전(negate) 동작을 수행하는 소스 수정 모듈(source modifier)을 더 포함할 수 있다. The vertex processing apparatus of the multi-stage pipeline structure according to the present invention further includes a source modifier for performing a sign inversion operation for inverting a sign of data input to the arithmetic logic module as needed. It may include.

또한, 본 발명에 따른 다단 파이프라인 구조의 정점 처리 장치는, 상기 디코딩 모듈이 필요로 하는 상기 라이트백 모듈의 내부에 저장된 상기 제2 레지스터에 저장되기 전의 연산결과를 상기 라이트백 모듈로부터 상기 디코딩 모듈로 전달하는 포워딩 모듈(forwarding unit)을 더 포함할 수 있다. In addition, the vertex processing apparatus of the multi-stage pipeline structure according to the present invention, the decoding result from the writeback module before the operation result stored in the second register stored in the writeback module required by the decoding module It may further include a forwarding module (forwarding unit) for delivering to.

또한, 본 발명에 따른 다단 파이프라인 구조의 정점 처리 장치에서, 상기 제1 레지스터는, 정점 데이터를 저장하는 입력 레지스터와, 상수 데이터를 저장하는 상수 레지스터와, 상기 연산결과를 저장하는 임시 레지스터를 포함할 수 있다. 여기서, 상기 디코딩 모듈은 상기 입력 레지스터, 상기 상수 레지스터 및 상기 임시 레지스터 중 어느 하나 이상으로부터 상기 정점 데이터, 상기 상수 데이터 및 상기 연산결과 중 어느 하나 이상을 읽어올 수 있다. In the vertex processing apparatus of the multi-stage pipeline structure according to the present invention, the first register includes an input register for storing vertex data, a constant register for storing constant data, and a temporary register for storing the operation result. can do. Here, the decoding module may read one or more of the vertex data, the constant data, and the operation result from any one or more of the input register, the constant register, and the temporary register.

또한, 본 발명에 따른 다단 파이프라인 구조의 정점 처리 장치에서, 상기 제2 레지스터는, 상기 연산결과를 저장하는 임시 레지스터와, 출력 데이터를 저장하는 출력 레지스터를 포함할 수 있다. 여기서, 상기 라이트백 모듈은 상기 임시 레지스터 및 상기 출력 레지스터 중 어느 하나에 상기 연산결과를 저장할 수 있다. In the vertex processing apparatus of the multi-stage pipeline structure according to the present invention, the second register may include a temporary register for storing the operation result and an output register for storing output data. Here, the writeback module may store the operation result in any one of the temporary register and the output register.

또한, 본 발명에 따른 다단 파이프라인 구조의 정점 처리 장치에서, 상기 실행코드는 선입선출(First In First Out) 방식에 의해 우선순위가 결정될 수 있다. In addition, in the vertex processing apparatus of the multi-stage pipeline structure according to the present invention, the execution code may be prioritized by a first in first out method.

또한, 본 발명에 따른 다단 파이프라인 구조의 정점 처리 장치에서, 상기 라이트백 모듈은 둘 이상의 연산결과의 목적지 주소가 동일한 경우 상기 우선순위가 늦은 실행코드의 연산결과를 상기 목적지 주소에 저장할 수 있다.In addition, in the vertex processing apparatus of the multi-stage pipeline structure according to the present invention, the writeback module may store the operation result of the execution code having a lower priority in the destination address when two or more operation results have the same destination address.

또한, 본 발명에 따른 다단 파이프라인 구조의 정점 처리 장치에서, 상기 연산 논리 모듈은 미리 결정된 특별 실행코드에 대해 상기 연산 스테이지 수가 1이 되도록 하는 특별 연산 모듈을 포함할 수 있다. 여기서, 상기 특별 실행코드는 EXP, LOG, EX2, LG2, RCP 및 RSQ 일 수 있다. Further, in the vertex processing apparatus of the multi-stage pipeline structure according to the present invention, the arithmetic logic module may include a special arithmetic module which allows the arithmetic stage number to be 1 for a predetermined special execution code. Here, the special execution code may be EXP, LOG, EX2, LG2, RCP and RSQ.

그리고 상기 실행코드의 연산 스테이지 수는 1, 2 및 3 중 어느 하나이고, 상기 연산 논리 모듈은 파이프라인 구조를 가지는 3개의 연산부로 구분될 수 있다. 여기서, 상기 실행코드 중 POW, DPH, DP3 및 DP4의 연산 스테이지 수가 3, MAD 및 XPD의 연산 스테이지 수가 2, 그 외 실행코드의 연산 스테이지 수가 1 일 수 있다. The number of operation stages of the execution code may be any one of 1, 2, and 3, and the operation logic module may be divided into three operation units having a pipelined structure. Here, the number of operation stages of POW, DPH, DP3, and DP4 among the execution codes may be three, the number of operation stages of MAD and XPD may be two, and the number of operation stages of other execution codes may be one.

또한, 본 발명에 따른 다단 파이프라인 구조의 정점 처리 장치에서, 상기 디코딩 모듈은 상기 실행코드의 종류에 따라 디코딩 방법을 다르게 결정하는 실행코드 룩업 테이블을 사용하여 상기 실행코드를 디코딩할 수 있다. In the vertex processing apparatus of the multi-stage pipeline structure according to the present invention, the decoding module may decode the execution code using an execution code lookup table that determines a decoding method according to the type of the execution code.

상기 목적들을 달성하기 위하여, 본 발명의 다른 측면에 따르면, (a) 실행코드 및 데이터를 전송받는 단계; (b) 제1 연산을 수행하는 단계; (c) 상기 실행코드의 연산 스테이지 수가 1인지 여부를 판단하는 단계; (d) 상기 실행코드의 연산 스테이지 수가 1인 경우 상기 단계 (b)에서의 수행 결과를 최종 출력하고, 1이 아닌 경우 제2 연산을 수행하는 단계; (e) 상기 실행코드의 연산 스테이지 수가 2인지 여부를 판단하는 단계; (f) 상기 실행코드의 연산 스테이지 수가 2인 경우 상기 단계 (d)에서의 수행 결과를 최종 출력하고, 2가 아닌 경우 제3 연산을 수행하는 단계; 및 (g) 상기 단계 (f)에서의 수행 결과를 최종 출력하는 단계를 포함하는 다단 파이프라인 구조의 정점 처리 방법이 제공될 수 있다.In order to achieve the above object, according to another aspect of the invention, (a) receiving the executable code and data; (b) performing a first operation; (c) determining whether the number of operation stages of the executable code is one; (d) finally outputting the result of the execution in step (b) when the number of the operation stages of the execution code is 1, and performing a second operation when the operation stage is not 1; (e) determining whether the number of operation stages of the executable code is two; (f) finally outputting the result of the execution in the step (d) when the number of the operation stages of the execution code is 2, and performing a third operation when the execution stage is not 2; And (g) finally outputting the result of the performance in step (f).

바람직하게는, 본 발명에 따른 다단 파이프라인 구조의 정점 처리 방법은, 상기 단계 (g) 이후에, (h) 상기 최종 출력되는 수행 결과에 상응하는 실행코드의 우선순위를 판단하는 단계; 및 (i) 상기 우선순위가 최우선인 실행코드의 수행 결과를 목적지 레지스터에 저장하고, 그 외 실행코드의 수행 결과는 바이패스하는 단계를 더 포함하되, 상기 수행 결과를 모두 상기 목적지 레지스터에 저장할 때까지 상기 단계 (h) 내지 (i)를 반복할 수 있다. Preferably, the vertex processing method of the multi-stage pipeline structure according to the present invention, after the step (g), (h) determining the priority of the execution code corresponding to the final output result; And (i) storing the execution result of the execution code having the highest priority in a destination register, and bypassing the execution result of the execution code in other destinations, when all of the execution result is stored in the destination register. The steps (h) to (i) can be repeated until.

상기 목적들을 달성하기 위하여, 본 발명의 또 다른 측면에 따르면, (a) 입력 레지스터에 정점 처리를 하기 위한 정점 데이터를 저장하는 단계; (b) 상기 정점 데이터의 정점 처리를 위한 명령어를 순차적으로 입력받고, 상기 입력 레지스터로부터 상기 정점 데이터를 읽어오는 단계; (c) 상기 명령어를 기계어로 해독하는 단계; (d) 상기 명령어가 상기 정점 데이터의 정점 처리를 위한 마지막 명령어인 경우 상기 입력 레지스터를 교체하고, 그 외 경우 상기 기계어에 따른 정점 처리를 수행하고 상기 단계 (b) 내지 (c)를 반복하는 단계; 및 (e) 상기 마지막 명령어에 따른 정점 처리를 수행하면서 상기 입력 레지스터에 다음 정점 데이터를 저장하고, 상기 단계 (b) 내지 (d)를 반복하는 단계를 포함하는 다단 파이프라인 구조의 정점 처리 방법이 제공될 수 있다.In order to achieve the above objects, according to another aspect of the invention, (a) storing the vertex data for vertex processing in the input register; (b) sequentially receiving instructions for vertex processing of the vertex data and reading the vertex data from the input register; (c) decoding the instructions in machine language; (d) replacing the input register if the instruction is the last instruction for vertex processing of the vertex data, otherwise performing vertex processing according to the machine language and repeating steps (b) to (c) ; And (e) storing the next vertex data in the input register while performing the vertex processing according to the last instruction, and repeating the steps (b) to (d). Can be provided.

바람직하게는, 본 발명에 따른 다단 파이프라인 구조의 정점 처리 방법에서, 상기 단계 (d)는 상기 입력 레지스터의 교체를 위해 2 클럭만큼 지연될 수 있다. Preferably, in the vertex processing method of the multistage pipeline structure according to the present invention, the step (d) may be delayed by 2 clocks for the replacement of the input register.

이하, 첨부된 도면을 참조하여 본 발명에 따른 다단 파이프라인 구조의 정점 처리 장치 및 그 방법의 바람직한 실시예를 상세히 설명한다. 본 발명을 설명함에 있어서, 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 본 명세서의 설명 과정에서 이용되는 숫자(예를 들어, 제1, 제2 등)는 동일 또는 유사한 개체를 순차적으로 구분하기 위한 식별기호에 불과하다.Hereinafter, with reference to the accompanying drawings will be described in detail a preferred embodiment of the vertex processing apparatus and method of the multi-stage pipeline structure according to the present invention. In describing the present invention, when it is determined that the detailed description of the related known technology may unnecessarily obscure the subject matter of the present invention, the detailed description thereof will be omitted. Numbers (eg, first, second, etc.) used in the description of the present specification are merely identification symbols for sequentially distinguishing identical or similar entities.

본 발명에서 사이클(cycle)은 정점 처리를 위한 명령어를 수행함에 있어서 명령어 인출, 명령어 해독, 연산, 라이트백 동작까지를 순차적으로 한 번 실행하는 주기를 의미한다. 그리고 스테이지(stage)는 1개의 명령어를 수행함에 있어서 필요로 하는 각 동작들의 단계로써, 명령어 인출, 명령어 해독, 라이트백 동작이 각각 1 스테이지에 해당하며, 연산 동작은 경우에 따라 1~3 스테이지에 해당하게 된다. 즉, 하나의 명령어가 수행되는 주기인 한 사이클 내에 복수의 스테이지로 이루어진 연산 동작이 포함된다. 연산 스테이지 수는 하나의 명령어가 수행되는 한 사이클 내에서 연산 동작을 수행함에 있어서 필요로 하는 스테이지의 수를 의미한다. In the present invention, a cycle refers to a cycle of sequentially executing instruction fetching, instruction decoding, operation, and writeback operation in executing an instruction for vertex processing. In addition, a stage is a stage of operations required to execute one instruction, and instruction fetch, instruction decode, and writeback operations correspond to one stage, and arithmetic operations are performed in one to three stages in some cases. It becomes. That is, a calculation operation including a plurality of stages is included in one cycle, which is a period in which one instruction is executed. The operation stage number refers to the number of stages required for performing an operation operation in one cycle in which one instruction is executed.

도 2는 본 발명의 바람직한 일 실시예에 따른 명령어의 실행코드 분류를 나 타낸 도면이다. 상기 표 1의 실행코드는 지연시간, 즉 연산 스테이지 수가 1~8까지로 매우 다양하였으나, 본 발명의 정점 처리 장치는 정점 처리를 위한 명령어의 실행코드가 최대 3의 연산 스테이지 수를 가지도록 제한한다. 2 is a diagram illustrating execution code classification of instructions according to an exemplary embodiment of the present invention. Although the execution code of Table 1 varies greatly from the delay time, that is, the number of operation stages to 1 to 8, the vertex processing apparatus of the present invention limits the execution code of the instruction for vertex processing to have a maximum number of operation stages of three. .

본 발명의 정점 처리 장치에서 정점 처리를 위한 실행코드는 '210'과 같이, ABS, ADD, ARL, DP3, DP4, DPH, DST, EX2, EXP, FLR, FRC, LG2, LIT, LOG, MAD, MAX, MIN, MOV, MUL, POW, RCP, RSQ, SGE, SLT, SUB, SWZ, XPD가 있다.Execution code for vertex processing in the vertex processing apparatus of the present invention is, as in '210', ABS, ADD, ARL, DP3, DP4, DPH, DST, EX2, EXP, FLR, FRC, LG2, LIT, LOG, MAD, MAX, MIN, MOV, MUL, POW, RCP, RSQ, SGE, SLT, SUB, SWZ, XPD.

명령어의 기본적인 문법 구조(syntax)는 하기의 수학식 1과 같다. The basic syntax of the command is shown in Equation 1 below.

opcode destination operand, source operand0, (source operand1, source operand2)opcode destination operand, source operand0, (source operand1, source operand2)

명령어(instruction)는 실행코드(opcode), 대상 오퍼런드(destination operand, 이하 dest라 함), 소스 오퍼런드(source operand, 이하 src라 함)로 구성된다.An instruction consists of an opcode, a destination operand (hereinafter referred to as dest), and a source operand (hereinafter referred to as src).

대상 오퍼런드 및/또는 소스 오퍼런드는 각 레지스터와 바인드된 변수 또는 레지스터의 이름을 직접 쓰고 있으며, 본 발명에서는 오퍼런드의 참조를 위해 각 레지스터의 인덱스(index) 주소를 사용한다. 대상 오퍼런드는 연산결과를 저장하기 위한 목적지 레지스터를 가르키며, 소스 오퍼런드는 연산을 위한 정점 데이터, 상수 데이터, 중간 연산결과, 주소값 등의 데이터를 저장하는 입력 레지스터, 상수 레지스터, 임시 레지스터, 주소 레지스터 등을 가르킨다. The target and / or source operator writes directly the name of the variable or register bound with each register, and the present invention uses the index address of each register for reference of the operand. Target operand refers to the destination register for storing the operation result, and source operation refers to the input register, constant register, temporary register, address that stores data such as vertex data, constant data, intermediate operation result, and address value for operation. Points to registers, etc.

각각의 명령어는 기본적으로 대상 오퍼런드를 가지고 있으며, 연산 내용에 따라 소스 오퍼런드는 1~3개를 가지게 된다. 4 성분 벡터(4 component vector)는 각 데이터가 4개의 실수 집합으로 이루어지는 [x, y, z, w]의 벡터 형식을 가진다. s 스칼라(scalar)는 x 성분만을 가지는 값이고, ssss 스칼라는 x, y, z, w의 4 성분을 가지되 모두 동일한 값이다. Each instruction basically has a target operand, and depending on the operation contents, there are one or three source operands. A four component vector has a vector format of [x, y, z, w] in which each data consists of four real sets. The s scalar is a value having only x components, and the ssss scalar has four components of x, y, z, and w, all of which are the same.

각 실행코드 중 24개의 기본 실행코드는 하기의 표 2와 같다. 여기서, a는 주소 레지스터, v는 4 성분 벡터, s는 스칼라, ssss는 4 성분 스칼라, normal은 기본 실행코드, macro는 매크로 실행코드를 의미한다.24 basic execution codes of the respective execution codes are shown in Table 2 below. Where a is an address register, v is a four-component vector, s is a scalar, ssss is a four-component scalar, normal is the basic executable code, and macro is the macro executable code.

(1) ABS(Absolute value)는 4 성분 벡터에 해당하는 src0의 x, y, z, w 각 성분의 절대값을 dest의 x, y, z, w에 각각 대입한다. (2) ADD(Add two vectors)는 4 성분 벡터에 해당하는 src0과 src1의 x, y, z, w 각 성분의 값을 더하여 dest의 x, y, z, w에 각각 대입한다. (3) ARL(Address register load)은 s 스칼라에 해당하는 src0에 저장된 값 이하 가장 큰 정수를 a0(주소 레지스터임)에 대입한다. 상술한 바에 의하면 a0에 대입되는 값이 기본 주소가 된다. (4) DP3(3-component dot product)는 4 성분 벡터에 해당하는 src0과 src1 중 x, y, z의 3 성분에 대하여 내적(dot product)을 구한 뒤, 내적값을 ssss 스칼라인 dest의 4 성분에 대입한다. (5) DP4(4-component dot product)는 4 성분 벡터에 해당하는 src0과 src1의 x, y, z, w의 4 성분에 대하여 내적을 구한 뒤, 내적값을 ssss 스칼라인 dest의 4 성분에 대입한다. (6) DPH(Homogeneous dot product)는 4 성분 벡터에 해당하는 src0과 src1 중 x, y, z의 3 성분에 대하여 내적을 구한 뒤 내적값을 4 성분 벡터인 dest의 x, y, z에 대입하고, src1의 w 성분의 값을 dest의 w 성분에 대입한다. src0의 w 성분을 1.0으로 설정하고 src0과 src1에 대하여 DP4 연산을 적용한 것과 같은 결과이다. (7) DST(Distance vector)는 2개의 특수한 포맷을 가지는 오퍼런드로부터 거리 벡터(distance vector)를 계산한다. 4 성분 벡터인 src0과 src1(여기서, src0은 [NA, d², d², NA], src1은 [NA, 1/d, NA, 1/d]이고, NA는 계산과 관련없으며 d는 벡터 크기를 의미함)에 대해 4 성분 벡터인 dest의 x 성분은 1.0을, y 성분은 src0과 src1의 y 성분의 곱을, z 성분은 scr0의 z 성분을, w 성분은 src1의 w 성분을 대입하여 dest가 [1.0, d, d², 1/d] 형태가 되도록 한다. (8) EX2(exponential base 2)는 2를 밑(base)으로 하고 s 스칼라인 src0을 지수로 하여 2^src0을 ssss 스칼라인 dest의 4 성분에 각각 대입한다. (9) EXP(exponential base 2(approximate))는 2를 밑(base)으로 하고 s 스칼라인 src0을 지수로 하여 2^src0의 부분적으로 정확한 값을 4 성분 벡터인 dest에 대입한다. dest의 x 성분에는 src0의 정수부분(src0 이하 가장 큰 정수)에 대해 2의 지수를 취한 형태를 대입하고, y 성분에는 src0의 소수부분(src0으로부터 src0 이하 가장 큰 정수를 뺀 값)을 대입하며, z 성분에는 src0의 대략적인 2의 지수를 취한 형태를 대입하고, w 성분에는 1.0을 대입한다. (10) FLR(floor)는 4 성분 벡터에 해당하는 src0의 각 성분의 정수부분(src0 이하 가장 큰 정수)을 4 성분 벡터에 해당하는 dest의 각 성분에 대입한다. (11) FRC(fraction)는 4 성분 벡터에 해당하는 src0의 각 성분의 소수부분(src0으로부터 src0 이하 가장 큰 정수를 뺀 값)을 4 성분 벡터에 해당하는 dest의 각 성분에 대입한다. (12) LG2(logarithm base 2)는 s 스칼라에 해당하는 src0에 대하여 밑(base)이 2인 로그값을 ssss 스칼라인 dest의 각 성분에 대입한다. (13) LOG(logarithm base 2(approximate))는 s 스칼라에 해당하는 src0에 대하여 밑(base)이 2인 로그값의 대략적인 값을 4 성분 벡터인 dest에 대입한다. dest의 x 성분에는 src0에 대하여 밑(base)이 2인 로그값의 정수부분을 대입하고, y 성분에는 src0의 값을 dest의 x 성분 만큼 오른쪽 쉬프트 연산한 값을 대입하며 w 성분에는 1.0을 대입한다. (14) MAD(Multiply and add)는 본 발명에서의 명령어 중에서 유일하게 소스 오퍼런드를 3개 모두 사용한다. 4 성분 벡터인 src0와 src1의 각 성분의 곱에 src2의 각 성분을 더한 값을 4 성분 벡터인 dest의 각 성분에 대입한다. (15) MAX(Maximum)는 4 성분 벡터인 src0과 src1의 각 성분 중 큰 값을 4 성분 벡터인 dest의 각 성분에 대입한다. (16) MIN(Minimum)는 4 성분 벡터인 src0과 src1의 각 성분 중 작은 값을 4 성분 벡터인 dest의 각 성분에 대입한다. (17) MOV(Move)는 4 성분 벡터인 src0의 각 성분의 값을 4 성분 벡터인 dest의 각 성분에 대입한다. (18) MUL(Multiply)는 4 성분 벡터인 src0과 src1의 각 성분의 곱을 4 성분 벡터인 dest의 각 성분에 대입한다. (19) RCP(Reciprocal)는 s 스칼라인 src0의 값의 역수를 ssss 스칼라인 dest의 각 성분에 대입한다. (20) RSQ(Reciprocal square root)는 s 스칼라인 src0의 제곱근(square root)의 역수를 ssss 스칼라인 dest의 각 성분에 대입한다. (21) SGE(set on Greater than or equal)는 4 성분 벡터인 src0과 src1의 각 성분을 비교하여 4 성분 벡터인 dest의 각 성분에 src0이 src1보다 크거나 같으면 1.0을, 작으면 0.0을 대입한다. (22) SLT(set on less than)는 4 성분 벡터인 src0과 src1의 각 성분을 비교하여 4 성분 벡터인 dest의 각 성분에 src0이 src1보다 작으면 1.0을, 크거나 같으면 0.0을 대입한다. (23) SWZ(extended swizzle)는 4 성분 벡터인 src0의 각 성분, 1.0, 0.0을 로드하고 4 성분 벡터인 dest의 각 성분에 대해 부호 반전(negation)이 있는지 여부와 6개의 값(src0의 각 성분, 1.0, 0.0) 중 어느 하나의 값을 조합하여 대입한다. (24) XPD(cross product)는 4 성분 벡터인 src0과 src1의 x, y, z의 3 성분의 외적(cross product)를 구하여 그 외적값을 4 성분 벡터인 dest의 x, y, z 성분에 대입한다. w 성분은 정의되지 않는다. (1) Absolute value (ABS) substitutes the absolute value of each component of x, y, z, w of src0 corresponding to four component vectors into x, y, z, w of dest, respectively. (2) ADD (Add two vectors) adds the values of x, y, z and w components of src0 and src1 corresponding to four component vectors and assigns them to x, y, z and w of dest, respectively. (3) ARL (Address register load) assigns the largest integer less than or equal to the value stored in src0 corresponding to the s scalar to a0 (the address register). According to the above description, the value assigned to a0 is the base address. (3) The 3-component dot product (DP3) calculates the dot product of three components of x, y, and z among src0 and src1 corresponding to the four component vector, and then calculates the dot product of the ssss scalar dest. Substitute in ingredients. (4) The 4-component dot product (DP4) calculates the inner product of four components of x, y, z, and w of src0 and src1 corresponding to the four component vectors, and then places the inner product on the four components of the ssss scalar dest. Assign. (6) Homogeneous dot product (DPH) finds the dot product of three components of x, y, and z among src0 and src1 corresponding to the four component vector, and then substitutes the dot product into x, y, z of the four component vector dest. Then, the value of the w component of src1 is substituted into the w component of dest. This is equivalent to setting the w component of src0 to 1.0 and applying the DP4 operation on src0 and src1. (7) A distance vector (DST) calculates a distance vector from an operand having two special formats. Four-component vectors src0 and src1, where src0 is [NA, d ² , d ² , NA], src1 is [NA, 1 / d, NA, 1 / d], NA is not computational and d is a vector X component of dest, the four-component vector, 1.0, y is the product of src0 and y components of src1, z is the z component of scr0, and w is the w component of src1. Let dest be of the form [1.0, d, d ² , 1 / d]. (8) EX2 (exponential base 2) assigns 2 ^src0 to 4 components of ssss scalar dest, with 2 as base and s scalar src0 as exponent. (9) EXP (exponential base 2 (approximate)) assigns the partially correct value of 2 ^src0 to the four-component vector dest with 2 as the base and s scalar src0 as the exponent. The x component of dest is substituted for the integer part of src0 (the largest integer less than src0), and the y component is substituted for the fractional part of src0 (src0 minus the largest integer less than or equal to src0). The z component is substituted with the form taking the approximate 2 exponent of src0, and 1.0 is substituted for the w component. (10) FLR (floor) substitutes the integer part (largest integer less than or equal to src0) of each component of src0 corresponding to the four component vector to each component of dest corresponding to the four component vector. (11) FRC (fraction) substitutes the fractional part (src0 minus the largest integer less than or equal to src0) of each component of src0 corresponding to the four component vector to each component of dest corresponding to the four component vector. (12) LG2 (logarithm base 2) substitutes the log value of base 2 for src0 corresponding to the s scalar to each component of the ssss scalar dest. (13) LOG (logarithm base 2 (approximate)) assigns the approximate value of the logarithm of base 2 to src0, the s scalar, to dest, a four-component vector. The x component of dest is substituted for the integer part of the logarithm base whose base is 2 with respect to src0, and the y component is substituted with the right-shifted value of src0 by the x component of dest. do. (14) Multiply and add (MAD) use all three source operations uniquely among the instructions in the present invention. The value obtained by adding each component of src2 to the product of four components vector src0 and src1 is substituted into each component of dest which is a four component vector. (15) MAX (Maximum) substitutes a larger value of each component of src0 and src1, which are four component vectors, into each component of dest, which is a four component vector. (16) MIN (Minimum) substitutes the smaller value of each component of src0 and src1 as four component vectors into each component of dest as a four component vector. (17) MOV (Move) substitutes the value of each component of src0 as a four component vector into each component of dest as a four component vector. (18) MUL (Multiply) substitutes the product of src0, which is a four component vector, and src1, into each component of dest, which is a four component vector. (19) RCP (Reciprocal) assigns the inverse of the value of the s scalar src0 to each component of the ssss scalar dest. (20) Reciprocal square root (RSQ) assigns the inverse of the square root of the s scalar src0 to each component of the ssss scalar dest. (21) SGE (set on Greater than or equal) compares each component of the four-component vector src0 and src1, and assigns 1.0 to src0 and 0.0 to less than src1 to each component of dest, the four-component vector. do. (22) Set on less than (SLT) compares each component of the four component vector src0 and src1 and substitutes 1.0 for src0 less than src1 and 0.0 for each component of dest, which is a four component vector. (23) SWZ (extended swizzle) is loaded with each component of src0, a four-component vector, 1.0 and 0.0, and whether there is a sign inversion for each component of dest, a four-component vector, and six values (each of src0). Component, 1.0, 0.0), and the value of any combination is substituted. (24) XPD (cross product) obtains the cross product of three components of x, y, z of four component vectors src0 and src1, and converts the cross product to the x, y, z components of dest of four component vectors. Assign. The w component is not defined.

3개의 매크로 실행코드는 다음과 같다. The three macro executables are as follows:

(1) LIT(compute light coefficients)는 각 정점마다의 주변광, 산란광, 반사광에 의한 조명 효과를 가속하기 위한 연산이다. LIT 연산은 하기의 수학식 2와 같은 기본 실행코드의 집합으로 구성된다. (1) LIT (compute light coefficients) is an operation for accelerating lighting effects due to ambient light, scattered light, and reflected light for each vertex. The LIT operation is composed of a set of basic executable codes as shown in Equation 2 below.

LIT f, a, bLIT f, a, b

Clamp tmp, a.0, b Clamp tmp, a.0, b

rLG2 tmp.w tmp.w rLG2 tmp.w tmp.w

MUL tmp.w tmp.w tmp.y MUL tmp.w tmp.w tmp.y

rEX2 tmp.w tmp.w rEX2 tmp.w tmp.w

mulz f, tmp1.1xz1, tmp.w mulz f, tmp1.1xz1, tmp.w

(2) POW(exponentiate)는 s 스칼라인 src0과 src1에 대해서 src0^src1을 구하여 ssss 스칼라인 dest에 대입한다. POW 연산은 하기의 수학식 3과 같은 기본 실행코드의 집합으로 구성된다.(2) POW (exponentiate) obtains src0 ^src1 for s scalar src0 and src1 and assigns it to ssss scalar dest. The POW operation consists of a set of basic executable codes as shown in Equation 3 below.

POW f, a, b (f = a^b)POW f, a, b (f = a ^b )

LG2 tmp, a LG2 tmp, a

MUL tmp, tmp, b MUL tmp, tmp, b

EX2 f, tmp EX2 f, tmp

(3) SUB(subtract)는 4 성분 벡터인 src0의 각 성분으로부터 src1의 각 성분의 값을 뺀 값을 4 성분 벡터인 dest의 각 성분에 대입한다. SUB 연산은 하기의 수학식 4와 같은 기본 실행코드의 집합으로 구성된다. (3) SUB (subtract) substitutes each component of dest as a four-component vector by subtracting the value of each component of src1 from each component of src0 as a four-component vector. The SUB operation consists of a set of basic executable codes such as Equation 4 below.

SUB f, a, b (f = a - b)SUB f, a, b (f = a-b)

ADD f, a, -b ADD f, a, -b

3개의 매크로 실행코드는 24개의 기본 실행코드들로 재구성할 수 있으며, 정점 처리 장치는 정점 처리 과정에서 매크로 실행코드를 수학식 2 내지 4에 표현된 것과 같은 기본 실행코드의 집합으로 해석한다. 따라서, 본 발명에서 정점 처리 장치에서 실제 사용되는 실행코드는 24개의 기본 실행코드이며, 다른 실행코드에 대하여 확장이 가능하다.The three macro execution codes can be reconstructed into 24 basic execution codes, and the vertex processing apparatus interprets the macro execution code as a set of basic execution codes as expressed in Equations 2 to 4 during the vertex processing. Therefore, the execution code actually used in the vertex processing apparatus in the present invention is 24 basic execution codes, and can be extended to other execution codes.

상술한 실행코드들은 기본적으로 연산 스테이지 수가 3 이하인 일반 실행코드(220)와, 연산 스테이지 수가 3을 초과하는 특별 실행코드(225)로 분류된다. 연산 사이클 수는 실행코드들이 소스 오퍼런드가 지칭하는 데이터들에 대하여 정해진 연산을 하고 그 결과를 대상 오퍼런드가 지칭하는 레지스터에 저장하는 동안의 지연시간을 클럭의 수로 나타낸 것이다.The above-described execution codes are basically classified into general execution code 220 having an operation stage number of 3 or less, and special execution code 225 having an operation stage number of 3 or more. The number of operation cycles is the number of clocks that represent the delay time during which the execution code performs a predetermined operation on the data indicated by the source operator and stores the result in the register indicated by the target operand.

일반 실행코드(220)는 연산 스테이지 수에 따라 제1 그룹(연산 스테이지 수 = 1), 제2 그룹(연산 스테이지 수 = 2), 제3 그룹(연산 스테이지 수 = 3)으로 분류된다. 제1 그룹(230, 232, 238)에는 ADD, MUL, DST, MOV, MAX, MIN, SGE, SLT, ABS, ARL, FLR, FRC가, 제2 그룹(236)에는 MAD, XPD가, 제3 그룹(234)에는 DP3, DP4, DPH가 포함된다. The general execution code 220 is classified into a first group (operation stage number = 1), a second group (operation stage number = 2), and a third group (operation stage number = 3) according to the operation stage number. ADD, MUL, DST, MOV, MAX, MIN, SGE, SLT, ABS, ARL, FLR, FRC in the first group 230, 232, 238, MAD, XPD in the second group 236, Group 234 includes DP3, DP4, DPH.

특별 실행코드(242)는 특별 연산모듈(도 3을 참조하여 후술함)에 의해 연산 스테이지 수가 1로 변환된 EXP, LOG, EX2, LG2, RCP, RSQ가 포함된다. The special execution code 242 includes EXP, LOG, EX2, LG2, RCP, and RSQ in which the number of operation stages is converted to 1 by a special operation module (to be described later with reference to FIG. 3).

그리고 매크로 실행코드(240)는 LIT와 POW로 구성되며, 상기한 수학식 2와 수학식 3에 의해 복수의 일반 실행코드(220) 또는 특별 실행코드(242)들의 집합으로 해석된다. The macro execution code 240 is composed of a LIT and a POW, and is interpreted as a set of a plurality of general execution codes 220 or special execution codes 242 by Equation 2 and Equation 3 described above.

이하에서는 상술한 바와 같이 연산 스테이지 수가 최대 3인 상기 실행코드들을 이용하여 정점 처리를 함에 있어서 다단 파이프라인 구조를 가지는 정점 처리 장치를 설명한다.Hereinafter, a vertex processing apparatus having a multi-stage pipeline structure in performing vertex processing using the execution codes having the maximum number of operation stages as described above will be described.

도 3은 본 발명의 바람직한 일 실시예에 따른 정점 처리 장치의 구성블록도이고, 도 4는 도 3에 도시된 정점 처리 장치에서의 파이프라인 흐름 중 일례를 나타낸 도면이다. 3 is a block diagram illustrating a vertex processing apparatus according to an exemplary embodiment of the present invention, and FIG. 4 is a diagram illustrating an example of a pipeline flow in the vertex processing apparatus illustrated in FIG. 3.

정점 처리 장치는 명령어 인출 모듈(300), 레지스터들(310a, 310b), 디코딩 모듈(320), 연산 논리 모듈(330), 라이트백 모듈(340)을 포함한다. 필요에 따라 포워딩 모듈(350) 및/또는 소스 수정 모듈(360)이 더 포함될 수 있다.The vertex processing apparatus includes an instruction retrieval module 300, registers 310a and 310b, a decoding module 320, an arithmetic logic module 330, and a writeback module 340. If necessary, the forwarding module 350 and / or the source modification module 360 may be further included.

명령어 인출 모듈(300)은 정점 처리를 위해 실행코드, 오퍼던드들을 포함하는 명령어를 순차적으로 인출(fetch)한다. 명령어 인출 모듈(300)에서의 인출 순서에 따라 각 실행코드는 후술할 라이트백 모듈(340)에서의 우선순위가 결정된다.The instruction fetch module 300 sequentially fetches instructions including executable code and operations for vertex processing. Each execution code is prioritized in the writeback module 340, which will be described later, according to a drawing order in the instruction drawing module 300.

레지스터는 정점 처리를 하기 위한 데이터(예를 들어, 정점 데이터, 상수 데이터, 중간 연산결과, 주소 데이터 등)를 저장하는 제1 레지스터(310a)와, 연산 논리 모듈(330)에서 출력되는 연산결과(예를 들어, 정점 처리가 완료된 출력 데이터, 중간 연산결과, 주소 데이터 등)를 저장하는 제2 레지스터(310b)를 포함한다. The register includes a first register 310a that stores data for vertex processing (for example, vertex data, constant data, intermediate calculation results, address data, and the like), and an operation result output from the calculation logic module 330 ( For example, a second register 310b for storing vertex processing output data, intermediate calculation result, address data, etc.) is included.

제1 레지스터(310a)는 정점 데이터를 저장하는 입력 레지스터(311), 중간 연산결과를 저장하는 임시 레지스터(312), 정점 처리에 필요한 상수 데이터를 저장하 는 상수 레지스터(313)(또는 상수 데이터를 참조하기 위해 상수 데이터가 저장된 주소를 저장하는 주소 레지스터(314))를 포함한다. The first register 310a includes an input register 311 storing vertex data, a temporary register 312 storing intermediate calculation results, and a constant register 313 (or constant data storing constant data required for vertex processing). Address register 314 which stores the address where constant data is stored for reference.

제2 레지스터(310b)는 연산 논리 모듈(330)에 의해 연산된 결과 중 정점 처리가 완료된 출력 데이터를 저장하는 출력 레지스터(316), 중간 연산결과를 저장하는 임시 레지스터(312), 주소 데이터를 저장하는 주소 레지스터(314)를 포함한다. The second register 310b includes an output register 316 for storing output data of which vertex processing is completed among the results calculated by the calculation logic module 330, a temporary register 312 for storing intermediate calculation results, and address data. Address register 314 to be included.

여기서, 임시 레지스터(312), 주소 레지스터(314)는 제1 레지스터(310a) 및 제2 레지스터(310b)에 공통인 레지스터들이다. Here, the temporary register 312 and the address register 314 are registers common to the first register 310a and the second register 310b.

정점 데이터(vertex data)는 화면 상에 표현하고자 하는 물체를 구성하는 각 폴리곤들의 정점의 스트림 데이터(stream data)이다. 화면 상에서 물체를 3차원적으로 표현하고자 하면, 표현하고자 하는 사물의 모양을 삼각형 형태의 폴리곤 집합으로 구분하고, 폴리곤을 구성하는 3개의 정점 데이터를 정점 처리 과정을 통해 처리함으로써 3차원 화상을 만들어낸다. Vertex data is stream data of the vertices of the polygons constituting the object to be displayed on the screen. If you want to express an object on the screen three-dimensionally, the shape of the object to be expressed is divided into triangular polygon sets, and three vertex data constituting the polygon is processed through a vertex processing process to create a three-dimensional image. .

정점 데이터는 각 정점에 대한 좌표(position), 색상(color), 법선 벡터(normal vector), 텍스처 좌표(texture coordinate) 등과 같은 속성 데이터(attribute data)를 포함한다. 각각의 속성 데이터는 4개의 실수 집합으로 이루어진다. 예를 들어, 좌표 속성 데이터는 3차원을 나타내는 x, y, z 값과 투영(projection) 정도를 나타내는 w 값의 4개 실수 정보를 가지고, 색상 속성 데이터는 기본 삼원색의 밝기값인 r(red), g(green), b(blue) 값과 불투명도를 나타내는 알파(a) 값의 4개 실수 정보를 가진다. Vertex data includes attribute data such as position, color, normal vector, texture coordinate, and the like for each vertex. Each attribute data consists of four real sets. For example, the coordinate attribute data has four real numbers of x, y, z values representing three dimensions and w values representing the degree of projection, and the color attribute data is r (red) which is the brightness value of the basic three primary colors. It has four real numbers: the g (green), the b (blue) value, and the alpha (a) value representing the opacity.

정점 데이터는 OpenGL ARB extension 1.0 구조에서 최소 16개의 속성 데이터 를 지원한다. 좌표, 주색상(primary color), 부색상(secondary color), 법선 벡터, 정점 가중치, 안개 좌표 등의 8개와 제1 텍스처 좌표, 제2 텍스처 좌표, 제3 텍스처 좌표 등 최대 8개의 텍스처 좌표에 관한 속성 데이터를 지원할 수 있다.Vertex data supports at least 16 attribute data in the OpenGL ARB extension 1.0 structure. 8 coordinates including coordinates, primary colors, secondary colors, normal vectors, vertex weights, fog coordinates, and up to 8 texture coordinates including first texture coordinates, second texture coordinates, and third texture coordinates. Can support attribute data.

입력 레지스터(input register; 311)에 정점 데이터의 값이 저장되거나 입력 레지스터(311)에 정점 데이터를 참조할 수 있는 주소가 저장된다. 디코딩 모듈(320)은 입력 레지스터(311)에 저장된 값을 읽거나 저장된 주소를 참조하여 정점 데이터를 입력받을 수 있게 되며, 정점 레지스터(311)는 읽기만이 가능하고 변경이나 쓰기는 불가능하다.The value of the vertex data is stored in an input register 311 or an address to which the vertex data can be referenced is stored in the input register 311. The decoding module 320 may read the value stored in the input register 311 or receive the vertex data with reference to the stored address. The vertex register 311 may read only and cannot change or write.

상수 레지스터(constant register; 313)는 상수 데이터를 저장하거나 상수 데이터를 참조할 수 있는 주소를 저장한다. 상수 데이터는 정점 처리를 수행함에 있어서 사용되는 값들이다. 예를 들어 상수 데이터는 행렬 계산을 위한 값이나 특정 색상 값, 조명 계산을 위한 값 등이며, 각 상수 데이터는 4개의 실수 집합으로 구성된다. OpenGL ARB extension 1.0 구조에서 최소 96개의 상수 데이터를 지원한다. 상수 레지스터(313)는 읽기만이 가능하며 변경이나 쓰기는 불가능하다.The constant register 313 stores constant data or an address to which the constant data can be referenced. Constant data are values used in performing vertex processing. For example, constant data is a value for matrix calculation, a specific color value, a value for lighting calculation, etc. Each constant data is composed of four sets of real numbers. At least 96 constants are supported in the OpenGL ARB extension 1.0 architecture. The constant register 313 can only read, not change or write.

임시 레지스터(temporary register; 312)는 정점 처리 과정 동안에 사용되는 임시 변수, 즉 중간 연산결과를 저장한다. 정점 처리를 위해 사용되는 정점 데이터 또는 상수 데이터가 모두 4개의 실수 집합으로 이루어지기 때문에, 임시 레지스터(312) 역시 기본적으로 4개의 실수 집합으로 구성된다.Temporary register 312 stores temporary variables used during vertex processing, that is, intermediate computation results. Since the vertex data or the constant data used for the vertex processing are all four real sets, the temporary register 312 also basically consists of four real sets.

임시 레지스터(312)는 OpenGL ARB 1.0 구조에서 최소 12개를 지원하도록 하고 있으며, 본 발명에서는 실행코드(225) 중 매크로 실행코드(240)가 존재하므로 추가적인 임시 저장 공간이 필요로 하여 4개를 추가적으로 지원하여 총 16개를 지원한다. 정점 처리가 수행되는 동안에 정점 처리 장치는 임시 레지스터(312)에 임의의 데이터를 쓰거나 읽을 수 있다.The temporary register 312 supports at least 12 in the OpenGL ARB 1.0 structure. In the present invention, since the macro execution code 240 is present among the execution codes 225, four additional temporary storage spaces are required. It supports 16 in total. While the vertex processing is being performed, the vertex processing device may write or read arbitrary data to the temporary register 312.

주소 레지스터(314)는 상수 레지스터(313)를 이용하여 상수 데이터를 읽고자 할 때 참조의 기본이 되는 기본 주소(base address)를 저장한다. 기본 주소를 기준으로 설정하고, 정점 처리 과정에서 필요로 하는 상수 데이터가 저장된 주소까지의 오프셋(offset)으로 이용하여 상수 데이터가 저장된 상대적인 위치를 참조하여 읽어온다. The address register 314 stores a base address that is a reference base when reading constant data using the constant register 313. The reference is set based on the base address, and it is read by referring to the relative position where the constant data is stored using the offset to the address where the constant data needed in the vertex processing is stored.

출력 레지스터(316)는 정점 처리된 최종 연산결과인 출력 데이터를 저장한다. 출력 데이터, 즉 출력 변수는 이후 상술한 바와 같이 프리미티브 어셈블리, 래스터라이저, 픽셀 프로세싱 등의 그래픽스 파이프라인을 따르게 된다. OpenGL ARB extension 1.0 구조에서 최소 13개의 출력 레지스터(316)을 지원하도록 하고 있으며, 각 출력 레지스터(316)는 4개의 실수 집합이 저장가능하도록 구성된다. 출력 레지스터(316)는 쓰기만이 가능하다. The output register 316 stores output data which is the final arithmetic result of vertex processing. The output data, i.e., the output variables, will then follow the graphics pipeline, such as primitive assemblies, rasterizers, pixel processing, etc., as described above. In the OpenGL ARB extension 1.0 structure, at least 13 output registers 316 are supported, and each output register 316 is configured to store four real sets. The output register 316 can only be written to.

본 발명에서 디코딩된 실행코드, 정점 데이터, 상수 데이터는 상술한 레지스터라는 별도의 저장 장치 없이, 메모리 상에 저장되어 있는 각 데이터에 대한 포인터(pointer)를 통해 직접 참조하는 방식이 사용가능하다. 이는 각 데이터에 대하여 별도의 저장 공간을 할당하여 각 데이터를 복사하는 과정에서 발생하는 시간 지연을 줄이기 위함이며, 디코딩된 실행코드, 정점 데이터, 상수 데이터는 후술할 연산 논리 모듈(330)에서 읽기 용도로만 사용되기 때문이다. In the present invention, the decoded executable code, vertex data, and constant data may be directly referred to through a pointer to each data stored in the memory, without using a separate storage device as described above. This is to alleviate time delay incurred in allocating separate storage spaces for copying each data. The decoded execution code, vertex data, and constant data can be read by the arithmetic logic module 330 to be described later. This is because it is used only.

임시 레지스터(312), 주소 레지스터(314), 출력 레지스터(316)는 정점 처리의 수행 중 각 레지스터에 저장되는 데이터의 쓰기 동작이 이루어지는 장치이다. 정점 처리 장치의 환경 설정을 초기화하는 단계에서 각 레지스터의 크기만큼 메모리를 할당하여 사용한다.The temporary register 312, the address register 314, and the output register 316 are devices in which data write operations are stored in each register during vertex processing. In the step of initializing the configuration of the vertex processing device, memory is allocated and used as much as the size of each register.

디코딩 모듈(320)은 입력된 명령어를 기계어(machine language)로 디코딩한다. 명령어는 상기 수학식 1과 같은 문법구조를 가지고 있으며, 소스 오퍼런드가 지칭하는 레지스터에 저장된 값을 읽어온다. 즉, 제1 레지스터(310a) 중 명령어에 따라 필요로 하는 레지스터(입력 레지스터(311), 임시 레지스터(312), 상수 레지스터(313), 주소 레지스터(314) 중 어느 하나 이상)에 저장된 데이터를 데이터 디코딩 모듈(322, 324, 326)이 읽어온다. The decoding module 320 decodes the input command into machine language. The instruction has the same grammar structure as in Equation 1 above, and reads a value stored in a register indicated by the source operator. That is, the data stored in the register (input register 311, temporary register 312, constant register 313, address register 314 or more) required according to the instruction of the first register (310a) data Decoding modules 322, 324, 326 are read.

여기서, 각 레지스터들로부터 데이터를 읽어오는 경우, 필요에 따라 스위즐(swizzle) 및/또는 부호 반전(negate) 동작을 수행하여 변환된 데이터를 후술할 연산 논리 모듈(330)에 입력해야 하는 경우가 발생한다. 이 때 정점 처리 장치는 소스 수정 모듈(360)을 더 포함할 수 있으며, 소스 수정 모듈(360)은 스위즐 동작을 수행하거나 부호 반전 동작을 수행한다. In this case, when data is read from each register, it is necessary to perform a swizzle and / or sign inversion operation to input the converted data to the arithmetic logic module 330 to be described later. Occurs. In this case, the vertex processing apparatus may further include a source modification module 360, and the source modification module 360 performs a swizzle operation or a sign inversion operation.

스위즐 동작은 각 데이터가 예를 들어 x, y, z, w의 4 성분을 가지는 4 성분 데이터인 경우에 각 성분의 값을 바꾸는 것을 의미한다. 즉, x 성분의 값을 y 성분으로, y 성분의 값을 z 성분으로 바꾸는 것과 같은 동작을 수행하여 데이터를 변환시키는 것이 가능하다. 부호 반전 동작은 데이터의 각 성분의 값들의 부호를 반전시킨다. 즉, 양(+)의 값을 가지고 있는 경우 음(-)의 값을 가지도록 데이터를 변환 시키게 된다. 스위즐 동작 및/또는 부호 반전 동작은 효과적인 정점 처리를 위해 필요한 경우 활용하게 된다. The swizzle operation means changing the value of each component when each data is four-component data having four components of x, y, z and w, for example. That is, it is possible to transform data by performing an operation such as changing the value of the x component to the y component and the value of the y component to the z component. The sign inversion operation inverts the sign of the values of each component of the data. In other words, if it has a positive value, the data is converted to have a negative value. The swizzle operation and / or sign inversion operation are utilized when necessary for effective vertex processing.

실행코드는 실행코드 룩업 테이블(315)에 의해 상응하는 값이 실행코드 디코딩 모듈(328)로 전송된다. 실행코드 룩업 테이블(315)은 명령어 중 실행코드를 기계어로 해독함에 있어서 후술할 연산 논리 모듈(330)의 연산을 효율적으로 정의하고 향후 실행코드의 확장이나 변경을 용이하게 하기 위해 사용되는 룩업 테이블(LookUp Table)이다. 실행코드의 종류에 따라 필요로 하는 소스 오퍼런드의 수가 달라지기 때문에, 각 실행코드마다 필요로 하는 소스 오퍼런드의 수와 목적지 주소의 종류에 따라 그룹화하고, 동일 그룹의 실행코드에 대해서 필요로 하는 소스 오퍼런드 영역의 필드만을 디코딩되도록 하여 디코딩 효율을 높일 수 있다. 즉, 실행코드 룩업 테이블(315)를 통해 실행코드의 종류를 파악하고, 실행코드 디코딩 모듈(328)은 해당 실행코드가 속하는 그룹에 따라 미리 결정된 영역의 필드만을 디코딩한다. The executable code is transmitted to the executable code decoding module 328 by the executable code lookup table 315 with a corresponding value. Execution code lookup table 315 is a lookup table that is used to efficiently define the operation of the operation logic module 330 to be described later in decoding the executable code among the instructions and to facilitate the expansion or modification of the execution code in the future ( LookUp Table). Since the number of source operations required varies depending on the type of executable code, each group of executable code is grouped according to the number of source operations required and the type of destination address, and required for the same group of executable codes. Decoding efficiency can be improved by decoding only the field of the source operation region. That is, the execution code lookup table 315 determines the type of the execution code, and the execution code decoding module 328 decodes only the field of the predetermined region according to the group to which the execution code belongs.

연산 논리 모듈(330)은 디코딩 모듈(320)에서 해독된, 즉 기계어로 변환된 실행코드, 데이터 등을 전송받는다. 연산 논리 모듈(330)는 3개의 연산부로 구분되며, 각 연산부는 각 스테이지마다 순차적으로 연산을 수행하고 그 연산결과를 출력한다. The arithmetic logic module 330 receives executable code, data, etc., which are decoded by the decoding module 320, that is, machine language. The arithmetic logic module 330 is divided into three arithmetic units, and each arithmetic unit performs arithmetic operations sequentially for each stage and outputs arithmetic results.

제1 연산부는 입력된 데이터들에 대해서 기본 연산(덧셈, 곱셈, 비교, fraction, floor 등)을 수행한 기본 연산 결과를 출력한다. 또한, 외적(Cross product)을 계산하기 위한 곱셈 연산을 수행한 곱셈 연산 결과를 출력한다. 기본 연산 결과 및 곱셈 연산 결과와, 미리 결정된 값(0 또는 1) 및 입력된 상수 데이터가 각 성분의 출력값이 될 수 있다. 각 성분에서는 멀티플렉서(MUX)를 이용하여 실행코드에 따라 제1 연산부에서의 최종 출력값을 선택한다. The first operation unit outputs a basic operation result of performing a basic operation (addition, multiplication, comparison, fraction, floor, etc.) on the input data. In addition, the result of the multiplication operation is performed by performing a multiplication operation for calculating the cross product. The basic operation result and the multiplication operation result, the predetermined value (0 or 1) and the input constant data may be output values of each component. In each component, a multiplexer (MUX) is used to select the final output value of the first operation unit according to the execution code.

제2 연산부는 내적(Dot product) 연산(DPH, DP3, DP4) 및 외적(Cross product) 연산을 위해 제1 연산부의 기본 연산 결과와 제1 연산부에서 출력되는 출력값들을 덧셈 연산한다. The second operation unit adds the basic operation results of the first operation unit and the output values output from the first operation unit for dot product operations DPH, DP3, and DP4 and cross product operations.

제3 연산부는 내적 연산을 위해 제2 연산부에서의 x 성분 출력값과 z 성분 출력값을 덧셈 연산한다. The third operation unit adds the x component output value and the z component output value in the second operation unit for the internal product operation.

제1 연산부, 제2 연산부 및 제3 연산부에서의 출력은 각각 제1 스테이지(332), 제2 스테이지(334) 및 제3 스테이지(336)를 통해 라이트백 모듈(340)에 라이트백되거나 다음 연산부로 전달된다. Outputs from the first computing unit, the second computing unit, and the third computing unit are written back to the writeback module 340 through the first stage 332, the second stage 334, and the third stage 336, respectively, or the next calculator. Is passed to.

각 스테이지(332, 334, 336)는 각 실행코드의 연산이 완료되는 시점에 연산결과를 출력한다. 디코딩된 실행코드가 연산 논리 모듈(330)에 전송된 후 제1 스테이지(332)는 연산 스테이지 수가 1인 실행코드의 연산결과를 출력하고, 제2 스테이지(334)는 연산 스테이지 수가 2인 실행코드의 연산결과를 출력하며, 제3 스테이지(336)는 연산 스테이지 수가 3인 실행코드의 연산결과를 출력한다. 즉, 제1 스테이지(332)는 ADD, MUL, DST, MOV, MAX, MIN, SGE, SLT, ABS, ARL, FLR, FRC 및 후술할 특별 연산 모듈에 의해 연산되는 특별 실행코드인 EXP, LOG, EX2, LG2, RCP, RSQ가, 제2 스테이지(334)에는 MAD, XPD가, 제3 스테이지(336)에는 DP3, DP4, DPH가 포함된다. 각 스테이지는 개별적으로 연산 처리 과정을 수행한다. 도 4를 참조 하면, DP3, MAD, ADD 순으로 실행코드를 인출(fetch)하는 경우를 예로 든다. 연산을 위한 스테이지는 도 4에 도시된 것과 같이 클럭(CLK) 0부터 시작하는 것으로 가정한다. 클럭이 0일 때, 연산 논리 모듈(330)에 전송된 1순위 DP3는 제3 스테이지(336)의 실행코드에 해당하여 3 스테이지 후인 클럭 3에서 연산결과가 출력된다. 클럭이 1일 때, 연산 논리 모듈(330)에 전송된 2순위 실행코드인 MAD는 제2 스테이지(334)의 실행코드에 해당하여 2 스테이지 후인 클럭 3에서 연산결과가 출력된다. 그리고 클럭이 2일 때, 연산 논리 모듈(330)에 전송된 3순위 실행코드인 ADD는 제1 스테이지(332)의 실행코드에 해당하여 1 스테이지 후인 클럭 3에서 연산결과가 출력된다. 각 스테이지는 개별적으로 연산 처리를 수행하기 때문에 상술한 것과 같이 순차적으로 입력된 실행코드의 연산결과가 클럭 3 시점에서 동시에 출력되는 경우도 존재한다. 이 경우 각 연산결과의 처리는 후술할 라이트백 모듈(340)에서 이루어진다. Each stage 332, 334, 336 outputs the operation result at the time when the operation of each execution code is completed. After the decoded execution code is transmitted to the operation logic module 330, the first stage 332 outputs the operation result of the execution code having the operation stage number 1, and the second stage 334 the execution code having the operation stage number 2 The third stage 336 outputs an operation result of the execution code of which the operation stage number is three. That is, the first stage 332 is a special execution code that is calculated by ADD, MUL, DST, MOV, MAX, MIN, SGE, SLT, ABS, ARL, FLR, FRC, and a special operation module to be described later. EX2, LG2, RCP, and RSQ include MAD and XPD in the second stage 334, and DP3, DP4, and DPH in the third stage 336. Each stage performs a computational process individually. Referring to FIG. 4, an example of fetching execution code in the order of DP3, MAD, and ADD will be given. It is assumed that the stage for the operation starts from the clock CLK 0 as shown in FIG. 4. When the clock is 0, the first DP3 transmitted to the arithmetic logic module 330 outputs an arithmetic result at clock 3, which is three stages later, corresponding to the execution code of the third stage 336. When the clock is 1, the MAD, which is the second-order execution code transmitted to the arithmetic logic module 330, corresponds to the execution code of the second stage 334, and the arithmetic result is output at clock 3, which is two stages later. When the clock is 2, the operation result is output at the clock 3, which is one stage later, corresponding to the execution code of the first stage 332, which is the third-order execution code transmitted to the operation logic module 330. Since each stage performs arithmetic processing individually, there exists a case where the arithmetic result of the execution code inputted sequentially is output simultaneously at the clock 3 time | point as mentioned above. In this case, the processing of each calculation result is performed by the writeback module 340 which will be described later.

연산 논리 모듈(330)은 특별 연산 모듈을 포함한다. 특별 연산 모듈은 실행코드를 산술 연산함에 있어서 종래 지연 시간, 즉 연산 사이클 수가 4 이상이었던 실행코드를 특별 실행코드로 지정하고, 특별 실행코드에 대해 각각 별도로 연산하여 정점 처리 장치 내에서는 연산 사이클 수가 1이 되도록 한다. 특별 연산 모듈에서 별도의 연산이 수행되는 특별 실행코드는 EXP, LOG, EX2, LG2, RCP, RSQ이다. Arithmetic logic module 330 includes a special computation module. The special arithmetic module designates the execution code that had a conventional delay time, that is, the number of operation cycles of 4 or more, as the special execution code in arithmetic operation of the execution code, and calculates each operation separately for the special execution code. To be Special execution code that performs separate operation in special operation module is EXP, LOG, EX2, LG2, RCP, RSQ.

라이트백(Write Back) 모듈(340)은 연산 논리 모듈(330)의 각 스테이지(332, 334, 336)로부터 출력되는 실행코드의 연산결과를 제2 레지스터(310b)에 저장한다. 연산결과의 저장은 선입선출(FIFO; First In First Out) 방식에 따라 연산 논리 모 듈(330)에 최우선적으로 입력된 우선순위가 최우선인 실행코드의 연산결과를 해당 목적지 레지스터에 저장한다. 연산결과는 그 종류에 따라 상술한 바와 같이 출력 레지스터(316), 임시 레지스터(312), 주소 레지스터(314) 등에 저장된다. The write back module 340 stores the operation result of the execution code output from each stage 332, 334, or 336 of the operation logic module 330 in the second register 310b. Storing the operation result stores the operation result of the execution code having the highest priority input to the operation logic module 330 according to the first in first out (FIFO) method in the corresponding destination register. The operation result is stored in the output register 316, the temporary register 312, the address register 314 and the like as described above according to the type.

단, 도 4에 도시된 바와 같이 연산 논리 모듈(330)은 다단 파이프라인 구조에 의해 각 스테이지가 개별적으로 연산 처리를 수행하는 바 동시에 2 이상의 연산결과를 출력할 수 있다. 이 경우 라이트백 모듈(340)은 우선순위가 1순위인 실행코드(도 4에서는 DP3)의 연산결과를 해당 목적지 레지스터에 저장하고, 우선순위가 그 다음인 실행코드(도 4에서는 MAD, ADD)의 연산결과는 다음 스테이지에서 처리되도록 내부에 일시 저장하고 바이패스(bypass)한다. 그리고 다음 스테이지, 즉 클럭 4 시점에서 바이패스된 연산결과(도 4에서는 MAD, ADD 순)가 그 우선순위가 최우선으로 변경되며, 연산 논리 모듈(330)에서 출력되는 연산결과의 우선순위는 그 다음이 된다. However, as shown in FIG. 4, the operation logic module 330 may output two or more calculation results at the same time as each stage individually performs a calculation process by a multi-stage pipeline structure. In this case, the writeback module 340 stores the operation result of the execution code (DP3 in FIG. 4) having the priority of 1 in the corresponding destination register, and the execution code (MAD, ADD in FIG. 4) of the next priority. The operation result of is temporarily stored and bypassed internally for processing at the next stage. The priority of the calculation result (MAD, ADD in FIG. 4), which is bypassed at the next stage, that is, clock 4, is changed to the highest priority, and the priority of the calculation result output from the calculation logic module 330 is next. Becomes

또한, 라이트백 모듈(340)은 연산 논리 모듈(330)에서 2 이상의 실행코드에 따른 연산결과가 출력되고, 동시에 목적지 주소가 동일한 경우(즉, 제2 레지스터(310b)의 동일한 주소에 저장하고자 하는 경우) 각 실행코드에 상응하는 우선순위를 비교하고, 그 우선순위가 늦은(또는 낮은) 실행코드의 연산결과를 해당 목적지 주소에 저장한다. 우선순위가 빠른(또는 높은) 실행코드의 연산결과를 먼저 저장하게 되는 경우, 다음 스테이지에서 우선순위가 늦은(또는 낮은) 실행코드의 연산결과가 동일한 저장 영역에 덮어 쓰여지게(overwrite) 된다. 따라서, 우선순위가 빠른(또는 높은) 실행코드의 연산결과는 아무런 의미가 없게 되고 정점 처리 전체 에 있어서 단지 1 스테이지 만큼의 지연이 있을 뿐이기 때문이다. In addition, the writeback module 340 outputs an operation result according to two or more execution codes from the operation logic module 330, and simultaneously stores the result of the operation at the same address of the second register 310b when the destination address is the same. In this case, the priority of each execution code is compared and the operation result of the execution code of the lower priority (or lower) is stored at the corresponding destination address. When the operation results of the higher priority (or higher) executable code are stored first, the operation results of the lower priority (or lower) executable code are overwritten in the same storage area. Therefore, the operation result of the high-priority (or high) executable code is meaningless and there is only one stage of delay in the entire vertex processing.

본 발명의 바람직한 다른 실시예에 따르면, 정점 처리 장치는 포워딩 모듈(350)을 더 포함한다. According to another preferred embodiment of the present invention, the vertex processing apparatus further includes a forwarding module 350.

포워딩(forwarding) 모듈(350)은 디코딩 모듈(320)이 실행코드 및 오퍼런드를 해독함에 있어서 필요로 하는 데이터(예를 들어, 중간 연산결과)가 아직 해당 레지스터에 저장되기 이전에 라이트백 모듈(340)에 저장되어 있는 경우, 상기 데이터를 라이트백 모듈(340)로부터 직접 디코딩 모듈(320)로 전달하여 데이터 의존(data dependency)에 의한 장해(hazard)가 발생하지 않도록 한다. The forwarding module 350 includes a writeback module before the data needed by the decoding module 320 to decode the executable code and the operand (for example, an intermediate operation result) is still stored in the register. If the data is stored at 340, the data is transferred directly from the writeback module 340 to the decoding module 320 so that no disturbance due to data dependency occurs.

본 발명에서 라이트백 모듈(340) 및/또는 포워딩 모듈(350)은 멀티플렉서로 구성되어, 연산 논리 모듈(330)의 제1 스테이지(332), 제2 스테이지(334), 제3 스테이지(336)로부터의 출력을 입력받고, 제어 신호에 따라 선택된 스테이지의 출력을 지정된 제2 레지스터(310b)에 저장하는 것이 가능하다. In the present invention, the writeback module 340 and / or the forwarding module 350 may be configured as a multiplexer, so that the first stage 332, the second stage 334, and the third stage 336 of the arithmetic logic module 330 are included. It is possible to receive the output from and store the output of the selected stage in the designated second register 310b in accordance with the control signal.

도 5는 도 3에 도시된 정점 처리 장치에서의 다단 파이프라인 흐름의 또 다른 일례이다. 도 1을 참조하여 상술한 것과 동일하게 RSQ, ADD, DP3 순으로 실행코드를 수행하고자 하는 경우에, RSQ는 특별 연산 모듈에 의해 연산 스테이지 수가 1이 되었는 바 클럭 1 시점에서 연산이 완료된다. 그리고 ADD는 도 1에서 클럭 8 시점까지 시간 지연이 있었던 것과는 달리 클럭 1 시점에서 연산이 개시된다. 그리고 DP3는 ADD의 연산이 완료된 클럭 2 시점에서 연산이 개시되고, 연산 스테이지 수인 3 클럭만큼의 시간이 지난 후인 클럭 5 시점에서 연산이 완료된다. FIG. 5 is another example of the multi-stage pipeline flow in the vertex processing apparatus shown in FIG. 3. In the case where the execution code is to be executed in the order of RSQ, ADD, and DP3 in the same manner as described above with reference to FIG. 1, the operation is completed at the time of clock 1 as the number of operation stages is 1 by the special operation module. In addition, unlike ADD having a time delay up to clock 8 in FIG. 1, the operation is started at clock 1. The DP3 starts operation at clock 2 when the operation of ADD is completed, and the operation is completed at clock 5 at the end of 3 clock cycles.

즉, 정점 처리를 위해 RSQ, ADD, DP3 순으로 연산을 수행하는 경우 종래에는 도 1에 도시된 것과 같이 클럭 14 시점에서 연산이 완료되던 것에 비해 본 발명의 바람직한 일 실시예에 따르면 도 5에 도시된 것과 같이 클럭 5 시점에서 연산이 완료되어 지연 시간이 현저하게 감소하고, 시간 지연(stall)의 발생량이 매우 줄어들어 정점 처리 장치의 성능을 향상시킴을 알 수 있다.That is, when performing the calculation in the order of vertex processing RSQ, ADD, DP3 in order in the prior art as shown in Figure 1 compared to the operation is completed at the time of the clock 14 as shown in Figure 5 according to a preferred embodiment of the present invention As shown in FIG. 5, the operation is completed at clock 5, and the delay time is significantly reduced, and the amount of time delay is greatly reduced, thereby improving the performance of the vertex processing apparatus.

도 6은 본 발명의 바람직한 일 실시예에 따른 연산 논리 모듈에서의 산술 연산 방법의 흐름도이고, 도 7은 본 발명의 바람직한 일 실시예에 따른 라이트백 모듈에서의 연산결과 처리 방법의 흐름도이다.6 is a flowchart of an arithmetic operation method in an arithmetic logic module according to an exemplary embodiment of the present invention, and FIG. 7 is a flowchart of a method of processing an arithmetic result in a writeback module according to an exemplary embodiment of the present invention.

도 6을 참조하면, 단계 S600에서 연산 논리 모듈(330)은 디코딩 모듈(320)로부터 디코딩된 실행코드 및 제1 레지스터(310a)로부터 읽어온 데이터를 전송받는다. Referring to FIG. 6, in operation S600, the operation logic module 330 receives data decoded from the decoding module 320 and data read from the first register 310a.

단계 S610에서 연산 논리 모듈(330)의 제1 연산부에서 실행코드에 대하여 기본 연산 및 곱셈 연산을 수행한다. 그리고 기본 연산 결과 및 곱셈 연산 결과와, 소정의 값(0 또는 1)과, 상수 데이터를 출력한다. In operation S610, the first operation unit of the operation logic module 330 performs a basic operation and a multiplication operation on the execution code. The result of the basic operation and the multiplication, the predetermined value (0 or 1), and the constant data are output.

단계 S620에서 실행코드의 연산 스테이지 수가 1인지 여부를 판단한다. 연산 스테이지 수가 1인 경우, 단계 S660으로 진행하여 제1 연산부에서의 출력 결과 중에서 선택된 값을 최종 출력하게 된다. In step S620, it is determined whether the number of operation stages of the execution code is one. If the number of calculation stages is 1, the flow advances to step S660 to finally output a value selected from the output results of the first calculation unit.

연산 스테이지 수가 1이 아닌 경우, 단계 S630으로 진행하여 연산 논리 모듈(330)의 제2 연산부에서 내적 연산 또는 외적 연산을 위해 기본 연산 결과와 제1 연산부에서의 출력 결과에 대해 덧셈 연산을 수행한다. If the number of operation stages is not 1, the process proceeds to step S630 in which the second operation unit of the operation logic module 330 performs an addition operation on the basic operation result and the output result of the first operation unit for the internal operation or the external operation.

단계 S640에서 실행코드의 연산 스테이지 수가 2인지 여부를 판단한다. 연산 스테이지 수가 2인 경우, 단계 S660으로 진행하여 제2 연산부에서의 출력 결과 중에서 선택된 값을 최종 출력하게 된다.In step S640, it is determined whether the number of computation stages of the execution code is two. If the number of calculation stages is two, the flow advances to step S660 to finally output a value selected from the output results of the second calculation unit.

연산 스테이지 수가 2가 아닌 경우, 단계 S650으로 진행하여 연산 논리 모듈(330)의 제3 연산부에서 내적 연산을 위해 제2 연산부에서의 출력 결과 중 x 성분의 값과 z 성분의 값을 덧셈 연산한다. If the number of the operation stages is not 2, the process proceeds to step S650 where the third operation unit of the operation logic module 330 adds the value of the x component and the value of the z component among the output results of the second operation unit for the internal calculation.

그리고는 단계 S660에서 제3 연산부에서의 출력을 최종적으로 연산결과로 출력하게 된다. In operation S660, the output from the third calculator is finally output as an operation result.

도 7을 참조하면, 단계 S700에서 라이트백 모듈(340)은 연산 논리 모듈(330)의 각 스테이지(332, 334, 336)로부터 연산결과를 입력받는다. 연산결과는 어느 하나의 스테이지에서만 출력될 수도 있고, 2 이상의 스테이지에서 동시에 출력될 수도 있다.Referring to FIG. 7, in operation S700, the writeback module 340 receives an operation result from each stage 332, 334, or 336 of the operation logic module 330. The operation result may be output in only one stage or may be output simultaneously in two or more stages.

단계 S710에서 입력받은 연산결과에 상응하는 기계어 또는 명령어의 우선순위가 최우선인지를 판단한다. 최우선에 해당하는 우선순위를 가지는 기계어 또는 명령어의 연산결과는 단계 S720으로 진행하여 목적지 레지스터, 즉 제2 레지스터(310b)에 저장하게 된다. 그리고 최우선에 해당하지 않는 우선순위를 가지는 기계어 또는 명령어의 연산결과는 1 스테이지가 경과할 동안은 바이패스한다(단계 S730). 최우선 순위의 연산결과가 단계 S720에서 목적지 레지스터에 저장되었는 바, 최우선 순위가 아니었던 연산결과의 우선순위를 상향 조정하고(단계 S740), 단계 S710으로 되돌아가 그 이하 단계를 반복하게 된다.In operation S710, it is determined whether the priority of the machine language or the instruction corresponding to the operation result input is the highest priority. The operation result of the machine word or the instruction having the highest priority is stored in the destination register, that is, the second register 310b, in step S720. The operation result of the machine word or the instruction having the priority that does not correspond to the highest priority is bypassed while one stage elapses (step S730). Since the operation result of the highest priority is stored in the destination register in step S720, the priority of the operation result that was not the highest priority is increased (step S740), and the process returns to step S710 and the subsequent steps are repeated.

도 8은 본 발명의 바람직한 또 다른 실시예에 따른 레지스터의 교체에 따른 지연 시간을 줄인 정점 처리 방법의 흐름도이고, 도 9는 종래 레지스터의 교체에 따른 지연 시간을 나타낸 도면이며, 도 10은 본 발명에 따라 지연 시간이 감소된 정점 처리 방법을 나타낸 도면이다. 8 is a flowchart illustrating a vertex processing method of reducing a delay time according to a replacement of a register according to another exemplary embodiment of the present invention, FIG. 9 is a diagram illustrating a delay time according to a replacement of a conventional register, and FIG. According to the present invention.

단계 S800에서 정점 처리 장치는 입력 레지스터에 정점 처리를 위한 정점 데이터를 저장한다. 정점 데이터는 하나 이상이 있으며, 각 정점 데이터에 대해 각각 정점 처리를 수행하게 된다. 따라서, 입력 레지스터는 각 정점 데이터에 대하여 정점 처리가 완료되면 교체를 통해 다음 정점 데이터를 저장하게 된다. In operation S800, the vertex processing apparatus stores vertex data for vertex processing in an input register. There is more than one vertex data, and vertex processing is performed on each vertex data. Therefore, when the vertex processing is completed for each vertex data, the input register replaces and stores the next vertex data.

단계 S810에서 정점 처리 장치는 정점 데이터의 정점 처리를 위한 하나 이상의 명령어를 입력받고, 입력 레지스터로부터는 정점 데이터를, 그 외 제1 레지스터(310a)로부터는 정점 처리를 위한 산술 연산에 사용되는 데이터(상수 데이터, 주소 데이터, 중간 연산결과 등)를 읽어온다. In operation S810, the vertex processing apparatus receives one or more instructions for vertex processing of vertex data, vertex data from an input register, and data used for arithmetic operations for vertex processing from the first register 310a. Constant data, address data, intermediate calculation results).

단계 S820에서 입력받은 명령어는 기계어로 해독한다. The command received in step S820 is decoded in machine language.

단계 S830에서 명령어가 해당 정점 데이터의 정점 처리를 위한 하나 이상의 명령어 중에서 마지막 명령어인지 여부를 판단한다. 마지막 명령어가 아닌 경우에는 단계 S840으로 진행하여 해당 명령어에 상응하는 산술 연산을 통해 정점 처리를 수행하고, 단계 S810으로 되돌아가 해당 정점 데이터의 다음 정점 처리를 위한 다음 순위의 명령어를 입력받는다. In operation S830, it is determined whether the instruction is the last instruction among one or more instructions for vertex processing of the corresponding vertex data. If it is not the last instruction, the process proceeds to step S840 to perform vertex processing through an arithmetic operation corresponding to the instruction, and returns to step S810 to receive an instruction of a next rank for processing the next vertex of the vertex data.

단계 S830에서의 판단 결과 마지막 명령어인 경우에는 기계어로의 해독 이후 입력 레지스터를 교체한다(단계 S850). 마지막 명령어의 해독 이후에는 더 이상 입력 레지스터에 저장되어 있던 값은 필요하지 않기 때문이다. 입력 레지스터를 교체함에 있어서 2 클럭의 지연은 불가피하다. If it is the last instruction as a result of the determination in step S830, the input register is replaced after the translation into the machine language (step S850). After decoding the last instruction, the value stored in the input register is no longer needed. A delay of two clocks is inevitable when replacing the input registers.

단계 S860에서 정점 처리 장치는 마지막 명령어에 따른 정점 처리를 수행함과 동시에 입력 레지스터의 교체 작업으로 다음 정점 데이터를 저장한다. 그리고 단계 S810으로 되돌아가 다음 정점 데이터에 대해 상술한 단계들의 동작을 반복한다. In operation S860, the vertex processing apparatus performs the vertex processing according to the last instruction and simultaneously stores the next vertex data by replacing the input register. Returning to step S810, the operation of the above-described steps is repeated for the next vertex data.

종래에는 첫번째 정점 데이터가 정점 처리된 출력 데이터가 출력됨에 있어서, 명령어(instruction)의 입력(T1), 기계어로 해독(decode)(T2), 필요에 따라 소스 수정(T3), 연산 논리 모듈에서 연산 사이클 수에 따른 실행코드의 산술 연산(ALU_0(T4), ALU_1(T5), ALU_2(T6))에 따른 출력 데이터 지연(T1~T6)(910)이 있고, 명령어의 입력에서부터 산술 연산의 완료까지의 첫번째 정점 처리 시간(T1~T8)(920)이 걸렸다. 이후 첫번째 정점 처리가 완료(T8)되고 난 후 레지스터를 교체하는 교체 시간(T8~T9)이 일정 시간 지연된 후 두번째 정점 데이터에 대한 정점 처리가 가능하였다. 이에 따라 첫번째 정점 데이터의 정점 처리를 위한 명령어의 개수에 비례하여 첫번째 정점 데이터의 정점 처리 시간이 증가하며, 두번째 정점 데이터의 정점 처리를 위한 명령어의 입력(T9) 시점이 늦어지게 되는 문제점이 있었다.Conventionally, in outputting data in which the first vertex data is vertexed, an input (T1) of an instruction, a decode (T2) in machine language, a source modification (T3) as necessary, and an operation are performed in a logic module. There are output data delays (T1 to T6) 910 according to arithmetic operations (ALU_0 (T4), ALU_1 (T5), and ALU_2 (T6)) of the execution code according to the number of cycles. The first vertex processing time (T1-T8) of 920 was taken. After the first vertex processing was completed (T8), the vertex processing for the second vertex data was possible after a delay of a replacement time (T8 to T9) for replacing the registers. As a result, the vertex processing time of the first vertex data increases in proportion to the number of instructions for vertex processing of the first vertex data, and there is a problem in that the input point T9 of the instruction for vertex processing of the second vertex data is delayed.

하지만, 도 10을 참조하면, 도 8에 도시된 방법에 의한 본 발명에서는 명령어의 입력이 완료되는 시점인 T17에서부터 바로 입력 레지스터의 교체가 시작되며, 2 클럭의 지연이 있은 후인 T19에서 두번째 정점 데이터를 입력받아 두번째 정점 데이터의 정점 처리를 시작하게 된다. 따라서, 첫번째 정점 데이터의 정점 처리가 연산 논리 모듈에서 진행됨과 동시에 디코딩 모듈(320)은 두번째 정점 데이터에 관한 명령어를 해독하게 된다. 따라서, 정점 데이터의 정점 처리를 위한 명령어의 수와는 관계 없이 2 클럭으로 첫번째 정점 데이터와 두번째 정점 데이터 간의 지연 시간을 줄일 수 있다. 순차적으로 정점 처리를 해야하는 각 정점 데이터 간에는 동일한 내용이 적용가능함은 물론이다. However, referring to FIG. 10, in the present invention according to the method illustrated in FIG. 8, the replacement of the input register starts immediately from T17, at which time the input of the instruction is completed, and the second vertex data at T19 after a delay of two clocks. To input the vertex processing of the second vertex data. Accordingly, as the vertex processing of the first vertex data is performed in the arithmetic logic module, the decoding module 320 decodes an instruction regarding the second vertex data. Therefore, regardless of the number of instructions for vertex processing of vertex data, the delay time between the first vertex data and the second vertex data can be reduced by two clocks. Of course, the same content is applicable to each vertex data that must be processed sequentially.

상술한 바와 같이, 본 발명에 따른 다단 파이프라인 구조의 정점 처리 장치 및 그 방법은 연산부들이 순차적으로 연결된 다단 파이프라인 구조의 연산 논리 모듈을 통해 각 실행코드 간에 최대 지연시간을 3 스테이지로 줄이는 것이 가능하다. As described above, the vertex processing apparatus and method of the multi-stage pipeline structure according to the present invention can reduce the maximum delay time between each execution code to three stages through the arithmetic logic module of the multi-stage pipeline structure in which the operation units are sequentially connected. Do.

또한, 긴 지연시간을 가지는 특별 실행코드에 대하여 1 스테이지 이내에 연산결과가 산출되도록 하는 특별 연산 모듈을 가지는 연산 논리 모듈을 통해 최대 지연시간을 줄이는 것이 가능하다. In addition, it is possible to reduce the maximum delay time through an arithmetic logic module having a special arithmetic module for calculating a calculation result within one stage for a special execution code having a long delay time.

또한, 실행코드 룩업 테이블을 사용하여 실행코드를 해독함으로써 연산 논리 모듈의 연산을 효율적으로 정의하고, 추후 실행코드의 확장 및 변경을 용이하게 하는 장점이 있다. In addition, by using the executable code lookup table to decode the executable code, there is an advantage of efficiently defining the operation of the arithmetic logic module and facilitating extension and modification of the executable code later.

또한, 각 정점 데이터를 처리함에 있어 실행코드의 수와 관계없이 일정한 지연시간을 가지는 장점이 있다. In addition, the processing of each vertex data has the advantage of having a constant delay regardless of the number of execution code.

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야에서 통상의 지식을 가진 자라면 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although the above has been described with reference to a preferred embodiment of the present invention, those skilled in the art to which the present invention pertains without departing from the spirit and scope of the present invention as set forth in the claims below It will be appreciated that modifications and variations can be made.

Claims

A first register for storing data for vertex processing;

A second register for storing an operation result;

An instruction fetch module for sequentially fetching an instruction including an OP code and one or more operands;

A decoding module for decoding the instruction and reading data required from the first register according to the operand;

An arithmetic logic module (ALU) comprising a plurality of sequential arithmetic units, and sequentially calculating and outputting the data through one or more of the arithmetic units according to the type of the execution code; And

A write back unit for storing an operation result of the execution code having the priority of the execution code among the operation results in the second register, and temporarily storing the operation result therein to be processed in the next stage Vertex processing device of a multi-stage pipeline structure comprising a.

The logic module of claim 1, wherein the arithmetic logic module comprises:

A first operation unit configured to perform at least one of a basic operation and a multiplication operation, and output at least one of a basic operation result and a multiplication operation result, a predetermined value, and input data;

A second operation unit configured to output a first addition operation result obtained by adding the output results of the first operation unit; And

And a third operation unit configured to output a second addition operation result obtained by adding the output results of the second operation unit.

The method of claim 2,

And the first operation unit, the second operation unit, and the third operation unit have a pipelined structure, and each operation unit can simultaneously perform operations on different data.

Claim 4 was abandoned when the registration fee was paid.

The method of claim 1,

And a source modifier for performing a swizzle operation for changing the input order of the data input to the arithmetic logic module as necessary.

Claim 5 was abandoned upon payment of a set-up fee.

The method of claim 1,

The multi-stage pipeline further includes a forwarding unit for transferring the calculation result from the writeback module to the decoding module before the operation result stored in the second register stored in the writeback module required by the decoding module. Vertex processing unit of the structure.

Claim 7 was abandoned upon payment of a set-up fee.

The method of claim 1, wherein the first register,

An input register for storing vertex data,

A constant register for storing constant data,

And a temporary register for storing the operation result.

Claim 8 was abandoned when the registration fee was paid.

The method of claim 7, wherein

Claim 9 was abandoned upon payment of a set-up fee.

The method of claim 1, wherein the second register,

A temporary register for storing the operation result;

And an output register for storing output data.

Claim 10 was abandoned upon payment of a setup registration fee.

The method of claim 9,

And the writeback module stores the operation result in one of the temporary register and the output register.

The method of claim 1,

The execution code is a vertex processing apparatus of a multi-stage pipeline structure, characterized in that the priority is determined by the First In First Out (First In First Out) method.

The method of claim 1,

The writeback module vertex processing apparatus of the multi-stage pipeline structure, when the destination address of the two or more calculation results are the same, the operation result of the execution code of the lower priority is stored in the destination address.

The method of claim 1,

And the arithmetic logic module includes a special arithmetic module for causing the number of arithmetic stages to be 1 for a predetermined special execution code.

Claim 14 was abandoned when the registration fee was paid.

The method of claim 13,

The special execution code is a vertex processing apparatus of a multi-stage pipeline structure, characterized in that the EXP, LOG, EX2, LG2, RCP and RSQ.

The method of claim 13,

The number of arithmetic stages of the execution code is any one of 1, 2, and 3, and the arithmetic logic module is divided into three arithmetic unit having a pipeline structure, characterized in that the multi-stage pipeline structure vertex processing apparatus.

Claim 16 was abandoned upon payment of a setup registration fee.

The method of claim 15,

And wherein the number of operation stages of POW, DPH, DP3, and DP4 is 3, the number of operation stages of MAD and XPD is 2, and the number of operation stages of other execution codes is 1 in the execution code.

The method of claim 1,

And the decoding module decodes the execution code using an execution code lookup table that determines a decoding method differently according to the type of the execution code.

(a) receiving execution code and data;

(b) performing a first operation;

(c) determining whether the number of operation stages of the executable code is one;

(d) finally outputting the result of the execution in step (b) when the number of the operation stages of the execution code is 1, and performing a second operation when the operation stage is not 1;

(e) determining whether the number of operation stages of the executable code is two;

(f) finally outputting the result of the execution in the step (d) when the number of the operation stages of the execution code is 2, and performing a third operation when the execution stage is not 2; And

(g) finally outputting the result of performing in step (f)

Vertex processing method of a multistage pipeline structure comprising a.

The method of claim 18, wherein after step (g),

(h) determining a priority of execution code corresponding to the final output result; And

(i) storing the result of execution of the execution code having the highest priority in a destination register, and bypassing the execution result of the execution code other than that;

And repeating steps (h) to (i) until all of the results are stored in the destination register.

(a) storing vertex data for vertex processing in an input register;

(b) sequentially receiving instructions for vertex processing of the vertex data and reading the vertex data from the input register;

(c) decoding the instructions in machine language;

(d) replacing the input register if the instruction is the last instruction for vertex processing of the vertex data, otherwise performing vertex processing according to the machine language and repeating steps (b) to (c) ; And

(e) storing the next vertex data in the input register while performing the vertex processing according to the last instruction, and repeating steps (b) to (d).

The method of claim 20,

And step (d) is delayed by two clocks for replacement of the input register.