KR100414152B1

KR100414152B1 - The Processing Method and Circuits for Viterbi Decoding Algorithm on Programmable Processors

Info

Publication number: KR100414152B1
Application number: KR10-2001-0043712A
Authority: KR
Inventors: 선우명훈; 이재성
Original assignee: 학교법인대우학원
Priority date: 2001-07-20
Filing date: 2001-07-20
Publication date: 2004-01-07
Also published as: KR20030008794A

Abstract

본 발명은 통신용 에러 정정을 위해 널리 사용되고 있는 알고리즘 가운데 하나인 비터비 디코딩(decoding)을 프로그래머블 프로세서에서 효율적으로 처리할 수 있는 비터비 디코딩 연산회로 및 그 연산방법에 관한 것이다.The present invention relates to a Viterbi decoding arithmetic circuit and a method for calculating the Viterbi decoding, which is one of the algorithms widely used for error correction for communication, in a programmable processor.

Description

The Viterbi Decoding Algorithm for Programmable Processors and the Computation Circuits for Performing the Computation Method

본 발명은 통신용 에러 정정을 위해 널리 사용되고 있는 알고리즘 중에 하나인 비터비 디코딩(decoding)을 프로그래머블 프로세서(디지털 신호처리 프로세서, 마이크로프로세서 등)에서 효율적으로 처리할 수 있는 비터비 디코딩 연산방법 및 연산회로에 관한 것이다.The present invention relates to a Viterbi decoding operation method and arithmetic circuit that can efficiently process Viterbi decoding, which is one of the widely used algorithms for communication error correction, in a programmable processor (digital signal processing processor, microprocessor, etc.). It is about.

현재까지 통신용 에러를 정정하기 위해 전용 비터비 프로세서를 사용하는 경우가 대부분이지만 전력소모를 줄이고 수정의 용이성을 위해 전용 하드웨어 구조에서 차츰 프로그래머블 프로세서를 이용한 비터비 디코더를 구현해야 할 필요성이 증대되고 있다.Until now, a dedicated Viterbi processor is often used to correct communication errors, but there is an increasing need to implement a Viterbi decoder using a programmable processor in a dedicated hardware structure in order to reduce power consumption and ease correction.

일반적으로 비터비 디코딩 알로리즘 블록의 전체 구성은 수신된 부호어와 인코더의 쉬프트 레지스터 상태에 따라 발생하는 부호어와의 차이를 계산하는 BMC(Branch Metric Calculation) 블록과, BMC 블록으로부터 발생된 BM(Branch Metric) 값에 현재의 상태 값을 더한 뒤 두 값을 비교해 작은 값을 선택하는 ACS(Add and Compare and Select) 블록, 그리고 마지막으로 ACS 블록으로부터 선택된 최소거리를 갖는 경로들을 저장하고 디코딩하는 생존자 경로 메모리 블록 등 크게 3개의 블록으로 구성된다.In general, the overall configuration of the Viterbi decoding algorithm block consists of a branch metric calculation (BMC) block that calculates a difference between a received codeword and a codeword generated according to the state of the shift register of the encoder, and a branch metric generated from the BMC block. Add and Compare and Select (ACS) block that adds the current state value to the) value and compares the two values to select a smaller value, and finally a survivor path memory block that stores and decodes the path with the minimum distance selected from the ACS block. Etc. It is composed of three blocks.

상기 BMC 블록은 강/연판정으로 얻어진 데이터로부터 BM 값을 계산한다. 부호율이 1/2인 콘볼루션 코드에서는 생성되는 부호어가 2비트이므로 BM은 모두 4가지 종류가 존재할 수 있고, 이것은 인코더에서 생성되는 코드에 따라 BM00, BM01, BM10, BM11로 구분된다. 강판정 결과에 따른 BM 값들을 표1에 기재하였다.The BMC block calculates a BM value from the data obtained by hard / soft decision. In a convolutional code having a code rate of 1/2, since the generated codeword is 2 bits, there are four types of BMs, which are classified into BM00, BM01, BM10, and BM11 according to codes generated by the encoder. Table 1 shows the BM values according to the steel sheet results.

BM부호어BM code BM00BM00 BM01BM01 BM10BM10 BM11BM11 0000 00 1One 1One 22 0101 1One 00 22 1One 1010 1One 22 00 1One 1111 22 1One 1One 00

강판정을 사용할 경우에는 입력되는 부호어와 각 BM들의 해밍거리를 계산한다. 연판정의 경우에는 m 비트로 양자화된 입력 데이터와의 유클리드 거리(Euclid distance)를 계산하여 출력한다. 강판정의 경우에는 각 BM들 사이의 최대 차이가 2인 반면에 연판정의 경우에는 최대 14까지 차이가 나게 된다. 따라서 연판정을 사용할 경우 에러 가 발생하더라도 생존자 경로를 잘못 선택할 확률이 적어지므로 강판정에 비해 디코딩 에러 발생확률이 적다.In case of using steel plate crystals, the input codeword and Hamming distance of each BM are calculated. In the case of soft decision, the Euclidean distance with the input data quantized in m bits is calculated and output. In the case of steel plate crystals, the maximum difference between each BM is 2, while in the case of soft crystals, the maximum difference is 14. Therefore, in case of using soft decision, even if an error occurs, the probability of wrongly selecting the survivor's path is reduced.

연판정 비터비 복호기는 수신된 신호에 연판정된 데이터 S1, S2가 BMC연산 블록으로 입력되는 구조를 가진다. 상기 S1, S2가 3비트로 연판정된 데이터라 하면, 각 BM들의 값은 수학식1에 의해 얻어진다. 수학식1은 연판정을 사용할 경우 BM값을 계산하는 방법이다.The soft decision Viterbi decoder has a structure in which the soft decision data S1 and S2 are input to the BMC operation block. If S1 and S2 are soft-determined data of 3 bits, the values of the respective BMs are obtained by Equation (1). Equation 1 is a method of calculating the BM value when using the soft decision.

여기서, m은 연판정된 비트 수이므로 3이다. 즉, 입력으로 S1이 5(101₂)이고 S2가 2(010₂)라고 가정했을 경우에 각 BM 값은 표2에 기재된 바와 같다.Here, m is 3 since the soft-determined number of bits. That is, assuming that S1 is 5 (101 ₂ ) and S2 is 2 (010 ₂ ) as an input, each BM value is as shown in Table 2.

BM1BM1 BM2BM2 유클리드 거리At the street BM00BM00 55 22 77 BM01BM01 55 55 1010 BM10BM10 22 22 44 BM11BM11 22 55 77

수학식1에서 알 수 있듯이 S1, S2가 모두 0(000₂)인 경우 최대 차이가 14로 강판정을 사용할 경우보다 BM들 사이의 값 차이가 큰 것을 알 수 있다.As can be seen in Equation 1, when S1 and S2 are both 0 (000 ₂ ), it can be seen that the maximum difference is 14 and the value difference between BMs is larger than when using steel sheet.

격자도 상에서 각 BM들의 값을 BMC 블록을 통해 계산하였다. 상기 계산된 BM들은 현재 스테이트로 입력되는 경로 중 작은 값을 찾기 위한 ACS 블록에 입력된다. 부호율이 1/2인 코드에서는 하나의 상태에 두 개의 입력 경로가 존재하므로 이전의 PM(Path Metric)과 현재 BM을 더해 작은 값을 갖는 경로 값이 다음 상태의 PM이 된다.The values of each BM on the grid are calculated through the BMC block. The calculated BMs are input to an ACS block to find a small value of a path input to the current state. In a code with a code rate of 1/2, two input paths exist in one state, so a path value having a smaller value by adding a previous path metric (PM) and the current BM becomes a PM of the next state.

도1은 격자도 상의 상태 천이 관계를 나타낸 버터플라이 구조이다. S는 PM을 계산하기 위한 현재 스테이지까지의 최소 거리(minimum distance)이며, 이것이 저장된 것을 SM(Survivor Memory)이라 한다. 연속적으로 수행되는 한쌍의 버터플라이 구조의 ACS 연산과정을 수식으로 나타내면 수학식2와 같다. 수학식2는 두 개의 값중 작은 값을 선택하는 것을 의미한다.1 is a butterfly structure showing a state transition relationship on a lattice diagram. S is the minimum distance to the current stage for calculating the PM, which is stored as SM (Survivor Memory). The ACS operation of a pair of butterfly structures that are continuously performed is represented by Equation (2). Equation 2 means to select the smaller of the two values.

수학식2의 첫 번째 식을 예로 들면, S₀와 BM_l을 더한 값과 S₁과 BM_h를 더한 값을 비교하여 이 두 값 중 적은 값이 새로운 S₀값이 된다.For example, the first equation of Equation 2 is compared with the sum of S ₀ and BM _l and the sum of S ₁ and BM _h , and the smaller of these two values becomes the new S ₀ value.

구속장의 길이가 K일 때 2^(K-1)개의 상태가 존재하게 되며 그 만큼의 ACS 연산을 필요로 한다. 즉, N은 2^(K-1)가 된다.When the length of the constraint field is K, there are 2 ^(K-1) states and it requires the ACS operation. That is, N becomes 2 ^(K-1) .

도2는 전용 비터비 디코더에서 사용하는 ACS 구조를 나타낸 것이다, 하나의 PE(Processing Element)가 한 상태의 ACS 연산을 수행하므로 이론적으로 총 2^(K-1)개의 PE를 사용하면 PE들 간에 데이터의 상관 관계는 없으므로 모든 상태에 대한 연산을 동시에 할 수 있다. 그러나 실제 DSP 칩이나 전용 ASIC 칩들은 하드웨어 부담이 크므로 여러 개의 PE를 갖지 못하는 것이 실정이다. 각 PE에서는 격자도 상의 상태들 중 하나의 상태를 처리하며 ACS 연산을 수행해 각 상태로 입력되는 현재 경로값들과 누적 경로값들을 더한 값을 비교하여 작은 값을 갖는 경로를 선택한다. 상기 BMC 블록에서 3비트로 입력된 데이터로부터 현재 경로값을 구한다. 구속장K=7일 경우, 한 스테이지에 2^(K-1)개의 상태가 존재하므로 64개의 PE를 사용하면 매 클럭마다 격자도 상의 한 스테이지를 처리할 수 있다. 상기 PE는 상위로부터의 경로값 BM_h와 이전까지의 누적 경로값인 SM_h를 더한 값과 BM_l과 SM_l을 더한 값을 비교한다.Figure 2 shows the ACS structure used in the dedicated Viterbi decoder, since one PE (Processing Element) performs an ACS operation in a state, theoretically, when a total of 2 ^(K-1) PEs are used, There is no correlation between, so all operations can be performed simultaneously. However, the actual DSP chip or dedicated ASIC chip is a heavy hardware burden, it does not have a number of PE. Each PE processes one of the states on the grid and performs an ACS operation to select a path with a smaller value by comparing the sum of the current path values and the accumulated path values. The current path value is obtained from data input in 3 bits in the BMC block. When the constraint field K = 7, since there are 2 ^(K-1) states in one stage, 64 PEs can be used to process one stage on the grid diagram every clock. And the PE compares the sum of the path value _h BM and cumulative value of the path, plus the value of the SM _h prior to the BM and the SM _l _l from the higher value.

다음에, 도3은 비터비 전용 역추적 방식 생존자 경로 계산 회로의 구조이다. 상기 역추적 방식은 ACS 연산 결과로 얻은 결정 비트들을 디코딩 깊이에 해당하는 만큼 메모리에 저장한 다음 순차적인 방법으로 메모리에 저장되어 있는 값을 읽어내어 현재 상태값으로부터 이전 스테이지의 생존자 경로를 추적해 낸다.3 is a structure of a Viterbi-only traceback survivor path calculation circuit. The traceback method stores the decision bits obtained from the ACS operation in memory as much as the decoding depth, and then reads the values stored in the memory in a sequential manner and traces the survivor path of the previous stage from the current state value. .

도시된 바와 같이 역추적 회로는 메모리, 멀티플렉서(Multiplexer) 그리고 쉬프트 레지스터로 구성된다. 구속장 K=7인 콘볼루션 인코더를 사용할 경우 메모리에 저장되어 있는 64비트의 데이터 중 현재 상태값으로 임의의 상태를 선택하여 멀티플렉서의 선택 비트로 사용한다. 상기 멀티플렉서 출력으로 선택된 1비트는 쉬프트 레지스터에 입력되어 다음 상태 값 결정을 위한 멀티플렉서의 선택 비트로 사용된다. 역추적을 시작하는 현재 상태 값은 임의로 선택할 수 있지만 쉬프트 레지스터의 초기화와 연관해 보통 0에서 시작하도록 사용된다. 상기와 같은 방법으로 순차적으로 이전 상태를 추적하다가 일정 깊이의 디코딩 깊이를 지나면 복원된 데이터를 얻는다. 디코딩 깊이는 대개 구속장 K의 4~5 배정도 이상이면 안정되고 신뢰도가 높은 비트 오류율을 얻게 된다. 그러나 구속장의 길이가 커질수록 연산 회수가 많아지게 되어 연산속도에 많은 제약을 받게 된다. 즉, 매번 1비트 데이터를 디코딩하기 위해서는 디코딩 깊이에 해당하는 연산을 반복 수행해야 하기 때문이다. 실제로 전용 비터비 복호기는 이러한 연산 속도 제약을 해결하기 위하여 레지스터 교환방식을 사용한다. 그러나 프로그래머블 DSP 칩에서는 하드웨어 구조상 구현이 용이하지 않다.As shown, the backtracking circuit consists of a memory, a multiplexer and a shift register. In case of using the convolutional encoder with the constraint length K = 7, the arbitrary state is selected as the current state value among the 64-bit data stored in the memory and used as the selection bit of the multiplexer. One bit selected as the multiplexer output is input to the shift register and used as a selection bit of the multiplexer for determining the next state value. The current state value to start backtracking can be chosen arbitrarily, but is usually used to start at zero in conjunction with the initialization of the shift register. The previous state is sequentially tracked in the same manner as described above, and then restored data is obtained after a decoding depth of a predetermined depth. Decoding depths are typically four to five times greater than the constraint field K, resulting in stable and reliable bit error rates. However, the greater the length of the constraint field, the more the number of operations becomes and the more the computational speed is constrained. That is, in order to decode 1-bit data each time, the operation corresponding to the decoding depth must be repeatedly performed. Indeed, the dedicated Viterbi decoder uses register swapping to solve this computational speed constraint. However, due to the hardware structure, the programmable DSP chip is not easy to implement.

상기 DSP 칩 개발사들은 비터비 알고리즘 연산의 고속화 경향에 맞추어 새로운 명령어 구조들을 선보이고 있으나 비터비 알고리즘 중의 일부 연산을 수행하기 위한 국한된 명령어들만을 지원하고 있어 수행 속도를 크게 향상시키지 못하고 있다. 상기 비터비 디코딩의 가속을 지원하는 상용 DSP 칩들로는 디에스피 그룹(DSP Group)의 OakDSP, 텍사스 인스트루먼트(Texas Instrument)사의 TMS320C55x, 모토롤라(Motorol ra)사의 DSP56600 패밀리가 있으며, 이들 모두 ACS 연산 부분만을 가속하고 있다.The DSP chip developers are introducing new instruction structures in accordance with the trend of speeding up the Viterbi algorithm, but only limited instructions for performing some operations in the Viterbi algorithm do not significantly improve the execution speed. Commercially available DSP chips that support the acceleration of Viterbi decoding include the DSP Group's OakDSP, Texas Instruments' TMS320C55x and Motorola's DSP56600 family, all of which accelerate only the ACS computing portion. have.

다음에 도4는 종래의 프로그래머블 프로세서(DSP 포함)에서의 ACS 연산과정을 수행하기 위한 하드웨어를 나타낸 것으로, 상기 ACS 연산은 한 쌍의 산술연산, 비교, 선택 및 저장의 과정을 거쳐야 한다. 즉 ①의 과정에서 두쌍의 PM과 BM을 더하며, ②의 과정에서 비교를 수행하여 플래그 레지스터에 그 신호를 저장한다. ③의 과정에서는 비교 결과 신호 1비트를 메모리에 저장하고, ④ 과정에서는 비교시 결정된 최소 거리의 값을 쉬프트시켜 메모리에 저장한다. 그러나 이런한 일련의 과정들은 DSP 칩이 보유하고 있는 ALU 및 레지스터 파일의 사양에 따라 각각 여러 클록 싸이클이 걸릴 수 있다. 또한 ③과 같은 경우는 비교 결과인 선택 신호를 메모리에 저장하기 전에 모든 상태에 대한 ACS 연산 결과의 선택 신호들을 쉬프트하여 정렬시킨 후 워드 단위로 메모리에 저장하여야 하기 때문에 여러 싸이클이 걸릴 수 있다.4 shows hardware for performing an ACS operation in a conventional programmable processor (including a DSP). The ACS operation must undergo a pair of arithmetic operations, comparison, selection, and storage. That is, two pairs of PM and BM are added in the process of ①, the comparison is performed in the process of ②, and the signal is stored in the flag register. In the process (3), one bit of the comparison result signal is stored in the memory. In the process (4), the minimum distance determined during the comparison is shifted and stored in the memory. However, this series of steps can take several clock cycles each, depending on the specifications of the ALU and register file held by the DSP chip. In addition, in the case of ③, several cycles may be required because the selection signals of the ACS operation results for all states must be shifted and aligned before being stored in the memory in word units before the selection signals as the comparison results are stored in the memory.

상술한 비터비 복호화를 위해 특정 사양을 가지는 상용 DSP의 명령어는 단지 2쌍의 데이터를 동시 비교할 수 있는 비교문만을 지원하고 있다. 즉, 비교 이전의 산술연산과 비교 연산 이후 최소거리의 선택 및 저장 과정은 여러 개의 일반 DSP 명령어를 사용하여 처리함으로써 많은 클록 싸이클을 소모하고 있다. 따라서 도4의 ②, ③ 과정에서 걸리는 싸이클을 다소 감소시키고 있다. 또한, 생존자 경로 저장시 1비트를 쉬프트시켜서 저장하는 상기 DSP56600의 VSL 명령어는 비교를 수행하는 MAX 명령어와 파이프 라이닝하여 처리하면 데이터 의존성이 없어 한 클록 싸이클에 동시 수행하면 효율적이나 하드웨어 구조가 뒷받침되지 않아 각각 다른 싸이클에 계산을 하고 있어 VSL 명령어의 장점을 살리지 못하고 있다. 특히, 범용 DSP 칩에서 가장 처리하기 까다로운 생존자 경로 계산을 위한 특정 명령어는 지원되고 있지 않다. 즉, 아무리 ACS 연산을 빨리 수행한다 하더라도 데이터가 복원되어 나오는 비트율은 생존자 경로 계산을 얼마나 빠르게 처리하느냐에 달린 것인데 이를 위한 연산 구조는 지원되지 않아 매우 느린 비트 복원율을 가질 수밖에 없다는 문제점이 있었다. 따라서 종래의 DSP 칩들은 음성서비스와 같은 15kbps 미만의 저속 데이터 통신에 국한되어 사용될 수 밖에 없고, 고속의 데이터 통신에 적용하기 어렵다는 문제점이 있었다.The instructions of a commercial DSP having a specific specification for Viterbi decoding described above only support a comparison statement capable of comparing two pairs of data simultaneously. That is, the arithmetic operation before comparison and the selection and storage of the minimum distance after the comparison operation consume a large number of clock cycles by using several general DSP instructions. Therefore, the cycles taken in the process of ②, ③ of FIG. 4 are somewhat reduced. In addition, the VSL instruction of the DSP56600, which shifts and stores 1 bit when storing the survivor path, has no data dependency when pipelining and processing with the MAX instruction that performs the comparison. The calculations are done in different cycles, which does not take advantage of the VSL instruction. In particular, there are no specific instructions for calculating survivor paths that are the most difficult to handle on general-purpose DSP chips. In other words, no matter how fast ACS operation is performed, the bit rate at which data is restored depends on how fast the survivor path calculation is performed. Therefore, the conventional DSP chips are limited to low-speed data communication of less than 15kbps, such as voice service, and can not be used for high-speed data communication.

따라서, 본 발명은 비터비 디코딩 알고리즘을 프로그래머블 프로세서 기반으로 구현시 종래의 비효율적인 측면을 제거하기 위해 프로그래머블 프로세서 상에서 고속의 비터비 디코딩 연산을 가능하게 하는 연산방법 및 이 방법을 실행하기 위한 회로를 제공하는 것을 목적으로 한다.Accordingly, the present invention provides a calculation method and a circuit for executing the method to enable a fast Viterbi decoding operation on a programmable processor to eliminate the conventional inefficient aspects when implementing the Viterbi decoding algorithm based on a programmable processor. It aims to do it.

또한, 본 발명은 최소의 회로를 추가하여 프로그래머블 프로세서의 성능을 향상시키므로써 휴대용 단말기의 고속 에러 정정 알고리즘을 프로그래머블 프로세서 상에서 구현 가능케 함으로써 멀티미디어 단말기의 원칩화(one-chip)를 이루는 것을 목적으로 한다.In addition, an object of the present invention is to achieve a one-chip of a multimedia terminal by enabling a high speed error correction algorithm of a portable terminal on a programmable processor by adding a minimum circuit to improve the performance of the programmable processor.

상기 목적을 달성하기 위하여 본 발명은 4쌍의 입력 데이터를 2개의 ACS 연산으로 처리하는 비터비 연산방법에 있어서, 상기 각 쌍의 입력 데이터들의 덧셈을 수행한 후 그 결과값을 4개의 9비트 레지스터에 저장하는 제1단계와; 상기 4개의 9비트 레지스터에 저장된 연산이 수행된 결과값을 두 개의 9비트 레지스터로부터 출력되는 결과값과 비교하여 작은 값을 쉬프트 레지스터에 저장하고, 비교 결과 선택 비트 2비트는 64비트 쉬프트 레지스터에 2비트씩 저장하는 제2단계와; 동시에 상기 제1단계가 반복되어 새로운 4쌍의 데이터를 입력하여 덧셈을 수행한 후 상기 9비트 레지스터에 저장하는 제3단계와; 상기 최초 입력된 4쌍의 데이터의 ACS 연산결과 생성된 최소값 2개를 쉬프터에서 출력하여 버스를 통해 이중 포트 메모리로 저장하는 제4단계와; 상기 제1 단계 내지 제 4 단계를 반복하여 64비트 쉬프트 레지스터를 완전히 채운 후, 출력되는 64비트 값을 버스를 통해 레지스터 파일로 전송하는제5 단계와; 상기 레지스터 파일이 64 비트 값 32개를 저장할 공간을 가지고 있을 경우, 64x1 멀티플렉서를 사용하여 데스티네이션 레지스터의 6비트 데이터로 상기 레지스터 파일의 첫번째 번지의 64비트 중 1비트를 선택하여 데스티네이션 레지스터의 LSB로 삽입하고 기존의 6비트를 왼쪽으로 1비트씩 쉬프트하며, 이때 비터비 디코딩된 MSB 1비트를 출력하는 제6단계를 포함하는 비터비 디코딩 연산방법을 특징으로 한다.In order to achieve the above object, the present invention provides a Viterbi calculation method for processing four pairs of input data by two ACS operations, and after performing addition of each pair of input data, the resultant values are four 9-bit registers. Storing in the first step; Compares the result of the operation stored in the four 9-bit registers with the result output from the two 9-bit registers, and stores a small value in the shift register. A second step of storing bit by bit; At the same time, a third step of repeating the first step, inputting new four pairs of data, performing an addition, and storing the data in the 9-bit register; A fourth step of outputting two minimum values generated as a result of the ACS operation of the first four pairs of data from the shifter and storing them in a dual port memory through a bus; A fifth step of repeating the first to fourth steps to completely fill the 64-bit shift register and then transferring the output 64-bit value to the register file via the bus; If the register file has space to store 32 64-bit values, use a 64x1 multiplexer to select 1 bit of the 64-bits of the first address of the register file as 6-bit data of the destination register, and select the LSB of the destination register. And inserting the first bit into the left and shifting the existing six bits by one bit, wherein the Viterbi decoding operation method includes a sixth step of outputting one bit of the Viterbi decoded MSB.

또한, 비터비 디코딩 연산방법을 실행하기 위하여, 입력되는 4쌍의 8비트 입력 데이터의 덧셈을 수행하기 위한 4개의 덧셈기와; 상기 각각의 덧셈기에서 연산이 수행된 결과값들을 저장하기 위한 4개의 9비트 레지스터와; 상기 레지스터에 저장되어 있는 연산값들 중 2개의 레지스터에서 출력되는 연산 값들을 비교한 후, 작은 값을 선택하도록하는 선택신호 값을 출력하는 2개의 비교기와; 상기 비교기로 입력되는 2개의 레지스터에서 출력된 연산값들과 동일한 두개의 9비트 데이터를 입력으로 받고 비교기의 비교 결과값 1비트를 선택비트로 받아 작은 값을 선택하도록 하는 두 개의 멀티플렉서와; 상기 멀티플렉서의 연산 결과값을 쉬프트하기 위한 두 개의 쉬프터와; 상기 비교기에서 출력된 선택신호 값을 저장하며, 꽉찰 경우 출력하는 64비트 쉬프트 레지스터와; 상기 쉬프터에서 출력된 결과값을 구속장이 7인 경우 64비트 쉬프트 레지스터의 출력값으로 통과시키는 버스와; 상기 버스를 통과한 쉬프트된 결과값을 저장하는 이중 포트 메모리와; 상기 버스를 통과한 64비트 쉬프트 레지스터에서 출력된 값을 저장한 후 꽉 찼을 경우 최선의 값부터 출력하는 32비트 레지스터 파일과; 상기 32비트 레지스터 파일에서 출력되는 64비트 데이터의 곱셈 연산을 수행하기 위한 64x1 멀티플렉서와; 상기 64x1 멀티플렉서에서 출력을 사용하여 6비트 데이터로 상기 레지스터 파일의 첫 번째 번지의 64비트 중 1비트를 선택하여 삽입하고, 기존의 6비트는 1비트씩 왼쪽으로 쉬프트하여 MSB 1비트를 밖으로 출력하는 데스티네이션 레지스터를 포함하는 것을 특징으로 하는 프로그래머블 프로세서에서의 비터비 디코딩 연산회로를 특징으로 한다.In addition, four adders for performing addition of four pairs of 8-bit input data inputted to execute the Viterbi decoding calculation method; Four 9-bit registers for storing the result values of the operation performed in each adder; Two comparators for comparing arithmetic values output from two registers among the arithmetic values stored in the register and outputting a selection signal value for selecting a smaller value; Two multiplexers which receive two 9-bit data identical to the operation values output from the two registers input to the comparator and receive a bit of a comparison result of the comparator as a selection bit to select a small value; Two shifters for shifting the operation result of the multiplexer; A 64-bit shift register for storing the selection signal value output from the comparator and outputting the full signal when the comparator is full; A bus for passing the result value output from the shifter to the output value of the 64-bit shift register when the constraint length is 7; A dual port memory for storing shifted result values passed through the bus; A 32-bit register file that stores the value output from the 64-bit shift register passing through the bus and outputs the best value when the value is full; A 64x1 multiplexer for multiplying 64-bit data output from the 32-bit register file; The output of the 64x1 multiplexer selects and inserts 1 bit of the 64 bits of the first address of the register file into 6 bit data using the output, and shifts the existing 6 bits left by 1 bit to output MSB 1 bit out. And a Viterbi decoding arithmetic circuit in a programmable processor comprising a destination register.

도1은 일반적인 격자도 상의 상태 천이 관계를 나타낸 버터플라이 구조의 도면.1 is a diagram of a butterfly structure showing a state transition relationship on a general grid.

도2는 일반적인 전용 비터비 디코더에 사용하는 ACS 구조를 나타낸 도면.2 is a diagram showing an ACS structure used in a general dedicated Viterbi decoder.

도3은 일반적인 비터비 전용 역추적 방식 생존자 경로 계산 회로의 구조를 나타낸 도면.3 is a diagram showing the structure of a general Viterbi-only traceback survivor path calculation circuit.

도4는 종래의 프로그래머블 프로세서(DSP 포함)에서 ACS 연산 수행과정을 나타낸 도면.4 is a diagram illustrating a process of performing an ACS operation in a conventional programmable processor (including a DSP).

도5는 종래의 ASIC으로 구현된 전용 비터비 디코더 아키텍쳐를 나타낸 도면.5 illustrates a dedicated Viterbi decoder architecture implemented with a conventional ASIC.

도6은 본 발명에 따른 PACS 명령어를 수행하기 위한 블록을 나타낸 도면.6 illustrates a block for performing a PACS instruction in accordance with the present invention.

도7은 본 발명에 따른 SLBIT 명령어의 동작을 나타낸 도면.7 illustrates the operation of an SLBIT instruction in accordance with the present invention.

도8 은 본 발명에 따른 PACS 및 SLBIT 명령어를 실행하기 위한 하드웨어의 구조를 나타낸 도면.Figure 8 is a view of the hardware structure for executing the PACS and SLBIT instruction in accordance with the present invention.

이하, 첨부된 도면을 참조로 하여 본 발명을 상세히 설명하기로 한다.Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

도5는 ASIC으로 구현된 전용 비터비 디코더의 아키텍쳐를 나타낸 것이다.5 shows the architecture of a dedicated Viterbi decoder implemented with an ASIC.

본 발명에 따른 프로그래머블 프로세서 명령어 집합은 ASIC으로 구현한 전용 비터비 디코더의 구조와 유사한 구조를 가지고 있어 대등한 성능을 나타낸다. 본 발명의 명령어들은 PACS(Packed Add and Compare and Select)와 SLBIT(Shift Left BIT)이며, 단지 2개의 명령어들로 비터비 디코딩 알고리즘의 대부분을 구현하게 된다. 즉, 도5의 ACS 연산 블록과 생존자 경로 계산 블록을 위한 각각 하나씩의 명령어를 한 클록 싸이클에 동시 수행하도록 하여 전용 비터비 구조의 성능과 대등하게 되도록 하였다. 참고로 수학식 1에 기재된 바와 같이 BMC 연산 블록은 단순한 산술 연산을 수행하므로 특별한 명령어나 하드웨어를 사용하지 않아도 종래 프로그래머블 프로세서들의 일반적인 구조의 ALU를 사용하여 쉽게 구현가능하기 때문에 BMC 알고리즘 블록을 위한 명령어는 고려하지 않는다.The programmable processor instruction set according to the present invention has a structure similar to that of a dedicated Viterbi decoder implemented with an ASIC and shows comparable performance. Instructions of the present invention are Packed Add and Compare and Select (PACS) and Shift Left BIT (SLBIT), which implement most of the Viterbi decoding algorithm with only two instructions. That is, one instruction for the ACS operation block and the survivor path calculation block of FIG. 5 is simultaneously executed in one clock cycle, so that the performance of the dedicated Viterbi structure is equal. For reference, as described in Equation 1, since the BMC operation block performs a simple arithmetic operation, the instruction for the BMC algorithm block can be easily implemented using the ALU of the general structure of conventional programmable processors without using any special instructions or hardware. Not taken into account.

도6은 본 발명에 따른 비터비 알고리즘의 ACS 블록을 하나의 명령어로 처리하는 PACS 명령어를 수행하는 블록도를 나타낸 것으로, 32비트 DSP 칩인 경우 1개의 ALU 안에 있는 연산 유닛들을 이용하여 도2에 도시된 PE 구조 2개를 나타낸 것이다. 상기 연산 유닛은 종래의 것들을 사용하여 구현할 수 있고, 단지 파이프라이닝에 의하여 덧셈과 비교를 연속적으로 수행할 수 있도록 데이터 패스 회로만을 추가하여 구현한다. 도6은 레지스터 파일이 바이트 단위로 억세스 가능하다는 것을 전제로 하는 구조이며(최신 DSP 칩들은 모두 바이트 단위 억세스가 가능함), 2개의 32비트 워드 데이터를 바이트 단위로 처리하여 2개의 ACS PE 역할을 수행하는 구조이다. 이러한 명령어 처리 구조는 2 싸이클마다 4개의 ACS 연산 결과를 출력할 수 있다.FIG. 6 is a block diagram illustrating a PACS instruction for processing an ACS block of the Viterbi algorithm according to the present invention as one instruction. In the case of a 32-bit DSP chip, FIG. Two PE structures are shown. The computing unit can be implemented using conventional ones, with only the data path circuitry being implemented so that addition and comparison can be performed continuously by pipelining. Fig. 6 is a structure assuming that a register file is accessible in byte units (the latest DSP chips all have byte access), and two 32-bit word data are processed in byte units to perform two ACS PE roles. It is a structure. This instruction processing structure can output the results of four ACS operations every two cycles.

그 동작을 설명하면 a, b, c, d는 BM값들이 들어 있는 8 비트 레지스터의 소스 오퍼런드(Source Operand)이고, a', b', c', d'는 PM값들이 들어 있는 또 다른 8 비트 레지스터의 소스 오퍼런드이다. 상기 값들이 덧셈기인 Ex1을 통해 PM + BM 연산의 결과가 1, 2, 3, 4의 파이프라인 레지스터에 저장되고, 상기 저장된 결과는 비교기(>/s)를 통해 비교 후 작은 값을 선택하고 선택 신호를 1 비트 발생하는 기능을 수행하여 선택결과와 선택신호를 출력한다. 상기 선택결과는 쉬프터1, 2에서 PM + BM 값이 9비트이며 연속적인 덧셈에 의해 발생하는 오버플로우를 방지하기 위한 정규화를 수행하여 오른쪽으로 1 비트 쉬프트되어 SM으로 전송되고, 선택신호들은 하나의 큰 64 비트 쉬프트 레지스터에 저장하며, 이는 한 클록 싸이클마다 PE 개수 만큼씩 왼쪽으로 쉬프트되어 PM으로 전송된다.In operation, a, b, c, and d are the source operands of 8-bit registers containing BM values, and a ', b', c ', and d' are the Source operation of another 8-bit register. The result of the PM + BM operation is stored in the pipeline registers 1, 2, 3, and 4 through the adder Ex1, and the stored result is selected and selected after the comparison using the comparator (> / s). It performs the function of generating the signal 1 bit and outputs the selection result and selection signal. The selection result is shifted to the right by shifting 1 bit to the right by performing normalization to prevent the overflow caused by successive additions. It is stored in a large 64-bit shift register, which is shifted left by the number of PEs per clock cycle and sent to the PM.

여러 개의 ALU를 갖는 프로그래머블 프로세서나 32비트 또는 추후 개발될 64비트, 128비트 프로그래머블 프로세서 등에서도 제6도의 기본 구조를 단순히 병렬적으로 확장하면 ACS 연산은 서로간에 데이터 의존성이 없으므로 고속 병렬 처리가 가능하다. 한 예로써, 64비트 칩인 경우 ALU 하나로 ACS PE를 4개, 128비트 칩인 경우는 ACS PE 8개를 구성할 수 있다. 또한 ALU가 2개 이상이면 결국 ACS PE도 2배 이상 늘어나게 된다.Even in the case of programmable processors with multiple ALUs or 32-bit or later 64-bit or 128-bit programmable processors, the basic structure of FIG. 6 is simply extended in parallel to allow high-speed parallel processing because ACS operations are independent of each other. . For example, in the case of a 64-bit chip, four ACS PEs may be configured in one ALU, and in the case of a 128-bit chip, eight ACS PEs may be configured. In addition, more than two ALUs will eventually double the ACS PE.

한쌍의 ACS 버터플라이 연산 구조는 4개의 ACS PE를 필요로 하므로 결국, 앞서 설명한 상용 프로그래머블 프로세서의 ACS 연산 소스리스트를 PACS 명령어 두 개로 처리할 수 있는 것이다. 따라서 상용 프로그래머블 프로세서보다 수행 싸이클 면에서 3~5 배 이상 빠르게 ACS 연산 블록을 처리할 수 있다. 상기 도4에 도시된 종래 프로그래머블 프로세서의 ALU 구조 처리방식은 ALU 안에 여러 가지 기능을 하는 연산 유닛들이 각기 연산을 취하는 방식이었으나 그 연산 유닛들 간에 데이터 흐름 경로를 놓아주면 도6의 연산 흐름이 가능하다. 다시 말해서, 도6의 구조는 종래 프로그래머블 프로세서의 연산 유닛들에 연속연산(한쌍의 덧셈 연산 직후 비교 연산)이 가능하도록 하는 데이터 패스 구조와 2^(K-1)비트의 쉬프트 레지스터(쉬프터 0 : 선택신호 저장)만을 추가하여 구현 가능하므로 하드웨어 부담이 적고, 다른 DSP 알고리즘의 연산 수행도 원활히 수행할 수 있다.Since a pair of ACS butterfly operation structures require four ACS PEs, the ACS operation source list of a commercially available programmable processor described above can be processed with two PACS instructions. As a result, ACS operation blocks can be processed three to five times faster in performance cycles than commercial programmable processors. The ALU structure processing method of the conventional programmable processor illustrated in FIG. 4 is a method in which arithmetic units having various functions in the ALU take arithmetic operations, but if the data flow path is released between the arithmetic units, the arithmetic flow of FIG. 6 is possible. . In other words, the structure of Fig. 6 is a data path structure and a 2 ^(K-1) bit shift register (shifter 0: selection) that allows continuous operation (comparison operation immediately after a pair of add operations) to the operation units of a conventional programmable processor. It can be implemented by adding only the signal storage, so the hardware burden is low and the operation of other DSP algorithms can be performed smoothly.

다음에, 본 발명에 따른 또 다른 명령어인 SLBIT는 도3의 역추적 방식 생존자 경로 계산 회로를 위한 출력 레지스터를 그대로 구현한 것으로, 도7은 이러한 SLBIT 명령어의 동작을 나타낸 것이다. 즉 구속장 K=7인 경우 ACS 연산 결과 생성된 선택 신호 1비트들이 모여 2^(7-1)=64 비트 쉬프트 레지스터를 채우면 64비트의 데이터는 경로 메모리(PM) 역할을 하는 레지스터 파일에 저장되고, 그 저장된 값들을 소스 오퍼런드로 하여 도7에 도시된 바와 같이 연산을 수행한다. 데스티네이션 하위 6 비트의 조합만으로 2⁶=64 비트를 비트단위로 선택 가능하므로 소스 오퍼랜드의 1비트를 선택하고 그 1비트가 데스티네이션의 LSB(Least Significant Bit)로 들어오면서 왼쪽으로 쉬프트되어 새로운 6비트를 구성하고, 이 6비트를 가지고 같은 방식의 연산을 반복 수행한다. 상기 반복 연산은 PM의 처음(0번지)부터 끝(K의 4~5배 번지)까지 수행하여야 하며 이렇게 전체 PM을 트레이스 백하면 1비트의 복호화 결과가 출력되는데, 여기에서 출력되는 값이 복호화된 원시 데이터이다.Next, SLBIT, another instruction according to the present invention, implements the output register for the traceback survivor path calculation circuit of FIG. 3 as it is, and FIG. 7 illustrates the operation of the SLBIT instruction. In other words, when the constraint length is K = 7, when 1 bit of the selection signal generated as a result of ACS operation gathers and fills the 2 ^(7-1) = 64 bit shift register, 64-bit data is stored in the register file serving as path memory (PM). The operation is performed as shown in Fig. 7 using the stored values as the source operand. By selecting a combination of the lower 6 bits of the destination, 2 ⁶ = 64 bits can be selected bit by bit, so that one bit of the source operand is selected and that 1 bit shifts to the left as it enters the destination sign bit (LSB) of the destination and is shifted to the left. Configure the bits, and repeat the same operation with these 6 bits. The iterative operation should be performed from the beginning (address 0) to the end (address 4 to 5 times of K). When the entire PM is traced back, a decoding result of 1 bit is output, and the output value is decoded. Raw data.

도7의 동작을 설명하면 먼저, 데스티네이션 오퍼런드의 하위 K 비트가 가리키는 소스 오퍼런드의 비트를 선택한 후, 상기 데스티네이션 오퍼런드의 원래값을 1 비트 쉬프트시키고, 선택된 비트를 데스티네이션의 LSB로 복사한다.Referring to the operation of Fig. 7, first, the bit of the source operation indicated by the lower K bit of the destination operation is selected, then the original value of the destination operation is shifted by one bit, and the selected bit is targeted. Copy to LSB of.

도7의 동작을 수행하기 위해서는 종래 DSP의 레지스터 파일에 단순히 2^(K-1)x1 멀티플렉서만 추가되면 구현 가능하다. 선택신호들이 저장되는 레지스터 파일 안의 레지스터들은 2^(K-1)비트 이상 단위로 억세스 가능한 구조를 가져야 한다(K=7인 경우 32비트 칩에서 레지스터 2개를 연결하여 64비트 단위로 억세스 가능하여야 함). 일반적으로 구속장 K가 7인 경우를 가장 많이 사용하나 7이 아닌 경우에도 사용이 용이하도록 K값을 충분하게 선택하여 하드웨어를 구성하는 것이 좋다.In order to perform the operation of FIG. 7, it is possible to implement by simply adding a 2 ^(K-1) x1 multiplexer to a register file of a conventional DSP. The registers in the register file where the selection signals are stored must have access structures in units of 2 ^(K-1) bits or more (when K = 7, they must be accessible in 64-bit units by connecting two registers on a 32-bit chip). ). In general, the most restrictive K is 7, but it is better to configure the hardware by selecting enough K values so that it is easy to use even if it is not 7.

이를 통하여, 종래 DSP로는 쉬프트와 인덱스 어드레싱으로 데이터를 이동시켜 약 5싸이클에 걸쳐 처리를 해야 하는 것을 1싸이클로 줄일 수 있다. 그러나 1비트의 디코딩 데이터를 산출하기 위해서는 구속장 K가 7인 경우 일반적으로 4~5배 즉, 28~35개의 2^(7-1)= 64비트 PM 값들을 전부 역추적 해야 하기 때문에 아무리 ACS 연산을 수행한다 하더라도 고속 연산을 수행하기 어렵다. 종래 DSP에서는 PM 값들을 내부 메모리에 저장시켜 놓고 임시 레지스터에 1개씩 옮겨가면서 역추적 연산을 수행하고 메모리로부터 한번에 64비트씩 레지스터로 이동하는 것 또한 불가능하기 때문에 많은 수행 싸이클이 소모된다. 한 예로서, 메모리 억세스 싸이클이 3이고 메모리 데이터의 비트수가 8인 경우에는 대략 8 x 3 + 5 = 29 싸이클이 걸리게 된다. 따라서 PM 값들을 억세스 싸이클이 1인 레지스터 파일을 사용하여 처리해야 하며, 레지스터 파일에 64비트 레지스터 개수가 28~35개 미만인 경우 내부 메모리와 레지스터 파일간의 데이터도 병렬로 이동하는 구조를 반드시 가져야 한다.Through this, the conventional DSP can shift the data by shifting and index addressing, thereby reducing the processing of about 5 cycles to 1 cycle. However, in order to produce 1 bit of decoded data, the ACS operation is required since the constraint length K is generally 4 to 5 times, that is, 28 to 35 2 ^(7-1) = 64 bit PM values must all be traced back. Even if you do, it is difficult to perform fast operation. In the conventional DSP, it is not possible to store the PM values in the internal memory, move back one by one to the temporary registers, perform backtracking operations, and move 64-bits from the memory to the registers at a time, which consumes many execution cycles. As an example, if the memory access cycle is 3 and the number of bits of the memory data is 8, it takes approximately 8 x 3 + 5 = 29 cycles. Therefore, the PM values must be processed using a register file with an access cycle of 1. If the number of 64-bit registers in the register file is less than 28 to 35, the data between the internal memory and the register file must also be moved in parallel.

다음에 도8은 본 발명에 따른 PACS와 SLBIT 명령어들을 효율적으로 처리하기 위한 K=7인 경우의 회로를 나타낸 것으로, 입력되는 4쌍의 8비트 입력 데이터의 덧셈을 수행하기 위한 4개의 덧셈기와; 상기 각각의 덧셈기에서 연산이 수행된 결과값들을 저장하기 위한 4개의 9비트 레지스터와; 상기 레지스터에 저장되어 있는 연산값들 중 2개의 레지스터에서 출력되는 연산 값들을 비교한 후, 작은 값을 선택하도록하는 선택신호 값을 출력하는 2개의 비교기와; 상기 비교기로 입력되는 2개의 레지스터에서 출력된 연산값들과 동일한 두개의 9비트 데이터를 입력으로 받고 비교기의 비교 결과값 1비트를 선택비트로 받아 작은 값을 선택하도록 하는 두 개의 멀티플렉서와; 상기 멀티플렉서의 연산 결과값을 쉬프트하기 위한 두 개의 쉬프터와; 상기 비교기에서 출력된 선택신호 값을 저장하며, 꽉찰 경우 출력하는 64비트 쉬프트 레지스터와; 상기 쉬프터에서 출력된 결과값을 구속장이 7인 경우 64비트 쉬프트 레지스터의 출력값으로 통과시키는 버스와; 상기 버스를 통과한 쉬프트된 결과값을 저장하는 이중 포트 메모리와; 상기 버스를 통과한 64비트 쉬프트 레지스터에서 출력된 값을 저장한 후 꽉 찼을 경우 최선의 값부터 출력하는 32비트 레지스터 파일과; 상기 32비트 레지스터 파일에서 출력되는 64비트 데이터의 곱셈 연산을 수행하기 위한 64x1 멀티플렉서와; 상기 64x1 멀티플렉서에서 출력을 사용하여 6비트 데이터로 상기 레지스터 파일의 첫 번째 번지의 64비트 중 1비트를 선택하여 삽입하고, 기존의 6비트는 1비트씩 왼쪽으로 쉬프트하여 MSB 1비트를 밖으로 출력하는 데스티네이션 레지스터로 구성된다.8 shows a circuit in the case of K = 7 for efficiently processing PACS and SLBIT instructions according to the present invention, comprising: four adders for adding four pairs of 8-bit input data to be input; Four 9-bit registers for storing the result values of the operation performed in each adder; Two comparators for comparing arithmetic values output from two registers among the arithmetic values stored in the register and outputting a selection signal value for selecting a smaller value; Two multiplexers which receive two 9-bit data identical to the operation values output from the two registers input to the comparator and receive a bit of a comparison result of the comparator as a selection bit to select a small value; Two shifters for shifting the operation result of the multiplexer; A 64-bit shift register for storing the selection signal value output from the comparator and outputting the full signal when the comparator is full; A bus for passing the result value output from the shifter to the output value of the 64-bit shift register when the constraint length is 7; A dual port memory for storing shifted result values passed through the bus; A 32-bit register file that stores the value output from the 64-bit shift register passing through the bus and outputs the best value when the value is full; A 64x1 multiplexer for multiplying 64-bit data output from the 32-bit register file; The output of the 64x1 multiplexer selects and inserts 1 bit of the 64 bits of the first address of the register file into 6 bit data using the output, and shifts the existing 6 bits left by 1 bit to output MSB 1 bit out. It consists of a destination resistor.

상기 시스템 버스의 상단부 구조는 PACS 명령어 처리를 위한 세부 회로로 도6의 PE 구조를 사용한 것이며 2개의 PE를 가진다. 따라서 32비트 머신의 ALU 1개로 구현 가능하다. 상기 회로는 구속장 K가 7인 경우를 임의로 나타낸 것이며, ACS 연산부는 PACS 명령어에서 설명한 바와 같이 병렬로 확장한다면 더 빠른 ACS 연산이 가능하다. 첫 번째 클럭 싸이클에서 한 쌍의 BM과 한 쌍의 SM으로부터 8비트의 데이터 쌍이 들어와 더해짐으로써 두 경로의 현재 스테이지까지의 PM을 계산한다. 또한 두 번째 클럭 싸이클에서 두쌍 PM 값들의 크기를 비교기를 이용하여 작은값을 선택하도록 하는 선택신호를 출력하고, 이 선택신호는 멀티플렉서로 입력되어 입력되는 값들 중 작은 값을 선택하도록 한다. 상기 선택된 결과를 각각 1비트의 신호들로써 64비트 레지스터의 LSB에 저장하고 작은 것으로 판명된 생존자를 정규화하기 위하여 1비트 오른쪽으로 쉬프트하여 SM에 저장한다.The upper structure of the system bus uses the PE structure of FIG. 6 as a detailed circuit for processing PACS commands and has two PEs. Therefore, it can be implemented with one ALU of 32-bit machine. The circuit arbitrarily represents the case where the constraint length K is 7, and the ACS operation unit can perform faster ACS operation if it is extended in parallel as described in the PACS instruction. In the first clock cycle, 8-bit data pairs are added from a pair of BMs and a pair of SMs to calculate the PMs to the current stage of both paths. In addition, the second clock cycle outputs a selection signal for selecting a small value by using a comparator for the magnitudes of the two pairs of PM values, and the selection signal is input to the multiplexer to select a smaller value among the input values. The selected result is stored in the LSB of the 64-bit register as 1-bit signals, respectively, and shifted to the right by 1 bit to normalize the survivors found to be small and stored in the SM.

도8에서는 두 개의 ACS 연산을 독립적으로 처리하여 출력된 두 개의 작은 값들을 시스템 버스를 통해 메모리에 저장하는 구조를 가지고 있으며, 이러한 메모리가 생존자를 저장하는 SM 역할을 수행하며 이중 포트 메모리를 사용할 경우 명령어 수행 사이클을 줄일 수 있다. 또한, 선택 신호 2비트가 저장되는 64비트 쉬프트 레지스터는 매 클럭마다 2비트씩 왼쪽으로 쉬프트하여 64비트를 채울 때까지 동작한다. 도8과 같이 ACS PE가 두 개인 경우 2비트 단위로 쉬프트하여(ACS PE의 개수 단위로 쉬프트) 64비트가 차게 되면 레지스터 파일의 32비트 레지스터 두 개에 나누어 저장된다.FIG. 8 has a structure in which two small values output by independently processing two ACS operations are stored in a memory through a system bus, and such a memory serves as an SM for storing survivors and uses dual port memory. Reduce instruction execution cycles. The 64-bit shift register, which stores two bits of the select signal, also operates until it shifts left by two bits for every clock to fill the 64-bits. As shown in FIG. 8, when two ACS PEs are shifted in units of 2 bits (shifts in units of the number of ACS PEs), when 64 bits are filled, the data is divided into two 32-bit registers of a register file.

상기 저장된 PM은 도8의 좌측 하단부에 도시된 SLBIT 명령어의 하드웨어 구조를 통해 64비트 데이터중 1비트가 64 x 1 멀티플렉서에서 선택되어 데스티네이션 레지스터에 LSB로 삽입되고, 이 레지스터는 동시에 왼쪽으로 1비트 쉬프트된다. 상기와 같은 연산을 반복하여 PM의 마지막 데이터까지 처리되면 이때 데스티네이션 레지스터의 7-1=6 번째 비트값이 최종 디코딩된 1비트 값이 된다.The stored PM is inserted through the hardware structure of the SLBIT instruction shown in the lower left part of FIG. 8 into a LSB into a destination register by selecting one bit of 64-bit data from the 64x1 multiplexer, which is simultaneously one bit to the left. Shifted. When the above operation is repeated and processed to the last data of the PM, the 7-1 = 6th bit value of the destination register becomes the last decoded 1-bit value.

상기 PACS 명령어 하드웨어는 처리 데이터가 8비트 단위로 억세스만 가능하면 프로그래머블 프로세서의 비트 수에 상관없이 구현 가능하고, 프로그래머블 프로세서의 처리 비트가 클 경우 즉, 64비트 머신이나 128 비트 머신인 경우 PE의 수가 늘어나므로 특별한 회로의 추가없이 그 만큼의 병렬처리 효과를 가져올 수 있다. 그러나 반드시 생존자 경로 계산의 수행 싸이클과 조화를 이루어야만 고속의 동작이 가능하다.The PACS instruction hardware can be implemented regardless of the number of bits of the programmable processor as long as the processing data can be accessed in 8-bit units, and the number of PEs when the processing bits of the programmable processor are large, that is, a 64-bit or 128-bit machine This increases the parallelism effect without adding special circuits. However, high speed operation is only possible in combination with the performance cycle of survivor path calculation.

상술한 바와 같이 ACS가 아무리 빨리 처리된다 하더라도 생존자 경로 계산이 늦어지면 고속의 ACS연산은 성능을 발휘할 수 없다. 따라서 생존자 경로 계산은 레지스터 파일과 같은 1싸이클 억세스가 가능한 저장매체를 대상으로 해야 하며, 도8의 하단부에 도시된 바와 같이 레지스터 파일이 전체 PM 메모리 역할을 하기에 공간이 부족한 경우는 내부 램 또는 캐쉬 메모리와 레지스터 파일간의 병렬이동이 용이한 구조를 가져야 병목 현상을 없앨 수 있다. 또한, 도8의 맨 하단에 생존자 경로를 계산하기 위한 6비트의 쉬프트 레지스터는 일반 쉬프트 레지스터를 사용하여도 되며, 반드시 ACS 연산과 병렬적으로 처리되어야 고속 수행이 가능하다.As described above, no matter how fast the ACS is processed, if the survivor path calculation is late, the fast ACS operation cannot perform the performance. Therefore, survivor path calculation should be performed for storage media that can access 1 cycle, such as register file, and if there is not enough space for register file to act as the entire PM memory as shown in the lower part of Fig. 8, the internal RAM or cache The bottleneck can be eliminated by having a structure that allows easy parallel movement between the memory and the register file. In addition, the 6-bit shift register for calculating the survivor path at the bottom of FIG. 8 may use a general shift register, and must be processed in parallel with an ACS operation to perform high speed.

또한, 구속장 K가 7인 경우 64개의 상태가 존재하고, 본 발명의 연산회로에서 ACS PE가 2개이므로 32싸이클이 걸린다. 그리고, 동시에 선택 신호들은 2 비트씩 왼쪽으로 쉬프트 되어 64비트 쉬프트 레지스터를 가득 채우게 된다. 가득 찬 64 비트 값을 레지스터 파일내에 있는 2개의 32 비트 레지스터에 저장하고, PACS 연산은 같은 순서로 계속된다.In addition, when the constraint field K is 7, 64 states exist, and since the ACS PE is 2 in the calculation circuit of the present invention, it takes 32 cycles. At the same time, the select signals are shifted left by two bits to fill the 64-bit shift register. The full 64-bit value is stored in two 32-bit registers in the register file, and the PACS operations continue in the same order.

한편 레지스터 파일이 PM의 깊이인 K의 4~5배의 공간 즉, 28~35개의 레지스터를 갖거나 메모리와 병렬 데이터 이동이 가능한 구조를 가지면, SLBIT 연산은 28~35의 싸이클 만에 처리가 되고, 그 결과 1 비트의 데이터가 복호화되어 출력된다. 따라서, PACS와 SLBIT는 연산이 독립적이고 수행 싸이클이 거의 비슷하므로 TMS320C54x의 경우 처럼 심볼 '∥'(병렬연산)을 사용하여 PACS∥SLBIT의 형태로 두명령어의 동시 수행이 가능하다.On the other hand, if the register file has 4-5 times the space of K, the depth of PM, that is, 28-35 registers or a structure that can move data in parallel with memory, the SLBIT operation is processed in 28-35 cycles. As a result, one bit of data is decoded and output. Therefore, since PACS and SLBIT are independent of operations and have similar execution cycles, two commands can be executed simultaneously in the form of PACS∥SLBIT using the symbol '∥' (parallel operation) as in the case of TMS320C54x.

쉬프터는 일반 산술연산 유닛에 존재하는 쉬프터에 데이터 경로를 생성하여 구현할 수 있다. 멀티플랙서는 6비트의 선택신호를 받아 경로 메모리의 데이터를 가지고 있는 PM 레지스터의 하위 비트로 삽입되고 그 레지스터의 데이터가 전체적으로 왼쪽으로 쉬프트를 하게 되면서 역추적 과정을 수행하게 되는 것이다. 즉, 약 32 싸이클마다 1비트 데이터를 출력하므로 100MHz의 동작 주파수를 가지는 DSP 칩인 경우라도 100,000,000/32 = 3.125 Mbps의 복호화 성능을 낼수 있으며, 부호율이 1/2이므로 6.25 Mbps의 고속 전송률을 갖는 통신 환경에서도 사용 가능하다.The shifter may be implemented by generating a data path to a shifter existing in a general arithmetic unit. The multiplexer receives a 6-bit selection signal and inserts it into the lower bits of the PM register that contains the data in the path memory. The multiplexer shifts the register to the left and performs the backtracking process. That is, since 1-bit data is output every 32 cycles, even a DSP chip having an operating frequency of 100 MHz can achieve 100,000,000 / 32 = 3.125 Mbps decoding performance, and since the code rate is 1/2, communication having a high data rate of 6.25 Mbps Can also be used in an environment.

표 3 은 구속장 K=7 인 경우의 기존 DSP 칩과의 비터비 디코딩 수행 싸이클의 비교표이다. ACS 연산의 경우 상용 DSP 칩들의 벤치마크 프로그램에서 ACS 버터플라이 연산 프로그램과 수행 싸이클만 설명하고 있기 때문에 실제 64/4 = 16쌍의 버터플라이 연산을 반복 수행하면서 루프문 동작시 결과 데이터 저장 및 분기 등으로 소모되는 데이터 지연 싸이클 수는 α로 표기하였으며, 대략 버터플라이 연산 수행 싸이클의 1/3정도를 더 소모한다.Table 3 is a comparison table of Viterbi decoding performance cycles with the existing DSP chip when the constraint length K = 7. In the case of ACS operation, only the ACS butterfly operation program and execution cycle are described in the benchmark program of commercial DSP chips. Actually, 64/4 = 16 pairs of butterfly operations are repeated, and the result data is stored and branched when loop statement is operated. The number of data delay cycles consumed is denoted by α, which consumes approximately one third of the butterfly operation cycle.

구조연산블럭Structure calculation block OakDSP^TM OakDSP ^TM TMS320C55xTMS320C55x DSP56600DSP56600 본 발명의 구조Structure of the present invention ACS 연산 블록ACS Operation Block 약 152+αAbout 152 + α 약 96+αAbout 96 + α 약 128+αAbout 128 + α 3232 생존자 경로 계산Survivor Path Calculation 약 910 (= 26 × 7 × 5)910 (= 26 x 7 x 5) 3535 전체all 약 1062+αAbout 1062 + α 약 1006+αAbout 1006 + α 약 1038+αAbout 1038 + α 3535

상기 표 3은 구속장 K=7 인 경우의 수행 싸이클 비교한 것이다.Table 3 compares the performance cycles when the constraint length K = 7.

또한, 기존 프로그래머블 프로세서에서는 ACS 연산과 생존자 경로 계산을 따로 수행하였으나, 본 발명에서는 ACS와 생존자 경로 계산의 동시 수행이 가능하므로 전체적으로 약 30배 이상의 성능 향상을 보일 수 있다. 본 발명의 비터비 전용 프로그래머블 프로세서 명령어 회로는 제 8도에 @기호로 나타낸 2^(K-1)비트 쉬프트 레지스터, 2^(K-1)×1 멀티플렉서들만 제외하면 기존 DSP에서 널리 사용되는 연산 유닛들로 구성되어 있다.In addition, the conventional programmable processor performs ACS operation and survivor path calculation separately, but in the present invention, it is possible to simultaneously perform ACS and survivor path calculation, and thus the overall performance improvement may be about 30 times or more. The Viterbi dedicated programmable processor instruction circuit of the present invention is a computational unit that is widely used in existing DSP except for 2 ^(K-1) bit shift register, 2 ^(K-1) x1 multiplexers, indicated by the @ symbol in FIG. Consists of

상술한 바와 같이 본 발명은 칩 제작시 기존 연산 모듈들 대부분이 재사용 가능하고 연속적인 연산이 가능하도록 하는 파이프라이닝 데이터 패스 회로들만 추가하면 되므로 설계 비용 측면에서도 경제적이다. 또한, ACS 연산 블록의 경우 같은 구조를 단순히 반복하여 쉽게 병렬로 확장이 가능하며 생존자 경로 계산의 경우 K 값을 넉넉히 고려할 경우 단순히 2^(K-1)×1 멀티플렉서 비트 수만 좀더 커지고, 한 상태의 PM값을 저장하기 위해 서로 결합되는 레지스터들의 수(32비트 칩에서 K=7인 경우 2개, K=8인 경우 4개, K=9인 경우 8개)만 늘어나면 되는 매우 유연한 구조가 된다.As described above, the present invention is economical in terms of design cost because only the pipelining data path circuits that allow most of the existing calculation modules to be reusable and continuous operation are added to the chip fabrication. In addition, in the case of ACS operation blocks, the same structure can be easily extended by simply repeating the same structure.In case of survivor path calculation, considering enough K values, only 2 ^(K-1) × 1 multiplexer bits are larger and PM in a state It is a very flexible structure that only needs to increase the number of registers (2 for K = 7, 4 for K = 8 and 8 for K = 9) that are combined with each other to store a value.

또한, 기존 DSP 기술에서 15Kbps 대의 전송률을 보이던 것을 본 발명이 제안하는 구조를 가지고 100MHz 동작 주파수를 갖는 DSP 칩에서 동작한다면 6.25 Mbps의 전송률을 가지고 있어 IMT-2000과 같은 무선 멀티미디어 서비스의 요구전송률인2Mbps 환경에서도 사용 가능하다. 실제로 최근 DSP 공정 기술은 수백 MHz 이상의 동작을 가능하게 하고 있으며, 레지스터 파일의 집적도도 매우 커지고 있어, 앞으로 반도체 공정 기술이 더욱 발달하면서 처리 가능 전송률은 더 높아질 수 있고, 본 발명에서 제안한 구조들은 병렬적으로 확장가능하므로 수~수십 Mbps까지 가능할 것이다. 따라서, 현재 음성 데이터 전송에 국한된 DSP 기술을 LMDS(Local Multipoint Distribution Systems) 서비스나 MCNS(Multimedia Cable Network System)등의 표준안을 만족하는 CATV나 VOD 등의 멀티미디어 통신 서비스까지도 구현할 수 있는 영역으로 끌어올릴 수 있을 것이다.In addition, the present invention shows that the transmission rate of 15Kbps in the DSP technology has a structure proposed by the present invention when operating on a DSP chip having a 100MHz operating frequency has a transmission rate of 6.25 Mbps, which is the required transmission rate of wireless multimedia services such as IMT-2000, 2Mbps Can also be used in an environment. In fact, the recent DSP process technology enables operation of hundreds of MHz or more, and the integration of register files is also very large. As the semiconductor process technology is further developed, the throughput can be higher, and the structures proposed in the present invention are parallel. It can be scaled up to several tens of Mbps. Therefore, the DSP technology, which is limited to voice data transmission, can be pushed to the area that can implement multimedia communication services such as CATV or VOD that meet the standards such as LMDS (Local Multipoint Distribution Systems) service or Multimedia Cable Network System (MCNS). There will be.

또한, 종래의 통신 전용의 ASIC들이 주류를 이룬 통신용 FEC(Forward Error Correction)에 있어서의 이를 통신 변복조와 함께 수행할 수 있는 DSP를 제조하여 해결할 수 있게 함으로써 원가 절감과 휴대용 단말기의 저전력화 및 소형화를 이룰 수 있다.In addition, it is possible to reduce costs and reduce power consumption and downsizing of portable terminals by manufacturing and solving DSPs capable of performing this with communication modulation and demodulation in the FEC (Forward Error Correction). Can be achieved.

Claims

In order to execute the Viterbi decoding algorithm,

Four adders for adding four pairs of input 8-bit input data;

Four 9-bit registers for storing the result values of the operation performed in each adder;

Two comparators for comparing arithmetic values output from two registers among the arithmetic values stored in the register and outputting a selection signal value for selecting a smaller value;

Two multiplexers which receive two 9-bit data identical to the operation values output from the two registers input to the comparator and receive a bit of a comparison result of the comparator as a selection bit to select a small value;

Two shifters for shifting the operation result of the multiplexer;

A 64-bit shift register for storing the selection signal value output from the comparator and outputting the full signal when the comparator is full;

A bus for passing the result value output from the shifter to the output value of the 64-bit shift register when the constraint length is 7;

A dual port memory for storing shifted result values passed through the bus; A 32-bit register file that stores the value output from the 64-bit shift register passing through the bus and outputs the best value when the value is full;

A 64x1 multiplexer for multiplying 64-bit data output from the 32-bit register file;

The output of the 64x1 multiplexer selects and inserts 1 bit of the 64 bits of the first address of the register file into 6 bit data using the output, and shifts the existing 6 bits left by 1 bit to output MSB 1 bit out. A Viterbi decoding arithmetic circuit in a programmable processor comprising a destination register.

The method of claim 1,

A Viterbi decoding arithmetic circuit in a programmable processor, characterized in that when the number of bits of input data increases, the circuits can be used in parallel.

The method of claim 1,

When the storage space of the 32-bit register file is insufficient, Viterbi decoding arithmetic circuit in the programmable processor, characterized in that it uses an internal RAM or cache memory, and has a structure that can move several at the same time.

The method of claim 1,

A Viterbi decoding operation circuit in a programmable processor having the same configuration as the circuit of claim 1 even if the number of bits of the first input / output data changes.

The method of claim 1,

Viterbi decoding arithmetic circuit in the programmable processor, characterized in that the same configuration as the circuit even if the number of the constraint field is changed

delete

A Viterbi decoding operation method of processing four pairs of input data by two ACS operations using the Viterbi decoding operation circuit of claim 1,

A first step of performing addition of each pair of input data and storing the result in four 9-bit registers;

Compares the result values stored in the four 9-bit registers with the result values output from the two 9-bit registers, and stores a small value in the shift register, and stores two bits of the comparison result selection bits in two bits in a 64-bit shift register. A second step;

At the same time, a third step of repeating the first step, inputting new four pairs of data, performing an addition, and storing the data in the 9-bit register;

A fourth step of outputting two minimum values generated as a result of the ACS operation of the first four pairs of data from the shifter and storing them in a dual port memory through a bus;

A fifth step of repeating the first to fourth steps to completely fill the 64-bit shift register and then transferring the output 64-bit value to the register file via the bus;

If the register file has space to store 32 64-bit values, use a 64x1 multiplexer to select 1 bit of the 64-bits of the first address of the register file as the 6-bit data of the destination register. And inserting into the LSB and shifting the existing six bits by one bit to the left, and outputting one bit of the Viterbi-decoded MSB.