KR20220146937A

KR20220146937A - Parallel processing apparatus for supporting variable bit number

Info

Publication number: KR20220146937A
Application number: KR1020210053858A
Authority: KR
Inventors: 김태형
Original assignee: 주식회사 모르미
Priority date: 2021-04-26
Filing date: 2021-04-26
Publication date: 2022-11-02
Also published as: KR102574824B1

Abstract

According to an embodiment of the present invention, a parallel processing device includes: a pre-processing part including pre-processing units and receiving first signals and second signals; and a main processing part with adders. Each of the pre-processing units includes a shift operation part. The shift operation part transmits signals which are obtained by shifting a corresponding first signal among the first signals to a corresponding adder among the adders according to the bits of a corresponding second signal among the second signals, wherein according to a division selection signal, some bits of the shifted signals are controlled to become 0.

Description

Parallel processing unit supporting variable number of bits

이하 설명하는 기술은 병렬 처리 장치에 관한 것이다.The techniques described below relate to parallel processing units.

높은 데이터 처리 성능을 위하여 병렬 처리 장치에 관한 연구가 많이 수행되고 있다. 병렬 처리 장치의 예로서 멀티코아 프로세서(multi-core processor)가 있다. 멀티코아 프로세서는 복수의 코아(프로세싱 유닛)을 구비하는 프로세서로서, 멀티코아 프로세서가 사용되는 이유는 코아의 개수를 늘림으로써, 전체 프로세서의 성능을 개선하기 위함이다. 그러나, 다양한 이유로 인하여, 코아의 개수를 늘이더라도 전체 프로세서의 성능이 이에 비례하여 증가하지 아니하고 있다. In order to achieve high data processing performance, many studies on parallel processing devices have been conducted. An example of a parallel processing device is a multi-core processor. A multi-core processor is a processor having a plurality of cores (processing units), and the reason why the multi-core processor is used is to increase the number of cores, thereby improving the performance of the entire processor. However, for various reasons, even if the number of cores is increased, the performance of the entire processor does not increase proportionally.

본 발명의 발명자는 이러한 문제점을 개선하기 위하여 지속적인 개발을 수행하고 있으며, 이에 기반하여 한국특허공개번호 제10-2019-0132295호, 제10-2018-0057950호, 제10-2018-0058166호, 제10-2018-0058167호, 제10-2018-0007523호, 제10-2018-0007652호 및 한국특허등록번호 제 10-1859294호의 발명을 수행한 바 있다. The inventor of the present invention is performing continuous development to improve these problems, and based on this, Korean Patent Publication Nos. 10-2019-0132295, 10-2018-0057950, 10-2018-0058166, No. 10-2018-0058167, 10-2018-0007523, 10-2018-0007652, and Korean Patent Registration No. 10-1859294 have performed the invention.

한국특허공개번호: 10-2019-0132295, 10-2018-0057950, 10-2018-0058166, 10-2018-0058167, 10-2018-0007523, 10-2018-0007652Korean Patent Publication Nos.: 10-2019-0132295, 10-2018-0057950, 10-2018-0058166, 10-2018-0058167, 10-2018-0007523, 10-2018-0007652 한국특허등록번호: 10-1859294Korean Patent Registration No.: 10-1859294

종래기술에 의한 병렬 처리 장치는 합산기와 곱셈기를 별도로 구비하고 있다. 이는 병렬 처리 장치의 효율을 저하시킨다. 보다 구체적으로, 곱셈 연산이 많이 요구되는 때에는 병렬 처리 장치의 모든 곱셈기가 활용되나 일부 덧셈기는 활용되지 않는다. 또한 덧셈 연산이 많이 요구되는 때에는 병렬 처리 장치의 모든 덧셈기가 활용되나, 일부 곱셈기는 활용되지 않는다. 또한 종래기술에 의한 병렬 처리 장치는 처리 유닛들 간의 데이터 교환이 용이하지 않는 측면이 있다. 이는 병렬 처리 장치 전체의 성능을 저하시킨다. The parallel processing apparatus according to the prior art has an adder and a multiplier separately. This lowers the efficiency of the parallel processing unit. More specifically, when a lot of multiplication operations are required, all multipliers of the parallel processing unit are utilized, but some adders are not utilized. Also, when many addition operations are required, all adders of the parallel processing unit are utilized, but some multipliers are not utilized. In addition, the parallel processing apparatus according to the prior art has an aspect in which data exchange between processing units is not easy. This degrades the overall performance of the parallel processing unit.

본 개시는 종래기술의 문제점을 해결하기 위한 것으로서, 병렬 처리 장치의 처리 유닛이 곱셈 연산과 덧셈 연산을 모두 수행 가능하게 설계함으로써, 병렬 처리 장치의 효율을 증가시키는 것을 목표로 한다. 또한 본 개시는 처리 유닛들 간의 데이터 교환을 용이하게 함으로써 전체 병렬 처리 장치의 효율을 증가시키는 것을 목표로 한다. 또한 본 개시는 각 처리 유닛이 시간에 따라 다양한 연산들을 수행할 수 있도록 함으로써 전체 병렬 처리 장치의 효율을 증가시키는 것을 목표로 한다. 또한 본 개시는 상술한 개선을 가짐에도 불구하고 전체적인 하드웨어의 복잡도를 크게 증가시키지 않는 것을 목표로 한다.The present disclosure is to solve the problems of the prior art, and aims to increase the efficiency of the parallel processing apparatus by designing a processing unit of the parallel processing apparatus to perform both a multiplication operation and an addition operation. The present disclosure also aims to increase the efficiency of the overall parallel processing unit by facilitating data exchange between processing units. In addition, the present disclosure aims to increase the efficiency of the overall parallel processing unit by allowing each processing unit to perform various operations according to time. In addition, the present disclosure aims to not significantly increase the complexity of the overall hardware despite having the above-described improvements.

또한 본 개시는 가변 비트 수를 가지는 입력을 지원하는 것을 목표로 한다. 가변 비트 수를 가지는 입력은 다양한 이유에서 요구된다. 예로서 ANN(Artificial Neural Network), DNN(Deep Neural Network), CNN(합성곱신경망 : Convolution Neural Network), RNN(순환신경망 : Recurrent Neural Network)과 같은 딥러닝 알고리즘에 있어서, 입력 데이터 및 가중치의 비트 수는 성능, 처리 속도, 요구 메모리 용량 등에 영향을 준다. 따라서 요구 사항에 따라 입력 데이터 및 가중치의 비트 수가 조정될 필요가 있다. 또한 수학적 연산을 많이 수행하는 병렬 처리 장치도, 어떤 경우에는 16비트 연산이 수행되는 것이 적절하고, 다른 경우에는 32비트 연산이 수행되는 것이 적절할 수 있다. 이러한 경우에 있어서, 최대 비트 수에 맞추어 프로세서를 설계할 경우, 그보다 적은 비트 수의 연산을 수행하면 프로세서의 상당 부분이 동작하지 않게 되므로 프로세서의 효율이 저하된다. 예로서 32비트 곱셈기를 사용하여 16비트 곱셈을 수행하면 전체 프로세서의 대략 25%만 사용된다. The present disclosure also aims to support an input having a variable number of bits. An input with a variable number of bits is required for various reasons. For example, in a deep learning algorithm such as ANN (Artificial Neural Network), DNN (Deep Neural Network), CNN (Convolution Neural Network), RNN (Recurrent Neural Network), bit of input data and weight The number affects performance, processing speed, required memory capacity, etc. Therefore, the number of bits in the input data and weights needs to be adjusted according to the requirements. Also, even in a parallel processing unit that performs a lot of mathematical operations, in some cases it may be appropriate to perform a 16-bit operation, and in other cases, it may be appropriate to perform a 32-bit operation. In this case, when a processor is designed according to the maximum number of bits, if a smaller number of bits is performed, a significant part of the processor does not operate, and thus the efficiency of the processor is reduced. For example, performing a 16-bit multiplication using a 32-bit multiplier only uses approximately 25% of the total processor.

일실시예에 의한 병렬 처리 장치는 전처리 유닛들을 포함하며, 제1 신호들 및 제2 신호들을 입력받는 전처리부; 및 합산기들을 포함하는 주처리부를 포함한다. 상기 전처리 유닛들 중 각 전처리 유닛은 쉬프트 연산부를 포함한다. 상기 쉬프트 연산부는 상기 제1 신호들 중 대응하는 제1 신호가 쉬프트된 신호들을 상기 제2 신호들 중 대응하는 제2 신호의 비트들에 따라 상기 합산기들 중 대응하는 합산기로 전달하되, 분할 선택 신호에 따라 상기 쉬프트된 신호들의 일부 비트들이 0이 되도록 제어한다.A parallel processing apparatus according to an embodiment includes a preprocessing unit including preprocessing units, and a preprocessing unit receiving first signals and second signals; and a main processing unit including summers. Each of the pre-processing units includes a shift operation unit. The shift operation unit transfers the shifted signals of a corresponding first signal from among the first signals to a corresponding summer among the summers according to bits of a corresponding second signal among the second signals, Controlled so that some bits of the shifted signals become 0 according to the signal.

본 개시에 의한 병렬 처리 장치는 단위 유닛이 곱셈 연산과 덧셈 연산을 모두 수행 가능하므로, 높은 병렬 처리 효율을 가진다. 또한, 병렬 처리 장치는 부가적으로 변위 연산 및 쉬프트 연산도 수행할 수 있다는 장점이 있다. 또한 병렬 처리 장치는 처리 유닛들 간의 용이한 데이터 교환을 가능케 한다는 장점을 가진다. 또한 병렬 처리 장치는 각 처리 유닛이 시간에 따라 다양한 연산을 수행할 수 있다는 장점을 가진다. 또한 병렬 처리 장치는 상술한 개선에도 불구하고 하드웨어의 복잡도가 크게 증가하지 아니한다는 장점을 가진다. The parallel processing apparatus according to the present disclosure has high parallel processing efficiency because a unit unit can perform both a multiplication operation and an addition operation. In addition, the parallel processing device has an advantage that it can additionally perform a displacement operation and a shift operation. In addition, the parallel processing apparatus has the advantage of enabling easy data exchange between processing units. In addition, the parallel processing apparatus has an advantage that each processing unit can perform various operations according to time. In addition, the parallel processing apparatus has an advantage that the complexity of hardware does not significantly increase in spite of the above-described improvement.

또한 본 개시에 의한 병렬 처리 장치는 가변 비트 수를 가지는 입력을 지원할 수 있다는 장점을 가진다. In addition, the parallel processing apparatus according to the present disclosure has an advantage in that it can support an input having a variable number of bits.

도 1은 제1 실시예에 의한 병렬 처리 장치를 나타내는 도면이다.
도 2는 제1 실시예의 i번째 전처리 유닛의 일례를 설명하기 위한 도면이다.
도 3은 제2 실시예에 의한 병렬 처리 장치를 나타내는 도면이다.
도 4은 제3 실시예에 의한 병렬 처리 장치를 나타내는 도면이다.
도 5는 제3 실시예의 i번째 전처리 유닛의 일례를 설명하기 위한 도면이다.
도 6은 제4 실시예에 의한 병렬 처리 장치를 나타내는 도면이다.1 is a diagram showing a parallel processing apparatus according to a first embodiment.
2 is a view for explaining an example of the i-th pre-processing unit of the first embodiment.
3 is a diagram showing a parallel processing apparatus according to the second embodiment.
Fig. 4 is a diagram showing a parallel processing apparatus according to the third embodiment.
5 is a diagram for explaining an example of the i-th pre-processing unit of the third embodiment.
Fig. 6 is a diagram showing a parallel processing apparatus according to the fourth embodiment.

이하 설명하는 기술은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세하게 설명하고자 한다. 그러나, 이는 이하 설명하는 기술을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 이하 설명하는 기술의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.Since the technology to be described below can have various changes and can have various embodiments, specific embodiments are illustrated in the drawings and described in detail. However, this is not intended to limit the technology described below to specific embodiments, and it should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the technology described below.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 해당 구성요소들은 상기 용어들에 의해 한정되지는 않으며, 단지 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 이하 설명하는 기술의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as first, second, A, and B may be used to describe various components, but the components are not limited by the above terms, and only for the purpose of distinguishing one component from other components. used only as For example, a first component may be named as a second component, and similarly, the second component may also be referred to as a first component without departing from the scope of the technology to be described below. and/or includes a combination of a plurality of related listed items or any of a plurality of related listed items.

본 명세서에서 사용되는 용어에서 단수의 표현은 문맥상 명백하게 다르게 해석되지 않는 한 복수의 표현을 포함하는 것으로 이해되어야 하고, "포함한다" 등의 용어는 설시된 특징, 개수, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 의미하는 것이지, 하나 또는 그 이상의 다른 특징들이나 개수, 단계 동작 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 배제하지 않는 것으로 이해되어야 한다.In terms of terms used herein, the singular expression should be understood to include a plural expression unless the context clearly dictates otherwise, and terms such as "comprises" include the described feature, number, step, operation, and element. , parts or combinations thereof are to be understood, but not to exclude the possibility of the presence or addition of one or more other features or numbers, step operation components, parts or combinations thereof.

도면에 대한 상세한 설명을 하기에 앞서, 본 명세서에서의 구성부들에 대한 구분은 각 구성부가 담당하는 주기능 별로 구분한 것에 불과함을 명확히 하고자 한다. 즉, 이하에서 설명할 2개 이상의 구성부가 하나의 구성부로 합쳐지거나 또는 하나의 구성부가 보다 세분화된 기능별로 2개 이상으로 분화되어 구비될 수도 있다. 그리고 이하에서 설명할 구성부 각각은 자신이 담당하는 주기능 이외에도 다른 구성부가 담당하는 기능 중 일부 또는 전부의 기능을 추가적으로 수행할 수도 있으며, 구성부 각각이 담당하는 주기능 중 일부 기능이 다른 구성부에 의해 전담되어 수행될 수도 있음은 물론이다.Prior to a detailed description of the drawings, it is intended to clarify that the classification of the constituent parts in the present specification is merely a division according to the main function that each constituent unit is responsible for. That is, two or more components to be described below may be combined into one component, or one component may be divided into two or more for each more subdivided function. In addition, each of the constituent units to be described below may additionally perform some or all of the functions of other constituent units in addition to the main function it is responsible for. Of course, it can also be performed by being dedicated to it.

또, 방법 또는 동작 방법을 수행함에 있어서, 상기 방법을 이루는 각 과정들은 문맥상 명백하게 특정 순서를 기재하지 않은 이상 명기된 순서와 다르게 일어날 수 있다. 즉, 각 과정들은 명기된 순서와 동일하게 일어날 수도 있고 실질적으로 동시에 수행될 수도 있으며 반대의 순서대로 수행될 수도 있다.In addition, in performing the method or operation method, each process constituting the method may occur differently from the specified order unless a specific order is clearly described in context. That is, each process may occur in the same order as specified, may be performed substantially simultaneously, or may be performed in the reverse order.

도 1은 제1 실시예에 의한 병렬 처리 장치를 나타내는 도면이다. 도 1을 참조하면, 병렬 처리 장치는 제1 내지 제N 입력들({X1, Y1}, {X2, Y2}, ... {XN, YN})을 입력받고, 제1 내지 제N 출력들(M1, M2, ... MN)을 출력한다. 여기에서 N은 4이상의 자연수를 의미하며, 일례로 N은 32일 수 있다. 제1 내지 제N 입력들({X1, Y1}, {X2, Y2}, ... {XN, YN})은 제1 신호들(X1, X2, ... XN)과 제2 신호들(Y1, Y2, ... YN)을 구비한다. 병렬 처리 장치는 전처리부(100)와 주처리부(200)를 구비한다. 병렬 처리장치는 지연부(300)와 선택부(400)를 더 구비할 수 있다. 1 is a diagram showing a parallel processing apparatus according to a first embodiment. Referring to FIG. 1 , the parallel processing device receives first to N-th inputs {X1, Y1}, {X2, Y2}, ... {XN, YN}), and outputs first to N-th outputs It outputs (M1, M2, ... MN). Here, N means a natural number equal to or greater than 4, for example, N may be 32. The first to Nth inputs {X1, Y1}, {X2, Y2}, ... {XN, YN}) are the first signals (X1, X2, ... XN) and the second signals ( Y1, Y2, ... YN). The parallel processing apparatus includes a pre-processing unit 100 and a main processing unit 200 . The parallel processing apparatus may further include a delay unit 300 and a selection unit 400 .

전처리부(100)가 합산 모드로 동작하는 경우에는 i번째 입력({Xi, Yi})의 제1 신호(Xi)를 제2 신호들(Y1, Y2, ... YN)의 i번째 비트들(Y1[i], Y2[i], ... YN[i])에 따라 주처리부(200)의 합산기들(SUM1, SUM2, ... SUMN)에 각각 전달한다. 여기에서 i는 1 이상이고 N 이하인 자연수이다. 제1 합산기(SUM1)의 입력들이 S1_1, S1_2, ... S1_N이고, 제2 합산기(SUM2)의 입력들이 S2_1, S2_2, ... S2_N이고, 제3 합산기(SUM3)의 입력들이 S3_1, S3_2, ... S3_N이라고 하자. 이때 전처리부(100)의 합산 모드 동작은 일례로 아래와 같이 의사 코드(Pseudo Code)로 표현될 수 있다. When the preprocessor 100 operates in the summing mode, the first signal Xi of the i-th input {Xi, Yi} is converted to the i-th bits of the second signals Y1, Y2, ... YN. (Y1[i], Y2[i], ... YN[i]) are transmitted to the summers SUM1, SUM2, ... SUMN of the main processing unit 200, respectively. Here, i is a natural number greater than or equal to 1 and less than or equal to N. The inputs of the first summer SUM1 are S1_1, S1_2, ... S1_N, the inputs of the second summer SUM2 are S2_1, S2_2, ... S2_N, the inputs of the third summer SUM3 are Let S3_1, S3_2, ... S3_N. In this case, the operation in the summing mode of the preprocessor 100 may be expressed as, for example, a pseudo code as follows.

[수학식 1][Equation 1]

(Y1[1] ? X1 : 0) => S1_1, (Y1[1] ? X1: 0) => S1_1,

(Y1[2] ? X2 : 0) => S1_2, (Y1[2] ? X2: 0) => S1_2,

......

(Y1[N] ? XN : 0) => S1_N, (Y1[N] ? XN : 0) => S1_N,

(Y2[1] ? X1 : 0) => S2_1, (Y2[1] ? X1: 0) => S2_1,

(Y2[2] ? X2 : 0) => S2_2, (Y2[2] ? X2: 0) => S2_2,

......

(Y2[N] ? XN : 0) => S2_N, (Y2[N] ? XN : 0) => S2_N,

......

(YN[1] ? X1 : 0) => SN_1, (YN[1] ? X1 : 0) => SN_1,

(YN[2] ? X2 : 0) => SN_2, (YN[2] ? X2: 0) => SN_2,

......

(YN[N] ? XN : 0) => SN_N (YN[N] ? XN : 0) => SN_N

상기 수학식에서 [(YN[N] ? XN : 0) => SN_N]은 YN[N]이 1인 경우에 XN이 SN_N으로 전달되고, YN[N]이 0인 경우에 0이 SN_N으로 전달됨을 의미한다. 또한 YN[1]은 1번째 비트(최하위 비트)를 의미하고, YN[N]은 N번째 비트(최상위 비트)를 의미한다.In the above equation, [(YN[N] ? XN: 0) => SN_N] indicates that when YN[N] is 1, XN is transferred to SN_N, and when YN[N] is 0, 0 is transferred to SN_N. it means. In addition, YN[1] means the 1st bit (least significant bit), and YN[N] means the Nth bit (most significant bit).

전처리부(100)가 곱셈 모드로 동작하는 경우에는 제1 신호들(X1, X2, ... XN)이 (i-1)비트만큼 쉬프트된 신호들((X1<<(i-1)), (X2<<(i-1)), ... (XN<<(i-1)))을 제2 신호들(Y1, Y2, ... YN)의 i번째 비트들(Y1[i], Y2[i], ... YN[i])에 따라 합산기들(SUM1, SUM2, ... SUMN)에 각각 전달한다. 전처리부(100)의 곱셈 모드 동작은 일례로 아래와 같이 의사 코드로 표현될 수 있다. When the preprocessor 100 operates in the multiplication mode, the first signals X1, X2, ... XN are shifted signals by (i-1) bits ((X1<<(i-1)) , (X2<<(i-1)), ... (XN<<(i-1))) of the i-th bits Y1[i] of the second signals Y1, Y2, ... YN ], Y2[i], ... YN[i]) to the summers SUM1, SUM2, ... SUMN, respectively. The multiplication mode operation of the preprocessor 100 may be expressed in pseudo code as follows, for example.

[수학식 2][Equation 2]

(Y1[1] ? (X1 << 0) : 0)=> S1_1,(Y1[1] ? (X1 << 0) : 0)=> S1_1,

(Y1[2] ? (X1 << 1) : 0)=> S1_2, (Y1[2] ? (X1 << 1) : 0)=> S1_2,

......

(Y1[N] ? (X1 << (N-1)) : 0)=> S1_N,(Y1[N] ? (X1 << (N-1)) : 0)=> S1_N,

(Y2[1] ? (X2 << 0) : 0)=> S2_1,(Y2[1] ? (X2 << 0) : 0)=> S2_1,

(Y2[2] ? (X2 << 1) : 0)=> S2_2, (Y2[2] ? (X2 << 1) : 0)=> S2_2,

......

(Y2[N] ? (X2 << (N-1)) : 0)=> S2_N, (Y2[N] ? (X2 << (N-1)) : 0)=> S2_N,

......

(YN[1] ? (XN << 0) : 0)=> SN_1,(YN[1] ? (XN << 0) : 0)=> SN_1,

(YN[2] ? (XN << 1) : 0)=> SN_2, (YN[2] ? (XN << 1) : 0)=> SN_2,

......

(YN[N] ? (XN << (N-1)) : 0)=> SN_N(YN[N] ? (XN << (N-1)) : 0)=> SN_N

상기 수학식에서 [(XN << (N-1)]은 XN을 좌측(최상위 비트 방향으로)으로 (N-1)비트 쉬프트함을 의미한다. In the above equation, [(XN << (N-1)] means shifting XN to the left (in the most significant bit direction) by (N-1) bits.

일례로, 동작 모드 선택 신호들(SF1, SF2, ... SFN)에 따라 전처리부(100)가 합산 모드 또는 곱셈 모드로 동작한다. 일례로 병렬 처리 장치 전체에 대하여 1개의 동작 모드 선택 신호가 할당될 수 있다. 이 경우에는 병렬 처리 장치 전체가 합산 모드로 동작하거나 곱셈 모드로 동작하여야 한다. 다른 예로 N개의 동작 모드 선택 신호들(SF1, SF2, ... SFN)이 할당 될 수 있다. 이 경우, N개의 출력들(M1, M2, ... MN) 중 일부는 합산 모드에 따라 얻어진 결과이고, 나머지는 곱셈 모드에 따라 얻어진 결과가 되도록 설정될 수 있다. 가령 N이 4인 경우, M1, M2, M3는 곱셈 모드로 동작하고 M4는 합산 모드로 동작하도록 동작 모드 선택 신호들(SF1, SF2, SF3, SF4)이 설정될 수 있다.For example, the preprocessor 100 operates in the summing mode or the multiplication mode according to the operation mode selection signals SF1, SF2, ... SFN. For example, one operation mode selection signal may be allocated to the entire parallel processing unit. In this case, the entire parallel processing unit must operate in the summing mode or the multiplication mode. As another example, N operation mode selection signals SF1, SF2, ... SFN may be allocated. In this case, it may be set such that some of the N outputs M1, M2, ... MN are results obtained according to the summing mode, and the rest are results obtained according to the multiplication mode. For example, when N is 4, the operation mode selection signals SF1 , SF2 , SF3 , and SF4 may be set so that M1 , M2 , and M3 operate in a multiplication mode and M4 operates in a summation mode.

일례로, 전처리부(100)는 복수의 전처리 유닛들(150_1, 150_2, ... 150_N)을 포함한다. 복수의 전처리 유닛들(150_1, 150_2, ... 150_N)은 선택 연산부들(110_1, 110_2, ... 110_N) 및 쉬프트 연산부들(120_1, 120_2, ... 120_N)을 포함한다. 전처리 유닛(150_i)은 선택 연산부(110_i) 및 쉬프트 연산부(120_i)를 포함한다. For example, the preprocessor 100 includes a plurality of preprocessing units 150_1 , 150_2 , ... 150_N. The plurality of preprocessing units 150_1 , 150_2 , ... 150_N include selection operation units 110_1 , 110_2 , ... 110_N and shift operation units 120_1 , 120_2 , ... 120_N. The preprocessing unit 150_i includes a selection operation unit 110_i and a shift operation unit 120_i.

선택 연산부(110_i)는 전처리 유닛(150_i)이 합산 모드로 동작하는 경우에 동작한다. 선택 연산부(110_i)는 제1 신호들(X1, X2, ... XN)을 제2 신호(Yi)의 비트들(Yi[1], Yi[2], ... Yi[N])에 따라 합산기(SUMi)에 전달하는 기능을 수행한다. 이때 선택 연산부(110_i)의 동작은 일례로 아래와 같이 의사 코드로 표현될 수 있다. The selection operation unit 110_i operates when the pre-processing unit 150_i operates in the summing mode. The selection operation unit 110_i applies the first signals X1, X2, ... XN to the bits Yi[1], Yi[2], ... Yi[N] of the second signal Yi. It performs the function of transmitting to the summer (SUMi). In this case, the operation of the selection operation unit 110_i may be expressed in pseudo code as follows, for example.

[수학식 3][Equation 3]

(Yi[1] ? X1 : 0) => Si_1, (Yi[1] ? X1: 0) => Si_1,

(Yi[2] ? X2 : 0) => Si_2, (Yi[2] ? X2: 0) => Si_2,

......

(Yi[N] ? XN : 0) => Si_N, (Yi[N] ? XN : 0) => Si_N,

쉬프트 연산부(120_i)는 전처리 유닛(150_i)이 곱셈 모드로 동작하는 경우에 동작한다. 쉬프트 연산부(120_i)는 제1 신호(Xi)가 0, 1, ... (N-1) 비트만큼 쉬프트된 신호들((Xi<<0), (Xi<<1), ... (Xi<<(N-1)))을 제2 신호(Yi)의 비트들(Yi[1], Yi[2], ... Yi[N])에 따라 합산기(SUMi)에 전달하는 기능을 수행한다. 이때 쉬프트 연산부(120_i)의 동작은 일례로 아래와 같이 의사 코드로 표현될 수 있다.The shift operation unit 120_i operates when the pre-processing unit 150_i operates in the multiplication mode. The shift operation unit 120_i calculates the signals ((Xi<<0), (Xi<<1), ... ( Xi<<(N-1))) to the summer SUMi according to the bits (Yi[1], Yi[2], ... Yi[N]) of the second signal Yi. carry out In this case, the operation of the shift operation unit 120_i may be expressed in pseudo code as follows, for example.

[수학식 4][Equation 4]

(Yi[1] ? (Xi << 0) : 0)=> Si_1,(Yi[1] ? (Xi << 0) : 0)=> Si_1,

(Yi[2] ? (Xi << 1) : 0)=> Si_2, (Yi[2] ? (Xi << 1) : 0)=> Si_2,

......

(Yi[N] ? (Xi << (N-1)) : 0)=> Si_N,(Yi[N] ? (Xi << (N-1)) : 0)=> Si_N,

전처리 유닛(150_i)은 동작 모드 선택 신호(SFi)에 따라 선택 연산부(110_i)를 동작시키거나 쉬프트 연산부(120_i)를 동작시킨다. 일례로 SFi가 0인 경우가 선택 연산부(110_i)의 동작을 의미하고, 1인 경우가 쉬프트 연산부(120_i)의 동작을 의미하는 경우에, SF1=0, SF2=0 및 SFN=1은 제1 전처리 유닛(150_1), 제2 전처리 유닛(150_2) 및 제N 전처리 유닛(150_N)이 각각 선택 연산부(120_1), 선택 연산부(110_2) 및 쉬프트 연산부(110_N)를 동작시킴을 의미한다. The preprocessing unit 150_i operates the selection operation unit 110_i or the shift operation unit 120_i according to the operation mode selection signal SFi. For example, when SFi is 0 means the operation of the selection operation unit 110_i, and when 1 indicates the operation of the shift operation unit 120_i, SF1 = 0, SF2 = 0, and SFN = 1 are the first This means that the pre-processing unit 150_1 , the second pre-processing unit 150_2 , and the N-th pre-processing unit 150_N operate the selection operation unit 120_1 , the selection operation unit 110_2 , and the shift operation unit 110_N, respectively.

주처리부(200)는 합산기들(SUM1, SUM2, ... SUMN)을 포함한다. i번째 합산기(Mi)는 전달된 신호들(Si_1, Si_2, ... Si_N)을 합산하며, 합산된 결과를 i번째 출력(Mi)으로서 출력한다. 주처리부(200)의 동작은 일례로 아래와 같은 의사 코드로 표현될 수 있다. The main processing unit 200 includes summers SUM1, SUM2, ... SUMN. The i-th summer Mi sums the transmitted signals Si_1, Si_2, ... Si_N, and outputs the summed result as the i-th output Mi. The operation of the main processing unit 200 may be expressed by, for example, the following pseudo code.

[수학식 5][Equation 5]

S1_1 + S1_2 + ... S1_N => M1, S1_1 + S1_2 + ... S1_N => M1,

S2_1 + S2_2 + ... S2_N => M2, S2_1 + S2_2 + ... S2_N => M2,

......

SN_1 + SN_2 + ... SN_N => MN, SN_1 + SN_2 + ... SN_N => MN,

지연부(300)는 클록 신호(CLK)에 따라 제1 내지 제N 출력들(M1, M2, ... MN)을 지연하여 출력한다. 이를 위하여 지연부(300)는 복수의 지연 유닛들(DU1, DU2, ... DUN)을 포함한다. 지연부(300)에서 출력되는 신호들(D1, D2, ... DN)은 제1 내지 제N 출력들(M1, M2, ... MN)에 각각 대응한다. The delay unit 300 delays and outputs the first to Nth outputs M1 , M2 , ... MN according to the clock signal CLK. To this end, the delay unit 300 includes a plurality of delay units DU1, DU2, ... DUN. The signals D1, D2, ... DN output from the delay unit 300 correspond to the first to N-th outputs M1, M2, ... MN, respectively.

선택부(400)는 메모리(미도시)로부터 전달된 신호들(R1, R2, ... RN) 및 지연부(300)에서 출력되는 신호들(D1, D2, ... DN) 중에서 입력 제어 신호들(SI1, SI2, ... SIN)에 따라 선택된 신호들을 제1 신호들(X1, X2, ... XN)로서 출력한다. 예로서, 도면에 표현된 바와 같이, 제1 신호(Xi)는 메모리로부터 전달된 신호(Ri)와 지연부(300)에서 출력된 신호(Di) 중에서 입력 제어 신호(SIi)에 따라 선택된 신호일 수 있다. 다른 예로 제1 신호(Xi)는 메모리로부터 전달된 신호(Ri)와 지연부(300)에서 출력된 2개의 신호들(D(i-1), Di) 중에서 입력 제어 신호(SIi)에 따라 선택된 신호일 수 있다. 즉, 제1 신호(Xi)는 메모리로부터 전달된 신호(Ri), i번째 출력(Mi)에 대응하는 지연부 출력 신호(Di) 및 (i-1)번째 출력(M(i-1))에 대응하는 지연부 출력 신호(D(i-1)) 중에서 입력 제어 신호(SIi)에 따라 선택된 신호일 수 있다. 메모리(미도시)는 일례로 복수의 뱅크를 구비할 수 있다. 일례로 메모리는 N개의 뱅크를 구비하고, N개의 뱅크는 N개의 입력들({X1, Y1}, {X2, Y2}, ... {XN, YN})에 각각 연결될 수 있다. 또한 메모리는 2*N개의 뱅크들을 구비하고, 이들 중 N개의 뱅크들은 N개의 제1 신호들(X1, X2, ... XN)에 각각 연결되고, 나머지 N개의 뱅크들을 N개의 제2 신호들(Y1, Y2, ... YN)에 각각 연결될 수 있다. The selector 400 controls input from among the signals R1, R2, ... RN transmitted from the memory (not shown) and the signals D1, D2, ... DN output from the delay unit 300 . Signals selected according to the signals SI1, SI2, ... SIN are output as first signals X1, X2, ... XN. For example, as shown in the figure, the first signal Xi may be a signal selected according to the input control signal SIi from among the signal Ri transmitted from the memory and the signal Di output from the delay unit 300 . have. As another example, the first signal Xi is selected according to the input control signal SIi from among the signal Ri transmitted from the memory and the two signals D(i-1) and Di output from the delay unit 300 . It could be a signal. That is, the first signal Xi includes a signal Ri transmitted from the memory, a delay unit output signal Di corresponding to the i-th output Mi, and an (i-1)-th output M(i-1). It may be a signal selected according to the input control signal SIi from among the delay unit output signals D(i-1) corresponding to . A memory (not shown) may include, for example, a plurality of banks. For example, the memory may have N banks, each of which may be connected to N inputs {X1, Y1}, {X2, Y2}, ... {XN, YN}. In addition, the memory has 2*N banks, of which N banks are respectively connected to N first signals X1, X2, ... XN, and the remaining N banks are connected to N second signals. It can be connected to (Y1, Y2, ... YN) respectively.

병렬 처리 장치는 이와 같은 구성을 가짐으로써, 1개의 하드웨어로 다양한 연산을 수행할 수 있다. 일례로 병렬 처리 장치는 부분 합산 연산을 수행할 수 있다. 여기에서 부분 합산이란 제1 신호들(X1, X2, ... XN)의 전체 또는 일부를 합산한다는 의미이다. 부분 합산 연산을 수행하기 위해서는 전처리부(100)는 합산 모드로 동작해야 한다. 이 때, i번째 출력(Mi)은 제1 신호들(X1, X2, ... XN)중 제2 신호(Yi)의 비트들(Yi[1], Yi[2], ... Yi[N])에 따라 선택된 신호들의 합산에 대응한다. 가령, N이 4이고, 제2 신호들(Y1, Y2, Y3, Y4)가 이진수로 1011, 1100, 0010, 0111이면, 출력들(M1, M2, M3, M4)은 X4+X2+X1, X4+X3, X2, X3+X2+X1에 각각 해당한다. 이와 같이 전처리부(100)가 합산 모드로 동작하는 경우에 동시에 N개의 부분 합산 연산을 수행할 수 있다. Since the parallel processing device has such a configuration, it is possible to perform various operations with one piece of hardware. For example, the parallel processing unit may perform a partial summation operation. Here, the partial summation means summing all or part of the first signals X1, X2, ... XN. In order to perform the partial summation operation, the preprocessor 100 must operate in the summation mode. At this time, the i-th output Mi is the bits Yi[1], Yi[2], ... Yi[ of the second signal Yi among the first signals X1, X2, ... XN. N]), corresponding to the summation of the selected signals. For example, if N is 4 and the second signals Y1, Y2, Y3, Y4 are 1011, 1100, 0010, 0111 in binary, the outputs M1, M2, M3, M4 are X4+X2+X1, Corresponds to X4+X3, X2, and X3+X2+X1, respectively. As described above, when the preprocessor 100 operates in the summation mode, N partial summation operations may be performed at the same time.

전처리부(100)가 합산 모드로 동작할 때, 변위 연산도 수행될 수 있다. 여기서 변위 연산이라 함은 제1 신호(Xi)를 출력 신호들(M1, M2, ... MN)로 전달함에 있어서 위치를 변경하는 것을 의미한다. 가령 N이 4이고, 제2 신호들(Y1, Y2, Y3, Y4)이 이진수로 1000, 0100, 0010, 0001이면, 출력들(M1, M2, M3, M4)은 X4, X3, X2, X1에 각각 해당한다. 따라서 제1 신호들(X1, X2, X3, X4)가 출력들(M1, M2, M3, M4)로 전달되되 그 위치가 변경되어 전달된다. When the preprocessor 100 operates in the summing mode, a displacement operation may also be performed. Here, the displacement operation means changing the position in transferring the first signal Xi to the output signals M1, M2, ... MN. For example, if N is 4 and the second signals Y1, Y2, Y3, Y4 are 1000, 0100, 0010, 0001 in binary, the outputs M1, M2, M3, M4 are X4, X3, X2, X1 each corresponds to Accordingly, the first signals X1, X2, X3, and X4 are transmitted to the outputs M1, M2, M3, and M4, but their positions are changed and transmitted.

상술한 바와 같이 병렬 처리 장치가 부분 합산 연산 및 변위 연산을 수행함에 있어서, 처리 유닛들 간의 데이터 교환을 용이하게 한다. 여기에서 i번째 처리 유닛은 i번째 전처리 유닛과 i번째 합산기를 포함하는 개념이다. 예로서 제1 처리 유닛(150_1, SUM1)은 첫째 제1 신호(X1)뿐만 아니라 둘째 내지 N번째 제1 신호들(X2, ... XN)을 입력받아 부분 합산을 수행할 수 있다. 또한 제2 처리 유닛(150_2, SUM2)은 둘째 제1 신호(X2) 이외의 제1 신호인 예로서 N번째 제1 신호(XN)을 전달받을 수 있다. As described above, when the parallel processing apparatus performs the partial sum operation and the displacement operation, data exchange between processing units is facilitated. Here, the i-th processing unit is a concept including the i-th pre-processing unit and the i-th summer. For example, the first processing unit 150_1 , SUM1 may receive the first first signal X1 as well as the second to N-th first signals X2 , ... XN and perform partial summation. In addition, the second processing units 150_2 and SUM2 may receive the N-th first signal XN as an example of a first signal other than the second first signal X2 .

일례로 병렬 처리 장치는 곱셈 연산을 수행할 수 있다. 곱셈 연산을 수행하기 위해서는 전처리부(100)는 곱셈 모드로 동작하여야 한다. 이 때, i번째 출력(Mi)이 제1 신호(Xi)와 제2 신호(Yi)의 곱(Xi*Yi)에 대응한다. 가령 N이 4이면, 출력들(M1, M2, M3, M4)은 X1*Y1, X2*Y2, X3*Y3, X4*Y4에 각각 해당한다. 이와 같이, 전처리부(100)가 곱셈 모드로 동작하는 경우에 동시에 N개의 곱셈을 수행할 수 있다. For example, the parallel processing unit may perform a multiplication operation. In order to perform a multiplication operation, the preprocessor 100 must operate in a multiplication mode. In this case, the i-th output Mi corresponds to the product (Xi*Yi) of the first signal Xi and the second signal Yi. For example, if N is 4, outputs M1, M2, M3, and M4 correspond to X1*Y1, X2*Y2, X3*Y3, and X4*Y4, respectively. As described above, when the preprocessor 100 operates in the multiplication mode, N multiplications may be simultaneously performed.

전처리부(100)가 곱셈 모드로 동작할 때, 쉬트프 연산도 수행될 수 있다. 가령 N이 4이고, 제2 신호들(Y1, Y2, Y3, Y4)이 이진수로 1000, 0100, 0010, 0001이면, 출력들(M1, M2, M3, M4)은 (X1<<3), (X2<<2), (X3<<1), (X4<<0)에 각각 해당한다. When the preprocessor 100 operates in the multiplication mode, a shift operation may also be performed. For example, if N is 4 and the second signals Y1, Y2, Y3, Y4 are 1000, 0100, 0010, 0001 in binary, the outputs M1, M2, M3, M4 are (X1<<3), It corresponds to (X2<<2), (X3<<1), and (X4<<0), respectively.

병렬 처리 장치는 다양한 연산을 동시에 수행할 수도 있다. 일례로 N이 4일 때, 다음과 같은 연산을 동시에 수행할 수 있다. The parallel processing unit may perform various operations simultaneously. For example, when N is 4, the following operations can be simultaneously performed.

M1 = X2 + X3 + X4 [부분 합산 연산]M1 = X2 + X3 + X4 [partial sum operation]

M2 = X1 [변위 연산]M2 = X1 [displacement operation]

M3 = X3 * Y3 [곱셈 연산]M3 = X3 * Y3 [multiplication operation]

M4 = (X4 << 2) [쉬프트 연산]M4 = (X4 << 2) [shift operation]

이를 위해선 제1 및 제2 전처리 유닛들(150_1, 150_2)이 합산 모드가 되도록 선택 신호들(SF1, SF2)이 설정되어야 하며, 제3 및 제4 전처리 유닛들(150_3, 150_4)이 곱셈 모드가 되도록 선택 신호들(SF3, SF4)이 설정되어야 한다. 또한, 부분 합산 연산이 위와 같이 수행될 수 있도록 Y1이 1110로 설정되어야 하고, 변위 연산이 위와 같이 수행될 수 있도록 Y2가 0001로 설정되어야 하고, 쉬프트 연산이 위와 같이 수행될 수 있도록 Y4가 0100로 설정되어야 한다. To this end, the selection signals SF1 and SF2 must be set so that the first and second preprocessing units 150_1 and 150_2 are in the summing mode, and the third and fourth preprocessing units 150_3 and 150_4 are in the multiplication mode. The selection signals SF3 and SF4 should be set so as to be possible. In addition, Y1 must be set to 1110 so that the partial sum operation can be performed as above, Y2 must be set to 0001 so that the displacement operation can be performed as above, and Y4 must be set to 0100 so that the shift operation can be performed as above. should be set

또한 병렬 처리 장치는 선택 신호(SFi) 및 제2 신호(Yi)를 변경함으로써 병렬 처리 장치의 동작을 매 차례마다 독립적으로 변경할 수 있다. 가령 첫째 차례에서 상술한 바와 같이 M1, M2, M3, M4가 각각 부분 합산 연산, 변위 연산, 곱셈 연산 및 쉬프트 연산을 수행한 후에, 둘째 차례에서 아래와 같이 곱셈 연산, 곱셈 연산, 부분 합산 연산 및 부분 합산 연산을 수행할 수 있다. In addition, the parallel processing apparatus may independently change the operation of the parallel processing apparatus every time by changing the selection signal SFi and the second signal Yi. For example, as described above in the first turn, after M1, M2, M3, and M4 perform partial summation, displacement, multiplication, and shift operations, respectively, in the second turn, multiplication, multiplication, partial summation and partial summation operations are performed as follows. A summation operation can be performed.

M1 = X1 * Y1 [곱셈 연산]M1 = X1 * Y1 [multiplication operation]

M2 = X2 * Y2 [곱셈 연산]M2 = X2 * Y2 [multiplication operation]

M3 = X1 + X2 + X3 + X4 [부분 합산 연산]M3 = X1 + X2 + X3 + X4 [partial sum operation]

M4 = X2 + X4 [부분 합산 연산]M4 = X2 + X4 [partial sum operation]

이를 위해선 제1 및 제2 전처리 유닛들(150_1, 150_2)이 곱셈 모드가 되도록 선택 신호들(SF1, SF2)이 설정되어야 하며, 제3 및 제4 전처리 유닛들(150_3, 150_4)이 합산 모드가 되도록 선택 신호들(SF3, SF4)이 설정되어야 한다. 또한, 부분 합산 연산이 위와 같이 수행될 수 있도록 Y3 및 Y4가 각각 1111 및 1010로 설정되어야 한다. To this end, the selection signals SF1 and SF2 must be set so that the first and second preprocessing units 150_1 and 150_2 are in multiplication mode, and the third and fourth preprocessing units 150_3 and 150_4 are in the summing mode. The selection signals SF3 and SF4 should be set so as to be possible. In addition, Y3 and Y4 must be set to 1111 and 1010, respectively, so that the partial sum operation can be performed as described above.

이와 같이 제1 실시예에 의한 병렬 처리 장치는 N개의 독립적인 연산들이 동시에 수행될 수 있으며, 또한 N개의 연산들이 매 차례마다 독립적으로 변경될 수 있다. 이는 병렬 처리 장치의 효율을 극대화 시킬 수 있다. As described above, in the parallel processing apparatus according to the first embodiment, N independent operations may be simultaneously performed, and N operations may be independently changed every turn. This can maximize the efficiency of the parallel processing unit.

만일 복수의 부분 합산 연산들을 수행하는 부분 합산 병렬 처리부, 복수의 변위 연산들을 수행하는 변위 병렬 처리부, 복수의 곱셈 연산들을 수행하는 곱셈 병렬 처리부 및 복수의 쉬프트 연산들을 수행하는 쉬프트 병렬 처리부를 구비하는 병렬 처리 장치가 있다고 가정하면, 곱셈 연산을 많이 필요로 하는 순간에는 곱셈 병렬 처리부가 100% 활용될 수 있으나, 부분 합산 병렬 처리부, 변위 병렬 처리부 및 쉬프트 병렬 처리부의 활용도는 저조할 것이다. 또한 부분 합산 연산을 많이 필요로 하는 순간에는 부분 합산 병렬 처리부가 100% 활용될 수 있으나, 변위 병렬 처리부, 곱셈 병렬 처리부 및 쉬프트 병렬 처리부의 활용도는 저조할 것이다. Parallel having a partial summation parallel processing unit performing a plurality of partial summation operations, a displacement parallel processing unit performing a plurality of displacement operations, a multiplication parallel processing unit performing a plurality of multiplication operations, and a shift parallel processing unit performing a plurality of shift operations Assuming there is a processing unit, the multiplication parallel processing unit can be utilized 100% at the moment when a lot of multiplication operations are required, but the utilization of the partial summation parallel processing unit, the displacement parallel processing unit and the shift parallel processing unit will be low. In addition, the partial sum parallel processing unit can be utilized 100% at the moment when a lot of partial sum operation is required, but the utilization of the displacement parallel processing unit, multiplication parallel processing unit, and shift parallel processing unit will be low.

이와 달리 제1 실시예에 의한 병렬 처리 장치는 매 순간마다 전처리 유닛들(150_1, 150_2, ... 150_N)에 의하여 수행되는 연산들을 변경함으로써 병렬 처리 장치의 활용도를 극대화 시킬 수 있다. 가령, 곱셈 연산을 많이 필요로 하는 순간에는 전처리 유닛들(150_1, 150_2, ... 150_N) 중 많은 부분들이 곱셈 연산을 수행하고, 나머지 부분들이 다른 연산들을 수행하도록 설정함으로써 전처리 유닛들(150_1, 150_2, ... 150_N)의 대부분이 활용되도록 할 수 있다. 또한, 부분 합산 연산을 많이 필요로 하는 순간에는 전처리 유닛들(150_1, 150_2, ... 150_N) 중 많은 부분들이 부분 합산 연산을 수행하고, 나머지 부분들이 다른 연산들을 수행하도록 설정함으로써 전처리 유닛들(150_1, 150_2, ... 150_N)의 대부분이 활용되도록 할 수 있다. Contrary to this, the parallel processing apparatus according to the first embodiment can maximize the utility of the parallel processing apparatus by changing the operations performed by the preprocessing units 150_1, 150_2, ... 150_N at every moment. For example, at a moment when a lot of multiplication operations are required, many of the preprocessing units 150_1, 150_2, ... 150_N perform a multiplication operation and the remaining parts perform other operations by setting the preprocessing units 150_1, Most of 150_2, ... 150_N) can be utilized. In addition, at a moment when a lot of partial summation operations are required, many of the preprocessing units 150_1, 150_2, ... 150_N perform partial summation operations, and the remaining parts perform other operations by setting the preprocessing units ( Most of 150_1, 150_2, ... 150_N) can be utilized.

도 2는 제1 실시예의 i번째 전처리 유닛의 일례를 설명하기 위한 도면이다. 도 2를 참조하면 전처리 유닛은 선택 연산부(110_i) 및 쉬프트 연산부(120_i)를 포함한다. 2 is a view for explaining an example of the i-th pre-processing unit of the first embodiment. Referring to FIG. 2 , the preprocessing unit includes a selection operation unit 110_i and a shift operation unit 120_i.

선택 연산부(110_i)는 복수의 역다중화부들(DM1, DM2, ... DMN)을 포함한다. 복수의 역다중화부들(DM1, DM2, ... DMN)은 복수의 제1 신호들(X1, X2, ... XN) 및 0 중에서 제2 신호(Yi)의 비트들(Yi[1], Yi[2], ... Yi[N])에 따라 선택된 신호들을 각각 출력한다. 예로서 제1 역다중화부(DM1)은 제1 신호(X1) 및 0 중에서 제2 신호(Yi)의 첫째 비트(Yi[1])에 따라 선택된 신호를 출력하고, 제2 역다중화부(DM2)은 제1 신호(X2) 및 0 중에서 제2 신호(Yi)의 둘째 비트(Yi[2])에 따라 선택된 신호를 출력하고, 제N 역다중화부(DMN)은 제1 신호(XN) 및 0 중에서 제2 신호(Yi)의 N째 비트(Yi[N])에 따라 선택된 신호를 출력한다. The selection operation unit 110_i includes a plurality of demultiplexers DM1, DM2, ... DMN. The plurality of demultiplexers DM1, DM2, ... DMN includes bits Yi[1], The signals selected according to Yi[2], ... Yi[N]) are respectively output. For example, the first demultiplexer DM1 outputs a signal selected according to the first bit (Yi[1]) of the second signal Yi from among the first signal X1 and 0, and the second demultiplexer DM2 ) outputs a signal selected according to the second bit (Yi[2]) of the second signal Yi among the first signal X2 and 0, and the N-th demultiplexer DMN includes the first signal XN and A signal selected according to the Nth bit (Yi[N]) of the second signal Yi among 0 is output.

쉬프트 연산부(120_i)는 복수의 쉬프트 유닛들(SH1, SH2, ... SHN)을 포함한다. 복수의 쉬프트 유닛들(SH1, SH2, ... SHN)은 제1 신호(Xi)가 쉬프트된 신호들 및 0 중에서 제2 신호(Yi)의 비트들(Yi[1], Yi[2], ... Yi[N])에 따라 선택된 신호들을 각각 출력한다. 예로서 제1 쉬프트 유닛(SH1)은 제1 신호(Xi)가 0비트만큼 쉬프트된 신호(Xi<<0) 및 0 중에서 제2 신호(Yi)의 첫째 비트(Yi[1])에 따라 선택된 신호를 출력하고, 제2 쉬프트 유닛(SH2)은 제1 신호(Xi)가 1비트만큼 쉬프트된 신호(Xi<<1) 및 0 중에서 제2 신호(Yi)의 둘째 비트(Yi[2])에 따라 선택된 신호를 출력하고, 제N 쉬프트 유닛(SHN)은 제1 신호(Xi)가 (N-1)비트만큼 쉬프트된 신호(Xi<<(N-1)) 및 0 중에서 제2 신호(Yi)의 N째 비트(Yi[N])에 따라 선택된 신호를 출력한다. The shift operation unit 120_i includes a plurality of shift units SH1, SH2, ... SHN. The plurality of shift units SH1, SH2, ... SHN include bits of the first signal Xi shifted and the second signal Yi among 0 bits Yi[1], Yi[2], ... each of the signals selected according to Yi[N]) is output. For example, the first shift unit SH1 is selected according to the first bit Yi[1] of the second signal Yi among the signal Xi<<0) in which the first signal Xi is shifted by 0 bits and 0 The signal is output, and the second shift unit SH2 is configured to generate a signal Xi<<1 in which the first signal Xi is shifted by one bit and a second bit (Yi[2]) of the second signal Yi among 0. Outputs a signal selected according to , and the N-th shift unit SHN generates a second signal (Xi<<(N-1)) in which the first signal Xi is shifted by (N-1) bits and a second signal ( A signal selected according to the Nth bit (Yi[N]) of Yi) is output.

선택 신호(SFi)에 따라 선택 연산부(110_i) 및 쉬프트 연산부(120_i) 중 어느 하나의 연산부가 동작한다. 예로서 선택 신호(SFi)가 0인 경우 선택 연산부(110_i)가 동작하고, 쉬프트 연산부(120_i)는 동작하지 아니한다. 이때 선택 연산부(110_i)는 역다중화부들(DM1, DM2, ... DMN)로부터 출력된 신호들을 합산기 입력들(Si_1, Si_2, ... Si_N)로서 합산기(SUMi)로 전달하고, 쉬프트 연산부(120_i)는 고 임피던스 신호들(high impedance signals)를 출력한다. 또한 선택 신호(SFi)가 1인 경우 선택 연산부(110_i)가 동작하지 아니하고, 쉬프트 연산부(120_i)는 동작한다. 이때 선택 연산부(110_i)는 고 임피던스 신호들(high impedance signals)를 출력하고, 쉬프트 연산부(120_i)는 쉬프트 유닛들(SH1, SH2, ... SHN)로부터 출력된 신호들을 합산기 입력들(Si_1, Si_2, ... Si_N)로서 합산기(SUMi)로 전달한다.Any one of the selection operation unit 110_i and the shift operation unit 120_i operates according to the selection signal SFi. For example, when the selection signal SFi is 0, the selection operation unit 110_i operates, and the shift operation unit 120_i does not operate. At this time, the selection operator 110_i transfers the signals output from the demultiplexers DM1, DM2, ... DMN to the summer SUMi as the summer inputs Si_1, Si_2, ... Si_N, and shifts them. The operation unit 120_i outputs high impedance signals. Also, when the selection signal SFi is 1, the selection operation unit 110_i does not operate, and the shift operation unit 120_i operates. In this case, the selection operation unit 110_i outputs high impedance signals, and the shift operation unit 120_i applies signals output from the shift units SH1, SH2, ... SHN to the summer inputs Si_1. , Si_2, ... Si_N) to the summer SUMi.

도면과 달리, 별도의 역다중화부들을 추가하여 선택 신호(SFi)에 따라 선택 연산부(110_i) 출력들 및 쉬프트 연산부(120_i) 출력들 중 일부를 선택할 수 있다. 예로서 선택 신호(SFi)가 합산 모드를 의미하는 경우 별도의 역다중화부들은 선택 연산부(110_i) 출력들을 합산기(SUMi)로 전달하고, 곱셈 모드를 의미하는 경우 별도의 역다중화부들은 쉬프트 연산부(120_i) 출력들을 합산기(SUMi)로 전달할 수 있다. Unlike the drawing, some of the outputs of the selection operation unit 110_i and the shift operation unit 120_i may be selected according to the selection signal SFi by adding separate demultiplexers. For example, when the selection signal SFi means the summing mode, separate demultiplexers transfer the outputs of the selection operator 110_i to the summer SUMi, and when the selection signal SFi means the multiplication mode, the separate demultiplexers use the shift operator The (120_i) outputs may be transferred to the summer SUMi.

도 3은 제2 실시예에 의한 병렬 처리 장치를 나타내는 도면이다. 도 3을 참조하면, 병렬 처리 장치는 제1 내지 제P 입력들({X1, Y1}, ... {X(p-1), Y(p-1)}, {Xp, Yp}, {X(p+1), Y(p+1)}, ... {XP, YP})을 입력받고, 제1 내지 제P 출력들(M1, ... M(p-1), Mp, M(p+1) ... MP)을 출력한다. 여기에서 P은 4이상의 자연수를 의미하며, 일례로 P는 1024일 수 있다. 또한 p는 1 이상이고 P 이하인 자연수를 의미한다. 제1 내지 제P 입력들({X1, Y1}, ... {X(p-1), Y(p-1)}, {Xp, Yp}, {X(p+1), Y(p+1)}, ... {XP, YP})은 제1 신호들(X1, ... X(p-1), Xp, X(p+1), ... XP)과 제2 신호들(Y1, ... Y(p-1), Yp, Y(p+1) ... YP)을 구비한다. 병렬 처리 장치는 전처리부(100A)와 주처리부(200A)를 구비한다. 병렬 처리장치는 지연부(300A)와 선택부(400A)를 더 구비할 수 있다. 3 is a diagram showing a parallel processing apparatus according to the second embodiment. Referring to FIG. 3 , the parallel processing unit performs first to Pth inputs {X1, Y1}, ... {X(p-1), Y(p-1)}, {Xp, Yp}, { X(p+1), Y(p+1)}, ... {XP, YP}) are received, and the first to Pth outputs M1, ... M(p-1), Mp, Output M(p+1) ... MP). Here, P means a natural number equal to or greater than 4, for example, P may be 1024. In addition, p means a natural number that is 1 or more and P or less. 1st to Pth inputs ({X1, Y1}, ... {X(p-1), Y(p-1)}, {Xp, Yp}, {X(p+1), Y(p +1)}, ... {XP, YP}) are the first signals X1, ... X(p-1), Xp, X(p+1), ... XP) and the second signal and Y1, ... Y(p-1), Yp, Y(p+1) ... YP. The parallel processing apparatus includes a preprocessing unit 100A and a main processing unit 200A. The parallel processing apparatus may further include a delay unit 300A and a selection unit 400A.

전처리부(100A)는 복수의 전처리 유닛들(... 150_(p-1), 150_p, 150(p+1), ...)을 포함한다. 복수의 전처리 유닛들(... 150_(p-1), 150_p, 150(p+1), ...)은 선택 연산부들(... 110_(p-1), 110_p, 110_(p+1), ...) 및 쉬프트 연산부들(... 120_(p-1), 120_p, 120_(p+1), ...)을 포함한다. 전처리 유닛(150_p)은 선택 연산부(110_p) 및 쉬프트 연산부(120_p)를 포함한다. The preprocessor 100A includes a plurality of preprocessing units ... 150_(p-1), 150_p, 150(p+1), ...). The plurality of preprocessing units (... 150_(p-1), 150_p, 150(p+1), ...) are selected by the selection operation units (... 110_(p-1), 110_p, 110_(p+). 1), ...) and shift operation units (... 120_(p-1), 120_p, 120_(p+1), ...). The preprocessing unit 150_p includes a selection operation unit 110_p and a shift operation unit 120_p.

선택 연산부(110_p)는 전처리 유닛(150_p)이 합산 모드로 동작하는 경우에 동작한다. 선택 연산부(110_p)는 전처리 유닛(150_p)에 대응하는 제1 신호(Xp) 및 이에 인접한 제1 신호들(예: X(p-Q/2+1), ... X(p-1), X(p+1), ... X(p+Q/2)을 제2 신호(Yp)의 비트들(Yp[1], Yp[2], ... Yp[Q])에 따라 합산기(SUMp)에 전달하는 기능을 수행한다. 여기에서 Q는 4 이상의 짝수를 의미하며, 일례로 Q는 32일 수 있다. 또한 q는 1 이상이고 Q 이하인 자연수를 의미한다. 이때 선택 연산부(110_p)의 동작은 일례로 아래와 같이 의사 코드로 표현될 수 있다. The selection operation unit 110_p operates when the pre-processing unit 150_p operates in the summing mode. The selection operation unit 110_p includes a first signal Xp corresponding to the preprocessing unit 150_p and first signals adjacent thereto (eg, X(p-Q/2+1), ... X(p-1), X (p+1), ... X(p+Q/2) according to the bits of the second signal Yp (Yp[1], Yp[2], ... Yp[Q]). (SUMp), where Q means an even number greater than or equal to 4, and for example, Q may be 32. Also, q means a natural number greater than or equal to 1 and less than or equal to Q. In this case, the selection operator (110_p) The operation of can be expressed in pseudo code as follows as an example.

[수학식 6][Equation 6]

(Yp[1] ? X(p-Q/2+1) : 0) => Si_1, (Yp[1] ? X(p-Q/2+1) : 0) => Si_1,

(Yp[2] ? X(p-Q/2+2) : 0) => Si_2, (Yp[2] ? X(p-Q/2+2) : 0) => Si_2,

......

(Yp[Q] ? X(p+Q/2) : 0) => Si_Q, (Yp[Q] ? X(p+Q/2) : 0) => Si_Q,

쉬프트 연산부(120_p)는 전처리 유닛(150_p)이 곱셈 모드로 동작하는 경우에 동작한다. 쉬프트 연산부(120_p)는 제1 신호(Xp)가 0, 1, ... (Q-1) 비트만큼 쉬프트된 신호들((Xp<<0), (Xp<<1), ... (Xp<<(Q-1)))을 제2 신호(Yp)의 비트들(Yp[1], Yp[2], ... Yp[Q])에 따라 합산기(SUMp)에 전달하는 기능을 수행한다. 이때 쉬프트 연산부(120_p)의 동작은 일례로 아래와 같이 의사 코드로 표현될 수 있다.The shift operation unit 120_p operates when the pre-processing unit 150_p operates in the multiplication mode. The shift operation unit 120_p calculates the signals ((Xp<<0), (Xp<<1), ... ( Xp<<(Q-1))) to the summer SUMp according to the bits Yp[1], Yp[2], ... Yp[Q]) of the second signal Yp carry out In this case, the operation of the shift operation unit 120_p may be expressed in pseudo code as follows, for example.

[수학식 7][Equation 7]

(Yp[1] ? (Xp << 0) : 0)=> Sp_1,(Yp[1] ? (Xp << 0) : 0)=> Sp_1,

(Yp[2] ? (Xp << 1) : 0)=> Sp_2, (Yp[2] ? (Xp << 1) : 0)=> Sp_2,

......

(Yp[Q] ? (Xp << (Q-1)) : 0)=> Sp_Q,(Yp[Q] ? (Xp << (Q-1)) : 0)=> Sp_Q,

전처리 유닛(150_p)은 동작 모드 선택 신호(SFp)에 따라 선택 연산부(110_p)를 동작시키거나 쉬프트 연산부(120_p)를 동작시킨다. The preprocessing unit 150_p operates the selection operation unit 110_p or the shift operation unit 120_p according to the operation mode selection signal SFp.

주처리부(200A)는 합산기들(... SUM(p-1), SUMp, SUM(p+1), ...)을 포함한다. p번째 합산기(Mp)는 전달된 신호들(Sp_1, Sp_2, ... Si_Q)을 합산하며, 합산된 결과를 p번째 출력(Mp)으로서 출력한다. 주처리부(200A)의 동작은 일례로 아래와 같은 의사 코드로 표현될 수 있다. The main processing unit 200A includes summers ... SUM(p-1), SUMp, SUM(p+1), ...). The p-th summer Mp sums the transmitted signals Sp_1, Sp_2, ... Si_Q, and outputs the summed result as the p-th output Mp. The operation of the main processing unit 200A may be expressed as, for example, the following pseudo code.

[수학식 8][Equation 8]

......

S(p-1)_1 + S(p-1)_2 + ... S(p-1)_Q => M(p-1), S(p-1)_1 + S(p-1)_2 + ... S(p-1)_Q => M(p-1),

Sp_1 + Sp_2 + ... Sp_Q => Mp, ...Sp_1 + Sp_2 + ... Sp_Q => Mp, ...

S(p+1)_1 + S(p+1)_2 + ... S(p+1)_Q => M(p+1), S(p+1)_1 + S(p+1)_2 + ... S(p+1)_Q => M(p+1),

......

지연부(300A)는 클록 신호(CLK)에 따라 출력들(... M(p-1), Mp, M(p+1), ...)을 지연하여 출력한다. 이를 위하여 지연부(300A)는 복수의 지연 유닛들(... DU(p-1), DUp, DU(p+1), ...)을 포함한다. 지연부(300A)에서 출력되는 신호들(... D(p-1), Dp, D(p+1), ...)은 출력들(... M(p-1), Mp, M(p+1), ...)에 각각 대응한다. The delay unit 300A delays and outputs the outputs (... M(p-1), Mp, M(p+1), ...) according to the clock signal CLK. To this end, the delay unit 300A includes a plurality of delay units (... DU(p-1), DUp, DU(p+1), ...). The signals ... D(p-1), Dp, D(p+1), ...) output from the delay unit 300A are the outputs ... M(p-1), Mp, M(p+1), ...) respectively.

선택부(400A)는 메모리(미도시)로부터 전달된 신호들(... R(p-1), Rp, R(p+1), ...) 및 지연부(300A)에서 출력되는 신호들(... D(p-1), Dp, D(p+1), ...) 중에서 입력 제어 신호들(... SI(p-1), SIp, SI(p+1), ...)에 따라 선택된 신호들을 제1 신호들(... X(p-1), Xp, X(p+1), ...)로서 출력한다. 일례로 메모리는 P개의 뱅크를 구비하고, P개의 뱅크는 P개의 입력들({X1, Y1}, {X2, Y2}, ... {XP, YP})에 각각 연결될 수 있다. 또한 메모리는 2P개의 뱅크들을 구비하고, 이들 중 P개의 뱅크들은 P개의 제1 신호들(X1, X2, ... XP)에 각각 연결되고, 나머지 P개의 뱅크들을 P개의 제2 신호들(Y1, Y2, ... YP)에 각각 연결될 수 있다. The selector 400A includes signals (... R(p-1), Rp, R(p+1), ...) transmitted from a memory (not shown) and a signal output from the delay unit 300A. of the input control signals (... SI(p-1), SIp, SI(p+1), The signals selected according to ...) are output as first signals ... X(p-1), Xp, X(p+1), ...). For example, the memory may include P banks, each of which may be connected to P inputs {X1, Y1}, {X2, Y2}, ... {XP, YP}. In addition, the memory has 2P banks, of which P banks are respectively connected to P first signals X1, X2, ... XP, and the remaining P banks are connected to P second signals Y1. , Y2, ... YP) respectively.

도 4는 제3 실시예에 의한 병렬 처리 장치를 나타내는 도면이다. 도 4를 참조하면, 병렬 처리 장치는 제1 내지 제N 입력들({X1, Y1}, {X2, Y2}, ... {XN, YN})을 입력받고, 제1 내지 제N 출력들(M1, M2, ... MN)을 출력한다. 여기에서 N은 8이상의 자연수를 의미하며, 일례로 N은 32일 수 있다. 제1 내지 제N 입력들({X1, Y1}, {X2, Y2}, ... {XN, YN})은 제1 신호들(X1, X2, ... XN)과 제2 신호들(Y1, Y2, ... YN)을 구비한다. 병렬 처리 장치는 전처리부(100B)와 주처리부(200)를 구비한다. 병렬 처리장치는 지연부(300)와 선택부(400)를 더 구비할 수 있다. Fig. 4 is a diagram showing a parallel processing apparatus according to the third embodiment. Referring to FIG. 4 , the parallel processing device receives first to N-th inputs {X1, Y1}, {X2, Y2}, ... {XN, YN}), and outputs first to N-th outputs It outputs (M1, M2, ... MN). Here, N means a natural number of 8 or more, for example, N may be 32. The first to Nth inputs {X1, Y1}, {X2, Y2}, ... {XN, YN}) are the first signals (X1, X2, ... XN) and the second signals ( Y1, Y2, ... YN). The parallel processing apparatus includes a pre-processing unit 100B and a main processing unit 200 . The parallel processing apparatus may further include a delay unit 300 and a selection unit 400 .

전처리부(100B)는 복수의 전처리 유닛들(150B_1, 150B_2, ... 150B_N)을 포함한다. 복수의 전처리 유닛들(150B_1, 150B_2, ... 150B_N)은 선택 연산부들(110B_1, 110B_2, ... 110B_N) 및 쉬프트 연산부들(120B_1, 120B_2, ... 120B_N)을 포함한다. 전처리 유닛(150B_i)은 선택 연산부(110B_i) 및 쉬프트 연산부(120B_i)를 포함한다. The preprocessor 100B includes a plurality of preprocessing units 150B_1, 150B_2, ... 150B_N. The plurality of preprocessing units 150B_1, 150B_2, ... 150B_N includes selection operation units 110B_1, 110B_2, ... 110B_N and shift operation units 120B_1, 120B_2, ... 120B_N. The preprocessing unit 150B_i includes a selection operation unit 110B_i and a shift operation unit 120B_i.

선택 연산부(110B_i)는 전처리 유닛(150B_i)이 합산 모드로 동작하는 경우에 동작한다. 선택 연산부(110B_i)는 제1 신호들(X1, X2, ... XN)을 제2 신호(Yi)의 비트들(Yi[1], Yi[2], ... Yi[N])에 따라 합산기(SUMi)에 전달하되, 분할 선택 신호(DSi)에 따라 제1 신호들(X1, X2, ... XN)의 일부 비트들을 쉬프트하여 합산기(SUMi)에 전달한다.The selection operation unit 110B_i operates when the pre-processing unit 150B_i operates in the summing mode. The selection operation unit 110B_i applies the first signals X1, X2, ... XN to the bits Yi[1], Yi[2], ... Yi[N] of the second signal Yi. Accordingly, the bits are transmitted to the summer SUMi, but some bits of the first signals X1, X2, ... XN are shifted according to the division selection signal DSi and transmitted to the summer SUMi.

분할 선택 신호(DSi)가 1 분할에 대응하는 신호이면(분할하지 않음에 대응하는 신호이면), 선택 연산부(110B_i)가 제1 신호들(X1, X2, ... XN)의 모든 비트들을 쉬프트하지 않은 채로 합산기(SUMi)로 전달한다. When the division selection signal DSi is a signal corresponding to division by 1 (a signal corresponding to not division), the selection operation unit 110B_i shifts all bits of the first signals X1, X2, ... XN. It is passed to the summer (SUMi) without doing so.

제1 신호들(X1, X2, ... XN)의 비트 수가 N(N은 8 이상의 짝수임)이고, 제2 신호(Yi)의 비트 수가 N이고, 분할 선택 신호(DSi)가 2 분할에 대응하는 신호이면, 선택 연산부(110B_i)가 제1 신호들(X1, X2, ... XN)의 최상위 (N/2) 비트들을 (N/2) 비트 쉬프트하고, 제1 신호들(X1, X2, ... XN)의 최하위 (N/2) 비트들을 0 비트 쉬프트하여(쉬프트하지 않은 채로) 합산기(SUMi)로 전달한다. The number of bits of the first signals X1, X2, ... XN is N (N is an even number of 8 or more), the number of bits of the second signal Yi is N, and the division selection signal DSi is divided into two. If it is a corresponding signal, the selection operation unit 110B_i shifts the most significant (N/2) bits of the first signals X1, X2, ... XN by (N/2) bits, and the first signals X1, The least significant (N/2) bits of X2, ... XN) are shifted by 0 bits (without shifting) and passed to the summer (SUMi).

제1 신호들(X1, X2, ... XN)의 비트 수가 N(N은 8 이상이고, 4의 배수임)이고, 제2 신호(Yi)의 비트 수가 N이고, 분할 선택 신호(DSi)가 4 분할에 대응하는 신호이면, 선택 연산부(110B_i)가 제1 신호들(X1, X2, ... XN)의 N 내지 (N*3/4+1) 비트들을 (N*3/4) 비트 쉬프트하고, 제1 신호들(X1, X2, ... XN)의 (N*3/4) 내지 (N*2/4+1) 비트들을 (N*2/4) 비트 쉬프트하고, 제1 신호들(X1, X2, ... XN)의 (N*2/4) 내지 (N*1/4+1) 비트들을 (N*1/4) 비트 쉬프트하고, 제1 신호들(X1, X2, ... XN)의 (N*1/4) 내지 1 비트들을 0 비트 쉬프트하여(쉬프트하지 않은 채로) 합산기(SUMi)로 전달한다. The number of bits of the first signals X1, X2, ... XN is N (N is 8 or more and is a multiple of 4), the number of bits of the second signal Yi is N, and the division selection signal DSi If is a signal corresponding to division by 4, the selection operation unit 110B_i converts N to (N*3/4+1) bits of the first signals X1, X2, ... XN to (N*3/4) bit shifting, (N*3/4) to (N*2/4+1) bits of the first signals X1, X2, ... XN bit shifting (N*2/4) bits, 1 (N*2/4) to (N*1/4+1) bits of the signals X1, X2, ... XN are bit shifted (N*1/4), and the first signals X1 , X2, ... XN) are transferred to the summer (SUMi) by shifting (N*1/4) to 1 bits by 0 bits (without shifting).

선택 연산부(110B_i)의 동작은 일례로 아래와 같이 의사 코드로 표현될 수 있다. The operation of the selection operation unit 110B_i may be expressed in pseudo code as follows, for example.

[수학식 9][Equation 9]

(Yi[1] ? XX1 : 0) => Si_1, (Yi[1] ? XX1 : 0) => Si_1,

(Yi[2] ? XX2 : 0) => Si_2, (Yi[2] ? XX2: 0) => Si_2,

......

(Yi[N] ? XXN : 0) => Si_N, (Yi[N] ? XXN : 0) => Si_N,

상기 계산식에서 XX1, XX2, ... XXN는 제1 신호들(X1, X2, ... XN) 및 분할 선택 신호(DSi)에 따라 정해진다. N이 8 및 16인 경우의 XXi의 일례가 표 1 및 2에 각각 표시되어 있다. In the above formula, XX1, XX2, ... XXN are determined according to the first signals X1, X2, ... XN and the division selection signal DSi. Examples of XXi when N is 8 and 16 are shown in Tables 1 and 2, respectively.

XXiXXi 1분할(분할하지 않음)1 division (no division) {00000000, Xi[8:1]}{00000000, Xi[8:1]} 2분할2 division {0000, Xi[8:5], 0000, Xi[4:1]}{0000, Xi[8:5], 0000, Xi[4:1]} 4분할4 divisions {00, Xi[8:7], 00, Xi[6:5], 00, Xi[4:3], 00, Xi[2:1]}{00, Xi[8:7], 00, Xi[6:5], 00, Xi[4:3], 00, Xi[2:1]}

XXiXXi 1분할(분할하지 않음)1 division (no division) {0000000000000000, Xi[16:1]}{0000000000000000, Xi[16:1]} 2분할2 division {00000000, Xi[16:9], 00000000, Xi[8:1]}{00000000, Xi[16:9], 00000000, Xi[8:1]} 4분할4 divisions {0000, Xi[16:13], 0000 Xi[12:9], 0000, Xi[8:5], 0000, Xi[4:1]}{0000, Xi[16:13], 0000 Xi[12:9], 0000, Xi[8:5], 0000, Xi[4:1]}

상기 표에서 {0000, Xi[8:5], 0000, Xi[4:1]}은 최상위 4비트는 0000, 다음 4비트는 Xi[8:5], 다음 4비트는 0000, 최하위 4비트는 Xi[4:1]로 구성된 총 16비트의 수를 의미한다. 본 기술이 속한 분야에서 통상의 지식을 가진 자는 상술한 설명으로부터 8 분할, 16 분할 및 그 초과의 분할일 때의 선택 연산부(110B_i)의 동작을 쉽게 예측할 수 있으므로 이에 대한 설명은 설명의 편의상 생략한다. In the table above, {0000, Xi[8:5], 0000, Xi[4:1]} is 0000 for the most significant 4 bits, Xi[8:5] for the next 4 bits, 0000 for the next 4 bits, and 0000 for the least significant 4 bits It means the total number of 16 bits composed of Xi[4:1]. A person of ordinary skill in the art can easily predict the operation of the selection operation unit 110B_i in 8 divisions, 16 divisions, and more divisions from the above description, so a description thereof will be omitted for convenience of description. .

쉬프트 연산부(120B_i)는 전처리 유닛(150B_i)이 곱셈 모드로 동작하는 경우에 동작한다. 쉬프트 연산부(120B_i)는 제1 신호(Xi)가 0, 1, ... (N-1) 비트만큼 쉬프트된 신호들((Xi<<0), (Xi<<1), ... (Xi<<(N-1)))을 제2 신호(Yi)의 비트들(Yi[1], Yi[2], ... Yi[N])에 따라 합산기(SUMi)에 전달하되, 분할 선택 신호(DSi)에 따라 쉬프트된 신호들((Xi<<0), (Xi<<1), ... (Xi<<(N-1)))의 일부 비트들이 0이 되도록 제어한다. The shift operation unit 120B_i operates when the pre-processing unit 150B_i operates in the multiplication mode. The shift operation unit 120B_i calculates the signals ((Xi<<0), (Xi<<1), ... ( Xi<<(N-1))) to the summer SUMi according to the bits Yi[1], Yi[2], ... Yi[N]) of the second signal Yi, Controls some bits of the shifted signals ((Xi<<0), (Xi<<1), ... (Xi<<(N-1))) to become 0 according to the division selection signal DSi .

분할 선택 신호(DSi)가 1 분할에 대응하는 신호이면(분할하지 않음에 대응하는 신호이면), 쉬프트된 신호들((Xi<<0), (Xi<<1), ... (Xi<<(N-1)))의 일부 비트들은 0으로 설정되지 아니한다. If the division selection signal DSi is a signal corresponding to division by one (a signal corresponding to not division), the shifted signals ((Xi<<0), (Xi<<1), ... (Xi< Some bits of <(N-1))) are not set to 0.

제1 신호(Xi)의 비트 수가 N(N은 8 이상의 짝수임)이고, 제2 신호(Yi)의 비트 수가 N이고, 분할 선택 신호(DSi)가 2 분할에 대응하는 신호이면, 쉬프트 연산부(120B_i)는 제1 신호(Xi)가 0 비트 내지 (N/2-1) 비트 쉬프트된 신호들((Xi<<0), (Xi<<1), ... (Xi<<(N/2-1)))의 최상위 (N/2) 비트들이 0이 되도록 제어하고, 제1 신호(Xi)가 (N/2) 비트 내지 (N-1) 비트 쉬프트된 신호들((Xi<<(N/2)), (Xi<<(N/2+1)), ... (Xi<<(N-1)))의 최하위 (N/2) 비트들이 0이 되도록 제어한다. If the number of bits of the first signal Xi is N (N is an even number of 8 or more), the number of bits of the second signal Yi is N, and the division selection signal DSi is a signal corresponding to division into two, the shift operation unit ( 120B_i) is a signal (Xi<<0), (Xi<<1), ... (Xi<<(N/) 2-1))) are controlled so that the most significant (N/2) bits become 0, and signals (Xi<< The least significant (N/2) bits of (N/2)), (Xi<<(N/2+1)), ... (Xi<<(N-1))) are controlled to be 0.

제1 신호(Xi)의 비트 수가 N(N은 8 이상이고, 4의 배수임)이고, 제2 신호(Yi)의 비트 수가 N이고, 분할 선택 신호(DSi)가 4 분할에 대응하는 신호이면, 쉬프트 연산부(120B_i)는 제1 신호(Xi)가 0 비트 내지 (N*1/4-1) 비트 쉬프트된 신호들((Xi<<0), (Xi<<1), ... (Xi<<(N*1/4-1)))의 최상위 (N*3/4) 비트들이 0이 되도록 제어하고, 제1 신호(Xi)가 (N*1/4) 비트 내지 (N*2/4-1) 비트 쉬프트된 신호들((Xi<<(N*1/4)), (Xi<<(N*1/4+1)), ... (Xi<<(N*2/4-1)))의 최상위 (N*2/4) 비트들 및 최하위 (N*1/4) 비트들이 0이 되도록 제어하고, 제1 신호(Xi)가 (N*2/4) 비트 내지 (N*3/4-1) 비트 쉬프트된 신호들((Xi<<(N*2/4)), (Xi<<(N*2/4+1)), ... (Xi<<(N*3/4-1)))의 최상위 (N*1/4) 비트들 및 최하위 (N*2/4) 비트들이 0이 되도록 제어하고, 제1 신호(Xi)가 (N*3/4) 비트 내지 (N-1) 비트 쉬프트된 신호들((Xi<<(N*3/4)), (Xi<<(N*3/4+1)), ... (Xi<<(N-1)))의 최하위 (N*3/4) 비트들이 0이 되도록 제어한다. If the number of bits of the first signal Xi is N (N is 8 or more and is a multiple of 4), the number of bits of the second signal Yi is N, and the division selection signal DSi is a signal corresponding to division by 4 , the shift operation unit 120B_i calculates the signals ((Xi<<0), (Xi<<1), ... ( The most significant (N*3/4) bits of Xi<<(N*1/4-1))) are controlled to be 0, and the first signal Xi is from (N*1/4) bits to (N* 2/4-1) bit-shifted signals ((Xi<<(N*1/4)), (Xi<<(N*1/4+1)), ... (Xi<<(N*) 2/4-1))), the most significant (N*2/4) bits and the least significant (N*1/4) bits are controlled to be 0, and the first signal Xi is (N*2/4) Bit to (N*3/4-1) bit shifted signals ((Xi<<(N*2/4)), (Xi<<(N*2/4+1)), ... (Xi Control so that the most significant (N*1/4) bits and the least significant (N*2/4) bits of <<(N*3/4-1))) become 0, and the first signal Xi is (N *3/4) bit to (N-1) bit shifted signals ((Xi<<(N*3/4)), (Xi<<(N*3/4+1)), ... ( The least significant (N*3/4) bits of Xi<<(N-1))) are controlled to be 0.

쉬프트 연산부(120B_i)의 동작은 일례로 아래와 같이 의사 코드로 표현될 수 있다.The operation of the shift operation unit 120B_i may be expressed in pseudo code as follows, for example.

[수학식 10][Equation 10]

(Yi[1] ? ((Xi & DC1) << 0) : 0)=> Si_1,(Yi[1] ? ((Xi & DC1) << 0) : 0)=> Si_1,

(Yi[2] ? ((Xi & DC2) << 1) : 0)=> Si_2, (Yi[2] ? ((Xi & DC2) << 1) : 0)=> Si_2,

......

(Yi[N] ? ((Xi & DCN) << (N-1)) : 0)=> Si_N(Yi[N] ? ((Xi & DCN) << (N-1)) : 0)=> Si_N

상기 수학식에서 (Xi & DCN)은 Xi와 DCN을 비트 단위로 논리곱 연산(bitwise AND operation)을 수행함을 의미한다. 분할 상수들(DC1~DCN)는 분할 선택 신호(DSi)에 따라 정해진다. N이 8인 및 16인 경우의 분할 상수들(DC1~DCN)의 일례가 표 3 및 4에 각각 표시되어 있다. In the above equation, (Xi & DCN) means that a bitwise AND operation is performed on Xi and DCN in bits. The division constants DC1 to DCN are determined according to the division selection signal DSi. Examples of division constants DC1 to DCN when N is 8 and 16 are shown in Tables 3 and 4, respectively.

1분할
(분할하지 않음)1 division
(do not split) 2분할2 division 4분할4 divisions DC1DC1 1111111111111111 0000111100001111 0000001100000011 DC2DC2 1111111111111111 0000111100001111 0000001100000011 DC3DC3 1111111111111111 0000111100001111 0000110000001100 DC4DC4 1111111111111111 0000111100001111 0000110000001100 DC5DC5 1111111111111111 1111000011110000 0011000000110000 DC6DC6 1111111111111111 1111000011110000 0011000000110000 DC7DC7 1111111111111111 1111000011110000 110000001100000 DC8DC8 1111111111111111 1111000011110000 110000001100000

1분할
(분할하지 않음)1 division
(do not split) 2분할2 division 4분할4 divisions DC1DC1 111111111111111111111111111111111 000000001111111100000000111111111 00000000000011110000000000001111 DC2DC2 111111111111111111111111111111111 000000001111111100000000111111111 00000000000011110000000000001111 DC3DC3 111111111111111111111111111111111 000000001111111100000000111111111 00000000000011110000000000001111 DC4DC4 111111111111111111111111111111111 000000001111111100000000111111111 00000000000011110000000000001111 DC5DC5 111111111111111111111111111111111 000000001111111100000000111111111 00000000111100000000000011110000 DC6DC6 111111111111111111111111111111111 000000001111111100000000111111111 00000000111100000000000011110000 DC7DC7 111111111111111111111111111111111 000000001111111100000000111111111 00000000111100000000000011110000 DC8DC8 111111111111111111111111111111111 000000001111111100000000111111111 00000000111100000000000011110000 DC9DC9 111111111111111111111111111111111 11111111000000001111111100000000 00001111000000000000111100000000 DC10DC10 111111111111111111111111111111111 11111111000000001111111100000000 00001111000000000000111100000000 DC11DC11 111111111111111111111111111111111 11111111000000001111111100000000 00001111000000000000111100000000 DC12DC12 111111111111111111111111111111111 11111111000000001111111100000000 00001111000000000000111100000000 DC13DC13 111111111111111111111111111111111 11111111000000001111111100000000 111100000000000011110000000000000 DC14DC14 111111111111111111111111111111111 11111111000000001111111100000000 111100000000000011110000000000000 DC15DC15 111111111111111111111111111111111 11111111000000001111111100000000 111100000000000011110000000000000 DC16DC16 111111111111111111111111111111111 11111111000000001111111100000000 111100000000000011110000000000000

본 기술이 속한 분야에서 통상의 지식을 가진 자는 상술한 설명으로부터 8 분할, 16 분할, 또는 그 초과의 분할일 때의 쉬프트 연산부(120B_i)의 동작을 쉽게 예측할 수 있으므로 이에 대한 설명은 설명의 편의상 생략한다. 전처리 유닛(150B_i)은 동작 모드 선택 신호(SFi)에 따라 선택 연산부(110B_i)를 동작시키거나 쉬프트 연산부(120B_i)를 동작시킨다. 일례로 SFi가 0인 경우가 선택 연산부(110B_i)의 동작을 의미하고, 1인 경우가 쉬프트 연산부(120B_i)의 동작을 의미하는 경우에, SF1=0, SF2=0 및 SFN=1은 제1 전처리 유닛(150B_1), 제2 전처리 유닛(150B_2) 및 제N 전처리 유닛(150B_N)이 각각 선택 연산부(120B_1), 선택 연산부(110B_2) 및 쉬프트 연산부(110B_N)를 동작시킴을 의미한다. A person of ordinary skill in the art can easily predict the operation of the shift operation unit 120B_i when dividing 8, 16, or more from the above description, so a description thereof is omitted for convenience of description. do. The preprocessing unit 150B_i operates the selection operation unit 110B_i or the shift operation unit 120B_i according to the operation mode selection signal SFi. For example, when SFi is 0 means the operation of the selection operation unit 110B_i, and when 1 indicates the operation of the shift operation unit 120B_i, SF1 = 0, SF2 = 0, and SFN = 1 are the first This means that the pre-processing unit 150B_1, the second pre-processing unit 150B_2, and the N-th pre-processing unit 150B_N operate the selection operation unit 120B_1, the selection operation unit 110B_2, and the shift operation unit 110B_N, respectively.

주처리부(200), 지연부(300) 및 선택부(400)의 동작은 도 1 및 이에 대한 설명과 동일하므로 설명의 편의상 생략한다. Operations of the main processing unit 200 , the delay unit 300 , and the selection unit 400 are the same as those of FIG. 1 and the description thereof, and thus will be omitted for convenience of description.

병렬 처리 장치는 이와 같은 구성을 가짐으로써, 1개의 하드웨어로 다양한 연산을 수행할 수 있다. 예로서 병렬 처리 장치는 부분 합산 연산, 변위 연산, 곱셈 연산 및 쉬트프 연산을 수행할 수 있다. 이러한 연산에 대한 설명은 이미 도 1 및 이에 대한 설명에서 이미 수행하였으므로 설명의 편의상 생략한다. Since the parallel processing device has such a configuration, it is possible to perform various operations with one piece of hardware. For example, the parallel processing unit may perform a partial sum operation, a displacement operation, a multiplication operation, and a shift operation. Since the description of such an operation has already been performed in FIG. 1 and the description thereof, it will be omitted for convenience of description.

또한 병렬 처리 장치는 이와 같은 구성을 가짐으로써, 다양한 비트 수의 입력에 대하여 곱셈 및 덧셈 연산을 수행할 수 있다. 예로서 전처리 유닛(150B_i)이 합산 모드로 동작하고, 8비트로 구성된 제1 신호들(X1, X2, ... X8)이 {a1_8, a1_7, a1_6, a1_5, a1_4, a1_3, a1_2, a1_1}, {a2_8, a2_7, a2_6, a2_5, a2_4, a2_3, a2_2, a2_1}, ... {a8_8, a8_7, a8_6, a8_5, a8_4, a8_3, a8_2, a8_1}이고, 8비트로 구성된 제2 신호(Yi)가 {b8, b7, b6, b5, b4, b3, b2, b1}이라고 가정하자. 만일 분할 선택 신호(DSi)가 1 분할에 대응하는 신호이면, 수학식 9 및 표 1에 따라 출력(Mi[16:1])은 b1*{a1_8, a1_7, a1_6, a1_5, a1_4, a1_3, a1_2, a1_1} + b2*{a2_8, a2_7, a2_6, a2_5, a2_4, a2_3, a2_2, a2_1} + ... + b8*{a8_8, a8_7, a8_6, a8_5, a8_4, a8_3, a8_2, a8_1}의 값을 가진다. Also, by having such a configuration, the parallel processing apparatus can perform multiplication and addition operations on inputs of various bits. As an example, the preprocessing unit 150B_i operates in the summing mode, and the first signals X1, X2, ... X8 composed of 8 bits are {a1_8, a1_7, a1_6, a1_5, a1_4, a1_3, a1_2, a1_1}, {a2_8, a2_7, a2_6, a2_5, a2_4, a2_3, a2_2, a2_1}, ... {a8_8, a8_7, a8_6, a8_5, a8_4, a8_3, a8_2, a8_1}, and the 8-bit second signal (Yi) is Assume {b8, b7, b6, b5, b4, b3, b2, b1}. If the division selection signal DSi is a signal corresponding to division 1, according to Equation 9 and Table 1, the output Mi[16:1] is b1*{a1_8, a1_7, a1_6, a1_5, a1_4, a1_3, a1_2 , a1_1} + b2*{a2_8, a2_7, a2_6, a2_5, a2_4, a2_3, a2_2, a2_1} + ... + the value of b8*{a8_8, a8_7, a8_6, a8_5, a8_4, a8_3, a8_2, a8 have

또한, 분할 선택 신호(DSi)가 2 분할에 대응하는 신호이면, 수학식 9 및 표 1에 따라 출력(Mi[16:9])은 b1*{a1_8, a1_7, a1_6, a1_5} + b2*{a2_8, a2_7, a2_6, a2_5} + ... + b8*{a8_8, a8_7, a8_6, a8_5}의 값을 가지고, 출력(Mi[8:1])은 b1*{a1_4, a1_3, a1_2, a1_1} + b2*{a2_4, a2_3, a2_2, a2_1} + ... + b8*{a8_4, a8_3, a8_2, a8_1}의 값을 가진다. 이를 일반화 하면 다음과 같다. 제1 신호들(X1, X2, ... XN)의 비트 수가 N(N은 8 이상의 짝수임)이고, 제2 신호(Yi)의 비트 수가 N이고, 분할 선택 신호(DSi)가 2 분할에 대응하는 신호이면, 합산부(SUMi)의 출력(Mi)의 최상위 N 비트들은 제1 신호(X1, X2, ... XN)들 중에서 제2 신호(Yi)에 따라 선택된 제1 신호들의 최상위 (N/2) 비트들의 합에 해당하고, 합산부(SUMi)의 출력(Mi)의 최하위 N 비트들은 제1 신호들(X1, X2, ... XN) 중에서 제2 신호(Yi)에 따라 선택된 제1 신호들의 최하위 (N/2) 비트들의 합에 해당한다. 본 실시예에 의한 병렬 처리 장치는 일반적인 방식(예: 8개의 제1 신호들(각 제1 신호는 8비트를 가짐)을 입력받는 합산기를 이용하여 8개의 제1 신호들(각 제1 신호는 4비트를 가짐)의 합산을 수행하는 경우 즉 제1 신호들(X1, X2, ... XN)로서 {0000, a1_4, a1_3, a1_2, a1_1}, {0000, a2_4, a2_3, a2_2, a2_1}, ... {0000, a8_4, a8_3, a8_2, a8_1}을 입력받고, 제2 신호(Yi)로 {b8, b7, b6, b5, b4, b3, b2, b1}를 입력받아 덧셈을 수행하는 경우) 대비하여, 합산기(SUMi)의 하드웨어의 활용도를 높이고, 제1 신호(Xi), 제2 신호(Yi) 및 출력(Mi)의 유효 비트들의 개수를 증가시킨다. Also, if the division selection signal DSi is a signal corresponding to division into 2, the output (Mi[16:9]) according to Equation 9 and Table 1 is b1*{a1_8, a1_7, a1_6, a1_5} + b2*{ With values of a2_8, a2_7, a2_6, a2_5} + ... + b8*{a8_8, a8_7, a8_6, a8_5}, the output (Mi[8:1]) is b1*{a1_4, a1_3, a1_2, a1_1} + b2*{a2_4, a2_3, a2_2, a2_1} + ... + b8*{a8_4, a8_3, a8_2, a8_1}. To generalize this, we get: The number of bits of the first signals X1, X2, ... XN is N (N is an even number of 8 or more), the number of bits of the second signal Yi is N, and the division selection signal DSi is divided into two. If it is a corresponding signal, the most significant N bits of the output Mi of the summing unit SUMi are the most significant (N) bits of the first signals selected according to the second signal Yi among the first signals X1, X2, ... N/2) corresponds to the sum of bits, and the least significant N bits of the output Mi of the summing unit SUMi are selected according to the second signal Yi from among the first signals X1, X2, ... XN. It corresponds to the sum of least significant (N/2) bits of the first signals. The parallel processing apparatus according to the present embodiment uses a summer that receives eight first signals (each first signal has 8 bits) in a general manner (eg, eight first signals (each first signal has 8 bits)). 4 bits), that is, as the first signals X1, X2, ... XN, {0000, a1_4, a1_3, a1_2, a1_1}, {0000, a2_4, a2_3, a2_2, a2_1} , ... {0000, a8_4, a8_3, a8_2, a8_1} is input, and {b8, b7, b6, b5, b4, b3, b2, b1} is received as the second signal (Yi) to perform addition. case), the utilization of hardware of the summer SUMi is increased, and the number of effective bits of the first signal Xi, the second signal Yi, and the output Mi is increased.

또한, 분할 선택 신호(DSi)가 4 분할에 대응하는 신호이면, 수학식 9 및 표 1에 따라 출력(Mi[16:13])은 b1*{a1_8, a1_7} + b2*{a2_8, a2_7} + ... + b8*{a8_8, a8_7}의 값을 가지고, 출력(Mi[12:9])은 b1*{a1_6, a1_5} + b2*{a2_6, a2_5} + ... + b8*{a8_6, a8_5}의 값을 가지고, 출력(Mi[8:5])은 b1*{a1_4, a1_3} + b2*{a2_4, a2_3} + ... + b8*{a8_4, a8_3}의 값을 가지고, 출력(Mi[4:1])은 b1*{a1_2, a1_1} + b2*{a2_2, a2_1} + ... + b8*{a8_2, a8_1}의 값을 가진다. 이를 일반화 하면 다음과 같다. 제1 신호들(X1, X2, ... XN)의 비트 수가 N(N은 8 이상이고, 4의 배수임)이고, 제2 신호(Yi)의 비트 수가 N이고, 분할 선택 신호(DSi)가 4 분할에 대응하는 신호이면, 합산부(SUMi)의 출력(Mi)의 2N 내지 (N*3/2+1) 비트들은 제1 신호들(X1, X2, ... XN) 중에서 제2 신호(Yi)에 따라 선택된 제1 신호들의 N 내지 (N*3/4+1) 비트들의 합에 해당하고, 합산부(SUMi)의 출력(Mi)의 (N*3/2) 내지 (N*2/2+1) 비트들은 제1 신호들(X1, X2, ... XN) 중에서 제2 신호(Yi)에 따라 선택된 제1 신호들의 (N*3/4) 내지 (N*2/4+1) 비트들의 합에 해당하고, 합산부(SUMi)의 출력(Mi)의 (N*2/2) 내지 (N*1/2+1) 비트들은 제1 신호들(X1, X2, ... XN) 중에서 제2 신호(Yi)에 따라 선택된 제1 신호들의 (N*2/4) 내지 (N*1/4+1) 비트들의 합에 해당하고, 합산부(SUMi)의 출력(Mi)의 (N*1/2) 내지 1 비트들은 제1 신호들(X1, X2, ... XN) 중에서 제2 신호(Yi)에 따라 선택된 제1 신호들의 (N*1/4) 내지 1 비트들의 합에 해당한다. 본 실시예에 의한 병렬 처리 장치는 일반적인 방식(예: 8개의 제1 신호들(각 제1 신호는 8비트를 가짐)을 입력받는 합산기를 이용하여 8개의 제1 신호들(각 제1 신호는 2비트를 가짐)의 합산을 수행하는 경우 즉 제1 신호들(X1, X2, ... XN)로서 {000000, a1_2, a1_1}, {000000, a2_2, a2_1}, ... {000000, a8_2, a8_1}을 입력받고, 제2 신호(Yi)로 {b8, b7, b6, b5, b4, b3, b2, b1}를 입력받아 덧셈을 수행하는 경우) 대비하여, 합산기(SUMi)의 하드웨어의 활용도를 높이고, 제1 신호(Xi), 제2 신호(Yi) 및 출력(Mi)의 유효 비트들의 개수를 증가시킨다. 다만, 분할 선택 신호(DSi)가 4 분할 또는 그 이상의 분할(예: 8분할, 16분할 등)에 대응하는 신호일 경우에 오버 플로우가 발생할 수 있으므로(예: b1*{a1_2, a1_1} + b2*{a2_2, a2_1} + ... + b8*{a8_2, a8_1}의 크기가 4비트를 초과하여, 출력(Mi[4:1])뿐만 아니라 출력(Mi[8:5])에도 영향을 주는 현상) 제2 신호(Yi)에 포함된 1의 개수가 소정의 개수를 초과하지 아니하도록 관리되어야 한다. Also, if the division selection signal DSi is a signal corresponding to division into 4, the output Mi[16:13] according to Equation 9 and Table 1 is b1*{a1_8, a1_7} + b2*{a2_8, a2_7} + ... + b8*{a8_8, a8_7}, the output (Mi[12:9]) is b1*{a1_6, a1_5} + b2*{a2_6, a2_5} + ... + b8*{ It has values of a8_6, a8_5}, and the output (Mi[8:5]) has values of b1*{a1_4, a1_3} + b2*{a2_4, a2_3} + ... + b8*{a8_4, a8_3} , the output (Mi[4:1]) has a value of b1*{a1_2, a1_1} + b2*{a2_2, a2_1} + ... + b8*{a8_2, a8_1}. To generalize this, we get: The number of bits of the first signals X1, X2, ... XN is N (N is 8 or more and is a multiple of 4), the number of bits of the second signal Yi is N, and the division selection signal DSi If is a signal corresponding to division by 4, 2N to (N*3/2+1) bits of the output Mi of the summing unit SUMi are the second among the first signals X1, X2, ... XN. Corresponds to the sum of N to (N*3/4+1) bits of the first signals selected according to the signal Yi, and (N*3/2) to (N) of the output Mi of the summing unit SUMi *2/2+1) bits are (N*3/4) to (N*2/) of the first signals selected according to the second signal Yi among the first signals X1, X2, ... XN 4+1) corresponds to the sum of bits, and (N*2/2) to (N*1/2+1) bits of the output Mi of the summing unit SUMi are the first signals X1, X2, ... XN) corresponds to the sum of (N*2/4) to (N*1/4+1) bits of the first signals selected according to the second signal Yi from among XN) (N*1/2) to 1 bits of (Mi) are (N*1/4) of the first signals selected according to the second signal Yi among the first signals X1, X2, ... XN to the sum of 1 bits. The parallel processing apparatus according to the present embodiment uses a summer that receives eight first signals (each first signal has 8 bits) in a general manner (eg, eight first signals (each first signal has 8 bits)). 2 bits), that is, as the first signals X1, X2, ... XN, {000000, a1_2, a1_1}, {000000, a2_2, a2_1}, ... {000000, a8_2 , a8_1} and {b8, b7, b6, b5, b4, b3, b2, b1} as the second signal Yi) and increase the number of effective bits of the first signal Xi, the second signal Yi, and the output Mi. However, since an overflow may occur when the division selection signal DSi is a signal corresponding to 4 divisions or more divisions (eg 8 divisions, 16 divisions, etc.), (eg, b1*{a1_2, a1_1} + b2* The size of {a2_2, a2_1} + ... + b8*{a8_2, a8_1} exceeds 4 bits, which affects the output (Mi[4:1]) as well as the output (Mi[8:5]). Development) The number of 1's included in the second signal Yi must be managed so as not to exceed a predetermined number.

예로서 전처리 유닛(150B_i)이 곱셈 모드로 동작하고, 8비트로 구성된 제1 신호(Xi)가 {a8, a7, a6, a5, a4, a3, a2, a1}이고, 8비트로 구성된 제2 신호(Yi)가 {b8, b7, b6, b5, b4, b3, b2, b1}이라고 가정하자. 만일 분할 선택 신호(DSi)가 1 분할에 대응하는 신호이면, 수학식 10 및 표 3에 따라 출력(Mi[16:1])은 {a8, a7, a6, a5, a4, a3, a2, a1}*{b8, b7, b6, b5, b4, b3, b2, b1}의 값을 가진다. 이때 합산기(SUMi)를 구성하는 하드웨어(예: adder)의 100%가 활용된다. As an example, the preprocessing unit 150B_i operates in the multiplication mode, the first signal Xi composed of 8 bits is {a8, a7, a6, a5, a4, a3, a2, a1}, and the second signal composed of 8 bits ( Let Yi) be {b8, b7, b6, b5, b4, b3, b2, b1}. If the division selection signal DSi is a signal corresponding to division 1, according to Equation 10 and Table 3, the output Mi[16:1] is {a8, a7, a6, a5, a4, a3, a2, a1 }*{b8, b7, b6, b5, b4, b3, b2, b1}. At this time, 100% of the hardware (eg, adder) constituting the summer (SUMi) is utilized.

또한, 분할 선택 신호(DSi)가 2 분할에 대응하는 신호이면, 수학식 10 및 표 3에 따라 출력(Mi[16:9])은 {a8, a7, a6, a5}*{b8, b7, b6, b5}의 값을 가지고, 출력(Mi[8:1])은 {a4, a3, a2, a1}*{b4, b3, b2, b1}의 값을 가진다. 이를 일반화하여 표현하면 다음과 같다. 제1 신호(Xi)의 비트 수가 N(N은 8 이상의 짝수임)이고, 제2 신호(Yi)의 비트 수가 N이고, 분할 선택 신호(DSi)가 2 분할에 대응하는 신호이면, 합산기(SUMi)의 출력(Mi)의 최상위 N 비트들은 제1 신호(Xi)의 최상위 (N/2) 비트들과 제2 신호(Yi)의 최상위 (N/2) 비트들의 곱에 해당하고, 합산기(SUMi)의 출력(Mi)의 최하위 N 비트들은 제1 신호(Xi)의 최하위 (N/2) 비트들과 제2 신호(Yi)의 최하위 (N/2) 비트들의 곱에 해당한다. 이때, 합산기(SUMi)를 구성하는 하드웨어의 50%가 활용되고, 제1 신호(Xi), 제2 신호(Yi) 및 출력(Mi)의 유효 비트들의 100%가 활용된다. 이에 반하여 일반적인 방식(예: 8비트용 곱셈기를 이용하여 4비트의 곱셈을 수행하는 경우 즉 제1 신호(Xi)로 {0000, a4, a3, a2, a1}를 입력하고, 제2 신호(Yi)로 {0000, b4, b3, b2, b1}를 입력받아 곱셈을 수행하는 경우)을 이용하는 경우, 합산기(SUMi)를 구성하는 하드웨어의 25%가 활용되고, 제1 신호(Xi), 제2 신호(Yi) 및 출력(Mi)의 유효 비트들의 50%가 활용된다. 따라서, 2 분할 시에, 본 실시예에 의한 병렬 처리 장치는 종래기술 대비하여 합산기(Mi)의 하드웨어 활용도를 2배 증가시킬 수 있고, 제1 신호(Xi), 제2 신호(Yi) 및 출력(Mi)의 유효 비트들도 2배 증가시킬 수 있다. In addition, if the division selection signal DSi is a signal corresponding to division by 2, the output (Mi[16:9]) according to Equation 10 and Table 3 is {a8, a7, a6, a5}*{b8, b7, It has values of b6, b5}, and the output (Mi[8:1]) has values of {a4, a3, a2, a1}*{b4, b3, b2, b1}. This can be generalized and expressed as follows. If the number of bits of the first signal Xi is N (N is an even number of 8 or more), the number of bits of the second signal Yi is N, and the division selection signal DSi is a signal corresponding to division by two, the summer ( The most significant N bits of the output Mi of SUMi) correspond to the product of the most significant (N/2) bits of the first signal Xi and the most significant (N/2) bits of the second signal Yi, and the summer The least significant N bits of the output Mi of (SUMi) correspond to the product of the least significant (N/2) bits of the first signal Xi and the least significant (N/2) bits of the second signal Yi. At this time, 50% of the hardware constituting the summer SUMi is utilized, and 100% of the effective bits of the first signal Xi, the second signal Yi, and the output Mi are utilized. On the other hand, in the case of performing 4-bit multiplication using an 8-bit multiplier, {0000, a4, a3, a2, a1} is input as the first signal Xi, and the second signal Yi ) as {0000, b4, b3, b2, b1} and performing multiplication), 25% of the hardware constituting the summer (SUMi) is utilized, and the first signal (Xi), the first signal (Xi), 2 50% of the valid bits of the signal Yi and the output Mi are utilized. Therefore, when dividing into two, the parallel processing apparatus according to the present embodiment can double the hardware utilization of the summer Mi compared to the prior art, and the first signal Xi, the second signal Yi and The effective bits of the output Mi can also be doubled.

또한, 분할 선택 신호(DSi)가 4 분할에 대응하는 신호이면, 수학식 10 및 표 3에 따라 출력(Mi[16:13])은 {a8, a7}*{b8, b7}의 값을 가지고, 출력(Mi[12:9])은 {a6, a5}*{b6, b5}의 값을 가지고, 출력(Mi[8:5])은 {a4, a3}*{b4, b3}의 값을 가지고, 출력(Mi[4:1])은 {a2, a1}*{b2, b1}의 값을 가진다. 이를 일반화하여 표현하면 다음과 같다. 제1 신호(Xi)의 비트 수가 N(N은 8 이상이고, 4의 배수임)이고, 제2 신호(Yi)의 비트 수가 N이고, 분할 선택 신호(DSi)가 4 분할에 대응하는 신호이면, 합산기(SUMi)의 출력(Mi)의 2N 내지 (N*3/2+1) 비트들은 제1 신호(Xi)의 N 내지 (N*3/4+1) 비트들과 제2 신호(Yi)의 N 내지 (N*3/4+1) 비트들의 곱에 해당하고, 합산기(SUMi)의 출력(Mi)의 (N*3/2) 내지 (N*2/2+1) 비트들은 제1 신호(Xi)의 (N*3/4) 내지 (N*2/4+1) 비트들과 제2 신호(Yi)의 (N*3/4) 내지 (N*2/4+1) 비트들의 곱에 해당하고, 합산기(SUMi)의 출력(Mi)의 (N*2/2) 내지 (N*1/2+1) 비트들은 제1 신호(Xi)의 (N*2/4) 내지 (N*1/4+1) 비트들과 제2 신호(Yi)의 (N*2/4) 내지 (N*1/4+1) 비트들의 곱에 해당하고, 합산기(SUMi)의 출력(Mi)의 (N*1/2) 내지 1 비트들은 제1 신호(Xi)의 (N*1/4) 내지 1 비트들과 제2 신호(Yi)의 (N*1/4) 내지 1 비트들의 곱에 해당한다. 이때, 합산기(SUMi)를 구성하는 하드웨어의 25%가 활용되고, 제1 신호(Xi), 제2 신호(Yi) 및 출력(Mi)의 유효 비트들의 100%가 활용된다. 이에 반하여 일반적인 방식(예: 8비트용 곱셈기를 이용하여 2비트의 곱셈을 수행하는 경우 즉 제1 신호(Xi)로 {000000, a2, a1}를 입력하고, 제2 신호(Yi)로 {000000, b2, b1}를 입력받아 곱셈을 수행하는 경우)을 이용하는 경우, 합산기(SUMi)를 구성하는 하드웨어의 6.25%가 활용되고, 제1 신호(Xi), 제2 신호(Yi) 및 출력(Mi)의 유효 비트들의 25%가 활용된다. 따라서, 4 분할 시에, 본 실시예에 의한 병렬 처리 장치는 종래기술 대비하여 합산기(Mi)의 하드웨어 활용도를 4배 증가시킬 수 있고, 제1 신호(Xi), 제2 신호(Yi) 및 출력(Mi)의 유효 비트들도 4배 증가시킬 수 있다. In addition, if the division selection signal DSi is a signal corresponding to division into 4, the output Mi[16:13] according to Equation 10 and Table 3 has a value of {a8, a7}*{b8, b7} , the output (Mi[12:9]) has the values {a6, a5}*{b6, b5}, and the output (Mi[8:5]) has the values {a4, a3}*{b4, b3} , and the output (Mi[4:1]) has a value of {a2, a1}*{b2, b1}. This can be generalized and expressed as follows. If the number of bits of the first signal Xi is N (N is 8 or more and is a multiple of 4), the number of bits of the second signal Yi is N, and the division selection signal DSi is a signal corresponding to division by 4 , 2N to (N*3/2+1) bits of the output Mi of the summer SUMi are the N to (N*3/4+1) bits of the first signal Xi and the second signal ( Yi) corresponds to the product of N to (N*3/4+1) bits, and (N*3/2) to (N*2/2+1) bits of the output Mi of the summer SUMi. (N*3/4) to (N*2/4+1) bits of the first signal Xi and (N*3/4) to (N*2/4+) bits of the second signal Yi 1) Corresponding to the product of bits, (N*2/2) to (N*1/2+1) bits of the output Mi of the summer SUMi are (N*2) bits of the first signal Xi /4) to (N*1/4+1) bits and corresponding to the product of (N*2/4) to (N*1/4+1) bits of the second signal Yi, and the summer ( (N*1/2) to 1 bits of the output Mi of SUMi) are (N*1/4) to 1 bits of the first signal Xi and (N*1/2) of the second signal Yi 4) to the product of 1 bits. At this time, 25% of the hardware constituting the summer SUMi is utilized, and 100% of the effective bits of the first signal Xi, the second signal Yi, and the output Mi are utilized. On the other hand, in the case of performing 2-bit multiplication using an 8-bit multiplier, {000000, a2, a1} is input as the first signal Xi and {000000 as the second signal Yi. . 25% of the significant bits of Mi) are utilized. Therefore, when dividing into 4, the parallel processing apparatus according to the present embodiment can increase the hardware utilization of the summer Mi by 4 times compared to the prior art, and the first signal Xi, the second signal Yi and The effective bits of the output Mi can also be increased by a factor of four.

이와 같이 제3 실시예에 의한 병렬 처리 장치는 N개의 독립적인 연산들이 동시에 수행될 수 있으며, N개의 연산들이 매 차례마다(시간에 따라) 독립적으로 변경될 수 있고, 또한 다양한 비트 수의 입력들에 대하여 연산을 수행 할 수 있다. 이는 병렬 처리 장치의 효율을 극대화 시킬 수 있다. As such, in the parallel processing apparatus according to the third embodiment, N independent operations can be simultaneously performed, the N operations can be independently changed every turn (according to time), and input of various bit numbers can be operated on. This can maximize the efficiency of the parallel processing unit.

도 5는 제3 실시예의 i번째 전처리 유닛의 일례를 설명하기 위한 도면이다. 도 5를 참조하면 전처리 유닛은 선택 연산부(110B_i) 및 쉬프트 연산부(120B_i)를 포함한다. 5 is a diagram for explaining an example of the i-th pre-processing unit of the third embodiment. Referring to FIG. 5 , the preprocessing unit includes a selection operation unit 110B_i and a shift operation unit 120B_i.

선택 연산부(110B_i)는 복수의 제1 변환부들(T1_1, T1_2, ... T1_N) 및 복수의 역다중화부들(DM1, DM2, ... DMN)을 포함한다. 제1 변환부들(T1_1, T1_2, ... T1_N)은 분할 선택 신호(DSi)에 따라 복수의 제1 신호들(X1, X2, ... XN)의 일부 비트들을 쉬프트 한다. 분할 선택 신호(DSi)가 1 분할에 대응하는 신호이면(분할하지 않음에 대응하는 신호이면), 제1 변환부들(T1_1, T1_2, ... T1_N)이 제1 신호들(X1, X2, ... XN)의 모든 비트들을 쉬프트하지 않은 채로 출력한다. The selection operation unit 110B_i includes a plurality of first transform units T1_1, T1_2, ... T1_N and a plurality of demultiplexers DM1, DM2, ... DMN. The first converters T1_1 , T1_2 , ... T1_N shift some bits of the plurality of first signals X1 , X2 , ... XN according to the division selection signal DSi. When the division selection signal DSi is a signal corresponding to division by 1 (a signal corresponding to not division), the first conversion units T1_1, T1_2, ... T1_N may convert the first signals X1, X2, . .. XN) outputs all bits without shifting.

제1 신호들(X1, X2, ... XN)의 비트 수가 N(N은 8 이상의 짝수임)이고, 제2 신호(Yi)의 비트 수가 N이고, 분할 선택 신호(DSi)가 2 분할에 대응하는 신호이면, 제1 변환부들(T1_1, T1_2, ... T1_N)이 제1 신호들(X1, X2, ... XN)의 최상위 (N/2) 비트들을 (N/2) 비트 쉬프트하고, 제1 신호들(X1, X2, ... XN)의 최하위 (N/2) 비트들을 0 비트 쉬프트 한다. The number of bits of the first signals X1, X2, ... XN is N (N is an even number of 8 or more), the number of bits of the second signal Yi is N, and the division selection signal DSi is divided into two. If it is a corresponding signal, the first conversion units T1_1, T1_2, ... T1_N shift the most significant (N/2) bits of the first signals X1, X2, ... XN by (N/2) bits and shift the least significant (N/2) bits of the first signals X1, X2, ... XN by 0 bits.

제1 신호들(X1, X2, ... XN)의 비트 수가 N(N은 8 이상이고, 4의 배수임)이고, 제2 신호(Yi)의 비트 수가 N이고, 분할 선택 신호(DSi)가 4 분할에 대응하는 신호이면, 제1 변환부들(T1_1, T1_2, ... T1_N)이 제1 신호들(X1, X2, ... XN)의 N 내지 (N*3/4+1) 비트들을 (N*3/4) 비트 쉬프트하고, 제1 신호들(X1, X2, ... XN)의 (N*3/4) 내지 (N*2/4+1) 비트들을 (N*2/4) 비트 쉬프트하고, 제1 신호들(X1, X2, ... XN)의 (N*2/4) 내지 (N*1/4+1) 비트들을 (N*1/4) 비트 쉬프트하고, 제1 신호들(X1, X2, ... XN)의 (N*1/4) 내지 1 비트들을 0 비트 쉬프트 한다. The number of bits of the first signals X1, X2, ... XN is N (N is 8 or more and is a multiple of 4), the number of bits of the second signal Yi is N, and the division selection signal DSi If is a signal corresponding to division by 4, the first transforming units T1_1, T1_2, ... T1_N may convert N to (N*3/4+1) of the first signals X1, X2, ... XN. Bit shift (N*3/4) bits, and (N*3/4) to (N*2/4+1) bits of the first signals (X1, X2, ... XN) to (N* 2/4) bit shift, and (N*1/4) bits of (N*2/4) to (N*1/4+1) bits of the first signals X1, X2, ... XN is shifted, and (N*1/4) to 1 bits of the first signals X1, X2, ... XN are shifted by 0 bits.

복수의 역다중화부들(DM1, DM2, ... DMN)은 제1 변환부들(T1_1, T1_2, ... T1_N)의 출력들 및 0 중에서 제2 신호(Yi)의 비트들(Yi[1], Yi[2], ... Yi[N])에 따라 선택된 신호들을 각각 출력한다. The plurality of demultiplexing units DM1, DM2, ... DMN includes outputs of the first transform units T1_1, T1_2, ... T1_N and bits Yi[1] of the second signal Yi among zeros. , Yi[2], ... Yi[N]) and outputs the selected signals, respectively.

쉬프트 연산부(120_i)는 복수의 제2 변환부들(T2_1, T2_2, ... T2_N) 및 복수의 쉬프트 유닛들(SH1, SH2, ... SHN)을 포함한다. 제2 변환부들(T2_1, T2_2, ... T2_N)은 분할 선택 신호(DSi)에 따라 제1 신호(Xi)의 일부 비트들을 0으로 설정한다. The shift operation unit 120_i includes a plurality of second transform units T2_1 , T2_2 , ... T2_N and a plurality of shift units SH1 , SH2 , ... SHN. The second conversion units T2_1 , T2_2 , ... T2_N set some bits of the first signal Xi to 0 according to the division selection signal DSi.

분할 선택 신호(DSi)가 1 분할에 대응하는 신호이면(분할하지 않음에 대응하는 신호이면), 제2 변환부들(T2_1, T2_2, ... T2_N)가 제1 신호(Xi)의 일부 비트들을 0으로 설정하지 아니한다. When the division selection signal DSi is a signal corresponding to division by 1 (a signal corresponding to not division), the second conversion units T2_1, T2_2, ... T2_N convert some bits of the first signal Xi. Do not set to 0.

제1 신호(Xi)의 비트 수가 N(N은 8 이상의 짝수임)이고, 제2 신호(Yi)의 비트 수가 N이고, 분할 선택 신호(DSi)가 2 분할에 대응하는 신호이면, 제2 변환부들(T2_1, ... T2_(N/2))이 제1 신호(Xi)의 최상위 (N/2) 비트들이 0이 되도록 제어하고, 제2 변환부들(T2_(N/2+1), ... T2_N)이 제1 신호(Xi)의 최하위 (N/2) 비트들이 0이 되도록 제어한다. If the number of bits of the first signal Xi is N (N is an even number of 8 or more), the number of bits of the second signal Yi is N, and the division selection signal DSi is a signal corresponding to division into two, the second transformation The parts T2_1, ... T2_(N/2) control the most significant (N/2) bits of the first signal Xi to be 0, and the second transform units T2_(N/2+1), ... T2_N) controls the least significant (N/2) bits of the first signal Xi to be 0.

제1 신호(Xi)의 비트 수가 N(N은 8 이상이고, 4의 배수임)이고, 제2 신호(Yi)의 비트 수가 N이고, 분할 선택 신호(DSi)가 4 분할에 대응하는 신호이면, 제2 변환부들(T2_1, ... T2_(N*1/4))이 제1 신호(Xi)의 최상위 (N*3/4) 비트들이 0이 되도록 제어하고, 제2 변환부들(T2_(N*1/4+1), ... T2_(N*2/4))이 제1 신호(Xi)의 최상위 (N*2/4) 비트들 및 최하위 (N*1/4) 비트들이 0이 되도록 제어하고, 제2 변환부들(T2_(N*2/4+1), ... T2_(N*3/4))이 제1 신호(Xi)의 최상위 (N*1/4) 비트들 및 최하위 (N*2/4) 비트들이 0이 되도록 제어하고, 제2 변환부들(T2_(N*3/4+1), ... T2_N)이 제1 신호(Xi)의 최하위 (N*3/4) 비트들이 0이 되도록 제어한다. If the number of bits of the first signal Xi is N (N is 8 or more and is a multiple of 4), the number of bits of the second signal Yi is N, and the division selection signal DSi is a signal corresponding to division by 4 , the second conversion units T2_1, ... T2_(N*1/4)) control the most significant (N*3/4) bits of the first signal Xi to be 0, and the second conversion units T2_ (N*1/4+1), ... T2_(N*2/4)) is the most significant (N*2/4) bits and the least significant (N*1/4) bits of the first signal Xi are controlled to be 0, and the second conversion units T2_(N*2/4+1), ... T2_(N*3/4)) are the most significant (N*1/4) of the first signal Xi. ) bits and the least significant (N*2/4) bits are controlled to be 0, and the second conversion units T2_(N*3/4+1), ... T2_N) are the least significant bits of the first signal Xi. (N*3/4) Controls bits to be 0.

복수의 쉬프트 유닛들(SH1, SH2, ... SHN)은 제2 변환부들(T2_1, T2_2, ... T2_N)의 출력들이 쉬프트된 신호들 및 0 중에서 제2 신호(Yi)의 비트들(Yi[1], Yi[2], ... Yi[N])에 따라 선택된 신호들을 각각 출력한다. The plurality of shift units SH1 , SH2 , ... SHN are the bits of the second signal Yi among the signals from which the outputs of the second transform units T2_1 , T2_2 , ... T2_N are shifted and 0 The signals selected according to Yi[1], Yi[2], ... Yi[N]) are respectively output.

선택 신호(SFi)에 따라 선택 연산부(110_i) 및 쉬프트 연산부(120_i) 중 어느 하나의 연산부가 동작한다.Any one of the selection operation unit 110_i and the shift operation unit 120_i operates according to the selection signal SFi.

도 6은 제4 실시예에 의한 병렬 처리 장치를 나타내는 도면이다. 도 6을 참조하면, 병렬 처리 장치는 제1 내지 제P 입력들({X1, Y1}, ... {X(p-1), Y(p-1)}, {Xp, Yp}, {X(p+1), Y(p+1)}, ... {XP, YP})을 입력받고, 제1 내지 제P 출력들(M1, ... M(p-1), Mp, M(p+1) ... MP)을 출력한다. 여기에서 P은 4이상의 자연수를 의미하며, 일례로 P는 1024일 수 있다. 또한 p는 1 이상이고 P 이하인 자연수를 의미한다. 제1 내지 제P 입력들({X1, Y1}, ... {X(p-1), Y(p-1)}, {Xp, Yp}, {X(p+1), Y(p+1)}, ... {XP, YP})은 제1 신호들(X1, ... X(p-1), Xp, X(p+1), ... XP)과 제2 신호들(Y1, ... Y(p-1), Yp, Y(p+1) ... YP)을 구비한다. 병렬 처리 장치는 전처리부(100C)와 주처리부(200A)를 구비한다. 병렬 처리장치는 지연부(300A)와 선택부(400A)를 더 구비할 수 있다. Fig. 6 is a diagram showing a parallel processing apparatus according to the fourth embodiment. Referring to FIG. 6 , the parallel processing unit receives first to Pth inputs {X1, Y1}, ... {X(p-1), Y(p-1)}, {Xp, Yp}, { X(p+1), Y(p+1)}, ... {XP, YP}) are received, and the first to Pth outputs M1, ... M(p-1), Mp, Output M(p+1) ... MP). Here, P means a natural number equal to or greater than 4, for example, P may be 1024. In addition, p means a natural number that is 1 or more and P or less. 1st to Pth inputs ({X1, Y1}, ... {X(p-1), Y(p-1)}, {Xp, Yp}, {X(p+1), Y(p +1)}, ... {XP, YP}) are the first signals X1, ... X(p-1), Xp, X(p+1), ... XP) and the second signal and Y1, ... Y(p-1), Yp, Y(p+1) ... YP. The parallel processing apparatus includes a preprocessing unit 100C and a main processing unit 200A. The parallel processing apparatus may further include a delay unit 300A and a selection unit 400A.

전처리부(100C)는 복수의 전처리 유닛들(... 150C_(p-1), 150C_p, 150C_(p+1), ...)을 포함한다. 복수의 전처리 유닛들(... 150C_(p-1), 150C_p, 150_C(p+1), ...)은 선택 연산부들(... 110C_(p-1), 110C_p, 110C_(p+1), ...) 및 쉬프트 연산부들(... 120C_(p-1), 120C_p, 120C_(p+1), ...)을 포함한다. 전처리 유닛(150C_p)은 선택 연산부(110C_p) 및 쉬프트 연산부(120C_p)를 포함한다. The preprocessor 100C includes a plurality of preprocessing units ... 150C_(p-1), 150C_p, 150C_(p+1), ...). The plurality of preprocessing units (... 150C_(p-1), 150C_p, 150_C(p+1), ...) are selected from the selection operators (... 110C_(p-1), 110C_p, 110C_(p+). 1), ...) and shift operation units (... 120C_(p-1), 120C_p, 120C_(p+1), ...). The preprocessing unit 150C_p includes a selection operation unit 110C_p and a shift operation unit 120C_p.

선택 연산부(110C_p)는 전처리 유닛(150C_p)이 합산 모드로 동작하는 경우에 동작한다. 선택 연산부(110C_p)는 전처리 유닛(150C_p)에 대응하는 제1 신호(Xp) 및 이에 인접한 제1 신호들(예: X(p-Q/2+1), ... X(p-1), X(p+1), ... X(p+Q/2)을 제2 신호(Yp)의 비트들(Yp[1], Yp[2], ... Yp[Q])에 따라 합산기(SUMp)에 전달하되, 분할 선택 신호(DSi)에 따라 대응하는 제1 신호(Xp) 및 인접한 제1 신호들(예: X(p-Q/2+1), ... X(p-1), X(p+1), ... X(p+Q/2))의 일부 비트들을 쉬프트하여 합산기에 전달한다. 여기에서 Q는 4 이상의 짝수를 의미하며, 일례로 Q는 32일 수 있다. 또한 q는 1 이상이고 Q 이하인 자연수를 의미한다. The selection operation unit 110C_p operates when the pre-processing unit 150C_p operates in the summing mode. The selection operation unit 110C_p includes a first signal Xp corresponding to the preprocessing unit 150C_p and first signals adjacent thereto (eg, X(p-Q/2+1), ... X(p-1), X (p+1), ... X(p+Q/2) according to the bits of the second signal Yp (Yp[1], Yp[2], ... Yp[Q]). The first signal Xp and adjacent first signals (eg, X(p-Q/2+1), ... X(p-1) , X(p+1), ... X(p+Q/2)) is shifted and passed to the summer, where Q means an even number greater than or equal to 4, for example, Q may be 32 Also, q means a natural number greater than or equal to 1 and less than or equal to Q.

쉬프트 연산부(120C_p)는 전처리 유닛(150C_p)이 곱셈 모드로 동작하는 경우에 동작한다. 쉬프트 연산부(120C_p)는 제1 신호(Xp)가 0, 1, ... (Q-1) 비트만큼 쉬프트된 신호들((Xp<<0), (Xp<<1), ... (Xp<<(Q-1)))을 제2 신호(Yp)의 비트들(Yp[1], Yp[2], ... Yp[Q])에 따라 합산기(SUMp)에 전달하되, 분할 선택 신호(DSi)에 따라 쉬프트된 신호들((Xp<<0), (Xp<<1), ... (Xp<<(Q-1)))의 일부 비트들이 0이 되도록 제어한다. The shift operation unit 120C_p operates when the pre-processing unit 150C_p operates in the multiplication mode. The shift operation unit 120C_p calculates the signals ((Xp<<0), (Xp<<1), ... ( Xp<<(Q-1))) to the summer SUMp according to the bits Yp[1], Yp[2], ... Yp[Q]) of the second signal Yp, Controls some bits of the shifted signals ((Xp<<0), (Xp<<1), ... (Xp<<(Q-1))) to become 0 according to the division selection signal DSi .

전처리 유닛(150C_p)은 동작 모드 선택 신호(SFp)에 따라 선택 연산부(110C_p)를 동작시키거나 쉬프트 연산부(120C_p)를 동작시킨다. The preprocessing unit 150C_p operates the selection operation unit 110C_p or the shift operation unit 120C_p according to the operation mode selection signal SFp.

본 기술이 속한 분야에서 통상적인 지식을 가진 자는 제2 및 제3 실시예를 참조하면 선택 연산부(110C_p) 및 쉬프트 연산부(120C_p)의 상세한 동작을 충분히 예측할 수 있으므로, 설명의 편의상 이에 대한 상세한 설명을 생략한다. 주처리부(200A), 지연부(300A) 및 선택부(400A)의 동작은 도 3 및 이에 대한 설명과 동일하므로 설명의 편의상 생략한다. Those of ordinary skill in the art can sufficiently predict the detailed operations of the selection operation unit 110C_p and the shift operation unit 120C_p with reference to the second and third embodiments, so for convenience of description, a detailed description thereof is provided. omit Operations of the main processing unit 200A, the delay unit 300A, and the selection unit 400A are the same as those of FIG. 3 and the description thereof, and thus will be omitted for convenience of description.

Claims

a pre-processing unit including pre-processing units and receiving first and second signals; and
a main processing unit including summers;
Each pre-processing unit among the pre-processing units includes a shift operation unit,
The shift operation unit transfers the shifted signals of a corresponding first signal from among the first signals to a corresponding summer among the summers according to bits of a corresponding second signal among the second signals, A parallel processing device for controlling some bits of the shifted signals to be 0 according to a signal.

The method of claim 1,
When the number of bits of the corresponding first signal is N (N is an even number equal to or greater than 8), the number of bits of the corresponding second signal is N, and the division selection signal is a signal corresponding to division by two, the shift operation unit is The corresponding first signal is controlled so that the most significant (N/2) bits of the signals shifted by 0 bits to (N/2-1) bits become 0, and the corresponding first signal is controlled to be 0 bits to (N/2) bits to (N/2) bits N-1) A parallel processing device that controls the least significant (N/2) bits of the bit-shifted signals to be 0.

3. The method of claim 2,
If the number of bits of the corresponding first signal is N (N is an even number greater than or equal to 8), the number of bits of the corresponding second signal is N, and the division selection signal is a signal corresponding to division by two, the corresponding summer the most significant N bits of the output of the corresponding first signal correspond to the product of the most significant (N/2) bits of the corresponding first signal and the most significant (N/2) bits of the corresponding second signal, The least significant N bits of the output correspond to the product of the least significant (N/2) bits of the corresponding first signal and the least significant (N/2) bits of the corresponding second signal.

The method of claim 1,
If the number of bits of the corresponding first signal is N (N is greater than or equal to 8 and is a multiple of 4), the number of bits of the corresponding second signal is N, and the division selection signal is a signal corresponding to division by 4, the The shift operation unit controls the most significant (N*3/4) bits of the signals shifted from 0 bits to (N*1/4-1) bits of the corresponding first signal to be 0, and the corresponding first signal is (N*1/4) bit to (N*2/4-1) bit-shifted signals, controlling the most significant (N*2/4) bits and least significant (N*1/4) bits to be 0, Most significant (N*1/4) bits and least significant (N*2/4) bits of the signals in which the corresponding first signal is shifted by (N*2/4) bits to (N*3/4-1) bits Parallel processing for controlling to be 0, and controlling the least significant (N*3/4) bits of the signals in which the corresponding first signal is shifted by (N*3/4) bits to (N-1) bits to be 0 Device.

5. The method of claim 4,
If the number of bits of the corresponding first signal is N (N is greater than or equal to 8 and is a multiple of 4), the number of bits of the corresponding second signal is N, and the division selection signal is a signal corresponding to division by 4, the 2N to (N*3/2+1) bits of the output of the corresponding summer are N to (N*3/4+1) bits of the corresponding first signal and N to (N*3/4+1) bits of the corresponding second signal (N*3/4+1) bits correspond to a product of bits, and (N*3/2) to (N*2/2+1) bits of the output of the corresponding summer are the corresponding first signal Corresponds to the product of (N*3/4) to (N*2/4+1) bits of and (N*3/4) to (N*2/4+1) bits of the corresponding second signal and (N*2/2) to (N*1/2+1) bits of the output of the corresponding summer are (N*2/4) to (N*1/ 4+1) bits and (N*2/4) to (N*1/4+1) bits of the corresponding second signal, corresponding to the product of (N*2/4) to (N*1/4+1) bits of the output of the corresponding summer 1/2) to 1 bits are parallel corresponding to the product of (N*1/4) to 1 bits of the corresponding first signal and (N*1/4) to 1 bits of the corresponding second signal processing unit.

The method of claim 1,
The shift operation unit includes conversion units and shift units,
The converters set some bits of the first signal to 0 according to the division selection signal,
The shift units are parallel processing units for outputting signals selected according to bits of the second signal from among the shifted signals and 0's of the outputs of the converters.

The method of claim 1,
Each of the pre-processing units further includes a selection operation unit,
The selection operation unit operates when a corresponding pre-processing unit among the pre-processing units operates in a summing mode, and transmits the first signals to the corresponding summer according to bits of the corresponding second signal, wherein the division Shifting some bits of the first signals according to a selection signal and transferring them to the corresponding summer,
The shift operation unit operates when the corresponding pre-processing unit operates in a multiplication mode.

8. The method of claim 7,
When the number of bits of the first signals is N (N is an even number greater than or equal to 8), the number of bits of the corresponding second signal is N, and the division selection signal is a signal corresponding to division into two, the selection operation unit is the first A parallel processing device for shifting most significant (N/2) bits of signals (N/2) bits and shifting least significant (N/2) bits of the first signals by 0 bits and passing them to the corresponding summer.

9. The method of claim 8,
If the number of bits of the first signals is N (N is an even number greater than or equal to 8), the number of bits of the corresponding second signal is N, and the division selection signal is a signal corresponding to division by 2, then the most significant N of the output of the summing unit The bits correspond to the sum of most significant (N/2) bits of first signals selected according to the corresponding second signal among the first signals, and the least significant N bits of the output of the summing unit are selected from among the first signals. A parallel processing device corresponding to the sum of least significant (N/2) bits of the first signals selected according to the corresponding second signal.

8. The method of claim 7,
If the number of bits of the first signals is N (N is greater than or equal to 8 and is a multiple of 4), the number of bits of the corresponding second signal is N, and the division selection signal is a signal corresponding to division by 4, the selection operation addition shifts N to (N*3/4+1) bits of the first signals (N*3/4) bits, and (N*3/4) to (N*2/4+) bits of the first signals 1) bit shifting (N*2/4) bits, and shifting (N*1/4) bits of (N*2/4) to (N*1/4+1) bits of the first signals, A parallel processing device for shifting (N*1/4) to 1 bits of the first signals by 0 bits and transferring them to the corresponding summer.

11. The method of claim 10,
If the number of bits of the first signals is N (N is greater than or equal to 8 and is a multiple of 4), the number of bits of the corresponding second signal is N, and the division selection signal is a signal corresponding to division by 4, the summing unit 2N to (N*3/2+1) bits of the output correspond to the sum of N to (N*3/4+1) bits of first signals selected according to the corresponding second signal among the first signals and (N*3/2) to (N*2/2+1) bits of the output of the summing unit are (N*3/) bits of the first signals selected according to the corresponding second signal among the first signals. 4) to (N*2/4+1) correspond to the sum of bits, and (N*2/2) to (N*1/2+1) bits of the output of the summing unit are Corresponds to the sum of (N*2/4) to (N*1/4+1) bits of the first signals selected according to the corresponding second signal, and (N*1/2) to 1 of the output of the summing unit The bits correspond to the sum of (N*1/4) to 1 bits of the first signals selected according to the corresponding second signal among the first signals.

8. The method of claim 7,
The selection operation unit includes transformation units and demultiplexing units,
The converters shift some bits of the first signals according to the division selection signal,
The demultiplexing units output signals selected according to bits of the second signal from among outputs of the transform units and 0's.

The method of claim 1,
Each of the pre-processing units further includes a selection operation unit,
The selection operation unit operates when a corresponding pre-processing unit of the pre-processing units operates in the summation mode, and performs the corresponding first signals among the first signals according to bits of the corresponding second signal. transfer to the summer, shifting some bits of the corresponding partial first signals according to the division selection signal and transferring the bits to the corresponding summer;
The shift operation unit operates when the corresponding pre-processing unit operates in a multiplication mode.

The method of claim 1,
a delay unit for delaying and outputting outputs of the summers according to a clock signal; and
The parallel processing apparatus further comprising: a selector configured to output, as the first signals, signals respectively selected according to input control signals from among the signals transmitted from the memory and the signals output from the delay unit.