KR100442434B1

KR100442434B1 - Accumulation Method of Array Structure node for Pre-trained Neural Network Design

Info

Publication number: KR100442434B1
Application number: KR10-2002-0012122A
Authority: KR
Inventors: 임국찬
Original assignee: 엘지전자 주식회사
Priority date: 2002-03-07
Filing date: 2002-03-07
Publication date: 2004-07-30
Also published as: KR20030072860A

Abstract

본 발명은 학습된 신경망의 효율적인 설계에 관한 것으로, 고정된 가중치를 이용하여 복잡한 연산과정을 어레이 구조로 표현하고 이를 간소화 할 수 있도록 한 것으로, 본 발명에 따른 학습된 신경망 설계를 위한 어레이 구조 연산 방법은, 신경망의 프로세싱 요소(PE) 연산을 행렬-벡터 곱셈으로 표현하고, 고정된 가중치와 입력 데이터의 관계를 비트 레벨 어레이 구조로 가중치 육면체(m,n,l)를 표현하는 제 1단계; 가중치 육면체로부터 각 노드의 입력값과 가중치를 논리곱으로 계산하고 이전에 계산된 노드와 순차적으로 덧셈을 수행한 후 최종 출력 노드까지의 전체 합으로 출력하는 제 2단계; 각 프로세싱 요소를 나타내는 인덱스와 확장 인덱스(m,l)를 갖는 출력층에서 상기 가중치 육면체의 출력값을 입력받아 l방향으로 덧셈을 행한 후, 오른쪽으로 시프트된 후 클럭에 따라 덧셈을 행하여 최종 출력 값을 출력하는 제 3단계를 포함하는 것을 특징으로 한다.The present invention relates to an efficient design of a trained neural network. The present invention relates to a simplified structure of a complex computational process using a fixed weight and to simplify it. An array structure calculation method for a trained neural network design according to the present invention. Is a first step of expressing a processing element (PE) operation of a neural network by matrix-vector multiplication, and expressing a weight hexahedron (m, n, l) in a bit-level array structure representing a relationship between fixed weights and input data; A second step of calculating an input value and a weight of each node from a weighted hexahedron by a logical product, performing an addition with the previously calculated nodes sequentially, and outputting the total sum up to the final output node; An output layer having an index representing each processing element and an extension index (m, l) receives the output value of the weighted cube and adds it in the l direction, shifts to the right, and adds according to the clock to output the final output value. Characterized in that it comprises a third step.

이 같은 본 발명에 의하면, 고정된 가중치와 입력 데이터의 관계를 비트-레벨 어레이 구조로 표현하고 그 결과에 영향을 주지 않는 노드 소거와 가산-트리 표현, 그리고 가중치의 비트-패턴에 따른 최적화를 통하여 노드를 공용하게 함으로써 전체 연산에 필요한 노드를 최소화할 수 있도록 함에 있다.According to the present invention, the relationship between the fixed weight and the input data is represented by a bit-level array structure, and the node elimination, addition-tree representation, and optimization of the weight bit-pattern are performed without affecting the result. By making the nodes common, we can minimize the nodes required for the entire operation.

Description

Accumulation method of array structure node for pre-trained neural network design

본 발명은 학습된 신경망(Pre-trained Neural Network)의 효율적인 설계에 관한 것으로 특히, 학습된 신경망은 가중치(weight)가 고정되어 있다는 점을 이용하여 복잡한 연산과정을 어레이 구조로 표현하고 이를 간소화함으로써 구현에 필요한 노드(node)의 수를 최소화하여 하드웨어 비용을 줄일 수 있도록 한 학습된 신경망 설계를 위한 어레이 구조 연산 방법에 관한 것이다.The present invention relates to the efficient design of a pre-trained neural network. In particular, the learned neural network is implemented by expressing a complex computational process in an array structure using a fixed weight and simplifying it. The present invention relates to an array structure calculation method for trained neural network design that minimizes the number of nodes required to reduce hardware cost.

일반적으로, 신경망(Neural network)은 음성 및 영상인식, 예측, 적응적 처리 및 최적화 문제를 해결하기 위한 다양한 해법을 제공한다. 이를 위해 신경세포의 인공 모델인 프로세싱 요소(PE: Processing Element)들이 각각의 응용에 맞는 다양한 형태로 연결되어 적절한 연산과정을 수행하게 된다. 이러한 다양한 네트워크 모델은 최적의 구현 수단의 부재와 방대한 연산 과정 때문에 실제 구현이 이루어지지 않고, 시뮬레이션 레벨의 활용에 그치고 있다. 신경망은 복잡한 산술연산 때문에 구현에 필요한 하드웨어 비용이 높고 동작속도가 낮다는 문제점이 있다.In general, neural networks provide a variety of solutions for solving voice and image recognition, prediction, adaptive processing and optimization problems. To this end, processing elements (PEs), which are artificial models of neurons, are connected in various forms for each application to perform appropriate computational processes. These various network models are not actually implemented because of the lack of optimal means of implementation and the vast computational process, and they are only utilizing the simulation level. Neural networks have high hardware cost and low operation speed due to complex arithmetic.

신경 세포의 모델인 프로세싱 요소(PE)는 뉴런(neuron)의 인공적인 모델, 유닛(unit) 또는 셀(cell) 등으로 불리는데 대부분 신경망에서는 다 입력, 일 출력(1-output) 소자가 통상 사용된다.Processing elements (PEs), which are models of nerve cells, are called artificial models of neurons, units, or cells. In most neural networks, multi-input, one-output elements are commonly used. .

도 1은 프로세싱 요소의 구조이고, 프로세싱 요소의 연산과정은 수학식 1과 같이 표현할 수 있다.1 is a structure of a processing element, the operation process of the processing element can be expressed as Equation (1).

여기서 I_n은 뉴런과 뉴런 사이의 시냅스(synapse) 연결을 나타내고, w_n,m은 신경망의 결합 하중인 가중치(weight)를 나타낸다. 또한, n(n=0,1,2N-1)은 프로세싱 요소의 입력 인덱스이고, m(m=0,1,2M-1)은 각 PE을 나타내는 인덱스이다. O_m은 입력의 총합이고, f는 응답함수이며 y_m은 응답함수에 의해 변형되어 출력되는 프로세싱 요소의 최종 출력값이다.Where I _n represents the synapse connection between neurons and neurons, and w _{n, m} represents the weight, which is the combined load of the neural network. In addition, n (n = 0,1,2N-1) is the input index of the processing element, and m (m = 0,1,2M-1) is the index representing each PE. O _m is the sum of the inputs, f is the response function, and y _m is the final output value of the processing element transformed and output by the response function.

수학식 1에서 알 수 있듯이 프로세싱 요소의 출력을 얻기 위해서는 입력벡터와 가중치 벡터의 곱셈연산이 필요하다. 이는 하드웨어 비용을 높이는 직접적인 문제로 이를 해결하기 위한 효과적인 알고리즘이 필요하다. 기존에는 MAC(Multiply Accumulator), Wallace Tree, Baugh-Wooley, 분산연산, CSD(Canonical Signed Digit), MCM(Multi Constant Multiplication)과 같이 DSP(Digital Signal Processing)분야에서 효율성을 높이기 위한 다양한 곱셈기 구조가 적용될 수 있다. 이들 중, 대부분의 알고리즘은 동작 속도를 높이고 세부 병렬구조를 고려한 구현을위해 비트-레벨에서 구현된다. 이를 위해 수학식 1에서 입력 값(I_n)을 수학식 2와 같이 비트 레벨로 표현하면 수학식 3과 같이 표현할 수 있다.As can be seen from Equation 1, multiplication of the input vector and the weight vector is required to obtain the output of the processing element. This is a direct problem that raises hardware costs and requires an effective algorithm to solve it. In the past, various multiplier structures are applied to increase efficiency in digital signal processing (DSP) fields such as MAC (Multiply Accumulator), Wallace Tree, Baugh-Wooley, distributed computation, Canonical Signed Digit (CSD), and Multi Constant Multiplication (MCM). Can be. Of these, most algorithms are implemented at the bit-level for speed-up and implementation taking into account the details of parallelism. To this end, if the input value I _n in Equation 1 is expressed at the bit level as in Equation 2, it can be expressed as Equation 3.

수학식 3을 구현하기 위한 종래의 MAC구조는 도 2와 같다. 하나의 프로세싱 요소를 구성하기 위하여 N개의 MAC(101~101n)과 대용량(log₂N)의 가산기(102,103)가 필요하게 된다. 각각의 MAC(101~101n)는 입력값((I_0,t),...,(I_N-1,t))과 피드백되는 가중치((w_o,m),...,(w_N-1,m))를 가산기(111)에서 각각 가산하고 시프트 레지스터(shift Reg)(112)를 통해 시프트시켜 출력해 주며, 두 개의 MAC는 하나의 가산기(102)(103)에서 각각 가산되고 서로 다른 두 개의 가산기(102)(103)의 출력이 다음 단계의 가산기(104)에서 다시 가산되어 출력되도록 설계되므로, 하드웨어 오버헤드(overhead)가 커지는 문제점이 있다.The conventional MAC structure for implementing Equation 3 is shown in FIG. The adders 102 and 103 of the N number of MAC (101 ~ 101n) and large (log ₂ N) is required to constitute one of the processing elements. Each MAC 101-101n has an input value ((I _{0, t} ), ..., (I _{N-1, t} )) and a feedback weight ((w _{o, m} ), ..., (w) _{N-1, m} )) is added in the adder 111 and shifted through a shift register 112, and outputted. Two MACs are added in one adder 102 and 103, respectively. Since the outputs of the two different adders 102 and 103 are designed to be added and output again by the adder 104 of the next step, there is a problem in that hardware overhead becomes large.

학습된 신경망은 가중치가 고정되어 있다. 이를 이용하여 보다 효율적인 하드웨어 구성 방법으로 분산연산과 CSD을 이용한 설계가 있다. 분산연산은 가중치가 고정되어 있으므로 입력 값에 의하여 출력 값이 결정된다. 따라서, 출력 값을 미리 계산하여 ROM(Read Only Memory)에 저장하는 방법이다. 도 3은 ROM을 이용한 분산연산의 구조이다.The learned neural network has a fixed weight. As a more efficient hardware construction method using this, there is a design using distributed computation and CSD. Since the variance operation has a fixed weight, the output value is determined by the input value. Therefore, the output value is calculated in advance and stored in a read only memory (ROM). 3 is a structure of distributed operation using a ROM.

여기서, ROM(120)의 주소는 2^N이고, 최대 비트폭(bit width)은 w_n,m의 비트폭 +log₂N이므로, 구현에 필요한 ROM의 크기는 2N*(w_n,m의 비트폭+log₂N) 비트가 된다. 분산 연산이 N에 의해 지수적으로 증가하는 롬이 필요하기 때문에, 프로세싱 요소의 상호 연결이 많은 신경망 구현에는 비효율적이다.Here, since the address of the ROM 120 is 2 ^N and the maximum bit width is w _{n, m} bit width + log ₂ N, the size of the ROM required for the implementation is 2N * (w _{n, m} bits). Width + log ₂ N) bits. Since distributed operations require a ROM that is exponentially increased by N, it is inefficient for neural network implementations with many interconnections of processing elements.

반면에 CSD는, 입력(I_n)이 아닌 가중치(w_n,m)를 수학식 4와 같이 비트-레벨로 확장하는 방법으로 이를 정리하면 수학식 5와 같다. 수학식 5에서는 고정된 가중치의 비트 값이 '0'인 부분에 대한 덧셈 연산을 소거할 수 있고, 또한 가중치 비트 표현을 CSD로 표현하면 비트 값이 '1'인 부분의 수를 최대 L/2이하로 줄일 수 있으므로 더욱 효율적인 구현이 가능하다. 도 4는 CSD 코딩을 이용한 PE 설계의 예를 나타낸다. 가중치(w_n,m)가 15인 경우로 6 비트의 2진수와 CSD 표현은 각각 '001111'와 '01000-1'이다. 도 4의 (a)처럼, 3개의 덧셈기(141~143)가 필요한 연산을 도 4의 (b)와 같이 CSD코딩을 이용하여 하나의 감산기(152)로 구현할 수 있음을 알 수 있다.On the other hand, in the CSD, the weight (w _{n, m} ) rather than the input (I _n ) is extended to the bit-level as shown in Equation (4). In Equation 5, the addition operation for the portion of the fixed weighted bit value of '0' can be eliminated, and if the weighted bit representation is expressed in CSD, the number of portions having the bit value of '1' can be up to L / 2. Since it can be reduced to below, more efficient implementation is possible. 4 shows an example of a PE design using CSD coding. In the case where the weight w _{n, m} is 15, the 6-bit binary and CSD representations are '001111' and '01000-1', respectively. As shown in FIG. 4A, it can be seen that three adders 141 to 143 may be implemented by one subtractor 152 using CSD coding as shown in FIG. 4B.

그러나, 도 3의 분산연산을 이용한 PE 설계 방법은 N에 의해 지수적으로 증가하는 ROM이 필요하므로 프로세싱 요소의 상호 연결이 많은 신경망 구현에는 비효율적이다. 또한, 도 4와 CSD 코딩을 이용한 설계 방법은 적절한 파이프라인(pipeline)설계가 어렵고 입력 값이 워드(word) 단위이므로 덧셈연산에 소요되는 시간이 길어지는 단점이 있다.However, the PE design method using the distributed operation of FIG. 3 requires ROM which is exponentially increased by N, and thus is inefficient for implementing a neural network having many interconnections of processing elements. In addition, the design method using FIG. 4 and the CSD coding has a disadvantage in that an appropriate pipeline design is difficult and an input value is a word unit, and thus the time required for the addition operation is long.

본 발명은 상기한 종래의 문제점을 해결하기 위해 안출된 것으로서, 학습된 신경망을 고정된 가중치를 이용하여 프로세싱 요소들의 연산형태를 집합된 행렬-벡터 곱셈으로 표현하고, 이를 비트-레벨에서 완전한 병렬 계산이 이루어지도록 한 학습된 신경망 설계를 위한 어레이 구조 연산 방법을 제공함에 그 목적이 있다.SUMMARY OF THE INVENTION The present invention has been made to solve the above-mentioned conventional problem, and represents a computational form of processing elements as a set matrix-vector multiplication by using a fixed weight for a learned neural network, which is computed in parallel at the bit-level. It is an object of the present invention to provide an array structure calculation method for trained neural network design.

다른 특징은 고정된 가중치와 입력 데이터의 관계를 비트-레벨 어레이 구조로 표현하고 그 결과에 영향을 주지 않는 노드 소거와 가산-트리 표현, 그리고 가중치의 비트-패턴에 따른 최적화를 통하여 노드를 공용하게 함으로써 전체 연산에 필요한 노드를 최소화할 수 있도록 한 학습된 신경망 설계를 위한 어레이 구조 연산 방법을 제공함에 그 목적이 있다.Another feature is to express the relationship between fixed weights and input data in a bit-level array structure, to make the nodes common through node elimination and addition-tree representations that do not affect the result, and optimization according to the bit-pattern of weights. The purpose is to provide an array structure calculation method for trained neural network design that can minimize the nodes required for the entire operation.

도 1은 일반적인 신경망의 인공적인 모델인 프로세싱 요소의 구조.1 is a structure of a processing element which is an artificial model of a general neural network.

도 2는 종래 MAC를 이용한 프로세싱 요소 설계를 나타낸 도면.2 illustrates a processing element design using a conventional MAC.

도 3은 종래 분산연산을 이용한 프로세싱 요소의 설계 도면.3 is a design diagram of a processing element using conventional distributed computing.

도 4는 종래 CSD를 이용한 프로세싱 요소의 설계 도면.4 is a design diagram of a processing element using a conventional CSD.

도 5는 본 발명 실시 예에 따른 학습된 신경망의 예를 나타낸 도면.5 is a diagram illustrating an example of a learned neural network according to an embodiment of the present invention.

도 6은 본 발명 실시 예에 따른 가중치의 비트-레벨 확장 예를 나타낸 도면.6 illustrates an example of bit-level extension of a weight according to an embodiment of the present invention.

도 7은 본 발명 실시 예에 따른 입력 평면, 가중치 육면체, 출력 평면을 나타낸 도면.7 illustrates an input plane, a weighted hexahedron, and an output plane according to an exemplary embodiment of the present invention.

도 8은 본 발명 실시 예에 따른 비트 레벨 덧셈을 위한 시프트 덧셈기의 구조.8 illustrates a structure of a shift adder for bit level addition according to an embodiment of the present invention.

도 9는 본 발명 실시 예에 다른 n,1 가중치 평면의 구조를 나타낸 도면.9 is a view showing a structure of an n, 1 weight plane according to an embodiment of the present invention.

도 10은 본 발명 실시 예에 따른 노드의 소거 구조를 나타낸 도면.10 illustrates an erase structure of a node according to an embodiment of the present invention.

도 11은 본 발명 실시 예에 따른 가산 트리의 구조를 나타낸 도면.11 is a view showing the structure of an addition tree according to an embodiment of the present invention.

도 12는 본 발명 실시 예에 따른 공유 노드의 설정 예를 나타낸 도면.12 illustrates an example of setting a shared node according to an embodiment of the present invention.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

201...가중치 육면체 202...입력 평면201 ... weighted cube 202 ... input plane

203...출력평면 210,220...노드203 ... output plane 210,220 ... node

211...앤드게이트 212,222...덧셈기211 ... Andgate 212,222 ... Adder

221...라이트 시프트221 ... light shift

상기한 목적 달성을 위한, 본 발명에 따른 학습된 신경망 설계를 위한 어레이 구조 연산 방법은,To achieve the above object, the array structure calculation method for the learned neural network design according to the present invention,

신경망의 프로세싱 요소(PE) 연산을 행렬-벡터 곱셈으로 표현하고, 고정된 가중치와 입력 데이터의 관계를 비트 레벨 어레이 구조로 가중치 육면체(m,n,l)를 표현하는 제 1단계;A first step of expressing a processing element (PE) operation of a neural network by matrix-vector multiplication, and expressing a weight cube (m, n, l) in a bit level array structure representing a relationship between fixed weights and input data;

가중치 육면체로부터 각 노드의 입력값과 가중치를 논리곱으로 계산하고 이전에 계산된 노드와 순차적으로 덧셈을 수행한 후 최종 출력 노드까지의 전체 합으로 출력하는 제 2단계;A second step of calculating an input value and a weight of each node from a weighted hexahedron by a logical product, performing an addition with the previously calculated nodes sequentially, and outputting the total sum up to the final output node;

각 프로세싱 요소를 나타내는 인덱스와 확장 인덱스(m,l)를 갖는 출력층에서 상기 가중치 육면체의 출력값을 입력받아 l방향으로 덧셈을 행한 후, 오른쪽으로 시프트된 후 클럭에 따라 덧셈을 행하여 최종 출력 값을 출력하는 제 3단계를 포함하는 것을 특징으로 한다.An output layer having an index representing each processing element and an extension index (m, l) receives the output value of the weighted cube and adds it in the l direction, shifts to the right, and adds according to the clock to output the final output value. Characterized in that it comprises a third step.

바람직하게, 상기 제 1단계에서, 입력값을 l방향으로 확장시켜 입력 평면을 만든 후, 가중치 육면체에 입력되는 입력값의 최 하위 비트를 가장 먼저 입력시키고, 최상위 비트를 마지막 클럭에 입력해 주는 것을 특징으로 한다.Preferably, in the first step, after the input value is extended in the l direction to form an input plane, the lowest bit of the input value input to the weighted hexahedron is input first and the most significant bit is input to the last clock. It features.

바람직하게, 상기 제 2단계는 모든 프로세싱 요소의 입력 인덱스와 확장 인덱스 가중 평면의 모든 노드가 하나의 출력을 얻을 수 있도록 고정된 가중치 값이 0 또는 1일 경우 노드의 입력값을 그대로 패스시키는 노드의 소거 단계; 하나의 n,l 가중치 평면의 한 행에 하나씩 매핑시키는 가산-트리 설정 단계; 상기 형성된 일정개의 가산 트리를 동일한 입력받는 가중치의 비트 패턴이 동일한 부분의 같은 출력에 대해 하나의 가산 트리로 표현하여 공유 노드를 설정하는 단계를 포함하는것을 특징으로 한다.Preferably, in the second step, when the fixed weight value is 0 or 1 so that all nodes of the input index and the extension index weighting plane of all the processing elements can obtain one output, Erasing step; An add-tree setting step of mapping one to one row of one n, l weight plane; And forming a shared node by expressing the formed predetermined addition trees as a single addition tree for the same output of the same portion in which the bit patterns of the same input weights are received.

바람직하게, 상기 노드 소거 단계는 각 노드에 입력되는 가중치 값이 0 또는 1으로 입력값이 동작에 상관없는 통과 경로의 앤드게이트와 플립플롭을 소거하여 출력하는 것을 특징으로 한다.Preferably, the node erasing step is characterized in that the weighted value input to each node is 0 or 1, and the AND gate and the flip-flop of the pass path having the input value irrespective of operation are erased and output.

바람직하게, 상기 가산 트리 설정 단계의 임계 경로는 가산 트리의 깊이만큼의 전 가산기로 구성하고 각 노드에서 발생하는 캐리를 노드의 플립플롭에 저장한 후 다음 클럭에 상기 전 가산기에 더해 주는 것을 특징으로 한다.Preferably, the threshold path of the addition tree setting step is configured with all adders equal to the depth of the addition tree, and the carry generated at each node is stored in a flip-flop of the node and added to the all adders at the next clock. do.

상기와 같은 본 발명에 따른 학습된 신경망 설계를 위한 어레이 구조 연산 방법에 대하여 첨부된 도면을 참조하여 설명하면 다음과 같다.Referring to the accompanying drawings, an array structure calculation method for learning neural network design according to the present invention as described above is as follows.

도 5는 학습된 신경망의 예를 나타낸다. 입력에서 출력으로 단일 방향인 계층형 네트워크로 입력층 노드(I₀,...,I_N-1)와 출력층 노드(O₀,...,O_M-1)로 구성되며, w_n,m은 학습을 통하여 얻어진 가중치이다. 도 5에서 점선 부분에 해당하는 각 프로세싱 요소의 합 값 즉, O_m을 구하기 위해서는 수학식 6과 같은 계산이 요구된다. 여기서는 비선형함수(f)를 제외한 입력벡터(I_n)와 가중치행렬(w_n,m), 그리고 연산 결과(O_m)의 관계만을 고려한다. 이 관계를 행렬연산으로 나타내면 수학식 7과 같다.5 shows an example of a learned neural network. A hierarchical network with a single direction from input to output, consisting of input layer nodes (I ₀ , ..., I _N-1 ) and output layer nodes (O ₀ , ..., O _M-1 ), w _{n, m} is a weight obtained through learning. In FIG. 5, a calculation such as Equation 6 is required to obtain the sum value of each processing element corresponding to the dotted line, that is, O _m . In this case _, only the relationship between the input vector I _n , the weight matrix w _{n, m} , and the calculation result O _m is excluded except for the nonlinear function f. This relationship is expressed by the equation (7).

수학식 6에서 가중치(w_n,m)와 입력(I_n)을 각각 수학식 2와 수학식 4를 이용하여 비트-레벨로 표현하면 수학식 8과 같다. 수학식 8에서 가중치는 도 6과 같이 m,n,l의 세 인덱스를 갖는 육면체 형태(201,202,203)로 확장될 수 있다.In Equation 6, the weights w _{n and m} and the input I _n are expressed in bit-level using Equation 2 and Equation 4, respectively. In Equation 8, the weight may be extended to a hexahedral shape 201, 202, 203 having three indices of m, n, l as shown in FIG.

도 7은 병렬성을 고려한 도 6의 세부 구조이며 검정색 노드는 각 어레이 구조에서 최종 출력을 갖는 노드를 나타낸다.FIG. 7 is a detailed structure of FIG. 6 in consideration of parallelism and a black node represents a node having a final output in each array structure.

도 7의 (a)는 입력 평면(Input plane)으로 입력 비트가 l방향으로 확장되어 I_n,t가 입력된다. 인덱스는 n,t를 갖지만 가중치 육면체의 입력을 맞추기 위해 l방향으로 확장된다. 각 클럭마다 t는 증가하고, 이때 해당하는 I_n,t값이 l방향으로 확장되어 n,l의 입력 평면을 만들어, 가중치 육면체에 입력된다. 이때 입력은 LSB(Least Signification Bit)가 가장 먼저 입력되고 MSB(Most Significant Bit)는 마지막 클럭(T-1)에 입력된다.FIG. 7A illustrates an input plane in which the input bit extends in the l direction so that I _{n and t} are input. The index has n, t but extends in the l direction to fit the input of the weighted cube. Each clock increases t, and at this time, the corresponding I _{n, t} values are extended in the l direction to make n, l input planes, and are input to the weighted cube. At this time, the LSB (Least Signation Bit) is input first and the Most Significant Bit (MSB) is input to the last clock T-1.

도 7의 (b)는 가중치 육면체(Weight hexahedron)로 각 노드(210)는 입력과가중치(I_n,t*w_n,m,l)를 앤드게이트(211)로 계산하며 이 계산된 값과 이전에 계산된 입력과 가중치(I_n-1,t*w_n-1,m,l)값을 덧셈기(212)로 덧셈을 수행한다. 따라서 최종 노드에는의 값이 출력되며 각 노드(210)는 비트 곱셈을 위한 하나의 앤드게이트(AND gate)(211)와 이전 노드의 출력 값을 더하기 위한 덧셈기(212)로 구성된다.7 (b) is a weight hexahedron, each node 210 calculates an input overweight value I _{n, t} * w _{n, m, l} as an AND gate 211, and the calculated value and The adder 212 adds the previously calculated input and weight value I _{n-1, t} * w _{n-1, m, l} . So the final node Is outputted, and each node 210 is composed of one AND gate 211 for bit multiplication and an adder 212 for adding an output value of the previous node.

도 7의 (c)는 출력 평면(Output plane)으로 m,l의 인덱스를 갖는다. 가중치 육면체의 출력값을 입력받아 l방향으로 덧셈을 행하게 된다. 이때 l방향은 2^-l으로 확장되었으므로 이전 노드(220)는 라이트 쉬프트(221)에 의해 1비트 오른쪽 쉬프트(right shift)되어 덧셈이 이루어지게 된다. 도 7에서 최종 출력은 도 8과 같이 시프트 가산기(shift adder)(231,232)를 구성하여 클럭에 따라 입력(I_n,t)을 입력하면, 수학식 8을 계산할 수 있게 된다.7C has an index of m, l as an output plane. The output value of the weighted hexahedron is input to add in the l direction. At this time, because the direction is extended to 2 l ^-l old node 220 is a 1-bit right shift (right shift), shifted by 221 light becomes the addition is taken. In FIG. 7, the final output is configured by shift adders 231 and 232 as shown in FIG. 8 and inputs inputs I _{n and t} according to a clock.

도 7의 (b)와 같은 어레이 구조를 간소화하기 위해 다음과 같은 3단계의 과정을 거치게 된다. 동작을 하지 않는 노드를 소거하는 제 1과정과, 가산 트리 구조를 구성하는 제 2과정과, 동일한 입력을 받고 동일한 결과를 출력하는 패턴을 공유하여 공유 노드 수를 줄이는 제 3과정으로 설정하게 된다.In order to simplify the array structure of FIG. 7B, the following three steps are required. The first process of erasing a non-operating node, the second process of forming an addition tree structure, and the third process of reducing the number of shared nodes by sharing a pattern of receiving the same input and outputting the same result are performed.

노드의 소거를 위한 제 1과정;A first process for erasing the node;

도 7의 (b)에서 하나의 출력(O_m)을 얻기 위해서는 모든 인덱스 n과 l에 대한 관련 연산이 요구된다. 즉, 도 9와 같이 n,l의 인덱스를 갖는 하나의 'n,l-weightplane(240)'의 모든 노드가 하나의 출력(O_m)을 얻기 위해 계산된다.In order to obtain one output O _m in FIG. 7B, related operations for all indexes n and l are required. That is, as shown in FIG. 9, all nodes of one 'n, l-weightplane 240' having an index of n, l are calculated to obtain one output O _m .

도 9에서 각 노드는 하나의 앤드 게이트(AND gate)와 가산기, 그리고 w_n,m,l을 저장하기 위한 플립플롭(FF)으로 구성된다. 여기서, w_n,m,l은 고정되어 있으므로, 그 값이 '0'인 경우 노드의 입력 값을 그대로 패스(pass)하게 된다. 즉, 동작을 하지 않기 때문에 생략이 가능하다는 의미이다. 또한 그 값이 '1'인 경우에는 노드의 입력 값이 앤드 게이트(AND gate)의 결과 값이 되므로 각 노드의 앤드 게이트와 플립플롭 또한 소거할 수 있다. 도 10은 노드 소거에 대한 예를 나타낸 것이며, 노드 밑의 숫자는 해당 노드의 w_n,m,l비트 값이 된다. 도 10의 (a)는 노드를 소거하기 전의 구조이고, 도 10의 (b)는 노드를 소거한 후의 구조이다.In FIG. 9, each node includes one AND gate, an adder, and a flip-flop FF for storing w _{n, m, and l} . Since w _{n, m, and l} are fixed, if the value is '0', the input value of the node is passed as it is. In other words, it can be omitted because it does not operate. In addition, when the value is '1', the input value of the node becomes the result value of the AND gate, and thus the AND gate and the flip-flop of each node may also be erased. 10 shows an example of node erasure, and the number under the node becomes the w _{n, m, l} bit value of the node. FIG. 10A illustrates a structure before erasing a node, and FIG. 10B illustrates a structure after erasing a node.

가산-트리 표현을 위한 제 2과정;A second process for addition-tree representation;

기존에는 하나의 n,l-weight plane에서 왼쪽 노드의 덧셈 결과를 오른쪽으로 진행하며 연속적으로 더해야 한다. 따라서 필요한 덧셈기의 용량 및 노드간의 신호선이 많아지게 된다. 이러한 문제를 해결하기 위하여 도 11과 같은 가산-트리 구조를 구성한다. 여기서 가산-트리는 n,l-weight plane의 한 행에 하나씩 매핑 된다.Conventionally, the addition result of the left node in the n, l-weight plane has to be added to the right continuously. Therefore, the capacity of the adder required and the signal lines between the nodes increase. In order to solve this problem, an add-tree structure as shown in FIG. 11 is configured. Here, the add-trees are mapped one by one in the n, l-weight plane.

도 11의 (a)는 가산 트리 맵핑 전의 구조이고, (b)는 가산 트리 매핑 후의 구조이다. 이때의 가산-트리는 각 노드가 하나의 전가산기(FA: Full-Adder)와 FF(FF: Flip-Flop)으로 구성되어 모든 노드간은 1 비트로 연결된다. 각 노드에서 발생하는 캐리는 플립플롭(FF)에 저장되어 다음 클럭에 더해지므로 전체 연산에 필요한 클럭은 다소 길어지게 된다.(A) of FIG. 11 is a structure before addition tree mapping, and (b) is a structure after addition tree mapping. In this case, each node is composed of one full-adder (FA) and flip-flop (FF) and all nodes are connected by 1 bit. Carry from each node is stored in the flip-flop (FF) and added to the next clock, so the clock needed for the entire operation is rather long.

공유 노드 설정을 위한 제 3과정;A third step for establishing a shared node;

도 12의 (a)는 패턴이 동일한 부분을 공유하기 전의 구조이고, (b)는 패턴이 동일한 부분을 공유하기 후의 구조이다. 도 12를 참조하면 하나의 n,l-weight plane은 M개의 가산-트리가 생성된다. 이들은 동일한 입력(I_n,t)을 입력받기 때문에 w_n,m,l의 패턴이 동일한 곳은 같은 결과를 출력한다. 그러므로 도 12와 같이 패턴이 동일한 부분을 하나의 가산-트리로 표현하여 결과 값을 공유하면, 전체 노드 수를 줄일 수 있다.(A) of FIG. 12 is a structure before a pattern shares the same part, (b) is a structure after a pattern shares the same part. Referring to FIG. 12, M add-trees are generated in one n, l-weight plane. Since they receive the same input (I _{n, t} ), where the patterns of w _{n, m, l} are the same, the same result is output. Therefore, as shown in FIG. 12, if the same pattern is expressed by one add-tree and the result value is shared, the total number of nodes can be reduced.

결과적으로 학습이 완료되어 고정된 가중치 값은 비트-레벨로 표현하여 위에서 제안한 노드 소거, 가산-트리 표현, 공유 노드 설정을 통하여 연산에 필요한 어레이 구조의 노드를 최소화시키면 구현에 필요한 하드웨어 자원을 줄일 수 있다.As a result, the fixed weight value is expressed in bit-level by minimizing the nodes of the array structure required for the operation through the node elimination, the addition-tree representation, and the shared node configuration proposed above. have.

본 발명 다른 실시 예는 신호 처리 분야의 다양한 여러 알고리즘에서도 수학식 7과 같은 입력 벡터와 고정된 계수 행렬의 곱셈 연산 형태를 갖는다. 따라서 이를 어레이 구조로 표현하고 상기에서 제안한 노드 소거, 가산-트리 표현, 공유 노드 설정을 적용하면 구현에 필요한 하드웨어 비용을 줄 일수 있다.Another embodiment of the present invention has a multiplication operation of an input vector and a fixed coefficient matrix as shown in Equation 7 in various algorithms in the signal processing field. Therefore, by expressing this in an array structure and applying the proposed node cancellation, add-tree representation, and shared node configuration, hardware cost required for implementation can be reduced.

이상에서 설명한 바와 같이, 본 발명에 따른 학습된 신경망 설계를 위한 어레이 구조 연산방법에 의하면, 학습된 신경망의 각 PE가 갖는 복잡한 산술 연산 과정을 어레이 구조로 표현하고 그 구성 노드를 최적화함으로써 구현에 필요한 하드웨어 비용을 최소화하여 작은 크기의 칩 구현이 가능함으로 FPGA의 온 칩 신경망구현이 가능한 효과가 있다.As described above, according to the array structure calculation method for trained neural network design according to the present invention, the complex arithmetic operation of each PE of the trained neural network is represented by an array structure and optimized for the configuration node. On-chip neural network implementation of FPGA is possible by minimizing hardware cost and enabling small chip implementation.

Claims

A first step of expressing a processing element (PE) operation of the neural network by matrix-vector multiplication, and expressing a weight hexahedron (m, n, l) in a bit level array structure representing a relationship between fixed weights of the neural network and input data;

A second product for calculating bitwise multiplication of the input values and weights of each node from the weighted hexahedron for bit multiplication of each node, sequentially performing addition with the previously calculated node, and outputting the total sum up to the final output node step;

A third step of receiving the output value of the weighted hexahedron from the output layer having an index representing each processing element and an extension index (m, l), adding in the l direction, and performing shift addition according to a clock to output the final output value; Including;

The second step is the erasing step of the node passing the input value of the node as it is if the fixed weight value is 0 or 1 so that all nodes of the input index and the extension index weighting plane of all processing elements can obtain one output. ; An add-tree setting step of mapping one to one row of one n, l weight plane; And setting a shared node by expressing the formed constant addition trees as a single addition tree for the same output of the same portion in which the bit patterns of the weights of the same inputs are minimized. Array structure calculation method for trained neural network design.

The method of claim 1,

In the first step, in order to match the input of the weighted hexahedron, the input value is extended in the l direction for each clock to make an input plane, and then the lowest bit of the input value input to the weighted hexahedron is input first, and the most significant bit is input. The array structure calculation method for trained neural network design, characterized in that the input to the last clock.

delete

The method of claim 1,

In the node erasing step, an array structure calculation for the learned neural network design, wherein the weight value input to each node is 0 or 1, and the output value is erased by eliminating the AND gate and the flip-flop of the pass path whose input value is irrelevant to the operation. Way.

The method of claim 1,

The threshold path of the addition tree setting step includes a full adder as much as the depth of the addition tree, stores the carry generated at each node in a flip-flop of the node, and adds the full adder to the next clock. Array structure calculation method for neural network design.