KR20210070262A

KR20210070262A - Deep learning apparatus for ANN with pipeline architecture

Info

Publication number: KR20210070262A
Application number: KR1020210072762A
Authority: KR
Inventors: 김녹원
Original assignee: 주식회사 딥엑스
Priority date: 2018-08-16
Filing date: 2021-06-04
Publication date: 2021-06-14
Also published as: KR102396447B1; KR20200020117A; KR102263598B1

Abstract

The present invention relates to a computational accelerator for an artificial neural network with a pipeline structure. More specifically, the present invention relates to the computational accelerator for the artificial neural network wherein by performing an output value, a corrected input data, a corrected output value, a weight correction, an input bias correction, and an output bias correction simultaneously in a pipeline structure, the present invention is capable of reducing a computational time according to learning and reducing a memory used.

Description

Computational acceleration device for artificial neural network having a pipeline structure {Deep learning apparatus for ANN with pipeline architecture}

본 발명은 파이프라인 구조를 가지는 인공신경망용 연산 가속 장치에 관한 것으로, 보다 구체적으로 출력값, 보정된 입력 데이터, 보정된 출력값, 가중치 보정, 입력 바이어스 보정 및 출력 바이어스 보정을 파이프라인 구조로 동시에 수행함으로써 학습에 따른 연산 시간을 줄일 수 있으며 사용되는 메모리를 줄일 수 있는 인공신경망용 연산 가속 장치에 관한 것이다.The present invention relates to an arithmetic acceleration device for an artificial neural network having a pipeline structure, and more particularly, by simultaneously performing output values, corrected input data, corrected output values, weight correction, input bias correction and output bias correction in a pipeline structure. It relates to a computational acceleration device for an artificial neural network that can reduce computation time according to learning and reduce memory used.

인공신경망(Artificial Neural Network, ANN)은 기계학습과 인지과학에서 생물학의 신경망(동물의 중추신경계중 특히 뇌)에서 영감을 얻은 통계학적 학습 알고리즘이다. 인공신경망은 시냅스의 결합으로 네트워크를 형성한 인공 뉴런(노드)이 학습을 통해 시냅스의 결합 세기를 변화시켜, 문제 해결 능력을 가지는 모델 전반을 가리킨다.Artificial Neural Network (ANN) is a statistical learning algorithm inspired by neural networks in biology (especially the brain in the central nervous system of animals) in machine learning and cognitive science. An artificial neural network refers to an overall model that has problem-solving ability by changing the bonding strength of synapses through learning in which artificial neurons (nodes) formed a network by combining synapses.

이러한 인공신경망 모델 중 RBM(제한된 볼츠만 머신, Restricted Boltzmann machine)은 차원 감소, 분류, 선형 회귀 분석, 협업 필터링(collaborative filtering), 특징값 학습(feature learning) 및 주제 모델링(topic modelling)에 사용할 수 있는 알고리즘으로 Geoff Hinton이 제안한 모델이다.Among these artificial neural network models, the Restricted Boltzmann machine (RBM) can be used for dimensionality reduction, classification, linear regression analysis, collaborative filtering, feature learning, and topic modeling. It is a model proposed by Geoff Hinton as an algorithm.

RBM 모델에서 모든 은닉층(hidden layer)의 은닉 노드(hidden node)는 입력층(visible layer)의 입력 노드(input layer)와 연결되어 있고, 모든 입력층의 입력 노드도 은닉층의 은닉 노드와 연결되어 있다. 그러나 같은 층(layer)에 있는 노드끼리는 전혀 연결되어있지 않다.In the RBM model, the hidden nodes of all hidden layers are connected to the input nodes of the visible layer, and the input nodes of all input layers are also connected to the hidden nodes of the hidden layer. . However, nodes in the same layer are not connected at all.

다시말해, RBM 모델에서는 같은 층 내부의 연결이 전혀 없으며, 이러한 구조때문에‘제한된’볼츠만 머신이라는 이름을 붙인 것이다. 입력층의 입력 노드는 데이터를 입력받으며 입력받은 데이터를 은닉층에 얼마나 전달할 것인지를 확률에 따라 결정(stochastic decision)한다. 즉, 확률에 따라 입력을 전달할지(1로 표현) 혹은 전달하지 않을지(0으로 표현)를 결정한다.In other words, in the RBM model, there is no connection within the same floor at all, and this structure gave it the name ‘limited’ Boltzmann machine. The input node of the input layer receives data and makes a stochastic decision based on the probability of how much data to pass to the hidden layer. That is, it determines whether to pass the input (expressed as 1) or not (expressed as 0) according to the probability.

도 1은 제한된 볼츠만 머신 모델의 개념을 설명하기 위한 도면이다.1 is a diagram for explaining the concept of a limited Boltzmann machine model.

도 1에 도시되어 있는 바와 같이, 입력층의 입력 노드(i)로 입력 데이터(v)가 입력되는 경우, 입력 노드의 입력 데이터와 가중치(w_ij)의 곱셈값은 서로 합해진 후 활성 함수, 예를 들어 시그모이드(Sigmoid) 함수를 거쳐 0 또는 1의 값으로 샘플링되어 은닉 노드(j)에서 출력값(h)으로 출력된다.As shown in FIG. 1 , when the input data v is input to the input node i of the input layer _{, the multiplication value of the input data of the input node and the weight w ij} is added to each other and then the activation function, e.g. For example, it is sampled as a value of 0 or 1 through a sigmoid function and output as an output value (h) from the hidden node (j).

제한된 볼츠만 머신 모델은 비지도학습(unsupervised learning)으로 가중치를 조절하여 입력 데이터의 중요한 특징들을 학습하게 되는데, 여기서 가중치는 입력 데이터(v), 은닉 노드의 출력값(h), 재구성(reconstruction) 과정을 통해 계산되는 보정된 입력 데이터(v') 그리고 재생산(Regeneration) 과정을 통해 계산되는 보정된 출력값(h')으로부터 오차값(v'h'-vh)을 계산하여 조절된다. The limited Boltzmann machine model learns important features of the input data by adjusting the weights through unsupervised learning, where the weights are the input data (v), the output value of the hidden node (h), and the reconstruction process. It is adjusted by calculating an error value (v'h'-vh) from the corrected input data (v') calculated through the process and the corrected output value (h') calculated through the regeneration process.

도 2는 제한된 볼츠만 머신 모델에서 가중치를 조절하는 일 예를 설명하기 위한 도면이다.2 is a diagram for explaining an example of adjusting weights in a limited Boltzmann machine model.

도 2를 참고로 살펴보면, 도 2(a)에 도시되어 있는 바와 같이 재구성 과정에서 은닉층의 각 은닉 노드(j)의 출력값(h)과 가중치(w_ji)의 곱셈값은 모두 합산된 후, 활성 함수를 거쳐 0 또는 1의 값으로 샘플링되어 입력 노드(i)에서 보정된 입력 데이터(v')으로 출력된다.Referring to FIG. 2 for reference, as shown in FIG. 2(a), in the reconstruction process, _{the multiplication value of the output value (h) of each hidden node (j) of the hidden layer and the weight value (w ji} ) are all summed, and then active It is sampled as a value of 0 or 1 through a function and is output as corrected input data (v') at the input node (i).

도 2(b)에 도시되어 있는 바와 같이 재생산 과정에서 입력 노드(i)의 보정된 입력 데이터(v')와 가중치(w_ij)의 곱셈값은 서로 합해진 후 활성 함수를 거쳐 0 또는 1의 값으로 샘플링되어 은닉 노드(j)에서 다시 보정된 출력값(h')으로 출력된다. As shown in Fig. 2(b), in the reproduction process, _{the multiplication value of the corrected input data v' of the input node i and the weight w ij} is added to each other and passed through an activation function to a value of 0 or 1 is sampled and output as the corrected output value (h') at the hidden node (j).

위에서 설명한 출력값(h), 보정된 입력 데이터(v'), 보정된 출력값(h')의 계산은 각각 아래와 같이 수학식(1), 수학식(2) 및 수학식(3)과 같이 표현된다.Calculation of the above-described output value (h), corrected input data (v'), and corrected output value (h') is expressed as Equation (1), Equation (2) and Equation (3) as follows, respectively .

[수학식1][Equation 1]

[수학식2][Equation 2]

[수학식3][Equation 3]

여기서 P는 샘플링 함수, h_cj는 입력 케이스(c)에 대한 은닉 노드(j)의 출력값, N_v는 입력 노드의 수, v_ci는 입력 노드(i)로 입력되는 입력 케이스(c)의 입력 데이터, w_ij는 입력 노드(i)와 은닉 노드(j) 사이의 가중치,

는 은닉 노드(j)의 입력 바이어스값, σ는 활성 함수로 예를 들어 로지스틱 함수(logistic function)인 것을 특징으로 한다.where P is the sampling function, h _cj is the output value of the hidden node (j) for the input case (c), N _v is the number of input nodes, and v _ci is the input of the input case (c) that is input to the input node (i). data, w _ij is the weight between the input node (i) and the hidden node (j),

is an input bias value of the hidden node j, and σ is an activation function, for example, a logistic function.

한편, N_h는 은닉 노드의 수, h_cj는 은닉 노드(j)에서 출력되는 입력 케이스(c)의 출력값, w_ji는 은닉 노드(j)와 입력 노드(i) 사이의 가중치,

는 입력 노드(i)의 출력 바이어스값, h'_cj는 입력 케이스(c)에 대한 은닉 노드(j)의 보정된 출력값인 것을 특징으로 한다.On the other hand, N _h is the number of hidden nodes, h _cj is the output value of the input case (c) output _{from the hidden node (j), w ji} is the weight between the hidden node (j) and the input node (i),

is the output bias value of the input node (i), and h′ _cj is the corrected output value of the hidden node (j) for the input case (c).

이와 같이 계산된 출력값(h), 보정된 입력 데이터(v') 및 보정된 출력값(h')를 이용하여 가중치(w_ij), 출력 바이어스(b^h) 및 입력 바이어스(b^v)는 입력 데이터의 특징을 학습하기 위해 보정되는데, 가중치, 출력 바이어스 및 입력 바이어스의 보정은 아래의 수학식(4), 수학식(5) 및 수학식(6)에 따라 보정된다.By using the calculated output value (h), the corrected input data (v'), and the corrected output value (h'), the weight (w _ij ), the output bias (b ^h ), and the input bias (b ^v ) are input data It is corrected to learn the characteristics of , and the correction of weight, output bias, and input bias is corrected according to Equations (4), (5) and (6) below.

[수학식 4][Equation 4]

[수학식 5][Equation 5]

[수학식 6][Equation 6]

여기서 N_c는 입력 배치를 구성하는 입력케이스의 수, ε학습율(learning rate)을 의미한다.Here, N _c denotes the number of input cases constituting the input arrangement, ε learning rate.

인공신경망에서 연산 시간과 학습 효율을 높이기 위하여 다수의 입력 데이터로 이루어진 입력 케이스(c)의 집합인 입력 배치(batch)를 기준으로 가중치 보정이 이루어진다. In order to increase computation time and learning efficiency in the artificial neural network, weight correction is performed based on an input batch, which is a set of input cases (c) composed of a plurality of input data.

인공신경망을 구현하는 인공신경망 장치는 크게 두 가지로 나눌 수 있다. 첫 번째는 CPU(Central Processing Unit)와 GPU (Graphic Processing Unit) 과 같은 범용 프로세서 기반의 인공신경망 장치이며, 두 번째는 FPGA(field programmable gate array) 또는 ASIC(application specific integrated circuit) 형태로 시냅스, 뉴런 등의 회로를 구성하여 인공신경망 장치를 구현하는 방법이다. 범용 프로세서 기반의 인공신경망 장치는 구현되는 시냅스의 개수에 비해 면적이 적고, 기존의 프로세서를 그대로 사용하기 때문에 설계가 용이하며, 프로그램의 변경만으로 다양한 형태의 인공신경망 장치를 구현할 수 있다는 장점이 있다. 하지만, FPGA 또는 ASIC으로 구현되는 인공신경망 장치에 비해서 인공신경망의 특징인 병렬처리, 분산처리의 효율이 떨어져 연산 속도가 느리고, 단일칩으로 구성하기 어렵고 소비 전력이 많다는 단점을 가진다.Artificial neural network devices that implement artificial neural networks can be roughly divided into two types. The first is a general-purpose processor-based artificial neural network device such as CPU (Central Processing Unit) and GPU (Graphic Processing Unit), and the second is an FPGA (field programmable gate array) or ASIC (application specific integrated circuit) form of synapses and neurons. It is a method to implement an artificial neural network device by configuring circuits such as A general-purpose processor-based artificial neural network device has a small area compared to the number of implemented synapses, is easy to design because it uses an existing processor, and can implement various types of artificial neural network devices only by changing a program. However, compared to artificial neural network devices implemented in FPGAs or ASICs, the efficiency of parallel processing and distributed processing, which are characteristics of artificial neural networks, is low, so the operation speed is slow, and it is difficult to configure as a single chip and consumes a lot of power.

FPGA 또는 ASIC 기술을 이용하여 인공 신경망을 구현하는 경우 사용하고자 하는 목적에 따라서 다양한 형태의 인공신경망 장치를 설계하는 것이 가능하며 이론적 모델에 가까운 형태로 신경망을 구현할 수 있다는 장점이 있다.When implementing an artificial neural network using FPGA or ASIC technology, it is possible to design various types of artificial neural network devices according to the intended use, and there is an advantage that the neural network can be implemented in a form close to a theoretical model.

그러나 FPGA 또는 ASIC 기술을 이용하여 인공신경망 장치를 구현하더라도, 설계 방식에 따라 인공신경망 알고리즘을 구현하는데 필요한 연산량 또는 메모리가 상이할 수 있으며, 따라서 연산 속도를 높이고 메모리를 줄일 수 있는 최적화된 인공신경망용 연산 장치에 대한 요구가 크다. However, even if an artificial neural network device is implemented using FPGA or ASIC technology, the amount of computation or memory required to implement an artificial neural network algorithm may be different depending on the design method. The demands on the computing device are high.

본 발명에 위에서 언급한 종래 인공신경망용 연산 장치가 가지는 문제점을 해결하기 위한 것으로, 본 발명이 이루고자 하는 목적은 신경신경망 알고리즘을 구현하는, 파이프라인 구조를 가지는 최적화된 인공신경망 연산 가속 장치를 제공하는 것이다.In order to solve the problems of the above-mentioned conventional computing device for artificial neural networks in the present invention, an object of the present invention is to provide an optimized artificial neural network computation acceleration device having a pipeline structure, which implements a neural network algorithm. will be.

본 발명이 이루고자 하는 다른 목적은 최적화된 설계를 통하여 연산효율을 향상시키며 동시에 필요한 메모리를 줄일 수 있는 인공신경망 연산 가속 장치를 제공하는 것이다. Another object of the present invention is to provide an artificial neural network computation acceleration device capable of improving computational efficiency and simultaneously reducing required memory through an optimized design.

본 발명의 목적을 달성하기 위하여 본 발명의 일 실시예에 따른 인공신경망용 연산 가속 장치는 다수의 입력 데이터로 이루어진 입력 케이스의 집합인 입력 배치(batch)를 저장하는 입력 메모리부와, 조합 논리회로와 래지스터로 이루어진 스테이지가 다수 연결되어 있으며, 각 스테이지에서 입력 배치를 이용하여 입력 노드와 은닉 노드 사이의 가중치를 보정하는 데이터 처리 모듈과, 입력 케이스의 입력 데이터를 데이터 처리 모듈로 입력시키거나 데이터 처리 모듈의 각 스테이지의 동작을 제어하기 위한 제어 신호를 생성하는 제어부를 포함하는 것을 특징으로 한다.In order to achieve the object of the present invention, an arithmetic acceleration device for an artificial neural network according to an embodiment of the present invention includes an input memory unit for storing an input batch, which is a set of input cases made of a plurality of input data, and a combinational logic circuit and registers are connected, a data processing module that corrects the weight between the input node and the hidden node by using the input arrangement in each stage, and inputting the input data of the input case to the data processing module and a control unit for generating a control signal for controlling the operation of each stage of the processing module.

바람직하게, 본 발명에 따른 데이터 처리 모듈은 입력층의 입력 노드(i)로 입력되는 입력 데이터에 가중치(w_ij)를 적용하여 은닉층의 은닉 노드(j)로 출력값(h)을 출력하는 제1 조합 논리회로와, 출력값(h)에 가중치를 적용하여 보정된 입력 데이터(v')를 출력하는 제2 조합 논리회로와, 보정된 입력 데이터(v')에 가중치를 적용하여 보정된 출력값(h')을 출력하는 제3 조합회로와, 입력 배치의 모든 입력 케이스에 대한 입력 데이터(v), 출력값(h), 보정된 입력 데이터(v'), 보정된 출력값(h')로부터 가중치 오차를 보정하는 제4 조합 논리회로를 포함하는 것을 특징으로 한다. _{Preferably, the data processing module according to the present invention applies a weight w ij} to the input data input to the input node (i) of the input layer to output the output value (h) to the hidden node (j) of the hidden layer. A combinational logic circuit, a second combinational logic circuit for outputting input data v′ corrected by applying a weight to the output value h, and an output value h corrected by applying a weight to the corrected input data v′ '), and a weighting error from the input data (v), output value (h), corrected input data (v'), and corrected output value (h') for all input cases of the input arrangement and a fourth combinational logic circuit for correcting.

바람직하게, 본 발명에 따른 데이터 처리 모듈은 입력 배치의 모든 입력 케이스에 대한 입력 데이터(v)와 보정된 입력 데이터(v')로부터 입력 바이어스(b^h)를 보정하는 제5 조합 논리회로와, 입력 배치의 모든 입력 케이스에 대한 출력값(h)과 보정된 출력값(h')로부터 출력 바이어스(b^v)를 보정하는 제6 조합 논리회로를 더 포함하는 것을 특징으로 한다. Preferably, the data processing module according to the present invention comprises: a fifth combinational logic circuit for correcting the ^{input bias (b h} ) from the input data (v) and the corrected input data (v′) for all input cases of the input arrangement; It characterized in that it further comprises a sixth combinational logic circuit for correcting the ^{output bias (b v} ) from the output value (h) and the corrected output value (h′) for all input cases of the input arrangement.

바람직하게, 본 발명에 따른 제2 조합 논리회로는 은닉 노드(j)의 출력값에 가중치(w_ji)를 곱하여 곱셈값을 계산하는 곱셈기와, 누적기와, 곱셈기에서 곱셈값이 계산될 때마다 곱셈값과 누적기에 저장된 이전 누적값을 합산하여 현재 누적값을 계산하는 합산기를 포함하며, 누적기에 기저장된 이전 누적값은 현재 누적값으로 갱신되는 것을 특징으로 한다. Preferably, the second combinational logic circuit according to the present invention _{multiplies the output value of the hidden node j by the weight w ji} to calculate the multiplication value, the accumulator, and the multiplier each time the multiplication value is calculated in the multiplier. and a summer for calculating a current accumulated value by summing the previous accumulated values stored in the accumulator, wherein the previous accumulated values stored in the accumulator are updated to the current accumulated values.

바람직하게, 본 발명에 따른 제2 조합 논리회로는 곱셈기와 합산기를 통해 계산된 최종 누적값에 출력 바이어스값을 합산한 제1 버퍼링값을 저장하는 제1 버퍼부와, 제1 버퍼부에 저장된 제1 버퍼링값을 순차적으로 입력받아 활성 함수값을 계산하는 활성 함수부와, 할성 함수부에서 출력되는 활성 함수값을 샘플링하여 샘플링값을 계산하는 샘플링부와, 순차적으로 계산되는 샘플링값을 저장하는 제2 버퍼부를 포함하는 것을 특징으로 한다.Preferably, the second combinational logic circuit according to the present invention comprises: a first buffer unit for storing a first buffering value obtained by adding an output bias value to a final accumulated value calculated through a multiplier and a summer; 1 An active function unit that sequentially receives a buffering value and calculates an active function value, a sampling unit that samples the active function value output from the active function unit and calculates a sampling value, and a third that stores the sequentially calculated sampling value It is characterized in that it includes two buffers.

바람직하게, 본 발명에 따른 제 4 조합 논리회로는 제3 조합 논리회로로부터 은닉 노드(j)의 보정된 출력값(h'_j)이 출력시 보정된 출력값(h'_j)으로부터 입력 노드(i)와 은닉 노드(j) 사이의 부분 가중치 보정값을 계산하는 부분 가중치 계산부와, 입력 배치의 모든 입력 케이스에 대한 입력 노드(i)와 은닉 노드(j) 사이의 부분 가중치 보정값이 계산될 때마다 부분 가중치 보정값을 순차적으로 저장하는 부분 가중치 저장부와, 입력 배치의 모든 입력 케이스에 대한 입력 노드(i)와 은닉 노드(j) 사이의 부분 가중치 보정값이 계산 완료되는 경우 계산 완료된 부분 가중치 보정값으로부터 입력 노드(i)와 은닉 노드(j) 사이의 가중치를 보정하는 보정부를 포함하는 것을 특징으로 한다.Preferably, in the fourth combinational logic circuit according to the present invention, the corrected output value (h′ _j ) of the hidden node j is output from the third combinational logic circuit from the corrected output value (h′ _j ) to the input node (i) and a partial weight calculation unit that calculates a partial weight correction value between the and the hidden node j, and when the partial weight correction value between the input node (i) and the hidden node (j) for all input cases of the input arrangement is calculated A partial weight storage unit that sequentially stores partial weight correction values for each case, and partial weights calculated when the partial weight correction values between the input node (i) and the hidden node (j) for all input cases in the input batch are calculated and a correction unit for correcting a weight between the input node (i) and the hidden node (j) from the correction value.

여기서 부분 가중치 계산부는 입력 배치의 입력 케이스 시퀀스에 따라 생성되는, 제어부의 제어 신호에 따라 선택되는 계산 경로에 의해 입력 노드(i)와 은닉 노드(j) 사이의 부분 가중치 보정값을 계산하는 것을 특징으로 한다.Here, the partial weight calculation unit calculates the partial weight correction value between the input node (i) and the hidden node (j) by a calculation path selected according to a control signal of the control unit, which is generated according to the input case sequence of the input arrangement. do it with

바람직하게, 본 발명의 일 실시예에 따른 부분 가중치 계산부는 부분 가중치 저장부에 저장된 이전 부분 가중치 보정값에 모멘텀(momeutum) 계수가 곱해져 입력 노드(i)와 은닉 노드(j) 사이의 부분 가중치 보정값을 계산하는 제1 계산 경로와, 부분 가중치 저장부에 저장된 이전 부분 가중치 보정값에 가중치 오차값을 합산하여 입력 노드(i)의 부분 가중치 보정값을 계산하는 제2 계산 경로와, 부분 가중치 저장부에 저장된 이전 부분 가중치에 가중치 오차값을 합산한 합산값에서 가중치 감쇄 계수에 입력 노드(i)와 은닉 노드(j) 사이의 이전 가중치를 곱한 제1 곱셈값을 차감한 차감값에 학습율(learning rate) 계수를 곱하여 부분 가중치 보정값을 계산하는 제3 계산 경로를 포함하는 것을 특징으로 한다.Preferably, the partial weight calculation unit according to an embodiment of the present invention is multiplied by a momentum coefficient to a previous partial weight correction value stored in the partial weight storage unit to obtain a partial weight between the input node (i) and the hidden node (j). A first calculation path for calculating a correction value, a second calculation path for calculating a partial weight correction value of the input node (i) by adding a weight error value to a previous partial weight correction value stored in the partial weight storage unit; From the sum of the weight error values to the previous partial weights stored in the storage, the first multiplication value obtained by multiplying the weight attenuation coefficient by the previous weight between the input node (i) and the hidden node (j) is subtracted from the subtraction value of the learning rate ( learning rate) and a third calculation path for calculating a partial weight correction value by multiplying the coefficient.

여기서 제어부는 입력 배치의 제1 입력 케이스에 대해서는 제1 계산 경로를 통해 부분 가중치 보정값을 계산하도록 제어하며, 입력 배치의 마지막 입력 케이스에 대해서는 제3 계산 경로를 통해 부분 가중치 보정값을 계산하도록 제어하며, 입력 배치의 제1 입력 케이스와 마지막 입력 케이스를 제외한 나머지 입력 케이스에 대해서는 제2 계산 경로를 통해 부분 가중치 보정값을 계산하도록 제어하는 것을 특징으로 한다.Here, the control unit controls to calculate the partial weight correction value through the first calculation path for the first input case of the input arrangement, and controls to calculate the partial weight correction value through the third calculation path for the last input case of the input arrangement and, for the remaining input cases except for the first input case and the last input case of the input arrangement, it is characterized in that the partial weight correction value is calculated through the second calculation path.

바람직하게, 본 발명에 따른 파이프라인 구조를 가지는 인공신경망용 연산 가속 장치는 제1 조합 논리회로에서 생성된 출력값(h) 또는 제2 조합 논리회로에서 생성된 보정된 입력 데이터(v')을 저장하기 위한 대기 메모리부를 더 포함하는 것을 특징으로 한다. Preferably, the arithmetic acceleration device for an artificial neural network having a pipeline structure according to the present invention stores the output value (h) generated by the first combinational logic circuit or the corrected input data (v′) generated by the second combinational logic circuit It is characterized in that it further comprises a standby memory unit for

바람직하게, 본 발명의 일 실시예에 따른 대기 메모리부는 출력값이 생성되는 조합 논리회로와 사용되는 조합 논리회로 사이의 차이만큼 출력값을 저장하는 제1 대기 메모리부와, 보정된 입력 데이터가 생성되는 조합 논리회로와 사용되는 조합 논리회로 사이의 차이만큼 보정된 입력 데이터를 저장하는 제2 대기 메모리부를 더 포함하는 것을 특징으로 한다.Preferably, the standby memory unit according to an embodiment of the present invention is a combination of the first standby memory unit for storing the output value by the difference between the combinational logic circuit in which the output value is generated and the combinational logic circuit used, and the corrected input data is generated It characterized in that it further comprises a second standby memory unit for storing the input data corrected by the difference between the logic circuit and the combination logic circuit used.

본 발명에 따른 인공신경망용 연산 가속 장치는 다음과 같은 다양한 효과를 가진다.The computational acceleration device for an artificial neural network according to the present invention has various effects as follows.

첫째, 본 발명에 따른 인공신경망용 연산 가속 장치는 출력값, 보정된 입력 데이터, 보정된 출력값, 가중치 보정, 입력 바이어스 보정 및 출력 바이어스 보정을 파이프라인 구조로 동시에 수행함으로써, 학습에 따른 연산 시간을 줄일 수 있다.First, the arithmetic acceleration device for an artificial neural network according to the present invention reduces the computation time due to learning by simultaneously performing output values, corrected input data, corrected output values, weight correction, input bias correction, and output bias correction in a pipeline structure. can

둘째, 본 발명에 따른 인공신경망용 연산 가속 장치는 순차적으로 생성되는 출력값(h)과 가중치(w_ji)의 곱을 적층하여 보정된 입력 데이터(v_i)를 계산함으로써, 가중치(w_ij) 행렬을 트랜스포즈(transpose)하기 위한 연산과 트랜스포즈된 가중치 행렬을 저장하기 위한 메모리가 필요하지 않다.Second, by calculating the output value (h) and weight the product stacked to correct the input data of the (w _ji) (v _i) computed accelerator for the artificial neural network is sequentially generated in accordance with the present invention, the weighting (w _ij) matrix No operation for transpose and no memory for storing the transposed weight matrix.

셋째, 본 발명에 따른 인공신경망용 연산 가속 장치는 제4 조합 논리회로에서 가중치를 보정시 은닉 노드(j)에서 보정된 출력값(h'_j)이 생성될 때마다 입력 노드(i)와 은닉 노드(j) 사이의 부분 가중치 보정값을 생성하며 순차적으로 생성되는 모든 입력 케이스에 대해 부분 가중치 보정값을 누적하여 가중치를 보정함으로써, 연산 시간을 줄일 수 있으며 모든 입력 케이스에 대한 보정된 출력값 전체를 저장하기 위한 메모리가 필요하지 않다. Third, the computation acceleration apparatus for an artificial neural network according to the present invention includes an input node (i) and a hidden node whenever the _{corrected output value (h' j) is generated in the hidden node (j) when the weight is corrected in the fourth combinational logic circuit.} (j) By generating partial weight correction values between and correcting weights by accumulating partial weight correction values for all sequentially generated input cases, calculation time can be reduced and all corrected output values for all input cases are stored No memory is needed to

도 1은 제한된 볼츠만 머신 모델의 개념을 설명하기 위한 도면이다.
도 2는 제한된 볼츠만 머신 모델에서 가중치를 조절하는 일 예를 설명하기 위한 도면이다.
도 3은 본 발명에 따른 인공신경망용 연산 가속 장치를 설명하기 위한 도면이다.
도 4는 본 발명에 따른 데이터 처리 모듈을 설명하기 위한 도면이다.
도 5는 본 발명에 따른 생산(generation) 과정을 수행하는 제1 조합 논리회로를 설명하기 위한 기능 블록도이다.
도 6은 본 발명에 따른 재구성(reconstruction) 과정을 수행하는 제2 조합 논리회로(220)를 설명하기 위한 기능 블록도이다.
도 7은 본 발명의 일 실시예에 따른 제4 조합 논리회로를 설명하기 위한 기능 블록도이다.
도 8은 본 발명에 따른 부분 가중치 계산부의 일 예를 설명하기 위한 기능 블록도이다.
도 9는 본 발명에 따른 제1 조합 논리회로의 구현예를 도시하고 있다.
도 10은 본 발명에 따른 제2 조합 논리회로의 구현예를 도시하고 있다.
도 11은 본 발명에 따른 제3 조합 논리회로의 구현예를 도시하고 있다.
도 12는 본 발명에 따른 제4 조합 논리회로의 구현예를 도시하고 있다.
도 13은 본 발명에 따른 제5 조합 논리회로의 구현예를 도시하고 있다.
도 14는 본 발명에 따른 제1 조합 논리회로 내지 제6 조합 논리회로의 타임라인을 설명하기 위한 도면이다.
도 15는 출력값의 생산 스테이지와 사용 스테이지를 설명하기 위한 도면이다.
도 16은 본 발명에 따른 파이프라인 구조를 가지는 인공신경망용 연산 가속 장치의 향상된 성능을 설명하기 위한 도면이다.
도 17은 본원발명의 파이프라인 구조를 가지는 인공신경망 연산 가속 장치의 성능 비교치를 설명하기 위한 도면이다.1 is a diagram for explaining the concept of a limited Boltzmann machine model.
2 is a diagram for explaining an example of adjusting weights in a limited Boltzmann machine model.
3 is a view for explaining a computational acceleration device for an artificial neural network according to the present invention.
4 is a diagram for explaining a data processing module according to the present invention.
5 is a functional block diagram illustrating a first combinational logic circuit for performing a generation process according to the present invention.
6 is a functional block diagram illustrating a second combinational logic circuit 220 that performs a reconstruction process according to the present invention.
7 is a functional block diagram illustrating a fourth combinational logic circuit according to an embodiment of the present invention.
8 is a functional block diagram illustrating an example of a partial weight calculator according to the present invention.
9 shows an implementation example of the first combinational logic circuit according to the present invention.
10 shows an implementation example of the second combinational logic circuit according to the present invention.
11 shows an implementation example of the third combinational logic circuit according to the present invention.
12 shows an implementation example of a fourth combinational logic circuit according to the present invention.
13 shows an implementation example of a fifth combinational logic circuit according to the present invention.
14 is a view for explaining the timeline of the first to sixth combinational logic circuits according to the present invention.
15 is a diagram for explaining a production stage and a use stage of an output value.
16 is a diagram for explaining improved performance of an artificial neural network computation acceleration device having a pipeline structure according to the present invention.
17 is a diagram for explaining a performance comparison value of an artificial neural network computation acceleration device having a pipeline structure according to the present invention.

본 발명에서 사용되는 기술적 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아님을 유의해야 한다. 또한, 본 발명에서 사용되는 기술적 용어는 본 발명에서 특별히 다른 의미로 정의되지 않는 한, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 의미로 해석되어야 하며, 과도하게 포괄적인 의미로 해석되거나, 과도하게 축소된 의미로 해석되지 않아야 한다. 또한, 본 발명에서 사용되는 기술적인 용어가 본 발명의 사상을 정확하게 표현하지 못하는 잘못된 기술적 용어일 때에는, 당업자가 올바르게 이해할 수 있는 기술적 용어로 대체되어 이해되어야 할 것이다.It should be noted that the technical terms used in the present invention are only used to describe specific embodiments, and are not intended to limit the present invention. In addition, the technical terms used in the present invention should be interpreted as meanings generally understood by those of ordinary skill in the art to which the present invention belongs, unless otherwise defined in particular in the present invention, and excessively comprehensive It should not be construed in the meaning of a human being or in an excessively reduced meaning. In addition, when the technical term used in the present invention is an incorrect technical term that does not accurately express the spirit of the present invention, it should be understood by being replaced with a technical term that can be correctly understood by those skilled in the art.

또한, 본 발명에서 사용되는 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한 복수의 표현을 포함한다. 본 발명에서, "구성된다" 또는 "포함한다" 등의 용어는 발명에 기재된 여러 구성 요소들, 또는 여러 단계를 반드시 모두 포함하는 것으로 해석되지 않아야 하며, 그 중 일부 구성 요소들 또는 일부 단계들은 포함되지 않을 수도 있고, 또는 추가적인 구성 요소 또는 단계들을 더 포함할 수 있는 것으로 해석되어야 한다.Also, as used herein, the singular expression includes the plural expression unless the context clearly dictates otherwise. In the present invention, terms such as "consisting of" or "comprising" should not be construed as necessarily including all of the various elements or several steps described in the invention, some of which elements or some steps are included. It should be construed that it may not, or may further include additional components or steps.

또한, 첨부된 도면은 본 발명의 사상을 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 발명의 사상이 제한되는 것으로 해석되어서는 아니 됨을 유의해야 한다.In addition, it should be noted that the accompanying drawings are only for easy understanding of the spirit of the present invention, and should not be construed as limiting the spirit of the present invention by the accompanying drawings.

이하 첨부한 도면을 참고로 본 발명에 따른 파이프라인 구조를 가지는 인공신경망용 연산 가속 장치에 대해 보다 구체적으로 설명한다.Hereinafter, an arithmetic acceleration device for an artificial neural network having a pipeline structure according to the present invention will be described in more detail with reference to the accompanying drawings.

도 3은 본 발명에 따른 인공신경망용 연산 가속 장치를 설명하기 위한 도면이다.3 is a view for explaining a computational acceleration device for an artificial neural network according to the present invention.

도 3을 참고로 보다 구체적으로 살펴보면, 입력 메모리부(100)에는 학습하고자 하는 학습 데이터가 저장되어 있다. 여기서 입력 메모리부(100)에 저장되어 있는 학습 데이터는 다수의 입력 데이터로 이루어진 입력 케이스의 집합인 입력 배치(batch)로 구분되어 데이터 처리 모듈(200)로 입력된다.Referring to FIG. 3 in more detail, learning data to be learned is stored in the input memory unit 100 . Here, the learning data stored in the input memory unit 100 is divided into an input batch, which is a set of input cases composed of a plurality of input data, and is input to the data processing module 200 .

데이터 처리 모듈(200)는 조합 논리회로와 래지스터로 이루어진 스테이지가 다수 연결되어 있으며, 각 스테이지는 입력된 입력 배치를 이용하여 입력 노드와 은닉 노드 사이의 가중치를 보정한다. The data processing module 200 is connected to a plurality of stages composed of a combinational logic circuit and a register, and each stage corrects a weight between an input node and a hidden node by using an input arrangement.

제어부(300)는 클럭 신호에 동기화하여 입력 케이스의 입력 데이터를 데이터 처리 모듈(200)로 입력시키거나 데이터 처리 모듈(200)의 각 스테이지의 동작을 제어하기 위한 제어 신호를 생성한다. 즉, 제어부(300)는 클럭 신호에 동기화하여 입력 메모리부(100)에 저장되어 있는 입력 배치의 각 입력 케이스별로 입력 데이터를 순차적으로 데이터 처리 모듈(200)로 입력하기 위한 제어 신호를 생성하거나, 클럭 신호에 동기화하여 데이터 처리 모듈(200)의 각 스테이지를 구성하는 조합 논리회로의 동작을 제어하기 위한 제어 신호를 생성한다. The control unit 300 inputs the input data of the input case to the data processing module 200 in synchronization with the clock signal or generates a control signal for controlling the operation of each stage of the data processing module 200 . That is, the control unit 300 generates a control signal for sequentially inputting input data to the data processing module 200 for each input case of the input arrangement stored in the input memory unit 100 in synchronization with the clock signal, or A control signal for controlling the operation of the combinational logic circuit constituting each stage of the data processing module 200 is generated in synchronization with the clock signal.

데이터 처리 모듈(200)은 제어부(300)의 제어 신호에 따라 생산(generation) 과정을 통해 계산되는 은닉 노드의 출력값(h), 재구성(reconstruction) 과정을 통해 계산되는 보정된 입력 데이터(v') 그리고 재생산(Regeneration) 과정을 통해 계산되는 보정된 출력값(h')으로부터 오차값(v'h'-vh)을 계산하여 입력 데이터의 특징을 추출하기 위한 가중치를 보정한다. The data processing module 200 is the output value (h) of the hidden node calculated through the generation process according to the control signal of the control unit 300, the corrected input data (v') calculated through the reconstruction process Then, an error value (v'h'-vh) is calculated from the corrected output value (h') calculated through the regeneration process, and the weight for extracting the features of the input data is corrected.

도 4는 본 발명에 따른 데이터 처리 모듈을 설명하기 위한 도면이다.4 is a diagram for explaining a data processing module according to the present invention.

도 4를 참고로 살펴보면, 본 발명에 따른 데이터 처리 모듈은 입력층의 입력 노드(i)로 입력되는 입력 데이터에 가중치(w_ij)를 적용하여 은닉층의 은닉 노드(j)로 출력값(h)을 출력하는 제1 조합 논리회로(210)와, 출력값(h)에 가중치를 적용하여 보정된 입력 데이터(v')를 출력하는 제2 조합 논리회로(220)와, 보정된 입력 데이터(v')에 가중치를 적용하여 보정된 출력값(h')을 출력하는 제3 조합회로(230)와, 입력 배치의 모든 입력 케이스에 대한 입력 데이터(v), 출력값(h), 보정된 입력 데이터(v'), 보정된 출력값(h')로부터 가중치를 보정하는 제4 조합 논리회로(240)와, 입력 배치의 모든 입력 케이스에 대한 입력 데이터(v)와 보정된 입력 데이터(v')로부터 입력 바이어스(b^h)를 보정하는 제5 조합 논리회로(250)와, 입력 배치의 모든 입력 케이스에 대한 출력값(h)과 보정된 출력값(h')로부터 출력 바이어스(b^v)를 보정하는 제6 조합 논리회로(260)를 포함한다.Referring to FIG. 4 , the data processing module according to the present invention _{applies a weight w ij} to the input data input to the input node (i) of the input layer to obtain the output value (h) to the hidden node (j) of the hidden layer. A first combinational logic circuit 210 for outputting, a second combinational logic circuit 220 for outputting input data (v') corrected by applying a weight to the output value (h), and corrected input data (v') A third combination circuit 230 for outputting a corrected output value (h') by applying a weight to, and input data (v), output value (h), and corrected input data (v') for all input cases of the input arrangement ), the fourth combinational logic circuit 240 for correcting the weight from the corrected output value h', and the input bias (v) from the input data (v) for all input cases of the input arrangement and the corrected input data (v') A fifth combinational logic circuit 250 for correcting b ^h ), and a sixth combinational logic for correcting an ^{output bias (b v} ) from the output values h for all input cases of the input arrangement and the corrected output values h′ circuit 260 .

본 발명에 따른 데이터 처리 모듈(200)은 성능을 높이기 위해 데이터 처리 모듈을 다수의 조합 논리 회로(210, 220, 230, 240, 250, 260)로 분할한 파이프 라인 구조를 사용하는데, 분할된 조합 논리회로(210, 220, 230, 240, 250, 260) 사이에는 설계한 스테이지에 따라 조합 논리회로로 입력과 출력을 저장하기 위한 버퍼부를 구비할 수 있다. 조합 논리회로와 조합 논리회로 사이의 버퍼부는 앞의 조합 논리회로의 출력 레지스터이자 뒤의 조합 논리회로의 입력 레지스터 역할을 수행한다. The data processing module 200 according to the present invention uses a pipeline structure in which the data processing module is divided into a plurality of combinational logic circuits 210 , 220 , 230 , 240 , 250 and 260 to increase performance. Between the logic circuits 210 , 220 , 230 , 240 , 250 , and 260 , a buffer unit for storing input and output as a combinational logic circuit may be provided according to a designed stage. The buffer unit between the combinational logic circuit and the combinational logic circuit functions as an output register of the previous combinational logic circuit and an input register of the subsequent combinational logic circuit.

본 발명에 따른 데이터 처리 모듈(200)은 도 4와 같이 파이프라인 구조를 사용하는 동기식 디지털 시스템으로, 동시에 생산 과정, 재구성 과정, 재생성 과정, 가중치 보정 과정, 입력 바이어스의 보정 과정 및 출력 바이어스의 보정 과정의 데이터 처리를 수행할 뿐 아니라, 동작 속도(클럭 속도)도 높이는 효과를 가져온다. 이는 도 4과 같이 데이터 처리 모듈을 구성하는 6개로 분할된 각 조합 논리회로(210, 220, 230, 240, 250, 260)에서 연속된 프로세싱이 동시에 수행될 수 있기 때문이다.The data processing module 200 according to the present invention is a synchronous digital system using a pipeline structure as shown in FIG. 4 , and is simultaneously a production process, a reconstruction process, a regeneration process, a weight correction process, an input bias correction process, and an output bias correction process. It not only performs data processing of the process, but also has the effect of increasing the operating speed (clock speed). This is because, as shown in FIG. 4 , continuous processing can be simultaneously performed in each combinational logic circuit 210 , 220 , 230 , 240 , 250 and 260 divided into six constituting the data processing module.

도 5는 본 발명에 따른 생산(generation) 과정을 수행하는 제1 조합 논리회로를 설명하기 위한 기능 블록도이다.5 is a functional block diagram illustrating a first combinational logic circuit for performing a generation process according to the present invention.

도 5를 참고로 보다 구체적으로 살펴보면, 제1 버퍼부(211)에 저장된 입력 케이스(c)의 입력 데이터(v₁, v₂,....v_N)는 각 입력 노드로 입력된다. 연산부(2213)는 각 입력 노드의 입력 데이터를 입력 노드(i)와 은닉 노드(j) 사이의 가중치(w_ij)에 곱하여 곱셈값을 생성하고 생성한 모든 입력 데이터의 곱셈값을 서로 합산하여 은닉 노드(j)에 대한 합산값을 출력한다. 여기서 가중치(w_ij)는 입력 케이스별로 그리고 입력 데이터별로 구분되어 행렬 형태로 메모리(w)에 저장되어 있는 것을 특징으로 한다. Referring to FIG. 5 in more detail, the input data v ₁ , v ₂ , ... v _N of the input case c stored in the first buffer unit 211 are input to each input node. _{The operation unit 2213 multiplies the input data of each input node by the weight w ij} between the input node (i) and the hidden node (j) to generate a multiplication value, and sums the multiplication values of all the generated input data with each other and hides them. Output the sum of the node (j). Here, the weight w _ij is divided by input case and input data, and is stored in the memory w in the form of a matrix.

제1 합산부(215)는 은닉 노드(j)에 대한 합산값과 은닉 노드(j)에 대한 입력 바이어스(b^h)를 합하며, 활성 함수부(217)는 은닉 노드(j)에 대한 합산값과 은닉 노드(j)에 대한 입력 바이어스(b^h)를 합한 값을 시그모이드(sigmoid) 패턴의 활성 함수(예를 들어, 로지스틱(logistic) 함수)에 적용하여 0과 1 사이의 값을 가지는 확률값으로 계산한다.The first summing unit 215 sums the sum of the hidden node j and the input bias b ^h for the hidden node j, and the active function unit 217 sums the hidden node j. A value between 0 and 1 is obtained by applying the sum of the value and ^{the input bias (b h} ) to the hidden node (j) to the activation function (e.g., logistic function) of the sigmoid pattern. is calculated as a probability value.

샘플링부(219)는 활성 함수부(217)에서 계산된 확률값과 기준값을 비교하여 0 또는 1 중 어느 하나의 값으로 샘플링하여 은닉 노드(j)에 대한 출력값(h)을 생성한다.The sampling unit 219 compares the probability value calculated by the activation function unit 217 with the reference value, samples it as any one of 0 or 1, and generates an output value h for the hidden node j.

도 6은 본 발명에 따른 재구성(reconstruction) 과정을 수행하는 제2 조합 논리회로(220)를 설명하기 위한 기능 블록도이다. 6 is a functional block diagram illustrating a second combinational logic circuit 220 that performs a reconstruction process according to the present invention.

제1 조합 논리회로(210)에서 출력값(h)이 생성되는 즉시, 제2 조합 논리회로(200)는 출력값(h)으로부터 입력 노드(i)에 대한 보정된 입력 데이터(v')를 생성한다. 재구성 과정에서 입력 노드(i)에 대한 보정된 입력 데이터를 생성하기 위해서는 트랜스포즈된(transposed) 가중치(w_ji)와 각 은닉 노드(h_j)의 출력값이 서로 곱해져야 한다. 따라서 입력 노드(i)에 대한 보정된 입력 데이터를 생성하기 위해 메모리(w)에서 각 은닉 노드(j)와 입력 노드(i) 사이의 가중치를 읽고(read) 트랜스포즈된 가중치를 저장하기 위한 별도의 메모리가 필요한데, 본원발명의 제2 조합 논리회로(220)는 곱셈-누적(multiply-accumulate) 수단(implementation, L)을 통해 별도의 트랜스포즈된 가중치를 저장하기 위한 메모리없이도 보정된 입력 데이터를 생성할 수 있다. As soon as the output value h is generated by the first combinational logic circuit 210, the second combinational logic circuit 200 generates corrected input data v′ for the input node i from the output value h. . In order to generate corrected input data for the input node (i) in the reconstruction process, the transposed weight (w _ji ) and the output value of each hidden node (h _j ) must be multiplied by each other. Therefore, in order to generate the corrected input data for the input node (i), the weights between each hidden node (j) and the input node (i) are read from the memory (w) and separate for storing the transposed weights. of the memory is required, and the second combinational logic circuit 220 of the present invention receives the corrected input data without a memory for storing a separate transposed weight through multiply-accumulate means (implementation, L). can create

도 6을 참고로 보다 구체적으로 살펴보면, 제1 조합 논리회로에서 출력값(h)가 출력되는 시점에, 곱셈기(221)는 은닉 노드(j)의 출력값을 은닉 노드(j)와 입력 노드(i) 사이의 가중치(w_ji)에 곱하며, 제2 합산부(122)는 저장부(223)에 기저장된 누적값과 곱셈값을 합산하여 다시 저장부(223)에 저장한다. 이러한 곱셈-누적 과정은 마지막 은닉 노드의 출력값을 마지막 은닉 노드와 입력 노드 사이의 가중치에 곱한 후 저장부(223)에 기저장된 누적값과 합산하여 최종 누적값을 계산할 때까지 수행된다.6, when the output value h is output from the first combinational logic circuit, the multiplier 221 converts the output value of the hidden node j to the hidden node j and the input node i The weight value w _ji is multiplied therebetween, and the second summing unit 122 sums the accumulated value and the multiplication value pre-stored in the storage unit 223 and stores it in the storage unit 223 again. This multiplication-accumulation process is performed until the final accumulated value is calculated by multiplying the output value of the last hidden node by the weight between the last hidden node and the input node and then summing it with the accumulated value pre-stored in the storage unit 223 .

제 3 합산부(224)는 최종 누적값에 입력 노드(j)에 대한 출력 바이어스(b^v)를 합하여 제2 버퍼부(225)에 저장한다. 제2 버퍼부(225)에 저장된 값은 순차적으로 활성 함수부(226) 및 샘플링부(227)에 입력되어 보정된 입력 데이터(v')를 생성한다. 순차적으로 생성되는 보정된 입력 데이터(v')는 제3 버퍼부(229)에 저장된다. ^{The third summing unit 224 adds the output bias b v} for the input node j to the final accumulated value and stores the sum in the second buffer unit 225 . The values stored in the second buffer unit 225 are sequentially input to the activation function unit 226 and the sampling unit 227 to generate corrected input data v′. The sequentially generated corrected input data v' is stored in the third buffer unit 229 .

제3 버퍼부(229)에 저장된 보정된 입력 데이터(v')는 제3 조합 논리회로(230)에서 앞서 제1 조합 논리회로(210)에서 이루어지는 생성 과정을 동일하게 반복하여 각 은닉 노드의 보정된 출력값(h')을 생성한다.The corrected input data (v') stored in the third buffer unit 229 is corrected for each hidden node by repeating the same process of generation in the first combinational logic circuit 210 in the third combinational logic circuit 230 . generated output value (h').

도 7은 본 발명의 일 실시예에 따른 제4 조합 논리회로를 설명하기 위한 기능 블록도이다.7 is a functional block diagram illustrating a fourth combinational logic circuit according to an embodiment of the present invention.

*앞서 설명한 수학식(4)와 같이 가중치를 보정하기 위하여, 입력 배치를 구성하는 모든 입력 케이스에 대한 입력 데이터(v), 출력값(h), 보정된 입력 데이터(v') 및 보정된 출력값(h')이 모두 필요하다. 따라서 가중치를 계산하기 위하여 모든 입력 케이스에 대한 입력 데이터(v), 출력값(h), 보정된 입력 데이터(v') 및 보정된 출력값(h')이 계산 완료될 때까지 기다려야 하며, 모든 입력 케이스에 대한 출력값(h), 보정된 입력 데이터(v') 및 보정된 출력값(h')를 저장하기 위한 메모리가 필요하다. *In order to correct the weight as in Equation (4) described above, input data (v), output value (h), corrected input data (v') and corrected output value ( h') are all required. Therefore, in order to calculate the weight, input data (v), output value (h), corrected input data (v'), and corrected output value (h') for all input cases must wait until the calculation is completed, and all input cases A memory for storing the output value h, the corrected input data (v'), and the corrected output value (h') is required.

본원발명에서 제4 조합 논리회로(240)는 입력 케이스별로 은닉 노드(j)에서 보정된 출력값(h')이 생성되는 즉시 모든 입력 노드와 은닉 노드 사이의 부분 가중치 보정값(△w)을 입력 케이스별로 순차적으로 계산하며 최종적으로 마지막 입력 케이스의 부분 가중치 보정값까지 합산하여 가중치 보상을 수행함으로써, 가중치 보상을 위한 연산 시간을 줄일 수 있고 출력값(h), 보정된 입력 데이터(v') 및 보정된 출력값(h')을 저장하기 위한 메모리없이 가중치 보정이 가능하다. In the present invention, the fourth combinational logic circuit 240 inputs a partial weight correction value Δw between all input nodes and the hidden node as soon as the corrected output value h' is generated at the hidden node j for each input case. By performing weight compensation by sequentially calculating for each case and finally adding up to the partial weight correction value of the last input case, the calculation time for weight compensation can be reduced and the output value (h), corrected input data (v') and correction Weight correction is possible without a memory for storing the output value h'.

도 7을 참고로 본 발명에 따른 제4 조합 논리회로를 보다 구체적으로 살펴보면, 가중치 오차값 계산부(241)는 은닉 노드(j)의 보정된 출력값(h')이 생성되는 경우 입력 케이스의 모든 입력 노드에 대한 가중치 오차값을 계산한다. 여기서 입력 케이스(c)에 대한 가중치 오차값(E_c)은 입력 데이터와 출력값 사이의 곱셈값에서 보정된 입력 데이터와 보정된 출력값 사이의 곱셈값을 차감하여 아래의 수학식(7)과 같이 계산된다.Looking at the fourth combinational logic circuit according to the present invention in more detail with reference to FIG. 7 , the weight error value calculator 241 calculates all of the input cases when the corrected output value h′ of the hidden node j is generated. Calculate the weight error value for the input node. Here, the weight error value (E _c ) for the input case (c) is calculated as in Equation (7) below by subtracting the multiplication value between the corrected input data and the corrected output value from the multiplication value between the input data and the output value. do.

[수학식 7][Equation 7]

E_c=v_ci×h_cj-v'_ci×h'_cj을 의미한다. E _c =v _ci ×h _cj -v' means _ci ×h' _cj .

부분 가중치 계산부(242)는 제어부로부터 제공되는 입력 케이스별 제어신호에 기초하여 가중치 오차값으로부터 입력 노드(i)와 은닉 노드(j) 사이의 부분 가중치 보정값(△w)을 계산한다.The partial weight calculation unit 242 calculates a partial weight correction value Δw between the input node i and the hidden node j from the weight error value based on the control signal for each input case provided from the control unit.

부분 가중치 저장부(243)는 입력 배치의 입력 케이스별로 입력 노드(i)와 은닉 노드(j) 사이의 부분 가중치 보정값이 계산될 때마다 부분 가중치 보정값을 순차적으로 저장한다.The partial weight storage unit 243 sequentially stores partial weight correction values whenever a partial weight correction value between the input node i and the hidden node j is calculated for each input case of the input arrangement.

보정부(244)는 입력 배치의 모든 입력 케이스에 대한 입력 노드(i)와 은닉 노드(j) 사이의 부분 가중치 보정값이 계산 완료되는 경우, 계산 완료된 부분 가중치 보정값으로부터 입력 노드(i)와 은닉 노드(j) 사이의 가중치를 보정한다.When the partial weight correction value between the input node (i) and the hidden node (j) for all input cases of the input arrangement is calculated, the correction unit 244 calculates the partial weight correction value from the calculated partial weight correction value to the input node (i) and The weights between the hidden nodes (j) are corrected.

도 8은 본 발명에 따른 부분 가중치 계산부의 일 예를 설명하기 위한 기능 블록도이다.8 is a functional block diagram illustrating an example of a partial weight calculator according to the present invention.

도 8을 참고로 보다 구체적으로 살펴보면, 부분 가중치 계산부에서 입력 케이스의 시퀀스에 따라 부분 가중치 보정값을 계산하는 계산 경로가 상이한데, 부분 가중치 저장부(243)에 저장된 이전 부분 가중치 보정값에 모멘텀(momeutum) 계수(c_m)가 곱해져 입력 노드(i)와 은닉 노드(j) 사이의 부분 가중치 보정값을 계산하는 제1 계산 경로(242-1)와, 부분 가중치 저장부에 저장된 이전 부분 가중치 보정값에 가중치 오차값을 합산하여 입력 노드(i)의 부분 가중치 보정값을 계산하는 제2 계산 경로(242-2)와, 부분 가중치 저장부에 저장된 이전 부분 가중치에 가중치 오차값을 합산한 합산값에서 가중치 감쇄 계수(W_d)에 입력 노드(i)와 은닉 노드(j) 사이의 이전 가중치를 곱한 곱셈값을 차감한 차감값에 학습율(learning rate) 계수(ε)를 곱하여 부분 가중치 보정값을 계산하는 제3 계산 경로(242-3)을 구비한다. Referring to FIG. 8 in more detail, the calculation path for calculating the partial weight correction value according to the sequence of input cases in the partial weight calculation unit is different, and the momentum of the previous partial weight correction value stored in the partial weight storage unit 243 is different. A first calculation path 242-1 for calculating a partial weight correction value between the input node (i) and the hidden node (j) by multiplying the (momeutum) coefficient (c _{m ), and the previous part stored in the partial weight storage unit} The second calculation path 242-2 for calculating the partial weight correction value of the input node (i) by adding the weight error value to the weight correction value, and adding the weight error value to the previous partial weight stored in the partial weight storage unit Partial weight correction by multiplying the subtraction value obtained by subtracting the multiplication value obtained by multiplying the weight decay coefficient (W _d ) by the previous weight between the input node (i) and the hidden node (j) by the learning rate coefficient (ε) from the sum value and a third calculation path 242-3 for calculating a value.

부분 가중치 계산부는 입력 배치의 입력 케이스 시퀀스에 따라 생성되는, 제어부의 제어 신호에 따라 선택되는 계산 경로에 의해 입력 노드(i)와 은닉 노드(j) 사이의 부분 가중치 보정값을 계산하는 것을 특징으로 한다.The partial weight calculation unit calculates the partial weight correction value between the input node (i) and the hidden node (j) by a calculation path selected according to a control signal of the control unit, which is generated according to the input case sequence of the input arrangement. do.

제어부는 입력 배치의 제1 입력 케이스에 대해서는 제1 계산 경로(242-1)를 통해 부분 가중치 보정값을 계산하도록 제어하며, 입력 배치의 마지막 입력 케이스에 대해서는 제3 계산 경로(242-3)를 통해 부분 가중치 보정값을 계산하도록 제어하며, 입력 배치의 제1 입력 케이스와 마지막 입력 케이스를 제외한 나머지 입력 케이스에 대해서는 제2 계산 경로(242-2)를 통해 부분 가중치 보정값을 계산하도록 제어하는 것을 특징으로 한다.The control unit controls to calculate the partial weight correction value through the first calculation path 242-1 for the first input case of the input arrangement, and the third calculation path 242-3 for the last input case of the input arrangement control to calculate the partial weight correction value through the second calculation path 242-2 for the remaining input cases except for the first input case and the last input case of the input arrangement. characterized.

도 9는 본 발명에 따른 제1 조합 논리회로의 구현예를 도시하고 있으며 도 10은 제2 조합 논리회로의 구현예를 도시하고 있다. 도 9에 도시된 제1 조합 논리회로와 제2 조합 논리회로 2A의 구현예를 통해 출력값을 생성하기 위한 코드는 아래와 같다.9 shows an implementation example of the first combinational logic circuit according to the present invention, and FIG. 10 illustrates an implementation example of the second combinational logic circuit. The code for generating an output value through the implementation example of the first combinational logic circuit and the second combinational logic circuit 2A shown in FIG. 9 is as follows.

도 10에 도시되어 있는 바와 같이, 제2 조합 논리회로를 2A와 2B로 구분하며 2A와 2B 사이에 제2 버퍼부를 배치하는데, 제2 버퍼부에 저장된 최종 누적값(tem₂)에 입력 노드(j)에 대한 출력 바이어스(b^v)를 합한 값은 순차적으로 활성 함수부(226) 및 샘플링부(227)에 입력되어 보정된 입력 데이터(v')를 생성한다. 제2 조합 논리회로를 2A와 2B로 구분하며 2A와 2B 사이에 제2 버퍼부를 배치함으로써, 1개의 활성 함수부(226) 및 샘플링부(227)만을 이용하여 모든 보정된 입력 데이터를 생성할 수 있다. As shown in FIG. 10, the second combinational logic circuit is divided into 2A and 2B, and a second buffer unit is disposed between 2A and 2B, and the input node ( _{tem 2) is stored in the final accumulated value tem 2 stored in the second buffer unit.} The sum of the output bias b ^v for j) is sequentially input to the activation function unit 226 and the sampling unit 227 to generate corrected input data v′. By dividing the second combinational logic circuit into 2A and 2B and arranging the second buffer unit between 2A and 2B, all corrected input data can be generated using only one active function unit 226 and a sampling unit 227. have.

한편, 도 11은 본 발명에 따른 제3 조합 논리회로의 구현예를 도시하고 있으며, 도 12는 본 발명에 따른 제4 조합 논리회로의 구현예를 도시하고 있으며, 도 13은 본 발명에 따른 제5 조합 논리회로의 구현예를 도시하고 있다. 출력 바이어스를 업데이트하기 위한 제6 조합 논리회로의 구현예는 제5 조합 논리회로와 유사하게 구현 가능하다. On the other hand, FIG. 11 shows an implementation example of a third combinational logic circuit according to the present invention, FIG. 12 illustrates an implementation example of a fourth combinational logic circuit according to the present invention, and FIG. 13 is a third combinational logic circuit according to the present invention. 5 shows an implementation example of a combinational logic circuit. An implementation example of the sixth combinational logic circuit for updating the output bias can be implemented similarly to the fifth combinational logic circuit.

도 12에 도시된 제4 조합 논리회로의 구현예를 통해 가중치 보정을 수행하기 위한 코드는 아래와 같다.The code for performing weight correction through the implementation of the fourth combinational logic circuit shown in FIG. 12 is as follows.

도 13에 도시된 제5 조합 논리회로의 구현예를 통해 입력 바이어스의 보정을 수행하기 위한 코드는 아래와 같다.The code for correcting the input bias through the implementation of the fifth combinational logic circuit shown in FIG. 13 is as follows.

도 14는 본 발명에 따른 제1 조합 논리회로 내지 제6 조합 논리회로의 타임라인을 설명하기 위한 도면이다.14 is a view for explaining the timeline of the first to sixth combinational logic circuits according to the present invention.

도 14에 도시되어 있는 바와 같이, 제1 조합 논리회로 내지 제3 조합 논리회로에서는 각각 N×N_C의 클락이 소요되며, 제4 조합 논리회로와 제5 조합 논리회로에서는 (N+1)×N_C의 클락이 소요되며, 따라서 가중치 보정을 위해 전체적으로 (N+2)×N_C의 클락이 소요됨을 알 수 있다. 여기서 N은 입력 노드 또는 출력 노드의 수를 의미하며(입력 노드와 출력 노드의 수는 서로 상이할 수 있는데, 설명의 편의를 위하여 입력 노드와 출력 노드의 수는 N으로 동일하다고 가정함), N_C는 입력 배치를 구성하는 입력 케이스의 수를 의미한다.14, in the first to third combinational logic circuits, each N×N _C clock is required, and in the fourth and fifth combinational logic circuits, (N+1) × it takes a clock for the N and _C, and thus for the weight correction as a whole (N + 2) of the clock × N _C 0175. Here, N means the number of input nodes or output nodes (the number of input nodes and output nodes may be different from each other, but for convenience of explanation, it is assumed that the number of input nodes and output nodes is equal to N), N _C denotes the number of input cases constituting the input batch.

한편 통상적인 인공신경망 연산 가속 장치의 경우 입력 데이터, 출력값, 보정된 입력 데이터 및 보정된 출력값 전체를 저장하기 위한 메모리가 각각 필요하며 이는 인공신경망 연산 가속 장치를 구현하는데 비용을 증가시키는 문제점을 가진다. 예를 들어, 출력값(h)을 저장하기 위하여 N_C×N_h×BW_data-path의 메모리 공간이 필요하다.On the other hand, in the case of a typical artificial neural network computation acceleration device, a memory for storing the input data, output value, corrected input data, and the entire corrected output value is required, respectively, and this has a problem of increasing the cost of implementing the artificial neural network computation acceleration device. For example, a memory space of _{N C} × N _h × BW _data-path is required to store the output value (h).

본 발명에 따른 파이프라인 구조를 가지는 인공신경망용 연산 가속 장치에서 입력 데이터, 출력값 및 보정된 입력 데이터는 각 데이터가 생성된 스테이지에서 각 데이터가 사용되는 스테이지(stage)까지만 데이터를 저장하기 때문에, 입력 데이터, 출력값 및 보정된 입력 데이터를 저장하는데 소요되는 메모리 공간을 줄일 수 있다. 예를 들어, 도 15에 도시되어 있는 바와 같이 출력값(h)의 경우 제1 조합 논리회로에서 생성되며 생성된 출력값은 제2 조합 논리회로, 제4 조합 논리회로 및 제6 조합 논리회로에서 사용되는데, 제2 조합 논리회로와 제6 조합 논리회로는 출력값이 생성되는 즉시 출력값을 바로 사용하기 때문에 별도의 출력값을 저장하기 위한 별도의 메모리 공간이 필요하지 않다. 한편, 출력값(h)이 생성되는 제1 조합 논리회로와 출력값(h)이 사용되는 제4 조합 논리회로 사이의 스테이지 차이는 2이므로, 전체적으로 출력값(h)의 3개열을 저장하기 위한 메모리 공간만 필요하다. In the arithmetic acceleration device for an artificial neural network having a pipeline structure according to the present invention, input data, output values, and corrected input data store data only from a stage in which each data is generated to a stage in which each data is used. It is possible to reduce the memory space required to store data, output values, and corrected input data. For example, as shown in FIG. 15 , the output value h is generated in the first combinational logic circuit, and the generated output value is used in the second combinational logic circuit, the fourth combinational logic circuit, and the sixth combinational logic circuit. , since the second combinational logic circuit and the sixth combinational logic circuit use the output value immediately after the output value is generated, a separate memory space for storing the separate output value is not required. On the other hand, since the stage difference between the first combinational logic circuit in which the output value h is generated and the fourth combinational logic circuit in which the output value h is used is 2, there is only a memory space for storing three columns of the output value h as a whole. need.

도 16은 본 발명에 따른 파이프라인 구조를 가지는 인공신경망용 연산 가속 장치의 향상된 성능을 설명하기 위한 도면이다.16 is a diagram for explaining improved performance of an artificial neural network computation acceleration device having a pipeline structure according to the present invention.

X측은 한 번의 가중치 업데이트를 위해 사용하는 입력 케이스 수이며 Y 축은 이에 따르는 성능 향상 수치이다.The X side is the number of input cases used for one weight update, and the Y axis is the corresponding performance improvement.

입력 케이스가 1인 경우에라도 이미 다른 기술보다 월등한 성능 향상이 있지만 본 기술에서 입력 케이스를 증가시키면 성능 향상치는 급격히 올라가다가 포화(saturation)되기 시작한다. 본원발명에서는 다수의 입력 케이스로 이루어진 입력 배치를 단위로 학습하기 때문에 다수의 입력 케이스를 사용하여 학습하는 경우 더 높은 성능 향상을 가질 수 있다. Even when the input case is 1, there is already a superior performance improvement compared to other techniques, but if the input case is increased in this technique, the performance improvement value increases rapidly and starts to become saturated. In the present invention, since an input batch composed of a plurality of input cases is learned as a unit, a higher performance improvement may be obtained when learning using a plurality of input cases.

이러한 성능 향상은 본원발명이 파이프라인 구조를 가지기 때문인데, 입력 케이스 1을 가정할 때, 첫 번째 입력을 받을 때는 이후의 파이프라인 스테이지는 휴지 상태에 있다. 첫 번째 입력에 대한 연산을 첫 파이프라인 스테이지가 마치면 두 번째 파이프라인 스테이지만 연산하고 다른 스테이지들은 휴지 상태이다. 반면에 입력케이스가 2가 되면, 첫 번째 스테이지 연산 후에 2개 스테이지가 연산하게 된다. 즉 입력케이스가 증가하면, 동시에 동작하는 하드웨어 양이 늘어 전체 연산 성능이 증가한다.This performance improvement is because the present invention has a pipeline structure. Assuming input case 1, when receiving the first input, subsequent pipeline stages are in the idle state. When the first pipeline stage completes the operation on the first input, only the second pipeline stage operates and the other stages are idle. On the other hand, if the input case becomes 2, two stages are calculated after the first stage operation. That is, as the number of input cases increases, the amount of concurrently operated hardware increases, thereby increasing overall computational performance.

도 17은 본원발명의 파이프라인 구조를 가지는 인공신경망 연산 가속 장치의 성능 비교치를 설명하기 위한 도면이다. 17 is a diagram for explaining a performance comparison value of an artificial neural network computation acceleration device having a pipeline structure according to the present invention.

RBM 연산 장치를 인텔 CPU로 구현한 결과를 성능 1로 기준 설정하는 경우, GPU를 사용하여 RBM 연산 장치를 구현한 결과는 CPU 대비 30배의 연산 성능 향상을 제공함을 알 수 있다. 한편, FPGA를 사용하여 RBM 연산 장치를 구현한 종래 결과는 CPU 대비 61배의 연산 성능 향상을 제공하며, 본원발명에 따라 RBM 연산 가속 장치를 같은 FPGA 디바이스로 구현한 결과는 CPU 대비 최대 755배, GPU 대비 최대 25배, 같은 FPGA 디바이스 기반 기술 대비 최대 12배 이상의 성능 향상을 제공함을 알 수 있다. When the result of implementing the RBM arithmetic unit with the Intel CPU is set as performance 1, it can be seen that the result of implementing the RBM arithmetic unit using the GPU provides a 30-fold improvement in arithmetic performance compared to the CPU. On the other hand, the conventional result of implementing the RBM arithmetic unit using the FPGA provides a 61 times improvement in arithmetic performance compared to the CPU, and the result of implementing the RBM arithmetic accelerator with the same FPGA device according to the present invention is up to 755 times compared to the CPU It can be seen that it provides a performance improvement of up to 25 times compared to GPU and up to 12 times more than the same FPGA device-based technology.

100: 입력 메모리부 200: 데이터 처리 모듈
300: 제어부
210: 제1 조합 논리회로 220: 제2 조합 논리회로
230: 제3 조합 논리회로 240: 제4 조합 논리회로
250: 제5 조합 논리회로 260: 제6 조합 논리회로100: input memory unit 200: data processing module
300: control unit
210: first combinational logic circuit 220: second combinational logic circuit
230: third combination logic circuit 240: fourth combination logic circuit
250: fifth combinational logic circuit 260: sixth combinational logic circuit

Claims

an input memory unit configured to store an input arrangement that is a set of Nc input cases comprising a plurality of input data;
A plurality of stages composed of a combinational logic circuit and a memory are connected, and a weight is applied to an input arrangement that is a set of Nc input cases composed of the plurality of input data input to an input node of an input layer using the input arrangement. configured data processing module; and
and a control unit configured to input the input data of the input case to the data processing module or to generate a control signal for controlling an operation of each stage of the data processing module.

The method of claim 1,
The control signal is
The arithmetic acceleration device for an artificial neural network, characterized in that it is configured to sequentially input the Nc input data stored in the input memory unit to the data processing module.

The method of claim 1,
As the number of the Nc input cases in the input arrangement increases, the performance improvement value increases.

The method of claim 1,
The data processing module is a pipeline structure, an artificial neural network computation acceleration device.

The method of claim 1,
The combinational logic circuit is
a multiplier configured to multiply the output value of the hidden node by a weight; and
and a summer configured to update the accumulated value by adding the value calculated by the multiplier and a previous accumulated value stored in the accumulator.

The method of claim 1,
The combinational logic circuit is
a first buffer unit configured to store a value obtained by adding an output bias value to a final accumulated value calculated through a multiplier and a summer;
an active function unit configured to sequentially receive the summed values stored in the first buffer unit and calculate an active function value;
a sampling unit configured to calculate a sampling value of the active function value output from the active function unit; and
and a second buffer unit configured to store the calculation value of the sampling unit.

The method of claim 1,
The memory is
A computational acceleration device for an artificial neural network, configured to have Nc times the memory space to store the output values of each of the plurality of stages for processing the Nc input cases.

an input memory unit configured to store an input arrangement that is a set of N input cases;
a multiplier configured to multiply an input value by a weight to calculate a multiplied value, a summer configured to sum the multiplied value of the multiplier with a previous accumulated value stored in the accumulator to calculate a current accumulated value, an output bias value to the final accumulated value of the summer A first buffer unit configured to store a first buffering value obtained by summing , an active function unit configured to receive the first buffering value stored in the first buffer unit and calculate an active function value, and an active function output from the active function unit a plurality of stages comprising a combinational logic circuit including a sampling unit configured to sample a value, and a second buffer unit configured to store an output value of the sampling unit, and a memory having N times the memory space corresponding to the input arrangement;
a data processing module configured to apply a weight to the input arrangement inputted to an input node of an input layer using the input arrangement in each of the plurality of stages; and
a control unit configured to input the input data of the input case to the data processing module or to generate a control signal for controlling an operation of the data processing module; and
Wherein N is an integer greater than 2, arithmetic acceleration device for artificial neural networks.

9. The method of claim 8,
As the number of the N input cases in the input arrangement increases, the performance improvement value increases.

9. The method of claim 8,
The control unit is
Computation acceleration apparatus for an artificial neural network, characterized in that it is configured to control an operation of sequentially inputting the input arrangement stored in the input memory unit to the data processing module.

an input memory unit storing a plurality of N input arrays and having an N times memory space;
A combinatorial logic circuit comprising: a multiplier configured to multiply an input value of the input array with a weight; a summer configured to sum the multiplier and accumulated multiplication values; and a sampling unit configured to sample the sum of the multiplier values by applying an activation function to the sample. and a data processing module including a buffer unit configured to store inputs and outputs of the combinational logic circuit; and
a control unit configured to generate a control signal for sequentially inputting input data into the data processing module for each input case of the input arrangement stored in the input memory unit;
As the number of the N pieces of the input arrangement increases, the performance improvement value increases.