KR102614909B1

KR102614909B1 - Neural network operation method and appratus using sparsification

Info

Publication number: KR102614909B1
Application number: KR1020210035050A
Authority: KR
Inventors: 이세환; 심현욱; 이종은
Original assignee: 삼성전자주식회사; 울산과학기술원
Priority date: 2021-03-04
Filing date: 2021-03-18
Publication date: 2023-12-19
Also published as: KR20220125115A

Abstract

희소화를 이용한 뉴럴 네트워크 연산 방법 및 장치가 개시된다. 일 실시예에 다른 뉴럴 네트워크 연산 방법은, 뉴럴 네트워크에 포함된 레이어에 대응하는 제1 활성화 기울기(activation gradient) 및 제1 임계값(threshold)을 수신하는 단계와, 상기 제1 임계값에 기초하여 상기 제1 활성화 기울기를 희소화(spasificate)하는 단계와, 희소화된 제1 활성화 기울기에 기초하여 뉴럴 네트워크 연산을 수행함으로써 제2 활성화 기울기를 획득하는 단계와, 상기 제2 활성화 기울기에 기초하여 상기 제1 임계값을 업데이트함으로써 제2 임계값을 계산하는 단계와, 상기 제2 활성화 기울기 및 상기 제2 임계값에 기초하여 뉴럴 네트워크 연산을 수행하는 단계를 포함한다.A neural network calculation method and device using sparsification are disclosed. According to another embodiment, a neural network calculation method includes receiving a first activation gradient and a first threshold corresponding to a layer included in the neural network, and based on the first threshold, sparsifying the first activation gradient, obtaining a second activation gradient by performing a neural network operation based on the sparsified first activation gradient, and obtaining a second activation gradient based on the second activation gradient. Computing a second threshold by updating a first threshold, and performing a neural network operation based on the second activation slope and the second threshold.

Description

Neural network operation method and device using sparsification {NEURAL NETWORK OPERATION METHOD AND APPRATUS USING SPARSIFICATION}

아래 실시예들은 희소화를 이용한 뉴럴 네트워크 연산 방법 및 장치에 관한 것이다.The following embodiments relate to a neural network calculation method and device using sparsification.

DNN(Deep Neural Network)의 많은 계산량을 빠르고 효율적으로 처리하기 위하여 GPU(Graphic Processing Unit), 멀티-GPU, FPGA(Field Programmable Gate Arrays), ASIC(Application-Specific Integrated Circuit) 및 이머징 장치(emerging device)에 이르기까지 많은 하드웨어에 대한 연구와, 모델 압축, 희소화(sparsification) 등의 소프트웨어에 대한 연구가 이루어지고 있다. 또한, 하드웨어 연구 및 소프트웨어 연구 모두를 고려한 공동설계(co-design)에 관한 연구도 많이 이루어지고 있다.In order to quickly and efficiently process the large amount of calculations of DNN (Deep Neural Network), GPU (Graphic Processing Unit), multi-GPU, FPGA (Field Programmable Gate Arrays), ASIC (Application-Specific Integrated Circuit), and emerging devices are used. A lot of research is being done on hardware and software such as model compression and sparsification. In addition, much research is being conducted on co-design, considering both hardware and software research.

하지만 종래 아키텍처 및 알고리즘들을 학습(training)에 적용시키는 데에는 문제가 있는데, 첫 번째 문제는 계산량과 메모리 요구량이 추론보다 매우 더 많다는 점이다. 또한 학습은 추론보다 높은 정밀도를 요구하는데, 추론의 정확도 하락은 재학습(re-training)으로 극복할 여지가 많지만, 학습 정확도가 낮아지는 것은 곧바로 학습 속도/인식률 하락의 문제로 이어진다.However, there are problems with applying conventional architectures and algorithms to training. The first problem is that the amount of computation and memory requirements are much higher than for inference. In addition, learning requires higher precision than inference. Although there is a lot of room to overcome the decline in inference accuracy through re-training, lower learning accuracy directly leads to the problem of lower learning speed/recognition rate.

DNN의 희소성(sparsity)은 매우 높고, 계산량과 메모리 요구량을 줄이는 데에 유용하여, 특히 추론에서 자주 활용되어 왔지만, 학습에서는 많이 활용되지 않았다.DNN's sparsity is very high and it is useful for reducing computation and memory requirements, so it has been frequently used in inference, but has not been used much in learning.

종래의 기울기(gradient) 희소화 기술(top-k)은 메모리 요구량을 줄이지 못하고 전처리(예: 소팅(sorting)) 시간을 추가로 요구한다. 또한 종래의 희소 행렬곱 아키텍처들(예: EIE(Efficient Inference Engine))은 원소들의 불규칙성 때문에 줄어든 계산량에 비해 컨트롤러와 같은 하드웨어에 오버헤드(overhead)가 크게 발생한다.Conventional gradient sparsification techniques (top-k) do not reduce memory requirements and require additional preprocessing (e.g. sorting) time. Additionally, conventional sparse matrix multiplication architectures (e.g., EIE (Efficient Inference Engine)) generate a large overhead on hardware such as a controller compared to the reduced amount of calculation due to the irregularity of the elements.

일 실시예에 다른 뉴럴 네트워크 연산 방법은, 뉴럴 네트워크에 포함된 레이어에 대응하는 제1 활성화 기울기(activation gradient) 및 제1 임계값(threshold)을 수신하는 단계와, 상기 제1 임계값에 기초하여 상기 제1 활성화 기울기를 희소화(spasificate)하는 단계와, 희소화된 제1 활성화 기울기에 기초하여 뉴럴 네트워크 연산을 수행함으로써 제2 활성화 기울기를 획득하는 단계와, 상기 제2 활성화 기울기에 기초하여 상기 제1 임계값을 업데이트함으로써 제2 임계값을 계산하는 단계와, 상기 제2 활성화 기울기 및 상기 제2 임계값에 기초하여 뉴럴 네트워크 연산을 수행하는 단계를 포함한다.According to another embodiment, a neural network calculation method includes receiving a first activation gradient and a first threshold corresponding to a layer included in the neural network, and based on the first threshold, sparsifying the first activation gradient, obtaining a second activation gradient by performing a neural network operation based on the sparsified first activation gradient, and obtaining a second activation gradient based on the second activation gradient. Computing a second threshold by updating a first threshold, and performing a neural network operation based on the second activation slope and the second threshold.

상기 제1 활성화 기울기 및 상기 제2 활성화 기울기는, 입력 활성화에 대한 기울기, 가중치에 대한 기울기 또는 출력 활성화에 대한 기울기를 포함할 수 있다.The first activation slope and the second activation slope may include a slope for input activation, a slope for weight, or a slope for output activation.

상기 제2 임계값을 계산하는 단계는, 미리 결정된 이터레이션(iteration) 횟수만큼 반복하여 상기 제1 임계값을 업데이트함으로써 상기 제2 임계값을 계산하는 단계를 포함할 수 있다.Calculating the second threshold may include calculating the second threshold by repeatedly updating the first threshold a predetermined number of iterations.

상기 제2 임계값을 계산하는 단계는, 타겟 희소성(target sparsity) 및 현재 이터레이션(iteration)에 대응하는 희소성에 기초하여 상기 제1 임계값을 업데이트함으로써 상기 제2 임계값을 계산하는 단계를 포함할 수 있다.Calculating the second threshold includes calculating the second threshold by updating the first threshold based on a target sparsity and a sparsity corresponding to the current iteration. can do.

상기 제2 임계값을 계산하는 단계는, 상기 타겟 희소성을 상기 현재 이터레이션에 대응하는 희소성으로 나눈 값을 상기 제1 임계값에 곱함으로써 상기 제2 임계값을 계산하는 단계를 포함할 수 있다.Calculating the second threshold may include calculating the second threshold by multiplying the first threshold by dividing the target sparsity by the sparsity corresponding to the current iteration.

상기 제2 임계값을 계산하는 단계는, 상기 제2 임계값이 미리 설정된 제한 범위를 초과하는지 여부를 판단하는 단계와, 상기 제한 범위를 초과하는 상기 제2 임계값을 상기 제한 범위 내의 값으로 보정하는 단계를 포함할 수 있다.The calculating the second threshold includes determining whether the second threshold exceeds a preset limit range, and correcting the second threshold value exceeding the limit range to a value within the limit range. It may include steps.

상기 제2 임계값을 계산하는 단계는, 상기 제2 활성화 기울기에 기초하여 상기 제1 임계값을 초기화함으로써 상기 제2 임계값을 계산하는 단계를 포함할 수 있다.Calculating the second threshold may include calculating the second threshold by initializing the first threshold based on the second activation slope.

상기 뉴럴 네트워크를 수행하는 단계는, 상기 제2 임계값에 기초하여 상기 제2 활성화 기울기를 희소화함으로써 희소 데이터(sparse data)를 생성하는 단계와, 상기 희소 데이터 및 밀집 데이터(dense data)를 이용하여 상기 뉴럴 네트워크 연산을 수행하는 단계를 포함할 수 있다.The performing the neural network includes generating sparse data by sparsifying the second activation gradient based on the second threshold, and using the sparse data and dense data. It may include performing the neural network operation.

상기 밀집 데이터는, 병렬화된(parallelized) 복수의 밀집 버퍼(dense buffer)에 저장될 수 있다.The dense data may be stored in a plurality of parallelized dense buffers.

상기 뉴럴 네트워크를 수행하는 단계는, 상기 제2 활성화 기울기에 기초하여 적어도 한 번의 MAC(Multiply Accumulate) 연산을 수행하는 단계를 포함할 수 있다.The performing of the neural network may include performing at least one MAC (Multiply Accumulate) operation based on the second activation gradient.

일 실시예에 따른 뉴럴 네트워크 연산 장치는, 뉴럴 네트워크에 포함된 레이어에 대응하는 제1 활성화 기울기(activation gradient) 및 제1 임계값(threshold)을 수신하는 수신기와, 상기 제1 임계값에 기초하여 상기 제1 활성화 기울기를 희소화(spasificate)하고, 희소화된 제1 활성화 기울기에 기초하여 뉴럴 네트워크 연산을 수행함으로써 제2 활성화 기울기를 획득하고, 상기 제2 활성화 기울기에 기초하여 상기 제1 임계값을 업데이트함으로써 제2 임계값을 계산하고, 상기 제2 활성화 기울기 및 상기 제2 임계값에 기초하여 뉴럴 네트워크 연산을 수행하는 프로세서를 포함할 수 있다.A neural network operation device according to an embodiment includes a receiver that receives a first activation gradient and a first threshold corresponding to a layer included in the neural network, and a receiver that receives a first activation gradient and a first threshold based on the first threshold. Obtain a second activation gradient by spasifying the first activation gradient, performing a neural network operation based on the sparsified first activation gradient, and set the first threshold value based on the second activation gradient. It may include a processor that calculates a second threshold by updating , and performs a neural network operation based on the second activation gradient and the second threshold.

상기 프로세서는, 미리 결정된 이터레이션 횟수만큼 반복하여 상기 제1 임계값을 업데이트함으로써 상기 제2 임계값을 계산할 수 있다.The processor may calculate the second threshold by repeatedly updating the first threshold a predetermined number of iterations.

상기 프로세서는, 타겟 희소성(target sparsity) 및 현재 이터레이션(iteration)에 대응하는 희소성에 기초하여 상기 제1 임계값을 업데이트함으로써 상기 제2 임계값을 계산할 수 있다.The processor may calculate the second threshold by updating the first threshold based on target sparsity and sparsity corresponding to the current iteration.

상기 프로세서는, 상기 타겟 희소성을 상기 현재 이터레이션에 대응하는 희소성으로 나눈 값을 상기 제1 임계값에 곱함으로써 상기 제2 임계값을 계산할 수 있다.The processor may calculate the second threshold by dividing the target sparsity by the sparsity corresponding to the current iteration and multiplying the first threshold.

상기 프로세서는, 상기 제2 임계값이 미리 설정된 제한 범위를 초과하는지 여부를 판단하고, 상기 제한 범위를 초과하는 상기 제2 임계값을 상기 제한 범위 내의 값으로 보정할 수 있다.The processor may determine whether the second threshold exceeds a preset limit range, and correct the second threshold value exceeding the limit range to a value within the limit range.

상기 프로세서는, 상기 제2 활성화 기울기에 기초하여 상기 제1 임계값을 초기화함으로써 상기 제2 임계값을 계산할 수 있다.The processor may calculate the second threshold by initializing the first threshold based on the second activation slope.

상기 프로세서는, 상기 제2 임계값에 기초하여 상기 제2 활성화 기울기를 희소화함으로써 희소 데이터(sparse data)를 생성하고, 상기 희소 데이터 및 밀집 데이터(dense data)를 이용하여 상기 뉴럴 네트워크 연산을 수행할 수 있다.The processor generates sparse data by sparsifying the second activation gradient based on the second threshold, and performs the neural network operation using the sparse data and dense data. can do.

상기 프로세서는, 상기 제2 활성화 기울기에 기초하여 적어도 한 번의 MAC(Multiply Accumulate) 연산을 수행할 수 있다.The processor may perform at least one MAC (Multiply Accumulate) operation based on the second activation gradient.

도 1은 일 실시예에 따른 뉴럴 네트워크 연산 장치의 개략적인 블록도를 나타낸다.
도 2는 도 1에 도시된 뉴럴 네트워크 연산 장치의 구현의 일 예를 나타낸다.
도 3은 임계값을 업데이트하는 과정을 나타낸다.
도 4는 도 1에 도시된 뉴럴 네트워크 연산 장치의 구현의 다른 예를 나타낸다.
도 5는 도 1에 도시된 뉴럴 네트워크 연산 장치를 적용한 단말의 예를 나타낸다.
도 6은 도 1에 도시된 뉴럴 네트워크 연산 장치의 동작의 흐름도를 나타낸다.Figure 1 shows a schematic block diagram of a neural network computing device according to an embodiment.
FIG. 2 shows an example of implementation of the neural network computing device shown in FIG. 1.
Figure 3 shows the process of updating the threshold.
FIG. 4 shows another example of implementation of the neural network computing device shown in FIG. 1.
Figure 5 shows an example of a terminal to which the neural network computing device shown in Figure 1 is applied.
FIG. 6 shows a flowchart of the operation of the neural network computing device shown in FIG. 1.

실시예들에 대한 특정한 구조적 또는 기능적 설명들은 단지 예시를 위한 목적으로 개시된 것으로서, 다양한 형태로 변경되어 구현될 수 있다. 따라서, 실제 구현되는 형태는 개시된 특정 실시예로만 한정되는 것이 아니며, 본 명세서의 범위는 실시예들로 설명한 기술적 사상에 포함되는 변경, 균등물, 또는 대체물을 포함한다.Specific structural or functional descriptions of the embodiments are disclosed for illustrative purposes only and may be changed and implemented in various forms. Accordingly, the actual implementation form is not limited to the specific disclosed embodiments, and the scope of the present specification includes changes, equivalents, or substitutes included in the technical idea described in the embodiments.

제1 또는 제2 등의 용어를 다양한 구성요소들을 설명하는데 사용될 수 있지만, 이런 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 해석되어야 한다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소는 제1 구성요소로도 명명될 수 있다.Terms such as first or second may be used to describe various components, but these terms should be interpreted only for the purpose of distinguishing one component from another component. For example, a first component may be named a second component, and similarly, the second component may also be named a first component.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다.When a component is referred to as being “connected” to another component, it should be understood that it may be directly connected or connected to the other component, but that other components may exist in between.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 설명된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함으로 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions include plural expressions unless the context clearly dictates otherwise. In this specification, terms such as “comprise” or “have” are intended to designate the presence of the described features, numbers, steps, operations, components, parts, or combinations thereof, and are intended to indicate the presence of one or more other features or numbers, It should be understood that this does not exclude in advance the possibility of the presence or addition of steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 해당 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by a person of ordinary skill in the art. Terms as defined in commonly used dictionaries should be interpreted as having meanings consistent with the meanings they have in the context of the related technology, and unless clearly defined in this specification, should not be interpreted in an idealized or overly formal sense. No.

이하, 실시예들을 첨부된 도면들을 참조하여 상세하게 설명한다. 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조 부호를 부여하고, 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, embodiments will be described in detail with reference to the attached drawings. In the description with reference to the accompanying drawings, identical components will be assigned the same reference numerals regardless of the reference numerals, and overlapping descriptions thereof will be omitted.

도 1은 일 실시예에 따른 뉴럴 네트워크 연산 장치의 개략적인 블록도를 나타낸다.Figure 1 shows a schematic block diagram of a neural network computing device according to an embodiment.

도 1을 참조하면, 뉴럴 네트워크 연산 장치(10)는 희소화를 이용하여 뉴럴 네트워크 연산을 수행할 수 있다. 뉴럴 네트워크 연산 장치(10)는 데이터를 뉴럴 네트워크를 이용하여 처리함으로써 뉴럴 네트워크 연산 결과를 출력할 수 있다.Referring to FIG. 1, the neural network calculation device 10 can perform neural network calculation using sparsification. The neural network calculation device 10 can output a neural network calculation result by processing data using a neural network.

희소성(sparsity)은 모든 엘리먼트에 대한 0 값을 갖는 엘리먼트의 비율을 의미할 수 있다. 희소화는 뉴럴 네트워크 연산을 수행하는데 있어, 특정한 데이터들을 배제(skipping)하는 것을 의미할 수 있다. 예를 들어, 희소화는 0 또는 0이 아닌 값(non-zero element)들을 뉴럴 네트워크 연산으로부터 배제하는 것을 의미할 수 있다. 희소성은 학습 과정 중에 텐서(tensor)들에서 발견될 수 있다. 희소성은 뉴럴 네트워크에 포함된 레이어에 따라 상이할 수 있다.Sparsity may refer to the ratio of elements with a value of 0 to all elements. Sparsification may mean excluding (skipping) specific data when performing a neural network operation. For example, sparsification may mean excluding 0 or non-zero elements from neural network operations. Sparsity can be discovered in tensors during the learning process. Sparsity may vary depending on the layers included in the neural network.

뉴럴 네트워크 연산 장치(10)는 뉴럴 네트워크에 연산에 사용되는 데이터의 일부를 희소화하여 처리함으로써 연산량을 줄이고, 부하(load)의 불균형을 해소할 수 있다.The neural network computation device 10 can reduce the amount of computation and resolve load imbalance by sparsifying and processing some of the data used for computation in the neural network.

뉴럴 네트워크 연산 장치(10)는 뉴럴 네트워크를 학습시킬 수 있다. 뉴럴 네트워크 연산 장치(10)는 학습된 뉴럴 네트워크에 기초하여 추론을 수행할 수 있다.The neural network computing device 10 can train a neural network. The neural network computing device 10 may perform inference based on the learned neural network.

뉴럴 네트워크 연산 장치(10)는 가속기를 이용하여 뉴럴 네트워크 연산을 수행할 수 있다. 뉴럴 네트워크 연산 장치(10)는 가속기 내부 또는 외부에 구현될 수 있다.The neural network computation device 10 can perform neural network computation using an accelerator. The neural network computing device 10 may be implemented inside or outside the accelerator.

가속기는 GPU(Graphics Processing Unit), FPGA(Field Programmable Gate Array), ASIC(Application Specific Integrated Circuit) 또는 AP(Application Processor)를 포함할 수 있다. 또한, 가속기는 가상 머신(Virtual Machine)와 같이 소프트웨어 컴퓨팅 환경으로 구현될 수도 있다.The accelerator may include a Graphics Processing Unit (GPU), Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), or Application Processor (AP). Additionally, the accelerator may be implemented as a software computing environment, such as a virtual machine.

뉴럴 네트워크(또는 인공 신경망)는 기계학습과 인지과학에서 생물학의 신경을 모방한 통계학적 학습 알고리즘을 포함할 수 있다. 뉴럴 네트워크는 시냅스의 결합으로 네트워크를 형성한 인공 뉴런(노드)이 학습을 통해 시냅스의 결합 세기를 변화시켜, 문제 해결 능력을 가지는 모델 전반을 의미할 수 있다.Neural networks (or artificial neural networks) can include statistical learning algorithms that mimic neurons in biology in machine learning and cognitive science. A neural network can refer to an overall model in which artificial neurons (nodes), which form a network through the combination of synapses, change the strength of the synapse connection through learning and have problem-solving capabilities.

뉴럴 네트워크의 뉴런은 가중치 또는 바이어스의 조합을 포함할 수 있다. 뉴럴 네트워크는 하나 이상의 뉴런 또는 노드로 구성된 하나 이상의 레이어(layer)를 포함할 수 있다. 뉴럴 네트워크는 뉴런의 가중치를 학습을 통해 변화시킴으로써 임의의 입력으로부터 예측하고자 하는 결과를 추론할 수 있다.Neurons in a neural network can contain combinations of weights or biases. A neural network may include one or more layers consisting of one or more neurons or nodes. Neural networks can infer the results they want to predict from arbitrary inputs by changing the weights of neurons through learning.

뉴럴 네트워크는 심층 뉴럴 네트워크 (Deep Neural Network)를 포함할 수 있다. 뉴럴 네트워크는 CNN(Convolutional Neural Network), RNN(Recurrent Neural Network), 퍼셉트론(perceptron), 다층 퍼셉트론(multilayer perceptron), FF(Feed Forward), RBF(Radial Basis Network), DFF(Deep Feed Forward), LSTM(Long Short Term Memory), GRU(Gated Recurrent Unit), AE(Auto Encoder), VAE(Variational Auto Encoder), DAE(Denoising Auto Encoder), SAE(Sparse Auto Encoder), MC(Markov Chain), HN(Hopfield Network), BM(Boltzmann Machine), RBM(Restricted Boltzmann Machine), DBN(Depp Belief Network), DCN(Deep Convolutional Network), DN(Deconvolutional Network), DCIGN(Deep Convolutional Inverse Graphics Network), GAN(Generative Adversarial Network), LSM(Liquid State Machine), ELM(Extreme Learning Machine), ESN(Echo State Network), DRN(Deep Residual Network), DNC(Differentiable Neural Computer), NTM(Neural Turning Machine), CN(Capsule Network), KN(Kohonen Network) 및 AN(Attention Network)를 포함할 수 있다.Neural networks may include deep neural networks. Neural networks include CNN (Convolutional Neural Network), RNN (Recurrent Neural Network), perceptron, multilayer perceptron, FF (Feed Forward), RBF (Radial Basis Network), DFF (Deep Feed Forward), and LSTM. (Long Short Term Memory), GRU (Gated Recurrent Unit), AE (Auto Encoder), VAE (Variational Auto Encoder), DAE (Denoising Auto Encoder), SAE (Sparse Auto Encoder), MC (Markov Chain), HN (Hopfield) Network), BM (Boltzmann Machine), RBM (Restricted Boltzmann Machine), DBN (Depp Belief Network), DCN (Deep Convolutional Network), DN (Deconvolutional Network), DCIGN (Deep Convolutional Inverse Graphics Network), GAN (Generative Adversarial Network) ), Liquid State Machine (LSM), Extreme Learning Machine (ELM), Echo State Network (ESN), Deep Residual Network (DRN), Differential Neural Computer (DNC), Neural Turning Machine (NTM), Capsule Network (CN), It may include Kohonen Network (KN) and Attention Network (AN).

뉴럴 네트워크 연산 장치(10)는 마더보드(motherboard)와 같은 인쇄 회로 기판(printed circuit board(PCB)), 집적 회로(integrated circuit(IC)), 또는 SoC(system on chip)로 구현될 수 있다. 예를 들어, 뉴럴 네트워크 연산 장치(10)는 애플리케이션 프로세서(application processor)로 구현될 수 있다.The neural network computing device 10 may be implemented with a printed circuit board (PCB) such as a motherboard, an integrated circuit (IC), or a system on chip (SoC). For example, the neural network computing device 10 may be implemented as an application processor.

또한, 뉴럴 네트워크 연산 장치(10)는 PC(personal computer), 데이터 서버, 또는 휴대용 장치 내에 구현될 수 있다.Additionally, the neural network computing device 10 may be implemented in a personal computer (PC), a data server, or a portable device.

휴대용 장치는 랩탑(laptop) 컴퓨터, 이동 전화기, 스마트 폰(smart phone), 태블릿(tablet) PC, 모바일 인터넷 디바이스(mobile internet device(MID)), PDA(personal digital assistant), EDA(enterprise digital assistant), 디지털 스틸 카메라(digital still camera), 디지털 비디오 카메라(digital video camera), PMP(portable multimedia player), PND(personal navigation device 또는 portable navigation device), 휴대용 게임 콘솔(handheld game console), e-북(e-book), 또는 스마트 디바이스(smart device)로 구현될 수 있다. 스마트 디바이스는 스마트 와치(smart watch), 스마트 밴드(smart band), 또는 스마트 링(smart ring)으로 구현될 수 있다.Portable devices include laptop computers, mobile phones, smart phones, tablet PCs, mobile internet devices (MIDs), personal digital assistants (PDAs), and enterprise digital assistants (EDAs). , digital still camera, digital video camera, portable multimedia player (PMP), personal navigation device or portable navigation device (PND), handheld game console, e-book ( It can be implemented as an e-book) or a smart device. A smart device may be implemented as a smart watch, smart band, or smart ring.

뉴럴 네트워크 연산 장치(10)는 수신기(100) 및 프로세서(200)를 포함한다. 뉴럴 네트워크 연산 장치(10)는 메모리(300)를 더 포함할 수 있다.The neural network computing device 10 includes a receiver 100 and a processor 200. The neural network computing device 10 may further include a memory 300.

수신기(100)는 수신 인터페이스를 포함할 수 있다. 수신기(100)는 뉴럴 네트워크 연산과 관련된 활성화 기울기 및 희소화를 위한 임계값을 수신할 수 있다. 예를 들어, 수신기(100)는 뉴럴 네트워크에 포함된 레이어에 대응하는 제1 활성화 기울기(activation gradient) 및 제1 임계값(threshold)을 수신할 수 있다. 수신기(100)는 수신한 활성화 기울기 및 임계값을 프로세서(200)로 출력할 수 있다.Receiver 100 may include a receiving interface. The receiver 100 may receive an activation gradient related to a neural network operation and a threshold for sparsification. For example, the receiver 100 may receive a first activation gradient and a first threshold corresponding to a layer included in the neural network. The receiver 100 may output the received activation slope and threshold to the processor 200.

프로세서(200)는 메모리(300)에 저장된 데이터를 처리할 수 있다. 프로세서(200)는 메모리(300)에 저장된 컴퓨터로 읽을 수 있는 코드(예를 들어, 소프트웨어) 및 프로세서(200)에 의해 유발된 인스트럭션(instruction)들을 실행할 수 있다.The processor 200 may process data stored in the memory 300. The processor 200 may execute computer-readable code (eg, software) stored in the memory 300 and instructions triggered by the processor 200 .

"프로세서(200)"는 목적하는 동작들(desired operations)을 실행시키기 위한 물리적인 구조를 갖는 회로를 가지는 하드웨어로 구현된 데이터 처리 장치일 수 있다. 예를 들어, 목적하는 동작들은 프로그램에 포함된 코드(code) 또는 인스트럭션들(instructions)을 포함할 수 있다.The “processor 200” may be a data processing device implemented in hardware that has a circuit with a physical structure for executing desired operations. For example, the intended operations may include code or instructions included in the program.

예를 들어, 하드웨어로 구현된 데이터 처리 장치는 마이크로프로세서(microprocessor), 중앙 처리 장치(central processing unit), 프로세서 코어(processor core), 멀티-코어 프로세서(multi-core processor), 멀티프로세서(multiprocessor), ASIC(Application-Specific Integrated Circuit), FPGA(Field Programmable Gate Array)를 포함할 수 있다.For example, data processing devices implemented in hardware include microprocessors, central processing units, processor cores, multi-core processors, and multiprocessors. , ASIC (Application-Specific Integrated Circuit), and FPGA (Field Programmable Gate Array).

프로세서(200)는 제1 임계값에 기초하여 제1 활성화 기울기를 희소화(spasificate)할 수 있다. 프로세서(200)는 희소화된 제1 활성화 기울기에 기초하여 뉴럴 네트워크 연산을 수행함으로써 제2 활성화 기울기를 획득할 수 있다.The processor 200 may sparsify the first activation gradient based on the first threshold. The processor 200 may obtain a second activation gradient by performing a neural network operation based on the sparsified first activation gradient.

프로세서(200)는 제2 활성화 기울기에 기초하여 제1 임계값을 업데이트함으로써 제2 임계값을 계산할 수 있다. 프로세서(200)는 미리 결정된 이터레이션(iteration) 횟수만큼 반복하여 제1 임계값을 업데이트함으로써 제2 임계값을 계산할 수 있다.The processor 200 may calculate the second threshold by updating the first threshold based on the second activation slope. The processor 200 may calculate the second threshold by repeatedly updating the first threshold a predetermined number of iterations.

프로세서(200)는 제2 활성화 기울기에 기초하여 제1 임계값을 초기화함으로써 제2 임계값을 계산할 수 있다. 제1 임계값을 초기화하는 과정은 도 3을 참조하여 자세하게 설명한다.The processor 200 may calculate the second threshold by initializing the first threshold based on the second activation slope. The process of initializing the first threshold will be described in detail with reference to FIG. 3.

프로세서(200)는 타겟 희소성(target sparsity) 및 현재 이터레이션(iteration)에 대응하는 희소성에 기초하여 제1 임계값을 업데이트함으로써 제2 임계값을 계산할 수 있다. 프로세서(200)는 타겟 희소성을 상기 현재 이터레이션에 대응하는 희소성으로 나눈 값을 제1 임계값에 곱함으로써 제2 임계값을 계산할 수 있다. The processor 200 may calculate the second threshold by updating the first threshold based on target sparsity and sparsity corresponding to the current iteration. The processor 200 may calculate the second threshold by dividing the target sparsity by the sparsity corresponding to the current iteration and multiplying the first threshold.

프로세서(200)는 제2 임계값이 미리 설정된 제한 범위를 초과하는지 여부를 판단할 수 있다. 프로세서(200)는 제한 범위를 초과하는 제2 임계값을 제한 범위 내의 값으로 보정할 수 있다.The processor 200 may determine whether the second threshold exceeds a preset limit range. The processor 200 may correct the second threshold value that exceeds the limit range to a value within the limit range.

프로세서(200)는 제2 활성화 기울기 및 제2 임계값에 기초하여 뉴럴 네트워크 연산을 수행할 수 있다. 프로세서(200)는 제2 임계값에 기초하여 제2 활성화 기울기를 희소화함으로써 희소 데이터(sparse data)를 생성할 수 있다.The processor 200 may perform a neural network operation based on the second activation slope and the second threshold. The processor 200 may generate sparse data by sparsifying the second activation gradient based on the second threshold.

프로세서(200)는 희소 데이터 및 밀집 데이터(dense data)를 이용하여 뉴럴 네트워크 연산을 수행할 수 있다. 밀집 데이터는 병렬화된(parallelized) 복수의 밀집 버퍼(dense buffer)에 저장될 수 있다. 희소 데이터 및 밀집 데이터는 도 2를 참조하여 자세하게 설명한다.The processor 200 may perform neural network calculations using sparse data and dense data. Dense data may be stored in a plurality of parallelized dense buffers. Sparse data and dense data are described in detail with reference to FIG. 2.

프로세서(200)는 제2 활성화 기울기에 기초하여 적어도 한 번의 MAC(Multiply Accumulate) 연산을 수행할 수 있다.The processor 200 may perform at least one MAC (Multiply Accumulate) operation based on the second activation gradient.

제1 활성화 기울기 및 상기 제2 활성화 기울기는 입력 활성화에 대한 기울기, 가중치에 대한 기울기 또는 출력 활성화에 대한 기울기를 포함할 수 있다.The first activation slope and the second activation slope may include a slope for input activation, a slope for weights, or a slope for output activation.

메모리(300)는 프로세서(200)에 의해 실행가능한 인스트럭션들(또는 프로그램)을 저장할 수 있다. 예를 들어, 인스트럭션들은 프로세서의 동작 및/또는 프로세서의 각 구성의 동작을 실행하기 위한 인스트럭션들을 포함할 수 있다.The memory 300 may store instructions (or programs) executable by the processor 200. For example, the instructions may include instructions for executing the operation of the processor and/or the operation of each component of the processor.

메모리(300)는 휘발성 메모리 장치 또는 불휘발성 메모리 장치로 구현될 수 있다.The memory 300 may be implemented as a volatile memory device or a non-volatile memory device.

휘발성 메모리 장치는 DRAM(dynamic random access memory), SRAM(static random access memory), T-RAM(thyristor RAM), Z-RAM(zero capacitor RAM), 또는 TTRAM(Twin Transistor RAM)으로 구현될 수 있다.Volatile memory devices may be implemented as dynamic random access memory (DRAM), static random access memory (SRAM), thyristor RAM (T-RAM), zero capacitor RAM (Z-RAM), or twin transistor RAM (TTRAM).

불휘발성 메모리 장치는 EEPROM(Electrically Erasable Programmable Read-Only Memory), 플래시(flash) 메모리, MRAM(Magnetic RAM), 스핀전달토크 MRAM(Spin-Transfer Torque(STT)-MRAM), Conductive Bridging RAM(CBRAM), FeRAM(Ferroelectric RAM), PRAM(Phase change RAM), 저항 메모리(Resistive RAM(RRAM)), 나노 튜브 RRAM(Nanotube RRAM), 폴리머 RAM(Polymer RAM(PoRAM)), 나노 부유 게이트 메모리(Nano Floating Gate Memory(NFGM)), 홀로그래픽 메모리(holographic memory), 분자 전자 메모리 소자(Molecular Electronic Memory Device), 또는 절연 저항 변화 메모리(Insulator Resistance Change Memory)로 구현될 수 있다.Non-volatile memory devices include EEPROM (Electrically Erasable Programmable Read-Only Memory), flash memory, MRAM (Magnetic RAM), Spin-Transfer Torque (STT)-MRAM (MRAM), and Conductive Bridging RAM (CBRAM). , FeRAM (Ferroelectric RAM), PRAM (Phase change RAM), Resistive RAM (RRAM), Nanotube RRAM (Nanotube RRAM), Polymer RAM (PoRAM), Nano Floating Gate Memory (NFGM), holographic memory, molecular electronic memory device, or insulation resistance change memory.

도 2는 도 1에 도시된 뉴럴 네트워크 연산 장치의 구현의 일 예를 나타낸다.FIG. 2 shows an example of implementation of the neural network computing device shown in FIG. 1.

도 2를 참조하면, 프로세서(200)는 뉴럴 네트워크와 관련된 데이터를 희소화할 수 있다. 프로세서(200)는 희소화된 데이터를 이용하여 뉴럴 네트워크를 학습시거나 뉴럴 네트워크 연산을 수행할 수 있다.Referring to FIG. 2, the processor 200 may sparse data related to a neural network. The processor 200 may learn a neural network or perform a neural network operation using sparse data.

뉴럴 네트워크의 학습 과정은 순전파(feedforward) 연산 및 역전파(backpropagation) 연산을 포함할 수 있다. 순전파 연산은 입력 레이어에서 히든 레이어를 거쳐 출력 레이어 방향으로 이동하면서 각 레이어의 입력과 가중치를 이용하여 연산을 수행함으로써 손실 함수의 값을 계산하는 과정을 의미할 수 있다.The learning process of a neural network may include feedforward operations and backpropagation operations. Forward propagation operation may refer to the process of calculating the value of the loss function by moving from the input layer through the hidden layer to the output layer and performing the operation using the input and weight of each layer.

역전파 연산은 뉴럴 네트워크의 입력과 출력을 알고 있는 상태에서 손실 함수의 값이 최소가 되도록 뉴럴 네트워크의 가중치들을 조절(또는, 업데이트)하는 과정을 의미할 수 있다.The backpropagation operation may refer to the process of adjusting (or updating) the weights of the neural network so that the value of the loss function is minimized while the input and output of the neural network are known.

프로세서(200)는 역전파 연산을 수행하는 동안에, 활성화 기울기를 임계값과 비교하여 희소화하며 뉴럴 네트워크를 학습시킬 수 있다. 프로세서(200)는 희소화를 이용하여 학습의 약 2/3를 차지하는 역전파 및 가중치 업데이트를 위한 계산량을 감소시킬 수 있다.While performing the backpropagation operation, the processor 200 may compare the activation gradient with a threshold, sparse it, and train the neural network. The processor 200 can use sparsification to reduce the amount of calculations for backpropagation and weight update, which account for about 2/3 of learning.

프로세서(200)는 활성화 기울기를 실시간으로 임계값과 비교함으로써 전처리가 필요한 종래의 뉴럴 네트워크 학습 방식에 비해 실질적인 메모리 접근량 및 계산량을 감소시킬 수 있다.The processor 200 can actually reduce the amount of memory access and calculation compared to a conventional neural network learning method that requires preprocessing by comparing the activation gradient with the threshold in real time.

프로세서(200)는 희소성을 높게 유지하고, 정확도를 하락하지 않도록 임계값을 결정할 수 있다. 프로세서(200)는 임계값을 결정하는 이터레이션에서 이전의 이터레이션의 희소화 분포를 이용하여 동적으로(dynamically) 다음 이터레이션의 임계값을 조절할 수 있다.The processor 200 may determine a threshold value to keep sparsity high and not reduce accuracy. The processor 200 may dynamically adjust the threshold of the next iteration by using the sparsity distribution of the previous iteration in the iteration that determines the threshold.

뉴럴 네트워크 연산 장치(10)는 희소 버퍼(sparse buffer), MAC 어레이(220), 출력 버퍼(230), 제1 밀집 버퍼(240), 제2 밀집 버퍼(250) 및 희소화기(sparsificator)(210)를 포함할 수 있다. 예를 들어, 희소 버퍼(210), 출력 버퍼(230), 제1 밀 집 버퍼(240) 및 제2 밀집 버퍼(250)는 메모리(예: 도 1의 메모리(300))에 포함될 수 있다.The neural network operation unit 10 includes a sparse buffer, a MAC array 220, an output buffer 230, a first dense buffer 240, a second dense buffer 250, and a sparsificator 210. ) may include. For example, the sparse buffer 210, the output buffer 230, the first dense buffer 240, and the second dense buffer 250 may be included in a memory (eg, memory 300 in FIG. 1).

MAC 어레이(220) 및 희소화기(260)는 프로세서(예: 도 1의 프로세서(200))에 포함될 수 있다. MAC 어레이(220)는 별도로, 뉴럴 네트워크 연산 장치(10)의 외부에 구현될 수도 있다.The MAC array 220 and sparsifier 260 may be included in a processor (eg, processor 200 of FIG. 1). The MAC array 220 may be implemented separately and external to the neural network computing device 10.

MAC 어레이(220)는 희소 버퍼(210)로부터 수신한 희소 입력(sparse input)을 순차적(serial) 으로처리하고 제1 밀집 버퍼(240) 및 제2 밀집 버퍼(250)로부터 밀집 입력(dense input)을 병렬화하여, 부하 불균형의 의한 이용율(utilization) 감소를 없앨 수 있다.The MAC array 220 sequentially processes sparse input received from the sparse buffer 210 and processes dense input from the first dense buffer 240 and the second dense buffer 250. By parallelizing, the reduction in utilization due to load imbalance can be eliminated.

프로세서(200)는 활성화 기울기(예: 제2 활성화 기울기) 및 임계값(예: 제2 임계값)에 기초하여 뉴럴 네트워크 연산을 수행할 수 있다. 프로세서(200)는 임계값에 기초하여 활성화 기울기를 희소화함으로써 희소 데이터(sparse data)를 생성할 수 있다. 희소 데이터는 희소 버퍼(210)에 저장될 수 있다.The processor 200 may perform a neural network operation based on an activation slope (eg, a second activation slope) and a threshold (eg, a second threshold). The processor 200 may generate sparse data by sparsifying the activation gradient based on a threshold. Sparse data may be stored in the sparse buffer 210.

프로세서(200)는 희소 데이터 및 밀집 데이터(dense data)를 이용하여 뉴럴 네트워크 연산을 수행할 수 있다. 밀집 데이터는 병렬화된(parallelized) 복수의 밀집 버퍼(예: 제1 밀집 버퍼(240) 및 제2 밀집 버퍼(250))에 저장될 수 있다.The processor 200 may perform neural network calculations using sparse data and dense data. Dense data may be stored in a plurality of parallelized dense buffers (eg, first dense buffer 240 and second dense buffer 250).

도 2의 예시는 역전파 연산과 가중치의 업데이트를 처리하는 과정에서의 데이터 플로우(dataflow)를 나타낼 수 있다. 도 2의 예시에서, IA, W, GI, GO 및 GW는 각각 입력 활성화(input activation), 가중치(weight), 입력 활성화 기울기(gradient w.r.t. input activation), 출력 활성화 기울기(gradient w.r.t. output activation), 가중치 기울기(gradient w.r.t. weight)를 의미할 수 있다.The example of FIG. 2 may represent a data flow in the process of processing backpropagation operation and weight update. In the example of Figure 2, IA, W, GI, GO, and GW are input activation, weight, gradient w.r.t. input activation, gradient w.r.t. output activation, and weight, respectively. It may mean gradient w.r.t. weight.

GO, W, GI 중에서 왼편에 기재된 데이터는 역전파시의 데이터 플로우를 나타내고, GO, IA, GW와 같이 오른편에 기재된 데이터는 가중치 업데이트 과정에서의 데이터 플로우를 의미할 수 있다. 희소화기(260)의 출력인 GI는 역전파 시의 데이터 플로우를 의미할 수 있다.Among GO, W, and GI, the data written on the left side may represent the data flow during backpropagation, and the data written on the right side such as GO, IA, and GW may represent the data flow during the weight update process. GI, which is the output of the sparsifier 260, may refer to the data flow during backpropagation.

도 2의 예시는 입력 활성화 기울기를 희소화하는 실시예에 대하여 표현되어 있지만, 실시예에 따라, 출력 활성화 기울기 또는 가중치 기울기에 대한 희소화도 가능하다.The example in FIG. 2 depicts an embodiment of sparsifying the input activation gradient, but depending on the embodiment, sparsification of the output activation gradient or weight gradient is also possible.

도 2에 도시된 데이터들은 DMA(Direct Memory Access)를 통해 통신할 수 있다. 활성화 기울기들은 희소 버퍼(210)에 로드(load)될 수 있고, 입력 활성화 및 가중치는 밀집 버퍼(예: 제1 밀집 버퍼(240) 및 제2 밀집 버퍼(250))에 로드될 수 있다.Data shown in FIG. 2 can be communicated through DMA (Direct Memory Access). Activation gradients may be loaded into the sparse buffer 210, and input activations and weights may be loaded into dense buffers (eg, first dense buffer 240 and second dense buffer 250).

MAC 어레이(220)는 희소 버퍼에서 순차적으로 기울기(예: 입력 활성화 기울기, 출력 활성화 기울기 또는 가중치 기울기)를 수신하여 병렬화된 밀집 데이터와 행렬곱을 수행할 수 있다.The MAC array 220 may sequentially receive gradients (e.g., input activation gradient, output activation gradient, or weight gradient) from a sparse buffer and perform matrix multiplication with parallelized dense data.

MAC 어레이(220)는 T개의 곱셈기 및 덧셈기를 포함할 수 있다. T는 하드웨어 구현 예에 따라 조절될 수 있다.MAC array 220 may include T multipliers and adders. T can be adjusted depending on the hardware implementation example.

희소화기(260)는 행렬곱 결과인 다음 레이어의 활성화 기울기의 통계값을 계산할 수 있다. 활성화 기울기의 통계값은 활성화 기울기의 확률 분포를 포함할 수 있다. 예를 들어, 활성화 기울기의 통계값은 활성화 기울기의 평균, 분산, 표준 편차를 포함할 수 있다.The sparsifier 260 may calculate the statistical value of the activation gradient of the next layer, which is the result of matrix multiplication. The statistical value of the activation slope may include a probability distribution of the activation slope. For example, the statistical value of the activation slope may include the mean, variance, and standard deviation of the activation slope.

희소화기(260)는 계산된 통계값에 기초하여 임계값(예: 제1 임계값)을 업데이트할 수 있다. 희소화기(260)는 활성화 기울기를 업데이트하여 메모리(300)(예: DRAM)으로 출력할 수 있다.The sparsifier 260 may update the threshold (eg, first threshold) based on the calculated statistical value. The sparsifier 260 may update the activation gradient and output it to the memory 300 (eg, DRAM).

도 3은 임계값을 업데이트하는 과정을 나타낸다.Figure 3 shows the process of updating the threshold.

도 3을 참조하면, 프로세서(예: 도 1의 프로세서(200))는 초기 임계값을 설정할 수 있다(310). 예를 들어, 프로세서(200)는 θ₁에 0을 대입함으로써 초기 임계값을 설정할 수 있다.Referring to FIG. 3, a processor (eg, processor 200 of FIG. 1) may set an initial threshold (310). For example, the processor 200 may set the initial threshold by substituting 0 for θ ₁ .

프로세서(200)는 임계값에 기초하여 활성화 기울기(예: 입력 활성화 기울기)를 계산할 수 있다(320). 프로세서(200)는 도 2에서 설명한 것과 같이 뉴럴 네트워크 연산을 이용하여 활성화 기울기를 계산할 수 있다.The processor 200 may calculate an activation slope (eg, input activation slope) based on the threshold (320). The processor 200 may calculate the activation gradient using a neural network operation as described in FIG. 2.

프로세서(200)는 현재 이터레이션의 임계값에 기초하여 활성화 기울기를 희소화할 수 있다(330). 프로세서(200)는 제1 임계값에 기초하여 제1 활성화 기울기를 희소화(sparsificate)할 수 있다. 예를 들어, 프로세서(200)는 제1 임계값 보다 작은 절대값을 갖는 활성화 기울기들을 희소화할 수 있다.Processor 200 may sparsify the activation gradient based on the threshold of the current iteration (330). The processor 200 may sparsify the first activation gradient based on the first threshold. For example, the processor 200 may sparse activation gradients that have an absolute value less than the first threshold.

임계값의 업데이트가 이루어지기 전에, 프로세서(200)는 초기 임계값(예: 0)에 기초하여 활성화 기울기를 희소화할 수 있다. 임계값의 업데이트가 이루어지는 경우, 프로세서(200)는 업데이트된 임계값에 기초하여 활성화 기울기를 희소화할 수 있다.Before the threshold is updated, the processor 200 may sparse the activation gradient based on the initial threshold (eg, 0). When the threshold is updated, the processor 200 may sparse the activation gradient based on the updated threshold.

프로세서(200)는 희소화된 제1 활성화 기울기에 기초하여 뉴럴 네트워크 연산을 수행함으로써 제2 활성화 기울기를 획득할 수 있다. 예를 들어, 프로세서(200)는 MAC 어레이(예: 도 2의 MAC 어레이(220))를 이용한 뉴럴 네트워크 연산을 통해 제2 활성화 기울기를 획득할 수 있다. 제1 활성화 기울기의 희소화는 이전 이터레이션에 수행된 희소화를 의미할 수 있다.The processor 200 may obtain a second activation gradient by performing a neural network operation based on the sparsified first activation gradient. For example, the processor 200 may obtain the second activation slope through a neural network operation using a MAC array (eg, MAC array 220 in FIG. 2). Sparsification of the first activation gradient may refer to sparsification performed in the previous iteration.

이터레이션의 시작을 위해, 프로세서(200)는 이터레이션 루프(loop)의 인덱스가 1인지 여부를 판단할 수 있다(340). i가 1인 경우 프로세서(200)는 제1 임계값을 초기화함으로써 제2 임계값을 계산할 수 있다(350).To start an iteration, the processor 200 may determine whether the index of the iteration loop is 1 (340). When i is 1, the processor 200 may calculate the second threshold by initializing the first threshold (350).

프로세서(200)는 제2 활성화 기울기에 기초하여 제1 임계값을 초기화함으로써 제2 임계값을 계산할 수 있다. 예를 들어, 프로세서(200)는 수학식 1을 이용하여 제1 임계값을 초기화함으로써 제2 임계값을 계산할 수 있다.The processor 200 may calculate the second threshold by initializing the first threshold based on the second activation slope. For example, the processor 200 may calculate the second threshold by initializing the first threshold using Equation 1.

여기서, θ_i+1는 다음 이터레이션을 위한 임계값(예: 제2 임계값)을 의미할 수 있다.Here, θ _i+1 may mean a threshold for the next iteration (eg, a second threshold).

i가 1이 아닌 경우, 프로세서(200)는 제1 임계값을 업데이트함으로써 제2 임계값을 계산할 수 있다(360). 프로세서(200)는 타겟 희소성(target sparsity) 및 현재 이터레이션(iteration)에 대응하는 희소성에 기초하여 제1 임계값을 업데이트함으로써 제2 임계값을 계산할 수 있다.If i is not 1, the processor 200 may calculate the second threshold by updating the first threshold (360). The processor 200 may calculate the second threshold by updating the first threshold based on target sparsity and sparsity corresponding to the current iteration.

프로세서(200)는 타겟 희소성을 상기 현재 이터레이션에 대응하는 희소성으로 나눈 값을 제1 임계값에 곱함으로써 제2 임계값을 계산할 수 있다. 예를 들어, 프로세서(200)는 수학식 2를 이용하여 제2 임계값을 계산할 수 있다.The processor 200 may calculate the second threshold by dividing the target sparsity by the sparsity corresponding to the current iteration and multiplying the first threshold. For example, the processor 200 may calculate the second threshold using Equation 2.

여기서, s*는 타겟 희소성을 의미하고, s_i는 현재 이터레이션의 희소성을 의미할 수 있다.Here, s* may mean target scarcity, and s _i may mean scarcity of the current iteration.

프로세서(200)는 제2 임계값이 미리 설정된 제한 범위를 초과하는지 여부를 판단할 수 있다(370). 프로세서(200)는 제한 범위를 초과하는 제2 임계값을 제한 범위 내의 값으로 보정할 수 있다. 예를 들어, 제한 범위는 제1 임계값 보다 20% 작은 값으로부터 20% 큰 값까지일 수 있다. 제한 범위는 실시예에 따라 상이할 수 있다.The processor 200 may determine whether the second threshold exceeds a preset limit range (370). The processor 200 may correct the second threshold value that exceeds the limit range to a value within the limit range. For example, the limit range may be from 20% less to 20% more than the first threshold. The limiting range may vary depending on the embodiment.

프로세서(200)는 상술한 임계값 업데이트 과정을 통해, 활성화 기울기를 정렬(sort)하여 상위 k개(top-k)를 뽑는 방식에 비해, 활성화 기울기들 사이의 의존도(dependency)를 줄임으로써 부분 활성화 기울기(partial activation gradient) 계산 결과를 바로 희소화함으로써 메모리(예: 도 1의 메모리(300))와의 통신량을 절약하고, 하드웨어 구현을 간소화시킬 수 있다.Through the above-described threshold update process, the processor 200 performs partial activation by reducing the dependency between activation gradients, compared to the method of sorting activation gradients and selecting the top k. By immediately sparsifying the gradient (partial activation gradient) calculation result, communication with memory (e.g., memory 300 in FIG. 1) can be saved and hardware implementation can be simplified.

프로세서(200)는 i가 미리 결정된 이터레이션 값과 같은지 여부를 판단할 수 있다(380). i가 미리 결정된 이터레이션 값과 다른 경우, 프로세서(200)는 i에 1을 더할 수 있다(390). 그 후, 프로세서(200)는 업데이트된 임계값(제2 임계값)에 기초하여 단계 320을 다시 수행할 수 있다.The processor 200 may determine whether i is equal to a predetermined iteration value (380). If i is different from the predetermined iteration value, the processor 200 may add 1 to i (390). Afterwards, the processor 200 may perform step 320 again based on the updated threshold (second threshold).

i가 미리 결정된 이터레이션 값과 같은 경우, 프로세서(200)는 임계값의 업데이트를 완료할 수 있다. 프로세서(200)는 업데이트된 임계값을 뉴럴 네트워크의 여러 구성에 공유할 수 있다. 예를 들어, 프로세서(200)는 업데이트된 임계값을 뉴럴 네트워크의 레이어들 중에서 일부 레이어에 공유할 수 있다.If i is equal to the predetermined iteration value, the processor 200 may complete updating the threshold. Processor 200 may share the updated threshold to multiple components of the neural network. For example, the processor 200 may share the updated threshold to some of the layers of the neural network.

도 4는 도 1에 도시된 뉴럴 네트워크 연산 장치의 구현의 다른 예를 나타낸다.FIG. 4 shows another example of implementation of the neural network computing device shown in FIG. 1.

도 4를 참조하면, 뉴럴 네트워크 연산 장치(10)는 제1 희소 버퍼(410), 제2 희소 버퍼(420), 컨트롤러(430), 연산 타일(440) 및 희소화기(450)를 포함할 수 있다.Referring to FIG. 4, the neural network computing device 10 may include a first sparse buffer 410, a second sparse buffer 420, a controller 430, an operation tile 440, and a sparsifier 450. there is.

제1 희소 버퍼(410)는 읽기(read) 및 쓰기(write) 포트(port)를 포함할 수 있다. 제1 희소 버퍼(410)는 로컬 버스(local bus)를 통해 희소화된 활성화 기울기를 연산 타일(440)로 출력할 수 있다.The first sparse buffer 410 may include read and write ports. The first sparse buffer 410 may output the sparse activation gradient to the operation tile 440 through a local bus.

제2 희소 버퍼(410)는 읽기(read) 및 쓰기(write) 포트(port)를 포함할 수 있다. 제2 희소 버퍼(410)는 로컬 버스를 통해 희소화된 활성화 기울기를 연산 타일(440)로 출력할 수 있다.The second sparse buffer 410 may include read and write ports. The second sparse buffer 410 may output the sparse activation gradient to the operation tile 440 through a local bus.

제1 희소 버퍼(410) 및 제2 희소 버퍼(420)의 동작은 도 2의 희소 버퍼(210)와 동일할 수 있다.The operations of the first sparse buffer 410 and the second sparse buffer 420 may be the same as the sparse buffer 210 of FIG. 2 .

컨트롤러(430)는 데이터 처리를 위한 신호들, DRAM 및 버퍼 주소를 분배할 수 있다. 컨트롤러(430)는 프로세서(예: 도 1의 프로세서(200))에 포함될 수 있다.The controller 430 can distribute signals, DRAM, and buffer addresses for data processing. The controller 430 may be included in a processor (eg, processor 200 of FIG. 1).

연산 타일(440)은 복수의 병렬 MAC 연산기들을 포함할 수 있다. 연산 타일(440)에 포함된 복수의 병렬 MAC 연산기들의 동작은 도 2의 MAC 어레이(220)와 동일할 수 있다.Computation tile 440 may include multiple parallel MAC operators. The operation of the plurality of parallel MAC operators included in the operation tile 440 may be the same as the MAC array 220 of FIG. 2.

희소화기(450)는 임계값을 업데이트하면서 활성화 기울기에 대한 희소화를 수행할 수 있다. 희소화기(450)는 프로세서(200)에 포함될 수 있다. 희소화기(450)의 동작은 도 2의 희소화기(260)와 동일할 수 있다.The sparsifier 450 may perform sparsification on the activation gradient while updating the threshold. The sparsifier 450 may be included in the processor 200. The operation of the sparsifier 450 may be the same as the sparsifier 260 of FIG. 2.

프로세서(200)는 행렬곱이 수행되는 하나의 입력 또는 복수의 입력에 대해서 희소화를 수행할 수 있다. 따라서, 프로세서(200)는 서로 다른 종류의 데이터들에 대하여 희소화를 수행하고, 희소화된 데이터들 간의 행렬곱을 수행할 수 있다.The processor 200 may perform sparsification on one input or multiple inputs on which matrix multiplication is performed. Accordingly, the processor 200 can perform sparsification on different types of data and perform matrix multiplication between the sparse data.

프로세서(200)는 희소화된 데이터들 간의 행렬 곱이 사용되는 뉴럴 네트워크 추론을 수행할 수 있다. 프로세서(200)는 병렬 곱셈 및 덧셈을 포함하는 다양한 뉴럴 네트워크 연산을 수행할 수 있다.The processor 200 may perform neural network inference using matrix multiplication between sparse data. The processor 200 can perform various neural network operations including parallel multiplication and addition.

도 5는 도 1에 도시된 뉴럴 네트워크 연산 장치를 적용한 단말의 예를 나타낸다.Figure 5 shows an example of a terminal to which the neural network computing device shown in Figure 1 is applied.

도 5를 참조하면, 뉴럴 네트워크 연산 장치(540)는 임의의 단말(500)에 구현될 수 있다. 단말(500)은 카메라(510), 프로세서(520), 이미지 분별기(530) 및 뉴럴 네트워크 연산 장치(540)를 포함할 수 있다. 프로세서(520)는 단말(500)의 동작을 제어할 수 있다.Referring to FIG. 5 , the neural network computing device 540 may be implemented in any terminal 500. The terminal 500 may include a camera 510, a processor 520, an image classifier 530, and a neural network operation unit 540. The processor 520 can control the operation of the terminal 500.

뉴럴 네트워크 연산 장치(540)는 희소화를 이용하여 경량 하드웨어 상에서 뉴럴 네트워크 학습 및 추론을 수행할 수 있다.The neural network computing device 540 can perform neural network learning and inference on lightweight hardware using sparsification.

일반적으로, 단말(500)(예: 스마트폰)은 사용할 수 있는 전력 에너지의 제약이 있기 때문에 추론에 비해 많은 학습 계산량을 감당하기 어렵지만, 뉴럴 네트워크 연산 장치(540)는 상술한 희소화 방식을 이용하여 계산량 및 메모리 접근량을 획기적으로 감소시킴으로써 모바일 기기에서 에너지를 절약할 수 있다.In general, it is difficult for the terminal 500 (e.g., a smartphone) to handle a large amount of learning calculations compared to inference due to limitations in available power energy, but the neural network computing device 540 uses the sparsification method described above. This can save energy in mobile devices by dramatically reducing the amount of calculations and memory access.

이미지 분별기(530)는 이미지 추론을 수행할 수 있다. 도 5의 예시에서 이미지 분별기(530)를 별도로 표현했지만, 뉴럴 네트워크 연산 장치(10)와 동일한 형태의 뉴럴 네트워크 연산을 이용하여 추론을 수행하는 경우(예: 희소 데이터와 밀집 데이터 간의 행렬곱)에는 별도의 이미지 분별기(530)가 필요하지 않을 수 있다.The image classifier 530 may perform image inference. In the example of FIG. 5, the image classifier 530 is expressed separately, but when inference is performed using the same type of neural network operation as the neural network operation unit 10 (e.g., matrix multiplication between sparse data and dense data) A separate image classifier 530 may not be required.

도 6은 도 1에 도시된 뉴럴 네트워크 연산 장치의 동작의 흐름도를 나타낸다.FIG. 6 shows a flowchart of the operation of the neural network computing device shown in FIG. 1.

도 6을 참조하면, 수신기(100)는 뉴럴 네트워크에 포함된 레이어에 대응하는 제1 활성화 기울기 및 제1 임계값을 수신할 수 있다(610).Referring to FIG. 6, the receiver 100 may receive a first activation gradient and a first threshold corresponding to a layer included in the neural network (610).

프로세서(200)는 제1 임계값에 기초하여 제1 활성화 기울기를 희소화(spasificate)할 수 있다(630).The processor 200 may sparsify the first activation gradient based on the first threshold (630).

프로세서(200)는 희소화된 제1 활성화 기울기에 기초하여 뉴럴 네트워크 연산을 수행함으로써 제2 활성화 기울기를 획득할 수 있다(650).The processor 200 may obtain a second activation gradient by performing a neural network operation based on the sparsified first activation gradient (650).

프로세서(200)는 제2 활성화 기울기에 기초하여 제1 임계값을 업데이트함으로써 제2 임계값을 계산할 수 있다(670). 프로세서(200)는 미리 결정된 이터레이션(iteration) 횟수만큼 반복하여 제1 임계값을 업데이트함으로써 제2 임계값을 계산할 수 있다.The processor 200 may calculate the second threshold by updating the first threshold based on the second activation slope (670). The processor 200 may calculate the second threshold by repeatedly updating the first threshold a predetermined number of iterations.

프로세서(200)는 제2 활성화 기울기에 기초하여 제1 임계값을 초기화함으로써 제2 임계값을 계산할 수 있다.The processor 200 may calculate the second threshold by initializing the first threshold based on the second activation slope.

프로세서(200)는 제2 활성화 기울기 및 제2 임계값에 기초하여 뉴럴 네트워크 연산을 수행할 수 있다(690). 프로세서(200)는 제2 임계값에 기초하여 제2 활성화 기울기를 희소화함으로써 희소 데이터(sparse data)를 생성할 수 있다.The processor 200 may perform a neural network operation based on the second activation slope and the second threshold (690). The processor 200 may generate sparse data by sparsifying the second activation gradient based on the second threshold.

프로세서(200)는 희소 데이터 및 밀집 데이터(dense data)를 이용하여 뉴럴 네트워크 연산을 수행할 수 있다. 밀집 데이터는 병렬화된(parallelized) 복수의 밀집 버퍼(dense buffer)에 저장될 수 있다. The processor 200 may perform neural network calculations using sparse data and dense data. Dense data may be stored in a plurality of parallelized dense buffers.

이상에서 설명된 실시예들은 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치, 방법 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The embodiments described above may be implemented with hardware components, software components, and/or a combination of hardware components and software components. For example, the devices, methods, and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, and a field programmable gate (FPGA). It may be implemented using a general-purpose computer or a special-purpose computer, such as an array, programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and software applications running on the operating system. Additionally, a processing device may access, store, manipulate, process, and generate data in response to the execution of software. For ease of understanding, a single processing device may be described as being used; however, those skilled in the art will understand that a processing device includes multiple processing elements and/or multiple types of processing elements. It can be seen that it may include. For example, a processing device may include multiple processors or one processor and one controller. Additionally, other processing configurations, such as parallel processors, are possible.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of these, which may configure a processing unit to operate as desired, or may be processed independently or collectively. You can command the device. Software and/or data may be used on any type of machine, component, physical device, virtual equipment, computer storage medium or device to be interpreted by or to provide instructions or data to a processing device. , or may be permanently or temporarily embodied in a transmitted signal wave. Software may be distributed over networked computer systems and thus stored or executed in a distributed manner. Software and data may be stored on a computer-readable recording medium.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있으며 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. A computer-readable medium may include program instructions, data files, data structures, etc., singly or in combination, and the program instructions recorded on the medium may be specially designed and constructed for the embodiment or may be known and available to those skilled in the art of computer software. It may be possible. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -Includes optical media (magneto-optical media) and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, etc. Examples of program instructions include machine language code, such as that produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc.

위에서 설명한 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 또는 복수의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The hardware devices described above may be configured to operate as one or multiple software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 이를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.Although the embodiments have been described with limited drawings as described above, those skilled in the art can apply various technical modifications and variations based on this. For example, the described techniques are performed in a different order than the described method, and/or components of the described system, structure, device, circuit, etc. are combined or combined in a different form than the described method, or other components are used. Alternatively, appropriate results may be achieved even if substituted or substituted by an equivalent.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims also fall within the scope of the claims described below.

Claims

Receiving a first activation gradient and a first threshold corresponding to a layer included in the neural network;
sparsifying the first activation gradient based on the first threshold;
Obtaining a second activation gradient by performing a neural network operation based on the sparsified first activation gradient;
calculating a second threshold by updating the first threshold based on the second activation slope; and
performing a neural network operation based on the second activation slope and the second threshold.
Including,
The neural network operation is,
Matrix multiplication operation
Including,
The step of calculating the second threshold is,
calculating the second threshold by multiplying the first threshold by the target sparsity divided by the sparsity corresponding to the current iteration.
Neural network calculation method including.

According to paragraph 1,
The first activation slope and the second activation slope are,
containing the slope for input activation, the slope for weights, or the slope for output activation.
Neural network calculation method.

According to paragraph 1,
The step of calculating the second threshold is,
Calculating the second threshold by updating the first threshold by repeating a predetermined number of iterations.
Neural network calculation method including.

delete

According to paragraph 1,
The step of calculating the second threshold is,
determining whether the second threshold exceeds a preset limit range; and
Correcting the second threshold value that exceeds the limit range to a value within the limit range
Neural network calculation method including.

According to paragraph 1,
The step of calculating the second threshold is,
calculating the second threshold by initializing the first threshold based on the second activation slope.
Neural network calculation method including.

According to paragraph 1,
The step of performing the neural network operation is,
generating sparse data by sparsifying the second activation gradient based on the second threshold; and
Performing the neural network operation using the sparse data and dense data
Neural network calculation method including.

According to clause 8,
The dense data is,
Stored in multiple parallelized dense buffers
Neural network calculation method.

According to paragraph 1,
The step of performing the neural network operation is,
Performing at least one MAC (Multiply Accumulate) operation based on the second activation gradient
Neural network calculation method including.

A receiver that receives a first activation gradient and a first threshold corresponding to a layer included in the neural network; and
Sparsify the first activation gradient based on the first threshold,
Obtaining a second activation gradient by performing a neural network operation based on the sparsified first activation gradient,
calculate a second threshold by updating the first threshold based on the second activation slope;
A processor performing a neural network operation based on the second activation slope and the second threshold.
Including,
The neural network operation is,
Matrix multiplication operation
Including,
The processor,
Calculate the second threshold by multiplying the first threshold by dividing the target sparsity by the sparsity corresponding to the current iteration.
Neural network computing device.

According to clause 11,
The first activation slope and the second activation slope are,
containing the slope for input activation, the slope for weights, or the slope for output activation.
Neural network computing device.

According to clause 11,
The processor,
Calculating the second threshold by repeatedly updating the first threshold by a predetermined number of iterations.
Neural network computing device.

delete

According to clause 11,
The processor,
Determine whether the second threshold exceeds a preset limit range,
Correcting the second threshold value exceeding the limit range to a value within the limit range
Neural network computing device.

According to clause 11,
The processor,
Calculate the second threshold by initializing the first threshold based on the second activation slope.
Neural network computing device.

According to clause 11,
The processor,
generate sparse data by sparsifying the second activation gradient based on the second threshold,
Performing the neural network operation using the sparse data and dense data
Neural network computing device.

According to clause 18,
The dense data is,
Stored in multiple parallelized dense buffers
Neural network computing device.

According to clause 11,
The processor,
Performing at least one MAC (Multiply Accumulate) operation based on the second activation gradient
Neural network computing device.