KR102256288B1

KR102256288B1 - Pruning-based training method and system for acceleration hardware of a artificial neural network

Info

Publication number: KR102256288B1
Application number: KR1020210036119A
Authority: KR
Inventors: 오진욱
Original assignee: 리벨리온 주식회사
Priority date: 2021-03-19
Filing date: 2021-03-19
Publication date: 2021-05-27
Also published as: KR20220131123A

Abstract

Disclosed is a pruning-based training method for acceleration hardware of an artificial neural network. The pruning-based training method for acceleration hardware of an artificial neural network comprises: a step in which a processing unit identifies a SIMD width of a processing element included in an accelerator; a step in which the processing unit initializes the weights of the artificial neural network; a step in which the processing unit groups weights arranged in an input channel and an output channel of the artificial neural network into weight groups according to the identified SIMD width; a step in which the processing unit prunes the plurality of weights included in the artificial neural network by being aware of the weight groups; and a step in which the processing unit updates the plurality of pruned weights. The step in which the processing unit prunes the plurality of weights included in the artificial neural network by being aware of the weight groups comprises: a step in which the processing unit extracts representative weights from each of the weight groups; a step in which the processing unit compares the representative weights with thresholds, respectively; and a step in which the processing unit sets all of the weights belonging to the weight group corresponding to the one of the representative weights to zero when any one of the representative weights is less than a corresponding threshold. Accordingly, by means of the method according to the present invention, it is possible to achieve energy saving effects in the accelerator.

Description

{Pruning-based training method and system for acceleration hardware of a artificial neural network}

본 발명은 인공 신경망의 가속 하드웨어를 위한 가지치기 기반의 훈련 방법 및 시스템에 관한 것으로, 상세하게는 가속 하드웨어에서 연산 속도 향상과 에너지 소비 감소를 위한 인공 신경망의 가속 하드웨어를 위한 가지치기 기반의 훈련 방법 및 시스템에 관한 것이다. The present invention relates to a pruning-based training method and system for accelerating hardware of an artificial neural network, and in particular, a pruning-based training method for accelerating hardware of an artificial neural network to improve computational speed and reduce energy consumption in acceleration hardware. And a system.

인공 신경망(artificial neural network)은 컴퓨터 비전(computer vision), 자연어 처리, 및 음성 인식과 같은 다양한 인공 지능 어플리케이션들에서 사용된다. 인공 신경망은 훈련 단계와 추론 단계로 나뉠 수 있다. 훈련 단계에서 훈련 데이터를 이용하여 인공 신경망의 파라미터들이 학습된다. 추론 단계에서 새로운 입력 데이터를 훈련된 인공 신경망에 적용하여 인공 신경망의 예측 결과가 출력된다. Artificial neural networks are used in a variety of artificial intelligence applications such as computer vision, natural language processing, and speech recognition. Artificial neural networks can be divided into a training stage and an inference stage. In the training phase, the parameters of the artificial neural network are learned using the training data. In the inference stage, new input data are applied to the trained artificial neural network, and the prediction result of the artificial neural network is output.

훈련 동작은 CPU, 또는 GPU에서 수행된다. 추론 동작은 인공 지능 어플리케이션들을 가속시키기 위해 특별히 고안된 가속 하드웨어인 AI 가속기에서 수행된다. The training operation is performed on the CPU or GPU. Inference operations are performed on an AI accelerator, which is an acceleration hardware specially designed to accelerate artificial intelligence applications.

더 복잡한 연산을 수행하고 정확도를 높이기 위해 인공 신경망의 사이즈가 증가함에 따라 에너지 소비 증가와 병목 현상이 발생할 수 있다.As the size of artificial neural networks increases in order to perform more complex operations and increase accuracy, energy consumption may increase and bottlenecks may occur.

한국 공개특허공보 제10-2020-0093404호(2020.08.05.)Korean Patent Application Publication No. 10-2020-0093404 (2020.08.05.)

본 발명이 이루고자 하는 기술적인 과제는 상기와 같은 종래의 문제점들을 해결하기 위한 것으로, 인공 신경망의 훈련 단계에서 인공 신경망의 사이즈를 줄여 에너지 소비 감소와 연산 속도 향상이 가능한 인공 신경망의 가속 하드웨어를 위한 가지치기 기반의 훈련 방법 및 시스템을 제공하는 것이다. The technical task to be achieved by the present invention is to solve the conventional problems as described above, and a branch for acceleration hardware of an artificial neural network capable of reducing energy consumption and improving computation speed by reducing the size of the artificial neural network in the training stage of the artificial neural network. It is to provide a stroke-based training method and system.

본 발명의 실시 예에 따른 인공 신경망의 가속 하드웨어를 위한 가지치기 기반의 훈련 방법은 프로세싱 유닛은 가속기(accelerator)에 포함된 프로세싱 엘리먼트(processing element)의 SIMD(Single Instruction Multiple Data) 폭(width)을 식별하는 단계, 상기 프로세싱 유닛은 인공 신경망의 가중치들을 초기화하는 단계, 상기 프로세싱 유닛은 상기 식별된 SIMD 폭에 따라 상기 인공 신경망의 입력 채널과 출력 채널에 배열된 가중치들을 가중치 그룹들로 그룹화하는 단계, 상기 프로세싱 유닛은 상기 가중치 그룹들을 의식(aware)하여 상기 인공 신경망에 포함된 상기 복수의 가중치들을 가지치기(pruning)하는 단계, 및 상기 프로세싱 유닛은 상기 가지치기된 복수의 가중치들을 업데이트하는 단계를 포함한다.In the pruning-based training method for acceleration hardware of an artificial neural network according to an embodiment of the present invention, the processing unit calculates the SIMD (Single Instruction Multiple Data) width of a processing element included in an accelerator. Identifying, the processing unit initializing weights of the artificial neural network, the processing unit grouping the weights arranged in the input channel and the output channel of the artificial neural network into weight groups according to the identified SIMD width, The processing unit aware of the weight groups and pruning the plurality of weights included in the artificial neural network, and the processing unit updating the pruned plurality of weights. do.

상기 프로세싱 유닛은 CPU(Central Processing Unit), 또는 GPU(Graphic Processing Unit)일 수 있다.The processing unit may be a central processing unit (CPU) or a graphical processing unit (GPU).

상기 SIMD 폭이 2일 때, 상기 복수의 가중치들 중에서 2개의 가중치들이 하나의 가중치 그룹을 형성한다. When the SIMD width is 2, two weights among the plurality of weights form one weight group.

상기 SIMD 폭이 4일 때, 상기 복수의 가중치들 중에서 4개의 가중치들이 하나의 가중치 그룹을 형성한다. When the SIMD width is 4, four weights among the plurality of weights form one weight group.

상기 SIMD 폭이 8일 때, 상기 복수의 가중치들 중에서 8개의 가중치들이 하나의 가중치 그룹을 형성한다. When the SIMD width is 8, eight weights among the plurality of weights form one weight group.

상기 프로세싱 유닛은 상기 가중치 그룹들을 의식(aware)하여 상기 인공 신경망에 포함된 상기 복수의 가중치들을 가지치기(pruning)하는 단계는 상기 프로세싱 유닛은 상기 가중치 그룹들 각각에서 대표 가중치들을 추출하는 단계, 상기 프로세싱 유닛은 상기 대표 가중치들 각각을 문턱값들 각각과 비교하는 단계, 및 상기 대표 가중치들 중 어느 하나가 대응하는 문턱값보다 작을 때, 상기 프로세싱 유닛은 상기 대표 가중치들 중 어느 하나에 대응하는 가중치 그룹에 속한 가중치들 모두를 0으로 설정하는 단계를 포함한다. The processing unit is aware of the weight groups and pruning the plurality of weights included in the artificial neural network, wherein the processing unit extracts representative weights from each of the weight groups, the The processing unit compares each of the representative weights with each of the threshold values, and when any one of the representative weights is less than a corresponding threshold value, the processing unit performs a weight corresponding to any one of the representative weights. And setting all of the weights belonging to the group to 0.

상기 프로세싱 유닛은 상기 가중치 그룹들을 의식(aware)하여 상기 인공 신경망에 포함된 상기 복수의 가중치들을 가지치기(pruning)하는 단계는, 상기 대표 가중치들 중 다른 하나가 대응하는 문턱값보다 클 때, 상기 프로세싱 유닛은 상기 대표 가중치들 중 다른 하나에 대응하는 가중치 그룹에 속한 가중치들 모두를 유지하는 단계를 더 포함할 수 있다. The processing unit is aware of the weight groups and pruning the plurality of weights included in the artificial neural network, when the other one of the representative weights is greater than a corresponding threshold value, the The processing unit may further include maintaining all weights belonging to a weight group corresponding to the other one of the representative weights.

상기 인공 신경망의 입력 채널과 상기 인공 신경망의 출력 채널에서 0을 가지는 가중치들이 상기 식별된 SIMD 폭에 해당하는 만큼 연속해서 존재한다. The weights having 0 in the input channel of the artificial neural network and the output channel of the artificial neural network are continuously present as long as they correspond to the identified SIMD width.

상기 프로세싱 엘리먼트는 가중치 레지스터를 포함한다. The processing element includes a weight register.

상기 가중치 레지스터는 행렬로 구현된다. The weight register is implemented as a matrix.

상기 행렬의 행을 따라 상기 인공 신경망의 출력 채널에 배열된 가중치들이 순서대로 삽입된다. 상기 행렬의 열을 따라 상기 인공 신경망의 입력 채널에 배열된 가중치들이 순서대로 삽입된다. Weights arranged in the output channel of the artificial neural network are sequentially inserted along the rows of the matrix. Weights arranged in the input channel of the artificial neural network are sequentially inserted along the column of the matrix.

상기 인공 신경망의 가속 하드웨어를 위한 가지치기 기반의 훈련 방법은 상기 프로세싱 유닛은 손실 함수의 기울기와, 상기 문턱값들과 관련된 상기 손실 함수의 기울기에 기초하여 상기 인공 신경망의 가중치들과, 상기 문턱값들을 업데이트하는 단계를 더 포함할 수 있다. In the pruning-based training method for the acceleration hardware of the artificial neural network, the processing unit includes weights of the artificial neural network and the threshold value based on a slope of a loss function and a slope of the loss function related to the threshold values. It may further include the step of updating them.

본 발명의 실시 예에 따른 인공 신경망의 가속 하드웨어를 위한 가지치기 기반의 훈련 시스템은 명령들을 저장하는 메모리, 및 상기 명령들을 실행하는 프로세싱 유닛을 포함한다. A pruning-based training system for acceleration hardware of an artificial neural network according to an embodiment of the present invention includes a memory for storing instructions and a processing unit for executing the instructions.

상기 명령들은 가속기에 포함된 프로세싱 엘리먼트의 SIMD 폭을 식별하며, 인공 신경망의 가중치들을 초기화하며, 상기 식별된 SIMD 폭에 따라 상기 인공 신경망의 입력 채널과 출력 채널에 배열된 가중치들을 가중치 그룹들로 그룹화하며, 상기 가중치 그룹들을 의식하여 상기 인공 신경망에 포함된 상기 복수의 가중치들을 가지치기하며, 상기 가지치기된 복수의 가중치들을 업데이트하도록 구현된다. The instructions identify the SIMD width of the processing element included in the accelerator, initialize the weights of the artificial neural network, and group the weights arranged in the input channel and the output channel of the artificial neural network into weight groups according to the identified SIMD width. The plurality of weights included in the artificial neural network are pruned by being conscious of the weight groups, and the plurality of pruned weights are updated.

상기 프로세싱 유닛은 CPU(Central Processing Unit), 또는 GPU(Graphic Processing Unit)일 수 있다. The processing unit may be a central processing unit (CPU) or a graphical processing unit (GPU).

상기 가중치 그룹들을 의식하여 상기 인공 신경망에 포함된 상기 복수의 가중치들을 가지치기하는 명령은 상기 가중치 그룹들 각각에서 대표 가중치를 추출하며, 상기 대표 가중치들 각각을 문턱값들 각각과 비교한다. The command for pruning the plurality of weights included in the artificial neural network consciously of the weight groups extracts a representative weight from each of the weight groups, and compares each of the representative weights with each of the threshold values.

상기 대표 가중치들 중 어느 하나가 대응되는 문턱값보다 작을 때, 상기 대표 가중치들 중 어느 하나에 대응하는 가중치 그룹들에 속한 가중치들 모두를 0으로 설정한다. When any one of the representative weights is less than a corresponding threshold value, all weights belonging to the weight groups corresponding to any one of the representative weights are set to 0.

상기 가중치 그룹들을 의식하여 상기 인공 신경망에 포함된 상기 복수의 가중치들을 가지치기하는 명령은 상기 대표 가중치들 중 다른 하나가 대응하는 문턱값보다 클 때, 상기 대표 가중치들 중 다른 하나에 대응하는 가중치 그룹들에 속한 가중치들 모두를 유지하도록 구현된다. The command for pruning the plurality of weights included in the artificial neural network consciously of the weight groups is a weight group corresponding to the other one of the representative weights when the other one of the representative weights is greater than a corresponding threshold value It is implemented to keep all of the weights belonging to the fields.

상기 명령들은 손실 함수의 기울기와, 상기 문턱값들과 관련된 상기 손실 함수의 기울기에 기초하여 상기 인공 신경망의 가중치들과, 상기 문턱값들을 업데이트하도록 구현된다. The instructions are implemented to update the weights of the artificial neural network and the threshold values based on the slope of the loss function and the slope of the loss function related to the threshold values.

본 발명의 실시 예에 따른 인공 신경망의 가속 하드웨어를 위한 가지치기 기반의 훈련 방법 및 시스템은 인공 신경망의 가중치들에 대해 훈련이 끝난 뒤 가속기에서 가지치기(pruning)가 수행되는 것이 아니라 인공 신경망의 가중치들의 훈련 단계에서 가속기에 포함된 프로세싱 유닛의 SIMD(Single Instruction Multiple Data) 폭을 고려하여 가지치기(pruning)을 수행함으로써 가속기의 하드웨어 구조에 유연한 가중치들의 제공이 가능하다는 효과가 있다.In the pruning-based training method and system for acceleration hardware of an artificial neural network according to an embodiment of the present invention, pruning is not performed in the accelerator after training on the weights of the artificial neural network, but the weight of the artificial neural network. There is an effect that flexible weights can be provided to the hardware structure of the accelerator by performing pruning in consideration of the SIMD (Single Instruction Multiple Data) width of the processing unit included in the accelerator in the training phase of the accelerator.

또한, 본 발명의 실시 예에 따른 인공 신경망의 가속 하드웨어를 위한 가지치기 기반의 훈련 방법 및 시스템은 인공 신경망의 입력 채널과 출력 채널에 배열된 가중치들을 그룹화하고, 가속기에 포함된 프로세싱 유닛의 SIMD 폭에 따라 그룹화된 가중치들을 훈련시킴으로써 가속기에서 연산 속도 향상과 에너지 소비 감소가 가능하다는 효과가 있다. In addition, the pruning-based training method and system for acceleration hardware of an artificial neural network according to an embodiment of the present invention group the weights arranged in the input channel and the output channel of the artificial neural network, and the SIMD width of the processing unit included in the accelerator. By training the grouped weights according to the accelerator, it is possible to improve the computational speed and reduce the energy consumption.

본 발명의 상세한 설명에서 인용되는 도면을 보다 충분히 이해하기 위하여 각 도면의 상세한 설명이 제공된다.
도 1은 본 발명의 실시 예에 따른 시스템의 블록도를 나타낸다.
도 2는 도 1에 도시된 프로세싱 유닛의 동작을 설명하기 위한 인공 신경망의 블록도를 나타낸다.
도 3은 도 2에 도시된 인공 신경망의 은닉 레이어(hidden layer)의 입력 활성화(input activations)와 가중치(weights)를 나타낸다.
도 4는 도 3에 도시된 4D 가중치 텐서(tensor)와 2D 가중치 행렬을 나타낸다.
도 5는 도 1에 도시된 가속기의 블록도를 나타낸다.
도 6은 도 5에 도시된 프로세싱 엘리먼트(processing element)의 내부 블록도를 나타낸다.
도 7은 도 1에 도시된 시스템의 동작 방법을 설명하기 위한 흐름도를 나타낸다.
도 8은 도 5에 도시된 프로세싱 엘리먼트의 다른 내부 블록도를 나타낸다.
도 9는 도 5에 도시된 복수의 프로세싱 엘리먼트들의 내부 블록도를 나타낸다.
도 10은 도 1에 도시된 시스템의 다른 동작 방법을 설명하기 위한 흐름도를 나타낸다. In order to more fully understand the drawings cited in the detailed description of the present invention, a detailed description of each drawing is provided.
1 is a block diagram of a system according to an embodiment of the present invention.
FIG. 2 is a block diagram of an artificial neural network for explaining the operation of the processing unit shown in FIG. 1.
3 shows input activations and weights of a hidden layer of the artificial neural network shown in FIG. 2.
FIG. 4 shows a 4D weight tensor and a 2D weight matrix shown in FIG. 3.
5 shows a block diagram of the accelerator shown in FIG. 1.
6 shows an internal block diagram of the processing element shown in FIG. 5.
7 is a flowchart illustrating a method of operating the system shown in FIG. 1.
8 shows another internal block diagram of the processing element shown in FIG. 5.
9 shows an internal block diagram of a plurality of processing elements shown in FIG. 5.
10 is a flowchart illustrating another method of operating the system shown in FIG. 1.

본 명세서에 개시되어 있는 본 발명의 개념에 따른 실시 예들에 대해서 특정한 구조적 또는 기능적 설명들은 단지 본 발명의 개념에 따른 실시 예들을 설명하기 위한 목적으로 예시된 것으로서, 본 발명의 개념에 따른 실시 예들은 다양한 형태들로 실시될 수 있으며 본 명세서에 설명된 실시 예들에 한정되지 않는다.Specific structural or functional descriptions of the embodiments according to the concept of the present invention disclosed in the present specification are exemplified only for the purpose of describing the embodiments according to the concept of the present invention, and the embodiments according to the concept of the present invention are It may be implemented in various forms and is not limited to the embodiments described herein.

본 발명의 개념에 따른 실시 예들은 다양한 변경들을 가할 수 있고 여러 가지 형태들을 가질 수 있으므로 실시 예들을 도면에 예시하고 본 명세서에 상세하게 설명하고자 한다. 그러나, 이는 본 발명의 개념에 따른 실시 예들을 특정한 개시 형태들에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물, 또는 대체물을 포함한다.Since the embodiments according to the concept of the present invention can apply various changes and have various forms, the embodiments will be illustrated in the drawings and described in detail in the present specification. However, this is not intended to limit the embodiments according to the concept of the present invention to specific disclosed forms, and includes all changes, equivalents, or substitutes included in the spirit and scope of the present invention.

제1 또는 제2 등의 용어는 다양한 구성 요소들을 설명하는데 사용될 수 있지만, 상기 구성 요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성 요소를 다른 구성 요소로부터 구별하는 목적으로만, 예컨대 본 발명의 개념에 따른 권리 범위로부터 이탈되지 않은 채, 제1구성요소는 제2구성요소로 명명될 수 있고, 유사하게 제2구성요소는 제1구성요소로도 명명될 수 있다.Terms such as first or second may be used to describe various constituent elements, but the constituent elements should not be limited by the terms. The terms are only for the purpose of distinguishing one component from other components, for example, without departing from the scope of the rights according to the concept of the present invention, the first component may be referred to as the second component, and similarly The second component may also be referred to as a first component.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. 구성요소들 간의 관계를 설명하는 다른 표현들, 즉 "~사이에"와 "바로 ~사이에" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.When a component is referred to as being "connected" or "connected" to another component, it is understood that it may be directly connected or connected to the other component, but other components may exist in the middle. It should be. On the other hand, when a component is referred to as being "directly connected" or "directly connected" to another component, it should be understood that there is no other component in the middle. Other expressions describing the relationship between components, such as "between" and "directly between" or "adjacent to" and "directly adjacent to" should be interpreted as well.

본 명세서에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 설시된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the present specification are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In the present specification, terms such as "comprise" or "have" are intended to designate the presence of a set feature, number, step, action, component, part, or combination thereof, but one or more other features or numbers. It is to be understood that the possibility of addition or presence of, steps, actions, components, parts, or combinations thereof is not preliminarily excluded.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 나타낸다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs. Terms as defined in a commonly used dictionary should be construed as having a meaning consistent with the meaning of the related technology, and should not be interpreted as an ideal or excessively formal meaning unless explicitly defined in the present specification. Does not.

이하, 첨부한 도면을 참조하여 본 발명의 바람직한 실시 예를 설명함으로써, 본 발명을 상세히 설명한다.Hereinafter, the present invention will be described in detail by describing a preferred embodiment of the present invention with reference to the accompanying drawings.

도 1은 본 발명의 실시 예에 따른 시스템의 블록도를 나타낸다. 도 2는 도 1에 도시된 프로세싱 유닛의 동작을 설명하기 위한 인공 신경망의 블록도를 나타낸다. 1 is a block diagram of a system according to an embodiment of the present invention. FIG. 2 is a block diagram of an artificial neural network for explaining the operation of the processing unit shown in FIG. 1.

도 1과 도 2를 참고하면, 시스템(100)은 연산 속도 향상과 에너지 소비 감소가 가능한 스마트 폰, 노트북, 태블릿 PC, PC, 또는 서버와 같은 전자 장치이다. 시스템(100)은 CPU(Central Processing Unit; 10), GPU(Graphics Processing Unit; 20), 메모리(30), 및 가속기(200)를 포함한다. 도 1에 도시된 시스템(100)은 본 발명을 설명하기 위한 일 실시 예를 나타낸다. 실시 예에 따라 시스템(100)은 다양하게 구현될 수 있다. 예컨대, 시스템(100)은 입력 장치(미도시), 또는 송수신기(transceiver; 미도시)를 더 포함할 수 있다. CPU(10), GPU(20), 메모리(30), 및 가속기(200)는 버스(101)에 접속된다. CPU(10), GPU(20), 메모리(30), 및 가속기(200)는 버스(101)를 통해 명령들과 데이터를 주고 받는다. Referring to FIGS. 1 and 2, the system 100 is an electronic device such as a smart phone, a notebook computer, a tablet PC, a PC, or a server capable of improving calculation speed and reducing energy consumption. The system 100 includes a central processing unit (CPU) 10, a graphics processing unit (GPU) 20, a memory 30, and an accelerator 200. The system 100 shown in FIG. 1 represents an embodiment for describing the present invention. Depending on the embodiment, the system 100 may be implemented in various ways. For example, the system 100 may further include an input device (not shown) or a transceiver (not shown). The CPU 10, the GPU 20, the memory 30, and the accelerator 200 are connected to the bus 101. The CPU 10, the GPU 20, the memory 30, and the accelerator 200 exchange commands and data through the bus 101.

CPU(10)는 인공 신경망(103)의 가중치들(weights)과 편향 벡터들(bias vectors)인 파라미터들을 훈련시키기 위한 명령들을 실행한다. 실시 예에 따라 인공 신경망(103)의 훈련은 GPU(20)에서 수행될 수 있다. The CPU 10 executes instructions for training parameters which are weights and bias vectors of the artificial neural network 103. According to an embodiment, training of the artificial neural network 103 may be performed by the GPU 20.

메모리(30)는 CPU(10), 또는 GPU(20)에서 실행되는 명령들을 저장한다. 본 발명에서 프로세싱 유닛은 CPU(10), 또는 GPU(20)를 의미한다. The memory 30 stores instructions executed in the CPU 10 or GPU 20. In the present invention, the processing unit means the CPU 10 or the GPU 20.

가속기(200)는 인공 신경망(103)의 훈련이 끝난 뒤인 인공 신경망(103)의 추론이 수행된다. 실시 예에 따라 가속기(200)는 AI 가속기로 호칭될 수 있다. 또한, 실시 예에 따라 가속기(200)는 가속 하드웨어로 호칭될 수 있다. The accelerator 200 performs inference of the artificial neural network 103 after training of the artificial neural network 103 is completed. According to an embodiment, the accelerator 200 may be referred to as an AI accelerator. In addition, according to an embodiment, the accelerator 200 may be referred to as acceleration hardware.

도 2를 참고하면, 인공 신경망(103)은 입력 레이어(input layer; 105), 은닉 레이어(hidden layer; 107), 및 출력 레이어(output layer; 109)를 포함한다. Referring to FIG. 2, the artificial neural network 103 includes an input layer 105, a hidden layer 107, and an output layer 109.

입력 레이어(105)는 각 값(each value)을 수신한다. 입력 레이어(105)는 각 값을 복제(duplicate)하고, 복제된 각 값을 은닉층(107)으로 전송한다. 상기 각 값은 입력 활성화(input activations)라고 호칭될 수 있다. 메모리(30)는 각 값을 저장한다. The input layer 105 receives each value. The input layer 105 duplicates each value and transmits each duplicated value to the hidden layer 107. Each of the above values may be referred to as input activations. The memory 30 stores each value.

은닉 레이어(107)는 각 값과 가중치들을 곱하고, 가중치 합(weighted sum)을 출력한다. 상기 가중치 합은 출력 활성화(output activations)라고 호칭될 수 있다. 메모리(30)는 가중치들과 가중치 합을 저장한다. The hidden layer 107 multiplies each value and weights, and outputs a weighted sum. The weighted sum may be referred to as output activations. The memory 30 stores weights and a sum of weights.

실시 예에 따라 은닉 레이어(107)는 가중치 합을 활성화 함수에 적용할 수 있다. 인공 신경망(103)의 훈련 단계에서 각 값과 가중치들의 곱은 CPU(10), 또는 GPU(20)에서 수행된다. 인경 신경망(103)의 추론 단계에서 각 값과 가중치들의 곱은 가속기(200)에서 수행된다. According to an embodiment, the hidden layer 107 may apply the sum of weights to the activation function. In the training phase of the artificial neural network 103, the product of each value and weights is performed by the CPU 10 or the GPU 20. In the inference step of the neural network 103, the product of each value and the weights is performed by the accelerator 200.

출력 레이어(109)는 은닉 레이어(107)로부터 수신된 가중치 합을 활성화 함수(activation function)에 적용하여 출력 값을 생성한다. The output layer 109 generates an output value by applying the sum of weights received from the hidden layer 107 to an activation function.

입력 레이어(105), 은닉 레이어(107), 및 출력 레이어(109) 각각은 노드들을 포함한다. 실시 예에 따라 입력 레이어(105), 은닉 레이어(107), 및 출력 레이어(109)의 노드들의 수와, 은닉 레이어(107)의 레이어 수는 다양할 수 있다. 입력 레이어(105), 은닉 레이어(107), 및 출력 레이어(109)의 노드들의 수와, 은닉 레이어(107)의 레이어 수의 증가함에 따라 인공 신경망(103)의 사이즈는 증가한다. 인공 신경망(103)의 사이즈의 증가는 에너지 소비 증가와 병목 현상을 유발한다. 따라서 인공 신경망(103)의 훈련 단계에서 에너지 효율성과 연산 속도 증가를 위해 인공 신경망(103)를 감소시킬 필요가 있다. 또한, 인공 신경망(103)의 각 값과 가중치들의 수가 증가함에 따라 메모리(30)에서 필요한 저장 공간은 증가한다. 따라서 인공 신경망(103)의 훈련 단계에서 메모리(30)의 저장 공간의 절약이 필요하다. Each of the input layer 105, the hidden layer 107, and the output layer 109 includes nodes. Depending on the embodiment, the number of nodes of the input layer 105, the hidden layer 107, and the output layer 109, and the number of layers of the hidden layer 107 may vary. As the number of nodes of the input layer 105, the hidden layer 107, and the output layer 109 and the number of layers of the hidden layer 107 increase, the size of the artificial neural network 103 increases. An increase in the size of the artificial neural network 103 causes an increase in energy consumption and a bottleneck. Therefore, in the training stage of the artificial neural network 103, it is necessary to decrease the artificial neural network 103 in order to increase energy efficiency and operation speed. In addition, as the number of each value and weight of the artificial neural network 103 increases, the required storage space in the memory 30 increases. Therefore, it is necessary to save the storage space of the memory 30 in the training stage of the artificial neural network 103.

도 3은 도 2에 도시된 인공 신경망의 은닉 레이어(hidden layer)의 입력 활성화(input activations)와 가중치(weights)를 나타낸다. 3 shows input activations and weights of a hidden layer of the artificial neural network shown in FIG. 2.

도 2와 도 3을 참고하면, 입력 활성화는 인공 신경망(103)에서 사용되는 데이터 구조인 3D 텐서(tensor)로 표현될 수 있다. 입력 활성화의 3D 텐서는 H x W x C 행렬로 표현될 수 있다.2 and 3, input activation may be expressed as a 3D tensor, which is a data structure used in the artificial neural network 103. The 3D tensor of input activation can be expressed as an H x W x C matrix.

가중치는 4D 텐서로 표현될 수 있다. 가중치의 4D 텐서는 R x S x C x K 행렬로 표현될 수 있다. 가중치의 4D 텐서에서 'C'는 입력 채널을 나타낸다. 가중치의 4D 텐서에서 'K'는 출력 채널을 나타낸다. The weight can be expressed as a 4D tensor. The weighted 4D tensor can be expressed as an R x S x C x K matrix. In the weighted 4D tensor,'C' represents the input channel. In the weighted 4D tensor,'K' represents the output channel.

출력 활성화는 3D 텐서로 표현될 수 있다. 출력 활성화의 3D 텐서는 P x Q x K 행렬로 표현될 수 있다. 출력 활성화는 입력 활성화와 가중치를 곱하고 누산(accumulate)함으로써 생성된다. Output activation can be expressed as a 3D tensor. The 3D tensor of output activation can be expressed as a P x Q x K matrix. The output activation is created by multiplying and accumulating the input activation and the weight.

도 4는 도 3에 도시된 4D 가중치 텐서(tensor)와 2D 가중치 행렬을 나타낸다. FIG. 4 shows a 4D weight tensor and a 2D weight matrix shown in FIG. 3.

도 4를 참고하면, 4D 가중치 텐서는 도 3에 도시된 4D 가중치 텐서와 대응된다. 가중치들은 4D 텐서의 행렬에 배열된다. 예컨대, 계수 W_(1,1,1,1), W_(1,1,2,1), 및 W_(1,1,3,1)는 R x S x 1 x 1 행렬의 첫 번째 행에 배열된다. 계수 W_(1,1,1,1)에서 첫 번째 1은 첫 번째 출력 채널을, 두 번째 1은 첫 번째 입력 채널을, 세 번째 1은 첫 번째 열을, 네 번째 1은 첫 번째 행을 나타낸다. 이는 본 발명의 실시 예를 설명하기 위해 정의된 것으로, 실시 예에 따라 계수의 좌표는 다양하게 정의될 수 있다. 계수 W_(1,2,1,1), W_(1,2,2,1), 및 W_(1,2,3,1)은 R x S x 2 x 1 행렬의 첫 번째 행에 배열된다. 계수 W_(1,C,1,1), W_(1,C,2,1), 및 W_(1,C,3,1)은 R x S x C x 1 행렬의 첫 번째 행에 배열된다. C는 입력 채널의 수를 의미한다. 계수 W_(K,1,1,1), W_(K,1,2,1), 및 W_(K,1,3,1)는 R x S x C x K 행렬의 첫 번째 행에 배열된다. K는 출력 채널의 수를 의미한다. Referring to FIG. 4, the 4D weight tensor corresponds to the 4D weight tensor shown in FIG. 3. The weights are arranged in a matrix of 4D tensors. For example, the coefficients W _(1,1,1,1) , W _(1,1,2,1) , and W _(1,1,3,1) are in the first row of the R x S x 1 x 1 matrix. Are arranged. In the coefficient W _(1,1,1,1) , the first 1 represents the first output channel, the second 1 represents the first input channel, the third 1 represents the first column, and the fourth 1 represents the first row. . This is defined to describe an embodiment of the present invention, and the coordinates of the coefficient may be variously defined according to the embodiment. The coefficients W _(1,2,1,1) , W _(1,2,2,1) , and W _(1,2,3,1) are arranged in the first row of the R x S x 2 x 1 matrix . The coefficients W _(1,C,1,1) , W _(1,C,2,1) , and W _(1,C,3,1) are arranged in the first row of the R x S x C x 1 matrix . C means the number of input channels. The coefficients W _(K,1,1,1) , W _(K ,1,2,1), and W _(K,1,3,1) are arranged in the first row of the R x S x C x K matrix . K means the number of output channels.

도 1과 도 4를 참고하면, CPU(10), 또는 GPU(20)는 4D 가중치 텐서의 입력 채널과 출력 채널을 고려하여 4D 가중치 텐서를 복수의 2D 가중치 행렬들로 변환한다. 구체적으로, CPU(10), 또는 GPU(20)는 4D 텐서에서 입력 채널 방향으로 배열된 가중치들은 2D 가중치 행렬에서 열 방향으로 순서대로 입력하도록 4D 가중치 텐서를 2D 가중치 행렬로 변환한다. 도 4에서 4D 텐서에서 입력 채널 방향으로 배열된 가중치들 W_(1,1,1,1), W_(1,2,1,1), W_(1,3,1,1),..., 및 W_(1,C,1,1)은 2D 가중치 행렬에서 첫 번째 열(C1)에 배열된다. 도 4에서 4D 텐서에서 입력 채널 방향으로 배열된 가중치들 W_(2,1,1,1), W_(2,2,1,1), W_(2,3,1,1),..., 및 W_(2,C,1,1)은 2D 가중치 행렬에서 두 번째 열(C2)에 배열된다. 1 and 4, the CPU 10 or the GPU 20 converts the 4D weight tensor into a plurality of 2D weight matrices in consideration of the input channel and the output channel of the 4D weight tensor. Specifically, the CPU 10 or GPU 20 transforms the 4D weight tensor into a 2D weight matrix so that the weights arranged in the 4D tensor in the input channel direction are sequentially input in the 2D weight matrix in the column direction. In Figure 4, weights arranged in the direction of the input channel in the 4D tensor W _(1,1,1,1) , W _(1,2,1,1) , W _(1,3,1,1) ,... , And W _(1,C,1,1) are arranged in the first column (C1) in the 2D weight matrix. In Figure 4, weights arranged in the direction of the input channel in the 4D tensor W _(2,1,1,1) , W _(2,2,1,1) , W _(2,3,1,1) ,... , And W _(2,C,1,1) are arranged in the second column (C2) in the 2D weight matrix.

또한, CPU(10), 또는 GPU(20)는 4D 텐서에서 출력 채널 방향으로 배열된 가중치들은 2D 가중치 행렬에서 행 방향으로 순서대로 입력하도록 4D 가중치 텐서를 2D 가중치 행렬로 변환한다. 도 4에서 4D 텐서에서 출력 채널 방향으로 배열된 가중치들 W_(1,1,1,1), W_(2,1,1,1), ..., 및 W₍ _K,1 _,1, ₁₎은 2D 가중치 행렬에서 첫 번째 행(R1)에 배열된다. 도 4에서 4D 텐서에서 출력 채널 방향으로 배열된 가중치들 W_(1,2,1,1), W_(2,2,1,1), ..., 및 W_(K,2,1,1)은 2D 가중치 행렬에서 두 번째 행(R2)에 배열된다. In addition, the CPU 10 or GPU 20 transforms the 4D weight tensor into a 2D weight matrix so that the weights arranged in the output channel direction in the 4D tensor are sequentially input in the 2D weight matrix in the row direction. _{In Figure 4, weights W (1,1,1,1)} , W _(2,1,1,1) , ..., and W ₍ _K,1 _,1, ₁ ) arranged in the direction of the output channel in the 4D tensor _{) Are} arranged in the first row (R1) in the 2D weight matrix. _{In FIG. 4, weights W (1,2,1,1)} , W _(2,2,1,1) , ..., and W _(K,2,1,1 ) arranged in the direction of the output channel in the 4D tensor _{) Are} arranged in the second row (R2) in the 2D weight matrix.

R x S x C x K 행렬의 4D 텐서에서 같은 입력 채널과 같은 출력 채널에 위치한 각 가중치들은 서로 다른 2D 가중치 행렬들에 속한다. 예컨대, 가중치 W_(1,1,1,1)은 첫 번째 2D 가중치 행렬에 속하며, 가중치 W_(1,1,2,1)은 두 번째 2D 가중치 행렬에 속한다. In the 4D tensor of the R x S x C x K matrix, weights located in the same input channel and the same output channel belong to different 2D weight matrices. For example, the weight W _(1,1,1,1) belongs to the first 2D weight matrix, and the weight W _(1,1,2,1) belongs to the second 2D weight matrix.

도 5는 도 1에 도시된 가속기의 블록도를 나타낸다. 5 shows a block diagram of the accelerator shown in FIG. 1.

도 1과 도 5를 참고하면, 가속기(200)는 컨트롤러(210), 입력 활성화 버퍼(220), 가중치 버퍼(230), 출력 버퍼(240), 제1레지스터(250), 제2레지스터(260), 복수의 프로세싱 엘리먼트들(processing elements; 270), 및 누산기(accumulator; 280)를 포함한다. 1 and 5, the accelerator 200 includes a controller 210, an input activation buffer 220, a weight buffer 230, an output buffer 240, a first register 250, and a second register 260. ), a plurality of processing elements 270, and an accumulator 280.

컨트롤러(210)는 CPU(10), 또는 GPU(20)로부터 가속기(200)의 동작을 실행시키기 위한 명령들을 수신한다. 예컨대, 상기 명령들은 로드(load) 명령, 곱셈 명령, 합산 명령, 및 저장 명령을 포함할 수 있다. 컨트롤러(210)가 상기 로드 명령을 수신할 때, 컨트롤러(210)는 제1레지스터(250)에 저장된 입력 활성화들, 또는 제2레지스터(260)에 저장된 가중치들을 복수의 프로세싱 엘리먼트들(270)에 로드(load)하기 위해 제1레지스터(250), 또는 제2레지스터(260)를 제어한다. 컨트롤러(219)는 CPU(10), 또는 GPU(20)로부터 수신된 명령들에 따라 가속기(200)의 각 구성요소들(220, 230, 240, 250, 260, 270, 및 280)의 동작들을 제어한다. The controller 210 receives commands for executing the operation of the accelerator 200 from the CPU 10 or the GPU 20. For example, the instructions may include a load instruction, a multiplication instruction, a sum instruction, and a store instruction. When the controller 210 receives the load command, the controller 210 transfers input activations stored in the first register 250 or weights stored in the second register 260 to the plurality of processing elements 270. The first register 250 or the second register 260 is controlled to load. The controller 219 performs operations of the components 220, 230, 240, 250, 260, 270, and 280 of the accelerator 200 according to commands received from the CPU 10 or the GPU 20. Control.

입력 활성화 버퍼(220)는 메모리(30)로부터 입력 활성화들을 수신하여 저장한다. 가중치 버퍼(230)는 메모리(30)로부터 가중치들을 수신하여 저장한다. The input activation buffer 220 receives and stores input activations from the memory 30. The weight buffer 230 receives and stores weights from the memory 30.

출력 버퍼(240)는 누산기(280)로부터 출력되는 결과 값을 저장하고, 결과 값을 CPU(10), GPU(20), 또는 메모리(30)로 전송한다. The output buffer 240 stores a result value output from the accumulator 280 and transmits the result value to the CPU 10, the GPU 20, or the memory 30.

제1레지스터(250)는 입력 활성화 버퍼(220)로부터 입력 활성화들을 수신하고 저장한다. 제2레지스터(260)는 가중치 버퍼(230)로부터 가중치들을 수신하고 저장한다. 제1레지스터(250)와 제2레지스터(260)는 FIFO(First In First Out) 레지스터일 수 있다. The first register 250 receives and stores input activations from the input activation buffer 220. The second register 260 receives and stores weights from the weight buffer 230. The first register 250 and the second register 260 may be first in first out (FIFO) registers.

복수의 프로세싱 엘리먼트들(270) 각각은 제1레지스터(250)로부터 수신된 입력 활성화들과 제2레지스터(260)로부터 수신된 가중치들을 곱하는 곱셈 연산을 수행한다. 복수의 프로세싱 엘리먼트들(270)에서 부분 합들이 출력될 수 있다. Each of the plurality of processing elements 270 performs a multiplication operation that multiplies the input activations received from the first register 250 and the weights received from the second register 260. Partial sums may be output from the plurality of processing elements 270.

복수의 프로세싱 엘리먼트들(270)은 프로세싱 엘리먼트 어레이(271)로 호칭될 수 있다. The plurality of processing elements 270 may be referred to as a processing element array 271.

누산기(280)는 복수의 프로세싱 엘리먼트들(270)로부터 출력되는 부분 합들을 저장하고, 부분 합들을 더하여 곱셈 가중치 합(weighted sum)을 업데이트한다. The accumulator 280 stores partial sums output from the plurality of processing elements 270 and updates the multiplied weighted sum by adding the partial sums.

실시 예에 따라 가속기(200)의 구조는 다양하게 구현될 수 있다. According to embodiments, the structure of the accelerator 200 may be implemented in various ways.

도 6은 도 5에 도시된 프로세싱 엘리먼트(processing element)의 내부 블록도를 나타낸다. 6 shows an internal block diagram of the processing element shown in FIG. 5.

도 5와 도 6을 참고하면, 복수의 프로세싱 엘리먼트들(270)은 모두 같은 구조를 가진다. 도 6에 대표적으로 복수의 프로세싱 엘리먼트들(270) 중 어느 하나의 프로세싱 엘리먼트(270)가 도시된다. 5 and 6, all of the plurality of processing elements 270 have the same structure. One of the plurality of processing elements 270 is representatively shown in FIG. 6.

프로세싱 엘리먼트(270)는 가중치 레지스터(271), 입력 활성화 레지스터(273), 복수의 곱셈기들(275), 및 가산기(277)를 포함한다. The processing element 270 includes a weight register 271, an input enable register 273, a plurality of multipliers 275, and an adder 277.

가중치 레지스터(271)는 제2레지스터(260)로부터 가중치들을 전송받는다. 가중치 레지스터(271)는 가중치들을 저장하기 위한 행렬(MX)로 구현된다. The weight register 271 receives weights from the second register 260. The weight register 271 is implemented as a matrix MX for storing weights.

SIMD 폭이 8일 때, 행렬(MX)의 폭은 8일 수 있다. 행렬(MX)의 높이는 8보다 크며, 16보다 작을 수 있다. 즉, 행렬(MX)의 가로에는 8개의 가중치들이 배열되며, 행렬(MX)의 세로에는 가중치들이 8개 이상 배열될 수 있다. When the SIMD width is 8, the width of the matrix MX may be 8. The height of the matrix MX is greater than 8 and may be less than 16. That is, eight weights may be arranged horizontally of the matrix MX, and eight or more weights may be arranged vertically of the matrix MX.

도 4와 도 6을 참고하면, CPU(10), 또는 GPU(20)에 의해 가중치의 4D 텐서의 입력 채널과 출력 채널을 고려하여 가중치의 4D 텐서가 2D 가중치 행렬로 변환된다. 변환된 2D 가중치 행렬에서 가중치들이 가중치의 입력 채널과 출력 채널을 고려하여 배열된다. Referring to FIGS. 4 and 6, the weighted 4D tensor is converted into a 2D weighted matrix in consideration of the input channel and the output channel of the weighted 4D tensor by the CPU 10 or the GPU 20. In the transformed 2D weight matrix, weights are arranged in consideration of the input channel and the output channel of the weight.

CPU(10), 또는 GPU(20)는 가속기(200)에 포함된 프로세싱 엘리먼트(270)의 SIMD(Single Instruction Multiple Data) 폭(width)을 식별한다. 예컨대, 가속기(200)에 포함된 프로세싱 엘리먼트(270)의 SIMD 폭이 2일 때, CPU(10), 또는 GPU(20)는 가속기(200)에 포함된 프로세싱 엘리먼트(270)의 SIMD 폭 2를 식별할 수 있다. 가속기(200)에 포함된 프로세싱 엘리먼트(270)의 SIMD 폭이 4일 때, CPU(10), 또는 GPU(20)는 가속기(200)에 포함된 프로세싱 엘리먼트(270)의 SIMD 폭 4를 식별할 수 있다. The CPU 10 or GPU 20 identifies a SIMD (Single Instruction Multiple Data) width of the processing element 270 included in the accelerator 200. For example, when the SIMD width of the processing element 270 included in the accelerator 200 is 2, the CPU 10 or the GPU 20 determines the SIMD width 2 of the processing element 270 included in the accelerator 200. Can be identified. When the SIMD width of the processing element 270 included in the accelerator 200 is 4, the CPU 10, or the GPU 20 may identify the SIMD width 4 of the processing element 270 included in the accelerator 200. I can.

CPU(10), 또는 GPU(20)는 상기 식별된 SIMD 폭에 따라 인공 신경망(103)의 입력 채널과 출력 채널에 배열된 가중치들을 가중치 그룹들로 그룹화한다. 인공 신경망(103)의 입력 채널은 도 4에 도시된 'C'를 의미하며, 인공 신경망(103)의 출력 채널은 도 4에 도시된 'K'를 의미한다. The CPU 10 or GPU 20 groups the weights arranged in the input channel and the output channel of the artificial neural network 103 into weight groups according to the identified SIMD width. The input channel of the artificial neural network 103 means'C' shown in FIG. 4, and the output channel of the artificial neural network 103 means'K' shown in FIG. 4.

CPU(10), 또는 GPU(20)는 상기 식별된 SIMD 폭에 따라 변환된 2D 가중치 행렬에서 인공 신경망(103)의 입력 채널과 출력 채널에 배열된 가중치들을 가중치 그룹들로 그룹화한다. 예컨대, SIMD 폭이 2일 때, 2개의 가중치들이 하나의 그룹을 형성한다. SIMD 폭이 4일 때, 4개의 가중치들이 하나의 그룹을 형성한다. The CPU 10 or GPU 20 groups the weights arranged in the input channel and the output channel of the artificial neural network 103 into weight groups in the 2D weight matrix transformed according to the identified SIMD width. For example, when the SIMD width is 2, the two weights form one group. When the SIMD width is 4, 4 weights form a group.

SIMD 폭이 8일 때, 가중치들은 8로 나뉠 수 있다. 변환된 2D 가중치 행렬의 행에는 출력 채널의 순서대로 가중치들이 배열된다. 도 6에서 가중치 그룹(GH1)은 8개의 가중치들(W_(1,1,1,1), W_(2,1,1,1),..., 및 W_(8,1,1,1))을 포함할 수 있다. 가중치 그룹(GH2)은 8개의 가중치들(W_(9,1,1,1), W_(10,1,1,1),..., 및 W_(16,1,1,1))을 포함할 수 있다. When the SIMD width is 8, the weights can be divided by 8. Weights are arranged in the order of the output channels in the rows of the transformed 2D weight matrix. In FIG. 6, the weight group GH1 has eight weights (W _(1,1,1,1) , W _(2,1,1,1) ,..., and W _{(8,1,1,1). )} ) Can be included. The weight group GH2 has eight weights (W _(9,1,1,1) , W _(10,1,1,1) ,..., and W _(16,1,1,1) ). Can include.

가중치 그룹(GH3)은 8개의 가중치들(W_(17,1,1,1), W_(18,1,1,1),..., 및 W_(24,1,1,1))을 포함할 수 있다. The weight group GH3 has eight weights (W _(17,1,1,1) , W _(18,1,1,1) ,..., and W _(24,1,1,1) ). Can include.

변환된 2D 가중치 행렬의 열에는 입력 채널의 순서대로 가중치들이 배열된다. 가중치 그룹(GV1)은 8개의 가중치들(W_(1,1,1,1), W_(1,2,1,1),..., 및 W_(1,8,1,1))을 포함할 수 있다. 가중치 그룹(GV2)은 8개의 가중치들(W_(1,9,1,1), W_(1,10,1,1),..., 및 W_(1,16,1,1))을 포함할 수 있다. 가중치 그룹(GV3)은 8개의 가중치들(W_(1,17,1,1), W_(1,18,1,1),..., 및 W_(1,24,1,1))을 포함할 수 있다. Weights are arranged in the order of the input channels in the columns of the transformed 2D weight matrix. The weight group (GV1) has eight weights (W _(1,1,1,1) , W _(1,2,1,1) ,..., and W _(1,8,1,1) ). Can include. The weight group (GV2) has eight weights (W _(1,9,1,1) , W _(1,10,1,1) ,..., and W _(1,16,1,1) ). Can include. The weight group (GV3) has eight weights (W _(1,17,1,1) , W _(1,18,1,1) ,..., and W _(1,24,1,1) ). Can include.

CPU(10), 또는 GPU(20)의 제어 하에 그룹화된 가중치들이 입력 채널과 출력 채널을 고려하여 가중치 레지스터(271)에 포함된 행렬(MX)에 저장된다. 예컨대, 행렬(MX)의 첫 번째 행(R1)에는 가중치들(W_(1,1,1,1), W_(2,1,1,1),..., 및 W_(8,1,1,1))을 포함하는 가중치 그룹(GH1)이 저장된다. 행렬(MX)의 첫 번째 열(C1)에는 가중치(W_(1,1,1,1), W_(1,2,1,1),..., 및 W_(1,8,1,1))을 포함하는 가중치 그룹(GV1)이 저장된다. 그룹화된 가중치들을 입력 채널과 출력 채널을 고려하여 가중치 레지스터(271)에 포함된 행렬(MX)에 저장함으로써, 종래보다 더 많은 가중치들이 가중치 레지스터(271)에 저장될 수 있다. 종래보다 더 많은 가중치들이 가중치 레지스터(271)에 저장됨으로써 한 번에 더 많은 부분 합 연산이 가능하다. 즉, 연산 속도가 향상될 수 있다. The weights grouped under the control of the CPU 10 or the GPU 20 are stored in the matrix MX included in the weight register 271 in consideration of an input channel and an output channel. For example, in the first row R1 of the matrix MX, the weights W _(1,1,1,1) , W _(2,1,1,1) ,..., and W _(8,1, The weight group GH1 including _{1,1)) is stored.} The first column (C1) of the matrix (MX) has weights (W _(1,1,1,1) , W _(1,2,1,1) ,..., and W _{(1,8,1,1) )} ) and a weight group (GV1) including) is stored. By storing the grouped weights in the matrix MX included in the weight register 271 in consideration of the input channel and the output channel, more weights can be stored in the weight register 271 than before. Since more weights are stored in the weight register 271 than before, more partial sum operations can be performed at one time. That is, the calculation speed can be improved.

행렬(MX)의 행을 따라 인공 신경망(103)의 출력 채널(K)에 배열된 가중치들(W_(1,1,1,1), W_(2,1,1,1),..., 및 W_(8,1,1,1))이 순서대로 삽입된다. 삽입되는 가중치들(W_(1,1,1,1), W_(2,1,1,1),..., 및 W_(8,1,1,1))의 수는 SIMD 폭과 같다. Weights (W _(1,1,1,1) , W _(2,1,1,1) ,... , And W _(8,1,1,1) ) are inserted in order. The number of inserted weights (W _(1,1,1,1) , W _(2,1,1,1) ,..., and W _(8,1,1,1) ) is equal to the SIMD width .

행렬(MX)의 열을 따라 인공 신경망(103)의 입력 채널(C)에 배열된 가중치들(W_(1,1,1,1), W_(1,2,1,1), ... , 및 W_(1,8,1,1))이 순서대로 삽입된다. 삽입되는 가중치들(W_(1,1,1,1), W_(1,2,1,1), ... , 및 W_(1,8,1,1))의 수는 SIMD 폭과 같다. 가속기(200)의 SIMD 폭에 따라 인공 신경망(103)의 입력 채널(C)과 출력 채널(K)에 배열된 가중치들을 그룹화하여 가중치 레지스터(271)에 저장함으로써 가속기(200)에서 연산 속도 향상과 에너지 소비 감소가 달성될 수 있다. Weights (W _(1,1,1,1) , W _(1,2,1,1) , ... , And W _(1,8,1,1) ) are inserted in order. The number of inserted weights (W _(1,1,1,1) , W _(1,2,1,1) , ..., and W _(1,8,1,1) ) is equal to the SIMD width . According to the SIMD width of the accelerator 200, the weights arranged in the input channel C and the output channel K of the artificial neural network 103 are grouped and stored in the weight register 271, thereby improving the computation speed in the accelerator 200. Reduction in energy consumption can be achieved.

입력 활성화 레지스터(273)는 입력 활성화들이 저장된다. Input activation register 273 stores input activations.

복수의 곱셈기들(275) 각각은 가중치 레지스터(271)로부터 출력되는 가중치와 입력 활성화 레지스터(273)로부터 출력되는 입력 활성화를 곱한다. SIMD 폭이 8일 때, 복수의 곱셈기들(275)은 8개이다. 한 번의 명령에 따라 8번의 곱셈이 동시에 수행된다. 실시 예에 따라 복수의 곱셈기들의 수는 다양할 수 있다. 복수의 곱셈기들(275)로부터 출력되는 값들은 가산기(277)로 입력된다. 가산기(277)는 부분 합을 생성한다. 가산기(277)에 의해 생성되는 부분 합은 가중치 레지스터(271)에 포함된 행렬(MX)에 저장될 수 있다. Each of the plurality of multipliers 275 multiplies the weight output from the weight register 271 and the input activation output from the input activation register 273. When the SIMD width is 8, the plurality of multipliers 275 is 8. Eight multiplications are performed simultaneously according to one instruction. Depending on the embodiment, the number of the plurality of multipliers may vary. Values output from the plurality of multipliers 275 are input to the adder 277. Adder 277 generates a partial sum. The partial sum generated by the adder 277 may be stored in the matrix MX included in the weight register 271.

도 7은 도 1에 도시된 시스템의 동작 방법을 설명하기 위한 흐름도를 나타낸다. 구체적으로 도 7은 도 2에 도시된 인공 신경망의 훈련 방법을 설명하기 위한 흐름도이다. 7 is a flowchart illustrating a method of operating the system shown in FIG. 1. Specifically, FIG. 7 is a flowchart illustrating a training method of the artificial neural network shown in FIG. 2.

도 1 내지 도 7을 참고하면, 프로세싱 유닛은 가속기(200)에 포함된 프로세싱 엘리먼트(270)의 SIMD 폭을 식별한다(S100). 상기 프로세싱 유닛은 CPU(10), 또는 GPU(20)이다. 가속기(200)에 포함된 프로세싱 엘리먼트(270)의 SIMD 폭이 8일 때, 상기 프로세싱 유닛은 가속기(200)에 접근하여 SIMD 폭 8을 식별할 수 있다. 1 to 7, the processing unit identifies the SIMD width of the processing element 270 included in the accelerator 200 (S100). The processing unit is the CPU 10 or GPU 20. When the SIMD width of the processing element 270 included in the accelerator 200 is 8, the processing unit may access the accelerator 200 and identify the SIMD width 8.

상기 프로세싱 유닛은 인공 신경망(130)의 가중치들을 초기화한다(S200). 상기 복수의 가중치들은 임의로 초기화될 수 있다. 또한, 실시 예에 따라 상기 복수의 가중치들은 다른 인공 신경망을 통해 얻어진 미리 결정된 값들로 초기화될 수 있다. The processing unit initializes the weights of the artificial neural network 130 (S200). The plurality of weights may be arbitrarily initialized. Also, according to an embodiment, the plurality of weights may be initialized to predetermined values obtained through different artificial neural networks.

상기 프로세싱 유닛은 상기 식별된 SIMD 폭에 따라 인공 신경망(103)의 입력 채널과 출력 채널에 배열된 가중치들을 가중치 그룹들로 그룹화한다(S300). 즉, 상기 프로세싱 유닛은 상기 식별된 SIMD 폭에 따라 인공 신경망(103)의 입력 채널과 출력 채널에 배열된 가중치들을 가중치 그룹들로 나눈다. 인공 신경망(103)의 입력 채널은 도 3에서 'C'로 표현된다. 인공 신경망(103)의 출력 채널은 도 3에서 'K'로 표현된다. 인공 신경망(103)의 입력 채널에 배열된 가중치들이란 도 4에서 입력 채널(C)을 따라 순차적으로 배열된 가중치들(W_(1,1,1,1), W_(1,2,1,1),..., 및 W₍ _1,C,1 _,1))을 의미한다. 인공 신경망(103)의 출력 채널에 배열된 가중치들이란 도 4에서 출력 채널(K)을 따라 순차적으로 배열된 가중치들(W_(1,1,1,1), W_(2,1,1,1), ..., 및 W₍ _K,1 _,1,1))을 의미한다. SIMD 폭이 2일 때, 2개의 가중치들이 하나의 그룹을 형성한다. The processing unit groups the weights arranged in the input channel and the output channel of the artificial neural network 103 into weight groups according to the identified SIMD width (S300). That is, the processing unit divides the weights arranged in the input channel and the output channel of the artificial neural network 103 into weight groups according to the identified SIMD width. The input channel of the artificial neural network 103 is represented by'C' in FIG. 3. The output channel of the artificial neural network 103 is represented by'K' in FIG. 3. The weights arranged in the input channel of the artificial neural network 103 refer to weights (W _(1,1,1,1) , W _{(1,2,1) sequentially arranged along the input channel (C) in FIG. 4. 1) means} ,..., and W ₍ _1,C,1 _,1) ). The weights arranged in the output channel of the artificial neural network 103 are weights (W _(1,1,1,1) , W _{(2,1,1) sequentially arranged along the output channel K in FIG. 1) means} , ..., and W ₍ _K,1 _,1,1) ). When the SIMD width is 2, the two weights form a group.

SIMD 폭이 4일 때, 4개의 가중치들이 하나의 그룹을 형성한다. When the SIMD width is 4, 4 weights form a group.

SIMD 폭이 8일 때, 8개의 가중치들(예컨대, (W_(1,1,1,1), W_(1,2,1,1),..., 및 W_(1,8,1,1))이 하나의 그룹을 형성한다. When the SIMD width is 8, 8 weights (e.g., (W _(1,1,1,1) , W _(1,2,1,1) ,..., and W _{(1,8,1, 1)} ) forms a group.

상기 식별된 SIMD 폭에 따라 인공 신경망(103)의 입력 채널과 출력 채널에 배열된 가중치들을 그룹화하기 위해 상기 프로세싱 유닛은 4D 가중치 텐서의 입력 채널과 출력 채널을 고려하여 4D 가중치 텐서를 복수의 2D 가중치 행렬들로 변환할 수 있다. 구체적으로, 상기 프로세싱 유닛은 4D 텐서에서 입력 채널 방향으로 배열된 가중치들은 2D 가중치 행렬에서 열 방향으로 순서대로 입력하도록 4D 가중치 텐서를 2D 가중치 행렬로 변환할 수 있다. 또한, 상기 프로세싱 유닛은 4D 가중치 텐서에서 출력 채널 방향으로 배열된 가중치들은 2D 가중치 행렬에서 행 방향으로 순서대로 입력하도록 4D 가중치 텐서를 2D 가중치 행렬로 변환한다. In order to group the weights arranged in the input channel and the output channel of the artificial neural network 103 according to the identified SIMD width, the processing unit considers the input channel and the output channel of the 4D weight tensor, and the 4D weight tensor has a plurality of 2D weights. Can be converted into matrices. Specifically, the processing unit may transform the 4D weight tensor into a 2D weight matrix so that weights arranged in the input channel direction in the 4D tensor are sequentially input in the column direction in the 2D weight matrix. In addition, the processing unit transforms the 4D weight tensor into a 2D weight matrix so that the weights arranged in the output channel direction in the 4D weight tensor are sequentially input in the 2D weight matrix in the row direction.

상기 프로세싱 유닛은 상기 변환된 복수의 2D 가중치 행렬들에서 상기 SIMD 폭에 따라 상기 가중치들을 가중치 그룹들로 그룹화할 수 있다. 예컨대, SIMD 폭이 8이라 가정할 때, 상기 프로세싱 유닛은 상기 변환된 복수의 2D 가중치 행렬들 중 어느 하나에서 행에 입력된 8개의 가중치들(W_(1,1,1,1), W_(1,2,1,1),..., 및 W_(1,8,1,1))을 하나의 가중치 그룹으로 형성 수 있다. 또한, 상기 프로세싱 유닛은 상기 변환된 복수의 2D 가중치 행렬들 중 어느 하나에서 열에 입력된 8개의 가중치들(W_(1,1,1,1), W_(1,2,1,1),..., 및 W_(1,8,1,1))을 하나의 가중치 그룹으로 형성할 수 있다. 이와 유사한 방법으로 2D 가중치 행렬들에 배열된 가중치들이 그룹화된다. The processing unit may group the weights into weight groups according to the SIMD width in the transformed plurality of 2D weight matrices. For example, assuming that the SIMD width is 8, the processing unit includes eight weights (W _(1,1,1,1) , W _{( 1,2,1,1)} ,..., and W _(1,8,1,1) ) can be formed into one weight group. In addition, the processing unit includes eight weights (W _(1,1,1,1) , W _(1,2,1,1) ,. .., and W _(1,8,1,1) ) can be formed into one weight group. In a similar way, the weights arranged in the 2D weight matrices are grouped.

상기 프로세싱 유닛은 상기 그룹화된 가중치들을 의식(aware)하여 인공 신경망(103)에 포함된 상기 복수의 가중치들을 가지치기(pruning)한다(S400). 의식한다함은 가지치기 동작을 수행함에 있어서 상기 그룹화된 가중치들을 고려한다는 것을 의미한다. The processing unit is aware of the grouped weights and prunes the plurality of weights included in the artificial neural network 103 (S400). Being conscious means taking the grouped weights into account when performing a pruning operation.

가지치기는 인공 신경망(103)에서 중요하지 않은 가중치들을 제거함으로써 인공 신경망(103)의 연결들의 수를 줄이는 것을 목표로 한다. 인공 신경망(103)의 가중치들을 0으로 설정함으로써 중요하지 않은 연결들이 제거될 수 있다. 가지치기의 구체적인 동작은 아래와 같다. Pruning aims to reduce the number of connections of the artificial neural network 103 by removing weights that are not important in the artificial neural network 103. Unimportant connections can be removed by setting the weights of the artificial neural network 103 to 0. The specific operation of pruning is as follows.

상기 프로세싱 유닛은 상기 그룹화된 가중치들 각각에서 대표 가중치들 각각을 추출한다. The processing unit extracts each of the representative weights from each of the grouped weights.

상기 대표 가중치 각각은 상기 그룹화된 가중치들 중에서 임의로 추출될 수 있다. 예컨대, 상기 그룹화된 가중치들(W_(1,1,1,1), W_(1,2,1,1),..., 및 W_(1,8,1,1))에서 임의로 가중치(W_(1,1,1,1))가 대표 가중치로 추출될 수 있다. Each of the representative weights may be randomly extracted from among the grouped weights. For example, from the grouped weights (W _(1,1,1,1) , W _(1,2,1,1) ,..., and W _(1,8,1,1) ) W _(1,1,1,1) ) may be extracted as a representative weight.

또한, 실시 예에 따라 상기 그룹화된 가중치들(W_(1,1,1,1), W_(1,2,1,1),..., 및 W_(1,8,1,1))의 RMS(Root Mean Square)를 계산하고 RMS 값이 대표 가중치로 추출될 수 있다. In addition, the grouped weights (W _(1,1,1,1) , W _(1,2,1,1) ,..., and W _(1,8,1,1) ) according to an embodiment RMS (Root Mean Square) of is calculated and the RMS value can be extracted as a representative weight.

또한, 실시 예에 따라 상기 그룹화된 가중치들(W_(1,1,1,1), W_(1,2,1,1),..., 및 W_(1,8,1,1))의 평균을 계산하고 평균 값이 대표 가중치로 추출될 수 있다. In addition, the grouped weights (W _(1,1,1,1) , W _(1,2,1,1) ,..., and W _(1,8,1,1) ) according to an embodiment The average of is calculated and the average value can be extracted as a representative weight.

또한, 실시 예에 따라 상기 그룹화된 가중치들(W_(1,1,1,1), W_(1,2,1,1),..., 및 W_(1,8,1,1))의 최소 값(예컨대, W_(1,2,1,1)), 또는 최대 값(예컨대, W_(1,8,1,1))이 대표 가중치로 추출될 수 있다. In addition, the grouped weights (W _(1,1,1,1) , W _(1,2,1,1) ,..., and W _(1,8,1,1) ) according to an embodiment The minimum value of (eg, W _(1,2,1,1) ), or the maximum value (eg, W _(1,8,1,1) ) may be extracted as a representative weight.

상기 프로세싱 유닛은 상기 대표 가중치들 각각을 문턱값들 각각과 비교한다. 상기 문턱값들 각각은 임의로 설정될 수 있다. The processing unit compares each of the representative weights with each of the threshold values. Each of the threshold values may be set arbitrarily.

상기 대표 가중치들 중 어느 하나가 대응되는 문턱값보다 작을 때, 상기 프로세싱 유닛은 상기 그룹화된 가중치들 전부를 제거하기 위해 상기 그룹화된 가중치들 전부를 0으로 설정한다. 예컨대, 상기 그룹화된 가중치들(W_(1,1,1,1), W_(1,2,1,1),..., 및 W_(1,8,1,1))이 모두 0으로 설정될 수 있다. 상기 0으로 설정은 가중치 행렬과 희소 행렬(sparse matrix)의 곱으로 설정될 수 있다. 희소 행렬은 많은 0의 값들을 포함하는 행렬이다. 따라서 인공 신경망(103)의 입력 채널(C)과 상기 인공 신경망의 출력 채널(K)에서 0을 가지는 가중치들(예컨대, W_(1,1,1,1), W_(1,2,1,1),..., 및 W_(1,8,1,1))이 상기 식별된 SIMD 폭(예컨대, 8)에 해당하는 만큼 연속해서 존재할 수 있다. 또한, 프로세싱 엘리먼트(270)에 구현된 행렬(MX)에서 임의의 하나의 행에 배열된 가중치들의 값들이 전부 0이거나, 임의의 하나의 열에 배열된 가중치들의 값들이 전부 0일 수 있다. 0의 개수는 SIMD 폭만큼 존재한다. 예컨대, SIMD 폭이 8일 때, 0의 개수는 0이다. 실시 예에 따라 프로세싱 엘리먼트(270)에 구현된 행렬(MX)에서 0을 가지는 가중치들(예컨대, W_(1,1,1,1), W_(1,2,1,1),..., 및 W_(1,8,1,1))은 제외될 수 있다. 즉, 프로세싱 엘리먼트(270)에 구현된 행렬(MX)에서 0을 가지는 가중치들(예컨대, W_(1,1,1,1), W_(1,2,1,1),..., 및 W_(1,8,1,1))은 발견되지 않을 수 있다. When any one of the representative weights is less than a corresponding threshold value, the processing unit sets all of the grouped weights to 0 to remove all of the grouped weights. For example, the grouped weights (W _(1,1,1,1) , W _(1,2,1,1) ,..., and W _(1,8,1,1) ) are all zeros. Can be set. The setting to 0 may be set as a product of a weight matrix and a sparse matrix. A sparse matrix is a matrix containing many zero values. Therefore, weights having 0 in the input channel (C) of the artificial neural network 103 and the output channel (K) of the artificial neural network (e.g., W _(1,1,1,1) , W _{(1,2,1, 1)} ,..., and W _(1,8,1,1) ) may be present continuously as long as they correspond to the identified SIMD width (eg, 8). In addition, in the matrix MX implemented in the processing element 270, values of weights arranged in one row may be all zeros, or values of weights arranged in one column may be all zeros. The number of zeros exists as much as the SIMD width. For example, when the SIMD width is 8, the number of 0s is 0. Weights having 0 in the matrix MX implemented in the processing element 270 according to an embodiment (eg, W _(1,1,1,1) , W _(1,2,1,1) ,... , And W _(1,8,1,1) ) may be excluded. That is, weights having 0 in the matrix MX implemented in the processing element 270 (eg, W _(1,1,1,1) , W _(1,2,1,1) ,..., and W _(1,8,1,1) ) may not be found.

반대로, 상기 대표 가중치들 중 다른 하나가 대응되는 문턱값보다 클 때, 상기 프로세싱 유닛은 상기 그룹화된 가중치들 전부(예컨대, W_(1,1,1,1), W_(2,1,1,1), ..., 및 W_(8,1,1,1))를 유지한다. 유지한다함은 상기 그룹화된 가중치들 전부(예컨대, W_(1,1,1,1), W_(2,1,1,1), ..., 및 W_(8,1,1,1))를 변경하지 않는다는 것을 의미한다. 즉, 가중치 행렬과 희소 행렬의 곱셈 동작은 수행되지 않는다.Conversely, when the other one of the representative weights is greater than a corresponding threshold value, the processing unit performs all of the grouped weights (e.g., W _(1,1,1,1) , W _{(2,1,1, 1)} Keep, ..., and W _(8,1,1,1) ). Holding means that all of the grouped weights (e.g., W _(1,1,1,1) , W _(2,1,1,1) , ..., and W _(8,1,1,1) ) Means do not change. That is, the multiplication operation of the weight matrix and the sparse matrix is not performed.

상기 프로세싱 유닛은 상기 가지치기된 복수의 가중치들을 업데이트한다(S500). 0이 아닌 가중치들은 업데이트된다. The processing unit updates the plurality of pruned weights (S500). Non-zero weights are updated.

도 2를 참고하면, CPU(10), 또는 GPU(20)와 같은 프로세싱 유닛은 인공 신경망(103)을 통해 훈련 데이터(training data)로부터 활성화 데이터를 프로파게이팅(propagating)한다. 즉, 상기 프로세싱 유닛은 입력 레이어(105)에 훈련 데이터를 입력하고, 은닉 레이어(107)와 출력 레이어(109)를 통해 활성화 데이터를 출력한다. Referring to FIG. 2, a processing unit such as CPU 10 or GPU 20 propagates activation data from training data through artificial neural network 103. That is, the processing unit inputs training data to the input layer 105 and outputs activation data through the hidden layer 107 and the output layer 109.

상기 프로세싱 유닛은 상기 활성화 데이터와 목표 데이터에 기초하여 손실 함수(loss function)의 값을 계산한다. 상기 손실 함수의 값은 상기 활성화 데이터와 목표 데이터 사이의 MSE(Means Squared Error), 또는 크로스-엔트로피(cross-entropy)를 이용하여 계산될 수 있다. The processing unit calculates a value of a loss function based on the activation data and target data. The value of the loss function may be calculated using Mean Squared Error (MSE) or cross-entropy between the activation data and target data.

상기 프로세싱 유닛은 인공 신경망(103)의 가중치들과 관련된 상기 손실 함수(loss function)의 기울기(gradient)와, 문턱값들(Threshold)과 관련된 상기 손실 함수의 기울기를 계산하기 위해 상기 손실 함수의 값을 백프로파게이팅(backpropagating)한다. 백프로파게이션 동작은 인공 신경망(103)의 가중치들과 관련된 상기 손실 함수(loss function)의 기울기(gradient)와, 문턱값들(Threshold)과 관련된 상기 손실 함수의 기울기를 계산하는 동작을 포함한다. 상기 백프로파게이션은 상기 가중치들과 문턱값들을 조절하여 상기 손실 함수의 값을 최소화를 목표로 한다.The processing unit includes a value of the loss function to calculate a gradient of the loss function related to the weights of the artificial neural network 103 and a gradient of the loss function related to the threshold values. Backpropagating (backpropagating). The backpropagation operation includes calculating a gradient of the loss function related to weights of the artificial neural network 103 and a gradient of the loss function related to threshold values. The backpropagation aims to minimize the value of the loss function by adjusting the weights and threshold values.

상기 가중치들과 관련된 상기 손실 함수는 상기 가중치들의 변화에 따른 상기 손실 함수의 값의 변화를 의미한다. 상기 문턱값들과 관련된 상기 손실 함수는 상기 문턱값들의 변화에 따른 손실 함수의 값의 변화를 의미한다. 상기 프로세싱 유닛은 인공 신경망(103)의 가중치들과 관련된 상기 손실 함수의 기울기와, 상기 문턱값들과 관련된 상기 손실 함수의 기울기에 따라 인공 신경망(103)의 가중치들과, 상기 문턱값들을 업데이트한다. 상기 문턱값들의 업데이트에 따라 특정 가중치 그룹들이 모두 0으로 재설정되거나, 0으로 설정이 취소될 수 있다. The loss function related to the weights refers to a change in the value of the loss function according to the change of the weights. The loss function related to the threshold values means a change in the value of the loss function according to the change of the threshold values. The processing unit updates the weights of the artificial neural network 103 and the threshold values according to the slope of the loss function related to the weights of the artificial neural network 103 and the slope of the loss function related to the threshold values. . According to the update of the threshold values, all of the specific weight groups may be reset to 0, or the setting to 0 may be canceled.

상기 프로세싱 유닛은 인공 신경망(103)의 오버피팅(overfitting)을 피하기 위해 인공 신경망(103)에 대해 구조화된 희소 정규화(structured sparisty regularization)를 수행한다. 구조화된 희소 정규화란 가중치 그룹들을 고려하여 정규화 동작을 수행함을 의미한다. 상기 문턱값들의 업데이트 동작은 구조화된 회소 정규화를 수행하는 과정에서 수행된다. 상기 구조화된 희소 정규화의 최적화 목표(optimization target)는 다음의 수학식과 같이 표현될 수 있다. The processing unit performs structured sparisty regularization on the artificial neural network 103 to avoid overfitting of the artificial neural network 103. Structured sparse normalization means performing a normalization operation in consideration of weight groups. The update operation of the threshold values is performed in a process of performing structured element normalization. The optimization target of the structured sparse normalization may be expressed as the following equation.

[수학식 1] [Equation 1]

상기

는 최적화 목표를, 상기

는 문턱값을, 상기

는 가중치들을, 상기

는 분류 손실(classification loss)을, 상기

는 어느 하나의 그룹에 속한 가중치들을 얼마나 패널티를 주기 위해(penalize) 결정하는 페널티 텀(penalty term), 혹은 정규화 파라미터(regularization parameter)를, 상기

는 구조화된 희소 정규화 텀(structured sparisty regularization term을 의미한다. 상기 N은 콘볼루션 층들(convolution layers)의 수를 의미하며, 상기 n은 가중치 텐서의 순서를 의미한다. 구체적으로

는 L1 규범(norm) 또는 L2 규범(norm)일 수 있다. remind

Is the optimization goal,

Is the threshold, above

Is the weights, above

Is the classification loss,

Is a penalty term or a regularization parameter that determines how much to penalize weights belonging to any one group, the

Denotes a structured sparse regularization term, where N denotes the number of convolution layers, and n denotes the order of weight tensors.

May be an L1 norm or an L2 norm.

상기 분류 손실(

)은 최종 손실(overall loss)을 최소화하기 위해 업데이트된다. 상기 구조화된 희소 정규화 텀(

)은 손실을 최소화하는 방향으로 학습된다. 상기 문턱값(

)은 희소화(sparsity)를 증가시키는 방향으로 업데이트된다. The classification loss (

) Is updated to minimize overall loss. The structured sparse normalization term (

) Is learned in the direction of minimizing the loss. The threshold value (

) Is updated in the direction of increasing sparsity.

상기 가중치 그룹들은 SIMD 폭을 고려하여 인공 신경망(103)의 입력 채널(C)과 출력 채널(K)에 배열된 가중치들이 그룹화된 그룹들이다. 상기 구조화된 희소 정규화를 통해 중요하지 않은 가중치 그룹들을 제거하기 위해 상기 중요하지 않은 가중치 그룹들이 0으로 설정된다.The weight groups are groups in which weights arranged in the input channel C and the output channel K of the artificial neural network 103 are grouped in consideration of the SIMD width. The non-significant weight groups are set to 0 in order to remove non-significant weight groups through the structured sparse normalization.

본 발명에서는 SIMD 폭을 고려하여 인공 신경망(103)의 입력 채널(C)과 출력 채널(K)에 배열된 가중치들을 그룹화하고, 그룹화된 가중치들을 의식하여 인공 신경망(103)에 포함된 복수의 가중치들을 가지치기함으로써 가속기(200)에서 연산 속도 향상과 에너지 소비 감소가 가능하다. 가속기(200)의 프로세싱 엘리먼트(270)에서 입력 활성화들과 가중치들의 곱셈 연산이 수행된다. 가속기(200)의 프로세싱 엘리먼트(270)에서 곱셈 연산을 고려하여 인공 신경망(103)의 가중치들을 가지치기함으로써 연산 속도 향상과 에너지 소비 감소가 가능하다. 연속해서 0인 가중치들을 한꺼번에 계산하도록 함으로써 연산 속도 향상과 에너지 소비 감소가 가능하다. 또한, 연속해서 0인 가중치들을 계산하지 않도록 함으로써 연산 속도 향상과 에너지 소비 감소가 가능하다. 가속기(200)의 하드웨어 구조인 SIMD 폭을 고려하여 가지치기함으로써 연산 속도 향상과 에너지 소비 감소가 가능하다. 또한, 가속기(200)의 하드웨어 구조인 SIMD 폭을 고려하여 가지치기함으로써 가속기(200)의 하드웨어 구조에 유연한 가중치들의 제공이 가능하다. 종래기술은 가속기(200)의 하드웨어 구조에 종속적인 가중치들을 설정함으로써 가속기(200)의 하드웨어 구조가 변경되는 경우, 인공 신경망(103)의 파라미터들이 최적화될 수 없는 문제점이 있었다. 하지만, 본 발명의 경우 SIMD 폭을 고려하여 인공 신경망(103)의 파라미터들을 설정함으로써 가속기(200)의 하드웨어 구조가 변경되더라도 유연하게 인공 신경망(103)의 파라미터들의 변경이 가능하다는 장점이 있다. 또한, 인공 신경망(103)의 훈련 단계에서 메모리(30)의 저장 공간의 절약이 가능하다. In the present invention, the weights arranged in the input channel C and the output channel K of the artificial neural network 103 are grouped in consideration of the SIMD width, and a plurality of weights included in the artificial neural network 103 are conscious of the grouped weights. By pruning the fields, it is possible to improve the computational speed and reduce energy consumption in the accelerator 200. In the processing element 270 of the accelerator 200, a multiplication operation of input activations and weights is performed. By pruning the weights of the artificial neural network 103 in consideration of the multiplication operation in the processing element 270 of the accelerator 200, it is possible to improve the computation speed and reduce the energy consumption. It is possible to increase the calculation speed and reduce energy consumption by calculating the weights that are consecutively 0 at once. In addition, it is possible to improve the calculation speed and reduce energy consumption by not calculating the weights that are consecutively zero. By pruning in consideration of the SIMD width, which is the hardware structure of the accelerator 200, it is possible to improve the operation speed and reduce the energy consumption. In addition, it is possible to provide flexible weights to the hardware structure of the accelerator 200 by pruning in consideration of the SIMD width, which is the hardware structure of the accelerator 200. In the prior art, when the hardware structure of the accelerator 200 is changed by setting weights dependent on the hardware structure of the accelerator 200, there is a problem in that parameters of the artificial neural network 103 cannot be optimized. However, the present invention has the advantage that parameters of the artificial neural network 103 can be flexibly changed even if the hardware structure of the accelerator 200 is changed by setting the parameters of the artificial neural network 103 in consideration of the SIMD width. In addition, it is possible to save the storage space of the memory 30 in the training stage of the artificial neural network 103.

가중치들을 그룹으로 나누어 가지치기를 하게 되면 추론의 정확도가 떨어질 수 있다. 또한, 가중치들을 그룹으로 나누지 않고 가지치기를 하게 되면 가지치기의 성능이 떨어진다. 본 발명에서는 SIMD 폭을 고려하여 인공 신경망(103)의 입력 채널(C)과 출력 채널(K)에 배열된 가중치들을 그룹화하고, 그룹화된 가중치들을 의식하여 인공 신경망(103)에 포함된 복수의 가중치들을 가지치기함으로써 정확도를 유지하고, 성능을 높일 수 있다. 또한, 본 발명에서는 인공 신경망(103)의 입력 채널(C)과 출력 채널(K)까지 고려하여 가지치기를 함으로써 추론의 정확도를 유지할 수 있다. If the weights are divided into groups and pruned, the accuracy of the inference may decrease. Also, if the weights are not divided into groups and pruning is performed, pruning performance is degraded. In the present invention, the weights arranged in the input channel C and the output channel K of the artificial neural network 103 are grouped in consideration of the SIMD width, and a plurality of weights included in the artificial neural network 103 are conscious of the grouped weights. By pruning the fields, you can maintain accuracy and increase performance. In addition, in the present invention, the accuracy of inference can be maintained by pruning in consideration of the input channel C and the output channel K of the artificial neural network 103.

도 8은 도 5에 도시된 프로세싱 엘리먼트의 다른 내부 블록도를 나타낸다. 8 shows another internal block diagram of the processing element shown in FIG. 5.

도 1, 도 5, 및 도 8을 참고하면, 복수의 프로세싱 엘리먼트들(270)은 모두 같은 구조를 가진다. 도 5에 대표적으로 복수의 프로세싱 엘리먼트들(270) 중 어느 하나의 프로세싱 엘리먼트(270)가 도시된다. 1, 5, and 8, the plurality of processing elements 270 all have the same structure. One of the plurality of processing elements 270 is representatively shown in FIG. 5.

가중치 레지스터(271)는 제2레지스터(260)로부터 가중치들(예컨대, W₁₁~W₄₄)을 전송받는다. 가중치 레지스터(271)는 가중치들을 저장하기 위한 행렬(MX)로 구현된다. 도 8에서 4개의 가중치들(W₁₁~W₁₄, W₂₁~W₂₄, W₃₁~W₃₄, 또는 W₄₁~W₄₄)이 하나의 가중치 그룹(G1, G2, G3, 또는 G4)을 형성한다. 각 가중치 그룹들(G1~G4)은 행렬(MX)에서 각 행에 배열된다. The weight register 271 receives weights (eg, W ₁₁ to W ₄₄ ) from the second register 260. The weight register 271 is implemented as a matrix MX for storing weights. In FIG. 8, four weights (W ₁₁ to W ₁₄ , W ₂₁ to W ₂₄ , W ₃₁ to W ₃₄ , or W ₄₁ to W ₄₄ ) form one weight group (G1, G2, G3, or G4). do. Each of the weight groups G1 to G4 is arranged in each row in the matrix MX.

프로세싱 유닛은 가중치 그룹들(G1~G4)에서 대표 가중치들을 추출한다. 상기 프로세싱 유닛은 대표 가중치들 각각을 가지치기 마스크들의 문턱값들 각각과 비교한다. 상기 프로세싱 유닛은 상기 비교에 따라 가중치 그룹들(G1~G4) 중 특정 그룹(예컨대, G2, 또는 G3)에 속한 가중치들(W₂₁~W₂₄, 또는 W₃₁~W₃₄)을 모두 0으로 설정할 수 있다. The processing unit extracts representative weights from the weight groups G1 to G4. The processing unit compares each of the representative weights with each of the threshold values of the pruning masks. _{The processing unit sets all of the weights (W 21} to W ₂₄ , or W ₃₁ to W ₃₄ ) belonging to a specific group (eg, G2 or G3) among the weight groups (G1 to G4) according to the comparison. I can.

입력 활성화 레지스터(273)는 제1레지스터(250)로부터 입력 활성화들(예컨대, a₁₁~a₄₁)을 전송받는다. 입력 활성화들(a₁₁~a₄₁)은 행렬로 구현된다. The input activation register 273 receives input activations (eg, a ₁₁ to a ₄₁ ) from the first register 250. The input activations a ₁₁ to a ₄₁ are implemented as a matrix.

복수의 곱셈기들(275) 각각은 가중치 레지스터(271)로부터 출력되는 가중치들과 입력 활성화 레지스터(273)로부터 출력되는 입력 활성화들을 곱한다. 예컨대, 첫 번째 사이클에서 곱셈기(275)는 가중치 레지스터(271)에서 제1가중치 그룹(G1)에 포함된 가중치들(W₁₁~W₁₄)과 입력 활성화 레지스터(273)에서 제1열(CL1)에 포함된 입력 활성화들(a₁₁~a₁₄)을 곱한다. 곱셈기(275)는 가중치(W₁₁)와 입력 활성화(a₁₁)를 곱한다. 곱셈기(275)는 가중치(W₁₂)와 입력 활성화(a₁₂)를 곱한다. 가산기(275)는 부분 합(partial sum)을 생성한다. Each of the plurality of multipliers 275 multiplies the weights output from the weight register 271 and the input activations output from the input activation register 273. For example, in the first cycle, the multiplier 275 includes the weights W ₁₁ to W ₁₄ included in the first weight group G1 in the weight register 271 and the first column CL1 in the input activation register 273. The input activations (a ₁₁ to a ₁₄ ) included in are multiplied. The multiplier 275 multiplies the weight (W ₁₁ ) by the input activation (a ₁₁ ). The multiplier 275 multiplies the weight (W ₁₂ ) by the input activation (a ₁₂ ). Adder 275 generates a partial sum.

두 번째 사이클에서 곱셈기(275)는 가중치 레지스터(271)에서 제2가중치 그룹(G2)에 포함된 가중치들(W₂₁~W₂₄)과 입력 활성화 레지스터(273)에서 제2열(CL2)에 포함된 입력 활성화들(a₂₁~a₂₄)을 곱한다. 가산기(275)는 부분 합을 생성한다. In the second cycle, the multiplier 275 includes the weights W ₂₁ to W ₂₄ included in the second weight group G2 in the weight register 271 and in the second column CL2 in the input enable register 273. The input activations (a ₂₁ ~a ₂₄ ) are multiplied. Adder 275 produces a partial sum.

세 번째 사이클에서 곱셈기(275)는 가중치 레지스터(271)에서 제3가중치 그룹(G3)에 포함된 가중치들(W₃₁~W₃₄)과 입력 활성화 레지스터(273)에서 제3열(CL3)에 포함된 입력 활성화들(a₃₁~a₃₄)을 곱한다. 가산기(275)는 부분 합을 생성한다. In the third cycle, the multiplier 275 includes the weights W ₃₁ to W ₃₄ included in the third weight group G3 in the weight register 271 and in the third column CL3 in the input enable register 273. The input activations (a ₃₁ to a ₃₄ ) are multiplied. Adder 275 produces a partial sum.

네 번째 사이클에서 곱셈기(275)는 가중치 레지스터(271)에서 제4가중치 그룹(G4)에 포함된 가중치들(W₄₁~W₄₄)과 입력 활성화 레지스터(273)에서 제4열(CL4)에 포함된 입력 활성화들(a₄₁~a₄₄)을 곱한다. 가산기(275)는 부분 합을 생성한다. In the fourth cycle, the multiplier 275 includes the weights W ₄₁ to W ₄₄ included in the fourth weight group G4 in the weight register 271 and in the fourth column CL4 in the input enable register 273. The input activations (a ₄₁ to a ₄₄ ) are multiplied. Adder 275 produces a partial sum.

만약, 가중치 레지스터(271)에서 가중치 그룹들(G1~G4) 중 특정 가중치 그룹(G2, 또는 G3)이 0으로 설정된 제로 그룹이라 가정할 때, 가중치들(W₂₁~W₂₄)과 입력 활성화들(a₂₁~a₂₄)의 곱의 연산, 또는 가중치들(W₃₁~W₃₄)과 입력 활성화들(a₃₁~a₃₄)의 곱의 연산은 스킵(skip)될 수 있다. 이는 가속기(200)의 연산 속도 향상으로 이어진다. If, in the weight register 271, assuming that a specific weight group (G2 or G3) among the weight groups G1 to G4 is a zero group set to 0, the weights W ₂₁ to W ₂₄ and input activations An operation of the product of (a ₂₁ to a ₂₄ ), or an operation of the product of the weights W ₃₁ to W ₃₄ and the input activations a ₃₁ to a ₃₄ may be skipped. This leads to an increase in the computational speed of the accelerator 200.

도 9는 도 5에 도시된 복수의 프로세싱 엘리먼트들의 내부 블록도를 나타낸다. 도 9에서는 설명의 편의를 위해 프로세싱 엘리먼트들 중 일부 프로세싱 엘리먼트들만이 도시된다. 또한, 도 9에서는 설명의 편의를 위해 프로세싱 엘리먼트에서 가중치 레지스터만이 도시된다. 도 9(a)는 로드 밸런싱 동작이 적용되기 전의 복수의 프로세싱 엘리먼트들의 내부 블록도를 나타낸다. 도 9(b)는 로드 밸런싱 동작 적용 후의 복수의 프로세싱 엘리먼트들의 내부 블록도를 나타낸다. 9 shows an internal block diagram of a plurality of processing elements shown in FIG. 5. In FIG. 9, only some of the processing elements are shown for convenience of description. Also, in FIG. 9, only the weight register is shown in the processing element for convenience of description. 9(a) shows an internal block diagram of a plurality of processing elements before a load balancing operation is applied. 9(b) shows an internal block diagram of a plurality of processing elements after applying a load balancing operation.

도 1 내지 도 9(a)를 참고하면, 프로세싱 엘리먼트들(270-1~270-3)은 가중치 레지스터들(271-1~271-3)을 포함한다. 가중치 레지스터들(271-1~271-3)에서 해시(hash) 블록은 제로(zero) 그룹을 의미한다. 즉, 제로 그룹에 포함된 가중치들은 모두 0이다. 가중치 레지스터들(271-1~271-3)에서 빈(blank) 블록은 제로가 아닌 그룹을 의미한다. 실시 예에 따라 하나의 그룹을 형성하는 가중치들의 수는 다양할 수 있다. 예컨대, 4개의 가중치들이 하나의 가중치 그룹을 형성할 수 있다. 1 to 9(a), the processing elements 270-1 to 270-3 include weight registers 271-1 to 271-3. In the weight registers 271-1 to 271-3, a hash block means a zero group. That is, all weights included in the zero group are 0. In the weight registers 271-1 to 271-3, a blank block means a non-zero group. Depending on the embodiment, the number of weights forming one group may vary. For example, four weights may form one weight group.

컨트롤러(210)의 제어 하에 프로세싱 엘리먼트들(270)은 곱셈 명령을 수행한다. 제1프로세싱 엘리먼트(270-1)는 상기 곱셈 명령에 따라 제1가중치 그룹(G1)에 속한 가중치들과 입력 활성화들을 곱한다. 다만, 제1가중치 그룹(G1)은 제로 그룹이므로, 제1가중치 그룹(G1)에 속한 가중치들과 입력 활성화들을 곱하는 곱셈 동작은 스킵된다. 동시에 제2프로세싱 엘리먼트(270-2)는 상기 곱셈 명령에 따라 제1가중치 그룹(G1)에 속한 가중치들과 입력 활성화들을 곱한다. 그리고, 제2프로세싱 엘리먼트(270-2)는 부분 합을 생성한다. 또한, 제3프로세싱 엘리먼트(270-3)도 상기 곱셈 명령에 따라 제1가중치 그룹(G1)에 속한 가중치들과 입력 활성화들을 곱한다. 그리고, 제3프로세싱 엘리먼트(270-3)는 부분 합을 생성한다. The processing elements 270 under the control of the controller 210 perform multiplication instructions. The first processing element 270-1 multiplies the weights belonging to the first weight group G1 and input activations according to the multiplication instruction. However, since the first weight group G1 is a zero group, the multiplication operation of multiplying the input activations and weights belonging to the first weight group G1 is skipped. At the same time, the second processing element 270-2 multiplies the weights belonging to the first weight group G1 and input activations according to the multiplication instruction. In addition, the second processing element 270-2 generates a partial sum. In addition, the third processing element 270-3 also multiplies the weights belonging to the first weight group G1 and input activations according to the multiplication instruction. In addition, the third processing element 270-3 generates a partial sum.

제1가중치 그룹(G1)에 대한 곱셈 동작이 끝난 후, 프로세싱 엘리먼트들(270-1~270-3)은 나머지 가중치 그룹들(G2~G8)에 대해서도 순차적으로 곱셈 명령을 수행한다. After the multiplication operation for the first weight group G1 is finished, the processing elements 270-1 to 270-3 sequentially perform a multiplication instruction for the remaining weight groups G2 to G8.

제1프로세싱 엘리먼트(270-1)의 가중치 레지스터(271-1)에서 제1가중치 그룹(G1), 제3가중치 그룹(G3), 제5가중치 그룹(G5), 및 제6가중치 그룹(G6)이 제로 그룹들이라 가정한다. 가중치 레지스터(271-1)에서 제로 그룹들이 많을수록 가속기(200)의 연산 속도가 향상될 수 있다. 왜냐하면 제로 그룹에서 가중치들과 입력 활성화들의 곱의 연산이 스킵될 수 있기 때문이다. In the weight register 271-1 of the first processing element 270-1, a first weight group (G1), a third weight group (G3), a fifth weight group (G5), and a sixth weight group (G6) Assume these are zero groups. As the number of zero groups in the weight register 271-1 increases, the operation speed of the accelerator 200 may increase. This is because the calculation of the product of weights and input activations in the zero group can be skipped.

제2프로세싱 엘리먼트(270-2)의 가중치 레지스터(271-2)에서 제3가중치 그룹(G3)과 제5가중치 그룹(G5)이 제로 그룹들이라 가정한다. 제3프로세싱 엘리먼트(270-3)의 가중치 레지스터(271-3)에서 제6가중치 그룹(G6)이 제로 그룹이라 가정한다.In the weight register 271-2 of the second processing element 270-2, it is assumed that the third weight group G3 and the fifth weight group G5 are zero groups. It is assumed that the sixth weight group G6 in the weight register 271-3 of the third processing element 270-3 is a zero group.

제1프로세싱 엘리먼트(270-1)는 다른 프로세싱 엘리먼트들(270-2과 270-3)보다 더 많은 제로 그룹들을 포함하고 있다. 따라서 제1프로세싱 엘리먼트(270-1)는 다른 프로세싱 엘리먼트들(270-2과 270-3)보다 더 빨리 곱셈 명령들을 수행할 수 있다. 제3프로세싱 엘리먼트(270-3)는 가장 적은 제로 그룹을 포함하고 있어서 프로세싱 엘리먼트들(270-1~270-3) 중 연산 속도가 가장 느리다. 하지만, 제1프로세싱 엘리먼트(270-1)는 제3프로세싱 엘리먼트(270-3)의 곱셈 명령들이 완전히 끝나기 전까지 대기하여야한다. 즉, 가속기(200)의 연산 속도는 가장 적은 제로 그룹을 포함하는 프로세싱 엘리먼트(270-3)에 따라 결정된다. 따라서 이러한 레이턴스(latency)를 감소시키기 위해 로드 밸런싱(load balancing)이 요구된다. The first processing element 270-1 includes more zero groups than other processing elements 270-2 and 270-3. Accordingly, the first processing element 270-1 may perform multiplication instructions faster than the other processing elements 270-2 and 270-3. The third processing element 270-3 includes the smallest zero group and thus has the slowest operation speed among the processing elements 270-1 to 270-3. However, the first processing element 270-1 must wait until the multiplication instructions of the third processing element 270-3 are completely finished. That is, the computational speed of the accelerator 200 is determined according to the processing element 270-3 including the smallest zero group. Therefore, load balancing is required to reduce this latency.

추론 단계에서 가속기(200)의 연산 속도 향상을 위해 훈련 단계에서 로드 밸런싱이 수행된다. Load balancing is performed in the training phase in order to improve the computational speed of the accelerator 200 in the inference phase.

도 1 내지 도 9(b)를 참고하면, 상기 프로세싱 유닛은 가속기(200)의 프로세싱 엘리먼트들(270-1, 270-2, 및 270-3)의 가중치 레지스터들(271-1, 271-2, 및 271-3) 각각에서 상기 가지치기 마스크들이 적용된 가중치들의 배열을 분석한다. 즉, 상기 프로세싱 유닛은 가중치 레지스터들(271-1, 271-2, 및 271-3) 각각에서 각 열에 배열된 가중치들이 모두 0인 제로 가중치 그룹들의 개수를 카운팅한다. 상기 제로 가중치 그룹들은 제로 그룹들을 의미한다. 1 to 9(b), the processing unit includes weight registers 271-1 and 271-2 of the processing elements 270-1, 270-2, and 270-3 of the accelerator 200. In, and 271-3), an arrangement of weights to which the pruning masks are applied is analyzed. That is, the processing unit counts the number of zero weight groups in which the weights arranged in each column are all 0 in each of the weight registers 271-1, 271-2, and 271-3. The zero weight groups mean zero groups.

예컨대, 상기 프로세싱 유닛은 제1프로세싱 엘리먼트(270-1)의 가중치 레지스터(271-1)에서 제로 가중치 그룹들(G1, G3, G5, 및 G6)의 개수(4개)를 카운팅한다. 상기 프로세싱 유닛은 제2프로세싱 엘리먼트(270-2)의 가중치 레지스터(271-2)에서 제로 가중치 그룹들(G3과 G5)의 개수(2개)를 카운팅한다. 상기 프로세싱 유닛은 제3프로세싱 엘리먼트(270-3)의 가중치 레지스터(271-3)에서 제로 가중치 그룹(G6)의 개수(1개)를 카운팅한다. For example, the processing unit counts the number (four) of zero weight groups G1, G3, G5, and G6 in the weight register 271-1 of the first processing element 270-1. The processing unit counts the number (two) of zero weight groups G3 and G5 in the weight register 271-2 of the second processing element 270-2. The processing unit counts the number of zero weight groups G6 (one) in the weight register 271-3 of the third processing element 270-3.

상기 프로세싱 유닛은 상기 배열의 분석에 따라 임의의 에포크들(epoch) 동안 가중치 레지스터들(271-1, 271-2, 및 271-1) 중 특정 가중치 레지스터(예컨대, 271-1)에 배열된 가중치들을 업데이트하지 않기 위해 상기 특정 가중치 레지스터(예컨대, 271-1)에 배열된 가중치들에 대응하는 가지치기 마스크의 문턱값들을 유지한다. 하나의 에포크는 인공 신경망(103)을 통해 한 번의 데이터가 프로파게이팅되고, 백프로파게이팅되는 것을 의미한다. The processing unit includes weights arranged in a specific weight register (eg, 271-1) among weight registers 271-1, 271-2, and 271-1 during arbitrary epochs according to the analysis of the arrangement. In order not to update the values, threshold values of the pruning mask corresponding to the weights arranged in the specific weight register (eg, 271-1) are maintained. One epoch means that one data is propagated and backpropagated through the artificial neural network 103.

상기 프로세싱 유닛은 프로세싱 엘리먼트들(270-1, 270-2, 및 270-3)의 가중치 레지스터들(271-1, 271-2, 및 271-3) 중 제로 가중치 그룹들의 개수가 가장 많은 프로세싱 엘리먼트(예컨대, 270-1)의 가중치 레지스터(예컨대, 271-1)를 판단한다. 상기 배열의 분석 결과, 상기 프로세싱 유닛은 제1프로세싱 엘리먼트(270-1)의 가중치 레지스터(271-1)에서 제로 가중치 그룹들이 가장 많다고 판단한다. 따라서 상기 프로세싱 유닛은 제1프로세싱 엘리먼트(270-1)의 가중치 레지스터(271-1)에 배열된 가중치들을 업데이트하지 않기 위해 임의의 에포크들 동안 제1가중치 레지스터(271-1)에 배열된 가중치들에 대응하는 가지치기 마스크의 문턱값들을 유지한다. 그러므로 임의의 에포크들 동안 제1가중치 레지스터(271-1)에서 제1가중치 그룹(G1), 제3가중치 그룹(G3), 제5가중치 그룹(G5), 및 제6가중치 그룹(G6)은 제로 그룹을 유지하고, 제1가중치 레지스터(271-1)에서 나머지 가중치 그룹들(G2, G4, G7, 및 G8)은 논 제로 그룹(non-zero) 그룹을 유지한다. The processing unit is the processing element having the largest number of zero weight groups among the weight registers 271-1, 271-2, and 271-3 of the processing elements 270-1, 270-2, and 270-3. A weight register (eg, 271-1) of (eg, 270-1) is determined. As a result of analyzing the arrangement, the processing unit determines that zero weight groups are the largest in the weight register 271-1 of the first processing element 270-1. Accordingly, the processing unit includes weights arranged in the first weight register 271-1 during arbitrary epochs in order not to update the weights arranged in the weight register 271-1 of the first processing element 270-1. Maintain the threshold values of the pruning mask corresponding to. Therefore, during arbitrary epochs, the first weight group (G1), the third weight group (G3), the fifth weight group (G5), and the sixth weight group (G6) are zero in the first weight register 271-1. The group is maintained, and the remaining weight groups G2, G4, G7, and G8 in the first weight register 271-1 maintain a non-zero group.

실시 예에 따라 프로세싱 엘리먼트들(270-1, 270-2, 및 270-3)의 가중치 레지스터들(271-1, 271-2, 및 271-3) 중 특정 가중치 레지스터의 제로 가중치 그룹들의 개수가 다른 가중치 레지스터의 제로 가중치 그룹들의 개수와 임의의 값의 합보다 큰 지를 판단한다. 프로세싱 엘리먼트들(270-1, 270-2, 및 270-3)의 가중치 레지스터들(271-1, 271-2, 및 271-3) 중 특정 가중치 레지스터(예컨대, 271-1)의 가중치 그룹들의 개수(예컨대, 4개)가 다른 프로세싱 엘리먼트(예컨대, 270-3)의 가중치 레지스터(271-3)의 제로 가중치 그룹들의 개수(예컨대, 1개)와 임의의 값(예컨대, 1)의 합보다 큰 때, 상기 프로세싱 유닛은 임의의 에포크들 동안 상기 특정 가중치 레지스터(예컨대, 271-1)에 배열된 가중치들에 대응하는 가지치기 마스크들의 문턱값들을 유지한다. According to an embodiment, among the weight registers 271-1, 271-2, and 271-3 of the processing elements 270-1, 270-2, and 270-3, the number of zero weight groups in a specific weight register is It is determined whether it is greater than the sum of the number of zero weight groups and an arbitrary value in other weight registers. Among the weight registers 271-1, 271-2, and 271-3 of the processing elements 270-1, 270-2, and 270-3, weight groups of a specific weight register (eg, 271-1) The number (e.g., 4) is greater than the sum of the number (e.g., 1) of zero weight groups of the weight register 271-3 of another processing element (e.g., 270-3) and an arbitrary value (e.g., 1) When large, the processing unit maintains threshold values of pruning masks corresponding to weights arranged in the specific weight register (eg, 271-1) during arbitrary epochs.

제1가중치 레지스터(271-1)에 배열된 가중치들을 업데이트하지 않는 동안, 상기 프로세싱 유닛은 나머지 가중치 레지스터들(271-2과 271-3)에 배열된 가중치들에 대해 더 많이 가지치기를 수행하도록 상기 나머지 가중치 레지스터들(271-2과 271-3)에 배열된 가중치들에 대응하는 가지치기 마스크들의 문턱값들을 조정한다. 구체적으로 상기 프로세싱 유닛은 상기 나머지 가중치 레지스터들(271-2과 271-3)에 배열된 가중치들에 대응하는 가지치기 마스크들의 문턱값들을 작게 재설정하여 더 많은 제로 그룹들이 생성되도록 한다. While not updating the weights arranged in the first weight register 271-1, the processing unit performs more pruning on the weights arranged in the remaining weight registers 271-2 and 271-3. Threshold values of pruning masks corresponding to the weights arranged in the remaining weight registers 271-2 and 271-3 are adjusted. Specifically, the processing unit resets threshold values of pruning masks corresponding to the weights arranged in the remaining weight registers 271-2 and 271-3 to be smaller so that more zero groups are generated.

도 9(b)는 로드 밸런싱 동작 적용 후의 복수의 프로세싱 엘리먼트들의 내부 블록도를 나타낸다. 9(b) shows an internal block diagram of a plurality of processing elements after applying a load balancing operation.

도 9(b)를 참고하면, 제1가중치 레지스터(271-1)에 배열된 가중치들은 업데이트되지 않는다. 제1가중치 레지스터(271-1)의 제로 그룹들과 넌 제로(non-zero) 그룹들은 변하지 않는다. Referring to FIG. 9B, the weights arranged in the first weight register 271-1 are not updated. Zero groups and non-zero groups of the first weight register 271-1 are not changed.

하지만, 제2가중치 레지스터(271-2)와 제3가중치 레지스터(271-3)에 배열된 가중치들은 변한다. 가중치 레지스터들(271-2과 271-3)에 배열된 가중치들에 대응하는 가지치기 마스크들의 문턱값들이 조정되고, 제2가중치 레지스터(271-2)와 제3가중치 레지스터(271-3)에 포함된 가중치 그룹들이 변한다. 제2가중치 레지스터(271-2)의 제7가중치 그룹(G7)과 제8가중치 그룹(G8)이 넌 제로 그룹들에서 제로 그룹들로 변한다. 제3가중치 레지스터(271-3)의 제2가중치 그룹(G2), 제3가중치 그룹(G3), 및 제8가중치 그룹(G8)이 넌 제로 그룹들에서 제로 그룹들로 변한다. However, the weights arranged in the second weight register 271-2 and the third weight register 271-3 are changed. Threshold values of the pruning masks corresponding to the weights arranged in the weight registers 271-2 and 271-3 are adjusted, and the second weight register 271-2 and the third weight register 271-3 are The included weight groups change. The seventh weight group G7 and the eighth weight group G8 of the second weight register 271-2 change from non-zero groups to zero groups. The second weight group G2, the third weight group G3, and the eighth weight group G8 of the third weight register 271-3 change from non-zero groups to zero groups.

로드 밸런싱 후에 가중치 레지스터들(271-1, 271-2, 및 271-3)은 모두 동일한 개수의 제로 그룹들을 포함하게 된다. 따라서 가속기(200)의 레이턴스가 감소될 수 있다. After load balancing, the weight registers 271-1, 271-2, and 271-3 all include the same number of zero groups. Therefore, the latency of the accelerator 200 can be reduced.

실시 예에 따라 가중치 레지스터들(271-1, 271-2, 및 271-3)에 포함된 제로 그룹들의 수는 같지 않을 수 있다. 예컨대, 제1가중치 레지스터들(271-1)의 제로 그룹들의 수가 4개라 할 때, 제2가중치 레지스터들(271-2)과 제3가중치 레지스터들(271-3)의 제로 그룹들의 수는 5개일 수 있다. 가중치 레지스터들(271-1, 271-2, 및 271-3)에 포함된 제로 그룹들의 수는 같지 않은 경우라도 가속기(200)의 레이턴스는 처음보다 감소될 수 있다. According to an embodiment, the number of zero groups included in the weight registers 271-1, 271-2, and 271-3 may not be the same. For example, when the number of zero groups of the first weight registers 271-1 is four, the number of zero groups of the second weight registers 271-2 and the third weight registers 271-3 is 5 It can be a dog. Even when the number of zero groups included in the weight registers 271-1, 271-2, and 271-3 are not the same, the latency of the accelerator 200 may be reduced from the beginning.

도 10은 도 1에 도시된 시스템의 다른 동작 방법을 설명하기 위한 흐름도를 나타낸다. 10 is a flowchart illustrating another method of operating the system shown in FIG. 1.

도 1과, 도 8 내지 도 10을 참고하면, 프로세싱 유닛은 인공 신경망(103)의 가중치들에 대해 가지치기(pruning) 마스크들을 적용한다(S10). 구체적으로, 상기 프로세싱 유닛은 상기 가중치들을 가중치 그룹들로 그룹화한다. 가지치기 마스크들 각각은 희소 행렬(sparse matrix)이다. 희소 행렬은 많은 0의 값들을 포함하는 행렬이다. 가지치기 마스크들을 적용하는 것은 가중치 행렬과 희소 행렬의 곱셈을 의미한다. 인공 신경망(103)의 가중치들에 대해 가지치기 마스크들을 적용하는 구체적인 동작은 다음과 같다. 1 and 8 to 10, the processing unit applies pruning masks to the weights of the artificial neural network 103 (S10). Specifically, the processing unit groups the weights into weight groups. Each of the pruning masks is a sparse matrix. A sparse matrix is a matrix containing many zero values. Applying the pruning masks means multiplying the weight matrix and the sparse matrix. A specific operation of applying pruning masks to the weights of the artificial neural network 103 is as follows.

상기 프로세싱 유닛은 상기 가중치들을 가중치 그룹들로 그룹화한다. 구체적으로, 상기 프로세싱 유닛은 상기 가중치들을 채널들, 또는 필터들을 고려하여 가중치 그룹들로 나눌 수 있다. 예컨대, 상기 프로세싱 유닛은 상기 가중치들을 입력 채널에 배열된 가중치들, 또는 출력 채널에 배열된 가중치들을 고려하여 가중치 그룹들로 나눌 수 있다.The processing unit groups the weights into weight groups. Specifically, the processing unit may divide the weights into weight groups in consideration of channels or filters. For example, the processing unit may divide the weights into weight groups in consideration of weights arranged in an input channel or weights arranged in an output channel.

또한, 상기 프로세싱 유닛은 SIMD(Single Instruction Multiple Data) 폭에 따라 가중치들을 가중치 그룹들로 그룹화할 수 있다. 예컨대, SIMD 폭이 2일 때, 상기 프로세싱 유닛은 2개의 가중치들이 하나의 가중치 그룹을 형성하도록 가중치들을 가중치 그룹들로 나눌 수 있다. SIMD 폭이 4일 때, 상기 프로세싱 유닛은 4개의 가중치들이 하나의 가중치 그룹을 형성하도록 가중치들을 가중치 그룹들로 나눌 수 있다. SIMD 폭이 8일 때, 상기 프로세싱 유닛은 8개의 가중치들이 하나의 가중치 그룹을 형성하도록 가중치들을 가중치 그룹들로 나눌 수 있다. In addition, the processing unit may group weights into weight groups according to a single instruction multiple data (SIMD) width. For example, when the SIMD width is 2, the processing unit may divide the weights into weight groups such that the two weights form one weight group. When the SIMD width is 4, the processing unit may divide the weights into weight groups such that the four weights form one weight group. When the SIMD width is 8, the processing unit may divide the weights into weight groups such that the eight weights form one weight group.

상기 프로세싱 유닛은 상기 가중치 그룹들 각각과 상기 가지치기 마스크들의 문턱값들 각각을 비교한다. 상기 가지치기 마스크들은 대응하는 문턱값들을 포함한다. 구체적으로, 상기 프로세싱 유닛은 상기 가중치 그룹들에서 대표 가중치들을 추출한다. 상기 대표 가중치들은 상기 가중치 그룹들에서 임의로 추출될 수 있다. 예컨대, 가중치 그룹들 중 어느 하나의 가중치 그룹이 8개의 가중치들(W1~W8)을 포함할 때, 임의로 제1가중치(W1)가 대표 가중치로서 추출될 수 있다. 또한, 실시 예에 따라 가중치 그룹이 8개의 가중치들(W1~W8)을 포함할 때, 8개의 가중치들의 RMS(Root Means Square)를 계산하고, RMS 값이 대표 가중치로서 추출될 수 있다. 또한, 실시 예에 따라 가중치 그룹이 8개의 가중치들(W1~W8)을 포함할 때, 8개의 가중치들 중 최소 값, 또는 최대 값이 대표 가중치로서 추출될 수 있다. The processing unit compares each of the weight groups with each of the threshold values of the pruning masks. The pruning masks include corresponding threshold values. Specifically, the processing unit extracts representative weights from the weight groups. The representative weights may be randomly extracted from the weight groups. For example, when any one of the weight groups includes eight weights W1 to W8, the first weight W1 may be arbitrarily extracted as a representative weight. In addition, according to an embodiment, when a weight group includes eight weights W1 to W8, a Root Means Square (RMS) of eight weights may be calculated, and an RMS value may be extracted as a representative weight. Further, according to an embodiment, when a weight group includes eight weights W1 to W8, a minimum value or a maximum value among the eight weights may be extracted as a representative weight.

상기 프로세싱 유닛은 가중치 그룹들 각각에서 추출된 대표 가중치들 각각을 상기 가지치기 마스크들의 문턱값들 각각과 비교한다. 상기 문턱값들 각각은 임의로 설정될 수 있다. 또한, 상기 문턱값들은 서로 다르게 설정될 수 있다. The processing unit compares each of the representative weights extracted from each of the weight groups with each of the threshold values of the pruning masks. Each of the threshold values may be set arbitrarily. Also, the threshold values may be set differently from each other.

상기 대표 가중치들 중 어느 하나가 대응하는 문턱값보다 작을 때, 상기 프로세싱 유닛은 상기 대표 가중치들 중 어느 하나에 대응하는 가중치 그룹에 속한 가중치들 모두를 0으로 설정한다. 예컨대, 하나의 가중치 그룹인 8개의 가중치들(W1~W8) 중 대표 가중치가 대응하는 문턱값보다 작을 때, 상기 프로세싱 유닛은 상기 대표 가중치에 대응하는 가중치 그룹에 속한 가중치들(W1~W8)을 모두 0으로 설정한다. 상기 0으로 설정은 가중치 행렬과 희소 행렬의 곱으로 설정될 수 있다. When any one of the representative weights is less than a corresponding threshold value, the processing unit sets all weights belonging to a weight group corresponding to any one of the representative weights to 0. For example, when a representative weight among eight weights W1 to W8, which is one weight group, is smaller than a corresponding threshold, the processing unit selects weights W1 to W8 belonging to the weight group corresponding to the representative weight. Set all to 0. The setting to 0 may be set as a product of a weight matrix and a sparse matrix.

반대로, 상기 대표 가중치들 중 다른 하나가 대응하는 문턱값보다 클 때, 상기 프로세싱 유닛은 상기 대표 가중치들 중 다른 하나에 대응하는 가중치 그룹에 속한 가중치들을 모두 유지한다. 유지한다함은 상기 가중치 그룹에 속한 가중치들(W1~W8)을 모두 변경하지 않는다 것을 의미한다. 즉, 가중치 행렬과 희소 행렬의 곱셈 동작은 수행되지 않는다. Conversely, when the other of the representative weights is greater than a corresponding threshold value, the processing unit maintains all weights belonging to the weight group corresponding to the other of the representative weights. Maintaining means that not all of the weights W1 to W8 belonging to the weight group are changed. That is, the multiplication operation of the weight matrix and the sparse matrix is not performed.

상기 프로세싱 유닛은 인공 신경망(103)을 통해 훈련 데이터(training data)로부터 활성화 데이터를 프로파게이팅(propagating)한다(S20). 즉, 상기 프로세싱 유닛은 인공 신경망(103)의 입력 레이어(105)에 훈련 데이터를 입력하고, 은닉 레이어(107)와 출력 레이어(109)를 통해 활성화 데이터를 출력한다. The processing unit propagates activation data from training data through the artificial neural network 103 (S20). That is, the processing unit inputs training data to the input layer 105 of the artificial neural network 103 and outputs activation data through the hidden layer 107 and the output layer 109.

상기 프로세싱 유닛은 상기 활성화 데이터와 목표 데이터에 기초하여 손실 함수(loss function)의 값을 계산한다(S30). 상기 손실 함수의 값은 상기 활성화 데이터와 목표 데이터 사이의 MSE(Means Squared Error), 또는 크로스-엔트로피(cross-entropy)를 이용하여 계산될 수 있다. The processing unit calculates a value of a loss function based on the activation data and target data (S30). The value of the loss function may be calculated using Mean Squared Error (MSE) or cross-entropy between the activation data and target data.

상기 프로세싱 유닛은 인공 신경망(103)의 가중치들과 관련된 상기 손실 함수(loss function)의 기울기(gradient)와, 상기 가지치기 마스크들의 문턱값들(Threshold)과 관련된 상기 손실 함수의 기울기를 계산하기 위해 상기 손실 함수의 값을 백프로파게이팅(backpropagating)한다(S40). 백프로파게이션 동작은 인공 신경망(103)의 가중치들과 관련된 상기 손실 함수(loss function)의 기울기(gradient)와, 문턱값들(Threshold)과 관련된 상기 손실 함수의 기울기를 계산하는 동작을 포함한다. 상기 백프로파게이션은 상기 가중치들과 문턱값들을 조절하여 상기 손실 함수의 값을 최소화를 목표로 한다.The processing unit is configured to calculate a gradient of the loss function related to the weights of the artificial neural network 103 and a gradient of the loss function related to the thresholds of the pruning masks. The value of the loss function is backpropagated (S40). The backpropagation operation includes calculating a gradient of the loss function related to weights of the artificial neural network 103 and a gradient of the loss function related to threshold values. The backpropagation aims to minimize the value of the loss function by adjusting the weights and threshold values.

인공 신경망(103)의 가중치들에 따라 손실 함수의 값이 변한다. 따라서 상기 프로세싱 유닛은 가중치들에 따른 손실 함수의 변화 값을 이용하여 가중치들과 관련된 손실 함수의 기울기를 계산한다. The value of the loss function changes according to the weights of the artificial neural network 103. Accordingly, the processing unit calculates the slope of the loss function related to the weights by using the change value of the loss function according to the weights.

가지치기 마스크들의 문턱값들에 따라 손실 함수의 값도 변한다. 상기 문턱값들에 따라 가중치들이 0으로 설정될 수 있기 때문이다. 따라서 상기 프로세싱 유닛은 가지치기 마스크들의 문턱값들에 따른 손실 함수의 변화 값을 이용하여 상기 가지치기 마스크들의 문턱값들과 관련된 손실 함수의 기울기를 계산한다. The value of the loss function also changes according to the threshold values of the pruning masks. This is because weights may be set to 0 according to the threshold values. Accordingly, the processing unit calculates the slope of the loss function related to the threshold values of the pruning masks by using the change value of the loss function according to the threshold values of the pruning masks.

상기 프로세싱 유닛은 인공 신경망(103)의 가중치들과 관련된 상기 손실 함수의 기울기와, 상기 가지치기 마스크들의 문턱값들과 관련된 상기 손실 함수의 기울기에 기초하여 인공 신경망(103)의 가중치들과, 상기 가지치기 마스크의 문턱값들을 업데이트한다(S50). 상기 문턱값들의 업데이트에 따라 특정 가중치 그룹들이 모두 0으로 재설정되거나, 0으로의 설정이 취소될 수 있다. The processing unit includes weights of the artificial neural network 103 based on the gradient of the loss function related to the weights of the artificial neural network 103 and the gradient of the loss function related to the threshold values of the pruning masks, and the The threshold values of the pruning mask are updated (S50). According to the update of the threshold values, all of the specific weight groups may be reset to 0, or the setting to 0 may be canceled.

상기 프로세싱 유닛은 인공 신경망(103)의 오버피팅(overfitting)을 피하기 위해 인공 신경망(103)에 대해 구조화된 희소 정규화(structured sparsity regularization)를 수행한다. 구조화된 희소 정규화란 가중치 그룹들을 고려하여 정규화 동작을 수행함을 의미한다.The processing unit performs structured sparsity regularization on the artificial neural network 103 to avoid overfitting of the artificial neural network 103. Structured sparse normalization means performing a normalization operation in consideration of weight groups.

상기 프로세싱 유닛은 가속기(200)의 프로세싱 엘리먼트들(270)의 가중치 레지스터들(271)에서 상기 가지치기 마스크들이 적용된 가중치들의 배열을 분석한다. 상기 프로세싱 유닛은 상기 배열의 분석에 따라 임의의 에포크들(epoch) 동안 상기 가중치 레지스터들 중 특정 가중치 레지스터에 배열된 가중치들을 업데이트하지 않기 위해 상기 특정 가중치 레지스터에 배열된 가중치들에 대응하는 가지치기 마스크의 문턱값들을 유지한다. The processing unit analyzes an arrangement of weights to which the pruning masks are applied in weight registers 271 of processing elements 270 of accelerator 200. The processing unit performs a pruning mask corresponding to the weights arranged in the specific weight register so as not to update the weights arranged in the specific weight register among the weight registers during arbitrary epochs according to the analysis of the arrangement. Keep the thresholds of.

상기 프로세싱 유닛은 상기 특정 가중치 레지스터에 배열된 가중치들을 제외한 나머지 가중치 레지스터들에 배열된 가중치들을 업데이트하기 위해 상기 나머지 가중치들에 대응하는 가지치기 마스크들의 문턱값들을 조정한다. The processing unit adjusts threshold values of pruning masks corresponding to the remaining weights to update the weights arranged in the remaining weight registers excluding the weights arranged in the specific weight register.

상기 프로세싱 유닛은 조정된 문턱값들에 따라 인공 신경망(103)을 훈련시킨다. The processing unit trains the artificial neural network 103 according to the adjusted threshold values.

본 발명에서는 가속기(200)의 프로세싱 엘리먼트들(270)의 가중치 레지스터들(271)에 배열된 제로 가중치 그룹들을 고려하여 가지치기함으로써 가속기(200)에서의 레이턴시를 감소될 수 있다. In the present invention, the latency in the accelerator 200 may be reduced by pruning in consideration of the zero weight groups arranged in the weight registers 271 of the processing elements 270 of the accelerator 200.

본 발명은 도면에 도시된 일 실시 예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시 예가 가능하다는 점을 이해할 것이다. 따라서 본 발명의 진정한 기술적 보호 범위는 첨부된 등록청구범위의 기술적 사상에 의해 정해져야 할 것이다.Although the present invention has been described with reference to an embodiment illustrated in the drawings, this is only exemplary, and those of ordinary skill in the art will understand that various modifications and other equivalent embodiments are possible therefrom. Therefore, the true technical protection scope of the present invention should be determined by the technical idea of the attached registration claims.

100: 시스템;
10: CPU;
20: GPU;
30: 메모리;
101: 버스;
200: 가속기; 100: system;
10: CPU;
20: GPU;
30: memory;
101: bus;
200: accelerator;

Claims

The processing unit includes: identifying a SIMD (Single Instruction Multiple Data) width of a processing element included in an accelerator;
The processing unit initializing weights of an artificial neural network;
The processing unit grouping the weights arranged in the input channel and the output channel of the artificial neural network into weight groups according to the identified SIMD width;
The processing unit aware of the weight groups and pruning the plurality of weights included in the artificial neural network; And
The processing unit includes updating the pruned plurality of weights,
The processing unit is aware of the weight groups and pruning the plurality of weights included in the artificial neural network,
The processing unit extracting representative weights from each of the weight groups;
The processing unit comparing each of the representative weights with each of threshold values; And
When any one of the representative weights is less than a corresponding threshold, the processing unit sets all weights belonging to a weight group corresponding to any one of the representative weights to zero. Pruning-based training method for hardware.

The method of claim 1, wherein the processing unit,
A pruning-based training method for accelerating hardware of an artificial neural network, which is a CPU (Central Processing Unit) or GPU (Graphic Processing Unit).

The method of claim 1,
When the SIMD width is 2,
Two weights among the plurality of weights form one weight group,
When the SIMD width is 4,
Four weights among the plurality of weights form one weight group,
When the SIMD width is 8,
A pruning-based training method for acceleration hardware of an artificial neural network in which eight weights among the plurality of weights form one weight group.

delete

The method of claim 1, wherein the processing unit is aware of the weight groups and pruning the plurality of weights included in the artificial neural network,
When the other of the representative weights is greater than the corresponding threshold, the processing unit further comprises maintaining all weights belonging to the weight group corresponding to the other of the representative weights. Pruning-based training method for

The pruning-based training for acceleration hardware of an artificial neural network according to claim 1, wherein weights having 0 in the input channel of the artificial neural network and the output channel of the artificial neural network are continuously present as long as they correspond to the identified SIMD width. Way.

The method of claim 1, wherein the processing element comprises:
Contains a weight register,
The weight register is
Is implemented as a matrix,
Weights arranged in the output channel of the artificial neural network are sequentially inserted along the row of the matrix,
A pruning-based training method for accelerating hardware of an artificial neural network in which weights arranged in an input channel of the artificial neural network are sequentially inserted along the matrix column.

The method of claim 1, wherein the training method based on pruning for acceleration hardware of the artificial neural network,
The processing unit further comprises updating weights and threshold values of the artificial neural network based on a slope of a loss function and a slope of the loss function related to the threshold values. Pruning-based training method.

A memory for storing instructions; And
And a processing unit that executes the instructions,
The above commands are:
Identifies the SIMD width of the processing element included in the accelerator,
Initialize the weights of the artificial neural network,
Group weights arranged in the input channel and the output channel of the artificial neural network into weight groups according to the identified SIMD width,
Pruning the plurality of weights included in the artificial neural network by being conscious of the weight groups,
It is implemented to update the plurality of pruned weights,
An instruction for pruning the plurality of weights included in the artificial neural network in consideration of the weight groups,
Extracting a representative weight from each of the weight groups,
Compare each of the representative weights with each of the threshold values,
When any one of the representative weights is less than a corresponding threshold, pruning-based hardware for acceleration of an artificial neural network that sets all weights belonging to the weight groups corresponding to one of the representative weights to 0 Training system.

The method of claim 9, wherein the processing unit,
A pruning-based training system for the acceleration hardware of an artificial neural network, which is a CPU (Central Processing Unit) or GPU (Graphic Processing Unit).

The method of claim 9,
When the SIMD width is 2,
Two weights among the plurality of weights form one weight group,
When the SIMD width is 4,
Four weights among the plurality of weights form one weight group,
When the SIMD width is 8,
A pruning-based training system for acceleration hardware of an artificial neural network in which eight weights among the plurality of weights form one weight group.

delete

The method of claim 9, wherein the command for pruning the plurality of weights included in the artificial neural network consciously of the weight groups comprises:
When the other of the representative weights is greater than the corresponding threshold, the pruning-based for the acceleration hardware of the artificial neural network is implemented to maintain all the weights belonging to the weight groups corresponding to the other of the representative weights. Training system.

The pruning-based training for acceleration hardware of an artificial neural network according to claim 9, wherein weights having 0 in the input channel of the artificial neural network and the output channel of the artificial neural network are continuously present as long as they correspond to the identified SIMD width. system.

The method of claim 9,
The processing element,
Contains a weight register,
The weight register,
Is implemented as a matrix,
Weights arranged in the output channel of the artificial neural network are sequentially inserted along the row of the matrix,
A pruning-based training system for accelerating hardware of an artificial neural network in which weights arranged in an input channel of the artificial neural network are sequentially inserted along the matrix column.

The method of claim 9, wherein the instructions are:
A pruning-based training system for acceleration hardware of an artificial neural network implemented to update weights of the artificial neural network and the threshold values based on a slope of a loss function and a slope of the loss function related to the threshold values.