KR102519210B1

KR102519210B1 - Method and apparatus for accelerating artificial neural network using double-stage weight sharing

Info

Publication number: KR102519210B1
Application number: KR1020200110520A
Authority: KR
Inventors: 김종태; 유태관; 김덕용
Original assignee: 성균관대학교산학협력단
Priority date: 2020-08-31
Filing date: 2020-08-31
Publication date: 2023-04-06
Also published as: KR20220028895A

Abstract

본 발명은 더블 스테이지 웨이트 공유 방식을 이용한 인공 신경망 가속 방법 및 장치에 관한 것으로, 본 발명의 일 실시예에 따른 더블 스테이지 웨이트 공유 방식을 이용한 인공 신경망 가속 방법은, 획득된 입력 액티베이션(Activation) 값을 이용한 해싱(Hashing) 방식을 통해 해쉬 결과 값을 산출하는 단계, 상기 산출된 해쉬 결과 값을 기준으로 중앙값의 위치 인덱스를 계산하는 단계, 상기 입력 액티베이션 값을 상기 계산된 중앙값의 위치 인덱스에 따른 배열에 저장하는 단계, 및 상기 저장된 입력 액티베이션 값과 상기 중앙값에 대한 연산을 수행하여 출력하는 단계를 포함한다. The present invention relates to a method and apparatus for accelerating an artificial neural network using a double-stage weight sharing method. In an artificial neural network acceleration method using a double-stage weight sharing method according to an embodiment of the present invention, an obtained input activation value Calculating a hash result value through the hashing method used, calculating a position index of the median value based on the calculated hash result value, and putting the input activation value into an array according to the calculated position index of the median value A step of storing, and a step of performing an operation on the stored input activation value and the median value and outputting the result.

Description

Method and apparatus for accelerating artificial neural network using double stage weight sharing scheme

본 발명은 더블 스테이지 웨이트 공유 방식을 이용한 인공 신경망 가속 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for accelerating an artificial neural network using a double stage weight sharing method.

최근 딥 뉴럴 네트워크의 깊이와 크기가 깊어지는 한편, 자연어 처리와 같은 분야의 풀리-커넥티드(Fully Connected) 레이어의 활용이 늘어나고 있다. 하지만, 풀리-커넥티드(Fully Connected) 층은 많은 수의 파라미터를 연산할 뿐 아니라, 컨볼루션(Convolutional) 층과 달리 파라미터의 재사용이 적어 많은 메모리가 요구되고 있다. 이를 보완하기 위해, 많은 온 칩 메모리인 SRAM(Static random access memory)를 필요로 하고, 엣지 디바이스(Edge device)에서의 활용을 위해 효율적으로 네트워크를 처리할 수 있는 뉴럴 네트워크 가속기에 대한 요구 또한 늘어나고 있다. 그럼에도 불구하고, 종래 뉴럴 네트워크 가속기의 경우 많은 면적과 파워가 파라미터를 저장하기 위한 SRAM에서 발생하는 것을 확인할 수 있다. Recently, while the depth and size of deep neural networks have deepened, the utilization of fully-connected layers in fields such as natural language processing is increasing. However, the fully-connected layer not only calculates a large number of parameters, but also requires a lot of memory because parameters are not reused unlike the convolutional layer. To compensate for this, a lot of on-chip memory, SRAM (Static random access memory) is required, and demand for neural network accelerators that can efficiently process networks for use in edge devices is also increasing. . Nevertheless, in the case of conventional neural network accelerators, it can be confirmed that a large amount of area and power is generated in SRAM for storing parameters.

대부분의 가속기는 대역폭(Bandwidth)이 작은 DRAM(Dynamic random access memory)의 사용을 줄이고 상대적으로 빠른 SRAM을 사용하고 있다. 하지만, 대부분의 전력 및 면적은 SRAM을 사용함으로써 발생한다. 예를 들어 EIE((Efficient Inference Engine)에서는 면적의 93%, 전력의 59%가 SRAM에 의해 소비된다. TETRIS에서는 EIE와 달리 DRAM을 가지고 있음에도 불구하고 온 칩 버퍼가 칩 면적의 70%까지 점유하고 있다. 즉, 이것은 비용을 증가시키고 계산 밀도를 감소시킨다.Most accelerators reduce the use of dynamic random access memory (DRAM) having a small bandwidth and use relatively fast SRAM. However, most of the power and area comes from using SRAM. For example, in the EIE (Efficient Inference Engine), 93% of the area and 59% of the power are consumed by SRAM. In TETRIS, unlike EIE, despite having DRAM, the on-chip buffer occupies up to 70% of the chip area, That is, it increases cost and reduces computational density.

또한, 가속기의 기본 연산 유닛인 PE(processing elements)에서 SRAM이 비효율적으로 사용되면, PE의 수에 따라 칩 면적이 크게 증가할 가능성이 있다. 그러므로, 설계시에 SRAM의 사용을 줄이고 효율적으로 설계해 면적 대비 성능의 효율성을 향상할 필요성이 있다. In addition, if SRAM is used inefficiently in PEs (processing elements), which are the basic computing units of the accelerator, there is a possibility that the chip area greatly increases according to the number of PEs. Therefore, there is a need to reduce the use of SRAM and improve the efficiency of performance per area by designing efficiently.

본 발명의 실시예들은 더블-스테이지(Double-Stage) 파라미터 공유 기법을 활용해 요구되는 데이터를 최소한으로 줄여, 모든 데이터를 SRAM에서 처리하더라도 면적 대비 성능을 효율적으로 향상시키기 위한, 더블 스테이지 웨이트 공유 방식을 이용한 인공 신경망 가속 방법 및 장치를 제공하고자 한다. Embodiments of the present invention utilize a double-stage parameter sharing method to reduce required data to a minimum, and even if all data is processed in SRAM, a double-stage weight sharing method for efficiently improving performance versus area It is intended to provide an artificial neural network acceleration method and apparatus using.

다만, 본 발명의 해결하고자 하는 과제는 이에 한정되는 것이 아니며, 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위의 환경에서도 다양하게 확장될 수 있을 것이다.However, the problem to be solved by the present invention is not limited thereto, and may be expanded in various ways even in an environment within a range that does not deviate from the spirit and scope of the present invention.

본 발명의 일 실시예에 따르면, 인공 신경망 가속 장치에 의해 수행되는 인공 신경망 가속 방법에 있어서, 획득된 입력 액티베이션(Activation) 값을 이용한 해싱(Hashing) 방식을 통해 해쉬 결과 값을 산출하는 단계; 상기 산출된 해쉬 결과 값을 기준으로 중앙값의 위치 인덱스를 계산하는 단계; 상기 입력 액티베이션 값을 상기 계산된 중앙값의 위치 인덱스에 따른 배열에 저장하는 단계; 및 상기 저장된 입력 액티베이션 값과 상기 중앙값에 대한 연산을 수행하여 출력하는 단계를 포함하는, 더블 스테이지 웨이트 공유 방식을 이용한 인공 신경망 가속 방법이 제공될 수 있다.According to an embodiment of the present invention, in an artificial neural network acceleration method performed by an artificial neural network accelerator, calculating a hash result value through a hashing method using an acquired input activation value; calculating a location index of a median value based on the calculated hash result value; storing the input activation value in an array according to the position index of the calculated median value; and performing an operation on the stored input activation value and the median value and outputting the calculated artificial neural network acceleration method using a double stage weight sharing method.

상기 방법은, 상기 획득된 입력 액티베이션(Activation) 값을 메모리 공간인 입력 버퍼에 저장하는 단계를 더 포함할 수 있다.The method may further include storing the obtained input activation value in an input buffer that is a memory space.

상기 방법은, 상기 저장된 입력 액티베이션 값의 넌-제로(Non-zero) 값인 희소성(Sparsity)을 찾는 단계를 더 포함할 수 있다.The method may further include finding sparsity that is a non-zero value of the stored input activation value.

상기 해쉬 결과 값을 산출하는 단계는, 상기 획득된 입력 액티베이션 값을 이용한 해싱 방식을 통해 상기 입력 액티베이션 값의 각 웨이트(weight) 좌표에 대한 중앙값 인덱스(Centroid index)를 산출할 수 있다.In the calculating of the hash result value, a centroid index for each weight coordinate of the input activation value may be calculated through a hashing method using the obtained input activation value.

상기 중앙값의 위치 인덱스는, 실제 중앙값 인덱스(Real centroid index)인 것을 특징으로 할 수 있다.The location index of the median value may be a real centroid index.

상기 입력 액티베이션 값을 배열에 저장하는 단계는, 상기 계산된 중앙값의 위치 인덱스를 기준으로 상기 입력 액티베이션 값을 저장할 수 있다.In the storing of the input activation values in an array, the input activation values may be stored based on a location index of the calculated median value.

상기 출력하는 단계는, 상기 저장된 입력 액티베이션 값과 중앙값에 대한 곱셈 연산을 수행할 수 있다.In the outputting, a multiplication operation may be performed on the stored input activation value and the median value.

상기 출력하는 단계는, 중앙값 좌표 위치마다 저장된 입력 액티베이션 값들과 상기 중앙값 좌표에 해당하는 중앙값에 대한 곱셈 연산을 수행할 수 있다.In the outputting, a multiplication operation may be performed on input activation values stored for each median coordinate position and a median value corresponding to the median coordinates.

상기 연산을 수행하는 단계는, 상기 곱셈 연산이 수행된 값들을 누적하여 총 합을 구하고 활성화 함수(Activation function)를 거쳐 출력으로 내보낼 수 있다.In the performing of the operation, values for which the multiplication operation has been performed may be accumulated to obtain a total sum, and may be output as an output through an activation function.

상기 활성화 함수는, ReLU(Rectified Linear Unit)인 것을 특징으로 할 수 있다.The activation function may be a Rectified Linear Unit (ReLU).

한편, 본 발명의 다른 실시예에 따르면, 하나 이상의 프로그램을 저장하는 메모리; 및 상기 저장된 하나 이상의 프로그램을 실행하는 프로세서를 포함하고, 상기 프로세서는, 획득된 입력 액티베이션(Activation) 값을 이용한 해싱(Hashing) 방식을 통해 해쉬 결과 값을 산출하고, 상기 산출된 해쉬 결과 값을 기준으로 중앙값의 위치 인덱스를 계산하고, 상기 입력 액티베이션 값을 상기 계산된 중앙값의 위치 인덱스에 따른 배열에 저장하고, 상기 저장된 입력 액티베이션 값과 상기 중앙값에 대한 연산을 수행하여 출력하는, 더블 스테이지 웨이트 공유 방식을 이용한 인공 신경망 가속 장치가 제공될 수 있다.Meanwhile, according to another embodiment of the present invention, a memory for storing one or more programs; and a processor executing the one or more stored programs, wherein the processor calculates a hash result value through a hashing method using the obtained input activation value, and the calculated hash result value is based on the standard. A double-stage weight sharing method that calculates the position index of the median value, stores the input activation value in an array according to the position index of the calculated median value, performs an operation on the stored input activation value and the median value, and outputs the result. An artificial neural network accelerator using can be provided.

상기 장치는, 상기 획득된 입력 액티베이션(Activation) 값을 메모리 공간에 저장하는 입력 버퍼를 더 포함할 수 있다.The apparatus may further include an input buffer storing the obtained input activation value in a memory space.

상기 장치는, 상기 저장된 입력 액티베이션 값의 넌-제로(Non-zero) 값인 희소성(Sparsity)을 찾는 넌-제로 검출 모듈을 더 포함할 수 있다.The apparatus may further include a non-zero detection module for finding sparsity that is a non-zero value of the stored input activation value.

상기 프로세서는, 상기 획득된 입력 액티베이션 값을 이용한 해싱 방식을 통해 상기 입력 액티베이션 값의 각 웨이트(weight) 좌표에 대한 중앙값 인덱스(Centroid index)를 산출할 수 있다.The processor may calculate a centroid index for each weight coordinate of the input activation value through a hashing method using the obtained input activation value.

상기 프로세서는, 상기 계산된 중앙값의 위치 인덱스를 기준으로 상기 입력 액티베이션 값을 저장할 수 있다.The processor may store the input activation value based on the calculated location index of the median value.

상기 프로세서는, 상기 저장된 입력 액티베이션 값과 중앙값에 대한 곱셈 연산을 수행할 수 있다.The processor may perform a multiplication operation on the stored input activation value and the median value.

상기 프로세서는, 중앙값 좌표 위치마다 저장된 입력 액티베이션 값들과 상기 중앙값 좌표에 해당하는 중앙값에 대한 곱셈 연산을 수행할 수 있다.The processor may perform a multiplication operation on input activation values stored for each median coordinate position and a median value corresponding to the median coordinates.

상기 프로세서는, 상기 곱셈 연산이 수행된 값들을 누적하여 총 합을 구하고 활성화 함수(Activation function)를 거쳐 출력으로 내보낼 수 있다.The processor may accumulate values for which the multiplication operation has been performed, obtain a total sum, and output the result as an output through an activation function.

한편, 본 발명의 다른 실시예에 따르면, 프로세서에 의해 실행될 때, 상기 프로세서로 하여금 방법을 실행하게 하는 명령어들을 저장하기 위한 비일시적 컴퓨터 판독가능 저장 매체로서, 상기 방법은: 획득된 입력 액티베이션(Activation) 값을 이용한 해싱(Hashing) 방식을 통해 해쉬 결과 값을 산출하는 단계; 상기 산출된 해쉬 결과 값을 기준으로 중앙값의 위치 인덱스를 계산하는 단계; 상기 입력 액티베이션 값을 배열에 저장하는 단계; 및 상기 저장된 입력 액티베이션 값과 상기 중앙값에 대한 연산을 수행하여 출력하는 단계를 포함하는, 비일시적 컴퓨터 판독 가능한 저장 매체가 제공될 수 있다.On the other hand, according to another embodiment of the present invention, a non-transitory computer-readable storage medium for storing instructions that, when executed by a processor, cause the processor to execute a method, the method comprising: obtained input activation (Activation ) Calculating a hash result value through a hashing method using a value; calculating a location index of a median value based on the calculated hash result value; storing the input activation value in an array; and performing and outputting an operation on the stored input activation value and the median value.

개시된 기술은 다음의 효과를 가질 수 있다. 다만, 특정 실시예가 다음의 효과를 전부 포함하여야 한다거나 다음의 효과만을 포함하여야 한다는 의미는 아니므로, 개시된 기술의 권리범위는 이에 의하여 제한되는 것으로 이해되어서는 아니 될 것이다.The disclosed technology may have the following effects. However, it does not mean that a specific embodiment must include all of the following effects or only the following effects, so it should not be understood that the scope of rights of the disclosed technology is limited thereby.

본 발명의 실시예들은 FC 계층에 대한 모든 데이터를 DRAM 액세스 없이 SRAM에 저장하는 데 초점을 맞추며, 영역 대비 정확도와 성능을 저하시키지 않는 모델 압축 기술인 해싱 트릭과 인덱스 매핑을 이용하여 더블-스테이지(Double-Stage) 파라미터 공유 방식을 통해, 정보를 효율적으로 압축할 수 있다. Embodiments of the present invention focus on storing all data for the FC layer in SRAM without DRAM access, and double-stage (Double-stage) using a hashing trick and index mapping, which are model compression techniques that do not degrade accuracy and performance versus area. -Stage) Through the parameter sharing method, information can be compressed efficiently.

또한, 본 발명의 실시예들은 종래보다 훨씬 적은 설계면적을 가지며 높은 성능을 유지하는 매우 효율적인 가속기를 구현할 수 있다. 이로써, 본 발명의 실시예들은 엣지 디바이스에서의 활용을 위해 효율적으로 네트워크를 처리할 수 있는 뉴럴 네트워크 가속기에 대한 요구를 만족시키며 해당 기술 분야 발전에 크게 이바지할 수 있다. In addition, embodiments of the present invention can realize a very efficient accelerator that maintains high performance while having a much smaller design area than the prior art. Thus, the embodiments of the present invention can satisfy the demand for a neural network accelerator capable of efficiently processing a network for use in an edge device and greatly contribute to the development of the related technology field.

본 발명의 실시예들은 CPU와 GPU에 비해 효율적으로 설계되는 딥러닝 가속기 중, 종래의 가속기에 비해 높은 면적 효율성과 성능을 보일 수 있다. Embodiments of the present invention can show high area efficiency and performance compared to conventional accelerators among deep learning accelerators designed to be more efficient than CPUs and GPUs.

본 발명의 실시예들은 종래에 SRAM의 비중이 높은 대부분의 인공신경망 가속기 시스템을 더블 스테이지 웨이트 공유 방식으로 대체할 수 있으며, 동일 성능 대비 면적과 전력소모를 줄일 수 있다. Embodiments of the present invention can replace most of the conventional artificial neural network accelerator systems with a high proportion of SRAM with a double stage weight sharing method, and can reduce area and power consumption compared to the same performance.

본 발명의 실시예들은 종래의 파라미터 공유 기술을 사용하는 대표적인 신경망 가속기보다 가속기의 성능-면적 효율을 높일 수 있다. Embodiments of the present invention can increase the performance-area efficiency of an accelerator than typical neural network accelerators using conventional parameter sharing techniques.

본 발명의 실시예들은 32nm CMOS 공정을 통해 종래 가속기보다 면적을 50.23%만큼 감소시킬 수 있고, 성능을 높일 수 있다.Embodiments of the present invention can reduce the area by 50.23% and improve performance compared to conventional accelerators through a 32nm CMOS process.

도 1은 종래의 방법과 본 발명의 일 실시예에 따른 더블-스테이지 웨이트 공유 방식의 데이터 포맷을 비교한 도면이다.
도 2 및 도 3은 본 발명의 일 실시예에 따른 더블-스테이지 웨이트 공유 방식의 예시를 나타낸 도면이다.
도 4는 희소성과 병렬화를 기준으로 가능한 디자인 탐색 방법을 나타낸 도면이다.
도 5는 포스트-곱셈 기법 소개와 불가능한 경우를 나타낸 도면이다.
도 6은 본 발명의 일 실시예에 따른 더블 스테이지 웨이트 공유 방식을 이용한 인공 신경망 가속 장치의 구성도이다.
도 7 및 도 8은 본 발명의 일 실시예에 따른 더블 스테이지 웨이트 공유 방식을 이용한 인공 신경망 가속 장치의 구체적인 동작이 표시된 구성도이다.
도 9는 본 발명의 일 실시예에 따른 더블 스테이지 웨이트 공유 방식을 이용한 인공 신경망 가속 방법에 대한 흐름도이다.
도 10은 중앙값의 개수에 따른 학습 정확도와 요구되는 메모리양을 비교한 도면이다.
도 11은 종래의 파라미터 공유 기법과 본 발명의 일 실시예의 면적과 요구되는 SRAM 크기를 비교한 도면이다. 1 is a diagram comparing data formats of a conventional method and a double-stage weight sharing method according to an embodiment of the present invention.
2 and 3 are diagrams illustrating an example of a double-stage weight sharing method according to an embodiment of the present invention.
4 is a diagram showing possible design search methods based on sparsity and parallelism.
5 is a diagram illustrating an introduction of the post-multiplication technique and an impossible case.
6 is a block diagram of an artificial neural network accelerator using a double stage weight sharing method according to an embodiment of the present invention.
7 and 8 are configuration diagrams showing specific operations of an artificial neural network accelerator using a double stage weight sharing method according to an embodiment of the present invention.
9 is a flowchart of an artificial neural network acceleration method using a double stage weight sharing method according to an embodiment of the present invention.
10 is a diagram comparing learning accuracy according to the number of median values and a required amount of memory.
11 is a diagram comparing the area and required SRAM size of a conventional parameter sharing technique and an embodiment of the present invention.

본 발명은 다양한 변환을 가할 수 있고 여러가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 구체적으로 설명하고자 한다. 그러나 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 기술적 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해될 수 있다. 본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.Since the present invention can apply various transformations and have various embodiments, specific embodiments will be illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the present invention to specific embodiments, and it can be understood to include all conversions, equivalents, or substitutes included in the technical spirit and scope of the present invention. In describing the present invention, if it is determined that a detailed description of related known technologies may obscure the gist of the present invention, the detailed description will be omitted.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 구성요소들이 용어들에 의해 한정되는 것은 아니다. 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.Terms such as first and second may be used to describe various components, but the components are not limited by the terms. Terms are only used to distinguish one component from another.

본 발명에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 본 발명에서 사용한 용어는 본 발명에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나 이는 당 분야에 종사하는 기술자의 의도, 판례, 또는 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 발명의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 발명에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 발명의 전반에 걸친 내용을 토대로 정의되어야 한다.Terms used in the present invention are only used to describe specific embodiments, and are not intended to limit the present invention. The terms used in the present invention have been selected from general terms that are currently widely used as much as possible while considering the functions in the present invention, but they may vary depending on the intention of a person skilled in the art, case law, or the emergence of new technologies. In addition, in a specific case, there is also a term arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the description of the invention. Therefore, the term used in the present invention should be defined based on the meaning of the term and the overall content of the present invention, not simply the name of the term.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 발명에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions include plural expressions unless the context clearly dictates otherwise. In the present invention, terms such as "comprise" or "having" are intended to designate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, but one or more other features It should be understood that the presence or addition of numbers, steps, operations, components, parts, or combinations thereof is not precluded.

이하, 본 발명의 실시예들을 첨부 도면을 참조하여 상세히 설명하기로 하며, 첨부 도면을 참조하여 설명함에 있어, 동일하거나 대응하는 구성요소는 동일한 도면번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings, and in the description with reference to the accompanying drawings, the same or corresponding components are given the same reference numerals and overlapping descriptions thereof will be omitted. do.

도 1은 종래의 방법과 본 발명의 일 실시예에 따른 더블-스테이지 웨이트 공유 방식의 데이터 포맷을 비교한 도면이다. 1 is a diagram comparing data formats of a conventional method and a double-stage weight sharing method according to an embodiment of the present invention.

본 발명의 일 실시예는 더블-스테이지 파라미터 공유 기법을 활용한 새로운 데이터 포맷을 제공한다. 도 1에 나타난 바와 같이, 종래의 희소행렬을 활용하거나 종래 파라미터 공유 기법을 활용한 경우는 포맷을 유지하기 위해 추가적으로 공유되는 실제 네트워크의 파라미터 값인 중앙값(Centroids)들의 위치를 표현하기 위한 배열을 가진다. 그러므로, 종래의 방법은 가장 압축된 포맷을 활용할 수 없으며, 이를 활용하기 위해 가속기 구조 또한 부가적인 로직을 필요로 하게 된다. An embodiment of the present invention provides a new data format utilizing a double-stage parameter sharing technique. As shown in FIG. 1, in the case of using a conventional sparse matrix or using a conventional parameter sharing technique, there is an arrangement for expressing the location of centroids, which are parameter values of an actual network that are additionally shared to maintain the format. Therefore, the conventional method cannot utilize the most compressed format, and the accelerator structure also requires additional logic to utilize it.

반면, 본 발명의 일 실시예는 도 1의 마지막 그림과 같이 더블-스테이지 파라미터 공유 기법을 활용한 구조의 데이터 포맷을 활용해 최소한의 데이터 포맷을 제공한다. 본 발명의 일 실시예는 종래의 파라미터 공유 기법으로 제안된 해싱 트릭(Hashing trick)을 활용하여, 수많은 중앙값을 이용해 학습한다. 이후, 본 발명의 일 실시예는 추가적인 단계를 이용해 이를 클러스터링 한 중앙값들을 공유한다. 본 발명의 일 실시예는 최소한의 양을 필요로 하는 데이터 포맷을 이용함에도 불구하고 비길만한 정확도를 보일 수 있다. On the other hand, an embodiment of the present invention provides a minimum data format by utilizing a data format having a structure using a double-stage parameter sharing technique as shown in the last figure of FIG. 1 . An embodiment of the present invention utilizes a hashing trick proposed as a conventional parameter sharing technique and learns using a number of median values. Then, in an embodiment of the present invention, medians clustered using an additional step are shared. One embodiment of the present invention can exhibit comparable accuracy despite using a data format that requires a minimum amount.

도 1의 블록(101 및 102)을 참조하여, 해싱(Hashing)과 중앙값(centroid)의 좌표 값, 중앙값(centroid) 등을 설명하기로 한다. Referring to blocks 101 and 102 of FIG. 1 , hashing, coordinate values of the centroid, and the like will be described.

우선 일반적으로, 딥러닝 연산을 위한 행렬 곱셈(matrix multiplication) 연산에서는 웨이트 행렬(weight matrix)의 각 행렬 위치에 16 비트(bit)의 값을 가지고 있다. 이러한 구조는 막대한 양의 메모리를 필요하므로, 도 2의 왼쪽 도면과 같이 웨이트 공유(weight sharing) 방법이 수행된다. First of all, in general, in a matrix multiplication operation for deep learning operation, each matrix position of a weight matrix has a value of 16 bits. Since this structure requires a huge amount of memory, a weight sharing method is performed as shown on the left side of FIG. 2 .

도 2 및 도 3은 본 발명의 일 실시예에 따른 더블-스테이지 웨이트 공유 방식의 예시를 나타낸 도면이다. 2 and 3 are diagrams illustrating an example of a double-stage weight sharing method according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 데이터 포맷을 활용한 네트워크를 학습하는 경우, 본 발명의 일 실시예는 도 2의 첫 번째 도면과 같이 종래의 해싱 트릭(Hashing trick) 방법을 활용해 공유되는 중앙값을 학습시킨다. 이후, 본 발명의 일 실시예는 도 2의 두 번째 도면과 같이 K-평균(K-means) 클러스터링 방법을 이용해 실제 중앙값 배열과 이의 위치를 가지는 인덱스 배열을 가지도록 학습을 하는 두 가지 단계를 통해 학습한다. When learning a network using a data format according to an embodiment of the present invention, an embodiment of the present invention uses a conventional hashing trick method as shown in the first diagram of FIG. learn Thereafter, one embodiment of the present invention, as shown in the second diagram of FIG. 2, through two steps of learning to have an index array having an actual median array and its location using the K-means clustering method learn

자세하게 말하자면, 본 발명의 일 실시예는 먼저 종래에 미리 학습된 파라미터를 활용해 해싱 트릭 기법과 같이 중앙값을 가지는 값들은 평균값을 이용해 중앙값들을 초기화한다. 그 다음으로, 본 발명의 일 실시예는 학습된 중앙값들을 K-평균 클러스터링을 통해 재학습해 새로운 중앙값들을 만든다. 그리고 본 발명의 일 실시예는 이 새로운 중앙값들을 공유하기 위한 인덱스를 가지는 배열을 만드는 추가적인 파인-튜닝(Fine-Tuning) 학습을 진행한다. 두 단계의 학습 동안 그래디언트(Gradient)값을 적용하는 방법은 하기의 [수학식 1]과 같이 같은 중앙값 위치를 가지면 그래디언트값의 평균값을 적용한다. More specifically, in an embodiment of the present invention, median values are initialized by using an average value of values having a median value, such as a hashing trick technique, by using parameters learned in advance in the prior art. Next, in an embodiment of the present invention, new medians are created by relearning the learned medians through K-means clustering. And an embodiment of the present invention proceeds with additional fine-tuning learning to create an array having an index for sharing these new median values. In the method of applying the gradient value during the two-step learning, the average value of the gradient value is applied if the median value has the same location as shown in [Equation 1] below.

기본적인 웨이트 공유(weight sharing) 방식은 도 2와 같이, 각 요소마다 16 비트(bits) 값을 주는 것이 아니고, 임의의 적은 비트(bit)로 0~1023 (10bits)의 값만 가지도록 한다. 0~1023개의 추가적임 임시 메모리에 실제 각 좌표마다 있어야 할 16 비트(bit) 값(중앙값 혹은 Centroid)을 저장하여 메모리를 효율적으로 사용하는 방식입니다.As shown in FIG. 2, the basic weight sharing method does not give a 16-bit value to each element, but allows only a small number of bits to have a value of 0 to 1023 (10 bits). This is a method of efficiently using memory by storing 16-bit values (median or centroid) that should be present for each actual coordinate in additional temporary memory of 0 to 1023.

해싱(hashing) 방식은 웨이트 공유 방식에서 더 나아가 각 좌표마다 중앙(Centroid) 값을 위한 인덱스(Index)를 가지는 것이 아닌, 좌표 값(x, y 값)을 사용하여 인덱스를 도출하는 해싱(hashing) 함수를 사용해 메모리 사용 없이 효율적으로 중앙값을 찾아내는 방식이다. 예를 들어, 해싱 함수가 f(x, y) = 3x+y 라고 하면, (1, 1) 좌표는 4이고, (2, 3) 좌표는 9의 중앙(centroid) 좌표값을 가지게 되는 것이다. 도 2의 왼쪽 도면의 화살표(201)를 참조하면, 각 요소별로 기입되어 있는 0~1023의 값(202)이 해싱을 거친 값이라고 볼 수 있다.The hashing method goes further than the weight sharing method and does not have an index for the centroid value for each coordinate, but hashing that derives an index using coordinate values (x, y values) It uses a function to efficiently find the median value without using memory. For example, if the hashing function is f(x, y) = 3x+y, the coordinates (1, 1) are 4, and the coordinates (2, 3) have a centroid coordinate value of 9. Referring to the arrow 201 of the left drawing of FIG. 2 , it can be seen that the value 202 of 0 to 1023 written for each element is a value that has undergone hashing.

본 발명의 일 실시예에 따른 더블-스테이지 웨이트 공유 방식은 메모리를 더욱 줄이기 위해, 해싱(hashing)을 통해 계산된 중앙값의 인덱스(centroid index)에 실제 16비트 값을 저장하는 것이 아니라, 추가적으로 '실제 중앙값'을 위한 실제 중앙값의 인덱스(real centroid index, 2bits) 값을 중앙값의 좌표 값(212)으로 저장한다. 그리고 본 발명의 일 실시예는 실제 중앙값의 인덱스(real centroid index)를 통해 16 비트의 실제 중앙(real centroid, 16bits) 값(213)을 찾아내는 방식을 수행한다. 본 발명의 일 실시예에 따른 더블-스테이지 웨이트 공유 방식은 도 3에 추가적으로 도시되어 있다. 더블-스테이지 웨이트 공유 방식에서, 웨이트에 대한 중앙값 인덱스(211)는 0~1023의 중앙값 인덱스로서 2비트인 중앙값 좌표 값(212)을 갖고, 중앙값(213)은 중앙값 좌표 값(212)에 대응되는 16비트인 중앙값을 갖는다. In order to further reduce memory, the double-stage weight sharing method according to an embodiment of the present invention does not store an actual 16-bit value in a centroid index calculated through hashing, but additionally 'real A real centroid index (2 bits) value for the 'median value' is stored as the coordinate value 212 of the median value. In addition, an embodiment of the present invention performs a method of finding a 16-bit real centroid (16 bits) value 213 through a real centroid index. A double-stage weight sharing method according to an embodiment of the present invention is further illustrated in FIG. 3 . In the double-stage weight sharing method, the median index 211 for the weight has a 2-bit median coordinate value 212 as a median index of 0 to 1023, and the median value 213 corresponds to the median coordinate value 212 It has a median value of 16 bits.

도 4는 희소성과 병렬화를 기준으로 가능한 디자인 탐색 방법을 나타낸 도면이다. 4 is a diagram showing possible design search methods based on sparsity and parallelism.

본 발명의 일 실시예는 직접 딥러닝 가속 장치를 직접 구현해 효율성을 보인다. 딥러닝 가속기의 기본 유닛이라고 할 수 있는 PE (Processing Element)를 효율적으로 설계하기 위해, 희소성과 병렬화 두 가지 요소를 고려해 디자인을 선택하였다. An embodiment of the present invention shows efficiency by directly implementing a deep learning accelerator. In order to efficiently design the PE (Processing Element), which can be called the basic unit of the deep learning accelerator, the design was selected considering two factors: sparsity and parallelism.

먼저, 도 4에서 나타낸 것과 같이, 가속기를 설계할 때 입력 액티베이션(Activation) 값 방향으로 병렬화할 지, 파라미터 값의 기준으로 병렬화 할지에 따라 디자인이 달라지게 되는데, 모든 경우를 고려해 디자인을 탐색하였다. 그 다음, 입력 액티베이션(Activation)과 파라미터 중 어떤 것의 희소성을 반영할지에 따라서도 결과가 달라지는데, 병렬화 방향에 따라 가능한 모든 경우의 희소성을 반영해 탐색하였다. 그 결과, 구현 불가능한 모델과 오히려 더 많은 메모리를 요구하는 모델로 나누어진다. 후자의 경우 중 LUT (Look Up Table) 문제가 발생한 경우도 있었는데, 이는 파라미터 공유 기법은 룩 업 테이블을 요구하고, 병렬화함에 따라 비례적으로 필요한 메모리가 증가하는 현상이다. 이를 해결하기 위해, 본 발명의 일 실시예는 포스트-곱셈(Post-Multiplication)이라는 기법을 사용한다. 포스트-곱셈 기법은 도 5를 참조하여 설명하기로 한다. First, as shown in FIG. 4, when designing an accelerator, the design differs depending on whether to parallelize in the direction of the input activation value or to parallelize based on the parameter value, and the design was explored considering all cases. Then, the result also differs depending on which of input activation and parameter sparseness is reflected, and the sparseness of all possible cases was reflected and explored according to the direction of parallelization. As a result, it is divided into a model that cannot be implemented and a model that requires more memory. Among the latter cases, there was also a LUT (Look Up Table) problem, which is a phenomenon in which the parameter sharing technique requires a look-up table and the required memory proportionally increases as parallelization is performed. To solve this problem, an embodiment of the present invention uses a technique called post-multiplication. The post-multiplication technique will be described with reference to FIG. 5 .

도 5는 포스트-곱셈 기법 소개와 불가능한 경우를 나타낸 도면이다.5 is a diagram illustrating an introduction of the post-multiplication technique and an impossible case.

포스트-곱셈(Post-multiplication) 기법은 도 5의 첫 번째 도면(301)에 나타낸 바와 같이, 모든 액티베이션(Activation)과 파라미터의 MAC (Multiplication And Accumulation) 연산을 각각 하지 않고, 모든 액티베이션(Activation)을 중앙값에 모두 더한 뒤, 해당하는 중앙값들과 한 번만 곱해서 결과 값을 얻어내는 방법이다. 다만, 이러한 포스트-곱셈 방법은 파라미터 방향 중 결과 값의 방향으로 병렬화한 경우, 액티베이션 값을 모으기 위한 배열이 도 5의 두 번째 도면(302)과 같이 비례적으로 증가하기 때문에, 적용할 수 없는 단점이 있다. As shown in the first diagram 301 of FIG. 5, the post-multiplication technique does not perform MAC (Multiplication And Accumulation) calculations of all activations and parameters, and all activations are performed. It is a method of obtaining the result by adding all of the values to the median and then multiplying with the corresponding median only once. However, this post-multiplication method is a disadvantage that cannot be applied when parallelized in the direction of the result value among the parameter directions, because the array for collecting activation values increases proportionally as shown in the second drawing 302 of FIG. there is

따라서, 도 5에 나온 것과 같이, 디자인 (iv)와 디자인 (v)가 구현 가능한 모델로 판별되었다. 이에 추가적으로 두 가지의 디자인을 비교한 결과, 디자인 (v)는 파라미터의 희소성을 반영하기 위해 메모리를 줄일 수 있는 가능성은 적은데 비해, 본 발명의 일 실시예에 따른 더블-스테이지 파라미터 공유 기법을 적용하기 위해 성능이 감소하는 단점이 있다. 따라서 가장 적합한 디자인은 (iv)로 결정되었다. Therefore, as shown in FIG. 5, design (iv) and design (v) were determined to be implementable models. In addition, as a result of comparing the two designs, design (v) has little possibility of reducing memory to reflect the sparsity of parameters, while applying the double-stage parameter sharing technique according to an embodiment of the present invention The downside is that performance is reduced. Therefore, the most suitable design was determined as (iv).

도 6은 본 발명의 일 실시예에 따른 더블 스테이지 웨이트 공유 방식을 이용한 인공 신경망 가속 장치의 구성도이다. 6 is a block diagram of an artificial neural network accelerator using a double stage weight sharing method according to an embodiment of the present invention.

도 6에 도시된 바와 같이, 본 발명의 일 실시예에 따른 더블 스테이지 웨이트 공유 방식을 이용한 인공 신경망 가속 장치(100)는 메모리(110) 및 프로세서(120)를 포함한다. 그러나 도시된 구성요소 모두가 필수 구성요소인 것은 아니다. 도시된 구성요소보다 많은 구성요소에 의해 인공 신경망 가속 장치(100)가 구현될 수도 있고, 그보다 적은 구성요소에 의해서도 인공 신경망 가속 장치(100)가 구현될 수 있다.As shown in FIG. 6 , the artificial neural network accelerator 100 using the double stage weight sharing method according to an embodiment of the present invention includes a memory 110 and a processor 120 . However, not all illustrated components are essential components. The artificial neural network accelerator 100 may be implemented with more components than those shown, or the artificial neural network accelerator 100 may be implemented with fewer components.

이하, 도 6의 인공 신경망 가속 장치(100)의 각 구성요소들의 구체적인 구성 및 동작을 설명한다.Hereinafter, detailed configuration and operation of each component of the artificial neural network accelerator 100 of FIG. 6 will be described.

메모리(110)는 하나 이상의 프로그램을 저장한다. Memory 110 stores one or more programs.

프로세서(120)는 메모리(110)에 저장된 하나 이상의 프로그램을 실행한다. 프로세서(120)는 획득된 입력 액티베이션(Activation) 값을 이용한 해싱(Hashing) 방식을 통해 해쉬 결과 값을 산출하고, 산출된 해쉬 결과 값을 기준으로 중앙값의 위치 인덱스를 계산하고, 입력 액티베이션 값을 배열에 저장하고, 저장된 입력 액티베이션 값과 중앙값에 대한 연산을 수행하여 출력한다. Processor 120 executes one or more programs stored in memory 110 . The processor 120 calculates a hash result value through a hashing method using the obtained input activation value, calculates a location index of a median value based on the calculated hash result value, and arranges the input activation values. It is stored in , performs an operation on the saved input activation value and the median value, and outputs it.

실시예들에 따르면, 인공 신경망 가속 장치(100)는 획득된 입력 액티베이션(Activation) 값을 메모리 공간에 저장하는 입력 버퍼를 더 포함할 수 있다. According to embodiments, the artificial neural network accelerator 100 may further include an input buffer for storing the obtained input activation value in a memory space.

실시예들에 따르면, 인공 신경망 가속 장치(100)는 저장된 입력 액티베이션 값의 넌-제로(Non-zero) 값인 희소성(Sparsity)을 찾는 넌-제로 검출 모듈을 더 포함할 수 있다. According to embodiments, the artificial neural network accelerator 100 may further include a non-zero detection module for finding sparsity, which is a non-zero value of a stored input activation value.

실시예들에 따르면, 프로세서(120)는 획득된 입력 액티베이션 값을 이용한 해싱 방식을 통해 입력 액티베이션 값의 각 웨이트(weight) 좌표에 대한 중앙값 인덱스(Centroid index)를 산출할 수 있다. According to embodiments, the processor 120 may calculate a centroid index for each weight coordinate of the input activation value through a hashing method using the obtained input activation value.

실시예들에 따르면, 중앙값의 위치 인덱스는, 실제 중앙값 인덱스(Real centroid index)일 수 있다. According to embodiments, the position index of the median value may be a real centroid index.

실시예들에 따르면, 프로세서(120)는 계산된 중앙값의 위치 인덱스를 기준으로 입력 액티베이션 값을 저장할 수 있다. According to embodiments, the processor 120 may store the input activation value based on the calculated location index of the median value.

실시예들에 따르면, 프로세서(120)는 저장된 입력 액티베이션 값과 중앙값에 대한 곱셈 연산을 수행할 수 있다. According to embodiments, the processor 120 may perform a multiplication operation on the stored input activation value and the median value.

실시예들에 따르면, 프로세서(120)는 중앙값 좌표 위치마다 저장된 입력 액티베이션 값들과 중앙값 좌표에 해당하는 중앙값에 대한 곱셈 연산을 수행할 수 있다. According to embodiments, the processor 120 may perform a multiplication operation on input activation values stored for each median coordinate position and a median value corresponding to the median coordinates.

실시예들에 따르면, 프로세서(120)는 곱셈 연산이 수행된 값들을 누적하여 총 합을 구하고 활성화 함수(Activation function)를 거쳐 출력으로 내보낼 수 있다.According to example embodiments, the processor 120 may accumulate the values for which multiplication operations have been performed to obtain a total sum, and output the values through an activation function.

실시예들에 따르면, 활성화 함수는 ReLU(Rectified Linear Unit)일 수 있다.According to embodiments, the activation function may be a Rectified Linear Unit (ReLU).

도 7 및 도 8은 본 발명의 일 실시예에 따른 더블 스테이지 웨이트 공유 방식을 이용한 인공 신경망 가속 장치의 구체적인 동작이 표시된 구성도이다. 7 and 8 are configuration diagrams showing specific operations of an artificial neural network accelerator using a double stage weight sharing method according to an embodiment of the present invention.

본 발명의 구체적인 실시예에 따른 결과적인 구성은 도 7과 같이 구성될 수 있다. The resulting configuration according to a specific embodiment of the present invention may be configured as shown in FIG. 7 .

구체적인 실시예에 따르면, 인공 신경망 가속 장치는 입력 버퍼(InBuffer), 선행 넌-제로 감지(Leading non-zero detection, LNZD) 모듈, 중앙 컨트롤 유닛(Central control unit, CCU) 및 처리 요소(Processing Element, PE) 모듈을 포함하여 구성될 수 있다. According to a specific embodiment, the artificial neural network accelerator includes an input buffer (InBuffer), a leading non-zero detection (LNZD) module, a central control unit (CCU), and a processing element (Processing Element, PE) module may be included.

입력 버퍼(InBuffer)는 입력 액티베이션 값들을 받기 위한 메모리 공간이다. The input buffer (InBuffer) is a memory space for receiving input activation values.

선행 넌-제로 감지(LNZD, Leading Non-Zero Detection) 모듈은 액티베이션의 희소성(Sparsity)을 찾기 위해 들어오는 입력의 희소성을 찾아 모든 PE에 병렬적으로 제공한다. The Leading Non-Zero Detection (LNZD) module finds the sparsity of the incoming input to find the sparsity of activation and provides it to all PEs in parallel.

또한, 중앙 제어 유닛(CCU, Central Control Unit)은 중앙에서 모든 PE의 결과 값과 파이프라인을 위한 중앙 통제를 맡고 있다. In addition, the Central Control Unit (CCU) is centrally responsible for central control for all PE outputs and pipelines.

처리 요소(PE) 모듈은 더블-스테이지 웨이트 공유 방식에 따른 실질적인 연산을 중재하는 여러 개의 PE(Processing Element)로 구성될 수 있다. A processing element (PE) module may consist of several processing elements (PEs) that mediate practical operations according to a double-stage weight sharing scheme.

여기서, 처리 요소(PE) 모듈은 해쉬 연산 모듈, 해쉬 처리 모듈, 입력 축적 모듈 및 최종 연산 모듈을 포함할 수 있다. Here, the processing element (PE) module may include a hash operation module, a hash processing module, an input accumulation module, and a final operation module.

해쉬 연산 모듈은 PE 내 해싱 방식으로 데이터를 저장할 수 있다. The hash calculation module may store data in a hashing method in the PE.

해쉬 처리 모듈은 PE 내 해쉬 결과 값을 통해 중앙값의 인덱스를 추정할 수 있다.The hash processing module may estimate the index of the median value through the hash result value in the PE.

입력 축적 모듈은 PE 내 중앙값의 인덱스를 활용해 액티베이션의 값을 모두 더할 수 있다. The input accumulation module can add all activation values by using the index of the median value in PE.

최종 연산 모듈은 PE 내 축적된 액티베이션 값과 파라미터의 곱을 연산하고, 그 결과 값을 합칠 수 있다. The final calculation module may calculate the product of the parameter and the activation value accumulated in the PE, and combine the resulting values.

인공 신경망 가속 장치의 가장 핵심적인 역할을 담당하는 PE(Processing Element)의 내부 구조는 5단계의 파이프라인으로 동작한다. The internal structure of PE (Processing Element), which plays the most essential role in the artificial neural network accelerator, operates as a five-step pipeline.

첫 단계는 단계 S101로서, 해쉬(Hash) 결과 값을 얻어내기 위한 것이다. The first step is step S101, which is to obtain a hash result value.

이후 두 번째 단계는 단계 S102로서, 중앙값의 위치 인덱스를 얻어내기 위한 것이다. After that, the second step is step S102, which is to obtain the position index of the median value.

그 이후 세 번째 단계는 단계 S103로서, 모든 입력 액티베이션 값들을 배열에 모두 저장하기 위한 것이다. After that, the third step is step S103, which is to store all input activation values in an array.

그리고 네 번째 단계는 단계 S104로서, 저장된 Activation 값들을 내부의 중앙 값들과 곱하기 위한 것이다. And the fourth step is step S104, which is to multiply the stored Activation values with internal median values.

이후 마지막 단계는 단계 S105로서, 앞 단계의 결과 값들을 누적해 총 합을 구해 활성화 함수를 거쳐 출력을 내보내게 된다. 도 7에 나온 점선은 파이프라인을 구분하기 위해 쓰였으며, 실선은 데이터의 이동경로를 표현한다. The last step thereafter is step S105, where the result values of the previous step are accumulated to obtain a total sum, and the output is output through an activation function. The dotted line shown in FIG. 7 is used to classify the pipeline, and the solid line represents the data movement path.

이를 구체적으로 설명하면 다음과 같다. A detailed description of this is as follows.

S101 단계는 해쉬(Hash) 결과 값을 얻어내는 단계이다. S101 단계에서 해싱 방식을 통해 각 웨이트의 좌표 (row, col)에 대한 중앙값의 인덱스(centroid index)를 계산한다. 이는 도 2의 오른쪽 도면에서 0~1023으로 표기된 값(211)을 계산한다. Step S101 is a step of obtaining a hash result value. In step S101, a centroid index of the coordinates (row, col) of each weight is calculated through a hashing method. This calculates the value 211 indicated as 0 to 1023 in the right diagram of FIG. 2 .

S102 단계는 중앙값의 위치 인덱스를 얻어내는 단계이다. S102 단계에서는 첫 단계에서 계산된 중앙값 인덱스(centroid index)를 기준으로 실제 중앙값 인덱스(real centroid index)를 얻어낸다. 이는 도 2의 오른쪽 도면의 아래 부분에 해당하는 과정으로, 0~3 값을 계산한다. Step S102 is a step of obtaining the location index of the median value. In step S102, a real centroid index is obtained based on the centroid index calculated in the first step. This is a process corresponding to the lower part of the right drawing of FIG. 2, and values 0 to 3 are calculated.

S103 단계는 모든 입력 액티베이션 값들을 배열에 모두 저장하는 단계이다. S103 단계는 S102 단계를 통해 각 액티베이션별로 해당하는 실제 중앙값 인덱스(real centroid index)를 얻어낸 것을 기준으로 1~N까지의 위치에 액티베이션 값을 저장(누적)한다. 본 발명의 일 실시예에 사용되는 도 5의 포스트-곱셈(Post-Multiplication) 과정이 S103 단계에 적용된다. Step S103 is a step of storing all input activation values in an array. Step S103 stores (accumulates) activation values at positions 1 to N based on the obtained real centroid index corresponding to each activation through step S102. The post-multiplication process of FIG. 5 used in one embodiment of the present invention is applied to step S103.

S104 단계는 저장된 액티베이션 값들을 내부의 중앙 값들과 곱하는 단계이다. S104 단계는 S103 단계에서 해당하는 중앙값 좌표 위치마다 누적(저장)된 액티베이션 값들과 좌표에 해당하는 중앙값을 곱해주는 과정이다. 여기서, 중앙값은 도 2의 오른쪽 아래 도면에서 16비트로 표현된 값들이다. 이는 실제 액티베이션과의 곱셈을 시리얼(Serial)하게 처리하는 것보다 더욱 빠르게 연산할 수 있고 적은 메모리를 요하는 장점이 있다. Step S104 is a step of multiplying the stored activation values by internal median values. Step S104 is a process of multiplying the activation values accumulated (stored) for each median coordinate position corresponding to step S103 by the median value corresponding to the coordinates. Here, the median values are values represented by 16 bits in the lower right diagram of FIG. 2 . This has the advantage of being able to operate faster than serially processing multiplication with actual activation and requiring less memory.

S105 단계는 앞 단계의 결과 값들을 누적(Accumulation)한 총 합을 구해 활성화 함수를 거쳐 출력을 내보내는 단계이다. S105 단계는 이전 단계를 통해 얻은 액티베이션과 웨이트의 곱들을 얻으면 모든 곱들의 합을 계산(결론적으로, 모든 액티베이션과 모든 웨이트의 곱의 합과 동일)할 수 있다. 그리고 S105 단계는 머신러닝 가속기의 기본적인 동작에 따라 ReLU(Rectified Linear Unit)라는 액티베이션 함수를 거친 후 다음 뉴런(다음 연산을 위한 액티베이션)으로 출력한다. Step S105 is a step of obtaining a total sum obtained by accumulating result values of the previous step and outputting an output through an activation function. In step S105, when the products of activations and weights obtained through the previous steps are obtained, the sum of all products can be calculated (conclusively, the same as the sum of products of all activations and all weights). And in step S105, it goes through an activation function called ReLU (Rectified Linear Unit) according to the basic operation of the machine learning accelerator, and outputs it to the next neuron (activation for the next operation).

도 9는 본 발명의 일 실시예에 따른 더블 스테이지 웨이트 공유 방식을 이용한 인공 신경망 가속 방법에 대한 흐름도이다. 9 is a flowchart of an artificial neural network acceleration method using a double stage weight sharing method according to an embodiment of the present invention.

단계 S201에서, 인공 신경망 가속 장치는 획득된 입력 액티베이션 값을 이용한 해싱 방식을 통해 해쉬 결과 값을 산출한다. In step S201, the artificial neural network accelerator calculates a hash result value through a hashing method using the obtained input activation value.

단계 S202에서, 인공 신경망 가속 장치는 산출된 해쉬 결과 값을 기준으로 중앙값의 위치 인덱스를 계산한다. In step S202, the artificial neural network accelerator calculates the position index of the median value based on the calculated hash result value.

단계 S203에서, 인공 신경망 가속 장치는 입력 액티베이션 값을 배열에 저장한다. In step S203, the artificial neural network accelerator stores the input activation values in an array.

단계 S204에서, 인공 신경망 가속 장치는 저장된 입력 액티베이션 값과 중앙값에 대한 곱셈 연산을 수행한다. In step S204, the artificial neural network accelerator performs a multiplication operation on the stored input activation value and the median value.

단계 S205에서, 인공 신경망 가속 장치는 곱셈 연산이 수행된 값들을 누적하여 총 합을 구하고 활성화 함수를 거쳐 출력으로 내보낸다. In step S205, the artificial neural network accelerator accumulates the values for which the multiplication operation has been performed to obtain a total sum and outputs it as an output through an activation function.

도 10은 중앙 값의 개수에 따른 학습 정확도와 요구되는 메모리양을 비교한 도면이다. 10 is a diagram comparing learning accuracy and required memory amount according to the number of median values.

본 발명의 일 실시예에 따른 더블-스테이지 파라미터 공유 기법의 인덱스 배열과 실제 중앙값 배열의 길이를 결정하기 위해 학습 결과를 비교한 결과는 도 10에 나타난다. 이를 위해, 풀리-커넥티드(Fully Connected) 레이어를 위해 학습을 해야 하므로 2개의 풀리-커넥티드(Fully Connected) 레이어를 가지고 있는 MLP (Multi-Layer Perceptron)를 MNIST 데이터 셋을 이용해 학습한다. The result of comparing the learning result to determine the length of the index array of the double-stage parameter sharing technique and the actual median array according to an embodiment of the present invention is shown in FIG. 10 . To this end, it is necessary to learn for the Fully Connected layer, so the MLP (Multi-Layer Perceptron) with two Fully Connected layers is learned using the MNIST data set.

제1 막대그래프(410)는 정확도(Accuracy)를 나타내며, 제4 막대그래프(420)는 구현을 위해 필요한 메모리의 크기를 나타낸다. 제1 막대그래프(410) 중 가장 첫 번째 있는 검정색의 제2 막대그래프(411)는 학습을 진행하기 전에 미리 학습된 데이터의 정확도를 나타낸다. 두 번째 있는 진한 회색의 제3 막대그래프(412)는 두 번째 단계의 학습을 위해 종래의 해싱 트릭 방법을 활용해 학습했을 때의 정확도이다. 예를 들어, 1531/250은 첫 번째 레이어에서 1531개의 중앙값을 활용했으며, 두 번째 레이어에서 250개의 중앙값을 활용해 학습했음을 의미한다. 마지막 단계 학습이 끝난 후의 정확도는 연한 회색의 제3 막대그래프(413)로 나타냈었다. 이때, 필요한 메모리 양을 주황색의 제4 막대그래프(420)로 나타내었다. 이때의 중앙값 개수는 마지막으로 학습할 때의 실제 중앙 값의 개수를 나타낸다. 이를 활용하기 위한 인덱스 배열의 길이는, 앞서 있는 진한 회색의 제3 막대그래프(412)의 중앙값 개수와 같다. 실험 결과, 정확도가 90% 이하로 내려가지 않는 한에서, 실제 필요한 메모리가 가장 적은 모델을 선택하게 되어, 인덱스 배열의 길이 1024/250, 실제 중앙값 배열의 길이가 4일 때의 모델을 선택한다. The first bar graph 410 represents accuracy, and the fourth bar graph 420 represents the size of memory required for implementation. Among the first bar graphs 410, the first bar graph 411 in black indicates the accuracy of pre-learned data prior to learning. The second dark gray third histogram 412 is the accuracy when learning using the conventional hashing trick method for the second stage learning. For example, 1531/250 means that the first layer used 1531 medians and the second layer learned using 250 medians. The accuracy after the last step learning was shown as a light gray third bar graph 413. At this time, the required amount of memory is indicated by the orange fourth bar graph 420 . The number of median values at this time represents the number of actual median values at the time of last learning. The length of the index array to utilize this is equal to the number of median values of the previous dark gray third histogram 412 . As a result of the experiment, as long as the accuracy does not fall below 90%, the model with the smallest actual required memory is selected, and the model when the length of the index array is 1024/250 and the length of the actual median array is 4 is selected.

도 11은 종래의 파라미터 공유 기법과 본 발명의 일 실시예의 면적과 요구되는 SRAM 크기를 비교한 도면이다. 11 is a diagram comparing the area and required SRAM size of a conventional parameter sharing technique and an embodiment of the present invention.

본 발명의 일 실시예에 따라 최종적으로 선택된 모델을 구현해 종래 가속기에 비해 면적 대비 성능이 얼마나 개선되었는지를 살펴보았다. 종래의 파라미터 공유 기법을 활용한 가속기에 비해 높은 면적 대비 성능이 보이는 것을 도 11에서 확인할 수 있다. 본 발명의 일 실시예에 대한 구현은 RTL (Register Transfer Level)에서 베릴로그를 이용해 구현했으며, DC (Design Compiler)의 32nm CMOS 기술을 이용해 합성 후, 시놉시스 IC 컴파일러(Synopsis IC Compiler)를 이용해 면적을 측정하였다. By implementing the finally selected model according to an embodiment of the present invention, we examined how much the performance versus area was improved compared to the conventional accelerator. It can be seen in FIG. 11 that the performance compared to the area is higher than that of the accelerator using the conventional parameter sharing technique. The implementation of an embodiment of the present invention was implemented using verilog at RTL (Register Transfer Level), and after synthesis using 32nm CMOS technology of DC (Design Compiler), the area was calculated using Synopsis IC Compiler. measured.

종래의 파라미터 공유기법을 활용한 EIE는 종래의 가속기에 비해 3배의 면적 효율(Frames/mm²)과 2.9배 Throughput 성능(Frames/s)을 보였다. 이에 비해, 본 발명의 일 실시예에 따른 인공 신경망 가속 장치(Hashed로 표시됨)는 EIE에 비해 60%의 성능향상(Frames/s)과 50.23%의 면적(mm²)을 감속시켜, 3.21배 면적 대비 성능을 향상할 수 있었다. EIE using the conventional parameter sharing technique showed 3 times area efficiency (Frames/mm ² ) and 2.9 times throughput performance (Frames/s) compared to conventional accelerators. In contrast, the artificial neural network accelerator (represented by Hashed) according to an embodiment of the present invention improves performance (Frames/s) by 60% and reduces area (mm ² ) by 50.23% compared to EIE, increasing the area by 3.21 times. Contrast performance could be improved.

이와 같이, 본 발명의 일 실시예에 따른 인공 신경망 장치는 더블-스테이지 파라미터 공유 방식을 활용한 데이터 포맷을 통해 파라미터를 표현하는 데 필요한 데이터를 최소화할 수 있다. 본 발명의 일 실시예는 해싱 트릭 등 종래의 파라미터 공유 기법의 데이터 포맷의 필요한 데이터의 양과 문제점을 해결할 수 있다. 텐서플로우를 활용하여 본 발명의 일 실시예를 적용해 직접 구현 및 학습하여 결과를 확인하였다. As such, the artificial neural network apparatus according to an embodiment of the present invention can minimize data required to express parameters through a data format using a double-stage parameter sharing method. An embodiment of the present invention can solve the problem of the required amount of data and the data format of the conventional parameter sharing technique such as hashing trick. Using TensorFlow, an embodiment of the present invention was applied, directly implemented and learned, and the results were confirmed.

또한, 실제 이를 활용한 최적의 가속기를 제안하기 위해 다양한 디자인을 희소성과 병렬화 두 가지의 요소를 기반으로 살펴보았으며, 가능한 디자인을 찾고 예상되는 문제점을 해결하기 위한 본 발명의 일 실시예에 따른 더블-스테이지 웨이트 공유 방법을 제공하고자 한다. 그리고 이를 바탕으로, 파라미터 공유 기법의 중앙값, 즉 공유 파라미터 값의 개수에 따른 정확도와, 가속기로 구현 시 필요한 메모리의 최소값을 찾아 이에 따른 가속기를 베릴로그(Verilog)를 활용해 직접 검증하였다. In addition, in order to propose an optimal accelerator using this in practice, various designs were examined based on two elements of sparsity and parallelism, and a double double according to an embodiment of the present invention to find possible designs and solve expected problems. -We want to provide a stage weight sharing method. Based on this, the median value of the parameter sharing technique, that is, the accuracy according to the number of shared parameter values, and the minimum value of memory required when implemented as an accelerator were found, and the accelerator based on this was directly verified using Verilog.

한편, 프로세서에 의해 실행될 때, 상기 프로세서로 하여금 방법을 실행하게 하는 명령어들을 저장하기 위한 비일시적 컴퓨터 판독가능 저장 매체로서, 상기 방법은: 획득된 입력 액티베이션(Activation) 값을 이용한 해싱(Hashing) 방식을 통해 해쉬 결과 값을 산출하는 단계; 상기 산출된 해쉬 결과 값을 기준으로 중앙값의 위치 인덱스를 계산하는 단계; 상기 입력 액티베이션 값을 배열에 저장하는 단계; 및 상기 저장된 입력 액티베이션 값과 상기 중앙값에 대한 연산을 수행하여 출력하는 단계를 포함하는, 비일시적 컴퓨터 판독 가능한 저장 매체가 제공될 수 있다. On the other hand, as a non-transitory computer-readable storage medium for storing instructions that, when executed by a processor, cause the processor to execute a method, the method comprising: a hashing method using an obtained input activation value Calculating a hash result value through; calculating a location index of a median value based on the calculated hash result value; storing the input activation value in an array; and performing and outputting an operation on the stored input activation value and the median value.

한편, 본 발명의 일 실시예에 따르면, 이상에서 설명된 다양한 실시예들은 기기(machine)(예: 컴퓨터)로 읽을 수 있는 저장 매체(machine-readable storage media)에 저장된 명령어를 포함하는 소프트웨어로 구현될 수 있다. 기기는, 저장 매체로부터 저장된 명령어를 호출하고, 호출된 명령어에 따라 동작이 가능한 장치로서, 개시된 실시예들에 따른 전자 장치(예: 전자 장치(A))를 포함할 수 있다. 명령이 프로세서에 의해 실행될 경우, 프로세서가 직접, 또는 프로세서의 제어 하에 다른 구성요소들을 이용하여 명령에 해당하는 기능을 수행할 수 있다. 명령은 컴파일러 또는 인터프리터에 의해 생성 또는 실행되는 코드를 포함할 수 있다. 기기로 읽을 수 있는 저장매체는, 비일시적(non-transitory) 저장매체의 형태로 제공될 수 있다. 여기서, '비일시적'은 저장매체가 신호(signal)를 포함하지 않으며 실재(tangible)한다는 것을 의미할 뿐 데이터가 저장매체에 반영구적 또는 임시적으로 저장됨을 구분하지 않는다. Meanwhile, according to one embodiment of the present invention, the various embodiments described above are implemented as software including instructions stored in a machine-readable storage media (eg, a computer). It can be. A device is a device capable of calling a stored command from a storage medium and operating according to the called command, and may include an electronic device (eg, the electronic device A) according to the disclosed embodiments. When a command is executed by a processor, the processor may perform a function corresponding to the command directly or by using other components under the control of the processor. An instruction may include code generated or executed by a compiler or interpreter. The device-readable storage medium may be provided in the form of a non-transitory storage medium. Here, 'non-temporary' only means that the storage medium does not contain a signal and is tangible, but does not distinguish whether data is stored semi-permanently or temporarily in the storage medium.

또한, 본 발명의 일 실시예에 따르면, 이상에서 설명된 다양한 실시예들에 따른 방법은 컴퓨터 프로그램 제품(computer program product)에 포함되어 제공될 수 있다. 컴퓨터 프로그램 제품은 상품으로서 판매자 및 구매자 간에 거래될 수 있다. 컴퓨터 프로그램 제품은 기기로 읽을 수 있는 저장 매체(예: compact disc read only memory (CD-ROM))의 형태로, 또는 어플리케이션 스토어(예: 플레이 스토어TM)를 통해 온라인으로 배포될 수 있다. 온라인 배포의 경우에, 컴퓨터 프로그램 제품의 적어도 일부는 제조사의 서버, 어플리케이션 스토어의 서버, 또는 중계 서버의 메모리와 같은 저장 매체에 적어도 일시 저장되거나, 임시적으로 생성될 수 있다.Also, according to one embodiment of the present invention, the method according to the various embodiments described above may be provided by being included in a computer program product. Computer program products may be traded between sellers and buyers as commodities. The computer program product may be distributed in the form of a device-readable storage medium (eg compact disc read only memory (CD-ROM)) or online through an application store (eg Play Store™). In the case of online distribution, at least part of the computer program product may be temporarily stored or temporarily created in a storage medium such as a manufacturer's server, an application store server, or a relay server's memory.

또한, 본 발명의 일 실시예에 따르면, 이상에서 설명된 다양한 실시예들은 소프트웨어(software), 하드웨어(hardware) 또는 이들의 조합을 이용하여 컴퓨터(computer) 또는 이와 유사한 장치로 읽을 수 있는 기록 매체 내에서 구현될 수 있다. 일부 경우에 있어 본 명세서에서 설명되는 실시예들이 프로세서 자체로 구현될 수 있다. 소프트웨어적인 구현에 의하면, 본 명세서에서 설명되는 절차 및 기능과 같은 실시예들은 별도의 소프트웨어 모듈들로 구현될 수 있다. 소프트웨어 모듈들 각각은 본 명세서에서 설명되는 하나 이상의 기능 및 동작을 수행할 수 있다.In addition, according to one embodiment of the present invention, the various embodiments described above use software, hardware, or a combination thereof in a recording medium readable by a computer or similar device. can be implemented in In some cases, the embodiments described herein may be implemented in a processor itself. According to software implementation, embodiments such as procedures and functions described in this specification may be implemented as separate software modules. Each of the software modules may perform one or more functions and operations described herein.

한편, 상술한 다양한 실시예들에 따른 기기의 프로세싱 동작을 수행하기 위한 컴퓨터 명령어(computer instructions)는 비일시적 컴퓨터 판독 가능 매체(non-transitory computer-readable medium)에 저장될 수 있다. 이러한 비일시적 컴퓨터 판독 가능 매체에 저장된 컴퓨터 명령어는 특정 기기의 프로세서에 의해 실행되었을 때 상술한 다양한 실시예에 따른 기기에서의 처리 동작을 특정 기기가 수행하도록 한다. 비일시적 컴퓨터 판독 가능 매체란 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 비일시적 컴퓨터 판독 가능 매체의 구체적인 예로는, CD, DVD, 하드 디스크, 블루레이 디스크, USB, 메모리카드, ROM 등이 있을 수 있다.Meanwhile, computer instructions for performing the processing operation of the device according to various embodiments described above may be stored in a non-transitory computer-readable medium. Computer instructions stored in such a non-transitory computer readable medium cause a specific device to perform a processing operation in the device according to various embodiments described above when executed by a processor of the specific device. A non-transitory computer readable medium is a medium that stores data semi-permanently and is readable by a device, not a medium that stores data for a short moment, such as a register, cache, or memory. Specific examples of the non-transitory computer readable media may include CD, DVD, hard disk, Blu-ray disk, USB, memory card, ROM, and the like.

또한, 상술한 다양한 실시예들에 따른 구성 요소(예: 모듈 또는 프로그램) 각각은 단수 또는 복수의 개체로 구성될 수 있으며, 전술한 해당 서브 구성 요소들 중 일부 서브 구성 요소가 생략되거나, 또는 다른 서브 구성 요소가 다양한 실시예에 더 포함될 수 있다. 대체적으로 또는 추가적으로, 일부 구성 요소들(예: 모듈 또는 프로그램)은 하나의 개체로 통합되어, 통합되기 이전의 각각의 해당 구성 요소에 의해 수행되는 기능을 동일 또는 유사하게 수행할 수 있다. 다양한 실시예들에 따른, 모듈, 프로그램 또는 다른 구성 요소에 의해 수행되는 동작들은 순차적, 병렬적, 반복적 또는 휴리스틱하게 실행되거나, 적어도 일부 동작이 다른 순서로 실행되거나, 생략되거나, 또는 다른 동작이 추가될 수 있다.In addition, each of the components (eg, modules or programs) according to various embodiments described above may be composed of a single object or a plurality of entities, and some of the sub-components may be omitted, or other sub-components may be omitted. Sub-components may be further included in various embodiments. Alternatively or additionally, some components (eg, modules or programs) may be integrated into one entity and perform the same or similar functions performed by each corresponding component prior to integration. According to various embodiments, operations performed by modules, programs, or other components are executed sequentially, in parallel, iteratively, or heuristically, or at least some operations are executed in a different order, are omitted, or other operations are added. It can be.

이상에서는 본 발명의 바람직한 실시예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 개시에 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어져서는 안될 것이다.Although the preferred embodiments of the present invention have been shown and described above, the present invention is not limited to the specific embodiments described above, and is common in the technical field belonging to the disclosure without departing from the gist of the present invention claimed in the claims. Of course, various modifications are possible by those with knowledge of, and these modifications should not be individually understood from the technical spirit or prospect of the present invention.

100: 인공 신경망 가속 장치
110: 메모리
120: 프로세서100: artificial neural network accelerator
110: memory
120: processor

Claims

In the artificial neural network acceleration method performed by the artificial neural network accelerator,
Calculating a median index (Centroid index) for each weight coordinate of the input activation value through a hashing method using the obtained input activation value;
calculating a location index of a median value based on the calculated median value index;
storing the input activation value in an array according to the position index of the calculated median value; and
The method of accelerating an artificial neural network using a double stage weight sharing method comprising performing an operation on a median value corresponding to the stored input activation value and a position index of the median value and outputting the calculated operation.

According to claim 1,
The method of accelerating an artificial neural network using a double stage weight sharing method, further comprising storing the obtained input activation value in an input buffer that is a memory space.

According to claim 2,
The method of accelerating an artificial neural network using a double-stage weight sharing scheme, further comprising the step of finding sparsity, which is a non-zero value of the stored input activation value.

delete

According to claim 1,
The position index of the median is,
An artificial neural network acceleration method using a double stage weight sharing method, characterized in that it is a real centroid index.

delete

According to claim 1,
The outputting step is
An artificial neural network acceleration method using a double stage weight sharing method, performing a multiplication operation on the stored input activation value and the median value.

delete

According to claim 7,
The outputting step is
The method of accelerating an artificial neural network using a double stage weight sharing method, further comprising the step of accumulating the values for which the multiplication operation has been performed, obtaining a total sum, and outputting it as an output through an activation function.

According to claim 9,
The activation function is
An artificial neural network acceleration method using a double stage weight sharing method, characterized in that it is a Rectified Linear Unit (ReLU).

memory for storing one or more programs; and
a processor for executing the stored one or more programs;
the processor,
Calculate a median index (Centroid index) for each weight coordinate of the input activation value through a hashing method using the obtained input activation value;
Calculate a position index of the median value based on the calculated median value index,
Store the input activation value in an array according to the position index of the calculated median value,
An artificial neural network accelerator using a double-stage weight sharing method, which performs an operation on a median value corresponding to the stored input activation value and a position index of the median value and outputs the result.

According to claim 11,
The apparatus for accelerating an artificial neural network using a double stage weight sharing method, further comprising an input buffer storing the obtained input activation value in a memory space.

According to claim 12,
An artificial neural network accelerator using a double stage weight sharing method, further comprising a non-zero detection module for finding sparsity, which is a non-zero value of the stored input activation value.

delete

According to claim 11,
The position index of the median is,
An artificial neural network accelerator using a double stage weight sharing method, characterized in that it is a real centroid index.

delete

According to claim 11,
the processor,
An artificial neural network accelerator using a double stage weight sharing method, which performs a multiplication operation on the stored input activation value and the median value.

delete

According to claim 17,
the processor,
An artificial neural network accelerator using a double stage weight sharing method, which accumulates values for which the multiplication operation has been performed, obtains a total sum, and outputs it as an output through an activation function.

According to claim 19,
The activation function is
An artificial neural network accelerator using a double stage weight sharing method, characterized in that it is a Rectified Linear Unit (ReLU).

A non-transitory computer-readable storage medium for storing instructions that, when executed by a processor, cause the processor to execute a method, the method comprising:
Calculating a median index (Centroid index) for each weight coordinate of the input activation value through a hashing method using the obtained input activation value;
calculating a location index of a median value based on the calculated median value index;
storing the input activation value in an array; and
and performing an operation on the stored input activation value and a median value corresponding to the location index of the median value and outputting the calculated operation.