KR20220109181A

KR20220109181A - Apparatus and method for inferring capsule network with 0heterogeneous core architecture

Info

Publication number: KR20220109181A
Application number: KR1020210012533A
Authority: KR
Inventors: 유회준; 박광태
Original assignee: 한국과학기술원
Priority date: 2021-01-28
Filing date: 2021-01-28
Publication date: 2022-08-04
Also published as: KR102541462B1

Abstract

According to the present invention, a capsule neural network inferring device and a method thereof can realize high-speed processing, in a capsule neural network including a convolution layer and a dynamic routing layer, by implementing a heterogeneous core architecture optimized for each processing considering the operating characteristics of each convolution layer and dynamic routing layer. In addition, the present invention supports group convolution and implements a convolution core that performs parallel processing through a pipeline in each group unit, thereby speeding up the operation of the convolution layer. In addition, the present invention identifies dynamic routing operations that can be omitted in the initial stage of capsule neural network inferring and implements a dynamic routing core that loads only selected input values from an external memory based on a result, thereby reducing the amount of memory access, accelerating calculation speed, and reducing power consumption. In addition, the present invention enables real-time capsule neural network inference by reducing calculation cycles by integrating a lookup table-based squash function block into the dynamic routing core and shortening capsule neural network inference time.

Description

Capsule neural network inference device and method based on heterogeneous core architecture

본 발명은 신경망 추론 장치 및 그 방법에 관한 것으로서, 보다 상세하게는, 컨볼루션 계층과 동적 라우팅 계층을 포함하는 캡슐 신경망의 추론을 고속화하기 위한 이형 코어 아키텍처 기반의 캡슐 신경망 추론 장치 및 그 방법에 관한 것이다. The present invention relates to a neural network inference apparatus and method, and more particularly, to a heterogeneous core architecture-based capsule neural network inference apparatus and method for accelerating inference of a capsule neural network including a convolution layer and a dynamic routing layer will be.

심층신경망(Deep Neural network)은 컴퓨터 비전(Computer vision), 자연어 처리(Natural Language Processing) 등의 다양한 분야에서 활용되고 있으며, 이미지 인식(Image Recognition) 및 질의 응답(Question Answering) 등의 분야에서 일반인 평균(Human Performance)을 능가하는 성능을 보여주고 있다.Deep neural networks are being used in various fields such as computer vision and natural language processing, and in fields such as image recognition and question answering, the average of the general public (Human Performance) shows the performance.

그러나 심층신경망의 일종인 컨볼루션 신경망(Convolutional Neural Network, CNN) 및 순환 신경망(Recurrent Neural Network, RNN)은 악의적인 목적으로 신경망에 결함을 일으키는 적대적 공격(Adversarial Attack)을 받거나, 신경망이 학습된 환경과 다른 사용자 환경에서 동작하는 경우 도메인 이동(Domain Shift)을 해야 하는 문제점을 지니고 있다.However, Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN), which are types of deep neural networks, receive an adversarial attack that causes defects in the neural network for malicious purposes, or the environment in which the neural network is trained. When operating in a user environment different from that, there is a problem that a domain shift is required.

기존 신경망은 학습(learning)으로 상기 문제점들을 해결했지만, 캡슐 신경망은 별도의 학습 과정 없이 추론 단계에서 이 문제를 해결할 수 있다. Existing neural networks have solved the above problems by learning, but capsule neural networks can solve this problem in the reasoning stage without a separate learning process.

캡슐 신경망은 일반적인 심층신경망(DNN)의 중간 계층에 동적 라우팅 계층이 추가된 구조를 갖는 신경망으로서, 학습한 환경과 다른 사용자의 환경에서도 별도의 튜닝을 위한 데이터-셋 없이 성능을 유지할 수 있도록 한다. 이러한 장점으로 인해, 캡슐 신경망은 이미지 및 3차원 포인트 클라우드 분석(Image and Point Cloud Processing) 및 자연어 처리(natural language processing) 등의 다양한 분야에서 활용되고 있다.The capsule neural network is a neural network with a structure in which a dynamic routing layer is added to the middle layer of a general deep neural network (DNN). Due to these advantages, the capsule neural network is being used in various fields such as image and 3D point cloud processing and natural language processing.

캡슐 신경망에 포함된 동적 라우팅 계층의 입력과 출력은 모두 캡슐이라 불리는 다차원 뉴런 형태를 갖는데, 출력 뉴런(이하, ‘출력 캡슐’이라 칭함)은 입력 뉴런(이하, ‘입력 캡슐’이라 칭함) 각각과 결합 계수(coupling coefficient)와의 가중치 합으로 계산된다. 상기 결합 계수의 값은, 초기에는, 복수의 계층에서 모두 동일한 값으로 초기화되지만, 복수의 이터레이션(iteration) 동안 이전 계층의 입력 캡슐과 계산된 출력 캡슐 사이의 관계를 이용해 갱신될 수 있다. 이 때, 입력 캡슐로부터 출력 캡슐을 계산하는 과정을 피드-포워드(feed-forward, FF) 과정, 결합 계수를 갱신하는 과정을 피드백(feed-back, FB) 과정이라고 지칭하며, 상기 피드-포워드(FF) 과정과, 피드백(FB) 과정의 반복에 의해 캡슐 신경망의 성능을 향상시킬 수 있다. 즉, 캡슐 신경망은 상기 피드-포워드(FF) 과정과, 피드백(FB) 과정을 반복 수행함으로써 원하는 성능에 도달하게 된다.Both the input and output of the dynamic routing layer included in the capsule neural network have the form of a multidimensional neuron called a capsule. It is calculated as a weighted sum with a coupling coefficient. The value of the coupling coefficient is initially initialized to the same value in a plurality of layers, but may be updated using a relationship between an input capsule of a previous layer and a calculated output capsule during a plurality of iterations. At this time, the process of calculating the output capsule from the input capsule is referred to as a feed-forward (FF) process, and the process of updating the coupling coefficient is referred to as a feedback (feed-back, FB) process, and the feed-forward ( The performance of the capsule neural network can be improved by repeating the FF) process and the feedback (FB) process. That is, the capsule neural network reaches the desired performance by repeatedly performing the feed-forward (FF) process and the feedback (FB) process.

한편, 심층신경망의 대부분을 구성하는 컨볼루션 계층이 연산 집약적인 성질을 갖는 반면, 캡슐 신경망에 포함된 동적 라우팅 계층은 메모리 집약적인 성질을 보이는데, 동적 라우팅 계층을 포함하는 캡슐 신경망이 많은 메모리 접근량을 요구하는 데에는 크게 두 가지 이유가 있다.On the other hand, while the convolutional layer constituting most of the deep neural network has a computationally intensive property, the dynamic routing layer included in the capsule neural network has a memory-intensive property. There are two main reasons for asking.

먼저, 캡슐 신경망은 동적 라우팅 계층의 입력 캡슐을 생성하기 위해, 컨볼루션 계층의 출력인 기능 맵(feature map)을 캡슐로 변환하는 완전 연결 계층(fully connected Layer)을 지나게 되는데, 중간 계층에 위치한 기능-맵의 크기가 큰 편이기 때문에, 완전 연결 계층의 매개 변수 개수가 매우 많아지고, 이로 인해 메모리 접근량이 증가하게 된다.First, in order to generate an input capsule of the dynamic routing layer, the capsule neural network passes through a fully connected layer that converts a feature map, the output of the convolution layer, into a capsule. -Since the size of the map is rather large, the number of parameters in the fully connected layer is very large, which increases the amount of memory access.

두 번째로, 캡슐 신경망의 성능 향상을 위해, 피드-포워드(FF) 과정과 피드백(FB) 과정이 여러 번 반복되어야 하는데, 이 때, 입력 캡슐을 모두 기억하고 있어야 하므로, 메모리 접근량이 증가하게 된다. 예를 들어, 3번의 동적 라우팅 과정이 반복될 경우, 입력 캡슐은 총 6번 접근되어야 하므로 메모리 접근량이 증가하게 된다. Second, in order to improve the performance of the capsule neural network, the feed-forward (FF) process and the feedback (FB) process must be repeated several times. . For example, if the dynamic routing process is repeated 3 times, the amount of memory access increases because the input capsule must be accessed a total of 6 times.

이와 같이 종래에는 캡슐 신경망이 많은 메모리 접근량을 요구함으로써, 메모리 대역이 부족하고, 전력 공급이 제한적인 모바일 환경에서는 캡슐 신경망을 이용한 실시간 추론이 어려운 문제가 있었다.As described above, since the conventional capsule neural network requires a large amount of memory access, there is a problem in that real-time reasoning using the capsule neural network is difficult in a mobile environment where the memory bandwidth is insufficient and the power supply is limited.

또한, 캡슐 신경망에서는, 피드-포워드(FF) 과정에서 중간 캡슐을 정규화하기 위한 목적으로 스쿼시 함수(squash activation function)라는 활성 함수를 사용하는데, 상기 스쿼시 함수는 l2-norm 값의 연산을 필요로 하며 이 과정에서 곱셈과 덧셈 뿐 아니라 제곱근, 나누기 연산을 필요로 한다. 이 때, 상기 제곱근과 나누기 연산은 곱셈과 덧셈 연산에 비해 많은 사이클을 소모하므로, 캡슐 신경망의 추론 속도를 지연시키는 원인이 된다. In addition, in the capsule neural network, an activation function called a squash activation function is used for the purpose of normalizing the intermediate capsule in the feed-forward (FF) process, and the squash function requires calculation of the l2-norm value. In this process, not only multiplication and addition, but also square root and division operations are required. In this case, since the square root and division operations consume more cycles than the multiplication and addition operations, it causes the inference speed of the capsule neural network to be delayed.

즉, 종래에는 많은 연산 사이클을 소모하는 스쿼시 함수의 사용으로 인해, 캡슐 신경망의 추론 속도가 느려지는 문제가 있었다.That is, conventionally, due to the use of a squash function that consumes many operation cycles, there is a problem in that the inference speed of the capsule neural network becomes slow.

S. Sabour, N. Frosst, and G. E. Hinton, "Dynamic Routing Between Capsules," Conference on Neural Information Processing Systems (NIPS), 2017.S. Sabour, N. Frosst, and G. E. Hinton, "Dynamic Routing Between Capsules," Conference on Neural Information Processing Systems (NIPS), 2017.

따라서 본 발명은 컨볼루션 계층과 동적 라우팅 계층을 포함하는 캡슐 신경망에 있어서, 컨볼루션 계층 및 동적 라우팅 계층 각각의 동작 특성을 고려하여 각각의 처리에 최적화된 이형 코어 아키텍처를 구현함으로써, 고속 처리가 가능한 이형 코어 아키텍처 기반의 캡슐 신경망 추론 장치 및 그 방법을 제공하고자 한다.Therefore, in the capsule neural network including the convolutional layer and the dynamic routing layer, the present invention implements a heterogeneous core architecture optimized for each processing in consideration of the operation characteristics of each of the convolutional layer and the dynamic routing layer, so that high-speed processing is possible. An object of the present invention is to provide a capsule neural network inference device based on a heterogeneous core architecture and a method therefor.

또한 본 발명은 그룹 컨볼루션(Group Convolution)을 지원하며 각 그룹 단위로 파이프라인을 통한 병렬처리를 수행하는 컨볼루션 코어를 구현함으로써, 컨볼루션 계층의 연산을 가속하는ㄹ 이형 코어 아키텍처 기반의 캡슐 신경망 추론 장치 및 그 방법을 제공하고자 한다.In addition, the present invention supports group convolution and implements a convolution core that performs parallel processing through pipelines in units of each group, thereby accelerating the operation of the convolution layer, a capsule neural network based on a heterogeneous core architecture. It is intended to provide an inference apparatus and a method therefor.

또한 본 발명은 캡슐 신경망 추론의 초기 단계에서, 생략 가능한 동적 라우팅 연산을 구분하고, 그 결과에 의해 선별적인 입력 값만을 외부의 메모리로부터 불러오도록 하는 동적 라우팅 코어를 구현함으로써, 메모리 접근량을 줄이는 이형코어 아키텍처 기반의 캡슐 신경망 추론 장치 및 그 방법을 제공하고자 한다.In addition, the present invention identifies a dynamic routing operation that can be omitted in the initial stage of capsule neural network inference, and implements a dynamic routing core that fetches only a selective input value from an external memory based on the result, thereby reducing the amount of memory access. An object of the present invention is to provide a core architecture-based capsule neural network inference apparatus and a method therefor.

또한 본 발명은 상기 동적 라우팅 코어에 참조 테이블 기반의 스쿼시 함수 블록을 집적함으로써, 스쿼시 함수의 연산 사이클을 줄이고, 이로 인해 캡슐 신경망 추론 시간을 단축시켜 실시간 캡슐 신경망 추론이 가능하도록 하는 이형코어 아키텍처 기반의 캡슐 신경망 추론 장치 및 그 방법을 제공하고자 한다.In addition, the present invention reduces the operation cycle of the squash function by integrating the reference table-based squash function block in the dynamic routing core, thereby shortening the capsule neural network inference time and enabling real-time capsule neural network inference. An object of the present invention is to provide a capsule neural network inference apparatus and a method therefor.

상기 목적을 달성하기 위해, 본 발명에서 제공하는 이형코어 아키텍처 기반의 캡슐 신경망 추론 장치는 컨볼루션 계층과 동적 라우팅 계층을 포함하는 캡슐신경망을 추론하기 위한 이형 코어 아키텍처 기반의 캡슐 신경망 추론 장치에 있어서, 입력 채널을 소정개의 그룹으로 분할한 후 상기 각 그룹 단위로 컨볼루션 연산을 수행하되, 상기 각 그룹 단위의 컨볼루션 연산을 병렬로 수행하여 n(이 때, n은 자연수)개의 프라이머리 캡슐(primary capsule)들을 생성하는 컨볼루션 코어장치; 및 상기 n개의 프라이머리 캡슐들 각각에 대한 행렬 곱셈 연산을 통해, 다차원 형태를 갖는 m x n 개의 입력 캡슐로 변환한 후, 그 입력 캡슐들과 결합계수의 연산에 의거하여 출력 캡슐을 생성하는 동적 라우팅 코어 장치를 포함하되, 상기 동적 라우팅 코어 장치는 상기 입력 캡슐의 접근 횟수를 줄이기 위해 상기 입력 캡슐과 상기 출력 캡슐의 유사도에 의거하여 선별적으로 결합 계수를 갱신하는 것을 특징으로 한다. In order to achieve the above object, the heterogeneous core architecture-based capsule neural network inference apparatus provided by the present invention is a heterogeneous core architecture-based capsule neural network inference apparatus for inferring a capsule neural network including a convolution layer and a dynamic routing layer, After dividing the input channel into a predetermined number of groups, a convolution operation is performed for each group, and the convolution operation for each group is performed in parallel to n (where n is a natural number) number of primary capsules (primary). a convolutional core device that generates capsules; and a dynamic routing core that converts m x n input capsules having a multidimensional form through a matrix multiplication operation for each of the n primary capsules, and generates an output capsule based on the operation of the coupling coefficients with the input capsules a device, wherein the dynamic routing core device selectively updates the coupling coefficient based on the similarity between the input capsule and the output capsule in order to reduce the number of accesses of the input capsule.

바람직하게, 상기 컨볼루션 코어장치는 입력 데이터를 저장하는 제1 입력 메모리; 가중치를 저장하는 가중치 메모리; 파이프라인 구조로 구성되어, 상기 입력 메모리를 통해 입력되는 데이터에 가중치를 적용하여 컨볼루션 연산을 수행하되, 상기 컨볼루션 연산을 병렬로 수행하여 상기 프라이머리 캡슐을 생성하는 적어도 하나의 컨볼루션 연산처리장치; 및 상기 프라이머리 캡슐을 출력하기 위한 출력 메모리를 포함하되, 상기 컨볼루션 연산처리장치는 상기 입력 메모리를 통해 입력되는 초기의 채널들을 소정 개의 그룹으로 분할한 후 상기 각 그룹 단위로 1차 컨볼루션 연산을 수행하는 제1 그룹 처리부; 상기 1차 컨볼루션 연산에 대한 정확도 손실을 보상하기 위해 상기 그룹으로 분할된 채널들을 무작위로 혼합하는 채널 셔플 연산을 수행하는 채널 셔플 연산부; 및 상기 채널 셔플 연산부의 처리 결과를 다시 소정 개의 그룹으로 분할한 후 상기 각 그룹 단위로 2차 컨볼루션 연산을 수행하는 제2 그룹 처리부를 포함할 수 있다.Preferably, the convolutional core device comprises: a first input memory for storing input data; weight memory for storing weights; At least one convolution operation process configured in a pipeline structure and performing a convolution operation by applying a weight to data input through the input memory, and generating the primary capsule by performing the convolution operation in parallel Device; and an output memory for outputting the primary capsule, wherein the convolution operation processing unit divides the initial channels input through the input memory into a predetermined number of groups and then performs a primary convolution operation for each group A first group processing unit to perform; a channel shuffle operation unit for performing a channel shuffle operation for randomly mixing the channels divided into the groups in order to compensate for the loss of accuracy for the first-order convolution operation; and a second group processing unit that divides the processing result of the channel shuffle operation unit into predetermined groups and then performs a second-order convolution operation in units of each group.

바람직하게, 상기 동적 라우팅 코어장치는 상기 컨볼루션 코어장치에서 생성된 n개의 프라이머리 캡슐을 입력받는 제2 입력 메모리; 가중치 및 결합계수를 저장하는 가중치/결합계수 메모리; 상기 프라이머리 캡슐들로부터 다차원 형태의 입력 캡슐을 생성한 후, 상기 입력 캡슐과 상기 결합계수의 연산에 의해 출력 캡슐을 생성하기 위한 처리를 수행하는 적어도 하나의 동적 라우팅 처리부들; 상기 동적 라우팅 처리부의 연산 과정에서 발생하는 중간 캡슐을 정규화하는 스쿼시 함수 연산부; 상기 입력 캡슐과 출력 캡슐의 유사도에 의거하여 상기 결합계수의 갱신 여부를 결정하는 스킵 제어부; 및 상기 스킵 제어부의 제어에 의해 상기 결합계수를 갱신한 후 그 결과를 상기 가중치/결합계수 메모리에 저장하는 결합계수 갱신부를 포함할 수 있다. Preferably, the dynamic routing core device comprises: a second input memory for receiving n primary capsules generated by the convolutional core device; a weight/coupling factor memory for storing weights and coupling factors; at least one dynamic routing processor for generating a multidimensional input capsule from the primary capsules and then performing a process for generating an output capsule by calculating the input capsule and the coupling coefficient; a squash function calculation unit for normalizing an intermediate capsule generated in the operation process of the dynamic routing processing unit; a skip control unit that determines whether to update the coupling coefficient based on the similarity between the input capsule and the output capsule; and a coupling coefficient updating unit for updating the coupling coefficient under the control of the skip control unit and storing the result in the weight/coupling coefficient memory.

바람직하게, 상기 동적 라우팅 처리부는 상기 입력 캡슐들과 결합계수들 각각을 매칭시켜 벡터-스칼라 곱을 수행한 후, 상기 입력 캡슐들 각각의 상기 벡터-스칼라 곱 결과를 합산하여 중간 캡슐(intermediate capsule)을 생성할 수 있다.Preferably, the dynamic routing processing unit matches each of the coupling coefficients with the input capsules to perform a vector-scalar product, and then sums the vector-scalar product result of each of the input capsules to form an intermediate capsule. can create

바람직하게, 상기 스쿼시 함수 연산부는 상기 동적 라우팅 처리부에서 생성된 중간 캡슐의 놈(norm) 값을 정규화시키기 위해 스쿼시 함수 연산을 수행하되, 근사 컴퓨팅 및 참조 테이블 기반의 스쿼시 함수 연산을 수행할 수 있다.Preferably, the squash function operation unit performs a squash function operation to normalize the norm value of the intermediate capsule generated by the dynamic routing processing unit, but may perform approximate computing and reference table-based squash function operation.

바람직하게, 상기 스쿼시 함수 연산부는 임의의 캡슐의 L2-norm 별로 정규화 데이터를 매칭시켜 저장하는 참조테이블을 저장하고, 상기 중간 캡슐의 L1-norm 값 및 L-infinite norm 값의 선형 조합으로 상기중간 캡슐의 L2-norm 값을 근사한 후, 상기 중간 캡슐의 L2-norm 근사값을 상기 참조테이블의 색인으로 활용하여 상기 중간 캡슐을 정규화할 수 있다.Preferably, the squash function operation unit stores a reference table that matches and stores normalized data for each L2-norm of an arbitrary capsule, and uses a linear combination of the L1-norm value and the L-infinite norm value of the intermediate capsule. After approximating the L2-norm value of , the intermediate capsule may be normalized by using the L2-norm approximation value of the intermediate capsule as an index of the reference table.

바람직하게, 상기 동적 라우팅 처리부는 제1 및 제2 레지스터를 포함하고, 상기 캡슐신경망의 정확도를 결정하기 위해, 미리 설정된 횟수만큼 상기 출력 캡슐 생성 과정을 반복 수행하되, 상기 결합계수 갱신부에서 갱신된 결합계수에 의거하여 생성된 제1 중간 캡슐들과, 갱신되지 않은 결합계수에 의거하여 생성된 제2 중간 캡슐들을 상기 제1 및 제2 레지스터에 각각 저장하고, 마지막 반복시 상기 제1 및 제2 레지스터에 각각 저장된 값들을 합산할 수 있다.Preferably, the dynamic routing processing unit includes first and second registers, and repeats the output capsule generation process a preset number of times to determine the accuracy of the capsule neural network, but updated by the coupling coefficient update unit The first intermediate capsules generated based on the coupling coefficient and the second intermediate capsules generated based on the non-updated coupling coefficient are stored in the first and second registers, respectively, and the first and second capsules are stored in the last iteration. The values stored in each register can be summed.

바람직하게, 상기 스킵 제어부는 상기 입력 캡슐과 상기 출력 캡슐의 코사인 유사도를 산출하고, 상기 코사인 유사도와 미리 설정된 임계치를 비교하여, 상기 코사인 유사도가 상기 임계치를 초과하는 경우에만, 선별적으로 결합 계수를 갱신하도록 제어할 수 있다. Preferably, the skip control unit calculates the cosine similarity of the input capsule and the output capsule, compares the cosine similarity with a preset threshold, and selects a coupling coefficient only when the cosine similarity exceeds the threshold You can control it to update.

한편, 상기 목적을 달성하기 위해 본 발명에서 제공하는 캡슐 신경망의 추론 방법은 컨볼루션 계층과 동적 라우팅 계층을 포함하는 캡슐신경망을 추론하기 위한 이형 코어 아키텍처 기반의 캡슐 신경망 추론 장치를 이용한 캡슐신경망 추론 방법에 있어서, 입력 채널을 소정개의 그룹으로 분할한 후 상기 각 그룹 단위로 컨볼루션 연산을 수행하되, 파이프라인을 통해 상기 각 그룹 단위의 컨볼루션 연산을 병렬로 수행하여 n(이 때, n은 자연수)개의 프라이머리 캡슐(primary capsule)들을 생성하는 컨볼루션 연산단계; 상기 n개의 프라이머리 캡슐들 각각에 대한 행렬 곱셈 연산을 통해, 다차원 형태를 갖는 m x n 개의 입력 캡슐을 생성하는 입력 캡슐 생성단계; 상기 입력 캡슐과 결합 계수의 연산에 의거하여 출력 캡슐을 생성하는 출력 캡슐 생성(FF)단계; 및 상기 입력 캡슐과 출력 캡슐간의 유사도를 산출하고, 상기 유사도가 미리 설정된 임계치를 초과하는 경우에만, 선별적으로 결합 계수를 갱신하는 결합 계수 갱신(FB)단계를 포함하는 것을 특징으로 한다. On the other hand, in order to achieve the above object, the inference method of a capsule neural network provided by the present invention is a capsule neural network inference method using a capsule neural network inference apparatus based on a heterogeneous core architecture for inferring a capsule neural network including a convolution layer and a dynamic routing layer. In the above, after dividing the input channel into a predetermined number of groups, the convolution operation is performed for each group, and the convolution operation for each group is performed in parallel through a pipeline to n (where n is a natural number) ) a convolution operation step of generating primary capsules; an input capsule generating step of generating m x n input capsules having a multidimensional shape through a matrix multiplication operation for each of the n primary capsules; an output capsule generation (FF) step of generating an output capsule based on the operation of the input capsule and the coupling coefficient; and a coupling coefficient update (FB) step of calculating the similarity between the input capsule and the output capsule, and selectively updating the coupling coefficient only when the similarity exceeds a preset threshold.

바람직하게, 상기 컨볼루션 연산단계는 초기 입력 채널을 소정 개의 그룹으로 분할한 후 상기 각 그룹 단위로 1차 컨볼루션 연산을 수행하는 제1 그룹 연산단계; 상기 1차 컨볼루션 연산이 종료되면, 그 결과에 대한 정확도 손실을 보상하기 위해 상기 그룹으로 분할된 채널들을 무작위로 혼합하는 채널 셔플 연산단계; 및 상기 채널 셔플 연산이 종료되면, 그 결과를 다시 소정 개의 그룹으로 분할한 후 상기 각 그룹 단위로 2차 컨볼루션 연산을 수행하는 제2 그룹 연산단계를 포함할 수 있다. Preferably, the convolution operation step comprises: a first group operation step of dividing an initial input channel into a predetermined number of groups and then performing a first convolution operation in each group unit; a channel shuffling operation step of randomly mixing the channels divided into the groups to compensate for the loss of accuracy for the result when the first convolution operation is completed; and when the channel shuffle operation is finished, a second group operation step of dividing the result into a predetermined number of groups and then performing a second-order convolution operation in units of each group.

바람직하게, 상기 출력 캡슐 생성(FF)단계는 상기 입력 캡슐들과 결합계수들 각각을 매칭시켜 벡터-스칼라 곱을 수행하는 곱셈 단계; 상기 곱셈 단계의 연산결과를 모두 합산하여 중간 캡슐(intermediate capsule)을 생성하는 덧셈 단계; 및 근사 컴퓨팅 및 참조 테이블 기반의 스쿼시 함수 연산을 통해, 상기 중간 캡슐의 놈(norm) 값을 정규화시키는 스쿼시 함수 연산 단계를 포함할 수 있다. Preferably, the output capsule generation (FF) step comprises: a multiplication step of performing a vector-scalar product by matching each of the input capsules and the coupling coefficients; an addition step of adding up all the calculation results of the multiplication step to generate an intermediate capsule; and a squash function calculation step of normalizing a norm value of the intermediate capsule through approximate computing and reference table-based squash function calculation.

바람직하게, 상기 스쿼시 함수 연산 단계는 상기 중간 캡슐의 L1-norm 값 및 L-infinite norm 값의 선형 조합으로 상기 중간 캡슐의 L2-norm 값을 근사하는 근사 컴퓨팅 단계; 및 상기 중간 캡슐의 L2-norm 근사값을 정규화하되, 임의의 캡슐의 L2-norm 별로 정규화 데이터를 매칭시켜 저장하는 참조테이블을 미리 저장하고, 상기 중간 캡슐의 L2-norm 근사값을 상기 참조테이블의 색인으로 활용하여 상기 중간 캡슐을 정규화하는 정규화 단계를 포함할 수 있다. Preferably, the squash function calculation step comprises: an approximate computing step of approximating the L2-norm value of the intermediate capsule by a linear combination of the L1-norm value and the L-infinite norm value of the intermediate capsule; and a reference table for normalizing the L2-norm approximation of the intermediate capsule, matching and storing normalized data for each L2-norm of an arbitrary capsule, and using the L2-norm approximation of the intermediate capsule as an index of the reference table It may include a normalization step of normalizing the intermediate capsule by utilizing it.

바람직하게, 상기 결합 계수 갱신(FB)단계는 상기 입력 캡슐과 출력 캡슐간 코사인 유사도를 산출하는 유사도 산출 단계; 상기 코사인 유사도와 미리 설정된 임계치를 비교하는 유사도 비교 단계; 및 상기 비교 결과, 상기 코사인 유사도가 상기 미리 설정된 임계치를 초과하는 경우에만 상기 결합 계수를 갱신하되, 소프트 맥스 연산에 의해 상기 결합계수를 갱신하고 그 결과를 내부 메모리에 저장하는 소프트 맥스 연산 단계를 포함할 수 있다. Preferably, the updating of the coupling coefficient (FB) comprises: a similarity calculating step of calculating a cosine similarity between the input capsule and the output capsule; a similarity comparison step of comparing the cosine similarity with a preset threshold; and a soft max operation step of updating the coupling coefficient only when the cosine similarity exceeds the preset threshold as a result of the comparison, updating the coupling coefficient by a soft max operation and storing the result in an internal memory. can do.

바람직하게, 상기 캡슐신경망의 정확도를 결정하기 위해 미리 설정된 횟수만큼 상기 출력 캡슐 생성(FF)단계 및 상기 결합 계수 갱신(FB)단계를 반복 수행하되, 상기 덧셈 단계는 상기 갱신된 결합계수에 의거하여 생성된 제1 중간 캡슐들과, 갱신되지 않은 결합계수에 의거하여 생성된 제2 중간 캡슐들을 서로 다른 레지스터에 각각 저장하고, 마지막 반복시 상기 서로 다른 레지스터에 각각 저장된 값들을 합산할 수 있다.Preferably, the output capsule generation (FF) step and the coupling coefficient update (FB) step are repeatedly performed a preset number of times to determine the accuracy of the capsule neural network, wherein the adding step is performed based on the updated coupling factor. The generated first intermediate capsules and the second intermediate capsules generated based on the non-updated coupling coefficient may be respectively stored in different registers, and values stored in the different registers may be summed in the last iteration.

본 발명의 이형 코어 아키텍처 기반의 캡슐 신경망 추론 장치 및 그 방법은 컨볼루션 계층과 동적 라우팅 계층을 포함하는 캡슐 신경망에 있어서, 컨볼루션 계층 및 동적 라우팅 계층 각각의 동작 특성을 고려하여 각각의 처리에 최적화된 이형 코어 아키텍처를 구현함으로써, 고속 처리가 가능한 장점이 있다. In a capsule neural network including a convolutional layer and a dynamic routing layer, the capsule neural network inference apparatus and method based on the heterogeneous core architecture of the present invention are optimized for each processing in consideration of the operation characteristics of each of the convolutional layer and the dynamic routing layer By implementing a heterogeneous core architecture, there is an advantage that high-speed processing is possible.

또한 본 발명은 그룹 컨볼루션(Group Convolution)을 지원하며 각 그룹단위로 파이프라인을 통한 병렬처리를 수행하는 컨볼루션 코어를 구현함으로써, 컨볼루션 계층의 연산을 빠르게 할 수 있는 장점이 있다. In addition, the present invention supports group convolution and implements a convolution core that performs parallel processing through pipelines in units of each group, so that the operation of the convolution layer can be performed quickly.

또한 본 발명은 캡슐 신경망 추론의 초기 단계에서, 생략 가능한 동적 라우팅 연산을 구분하고, 그 결과에 의해 선별적인 입력 값만을 외부의 메모리로부터 불러오도록 하는 동적 라우팅 코어를 구현함으로써, 메모리 접근량을 줄이고 이로 인한 연산 속도를 단축시키고, 전력 소모를 줄일 수 있는 장점이 있다. In addition, the present invention distinguishes omissible dynamic routing operations in the initial stage of capsule neural network inference, and implements a dynamic routing core that allows only selective input values to be fetched from external memory by the result, thereby reducing the amount of memory access. There is an advantage in that the calculation speed can be shortened and power consumption can be reduced.

또한 본 발명은 상기 동적 라우팅 코어에 참조 테이블 기반의 스쿼시 함수 블록을 집적함으로써, 연산 사이클을 줄이고, 이로 인해 캡슐 신경망 추론 시간을 단축시켜 실시간 캡슐 신경망 추론이 가능하도록 하는 장점이 있다.In addition, the present invention has the advantage of reducing the operation cycle by integrating the reference table-based squash function block in the dynamic routing core, thereby shortening the capsule neural network inference time, thereby enabling real-time capsule neural network inference.

도 1은 본 발명의 일 실시 예에 따른 이형 코어 아키텍처 기반의 캡슐 신경망 추론 장치에 대한 개략적인 블록도이다.
도 2 내지 도 6은 본 발명의 일 실시 예에 따른 캡슐 신경망 추론 방법에 대한 처리 흐름도들이다.
도 7 및 도 8은 본 발명의 일 실시 예에 따른 컨볼루션 연산 과정을 설명하기 위한 도면들이다.
도 9는 본 발명의 일 실시 예에 따른 동적 라우팅 연산을 통해 다차원의 입력 캡슐을 생성하는 과정을 설명하기 위한 도면이다.
도 10은 본 발명의 일 실시 예에 따른 동적 라우팅 연산을 통해 출력캡슐을 생성하는 과정을 설명하기 위한 도면이다.
도 11은 일반적인 스쿼시 함수 연산 과정과 이 때 소모되는 지연시간을 설명하기 위한 도면이다.
도 12 및 도 13은 본 발명의 일 실시 예에 따른 스쿼시 함수 연산부 구성 및 동작을 설명하기 위한 도면들이다.
도 14는 동적 라우팅이 한 번 반복되었을 때의 상대적인 결합계수 변화를 설명하기 위한 도면이다.
도 15는 본 발명의 일 실시 예에 따른 동적 라우팅 연산을 통해 결합계수를 갱신하는 과정을 설명하기 위한 도면이다.
도 16은 본 발명의 일 실시 예에 따라 결합계수를 선별적으로 갱신하는 처리 과정을 설명하기 위한 도면이다.
도 17은 본 발명의 일 실시 예에 따른 입력 채널 단위의 파이프라인 구조를 설명하기 위한 도면이다.1 is a schematic block diagram of a capsule neural network inference apparatus based on a heterogeneous core architecture according to an embodiment of the present invention.
2 to 6 are processing flowcharts for a capsule neural network inference method according to an embodiment of the present invention.
7 and 8 are diagrams for explaining a convolution operation process according to an embodiment of the present invention.
9 is a diagram for explaining a process of generating a multi-dimensional input capsule through a dynamic routing operation according to an embodiment of the present invention.
10 is a diagram for explaining a process of generating an output capsule through a dynamic routing operation according to an embodiment of the present invention.
11 is a view for explaining a general squash function calculation process and a delay time consumed at this time.
12 and 13 are diagrams for explaining the configuration and operation of a squash function calculator according to an embodiment of the present invention.
14 is a view for explaining a relative coupling coefficient change when dynamic routing is repeated once.
15 is a diagram for explaining a process of updating a coupling coefficient through a dynamic routing operation according to an embodiment of the present invention.
16 is a view for explaining a process of selectively updating a coupling coefficient according to an embodiment of the present invention.
17 is a diagram for explaining a pipeline structure of an input channel unit according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시 예에 대하여 설명하되, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시 예에 한정되지 않는다. 한편 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. 또한 상세한 설명을 생략하여도 본 기술 분야의 당업자가 쉽게 이해할 수 있는 부분의 설명은 생략하였다.Hereinafter, an embodiment of the present invention will be described with reference to the accompanying drawings, but it will be described in detail so that those of ordinary skill in the art to which the present invention pertains can easily practice the present invention. However, the present invention may be embodied in various different forms and is not limited to the embodiments described herein. On the other hand, in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification. In addition, even if the detailed description is omitted, descriptions of parts that can be easily understood by those skilled in the art are omitted.

명세서 및 청구범위 전체에서, 어떤 부분이 어떤 구성 요소를 포함한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성 요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있는 것을 의미한다.Throughout the specification and claims, when a part includes a certain element, it means that other elements may be further included, rather than excluding other elements, unless otherwise stated.

도 1은 본 발명의 일 실시 예에 따른 이형 코어 아키텍처 기반의 캡슐 신경망 추론 장치에 대한 개략적인 블록도이다. 도 1을 참조하면, 본 발명의 일 실시 예에 따른 캡슐 신경망 추론장치(100)는 복수의 게이트웨이들(10, 20), 컨볼루션 코어장치(110), 및 동적 라우팅 코어장치(120)를 포함한다.1 is a schematic block diagram of a capsule neural network inference apparatus based on a heterogeneous core architecture according to an embodiment of the present invention. Referring to FIG. 1 , the capsule neural network reasoning device 100 according to an embodiment of the present invention includes a plurality of gateways 10 and 20 , a convolutional core device 110 , and a dynamic routing core device 120 . do.

게이트웨이들(10, 20)은 외부 메모리(미도시)와 캡슐 신경망 추론장치(100)를 연결시킬 수 있다. 게이트웨이들(10, 20)은 외부 메모리(미도시)에 저장된 가중치들을 캡슐 신경망 추론장치(100)로 전달하고, 캡슐 신경망 추론장치(100)에서 생성된 처리 결과들을 외부 메모리(미도시)로 전달하는데 사용될 수 있다.The gateways 10 and 20 may connect an external memory (not shown) and the capsule neural network reasoning apparatus 100 . The gateways 10 and 20 transmit weights stored in an external memory (not shown) to the capsule neural network reasoning apparatus 100 , and transfer the processing results generated by the capsule neural network reasoning apparatus 100 to an external memory (not shown). can be used to

컨볼루션 코어장치(110)는 입력 채널을 소정개의 그룹으로 분할한 후 상기 각 그룹 단위로 컨볼루션 연산을 수행하되, 상기 각 그룹 단위의 컨볼루션 연산을 병렬로 수행하여 n(이 때, n은 자연수)개의 프라이머리 캡슐(primary capsule)들을 생성한다. The convolution core device 110 divides the input channel into a predetermined number of groups and then performs the convolution operation in units of each group, and performs the convolution operation in units of each group in parallel to n (in this case, n is Creates a natural number of primary capsules.

이를 위해, 컨볼루션 코어장치(110)는 입력 메모리(Input Memory)(111), 가중치 메모리(Weight Memory)(112), 적어도 하나의 컨볼루션 연산처리장치들(113), 집계코어(Aggr.Core.)(114), 출력 메모리(Output Memory)(115), BN 블럭(116), Max Pool 블록(117), ReLU 블록(118)을 포함한다. To this end, the convolutional core device 110 includes an input memory 111 , a weight memory 112 , at least one convolutional processing unit 113 , and an aggregate core (Aggr.Core). .) 114 , an output memory 115 , a BN block 116 , a Max Pool block 117 , and a ReLU block 118 .

입력 메모리(Input Memory)(111)는 입력 데이터를 저장하고, 가중치 메모리(Weight Memory)(112)는 컨볼루션 연산을 위해 필요한 가중치를 저장한다.The input memory 111 stores input data, and the weight memory 112 stores weights required for a convolution operation.

적어도 하나의 컨볼루션 연산처리장치들(113)은 파이프라인 구조로 구성되어, 상기 입력 메모리를 통해 입력되는 데이터에 가중치를 적용하여 컨볼루션 연산을 수행하되, 상기 컨볼루션 연산을 병렬로 수행하여 상기 프라이머리 캡슐을 생성한다. 이를 위해 적어도 하나의 컨볼루션 연산처리장치들(113)은 다수의 처리 요소(PE: Processing Element)들을 포함할 수 있다. The at least one convolution operation processing unit 113 is configured in a pipeline structure and performs a convolution operation by applying a weight to the data input through the input memory, and performs the convolution operation in parallel. Create a primary capsule. To this end, at least one convolution processing unit 113 may include a plurality of processing elements (PE).

집계코어(Aggr.Core.)(114)는 적어도 하나의 컨볼루션 연산처리장치들(113)로부터 출력된 프라이머리 캡슐을 집계하여 출력 메모리(Output Memory)(115)로 전달한다.The aggregation core (Aggr.Core.) 114 aggregates the primary capsule output from the at least one convolution processing unit 113 and transmits it to the output memory 115 .

출력 메모리(Output Memory)(115)는 상기 프라이머리 캡슐을 출력한다.The output memory 115 outputs the primary capsule.

동적 라우팅 코어장치(120)는 상기 n개의 프라이머리 캡슐들 각각에 대한 행렬 곱셈 연산을 통해, 상기 프라이머리 캡슐들을 다차원 형태를 갖는 m x n 개의 입력 캡슐로 변환한 후, 그 입력 캡슐들과 결합계수의 연산에 의거하여 출력 캡슐을 생성한다. 이 때, 동적 라우팅 코어장치(120)는 게이트웨이들(10, 20)을 통해 외부 메모리(미도시)에 연결되며, 상기 입력 캡슐들을 외부 메모리(미도시)에 저장하고, 상기 출력 캡슐을 생성하기 위한 연산 시, 외부 메모리(미도시)에 접근하여 새로운 입력 캡슐을 계속하여 읽어와야 한다. 따라서, 동적 라우팅 코어장치(120)는 이러한 접근 횟수를 줄이기 위해 상기 입력 캡슐과 상기 출력 캡슐의 유사도에 의거하여 선별적으로 결합 계수를 갱신한다. 결과적으로, 동적 라우팅 코어장치(120)는 결합 계수의 갱신을 위해 필요한 입력 캡슐의 접근 횟수를 줄임으로써, 캡슐 신경망의 추론 속도를 가속화할 수 있다.The dynamic routing core device 120 converts the primary capsules into m x n input capsules having a multidimensional form through a matrix multiplication operation for each of the n primary capsules, Generates an output capsule based on an operation. At this time, the dynamic routing core device 120 is connected to an external memory (not shown) through the gateways 10 and 20, stores the input capsules in an external memory (not shown), and generates the output capsules. For operation, it is necessary to continuously read a new input capsule by accessing an external memory (not shown). Accordingly, the dynamic routing core device 120 selectively updates the coupling coefficient based on the similarity between the input capsule and the output capsule in order to reduce the number of such accesses. As a result, the dynamic routing core device 120 can accelerate the inference speed of the capsule neural network by reducing the number of accesses of the input capsule required for updating the coupling coefficient.

이를 위해, 동적 라우팅 코어장치(120)는 입력 메모리(Input Memory)(121), 가중치/결합계수 메모리(Weight/Coefficient Memory)(122), 적어도 하나의 동적 라우팅 처리부들(123), 스쿼시 함수 연산부(124), 스킵 제어부(125), 결합계수 갱신부(126)를 포함한다.To this end, the dynamic routing core device 120 is an input memory (Input Memory) 121, weight / coupling coefficient memory (Weight / Coefficient Memory) 122, at least one dynamic routing processing unit 123, squash function operation unit 124 , a skip control unit 125 , and a coupling coefficient update unit 126 .

입력 메모리(Input Memory)(121)는 컨볼루션 코어장치(110)에서 생성된 n개의 프라이머리 캡슐을 입력받는다.The input memory 121 receives n primary capsules generated by the convolutional core device 110 .

가중치/결합계수 메모리(Weight/Coefficient Memory)(122)는 동적 라우팅 연산을 위해 필요한 가중치 및 결합계수를 저장한다.Weight/coefficient memory (Weight/Coefficient Memory) 122 stores weights and coupling coefficients required for dynamic routing operation.

적어도 하나의 동적 라우팅 처리부들(123)은 상기 n개의 프라이머리 캡슐들로부터 다차원 형태인 m x n 개의 입력 캡슐을 생성한 후, 상기 입력 캡슐과 가중치/결합계수 메모리(122)에 저장된 결합계수의 연산에 의해 출력 캡슐을 생성하기 위한 처리를 수행한다. 이를 위해, 동적 라우팅 처리부(123)는 상기 입력 캡슐들과 결합계수들 각각을 매칭시켜 벡터-스칼라 곱을 수행한 후, 상기 입력 캡슐들 각각의 상기 벡터-스칼라 곱 결과를 합산하여 중간 캡슐(intermediate capsule)을 생성한다. 특히, 동적 라우팅 처리부(123)는 상기 캡슐신경망의 정확도를 결정하기 위해, 미리 설정된 횟수만큼 상기 출력 캡슐 생성 과정을 반복 수행하되, 반복되는 이터레이션(iteration)별로 생성되는 부분합에 의해 중간 캡슐을 생성한다. At least one dynamic routing processing unit 123 generates m x n input capsules in a multidimensional form from the n primary capsules, and then calculates the input capsule and the coupling coefficient stored in the weight/coupling coefficient memory 122. process to create an output capsule by To this end, the dynamic routing processing unit 123 performs a vector-scalar product by matching each of the input capsules and the coupling coefficients, and then sums the vector-scalar product result of each of the input capsules to obtain an intermediate capsule (intermediate capsule). ) is created. In particular, the dynamic routing processing unit 123 repeats the output capsule generation process a preset number of times to determine the accuracy of the capsule neural network, but generates an intermediate capsule by subtotals generated for each repeated iteration. do.

이 때, 상기 결합계수는 초기에 동일한 값으로 초기화되지만, 이터레이션(iteration)이 반복되면서, 생성되는 입력 캡슐과 계산된 출력 캡슐 사이의 관계를 이용해 갱신될 수 있는데, 결합계수 갱신부(126)는 스킵 제어부(125)의 제어에 의해 상기 결합계수의 갱신 여부를 결정하고, 동적 라우팅 처리부(123)는 결합계수 갱신부(126)에서 갱신된 결합계수에 의거하여 생성된 중간 캡슐(이하, 제1 중간 캡슐이라 칭함)과, 갱신되지 않은 결합계수에 의거하여 생성된 중간 캡슐(이하, 제2 중간 캡슐이라 칭함)을 서로 다른 레지스터에 각각 저장하였다가, 마지막 이터레이션(iteration)에서 상기 서로 다른 레지스터에 각각 저장된 값들을 합산하여 최종 캡슐을 생성할 수 있다. 즉, 갱신되지 않은 결합계수에 의거하여 생성된 제2 중간 캡슐들은 첫 번째 동적 라운팅 반복 이후에 그 연산이 생략되어 더 이상 계산되지 않지만, 갱신된 결합계수에 의거하여 생성된 제1 중간 캡슐은 매 동적 라우팅 반복마다 그 값이 달라진다. 따라서 상기 제2 중간 캡슐과 상기 제1 중간 캡슐은 분리 저장되어야 하며, 이를 위해, 동적 라우팅 처리부(123)는 제1 및 제2 레지스터를 포함할 수 있다. At this time, the coupling coefficient is initially initialized to the same value, but as the iteration is repeated, it can be updated using the relationship between the generated input capsule and the calculated output capsule. determines whether to update the coupling coefficient under the control of the skip control unit 125, and the dynamic routing processing unit 123 generates an intermediate capsule (hereinafter, the second 1 intermediate capsule) and an intermediate capsule (hereinafter, referred to as a second intermediate capsule) generated based on the non-updated coupling coefficient are stored in different registers, and then in the last iteration, the different The final capsule can be created by summing the values respectively stored in the registers. That is, the second intermediate capsules generated based on the non-updated coupling coefficient are no longer calculated because the operation is omitted after the first dynamic rounding iteration, but the first intermediate capsule generated based on the updated coupling coefficient is Each dynamic routing iteration changes its value. Therefore, the second intermediate capsule and the first intermediate capsule should be stored separately, To this end, the dynamic routing processing unit 123 may include first and second registers.

스쿼시 함수 연산부(124)는 동적 라우팅 처리부(123)의 연산 과정에서 발생하는 중간 캡슐을 정규화한다. The squash function operation unit 124 normalizes the intermediate capsule generated in the operation process of the dynamic routing processing unit 123 .

이를 위해, 스쿼시 함수 연산부(124)는 동적 라우팅 처리부(123)에서 생성된 중간 캡슐의 놈(norm) 값을 정규화시키기 위해 스쿼시 함수 연산을 수행하되, 근사 컴퓨팅 및 참조 테이블 기반의 스쿼시 함수 연산을 수행한다. To this end, the squash function operation unit 124 performs a squash function operation to normalize the norm value of the intermediate capsule generated by the dynamic routing processing unit 123, but performs approximate computing and reference table-based squash function operation. do.

이 때, 놈(norm)은 일반적으로 벡터의 길이 혹은 크기를 측정하는 방법(함수)을 말하는 것으로서, 차수에 따라 다양한 종류의 놈(norm)이 정의될 수 있는데, 본 발명에서는 캡슐의 크기를 측정하기 위한 방법을 의미한다. 예를 들어, L1-norm은 차수가 1인 놈(norm) 값으로써 캡슐의 각 요소들의 절댓값을 모두 더함으로써 얻어지고, L2-norm은 차수가 2인 놈(norm) 값으로써 캡슐의 각 요소들을 두 번 곱한 뒤 합하고, 그 합의 제곱근을 취하여 얻어진 값이고, L-infinite norm은 차수를 무한대로 보냈을 때의 norm 값으로써, 절댓값이 가장 큰 성분을 제외하면 차수의 제곱근을 취하는 과정에서 모두 0에 수렴하므로, 캡슐 성분의 최댓값이 된다.In this case, the norm generally refers to a method (function) for measuring the length or size of a vector, and various kinds of norms can be defined according to the order. In the present invention, the size of the capsule is measured. means how to do it. For example, L1-norm is obtained by adding all the absolute values of each element of the capsule as a norm value of order 1, and L2-norm is a norm value of order 2 and each element of the capsule is obtained as a norm value. It is a value obtained by multiplying twice, then summing, and taking the square root of the sum. The L-infinite norm is the norm value when the degree is sent to infinity. Except for the component with the largest absolute value, all values are 0 in the process of taking the square root of the degree. Since it converges, it becomes the maximum value of the capsule component.

한편, 캡슐을 그 캡슐의 L2-norm 값으로 나누면 항상 그 크기가 1인 캡슐이 되므로, 임의의 캡슐을 동일한 크기로 만들기에 L2-norm 값이 가장 적합한 특징이 있다. 따라서, 스쿼시 함수 연산부(124)에서는, 중간 캡슐의 놈(norm) 값을 정규화시키기 위해 L2-norm 값을 사용한다.On the other hand, if a capsule is divided by the L2-norm value of the capsule, it always becomes a capsule whose size is 1, so the L2-norm value is the most suitable for making any capsule the same size. Accordingly, the squash function operator 124 uses the L2-norm value to normalize the norm value of the intermediate capsule.

이를 위해, 스쿼시 함수 연산부(124)는 임의의 캡슐의 L2-norm 별로 정규화 데이터를 매칭시켜 저장하는 참조테이블을 저장하고, 상기 중간 캡슐의 L1-norm 값 및 L-infinite norm 값의 선형 조합으로 상기중간 캡슐의 L2-norm 값을 근사한 후, 상기 중간 캡슐의 L2-norm 근사값을 상기 참조테이블의 색인으로 활용하여 상기 중간 캡슐을 정규화한다. To this end, the squash function operator 124 stores a reference table that matches and stores normalized data for each L2-norm of an arbitrary capsule, and uses the linear combination of the L1-norm value and the L-infinite norm value of the intermediate capsule. After approximating the L2-norm value of the intermediate capsule, the intermediate capsule is normalized by using the L2-norm approximation value of the intermediate capsule as an index of the reference table.

스킵 제어부(125)는 상기 입력 캡슐과 출력 캡슐의 유사도에 의거하여 상기 결합계수의 갱신 여부를 결정한다. The skip control unit 125 determines whether to update the coupling coefficient based on the similarity between the input capsule and the output capsule.

이를 위해, 스킵 제어부(125)는 상기 입력 캡슐과 상기 출력 캡슐의 코사인 유사도를 산출하고, 상기 코사인 유사도와 미리 설정된 임계치를 비교하여, 상기 코사인 유사도가 상기 임계치를 초과하는 경우에만, 선별적으로 결합 계수를 갱신하도록 제어할 수 있다. To this end, the skip control unit 125 calculates the cosine similarity of the input capsule and the output capsule, compares the cosine similarity with a preset threshold, and selectively combines only when the cosine similarity exceeds the threshold. It can be controlled to update the coefficients.

결합계수 갱신부(126)는 스킵 제어부(125)의 제어에 의해 상기 결합계수를 갱신한 후 그 결과를 가중치/결합계수 메모리(122)에 저장한다. The coupling coefficient updating unit 126 updates the coupling coefficient under the control of the skip control unit 125 and stores the result in the weight/coupling coefficient memory 122 .

도 2 내지 도 6은 본 발명의 일 실시 예에 따른 캡슐 신경망 추론 방법에 대한 처리 흐름도들이다. 도 1 내지 도 6을 참조하며, 본 발명의 일 실시 예에 따른 캡슐 신경망 추론 방법을 설명하면 다음과 같다. 2 to 6 are processing flowcharts for a capsule neural network inference method according to an embodiment of the present invention. 1 to 6 , a capsule neural network inference method according to an embodiment of the present invention will be described as follows.

먼저, 단계 S100에서는, 본 발명의 캡슐 신경망 추론 장치(100)의 동적 라우팅 연산을 카운팅하기 위한 변수(CNT)를 초기화한다. 즉, 캡슐 신경망 추론 장치(100)는 캡슐 신경망의 정확도를 결정하기 위해 미리 설정된 횟수만큼 출력캡슐 생성 과정(일명, 피드 포워드(feed-forward, FF) 단계) 및 결합계수 갱신 과정(일명, 피드백(feed-back, FB) 단계)을 반복 수행하는데, 상기 변수(CNT)는 상기 반복 횟수를 카운팅하기 위한 변수인 것이다.First, in step S100, a variable (CNT) for counting the dynamic routing operation of the capsule neural network reasoning apparatus 100 of the present invention is initialized. That is, the capsule neural network reasoning apparatus 100 generates an output capsule a preset number of times to determine the accuracy of the capsule neural network (a.k.a., a feed-forward, FF step) and a coupling coefficient update process (a.k.a., feedback ( feed-back, FB) step) is repeatedly performed, and the variable CNT is a variable for counting the number of repetitions.

단계 S200에서는, 컨볼루션 코어장치(110)가 컨볼루션 연산을 수행하여 프라이머리 캡슐을 생성한다. 즉, 단계 S200에서, 컨볼루션 코어장치(110)는 입력 채널을 소정개의 그룹으로 분할한 후 상기 각 그룹 단위로 컨볼루션 연산을 수행하되, 파이프라인을 통해 상기 각 그룹 단위의 컨볼루션 연산을 병렬로 수행하여 n(이 때, n은 자연수)개의 프라이머리 캡슐(primary capsule)들을 생성한다. In step S200, the convolutional core device 110 performs a convolution operation to generate a primary capsule. That is, in step S200, the convolution core device 110 divides the input channel into a predetermined number of groups and then performs the convolution operation in units of each group, but parallelizes the convolution operation in units of each group through a pipeline. to create n (in this case, n is a natural number) primary capsules.

도 3은 상기 단계 S200에서, 컨볼루션 연산에 의해 프라이머리 캡슐을 생성하기 위한 구체적인 과정을 예시하고 있다.3 illustrates a specific process for generating a primary capsule by a convolution operation in step S200.

도 3을 참조하면, 먼저, 단계 S210에서는, 컨볼루션 코어장치(110)가 제1 그룹 연산을 수행한다. 즉, 단계 S210에서 컨볼루션 코어장치(110)는 입력 메모리(111)를 통해 입력되는 초기의 채널들을 소정 개의 그룹으로 분할한 후 상기 각 그룹 단위로 1차 컨볼루션 연산을 수행한다. Referring to FIG. 3 , first, in step S210 , the convolutional core device 110 performs a first group operation. That is, in step S210, the convolution core device 110 divides the initial channels input through the input memory 111 into a predetermined number of groups, and then performs a primary convolution operation in units of each group.

단계 S220에서는, 컨볼루션 코어장치(110)가 채널 셔플 연산을 수행한다. 즉, 단계 S220에서 컨볼루션 코어장치(110)는 상기 단계 S210에서 1차 컨볼루션 연산이 종료된 결과에 대하여 상기 그룹으로 분할된 채널들을 무작위로 혼합하는 채널 셔플 연산을 수행한다. 이는 상기 1차 컨볼루션 연산 결과에 대한 정확도 손실을 보상하기 위함이다.In step S220, the convolutional core device 110 performs a channel shuffle operation. That is, in step S220, the convolution core device 110 performs a channel shuffle operation for randomly mixing the channels divided into the groups on the result of the completion of the primary convolution operation in step S210. This is to compensate for the loss of accuracy for the result of the first-order convolution operation.

단계 S230에서는, 컨볼루션 코어장치(110)가 제2 그룹 연산을 수행한다. 즉, 단계 S230에서 컨볼루션 코어장치(110)는 상기 채널 셔플 연산 결과를 다시 소정 개의 그룹으로 분할한 후 상기 각 그룹 단위로 2차 컨볼루션 연산을 수행한다. In step S230, the convolutional core device 110 performs a second group operation. That is, in step S230, the convolutional core device 110 divides the channel shuffle operation result into a predetermined number of groups and then performs a secondary convolution operation for each group.

이 때, 단계 S200에서 컨볼루션 코어장치(110)가 입력 채널들을 소정 개의 그룹으로 분할하여 연산하는 것은 컨볼루션 계층을 경량화하기 위한 방법 중 하나로서, 이와 같이 함으로써, 그 그룹의 개수에 비례하여 연산량을 감소시킬 수 있다. At this time, in step S200, the convolutional core device 110 divides the input channels into predetermined groups and calculates them as one of the methods for reducing the weight of the convolution layer. can reduce

그러나, 단순히 컨볼루션을 그룹 컨볼루션으로 대체하는 것은 정확도 손실을 일으키기 때문에, 이를 보상하기 위한 채널 셔플(channel shuffling) 계층이 그룹 컨볼루션 계층 뒤에 추가되어야 한다. However, simply replacing the convolution with the group convolution causes loss of accuracy, so a channel shuffling layer must be added after the group convolution layer to compensate for this.

도 7은 이와 같이 컨볼루션(convolution) 계층을 그룹 컨볼루션(group convolution) 계층으로 치환했을 때의 상황을 예시한 도면이다. 7 is a diagram illustrating a situation when a convolution layer is replaced with a group convolution layer as described above.

그런데, 이와 같이 채널 셔플 연산을 마지막에 수행하는 경우, 서로 다른 채널 그룹 간에 의존성(channel group dependency)을 생성하는 채널 셔플 연산의 특성상 그룹 컨볼루션 연산이 모두 끝나기를 기다린 뒤에 다음 연산을 진행해야 하는 문제가 있다. However, when the channel shuffle operation is performed at the end as described above, the next operation must be performed after waiting for all group convolution operations to be completed due to the characteristics of the channel shuffle operation that creates a channel group dependency between different channel groups. there is

따라서, 본 발명에서는 도 8에 예시된 바와 같이, 하나의 컨볼루션 계층을 총 두 개의 그룹 컨볼루션 계층으로 대체하고, 그 사이에 채널 셔플 계층을 추가하였다. 이 경우, 첫 번째 그룹 컨볼루션 계층은 매우 작은 출력 채널 개수를 가지므로 전체 연산량 중 미미한 부분만을 차지하게 되고, 대부분의 연산량을 차지하는 두 번째 그룹 컨볼루션 계층은 채널 셔플 계층 뒤에 위치하기 때문에, 더 이상 채널 그룹 간에 의존성이 존재하지 않게 된다.Accordingly, in the present invention, as illustrated in FIG. 8 , one convolutional layer is replaced with a total of two group convolutional layers, and a channel shuffle layer is added therebetween. In this case, since the first group convolution layer has a very small number of output channels, it occupies only a negligible portion of the total computational amount, and the second group convolution layer, which takes up most of the computational amount, is located behind the channel shuffle layer. There is no dependency between channel groups.

이로 인해, 단계 S200에서는 컨볼루션 연산의 연산량이 많아서 캡슐 신경망의 추론 속도가 지연되는 것을 방지할 수 있는 것이다.For this reason, it is possible to prevent the inference speed of the capsule neural network from being delayed due to the large amount of convolution operation in step S200.

한편, 도 2의 단계 S300에서는, 동적 라우팅 코어장치(120)가 다차원의 입력 캡슐을 생성한다. 즉, 단계 S300에서, 동적 라우팅 처리부(123)는, 입력 메모리(121)로부터, 상기 단계 S200에서 생성된 n개의 프라이머리 캡슐들을 전달받고, 이들 각각에 대한 행렬 곱셈 연산을 통해, 다차원 형태를 갖는 m x n 개의 입력 캡슐을 생성한다. On the other hand, in step S300 of Figure 2, the dynamic routing core device 120 generates a multi-dimensional input capsule. That is, in step S300, the dynamic routing processing unit 123 receives, from the input memory 121, the n primary capsules generated in step S200, and through a matrix multiplication operation for each of them, has a multidimensional shape. Create m x n input capsules.

도 9는 본 발명의 일 실시 예에 따른 동적 라우팅 연산을 통해 다차원의 입력 캡슐을 생성하는 과정을 설명하기 위한 도면으로서, 도 9를 참조하면, 프라이머리 캡슐이 행렬 곱셈(matrix multiplication) 연산을 통해 입력 캡슐(Input Capsule)로 변환되는 과정이 예시되어 있다. 이 때, 하나의 프라이머리 캡슐에 대해서 총 m개의 matrix가 곱해짐으로써 m개의 캡슐이 생성되고, 이 과정이 n개의 캡슐마다 각각 진행되어 총 m x n 개의 입력 캡슐이 생성된다. 한편, 상기 곱셈 연산은, 각각의 초기 캡슐마다 여러 쌍의 입력 캡슐을 생성해 내야하고, 연산의 기본 단위가 되는 캡슐이 여러 차원을 가지므로, 뉴런 단위로 처리되는 다른 계층과 달리, 연산에 사용되는 매개변수의 메모리 접근량이 큰 특징이 있다. 하지만, 이러한 매개변수는 신경망 학습 과정에서 기존에 존재하는 양자화(quantization) 등의 기법을 통해 대폭 감소시킬 수 있다.9 is a diagram for explaining a process of generating a multi-dimensional input capsule through a dynamic routing operation according to an embodiment of the present invention. Referring to FIG. 9 , a primary capsule is performed through a matrix multiplication operation. A process of converting into an input capsule is exemplified. At this time, m capsules are generated by multiplying a total of m matrices with respect to one primary capsule, and this process is performed for every n capsules to generate a total of m x n input capsules. On the other hand, in the multiplication operation, several pairs of input capsules must be generated for each initial capsule, and since the capsule, which is the basic unit of operation, has several dimensions, unlike other layers processed in units of neurons, it is used for operation It has a characteristic that the amount of memory access of the parameter to be changed is large. However, these parameters can be significantly reduced through existing techniques such as quantization in the neural network learning process.

도 2의 단계 S400에서는, 동적 라우팅 코어장치(120)가 출력 캡슐을 생성한다. 즉, 단계 S400에서, 동적 라우팅 코어장치(120)는 상기 단계S300에서 생성된 입력 캡슐과 결합 계수의 연산에 의거하여 출력 캡슐을 생성한다. 이 때, 상기 결합 계수의 값은, 초기에는, 복수의 계층에서 모두 동일한 값으로 초기화되지만, 복수의 이터레이션(iteration) 동안 이전 계층의 입력 캡슐과 계산된 출력 캡슐 사이의 관계를 이용해 갱신될 수 있다.In step S400 of FIG. 2 , the dynamic routing core device 120 generates an output capsule. That is, in step S400, the dynamic routing core device 120 generates an output capsule based on the calculation of the input capsule and the coupling coefficient generated in the step S300. At this time, the value of the coupling coefficient is initially initialized to the same value in a plurality of layers, but may be updated using the relationship between the input capsule of the previous layer and the calculated output capsule during a plurality of iterations. have.

도 4는 상기 단계 S400에서, 동적 라우팅 코어장치(120)가 출력 캡슐을 생성하기 위한 구체적인 과정을 예시하고 있다.4 illustrates a specific process for the dynamic routing core device 120 to generate an output capsule in the step S400.

도 4를 참조하면, 먼저, 단계 S410에서는, 동적 라우팅 코어장치(120)가 입력 캡슐들과 결합 계수의 벡터-스칼라 곱을 수행한다. 즉, 단계 S410에서는, 동적 라우팅 처리부(123)가 상기 단계 S300에서 생성된, 입력 캡슐들과 가중치/결합계수 메모리(122)로부터 전달된 결합계수들 각각을 매칭시켜 벡터-스칼라 곱을 수행한다. Referring to FIG. 4 , first, in step S410 , the dynamic routing core device 120 performs vector-scalar multiplication of input capsules and coupling coefficients. That is, in step S410, the dynamic routing processing unit 123 performs vector-scalar multiplication by matching the input capsules generated in step S300 and each of the coupling coefficients transferred from the weight/coupling coefficient memory 122.

단계 S420에서는, 동적 라우팅 처리부(123)가 단계 S410의 처리 결과인 벡터-스칼라 곱 결과를 보두 합산하여 중간 캡슐(intermediate capsule)을 생성한다. 특히, 단계 S420에서는, 각 이터레이션(iteration) 마다 생성되는 부분합에 의해 중간 캡슐을 생성하되, 이터레이션이 반복되면서 갱신되는 결합계수에 의거하여 생성된 중간 캡슐(이하, 제1 중간 캡슐이라 칭함)들과, 갱신되지 않은 결합계수에 의거하여 생성된 중간 캡슐(이하, 제2 중간 캡슐이라 칭함)들을 서로 다른 레지스터에 각각 저장하고, 마지막 이터레이션에서 상기 서로 다른 레지스터에 각각 저장된 값들을 합산하여 최종 캡슐을 생성할 수 있다. 즉, 갱신되지 않은 결합계수에 의거하여 생성된 제2 중간 캡슐들은 첫 번째 동적 라운팅 반복 이후에 그 연산이 생략되어 더 이상 계산되지 않지만, 갱신된 결합계수에 의거하여 생성된 제1 중간 캡슐은 매 동적 라우팅 반복마다 그 값이 달라진다. 따라서 상기 제2 중간 캡슐과 상기 제1 중간 캡슐은 분리 저장되어야 하며, 이를 위해, 동적 라우팅 처리부(123)는 제1 및 제2 레지스터를 포함할 수 있다.In step S420, the dynamic routing processing unit 123 generates an intermediate capsule by summing the vector-scalar product result, which is the processing result of step S410. In particular, in step S420, an intermediate capsule is generated by a subtotal generated for each iteration, but an intermediate capsule generated based on a coupling coefficient that is updated as the iteration is repeated (hereinafter referred to as a first intermediate capsule) , and intermediate capsules (hereinafter, referred to as second intermediate capsules) generated based on the non-updated coupling coefficient are stored in different registers, and the values stored in the different registers are added up in the last iteration to final Capsules can be created. That is, the second intermediate capsules generated based on the non-updated coupling coefficient are no longer calculated because the operation is omitted after the first dynamic rounding iteration, but the first intermediate capsule generated based on the updated coupling coefficient is Each dynamic routing iteration changes its value. Therefore, the second intermediate capsule and the first intermediate capsule must be stored separately, and for this, the dynamic routing processing unit 123 may include first and second registers.

단계 S430에서는, 스쿼시 함수 연산부(124)가 스쿼시 함수 연산을 수행한다. 즉, 단계 S430에서, 스쿼시 함수 연산부(124)는 상기 중간 캡슐의 놈(norm) 값을 정규화시키기 위해 스쿼시 함수 연산을 수행하되, 근사 컴퓨팅 및 참조 테이블 기반의 스쿼시 함수 연산을 수행한다. In step S430, the squash function calculation unit 124 performs a squash function calculation. That is, in step S430, the squash function calculation unit 124 performs a squash function operation to normalize the norm value of the intermediate capsule, but performs an approximate computing and a reference table-based squash function operation.

도 10은 본 발명의 일 실시 예에 따른 동적 라우팅 연산을 통해 출력캡슐을 생성하는 과정을 설명하기 위한 도면으로서, 도 10을 참조하면, 상기 단계 S300에서, 초기 캡슐마다 생성된 입력 캡슐이, 각각 결합 계수와의 벡터-스칼라 곱을 거쳐 다른 입력 캡슐로부터 연산된 결과와 더해 주는 과정을 거친 후, 상기 중간 캡슐의 norm 값을 정규화시키기 위한 목적으로 스쿼시 함수 연산을 거쳐 최종적으로 출력 캡슐이 생성되는 과정이 예시되어 있다.10 is a view for explaining a process of generating an output capsule through a dynamic routing operation according to an embodiment of the present invention. Referring to FIG. 10, in the step S300, the input capsule generated for each initial capsule is, After the vector-scalar product with the coupling coefficient and addition with the result calculated from other input capsules, a squash function operation is performed for the purpose of normalizing the norm value of the intermediate capsule, and finally the output capsule is generated. is exemplified.

도 5는 상기 단계 S430에서, 스쿼시 함수 연산부(124)가 중간 캡슐을 정규화하기 위한 구체적인 과정을 예시하고 있다. 5 illustrates a detailed process for normalizing the intermediate capsule by the squash function calculating unit 124 in step S430.

도 5를 참조하면, 먼저, 단계 S431에서는, 스쿼시 함수 연산부(124)가 근사 컴퓨팅 연산에 의해 중간 캡슐의 L2-norm 값을 근사한다. 즉, 단계 S431에서, 스쿼시 함수 연산부(124)는 상기 중간 캡슐의 L1-norm 값 및 L-infinite norm 값의 선형 조합으로 상기 중간 캡슐의 L2-norm 값을 근사한다.Referring to FIG. 5 , first, in step S431 , the squash function calculating unit 124 approximates the L2-norm value of the intermediate capsule by approximate computing operation. That is, in step S431, the squash function calculating unit 124 approximates the L2-norm value of the intermediate capsule by a linear combination of the L1-norm value and the L-infinite norm value of the intermediate capsule.

단계 S432에서는, 스쿼시 함수 연산부(124)가 참조 테이블 방식에 의거하여 상기 중간 캡슐을 정규화한다. 즉, 단계 S432에서는, 스쿼시 함수 연산부(124)가 상기 중간 캡슐의 L2-norm 근사값을 정규화하되, 임의의 캡슐의 L2-norm 별로 정규화 데이터를 매칭시켜 저장하는 참조테이블을 미리 저장하고, 상기 중간 캡슐의 L2-norm 근사값을 상기 참조테이블의 색인으로 활용하여 상기 중간 캡슐을 정규화한다. In step S432, the squash function calculating unit 124 normalizes the intermediate capsule based on the reference table method. That is, in step S432, the squash function calculating unit 124 normalizes the L2-norm approximation value of the intermediate capsule, and stores a reference table that matches and stores normalized data for each L2-norm of an arbitrary capsule in advance, and the intermediate capsule Normalize the intermediate capsule by using the L2-norm approximation of

도 11은 일반적인 스쿼시 함수 연산 과정과 이 때 소모되는 지연시간을 설명하기 위한 도면으로서, 도 11을 참조하면, 동적 라우팅 계층의 활성화를 위해 주로 사용되는 스쿼시 연산은 크게, 두 가지 단계를 거치게 되는데, 첫 번째 단계에서는, 캡슐의 L2-norm 값을 구하고(Calculate L2-norm), 그 다음 단계에서, 앞서 구한 L2-norm 값을 이용해 캡슐을 정규화 시킨다(Normalize Input). 이 때, 상기 첫 번째 단계인, L2-norm 값을 구하는 과정은 제곱, 합, 제곱근 연산을 포함하며, 두 번째 단계인, 정규화 과정은 제곱, 합, 그리고 나누기 연산을 포함한다. 따라서, 종래의 스쿼시 함수 연산은 많은 연산 사이클을 필요로 한다. 이러한 지연 시간을 줄이기 위해, 종래에도 참조 테이블 방식이 사용되고 있지만, 스쿼시 함수의 경우 각각의 연산마다, 입력마다 다른 데이터 분포 형태를 보이기 때문에, 참조 테이블로 인한 면적 및 전력 소모가 커지는 문제가 있다. 11 is a view for explaining a general squash function operation process and the delay time consumed at this time. Referring to FIG. 11, the squash operation mainly used for activating the dynamic routing layer largely goes through two steps, In the first step, the L2-norm value of the capsule is calculated (Calculate L2-norm), and in the next step, the capsule is normalized using the previously obtained L2-norm value (Normalize Input). In this case, the first step, the process of obtaining the L2-norm value, includes squaring, summing, and square root operations, and the second step, the normalization process, includes squaring, summing, and dividing operations. Therefore, the conventional squash function operation requires many operation cycles. In order to reduce the delay time, the reference table method is also used in the prior art, but in the case of the squash function, since a different data distribution pattern is shown for each operation and each input, there is a problem in that the area and power consumption due to the reference table increase.

따라서, 본 발명에서는 근사 컴퓨팅(approximation computiong)과 참조 테이블 방식을 결합하여 적은 하드웨어 비용으로 스쿼시 함수의 처리 속도를 빠르게 하였다. 이러한 본 발명의 스쿼시 함수의 구성 및 동작 예가 도 12 및 도 13에 예시되어 있다. Therefore, in the present invention, the processing speed of the squash function is increased by combining the approximation computation and the reference table method with a small hardware cost. Examples of the configuration and operation of the squash function of the present invention are illustrated in FIGS. 12 and 13 .

도 12 및 도 13은 본 발명의 일 실시 예에 따른 스쿼시 함수 연산부 구성 및 동작을 설명하기 위한 도면들이다.12 and 13 are diagrams for explaining the configuration and operation of a squash function calculator according to an embodiment of the present invention.

먼저, 도 12를 참조하면, L2-norm은 L1-norm에 비해 항상 작고, L-infinite norm 보다는 항상 큰 성질을 가지므로, L2-norm 값은 이 두 norm 값들의 선형 조합으로써 근사할 수 있다. 이 때, 상기 선형 조합의 계수 값은 실험적으로 결정하되, 알려진 데이터 셋에 대해 가장 작은 평균 제곱근 편차(mean square error) 값으로써 정할 수 있다. 한편, 이와 같이 L2-norm 근사를 가능하게 하는 하드웨어 단위는 오직 비교기, 연산기, 곱셈기 만을 포함할 수 있다.First, referring to FIG. 12 , since L2-norm is always smaller than L1-norm and always larger than L-infinite norm, the L2-norm value can be approximated by a linear combination of these two norm values. In this case, the coefficient value of the linear combination is experimentally determined, but may be determined as the smallest root mean square error value for a known data set. Meanwhile, the hardware unit enabling L2-norm approximation as described above may include only a comparator, an operator, and a multiplier.

한편, 도 13을 참조하면, 캡슐 정규화 과정은 참조 테이블과 행렬 곱셈시 사용하는 처리기(processing element)들을 이용하여 구현할 수 있으며, 앞서 구한 L2-norm 근사값을 참조테이블의 색인(index)으로 활용하여, 이로부터 얻은 값을 상기 처리기의 입력으로 넣어주게 된다. 이와 같이 함으로써, 본 발명은 제곱, 덧셈 및 나누기 연산으로 인해 여러 사이클을 소모해야 했던 연산을 한 사이클 만에 처리할 수 있게 되었으며, 다른 연산에서 활용했던 처리기를 재사용함으로써 면적 낭비 또한 적어지게 된다. 따라서, 도 13에 예시된 바와 같은 스쿼시 함수 연산부를 이용할 경우, 상기 스쿼시 함수 연산유닛은 전체 전력의 오직 1.25%만을 차지하며, 면적은 전체의 0.81%만을 차지하게 된다.On the other hand, referring to FIG. 13 , the capsule normalization process can be implemented using processing elements used for multiplication of a reference table and a matrix, and using the L2-norm approximation obtained above as an index of the reference table, The value obtained therefrom is put as an input of the processor. In this way, according to the present invention, operations that had to consume several cycles due to squaring, addition, and division operations can be processed in one cycle, and area waste is also reduced by reusing the processor used in other operations. Accordingly, when using the squash function calculation unit as illustrated in FIG. 13 , the squash function calculation unit occupies only 1.25% of the total power, and the area occupies only 0.81% of the total power.

도 2의 단계 S500에서는, 동적 라우팅 코어장치(120)가 결합계수를 갱신한다. 즉, 단계 S500에서, 동적 라우팅 코어장치(120)는 상기 입력 캡슐과 출력 캡슐간의 유사도를 산출하고, 상기 유사도에 의거하여 선별적으로 결합 계수를 갱신하되, 상기 유사도가 미리 설정된 임계치를 초과하는 경우에만, 결합 계수를 갱신한다. In step S500 of FIG. 2, the dynamic routing core device 120 updates the coupling coefficient. That is, in step S500, the dynamic routing core device 120 calculates the similarity between the input capsule and the output capsule, and selectively updates the coupling coefficient based on the similarity, but when the similarity exceeds a preset threshold Only, the coupling coefficient is updated.

도 6은 상기 단계 S500에서, 동적 라우팅 코어장치(120)가 결합계수를 갱신하기 위한 구체적인 과정을 예시하고 있다. 6 illustrates a specific process for the dynamic routing core device 120 to update the coupling coefficient in the step S500.

도 6을 참조하면, 먼저, 단계 S510에서는, 스킵 제어부(125)가 상기 입력 캡슐과 출력 캡슐간 코사인 유사도를 산출한다.Referring to FIG. 6 , first, in step S510 , the skip control unit 125 calculates the cosine similarity between the input capsule and the output capsule.

단계 S520에서는, 스킵 제어부(125)가 상기 코사인 유사도와 미리 설정된 임계치를 비교한다. In step S520, the skip control unit 125 compares the cosine similarity with a preset threshold.

상기 비교 결과 상기 코사인 유사도가 상기 미리 설정된 임계치를 초과하는 경우, 단계 S530에서는, 결합계수 갱신부(126)가 소프트 맥스 연산에 의해 상기 결합계수를 갱신하고, 단계 S540에서는, 그 결과를 가중치/결합계수 메모리(122)에 저장한다. As a result of the comparison, when the cosine similarity exceeds the preset threshold, in step S530, the coupling coefficient updating unit 126 updates the coupling coefficient by soft max operation, and in step S540, the result is weighted/combined. It is stored in the coefficient memory 122.

이는 도 14에 예시된 바와 같이, 동적 라우팅 과정에서 결합 계수의 변화를 추적해 본 결과, 코사인 유사도가 일정 범위 이내인 경우에 그렇지 않은 경우에 비해 그 변동폭이 매우 적은 특징을 이용하여, 코사인 유사도가 미리 설정된 임계치 이내인 경우 결합 계수를 갱신하지 않도록 함으로써, 캡슐 신경망 추론 장치의 전력 소모 문제를 해결하기 위함이다.As illustrated in FIG. 14 , as a result of tracking the change of the coupling coefficient in the dynamic routing process, when the cosine similarity is within a certain range, the cosine similarity is lower than that of the case where the fluctuation range is very small compared to the case where the cosine similarity is within a certain range. This is to solve the power consumption problem of the capsule neural network inference apparatus by not updating the coupling coefficient when it is within a preset threshold.

도 15는 본 발명의 일 실시 예에 따른 동적 라우팅 연산을 통해 결합계수를 갱신하는 과정을 설명하기 위한 도면으로서, 입력 캡슐과 출력 캡슐의 관계를 코사인 유사도(cosine similarity)를 통해 계산한 후, 소프트맥스(softmax) 함수를 거친 값을 기존의 결합계수에 더해 줌으로써 새로운 결합계수 값을 만들어 내는 과정을 예시하고 있다. 이 과정을 거치면 입력 캡슐에 의해 강하게 영향을 받은 출력 캡슐은 해당 연결이 강화되게 되고, 그렇지 않은 경우에는 약해지게 된다. 15 is a view for explaining a process of updating a coupling coefficient through a dynamic routing operation according to an embodiment of the present invention. After calculating a relationship between an input capsule and an output capsule through cosine similarity, the soft It illustrates the process of creating a new coupling coefficient value by adding the value passed through the softmax function to the existing coupling factor. Through this process, the output capsule strongly affected by the input capsule is strengthened, otherwise the connection is weakened.

상기 단계 S400과 단계 500이 반복됨으로써, 캡슐간의 연결이 동적으로 조절되는 것이다.By repeating the steps S400 and 500, the connection between the capsules is dynamically controlled.

도 16은 본 발명의 일 실시 예에 따라 결합계수를 선별적으로 갱신하는 처리 과정을 설명하기 위한 도면으로서, 결합계수를 선별적으로 갱신하도록 하여 동적 라우팅 생략 여부를 결정하는 과정과 이후의 단계에서 생략이 일어나는 과정을 나타내고 있다. 16 is a view for explaining a process of selectively updating a coupling coefficient according to an embodiment of the present invention. It shows the process by which the omission takes place.

도 16을 참조하면, 동적 라우팅의 첫 번째 피드 포워드 단계(1^st iteration FF)에서는, 모든 결합 계수들이 같은 값으로 초기화되어 있고, 이를 통해 동적 라우팅의 첫번째 피드 포워드 단계(1^st iteration FF)에서는, 모든 결합 계수들이 같은 값으로 초기화 되어 있고, 이를 통해 출력 캡슐을 연산하게 된다. Referring to Figure 16, in the first feed-forward step (1 ^st iteration FF) of dynamic routing, all coupling coefficients are initialized to the same value, and through this, in the first feed-forward step (1 ^st iteration FF) of dynamic routing, All coupling coefficients are initialized to the same value, and the output capsule is calculated through this.

이 단계가 지난 뒤에는, 피드백 단계에 접어들게 되며 입력과 출력 캡슐 간 코사인 유사도를 계산하게 된다. 코사인 유사도 결과는 덧셈기 트리 (Adder tree)를 통해 하나의 스칼라 값으로 나오게 되며, 이 값이 생략 주소 생성기 (skip index generator)의 입력으로 들어가 미리 결정되어 있는 문턱값 (threshold)와 비교하게 된다. 코사인 유사도가 문턱값을 넘는지의 유무가 곧 생략 가능 여부가 되게 되며, 해당 정보는 비트맵(bitmap) 형태로 저장되게 된다. 즉, 상기 코사인 유사도가 문턱값을 넘지 않는 경우 생략이 가능한 것이다.After this stage, the feedback stage is entered and the cosine similarity between the input and output capsules is calculated. The cosine similarity result comes out as a single scalar value through an adder tree, and this value enters the input of the skip index generator and is compared with a predetermined threshold. Whether or not the cosine similarity exceeds the threshold value becomes whether or not it can be omitted, and the corresponding information is stored in the form of a bitmap. That is, when the cosine similarity does not exceed the threshold, it is possible to omit it.

생략 여부를 결정하는 사이클 동안, 모든 결합 계수는 생략된다는 가정 하에 (always skip prediction) 동적 라우팅 코어의 처리기 배열(processing element array)에서는 다음 라우팅의 피드포워드 과정을 미리 진행한다. 이 때, 아직 갱신되기 전의 결합 계수 값이 사용되게 되며, 입력 캡슐은 앞선 코사인 유사도 연산에서 저장이 되어 있었던 것을 그대로 재사용한다. During the cycle of determining whether to skip, on the assumption that all coupling coefficients are skipped (always skip prediction), the processing element array of the dynamic routing core advances the feedforward process of the next routing in advance. At this time, the value of the coupling coefficient that has not yet been updated is used, and the input capsule reuses what was stored in the previous cosine similarity operation as it is.

이후 사이클에서 해당 결합 계수가 생략 가능하다면 이후의 모든 연산은 생략되고, 피드포워드 연산 결과 생성된 부분합(partial sum) 값을 로컬 레지스터 파일(local register file)에 저장한다. 만약 생략이 불가능 할 경우, 코사인 유사도 결과는 계수 갱신 유닛(coefficient update unit)으로 들어가 소프트맥스 연산을 진행한 후에 새롭게 계산된 결합 계수 값이 온칩 SRAM에 저장되게 된다. 뒤따르는 반복에서 피드포워드와 피드백은 오직 생략 불가능한 계수에 대해서만 진행된다. If the corresponding coupling coefficient can be omitted in a subsequent cycle, all subsequent operations are omitted, and a partial sum value generated as a result of the feedforward operation is stored in a local register file. If it is impossible to omit, the cosine similarity result is entered into a coefficient update unit to perform a softmax operation, and then the newly calculated coupling coefficient value is stored in the on-chip SRAM. In subsequent iterations, feedforward and feedback proceed only for non-skippable coefficients.

최종 출력 캡슐은 마지막 반복에서 축적한 부분합과 앞서 로컬 레지스터 파일에 저장했던 생략된 계수들로부터 얻은 부분합을 재사용하여 더해 줌으로써 완성된다. The final output capsule is completed by reusing and adding the subtotal accumulated from the last iteration and the subtotal obtained from the omitted coefficients previously stored in the local register file.

동적 라우팅 생략의 결과, 생략 불가능한 결합 계수에 대한 입력 캡슐만 외부 메모리에서 불러오면 되기 때문에 외부 메모리 접근이 39.1%만큼 감소하게 된다. 또한, 동적 라우팅 생략을 적용한 채 계산한 출력 캡슐과 그렇지 않은 출력 캡슐 간의 부호 불일치 (sign mismatch) 비율을 계산하면 오직 0.07%밖에 되지 않으므로, 캡슐 신경망의 정확도는 생략하지 않을 때와 거의 비슷하게 유지할 수 있다.As a result of omitting dynamic routing, external memory access is reduced by 39.1% because only input capsules for non-skippable coupling coefficients need to be loaded from external memory. In addition, when calculating the sign mismatch ratio between output capsules calculated with dynamic routing omission applied and output capsules without dynamic routing applied, it is only 0.07%, so the accuracy of the capsule neural network can be maintained almost the same as when not omitting it. .

한편, 도 2의 단계 S600에서는, 단계 S400 및 단계 S500이 수행된 횟수(CNT)를 카운트하고, 단계 S700에서는, 그 횟수(CNT)가 캡슐신경망의 정확도를 결정하기 위해 미리 설정된 횟수(N)를 초과할 때까지 단계 S400 및 단계 S500를 반복한다.Meanwhile, in step S600 of FIG. 2, the number of times (CNT) at which steps S400 and S500 are performed is counted, and in step S700, the number (CNT) is a preset number (N) to determine the accuracy of the capsule neural network. Repeat steps S400 and S500 until it is exceeded.

상기와 같은 본 발명의 캡슐 신경망 추론 장치는 도 1에 예시된 바와 같이 이형 코어 어키텍처로 구현된다. 즉, 컨볼루션 연산과 동적 라우팅 및 완전 연결 연산을 지원하기 위한 2개의 서로 다른 코어가 파이프라인으로 구성되어 각각의 동작이 병렬처리가 가능하다. 도 17을 이러한 파이프라인 구조를 설명하기 위한 도면으로써, 도 17을 참조하면, 그룹 컨볼루션의 한 채널 연산과 완전 연결 연산을 위한 가중치 로딩 시간이 겹쳐지게 되어 메모리 접근 시간이 숨겨지게 된다. 결과적으로 이는 파이프라인이 구성되기 이전에 비해 35.4% 만큼의 지연 시간 감소를 보인다. 즉, 캡슐 신경망의 추론 시간을 35.4% 만큼 단축시킬 수 있는 것이다.The capsule neural network reasoning apparatus of the present invention as described above is implemented with a heterogeneous core architecture as illustrated in FIG. 1 . In other words, two different cores to support convolution operation, dynamic routing, and fully connected operation are configured as pipelines, so that each operation can be processed in parallel. 17 is a diagram for explaining such a pipeline structure. Referring to FIG. 17, a weight loading time for one channel operation of group convolution and a weight loading time for a fully concatenated operation overlap, so that the memory access time is hidden. As a result, it shows a reduction in latency of 35.4% compared to before the pipeline was configured. That is, the inference time of the capsule neural network can be reduced by 35.4%.

이상에서 설명한 바와 같이, 본 발명은 하드웨어 친화적인 캡슐 신경망 최적화 방법과 파이프라인된 이형 코어 아키텍처를 제안하여 55%의 지연시간을 감소시켰고, 동적 라우팅 생략 알고리즘과 이를 구현한 동적 라우팅 코어를 통해 외부 메모리 접근량을 39.1% 만큼 줄일 수 있었다. 마지막으로, 스쿼시 함수는 빠른 속도로 동작하면서도 전체 면적과 전력의 1% 내외 만큼을 차지한다. As described above, the present invention reduces latency by 55% by proposing a hardware-friendly capsule neural network optimization method and a pipelined heterogeneous core architecture, and provides an external memory through a dynamic routing omitting algorithm and a dynamic routing core implementing it. We were able to reduce access by 39.1%. Finally, the squash function occupies about 1% of the total area and power while operating at high speed.

따라서, 본 발명의 캡슐 신경망 추론 장치는 메모리 대역폭과 배터리 공급이 제한된 모바일 환경에서 실시간으로 캡슐 신경망을 통한 추론이 가능한 특징이 있다. Therefore, the capsule neural network inference apparatus of the present invention has a feature that inference can be made through the capsule neural network in real time in a mobile environment where memory bandwidth and battery supply are limited.

또한, 본 발명은 기존의 심층신경망이 활용되는 모든 분야에서 적용 가능하다. 특히, 캡슐 신경망은 입력에 가해지는 왜곡이나 변형 등에 대해서 네트워크를 다시 학습시킬 필요없이 정확도를 유지할 수 있기 때문에, 주변 환경이나 물체의 형태 등이 지속해서 변화되는 경우에 활용될 수 있다.In addition, the present invention is applicable to all fields in which the existing deep neural network is utilized. In particular, since the capsule neural network can maintain accuracy without re-learning the network with respect to distortion or deformation applied to the input, it can be used when the surrounding environment or the shape of an object is constantly changing.

이상의 설명에서는 본 발명의 바람직한 실시예를 제시하여 설명하였으나, 본 발명이 반드시 이에 한정되는 것은 아니며, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경할 수 있음을 쉽게 알 수 있을 것이다. In the above description, preferred embodiments of the present invention have been presented and described, but the present invention is not necessarily limited thereto, and those of ordinary skill in the art to which the present invention pertains within the scope not departing from the technical spirit of the present invention. It will be readily appreciated that many substitutions, modifications, and alterations are possible.

10, 20: 게이트웨이 100: 캡슐 신경망 추론장치
110: 컨볼루션 코어장치 120: 동적 라우팅 코어장치10, 20: Gateway 100: Capsule neural network reasoning device
110: convolutional core unit 120: dynamic routing core unit

Claims

In the heterogeneous core architecture-based capsule neural network inference apparatus for inferring a capsule neural network including a convolution layer and a dynamic routing layer,
After dividing the input channel into a predetermined number of groups, a convolution operation is performed for each group, and the convolution operation for each group is performed in parallel to n (where n is a natural number) number of primary capsules (primary). a convolutional core device that generates capsules; and
A dynamic routing core device that transforms into mxn input capsules having a multidimensional form through a matrix multiplication operation for each of the n primary capsules, and then generates an output capsule based on the operation of the coupling coefficients with the input capsules including,
The dynamic routing core device is
Capsule neural network inference apparatus based on heterogeneous core architecture, characterized in that in order to reduce the number of accesses of the input capsule, the coupling coefficient is selectively updated based on the similarity between the input capsule and the output capsule.

According to claim 1, wherein the convolutional core device
a first input memory for storing input data;
weight memory for storing weights;
At least one convolution operation process configured in a pipeline structure to perform a convolution operation by applying a weight to data input through the input memory, and to generate the primary capsule by performing the convolution operation in parallel Device; and
an output memory for outputting the primary capsule;
The convolution arithmetic processing unit
a first group processing unit that divides the initial channels input through the input memory into predetermined groups and then performs a first-order convolution operation for each group;
a channel shuffle operation unit for performing a channel shuffle operation for randomly mixing the channels divided into the groups in order to compensate for the loss of accuracy for the first-order convolution operation; and
and a second group processing unit for dividing the processing result of the channel shuffle operation unit into predetermined groups and then performing a second-order convolution operation for each group.

According to claim 1, wherein the dynamic routing core device
a second input memory for receiving n primary capsules generated by the convolutional core device;
a weight/coupling factor memory for storing weights and coupling factors;
at least one dynamic routing processor for generating a multidimensional input capsule from the primary capsules and then performing a process for generating an output capsule by calculating the input capsule and the coupling coefficient;
a squash function calculation unit for normalizing an intermediate capsule generated in the operation process of the dynamic routing processing unit;
a skip control unit that determines whether to update the coupling coefficient based on the similarity between the input capsule and the output capsule; and
and a coupling factor updater for updating the coupling coefficient under the control of the skip control unit and storing the result in the weight/coupling factor memory.

The method of claim 3, wherein the dynamic routing processing unit
Capsule neural network, characterized in that after performing a vector-scalar product by matching the input capsules and each of the coupling coefficients, an intermediate capsule is generated by summing the vector-scalar product results of each of the input capsules reasoning device.

5. The method of claim 4, wherein the squash function calculator
Capsule neural network inference apparatus, characterized in that the squash function operation is performed to normalize the norm value of the intermediate capsule generated by the dynamic routing processing unit, and the squash function operation is performed based on approximate computing and reference table.

The method of claim 5, wherein the squash function calculator
Stores a reference table that matches and stores normalized data for each L2-norm of an arbitrary capsule,
After approximating the L2-norm value of the intermediate capsule by a linear combination of the L1-norm value and the L-infinite norm value of the intermediate capsule, the intermediate capsule L2-norm approximation value is used as an index of the reference table. Capsule neural network inference apparatus, characterized in that for normalizing.

5. The method of claim 4, wherein the dynamic routing processing unit
first and second registers;
In order to determine the accuracy of the capsule neural network, the output capsule generation process is repeated a preset number of times,
first intermediate capsules generated based on the coupling coefficient updated by the coupling coefficient update unit and second intermediate capsules generated based on the non-updated coupling coefficient are stored in the first and second registers, respectively;
An inference method of a capsule neural network, characterized in that the values stored in the first and second registers are summed in the last iteration.

The method of claim 3, wherein the skip control unit
calculating the cosine similarity of the input capsule and the output capsule, comparing the cosine similarity with a preset threshold, and controlling to selectively update the coupling coefficient only when the cosine similarity exceeds the threshold Capsule neural network inference device.

In a capsule neural network inference method using a capsule neural network inference device based on a heterogeneous core architecture for inferring a capsule neural network including a convolution layer and a dynamic routing layer,
After dividing the input channel into a predetermined number of groups, a convolution operation is performed for each group, but the convolution operation for each group is performed in parallel through a pipeline to n (in this case, n is a natural number) number of fry a convolution operation step of generating primary capsules;
an input capsule generating step of generating mxn input capsules having a multidimensional shape through a matrix multiplication operation for each of the n primary capsules;
an output capsule generation (FF) step of generating an output capsule based on the operation of the input capsule and the coupling coefficient; and
Inference of a capsule neural network, comprising: calculating the similarity between the input capsule and the output capsule, and selectively updating the coupling coefficient only when the similarity exceeds a preset threshold; Way.

10. The method of claim 9, wherein the convolution operation step
a first group operation step of dividing an initial input channel into a predetermined number of groups and performing a first-order convolution operation for each group;
a channel shuffling operation step of randomly mixing the channels divided into the groups to compensate for the loss of accuracy for the result when the first convolution operation is completed; and
and a second group operation step of performing a second-order convolution operation for each group after dividing the result into a predetermined number of groups when the channel shuffle operation is completed.

10. The method of claim 9, wherein the output capsule generation (FF) step
a multiplication step of performing vector-scalar multiplication by matching each of the input capsules and coupling coefficients;
an addition step of adding up all the calculation results of the multiplication step to generate an intermediate capsule; and
The inference method of a capsule neural network, comprising: a squash function calculation step of normalizing a norm value of the intermediate capsule through approximate computing and reference table-based squash function operation.

12. The method of claim 11, wherein the squash function calculation step
an approximate computing step of approximating the L2-norm value of the intermediate capsule by a linear combination of the L1-norm value and the L-infinite norm value of the intermediate capsule; and
Normalizes the L2-norm approximation of the intermediate capsule, but stores in advance a reference table that matches and stores normalized data for each L2-norm of an arbitrary capsule, and uses the L2-norm approximation of the intermediate capsule as an index of the reference table Inference method of a capsule neural network, characterized in that it comprises a normalization step of normalizing the intermediate capsule.

12. The method of claim 11, wherein the coupling coefficient update (FB) step
a similarity calculating step of calculating a cosine similarity between the input capsule and the output capsule;
a similarity comparison step of comparing the cosine similarity with a preset threshold; and
As a result of the comparison, updating the coupling coefficient only when the cosine similarity exceeds the preset threshold, updating the coupling coefficient by a soft max operation and storing the result in an internal memory. An inference method of a capsule neural network, characterized in that.

14. The method of claim 13,
In order to determine the accuracy of the capsule neural network, the output capsule generation (FF) step and the coupling coefficient update (FB) step are repeatedly performed a preset number of times,
The addition step is
Storing the first intermediate capsules generated based on the updated coupling coefficient and the second intermediate capsules generated based on the non-updated coupling coefficient in different registers, respectively;
An inference method of a capsule neural network, characterized in that at the last iteration, the values stored in the different registers are summed.