KR20230160177A

KR20230160177A - Apparatus and method for multiplication operation based on outer product

Info

Publication number: KR20230160177A
Application number: KR1020230057744A
Authority: KR
Inventors: 김혜지
Original assignee: 한국전자통신연구원
Priority date: 2022-05-16
Filing date: 2023-05-03
Publication date: 2023-11-23

Abstract

외적 기반 곱셈 연산 장치 및 방법이 개시된다. 실시예에 따른 외적 기반 곱셈 연산 장치는, 각각 MAC(Multiply-Accumulation) 연산을 수행하여 중간 누적값을 생성하는 제1 내부 연산기들, 각각 상기 중간 누적값을 이용하여 단위 누적값(chucking accumulation value)을 생성하는 제2 내부 연산기들, 및 상기 제1 내부 연산기들 중 어느 하나의 출력이 상기 제2 내부 연산기들 중 어느 하나로 입력되도록 하는 누적 데이터 전송 경로(accumulation data transmission path)들을 포함할 수 있다. A cross product-based multiplication operation apparatus and method are disclosed. An external product-based multiplication operation device according to an embodiment includes first internal operators each performing a MAC (Multiply-Accumulation) operation to generate an intermediate accumulation value, each using the intermediate accumulation value to generate a unit accumulation value (chucking accumulation value). It may include second internal operators that generate , and accumulation data transmission paths that allow the output of any one of the first internal operators to be input to one of the second internal operators.

Description

Outer product-based multiplication operation device and method {APPARATUS AND METHOD FOR MULTIPLICATION OPERATION BASED ON OUTER PRODUCT}

기재된 실시예는 인공 지능 프로세서의 저정밀도 연산의 정확도 보상을 위한 다중 누적 연산기 처리 기술에 관한 것이다.The described embodiment relates to a multiple accumulation operator processing technology for accuracy compensation of low-precision calculations of an artificial intelligence processor.

인공 신경망의 학습과 추론을 가속하고 메모리 사용을 줄이기 위해 양자화(Quantization) 기법이 널리 사용되고 있다. 이는 32 비트 부동 소수점(floating point) 포맷으로 연산하지 않고, 8 비트 이하의 저정밀도 정수형(integer) 또는 부동 소수점(floating point) 데이터 포맷을 이용하여 신경망 연산을 처리하는 기법이다. Quantization techniques are widely used to accelerate the learning and inference of artificial neural networks and reduce memory usage. This is a technique for processing neural network calculations using a low-precision integer or floating point data format of 8 bits or less, rather than calculating in a 32-bit floating point format.

여기서, 부동 소수점 포맷은 부호(Sign, S), 지수(Exponent, E), 그리고 가수(Mantissa, M)으로 구성된다. 이중, 가수의 비트 폭은 수의 정밀도를 표현하는 데 사용될 수 있다. 즉 지수의 할당량이 크고 가수 부분이 작다면, 넓은 범위의 수를 듬성듬성 표현하는 포맷이다. 반대로 지수가 작고 가수가 많이 할당되었다면, 좁은 범위의 수를 미세하게 표현하는 포맷을 의미한다.Here, the floating point format consists of a sign (Sign, S), an exponent (E), and a mantissa (M). Among these, the bit width of the mantissa can be used to express the precision of the number. In other words, if the exponent allocation is large and the mantissa part is small, it is a format that sparsely expresses a wide range of numbers. Conversely, if the exponent is small and many mantissas are assigned, it means a format that expresses numbers in a narrow range in detail.

그런데, 신경망의 누적 연산에 양자화 기법이 적용되면, 양자화에 의한 정보 손실이 심화된다. 즉, 부동 소수점 포맷의 누적 연산 데이터 포맷의 정밀도를 표현하는 가수(mantissa)의 비트 폭(bit-width)이 낮을수록 정보 손실에 의한 연산 오차는 더욱 커질 수 있다. 이러한 현상을 스왐핑(swamping)이라 부른다.However, when quantization techniques are applied to the accumulation operation of a neural network, information loss due to quantization intensifies. In other words, the lower the bit-width of the mantissa, which expresses the precision of the cumulative calculation data format in floating point format, the larger the calculation error due to information loss can be. This phenomenon is called swamping.

이러한 스왐핑(swamping) 현상을 완화하기 위해 대표적으로 단위 누적(Chunking Accumulation) 기법이 사용된다. 단위 누적(Chunking Accumulation) 기법이 딥러닝 가속기에 적용된 선행 문헌으로 "Deep learning accelerator architecture with chunking GEMM"(미국 공개 공보 US 2019-0325301)이 있다. To alleviate this swamping phenomenon, the unit accumulation technique is typically used. A prior literature in which the unit accumulation (Chunking Accumulation) technique is applied to a deep learning accelerator is "Deep learning accelerator architecture with chunking GEMM" (US Public Publication US 2019-0325301).

선행 문헌을 살펴보면, 단위 누적(Chunking Accumulation) 기법에서는 누적 연산 대상 데이터를 소정 단위(Chunk) 만큼 나누어 각각에 대해 누적하는 연산이 제1 단계로 진행되고, 단위별로 누적 연산된 결과들을 누적하는 연산이 제2 단계로 진행된다. Looking at previous literature, in the unit accumulation (chunking accumulation) technique, the data subject to the accumulation operation is divided into predetermined units (chunks) and the first step is to accumulate the results for each unit. It proceeds to the second stage.

이러한 과정을 통해 누적값이 고르게 분포하도록 할 수 있어, 부동 소수점 덧셈 과정에서 정밀도 부족에 의해 발생하는 정보 손실을 최소화할 수 있다. Through this process, the accumulated value can be distributed evenly, minimizing information loss caused by lack of precision during the floating point addition process.

그런데, 이러한 단위 누적(Chunking Accumulation) 기법에서는 누적 연산 대상 데이터를 소정 단위(Chunk)로 나누어 누적 연산을 수행하므로 제1 단계의 누적 연산과 제2 단계의 누적 연산은 동일한 연산기를 통해 동시에 수행될 수 없다. However, in this unit accumulation (chunking accumulation) technique, the accumulation operation is performed by dividing the data subject to the accumulation operation into predetermined units (chunks), so the accumulation operation of the first stage and the accumulation operation of the second stage can be performed simultaneously through the same operator. does not exist.

따라서, 단위(Chunk) 별로 누적 연산이 종료될 때마다 매번 연산 결과값을 외부의 메모리에 저장해야 한다. 이를 위한 빈번한 메모리 접근은 누적 연산을 지연시키게 되고, 이는 전반적인 학습 및 추론의 속도의 저하시키는 원인이 이어질 수 있다. Therefore, whenever the accumulation operation is completed for each unit (chunk), the operation result must be stored in external memory. Frequent memory access for this purpose delays the accumulation operation, which can lead to a slowdown in overall learning and inference speed.

따라서, 이를 해결하기 위해 단위 누적(Chunking Accumulation) 전용 하드웨어를 내장하여, 연산기 내부에서 두 단계의 누적 연산을 수행하도록 신경망 가속기가 구현되고 있다. 하지만, 신경망 가속 회로의 모든 내부 연산기에 2중 누적 연산 회로를 탑재하는 것은 여전히 하드웨어 면적을 확장시키는 문제를 야기한다. Therefore, to solve this problem, a neural network accelerator is being implemented to perform a two-step accumulation operation inside the calculator by embedding hardware dedicated to unit accumulation (chunking accumulation). However, installing a double accumulation operation circuit in all internal operators of a neural network acceleration circuit still causes the problem of expanding the hardware area.

기재된 실시예는 인공 지능 프로세서에서 저정밀도 데이터 포맷을 이용한 신경망 누적 연산의 정확도를 보상하는 데 그 목적이 있다. The described embodiment is aimed at compensating the accuracy of a neural network accumulation operation using a low-precision data format in an artificial intelligence processor.

기재된 실시예는 인공 지능 프로세서에서 단위 누적(Chunking Accumulation) 기법을 적용함에 있어 빈번한 메모리 접근에 따른 학습 및 추론의 속도의 저하되는 문제를 해결하는데 그 목적이 있다. The purpose of the described embodiment is to solve the problem of slowing down learning and inference speed due to frequent memory access when applying the unit accumulation (chunking accumulation) technique in an artificial intelligence processor.

기재된 실시예는 인공 지능 프로세서에서 단위 누적(Chunking Accumulation) 연산을 위한 하드웨어 면적 확장의 문제를 해결하는데 그 목적이 있다. The purpose of the described embodiment is to solve the problem of hardware area expansion for unit accumulation (chunking accumulation) calculation in an artificial intelligence processor.

실시예에 따른 외적 기반 곱셈 연산 장치는, 각각 MAC(Multiply-Accumulation) 연산을 수행하여 중간 누적값을 생성하는 제1 내부 연산기들, 각각 상기 중간 누적값을 이용하여 단위 누적값(chucking accumulation value)을 생성하는 제2 내부 연산기들, 및 상기 제1 내부 연산기들 중 어느 하나의 출력이 상기 제2 내부 연산기들 중 어느 하나로 입력되도록 하는 누적 데이터 전송 경로(accumulation data transmission path)들을 포함할 수 있다. An external product-based multiplication operation device according to an embodiment includes first internal operators each performing a MAC (Multiply-Accumulation) operation to generate an intermediate accumulation value, each using the intermediate accumulation value to generate a unit accumulation value (chucking accumulation value). It may include second internal operators that generate , and accumulation data transmission paths that allow the output of any one of the first internal operators to be input to one of the second internal operators.

이때, 상기 제2 내부 연산기들은 각각 상기 제1 내부 연산기들 중 어느 하나의 출력을 입력 받을 수 있다. At this time, the second internal operators can each receive the output of any one of the first internal operators.

이때, 상기 제1 내부 연산기들은 각각 상기 중간 누적값에 상응하는 단위 크기(chunk size)에 대응되는 주기마다 상기 중간 누적값을 상기 제2 내부 연산기들 중 하나로 전달한 후 리셋 동작을 수행할 수 있다. At this time, the first internal operators may transfer the intermediate accumulated value to one of the second internal operators at every cycle corresponding to a chunk size corresponding to the intermediate accumulated value, and then perform a reset operation.

이때, 상기 제2 내부 연산기들은 각각 상기 단위 크기에 대응되는 주기마다 상기 단위 누적값을 생성하기 위한 연산을 수행할 수 있다. At this time, the second internal operators may each perform an operation to generate the unit accumulation value at every cycle corresponding to the unit size.

이때, 상기 단위 누적값을 생성하기 위한 연산은 새로 입력되는 중간 누적값과 상기 제2 내부 연산기에 저장되어 있던 중간 누적값을 더하는 연산일 수 있다. At this time, the operation for generating the unit accumulation value may be an operation of adding a newly input intermediate accumulation value and an intermediate accumulation value stored in the second internal calculator.

이때, 상기 제1 내부 연산기들 및 상기 제2 내부 연산기들 각각에 단위 크기(chunk size)를 조절하는 제어 신호를 입력하는 제어부를 더 포함할 수 있다. At this time, it may further include a control unit that inputs a control signal to adjust the unit size (chunk size) to each of the first internal operators and the second internal operators.

이때, 상기 제2 내부 연산기들은 각각 단위 크기에 대응되는 주기마다 비활성 상태에서 활성 상태로 전환될 수 있다. At this time, each of the second internal operators may be switched from an inactive state to an active state every cycle corresponding to a unit size.

실시예에 따른 외적 기반 곱셈 연산 방법은, 제1 내부 연산기들 각각에 의하여, MAC(Multiply-Accumulation) 연산을 수행하여 중간 누적값을 생성하는 단계, 상기 중간 누적값을 누적 데이터 전송 경로를 통해 제2 내부 연산기들 중 하나로 전송하는 단계, 및 상기 제2 내부 연산기들 각각에 의하여, 상기 중간 누적값을 이용하여 단위 누적값(chucking accumulation value)을 생성하는 단계를 포함할 수 있다. The outer product-based multiplication operation method according to the embodiment includes generating an intermediate accumulation value by performing a MAC (Multiply-Accumulation) operation by each of the first internal operators, and generating the intermediate accumulation value through an accumulation data transmission path. It may include transmitting to one of two internal operators, and generating a unit accumulation value (chucking accumulation value) using the intermediate accumulation value by each of the second internal operators.

이때, 상기 전송하는 단계는 각각 상기 중간 누적값에 상응하는 단위 크기(chunk size)에 대응되는 주기마다 상기 중간 누적값을 상기 제2 내부 연산기들 중 하나로 전달한 후 리셋 동작을 수행할 수 있다. At this time, the transmitting step may transfer the intermediate accumulated value to one of the second internal operators at each cycle corresponding to a unit size (chunk size) corresponding to the intermediate accumulated value, and then perform a reset operation.

이때, 단위 누적값을 생성하는 단계는 각각 상기 단위 크기에 대응되는 주기마다 상기 단위 누적값을 생성하기 위한 연산을 수행할 수 있다. At this time, the step of generating the unit accumulation value may involve performing an operation to generate the unit accumulation value at each cycle corresponding to the unit size.

이때, 단위 누적값을 생성하는 단계 이전에 상기 제2 내부 연산기들은 각각 단위 크기에 대응되는 주기마다 비활성 상태에서 활성 상태로 전환되는 단계를 더 포함할 수 있다. At this time, before the step of generating the unit accumulation value, the second internal operators may further include a step of switching from an inactive state to an active state every cycle corresponding to the unit size.

실시예에 따른 외적 기반 곱셈 연산 장치는, 각각 MAC(Multiply-Accumulation) 연산을 수행하여 중간 누적값을 생성하는 제1 내부 연산기들, 각각 상기 중간 누적값을 이용하여 제1 단위 누적값(chucking accumulation value)을 생성하는 제2 내부 연산기들, 각각 상기 제1 단위 누적값을 이용하여 제2 단위 누적값(chucking accumulation value)을 생성하는 제3 내부 연산기들, 상기 제1 내부 연산기들 중 어느 하나의 출력이 상기 제2 내부 연산기들 중 어느 하나로 입력되도록 하는 누적 데이터 전송 경로(accumulation data transmission path)들, 및 상기 제1 내부 연산기들 중 어느 하나의 출력이 상기 제2 내부 연산기들 중 어느 하나로 입력되도록 하는 누적 데이터 전송 경로(accumulation data transmission path)들을 포함할 수 있다. An external product-based multiplication operation device according to an embodiment includes first internal operators each performing a MAC (Multiply-Accumulation) operation to generate an intermediate accumulation value, each using the intermediate accumulation value to generate a first unit accumulation value (chucking accumulation value). second internal operators that generate a chucking accumulation value, third internal operators that generate a second unit accumulation value (chucking accumulation value) using the first unit accumulation value, respectively, any one of the first internal operators Accumulation data transmission paths for causing an output to be input to one of the second internal operators, and to cause an output of any one of the first internal operators to be input to one of the second internal operators. It may include accumulation data transmission paths.

이때, 상기 제2 내부 연산기들은 각각 상기 제1 내부 연산기들 중 어느 하나의 출력을 입력 받고, 상기 제3 내부 연산기들은 각각 상기 제2 내부 연산기들 중 어느 하나의 출력을 입력 받을 수 있다. At this time, the second internal operators may each receive an output from one of the first internal operators, and the third internal operators may each receive an output from one of the second internal operators.

이때, 상기 제1 내부 연산기들은 각각 상기 중간 누적값에 상응하는 제1 단위 크기(chunk size)에 대응되는 주기마다 상기 중간 누적값을 상기 제2 내부 연산기들 중 하나로 전달한 후 리셋 동작을 수행하고, 상기 제2 내부 연산기들은 각각 상기 제1 단위 크기와 상기 제1 단위 누적값의 단위 누적을 위해 설정된 제2 단위 크기(chunk size)를 곱한 크기에 대응되는 주기마다 상기 제1 단위 누적값을 상기 제3 내부 연산기들 중 하나로 전달한 후 리셋 동작을 수행할 수 있다. At this time, the first internal operators transfer the intermediate accumulated value to one of the second internal operators at every cycle corresponding to a first unit size (chunk size) corresponding to the intermediate accumulated value, and then perform a reset operation, The second internal operators each calculate the first unit accumulation value to the first unit accumulation value every period corresponding to a size obtained by multiplying the first unit size by a second unit size (chunk size) set for unit accumulation of the first unit accumulation value. 3 After passing it to one of the internal operators, a reset operation can be performed.

이때, 상기 제2 내부 연산기들은 각각 상기 제1 단위 크기에 대응되는 주기마다 상기 제1 단위 누적값을 생성하기 위한 연산을 수행하고, 상기 제3 내부 연산기들은 각각 상기 제1 단위 크기와 상기 제1 단위 누적값의 단위 누적을 위해 설정된 제2 단위 크기(chunk size)를 곱한 크기에 대응되는 주기마다 상기 제2 단위 누적값을 생성하기 위한 연산을 수행할 수 있다. At this time, the second internal operators each perform an operation to generate the first unit accumulation value at every cycle corresponding to the first unit size, and the third internal operators respectively perform the operation to generate the first unit accumulation value and the first unit size. An operation for generating the second unit accumulation value may be performed every cycle corresponding to a size multiplied by a second unit size (chunk size) set for unit accumulation of the unit accumulation value.

이때, 상기 제1 단위 누적값을 생성하기 위한 연산은 새로 입력되는 중간 누적값과 상기 제2 내부 연산기에 저장되어 있던 중간 누적값을 더하는 연산이고, 상기 제2 단위 누적값을 생성하기 위한 연산은 새로 입력되는 제1 단위 누적값과 상기 제3 내부 연산기에 저장되어 있던 제2 단위 누적값을 더하는 연산일 수 있다. At this time, the operation for generating the first unit accumulation value is an operation of adding the newly input intermediate accumulation value and the intermediate accumulation value stored in the second internal calculator, and the operation for generating the second unit accumulation value is It may be an operation that adds a newly input first unit accumulation value and a second unit accumulation value stored in the third internal calculator.

이때, 상기 제1 내부 연산기들 내지 상기 제3 내부 연산기들 각각에 상응하는 단위 크기(chunk size) 또는 리셋 동작을 수행하는 주기를 조절하는 제어 신호를 입력하는 제어부를 더 포함할 수 있다. At this time, it may further include a control unit that inputs a control signal that adjusts a unit size (chunk size) or a period of performing a reset operation corresponding to each of the first to third internal operators.

이때, 상기 제2 내부 연산기들은 각각 제1 단위 크기에 대응되는 주기마다 비활성 상태에서 활성 상태로 전환되고, 상기 제3 내부 연산기들은 각각 제1 단위 크기와 제2 단위 크기를 곱한 크기에 대응되는 주기마다 비활성 상태에서 활성 상태로 전환될 수 있다. At this time, the second internal operators are switched from the inactive state to the active state at every cycle corresponding to the first unit size, and the third internal operators are each switched from the inactive state to the active state at a cycle corresponding to the size multiplied by the first unit size and the second unit size. It can be switched from inactive to active state every time.

실시예에 따른 외적 기반 곱셈 연산 장치는, 각각 상기 제N+1(N은 1부터 1씩 증가하는 정수) 단위 누적값을 이용하여 제N+2 단위 누적값(chucking accumulation value)을 생성하기 위한 연산을 수행하는 제N+3 내부 연산기들, 및 상기 제N+2 내부 연산기들 중 어느 하나의 출력이 상기 제N+3 내부 연산기들 중 어느 하나로 입력되도록 하는 누적 데이터 전송 경로(accumulation data transmission path)들을 각각 N이 증가함에 따라 반복 추가하되, 상기 제N+2 단위 누적값을 생성하기 위한 연산은 새로 입력되는 제N+1 단위 누적값과 상기 제N+3 내부 연산기에 저장되어 있던 제N+2 단위 누적값을 더하는 연산일 수 있다. The outer product-based multiplication operation device according to the embodiment is for generating an N+2-th unit accumulation value using the N+1-th unit accumulation value (N is an integer that increases by 1 from 1). N+3 internal operators that perform operations, and an accumulation data transmission path that causes the output of one of the N+2 internal operators to be input to one of the N+3 internal operators. ) are repeatedly added as N increases, but the operation to generate the N+2 unit accumulated value is performed using the newly input N+1 unit accumulated value and the N stored in the N+3 internal operator. This may be an operation that adds +2 unit cumulative value.

기재된 실시예에 따라, 인공 지능 프로세서에서 저정밀도 데이터 포맷을 이용한 신경망 누적 연산의 정확도를 보상할 수 있다. According to the described embodiment, the accuracy of a neural network accumulation operation using a low-precision data format can be compensated for in an artificial intelligence processor.

기재된 실시예에 따라, 인공 지능 프로세서에서 단위 누적(Chunking Accumulation) 기법을 적용함에 있어 빈번한 메모리 접근에 따른 학습 및 추론의 속도의 저하되는 문제를 해결할 수 있다. According to the described embodiment, when applying the unit accumulation (chunking accumulation) technique in an artificial intelligence processor, the problem of slowing down learning and inference speed due to frequent memory access can be solved.

기재된 실시예에 따라, 인공 지능 프로세서에서 단위 누적(Chunking Accumulation) 연산을 위한 하드웨어 면적 확장의 문제를 해결할 수 있다. 즉, 종래의 외적 기반(Outer-product) 기반 행렬곱 연산기 구조를 그대로 재활용할 수 있다. According to the described embodiment, the problem of hardware area expansion for unit accumulation (Chunking Accumulation) operation in an artificial intelligence processor can be solved. In other words, the structure of the conventional outer-product-based matrix multiplication operator can be reused as is.

기재된 실시예에 따라, 벡터-행렬곱 연산에 의해 가속기 운용율(utilization) 최대 100% 달성할 수 있다. 벡터-행렬곱 연산의 경우, 비활성화된 내부 연산기(ex. FPU)를 다중 누적 연산에 활용하여 연산기 운용율을 최대 100%까지 향상시킬 수 있으며, 단위 누적을 2단계 이상 심층 누적이 가능한 구조이므로 양자화 오차 저감 효과가 있다.According to the described embodiment, accelerator utilization of up to 100% can be achieved by vector-matrix product operation. In the case of vector-matrix product operation, the operation rate of the operator can be improved up to 100% by utilizing the deactivated internal operator (ex. FPU) for multiple accumulation operation, and the unit accumulation has a structure that allows deep accumulation of two or more stages, so quantization It has the effect of reducing errors.

이러한 기재된 실시예는, AI 반도체에서 저정밀도 연산은 빠른 연산속도와 낮은 전력사용을 위해 널리 사용될 수 있다. 그리고, 그 과정에서 단위 누적은 필수적인 연산 요소이므로, 이를 가속하는 하드웨어 구조는 향후 다양한 AI 반도체에 활용될 것이므로 높은 시장성 및 상용화 가능성이 있다. In these described embodiments, low-precision calculations can be widely used in AI semiconductors for fast calculation speed and low power usage. In addition, since unit accumulation is an essential computational element in the process, the hardware structure that accelerates it will be used in various AI semiconductors in the future, so it has high marketability and commercialization potential.

도 1은 일반적인 외적 기반 곱셈 연산 장치의 내부 구성도이다.
도 2는 일반적인 외적 기반 곱셈 연산 장치에서의 단위 누적(chucking accumulation) 과정을 설명하기 위한 순서도이다.
도 3은 실시예에 따른 외적 기반 곱셈 연산 장치의 내부 구성도이다.
도 4는 실시예에 따른 외적 기반 곱셈 연산 방법을 설명하기 위한 순서도이다.
도 5는 일반적인 외적 기반 곱셈 연산 장치에서 벡터-행렬 곱셈이 수행되는 예시도이다.
도 6은 다른 실시예에 따른 외적 기반 곱셈 연산 장치의 내부 구성도이다.
도 7은 또 다른 실시예에 따른 외적 기반 곱셈 연산 장치의 내부 구성도이다.
도 8은 실시예에 따른 컴퓨터 시스템 구성을 나타낸 도면이다.1 is an internal configuration diagram of a general external product-based multiplication operation device.
Figure 2 is a flowchart for explaining the unit accumulation process in a general external product-based multiplication operation device.
Figure 3 is an internal configuration diagram of an external product-based multiplication operation device according to an embodiment.
Figure 4 is a flowchart for explaining a cross product-based multiplication operation method according to an embodiment.
Figure 5 is an example of vector-matrix multiplication performed in a general outer product-based multiplication operation device.
Figure 6 is an internal configuration diagram of a cross product-based multiplication operation device according to another embodiment.
Figure 7 is an internal configuration diagram of a cross product-based multiplication operation device according to another embodiment.
Figure 8 is a diagram showing the configuration of a computer system according to an embodiment.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.The advantages and features of the present invention and methods for achieving them will become clear by referring to the embodiments described in detail below along with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below and will be implemented in various different forms. The present embodiments only serve to ensure that the disclosure of the present invention is complete and that common knowledge in the technical field to which the present invention pertains is not limited. It is provided to fully inform those who have the scope of the invention, and the present invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout the specification.

비록 "제1" 또는 "제2" 등이 다양한 구성요소를 서술하기 위해서 사용되나, 이러한 구성요소는 상기와 같은 용어에 의해 제한되지 않는다. 상기와 같은 용어는 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용될 수 있다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있다.Although terms such as “first” or “second” are used to describe various components, these components are not limited by the above terms. The above terms may be used only to distinguish one component from another component. Accordingly, the first component mentioned below may also be the second component within the technical spirit of the present invention.

본 명세서에서 사용된 용어는 실시예를 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 또는 "포함하는(comprising)"은 언급된 구성요소 또는 단계가 하나 이상의 다른 구성요소 또는 단계의 존재 또는 추가를 배제하지 않는다는 의미를 내포한다.The terms used in this specification are for describing embodiments and are not intended to limit the invention. As used herein, singular forms also include plural forms, unless specifically stated otherwise in the context. As used in the specification, “comprises” or “comprising” implies that the mentioned component or step does not exclude the presence or addition of one or more other components or steps.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 해석될 수 있다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms used in this specification can be interpreted as meanings commonly understood by those skilled in the art to which the present invention pertains. Additionally, terms defined in commonly used dictionaries are not interpreted ideally or excessively unless clearly specifically defined.

이하에서는, 도 1 내지 도 8을 참조하여 실시예에 따른 외적 기반 곱셈 연산 장치 및 방법이 상세히 설명된다.Hereinafter, a cross product-based multiplication operation device and method according to an embodiment will be described in detail with reference to FIGS. 1 to 8.

도 1은 일반적인 외적 기반 곱셈 연산 장치의 내부 구성도이고, 도 2는 일반적인 외적 기반 곱셈 연산 장치에서의 단위 누적(chucking accumulation) 과정을 설명하기 위한 순서도이다. FIG. 1 is an internal configuration diagram of a general external product-based multiplication operation device, and FIG. 2 is a flowchart for explaining a unit accumulation process in a general external product-based multiplication operation device.

도 1을 참조하면, 일반적인 외적 기반 행렬 연산기(100)는, MAC(Multiply-Accumulation) 연산을 수행하는 복수의 내부 연산기들(110)로 구성될 수 있다. Referring to FIG. 1, a general outer product-based matrix operator 100 may be composed of a plurality of internal operators 110 that perform a MAC (Multiply-Accumulation) operation.

이러한 복수의 내부 연산기들(110)을 통해 단위 누적(chucking accumulation) 기법을 적용한 행렬-행렬곱 연산이 수행될 수 있다. A matrix-matrix multiplication operation using a unit accumulation (chucking accumulation) technique can be performed through these plurality of internal operators 110.

도 2를 참조하면, 외적 기반 행렬 연산기(100)에 외부의 메모리(10)로부터 외적 연산을 위한 행렬 데이터를 로딩된다(S210). Referring to FIG. 2, matrix data for the outer product operation is loaded from the external memory 10 into the outer product-based matrix operator 100 (S210).

여기서, 외부의 메모리(10)는 후술되는 도 8에 도시된 바와 같은 컴퓨터 시스템(1000)에서 외적 기반 곱셈 연산 장치(1090)의 외부에 설치된 메모리(1030), 스토리지(1060) 및 네트워크(1080)를 통해 접근 가능한 외부 메모리를 모두 포함할 수 있다. Here, the external memory 10 is a memory 1030, a storage 1060, and a network 1080 installed outside the external product-based multiplication operation unit 1090 in the computer system 1000 as shown in FIG. 8, which will be described later. It can include all external memory accessible through .

그런 후, 외적 기반 행렬 연산기(100)는 로딩된 행렬 데이터에서 소정 단위 크기(Chunk Size) 만큼의 행렬 데이터에 대한 행렬-행렬 곱 연산을 수행한다(S220). Then, the outer product-based matrix operator 100 performs a matrix-matrix multiplication operation on matrix data of a predetermined unit size (Chunk Size) from the loaded matrix data (S220).

그런 후, 외적 기반 행렬 연산기(100)는 외부의 메모리(10)로부터 이전에 저장된 부분 누적값을 호출(S230)하여, S220에서의 행렬-행렬 곱 연산 결과값과 원소별 덧셈(element-wise addition)을 수행한다(S240). Then, the outer product-based matrix operator 100 calls the previously stored partial accumulation value from the external memory 10 (S230), and performs element-wise addition with the matrix-matrix product operation result value in S220. ) is performed (S240).

이때, 최초로 행렬-행렬 곱 연산이 수행되는 경우, 이전에 저장된 부분 누적값이 존재하지 않으므로, S230 및 S240은 생략될 수도 있다. At this time, when the matrix-matrix multiplication operation is performed for the first time, since there is no previously stored partial accumulation value, S230 and S240 may be omitted.

그런 후, 외적 기반 행렬 연산기(100)는 원소별 덧셈이 수행되어 업데이트된 부분 누적값을 메모리(10)에 다시 저장하게 된다(S250).Then, the outer product-based matrix operator 100 performs addition for each element and stores the updated partial accumulation value back in the memory 10 (S250).

그런데, S220에서 로딩된 행렬 데이터 전체가 아닌 소정 단위 크기(Chunk Size) 만큼의 행렬 데이터에 대한 행렬-행렬 곱 연산을 수행되었으므로, 연산 대상 행렬 데이터가 더 존재할 경우(S260), 외적 기반 행렬 연산기(100)는 S220으로 진행하여 S220 내지 S250을 재 수행한다. However, since the matrix-matrix multiplication operation was performed on matrix data of a predetermined unit size (Chunk Size) rather than the entire matrix data loaded in S220, if there is more matrix data to be calculated (S260), the cross product-based matrix operator ( 100) proceeds to S220 and re-performs S220 to S250.

또는, 도 2에서 S230 단계 이전에 S250 단계 및 S260 단계가 먼저 수행될 수도 있고, 외적 기반 행렬 연산기(100)는 S230에서 저장된 모든 부분 누적값들을 S240에서 원소별 덧셈을 수행하고, 원소별 덧셈된 결과값이 다시 메모리(10)에 전달되도록 할 수도 있다. Alternatively, in FIG. 2, steps S250 and S260 may be performed first before step S230, and the cross product-based matrix operator 100 performs element-by-element addition of all partial accumulation values stored in S230 at S240, and adds the element-by-element addition The result may be passed back to the memory 10.

또는, 외적 기반 행렬 연산기(100)는 도 2에서 S210, S220, S250 및 S260 단계들만 수행하고, S230 및 S240 단계는 외부의 별도의 프로세서에 의해 수행될 수도 있다. Alternatively, the outer product-based matrix operator 100 may only perform steps S210, S220, S250, and S260 in FIG. 2, and steps S230 and S240 may be performed by a separate external processor.

전술한 바와 같이, 종래의 외적 기반 행렬 연산기(100)는 단위(Chunk) 크기 만큼의 행렬 데이터에 대한 MAC(Multiply-Accumulation) 연산이 종료될 때마다 매번 연산 결과값을 외부의 메모리(10)에 저장해야 한다. 이러한 과정에서의 빈번한 메모리 접근은 연산 시간을 지연시키게 되고, 이는 전반적인 학습 및 추론의 속도의 저하시키는 원인이 이어질 수 있다. 또한, 원소별 덧셈을 위해 외부의 별도의 프로세서가 요구될 수도 있다. As described above, the conventional external product-based matrix operator 100 stores the operation result in the external memory 10 every time the MAC (Multiply-Accumulation) operation on matrix data as large as a unit (chunk) is completed. You have to save it. Frequent memory access in this process delays computation time, which can lead to a slowdown in overall learning and inference speed. Additionally, a separate external processor may be required for element-by-element addition.

따라서, 실시예에서는 부가적인 하드웨어 추가 없이도, 단위(Chunk) 별로 누적 연산이 종료될 때마다 매번 연산 결과값을 외부의 메모리에 저장하지 않아도 되는 외적 기반 곱셈 연산 장치 및 방법을 제안한다. Accordingly, the embodiment proposes an external product-based multiplication operation device and method that does not require the addition of additional hardware and does not require the operation result to be stored in an external memory every time the accumulation operation is completed for each unit (chunk).

도 3은 실시예에 따른 외적 기반 곱셈 연산 장치의 내부 구성도이다. Figure 3 is an internal configuration diagram of an external product-based multiplication operation device according to an embodiment.

도 3을 참조하면, 실시예에 따른 외적 기반 곱셈 연산 장치(300)는, 제1 내부 연산기들(310) 및 제2 내부 연산기들(320)를 포함하되, 제1 내부 연산기들(310) 및 제2 내부 연산기들(320) 간에 누적 데이터 전송 경로(accumulation data transmission path)들(330)이 형성되어 있을 수 있다. Referring to FIG. 3, the outer product-based multiplication operation apparatus 300 according to an embodiment includes first internal operators 310 and second internal operators 320, and the first internal operators 310 and Accumulation data transmission paths 330 may be formed between the second internal operators 320.

여기서, 내부 연산기들이 4x4 형태의 구조를 가진 외적 기반 곱셈 연산 장치(300)가 도시되어 있으나, 이는 실시예에 대한 이해를 돕기 위한 일 예일 뿐, 본 발명은 이에 한정되지 않는다. Here, an external product-based multiplication operation device 300 in which internal operators have a 4x4 structure is shown, but this is only an example to aid understanding of the embodiment, and the present invention is not limited thereto.

또한, 제1 내부 연산기들(310)은 도 3에서 두 개의 외부 입력 포트들로부터 데이터가 입력되는 연산기들, 예컨대, E0.0, E1.0, ??. E7.0을 의미할 수 있고, 제2 내부 연산기들(310)은 도 3에서 외부 입력 포트로부터 데이터가 입력되는 것이 아닌 인접한 제1 내부 연산기들(310)로부터 데이터를 입력받는 연산기들, 즉, E0.1, E1.1,..., E7.1을 의미할 수 있다. In addition, the first internal operators 310 are operators that input data from two external input ports in FIG. 3, for example, E0.0, E1.0, ??. It may mean E7.0, and the second internal operators 310 are operators that receive data from the adjacent first internal operators 310 rather than from the external input port in FIG. 3, that is, It can mean E0.1, E1.1,..., E7.1.

외적 기반 곱셈 연산 장치(600)는, 예컨대, A0 내지 A4의 입력 포트들을 통해 제1 행렬의 벡터 데이터를 순차적으로 입력받고, B0 내지 B1의 입력 포트들을 통해 제2 행렬의 벡터 데이터를 순차적으로 입력받아 행렬-행렬 곱 연산을 수행할 수 있다. For example, the cross product-based multiplication operation device 600 sequentially receives vector data of a first matrix through input ports of A0 to A4, and sequentially inputs vector data of a second matrix through input ports of B0 to B1. You can perform matrix-matrix multiplication operations.

여기서, 제1 행렬의 벡터 데이터 및 제2 행렬의 데이터는, 예컨대, 신경망 중간 연산 데이터, 예를 들어 피처맵 데이터 또는 신경망 학습 파라미터들, 예를 들어 가중치들을 포함할 수 있다.Here, the vector data of the first matrix and the data of the second matrix may include, for example, neural network intermediate calculation data, for example, feature map data, or neural network learning parameters, for example, weights.

이때, 입력 포트들을 통해 입력되는 데이터는, 부동 소수점 포맷으로 정밀도를 표현하는 가수(mantissa)의 비트 폭(bit-width)이 조절되어 양자화된 데이터일 수 있다. At this time, data input through the input ports may be quantized data in which the bit-width of the mantissa, which expresses precision in a floating point format, is adjusted.

실시예에 따른 외적 기반 곱셈 연산 장치(300)는, 단위 누적(chunking accumulation) 연산에 의한 하드웨어 구성에 있어 누적 데이터 전송 경로(accumulation data transmission path)들(330) 외에 추가적인 구성을 최소화한다. 따라서, 외적 기반 곱셈 연산 장치(300)의 구조를 활용하면서도 단위 누적(chunking accumulation) 연산에 따른 문제점을 해결할 수 있다. The external product-based multiplication operation device 300 according to the embodiment minimizes additional configuration other than the accumulation data transmission paths 330 in hardware configuration by unit accumulation (chunking accumulation) operation. Therefore, it is possible to solve problems caused by unit accumulation (chunking accumulation) operation while utilizing the structure of the external product-based multiplication operation device 300.

제1 내부 연산기들(310) 각각은, MAC(Multiply-Accumulation) 연산을 수행하여 중간 누적값을 생성할 수 있다. Each of the first internal operators 310 may generate an intermediate accumulation value by performing a Multiple-Accumulation (MAC) operation.

누적 데이터 전송 경로(accumulation data transmission path)들(330)은, 상기 제1 내부 연산기들(310) 중 어느 하나의 출력이 상기 제2 내부 연산기들(320) 중 어느 하나로 입력되도록 할 수 있다. Accumulation data transmission paths 330 may allow the output of one of the first internal operators 310 to be input to one of the second internal operators 320.

따라서, 상기 제1 내부 연산기들(310) 각각은, 중간 누적값을 별도의 메모리에 저장하지 않고, 근접한 제2 내부 연산기들(320) 중 하나에 전달하게 된다. Accordingly, each of the first internal operators 310 transfers the intermediate accumulated value to one of the adjacent second internal operators 320, rather than storing the intermediate accumulated value in a separate memory.

그러면, 상기 제2 내부 연산기들(320) 각각은, 상기 제1 내부 연산기들(310) 중 어느 하나의 출력을 입력 받을 수 있다. Then, each of the second internal operators 320 can receive the output of any one of the first internal operators 310.

제2 내부 연산기들(320) 각각은, 상기 전달되는 중간 누적값을 이용하여 단위 누적값(chucking accumulation value)을 생성하는 연산을 수행할 수 있다.Each of the second internal operators 320 may perform an operation to generate a unit accumulation value (chucking accumulation value) using the transmitted intermediate accumulation value.

여기서, 상기 단위 누적값을 생성하기 위한 연산은, 새로 입력되는 중간 누적값과 상기 제2 내부 연산기(320)에 저장되어 있던 중간 누적값을 더하는 연산일 수 있다. Here, the operation for generating the unit accumulation value may be an operation of adding a newly input intermediate accumulation value and an intermediate accumulation value stored in the second internal calculator 320.

이때, 상기 제1 내부 연산기들(310) 각각은, 상기 중간 누적값에 상응하는 단위 크기(chunk size)에 대응되는 주기마다 상기 중간 누적값을 상기 제2 내부 연산기들(320) 중 하나로 전달한 후 리셋 동작을 수행할 수 있다. At this time, each of the first internal operators 310 transfers the intermediate accumulated value to one of the second internal operators 320 at every cycle corresponding to a unit size (chunk size) corresponding to the intermediate accumulated value. A reset operation can be performed.

또한, 상기 제2 내부 연산기들(320)은 각각 상기 단위 크기에 대응되는 주기마다 상기 단위 누적값을 생성하기 위한 연산을 수행할 수 있다. Additionally, the second internal operators 320 may each perform an operation to generate the unit accumulation value at every cycle corresponding to the unit size.

즉, 상기 제2 내부 연산기들(320) 각각은, 제1 내부 연산기들(310) 중 하나로부터 전달받은 중간 누적값들 만을 단위 크기에 대응되는 주기마다 누적하여, 최종 누적값을 생성할 수 있다. That is, each of the second internal operators 320 may generate a final accumulated value by accumulating only the intermediate accumulated values received from one of the first internal operators 310 at each cycle corresponding to the unit size. .

이를 위해, 실시예에 따른 외적 기반 곱셈 연산 장치(300)는, 상기 제1 내부 연산기들 및 상기 제2 내부 연산기들 각각에 단위 크기(chunk size)를 조절하는 제어 신호를 입력하는 제어부(미도시)를 더 포함할 수 있다. 다른 실시예에 따라, 제어 신호는 외부의 제어부로부터 전달될 수도 있다. To this end, the outer product-based multiplication operation device 300 according to the embodiment includes a control unit (not shown) that inputs a control signal for adjusting the unit size (chunk size) to each of the first internal operators and the second internal operators. ) may further be included. According to another embodiment, the control signal may be transmitted from an external control unit.

이러한 단위 크기(chunk size)는 외적 기반 곱셈 연산 대상이 되는 데이터의 종류, 부동 소수점 포맷인 데이터의 정밀도를 표현하는 가수(mantissa)의 비트 폭(bit-width), 및 해당 데이터가 사용되는 어플리케이션의 특성을 포함하는 다양한 속성들 중 적어도 하나에 따라 조절될 수 있다. This unit size (chunk size) is determined by the type of data that is the target of the cross product-based multiplication operation, the bit-width of the mantissa that expresses the precision of the data in floating point format, and the application in which the data is used. It may be adjusted according to at least one of various properties including characteristics.

상기 제1 내부 연산기들(310)은 각각 단위 크기(chunk size)를 기반으로 중간 누적값 전달 시기 및 리셋 시기가 설정될 수 있다. The first internal operators 310 may each have their intermediate accumulated value transfer timing and reset timing set based on the unit size (chunk size).

상기 제2 내부 연산기들(320)은 각각 상기 단위 크기를 기반으로 상기 단위 누적값을 생성하기 위한 연산을 수행하는 시기가 설정될 수 있다. Each of the second internal operators 320 may be set at a time to perform an operation for generating the unit accumulation value based on the unit size.

이때, 상기 제2 내부 연산기들(320)은 각각 단위 크기에 대응되는 주기마다 비활성 상태에서 활성 상태로 전환되도록 설정될 수 있다. At this time, the second internal operators 320 may be set to switch from the inactive state to the active state at intervals corresponding to each unit size.

도 4는 실시예에 따른 외적 기반 곱셈 연산 방법을 설명하기 위한 순서도이다. Figure 4 is a flowchart for explaining a cross product-based multiplication operation method according to an embodiment.

도 4를 참조하면, 실시예에 따른 외적 기반 곱셈 연산 장치(300)는, 우선 외부의 메모리(10)로부터 행렬 데이터를 로딩한다(S410). Referring to FIG. 4, the external product-based multiplication operation device 300 according to the embodiment first loads matrix data from the external memory 10 (S410).

여기서, 외부의 메모리(10)는 후술되는 도 8에 도시된 바와 같은 컴퓨터 시스템(1000)에서 외적 기반 곱셈 연산 장치(1090)의 외부에 설치된 메모리(1030), 스토리지(1060) 및 네트워크(1080)를 통해 접근 가능한 외부 메모리를 모두 포함할 수 있다.Here, the external memory 10 is a memory 1030, a storage 1060, and a network 1080 installed outside the external product-based multiplication operation unit 1090 in the computer system 1000 as shown in FIG. 8, which will be described later. It can include all external memory accessible through .

그런 후, 외적 기반 곱셈 연산 장치(300)의 제1 내부 연산기들(310) 각각에 의하여, MAC(Multiply-Accumulation) 연산을 수행하여 중간 누적값을 생성한다(S420). Then, each of the first internal operators 310 of the external product-based multiplication operation device 300 performs a MAC (Multiply-Accumulation) operation to generate an intermediate accumulated value (S420).

그런 후, 제1 내부 연산기들(310) 각각에 의하여, 상기 중간 누적값을 누적 데이터 전송 경로를 통해 제2 내부 연산기들 중 하나로 전송한다(S430). Then, each of the first internal operators 310 transmits the intermediate accumulated value to one of the second internal operators through the accumulated data transmission path (S430).

상기 제2 내부 연산기들 각각(320) 각각에 의하여, 상기 중간 누적값을 이용하여 단위 누적값(chucking accumulation value)을 생성한다(S440).Each of the second internal operators 320 generates a unit accumulation value using the intermediate accumulation value (S440).

그런 후, 연산 대상 데이터가 존재할 경우(S450), S420 내지 S440 단계가 반복 수행될 수 있다. Then, if calculation target data exists (S450), steps S420 to S440 may be repeatedly performed.

반면, 연산 대상 데이터가 존재하지 않을 경우(S450), 제1 내부 연산기들(310) 각각은 누적 데이터 전송 경로를 통해 연산 완료 신호를 전달(S460)하게 되고, 제1 내부 연산기들(310) 각각은 지금까지 누적된 단위 누적값을 최종 결과값으로 메모리(10)에 전달한다(S470).On the other hand, when the calculation target data does not exist (S450), each of the first internal operators 310 transmits an operation completion signal through the accumulated data transmission path (S460), and each of the first internal operators 310 Transfers the unit accumulation value accumulated so far to the memory 10 as the final result value (S470).

전술한 바와 같은 실시예에서 제2 내부 연산기들(320)에 의해 중간 누적값들이 누적되게 되므로, 도 2의 종래 방법과 도 4의 실시예에 따른 방법을 비교했을 때, 외적 기반 곱셈 연산 장치(300)는 외부의 메모리 접근이 필요 없게 된다. In the above-described embodiment, intermediate accumulation values are accumulated by the second internal operators 320, so when comparing the conventional method of FIG. 2 and the method according to the embodiment of FIG. 4, the outer product-based multiplication operation device ( 300) eliminates the need for external memory access.

또한, 원소별 덧셈(element-wise addition) 과정이 생략되므로, 전반적으로 간단하게 결과값을 도출할 수 있다. Additionally, since the element-wise addition process is omitted, the overall result can be derived simply.

다만, 도 1에 도시된 종래의 외적 기반 행렬 연산기(100)에서는 매 사이클 마다 LxL-차원(dimension) 행렬의 누적값이 산출될 수 있다. 여기서, L은 입력 포트의 크기를 의미한다. 예컨대, 도 1에 도시된 외적 기반 행렬 연산기(100)의 입력 포트는 A0 내지 A3까지 4개이므로, 매 사이클마다 4x4 차원 행렬의 누적값이 산출될 수 있다. However, in the conventional outer product-based matrix operator 100 shown in FIG. 1, the cumulative value of the LxL-dimension matrix can be calculated every cycle. Here, L means the size of the input port. For example, since the cross product-based matrix operator 100 shown in FIG. 1 has four input ports from A0 to A3, the accumulated value of a 4x4 dimensional matrix can be calculated for each cycle.

반면, 도 3에 도시된 실시예에 따른 외적 기반 행렬 연산기(300)에서는 매 사이클 마다 Lx(L/2)-차원(dimension) 행렬의 누적값이 산출될 수 있다. 이는 상기 제2 내부 연산기들(120) 각각은 외부로부터 데이터를 입력받는 것이 아니라, 제2 내부 연산기들(120) 중 하나로부터 전달되는 중간값들의 단위 누적에 활용했기 때문이다. On the other hand, in the outer product-based matrix operator 300 according to the embodiment shown in FIG. 3, the cumulative value of the Lx(L/2)-dimension matrix can be calculated every cycle. This is because each of the second internal operators 120 does not receive data from the outside, but is used for unit accumulation of intermediate values transmitted from one of the second internal operators 120.

그러나, 전술한 바와 같이 외부 메모리 접근에 의한 지연 시간을 감소시키게 되므로, 전체적인 연산 속도는 향상될 수 있다. However, as described above, the overall computation speed can be improved because the delay time due to external memory access is reduced.

아울러, 전술한 바와 같이 신경망의 누적 연산에 양자화 기법이 적용됨에 따른 스왐핑(swamping) 현상을 완화하기 위한 단위 누적(Chunking Accumulation) 기법이 적용되므로, 양자화에 의한 오차가 저감된 정확한 연산 결과를 얻을 수 있다. In addition, as described above, the unit accumulation technique is applied to alleviate the swamping phenomenon caused by the application of the quantization technique to the accumulation operation of the neural network, so accurate operation results with reduced errors due to quantization can be obtained. You can.

도 5는 일반적인 외적 기반 곱셈 연산 장치에서 벡터-행렬 곱셈이 수행되는 예시도이다. Figure 5 is an example of vector-matrix multiplication performed in a general outer product-based multiplication operation device.

도 1에 도시된 일반적인 외적 기반 곱셈 연산기(100)는 행렬-행렬 곱을 효율적으로 처리하기 위해 최적화된 구조이다. 따라서, 도 5에 도시된 바와 같이, 종래의 외적 기반 곱셈 연산 장치(100)와 동일한 하드웨어 구조에서 벡터-행렬곱 연산을 수행하게 되면, 내부 연산기들의 운용률(utilization)이 100%를 달성하지 못하고 극히 일부만 활용된다. 예컨대, EO 내지 E3에 해당하는 내부 연산기들 만이 사용되고, 나머지 내부 연산기들을 활용되지 못하게 된다. The general outer product-based multiplication operator 100 shown in FIG. 1 has an optimized structure to efficiently process matrix-matrix multiplication. Therefore, as shown in FIG. 5, when a vector-matrix multiplication operation is performed in the same hardware structure as the conventional external product-based multiplication operation device 100, the utilization of the internal operators does not achieve 100%. Only a small portion of it is used. For example, only internal operators corresponding to EO to E3 are used, and the remaining internal operators are not utilized.

따라서, 실시예에서는 벡터-행렬곱 연산에서 활용되지 못하는 내부 연산기들을 다중 누적 연산에 활용한다. 즉, 활용되지 못하는 내부 연산기들을 이용하여 단위 누적(chucking accumulation) 기법을 적용하는 것이다. Therefore, in the embodiment, internal operators that cannot be used in the vector-matrix multiplication operation are used in the multiple accumulation operation. In other words, the unit accumulation (chucking accumulation) technique is applied using unused internal operators.

도 6은 다른 실시예에 따른 외적 기반 곱셈 연산 장치의 내부 구성도이다. Figure 6 is an internal configuration diagram of a cross product-based multiplication operation device according to another embodiment.

도 6을 참조하면, 다른 실시예에 따른 외적 기반 곱셈 연산 장치(500)는, 제1 내부 연산기들(510) 및 제2 내부 연산기들(520)를 포함하되, 제1 내부 연산기들(510) 및 제2 내부 연산기들(520) 간에 누적 데이터 전송 경로(accumulation data transmission path)들(550-1)이 형성되어 있을 수 있다. Referring to FIG. 6, the outer product-based multiplication operation device 500 according to another embodiment includes first internal operators 510 and second internal operators 520, and the first internal operators 510 and accumulation data transmission paths 550-1 may be formed between the second internal calculators 520.

또한, 다른 실시예에 따른 외적 기반 곱셈 연산 장치(500)는, 제3 내부 연산기들(530)를 포함하되, 제2 내부 연산기들(520) 및 제3 내부 연산기들(530) 간에 누적 데이터 전송 경로(accumulation data transmission path)들(550-2)이 형성되어 있을 수 있다. In addition, the outer product-based multiplication operation device 500 according to another embodiment includes third internal operators 530, and transfers accumulated data between the second internal operators 520 and the third internal operators 530. Accumulation data transmission paths 550-2 may be formed.

또한, 다른 실시예에 따른 외적 기반 곱셈 연산 장치(500)는, 제4 내부 연산기들(540)를 포함하되, 제3 내부 연산기들(530) 및 제4 내부 연산기들(540) 간에 누적 데이터 전송 경로(accumulation data transmission path)들(550-3)이 형성되어 있을 수 있다. In addition, the outer product-based multiplication operation device 500 according to another embodiment includes fourth internal operators 540, and transfers accumulated data between the third internal operators 530 and the fourth internal operators 540. Accumulation data transmission paths 550-3 may be formed.

또한, 제1 내부 연산기들(510)은 도 6에서 두 개의 외부 입력 포트들로부터 데이터가 입력되는 연산기들, 예컨대, E0.0, E1.0,..., E3.0을 의미할 수 있고, 제2 내부 연산기들(520)은 도 6에서 외부 입력 포트로부터 데이터가 입력되는 것이 아닌 인접한 제1 내부 연산기들(510)로부터 데이터를 입력받는 연산기들, 즉, E0.1, E1.1,..., E3.1을 의미할 수 있고, 제3 내부 연산기들(530)은 도 6에서 외부 입력 포트로부터 데이터가 입력되는 것이 아닌 인접한 제2 내부 연산기들(520)로부터 데이터를 입력받는 연산기들, 즉, E0.2, E1.2,..., E3.2를 의미할 수 있고, 제4 내부 연산기들(540)은 도 6에서 외부 입력 포트로부터 데이터가 입력되는 것이 아닌 인접한 제3 내부 연산기들(530)로부터 데이터를 입력받는 연산기들, 즉, E0.3, E1.3,..., E3.3을 의미할 수 있다. In addition, the first internal operators 510 may refer to operators that input data from two external input ports in FIG. 6, for example, E0.0, E1.0,..., E3.0, , the second internal operators 520 are operators that receive data from the adjacent first internal operators 510 rather than from an external input port in FIG. 6, that is, E0.1, E1.1, ..., may mean E3.1, and the third internal operators 530 are operators that receive data from adjacent second internal operators 520 rather than from the external input port in FIG. , that is, it may mean E0.2, E1.2,..., E3.2, and the fourth internal operators 540 are the adjacent third operators in FIG. 6 where data is not input from the external input port. This may refer to operators that receive data from the internal operators 530, that is, E0.3, E1.3,..., E3.3.

즉, 실시예에 따른 외적 기반 곱셈 연산 장치(500)는, 단위 누적(chunking accumulation) 연산에 의한 하드웨어 구성에 있어 누적 데이터 전송 경로(accumulation data transmission path)들(550-1, 550-2, 550-3) 외에 추가적인 구성을 최소화한다. 따라서, 외적 기반 곱셈 연산 장치(300)의 구조를 활용하면서도 단위 누적(chunking accumulation) 연산에 따른 문제점을 해결할 수 있다. That is, the external product-based multiplication operation device 500 according to the embodiment includes accumulation data transmission paths 550-1, 550-2, and 550 in a hardware configuration based on unit accumulation (chunking accumulation) operation. Minimize additional configuration other than -3). Therefore, it is possible to solve problems caused by unit accumulation (chunking accumulation) operation while utilizing the structure of the external product-based multiplication operation device 300.

더 나아가, 다른 실시예에 따른 외적 기반 곱셈 연산 장치(500)에서는, 제2 내부 연산기들(520) 내지 제4 내부 연산기들(540)를 고차 누적에 활용하여, 내부 연산기들의 운용률을 높일 뿐만 아니라 단위 누적(chunking accumulation) 연산에 따른 문제점인 양자화 오차를 더욱 감소시킬 수 있다. Furthermore, in the external product-based multiplication operation device 500 according to another embodiment, the second internal operators 520 to the fourth internal operators 540 are utilized for high-order accumulation, thereby not only increasing the operating efficiency of the internal operators. In addition, quantization error, which is a problem caused by unit accumulation (chunking accumulation) operation, can be further reduced.

예컨대, 전체적으로 누적되는 데이터가 10000개일 경우, 제1 내부 연산기들(510)은, 10개 단위로 누적한 값을 제2 내부 연산기들(520)에 전달할 수 있다. 제2 내부 연산기들(520)은, 제1 내부 연산기들(510)에서 10개 단위로 누적한 값들을 다시 10개 단위로 누적하여 총 100개가 누적된 값들을 제3 내부 연산기들(530)에 전달할 수 있다. 제3 내부 연산기들(530)은, 제2 내부 연산기들(520)에서 10개 단위로 누적한 값들을 다시 10개 단위로 누적하여 총 1000개가 누적된 값들을 제3 내부 연산기들(530)에 전달할 수 있다. 제4 내부 연산기들(540)은, 제3 내부 연산기들(530)에서 10개 단위로 누적한 값들을 다시 10개 단위로 누적하여 총 10000개가 누적된 최종값들 출력할 수 있다. For example, when the total accumulated data is 10000, the first internal operators 510 may transfer the accumulated values in units of 10 to the second internal operators 520. The second internal operators 520 accumulate the values accumulated in units of 10 by the first internal operators 510 again in units of 10 and store a total of 100 accumulated values in the third internal operators 530. It can be delivered. The third internal operators 530 accumulate the values accumulated in units of 10 by the second internal operators 520 again in units of 10 and store a total of 1000 accumulated values in the third internal operators 530. It can be delivered. The fourth internal operators 540 may accumulate the values accumulated in units of 10 by the third internal operators 530 again in units of 10 and output a total of 10,000 accumulated final values.

제3 내부 연산기들(530)은, 10개 단위로 누적한 값들을 다시 10개 단위로 누적하여 총 100개가 누적된 값들을 제3 내부 연산기들(530)에 전달하고, 제1 내부 연산기들(510) 각각은, MAC(Multiply-Accumulation) 연산을 수행하여 중간 누적값을 생성할 수 있다. The third internal operators 530 accumulate the values accumulated in units of 10 again in units of 10 and transfer a total of 100 accumulated values to the third internal operators 530, and the first internal operators ( 510) Each can generate an intermediate accumulation value by performing a MAC (Multiply-Accumulation) operation.

누적 데이터 전송 경로(accumulation data transmission path)들(550-1)은, 상기 제1 내부 연산기들(510) 중 어느 하나의 출력이 상기 제2 내부 연산기들(520) 중 어느 하나로 입력되도록 할 수 있다. Accumulation data transmission paths 550-1 may allow the output of one of the first internal operators 510 to be input to one of the second internal operators 520. .

따라서, 상기 제1 내부 연산기들(510) 각각은, 중간 누적값을 별도의 메모리에 저장하지 않고, 근접한 제2 내부 연산기들(520) 중 하나에 전달하게 된다. Accordingly, each of the first internal operators 510 transfers the intermediate accumulated value to one of the adjacent second internal operators 520, rather than storing the intermediate accumulated value in a separate memory.

그러면, 상기 제2 내부 연산기들(520) 각각은, 상기 제1 내부 연산기들(510) 중 어느 하나의 출력을 입력 받을 수 있다. Then, each of the second internal operators 520 can receive the output of any one of the first internal operators 510.

제2 내부 연산기들(520) 각각은, 상기 전달되는 중간 누적값을 이용하여 제1 단위 누적값(chucking accumulation value)을 생성하는 연산을 수행할 수 있다.Each of the second internal operators 520 may perform an operation to generate a first unit accumulation value (chucking accumulation value) using the transmitted intermediate accumulation value.

여기서, 상기 단위 누적값을 생성하기 위한 연산은, 새로 입력되는 중간 누적값과 상기 제2 내부 연산기(520)에 저장되어 있던 중간 누적값을 더하는 연산일 수 있다. Here, the operation for generating the unit accumulation value may be an operation of adding a newly input intermediate accumulation value and an intermediate accumulation value stored in the second internal calculator 520.

누적 데이터 전송 경로(accumulation data transmission path)들(550-2)은, 상기 제2 내부 연산기들(520) 중 어느 하나의 출력이 상기 제3 내부 연산기들(530) 중 어느 하나로 입력되도록 할 수 있다. Accumulation data transmission paths 550-2 may allow the output of one of the second internal operators 520 to be input to one of the third internal operators 530. .

따라서, 상기 제2 내부 연산기들(520) 각각은, 제1 단위 누적값을 별도의 메모리에 저장하지 않고, 근접한 제3 내부 연산기들(530) 중 하나에 전달하게 된다. Accordingly, each of the second internal calculators 520 transfers the first unit accumulation value to one of the adjacent third internal calculators 530 rather than storing it in a separate memory.

그러면, 상기 제3 내부 연산기들(520) 각각은, 상기 제1 내부 연산기들(510) 중 어느 하나의 출력을 입력 받을 수 있다. Then, each of the third internal operators 520 can receive the output of any one of the first internal operators 510.

제3 내부 연산기들(530) 각각은, 상기 전달되는 제1 단위 누적값을 이용하여 제2 단위 누적값(chucking accumulation value)을 생성하는 연산을 수행할 수 있다.Each of the third internal operators 530 may perform an operation to generate a second unit accumulation value (chucking accumulation value) using the transmitted first unit accumulation value.

여기서, 상기 단위 누적값을 생성하기 위한 연산은, 새로 입력되는 제1 단위 누적값과 상기 제3 내부 연산기(530)에 저장되어 있던 제2 단위 누적값을 더하는 연산일 수 있다. Here, the operation for generating the unit accumulation value may be an operation of adding the newly input first unit accumulation value and the second unit accumulation value stored in the third internal calculator 530.

누적 데이터 전송 경로(accumulation data transmission path)들(550-3)은, 상기 제3 내부 연산기들(530) 중 어느 하나의 출력이 상기 제4 내부 연산기들(540) 중 어느 하나로 입력되도록 할 수 있다. Accumulation data transmission paths 550-3 may allow the output of one of the third internal operators 530 to be input to one of the fourth internal operators 540. .

따라서, 상기 제3 내부 연산기들(530) 각각은, 제2 단위 누적값을 별도의 메모리에 저장하지 않고, 근접한 제4 내부 연산기들(540) 중 하나에 전달하게 된다. Accordingly, each of the third internal calculators 530 transfers the second unit accumulation value to one of the adjacent fourth internal calculators 540 rather than storing it in a separate memory.

그러면, 상기 제3 내부 연산기들(540) 각각은, 상기 제2 내부 연산기들(520) 중 어느 하나의 출력을 입력 받을 수 있다. Then, each of the third internal operators 540 can receive the output of one of the second internal operators 520.

제4 내부 연산기들(540) 각각은, 상기 전달되는 제2 단위 누적값을 이용하여 제3 단위 누적값(chucking accumulation value)을 생성하는 연산을 수행할 수 있다.Each of the fourth internal operators 540 may perform an operation to generate a third unit accumulation value (chucking accumulation value) using the transmitted second unit accumulation value.

여기서, 상기 제3 단위 누적값을 생성하기 위한 연산은, 새로 입력되는 제2 단위 누적값과 상기 제4 내부 연산기(540)에 저장되어 있던 제3 단위 누적값을 더하는 연산일 수 있다. Here, the operation for generating the third unit accumulation value may be an operation that adds the newly input second unit accumulation value and the third unit accumulation value stored in the fourth internal calculator 540.

여기서, 단위 누적값을 생성하는 연산이, 3 단계(stages)로 수행되는 것으로 설명되나, 이는 실시예에 대한 이해를 돕기 위한 일 예일 뿐, 본 발명은 이에 한정되지 않는다. 즉, 단위 누적값을 생성하는 연산이, 내부 연산기들의 수에 따라 3 단계(stages) 이상으로 이루어질 수도 있고, 도 6에 도시된 내부 연산기들을 모두 활용하기 않고 2 단계(stages)로 수행되는 것으로 할 수도 있다. Here, the operation for generating the unit accumulation value is described as being performed in three stages, but this is only an example to aid understanding of the embodiment, and the present invention is not limited thereto. In other words, the operation to generate the unit accumulation value may be performed in three or more stages (stages) depending on the number of internal operators, or may be performed in two stages (stages) without utilizing all of the internal operators shown in FIG. 6. It may be possible.

이를 위해, 실시예에 따른 외적 기반 곱셈 연산 장치(500)는, 단위 누적값을 생성하는 연산의 단계들의 수를 조절하는 제어 신호를 입력하는 제어부(미도시)를 더 포함할 수 있다. 다른 실시예에 따라, 제어 신호는 외부의 제어부로부터 전달될 수도 있다. To this end, the external product-based multiplication operation apparatus 500 according to the embodiment may further include a control unit (not shown) that inputs a control signal that adjusts the number of steps of the operation that generates the unit accumulation value. According to another embodiment, the control signal may be transmitted from an external control unit.

이러한 단위 누적값을 생성하는 연산의 단계들의 수는 외적 기반 곱셈 연산 대상이 되는 데이터의 종류, 부동 소수점 포맷인 데이터의 정밀도를 표현하는 가수(mantissa)의 비트 폭(bit-width), 및 해당 데이터가 사용되는 어플리케이션의 특성을 포함하는 다양한 속성들 중 적어도 하나에 따라 조절될 수 있다. The number of steps of the operation that generates this unit accumulated value depends on the type of data that is the target of the cross product-based multiplication operation, the bit-width of the mantissa expressing the precision of the data in floating point format, and the corresponding data. may be adjusted according to at least one of various properties including characteristics of the application being used.

한편, 상기 제1 내부 연산기들(510)은 각각 상기 중간 누적값에 상응하는 제1 단위 크기(chunk size)에 대응되는 주기마다 상기 중간 누적값을 상기 제2 내부 연산기들(520) 중 하나로 전달한 후 리셋 동작을 수행할 수 있다. Meanwhile, the first internal operators 510 transmit the intermediate accumulated value to one of the second internal operators 520 at each cycle corresponding to the first unit size (chunk size) corresponding to the intermediate accumulated value. After that, a reset operation can be performed.

예컨대, 만약 전체적으로 누적되는 데이터가 10000개일 경우, 제1 내부 연산기들(510)은, 10개 단위로 누적하는 시간 주기마다 리셋되어, 총 1000번 리셋될 수 있다. For example, if the total accumulated data is 10,000, the first internal operators 510 may be reset every 10 accumulated time periods, a total of 1,000 times.

또한, 상기 제2 내부 연산기들(320)은 각각 상기 제1 단위 크기에 대응되는 주기마다 상기 제1 단위 누적값을 생성하기 위한 연산을 수행할 수 있다. Additionally, the second internal operators 320 may each perform an operation to generate the first unit accumulation value at every cycle corresponding to the first unit size.

즉, 상기 제2 내부 연산기들(320) 각각은, 제1 내부 연산기들(310) 중 하나로부터 전달받은 중간 누적값들 만을 제1 단위 크기에 대응되는 주기마다 누적하여, 제1 단위 누적값을 생성할 수 있다. That is, each of the second internal operators 320 accumulates only the intermediate accumulated values received from one of the first internal operators 310 at every cycle corresponding to the first unit size, and generates the first unit accumulated value. can be created.

또한, 상기 제2 내부 연산기들(520)은 각각 상기 제1 단위 크기와 상기 제1 단위 누적값의 단위 누적을 위해 설정된 제2 단위 크기(chunk size)를 곱한 크기에 대응되는 주기마다 상기 제1 단위 누적값을 상기 제3 내부 연산기들(530) 중 하나로 전달한 후 리셋 동작을 수행할 수 있다. In addition, the second internal operators 520 each calculate the first chunk size at every cycle corresponding to a size obtained by multiplying the first unit size by a second unit size (chunk size) set for unit accumulation of the first unit accumulation value. After transferring the unit accumulation value to one of the third internal operators 530, a reset operation can be performed.

예컨대, 만약 전체적으로 누적되는 데이터가 10000개일 경우, 제2 내부 연산기들(520)은, 제1 내부 연산기들(520)에서 10개 단위로 누적한 값들을 다시 10개 단위로 누적하여 총 100개가 누적된 시간 주기마다 리셋되어, 총 100번 리셋될 수 있다. For example, if the total accumulated data is 10,000, the second internal operators 520 accumulate the values accumulated in units of 10 by the first internal operators 520 again in units of 10, and a total of 100 are accumulated. It is reset every time period, and can be reset a total of 100 times.

또한, 상기 제3 내부 연산기들(330)은 각각 상기 제1 단위 크기와 상기 제1 단위 누적값의 단위 누적을 위해 설정된 제2 단위 크기(chunk size)를 곱한 크기에 대응되는 주기마다 상기 제2 단위 누적값을 생성하기 위한 연산을 수행할 수 있다. In addition, the third internal operators 330 each calculate the second chunk size at every cycle corresponding to a size obtained by multiplying the first unit size by a second unit size (chunk size) set for unit accumulation of the first unit accumulation value. You can perform operations to generate unit accumulation values.

즉, 상기 제3 내부 연산기들(320) 각각은, 제2 내부 연산기들(310) 중 하나로부터 전달받은 제1 단위 누적값들만을 제1 단위 크기(chunk size) 및 제2 단위 크기(chunk size)를 곱한 크기에 대응되는 주기마다 누적하여, 제2 단위 누적값을 생성할 수 있다. That is, each of the third internal operators 320 uses only the first unit accumulation values received from one of the second internal operators 310 as a first unit size (chunk size) and a second unit size (chunk size). ) can be accumulated for each period corresponding to the size multiplied by , thereby generating a second unit accumulation value.

또한, 상기 제3 내부 연산기들(530)은 각각 상기 제1 단위 크기(chunk size) 및 제2 단위 크기(chunk size)를 곱한 크기에 대응되는 주기마다 상기 제2 단위 누적값을 상기 제3 내부 연산기들(530) 중 하나로 전달한 후 리셋 동작을 수행할 수 있다. In addition, the third internal operators 530 calculate the second unit accumulation value at every cycle corresponding to a size multiplied by the first unit size (chunk size) and the second unit size (chunk size), respectively. After transferring it to one of the operators 530, a reset operation can be performed.

예컨대, 만약 전체적으로 누적되는 데이터가 10000개일 경우, 제3 내부 연산기들(530)은, 제2 내부 연산기들(530)에서 10개 단위로 누적한 값들을 다시 10개 단위로 누적하여 총 1000개가 누적된 시간 주기마다 리셋되어, 총 10번 리셋될 수 있다. For example, if the total accumulated data is 10000, the third internal operators 530 accumulate the values accumulated in units of 10 by the second internal operators 530 again in units of 10, for a total of 1000. It is reset every time period, and can be reset a total of 10 times.

또한, 상기 제4 내부 연산기들(540)은 각각 상기 제1 단위 크기, 제2 단위 크기 및 상기 제2 단위 누적값의 단위 누적을 위해 설정된 제3 단위 크기(chunk size)를 곱한 크기에 대응되는 주기마다 제3 단위 누적값을 생성하기 위한 연산을 수행할 수 있다. In addition, the fourth internal operators 540 each correspond to a size obtained by multiplying the first unit size, the second unit size, and the third unit size (chunk size) set for unit accumulation of the second unit accumulation value. An operation to generate a third unit cumulative value may be performed for each cycle.

즉, 상기 제4 내부 연산기들(540) 각각은, 제3 내부 연산기들(530) 중 하나로부터 전달받은 제2 누적값들만을 상기 제1 단위 크기, 제2 단위 크기 및 상기 제2 단위 누적값의 단위 누적을 위해 설정된 제3 단위 크기(chunk size)를 곱한 크기에 대응되는 주기마다 누적하여, 최종 누적값을 생성할 수 있다. That is, each of the fourth internal operators 540 only uses the second accumulated values received from one of the third internal operators 530, such as the first unit size, the second unit size, and the second unit accumulated value. The final accumulation value can be generated by accumulating at every cycle corresponding to the size multiplied by the third unit size (chunk size) set for unit accumulation.

예컨대, 만약 전체적으로 누적되는 데이터가 10000개일 경우, 제4 내부 연산기들(540)은, 제3 내부 연산기들(530)에서 10개 단위로 누적한 값들을 다시 10개 단위로 누적하여 총 10000개가 누적된 최종 누적값을 생성할 수 있다. For example, if the total accumulated data is 10,000, the fourth internal operators 540 accumulate the values accumulated in units of 10 by the third internal operators 530 again in units of 10, for a total of 10,000. The final accumulated value can be generated.

이를 위해, 실시예에 따른 외적 기반 곱셈 연산 장치(500)는, 상기 제1 내부 연산기들 내지 상기 제3 내부 연산기들 각각에 상응하는 단위 크기(chunk size) 또는 리셋 동작을 수행하는 주기를 조절하는 제어 신호를 입력하는 제어부(미도시)를 더 포함할 수 있다. 다른 실시예에 따라, 제어 신호는 외부의 제어부로부터 전달될 수도 있다. To this end, the external product-based multiplication operation device 500 according to the embodiment adjusts the unit size (chunk size) or the period for performing the reset operation corresponding to each of the first to third internal operators. It may further include a control unit (not shown) that inputs a control signal. According to another embodiment, the control signal may be transmitted from an external control unit.

이러한 제1 단위 크기(chunk size) 내지 제3 단위 크기(chunk size)는 외적 기반 곱셈 연산 대상이 되는 데이터의 종류, 부동 소수점 포맷인 데이터의 정밀도를 표현하는 가수(mantissa)의 비트 폭(bit-width), 및 해당 데이터가 사용되는 어플리케이션의 특성을 포함하는 다양한 속성들 중 적어도 하나에 따라 조절될 수 있다. These first unit sizes (chunk size) to third unit sizes (chunk size) are the type of data that is the target of the cross product-based multiplication operation, and the bit width (bit-) of the mantissa that expresses the precision of the data in floating point format. width), and the data may be adjusted according to at least one of various properties including characteristics of the application in which the data is used.

상기 제1 내부 연산기들(510)은 각각 제1 단위 크기(chunk size)를 기반으로 중간 누적값 전달 시기 및 리셋 시기가 설정될 수 있다. Each of the first internal operators 510 may set an intermediate accumulated value transfer time and a reset time based on the first unit size (chunk size).

상기 제2 내부 연산기들(520)은 각각 상기 제1 단위 크기를 기반으로 상기 단위 누적값을 생성하기 위한 연산을 수행하는 시기가 설정될 수 있다. Each of the second internal operators 520 may be set at a time to perform an operation for generating the unit accumulation value based on the first unit size.

이때, 상기 제2 내부 연산기들(320)은 각각 제1 단위 크기에 대응되는 주기마다 비활성 상태에서 활성 상태로 전환되도록 설정될 수 있다. At this time, the second internal operators 320 may be set to switch from the inactive state to the active state at intervals corresponding to the first unit size.

또한, 상기 제2 내부 연산기들(520)은 각각 제1 단위 크기 및 제2 단위 크기를 기반으로 제1 단위 누적값 전달 시기 및 리셋 시기가 설정될 수 있다. Additionally, the first unit accumulation value transfer time and reset time of the second internal operators 520 may be set based on the first unit size and the second unit size, respectively.

상기 제3 내부 연산기들(530)은 각각 상기 제1 단위 크기 및 제2 단위 크기를 기반으로 상기 제2 단위 누적값을 생성하기 위한 연산을 수행하는 시기가 설정될 수 있다. The third internal operators 530 may be set at a time when they perform an operation for generating the second unit accumulation value based on the first unit size and the second unit size, respectively.

이때, 상기 제3 내부 연산기들(530)은 각각 제1 단위 크기 및 제2 단위 크기를 기반으로 하는 주기마다 비활성 상태에서 활성 상태로 전환되도록 설정될 수 있다. At this time, the third internal operators 530 may be set to switch from the inactive state to the active state at every cycle based on the first unit size and the second unit size, respectively.

또한, 상기 제3 내부 연산기들(530)은 각각 제1 단위 크기 내지 제3 단위 크기를 기반으로 제2 단위 누적값 전달 시기 및 리셋 시기가 설정될 수 있다. In addition, the third internal calculators 530 may set the second unit accumulation value transfer time and reset time based on the first unit size to the third unit size, respectively.

상기 제4 내부 연산기들(540)은 각각 상기 제1 단위 크기 내지 제3 단위 크기를 기반으로 상기 제3 단위 누적값을 생성하기 위한 연산을 수행하는 시기가 설정될 수 있다. The fourth internal operators 540 may each be set at a time to perform an operation for generating the third unit accumulation value based on the first unit size to the third unit size.

이때, 상기 제4 내부 연산기들(540)은 각각 제1 단위 크기 내지 제3 단위 크기를 기반으로 하는 주기마다 비활성 상태에서 활성 상태로 전환되도록 설정될 수 있다. At this time, the fourth internal operators 540 may be set to switch from the inactive state to the active state at intervals based on the first to third unit sizes, respectively.

전술한 바와 같이, 제2 내부 연산기들(520) 내지 제4 연산기들(540)을 활용하여, 단위 누적(Chunking Accumulation) 기법을 다 단계(Multi-stage)로 수행하게 되므로, 하드웨어의 활용률을 향상시킬 수 있다. 또한, 심층 누적을 통해, 연산 입력값 중 급격하게 크거나 작은 값에 의해 발생되는 누적된 결과의 정보 손실률을 감소시킬 수 있다. 이로써, 양자화 오차가 저감된 연산 결과를 도출할 수 있으므로, 이를 활용한 학습 및 추론 정확도는 향상될 수 있다. As described above, by utilizing the second internal operators 520 to the fourth operators 540, the unit accumulation (Chunking Accumulation) technique is performed in multi-stage, thereby improving hardware utilization rate. You can do it. In addition, through deep accumulation, it is possible to reduce the information loss rate of accumulated results caused by rapidly large or small values among calculation input values. As a result, calculation results with reduced quantization errors can be derived, and the accuracy of learning and inference using this can be improved.

도 7은 또 다른 실시예에 따른 외적 기반 곱셈 연산 장치의 내부 구성도이다. Figure 7 is an internal configuration diagram of a cross product-based multiplication operation device according to another embodiment.

도 7을 참조하면, 또 다른 실시예에 따른 외적 기반 곱셈 연산 장치(600)는, 도 3에 도시된 외적 기반 곱셈 연산 장치(600)와 같이 행렬-행렬 곱을 수행하는 일 실시예에 도 6에 도시된 바와 같이 단위 누적값을 생성하는 연산이, 다 단계(multi-stages)로 수행되는 다른 실시예를 적용한 실시예이다. Referring to FIG. 7 , the cross product-based multiplication operation device 600 according to another embodiment performs matrix-matrix multiplication like the cross product-based multiplication operation device 600 shown in FIG. 3 . As shown, this is an embodiment applying another embodiment in which the operation for generating a unit accumulation value is performed in multi-stages.

예컨대, 외적 기반 곱셈 연산 장치(600)는, A0 내지 A5의 입력 포트들을 통해 제1 행렬의 벡터 데이터를 순차적으로 입력받고, B0 내지 B1의 입력 포트들을 통해 제2 행렬의 벡터 데이터를 순차적으로 입력받아 행렬-행렬 곱 연산을 수행 있다. For example, the cross product-based multiplication operation device 600 sequentially receives vector data of a first matrix through input ports of A0 to A5, and sequentially inputs vector data of a second matrix through input ports of B0 to B1. It takes a matrix-matrix multiplication operation.

또 다른 실시예에 따른 외적 기반 곱셈 연산 장치(600)는, 제1 내부 연산기들(610) 및 제2 내부 연산기들(620)를 포함하되, 제1 내부 연산기들(610) 및 제2 내부 연산기들(620) 간에 누적 데이터 전송 경로(accumulation data transmission path)들(640-1)이 형성되어 있을 수 있다. The outer product-based multiplication operation device 600 according to another embodiment includes first internal operators 610 and second internal operators 620, and the first internal operators 610 and the second internal operators 620. Accumulation data transmission paths 640-1 may be formed between the fields 620.

또한, 또 다른 실시예에 따른 외적 기반 곱셈 연산 장치(500)는, 제3 내부 연산기들(630)를 포함하되, 제2 내부 연산기들(620) 및 제3 내부 연산기들(630) 간에 누적 데이터 전송 경로(accumulation data transmission path)들(540-2)이 형성되어 있을 수 있다. In addition, the outer product-based multiplication operation device 500 according to another embodiment includes third internal operators 630, and accumulates data between the second internal operators 620 and the third internal operators 630. Accumulation data transmission paths 540-2 may be formed.

또한, 제1 내부 연산기들(610)은 도 7에서 두 개의 외부 입력 포트들로부터 데이터가 입력되는 연산기들, 예컨대, E0.0, E1.0,..., E11.0을 의미할 수 있고, 제2 내부 연산기들(620)은 도 6에서 외부 입력 포트로부터 데이터가 입력되는 것이 아닌 인접한 제1 내부 연산기들(610)로부터 데이터를 입력받는 연산기들, 즉, E0.1, E1.1,..., E11.1을 의미할 수 있고, 제3 내부 연산기들(630)은 도 6에서 외부 입력 포트로부터 데이터가 입력되는 것이 아닌 인접한 제2 내부 연산기들(620)로부터 데이터를 입력받는 연산기들, 즉, E0.2, E1.2,..., E11.2를 의미할 수 있다. In addition, the first internal operators 610 may refer to operators in which data is input from two external input ports in FIG. 7, for example, E0.0, E1.0,..., E11.0, , the second internal operators 620 are operators that receive data from the adjacent first internal operators 610 rather than from an external input port in FIG. 6, that is, E0.1, E1.1, ..., can mean E11.1, and the third internal operators 630 are operators that receive data from the adjacent second internal operators 620 rather than from the external input port in FIG. It may mean E0.2, E1.2,..., E11.2.

즉, 실시예에 따른 외적 기반 곱셈 연산 장치(600)는, 단위 누적(chunking accumulation) 연산에 의한 하드웨어 구성에 있어 누적 데이터 전송 경로(accumulation data transmission path)들(640-1, 640-2) 외에 추가적인 구성을 최소화한다. 따라서, 종래의 외적 기반 곱셈 연산 장치의 구조를 활용하면서도 단위 누적(chunking accumulation) 연산에 따른 문제점을 해결할 수 있다. That is, the external product-based multiplication operation device 600 according to the embodiment includes, in addition to accumulation data transmission paths 640-1 and 640-2, a hardware configuration based on unit accumulation (chunking accumulation) operation. Minimize additional configuration. Therefore, it is possible to solve problems caused by unit accumulation (chunking accumulation) operation while utilizing the structure of the conventional external product-based multiplication operation device.

더 나아가, 다른 실시예에 따른 외적 기반 곱셈 연산 장치(600)에서는, 행렬-행렬 곱 연산을 수행함에 있어서도, 제2 내부 연산기들(620) 및 제3 내부 연산기들(630)를 고차 누적에 활용하여, 내부 연산기들의 운용률을 높일 뿐만 아니라 단위 누적(chunking accumulation) 연산에 따른 문제점인 양자화 오차를 더욱 감소시킬 수 있다. Furthermore, in the external product-based multiplication operation apparatus 600 according to another embodiment, even when performing a matrix-matrix product operation, the second internal operators 620 and the third internal operators 630 are utilized for high-order accumulation. Thus, not only can the operating efficiency of the internal operators be increased, but also the quantization error, which is a problem caused by unit accumulation (chunking accumulation) calculation, can be further reduced.

제1 내부 연산기들(610) 내지 제3 내부 연산기들(620)의 동작은 도 6에 도시된 제1 내부 연산기들(510) 내지 제3 내부 연산기들(520)의 동작과 동일하므로, 상세한 설명을 생략하기로 한다. Since the operations of the first internal operators 610 to third internal operators 620 are the same as the operations of the first internal operators 510 to third internal operators 520 shown in FIG. 6, detailed description Let's omit it.

다만, 도 6에 도시된 제3 내부 연산기들(520)은, 제2 단위 누적값을 인접한 내부 연산기들 중 하나에 전달하는 것이 아니라, 최종 누적값으로 외부 메모리에 출력하게 된다. However, the third internal operators 520 shown in FIG. 6 do not transmit the second unit accumulated value to one of the adjacent internal operators, but output it to the external memory as the final accumulated value.

도 8은 실시예에 따른 컴퓨터 시스템 구성을 나타낸 도면이다.Figure 8 is a diagram showing the configuration of a computer system according to an embodiment.

실시예에 외적 기반 곱셈 연산 장치(1090)가 적용되는 시스템은 컴퓨터로 읽을 수 있는 기록매체와 같은 컴퓨터 시스템(1000)에서 구현될 수 있다.The system to which the external product-based multiplication operation device 1090 is applied in the embodiment may be implemented in a computer system 1000 such as a computer-readable recording medium.

컴퓨터 시스템(1000)은 신경망을 학습시키거나, 신경망을 통해 추론을 수행하는 인공 지능 시스템일 수 있다. 이때, 학습 및 추론시에 신경망의 가속 연산을 위해 외적 기반 곱셈 연산 장치(1090)가 활용될 수 있다. 즉, 외적 기반 곱셈 연산 장치(1090)는 프로세서(1010)의 제어에 의해 피처(Feature) 맵 데이터 및 신경망 파라미터들, 예컨대, 가중치(Weight) 데이터가 입력됨에 따라, 외적 기반 곱셈을 통해 신경망 연산을 수행할 수 있다. The computer system 1000 may be an artificial intelligence system that trains a neural network or performs inference through a neural network. At this time, the external product-based multiplication operation device 1090 can be used to accelerate the neural network during learning and inference. That is, the outer product-based multiplication operation device 1090 performs a neural network operation through outer product-based multiplication as feature map data and neural network parameters, such as weight data, are input under the control of the processor 1010. It can be done.

이때, 외적 기반 곱셈 연산 장치(1090)는 전술한 바와 같이 도 3, 도 4 및 도 6, 도 7을 참조하여 설명한 바와 같은 동작을 수행할 수 있다. At this time, the outer product-based multiplication operation device 1090 may perform the operations described above with reference to FIGS. 3, 4, 6, and 7.

컴퓨터 시스템(1000)은 버스(1020)를 통하여 서로 통신하는 하나 이상의 프로세서(1010), 메모리(1030), 사용자 인터페이스 입력 장치(1040), 사용자 인터페이스 출력 장치(1050) 및 스토리지(1060)를 포함할 수 있다. 또한, 컴퓨터 시스템(1000)은 네트워크(1080)에 연결되는 네트워크 인터페이스(1070)를 더 포함할 수 있다. 프로세서(1010)는 중앙 처리 장치 또는 메모리(1030)나 스토리지(1060)에 저장된 프로그램 또는 프로세싱 인스트럭션들을 실행하는 반도체 장치일 수 있다. 메모리(1030) 및 스토리지(1060)는 휘발성 매체, 비휘발성 매체, 분리형 매체, 비분리형 매체, 통신 매체, 또는 정보 전달 매체 중에서 적어도 하나 이상을 포함하는 저장 매체일 수 있다. 예를 들어, 메모리(1030)는 ROM(1031)이나 RAM(1032)을 포함할 수 있다.Computer system 1000 may include one or more processors 1010, memory 1030, user interface input device 1040, user interface output device 1050, and storage 1060 that communicate with each other via bus 1020. You can. Additionally, the computer system 1000 may further include a network interface 1070 connected to the network 1080. The processor 1010 may be a central processing unit or a semiconductor device that executes programs or processing instructions stored in the memory 1030 or storage 1060. The memory 1030 and storage 1060 may be storage media including at least one of volatile media, non-volatile media, removable media, non-removable media, communication media, and information transfer media. For example, memory 1030 may include ROM 1031 or RAM 1032.

실시예에 따라, 프로세서(1010)는, 외적 기반 곱셈 연산 장치(1090)의 구동, 데이터 입출력 및 입력되는 데이터의 양자화 등을 제어할 수 있다. Depending on the embodiment, the processor 1010 may control the operation of the external product-based multiplication operation device 1090, data input/output, and quantization of input data.

이상에서 첨부된 도면을 참조하여 본 발명의 실시예들을 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다.Although embodiments of the present invention have been described above with reference to the attached drawings, those skilled in the art will understand that the present invention can be implemented in other specific forms without changing the technical idea or essential features. You will understand that it exists. Therefore, the embodiments described above should be understood in all respects as illustrative and not restrictive.

Claims

First internal operators each performing a MAC (Multiply-Accumulation) operation to generate an intermediate accumulation value;
second internal operators each generating a unit accumulation value using the intermediate accumulation value; and
Comprising accumulation data transmission paths that allow the output of one of the first internal operators to be input to one of the second internal operators,
Cross product based multiplication operation unit.

In claim 1,
The second internal operators each receive the output of one of the first internal operators,
Cross product based multiplication operation unit.

In claim 1,
The first internal operators transfer the intermediate accumulated value to one of the second internal operators at every cycle corresponding to a unit size (chunk size) corresponding to the intermediate accumulated value, and then perform a reset operation.
Cross product based multiplication operation unit.

In claim 3,
The second internal operators each perform an operation to generate the unit accumulation value at every cycle corresponding to the unit size,
Cross product based multiplication operation unit.

In claim 4,
The operation to generate the unit cumulative value is
An operation that adds a newly input intermediate accumulation value and an intermediate accumulation value stored in the second internal operator,
Cross product based multiplication operation unit.

In claim 4,
Further comprising a control unit that inputs a control signal for adjusting a unit size (chunk size) to each of the first internal calculator and the second internal calculator,
Cross product based multiplication operation unit.

In claim 4,
The second internal operators are each switched from an inactive state to an active state every cycle corresponding to the unit size,
Cross product based multiplication operation unit.

generating an intermediate accumulation value by performing a MAC (Multiply-Accumulation) operation by each of the first internal operators;
transmitting the intermediate accumulated value to one of second internal operators through an accumulated data transmission path; and
Generating a unit accumulation value (chucking accumulation value) using the intermediate accumulation value by each of the second internal operators.
Including,
Cross product based multiplication operation method.

In claim 8,
The transmitting step is
Transferring the intermediate accumulation value to one of the second internal operators at each cycle corresponding to a unit size (chunk size) corresponding to the intermediate accumulation value, and then performing a reset operation,
Cross product based multiplication operation method.

In claim 9,
The steps for generating unit accumulation values are:
Performing an operation to generate the unit accumulation value at each cycle corresponding to the unit size,
Cross product based multiplication operation method.

In claim 10,
The operation to generate the unit cumulative value is
An operation that adds a newly input intermediate accumulation value and an intermediate accumulation value stored in the second internal operator,
Cross product based multiplication operation method.

In claim 10,
Before generating a unit accumulation value, each of the second internal operators is converted from an inactive state to an active state every cycle corresponding to the unit size.
Containing more,
Cross product based multiplication operation method.

First internal operators each performing a MAC (Multiply-Accumulation) operation to generate an intermediate accumulation value;
second internal operators each generating a first unit accumulation value using the intermediate accumulation value;
third internal operators each generating a second unit accumulation value using the first unit accumulation value;
Accumulation data transmission paths that allow the output of one of the first internal operators to be input to one of the second internal operators; and
Accumulation data transmission paths that allow the output of one of the second internal operators to be input to one of the second internal operators.
Including,
Cross product based multiplication operation unit.

In claim 13,
The second internal operators each receive an output from one of the first internal operators,
The third internal operators each receive the output of one of the second internal operators,
Cross product based multiplication operation unit.

In claim 13,
Each of the first internal operators transfers the intermediate accumulated value to one of the second internal operators at every cycle corresponding to a first unit size (chunk size) corresponding to the intermediate accumulated value, and then performs a reset operation,
The second internal operators each calculate the first unit accumulation value to the first unit accumulation value every period corresponding to a size obtained by multiplying the first unit size by a second unit size (chunk size) set for unit accumulation of the first unit accumulation value. 3, which performs a reset operation after passing it to one of the internal operators,
Cross product based multiplication operation unit.

In claim 13,
The second internal operators each perform an operation to generate the first unit accumulation value at every cycle corresponding to the first unit size,
The third internal calculator generates the second unit accumulation value every cycle corresponding to a size obtained by multiplying the first unit size by a second unit size (chunk size) set for unit accumulation of the first unit accumulation value. performing operations for,
Cross product based multiplication operation unit.

In claim 16,
The operation for generating the first unit cumulative value is
It is an operation that adds a newly input intermediate accumulation value and an intermediate accumulation value stored in the second internal calculator,
The operation for generating the second unit cumulative value is
An operation that adds the newly input first unit accumulation value and the second unit accumulation value stored in the third internal operator,
Cross product based multiplication operation unit.

In claim 16,
Further comprising a control unit that inputs a control signal to adjust a unit size (chunk size) or a period of performing a reset operation corresponding to each of the first to third internal operators,
Cross product based multiplication operation unit.

In claim 16,
Each of the second internal operators is switched from an inactive state to an active state every cycle corresponding to the first unit size,
The third internal operators are switched from the inactive state to the active state at each cycle calculated based on the first unit size and the second unit size, respectively.
Cross product based multiplication operation unit.

In claim 16,
N+3 internal operators that perform an operation to generate an N+2 unit chucking accumulation value using the N+1 (N is an integer that increases by 1 from 1) unit accumulation value, respectively. ; and
Accumulation data transmission paths that allow the output of one of the N+2 internal operators to be input to one of the N+3 internal operators
Repeat addition as each N increases,
The operation for generating the N+2 unit cumulative value is
An operation that adds the newly input N+1th unit accumulation value and the N+2th unit accumulation value stored in the N+3th internal calculator,
Cross product based multiplication operation unit.