KR20190114208A

KR20190114208A - In DRAM Bitwise Convolution Circuit for Low Power and Fast Computation

Info

Publication number: KR20190114208A
Application number: KR1020180036574A
Authority: KR
Inventors: 민경식
Original assignee: 국민대학교산학협력단
Priority date: 2018-03-29
Filing date: 2018-03-29
Publication date: 2019-10-10
Also published as: KR102154834B1

Abstract

The present invention relates to a bitwise convolution circuit for a DRAM for low-power and high-speed operations which can effectively perform bitwise convolution operations by using a processing-in-memory circuit in a DRAM. The bitwise convolution circuit for a DRAM for low-power and high-speed operations comprises a plurality of processing units to execute convolution operations for input data and partial results. The processing unit processes bitwise convolution operations using a ternary kernel with a processing-in-memory circuit structure in a DRAM, has a plurality of bit lines (BLs) crossing a single word line (WL), has pixels on a crossing region, and has a sense amplifier (SA) in response to each pixel. The processing unit includes: a first bank to which an accommodation field is assigned; a second bank assigned for a positive kernel and a negative kernel; and a feature map output means to output a feature map by convolution operations for the positive kernel and the negative kernel.

Description

In-DDRAM Bitwise Convolution Circuit for Low Power and Fast Computation

본 발명은 컨볼루션 연산을 수행하는 장치에 관한 것으로, 구체적으로 비트와이즈 컨볼루션(bitwise convolution) 연산을 DRAM 내부에의 프로세싱 인 메모리(Processing-In-Memory) 회로를 사용해서 효과적으로 수행할 수 있도록 한 저전력 및 고속 연산을 위한 DRAM용 비트와이즈 컨볼루션 회로에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an apparatus for performing convolution operations, and more particularly, to efficiently perform bitwise convolution operations using a processing-in-memory circuit in a DRAM. A bitwise convolution circuit for DRAMs for low power and high speed operations.

신경망은 생체 뇌를 모델링한 데이터 구조를 나타낸다. 신경망에 있어서, 노드들은 입력 데이터를 처리하기 위하여 상호 연결되며, 집합적으로 동작하는 뉴런을 나타낼 수 있다.Neural networks represent data structures that model the living brain. In neural networks, nodes may represent neurons that are interconnected and collectively processed to process input data.

서로 다른 종류의 신경망의 예시로서, 콘벌루션 신경망(Convolutional Neural Networks), 순환형 신경망(Recurrent Neural Networks), 확신 네트워크(Deep Belief Network) 및 제한적 볼츠만 머신(Restricted Boltzmann Machines) 등이 포함될 수 있으나, 이로 제한되지 않는다.Examples of different kinds of neural networks may include, but are not limited to, Convolutional Neural Networks, Recurrent Neural Networks, Deep Belief Network, and Restricted Boltzmann Machines. It is not limited.

신경망은 복잡한 입력 데이터로부터 "특징"을 추출하기 위해 이용될 수 있다.Neural networks can be used to extract "features" from complex input data.

신경망은 복수의 레이어(layer)를 포함할 수 있다. 각각의 레이어는 입력 데이터를 수신하고, 수신된 입력 데이터를 처리하여 출력 데이터를 생성할 수 있다. The neural network may include a plurality of layers. Each layer may receive input data and process the received input data to generate output data.

출력 데이터는 입력 데이터의 특징 지도(feature map)일 수 있고, 신경망은 입력 이미지 또는 특징지도를 콘벌루션 커널(convolution kernel)과 콘벌루션함으로써 입력 데이터의 특징 지도를 생성하는 데 이용될 수 있다.The output data may be a feature map of the input data, and the neural network may be used to generate a feature map of the input data by convolving the input image or feature map with the convolution kernel.

특히, 이미지 인식 분야에서, CNN(Convolutional Neural Networks) 모델은 일반적으로 인식되는 이미지의 분류를 결정하는데 사용된다.In particular, in the field of image recognition, CNN (Convolutional Neural Networks) models are generally used to determine the classification of recognized images.

인식되는 이미지의 분류가 CNN 모델을 통해 인식되기 전에, CNN 모델은 우선 트레이닝 될 필요가 있다.Before the classification of the recognized image can be recognized through the CNN model, the CNN model needs to be trained first.

CNN 모델의 트레이닝은 일반적으로 이하의 방법으로 구현된다.Training of the CNN model is generally implemented in the following way.

우선, 트레이닝되는 CNN 모델의 모델 파라미터가 초기화되고, 여기서 모델 파라미터는 각각의 컨볼루션 레이어(convolution layer)의 초기 컨볼루션 커널(convolution kernels), 각각의 컨볼루션 레이어의 초기 바이어스 행렬(bias matrixes), 및 완전 연결 레이어(fully connected layer)의 초기 가중치 행렬(weight matrix) 및 초기 바이어스 벡터(bias vector)를 포함한다.First, model parameters of the trained CNN model are initialized, where the model parameters are the initial convolution kernels of each convolution layer, the initial bias matrixes of each convolution layer, And an initial weight matrix and an initial bias vector of the fully connected layer.

그리고 고정된 높이 및 고정된 폭을 가진 프로세싱될 영역이 미리 선택된 트레이닝 이미지 각각으로부터 획득되고, 여기서 고정된 높이 및 고정된 폭은 인식되는 이미지의 분류에 매칭되고, 인식되는 이미지는 트레이닝되는 CNN 모델에 의해 프로세싱될 수 있는 이미지로서 미리 설정된 것이다. And an area to be processed having a fixed height and a fixed width is obtained from each of the preselected training images, where the fixed height and the fixed width match the classification of the recognized image, and the recognized image is subjected to the trained CNN model. As an image that can be processed by the user.

트레이닝 이미지의 각각에 대응하는 프로세싱되는 영역은 트레이닝되는 CNN 모델에 입력된다. 그 후, 각각의 컨볼루션 레이어에서, 컨볼루션 연산 및 최대 풀링(pooling) 연산이 각각의 컨볼루션 레이어의 초기 컨볼루션 커널 및 초기 바이어스 행렬를 사용하여 프로세싱되는 각 영역에서 수행되어, 각각의 컨볼루션 레이어 에서 처리되는 각 영역의 특성 이미지를 획득한다.The region to be processed corresponding to each of the training images is input to the trained CNN model. Then, in each convolutional layer, a convolutional operation and a maximum pooling operation are performed in each region that is processed using the initial convolutional kernel and initial bias matrix of each convolutional layer, so that each convolutional layer Acquire a feature image of each region processed by.

그리고, 각 특성 이미지는, 완전 연결 레이어의 초기 가중치 행렬 및 초기 바이어스 벡터를 사용하여 프로세싱되는 각 영역의 분류 확률을 획득하기 위해 프로세싱된다.Each feature image is then processed to obtain a classification probability of each region that is processed using the initial weight matrix and the initial bias vector of the fully connected layer.

그리고, 트레이닝 이미지 각각의 분류 확률 및 초기 분류에 따라서 분류 에러가 계산된다. 분류 에러의 평균이 모든 트레이닝 이미지의 분류 에러에 따라서계산된다.The classification error is calculated according to the classification probability and the initial classification of each training image. The average of the classification errors is calculated according to the classification errors of all training images.

그리고, 트레이닝되는 CNN 모델의 모델 파라미터는 분류 에러의 평균을 사용하여 조정된다.The model parameters of the trained CNN model are then adjusted using the average of the classification errors.

이와 같은 딥러닝(Deep learning)의 한 종류인 CNN(Convolutional Neural Networks)에서는 입력 이미지(input image)에 대해서, 커널(kernel)을 이용한 컨볼루션 연산을 수행하여 특징(feature)들을 추출해서 영상 인식을 수행한다.In the CNN (Convolutional Neural Networks), a type of deep learning, a convolution operation using a kernel is performed on an input image to extract features and extract images. To perform.

따라서 CNN에서는 엄청나게 많은 양의 연산이 수행되어야 하고, 이러한 연산을 고속/대량으로 수행하기 위해서 GPU 등의 고속/대용량 연산 회로를 사용하여야 하는 CNN 하드웨어의 특성을 갖는다.Therefore, in CNN, a huge amount of operations must be performed, and CNN hardware has to use a high speed / high capacity computing circuit such as a GPU in order to perform these operations at high speed / mass.

이와 같은 종래 기술의 CPU와 메모리가 서로 분리된 구조로 컨볼루션과 같은 복잡한 연산을 수행하는 경우에는, CPU가 메모리에 지속적으로 반복해서 데이터를 쓰고 읽는 동작을 수행해야 한다.When the CPU and the memory of the related art are separated from each other and perform a complicated operation such as a convolution, the CPU must continuously write and read data to and from the memory.

또한, CPU와 메모리 사이의 데이터 폭(data width)의 제한 때문에, 컨볼루션과 같은 복잡한 연산을 위해서 많은 수의 동작 사이클(operation cycle)이 필요하게 되고 이 과정에서 불가피하게 많은 양의 에너지를 소비하게 된다. In addition, due to the limitation of the data width between the CPU and memory, a large number of operation cycles are required for complex operations such as convolution, which inevitably consumes a large amount of energy. do.

따라서, 컨볼루션 연산에서 소비되는 전력 소비와 클락 사이클의 수를 크게 줄일 수 있도록 하여 CNN(Convolutional Neural Networks) 모델과 같은 신경망 분야에 유용하게 적용될 수 있는 새로운 기술의 개발이 요구되고 있다.Accordingly, there is a need for the development of a new technology that can be applied to neural networks such as the CNN (Convolutional Neural Networks) model by greatly reducing the power consumption and the number of clock cycles consumed in the convolution operation.

한국공개특허번호 10-2017-0099848호Korean Laid-Open Patent No. 10-2017-0099848 한국공개특허번호 10-2017-0091140호Korean Laid-Open Patent No. 10-2017-0091140 한국공개특허번호 10-2016-0143505호Korean Laid-Open Patent No. 10-2016-0143505

본 발명은 이와 같은 종래 기술의 컨볼루션 연산을 수행하는 장치의 문제를 해결하기 위한 것으로, 컨볼루션 연산에서 소비되는 전력 소비와 클락 사이클의 수를 크게 줄일 수 있도록 한 저전력 및 고속 연산을 위한 DRAM용 비트와이즈 컨볼루션 회로를 제공하는데 그 목적이 있다.The present invention is to solve the problems of the prior art convolution operation, for the DRAM for low power and high speed operation that can greatly reduce the number of power consumption and clock cycles consumed in the convolution operation The purpose is to provide a bitwise convolution circuit.

본 발명은 터너리 커널(ternary kernel)을 이용한 비트와이즈 컨볼루션(bitwise convolution) 연산을 DRAM 내부에의 프로세싱 인 메모리(Processing-In-Memory) 회로를 사용해서 효과적으로 수행할 수 있도록 한 저전력 및 고속 연산을 위한 DRAM용 비트와이즈 컨볼루션 회로를 제공하는데 그 목적이 있다.The present invention provides a low power and high speed operation that enables a bitwise convolution operation using a ternary kernel to be efficiently performed using a processing-in-memory circuit inside a DRAM. It is an object of the present invention to provide a bitwise convolution circuit for a DRAM for the purpose.

본 발명은 프로세싱 인 메모리(Processing-In-Memory) 회로를 갖는 것에 의해 컨볼루션 연산과 같은 복잡한 연산을 더욱 더 에너지 효율적으로 그리고 더 적은 수의 동작 사이클 내에서 완료할 수 있는 비트와이즈 연산을 위한 저전력 및 고속 연산을 위한 DRAM용 비트와이즈 컨볼루션 회로를 제공하는데 그 목적이 있다.The present invention provides a low power for bitwise operation that allows processing of convolutional operations such as convolutional operations to be more energy efficient and within fewer operating cycles by having a Processing-In-Memory circuit. And a bitwise convolution circuit for DRAM for high-speed operation.

본 발명의 목적들은 이상에서 언급한 목적들로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The objects of the present invention are not limited to the above-mentioned objects, and other objects that are not mentioned will be clearly understood by those skilled in the art from the following description.

이와 같은 목적을 달성하기 위한 본 발명에 따른 저전력 및 고속 연산을 위한 DRAM용 비트와이즈 컨볼루션 회로는 입력 데이터 및 부분 결과들에 대해 컨볼루션 연산을 실행하는 복수의 처리 유닛들을 포함하고, 처리 유닛은 터너리 커널(ternary kernel)을 이용한 비트와이즈 컨볼루션(bitwise convolution) 연산을 DRAM 내부에의 프로세싱 인 메모리(Processing-In-Memory) 회로 구조를 갖고 처리하고, 하나의 워드라인(WL)에 교차되는 복수개의 비트라인(BL)을 갖고 교차 영역에 픽셀들을 갖고 각 픽셀들에 대응하여 SA(sense amplifiers)를 갖고, 처리 유닛은 수용 필드가 할당되는 뱅크 #1과, 양의 커널 및 음의 커널용으로 할당되는 뱅크 #2와, 양의 커널 및 음의 커널에 위한 컨볼루션 연산에 따른 특징맵을 출력하기 위한 특징맵 출력 수단을 포함하는 것을 특징으로 한다.The bitwise convolution circuit for DRAM for low power and high speed operation according to the present invention for achieving the above object comprises a plurality of processing units for performing convolution operations on input data and partial results, A bitwise convolution operation using a ternary kernel is processed with a processing-in-memory circuit structure inside the DRAM and intersected in one word line (WL). It has a plurality of bit lines BL, has pixels in the cross region and has sense amplifiers (SA) corresponding to each pixel, and the processing unit is for bank # 1 to which an accommodating field is assigned, for a positive kernel and a negative kernel. And a feature map output means for outputting a feature map according to a convolution operation for the positive kernel and the negative kernel.

여기서, 컨볼루션 연산시에 픽셀값과 커널값에 의해 전류 미러(current mirror)를 제공하는 전류 미러(current mirror) 제공 수단을 더 포함하는 것을 특징으로 한다.The method may further include a current mirror providing means for providing a current mirror by the pixel value and the kernel value during the convolution operation.

그리고 픽셀값(X₀)에 터너리 커널(ternary kernel) W₀을 곱하기 위해, 커널(W₀)은 양의 부분(

)과 음의 부분(

)으로 나누어지고,And to multiply the pixel value (X ₀ ) by the ternary kernel W ₀ , the kernel (W ₀ ) is a positive part (

) And the negative part (

Divided by)

W₀은 +1, -1 또는 0인 것을 특징으로 한다.W ₀ is characterized by being +1, -1 or 0.

그리고 컨볼루션 연산시에 픽셀값과 커널값이 모두 1이면 전류 미러(current mirror)가 전달되고, 이들 중 하나 또는 모두가 0이면 전달되지 않는 것을 특징으로 한다.In the convolution operation, if both the pixel value and the kernel value are 1, a current mirror is transmitted, and if one or all of them is 0, the current mirror is not transmitted.

그리고 컨볼루션 연산 결과를 16×16 특징맵에 저장하는 경우에는, 수용 필드와 커널은 각각 4×4로 주어지고, 4×4 수용 필드에 대한 컨볼루션 계산은

으로 이루어지고, 여기서, X_k와 W_k는 각각 픽셀 값과 커널 값인 것을 특징으로 한다.If the result of the convolution operation is stored in the 16x16 feature map, the acceptance field and the kernel are given as 4x4, respectively, and the convolution calculation for the 4x4 acceptance field is

Here, X _k and W _k is characterized in that the pixel value and the kernel value, respectively.

그리고 뱅크가 워드 라인 당 1024 개의 셀을 갖는 구조인 경우에는, 하나의 수용 필드는 16 개의 픽셀로 구성되고, 하나의 픽셀이 8 비트 정밀도로 표현된다면 하나의 수용 필드를 나타내기 위해 256 개의 DRAM 셀을 할당하고, 256 셀 중 128 비트는 양의 커널용이고 나머지 절반은 음의 커널용인 것을 특징으로 한다.And if the bank has a structure of 1024 cells per word line, one accommodating field is composed of 16 pixels, and if one pixel is represented with 8 bit precision, 256 DRAM cells to represent one accommodating field. And 128 bits of 256 cells are for the positive kernel and the other half are for the negative kernel.

이와 같은 본 발명에 따른 저전력 및 고속 연산을 위한 DRAM용 비트와이즈 컨볼루션 회로는 다음과 같은 효과를 갖는다.The bitwise convolution circuit for DRAM for low power and high speed operation according to the present invention has the following effects.

첫째, 컨볼루션 연산에서 소비되는 전력 소비와 클락 사이클의 수를 크게 줄여 저전력 및 고속 연산이 가능하도록 한다.First, it significantly reduces the power consumption and clock cycles consumed in convolution operations, enabling low power and high speed operations.

둘째, 터너리 커널(ternary kernel)을 이용한 비트와이즈 컨볼루션(bitwise convolution) 연산을 DRAM 내부에의 프로세싱 인 메모리(Processing-In-Memory) 회로를 사용해서 효과적으로 수행할 수 있도록 한다.Second, bitwise convolution operations using the ternary kernel can be efficiently performed using processing-in-memory circuitry inside the DRAM.

셋째, 프로세싱 인 메모리(Processing-In-Memory) 회로를 갖는 것에 의해 컨볼루션 연산과 같은 복잡한 연산을 더욱 더 에너지 효율적으로 그리고 더 적은 수의 동작 사이클 내에서 완료할 수 있도록 한다.Third, having processing-in-memory circuitry allows complex operations, such as convolution operations, to be completed more energy efficiently and in fewer operating cycles.

도 1 및 도 2는 컨볼루션 동작을 나타낸 구성도
도 3a와 도 3b는 인터 뱅크 비트와이즈 동작을 나타낸 구성도
도 4는 본 발명에 따른 저전력 및 고속 연산을 위한 DRAM용 비트와이즈 컨볼루션 회로의 구성도1 and 2 are schematic diagrams showing a convolution operation
3A and 3B are diagrams illustrating an inter-bank bitwise operation
4 is a block diagram of a bitwise convolution circuit for DRAM for low power and high speed operation according to the present invention;

이하, 본 발명에 따른 저전력 및 고속 연산을 위한 DRAM용 비트와이즈 컨볼루션 회로의 바람직한 실시 예에 관하여 상세히 설명하면 다음과 같다.Hereinafter, a preferred embodiment of a bitwise convolution circuit for DRAM for low power and high speed operation according to the present invention will be described in detail.

본 발명에 따른 저전력 및 고속 연산을 위한 DRAM용 비트와이즈 컨볼루션 회로의 특징 및 이점들은 이하에서의 각 실시 예에 대한 상세한 설명을 통해 명백해질 것이다.Features and advantages of the bitwise convolution circuit for DRAM for low power and high speed operation according to the present invention will be apparent from the detailed description of each embodiment below.

도 1 및 도 2는 컨볼루션 동작을 나타낸 구성도이고, 도 3a와 도 3b는 인터 뱅크 비트와이즈 동작을 나타낸 구성도이다.1 and 2 are diagrams illustrating a convolution operation, and FIGS. 3A and 3B are diagrams illustrating an inter-bank bitwise operation.

그리고 도 4는 본 발명에 따른 저전력 및 고속 연산을 위한 DRAM용 비트와이즈 컨볼루션 회로의 구성도이다.4 is a block diagram of a DRAM bitwise convolution circuit for low power and high speed operation according to the present invention.

본 발명에 따른 저전력 및 고속 연산을 위한 DRAM용 비트와이즈 컨볼루션 회로는 컨볼루션 연산에서 소비되는 전력 소비와 클락 사이클의 수를 크게 줄일 수 있도록 것으로, 터너리 커널(ternary kernel)을 이용한 비트와이즈 컨볼루션(bitwise convolution) 연산을 DRAM 내부에의 프로세싱 인 메모리(Processing-In-Memory) 회로를 사용해서 효과적으로 수행할 수 있도록 한 것이다.The bitwise convolution circuit for DRAM for low-power and high-speed operation according to the present invention can significantly reduce the power consumption and the number of clock cycles consumed in the convolution operation, and the bitwise convolution using a ternary kernel Bitwise convolution operations can be efficiently performed using processing-in-memory circuitry inside the DRAM.

도 1은 19×19인 MNIST 이미지 크기를 갖고, 수용 필드(Receptive field)와 커널은 4×4이고, 컨볼루션 결과가 16×16 특징맵에 저장되는 것을 나타낸 것이다.FIG. 1 shows that the MNIST image size is 19 × 19, the receptive field and the kernel are 4 × 4, and the convolution results are stored in the 16 × 16 feature map.

그리고 도 2는 컨볼루션 연산의 DRAM 내 비트와이즈 처리 과정을 나타낸 것이다.2 illustrates a bitwise process of DRAM in a convolution operation.

메모리 내 비트 연산은 많은 양의 계산과 심층학습 신경망(deep-learning neural networks)과 같은 메모리에 대한 잦은 액세스가 필요한 많은 애플리케이션에서 매우 유용하다.In-memory bitwise operations are very useful in many applications that require large amounts of computation and frequent access to memory, such as deep-learning neural networks.

이미지 인식을 위한 가장 널리 알려진 심층 학습 알고리즘 중 하나인 CNN (convolutional neural network)을 보면, CNN에서 수용 필드의 이미지 픽셀은 커널과 곱해진다. 이 곱셈은 한 번에 1 픽셀 씩 시프트되는 마지막 픽셀까지 반복된다.Looking at the convolutional neural network (CNN), one of the most well-known deep learning algorithms for image recognition, in CNN, the image pixels in the acceptance field are multiplied by the kernel. This multiplication is repeated up to the last pixel shifted by one pixel at a time.

도 1은 19×19 픽셀을 갖는 MNIST(Modified National Institute of Standards and Technology) 이미지의 핸드라이튼 디지트(handwritten digit)를 처리하기 위한 컨볼루션(convolution) 동작을 도식적으로 나타낸 것이다.1 diagrammatically illustrates a convolution operation to process handwritten digits of a Modified National Institute of Standards and Technology (MNIST) image with 19 × 19 pixels.

수용 필드와 커널은 각각 4×4로 주어지고, 4×4 수용 필드에 대한 컨볼루션 계산은

으로 이루어진다.Receptive fields and kernels are given by 4 × 4, respectively, and the convolution calculation

Is done.

여기서, X_k와 W_k는 각각 픽셀 값과 커널 값을 나타낸다.Where X _k and W _k represent pixel values and kernel values, respectively.

컨볼루션 결과는 도 1의 16×16 특징맵에 저장되고, 16×16의 특징맵을 얻기 위해서는 4096번의 곱셈, 덧셈, 메모리 접근이 필요하다.The convolution result is stored in the 16x16 feature map of FIG. 1 and requires 4096 multiplications, additions, and memory accesses to obtain a 16x16 feature map.

이러한 많은 양의 계산과 빈번한 메모리 액세스를보다 효율적으로 수행하기 위해서는 도 2에서와 같은 DRAM 내 비트와이즈 프로세싱이 필요하다.In order to perform such a large amount of computation and frequent memory access more efficiently, bitwise processing in DRAM as shown in FIG. 2 is required.

도 3a는 컨볼루션 연산을 위한 인터 뱅크 비트와이즈 처리 회로를 나타낸 것이고, 도 3b는 한 인터 뱅크 비트와이즈 처리 회로의 동작 파형을 나타낸 것이다.FIG. 3A shows an interbank bitwise processing circuit for a convolution operation, and FIG. 3B shows an operating waveform of an interbank bitwise processing circuit.

본 발명은 도 3a에서의 인터 뱅크 연산을 기반으로하는 DRAM 내 비트와이즈 컨볼루션 회로를 제안한다. 도 3b는 도 3a회로의 동작 파형을 나타낸 것이다.The present invention proposes a bitwise convolution circuit in DRAM based on the interbank operation in FIG. 3A. Figure 3b shows the operating waveform of the circuit of Figure 3a.

먼저, A의 행(row)이 뱅크 #1에서 활성화되고, 그 다음, B 행은 뱅크 #2에서 활성화된다.First, a row of A is activated in bank # 1, and then a row B is activated in bank # 2.

그 후 도 4에 상세히 도시된 컨볼루션 회로에 의해 비트 연산을 수행 할 수있다. The bit operation can then be performed by the convolution circuit shown in detail in FIG.

인터 뱅크 비트와이즈 회로는 인트라 뱅크 회로보다 우수한 전력 효율을 나타낼 수 있다. 이는 도 3b에 나타낸 바와 같이 2사이클 내에서 컨볼루션을 완료 할 수 있기 때문이다.The inter bank bitwise circuit can exhibit better power efficiency than the intra bank circuit. This is because convolution can be completed in two cycles as shown in FIG. 3B.

특히, 일단 커널이 뱅크 #2에 저장되면, 컨벌루션은 한 사이클에서 수행 될 수 있어 인트라 뱅크 회로와 비교하여 도 3a의 전력 소비 및 컨볼루션 계산 시간이 상당히 감소 될 수있다.In particular, once the kernel is stored in bank # 2, convolution can be performed in one cycle so that the power consumption and convolution calculation time of FIG. 3A can be significantly reduced compared to intra bank circuitry.

도 4는 16×16 특징맵을 얻기 위해

및

의 연산 및 비교를 위한 인터 뱅크 비트와이즈 회로를 도식적으로 나타낸 것이다.4 shows a 16 × 16 feature map.

And

A schematic diagram of an inter-bank bitwise circuit for computing and comparing is shown.

도 4에서 RF는 수용 필드를 의미하고, S는 SA(sense amplifiers)이다.In FIG. 4, RF denotes an acceptance field, and S denotes sense amplifiers (SAs).

본 발명에 따른 저전력 및 고속 연산을 위한 DRAM용 비트와이즈 컨볼루션 회로는 도 4에서와 같이, 입력 데이터 및 부분 결과들에 대해 컨볼루션 연산을 실행하는 복수의 처리 유닛들을 포함하고,The bitwise convolution circuit for DRAM for low power and high speed operation according to the present invention includes a plurality of processing units that execute convolution operations on input data and partial results, as shown in FIG.

처리 유닛은 터너리 커널(ternary kernel)을 이용한 비트와이즈 컨볼루션(bitwise convolution) 연산을 DRAM 내부에의 프로세싱 인 메모리(Processing-In-Memory) 회로 구조를 갖고, 하나의 워드라인(WL)에 교차되는 복수개의 비트라인(BL)을 갖고 교차 영역에 픽셀들을 갖고 각 픽셀들에 대응하여 SA(sense amplifiers)를 갖는다.The processing unit has a processing-in-memory circuit structure in which DRAM performs bitwise convolution operations using a ternary kernel and crosses one word line WL. Has a plurality of bit lines BL, has pixels in the intersection area, and has sense amplifiers (SAs) corresponding to each pixel.

처리 유닛은 수용 필드가 할당되는 뱅크 #1(40)과, 양의 커널 및 음의 커널용으로 할당되는 뱅크 #2(50)를 갖는다.The processing unit has bank # 1 40 to which the acceptance field is assigned, and bank # 2 50 to be allocated for the positive kernel and the negative kernel.

그리고 양의 커널 및 음의 커널에 위한 컨볼루션 연산에 따른 특징맵을 출력하기 위한 특징맵 출력 수단을 포함한다.And feature map output means for outputting a feature map according to a convolution operation for the positive kernel and the negative kernel.

그리고 컨볼루션 연산시에 픽셀값과 커널값에 의해 전류 미러(current mirror)를 제공하는 전류 미러(current mirror) 제공 수단을 포함한다.And a current mirror providing means for providing a current mirror by the pixel value and the kernel value in the convolution operation.

도 4는 인터 뱅크 비트와이즈 프로세싱 회로를 나타낸 것으로, 뱅크는 워드 라인 당 1024 개의 셀을 갖는 구조를 나타낸 것이다.4 illustrates an inter bank bitwise processing circuit, in which a bank has a structure having 1024 cells per word line.

하나의 수용 필드는 16 개의 픽셀로 구성되고, 하나의 픽셀이 8 비트 정밀도로 표현된다면 하나의 수용 필드를 나타내기 위해 256 개의 DRAM 셀을 할당한다.One accommodating field is composed of 16 pixels, and if one pixel is represented with 8 bit precision, 256 DRAM cells are allocated to represent one accommodating field.

256 셀 중 128 비트는 양의 커널용이고 나머지 절반은 음의 커널용이다.Of the 256 cells, 128 bits are for the positive kernel and the other half for the negative kernel.

여기서, 수용 필드는 뱅크 #1에 저장된다.Here, the acceptance field is stored in bank # 1.

도 1a의 X₀는 도 4의 X_0,0, X_0,1, X_0,2, X_0,3, X_0,4, X_0,5, X_0,6 및 X_0,7로 나타내지 만, 여기서 X_0,7 및 X_0,0은 각각 X₀의 이진 형식에서 최상위 비트 및 최하위 비트를 나타낸다.X ₀ in FIG. 1A is represented by X _0,0 , X _0,1 , X _0,2 , X _0,3 , X _0,4 , X _0,5 , X _0,6 and X _0,7 in FIG. However, where X _0,7 and X _0,0 respectively represent the most significant bit and the least significant bit in the binary form of X ₀ .

X₀에 터너리 커널(ternary kernel) W₀을 곱하기 위해, 커널은 양의 부분과 음의 부분으로 나누어 져야한다.To multiply X ₀ by the ternary kernel W ₀ , the kernel must be divided into a positive part and a negative part.

도 4에서,

와

는 각각 W₀의 양 및 음의 부분을 나타낸다.In Figure 4,

Wow

_Denotes the positive and negative portions of W ₀ , respectively.

W₀은 +1, -1 또는 0이다. 이는 본 발명에서 터너리 커널(ternary kernel)을 사용하기 때문이다.W ₀ is +1, -1 or 0. This is because the ternary kernel is used in the present invention.

컨볼루션의 양의 부분을 계산하기 위해 먼저 8비트 벡터(X_0,0, X_0,1, X_0,2, X_0,3, X_0,4, X_0,5, X_0,6 및 X_0,7)와

는 감지 및 증폭된다.To calculate the positive part of the convolution, we first use the 8-bit vectors (X _0,0 , X _0,1 , X _0,2 , X _0,3 , X _0,4 , X _0,5 , X _0,6 and X _0,7 )

Is detected and amplified.

X_0,7 및

가 모두 1이면 전류 미러(current mirror)는 ×128 전류를 R⁺에 전달할 수 있다.X _0,7 and

If all are 1, the current mirror can deliver x128 currents to R ⁺ .

마찬가지로, X_0,6 및

가 1이면 ×64 전류는 R⁺에 전달된다. 이들 중 하나 또는 모두가 0이면 R⁺에 전류가 전달되지 않는다.Similarly, X _0,6 and

If 1, × 64 currents are delivered to R ⁺ . If one or both of these is zero, no current is delivered to R ⁺ .

R⁺의 전류는

의 양의 합으로 계산될 수 있고, R⁺의 전류는

의 음의 합계의 R^-전류와 비교된다.R ⁺ current is

Can be calculated as the sum of the quantities of and the current in R ⁺

The negative sum of is compared with the R ^- current.

만약, R⁺의 전류가 R^-전류보다 큰 경우에는 특징맵의 첫 번째 요소인 C₀는 1이 된다. 그렇지 않은 경우, C₀는 0이다.If the current of R ⁺ is greater than the current of R ⁻ , C ₀ , the first element of the feature map, becomes 1. Otherwise, C ₀ is 0.

도 4에서 M₁은 X_0,7에 의해 턴온되고 M₂는

에 의해 턴온되고, X_0,7과 M₂가 모두 '1'일 때만, ×128 전류가 M₄에서 M₃으로 복사된다.In FIG. 4, M ₁ is turned on by X _0,7 and M ₂ is

_X128 current is radiated from M ₄ to M ₃ only when X _0,7 and M ₂ are both '1'.

마찬가지로, M₅ 및 M₆은 각각 X_0,6 및

에 의해 제어된다.Similarly, M ₅ and M ₆ are each X _0,6 and

Controlled by

X_0,6 및

가 모두 '1'일 때, ×64 전류는 M₈에서 M₇로 복사될 수 있다.X _0,6 and

When both are '1', the x64 current can be radiated from M ₈ to M ₇ .

또한, M₄, M₈ 등과 같은 전류 미러(current mirror)는 다른 수용 필드들에 대한 컨볼루션 회로들 사이에서 공유 될 수 있다.Also, current mirrors such as M ₄ , M _8, etc. can be shared between convolutional circuits for other receiving fields.

도 4의 기준 전류를 갖는 전류 미러는 전체 칩에서 공유되고, M₁, M₂ 및 M₃의 3 개 트랜지스터만 하나 또는 두 개의 DRAM 셀의 하나 또는 두개의 피치에 장착된다.The current mirror with the reference current of FIG. 4 is shared across the entire chip, and only three transistors of M ₁ , M ₂ and M ₃ are mounted in one or two pitches of one or two DRAM cells.

표 1은 인트라 뱅크와 인터 뱅크 비트와이즈 처리 회로를 비교한 것이다.Table 1 compares the intra bank and inter bank bitwise processing circuits.

계산 시간과 관련하여, 16 × 16 특징맵에 대한 컨볼루션 연산을 완료하기 위해 인트라 뱅크 비트와이즈 회로는 192 사이클의 행(row) 활성화를 필요로 한다.In terms of computation time, the intra bank bitwise circuit requires 192 cycles of row activation to complete the convolution operation on the 16 × 16 feature map.

인터 뱅크 비트와이즈 회로는 65 사이클 이내에 동일한 작업을 완료 할 수 있어 전력 소비 측면에서 인터 뱅크회로는 인트라 뱅크보다 35％ 더 작은 전력을 소모한다.The interbank bitwise circuit can complete the same task within 65 cycles, so in terms of power consumption, the interbank circuit consumes 35% less power than the intrabank.

이상에서 설명한 본 발명에 따른 저전력 및 고속 연산을 위한 DRAM용 비트와이즈 컨볼루션 회로는 컨볼루션 연산에서 소비되는 전력 소비와 클락 사이클의 수를 크게 줄일 수 있도록 것으로, 터너리 커널(ternary kernel)을 이용한 비트와이즈 컨볼루션(bitwise convolution) 연산을 DRAM 내부에의 프로세싱 인 메모리(Processing-In-Memory) 회로를 사용해서 효과적으로 수행할 수 있도록 한 것이다.The bitwise convolution circuit for DRAM for low power and high speed operation according to the present invention as described above can significantly reduce the power consumption and the number of clock cycles consumed in the convolution operation, using a ternary kernel. Bitwise convolution operations can be efficiently performed using processing-in-memory circuitry inside the DRAM.

이와 같은 본 발명은 프로세싱 인 메모리(Processing-In-Memory) 회로를 갖는 것에 의해 컨볼루션 연산과 같은 복잡한 연산을 더욱 더 에너지 효율적으로 그리고 더 적은 수의 동작 사이클 내에서 완료할 수 있도록 한 것이다.The present invention thus has a processing-in-memory circuitry that allows complex operations, such as convolutional operations, to be more energy efficient and complete in fewer operating cycles.

이상에서의 설명에서와 같이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 본 발명이 구현되어 있음을 이해할 수 있을 것이다.It will be understood that the present invention is implemented in a modified form without departing from the essential features of the present invention as described above.

그러므로 명시된 실시 예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 하고, 본 발명의 범위는 전술한 설명이 아니라 특허청구 범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.Therefore, the described embodiments should be considered in descriptive sense only and not for purposes of limitation, and the scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the equivalent scope are included in the present invention. It should be interpreted.

40. 뱅크 #1
50. 뱅크 #240. Bank # 1
50. Bank # 2

Claims

A plurality of processing units for executing a convolution operation on the input data and the partial results,
The processing unit processes a bitwise convolution operation using a ternary kernel with a processing-in-memory circuit structure inside the DRAM and processes one word line (WL). Has a plurality of bit lines (BL) intersected at, and has pixels in the intersection area, and has sense amplifiers (SAs) corresponding to each pixel,
The processing unit comprises bank # 1 to which the acceptance field is assigned, bank # 2 to be allocated for the positive kernel and the negative kernel,
A bitwise convolution circuit for DRAM for low power and high speed operation, comprising: feature map output means for outputting a feature map according to a convolution operation for a positive kernel and a negative kernel.

2. The DRAM of claim 1, further comprising: a current mirror providing means for providing a current mirror by a pixel value and a kernel value in a convolution operation. Bitwise convolution circuit.

The method of claim 1, in order to multiply the pixel value X ₀ by the ternary kernel W ₀ , the kernel W ₀ is a positive portion (

) And the negative part (

Divided by)
Wise is a bitwise convolution circuit for DRAM for low power and high speed operation, characterized in that + ₀ , -1 or 0.

4. The low power and high speed operation of claim 3, wherein a current mirror is transmitted when both the pixel value and the kernel value are 1 during the convolution operation, and when one or all of them is not transmitted, the current mirror is not transmitted. Bitwise convolution circuit for DRAM.

The method of claim 1, wherein when storing the convolution operation result in a 16x16 feature map,
Receptive fields and kernels are given by 4 × 4, respectively, and the convolution calculation for 4 × 4 accepting fields is

Made of
Here, X _k and W _k is a bitwise convolution circuit for DRAM for low power and high speed operation, characterized in that the pixel value and the kernel value, respectively.

The method of claim 1, wherein the bank has a structure having 1024 cells per word line.
One accommodating field is composed of 16 pixels, if one pixel is represented with 8 bit precision, 256 DRAM cells are allocated to represent one accommodating field,
Bitwise convolution circuit for DRAM for low power and high speed computation, wherein 128 bits of 256 cells are for the positive kernel and the other half for the negative kernel.