KR20210075007A

KR20210075007A - Cache for artificial intelligence processor

Info

Publication number: KR20210075007A
Application number: KR1020200162944A
Authority: KR
Inventors: 한진호; 최민석; 권영수
Original assignee: 한국전자통신연구원
Priority date: 2019-12-12
Filing date: 2020-11-27
Publication date: 2021-06-22
Also published as: KR102518057B1

Abstract

An embodiment of the present invention relates to a cache which comprises: a dataflow controller for transmitting first data to a first processor and receiving second data from the first processor; an external direct memory access (DMA) controller for receiving the first data from an external memory to transmit the same to the dataflow controller and receiving the second data from the dataflow controller to transmit the same to the external memory; a scratchpad memory for storing the first data or the second data transmitted or received between the dataflow controller and the external DMA controller; and a compression/decompression device for compressing the data to be transmitted by the scratchpad memory to the external memory and decompressing the data transmitted by the external memory to the scratchpad memory; and a transmission state buffer for storing transmission state information related to transmission of the data between the dataflow controller and the external DMA controller. In accordance with the present invention, provided is a cache for an artificial intelligence (AI) processor, where feature data and kernel data subject to the operation of an AI algorithm are compressed and stored in an external memory and frequency of reading and writing the external memory can be reduced.

Description

CACHE FOR ARTIFICIAL INTELLIGENCE PROCESSOR

본 개시는 캐시에 관한 것으로, 좀 더 상세하게는 인공지능 프로세서를 위한 캐시에 관한 것이다.The present disclosure relates to a cache, and more particularly, to a cache for an artificial intelligence processor.

인공지능 프로세서는 인공지능 알고리즘을 처리하기 위한 프로세서이다. 인공지능 알고리즘은 특정한 네트워크 구조에 기반하여 특징(feature) 데이터 및 커널(kernel) 데이터에 대한 연산을 수행하는 알고리즘으로, 데이터의 양이 많기 때문에 연산을 가속하기 위한 하드웨어로서 인공지능 프로세서가 개발된다. 특징 데이터 및 커널 데이터의 양뿐만 아니라, 연산 중간 결과인 출력 특징(output feature) 데이터의 양 또한 많기 때문에 인공지능 프로세서 내부의 저장장치에만 데이터를 저장하기는 어렵고, 데이터를 외부 메모리에 저장을 하는 경우가 빈번하다. 따라서 큰 용량의 외부 메모리가 요구되며, 외부 메모리를 읽고 쓰는 속도도 빨라야 한다. 그러나 외부 메모리를 읽고 쓰는 속도는 한정되어 있기 때문에 외부 메모리를 읽고 쓰는 횟수를 줄이는 기능이 요구되고 있다.An artificial intelligence processor is a processor for processing artificial intelligence algorithms. An artificial intelligence algorithm is an algorithm that performs operations on feature data and kernel data based on a specific network structure. Since the amount of data is large, an artificial intelligence processor is developed as hardware for accelerating the operation. In addition to the amount of feature data and kernel data, the amount of output feature data that is the intermediate result of the operation is also large, so it is difficult to store data only in the storage device inside the AI processor, is frequent Therefore, a large-capacity external memory is required, and the speed of reading and writing the external memory must be fast. However, since the speed of reading and writing the external memory is limited, a function to reduce the number of times to read and write the external memory is required.

본 개시는 인공지능 알고리즘의 연산 대상이 되는 특징 데이터 및 커널 데이터를 압축하여 외부 메모리에 저장하고, 외부 메모리를 읽고 쓰는 횟수를 줄일 수 있는 인공지능 프로세서를 위한 캐시를 제공한다.The present disclosure provides a cache for an artificial intelligence processor capable of compressing and storing feature data and kernel data, which are computation targets of an artificial intelligence algorithm, in an external memory, and reducing the number of reading and writing of the external memory.

본 개시의 실시 예에 따른 캐시는 제 1 데이터를 제 1 프로세서로 전송하고, 상기 제 1 프로세서로부터 제 2 데이터를 제공받는 데이터플로우 컨트롤러, 외부 메모리로부터 상기 제 1 데이터를 제공받아 상기 데이터플로우 컨트롤러로 전송하고, 상기 데이터플로우 컨트롤러로부터 상기 제 2 데이터를 제공받아 상기 외부 메모리로 전송하는 외부 DMA(direct memory access) 컨트롤러, 상기 데이터플로우 컨트롤러 및 상기 외부 DMA 컨트롤러 사이에서 전송되는 상기 제 1 데이터 또는 상기 제 2 데이터를 저장하는 스크래치패드 메모리, 상기 스크래치패드 메모리로부터 상기 외부 메모리로 전송될 데이터를 압축하고, 상기 외부 메모리로부터 상기 스크래치패드 메모리로 전송된 데이터를 압축 해제하는 압축/압축 해제 장치, 및 상기 데이터플로우 컨트롤러 및 상기 외부 DMA 컨트롤러 사이의 데이터 전송과 관련된 전송 상태 정보를 저장하는 전송 상태 버퍼를 포함한다.A cache according to an embodiment of the present disclosure transmits first data to a first processor, a dataflow controller that receives second data from the first processor, and receives the first data from an external memory to the dataflow controller and an external direct memory access (DMA) controller that receives the second data from the dataflow controller and transmits it to the external memory, the first data or the first data transferred between the dataflow controller and the external DMA controller 2 A scratchpad memory for storing data, a compression/decompression device for compressing data to be transferred from the scratchpad memory to the external memory, and decompressing data transferred from the external memory to the scratchpad memory, and the data and a transfer state buffer for storing transfer state information related to data transfer between the flow controller and the external DMA controller.

본 개시의 실시 예에 따르면 인공지능 프로세서에서 특징 데이터 및 커널 데이터의 재배치를 위해 외부 메모리에 접근하는 횟수를 줄일 수 있다.According to an embodiment of the present disclosure, the number of times the artificial intelligence processor accesses the external memory for relocation of feature data and kernel data may be reduced.

도 1은 본 개시의 실시 예에 따른 인공지능 프로세서를 위한 캐시를 포함하는 시스템의 예시적인 구성을 나타낸다.
도 2는 본 개시의 실시 예에 따른 인공지능 프로세서를 위한 캐시를 포함하는 시스템의 예시적인 구성을 나타낸다.
도 3은 본 개시의 실시 예에 따른 인공지능 프로세서를 위한 캐시의 예시적인 동작을 나타낸다.
도 4는 스크래치패드 메모리의 동작 중 일반 모드를 나타낸다.
도 5는 스크래치패드 메모리의 동작 중 전치 모드를 나타낸다.1 shows an exemplary configuration of a system including a cache for an artificial intelligence processor according to an embodiment of the present disclosure.
2 shows an exemplary configuration of a system including a cache for an artificial intelligence processor according to an embodiment of the present disclosure.
3 illustrates an exemplary operation of a cache for an artificial intelligence processor according to an embodiment of the present disclosure.
4 shows a normal mode of operation of the scratchpad memory.
5 shows the transpose mode during operation of the scratchpad memory.

아래에서는, 본 개시의 기술 분야에서 통상의 지식을 가진 자가 본 개시를 쉽게 실시할 수 있을 정도로, 본 개시의 실시 예들이 명확하고 상세하게 기재될 것이다.Below, embodiments of the present disclosure will be described clearly and in detail to the extent that those skilled in the art can easily practice the present disclosure.

상세한 설명에서 사용되는 부 또는 유닛(unit), 모듈(module), 블록(block), ~기(~or, ~er) 등의 용어들을 참조하여 설명되는 구성 요소들 및 도면에 도시된 기능 블록들은 소프트웨어, 또는 하드웨어, 또는 그것들의 조합의 형태로 구현될 수 있다. 예시적으로, 소프트웨어는 기계 코드, 펌웨어, 임베디드 코드, 및 애플리케이션 소프트웨어일 수 있다. 예를 들어, 하드웨어는 전기 회로, 전자 회로, 프로세서, 컴퓨터, 집적 회로, 집적 회로 코어들, 압력 센서, 관성 센서, 멤즈 (microelectromechanical system; MEMS), 수동 소자, 또는 그것들의 조합을 포함할 수 있다.Components described with reference to terms such as units or units, modules, blocks, and groups (~or, ~er) used in the detailed description and functional blocks shown in the drawings are It may be implemented in the form of software, hardware, or a combination thereof. Illustratively, the software may be machine code, firmware, embedded code, and application software. For example, hardware may include an electrical circuit, an electronic circuit, a processor, a computer, an integrated circuit, integrated circuit cores, a pressure sensor, an inertial sensor, a microelectromechanical system (MEMS), a passive element, or a combination thereof. .

도 1은 본 개시의 실시 예에 따른 인공지능 프로세서를 위한 캐시를 포함하는 시스템(100)의 예시적인 구성을 나타낸다. 시스템(100)은 특징(feature) 데이터 및 커널(kernel) 데이터에 대해 인공지능(artificial intelligence; AI) 알고리즘에 기반한 연산을 수행하고, AI 알고리즘의 연산 결과를 출력할 수 있다. 또한, 시스템(100)은 AI 알고리즘에 기반한 연산을 가속하고, 연산의 효율을 높이기 위해 특징 데이터 및 커널 데이터를 압축 또는 재배치할 수 있다.1 shows an exemplary configuration of a system 100 including a cache for an artificial intelligence processor according to an embodiment of the present disclosure. The system 100 may perform an operation based on an artificial intelligence (AI) algorithm on feature data and kernel data, and may output an operation result of the AI algorithm. In addition, the system 100 may compress or rearrange the feature data and the kernel data in order to accelerate the computation based on the AI algorithm and increase the efficiency of the computation.

시스템(100)은 AI 프로세서(110), 범용(General purpose) 프로세서(120), 캐시(130), 및 메모리 컨트롤러(140)를 포함할 수 있다. 또한, 시스템(100)은 AI 알고리즘과 관련된 데이터를 저장하기 위해 하나 이상의 외부 메모리를 더 포함할 수 있다. 단, 도시의 간략화를 위해 외부 메모리의 도시는 생략하였다.The system 100 may include an AI processor 110 , a general purpose processor 120 , a cache 130 , and a memory controller 140 . In addition, the system 100 may further include one or more external memories to store data related to the AI algorithm. However, the illustration of the external memory is omitted for simplicity of illustration.

AI 프로세서(110)는 특정한 네트워크(즉, 신경망) 구조에 기반하여 특징 데이터 및 커널 데이터에 대한 연산을 수행하는 AI 알고리즘을 처리할 수 있다. 예를 들어, AI 알고리즘은 기계 학습을 수행하는 합성곱 신경망(convolution neural network; CNN), 순환 신경망(recurrent neural network; RNN), 또는 생성적 적대 신경망(generative adversarial network; GAN) 중 적어도 하나일 수 있으나, 본 개시는 이에 한정되지 않는다.The AI processor 110 may process an AI algorithm that performs operations on feature data and kernel data based on a specific network (ie, neural network) structure. For example, the AI algorithm may be at least one of a convolutional neural network (CNN), a recurrent neural network (RNN), or a generative adversarial network (GAN) that performs machine learning. However, the present disclosure is not limited thereto.

AI 프로세서(110)는 AI 알고리즘에 따라 처리해야 할 데이터 또는 처리된 데이터를 캐시(130)에 저장할 수 있다. 또한, AI 프로세서(110)는 캐시(130)를 통해 메모리 컨트롤러(140)로 데이터를 전송할 수 있고, 메모리 컨트롤러(140)로부터 캐시(130)를 통해 데이터를 수신할 수 있다. 예를 들어, AI 프로세서(110)는 중앙 처리 장치(CPU), 그래픽 처리 장치(GPU), 신경망 처리 장치(NPU), 가속 처리 장치(APU), 또는 텐서 처리 장치(TPU) 등과 같은 프로세싱 유닛 중 적어도 하나를 포함할 수 있으나, 본 개시는 이에 한정되지 않는다.The AI processor 110 may store data to be processed or processed data in the cache 130 according to the AI algorithm. Also, the AI processor 110 may transmit data to the memory controller 140 through the cache 130 , and may receive data from the memory controller 140 through the cache 130 . For example, the AI processor 110 is one of processing units such as a central processing unit (CPU), a graphics processing unit (GPU), a neural network processing unit (NPU), an accelerated processing unit (APU), or a tensor processing unit (TPU). It may include at least one, but the present disclosure is not limited thereto.

범용 프로세서(120)는 AI 프로세서(110)에 의한 처리 대상이 되는 데이터에 대한 전처리(pre-processing) 및 AI 프로세서(110)에 의한 연산 결과에 대한 후처리(post-processing)를 수행할 수 있다. 구체적으로, 범용 프로세서(120)는 특징 데이터 및 커널 데이터에 대한 전처리를 수행함으로써 외부로부터 수신된 데이터를 AI 프로세서(110)가 처리하기에 적합한 형식을 갖는 데이터로 변환할 수 있다. 예를 들어, 데이터에 대한 전처리는 누락된 및/또는 불필요한 특징 데이터를 제거하는 것, 문자로 된 특징 데이터를 숫자 형식으로 변환하는 것, 숫자 형식으로 변환된 특징 데이터의 범위를 조절하는 것, 초기 가중치(weight) 값을 설정하는 것 등을 포함할 수 있다.The general-purpose processor 120 may perform pre-processing on data to be processed by the AI processor 110 and post-processing on the operation result by the AI processor 110 . . Specifically, the general-purpose processor 120 may convert data received from the outside into data having a format suitable for processing by the AI processor 110 by performing pre-processing on the feature data and the kernel data. For example, preprocessing of the data includes removing missing and/or unnecessary feature data, converting character feature data into a numeric format, adjusting the range of feature data converted into a numeric format, and initial It may include setting a weight value, and the like.

또한, 범용 프로세서(120)는 AI 프로세서(110)에 의한 연산 결과에 대한 후처리를 수행함으로써 연산 결과에 포함된 오류를 정정할 수 있거나, 연산 결과의 최종 품질을 향상시킬 수 있다. 범용 프로세서(120)는 캐시(130)로부터 AI 프로세서(110)에 의해 처리된 데이터를 제공받을 수 있고, 외부로부터 수신된 데이터에 대한 전처리 결과 또는 AI 프로세서(110)에 의해 처리된 데이터에 대한 후처리 결과를 캐시(130)에 저장할 수 있다. 예를 들어, 범용 프로세서(120)는 중앙 처리 장치(CPU), 그래픽 처리 장치(GPU), 또는 데이터 처리 장치(DPU) 등과 같은 범용 프로세싱 유닛 중 적어도 하나를 포함할 수 있으나, 본 개시는 이에 한정되지 않는다.In addition, the general-purpose processor 120 may correct an error included in the operation result or improve the final quality of the operation result by performing post-processing on the operation result by the AI processor 110 . The general-purpose processor 120 may receive the data processed by the AI processor 110 from the cache 130 , and after processing the data processed by the AI processor 110 or the pre-processing result for the data received from the outside. The processing result may be stored in the cache 130 . For example, the general-purpose processor 120 may include at least one of a general-purpose processing unit such as a central processing unit (CPU), a graphics processing unit (GPU), or a data processing unit (DPU), but the present disclosure is limited thereto. doesn't happen

캐시(130)는 AI 프로세서(110) 또는 범용 프로세서(120)에서 처리하는 데이터 및 AI 프로세서(110) 또는 범용 프로세서(120)의 연산 수행 결과를 저장할 수 있다. 또한 캐시(130)는 메모리 컨트롤러(140)를 통해 외부 메모리로부터 데이터를 읽어 들일 수 있고, 읽어 들인 데이터를 AI 프로세서(110) 또는 범용 프로세서(120)로 전송할 수 있다. 캐시(130)는 데이터플로우 컨트롤러(131), 스크래치패드 메모리(shared scratchpad memory)(132), AI 외부 DMA(direct memory access) 컨트롤러(133), 압축/압축 해제 장치(134), 및 전송 상태 버퍼(135)를 포함할 수 있다.The cache 130 may store data processed by the AI processor 110 or the general-purpose processor 120 and an operation result of the AI processor 110 or the general-purpose processor 120 . Also, the cache 130 may read data from an external memory through the memory controller 140 , and transmit the read data to the AI processor 110 or the general-purpose processor 120 . The cache 130 includes a dataflow controller 131 , a shared scratchpad memory 132 , an AI external direct memory access (DMA) controller 133 , a compression/decompression unit 134 , and a transfer status buffer. (135).

데이터플로우 컨트롤러(131)는 AI 프로세서(110)가 필요로 하는 데이터(예를 들어, 특징 데이터 및 커널 데이터)를 AI 프로세서(110)로 전송할 수 있고, AI 프로세서(110)의 연산 수행 결과(예를 들어, 출력 특징 데이터)를 스크래치패드 메모리(132)를 통해 외부 메모리로 전송할 수 있다.The data flow controller 131 may transmit data (eg, feature data and kernel data) required by the AI processor 110 to the AI processor 110 , and the result of the operation of the AI processor 110 (eg, For example, output characteristic data) may be transmitted to an external memory through the scratchpad memory 132 .

구체적으로, 데이터플로우 컨트롤러(131)는 AI 프로세서(110) 및 스크래치패드 메모리(132)와 연결될 수 있고, AI 프로세서(110)의 연산 수행 결과를 스크래치패드 메모리(132)로 전송할 수 있다. 또한, 데이터플로우 컨트롤러(131)는 AI 프로세서(110)에서 처리해야 할 데이터를 스크래치패드 메모리(132)로부터 제공받을 수 있다. 데이터플로우 컨트롤러(131) 및 스크래치패드 메모리(132) 사이의 데이터 교환은 전송 상태 버퍼(135)가 저장하고 있는 전송 상태(transfer state) 정보에 기반하여 수행될 수 있다.Specifically, the dataflow controller 131 may be connected to the AI processor 110 and the scratchpad memory 132 , and may transmit a result of the operation of the AI processor 110 to the scratchpad memory 132 . In addition, the data flow controller 131 may receive data to be processed by the AI processor 110 from the scratchpad memory 132 . Data exchange between the dataflow controller 131 and the scratchpad memory 132 may be performed based on transfer state information stored in the transfer state buffer 135 .

스크래치패드 메모리(132)는 데이터플로우 컨트롤러(131) 및 외부 DMA 컨트롤러(133) 사이에서 교환될 수 있는 데이터를 저장할 수 있다. 구체적으로, 스크래치패드 메모리(132)는 AI 프로세서(110)에 의해 처리될 특징 데이터 및 커널 데이터를 저장할 수 있다. 또한, 스크래치패드 메모리(132)는 AI 프로세서(110)의 연산 결과를 저장할 수 있다. 스크래치패드 메모리(132)가 저장하고 있는 데이터는 외부 DMA 컨트롤러(133)를 통해 메모리 컨트롤러(140)로 제공될 수 있거나, 또는 데이터플로우 컨트롤러(131)를 통해 AI 프로세서(110)로 제공될 수 있다.The scratchpad memory 132 may store data that can be exchanged between the dataflow controller 131 and the external DMA controller 133 . Specifically, the scratchpad memory 132 may store feature data and kernel data to be processed by the AI processor 110 . Also, the scratchpad memory 132 may store an operation result of the AI processor 110 . Data stored in the scratchpad memory 132 may be provided to the memory controller 140 through the external DMA controller 133 , or may be provided to the AI processor 110 through the data flow controller 131 . .

나아가, 스크래치패드 메모리(132)는 범용 프로세서(120)와 데이터를 교환할 수도 있다. 범용 프로세서(120)는 전처리된 데이터 또는 후처리된 데이터를 스크래치패드 메모리(132)에 저장할 수 있고, 스크래치패드 메모리(132)는 전처리 또는 후처리가 요구되는 데이터를 범용 프로세서(120)로 전송할 수 있다. 예를 들어, 스크래치패드 메모리(132)는 SRAM들이 집적된 온칩 메모리로서 구현될 수 있다.Furthermore, the scratchpad memory 132 may exchange data with the general-purpose processor 120 . The general-purpose processor 120 may store pre-processed data or post-processed data in the scratchpad memory 132 , and the scratchpad memory 132 may transmit data requiring pre- or post-processing to the general-purpose processor 120 . have. For example, the scratchpad memory 132 may be implemented as an on-chip memory with integrated SRAMs.

외부 DMA 컨트롤러(133)는 DMA 방식을 이용하여 메모리 컨트롤러(140)를 통해 외부 메모리에 접근할 수 있다. 구체적으로, 외부 DMA 컨트롤러(133)는 AI 프로세서(110)의 동작과 독립적으로 AI 프로세서(110)가 필요로 하는 데이터를 외부 메모리로부터 제공받을 수 있고, 제공받은 데이터를 스크래치패드 메모리(132)로 전송할 수 있다.The external DMA controller 133 may access the external memory through the memory controller 140 using the DMA method. Specifically, the external DMA controller 133 may receive data required by the AI processor 110 independently of the operation of the AI processor 110 from an external memory, and transfer the provided data to the scratchpad memory 132 . can be transmitted

이로써, AI 프로세서(110)는 AI 알고리즘을 처리하는 동시에 필요로 하는 데이터를 제공받을 수 있으므로 AI 알고리즘 처리의 효율성이 높아질 수 있다. 스크래치패드 메모리(132) 및 외부 DMA 컨트롤러(133) 사이의 데이터 교환은 전송 상태 버퍼(135)가 저장하고 있는 전송 상태 정보에 기반하여 수행될 수 있다.Accordingly, the AI processor 110 may be provided with necessary data while processing the AI algorithm, thereby increasing the efficiency of AI algorithm processing. Data exchange between the scratchpad memory 132 and the external DMA controller 133 may be performed based on the transfer state information stored in the transfer state buffer 135 .

압축/압축 해제 장치(134)는 메모리 컨트롤러(140)로부터 제공되는 데이터의 압축을 해제하거나, 또는 메모리 컨트롤러(140)로 제공되는 데이터를 압축할 수 있다. 구체적으로, 압축/압축 해제 장치(134)는 AI 프로세서(110)가 처리해야 할 특징 데이터 및 커널 데이터를 압축할 수 있고, 압축된 특징 데이터 및 커널 데이터를 메모리 컨트롤러(140)로 전송할 수 있다. 또한, 압축/압축 해제 장치는 압축된 특징 데이터 및 커널 데이터 중 AI 프로세서(110)가 필요로 하는 데이터를 메모리 컨트롤러(140)로부터 제공받아 압축을 해제할 수 있다.The compression/decompression device 134 may decompress data provided from the memory controller 140 or compress data provided to the memory controller 140 . Specifically, the compression/decompression device 134 may compress feature data and kernel data to be processed by the AI processor 110 , and may transmit the compressed feature data and kernel data to the memory controller 140 . In addition, the compression/decompression apparatus may receive data required by the AI processor 110 from among the compressed feature data and kernel data from the memory controller 140 and decompress it.

전송 상태 버퍼(135)는 데이터플로우 컨트롤러(131) 및 외부 DMA 컨트롤러(133) 사이의 데이터 교환과 관련된 전송 상태 정보를 저장할 수 있다. 구체적으로, 전송 상태 버퍼(135)는 데이터플로우 컨트롤러(131)와 스크래치패드 메모리(132) 사이의 데이터 전송 및 외부 DMA 컨트롤러(133)와 스크래치패드 메모리(132) 사이의 데이터 전송과 관련된 전송 상태 정보를 저장할 수 있다. 전송 상태 정보에 대해서는 도 3을 참조하여 더 상세히 설명된다.The transmission status buffer 135 may store transmission status information related to data exchange between the dataflow controller 131 and the external DMA controller 133 . Specifically, the transfer status buffer 135 includes transfer state information related to data transfer between the dataflow controller 131 and the scratchpad memory 132 and data transfer between the external DMA controller 133 and the scratchpad memory 132 . can be saved. The transmission status information will be described in more detail with reference to FIG. 3 .

메모리 컨트롤러(140)는 외부 메모리를 제어할 수 있다. 예를 들어, 메모리 컨트롤러(140)는 외부 메모리를 제어하여 외부 메모리로부터 AI 프로세서(110)가 필요로 하는 특징 데이터 및 커널 데이터를 제공받을 수 있다. 또한, 메모리 컨트롤러(140)는 외부 메모리를 제어하여 특징 데이터 및 커널 데이터, 그리고 AI 프로세서(110)의 연산 결과를 외부 메모리에 저장할 수 있다.The memory controller 140 may control an external memory. For example, the memory controller 140 may control the external memory to receive feature data and kernel data required by the AI processor 110 from the external memory. Also, the memory controller 140 may control the external memory to store feature data, kernel data, and an operation result of the AI processor 110 in the external memory.

도 2는 본 개시의 실시 예에 따른 인공지능 프로세서를 위한 캐시를 포함하는 시스템(100)의 예시적인 구성을 나타낸다. 특히, 도 2는 도 1에 나타난 시스템(100)이 복수의 외부 메모리들을 포함하는 경우를 도시한다. 복수의 외부 메모리들을 제어하기 위해, 도 2의 시스템(100)은 복수의 메모리 컨트롤러들(140_1 내지 140_4)을 포함할 수 있다.2 shows an exemplary configuration of a system 100 including a cache for an artificial intelligence processor according to an embodiment of the present disclosure. In particular, FIG. 2 illustrates a case in which the system 100 shown in FIG. 1 includes a plurality of external memories. To control a plurality of external memories, the system 100 of FIG. 2 may include a plurality of memory controllers 140_1 to 140_4 .

또한, 도 2의 캐시(130)는 복수의 메모리 컨트롤러들(140_1 내지 140_4)과 데이터를 교환하기 위해 복수의 스크래치패드 메모리들(132_1 내지 132_4), 복수의 외부 DMA 컨트롤러들(133_1 내지 133_4), 및 복수의 압축/압축 해제 장치들(134_1 내지 134_4)를 포함할 수 있다.In addition, the cache 130 of FIG. 2 includes a plurality of scratchpad memories 132_1 to 132_4, a plurality of external DMA controllers 133_1 to 133_4 to exchange data with the plurality of memory controllers 140_1 to 140_4, and a plurality of compression/decompression devices 134_1 to 134_4.

예시적으로, 도 2의 시스템(100)은 각각 4개의 스크래치패드 메모리들(132_1 내지 132_4), 4개의 외부 DMA 컨트롤러들(133_1 내지 133_4), 4개의 압축/압축 해제 장치들(134_1 내지 134_4), 및 4개의 메모리 컨트롤러들(140_1 내지 140_4)을 포함하는 것으로 도시되어 있다. 그러나 본 개시는 이에 한정되지 않으며, 시스템(100)은 각각 도 2에 나타난 것과 다른 개수의 스크래치패드 메모리들, 외부 DMA 컨트롤러들, 압축/압축 해제 장치들, 및 메모리 컨트롤러들을 포함할 수 있다.Illustratively, the system 100 of FIG. 2 has four scratchpad memories 132_1 to 132_4, four external DMA controllers 133_1 to 133_4, and four compression/decompression devices 134_1 to 134_4, respectively. , and four memory controllers 140_1 to 140_4 are illustrated. However, the present disclosure is not limited thereto, and the system 100 may include a different number of scratchpad memories, external DMA controllers, compression/decompression devices, and memory controllers, respectively, from those shown in FIG. 2 .

나아가, 도 2의 스크래치패드 메모리(134_4)는 AI 프로세서(110)뿐만 아니라 범용 프로세서(120)와도 데이터를 교환하는 것으로 도시된다. 이 경우, 범용 프로세서(120)에 의해 전처리된 데이터 또는 후처리된 데이터는 스크래치패드 메모리(134_4)에만 저장될 수 있고, 외부 메모리로는 전송되지 않을 수 있으나, 본 개시는 이에 한정되지 않는다.Furthermore, the scratchpad memory 134_4 of FIG. 2 is illustrated as exchanging data with the general-purpose processor 120 as well as the AI processor 110 . In this case, data pre-processed or post-processed by the general-purpose processor 120 may be stored only in the scratchpad memory 134_4 and may not be transmitted to an external memory, but the present disclosure is not limited thereto.

도 3은 본 개시의 실시 예에 따른 인공지능 프로세서를 위한 캐시(200)의 예시적인 동작을 나타낸다. 예를 들어, 캐시(200)는 도 1의 캐시(130)일 수 있다. 도 1을 참조하여 설명한 바와 마찬가지로, 캐시(200)는 데이터플로우 컨트롤러(210), 스크래치패드 메모리(220), 외부 DMA 컨트롤러(230), 압축/압축 해제 장치(240), 및 전송 상태 버퍼(250)를 포함할 수 있다.3 illustrates an exemplary operation of a cache 200 for an artificial intelligence processor according to an embodiment of the present disclosure. For example, the cache 200 may be the cache 130 of FIG. 1 . As described with reference to FIG. 1 , the cache 200 includes a dataflow controller 210 , a scratchpad memory 220 , an external DMA controller 230 , a compression/decompression device 240 , and a transfer status buffer 250 . ) may be included.

상술한 바와 같이, 스크래치패드 메모리(220)는 SRAM들이 집적된 온칩 메모리로 구현될 수 있다. 도 3에 나타난 바와 같이, 스크래치패드 메모리(220)는 예시적으로 32개의 메모리들(220_1 내지 220_32)을 포함할 수 있다. 예를 들어, 32개의 메모리들(220_1 내지 220_32) 각각은 1024비트의 데이터를 저장할 수 있다.As described above, the scratchpad memory 220 may be implemented as an on-chip memory in which SRAMs are integrated. As shown in FIG. 3 , the scratchpad memory 220 may include 32 memories 220_1 to 220_32 by way of example. For example, each of the 32 memories 220_1 to 220_32 may store 1024 bits of data.

데이터플로우 컨트롤러(210) 및 외부 DMA 컨트롤러(230)는 전송 상태 버퍼(250)가 저장하고 있는 전송 상태 정보에 기반하여, 스크래치패드 메모리(220)를 통해 데이터를 교환할 수 있다. 예를 들어, 전송 상태 버퍼(250)는 데이터플로우 컨트롤러 전송 상태 정보(이하 제 1 정보라고 지칭함) 및 외부 DMA 컨트롤러 전송 상태 정보(이하 제 2 정보라고 지칭함)를 포함할 수 있다.The dataflow controller 210 and the external DMA controller 230 may exchange data through the scratchpad memory 220 based on the transfer state information stored in the transfer state buffer 250 . For example, the transmission status buffer 250 may include dataflow controller transmission status information (hereinafter referred to as first information) and external DMA controller transmission status information (hereinafter referred to as second information).

예시적으로, 제 1 정보는 데이터플로우 컨트롤러(210)가 스크래치패드 메모리(220)에 데이터를 얼마나 쓸 것인지, 및 외부 DMA 컨트롤러(230)가 데이터플로우 컨트롤러(210)에 의해 쓰인 데이터를 얼마나 읽을 것인지를 나타낼 수 있다. 제 2 정보는 외부 DMA 컨트롤러(230)가 스크래치패드 메모리(220)에 데이터를 얼마나 쓸 것인지, 및 데이터플로우 컨트롤러(210)가 외부 DMA 컨트롤러(230)에 의해 쓰인 데이터를 얼마나 읽을 것인지를 나타낼 수 있다.Illustratively, the first information includes how much data the dataflow controller 210 will write to the scratchpad memory 220 and how much the external DMA controller 230 will read the data written by the dataflow controller 210 . can represent The second information may indicate how much data the external DMA controller 230 will write to the scratchpad memory 220 and how much the dataflow controller 210 will read data written by the external DMA controller 230 . .

또한, 제 1 정보 및 제 2 정보는 데이터가 마지막으로 쓰인 지점의 주소 및 데이터가 마지막으로 읽힌 지점의 주소를 나타낼 수 있다. 즉, 제 1 정보는 데이터플로우 컨트롤러(210)가 스크래치패드 메모리(220)에 데이터를 쓰기 시작할 주소(이하, 제 1 쓰기 주소라고 지칭함) 및 외부 DMA 컨트롤러(230)가 스크래치패드 메모리(220)로부터 데이터를 읽기 시작할 주소(이하, 제 1 읽기 주소라고 지칭함)를 나타낼 수 있다. 그리고, 제 2 정보는 외부 DMA 컨트롤러(230)가 스크래치패드 메모리(220)에 데이터를 쓰기 시작할 주소(이하, 제 2 쓰기 주소라고 지칭함) 및 데이터플로우 컨트롤러(210)가 스크래치패드 메모리(220)로부터 데이터를 읽기 시작할 주소(이하, 제 2 읽기 주소라고 지칭함)를 나타낼 수 있다.Also, the first information and the second information may indicate an address of a point where data was last written and an address of a point where data was last read. That is, the first information includes an address at which the dataflow controller 210 starts writing data to the scratchpad memory 220 (hereinafter referred to as a first write address) and the external DMA controller 230 from the scratchpad memory 220 . It may indicate an address from which data is to be read (hereinafter referred to as a first read address). In addition, the second information includes an address at which the external DMA controller 230 starts writing data to the scratchpad memory 220 (hereinafter referred to as a second write address) and the dataflow controller 210 from the scratchpad memory 220 . It may indicate an address from which data is to be read (hereinafter referred to as a second read address).

따라서, 데이터플로우 컨트롤러(210)는 제 1 정보에 기반하여 외부 DMA 컨트롤러(230)로 전송할 데이터를 스크래치패드 메모리(220)의 제 1 쓰기 주소에 쓸 수 있다. 그리고, 외부 DMA 컨트롤러(230)는 제 1 정보에 기반하여 데이터플로우 컨트롤러(210)에 의해 스크래치패드 메모리(220)의 제 1 읽기 주소로부터 데이터를 읽을 수 있다.Accordingly, the dataflow controller 210 may write data to be transmitted to the external DMA controller 230 to the first write address of the scratchpad memory 220 based on the first information. In addition, the external DMA controller 230 may read data from the first read address of the scratchpad memory 220 by the dataflow controller 210 based on the first information.

또한, 외부 DMA 컨트롤러(230)는 제 2 정보에 기반하여 데이터플로우 컨트롤러(210)로 전송할 데이터를 스크래치패드 메모리(220)의 제 2 쓰기 주소에 쓸 수 있다. 그리고, 데이터플로우 컨트롤러(210)는 제 2 정보에 기반하여 스크래치패드 메모리(220)의 제 2 읽기 주소로부터 데이터를 읽을 수 있다. 이로써, 데이터의 읽기 및 쓰기 동작이 끊김 없이 수행될 수 있다.Also, the external DMA controller 230 may write data to be transmitted to the dataflow controller 210 to the second write address of the scratchpad memory 220 based on the second information. In addition, the dataflow controller 210 may read data from the second read address of the scratchpad memory 220 based on the second information. Accordingly, data read and write operations can be performed without interruption.

이하 도 4 내지 도 5를 참조하여 스크래치패드 메모리(220)의 데이터 읽기 및 쓰기 동작을 설명한다. 도 4는 스크래치패드 메모리의 동작 중 일반 모드(normal mode)를 나타내며, 도 5는 스크래치패드 메모리의 동작 중 전치 모드(transpose mode)를 나타낸다.Hereinafter, data read and write operations of the scratchpad memory 220 will be described with reference to FIGS. 4 to 5 . 4 illustrates a normal mode during operation of the scratchpad memory, and FIG. 5 illustrates a transpose mode during operation of the scratchpad memory.

도 3에 나타난 바와 같이, 스크래치패드 메모리(220)는 복수의 메모리들(220_1 내지 220_32)을 포함할 수 있다. 명확한 설명을 위해, 복수의 메모리들(220_1 내지 220_32) 각각은 1024비트의 데이터를 저장할 수 있는 것으로 가정한다. 그리고, 데이터플로우 컨트롤러(210) 또는 외부 DMA 컨트롤러(230)는 한 번에 1024비트의 데이터를 스크래치패드 메모리(220)에 쓰거나, 스크래치패드 메모리(220)로부터 읽을 수 있는 것으로 가정한다.As shown in FIG. 3 , the scratchpad memory 220 may include a plurality of memories 220_1 to 220_32 . For clarity, it is assumed that each of the plurality of memories 220_1 to 220_32 can store data of 1024 bits. In addition, it is assumed that the dataflow controller 210 or the external DMA controller 230 can write data of 1024 bits at a time to or read from the scratchpad memory 220 .

스크래치패드 메모리(220)의 데이터 쓰기 동작은 일반 모드인 경우와 전치 모드인 경우 모두 동일할 수 있다. 데이터플로우 컨트롤러(210) 또는 외부 DMA 컨트롤러(230)는 제 1 메모리(220_1)부터 제 32 메모리(220_32)까지 순서대로 각각 1024비트의 데이터를 쓸 수 있다. 따라서 도 4 내지 도 5에 나타난 바와 같이, 메모리들(220_1 내지 220_32)은 각각 1024비트의 데이터를 저장할 수 있다.The data writing operation of the scratchpad memory 220 may be the same in both the normal mode and the transpose mode. The dataflow controller 210 or the external DMA controller 230 may sequentially write data of 1024 bits from the first memory 220_1 to the 32nd memory 220_32. Therefore, as shown in FIGS. 4 to 5 , each of the memories 220_1 to 220_32 may store data of 1024 bits.

한편, 스크래치패드 메모리(220)의 데이터 읽기 동작은 일반 모드인 경우와 전치 모드인 경우가 서로 다를 수 있다. 도 4에 나타난 일반 모드의 경우, 메모리들(220_1 내지 220_32)에 데이터가 저장된 순서대로 읽기 동작이 수행될 수 있다. 즉, 1024비트 단위로 데이터 읽기 동작이 수행될 때 도 4의 빗금으로 나타난 부분과 같이 제 1 메모리(220_1)에 저장된 1024비트의 데이터에 대해 먼저 읽기 동작이 수행될 수 있고, 나머지 각 메모리들의 데이터에 대해 순서대로 읽기 동작이 수행될 수 있다. 따라서, AI 알고리즘이 요구하는 임의의 순서에 따라 데이터 읽기 동작이 수행되는 것이 불가능할 수 있다.Meanwhile, the data read operation of the scratchpad memory 220 may be different from the normal mode and the transpose mode. In the case of the normal mode shown in FIG. 4 , a read operation may be performed in the order in which data is stored in the memories 220_1 to 220_32 . That is, when a data read operation is performed in units of 1024 bits, a read operation may first be performed on 1024-bit data stored in the first memory 220_1 as indicated by hatched lines in FIG. 4 , and the data of each of the remaining memories may be read first. A read operation may be sequentially performed with respect to the . Therefore, it may not be possible to perform data read operations in an arbitrary order required by the AI algorithm.

반면 도 5에 나타난 전치 모드의 경우, 데이터가 저장된 순서와 무관하게 읽기 동작이 수행될 수 있다. 즉, 1024비트 단위로 데이터 읽기 동작이 수행될 때 하나의 메모리에 저장된 1024비트의 데이터 전체에 대해 읽기 동작이 수행되는 것이 아니라, 도 5의 빗금으로 나타난 부분과 같이 각 메모리들(220_1 내지 220_32)에 대해 32비트 단위로 읽기 동작이 수행될 수 있다. 따라서, AI 알고리즘이 요구하는 임의의 순서에 따라 데이터 읽기 동작이 수행될 수 있다.On the other hand, in the case of the transpose mode shown in FIG. 5 , a read operation may be performed regardless of the order in which data is stored. That is, when the data read operation is performed in units of 1024 bits, the read operation is not performed on the entire 1024-bit data stored in one memory, but each of the memories 220_1 to 220_32 as indicated by hatched lines in FIG. 5 . A read operation may be performed in units of 32 bits. Accordingly, the data read operation may be performed according to an arbitrary order required by the AI algorithm.

다시 말해, 스크래치패드 메모리(220)가 전치 모드로 동작하는 경우, 데이터플로우 컨트롤러(210) 또는 외부 DMA 컨트롤러(230)는 데이터가 저장된 순서에 관계 없이 각 메모리들(220_1 내지 220_32)로부터 요구되는 데이터를 읽을 수 있다. 따라서, 본 개시의 스크래치패드 메모리(220)가 전치 모드로 동작하는 경우, 외부 메모리 및 내부 메모리에 대한 접근 횟수가 줄어들 수 있다.In other words, when the scratchpad memory 220 operates in the transpose mode, the data flow controller 210 or the external DMA controller 230 receives data required from each of the memories 220_1 to 220_32 irrespective of the order in which the data is stored. can read Accordingly, when the scratchpad memory 220 of the present disclosure operates in the transpose mode, the number of accesses to the external memory and the internal memory may be reduced.

상술된 내용은 본 개시를 실시하기 위한 구체적인 실시 예들이다. 본 개시는 상술된 실시 예들뿐만 아니라, 단순하게 설계 변경되거나 용이하게 변경할 수 있는 실시 예들 또한 포함할 것이다. 또한, 본 개시는 실시 예들을 이용하여 용이하게 변형하여 실시할 수 있는 기술들도 포함될 것이다. 따라서, 본 개시의 범위는 상술된 실시 예들에 국한되어 정해져서는 안되며 후술하는 특허청구범위뿐만 아니라 이 발명의 특허청구범위와 균등한 것들에 의해 정해져야 할 것이다.The above are specific embodiments for carrying out the present disclosure. The present disclosure will include not only the above-described embodiments, but also simple design changes or easily changeable embodiments. In addition, the present disclosure will include techniques that can be easily modified and implemented using the embodiments. Therefore, the scope of the present disclosure should not be limited to the above-described embodiments and should be defined by the claims and equivalents of the claims as well as the claims to be described later.

110: AI 프로세서 120: 범용 프로세서
130: 캐시 131: 데이터플로우 컨트롤러
132: 스크래치패드 메모리 133: 외부 DMA 컨트롤러
134: 압축/압축 해제 장치 135: 전송 상태 버퍼
140: 메모리 컨트롤러110: AI processor 120: general purpose processor
130: cache 131: dataflow controller
132: scratchpad memory 133: external DMA controller
134: compression/decompression unit 135: transmission status buffer
140: memory controller

Claims

a data flow controller that transmits first data to a first processor and receives second data from the first processor;
an external direct memory access (DMA) controller that receives the first data from an external memory and transmits it to the dataflow controller, and receives the second data from the dataflow controller and transmits the second data to the external memory;
a scratchpad memory configured to store the first data or the second data transferred between the dataflow controller and the external DMA controller;
a compression/decompression device for compressing data to be transmitted from the scratchpad memory to the external memory and decompressing data transmitted from the external memory to the scratchpad memory; and
and a transfer status buffer for storing transfer state information related to data transfer between the dataflow controller and the external DMA controller.

The method of claim 1,
The scratchpad memory is a cache that transmits data requiring pre-processing or post-processing from the external memory to a second processor, and receives pre-processed data or post-processed data from the second processor.

The method of claim 1,
The first data is feature data or kernel data, and the second data is a cache result of an operation performed by the first processor.

The method of claim 1,
The transmission status buffer stores the first information and the second information,
The first information indicates an amount of data to be written to the scratchpad memory by the dataflow controller, and a first write address at which the dataflow controller will write data to the scratchpad memory,
The second information is a cache indicating an amount of data to be written to the scratchpad memory by the external DMA controller and a second write address at which the external DMA controller will write data to the scratchpad memory.

5. The method of claim 4,
The first information indicates an amount of data to be read from the scratchpad memory by the external DMA controller, and a first read address from which the external DMA controller will read data from the scratchpad memory,
The second information is a cache indicating an amount of data to be read from the scratchpad memory by the dataflow controller and a second read address from which the dataflow controller is to read data from the scratchpad memory.

6. The method of claim 5,
the dataflow controller writes data to be transmitted to the external DMA controller to the first write address of the scratchpad memory, and reads data provided from the external DMA controller from the second read address of the scratchpad memory;
wherein the external DMA controller writes data to be transmitted to the dataflow controller to the second write address of the scratchpad memory, and reads data received from the dataflow controller from the first read address of the scratchpad memory.

The method of claim 1,
The scratchpad memory includes a plurality of memories,
A cache in which data is sequentially stored in each of the plurality of memories.

8. The method of claim 7,
The scratchpad memory is a cache that performs a data read operation according to an order in which the data is stored.

8. The method of claim 7,
The scratchpad memory is a cache that performs a data read operation regardless of the order in which the data is stored.

3. The method of claim 2,
The first processor comprises at least one of a central processing unit (CPU), a graphics processing unit (GPU), a neural network processing unit (NPU), an accelerated processing unit (APU), or a tensor processing unit (TPU), and
wherein the second processor is a general-purpose processor.