KR20220061828A

KR20220061828A - Computing system including accelerator and operation method thereof

Info

Publication number: KR20220061828A
Application number: KR1020210057850A
Authority: KR
Inventors: 김찬; 여준기; 권영수; 한진호
Original assignee: 한국전자통신연구원
Priority date: 2020-11-06
Filing date: 2021-05-04
Publication date: 2022-05-13

Abstract

According to an embodiment of the present disclosure, a method of operating an accelerator comprises the steps of: performing a cache flush operation in response to a first interrupt; performing an update section jump operation to jump the program counter to the update section; performing a program update operation to write a binary file compiled in a host to a memory module connected to the accelerator, after the update section jump operation; and performing a start section jump operation to jump a program counter to the start section, in response to the second interrupt, after the program update operation.

Description

COMPUTING SYSTEM INCLUDING ACCELERATOR AND OPERATION METHOD THEREOF

본 개시는 컴퓨팅 시스템에 관한 것으로서 좀 더 상세하게는 가속기를 포함하는 컴퓨팅 시스템 및 그 동작 방법에 관한 것이다. The present disclosure relates to a computing system, and more particularly, to a computing system including an accelerator and an operating method thereof.

컴퓨팅 시스템은 PCIe 카드 형태의 딥러닝 가속기를 포함할 수 있다. 컴퓨팅 시스템에 포함된 호스트는 운영 체제 프로그램을 사용하고, 가속기는 운영 체제 프로그램 없이 가속기 프로그램을 실행할 수 있다. 호스트는 가속기 프로그램을 컴파일 하여 가속기에게 전송할 수 있다. 가속기의 프로세서 동작 중에 가속기 프로그램 업데이트가 요구된다. The computing system may include a deep learning accelerator in the form of a PCIe card. A host included in the computing system may use an operating system program, and the accelerator may execute the accelerator program without the operating system program. The host can compile the accelerator program and send it to the accelerator. The accelerator program update is required during the processor operation of the accelerator.

본 개시의 목적은 가속기 동작 중 프로그램 업데이트 수행이 가능한 가속기를 포함하는 컴퓨팅 시스템 및 그 동작 방법을 제공하는데 있다. SUMMARY OF THE INVENTION An object of the present disclosure is to provide a computing system including an accelerator capable of performing program update while the accelerator is in operation, and a method of operating the same.

본 개시의 실시 예에 따른 가속기의 동작 방법은 제1 인터럽트에 응답하여, 캐시 플러시 동작을 수행하는 단계, 캐시 플러시 동작 수행 후에, 프로그램 카운터를 업데이트 섹션으로 점프하도록 업데이트 섹션 점프 동작을 수행하는 단계, 업데이트 섹션 점프 동작 이후에, 호스트에서 컴파일된 바이너리 파일을 가속기와 연결된 메모리 모듈에 기입하도록 프로그램 업데이트 동작을 수행하는 단계, 및 프로그램 업데이트 동작 후에, 제2 인터럽트에 응답하여, 프로그램 카운터를 시작 섹션으로 점프하도록 시작 섹션 점프 동작을 수행하는 단계를 포함한다.A method of operating an accelerator according to an embodiment of the present disclosure includes performing a cache flush operation in response to a first interrupt, performing an update section jump operation to jump a program counter to an update section after performing the cache flush operation; After the update section jump operation, the host performs a program update operation to write the compiled binary file to a memory module associated with the accelerator, and after the program update operation, in response to a second interrupt, jumps the program counter to the start section and performing the starting section jump action to do so.

일 실시 예에서, 가속기는 PCI-express(Peripheral Component Interconnect express) 인터페이스를 통해 호스트와 통신한다.In one embodiment, the accelerator communicates with the host via a Peripheral Component Interconnect express (PCI-express) interface.

일 실시 예에서, 가속기의 동작 방법은 캐시 플러시 동작 이전에, 호스트로부터 제1 인터럽트를 수신하는 단계, 및 프로그램 업데이트 동작 이후에, 호스트로부터 제2 인터럽트를 수신하는 단계를 더 포함한다.In an embodiment, the method of operating the accelerator further includes receiving a first interrupt from the host before the cache flush operation, and receiving a second interrupt from the host after the program update operation.

일 실시 예에서, 바이너리 파일은 링커 스크립트에 의해 시작 섹션을 포함하는 일반 프로그램용 섹션 및 업데이트 섹션을 포함한다.In one embodiment, the binary file includes an update section and a section for general programs including a start section by a linker script.

일 실시 예에서, 바이너리 파일은 호스트에 의해 가속기 프로그램을 컴파일하여 생성된다.In one embodiment, the binary file is generated by compiling the accelerator program by the host.

일 실시 예에서, 프로그램 업데이트 동작을 수행하는 단계는, 가속기에 포함된 DMA 엔진에 의해, 호스트로부터 바이너리 파일을 독출하는 단계, 및 DMA 엔진에 의해, 바이너리 파일을 메모리 모듈에 기입하는 단계를 포함한다.In an embodiment, performing the program update operation includes reading a binary file from a host by a DMA engine included in the accelerator, and writing the binary file into a memory module by the DMA engine do.

일 실시 예에서, 바이너리 파일을 메모리 모듈에 기입하는 단계는 커맨드/어드레스 라인을 통해 쓰기 커맨드 및 어드레스를 메모리 모듈로 전송하는 단계, 및 데이터 라인을 통해 바이너리 파일을 메모리 모듈로 전송하는 단계를 포함한다. In one embodiment, writing the binary file to the memory module includes sending a write command and an address to the memory module via a command/address line, and sending the binary file to the memory module via a data line. .

일 실시 예에서, 캐시 플러시 동작을 수행하는 단계는, 가속기에 포함된 캐시 메모리 장치에 저장된 더티 상태에 대응하는 데이터를 메모리 모듈에 기입하는 단계를 포함한다.In an embodiment, performing the cache flush operation includes writing data corresponding to a dirty state stored in a cache memory device included in the accelerator into the memory module.

일 실시 예에서, 제1 및 제2 인터럽트는 호스트에 의해 메모리 매핑된 입출력 레지스터(MMIO; memory mapped I/O register) 쓰기 동작을 가리킨다.In one embodiment, the first and second interrupts indicate memory mapped I/O register (MMIO) write operations by the host.

일 실시 예에서, 가속기는 FPGA(field-programmable gate array), MPPA(massively parallel processor array), GPU(graphics processing unit), ASIC(Application-Specific Integrated Circuit), NPU(neural processing unit), TPU(Tensor Processing Unit) 및 MPSoC(Multi-Processor System-on-Chip) 중에서 어느 하나를 포함한다.In one embodiment, the accelerator is a field-programmable gate array (FPGA), a massively parallel processor array (MPPA), a graphics processing unit (GPU), an application-specific integrated circuit (ASIC), a neural processing unit (NPU), a tensor (TPU) Processing Unit) and MPSoC (Multi-Processor System-on-Chip).

일 실시 예에서, 가속기의 동작 방법은 프로그램 업데이트 동작 수행하는 동안에, 프로그램 카운터는 무한 루프를 돌며 대기하는 단계를 더 포함한다.In an embodiment, the method of operating the accelerator further includes waiting while the program update operation is performed while the program counter runs in an infinite loop.

본 개시의 실시 예에 따른 컴퓨팅 장치는 가속기 프로그램에 대한 컴파일 동작을 수행하여 업데이트 섹션 및 시작 섹션을 포함하는 바이너리 파일을 생성하고, 제1 및 제2 인터럽트를 전송하도록 구성된 호스트, 및 제1 인터럽트에 응답하여, 프로그램 카운터를 업데이트 섹션으로 점프하고, 제2 인터럽트에 응답하여, 프로그램 카운터를 시작 섹션으로 점프하도록 구성된 가속기를 포함하고, 제1 인터럽트는 가속기 프로그램 업데이트 시작을 가리키고, 제2 인터럽트는 가속기 프로그램 업데이트 종료를 가리킨다. A computing device according to an embodiment of the present disclosure generates a binary file including an update section and a start section by performing a compile operation on an accelerator program, a host configured to transmit first and second interrupts, and a first interrupt in response to jump the program counter to the update section, and in response to a second interrupt, an accelerator configured to jump the program counter to the start section, wherein the first interrupt indicates an accelerator program update start, and the second interrupt is an accelerator program Indicates the end of the update.

일 실시 예에서, 컴퓨팅 시스템은 가속기와 연결되고, 가속기에서 실행되는 바이너리 파일을 저장하도록 구성된 메모리 모듈을 더 포함한다.In one embodiment, the computing system further includes a memory module coupled to the accelerator and configured to store a binary file executed in the accelerator.

일 실시 예에서, 가속기는 바이너리 파일을 호스트에서 독출하고, 바이너리 파일을 메모리 모듈에 기입하도록 구성된 DMA 엔진을 포함한다. In one embodiment, the accelerator includes a DMA engine configured to read a binary file from a host and write the binary file to a memory module.

일 실시 예에서, 가속기는 바이너리 파일을 실행하도록 구성된 프로세서, 및 프로세서에서 사용되는 데이터를 저장하고, 제1 인터럽트에 응답하여, 더티 상태에 대응하는 데이터를 메모리 모듈로 플러시 하도록 구성된 캐시 메모리 장치를 더 포함한다.In one embodiment, the accelerator further includes a processor configured to execute a binary file, and a cache memory device configured to store data used by the processor and, in response to the first interrupt, flush data corresponding to the dirty condition to the memory module. include

일 실시 예에서, 가속기는 CNN(Convolutional Neural Network), RNN(Recurrent Neural Network) 등과 같은 머신 러닝 알고리즘 등을 수행하도록 구성된 재구성 로직 회로를 더 포함한다.In one embodiment, the accelerator further comprises a reconstruction logic circuit configured to perform a machine learning algorithm, such as a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), or the like.

본 개시의 실시 예에 따른 컴퓨팅 시스템은 가속기 동작 중에 프로그램 업데이트 동작을 수행할 수 있다. 따라서, 향상된 가속기를 포함하는 컴퓨팅 시스템 및 그 동작 방법이 제공된다. The computing system according to an embodiment of the present disclosure may perform a program update operation during an accelerator operation. Accordingly, a computing system including an improved accelerator and a method of operating the same are provided.

도 1은 본 개시의 실시 예에 따른 컴퓨팅 시스템을 보여주는 블록도이다.
도 2는 도 1의 가속기를 보여주는 블록도이다.
도 3은 본 개시의 실시 예에 따른 가속기 프로그램의 바이너리 파일을 나타내는 도면이다.
도 4는 도 1의 가속기의 동작을 보여주는 순서도이다.
도 5는 도 1의 가속기의 프로그램 업데이트 동작을 설명하기 위한 도면이다.
도 6은 도 1의 컴퓨팅 시스템의 동작을 보여주는 순서도이다.
도 7은 본 개시의 실시 예에 따른 컴퓨팅 시스템을 보여주는 블록도이다.
도 8은 본 개시의 실시 예에 따른 컴퓨팅 시스템이 적용된 데이터 센터를 보여주는 블록도이다.1 is a block diagram illustrating a computing system according to an embodiment of the present disclosure.
FIG. 2 is a block diagram showing the accelerator of FIG. 1 .
3 is a diagram illustrating a binary file of an accelerator program according to an embodiment of the present disclosure.
4 is a flowchart illustrating an operation of the accelerator of FIG. 1 .
FIG. 5 is a diagram for explaining a program update operation of the accelerator of FIG. 1 .
6 is a flowchart illustrating an operation of the computing system of FIG. 1 .
7 is a block diagram illustrating a computing system according to an embodiment of the present disclosure.
8 is a block diagram illustrating a data center to which a computing system according to an embodiment of the present disclosure is applied.

이하에서, 본 개시의 기술 분야에서 통상의 지식을 가진 자가 본 개시를 용이하게 실시할 수 있을 정도로, 본 개시의 실시 예들이 명확하고 상세하게 기재될 것이다.Hereinafter, embodiments of the present disclosure will be described clearly and in detail to the extent that those of ordinary skill in the art can easily practice the present disclosure.

이하에서, 상세한 설명에서 사용되는 부(unit), 모듈(module), 레이어(layer), 또는 도면에 도시된 기능 블록들은 소프트웨어, 또는 하드웨어, 또는 그것들의 조합의 형태로 구현될 수 있다. 예를 들어, 소프트웨어는 기계 코드, 펌웨어, 임베디드 코드, 및 애플리케이션 소프트웨어일 수 있다. 예를 들어, 하드웨어는 전기 회로, 전자 회로, 프로세서, 컴퓨터, 집적 회로, 집적 회로 코어들, 압력 센서, 관성 센서, 멤즈(MEMS; microelectromechanical system), 수동 소자, 또는 그것들의 조합을 포함할 수 있다. Hereinafter, units, modules, layers, or functional blocks shown in the drawings used in the detailed description may be implemented in the form of software, hardware, or a combination thereof. For example, the software may be machine code, firmware, embedded code, and application software. For example, the hardware may include an electrical circuit, an electronic circuit, a processor, a computer, an integrated circuit, integrated circuit cores, a pressure sensor, an inertial sensor, a microelectromechanical system (MEMS), a passive element, or a combination thereof. .

또한, 다르게 정의되지 않는 한, 본문에서 사용되는 기술적 또는 과학적인 의미를 포함하는 모든 용어들은 본 개시가 속하는 기술 분야에서의 당업자에 의해 이해될 수 있는 의미를 갖는다. 일반적으로 사전에서 정의된 용어들은 관련된 기술 분야에서의 맥락적 의미와 동등한 의미를 갖도록 해석되며, 본문에서 명확하게 정의되지 않는 한, 이상적 또는 과도하게 형식적인 의미를 갖도록 해석되지 않는다.In addition, unless otherwise defined, all terms including technical or scientific meanings used herein have the meanings understood by those skilled in the art to which this disclosure belongs. In general, terms defined in the dictionary are interpreted to have the same meaning as the contextual meaning in the related technical field, and unless clearly defined in the text, they are not interpreted to have an ideal or excessively formal meaning.

도 1은 본 개시의 실시 예에 따른 컴퓨팅 시스템을 보여주는 블록도이다. 도 1을 참조하면, 컴퓨팅 시스템(100)은 호스트(110), 가속기(120) 및 메모리 모듈(130)을 포함할 수 있다. 컴퓨팅 시스템(100)은 데이터를 처리하여 메모리 모듈(130)에 기록하거나, 메모리 모듈(130)부터 독출된 데이터를 처리하는 다양한 종류의 시스템일 수 있다. 1 is a block diagram illustrating a computing system according to an embodiment of the present disclosure. Referring to FIG. 1 , a computing system 100 may include a host 110 , an accelerator 120 , and a memory module 130 . The computing system 100 may be various types of systems that process data and write data to the memory module 130 or process data read from the memory module 130 .

일 실시 예에서, 컴퓨팅 시스템(100)은 PC(personal computer), 데이터 서버, 클라우드 시스템, 인공 지능 서버, 네트워크-결합 스토리지(network-attached storage, NAS), IoT(Internet of Things) 장치, 또는 휴대용 전자 기기로 구현될 수 있다. 또한, 컴퓨팅 시스템(100)이 휴대용 전자 기기인 경우, 컴퓨팅 시스템(100)은 랩탑 컴퓨터, 이동 전화기, 스마트폰, 태블릿 PC, PDA(personal digital assistant), EDA(enterprise digital assistant), 디지털 스틸 카메라, 디지털 비디오 카메라, 오디오 장치, PMP(portable multimedia player), PND(personal navigation device), MP3 플레이어, 휴대용 게임 콘솔(handheld game console), e-북(e-book), 웨어러블 기기 등일 수 있다. In one embodiment, the computing system 100 is a personal computer (PC), data server, cloud system, artificial intelligence server, network-attached storage (NAS), Internet of Things (IoT) device, or portable It may be implemented as an electronic device. In addition, when the computing system 100 is a portable electronic device, the computing system 100 may include a laptop computer, a mobile phone, a smartphone, a tablet PC, a personal digital assistant (PDA), an enterprise digital assistant (EDA), a digital still camera, It may be a digital video camera, an audio device, a portable multimedia player (PMP), a personal navigation device (PND), an MP3 player, a handheld game console, an e-book, a wearable device, and the like.

호스트(110)는 컴퓨팅 시스템(100)의 전반적인 동작을 제어할 수 있다. 예를 들어, 호스트(110)는 커맨드 및 데이터를 제공하여 가속기(120)를 제어할 수 있다. 일 실시 예에서, 호스트(110)는 수행할 처리 동작을 가속기(120)에게 오프로드 할 수 있다. 호스트(110)는 특정 연산 동작을 직접 수행하지 않고, 가속기(120)가 특정 연산 동작을 수행하도록 커맨드 및 데이터를 가속기(120)에게 전송할 수 있다. The host 110 may control the overall operation of the computing system 100 . For example, the host 110 may control the accelerator 120 by providing commands and data. In an embodiment, the host 110 may offload the processing operation to be performed to the accelerator 120 . The host 110 may transmit a command and data to the accelerator 120 so that the accelerator 120 performs a specific computation operation without directly performing the specific arithmetic operation.

가속기(120)는 다양한 종류의 보조 프로세서에 해당할 수 있다. 일 실시 예에서 가속기(120)는 FPGA(Field-programmable gate array), MPPA(Massively parallel processorarray), GPU(Graphics processing unit), ASIC(Application-Specific Integrated Circuit), NPU(Neural processing unit), TPU(Tensor Processing Unit) 및 MPSoC(Multi-Processor System-on-Chip) 등의 다양한 종류의 가속기일 수 있다. 예를 들어, 가속기(120)는 딥러닝 동작을 수행하는 가속기 일 수 있다. The accelerator 120 may correspond to various types of coprocessors. In one embodiment, the accelerator 120 is a field-programmable gate array (FPGA), a massively parallel processorarray (MPPA), a graphics processing unit (GPU), an application-specific integrated circuit (ASIC), a neural processing unit (NPU), a TPU (TPU). It may be various types of accelerators, such as a Tensor Processing Unit) and a Multi-Processor System-on-Chip (MPSoC). For example, the accelerator 120 may be an accelerator that performs a deep learning operation.

메모리 모듈(130)은 가속기(120)의 제어에 따라 데이터를 기록하거나 데이터를 독출할 수 있다. 예를 들어, 메모리 모듈(130)은 가속기(120)로부터 제공된 커맨드 및 어드레스에 응답하여 데이터를 기록하거나, 독출된 데이터를 가속기(120)로 제공할 수 있다. The memory module 130 may write data or read data according to the control of the accelerator 120 . For example, the memory module 130 may write data or provide read data to the accelerator 120 in response to a command and an address provided from the accelerator 120 .

일 실시 예에서, 컴퓨팅 시스템(100)은 가속기(120) 및 메모리 모듈(130)을 보드(또는, 메인 보드) 상에 삽입되는 카드 형태로서 포함할 수 있다. 예컨대, 하나 이상의 가속기 및 하나 이상의 메모리 모듈을 포함하는 애드-인 형태의 카드가 보드 상의 확장 슬롯에 장착될 수 있다. 예를 들어, 확장 슬롯에 장착되는 카드는 하나의 가속기 및 하나 이상의 메모리 모듈을 포함할 수 있다. 변형 가능한 실시예로서, 카드는 하나의 가속기와 다수의 메모리 모듈들을 포함할 수 있다. 또는, 카드는 다수 개의 가속기들과 다수 개의 메모리 모듈들을 포함할 수 있다. In an embodiment, the computing system 100 may include the accelerator 120 and the memory module 130 in the form of a card inserted on a board (or main board). For example, an add-in type card including one or more accelerators and one or more memory modules may be mounted in an expansion slot on a board. For example, a card mounted in the expansion slot may include one accelerator and one or more memory modules. As a variant embodiment, the card may include one accelerator and multiple memory modules. Alternatively, the card may include a plurality of accelerators and a plurality of memory modules.

호스트(110) 및 가속기(120)는 버스를 통해 서로 통신할 수 있다. 일 실시 예에서, 버스는 PCI(peripheral Component Interconnect), PCIe(PCI Express), BlueLink, QPI(Quick Path Interconnect) 등 다양한 종류의 프로토콜을 통해 버스(BUS)에 접속된 컴포넌트들 사이의 통신을 지원할 수 있다.The host 110 and the accelerator 120 may communicate with each other via a bus. In an embodiment, the bus may support communication between components connected to the bus (BUS) through various types of protocols, such as peripheral component interconnect (PCI), PCI Express (PCIe), BlueLink, and Quick Path Interconnect (QPI). there is.

일 실시 예에서, 버스가 PCI 프로토콜에 따른 통신을 지원하는 경우, 상기 카드는 PCI 카드에 해당할 수 있으며, 또한 카드가 가속기를 포함하는 경우 상기 카드는 그래픽 카드 또는 가속기 카드로 지칭될 수 있다. In one embodiment, if the bus supports communication according to the PCI protocol, the card may correspond to a PCI card, and if the card includes an accelerator, the card may be referred to as a graphics card or an accelerator card.

일 실시 예에서, 메모리 모듈(130)은 DDR SDRAM(Double Data Rate Synchronous Dynamic Random Access Memory), LPDDR(Low Power Double Data Rate) SDRAM, GDDR(Graphics Double Data Rate) SDRAM, RDRAM(Rambus Dynamic Random Access Memory) 등과 같은 DRAM 일 수 있다. 그러나, 본 개시의 실시예들은 이에 국한될 필요가 없으며, 일 예로서 메모리 모듈(130)은 플래시(flash) 메모리, MRAM(Magnetic RAM), FeRAM(Ferroelectric RAM), PRAM(Phase change RAM) 및 ReRAM(Resistive RAM) 등의 불휘발성 메모리로 구현되어도 무방하다.In one embodiment, the memory module 130 includes DDR SDRAM (Double Data Rate Synchronous Dynamic Random Access Memory), LPDDR (Low Power Double Data Rate) SDRAM, GDDR (Graphics Double Data Rate) SDRAM, RDRAM (Rambus Dynamic Random Access Memory) ), such as DRAM. However, embodiments of the present disclosure are not limited thereto, and as an example, the memory module 130 may include flash memory, magnetic RAM (MRAM), ferroelectric RAM (FeRAM), phase change RAM (PRAM), and ReRAM. (Resistive RAM) may be implemented as a non-volatile memory.

일 실시 예에서, 호스트(110)는 가속기(120)의 프로세서가 처리할 프로그램(즉, 가속기 프로그램)을 업데이트할 수 있다. 호스트(110)는 가속기(120)에 대한 초기화 동작을 수행하여, 가속기 프로그램을 업데이트 할 수 있다. 즉, 호스트(110)는 가속기 프로그램 업데이트 시, 업데이트된 가속기 프로그램을 가속기(120)의 불휘발성 메모리(예를 들어, 플래시 메모리 또는 SD 카드(Secure Digital Card))에 기록할 수 있다. In an embodiment, the host 110 may update a program (ie, an accelerator program) to be processed by the processor of the accelerator 120 . The host 110 may update the accelerator program by performing an initialization operation on the accelerator 120 . That is, when updating the accelerator program, the host 110 may record the updated accelerator program in a nonvolatile memory (eg, a flash memory or a secure digital card (SD card)) of the accelerator 120 .

그러나, 본 개시의 실시 예에 따른 호스트(110)는 가속기(120)에 대한 초기화 동작 없이, 가속기(120)의 동작 수행 중에 가속기 프로그램을 업데이트할 수 있다. 예를 들어, 호스트(110)는 업데이트된 가속기 프로그램을 컴파일할 수 있다. 호스트(110)는 컴파일된 가속기 프로그램을 가속기(120)에게 제공할 수 있다. 가속기(120)는 호스트(110)의 제어에 따라, 프로그램을 업데이트할 수 있다. 본 개시의 실시 예에 따른 가속기(120)의 업데이트 동작은 이하의 도면들을 참조하여 더욱 상세하게 설명된다. However, the host 110 according to an embodiment of the present disclosure may update the accelerator program while the accelerator 120 is operating without an initialization operation for the accelerator 120 . For example, the host 110 may compile an updated accelerator program. The host 110 may provide the compiled accelerator program to the accelerator 120 . The accelerator 120 may update a program under the control of the host 110 . The update operation of the accelerator 120 according to an embodiment of the present disclosure will be described in more detail with reference to the following drawings.

도 2는 도 1의 가속기를 보여주는 블록도이다. 도 1 및 도 2를 참조하면, 가속기(120)는 프로세서(121), 캐시 메모리 장치(122), ROM(123), DMA 엔진(Direct Memory Access engine)(124), 호스트 인터페이스 회로(125), 재구성 로직 회로(126), 및 메모리 컨트롤러(127)를 포함할 수 있다. 가속기(120)는 딥러닝 동작을 수행할 수 있다. 일 실시 예에서, 가속기(120)는 YOLO, ResNet, ResNeXt, DenseNet, GCN(Graph Convolutional Network) 등과 같은 다양한 심층 신경망(DNN; Deep Neural Network)을 구동하도록 구성될 수 있다.FIG. 2 is a block diagram showing the accelerator of FIG. 1 . 1 and 2 , the accelerator 120 includes a processor 121 , a cache memory device 122 , a ROM 123 , a direct memory access engine (DMA engine) 124 , a host interface circuit 125 , It may include a reconfiguration logic circuit 126 , and a memory controller 127 . The accelerator 120 may perform a deep learning operation. In an embodiment, the accelerator 120 may be configured to drive various deep neural networks (DNNs), such as YOLO, ResNet, ResNeXt, DenseNet, and Graph Convolutional Network (GCN).

프로세서(121)는 가속기(120)의 제반 동작을 제어할 수 있다. 캐시 메모리 장치(120)는 프로세서(121)로부터 수신된 신호들에 응답하여, 데이터를 저장하거나 또는 저장된 데이터를 프로세서(121)로 제공할 수 있다. 캐시 메모리 장치(122)는 메모리 모듈(130)보다 빠른 액세스 속도를 가질 수 있다. 즉, 메모리 모듈(130)에 저장된 데이터 중 일부가 캐시 메모리 장치(122)에 저장됨으로써, 프로세서(121)의 요청에 따른 액세스 속도가 향상될 수 있다. 일 실시 예에서, 캐시 메모리 장치(122)는 SRAM(Static Random Access Memory) 장치 일 수 있으나, 본 개시의 범위가 이에 한정되는 것은 아니다. The processor 121 may control overall operations of the accelerator 120 . The cache memory device 120 may store data or provide stored data to the processor 121 in response to signals received from the processor 121 . The cache memory device 122 may have a faster access speed than the memory module 130 . That is, since some of the data stored in the memory module 130 is stored in the cache memory device 122 , the access speed according to the request of the processor 121 may be improved. In an embodiment, the cache memory device 122 may be a static random access memory (SRAM) device, but the scope of the present disclosure is not limited thereto.

ROM(123)은 가속기(120)가 동작하는데 요구되는 다양한 정보를 펌웨어 형태로 저장할 수 있다. DMA 엔진(124)은 프로세서(121)의 개입 없이 호스트(110) 및 메모리 모듈(130) 사이의 직접 메모리 접근(DMA; Direct Memory Access) 동작을 제어하는 하드웨어 장치일 수 있다. The ROM 123 may store various information required for the accelerator 120 to operate in the form of firmware. The DMA engine 124 may be a hardware device that controls a direct memory access (DMA) operation between the host 110 and the memory module 130 without the intervention of the processor 121 .

예를 들어, 가속기(120)는 데이터 전송 속도를 향상시키기 위하여, DMA 모드로 동작할 수 있다. DMA 모드는 가속기(120)에 포함된 프로세서(121) 또는 코어의 개입 또는 없이, DMA 엔진(124)의 제어에 따라 데이터가 전달되는 동작 모드를 가리킨다. 즉, 데이터가 전달되는 동안, 프로세서 또는 코어로부터의 제어 또는 처리가 요구되지 않기 때문에, 데이터 전달 속도가 향상될 수 있다.For example, the accelerator 120 may operate in a DMA mode in order to improve data transfer speed. The DMA mode refers to an operation mode in which data is transferred under the control of the DMA engine 124 without or without intervention of the processor 121 or the core included in the accelerator 120 . That is, since no control or processing from the processor or core is required while data is being transferred, the data transfer speed can be improved.

일 실시 예에서, DMA 엔진(124)은 호스트(110) 및 메모리 모듈(130) 사이의 데이터 전달을 제어 또는 관장할 수 있다. 예를 들어, DMA 엔진(124)은 호스트(110)에서 컴파일된 바이너리 파일(즉, 새로운 가속기 프로그램)을 독출하여, 메모리 모듈(130)에 바이너리 파일을 기입할 수 있다. DMA 엔진(124)은 커맨드/어드레스 라인을 통해 쓰기 커맨드 및 어드레스를 메모리 모듈(130)로 전송할 수 있다. DMA 엔진(124)은 데이터 라인을 통해 데이터(또는 바이너리 파일)을 메모리 모듈(130)로 전송할 수 있다. In an embodiment, the DMA engine 124 may control or manage data transfer between the host 110 and the memory module 130 . For example, the DMA engine 124 may read a binary file (ie, a new accelerator program) compiled from the host 110 and write the binary file to the memory module 130 . The DMA engine 124 may transmit a write command and an address to the memory module 130 through a command/address line. The DMA engine 124 may transmit data (or a binary file) to the memory module 130 through a data line.

가속기(120)는 호스트 인터페이스 회로(125)를 통해 호스트(110)와 통신할 수 있다. 앞서 설명된 바와 같이, 호스트 인터페이스 회로(125)는 PCIe 인터페이스일 수 있다. 그러나 본 개시의 범위가 이에 한정되는 것은 아니며, 호스트 인터페이스(133)는 USB (Universal Serial Bus), MMC(multimedia card), eMMC(embedded MMC), ATA (Advanced Technology Attachment), Serial-ATA, Parallel-ATA, SCSI (small computer small interface), ESDI (enhanced small disk interface), IDE (Integrated Drive Electronics), 파이어와이어(Firewire), UFS(Universal Flash Storage) 등과 같은 다양한 인터페이스들 중 적어도 하나를 포함할 수 있다. The accelerator 120 may communicate with the host 110 through the host interface circuit 125 . As described above, the host interface circuit 125 may be a PCIe interface. However, the scope of the present disclosure is not limited thereto, and the host interface 133 includes a Universal Serial Bus (USB), a multimedia card (MMC), an embedded MMC (eMMC), an Advanced Technology Attachment (ATA), Serial-ATA, Parallel- It may include at least one of various interfaces such as ATA, small computer small interface (SCSI), enhanced small disk interface (ESD), integrated drive electronics (IDE), Firewire, and universal flash storage (UFS). .

재구성 로직 회로(126)는 호스트(110)가 수행하는 연산들 중 일부를 수행하여 호스트(110)의 연산을 보조하도록 구성될 수 있다. 예를 들어, 재구성 로직 회로(126)는 FPGA(Field-programmable gate array)일 수 있다. 그러나, 본 개시는 이에 한정되지 않으며, 재구성 로직 회로(126)는 PLD(Programmable Logic Device) 또는 CPLD(Complex PLD) 등일 수도 있다. The reconfiguration logic circuit 126 may be configured to assist the operation of the host 110 by performing some of the operations performed by the host 110 . For example, the reconfiguration logic circuit 126 may be a field-programmable gate array (FPGA). However, the present disclosure is not limited thereto, and the reconfiguration logic circuit 126 may be a programmable logic device (PLD) or a complex PLD (CPLD).

일 실시 예에서, 재구성 로직 회로(126)는 CNN(Convolutional Neural Network), RNN(Recurrent Neural Network) 등과 같은 머신 러닝 알고리즘 등을 수행할 수 있다. 예를 들어, 재구성 로직 회로(126)는 비디오 트랜스코딩을 수행하도록 구성될 수 있고, 호스트(110) 또는 프로세서(121)의 제어에 따라 CNN을 수행하도록 구성될 수 있다. In an embodiment, the reconstruction logic circuit 126 may perform a machine learning algorithm, such as a convolutional neural network (CNN), a recurrent neural network (RNN), or the like. For example, the reconstruction logic circuit 126 may be configured to perform video transcoding, and may be configured to perform CNN under the control of the host 110 or the processor 121 .

예를 들어, 재구성 로직 회로(126)는 인라인(inline) 프로세싱, 프리(pre)-프로세싱, 프리-필터링(prefiltering), 암호화(cryptography), 압축, 프로토콜 브릿징(bridging) 등을 수행할 수 있다. 재구성 로직 회로(126)는 소팅(sorting) 연산, 서칭(searching) 연산, 로직 연산 또는 사칙 연산 중 하나 이상의 연산을 수행할 수 있다. For example, the reconstruction logic circuit 126 may perform inline processing, pre-processing, pre-filtering, cryptography, compression, protocol bridging, etc. . The reconstruction logic circuit 126 may perform one or more of a sorting operation, a searching operation, a logic operation, and a arithmetic operation.

예를 들어, 로직 연산은 AND 게이트, OR 게이트, XOR 게이트, NOR 게이트, NAND 게이트 등 다양한 로직 게이트가 수행하는 연산 또는 이러한 연산을 둘 이상 결합한 연산 동작을 나타낼 수 있다. 재구성 로직 회로(126)가 수행하는 연산 동작은 위에 설명한 예에 한정되지 않으며, 호스트(200)가 수행하는 연산들 중 일부에 대응하는 임의의 연산일 수 있다.For example, the logic operation may represent an operation performed by various logic gates, such as an AND gate, an OR gate, an XOR gate, a NOR gate, and a NAND gate, or an operation operation combining two or more of these operations. The arithmetic operation performed by the reconfiguration logic circuit 126 is not limited to the example described above, and may be any operation corresponding to some of the operations performed by the host 200 .

가속기(120)는 메모리 컨트롤러(127)를 통해 메모리 모듈(130)과 통신할 수 있다. 메모리 컨트롤러(127)는 메모리 모듈(130)을 제어할 수 있다. 예를 들어, 메모리 컨트롤러(127)는 메모리 모듈(130)을 제어하기 위한 어드레스(ADDR), 커맨드(CMD), 제어 신호(CTRL)를 메모리 모듈(130)로 전송할 수 있고, 메모리 모듈(130)과 데이터 라인(DQ)을 통해 데이터(DATA)를 송수신할 수 있다. 예를 들어, 메모리 컨트롤러(127)는 커맨드/어드레스 라인을 통해 커맨드 및 어드레스를 메모리 모듈(130)로 전송할 수 있다. The accelerator 120 may communicate with the memory module 130 through the memory controller 127 . The memory controller 127 may control the memory module 130 . For example, the memory controller 127 may transmit an address ADDR, a command CMD, and a control signal CTRL for controlling the memory module 130 to the memory module 130 , and the memory module 130 . and data DATA may be transmitted/received through the data line DQ. For example, the memory controller 127 may transmit a command and an address to the memory module 130 through a command/address line.

도 3은 본 개시의 실시 예에 따른 가속기 프로그램의 바이너리 파일을 나타내는 도면이다. 도 3을 참조하면, 가속기 프로그램의 바이너리 파일(BF)은 ELF(Executable And Linking Format)일 수 있다. 바이너리 파일(BF)은 메모리 모듈(130)에 로딩될 수 있다. 바이너리 파일(BF)은 메모리 모듈(130)의 제1 주소(A1)부터 제4 주소(A4) 내에 배치될 수 있다. 3 is a diagram illustrating a binary file of an accelerator program according to an embodiment of the present disclosure. Referring to FIG. 3 , the binary file BF of the accelerator program may be an Executable And Linking Format (ELF). The binary file BF may be loaded into the memory module 130 . The binary file BF may be disposed in the first address A1 to the fourth address A4 of the memory module 130 .

본 개시의 실시 예에 따른 바이너리 파일(BF)은 일반 프로그램용 섹션들(P1,P2) 이외에 업데이트용 섹션(P3)을 더 포함할 수 있다. 즉, 바이너리 파일은 일반 프로그램용 섹션들(P1, P2) 및 업데이트용 섹션(P3)을 포함할 수 있다. 일반 프로그램용 섹션들(P1, P2)은 메모리 모듈(130)의 제1 주소(A1)부터 제3 주소(A3) 내에 배치되고, 업데이터용 섹션(P3)은 메모리 모듈(130)의 제3 주소(A3)부터 제4 주소(A4) 내에 배치될 수 있다. The binary file BF according to an embodiment of the present disclosure may further include an update section P3 in addition to the general program sections P1 and P2. That is, the binary file may include sections P1 and P2 for general programs and a section P3 for updates. The sections P1 and P2 for the general program are disposed within the first address A1 to the third address A3 of the memory module 130 , and the section P3 for the updater is the third address of the memory module 130 . It may be disposed within the fourth address (A4) from (A3).

일반 프로그램용 섹션들(P1, P2)은 .start 섹션, .startup 섹션, .text 섹션, .rodata 섹션, .data 섹션, .bss 섹션을 포함할 수 있다. 예를 들어, 실제 명령어(또는 코드)를 적재하는 영역은 .text 섹션을 가리키고, 변수(또는 데이터)가 저장되는 영역은 .rodata 섹션 또는 .data 섹션을 가리키고, 초기화되지 않은 데이터 영역은 .bss 섹션을 가리킬 수 있다. 프로그램의 main 함수 이전에 실행되는 코드의 영역은 .start 섹션 또는 .startup 섹션을 가리킬 수 있다. The general program sections P1 and P2 may include a .start section, a .startup section, a .text section, a .rodata section, a .data section, and a .bss section. For example, the area where the actual instruction (or code) is loaded points to the .text section, the area where variables (or data) are stored points to the .rodata section or .data section, and the uninitialized data area points to the .bss section can point to The area of code that is executed before the program's main function can point to either a .start section or a .startup section.

일 실시 예에서, .start 섹션은 제1 주소(A1)부터 제2 주소(A2) 내에 배치되고, 나머지 섹션들(.startup 섹션, .text 섹션, .rodata 섹션, .data 섹션, .bss 섹션 등)은 제2 주소(A2) 부터 제3 주소(A3) 내에 배치될 수 있다. In one embodiment, the .start section is disposed within the first address A1 to the second address A2, and the remaining sections (.startup section, .text section, .rodata section, .data section, .bss section, etc.) ) may be disposed within the second address A2 to the third address A3 .

.update 섹션이 미리 정해진 주소(예를 들어, 제3 주소(A3))부터 배치되도록, 링커 스크립트(Linker script) 및 소스 프로그램은 설정될 수 있다. 예를 들어, 코드에서 특정 함수나 변수가 선언되는 경우, 섹션의 이름을 attribute로 주고 링커 스크립트에서 그 섹션이 특정 주소에 배치하도록 설정될 수 있다. 즉, 코드에서 업데이트 함수가 선언되는 경우, .update 섹션의 이름을 attribute로 주고 링커 스크립트에서 .update 섹션이 제3 주소에 배치되도록 설정될 수 있다. A linker script and a source program may be set so that the .update section is placed from a predetermined address (eg, the third address A3). For example, when a specific function or variable is declared in the code, the section name can be given as an attribute and the section can be set to be placed at a specific address in the linker script. That is, when an update function is declared in the code, the name of the .update section may be given as an attribute, and the .update section may be set to be placed at the third address in the linker script.

도 4는 도 1의 가속기의 동작을 보여주는 순서도이다. 도 5는 도 1의 가속기의 프로그램 업데이트 동작을 설명하기 위한 도면이다. 도 1 및 도 4를 참조하면, S110 단계에서, 가속기(120)는 제1 인터럽트에 응답하여, 캐시 플러시 동작을 수행하고, 프로그램 카운터(PC)는 업데이트 섹션(.update 섹션)으로 점프할 수 있다. 일 실시 예에서, 가속기(120)는 호스트(110)로부터 제1 인터럽트를 수신할 수 있다. 예를 들어, 제1 인터럽트는 가속기 프로그램 업데이트 시작을 가리키는 신호일 수 있다. 4 is a flowchart illustrating an operation of the accelerator of FIG. 1 . FIG. 5 is a diagram for explaining a program update operation of the accelerator of FIG. 1 . 1 and 4 , in step S110 , the accelerator 120 may perform a cache flush operation in response to the first interrupt, and the program counter PC may jump to an update section (.update section). . In an embodiment, the accelerator 120 may receive a first interrupt from the host 110 . For example, the first interrupt may be a signal indicating the start of the accelerator program update.

가속기(120)는 호스트(110)로부터 제공된 제1 인터럽트에 응답하여, 캐시 플러시 동작을 수행할 수 있다. 예를 들어, 가속기(120)는 제1 인터럽트로부터 유발된 인터럽트 루틴에 의해, 캐시 메모리 장치(122)에 저장된 데이터를 메모리 모듈(130)로 플러시 할 수 있다. The accelerator 120 may perform a cache flush operation in response to the first interrupt provided from the host 110 . For example, the accelerator 120 may flush data stored in the cache memory device 122 to the memory module 130 by an interrupt routine caused by the first interrupt.

가속기(120)는 프로그램 카운터(PC)를 .update 섹션에 코드가 위치하는 ISR(interrupt service routine)으로 점프하고, 무한 루프를 돌며 대기할 수 있다. 도 5에 도시된 바와 같이, 일반 프로그램용 섹션들(P2)에 위치한 프로그램 카운터(PC)는 .update 섹션(P3)이 위치한 제3 주소(A3)로 점프할 수 있다. The accelerator 120 jumps the program counter PC to an interrupt service routine (ISR) in which the code is located in the .update section, and may wait in an infinite loop. As shown in FIG. 5 , the program counter PC located in the general program sections P2 may jump to the third address A3 where the .update section P3 is located.

가속기(120)는 새로운 프로그램 업데이트를 위하여, 다른 동작을 수행하지 않을 수 있다. 가속기(120)는 .update 섹션으로 프로그램 카운터(PC)를 점프하고 새로운 프로그램이 업데이트 완료시까지 대기할 수 있다. 이로 인해, 프로그램이 업데이트되는 동안, 프로그램 카운터(PC)는 일반 프로그램용 섹션들(P1, P2)을 접근하지 않을 수 있다. The accelerator 120 may not perform another operation to update a new program. The accelerator 120 may jump the program counter PC to the .update section and wait until the new program is updated. Due to this, while the program is being updated, the program counter PC may not access the sections P1 and P2 for general programs.

S120 단계에서, 가속기(120)는 새로운 프로그램을 업데이트할 수 있다. 일 실시 예에서, 가속기(120)는 호스트(110)로부터 새로운 프로그램을 수신할 수 있다. 예를 들어, 호스트(110)는 직접 새로운 프로그램을 메모리 모듈(130)에 기록할 수 있다. 또는 가속기(120)에 포함된 DMA(Direct Memory Access) 엔진은 새로운 프로그램을 메모리 모듈(130)에 기록할 수 있다. In step S120 , the accelerator 120 may update a new program. In an embodiment, the accelerator 120 may receive a new program from the host 110 . For example, the host 110 may directly write a new program to the memory module 130 . Alternatively, a direct memory access (DMA) engine included in the accelerator 120 may record a new program in the memory module 130 .

일 실시 예에서, DMA 엔진(124)을 통해 바이너리 파일이 호스트(110)에서 가속기(120)로 전송되는 경우, 가속기(120)는 바이너리 파일과 관련된 제3 인터럽트를 수신할 수 있다. 제3 인터럽트로부터 유발된 인터럽트 서비스 루틴은 업데이트 섹션에 배치될 수 있다. In an embodiment, when the binary file is transmitted from the host 110 to the accelerator 120 through the DMA engine 124 , the accelerator 120 may receive a third interrupt related to the binary file. The interrupt service routine resulting from the third interrupt may be placed in the update section.

S130 단계에서, 가속기(120)는 제2 인터럽트에 응답하여, 프로그램 카운터(PC)를 시작 섹션(.start 섹션)으로 점프할 수 있다. 예를 들어, 가속기(120)는 호스트(110)로부터 제2 인터럽트를 수신할 수 있다. 제2 인터럽트는 프로그램 업데이트 종료를 가리키는 신호일 수 있다. In step S130 , the accelerator 120 may jump the program counter PC to the start section (.start section) in response to the second interrupt. For example, the accelerator 120 may receive the second interrupt from the host 110 . The second interrupt may be a signal indicating the end of the program update.

가속기(120)는 제2 인터럽트로부터 유발된 인터럽트 루틴에 의해, 프로그램 카운터(PC)를 .start 섹션으로 점프할 수 있다. 도 5에 도시된 바와 같이, .update 섹션(P3)에 위치한 프로그램 카운터(PC)는 .start 섹션(P1)이 위치한 제1 주소(A1)로 점프할 수 있다. 이후에, 가속기(120)는 업데이트된 프로그램을 실행할 수 있다. The accelerator 120 may jump the program counter PC to the .start section by the interrupt routine caused by the second interrupt. 5 , the program counter PC located in the .update section P3 may jump to the first address A1 in which the .start section P1 is located. Thereafter, the accelerator 120 may execute the updated program.

상술된 바와 같이, 가속기(120)는 .update 섹션을 추가로 두어, 프로그램이 업데이트 되는 동안, 프로그램 카운터를 .update 섹션에 머무르게 할 수 있다. 이로 인해, 가속기의 동작 수행 중에, 가속기(120)는 새로운 프로그램을 업데이트 할 수 있다. As described above, the accelerator 120 may add an .update section to keep the program counter in the .update section while the program is being updated. For this reason, while the accelerator operation is being performed, the accelerator 120 may update a new program.

도 6은 도 1의 컴퓨팅 시스템의 동작을 보여주는 순서도이다. 도 1 및 도 6을 참조하면, S210 단계에서, 호스트(110)는 프로그램 컴파일 동작을 수행할 수 있다. 예를 들어, 호스트(110)는 업데이트된 또는 새로운 가속기 프로그램을 컴파일할 수 있다. 호스트(110)는 가속기 프로그램을 컴파일하여, 바이너리 파일을 생성할 수 있다. 6 is a flowchart illustrating an operation of the computing system of FIG. 1 . 1 and 6 , in step S210 , the host 110 may perform a program compilation operation. For example, the host 110 may compile an updated or new accelerator program. The host 110 may compile an accelerator program to generate a binary file.

S220 단계에서, 호스트(110)는 가속기(120)에게 제1 인터럽트를 전송할 수 있다. 호스트(110)는 제1 인터럽트를 통해, 가속기(120)가 프로그램 업데이트를 준비하도록 통지할 수 있다. 예를 들어, 제1 인터럽트는 호스트(110)가 가속기(120)로 전송하는 커맨드일 수 있다. 또는 제1 인터럽트는 호스트(110)가 가속기(120)의 메모리 매핑된 입출력 레지스터(MMIO; memory mapped I/O register) 쓰기 동작을 가리킬 수 있다. In step S220 , the host 110 may transmit a first interrupt to the accelerator 120 . The host 110 may notify the accelerator 120 to prepare for the program update through the first interrupt. For example, the first interrupt may be a command transmitted from the host 110 to the accelerator 120 . Alternatively, the first interrupt may indicate a write operation of the host 110 to a memory mapped I/O register (MMIO) of the accelerator 120 .

S230 단계에서, 가속기(120)는 제1 인터럽트에 응답하여, 플러시 동작을 수행하고, 프로그램 카운터(PC)를 .update 섹션으로 점프할 수 있다. 예를 들어, 가속기(120)는 캐시 메모리 장치(122)에 저장된 더티 상태에 대응하는 데이터를 메모리 모듈(130)로 플러시할 수 있다. 가속기(120)는 프로그램 카운터(PC)를 .update 섹션이 배치된 제3 주소(A3)로 점프할 수 있다. In step S230 , the accelerator 120 may perform a flush operation in response to the first interrupt and jump the program counter PC to the .update section. For example, the accelerator 120 may flush data corresponding to the dirty state stored in the cache memory device 122 to the memory module 130 . The accelerator 120 may jump the program counter PC to the third address A3 where the .update section is disposed.

S240 단계에서, 호스트(110)는 업데이트된 프로그램을 가속기(120)로 전송할 수 있다. 예를 들어, 호스트(110)는 컴파일된 바이너리 파일을 직접 가속기(120)로 전송할 수 있다. 또는 가속기(120)에 포함된 DMA 엔진(124)은 호스트(110)로부터 바이너리 파일을 독출하고, 메모리 모듈(130)에 기입할 수 있다. In step S240 , the host 110 may transmit the updated program to the accelerator 120 . For example, the host 110 may directly transmit the compiled binary file to the accelerator 120 . Alternatively, the DMA engine 124 included in the accelerator 120 may read a binary file from the host 110 and write it to the memory module 130 .

S250 단계에서, 가속기(120)는 수신한 새로운 프로그램을 업데이트할 수 있다. 예를 들어, 가속기(120)는 새로운 프로그램을 메모리 모듈(130)에 기입할 수 있다. 프로세서(121)는 .update 섹션에서 무한 루프를 돌며 대기할 수 있다. 가속기(120)에 포함된 DMA 엔진(124)은 수신한 바이너리 파일은 메모리 모듈(130)에 기입할 수 있다. In step S250 , the accelerator 120 may update the received new program. For example, the accelerator 120 may write a new program to the memory module 130 . The processor 121 may wait in an infinite loop in the .update section. The DMA engine 124 included in the accelerator 120 may write the received binary file to the memory module 130 .

S260 단계에서, 호스트(110)는 가속기(120)에게 제2 인터럽트를 전송할 수 있다. 호스트(110)는 제2 인터럽트를 통해, 가속기(120)가 프로그램 업데이트 종료를 통지할 수 있다. 예를 들어, 제2 인터럽트는 호스트(110)가 가속기(120)로 전송하는 커맨드일 수 있다. 또는 제2 인터럽트는 호스트(110)가 가속기(120)의 메모리 매핑된 레지스터(MMIO; memory mapped I/O register) 쓰기 동작을 가리킬 수 있다.In step S260 , the host 110 may transmit a second interrupt to the accelerator 120 . The host 110 may notify the end of the program update of the accelerator 120 through the second interrupt. For example, the second interrupt may be a command transmitted from the host 110 to the accelerator 120 . Alternatively, the second interrupt may indicate a memory mapped I/O register (MMIO) write operation by the host 110 of the accelerator 120 .

S270 단계에서, 가속기(120)는 프로그램 카운터(PC)를 .start 섹션으로 점프할 수 있다. 예를 들어, 가속기(120)는 프로그램 카운터(PC)를 .start 섹션이 배치된 제1 주소(A1)로 점프할 수 있다. 이후에, 가속기(120)는 새로운 프로그램 초기화부터 시작할 수 있다. In step S270 , the accelerator 120 may jump the program counter PC to the .start section. For example, the accelerator 120 may jump the program counter PC to the first address A1 in which the .start section is disposed. Thereafter, the accelerator 120 may start a new program initialization.

상술한 바와 같이, 컴퓨팅 시스템(100)은 가속기 프로그램 업데이트 수행 시, 프로그램 업데이트 되는 공간과 상이한 .update 섹션으로 가속기의 프로그램 카운터(PC)를 점프 시킬 수 있다. 이로 인해, 컴퓨팅 시스템(100)은 가속기의 ROM 또는 플래시 메모리를 새로 프로그램 하지 않고, DMA 엔진(124)을 통해 새로운 프로그램을 업데이트할 수 있다. As described above, when the accelerator program update is performed, the computing system 100 may jump the program counter PC of the accelerator to a different .update section from the space where the program is updated. Accordingly, the computing system 100 may update a new program through the DMA engine 124 without newly programming the ROM or flash memory of the accelerator.

도 7은 본 개시의 실시 예에 따른 컴퓨팅 시스템을 보여주는 블록도이다. 도 7을 참조하면, 컴퓨팅 시스템(1000)은 프로세서(1100), 메모리 장치(1200), 가속기(1300), 칩셋(1400), GPU(1500), 입출력 장치(1600), 및 스토리지 장치(1700)를 포함한다. 프로세서(1100)는 컴퓨팅 시스템(1000)의 제반 동작을 제어할 수 있다. 프로세서(1100)는 컴퓨팅 시스템(1000)에서 수행되는 다양한 연산을 수행할 수 있다.7 is a block diagram illustrating a computing system according to an embodiment of the present disclosure. Referring to FIG. 7 , the computing system 1000 includes a processor 1100 , a memory device 1200 , an accelerator 1300 , a chipset 1400 , a GPU 1500 , an input/output device 1600 , and a storage device 1700 . includes The processor 1100 may control overall operations of the computing system 1000 . The processor 1100 may perform various operations performed by the computing system 1000 .

메모리 장치(1200)는 DDR(Double Data Rate) 인터페이스를 통해 통신할 수 있다. 예를 들어, 프로세서(1100)는 메모리 장치(1200)를 컴퓨팅 시스템(1000)의 동작 메모리, 버퍼 메모리, 또는 캐시 메모리로서 사용할 수 있다.The memory device 1200 may communicate through a double data rate (DDR) interface. For example, the processor 1100 may use the memory device 1200 as an operating memory, a buffer memory, or a cache memory of the computing system 1000 .

가속기(1300)는 칩셋(1400)과 연결될 수 있다. 가속기(1300)는 도 1 내지 도 6을 참조하여 설명된 가속기(120)일 수 있다. 칩셋(1400)은 프로세서(1100)와 전기적으로 연결되고, 프로세서(1100)의 제어에 따라 컴퓨팅 시스템(1000)의 하드웨어를 제어할 수 있다. 예를 들어, 칩셋(1400)은 주요 버스들을 통해 GPU(1500), 입출력 장치(1600), 및 스토리지 장치(1700) 각각과 연결되고, 주요 버스들에 대한 브릿지 역할을 수행할 수 있다. The accelerator 1300 may be connected to the chipset 1400 . The accelerator 1300 may be the accelerator 120 described with reference to FIGS. 1 to 6 . The chipset 1400 is electrically connected to the processor 1100 , and may control hardware of the computing system 1000 according to the control of the processor 1100 . For example, the chipset 1400 may be connected to each of the GPU 1500 , the input/output device 1600 , and the storage device 1700 through major buses, and may serve as a bridge for the major buses.

GPU(1500)는 컴퓨팅 시스템(1000)의 영상 데이터를 출력하기 위한 일련의 연산 동작을 수행할 수 있다. 예를 들어, GPU(1500)는 시스템-온-칩 형태로 프로세서(1100) 내에 실장될 수 있다.The GPU 1500 may perform a series of arithmetic operations for outputting image data of the computing system 1000 . For example, the GPU 1500 may be mounted in the processor 1100 in a system-on-chip form.

입출력 장치(1600)는 컴퓨팅 시스템(1000)으로 데이터 또는 명령어를 입력하거나 또는 외부로 데이터를 출력하는 다양한 장치들을 포함한다. 예를 들어, 입출력 장치(1600)는 키보드, 키패드, 버튼, 터치 패널, 터치 스크린, 터치 패드, 터치 볼, 카메라, 마이크, 자이로스코프 센서, 진동 센서, 압전 소자 등과 같은 사용자 입력 장치들 및 LCD (Liquid Crystal Display), OLED (Organic Light Emitting Diode) 표시 장치, AMOLED (Active Matrix OLED) 표시 장치, LED, 스피커, 모터 등과 같은 사용자 출력 장치들을 포함할 수 있다.The input/output device 1600 includes various devices that input data or commands to the computing system 1000 or output data to the outside. For example, the input/output device 1600 includes user input devices such as a keyboard, a keypad, a button, a touch panel, a touch screen, a touch pad, a touch ball, a camera, a microphone, a gyroscope sensor, a vibration sensor, a piezoelectric element, and an LCD ( Liquid Crystal Display), an organic light emitting diode (OLED) display device, an active matrix OLED (AMOLED) display device, and user output devices such as an LED, a speaker, and a motor may be included.

스토리지 장치(1700)는 컴퓨팅 시스템(1000)의 저장 매체로서 사용될 수 있다. 스토리지 장치(1600)는 하드 디스크 드라이브, SSD, 메모리 카드, 메모리 스틱 등과 같은 대용량 저장 매체들을 포함할 수 있다.The storage device 1700 may be used as a storage medium of the computing system 1000 . The storage device 1600 may include mass storage media such as a hard disk drive, an SSD, a memory card, a memory stick, and the like.

도 8은 본 개시의 실시 예에 따른 컴퓨팅 시스템이 적용된 데이터 센터를 보여주는 블록도이다. 도 8을 참조하면, 데이터 센터(2000)는 복수의 컴퓨팅 노드들(2100~2400)(또는 서버들)을 포함할 수 있다. 복수의 컴퓨팅 노드들(2100~2400)은 네트워크(NT)를 통해 서로 통신할 수 있다. 일 실시 예에서, 네트워크(NT)는 SAN(storage area network)와 같은 스토리지 전용 네트워크이거나 또는 TCP/IP와 같은 인터넷 네트워크일 수 있다. 일 실시 예에서, 네트워크(NT)는 파이버 채널(Fibre Channel), iSCSI 프로토콜, FCoE, NAS, NVMe-oF 등과 같은 다양한 통신 프로토콜들 중 적어도 하나를 포함할 수 있다. 8 is a block diagram illustrating a data center to which a computing system according to an embodiment of the present disclosure is applied. Referring to FIG. 8 , the data center 2000 may include a plurality of computing nodes 2100 to 2400 (or servers). The plurality of computing nodes 2100 to 2400 may communicate with each other through the network NT. In an embodiment, the network NT may be a storage-only network such as a storage area network (SAN) or an Internet network such as TCP/IP. In an embodiment, the network NT may include at least one of various communication protocols such as Fiber Channel, iSCSI protocol, FCoE, NAS, and NVMe-oF.

복수의 컴퓨팅 노드들(2100~2400) 각각은 프로세서들(2110, 2210, 2310, 2410), 메모리들(2120, 2220, 2320, 2420), 가속기들(2130, 2230, 2330, 2430), 및 인터페이스 회로들(2140, 2240, 2340, 2440)을 각각 포함할 수 있다. Each of the plurality of computing nodes 2100 to 2400 includes processors 2110 , 2210 , 2310 , 2410 , memories 2120 , 2220 , 2320 , 2420 , accelerators 2130 , 2230 , 2330 , 2430 , and an interface. Circuits 2140 , 2240 , 2340 , and 2440 may be included, respectively.

예를 들어, 제1 컴퓨팅 노드(2100)는 제1 프로세서(2110), 제1 메모리(2120), 제1 가속기(2130), 및 제1 인터페이스 회로(2140)를 포함할 수 있다. 일 실시 예에서, 제1 프로세서(2110)는 싱글 코어 또는 멀티 코어로 구현될 수 있다. 제1 메모리(2120)는 DRAM, SDRAM, SRAM, 3D XPoint 메모리, MRAM, PRAM, FeRAM, ReRAM, 3D X-Point 등과 같은 메모리를 포함할 수 있다. 제1 메모리(2120)는 제1 컴퓨팅 노드(2100)의 시스템 메모리, 동작 메모리, 또는 버퍼 메모리로서 사용될 수 있다. 제1 가속기(2130)는 도 1 내지 도 6을 참조하여 설명된 가속기(120)일 수 있다. 제1 인터페이스 회로(2140)는 네트워크(NT)를 통한 통신 지원하도록 구성된 네트워크 인터페이스 컨트롤러(NIC; network interface controller)일 수 있다.For example, the first computing node 2100 may include a first processor 2110 , a first memory 2120 , a first accelerator 2130 , and a first interface circuit 2140 . In an embodiment, the first processor 2110 may be implemented as a single core or a multi-core. The first memory 2120 may include a memory such as DRAM, SDRAM, SRAM, 3D XPoint memory, MRAM, PRAM, FeRAM, ReRAM, 3D X-Point, and the like. The first memory 2120 may be used as a system memory, an operational memory, or a buffer memory of the first computing node 2100 . The first accelerator 2130 may be the accelerator 120 described with reference to FIGS. 1 to 6 . The first interface circuit 2140 may be a network interface controller (NIC) configured to support communication through the network NT.

일 실시 예에서, 제1 컴퓨팅 노드(2100)의 제1 프로세서(2110)는 미리 정해진 메모리 인터페이스를 기반으로 제1 메모리(2120)를 액세스하도록 구성될 수 있다. 또는, 공유 메모리 구조(shared memory architecture)의 실시 예에서, 제1 컴퓨팅 노드(2100)의 제1 프로세서(2110)는 네트워크(NT)를 통해 다른 컴퓨팅 노드들(2200, 2300, 2400)의 메모리들(2220, 2320, 2420)을 액세스하도록 구성될 수 있다. 제1 인터페이스 회로(2140)는 제1 프로세서(2110)의 공유 메모리(즉, 다른 컴퓨팅 노드들의 메모리들)로의 접근을 제어 또는 지원하도록 구성된 네트워크 스위치(미도시)를 포함할 수 있다. In an embodiment, the first processor 2110 of the first computing node 2100 may be configured to access the first memory 2120 based on a predetermined memory interface. Alternatively, in an embodiment of a shared memory architecture, the first processor 2110 of the first computing node 2100 may access the memories of the other computing nodes 2200 , 2300 , 2400 through the network NT. (2220, 2320, 2420). The first interface circuit 2140 may include a network switch (not shown) configured to control or support access to the shared memory (ie, memories of other computing nodes) of the first processor 2110 .

일 실시 예에서, 제1 컴퓨팅 노드(2100)의 제1 프로세서(2110)는 미리 정해진 인터페이스를 기반으로 제1 가속기(2130)를 액세스하도록 구성될 수 있다. 또는 제1 컴퓨팅 노드(2100)의 제1 프로세서(2110)는 네트워크(NT)를 통해 다른 컴퓨팅 노드들(2200, 2300, 2400)의 가속기들(2230, 2330, 2430)을 액세스하도록 구성될 수 있다. 제1 인터페이스 회로(2140)는 상술된 제1 프로세서(2110)의 다른 가속기들로의 접근을 제어 또는 지원하도록 구성된 네트워크 스위치(미도시)를 포함할 수 있다. In an embodiment, the first processor 2110 of the first computing node 2100 may be configured to access the first accelerator 2130 based on a predetermined interface. Alternatively, the first processor 2110 of the first computing node 2100 may be configured to access the accelerators 2230, 2330, 2430 of the other computing nodes 2200, 2300, 2400 via the network NT. . The first interface circuit 2140 may include a network switch (not shown) configured to control or support access to other accelerators of the first processor 2110 described above.

제2 내지 제4 컴퓨팅 노드들(2200~2400)은 상술된 제1 컴퓨팅 노드(2100)와 유사한 동작을 수행할 수 있으며, 이에 대한 상세한 설명은 생략된다.The second to fourth computing nodes 2200 to 2400 may perform operations similar to those of the above-described first computing node 2100 , and a detailed description thereof will be omitted.

일 실시 예에서, 데이터 센터(2000)에서, 다양한 애플리케이션들이 실행될 수 있다. 다양한 애플리케이션들은 컴퓨팅 노드들(2100~2400) 사이의 데이터 이동 또는 복사를 위한 명령어를 실행하도록 구성되거나, 또는 컴퓨팅 노드들(2100~2400) 상에 존재하는 다양한 정보들을 조합, 가공, 재생산하기 위한 명령어들을 실행하도록 구성될 수 있다. 일 실시 예에서, 다양한 애플리케이션들은 데이터 센터(2000)에 포함된 복수의 컴퓨팅 노드들(2100~2400) 중 어느 하나에 의해 수행되거나 또는, 다양한 애플리케이션들은 복수의 컴퓨팅 노드들(2100~2400) 사이에서 분산되어 실행될 수 있다.In one embodiment, in the data center 2000, various applications may be executed. Various applications are configured to execute instructions for moving or copying data between the computing nodes 2100 to 2400, or instructions for combining, processing, and reproducing various information present on the computing nodes 2100 to 2400 can be configured to run In an embodiment, various applications are performed by any one of the plurality of computing nodes 2100 to 2400 included in the data center 2000, or various applications are performed between the plurality of computing nodes 2100 to 2400. It can be distributed and executed.

일 실시 예에서, 데이터 센터(2000)는 고성능 컴퓨팅(HPC; high-performance computing)(예를 들어, 금융, 석유, 재료과학, 기상 예측 등), 기업형 애플리케이션(예를 들어, 스케일 아웃 데이터베이스), 빅 데이터 애플리케이션(예를 들어, NoSQL 데이터베이스, 인-메모리 복제 등)을 위해 사용될 수 있다.In one embodiment, the data center 2000 includes high-performance computing (HPC) (eg, finance, petroleum, materials science, weather forecasting, etc.), enterprise applications (eg, scale-out databases); It can be used for big data applications (eg NoSQL databases, in-memory replication, etc.).

일 실시 예에서, 복수의 컴퓨팅 노드들(2100~2400) 중 적어도 하나는 애플리케이션 서버일 수 있다. 애플리케이션 서버는 데이터 센터(2000)에서 다양한 동작을 수행하도록 구성된 애플리케이션을 실행하도록 구성될 수 있다. 복수의 컴퓨팅 노드들(2100~2400) 중 적어도 하나는 스토리지 서버일 수 있다. 스토리지 서버는 데이터 센터(2000)에서 생성되거나 또는 관리되는 데이터를 저장하도록 구성될 수 있다.In an embodiment, at least one of the plurality of computing nodes 2100 to 2400 may be an application server. The application server may be configured to run applications configured to perform various operations in the data center 2000 . At least one of the plurality of computing nodes 2100 to 2400 may be a storage server. The storage server may be configured to store data generated or managed in the data center 2000 .

일 실시 예에서, 데이터 센터(2000)에 포함된 복수의 컴퓨팅 노드들(2100~2400) 각각 또는 그것들의 부분들은 동일한 위치에 존재하거나 또는 물리적으로 이격된 위치에 존재할 수 있으며, 무선 통신 또는 유선 통신에 기반된 네트워크(NT)를 통해 서로 통신할 수 있다. 일 실시 예에서, 데이터 센터(2000)에 포함된 복수의 컴퓨팅 노드들(2100~2400)은 서로 동일한 메모리 기술을 기반으로 구현되거나 또는 서로 다른 메모리 기술들을 기반으로 구현될 수 있다.In one embodiment, each of the plurality of computing nodes 2100 to 2400 or parts thereof included in the data center 2000 may exist in the same location or may exist in physically spaced locations, and may include wireless communication or wired communication. can communicate with each other through a network (NT) based on In an embodiment, the plurality of computing nodes 2100 to 2400 included in the data center 2000 may be implemented based on the same memory technology or different memory technologies.

비록 도면에 도시되지는 않았으나, 데이터 센터(2000)의 복수의 컴퓨팅 노드들(2100~2400) 중 적어도 일부는 네트워크(NT) 또는 다른 통신 인터페이스(미도시)를 통해 외부 클라이언트 노드(미도시)와 통신할 수 있다. 복수의 컴퓨팅 노드들(2100~2400) 중 적어도 일부는 외부 클라이언트 노드의 요청에 따라 자체적으로 요청(예를 들어, 데이터 저장, 데이터 전송 등)을 처리하거나 또는 다른 컴퓨팅 노드에서 요청을 처리할 수 있다.Although not shown in the figure, at least some of the plurality of computing nodes 2100 to 2400 of the data center 2000 may communicate with an external client node (not shown) through a network NT or other communication interface (not shown). can communicate At least some of the plurality of computing nodes 2100 to 2400 may process a request (eg, data storage, data transmission, etc.) by itself according to a request of an external client node, or may process a request in another computing node .

일 실시 예에서, 데이터 센터(2000)에 포함된 복수의 컴퓨팅 노드들(2100~2400)의 개수, 각 컴퓨팅 노드에 포함된 프로세서의 개수, 메모리 개수, 가속기의 개수는 예시적인 것이며, 본 개시의 범위가 이에 한정되는 것은 아니다.In one embodiment, the number of the plurality of computing nodes 2100 to 2400 included in the data center 2000, the number of processors, the number of memories, and the number of accelerators included in each computing node are exemplary, and The scope is not limited thereto.

상술된 내용은 본 개시를 실시하기 위한 구체적인 실시 예들이다. 본 개시는 상술된 실시 예들뿐만 아니라, 단순하게 설계 변경되거나 용이하게 변경할 수 있는 실시 예들 또한 포함할 것이다. 또한, 본 개시는 실시 예들을 이용하여 용이하게 변형하여 실시할 수 있는 기술들도 포함될 것이다. 따라서, 본 개시의 범위는 상술된 실시 예들에 국한되어 정해져서는 안되며 후술하는 특허청구범위뿐만 아니라 이 발명의 특허청구범위와 균등한 것들에 의해 정해져야 할 것이다.The above are specific embodiments for carrying out the present disclosure. The present disclosure will include not only the above-described embodiments, but also simple design changes or easily changeable embodiments. In addition, the present disclosure will also include techniques that can be easily modified and implemented using the embodiments. Therefore, the scope of the present disclosure should not be limited to the above-described embodiments, but should be defined by the claims described below as well as the claims and equivalents of the present invention.

100: 컴퓨팅 시스템
110: 호스트
120: 가속기
130: 메모리 모듈100: computing system
110: host
120: accelerator
130: memory module

Claims

In the method of operating an accelerator,
in response to the first interrupt, performing a cache flush operation;
performing an update section jump operation to jump a program counter to an update section after performing the cache flush operation;
performing a program update operation to write a compiled binary file to a memory module connected to the accelerator after the update section jump operation; and
after the program update operation, in response to a second interrupt, performing a start section jump operation to jump the program counter to the start section.

The method of claim 1,
wherein the accelerator communicates with the host via a Peripheral Component Interconnect express (PCI-express) interface.

The method of claim 1,
receiving the first interrupt from the host before the cache flush operation; and
after the program update operation, receiving the second interrupt from the host.

The method of claim 1,
The binary file includes a section for a general program including the start section and the update section by a linker script.

The method of claim 1,
The binary file is generated by compiling an accelerator program by the host.

The method of claim 1,
The performing of the program update operation may include: reading the binary file from the host by a DMA engine included in the accelerator; and
and writing, by the DMA engine, the binary file to the memory module.

7. The method of claim 6,
Writing the binary file to the memory module comprises:
sending a write command and an address to the memory module through a command/address line; and
and transferring the binary file to the memory module via a data line.

The method of claim 1,
The performing the cache flush operation includes writing data corresponding to a dirty state stored in a cache memory device included in the accelerator into the memory module.

The method of claim 1,
and the first and second interrupts indicate a memory mapped I/O register (MMIO) write operation by the host.

The method of claim 1,
The accelerator includes a field-programmable gate array (FPGA), a massively parallel processor array (MPPA), a graphics processing unit (GPU), an application-specific integrated circuit (ASIC), a neural processing unit (NPU), a tensor processing unit (TPU), and A method comprising any one of Multi-Processor System-on-Chip (MPSoC).

The method of claim 1,
while the program update operation is performed, the program counter loops in an infinite loop and waits.

a host configured to perform a compile operation on the accelerator program to generate a binary file including an update section and a start section, and to transmit first and second interrupts; and
an accelerator configured to jump a program counter to the update section in response to the first interrupt and jump the program counter to the start section in response to the second interrupt;
The first interrupt indicates the start of the accelerator program update, and the second interrupt indicates the end of the accelerator program update.

13. The method of claim 12,
The computing system further comprising a memory module coupled to the accelerator and configured to store a binary file executed by the accelerator.

14. The method of claim 13,
and the accelerator is a DMA engine configured to read the binary file from the host and write the binary file to the memory module.

14. The method of claim 13,
The accelerator is:
a processor configured to execute the binary file; and
and a cache memory device configured to store data used by the processor and, in response to the first interrupt, flush data corresponding to a dirty state to the memory module.

13. The method of claim 12,
wherein the accelerator further comprises a reconstruction logic circuit configured to perform a machine learning algorithm, such as a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), or the like.