KR102551726B1

KR102551726B1 - Memory system

Info

Publication number: KR102551726B1
Application number: KR1020180155681A
Authority: KR
Inventors: 김선웅; 임의철
Original assignee: 에스케이하이닉스 주식회사
Priority date: 2018-07-23
Filing date: 2018-12-06
Publication date: 2023-07-06
Also published as: KR20200018188A; TW202008172A

Abstract

본 발명의 실시예는 메모리 시스템에 관한 것으로, 대용량 메모리 장치의 가속기(Accelerator)에 관한 기술이다. 이러한 본 발명의 실시예는 데이터를 저장하는 복수의 메모리 및 복수의 메모리에 저장된 데이터를 리드하여 맵 연산을 수행하고, 맵 연산의 결과 데이터를 복수의 메모리에 저장하는 풀드 메모리 컨트롤러를 포함한다. An embodiment of the present invention relates to a memory system, and is a technology related to an accelerator of a large-capacity memory device. This embodiment of the present invention includes a plurality of memories for storing data and a pulled memory controller for reading data stored in the plurality of memories, performing a map operation, and storing result data of the map operation in the plurality of memories.

Description

Memory system {Memory system}

본 발명은 메모리 시스템에 관한 것으로, 대용량 메모리 장치의 가속기(Accelerator)에 관한 기술이다. The present invention relates to a memory system and relates to an accelerator for a large-capacity memory device.

최근 스마트 폰, 태블릿 PC와 같은 이동 통신 단말기의 보급이 대중화되고 있다. 그리고, 소셜 네트워크 서비스(SNS, Social Network Service), 사물 네트워크(M2M, Machine to Machine), 센서 네트워크(Sensor Network) 등의 사용이 증가하고 있다. 이에 따라, 데이터의 양, 생성 속도 및 그 다양성이 기하급수적으로 증가하고 있다. 빅 데이터의 처리를 위해서는 메모리의 속도도 중요하지만 저장 용량이 큰 메모리 장치 및 메모리 모듈이 요구된다.Recently, the spread of mobile communication terminals such as smart phones and tablet PCs has become popular. In addition, the use of social network services (SNS), machine to machine (M2M), sensor networks, and the like is increasing. Accordingly, the amount of data, the speed of generation and the variety thereof are increasing exponentially. In order to process big data, memory speed is also important, but memory devices and memory modules with large storage capacity are required.

이에, 메모리 시스템은 메모리의 물리적 한계를 극복하면서 데이터의 저장 용량을 늘리기 위하여 복수의 통합된 메모리들을 구비한다. 일 예로, 클라우드 데이터 센터(Cloud Data Center)의 서버 구조(Server Architecture)가 빅 데이터 어플리케이션(Big-Data Application)을 효율적으로 실행시키기 위한 구조로 바뀌고 있다. Accordingly, the memory system includes a plurality of integrated memories to increase data storage capacity while overcoming physical limitations of the memory. For example, the server architecture of a cloud data center is being changed to a structure for efficiently executing a big-data application.

빅 데이터를 효율적으로 처리하기 위해 복수의 메모리들이 통합된 풀드 메모리(Pooled Memory)를 사용한다. 풀드 메모리는 많은 용량과 높은 대역폭(Bandwidth)를 제공할 수 있으며, 인메모리 데이터베이스(In-memory Database) 등에서 유용하게 사용할 수 있다. In order to efficiently process big data, pooled memory in which multiple memories are integrated is used. The pooled memory can provide a large capacity and a high bandwidth, and can be usefully used in an in-memory database.

본 발명의 실시예는 풀드 메모리 내부에 가속기를 포함하여 시스템의 에너지 소모를 줄이고 아울러 성능을 향상시킬 수 있도록 하는 메모리 시스템을 제공한다. An embodiment of the present invention provides a memory system that includes an accelerator inside a pulled memory to reduce energy consumption and improve performance of the system.

본 발명의 실시예에 따른 메모리 시스템은, 데이터를 저장하는 복수의 메모리; 및 복수의 메모리에 저장된 데이터를 리드하여 맵 연산을 수행하고, 맵 연산의 결과 데이터를 복수의 메모리에 저장하는 풀드 메모리 컨트롤러를 포함한다. A memory system according to an embodiment of the present invention includes a plurality of memories for storing data; and a pulled memory controller that reads data stored in a plurality of memories, performs a map operation, and stores result data of the map operation in the plurality of memories.

또한, 본 발명의 다른 실시예에 따른 메모리 시스템은, 프로세서와 연결된 패브릭 네트워크; 및 패브릭 네트워크를 통해 프로세서와 패킷을 중계하며, 프로세서의 요청시 메모리에 저장된 데이터를 프로세서에 전달하는 풀드 메모리를 포함하며, 풀드 메모리는 메모리에 저장된 데이터를 리드하여 맵 연산을 오프-로딩하고, 맵 연산의 결과 데이터를 메모리에 저장하는 풀드 메모리 컨트롤러를 포함한다. Also, a memory system according to another embodiment of the present invention may include a fabric network connected to a processor; and a pulled memory that relays packets with the processor through a fabric network and delivers data stored in the memory to the processor upon request of the processor, wherein the pulled memory reads the data stored in the memory to off-load map operation, It includes a pulled memory controller that stores result data of an operation in a memory.

본 발명의 실시예는 시스템의 성능을 향상시키고, 데이터 연산을 위해 필요한 에너지를 절약할 수 있도록 하는 효과를 제공한다. Embodiments of the present invention provide an effect of improving system performance and saving energy required for data operation.

도 1은 본 발명의 실시예에 따른 메모리 시스템의 개념을 설명하기 위한 도면.
도 2는 본 발명의 실시예에 따른 메모리 시스템의 구성을 보여주는 도면.
도 3은 도 2의 풀드 메모리 컨트롤러에 관한 상세 구성을 보여주는 도면.
도 4 내지 도 6은 본 발명의 실시예에 따른 메모리 시스템의 동작을 설명하기 위한 도면.
도 7은 본 발명의 실시예에 따른 메모리 시스템의 성능 개선을 보여주는 도면.1 is a diagram for explaining the concept of a memory system according to an embodiment of the present invention;
2 is a diagram showing the configuration of a memory system according to an embodiment of the present invention;
3 is a diagram showing a detailed configuration of the pulled memory controller of FIG. 2;
4 to 6 are diagrams for explaining the operation of a memory system according to an embodiment of the present invention;
7 is a diagram showing performance improvement of a memory system according to an embodiment of the present invention;

이하, 첨부한 도면을 참조하여 본 발명의 실시예에 대해 상세히 설명하고자 한다. 본 발명의 실시예를 설명함에 있어서 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때 이는 "직접적으로 연결"되어 있는 경우뿐만 아니라 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한, 어떤 부분이 어떤 구성요소를 "포함" 또는 "구비"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함하거나 구비할 수 있는 것을 의미한다. 또한, 명세서 전체의 기재에 있어서 일부 구성요소들을 단수형으로 기재하였다고 해서, 본 발명이 그에 국한되는 것은 아니며, 해당 구성요소가 복수 개로 이루어질 수 있음을 알 것이다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In describing an embodiment of the present invention, when a part is said to be "connected" to another part, this is not only when it is "directly connected" but also when it is "electrically connected" with another element interposed therebetween. Also includes In addition, when a part "includes" or "includes" a certain component, this means that it may further include or include other components, not excluding other components unless otherwise specified. . In addition, even if some components are described in the singular form in the description of the entire specification, the present invention is not limited thereto, and it will be appreciated that the corresponding components may consist of a plurality of pieces.

데이터 센터 어플리케이션은 데이터의 크기가 점점 더 커짐에 따라 더 많은 하드웨어 자원을 필요로 한다. 서버 아키텍처(Server Architecture)는 하드웨어 자원을 보다 효율적으로 사용하려는 방향으로 진화하고 있다. Data center applications require more hardware resources as the size of data grows. Server architecture is evolving in the direction of using hardware resources more efficiently.

일 예로, 클라우드 데이터 센터(Cloud Data Center)에서는 딥 러닝(Deep Learning)을 비롯한 많은 머신 러닝(Machine Learning) 어플리케이션들이 실행되고 있다. 위와 같은 딥 러닝, 머신 러닝 등의 어플리케이션들은 대부분 시간적 지역성(Temporal Locality)이 낮아서 중앙처리장치(CPU; Central Processing Unit) 보다는 그래픽처리장치(GPU; Graphic Processing Unit)나 FPGA(Field Programmable Gate Array) 등의 하드웨어 가속기(Hardware Accelerator)를 통해 연산을 수행하는 것이 일반적이다. For example, many machine learning applications including deep learning are being executed in a cloud data center. Most of the applications such as deep learning and machine learning above have low temporal locality, so they require a graphic processing unit (GPU) or a field programmable gate array (FPGA) rather than a central processing unit (CPU). It is common to perform calculations through a hardware accelerator.

여기서, 시간적 지역성(temporal locality)은 한번 접근했던 데이터에 비교적 가까운 시간 내에 다시 접근하는 것을 의미한다. 즉, 위의 어플리케이션들은 빈번하게 접근되는 핫(hot) 데이터 보다 한동안 접근이 없는 콜드(cold) 데이터를 사용한다. Here, temporal locality means that once accessed data is accessed again within a relatively close time. That is, the above applications use cold data that has not been accessed for a while rather than frequently accessed hot data.

프로세서(예를 들면, 중앙처리장치)가 가속기(Accelerator)에게 잡(Job)을 오프-로딩(Off-loading)하는 과정을 살펴보면 다음과 같다. 먼저, 프로세서의 로컬 메모리(Local Memory)에서 가속기의 로컬 메모리로 데이터를 옮긴다. 이후에, 가속기가 연산을 마치면 그 결과 데이터를 프로세서 쪽으로 다시 옮겨야 한다. A process of off-loading a job from a processor (eg, a central processing unit) to an accelerator is as follows. First, data is moved from the local memory of the processor to the local memory of the accelerator. Later, when the accelerator finishes its operation, it must move the resulting data back to the processor.

만일, 데이터를 옮기는 비용이 데이터를 연산하는 비용보다 크다면 데이터를 가급적 덜 옮길 수 있도록 아키텍처를 구현하는 것이 전체 비용을 절감할 수 있다. 이를 위해, 최근에는 메모리 구동 컴퓨팅 컨셉(Memory-Driven Computing Concept)이 제안되었다.If the cost of moving data is greater than the cost of calculating data, implementing an architecture to move data as little as possible can reduce the overall cost. To this end, a memory-driven computing concept has recently been proposed.

도 1은 본 발명의 실시예에 따른 메모리 시스템의 개념을 설명하기 위한 도면.1 is a diagram for explaining the concept of a memory system according to an embodiment of the present invention;

도 1의 실시예는 시스템 온 칩(SoC; System On Chip), 즉, 프로세서 중심의 컴퓨팅(연산 장치) 구조에서 메모리 중심의 컴퓨팅 구조로 아키텍처가 변화되는 것을 나타낸다. 프로세서 중심의 컴퓨팅 구조에서는 하나의 시스템 온 칩이 하나의 메모리와 일대일 방식으로 연결된다. The embodiment of FIG. 1 represents a change in architecture from a System On Chip (SoC), that is, a processor-oriented computing (computing device) structure to a memory-oriented computing structure. In the processor-centric computing structure, one system-on-chip is connected to one memory in a one-to-one manner.

메모리 구동 컴퓨팅(Memory-Driven Computing)이란 여러 시스템 온 칩이 패브릭 네트워크(Fabric Network)를 통해 연결된 통합 메모리(Unified Memory)를 사용한다. 시스템 온 칩(SoC) 간에 데이터를 주고 받을 때에는 메모리 대역폭(Memory Bandwidth)으로 데이터를 주고 받게 된다. Memory-driven computing uses unified memory in which several systems-on-chips are connected through a fabric network. When data is exchanged between system-on-chips (SoCs), data is exchanged with a memory bandwidth.

또한, 패브릭 네트워크로 연결된 하나의 통합 메모리는 데이터를 주고 받기 위해 기존처럼 메모리 카피(Memory Copy)를 하지 않아도 된다. 위와 같은 메모리 구동 컴퓨팅(Memory-Driven Computing)이 상용화되려면 메모리 의미적 상호 연결(Memory Semantic Interconnect)이 높은 대역폭(Bandwidth), 낮은 레이턴시(Latency), 일관성(Coherency) 등을 지원해야 한다. In addition, a single unified memory connected through a fabric network does not require memory copy as in the past to send and receive data. For commercialization of memory-driven computing as described above, memory semantic interconnection must support high bandwidth, low latency, and coherency.

본 발명의 실시예가 속하는 기술분야에서는 이와 관련해 트랜잭션 기반 (Transaction-based) 메모리 시스템의 상호 연결(Interconnect)에 대한 연구가 활발히 진행 중이다.In the technical field to which an embodiment of the present invention belongs, research on interconnection of a transaction-based memory system is being actively conducted.

가속기와 관련해서는 근접 데이터 처리(Near Data Processing; NDP) 혹은 프로세싱 인 메모리(Processing In Memory; PIM)과 같이 워크 로드(Workload)의 특성에 따른 가속기의 위치에 대한 연구도 널리 진행되고 있다. 여기서, 프로세싱 인 메모리는 데이터 처리 속도 및 데이터 전송 속도를 증가시키기 위해 프로세서 로직이 메모리 셀들에 밀접하게 결합된 메모리를 의미한다. Regarding the accelerator, studies on the position of the accelerator according to the characteristics of the workload, such as Near Data Processing (NDP) or Processing In Memory (PIM), are also widely conducted. Here, processing-in-memory refers to a memory in which processor logic is closely coupled to memory cells in order to increase data processing speed and data transmission speed.

본 발명의 실시예는 다수의 메모리가 통합된 풀드 메모리 구조(Pooled Memory Architecture) 및 그에 적합한 인-메모리 데이터 베이스 운용(In-memory DataBase Usage)에 관한 기술이다. 이하에서는 본 발명의 실시예에 따른 맵-리듀스 어플리케이션(Map-Reduce Application)의 특징을 설명하고, 맵(Map) 연산을 풀드 메모리(Pooled Memory) 내의 가속기(후술함)에서 처리하는 과정을 설명하기로 한다.An embodiment of the present invention relates to a pooled memory architecture in which a plurality of memories are integrated and an in-memory database usage suitable therefor. Hereinafter, characteristics of a Map-Reduce Application according to an embodiment of the present invention will be described, and a process of processing a map operation in an accelerator (described later) in a pooled memory will be described. I'm going to do it.

도 2는 본 발명의 실시예에 따른 메모리 시스템의 구성을 보여주는 도면이다. 2 is a diagram showing the configuration of a memory system according to an embodiment of the present invention.

본 발명의 실시예에 따른 메모리 시스템(10)은 위에서 설명한 메모리 구동 컴퓨팅(Memory-Driven Computing) 구조를 기반으로 한다. 이러한 메모리 시스템(10)은 복수의 프로세서(예를 들면, 중앙처리장치, CPU)(20), 패브릭 네트워크(30), 복수의 채널(40) 및 복수의 풀드 메모리(100)를 포함한다. The memory system 10 according to an embodiment of the present invention is based on the above-described memory-driven computing structure. The memory system 10 includes a plurality of processors (eg, central processing unit, CPU) 20 , a fabric network 30 , a plurality of channels 40 , and a plurality of pulled memories 100 .

여기서, 복수의 프로세서(20)는 노드 CND를 통해 패브릭 네트워크(30)와 연결된다. 그리고, 복수의 프로세서(20)는 패브릭 네트워크(30)를 통해 복수의 풀드 메모리(100)와 연결된다. 또한, 풀드 메모리(100)는 복수의 채널(40)을 통해 패브릭 네트워크(30)에 연결된다. 즉, 복수의 풀드 메모리(100) 각각은 N개의 채널(40)을 통해 패브릭 네트워크(30)에 연결될 수 있다. Here, the plurality of processors 20 are connected to the fabric network 30 through the node CND. Also, the plurality of processors 20 are connected to the plurality of pulled memories 100 through the fabric network 30 . Also, the pulled memory 100 is connected to the fabric network 30 through a plurality of channels 40 . That is, each of the plurality of pulled memories 100 may be connected to the fabric network 30 through N channels 40 .

복수의 풀드 메모리(100) 각각은 복수의 메모리(120)와, 복수의 메모리(120)를 제어하기 위한 풀드 메모리 컨트롤러(PMC; Pooled Memory Controller)(110)를 포함한다. 풀드 메모리 컨트롤러(110)는 버스 BUS를 통해 각 메모리(120)와 연결된다. Each of the plurality of pooled memories 100 includes a plurality of memories 120 and a pooled memory controller (PMC) 110 for controlling the plurality of memories 120 . The pulled memory controller 110 is connected to each memory 120 through a bus BUS.

각각의 메모리(120)는 패브릭 네트워크(30)에 바로 연결될 수도 있다. 하지만, 다수의 메모리(120)들이 하나의 통합된 풀드 메모리(100)에 포함되어, 풀드 메모리(100)가 패브릭 네트워크(30)에 연결될 수도 있다. Each memory 120 may be directly connected to the fabric network 30 . However, a plurality of memories 120 may be included in one unified pulled memory 100 so that the pulled memory 100 may be connected to the fabric network 30 .

풀드 메모리(100)가 다수의 메모리(120)들을 포함하는 경우, 풀드 메모리 컨트롤러(110)는 패브릭 네트워크(30)와 다수의 메모리(120)들 사이에서 각 메모리(120)들을 관리한다. When the pulled memory 100 includes a plurality of memories 120 , the pulled memory controller 110 manages each memory 120 between the fabric network 30 and the plurality of memories 120 .

여기서, 풀드 메모리 컨트롤러(110)는 처리율(Throughput)을 높이기 위해 메모리 인터리빙(Memory Interleaving)을 수행하거나 신뢰성(Reliability), 가용성(Availability) 및 내구성(Serviceability) 등을 높이기 위해 어드레스 리맵핑(Address Remapping) 등을 지원한다.Here, the pulled memory controller 110 performs memory interleaving to increase throughput or address remapping to increase reliability, availability, and serviceability. support etc.

인-메모리 데이터 베이스(In-memory DataBase)란 빠른 접근을 위해 데이터 베이스(DB)를 스토리지(Storage)가 아닌 메인 메모리(Main Memory)에 저장하는 데이터 베이스 관리 시스템이다. An in-memory database is a database management system that stores a database in main memory rather than storage for fast access.

현재의 서버 시스템(Server System)은 메모리(Memory)의 용량을 증가시키는데 물리적인 한계가 있다. 이에 따라, 어플리케이션(Application)이 데이터 베이스의 크기를 각 서버(Server)가 가진 메모리 용량 이상으로 키울 수 없다. 데이터 베이스의 크기가 커지면 어쩔 수 없이 여러 서버에 나눠서 데이터 베이스를 저장하게 되고, 여러 서버를 엮는 과정에서 성능이 저하되는 부분이 있다. 풀드 메모리(100)는 많은 용량과 높은 대역폭(Bandwidth)을 제공하므로 인- 메모리(In-memory DB)에서 유용하게 사용될 수 있다.Current server systems have physical limitations in increasing the capacity of memory. Accordingly, the application cannot increase the size of the database beyond the memory capacity of each server. When the size of the database increases, the database is inevitably divided into several servers to store the database, and in the process of combining multiple servers, there is a part where performance deteriorates. Since the pulled memory 100 provides a large capacity and a high bandwidth, it can be usefully used in an in-memory DB.

도 3은 도 2의 풀드 메모리 컨트롤러(110)에 관한 상세 구성을 보여주는 도면이다. FIG. 3 is a diagram showing a detailed configuration of the pulled memory controller 110 of FIG. 2 .

풀드 메모리 컨트롤러(110)는 인터페이스(111)와 가속기(112)를 포함한다. 여기서, 인터페이스(111)는 패브릭 네트워크(30)와 가속기(112) 및 메모리(120) 사이에서 패킷을 중계한다. 인터페이스(111)는 복수의 채널 CN을 통해 가속기(112)와 연결된다. The pulled memory controller 110 includes an interface 111 and an accelerator 112 . Here, interface 111 relays packets between fabric network 30 and accelerator 112 and memory 120 . The interface 111 is connected to the accelerator 112 through a plurality of channels CN.

본 발명의 실시예에서 인터페이스(111)는 패브릭 네트워크(30)와 가속기(112) 및 메모리(120) 사이에서 패킷을 중계하기 위한 스위치를 포함할 수 있다. 본 발명의 실시예에서는 인터페이스(111)가 스위치를 포함하는 것을 일 예로 설명하였지만, 본 발명의 실시예는 이에 한정되는 것이 아니라 패킷을 중계하기 위한 수단은 이에 한정되지 않는다. In an embodiment of the present invention, the interface 111 may include a switch for relaying packets between the fabric network 30 and the accelerator 112 and memory 120 . In the embodiment of the present invention, it has been described as an example that the interface 111 includes a switch, but the embodiment of the present invention is not limited thereto, and means for relaying packets is not limited thereto.

그리고, 가속기(112)는 인터페이스(111)를 통해 인가되는 데이터의 연산 처리를 수행한다. 예를 들어, 가속기(112)는 인터페이스(111)를 통해 메모리(120)로부터 인가되는 데이터의 맵 연산을 수행하고, 맵 연산 결과에 대한 데이터를 인터페이스(111)를 통해 메모리(120)에 저장한다. And, the accelerator 112 performs arithmetic processing of data applied through the interface 111 . For example, the accelerator 112 performs a map operation on data applied from the memory 120 through the interface 111, and stores data about the result of the map operation in the memory 120 through the interface 111. .

본 발명의 실시에에서는 풀드 메모리 컨트롤러(110)에 하나의 가속기(112)가 포함되는 것을 일 예로 설명하였다. 하지만, 본 발명의 실시예는 이에 한정되는 것이 아니며 풀드 메모리 컨트롤러(110)에 다수의 가속기(112)가 포함될 수도 있다. In the embodiment of the present invention, it has been described that one accelerator 112 is included in the pulled memory controller 110 as an example. However, embodiments of the present invention are not limited thereto, and a plurality of accelerators 112 may be included in the pulled memory controller 110 .

맵-리듀스(Map-Reduce) 어플리케이션은 대용량 데이터 처리를 분산 병렬 컴퓨팅에서 처리하기 위한 목적으로 제작한 소프트웨어 프레임워크(Software Framework)이다. 다양한 어플리케이션에서 이 맵-리듀스 라이브러리를 사용하고 있다. 맵-리듀스 어플리케이션에서 맵 연산은 (key, value) 형태로 중간 정보를 추출하면, 리듀스 연산이 이를 모아 원하는 최종 결과를 출력한다. The Map-Reduce application is a software framework designed for the purpose of processing large amounts of data in distributed parallel computing. Various applications use this map-reduce library. In map-reduce applications, map operation extracts intermediate information in the form of (key, value), and reduce operation collects them and outputs the desired final result.

예를 들어, 맵-리듀스 어플리케이션을 통해 "매년 제일 높았던 지구 기온"을 찾는다고 가정하면, 맵 연산은 텍스트 파일을 읽어서 연도 및 기온에 대한 정보를 추출해 (연도, 기온) 형태의 리스트를 출력한다. 그리고, 리듀스 연산은 이 결과를 수집해 온도 순으로 정렬해 원하는 최종 결과를 출력한다. 여기서, 눈여겨봐야 할 점은 맵 연산에 사용되는 데이터는 대체로 대용량인 반면, 맵 연산의 결과 데이터는 비교적 크기가 작다는 것이다.For example, assuming that you are looking for "the highest annual global temperature" through a map-reduce application, the map operation reads a text file, extracts year and temperature information, and outputs a list in the form of (year, temperature). . Then, the reduce operation collects these results, sorts them in order of temperature, and outputs the desired final result. Here, it should be noted that the data used for the map operation is generally large, whereas the data resulting from the map operation is relatively small.

본 발명의 실시예는 맵-리듀스 어플리케이션의 맵 연산과 같이 대용량 데이터를 처리하지만 데이터 재사용(Data Reuse)이 적은 연산을 풀드 메모리 컨트롤러(110) 내의 가속기(112)로 오프-로딩(Off-loading) 할 수 있다. 여기서, 오프-로딩이란 프로세서(20)로부터의 요청을 수신하여 해석하고 연산을 수행한 후 그 연산 결과를 출력하는 일련의 과정을 나타낸다. 데이터를 풀드 메모리(100) 내에서 처리하게 되면 프로세서(20)의 노드 CND까지 데이터를 전달하기 위한 에너지를 절약할 수 있고 성능도 더 높일 수 있다.The embodiment of the present invention processes large amounts of data, such as map operations of map-reduce applications, but off-loads operations with little data reuse to the accelerator 112 in the pulled memory controller 110. ) can do. Here, off-loading refers to a series of processes of receiving and interpreting a request from the processor 20, performing an operation, and then outputting the operation result. When data is processed in the pulled memory 100, energy for transferring data to the node CND of the processor 20 can be saved and performance can be further improved.

본 발명의 실시예에 따른 가속기(112)는 풀드 메모리 컨트롤러(110)의 내에 구비되거나 메모리(120) 내에 위치할 수 있다. 근접 데이터 처리의 관점에서는 데이터를 각 메모리(120) 내에서 처리하는 것이 풀드 메모리 컨트롤러(110)의 내부에서 처리하는 것 보다 더 효율적이다. The accelerator 112 according to an embodiment of the present invention may be included in the pulled memory controller 110 or located in the memory 120 . In terms of proximity data processing, processing data within each memory 120 is more efficient than processing data inside the pulled memory controller 110 .

풀드 메모리 컨트롤러(110)는 높은 대역폭(Bandwidth)을 제공하기 위해 메모리 인터리빙(Memory Interleaving)을 수행한다. 이러한 경우 데이터가 여러 메모리(120)에 나뉘어 저장된다. 이렇게 되면 가속기(112)가 필요로 하는 데이터 역시 여러 메모리(120)에 나뉘어질 수 있으므로, 본 발명의 실시예에서는 가속기(112)의 물리적인 위치가 풀드 메모리 컨트롤러(110) 내에 배치되는 것을 일 예로 설명하기로 한다. The pulled memory controller 110 performs memory interleaving to provide high bandwidth. In this case, data is divided into several memories 120 and stored. In this case, data required by the accelerator 112 can also be divided into several memories 120, so in the embodiment of the present invention, the physical location of the accelerator 112 is placed in the pulled memory controller 110 as an example. Let's explain.

지금부터는 맵-리듀스 어플리케이션의 맵 연산을 가속기(112)로 오프-로딩하는 것이 성능(Performance)과 에너지(Energy) 관점에서 메모리 시스템(10) 전체적으로 얼마나 이득인 지 살펴본다.From now on, how much benefit of off-loading the map operation of the map-reduce application to the accelerator 112 will be seen in terms of performance and energy of the memory system 10 as a whole.

맵-리듀스 어플리케이션의 맵 연산과 같이 가속기(112)가 처리하는 연산이 단순하다면, 가속기(112)에서의 연산 시간은 데이터를 메모리에서 리드하는 대역폭에 의해 좌우된다. 따라서, 가속기(112)의 대역폭을 높임으로써 가속기(112)의 연산 시간을 줄일 수 있다. If an operation processed by the accelerator 112 is simple, such as a map operation of a map-reduce application, an operation time in the accelerator 112 is influenced by a bandwidth for reading data from memory. Accordingly, the computation time of the accelerator 112 may be reduced by increasing the bandwidth of the accelerator 112 .

도 3에 도시된 바와 같이, 일련의 프로세서(20)의 노드 CND들은 패브릭 네트워크(30)를 거쳐 풀드 메모리(100)와 연결된다. 각 노드 CND는 각 프로세서(20) 별로 1개의 링크(Link)(L1)를 갖고 있고 풀드 메모리 컨트롤러(110) 내부 가속기(112)가 4개의 링크(L2)를 갖는다고 가정한다. 즉, 프로세서(20)의 링크(L1) 보다 가속기(112)의 링크(L2)에 대한 대역폭을 더 넓게 할당한다. 그러면, 맵 연산을 가속기(112)에게 오프-로딩하는 경우 프로세서(20)에서 처리하는 것 보다 4 배 빠르게 연산할 수 있다. As shown in FIG. 3 , node CNDs of a series of processors 20 are connected to the pulled memory 100 via a fabric network 30 . It is assumed that each node CND has one link L1 for each processor 20 and the accelerator 112 inside the pulled memory controller 110 has four links L2. That is, a wider bandwidth is allocated to the link L2 of the accelerator 112 than to the link L1 of the processor 20 . Then, when the map operation is off-loaded to the accelerator 112, the operation can be performed four times faster than the processing in the processor 20.

맵 연산 및 리듀스 연산을 프로세서(20)가 모두 수행하는 경우 맵 연산에 소요되는 시간이 전체 실행 시간의 99% 라고 가정한다. 또한, 하나의 프로세서(20)에서 여러 어플리케이션들이 실행되는데, 그 중 맵-리듀스 어플리케이션의 실행 시간이 전체 어플리케이션의 실행 시간의 10%를 차지한다고 가정한다. 맵 연산을 가속기(112)에게 오프-로딩하는 경우 맵 연산 시간이 1/4 로 줄어듦에 따라 전체 시스템 성능은 81% 향상될 수 있다.When the processor 20 performs both the map operation and the reduce operation, it is assumed that the time required for the map operation is 99% of the total execution time. In addition, it is assumed that several applications are executed in one processor 20, and the execution time of a map-reduce application among them occupies 10% of the execution time of all applications. In the case of offloading the map operation to the accelerator 112, the map operation time is reduced by 1/4, and thus the overall system performance can be improved by 81%.

도 4 내지 도 6은 본 발명의 실시예에 따른 메모리 시스템(10)의 동작을 설명하기 위한 도면이다. 4 to 6 are diagrams for explaining the operation of the memory system 10 according to an embodiment of the present invention.

먼저, 도 4의 (1) 경로에서와 같이, 프로세서(20)는 풀드 메모리(100) 측으로 맵 연산 요청 패킷을 전달한다. 즉, 프로세서(20)로부터 송신된 맵 연산 요청 패킷은 패브릭 네트워크(30)를 통해 풀드 메모리 컨트롤러(110)의 인터페이스(111)를 거쳐 가속기(112)에 전달된다. 여기서, 맵 연산 요청 패킷은 맵 연산에 사용될 데이터가 저장된 주소, 데이터의 크기 및 맵 연산 결과 데이터를 저장할 주소 등에 대한 정보가 포함될 수 있다. First, as shown in path (1) of FIG. 4 , the processor 20 transfers a map operation request packet to the pulled memory 100 . That is, the map operation request packet transmitted from the processor 20 is transferred to the accelerator 112 through the interface 111 of the pulled memory controller 110 through the fabric network 30 . Here, the map operation request packet may include information about an address where data to be used for map operation is stored, a data size, and an address where map operation result data is stored.

다음에, 풀드 메모리 컨트롤러(110)는 도 4의 (2) 경로에서와 같이, 맵 연산 응답 패킷을 패브릭 네트워크(30)를 통해 프로세서(20)에 전달한다. 즉, 풀드 메모리 컨트롤러(110)는 가속기(112)가 맵 연산 요청 패킷을 잘 수신하였음을 알리는 신호를 프로세서(20)에 전달한다. Next, the pulled memory controller 110 transfers the map operation response packet to the processor 20 through the fabric network 30 as shown in path (2) of FIG. 4 . That is, the pulled memory controller 110 transmits a signal indicating that the accelerator 112 successfully received the map operation request packet to the processor 20 .

이후에, 도 5의 (3) 경로에서와 같이, 풀드 메모리 컨트롤러(110)는 각 메모리(120)에서 맵 연산에 필요한 데이터를 리드하여 가속기(112)에 전달한다. 여기서, 가속기(112)에서 필요로 하는 데이터가 여러 메모리(120)에 나뉘어 질 수 있으며, 이러한 경우 가속기(112)는 다수의 메모리(120)로부터 데이터를 리드한다. 그러면, 가속기(112)는 메모리(120)에서 리드된 데이터를 토대로 하여 맵 연산을 수행한다. Subsequently, as shown in path (3) of FIG. 5 , the pulled memory controller 110 reads data necessary for map operation from each memory 120 and transfers it to the accelerator 112 . Here, data required by the accelerator 112 may be divided into several memories 120, and in this case, the accelerator 112 reads data from the plurality of memories 120. Then, the accelerator 112 performs a map operation based on the data read from the memory 120 .

이어서, 풀드 메모리 컨트롤러(110)는 도 5의 (4) 경로에서와 같이, 가속기(112)에서 연산된 맵 연산 결과 데이터를 리드하고 각 메모리(120)에 전달하여 저장한다. 가속기(112)에서 연산된 데이터를 다수의 메모리(120)에 나뉘어 저장될 수 있다. Subsequently, as shown in path (4) of FIG. 5 , the pulled memory controller 110 reads map operation result data calculated in the accelerator 112 and transfers the data to each memory 120 to store the data. Data calculated by the accelerator 112 may be divided and stored in a plurality of memories 120 .

다음에, 도 6의 (5) 경로에서와 같이, 풀드 메모리 컨트롤러(110)는 프로세서(20) 측으로 인터럽트 패킷을 전달한다. 즉, 풀드 메모리 컨트롤러(110)는 가속기(112)의 맵 연산이 종료되었음을 나타내는 인터럽트 패킷을 패브릭 네트워크(30)를 통해 프로세서(20)로 전달한다. Next, as in the path (5) of FIG. 6, the pulled memory controller 110 transfers the interrupt packet to the processor 20 side. That is, the pulled memory controller 110 transfers an interrupt packet indicating that the map operation of the accelerator 112 is finished to the processor 20 through the fabric network 30 .

이후에, 풀드 메모리 컨트롤러(110)는 도 6의 (6) 경로에서와 같이, 메모리(120)에 저장된 맵 연산 결과 데이터를 리드하여 인터페이스(111), 패브릭 네트워크(30)를 통해 프로세서(20)에 전달한다. Thereafter, the pulled memory controller 110 reads the map operation result data stored in the memory 120 as shown in path (6) of FIG. forward to

도 7은 본 발명의 실시예에 따른 메모리 시스템의 성능 개선을 보여주는 도면이다. 도 7은 가속기(112)에서 맵 연산을 수행하는 경우 가속기(112)의 채널 CN 수를 증가시킴에 따라 전체 시스템의 성능(Performance)이 얼마나 증가하는지를 나타내는 결과 그래프이다. 7 is a diagram illustrating performance improvement of a memory system according to an exemplary embodiment of the present invention. 7 is a result graph showing how much the performance of the entire system increases as the number of channel CNs of the accelerator 112 increases when the accelerator 112 performs a map operation.

가속기(112)의 채널(Channel) CN 수를 증가시킴에 따라 성능 역시 함께 증가하는 것을 알 수 있다. 하지만, 채널 CN 수를 증가시키는 비용에 비해 성능 증가량은 점점 작아지므로 본 발명의 실시예에서는 가속기(112)의 채널 CN을 2개 내지 4개로 설정하는 것을 일 예로 설명하기로 한다. It can be seen that performance also increases as the number of channel CNs of the accelerator 112 increases. However, since the amount of performance increase is gradually smaller than the cost of increasing the number of channel CNs, setting the number of channel CNs of the accelerator 112 to 2 to 4 will be described as an example in the embodiment of the present invention.

프로세서(20)의 노드 CND를 통해 데이터를 이동시키는 데 한 링크 L1 당 1 pJ(에너지 소모량)/비트(bit)를 소모한다고 가정한다. 프로세서(20)에서 데이터를 처리하기 위해서는 도 3에서 메모리(120)의 버스 BUS, 패브릭 네트워크(30)의 채널(40) 및 프로세서(20)의 노드 CND, 즉, 총 3개의 링크를 거쳐야 하므로 3 pJ/bit이 소모된다. 하지만, 맵 연산을 가속기(112)로 오프-로딩하게 되면 데이터가 메모리(120)의 버스 BUS 만 거치게 되므로 데이터를 이동하는 데 소모되는 에너지를 1/3인 1 pJ/bit로 줄일 수 있다. 전체 시스템에서 얼마만큼의 에너지가 절약되는 것인지를 계산하려면 각 하드웨어의 스태틱 파워(Static Power)에 대해 모두 고려해야 한다. It is assumed that 1 pJ (energy consumption)/bit is consumed per link L1 to move data through the node CND of the processor 20. In order to process data in the processor 20, the bus BUS of the memory 120 in FIG. 3, the channel 40 of the fabric network 30, and the node CND of the processor 20, that is, a total of three links, pJ/bit is consumed. However, if the map operation is off-loaded to the accelerator 112, since the data only passes through the bus BUS of the memory 120, the energy consumed to move the data can be reduced to 1/3, or 1 pJ/bit. To calculate how much energy is saved in the overall system, the static power of each piece of hardware must be taken into account.

이상에서와 같이, 본 발명의 실시예에 따른 풀드 메모리(100)는 많은 용량과 높은 대역폭을 제공할 수 있으며, 인-메모리 데이터 베이스 등에서 유용하게 사용할 수 있다. 풀드 메모리 컨트롤러(110) 내부에 가속기(112)를 추가하고, 가속기(112)에서 맵-리듀스 어플리케이션의 맵 연산을 오프-로딩 함으로써 전체 시스템의 성능을 높이는 동시에 에너지를 절약할 수 있다.As described above, the pulled memory 100 according to an embodiment of the present invention can provide large capacity and high bandwidth, and can be usefully used in an in-memory database or the like. By adding the accelerator 112 inside the pulled memory controller 110 and off-loading the map operation of the map-reduce application in the accelerator 112, the performance of the entire system can be improved and energy can be saved.

본 발명이 속하는 기술분야의 당업자는 본 발명이 그 기술적 사상이나 필수적 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있으므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로서 이해해야만 한다. 본 발명의 범위는 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 등가 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다. Those skilled in the art to which the present invention pertains should understand that the embodiments described above are illustrative in all respects and not limiting, since the present invention can be embodied in other specific forms without changing the technical spirit or essential characteristics thereof. only do The scope of the present invention is indicated by the claims to be described later rather than the detailed description, and all changes or modifications derived from the meaning and scope of the claims and equivalent concepts should be construed as being included in the scope of the present invention. .

Claims

A plurality of memories for storing data; and
a pulled memory controller that reads the data stored in the plurality of memories, performs a map operation, and stores result data of the map operation in the plurality of memories;
The pooled memory controller
an interface for relaying packets between a processor and the plurality of memories through a fabric network; and
An accelerator for performing the map operation on the data transmitted from the plurality of memories through the interface and storing result data of the map operation in the plurality of memories through the interface;
The accelerator is connected to the interface through a plurality of channels,
The number of links of the plurality of channels is greater than the number of links of the processor,
wherein the interface transmits the read data to the accelerator, receives the result data calculated by the accelerator, and transmits the received result data to the plurality of memories.

delete

The method of claim 1 , wherein the pulled memory controller
A memory system receiving a map operation request packet from the processor through the interface.

6. The method of claim 5, wherein the map operation request packet
A memory system comprising information on at least one of an address where data to be used for the map operation is stored, a size of the data, and an address where result data of the map operation is stored.

The method of claim 1 , wherein the pulled memory controller
A memory system configured to transmit a map operation response packet to the processor through the interface.

The method of claim 1 , wherein the pulled memory controller
A memory system that reads data necessary for the map operation from the plurality of memories and transmits the data to the accelerator.

The method of claim 1 , wherein the pulled memory controller
A memory system configured to read result data of the map operation calculated in the accelerator and store the result data in the plurality of memories.

The method of claim 1 , wherein the pulled memory controller
When the map operation is finished, the memory system transmits an interrupt packet to the processor through the interface.

The method of claim 1 , wherein the pulled memory controller
A memory system for reading result data of the map operation stored in the plurality of memories and transmitting the data to the processor through the interface.

The method of claim 1 , wherein the pulled memory controller
A memory system performing interleaving on the plurality of memories.

The method of claim 1 , wherein the pulled memory controller
A memory system performing address remapping on the plurality of memories.

The method of claim 1 , wherein the pulled memory controller
A memory system using a map-reduce application during the map operation.

a fabric network associated with the processor; and
A pulled memory that relays packets with the processor through the fabric network and transmits data stored in the memory to the processor when requested by the processor;
The pooled memory is
a pulled memory controller that reads data stored in the memory, offloads a map operation, and stores result data of the map operation in the memory;
The pooled memory controller
an interface relaying the packet between the processor and the pulled memory controller through the fabric network; and
an accelerator for off-loading the map operation on the data transmitted from the memory through the interface and storing result data of the map operation in the memory through the interface;
the accelerator
connected through the interface and a plurality of channels,
The number of links of the plurality of channels is greater than the number of links of the processor,
performing the map operation and off-loading the map operation by reading the data received from the memory through the interface to generate the result data;
wherein the interface transmits the read data to the accelerator, receives result data calculated by the accelerator, and transmits the received result data to the memory.

delete

16. The method of claim 15, wherein the pulled memory controller
A memory system configured to receive a map operation request packet from the processor through the interface and transmit a map operation response packet to the processor through the interface.

16. The method of claim 15, wherein the pulled memory controller
A memory system configured to read data necessary for the map operation from the memory, transfer the data to the accelerator, and store result data of the map operation calculated in the accelerator in the memory.

16. The method of claim 15, wherein the pulled memory controller
When the map operation is finished, an interrupt packet is transmitted to the processor through the interface, and result data of the map operation stored in the memory is read and transmitted to the processor through the interface.

16. The method of claim 15, wherein the pulled memory controller
A memory system performing at least one of an interleaving operation and an address remapping operation on the memory.