JP2020017263A

JP2020017263A - Memory system

Info

Publication number: JP2020017263A
Application number: JP2019093313A
Authority: JP
Inventors: ソンウンキム，; Sun Woong Kim; ウィチョルイム，; Eui Cheol Lim
Original assignee: SK Hynix Inc
Current assignee: SK Hynix Inc
Priority date: 2018-07-23
Filing date: 2019-05-17
Publication date: 2020-01-30
Also published as: US10915470B2; CN110750210B; US20200026669A1; CN110750210A

Abstract

To provide a memory system capable of reduce energy consumption of the system and increase performance at the same time.SOLUTION: A memory system 10 includes processors 20, a fabric network 30, and pooled memories 100. The pooled memories include multiple memories 120 configured to store data, and a pooled memory controller 110 configured to perform map computation by reading data stored in the multiple memories and store resultant data from the map computation in the multiple memories.SELECTED DRAWING: Figure 2

Description

本発明は、メモリーシステムに関するものであり、大容量メモリー装置の加速器（Ａｃｃｅｌｅｒａｔｏｒ）に関する技術である。 The present invention relates to a memory system, and more particularly, to a technique related to an accelerator of a large capacity memory device.

最近、スマートフォン、タブレットＰＣのような移動通信端末機の普及が大衆化している。そして、ソーシャルネットワークサービス（ＳＮＳ、ＳｏｃｉａｌＮｅｔｗｏｒｋＳｅｒｖｉｃｅ）、モノ（機械）のネットワーク（Ｍ２Ｍ、ＭａｃｈｉｎｅｔｏＭａｃｈｉｎｅ）、センサネットワーク（ＳｅｎｓｏｒＮｅｔｗｏｒｋ）の使用が増加している。これにより、データの量、生成速度及びその多様性が幾何級数的に増加している。ビッグデータの処理のためには、メモリーの速度も重要であるが、格納容量が大きいメモリー装置及びメモリーモジュールが要求される。 Recently, mobile communication terminals such as smartphones and tablet PCs have become popular. In addition, the use of social network services (SNS, Social Network Service), networks of objects (machines) (M2M, Machine to Machine), and sensor networks (Sensor Network) is increasing. As a result, the amount of data, the generation speed, and the variety thereof are geometrically increased. For processing big data, the speed of the memory is also important, but a memory device and a memory module having a large storage capacity are required.

このため、メモリーシステムは、メモリーの物理的限界を克服しつつデータの格納容量を増やすために複数の統合されたメモリーを具備する。一例として、クラウドデータセンター（ＣｌｏｕｄＤａｔａＣｅｎｔｅｒ）のサーバー構造（ＳｅｒｖｅｒＡｒｃｈｉｔｅｃｔｕｒｅ）がビッグデータアプリケーション（Ｂｉｇ−ＤａｔａＡｐｐｌｉｃａｔｉｏｎ）を効率的に実行させるための構造に変わっている。
ビッグデータを効率的に処理するために、複数のメモリーが統合されたプールメモリー（ＰｏｏｌｅｄＭｅｍｏｒｙ）を使用する。プールメモリーは、多い容量と高い帯域幅（Ｂａｎｄｗｉｄｔｈ）を提供でき、インメモリデータベース（Ｉｎ−ｍｅｍｏｒｙＤａｔａｂａｓｅ）等で有用に使用できる。 To this end, the memory system includes a plurality of integrated memories to increase the data storage capacity while overcoming the physical limitations of the memory. For example, a server structure (Server Architecture) of a cloud data center (Cloud Data Center) has been changed to a structure for efficiently executing a big data application (Big-Data Application).
In order to process big data efficiently, a pooled memory (Pooled Memory) in which a plurality of memories are integrated is used. The pool memory can provide a large capacity and a high bandwidth, and can be usefully used in an in-memory database (In-memory Database) or the like.

本発明の実施形態は、プールメモリー内部に加速器を備え、システムのエネルギー消耗を減らし、同時に性能を向上させることができるようにするメモリーシステムを提供する。 Embodiments of the present invention provide a memory system that includes an accelerator inside a pool memory to reduce energy consumption of the system and at the same time improve performance.

本発明の実施形態に係るメモリーシステムは、データを格納する複数のメモリーと、複数のメモリーに格納されたデータを読み取ってマップ演算を行い、マップ演算の結果データを複数のメモリーに格納するプールメモリーコントローラーと、を備える。 A memory system according to an embodiment of the present invention includes a plurality of memories that store data, a pool memory that reads data stored in the plurality of memories, performs a map operation, and stores result data of the map operation in the plurality of memories. And a controller.

また、本発明の他の実施形態によるメモリーシステムは、プロセッサと連結されたファブリックネットワークと、ファブリックネットワークを介してプロセッサとパケットを中継し、プロセッサの要請時メモリーに格納されたデータをプロセッサに伝達するプールメモリーと、を備え、プールメモリーが、メモリーに格納されたデータを読み取ってマップ演算をオフ−ローディングし、マップ演算の結果データをメモリーに格納するプールメモリーコントローラーを備える。 Also, a memory system according to another embodiment of the present invention relays a packet to and from a processor via a fabric network connected to a processor, and transmits data stored in a memory to the processor when requested by the processor. A pool memory, wherein the pool memory comprises a pool memory controller for reading data stored in the memory, off-loading the map operation, and storing the result of the map operation in the memory.

本発明の実施形態は、システムの性能を向上させ、データ演算のために必要なエネルギーを節約できるようにする効果を提供する。 Embodiments of the present invention provide the advantage of improving system performance and saving energy required for data operations.

本発明の実施形態に係るメモリーシステムの概念を説明するための図面である。1 is a diagram illustrating a concept of a memory system according to an embodiment of the present invention. 本発明の実施形態に係るメモリーシステムの構成を示す図面である。1 is a diagram illustrating a configuration of a memory system according to an embodiment of the present invention. 図２のプールメモリーコントローラーに関する詳細構成を示す図面である。3 is a diagram illustrating a detailed configuration of a pool memory controller of FIG. 2. 本発明の実施形態に係るメモリーシステムの動作を説明するための図面である。5 is a diagram illustrating an operation of the memory system according to the embodiment of the present invention. 本発明の実施形態に係るメモリーシステムの動作を説明するための図面である。5 is a diagram illustrating an operation of the memory system according to the embodiment of the present invention. 本発明の実施形態に係るメモリーシステムの動作を説明するための図面である。5 is a diagram illustrating an operation of the memory system according to the embodiment of the present invention. 本発明の実施形態に係るメモリーシステムの性能改善を示す図面。4 is a diagram illustrating performance improvement of the memory system according to the embodiment of the present invention.

以下、添付の図面を参照し本発明の実施形態に対して詳しく説明する。本発明の実施形態を説明することにおいて、ある部分が他の部分と『連結』されているという時、これは、『直接的に連結』されている場合だけでなく、その中間に他の素子を間に置いて『電気的に連結』されている場合も含む。また、ある部分がある構成要素を『含む』又は『具備』するという時、これは、特別に反対される記載がない限り、他の構成要素を除外するのではなく、他の構成要素をさらに含むか具備できることを意味する。また、明細書全体の記載において、一部構成要素を単数形で記載したからといって、本発明がそれに限られるものではなく、当該構成要素が複数でなり得ることが分かる。 Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In describing the embodiments of the present invention, when one part is “coupled” to another part, it means not only that the part is “directly connected” but also other elements in between. Also includes the case where they are electrically connected with each other. Also, when an element is referred to as "including" or "comprising" an element, that element does not exclude the other element, but does not exclude the other element, unless specifically stated to the contrary. Include or include. Further, in the description of the entire specification, it is understood that the present invention is not limited to the case where some components are described in singular form, and the components may be plural.

データセンターアプリケーションは、データの大きさが益々大きくなるにつれ、さらに多いハードウェア資源を必要とする。サーバーアーキテクチャ（ＳｅｒｖｅｒＡｒｃｈｉｔｅｃｔｕｒｅ）は、ハードウェア資源をより効率的に使用しようとする方向に進化している。 Data center applications require more hardware resources as data sizes increase. The server architecture has evolved to use hardware resources more efficiently.

一例として、クラウドデータセンター（ＣｌｏｕｄＤａｔａＣｅｎｔｅｒ）では、ディープラーニング（ＤｅｅｐＬｅａｒｎｉｎｇ）をはじめとする多くのマシンラーニング（ＭａｃｈｉｎｅＬｅａｒｎｉｎｇ）アプリケーションが実行されている。前記のようなディープラーニング、マシンラーニング等のアプリケーションは、大部分時間的地域性（ＴｅｍｐｏｒａｌＬｏｃａｌｉｔｙ）が低いため、中央処理装置（ＣＰＵ；ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）よりはグラフィック処理装置（ＧＰＵ；ＧｒａｐｈｉｃＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）やＦＰＧＡ（ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）等のハードウェア加速器（ＨａｒｄｗａｒｅＡｃｃｅｌｅｒａｔｏｒ）を介して演算を行うのが一般的である。 As an example, in a cloud data center (Cloud Data Center), many machine learning (Machine Learning) applications including deep learning (Deep Learning) are executed. Applications such as deep learning and machine learning are mostly low in temporal locality, so that a central processing unit (CPU) is more likely to be a graphic processing unit (GPU) than a central processing unit (GPU). Generally, the calculation is performed via a hardware accelerator (Hardware Accelerator) such as an FPGA or a Field Programmable Gate Array (FPGA).

ここで、時間的地域性（ｔｅｍｐｏｒａｌｌｏｃａｌｉｔｙ）は、一度接近したデータに比較的近い時間内に再び接近することを意味する。すなわち、前記のアプリケーションは、頻繁に接近されるホット（ｈｏｔ）データよりしばらく接近のないコールド（ｃｏｌｄ）データを使用する。 Here, temporal locality means that the data approaches again within a relatively close time to the data once approached. That is, the application uses cold data that has not been accessed for a while than hot data that is frequently accessed.

プロセッサ（例えば、中央処理装置）が加速器（Ａｃｃｅｌｅｒａｔｏｒ）にジョブ（Ｊｏｂ）をオフ−ローディング（Ｏｆｆ−ｌｏａｄｉｎｇ）する過程を検討してみれば、次の通りである。まず、プロセッサのローカルメモリー（ＬｏｃａｌＭｅｍｏｒｙ）から加速器のローカルメモリーにデータを移す。以後に、加速器が演算を終了すれば、その結果データをプロセッサの方に再び移さなければならない。 A process in which a processor (e.g., a central processing unit) off-loads a job to an accelerator (Accelerator) will be described as follows. First, data is transferred from the local memory of the processor (Local Memory) to the local memory of the accelerator. Thereafter, when the accelerator finishes the operation, the result data must be transferred to the processor again.

もし、データを移す費用がデータを演算する費用より高ければ、データをできるだけ少なく移すようにアーキテクチャを具現するのが全体費用を節減できる。このために、最近、メモリー駆動コンピューティングコンセプト（Ｍｅｍｏｒｙ−ＤｒｉｖｅｎＣｏｍｐｕｔｉｎｇＣｏｎｃｅｐｔ）が提案された。 If the cost of moving the data is higher than the cost of operating the data, implementing the architecture to move the data as little as possible can reduce the overall cost. To this end, recently, a Memory-Driven Computing Concept has been proposed.

図１は、本発明の実施形態によるメモリーシステムの概念を説明するための図面である。
図１の実施形態は、システムオンチップ（ＳｏＣ；ＳｙｓｔｅｍＯｎＣｈｉｐ）、すなわち、プロセッサ中心のコンピューティング（演算装置）構造からメモリー中心のコンピューティング構造にアーキテクチャが変化されることを示す。プロセッサ中心のコンピューティング構造では、一つのシステムオンチップが一つのメモリーと一対一方式で連結される。 FIG. 1 is a diagram illustrating a concept of a memory system according to an embodiment of the present invention.
The embodiment of FIG. 1 illustrates that the architecture is changed from a system-on-chip (SoC), i.e., a processor-centric computing structure to a memory-centric computing structure. In a processor-based computing structure, one system-on-a-chip is connected to one memory in a one-to-one manner.

メモリー駆動コンピューティング（Ｍｅｍｏｒｙ−ＤｒｉｖｅｎＣｏｍｐｕｔｉｎｇ）とは、複数のシステムオンチップがファブリックネットワーク（ＦａｂｒｉｃＮｅｔｗｏｒｋ）を介して連結された統合メモリー（ＵｎｉｆｉｅｄＭｅｍｏｒｙ）を使用する。システムオンチップ（ＳｏＣ）間にデータを取り交わす時には、メモリー帯域幅（ＭｅｍｏｒｙＢａｎｄｗｉｄｔｈ）でデータを取り交わすようになる。 Memory-driven computing uses a unified memory in which a plurality of system-on-chips are connected via a fabric network. When data is exchanged between system-on-chips (SoCs), data is exchanged with a memory bandwidth.

また、ファブリックネットワークで連結された一つの統合メモリーは、データを取り交わすために既存のようにメモリーコピー（ＭｅｍｏｒｙＣｏｐｙ）をしなくてもよい。前記のようなメモリー駆動コンピューティング（Ｍｅｍｏｒｙ−ＤｒｉｖｅｎＣｏｍｐｕｔｉｎｇ）が常用化されるためには、メモリー意味的相互連結（ＭｅｍｏｒｙＳｅｍａｎｔｉｃＩｎｔｅｒｃｏｎｎｅｃｔ）の高い帯域幅（Ｂａｎｄｗｉｄｔｈ）、低いレイテンシ（Ｌａｔｅｎｃｙ）、一貫性（Ｃｏｈｅｒｅｎｃｙ）等を支援しなければならない。 In addition, one integrated memory connected by the fabric network does not have to perform a memory copy as in the related art in order to exchange data. In order for the above-mentioned memory-driven computing to be commonly used, a high bandwidth (Bandwidth), a low latency (Latency), and a consistency (Low) of a memory semantic interconnect are required. Coherency) must be supported.

本発明の実施形態が属する技術分野では、これに関連してトランザクション基盤（Ｔｒａｎｓａｃｔｉｏｎ−ｂａｓｅｄ）メモリーシステムの相互連結（Ｉｎｔｅｒｃｏｎｎｅｃｔ）に対する研究が活発に進行中である。 In this regard, in the technical field to which the embodiments of the present invention pertain, research on the interconnection of a transaction-based memory system is actively underway.

加速器に関連しては、近接データ処理（ＮｅａｒＤａｔａＰｒｏｃｅｓｓｉｎｇ；ＮＤＰ）或いはプロセッシングインメモリー（ＰｒｏｃｅｓｓｉｎｇＩｎＭｅｍｏｒｙ；ＰＩＭ）のようにワークロード（Ｗｏｒｋｌｏａｄ）の特性による加速器の位置に対する研究も広く進行されている。ここで、プロセッシングインメモリーは、データ処理速度及びデータ転送速度を増加させるために、プロセッサロジックがメモリーセルに密接に結合されたメモリーを意味する。 In connection with the accelerator, research on the position of the accelerator based on the characteristics of a workload such as near data processing (NDP) or processing in memory (PIM) has been widely performed. . Here, the processing-in-memory refers to a memory in which processor logic is closely coupled to a memory cell in order to increase a data processing speed and a data transfer speed.

本発明の実施形態は、多数のメモリーが統合されたプールメモリー構造（ＰｏｏｌｅｄＭｅｍｏｒｙＡｒｃｈｉｔｅｃｔｕｒｅ）及びそれに適合するイン−メモリーデータベース運用（Ｉｎ−ｍｅｍｏｒｙＤａｔａｂａｓｅＵｓａｇｅ）に関する技術である。以下では、本発明の実施形態に係るマップ−リデュースアプリケーション（Ｍａｐ−ＲｅｄｕｃｅＡｐｐｌｉｃａｔｉｏｎ）の特徴を説明し、マップ（Ｍａｐ）演算をプールメモリー（ＰｏｏｌｅｄＭｅｍｏｒｙ）内の加速器（後述する）で処理する過程を説明する。 Embodiments of the present invention relate to a technology related to a pooled memory architecture in which a large number of memories are integrated and an in-memory database operation adapted to the pooled memory architecture. Hereinafter, features of the Map-Reduce Application according to the embodiment of the present invention will be described, and a process of processing a Map operation by an accelerator (described later) in a pooled memory (Pooled Memory) will be described. explain.

図２は、本発明の実施形態によるメモリーシステムの構成を示す図面である。
本発明の実施形態によるメモリーシステム１０は、前記で説明したメモリー駆動コンピューティング（Ｍｅｍｏｒｙ−ＤｒｉｖｅｎＣｏｍｐｕｔｉｎｇ）構造を基盤とする。このようなメモリーシステム１０は、複数のプロセッサ（例えば、中央処理装置、ＣＰＵ）２０、ファブリックネットワーク３０、複数のチャンネル４０及び複数のプールメモリー１００を備える。 FIG. 2 is a diagram illustrating a configuration of a memory system according to an embodiment of the present invention.
The memory system 10 according to an embodiment of the present invention is based on the memory-driven computing structure described above. Such a memory system 10 includes a plurality of processors (for example, a central processing unit, a CPU) 20, a fabric network 30, a plurality of channels 40, and a plurality of pool memories 100.

ここで、複数のプロセッサ２０は、ノードＣＮＤを介してファブリックネットワーク３０と連結される。そして、複数のプロセッサ２０は、ファブリックネットワーク３０を介して複数のプールメモリー１００と連結される。また、プールメモリー１００は、複数のチャンネル４０を介してファブリックネットワーク３０に連結される。すなわち、複数のプールメモリー１００のそれぞれは、Ｎ個のチャンネル４０を介してファブリックネットワーク３０に連結されてもよい。 Here, the plurality of processors 20 are connected to the fabric network 30 via the node CND. The plurality of processors 20 are connected to the plurality of pool memories 100 via the fabric network 30. Further, the pool memory 100 is connected to the fabric network 30 via a plurality of channels 40. That is, each of the plurality of pool memories 100 may be connected to the fabric network 30 via the N channels 40.

複数のプールメモリー１００のそれぞれは、複数のメモリー１２０と、複数のメモリー１２０（又はメモリー装置）を制御するためのプールメモリーコントローラー（ＰＭＣ；ＰｏｏｌｅｄＭｅｍｏｒｙＣｏｎｔｒｏｌｌｅｒ）１１０を備える。プールメモリーコントローラー１１０は、バス（ＢＵＳ）を介して各メモリー１２０と連結される。 Each of the plurality of pool memories 100 includes a plurality of memories 120 and a pooled memory controller (PMC) 110 for controlling the plurality of memories 120 (or memory devices). The pool memory controller 110 is connected to each memory 120 via a bus (BUS).

それぞれのメモリー１２０は、ファブリックネットワーク３０に直ぐ連結されてもよい。しかし、多数のメモリー１２０が一つの統合されたプールメモリー１００に含まれ、プールメモリー１００がファブリックネットワーク３０に連結されてもよい。 Each memory 120 may be directly connected to the fabric network 30. However, a plurality of memories 120 may be included in one integrated pool memory 100, and the pool memory 100 may be connected to the fabric network 30.

プールメモリー１００が多数のメモリー１２０を備える場合、プールメモリーコントローラー１１０は、ファブリックネットワーク３０と多数のメモリー１２０の間で各メモリー１２０を管理する。 When the pool memory 100 includes a plurality of memories 120, the pool memory controller 110 manages each memory 120 between the fabric network 30 and the plurality of memories 120.

ここで、プールメモリーコントローラー１１０は、処理率（Ｔｈｒｏｕｇｈｐｕｔ）を高めるためにメモリーインターリビング（ＭｅｍｏｒｙＩｎｔｅｒｌｅａｖｉｎｇ）を行うか、信頼性（Ｒｅｌｉａｂｉｌｉｔｙ）、可用性（Ａｖａｉｌａｂｉｌｉｔｙ）及び耐久性（Ｓｅｒｖｉｃｅａｂｉｌｉｔｙ）等を高めるためにアドレスリマッピング（ＡｄｄｒｅｓｓＲｅｍａｐｐｉｎｇ）を支援する。 Here, the pool memory controller 110 performs memory interleaving in order to increase the processing rate (Throughput), or increases reliability (Reliability), availability (Availability), durability (Serviceability), and the like. Supports Address Remapping.

イン−メモリーデータベース（Ｉｎ−ｍｅｍｏｒｙＤａｔａｂａｓｅ）とは、早い接近のためにデータベース（ＤＢ）をストレージ（Ｓｔｏｒａｇｅ）ではないメインメモリー（ＭａｉｎＭｅｍｏｒｙ）に格納するデータベース管理システムである。 An in-memory database is a database management system that stores a database (DB) in a main memory (Main Memory) that is not a storage (Storage) for quick access.

現在のサーバーシステム（ＳｅｒｖｅｒＳｙｓｔｅｍ）は、メモリー（Ｍｅｍｏｒｙ）の容量を増加させることに物理的な限界がある。これによって、アプリケーション（Ａｐｐｌｉｃａｔｉｏｎ）がデータベースの大きさを各サーバー（Ｓｅｒｖｅｒ）が有しているメモリー容量以上に大きくできない。データベースの大きさが大きくなれば、仕方なく複数のサーバーに分けてデータベースを格納するようになり、複数のサーバーを組み合わせる過程で性能が低下される部分がある。プールメモリー１００は、多い容量と高い帯域幅（Ｂａｎｄｗｉｄｔｈ）を提供するため、イン−メモリーデータベース（Ｉｎ−ｍｅｍｏｒｙＤａｔａｂａｓｅ）で有用に使用され得る。 Current server systems have a physical limitation in increasing the capacity of a memory. As a result, the application cannot increase the size of the database beyond the memory capacity of each server. As the size of the database increases, the database is inevitably divided into a plurality of servers and the database is stored, and there is a part where the performance is reduced in the process of combining the plurality of servers. The pool memory 100 can be advantageously used in an in-memory database to provide a large capacity and a high bandwidth.

図３は、図２のプールメモリーコントローラー１１０に関する詳細構成を示す図面である。
プールメモリーコントローラー１１０は、インターフェース１１１と加速器１１２を備える。ここで、インターフェース１１１は、ファブリックネットワーク３０と加速器１１２及びメモリー１２０の間でパケットを中継する。インターフェース１１１は、複数のチャンネルＣＮを介して加速器１１２と連結される。 FIG. 3 is a diagram illustrating a detailed configuration of the pool memory controller 110 of FIG.
The pool memory controller 110 includes an interface 111 and an accelerator 112. Here, the interface 111 relays a packet between the fabric network 30 and the accelerator 112 and the memory 120. The interface 111 is connected to the accelerator 112 via a plurality of channels CN.

本発明の実施形態において、インターフェース１１１は、ファブリックネットワーク３０と加速器１１２及びメモリー１２０の間でパケットを中継するためのスイッチを備えてもよい。本発明の実施形態では、インターフェース１１１がスイッチを含むことを一例として説明したが、パケットを中継するための手段はこれに限定されない。 In an embodiment of the present invention, the interface 111 may include a switch for relaying a packet between the fabric network 30 and the accelerator 112 and the memory 120. In the embodiment of the present invention, an example has been described in which the interface 111 includes a switch, but the means for relaying a packet is not limited to this.

そして、加速器１１２は、インターフェース１１１を介して印加されるデータの演算処理を行う。例えば、加速器１１２は、インターフェース１１１を介してメモリー１２０から印加されるデータのマップ演算を行い、マップ演算結果に対するデータを、インターフェース１１１を介してメモリー１２０に格納する。 Then, the accelerator 112 performs an arithmetic process on data applied via the interface 111. For example, the accelerator 112 performs a map operation on the data applied from the memory 120 via the interface 111, and stores data corresponding to the map operation result in the memory 120 via the interface 111.

本発明の実施形態では、プールメモリーコントローラー１１０に一つの加速器１１２が含まれることを一例として説明する。しかし、本発明の実施形態はこれに限定されるものではなく、プールメモリーコントローラー１１０に多数の加速器１１２が備えられてもよい。 In the embodiment of the present invention, an example in which the pool memory controller 110 includes one accelerator 112 will be described. However, embodiments of the present invention are not limited thereto, and the pool memory controller 110 may include a plurality of accelerators 112.

マップ−リデュース（Ｍａｐ−Ｒｅｄｕｃｅ）アプリケーションは、大容量データ処理を分散並列コンピューティングで処理するための目的で制作したソフトウェアフレームワーク（ＳｏｆｔｗａｒｅＦｒａｍｅｗｏｒｋ）である。多様なアプリケーションでこのマップ−リデュースライブラリを使用している。マップ−リデュースアプリケーションにおいて、マップ演算は、（ｋｅｙ、ｖａｌｕｅ）形態で中間情報を抽出すれば、リデュース演算がこれを集めて望む最終結果を出力する。 The Map-Reduce application is a software framework (Software Framework) created for the purpose of processing large-capacity data processing by distributed parallel computing. Various applications use this map-reduce library. In the map-reduce application, if the map operation extracts intermediate information in a (key, value) form, the reduce operation collects the intermediate information and outputs a desired final result.

例えば、マップ−リデュースアプリケーションを介して『毎年の最も高かった地球気温』を検索すると仮定すれば、マップ演算は、テキストファイルを読んで年度及び気温に対する情報を抽出し（年度、気温）形態のリストを出力する。そして、リデュース演算は、この結果を収集して温度順に整列し望む最終結果を出力する。ここで、注目すべき点は、マップ演算に使用されるデータは大体大容量である一方、マップ演算の結果データは比較的大きさが小さいということである。 For example, assuming that the "highest global temperature of each year" is retrieved via the map-reduce application, the map operation reads the text file and extracts information on the year and temperature, and lists the (year, temperature) form. Is output. Then, the reduce operation collects the results, arranges them in order of temperature, and outputs a desired final result. Here, it should be noted that while the data used for the map calculation has a large capacity, the result data of the map calculation has a relatively small size.

本発明の実施形態は、マップ−リデュースアプリケーションのマップ演算のように大容量データを処理するが、データ再使用（ＤａｔａＲｅｕｓｅ）の少ない演算をプールメモリーコントローラー１１０内の加速器１１２でオフ−ローディング（Ｏｆｆ−ｌｏａｄｉｎｇ）できる。ここで、オフ−ローディングとは、プロセッサ２０からの要請を受信して解釈し演算を行った後、その演算結果を出力する一連の過程を示す。データをプールメモリー１００内で処理するようになれば、プロセッサ２０のノードＣＮＤまでデータを伝達するためのエネルギーを節約でき、性能もさらに高めることができる。 Although the embodiment of the present invention processes a large amount of data like the map operation of the map-reduce application, the operation with less data reuse is performed by the accelerator 112 in the pool memory controller 110 in off-loading (Off). -Loading). Here, the off-loading refers to a series of processes of receiving and interpreting a request from the processor 20, performing an operation, and outputting a result of the operation. If data is processed in the pool memory 100, energy for transmitting data to the node CND of the processor 20 can be saved, and performance can be further improved.

本発明の実施形態に係る加速器１１２は、プールメモリーコントローラー１１０の内に具備されるか、メモリー１２０内に位置してよい。近接データ処理の観点では、データを各メモリー１２０内で処理することがプールメモリーコントローラー１１０の内部で処理することよりさらに効率的である。 The accelerator 112 according to an embodiment of the present invention may be included in the pool memory controller 110 or may be located in the memory 120. In terms of proximity data processing, processing data within each memory 120 is more efficient than processing data within the pool memory controller 110.

プールメモリーコントローラー１１０は、高い帯域幅（Ｂａｎｄｗｉｄｔｈ）を提供するためにメモリーインターリビング（ＭｅｍｏｒｙＩｎｔｅｒｌｅａｖｉｎｇ）を行う。このような場合、データが複数のメモリー１２０に分かれて格納される。このようになれば、加速器１１２が必要とするデータもまた複数のメモリー１２０に分かれ得るため、本発明の実施形態では、加速器１１２の物理的な位置がプールメモリーコントローラー１１０内に配置されることを一例として説明する。 The pool memory controller 110 performs memory interleaving in order to provide a high bandwidth. In such a case, the data is separately stored in the plurality of memories 120. In such a case, the data required by the accelerator 112 may also be divided into a plurality of memories 120. Therefore, in the embodiment of the present invention, it is required that the physical location of the accelerator 112 be located in the pool memory controller 110. This will be described as an example.

ここからは、マップ−リデュースアプリケーションのマップ演算を加速器１１２でオフ−ローディングすることが性能（Ｐｅｒｆｏｒｍａｎｃｅ）とエネルギー（Ｅｎｅｒｇｙ）の観点でメモリーシステム１０全体的にどの程度利得であるかを検討する。 From now on, it will be examined how much the off-loading of the map operation of the map-reduce application by the accelerator 112 is gain in the memory system 10 as a whole in terms of performance (Performance) and energy (Energy).

マップ−リデュースアプリケーションのマップ演算のように加速器１１２が処理する演算が単純であれば、加速器１１２における演算時間は、データをメモリーから読み取る帯域幅によって左右される。したがって、加速器１１２の帯域幅を高めることで加速器１１２の演算時間を減らすことができる。 If the operation performed by the accelerator 112 is simple, such as the map operation of the map-reduce application, the operation time in the accelerator 112 depends on the bandwidth for reading data from the memory. Therefore, the operation time of the accelerator 112 can be reduced by increasing the bandwidth of the accelerator 112.

図３に示された通り、一連のプロセッサ２０のノードＣＮＤは、ファブリックネットワーク３０を経てプールメモリー１００と連結される。各ノードＣＮＤは、各プロセッサ２０別に１個のリンク（Ｌｉｎｋ）Ｌ１を有しており、プールメモリーコントローラー１１０内部の加速器１１２が４個のリンクＬ２を有すると仮定する。すなわち、プロセッサ２０のリンクＬ１より加速器１１２のリンクＬ２に対する帯域幅をさらに広く割り当てる。そうすれば、マップ演算を加速器１１２にオフ−ローディングする場合、プロセッサ２０で処理することより４倍早く演算できる。 As shown in FIG. 3, the nodes CND of the series of processors 20 are connected to the pool memory 100 via the fabric network 30. It is assumed that each node CND has one link L1 for each processor 20 and that the accelerator 112 inside the pool memory controller 110 has four links L2. That is, a wider bandwidth is allocated to the link L2 of the accelerator 112 than to the link L1 of the processor 20. Then, when the map calculation is off-loaded to the accelerator 112, the calculation can be performed four times faster than the processing performed by the processor 20.

マップ演算及びリデュース演算をプロセッサ２０が全て行う場合、マップ演算に所要される時間が全体実行時間の９９％であると仮定する。また、一つのプロセッサ２０で複数のアプリケーションが実行されるが、そのうちマップ−リデュースアプリケーションの実行時間が全体アプリケーションの実行時間の１０％を占めると仮定する。マップ演算を加速器１１２にオフ−ローディングする場合、マップ演算時間が１／４に減っていくのにつれ、全体システム性能は８１％向上され得る。 When the processor 20 performs all the map operations and the reduce operations, it is assumed that the time required for the map operations is 99% of the total execution time. Also, it is assumed that a plurality of applications are executed by one processor 20, and that the execution time of the map-reduce application accounts for 10% of the execution time of the entire application. When off-loading the map operation to the accelerator 112, the overall system performance can be improved by 81% as the map operation time is reduced to 1/4.

図４乃至図６は、本発明の実施形態によるメモリーシステム１０の動作を説明するための図面である。
まず、図４の経路１によって示されるように、プロセッサ２０は、プールメモリー１００側にマップ演算要請パケットを伝達する。すなわち、プロセッサ２０から送信されたマップ演算要請パケットは、ファブリックネットワーク３０を介してプールメモリーコントローラー１１０のインターフェース１１１を経て加速器１１２に伝達される。ここで、マップ演算要請パケットは、マップ演算に使用されるデータが格納されたアドレス、データの大きさ及びマップ演算結果データを格納する住所等に対する情報が含まれてよい。 4 to 6 are views for explaining the operation of the memory system 10 according to the embodiment of the present invention.
First, as indicated by a path 1 in FIG. 4, the processor 20 transmits a map operation request packet to the pool memory 100 side. That is, the map operation request packet transmitted from the processor 20 is transmitted to the accelerator 112 via the interface 111 of the pool memory controller 110 via the fabric network 30. Here, the map operation request packet may include information on an address at which data used for the map operation is stored, data size, an address at which the map operation result data is stored, and the like.

次に、プールメモリーコントローラー１１０は、図４の経路２によって示されるように、マップ演算応答パケットをファブリックネットワーク３０を介してプロセッサ２０に伝達する。すなわち、プールメモリーコントローラー１１０は、加速器１１２がマップ演算要請パケットをよく受信したことを知らせる信号をプロセッサ２０に伝達する。 Next, the pool memory controller 110 transmits the map operation response packet to the processor 20 via the fabric network 30, as indicated by the path 2 in FIG. That is, the pool memory controller 110 transmits a signal indicating that the accelerator 112 has received the map operation request packet to the processor 20.

その後、図５の経路３によって示されるように、プールメモリーコントローラー１１０は、各メモリー１２０でマップ演算に必要なデータを読み取り加速器１１２に伝達する。ここで、加速器１１２で必要とするデータが複数のメモリー１２０に分かれてよく、このような場合、加速器１１２は、多数のメモリー１２０からデータを読み取る。そうすれば、加速器１１２は、メモリー１２０から読み取られたデータに基づいてマップ演算を行う。 Thereafter, as indicated by the path 3 in FIG. 5, the pool memory controller 110 reads the data required for the map calculation in each memory 120 and transmits the data to the accelerator 112. Here, data required by the accelerator 112 may be divided into a plurality of memories 120, and in such a case, the accelerator 112 reads data from a number of memories 120. Then, the accelerator 112 performs a map operation based on the data read from the memory 120.

次いで、プールメモリーコントローラー１１０は、図５の４経路でのように、加速器１１２によって演算されたマップ演算結果データを読み取り、各メモリー１２０に伝達して格納する。加速器１１２によって演算されたデータは多数のメモリー１２０に分かれて格納されてよい。 Next, the pool memory controller 110 reads the map operation result data calculated by the accelerator 112 as in the four paths in FIG. The data calculated by the accelerator 112 may be separately stored in a number of memories 120.

次に、図６の５経路でのように、プールメモリーコントローラー１１０は、プロセッサ２０側にインターラプトパケットを伝達する。すなわち、プールメモリーコントローラー１１０は、加速器１１２のマップ演算が終了したことを示すインターラプトパケットを、ファブリックネットワーク３０を介してプロセッサ２０に伝達する。 Next, the pool memory controller 110 transmits an interrupt packet to the processor 20 as in the five paths in FIG. That is, the pool memory controller 110 transmits an interrupt packet indicating that the map calculation of the accelerator 112 has been completed to the processor 20 via the fabric network 30.

その後、プールメモリーコントローラー１１０は、図６の６経路でのように、メモリー１２０に格納されたマップ演算結果データを読み取り、インターフェース１１１、ファブリックネットワーク３０を介してプロセッサ２０に伝達する。 Thereafter, the pool memory controller 110 reads the map operation result data stored in the memory 120 and transmits the read data to the processor 20 via the interface 111 and the fabric network 30, as indicated by the six paths in FIG.

図７は、本発明の実施形態によるメモリーシステムの性能改善を示す図面である。図７は、加速器１１２でマップ演算を行う場合、加速器１１２のチャンネルＣＮ数を増加させることによって全体システムの性能（Ｐｅｒｆｏｒｍａｎｃｅ）がどの程度増加するかを示す結果グラフである。 FIG. 7 is a diagram illustrating a performance improvement of a memory system according to an embodiment of the present invention. FIG. 7 is a result graph showing how the performance of the entire system is increased by increasing the number of channels CN of the accelerator 112 when performing a map operation in the accelerator 112.

加速器１１２のチャンネル（Ｃｈａｎｎｅｌ）ＣＮ数を増加させることによって性能も共に増加することが分かる。しかし、チャンネルＣＮ数を増加させる費用に比べ性能増加量は益々小くなるため、本発明の実施形態では加速器１１２のチャンネルＣＮを２個乃至４個に設定することを一例として説明する。 It can be seen that the performance increases as the number of channels CN of the accelerator 112 increases. However, since the amount of performance increase becomes smaller than the cost of increasing the number of channels CN, in the embodiment of the present invention, setting the number of channels CN of the accelerator 112 to two to four will be described as an example.

プロセッサ２０のノードＣＮＤを介してデータを移動させるのに1リンクＬ１当たり１ｐＪ（エネルギー消耗量）／ビート（ｂｉｔ）を消耗すると仮定する。プロセッサ２０でデータを処理するためには、図３でメモリー１２０のバスＢＵＳ、ファブリックネットワーク３０のチャンネル４０及びプロセッサ２０のノードＣＮＤ、すなわち、総３個のリンクを経なければならないため３ｐＪ／ｂｉｔが消耗される。しかし、マップ演算を加速器１１２でオフ−ローディングするようになれば、データがメモリー１２０のバスＢＵＳのみ経るようになるため、データを移動するのに消耗されるエネルギーを１／３である１ｐＪ／ｂｉｔに減らすことができる。全体システムでどれ程のエネルギーが節約されるかを計算するためには、各ハードウェアのスタティックパワー（ＳｔａｔｉｃＰｏｗｅｒ）に対し全て考慮しなければならない。 Assume that 1 pJ (energy consumption) / beat (bit) is consumed per link L1 to move data through the node CND of the processor 20. In order for the processor 20 to process data, the bus BUS of the memory 120, the channel 40 of the fabric network 30 and the node CND of the processor 20 in FIG. 3, that is, 3 pJ / bit must be passed. Is consumed. However, when the map operation is off-loaded by the accelerator 112, the data passes through only the bus BUS of the memory 120, and the energy consumed for moving the data is reduced to 1 pJ / bit, which is 1/3. Can be reduced to In order to calculate how much energy is saved in the overall system, all the static power of each hardware must be considered.

以上のように、本発明の実施形態によるプールメモリー１００は、多い容量と高い帯域幅を提供でき、イン−メモリーデータベース等で有用に使用できる。プールメモリーコントローラー１１０内部に加速器１１２を追加し、加速器１１２でマップ−リデュースアプリケーションのマップ演算をオフ−ローディングすることで、全体システムの性能を高めるとともにエネルギーを節約できる。 As described above, the pool memory 100 according to the embodiment of the present invention can provide a large capacity and a high bandwidth, and can be effectively used for an in-memory database or the like. By adding an accelerator 112 inside the pool memory controller 110 and off-loading the map operation of the map-reduce application by the accelerator 112, it is possible to improve the performance of the entire system and save energy.

本発明が属する技術分野の当業者は、本発明がその技術的思想や必須的特徴を変更せずに他の具体的な形態で実施され得るため、以上で記述した実施形態は全ての面において例示的なものであり、限定的ではないものとして理解しなければならない。本発明の範囲は、詳細な説明よりは後述する特許請求の範囲によって表れるようになり、特許請求の範囲の意味及び範囲、そしてその等価概念から導出される全ての変更又は変形された形態が、本発明の範囲に含まれるものとして解釈されなければならない。 Those skilled in the art to which the present invention pertains will appreciate that the embodiments described above may be embodied in other specific forms without altering the technical spirit and essential characteristics thereof, and thus the embodiments described above may be modified in all respects. It should be understood as exemplary and not limiting. The scope of the present invention will be expressed by the claims described below rather than the detailed description, and the meaning and scope of the claims, and all modified or modified forms derived from the equivalents thereof, It should be construed as falling within the scope of the invention.

１０：メモリーシステム
２０：複数のプロセッサ
３０：ファブリックネットワーク
４０：複数のチャンネル
１００：複数のプールメモリー 10: Memory system 20: Multiple processors 30: Fabric network 40: Multiple channels 100: Multiple pool memories

Claims

Multiple memories for storing data,
A pool memory controller that performs the map operation by reading the data stored in the plurality of memories and stores the result data of the map operation in the plurality of memories;
Memory system with

The pool memory controller,
An interface for relaying packets between a processor and the pool memory controller via a fabric network;
An accelerator that performs the map operation on the data transmitted through the interface, and stores the result data of the map operation in the plurality of memories via the interface;
The memory system according to claim 1, comprising:

The interface is
The memory system according to claim 2, wherein the memory system is connected to the accelerator through a plurality of channels.

The memory system according to claim 3, wherein the number of links of the plurality of channels is greater than the number of links of the processor.

The pool memory controller,
The memory system according to claim 2, wherein a map operation request packet is received from the processor via the interface.

The map operation request packet is
6. The memory system according to claim 5, wherein the memory system includes at least one of an address at which data used for the map operation is stored, a size of the data, and an address at which result data of the map operation are stored. .

The pool memory controller,
The memory system according to claim 2, wherein a map operation response packet is transmitted to the processor via the interface.

The pool memory controller,
The memory system according to claim 2, wherein data required for the map operation is read from the plurality of memories and transmitted to the accelerator.

The pool memory controller,
3. The memory system according to claim 2, wherein the result data of the map operation calculated by the accelerator is read and stored in the plurality of memories.

The pool memory controller,
3. The memory system according to claim 2, wherein an interrupt packet is transmitted to the processor via the interface at the end of the map operation.

The pool memory controller,
3. The memory system according to claim 2, wherein the result data of the map operation stored in the plurality of memories is read and transmitted to the processor via the interface.

The pool memory controller,
The memory system according to claim 1, wherein interleaving is performed on the plurality of memories.

The pool memory controller,
2. The memory system according to claim 1, wherein address remapping is performed on the plurality of memories.

The pool memory controller,
The memory system according to claim 1, wherein a map-reduce application is used during the map calculation.

A fabric network connected to the processor,
A pool memory for relaying packets to and from the processor through the fabric network and transmitting data stored in the memory to the processor when requested by the processor;
With
Said pool memory,
A memory system comprising: a pool memory controller for reading data stored in the memory, off-loading a map operation, and storing result data of the map operation in the memory.

The pool memory controller,
An interface for relaying packets between the processor and the pool memory controller via the fabric network;
An accelerator for off-loading the map operation with respect to the data transmitted through the interface, and storing result data of the map operation in the memory via the interface;
The memory system according to claim 15, comprising:

The pool memory controller,
17. The memory system of claim 16, receiving a map operation request packet from the processor via the interface, and transmitting a map operation response packet to the processor via the interface.

The pool memory controller,
17. The memory system according to claim 16, wherein data necessary for the map operation is read from the memory and transmitted to the accelerator, and the result data of the map operation calculated by the accelerator is stored in the memory.

The pool memory controller,
17. The method according to claim 16, wherein at the end of the map operation, an interrupt packet is transmitted to the processor via the interface, the result data of the map operation stored in the memory is read, and transmitted to the processor via the interface. The described memory system.

The pool memory controller,
The memory system according to claim 15, wherein at least one of an interleaving operation and an address remapping operation is performed on the memory.