KR101480954B1

KR101480954B1 - NUMA System Scheduling Apparatus and Secheduing Method Therefor

Info

Publication number: KR101480954B1
Application number: KR20130009094A
Authority: KR
Inventors: 박규호; 김철민
Original assignee: 한국과학기술원
Priority date: 2013-01-28
Filing date: 2013-01-28
Publication date: 2015-01-14
Also published as: KR20140096511A

Abstract

본 발명의 스케줄링 장치는 각각 프로세스를 실행하는 코어, 공용캐쉬 및 메모리를 포함하는 복수의 노드를 구비하고 코어별 메모리 접근 정보를 획득하는 NUMA 시스템; 상기 NUMA 시스템에서 획득된 상기 코어별 메모리 접근 정보에 따라 상기 NUMA 시스템의 현재 프로세스-코어 매핑 및 미래 프로세스-코어 매핑의 메모리 접근 성능을 평가하는 NUMA 시스템 평가유닛; 및 상기 NUMA 시스템 평가유닛의 평가 결과에 따라 상기 현재 프로세스-코어 매핑 및 상기 미래 프로세스-코어 매핑 중 하나의 프로세스-코어 매핑을 NUMA 시스템의 프로세스-코어 매핑으로 적용하는 프로세스 스케줄러를 포함한다.The scheduling apparatus of the present invention includes: a NUMA system having a plurality of nodes each including a core executing a process, a public cache, and a memory, and acquiring memory access information per core; A NUMA system evaluation unit for evaluating memory access performance of a current process-core mapping and a future process-core mapping of the NUMA system according to the core memory access information obtained in the NUMA system; And a process scheduler that applies one of the current process-core mapping and the future process-core mapping as a process-core mapping of the NUMA system according to the evaluation result of the NUMA system evaluation unit.

Description

[0001] NUMA System Scheduling Apparatus and Method [0002]

본 발명은 NUMA 시스템을 위한 스케줄링 장치 및 그 방법에 관한 것이다. 보다 구체적으로, NUMA 시스템에 대한 메모리 접근의 최적화 및/또는 공정성을 향상시킬 수 있는 스케줄링 장치 및 그 방법에 관한 것이다.The present invention relates to a scheduling apparatus and method for a NUMA system. More particularly, the present invention relates to a scheduling apparatus and method capable of improving optimization and / or fairness of memory access to a NUMA system.

비대칭 메모리 접근(NUMA: Non Uniform Memory Access) 시스템은 특정 CPU(Central Processing Unit)에서 접근하는 물리 메모리의 접근 시간이 어떤 코어에서 수행되는가에 따라 변하는 비대칭성을 갖는다. CPU는 코어로 지칭될 수 있다. 이러한 현상은 비대칭 메모리 시스템이 갖고 있는 구조적 특성에 기인한다. 비대칭 메모리 시스템의 경우 지역 노드(local node)에 속한 메모리 접근을 하는 경우와 원격 노드(remote node)에 속한 메모리를 접근하는 경우에 시간 차이가 발생하기 때문이다. The asymmetric non-uniform memory access (NUMA) system has asymmetry that varies depending on the core in which the access time of the physical memory accessed by a particular CPU (Central Processing Unit) is performed. The CPU may be referred to as a core. This phenomenon is caused by the structural characteristics of the asymmetric memory system. In the asymmetric memory system, there is a time difference between accessing memory belonging to a local node and accessing a memory belonging to a remote node.

이러한 비대칭 메모리 접근 시스템의 전체 메모리 접근 성능을 최적화할 수 있으면서도 비대칭 메모리 접근 시스템에 대한 코어별 메모리 접근 시간이 공정하도록 하는 기법을 제공할 필요성이 대두되고 있다. There is a need to provide a technique for optimizing the overall memory access performance of such an asymmetric memory access system while allowing fair access to the memory access time per core for the asymmetric memory access system.

한국등록공보 제10-1145144호 (2012.5.4)Korean Registration Bulletin No. 10-1145144 (April 5, 2012)

본 발명은 종래의 필요성을 충족시키기 위해 안출된 것으로써, NUMA 시스템 전체에 대한 메모리 접근 성능을 최적화할 수 있는 스케줄링 장치 및 방법을 제공하기 위한 것이다. SUMMARY OF THE INVENTION The present invention is directed to provide a scheduling apparatus and method that can optimize memory access performance for an entire NUMA system, as set forth in order to meet the needs of the prior art.

또한, 본 발명은 NUMA 시스템에 대한 메모리 접근이 각 코어별로 공정성을 갖도록 하는 스케줄링 장치 및 방법을 제공하기 위한 것이다. The present invention also provides a scheduling apparatus and method for allowing a memory access to a NUMA system to have fairness for each core.

또한, 본 발명은 NUMA 시스템에서 프로세스 별로 요구되는 메모리 접근 요건을 충족할 수 있는 스케줄링 장치 및 방법을 제공하기 위한 것이다. The present invention also provides a scheduling apparatus and method capable of meeting the memory access requirements required for each process in the NUMA system.

본 발명이 이루고자 하는 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 본 발명의 기재로부터 당해 분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The technical objects to be achieved by the present invention are not limited to the above-mentioned technical problems, and other technical subjects which are not mentioned can be clearly understood by those skilled in the art from the description of the present invention .

본 발명의 실시예에 따른 스케줄링 장치는 각각 프로세스를 실행하는 코어, 공용캐쉬 및 메모리를 포함하는 복수의 노드를 구비하고 코어별 메모리 접근 정보를 획득하는 NUMA 시스템; 상기 NUMA 시스템에서 획득된 상기 코어별 메모리 접근 정보에 따라 상기 NUMA 시스템의 현재 프로세스-코어 매핑 및 미래 프로세스-코어 매핑의 메모리 접근 성능을 평가하는 NUMA 시스템 평가유닛; 및 상기 NUMA 시스템 평가유닛의 평가 결과에 따라 상기 현재 프로세스-코어 매핑 및 상기 미래 프로세스-코어 매핑 중 하나의 프로세스-코어 매핑을 상기 NUMA 시스템의 프로세스-코어 매핑으로 적용하는 프로세스 스케줄러를 포함한다. A scheduling apparatus according to an embodiment of the present invention includes: a NUMA system having a plurality of nodes each including a core for executing a process, a public cache, and a memory, and acquiring memory access information for each core; A NUMA system evaluation unit for evaluating memory access performance of a current process-core mapping and a future process-core mapping of the NUMA system according to the core memory access information obtained in the NUMA system; And a process scheduler for applying one of a process-core mapping of the current process-core mapping and the future process-core mapping to a process-core mapping of the NUMA system according to an evaluation result of the NUMA system evaluation unit.

본 발명의 실시예에 따른 NUMA 시스템 스케줄링 방법은 각각 프로세스를 실행하는 코어, 공용캐쉬 및 메모리를 포함하는 복수의 노드를 구비하는 NUMA 시스템에서 코어별 메모리 접근 정보를 획득 단계; NUMA 시스템 평가유닛에서 상기 NUMA 시스템에서 획득된 상기 코어별 메모리 접근 정보에 따라 상기 NUMA 시스템의 현재 프로세스-코어 매핑 및 미래 프로세스-코어 매핑의 메모리 접근 성능을 평가하는 단계; 및 상기 NUMA 시스템 평가유닛의 평가 결과에 따라 프로세스 스케줄러에서 상기 현재 프로세스-코어 매핑 및 상기 미래 프로세스-코어 매핑 중 하나의 프로세스-코어 매핑을 상기 NUMA 시스템의 프로세스-코어 매핑으로 적용하는 단계를 포함한다.The NUMA system scheduling method according to an embodiment of the present invention includes: acquiring memory access information per core in a NUMA system having a plurality of nodes each including a core, a common cache, and a memory for executing a process; Evaluating memory access capability of the current process-core mapping and future process-core mapping of the NUMA system in accordance with the per-core memory access information obtained in the NUMA system in the NUMA system evaluation unit; Core mapping of one of the current process-core mapping and the future process-core mapping in a process scheduler according to an evaluation result of the NUMA system evaluation unit as a process-core mapping of the NUMA system .

본 발명에 따르면 NUMA 시스템 전체에 대한 메모리 접근 성능을 최적화할 수 있는 스케줄링 장치 및 방법을 제공할 수 있다. According to the present invention, it is possible to provide a scheduling apparatus and method capable of optimizing memory access performance for the entire NUMA system.

또한, 본 발명에 따르면 NUMA 시스템에 대한 메모리 접근이 각 코어별로 공정성을 갖도록 하는 스케줄링 장치 및 방법을 제공할 수 있다. Also, according to the present invention, it is possible to provide a scheduling apparatus and method for allowing memory access to the NUMA system to have fairness for each core.

또한, 본 발명에 따르면 NUMA 시스템에서 프로세스 별로 요구되는 메모리 접근 요건을 충족할 수 있는 스케줄링 장치 및 방법을 제공할 수 있다. Also, according to the present invention, it is possible to provide a scheduling apparatus and method capable of satisfying a memory access requirement required for each process in a NUMA system.

도1은 본 발명의 실시예에 따른 NUMA 시스템을 예시한다.
도2는 도1에 도시된 NUMA 시스템을 포함하는 본 발명의 실시예에 따른 스케줄링 장치를 예시한다.
도3은 본 발명의 실시예에 따른 스케줄링 장치에서 메모리 접근 레이턴시의 예측 과정을 도식화한 것이다.
도4는 본 발명의 실시예에 따른 스케줄링 장치에서 메모리 접근 레이턴시의 증가 원인을 파악하는 과정을 도식화한 것이다.
도5는 본 발명의 실시예에 따른 스케줄링 장치에서 메모리 접근 레이턴시의 증가 원인에 따른 해결 과정을 도식화한 것이다.1 illustrates a NUMA system according to an embodiment of the present invention.
2 illustrates a scheduling apparatus according to an embodiment of the present invention including the NUMA system shown in FIG.
FIG. 3 is a diagram illustrating a process of predicting a memory access latency in a scheduling apparatus according to an embodiment of the present invention.
FIG. 4 is a diagram illustrating a process of determining a cause of an increase in memory access latency in a scheduling apparatus according to an embodiment of the present invention.
FIG. 5 is a diagram illustrating a solution process according to a cause of an increase in memory access latency in a scheduling apparatus according to an embodiment of the present invention.

이하, 본 발명의 바람직한 실시예의 상세한 설명이 첨부된 도면들을 참조하여 설명된다. 그러나, 본 발명의 실시형태는 여러 가지의 다른 형태로 변형될 수 있으며, 본 발명의 범위가 이하 설명하는 실시형태로만 한정되는 것은 아니다. 도면에서의 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있으며, 도면들 중 인용부호들 및 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 인용부호들로 표시됨을 유의해야 한다. 참고로 본 발명을 설명함에 있어서 관련된 공지 기능 혹은 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, a detailed description of preferred embodiments of the present invention will be given with reference to the accompanying drawings. However, the embodiments of the present invention may be modified into various other forms, and the scope of the present invention is not limited to the embodiments described below. The shape and the size of the elements in the drawings may be exaggerated for clarity of explanation and the same reference numerals are used for the same elements and the same elements are denoted by the same quote symbols as possible even if they are displayed on different drawings Should be. In the following description, well-known functions or constructions are not described in detail to avoid unnecessarily obscuring the subject matter of the present invention.

이하, 첨부되는 도면을 참조하여 본 발명의 실시예에 따른 NUMA 시스템 및 이를 위한 스케줄링 장치 및 그 방법을 설명한다. Hereinafter, a NUMA system according to an embodiment of the present invention and a scheduling apparatus and method thereof will be described with reference to the accompanying drawings.

도1은 본 발명의 실시예에 따른 NUMA 시스템(100)을 예시한다. NUMA 시스템(100)은 비대칭 메모리 접속(NUMA) 시스템으로서 코어별로 메모리 접근 시간 또는 메모리 접근 레이턴시(latency)가 균등하지 않은 시스템을 지칭한다. 본 발명의 실시예에 따른 NUMA 시스템은 복수의 노드(node)를 포함할 수 있으며, 도1에서는 2개의 노드(11, 12)를 포함하는 것이 예시된다. 일반적으로 하나의 프로세스는 하나의 노드에 해당하며, 복수개의 노드는 복수개의 프로세서가 포함됨을 의미할 수 있다. 실시예에 따라 하나의 프로세스가 복수 개의 노드를 가질 수도 있다. Figure 1 illustrates a NUMA system 100 in accordance with an embodiment of the present invention. The NUMA system 100 is an asymmetric memory access (NUMA) system, which refers to a system in which memory access time or memory access latency is not uniform on a per-core basis. The NUMA system according to the embodiment of the present invention may include a plurality of nodes, and it is exemplified that the node includes two nodes 11 and 12 in FIG. In general, one process corresponds to one node, and a plurality of nodes may mean that a plurality of processors are included. Depending on the embodiment, one process may have a plurality of nodes.

노드(11, 12) 당 포함되는 코어의 개수는 가변적이며 일반적으로 노드 당 4~8개의 코어가 포함될 수 있다. 도1에에서는 제1노드(11)에 제1코어 내지 제4코어(core1 내지 core4)가 포함되고 제2노드(12)에 제5코어 내지 제8코어(core5 내지 core8)가 포함되는 것이 예시된다. The number of cores included in each of the nodes 11 and 12 is variable and may generally include 4 to 8 cores per node. 1 illustrates that the first node 11 includes the first to fourth cores core1 to core4 and the second node 12 includes the fifth to eighth cores core5 to core8. do.

본 발명의 NUMA 시스템(100)에서 메모리 영역은 도면부호 20 및 점선으로 구분되며 메모리 서브시스템(20)으로 지칭될 수 있다. 코어로부터의 메모리 접근 요청은 메모리 서브시스템(20)으로의 접근 요청을 의미한다. In the NUMA system 100 of the present invention, the memory area is divided into 20 and dotted lines and may be referred to as the memory subsystem 20. A memory access request from the core means a request to access the memory subsystem 20.

본 발명의 실시예에 따른 NUMA(100) 시스템은 메모리 서브시스템(20) 내에 공용 캐쉬(31,32)를 포함할 수 있다. 제1공용 캐쉬(31)는 제1노드(11)에 포함된 4개의 코어에 의해 공유되고 제2공용 캐쉬(32)는 제2노드(12)에 포함된 4개의 코어에 의해 공유될 수 있다. 본 발명의 실시예에서 공용 캐쉬(31, 32)는 레벨3 캐쉬일 수 있다. 일반적으로 각각의 코어는 그 내부에 레벨1(level 1) 및 레벨2(level 2) 캐쉬를 구비한다. 또한 각 노드에는 해당 노드에 포함된 코어들끼리 공유할 수 있는 레벨3 (level 3) 캐쉬를 포함할 수 있다. 이러한 레벨3 캐쉬는 LLC(last-level chache)로도 지칭될 수 있다. A NUMA (100) system according to an embodiment of the present invention may include public caches 31 and 32 in a memory subsystem 20. The first public cache 31 may be shared by the four cores included in the first node 11 and the second public cache 32 may be shared by the four cores included in the second node 12. [ . In an embodiment of the present invention, the public caches 31 and 32 may be level 3 caches. Generally, each core has a level 1 (level 1) and level 2 (level 2) cache therein. Each node may also contain a level 3 cache that can be shared among the cores contained in the node. This level 3 cache may also be referred to as LLC (last-level chache).

코어로부터 전달된 메모리 접속 요청은 동일 노드에 포함된 공용캐쉬(31,32)를 탐색한 뒤, 캐쉬 히트(cache hit)가 발생하는 경우 공용캐쉬(31,32)로부터 원하는 데이터를 획득한다. 이하에서 이를 일차 메모리 접속 요청으로 지칭한다. The memory access request transmitted from the core searches the public caches 31 and 32 included in the same node and acquires desired data from the public caches 31 and 32 when a cache hit occurs. Hereinafter, this is referred to as a primary memory access request.

노드(11, 12) 각각은 해당 메모리(61,62)를 포함한다. 즉, 제1노드(11)는 제1메모리(61)를 포함하고 제2노드(12)는 제2메모리(62)를 포함할 수 있다. 이때, 제1노드(11)에 대해서 제1메모리(61)는 지역 메모리이고 제2메모리(62)는 원격 메모리에 해당한다. 마찬가지로 제2노드(12)에 대해서 제2메모리(62)는 지역 메모리이고 제1메모리(61)는 원격 메모리이다. 본 발명의 실시예에서 제1메모리(61) 및 제2메모리(62)는 예컨대 일련의 DRAM(Dynamic Random-Access Memory)을 포함한 메모리 모듈인 DIMM(dual in-line memory module)일 수 있다. Each of the nodes 11 and 12 includes corresponding memories 61 and 62. That is, the first node 11 may include a first memory 61 and the second node 12 may include a second memory 62. At this time, the first memory 61 corresponds to the local memory and the second memory 62 corresponds to the remote memory with respect to the first node 11. Similarly, for the second node 12, the second memory 62 is a local memory and the first memory 61 is a remote memory. In the embodiment of the present invention, the first memory 61 and the second memory 62 may be a dual in-line memory module (DIMM), which is a memory module including a series of Dynamic Random-Access Memory (DRAM), for example.

공용캐쉬(31,32)에서 캐쉬 미스(cache miss)가 발생하는 경우 공용 캐쉬(31, 32)로부터 도1에 도시된 메모리 제어기(41, 42)에 실제 메모리 접속 요청이 전달된다. 여기서, 실제 메모리 접속 요청은 메모리 제어기(41, 42)를 통한 제1메모리(61) 또는 제2메모리(62)로의 접속 요청으로서, 이하에서 이차 메모리 접속 요청으로 지칭한다. When a cache miss occurs in the public caches 31 and 32, an actual memory access request is transmitted from the public caches 31 and 32 to the memory controllers 41 and 42 shown in FIG. Here, the actual memory access request is a request for access to the first memory 61 or the second memory 62 via the memory controller 41, 42, hereinafter referred to as a secondary memory access request.

즉, 메모리 제어기(41, 42)는 동일한 노드(31, 32)에 포함된 공용캐쉬(31, 32)로부터의 이차 메모리 접속 요청에 따라 동일한 노드(31,32)에 포함된 지역 메모리(local memory)에 접근할 수 있다. 예컨대, 메모리 제어기(41,42)는 초당 대략 수 GB(giga byte)의 메모리 접근을 처리할 수 있다. That is, the memory controllers 41 and 42 store the local memory included in the same node 31 and 32 according to the secondary memory access request from the public caches 31 and 32 included in the same node 31 and 32, ). For example, the memory controller 41, 42 may handle approximately several gigabytes (GB) of memory accesses per second.

공용캐쉬(31,32)에서 캐쉬 미스(cache miss)가 발생하는 경우 동일한 노드(31, 32)에 포함되지 않은 원격 메모리로의 접근도 가능하다. 이때, 공용 캐쉬(31, 32)로부터의 이차 메모리 접속 요청은 인터커넥션(51 및 52)으로 전달될 수 있다. 예컨대, 제1공용캐쉬(31)로부터 이차 메모리 접속 요청이 제1인터커넥션(51)으로 전달되고 이후 제1버스(71)를 통해 제2노드(12)에 포함된 제2인터커넥션(52) 및 제2메모리 제어기(42)를 이용하여 제2메모리(62)로의 접속이 달성될 수 있다. When a cache miss occurs in the public caches 31 and 32, access to the remote memory not included in the same node 31 and 32 is also possible. At this time, the secondary memory access requests from the public caches 31 and 32 may be transferred to the interconnection 51 and 52. For example, a secondary memory connection request from the first public cache 31 is transferred to the first interconnection 51 and then transferred to the second interconnection 52 included in the second node 12 via the first bus 71, And the second memory controller 42 may be used to achieve the connection to the second memory 62. [

이와 마찬가지로, 제2공용캐쉬(32)로부터의 이차 메모리 접속 요청은 제2인터커넥션(52)에 전달되어, 순차적으로 제2버스(72), 제1인터커넥션(51) 및 제1메모리 제어기(41)를 통해 제1메모리(61)에 접속이 달성될 수 있다. Likewise, a secondary memory access request from the second public cache 32 is passed to the second interconnection 52 to be sequentially transferred to the second bus 72, the first interconnection 51 and the first memory controller 41 can be achieved.

서로 다른 노드(11, 12)에 포함된 인터커넥션(51,52) 사이에는 버스(71,72)가 위치하며, 이때 각 버스(71,72)는 각각 단방향일 수 있다. 이에 따라 제1노드(11)와 제2노드(12) 사이에는 두 개의 버스(71,72)가 포함되어 양방향 통신이 가능할 수 있다. 하지만, 이는 단지 실시예일뿐이며 하나의 양방향 버스를 통해 제1인터커넥션(51)과 제2인터커넥션(52) 사이에 양방향 통신이 달성될 수 있다. Buses 71 and 72 are located between the interconnection 51 and 52 included in the different nodes 11 and 12, and each of the buses 71 and 72 may be unidirectional. Accordingly, two buses 71 and 72 are included between the first node 11 and the second node 12 to enable bidirectional communication. However, this is merely an embodiment, and bidirectional communication can be achieved between the first interconnection 51 and the second interconnection 52 via one bidirectional bus.

인터커넥션(51, 52)을 통해 원격 메모리를 접근하는 경우에는, 메모리 제어기(41,42)를 통한 지역 메모리를 접근하는 경우보다 1.5배 큰 메모리 접근 레이턴시(latency)를 가질 수 있다. 즉, 원격 메모리로의 접근시 지역 메모리로의 접근시보다 많은 시간이 소요될 수 있다. 이때, 인터커넥션(51,52) 또한 시간당 처리할 수 있는 처리 용량이 정해져있으며 일반적으로 메모리 제어기(41,42)의 처리용량에 비해 낮다. When accessing the remote memory via the interconnection 51, 52, it may have a memory access latency 1.5 times greater than accessing the local memory via the memory controller 41, 42. That is, when accessing the remote memory, it may take more time to access the local memory. At this time, the interconnection 51, 52 also has a predetermined processing capacity per hour and is generally lower than the processing capacity of the memory controllers 41, 42.

프로세스(Process)는 일반적으로 운영체제 내에서 실행중인 응용 프로그램이나 가상화 시스템 소프트웨어의 VCPU(Vitual CPU)로 정의할 수 있다. 프로세스는 코어에서의 실행 단위가 된다. 각각의 프로세스는 자신의 메모리를 임의의 메모리 영역에 가질 수 있다. 코어가 특정 프로세스를 실행할 때 코어는 해당 프로세스의 메모리를 가지고 있는 메모리 영역으로의 메모리 접근 요청을 생성한다. 예컨대, 제1프로세스(Proc.1)가 제1코어(Core 1)에서 실행되고, 제1프로세스의 메모리는 제1메모리(61) 또는 제2메모리(62)에 포함되어 있을 수 있다. 따라서, 제1코어가 제1프로세스를 실행할 때 제1코어는 제1메모리(61) 또는 제2메모리(62)로의 메모리 접근 요청을 생성할 수 있다. Processes can be defined as VCPUs (virtual CPUs) of applications or virtualization system software that are generally running in the operating system. The process becomes a unit of execution in the core. Each process can have its memory in any memory area. When a core executes a specific process, the core generates a memory access request to a memory area that has the memory of that process. For example, the first process (Proc. 1) may be executed in the first core (Core 1), and the memory of the first process may be included in the first memory 61 or the second memory 62. Thus, the first core may generate a memory access request to the first memory 61 or the second memory 62 when the first core executes the first process.

프로세스는 임의의 코어에서 실행될 수 있고 자신의 메모리를 임의의 노드에 포함된 메모리 영역에 구비할 수 있다. 프로세스와 이를 실행할 코어의 다양한 매핑(mapping)이 가능하고 프로세스의 메모리는 이주가 가능하다. 예컨대, 제1프로세스(Proc.1)의 메모리는 제1노드(11)의 제1메모리(61)에 포함되어 있더라도 제2노드(12)의 제2메모리(62)로 이주가 가능하다. A process can be executed in any core and its memory can be included in a memory area included in any node. Various mappings between the process and the core to execute it are possible, and the memory of the process can be migrated. For example, the memory of the first process Proc. 1 can be migrated to the second memory 62 of the second node 12 even if it is included in the first memory 61 of the first node 11.

도1에서 NUMA 시스템이 8개의 프로세스 및 8개의 코어를 포함하는 것이 예시된다. 이때, 제1프로세스(Proc.1)는 제1코어(Core1)에서, 제2프로세스(Proc.2)는 제2코어(core 2)에서,..., 제8프로세스(Proc.8)는 제8코어(Core 8)에서 실행되도록 매핑된 것이 예시되나, 이는 단지 예시일 뿐이며 임의의 프로세스는 임의의 코어에서 실행되도록 매핑될 수 있다. 특정 시간에 하나의 프로세스는 하나의 코어를 점유하여 실행될 수 있다. In Figure 1 it is illustrated that the NUMA system comprises eight processes and eight cores. At this time, the first process (Proc.1) is executed in the first core (Core1), the second process (Proc.2) is executed in the second core (core 2), the eighth process Although mapped to be executed on the eighth core (Core 8), this is merely an example, and any process can be mapped to run on any core. At one time, one process can be executed by occupying one core.

도1에 도시된 바와 같은 본 발명의 실시예에 따른 NUMA 시스템(100)에서 메모리 접근시의 레이턴시에 영향을 미치는 요소로는 대표적으로 아래에 기술된 4가지가 있을 수 있다. As shown in FIG. 1, in the NUMA system 100 according to the embodiment of the present invention, there are four factors described below that affect the latency of accessing the memory.

첫째, 특정 프로세스의 메모리가 지역 메모리 또는 원격 메모리 중 어떤 메모리 영역에 위치하는지 이다. 예컨대, 제1프로세스(Proc.1)는 제1코어(Core 1)에서 실행 중이지만, 제1프로세스 메모리는 제1노드(11) 또는 제2노드(12) 중 임의의 노드에 포함될 수 있다. 제1프로세스 메모리가 제1노드(11)의 제1메모리(61)에 포함된다면 이는 지역 메모리에 포함된 것이고 제1프로세스 메모리가 제2노드(12)의 제2메모리(62)에 포함된다면 이는 원격 메모리에 포함된 것이다. 코어에 동작 중인 프로세스의 메모리가 지역 메모리에 위치하는지 또는 원격 메모리에 위치하는지에 따라 메모리 접근 레이턴시가 달라질 수 있다. First, the memory of a particular process is located in either the local memory or the remote memory. For example, the first process Proc. 1 is running in the first core (Core 1), but the first process memory may be included in any one of the first node 11 or the second node 12. If the first process memory is included in the first memory 61 of the first node 11, it is included in the local memory and if the first process memory is included in the second memory 62 of the second node 12, It is contained in remote memory. The memory access latency can vary depending on whether the memory of the process running on the core is located in local memory or remote memory.

둘째, 메모리 제어기(41, 42)의 과부하 여부이다. 예컨대, 제1프로세스 내지 제8프로세스(Proc.1 내지 Proc. 8) 모두가 해당 프로세스 메모리는 제1노드(11)에 포함하고 있다면, 제1노드(11)의 제1메모리 제어기(41)를 통해서만 메모리 접근이 일어난다. 이러한 경우 제1메모리 제어기(41)에 과부하가 발생하고 이는 메모리 접근 요청의 큐잉(queuing)을 양산한다. Second, whether or not the memory controllers 41 and 42 are overloaded. For example, if all of the first to eighth processes (Proc. 1 to Proc. 8) include the corresponding process memory in the first node 11, the first memory controller 41 of the first node 11 Only memory access occurs. In this case, an overload occurs in the first memory controller 41, which massages the queuing of the memory access request.

셋째, 인터커넥션(51,52)의 과부하 여부이다. 예컨대, 제1노드(11)에 포함된 코어에서 실행되는 제1프로세스 내지 제4프로세스(Proc.1 내지 Proc. 4)의 프로세스 메모리가 제2노드(12)에 위치하는 경우, 메모리 접근 요청은 제1인터커넥션(51) 및 제2인터커넥션(52)을 통해서 제2메모리 제어기(42)에 전달될 것이다. 이때, 인터커넥션(51,52)의 과부하가 발생하고 이는 메모리 접근 요청의 큐잉을 야기한다. Third, whether or not the interconnection 51 or 52 is overloaded. For example, when the process memory of the first to fourth processes (Proc. 1 to Proc. 4) executing in the core included in the first node 11 is located at the second node 12, To the second memory controller 42 via the first interconnection 51 and the second interconnection 52. [ At this time, an overload of the interconnection 51, 52 occurs, which causes queuing of the memory access request.

넷째, 공용캐쉬(31,32)의 과부하 여부이다. 예컨대, 제1노드(11)의 코어에서 실행되는 제1프로세스 및 제2프로세스(Proc.1 및 Proc. 2)의 실행시에 많은 메모리 접근 요청이 발생하고 나머지 프로세스에서는 메모리 접근 요청이 없는 경우, 제1노드(11)의 제1공용캐쉬(31)에 대한 캐쉬 히트 비율(cache hit ratio)이 제2노드(12)의 제2공용캐쉬(32)에 대한 캐쉬 히트 비율보다 상대적으로 낮을 것이며, 이는 더 많은 이차 메모리 접근 요청이 발생함을 의미한다. 즉, 공용캐쉬(31,32)가 과부하됨은 높은 캐쉬 미스 비율(cache miss ratio)을 야기한다. 이는 이차 메모리 접근 요청이 증가하는 것을 의미하므로 레이턴시가 증가될 수 있다. Fourth, whether or not the public caches 31 and 32 are overloaded. For example, when a large number of memory access requests are generated in the execution of the first process and the second process (Proc. 1 and Proc. 2) executed in the core of the first node 11, The cache hit ratio for the first public cache 31 of the first node 11 will be relatively lower than the cache hit ratio for the second public cache 32 of the second node 12, This means that more secondary memory access requests occur. That is, overloading the public caches 31 and 32 causes a high cache miss ratio. This implies an increase in the secondary memory access request, so that the latency can be increased.

이상에서 살펴본 바와 같이, 각 코어에 할당된 프로세스의 실행을 위해 필요한 프로세스 메모리로의 접근 경로에 따라 메모리 접근 성능이 달라질 수 있음이 자명하다. 즉, 프로세스를 어떤 코어에서 실행할지를 결정하는 CPU 스케줄링 방법에 따라 메모리 접근 성능이 달라질 수 있다. 따라서, CPU 스케줄링시에 전술한 메모리 접근 레이턴시에 영향을 미치는 요소를 고려할 필요가 있다. As described above, it is apparent that the memory access performance can be changed according to the access path to the process memory required for executing the process assigned to each core. That is, memory access performance may vary depending on the CPU scheduling method that determines which cores the process will run on. Therefore, it is necessary to consider the factors that affect the above-mentioned memory access latency in CPU scheduling.

본 발명의 실시예에서는 NUMA 시스템(100)에서 메모리 접근 성능을 최적화할 수 있는 스케줄링 장치 및 그 방법을 제시할 수 있다. 또한, 각 코어별로 메모리 접근 레이턴시가 공정하도록 하는 스케줄링 장치 및 그 방법을 제시하고자 한다. 이에 따라, 코어별로 정규화된 메모리 접근 레이턴시를 제공할 수 있다. 또한 본 발명의 실시예에서는 프로세스 별로 메모리 접근 레이턴시에 대한 요구를 충족할 수 있는 스케줄링 장치 및 방법을 제공하고자 한다. In the embodiment of the present invention, a scheduling apparatus and method for optimizing memory access performance in the NUMA system 100 may be proposed. In addition, a scheduling apparatus and a method for allowing a memory access latency to be processed for each core are proposed. Thus, a normalized memory access latency per core can be provided. Also, embodiments of the present invention provide a scheduling apparatus and method that can satisfy a demand for a memory access latency on a per-process basis.

도2는 도1에 도시된 NUMA 시스템에 대한 본 발명의 실시예에 따른 스케줄링 장치를 예시한다. 본 발명의 실시예에 따른 스케줄링 장치(400)는 NUMA 시스템(100), 프로세스 스케줄러(200) 및 NUMA 시스템의 평가유닛(300)을 포함하여 구성될 수 있다. FIG. 2 illustrates a scheduling apparatus according to an embodiment of the present invention for the NUMA system shown in FIG. The scheduling apparatus 400 according to the embodiment of the present invention may be configured to include the NUMA system 100, the process scheduler 200, and the evaluation unit 300 of the NUMA system.

NUMA 시스템(100)에서는 복수의 코어에 대해서 메모리 접근이 병렬적으로 발생하므로 공유자원들, 예컨대, 공용캐쉬(31,32), 메모리 제어기(41,42) 및 인터커넥션(51,52)에서 충돌(contention)이 발생할 때 메모리 접근시 레이턴시에 차이가 발생한다. 이러한 메모리 접근시의 레이턴시 차이는 NUMA 시스템(100) 전체의 메모리 접근 성능 및 각 코어별 메모리 접근 성능에 영향을 미친다.In the NUMA system 100, memory accesses occur in parallel for a plurality of cores, so that collisions in the shared resources, e.g., the public caches 31 and 32, the memory controllers 41 and 42 and the interconnection 51 and 52, there is a difference in latency during memory access when contention occurs. The difference in latency during the memory access affects the memory access performance of the entire NUMA system 100 and the memory access performance of each core.

본 발명의 실시예에 따른 스케줄링 장치(400)는 NUMA 시스템(100) 전체에 대한 메모리 접근 성능을 최적화하면서 각 코어별로 메모리 접근 성능을 공정하게 할 수 있는 프로세스-코어간 매핑을 할 수 있다. 또한, 본 발명의 실시예에 따른 스케줄링 장치(400)는 예컨대 특정 프로세스에 대해서 사용자가 요구하는 메모리 접근 레이턴시를 제공할 수 있다. The scheduling apparatus 400 according to the embodiment of the present invention can perform a process-to-core mapping that can fair memory access performance for each core while optimizing memory access performance for the entire NUMA system 100. [ In addition, the scheduling apparatus 400 according to an embodiment of the present invention can provide a memory access latency requested by a user for a specific process, for example.

본 발명의 실시예에 따른 스케줄링 장치(400)에서 NUMA 시스템(100)은 코어별로 메모리 접근 정보를 획득하여 프로세스 스케줄러(200)에 제공할 수 있다. NUMA 시스템(100)에서 획득되는 코어별 메모리 접근 정보는 현재 코어와 프로세스 사이의 매핑 정보를 포함할 수 있다. 또한, 상기 코어별 메모리 접근 정보는 프로세스별로 프로세스 메모리의 위치 정보를 포함할 수 있다. 또한, 상기 코어별 메모리 접근 정보는 하드웨어 모니터링 유닛(미도시)을 통해 획득된 코어별로 공용캐쉬(31,32)로의 접근 횟수와 함께 캐쉬 미스 횟수를 포함할 수 있다. 이에 따라 코어별로 할당된 프로세스의 메모리 접근 패턴을 알 수 있다. In the scheduling apparatus 400 according to the embodiment of the present invention, the NUMA system 100 may acquire memory access information for each core and provide the memory access information to the process scheduler 200. The memory access information per core obtained in the NUMA system 100 may include mapping information between the current core and the process. In addition, the memory access information for each core may include location information of the process memory for each process. In addition, the memory access information for each core may include the number of accesses to the public caches 31 and 32 and the number of cache misses for each core obtained through a hardware monitoring unit (not shown). Thus, memory access patterns of processes allocated to each core can be known.

본 발명의 실시예에 따른 하드웨어 모니터링 유닛은 일반적으로 CPU 칩들 내부에 포함되어 있는 하드웨어 PMU(Performance Monitoring Unit)일 수 있다. 하드웨어 PMU로 지칭되는 모니터링 유닛은 인스트럭션(instruction) 수행 중 CPU 내의 하드웨어를 이용하여 사용자가 설정한 특정 이벤트의 발생 횟수를 CPU의 레지스터 영역에 보관하고 사용자의 요청시 해당 정보를 반환하는 방식으로 사용될 수 있다. The hardware monitoring unit according to the embodiment of the present invention may be a hardware PMU (Performance Monitoring Unit) generally included in the CPU chips. A monitoring unit, referred to as a hardware PMU, can be used to store the number of occurrences of a specific event set by the user using the hardware in the CPU during instruction execution in a register area of the CPU, and to return the corresponding information when the user requests have.

이러한 PMU를 사용하여 클록 사이클(clock cycle), 수행된 인스트럭션의 개수, 레벨1 캐쉬에 대한 접근 횟수 및 미스 횟수, 레벨2 캐쉬에 대한 접근 횟수 및 미스 횟수, 그리고 레벨3 캐쉬에 대한 접근 횟수 및 미스 횟수 등을 수집할 수 있다. 이때, 본 발명의 실시예에 따른 NUMA 시스템(100)에서는 PMU에 의해 획득된 레벨3 캐쉬, 즉 공용캐쉬(31,32)에 대한 접근 횟수 및 미스 횟수를 스케줄링시에 이용하도록 구성될 수 있다. Using these PMUs, the clock cycle, the number of instructions performed, the number of accesses and misses to the level 1 cache, the number of accesses and misses to the level 2 cache, the number of accesses to the level 3 cache, The number of times can be collected. In this case, in the NUMA system 100 according to the embodiment of the present invention, the level 3 cache obtained by the PMU, that is, the access count and the miss count for the public caches 31 and 32, may be used in scheduling.

본 발명의 실시예에 따른 스케줄링 장치(400)에 포함되는 NUMA 시스템 평가유닛(300)은 특정 스케줄링에 따른 NUMA 시스템(100)의 메모리 접근 성능을 평가한다. 이러한 성능 평가는 프로세스 스케줄러(200)의 요청에 따라 수행될 수 있다. 본 발명의 실시예에 따른 NUMA 시스템의 평가유닛(300)은 프로세스 스케줄러(200)로부터 전달된 코어별 메모리 접근 정보를 이용하여 프로세스별로 메모리 접근 성능을 평가할 수 있다. 이때, 이러한 프로세스별 메모리 접근 성능은 코어별 메모리 접근 정보를 이용한 모델링을 통해 평가될 수 있다. The NUMA system evaluation unit 300 included in the scheduling apparatus 400 according to the embodiment of the present invention evaluates the memory access performance of the NUMA system 100 according to the specific scheduling. This performance evaluation can be performed at the request of the process scheduler 200. [ The evaluation unit 300 of the NUMA system according to the embodiment of the present invention can evaluate memory access performance by process using the memory access information per core transmitted from the process scheduler 200. [ At this time, the memory access performance per process can be evaluated through modeling using the memory access information per core.

본 발명의 실시예에 따른 NUMA 시스템의 성능 평가유닛(300)의 기능은 크게 아래 두가지로 구분될 수 있다. The functions of the performance evaluation unit 300 of the NUMA system according to the embodiment of the present invention can be largely divided into the following two.

첫째, 현재 프로세스-코어 매핑에 따른 현재 프로세스-코어 매핑의 메모리 접근 성능을 평가한다. 이는 현재 프로세스-코어 매핑에 따른 프로세스별 메모리 접근 레이턴시를 평가함으로써 수행될 수 있다. 이에 기반하여 NUMA 시스템(100) 전체에 대한 메모리 접근 성능 및 코어별 메모리 접근 공정성을 평가할 수 있다(1-1). 또한, 메모리 접근 레이턴시의 증가 요인을 제공할 수 있다(1-2). 이러한 평가는 프로세스 스케줄러(200)의 요청에 따라 수행될 수 있으며 그 결과가 프로세스 스케줄러(200)에 반환될 수 있다. First, the memory access performance of the current process-core mapping according to the current process-core mapping is evaluated. This can be done by evaluating the process-by-process memory access latency associated with the current process-core mapping. Based on this, it is possible to evaluate the memory access performance and the memory access fairness per core of the NUMA system 100 as a whole (1-1). In addition, it can provide an increase factor of memory access latency (1-2). Such an evaluation may be performed at the request of the process scheduler 200 and the results may be returned to the process scheduler 200. [

도3은 본 발명의 실시예에 따른 스케줄링 장치(400)에서 메모리 접근 레이턴시의 예측 과정을 도식화한 것이다. 보다 구체적으로 도3은 본 발명의 실시예에 따른 NUMA 시스템의 평가유닛(300)에서 메모리 접근 레이턴시를 예측하는 과정을 도식화한 것이다. 도3에서 코어로부터 발생된 메모리 서브시스템(20)으로의 일차 메모리 접근 요청이 Refs로 표시되며 이러한 일차 메모리 접근 요청은 공용캐쉬(30: 31,32)로 전달된다. 공용캐쉬(30)에서 캐쉬 미스가 발생하는 경우 이차 메모리 접근 요청이 메모리(60: 61,62)로 전달되며 이는 도3에서 Miss로 표시된다. FIG. 3 is a diagram illustrating a process of predicting a memory access latency in the scheduling apparatus 400 according to an embodiment of the present invention. More specifically, FIG. 3 illustrates a process of predicting a memory access latency in the evaluation unit 300 of the NUMA system according to an embodiment of the present invention. In FIG. 3, a primary memory access request to the memory subsystem 20 generated from the core is denoted by Refs, and this primary memory access request is passed to the public cache 30 (31, 32). When a cache miss occurs in the public cache 30, a secondary memory access request is transmitted to the memory 60 (61, 62), which is indicated as Miss in FIG.

이때, 실제 메모리 접근 요청, 즉 이차 메모리 접근 요청에 대한 메모리 접근 레이턴시(mL)는 예컨대 아래와 같은 수식에 따라 획득될 수 있다. At this time, the memory access latency (mL) for the actual memory access request, i.e., the secondary memory access request, can be obtained according to the following equation, for example.

1second= MPS*mL+(RPS-MPS)*cL+I*CPI 수식(1)1 sec = MPS * mL + (RPS-MPS) * cL + I * CPI (1)

여기서, 1second는 1초의 시간 구간을 나타낸다. MPS(Misses per Second)는 공용캐쉬(30: 31,32)에 대한 캐쉬미스의 초당 횟수 그리고 RPS(Refs per Second)는 일차 메모리 접근 요청의 초당 횟수를 나타내며 이들은 NUMA 시스템(100) 내의 PMU에서 획득된 정보일 수 있다. 또한, cL(Cache Access Latency)는 공용캐쉬(30:31,32)에 대한 접근시의 레이턴시로서 이는 CPU의 사양(specification)에서 규정될 수 있다. I(# of Instructions)는 해당 코어에서 실행되는 인스트럭션들의 개수를 나타내며 이 또한 PMU에서 획득될 수 있다. CPI(Cycles per No Memory Subsystem)는 메모리 접근 요청을 발생시키지 않는 명령어의 개수를 나타내며 반복적인 실험을 통해 고정된 수치를 나타낼 수 있다. Here, 1 second represents a time period of 1 second. The MPS (misses per second) represents the number of cache misses per second and the RPS (Refs per Second) for the common cache 30 (31,32) represent the number of times per second of the primary memory access request, which are obtained from the PMU in the NUMA system 100 Lt; / RTI > Also, cL (Cache Access Latency) is the latency when accessing the public cache 30 (31, 32), which can be specified in the specification of the CPU. I (# of Instructions) represents the number of instructions executed on the core, which can also be obtained from the PMU. The CPI (Cycles per Memory Subsystem) represents the number of instructions that do not cause a memory access request and can be fixed by repeated experiments.

도4는 본 발명의 실시예에 따른 스케줄링 장치에서 메모리 접근 레이턴시의 증가 원인을 파악하는 과정을 도식화한 것이다. 보다 구체적으로 도4는 본 발명의 실시예에 따른 NUMA 시스템의 평가유닛(300)에서 메모리 접근 레이턴시의 증가 원인을 예측하는 과정을 도식화한 것이다. FIG. 4 is a diagram illustrating a process of determining a cause of an increase in memory access latency in a scheduling apparatus according to an embodiment of the present invention. More specifically, FIG. 4 is a diagram illustrating a process of predicting an increase in memory access latency in the evaluation unit 300 of the NUMA system according to an embodiment of the present invention.

도4에 도시된 바와 같이, 본 발명의 실시예에 따른 NUMA 시스템의 평가유닛(300)은 각 프로세스별로 공용캐쉬(30: 31,32) 접근 횟수 및 캐쉬 미스 횟수에 대한 정보를 활용하여 NUMA 시스템(100)에서 공유 자원에 대해서 병목현상이 발생하는 위치를 파악할 수 있다. 이는 공유 자원 부분을 모델링하여 수행될 수 있다. 즉, 각 프로세스의 실행 위치, 즉 실행 코어와 해당 프로세스 메모리의 위치에 대한 정보를 바탕으로 각 프로세스별로 메모리 접근 경로를 알 수 있다. 이를 통해 각 공유자원에 적용되고 있는 초당 메모리 접근 횟수 등을 파악할 수 있다. 이와 같은 정보를 통해 각 공유 자원에 적용되는 로드(load)를 파악할 수 있으며 이러한 적용 로드를 해당 공유 자원이 감당할 수 있는 최대 로드와 비교함으로써 병목 여부를 파악할 수 있다. 4, the evaluation unit 300 of the NUMA system according to the embodiment of the present invention utilizes the information on the number of cache misses (30: 31, 32) accesses and cache misses for each process, The location where the bottleneck phenomenon occurs with respect to the shared resource can be grasped. This can be done by modeling the shared resource part. That is, the memory access path can be known for each process based on the execution position of each process, that is, the information about the execution core and the position of the process memory. This allows you to know the number of memory accesses per second that are being applied to each shared resource. Such information can be used to determine the load applied to each shared resource, and it can be determined whether or not the bottleneck is caused by comparing the applied load with the maximum load that the shared resource can afford.

도4에서는, 제1노드(11)에서 수행되는 프로세스들(Proc.1 내지 Proc.4)에 의해 발생하는 각 공유자원에 대해 예측된 로드가 표시되어 있다. 즉, 제1메모리 제어기(41), 인터커넥션(51,52) 및 제2메모리 제어기(42)에 대한 적용 로드는 아래와 같이 예측될 수 있다. In Fig. 4, the predicted load for each shared resource generated by the processes (Proc. 1 to Proc. 4) performed in the first node 11 is shown. That is, the application load for the first memory controller 41, the interconnection 51, 52 and the second memory controller 42 can be predicted as follows.

CM_SUM_MC1=A (수식2)CM_SUM _MC1 = A (Equation 2)

CM_SUM_ICitok=B (수식3)CM_SUM _ICitok = B (Equation 3)

CM_SUM_MC2=B (수식4)CM_SUM _MC2 = B (Equation 4)

여기서, A는 제1노드(11)에 포함된 제1메모리(61)에 대한 CMPS(cache misses per second)로서, 메모리 접근 요청이 지역 메모리로 전달되어야 하는 초당 제1공용캐쉬(31)에 대한 캐쉬 미스를 나타낸다. 여기서 제1메모리(61)는 지역 메모리이다. B는 제2노드(12)에 포함된 제2메모리(62)에 대한 CMPS로서, 원격 메모리로 전달되어야 하는 초당 제1공용캐쉬(31)에 대한 캐쉬 미스이다. 제2메모리(61)는 제1노드(11)의 관점에서 원격 메모리에 해당한다. Where A is the cache misses per second (CMPS) for the first memory 61 included in the first node 11 and is the number of cache misses per second for the first public cache 31 per second, Indicates a cache miss. Here, the first memory 61 is a local memory. B is the CMPS for the second memory 62 included in the second node 12 and is the cache miss for the first public cache 31 per second that has to be delivered to the remote memory. The second memory 61 corresponds to the remote memory in terms of the first node 11.

CM_SUM_MC1는 제i메모리 제어기(도4에서 제1메모리 제어기41)에 대한 캐쉬 미스의 총합을 나타낸다. CM_SUM_ICitok는 원격 메모리로의 메모리 접근이 일어나야 하는 캐쉬 미스의 총합을 나타낸다. CM_SUM_MC2는 제2메모리 제어기(42)로 메모리 접근이 일어나야 하는 캐쉬 미스의 총합을 나타낸다. 도4에서는 두 개의 노드를 포함하는 경우를 예시하므로 CM_SUM_ICitok와 CM_SUM_MC2t는 동일한 값을 갖는 것으로 예시될 수 있다. CM_SUM _MC1 represents the sum of the cache misses for the i-th memory controller (the first memory controller 41 in Fig. 4). CM_SUM _ICitok represents the sum of cache misses for which memory accesses to remote memory should occur. CM_SUM _MC2 represents the sum of cache misses for which memory access to the second memory controller 42 should occur. Since FIG. 4 illustrates the case of including two nodes, the CM_SUM _ICitok and the CM_SUM _MC2 t can be illustrated as having the same value.

둘째, 본 발명의 실시예에 따른 NUMA 시스템의 평가 유닛(300)은 또한 미래의 프로세스-코어 매핑에 따른 메모리 접근 성능을 평가할 수 있다. 이는 미래의 프로세스-코어 매핑에 따른 프로세스별 메모리 접근 레이턴시를 평가함으로써 수행될 수 있다. 즉, NUMA 시스템의 평가 유닛(300)은 프로세스 스케줄러(200)의 요청에 따라 미래의 프로세스-코어 매핑에 따른 메모리 접근 성능을 미리 예측 평가할 수 있다. 이에 기반하여 NUMA 미래의 프로세스-코어 매핑의 경우에 대해, 시스템(100) 전체에 대한 메모리 접근 성능 및 코어별 메모리 접근 공정성을 평가할 수 있다(2-1). 또한, 메모리 접근 레이턴시의 증가 요인을 제공할 수 있다(2-2). 이러한 평가는 프로세스 스케줄러(200)의 요청에 따라 수행될 수 있으며 그 결과가 프로세스 스케줄러(200)에 반환될 수 있다. Second, the evaluation unit 300 of the NUMA system according to an embodiment of the present invention can also evaluate memory access performance according to future process-core mappings. This can be done by evaluating the process-by-process memory access latency associated with future process-to-core mappings. That is, the evaluation unit 300 of the NUMA system can predict and evaluate the memory access performance according to a future process-core mapping according to a request of the process scheduler 200. [ Based on this, in the case of the NUMA future process-core mapping, the memory access performance for the entire system 100 and the fairness of memory access for each core can be evaluated (2-1). It can also provide an increase in memory access latency (2-2). Such an evaluation may be performed at the request of the process scheduler 200 and the results may be returned to the process scheduler 200. [

미래 프로세스-코어 매핑시에, 메모리 접근 성능 및 코어별 메모리 접근 공정성 평가(2-1) 그리고 메모리 접근 레이턴시의 증가 요인(2-2)은 아래와 같은 전처리 과정을 수행한 후 현재 프로세스-코어 매핑시의 경우와 동일한 방식으로 예측될 수 있다. In the future process-core mapping, the memory access performance and core access fairness evaluation (2-1) and the memory access latency increase factor (2-2) are as follows. Can be predicted in the same manner as in the case of FIG.

미래 프로세스-코어 매핑의 메모리 접근 정보는 현재 메모리 접근 정보로부터 유추하여야 한다. 이는 PMU로부터 얻어진 코어별 메모리 접근 정보는 현재 프로세스-코어 매핑에 따른 것이기 때문이다. 또한, 동일한 프로세스는 항상 비슷한 워크로드(work load)를 갖는 것으로 가정할 수 있다. 이에 따라 매핑에 관계없이 동일한 RPKI(Reference Per 1000 Instructions)를 가정할 수 있다. 즉, 1000개의 인스트럭션당 발생하는 메모리 서브시스템(20)으로의 메모리 접근 요청의 개수가 동일 한 것으로 가정할 수 있다. The memory access information of the future process - core mapping should be deduced from the current memory access information. This is because the memory access information per core obtained from the PMU is in accordance with the current process-core mapping. It can also be assumed that the same process always has a similar work load. Thus, the same RPKI (Reference Per 1000 Instructions) can be assumed regardless of the mapping. That is, it can be assumed that the number of memory access requests to the memory subsystem 20 generated per 1000 instructions is the same.

전처리 첫단계로서 이상에서와 같이 프로세스별 RPKI를 획득할 수 있다(제1단계). 그 후 미래 프로세스-코어 매핑을 적용하여 프로세스별 공용캐쉬(31,32)에 대한 접근 횟수를 계산한다(제2단계). 실제 인스트럭션 실행시에 작성된 공용캐쉬(31,32) 접근 횟수에 대한 캐쉬 미스 횟수의 테이블 등을 이용하여 MPKI(Misses Per 1000 Instructions)를 계산한다(제3단계). 즉, 제3단계에서 1000개의 인스트럭션 당 공용캐쉬(31,32)의 캐쉬 미스의 횟수가 계산될 수 있다. 이를 이용하여 프로세스별로 MPS(Misses per second)가 계산된다(제4단계). As a first step of preprocessing, RPKI for each process can be acquired as in the above (Step 1). Thereafter, the number of accesses to the public caches 31 and 32 is calculated by applying the future process-core mapping (step 2). The MPKI (Misses Per 1000 Instructions) is calculated using a table of cache miss counts for access counts of the public caches 31 and 32 created at the time of execution of the actual instruction (third step). That is, in the third step, the number of cache misses of the public caches 31 and 32 per 1000 instructions can be calculated. Using this, MPS (Missing per second) is calculated for each process (Step 4).

이상에서와 같은 4단계의 전처리 과정을 거친 후, 미래 프로세스-코어 매핑에 대한 NUMA 시스템(100) 전체에 대한 메모리 접근 성능 및 코어별 메모리 접근 공정성 평가(2-1) 및 메모리 접근 레이턴시의 증가 요인(2-2)이 예측될 수 있다. 이러한 (2-1) 및 (2-2)는 현재 프로세스-코어 매핑에 대한 (1-1) 및 (1-2) 기능과 동일한 방식으로 달성될 수 있다. 즉, 실제 NUMA 시스템(100)의 프로세스-코어 매핑에 대해서 공유 자원별로 로드를 합산하여 자원별 최대 용량과 비교하여 충돌(contention) 여부를 판단한 것과 유사하게, MPKI에 비례하여 MPS를 증가시키면서 각 공유 자원별 로드를 모니터링할 수 있다. 어느 한 자원의 로드가 최대 용량에 근접하였을 때의 MPS를 유추된 MPS로 정의하여 이를 통해 (1-1) 및 (1-2)와 마찬가지의 기능을 수행할 수 있다. After the four stages of preprocessing as described above, the memory access performance and the memory access fairness evaluation (2-1) for the entire NUMA system 100 for the future process-core mapping (2-1) and the increase factor of the memory access latency (2-2) can be predicted. These (2-1) and (2-2) can be achieved in the same manner as the (1-1) and (1-2) functions for the current process-core mapping. That is, similar to the process-core mapping of the actual NUMA system 100, the load is calculated for each shared resource and compared with the maximum capacity for each resource to determine whether or not the resource is contention. Thus, MPS is increased in proportion to the MPKI, You can monitor resource-specific loads. The MPS when the load of one resource is close to the maximum capacity is defined as the inferred MPS, and the same functions as those of (1-1) and (1-2) can be performed.

본 발명의 실시예에 따른 프로세스 스케줄러(200)는 NUMA 시스템의 평가유닛(300)으로부터의 결과에 기반하여 NUMA 시스템(100)의 프로세스-코어 매핑을 결정하여 NUMA 시스템(100)에 적용한다. 이러한 프로세스 스케줄러(200)의 동작은 아래 두 가지 유형에 따라 수행될 수 있다. The process scheduler 200 according to an embodiment of the present invention determines the process-core mapping of the NUMA system 100 and applies it to the NUMA system 100 based on the results from the evaluation unit 300 of the NUMA system. The operation of the process scheduler 200 can be performed according to the following two types.

(가) 프로세스 스케줄러(200)는 현재 프로세스-코어 매핑 및 가능한 모든 미래 프로세스-코어 매핑 케이스의 성능 평가 결과에 기반하여 전체 NUMA 시스템(100)의 메모리 접근 성능의 최적화 및 코어별 메모리 접근 성능 공정성을 모두 만족하는 매핑 케이스를 선택할 수 있다. 이때, 현재 및 미래 프로세스-코어 매핑 케이스마다 성능 점수(Mcredit)가 매겨질 수 있으며, 이 점수가 가장 높은 프로세스-코어 매핑에 따라 NUMA 시스템(100)의 매핑이 이루어지도록 적용할 수 있다. (A) The process scheduler 200 optimizes the memory access performance of the entire NUMA system 100 and the memory access performance fairness per core based on the performance evaluation results of the current process-core mapping and all possible future process-core mapping cases. You can choose a mapping case that satisfies all. At this time, a performance score (Mcredit) may be assigned to each current and future process-core mapping case, and the mapping of the NUMA system 100 may be applied according to the process-core mapping having the highest score.

예컨대, 현재 프로세스-코어 매핑의 성능이 공정성과 최적화 관점에서 가장 높은 점수를 나타내는 경우 프로세스 스케줄러(200)는 NUMA 시스템(100)이 현재의 프로세스-코어 매핑을 유지하도록 한다. 또한, 임의의 미래 프로세스-코어 매핑의 성능이 가장 높은 점수를 나타내는 경우 프로세스 스케줄러(200)는 NUMA 시스템(100)이 해당 미래 프로세스-코어 매핑에 따라 매핑되도록 적용할 수 있다. For example, if the performance of the current process-core mapping represents the highest score in terms of fairness and optimization, the process scheduler 200 allows the NUMA system 100 to maintain current process-core mappings. Also, if the performance of any future process-core mapping indicates the highest score, the process scheduler 200 may apply the NUMA system 100 to be mapped according to the corresponding future process-core mapping.

이때, 전체 NUMA 시스템(100)에 대한 메모리 접근 성능은 아래와 같이 표시되는 전체 시스템(100)에 대한 메모리 접근 레이턴시의 평균값으로부터 평가될 수 있다. At this time, the memory access performance for the entire NUMA system 100 can be evaluated from the average value of the memory access latency for the entire system 100 indicated as follows.

메모리 접근 레이턴시의 평균값 =

수식(5)Average value of memory access latency =

Equation (5)

여기서, Li는 프로세스i에 대한 메모리 접근 레이턴시를 나타낸다. Mi는 프로세스 i의 메모리 서브시스템(20)으로의 접근 횟수를 나타낸다. 이때, 수식(5)로 표시되는 메모리 접근 레이턴시의 평균값이 작을수록 높은 점수(Mcredit)을 획득할 수 있다. Where Li represents the memory access latency for process i. Mi represents the number of accesses of process i to the memory subsystem 20. At this time, the smaller the average value of the memory access latency represented by the equation (5) is, the higher the score (Mcredit) can be obtained.

또한, 프로세스별 메모리 접근 레이턴시의 공정성은 프로세스별 메모리 접근 레이턴시 Li의 표준편차가 작을수록 높은 점수(Mdredit)를 획득할 수 있다. 프로세스별 메모리 접근 레이턴시에 대한 표준편차가 0의 값을 가질 때 가장 이상적인 공정성을 나타낸다. In addition, the fairness of the memory access latency for each process can be obtained with a higher score (Mdredit) as the standard deviation of the memory access latency Li for each process is smaller. It shows the ideal fairness when the standard deviation of the process-specific memory access latency is zero.

이와 같은 (가) 유형에 따르는 경우 최적화 및 공정성 요구에 모두 만족할 수 있는 최적의 매핑 케이스를 선택하는 것이 가능하다. 하지만, 매핑을 결정하기 위한 연산량이 과도하게 증가할 수 있다. According to this type, it is possible to select an optimal mapping case that satisfies both optimization and fairness needs. However, the amount of computation for determining the mapping may increase excessively.

두번째 유형이 이하에서 설명된다. The second type is described below.

(나) 프로세스 스케줄러(200)는 NUMA 시스템 평가유닛(300)으로부터 전달된 현재 프로세스-코어 매핑의 메모리 접근 레이턴시의 증가 요인들을 제거해감으로써 프로세스-코어 매핑을 결정할 수 있다. (B) The process scheduler 200 may determine the process-core mapping by removing the increasing factors of the memory access latency of the current process-core mapping passed from the NUMA system evaluation unit 300. [

우선, NUMA 시스템 평가유닛(300)에서 예측한 현재 프로세스-코어 매핑의 프로세스별 메모리 접근 레이턴시의 증가 요인 중 가장 큰 영향을 미치는 요인을 파악한다. 이러한 과정은 NUMA 시스템 평가 유닛(300)에서 수행되어 프로세스 스케줄러(200)에 전달되거나 NUMA 시스템 평가 유닛(300)로부터 전달된 메모리 접근 레이턴시의 증가 요인들로부터 프로세스 스케줄러(200)에서 수행될 수 있다. First, the factor that has the greatest influence on the increase in the process-memory access latency of the current process-core mapping predicted by the NUMA system evaluation unit 300 is identified. This process can be performed in the process scheduler 200 from the NUMA system evaluation unit 300 and transferred to the process scheduler 200 or from the NUMA system evaluation unit 300 for the increase in memory access latency.

그 다음, 프로세스 스케줄러(200)는 메모리 접근 레이턴시의 증가 요인 중 가장 큰 원인을 제거하는 방향으로 코어-프로세스 매핑 및/또는 프로세스 메모리의 위치를 변경하여 NUMA 시스템 평가 유닛(300)에서 평가될 미래 프로세스-코어 매핑을 결정한다. 즉, 과부하 제거 과정을 통한 변환된 코어-프로세스 매핑이 결정될 수 있다. Process scheduler 200 then changes the location of the core-to-process mapping and / or process memory in a direction that eliminates the largest cause of the increase in memory access latency to determine the future process to be evaluated in the NUMA system evaluation unit 300 - Determine the core mapping. That is, the converted core-process mapping through the overload removal process can be determined.

도5는 본 발명의 실시예에 따른 스케줄링 장치에서 메모리 접근 레이턴시의 증가 원인에 따른 과부하 제거 과정을 도식화한다. FIG. 5 illustrates an overload removal process according to an increase in memory access latency in a scheduling apparatus according to an exemplary embodiment of the present invention.

첫째, 메모리 제어기의 과부하가 메모리 접근 레이턴시를 증가시키는 가장 큰 원인인 경우의 과부하 제거(contention resolver) 과정이 도5의 (a)에 예시된다. 예컨대, 과부하가 걸린 메모리 제어기를 사용하고 있는 프로세스들 중 가장 메모리 접근이 많은 프로세스의 메모리 및 해당 프로세스를 실행하는 코어의 위치를 다른 노드로 이주할 수 있다. 해당 프로세스를 실행하는 코어의 위치를 변경하는 것만으로 과부하 문제가 해결이 불가능하기 때문에 프로세스 메모리의 위치도 함께 변경하는 것이 필요하다. 도5의 (a)에서 이주 당하는 코어 및 프로세스 메모리는 노드1(n1)에 위치하며 노드4(n4)로 이주함이 예시된다. 이때, 도5의 (a)에 도시된 바와 같이 이주되는 노드(4)에서 실행중인 프로세스들 중 가장 메모리 접근이 적은 프로세스와 코어 및 프로세스 메모리의 위치를 교환할 수 있다. First, the contention resolver process in the case where the overload of the memory controller is the biggest cause of increasing the memory access latency is illustrated in FIG. 5 (a). For example, the memory of the most memory-intensive processes among the processes using the overloaded memory controller and the location of the core executing the process may be migrated to another node. It is necessary to change the position of the process memory because the overload problem can not be solved simply by changing the position of the core executing the process. The core and process memory migrated in Fig. 5 (a) are located at node 1 (n1) and are illustrated as migrating to node 4 (n4). At this time, as shown in FIG. 5 (a), it is possible to exchange the positions of the core and the process memory with the process having the least memory access among the processes executing in the node 4 migrated.

둘째, 인터커넥션의 과부하가 메모리 접근 레이턴시를 증가시키는 가장 큰 원인인 경우 과부하 제거 과정이 도5의 (b)에 예시된다. 이 경우, 원격 메모리 접근을 가장 많이 발생시키는 프로세스의 실행 코어 위치를 해당 프로세스의 메모리가 위치하는 타겟 노드에 포함된 코어들 중 하나로 변경할 수 있다. 이때, 타겟 노드에서 실행중인 프로세스들 중 메모리 접근이 가장 적은 프로세스를 골라 교환할 수 있다. Secondly, if the overload of the interconnection is the biggest cause of increasing the memory access latency, the overload removal process is illustrated in FIG. 5 (b). In this case, the execution core location of the process that most frequently accesses the remote memory can be changed to one of the cores included in the target node where the memory of the process is located. At this time, it is possible to select and exchange a process having the least memory access among the processes running on the target node.

셋째, 공용캐쉬의 과부하가 메모리 접근 레이턴시를 증가시키는 가장 큰 원인인 경우 과부하 제거 과정이 도5의 (c)에 예시된다. 이 경우, 원격 메모리 접근이 발생하더라도 공용캐쉬에 대한 캐쉬 히트(cache hit) 확률의 상승에 따라 가장 큰 이득을 볼 수 있는 프로세스를 현재 실행중인 코어에서 다른 노드의 코어에서 실행되도록 위치를 변경할 수 있다. 이때, 다른 노드의 코어들에서 실행되고 있는 프로세스들 중 가장 손해가 적을 것으로 예상되는 코어와 교환할 수 있다. Third, the overload removal process is illustrated in (c) of FIG. 5 when the overhead of the common cache is the biggest cause of increasing the memory access latency. In this case, even if a remote memory access occurs, a process that can gain the greatest gain as the cache hit probability for the public cache rises can be changed to be executed in the core of another node in the currently executing core . At this time, it is possible to exchange a core that is expected to be least damaging among the processes executing in the cores of other nodes.

이상에서 예시된 바와 같은 과부하 제거 과정을 거친 미래 프로세스-코어 매핑에 대해서 NUMA 시스템 평가유닛(300)에서 성능 평가가 이루어지며 메모리 접근 성능에 대한 점수(Mcredit)가 유형1(가)에서와 마찬가지로 매겨진다. The performance evaluation is performed in the NUMA system evaluation unit 300 for the future process-core mapping that has undergone the overload removal process as exemplified above, and the score for the memory access performance (Mcredit) is the same as in Type 1 Loses.

그후, 현재 프로세스-코어 매핑에 대한 메모리 접근 성능의 점수(Mcredit)와 과부하 제거된 미래 프로세스-코어 매핑에 대한 메모리 접근 성능의 점수(Mcredit)가 비교되어 더 높은 점수의 매핑에 따라서 NUMA 시스템(100)의 매핑이 결정될 수 있다. 예컨대, 현재 프로세스-코어 매핑에 대한 성능 점수(Mcredit)이 더 높은 경우 현재 코어 프로세스 매핑이 유지될 수 있고, 미래 프로세스-코어 매핑에 대한 성능 점수가 더 높은 경우 미래 프로세스-코어 매핑이 NUMA 시스템(100)에 적용될 수 있다. Then, the score of the memory access performance (Mcredit) for the current process-core mapping is compared with the score of the memory access performance (Mcredit) for the overloaded future process-core mapping and the NUMA system 100 ) May be determined. For example, if the current core-process mapping can be maintained and the performance score for the future process-core mapping is higher if the performance score (Mcredit) for the current process-core mapping is higher, the future process- 100).

이상과 같은 과부하 제거 과정을 통한 미래 프로세스-코어 매핑 결정, 매핑 평가 및 적용 절차가 주기적으로 반복되어 실행될 수 있다. 예컨대 대략 1초마다 상기 과정이 반복되어 실행될 수 있다. 유형2(나)에 따르는 경우 한번에 최적의 매핑 케이스를 선택할 수 없는 단점이 있으나 매핑을 적응적으로 변환해감으로 계산량의 감소를 야기할 수 있다. The future process-core mapping decision, mapping evaluation and application procedure through the overload removal process as described above can be repeated periodically and executed. For example, the above process can be repeatedly performed every about one second. In case of Type 2 (b), there is a disadvantage that the optimal mapping case can not be selected at a time, but the mapping can be adaptively transformed, resulting in a reduction in the calculation amount.

이상에서는 프로세스 스케줄러(200)가 NUMA 시스템(100) 전체에 대한 메모리 접근 성능의 최적화 및 코어별 메모리 접근 레이턴시의 공정성을 기준으로 프로세스-코어 매핑의 성능을 평가하였으나, 실시예에 따라 메모리 접근 성능의 최적화만을 판단 근거로 이용하거나 코어별 메모리 접근 레이턴시의 공정성만을 판단 근거로 이용할 수 있다. 또한, 이와 더불어 사용자의 요구에 맞는 프로세스별 메모리 접근 레이턴시를 충족시키도록 성능 평가가 이루어질 수 있다. Although the process scheduler 200 has evaluated the performance of the process-core mapping based on the optimization of the memory access performance for the entire NUMA system 100 and the fairness of the memory access latency per core, Only the optimizations can be used as a basis for judging, or only the fairness of the memory access latency of each core can be used as a judgment criterion. In addition, a performance evaluation can be made to meet the process-specific memory access latency that meets the needs of the user.

본 발명의 실시예에 따른 스케줄러 장치(400)에서 프로세스 스케줄러(200) 및/또는 NUMA 시스템 평가유닛(300)은 전술한 바와 같은 기능을 컴퓨팅 시스템에서 수행할 수 있도록 하는 하드웨어 모듈 및/또는 소프트웨어 모듈로 구현될 수 있다. In the scheduler apparatus 400 according to an embodiment of the present invention, the process scheduler 200 and / or the NUMA system evaluating unit 300 may include a hardware module and / or a software module (not shown) . &Lt; / RTI >

이상 첨부된 도면을 참조하여 본 발명의 실시예를 설명하였지만, 본 발명이 속하는 기술분야의 당업자는 본 발명이 그 기술적 사상이나 필수적 특징을 변경하지 않고 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로서 이해되어야 하고, 본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 등가개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. will be. Therefore, it should be understood that the above-described embodiments are to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than the foregoing description, It is intended that all changes and modifications derived from the equivalent concept be included within the scope of the present invention.

11,12: 제1노드 및 제2노드
20: 메모리 서브 시스템
30, 31,32: 공용캐쉬
40, 41,42: 메모리 제어기
50,51,52: 인터커넥션
60,61,62: 메모리
100: NUMA 시스템
200: 프로세스 스케줄러
300: NUMA 시스템 평가유닛
400: 스케줄러 장치11, 12: a first node and a second node
20: memory subsystem
30, 31, 32: Common cache
40, 41, 42: memory controller
50,51,52: Interconnection
60, 61,
100: NUMA system
200: Process Scheduler
300: NUMA system evaluation unit
400: scheduler device

Claims

A NUMA system having a plurality of nodes each including a core executing a process, a public cache and a memory, and obtaining memory access information per core;
A NUMA system evaluation unit for evaluating memory access performance of a current process-core mapping and a future process-core mapping of the NUMA system according to the core memory access information obtained in the NUMA system; And
Core mapping of one of the current process-core mapping and the future process-core mapping according to an evaluation result of the NUMA system evaluation unit as a process-core mapping of the NUMA system.
NUMA system scheduling device.

The method according to claim 1,
The memory access information per core includes:
Information about the current process-core mapping, location information of a process memory for each process, and access counts to the common cache for each core, and a miss count for the common cache,
Wherein the core-specific memory access information is passed to the NUMA system evaluation unit via the process scheduler,
NUMA system scheduling device.

The method according to claim 1,
The memory access performance evaluation for the current process-core mapping includes:
Wherein the at least one of the overall memory access performance evaluation for the NUMA system and the core memory access fairness evaluation based on a process-by-process memory access latency calculation result according to the current process-
NUMA system scheduling device.

The method according to claim 1,
The memory access performance evaluation for the future process-core mapping includes:
A memory access performance evaluation for the NUMA system based on a process-specific memory access latency calculation according to the future process-core mapping, and a core memory access fairness evaluation for the NUMA system.
NUMA system scheduling device.

5. The method of claim 4,
The NUMA system evaluation unit prior to the evaluation of the memory access performance for the future process-
A first step of acquiring the number of memory access requests generated per 1000 instructions per process;
A second step of calculating an access count for the common cache for each process;
A third step of calculating the number of cache misses for the common cache per 1000 instructions; And
And a fourth step of calculating a number of cache misses for the common cache per second per process,
NUMA system scheduling device.

6. The method according to any one of claims 1 to 5,
The future process-core mapping includes a plurality of future process-core mappings,
Wherein the process scheduler applies to the NUMA system a process-core mapping having the highest performance evaluation among the plurality of future process-core mappings and the current process-
NUMA system scheduling device.

delete

6. The method according to any one of claims 1 to 5,
The process scheduler comprising:
Determining a future process-core mapping through an overload removal process that removes the largest factor among the increase factors of the memory access latency,
NUMA system scheduling device.

9. The method of claim 8,
Wherein the overload removal process changes at least one of a location of a core that performs a process to increase the memory access latency and a location of a memory of a process that increases the memory access latency,
NUMA system scheduling device.

Acquiring per-core memory access information in a NUMA system having a plurality of nodes each including a core executing a process, a public cache, and a memory;
Evaluating memory access capability of the current process-core mapping and future process-core mapping of the NUMA system in accordance with the per-core memory access information obtained in the NUMA system in the NUMA system evaluation unit; And
Core mapping of one of the current process-core mapping and the future process-core mapping in a process scheduler according to an evaluation result of the NUMA system evaluation unit as a process-core mapping of the NUMA system.
NUMA system scheduling method.

11. The method of claim 10,
The memory access information per core includes:
Information on the current process-core mapping, location information of a process memory for each process, access counts to the common cache per core, and miss counts for the public cache,
Wherein the core-specific memory access information is passed to the NUMA system evaluation unit via the process scheduler,
NUMA system scheduling method.

11. The method of claim 10,
Wherein evaluating the memory access capability for the current process-core mapping comprises:
Calculating a process-by-process memory access latency in accordance with the current process-to-core mapping to evaluate at least one of an overall memory access performance for the NUMA system and a per-
NUMA system scheduling method.

11. The method of claim 10,
Evaluating the memory access capability for the future process-to-core mapping comprises:
And evaluating at least one of a total memory access performance for the NUMA system and a per-core memory access fairness by calculating a per-process memory access latency according to the future process-core mapping.
NUMA system scheduling method.

14. The method of claim 13,
The NUMA system evaluation unit prior to the evaluation of the memory access performance for the future process-
A first step of acquiring the number of memory access requests generated per 1000 instructions per process;
A second step of calculating an access count for the common cache for each process;
A third step of calculating the number of cache misses for the common cache per 1000 instructions; And
And a fourth step of calculating a number of cache misses for the common cache per second per process,
NUMA system scheduling method.

15. The method according to any one of claims 10 to 14,
The future process-core mapping includes a plurality of future process-core mappings,
In the applying step, the process scheduler applies to the NUMA system a process-core mapping having the highest performance evaluation among the plurality of future process-core mappings and the current process-core mappings.
NUMA system scheduling method.

delete

15. The method according to any one of claims 10 to 14,
Wherein in the application step, the process scheduler further comprises determining the future process-core mapping through an overload removal process that removes the largest factor of the increase in the memory access latency.
NUMA system scheduling method.

18. The method of claim 17,
Wherein the overload removal process changes at least one of a location of a core that performs a process to increase the memory access latency and a location of a memory of a process that increases the memory access latency,
NUMA system scheduling method.