KR102035843B1

KR102035843B1 - DATA TRANSMISSION SYSTEM AND METHOD CONSIDERING Non-Uniformed Memory Access

Info

Publication number: KR102035843B1
Application number: KR1020180017760A
Authority: KR
Inventors: 김영재; 김태욱
Original assignee: 서강대학교 산학협력단
Priority date: 2018-02-13
Filing date: 2018-02-13
Publication date: 2019-10-23
Also published as: KR20190097843A

Abstract

송신단의 메모리 버퍼들에 로드된 데이터를 수신단의 메모리 버퍼들로 전송하는 데이터 전송 시스템으로서, 상기 송신단 및 상기 수신단 각각은 마스터 스레드, 커뮤니케이션 스레드 또는 입출력 스레드 중 어느 하나가 할당되는 CPU 코어, 상기 CPU 코어를 적어도 하나 이상 포함하는 CPU 소켓들, 그리고 상기 CPU 소켓들의 수만큼 분할되어 상기 CPU 소켓들 각각에 배치된 메모리 버퍼들을 포함한다.A data transfer system for transmitting data loaded in memory buffers of a transmitter to memory buffers of a receiver, wherein each of the transmitter and the receiver is a CPU core to which one of a master thread, a communication thread, and an input / output thread is allocated. CPU sockets including at least one, and memory buffers divided by the number of CPU sockets and disposed in each of the CPU sockets.

Description

DATA TRANSMISSION SYSTEM AND METHOD CONSIDERING NON-UNIFORMED MEMORY ACCESS}

본 발명은 데이터 전송 기술에 관한 것이다.The present invention relates to data transmission techniques.

도래하는 빅데이터 시대에 과학자들은 대용량 데이터를 전세계적으로 흩어져 있는 연구소 및 컴퓨팅 센터와 공유하거나 이들로 전송해야 하는 경우가 많아졌다. 하지만, 이미 존재하는 데이터 전송 프레임워크는 연구소 및 컴퓨팅 센터의 병렬 파일 시스템(PFS)의 특성을 반영하지 못하는 한계가 존재하였다.In the coming big data era, scientists often have to share or transfer large amounts of data to and from labs and computing centers around the world. However, existing data transfer frameworks have limitations that do not reflect the characteristics of parallel file systems (PFS) in laboratories and computing centers.

이러한 문제를 해결하기 위해 LADS(Layout-Aware Data Scheduling) 데이터 전송 프레임워크가 디자인되었다. LADS 프레임워크는 오브젝트 단위의 데이터 전송을 하여 다수의 스레드가 오브젝트 단위로 작업할 수 있게 함으로써 종단 간 데이터 전송의 효율성을 향상시켰으며, 병렬 파일 시스템 내 데이터 레이아웃을 인지하여 데이터를 스케줄링 함으로써 테라비트 네트워크(terabit-network)에서의 데이터 통신의 성능을 최적화하였다.To solve this problem, the Layout-Aware Data Scheduling (LADS) data transfer framework was designed. The LADS framework improves the efficiency of end-to-end data transfer by allowing multiple threads to work on a per-object basis by transferring data on a per-object basis, and by scheduling data by recognizing the data layout in a parallel file system. Optimized performance of data communication in terabit-network.

그러나, LADS 프레임워크는 불균일 기억 장치 접근(Non-Uniformed Memory Access, NUMA) 구조에서 발생할 수 있는 일부 문제점들을 해결하지 못하는 단점이 존재한다. 구체적으로, 종단 간 데이터 전송의 경우 크게 스토리지, CPU 및 메모리에서 병목현상이 일어날 수 있는데, LADS 프레임워크는 데이터 레이아웃 인지 다중 스레드 아키텍처를 이용하는바, 스토리지 및 CPU에서 발생하는 병목현상을 해결할 수 있다. 하지만, LADS 프레임워크는 하나의 RMA 버퍼를 사용하는바, 메모리 컨트롤러의 과부하 및 다른 소켓의 메모리에 접근하는 문제를 야기하여 메모리에서 발생하는 병목 현상은 해결하지 못하는 한계가 있다.However, there is a drawback that the LADS framework does not address some of the problems that may arise in a Non-Uniformed Memory Access (NUMA) architecture. Specifically, end-to-end data transfer can be a bottleneck in storage, CPU, and memory. The LADS framework uses a data-layout-aware multithreaded architecture to address bottlenecks in storage and CPU. However, the LADS framework uses a single RMA buffer, which causes memory controller overload and access to memory on other sockets.

본 발명이 해결하고자 하는 과제는 CPU 소켓 마다 메모리 버퍼를 배치하는 데이터 전송 시스템 및 이를 이용한 데이터 전송 방법을 제공하는 것이다.An object of the present invention is to provide a data transmission system for arranging a memory buffer for each CPU socket and a data transmission method using the same.

본 발명이 해결하고자 하는 다른 과제는 각각의 CPU 소켓에 할당된 스레드가 해당 CPU 소켓에 배치된 메모리 버퍼에 접근하도록 하는 메모리 인지 스레드 스케줄링 기술을 제공하는 것이다.Another problem to be solved by the present invention is to provide a memory-aware thread scheduling technique that allows a thread allocated to each CPU socket to access a memory buffer disposed in that CPU socket.

본 발명의 일 실시예에 따른 데이터 전송 시스템은 복수의 CPU 소켓들, 그리고 각 CPU 소켓마다 배치된 메모리 버퍼를 포함하고, 각 CPU 소켓은 마스터 스레드가 할당된 CPU 코어, 커뮤니케이션 스레드가 할당된 CPU 코어 및 입출력 스레드가 할당된 CPU 코어를 적어도 하나씩 포함한다.A data transfer system according to an embodiment of the present invention includes a plurality of CPU sockets and a memory buffer disposed for each CPU socket, each CPU socket having a CPU core assigned with a master thread and a CPU core assigned with a communication thread. And at least one CPU core to which the input / output thread is allocated.

각 CPU 소켓에 포함된 CPU 코어들에 할당된 스레드는 해당 CPU 소켓에 배치된 메모리 버퍼에만 접근한다.Threads allocated to CPU cores contained in each CPU socket only access memory buffers located in that CPU socket.

각 CPU 소켓에 포함된 CPU 코어들에 할당된 마스터 스레드, 커뮤니케이션 스레드 및 입출력 스레드는 해당 CPU 소켓에 배치된 메모리 버퍼를 통해 전송되는 파일을 처리하고자 상호작용을 수행한다.Master threads, communication threads, and I / O threads assigned to CPU cores included in each CPU socket interact to process files transferred through memory buffers disposed in the corresponding CPU sockets.

각 CPU 소켓이 해당 CPU 소켓에 포함된 CPU 코어들로 구성된 NUMA 노드를 다수 포함하는 경우, 마스터 스레드와 커뮤니케이션 스레드는 동일한 NUMA 노드에 포함된 CPU 코어들에 할당된다.If each CPU socket includes multiple NUMA nodes composed of CPU cores contained in the corresponding CPU socket, the master thread and the communication thread are allocated to the CPU cores included in the same NUMA node.

입출력 스레드는 마스터 스레드가 할당된 CPU 코어와 인접한 CPU 코어에 할당된다.I / O threads are allocated to CPU cores adjacent to the CPU core to which the master thread is assigned.

본 발명의 일 실시예에 따른 데이터 전송 시스템이 데이터를 전송하는 방법은 송신단의 마스터 스레드 및 커뮤니케이션 스레드가 상호작용하여 상기 송신단의 입출력 스레드를 활성화시키는 단계, 상기 활성화된 입출력 스레드가 상기 송신단의 메모리 버퍼들 중 제1 메모리 버퍼에 데이터를 로드하는 단계, 상기 제1 메모리 버퍼에 로드된 데이터에 수신단의 커뮤니케이션 스레드가 접근하여 상기 데이터를 상기 수신단의 메모리 버퍼들 중 제2 메모리 버퍼로 로드하는 단계, 그리고, 상기 제2 메모리 버퍼에 로드된 데이터를 상기 수신단의 입출력 스레드가 스토리지에 저장하는 단계를 포함하고, 상기 제1 메모리 버퍼는 상기 송신단의 마스터 스레드, 상기 송신단의 커뮤니케이션 스레드 및 상기 송신단의 입출력 스레드가 할당된 CPU 코어들을 포함하는 CPU 소켓에 배치된 메모리 버퍼이고, 상기 제2 메모리 버퍼는 상기 수신단의 커뮤니케이션 스레드 및 상기 수신단의 입출력 스레드가 할당된 CPU 코어들을 포함하는 CPU 소켓에 배치된 메모리 버퍼이다.In a method of transmitting data by a data transmission system according to an embodiment of the present invention, a master thread and a communication thread of a transmitting end interact with each other to activate an input / output thread of the transmitting end, wherein the activated input / output thread is a memory buffer of the transmitting end. Loading data into a first memory buffer, a communication thread of a receiver accessing the data loaded in the first memory buffer, and loading the data into a second memory buffer of the memory buffers of the receiver; and And storing data loaded in the second memory buffer in storage by the input / output thread of the receiving end, wherein the first memory buffer comprises a master thread of the transmitting end, a communication thread of the transmitting end, and an input / output thread of the transmitting end. Including allocated CPU cores And a memory buffer arranged to CPU socket, said second memory buffer is a memory buffer arranged to CPU socket including a CPU core input and output threads are allocated for communication threads, and the receiving end of the receiving end.

상기 송신단의 CPU 소켓이 해당 CPU 소켓에 포함된 CPU 코어들로 구성된 NUMA 노드를 다수 포함하는 경우, 마스터 스레드와 커뮤니케이션 스레드는 동일한 NUMA 노드에 포함된 CPU 코어들에 할당된다.When the CPU socket of the transmitting end includes a plurality of NUMA nodes composed of CPU cores included in the CPU socket, the master thread and the communication thread are allocated to the CPU cores included in the same NUMA node.

상기 수신단의 CPU 소켓이 해당 CPU 소켓에 포함된 CPU 코어들로 구성된 NUMA 노드를 다수 포함하는 경우, 마스터 스레드와 커뮤니케이션 스레드는 동일한 NUMA 노드에 포함된 CPU 코어들에 할당된다.When the CPU socket of the receiving end includes a plurality of NUMA nodes composed of CPU cores included in the CPU socket, the master thread and the communication thread are allocated to the CPU cores included in the same NUMA node.

본 발명에 따르면, 각각의 CPU 소켓에 메모리 버퍼가 배치되는바, 다수의 CPU 소켓이 단일 메모리 버퍼를 사용하는 경우에 비해 메모리 컨트롤러에 발생하는 부하를 줄일 수 있다.According to the present invention, since a memory buffer is disposed in each CPU socket, the load generated on the memory controller can be reduced compared to the case where a plurality of CPU sockets use a single memory buffer.

또한, 본 발명에 따르면, 각각의 CPU 소켓은 자신에게 배치된 메모리 버퍼에 접근하면 되므로, 다른 CPU 소켓에 배치된 메모리 버퍼에 접근함으로써 발생하는 메모리 접근 지연 시간을 감소시킬 수 있다.In addition, according to the present invention, since each CPU socket needs to access a memory buffer disposed therein, it is possible to reduce a memory access delay time caused by accessing a memory buffer disposed in another CPU socket.

도 1은 한 실시예에 따른 데이터 전송 시스템의 구조도이다.
도 2는 한 실시예에 따른 데이터 전송 시스템이 데이터를 전송하는 프로세스를 설명한 도면이다.
도 3은 한 실시예에 따른 데이터 전송 시스템에 구현된 다중 메모리 버퍼와 기존의 단일 메모리 버퍼를 비교하여 설명한 도면이다.
도 4는 한 실시예에 따른 데이터 전송 시스템에서 소켓 기반의 스레드 스케줄링이 구현되는 방법을 설명한 도면이다.
도 5는 한 실시예에 따른 데이터 전송 시스템에서 NUMA 노드 기반의 스레드 스케줄링이 구현되는 방법을 설명한 도면이다.1 is a structural diagram of a data transmission system according to an embodiment.
2 is a diagram illustrating a process of transmitting data by a data transmission system according to an exemplary embodiment.
3 is a diagram illustrating a comparison between a multiple memory buffer implemented in a data transmission system and a conventional single memory buffer, according to an exemplary embodiment.
4 is a diagram illustrating a method of implementing socket-based thread scheduling in a data transmission system according to an embodiment.
5 is a diagram illustrating a method of implementing thread scheduling based on NUMA node in a data transmission system according to an embodiment.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present invention. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present invention, and like reference numerals designate like parts throughout the specification.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Throughout the specification, when a part is said to "include" a certain component, it means that it can further include other components, without excluding other components unless specifically stated otherwise.

이하, 도면을 참조로 하여 본 발명의 실시예에 따른 데이터 전송 시스템 및 방법에 대해 설명한다.Hereinafter, a data transmission system and method according to an embodiment of the present invention will be described with reference to the drawings.

도 1은 한 실시예에 따른 데이터 전송 시스템의 구조도이다.1 is a structural diagram of a data transmission system according to an embodiment.

도 1을 참고하면, 데이터 전송 시스템(100)은 송신단(100) 및 수신단(200)을포함하고, 송신단(100) 및 수신단(200) 각각은 제1 CPU 소켓(110 및 210) 및 제2 CPU 소켓(120 및 220)을 포함한다.Referring to FIG. 1, the data transmission system 100 includes a transmitting end 100 and a receiving end 200, and each of the transmitting end 100 and the receiving end 200 includes a first CPU socket 110 and 210 and a second CPU. Sockets 120 and 220.

비록 도 1에서는 송신단(100) 및 수신단(200)이 2개의 CPU 소켓만을 포함하는 것으로 도시하였으나, 다양한 실시예에 따라 2 이상의 CPU 소켓을 각각 포함할 수도 있음을 물론이다.Although FIG. 1 illustrates that the transmitter 100 and the receiver 200 include only two CPU sockets, the transmitter 100 and the receiver 200 may include two or more CPU sockets according to various embodiments.

또한, 송신단(100)에 포함된 CPU 소켓(110 및 120)과 수신단(200)에 포함된 CPU 소켓(210 및 220)의 각 구성 및 기능은 동일한바, 이하 송신단(100)의 제1 CPU 소켓(110)과 수신단(200)의 제1 CPU 소켓(210)를 기준으로 설명한다.In addition, the respective configurations and functions of the CPU sockets 110 and 120 included in the transmitter 100 and the CPU sockets 210 and 220 included in the receiver 200 are the same, hereinafter, the first CPU socket of the transmitter 100. Reference will be made to the reference numeral 110 and the first CPU socket 210 of the receiver 200.

송신단(100)의 제1 CPU 소켓(110)은 제1 NUMA 노드(111), 제2 NUMA 노드(112) 및 제1 메모리 버퍼(113)를 포함하고, 수신단(200)의 제1 CPU 소켓(210) 또한 제1 NUMA 노드(211), 제2 NUMA 노드(212) 및 제1 메모리 버퍼(213)를 포함한다. 도 1에 도시된 각각의 CPU 소켓에 포함되는 NUMA 노드들의 수는 단지 예시이며, 다양한 실시예에 따라 각각의 CPU 소켓은 2 이상의 NUMA 노드들을 포함할 수 있다.The first CPU socket 110 of the transmitter 100 includes a first NUMA node 111, a second NUMA node 112, and a first memory buffer 113, and includes a first CPU socket ( 210 also includes a first NUMA node 211, a second NUMA node 212, and a first memory buffer 213. The number of NUMA nodes included in each CPU socket shown in FIG. 1 is merely an example, and each CPU socket may include two or more NUMA nodes according to various embodiments.

송신단(100)의 제1 NUMA노드(111) 및 제2 NUMA 노드(112)는 CPU 코어들(111a 내지 111d 및 112a 내지 112d)을 포함하며, 각각의 CPU 코어에는 마스터 스레드, 커뮤니케이션 스레드 또는 입출력 스레드 중 어느 하나가 할당된다.The first NUMA node 111 and the second NUMA node 112 of the transmitting end 100 include CPU cores 111a to 111d and 112a to 112d, each CPU core having a master thread, a communication thread or an input / output thread. Either of which is assigned.

예를 들면, 도 1에 도시된 바와 같이, 송신단(100)의 제1 NUMA 노드(111)에포함된 CPU 코어들(111a 내지 111d) 중 3개의 CPU 코어에 각각 마스터 스레드, 커뮤니케이션 스레드 및 입출력 스레드가 할당될 수 있으며, 송신단(100)의 제2 NUMA 노드(112)에 포함된 CPU 코어들(112a 내지 112d) 중 3개의 CPU 코어에 입출력 스레드가 각각 할당될 수 있다.For example, as shown in FIG. 1, three CPU cores among the CPU cores 111a to 111d included in the first NUMA node 111 of the transmitting end 100 are respectively a master thread, a communication thread, and an I / O thread. May be allocated, and I / O threads may be allocated to three CPU cores among the CPU cores 112a to 112d included in the second NUMA node 112 of the transmitter 100.

송신단(100)의 제1 메모리 버퍼(113)는 송신단(100)에 포함된 각 CPU 소켓마다 배치된다. 예를 들면, 도 1에 도시된 바와 같이, 송신단(100)에는 2개의 CPU 소켓들(110 및 120)이 존재하는바, 동일한 사이즈의 2개의 메모리 버퍼들(113 및 123)이 CPU 소켓들(110 및 120) 각각에 배치된다.The first memory buffer 113 of the transmitter 100 is disposed for each CPU socket included in the transmitter 100. For example, as shown in FIG. 1, there are two CPU sockets 110 and 120 in the transmitting end 100, so that two memory buffers 113 and 123 of the same size are provided with the CPU sockets ( 110 and 120), respectively.

이 경우, CPU 소켓들(110 및 120) 각각에 포함된 CPU 코어에 할당된 스레드는 해당 소켓에 배치된 메모리 버퍼에만 접근한다.In this case, a thread assigned to a CPU core included in each of the CPU sockets 110 and 120 only accesses a memory buffer disposed in that socket.

예를 들면, 도 1에서, 제1 CPU 소켓(110)에 포함된 CPU 노드들(111a 내지 111d 및 112a 내지 112d)에 각각 할당된 스레드들은 제1 메모리 버퍼(113)로 접근하며, 제2 CPU 소켓(110)에 포함된 CPU 노드들(121a 내지 121d 및 122a 내지 122d)에 각각 할당된 스레드들은 동일하게 제2 메모리 버퍼(123)로 접근한다.For example, in FIG. 1, threads allocated to the CPU nodes 111a to 111d and 112a to 112d included in the first CPU socket 110, respectively, access the first memory buffer 113 and the second CPU. Threads allocated to the CPU nodes 121a to 121d and 122a to 122d respectively included in the socket 110 access the second memory buffer 123 in the same manner.

즉, 각 CPU 소켓의 모든 스레드들과 커넥션들은 다른 소켓과 독립적으로 동작하며, 각 CPU 소켓에 포함된 CPU 코어들에 할당된 마스터 스레드, 커뮤니케이션 스레드 및 입출력 스레드는 해당 CPU 소켓에 배치된 메모리 버퍼를 통해 전송되는 파일을 처리하고자 상호작용을 수행한다.That is, all threads and connections of each CPU socket operate independently of the other sockets, and the master thread, communication thread, and I / O thread assigned to the CPU cores included in each CPU socket will have a memory buffer placed on that CPU socket. Interact with each other to handle files sent through it.

본 발명에 따르면, 각각의 CPU 소켓에 포함된 스레드들은 해당 CPU 소켓에 배치된 메모리 버퍼에 접근하면 되므로, 다른 CPU 소켓에 배치된 메모리 버퍼에 접근함으로써 발생하는 메모리 접근 지연 시간을 감소시킬 수 있다.According to the present invention, since the threads included in each CPU socket need to access a memory buffer disposed in the corresponding CPU socket, the memory access delay time caused by accessing the memory buffer disposed in the other CPU socket can be reduced.

수신단(200)은 송신단(100)과 동일한 구조를 가질 수 있다. 즉, 수신단(200)에 포함된 CPU 소켓의 수는 송신단(100)에 포함된 CPU 소켓의 수와 동일하며, 수신단(200)에 포함된 CPU 소켓의 구조는 송신단(100)에 포함된 대응하는 CPU 소켓의 구조와 동일하게 구성될 수 있다.The receiving end 200 may have the same structure as the transmitting end 100. That is, the number of CPU sockets included in the receiver 200 is the same as the number of CPU sockets included in the transmitter 100, and the structure of the CPU sockets included in the receiver 200 corresponds to the number of CPU sockets included in the transmitter 100. It can be configured in the same manner as the structure of the CPU socket.

도 2는 한 실시예에 따른 데이터 전송 시스템이 데이터를 전송하는 프로세스를 설명한 도면이다.2 is a diagram illustrating a process of transmitting data by a data transmission system according to an embodiment.

도 2에서, 송신단(100) 및 수신단(200)에 포함된 CPU 소켓들 각각은 마스터 스레드 및 커뮤니케이션 스레드를 각각 하나씩 포함하고, 다수의 입출력 스레드를 포함할 수 있으며, 각각의 스레드는 아래에 설명되는 것과 같은 기능을 발휘하여 송신단(100)의 메모리 버퍼들(113 및 123)에 로드된 데이터를 수신단(200)으로 전송한다.In FIG. 2, each of the CPU sockets included in the transmitter 100 and the receiver 200 may include one master thread and one communication thread, and may include a plurality of input / output threads, each of which will be described below. The data loaded in the memory buffers 113 and 123 of the transmitter 100 is transmitted to the receiver 200 by performing the same function.

이때, 송신단(100)에 포함된 CPU 소켓(110 및 120)과 수신단(200)에 포함된 CPU 소켓(210 및 220)의 각 구성 및 기능은 동일한바, 이하 송신단(100)의 제1 CPU 소켓(110)과 수신단(200)의 제1 CPU 소켓(210) 사이의 데이터 통신을 중심으로 설명한다.In this case, the respective configurations and functions of the CPU sockets 110 and 120 included in the transmitter 100 and the CPU sockets 210 and 220 included in the receiver 200 are the same, hereinafter, the first CPU socket of the transmitter 100. A description will be given focusing on data communication between the 110 and the first CPU socket 210 of the receiver 200.

도 2를 참고하면, 송신단(100)의 입출력 스레드(111b, 112a, 112b 및 112d)를 활성화하기 위해, 송신단(100)의 마스터 스레드(111a) 및 커뮤니케이션 스레드(111d)가 상호작용한다.Referring to FIG. 2, in order to activate the input / output threads 111b, 112a, 112b, and 112d of the transmitter 100, the master thread 111a and the communication thread 111d of the transmitter 100 interact with each other.

구체적으로, 송신단(100) 측의 마스터 스레드(111a)는 전송할 파일의 데이터 청크(data chunk) 레이아웃을 파악하고(S100), 파악한 레이아웃을 커뮤니케이션 스레드(111d)의 작업 큐에 전송한다(S101).In detail, the master thread 111a of the transmitting end 100 determines the data chunk layout of the file to be transmitted (S100), and transmits the determined layout to the work queue of the communication thread 111d (S101).

레이아웃을 수신한 커뮤니케이션 스레드(111d)는 새로운 파일 요청을 수신단(200)의 커뮤니케이션 스레드(211d)로 전송하며(S103), 커뮤니케이션 스레드(211d)는 수신한 요청을 마스터 스레드(211a)의 작업 큐에 전송한다(S105).The communication thread 111d receiving the layout transmits a new file request to the communication thread 211d of the receiving end 200 (S103), and the communication thread 211d transmits the received request to the work queue of the master thread 211a. It transmits (S105).

마스터 스레드(211a)는 새로운 파일 및 새로운 파일에 대한 식별자를 생성하고(S107), 커뮤니케이션 스레드(211d)의 작업 큐에 새로운 파일에 대한 식별자 요청을 전송한다(S109).The master thread 211a generates a new file and an identifier for the new file (S107), and transmits an identifier request for the new file to the work queue of the communication thread 211d (S109).

커뮤니케이션 스레드(211d)는 상기 요청을 송신단(100)의 커뮤니케이션 스레드(111d)로 전송하고(S111), 커뮤니케이션 스레드(111d)는 상기 요청을 마스터 스레드(111a)의 작업 큐로 전송하며(S113), 마스터 스레드(111a)는 해당 파일의 청크 정보들을 OST 큐에 전송한다(S115).The communication thread 211d transmits the request to the communication thread 111d of the transmitting end 100 (S111), and the communication thread 111d transmits the request to the work queue of the master thread 111a (S113). The thread 111a transmits chunk information of the file to the OST queue (S115).

마스터 스레드(111a)는 파일이 스트라이핑(striping)되어 있는 OST의 개수만큼 입출력 스레드들(111b, 112a, 112b 및 112d)을 활성화시킨다(S117).The master thread 111a activates the input / output threads 111b, 112a, 112b, and 112d by the number of OSTs in which the file is striped (S117).

활성화된 입출력 스레드들(111b, 112a, 112b 및 112d)은 OST 큐들을 순회하며 청크 정보를 받아서 데이터 청크를 제1 메모리 버퍼(113)에 로드하면(S119), 제1 메모리 버퍼(113)에 데이터 청크가 저장된다(S121). 즉, 활성화된 입출력 스레드들(111b, 112a, 112b 및 112d)은 제1 CPU 소켓(110)에 배치된 제1 메모리 버퍼(113)에 데이터를 로드하며, 제2 CPU 소켓(110)에 배치된 제2 메모리 버퍼(123)에 데이터를 로드하지 않는다.The activated I / O threads 111b, 112a, 112b, and 112d traverse the OST queues, receive the chunk information, and load the data chunks into the first memory buffer 113 (S119), where the data is stored in the first memory buffer 113. The chunk is stored (S121). That is, the activated I / O threads 111b, 112a, 112b, and 112d load data into the first memory buffer 113 disposed in the first CPU socket 110, and are disposed in the second CPU socket 110. No data is loaded into the second memory buffer 123.

또한, 활성화된 입출력 스레드들(111b, 112a, 112b 및 112d)은 데이터 청크 정보를 수신단(200)으로 전송하라는 요청을 커뮤니케이션 스레드(111d)의 작업 큐에 전송한다(S123).In addition, the activated I / O threads 111b, 112a, 112b, and 112d transmit a request to transmit the data chunk information to the receiving end 200 to the work queue of the communication thread 111d (S123).

커뮤니케이션 스레드(111d)는 수신단(200)의 커뮤니케이션 스레드(211d)로 상기 요청을 전달하며(S125), 수신단(200)의 커뮤니케이션 스레드(211d)가 송신단(100)의 제1 메모리 버퍼(113)에 있는 데이터 청크에 접근하여(S127), 송신단(100)의 제1 메모리 버퍼(113)로부터 수신단(200)의 제1 메모리 버퍼(213)로 데이터 청크를 로드한다(S129). 즉, 송신단(100)과 마찬가지로 커뮤니케이션 스레드(211d)는 제1 메모리 버퍼(213)에 로드된 데이터에만 접근한다. 이후, 제1 메모리 버퍼(113)에 데이터 청크가 저장된다(S131).The communication thread 111d transmits the request to the communication thread 211d of the receiving end 200 (S125), and the communication thread 211d of the receiving end 200 transmits the request to the first memory buffer 113 of the transmitting end 100. The data chunk is accessed (S127), and the data chunk is loaded from the first memory buffer 113 of the transmitter 100 to the first memory buffer 213 of the receiver 200 (S129). That is, like the transmitter 100, the communication thread 211d accesses only data loaded in the first memory buffer 213. Thereafter, the data chunk is stored in the first memory buffer 113 (S131).

이후, 커뮤니케이션 스레드(211d)가 입출력 스레드들(211b, 212a, 212b 및 212d)을 각 OST 큐에 배치하면(S133), 입출력 스레드들(211b, 212a, 212b 및 212d)은 제1 메모리 버퍼(213)의 데이터 청크를 OST 스토리지에 기록한다(S135). 이로써, 하나의 데이터 청크에 대한 전송이 완료되며 모든 데이터 청크들에 대한 전송이 완료될 때까지 상기 단계들이 반복된다.Thereafter, when the communication thread 211d places the input / output threads 211b, 212a, 212b, and 212d in the respective OST queues (S133), the input / output threads 211b, 212a, 212b, and 212d are arranged in the first memory buffer 213. ) Is written to the OST storage (S135). In this way, the transmission for one data chunk is completed and the above steps are repeated until the transmission for all data chunks is completed.

도 3은 한 실시예에 따른 데이터 전송 시스템에 구현된 다중 메모리 버퍼와 기존의 단일 메모리 버퍼를 비교하여 설명한 도면이다.3 is a diagram illustrating a comparison between a multiple memory buffer implemented in a data transmission system and a conventional single memory buffer, according to an exemplary embodiment.

도 3을 참고하면, 기존의 단일 메모리 버퍼 시스템(300)은 다수의 CPU 소켓(310 및 320)에 대해 단일 메모리 버퍼(325)가 배치된다.Referring to FIG. 3, in the conventional single memory buffer system 300, a single memory buffer 325 is disposed for a plurality of CPU sockets 310 and 320.

이 경우, 제2 CPU 소켓(320)에 포함된 CPU 코어들에 할당된 스레드들(321 내지 324)은 제2 CPU 소켓(320)에 할당된 메모리 버퍼(325)에 접근하면 되는바 문제가 되지 않으나, 제1 CPU 소켓(310)에 포함된 CPU 코어들에 할당된 스레드들(311 내지 314)은 멀리 배치된 메모리 버퍼(325)에 접근해야하므로 이로 인해 메모리 접근 지연 시간이 발생하게 된다.In this case, the threads 321 to 324 allocated to the CPU cores included in the second CPU socket 320 may access the memory buffer 325 allocated to the second CPU socket 320. However, since the threads 311 to 314 allocated to the CPU cores included in the first CPU socket 310 must access the memory buffer 325 disposed far away, this causes a memory access delay time.

또한, 스레드들(311 내지 314 및 321 내지 324)이 하나의 메모리 버퍼(325)에 접근하는 경우, 메모리 버퍼(325)를 호스팅하는 제2 CPU 소켓(320)에 과부하가 일어나게 되는 문제가 발생한다.In addition, when the threads 311 to 314 and 321 to 324 access one memory buffer 325, a problem occurs that the second CPU socket 320 hosting the memory buffer 325 becomes overloaded. .

이와 달리, 본 발명의 다중 메모리 버퍼 시스템(400)은 제1 CPU 소켓(410)에 대해 제1 메모리 버퍼(415)가 배치되고, 제2 CPU 소켓(420)에 대해 제2 메모리 버퍼(425)가 배치된다.In contrast, in the multiple memory buffer system 400 of the present invention, a first memory buffer 415 is disposed for the first CPU socket 410, and a second memory buffer 425 for the second CPU socket 420. Is placed.

이 경우, 제1 CPU 소켓(410)에 포함된 CPU 코어들에 할당된 스레드들(411 내지 414)은 제1 메모리 버퍼(415)에 접근하면 되고, 제2 CPU 소켓(420)에 포함된 CPU 코어들에 할당된 스레드들(421 내지 424)은 제1 메모리 버퍼(425)에 접근하면 되는바 기존의 단말 메모리 버퍼 시스템(300)과 달리 메모리 접근 지연 시간이 발생하지 않게 된다.In this case, the threads 411 to 414 allocated to the CPU cores included in the first CPU socket 410 may access the first memory buffer 415 and the CPU included in the second CPU socket 420. The threads 421 to 424 allocated to the cores may access the first memory buffer 425, and thus, unlike the conventional terminal memory buffer system 300, the memory access delay time does not occur.

특히, 입출력 스레드들(413, 414, 423 및 424)이 증가하는 경우에도 각 입출력 스레드는 자신이 배치된 CPU 소켓이 존재하는 메모리 버퍼에 접근하면 되는바 단말 메모리 버퍼 시스템(300)보다 메모리 컨트롤러(미도시)의 부하가 크게 줄어들 수 있다.In particular, even when the I / O threads 413, 414, 423, and 424 increase, each I / O thread needs to access a memory buffer in which a CPU socket in which the I / O thread is located exists. (Not shown) can be significantly reduced.

도 4는 한 실시예에 따른 데이터 전송 시스템에서 소켓 기반의 스레드 스케줄링이 구현되는 방법을 설명한 도면이다.4 is a diagram illustrating a method of implementing socket-based thread scheduling in a data transmission system according to an embodiment.

도 4를 참고하면, 다중 메모리 버퍼 시스템(500)이 다수의 CPU 소켓들(510 및 520) 각각에 메모리 버퍼(515 및 525)를 포함하고 있더라도, 경우에 따라 제1 CPU 소켓(510)에 포함된 CPU 코어들에 할당된 스레드들(511 내지 514)이 제2 메모리 버퍼(525)에 접근할 수도 있고, 반대로 제2 CPU 소켓(520)에 포함된 CPU 코어들에 할당된 스레드들(521 내지 524)이 제1 메모리 버퍼(515)에 접근할 수도 있다.Referring to FIG. 4, even if the multiple memory buffer system 500 includes the memory buffers 515 and 525 in each of the plurality of CPU sockets 510 and 520, it may be included in the first CPU socket 510 in some cases. The threads 511 to 514 allocated to the assigned CPU cores may access the second memory buffer 525, and conversely, the threads 521 to 514 allocated to the CPU cores included in the second CPU socket 520. 524 may access the first memory buffer 515.

이를 방지하기 위해, 각각의 CPU 소켓들(510 및 520)은 마스터 스레드, 커뮤니케이션 스레드 및 입출력 스레드를 적어도 하나씩 포함한다.To prevent this, each of the CPU sockets 510 and 520 includes at least one master thread, communication thread, and input / output thread.

예를 들면, 도 4에 도시된 바와 같이, 제1 CPU 소켓(510)은 제1 마스터 스레드(511), 제1 커뮤니케이션 스레드(512) 및 제1 입출력 스레드(513 및 514)를 포함하고, 제2 CPU 소켓(520)은 제2 마스터 스레드(521), 제2 커뮤니케이션 스레드(522) 및 제2 입출력 스레드(523 및 524)를 포함한다.For example, as shown in FIG. 4, the first CPU socket 510 includes a first master thread 511, a first communication thread 512, and a first input / output thread 513 and 514. The two CPU sockets 520 include a second master thread 521, a second communication thread 522, and second input / output threads 523 and 524.

또한, 각각의 CPU 소켓에 포함된 스레드들은 해당 CPU 소켓을 통해 전송되는 파일을 처리하고자 상호작용을 수행하고, 다른 CPU 소켓을 통해 전송되는 파일을 처리하지 않는다. 즉, 모든 스레드들은 각각의 CPU 소켓에 고정되어 작업한다.In addition, the threads included in each CPU socket interact to process files transmitted through the corresponding CPU socket, and do not process files transmitted through the other CPU socket. That is, all threads work fixedly in their respective CPU sockets.

구체적으로, 제1 CPU 소켓(510)에 포함된 제1 마스터 스레드는(511) 제1 메모리 소켓(510)을 통해 전송되는 파일을 처리하고자 제1 CPU 소켓(510)에 포함된 제1 커뮤니케이션 스레드(512)와 상호작용하고, 제1 커뮤니케이션 스레드(512)는 제1 메모리 버퍼(515)만을 운영한다. 또한, 제1 CPU 소켓(510)에 포함된 제1 입출력 스레드(513 및 514)는 제1 CPU 소켓(510)을 통해 할당된 오브젝트들에 대해서만 작업한다. 이는 제2 CPU 소켓(520)에서도 동일하게 적용된다.In detail, the first master thread included in the first CPU socket 510 may include a first communication thread included in the first CPU socket 510 to process a file transmitted through the first memory socket 510. Interacting with 512, the first communication thread 512 operates only the first memory buffer 515. In addition, the first input / output threads 513 and 514 included in the first CPU socket 510 work only on objects allocated through the first CPU socket 510. The same applies to the second CPU socket 520.

도 5는 한 실시예에 따른 데이터 전송 시스템에서 NUMA 노드 기반의 스레드 스케줄링이 구현되는 방법을 설명한 도면이다.5 is a diagram illustrating a method of implementing thread scheduling based on NUMA node in a data transmission system according to an embodiment.

도 5를 참고하면, 각 CPU 소켓이 해당 CPU 소컷에 포함된 CPU 코어들로 구성된 복수의 불균일 기억 장치 접근 노드들(이하, NUMA 노드들)을 포함하는 경우, CPU 소켓에 포함된 스레드들의 상호작용 및 공평한 코어 사용량이 중요하다. 만일 마스터 스레드와 커뮤니케이션 스레드가 서로 다른 NUMA 노드들에 할당되면, 스레드들의 상호작용 시 서로 다른 NUMA 노드 메모리 접근으로 인해 성능이 저하된다.Referring to FIG. 5, when each CPU socket includes a plurality of heterogeneous storage access nodes (hereinafter, NUMA nodes) composed of CPU cores included in the corresponding CPU small, the interaction of the threads included in the CPU socket And fair core usage is important. If a master thread and a communication thread are allocated to different NUMA nodes, performance decreases due to different NUMA node memory accesses when the threads interact.

따라서, CPU 소켓에 포함된 마스터 스레드와 커뮤니케이션 스레드는 동일한 NUMA 노드들에 포함된 CPU 코어들에 각각 할당된다. 예를 들면, 도 5에 도시된 바와 같이, CPU 소켓(610)에 할당된 마스터 스레드(611a) 및 커뮤니케이션 스레드(611b)는 제1 NUMA 노드(611)에 포함된 CPU 코어들에 각각 할당될 수 있다.Thus, the master thread and communication thread included in the CPU socket are allocated to the CPU cores included in the same NUMA nodes, respectively. For example, as shown in FIG. 5, the master thread 611a and the communication thread 611b allocated to the CPU socket 610 may be allocated to the CPU cores included in the first NUMA node 611, respectively. have.

또한, CPU 소켓의 성능의 최적화를 위해 마스터 스레드와 입출력 스레드를 최대한 가깝게 배치하고자, 입출력 스레드는 마스터 스레드가 할당된 CPU 코어와 인접한 CPU 코어에 할당된다. 예를 들면, 도 5에 도시된 바와 같이, 제2 NUMA 노드(612)에 포함된 CPU 코어들 중에서 마스터 스레드(611a)과 더 가까운 CPU 코어(612a 및 612c)에 입출력 스레드가 각각 할당될 수 있다.In addition, in order to place the master thread and the I / O thread as close as possible to optimize the performance of the CPU socket, the I / O thread is allocated to the CPU core adjacent to the CPU core to which the master thread is allocated. For example, as shown in FIG. 5, among the CPU cores included in the second NUMA node 612, an input / output thread may be allocated to the CPU cores 612a and 612c that are closer to the master thread 611a, respectively. .

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.Although the embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto, and various modifications and improvements of those skilled in the art using the basic concepts of the present invention defined in the following claims are also provided. It belongs to the scope of rights.

Claims

As a data transmission system,
Multiple CPU sockets, and
Contains a memory buffer disposed for each CPU socket,
Each CPU socket includes at least one CPU core assigned a master thread, a CPU core assigned a communication thread, and a CPU core assigned an I / O thread,
Each CPU socket includes a plurality of NUMA nodes composed of CPU cores contained in the corresponding CPU socket, and a master thread and a communication thread are allocated to CPU cores included in the same NUMA node.

In claim 1,
A data transfer system in which threads allocated to CPU cores contained in each CPU socket access only memory buffers placed in that CPU socket.

In claim 1,
A master thread, a communication thread, and an I / O thread assigned to CPU cores included in each CPU socket interact with each other to process a file transferred through a memory buffer disposed in the CPU socket.

delete

In claim 1,
I / O threads are data transfer systems in which CPU threads are assigned to adjacent CPU cores.

As a data transmission system transmits data,
Activating an input / output thread of the transmitting end by interacting with a master thread of the transmitting end and a communication thread;
Loading the data into a first memory buffer of the memory buffers of the transmitting end by the activated input / output thread;
A communication thread of a receiver accessing the data loaded in the first memory buffer to load the data into a second memory buffer among memory buffers of the receiver; and
Storing the data loaded in the second memory buffer in storage by an input / output thread of the receiving end;
The first memory buffer is a memory buffer disposed in a CPU socket including CPU cores to which a master thread of the transmitting end, a communication thread of the transmitting end, and an input / output thread of the transmitting end are allocated;
The second memory buffer is a memory buffer disposed in a CPU socket including CPU cores to which a communication thread of the receiving end and an input / output thread of the receiving end are allocated;
The CPU socket of the transmitting end includes a plurality of NUMA nodes composed of CPU cores included in the CPU socket, and a master thread and a communication thread are allocated to CPU cores included in the same NUMA node.

delete

In claim 6,
I / O thread is a data transfer method in which a master thread is allocated to a CPU core adjacent to an assigned CPU core.

In claim 6,
If the CPU socket of the receiving end includes a plurality of NUMA nodes composed of CPU cores included in the CPU socket, a master thread and a communication thread are allocated to CPU cores included in the same NUMA node.

In claim 9,
I / O thread is a data transfer method in which a master thread is allocated to a CPU core adjacent to an assigned CPU core.