KR102479757B1

KR102479757B1 - Offloading method and system of network and file i/o operation, and a computer-readable recording medium

Info

Publication number: KR102479757B1
Application number: KR1020210058761A
Authority: KR
Inventors: 박경수; 데온드레 마틴 탄 응
Original assignee: 한국과학기술원
Priority date: 2020-11-24
Filing date: 2021-05-06
Publication date: 2022-12-22
Also published as: KR20220071858A

Abstract

본 발명에 따른 파일 입출력 연산 오프로딩 방법은, 파일 기반 컨텐트 전송 서버의 CPU(Central Processing Unit) 및 연산 기능을 갖는 NIC(Network Interface Card)에서의 네트워크 및 파일 입출력 연산 오프로딩 방법으로, 상기 CPU가, 데이터 플레인(data plane) 연산을 상기 NIC에 오프로드하는 오프로드 단계; 및 상기 NIC가 상기 데이터 플레인 연산을 수행하고, 상기 CPU는 오프로드된 상기 데이터 플레인 연산을 제외한 컨트롤 플레인(control plane) 연산을 수행하는 수행 단계;를 포함하고, 상기 데이터 플레인 연산은 컨텐츠 가져오기, 데이터 버퍼 관리, 데이터 패킷 분할, TCP/IP 체크섬 계산, 데이터 패킷 생성 및 전송 중 적어도 하나의 연산 포함하며, 상기 컨트롤 플레인 연산은 TCP 프로토콜의 연결 관리, 데이터 전송의 안정성 관리, 데이터 전송의 혼잡 관리, 흐름 제어 및 에러 제어 중 적어도 하나의 연산을 포함한다.A method for offloading file input/output operations according to the present invention is a method for offloading network and file input/output operations in a central processing unit (CPU) of a file-based content transmission server and a network interface card (NIC) having an operation function. , an offload step of offloading data plane operations to the NIC; and a performing step in which the NIC performs the data plane operation and the CPU performs a control plane operation excluding the offloaded data plane operation, wherein the data plane operation includes: fetching content; It includes at least one operation of data buffer management, data packet division, TCP/IP checksum calculation, data packet generation and transmission, wherein the control plane operation includes TCP protocol connection management, data transmission stability management, data transmission congestion management, It includes an operation of at least one of flow control and error control.

Description

Network and file input/output operation offloading method, operation offloading system, and computer readable recording medium

본 발명은 파일 기반 컨텐트 전송 서버의 병목인 파일 I/O와 네트워크 I/O를 CPU에서 오프로딩하고, 연산 기능이 있는 보조디바이스(예: SmartNIC)에서 수행하게 함으로써 단위 CPU 용량당 컨텐트 전송 성능을 크게 향상시킬 수 있는 네트워크 및 파일 입출력 연산 오프로딩 방법, 연산 오프로딩 시스템 및 컴퓨터 판독 가능 기록매체에 관한 것이다.The present invention improves content transmission performance per unit CPU capacity by offloading file I/O and network I/O, which are the bottlenecks of a file-based content transmission server, from the CPU and having them performed in an auxiliary device (eg SmartNIC) having an arithmetic function. It relates to a network and file input/output operation offloading method, an operation offloading system, and a computer readable recording medium that can be significantly improved.

최근 고대역폭 I/O 기기의 발전은 고품질 온라인 컨텐츠의 빠른 전송을 약속한다. 하지만, 불행하게도 오늘날의 프로그래밍 모델은 여전히 CPU 중심 접근방식에 얽매여 있는데, 이 접근방식은 병목현상이 현대 I/O 장치의 진정한 잠재력을 심각하게 제한하는 맹점이 있다. CPU 주기의 85% 이상이 온라인 컨텐츠 전송시 디스크 및 네트워크 I/O 연산과 같은 단순하지만 반복적인 연산에 사용된다.Recent advances in high-bandwidth I/O devices promise fast delivery of high-quality online content. Unfortunately, however, today's programming models are still bound by CPU-centric approaches, whose bottlenecks severely limit the true potential of modern I/O devices. More than 85% of CPU cycles are used for simple but repetitive operations such as disk and network I/O operations when transferring online content.

한편, 최근의 컴퓨팅 하드웨어 추세는 I/O 장치 용량의 발전이 CPU 용량의 성장을 실질적으로 앞질렀다. 지난 20년간 네트워크 인터페이스 카드(Network Interface Card; NIC)의 대역폭과 디스크 I/O 속도가 2∼4배 정도 향상되었다. 이와 반대로, CPU 용량의 향상 속도는 느려졌고 프로세서에 대한 무어의 법칙이 10여 년 전에 끝났다는 것이 널리 받아들여지고 있다. 이러한 경향은 현대 운영체제의 기반이 되는 CPU 중심의 프로그래밍 모델을 재고할 것을 요구한다. 현재의 OS API는 CPU에 직접 연결된 메인 메모리로 I/O 장치의 모든 데이터를 가져와야 연산을 수행할 수 있다. 이러한 이유로 비디오 컨텐츠 전송 서버 같은 경우 CPU는 전체 주기의 85% 이상이 간단한 I/O 연산(패스트 디스크의 컨텐츠를 읽고 고대역폭 NIC를 통해 전송)에 사용되므로 비디오 컨텐츠 제공과 같은 I/O 집약적인 애플리케이션의 성능 병목 현상이 빈번히 발생한다. I/O 장치의 현대적 혁신을 효과적으로 활용하기 위해 현재의 프로그램 구조는 I/O 연산을 수행하기 위한 CPU와 그 메모리 시스템에 대한 의존성을 완화해야 한다.On the other hand, recent computing hardware trends have shown that advances in I/O device capacity have substantially outpaced growth in CPU capacity. Over the past 20 years, network interface card (NIC) bandwidth and disk I/O speeds have improved by a factor of two to four. Conversely, progress in CPU capacity has slowed and it is widely accepted that Moore's Law for processors ended more than a decade ago. This trend calls for a rethinking of the CPU-centric programming model on which modern operating systems are based. Current OS APIs can perform operations only when all data from the I/O device is fetched from the main memory directly connected to the CPU. For this reason, in the case of a video content delivery server, the CPU is used for simple I/O operations (reading the contents of the fast disk and transmitting them through a high-bandwidth NIC) for more than 85% of the total cycle, so I/O-intensive applications such as video content delivery performance bottlenecks frequently occur. To effectively take advantage of modern innovations in I/O devices, current program structures must mitigate their dependence on the CPU and its memory system to perform I/O operations.

다행히 최근 SmartNIC 또는 Computational SSD와 같은 프로그래밍 가능한 I/O 기기가 출현하여 CPU의 개입 없이 CPU의 계산 부담을 줄이는 데 도움을 주고 있다. PCI-e 표준은 CPU의 개입없이 두 PCI-e 장치간에 P2PDMA(Peer-to-Peer DMA)를 허용하므로, NIC가 CPU에서 디스크 I/O 연산을 완전히 오프로드하는 서버 시스템을 생각할 수 있다. 실제로 DCS 및 DCS-Ctrl과 같은 최근 작업은 FPGA 기반 코디네이터가 비디오 전송 서버용 P2PDMA를 통해 모든 디스크 I/O 연산을 수행할 수 있음을 보여주고 있다. 하지만, 안타깝게도 이러한 시스템의 주요 단점은 컨텐츠 전송이 오늘날의 비디오 스트리밍에서 일반적으로 채택되는 TCP 기반 전송과 달리 UDP와 유사한 프로토콜에서 실행된다는 것이다. 그러나 I/O 오프로드 서버에서 TCP를 지원하기 위해서는 TCP 기능을 어디에 배치해야 하는지에 대한 "기능 배치" 문제가 발생한다. 즉, 모든 디스크 I/O가 프로그래밍 가능한 NIC에서 수행된다면, TCP 스택을 어디에서 실행해야 하는지 의문이 생긴다. 한 가지 접근방식은 CPU 쪽에서 실행하는 것이지만, 디스크 컨텐츠가 NIC에서만 사용 가능하기 때문에 데이터 패킷에 대한 "데이터 누락"이라는 문제가 발생한다. 데이터를 CPU 쪽으로 다시 이동하면 오프로드된 디스크 I/O의 이점이 무효화되는 문제점이 생긴다. Fortunately, programmable I/O devices such as SmartNICs or Computational SSDs have recently emerged to help reduce the computational burden on the CPU without CPU involvement. The PCI-e standard allows for peer-to-peer DMA (P2PDMA) between two PCI-e devices without CPU intervention, so you can think of a server system where the NIC completely offloads disk I/O operations from the CPU. Indeed, recent work such as DCS and DCS-Ctrl demonstrates that FPGA-based coordinators can perform all disk I/O operations via P2PDMA for video delivery servers. Unfortunately, however, the major drawback of these systems is that the content delivery runs over a protocol similar to UDP, as opposed to the TCP-based transport commonly employed in today's video streaming. However, supporting TCP in an I/O offload server raises the "feature placement" problem of where to place the TCP functions. That is, if all disk I/O is done on a programmable NIC, the question arises where to run the TCP stack. One approach would be to run it on the CPU side, but this introduces the problem of "missing data" for data packets because the disk content is only available on the NIC. The problem with moving data back to the CPU negates the benefits of offloaded disk I/O.

또 다른 접근 방식은 NIC 쪽에서 TCP 스택을 실행하는 것이다. FPGA 보드에 전체 TCP 스택을 구현하는 것이 쉽지는 않지만, SmartNIC에서는 실행이 가능하다. 실제로 최신 SmartNIC 플랫폼은 전체 TCP 스택으로 Linux에서 실행되는 Arm 기반 임베디드 프로세서를 지원한다. 그러나, NIC에서 전체 TCP 스택을 실행하려면 해당 애플리케이션이 제한된 리소스로 동일한 플랫폼에서 공동 실행되어야 한다는 문제점이 있다. Another approach is to run a TCP stack on the NIC side. Implementing a full TCP stack on an FPGA board is not easy, but it is doable on a SmartNIC. In fact, modern SmartNIC platforms support Arm-based embedded processors running on Linux with full TCP stacks. However, running the full TCP stack on the NIC has the drawback that those applications must co-run on the same platform with limited resources.

본 발명은 상술한 문제점을 감안하여 안출된 것으로, 본 발명은 컨텐츠 전송 서버의 병목 부분인 파일 I/O와 네트워크 I/O를 CPU에서 오프로딩하여, CPU는 복잡하지만 빠르게 수행해야 하는 컨트롤 플레인(control plane) 연산만 수행하고, 단순하지만 반복적이고 기계적인 데이터 플레인(data plane) 연산은 주변기기(SmartNIC 등)에서 수행하되 TCP protocol과 완벽 호환될 수 있는 네트워크 및 파일 입출력 연산 오프로딩 방법, 연산 오프로딩 시스템 및 컴퓨터 판독 가능 기록매체를 제공함에 있다.The present invention has been devised in view of the above problems, and the present invention offloads file I/O and network I/O, which are the bottlenecks of the content delivery server, from the CPU, so that the CPU is complex, but the control plane (which must be performed quickly) Only control plane) operations are performed, and simple but repetitive and mechanical data plane operations are performed in peripheral devices (SmartNIC, etc.), but network and file input/output operation offloading methods that are perfectly compatible with the TCP protocol, operation offloading It is to provide a system and a computer readable recording medium.

상기 목적을 달성하기 위한, 본 발명에 따른 파일 입출력 연산 오프로딩 방법은, 파일 기반 컨텐트 전송 서버의 CPU(Central Processing Unit) 및 연산 기능을 갖는 NIC(Network Interface Card)에서의 네트워크 및 파일 입출력 연산 오프로딩 방법으로, 상기 CPU가, 데이터 플레인(data plane) 연산을 상기 NIC에 오프로드(offload)하는 오프로드 단계; 및 상기 NIC가 상기 데이터 플레인 연산을 수행하고, 상기 CPU는 오프로드된 상기 데이터 플레인 연산을 제외한 컨트롤 플레인(control plane) 연산을 수행하는 수행 단계;를 포함하고, 상기 데이터 플레인 연산은 컨텐츠 가져오기, 데이터 버퍼 관리, 데이터 패킷 분할, TCP/IP 체크섬 계산, 데이터 패킷 생성 및 전송 중 적어도 하나의 연산을 포함하며, 상기 컨트롤 플레인 연산은 TCP 프로토콜의 연결 관리, 데이터 전송의 안정성 관리, 데이터 전송의 혼잡 관리, 흐름 제어 및 에러 제어 중 적어도 하나의 연산을 포함한다.In order to achieve the above object, a file input/output operation offloading method according to the present invention is provided by offloading network and file input/output operations in a central processing unit (CPU) of a file-based content transmission server and a network interface card (NIC) having an operation function. As a loading method, an offload step of offloading, by the CPU, a data plane operation to the NIC; and a performing step in which the NIC performs the data plane operation and the CPU performs a control plane operation excluding the offloaded data plane operation, wherein the data plane operation includes: fetching content; It includes at least one operation of data buffer management, data packet division, TCP/IP checksum calculation, data packet generation and transmission, wherein the control plane operation is TCP protocol connection management, data transmission stability management, and data transmission congestion management , flow control and error control.

상기 오프로드 단계는, 상기 CPU가 응용 프로그램으로부터 수신한 "offload_open()" 명령을 상기 NIC에 전달하는 단계; 상기 NIC가 NIC 스택에 파일을 열고 결과를 상기 CPU에 회신하는 단계; 및 상기 CPU가 상기 응용 프로그램으로부터 호출된 "offload_write()" 명령을 NIC에 전달하는 단계;를 포함할 수 있다.The offloading step may include transmitting, by the CPU, an "offload_open()" command received from an application program to the NIC; the NIC opening a file on the NIC stack and returning a result to the CPU; and transmitting, by the CPU, an “offload_write()” command called from the application program to the NIC.

상기 CPU의 호스트 스택에서 실제 데이터가 누락된 경우 상기 CPU는 가상으로 상기 데이터 플레인 연산을 수행하는 가상 연산 단계;를 포함할 수 있다.A virtual operation step of, by the CPU, virtually performing the data plane operation when actual data is missing from the host stack of the CPU.

상기 가상 연산 단계는, 상기 CPU는, 상기 응용 프로그램이 "offload_write()" 명령을 호출한 경우, 메타 데이터만 업데이트하여 전송 버퍼에 가상 기입하는 단계; 및 상기 CPU가 데이터의 정체 및 흐름 제어 매개 변수와 함께 전송해야 하는 바이트 수를 결정한 뒤 "SEND" 명령을 상기 NIC의 NIC 스택에 게시하는 단계;를 포함할 수 있다.The virtual operation step may include, by the CPU, when the application program calls the "offload_write()" command, updating only meta data and virtually writing it in a transmission buffer; and posting a "SEND" command to the NIC stack of the NIC after the CPU determines the number of bytes to be sent along with the congestion and flow control parameters of the data.

상기 CPU는 TCP 패킷을 활용하여 상기 "SEND" 명령을 전달하고, 상기 TCP 패킷은 파일 ID, 읽기 스타터 오프셋(starter offset to read) 및 데이터 길이(length) 정보 중 적어도 하나의 정보를 포함할 수 있다.The CPU transmits the "SEND" command using a TCP packet, and the TCP packet may include at least one of file ID, starter offset to read, and data length information. .

상기 "SEND" 명령을 수신한 상기 NIC는, TCP 패킷에 포함된 길이(length) 정보에 기초하여 복수의 TCP 패킷으로 변환하여 전송할 수 있다.Upon receiving the “SEND” command, the NIC may convert and transmit a plurality of TCP packets based on length information included in the TCP packet.

상기 클라이언트는 상기 "SEND" 명령에 대한 데이터 패킷을 전송하려고 할 때 에코 패킷을 상기 CPU로 전송하며, 상기 CPU는 상기 에코 패킷을 수신할 때까지 재전송 타이머의 전송을 소정 시간동안 수행하지 않을 수 있다.When the client tries to transmit a data packet for the “SEND” command, it transmits an echo packet to the CPU, and the CPU may not transmit a retransmission timer for a predetermined time until receiving the echo packet. .

상기 소정 시간은 상기 NIC에서의 패킷처리 지연시간과 상기 에코 패킷의 단방향 처리시간에 의해 결정될 수 있다.The predetermined time may be determined by a packet processing delay time in the NIC and a unidirectional processing time of the echo packet.

상기 NIC에서의 패킷처리 지연시간은 상기 "SEND" 명령이 상기 NIC에 도착한 뒤 "SEND" 명령에 대한 데이터 패킷이 상기 클라이언트로 출발하는 사이의 지연시간일 수 있다.A packet processing delay time in the NIC may be a delay time between when the "SEND" command arrives at the NIC and a data packet for the "SEND" command departs from the client.

상기 CPU는 IO-TCP 응용 프로그램의 파일을 열 때 사용하는 "OPEN" 명령, 파일 내용을 클라이언트로 보낼 때 사용하는 "SEND" 명령, 재전송의 효율적 처리를 위한 "ACKD" 명령 및 IO-TCP 응용 프로그램의 파일을 닫을 때 사용하는 "CLOS" 명령을 정의할 수 있다.The CPU includes the "OPEN" command used to open the file of the IO-TCP application program, the "SEND" command used to send the contents of the file to the client, the "ACKD" command for efficient processing of retransmission, and the IO-TCP application program You can define the "CLOS" command used to close a file in .

상기 CPU는 "SEND" 명령만을 상기 NIC로 오프로드하고, 클라이언트로부터 수신되는 패킷은 직접 처리할 수 있다.The CPU offloads only the "SEND" command to the NIC, and can directly process packets received from clients.

상기 CPU는 오프로드된 시퀀스 공간 범위에 대한 ACK를 확인하면, 상기 NIC에 주기적으로 상기 "ACKD" 패킷을 전달할 수 있다.When the CPU confirms an ACK for the offloaded sequence space range, the CPU may periodically transmit the “ACKD” packet to the NIC.

한편, 상기 목적을 달성하기 위한 연산 오프로딩 시스템은, 파일 기반 컨텐트 전송 서버의 CPU(Central Processing Unit) 및 연산 기능을 갖는 NIC(Network Interface Card)을 포함하는 네트워크 및 파일 입출력 연산 오프로딩 시스템으로, 상기 CPU는 데이터 플레인(data plane) 연산을 상기 NIC에 오프로드(offload)하고, 상기 NIC는 상기 데이터 플레인 연산을 수행하고, 상기 CPU는 오프로드된 상기 데이터 플레인 연산을 제외한 컨트롤 플레인(control plane) 연산을 수행하되, 상기 데이터 플레인 연산은 컨텐츠 가져오기, 데이터 버퍼 관리, 데이터 패킷 분할, TCP/IP 체크섬 계산, 데이터 패킷 생성 및 전송 중 적어도 하나의 연산을 포함하며, 상기 컨트롤 플레인 연산은 TCP 프로토콜의 연결 관리, 데이터 전송의 안정성 관리, 데이터 전송의 혼잡 관리, 흐름 제어 및 에러 제어 중 적어도 하나의 연산을 포함한다.On the other hand, a calculation offloading system for achieving the above object is a network and file input/output calculation offloading system including a central processing unit (CPU) of a file-based content transmission server and a network interface card (NIC) having a calculation function, The CPU offloads data plane operations to the NIC, the NIC performs the data plane operations, and the CPU performs a control plane excluding the offloaded data plane operations An operation is performed, wherein the data plane operation includes at least one operation of fetching content, managing data buffers, dividing data packets, calculating a TCP/IP checksum, generating and transmitting data packets, and the control plane operation is It includes at least one operation of connection management, data transmission stability management, data transmission congestion management, flow control, and error control.

상기 CPU는 응용 프로그램으로부터 수신한 "offload_open()" 명령을 상기 NIC에 전달하면, 상기 NIC가 NIC 스택에 파일을 열고 결과를 상기 CPU에 회신하며, 상기 CPU는 상기 응용 프로그램으로부터 호출된 "offload_write()" 명령을 NIC에 전달할 수 있다.When the CPU transmits the “offload_open()” command received from the application program to the NIC, the NIC opens a file on the NIC stack and returns a result to the CPU, and the CPU responds with an “offload_write() command called from the application program. )" command to the NIC.

상기 CPU의 호스트 스택에서 실제 데이터가 누락된 경우 상기 CPU는 가상으로 상기 데이터 플레인 연산을 수행할 수 있다.When actual data is missing from the host stack of the CPU, the CPU may virtually perform the data plane operation.

상기 CPU는, 상기 응용 프로그램이 "offload_write()" 명령을 호출한 경우, 메타 데이터만 업데이트하여 전송 버퍼에 가상 기입하고, 데이터의 정체 및 흐름 제어 매개 변수와 함께 전송해야 하는 바이트 수를 결정한 뒤 "SEND" 명령을 상기 NIC의 NIC 스택에 게시할 수 있다.When the application program calls the "offload_write()" command, the CPU updates only the meta data and virtually writes it to the transmission buffer, determines the number of bytes to be transmitted along with the congestion and flow control parameters of the data, and then " SEND" command to the NIC stack of the NIC.

상기 CPU는 TCP 패킷을 활용하여 상기 "SEND" 명령을 전달하고, 상기 TCP 패킷은 파일 ID, 읽기 스타터 오프셋(starter offset to read) 및 데이터 길이(length) 정보 중 적어도 하나의 정보를 포함하며, 상기 "SEND" 명령을 수신한 상기 NIC는, TCP 패킷에 포함된 길이(length) 정보에 기초하여 복수의 TCP 패킷으로 변환하여 전송할 수 있다.The CPU transmits the "SEND" command by utilizing a TCP packet, the TCP packet includes at least one information of a file ID, a starter offset to read, and data length information, Upon receiving the “SEND” command, the NIC may convert and transmit a plurality of TCP packets based on length information included in the TCP packet.

상기 클라이언트는 상기 "SEND" 명령에 대한 데이터 패킷을 전송하려고 할 때 에코 패킷을 상기 CPU로 전송하며, 상기 CPU는 상기 에코 패킷을 수신할 때까지 재전송 타이머의 전송을 소정 시간동안 수행하지 않고, 상기 소정 시간은 상기 NIC에서의 패킷처리 지연시간과 상기 에코 패킷의 단방향 처리시간에 의해 결정되며, 상기 NIC에서의 패킷처리 지연시간은 상기 "SEND" 명령이 상기 NIC에 도착한 뒤 "SEND" 명령에 대한 데이터 패킷이 상기 클라이언트로 출발하는 사이의 지연시간일 수 있다.The client transmits an echo packet to the CPU when attempting to transmit a data packet for the "SEND" command, and the CPU does not transmit a retransmission timer for a predetermined time until receiving the echo packet. The predetermined time is determined by the packet processing delay time in the NIC and the one-way processing time of the echo packet, and the packet processing delay time in the NIC is determined by the “SEND” command It may be the delay between data packets departing for the client.

상기 CPU는 IO-TCP 응용 프로그램의 파일을 열 때 사용하는 "OPEN" 명령, 파일 내용을 클라이언트로 보낼 때 사용하는 "SEND" 명령, 재전송의 효율적 처리를 위한 "ACKD" 명령 및 IO-TCP 응용 프로그램의 파일을 닫을 때 사용하는 "CLOS" 명령을 정의하고, 상기 CPU는 "SEND" 명령만을 상기 NIC로 오프로드하고, 클라이언트로부터 수신되는 패킷은 직접 처리하며, 상기 CPU는 오프로드된 시퀀스 공간 범위에 대한 ACK를 확인하면, 상기 NIC에 주기적으로 상기 "ACKD" 패킷을 전달할 수 있다.The CPU includes the "OPEN" command used to open the file of the IO-TCP application program, the "SEND" command used to send the contents of the file to the client, the "ACKD" command for efficient processing of retransmission, and the IO-TCP application program defines the "CLOS" command used when closing the file of , the CPU offloads only the "SEND" command to the NIC, directly processes packets received from clients, and the CPU If the ACK for the ACK is confirmed, the “ACKD” packet may be periodically transmitted to the NIC.

한편, 본 발명에 따른 컴퓨터 판독 가능 기록매체는 상술한 연산 오프로딩 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록할 수 있다.Meanwhile, the computer readable recording medium according to the present invention may record a program for executing the above-described operation offloading method in a computer.

본 발명에 의하면, P2P DMA를 활용한 디스크 액세스 오프로딩(disk access offloading)을 하고, CPU쪽에서 TCP operation을 컨트롤하면서 완벽히 TCP protocol을 지원하는 네트워크 스택을 설계할 수 있다. 또한, 종래에는 FPGA 기기로 P2P DMA를 구현하여 NVMe disk를 CPU 개입없이 직접 액세스하였으나 TCP protocol을 지원하지 않았다. 따라서, 현대 비디오 스트리밍 서비스가 대부분 TCP기반인 점을 고려할때 종래 기술은 상용화 가능성이 떨어질 수밖에 없었다. 반면 본 발명은 TCP protocol을 지원하여 비디오 서비스에 바로 적용될 수 있다는 장점이 있다. According to the present invention, it is possible to design a network stack that completely supports the TCP protocol while performing disk access offloading using P2P DMA and controlling the TCP operation on the CPU side. In addition, in the prior art, P2P DMA was implemented with an FPGA device to directly access the NVMe disk without CPU intervention, but the TCP protocol was not supported. Therefore, considering that most of the modern video streaming services are based on TCP, the commercialization potential of the prior art is inevitably low. On the other hand, the present invention has the advantage that it can be directly applied to video services by supporting the TCP protocol.

도 1은 IO-TCP 스택의 아키텍처 개요를 보여준다.
도 2는 널리 사용되는 디스크 벤치 마크 도구인 fio로 측정한 단일 CPU 코어의 NVMe 디스크 사용률을 보여준다
도 3은 웹 서버의 성능 비교에 있어서 단일 CPU 코어의 결과를 보여준다.
도 4는 플랫폼에 사용하는 Mellanox Blue-Field NIC의 아키텍처를 보여준다.
도 5는 HTTP 서버 컨텍스트에서 API 함수를 사용하는 작업의 하위 집합을 보여준다.
도 6은 "SEND" 명령 패킷이 처리되는 방법을 도시준다.
도 7은 호스트와 NIC 스택 사이의 통신을 위한 이더넷(Eithernet)의 라운드 트립 타임(Round-trip time)을 도시한다.
도 8은 대용량 파일 컨텐츠 전달에서 IO-TCP의 효과 평가에 대한 결과를 도시한다.
도 9는 는 호스트 스택이 소비하는 CPU주기의 일부에 대한 처리량(a) 및 Arm 코어수에 따른 단일 SmartNIC의 처리량(b)을 보여준다.
도 10은 실제 데이터 패킷이 전송되는 적시에 재전송 타이머를 시작하는 에코 패킷의 영향(a)과 NIC 스택에서 TCP 타임 스탬프 수정의 영향(b)을 도시한다.
도 11은 동시 통신 연결 개수에 대한 처리량 실험 결과를 도시한다.
도 12는 동시 통신 연결의 다른 개수에 있어서 4KB 파일을 제공하는 처리량을 도시한다. Figure 1 shows an architectural overview of the IO-TCP stack.
Figure 2 shows the NVMe disk utilization of a single CPU core as measured by fio, a widely used disk benchmark tool.
Figure 3 shows the result of a single CPU core in the performance comparison of web servers.
Figure 4 shows the architecture of the Mellanox Blue-Field NIC used in the platform.
Figure 5 shows a subset of operations using API functions in an HTTP server context.
Figure 6 shows how the "SEND" command packet is processed.
7 illustrates round-trip time of Ethernet for communication between a host and a NIC stack.
8 shows the result of evaluating the effect of IO-TCP on large-capacity file content delivery.
Figure 9 shows the throughput (a) for a fraction of CPU cycles consumed by the host stack and the throughput (b) for a single SmartNIC as a function of the number of Arm cores.
Figure 10 shows the effect (a) of the echo packet starting the retransmission timer at the time the actual data packet is transmitted and the effect (b) of TCP timestamp modification in the NIC stack.
11 shows throughput test results for the number of simultaneous communication connections.
Figure 12 shows the throughput of serving a 4KB file for different numbers of concurrent communication connections.

이하, 첨부된 도면을 참조하여 본 명세서에 개시된 실시 예를 상세히 설명하되, 도면 부호에 관계없이 동일하거나 유사한 구성요소는 동일한 참조 번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 이하의 설명에서 사용되는 구성요소에 대한 접미사 "부"는 명세서 작성의 용이함만이 고려되어 부여되거나 혼용되는 것으로서, 그 자체로 서로 구별되는 의미 또는 역할을 갖는 것은 아니다. 또한, 본 명세서에 개시된 실시 예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 명세서에 개시된 실시 예의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 첨부된 도면은 본 명세서에 개시된 실시 예를 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 명세서에 개시된 기술적 사상이 제한되지 않으며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. Hereinafter, the embodiments disclosed in this specification will be described in detail with reference to the accompanying drawings, but the same or similar elements are given the same reference numerals regardless of reference numerals, and redundant description thereof will be omitted. The suffix "part" for components used in the following description is given or used interchangeably in consideration of ease of writing the specification, and does not itself have a meaning or role distinct from each other. In addition, in describing the embodiments disclosed in this specification, if it is determined that a detailed description of a related known technology may obscure the gist of the embodiment disclosed in this specification, the detailed description thereof will be omitted. In addition, the accompanying drawings are only for easy understanding of the embodiments disclosed in this specification, the technical idea disclosed in this specification is not limited by the accompanying drawings, and all changes included in the spirit and technical scope of the present invention , it should be understood to include equivalents or substitutes.

이하, 본 발명에 따른 실시예를 첨부된 도면을 참조하여 상세하게 설명하도록 한다.Hereinafter, embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

결론부터 언급하면, 본 발명에서는 I/O 집약적인 애플리케이션을 위한 분할 TCP 스택 설계인 I/O Offloading TCP(IO-TCP)로 위에서 위에서 언급한 문제를 해결했다. IO-TCP의 핵심 아이디어는 P2PDMA를 통해 디스크(예: NVMe 디스크와 같은 PCIe 연결 디스크)에 액세스할 수 있는 SmartNIC에 모든 데이터 플레인 연산(operation)을 위임하면서 CPU에서 컨트롤 플레인(control plane)만 실행하는 것이다. 컨트롤 플레인이란 TCP 프로토콜의 모든 핵심 기능(연결 관리, 데이터 전송 의 안정성 관리, 데이터의 흐름 제어 및 혼잡 관리)을 의미하며 데이터 플레인(data plane)은 NVMe 디스크에서 컨텐츠 가져오기를 포함하여 데이터 패킷 생성 및 전송의 모든 측면을 가리킨다. First of all, in the present invention, I/O Offloading TCP (IO-TCP), which is a split TCP stack design for I/O intensive applications, solves the above-mentioned problems. The core idea of IO-TCP is to run only the control plane on the CPU while delegating all data plane operations to a SmartNIC that can access the disks (e.g. PCIe connected disks like NVMe disks) via P2PDMA. will be. The control plane refers to all the core functions of the TCP protocol (connection management, reliability management of data transmission, data flow control and congestion management). Refers to all aspects of transmission.

본 발명에서는 CPU 측이 여전히 모든 연산을 완전히 제어할 수 있도록 한다. 내부적으로 실제 디스크 및 네트워크 I/O 연산은 SmartNIC로 오프로드된다. 이를 통해 복잡한 제어 연산에서 CPU 주기를 예약하고 단순하지만 반복적인 I/O 연산에서 제외할 수 있다.The present invention allows the CPU side to still fully control all operations. Internally, the actual disk and network I/O operations are offloaded to the SmartNIC. This allows CPU cycles to be reserved for complex control operations and excluded from simple but repetitive I/O operations.

IO-TCP는 CPU 주기를 절약할 수 있는 큰 잠재력을 제공하지만 일련의 새로운 과제를 안고 있다. 첫째, 호스트 TCP의 RTT 측정은 디스크로 인한 지연으로 인해 변동된다. IO-TCP에서 NIC의 데이터 패킷 I/O는 디스크에서 패킷 컨텐츠를 가져올 때 디스크 I/O와 결합된다. 그러나, 디스크 I/O는 훨씬 느려질 수 있으며 네트워크 경로에 정체가 없는 경우에도 패킷 전송에 상당한 지연을 추가하는 경우가 많다. IO-TCP는 RTT 측정에서 디스크로 인한 지연을 신중하게 제거하여 이 문제를 해결한다. 호스트 스택이 패킷 출발 시간을 정확하게 추적할 수 있도록 에코 패킷을 사용한다. 둘째, IO-TCP는 패킷 재전송시기에 대한 정보 없이 NIC 스택에서 패킷 재전송을 처리해야 한다. 주의하지 않으면 NIC 스택이 패킷 재전송을 위해 중복 디스크 I/O를 발행할 수 있다. 비효율성을 피하기 위해 IO-TCP는 내부 ACK 프로토콜을 사용하여 데이터 패킷을 버리는 것이 안전한 시기를 알린다. 셋째, IO-TCP는 데이터 전송을 위해 파일 또는 비파일 컨텐츠를 유연하게 구성하기 위해 애플리케이션에 대해 잘 정의된 API를 제공해야 한다. 이를 위해 IO-TCP는 파일을 열고 NIC에서 파일 컨텐츠를 보내는 몇 가지 "오프로드" 기능을 사용하여 버클리 소켓 API를 약간 확장한다.IO-TCP offers great potential for saving CPU cycles, but it also presents a series of new challenges. First, the host TCP's RTT measurement fluctuates due to disk-induced delays. In IO-TCP, the NIC's data packet I/O is combined with the disk I/O as it fetches the packet content from the disk. However, disk I/O can be much slower and often adds significant delay to packet transmission even when there is no congestion on the network path. IO-TCP solves this problem by carefully removing disk-induced delays in the RTT measurement. Echo packets are used so that the host stack can accurately track packet departure time. Second, IO-TCP has to handle packet retransmission in the NIC stack without information about when to retransmit the packet. If you're not careful, the NIC stack can issue redundant disk I/Os to resend packets. To avoid inefficiencies, IO-TCP uses an internal ACK protocol to signal when it is safe to discard data packets. Third, IO-TCP must provide a well-defined API for applications to flexibly compose file or non-file content for data transfer. To do this, IO-TCP slightly extends the Berkeley Sockets API with some "offload" functions to open files and send file contents from the NIC.

도 1은 IO-TCP 스택의 아키텍처 개요를 보여준다. P2PDMA로 NVMe 디스크에 직접 액세스할 수 있는 Mellanox BlueField Smart-NIC로 IO-TCP를 구현한다. 호스트 스택의 경우, 기존 사용자 수준 TCP 스택을 확장하여 I/O 오프로딩을 지원하는 동시에 DPDK 라이브러리로 NIC 스택을 구현한다. 호스트 스택의 경우 1,793라인(line)의 코드 수정이 필요하고, NIC 스택의 경우 1,853라인의 C코드가 필요하다. 실제 응용 프로그램의 효과를 평가하기 위해 Lighttpd를 약 10라인의 코드 수정만으로 IO-TCP를 사용하도록 포트한다. Figure 1 shows an architectural overview of the IO-TCP stack. IO-TCP is implemented with a Mellanox BlueField Smart-NIC that can directly access NVMe disks with P2PDMA. For the host stack, the existing user-level TCP stack is extended to support I/O offloading while the NIC stack is implemented with the DPDK library. In the case of the host stack, 1,793 lines of code modification is required, and in the case of the NIC stack, 1,853 lines of C code are required. To evaluate the effect of real application, Lighttpd is ported to use IO-TCP with only about 10 lines of code modification.

평가 결과, IO-TCP 포트 Lighttpd는 단일 CPU 코어로 44Gbps의 비디오 컨텐츠 전송을 달성한다. 이에 반해, Linux TCP 스택의 원래 Lighttpd가 동일한 성능에 도달하려면 6개의 CPU코어가 필요하다. 현재 병목 현상이 BlueField NIC의 낮은 메모리 대역폭에 있다는 것을 알지만, 향후 버전이 더 나은 성능을 달성할 수 있을 것이다. 발명은 아래와 같은 기술적 의의를 갖는다.As a result of the evaluation, the IO-TCP port Lighttpd achieves 44Gbps video content transfer with a single CPU core. In contrast, the original Lighttpd on Linux TCP stack needs 6 CPU cores to reach the same performance. We know the current bottleneck is in the BlueField NIC's low memory bandwidth, but future versions may achieve better performance. The invention has the following technical significance.

(1) 디스크 I/O에 대한 CPU 병목 현상이 최신 컨텐츠 전송 시스템의 성능에 미치는 영향을 분석(1) Analysis of the effect of the CPU bottleneck on disk I/O on the performance of the latest content delivery system

(2)컨텐츠 전달 시스템이 TCP 제어 및 데이터 플레인을 분리하여 SmartNIC의 최근 I/O 발전을 완전히 활용할 수 있도록 하는 새로운 TCP 스택 설계인 IO-TCP의 설계 및 구현을 소개(2) Introducing the design and implementation of IO-TCP, a new TCP stack design that allows content delivery systems to fully utilize the recent I/O advances in SmartNICs by separating the TCP control and data planes.

(3)IO-TCP가 CPU 병목 현상의 한계를 뛰어 넘어 CPU가 정상적으로 수행할 수 있는 것보다 훨씬 더 큰 I/O 대역폭을 달성하는 방법을 제시(3) IO-TCP breaks the limits of the CPU bottleneck and presents a way to achieve much greater I/O bandwidth than the CPU can normally do.

I/O 장치 발전과 CPU 용량 사이의 불일치(Mismatch between I/O Device Advances and CPU Capacity)Mismatch between I/O Device Advances and CPU Capacity

20년 전에는 가장 빠른 하드 디스크가 초당 약 200 개의 임의 I/O 연산(IOPS)을 달성할 수 있었지만, 최근 NVMe 디스크는 거의 40배 빠른 속도인 1백만 IOPS 이상을 수행할 수 있다. 같은 기간 이더넷 NIC의 대역폭은 200배 이상(1997년 1Gbps에서 2019 년 200Gbps로) 개선되었으며, 800Gbps/1.6Tbps 이더넷은 몇 년 안에 표준화될 것으로 예상하고 있다. Twenty years ago, the fastest hard disks could achieve around 200 random I/O operations per second (IOPS), but recent NVMe disks can do over 1 million IOPS, nearly 40 times faster. Over the same period, the bandwidth of Ethernet NICs has improved by more than 200 times (from 1 Gbps in 1997 to 200 Gbps in 2019), and 800 Gbps/1.6 Tbps Ethernet is expected to become standard within a few years.

이와는 대조적으로 CPU 용량 향상은 무어의 법칙이 끝나고 Dennard 확장의 붕괴로 인해 크게 방해를 받았다. 최초의 범용 멀티 코어 CPU는 2005년에 등장했지만, CPU당 최대 코어 수는 현재까지 32개로 남아 있다. In contrast, CPU capacity growth was greatly hampered by the end of Moore's Law and the collapse of the Dennard extension. The first general-purpose multi-core CPU appeared in 2005, but the maximum number of cores per CPU remains at 32 to date.

도 2는 널리 사용되는 디스크 벤치 마크 도구인 fio로 측정한 단일 CPU 코어의 NVMe 디스크 사용률을 보여준다. CPU에는 Intel Xeon Silver 4210(2.20GHz)을, NVMe 장치에는 Intel Optane 900P를 사용한다. 도 2는 단일 CPU 코어가 단위 블록 크기가 4KB인 NVMe 디스크 2개도 포화시킬 수 없음을 보여준다. 블록 크기가 16KB인 경우에도 단일 코어는 최대 3개의 NVMe 디스크만 병렬로 처리할 수 있다. NVMe 디스크가 일반적으로 4개의 PCIe 레인을 차지하고 최신 서버급 CPU가 40~48개의 PCIe 레인을 지원할 수 있다는 점을 감안할 때, 단일 CPU가 있는 컨텐츠 전송 서버는 고속 NIC외에 최대 8~10개의 NVMe 디스크를 호스팅할 수 있게 된다. CPU와 I/O 장치 간의 성능 차이로 인해 I/O 연산에 대한 현재 OS 추상화를 재검토해야 한다. 기존 OS는 디스크 컨텐츠를 읽고 NIC를 통해 전송하는 등 I/O 연산을 수행하기 위해 CPU 개입이 필요하다. 이는 현재 프로그래밍 모델이 다음 연산을 수행하기 전에 I/O 장치의 내용을 메인 메모리로 이동해야 하므로 메모리 연산이 중단되어 CPU주기가 낭비되기 때문이다.Figure 2 shows the NVMe disk utilization of a single CPU core as measured by fio, a widely used disk benchmark tool. It uses an Intel Xeon Silver 4210 (2.20 GHz) for the CPU and an Intel Optane 900P for the NVMe device. Figure 2 shows that a single CPU core cannot saturate even two NVMe disks with a unit block size of 4KB. Even with a block size of 16KB, a single core can only process up to 3 NVMe disks in parallel. Given that NVMe disks typically occupy 4 PCIe lanes and modern server-grade CPUs can support 40-48 PCIe lanes, a content delivery server with a single CPU can host up to 8-10 NVMe disks in addition to a high-speed NIC. You can do it. The difference in performance between CPUs and I/O devices requires a reexamination of current OS abstractions for I/O operations. Existing OSs require CPU intervention to perform I/O operations, such as reading disk contents and sending them over the NIC. This is because the current programming model requires that the contents of the I/O device be moved to main memory before performing the next operation, so memory operations are interrupted and CPU cycles are wasted.

컨텐츠 전달 시스템 스택의 비효율성(Inefficiencies in Content Delivery System Stacks)Inefficiencies in Content Delivery System Stacks

최신 컨텐츠 전송 시스템은 지리적으로 분산된 수 많은 컨텐츠 전송 웹 또는 역방향 프록시 서버로 구성된다. 이러한 시스템은 비디오 스트리밍 및 웹 페이지 액세스와 같은 많은 애플리케이션의 기초가 된다. 이 가운데 비디오 트래픽은 전체 인터넷 트래픽의 약 60%를 차지하며, 최근 폭발적인 수요 증가로 인해 전체 볼륨이 증가했다. 성능을 최적화하기 위해 전통적으로 서버 설계는 하드 디스크 I/O가 훨씬 더 느렸기 때문에 디스크 액세스 및 CPU 사용률 최적화에 초점을 맞춘다. 웹 페이지 컨텐츠와 같은 작은 오브젝트를 가져오기 위해 서버는 인덱싱을 위한 작은 메모리 공간을 유지하면서, 디스크 검색을 최소화하도록 최적화되어 있다. 비디오 다운로드와 같은 대규모 오브젝트 액세스의 경우, 서버는 순차 디스크 읽기를 이용하여 디스크 처리량을 극대화한다. 또한 사용자 프로세스와 커널 간의 빈번한 메모리 복사 및 컨텍스트 전환을 방지하기 위해 일반적으로 sendfile()을 지원한다.Modern content delivery systems consist of numerous geographically distributed content delivery webs or reverse proxy servers. These systems are the basis for many applications such as video streaming and web page access. Among them, video traffic accounts for about 60% of all Internet traffic, and the total volume has increased due to the recent explosive increase in demand. To optimize performance, server design has traditionally focused on optimizing disk access and CPU utilization, as hard disk I/O has been much slower. For fetching small objects such as web page content, the server is optimized to minimize disk searches while maintaining a small memory footprint for indexing. For large object accesses such as video downloads, the server uses sequential disk reads to maximize disk throughput. It also supports sendfile() in general to avoid frequent memory copying and context switching between user processes and the kernel.

CPU 활용도를 높이기 위해 서버는 일반적으로 이벤트 중심 아키텍처를 채택한다. 기존 디스크 기반 최적화는 검색으로 인한 한계를 없앤 저렴한 대용량 RAM 및 플래시 기반 디스크(예: SSD 및 NVMe 디스크)의 출현으로 인해 대체로 쓸모가 없게 되었다. 주요 디스크 병목 현상이 해결되었으므로 메모리 하위 시스템이 오늘날 서버에서 다음 병목 현상이 된다. 이는 디스크 및 네트워크 I/O뿐만 아니라 암호화 및 암호 해독을 위한 컨텐츠 스캔으로 인한 여러 메모리 복사본으로 인해 악화된다. 최근 작업은 디스크 액세스 계층을 최적화하고 Intel Data Direct I/O(DDIO)를 활용하여 이러한 모든 작업을 CPU 캐시의 데이터로 수행하도록 배열하지만 CPU의 연산 부하를 분산시키지는 않는다. 따라서 워크로드가 CPU 캐시 크기를 초과하면 성능이 메모리 하위 시스템의 성능으로 떨어진다. 웹 기반 컨텐츠 전송의 성능을 이해하기 위해 HTTP 적응형 비디오 스트리밍에 대한 일반적인 설정을 시뮬레이션하는 디스크 바운드 워크로드(disk-bound workload)에 대해 두 개의 인기 웹 서버인 Lighttpd(v1.4.32) 및 nginx(v1.16.1)로 실험을 실행한다. 서버 설정은 5.1 절과 동일하며 100KB, 300KB 및 500KB의 파일 크기를 사용하여 다양한 비디오 품질을 시뮬레이션한다. 1600개의 영구 연결을 사용하고 클라이언트가 병목 현상이 없는지 확인한다.To increase CPU utilization, servers typically adopt an event-driven architecture. Traditional disk-based optimization has largely become obsolete with the advent of cheap, high-capacity RAM and flash-based disks (such as SSDs and NVMe disks) that have removed the limitations imposed by seek. Now that the major disk bottleneck has been resolved, the memory subsystem becomes the next bottleneck in today's servers. This is exacerbated by multiple memory copies resulting from disk and network I/O as well as content scanning for encryption and decryption. Recent work optimizes the disk access layer and utilizes Intel Data Direct I/O (DDIO) to arrange all these operations to be performed with data in the CPU cache, but without offloading the CPU's computational load. So, when the workload exceeds the CPU cache size, performance drops to that of the memory subsystem. To understand the performance of web-based content delivery, two popular web servers, Lighttpd (v1.4.32) and nginx (v1) for a disk-bound workload simulating a typical setup for HTTP adaptive video streaming. .16.1) to run the experiment. The server settings are the same as in Section 5.1, and we use file sizes of 100 KB, 300 KB, and 500 KB to simulate different video qualities. Use 1600 persistent connections and make sure the client is not a bottleneck.

sendfile() 최적화를 사용하거나 사용하지 않은 웹 서버의 성능을 비교한다. 도 3은 웹 서버의 성능 비교에 있어서 단일 CPU 코어의 결과를 보여준다. 일반적으로 파일 크기가 클수록 성능이 향상되고, sendfile()은 성능이 12%에서 52%까지 향상된다. sendfile()이 없으면 nginx가 더 나은 성능을 보여 주지만 Lighttpd는 sendfile()과 비슷한 성능을 보인다. 단일 NVMe 디스크가 임의 파일 읽기에 대해 약 2.5GB/s(또는 20Gbps)를 달성한다는 점을 감안할 때, 단일 CPU 코어는 300KB 파일이있는 단일 NVMe 디스크 성능의 절반을 약간 넘는다. Compare the performance of web servers with and without sendfile() optimization. Figure 3 shows the result of a single CPU core in the performance comparison of web servers. In general, larger file sizes improve performance, with sendfile() improving performance from 12% to 52%. Without sendfile() nginx shows better performance, but Lighttpd shows similar performance to sendfile(). Given that a single NVMe disk achieves around 2.5GB/s (or 20Gbps) for random file reads, a single CPU core has just over half the performance of a single NVMe disk with a 300KB file.

[표 1]은 perf로 측정된 함수 호출 수준에서 nginx의 CPU주기 분석을 보여준다. [Table 1] shows nginx's CPU cycle analysis at the function call level as measured by perf.

함수(Function)Function %CPU%CPU sendfile()sendfile() 71.59%71.59% open()open() 14.55%14.55% recv()recv() 1.76%1.76% ngx_http_finalize_request()ngx_http_finalize_request() 3.52%3.52% ngx_http_send_header()ngx_http_send_header() 1.17%1.17% othersothers 7.41%7.41%

당연히 sendfile() 및 open()은 CPU주기의 대부분을 차지하며, 소비된 주기의 86.1%에 해당한다. 이는 컨텐츠 전달 서버(디스크 및 네트워크I/O)에서 대부분의 CPU주기가 사용되는 위치를 명확하게 보여준다. CPU에서 이러한 연산을 오프로드하면 성능을 향상시킬 수 있는 큰 잠재력이 있다.Not surprisingly, sendfile() and open() take up most of the CPU cycles, accounting for 86.1% of the cycles consumed. This clearly shows where most of the CPU cycles are being used on the content delivery server (disk and network I/O). Offloading these operations from the CPU has great potential to improve performance.

SmartNIC를 통한 기회(Opportunities with SmartNIC)Opportunities with SmartNIC

본 발명의 핵심 아이디어는 TCP 기반 컨텐츠 전달을 지원하면서 CPU에서 프로그래밍 가능한 I/O 장치로 데이터 I/O를 오프로드하는 것이다. 이상적으로는 직접 디스크 I/O 및 네트워크 패킷 I/O를 수행할 수 있는 모든 프로그래밍 가능 장치가 그 목적을 달성할 수 있다. 그 구현을 위해 P2PDMA 기능이 있는 최신 SmartNIC 플랫폼을 선택할 수 있고, 본 설계는 FPGA 보드에서도 구현될 수 있을 것이다. Mellanox Blue-Field 및 Broadcom Stingray와 같은 최신 SoC 기반 SmartNIC는 NIC 데이터 처리 장치 위에 Arm 기반 임베디드 시스템을 제공한다. 이러한 시스템은 CPU 또는 메인 메모리의 개입 없이 동일한 도메인의 NVMe 디스크에 대한 직접 액세스를 지원한다. 예를 들어 BlueField NIC는 NVMe-oF(NVMe over Fabrics) 대상 오프로드를 통해 P2PDMA를 지원한다. 모든 NVMe-oF 연산을 하드웨어 가속기로 오프로드하고 Arm 프로세서는 이 시스템에 연결하여 로컬 NVMe 디스크에서 직접 읽을 수 있다. 이러한 디스크는 Arm 프로세서에서 실행되는 Linux 환경에 직접 마운트되며 호스트 OS에 표시되는 것과 동일한 파일 시스템에서 실행된다. 성능 측면에서 BlueField NIC의 fio는 Intel Optane 900P NVMe 디스크당 2.5GB/초를 달성하며 이는 호스트 CPU 측에서 달성한 것과 비슷하다. The core idea of the present invention is to offload data I/O from the CPU to a programmable I/O device while supporting TCP-based content delivery. Ideally, any programmable device that can do direct disk I/O and network packet I/O would accomplish that goal. For its implementation, a modern SmartNIC platform with P2PDMA capability can be selected, and the design could be implemented on an FPGA board as well. Modern SoC-based SmartNICs such as the Mellanox Blue-Field and Broadcom Stingray provide an Arm-based embedded system on top of the NIC data processing unit. These systems support direct access to NVMe disks in the same domain without CPU or main memory involvement. For example, BlueField NICs support P2PDMA with NVMe over Fabrics (NVMe-oF) target offload. It offloads all NVMe-oF operations to hardware accelerators, and Arm processors can connect to the system and read directly from local NVMe disks. These disks are mounted directly into the Linux environment running on the Arm processor and run on the same filesystem as is visible to the host OS. In terms of performance, the fio of the BlueField NIC achieves 2.5GB/sec per Intel Optane 900P NVMe disk, which is similar to what was achieved on the host CPU side.

도 4는 플랫폼에 사용하는 Mellanox Blue-Field NIC의 아키텍처를 보여준다. 16개의 Armv8 코어와 Linux에서 실행되는 16GB의 DDR4 메모리가 장착되어 있다. Arm 하위 시스템을 사용하면 DPDK 응용 프로그램을 실행하여 원격 시스템이나 로컬 호스트에서 빠른 패킷 I/O를 수행할 수 있다. 또한 애플리케이션은 TCP/IP 체크섬 계산과 TCP 세분화(즉, TSO)를 ConnectX-5 NIC 하드웨어로 오프로드할 수 있다. 그러나 모든 패킷이 임베디드 시스템을 통과하도록 하면 중요한 패킷 포워딩 오버 헤드가 발생할 수 있다. 이를 방지하기 위해 NIC(도 4의 eSwitch)를 구성하여 모든 수신 패킷이 호스트 CPU 측으로 직접 전달되고 NIC 임베디드 시스템은 아웃 바운드 패킷만 처리하도록 할 수 있다. 프로세서와 메모리가 CPU에 비해 강력하지 않다. BlueField NIC의 Arm 하위 시스템의 최대 메모리 대역폭은 19.2GB/s에 불과하므로 전체 성능이 제한되지만, Armv8 기반 SoC는 100GB/s를 초과하는 메모리 대역폭을 지원할 수 있으므로 이는 본질적인 제한에 해당하지는 않을 것이다.Figure 4 shows the architecture of the Mellanox Blue-Field NIC used in the platform. It is equipped with 16 Armv8 cores and 16GB of DDR4 memory running on Linux. The Arm subsystem allows you to run DPDK applications to perform fast packet I/O on remote systems or on the local host. Applications can also offload TCP/IP checksum computation and TCP segmentation (ie, TSO) to the ConnectX-5 NIC hardware. However, having all packets pass through an embedded system can introduce significant packet forwarding overhead. To prevent this, the NIC (eSwitch in Fig. 4) can be configured so that all received packets are directly forwarded to the host CPU side and the NIC embedded system processes only outbound packets. The processor and memory are not as powerful compared to the CPU. The maximum memory bandwidth of the BlueField NIC's Arm subsystem is only 19.2 GB/s, which limits overall performance, but this is unlikely to be an inherent limitation as Armv8-based SoCs can support memory bandwidths in excess of 100 GB/s.

설계(Design)Design

본 발명에서는 컨텐츠 전달 시스템이 최근 SmartNIC I/O 발전을 활용할 수 있도록 하는 IO-TCP의 설계를 제시한다. IO-TCP의 주요 설계 선택은 개별 I/O 연산이 SmartNIC로 오프로드되는 동안 CPU측이 모든 연산을 완전히 제어하도록 TCP 스택의 제어 플레인과 데이터 플레인을 분리하는 것이다. 이에 대한 핵심 근거는 NIC 스택을 구현하기 쉽도록 유지하면서 I/O 연산에서 대부분의 CPU 주기를 절약하는 것이다. 단순성은 성능 확장성의 핵심이다. In the present invention, we present a design of IO-TCP that allows content delivery systems to take advantage of recent SmartNIC I/O advancements. A key design choice for IO-TCP is to separate the control and data planes of the TCP stack, allowing the CPU side to fully control all operations while individual I/O operations are offloaded to the SmartNIC. The key rationale for this is to save most CPU cycles on I/O operations while keeping the NIC stack easy to implement. Simplicity is key to performance scalability.

IO-TCP에는 세 가지 설계 목표가 있다. 첫째, IO-TCP는 TCP 프로토콜을 준수해야 하며 다양한 혼잡 제어 구현을 지원할 수 있어야 한다. 예를 들어, NIC 스택에서 디스크 I/O를 처리하는 것은 디스크 액세스 대기 시간(아래에서 설명)으로 인한 부정확한 RTT 측정으로 인해 호스트 스택의 정체 제어 로직을 손상해서는 안된다. 둘째, IO-TCP로 마이그레이션하기 위해 기존 애플리케이션의 수정을 최소화해야 한다. 파일 I/O 오프로딩(아래에서 설명)을 제외하고 동일한 소켓 API를 사용해야 한다. 셋째, IO-TCP 호스트 스택은 I/O 오프 로딩을 위해 NIC 스택과 통신해야 하며 오버 헤드를 줄여야 한다. 또한 호스트 스택에 NIC 스택의 모든 오류를 제때에 알려 제대로 처리해야 한다(아래에서 설명).IO-TCP has three design goals: First, IO-TCP must conform to the TCP protocol and be able to support various congestion control implementations. For example, handling disk I/O on the NIC stack should not break the host stack's congestion control logic due to inaccurate RTT measurements due to disk access latency (discussed below). Second, modifications to existing applications should be minimized to migrate to IO-TCP. Except for file I/O offloading (discussed below), you must use the same socket API. Third, the IO-TCP host stack must communicate with the NIC stack for I/O offloading and reduce overhead. It also needs to notify the host stack of any errors in the NIC stack in a timely manner so that it can properly handle them (described below).

TCP 컨트롤 플레인과 데이터 플레인 분리(Separating TCP control and data planes)Separating TCP control and data planes

호스트 CPU 주기를 절약하려면 SmartNIC 및 CPU의 기능을 기반으로 오프로딩을 통해 가장 많은 이점을 얻을 수 있는 연산을 결정해야 한다. SoC 또는 ASIC 기반 SmartNIC의 임베디드 프로세서는 더 간단한 데이터 플레인 연산에 더 적합하며 고급 기능이 있는 x86 CPU는 복잡한 컨트롤 플레인 연산을 더 빠르게 처리할 수 있다. 따라서 TCP 스택을 컨트롤 플레인과 데이터 플레인 오퍼레이션으로 나눈다. To save host CPU cycles, determine which operations will benefit most from offloading based on the capabilities of the SmartNIC and CPU. Embedded processors in SoCs or ASIC-based SmartNICs are better suited for simpler data plane operations, while x86 CPUs with advanced features can handle complex control plane operations faster. Therefore, the TCP stack is divided into control plane and data plane operations.

컨트롤 플레인 기능은 연결 관리, 안정적인 데이터 전송, 혼잡/흐름 제어 및 오류 제어와 같은 주요 TCP 프로토콜 기능을 나타낸다. 동작은 다른 쪽의 응답에 따라 달라지므로 일반적으로 복잡한 상태 관리가 필요하다. 예를 들어, 수신기 측에서 안정적인 데이터 전달을 위해서는 적절한 순차 전달 및 ACK 생성을 위해 분리 된 모든 수신 데이터 범위를 추적해야 한다. 또한 손실 감지 및 패킷 재전송으로 혼잡 제어와 긴밀하게 결합되어 안정적인 전달을 위해 전송창 크기를 다시 조정한다. 마찬가지로 흐름 제어는 창 크기에도 영향을 미치므로 정체 제어와 함께 실행해야 한다. 오류 제어는 오류 동작을 추론하기 위해 자세한 흐름 상태를 추적해야 하므로 단독으로 실행할 수 없다. 이러한 연산은 SmartNIC로 오프로드될 수 있지만 효율성을 위해 함께 오프로드하는 것이 좋다. 위의 각 연산을 SmartNIC로 오프로드할 수 있지만 효율성을 위해 이러한 모든 연산을 함께 오프로드하는 것이 좋다. 그러나, 후술하겠지만 단일 CPU 코어로 수천 개의 연결을 처리하기에 충분하므로 이를 오프로드해도 CPU 주기가 많이 절약되지 않는다. 연결 관리는 AccelTCP에 설명된 대로 독립적으로 오프로드할 수 있는 유일한 기능이다. 그러나 연결 관리에 상대적으로 큰 오버 헤드가 발생하여 컨텐츠 전송 워크로드에 맞지 않을 수 있는 흐름 크기가 작은 경우에만 가장 큰 이점을 제공한다. 따라서, 본 발명에서는 그것들을 모두 CPU 측에 보관한다. Control plane functions represent key TCP protocol functions such as connection management, reliable data transmission, congestion/flow control, and error control. Actions depend on the response of the other side, so complex state management is usually required. For example, for reliable data delivery at the receiver side, all separated received data ranges must be tracked for proper sequential delivery and ACK generation. It is also tightly coupled with congestion control with loss detection and packet retransmission, resizing the transmission window for reliable delivery. Likewise, flow control also affects window size, so it should be run in conjunction with congestion control. Error control cannot be run alone as it needs to track detailed flow states in order to infer error behavior. These operations can be offloaded to the SmartNIC, but it is better to offload them together for efficiency. Each of the above operations can be offloaded to the SmartNIC, but for efficiency it is better to offload all of these operations together. However, as will be discussed later, a single CPU core is sufficient to handle thousands of connections, so offloading it does not save many CPU cycles. Connection management is the only function that can be offloaded independently as described in AccelTCP. However, it only provides the greatest benefit when flow sizes are small, which may not fit the content delivery workload due to the relatively large overhead of connection management. Therefore, in the present invention, they are all stored on the CPU side.

데이터 플레인 연산은 컨트롤 플레인 기능의 구현을 지원하는 데이터 패킷 준비 및 전송과 관련된 모든 작업을 의미한다. 여기에는 데이터 버퍼 관리, 데이터를 패킷으로 분할, TCP/IP 체크섬 계산 등이 포함된다. IO-TCP는 단순하고 반복적이며 상태 비저장이기 때문에 전송 경로에서 데이터 플레인 연산을 오프로드한다. 또한 IO-TCP는 파일/디스크 I/O를 오프로드하고 이를 TCP 데이터 플레인 연산에 결합시킨다. 이 결정의 근거는 이러한 작업이 대용량 파일 컨텐츠 전달에서 TCP주기의 대부분을 차지하므로 상당한 CPU주기를 절약할 수 있게 된다. 또한, SmartNIC에는 하드웨어 기반 암호화 가속기가 있는 경향이 있어 가까운 장래에 TLS 데이터 암호화/복호화를 라인 속도로 활성화할 수 있다.Data plane operations refer to all operations related to preparing and transmitting data packets that support the implementation of control plane functions. This includes managing data buffers, splitting data into packets, calculating TCP/IP checksums, and more. Because IO-TCP is simple, iterative, and stateless, it offloads data plane operations from the transmission path. IO-TCP also offloads file/disk I/O and couples it to TCP data plane operations. The rationale for this decision is that these operations take up most of the TCP cycles in large file content delivery, thus saving significant CPU cycles. Also, SmartNICs tend to have hardware-based crypto accelerators, which could enable TLS data encryption/decryption at line rate in the near future.

IO-TCP 오프로드 API 함수(IO-TCP Offload API Functions)IO-TCP Offload API Functions

먼저, IO-TCP 애플리케이션을 작성하는 방법을 설명한다. 애플리케이션을 IO-TCP로 포팅하는 것은 핵심 로직의 수정을 거의 필요로 하지 않지만, 애플리케이션이 원하는 것을 유연하게 활성화하는 것이 바람직하다. First, we will explain how to create an IO-TCP application. Porting an application to IO-TCP requires little modification of the core logic, but it is desirable to have the flexibility to enable what the application wants.

예를 들어, IO-TCP 애플리케이션은 파일 컨텐츠인지 여부에 관계없이 전송할 컨텐츠를 구성할 수 있어야 한다. 이 목표를 위해 오프로드된 파일 및 네트워크 I/O에 대해 4개의 기능만 추가하여 기존 소켓 API를 확장한다. 함수는 아래와 같다.For example, an IO-TCP application must be able to construct content to be transmitted, whether it is file content or not. For this goal, it extends the existing socket API by adding only four functions for offloaded file and network I/O. The function is shown below.

int offload_open(const char *filename, int mode) - opens a file in the NIC and returns a unique file id (fid).
int offload_close(int fid) - closes the file for fid in the NIC.
int offload_fstat(int fid, struct stat* buf) - retrieve the metadata for an opened file, fid.
size_t offload_write(int socket, int fid, off_t offset, size_t length) - sends the data of the given length starting at the offset value read from the file, fid, and returns the number of bytes virtually copied to the send buffer.int offload_open(const char *filename, int mode) - opens a file in the NIC and returns a unique file id (fid).
int offload_close(int fid) - closes the file for fid in the NIC.
int offload_fstat(int fid, struct stat* buf) - retrieve the metadata for an opened file, fid.
size_t offload_write(int socket, int fid, off_t offset, size_t length) - sends the data of the given length starting at the offset value read from the file, fid, and returns the number of bytes virtually copied to the send buffer.

IO-TCP는 NIC 스택에서 파일을 열고 닫는 데 offload_open() 및 offload-close()를 제공한다. offload_open()은 NIC 스택에 파일을 열고, 결과(성공 또는 오류)를 보고하도록 요청한다. 이후 작업을 위해 NIC 스택에서 열린 파일을 식별하는 파일 ID(파일 설명자 대신)를 반환한다. offload_open()은 다양한 이유로 파일 열기가 실패할 수 있으므로 epoll()로 결과를 확인해야 하는 비동기 함수이다. 모든 파일 작업이 완료되면 응용 프로그램은 NIC 스택에 offload_close()를 사용하여 파일을 닫도록 요청할 수 있다. 또한 IO-TCP는 열린 파일에 대한 메타 데이터(예 : 파일 크기 및 권한 정보)를 검색하는 offload_fstat()를 추가한다. 열린 파일 ID로 응용 프로그램은 offload_write()를 호출하여 파일 내용을 연결을 위한 TCP 데이터 패킷으로 보낼 수 있다. 기본적으로 offload_write()는 Linux의 sendfile()과 동일한 연산을 수행하지만 NIC 임베디드 시스템에서 열린 파일과 함께 작동한다. 응용 프로그램은 write()와 같은 기존 소켓 API를 사용하여 사용자 지정 데이터(예: HTTP 응답 헤더)를 보내거나 NIC 스택에 의해 열린 여러 파일에서 컨텐츠를 보낼 수 있다. 현재는 일반 텍스트 데이터 전송만 구현하지만, 암호화 연산(예: SSL_offload_write() 사용)을 NIC 스택으로 쉽게 오프로드할 수 있을 것이다. 도 5는 HTTP 서버 컨텍스트에서 API 함수를 사용하는 작업의 하위 집합을 보여준다.IO-TCP provides offload_open() and offload-close() to open and close files on the NIC stack. offload_open() asks the NIC stack to open a file and report the result (success or error). Returns a file ID (instead of a file descriptor) that identifies the open file on the NIC stack for further operation. offload_open() is an asynchronous function whose result must be checked with epoll() since opening a file can fail for various reasons. When all file operations are complete, the application can ask the NIC stack to close the file using offload_close(). IO-TCP also adds offload_fstat() to retrieve metadata about open files (e.g. file size and permission information). With the open file ID, the application can call offload_write() to send the contents of the file as TCP data packets for the connection. Basically, offload_write() does the same operation as sendfile() in Linux, but works with open files on NIC embedded systems. Applications can send custom data (such as HTTP response headers) using existing socket APIs such as write() or send content from multiple files opened by the NIC stack. Currently we only implement plaintext data transfer, but it should be easy to offload cryptographic operations (eg using SSL_offload_write()) to the NIC stack. Figure 5 shows a subset of operations using API functions in an HTTP server context.

특히, 도 5에 도시된 바와 같이, NIC로의 오프로딩 기능은 패킷데이터를 파일 컨텐트로 채워서 "보내는(SEND)" 기능만 오프로딩하는 것이고, 클라이언트에서 오는 이 데이터 패킷에 대한 ACK 패킷은 NIC를 바이패스(bypass)하여 CPU측 호스트 스택으로 "직접 전송"되어 호스트 스택에서 처리된다. 더욱 상세하게는, IP TCP는 "SEND" 명령만 NIC로 오프로드하고, 리시브(receive)쪽 연산들은 모두 CPU의 호스트 스택에서 구현된다. 다시 말해, 클라이언트로부터 수신되는 ACK나 페이로드가 있는 패킷들은 모두 NIC의 프로세서를 바이패스(bypass)해서 곧바로 호스트스택으로 전달되어 처리된다. ACK처리나 메시지 리시브가 훨씬 복잡한 연산임을 감안한다면 이들을 CPU측에서 처리하게 함으로써 효율을 높일 수 있게 된다.In particular, as shown in Fig. 5, the offloading function to the NIC is to offload only the "SEND" function by filling the packet data with the file content, and the ACK packet for this data packet coming from the client bypasses the NIC. It is "directly sent" to the host stack on the CPU side by bypassing it, where it is processed. More specifically, IP TCP offloads only the "SEND" command to the NIC, and all receive-side operations are implemented in the host stack of the CPU. In other words, all packets with ACKs or payloads received from the client bypass the processor of the NIC and are directly delivered to the host stack for processing. Considering that ACK processing and message receiving are much more complex operations, efficiency can be increased by processing them in the CPU side.

이러한 설계를 채택한 본 발명에 의하면, TCP의 ACK 처리와 같이 복잡한 부분은 CPU 측에서 처리하고, SmartNIC는 데이터 전송과 같은 단순 반복적인 일만 수행할 수 있기 때문에 전체 효율성이 상당히 향상된다. 다시 말해, 연산성능이 낮은 프로세서를 SmartNIC에서 사용해도 무방하고, 이는 SmartNIC의 제조비용을 줄일 수 있다는 장점을 도모한다.According to the present invention adopting this design, the CPU side processes complex parts such as TCP ACK processing, and the SmartNIC can perform only simple repetitive tasks such as data transmission, so overall efficiency is significantly improved. In other words, it is okay to use a processor with low computational performance in the SmartNIC, which promotes the advantage of reducing the manufacturing cost of the SmartNIC.

IO-TCP 호스트 스택(IO-TCP Host Stack)IO-TCP Host Stack

I/O 오프로드 TCP 스택의 가장 큰 문제는 전송을 위한 실제 데이터가 호스트 스택에서 누락될 수 있다는 것이다. 파일 I/O가 NIC 스택으로 오프로드되기 때문에 파일에서 데이터를 읽어야 하는 경우 호스트 스택은 데이터 패킷을 생성할 수 없다. 마찬가지로 호스트 스택에 데이터 패킷이 없기 때문에 데이터 패킷의 재전송이 불가능해진다. The biggest problem with the I/O offload TCP stack is that the actual data for transmission can be missing from the host stack. Because file I/O is offloaded to the NIC stack, when data needs to be read from a file, the host stack cannot generate data packets. Likewise, retransmission of data packets becomes impossible because there are no data packets in the host stack.

IO-TCP는 데이터를 사용할 수 없는 경우 호스트 스택에서 데이터 플레인 연산을 가상으로 수행하여 이 문제를 해결한다. IO-TCP 호스트 스택은 시퀀스 번호 공간에서 "누락된" 데이터를 추적하고 실제 I/O 연산을 NIC 스택에 위임하는 동안 부기 작업만 수행한다. IO-TCP solves this problem by virtually performing data plane operations on the host stack when data is not available. The IO-TCP host stack keeps track of "missing" data in the sequence number space and does only bookkeeping while delegating the actual I/O operations to the NIC stack.

예를 들어 응용 프로그램이 offload_write()를 호출하면 호스트 스택은 메타 데이터만 업데이트하여 전송 버퍼에 가상으로 기입한다. 이 함수는 전송 버퍼에 기록된 "가상" 바이트수와 함께 즉시 반환된다. 그런 다음 호스트 스택은 정체 및 흐름 제어 매개 변수와 함께 전송해야 하는 바이트 수를 결정하고 "SEND" 명령을 NIC 스택에 게시한다(그림 5의 ③및 ④참조). For example, when an application calls offload_write(), the host stack only updates meta data and virtually writes it to the transmit buffer. This function returns immediately with the number of "virtual" bytes written to the transmit buffer. The host stack then determines how many bytes need to be sent, along with congestion and flow control parameters, and posts a "SEND" command to the NIC stack (see ③ and ④ in Figure 5).

"SEND" 명령은 NIC 스택으로 향하는 TCP 패킷(NIC의 내부 MAC 주소 포함)에서 수행된다. 명령 패킷의 TCP/IP 헤더는 전체 연결 정보(예: 다음 데이터 패킷에 대한 4개의 연결 튜플, 시퀀스 및 시퀀스 및 ACK 번호 등)를 포함하고 페이로드에는 결국 실제로 대체되는 "SEND" 명령이 포함된다. 컨텐츠가 클라이언트로 전송되기 전에 "SEND" 명령은 파일 ID, 읽기 스타터 오프셋(starter offset to read) 및 데이터 길이를 지정한다. 이 정보를 사용하여 NIC 스택은 파일 내용을 읽고 헤더 정보와 함께 실제 데이터 패킷을 생성 및 전송한다. 파일 내용 크기에 따라 하나의 "SEND" 명령이 여러 MTU 크기의 데이터 패킷으로 변환될 수 있다. 즉, "SEND" 명령어의 구현은 CPU측 호스트 스택에서 일반 TCP 패킷을 활용하여 NIC 스택으로 전달하되, 이 패킷의 헤더(header)는 일반 TCP 패킷과 동일한 것을 이용하지만 버추얼 페이로드(virtual payload)를 가지고 있다. 이 버추얼 페이로드 내용에 특정 파일의 파일 ID, 읽기 스타터 오프셋(starter offset to read) 및 데이터 길이(length) 정보를 가지고 있고, NIC가 이 파일의 내용을 리드하고, 길이(length)를 고려하여, n개의 리얼 TCP 패킷(real TCP packet)으로 변환하여 전송한다. "SEND" 명령어가 처리되는 방법과 관련하여 아래에서 더욱 상세히 설명한다.The "SEND" command is performed on the TCP packets destined for the NIC stack (including the NIC's internal MAC address). The command packet's TCP/IP header contains full concatenation information (e.g. 4 concatenation tuples for the next data packet, sequence and sequence and ACK number, etc.) and the payload contains the "SEND" command that is eventually actually replaced. Before the content is sent to the client, the "SEND" command specifies the file ID, starter offset to read, and data length. Using this information, the NIC stack reads the contents of the file and creates and transmits the actual data packet with the header information. Depending on the size of the file contents, one "SEND" command can be converted into multiple MTU-sized data packets. That is, the implementation of the "SEND" command utilizes a general TCP packet from the host stack on the CPU side and transfers it to the NIC stack. The header of this packet uses the same header as that of a general TCP packet, but uses a virtual payload. Have. The virtual payload contents have the file ID, starter offset to read, and data length information of a specific file, and the NIC reads the contents of this file, considering the length, It is converted into n real TCP packets and transmitted. How the "SEND" command is processed is described in more detail below.

도 6은 "SEND" 명령 패킷이 처리되는 방법을 도시준다. 호스트 스택은 동일한 방식으로 패킷 재전송을 처리한다. 재전송을 위해 파일 내용 정보와 함께 "SEND" 명령을 보낸다. 이 설계의 근거는 NIC 스택을 가능한 단순하게 만드는 것이다. 한 가지 분명한 대안은 NIC 스택이 재전송을 처리하도록 하여 "SEND" 명령에 대한 모든 데이터의 안정적인 전달을 보장하는 것이다. 그런 다음 NIC 스택은 클라이언트의 모든 ACK를 추적하고 정체 제어 로직(congestion control logic)을 실행하여 패킷 재전송시기를 결정해야 한다. Figure 6 shows how the "SEND" command packet is processed. The host stack handles packet retransmission in the same way. Send "SEND" command with file content information for retransmission. The rationale for this design is to make the NIC stack as simple as possible. One obvious alternative is to let the NIC stack handle retransmissions, ensuring reliable delivery of all data for the "SEND" command. The NIC stack must then track all ACKs from the client and run congestion control logic to determine when to retransmit packets.

이는 NIC 스택을 상태 저장 및 더 복잡하게 만들 수 있으며 일부 SmartNIC 플랫폼(예: FPGA 기반 플랫폼)에서 효율적으로 구현하는 것이 어려울 수 있다. 다른 모든 작업의 경우 IO-TCP는 일반 TCP 스택과 유사하게 작동한다. 수신 경로의 연결별 상태 및 버퍼 관리, 타이머 관리, 안정적인 데이터 전송, 혼잡/흐름 제어 및 오류 제어와 같은 모든 복잡한 작업이 호스트 스택에서 실행된다. 또한 호스트 스택에서 데이터를 사용할 수 있는 제어 패킷 또는 패킷의 경우 호스트 스택은 이를 생성하여 NIC 스택을 우회하여 클라이언트로 직접 보낸다. 클라이언트에서 들어오는 모든 패킷도 호스트 스택으로 직접 전달된다(그림 5의 ② 및 클라이언트에서 전송한 ACK 참조). This can make the NIC stack stateful and more complex, and can be difficult to implement efficiently on some SmartNIC platforms (such as FPGA-based platforms). For all other tasks, IO-TCP behaves like a normal TCP stack. All complex tasks such as connection-specific state and buffer management of the receive path, timer management, reliable data transfer, congestion/flow control and error control are executed in the host stack. Also, for control packets or packets that have data available on the host stack, the host stack creates them and sends them directly to the client, bypassing the NIC stack. All packets coming from the client are also passed directly to the host stack (see ② in Figure 5 and the ACK sent by the client).

이는 NIC의 내장형 시스템을 통과할 때 도 7과 같이 지연 시간 오버 헤드가 발생할 뿐만 아니라 NIC 스택에 불필요한 부담을 주기 때문이다. 도 7은 호스트와 NIC 스택 사이의 통신을 위한 이더넷(Eithernet)의 라운드 트립 타임(Round-trip time)을 도시한다. 이 패킷 조정은 NIC의 내장형 시스템이 서로 다른 IP 및 MAC 주소를 갖는 Mellanox BlueField NIC의 분리 모드에서 쉽게 시행할 수 있다.This is because delay time overhead occurs as shown in FIG. 7 when passing through the NIC's embedded system, and unnecessary burden is placed on the NIC stack. 7 illustrates round-trip time of Ethernet for communication between a host and a NIC stack. This packet arbitration can easily be done in a detached mode on Mellanox BlueField NICs where the NIC's embedded system has different IP and MAC addresses.

IO-TCP NIC 스택(IO-TCP NIC Stack)IO-TCP NIC Stack

IO-TCP NIC 스택은 호스트 스택에 대한 모든 실제 데이터 플레인 연산을 수행하며 데이터 패킷 전송을 위해 오프로드된 파일 I/O 및 네트워크 I/O를 처리한다. 각 명령이 NIC 스택으로 향하는 특수 패킷에서 전달되는 호스트 스택에서 사용자 지정 명령을 처리하여 작동한다. 현재 "OPEN", "CLOS", "SEND" 및 "ACKD"의 네 가지 명령이 정의되어 있다. "OPEN" 및 "CLOS"는 IO-TCP 응용 프로그램의 파일을 열거나 닫는 데 사용된다. "SEND"는 파일 내용을 클라이언트로 보내기 위한 기본 명령이다. 마지막으로 "ACKD"는 중복 디스크 액세스 없이 재전송을 효율적으로 처리하는 데 사용된다. "SEND" 명령은 I/O 연산을 위한 핵심 드라이버이다. "SEND" 명령이 주어지면 NIC 스택은 대상 파일이 열려 있는지 확인하고 파일 내용을 고정된 크기의 메모리 버퍼로 읽는다. 파일 읽기 오프셋과 길이는 NVMe 디스크 페이지 경계(예: 4KB)에 맞춰 정렬되고 실제 파일 I/O는 메인 스레드의 차단을 방지하기 위해 비동기식으로 실행된다. 메모리 버퍼에서 파일 내용을 사용할 수 있는 경우 NIC 스택은 "SEND" 명령 패킷에 TCP/IP 헤더가 포함된 TSO 패킷을 생성하여 NIC 하드웨어 데이터 플레인으로 보낸다. NIC 하드웨어 데이터 플레인은 TCP 패킷 분할 및 TCP/IP 체크섬 계산을 처리한다.The IO-TCP NIC stack does all the actual data plane operations for the host stack and handles offloaded file I/O and network I/O for data packet transmission. It works by processing user-specified commands on the host stack, where each command is delivered in a special packet destined for the NIC stack. Currently, four commands are defined: "OPEN", "CLOS", "SEND" and "ACKD". "OPEN" and "CLOS" are used to open or close files in IO-TCP applications. "SEND" is the basic command for sending the contents of a file to the client. Finally, "ACKD" is used to efficiently handle retransmissions without redundant disk accesses. The "SEND" instruction is the core driver for I/O operations. When a "SEND" command is given, the NIC stack checks if the target file is open and reads the contents of the file into a fixed-size memory buffer. The file read offset and length are aligned to NVMe disk page boundaries (e.g. 4KB) and the actual file I/O is executed asynchronously to avoid blocking the main thread. If the file contents are available in the memory buffer, the NIC stack creates a TSO packet with a TCP/IP header in the "SEND" command packet and sends it to the NIC hardware data plane. The NIC hardware data plane handles TCP packet splitting and TCP/IP checksum computation.

또한, CPU 측의 호스트 스택이 클라이언트에서 보낸 ACK 패킷을 수신하면, 주기적으로(예: 32KB 단위로) NIC에게 ACK된 일련번호(sequence number)를 통지하게 되고, 이에 따라 NIC는 ACK된 일련번호(sequence number)까지 해당되는 파일 컨텐츠를 메모리에 유지하고 있지 않아도 된다. TCP에서는 데이터 전송시　이 데이터 패킷이 네트워크에서 로스(loss)되면 다시 재전송(retransmission)을 수행하여 신뢰성있게 데이터 전송을 하게 되는데, 이 재전송을 위해 NIC는 한번 보낸 데이터라도 ACK되지 않은 데이터를 메모리에 저장하고 있어야 하는데, ACKD는 이런 데이터를 버릴 수 있게 해 줍니다.　In addition, when the host stack on the CPU side receives the ACK packet sent from the client, it notifies the NIC of the ACKed sequence number periodically (eg, in units of 32KB), and accordingly, the NIC sends the ACKed sequence number ( sequence number) need not be maintained in memory. In TCP, if this data packet is lost in the network during data transmission, retransmission is performed again to ensure reliable data transmission. You should be doing it, but ACCD allows you to discard this data.

통합 I/O의 과제(Challenges with Integrated I/O)Challenges with Integrated I/O

파일 I/O를 NIC 스택의 네트워크 I/O에 결합하면 TCP 스택 연산의 정확성에 몇 가지 고유한 문제가 발생한다. Coupling file I/O to the NIC stack's network I/O introduces some unique challenges to the correctness of the TCP stack's operations.

(재전송 타이머 및 RTT 측정)(Retransmit timer and RTT measurement)

TCP는 재전송 타이머와 같은 결정을 위해 지연 측정에 의존한다. 그러나 디스크 I/O로 인한 지연은 RTT 측정을 혼동할 수 있다. 빠른 NVMe 디스크를 사용하더라도 몇 KB의 데이터를 읽기 위한 디스크 액세스 지연은 마이크로 초 단위이며 동일한 디스크에 대한 I/O 요청이 백로그되는 경우 최대 밀리 초까지 걸릴 수 있다.TCP relies on delay measurements for decisions such as retransmission timers. However, delays due to disk I/O can confuse RTT measurements. Even with fast NVMe disks, the disk access latency to read a few KB of data is on the order of microseconds, and can take up to milliseconds if I/O requests to the same disk are backlogged.

IO-TCP의 초기 구현은 원래 패킷이 클라이언트로 전송되지 않은 경우에도 종종 패킷을 재전송한다. 이 문제를 해결하기 위해 NIC 스택이 해당 "SEND" 명령에 대한 데이터 패킷을 보내려고 할 때 에코 패킷을 호스트 스택으로 다시 보내도록 한다. 호스트 스택은 SEND 명령에 대한 에코 패킷을 수신할 때까지 재전송 타이머를 시작하지 않는다. 높은 정확도를 위해 호스트 스택은 NIC 스택에서 타임 아웃 값에서 에코 패킷의 단방향 지연(플랫폼에서 ∼3.7 마이크로초)을 뺀다. 에코 패킷에 대한 CPU 오버 헤드는 "SEND" 명령당 전송되기 때문에 적으며 일반적인 "SEND" 명령은 덜 혼잡한 네트워크 경로에서 큰 파일을 전송할 때 수십에서 수백 개의 MTU 크기 패킷으로 변환된다. Early implementations of IO-TCP often retransmit packets even if the original packets were not sent to the client. To solve this problem, when the NIC stack wants to send a data packet for that "SEND" command, it sends an echo packet back to the host stack. The host stack does not start the retransmit timer until it receives the echo packet for the SEND command. For high accuracy, the host stack subtracts the one-way delay of the echo packet (~3.7 microseconds on the platform) from the timeout value on the NIC stack. The CPU overhead for echo packets is small because they are sent per "SEND" command, and normal "SEND" commands translate into tens to hundreds of MTU-sized packets when transferring large files over less congested network paths.

또한 정확한 RTT 측정을 위해 NIC 스택은 실제 "패킷 처리" 지연을 TCP 타임 스탬프 옵션 값에 반영한다. 즉, "SEND" 명령이 NIC 스택에 도착하고 해당 데이터 패킷이 NIC 스택에서 출발하는 사이의 지연 시간이다. 즉, "SEND" 명령 패킷은 호스트 스택에 의해 채워진 TCP 타임 스탬프 옵션을 전달하고 NIC 스택은 패킷을 보내기 전에 값을 업데이트한다. 타임 스탬프 옵션 값이 밀리 초 단위이고 호스트 스택의 시간 피드가 마이크로 초 단위이므로 호스트 스택은 마이크로 초 단위의 추가 시간 정보를 NIC 스택에 보낸다. 그런 다음 NIC 스택은 필요한 경우 타임 스탬프 값을 반올림할 수 있다. Also, for accurate RTT measurement, the NIC stack reflects the actual "packet processing" delay into the value of the TCP timestamp option. That is, the delay between the "SEND" command arriving at the NIC stack and the corresponding data packet leaving the NIC stack. That is, the "SEND" command packet carries the TCP timestamp options populated by the host stack and the NIC stack updates the values before sending the packet. Since the timestamp option value is in milliseconds and the host stack's time feed is in microseconds, the host stack sends additional time information in microseconds to the NIC stack. The NIC stack can then round the timestamp value if necessary.

정리하면, CPU는 상기 에코 패킷을 수신할 때까지 재전송 타이머의 전송을 소정 시간동안 수행하지 않으며, 소정 시간은 상기 NIC에서의 패킷처리 지연시간과 상기 에코 패킷의 단방향 처리시간에 의해 결정되되, 상기 NIC에서의 패킷처리 지연시간은 상기 "SEND" 명령이 상기 NIC에 도착한 뒤 "SEND" 명령에 대한 데이터 패킷이 상기 클라이언트로 출발하는 사이의 지연시간으로 정의될 수 있다. 즉, 에코 패킷은 호스트 스택에서의 플로우(flow)의 RTT 측정을 정확하게 하기 위하여 NIC가 disk IO delay에 대한 정보를 호스트 스택으로 전달해 주는 기능을 갖는다.In summary, the CPU does not transmit the retransmission timer for a predetermined time until the echo packet is received, and the predetermined time is determined by the packet processing delay time in the NIC and the unidirectional processing time of the echo packet. A packet processing delay time in a NIC may be defined as a delay time between when the "SEND" command arrives at the NIC and a data packet for the "SEND" command departs to the client. That is, the echo packet has a function in which the NIC transfers information about disk IO delay to the host stack in order to accurately measure the RTT of the flow in the host stack.

(재전송 처리)(handling retransmission)

위에서 설명한 바와 같이, I/O 오프로드된 패킷의 재전송도 "SEND" 명령으로 구현된다. 그러나, 단순한 구현은 파일 내용을 다시 읽어 디스크 및 메모리 대역폭을 낭비할 수 있다. 비효율성을 방지하기 위해 NIC 스택은 호스트 스택이 클라이언트에 대한 전달을 확인할 때까지 원래 데이터 컨텐츠를 메모리에 보관한다. 호스트 스택이 I/O 오프로드된 시퀀스 공간 범위에 대한 ACK를 확인하면 "ACKD" 명령 패킷으로 전달된 부분을 NIC 스택에 주기적으로 알린다. 그런 다음 NIC 스택은 전달된 데이터를 보유하는 메모리 버퍼를 재활용할 수 있다. 오버 헤드를 최소화하기 위해 호스트 스택은 클라이언트가 마지막으로 확인한 임계 데이터 양(예: 현재 32KB 사용)을 볼 때마다 NIC 스택에 알린다. 이 버퍼 메모리는 기본적으로 일반 TCP 스택에서 소켓 전송 버퍼 역할을 하지만 실제로 필요한 메모리는 대략 대역폭 지연 제품에 해당한다. 연결을 위한 평균 RTT가 30ms 인 100Gbps NIC에는 375MB의 버퍼 메모리가 필요할 수 있다.As described above, retransmission of I/O offloaded packets is also implemented with the "SEND" command. However, simple implementations can waste disk and memory bandwidth by rereading the file contents. To avoid inefficiency, the NIC stack keeps the original data contents in memory until the host stack acknowledges the delivery to the client. When the host stack acknowledges an ACK for an I/O offloaded sequence space range, it periodically informs the NIC stack of the portion passed in the "ACKD" command packet. The NIC stack can then recycle the memory buffer holding the passed data. To minimize overhead, the host stack informs the NIC stack whenever the client sees the last critical amount of data it saw (e.g. 32KB currently used). This buffer memory basically acts as a socket transmit buffer in a regular TCP stack, but the actual memory required is approximately the bandwidth delay product. A 100Gbps NIC with an average RTT of 30ms for a connection might require 375MB of buffer memory.

오류 처리(Handling Errors)Handling Errors

IO-TCP에서 호스트 스택은 패킷 손실, 잘못된 패킷 또는 갑작스러운 연결 실패 처리와 같은 모든 TCP 수준 오류를 처리한다. NIC 스택은 호스트 스택을 대신해서만 패킷을 보내고 모든 수신 패킷은 NIC 스택을 우회하므로 호스트 스택은 다른 TCP 스택과 마찬가지로 TCP 수준 오류를 추론할 수 있다. In IO-TCP, the host stack handles all TCP-level errors, such as handling packet loss, bad packets, or sudden connection failures. Because the NIC stack only sends packets on behalf of the host stack, and all incoming packets bypass the NIC stack, the host stack can infer TCP-level errors just like any other TCP stack.

그러나 NIC 스택은 파일 I/O의 오류를 호스트 스택에 보고해야 한다. "OPEN"명령의 경우 NIC 스택은 파일 열기 성공 여부에 관계없이 호스트 스택에 응답한다. 그런 다음 호스트 스택은 해당 파일 ID에 이벤트를 발생시켜 애플리케이션이 결과를 학습하도록 한다. 호스트 스택은 오프로드된 파일의 메타데이터를 캐시에 저장하므로, offload_write()에 잘못된 매개 변수 값이 전달되면 오류를 반환할 수 있다. 파일 읽기 연산 자체에서 오류가 발생하는 경우 파일 ID 및 오류 코드와 함께 "오류" 명령 패킷과 함께 호스트 스택에 보고한다. 그런 다음 offload_write()는 다음에 응용 프로그램이 호출할 때 errno에 오류 코드와 함께 1을 반환한다.However, the NIC stack must report file I/O errors to the host stack. For the "OPEN" command, the NIC stack responds to the host stack whether or not the file open was successful. The host stack then fires an event on that file ID, letting the application learn the result. The host stack caches the metadata of offloaded files, so offload_write() may return an error if an invalid parameter value is passed. If an error occurs in the file read operation itself, it is reported to the host stack with an "error" command packet along with the file ID and error code. Then offload_write() returns 1 with an error code in errno the next time the application calls it.

구현(Implementation)Implementation

(IO-TCP 호스트 스택)(IO-TCP host stack)

고성능 사용자 수준 TCP 스택인 mTCP를 수정하여 IO-TCP 호스트 스택을 구현한다. 소켓 API가 Berkeley 소켓 API와 유사하고 epoll을 사용한 이벤트 기반 프로그래밍을 지원하므로 mTCP를 선택한다. It implements an IO-TCP host stack by modifying mTCP, a high-performance user-level TCP stack. Choose mTCP because its socket API is similar to the Berkeley socket API and supports event-driven programming using epoll.

호스트 스택은 모든 mTCP API 기능을 포함하고 4개의 오프로드 기능(표 2 참조)을 애플리케이션에 추가한다. 각 오프로드 기능은 NIC 스택과 특수 명령 패킷을 교환하여 구현된다. NIC 스택은 IP 패킷의 ToS 필드에있는 특수 값을 확인하여 명령 패킷을 감지한다. "SEND" 명령 패킷에는 전체 연결 정보가 포함된 유효한 TCP/IP 헤더가 있으므로 페이로드(체크섬 포함)만 클라이언트로 전송되기 전에 실제 파일 내용과 스왑되어야 한다. The host stack includes all mTCP API functions and adds four offload functions (see Table 2) to the application. Each offload function is implemented by exchanging special command packets with the NIC stack. The NIC stack detects command packets by checking a special value in the ToS field of IP packets. Since the "SEND" command packet has a valid TCP/IP header with full connection information, only the payload (with checksum) needs to be swapped with the actual file content before being sent to the client.

offload_open()의 경우 호스트 스택은 성공적인 작업에 대한 NIC 스택의 응답으로 열린 파일의 메타 데이터(예: fstat()의 출력)를 받는다. 그런 다음 offload_stat()에 대해 캐시하여 함수를 자체적으로 로컬에서 처리할 수 있다. 이렇게 하면 NIC 스택에 대한 라운드 트립이 줄어든다. 파일 작업의 경우 호스트 및 NIC 스택은 모두 동일한 파일 시스템을 공유한다. 호스트 OS는 NVMe 디스크의 파일 시스템을 읽기/쓰기 가능으로 마운트하고, NIC 스택은 읽기 전용으로 마운트한다. 한 가지 문제는 호스트 측 파일 시스템의 업데이트가 별도의 운영 체제에서 실행될 때 NIC 스택에 자동으로 전파되지 않는다는 것이다. 현재 컨텐츠 전달 서비스 중에 파일이 변경되지 않는다고 가정하지만 향후 두 파일 시스템의 동적 동기화에 대한 지원을 추가할 수 있다.In the case of offload_open(), the host stack receives the opened file's metadata (e.g. the output of fstat()) in response from the NIC stack on a successful operation. You can then cache for offload_stat() so the function can handle itself locally. This reduces round trips to the NIC stack. For file operations, both the host and NIC stack share the same file system. The host OS mounts the NVMe disk's filesystem as read/write, and mounts the NIC stack as read-only. One problem is that updates to the host-side filesystem are not automatically propagated to the NIC stack when running on a separate operating system. We currently assume that files do not change during content delivery services, but we may add support for dynamic synchronization of the two filesystems in the future.

본 발명에서는 오프로드 API를 효율적으로 구현하기 위하여 호스트 스택과 NIC의 통신을 위해 TCP header를 재활용할 수 있다. 구체적으로, "offload_write()" 전송 후 호스트 스택에서 데이터 패킷을 전송하려고 판단한 경우, "SEND" 명령어 전송시 보내야 하는 TCP/IP header의 첫 부분을 함께 피기배킹(piggybacking)해서 NIC로 전송하고, NIC는 이를 그대로 재활용하기 때문에 구현이 매우 편리하고 효율적이다. 피기배킹(Piggybacking)은 정보 프레임의 구조를 적당히 조정해 재정의하면 정보 프레임을 전송하면서 응답 기능까지 함께 수행할 수 있고, 이런 방식으로 프로토콜을 작성함으로써 응답 프레임의 전송 횟수를 줄여 전송 효율을 높이는 방식을 의미한다.In the present invention, in order to efficiently implement the offload API, the TCP header can be reused for communication between the host stack and the NIC. Specifically, if the host stack decides to transmit a data packet after sending "offload_write()", it piggybacks the first part of the TCP/IP header to be sent when sending the "SEND" command and sends it to the NIC. Since it is reused as it is, implementation is very convenient and efficient. Piggybacking is a method of improving transmission efficiency by reducing the number of transmissions of response frames by creating a protocol in which an information frame can be transmitted and a response function can be performed at the same time by properly adjusting and redefining the structure of the information frame. it means.

(IO-TCP NIC 스택)(IO-TCP NIC Stack)

NIC 스택은 DPDK 애플리케이션으로 구현된다. 기본 작업은 호스트 스택에서 보낸 명령 패킷을 읽고, 파일에서 데이터를 읽고, 데이터 패킷을 생성하고, 클라이언트로 보내는 것이다. The NIC stack is implemented as a DPDK application. Its basic operations are to read command packets sent by the host stack, read data from files, create data packets, and send them to the client.

여러 메인 스레드는 이러한 작업을 병렬로 수행하는 반면 각 스레드는 서로 독립적으로 작동하는 별개의 Arm 코어에 고정된다. 호스트 스택의 명령 패킷은 NIC 하드웨어의 RSS(수신측 확장)에 의해 이러한 스레드로 분산되어 서로 다른 "SEND" 명령 패킷의 동일한 연결에서 데이터 패킷을 순서대로 전달할 수 있다. 호스트와 NIC 스택 사이의 명령 패킷은 로컬호스트(localhost) 통신이므로 순서를 변경하지 않고도 안정적으로 전달된다. 효율적인 메모리 버퍼 관리를 위해 NIC 스택은 시작시 파일 컨텐츠에 대한 모든 버퍼를 미리 할당한다. 모든 버퍼는 구성 가능한 고정 크기이다. 각 주 스레드는 잠금 경합을 피하기 위해 이러한 버퍼의 1/n을 소유하며 간단한 사용자 수준 메모리 관리자가 저렴한 비용으로 버퍼를 할당하고 해제한다. 스레드당 버퍼가 고갈되면 메모리 관리자가 버퍼를 동적으로 할당한다. 더 빠른 NVMe 디스크로도 파일 읽기는 메모리 연산보다 느리므로 각 기본 스레드는 몇 개의 디스크 읽기 스레드를 사용하여 기본 스레드 차단을 방지한다. 디스크 읽기 스레드는 직접 I/O(즉, O_DIRECT 플래그가 있는 open())를 사용하여 운영 체제의 파일 시스템 캐시의 비효율성을 우회하고 공유 메모리를 통해 메인 스레드와 효율적으로 통신한다. 대안은 Intel SPDK와 같은 사용자 수준 디스크 I/O 라이브러리를 사용하는 것이다. 실제로 SPDK는 직접 I/O보다 최대 2 배 더 작은 Arm 프로세서 사이클로 최대 성능에 도달하지만 여기서는 SPDK의 지원으로 일반 파일 시스템(즉, Linux에서 XFS 사용)을 고수한다. Multiple main threads perform these tasks in parallel, while each thread is anchored to a separate Arm core that operates independently of each other. Command packets from the host stack are distributed to these threads by RSS (receive-side extensions) in the NIC hardware, allowing the sequential delivery of data packets on the same concatenation of different "SEND" command packets. Since the command packets between the host and the NIC stack are localhost communication, they are delivered reliably without changing their order. For efficient memory buffer management, the NIC stack pre-allocates all buffers for file contents at startup. All buffers are of a configurable fixed size. Each main thread owns 1/n of these buffers to avoid lock contention, and a simple user-level memory manager allocates and frees them inexpensively. When a per-thread buffer is exhausted, the memory manager dynamically allocates a buffer. Even with faster NVMe disks, file reading is slower than memory operations, so each main thread uses a few disk reading threads to avoid blocking the main thread. The disk-reading thread uses direct I/O (i.e. open() with the O_DIRECT flag) to bypass inefficiencies in the operating system's filesystem cache and communicates efficiently with the main thread via shared memory. An alternative is to use a user-level disk I/O library such as Intel SPDK. In practice, SPDK reaches its maximum performance with up to 2x fewer Arm processor cycles than direct I/O, but here it sticks to a regular filesystem (i.e. using XFS on Linux) with SPDK's support.

(Lighttpd를 IO-TCP로 포팅)(Porting Lighttpd to IO-TCP)

실제 애플리케이션에서 IO-TCP의 효과를 평가하기 위해 Lighttpd v1.4.32를 IO-TCP로 포팅하고 Github에서 mTCP 포팅된 Lighttpd 코드를 가져와 오프로드된 I/O 연산을 지원하도록 수정할 수 있다. Lighttpd 코드의 41871 라인 중 약 10라인만 수정해야 했기 때문에 IO-TCP로 이식하는 것은 매우 간단하다.To evaluate the effectiveness of IO-TCP in real applications, Lighttpd v1.4.32 can be ported to IO-TCP, and the mTCP ported Lighttpd code can be taken from Github and modified to support offloaded I/O operations. Porting to IO-TCP is quite straightforward, as only about 10 of the 41871 lines of Lighttpd code needed to be modified.

평가(Evaluation)Evaluation

본 발명에서의 IO-TCP 평가 기준은 다음과 같다. (1) 새로운 아키텍처가 컨텐츠 전송 시스템의 처리량을 향상시킬 수 있는지 (2) CPU주기가 상당히 절약되는지 (3)설계 및 구현이 그 목적에 잘 부합되는지. 평가를 위한 실험을 실행하기 전, 많은 패킷 손실과 다수의 동시 연결의 경우에 있어서도 IO-TCP 포트 Lighttpd 서버로 IO-TCP 스택의 정확성이 유지되는지를 검증했다.The IO-TCP evaluation criteria in the present invention are as follows. (1) whether the new architecture can improve the throughput of the content delivery system; (2) whether CPU cycles are significantly saved; and (3) whether the design and implementation serve the purpose well. Before running the experiment for evaluation, we verified that the accuracy of the IO-TCP stack is maintained with the IO-TCP port Lighttpd server even in the case of high packet loss and multiple simultaneous connections.

실험 설정(Experiment Setup)Experiment Setup

실험 설정은 1개의 서버와 2개의 클라이언트로 구성된다. 서버 시스템에는 2개의 Intel Xeon Gold 6142 CPU @2.60GHz(16 코어)가 있다. 실험에는 CPU를 1개만 사용한다. 25G Mellanox BlueField SmartNIC 2개와 Intel Optane 900P NVMe 디스크 4개를 사용했다. Smart NIC 1개와 NVMe 디스크 2개를 연결했다. 각 NUMA 도메인과 NVMe-oF 대상 오프로드를 설정하여 NIC가 동일한 도메인에 있는 2개의 NVMe 디스크에 직접 액세스할 수 있다. The experimental setup consists of 1 server and 2 clients. The server system has two Intel Xeon Gold 6142 CPU @2.60GHz (16 cores). Experiments use only one CPU. I used two 25G Mellanox BlueField SmartNICs and four Intel Optane 900P NVMe disks. I connected 1 Smart NIC and 2 NVMe disks. You can set up each NUMA domain and NVMe-oF target offload so that the NIC can directly access 2 NVMe disks in the same domain.

호스트 CPU는 Linux 4.14에서 실행되고 Arm 하위 시스템은 Linux 4.20에서 실행된다. 클라이언트 시스템에는 각각 Intel E5-2683v4 CPU @2.10GHz(16 코어) 및 100Gbps Mellanox ConnectX-5 NIC가 장착되어 있다. 클라이언트의 NIC는 LRO(Large 수신 오프로드)를 사용하도록 구성되어 있다. 모든 클라이언트는 Linux 4.14에서 실행되며 클라이언트가 실험의 병목 현상이 아님을 확인한다. The host CPU runs on Linux 4.14 and the Arm subsystem runs on Linux 4.20. The client systems are each equipped with an Intel E5-2683v4 CPU @2.10GHz (16 cores) and a 100Gbps Mellanox ConnectX-5 NIC. The client's NIC is configured to use Large Receive Offload (LRO). All clients run on Linux 4.14, confirming that the client is not a bottleneck in the experiment.

모든 NIC는 100Gbps Dell EMC Networking Z9100-ON 스위치에 연결된다. NVMe 디스크를 최근 작업에 사용된 비디오 청크의 평균 크기인 300KB 파일로 채운다. 작업 세트 크기가 주 메모리 크기를 초과하도록 워크로드가 디스크 바운딩되어 있는지 확인한다. 각 디스크는 2500MB/초의 광고된 읽기 처리량을 가지고 있으며, 4개의 디스크에서 읽을 때 이론적으로 80Gbps의 제한이 있음을 의미한다.All NICs connect to 100Gbps Dell EMC Networking Z9100-ON switches. Fill the NVMe disk with 300KB files, the average size of video chunks used in recent work. Make sure the workload is disk bound so that the working set size exceeds the main memory size. Each disk has an advertised read throughput of 2500MB/sec, meaning there is a theoretical limit of 80Gbps when reading from 4 disks.

Lighttpd를 사용한 IO-TCP 처리량(IO-TCP Throughput with Lighttpd)IO-TCP Throughput with Lighttpd

대용량 파일 컨텐츠 전달에서 IO-TCP의 효과를 평가한다. 다양한 CPU 코어 수에 대해 IO-TCP-ported Lighttpd의 처리량과 sendfile()을 사용하는 스톡 버전의 처리량을 비교한다. To evaluate the effectiveness of IO-TCP in the delivery of large file contents. Compare the throughput of IO-TCP-ported Lighttpd with that of the stock version using sendfile() for various numbers of CPU cores.

300KB의 다른 비디오 청크를 요청하는 영구 연결에 1600개의 동시 요청을 사용한다. 5분 동안 실험을 실행하고 평균 처리량을 보고한다. 도 8은 대용량 파일 컨텐츠 전달에서 IO-TCP의 효과 평가에 대한 결과를 도시한다. It uses 1600 concurrent requests for a persistent connection requesting another video chunk of 300 KB. Run the experiment for 5 minutes and report the average throughput. 8 shows the result of evaluating the effect of IO-TCP on large-capacity file content delivery.

IO-TCP의 Lighttpd의 경우 호스트 측에서 단일 CPU 코어만 사용하고 각 SmartNIC에서 전체 16개의 Arm 코어를 사용한다. 이는 단일 CPU 코어가 모든 1600 클라이언트에 대한 컨트롤 플레인 연산을 처리하기에 충분하고 NIC 대역폭 제한에 가깝게 도달하기 때문에 더 많은 CPU 코어를 사용하더라도 성능이 향상되지 않기 때문이다. For Lighttpd on IO-TCP, it uses only a single CPU core on the host side and uses a full 16 Arm cores on each SmartNIC. This is because a single CPU core is sufficient to handle the control plane operations for all 1600 clients, and using more CPU cores does not improve performance, as the NIC bandwidth limit is approaching.

각 NIC는 약 22Gbps를 달성하며, 이는 성능이 NIC 수에 따라 확장된다는 것을 보여준다. 반대로 Linux TCP의 Lighttpd는 단일 CPU 코어로 약 11Gbps를 달성하며 IO-TCP보다 4배 작다. 성능은 CPU 코어에 비해 비선형 적으로 증가하며 6개의 CPU 코어가 있는 IO-TCP 포트 Lighttpd와 유사한 성능에 도달한다. 이는 IO-TCP가 CPU에서 I/O 작업을 효과적으로 오프로드하고 CPU 주기 절약에 상당한 이점을 제공함을 분명히 보여준다. Each NIC achieves about 22 Gbps, which shows that performance scales with the number of NICs. Conversely, Lighttpd on Linux TCP achieves about 11 Gbps with a single CPU core, 4x smaller than IO-TCP. Performance increases non-linearly with CPU cores, reaching performance similar to IO-TCP port Lighttpd with 6 CPU cores. This clearly shows that IO-TCP effectively offloads I/O operations from the CPU and provides significant benefits in saving CPU cycles.

또한 IO-TCP가 경쟁 연결간에 대역폭 공정성을 제공하는지 평가한다. Jain의 IO-TCP 공정성 지수는 동시 연결 수에 따라 0.91에서 0.97까지이다. 동일한 실험에서 Linux TCP에서 비슷한 범위(0.90∼0.97)를 볼 수 있다. 이는 실험에서 디스크 I/O 경합이 심한 것을 고려하면 상당한 효과가 있음을 입증한다.It also evaluates whether IO-TCP provides bandwidth fairness between competing connections. Jain's IO-TCP fairness index ranges from 0.91 to 0.97 depending on the number of concurrent connections. A similar range (0.90 to 0.97) can be seen for Linux TCP in the same experiment. This proves that there is a significant effect considering the severe disk I/O contention in the experiment.

CPU 사이클 절약 분석(Analysis on CPU Cycle Saving)Analysis on CPU Cycle Saving

I/O 오프 로딩으로 CPU 주기 절약을 추가로 분석한다. 먼저, 이전 섹션의 실험에서 최고 성능을 달성하기 위해 전체 CPU 코어가 필요한지 조사한다. 구성된 양만큼 CPU 주기를 빼앗는 사용자 지정 프로그램을 작성하고, IO-TCP 호스트 스택이 실행중인 동일한 CPU 코어에 고정한다. We further analyze the savings in CPU cycles by offloading I/O. First, we investigate whether full CPU cores are required to achieve peak performance in the experiments in the previous section. Write a custom program that steals a configured amount of CPU cycles, and locks it to the same CPU core the IO-TCP host stack is running on.

도 9의 (a)는 호스트 스택이 소비하는 CPU주기의 일부에 대한 처리량을 보여준다. 단일 CPU 코어주기의 70 %가 최고 성능의 97.3 %를 달성할 수 있음을 확인했다. 주기의 80%이면 최대 처리량에 충분하다. CPU 코어가 절반이어도 최대 처리량으로 인한 성능 손실의 23%에 불과하다. 이는 1600개의 동시 연결에 대한 컨트롤 플레인 연산을 처리함에 전체 CPU 코어를 이용할 필요가 없음을 보여준다. 데이터 플레인 작업을 SmartNIC로 오프로드하면 CPU 주기가 크게 절약된다. 9(a) shows the throughput for a portion of CPU cycles consumed by the host stack. It was confirmed that 70% of single CPU core cycles can achieve 97.3% of peak performance. 80% of the cycle is sufficient for maximum throughput. Even having half the CPU cores is only 23% of the performance loss due to peak throughput. This shows that there is no need to use an entire CPU core to handle control plane operations for 1600 simultaneous connections. Offloading data plane operations to the SmartNIC saves significant CPU cycles.

도 9의 (b)는 사용된 Arm 코어수에 따른 단일 SmartNIC의 처리량을 보여준다. 달성할 수 있는 최대 성능은 22.1Gbps이며, 이는 12개의 Arm 코어로만 실현된다. 6개의 Arm 코어만 사용하더라도 성능 손실은 최대 처리량의 약 10%에 불과하다. 이것은 프로토 타입의 Arm 코어가 병목 현상이 아님을 나타낸다. 실제로 SmartNIC의 병목 현상은 6 절에서 자세히 논의할 메모리 대역폭에 있다.Figure 9(b) shows the throughput of a single SmartNIC according to the number of Arm cores used. The maximum performance that can be achieved is 22.1 Gbps, which is realized only with 12 Arm cores. Even with just 6 Arm cores, the performance loss is only about 10% of the maximum throughput. This indicates that the prototype's Arm core is not the bottleneck. In practice, the bottleneck for SmartNICs is memory bandwidth, which will be discussed in detail in Section 6.

IO-TCP 설계 선택 평가(Evaluation of IO-TCP Design Choices)Evaluation of IO-TCP Design Choices

여기서 IO-TCP에 대한 설계 선택의 효과를 평가한다.Here we evaluate the effect of design choices on IO-TCP.

(재전송 타이머 수정)(fix retransmit timer)

먼저, 실제 데이터 패킷이 전송되는 적시에 재전송 타이머를 시작하는 에코 패킷의 영향을 평가한다. 도 10의 (a)는 타이머 수정이 있을 때와 그렇지 않을 때 Lighttpd의 처리량을 비교한다. 에코 패킷이 없으면 IO-TCP 호스트 스택("RTO 수정 없음")에서 많은 조기 시간 초과가 발생하므로 성능이 37.9Gbps로 유지된다. 타이머 수정을 통해 IO-TCP는 성능을 16.2% 향상시킨다. 실제로 적절한 타이머 수정이 없는 IO-TCP는 타이머 수정을 사용하는 것보다 97.6배 더 많은 타임 아웃을 생성한다. 조기 시간 초과로 인한 성능 저하는 종단 간 RTT가 더 큰 광역 네트워크(WAN)에서 훨씬 더 심각했을 것이다.First, we evaluate the effect of the echo packet starting the retransmission timer at the right time when the actual data packet is transmitted. Figure 10(a) compares the throughput of Lighttpd with and without timer modification. Without echo packets, many premature timeouts occur on the IO-TCP host stack ("no RTO fix"), so performance remains at 37.9 Gbps. With the timer modification, IO-TCP improves performance by 16.2%. In fact, IO-TCP without proper timer correction produces 97.6 times more timeouts than using timer correction. The performance hit due to premature timeouts would have been much more severe in wide area networks (WANs) with larger end-to-end RTTs.

(RTT 측정 보정)(RTT measurement correction)

또한, NIC 스택에서 TCP 타임 스탬프 수정의 영향을 비교한다. 매초 TCP 스택에 기록된 평균 RTT 값을 측정하여 도 10의 (b)에 표시했다. TCP 타임 스탬프 수정을 비활성화하면 일반 RTT가 1ms 미만인 경우에도 1600 개의 동시 연결이 있는 평균 RTT는 42-44ms에 도달한다. 이는 RTT에 경합으로 인해 10밀리 초에 달할 수 있는 디스크 액세스 지연이 포함되기 때문이다. 타임 스탬프 수정을 활성화하면 평균 RTT가 3-5ms로 감소한다. 훨씬 낮지만 NIC의 과도한 경합으로 인해 RTT는 여전히 1ms 이상이다. 패킷 손실이있을 때 시간 초과를 트리거하려면 보다 정확한 대기 시간 측정이 중요하다.We also compare the impact of TCP timestamp modification on the NIC stack. The average RTT value recorded in the TCP stack every second is measured and displayed in (b) of FIG. 10 . With TCP timestamp modification disabled, the average RTT with 1600 concurrent connections reaches 42-44ms even when the normal RTT is less than 1ms. This is because RTT includes disk access latency that can reach 10 milliseconds due to contention. Enabling timestamp correction reduces the average RTT to 3-5 ms. Although much lower, the RTT is still above 1 ms due to excessive contention on the NIC. A more accurate latency measurement is important to trigger a timeout in the presence of packet loss.

오버헤드 평가(Overhead Evaluation)Overhead Evaluation

IO-TCP의 분할 아키텍처는 호스트와 NIC 스택 간의 통신 오버 헤드뿐만 아니라 NIC에 있는 Arm 기반 하위 시스템의 낮은 컴퓨팅 용량으로 인해 어려움을 겪을 수 있다. 이러한 이유로, CPU가 오버헤드 없이 연결을 편안하게 처리 할 수 있으므로 Linux TCP의 CPU 전용 접근 방식은 적은 수의 동시 연결에 대해 IO-TCP보다 성능이 좋다. IO-TCP's split architecture can suffer from the communication overhead between the host and the NIC stack, as well as the low computing capacity of the Arm-based subsystem on the NIC. For this reason, Linux TCP's CPU-only approach performs better than IO-TCP for a small number of simultaneous connections, as the CPU can comfortably handle connections without any overhead.

그러나, 이러한 추세는 연결 수가 증가함에 따라 변경된다. 도 11은 다른 개수의 동시 연결에 대한 처리량 실험 결과를 도시한다. 단일 영구 연결로 Linux TCP는 IO-TCP보다 2배 이상 성능이 뛰어나다. 그러나 최소 4개의 연결로 최대 성능에 도달하고 그 이상의 성능은 동일하게 유지된다. 반대로 IO-TCP의 처리량은 오버 헤드로 인해 천천히 증가하지만 8개 연결에서 Linux TCP의 처리량을 초과하는 반면 128개 연결에서 최대 처리량에 도달한다. 성능 추세는 파일 크기가 더 작아도 계속 유지된다. However, this trend changes as the number of connections increases. 11 shows the results of throughput experiments for different numbers of simultaneous connections. With a single persistent connection, Linux TCP outperforms IO-TCP by a factor of 2 or more. However, maximum performance is reached with a minimum of 4 connections and beyond that the performance remains the same. Conversely, the throughput of IO-TCP increases slowly due to the overhead, but reaches its maximum throughput at 128 connections, while exceeding that of Linux TCP at 8 connections. The performance trend continues even with smaller file sizes.

도 12는 다른 개수의 연결수에 있어서, 4KB 파일을 제공하는 처리량을 도시한다. 이전 사례와 마찬가지로 Linux TCP 및 IO-TCP는 4개 및 128개 연결에서 최고 성능에 도달하지만 파일 작업의 오버 헤드 증가로 인해 성능이 도 11에 도시된 것보다 훨씬 낮다. 그럼에도 불구하고 IO-TCP는 16개 이상의 동시 연결에서 Linux TCP를 능가한다. 대기 시간 오버 헤드를 평가하기 위해 IO-TCP와 Linux TCP를 사용하여 작은 파일을 다운로드하는 대기 시간을 비교한다. IOTCP는 1KB~16KB의 파일을 다운로드할 때 약 200~500us의 추가 지연을 보여준다. 이 추가 지연 시간은 현재 프로토 타입의 오버 헤드로 인해 불가피하지만, 광역 네트워크에서 컨텐츠를 전달하는 경우 무시할 수 있는 수준이다.Figure 12 shows the throughput of serving 4KB files for different numbers of connections. As in the previous case, Linux TCP and IO-TCP reach peak performance at 4 and 128 connections, but the performance is much lower than that shown in Figure 11 due to the increased overhead of file operations. Nonetheless, IO-TCP outperforms Linux TCP at 16+ concurrent connections. To evaluate the latency overhead, we compare the latency of downloading small files using IO-TCP and Linux TCP. IOTCP shows an additional delay of about 200-500us when downloading a file between 1KB and 16KB. This additional latency is unavoidable due to the overhead of the current prototype, but is negligible when delivering content over a wide area network.

본 발명에서는 향후에도 매우 유망하게 이용될 수 있는 가능성을 지닌다.The present invention has the potential to be used very promisingly in the future.

(메모리 대역폭 제한)(memory bandwidth limit)

현재 프로토 타입의 성능 병목 현상은 BlueField NIC의 낮은 메모리 대역폭에 있다. 100Gbps BlueField NIC를 사용하여 그림 8과 동일한 테스트를 실행하여 이를 확인한다. NIC 스택 구현에는 (1)NVMe 디스크에서 메모리 버퍼로 복사하고 (2)메모리 버퍼에서 TSO 패킷 페이로드 버퍼로 복사하는 두 가지 주요 메모리 사본이 있다. 비활성화하면 (1)성능이 36Gbps까지 올라가고 둘 다 비활성화하면 성능이 NIC 당 83Gbps에 도달한다. 현재 BlueField NIC는 Intel Xeon Gold 6142의 메모리 대역폭이 1/6에 불과하다. 그럼에도 불구하고 본 발명에 따른 방법은 향후에도 유망하다. 첫째, Arm 기반 SoC는 훨씬 더 높은 메모리 대역폭으로 설계 할 수 있으며 미래의 SmartNIC는이를 통해 이점을 얻을 수 있다. 예를 들어 Armv8 기반 SoC 서버 인 Cavium ThunderX2의 최대 메모리 대역폭은 166GB/s로 Intel Xeon CPU보다 훨씬 크다. 둘째, Arm SoC가 서버급 x86 CPU보다 저렴하고 여러 NIC를 사용하여 전체 성능을 쉽게 확장할 수 있으므로 SmartNIC의 메모리 대역폭을 개선하는 것이 더 비용 효율적이다. The current prototype's performance bottleneck is the BlueField NIC's low memory bandwidth. We verify this by running the same test as in Figure 8 using a 100Gbps BlueField NIC. The NIC stack implementation has two main memory copies: (1) copying from NVMe disk to memory buffer and (2) copying from memory buffer to TSO packet payload buffer. Disabling (1) performance goes up to 36 Gbps, disabling both brings performance to 83 Gbps per NIC. Current BlueField NICs have only 1/6 the memory bandwidth of the Intel Xeon Gold 6142. Nevertheless, the method according to the present invention is promising for the future. First, Arm-based SoCs can be designed with much higher memory bandwidth, and future SmartNICs could benefit from that. For example, Cavium ThunderX2, an Armv8-based SoC server, has a maximum memory bandwidth of 166GB/s, which is much larger than an Intel Xeon CPU. Second, improving the memory bandwidth of SmartNICs is more cost-effective as Arm SoCs are less expensive than server-class x86 CPUs and the overall performance can be easily scaled using multiple NICs.

(암호화 지원)(encryption supported)

현재 프로토 타입은 SSL / TLS를 구현하지 않지만 NIC 스택에서 암호화 작업을 수행하여 지원할 수 있다. 이때 CPU 기반 구현의 성능이 더 좋을지 또는 비교할 수 있는지 여부가 중요하다. NIC 플랫폼의 Arm 하위 시스템은 AES-NI를 지원하지만 AES-GCM 성능은 아래 표 3과 같이 Intel CPU(Gold 6142)의 8.1%∼12.7%에 불과하다. The current prototype does not implement SSL/TLS, but it can be supported by performing cryptographic operations on the NIC stack. At this point, it is important whether the performance of the CPU-based implementation is better or comparable. The Arm subsystem of the NIC platform supports AES-NI, but the AES-GCM performance is only 8.1% to 12.7% of the Intel CPU (Gold 6142) as shown in Table 3 below.

Key SizeKey Size 64B64B 256B256B 1025B1025B 8092B8092B 128 bit(Arm)128bit (Arm) 212212 324324 365365 377377 256 bit(Arm)256 bit (Arm) 192192 287287 321321 333333 128 bit(CPU)128 bit (CPU) 1,6321,632 3,0633,063 4,6494,649 5,6575,657 256 bit(CPU256 bit (CPU 1,5071,507 2,5502,550 3,5723,572 4,1034,103

AES-GCM은 오늘날 TLS에서 대용량 파일 전송 성능을 결정하는 가장 인기 있는 대칭키 암호법이다. 흥미롭게도 IO-TCP에서 TSO 페이로드를 사용하여 256 비트 키 AES-GCM을 실행하면 현재 프로토 타입이 NIC 당 19Gbps(또는 둘 모두에 대해 38Gbps)를 달성하여 일반 텍스트 전송에서 성능의 14%만 손실된다. 동일한 설정에서 Linux TCP 스택의 Lighttpd는 단일 코어로 5.5Gbps만 생성하며 38Gbps에 도달하려면 8개의 코어가 필요하다. 6개의 Arm 코어는 일반 텍스트 전송에서 19Gbps를 달성하기에 충분하고(도 9의 (b) 참조), 나머지 10 개의 코어는 AES-GCM을 수행하여 19Gbps를 달성 할 수 있기 때문에 성능이 매우 합리적이다. BlueField-2에는 AES-GCM용 하드웨어 가속기가 있으므로 NIC 스택에 대한 TLS 지원이 가까운 장래에 더 합리적일 것이다. (QUIC 지원) AES-GCM is today the most popular symmetric key cryptography that determines the performance of large file transfers in TLS. Interestingly, running 256-bit keyed AES-GCM with TSO payload on IO-TCP, the current prototype achieves 19 Gbps per NIC (or 38 Gbps for both), losing only 14% of the performance in plaintext transfers. . On the same setup, Lighttpd on the Linux TCP stack only produces 5.5 Gbps with a single core and requires 8 cores to reach 38 Gbps. Since 6 Arm cores are enough to achieve 19 Gbps in plain text transmission (see Fig. 9(b)), and the remaining 10 cores can achieve 19 Gbps by performing AES-GCM, the performance is very reasonable. BlueField-2 has hardware accelerators for AES-GCM, so TLS support on the NIC stack will make more sense in the near future. (QUIC supported)

본 발명은 QUIC와 같은 다른 전송 계층 프로토콜을 지원하도록 설계를 쉽게 확장할 수 있다. 파일 기반 컨텐츠 전달에서 QUIC를 지원하려면 다른 API 기능을 유지하면서 QUIC에 대한 offload_write() 만 업데이트하면 된다. 내부적으로 각 "SEND" 명령 패킷은 NIC 스택에 헤더가 있는 QUIC 데이터 패킷으로 변환되어야 한다. 또한 QUIC가 TLS에서 작동하므로 암호화 작업을 지원해야 한다. The present invention can easily extend the design to support other transport layer protocols such as QUIC. To support QUIC in file-based content delivery, you only need to update offload_write() for QUIC while maintaining other API functionality. Internally, each "SEND" command packet must be converted to a QUIC data packet with a header on the NIC stack. Also, since QUIC works over TLS, it must support cryptographic operations.

(다른 애플리케이션 지원)(other applications supported)

본 발명에서는 주로 IO-TCP를 사용하는 HTTP 기반 비디오 스트리밍 애플리케이션에 중점을 두지만 대용량 파일을 제공해야 하는 모든 애플리케이션은 스택의 이점을 누릴 수 있다. 여기에는 소프트웨어 다운로드, 웹 브라우징, 파일 수준 동기화 등이 포함된다. 또한 작은 파일 전송의 경우에도 IO-TCP 스택은 위에서 설명한 바와 같이 충분한 동시 연결이 있는 한 여전히 이점을 제공한다.Although the present invention focuses primarily on HTTP-based video streaming applications using IO-TCP, any application that needs to serve large files can benefit from the stack. This includes software downloads, web browsing, file-level synchronization, and more. Also, for small file transfers, the IO-TCP stack still provides benefits as long as there are enough simultaneous connections as described above.

본 발명과 관련된 작업에 대해 설명한다.Work related to the present invention will be described.

(PCIe P2P 통신)(PCIe P2P communication)

외부 장치간에 PCIe P2P 통신을 활성화하면 외부 장치간에 데이터를 전송할 때 CPU 오버 헤드를 크게 줄일 수 있다. NVIDIA GPUDirect RDMA, GPUDirect Async 및 AMD DirectGMA 기술은 GPU 메모리를 PCIe 메모리 공간에 직접 노출하여 다른 장치가 GPU의 데이터에 직접 액세스할 수 있는 방법을 제공한다. EXTOLL은 Intel Xeon Phi 코 프로세서(가속기)와 NIC간의 직접 통신을 가능하게 하여 가속기가 CPU의 개입없이 네트워크를 통해 서로 통신할 수 있도록 제안한다. Morpheus는 NVMe 장치와 다른 PCIe 장치 간의 통신을 가능하게 한다. DCS 및 DCS-ctrl은 다양한 유형의 외부 PCIe 장치간에 P2P 통신을 가능하게 하는 하드웨어 기반 프레임 워크를 제안한다. 그러나 이러한 모든 P2P 솔루션은 커널 스택을 고려하지 않고 하드웨어에서의 데이터 통신만 고려한다. 결과적으로 이러한 솔루션은 컨텐츠 전달 응용 프로그램을 실행할 때 여전히 커널 스택 오버 헤드를 겪는다. Enabling PCIe P2P communication between external devices can significantly reduce CPU overhead when transferring data between external devices. NVIDIA GPUDirect RDMA, GPUDirect Async and AMD DirectGMA technologies expose GPU memory directly to PCIe memory space, providing a way for other devices to directly access data on the GPU. EXTOLL proposes to enable direct communication between the Intel Xeon Phi coprocessor (accelerator) and the NIC so that the accelerators can communicate with each other over the network without CPU intervention. Morpheus enables communication between NVMe devices and other PCIe devices. DCS and DCS-ctrl propose a hardware-based framework that enables peer-to-peer communication between various types of external PCIe devices. However, all these peer-to-peer solutions only consider data communication in hardware without considering the kernel stack. As a result, these solutions still suffer from kernel stack overhead when running content delivery applications.

(가속화된 네트워킹 스택)(Accelerated Networking Stack)

네트워킹 스택의 성능을 향상시키려는 기존 작업이 몇 가지 있다. 일부 작업은 기존 커널 스택의 성능을 향상시키려고 한다. Fastsocket은 테이블 수준 연결 파티션을 달성하고 연결 지역성을 높이며 잠금 경합을 제거하여 TCP 스택 성능을 향상시킨다. StackMap은 애플리케이션 전용 네트워크 인터페이스를 제공하고 애플리케이션을 위한 제로 카피, 낮은 오버 헤드 네트워크 인터페이스를 제공한다. Megapipe는 분할된 경량 소켓을 활용하고 시스템 호출을 일괄 처리하여 성능을 향상킨다. 또 다른 접근 방식은 무거운 커널 스택을 우회하고 사용자 수준에서 전체 스택을 실행하는 것이다. mTCP, IX, Sandstorm 및 F-Stack은 사용자 수준 패킷 I/O 라이브러리를 활용하고 여러 CPU 코어를 활용하여 들어오는 흐름을 동시에 처리하여 처리 처리량을 높이고 커널 호출의 지연 시간을 줄인다. ZygOS, Shinjuku 및 Shenango는 CPU 코어 간의 작업 부하 분산을 개선하여 패킷 처리의 꼬리 지연 시간을 더욱 개선한다. Arrakis는 커널의 보안 기능을 유지하면서 데이터 경로에 대한 커널 개입을 완전히 제거한다. TAS는 데이터 센터에서 RPC 호출의 성능을 향상시키는 것을 목표로 하는 별도의 OS 서비스로 TCP 빠른 경로를 구축한다. Disk | Crypt | Net은 새로운 커널 바이 패스 스토리지 스택과 기존 커널 바이 패스 네트워크 스택을 포함하는 새로운 비디오 스트리밍 스택을 구축하여 비디오 스트리밍 애플리케이션을 위한 낮은 지연 시간과 높은 처리량을 달성한다. 그러나 이러한 모든 솔루션에는 여전히 패킷 처리에 많은 CPU가 필요하므로 외부 장치간에 데이터를 전송할 때 많은 CPU 전력을 낭비한다. AccelTCP라는 최근 작업은 TCP 연결 관리와 연결 릴레이를 SmartNIC로 오프로드하여 호스트 CPU 코어에서 패킷 처리 계산의 일부를 덜어준다. 그러나 단기 연결 및 L7 프록시에 대한 처리량 향상에만 초점을 맞춘다. There is some existing work to improve the performance of the networking stack. Some work seeks to improve the performance of the existing kernel stack. Fastsocket improves TCP stack performance by achieving table-level connection partitioning, increasing connection locality, and eliminating lock contention. StackMap provides application-specific network interfaces and provides a zero-copy, low-overhead network interface for applications. Megapipe utilizes split lightweight sockets and batches system calls to improve performance. Another approach is to bypass the heavy kernel stack and run the full stack at the user level. mTCP, IX, Sandstorm, and F-Stack utilize user-level packet I/O libraries and leverage multiple CPU cores to process incoming flows concurrently to increase processing throughput and reduce the latency of kernel calls. ZygOS, Shinjuku, and Shenango further improve the tail latency of packet processing by improving workload distribution among CPU cores. Arrakis completely removes kernel involvement in the data path while preserving the kernel's security features. TAS builds TCP fast path as a separate OS service aimed at improving the performance of RPC calls in data centers. Disk | Crypt | Net is building a new video streaming stack that includes a new kernel bypass storage stack and an existing kernel bypass network stack to achieve low latency and high throughput for video streaming applications. However, all these solutions still require a lot of CPU for packet processing, wasting a lot of CPU power when transferring data between external devices. A recent work called AccelTCP offloads TCP connection management and connection relaying to the SmartNIC, offloading some of the packet processing computations from the host CPU core. However, it focuses only on improving throughput for short-lived connections and L7 proxies.

(NIC 오프로드)(NIC Offload)

전통적으로, NIC 오프로드 방식은 TSO(TCP 세분화 오프로드), LRO(Large Receiver Offload), TCP/IP 체크섬 오프로드에서 TCP 엔진 오프로드(TOE), 마이크로소프트 Chimney Offload와 같은 상태 저장 시스템에 이르기까지 다양했다. 상태 비 저장 오프로드 체계는 대규모 메시지 전송을 위해 CPU주기를 효과적으로 줄임으로써 최신 NIC에서 보편화되었다. 반대로 상태 저장 체계는 복잡성으로 인한 다양한 보안 및 유지 관리 문제로 인해 그다지 인기가 없었다. 디스크 I/O에 대한 잠재적인 종속성에도 불구하고 다양한 NIC 플랫폼에서 간편한 구현과 고성능을 수용할 수 있도록 IO-TCP NIC 스택을 상태 비저장으로 설계한다. 최근에는 특정 응용 프로그램의 성능을 향상시키기 위해 다양한 작업을 SmartNIC에 오프로드하는 데 초점을 맞춘 여러 작업이 있다. AccelTCP는 단기 연결 관리 및 L7 프록시를 SmartNIC로 오프로드하여 TCP의 성능을 향상시킨다. KV-Direct는 FPGA 기반 SmartNIC를 활용하여 인 메모리 키-값 저장소의 성능을 향상시킨다. Floem, ClickNP 및 UNO는 SmartNIC를 활용하여 네트워크 애플리케이션의 패킷 처리를 가속화한다. Metron은 패킷 태깅을 NIC로 오프로드하여 네트워크 기능에 대한 패킷 처리 대기 시간을 줄인다. iPipe는 분산 애플리케이션을 SmartNIC로 오프로드하기위한 일반 프레임 워크를 구축한다. IO-TCP는 SmartNIC를 활용하여 컨텐츠 전송 시스템을 위한 디스크 및 패킷 I/O를 가속화하는 최초의 시도임에 분명하다.Traditionally, NIC offload methods have ranged from TCP segmentation offload (TSO), large receiver offload (LRO), and TCP/IP checksum offload to stateful systems such as TCP Engine Offload (TOE) and Microsoft Chimney Offload. varied. Stateless offload schemes have become common in modern NICs by effectively reducing CPU cycles for large message transfers. Conversely, stateful schemes have not been very popular due to various security and maintenance issues due to their complexity. Despite potential dependencies on disk I/O, the IO-TCP NIC stack is designed to be stateless to accommodate easy implementation and high performance on a variety of NIC platforms. Recently, there have been several efforts focused on offloading various tasks to SmartNICs to improve the performance of specific applications. AccelTCP improves the performance of TCP by offloading short-lived connection management and L7 proxies to the SmartNIC. KV-Direct leverages FPGA-based SmartNICs to improve the performance of in-memory key-value stores. Floem, ClickNP and UNO leverage SmartNICs to accelerate packet processing for network applications. Metron reduces packet processing latency for network functions by offloading packet tagging to the NIC. iPipe builds a generic framework for offloading decentralized applications to SmartNICs. IO-TCP is obviously the first attempt to leverage SmartNICs to accelerate disk and packet I/O for content delivery systems.

본 발명은 확장 가능한 컨텐츠 전달을 위해 CPU에서 I/O 작업을 오프로드하는 분할 TCP 스택 설계인 IO-TCP를 제시한다. IO-TCP는 SmartNIC 프로세서를 활용하여 I/O 작업을 수행하는 새로운 추상화를 제공하여 CPU 및 주 메모리 시스템에 대한 부담을 크게 줄인다. 또한 본 발명은 I/O 장치에서 저전력 프로세서로 쉽게 구현할 수 있도록 NIC 스택 설계의 단순성을 유지한다. 위에서 설명한 평가에 따르면, IO-TCP는 충분한 연결을 제공할 때 작은 파일 전송에도 이점을 제공하면서 CPU주기를 크게 절약한다. The present invention presents IO-TCP, a split TCP stack design that offloads I/O operations from the CPU for scalable content delivery. IO-TCP provides a new abstraction to perform I/O operations by leveraging the SmartNIC processor, greatly reducing the burden on the CPU and main memory system. In addition, the present invention maintains the simplicity of the NIC stack design so that it can be easily implemented with a low-power processor in an I/O device. According to the evaluation described above, IO-TCP offers significant savings in CPU cycles while providing benefits even for small file transfers when providing sufficient connections.

다시 정리하면, 본 발명은 TCP stack 구현을 데이터를 처리하는 부분인 데이터 플레인과, 그 외의 모든 컨트롤 기능을 총괄하는 컨트롤 플레인으로 나누고, 단순하지만 반복적인 기능을 수행하는 data plane을 SmartNIC에서 수행하여 CPU낭비를 막고, 복잡하지만 신속히 수행되어야 하는 control plan기능은 CPU에서 동작하게 하여 단위 CPU용량당 전송 성능을 극대화시키는 기법에 관한 것이다. CPU에서 수행되는 TCP 스택인 호스트 스택과 NIC에서 수행되는 NIC 스택간의 통신 프로토콜을 제안하고, NIC에서의 파일 접근을 지시하는 API를 정의하여 TCP레벨에서 보내야 하는 데이터의 파일내 위치와 사이즈를 NIC stack에게 제공함으로써 host stack에서는 NIC에서 읽어오는 파일 내용을 TCP packet에 담아서 보내는 오퍼레이션을 제어하도록 설계했다. 이외 다른 모든 제어 기능들(연결 관리(connection management), 안정적 데이터 전송(reliable data transfer), 혼잡/흐름 관리(congestion/flow control), 오류 제어(error control))는 모두 호스트 스택에서 동작하도록 설계하여 기존 TCP stack중 아주 일부분만 NIC stack으로 이양하여 구현이 쉽게 하되 이로 인한 CPU 절약의 장점이 최대화될 수 있도록 했다. In summary, the present invention divides the TCP stack implementation into a data plane that processes data and a control plane that oversees all other control functions. It relates to a method of maximizing transmission performance per unit CPU capacity by making the control plan function, which prevents waste and is complex but must be performed quickly, operate in the CPU. Proposes a communication protocol between the host stack, which is a TCP stack executed in the CPU, and the NIC stack executed in the NIC, and defines an API that directs file access in the NIC. By providing it to the host stack, it is designed to control the operation of sending the contents of files read from the NIC in TCP packets. All other control functions (connection management, reliable data transfer, congestion/flow control, error control) are all designed to operate on the host stack. Only a small part of the existing TCP stack was transferred to the NIC stack to make implementation easy, but to maximize the advantage of CPU savings.

이는 다양한 종류의 SmartNIC에서도 구현될 수 있다. 호스트 스택과 NIC 스택 간의 프로토콜을 활용하여 호스트 스택에서의 패킷 RTT측정에 디스크 엑세스 지연이 포함되지 않도록 하여 정확성을 올렸으며, 재전송(retransmission)을 하는 경우에 디스크 액세스를 여러 번 하지 않도록 클라이언트에서 ack한 시퀀스 스페이스(sequence space)를 정기적으로 NIC 스택에 알려주어 적절한 시기에 전송이 완료된 파일 컨텐츠를 해제하도록 설계했다. 프로토타입 시스템을 구현하여 300KB 파일을 무작위로 다운로드하는 성능 평가에서 CPU 코어 1개만으로 50Gbps의 성능을 보였으며 이는 기존 방식을 활용한 서버의 1/6만의 CPU 용량으로 같은 성능을 발휘함을 보여주었다.This can be implemented in various types of SmartNICs. Utilizing the protocol between the host stack and the NIC stack, the packet RTT measurement in the host stack does not include the disk access delay, so the accuracy is increased. It is designed to periodically inform the NIC stack of the sequence space to release the file contents that have been transferred at the right time. In the performance evaluation of implementing a prototype system and randomly downloading a 300KB file, it showed 50Gbps performance with only one CPU core, which showed the same performance with only 1/6 the CPU capacity of the server using the existing method. .

컨텐츠 전송 서비스(Content distribution system service) 중 일부분인 온라인 비디오 전송 서비스만 하더라도 이미 전체 인터넷 트래픽의 60%를 넘어섰고, 코로나 바이러스 팬데믹에 따른 비대면 일상이 지속됨에 따라 트래픽의 양이 계속 증가하고 있는 추세에 있다. 또한, 비디오 화질도 HD, UHD를 거치며 점점 좋아지고 있으나 이에 대한 대역폭 수요도 함께 커지는 부담이 생겨 단위 서버당 동시 지원가능한 클라이언트의 수가 상대적으로 줄어들어 비디오 전송 비용이 증가하는 문제가 있다. 본 발명은 SmartNIC을 활용하여 단위 CPU 용량당 전송 성능을 높여 서버당 지원할 수 있는 클라이언트 수를 크게 높일 수 있는 포텐셜을 지니고 있다. 현재 SmartNIC에 대한 기술개발이 활발하여 SmartNIC의 연산성능이 높아지는 추세를 보이고 있어 가까운 미래에 서버 플랫폼에 확산될 것이며, 본 발명은 SmartNIC의 개수에 따라 대역폭 뿐만 아니라 컨텐츠 전송 성능도 선형적으로 확장할 수 있는 장점이 있고, 오프로딩 되는 연산이 간단하여 쉽게 구현될 수 있는 장점을 가진다.Even the online video transmission service, which is a part of the content distribution system service, has already exceeded 60% of the total internet traffic, and the amount of traffic continues to increase as non-face-to-face daily life continues due to the corona virus pandemic. are on trend In addition, although the video quality is gradually improving through HD and UHD, there is a problem in that the number of clients that can be simultaneously supported per unit server is relatively reduced due to the burden of increasing bandwidth demand, thereby increasing the video transmission cost. The present invention has the potential to greatly increase the number of clients that can be supported per server by increasing transmission performance per unit CPU capacity by utilizing SmartNIC. Currently, technology development for SmartNICs is actively developing, and the computing performance of SmartNICs is showing a trend of increasing, so it will spread to server platforms in the near future. It has the advantage of being easy to implement because the offloading operation is simple.

본 발명은 온라인 콘텐츠 전송에 대한 CPU 부담을 획기적으로 줄여주는 분할 TCP 스택 설계인 IO-TCP를 제시하며,IO-TCP는 CPU가 디스크 및 네트워크 I/O를 SmartNIC로 오프로드하는 동안 컨텐츠 제공의 핵심 기능에만 집중할 수 있도록 한다. 이 분업은 데이터 평면 작업만 SmartNIC로 오프로드되는 동안 CPU 측에서 스택 작업의 전체 제어를 가정하는 TCP 스택의 제어 영역과 데이터 영역의 분리를 실현한다. IO-TCP는 디스크 바인딩 워크로드를 위한 단일 CPU 코어로 두 개의 25Gbps NIC를 포화시킬 수 있다.The present invention presents IO-TCP, a split TCP stack design that dramatically reduces the CPU burden for online content transmission. Allows you to focus only on function. This division of labor realizes the separation of the control plane and the data plane of the TCP stack, assuming full control of the stack operations on the CPU side, while only the data plane operations are offloaded to the SmartNIC. IO-TCP can saturate two 25Gbps NICs with a single CPU core for disk-bound workloads.

연산 오프로딩 시스템Computational offloading system

본 발명에 따른 연산 오프로딩 시스템은 상술한 연산 오프로딩 방법이 시스템으로 구현된 것으로, 파일 기반 컨텐트 전송 서버의 CPU(Central Processing Unit) 및 연산 기능을 갖는 NIC(Network Interface Card)을 포함한다. 구체적인 동작 방법과 관련해서는 위에서 상세히 설명한 바, 여기서는 주요 동작 내용만 간략히 설명하기로 한다.The calculation offloading system according to the present invention is implemented as a system for the above calculation offloading method, and includes a central processing unit (CPU) of a file-based content transmission server and a network interface card (NIC) having a calculation function. Since the specific operation method has been described in detail above, only the main operation contents will be briefly described here.

CPU는 데이터 플레인(data plane) 연산을 NIC에 오프로드(offload)하며, NIC는 상기 데이터 플레인 연산을 수행하고, CPU는 오프로드된 상기 데이터 플레인 연산을 제외한 컨트롤 플레인(control plane) 연산을 수행하게 된다. The CPU offloads data plane operations to the NIC, the NIC performs the data plane operations, and the CPU performs control plane operations excluding the offloaded data plane operations. do.

이때, 데이터 플레인 연산은 컨텐츠 가져오기, 데이터 버퍼 관리, 데이터 패킷 분할, TCP/IP 체크섬 계산, 데이터 패킷 생성 및 전송 중 적어도 하나의 연산을 포함한다.In this case, the data plane operation includes at least one operation of fetching contents, managing data buffers, dividing data packets, calculating TCP/IP checksums, and generating and transmitting data packets.

컨트롤 플레인 연산은 TCP 프로토콜의 연결 관리, 데이터 전송의 안정성 관리, 데이터 전송의 혼잡 관리, 흐름 제어 및 에러 제어 중 적어도 하나의 연산을 포함하는 것으로, 각 연산 영역이 구분된다.The control plane operation includes at least one operation of TCP protocol connection management, data transmission stability management, data transmission congestion management, flow control, and error control, and each operation area is divided.

CPU는 응용 프로그램으로부터 수신한 "offload_open()" 명령을 상기 NIC에 전달한다. NIC가 NIC 스택에 파일을 열고 결과를 전송 서버의 상기 CPU에 회신하며, 상기 CPU는 상기 응용 프로그램으로부터 호출된 "offload_write()" 명령을 NIC에 전달하게 된다.The CPU transfers the “offload_open()” command received from the application program to the NIC. The NIC opens a file on the NIC stack and returns the result to the CPU of the transmission server, and the CPU transmits the “offload_write()” command called from the application program to the NIC.

이때, CPU의 호스트 스택에서 실제 데이터가 누락된 경우에는 CPU가 가상으로 상기 데이터 플레인 연산을 수행하게 된다. At this time, if actual data is missing from the host stack of the CPU, the CPU virtually performs the data plane operation.

CPU는, 상기 응용 프로그램이 "offload_write()" 명령을 호출한 경우, 메타 데이터만 업데이트하여 전송 버퍼에 가상 기입하고, 데이터의 정체 및 흐름 제어 매개 변수와 함께 전송해야 하는 바이트 수를 결정한 뒤 "SEND" 명령을 상기 NIC의 NIC 스택에 게시하게 된다. When the application program calls the “offload_write()” command, the CPU updates only meta data, writes virtual data into the transmission buffer, determines the number of bytes to be transmitted along with data congestion and flow control parameters, and then sends “SEND " command to the NIC stack of that NIC.

한편, CPU는 TCP 패킷을 활용하여 상기 “SEND” 명령을 전달하고, 상기 TCP 패킷은 파일 ID, 읽기 스타터 오프셋(starter offset to read) 및 데이터 길이(length) 정보 중 적어도 하나의 정보를 포함하며, 상기 “SEND” 명령을 수신한 상기 NIC는, TCP 패킷에 포함된 길이(length) 정보에 기초하여 복수의 TCP 패킷으로 변환하여 전송할 수 있다.Meanwhile, the CPU transmits the “SEND” command using a TCP packet, and the TCP packet includes at least one of file ID, starter offset to read, and data length information, Upon receiving the “SEND” command, the NIC may convert and transmit a plurality of TCP packets based on length information included in the TCP packet.

또한, 상기 클라이언트는 상기 “SEND” 명령에 대한 데이터 패킷을 전송하려고 할 때 에코 패킷을 상기 CPU로 전송하며, 상기 CPU는 상기 에코 패킷을 수신할 때까지 재전송 타이머의 전송을 소정 시간동안 수행하지 않고, 상기 소정 시간은 상기 NIC에서의 패킷처리 지연시간과 상기 에코 패킷의 단방향 처리시간에 의해 결정되며, 상기 NIC에서의 패킷처리 지연시간은 상기 “SEND” 명령이 상기 NIC에 도착한 뒤 “SEND” 명령에 대한 데이터 패킷이 상기 클라이언트로 출발하는 사이의 지연시간이다.In addition, when the client tries to transmit a data packet for the “SEND” command, it transmits an echo packet to the CPU, and the CPU does not transmit a retransmission timer for a predetermined time until receiving the echo packet. , The predetermined time is determined by the packet processing delay time in the NIC and the unidirectional processing time of the echo packet, and the packet processing delay time in the NIC is determined by the “SEND” command after the “SEND” command arrives at the NIC. is the delay between the departure of data packets to the client.

CPU는 IO-TCP 응용 프로그램의 파일을 열 때 사용하는 “OPEN” 명령, 파일 내용을 클라이언트로 보낼 때 사용하는 "SEND" 명령, 재전송의 효율적 처리를 위한 “ACKD” 명령 및 IO-TCP 응용 프로그램의 파일을 닫을 때 사용하는 “CLOS” 명령을 정의하고, 상기 CPU는 “SEND” 명령만을 상기 NIC로 오프로드하고, 클라이언트로부터 수신되는 패킷은 직접 처리하며, 상기 CPU는 오프로드된 시퀀스 공간 범위에 대한 ACK를 확인하면, 상기 NIC에 주기적으로 상기 "ACKD" 패킷을 전달할 수 있다. ACKD에 의하여 재전송(retransmission)의 효율적 처리가 가능해진다. The CPU uses the “OPEN” command to open the file of the IO-TCP application, the “SEND” command to send the contents of the file to the client, the “ACKD” command to efficiently process retransmission, and the IO-TCP application’s Defines the "CLOS" command used when closing a file, the CPU offloads only the "SEND" command to the NIC, directly processes packets received from the client, and the CPU If ACK is confirmed, the “ACKD” packet may be periodically transmitted to the NIC. ACPD enables efficient processing of retransmission.

IO-TCP는 “SEND”만 오프로드(offload)하고 리시브(receive)쪽 연산들은 모두 CPU의 호스트 스택에서 구현된다. 즉 클라이언트 쪽에서 오는 ACK이나 페이로드(payload)가 있는 패킷(packet)들은 모두 NIC의 프로세서(processor)를 바이패스(_bypass_)하여 곧바로 호스트 스택에 전달되어 처리된다. 이는 ACK처리나 메시지 리시브가 훨씬 복잡한 연산이어서 CPU에서 처리하는 게 효율적이기 때문이다.IO-TCP offloads only “SEND” and all receive-side operations are implemented in the host stack of the CPU. That is, all packets with ACKs or payloads coming from the client bypass the processor of the NIC and are directly delivered to the host stack for processing. This is because ACK processing and message receiving are much more complex operations, so it is more efficient to process them in the CPU.

본 발명에 따른 연산 오프로딩 시스템은 비디오 서버 시스템과 같은 네트워크 웹 서버가 TCP로 디스크에 있는 파일을 전송할 때, 전송해야 할 파일을 CPU대신 NIC이 읽어서 보내게 하는 오프로딩을 구현하며, 이 경우 TCP 구현이 어떻게 재설계되어야 하는 지에 대한 기술적 특징을 갖는다. CPU에서는 TCP수행에 필요한 모든 기능을 수행하되 데이터 입출력에 관한 내용만 NIC으로 오프로딩하여 단순 반복적이고 시간을 많이 차지하는 작업은 NIC에서 수행하게 한다. 그리고, 이를 가능하게 하기 위한 TCP 재설계가 본 발명의 주요 특징이며, 이는 종래 기술에 비해 온라인 콘텐츠 전송에 대한 CPU 부담을 획기적으로 줄여주게 된다. The calculation offloading system according to the present invention implements offloading in which the NIC reads and sends the file to be transmitted instead of the CPU when a network web server such as a video server system transmits a file on a disk through TCP. In this case, the TCP It has technical characteristics of how the implementation should be redesigned. The CPU performs all the functions necessary for performing TCP, but only data input/output is offloaded to the NIC so that simple repetitive and time-consuming tasks are performed by the NIC. And, TCP redesign to enable this is a major feature of the present invention, which dramatically reduces the CPU load for online content transmission compared to the prior art.

위에서 설명한 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 기록 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method described above may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. A computer-readable recording medium may include program instructions, data files, data structures, etc. alone or in combination. Program instructions recorded on the medium may be those specially designed and configured for the present invention or those known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler. The hardware devices described above may be configured to act as one or more software modules to perform the operations of the present invention, and vice versa.

이상에서 실시예들에 설명된 특징, 구조, 효과 등은 본 발명의 하나 의 실시예에 포함되며, 반드시 하나의 실시예에만 한정되는 것은 아니다. 나아가, 각 실시예에서 예시된 특징, 구조, 효과 등은 실시예들이 속하는 분야의 통상의 지식을 가지는 자에 의해 다른 실시예들에 대해서도 조합 또는 변형되어 실시 가능하다. 따라서 이러한 조합과 변형에 관계된 내용들은 본 발명의 범위에 포함되는 것으로 해석되어야 할 것이다.Features, structures, effects, etc. described in the embodiments above are included in one embodiment of the present invention, and are not necessarily limited to only one embodiment. Furthermore, the features, structures, and effects illustrated in each embodiment can be combined or modified with respect to other embodiments by those skilled in the art in the field to which the embodiments belong. Therefore, contents related to these combinations and variations should be construed as being included in the scope of the present invention.

Host Stack: 호스트 스택
NIC Stack: NIC(Network Interface Card) 스택Host Stack: Host Stack
NIC Stack: Network Interface Card (NIC) Stack

Claims

A network and file input/output operation offloading method in a CPU (Central Processing Unit) of a file-based content transmission server and a NIC (Network Interface Card) having an operation function,
an offload step of, by the CPU, offloading data plane operations to the NIC; and
A performing step of the NIC performing the data plane operation and the CPU performing a control plane operation excluding the offloaded data plane operation;
The data plane operation includes at least one of content fetching, data buffer management, data packet splitting, TCP/IP checksum calculation, data packet generation and transmission, and the control plane operation includes TCP protocol connection management and data transmission. Including the operation of at least one of stability management, data transmission congestion management, flow control and error control,
In the off-road phase,
transmitting, by the CPU, an “offload_open()” command received from an application program to the NIC;
the NIC opening a file on the NIC stack and returning a result to the CPU; and
and transmitting, by the CPU, an “offload_write()” command called from the application program to the NIC.

delete

According to claim 1,
and a virtual operation step of, by the CPU, virtually performing the data plane operation when actual data is missing from the host stack of the CPU.

According to claim 3,
The virtual operation step,
The CPU, when the application program calls the "offload_write()" command, updating only meta data and virtually writing them in a transmission buffer; and
and posting a "SEND" command to the NIC stack of the NIC after the CPU determines the number of bytes to be transmitted along with congestion and flow control parameters of the data.

According to claim 4,
The CPU transmits the "SEND" command using a TCP packet, and the TCP packet includes at least one of file ID, starter offset to read, and data length information. How to offload.

According to claim 5,
The NIC receiving the "SEND" command converts and transmits a plurality of TCP packets based on length information included in the TCP packet.

According to claim 6,
The client transmits an echo packet to the CPU when attempting to transmit a data packet for the "SEND"command;
The operation offloading method of claim 1 , wherein the CPU does not transmit a retransmission timer for a predetermined time until receiving the echo packet.

According to claim 7,
The predetermined time is determined by a packet processing delay time in the NIC and a unidirectional processing time of the echo packet.

According to claim 8,
The packet processing delay time in the NIC is a delay time between when the "SEND" command arrives at the NIC and a data packet for the "SEND" command departs to the client.

According to claim 1,
The CPU includes the "OPEN" command used to open the file of the IO-TCP application program, the "SEND" command used to send the contents of the file to the client, the "ACKD" command for efficient processing of retransmission, and the IO-TCP application program An operation offloading method that defines the "CLOS" command used to close a file in .

According to claim 10,
wherein the CPU offloads only a "SEND" command to the NIC, and directly processes packets received from clients.

According to claim 10,
The operation offloading method of claim 1 , wherein the CPU periodically transmits the “ACKD” command to the NIC when confirming an ACK for an offloaded sequence space range.

A network and file input/output operation offloading system including a CPU (Central Processing Unit) of a file-based content transmission server and a NIC (Network Interface Card) having an operation function,
The CPU offloads data plane operations to the NIC;
The NIC performs the data plane operation, and the CPU performs a control plane operation excluding the offloaded data plane operation,
The data plane operation includes at least one operation of fetching contents, managing data buffers, dividing data packets, calculating TCP/IP checksums, and generating and transmitting data packets;
The control plane operation includes at least one operation of TCP protocol connection management, data transmission stability management, data transmission congestion management, flow control, and error control;
When the CPU transmits the “offload_open()” command received from the application program to the NIC, the NIC opens a file on the NIC stack and returns a result to the CPU, and the CPU responds with an “offload_write() command called from the application program. )" command to the NIC.

delete

According to claim 13,
When actual data is missing from the host stack of the CPU, the CPU virtually performs the data plane operation.

According to claim 15,
When the application program calls the "offload_write()" command, the CPU updates only the meta data and virtually writes it to the transmission buffer, determines the number of bytes to be transmitted along with the congestion and flow control parameters of the data, and then " A compute offloading system that posts a "SEND" command to the NIC's NIC stack.

According to claim 16,
The CPU transmits the "SEND" command using a TCP packet, the TCP packet includes at least one of a file ID, a starter offset to read, and data length information,
The NIC receiving the "SEND" command converts and transmits a plurality of TCP packets based on length information included in the TCP packet.

According to claim 17,
The client transmits an echo packet to the CPU when attempting to transmit a data packet for the "SEND"command;
The CPU does not transmit the retransmission timer for a predetermined time until receiving the echo packet,
The predetermined time is determined by a packet processing delay time in the NIC and a unidirectional processing time of the echo packet;
The packet processing delay time in the NIC is a delay time between when the "SEND" command arrives at the NIC and a data packet for the "SEND" command departs to the client.

According to claim 13,
The CPU includes the "OPEN" command used to open the file of the IO-TCP application program, the "SEND" command used to send the contents of the file to the client, the "ACKD" command for efficient processing of retransmission, and the IO-TCP application program Defines the "CLOS" command used to close the file of
The CPU offloads only the "SEND" command to the NIC, and directly processes packets received from clients;
When the CPU checks the ACK for the offloaded sequence space range, the CPU periodically transmits the "ACKD" command to the NIC.

A computer-readable recording medium recording a program for executing the operation offloading method according to any one of claims 1 and 3 to 12 in a computer.