KR102287774B1

KR102287774B1 - Method for Processing Data in Database Based on Log Structured Merge Tree Using Non-Volatile Memory

Info

Publication number: KR102287774B1
Application number: KR1020190138684A
Authority: KR
Inventors: 박상현; 이지환; 최원기; 성한승; 김도영
Original assignee: 연세대학교 산학협력단
Priority date: 2019-11-01
Filing date: 2019-11-01
Publication date: 2021-08-06
Also published as: WO2021085717A1; KR20210052981A

Abstract

본 실시예들은 휘발성 메모리의 일정 용량을 초과한 데이터에 관하여 비휘발성 메모리에 저장하고, 비휘발성 메모리의 리스트 구조 및 영속성 버퍼를 통해 플러시 동작 및 컴팩션 동작을 수행함으로써, 데이터 영속성을 유지하면서 쓰기 지연과 읽기 지연을 최소화할 수 있는 데이터 베이스를 제공한다.According to the present embodiments, data exceeding a predetermined capacity of the volatile memory is stored in the non-volatile memory, and the flush operation and the compaction operation are performed through the list structure and the persistence buffer of the non-volatile memory, so that the write delay while maintaining data persistence and provides a database that can minimize read delay.

Description

{Method for Processing Data in Database Based on Log Structured Merge Tree Using Non-Volatile Memory}

본 발명이 속하는 기술 분야는 비휘발성 메모리를 이용한 로그 구조 병합 트리 기반의 데이터 베이스 및 그 데이터 처리 방법에 관한 것이다. The technical field to which the present invention pertains relates to a log structure merge tree-based database using a non-volatile memory and a data processing method thereof.

이 부분에 기술된 내용은 단순히 본 실시예에 대한 배경 정보를 제공할 뿐 종래기술을 구성하는 것은 아니다.The content described in this section merely provides background information for the present embodiment and does not constitute the prior art.

키-값 기반의 데이터 베이스는 센서 데이터, 소셜 네트워크 데이터 등과 같이 비정형 데이터를 다루는데 유용하다. 키-값 기반의 데이터 베이스는 로그 구조 병합 트리(Log Structured Merge Tree)를 주로 사용한다.Key-value based databases are useful for handling unstructured data such as sensor data and social network data. A key-value based database mainly uses a log structured merge tree.

로그 구조 병합 트리(Log Structured Merge Tree, LSM-Tree)는 연속적인 쓰기 연산을 수행하는 워크로드를 위해 설계되었다. LSM-Tree 구조는 하나의 인메모리 데이터 구조와 여러 개의 블록(ex. 디스크 등)에 저장을 위한 이어쓰기(Append) 방식의 데이터 구조로 이루어져 있다. Log Structured Merge Tree (LSM-Tree) is designed for workloads that perform continuous write operations. The LSM-Tree structure consists of one in-memory data structure and an append-type data structure for storage in multiple blocks (eg, disk, etc.).

LSM-Tree는 키-값 데이터 베이스에서 빈번히 발생하는 삽입 및 수정을 효율적으로 수행한다. 데이터를 우선 로그 형식으로 저장하고, 로그 상의 데이터 정렬, 수정 작업의 처리 등의 병합을 미루는 쓰기 친숙형 구조(Write Friendly Structure)이다. 하지만 나중에 발생하는 병합 동작은 쓰기 증폭을 발생시키며 시스템 성능과 저장장치의 수명에 영향을 준다.LSM-Tree efficiently performs inserts and modifications that occur frequently in key-value databases. It is a write friendly structure in which data is first stored in a log format, and merging such as sorting data in the log and processing of correction operations is postponed. However, merge operations that occur later cause write amplification and affect system performance and storage lifespan.

LSM-Tree는 임의적인 순서로 데이터를 쓰지 않고 순차적으로 데이터를 쓴다. 데이터를 조회할 때 주어진 데이터가 트리 내의 어느 위치에 있는지 알 수 없어서 데이터를 찾기 위해서 상위 레벨부터 순차적으로 검색해야 한다. 디스크에 데이터가 없어도 모든 레벨의 모든 파일을 읽어야 한다.LSM-Tree writes data sequentially without writing data in random order. When searching for data, it is impossible to know where the given data is in the tree, so to find the data, it must be searched sequentially from the upper level. All files at all levels must be read, even if there is no data on the disk.

미국공개특허공보 US 2017-0344619 (2017.11.30.)US Patent Publication US 2017-0344619 (2017.11.30.) 한국공개특허공보 KR 10-2016-0121819 (2016.10.21.)Korean Patent Publication No. KR 10-2016-0121819 (2016.10.21.) 미국공개특허공보 US 2018-0121121 (2018.05.03.)US Patent Publication US 2018-0121121 (2018.05.03.)

본 발명의 실시예들은 휘발성 메모리 및 비휘발성 메모리를 포함하는 데이터 베이스가 휘발성 메모리의 일정 용량을 초과한 데이터에 관하여 비휘발성 메모리에 저장하고, 비휘발성 메모리의 리스트 구조 및 영속성 버퍼를 통해 플러시 동작 및 컴팩션 동작을 수행함으로써, 데이터 영속성을 유지하면서 쓰기 지연과 읽기 지연을 최소화하는 데 발명의 주된 목적이 있다.In embodiments of the present invention, a database including a volatile memory and a non-volatile memory stores data exceeding a predetermined capacity of the volatile memory in the non-volatile memory, and performs a flush operation and A primary object of the invention is to minimize write delay and read delay while maintaining data persistence by performing a compaction operation.

본 발명의 명시되지 않은 또 다른 목적들은 하기의 상세한 설명 및 그 효과로부터 용이하게 추론할 수 있는 범위 내에서 추가적으로 고려될 수 있다.Other objects not specified in the present invention may be additionally considered within the scope that can be easily inferred from the following detailed description and effects thereof.

본 실시예의 일 측면에 의하면, 데이터 베이스의 데이터 처리 방법에 있어서, 상기 데이터 베이스의 휘발성 메모리에 데이터를 저장하는 단계, 및 상기 데이터 베이스의 비휘발성 메모리에 복수의 노드가 연결된 리스트 구조를 생성하고 상기 데이터를 상기 리스트 구조에 저장하는 방식으로 플러시 동작을 수행하는 단계를 포함하는 데이터 베이스의 데이터 처리 방법을 제공한다.According to one aspect of this embodiment, in a data processing method of a database, storing data in a volatile memory of the database; generating a list structure in which a plurality of nodes are connected to a nonvolatile memory of the database; There is provided a data processing method of a database including performing a flush operation in a manner of storing data in the list structure.

상기 데이터 베이스는 키-값 형식으로 데이터를 저장하고, 상기 리스트 구조는 다수의 다음 포인터를 갖는 스킵 리스트일 수 있다.The database stores data in a key-value format, and the list structure may be a skip list with multiple next pointers.

상기 데이터 베이스가 데이터 쓰기를 수행하지 않고, 새로운 리스트 구조를 생성하고 상기 새로운 리스트 구조가 기존 리스트 구조의 노드에 할당된 키-값을 포인팅하는 방식으로 컴팩션 동작을 수행하는 단계를 포함할 수 있다.The method may include performing a compaction operation in such a way that the database does not write data, but creates a new list structure and points the new list structure to a key-value assigned to a node of the existing list structure. .

상기 플러시 동작을 수행하는 단계는, 상기 비휘발성 메모리의 영속성 버퍼(Persistent Buffer)에 키-값을 순차적으로 복사하고, 상기 영속성 버퍼는 상기 노드에 할당된 키-값의 랜덤 접근을 방지하며, 상기 리스트 구조는 상기 리스트 구조에 대응하는 영속성 버퍼의 오프셋을 포인팅할 수 있다.In the performing the flush operation, a key-value is sequentially copied to a persistent buffer of the non-volatile memory, and the persistence buffer prevents random access of the key-value assigned to the node, and the The list structure may point to an offset of the persistence buffer corresponding to the list structure.

상기 컴팩션 동작을 수행하는 단계는, 상기 비휘발성 메모리에 새로운 리스트 구조를 생성하고 상기 새로운 리스트 구조가 이전 리스트 구조에 대응하는 영속성 버퍼의 오프셋을 포인팅할 수 있다.The performing of the compaction operation may include creating a new list structure in the non-volatile memory and pointing the new list structure to an offset of a persistence buffer corresponding to the previous list structure.

상기 데이터 베이스의 비휘발성 메모리에 저장된 데이터가 기 설정된 용량 범위를 초과하면, 단계화 정책(Tiering Policy)에 따라 상기 데이터 베이스의 블록 드라이브로 방출(Eviction)할 수 있다.When the data stored in the nonvolatile memory of the database exceeds a preset capacity range, the data may be ejected to the block drive of the database according to a tiering policy.

상기 단계화 정책은, (i) 특정 레벨에 있는 데이터를 선택하여 상기 블록 드라이브에 저장하는 제1 단계화 정책, (ii) 데이터 접근이 오래된 데이터를 선택하여 상기 블록 드라이브에 저장하는 제2 단계화 정책, (iii) 모든 데이터를 상기 비휘발성 메모리에 저장하는 제3 단계화 정책, 또는 이들의 조합으로 설정될 수 있다.The staging policy includes (i) a first tiering policy that selects data at a specific level and stores it in the block drive, (ii) a second tiering policy that selects data with old data access and stores it in the block drive policy, (iii) a third staging policy that stores all data in the non-volatile memory, or a combination thereof.

상기 데이터 베이스에서 시스템 오류가 발생하면, 상기 데이터 베이스의 비휘발성 메모리에 저장된 상기 리스트 구조가 포인팅하는 데이터를 조회한 결과를 통해 데이터를 순차적으로 복구하는 단계를 포함할 수 있다.When a system error occurs in the database, the method may include sequentially recovering data through a result of inquiring about data pointed to by the list structure stored in a non-volatile memory of the database.

본 실시예의 다른 측면에 의하면, 프로세서, 휘발성 메모리, 및 비휘발성 메모리를 포함하는 데이터 베이스에 있어서, 상기 휘발성 메모리에 데이터를 저장하고, 상기 비휘발성 메모리에 복수의 노드가 연결된 리스트 구조를 생성하고 상기 데이터를 상기 리스트 구조에 저장하는 방식으로 플러시 동작을 수행하는 것을 특징으로 하는 데이터 베이스를 제공한다.According to another aspect of the present embodiment, in a database including a processor, a volatile memory, and a non-volatile memory, data is stored in the volatile memory, a list structure in which a plurality of nodes are connected is created in the non-volatile memory, and the There is provided a database characterized in that the flush operation is performed by storing data in the list structure.

본 실시예의 또 다른 측면에 의하면, 프로세서에 의해 실행 가능한 컴퓨터 프로그램 명령어들을 포함하는 비일시적(Non-Transitory) 컴퓨터 판독 가능한 매체에 기록되어 데이터 처리를 위한 컴퓨터 프로그램으로서, 상기 컴퓨터 프로그램 명령어들이 데이터 베이스의 적어도 하나의 프로세서에 의해 실행되는 경우에, 상기 데이터 베이스의 휘발성 메모리에 데이터를 저장하는 단계, 및 상기 데이터 베이스의 비휘발성 메모리에 복수의 노드가 연결된 리스트 구조를 생성하고 상기 데이터를 상기 리스트 구조에 저장하는 방식으로 플러시 동작을 수행하는 단계를 포함한 동작들을 수행하는 컴퓨터 프로그램을 제공한다.According to another aspect of this embodiment, as a computer program for data processing recorded in a non-transitory computer readable medium including computer program instructions executable by a processor, the computer program instructions are stored in a database. When executed by at least one processor, storing data in a volatile memory of the database, and generating a list structure in which a plurality of nodes are connected to a non-volatile memory of the database and storing the data in the list structure A computer program for performing operations including performing a flush operation in a store manner is provided.

이상에서 설명한 바와 같이 본 발명의 실시예들에 의하면, 휘발성 메모리 및 비휘발성 메모리를 포함하는 데이터 베이스가 휘발성 메모리의 일정 용량을 초과한 데이터에 관하여 비휘발성 메모리에 저장하고, 비휘발성 메모리의 리스트 구조 및 영속성 버퍼를 통해 플러시 동작 및 컴팩션 동작을 수행함으로써, 데이터 영속성을 유지하면서 쓰기 지연과 읽기 지연을 최소화하는 효과가 있다.As described above, according to the embodiments of the present invention, a database including a volatile memory and a non-volatile memory stores data exceeding a predetermined capacity of the volatile memory in the non-volatile memory, and a list structure of the non-volatile memory And by performing a flush operation and a compaction operation through the persistence buffer, there is an effect of minimizing write delay and read delay while maintaining data persistence.

여기에서 명시적으로 언급되지 않은 효과라 하더라도, 본 발명의 기술적 특징에 의해 기대되는 이하의 명세서에서 기재된 효과 및 그 잠정적인 효과는 본 발명의 명세서에 기재된 것과 같이 취급된다.Even if it is an effect not explicitly mentioned herein, the effects described in the following specification expected by the technical features of the present invention and their potential effects are treated as if they were described in the specification of the present invention.

도 1은 기존의 로그 구조 병합 트리 기반의 데이터 베이스를 예시한 도면이다.
도 2는 기존의 로그 구조 병합 트리 기반의 데이터 베이스가 데이터 컴팩션 동작을 수행하는 것을 예시한 도면이다.
도 3은 본 발명의 일 실시예에 따른 데이터 베이스를 예시한 블록도이다.
도 4는 본 발명의 일 실시예에 따른 데이터 베이스의 내부 데이터 구조를 예시한 도면이다.
도 5는 본 발명의 다른 실시예에 따른 데이터 베이스의 데이터 처리 방법을 예시한 흐름도이다.
도 6은 본 발명의 다른 실시예에 따른 데이터 베이스가 비휘발성 메모리에 생성한 스킵 리스트를 예시한 도면이다.
도 7은 본 발명의 다른 실시예에 따른 데이터 베이스가 스킵 리스트에 대해 컴팩션을 수행한 것을 예시한 도면이다.
도 8은 본 발명의 다른 실시예에 따른 데이터 베이스가 영속성 버퍼를 통해 순차적 복사를 수행한 것을 예시한 도면이다.
도 9는 본 발명의 다른 실시예에 따른 데이터 베이스가 영속성 버퍼를 통해 바이트 어드레싱 컴팩션을 수행한 것을 예시한 도면이다.
도 10 내지 도 12는 본 발명의 실시예들에 따라 수행된 모의실험 결과를 도시한 것이다.1 is a diagram illustrating an existing log structure merge tree-based database.
2 is a diagram exemplifying that a data compaction operation is performed by an existing log structure merge tree-based database.
3 is a block diagram illustrating a database according to an embodiment of the present invention.
4 is a diagram illustrating an internal data structure of a database according to an embodiment of the present invention.
5 is a flowchart illustrating a data processing method of a database according to another embodiment of the present invention.
6 is a diagram illustrating a skip list generated by a database in a nonvolatile memory according to another embodiment of the present invention.
7 is a diagram illustrating that the database performs compaction on a skip list according to another embodiment of the present invention.
8 is a diagram illustrating sequential copying of a database through a persistence buffer according to another embodiment of the present invention.
9 is a diagram illustrating that a database performs byte addressing compaction through a persistence buffer according to another embodiment of the present invention.
10 to 12 show simulation results performed according to embodiments of the present invention.

이하, 본 발명을 설명함에 있어서 관련된 공지기능에 대하여 이 분야의 기술자에게 자명한 사항으로서 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략하고, 본 발명의 일부 실시예들을 예시적인 도면을 통해 상세하게 설명한다.Hereinafter, in the description of the present invention, if it is determined that the subject matter of the present invention may be unnecessarily obscured as it is obvious to those skilled in the art with respect to related known functions, the detailed description thereof will be omitted, and some embodiments of the present invention will be described. It will be described in detail with reference to exemplary drawings.

도 1은 기존의 로그 구조 병합 트리(LSM-Tree) 기반의 데이터 베이스를 예시한 도면이고, 도 2는 기존의 로그 구조 병합 트리 기반의 데이터 베이스가 데이터 컴팩션 동작을 수행하는 것을 예시한 도면이다.1 is a diagram illustrating an existing log structure merge tree (LSM-Tree)-based database, and FIG. 2 is a diagram illustrating a data compaction operation of an existing log structure merge tree-based database. .

LSM-Tree를 이용한 대표적인 데이터 베이스로는 LevelDB와 RocksDB가 있다.Representative databases using LSM-Tree include LevelDB and RocksDB.

LSM-Tree는 삽입 연산이 수행되면 먼저 메모리 영역에 데이터를 저장한다. 메모리의 일정 용량까지 데이터가 쌓이면 메모리의 내용을 디스크로 플러시(Flush)를 수행한다. 플러시되는 데이터는 디스크에 저장되어 있던 기존 데이터와 병합 정렬을 하여 기록된다. 디스크 영역의 각 레벨이 임계치를 넘으면 병합 정렬을 실행하여 하위 레벨을 생성한다.LSM-Tree stores data in the memory area first when an insert operation is performed. When data is accumulated up to a certain amount of memory, the contents of the memory are flushed to the disk. The data to be flushed is recorded by merge-sorting with the existing data stored on the disk. When each level of disk area crosses a threshold, a merge sort is executed to create a lower level.

LSM-Tree 기반의 데이터 베이스는 키-값 형태로 데이터를 저장한다. LSM-Tree 기반의 데이터 베이스에 데이터의 삽입 연산 요청이 들어오면 데이터를 메모리에 기록하기 전에 우선적으로 로그 파일에 로그를 기록한다. 로그를 기록한 다음 메모리 영역에 있는 멤테이블(Memtable)에 데이터를 저장한다. 쓰기 요청이 계속되어 멤테이블(Memtable)에 데이터가 일정 용량까지 기록되면, 멤테이블(Memtable)은 변경이 불가능한 불변 멤테이블(Immutable Memtable, Read-Only Memtable)로 변경된다. 불변 멤테이블이 가득 차게 되면 블록(디스크) 영역으로 플러시가 발생한다. LSM-Tree-based database stores data in key-value format. When a data insertion operation request comes in to the LSM-Tree-based database, the log is first written to the log file before data is written to the memory. After writing the log, the data is stored in the memtable in the memory area. If the write request continues and data is written to the memtable up to a certain capacity, the memtable is changed to an immutable memtable (read-only memtable) that cannot be changed. When the immutable memtable becomes full, a flush occurs to the block (disk) area.

플러시 동작을 수행하면, 멤테이블의 파일은 키 순서에 따라 정렬되어 SST(Storted String Table) 파일로 변경된다. SST 파일은 복수의 블록을 갖는다. 블록의 예시로는 데이터를 저장하는 데이터 블록(Data Block), 데이터 블록의 위치를 인덱싱하는 인덱스 블록(Index Block), 인덱스 블록의 위치를 처리하는 푸터 블록(Footer Block) 등이 있다.When a flush operation is performed, the memtable files are sorted according to the key order and changed to a SST (Stored String Table) file. An SST file has a plurality of blocks. Examples of the block include a data block for storing data, an index block for indexing the position of the data block, and a footer block for processing the position of the index block.

SST 파일은 디스크 영역에서 컴팩션(Compaction)을 통해 업데이트된다. 한 번 생성된 SST 파일은 사라지지 않을 수 있다. 하위 레벨에 상주하는 SST 파일일수록 상위 레벨의 SST 파일보다 오래된 데이터가 위치할 수 있다.The SST file is updated through compaction in the disk area. Once created, the SST file may not disappear. As the SST file residing at the lower level, data older than the SST file at the upper level may be located.

트랜잭션 수행 도중에 시스템 오류 또는 전원 차단 등과 같은 문제가 발생하면, 아직 디스크에 반영되지 않고 버퍼에 남아있는 데이터는 유실된다. 시스템이 재부팅된 후 데이터 베이스가 복구를 수행할 때, 트랜잭션이 어떤 갱신 연산을 수행했는지 기록하는 로그를 사용한다. 로그 기록 방식으로는 WAL(Write-Ahead-Logging) 규칙이 있다. WAL은 트랜잭션으로 인해 변경된 데이터가 디스크에 기록되기 전에 관련된 로그를 로그 파일에 기록하는 규칙이다.If a problem such as a system error or power cut occurs during transaction execution, data that has not yet been reflected in the disk and remaining in the buffer is lost. When the database performs recovery after the system is rebooted, a log is used to record what update operation the transaction performed. As a logging method, there is a write-ahead-logging (WAL) rule. WAL is a rule to write related logs to log files before data changed due to a transaction is written to disk.

LSM-Tree 기반의 데이터 베이스는 두 개의 명령어를 수행한다. 하나는 메모리에서 디스크로 넘어가는 플러시 명령어이고, 다른 하나는 디스크의 레벨들을 조정하는 컴팩션 명령어이다.The LSM-Tree-based database executes two commands. One is a flush command that goes from memory to disk, and the other is a compaction command that adjusts the levels of the disk.

플러시 명령어를 수행할 때, 불변 멤테이블은 단일 SST 파일로 변경된다. 대량의 데이터가 한꺼번에 입력되면, 플러시 속도의 균형을 맞추고 SST 파일의 레벨의 용량 한계치를 유지하기 위해서 의도적으로 플러시 속도를 조절한다. 이러한 의도된 지연을 'Write Stall'이라고 한다. 표 1에 누적된 Write Stall이 예시되어 있다.When the flush command is executed, the immutable memtable is changed to a single SST file. When a large amount of data is input at once, the flush rate is intentionally adjusted to balance the flush rate and maintain the capacity limit of the level of the SST file. This intended delay is called 'Write Stall'. Stacked Write Stalls are exemplified in Table 1.

LSM-Tree의 각각의 레벨은 특성이 구분된다. 컴팩션 비용과 디스크 쓰기는 특정 레벨에 집중된다. 계층적 저장 구조는 상위 레벨에서 하위 레벨로 점진적인 데이터 누적을 야기한다. SST 파일의 수명은 해당하는 레벨에서 존재하는 동안 컴팩션을 수행하는 횟수를 의미한다. 컴팩션을 수행하는 동안 SST 파일이 특정 레벨에서 삭제되지 않으면, 해당 SST 파일의 수명은 높게 나타난다. 표 2에 SST 파일의 수명, 컴팩션 파일의 개수, 컴팩션 파일의 비율, 및 컴팩션 동안 쓰기량이 예시되어 있다.Each level of LSM-Tree is distinguished by its characteristics. Compaction costs and disk writes are concentrated at a specific level. The hierarchical storage structure causes a gradual accumulation of data from a higher level to a lower level. The lifetime of an SST file means the number of times it performs compaction while it exists at that level. If the SST file is not deleted at a certain level during compaction, the lifespan of the corresponding SST file is high. Table 2 exemplifies the lifetime of the SST file, the number of compaction files, the ratio of compaction files, and the amount of writing during compaction.

컴팩션 명령어를 수행한 각 레벨의 결과를 살펴보면, 데이터 컴팩션을 수행하는 짧은 시간 동안에 상위 레벨(ex. L₀ to L₃)의 SST 파일들은 생성되고 삭제됨을 나타낸다. 컴팩션 파일의 개수와 컴팩션 파일의 비율을 보면, LSM-Tree의 계층적 구조로 인하여 컴팩션 파일의 크기가 작지 않음을 나타낸다. Looking at the results of each level of performing the compaction command, it indicates that _{SST files of higher levels (eg, L 0} to L ₃ ) are created and deleted during a short time during which data compaction is performed. Looking at the number of compaction files and the ratio of compaction files, it indicates that the size of the compaction file is not small due to the hierarchical structure of LSM-Tree.

컴팩션 동안 예비 자원을 수반하는 상위 레벨에서의 쓰기량은 중요하지 않아 보이지만, 디스크 I/O를 피할 수 없다. 상당히 많은 데이터를 입력했음에도, 하위 레벨 L₄의 컴팩션 파일의 개수와 쓰기량은 L₃보다 작은 값을 나타낸다.The amount of writes at the upper level with spare resources during compaction seems insignificant, but disk I/O is unavoidable. Even though quite a lot of data is input, the number and write amount of compaction files of the _{lower level L 4} are smaller than those _{of L 3 .}

본 실시예에 따른 데이터 베이스는 L₄와 같은 하위 레벨에서 높은 값을 갖는 SST 파일의 수명에 집중해서, NVM을 통해 상위 레벨에서 낭비되는 디스크 I/O를 감소시킨다. NVM은 바이트 어드레싱을 이용하여 쓰기 증폭(Write Amplication)을 해결하고 영속적 컴팩션을 수행하게 한다.The database according to the present embodiment reduces the disk I/O wasted at the upper level through NVM by focusing on the lifetime of the SST file having a high value at the lower level such as _{L 4 .} NVM uses byte addressing to solve write amplification and enables persistent compaction.

도 3은 본 발명의 일 실시예에 따른 데이터 베이스를 예시한 블록도이고, 도 4는 본 발명의 일 실시예에 따른 데이터 베이스의 내부 데이터 구조를 예시한 도면이다.3 is a block diagram illustrating a database according to an embodiment of the present invention, and FIG. 4 is a diagram illustrating an internal data structure of a database according to an embodiment of the present invention.

도 3에 도시한 바와 같이, 데이터 베이스(10)는 프로세서(100), 휘발성 메모리(200), 및 비휘발성 메모리(300)를 포함한다. 데이터 베이스(10)는 도 3에서 예시적으로 도시한 다양한 구성요소들 중에서 일부 구성요소를 생략하거나 다른 구성요소를 추가로 포함할 수 있다. 예컨대, 데이터 베이스(10)는 단계화 정책에 따라 블록 디바이스(400)를 추가로 포함할 수 있다.As shown in FIG. 3 , the database 10 includes a processor 100 , a volatile memory 200 , and a nonvolatile memory 300 . The database 10 may omit some of the various components exemplarily illustrated in FIG. 3 or may additionally include other components. For example, the database 10 may further include a block device 400 according to a staging policy.

데이터 베이스(10)는 데이터를 가공 및 저장하는 장치이다. 데이터 베이스(10)는 키-값 형식으로 데이터를 저장하고 읽을 수 있다. 키-값의 처리 명령은 (SET K, V), (DEL K, V) 등으로 정의될 수 있다. The database 10 is a device for processing and storing data. The database 10 can store and read data in a key-value format. The key-value processing command may be defined as (SET K, V), (DEL K, V), and the like.

프로세서(100)는 휘발성 메모리(200), 비휘발성 메모리(300), 및 블록 디바이스(400)에 기 정의된 명령어를 전송하여, 각종 신호 및 데이터 흐름을 제어한다.The processor 100 transmits predefined commands to the volatile memory 200 , the non-volatile memory 300 , and the block device 400 to control the flow of various signals and data.

휘발성 메모리(200)는 저장된 정보를 계속 유지하기 위하여 전원 공급이 필요한 메모리이다. 예컨대, 휘발성 메모리(300)로는 DRAM(Dynamic Random Access Memory) 등이 있다. The volatile memory 200 is a memory that requires power supply to continuously maintain stored information. For example, the volatile memory 300 includes a dynamic random access memory (DRAM).

비휘발성 메모리(300)는 전원이 공급되지 않아도 저장된 정보를 계속 유지하는 메모리이다. 비휘발성 메모리(300)는 리스트 구조(310)를 포함할 수 있다. 리스트 구조(310)는 헤드(Head)와 리어(Rear)를 갖고, 키의 주소와 값의 주소를 각각 갖는다. 리스트 구조(310)는 다수의 다음 포인터를 갖는 스킵 리스트로 구현될 수 있다. 스킵 리스트는 각 노드마다 키 길이, 키, 값 길이, 및 값을 갖는다. 비휘발성 메모리(300)는 영속성 버퍼(320)를 포함할 수 있다. The non-volatile memory 300 is a memory that continuously maintains stored information even when power is not supplied. The non-volatile memory 300 may include a list structure 310 . The list structure 310 has a head and a rear, and has an address of a key and an address of a value, respectively. List structure 310 may be implemented as a skip list with multiple next pointers. The skip list has a key length, a key, a value length, and a value for each node. The non-volatile memory 300 may include a persistence buffer 320 .

블록 디바이스(400)는 블록 단위로 임의 접근이 가능한 저장매체이다. 예컨대, 블록 디바이스(400)로는 HDD(Hard Disk Drive), SSD(Solid State Drive) 등이 있다. The block device 400 is a storage medium that can be accessed randomly in units of blocks. For example, the block device 400 includes a hard disk drive (HDD), a solid state drive (SSD), and the like.

데이터 베이스(10)는 휘발성 메모리(200)에 1차적으로 데이터를 저장한다. 저장된 데이터가 기 설정된 용량을 초과하면, 일부 데이터를 비휘발성 메모리(300)에 2차적으로 저장한다. 데이터 베이스(10)는 비휘발성 메모리(300)에 복수의 노드가 연결된 리스트 구조(310)를 생성하고 데이터를 리스트 구조(310)에 저장하는 방식으로 플러시 동작을 수행할 수 있다.The database 10 primarily stores data in the volatile memory 200 . When the stored data exceeds a preset capacity, some data is secondaryly stored in the non-volatile memory 300 . The database 10 may generate a list structure 310 in which a plurality of nodes are connected to the non-volatile memory 300 , and may perform a flush operation by storing data in the list structure 310 .

도 5는 본 발명의 다른 실시예에 따른 데이터 베이스의 데이터 처리 방법을 예시한 흐름도이다.5 is a flowchart illustrating a data processing method of a database according to another embodiment of the present invention.

단계 S210에서 데이터 베이스는 휘발성 메모리에 데이터를 저장한다.In step S210, the database stores data in the volatile memory.

단계 S220에서 데이터 베이스는 비휘발성 메모리에 플러시(flush) 동작을 수행한다. 플러시는 제1 저장소에서 제2 저장소로 복사하는 동작이다. 예컨대, 휘발성 메모리에서 비휘발성 메모리로 데이터를 복사한다.In step S220, the database performs a flush operation on the nonvolatile memory. A flush is an operation of copying from a first storage to a second storage. For example, data is copied from volatile memory to non-volatile memory.

단계 S230에서 데이터 베이스는 컴팩션(Compaction) 동작을 수행한다. 컴팩션은 병합 과정으로 특정 레벨의 임계치까지 데이터가 차면 해당 레벨의 데이터를 하위 레벨로 내려주는 동작이다.In step S230, the database performs a compaction operation. Compaction is an operation of lowering data of a specific level to a lower level when data is filled up to a threshold of a specific level as a merging process.

단계 S240에서 데이터 베이스는 정책에 따라 블록 드라이브로 데이터를 방출(Eviction)하는 동작을 수행한다.In step S240, the database performs an operation of emitting data to the block drive according to the policy.

단계 S250에서 데이터 베이스는 오류가 발생하면 데이터를 복구(Recovery)하는 동작을 수행한다.In step S250, the database performs an operation of recovering data when an error occurs.

도 6은 본 발명의 다른 실시예에 따른 데이터 베이스가 비휘발성 메모리에 생성한 스킵 리스트를 예시한 도면이다.6 is a diagram illustrating a skip list generated by a database in a nonvolatile memory according to another embodiment of the present invention.

비휘발성 메모리(Non-Volatile Memory, NVM)는 중간 유연 레벨에 해당하며, 스킵 리스트는 기존의 블록 파일 형식(ex. SST 파일)을 대체할 수 있다.Non-Volatile Memory (NVM) corresponds to an intermediate level of flexibility, and the skip list can replace the existing block file format (ex. SST file).

비휘발성 메모리(Non-Volatile Memory, NVM)는 바이트 어드레싱(Byte Addressability)이 가능하다. 바이트 어드레싱이 가능한 비휘발성 메모리의 예시로는 STT-MRAM(Spin-Transfer Torque Magnetic Random Access Memory), PCM(Phase-Change Memory) 등이 있다.Non-Volatile Memory (NVM) is capable of byte addressability. Examples of nonvolatile memory capable of byte addressing include Spin-Transfer Torque Magnetic Random Access Memory (STT-MRAM) and Phase-Change Memory (PCM).

데이터 베이스 시스템에서 NVM이 일관적으로 동작하려면, 데이터 베이스 시스템은 'cflush' 및 'mfence' 명령어를 사용해야 한다. 'cflush' 명령어는 메모리에 캐시 라인을 플러시하는 명령어이며, NVM에 데이터를 완전하게 저장하는 것을 보장한다. 'mfence' 명령어는 명령어들의 재순서 배정으로부터 프로세서를 보호하는 메모리 장벽 명령어이다.In order for NVM to operate consistently in the database system, the database system must use the 'cflush' and 'mfence' commands. The 'cflush' command flushes the cache line to memory and ensures that the data is completely saved to the NVM. The 'mfence' instruction is a memory barrier instruction that protects the processor from reordering of instructions.

PMDK(Persistent Memory Development Kit) API는 NVM를 이용하여 키-값 기반의 데이터 베이스를 제공한다.The Persistent Memory Development Kit (PMDK) API provides a key-value based database using NVM.

도 7은 본 발명의 다른 실시예에 따른 데이터 베이스가 스킵 리스트에 대해 컴팩션을 수행한 것을 예시한 도면이다.7 is a diagram illustrating that the database performs compaction on a skip list according to another embodiment of the present invention.

데이터 베이스가 데이터 쓰기를 수행하지 않고, 새로운 리스트 구조를 생성하고 새로운 리스트 구조가 기존 리스트 구조의 노드에 할당된 키-값을 포인팅하는 방식으로 컴팩션 동작을 수행한다. The database does not write data, but creates a new list structure and performs the compaction operation in such a way that the new list structure points to the key-value assigned to the node of the existing list structure.

도 8은 본 발명의 다른 실시예에 따른 데이터 베이스가 영속성 버퍼를 통해 순차적 복사를 수행한 것을 예시한 도면이다.8 is a diagram illustrating sequential copying of a database through a persistence buffer according to another embodiment of the present invention.

데이터 베이스의 플러시 동작에 관한 알고리즘은 표 3와 같다.Table 3 shows the algorithm related to the flush operation of the database.

플러시 동작은, 비휘발성 메모리의 영속성 버퍼에 키-값을 순차적으로 복사하고, 영속성 버퍼는 노드에 할당된 키-값의 랜덤 접근을 방지한다. 리스트 구조는 리스트 구조에 대응하는 영속성 버퍼의 오프셋을 포인팅한다. The flush operation sequentially copies key-values to the persistence buffer of non-volatile memory, and the persistence buffer prevents random access of key-values assigned to nodes. The list structure points to the offset of the persistence buffer corresponding to the list structure.

도 9는 본 발명의 다른 실시예에 따른 데이터 베이스가 영속성 버퍼를 통해 바이트 어드레싱 컴팩션을 수행한 것을 예시한 도면이다.9 is a diagram illustrating that a database performs byte addressing compaction through a persistence buffer according to another embodiment of the present invention.

데이터 베이스의 바이트 어드레싱 컴팩션 동작에 관한 알고리즘은 표 4와 같다.Table 4 shows the algorithm for the byte addressing compaction operation of the database.

컴팩션 동작은, 비휘발성 메모리에 새로운 리스트 구조를 생성하고 새로운 리스트 구조가 이전 리스트 구조에 대응하는 영속성 버퍼의 오프셋을 포인팅한다. 컴팩션 동작을 수행할 때, 영속성 버퍼는 키 길이, 키, 값 길이, 및 값에 대해 쓰기를 수행하지 않는다. NVM에 대해서 'cflush' 및 'mfence' 명령어를 사용하여 영속성을 확보한다. 영속성 버퍼는 SST 파일과 달리 인덱스 블록과 푸터 블록을 포함하지 않는다. 영속성 버퍼의 오프셋을 통해 데이터 검색 성능을 향상시킨다. 컴팩션을 수행하기 전에 스킵 리스트의 최하위 레벨에 대한 반복자들을 생성하고, 바이트 어드레싱 컴팬션 과정에서 반복자를 병합한다.The compaction operation creates a new list structure in the non-volatile memory and points the new list structure to the offset of the persistence buffer corresponding to the old list structure. When performing a compaction operation, the persistence buffer does not write to the key length, key, value length, and value. For NVM, use the 'cflush' and 'mfence' commands to ensure persistence. Unlike SST files, persistence buffers do not contain index blocks and footer blocks. Improves data retrieval performance through offsetting of persistence buffers. Before compaction, iterators for the lowest level of the skip list are created, and the iterators are merged in the byte addressing compaction process.

데이터 베이스는 저장소 단계화(Storage Tiering)를 수행한다. 데이터 베이스는 블록 드라이브를 포함한다. 데이터 베이스는 비휘발성 메모리에 저장된 데이터가 기 설정된 용량 범위를 초과하면, 단계화 정책(Tiering Policy)에 따라 데이터 베이스의 블록 드라이브로 방출(Eviction)한다.The database performs storage tiering. The database contains block drives. When the data stored in the non-volatile memory exceeds a preset capacity range, the database ejects it to the block drive of the database according to a tiering policy.

단계화 정책은, (i) 특정 레벨에 있는 데이터를 선택하여 블록 드라이브에 저장하는 제1 단계화 정책(Leveled Tiering), (ii) 데이터 접근이 오래된 데이터를 선택하여 블록 드라이브에 저장하는 제2 단계화 정책(LRU Tiering), (iii) 모든 데이터를 비휘발성 메모리에 저장하는 제3 단계화 정책(No Tiering), 또는 이들의 조합으로 설정될 수 있다. 특정 레벨은 구현되는 설계에 따라 통계적인 방식으로 산출되어 설정될 수 있다.The tiering policy consists of (i) a first tiering policy that selects data at a specific level and stores it in a block drive (Leveled Tiering), (ii) a second stage that selects data with old data access and stores it in a block drive It may be set to a policy (LRU Tiering), (iii) a third tiering policy to store all data in non-volatile memory (No Tiering), or a combination thereof. The specific level may be calculated and set in a statistical manner according to an implemented design.

데이터 베이스를 복구하는 동작에 관한 알고리즘은 표 5와 같다.The algorithm for the operation of restoring the database is shown in Table 5.

데이터 베이스에서 시스템 오류가 발생하면, 데이터 베이스의 비휘발성 메모리에 저장된 리스트 구조가 포인팅하는 데이터를 조회한 결과를 통해 데이터를 순차적으로 복구한다. NVM으로부터 영속성 포인터를 획득하고, 영속성 포인터를 이용하여 스킵 리스트를 복구한다. 스킵 리스트에서 노드를 찾고, 메타데이터(ID), ID에 대한 반복자를 획득하고, 이미 입력된 데이터는 생략하고, 남은 바이트 어드레싱 컴팩션을 고려한다. 스킵 리스트의 메타 데이터를 복구하고, 메타 데이터를 리스트에 삽입한다.When a system error occurs in the database, the data is sequentially restored through the result of inquiring the data pointed to by the list structure stored in the non-volatile memory of the database. A persistence pointer is obtained from the NVM and the skip list is restored using the persistence pointer. Find a node in the skip list, obtain metadata (ID), iterator for ID, omit already input data, and consider remaining byte addressing compaction. Recovers the meta data of the skip list and inserts the meta data into the list.

도 10 내지 도 12는 본 발명의 실시예들에 따라 수행된 모의실험 결과를 도시한 것이다.10 to 12 show simulation results performed according to embodiments of the present invention.

도 10에 도시된 바와 같이, 본 실시예에 따른 데이터 베이스는 쓰기 지연과 읽기 지연 측면에서 성능이 향상됨을 알 수 있다.As shown in FIG. 10 , it can be seen that the performance of the database according to the present embodiment is improved in terms of write delay and read delay.

도 11을 참조하면, 제3 단계화 정책(No Tiering), 제2 단계화 정책(LRU Tiering), 제1 단계화 정책(Leveled Tiering) 순으로 쓰기 지연과 읽기 지연 측면에서 성능이 향상됨을 알 수 있다.11 , it can be seen that the performance is improved in terms of write delay and read delay in the order of the third tiering policy (No Tiering), the second tiering policy (LRU Tiering), and the first tiering policy (Leveled Tiering). there is.

도 12를 참조하면, 본 실시예에 따른 데이터 베이스(TLSM)는 쓰기 지연(Write Stall)과 컴팩션의 쓰기량 측면에서 성능이 향상됨을 알 수 있다.Referring to FIG. 12 , it can be seen that the performance of the database TLSM according to the present embodiment is improved in terms of write delay and compact write amount.

데이터 베이스에 포함된 구성요소들이 도 3에서는 분리되어 도시되어 있으나, 복수의 구성요소들은 상호 결합되어 적어도 하나의 모듈로 구현될 수 있다. 구성요소들은 장치 내부의 소프트웨어적인 모듈 또는 하드웨어적인 모듈을 연결하는 통신 경로에 연결되어 상호 간에 유기적으로 동작한다. 이러한 구성요소들은 하나 이상의 통신 버스 또는 신호선을 이용하여 통신한다.Although the components included in the database are illustrated separately in FIG. 3 , a plurality of components may be combined with each other and implemented as at least one module. The components are connected to a communication path connecting a software module or a hardware module inside the device to operate organically with each other. These components communicate using one or more communication buses or signal lines.

데이터 베이스는 하드웨어, 펌웨어, 소프트웨어 또는 이들의 조합에 의해 로직회로 내에서 구현될 수 있고, 범용 또는 특정 목적 컴퓨터를 이용하여 구현될 수도 있다. 장치는 고정배선형(Hardwired) 기기, 필드 프로그램 가능한 게이트 어레이(Field Programmable Gate Array, FPGA), 주문형 반도체(Application Specific Integrated Circuit, ASIC) 등을 이용하여 구현될 수 있다. 또한, 장치는 하나 이상의 프로세서 및 컨트롤러를 포함한 시스템온칩(System on Chip, SoC)으로 구현될 수 있다.The database may be implemented in a logic circuit by hardware, firmware, software, or a combination thereof, or may be implemented using a general-purpose or special-purpose computer. The device may be implemented using a hardwired device, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), or the like. In addition, the device may be implemented as a system on chip (SoC) including one or more processors and controllers.

데이터 베이스는 하드웨어적 요소가 마련된 컴퓨팅 디바이스에 소프트웨어, 하드웨어, 또는 이들의 조합하는 형태로 탑재될 수 있다. 컴퓨팅 디바이스는 각종 기기 또는 유무선 통신망과 통신을 수행하기 위한 통신 모뎀 등의 통신장치, 프로그램을 실행하기 위한 데이터를 저장하는 메모리, 프로그램을 실행하여 연산 및 명령하기 위한 마이크로프로세서 등을 전부 또는 일부 포함한 다양한 장치를 의미할 수 있다.The database may be mounted in the form of software, hardware, or a combination thereof on a computing device provided with hardware elements. A computing device includes all or part of a communication device such as a communication modem for performing communication with various devices or a wired/wireless communication network, a memory for storing data for executing a program, and a microprocessor for executing an operation and command by executing the program. It can mean a device.

도 5에서는 각각의 과정을 순차적으로 실행하는 것으로 기재하고 있으나 이는 예시적으로 설명한 것에 불과하고, 이 분야의 기술자라면 본 발명의 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 도 5에 기재된 순서를 변경하여 실행하거나 또는 하나 이상의 과정을 병렬적으로 실행하거나 다른 과정을 추가하는 것으로 다양하게 수정 및 변형하여 적용 가능할 것이다.Although it is described that each process is sequentially executed in FIG. 5, it is only illustratively described, and those skilled in the art change the order described in FIG. 5 within the range not departing from the essential characteristics of the embodiment of the present invention Alternatively, various modifications and variations may be applied by executing one or more processes in parallel or adding other processes.

본 실시예들에 따른 동작은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능한 매체에 기록될 수 있다. 컴퓨터 판독 가능한 매체는 실행을 위해 프로세서에 명령어를 제공하는 데 참여한 임의의 매체를 나타낸다. 컴퓨터 판독 가능한 매체는 프로그램 명령, 데이터 파일, 데이터 구조 또는 이들의 조합을 포함할 수 있다. 예를 들면, 자기 매체, 광기록 매체, 메모리 등이 있을 수 있다. 컴퓨터 프로그램은 네트워크로 연결된 컴퓨터 시스템 상에 분산되어 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수도 있다. 본 실시예를 구현하기 위한 기능적인(Functional) 프로그램, 코드, 및 코드 세그먼트들은 본 실시예가 속하는 기술분야의 프로그래머들에 의해 용이하게 추론될 수 있을 것이다.The operations according to the present embodiments may be implemented in the form of program instructions that can be performed through various computer means and recorded in a computer-readable medium. Computer-readable media refers to any medium that participates in providing instructions to a processor for execution. Computer-readable media may include program instructions, data files, data structures, or a combination thereof. For example, there may be a magnetic medium, an optical recording medium, a memory, and the like. A computer program may be distributed over a networked computer system so that computer readable code is stored and executed in a distributed manner. Functional programs, codes, and code segments for implementing the present embodiment may be easily inferred by programmers in the technical field to which the present embodiment pertains.

본 실시예들은 본 실시예의 기술 사상을 설명하기 위한 것이고, 이러한 실시예에 의하여 본 실시예의 기술 사상의 범위가 한정되는 것은 아니다. 본 실시예의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 실시예의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The present embodiments are for explaining the technical idea of the present embodiment, and the scope of the technical idea of the present embodiment is not limited by these embodiments. The protection scope of the present embodiment should be interpreted by the following claims, and all technical ideas within the equivalent range should be construed as being included in the scope of the present embodiment.

10: 데이터 베이스 100: 프로세서
200: 휘발성 메모리 300: 비휘발성 메모리
310: 리스트 구조 320: 영속성 버퍼
400: 블록 디바이스10: database 100: processor
200: volatile memory 300: non-volatile memory
310: list structure 320: persistence buffer
400: block device

Claims

In the data processing method of the database,
storing data in a volatile memory of the database; and
generating a list structure in which a plurality of nodes are connected to a non-volatile memory of the database and performing a flush operation in such a way that the data is stored in the list structure,
The database stores data in a key-value format,
The data processing method of a database, characterized in that the list structure is a skip list having a plurality of next pointers.

delete

According to claim 1,
and performing a compaction operation in such a way that the database does not write data, but creates a new list structure and points the new list structure to a key-value assigned to a node of the existing list structure. How to process data in the database.

4. The method of claim 3,
The step of performing the flush operation includes:
sequentially copying key-values to a persistent buffer of the non-volatile memory;
The persistence buffer prevents random access of the key-value assigned to the node,
The data processing method of the database, characterized in that the list structure points to an offset of a persistence buffer corresponding to the list structure.

4. The method of claim 3,
The step of performing the compaction operation is,
Creating a new list structure in the non-volatile memory and pointing the new list structure to an offset of a persistence buffer corresponding to the previous list structure.

According to claim 1,
and when the data stored in the non-volatile memory of the database exceeds a preset capacity range, ejecting it to the block drive of the database according to a tiering policy. of data processing methods.

7. The method of claim 6,
The tiering policy is
(i) a first staging policy that selects data at a certain level and stores it in the block drive, (ii) a second tiering policy that selects data with outdated data access and stores it in the block drive, (iii) all A data processing method of a database, characterized in that it is set as a third step policy for storing data in the non-volatile memory, or a combination thereof.

According to claim 1,
and, when a system error occurs in the database, sequentially recovering data through a result of inquiring the data pointed to by the list structure stored in the non-volatile memory of the database. processing method.

A database comprising a processor, a volatile memory, and a non-volatile memory, comprising:
store data in the volatile memory;
generating a list structure in which a plurality of nodes are connected to the non-volatile memory and performing a flush operation in a manner that stores the data in the list structure;
The database stores data in a key-value format,
wherein the list structure is a skip list having a plurality of next pointers.

delete

10. The method of claim 9,
The database is characterized in that the compaction operation is performed in such a way that the database does not write data, but creates a new list structure and points the new list structure to a key-value assigned to a node of the existing list structure. .

12. The method of claim 11,
The flush operation is
sequentially copying key-values to the persistence buffer of the non-volatile memory;
The persistence buffer prevents random access of the key-value assigned to the node,
The list structure points to an offset of a persistence buffer corresponding to the list structure.

12. The method of claim 11,
The compaction operation is
and creating a new list structure in the non-volatile memory and pointing the new list structure to an offset of a persistence buffer corresponding to the previous list structure.

10. The method of claim 9,
The database includes a block drive,
The database, characterized in that when the data stored in the non-volatile memory exceeds a preset capacity range, the database is characterized in that the ejection (Eviction) to the block drive of the database according to a tiering policy (Tiering Policy).

15. The method of claim 14,
The tiering policy is
(i) a first staging policy that selects data at a certain level and stores it in the block drive, (ii) a second tiering policy that selects data with outdated data access and stores it in the block drive, (iii) all A database, characterized in that it is set as a third staging policy for storing data in the non-volatile memory, or a combination thereof.

10. The method of claim 9,
When a system error occurs in the database, data is sequentially restored through a result of inquiring the data pointed to by the list structure stored in the non-volatile memory of the database.