WO2023146332A1 - Technique et appareil de stockage intermédiaire de données de transaction destinés à des transactions massives - Google Patents

Technique et appareil de stockage intermédiaire de données de transaction destinés à des transactions massives Download PDF

Info

Publication number
WO2023146332A1
WO2023146332A1 PCT/KR2023/001251 KR2023001251W WO2023146332A1 WO 2023146332 A1 WO2023146332 A1 WO 2023146332A1 KR 2023001251 W KR2023001251 W KR 2023001251W WO 2023146332 A1 WO2023146332 A1 WO 2023146332A1
Authority
WO
WIPO (PCT)
Prior art keywords
transaction
operating system
data
page
block
Prior art date
Application number
PCT/KR2023/001251
Other languages
English (en)
Korean (ko)
Inventor
원유집
오준택
Original Assignee
한국과학기술원
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 한국과학기술원 filed Critical 한국과학기술원
Publication of WO2023146332A1 publication Critical patent/WO2023146332A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0647Migration mechanisms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/023Free address space management
    • G06F12/0253Garbage collection, i.e. reclamation of unreferenced memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/10Address translation
    • G06F12/1009Address translation using page tables, e.g. page table structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/466Transaction processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1016Performance improvement

Definitions

  • the present invention relates to computing technology, and more particularly, to a technology for executing a large-capacity transaction in which the amount of write data included in one transaction is greater than the amount of a page cache.
  • FIG. 1 shows the configuration of a host provided according to the prior art and an embodiment of the present invention.
  • Host 10 may be a computing device.
  • the host 10 may include a memory 13 , a power supply unit 16 , a storage unit 17 , a processing unit 18 , and a communication unit 19 .
  • the power supply unit 16 supplies operating power to the host 10 .
  • the storage unit 17 may store a program including instructions for causing the processing unit 18 to execute the application 11 and/or the operating system 12 .
  • the storage unit 17 may be a non-volatile memory such as SSD or HDD.
  • the processing unit 18 may be configured to execute the application 11 and/or the operating system 12 by executing the program.
  • Processing unit 18 may be, for example, an AP or a CPU.
  • the memory 13 may be directly controlled by the operating system 12, such as DRAM.
  • the memory 13 may include a page cache and a shadow page cache.
  • the communication unit 19 may be configured to drive a communication medium that delivers commands of the operating system 12 to the storage 20 .
  • the communication medium may transmit a baseband communication signal or
  • An application program (hereinafter referred to simply as 'application') running on the host may be a subject that stores data in a storage (storage device) (eg, SSD).
  • a storage storage device
  • the application cannot directly control the storage.
  • applications can write data to the storage through the operating system, which can directly control the storage.
  • the part of the operating system that handles storage can be referred to as a file system.
  • Errors can occur if data is not stored in storage in the way the application requires. Transactions can be used to ensure that data is stored in storage in the manner required by the application.
  • a transaction is a set of write data, and properties such as atomicity, consistency, isolation between transactions, and durability maintained in the event of a power outage must be satisfied.
  • a set of write data that satisfies these characteristics can be called a transaction.
  • applications require transactions, but operating systems do not require transactions.
  • the application may be configured to implement the transaction in some way, which may be inefficient.
  • Inside the storage is a fast-acting cache.
  • the write data is usually stored in the cache of the storage. If the application controls the transaction, the cache of the storage may be rarely used. Because write data is cached in the storage's cache, the transaction order desired by the application can be broken.
  • flushing data X may mean storing data X in a non-volatile memory of storage.
  • a transition contains special data called a commit block.
  • Existence of the commit block in the storage guarantees that all transaction data associated with the commit block in the storage are completely written. In order to guarantee the condition that the commit block included in one transaction is written after all other blocks, all other blocks are first written to and flushed from the storage cache, and then the commit block is written to and flushed from the storage cache. The method is used by conventional applications.
  • journal file may be stored in a non-volatile memory of storage. Then, the first transaction is written to the journal file. Only after the first transaction is completely written in the journal file, the first transaction is reflected in the original file. In the event of a power outage, only transactions completely written to the journal file can be recovered.
  • An object of the present invention is to provide a technique for solving the problem that the size of a transaction in a file system of an operating system supporting transactions is limited by the size of a memory managed by the operating system.
  • the present invention proposes an intermediate storage device and technique for executing large-capacity transactions. In the present invention, this may be referred to as stealing.
  • the large-capacity transaction may refer to a case in which the amount of write data included in one transaction is greater than the amount of the page cache.
  • the Tx start call is a system call indicating that the application is starting a transaction.
  • the write call is a system call indicating that a file is to be written to storage.
  • the Tx end call is a system call indicating that a transaction is to be terminated.
  • write data (pages) related to the write call are pinned to a page cache, which is a memory used by an operating system. If all of the write data cannot be pinned to the page cache, that is, when the page cache is small, the prior art has a problem of canceling the transaction.
  • the prior art may refer to a technology in which an operating system supporting a transaction executes a transaction.
  • the operating system first intermediately stores some data (some pages) among write data stored in a page cache directly managed by the operating system in storage, and performs an evict or evict step. refers to Blocks of the page cache in which the previously intermediately stored write data are recorded are now in a usable state. Then, the operating system pins all remaining write data requested by the application through the transaction in the page cache. After that, when a send end call is called by the application, the operating system executes a command to write all the data stored in the page cache to the storage.
  • evict may mean selecting a specific page, writing the contents stored in the specific page to storage, and making the selected specific page a blank page so that it can be used for other purposes.
  • the interim save may be regarded as a final save of the part of the write data. That is, the intermediate storage may be regarded as final storage of the partial data (partial pages) for the storage.
  • the 'block' may mean a unit storage in which data is stored in storage. A plurality of blocks may be collectively referred to as one segment. As such, in the case of a log structure file system, even if old data A already recorded in storage is modified with new data A', the old data A remains in the storage for a predetermined time. This point is important in the present invention.
  • File mapping information indicating a mapping relationship between a certain data A recorded in the storage and a location of a block in which the data A is stored is stored in a file called an inode.
  • the inode file is also stored in storage and can be temporarily stored in the page cache managed by the operating system.
  • a block considered valid in one segment may be referred to as a live block.
  • a block declared as a block in which certain valid data is stored by predetermined mapping information may be referred to as a live block.
  • a process of moving data stored in all live blocks among blocks included in a first segment to another second segment and then making all blocks included in the first segment into usable blocks may be referred to as garbage collection. .
  • the purpose of garbage collection is to free up free segments.
  • the log-structured filesystem supports an operation called checkpointing.
  • the checkpoint operation flushes the file mapping information (inodes) stored in the page cache to the storage.
  • the file mapping information stored in the non-volatile memory of the storage may be cached in the cache of the storage, and the file mapping information stored in the cache of the storage may be stored in the memory managed by the operating system of the host, that is, the page cache. can be cached.
  • the file mapping information stored in the storage may be first cached in the page cache.
  • some of the pages stored in the page cache may be interimly stored in storage. That is, the page A can be Evicted.
  • the old page of page A may be written to a first block of storage
  • the new page of page A may be written to a second block of storage.
  • new file mapping information indicating that the new page is written in the second block may be recorded in the page cache and the cache of the storage. In this case, the old file mapping information indicating that the old page is written in the first block disappears.
  • the new file mapping information is flushed to the storage, and as a result, the old file mapping information existing in the storage disappears. Then, if a system crash (eg, power failure) occurs before the transaction is committed, the new file mapping information recorded in the storage can be used in the process of recovery after the system crash, but the information that has already disappeared Old file mapping information cannot be used. Therefore, there is a problem in that even if recovery is executed, the previous state of the unfinished transaction is not restored. In order to solve this problem, the present invention introduces the following additional configuration.
  • a system crash eg, power failure
  • the new file mapping information for the page A may be pinned to the page cache until the transaction is committed. In this way, even if a checkpoint operation is called, the new file mapping information is not flushed, and thus, the old file mapping information in the storage can be maintained as it is until the transaction is committed.
  • Calls to checkpoint operations can be performed periodically by the operating system or just before and after garbage collection is called.
  • invalidation of an old block stored in storage is not performed before committing a transaction for a page on which an Evict is made.
  • exF2FS a file system with a transaction log structure
  • the proposed file system is Membership-Oriented Transaction. It can consist of three main components called Stealing-Enable Transaction and Shadow Garbage Collection.
  • the membership-oriented transaction allows transactions to span multiple files where applications can explicitly specify the files involved in the transaction.
  • the stealing-enabled transaction allows application programs to execute transactions with small amounts of memory and encapsulate many updates (eg, hundreds of files with a total size of tens of GB) into a single transaction.
  • the log structure file system can perform garbage collection without affecting error atomicity of ongoing transactions.
  • exF2FS Transactional support of exF2FS can be carefully tuned to meet the critical needs of application programs while minimizing code complexity and avoiding performance side effects.
  • exF2FS increases SQLite multifile transaction throughput by a factor of 24 compared to stock SQLite multifile transactions.
  • Implementing compression as a filesystem transaction increases RocksDB throughput by 87%.
  • journal files For transactions that update multiple database files, SQLite, a library-based embedded DBMS, maintains separate journal files for each database file, resulting in excessive fdatasync() calls and massive write amplification.
  • the key-value storage engine flushes output files individually and flushes the global state of compression to a manifest file.
  • the filesystem's transaction support allows application programs to replace multiple fsync()'s for each output file and manifest file with a single filesystem transaction, rendering higher performance by eliminating redundant IO.
  • System-level support for transactions can be largely classified into four types: native operating system support, kernel-level file system, user-level file system, and transaction block device. As a first-class citizen of the operating system, it is ideal to support transactions; However, it requires major changes to the operating system.
  • User-level filesystem transaction support uses a user-level DBMS to provide full ACID transactions. ACID support sacrifices performance.
  • Transaction support in kernel-level filesystems can be further classified according to the degree of ACID support: full ACID semantics, ACD without isolation support, or AC without isolation and durability support. F2FS transactions only support atomicity and do not support isolation and durability.
  • F2FS Transactions in F2FS cannot span multiple files. Ironically, despite having minimal support for transactions, F2FS is the only filesystem to successfully roll out transactional support to the masses. F2FS's transactional support has a specific target application called SQLite. F2FS's atomic writes allow SQLite to implement transactions without a rollback journal file and eliminate excessive flushing overhead.
  • the present invention revisits the problem of providing filesystem level transaction support.
  • log-structured filesystems Most of the prior art for transactional filesystems use a journaling filesystem as a basic filesystem. This prior art uses the journaling layer of the filesystem to provide transactional functionality.
  • F2FS a log-structured filesystem designed for flash storage
  • F2FS a log-structured filesystem designed for flash storage
  • flash storage has recently become widespread on smartphone platforms and is starting to expand to cloud platforms.
  • Few studies have dealt with transactional support in log structured filesystems. Although some prior studies deal with transaction support in log structured file systems, the prior studies are limited in terms of transaction support.
  • the prior art does not support multi-file transactions, transaction stealing, and conflict processing between transactions and garbage collection.
  • a file system maintains a kernel object, which is a transaction file group that designates a set of files related to a transaction, including directories. Membership-oriented transactions allow applications to explicitly specify a file for a transaction.
  • a delayed invalidation and relocation record are proposed to realize stealing in a file system transaction.
  • the deferred invalidation prevents old disk locations of exploited pages from being garbage collected until the transaction commits.
  • the relocation record maintains undo and redo information for aborting and committing an evicted page, respectively.
  • Shadow garbage collection allows filesystems to perform garbage collection transparently to in-flight transactions.
  • exF2FS extended F2FS
  • ExF2FS improves SQLite performance by a factor of 24 over stock SQLite and reduces write volume by 1/6 compared to SQLite's PERSIST journal mode.
  • Multi-file transactions are an essential part of modern software. The following are some examples of multi-file transaction methods currently in use.
  • Chrome browser keeps user browsing activities such as visited URLs, list of downloaded files, access history for each URL and list of most frequently visited URLs. Chrome keeps each of these as separate files and updates these files in a failure-atomic fashion. For failure-atomicity, Chrome uses SQLite to update these files rendering excessive IO. SQLite transactions are inefficient.
  • Compression is the process of merge-sorting multiple SSTables with overlapping intervals into a sequence of output files with non-overlapping intervals.
  • the error atomicity of the compact operation calls fsync() on each output file and its parent directory and flushes the global state of the transaction to a special file called a manifest file.
  • a single zip in RocksDB can produce up to 198 output files (over 200 fsync()'s) for a total of 13.3GB.
  • the MAILDIR IMAP format maintains mailboxes and messages as directories, respectively, and as files in directories. Email clients update message files and associated directories in a transactional fashion. Without transactional support in the underlying filesystem, mail clients manage mailboxes and messages in a transactional fashion using expensive atomic renames.
  • SQLite is a serverless embedded DBMS widely used in a variety of applications, including mobile applications such as Android Mail and Facebook App, desktop applications such as Gmail and Apple iWork, and distributed file systems such as Luster and Ceph. These applications use SQLite to continuously manage updates to multiple files in an error-atomic fashion. To understand how SQLite can benefit from the underlying filesystem's transactional support, we will instrument the IO behavior of SQLite's multi-file transactions.
  • SQLite to manage data continuously can make application programs simpler, but SQLite's file-assisted journaling and page segmentation physical logging result in significant write amplification and excessive flushing.
  • a single insert() in SQLite results in 5 fdatasync()s with a write() of 40KB.
  • SQLite organizes a multifile transaction into a single file transaction and a collection of several flushes to record the global state of the multifile transaction in the master journal file. SQLite implements multi-file transactions in the four steps listed below. Steps 1 and 3 are for updating the master journal file. Steps 2 and 4 are for executing a series of single file processing.
  • Figure 2 shows how each of these steps relates to IO operation through physical experiments.
  • the transaction consists of three inserts into three different database files.
  • Step 1 Initialize the master journal file. SQLite records the name of the journal file in the master journal file. Then, the master journal file (S1 in Fig. 2) and the updated directory are flushed to disk (S2 in Fig. 2).
  • Phase 2 Logging and database updates. SQLite writes an undo record to the journal file and updates the database file. Each file is updated in the same way as in a single database transaction (S3 in Fig. 2). In Figure 2, there are three S3s, each corresponding to a single insert().
  • Step 3 Delete the master journal file.
  • SQLite deletes the master journal file and makes the associated directory durable (S4 in Figure 2).
  • Step 4 Reset the log. SQLite resets and flushes the journal file (S5 in Fig. 2).
  • the X axis and the Y axis represent time and LBA, respectively.
  • three areas of F2FS that is, a metadata area, a data area of the main area, and a node area of the main area are explicitly designated.
  • the underlying F2FS flushes not only the data blocks, but also the associated node blocks into the data area and node area, respectively.
  • Flushing the master journal file (fd(mj)) at S1 in Figure 2 renders two separate 4KB IOs to disk. One for flushing data blocks and one for flushing node blocks. Data blocks and related node blocks must be durable to ensure the integrity of the filesystem.
  • Each insert() has three fdatasync()(S3).
  • the first and second fdatasync() are for flushing the rollback journal file.
  • the third is to flush the database files.
  • SQLite deletes the master journal file and retains the parent directory. If the disconnection of the master journal file persists, the transaction is committed.
  • SQLite resets the transaction's rollback journal file.
  • F2FS is used as a basic log structure file system.
  • F2FS has several key design features that differentiate it from the original log-structured filesystem design. Among them, two functions focused on in the present invention are a block allocation bitmap and a dual log partition layout. To make stealing and shadow garbage collection a reality, you need to examine how F2FS manipulates and updates the block allocation bitmap and the two logs.
  • the first is the block allocation bitmap.
  • the original log-structured filesystem design had no explicit data structures specifying whether a given block of a filesystem partition was allocated.
  • the filesystem determines that a block on a filesystem partition is allocated if it can be reached via file mapping.
  • F2FS maintains a block allocation bitmap to indicate whether a given block in the filesystem is valid.
  • the second is a dual log partition layout.
  • the legacy log structure filesystem treats a filesystem partition as a single log. Cluster the data blocks and their associated filemap1 together and flush them as a single unit.
  • F2FS organizes the filesystem partition into two separate logs: a data area and a node area. F2FS places data blocks and node blocks respectively in related areas. Unlike the legacy log structure file system, F2FS writes data blocks and node blocks separately. To maintain filesystem integrity against system crashes, F2FS checks if data blocks are durable before connected node blocks. Due to this ordering mechanism of F2FS, block traces for data block writes appear before block traces for node block writes in each pair of writes for data blocks and node blocks, as shown in FIG.
  • F2FS provides atomic write functionality.
  • An application can write multiple blocks to a single file in an error-atomic fashion. This feature is primarily intended to address the excessive IO overhead of SQLite's single file transactions.
  • F2FS For atomic writes, F2FS maintains a list of dirty pages in the inode. When a transaction updates a file block, it inserts the dirty page into the per-inode list of dirty pages and pins the dirty page into memory. When the transaction commits, the filesystem unpins the dirty pages from the per-inode list of dirty pages and flushes the dirty pages and associated node blocks that hold the updated file mapping to disk. Atomic writes pin dirty pages into memory until they are committed, so F2FS by design cannot support stealing in atomic write transactions. When a transaction is committed, F2FS sets the FSYNC_BIT flag on the node block. If more than one node block is flushed, the atomic write places the FSYNC_BIT flag on the last node block. F2FS sets the FSYNC_BIT flag on a node block to indicate that it is eligible for rollforward recovery.
  • Log structure Filesystems periodically checkpoint status such as updated file mappings, updated bitmaps (F2FS only), and the disk location of the last block of each log.
  • the recovery module restores the state of the filesystem with respect to the most recent checkpoint information.
  • the recovery module scans the log at the last location and finds node blocks with FSYNC_BIT, i.e. transactions successfully completed since the most recent checkpoint, and recovers the related files.
  • exF2FS a file system with a transaction log structure that satisfies these constraints.
  • the key technical components of exF2FS are membership-oriented transactions, stealing-enabled transactions, and shadow garbage collection. Each component is summarized below.
  • Membership-oriented transaction Transactions in F2FS cannot span multiple files because dirty pages of transactions are maintained in inode units.
  • membership-oriented transaction a new transaction model called membership-oriented transaction is developed.
  • membership-oriented transactions the filesystem defines transactional filegroups, which are sets of files whose updates must be transacted, and maintains a transactional dirty page for each transactional filegroup.
  • membership-oriented transactions transactions can span multiple files, and application programs can explicitly specify the files to which a transaction applies.
  • Garbage collection can make dirty pages associated with uncommitted transactions durable, and it can early checkpoint updated file mappings before transactions are committed. Develop shadow garbage collection to decouple garbage collection from uncommitted transactions.
  • the present invention proposes a new transaction model called membership-oriented transaction.
  • This model defines a new kernel entity, the transactional filegroup.
  • a transactional file group is a set of files for which updates must be processed as one transaction, and is composed of a transaction membership (inode set), a dirty page list, a relocation list, and a master commit block as shown in FIG. 3 . It uses a hash table for the namespace of transactional filegroup objects, which is widely used to construct the namespace of kernel objects such as semaphores and pipes.
  • a dirty page list is a set of dirty pages for transactional member files.
  • the dirty page list has two separate dirty page lists: a dirty data page list and a dirty node page list.
  • a relocation list is a set of relocation records. The relocation record contains information about the page that was evicted: file ID, file offset, old disk location and new disk location.
  • the master commit block holds the disk location of the last node block for each file within the transactional membership. Using transactional filegroups with master commit blocks allows transactions to span multiple files. Relocation lists are used for stealing and shadow garbage collection.
  • An application creates a transactional filegroup with an explicit call.
  • an application creates a transactional filegroup
  • the ID of the transactional filegroup is returned to the application.
  • Applications can add or remove files from a transactional filegroup.
  • Membership inheritance saves transaction-created files from transaction conflicts because newly created files are added to the transaction filegroup before they are visible to the outside world.
  • Membership inheritance saves transaction-created files from transaction conflicts because newly created files are added to the transaction filegroup before they are visible to the outside world.
  • exF2FS provides APIs for transaction abort and transaction deletion.
  • an application requests to delete a transactional filegroup and there are no transactions in progress for the transactional filegroup, the transactional filegroup and related objects are deallocated.
  • exF2FS first aborts the transaction and then deletes the transactional filegroup.
  • transactions can include directory updates such as rename(), unlink() and create().
  • F2FS transactions do not support directory updates in transactions.
  • a transaction When a transaction updates a file in the transactional filegroup, it inserts the updated page cache entry into the list of dirty data pages in the transactional filegroup.
  • the filesystem When committing a transaction, the filesystem prepares dirty data pages, dirty node pages, and master commit block for committing the transaction. First, the file system inserts dirty data pages from the dirty page list into the active data segment and obtains the disk location for each dirty data page. Second, the filesystem updates the associated node page with the new disk location of each data page, inserts the updated node page into the list of dirty node pages, and determines the disk location of each dirty node page. Third, the file system allocates a master commit block and stores the disk location of each node page in the list of dirty node pages in the master commit block. The filesystem then sets the FSYNC_BIT flag in the master commit block.
  • exF2FS flushes dirty data pages, dirty node pages, and master commit blocks.
  • Master commit blocks become durable only after data blocks and node blocks become durable.
  • the master commit block is a key component for manipulating dirty pages from multiple files into a single multi-file transaction. If the master commit block persists, the filesystem scans the relocation list and invalidates the old disk location of the relocation record.
  • the recovery module performs a rollback recovery and sets the filesystem state to the most recent checkpoint.
  • exF2FS then performs a roll-forward recovery. The log is scanned starting at the last logging offset recorded in the checkpoint.
  • the recovery module examines it and identifies the node block disk location of the file in the transaction.
  • the recovery module of exF2FS then uses stock F2FS' roll-forward recovery routines to recover the files associated with each node block. If the system crashes before the master commit block persists, the temporary state of the transaction in memory is completely lost. Through this recovery mechanism, exF2FS guarantees the atomicity and durability of transactions.
  • a file can belong to only one transactional filegroup at a time.
  • the application checks whether the file is already in another transactional filegroup. If the file is already in another transactional filegroup, add_tx_file_group returns an error.
  • the present invention leaves isolation support up to the application, as other transactional filesystems do.
  • As a general-purpose filesystem it is difficult to meet the requirements of all different levels of isolation in various application programs at the same time. Carefully consider that the filesystem's limited support for isolation is redundant at best, unless the isolation level supported by the filesystem matches well with the isolation level required by the application program. Isolation is not required for text editors, application installers, compression of git and LSM-based key-value storage. SQLite and MySQL implement different levels of isolation on their own. In these applications, the filesystem's limited support for isolation won't help much. TxFS supports isolation of “repeatable reads”. Too strong for text editors, too loose for some applications like SQLite's "Serializable Read”.
  • SQLite must implement "Serializable Read" isolation in its own database layer using shared locks, even when using TxFS as the underlying filesystem.
  • Filesystem support for isolation has a cost.
  • TxFS's isolation support renders a 10% performance overhead due to the overhead of creating shadow copies of pages updated in a transaction.
  • one limitation introduced by the lack of isolation support is that different processes cannot concurrently add, delete, or rename files in a directory that is part of another process's transaction. Concurrent directory modification support is reserved for future work.
  • Stealing proposed in the present invention represents a buffer management policy that allows dirty page eviction of uncommitted transactions.
  • the DBMS's steal policy and the operating system's (or filesystem's) page reclamation are other manifestations of the same essential behavior, such as evict dirty pages to disk and free up physical memory. am. The two share essential behavior, but are at different ends of the extreme.
  • the DBMS prohibits the Evicted dirty page from being displayed to the outside (isolation) and undoes the steal in the case of a transaction abort (atomic).
  • the operating system reclaims a file-backed dirty page, the result of the page eviction is displayed externally and cannot be undone.
  • the escaped page overwrites the old file block, and in the log structure file system, the old file block of the escaped page is accessed due to the file mapping update. will not be able to
  • TxFS F2FS
  • Isotope Isotope
  • Libnvmmio do not support transaction stealing.
  • TxFS cannot support transaction stealing due to a fundamental design limitation.
  • TxFS's support for transactions is built on top of EXT4 journaling.
  • EXT4 journaling pins log blocks in memory until journal transactions are committed.
  • EXT4 limits the size of journal transactions (256 MB by default).
  • the EXT4 journaling module commits the journal transaction.
  • dirty pages associated with a single system call can be split into two or more journal commits.
  • TxFS can damage the atomicity of a transaction by prematurely persisting the transient state of the transaction, so this should not happen. To ensure atomicity, TxFS aborts transactions when the transaction size exceeds the limit.
  • F2FS pins the transaction's dirty pages into memory until they are committed. F2FS aborts all outstanding transactions when the dirty pages of uncommitted transactions exceed a certain threshold (15% of the total physical page frame by default). An example related to this will be described with reference to FIG. 4 .
  • log blocks '1', '2', '3', '4', '5', and '6' may be pinned to the memory 13 .
  • F2FS pins the transaction's dirty pages to the page cache 133 in memory until committed.
  • F2FS suspends all pending transactions when the number of dirty pages of uncommitted transactions exceeds a specific threshold (ex: 6 blocks shown in FIG. 4).
  • a rectangle 133 shown in FIG. 4 may mean a page cache of a memory.
  • CFS supports stealing.
  • CFS relies on a non-existent transactional block device for stealing support.
  • AdvFS supports stealing with commodity hardware.
  • AdvFS uses writable file copies for transactional updates. When the transaction commits, the file map is updated to reference the updated file block that was written in the wrong way. Due to these characteristics, AdvFS can freely support stealing.
  • transactions in AdvFS can fragment files as the filesystem deletes old file blocks each time a transaction commits.
  • the file defragmentation overhead of AdvFS is not yet known. Analysis of AdvFS is limited because it is a proprietary filesystem and the transaction module's source code is not publicly available.
  • the page cache 133 may exist in a memory managed by an operating system of a host, for example, DRAM.
  • the storage 20 shown in FIG. 4 may be a separate device distinct from the host.
  • Data stored in non-volatile memory in storage 20 may be cached in volatile memory in storage 20 .
  • the content cached in the volatile memory may be cached in the host's DRAM.
  • the unit of caching may be referred to as a block in the storage 20 .
  • the block is not only a unit of caching but also a unit of writing/reading of the storage 20 .
  • Storage can also be thought of as an arrangement of blocks. A number is assigned to each block, and this number can be regarded as a disk location.
  • the host may use a page cache, which is part of DRAM, when caching the content of the block.
  • the caching unit of the page cache is the page.
  • the contents of blocks of storage can be cached in pages of the page cache.
  • stealing is activated in a file system transaction.
  • the rectangle shown within the memory 13 represents the page cache 133.
  • FIG. 5 illustrates a page eviction method of a log structured file system provided according to an embodiment.
  • a space in a memory in which data “X” is stored may be referred to as “page”, and a page in which data “X” is stored may be referred to as “page X”.
  • the “page X” may refer to data “X” stored in page X depending on the context.
  • the memory may refer to a memory (ex: DRAM) in the host managed by an operating system of the host.
  • a space in storage in which data “X” is stored may be referred to as “block”, and a block in which data “X” is stored may be referred to as “block X”.
  • the “block X” may refer to data “X” stored in block X depending on the context.
  • data X may mean “data X”.
  • page X may mean a page in which data X is stored
  • block X may mean a block in which data X is stored.
  • the log structure file system evicts dirty pages as follows: the evicted page is written to the new disk location 232, the old disk location 231 of the evicted page is invalidated, and the file mapping ( The node page of F2FS) is updated to reference the new location 231 of the associated file block.
  • page evict method (page evict routine) cannot be used with the stealing provided in accordance with one embodiment of the present invention for two important reasons.
  • the first is the invalidation of the old disk location 231. If the old disk location 231 is invalidated, the old file block 231 can be garbage collected and recycled before the transaction is committed. If the previous file block 231 is recycled before the transaction is committed, the transaction cannot be undone when the transaction is aborted.
  • the file blocks and node pages shown in storage 20 in FIG. 5 may be in volatile memory or non-volatile memory in storage 20 .
  • a block obtained during a transaction may be written to the non-volatile memory of the storage 20 even before the transaction is committed.
  • the transaction may modify data A to data A'.
  • the contents of data A' may be cached in a volatile memory in the storage 20, and this caching is to hide the low speed of the storage 20.
  • Contents cached in the volatile memory in the storage 20 may be cached once more in the volatile memory managed by the host. This caching is to reduce the effort of the host to access the storage 20.
  • A' can be pinned to the host memory by the above transaction.
  • data A' may be transmitted to the storage 20 by ejecting page A' as shown in step S12 of FIG. 5 .
  • the transmitted content may be stored in a volatile memory of the storage 20 .
  • Data A' transmitted to the storage 20 may be written to the non-volatile memory in the storage 20 when the host calls a flush command or when the volatile memory in the storage 20 runs out of free space.
  • Data A' is written to block 232 instead of block 231 by the above Eviction. And the host can treat the data written to block 231 as invalidated data, so that block 231 can then be cleaned by garbage collection. Step S13 of FIG.
  • 5 may be executed in a state in which commit is not called during the transaction. That is, when the host tries to write data B to the storage 20, a situation may occur where the non-volatile memory in the storage 20 runs out of space. At this time, the host can know that the data recorded in block 231 is invalid. Thus, data B can be written to block 231. As a result, data A recorded in block 231 is changed to data B, and as a result, data A is lost. Now, if a transaction is cancelled, a problem arises in that the data recovered due to cancellation is B, not A.
  • FIG. 6 illustrates a checkpoint method of a log structured file system provided according to an embodiment.
  • the second is a premature checkpoint of updated node pages. If the dirty page is evicted, the updated node page 135 including the updated file mapping information may be checkpointed when the file system executes a periodic checkpoint operation before committing the transaction. Then, the updated node block 235 checkpointed to disk 20 refers to the new disk location (A:2) of the evicted page of the uncommitted transaction. If the file system crashes before the transaction is committed, the recovery module recovers the escaped page (A') of the uncommitted transaction in relation to the most recent file mapping information found on the disk 20 (S241) can do. As a result, it is possible to recover the filesystem to an invalid state.
  • the first issue is to prevent the old disk location 231 from being garbage collected until the transaction is committed.
  • the second issue relates to preventing the recovered block 232 of an uncommitted transaction from being recovered after a system crash. System crashes can occur across both hosts and storage.
  • a delayed invalidation method is proposed to solve the first issue.
  • the delayed invalidation method is described with reference to FIG. 7 .
  • the rectangle shown in the memory 13 in FIG. 7 represents the page cache 133 .
  • the file system After ejecting dirty pages in an uncommitted transaction, the file system does not invalidate the old disk location 231 until the transaction is committed.
  • a node page pinning method is proposed.
  • the node page pinning method will be described with reference to FIG. 8 .
  • the rectangle shown in the memory 13 in FIG. 8 represents the page cache 133 .
  • the file system pins the updated node page 135 into memory until the transaction is committed to prevent the updated node page 135 from being prematurely checkpointed.
  • an icon representing pinning is displayed in the pinned page cache.
  • the pinning icon is displayed on the upper right of the rectangle indicated by reference numeral 135 in FIG. 8 .
  • a new in-memory object For the delayed invalidation method and the node page pinning method, a new in-memory object, a relocation record, is introduced.
  • Relocation records hold information related to page evict.
  • the relocation record contains the file block ID (inode number and file offset), the old disk location 231 and the new disk location 232 of the file block of the evicted page.
  • the file system asynchronously invalidates the old disk location 231 when the transaction is committed, not when the dirty page is ejected (S33).
  • the relocation record may be written to DRAM of the host. There is no problem even if the relocation record disappears due to a system crash. The reason is that when rebooting after a system crash, the node page and block bitmap are not changed, so they are restored to their pre-transaction state.
  • Each transactional filegroup maintains a set of relocation records called a relocation list .
  • the filesystem creates a relocation record and adds it to the relocation list when a transaction evicts a dirty page.
  • the rectangle shown in the memory 13 in FIG. 9 represents the page cache 133 .
  • the dirty page of file block A is initially mapped to LBA 1 (S51).
  • File block A is evict to LBA 8 (S52).
  • the node page 250 of the memory 13 is updated to map file block A to LBA 8 (S53).
  • the block bitmap 260 of LBA 8 is set (S54).
  • the block bitmap 260 for LBA 1 is not invalidated upon eviction due to delayed invalidation.
  • the file system creates a relocation record 271 and inserts the newly created record 271 into the relocation list 270 (S55).
  • the newly created relocation record 271 contains the file block ID (file block A), old location (LBA 1) and new location (LBA 8) of the block that was evict.
  • LBA 1 is evict from disk 20, it is removed from the dirty page list of the associated transactional filegroup.
  • the LBA is invalidated (S57) and the updated node page is continued (S58).
  • the file system When the transaction is committed (S56), the file system makes the previous location (LBA 1) of the evict page unreachable any more. Before starting the dirty page flush, the file system searches the relocation list 270 in chronological order and invalidates the old disk location (LBA 1) of the evict block (delayed invalidation) (S57). When this operation is completed, the dirty data page of the transaction is flushed (S59). When the dirty page becomes durable, the file system unpins the node page 250 updated by eviction and inserts it into the dirty node page list. The filesystem then flushes the dirty node pages. A transaction commits successfully only if the master commit block becomes durable.
  • the file system searches the relocation list 270 in reverse chronological order. For each relocation record, the file system invalidates the new disk location (ex: LBA 8) and returns the node page 250 of memory 13 to map the file block to the old disk location (LBA 1). After returning the node page 250, pinning is unpinned.
  • Lazy invalidation in the event of a system crash can leave filesystem blocks that are allocated but unreachable. Delayed invalidation temporarily leaves both the old disk location 231 and the new disk location 232 valid from the time the page is evict until the transaction is committed. If the system crashes during this period, the file system can be restored in a state where both the old disk location 231 and the new disk location 232 are valid, but only the old disk location 231 is mapped to a file. In this case, the new disk location 232 must be collected via fsck (offline) or online transformation.
  • the operating system of the host when the size of the page cache is smaller than the size of a plurality of write pages belonging to the first transaction, the operating system of the host writes some of the write pages among the plurality of write pages to the storage. (evict) step; and a commit step of committing the first transaction by the operating system.
  • the data fixed to the page cache immediately before the commit is executed may be other write pages excluding some of the write pages among the plurality of write pages.
  • the commit step may be executed when the operating system receives a commit call of the first transaction from the application program.
  • the transaction execution method may include, before the evict step, receiving, by the operating system, a set of write operation calls for the plurality of write pages from an application program; and receiving, by the operating system, a commit call of the first transaction from the application program between the evict step and the commit step.
  • the file system of the operating system may be a log structure file system.
  • the data of the partial write page is data that replaces old data stored in a first set of old blocks in the storage
  • the data of the partial write page is data that replaces old data stored in a first set of old blocks in the storage.
  • the operating system may be configured to invalidate the old blocks of the first group simultaneously with execution of the commit or after execution of the commit.
  • the data of the partial write page is data that replaces old data stored in a first set of old blocks in the storage
  • the data of the partial write page is data that replaces old data stored in a first set of old blocks in the storage.
  • the data of the partial write page is data that replaces old data stored in a first set of old blocks in the storage
  • the data of the partial write page is data that replaces old data stored in a first set of old blocks in the storage.
  • a first set of new blocks in the storage is stored in the vict step, and in the vict step, the operating system determines that some of the write pages and the first group of write pages are stored in the page cache. It may include storing new mapping information indicating a mapping relationship between new blocks of .
  • the operating system may be configured not to perform a checkpoint operation on the new mapping information stored in the page cache before the commit is executed.
  • a checkpoint operation for the new mapping information may be executed simultaneously with execution of the commit or after execution of the commit.
  • starting a first transaction including first data and second data by an operating system of a host; Before the operating system receives a commit call for the first transaction from an application, the first page is pinned at a first page of a page cache of a memory. an evict step of writing data to a first new block of storage; pinning, by the operating system, the second data at the first page; and a commit step of committing the first transaction by the operating system.
  • the transaction execution method may further include, before the evict step, receiving, by the operating system, a set of write operation calls for a plurality of data belonging to a first transaction from an application program. .
  • the Evict step may be executed only when the size of the page cache is smaller than the total size of the plurality of pieces of data.
  • the operating system transmits first file mapping information indicating that the first data is mapped to the first new block to the first block of the memory. pinning a first file mapping information indicating that the first data is mapped to the first new block at a first node page of the memory (S53); Modifying, by the operating system, a block bitmap 260 so that the first new block is in an in-use state (modifying a block bitmap 260) ; The operating system generates a first relocation record 271 including an identifier of the first data, a location of a first old block that is an old block of the first data, and a location of the first new block of the first data. generating; and inserting, by the operating system, the generated first relocation record into the relocation list 270 .
  • the transaction execution method may include, between the inserting step and the committing step, checking the location of the first old block of the first data by searching the relocation list, by the operating system; invalidating, by the operating system, the identified first old block of the first data; flushing, by the operating system, dirty pages of the page cache; unpinning the first node page, by the operating system, if the dirty pages are determined to be durable; inserting, by the operating system, the first node page into a dirty node page list; and flushing the dirty node page list by the operating system.
  • the transaction execution method may include, if the first transaction is aborted, checking, by the operating system, the location of the first new block of the first data by searching the relocation list; invalidating, by the operating system, a first new block of the first data; recording, by the operating system, file mapping information indicating that the first data is mapped to the first old block in the first node page; and unpinning, by the operating system, the first node page.
  • the first transaction may be committed only when the commit block is determined to be sustainable.
  • a host including a; and a processing unit executing an operating system may be provided.
  • the operating system may be configured to execute each step included in the transaction execution method described above.
  • the present invention it is possible to provide a technique for solving the problem that the size of a transaction in a file system of an operating system supporting transactions is limited by the size of a memory managed by the operating system.
  • FIG. 1 shows the configuration of a host provided according to the prior art and an embodiment of the present invention.
  • Figure 2 shows the relationship between steps performed by SQLite to execute a multi-file transaction and IO operations.
  • FIG. 3 shows the configuration of a transactional file group.
  • FIG. 5 illustrates a page eviction method of a log structured file system provided according to an embodiment.
  • FIG. 6 illustrates a checkpoint method of a log structured file system provided according to an embodiment.
  • FIG. 8 is for explaining a node page pinning method provided according to an embodiment of the present invention.
  • FIG. 10 illustrates a method of handling a transaction suspension situation using a relocation list according to an embodiment of the present invention.
  • FIG. 11 shows an example of a shadow garbage collection execution method provided according to an embodiment of the present invention.
  • FIG. 12 illustrates a method of migrating a victim block when a victim block of garbage collection corresponds to a previous version of a cached block of an uncommitted transaction, according to an embodiment of the present invention.
  • FIG. 13 illustrates a method of migrating a victim block for garbage collection when the victim block corresponds to a previous version of an evicted page, according to an embodiment of the present invention.
  • FIG. 14 illustrates a method of migrating a victim block for garbage collection when the victim block corresponds to a new version of an evicted page.
  • 15 illustrates a concept in which a host writes information to a storage according to an embodiment.
  • 17 is a flowchart illustrating a method of committing a multi-file transaction provided according to an embodiment of the present invention.
  • FIG. 18 shows the structure of a master commit block provided according to one aspect of the present invention.
  • FIG. 19 illustrates the relationship between MCB and inode information stored in storage by the multi-file transaction commit method described in FIG. 17 .
  • 20A and 20B are flowcharts illustrating a method of interimly storing transaction data for large-capacity transactions according to an embodiment of the present invention.
  • 21 is a flowchart illustrating a method for an operating system to execute a transaction according to an embodiment of the present invention.
  • 22 is a flowchart illustrating a method for an operating system to commit a transaction according to an embodiment of the present invention.
  • FIG. 23 is a flowchart illustrating a method of restoring a state of a file system to a state prior to the start of a canceled transaction when a transaction is cancelled, according to an embodiment of the present invention.
  • 15 illustrates a concept in which a host writes information to a storage according to an embodiment.
  • the host 10 and the storage 20 may each be a computing device that operates by supplying power from a power supply.
  • the host 10 and the storage 20 may exchange data and commands through one or more transmission channels 30 .
  • the transmission channels 30 may be wireless channels or wired channels.
  • the host 10 and the storage 20 may share power provided from one power supply or receive power from two different power supplies.
  • the host 10 may include a CPU, a memory, a power supply, and a communication device.
  • the storage 20 may include a controller 21 , a volatile memory 22 , and a non-volatile memory 23 .
  • the host 10 may transmit various commands and data to the storage 20 through the transmission channels 30 .
  • the command may include a write command.
  • the controller 21 of the storage 20 may store data received from the transmission channels 30 in the volatile memory 22 based on commands received from the transmission channels 30 .
  • Data stored in the volatile memory 22 may be stored in the non-volatile memory 23 according to a rule followed by the controller 21 .
  • Data stored in the volatile memory 22 can be deleted when power supplied to the storage 20 is cut off, but data stored in the nonvolatile memory 23 is deleted even if power supplied to the storage 20 is cut off. It doesn't work.
  • the host 10 may execute an application 11 and an operating system 12 .
  • the application 11 and the operating system 12 may be executed by executing predetermined command codes stored in a memory accessed by the host 10 by a CPU included in the host 10 .
  • the application 11 may be a program that is executed or terminated when a user using the host 10 provides a user input through a user interface provided by the host 10 .
  • the operating system 12 may be a program automatically executed by the host 10 when power is applied to the host 10 or a reset is performed.
  • the application 11 may send various system calls to the operating system 12 .
  • the operating system 12 may execute a task corresponding to the system call.
  • the host 10 may execute transactions. It may take a certain amount of time from the start of a particular transaction to its end.
  • the application 11 can control the start and commit of a transaction. In addition, the application 11 may control one or more operations to be executed during the transaction.
  • the application 11 may transmit system calls including a start call, a set of operation calls, and a commit call to the operating system 12 .
  • a specific transaction is started by the start call, commands to be transferred from the host 10 to the storage 20 are prepared by the set of operation calls, and the prepared commands are prepared by the commit call. It can be delivered to the storage 20 through the transmission channels 30 .
  • a set of multiple operations may constitute one transaction.
  • the first transaction 41 may include four write operations (WO#1 to WO#4). Although FIG. 16 shows an example in which only write operations are included in each transaction, other types of operations may also be included.
  • 17 is a flowchart illustrating a method of committing a multi-file transaction provided according to an embodiment of the present invention.
  • step S110 the application 11 may transmit a start call to the operating system 12.
  • the first transaction may be started.
  • the operating system 12 may start a process for the first transaction.
  • step S121 the application 11 may call a write operation call (WO#1) for the first page of the first file to the operating system 12.
  • WO#1 write operation call
  • step S122 the application 11 may call the write operation call (WO#2) for the first inode of the first file to the operating system 12.
  • step S131 the application 11 may call the write operation call (WO#3) for the second page of the second file to the operating system 12.
  • step S132 the application 11 may call a write operation call (WO#4) for the second inode of the second file to the operating system 12.
  • WO#4 write operation call
  • step S140 the application 11 may call a commit call to the operating system 12.
  • step S150 the operating system 12 may process the first transaction in response to the commit call. That is, pages of all files included in the first transaction may be reflected in the storage 20 .
  • Step S150 may include steps S151 to S158.
  • the operating system 12 may create one MCB (Master Commit Block) having a structure according to an embodiment of the present invention.
  • MCB Master Commit Block
  • FIG. 18 shows the structure of a master commit block provided according to one aspect of the present invention.
  • N/N1-1 block addresses having a size of N1 bytes may be stored.
  • block positions (block numbers) of inodes written by the multi-file transaction according to the first transaction may be stored.
  • the block location of each of the inodes may be an address indicating a location of a block in which the inode is stored in the storage 20 .
  • the block position of the first inode of the first file (File#1) 401 can be stored in the first part 302, and the block position of the first inode of the first file (File#2) 402 can be stored in the second part 303.
  • the block position of the second inode of ) may be stored.
  • the first file (File#1) 401 and the second file (File#2) 402 are files included in the first transaction.
  • the FSYNC_BIT flag (307) can be attached to the master commit block.
  • the ordering between inodes and transaction contents may not be guaranteed either.
  • the master commit block 300 with the FSYNC_BIT flag 307 can be found in the storage.
  • the master commit block 300 is recorded in a state in which the writing order with the rest of the transaction is guaranteed. Therefore, finding the master commit block 300 means that all other transaction details have been recorded.
  • step S152 the operating system 12 sends write commands (WC#1, WC#) corresponding to the write operation calls (WO#1, WO#2) to the storage 20. 2) can be transmitted.
  • the write commands WC#1 and WC#2 may include information about the first inode Inode#1 of the first file File#1.
  • step S153 the operating system 12 may store the block location (block number) of the first inode (Inode#1) in a part of the master commit block 300.
  • Step S153 of FIG. 17 is a step of reflecting the contents of the i-th file among the files included in the transaction to the storage.
  • the file contents include the inode of the ith file. That is, since step S153 of FIG. 17 is an operation immediately after the inode is stored in the non-volatile memory, the operating system 12 can know the block location of the inode.
  • Steps S152 and S153 may be repeatedly executed for all other files included in the first transaction.
  • the operating system 12 may transmit write commands WC#3 and WC#4 corresponding to the write operation calls WO#3 and WO#4 to the storage 20.
  • Information on the third inode (Inode#3) of the third file (File#1) may be included in the write commands (WC#3, WC#4).
  • step S155 the operating system 12 may store the block location (block number) of the second inode (Inode#2) in a part of the master commit block 300.
  • the operating system 12 transmits a flush command FC to the storage 20 in step S157.
  • step S158 the operating system 12 may transmit an MCB write command to write the master commit block 300 to the storage 20 to the storage 20 .
  • the MCB write command may include the contents of the master commit block 300 .
  • the storage 20 may perform the following steps in response to commands received from the operating system 12 .
  • step S161 when the storage 20 receives the write commands WC#1 and WC#2, the first page (Page#1) and the first inode ( Save Inode#1).
  • step S162 when the storage 20 receives the write commands WC#3 and WC#4, the second page (Page#2) and the second inode ( Save Inode#2).
  • step S163 when the storage 20 receives the flush command FC, the storage 20 transfers information about the first transaction stored in the volatile memory 22 to the non-volatile memory. Save in (23).
  • step S164 the storage 20 may store the master commit block 300 in the non-volatile memory 23 .
  • FIG. 19 illustrates the relationship between MCB and inode information stored in storage by the multi-file transaction commit method described in FIG. 17 .
  • reference number 231 denotes the first block 231 in which the first inode (Inode#1) is stored
  • reference number 232 denotes the second block 232 in which the second inode (Inode#2) is stored.
  • reference number 233 indicates a third block 233 in which the master commit block 300 is stored.
  • the first inode pointer 2331 included in the third block 233 has a value related to the address of the first block 231, and the second inode pointer 2332 corresponds to the second block ( 232) has a value related to the address.
  • 20A and 20B are flowcharts illustrating a method of interimly storing transaction data for large-capacity transactions according to an embodiment of the present invention.
  • FIG. 20 may be collectively referred to as FIG. 20 .
  • step S210 the operating system 12 may receive a transaction start call from the application 11.
  • step S220 the operating system 12 performs a first write operation call (WO#1) for the first to third pages (pages 1 to 3) of the first file (fd#1) from the application 11. ) can be received.
  • WO#1 first write operation call
  • the operating system 12 may pin the data of the first page to the third page in a page cache of a memory (eg, DRAM) managed by the operating system 12. .
  • a memory eg, DRAM
  • the step (S310) indicates that a space for storing pages designated by the first write operation call (WO#1) (the first to third pages in the example of FIG. 20) is already prepared in the page cache. It can be executed based on the condition.
  • step S230 the operating system 12 performs a second write operation call (WO#2) for the fourth to eighth pages (pages 4 to 8) of the first file (fd#1) from the application 11. ) can be received.
  • WO#2 write operation call
  • the operating system (12) may execute step S320.
  • Step S320 may include the following steps S321, S322, and S323.
  • step S321 the operating system 12 selects all pages that can be stored in the remaining space of the page cache among the pages designated by the second write operation call (WO#2) (in the example of FIG. 4th to 6th pages) can be pinned.
  • W#2 the second write operation call
  • step S322 the operating system 12 selects standby pages (7 in the example of FIG. In order to secure space to store pages to eighth pages), evict pages (first to second pages in the example of FIG. 20), which are some of the pages already pinned in the page cache, are transferred to the storage 20. ivic can do it To this end, the operating system 12 may transmit an Evict command for the Evict pages already pinned in the page caching to the storage 20 .
  • the size of the standby pages and the size of the evict pages may be equal to each other.
  • the storage 20 may store the Evict pages in the non-volatile memory of the storage 20 when receiving an Evict command for the Evict pages from the operating system 12. there is.
  • step S323 the operating system 12 pins new file mapping information, which is mapping information between the evict pages and blocks in the storage 20 in which the evict pages are stored, to the page cache. can do.
  • step S330 the operating system 12 may pin the standby pages in the space occupied by the Evict pages in the page cache.
  • the write priority of the Evict pages may be higher than the write priority of pages pinned to the page cache after the execution of the Evict command.
  • step S240 the operating system 12 may receive a commit call from the application 11.
  • step S340 the operating system 12 may process transaction #1, and may specifically include steps S341 to S345.
  • step S341 the operating system 12 may transmit a write command to the storage 20.
  • the write command may be a write command for pages of the first set pinned to the page cache.
  • the storage 20 may store the first set of pages in a volatile memory in the storage 20 .
  • the pages of the first set are third to eighth pages.
  • step S342 the operating system 12 may transmit a flush command to the storage 20.
  • the storage 20 may store the information stored in the volatile memory in the storage 20 to the non-volatile memory in the storage 20 (flush).
  • step S343 the OS 12 may transmit a command to invalidate old blocks occupied by the Evict pages in the storage 20 before the execution of the Evict command, to the storage 20. .
  • step S440 the storage 20 may invalidate the old blocks.
  • step S344 the operating system 12 may cancel the pinning of the new file mapping information.
  • step S345 the operating system 12 may transmit a checkpoint command for the new file mapping information to the storage 20.
  • the storage 20 may store the new file mapping information in a non-volatile memory within the storage 20.
  • steps S345 and S450 may be omitted.
  • 21 is a flowchart illustrating a method for an operating system to execute a transaction according to an embodiment of the present invention.
  • step S810 the operating system of the host may initiate a first transaction including first data and second data.
  • step S820 the operating system may receive a set of write operation calls regarding a plurality of data belonging to the first transaction from the application program.
  • step S830 before the operating system receives a commit call for the first transaction from an application, the first data pinned in the first page of the page cache of the memory is stored in a first new block of storage. You can run evict that writes to .
  • the evict step may be executed only when the size of the page cache is smaller than the total size of the plurality of pieces of data.
  • a step of pinning the second data to the first page by the operating system may be executed.
  • step S841 the operating system may pin first file mapping information indicating that the first data is mapped to the first new block to the first node page of the memory.
  • step S842 the operating system may modify the block bitmap 260 so that the state of the first new block becomes in-use.
  • step S843 the operating system performs a first process including an identifier of the first data, a location of a first old block that is an old block of the first data, and a location of the first new block of the first data.
  • a relocation record 271 may be created.
  • step S844 the operating system may insert the generated first relocation record into the relocation list 270.
  • 22 is a flowchart illustrating a method for an operating system to commit a transaction according to an embodiment of the present invention.
  • Steps shown in FIG. 22 may be executed after step S844 of FIG. 21 .
  • step S851 the operating system may search the relocation list to confirm the location of the first old block of the first data.
  • step S852 the operating system may invalidate the confirmed first old block of the first data.
  • step S853 the operating system may flush dirty pages of the page cache.
  • step S854 if the operating system determines that the dirty pages are sustainable, it may unpin the first node page.
  • step S855 the operating system may insert the first node page into a dirty node page list.
  • step S856 the operating system may flush the dirty node page list.
  • step S870 the operating system may commit the first transaction.
  • the first transaction may be committed only when the commit block is determined to be sustainable.
  • FIG. 23 is a flowchart illustrating a method of restoring a state of a file system to a state prior to the start of a canceled transaction when a transaction is cancelled, according to an embodiment of the present invention.
  • the steps shown in FIG. 23 may be executed after step S844 of FIG. 21 under the condition that the first transaction is aborted.
  • step S861 the operating system may search the relocation list to confirm the position of the first new block of the first data.
  • step S862 the operating system may invalidate the first new block of the first data.
  • step S863 the operating system may record file mapping information indicating that the first data is mapped to the first old block in the first node page.
  • step S864 the operating system may unpin the first node page.
  • pinning may refer to an operation of preventing content written in the page cache from being written to storage.
  • the present invention is a contention-free scalable input/output subsystem for an ultra-low-latency storage device, which is a research project supported by the Ministry of Science and ICT (the project performing organization) and the National Research Foundation of Korea (project identification number 2020R1A2C300852513, research period 2022.03 .01 ⁇ 2023.02.28) It was developed in the process of conducting a research project.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Est divulgué un procédé d'exécution de transaction comprenant : une étape d'expulsion dans laquelle un système d'exploitation d'un hôte écrit certaines pages d'écriture d'une pluralité de pages d'écriture sur une unité de stockage lorsqu'une taille d'un cache de page est inférieure à des tailles de la pluralité de pages d'écriture appartenant à une première transaction ; et une étape de validation dans laquelle le système d'exploitation valide la première transaction.
PCT/KR2023/001251 2022-01-27 2023-01-27 Technique et appareil de stockage intermédiaire de données de transaction destinés à des transactions massives WO2023146332A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2022-0012633 2022-01-27
KR20220012633 2022-01-27

Publications (1)

Publication Number Publication Date
WO2023146332A1 true WO2023146332A1 (fr) 2023-08-03

Family

ID=87472035

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2023/001251 WO2023146332A1 (fr) 2022-01-27 2023-01-27 Technique et appareil de stockage intermédiaire de données de transaction destinés à des transactions massives

Country Status (2)

Country Link
KR (1) KR20230115930A (fr)
WO (1) WO2023146332A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040064647A1 (en) * 2002-06-27 2004-04-01 Microsoft Corporation Method and apparatus to reduce power consumption and improve read/write performance of hard disk drives using non-volatile memory
KR20120104364A (ko) * 2009-12-15 2012-09-20 인텔 코오퍼레이션 무제한 트랜잭션 메모리(utm) 시스템에서의 모드 전환의 수행
US20130166816A1 (en) * 2011-02-25 2013-06-27 Fusion-Io, Inc. Apparatus, System, and Method for Managing Contents of a Cache
US20160179687A1 (en) * 2014-12-22 2016-06-23 Intel Corporation Updating persistent data in persistent memory-based storage
US20180095887A1 (en) * 2016-09-30 2018-04-05 International Business Machines Corporation Maintaining cyclic redundancy check context in a synchronous i/o endpoint device cache system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040064647A1 (en) * 2002-06-27 2004-04-01 Microsoft Corporation Method and apparatus to reduce power consumption and improve read/write performance of hard disk drives using non-volatile memory
KR20120104364A (ko) * 2009-12-15 2012-09-20 인텔 코오퍼레이션 무제한 트랜잭션 메모리(utm) 시스템에서의 모드 전환의 수행
US20130166816A1 (en) * 2011-02-25 2013-06-27 Fusion-Io, Inc. Apparatus, System, and Method for Managing Contents of a Cache
US20160179687A1 (en) * 2014-12-22 2016-06-23 Intel Corporation Updating persistent data in persistent memory-based storage
US20180095887A1 (en) * 2016-09-30 2018-04-05 International Business Machines Corporation Maintaining cyclic redundancy check context in a synchronous i/o endpoint device cache system

Also Published As

Publication number Publication date
KR20230115930A (ko) 2023-08-03

Similar Documents

Publication Publication Date Title
WO2014208863A1 (fr) Appareil et procédé de gestion de données intégrée pour cache de tampon non volatil et dispositif de stockage non volatil
JP4414381B2 (ja) ファイル管理プログラム、ファイル管理装置、ファイル管理方法
US6957362B2 (en) Instantaneous restoration of a production copy from a snapshot copy in a data storage system
US6934822B2 (en) Organization of multiple snapshot copies in a data storage system
US7395281B2 (en) File synchronisation
JP2505112B2 (ja) トランザクション管理方法
WO2017082520A1 (fr) Système de gestion de base de données et son procédé de modification et de récupération de données
US7293137B2 (en) Storage system with inhibition of cache destaging
US20110004586A1 (en) System, method, and computer program product for creating a virtual database
US20040030846A1 (en) Data storage system having meta bit maps for indicating whether data blocks are invalid in snapshot copies
US5134696A (en) Virtual lookaside facility
US20130080720A1 (en) Information processing apparatus and method
US20080215848A1 (en) Method and System For Caching Address Translations From Multiple Address Spaces In Virtual Machines
US10942867B2 (en) Client-side caching for deduplication data protection and storage systems
US5247647A (en) Detection of deletion of stored data by concurrently executing processes in a multiprocessing data processing system
WO2014056398A1 (fr) Procédé de traitement de données, dispositif et support d'informations
WO2016204529A1 (fr) Dispositif et procédé de stockage en mémoire destinés à la prévention de perte de données après une perte d'alimentation électrique
WO2015002481A1 (fr) Appareil et procédé de gestion de tampon à trois états sur la base d'une mémoire flash
NO310488B1 (no) Fremgangsmåte for optimalisering av lagerplass i en database
WO2020101343A1 (fr) Procédé de mise en œuvre d'une capture de données de changement dans un système de gestion de base de données
JP3701890B2 (ja) 圧縮メモリ・システムの再利用スペース予約
WO2023146332A1 (fr) Technique et appareil de stockage intermédiaire de données de transaction destinés à des transactions massives
JP3042600B2 (ja) 分散ファイルの同期方式
WO2023146334A1 (fr) Procédé de récupération de mémoire pour traitement parallèle de collecte de mémoire et de transaction dans un système de fichier basé sur un journal
EP3405870B1 (fr) Gestion d'enregistrements versionnés au moyen d'une période de redémarrage

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23747373

Country of ref document: EP

Kind code of ref document: A1