WO2016122710A1 - Byte addressable non-volatile random access memory for storing log record - Google Patents

Byte addressable non-volatile random access memory for storing log record Download PDF

Info

Publication number
WO2016122710A1
WO2016122710A1 PCT/US2015/039771 US2015039771W WO2016122710A1 WO 2016122710 A1 WO2016122710 A1 WO 2016122710A1 US 2015039771 W US2015039771 W US 2015039771W WO 2016122710 A1 WO2016122710 A1 WO 2016122710A1
Authority
WO
WIPO (PCT)
Prior art keywords
transaction
log
undo
log record
redo
Prior art date
Application number
PCT/US2015/039771
Other languages
French (fr)
Inventor
Sathyanarayanan MANAMOHAN
Shastry LINGADAHALLI
Original Assignee
Hewlett Packard Enterprise Development Lp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett Packard Enterprise Development Lp filed Critical Hewlett Packard Enterprise Development Lp
Publication of WO2016122710A1 publication Critical patent/WO2016122710A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1474Saving, restoring, recovering or retrying in transactions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1471Saving, restoring, recovering or retrying involving logging of persistent data for recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2308Concurrency control
    • G06F16/2336Pessimistic concurrency control approaches, e.g. locking or multiple versions without time stamps
    • G06F16/2343Locking methods, e.g. distributed locking or locking implementation details
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0866Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches for peripheral storage systems, e.g. disk cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/82Solving problems relating to consistency
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1016Performance improvement
    • G06F2212/1024Latency reduction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/22Employing cache memory using specific memory technology
    • G06F2212/222Non-volatile memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/46Caching storage objects of specific type in disk cache
    • G06F2212/466Metadata, control data

Definitions

  • Database systems are designed and optimized for disks and memory hierarchies. For example, transactions are essential part of a data management system.
  • Database systems may use Algorithms for Recovery and Isolation Exploiting Semantics (ARIES )-style write-ahead logging (WAL) to implement the transactions.
  • ARIES Algorithms for Recovery and Isolation Exploiting Semantics
  • WAL write-ahead logging
  • ARIES techniques are optimized for disk based systems and tuned for the sequential write performance of disks.
  • Most of the transaction processing system implementations rely on Dynamic random- access memory (DRAM) for performance and disks for persistent storage.
  • DRAM Dynamic random- access memory
  • Such database systems have varying levels of storage latencies, e.g., reading and writing data to and from the DRAM and the hard disks.
  • FIG. 1 illustrates a block diagram of an example logging system of the present disclosure
  • FIG. 2 illustrates an example structure of undo log records
  • FIG. 3 illustrates an example structure of redo log records
  • FIG. 4 illustrates a flowchart of an example method for performing a transaction
  • FIG. 5 illustrates a flowchart of example method for performing a recovery operation
  • FIG. 6 depicts a high-level block diagram of a computer that can be transformed into a machine capable of performing the functions described herein;
  • FIG. 7 illustrates a block diagram of an example system for performing a transaction.
  • the present disclosure broadly discloses a method, apparatus and non-transitory computer-readable medium for utilizing non-volatile random- access memory (NVRAM) for logging.
  • NVRAM non-volatile random- access memory
  • WAL write-ahead logging
  • Database systems may use ARIES-style write-ahead logging (WAL) to implement transactions.
  • WAL write-ahead logging
  • ARIES techniques are optimized for disk based systems and tuned for the sequential write performance of disks.
  • sequential logging creates significant bottlenecks.
  • the durability property ensures the ability of the system to recover committed transaction updates if either the system or the storage medium fails.
  • the transaction systems may use write-ahead logging or shadow copy and optimistic or pessimistic concurrency control techniques to support ACID properties.
  • NVRAM combines the persistence properties of media like hard disks with the ability to access content like DRAM. It is noted that a software stack modeled on disks that are block accessed and slow due to mechanical components, is inefficient. More specifically, in one example "Byte Addressable" NVRAM will alleviate the need of deep memory hierarchies that were built around disk based systems to match disk with DRAM speeds. [0013] In one example, the use of NVRAM may reduce the complexity of data storage. The large memory capability of NVRAM enables the ability to store a large amount of data in a single flat memory bank.
  • NVRAM byte addressable nature of NVRAM enables the ability to store the data and metadata in its native format without converting the data into block oriented formats for storage and transmission.
  • the present disclosure leverages this property to provide optimal data structures for transaction management that is not hampered with the need for serializing the data into a block oriented storage format.
  • ARIES-style WAL is the de-facto standard of transactional logging.
  • ARIES adopts a centralized or global logging to leverage the sequential write performance of disks. To hide the performance difference between the DRAM and the disk, the log records are cached in DRAM and forced to disk at the time of commit.
  • the centralized logging can be a significant bottleneck. Logging is tightly coupled with recovery and the design aspect has to consider the impact to recovery algorithms.
  • a transactional storage engine can be a pluggable software module that is used to perform various data management operations like create, insert, update and delete operations on data that the transactional storage engine manages in a transactional consistent manner.
  • the transactional storage engine may use WAL to manage transactions.
  • the transactional storage engine may maintain the transaction logs in DRAM as a circular buffer and a persistent copy of the same content in a flat file on disk. All database operations performed by the transactional storage engine that manipulates databases pages are logged in the redo logs prior to actual execution. Contents of the redo logs are flushed into the disk prior to transaction commit in line with WAL semantics.
  • the redo log is used during system recovery. For example, the system recovery may start by replaying the redo log onto the buffer pool until all database pages are recovered. This replay starts from the last successful checkpoint. Once this replay is over, the undo log is used to rollback all partially completed transactions.
  • the transactional storage engine may maintain the undo log in the memory and on the disk.
  • the undo logs may contain before image of database records that are being modified by a transaction.
  • the undo logs can be used to support transaction rollback.
  • the redo log and buffer pool are flushed periodically into the disk.
  • the present disclosure provides a NVRAM-optimized logging system that writes log records directly to a NVRAM.
  • the NVRAM is byte-addressable similar to a DRAM.
  • Byte addressing refers to an architecture that supports accessing individual bytes of data.
  • log fields of information within log records there are two types of log fields of information within log records: (a) undo information, which stores information about how to undo a change and (b) redo information, which stores information about how to reproduce a change.
  • undo information which stores information about how to undo a change
  • redo information which stores information about how to reproduce a change.
  • FIG. 1 illustrates a block diagram of an example logging system 100 of the present disclosure.
  • the logging system 100 comprises a memory 110, e.g., random access memory 110, a non-volatile random access memory (NVRAM) 120, a persistent storage 130, e.g., a disk, and a processor 140, e.g., a central processor unit (CPU) implementing a transaction manager of the present disclosure.
  • the redo log portion 122 and the undo log portion 124 receive log records 123 and 125 directly and synchronously on the NVRAM 120.
  • the log records stored in the redo log portion 122 and the undo log portion 124 are not required to be flushed to the disk 130.
  • each of the log records 123 contains redo information and each of the log records 125 contains undo information.
  • the logging system 100 of the present disclosure simplifies log buffer management.
  • the transaction manager of the logging system 100 of the present disclosure reduces the Input/output (lO)-related delay which in turn reduces extensive context switching.
  • the context switching is reduced because persistence of the redo data is now synchronous with the commit operation.
  • a commit operation is the final step in the successful completion of a previously started database change as part of handling a transaction in a computing system.
  • the logging system 100 of the present disclosure does not need to wait for a flush to disk which is usually performed by another thread to finish the commit operation.
  • log records can be cached in the DRAM using in-memory format and stored in disks in block oriented format.
  • the log records are converted from DRAM to disk format while persisting log records and vice versa while reading log records from disk during recovery.
  • the logging system 100 of the present disclosure avoids the multiple formats and implements a single unified format of log records on the NVRAM 120.
  • the logging system 100 of the present disclosure avoids the extra memory copy and log record conversion complexity. Since the logging system 100 of the present disclosure writes log records directly on NVRAM 120, there is no need to force the log buffers to disk before the commit operation of a transaction. The transactional locks are held until the log records are persisted. Writing log to the NVRAM avoids the IO and hence reduces the lock duration. This in turn reduces the lock contention.
  • the present disclosure provides redo logs and undo logs that have different models of parallelism.
  • the undo operations are applicable for a transaction and hence can be parallelized at a transaction level, whereas the redo operations are applicable for a page and hence can be parallelized at a page level.
  • the logging system 100 of the present disclosure distributes undo log records based on the transaction identification (ID) and implements it as a linked list of undo records belonging to a transaction.
  • the logging system 100 of the present disclosure implements a hash based distribution of transaction IDs.
  • FIG. 2 illustrates an example structure 200 of undo log records 210.
  • the information about transaction state and pointer to undo log records are maintained in a hash table 220 based on the transaction identification (ID) 205.
  • the details pertaining to the transaction ID 205 and the pointer to undo chain are stored in the hash bucket header 207.
  • Undo logging can be used to cancel the effects of incomplete transactions.
  • each of the undo log records 210 may contain a header 212 that comprises a previous record field, a next record field, a page number field, a last sequence number (LSN) field, an undo number field, and a type field.
  • LSN last sequence number
  • Each of the undo log records 210 may also contain a payload 214 that comprises undo operation details 216.
  • the undo log records 210 are implemented as a linked list. Hash partitioning of the undo logs eliminates the contention for writing undo log records from multiple transactions. Only during the beginning of a transaction does the present logging system need to acquire a lock to acquire the corresponding hash slot. Thus, the present logging system improves the concurrency of undo log operations.
  • the logging system 100 of the present disclosure distributes the redo log records 310 based on the page identification (ID) 305.
  • the information about page state and pointer to redo log records are maintained in a hash table 320 based on the page identification (ID) 305.
  • FIG. 3 illustrates an example structure 300 of redo log records 310.
  • the details pertaining to the page number or page ID 305, start and end LSNs of the page and the pointer to redo chain are stored in the hash bucket header 307.
  • each of the redo log records 310 may contain a header 312 that comprises a previous record field, a next record field, a last sequence number (LSN) field, and a type field.
  • Each of the redo log records 3 0 may also contain a payload 314 that comprises redo operation details 316.
  • the redo log records 310 are implemented as a linked list. Namely, the present logging system stores redo records that consists of transaction ID, record type, LSN, record payload etc. as a linked list. This reduces the redo log write contention across the pages. Only the parallel transactions that operate on the same page will contend for the redo log chain. With the customized distribution of redo and undo log records, the present logging system can implement more granular latches and increase the parallelism in logging operations. This reduces the log buffer contention and improves the performance.
  • ARIES recommends periodic checkpointing to accelerate the recovery operation.
  • the checkpointing flushes the log records and dirty pages in the buffer pool to the disks.
  • the checkpoint log record holds information about active transactions, its state and the flushed LSN.
  • the checkpoint log record since the active transactions and their states are directly written to the NVRAM along with undo logging, the checkpoint log record has to just write the flushed LSN. This improves the performance of the checkpointing operation.
  • the present parallel hash based distribution of the redo and undo log records provides the opportunity to parallelize recovery operations as well.
  • ARIES recommends the recovery in 3 phases: (i) analysis phase - during which the algorithm reads the flushed LSN information from checkpoint and scans the log records sequentially to gather required redo and undo operation information, (ii) redo phase - during which the redo operations are applied to bring back the database to the state before the crash, and (iii) undo phase - during which undo operations are applied to reverse the effect of inflight transactions.
  • the present distributed redo log records of the present disclosure will enable parallelism in building and applying redo operations.
  • the present distributed undo log records will enable parallelism in rolling back the inflight transactions.
  • the logging system of the present disclosure improves the performance of recovery operations.
  • the logging system of the present disclosure depends on system primitives and programming APIs to read/write in NVRAM. These APIs shall: (a) support namespace for the NVRAM, (b) support dynamic memory management, and (c) support variable length read/write operations, which shall guarantee atomicity and durability.
  • Partitioning of the log structures as disclosed in the present disclosure addresses several bottlenecks that are seen with sequential logs.
  • Global log is needed in disk based systems to ensure persistence of the log records are done in an optimal manner. This enables optimizations like group-commits on disk based systems.
  • This forces the undo and redo components of the logs to be persisted on the same global log creating a synchronization bottleneck. This is because, if any transaction wants to append the log with an undo or redo record, that transaction thread, has to acquire the semaphore on the global log and reserve space in the log to perform the update. During this time, other threads have to go into a wait state until the current thread releases the log. This increases contention at the head of the sequential log. This also increases the amount of context switching performed by the system.
  • the logging system of the present disclosure avoids these problems by having separate hash based structures for the undo and redo logs in the NVRAM.
  • the logging system of the present disclosure breaks up the single global logs head into several streams equal to the bucket size of hash table. This eliminates the bottleneck on the global log by parallelizing access to the logs. This will enable several threads to write to the log simultaneous, consequently reduces log induced contention and context switching.
  • the logging system of the present disclosure partitions the undo logs based on transaction ID. This will allow the logging system of the present disclosure to create sufficiently large hash buckets to the extent where one can completely eliminate the need of synchronization constructs and make the undo logging practically lock free.
  • an undo log hash table having more than 2048 (or closest prime number) buckets. In this way every transaction will receive its own exclusive hash bucket, thereby eliminating the need to have a synchronization construct to manage the undo log. This also simplifies the search of undo logs during recovery operation.
  • redo logs of the present disclosure are partitioned on page ID that allows several database threads to operate in parallel as long as these threads do not attempt to append redo log records on the same page.
  • the grouping of redo logs is done in traditional systems to improve the IO performance.
  • the IO from multiple transactions is grouped to make one combined IO.
  • Writing redo log on NVRAM avoids IO and thus optimization like grouping redo logs is not needed. Transactions can directly append the logs into the log structure directly when they are generated.
  • the present disclosure also provides a simplification in the process of releasing locks that were held by the transaction.
  • a transaction acquires a lock to protect the data that is being used by itself against possible corruption from concurrent access. This ensures the isolation guarantee of the database is maintained.
  • Lock release occurs at the end of a transaction when the commit status of the transaction is flushed into a durable medium.
  • one must wait for the flushing of the commit record to complete which involves an I/O bottleneck.
  • Transactions are made to wait for the grouping log data to be sufficient enough to overcome the cost of doing a serial I/O to disk.
  • the logging system of the present disclosure eliminates this completely because the logs are directly written to persistent media in their native form.
  • the logging system of the present disclosure does not maintain two distinct data structures, one where system buffers the logs and the other that is used to perform bulk I/O to the disk. Thus, locks can be released as soon as the commit record is posted into the data structure.
  • the logging system of the present disclosure also reduces context switching. Many databases are multi-thread to take advantage of the abundance of compute cores available in state-of-the-art CPUs. However, even though this is largely beneficial, due to the I/O and synchronization bottlenecks, much of the compute cycle is wasted in context switching and spin locks.
  • the logging system of the present disclosure addresses the context switching part of the problem.
  • the persistence of log records is now reduced to a write operation to a NVRAM resident data structure. This can be performed in the same thread, without having to wait for I/O operation, which is usually performed by other I/O threads. This optimization allows the logging system of the present disclosure to drive the cores to perform more user work rather than waiting for I/O operations to complete.
  • the logging system of the present disclosure also eliminates log multiplexing, which attempts to combine log records from various transactions to achieve optimal volume to make flush operations to disk more efficient.
  • log multiplexing attempts to combine log records from various transactions to achieve optimal volume to make flush operations to disk more efficient.
  • On disk based logging systems such method is used to reduce the I/O overhead for writing to the disk.
  • flash based systems such method is used to reduce erase-unit overheads.
  • the system may write more data than that was actually updated. This may arise due the block oriented nature of writes on these systems.
  • the block On disk based systems, the block is usually 4-16 KB and on flash based systems, the block size (erase unit size) is 128 KB. Since the logging system of the present disclosure relies on byte- addressability, the logging system of the present disclosure writes variable log data without having to worry about block boundaries. This results in faster commit time and better utilization of cores to perform more useful work.
  • a checkpoint can be treated as a marker that indicates to the extent to which state information has been transferred to the secure persistent storage. Modifications to data management systems pages are not necessarily flushed to disk in a synchronous manner for performance reasons. Checkpointing is a costly operation and has serious impact on the throughput of a system. Checkpoints are classified into two categories: full checkpoint and fuzzy checkpoint. In a full checkpoint, the data management system writes all the dirty information into the disk. The fuzzy checkpoint which is commonly used for performance reasons writes only certain number of dirty pages.
  • Recovery or crash recovery operation rebuilds the internal data structures of the data management system to a consistent state from which the storage engine can start processing transactions again.
  • the recovery process also ensures that the overall consistency of the data is also maintained.
  • the recovery may occur in several steps or tasks. These tasks are semantically similar for any data management system which supports crash recovery.
  • the first task is to bring the data pages that were present in the buffer pool that were not flushed into persistent medium back into a consistent state. This is performed by applying the redo log from the last checkpoint forward until all of the redo logs are exhausted. This process will bring the buffer pool up to the state just prior to the point of failure.
  • the buffer pool also contains dirty data pages that are part of the incomplete transactions. These incomplete transactions need to be rolled back. The undo logs are used to perform this operation.
  • the logging system of the present disclosure supports both undo and redo logs, hence a transaction manager of the present logging system can be used to perform a recovery operation.
  • the primary performance bottleneck with a recovery operation is the time spent in performing many random 10s to bring the buffer pool back to a state where undo information can be used.
  • processing and converting the block based redo logs on the disk to a format that is usable in the DRAM, will impact the performance of the recovery operation. It should also be noted that the system is not available for transactions until the buffer pool is restored by the redo log.
  • the present hash partitioning of the redo log based on page ID enables the parallel recovery of the pages. Instead of reading a serial redo log, recovery threads can be assigned to process every hash bucket in parallel.
  • the logging system of the present disclosure also utilizes a single format in which the redo logs are stored, hence the cost of converting disk based structure to a DRAM based structure is completely avoided. Additionally, undo of in-flight transactions can be parallelized because the undo log records are hash partitioned on transaction ID. These improvements can significantly reduce recovery time of the system that uses the transaction manager of the present disclosure.
  • FIG. 4 illustrates a flowchart of an example method 400 for performing a transaction.
  • the method 400 may be performed by the processor of a logging system, e.g., serving as a transaction manager, or a computer as illustrated in FIG. 6 and discussed below.
  • the method 400 begins.
  • a transaction is start.
  • the transaction will impact data stored in a database, e.g., located on a persistent storage.
  • method 400 writes a log record to the NVRAM associated with the transaction.
  • the log record may comprise a redo log record or an undo log record.
  • the undo log record is partitioned based on a transaction ID, whereas the redo log record is partitioned based on a page ID.
  • method 400 commits the transaction. It should be noted that the commit operation is completed without the need of the log record being flushed to a persistent storage. Method 400 ends in block 495.
  • FIG. 5 illustrates a flowchart of example method 500 for performing a recovery operation.
  • the method 500 may be performed by the processor of a logging system, e.g., serving as a transaction manager, or a computer as illustrated in FIG. 6 and discussed below.
  • method 500 begins.
  • method 500 starts a recovery operation, e.g., a crash recovery operation.
  • a crash recovery operation For example, a system crash may have occurred that requires a recovery operation to be performed.
  • the method 500 applies a plurality of threads to access a plurality of log records simultaneously.
  • each separate thread can be used to access a different log record or page simultaneously.
  • the plurality of log records is stored on a non-volatile random access memory, where the non-volatile random access memory is byte addressable.
  • method 500 does not have to perform a sequential logging operation.
  • the method 500 performs the recovery operation using the data obtained from the plurality of log records. For example, the redo phase and undo phase can be performed as discussed above. Method 500 ends in block 595.
  • one or more blocks, functions, or operations of the methods 400 and 500 described above may include a storing, displaying and/or outputting block as required for a particular application.
  • any data, records, fields, and/or intermediate results discussed in the methods can be stored, displayed, and/or outputted to another device as required for a particular application.
  • blocks, functions, or operations in FIG. 4 and FIG. 5 that recite a determining operation, or involve a decision do not necessarily require that both branches of the determining operation be practiced. In other words, one of the branches of the determining operation can be deemed as an optional step.
  • FIG. 6 depicts a high-level block diagram of a computer that can be transformed into a machine capable of performing the functions described herein. Notably, no computer or machine currently exists that performs the functions as described herein. As a result, the examples of the present disclosure improve the operation and functioning of the computer to perform a transaction or a recovery operation, as disclosed herein.
  • the computer 600 comprises a hardware processor element 602, e.g., a central processing unit (CPU), a microprocessor, or a multi-core processor, a memory 604, e.g., random access memory (RAM), NVRAM, and/or read only memory (ROM), a module 605 for performing a transaction or a recovery operation, and various input/output devices 606, e.g., storage devices, including but not limited to, a tape drive, a floppy drive, a hard disk drive or a compact disk drive, a receiver, a transmitter, a speaker, a display, a speech synthesizer, an output port, an input port and a user input device, such as a keyboard, a keypad, a mouse, a microphone, and the like.
  • a hardware processor element 602 e.g., a central processing unit (CPU), a microprocessor, or a multi-core processor
  • the computer may employ a plurality of processor elements.
  • the method(s) as discussed above is implemented in a distributed or parallel manner for a particular illustrative example, i.e., the blocks of the above method(s) or the entire method(s) are implemented across multiple or parallel computers, then the computer of this figure is intended to represent each of those multiple computers.
  • one or more hardware processors can be utilized in supporting a virtualized or shared computing environment.
  • the virtualized computing environment may support one or more virtual machines representing computers, servers, or other computing devices.
  • hardware components such as hardware processors and computer-readable storage devices may be virtualized or logically represented.
  • the present disclosure can be implemented by machine readable instructions and/or in a combination of machine readable instructions and hardware, e.g., using application specific integrated circuits (ASIC), a programmable logic array (PLA), including a field-programmable gate array (FPGA), or a state machine deployed on a hardware device, a general purpose computer or any other hardware equivalents, e.g., computer readable instructions pertaining to the method(s) discussed above can be used to configure a hardware processor to perform the blocks, functions and/or operations of the above disclosed methods.
  • ASIC application specific integrated circuits
  • PDA programmable logic array
  • FPGA field-programmable gate array
  • instructions and data for the present module or process 605 for performing a transaction or a recovery operation can be loaded into memory 604 and executed by hardware processor element 602 to implement the blocks, functions or operations as discussed above in connection with the exemplary methods 400 and 500.
  • the module 605 may include one or more programming code components, including a database updating component 608, e.g., a transaction manager performing the various functions as discussed above. These programming code components may be included on one or more of the processing nodes of a computing system, such as system 100.
  • a hardware processor executes instructions to perform "operations"
  • the processor executing the machine readable instructions relating to the above described method(s) can be perceived as a programmed processor or a specialized processor.
  • the present module 605 for performing a transaction or a recovery operation, including associated data structures, of the present disclosure can be stored on a tangible or physical (broadly non-transitory) computer-readable storage device or medium, e.g., volatile memory, non-volatile memory, ROM memory, RAM memory, magnetic or optical drive, device or diskette and the like.
  • the computer- readable storage device may comprise any physical devices that provide the ability to store information such as data and/or instructions to be accessed by a processor or a computing device such as a computer or an application server.
  • FIG. 7 illustrates a block diagram of an example system for performing a transaction.
  • System 700 may include at least one computing device that is capable of communicating with at least one remote system.
  • System 700 may be similar to system 100 of FIG. 1 or system 600 of FIG. 6, for example.
  • system 700 includes a processor 710 and a machine-readable storage medium 720.
  • the following descriptions refer to a single processor and a single machine-readable storage medium, the descriptions may also apply to a system with multiple processors and multiple machine-readable storage mediums.
  • the instructions may be distributed (e.g., stored) across multiple machine-readable storage mediums and the instructions may be distributed (e.g., executed by) across multiple processors.
  • Processor 710 may be one or more central processing units (CPUs), microprocessors, and/or other hardware devices suitable for retrieval and execution of instructions stored in machine-readable storage medium 720.
  • processor 710 may fetch, decode, and execute instructions 722, 724, and 726 to perform a transaction.
  • processor 710 may include one or more electronic circuits comprising a number of electronic components for performing the functionality of one or more of the instructions in machine-readable storage medium 720.
  • Machine-readable storage medium 720 may be any electronic, magnetic, optical, or other physical storage device that stores executable instructions.
  • machine-readable storage medium 720 may be, for example, Random Access Memory (RAM) or NVRAM, an Electrically-Erasable Programmable Read-Only Memory (EEPROM), a storage drive, an optical disc, and the like.
  • Machine-readable storage medium 720 may be disposed within system 700, as shown in FIG. 7.
  • machine-readable storage medium 720 may be a portable, external or remote storage medium, for example, that allows system 700 to download the instructions from the portable/external/remote storage medium.
  • the executable instructions may be part of an "installation package.”
  • machine-readable storage medium 720 may be encoded with executable instructions for smart location determination.
  • starting a transaction instructions 722 when executed by a processor (e.g., 710), may cause system 700 to start a transaction.
  • Writing a log instructions 724 when executed by a processor (e.g., 710), may cause system 700 to write a log record associated with the transaction to a non-volatile random access memory, wherein the non-volatile random access memory is byte addressable.
  • Committing instructions 726 when executed by a processor (e.g., 710), may cause system 700 to commit to the transaction.

Abstract

A method is described in which a transaction is started. The method writes a log record associated with the transaction to a non-volatile random access memory, wherein the non-volatile random access memory is byte addressable, and commits the transaction.

Description

BYTE ADDRESSABLE NON-VOLATILE RANDOM ACCESS MEMORY FOR
STORING LOG RECORD
BACKGROUND
[0001] Database systems are designed and optimized for disks and memory hierarchies. For example, transactions are essential part of a data management system. Database systems may use Algorithms for Recovery and Isolation Exploiting Semantics (ARIES )-style write-ahead logging (WAL) to implement the transactions. However, ARIES techniques are optimized for disk based systems and tuned for the sequential write performance of disks. Most of the transaction processing system implementations rely on Dynamic random- access memory (DRAM) for performance and disks for persistent storage. Such database systems have varying levels of storage latencies, e.g., reading and writing data to and from the DRAM and the hard disks.
BRIEF DESCRIPTION OF THE DRAWINGS
[0002] FIG. 1 illustrates a block diagram of an example logging system of the present disclosure;
[0003] FIG. 2 illustrates an example structure of undo log records;
[0004] FIG. 3 illustrates an example structure of redo log records;
[0005] FIG. 4 illustrates a flowchart of an example method for performing a transaction;
[0006] FIG. 5 illustrates a flowchart of example method for performing a recovery operation;
[0007] FIG. 6 depicts a high-level block diagram of a computer that can be transformed into a machine capable of performing the functions described herein; and
[0008] FIG. 7 illustrates a block diagram of an example system for performing a transaction.
DETAILED DESCRIPTION
[0009] The present disclosure broadly discloses a method, apparatus and non-transitory computer-readable medium for utilizing non-volatile random- access memory (NVRAM) for logging. In one example of the present disclosure, the write-ahead logging (WAL) is optimized with the use of NVRAM- based systems.
[0010] Database systems may use ARIES-style write-ahead logging (WAL) to implement transactions. Generally, ARIES techniques are optimized for disk based systems and tuned for the sequential write performance of disks. However, sequential logging creates significant bottlenecks.
[0011] Proper management of transactions is an essential part of a data management system. Strong transactional support is important for supporting operational activities of all businesses. Thus, transaction processing systems are designed to support (Atomicity, Consistency, Isolation, Durability) ACID properties. The atomicity property requires that the transaction be all or nothing, i.e., if one part of the transaction fails, then the entire transaction fails. The consistency property ensures that any changes to values in an instance are consistent with changes to other values in the same instance. The isolation property ensures that the concurrent execution of transactions will result in a system state that would be obtained if the transactions were executed serially, i.e., the effect on the database is the same whether the transactions are executed in serial order or in an interleaved fashion. Finally, the durability property ensures the ability of the system to recover committed transaction updates if either the system or the storage medium fails. For example, the transaction systems may use write-ahead logging or shadow copy and optimistic or pessimistic concurrency control techniques to support ACID properties.
[0012] In one example, NVRAM combines the persistence properties of media like hard disks with the ability to access content like DRAM. It is noted that a software stack modeled on disks that are block accessed and slow due to mechanical components, is inefficient. More specifically, in one example "Byte Addressable" NVRAM will alleviate the need of deep memory hierarchies that were built around disk based systems to match disk with DRAM speeds. [0013] In one example, the use of NVRAM may reduce the complexity of data storage. The large memory capability of NVRAM enables the ability to store a large amount of data in a single flat memory bank. This eliminates the complexities involved in distributing and arranging the data with different storage layers (e.g., a DRAM layer and a disk layer) with varying access latency. Partitioning can be accomplished on the working data structures to improve concurrency and throughput. This will provide order of magnitude improvements in data processing compared to storing the data in a disk storage and bringing the data to the main memory on demand.
[0014] The byte addressable nature of NVRAM enables the ability to store the data and metadata in its native format without converting the data into block oriented formats for storage and transmission. The present disclosure leverages this property to provide optimal data structures for transaction management that is not hampered with the need for serializing the data into a block oriented storage format.
[0015] Transaction support is essential in all data management system to handle failures and to reduce the impact of failures on the overall system's behavior. Data management systems have to support solid transaction semantics and robust failure recovery. ARIES-style WAL is the de-facto standard of transactional logging. ARIES adopts a centralized or global logging to leverage the sequential write performance of disks. To hide the performance difference between the DRAM and the disk, the log records are cached in DRAM and forced to disk at the time of commit. However, the centralized logging can be a significant bottleneck. Logging is tightly coupled with recovery and the design aspect has to consider the impact to recovery algorithms.
[0016] To illustrate, a transactional storage engine can be a pluggable software module that is used to perform various data management operations like create, insert, update and delete operations on data that the transactional storage engine manages in a transactional consistent manner. The transactional storage engine may use WAL to manage transactions. For example, the transactional storage engine may maintain the transaction logs in DRAM as a circular buffer and a persistent copy of the same content in a flat file on disk. All database operations performed by the transactional storage engine that manipulates databases pages are logged in the redo logs prior to actual execution. Contents of the redo logs are flushed into the disk prior to transaction commit in line with WAL semantics. The redo log is used during system recovery. For example, the system recovery may start by replaying the redo log onto the buffer pool until all database pages are recovered. This replay starts from the last successful checkpoint. Once this replay is over, the undo log is used to rollback all partially completed transactions.
[0017] For example, the transactional storage engine may maintain the undo log in the memory and on the disk. The undo logs may contain before image of database records that are being modified by a transaction. The undo logs can be used to support transaction rollback. As discussed above, the redo log and buffer pool are flushed periodically into the disk.
[0018] In contrast, the present disclosure provides a NVRAM-optimized logging system that writes log records directly to a NVRAM. In one example, the NVRAM is byte-addressable similar to a DRAM. Byte addressing refers to an architecture that supports accessing individual bytes of data.
[0019] In one example, there are two types of log fields of information within log records: (a) undo information, which stores information about how to undo a change and (b) redo information, which stores information about how to reproduce a change. As discussed above, in database systems, the log records can be cached in memory and persisted on disk as a large sequential file. The log files are flushed to disk before the transaction commits.
[0020] In contrast, the present disclosure provides a NVRAM-optimized logging system that writes log records on NVRAM directly and synchronously. FIG. 1 illustrates a block diagram of an example logging system 100 of the present disclosure. For example, the logging system 100 comprises a memory 110, e.g., random access memory 110, a non-volatile random access memory (NVRAM) 120, a persistent storage 130, e.g., a disk, and a processor 140, e.g., a central processor unit (CPU) implementing a transaction manager of the present disclosure. In one example, the redo log portion 122 and the undo log portion 124 receive log records 123 and 125 directly and synchronously on the NVRAM 120. The log records stored in the redo log portion 122 and the undo log portion 124 are not required to be flushed to the disk 130. In one example, each of the log records 123 contains redo information and each of the log records 125 contains undo information.
[0021] The logging system 100 of the present disclosure simplifies log buffer management. By writing log records directly to (NVRAM) 120, the transaction manager of the logging system 100 of the present disclosure reduces the Input/output (lO)-related delay which in turn reduces extensive context switching. The context switching is reduced because persistence of the redo data is now synchronous with the commit operation. A commit operation is the final step in the successful completion of a previously started database change as part of handling a transaction in a computing system. Hence, the logging system 100 of the present disclosure does not need to wait for a flush to disk which is usually performed by another thread to finish the commit operation. As discussed above, log records can be cached in the DRAM using in-memory format and stored in disks in block oriented format. The log records are converted from DRAM to disk format while persisting log records and vice versa while reading log records from disk during recovery. Thus, the logging system 100 of the present disclosure avoids the multiple formats and implements a single unified format of log records on the NVRAM 120. Thus, the logging system 100 of the present disclosure avoids the extra memory copy and log record conversion complexity. Since the logging system 100 of the present disclosure writes log records directly on NVRAM 120, there is no need to force the log buffers to disk before the commit operation of a transaction. The transactional locks are held until the log records are persisted. Writing log to the NVRAM avoids the IO and hence reduces the lock duration. This in turn reduces the lock contention.
[0022] In one example, to overcome the log-related contention, the present disclosure provides redo logs and undo logs that have different models of parallelism. The undo operations are applicable for a transaction and hence can be parallelized at a transaction level, whereas the redo operations are applicable for a page and hence can be parallelized at a page level. For example, the logging system 100 of the present disclosure distributes undo log records based on the transaction identification (ID) and implements it as a linked list of undo records belonging to a transaction. In one example, the logging system 100 of the present disclosure implements a hash based distribution of transaction IDs.
[0023] FIG. 2 illustrates an example structure 200 of undo log records 210. In one example, the information about transaction state and pointer to undo log records are maintained in a hash table 220 based on the transaction identification (ID) 205. In one example, the details pertaining to the transaction ID 205 and the pointer to undo chain are stored in the hash bucket header 207. Undo logging can be used to cancel the effects of incomplete transactions. In one example, each of the undo log records 210 may contain a header 212 that comprises a previous record field, a next record field, a page number field, a last sequence number (LSN) field, an undo number field, and a type field. Each of the undo log records 210 may also contain a payload 214 that comprises undo operation details 216. The undo log records 210 are implemented as a linked list. Hash partitioning of the undo logs eliminates the contention for writing undo log records from multiple transactions. Only during the beginning of a transaction does the present logging system need to acquire a lock to acquire the corresponding hash slot. Thus, the present logging system improves the concurrency of undo log operations.
[0024] Similarly, the logging system 100 of the present disclosure distributes the redo log records 310 based on the page identification (ID) 305. In one example, the information about page state and pointer to redo log records are maintained in a hash table 320 based on the page identification (ID) 305. FIG. 3 illustrates an example structure 300 of redo log records 310. In one example, the details pertaining to the page number or page ID 305, start and end LSNs of the page and the pointer to redo chain are stored in the hash bucket header 307. In one example, each of the redo log records 310 may contain a header 312 that comprises a previous record field, a next record field, a last sequence number (LSN) field, and a type field. Each of the redo log records 3 0 may also contain a payload 314 that comprises redo operation details 316. More specifically, the redo log records 310 are implemented as a linked list. Namely, the present logging system stores redo records that consists of transaction ID, record type, LSN, record payload etc. as a linked list. This reduces the redo log write contention across the pages. Only the parallel transactions that operate on the same page will contend for the redo log chain. With the customized distribution of redo and undo log records, the present logging system can implement more granular latches and increase the parallelism in logging operations. This reduces the log buffer contention and improves the performance.
[0025] It should be noted that ARIES recommends periodic checkpointing to accelerate the recovery operation. The checkpointing flushes the log records and dirty pages in the buffer pool to the disks. The checkpoint log record holds information about active transactions, its state and the flushed LSN. In the present disclosure, since the active transactions and their states are directly written to the NVRAM along with undo logging, the checkpoint log record has to just write the flushed LSN. This improves the performance of the checkpointing operation.
[0026] In one example, the present parallel hash based distribution of the redo and undo log records provides the opportunity to parallelize recovery operations as well. ARIES recommends the recovery in 3 phases: (i) analysis phase - during which the algorithm reads the flushed LSN information from checkpoint and scans the log records sequentially to gather required redo and undo operation information, (ii) redo phase - during which the redo operations are applied to bring back the database to the state before the crash, and (iii) undo phase - during which undo operations are applied to reverse the effect of inflight transactions. The present distributed redo log records of the present disclosure will enable parallelism in building and applying redo operations. Similarly, the present distributed undo log records will enable parallelism in rolling back the inflight transactions. Thus the logging system of the present disclosure improves the performance of recovery operations.
[0027] The logging system of the present disclosure depends on system primitives and programming APIs to read/write in NVRAM. These APIs shall: (a) support namespace for the NVRAM, (b) support dynamic memory management, and (c) support variable length read/write operations, which shall guarantee atomicity and durability.
[0028] Partitioning of the log structures as disclosed in the present disclosure addresses several bottlenecks that are seen with sequential logs. Global log is needed in disk based systems to ensure persistence of the log records are done in an optimal manner. This enables optimizations like group-commits on disk based systems. This, in turn, forces the undo and redo components of the logs to be persisted on the same global log creating a synchronization bottleneck. This is because, if any transaction wants to append the log with an undo or redo record, that transaction thread, has to acquire the semaphore on the global log and reserve space in the log to perform the update. During this time, other threads have to go into a wait state until the current thread releases the log. This increases contention at the head of the sequential log. This also increases the amount of context switching performed by the system.
[0029] Another implication of the global log design is that during crash recovery, the system has to process the sequential log to extract the information into some hash-based structure to enable parallelism of recovery operation. Otherwise, the system would be scanning the sequential logs one record at a time, which will significantly increase the time that the system takes to become ready to accept new transaction post-crash recovery operation.
[0030] The logging system of the present disclosure avoids these problems by having separate hash based structures for the undo and redo logs in the NVRAM. By the use of hashing, the logging system of the present disclosure breaks up the single global logs head into several streams equal to the bucket size of hash table. This eliminates the bottleneck on the global log by parallelizing access to the logs. This will enable several threads to write to the log simultaneous, consequently reduces log induced contention and context switching.
[0031] Additionally, having separate logs structures for the undo and redo logs allows the partitioning of the log in the most optimal way based on the use case. For example, since undo logs are closely associated with a transaction, the logging system of the present disclosure partitions the undo logs based on transaction ID. This will allow the logging system of the present disclosure to create sufficiently large hash buckets to the extent where one can completely eliminate the need of synchronization constructs and make the undo logging practically lock free. As an example, to optimize a system to handle 2048 concurrent transactions, one can create an undo log hash table having more than 2048 (or closest prime number) buckets. In this way every transaction will receive its own exclusive hash bucket, thereby eliminating the need to have a synchronization construct to manage the undo log. This also simplifies the search of undo logs during recovery operation.
[0032] Similarly, redo logs of the present disclosure are partitioned on page ID that allows several database threads to operate in parallel as long as these threads do not attempt to append redo log records on the same page. The grouping of redo logs is done in traditional systems to improve the IO performance. The IO from multiple transactions is grouped to make one combined IO. Writing redo log on NVRAM avoids IO and thus optimization like grouping redo logs is not needed. Transactions can directly append the logs into the log structure directly when they are generated.
[0033] The present disclosure also provides a simplification in the process of releasing locks that were held by the transaction. To illustrate, a transaction acquires a lock to protect the data that is being used by itself against possible corruption from concurrent access. This ensures the isolation guarantee of the database is maintained. Lock release occurs at the end of a transaction when the commit status of the transaction is flushed into a durable medium. In some systems, one must wait for the flushing of the commit record to complete which involves an I/O bottleneck. Transactions are made to wait for the grouping log data to be sufficient enough to overcome the cost of doing a serial I/O to disk. The logging system of the present disclosure eliminates this completely because the logs are directly written to persistent media in their native form. The logging system of the present disclosure does not maintain two distinct data structures, one where system buffers the logs and the other that is used to perform bulk I/O to the disk. Thus, locks can be released as soon as the commit record is posted into the data structure.
[0034] The logging system of the present disclosure also reduces context switching. Many databases are multi-thread to take advantage of the abundance of compute cores available in state-of-the-art CPUs. However, even though this is largely beneficial, due to the I/O and synchronization bottlenecks, much of the compute cycle is wasted in context switching and spin locks. The logging system of the present disclosure addresses the context switching part of the problem. In the present disclosure, the persistence of log records is now reduced to a write operation to a NVRAM resident data structure. This can be performed in the same thread, without having to wait for I/O operation, which is usually performed by other I/O threads. This optimization allows the logging system of the present disclosure to drive the cores to perform more user work rather than waiting for I/O operations to complete.
[0035] The logging system of the present disclosure also eliminates log multiplexing, which attempts to combine log records from various transactions to achieve optimal volume to make flush operations to disk more efficient. On disk based logging systems, such method is used to reduce the I/O overhead for writing to the disk. In flash based systems, such method is used to reduce erase-unit overheads. In both systems it is quite possible that the system may write more data than that was actually updated. This may arise due the block oriented nature of writes on these systems. On disk based systems, the block is usually 4-16 KB and on flash based systems, the block size (erase unit size) is 128 KB. Since the logging system of the present disclosure relies on byte- addressability, the logging system of the present disclosure writes variable log data without having to worry about block boundaries. This results in faster commit time and better utilization of cores to perform more useful work.
[0036] In data management systems, a checkpoint can be treated as a marker that indicates to the extent to which state information has been transferred to the secure persistent storage. Modifications to data management systems pages are not necessarily flushed to disk in a synchronous manner for performance reasons. Checkpointing is a costly operation and has serious impact on the throughput of a system. Checkpoints are classified into two categories: full checkpoint and fuzzy checkpoint. In a full checkpoint, the data management system writes all the dirty information into the disk. The fuzzy checkpoint which is commonly used for performance reasons writes only certain number of dirty pages.
[0037] In sum, whenever a checkpoint occurs, the system may suffer a performance hit. In the logging system of the present disclosure due to the NVRAM, the flush of the redo logs is completely eliminated and the fuzzy checkpoint needs only to maintain the state of the pages that were flushed from the buffer pool. This also simplifies page stealing because the logs are already persisted and hence the buffer pool manager has the freedom to pick up dirty pages on demand, thereby making the checkpoint process and page stealing simpler and faster.
[0038] Recovery or crash recovery operation rebuilds the internal data structures of the data management system to a consistent state from which the storage engine can start processing transactions again. The recovery process also ensures that the overall consistency of the data is also maintained. The recovery may occur in several steps or tasks. These tasks are semantically similar for any data management system which supports crash recovery. The first task is to bring the data pages that were present in the buffer pool that were not flushed into persistent medium back into a consistent state. This is performed by applying the redo log from the last checkpoint forward until all of the redo logs are exhausted. This process will bring the buffer pool up to the state just prior to the point of failure. [0039] However, the buffer pool also contains dirty data pages that are part of the incomplete transactions. These incomplete transactions need to be rolled back. The undo logs are used to perform this operation.
[0040] The logging system of the present disclosure supports both undo and redo logs, hence a transaction manager of the present logging system can be used to perform a recovery operation. The primary performance bottleneck with a recovery operation is the time spent in performing many random 10s to bring the buffer pool back to a state where undo information can be used. Furthermore, processing and converting the block based redo logs on the disk to a format that is usable in the DRAM, will impact the performance of the recovery operation. It should also be noted that the system is not available for transactions until the buffer pool is restored by the redo log.
[0041] In contrast, the present hash partitioning of the redo log based on page ID enables the parallel recovery of the pages. Instead of reading a serial redo log, recovery threads can be assigned to process every hash bucket in parallel. The logging system of the present disclosure also utilizes a single format in which the redo logs are stored, hence the cost of converting disk based structure to a DRAM based structure is completely avoided. Additionally, undo of in-flight transactions can be parallelized because the undo log records are hash partitioned on transaction ID. These improvements can significantly reduce recovery time of the system that uses the transaction manager of the present disclosure.
[0042] FIG. 4 illustrates a flowchart of an example method 400 for performing a transaction. In one example, the method 400 may be performed by the processor of a logging system, e.g., serving as a transaction manager, or a computer as illustrated in FIG. 6 and discussed below.
[0043] At block 405, the method 400 begins. At block 410, a transaction is start. For example, the transaction will impact data stored in a database, e.g., located on a persistent storage.
[0044] At block 420, method 400 writes a log record to the NVRAM associated with the transaction. It should be noted that one or more log records can be written to the NVRAM. For example, the log record may comprise a redo log record or an undo log record. In one example, the undo log record is partitioned based on a transaction ID, whereas the redo log record is partitioned based on a page ID.
[0045] At block 430, method 400 commits the transaction. It should be noted that the commit operation is completed without the need of the log record being flushed to a persistent storage. Method 400 ends in block 495.
[0046] FIG. 5 illustrates a flowchart of example method 500 for performing a recovery operation. In one example, the method 500 may be performed by the processor of a logging system, e.g., serving as a transaction manager, or a computer as illustrated in FIG. 6 and discussed below.
[0047] At block 505, the method 500 begins. At block 510, method 500 starts a recovery operation, e.g., a crash recovery operation. For example, a system crash may have occurred that requires a recovery operation to be performed.
[0048] At block 520, the method 500 applies a plurality of threads to access a plurality of log records simultaneously. In other words, each separate thread can be used to access a different log record or page simultaneously. In one example, the plurality of log records is stored on a non-volatile random access memory, where the non-volatile random access memory is byte addressable. Thus, method 500 does not have to perform a sequential logging operation.
[0049] At block 530, the method 500 performs the recovery operation using the data obtained from the plurality of log records. For example, the redo phase and undo phase can be performed as discussed above. Method 500 ends in block 595.
[0050] It should be noted that although not explicitly specified, one or more blocks, functions, or operations of the methods 400 and 500 described above may include a storing, displaying and/or outputting block as required for a particular application. In other words, any data, records, fields, and/or intermediate results discussed in the methods can be stored, displayed, and/or outputted to another device as required for a particular application. Furthermore, blocks, functions, or operations in FIG. 4 and FIG. 5 that recite a determining operation, or involve a decision, do not necessarily require that both branches of the determining operation be practiced. In other words, one of the branches of the determining operation can be deemed as an optional step.
[0051] FIG. 6 depicts a high-level block diagram of a computer that can be transformed into a machine capable of performing the functions described herein. Notably, no computer or machine currently exists that performs the functions as described herein. As a result, the examples of the present disclosure improve the operation and functioning of the computer to perform a transaction or a recovery operation, as disclosed herein.
[0052] As depicted in FIG. 6, the computer 600 comprises a hardware processor element 602, e.g., a central processing unit (CPU), a microprocessor, or a multi-core processor, a memory 604, e.g., random access memory (RAM), NVRAM, and/or read only memory (ROM), a module 605 for performing a transaction or a recovery operation, and various input/output devices 606, e.g., storage devices, including but not limited to, a tape drive, a floppy drive, a hard disk drive or a compact disk drive, a receiver, a transmitter, a speaker, a display, a speech synthesizer, an output port, an input port and a user input device, such as a keyboard, a keypad, a mouse, a microphone, and the like. Although only one processor element is shown, it should be noted that the computer may employ a plurality of processor elements. Furthermore, although only one computer is shown in the figure, if the method(s) as discussed above is implemented in a distributed or parallel manner for a particular illustrative example, i.e., the blocks of the above method(s) or the entire method(s) are implemented across multiple or parallel computers, then the computer of this figure is intended to represent each of those multiple computers. Furthermore, one or more hardware processors can be utilized in supporting a virtualized or shared computing environment. The virtualized computing environment may support one or more virtual machines representing computers, servers, or other computing devices. In such virtualized virtual machines, hardware components such as hardware processors and computer-readable storage devices may be virtualized or logically represented.
[0053] It should be noted that the present disclosure can be implemented by machine readable instructions and/or in a combination of machine readable instructions and hardware, e.g., using application specific integrated circuits (ASIC), a programmable logic array (PLA), including a field-programmable gate array (FPGA), or a state machine deployed on a hardware device, a general purpose computer or any other hardware equivalents, e.g., computer readable instructions pertaining to the method(s) discussed above can be used to configure a hardware processor to perform the blocks, functions and/or operations of the above disclosed methods.
[0054] In one example, instructions and data for the present module or process 605 for performing a transaction or a recovery operation, e.g., machine readable instructions can be loaded into memory 604 and executed by hardware processor element 602 to implement the blocks, functions or operations as discussed above in connection with the exemplary methods 400 and 500. For instance, the module 605 may include one or more programming code components, including a database updating component 608, e.g., a transaction manager performing the various functions as discussed above. These programming code components may be included on one or more of the processing nodes of a computing system, such as system 100.
[0055] Furthermore, when a hardware processor executes instructions to perform "operations", this could include the hardware processor performing the operations directly and/or facilitating, directing, or cooperating with another hardware device or component, e.g., a co-processor and the like, to perform the operations.
[0056] The processor executing the machine readable instructions relating to the above described method(s) can be perceived as a programmed processor or a specialized processor. As such, the present module 605 for performing a transaction or a recovery operation, including associated data structures, of the present disclosure can be stored on a tangible or physical (broadly non-transitory) computer-readable storage device or medium, e.g., volatile memory, non-volatile memory, ROM memory, RAM memory, magnetic or optical drive, device or diskette and the like. More specifically, the computer- readable storage device may comprise any physical devices that provide the ability to store information such as data and/or instructions to be accessed by a processor or a computing device such as a computer or an application server.
[0057] FIG. 7 illustrates a block diagram of an example system for performing a transaction. System 700 may include at least one computing device that is capable of communicating with at least one remote system. System 700 may be similar to system 100 of FIG. 1 or system 600 of FIG. 6, for example. In the example of FIG. 7, system 700 includes a processor 710 and a machine-readable storage medium 720. Although the following descriptions refer to a single processor and a single machine-readable storage medium, the descriptions may also apply to a system with multiple processors and multiple machine-readable storage mediums. In such examples, the instructions may be distributed (e.g., stored) across multiple machine-readable storage mediums and the instructions may be distributed (e.g., executed by) across multiple processors.
[0058] Processor 710 may be one or more central processing units (CPUs), microprocessors, and/or other hardware devices suitable for retrieval and execution of instructions stored in machine-readable storage medium 720. In the particular example shown in FIG. 7, processor 710 may fetch, decode, and execute instructions 722, 724, and 726 to perform a transaction. As an alternative or in addition to retrieving and executing instructions, processor 710 may include one or more electronic circuits comprising a number of electronic components for performing the functionality of one or more of the instructions in machine-readable storage medium 720. With respect to the executable instruction representations (e.g., boxes) described and shown herein, it should be understood that part or all of the executable instructions and/or electronic circuits included within one box may, in alternate examples, be included in a different box shown in the figures or in a different box not shown. [0059] Machine-readable storage medium 720 may be any electronic, magnetic, optical, or other physical storage device that stores executable instructions. Thus, machine-readable storage medium 720 may be, for example, Random Access Memory (RAM) or NVRAM, an Electrically-Erasable Programmable Read-Only Memory (EEPROM), a storage drive, an optical disc, and the like. Machine-readable storage medium 720 may be disposed within system 700, as shown in FIG. 7. In this situation, the executable instructions may be "installed" on the system 700. Alternatively, machine-readable storage medium 720 may be a portable, external or remote storage medium, for example, that allows system 700 to download the instructions from the portable/external/remote storage medium. In this situation, the executable instructions may be part of an "installation package." As described herein, machine-readable storage medium 720 may be encoded with executable instructions for smart location determination.
[0060] Referring to FIG. 7, starting a transaction instructions 722, when executed by a processor (e.g., 710), may cause system 700 to start a transaction. Writing a log instructions 724, when executed by a processor (e.g., 710), may cause system 700 to write a log record associated with the transaction to a non-volatile random access memory, wherein the non-volatile random access memory is byte addressable. Committing instructions 726, when executed by a processor (e.g., 710), may cause system 700 to commit to the transaction.
[0061] It will be appreciated that variants of the above-disclosed and other features and functions, or alternatives thereof, may be combined into many other different systems or applications. Various presently unforeseen or unanticipated alternatives, modifications, variations, or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims.

Claims

What is claimed is:
1. A method, comprising:
starting, by a processor, a transaction;
writing, by the processor, a log record associated with the transaction to a non-volatile random access memory, wherein the non-volatile random access memory is byte addressable; and
committing, by the processor, the transaction.
2. The method of claim 1 , wherein the log record comprises an undo log record.
3. The method of claim 2, wherein the undo log record is partitioned based on a transaction identification as a linked list.
4. The method of claim 3, wherein the undo log record is not flushed to a persistent storage.
5. The method of claim 3, wherein the transaction identification is maintained in a hash bucket header.
6. The method of claim 1 , wherein the log record comprises a redo log record.
7. The method of claim 6, wherein the redo log record is partitioned based on a page identification as a linked list.
8. The method of claim 7, wherein the redo log record is not flushed to a persistent storage.
9. The method of claim 7, wherein the page identification is maintained in a hash bucket header.
10. A method, comprising:
starting, by a processor, a recovery operation;
applying, by the processor, a plurality of threads to access a plurality of log records simultaneously, wherein the plurality of log records is stored on a non-volatile random access memory, wherein the non-volatile random access memory is byte addressable; and
performing, by the processor, the recovery operation.
11. The method of claim 10, wherein the plurality of log records comprises a plurality of redo log records.
12. The method of claim 11 , wherein the plurality of redo log records is partitioned based on a page identification as a linked list.
13. The method of claim 10, wherein the plurality of log records comprises a plurality of undo log records, wherein the plurality of undo log records is partitioned based on a transaction identification as a linked list.
14. A non-transitory machine-readable storage medium storing instructions executable by a processor, the machine-readable storage medium comprising: instructions to start a transaction;
instructions to write a log record associated with the transaction to a nonvolatile random access memory, wherein the non-volatile random access memory is byte addressable; and
instructions to commit the transaction.
15. The non-transitory machine-readable storage medium of claim 14, wherein the log record comprises an undo log record and, wherein the undo log record is partitioned based on a transaction identification as a linked list.
PCT/US2015/039771 2015-01-30 2015-07-09 Byte addressable non-volatile random access memory for storing log record WO2016122710A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN0465/CHE/2015 2015-01-30
IN465CH2015 2015-01-30

Publications (1)

Publication Number Publication Date
WO2016122710A1 true WO2016122710A1 (en) 2016-08-04

Family

ID=56544113

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2015/039771 WO2016122710A1 (en) 2015-01-30 2015-07-09 Byte addressable non-volatile random access memory for storing log record

Country Status (1)

Country Link
WO (1) WO2016122710A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111480149A (en) * 2017-12-15 2020-07-31 微软技术许可有限责任公司 Pre-written logging in persistent memory devices
CN112416654A (en) * 2020-11-26 2021-02-26 上海达梦数据库有限公司 Database log replay method, device, equipment and storage medium
US10997153B2 (en) 2018-04-20 2021-05-04 Hewlett Packard Enterprise Development Lp Transaction encoding and transaction persistence according to type of persistent storage
US11163625B2 (en) 2018-08-21 2021-11-02 Red Hat, Inc. Optimizing logging of decision outcomes in distributed transactions
US11226877B2 (en) * 2015-09-30 2022-01-18 EMC IP Holding Company LLC Hybrid NVRAM logging in filesystem namespace
US11243703B2 (en) 2018-04-27 2022-02-08 Hewlett Packard Enterprise Development Lp Expandable index with pages to store object records
WO2022161170A1 (en) * 2021-01-29 2022-08-04 International Business Machines Corporation Database log writing based on log pipeline contention

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030061537A1 (en) * 2001-07-16 2003-03-27 Cha Sang K. Parallelized redo-only logging and recovery for highly available main memory database systems
US20060206538A1 (en) * 2005-03-09 2006-09-14 Veazey Judson E System for performing log writes in a database management system
US20100082529A1 (en) * 2008-05-08 2010-04-01 Riverbed Technology, Inc. Log Structured Content Addressable Deduplicating Storage
US20150019792A1 (en) * 2012-01-23 2015-01-15 The Regents Of The University Of California System and method for implementing transactions using storage device support for atomic updates and flexible interface for managing data logging

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030061537A1 (en) * 2001-07-16 2003-03-27 Cha Sang K. Parallelized redo-only logging and recovery for highly available main memory database systems
US20060206538A1 (en) * 2005-03-09 2006-09-14 Veazey Judson E System for performing log writes in a database management system
US20100082529A1 (en) * 2008-05-08 2010-04-01 Riverbed Technology, Inc. Log Structured Content Addressable Deduplicating Storage
US20150019792A1 (en) * 2012-01-23 2015-01-15 The Regents Of The University Of California System and method for implementing transactions using storage device support for atomic updates and flexible interface for managing data logging

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
GOETZ GRAEFE ET AL.: "Database software for non-volatile byte -addressable memory", NON-VOLATILE MEMORIES WORKSHOP *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11226877B2 (en) * 2015-09-30 2022-01-18 EMC IP Holding Company LLC Hybrid NVRAM logging in filesystem namespace
US11797397B2 (en) 2015-09-30 2023-10-24 EMC IP Holding Company LLC Hybrid NVRAM logging in filesystem namespace
US20220050755A1 (en) * 2015-09-30 2022-02-17 EMC IP Holding Company Hybrid nvram logging in filesystem namespace
CN111480149B (en) * 2017-12-15 2023-09-08 微软技术许可有限责任公司 Pre-written logging in persistent memory devices
CN111480149A (en) * 2017-12-15 2020-07-31 微软技术许可有限责任公司 Pre-written logging in persistent memory devices
US10997153B2 (en) 2018-04-20 2021-05-04 Hewlett Packard Enterprise Development Lp Transaction encoding and transaction persistence according to type of persistent storage
US11243703B2 (en) 2018-04-27 2022-02-08 Hewlett Packard Enterprise Development Lp Expandable index with pages to store object records
US11163625B2 (en) 2018-08-21 2021-11-02 Red Hat, Inc. Optimizing logging of decision outcomes in distributed transactions
US11720429B2 (en) 2018-08-21 2023-08-08 Red Hat, Inc. Optimizing logging of decision outcomes in distributed transactions
CN112416654A (en) * 2020-11-26 2021-02-26 上海达梦数据库有限公司 Database log replay method, device, equipment and storage medium
CN112416654B (en) * 2020-11-26 2024-04-09 上海达梦数据库有限公司 Database log replay method, device, equipment and storage medium
WO2022161170A1 (en) * 2021-01-29 2022-08-04 International Business Machines Corporation Database log writing based on log pipeline contention
US11797522B2 (en) 2021-01-29 2023-10-24 International Business Machines Corporation Database log writing based on log pipeline contention
GB2617999A (en) * 2021-01-29 2023-10-25 Ibm Database log writing based on log pipeline contention

Similar Documents

Publication Publication Date Title
WO2016122710A1 (en) Byte addressable non-volatile random access memory for storing log record
US10360149B2 (en) Data structure store in persistent memory
US11556396B2 (en) Structure linked native query database management system and methods
EP3827347B1 (en) Constant time database recovery
US11132350B2 (en) Replicable differential store data structure
CN103092903B (en) Database Log Parallelization
EP2572296B1 (en) Hybrid oltp and olap high performance database system
US6981004B2 (en) Method and mechanism for implementing in-memory transaction logging records
US10430298B2 (en) Versatile in-memory database recovery using logical log records
US6976022B2 (en) Method and mechanism for batch processing transaction logging records
US9069704B2 (en) Database log replay parallelization
US20200050692A1 (en) Consistent read queries from a secondary compute node
CN105159818A (en) Log recovery method in memory data management and log recovery simulation system in memory data management
KR20160023871A (en) Latch-free, log-structured storage for multiple access methods
CN103198088B (en) Log segment directory based on Shadow paging
US9652491B2 (en) Out-of-order execution of strictly-ordered transactional workloads
US10185630B2 (en) Failure recovery in shared storage operations
CN116529726A (en) Method, device and medium for synchronizing data among cloud database nodes
Saxena et al. Hathi: durable transactions for memory using flash
US10482013B2 (en) Eliding memory page writes upon eviction after page modification
WO2019008715A1 (en) Data loading program, data loading method, and data loading device
US20200125457A1 (en) Using non-volatile memory to improve the availability of an in-memory database
Son et al. Design and implementation of an efficient flushing scheme for cloud key-value storage
Manamohan et al. BaSE (Byte addressable Storage Engine) Transaction Manager.
Sul et al. montage: NVM-based scalable synchronization framework for crash-consistent file systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15880597

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15880597

Country of ref document: EP

Kind code of ref document: A1