WO2021061173A1

WO2021061173A1 - Data integrity validation on lsm tree snapshots

Info

Publication number: WO2021061173A1
Application number: PCT/US2019/064484
Authority: WO
Inventors: Valentin KUZNETSOV; Ou JIAXIN; Li Yi
Original assignee: Futurewei Technologies, Inc.
Priority date: 2019-12-04
Filing date: 2019-12-04
Publication date: 2021-04-01

Abstract

A method of validating data integrity of a distributed key value database includes taking snapshots of static sorted table (SST) files and write ahead logs (WALs) of a log structure merge tree of the database, calculating a record set checksum of the snapshot, and storing the snapshot and metadata (including checksums) for each partition in a separate database. The snapshot is incrementally updated by backing up each partitions newly created WAL files and periodic check-pointing of a partitions state by backing up the partitions SST files and updating partition snapshot data to align with a current partition LSM manifest state. A record set checksum is added to WALs and SSTs to reflect the stream of partition state mutations (puts and deletes). Before applying a new partition checkpoint, the record set checksums of the checkpoint snapshot are checked against the record set checksum of the current snapshot to identify data losses.

Description

DATA INTEGRITY VALIDATION ON LSM TREE SNAPSHOTS

TECHNICAL FIELD

[0001] This application is related to data integrity validation for databases and, in particular, to systems and methods for validating data integrity of databases that use log structured merge trees as indexing data structures by using checksums to detect data losses during database state updates.

BACKGROUND

[0002] Log Structured Merge (LSM) trees are widely used as an indexing data structures for write-heavy distributed key value (KV)-stores like Google BigTable, HBase, LevelDB, SQLite4, Tarantool, RocksDB, WiredTiger, Apache Cassandra, InfluxDB, ScyllaDB, VictoriaMetrics, X-Engine, or Huawei OBS Indexing Layer. LSM trees are data structures that provide indexed access to files with high insert volume, such as transactional log data. The LSM trees maintain KV pairs and maintain data in two or more separate structures, each of which is optimized for its respective underlying storage medium. Data is synchronized between the two or more separate structures in batches. For example, in a two-level LSM tree, new records are inserted into a first memory resident component. If the insertion causes the first memory resident component to exceed a certain size threshold, a contiguous segment of entries is removed and merged into the second memory component on a disk, for example. The data is efficiently migrated between the storage media in rolling batches using an algorithm similar to a merge sort.

[0003] Many LSM trees employ multiple levels, where level 0 may be represented using a tree. The on-disk data may be organized into sorted runs of data where each run contains data sorted by an index key. A run may be represented on disk as a single file or as a collection of files with non-overlapping key ranges. The level 0 tree and each run are searched to perform a query on a particular key to get its associated value.

[0004] Records inserted into an LSM tree are first appended into a Write Ahead Log (WAL) and cached in memory. Changes to the records are first recorded in the WAL in a time sequence before they are applied and then written to cache before the changes are written to the database. When enough records are accumulated in the cache, the records are flushed into a Static Sorted Table (SST) file that sorts files using the key. After that, the WAL is sealed and later truncated. Since several versions of the same record may coexist in different SST files, a periodic compaction process is required to remove stale versions by merging and rewriting SSTs. Many distributed KV stores also support repartitioning, which is implemented through splits and merges of the LSM tree structure. The main change of this structure also often happens during the compaction phase. This makes the compaction process crucial for high performance and availability of the KV store. In turn, this results in developers bringing many complicated ad hoc optimizations into the compaction process that make it tremendously hard to verify data for all possible scenarios.

[0005] Data persisted in SST files and WAL files is protected from loss or corruption by block and record level cyclic redundancy checks (CRCs), respectively, that detect accidental changes to raw data. However, there is no protection from a loss of a set of records that may be caused by software bugs or memory bit-flips taking place during memory table flush, background compaction or LSM tree split/merge processes. Due to the high level of complexity and frequency of these processes, there may be a significant probability of such events happening.

[0006] To make data recovery possible, database archiving may be implemented. Archiving is a universal solution for protecting database service from data loss occurring on an underlying file storage level, software -bug-caused data losses or just customer mistakes. Archiving may be done by periodic migration of WALs to backup storage. Since WALs are append-only and append logic and code are simple and practically never get changed, WALs may be considered as an ultimate source of truth about the database state. However, restoring data from a set of WALs may be computationally expensive. To speed up the recovery process, a backup may include a full database snapshot (set of SST files and WAL files), which comprises a snapshot checkpoint - a set of SST files corresponding to the database state at some point in time, and WAL files containing all updates (mutations) happening after that checkpoint. Thus, the archiving process is reduced to backing-up newly created WALs and, when the number of WALs reaches some threshold, backing up database recent SST files and installing them as a latest checkpoint while truncating WALs storing changes fully reflected in the checkpoint. However, to make sure that the checkpoint is correct (no mutations logged in WALs are lost, as a result of erroneous compaction or repartitioning), a snapshot integrity verification mechanism is needed.

SUMMARY

[0007] Various examples are now described to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. The Summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

[0008] In sample embodiments, a snapshot integrity verification method is provided that validates the data integrity of a distributed key value database by taking a snapshot of a static sorted table (SST) files and write ahead logs (WALs) of a log structure merge tree of the database, calculating a record set checksum of the snapshot, and storing the snapshot and metadata (including checksums) for each partition in a separate database. The snapshot is incrementally updated by backing up each partition’s newly created WAL files and periodic check-pointing of a partition’s state by backing up the partition’s SST files and updating partition snapshot data to align with a current partition LSM manifest state. A record set checksum is added to WALs and SSTs to reflect the stream of partition state mutations (puts and deletes). Before applying a new partition checkpoint, the record set checksums of the checkpoint snapshot are checked against the record set checksum of the current snapshot to identify data losses.

[0009] According to a first aspect of the present disclosure, there is provided a method of validating data integrity of a distributed key value database. The method includes taking a partition snapshot of the distributed key value database. The partition snapshot comprises a static sorted table (SST) and a write ahead log (WAL) of a log structure merge (LSM) tree of the distributed key value database. A record set checksum of the partition snapshot is calculated. A record set checksum function validates that concatenations of sets of key value pairs are equal if unions of the sets of key value pairs are equal.

The snapshot files including SST files and WAL files of the partition snapshot are stored in a snapshot storage and a file fist and record set checksum are stored in a snapshot metadata database. The partition snapshot is updated incrementally by appending newly created WAL files. Snapshot checkpoints are periodically generated by copying SST files of the partition snapshot to the snapshot storage and updating metadata of the partition snapshot to align the partition snapshot with a state of a LSM manifest of a partition represented by the partition snapshot. The record set checksum is added to the WAL files and the SST files of the partition snapshot. Before installing a new snapshot checkpoint, the record set checksums of the new snapshot checkpoint are checked against the record set checksum of a previous snapshot checkpoint to identify data losses. When data losses are identified, an alert is generated and installation of the new snapshot checkpoint is aborted. Otherwise, the new snapshot checkpoint including a current SST and WAL of the LSM tree of the distributed key value database is installed.

[0010] According to a second aspect of the present disclosure, there is provided a system comprising a distributed key value database, at least one processor, and a memory that stores instructions that when executed by the at least one processor validates data integrity of the distributed key value database. The operations for validating data integrity include taking a partition snapshot of the distributed key value database, the partition snapshot comprising a static sorted table (SST) and a write ahead log (WAL) of a log structure merge tree of the distributed key value database; calculating a record set checksum of the partition snapshot, wherein a record set checksum function validates that concatenations of sets of key value pairs are equal if unions of the sets of key value pairs are equal; storing snapshot files including SST files and WAL files of the partition snapshot in a snapshot storage and storing a file fist and record set checksum in a snapshot metadata database; updating the partition snapshot incrementally by appending newly created WAL files; periodically generating snapshot checkpoints by copying SST files of the partition snapshot to the snapshot storage and updating metadata of the partition snapshot to align the partition snapshot with a state of a LSM manifest of a partition represented by the partition snapshot; adding the record set checksum to the WAL files and the SST files of the partition snapshot; and before installing a new snapshot checkpoint, checking the record set checksums of the new snapshot checkpoint against the record set checksum of a previous snapshot checkpoint to identify data losses.

[0011] According to a third aspect of the present disclosure, there is provided a non-transitory computer readable storage medium comprising instructions that when executed by at least one processor validate data integrity of a distributed key value database. The operations for validating data integrity include taking a partition snapshot of the distributed key value database, the partition snapshot comprising a static sorted table (SST) and a write ahead log (W AL) of a log structure merge tree of the distributed key value database; calculating a record set checksum of the partition snapshot, wherein a record set checksum function validates that concatenations of sets of key value pairs are equal if unions of the sets of key value pairs are equal; storing snapshot files including SST files and WAL files of the partition snapshot in a snapshot storage and storing a file list and record set checksum in a snapshot metadata database; updating the partition snapshot incrementally by appending newly created WAL files; periodically generating snapshot checkpoints by copying SST files of the partition snapshot to the snapshot storage and updating metadata of the partition snapshot to align the partition snapshot with a state of a LSM manifest of a partition represented by the partition snapshot; adding the record set checksum to the WAL files and the SST files of the partition snapshot; and before installing a new snapshot checkpoint, checking the record set checksums of the new snapshot checkpoint against the record set checksum of a previous snapshot checkpoint to identify data losses. [0012] In a first implementation of any of the preceding aspects, the record set checksum comprises a hash function that satisfies the properties of associativity and commutativity.

[0013] In a second implementation of any of the preceding aspects, the record set checksum comprises a 128-bit checksum wherein a first 64 bits are calculated by applying a 64-bit version of an RFC 1071 checksum function to checksums of records of the partition snapshot, and a second 64 bits comprise a count of elements in a record set of the partition snapshot taken by modulo (2²⁴-l), a count of all turned on bits of a cyclical redundancy check (CRC) of all elements in the record set by modulo 255, and a l’s complement addition of an exclusive OR operation of bit parts of CRCs of the record set.

[0014] In a third implementation of any of the preceding aspects, a key value checksum is generated for each key value pair of the distributed key value database as each key value record structure is formed, the key value checksum is added to each key value pair, the key value checksums are checked before a record containing the key value pairs is written to the WAL or the SST, and when the key value checksums are correct, WAL checksums (WPCS/WDCS) are updated for all insertion records and all deletion records in the WAL up to a current record and a result of the insertions or deletions and the updated WAL checksums for all insertion records and all deletion records are written into the WAL.

[0015] In a fourth implementation of any of the preceding aspects, SST checksums are updated for all deletions in the SST or a memory table of the distributed key value database (ADCS) and the SST checksums are updated for all insertions and deletions in the SST or the memory table that have been replaced with a newer version (RMCS).

[0016] In a fifth implementation of any of the preceding aspects, the WAL checksums are checked for all deletions in the SST or the memory table of the distributed key value database (ADCS) and the SST checksums for all insertions and deletions in the SST or the memory table that have been replaced with RMCS and an alert is generated in the event the WAL checksums or SST checksums are incorrect.

[0017] In a sixth implementation of any of the preceding aspects, when the WAL checksums and SST checksums are correct, the memory table is updated with any insertions or deletions in the SST or the memory table of the distributed key value database represented by the WAL checksums or the SST checksums. [0018] In a seventh implementation of any of the preceding aspects, the memory table is flushed into the SST by calculating a first checksum (APCS) for all insertion records in the SST or memory table and a second checksum (AMCS) for all insertion and deletion records in the SST or memory table, checking that the WAL checksums (WACS+WDCS) equal a checksum of the first and second checksums and the SST checksums (APCS+ADCS+RMCS), saving the memory table, the second checksum and RMCS into the SST when the WAL checksums equal the checksum of the first and second checksums and the SST checksums, and updating the LSM manifest state of the partition represented by the partition snapshot to reflect the saved memory table.

[0019] In an eighth implementation of any of the preceding aspects, during a compaction of the SST, the second checksum is updated, the SST checksums are updated for all insertions and deletions in the SST or the memory table that have been replaced with RMCS, and checksums are updated for all insertions or deletions that have become invalid after a split operation (IMCS). [0020] In a ninth implementation of any of the preceding aspects, a mutation checksum (TSMCS) is calculated across a set of SST files input for the compaction. The mutation checksum comprises a sum of the second checksum (AMCS), RMCS, and IMCS for at least some of the SST files input for compaction.

[0021] In a tenth implementation of any of the preceding aspects, after compaction, result files and block level CRCs of the result files resulting from the compaction are written to the snapshot storage, a variation of record set checksums is checked between the set of SST files input to the compaction and record set checksums of the result files of the compaction, and when the checksums of the set of SST files input to the compaction and the checksums of the result files of the compaction are not equal, an indication is provided that the compaction has failed.

[0022] In an eleventh implementation of any of the preceding aspects, when the record set checksums of the set of SST files input to the compaction and the record set checksums of the result files of the compaction are equal, the LSM manifest state of the partition represented by the partition snapshot is updated to reflect results of the compaction.

[0023] In a twelfth implementation of any of the preceding aspects, after a split of the database into further partitions, all updates as a result of the split of the database are removed that do not belong to the partition and checksums of the removed updates are added to determine checksums for all insertions or deletions that have become invalid after the split of the database. All updates that get replaced with newer versions and deletion of tombstones add up to the SST checksums for RMCS.

[0024] In a thirteenth implementation of any of the preceding aspects, when deleting a second record of the distributed key value database, an additional parameter is added to a “delete” application programming interface (API) call that contains a checksum of the second record to be deleted, the second record to be deleted is read before the second record is deleted, the second record to be deleted is validated, a second record checksum of the second record to be deleted is generated, a delete request is issued with the second record checksum of the second record to be deleted and a sequence number of the second record to be deleted, the second record checksum of the second record to be deleted and the sequence number of the second record to be deleted are added to a delete tombstone of the log structure merge tree, when the delete tombstone of the log structure merge tree collapses during compaction of the second record to be deleted is verified, and when a mismatch in the second record checksum of the second record to be deleted is found or when the second record to be deleted is not found, the alert is generated and deletion of the second record is terminated.

[0025] In a fourteenth implementation of any of the preceding aspects, snapshots for each partition of the distributed key value database are maintained in the snapshot storage and a WAL file is sealed. When the WAL frle is sealed, the sealed WAL frle is removed from the LSM manifest of the partition represented by the partition snapshot and metadata of the sealed WAL file is added to the partition snapshot and stored in the snapshot storage.

[0026] In a fifteenth implementation of any of the preceding aspects, an archived LSM manifest of an archived partition of the distributed key value database is compared with a current LSM manifest of the partition represented by the partition snapshot, SST files are copied that are only present in the current LSM manifest of the partition represented by the partition snapshot to the snapshot storage, and SST files are marked for deletion that are only present in the archived LSM manifest of the archived partition of the distributed key value database. [0027] In a sixteenth implementation of any of the preceding aspects, integrity of the archived LSM manifest of the archived partition is checked by checking that a total checksum for the archived partition, including checksums in the SST files and WAL files, are invariant across snapshots. When the total checksum is not invariant across snapshots, an alert is generated and a current operation is aborted. When the total checksum is invariant across snapshots, a new archived snapshot checkpoint is created.

[0028] The method may be performed and the instructions on the computer readable media may be processed by the apparatus, and further features of the method and instructions on the computer readable media result from the functionality of the apparatus. Also, the explanations provided for each aspect and its implementation apply equally to the other aspects and the corresponding implementations. The different embodiments may be implemented in hardware, software, or any combination thereof. Also, any one of the foregoing examples may be combined with any one or more of the other foregoing examples to create a new embodiment within the scope of the present disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS [0029] In the drawings, which are not necessarily drawn to scale, like numerals may describe similar components in different views. The drawings illustrate generally, by way of example, but not by way of limitation, various embodiments discussed in the present document.

[0030] FIG. 1 illustrates a sample snapshot update of database partitions in a sample embodiment.

[0031] FIG. 2 illustrates migration of a snapshot of log structured merge (LSM) tree manifests (set of static sorted tables (SSTs) and write ahead logs (WALs)) that are migrated to backup storage and snapshots (and checksums) that are archived in a snapshots database.

[0032] FIG. 3 illustrates backing up of WALs and a snapshot as a result of data updates (mutations).

[0033] FIG. 4 illustrates a diagram of an LSM tree manifest showing what fields have been added to standard LSM data structures (WALs and SSTs) to implement the checksum snapshot validation in sample embodiments.

[0034] FIG. 5 illustrates a flowchart for using checksums to monitor data loss during record mutations (e.g., insertion, deletion, and update (replacement)) in a sample embodiment.

[0035] FIG. 6 illustrates a flowchart of a memory table flush using checksums to monitor data loss in a sample embodiment.

[0036] FIG. 7 illustrates a flowchart of compaction of SSTs using checksums to monitor data loss in a sample embodiment.

[0037] FIG. 8 illustrates a flowchart of a WAL backup process that maintains an archived snapshot for each partition in a sample embodiment.

[0038] FIG. 9 illustrates a flow chart of a checkpoint process where with some periodicity the background process compares an archived partition LSM manifest with a current partition LSM manifest in a sample embodiment.

[0039] FIG. 10 illustrates a method of validating data integrity of a distributed key value database in a sample embodiment.

[0040] FIG. 11 is a block diagram illustrating circuitry for performing the methods according to sample embodiments. DETAILED DESCRIPTION

[0041] It should be understood at the outset that although an illustrative implementation of one or more embodiments are provided below, the disclosed systems and/or methods described with respect to FIGS. 1-11 may be implemented using any number of techniques, whether currently known or in existence. The disclosure should in no way be limited to the illustrative implementations, drawings, and techniques illustrated below, including the exemplary designs and implementations illustrated and described herein, but may be modified within the scope of the appended claims along with their full scope of equivalents.

[0042] The functions or algorithms described herein may be implemented in software in one embodiment. The software may include computer executable instructions stored on computer readable media or computer readable storage device such as one or more non-transitory memories or other type of hardware- based storage devices, either local or networked. Further, such functions correspond to modules, which may be software, hardware, firmware or any combination thereof. Multiple functions may be performed in one or more modules as desired, and the embodiments described are merely examples. The software may be executed on a digital signal processor, ASIC, microprocessor, or other type of processor operating on a computer system, such as a personal computer, server or other computer system, turning such computer system into a specifically programmed machine.

[0043] Prior art methods of data integrity checks for databases rely on maintaining several active replicas of database partitions and comparing them with each other. Since direct record-to-record comparison is very expensive, typically a Merkle tree is maintained and used for replica comparison. However, this approach has several problems. First, databases built on a compute-and-storage-separation model (Hbase, Google Big Table, Huawei OBS Indexing Layer, etc.), do not perform replication on the record level. Instead, they rely on the underlying blob storage system (HDFS, GFS or Huawei Plog). This blob storage does replication at the data blocks level, while getting updates from a single partition instance. Maintaining a separate partition replica under such models is very expensive. Second, replica comparison does not give reliable protection from bugs in data restructuring tasks (like splits or compactions) that may happen simultaneously on all replicas. Third, under heavy- write scenarios, maintaining and comparing Merkle trees between replicas becomes inefficient and problematic. Fourth, since comparison happens in on online system, this may impact user traffic.

[0044] The main problem addressed by the methods described herein is how to check that the current snapshot checkpoint is not affected by some data losses introduced by the compaction or split/merge processes. The methods check that all changes (data mutations) contained in write ahead log (WAL) files of a previous snapshot are reflected in a new snapshot and that no existing records are lost. Database snapshots and mutation flow checksums are constructed and applied to underlying data to allow detection of data losses during database state updates (mutations). No separate online replica is required; snapshots may be stored in backup storage; background compaction bugs related data loss may be detected; and the integrity checks do not intervene with user traffic.

[0045] FIG. 1 illustrates a sample snapshot update of database partitions in a sample embodiment. As illustrated, partitions 1-4 are present in a sample snapshot 100. After the sample snapshot 100, several database updates 110 cause the partitions 1-4 to be merged/compacted to form updated partitions 120 including partitions 1 and 5-7. In this example, the database updates 110 cause partitions 2-4 to be merged and split to form partitions 5-7. In the example illustrated, the WAL files are flushed out of all of the partitions except for partition 6. In accordance with the techniques described herein, data integrity check 130 is provided before the updated partitions 120 may replace the sample snapshot 100 in the database. The data integrity check 130 includes checking record set checksums between the sample snapshot 100 and updated partitions 120 before committing the updated partitions 120 to the database as a new checkpoint.

[0046] The data integrity check 130 verifies that two log structured merge (LSM) tree snapshots (set of SST files and WAL files) correspond to the same database state and allows detection of a location of data loss or corruption. As will be explained in more detail below, the method uses a special record set checksum for efficient snapshot comparison that supports database repartitioning, which is done either by a database partition split or merge.

[0047] FIG. 2 illustrates migration of a snapshot of log structured merge (LSM) tree manifests (a set of static sorted tables (SSTs) and write ahead logs (WALs) of respective partitions at a particular point in time) 200 that represent snapshots of respective partitions 210 of a database 220. As illustrated, the LSM tree manifests 200 are migrated to backup storage 230. As also illustrated, the snapshots (and checksums) for particular partitions are also archived in a snapshots database 240. The snapshots database 240 may store the snapshots with their associated checksums at 250, the SST files and WAL files with their associated checksums at 260, and aggregated checksums at 270.

[0048] FIG. 3 illustrates backing up of WALs and a snapshot of a database partition 300 as a result of data updates (mutations). As illustrated, during an update, the WALs 310 are flushed to the SSTs 320 of the LSM tree of the database 220. As described above with respect to FIG. 1, the SSTs 320 of the partition 300 may be compacted and/or split as a result of the database updates 110. The updated WALs 310 are backed up to the backup storage 230 for each new WAL 330. Also, the resulting snapshot after any periodic compaction and verification of the SSTs is also stored as updated SSTs 340 for the current checkpoint. Thus, the snapshot files including the SSTs 340 and WALs 330 of the snapshot of partition 300 are stored in backup storage 230, which functions as a snapshot storage. Also, an SST and WAL file list and record set checksum are stored in a global snapshots metadata database 350 for retrieval for checksum generation and comparisons. [0049] Thus, in sample embodiments, an archived database partition snapshot is presented as LSM tree manifests 200, and the SST files WAL files are migrated to backup storage. A snapshot is updated incrementally by appending newly created WALs and also installing a snapshot checkpoint during periodic checkpointing processes (in order to speed up recovery). In sample embodiments, special record set checksums are added to each WAL file and SST file to reflect the mutation flow. [0050] FIG. 4 illustrates a diagram of an LSM tree manifest 400 showing what fields have been added to standard LSM data structures (WALs and SSTs) to implement the checksum snapshot validation in sample embodiments. The LSM tree manifest 400 shows the set of WAL files and SST files of the respective partitions at a particular point in time. As illustrated, record set checksums 410 for record blocks 420 are added to the WAL files 430 and SST files 440 to enable the database system to check with high probability that no updates to the database 220 were lost while an old database snapshot is replaced with a new one. The checksums may be verified during compaction and a memory table flush to also help in finding SST population related bugs. In sample embodiments, the last WAL entry (N) before a put of the key value pair to the database becomes part of the checksum snapshot 450 for the data that is flushed to the SST files 440.

[0051] Also, as indicated in FIG. 4, when the WAL file entry indicates that data is to be deleted, a cyclic redundancy check (CRC) code (OldValueCRC) 460 may also be inserted into a “delete” application programming interface (API) call to enable validation of data deleted since a previous snapshot. It will be appreciated by those skilled in the art that CRCs generally require an exact order for the data verification while checksums, such as the record set checksums used herein, may be used to verify the data even when the processing sequences vary, as when the data is sorted in a different order.

[0052] In sample embodiments, checksums are updated during insertion, deletion, and compaction of database files. Checksums are checked after compaction and before the update of snapshots. When a checksum mismatch is found, an alert is generated and an operation is aborted. Instead of checking integrity of database snapshot states, the system described herein periodically compares the checksums to validate mutation (change) flows that reflect the snapshots. This allows efficient utilization of the incremental nature of LSM-trees and avoids expensive comparisons required to identify lost record location and to restore them from a log of updates.

[0053] As will be apparent from the description of FIGS. 5-9 below, key features of the data integrity verification process include a method of incremental backup of LSM trees in the presence of repartitioning, a model that verifies correctness of background data restructuring tasks based on a set of invariant checks, and a cost-effective method of continuous data integrity validation for snapshots of LSM trees based on checking mutation flow checksums.

[0054] FIG. 5 illustrates a flowchart 500 for using checksums to monitor data loss during record mutations (e.g., insertion, deletion, and update (replacement)) in a sample embodiment. In sample embodiments, to protect data during transfer between system layers, a CRC code such as CRC-64 is added to each KV pair of the database. The CRC is generated on the client at the same moment the KV record structure is formed. The CRCs are checked at operation 510 before the record is written to the WAL file or the SST file (and flushed or compacted) and after reading from disk. In the case of a CRC mismatch at operation 510, an alert is generated and the corresponding operation that caused the record mutation is rejected at operation 520. Supported record mutations include insertion, deletion, and update (replacement).

[0055] In the case of a deletion, an additional parameter (e.g., OldValueCRC 460 in FIG. 4) may be added to the “Delete” API call which contains the CRC of a deleted record that identifies the deleted record, before record removal, and informs the client that it needs to read the current record, validate it, generate a CRC, and issue a delete request with the CRC and a sequence number of the deleted record attached. This CRC and sequence number are added to an LSM delete tombstone and verified when the tombstone collapses the deleted record during compaction. In case of mismatch or if a delete record is not found, an alert is generated.

[0056] In the embodiments described herein, a record set checksum function H validates that concatenations of non-overlapping sets of KV-pairs (i.e., no common elements) are equal if unions of such sets are equal. This may be any strong enough hash function satisfying the properties of associativity and commutativity. The checksum function RFC 1071 is a potential fit. To decrease collision probability for large sets, a 128-bit combined checksum function may be used. The first 64 bits may be calculated by applying a 64-bit version of the checksum function RFC 1071 to the CRCs of records. The next 24-bits may represent a count of elements in a set taken by modulo (2²⁴-l). The next 8 bits may represent a count of all turned on bits of the CRCs of all elements in a set taken by modulo 255. The remaining 32 bits may be a l's complement addition of the exclusive OR (XOR) of 32-bit parts of record CRCs. Considering that a record set checksum function is commutative and associative, a ‘+’ operation may be defined for two results of applying H to two sets where the result of such operation is equal to the result of H applied to a union of sets. It is easy to show that such an operation may be calculated in 0(1): just use either addition by modulo or l's complement addition for different parts of the 128-bit checksum. [0057] If the CRCs before and after the mutation match at operation 510, the checksums calculated using the record set checksum function H are updated at operation 520. In particular, the WAL put checksum WPCS for all inserted records and the WAL delete checksum WDCS for all deleted records are updated at operation 520. WPCS and WDCS and the mutation are periodically written into the WAL at operation 530 to avoid the necessity of recalculating them for huge amount of records in case of process crash.

[0058] While adding records to the LSM memory table in the memory record cache of the database, two checksums may be calculated at operation 540: all delete checksum ADCS for deleted records and removed checksum RMCS for all records (mutations) that were replaced with newer versions (typically 0 during MemTable population since by default all versions are kept for multi- version concurrency control). If the mutation CRCs including the checksums do not match at operation 550, the user is alerted to a system crash at operation 560. Otherwise, the mutation is inserted into the LSM memory table at operation 570. If the transfer is successful, a success message is provided at operation 580.

[0059] FIG. 6 illustrates a flowchart of a memory table flush 600 using checksums to monitor data loss in a sample embodiment. While transferring the memory table into the SST, the all put checksum APCS for all active records and the all mutation checksum AMCS for all puts and deletes (all mutations) in the SST or the memory table are calculated and updated at operation 610. Before dropping the WAL into the SST, the invariant is checked at operation 620: WPCS+WDCS=APCS+ADCS+RPCS (replace by put checksum), where ‘+’ is the checksum operation H defined above and the checksums are defined in Table I below: TABLE I

[0060] The naming convention for the simple checksums (does not include aggregates) in Table I includes 4 or 5 letters specifying {Record location {{Record type }CS (checksum). The record locations include:

A - SST file active records; W - WAL file;

R - Checksum for records that originally were in the SST but were replaced; and

I - Checksum for records that originally were in the SST but were deleted because they did not below to the partition any more. [0061] On the other hand, the record types include:

P - put (insert), which means record is active;

D - deletion marker; and

M - all possible mutations (deletions and insertions).

These checksums collectively account for the changes that may occur to the WAL files and SST files between snapshots as a result of inserts, deletions, changes, merges, splits, and the like.

[0062] Referring back to FIG. 6, if the checksums are not invariant at operation 620, then the user is alerted to a system crash at operation 630. On the other hand, if the checksums are invariant at operation 620, then the memory table and checksums are saved into the SST for the partition at operation 640 and the LSM tree manifest (set of SSTs and WALs) 400 for the partition is updated at operation 650. A success message is provided at operation 660.

[0063] In sample embodiments, three checksums are stored for each SST file: checksum AMCS for all mutations in the SST or memory table (including deletion tombstones); checksum IMCS for all mutations that become invalid after a split operation, meaning that the record key does not belong to the partition key range anymore; and checksum RMCS of all mutations (puts or deletes) that got replaced with newer versions and the deletion tombstones reached the LSM-tree bottom level (and thus removed). It will be appreciated that after a memory table flush 600 (FIG. 6), the following is true: AMCS=APCS+ADCS and IMCS={0}.

[0064] FIG. 7 illustrates a flowchart of compaction 700 of SST s using checksums to monitor data loss in a sample embodiment. In the case of compaction, the checksums for deleted files need to be accounted for to account for all data since the last checkpoint. The checksums AMCS, RMCS and IMCS are updated during compaction whereby all mutations not belonging to a partition (e.g. after a split) get removed and their CRC adds up to the checksum IMCS, and all mutations getting replaced with newer versions and deletion tombstones that reached the LSM-tree bottom level (and thus getting removed) add up to RMCS. All compaction output SST files except the last one receive RMCS and IMCS values equal to 0. Only the last compaction output file will contain a calculated RMCS and IMCS.

[0065] For compaction, TSMCS is calculated at operation 710 as a sum of AMCS+RMCS+IMCS for some set of input SST files to be compacted. After compaction, the CRCs for input records are verified at operation 720, and the checksums AMCS, RMCS, and IMCS are updated at operation 730. Result files (output SSTs) are written (including their block level CRCs) at operation 740, and the TSMCS is calculated at operation 750 for the output SST files after compaction. The compacted invariant for input TSMCS versus output TSMCS is checked at operation 760. If the input checksum TSMCS differs from the output checksum TSMCS, the compaction fails at operation 770. However, if the input checksum TSMCS matches the output checksum TSMCS, and hence the checksums are equal among input and output SST file sets, the LSM tree manifest 400 for the resulting partitions are updated at operation 780, and a success message is provided at operation 790. [0066] FIG. 8 illustrates a flowchart of a WAL backup process 800 that maintains an archived snapshot for each partition in a sample embodiment. A snapshot is presented as a set of WAL and SST files metadata (including checksums), similar to an LSM-tree manifest. Snapshot metadata (especially, a list of SST and WAL files in the snapshot) for each partition is stored in a global snapshots metadata database 350 as described above with respect to FIG. 3. This data plus the checkpoint data may be used to recover the database state in the event of a failure. Each time a new WAL file is sealed at operation 810, the background task copies it to backup storage 230 and it is registered in the snapshot metadata database 350. Each time an LSM manifest file is rotated, it is not removed, but its metadata is added to the snapshot and the background task migrates it to backup storage 230. As illustrated in FIG. 8, after the WAL is sealed at operation 810, the WAL is removed from the LSM manifest at operation 820 and copy of the WAL is provide to the backup storage 230 and added to a backup manifest registry at operation 830. The CRCs and set checksums for the WAL files are checked at operation 840. If the CRCs and checksums are invariant, the WAL is added to the backup snapshot LSM manifest for each partition at operation 850. A success message is provided at operation 860. However, if the CRCs and checksums are not invariant at operation 840, the WAL backup process 800 is aborted at operation 870.

[0067] FIG. 9 illustrates a flow chart of a checkpoint process 900 where with some periodicity the background process compares an archived partition LSM manifest with a current partition LSM manifest in a sample embodiment. As noted above, the snapshots may be used for fast restoration of data in the event of a failure. If there are some SST files that are only presented in the current LSM manifest, they are copied into backup storage. If there are some files that are only presented in the archived manifest, they are marked for deletion. After all required SST files are copied to backup storage, snapshot manifest integrity validation is started. The total backup WAL size is compared to a threshold at operation 910 to determine if there are too many WAL files such that a new checkpoint snapshot is desirable. If the total backup WAL size does not exceed the threshold, additional WAL files are added for additional mutations until the WAL size exceeds the threshold at operation 910. The difference between snapshots is populated in the WAL files at operation 920, and new SSTs are added to the backup snapshot LSM manifest registry at operation 930. Also, the new SSTs are copied to backup storage 230 at operation 940.

[0068] As noted in Table I above, TWMCS is defined as a sum of WPCS+WDSCS across WAL files in some partition snapshot and TMCS=TSMCS+TWMCS for some snapshot. TMCS checksums for new and old snapshots are calculated at operation 950. Invariants in the new and old TMCS checksums are checked at operation 960. The TMCS must be equal among snapshots. If there is an invariance at operation 960, an alert is generated at operation 970, and the checkpoint process 900 is aborted. The checkpoint process returns to operation 910 to collect new WAL files. Otherwise, a new archived manifest snapshot version gets created at operation 980, replacing previous snapshot metadata. The process then repeats with the new checkpoint.

[0069] In sample embodiments, archived files (WAL, SST, LSM manifests) get removed from backup storage 230 by a garbage collection process if they do not have references from any of the current partition manifests and/or they may be removed based on a retention policy (which depends on the current amount of archived data).

[0070] During a partition split as illustrated in FIG. 1, two partitions are created both sharing all SSTs of the original parent partitions (all WALs are flushed to SST before a split). Each partition serves its own range and, during compaction, keys not belonging to the partition get removed. A sequence number of the last written record before a split is recorded. At the first checkpoint after the split, the following invariant is checked: {TMCS of origin partition snapshot] + {TWMCS across all WALs created after the split] = {new snapshot TMCS}. During such a partition split, all checksums should remain equal to the checksums of the parent partition immediately after the split and will change only after compaction or if new records are added.

[0071] To check the split time LSM tree manifest, during the split a reference to the LSM tree current manifest position is saved. As noted above, the LSM tree manifest 400 is an append-only log of tree file updates (add/delete for SST or WAL). A further whole LSM manifest file is migrated to backup storage 230. Each time a split partitions’ snapshots get checkpointed, a check is made to determine whether there are still some SST files shared between partitions. Since shared files get re-compacted in the background, at some point in time the split partitions will not have such files. The first time this condition is observed, the following invariant is checked: {sum of AMCS of original partition} + 2 x{sums oflMCS of original partition } = { sums of IMCS of both split partitions }.

[0072] On the other hand, during a partition merge as illustrated in FIG. 1, all SST files of the original partitions get merged into a new partition. At the first checkpoint after the merge, the following invariant is checked: { sum of TMCS of original partitions } + { TWMCS across all WALs created after the merge } = {new snapshot TMCS}. After the merge, the TMCS of the new partition is equal to the sum of TMCS of the parent partitions.

[0073] FIG. 10 illustrates a method of validating data integrity of a distributed key value database in a sample embodiment. This figure summarizes the data flow for the techniques described above. As illustrated, the data integrity validation process starts by taking a partition snapshot of the distributed key value database at operation 1000. As noted above, the partition snapshot includes a set including an SST and a WAL of a log structure merge tree of the distributed key value database. A record set checksum of the partition snapshot is calculated at operation 1010. The record set checksum validates that concatenations of sets of key value pairs are equal if unions of the sets of key value pairs are equal.

Snapshot files including SST files and WAL files of the partition snapshot are stored in a snapshot storage and a file list and record set checksum are stored in a snapshot metadata database at operation 1020.

[0074] The partition snapshot is updated incrementally as a result of mutations by appending newly created WAL files at operation 1030. As the newly created WAL files accumulate, snapshot checkpoints are periodically generated at operation 1040 by copying SST files of the partition snapshot to the snapshot storage and updating metadata of the partition snapshot to align the partition snapshot with an LSM manifest state of the partition represented by the partition snapshot. The record set checksum is added to the WAL files and the SST files of the partition snapshot at operation 1050. Before installing a new snapshot checkpoint, the record set checksums of the new snapshot checkpoint are checked at operation 1060 against the record set checksum of a previous snapshot checkpoint to identify data losses. When data losses are identified, an alert is generated at operation 1070 and installation of the new snapshot checkpoint is aborted. Otherwise (no data losses are identified), the new snapshot checkpoint including a current SST and WAL of the log structure merge tree of the distributed key value database is installed at operation 1080.

[0075] It will be appreciated by those skilled in the database arts that relying on a record set checksum allows the database system to avoid costly record-by-record comparison by instead comparing checksums, which is much faster. The snapshot may be stored in the form of SST files and WAL files (instead of a full copy of all records in a latest database snapshot). This reduces read, write, and space amplification needed to maintain the database backup. Correctness invariants embedded into compaction, repartitioning and snapshot update processes allows the system to continuously check for data losses, localize it on the file level, and easily point to sources of valid data to use for recovery.

[0076] FIG. 11 illustrates a general-purpose computer 1100 suitable for implementing one or more embodiments of the methods disclosed herein. For example, the computer 1100 in FIG. 11 may be implemented on background servers that manage the indexed database of the type described herein. The components described above may be implemented on any general-purpose network component, such as a computer 1100 with sufficient processing power, memory resources, and network throughput capability to handle the necessary workload placed upon it. The computer 1100 includes a processor 1110 (which may be referred to as a central processor unit or CPU) that is in communication with memory devices including secondary storage 1120, read only memory (ROM) 1130, random access memory (RAM) 1140, input/output (I/O) devices 1150, and network connectivity devices 1160. In sample embodiments, the network connectivity devices 1160 further connect the processor 1110 to a database 1170 of the type described herein. The processor 1110 may be implemented as one or more CPU chips, or may be part of one or more application specific integrated circuits (ASICs).

[0077] The secondary storage 1120 is typically comprised of one or more disk drives or tape drives and is used for non-volatile storage of data and as an over- flow data storage device if RAM 1140 is not large enough to hold all working data. Secondary storage 1120 may be used to store programs that are loaded into RAM 1140 when such programs are selected for execution. The ROM 1130 is used to store instructions and perhaps data that are read during program execution. ROM 1130 is a non-volatile memory device that typically has a small memory capacity relative to the larger memory capacity of secondary storage 1120. The RAM 1140 is used to store volatile data and perhaps to store instructions. Access to both ROM 1130 and RAM 1140 is typically faster than to secondary storage 1120.

[0078] It should be understood that computer 1100 may execute instructions from computer-readable non-transitory media storing computer readable instructions and one or more processors coupled to the memory, and when executing the computer readable instructions, the computer 1100 is configured to perform method steps and operations described in the disclosure with reference to FIG. 1 to FIG. 10. The computer-readable non-transitory media includes all types of computer readable media, including magnetic storage media, optical storage media, flash media and solid state storage media.

[0079] It should be further understood that software including one or more computer-executable instructions that facilitate processing and operations as described above with reference to any one or all of steps of the disclosure may be installed in and sold with one or more servers or databases. Alternatively, the software may be obtained and loaded into one or more servers or one or more databases in a manner consistent with the disclosure, including obtaining the software through physical medium or distribution system, including, for example, from a server owned by the software creator or from a server not owned but used by the software creator. The software may be stored on a server for distribution over the Internet, for example.

[0080] Also, it will be understood by one skilled in the art that this disclosure is not limited in its application to the details of construction and the arrangement of components set forth in the following description or illustrated in the drawings. The embodiments herein are capable of other embodiments, and capable of being practiced or carried out in various ways. Also, it will be understood that the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of "including," "comprising," or "having" and variations thereof herein is meant to encompass the items listed thereafter and equivalents thereof as well as additional items. [0081] The components of the illustrative devices, systems and methods employed in accordance with the illustrated embodiments may be implemented, at least in part, in digital electronic circuitry, analog electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. These components also mayn be implemented, for example, as a computer program product such as a computer program, program code or computer instructions tangibly embodied in an information carrier, or in a machine -readable storage device, for execution by, or to control the operation of, data processing apparatus such as a programmable processor, a computer, or multiple computers.

[0082] A computer program may be written in any form of programming language, including compiled or interpreted languages, and it may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network. Also, functional programs, codes, and code segments for accomplishing the systems and methods described herein may be easily construed as within the scope of the disclosure by programmers skilled in the art to which the present disclosure pertains. Method steps associated with the illustrative embodiments may be performed by one or more programmable processors executing a computer program, code or instructions to perform functions (e.g., by operating on input data and generating an output). Method steps may also be performed by, and apparatus may be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit), for example.

[0083] The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor (DSP), an ASIC, a FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. [0084] Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. Information carriers suitable for embodying computer program instructions and data include all forms of non-volatile memory, including by way of example, semiconductor memory devices, e.g., electrically programmable read-only memory or ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory devices, and data storage disks (e.g., magnetic disks, internal hard disks, or removable disks, magneto-optical disks, CD-ROM disks, or DVD-ROM disks). The processor and the memory may be supplemented by, or incorporated in special purpose logic circuitry.

[0085] Those of skill in the art understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.

[0086] Those skilled in the art may further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure. A software module may reside in random access memory (RAM), flash memory, ROM, EPROM, EEPROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. A sample storage medium is coupled to the processor such the processor may read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. In other words, the processor and the storage medium may reside in an integrated circuit or be implemented as discrete components.

[0087] As used herein, “machine-readable medium” means a device able to store instructions and data temporarily or permanently and may include, but is not limited to, random-access memory (RAM), read-only memory (ROM), buffer memory, flash memory, optical media, magnetic media, cache memory, other types of storage (e.g., Erasable Programmable Read-Only Memory (EEPROM)), and any suitable combination thereof. The term “machine -readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, or associated caches and servers) able to store processor instructions. The term “machine -readable medium” shall also be taken to include any medium, or combination of multiple media, that is capable of storing instructions for execution by one or more processors, such that the instructions, when executed by one or more processors cause the one or more processors to perform any one or more of the methodologies described herein. Accordingly, a “machine -readable medium” refers to a single storage apparatus or device, as well as “cloud-based” storage systems or storage networks that include multiple storage apparatus or devices. The term “machine-readable medium” as used herein excludes signals per se.

[0088] Although a few embodiments have been described in detail above, other modifications are possible. For example, the logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. Other steps may be provided, or steps may be eliminated, from the described flows, and other components may be added to, or removed from, the described systems. Other embodiments may be within the scope of the following claims.

Claims

CLAIMS What is claimed is:

1. A method of validating data integrity of a distributed key value database, comprising: taking a partition snapshot of the distributed key value database, the partition snapshot comprising a static sorted table (SST) and a write ahead log (WAL) of a log structure merge (LSM) tree of the distributed key value database; calculating a record set checksum of the partition snapshot, wherein a record set checksum function validates that concatenations of sets of key value pairs are equal if unions of the sets of key value pairs are equal; storing snapshot files including SST files and WAL files of the partition snapshot in a snapshot storage and storing a file list and record set checksum in a snapshot metadata database; updating the partition snapshot incrementally by appending newly created WAL files; periodically generating snapshot checkpoints by copying SST files of the partition snapshot to the snapshot storage and updating metadata of the partition snapshot to align the partition snapshot with a state of an LSM manifest of a partition represented by the partition snapshot; adding the record set checksum to the WAL files and the SST files of the partition snapshot; and before installing a new snapshot checkpoint, checking the record set checksum of the new snapshot checkpoint against the record set checksum of a previous snapshot checkpoint to identify data losses.

2. The method of claim 1, further comprising, upon data losses being identified, generating an alert and aborting installing the new snapshot checkpoint; otherwise, installing the new snapshot checkpoint including a current SST and WAL of the LSM tree of the distributed key value database.

3. The method of claim 1, wherein the record set checksum comprises a hash function that satisfies the properties of associativity and commutativity.

4. The method of claim 3, wherein the record set checksum comprises a 128- bit checksum wherein a first 64 bits are calculated by applying a 64-bit version of an RFC 1071 checksum function to checksums of records of the partition snapshot, and a second 64 bits comprise a count of elements in a record set of the partition snapshot taken by modulo (2²⁴-l), a count of all turned on bits of a cyclical redundancy check (CRC) of all elements in the record set by modulo 255, and a l’s complement addition of an exclusive OR operation of bit parts of CRCs of the record set.

5. The method of claim 1, further comprising generating a key value checksum for each key value pair of the distributed key value database as each key value record structure is formed, adding the key value checksum to each key value pair, checking the key value checksums before a record containing the key value pairs is written to the WAL or the SST, and based on the key value checksums being correct, updating WAL checksums for all insertion records and all deletion records in the WAL up to a current record and writing a result of the insertions or deletions and the updated WAL checksums for all insertion records and all deletion records into the WAL.

6. The method of claim 5, further comprising updating SST checksums for all deletions in the SST or a memory table of the distributed key value database (ADCS) and updating the SST checksums for all insertions and deletions in the SST or the memory table that have been replaced with a newer version (RMCS).

7. The method of claim 6, further comprising checking the WAL checksums for ADCS and RMCS and generating an alert in the event the WAL checksums or SST checksums are incorrect.

8. The method of claim 7, upon the WAL checksums and SST checksums being correct, further comprising updating the memory table with any insertions or deletions in the SST or the memory table of the distributed key value database represented by the WAL checksums or the SST checksums.

9. The method of claim 8, further comprising flushing the memory table into the SST by calculating a first checksum for all insertion records in the SST or memory table and a second checksum for all insertion and deletion records in the SST or memory table, checking that the WAL checksums equal a checksum of the first and second checksums and the SST checksums, saving the memory table, the second checksum and RMCS into the SST based on the WAL checksums being equal to the checksum of the first and second checksums and the SST checksums, and updating the LSM manifest state of the partition represented by the partition snapshot to reflect the saved memory table.

10. The method of claim 9, further comprising, during a compaction of the SST, updating the second checksum, updating the SST checksums for all insertions and deletions in the SST or the memory table that have been replaced with RMCS, and updating checksums for all insertions or deletions that have become invalid after a split operation (IMCS).

11. The method of claim 10, further comprising calculating a mutation checksum (TSMCS) across a set of SST files input for the compaction, the mutation checksum comprising a sum of the second checksum, RMCS, and IMCS for at least some of the SST fries input for compaction.

12. The method of claim 11, further comprising, after compaction, writing result files and block level CRCs of the result fries resulting from the compaction to the snapshot storage, checking a variation of record set checksums between the set of SST fries input to the compaction and record set checksums of the result files of the compaction, and based on the checksums of the set of SST files input to the compaction and the checksums of the result files of the compaction being not equal, indicating that the compaction has failed.

13. The method of claim 12, wherein based on the record set checksums of the set of SST fries input to the compaction and the record set checksums of the result files of the compaction being equal, updating the LSM manifest state of the partition represented by the partition snapshot to reflect results of the compaction.

14. The method of claim 9, further comprising, after a split of the database into further partitions, removing all updates as a result of the split of the database that do not belong to the partition and adding checksums of the removed updates to determine checksums for all insertions or deletions that have become invalid after the split of the database , wherein all updates that get replaced with newer versions and deletion of tombstones add up to the SST checksums for RMCS.

15. The method of claim 5, further comprising, upon deleting a second record of the distributed key value database, adding an additional parameter to a “delete” application programming interface (API) call that contains a checksum of the second record to be deleted, reading the second record to be deleted before the second record is deleted, validating the second record to be deleted, generating a second record checksum of the second record to be deleted, issuing a delete request with the second record checksum of the second record to be deleted and a sequence number of the second record to be deleted, adding the second record checksum of the second record to be deleted and the sequence number of the second record to be deleted to a delete tombstone of the log structure merge tree, verifying that the delete tombstone of the log structure merge tree has collapsed during compaction of the second record to be deleted, and based on a mismatch in the second record checksum of the second record to be deleted being found or based on the second record to be deleted being not found, generating an alert and terminating deletion of the second record.

16. The method of claim 1 , further comprising maintaining snapshots for each partition of the distributed key value database in the snapshot storage and sealing a WAL file, wherein based on the WAL file being sealed, the sealed WAL file is removed from the LSM manifest of the partition represented by the partition snapshot and metadata of the sealed WAL file is added to the partition snapshot and stored in the snapshot storage.

17. The method of claim 16, further comprising comparing an archived LSM manifest of an archived partition of the distributed key value database with a current LSM manifest of the partition represented by the partition snapshot, copying SST files that are only present in the current LSM manifest of the partition represented by the partition snapshot to the snapshot storage, and marking for deletion SST files that are only present in the archived LSM manifest of the archived partition of the distributed key value database.

18. The method of claim 17, further comprising checking integrity of the archived LSM manifest of the archived partition by checking that a total checksum for the archived partition, including checksums in the SST files and WAL files, are invariant across snapshots; based on the total checksum being not invariant across snapshots, generating an alert and aborting a current operation; and based on the total checksum being invariant across snapshots, creating a new archived snapshot checkpoint.

19. A system comprising: a distributed key value database; at least one processor; and a memory that stores instructions that upon execution by the at least one processor validates data integrity of the distributed key value database by performing operations comprising: taking a partition snapshot of the distributed key value database, the partition snapshot comprising a static sorted table (SST) and a write ahead log (WAL) of a log structure merge tree of the distributed key value database; calculating a record set checksum of the partition snapshot, wherein a record set checksum function validates that concatenations of sets of key value pairs are equal if unions of the sets of key value pairs are equal; storing snapshot files including SST files and WAL files of the partition snapshot in a snapshot storage and storing a file list and record set checksum in a snapshot metadata database; updating the partition snapshot incrementally by appending newly created WAL files; periodically generating snapshot checkpoints by copying SST files of the partition snapshot to the snapshot storage and updating metadata of the partition snapshot to align the partition snapshot with a state of a LSM manifest of a partition represented by the partition snapshot; adding the record set checksum to the WAL files and the SST files of the partition snapshot; and before installing a new snapshot checkpoint, checking the record set checksum of the new snapshot checkpoint against the record set checksum of a previous snapshot checkpoint to identify data losses.

20. A computer readable storage medium comprising instructions that upon execution by at least one processor validate data integrity of a distributed key value database by performing operations comprising: taking a partition snapshot of the distributed key value database, the partition snapshot comprising a static sorted table (SST) and a write ahead log (WAL) of a log structure merge tree of the distributed key value database; calculating a record set checksum of the partition snapshot, wherein a record set checksum function validates that concatenations of sets of key value pairs are equal if unions of the sets of key value pairs are equal; storing snapshot files including SST files and WAL files of the partition snapshot in a snapshot storage and storing a file list and record set checksum in a snapshot metadata database; updating the partition snapshot incrementally by appending newly created WAL files; periodically generating snapshot checkpoints by copying SST files of the partition snapshot to the snapshot storage and updating metadata of the partition snapshot to align the partition snapshot with a state of a LSM manifest of a partition represented by the partition snapshot; adding the record set checksum to the WAL files and the SST files of the partition snapshot; and before installing a new snapshot checkpoint, checking the record set checksum of the new snapshot checkpoint against the record set checksum of a previous snapshot checkpoint to identify data losses.