CN116635849A

CN116635849A - System, method, and medium for implementing conflict-free replication data types in a memory data structure

Info

Publication number: CN116635849A
Application number: CN202180086189.9A
Authority: CN
Inventors: 尤瓦尔·因巴尔; 约西·戈特利布
Original assignee: Letis Co ltd
Current assignee: Letis Co ltd
Priority date: 2020-10-20
Filing date: 2021-10-20
Publication date: 2023-08-22

Abstract

Mechanisms, including systems, methods, and non-transitory computer-readable media, are provided for implementing a conflict-free replication data type in a storage data structure, the mechanisms including: a memory; and at least one hardware processor coupled to the memory, and the at least one hardware processor and the memory are collectively configured to: marking a first key of a collision-free duplicate data type to be deleted; sending an update message to the first copy of the memory data structure reflecting that the first key is to be deleted; receiving a plurality of messages, each message confirming that the first key is to be deleted; determining that the plurality of messages includes a message for each of the plurality of tiles of the first copy; and deleting the first key in response to determining that the plurality of messages includes a message for each of the plurality of fragments of the first copy.

Description

System, method, and medium for implementing conflict-free replication data types in a memory data structure

Cross Reference to Related Applications

The present application claims priority from U.S. provisional patent application No.63/094,328 filed on day 10 and day 20 in 2020 and U.S. provisional patent application No.63/094,797 filed on day 10 and day 21 in 2020, each of which is incorporated herein by reference in its entirety.

Background

Memory data structures (e.g., REDIS) are widely used to store data that needs to be accessed quickly for applications such as games, advertising, financial services, healthcare, and many other applications. In many instances, the memory data structure may be applied in a distributed system.

A collision free replication data type (Conflict-free replicated data type, CRDT) is a data structure that can be replicated across multiple nodes in a distributed system network to achieve strong final consistency without requiring consensus and introducing delays and reduced availability involved in consensus.

It is therefore desirable to implement CRDT in the memory data structure.

Disclosure of Invention

According to some embodiments, systems, methods, and media for implementing conflict-free replication data types in an in-memory data structure are provided.

In some embodiments, a system for implementing a conflict-free replication data type in a storage data structure is provided, the system comprising: a memory; and at least one hardware processor coupled to the memory and collectively configured to: marking a first key of a collision-free duplicate data type to be deleted; sending an update message to the first copy of the memory data structure reflecting that the first key is to be deleted; receiving a plurality of messages, each message confirming that the first key is to be deleted; determining that the plurality of messages includes a message for each of the plurality of tiles of the first copy; and deleting the first key in response to determining that the plurality of messages includes a message for each of the plurality of fragments of the first copy.

In some of these embodiments, the at least one hardware processor is further configured to: a counter is maintained, wherein the counter tracks an interval value and a logic clock for a plurality of intervals.

In some of these embodiments, the at least one hardware processor is further configured to: determining that a second copy of the second key was recently updated, and setting the second copy as an eviction (evaluation) owner of the second key; determining that the memory usage of the second copy exceeds a threshold; determining that the second key is at least one of a least frequently used key and a least recently used key of the plurality of keys stored by the second copy; and deleting the second key in response to determining that the memory usage of the second copy exceeds the threshold and that the second key is at least one of a least frequently used key and a least recently used key of the plurality of keys stored by the second copy.

In some of these embodiments, the at least one hardware processor is further configured to: associating a third key with the expired data owner; determining that the third key has expired; and deleting, by the expiration data owner, the third key in response to determining that the third key has expired and that the third key is associated with the expiration data owner.

In some of these embodiments, the at least one hardware processor is further configured to: by each of the multiple copies: creating a stream of append-only updates for the updates of each of the plurality of copies; and copying the stream to each of the plurality of copies.

In some of these embodiments, the at least one hardware processor is further configured to: identifying a fourth key created by a fourth copy as having a first value and a first type; identifying a fifth key created by the fifth copy as having a second value different from the first value and a second type different from the first type; and applying a priority to the fourth key and the fifth key based on the first type and the second type such that the fourth key is assigned a second value.

In some embodiments, a method for implementing a conflict-free replication data type in a memory data structure is provided, the method comprising: marking a first key of a collision-free duplicate data type to be deleted; sending an update message to the first copy of the memory data structure reflecting that the first key is to be deleted; receiving a plurality of messages, each message confirming that the first key is to be deleted; determining that the plurality of messages includes a message for each of the plurality of tiles of the first copy; and deleting the first key in response to determining that the plurality of messages includes a message for each of the plurality of fragments of the first copy.

In some of these embodiments, the method further comprises: a counter is maintained, wherein the counter tracks interval values and logic clocks for a plurality of intervals.

In some of these embodiments, the method further comprises: determining that a second copy of the second key has been recently updated and setting the second copy as an eviction owner of the second key; determining that the memory usage of the second copy exceeds a threshold; determining that the second key is at least one of a least frequently used key and a least recently used key of the plurality of keys stored by the second copy; and deleting the second key in response to determining that the memory usage of the second copy exceeds the threshold and that the second key is at least one of a least frequently used key and a least recently used key of the plurality of keys stored by the second copy.

In some of these embodiments, the method further comprises: associating a third key with the expired data owner; determining that the third key has expired; and determining that the third key is associated with the expired data owner; and deleting, by the expiration data owner, the third key in response to determining that the third key has expired and that the third key is associated with the expiration data owner.

In some of these embodiments, the method further comprises: by each of the multiple copies: creating a stream of append-only updates for the updates of each of the plurality of copies; and copying the stream to each of the plurality of copies.

In some of these embodiments, the method further comprises: identifying a fourth key created by a fourth copy as having a first value and a first type; identifying a fifth key created by the fifth copy as having a second value different from the first value and a second type different from the first type; and applying a priority to the fourth key and the fifth key based on the first type and the second type such that the fourth key is assigned a second value.

In some embodiments, a non-transitory computer-readable medium containing computer-executable instructions that, when executed by a processor, cause the processor to perform a method for implementing a conflict-free replication data type in a memory data structure, the method comprising: marking a first key of a collision-free duplicate data type to be deleted; sending an update message to the first copy of the memory data structure reflecting that the first key is to be deleted; receiving a plurality of messages, each message confirming that the first key is to be deleted; determining that the plurality of messages includes a message for each of the plurality of tiles of the first copy; and deleting the first key in response to determining that the plurality of messages includes a message for each of the plurality of fragments of the first copy.

Drawings

Fig. 1 is an example flow diagram of a process for deleting a key, according to some embodiments.

FIG. 2 is an example block diagram of a system according to some embodiments.

FIG. 3 is a block diagram of an example of hardware according to some embodiments.

Detailed Description

According to some embodiments, mechanisms, including systems, methods, and media, are provided for implementing a conflict-free replicated data type (CRDT) in an in-memory data structure.

In some embodiments, these mechanisms provide some enhancements over known, surgically verified CRDTs and allow memory data structures (e.g., REDISs) to be deployed in an active-active deployment with minimal modification to the known behavior of the APIs of the memory data structures (e.g., REDISs) and the memory data structures (e.g., REDISs).

In some embodiments, a unique attribute of the mechanisms described herein is that they apply CRDT semantics to pre-existing systems of memory data structures (e.g., REDIS) data types and commands to create a final consistent distributed system that exhibits applications and user experiences very similar to those exhibited by non-distributed memory data structures.

Sharding is a well-known and general method of distributing data among nodes (referred to as "shards") in a system to achieve scalability (e.g., storing more data or processing transactions faster). When using shards, it may be necessary for the system to store configuration information about the shard topology, such as a mapping between data elements and the nodes that store and process them. Because such systems are dynamic in nature, such configuration information may change; for example, if a new node is added to the system and some data should be migrated to the new node, the configuration information for the system should be updated. This process is called re-slicing.

CRDTs are typically modeled around replicas, i.e., systems that hold replicas of the same data. In systems using shards, this typically means that the copies are symmetrical and use the same sharding topology. To maintain this invariance, coordination across different copies is required if re-sharding is performed. However, this coordination requires some consensus and violates the motivation for employing CRDT in distributed systems to achieve strong final consistency without the need for consensus and introducing delays involved in consensus.

According to some embodiments, the mechanisms described herein may implement copies that include different numbers of slices and use different slice topologies. Thus, in some embodiments, the copies may also be re-fragmented independently without the need for an end-to-end consensus mechanism.

In some embodiments, a Vector Clock (VC) may be used to register the causal order of operations in and between copies. In some embodiments, the VC may include a unique Identifier (ID) that identifies the copy and/or the shard of the copy that performs the operation. In some embodiments, such VCs may be used in different contexts, such as when a VC is attached to a message indicating its logical time or to a data element indicating its logical time of creation, modification, or deletion.

In some embodiments, the VC may also be used as part of an Observed VC (OVC). The observed vector clock is a structure that describes the latest set of updates received by a single copy from other copies by referencing the vector clock associated with those updates.

For example, in some embodiments, given a copy a with an ID of 1 and a local clock of 100 (denoted as {1, 100 }), and a copy B with an ID of 2 and a local clock of 110 (denoted as {2, 110 }), receiving the most recently updated copy C from both copies will announce an OVC, which may be denoted as [ {1, 100}, {2, 110} ].

To support a copy with multiple tiles, in some embodiments, the ID may be structured such that one portion of the ID describes the copy and another portion describes the tile. For example, in some embodiments: 0x0501 and 0x0502 may be two unique IDs describing fragments 0x1 and 0x2 belonging to the same copy 0x 05. In some embodiments, all slices have their own dedicated logic clocks and are represented separately in the VC. In some embodiments, vector clock operations may be performed in a copy-unique or slice-unique manner to address the fact that copy a is unaware and cannot assume anything about the slice topology or even the number of slices used by copy B.

As shown in the example process 100 of fig. 1, in some embodiments, two copies, copy a and copy B, may appear as follows:

1) At 102, copy A marks the key as to be deleted and sends an update message indicating that the key is to be deleted. Although the key is not effectively deleted at this point (but has become a tombstone), once it is marked as such it is no longer visible to the user.

The key cannot be completely deleted immediately because the local copy (copy a) needs to maintain information about the fact that the key exists and when it was deleted (relative to other operations). For some data types, the local copy also needs to maintain additional information about the value(s) assigned to the key at the time of deletion, in case a different copy performs another update operation on the same key at the same time.

Thus, when a key is marked to be deleted, it is replaced with a "tombstone" (e.g., a key modified with a flag set to indicate that the key is to be deleted). This tombstone may store information about the key to be deleted as a placeholder until the OVC received in the future indicates that all copies have been confirmed and handled for deletion, and the key may be permanently deleted. Only when such an acknowledgement is received, garbage collection and tombstone removal is possible.

2) At 104, copy B, which may have five different shards (for example), receives the delete message and performs a local delete operation, replacing the key with a tombstone (which will also be deleted later during garbage collection, once the copy sees all copies confirming the delete operation).

3) At 106, at some later point in time, the tiles transmit their latest OVCs in a periodic OVC update message. These messages indicate that the shard has received all updates from copy a, including an update message indicating that the key is to be deleted.

4) Copy a can perform garbage collection only after a specific copy B shard validation operation holding the deleted key. However, it cannot make assumptions about the identity of the fragment. It therefore needs to wait until it has received OVC updates from all the slices of a portion of copy B to acknowledge the delete key operation. This is possible thanks to the structure of the OVC update message, where each slice provides additional information about the number of other slices (which are part of the replica slice topology).

5) At 108, replica a receives an OVC update message from replica B shard.

6) At 110, replica a determines whether it has received OVC update messages from all replica B slices. If not, the process 100 loops back to 108.

7) Otherwise, at 112, copy A deletes the key.

According to some embodiments, causal consistency may be provided, which is an optional guarantee that defines which states of the system may or may not be observed.

According to some embodiments, consider the following example:

copy a creates a data element with a value of 'cause'. Copy B receives the update from copy a, reads the value 'cause' and writes the added value 'effect'. Replica C receives updates from replica B, but has not received the original updates from replica a. Causal consistency requires that when a data element is read from copy C, it will consist of 'cause' and 'effect', or neither include, but not just 'effect'.

In some embodiments, causal consistency may make a distributed system look like a simpler, non-distributed system by making more stringent guarantees. This is an advantage for users and application developers. On the other hand, causal consistency is costly, as discussed in the examples below.

In the above examples according to some embodiments, causal consistency may be achieved in two ways:

1) Copy B needs to send larger updates including not only 'effect' but also 'cause'. This means that replication consumes more resources.

2) Copy C retains the update from copy B until it receives the original update from copy a. This means that more memory/storage resources are required and that the update requires a longer time to propagate.

In some embodiments, causal consistency may be omitted in order to obtain performance and resource utilization advantages. To allow this, in some embodiments, CRDT data types may be implemented with additional attributes to enable strong final consistency, even in cases where updates are received in non-causal order. Non-causal sequential updates may occur in such cases: copy A adds element X to the collection; copy B receives the update from copy a and modifies element X to X'; and replica C receives updates for X' from replica B before receiving updates for X from replica a.

To support strong final consistency in a non-causal consistency system, CRDT data types are extended in a type specific way.

In some embodiments, the CRDT may be deployed as a master database or as a cache. In a cache configuration, data stored in memory may be removed and replaced with updated, fresher data in some embodiments. The process of discarding old data may be referred to as "eviction" and may be initiated in some embodiments when the system reaches a set memory usage limit.

According to some embodiments, the distributed eviction process may be managed as follows.

In some embodiments, when evicting data, a mechanism as described herein may follow an eviction policy that defines criteria for selecting which data to evict. In some embodiments, the eviction policy may select data to evict based on what data is least frequently used (least frequently used, LFU, or Least Frequently Used) or based on what data is least recently used (least recently used, LRU, or Least Recently Used). In some embodiments, data may be considered to have been used if it is written to (updated) or read from it.

In some embodiments, information about writes may be propagated between copies as part of a replication mechanism. However, in some embodiments, reads may be handled locally and never replicated, as they do not mutate the dataset. In some embodiments, non-replicating read operations may greatly contribute to the scalability of the system, as replication bandwidth and computing resources may be preserved.

In some embodiments, because the read is not replicated, the eviction process does not have all the information needed to properly select the key for eviction. For example, in some embodiments, if replica a handles many read operations to, but does not write to, key K, then replica B does not have this information and may consider key K as a infrequent or non-recently used, and thus a candidate for eviction.

In some embodiments, this problem may be solved by utilizing the locality (locality) attribute of the data. In some cases, portions of data will often be read from and written to by the same copy. Based on this, in some embodiments, a mechanism as described herein may assign eviction ownership to a key. For example, in some embodiments, there may be exactly one copy per key, which is considered its eviction owner. In some embodiments, this may be the copy that has most recently updated the key, and which copy has most recently updated the key may be derived from the underlying vector clock that provides the ordering of operations.

In some embodiments of the mechanisms described herein, the copy considers only locally owned keys for eviction. In some embodiments, an eviction may be initiated each time the memory usage of a copy exceeds a certain threshold. In some embodiments, the threshold may be lower than full memory capacity because in some cases memory can only be reclaimed after an eviction operation (key deletion) has propagated to all copies and garbage collection can be initiated.

In some embodiments, this form of eviction is inadequate and the copy may continue to consume more memory. When this occurs, in some embodiments, the replica may initiate a more aggressive memory eviction process and evict data that it does not own, except for the data that it does own.

In some embodiments, a volatile key may be used, which is a key that is set to expire and which may be automatically removed after a given time-to-live (TTL).

In some embodiments, TTL information can be maintained as additional key metadata and can be updated at any time. In some embodiments, non-volatile keys may have TTLs attached to them to become volatile, volatile keys may become non-volatile, and the TTLs of existing volatile keys may be modified.

In some embodiments, the replica can periodically perform an active expiration process in which keys with already expired TTLs are removed. In some embodiments, the mechanisms described herein may be configured to avoid multiple copies performing the process at the same time. In some embodiments, this may prevent excessive replication traffic that may result when all replicas expire and delete the same key.

In some embodiments, to avoid this, a volatile key may be associated with an expired data owner-e.g., updating the last copy of the TTL of the key. In some embodiments, owners may be determined by relying on causal ordering provided by a vector clock. In some embodiments, a tie break (tie break) mechanism may also be used where multiple copies update the TTL at the same time. In this case, in some embodiments, owners may be set based on any suitable predetermined rule that resolves to a consistent owner across all copies.

In some embodiments, the volatile key can only be actively expired by the copy that owns the TTL, and never by other copies. In some embodiments, active expiration is a process whereby a key is scanned and keys with expired TTLs are removed. In some embodiments, the process actively removes expired keys even if the user does not actively access these keys.

In some embodiments, on-demand expiration may additionally or alternatively be provided. To learn about expiration on demand, consider the following example scenario:

1. copy a creates a key with a TTL of 10 seconds.

2. The key is copied to copy B, which now also stores the key with its TTL.

3. The duplicate link is broken.

4. About 10 seconds later, copy a actively expires the key. However, the expiration message to delete the key cannot reach copy B at this time.

5. The user tries to read the key from copy B. Because copy B did not receive the expiration message, the key still exists, although it should obviously have expired.

In some embodiments, to address this, the key may expire as needed. In some embodiments, such expiration may occur regardless of whether the local copy has a TTL for the key.

In some embodiments, attempting to read a key that should have expired may result in the key appearing to have expired, although it remains in memory.

In some embodiments, attempting to write (modify) a key that should have expired may result in the actual expiration of the key. In some embodiments, in response thereto, the key may be removed and the expiration message may be copied to all copies.

In some embodiments of the mechanisms described herein:

1) The counter may be set, rather than merely incremented or decremented;

2) The counter may be deleted and recreated with a different value over time; and/or

3) The counter may use a mixture of integer and floating point values.

In some embodiments, a deletable counter key may be provided.

According to some embodiments, consider the following example:

copy a deletes the counter key holding the value of "10". Logically, this operation can be considered as two operations: reset the value to 0 and delete the key. If copy B performs concurrent incrementation on the same key, the result should be 1 due to add-ins and observed-remove semantics (e.g., behavior).

Add-ins semantics means that when an Add or update operation is performed simultaneously with a remove operation, the Add or update operation takes precedence over the same data. For example, if copy A adds element X to a collection, and copy B deletes the same element X from the same collection, the result of the operation (after propagation and merging) is a collection with element X as part thereof.

The unserved-remove semantics means that a copy can only perform remove (remove) operations on data that it can observe. For example, copy a and copy B both have a set of elements X, Y. Copy A adds element Z to the collection, and copy B deletes the entire collection at the same time. After propagation and merging, the result of the operation is a set containing element Z-because copy B cannot delete elements that it (yet) does not observe.

According to some embodiments, consider another example:

copy a increments key K by 1 and delivers the update to copy B, C. At this point, all copies hold a value of 1 for the logical clock with the operations performed by copy a. Next, copy B deletes key K and at the same time (before the deletion update message is received), copy a increments K by 1. Copy C sees two updates: delete updates from copy B and delta updates from copy a. These two operations are concurrent and thus no general order applies to them. However, copy C still must be able to determine that copy B only observed a value of "1" when deleted. For this purpose, a counter may be provided which is composed of a plurality of intervals, which contain interval values and a logic clock. The total value of the counter may be the sum of the intervals. The delete or set operations may also carry a logical clock and thus have no ambiguity as to the "portion of the value" they refer to.

Thus, for example, a counter may have the following entries in some embodiments:

[{0：0}，{1：+1}，{3：+1}，{6：-1}，{9：+1}]

wherein each interval of the counter may be defined by { t: v, where t is the start time of the interval and v is the value (if there is no + or-) or a change in value (if there is + or-).

In some embodiments, the in-memory data structure (e.g., REDIS) stream may be a monotonically increasing list of uniquely identified immutable entries.

Managing memory data structure (e.g., REDIS) compliant streams in a ultimately consistent distributed system introduces several challenges. One challenge is ordering: if the flow is monotonic (a new entry is appended only to its tail), different copies may eventually have inconsistent ordering of entries, depending on the order of arrival of the updates. On the other hand, maintaining a consistent order across all copies means that the flow is no longer monotonic, as entries may be added at different locations, not just to its tail.

According to some embodiments, the mechanisms described herein may solve this problem by maintaining a sub-stream for each copy for changes made to that copy and copying that sub-stream to each other copy. In some embodiments, each copy appends an entry to its substream based only on changes made at the copy. In some embodiments, each entry of a substream may have an ID that identifies the copy to which it applies. For example, in some embodiments, the ID assigned to an entry for a copy may be assigned a value of X such that X modulo Y is equal to Z, where X is the assigned value of the ID, Z is the maximum number of copies, and Z is the local copy ID. In some embodiments, the value X assigned to each entry in the substream may be unique.

In some embodiments, a distinction may be made between different stream read modes that exhibit different semantics (e.g., behavior). By so doing, the mechanisms described herein are able to overcome the inherent limitations described above that involve the assurance that monotonicity and overall order are maintained in a distributed, ultimately consistent stream.

According to some embodiments, one read mode that may be provided is cursor-like XREAD from the tail of the stream. In this mode, the reader requests that the most recent message be read from the tail of the stream. The reader receives the ID of the message and the stream and then uses this ID for subsequent reads in order to read the next message in the stream. To service this read mode, the streams may be merged-read from the already ordered substreams in some embodiments. In such a merged read, the reader is provided with entries from the stream based on the order in which the entries are appended to the stream (e.g., oldest to newest), and thus duplicate entries across sub-streams are ignored. Thus, in some embodiments, the resulting entries may be guaranteed to be in a single monotonic overall order.

In some embodiments, one or more updates received from the replica may update the substreams after the later/newer entries have been read from the different substreams, in which case those received updates may not be visible to the reader.

In some embodiments, a reader attempting to read a particular entry by issuing a more specific read request is able to observe and read all corresponding entries regardless of their order of arrival. This is the case because each entry in each substream is identified by a corresponding unique ID and the stream is allowed to randomly access the entry based on that ID. For example, a reader that maintains a reference to a particular entry in a stream using its unique entry ID will be able to read that entry as long as it has been copied to a local copy, regardless of the order of other reads.

In some embodiments, another read mode is consumer reading. In some embodiments, the consumer may be managed in a consumer group that maintains information about the delivery of the item to the consumer and its acknowledgement of receipt.

In some embodiments, the flow of mechanisms described herein may make more relaxed subscription guarantees to consumers. For example, in some embodiments, a consumer may receive and process a newer item and later receive an older item that is not acknowledged and is therefore reclaimed for redistribution to the consumer group.

In some embodiments, consumer reads in the streams of the mechanisms described herein follow the same semantics and may therefore return entries in any order, rather than following any monotonic or total order, delivering older entries after newer entries have been delivered.

In some embodiments, the entire stream itself may also be a memory data structure (e.g., REDIS) key. In some embodiments, this key may be identified by name and may be deleted or created by the copy any number of times, effectively creating different versions of the same stream over time.

In settings that are not causal, this may result in ambiguity regarding the association of elements to a particular version of the stream key.

For example, assume that copy a appends entry X with ID 100 to the flow. The changes are copied to copy B and copy C. Copy B deletes the stream key and propagates the deletion to copy a. Copy a recreates the stream key with the same name, appends entry Y with ID 50 to the stream, and copies the change to copy C. From the perspective of copy C, this may be considered an invalid change because it violates monotonicity (append ID 50 after ID 100).

To address this, and many other similar issues, a substream may be associated with a local logic clock when created. The update for the sub-stream, where the logic clock of the sub-stream is received less than the updated logic clock, indicates that the keys of the sub-stream are outdated and can be replaced.

In some embodiments, valid acknowledgement propagation may be provided using aggregated sequential acknowledgements.

In some embodiments, replication between replicas may be managed by a peer-to-peer replication protocol, as described below.

In some embodiments, efficiency is an important aspect of replication between copies. In some embodiments, efficiency may have many aspects, such as network traffic required to deliver replication messages, memory resources on the replication for storing additional replication related state, CPU resources for processing replication messages, and so on.

In some embodiments, optimizing one aspect of the replication mechanism may trade off efficiency of another aspect. For example, in some embodiments, a compression algorithm is applied to the replicated stream to trade off CPU cycles for network bandwidth.

In some embodiments, the peer-to-peer replication protocol may use a combination of three different methods to maintain replication efficiency:

1) Partial backlog replication. The source replica may continually maintain backlog of the set of replication messages that hold the latest replication message while the replication link is active. Backlog may be allocated a limited amount of memory and old data may be discarded as new data is appended.

In some embodiments, if the duplicate link is relinquished and reestablished, the target copy may request the source copy to deliver the duplicate message starting from the last message (byte offset) it has received. In some embodiments, if the requested data is still present in the source replication backlog, it may be delivered and the replication process resumed. In some embodiments, this method of reestablishing the duplicate link may be very efficient because it requires very little resources on the source replica side.

In some embodiments, if the source copy cannot satisfy the partial backlog copy request (because the backlog data has been replaced with newer data), then both copies may rollback to negotiate a partial state copy.

2) Partial state replication. The target copy may determine the last updated logical vector clock time it has received and may request the source copy to send only information about keys that have been modified after that logical time. Because the vector clock tracks the update order of each copy, this request can be further fine-tuned to only require keys that have been updated by one or more particular copies at a particular time (expressed by the vector clocks of those copies). For example, the target copy may request from the source copy that at time t _X Thereafter updated by copy X and at time t _y And then updated by copy Y. Further, the target copies may request to skip entirely the keys updated by the particular copy (e.g., if they have been copied directly from the copies).

In some embodiments, the source replica may then iterate through the entire dataset looking for a key with a vector clock indicating that the key has been modified after the requested logical time. In some embodiments, those keys may be serialized and delivered to the target copy.

In some embodiments, the target copy receives the serialized keys, deserializes them and loads them into its dataset. If a key already exists, the target copy may use the CRDT metadata to identify new changes and incorporate those changes into its existing key.

In some embodiments, a locally non-existent key may be simply created based on the received state information. 3) Full state replication. This is similar to (2) above, but involves delivering all keys from the source copy to the target copy. It is essentially a subset and special case of partial state replication.

In some embodiments, when a key is updated by a copy, there are two ways in which information about the update can be propagated to other copies: the CRDT effect message and the CRDT status merge message.

In some embodiments, the CRDT effect message describes the operations applied to the key. In order to be properly handled, in some embodiments, the recipient of the CRDT effect message needs to have a priori information about the updated key. For example, an effect may describe an operation such as "add element X to a collection". As such, CRDT effect messages may be used between replicas after initial state synchronization has been established in some embodiments.

In some embodiments, on the other hand, the CRDT status merge message carries all the information that the copy holds for a given key. This includes both the actual data, as provided by the user, and CRDT metadata, such as vector clocks associated with different operations performed on the keys in some embodiments. In some embodiments, there are some attributes to the CRDT status merge message:

1) In some embodiments, receiving a copy of the CRDT state combined message for a particular key can reconstruct the full state of the key, which will be the same as the key in the source copy.

2) The CRDT status merge message can always be applied on existing keys and in some embodiments yields consistent results. For example, copy A holds a key K with some changes (Ka) applied to it, and copy B holds a key K with other changes (Kb) applied to it. Replica a sends a CRDT state combined message for Ka to replica B and replica B sends a CRDT state combined message for Kb to replica a. After the two copies perform the merge, the key K is guaranteed to transition to a common state Kcommon that follows the specific conflict resolution semantics applicable to the data type and the type of modification applied to the key.

In some embodiments, the replication link may remain on once the target replica has been synchronized with the source replica in one of the methods described above. In some embodiments, the source replica may use the replication link to deliver the following messages consecutively:

1) CRDT effect messages for data set changes are described.

2) OVC update messages that the source copy uses to announce periodically the logical time it has observed for the latest operation copied from the other copies.

In some embodiments, the changes to the data set of the replica may be assumed to be persistent and monotonic.

According to some embodiments, consider the following example:

copy a updates the counter key and increments its value from X to Y (with the operation of being assigned the local vector clock time T). After copying this update to all other copies, copy a must not lose its state and recover to the value of X. Furthermore, it must not lose its current vector clock time T; any future operation should have a clock time greater than T.

In some embodiments, the data set may be stored in a process memory, so partial loss of data is not possible (as this may result in a process crash or a full system failure). Furthermore, in some embodiments, all updates may optionally be written to an Append Only File (AOF) that the copy may read in the event of a process or a full system restart.

In some embodiments, a copy according to the mechanisms described herein may begin in one of three states:

1) There is no data set loaded from disk (either because no persistence is used, or because of a failure).

2) A portion of the data set loaded from disk. For example, if an AOF file is truncated or not fully synchronized to disk.

3) The entire data set loaded from disk, including the most current local write.

In some embodiments, before becoming active, a loaded copy may need to verify that it has managed to recover the data set and that there are no local writes that have been copied to other copies but not loaded locally. In some embodiments, this state may be referred to as a stale state. In some embodiments, a replica may remain in a stale state until it has negotiated a duplicate link with other replicas and may confirm that no lost updates exist.

In some embodiments, if lost updates do exist, they may be identified using a vector clock time carried by the updates that is related to a copy operation that is more advanced than the current vector clock operation time of the copy.

In some embodiments, such a hybrid restoration method may provide a method for replication according to the mechanisms described herein to restore an entire data set faster than relying solely on re-replication of the entire data set from a remote replica over a network. This may be the case, for example, because accessing local storage is faster than geographically distributed data center links.

In some embodiments, another concern for final consistency and recovery relates to lost garbage collection information.

According to some embodiments, consider the following example:

copy a deletes a key. The delete operation is copied to copy B and the key will remain as a tombstone until OVC update messages are received from all participating copies confirming that the delete operation has been completed. Only then can the key be garbage collected and the tombstone removed. Copy B receives the delete operation, arranges to write it to the AOF file and performs a local delete (leaving a tombstone). Later, copy B issues an OVC update message confirming that it has seen the delete operation and signals a secure garbage collection tombstone. Immediately after that, copy B fails without completing writing to the AOF file. At the same time, replica a receives the OVC message and performs garbage collection, removing the tombstone. Copy B then restarts and reloads the data from its AOF file. The result is a significant inconsistency-copy B holds the key that copy a has deleted in the past and garbage collected, and this state is no longer reconciled.

In some embodiments, to address this situation, the mechanisms described herein may include a delayed OVC mechanism.

The purpose of this mechanism is to ensure that OVCs that a copy declares to other copies are consistent with information that has been written to and committed to an AOF file. To do so, the copy first writes the data to the AOF file and requests the operating system to commit it to disk. Only after this operation is completed successfully will the copy announce the updated OVC. Before this it will continue to announce the previous OVC information (which has been previously committed to disk).

In this way, even if a copy crashes and needs to be restarted from disk, it can be guaranteed that the OVC information it holds is never more up-to-date than the OVC announcements it had made in the past.

In some embodiments, the mechanisms described herein may use a single key space to store all keys, which may be of any type. In some embodiments, the key may be implicitly created with a type inferred from the write operation that created the key. In some embodiments, key access may be type checked so, for example, attempting to add a collection member to a key previously created as a list will fail.

In some embodiments, such checks cannot be relied upon in a ultimately consistent distributed system, as keys may be created or manipulated simultaneously.

In some embodiments, the mechanisms described herein may employ mechanisms that define strict type priorities when handling key type conflict resolution. For example, in some embodiments, copy A may create a list key K and append an element thereto. Copy B may create a string key K having a particular value. These two operations may be concurrent (e.g., copies a and B are disconnected from each other). When the copies are reconnected and the updates converge, the key K may be of the string type and the updates performed by copy A may be discarded. In some embodiments, this may occur because the string type is prioritized over other types, which may be derived from non-distributed memory data structure (e.g., REDIS) behavior. Any suitable priority rule may be used in some embodiments.

In some embodiments, the in-memory data structure (e.g., REDIS) may also use an implicit key deletion mechanism. In some embodiments, a container key (such as a hash, list, ordered set, or collection) may be implicitly deleted when the last element in the container key is removed.

In some embodiments, the mechanisms described herein can distinguish between the two operations and track them individually with different vector clocks. According to some embodiments, consider, for example, that both copy A and copy B maintain a set key with three elements [1,2,3 ]. Copy A performs a write operation that removes element [1,2,3] leaving the set empty. It therefore also registers the implicit operation of deleting the key. At the same time, copy B performs a write operation to key append [4 ]. After all updates have propagated and converged, both copies A and B will use [4] to set the keys, due to add-ins and observed-remove semantics.

In some embodiments, any suitable hardware may be used to implement the mechanisms described herein. For example, in some embodiments, copies in a memory data structure (e.g., REDIS) system may each reside in one or more hardware servers coupled to each other via one or more local or wide area networks (including the Internet).

In some embodiments, any of these servers may be implemented using any suitable general purpose or special purpose computer.

Turning to fig. 2, an example 200 of hardware that may be used in accordance with some embodiments of the disclosed subject matter is shown. As shown, in some embodiments, hardware 200 may include multiple data centers, each in a different area. Although three data centers are shown and each data center is in its own area, any suitable number of data centers and any suitable number of areas may be used in some embodiments.

As also illustrated, in some embodiments, each data center may have five (or any other suitable number of nodes). A node may be implemented on one or more physical or virtual servers. In some embodiments, a physical or virtual server may hold any suitable number of nodes.

In some embodiments, the physical or virtual server may be implemented on any suitable general purpose or special purpose computer.

Any such general purpose or special purpose computer may include any suitable hardware. For example, as shown in the example hardware 300 of fig. 3, such hardware may include a hardware processor 302, memory and/or storage 304, an input device controller 306, an input device 308, a display/audio driver 310, a display and audio output circuit 312, a communication interface 314, an antenna 316, and a bus 318.

In some embodiments, hardware processor 302 may include any suitable hardware processor, such as a microprocessor, a microcontroller, a digital signal processor, dedicated logic, and/or any other suitable circuitry for controlling the functionality of a general purpose or special purpose computer.

In some embodiments, memory and/or storage 304 may be any suitable memory and/or storage for storing programs, data, and/or any other suitable information. For example, memory and/or memory 304 may include random access memory, read only memory, flash memory, hard disk memory, optical media, and/or any other suitable memory.

In some embodiments, the input device controller 306 may be any suitable circuitry for controlling and receiving input from the input device 308. For example, the input device controller 306 may be circuitry for receiving input from an input device 308, such as a touch screen, from one or more buttons, from voice recognition circuitry, from a microphone, from a camera, from an optical sensor, from a temperature sensor, from a near field sensor, and/or any other type of input device.

In some embodiments, the display/audio driver 310 may be any suitable circuitry for controlling and driving output to one or more display/audio output circuits 312. For example, the display/audio driver 310 may be circuitry for driving one or more display/audio output circuits 312, such as an LCD display, speakers, LEDs, or any other type of output device.

The communication interface 314 may be any suitable circuitry for interfacing with one or more communication networks. For example, interface 314 may include network interface card circuitry, wireless communication circuitry, and/or any other suitable type of communication network circuitry.

In some embodiments, antenna 316 may be any suitable antenna or antennas for wireless communication with a communication network. In some embodiments, antenna 316 may be omitted when not needed.

Bus 318 may be any suitable mechanism for communicating between two or more components 302, 304, 306, 310, and 314 in some embodiments.

Any other suitable component may additionally or alternatively be included in hardware 300 according to some embodiments.

In some embodiments, any suitable computer readable medium may be utilized to store instructions for performing the functions and/or processes described herein. For example, in some embodiments, the computer readable medium may be transitory or non-transitory. For example, non-transitory computer-readable media may include media such as non-transitory magnetic media (such as hard disk, floppy disk, and/or any other suitable magnetic media), non-transitory optical media (such as optical disk, digital video disk, blu-ray disk, and/or any other suitable optical media), non-transitory semiconductor media (such as flash memory, electrically programmable read-only memory (electrically programmable read-only memory), electrically erasable programmable read-only memory (EEPROM), and/or any other suitable semiconductor media), any suitable media that does not immediately lapse or lack any persistence during a transmission process, and/or any suitable tangible media. As another example, a transitory computer-readable medium may include signals on a network, signals in wires, conductors, optical fibers, circuits, any suitable medium that has elapsed during transmission and lacks any persistence, and/or any suitable intangible medium.

While the invention has been described and illustrated in the foregoing illustrative embodiments, it is to be understood that this disclosure is made only by way of example and that numerous changes in the details of the embodiments of the invention may be resorted to without departing from the spirit and scope of this invention, as the invention is limited only by the claims that follow. The features of the disclosed embodiments can be combined and rearranged in various ways.

It should be particularly noted that features described with reference to one or more embodiments are described by way of example and not by way of limitation to those embodiments. Thus, unless otherwise indicated or unless a particular combination is clearly unacceptable, optional features described with reference to only some embodiments are assumed to be equally applicable to all other embodiments.

Claims

1. A system for implementing a conflict-free replication data type in a memory data structure, comprising:

a memory; and

at least one hardware processor coupled to the memory and collectively configured to:

marking a first key of a collision-free duplicate data type to be deleted;

sending an update message to a first copy of a memory data structure reflecting that the first key is to be deleted;

Receiving a plurality of messages, each message acknowledging that the first key is to be deleted;

determining that the plurality of messages includes a message for each of a plurality of tiles of the first copy; and

the first key is deleted in response to determining that the plurality of messages includes a message for each of a plurality of tiles of the first copy.

2. The system of claim 1, wherein the at least one hardware processor is further configured to:

a counter is maintained, wherein the counter tracks an interval value and a logic clock for a plurality of intervals.

3. The system of claim 1 or 2, wherein the at least one hardware processor is further configured to:

determining that a second copy of a second key has been recently updated and setting the second copy as an eviction owner of the second key;

determining that the memory usage of the second copy exceeds a threshold;

determining that the second key is at least one of a least frequently used key and a least recently used key of a plurality of keys stored by the second copy; and

in response to determining that the memory usage of the second copy exceeds the threshold and that the second key is at least one of a least frequently used key and a least recently used key of a plurality of keys stored by the second copy, the second key is deleted.

4. The system of any of the preceding claims, wherein the at least one hardware processor is further configured to:

associating a third key with the expired data owner;

determining that the third key has expired; and

determining that the third key is associated with the expired data owner; and

in response to determining that the third key has expired and determining that the third key is associated with the expired data owner, the third key is deleted by the expired data owner.

5. The system of any of the preceding claims, wherein the at least one hardware processor is further configured to:

by each of the multiple copies:

creating a stream of append-only updates for the updates of each of the plurality of copies; and

the stream is replicated to each of the plurality of replicas.

6. The system of any of the preceding claims, wherein the at least one hardware processor is further configured to:

identifying a fourth key created by a fourth copy as having a first value and a first type;

identifying a fifth key created by a fifth copy as having a second value different from the first value and a second type different from the first type; and

Priority is applied to the fourth key and the fifth key based on the first type and the second type such that the fourth key is assigned the second value.

7. A method for implementing a conflict-free replication data type in a memory data structure, comprising:

marking a first key of a collision-free duplicate data type to be deleted;

deleting the first key in response to determining that the plurality of messages includes a message for each of a plurality of shards of the first copy.

8. The method of claim 7, further comprising:

9. The method of claim 7 or 8, further comprising:

Determining that the memory usage of the second copy exceeds a threshold;

10. The method of any of claims 7 to 9, further comprising:

associating a third key with the expired data owner;

determining that the third key has expired; and

determining that the third key is associated with the expired data owner; and

11. The method of any of claims 7 to 10, further comprising:

by each of the multiple copies:

The stream is replicated to each of the plurality of replicas.

12. The method of any of claims 7 to 11, further comprising:

13. A non-transitory computer-readable medium containing computer-executable instructions that, when executed by a processor, cause the processor to perform a method for implementing a collision-free copy data type in a content data structure, the method comprising:

marking a first key of a collision-free duplicate data type to be deleted;

14. The non-transitory computer-readable medium of claim 13, wherein the method further comprises:

15. The non-transitory computer readable medium of claim 13 or 14, wherein the method further comprises:

determining that the memory usage of the second copy exceeds a threshold;

16. The non-transitory computer readable medium of any one of claims 13 to 15, wherein the method further comprises:

associating a third key with the expired data owner;

determining that the third key has expired; and

determining that the third key is associated with the expired data owner; and

17. The non-transitory computer readable medium of any one of claims 13 to 16, wherein the method further comprises:

by each of the multiple copies:

the stream is replicated to each of the plurality of replicas.

18. The non-transitory computer readable medium of any one of claims 13 to 17, wherein the method further comprises: