WO2023125412A1

WO2023125412A1 - Method and system for synchronous data replication

Info

Publication number: WO2023125412A1
Application number: PCT/CN2022/141930
Authority: WO
Inventors: Kanaka Kumar AVVARU; Chetan Jagatkishore Kothari; Pankaj Kumar; Wei ZHI
Original assignee: Huawei Cloud Computing Technologies Co., Ltd.
Priority date: 2021-12-27
Filing date: 2022-12-26
Publication date: 2023-07-06
Also published as: WO2023125412A9; CN118355374A

Abstract

The present disclosure provides a synchronous data replication method from a first cluster to a second cluster. The first cluster and the second cluster work in active-active cooperation with each other. The method comprises of receiving a write command from a client server in a first region server of the first cluster, replicating, by the first region server, the received write command in a second region server of the second cluster; wherein the step of replicating including committing, by the second region server, a record associated with the write command in a second write ahead log (WAL) of the second region server and subsequent to the committing the record to the second WAL, committing, by the second region server, the record to a second memory of the second region server. Further, the method comprises of committing, by the first region server, the record associated with the write command, to a first write ahead log (WAL) of the first region server and subsequent to the committing the record to the first WAL, committing, by the first region server, the record to a first memory of the first region server.

Description

METHOD AND SYSTEM FOR SYNCHRONOUS DATA REPLICATION

The present application claims priorities to India Patent Application No. 202131061002, titled "METHOD AND SYSTEM FOR SYNCHRONOUS DATA REPLICATION" , filed on December 27, 2021 with the India Patent Office, which is incorporated herein by reference in its entirety.

FIELD

The present disclosure described herein, in general, relates to mechanism of managing data in distributed database management systems, and in particular, to data recovery and synchronization methods employed in distributed databases like Apache HBase ^TM.

BACKGROUND

A Hadoop database (HBase) is a distributed database and features high reliability, high performance, Key-Value based storage, and the like. Therefore, the HBase is used by a growing quantity of enterprises and users to create a data table. HBase data model stores semi-structured data having different data types, varying column size and field size. Data files are stored in the backend using a Distributed File Systems like HDFS. HBase data model consists of several logical components- row key, column family, table name, timestamp, etc. Row Key is used to uniquely identify the rows in HBase tables. Column families in HBase are static whereas the columns, by themselves, are dynamic.

● HBase Tables – Logical collection of rows stored in individual partitions known as Regions.

● HBase Row – Instance of data in a table.

● RowKey -Every entry in an HBase table is identified and indexed by a RowKey.

● Columns - For every RowKey an unlimited number of attributes can be stored.

● Column Family – Data in rows is grouped together as column families and all columns are stored together in a low-level storage file known as HFile.

Fig. 1 illustrates an overview of Hbase architecture and its components. The simplest and foundational unit of horizontal scalability in HBase is a Region. A continuous, sorted set of rows that are stored together is referred to as a region (subset of table data) . HBase architecture may have many HBase masters and one may be active at any point in time. For an HBase master node/Master Server (HMaster) there will be several slaves i.e. region servers. Each region server (slave) serves a set of regions, and a region can be served only by a single region server. Whenever a client sends a write request, HMaster receives the request and forwards it to the corresponding region server. High Availability of Master service is managed through Apache Zookeeper ^TM ⁽not shown in Fig. 1) coordinator framework.

HMaster/Master Server:

- Manages Hbase Tables creation, deletion;

- Assign table regions (partitions) to region servers in a cluster ;

- Balances the cluster by distributing regions equally to region servers; and - Detect failures of region server and performs recovery of regions in the crashed region servers through a process called Server Crash Procedure (SCP) (explained further) .

Region Servers:

- Hosts the regions (partitions) of tables and handles read, write requests incoming from the client;

- Write records to a region server level Write Ahead Log (WAL) files for durability and in-memory for quick availability;

- Manage region store files for operations like compaction and deletion of old records; and

- Upon task assignment by HMaster, recovers table partition (region) data from the WAL files.

Region server process, runs on every node/data node in the cluster managed by the HMaster, see for example, DataNode 1, DataNode N in Fig. 1. Region server broadly comprises of the following components:

In Memory Store: This is the write cache and stores new data that is not yet written to the disk. Every column family in a region has an in memory store/MemStore.

Write Ahead Log (WAL) : WAL is a file that stores new data that is not persisted to permanent storage. Databases like HBase ensures strong consistency using WAL which records all changes to data in HBase, to file-based storage. Under normal operations, the WAL is not needed because data changes move from the MemStore to StoreFiles. However, if a region server crashes or becomes unavailable before the MemStore is flushed, the WAL ensures that the changes to the data can be replayed. If writing to the WAL fails, the entire operation to modify the data fails.

Hfile: HFile is the actual storage file that stores the rows as sorted key values on a disk.

Apart from the master server and the region servers, the Hbase uses ZooKeeper as a distributed coordination service for region assignments and to recover any region server crashes by loading them onto other region servers that are functioning. ZooKeeper is a centralized monitoring server that maintains configuration information and provides distributed synchronization. HMaster and Region servers are registered with ZooKeeper service, and a client needs to access ZooKeeper quorum in order to connect with region servers and HMaster.

Fig. 1 also depicts a data write process in a region server:

1. Client sends insert data request to region server;

2. Region server first commits record to write-a-head log (WAL) file;

3. Insert data record to a region in memory store;

4. Till the records reached to store file, the data records are in temporary memory & WAL files only. So, these data bases are also called as LSM Tree (Log Structured Merge Tree) based data bases. Once many records get accumulated in the memory, these records are flushed to a file system in a structured &compacted store files (HFile) ; and

5. Add the flushed file to Region Store Files. HFiles are merged to big HFiles using a compaction process.

Fig 2 illustrates an overview of the region server fault recovery mechanism and the components of the master server (HMaster) and the region server involved in handling the region server fault. As depicted in Fig 2a, when a fault in any of the region servers of a cluster is detected by the Master (Hmaster) , a Server Crash Procedure (SCP) is triggered in the Master which involves the following major steps:

○ Identify all the write-a-head log (WAL) files which belong to the failed region server of the cluster. It is understood that the region server may be hosting several WAL files, each of the WAL files may include records for several regions.

○ Assign recovery of records from WAL file task (Split Wal) to other region servers. Fig 2b depicts the task of WALSplitter implemented by the assigned region server. The WALSplitter retrieves the records from the recovered WAL file and splits the records per region. See, for example in Fig 2b, records segregated for Region 1, Region 2 and Region N. In addition, for each split record per region, a corresponding edit file is also created.

○ Assigned region server collects records of each Region from the WAL file and filters the records that are already flushed, or belonging to deleted region, edit detail not related to put data.

○ Write the recovered region records (edits) to edit file which is termed as “recovered edits” file further.

While the assigned region server employs the WALSplitter as described above, as depicted in Fig 2a, the Master (HMaster) re-assigns each region from the crashed region server, each region being assigned to one region server of the cluster. Fig 2c further depicts the implementation of the Open Region Handler procedure in the assigned region server which takes up the task from the Master upon being assigned a region.

○ Upon assign request, the respective assigned region server reads the recovered edits files (see Fig. 2b) for the region.

○ Applies to In-Memory and flushes to the store file of the assigned region to ensure the records are durable.

○ Initialize all the store files to be ready for queries.

○ Confirm the region open status to Master.

○ Make the region available for user operations.

Data replication:

For the purpose of data recovery of disaster recovery or improved availability multi cluster deployment is used where the data records from one cluster is transferred to another cluster using a mechanism called replication which works as follows:

○ Records from the WAL file is replicated to peer cluster using background replication service in Asynchronous mode [Eventual Consistency]

○ at-least-once delivery of client edits to peer cluster

○ Every new WAL is added to a replication queue and status is synced in ZK (ZooKeeper)

○ Source region server replicates batch edits to a sink region server.

Multi clusters works in different modes of operations:

- Active-Active mode:

- Both cluster independently accessible to different client application for the purpose of load balancing.

- Applications can use other cluster as a fall back when first cluster is not accessible which makes Cluster highly available.

- Active-Standby mode:

- Only Active cluster is accessible to client application and standby synchronizes data from Active. Standby is made accessible to client incase Active cluster is crashed.

One of the technical problems associated with the data transfer from Active to Standby cluster is asynchronous mode which makes client applications connecting to Standby cluster not able to see latest data so the clients can see inconsistent results. Further, when active cluster is down in case of a disaster or a power failure scenario, the data that is not transferred to Standby cluster is lost (High RPO-Recovery Point Object) . Moreover, to recover all the data & achieve data consistency, it might take a long time to recover the cluster and replay pending data or compare original data vs. available records in a standby cluster [High RTO-Recovery Time Object] . However, with the compromised data consistency, cluster can be made available to client applications immediately. The drawbacks of the Asynchronous replication methods employed in Active-Standby mode can be summarized as:

- Eventual consistency when both clusters healthy.

-Read from standby gives Stale data

-Un replicated Data is lost when Active cluster Crashes.

- RPO = ～ Seconds to Minute;

- RTO = ～ Minutes to Hours

In the currently known Apache based HBased technology, synchronous replication methods are disclosed. A document titled, ‘Synchronous Replication Solution using Remote WAL’ , referred to as prior art 1, which is accessible from the link: https: //issues. apache. org/jira/browse/HBASE-19064, describes a setup of two clusters, active (A) and standby (S) , and connecting them with asynchronous replication. All read/write are performed at A, S only receives replication data. Besides the normal WAL logging, A will also write a copy of the WAL (remote WAL) to the HDFS cluster of S. When the asynchronous replication goes on, S also deletes remote WALs on S which have already been replicated. If A crashes and we want S to be the next active cluster, the remote WALs are replayed on S before offering service. The disadvantages associated with the methods disclosed in the prior art 1 are as follows:

○ Slave, i.e., S, doesn’ t accept read or write request, and only receives replication data.

○ Using remote log, only RPO is reduced by this design. But actual data records transfer is still using asynchronous replication.

○ Recover pending data from the remote log files during master failure takes still long time (RTO = ～Minutes) as the data records are having for all the tables. Filtration of required records for each region/table partition takes time at minute level (RTO = ～Minutes)

○ Cluster switch takes few minutes till Regions recovered/refresh stores.

○ Read from Slave not supported. Even if enabled, data is Stale (As data copy is still through async replication)

US20170161350A1, “Synchronous replication in a distributed storage environment” , referred to as prior art 2 discloses: To achieve synchronous replication both an eventual consistency approach and a strong consistency approach are contemplated. Received data may be written to a log of a primary data store for eventual committal. The data may then be annotated with a record, such as a unique identifier, which facilitates the replay of the data at a secondary data store. Upon receiving an acknowledgment that the secondary data store has written the data to a log, the primary data store may commit the data and communicate an acknowledgment of success back to the client. In a strong consistency approach, the primary data store may wait to send an acknowledgement of success to the client until it receives an acknowledgment that the secondary has not only written, but also committed, the data. The disadvantages associated with the methods disclosed in the prior art 2 are as follows:

○ Solution is based on primary data store to secondary data store. Write is not supported in Secondary data store.

○ There is no solution for Recovery of the cluster in case of Crash;

○ If there is a communication failure between Primary & Secondary cluster, after the commit in secondary cluster, the record is visible to Secondary cluster but not available for read in primary cluster.

US8301593. “Mixed mode synchronous and asynchronous replication system” , referred to as prior art 3 describes: A replication system that includes an asynchronous replication mode and a synchronous replication mode replicates data associated with a plurality of transactions. Each transaction has one or more transaction steps or operations. A first set of transaction steps or operations are allocated to the plurality of appliers on an object-by-object basis when the replication system operates in asynchronous replication mode. A second set of transaction steps or operations are allocated to the plurality of appliers on a transaction-by-transaction basis when the replication system operates in synchronous replication mode. The replication system further includes one or more originating nodes, and the requests for the first and second sets of transaction steps or operations to execute on an originating node can be initiated during the same time period. The disadvantages associated with the methods disclosed in the prior art 3 are as follows:

○ Transaction is originated and committed by the Client, not automatically managed by distributed database. For the failure scenarios when once cluster is down, client need to take care of waiting till synchronization to secondary/Standby database is completed.

○ Transaction is global and is not at partition level which needs separate transaction partitioning design.

○ Solution for recovery in case of crash and Out of Order records during communication failures are not covered in this design.

WO2017124938A1, “ A method, device, and system for data synchronization. ” , referred to as prior art 3 describes: The method for data synchronization comprises: when a data modification of a source database is determined by a data synchronization source end, the data synchronization source end generates a real-time notification with respect to the present instance of data modification and transmits the real-time notification to a data synchronization target end; when the real-time notification is received by the data synchronization target end, the data synchronization target end parses information related to the data modification from the real-time notification and updates a target end database buffer on the basis of the parsing result. When the real-time notification is received by the target end, the target end can update directly the local database buffer on the basis of the information carried in the real-time notification or start monitoring of the local database so as to update buffered data as soon as synchronization of local data is completed, thus achieving the effect of reducing buffer synchronization and update delay. The disadvantages associated with the methods disclosed in the prior art 4 are as follows:

○ There is no row lock coordination between both the clusters. If same row of a table is being modified in both clusters, atomicity is not guaranteed

In view of the above, there is a scope of improvement in the existing replication mechanism for distributed database storage systems, especially in the context of transaction management during the replication process at the partition level of distributed databases like HBase. Existing solutions in the industry focuses on asynchronous data transfer between the clusters which can support only eventual consistent that can’ t meet customer requirements that need strong consistency. Traditional system that uses active-standby based architecture takes a long time to make available to user due to the long recovery process of in-flight records that are yet to be synchronized.

SUMMARY

This summary is provided to introduce concepts related to synchronous replication methods employed in active-active cluster mode deployment where both clusters can support read and write at the same time. In addition, recovery mechanism and automatic recovery after failure and setting back to sync replication for such active-active cluster mode deployment, are also discussed.

A main objective of the present disclosure is to provide a method of synchronizing data from one cluster to another in an active-active mode deployment of a distributed database system, such as HBase, and also recovery of data from any of the failed cluster during disaster. The main objectives of the present disclosure may be summarized as:

- Support multi active cluster yet consistent data results.

- Both clusters are made accessible to users to achieve load balance.

- Optimizes the time take for recovery of a crashed cluster using a bulk loading of prepared data files rather than record by record.

- Provide instant fail back to healthy cluster in case one cluster is unhealthy or crashed.

- Track the communication failure records and ensure the transactions is with same state in both the clusters.

- Achieve RPO = 0.

- Achieve RTO ～ 0 (If one cluster is down, client can switch to other cluster) .

In a first implementation, the present disclosure provides a synchronous data replication method from a first cluster to a second cluster. The first cluster and the second cluster work in active-active cooperation with each other. The method comprises of receiving a write command from a client server in a first region server of the first cluster, replicating, by the first region server, the received write command, in a second region server of the second cluster; wherein the step of replicating including committing, by the second region server, a record associated with the write command in a second write ahead log (WAL) of the second region server and subsequent to the committing the record to the second WAL, committing, by the second region server, the record to a second memory of the second region server. Further, the method comprises of committing, by the first region server, the record associated with the write command, to a first write ahead log (WAL) of the first region server and subsequent to the committing the record to the first WAL, committing, by the first region server, the record to a first memory of the first region server.

In one implementation, the first region server commits in the second phase, i.e., first committing is done in the remote cluster and then only in the local cluster.

In another implementation, the record corresponding to the write command received at the first region server is concurrently committed to the first memory of the first region server and the second memory of the second region server; and if the record corresponding to the write command fails to be committed in any of the first region server and the second region server, the record is not considered for a subsequent read operation or a search query received in either of the first cluster and the second cluster.

In yet another implementation, a system for synchronous data replication is disclosed. The system comprises of a first cluster and a second cluster. The first cluster comprising a first region server, the first region server comprises of a first memory, a first write ahead log (WAL) and a first file store, and the second cluster comprising a second region server, the second region server comprises of a second memory, a second WAL and a second file store. The first cluster and the second cluster work in active-active cooperation with each other. Further, the first region server is configured to receive a write command from a client server, replicate the received write command in the second region server, wherein the step of replicating including committing, by the second region server, a record associated with the write command in a second WAL and subsequent to the committing the record to the second WAL, committing, by the second region server, the record to the second memory. Further, the first region server is configured to commit the record associated with the write command to the first WAL and subsequent to the committing the record to the first WAL, commit the record to a first memory of the first region server. And, the second region server is configured to perform replication of the received write command in the second region server.

BRIEF DESCRIPTION OF THE DRAWINGS

The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit (s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the drawings to refer like features and components.

Fig. 1 illustrates an overview of Hbase architecture and its components in the prior art solutions.

Fig. 2 illustrates a server crash procedure employed in HBase in the prior art solutions.

Fig. 3illustrates an overview of an active-active cluster architecture and its components in accordance with the teachings of the present disclosure.

Fig. 4 illustrates a synchronous data replication method from a first cluster to a second cluster, the first cluster and the second cluster working in active-active cooperation with each other, in accordance with the embodiments of the present disclosure.

Fig. 5 illustrates a method of transaction logging and managing out of records in accordance with an embodiment of the present disclosure.

Fig. 6 illustrates a method of committing the record in the first region server in accordance with an embodiment of the present disclosure.

Fig. 7 is a schematic illustration of a transaction management mechanism using coordinated lock between clusters, in accordance with an embodiment of the present disclosure.

Fig. 8 a schematic illustration of a transaction management mechanism using coordinated lock between clusters, in accordance with an embodiment of the present disclosure.

Fig. 9 is a schematic illustration of a region recovery mechanism in accordance with an embodiment of the present disclosure.

Fig. 10 is a schematic illustration of a health monitoring mechanism and a single cluster write mode, in accordance with an embodiment of the present disclosure.

Fig. 11 is a schematic illustration of an automatic recovery and set back to sync replication mechanism, in accordance with an embodiment of the present disclosure.

Fig. 12 is a schematic illustration of an offline bulk load data scenario, in accordance with an embodiment of the present disclosure.

Figs. 13-15 illustrate a system for synchronous data replication comprising a first cluster and a second cluster in active-active cooperation with each other, in accordance with the embodiments of the present disclosure.

Fig. 16 illustrates a schematic representation of data node of an HBase cluster as discussed in the present application.

It is to be understood that the attached drawings are for purposes of illustrating the concepts of the present disclosure and should not be construed as a limitation to the present disclosure.

DETAILED DESCRIPTION OF THE PRESENT DISCLOSURE

The following clearly describes the technical solutions in the embodiments of the present disclosure with reference to the drawings in the embodiments of the present disclosure. Apparently, the described embodiments are merely a part rather than all of the embodiments of the present disclosure.

The present disclosure can be implemented in numerous ways, as a process, an apparatus, a system, a composition of matter, a computer readable medium such as a computer readable storage medium or a computer network wherein program instructions are sent over optical or electronic communication links. In this specification, these implementations, or any other form that the present disclosure may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the present disclosure.

A detailed description of one or more embodiments of the present disclosure is provided below along with accompanying figures that illustrate the principles of the present disclosure. The present disclosure is described in connection with such embodiments, but the present disclosure is not limited to any embodiment. The scope of the present disclosure is limited only by the claims and the present disclosure encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the present disclosure. These details are provided for the purpose of example and the present disclosure may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the present disclosure has not been described in detail so that the present disclosure is not unnecessarily obscured.

In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. However, it will be understood by those skilled in the art that the present disclosure may be practiced without these specific details. In other instances, well-known methods, procedures, and components, modules, units and/or circuits have not been described in detail so as not to obscure the present disclosure.

Although embodiments of the present disclosure are not limited in this regard, discussions utilizing terms such as, for example, “processing, ” “computing, ” “calculating, ” “determining, ” “establishing” , “analyzing” , “checking” , or the like, may refer to operation (s) and/or process (es) of a computer, a computing platform, a computing system, or other electronic computing device, that manipulates and/or transforms data represented as physical (e.g., electronic) quantities within the computer's registers and/or memories into other data similarly represented as physical quantities within the computer's registers and/or memories or other information non-transitory storage medium that may store instructions to perform operations and/or processes.

Although embodiments of the present disclosure are not limited in this regard, the terms “plurality” and “a plurality” as used herein may include, for example, “multiple” or “two or more” . The terms “plurality” or “a plurality” may be used throughout the specification to describe two or more components, devices, elements, units, parameters, or the like. Unless explicitly stated, the method embodiments described herein are not constrained to a particular order or sequence. Additionally, some of the described method embodiments or elements thereof can occur or be performed simultaneously, at the same point in time, or concurrently.

The present disclosure relates to solutions that can support two active clusters and provide strong consistency to achieve high read and write availability by using synchronous replication of data records. In accordance with the present disclosure, the synchronous replication is achieved through partition level coordinated locks.

Fig. 3 illustrates an overview of active-active cluster architecture and its components in accordance with the teachings of the present disclosure. A cluster generally refers to a group of computers systems, also referred to as nodes/servers that have been linked or interconnected to operate closely together, such that in many respects they form a single computer. Distributed database systems, such as HBase, can support multi-active clusters. Multi cluster architecture is a common solution for applications that need guaranteed read speeds to avoid intermittent spikes in read performance in the event of network or disk issues. It also helps to load balance and as backup clusters for disaster recovery. When deployed using active-active mode, applications can use other cluster as a fall back when first cluster is not accessible which makes cluster highly available. Fig. 3 illustrates a cluster 1 and a cluster 2, both in active cooperation with each other, i.e., both clusters can support read and write at same time. The cluster 1 and cluster 2 may be part of a distributed database systems. In one such implementation, the distributed database system is HBase. In another implementation, the technical solutions of the present application may be available to distributed database system that can partition tables and employ WAL files to synchronize and recover the table data in the target cluster. Hereinafter, cluster 1 is also referred to as ‘afirst cluster’ , ‘a first active cluster’ , or ‘a source cluster’ , and cluster 2 is also referred to as ‘a second cluster’ , ‘a second active cluster’ , ‘a sink cluster’ , or ‘adestination cluster. ’

As explained with reference to Fig. 1, a typical HBase cluster may include one or more region servers, each of the region servers serves a set of regions and a region can be served only by one region server. The cluster 1 in in Fig. 3 is shown with respect to one region server, also referred to as the first region server, which hosts a Region of a table managed by the HBase cluster, in the illustrated example. The first region server handles read and write request from an application hosted by a client. The first region server includes a corresponding in-memory store and a WAL, which may be also referred to as the first memory and the first WAL respectively. Additionally, the first region server also comprises of a transaction status log, hereinafter referred to as ‘a first transaction status log’ in reference to the first region server of the first cluster. Additionally, the first region server also comprises of a health sensor. The replication of data I/O operation such as a write command from the first region server of cluster 1 to a corresponding region server of cluster 2, which is referred to as ‘a second region server’ , is managed by a replication manager of the first region server. The replication manager also manages the health sensor of the first region server.

Similar to the cluster 1, i.e., the first cluster, cluster 2, the second cluster is shown in Fig. 3, which works in active-active operation with the first cluster. Likewise, cluster 1, cluster 2 in Fig. 3 is shown with respect to one region server, also referred to as the second region server, which hosts a Region of a table also hosted by the first region server of the cluster 1, in the illustrated example. The second region server can support read and write at the same time as the first region server. The second region server owns a corresponding an in memory store, a WAL, a transaction status log and a corresponding health sensor. For ease of reference, corresponding to the second region server, the components of the second region server are referred to as the second WAL, the second memory, the second transaction status log and the second health sensor, respectively. In addition, the second region server includes a row replication sink and an auto recovery handler, which shall be explained in more detail in the foregoing description.

It is to be noted that the cluster 2 uses the same mechanism as cluster 1 for write. For simplicity just only one operation is described. Similarly, cluster 1 uses the same mechanism as cluster 2 for read, the read command received for a respective region at cluster 2 is read from the persistent store files of the second region server hosting the corresponding region.

In accordance with the embodiments of the present disclosure, both clusters, cluster 1 and cluster 2 can write data concurrently and read can be performed from any cluster, the results is strongly consistent with each other. When one cluster is down, the other cluster supports read and write.

In accordance with the embodiments of the present disclosure, collision of writes between the two clusters are managed by using coordinated partition level locking mechanism.

In accordance with an embodiment of the present disclosure, write failed in any cluster is not considered in query results.

In accordance with an embodiment of the present disclosure, health of the clusters is monitored and may be switched to local write or synchronous write modes automatically. Also, read and writes may be disabled on an unhealthy cluster.

In accordance with an embodiment of the present disclosure, a crashed cluster is recovered based on the last sync point.

The present disclosure provides a synchronous data replication method from a first cluster to a second cluster. The first cluster and the second cluster work in active-active cooperation with each other. Fig. 4 illustrates a synchronous data replication method in accordance with the present embodiment. The first cluster includes the cluster 1 and the second cluster includes the cluster 2, represented in Fig. 3.

At step 402, the first region server of the first cluster, i.e., cluster 1, receives a write command from a client server. The client server may be hosting an application, similar to the depiction in Fig. 3. Referring to Fig. 3, the write command is replicated from the first region server to the second region server where a row replication sink at the second region server handles the replication mechanism. Specifically, at step 402, first region server replicates the received write command, in a second region server of the second cluster. Herein, the replication step includes committing, by the second region server, a record associated with the write command in a second write ahead log WAL of the second region server and subsequent to the committing the record to the second WAL, committing, by the second region server, the record to a second memory of the second region server. The step of replication involving committing the write command to the second WAL and the second memory is handled by the row replication sink.

Upon committing the write command in the second region server, at step 404, the first region server commits the record associated with the write command to the first WAL of the first region server and subsequent to the committing the record to the first WAL, commits the record to a first memory of the first region server. In accordance with the teachings of the present disclosure, the write operation is performed first in the remote cluster, i.e., the second cluster and then in the local cluster, i.e., in the first cluster. That is to say that the committing in the local cluster is done is done in a second phase, after committing in the remote cluster in the first phase.

In accordance with a specific embodiment of the present disclosure, a transaction status of the data records associated with the received write command is maintained at both the first region server and the second region server. The transaction log of the first region server, also referred to as the first transaction log maintains the transaction status of the data records being committed in the first WAL and the first memory and the transaction log of the second region server, also referred to as the second transaction log maintains the transaction status of the data records being committed in the second WAL and the second memory. By way of an example, Fig. 5a illustrates WAL sequence identification Nos. (IDS) which corresponds to the sequential data records written to a local WAL at a region server that receives a write command for a region being hosted by that region server. The WAL in the example may include the first WAL of the first region server referred to in the Fig. 4 and Fig. 3 and the WAL herein may also include the second WAL of the second region server referred to in the Fig. 4 and Fig. 3. Each entry in the WAL, i.e., identified by the corresponding sequence ID (1, 2, 3, 4, 5 in the example of Fig 5a) , has one or more corresponding entries in the local transaction log of the same region server. That is to say, an entry in the first WAL of the first region server will have one or more corresponding entries in first transaction log of the first region server and an entry in the second WAL of the second region server will have one or more corresponding entries in second transaction log of the second region server. The entries in the transaction log are hereinafter referred to as transaction status since they indicate a status/progress of a data record being committed to a WAL. For ease of reference, the entries of the first transaction log maybe referred to as the first transaction status and the entries of the second transaction log maybe referred to as the second transaction status. The transaction status are one of the three types: in-progress, success, failed/ time-out, wherein the in-progress status indicates that the record is not yet committed to the WAL, the success status indicates a record being successfully committed to the WAL, and the timeout/failed status indicates that the record has failed to be committed to the WAL. Each record entry in the WAL may have at least two transaction status in the transaction log, the last updated entry being considered the latest transaction status of the record. Referring to the example of Fig 5a, a record with sequence ID 1 has initially an in-progress status in the transaction log, which is later updated to success in the transaction log once the committing to the WAL is successful. The last updated entry of the record with sequence ID 1, i.e., Success, will be considered as the latest transaction status of that record. For another record in the WAL with the sequence ID 3, there are three transaction status, in-progress, success, timeout/failed. This is a scenario where although the committing of the record (with sequence ID 3) was successful in the WAL of that region server, the committing would have failed in the local WAL of the second region server of the second cluster. Thus, the transaction status in the transaction log is re-updated to failed/timeout reflecting that the committing was failed while committing in the local. The last entry from the transaction log is always considered as latest transaction status. All locally timeouts records are reported to the remote cluster to correct status in the remote memory as well. To this, according to the embodiments of the present disclosure, the committing of a record in-local memory of a remote cluster is successful only when the committing is successful in the local cluster, thus maintaining strong consistency during synchronous replication mode between the remote cluster and the local cluster.

Fig. 6 illustrates a further embodiment of the method illustrated in Fig. 4. Specifically, Fig. 6 discloses the method comprising committing the record in the second region server during replication of the received write command at the second region server, illustrated and explained with reference to Fig. 4. The step of committing the record to the second memory of the second region server comprises, at step 602 of Fig. 6, updating, by the second region server, a second transaction status of the record in a second transaction log maintained in the second region server. The term ‘second’ used here signifies that the transaction log belongs to the second region server and the transaction status belong to the entries of the such second transaction log. As explained in relation to Fig. 5a above, the second transaction status is indicative of at least one of following current status of the record in the second region server:

- in-progress status;

- success status

- timeout/failed status;

The in-progress status indicates that the record is not yet committed to the second WAL, the success status indicates a record being successfully committed to the second WAL, and the timeout/failed status indicates that the record has failed to be committed to the second WAL. Each record has at least two of the second transaction status in the second transaction log. The last updated entry for a record is the last entry in the second transaction log indicative of the latest transaction status of the record.

Further, at step 604, the second region server sends the latest transaction status of the record to the first region server.

At step 606, the first region server receives the latest second transaction status of the record from the second region server.

At step 608, the first region server may commit the record in the first memory of the first region server when the latest second transaction status received for the record from the second region server, is success.

In one embodiment, the first region server may commit the record in the first memory only when the latest second transaction status received for record from the second region is success. In another embodiment, the first region server may have committed the record in the first memory, and subsequently, may receive the latest second transaction status received for record from the second region as failed/timeout. This a scenario illustrated in the example of Fig. 5a. Thus, the transaction status in the first transaction log is re-updated to failed/timeout reflecting that the committing for the record was failed while committing in the local.

In accordance with a further embodiment of the method illustrated in Fig. 4, the method comprising committing the record in the first region server comprises updating, by the first region server, a first transaction status of the record in a first transaction log maintained in the first region server. The term ‘first’ used herein signifies that the transaction log belongs to the first region server and the transaction status belong to the entries of such first transaction log. As explained in relation to Fig. 5a above, the first transaction status is indicative of at least one of the current status of the record in the first region server.

In the present embodiments, both the first region server and the second region server maintain the local transaction logs, the first transaction log and the second transaction log, respectively. All locally timeouts, are reported to the remote cluster to correct the transaction status in the remote cluster as well, thereby also correcting the transaction status in the remote in-memory as well.

In another example, consider a scenario that the second region server has committed the record to the second memory and thereby updated the second transaction status to ‘success’ . However, the latest first transaction status for the record in the first transaction log of the first region server is ‘timeout/failed’ . The first region server sends the latest first transaction status of the record to the second region server. Upon receiving, the latest first transaction status of the record, at the second region server, the second region server detects that the first transaction status of the record is a timeout/failed status. On detecting, the second region server re-updates the second transaction status of the record in the second region server to reflect the timeout/failed status in the second region server, same as the first region server. Consequently, the record in the second memory also has to be updated to reflect the timeout/failed status corresponding to the transaction status of the transaction log. In the scenario taken in the present example, the second region server had initially committed the record to the second memory. However, with the re-updated second transaction status of the record, the second memory may have to fail or timeout the record in its second memory.

In accordance with the further embodiments, a method of rewriting files in the file store of the second region server is disclosed. By way of an example, taking the same scenario discussed above where the second region server had initially committed the record to the second memory and then re-updated second transaction status of the record to a fail/ timeout status, it may be a possibility that the record has already been flushed from the second memory to the file store (HFiles) of the second region server, referred to as the second file store. According to an implementation of the present disclosure, wherein if the record is already flushed from the second memory to the second file store of the second file region and the timeout/failed status of the record is detected as the latest first transaction status at the second file region, the second region server rewrites files in the second file store to remove all transactions of the record from the second region server. Herein, transactions of the record involve all the transactions corresponding to the write command received in the first region server and being replicated at the second region server. Thus, according to the embodiments of the present disclosure, the record corresponding to the write command received at the first region server is concurrently committed to the first memory of the first region server and the second memory of the second region server; and if the record corresponding to the write command fails to be committed in any of the first region server and the second region server, the record is not considered for a subsequent read operation or a search query received in either of the first cluster and the second cluster. To this end, reference is made to the illustration of Fig. 5b. The first region server comprises of the in-memory store ‘Mem Store’ , also referred to as the first memory and the file store (Hfiles) , and also referred to as the first file store. In the first memory, the transaction status of all the records are updates according to the entries in the corresponding transaction log. While querying a record from the in- memory, or while flushing records from the in-memory to the HFiles, only successful transactions are considered, i.e., the records which have a successful transaction status in-memory. In the examples of Fig 5a, 5b, the color coding may be used for referring to the different types of transaction status, In Fig 5b, the color green may be corresponding to a success status, the color yellow may be corresponding to a, in-progress status and the color red may be referred to as a fail/timeout status. Taking the color coding as an example, the record with a transaction ID 3 has the latest transaction status as failed /timeout. As seen in the example of Fig 5b, a Scanner, scans the records of the mem store of the first region server, and reads or flushes only the successful transactions, i.e., the green color coded records with

transaction ID

1 and 4 from the mem store. According to one implementation of the present disclosure, the Scanner (query scanner) is a component of the first region server. However, in another embodiment the Scanner may be outside the first region server, for example, the HMaster. The transaction status of the failed/timeout record with transaction ID 3 is reported to the remote cluster to correct status in remote memory as well. Considering that the corresponding record with transaction ID 3 is already flushed to the second file store from the mem store of the remote region server, according to the embodiments of the present disclosure, the remote region server rewrites files in its file store to remove transactions of the record with transaction ID 3. By way of an implementation of the present disclosure, a repairing mechanism is implemented by the remote region server. As shown in Fig. 5b, an HFile Repair Chore is implemented in the Remote Region that will rewrite the files with invalid transactions, which in this example case is the transaction ID 3. As illustrated in Fig. 5, initially the HFile in Remote Region had the successful transactions in the

sequential order

1, 3 and 4, which were initially the successful in memory transactions in the remote region. Upon receiving the report of the invalid transaction ID 3, the HFile Repair Chore repairs the files in the background, removes the transactions of the transaction ID 3 from the files, and next we see the HFile in Remote Region with the successful transactions in the

sequential order

1, and 4, with 3 being removed. To this end, the record corresponding to the write command received at the first region server is concurrently committed to the first memory of the first region server and the second memory of the second region server. And if the record corresponding to the write command fails to be committed in any of the first region server and the second region server, the record is not considered for a subsequent read operation or a search query received in either of the first cluster and the second cluster.

In view of a further embodiment of the present disclosure, the method of concurrently committing the record to the first memory of the first region server and the second memory of the second region server comprises a coordinated row lock management in the first cluster and the second cluster. In accordance with an additional embodiment of the present disclosure, while there is a collision of the incoming transactions/write commands in the first cluster and the second cluster, the first arrived transaction is given priority and the other one is queued next.

In accordance with a specific embodiment of the present disclosure, the method of concurrently committing the record to the first memory of the first region server and the second memory of the second region server, comprises of locking a first row, by the first region server, the first row belongs to a region hosted by the first region server and the first row corresponds to the write command received at the first region server. The first region server replicates the received write command in the second region server. By way of an example, reference is made to Fig 7 (described later) . On obtaining a write command corresponding to a row=100 in the first region server of cluster 1, the row 100 is locked in the region server of cluster 1 and the transaction is sent to the row replication sink handler 7A of the second region server of cluster 2, for replication. Further, the method comprises of locking a second row, by the second region server, the second row belongs to the region hosted by the second region server and the second row corresponds to the write command replicated at the second region server. Herein, the term ‘second’ used before the term ‘row’ implies that the row belongs to a region hosted by the second region server of cluster 2. It is to be understood that the first row is the same as the second row. Referring to the example of Fig. 7, on locking row 100 for TX1 (write command) in the first region server of cluster 1, the row 100 is locked for TX1 (replication) in the second region server of cluster 2. Further, when the locking of the second row is success, the method further comprises of performing parallel committing, by the second region server, the record in the second WAL and the committing, by the first region server, the record in the first WAL.

According to an implementation of the present disclosure, the write operation is performed by locking row on local (the first cluster) , writing in remote cluster (the second cluster) and then writing later in local cluster. Further, committing to the WAL and in memory is done is the second phase, first in the remote cluster and then the local cluster. While there is a collision, it is ensured that the first arrived transaction is given priority and the other one is queued next. This may be done using timestamps. If the transaction timestamp also conflicts, a hashing algorithm may be chosen to decide on conflict resolution (smallest Peer ID) . According to another implementation of the present disclosure, once remote lock is also success, write (WAL, Memstore writes) in both clusters can be done in parallel. For the failed/timeout records, if timeout happens during remote locking, commit will be synchronized with one more transaction status by checking RPC (remote procedure call) . In accordance with the present implementation, lock coordination is local to regions and so transaction management is distributed. This has the advantage that there are no bottlenecks like a global transaction management and contention for locks.

In accordance with an embodiment of the present disclosure, the transaction management using coordinated lock between both clusters may be at a single row level. In accordance with another embodiment of the present disclosure, the transaction management using coordinated lock between both clusters may be at a batch level which may be atomic or non-atomic.

In one such implementation of a single row lock management, the method comprises determining if another transaction arrived earlier at the second cluster, i.e., prior to the write command received in the first cluster. In such case, the single row lock is retried at cluster 2. By way of an example, the client server may generate timestamps ts1, ts2 for the respective transaction 1 and transaction 2, where transaction 1 corresponds to the write command received at cluster 1 and transaction 2 corresponds to another write command received at cluster 2. On conflict, the region server may choose the Tx with t = min {ts1, ts2} . In other examples, a hashing protocol or a pluggable conflict resolution protocol may be chosen. In one specific embodiment of a single row lock scenario, the method of the implementation illustrated in Fig. 4, may further comprise of receiving another write command from another client server in the second region server of the second cluster, the another write command also associated with the record that is associated with the write command received in the first server region. In the implementation, the transaction 2 may arrive at the second region server of the second cluster from another client server and may have the timestamp ts2. Further, the second region server determines which of the transactions, i.e., the first write command or the second write command reached the second region server. For instance, the second region server determines if the write command from the another client server reached the second region server later than the write command from the client server at the first region server. According to the implementations of the present disclosure, the determining is performed by at least one of the following mechanism:

- checking, by the second region server a transaction timestamp associated with the write command and a transaction timestamp associated with the another write command

- applying, by the second region server, a hashing protocol on the write command received from the first client server and the another write command received from the second client server;

- applying, by the second region server, a pluggable conflict resolution protocol on the write command received from the first client server and the another write command received from the second client server.

Either of the mechanisms are applied in determining which of the transactions arrived earlier at the second region server and the lock may be accordingly obtained in the second region server.

In accordance with a further embodiment, if it is determined that the that the another write command (transaction 2) from the another client server reached the second region server later than the write command (transaction 2) from the client server at the first region server, the second region server performs the step of committing the record associated with the write command (transaction 1) in the second WAL and subsequent to the committing the record to the second WAL, performing the step of committing, by the second region server, the record (transaction 1) to the second memory of the second region server.

In accordance with a further embodiment, if it is determined that the that the another write command (transaction 2) from the another client server reached the second region server before the write command (transaction 1) from the client server at the first region server, prior to committing the record associated with the write command (transaction 1) at the second region server, the second region server performs the step of committing a record associated with the another write command (transaction 2) in the second WAL and subsequent to the committing the record to the second WAL, performing the step of committing, by the second region server, the record associated with the another write command (transaction 2) , to the second memory of the second region server.

For the single row lock management, reference shall now be made to a flowchart illustration shown in Fig. 7 as well as Fig. 3, to illustrate an application scenario of coordinated row lock management. Cluster 1 and cluster 2 illustrated in Fig. 7 includes the cluster 1 and cluster 2 respectively illustrated in Fig. 3. Further, the row replication sink RPC handler 7A shown in Fig. 7 includes the row replication sink of Fig. 3. Moreover, the transaction status RPC handler 7B shown in Fig. 7 is also understood as a component of the second region server of cluster 2, which is not shown in Fig. 3. The role of the row replication sink RPC handler 7A and the transaction status RPC handler 7B shall be understood in the foregoing description. In the example of Fig. 7, when a transaction, a write command from a client server, arrives in the cluster 1, the transaction flow starts in cluster 1. In the example, the region server generates a timestamp for the transaction arrived at cluster 1. The example illustrates that the transaction having for row=100 having value 10 arrives in the first region server of cluster 1 for which timestamp=t1/ts1 is put. In the next step, a transaction ID/transaction number=TX1 is generated for this transaction having timestamp=t1/ts1. The first region server locks the row=100 and then attempts a lock for the row=100 in the second cluster. The replication sink RPC handler 7A receives the information from the cluster 1, the information comprising the transaction number TX 1, the locked row=100 originating from cluster 1 and the transaction number, TX1 having a timestamp ts1. The replication sink RPC handler 7A of the second region server of cluster 2 attempts the lock for row 100. The second region server of cluster 2 may have another transaction having transaction number TX2 with a timestamp generated t2/ts2 in-progress. The second region server performs try lock for row 100. If the row 100 is not yet locked, the lock is obtained at this step. However, if the row 100 is already locked for TX 2, the second region serve prefers the transaction with the older timestamp. Accordingly, in accordance with the embodiments of the present disclosure, it is verified if TX1 having timestamp ts1 < TX2 having timestamp ts2. If it is confirmed that the TX1 is older than TX2, the lock for row 100 is retried and subsequently obtained. However, if TX2 having timestamp ts2 < TX1 having timestamp ts1, the row 100 is to be locked and processed first for transaction TX2. The row already locked exception handler reports that the row 100 is locked for transaction TX2 to the first region server of cluster 1. Accordingly, the second region will also release the lock for row 100 and retry the lock for 100 after the committing of transaction TX2 is committed in the remote cluster. Thus, the lock at row level is coordinated in both the clusters and where there is a collision, it is ensured that the first arrived transaction (using put timestamp) is given priority & other one is queued next. Once remote lock is also success, write (WAL, Mem store writes) in both clusters can be done in parallel.

As stated for the above embodiments described above, the write operation is performed by locking row in local, write in remote cluster first and later in local/source cluster. Committing is done second phase, first in the remote cluster and then local/source cluster. Referring to Fig. 7, once the lock for row 100 is obtained in the remote cluster, write operation is performed in the remote cluster. If the result of the write is success in the second region server, the lock for row 100 is released in the cluster 2 and the write operation is now initiated in the region server of cluster 1. It is pointed out that the status after writing in the remote cluster is ‘awaiting origin conform’ which implies that the remote cluster awaits the success transaction status conformation from the origin cluster for TX1. According to the described embodiments of the present disclosure, the committing to the WAL and the memory is after the transaction log is updated with the latest transaction status as received from the origin/source cluster. The transaction status RPC hander 7B receives the transaction status of the record from cluster 1 and updates the transaction status in the transaction log of the second region server. Alternatively, if result of the write success in the second region server is failed, the lock for row 100 is released and the failed transaction status is reported to the source cluster, cluster 1. The first region server of cluster 1 releases the lock for row 100 and on remote failing, ends the processing of the transaction TX1. The process of obtaining lock in the remote is reattempted after lapse of certain considerable period.

When the remote write is a success and reported to the first region server of cluster 1, the write operation is initiated in the first region server of cluster 1 for TX1. After write, the committing is done in a second phase in the region server of cluster 1, i.e., the committing is first done cluster 2. If the result of the write operation performed in the first region server of cluster 1 is success, it is checked if the transaction status for the TX1 in the remote is a success. Subsequently, unless the transaction TX1 timeouts, the committing to the WAL and the in mem store is performed in cluster 1. On committing in the region server of cluster 1, the same is reported to the transaction status RPC handler 7B of cluster 2 which subsequently updates the transaction log in the region server of cluster 2. If the transaction TX1 timeouts in the region server of cluster 1, the process will end. The transaction status RPC handler 7B of cluster 2 will accordingly update the transaction status of transaction TX1 in the transaction log of the region server of cluster 2.

If the result of the write operation performed in the first region server of cluster 1 is failed, the transaction status is updated in the transaction log of the region server of cluster 1, and also the failed status is reported to the transaction status RPC handler 7B of cluster 2. The transaction status RPC handler 7B of cluster 2 will accordingly update the transaction status of transaction TX1 in the transaction log of the region server of cluster 2. Based on the latest transaction status of TX1 in the remote cluster, the TX1 will be either committed or failed in the remote cluster.

In accordance with the embodiments of the present disclosure, the single row lock management can be extended to batch row lock management. The batch row lock management can be categorized into two types, atomic and non-atomic. The record corresponding to the write command received in the origin cluster 1 may have a transaction for a batch of rows for an atomic batch row request and a non-atomic batch row request received from the client server.

In accordance with one implementation of the present disclosure, the single row lock scenario illustrated in Fig. 7 may extend atomic batch lock scenario. The method may comprise locking of first rows by the first region server and the locking of second rows by the second region server in response to an atomic batch row request by the client server. Atomic batch rows request received from the client should complete all the row insertion or fail. So, lock will be attempted with all the rows given in client request. If any lock row lock is not available due to locked in remote cluster writes in progress, will release the old locks and retries. Once all the row locks are achieved in both clusters, data write is continued. An example illustration of the present implementation is shown in Fig. 8a which shows that all the row locked in cluster 1 corresponding to transaction TX1 (the received write command in the first region server) , are attempted for lock in the cluster 2. If any of the row lock is not available, for example, busy rows Ri, Rj (occupied for example, for transactions TXa, TXb respectively) are not acquired for locks in the cluster 2, the first region server of cluster 1 will have to release all the locks initially obtained for transaction TX1 in cluster 1. The first region server of cluster 1 will wait for TXa, TXb to eventually finish and the retry obtaining locks for all the rows corresponding to transaction TX1 in cluster 1 which is then re-attempted in cluster 2.

In accordance with one implementation of the present disclosure, the single row lock scenario illustrated in Fig. 7 may extend to a non-atomic batch lock scenario. In response to a non-atomic batch row request by the client server, the method comprises of locking a first batch of rows, by the first region server, the first batch of first rows belonging to a region hosted by the first region server and the first batch of rows corresponds to the write command received at the first region server. Reference is made to an example flowchart illustrated in Fig 8b in relation to a non-atomic batch row lock management. As a first step, a first batch of rows is locked by the first region server of cluster 1. Further, on obtaining locks for the first batch of rows, as understood from Fig. 7, the first region server of cluster 1 replicates the received write command in the second region server. The second region server attempts locking the second batch of rows belonging to the region hosted by the second region server and the second batch of rows corresponds to the write command replicated at the second region server. It is understood that the terms ‘first’ and ‘second’ used herein is to refer to the batch of rows handled by the first region server and the batch of rows handled by the second region server, however, the first batch of rows is the same as the second batch of rows attempted for locks in the first region server in the cluster 1 and the second region server in the cluster 2, respectively. In the non-atomic batch row management, the second region server attempts acquiring lock for the second batch of rows. However, some rows may already be locked for other transactions in the cluster 2. In that case, the second region server acquires locks on at least a first subset (SubBatch 1=OK) of the second batch of rows while failing to acquire lock on a second subset (SubBatch2=NOK) of the second batch of rows. Herein, the first subset of second batch of rows includes the rows which are available for locking and the second subset of the second batch of rows includes rows not available for locks. On acquiring locks on the first subset of the second batch of rows and failing to acquire locks on the second subset of the second batch of rows, the second region server of cluster 2 reports the same to the first region server of cluster 1. The first region server releases locks on a first subset of the first batch of rows, the first subset of the first batch of rows corresponding to the first subset of the second batch of rows in the second region server. See the example shown in Fig 8b, where the flow indicates that in cluster 1, the lock on SubBatch 2 is released and mutation for SubBatch 1 is continued, as described with reference to Fig . 7. Consequently, the method comprises of performing parallel committing, by the second region server, the record in the second WAL, the step of committing by the second region server being performed on the first subset of the second batch of rows, and the committing, by the first region server, the record in the first WAL, the step of committing by the first region server, being performed on the first subset of the first batch of rows. Again, the first region server obtains locks for the second subset of the first batch of rows which were failed to be acquired previously in the remote cluster. Further, on obtaining locks for the second subset of the first batch of rows in the cluster 1, the second region server re-attempts obtaining locks of the second subset of the second batch of rows. As illustrated in Fig. 8b, on locking SubBatch 2, and maybe a next batch of rows, in the first cluster, the next possible minibatch locks are attempted in the cluster 2. The process continues in order to perform the step of parallel committing, by the second region server the record in the second WAL on the second subset of the second batch of rows, and the committing, by the first region server, the record in the first WAL on the second subset of the first batch of rows.

In accordance with further embodiments of the present disclosure, a server crash procedure implemented in accordance with the present disclosure is disclosed. During disaster, if one of the region servers, for example, the first region server of cluster 1 fails, the master server implements the server crash procedure in accordance with the teachings of the present disclosure. In describing the embodiments of the present disclosure, the description of Fig. 2a is incorporated herein for reference. According to one implementation of the present disclosure, the files of a failed region server are recovered and flushed to another region server in a cluster. For recovering the files from a failed region server, for example, the first region server in the first cluster, the method comprises of invoking a server crash recovery procedure in a master server managing assignment of regions in the first cluster and accordingly managing the assignment of the region being hosted by the first region server. On invoking the server crash recovery procedure in the master server, the method further comprises of identifying, by the master server, one or more WALs belonging to the first region server and assigning, by the master server, recovery of records from each of the one or more WALs to other region servers of the first cluster. The description of Fig 2a describing the steps of WAL splitter and assign region is incorporated herein by reference. An assigned region server collects the records from a WAL from the one or more WALs of the first region server. The assigned region server splits the records from that WAL to segregate the records according to the corresponding regions for which records are available in the WAL. Further, the assigned region creates an edit file for each set of segregated records (i.e., one edit file for each set or records split region wise) . Also, the assigned region server creates a transaction log file for each edit file created by the assigned region server. The creation of the transaction log file by the WAL Splitter is not available in the prior art recovery mechanism illustrated in Fig. 2b. The transaction log file created for an edit file is used to load the transaction status for the records of that edit file in a corresponding transaction log of the assigned region server. Therefore, the assigned region server loads the edit file for the region being assigned to that region server, in-memory and also loads the transaction status of each of the records of the edit file in-memory. This is handled during the processing by the Open Region Handler of that assigned region. The assigned region server writes the edit file corresponding to one region (which is assigned to the region server) locally to a corresponding memory of the assigned other region server, Further, the assigned region server loads the transaction log file for the edit file to a corresponding transaction log of the assigned other region server. A latest transaction status for each record from the edit file is derived from the transaction log file corresponding to that edit file. Eventually, the assigned region server flushes the edit files in-memory to a file store (HFile) of the assigned region server. In a specific implementation of the present disclosure, the flushing comprises of checking the transaction status of the each record from the edit file loaded in-memory of the assigned region server. The records which have a failed/time out status are not flushed to the file store, the records which have an in-progress status are reloaded in-memory, and the records which have a success status are flushed to the file store of the assigned region server.

By way of an example, reference is now made to Fig. 9 which depicts an illustration of the above-described implementation.

- Flows marked 1, 2 and 3 depict splitting of the transaction log per region also along with region edits while processing recovery [1, 2, 3]

- Flow marked 4 depicts that the master server assigns region to another region file of a cluster, where the open region handler of the assigned region loads recovered edits for the assigned region.

- Flow marked 5 depicts loading recovered edits during region open and apply the transaction status from recovered transaction log to get back the original status before crash

- Flow marked 6 depicts flushing from memory to store files using Tx aware scanner (corresponding to a similar scanner illustrated and explained with reference to Fig 5b) . Here failed Transactions are ignored.

- Flow marked 7 depicts loading back all uncommitted transactions (in-progress) to memory. These will be flushed to HFile in next cycle.

- Flow marked 8 depicts that the latest flushed Tx ID for the region can be stored in master node to purge complex Tx logs and identify starting offset during next recovery.

The master server groups all the WAL Split tasks and assigns to the original region server which has created this WAL file, if it is available. This can be identified by the folder and file names where WAL file present in DFS. This ensures local read during WAL split and saves network bandwidth

According to the next embodiment of the present disclosure, a method of monitoring health of the first cluster and the second cluster is disclosed. Specifically, according to the disclosed embodiment, health of both the first cluster and the second cluster is monitored so as to fall back to a single cluster write mode on detecting unhealthy status for any of the first cluster and the second cluster. Herein, the first cluster and the second cluster include the embodiments disclosed above from Figs 3 to 9. According to one implementation, the clusters’ health is monitored locally at the clusters and reported to the respective local zookeeper and the remote zookeeper. According to another implementation, the clusters’ health is monitored by the local zookeeper and the remote zookeeper. The different cluster status are, for example, as illustrated in Fig. 10. Each of the cluster 1 and cluster 2 includes a set of first region servers (RS, RS, RS) and a set of second region servers (also denoted RS, RS, RS) , respectively. Each of the cluster 1 and cluster 2 is managed by a respective zookeeper (ZK) which may be also be referred to as the first zookeeper and the second zookeeper, respectively. Fig. 10 shows zookeeper (ZK) managing cluster 2. In accordance with the embodiment of the present disclosure, the health of the cluster 1 is monitored by its local zookeeper (see 10a) and also by the remote zookeeper (see 10b) . Similarly, the health of the cluster 2 is monitored by its local zookeeper (see 10b) and also by the remote zookeeper 9 (see 10a) . To this end, the first zookeeper monitors the health of the first cluster and may also share the information of the health of the first cluster with the second zookeeper. Likewise, the second zookeeper monitors the health of the second cluster and may also share the information of the health of the second cluster with the first zookeeper. The first ZK and the second ZK are in coordination with each other regarding the respective monitored health of the first cluster and the second cluster. Further, the different cluster status are, for example:

1) Local write: The local status implies that a respective cluster/ a region server of the cluster is allowed to write only locally.

2) Active: The active status implies that a respective cluster/a regions server of the cluster can write and synchronize data to remote cluster too

3) Un-Healthy: The unhealthy active status implies that a respective cluster/aregions server of the cluster can’ t support read or write operations till recovered. 4) Recovering: The recovering status implies that a respective cluster/a regions server of the cluster can hold the write till the status is changed to Active

Referring to Fig. 10, both clusters, cluster 1 and cluster 2 coordinates the status mutually in peer Zookeeper (see 10a, 10b) . The first zookeeper of cluster 1 gets the status of the region servers (RS) as well as the master server (Master) of cluster 1 and accordingly sets a health mode status (my status) which identifies the health status of the cluster 1. See 10a which is the information updated by the first zookeeper, the information including the health status of the first cluster as well as the health status of the peer cluster (cluster 2) notified by the second zookeeper. In the example of Fig 10, 10a information includes a peer status (C2) : Unhealthy, which implies that cluster 2 is notified to be unhealthy by the remote cluster. While in the local cluster 1, the master is Up, the region server 1 (RS1) is Up, the region server 2 (RS2) is down and the region server (RS3) is Down. Since, the peer status is unhealthy, and the first zookeeper sets the health status mode (my mode) as local. In cluster 2, the information (10b) updated by the second zookeeper includes, the health status of the second cluster as well as the health status of the peer cluster (cluster 1) notified by the first zookeeper. In the example of Fig 10, 10b information includes a peer status (C1) : Active, which implies that cluster 1 is notified to be active by the remote cluster. While in the local cluster 2, the master is Up, the region server 1 (RS1) is Up, the region server 2 (RS2) is Up and the region server (RS3) is Down. Since, the second cluster 2 may be recovering, the first zookeeper sets the health status mode (my mode) as un healthy.

In the above example, during write, if communication to remote region server is failed, the local status is validated in the cluster. Depending upon the health status of the cluster, the write/read operations in the respective clusters may be performed according to one of the following scenarios:

If Unhealthy – Reject read/writes

If Local Mode– Ignore and continue to local write

If Sync Mode (replication between the local and remote cluster) – Check Peer status in ZK

-- If Peer is un healthy --fall back to local mode [Update status in ZK]

-- If Peer is Active and RS down--Retry and reject based on timeout

-- If Peer is Recovering --Wait till the state changed to Active

In view of the above-described embodiment, the first region server, rejects the write command from the client server, when the first region server has an unhealthy status as determined from the first ZK.

In view of the above-described embodiment, the first region server, commits the write command in the first WAL and the first memory and halts the replication to the second region server of the second cluster when the first region server has a local health status as determined from the first ZK.

In view of the above-described embodiment, the first region server, halts the replication to the second region server of the second cluster when the second region server has a recovering health status as determined from the second ZK.

In view of the above-described embodiment, the first region server, performs the committing of the write command as well as the replication to the second region server of the second cluster when the first region server has an active health status as determined from the first ZK and the second region server has an active health status as determined from the second ZK.

According to the next embodiment of the present disclosure, a method for automatic recovery of a cluster after a failover is disclosed. Specifically, the method comprises of setting the recovering cluster back to sync replication with the active cluster according to the embodiments of the present disclosure. Reference shall now be made to Fig 11 which illustrates an example of setting a recovering cluster back to sync mode with the active cluster, in accordance with the present disclosure. In the example of Fig. 11, it is assumed that cluster 1 is recovering after a failover while cluster 2 is the current active cluster. It is understood that cluster 1 and cluster 2 refers to the first cluster and the second cluster described so far with references to Figs 3 to 7. In Fig. 11, the master of the recovering cluster 1 coordinates and identifies all the tables and the regions that need to be synced up. The master server assigns a region server to start the recovery process, the region recovery handler of the region server triggers memstore flush in the active cluster for all regions part of the sync replication tables. Based on the last synched Tx number, all the HFiles are downloaded from the active cluster and bulk loaded into cluster 1. Subsequently, row sync is enabled in the source cluster, i.e., transaction management starts and all the records from now are synced. All the memory records that are put before sync started are transferred. Once all the regions are synchronized, cluster state of the cluster 1 is enabled as Active.

In the current embodiment, after recovery, when the cluster 1 comes up, whatever data was written to the active cluster 2 when cluster 1 was down, that data has to be recovered at cluster 1. The master server (HMaster) of cluster 1 coordinates and identifies the tables and the regions which have to be synchronized. Specifically, the HMaster may coordinate with the cluster 2, for example with the ZK of cluster 2, in order to identify the tables and regions which received the write command while cluster 1 was down. As shown in Fig. 11, the HMaster of cluster 1 assigns a region server from cluster 1 to recover the data from cluster 2 which are to be synced. In the example, it is assumed that the HMaster assigs the first region server of the first cluster to recover data from the second region server of cluster 2, which is the active cluster. The first region server includes a region recovery handler to carry out the operations of recovering data from the second region server and enabling sync with the second region server according to the foregoing description. At flow 1, the first region server being assigned the recovery operations starts the recovery process implemented by the region recovery handler of the first region server. At flow 2, the region recovery handler recovers the transaction log (see recovered transaction log in Fig. 11) of the first region server, which shows that the last successful transaction committed in-memory of the first region server is the transaction TX4. Thus, the record identified with TX4 gets last flushes into the file store (HFiles) of the first region server. At flow 3, the region server handler having identified that the last successful transaction being TX4, downloads the HFiles post last flushed TX4, from the active cluster, i.e., cluster 2. In order to download or retrieve the HFiles from cluster 2, a flush instruction is sent from the first region server to the corresponding in-memory of the second region server of cluster 2. The flush instruction received at the in-memory of the second region server causes the records in-memory to be flushed to the corresponding file store (HFile) of the second region server. The HFiles of the second cluster updated after the flushing from the in-memory to the file store of the second region server are copied to the file store of the first region server in cluster 1. The HFiles thus pulled and copied to the file store of the first region server in cluster 1, are then bulk loaded to the in-memory of the first region server.

Flows

4, 5 and 6 depict that after bulk loading of the HFiles in-memory of the first region server, the transaction sync is resumed between the first region server in cluster 1 and the second region server in cluster 2. Reference is made to Figs 4 and 5 which already explain in detail the sync replication between active cluster 1 and active cluster 2 relying upon the respective transaction managers of the first region server and the second region server. I. e., after bulk loading both the first region server and the second region server resume updating their respective transaction logs, committing in both the first region server and the second region server, downloading/flushing their respective in-memory data and thereby enabling sync replication for a region between cluster 1 and cluster 2, and enabling cluster 1 state as active.

In the above-described embodiment, bulk loading of HFiles may imply an offload data load scenario which is explained with reference to Fig. 12. Basically, sync replication between active cluster 1 and cluster 2 can be performed by writing data to both the clusters which is shown implemented by a dual cluster writer (flow 1) which maybe for example, is performed on calling a data import job. In the example shown in Fig 12, an Import TSV Map Reduce Job is called upon to perform dual cluster write by the dual cluster writer. The other way for writing data into cluster is the offline mode which is denoted as Load Incremental Files in Fig. 12. In the offline mode. Some files may be bulk loaded to the cluster and the files may be then directly loaded to the in-memory of the region server of the cluster. In the dual cluster write mode, the replication is record by record which involves, writing to the WAL, writing to the in-memory and flushing to the HFiles. In the bulk load manner, the HFiles are created offline and the HFiles are loaded directly into the file store of the cluster. The corresponding region for the records in the Hfiles will then initiate loading the files from the HFiles to the in-memory of the region file server of cluster 1. Also, the region server of the first cluster will send a replication request of the files to the corresponding region server of cluster 2. See, for example in Fig. 12, the row replication sink of a region server of cluster 2 receives the replication of the bulk load HFiles from cluster 1. The row replication sink will also bulk load the files into the file store of cluster 2 which will then be loaded in-memory of the region server of cluster 2. In the offline mode, although no WAL writing is there, the replication WAL, i.e., the WAL of cluster 1 marks the event when some files have been bulk loaded in-memory instead of being written record by record. Bulk data load commit is synchronized to the peer cluster (cluster 2) through row replication sink similar to single put record (single row lock transaction management) and row replication can also take care to copy the files from the source cluster and load to regions if the files are not copied by Dual Cluster Writer.

In view of the above described embodiment, referring to Figs 3, 4 and 12, the method of setting back the first region server of the first cluster in sync replication with the second region server of the second cluster is disclosed, wherein the first region server and the first cluster has a recovering health status and the second region server and the second cluster has an active health status. The method comprises of identifying by a master server in the first cluster, regions to be synced from the second region server to the first region server. The master server assigns the first region server the recovery process for syncing the identified records from the second region server to the first region server. The first region server recovers the first transaction log of the first region server to identify a last successful transaction and downloads files from the second file store of the second region server to the first file store of the first region server, the files belonging to transactions succeeding the last successful transaction. Thereafter, the first region server resumes syncing after the downloaded files in the first file store are loaded to the first memory of the first region server. Before resuming, the first region server transfers records in the first memory to the second memory of the second region server. After syncing the first region server replicates write commands received from the client server at the first region server to the second region server.

According to one implementation, the step of downloading files from the second file store to the first file store comprises of triggering, by the first region server, flushing records from the second memory to the second file store in the second region server, the records pertaining to the identified regions by the master server in the first cluster. The second region server upon receiving the triggering instructions, flushes the records to the second file store. Upon said flushing, the first region server copies the files from the second file store to the first file store upon the flushing in the second region server.

According to one implementation, the copying, by the first region server, includes bulk loading, by the first region server, the files from the first file store to the first memory.

In accordance with the further embodiment of the present disclosure, a system for synchronous data replication is disclosed. The system comprises of a first cluster and a second cluster. In a presently disclosed implementation of the present disclosure, the first cluster and the second cluster includes the first cluster, cluster 1 and the second cluster, cluster 2, described with reference to Fig. 3. Accordingly, the system disclosed performs the present disclosure as disclosed with reference to Figs 4 to 12. Henceforth, the system shall be explained with reference to cluster 1 and cluster 2 incorporating all the elements and the features explained in detail above.

By way of an example, Fig 13 illustrates a system 1300 according to the embodiments of the present disclosure. The system 1300 comprises of a first cluster, cluster 1 and a second cluster, cluster 2. The first cluster comprises of a first region server 1300A. It is understood that although the system depicts a single region server 1300A included in the first cluster, the first cluster may include a plurality of first region servers, as shown in Fig 1. The first region server 1300Acomprise of a first memory 1302 A (also referred to as ‘Mem Store) , a first write ahead log (WAL) 1304 and a first file store 1306A (also referred to as HFiles/Persistent Store File) . Likewise, the second cluster 1300B comprises of a second memory 1302B, a second WAL 1304B and a second file store 1306B. The first cluster and the second cluster work in active-active cooperation with each other. The first region server 1300A is configured to replicate write commands to the second region server 1300B in accordance with the embodiments of the present disclosure.

According to one implementation, the first region server 1300A is configured to receive a write command from a client server. The client server is not shown in Fig. 13, but can be understood to be hosting an application similar to the illustration of Fig. 3 which sends read and write commands to the cluster 1 and cluster 2. Further, the first region server 1300A replicates the write command in the second region server 1300B. Further, the second region server 1300B is configured to perform replication of the received write command in the second region server. The step of replicating includes committing, by the second region server 1300B, a record associated with the write command in a second WAL 1304B and subsequent to the committing the record to the second WAL 1304B, committing, by the second region server 1300B, the record to the second memory 1302B. Further, the first region server 1300A commits the record associated with the write command to the first WAL 1304A and subsequent to the committing the record to the first WAL 1304A, commits the record to a first memory 1302A of the first region server 1300A. The first region server 1300A may commit the record in a second phase, i.e., first committing is done in the remote cluster and then only in the local cluster. In another implementation, the first region server the record corresponding to the write command received at the first region server 1300A is concurrently committed to the first memory 1304Aof the first region server and the second memory 1304B of the second region server 1300B, as explained in detail with reference to Fig. 7.

By way of an example, Fig 14 illustrates a system 1300 according to another embodiment of the present disclosure. The system 1300 includes the components of the system 1300 illustrated in Fig. 13. Further, the first region server 1300A comprises of a first transaction log 1308A and the second region server 1300B comprises of a second transaction log 1308B. The first transaction log 1308A and the second transaction log 1308B includes the first transaction log and the second transaction log respectively which is explained with reference to Fig 3 above. The second region server is configured to update a second transaction status of the record in the second transaction log 1308B. The second transaction status is indicative of at least one of following current status of the record in the second region server: in-progress status; success status and timeout/failed status; wherein the in-progress status indicates that the record is not yet committed to the second WAL 1304B, the success status indicates a record being successfully committed to the second WAL 1304B, and the timeout/failed status indicates that the record has failed to be committed to the second WAL 1304B; each record having at least two of the second transaction status in the second transaction log 1304B, last updated entry being considered latest second transaction status of the record. Further, the second region server 1300B is configured to send the latest second transaction status of the record from the second transaction log to the first region server 1300A. The first region server 1300A is configured to receive the latest second transaction status of the record from the second region server 1300B. The first region server 1300A is configured to commit the record in the first memory 1302A when the latest second transaction status received is the success status.

In a further example, referring to Fig. 14 the first region server 1300A is configured to update a first transaction status of the record in the first transaction log, the first transaction status indicative of at least one of following current status of the record at the first region server: in-progress status; success status and timeout/failed status; wherein the in-progress status indicates that the record is not yet committed to the first WAL 1304A, the success status indicates a record being successfully committed to the first WAL 1304A, and timeout/failed status indicates that the record has failed to be committed to the first WAL 1304A. Each record having at least two transaction status in the first transaction log 1308A, last updating entry being considered latest first transaction status of the record.

In a further example, the first region server 1300A is configured to send the latest first transaction status of the record to the second region server 1300B. The second region server 1300B is configured to receive the latest first transaction status of the record and detect that the first transaction status of the record is a timeout/failed status. On detecting, the second region server 1300B is configured to re-update the corresponding second transaction status of the record in the second transaction log to indicate the timeout/failed status in the second region server; and the second region server 1300B is configured to update the record in the second memory 1302B to reflect the timeout/failed status for the record.

In a further example, wherein if the record is already flushed from the second memory 1302B to the second file store 1306B of the second region server 1300B and the timeout/failed status of the record is received as the latest first transaction status at the second region server 1300B, the second region server is configured to rewrite files in the second file store 1306B to remove transactions of the record corresponding to the write command.

In a further example, the record corresponding to the write command received at the first region server 1300A is concurrently committed to the first memory 1302A of the first region server 1300A and the second memory 1302B of the second region server 1300B. If the record corresponding to the write command fails to be committed in any of the first region server 1300A and the second region server 1300B, the record is not considered for a subsequent read operation or a search query received in either of the first cluster and the second cluster.

In yet another example, wherein in order to concurrently commit the record to the first memory 1302A of the first region server 1300A and the second memory 1302B of the second region server 1300B, the first region server 1300A is configured to lock a first row, the first row belongs to a region hosted by the first region server 1300A and the first row corresponds to the write command received at the first region server 1300A. Reference for the embodiment disclosed herein can be found in the description of Figs 7 and 8a, 8b above. The first region server 1300A is configured to replicate the received write command in the second region server 1300B. The second region server 1300B is configured to lock a second row, the second row belongs to the region hosted by the second region server 1300B and the second row corresponds to the write command replicated at the second region server 1300B. When the locking of the second row is success, the second region server 1300B is configured to commit the record in the second WAL 1304B and the first region server 1300B is configured to commit the record in the first WAL 1304A, the committing by the second region server 1300B is in parallel to the committing by the first region server 1300A.

In one of the implementations, the locking of the first row by the first region server 1300A and the locking of the second row by the second region server 1300B is in response to an atomic batch row request by the client server.

In another implementation, in response to a non-atomic batch row request by the client server, the first region server 1300A is configured to lock a first batch of rows, the first batch of first rows belonging to a region hosted by the first region server 1300A and the first batch of rows corresponds to the write command received at the first region server 1300A. Herein, the description of Fig. 8 is incorporated for reference. Further, the first region server 1300A is configured to replicate the received write command in the second region server 1300B. The second region server 1300B is configured to attempt locking of a second batch of rows, the second batch of rows belonging to the region hosted by the second region server 1300B and the second batch of rows corresponds to the write command replicated at the second region server 1300B. Further, the second region server is configured to acquire locks on at least a first subset of the second batch of rows while failing to acquire lock on a second subset of the second batch of rows. Upon which, the first region server 1300A is configured to release locks on a first subset of the first batch of rows, the first subset of the first batch of rows corresponding to the first subset of the second batch of rows in the second region server. Consequently, the second region server 1300B is configured to commit the record in the second WAL 1304B, the step of committing by the second region server 1300B being performed on the first subset of the second batch of rows, and the first region server is configured to commit the record in the first WAL, the step of committing by the first region server, being performed on the first subset of the first batch of rows, wherein the committing by the second region server is in parallel to the committing by the first region server. Subsequently, the first region server 1300A is configured to relock a second subset of the first batch of rows, the second subset of the first batch of rows corresponding to the second subset of the second batch of rows in the second region serve. Also, the second region server 1300B is configured to re-attempt locking of the second subset of the second batch of rows in order to perform parallel committing, by the second region server 1300B, the record in the second WAL 1304B on the second subset of the second batch of rows, and committing, by the first region server 1300A, the record in the first WAL 1304A on the second subset of the first batch of rows.

In another embodiment, the second region server 1300B is configured to receive another write command from another client server, the another write command also associated with the record that is associated with the write command received in the first server region 1300A. The second region server 1300B is configured to determine if the another write command from the another client server reached the second region server later than the write command from the client server at the first region server. Herein, reference is made to the description of Fig. 7 which describes an embodiment of the present disclosure to retry locks based on the first arrived transaction at the second region server. The write command received from the client server can be understood as TX1 in Fig. 7 and the another write command received from the another client server can be understood as TX2. In the present embodiment the determining by the second region server 1300B is performed by at least one of the following mechanism:

the second region server 1300B is configured to check a transaction timestamp associated with the write command and a transaction timestamp associated with the another write command;

the second region server 1300B is configured to apply a hashing protocol on the write command and the another write command;

the second region server is configured to apply a pluggable conflict resolution protocol on the write command and the another write command.

Further, the second region server 1300B is configured to perform the step of committing the record associated with the write command in the second WAL 1304B and subsequent to the committing the record to the second WAL 1304B, perform the step of committing the record to the second memory 1302B of the second region server 1300B on determining that the another write command from the another client server reached the second region server 1300B later than the write command from the client server at the first region server 1300A.

Alternatively, on determining that the another write command from the another client server reached the second region server before the write command from the client server at the first region server, the second region server 1300B is configured to commit the record associated with the another write command in the second WAL 1304B and subsequent to the committing the record to the second WAL 1304B, commit the record to the second memory 1302B, prior to the performing the step of committing the record associated with the write command in the second WAL 1304B.

By way of another example, Fig. 15 illustrates yet another embodiment of the present disclosure. Fig. 15 illustrates a system 1300, which includes the system depicted in Figs 13 and 14. The system 1300 of Fig. 15 comprises of the first cluster, cluster 1, and the second cluster, cluster 2. The first cluster comprises of the first region server 1300-1A and at least one other region server 1300-2A. Further, the first cluster comprises of a master server 1310A configured to manage assignment of regions in the first cluster and accordingly manage the assignment of the region being hosted by the first region server 1300-1A and the other region server 1300-2A of the first cluster. Similarly, the second cluster comprises of the second region server 1300-1B and at least one other region server 1300-2B. Further, the second cluster comprises of a master server 1310B configured to manage assignment of regions in the second cluster and accordingly manage the assignment of the region being hosted by the second region server 1300-2B and the other region server 1300-2B of the second cluster.

As explained earlier, the master server, in this instance the master server 1310A is configured to invoke a server crash recovery procedure for recovering files from the first region server 1300-1A in the first cluster, wherein the first cluster comprises other region servers 1300-2A, the other region servers 1300-2A being assigned recovery of the files from the first region server 1300-1A, each of the other region servers 1300-2A comprising a corresponding memory, a corresponding WAL, a corresponding transaction log and a corresponding file store. On invoking the server crash recovery procedure in the master server 1310-A, the master server 1310-A is configured to identify, one or more WALs belonging to the first region server 1300-1A being recovered and, assign recovery of records from the one or more WALs to the other region servers 1300-2A of the first cluster. It is understood that the Fig. 15 illustrates one other region server 1300-2A while there may be other region servers not shown. The master server 1310A, for example, assigns the recovery of records from the one or more WALs to the other region server 1300-2A which is also referred to as the assigned other region server 1300-2A. The assigned other region server 1300-2A from the first cluster is configured to collect records from a WAL from the one or more WALs of the first region server 1300-1A. Further, the assigned other region server 1300-2A is configured to split the records from the WAL to segregate the records according to the corresponding regions for which records are available in the WAL, create an edit file for each set of segregated records corresponding to one region and creating a transaction log file for the each edit file created. Further, the assigned other region server 1300-2A is configured to write the edit file locally to the corresponding memory of the assigned other region server 1300-2A. Herein, the edit file corresponds to the one region assigned to the assigned other region server 1300-2A by the master server 1310A. Further, the assigned other region server 1300-2A is configured to load the corresponding transaction log file for the edit file to the corresponding transaction log of the assigned other region server 1300-2A, wherein a latest transaction status of each record from the edit file is recovered from the loaded transaction log file. As explained in detail with reference to Fig. 9 above, the assigned region server 1300-2A checks the transaction status of the each record from the edit file, flushes the edit file to the corresponding file store of the assigned other region server 1300-2A based on the transaction status of the each record from the edit file, wherein a record which has a transaction status indicating a failed/timeout status is not flushed, a record which has a transaction status indicating an in-progress status is reloaded, and a record which has a transacting status indicating a success status is flushed to the corresponding file store.

By way of another example, the system disclosed in Figs. 13-15 may further comprise a first zookeeper (ZK) corresponding to the first cluster and a second zookeeper (ZK) corresponding to the second cluster. Herein, reference is made to Fig. 10 where the embodiments related to the first ZK and the second ZK are discussed. The health of the first cluster including at least the first region server 1300A is monitored by a first ZK and health of the second cluster including the second region server 1300B is monitored by the second ZK, and further the first ZK also maintains information of the health of the second cluster and the second ZK also maintains information of the health of the first cluster. In one implementation, the first region server 1300A is configured to reject the write command from the client server, when the first region server has an unhealthy status as determined from the first ZK. In another implementation, the first region server 1300A is configured to commit the write command in the first WAL 1304A and the first memory 1302A and is configured to halt the replication when the first region server 1300A has a local health status as determined from the first ZK. The first region server 1300A is configured to halt the replication when the second region server 1300B has a recovering health status as determined from the second ZK. In the first region server 1300A is configured to perform the replication and the committing when the first region server 1300A has an active health status as determined from the first ZK and the second region server 1300B has an active health status as determined from the second ZK. In the present discussed embodiments, the first ZK and the second ZK are in coordination with each other regarding the respective monitored health of the first cluster and the second cluster.

In a further implementation of the present disclosure, the master server 1310A corresponding to the first cluster, see Fig. 15, is configured to set back the first region server 1300A of the first cluster in sync replication with the second region server 1300B of the second cluster, wherein when the first region server 1300A and the first cluster have a recovering health status and the second region server 1300B and the second cluster have an active health status. The description of Fig. 11 is incorporated herein by reference. Also, the first region server 1300A and the second region server 1300B include the components of the respective region servers illustrated in Figs. 13 and 14. The master server 1310A is configured to identify in the first cluster regions to be synced from the second region server 1300B to the first region server 1300A. Further, the master server 1310A is configured to assign, to the first region server 1300A, recovery process for syncing the identified records from the second region server 1300B to the first region server 1300A. Further, the first region server 1300A is configured to recover the first transaction log 1308A of the first region server 1300A to identify a last successful transaction. Accordingly, the first region server 1300A is configured to download files from the second file store 1306B of the second region server 1300B to the first file store 1306A of the first region server 1300A, the files belonging to transactions succeeding the last successful transaction. Further, the first region server 1300A is configured to resume syncing after the downloaded files in the first file store 1306A are loaded to the first memory 1302A of the first region server 1300A, wherein before resuming, the first region server 1300A is configured to transfer records in the first memory 1302A to the second memory 1302B of the second region server 1300B. Further, the first region server 1300A is configured to enable replication of write commands received from the client server at the first region server 1300A to the second region server 1300B.

In one implementation, for downloading the files from the second file store 1306B to the first file store 1306A, the first region server 1300A is configured to trigger flushing records from the second memory 1302B to the second file store 1306B, the records pertaining to the identified regions by the master server 1310A in the first cluster. Consequently, the second region server 1300B is configured to flush the records to the second file store 1306B and the first region server 1300A is configured to copy the files from the second file store 1306B to the first file store 1306A upon the flushing of the records in the second region server 1300B. In a further implementation, the first region server 1300A is configured to bulk load the copied files from the first file store 1306A to the first memory 1302A.

Fig. 16 illustrates a schematic diagram of a data node 1600 of HBase cluster. The date node 1600 may represent a first region server of a first cluster, a second region server of a second cluster or a master server, and accordingly may be a computing and/or a storage node of a HBase cluster in accordance with the embodiments of the present disclosure. One skilled in the art will recognize that the term data node 1600 is included for purposes of clarity of discussion, but is in no way meant to limit the application of the present disclosure to a particular node. At least some of the features/methods described in the disclosure are implemented in a region server. For instance, the features/methods in the disclosure are implemented using hardware, firmware, and/or software installed to run on hardware. As shown in FIG. 16, the data node 1600 comprises transceivers (Tx/Rx) 1610, which are transmitters, receivers, or combinations thereof. A processor 1620 is coupled to the Tx/Rx 1610 to process the write commands received from a client node or a write command being replicated from another data node. The processor 1610 may comprise one or more multi-core processors and/or memory modules 1630, which functions as data stores, buffers, etc. Processor 1610 is implemented as a general processor or is part of one or more application specific integrated circuits (ASICs) and/or digital signal processors (DSPs) . The memory module 1630 comprises a cache for temporarily storing content, e.g., a Random Access Memory (RAM) . Additionally, the memory module 1630 comprises a long-term storage for storing content relatively longer, e.g., a Read Only Memory (ROM) . For instance, the cache and the long-term storage includes dynamic random access memories (DRAMs) , solid-state drives (SSDs) , hard disks, or combinations thereof.

It is understood that by programming and/or loading executable instructions onto the data node 1600, at least one of the processor 1610, the cache, and the long-term storage are changed, transforming the node 1610 in part into a particular machine or apparatus, e.g., performing the synchronous data replication from a first cluster to a second cluster which work in active-active cooperation with each other, as taught in the embodiments of the present disclosure. It is fundamental to the electrical engineering and software engineering arts that functionality that can be implemented by loading executable software into a computer can be converted to a hardware implementation by well-known design rules. Decisions between implementing a concept in software versus hardware typically hinge on considerations of stability of the design and numbers of units to be produced rather than any issues involved in translating from the software domain to the hardware domain. Generally, a design that is still subject to frequent change is preferred to be implemented in software, because re-spinning a hardware implementation is more expensive than re-spinning a software design. Generally, a design that is stable that will be produced in large volume is preferred to be implemented in hardware, for example in an ASIC, because for large production runs the hardware implementation is less expensive than the software implementation. Often a design is developed and tested in a software form and later transformed, by well-known design rules, to an equivalent hardware implementation in an ASIC that hardwires the instructions of the software. In the same manner as a machine controlled by a new ASIC is a particular machine or apparatus, likewise a computer that has been programmed and/or loaded with executable instructions is viewed as a particular machine or apparatus.

The present disclosure as explained with reference to Fig. 3 onwards in the present disclosure provides at least the following beneficial technical effect over the prior art replication methods between two clusters.

● High availability with Strong consistency for query results from any cluster

● Instant Recovery for single cluster down scenario (RTO ～0)

● There is no data loss for single cluster down (RPO=0)

● Transaction management is independent to client applications

● Client can read or write to any cluster

● No central transaction manager. Distributed at Region Level.

● Configurable consistency levels per Table. Choose between Async & Sync replication

● Automatic recovery of cluster which came back from crash to avoid complex manual recovery process for administrator;

● Cluster recovery is fast through differential store files data transfer instead of record by record

● Improved availability even in case of network glitch, disk slow & short term JVM pauses with consistent results for read.

Overall, the embodiments of the present disclosure disclose a method, system and apparatus for transaction management between two clusters. The transaction management include single row, batch with atomic and non-atomic batch transaction management between two clusters by coordinated row lock mechanism. The embodiments also disclose in-memory transaction status being used for scanner and flush to HFiles. A background file repair, Repair Chore, mechanism is used for correcting out of records. The embodiments disclose transaction management for bulk data loading scenario as well as a process of automatic recovery using bulk store files instead of record by record to recover fast after failover.

It may be clearly understood by a person skilled in the art that for the purpose of convenient and brief description, for a detailed working process of the foregoing system, apparatus, and unit, reference may be made to a corresponding process in the foregoing method embodiments, and details are not described herein again.

While several embodiments have been provided in the present disclosure, it should be understood that the disclosed systems and methods might be embodied in many other specific forms without departing from the scope of the present disclosure. The present examples are to be considered as illustrative and not restrictive, and the intention is not to be limited to the details given herein. For example, the various elements or components may be combined or integrated in another system or certain features may be omitted, or not implemented.

In addition, techniques, systems, subsystems, and methods described and illustrated in the various embodiments as discrete or separate may be combined or integrated with other systems, modules, techniques, or methods without departing from the scope of the present disclosure. Other items shown or discussed as coupled or directly coupled or communicating with each other may be indirectly coupled or communicating through some interface, device, or intermediate component whether electrically, mechanically, or otherwise. Other examples of changes, substitutions, and alterations are ascertainable by one skilled in the art and could be made without departing from the Scope disclosed herein.

Accordingly, the scope of protection is not limited by the description set out above but is defined by the claims that follow, that scope including all equivalents of the subject matter of the claims. Each and every claim is incorporated as further disclosure into the specification and the claims are embodiment (s) of the present disclosure. The discussion of a reference in the disclosure is not an admission that it is prior art, especially any reference that has a publication date after the priority date of this application.

Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the present disclosure be limited not by this detailed description, but rather by any claims that issue on an application based here on. Accordingly, the disclosure of the embodiments of the present disclosure is intended to be illustrative, but not limiting, of the scope of the present disclosure, which is set forth in the following claims.

Claims

A synchronous data replication method from a first cluster to a second cluster, the first cluster and the second cluster working in active-active cooperation with each other, the method comprising:

receiving a write command from a client server in a first region server of the first cluster;

replicating, by the first region server, the received write command in a second region server of the second cluster; wherein the step of replicating including committing, by the second region server, a record associated with the write command in a second write ahead log (WAL) of the second region server and subsequent to the committing the record to the second WAL, committing, by the second region server, the record to a second memory of the second region server; and

committing, by the first region server, the record associated with the write command to a first write ahead log (WAL) of the first region server and subsequent to the committing the record to the first WAL, committing, by the first region server, the record to a first memory of the first region server.
The method as claimed in claim 1, wherein the committing, by the second region server, the record to the second memory of the second region server, comprising:

updating, by the second region server, a second transaction status of the record in a second transaction log maintained in the second region server, the second transaction status indicative of at least one of following current status of the record in the second region server: in-progress status; success status and timeout/failed status; wherein the in-progress status indicates that the record is not yet committed to the second WAL, the success status indicates a record being successfully committed to the second WAL, and the timeout/failed status indicates that the record has failed to be committed to the second WAL; each record having at least two of the second transaction status in the second transaction log, last updated entry being considered latest second transaction status of the record;

sending, by the second region server, the latest second transaction status of the record from the second transaction log, to the first region server;

receiving, by the first region server, the latest second transaction status of the record from the second region server;

committing, by the first region server, the record in the first memory when the latest second transaction status received is the success status.
The method as claimed in claim 2, wherein the committing, by the first region server, the record to the first memory of the first region server, comprising:

updating, by the first region server, a first transaction status of the record in a first transaction log maintained in the first region server, the first transaction status indicative of at least one of following current status of the record at the first region server: in-progress status; success status and timeout/failed status; wherein the in-progress status indicates that the record is not yet committed to the first WAL, the success status indicates a record being successfully committed to the first WAL, and timeout/failed status indicates that the record has failed to be committed to the first WAL; each record having at least two transaction status in the first transaction log, last updating entry being considered latest first transaction status of the record.
The method as claimed in claim 3, further comprising:

sending, by the first region server, the latest first transaction status of the record to the second region server;

receiving, the latest first transaction status of the record, at the second region server and detecting, by the second region server that the first transaction status of the record is a timeout/failed status;

on detecting, re-updating, by the second region server, the corresponding second transaction status of the record in the second transaction log to indicate the timeout/failed status in the second region server; and

updating, by the second region server, the record in the second memory to reflect the timeout/failed status for the record.
The method as claimed in claim 4, wherein if the record is already flushed from the second memory to a second file store of the second region server and the timeout/failed status of the record is received as the latest first transaction status at the second region server, the method comprising:

rewriting, by the second region server, files in the second file store to remove transactions of the record corresponding to the write command.
The method as claimed in any of claims 1 to 5, wherein the record corresponding to the write command received at the first region server is concurrently committed to the first memory of the first region server and the second memory of the second region server; and if the record corresponding to the write command fails to be committed in any of the first region server and the second region server, the record is not considered for a subsequent read operation or a search query received in either of the first cluster and the second cluster.
The method as claimed in claim 6, wherein the concurrently committing the record to the first memory of the first region server and the second memory of the second region server, comprising:

locking a first row, by the first region server, the first row belongs to a region hosted by the first region server and the first row corresponds to the write command received at the first region server;

replicating, by the first region server, the received write command, in the second region server;

locking a second row, by the second region server, the second row belongs to the region hosted by the second region server and the second row corresponds to the write command replicated at the second region server; and

when the locking of the second row is success, performing parallel committing, by the second region server, the record in the second WAL and the committing, by the first region server, the record in the first WAL.
The method as claimed in claim 7, wherein the locking of the first row by the first region server and the locking of the second row by the second region server is in response to an atomic batch row request by the client server.
The method as claimed in claim 6, wherein in response to a non-atomic batch row request by the client server, the method comprising:

locking a first batch of rows, by the first region server, the first batch of first rows belonging to a region hosted by the first region server and the first batch of rows corresponds to the write command received at the first region server;

replicating, by the first region server, the received write command, in the second region server;

attempting, by the second region server, locking of a second batch of rows, by the second region server, the second batch of rows belonging to the region hosted by the second region server and the second batch of rows corresponds to the write command replicated at the second region server;

acquiring, by the second region server, locks on at least a first subset of the second batch of rows while failing to acquire lock on a second subset of the second batch of rows;

releasing, by the first region server, locks on a first subset of the first batch of rows, the first subset of the first batch of rows corresponding to the first subset of the second batch of rows in the second region server;

performing parallel committing, by the second region server, the record in the second WAL, the step of committing by the second region server being performed on the first subset of the second batch of rows, and the committing, by the first region server, the record in the first WAL, the step of committing by the first region server, being performed on the first subset of the first batch of rows; and

relocking, by the first region server a second subset of the first batch of rows, the second subset of the first batch of rows corresponding to the second subset of the second batch of rows in the second region server; and

re-attempting, by the second region server, locking of the second subset of the second batch of rows, by the second region server, in order to perform the step of parallel committing, by the second region server the record in the second WAL on the second subset of the second batch of rows, and the committing, by the first region server, the record in the first WAL on the second subset of the first batch of rows.
The method as claimed in claim 1, further comprising:

receiving another write command from another client server in the second region server of the second cluster, the another write command also associated with the record that is associated with the write command received in the first server region;

determining, by the second region server, if the another write command from the another client server reached the second region server later than the write command from the client server at the first region server;

wherein the determining is performed by at least one of the following mechanism:

checking, by the second region server a transaction timestamp associated with the write command and a transaction timestamp associated with the another write command;

applying, by the second region server, a hashing protocol on the write command received from the client server and the another write command received from the another client server; and

applying, by the second region server, a pluggable conflict resolution protocol on the write command received from the client server and the another write command received from the another client server.
The method as claimed in claim 10, further comprising:

performing the step of committing, by the second region server, the record associated with the write command in the second write ahead log (WAL) and subsequent to the committing the record to the second WAL, performing the step of committing, by the second region server, the record to the second memory of the second region server on determining that the another write command from the another client server reached the second region server later than the write command from the client server at the first region server.
The method as claimed in claim 10, further comprising:

prior to the performing the step of committing, by the second region server, the record associated with the write command in the second WAL, committing, by the second region server, the record associated with the another write command in the second WAL and subsequent to the committing the record to the second WAL, committing, by the second region server, the record to the second memory on determining that the another write command from the another client server reached the second region server before the write command from the client server at the first region server.
The method as claimed in claim 1, further comprising recovering files from the first region server in the first cluster, wherein the recovering the files comprising:

invoking of a server crash recovery procedure in a master server managing assignment of regions in the first cluster and accordingly managing the assignment of the region being hosted by the first region server;

on invoking the server crash recovery procedure in the master server, the method further comprising:

identifying, by the master server, one or more WALs belonging to the first region server being recovered;

assigning, by the master server, recovery of records from the one or more WALs to other region servers of the first cluster;

collecting, by an assigned other region server, records from a WAL from the one or more WALs;

splitting, by the assigned other region server, the records from the WAL to segregate the records according to the corresponding regions for which records are available in the WAL;

creating, by the assigned other region server, an edit file for each set of segregated records corresponding to one region and creating a transaction log file for the each edit file created;

writing, by the assigned other region server, the edit file corresponding to one region assigned to the assigned other region server, locally to a corresponding memory of the assigned other region server and loading the corresponding transaction log file for the edit file to a transaction log of the assigned other region server, wherein a latest transaction status of each record from the edit file is recovered from the loaded transaction log file;

flushing, by the assigned other region server, the edit file to a file store of the other assigned region server, wherein the flushing comprising:

checking the transaction status of the each record from the edit file;

not flushing a record which has a transaction status indicating a failed/timeout status;

reloading a record which has a transaction status indicating an in-progress status; and

flushing a record which has a transaction indicating a success to the file store of the other assigned region server.
The method as claimed in claim 1, wherein health of the first cluster including at least the first region server is monitored by a first zookeeper (ZK) corresponding to the first cluster and health of the second cluster including at least the second region server is monitored by a second zookeeper (ZK) corresponding to the second cluster, and further the first ZK also maintains information of the health of the second cluster and the second ZK also maintains information of the health of the first cluster, the method further comprising:

rejecting, by the first region server, the write command from the client server, when the first region server has a unhealthy status as determined from the first ZK;

committing, by the first region server, the write command in the first WAL and the first memory and halting the replication, by the first region server, when the first region server has a local health status as determined from the first ZK;

halting the replication, by the first region server, when the second region server has a recovering health status as determined from the second ZK; and

performing the replication, by the first region server, and performing the committing, by the first region server when the first region server has an active health status as determined from the first ZK and the second region server has an active health status as determined from the second ZK, and

wherein the first ZK and the second ZK are in coordination with each other regarding the respective monitored health of the first cluster and the second cluster.
The method as claimed in claim 14, further comprising setting back the first region server of the first cluster in sync replication with the second region server of the second cluster, wherein the first region server and the first cluster has a recovering health status and the second region server and the second cluster has an active health status, the method comprising:

identifying, by a master server in the first cluster, regions to be synced from the second region server to the first region server;

assigning, by the master server, the first region server, recovery process for syncing the identified records from the second region server to the first region server;

recovering, by the first region server, the first transaction log of the first region server to identify a last successful transaction;

downloading, by the first region server, files from the second file store of the second region server to the first file store of the first region server, the files belonging to transactions succeeding the last successful transaction;

resuming, by the first region server, syncing after the downloaded files in the first file store are loaded to the first memory of the first region server, wherein before resuming, transferring, by the first region server, records in the first memory to the second memory of the second region server; and

enabling, by the first region server, replicating write commands received from the client server at the first region server to the second region server.
The method as claimed in claim 15, wherein the step of downloading files from the second file store to the first file store comprising:

triggering, by the first region server, flushing records from the second memory to the second file store in the second region server, the records pertaining to the identified regions by the master server in the first cluster;

flushing, by the second region server, the records to the second file store; and

copying, by the first region server, the files from the second file store to the first file store upon the flushing in the second region server.
The method as claimed in claim 16, wherein the copying, by the first region server, includes bulk loading, by the first region server, the files from the first file store to the first memory.
A system for synchronous data replication, wherein the system comprises:

a first cluster comprising a first region server, the first region server comprising a first memory, a first write ahead log (WAL) and a first file store; and

a second cluster comprising a second region server, the second region server comprises a second memory, a second WAL and a second file store, the first cluster and the second cluster working in active-active cooperation with each other, wherein

the first region server is configured to:

receive a write command from a client server;

replicate the received write command in the second region server, wherein the step of replicating including committing, by the second region server, a record associated with the write command in a second WAL and subsequent to the committing the record to the second WAL, committing, by the second region server, the record to the second memory; and

commit the record associated with the write command to the first WAL and subsequent to the committing the record to the first WAL, commit the record to a first memory of the first region server,

and

the second region server is configured to perform replication of the received write command in the second region server.
The system as claimed in claim 18, wherein

the first region server comprises of a first transaction log;

the second region server comprises of a second transaction log;

the second region server is configured to update a second transaction status of the record in the second transaction log, the second transaction status indicative of at least one of following current status of the record in the second region server: in-progress status; success status and timeout/failed status; wherein the in-progress status indicates that the record is not yet committed to the second WAL, the success status indicates a record being successfully committed to the second WAL, and the timeout/failed status indicates that the record has failed to be committed to the second WAL; each record having at least two of the second transaction status in the second transaction log, last updated entry being considered latest second transaction status of the record;

the second region server is configured to send the latest second transaction status of the record from the second transaction log to the first region server; and

the first region server is configured to receive the latest second transaction status of the record from the second region server, and

wherein the first region server is configured to commit the record in the first memory when the latest second transaction status received is the success status.
The system as claimed in claim 19, wherein

the first region server is configured to update a first transaction status of the record in the first transaction log, the first transaction status indicative of at least one of following current status of the record at the first region server: in-progress status; success status and timeout/failed status; wherein the in-progress status indicates that the record is not yet committed to the first WAL, the success status indicates a record being successfully committed to the first WAL, and timeout/failed status indicates that the record has failed to be committed to the first WAL; each record having at least two transaction status in the first transaction log, last updating entry being considered latest first transaction status of the record.
The system as claimed in claim 20, wherein

the first region server is configured to send the latest first transaction status of the record to the second region server;

the second region server is configured to receive the latest first transaction status of the record and detect that the first transaction status of the record is a timeout/failed status;

on detecting, the second region server is configured to re-update the corresponding second transaction status of the record in the second transaction log to indicate the timeout/failed status in the second region server; and

the second region server is configured to update the record in the second memory to reflect the timeout/failed status for the record.
The system as claimed in claim 21, wherein if the record is already flushed from the second memory to the second file store of the second region server and the timeout/failed status of the record is received as the latest first transaction status at the second region server, the second region server is configured to:

rewrite files in the second file store to remove transactions of the record corresponding to the write command.
The system as claimed in any of claims 18 to 22, wherein the record corresponding to the write command received at the first region server is concurrently committed to the first memory of the first region server and the second memory of the second region server; and if the record corresponding to the write command fails to be committed in any of the first region server and the second region server, the record is not considered for a subsequent read operation or a search query received in either of the first cluster and the second cluster.
The system as claimed in claim 23, wherein in order to concurrently commit the record to the first memory of the first region server and the second memory of the second region server,

the first region server is configured to lock a first row, the first row belongs to a region hosted by the first region server and the first row corresponds to the write command received at the first region server, and the first region server is configured to replicate the received write command in the second region server;

the second region server is configured to lock a second row, the second row belongs to the region hosted by the second region server and the second row corresponds to the write command replicated at the second region server; and

when the locking of the second row is success, the second region server is configured to commit the record in the second WAL and the first region server is configured to commit the record in the first WAL, the committing by the second region server is in parallel to the committing by the first region server.
The system as claimed in claim 24, wherein the locking of the first row by the first region server and the locking of the second row by the second region server is in response to an atomic batch row request by the client server.
The system as claimed in claim 23, wherein in response to a non-atomic batch row request by the client server,

the first region server is configured to lock a first batch of rows, the first batch of first rows belonging to a region hosted by the first region server and the first batch of rows corresponds to the write command received at the first region server, and the first region server is configured to replicate the received write command in the second region server;

the second region server is configured to attempt locking of a second batch of rows, the second batch of rows belonging to the region hosted by the second region server and the second batch of rows corresponds to the write command replicated at the second region server;

the second region server is configured to acquire locks on at least a first subset of the second batch of rows while failing to acquire lock on a second subset of the second batch of rows;

first region server is configured to release locks on a first subset of the first batch of rows, the first subset of the first batch of rows corresponding to the first subset of the second batch of rows in the second region server;

the second region server is configured to commit the record in the second WAL, the step of committing by the second region server being performed on the first subset of the second batch of rows, and the first region server is configured to commit the record in the first WAL, the step of committing by the first region server, being performed on the first subset of the first batch of rows, wherein the committing by the second region server is in parallel to the committing by the first region server;

the first region server is configured to relock a second subset of the first batch of rows, the second subset of the first batch of rows corresponding to the second subset of the second batch of rows in the second region server; and

the second region server is configured to re-attempt locking of the second subset of the second batch of rows in order to perform parallel committing, by the second region server, the record in the second WAL on the second subset of the second batch of rows, and committing, by the first region server, the record in the first WAL on the second subset of the first batch of rows.
The system as claimed in claim 18, wherein

the second region server is configured to receive another write command from another client server, the another write command also associated with the record that is associated with the write command received in the first server region; and

the second region server is configured to determine if the another write command from the another client server reached the second region server later than the write command from the client server at the first region server,

wherein the determining is performed by at least one of the following mechanism:

the second region server is configured to check a transaction timestamp associated with the write command and a transaction timestamp associated with the another write command;

the second region server is configured to apply a hashing protocol on the write command received and the another write command;

the second region server is configured to apply a pluggable conflict resolution protocol on the write command and the another write command.
The system as claimed in claim 27, wherein

the second region server is configured to perform the step of committing the record associated with the write command in the second WAL and subsequent to the committing the record to the second WAL, perform the step of committing the record to the second memory of the second region server on determining that the another write command from the another client server reached the second region server later than the write command from the client server at the first region server.
The system as claimed in claim 27, wherein

prior to the performing the step of committing the record associated with the write command in the second WAL, the second region server is configured to commit the record associated with the another write command in the second WAL and subsequent to the committing the record to the second WAL, commit the record to the second memory, on determining that the another write command from the another client server reached the second region server before the write command from the client server at the first region server.
The system as claimed in claim 18, wherein the system further comprises a master server configured to manage assignment of regions in the first cluster and accordingly manage the assignment of the region being hosted by the first region server; wherein

the master server is configured to invoke a server crash recovery procedure for recovering files from the first region server in the first cluster, wherein the first cluster comprises other region servers, the other region servers being assigned recovery of the files from the first region server, each of the other region servers comprising a corresponding memory, a corresponding WAL, a corresponding transaction log and a corresponding file store;

on invoking the server crash recovery procedure in the master server, the master server is configured to identify, one or more WALs belonging to the first region server being recovered and, assign recovery of records from the one or more WALs to the other region servers of the first cluster;

an assigned other region server from the first cluster is configured to:

collect records from a WAL from the one or more WALs;

split the records from the WAL to segregate the records according to the corresponding regions for which records are available in the WAL;

create an edit file for each set of segregated records corresponding to one region and creating a transaction log file for the each edit file created;

write the edit file, the edit file corresponding to one region assigned to the assigned other region server, locally to the corresponding memory of the assigned other region server and loading the corresponding transaction log file for the edit file to the corresponding transaction log of the assigned other region server, wherein a latest transaction status of each record from the edit file is recovered from the loaded transaction log file;

check the transaction status of the each record from the edit file; and

flush the edit file to the corresponding file store of the assigned other region server based on the transaction status of the each record from the edit file, and

wherein a record which has a transaction status indicating a failed/timeout status is not flushed, a record which has a transaction status indicating an in-progress status is reloaded; and a record which has a transacting status indicating a success status is flushed to the corresponding file store.
The system as claimed in claim 18, wherein the system further comprising a first zookeeper (ZK) corresponding to the first cluster and a second zookeeper (ZK) corresponding to the second cluster, wherein health of the first cluster including at least the first region server is monitored by a first ZK and health of the second cluster including at least the second region server is monitored by the second ZK, and further the first ZK also maintains information of the health of the second cluster and the second ZK also maintains information of the health of the first cluster, wherein

the first region server is configured to reject the write command from the client server, when the first region server has a unhealthy status as determined from the first ZK;

the first region server is configured to commit the write command in the first WAL and the first memory and is configured to halt the replication when the first region server has a local health status as determined from the first ZK;

the first region server is configured to halt the replication, by the first region server, when the second region server has a recovering health status as determined from the second ZK; and

the first region server is configured to perform the replication and the committing when the first region server has an active health status as determined from the first ZK and the second region server has an active health status as determined from the second ZK, and

wherein the first ZK and the second ZK are in coordination with each other regarding the respective monitored health of the first cluster and the second cluster.
The system as claimed in claim 31, further comprising a master server corresponding to the first cluster, wherein when the first region server and the first cluster has a recovering health status and the second region server and the second cluster has an active health status, the master server is configured to set back the first region server of the first cluster in sync replication with the second region server of the second cluster, wherein:

the master server is configured to:

identify in the first cluster regions to be synced from the second region server to the first region server;

assign, to the first region server, recovery process for syncing the identified records from the second region server to the first region server; the first region server is configured to:

recover the first transaction log of the first region server to identify a last successful transaction;

download files from the second file store of the second region server to the first file store of the first region server, the files belonging to transactions succeeding the last successful transaction;

resume syncing after the downloaded files in the first file store are loaded to the first memory of the first region server, wherein before resuming, transfer records in the first memory to the second memory of the second region server; and

enable replication of write commands received from the client server at the first region server to the second region server.
The system as claimed in claim 32, wherein for the downloading files from the second file store to the first file store,

the first region server is configured to trigger flushing records from the second memory to the second file store, the records pertaining to the identified regions by the master server in the first cluster;

the second region server is configured to flush the records to the second file store; and

the first region server is configured to copy the files from the second file store to the first file store upon the flushing in the second region server.
The system as claimed in claim 32, wherein the first region server is configured to bulk load the copied files from the first file store to the first memory.