KR20130043823A

KR20130043823A - Distributed storage system for maintaining data consistency based on log, and method for the same

Info

Publication number: KR20130043823A
Application number: KR1020110107949A
Authority: KR
Inventors: 김태웅; 이규재; 박기은; 전성원; 문상철; 이혜정; 김효
Original assignee: 엔에이치엔비즈니스플랫폼 주식회사
Priority date: 2011-10-21
Filing date: 2011-10-21
Publication date: 2013-05-02
Also published as: KR101553712B1

Abstract

PURPOSE: A distribution storage system for maintaining data matching based on a log and a method thereof are provided to generate a log for an operation which is not performed and to perform the operation based on the log, thereby maintaining data matching. CONSTITUTION: A first node receives a request of a first operation; generates a log which shows failure of the first operation; and repeatedly attempts the first operation based on the log(270). When the first operation is performed, the first node deletes the log(280). The failure includes a case that an object, which is used in the first operation, is not existed in the first node. The first node performs the first operation after performing a second operation which includes generation of the object. [Reference numerals] (210) Receive a first command; (220,260) First operation can be performed?; (230) Generate a log indicating the failure of the first operation; (240,295) Output a success response for the first command; (250) Wait(perform a second operation); (270,290) Perform the first operation; (280) Delete the log; (AA) Start; (BB) End

Description

DISTRIBUTED STORAGE SYSTEM FOR MAINTAINING DATA CONSISTENCY BASED ON LOG, AND METHOD FOR THE SAME}

아래의 실시예들은 분산 저장 시스템 및 방법에 대한 것이다.The following embodiments are directed to distributed storage systems and methods.

로그에 기반하여 데이터 정합성을 유지하는 분산 저장 시스템 및 로그에 기반하여 분산 저장 시스템의 데이터 정합성을 유지하는 방법이 개시된다.Disclosed are a distributed storage system that maintains data consistency based on logs and a method of maintaining data consistency of a distributed storage system based on logs.

데이터가 빈번하게 대량 삽입(bulk insert)되는 시스템은, 싸고 많은 개수의 장치들을 사용하고, 상기의 장치들로 데이터를 분산하여 저장함으로써 처리량(throughput)을 향상시킬 수 있다.Systems in which data is frequently bulk inserted can improve throughput by using cheap and large numbers of devices and distributing and storing the data among the devices.

많은 개수의 장치가 사용됨에 따라, 장치 자체에서 발생하는 장애가 증가할 수 있으며, 시스템의 응답 시간(response time)이 보장하는 것이 어렵게 될 수 있다. 이러한 문제점을 해결하기 위해, 분산 저장 시스템은 복제본(replica)들을 유지 및 관리할 수 있다. 여기서, 복제본들은 동일한 데이터를 저장하는 복수 개의 장치들을 의미한다.As a large number of devices are used, the failures occurring in the devices themselves may increase, and the response time of the system may become difficult to ensure. To solve this problem, distributed storage systems can maintain and manage replicas. Here, the replicas refer to a plurality of devices that store the same data.

복수 개의 노드들을 포함하는 분산 저장 시스템에서, 동일한 데이터를 저장하는 복제본들 간의 정합성(consistency)이 문제가 될 수 있다. 즉, 복제본들이 각각 서로 상이한 순서로 연산들을 처리하여, 복제본들 각각이 갖는 데이터가 서로 상이하게 되는 것이 방지되어야 한다.In a distributed storage system including a plurality of nodes, consistency between replicas storing the same data may be a problem. In other words, the replicas must each process operations in a different order from each other so that the data that each of the replicas has differs from each other.

정합성을 유지하기 위한 하나의 방법은, 가환성(commutativity) 및 멱등성(idempotent)을 만족하도록 복제본들에게 제공되는 연산들을 구성하는 것이다. 가환성은 특정한 연산들을 서로 상이한 순서로 수행하더라도, 상기의 연산들을 수행한 결과가 항상 동일하게 되는 것을 의미한다. 멱등성은 특정한 연산을 반복하여 수행하였을 때의 결과가, 상기의 연산을 1 회 수행하였을 때의 결과와 동일하다는 것을 의미한다. 예컨대, 1) 멱등하지 않은 연산을 수행하는 것을 제한하거나, 2) 테이블의 전체 행(row)이 아닌 행의 일부만을 업데이트(update)하는 것을 제한함으로써 멱등성을 만족하는 연산을 구성할 수 있다. 멱등하지 않은 연산의 일 예로서, 자기 재참조가 있다. 수식 "A = A+1"은 자기 재참조의 일 예이다.One way to maintain consistency is to configure the operations provided to the replicas to satisfy commutativity and idempotent. Compatibility means that the results of performing the above operations will always be the same, even if certain operations are performed in a different order from each other. Idability means that the result when a specific operation is repeatedly performed is the same as the result when the above operation is performed once. For example, an operation that satisfies equality can be configured by restricting 1) performing an operation that is not equal, or 2) updating only a part of a row that is not the entire row of the table. An example of an unequal operation is self rereference. Equation "A = A + 1" is an example of self rereference.

정합성을 유지하기 위한 또 하나의 방법은, 전체 순서화(total ordering)이다. 전체 순서화는 복제본들이 연산들을 수행함에 있어서, 복제본들이 서로 간에 연산들 간의 순서를 맞추는 방법을 의미한다. 연산들 간의 순서를 맞추기 위해, 복제본들은 전역(global) 큐(queue) 또는 락(lock)을 이용할 수 있으며, 서로 간에 마스터(master) 및 슬레이브(slave)를 설정할 수 있다. 예를 들면, 복제본들 중 마스터는 연산들을 수행하면서 연산들 간의 순서를 결정하고, 슬레이브들은 상기의 결정된 순서에 따라 연산들을 수행할 수 있다. 그러나, 전체 순서화가 사용될 경우, 복제본들 중 하나의 노드에 장애가 발생하면 다른 노드들 또한 상기의 장애로부터 영향을 받는 문제가 발생할 수 있다.Another way to maintain consistency is total ordering. Full ordering means how the replicas order the operations from one another, as the replicas perform the operations. To order the operations, the replicas can use global queues or locks, and can set up a master and a slave to each other. For example, the master of the replicas determines the order between the operations while performing the operations, and the slaves may perform the operations according to the determined order. However, when full ordering is used, if one node of the replica fails, the other nodes may also be affected from the failure.

한국공개특허 제10-2008-0102622호(공개일 2008년 11월 26일)에는, 마스터 데이터베이스의 트랜잭션 로그를 통해 복제 로그를 생성하여 배포 시스템으로 전송하고, 복제 로그를 배포할 슬레이브 호스트를 판단하여 해당하는 슬레이브 호스트로 상기 복제 로그를 배포함으로써 상기의 복제 로그를 슬레이브 데이터베이스에 반영하는 방법이 개시되어 있다.In Korean Patent Publication No. 10-2008-0102622 (published November 26, 2008), a replication log is generated and transmitted to a distribution system through a transaction log of a master database, and a slave host to which replication logs are distributed is determined. Disclosed is a method of reflecting the replication log in a slave database by distributing the replication log to a corresponding slave host.

본 발명의 일 실시예는 수행할 수 없는 연산에 대한 로그를 생성하고, 상기 로그에 기반하여 이후에 상기 연산을 수행함으로써 데이터의 정합성을 유지하는 분산 저장 시스템 및 분산 저장 방법을 제공할 수 있다.An embodiment of the present invention may provide a distributed storage system and a distributed storage method for generating a log of an operation that cannot be performed and maintaining the consistency of data by performing the operation later based on the log.

본 발명의 일 실시예는 데이터의 복사의 종료 시각을 설정하고, 데이터의 객체들을 타겟 노드로 복사하며, 상기 종료 시각 이전에 갱신된 객체를 타겟 노드로 재복사함으로써 데이터의 정합성을 유지하는 분산 저장 시스템 및 분산 저장 방법을 제공할 수 있다.An embodiment of the present invention sets distributed end time of data copy, copies objects of data to target node, and distributes stored data to maintain consistency of data by re-copying the updated object to target node before the end time. It is possible to provide a system and a distributed storage method.

본 발명의 일 측에 따르면, 분산 저장 시스템에 있어서, 복수 개의 노드들을 포함하고, 상기 복수 개의 노드들 중 제1 노드는 제1 연산의 요청을 수신하고, 상기 제1 노드는 상기 제1 연산을 수행할 수 없는 장애가 발생한 경우 상기 제1 연산의 실패를 나타내는 로그를 생성하고, 상기 제1 노드는 상기 로그에 기반하여 상기 제1 연산을 반복하여 시도하며, 상기 제1 연산이 수행된 경우 상기 로그를 삭제하는, 분산 저장 시스템이 제공된다.According to one aspect of the invention, in a distributed storage system, comprising a plurality of nodes, a first node of the plurality of nodes receives a request of a first operation, the first node to perform the first operation Generates a log indicating the failure of the first operation when a failure that cannot be performed occurs, the first node repeatedly attempts the first operation based on the log, and when the first operation is performed, the log. A distributed storage system is provided, which deletes it.

상기 장애는 상기 제1 연산에서 사용되는 객체가 상기 제1 노드 내에 존재하지 않음을 의미할 수 있다.The failure may mean that an object used in the first operation does not exist in the first node.

상기 제1 노드는 상기 객체의 생성을 포함하는 제2 연산을 수행한 후 상기 제1 연산을 수행할 수 있다.The first node may perform the first operation after performing a second operation including creation of the object.

상기 제1 연산의 타임스탬프는 상기 제2 연산의 타임스탬프보다 더 늦은 시각을 나타낼 수 있다.The timestamp of the first operation may represent a later time than the timestamp of the second operation.

상기 제1 연산은 상기 객체에 대한 갱신 연산일 수 있다. 상기 제2 연산은 상기 객체에 대한 삽입 연산일 수 있다.The first operation may be an update operation on the object. The second operation may be an insert operation on the object.

상기 장애는 상기 제1 노드의 네트워크 장애 또는 프로세스 장애일 수 있다.The failure may be a network failure or a process failure of the first node.

본 발명의 일 측에 따르면, 분산 저장 시스템에 있어서, 복수 개의 노드들을 포함하고, 상기 복수 개의 노드들 중 제1 노드는 제1 연산의 요청을 수신하고, 상기 제1 노드는 상기 제1 연산을 수행할 수 없는 장애가 발생한 경우 상기 제1 연산의 실패를 나타내는 로그를 생성하여 상기 로그를 상기 복수 개의 노드들 중 제2 노드로 전송하고, 상기 제2 노드는 상기 로그를 저장하며, 상기 로그에 기반하여 주기적으로 상기 제1 노드에게 상기 제1 연산의 시도를 요청하고, 상기 제1 노드는 상기 제1 연산이 수행된 경우 상기 제2 노드에게 상기 로그의 삭제를 요청하는, 분산 저장 시스템이 제공된다.According to one aspect of the invention, in a distributed storage system, comprising a plurality of nodes, a first node of the plurality of nodes receives a request of a first operation, the first node to perform the first operation When a failure that cannot be performed, a log indicating a failure of the first operation is generated and the log is transmitted to a second node of the plurality of nodes, and the second node stores the log and is based on the log. And periodically requesting the first node to attempt the first operation, and wherein the first node requests the second node to delete the log when the first operation is performed. .

본 발명의 또 다른 일 측에 따르면, 분산 저장 시스템에 있어서, 복수 개의 노드들을 포함하고, 장애 노드는 상기 복수 개의 노드들 중 장애가 발생한 노드이며, 소스 노드는 복제본들 중 하나이며, 상기 복제본들은 상기 복수 개의 노드들 중 상기 장애 노드와 동일한 데이터를 제공하는 하나 이상의 노드들이며, 타겟 노드는 상기 복수 개의 노드들 중 상기 장애 노드를 대체하여 상기 복제본들 중 하나로서 새롭게 선택된 노드이며, 상기 소스 노드는 상기 장애 노드의 상기 장애를 인식하며, 상기 소스 노드는 데이터의 복사의 종료 시각을 설정하며, 상기 데이터의 객체들을 상기 타겟 노드로 복사하며, 상기 타겟 노드로 복사된 후 상기 종료 시각 이전에 갱신된 제1 객체를 상기 타겟 노드로 다시 복사하는, 분산 저장 시스템이 제공된다.According to another aspect of the present invention, in a distributed storage system, a plurality of nodes, a failed node is a failed node of the plurality of nodes, a source node is one of the replicas, the replicas are the One or more nodes providing the same data as the failed node of the plurality of nodes, a target node being a newly selected node as one of the replicas, replacing the failed node of the plurality of nodes, wherein the source node is the Recognizing the failure of the failure node, the source node sets an end time of copying data, copies objects of the data to the target node, updates the target node after being copied before the end time. A distributed storage system is provided that copies one object back to the target node.

상기 종료 시각은 상기 복수 개의 노드들 전체가 상기 타겟 노드로의 복사를 인식하여 상기 데이터에 대한 접근 요청을 상기 타겟 노드로도 보낼 수 있는 시각일 수 있다.The end time may be a time at which all of the plurality of nodes recognizes a copy to the target node and may also send an access request for the data to the target node.

상기 소스 노드는 상기 객체들 각각의 타임스탬프들의 오름차순으로 상기 객체들을 상기 타겟 노드로 복사할 수 있다.The source node may copy the objects to the target node in ascending order of timestamps of each of the objects.

상기 소스 노드는 상기 객체들 중 상기 타임스탬프가 상기 종료 시각 또는 상기 종료 시각의 이전 시각을 나타내는 객체에 대해 상기 복사를 수행할 수 있다.The source node may perform the copying on an object in which the time stamp among the objects indicates the end time or a previous time of the end time.

상기 복수 개의 노드들 중 조정 노드는 상기 제1 객체에 대한 갱신 요청을 수신할 수 있고, 상기 조정 노드는 상기 소스 노드 및 상기 타겟 노드 각각에게 상기 제1 객체에 대한 갱신 요청을 전송할 수 있고, 상기 타겟 노드는 상기 제1 객체가 존재하지 않는 경우 상기 갱신 요청의 실패를 나타내는 로그를 생성할 수 있고, 상기 소스 노드로부터 상기 제1 객체가 복사된 후 상기 로그에 기반하여 상기 제1 객체를 갱신할 수 있다.The coordination node of the plurality of nodes may receive an update request for the first object, the coordination node may transmit an update request for the first object to each of the source node and the target node, and The target node may generate a log indicating a failure of the update request when the first object does not exist, and update the first object based on the log after the first object is copied from the source node. Can be.

상기 소스 노드는 상기 타겟 노드에 제2 객체가 존재하지 않는 경우 상기 타겟 노드로 상기 제2 객체를 삽입하는 연산을 수행하고, 상기 타겟 노드에 상기 제2 객체가 존재하는 경우 상기 타겟 노드의 상기 제2 객체를 갱신하는 연산을 수행함으로써 상기 제2 객체를 상기 타겟 노드로 복사할 수 있다.The source node performs an operation of inserting the second object into the target node when a second object does not exist in the target node, and the first node of the target node when the second object exists in the target node. The second object may be copied to the target node by performing an operation of updating an object.

본 발명의 또 다른 일 측에 따르면, 복수 개의 노드들을 사용하는 분산 저장 시스템이 데이터를 관리하는 방법에 있어서, 상기 복수 개의 노드들 중 제1 노드가 제1 연산의 요청을 수신하는 단계, 상기 제1 노드에 상기 제1 연산을 수행할 수 없는 장애가 발생한 경우 상기 제1 노드가 상기 제1 연산의 실패를 나타내는 로그를 생성하는 단계, 상기 제1 노드가 상기 로그에 기반하여 상기 제1 연산을 반복하여 시도하는 단계 및 상기 제1 노드가 상기 제1 연산을 수행한 경우 상기 제1 로그를 삭제하는 단계를 포함하는, 데이터 관리 방법이 제공된다.According to another aspect of the present invention, a method of managing data in a distributed storage system using a plurality of nodes, the first node of the plurality of nodes receiving a request of the first operation, the first Generating a log indicating a failure of the first operation by the first node when a failure occurs in which one node cannot perform the first operation, and the first node repeats the first operation based on the log And deleting the first log when the first node performs the first operation.

상기 제1 연산을 반복하여 시도하는 단계는, 상기 제1 노드가 상기 제1 객체의 생성을 포함하는 제2 연산을 수행하는 단계 및 상기 제2 연산이 수행된 후 상기 제1 연산을 수행하는 단계를 포함할 수 있다.The retrying of the first operation may include: performing, by the first node, a second operation including generation of the first object, and performing the first operation after the second operation is performed. It may include.

본 발명의 또 다른 일 측에 따르면, 복수 개의 노드들을 사용하는 분산 저장 시스템이 데이터를 관리하는 방법에 있어서, 상기 복수 개의 노드들 중 제1 노드가 제1 연산의 요청을 수신하는 단계, 상기 제1 노드에 상기 제1 연산을 수행할 수 없는 장애가 발생한 경우 상기 제1 노드가 상기 제1 연산의 실패를 나타내는 로그를 생성하는 단계, 상기 제1 노드가 상기 로그를 상기 제2 노드로 전송하는 단계, 상기 제2 노드가 상기 로그를 저장하는 단계, 상기 제2 노드가 상기 로그에 기반하여 주기적으로 상기 제1 노드에게 상기 제1 연산의 시도를 요청하는 단계, 상기 제1 연산이 수행될 수 있는 경우, 상기 제1 노드가 상기 제1 연산을 수행하는 단계, 상기 제1 연산이 수행된 경우, 상기 제1 노드가 상기 제2 노드에게 상기 로그의 삭제를 요청하는 단계 및 상기 제2 노드가 상기 로그를 삭제하는 단계를 포함하는, 데이터 관리 방법이 제공된다.According to another aspect of the present invention, a method of managing data in a distributed storage system using a plurality of nodes, the first node of the plurality of nodes receiving a request of the first operation, the first Generating a log indicating a failure of the first operation by the first node when a failure occurs in which one node cannot perform the first operation, and transmitting the log to the second node by the first node Storing the log by the second node; requesting the first node to attempt the first operation periodically based on the log; wherein the first operation may be performed If the first node performs the first operation, if the first operation is performed, the first node requests the second node to delete the log and the second operation. A method of managing data is provided, the method comprising the node deleting the log.

본 발명의 또 다른 일 측에 따르면, 복수 개의 노드들을 사용하는 분산 저장 시스템이 상기 노드들 간에 데이터를 복사하는 방법에 있어서, 타겟 노드가 소스 노드에게 데이터의 복사를 요청하는 단계 - 상기 소스 노드는 복제본들 중 하나임, 상기 타겟 노드는 상기 복제본들 중 장애가 발생한 장애 노드를 대체하여 상기 복수 개의 노드들 중 상기 복제본들 중 하나로서 새롭게 선택된 노드임 -, 상기 소스 노드가 상기 데이터의 복사의 종료 시각을 설정하는 단계, 상기 소스 노드가 상기 데이터의 객체들을 타겟 노드로 복사하는 단계 및 상기 소스 노드가 상기 타겟 노드로 복사된 후 상기 종료 시각 이전에 갱신된 제1 객체를 상기 타겟 노드로 다시 복사하는 단계를 포함하는, 분산 저장 시스템의 데이터 복사 방법이 제공된다.According to yet another aspect of the present invention, in a distributed storage system using a plurality of nodes to copy data between the nodes, the target node requesting a copy of the data from the source node-the source node is One of the replicas, wherein the target node is a newly selected node as one of the replicas of the plurality of nodes, replacing the failed one of the replicas, wherein the source node indicates the end time of the copy of the data. Setting, copying, by the source node, objects of the data to a target node, and copying, again, the updated first object before the end time after the source node is copied to the target node to the target node. Provided is a method of copying data in a distributed storage system.

상기 복사하는 단계는, 상기 소스 노드가 상기 객체들 각각의 타임스탬프들의 오름차순으로 상기 객체들을 상기 타겟 노드로 복사하는 단계를 포함할 수 있다.The copying step may include the source node copying the objects to the target node in ascending order of timestamps of each of the objects.

상기 데이터 복사 방법은, 상기 복수 개의 노드들 중 조정 노드가 상기 제1 객체에 대한 갱신 요청을 수신하는 단계, 상기 조정 노드가 상기 소스 노드 및 상기 타겟 노드 각각에게 상기 제1 객체에 대한 갱신 요청을 전송하는 단계, 상기 타겟 노드가 상기 제1 객체가 존재하지 않는 경우 상기 갱신 요청의 실패를 나타내는 로그를 생성하는 단계 및 상기 타겟 노드가 상기 소스 노드로부터 상기 제1 객체가 복사된 후 상기 로그에 기반하여 상기 제1 객체를 갱신하는 단계를 더 포함할 수 있다.The data copying method may further include: receiving, by the coordinating node of the plurality of nodes, an update request for the first object, the coordinating node, respectively, requesting the update of the first object from the source node and the target node. Transmitting, the target node generating a log indicating a failure of the update request if the first object does not exist, and the target node based on the log after the first object is copied from the source node The method may further include updating the first object.

수행할 수 없는 연산에 대한 로그를 생성하고, 상기 로그에 기반하여 이후에 상기 연산을 수행함으로써 데이터의 정합성을 유지하는 분산 저장 시스템 및 분산 저장 방법이 제공된다.A distributed storage system and distributed storage method are provided for generating a log of an operation that cannot be performed and maintaining the consistency of data by performing the operation later based on the log.

데이터의 복사의 종료 시각을 설정하고, 데이터의 객체들을 타겟 노드로 복사하며, 상기 종료 시각 이전에 갱신된 객체를 타겟 노드로 재복사함으로써 데이터의 정합성을 유지하는 분산 저장 시스템 및 분산 저장 방법이 제공된다.There is provided a distributed storage system and a distributed storage method for maintaining the consistency of data by setting an end time of copying data, copying objects of data to a target node, and copying an object updated before the end time to a target node. do.

분산 저장 시스템 및 분산 저장 방법은, 신규의 데이터 접근 요청을 데이터가 복사되는 타겟 노드로도 전송하고, 소스 노드가 종료 시각 이전의 시각을 나타내는 타임스탬프를 갖는 객체들 만을 타겟 노드로 복사함으로써, 신규로 삽입 또는 갱신되는 객체들이 많은 경우에도 데이터 마이그레이션이 계속하여 연장되는 문제를 방지할 수 있다.The distributed storage system and distributed storage method transmit a new data access request to the target node to which the data is copied, and copy only objects having a timestamp indicating the time before the source node to the target node to the target node. Even if there are many objects that are inserted or updated as a result, data migration can be prevented from continuously extending.

도 1은 분산 저장 시스템의 노드들 간에 데이터 비정합이 발생하는 경우를 설명한다.
도 2는 본 발명의 일 실시예에 따른 복수 개의 노드들을 사용하는 분산 저장 시스템이 데이터를 관리하는 방법을 설명하는 흐름도이다.
도 3은 본 발명의 일 실시예에 따른 이웃 노드에 로그를 기록함으로써 분산 저장 시스템이 데이터를 관리하는 방법을 설명하는 흐름도이다.
도 4는 본 발명의 일 실시예에 따른 데이터를 복사하는 분산 저장 시스템을 설명한다.
도 5는 본 발명의 일 실시예에 따른 분산 저장 시스템의 데이터 복사 방법의 흐름도이다.1 illustrates a case where data mismatch occurs between nodes of a distributed storage system.
2 is a flowchart illustrating a method of managing data in a distributed storage system using a plurality of nodes according to an embodiment of the present invention.
3 is a flowchart illustrating a method for managing data in a distributed storage system by writing a log to a neighbor node according to an embodiment of the present invention.
4 illustrates a distributed storage system for copying data according to an embodiment of the present invention.
5 is a flowchart of a data copy method of a distributed storage system according to an embodiment of the present invention.

이하에서, 본 발명의 일 실시예를, 첨부된 도면을 참조하여 상세하게 설명한다. 그러나, 본 발명이 실시예들에 의해 제한되거나 한정되는 것은 아니다. 각 도면에 제시된 동일한 참조 부호는 동일한 부재를 나타낸다.Hereinafter, an embodiment of the present invention will be described in detail with reference to the accompanying drawings. However, the present invention is not limited to or limited by the embodiments. Like reference symbols in the drawings denote like elements.

이하에서, 갱신(update)은 변경과 동일한 의미로 사용되며, 갱신 및 변경은 상호 대체하여 사용될 수 있다.In the following, update is used in the same meaning as change, and update and change may be used interchangeably.

분산 저장 시스템의 각 노드는 데이터를 관리한다. 노드는 특정한 작업을 처리하는 실행 단위를 의미할 수 있다. 예를 들면, 노드는 프로세스(process) 또는 서버(server) 등을 의미할 수 있다. 데이터는 하나 이상의 테이블들을 포함할 수 있다. 노드는 테이블의 행(row)을 단위로 연산을 수행할 수 있다. 하기의 실시예들에서, 객체는 특정한 테이블의 특정한 행을 의미할 수 있다.
Each node in a distributed storage system manages data. A node may mean an execution unit for processing a specific task. For example, a node may mean a process or a server. The data may include one or more tables. Nodes can perform operations on the rows of a table. In the following embodiments, an object may mean a particular row of a particular table.

도 1은 분산 저장 시스템의 노드들 간에 데이터 비정합이 발생하는 경우를 설명한다.1 illustrates a case where data mismatch occurs between nodes of a distributed storage system.

분산 저장 시스템은 제1 노드(120) 및 제2 노드(130)를 포함한다. 제1 노드(120) 및 제2 노드(130)는 동일한 데이터를 복제하는 복제본들일 수 있다.The distributed storage system includes a first node 120 and a second node 130. The first node 120 and the second node 130 may be replicas replicating the same data.

클라이언트(110)는 2 개의 명령들(150 및 160)을 분산 저장 시스템에게 전송한다.Client 110 sends two commands 150 and 160 to a distributed storage system.

삽입 명령(150)은 객체(또는, 데이터) A에 a를 삽입(insert)하는 연산을 요청한다. 삽입 명령(150)의 타임스탬프는 T1이다. 여기서, 삽입 명령(150)의 타임스탬프는 삽입 명령(150)이 요청하는 삽입 연산의 타임스탬프로 볼 수 있다. 갱신 명령(160)은 객체 A의 값을 b로 갱신(update)하는 연산을 요청한다. 갱신 명령(160)(또는, 갱신 명령(140)이 요창하는 갱신 연산)의 타임스탬프는 T2이다. T1 및 T2는 클라이언트(110)가 2 개의 명령들(150 및 160) 각각을 실제로 요청한 시각을 나타낸다. T1 < T2 이다.The insert command 150 requests an operation for inserting a into the object (or data) A. The time stamp of the insert command 150 is T1. Here, the time stamp of the insert command 150 may be viewed as a time stamp of an insert operation requested by the insert command 150. The update command 160 requests an operation to update the value of the object A to b. The time stamp of the update command 160 (or the update operation required by the update command 140) is T2. T1 and T2 represent the time when the client 110 actually requested each of the two instructions 150 and 160. T1 <T2.

클라이언트(110)의 의도는, 'a'라는 값을 갖는 객체 A를 분산 저장 시스템 내에 삽입한 후, 상기의 객체 A의 값을 'b'로 갱신하는 것이다. 제1 노드(120)에는, 2 개의 명령들(150 및 160)이 바른 순서로 전달된다. 따라서, 제1 노드(120)는 2 개의 명령들(150 및 160)을 정상적으로 처리할 수 있다. 반면, 제2 노드(130)에는, 갱신 명령(160)이 먼저 도착하고, 삽입 명령(150)이 나중에 도착한다. 갱신 명령(160)이 제2 노드(130)에 도착하였을 때, 제2 노드(130) 내에는 객체 A가 존재하지 않는다. 따라서, 존재하지 않는 객체 A에 대한 갱신 연산은 실패한다. 이후 삽입 명령(150)이 제2 노드(130)에 도착하면, 삽입 명령(150)이 수행됨으로써 제2 노드(130) 내에 저장된 객체 A는 'a'의 값을 갖는다. The intention of the client 110 is to insert an object A having a value of 'a' into a distributed storage system and then update the value of the object A to 'b'. Two commands 150 and 160 are delivered to the first node 120 in the correct order. Accordingly, the first node 120 may normally process the two commands 150 and 160. On the other hand, in the second node 130, the update command 160 arrives first, and the insert command 150 arrives later. When the update command 160 arrives at the second node 130, there is no object A in the second node 130. Therefore, the update operation on the nonexistent object A fails. Thereafter, when the insert command 150 arrives at the second node 130, the insert command 150 is performed, so that the object A stored in the second node 130 has a value of 'a'.

상술된 것처럼, 복수 개의 명령들이 각 노드에 전달되는 순서에 따라, 다수의 복제본들이 특정한 객체에 대해 서로 상이한 값을 갖는 데이터 비정합 문제가 발생할 수 있다.As described above, depending on the order in which a plurality of commands are delivered to each node, a data mismatch problem may occur where multiple replicas have different values for a particular object.

이러한 데이터 비정합 문제가 발생하는 것을 방지하기 위해, 분산 저장 시스템은 명령들이 요청된 시각에 따라, 상기의 명령들을 순차적으로 수행하는 방법을 사용할 수 있다. 각 명령은 상기 명령이 요청된 시각을 나타내는 타임스탬프를 포함하고, 복제본들 각각은 상기의 타임스탬프를 확인하여, 수신된 명령들을 순서에 따라 수행할 수 있다. 그러나, 복제본이 앞질러서 도착한 명령(예를 들면, 갱신 명령(160))을 수행하기 위해, 다른 명령(예를 들면, 삽입 명령(150))을 계속하여 기다릴 경우, 응답 시간(response time)이 보장될 수 없는 문제가 발생할 수 있다.In order to prevent such a data mismatch problem from occurring, the distributed storage system may use a method of sequentially executing the above instructions according to the time when the instructions are requested. Each command includes a timestamp indicating the time at which the command was requested, and each of the replicas may check the timestamp and perform the received commands in order. However, if the replica continues to wait for another command (e.g., insert command 150) to execute the command that arrived earlier (e.g., update command 160), the response time is guaranteed. Problems that may not be possible can occur.

하기에서 설명될 본 발명의 실시예는 연산의 실패를 나타내는 로그를 사용하여 상기의 데이터 비정합 문제를 해결하는 방법을 제시한다.
An embodiment of the present invention, which will be described below, proposes a method of solving the data mismatch problem using a log indicating a failure of an operation.

도 2는 본 발명의 일 실시예에 따른 복수 개의 노드들을 사용하는 분산 저장 시스템이 데이터를 관리하는 방법을 설명하는 흐름도이다.2 is a flowchart illustrating a method of managing data in a distributed storage system using a plurality of nodes according to an embodiment of the present invention.

분산 저장 시스템은 복수 개의 노드들을 포함할 수 있다. 상기의 복수 개의 노드들 중 전부 또는 일부의 노드들은 동일한 데이터를 복제하는 복제본들일 수 있다.The distributed storage system may include a plurality of nodes. All or some of the nodes of the plurality of nodes may be replicas replicating the same data.

하기에서, 단계들(210 내지 295)을 수행하는 장애 노드는 분산 저장 시스템이 포함하는 복수 개의 노드들 중 하나의 노드일 수 있으며, 도 1을 참조하여 전술된 제2 노드(130)에 대응할 수 있다.In the following, the failed node performing the steps 210 to 295 may be one of a plurality of nodes included in the distributed storage system, and may correspond to the second node 130 described above with reference to FIG. 1. have.

단계(210)에서, 장애 노드는 제1 명령을 수신한다. 상기의 제1 명령은 도 1을 참조하여 전술된 갱신 명령(160)일 수 있다. 제1 명령은 장애 노드가 수행해야하는 제1 연산을 포함한다. 제1 명령은 제1 연산을 요청하는 것을 의미할 수 있다.In step 210, the failed node receives the first command. The first command may be the update command 160 described above with reference to FIG. 1. The first command includes the first operation that the failed node should perform. The first command may mean requesting a first operation.

단계(220)에서, 장애 노드는 제1 연산이 수행할 수 있는지 여부를 판단한다. 제1 연산을 수행할 수 있는 경우, 장애 노드는 제1 연산을 수행한다(단계(290)). 다음으로, 제 1 노드는 제1 명령에 대한 성공 응답을 출력하고(단계(295)), 절차가 종료한다. 여기에서, 성공 응답의 출력은, 장애 노드가 제1 명령의 수행이 성공하였음을 나타내는 성공 응답을 제1 명령을 요청하였던 클라이언트에게 전송하는 것을 의미할 수 있다.In step 220, the failed node determines whether the first operation can be performed. If it is possible to perform the first operation, the failed node performs the first operation (step 290). Next, the first node outputs a success response to the first command (step 295), and the procedure ends. Here, the output of the success response may mean that the failed node transmits a success response indicating that the execution of the first command was successful to the client that requested the first command.

단계(230)에서, 장애 노드에 제1 명령의 제1 연산을 수행할 수 없는 장애가 발생한 경우, 장애 노드는 제1 연산의 실패를 나타내는 로그(log)를 생성한다. In step 230, when a failure occurs in which the failure node cannot perform the first operation of the first command, the failure node generates a log indicating the failure of the first operation.

상기의 장애는 제1 연산에서 사용되는 객체(예를 들면, 갱신 명령(160)의 연산 내의 객체 A)가 장애 노드 내에 존재하지 않음을 의미할 수 있다. 이때, 생성된 로그는 특정한 객체가 장애 노드 내에 존재하지 않기 때문에 제1 연산의 수행이 실패하였음을 나타내는 정보일 수 있다. 이러한 경우, 장애 노드는 상기의 객체의 생성을 포함하는 제2 명령의 제2 연산(예를 들면, 삽입 명령(150)의 연산)을 수행한 후 제1 연산을 수행할 수 있다. 예를 들면, 제1 연산은 특정한 객체에 대한 갱신 연산일 수 있고, 제2 연산은 상기의 특정한 객체에 대한 삽입 연산일 수 있다. 여기에서, 삽입 연산은 삽입의 대상인 객체의 생성을 연산의 일부로서 포함하고, 갱신 연산은 갱신인 대상인 객체의 생성을 연산의 일부로서 포함하지 않는 것으로 본다.The failure may mean that the object used in the first operation (eg, object A in the operation of the update command 160) does not exist in the failure node. In this case, the generated log may be information indicating that the execution of the first operation failed because a specific object does not exist in the failed node. In this case, the disabled node may perform a first operation after performing a second operation (eg, operation of the insertion command 150) of the second command including generation of the object. For example, the first operation may be an update operation on a specific object, and the second operation may be an insert operation on the specific object. Here, the insert operation is considered to include the creation of an object that is the object of insertion as part of the operation, and the update operation does not include the creation of the object that is the object to be updated as part of the operation.

제1 명령 및 제2 명령은 각각 타임스탬프를 포함할 수 있다. 각 명령의 타임스탬프는 명령이 포함하는 연산의 타임스탬프일 수 있다. 제1 명령의 타임스탬프(즉, 제1 연산의 타임스탬프)는 제2 명령의 타임스탬프(즉, 제2 연산의 타임스탬프)보다 더 늦은 시각을 나타낼 수 있으며, 장애 노드는 상기의 타임스탬프들이 나타내는 시각들을 비교함으로써 명령들(또는, 연산들)의 순서를 알 수 있다.The first command and the second command may each include a time stamp. The time stamp of each instruction may be a time stamp of an operation included in the instruction. The timestamp of the first instruction (ie, the timestamp of the first operation) may represent a later time than the timestamp of the second instruction (ie, the timestamp of the second operation), and the faulty node may The order of instructions (or operations) can be known by comparing the times represented.

상기의 로그는 장애 노드가 실패한 연산을 나중에 다시 수행하기 위해 로컬(local)(즉, 장애 노드 자신) 내에 남겨놓는 로그일 수 있다. 상기의 로그는 분산 저장 시스템이 최종적인 일관성(eventual consistency)을 제공하기 위한 로그일 수 있고, 최종적인 질의 로그(Eventual Query Log; EQ Log)로 명명될 수 있다.The log may be a log that the failed node leaves in local (ie, the failed node itself) for later performing the failed operation again. The log may be a log for the distributed storage system to provide eventual consistency, and may be referred to as an eventual query log (EQ log).

단계(240)에서, 장애 노드는 제1 명령에 대한 성공 응답을 출력할 수 있다.In step 240, the failed node may output a success response to the first command.

단계들(230 및 240)에 의해, 장애 노드는 실패한 연산을 수행될 수 있을 때까지 대기하지 않고 로그만 생성한 채 즉각적으로 성공 응답을 출력한다. 따라서, 장애 노드(또는, 분산 저장 시스템)는 즉각적인 응답 시간을 보장할 수 있다.By means of steps 230 and 240, the failed node immediately outputs a success response with only a log generated without waiting for a failed operation to be performed. Thus, a failed node (or distributed storage system) can ensure immediate response time.

단계들(250 및 280)에서, 장애 노드는 생성된 로그에 기반하여 제1 연산을 반복하여 시도할 수 있다.In steps 250 and 280, the failed node may repeatedly attempt the first operation based on the generated log.

단계(250)에서, 장애 노드는 일정 시간 대기할 수 있다. 대기 중, 장애 노드는 전술된 제2 연산을 수행할 수 있다.In step 250, the failed node may wait for some time. During standby, the failed node may perform the second operation described above.

단계(260)에서, 장애 노드는 제1 연산이 수행될 수 있는지 여부를 판단한다. 제1 연산이 여전히 수행될 수 없는 경우, 단계(250)가 반복된다. 제1 연산이 수행될 수 있는 경우, 단계(270)에서, 장애 노드는 제1 연산을 수행한다. 다음으로, 단계(280)에서, 장애 노드는 로그를 삭제하며, 이후 절차가 종료한다. 예를 들면, 장애 노드가 제2 연산을 수행함으로써 제1 연산 내에서 사용되는 객체가 생성된 경우, 제1 연산이 수행될 수 있다.In step 260, the failed node determines whether the first operation can be performed. If the first operation still cannot be performed, step 250 is repeated. If the first operation can be performed, then at step 270, the failed node performs the first operation. Next, in step 280, the failed node deletes the log, after which the procedure ends. For example, when an object used in the first operation is generated by the failure node performing the second operation, the first operation may be performed.

전술된 단계들(210 내지 295)은 분산 저장 시스템 내의 모든 노드들 각각이 개별적으로 수행할 수 있다. 전술된 방법에 의해, 노드들 각각은 명령들의 순서에 따라 연산들을 수행함으로써 데이터의 정합성을 유지하면서도, 명령들 각각에 대해 즉각적인 응답을 출력할 수 있다. 따라서, 분산 저장 시스템 내에 장애가 있는 노드가 있는 경우라도, 분산 저장 시스템에게 명령을 전송하는 클라이언트는 명령에 대한 응답을 장시간 기다리지 않을 수 있다..
The steps 210-295 described above can be performed individually by each of all nodes in the distributed storage system. By the method described above, each of the nodes can output an immediate response to each of the commands while maintaining the consistency of the data by performing the operations in the order of the commands. Thus, even if there is a faulty node in the distributed storage system, the client sending the command to the distributed storage system may not wait for a long time for a response to the command.

도 3은 본 발명의 일 실시예에 따른 이웃 노드에 로그를 기록함으로써 분산 저장 시스템이 데이터를 관리하는 방법을 설명하는 흐름도이다.3 is a flowchart illustrating a method for managing data in a distributed storage system by writing a log to a neighbor node according to an embodiment of the present invention.

본 실시예는, 도 2를 참조하여 설명된 장애 노드가 연산의 실패를 나타내는 로그를 장애 노드 자신이 아닌 분산 저장 시스템 내의 이웃 노드에 저장하고, 상기의 이웃 노드로부터 상기 연산의 재시도를 요청받음에 따라 상기 연산을 수행하는 구성을 개시한다. 본 실시예의 장애 노드(310)는 도 2에서 설명된 장애 노드일 수 있다.In this embodiment, the failure node described with reference to FIG. 2 stores a log indicating failure of the operation in a neighbor node in the distributed storage system, not the failure node itself, and is requested to retry the operation from the neighbor node. According to the configuration for performing the operation according to. The failing node 310 of the present embodiment may be the failing node described in FIG. 2.

단계(330)는, 전술된 단계들(210 내지 230)에 대응한다. 단계(230)에서 설명된 장애는, 장애 노드(310)의 네트워크 장애 또는 프로세스(process) 장애일 수 있다. 상기의 장애로 인해 특정한 명령의 연산이 수행되지 못한 채 유실될 수 있다. Step 330 corresponds to steps 210-230 described above. The failure described in step 230 may be a network failure or process failure of the failed node 310. The failure may result in the loss of a particular instruction's operation.

단계(340)에서, 장애 노드(310)는 로그 노드(320)에게 생성된 로그를 전송한다. 로그 노드(320)는 장애 노드(310)의 이웃 노드이다. 상기의 로그는 유실되는 명령(또는, 연산)에 대한 로그이므로, 유실된 질의 로그(Missed Query Log; MQ Log)로 명명될 수 있다.In step 340, the failed node 310 transmits the generated log to the log node 320. The log node 320 is a neighbor node of the faulty node 310. Since the log is a log of a lost command (or operation), the log may be referred to as a lost query log (MQ log).

장애 노드(310)는 분산 저장 시스템의 복수 개의 노드들 중 하나의 노드를 로그 노드(320)로서 선택할 수 있다. 장애 노드(310)는 복수 개의 노드들 중 자신과 동일한 데이터를 복제하는 복제본들을 제외한 다른 노드를 우선적으로 로그 노드(320)로서 선택할 수 있다. 이러한 선택을 통해 장애 노드(310)는 특정한 연산을 수행한 노드들의 개수 및 상기 특정한 연산의 실패에 대한 로그를 가지고 있는 노드들의 개수의 합을 일정하게 유지시킬 수 있다.The failing node 310 may select one node of the plurality of nodes of the distributed storage system as the log node 320. The failure node 310 may preferentially select another node of the plurality of nodes as the log node 320 except for replicas that replicate the same data as itself. Through this selection, the faulty node 310 may maintain the sum of the number of nodes performing a specific operation and the number of nodes having a log of the failure of the specific operation.

로그 노드(320)는 복수 개일 수 있다. 장애 노드(310)의 데이터가 n 개의 복제본들에 의해 복사될 때, 로그 노드(320)는 분산 저장 시스템의 n - 1 개의 노드들일 수 있다. 복수 개의 로그 노드(320)들을 사용함으로써, 복제본들 중 또 다른 노드에 추가적으로 장애가 발생하였을 때에도 로그가 유실되는 것을 방지할 수 있다. 로그 노드(320)가 복수 개일 경우, 후술될 단계들(345, 350, 380)은 복수 개의 로그 노드(320)들 각각에 의해 수행될 수 있다.There may be a plurality of log nodes 320. When the data of the failed node 310 is copied by n replicas, the log node 320 may be n − 1 nodes of a distributed storage system. By using the plurality of log nodes 320, it is possible to prevent the log from being lost even when another node in the replica additionally fails. When there are a plurality of log nodes 320, the steps 345, 350, and 380 to be described below may be performed by each of the plurality of log nodes 320.

단계(345)에서, 로그 노드(320)는 전송된 로그를 저장한다.In step 345, log node 320 stores the transmitted log.

단계들(350 내지 355)에서, 로그 노드(320)는 저장된 로그에 기반하여 주기적으로 장애 노드(310)에게 제1 연산의 시도를 요청한다.In steps 350-355, log node 320 periodically requests failed node 310 to attempt a first operation based on the stored log.

단계(350)에서, 로그 노드(320)는 저장된 로그를 확인하고, 장애 노드(310)에게 제1 연산의 시도를 요청한다.In step 350, the log node 320 checks the stored log and requests the failed node 310 to attempt the first operation.

단계(355)에서, 상기의 요청을 수신한 장애 노드(310)는 제1 연산이 수행 가능한지 여부를 판단한다. 제1 연산이 수행될 수 있는 경우, 단계(360)이 수행된다. 제1 연산이 수행될 수 없는 경우, 단계(350)가 주기적으로 반복된다.In step 355, the failure node 310 receiving the request determines whether or not the first operation can be performed. If the first operation can be performed, step 360 is performed. If the first operation cannot be performed, step 350 is repeated periodically.

단계(360)에서, 장애로부터 회복한 장애 노드(310)는 제1 연산을 수행한다.In step 360, the failed node 310 recovering from the failure performs a first operation.

단계(370)에서, 장애 노드(310)는 로그 노드(320)에게 로그의 삭제를 요청한다.In step 370, the failed node 310 requests the log node 320 to delete the log.

단계(380)에서, 로그 노드(320)는 로그를 삭제한다.
In step 380, log node 320 deletes the log.

도 4는 본 발명의 일 실시예에 따른 데이터를 복사하는 분산 저장 시스템을 설명한다.4 illustrates a distributed storage system for copying data according to an embodiment of the present invention.

본 실시예에서는, 분산 저장 시스템(400)의 복제본들(410) 중 하나의 노드에서 장애가 발생한 경우, 다른 노드에 데이터를 복사하는 방법을 개시한다.In this embodiment, when a failure occurs in one node of the replicas 410 of the distributed storage system 400, a method of copying data to another node is disclosed.

동일한 데이터를 복사하여 저장하는 복제본들(410) 중 하나의 노드에 장애가 발생할 수 있다. 장애 노드(420)는 복제본들(410) 중 장애가 발생한 노드이며, 복제본들(410)은 분산 저장 시스템(400)의 복수 개의 노드들 중 장애 노드(420)와 동일한 데이터를 제공하는 하나 이상의 노드들이다. 장애의 발생으로 인해, 장애 노드(420)의 데이터로 접근하는 것이 불가능해질 수 있다. 장애가 발생한 경우, 장애 노드(420)를 제외한 나머지 복제본들(410) 중 하나의 노드가 자신의 데이터를 정상적으로 동작하는 다른 노드로 복사하고, 데이터가 복사된 노드를 새로운 복제본으로 추가해야 한다. 상기의 복사가 완료되고, 복제본들(410)의 위치 정보가 갱신되면, 분산 저장 시스템(400)은 복제본들(410)의 개수를 일정하게 유지할 수 있다. 복제본 마이그레이션(migration)은 상술된 것과 같은 복사를 의미한다.A node may fail in one of the replicas 410 that copies and stores the same data. The failed node 420 is a failed node of the replicas 410, and the replicas 410 are one or more nodes that provide the same data as the failed node 420 of a plurality of nodes of the distributed storage system 400. . Due to the occurrence of the failure, it may be impossible to access the data of the failed node 420. In the event of a failure, one of the replicas 410 except for the failed node 420 should copy its data to another node that operates normally, and add the node to which the data has been copied as a new replica. When the copy is completed and the location information of the replicas 410 is updated, the distributed storage system 400 may keep the number of replicas 410 constant. Replica migration means copying as described above.

복제본들(410) 중 하나의 노드는 장애 노드(420)의 장애를 인식하고, 자신의 데이터를 분산 저장 시스템(400)의 노드들 중 복제본들(410)에 포함되지 않은 노드로 복사한다. 소스 노드(430)는 장애 노드(420)의 장애를 인식한 노드이다. 타겟 노드(450)는 소스 노드(430)의 데이터가 복사되는 노드이다. 타겟 노드(450)는 분산 저장 시스템(400)의 복수 개의 노드들 중 장애 노드(420)를 대체하여 복제본들(410) 중 하나로서 새롭게 선택된 노드이다. 복제본 노드(440)는 복제본들(410) 중 장애 노드(420) 및 소스 노드(430)를 제외한 나머지 노드이다.One node of the replicas 410 recognizes the failure of the failed node 420 and copies its data to one of the nodes of the distributed storage system 400 that is not included in the replicas 410. The source node 430 is a node that recognizes the failure of the failure node 420. The target node 450 is a node to which data of the source node 430 is copied. The target node 450 is a newly selected node as one of the replicas 410 in place of the failed node 420 of the plurality of nodes of the distributed storage system 400. The replica node 440 is the remaining node of the replicas 410 except for the failed node 420 and the source node 430.

클라이언트(490)는 분산 저장 시스템(400)에게 데이터 접근을 요청하는 노드이다. 조정 노드(460)는 상기의 요청을 클라이언트(490)로부터 수신하여, 데이터 접근 요청을 복제하여 데이터를 복제하는 복제본들(410) 각각에게 전송하는 노드이다. 조정 노드(460)는 분산 저장 시스템(400)의 복수 개의 노드들 중 클라이언트(490)로부터 데이터 접근을 요청 받은 임의의 노드일 수 있다. 예를 들면, 복제본들(410) 중 하나의 노드 또한 조정 노드(460)가 될 수 있다.Client 490 is a node requesting data access to distributed storage system 400. The coordination node 460 is a node that receives the request from the client 490 and replicates the data access request to each of the replicas 410 that replicate the data. The coordination node 460 may be any node that is requested to access data from the client 490 of the plurality of nodes of the distributed storage system 400. For example, one node of replicas 410 may also be a coordinating node 460.

데이터 접근 요청은 복제본 노드(440) 및 소스 노드(430)에게 전송되며, 장애 노드(420) 또는 타겟 노드(450)에게 전송될 수 있다. 예를 들면, 조정 노드(460)가 장애 노드(420)에서 장애가 발생한 것을 인식하지 못했을 때에는 데이터 접근 요청이 장애 노드(420)에게 전송되며, 조정 노드(460)가 타겟 노드(450)가 장애 노드(420)를 대체한다는 것을 인식하였을 때에는 데이터 접근 요청이 타겟 노드(450)에게 전송될 수 있다. The data access request may be sent to the replica node 440 and the source node 430, and may be sent to the failed node 420 or the target node 450. For example, when the coordination node 460 does not recognize that a failure has occurred at the failure node 420, a data access request is sent to the failure node 420, and the coordination node 460 has a target node 450 at the failure node. When it is recognized that it replaces 420, a data access request may be sent to the target node 450.

복제본 마이그레이션이 실행될 때에는, 소스 노드(430)의 전체 데이터(예를 들면, 소스 노드(430)의 지역 데이터베이스 내의 모든 테이블들)가 읽혀지고, 읽혀진 전체 데이터가 타겟 노드(450)로 삽입된다. 소스 노드(430)가 복제본 마이그레이션을 실행하는 동안에도, 분산 저장 시스템(400)은 클라이언트로부터의 데이터 접근 요청을 수신할 수 있다. 이러한 경우, 소스 노드(430)의 분산 저장 서비스를 중단하지 않으면, 추가적으로 유입되는 데이터를 타겟 노드(450)로 정확하게 복사하는 것이 어렵게될 수 있다. When the replica migration is executed, the entire data of the source node 430 (eg, all the tables in the local database of the source node 430) is read and the read entire data is inserted into the target node 450. Even while source node 430 is executing a replica migration, distributed storage system 400 may receive data access requests from clients. In this case, it may be difficult to accurately copy additionally introduced data to the target node 450 unless the distributed storage service of the source node 430 is stopped.

분산 저장 서비스의 중단 없이 데이터를 복사하기 위해, 분산 저장 시스템(400)은 모든 데이터(예를 들면, 객체 또는 테이블의 행)에 대해 타임스탬프를 유지한다.In order to copy data without interrupting the distributed storage service, distributed storage system 400 maintains a timestamp for all data (eg, rows of objects or tables).

복제본 마이그레이션에 대한 구체적인 방법이 도 5를 참조하여 하기에서 상세히 설명된다.A specific method for replica migration is described in detail below with reference to FIG.

앞서 도 1 내지 도 3을 참조하여 설명된 본 발명의 일 실시예에 따른 기술적 내용들이 본 실시예에도 그대로 적용될 수 있다. 따라서 보다 상세한 설명은 이하 생략하기로 한다.
Technical contents according to an embodiment of the present invention described above with reference to FIGS. 1 to 3 may be applied to the present embodiment as it is. Therefore, more detailed description will be omitted below.

도 5는 본 발명의 일 실시예에 따른 분산 저장 시스템의 데이터 복사 방법의 흐름도이다.5 is a flowchart of a data copy method of a distributed storage system according to an embodiment of the present invention.

분산 저장 시스템(400)은 데이터 복사를 관리하는 관리 노드(500)를 더 포함할 수 있다.The distributed storage system 400 may further include a management node 500 that manages data copying.

단계(510)에서, 관리 노드(500)는 장애 노드(420)의 장애를 인식한다.At step 510, management node 500 recognizes a failure of failed node 420.

단계(512)에서, 관리 노드(500)는 분산 저장 시스템(400)의 복수 개의 노드들 중에서 복제본들(410)을 제외한 나머지 노드들 중 하나의 노드를 데이터를 복제할 타겟 노드(450)로서 선택할 수 있다. 관리 노드(500)는, 1) 장애 노드(420)에서 장애가 발생하였다는 것과, 2) 타겟 노드(450)가 장애 노드(420)를 대체하기 위해 선택되었다는 것을 분산 저장 시스템(400) 내의 전체 노드들(또는, 특정한 노드를 제외한 나머지 노드들)에게 알릴 수 있다.In step 512, the management node 500 selects one of the plurality of nodes of the distributed storage system 400 except the replicas 410 as the target node 450 to which the data is to be replicated. Can be. The management node 500 determines that 1) the failure has occurred at the failed node 420, and 2) that the target node 450 has been selected to replace the failed node 420. Or nodes other than a particular node).

단계(514)에서, 조정 노드(460)는 타겟 노드(450)에게 소스 노드(430)로부터 데이터의 복사(즉, 마이그레이션)를 명령한다.In step 514, the coordination node 460 instructs the target node 450 to copy (ie, migrate) the data from the source node 430.

단계(516)에서, 타겟 노드(450)는 소스 노드(430)에게 데이터의 복사를 등록한다.At step 516, target node 450 registers a copy of the data with source node 430.

단계(520)에서, 소스 노드(430)는 데이터의 복사의 종료 시각을 설정한다. 상기의 종료 시각은 분산 저장 시스템(400) 내의 복수 개의 노드들 전체(또는, 장애 노드(420)를 제외한 나머지 노드들)가, 타겟 노드(450)로의 데이터의 복사를 인지하여, 데이터 접근 요청을 타겟 노드(450)로도 보낼 수 있는 시각 또는 상기의 시각의 이후의 시각일 수 있다. In step 520, the source node 430 sets the end time of copying the data. The above end time indicates that all of the plurality of nodes in the distributed storage system 400 (or the remaining nodes except the failed node 420) recognize the copy of the data to the target node 450 and request a data access request. It may be a time that can also be sent to the target node 450 or a time after the above time.

단계(530)에서, 소스 노드(430)는 데이터의 객체들을 객체들 각각의 타임스탬프들의 오름차순으로 정렬할 수 있다.In step 530, the source node 430 may sort the objects of data in ascending order of timestamps of each of the objects.

단계(540)에서, 소스 노드(430)의 데이터(즉, 데이터를 구성하는 객체들)가 타겟 노드(450)로 복사된다. 상기의 복사는 타겟 노드(450)가 소스 노드(430)의 데이터베이스로부터 읽혀진 데이터를 자신의 데이터베이스에 삽입하는 것을 의미할 수 있다.In step 540, the data of the source node 430 (ie, the objects that make up the data) are copied to the target node 450. The above copy may mean that the target node 450 inserts data read from the database of the source node 430 into its database.

단계(540)에서, 소스 노드(430)는 데이터의 객체들 중 타임스탬프가 종료 시각 또는 종료 시각의 이전 시각을 나타내는 객체에 대해서 복사를 수행한다. 소스 노드(430)는 최후에 복사한 객체의 타임스탬프를 기준으로, 상기의 타임스탬프가 나타내는 시각보다 더 늦은 시각을 나타내는 타임스탬프를 갖는 객체들에 대하여 복사를 진행할 수 있다.In step 540, the source node 430 makes a copy of the objects of the data whose time stamps indicate the end time or the previous time of the end time. The source node 430 may copy the objects having a time stamp indicating a later time than the time indicated by the time stamp, based on the time stamp of the last copied object.

단계(540)는 재수행될 수 있다. 데이터의 복사가 완료되면, 단계(516)에서 데이터의 복사가 등록된 시점 이후에 새로 추가된 데이터가 다시 소스 노드(430)로부터 타겟 노드(450)로 복사될 수 있다.Step 540 may be performed again. When the copy of the data is completed, newly added data may be copied from the source node 430 to the target node 450 again after the point in time at which the copy of the data is registered in step 516.

종료 시각 이후에 객체가 갱신될 경우, 상기의 갱신에 대한 데이터 접근 요청은 타겟 노드(450)로도 전송된다. 따라서, 타겟 노드(450) 자체에서 상기의 갱신을 수행하기 때문에, 소스 노드(430)로부터 객체가 다시 복사될 필요가 없다.If the object is updated after the end time, the data access request for the update is also sent to the target node 450. Therefore, since the update is performed in the target node 450 itself, the object does not need to be copied again from the source node 430.

단계(550)는, 소스 노드(430)의 객체들 각각이 타겟 노드(450)로 복사되는 것을 나타낸다. 단계(550)를 반복적으로 실행함으로써, 소스 노드(430)는 객체들 각각의 타임스탬프들의 오름차순으로 객체들을 타겟 노드(450)로 복사할 수 있다.Step 550 indicates that each of the objects of the source node 430 is copied to the target node 450. By repeatedly executing step 550, source node 430 may copy objects to target node 450 in ascending order of timestamps of each of the objects.

단계(540)가 수행되는 동안, 단계들(560 내지 570)이 수행될 수 있다.While step 540 is performed, steps 560-570 may be performed.

단계(560)에서, 조정 노드(460)는 제1 객체에 대한 갱신 요청을 수신한다.At step 560, the coordination node 460 receives an update request for the first object.

단계(565)에서, 조정 노드(465)는 제1 객체에 대한 갱신 요청을 소스 노드(430) 및 타겟 노드(450) 각각에게 전송한다. 소스 노드(430)는 제1 객체의 값을 갱신하고, 제1 객체의 타임스탬프의 값을 최신의 시각으로 변경할 수 있다.At step 565, the coordination node 465 sends an update request for the first object to each of the source node 430 and the target node 450. The source node 430 may update the value of the first object and change the value of the time stamp of the first object to the latest time.

제1 객체에 대한 갱신 요청이 타겟 노드(450) 내에서 제1 객체가 생성되기 전에(즉, 소스 노드(430)가 제1 객체를 타겟 노드(450)로 복사하기 전에), 타겟 노드(450)로 전송된 경우, 타겟 노드(450)는 존재하지 않는 제1 객체를 갱신할 수 없다. 따라서, 갱신 요청은 실패하고, 단계(570)에서, 타겟 노드(450)는 갱신 요청의 실패를 나타내는 로그를 생성할 수 있다. 이후, 데이터를 복사하는 단계(540)에 의해, 제1 객체는 타겟 노드(450)로 복사된다. 제1 객체가 복사된 후, 단계(590)에서, 타겟 노드(450)는 로그에 기반하여 제1 객체를 갱신할 수 있다. 제1 객체가 갱신된 후, 단계(595)에서, 타겟 노드(450)는 로그를 삭제한다.Before the update request for the first object is generated within the target node 450 (ie, before the source node 430 copies the first object to the target node 450), the target node 450 ), The target node 450 may not update the first object that does not exist. Thus, the update request fails, and at step 570, the target node 450 may generate a log indicating the failure of the update request. Thereafter, by copying data 540, the first object is copied to the target node 450. After the first object has been copied, in step 590, the target node 450 may update the first object based on the log. After the first object is updated, at step 595, target node 450 deletes the log.

소스 노드(430)가 제1 객체를 타겟 노드(450)로 복사한 이후에 제1 객체에 대한 갱신 요청이 타겟 노드(450)로 전송되었고, 제1 객체가 변경된 시각이 종료 시각 이전이면, 단계(580)에서, 소스 노드(430)는 제1 객체를 타겟 노드(450)로 재복사할 수 있다. 이때, 상기의 재복사는 갱신 연산을 통해 수행될 수 있다. 즉, 복사의 대상인 제1 객체가 타겟 '노드(450) 내에 존재하지 않는 경우, 객체의 복사는 소스 노드(430)가 타겟 노드(450)로 제2 객체를 삽입하는 연산을 수행하는 것을 의미할 수 있다. 또한, 복사의 대상인 제1 객체가 타겟 노드(450) 내에 존재하는 경우, 객체의 복사는 소스 노드(430)가 타겟 노드(450)의 제1 객체를 갱신하는 연산을 수행하는 것을 의미할 수 있다. 예를 들면, 복사의 대상인 제1 객체가 타겟 노드(450) 내에 이미 존재하는 경우, 제1 객체의 복사를 수행하는 삽입 연산이 갱신 연산으로 변경될 수 있다.If the update request for the first object has been sent to the target node 450 after the source node 430 has copied the first object to the target node 450, and the time at which the first object was changed is before the end time, step At 580, the source node 430 may recopy the first object to the target node 450. In this case, the recopy may be performed through an update operation. That is, when the first object to be copied does not exist in the target 'node 450, the copying of the object may mean that the source node 430 performs an operation of inserting the second object into the target node 450. Can be. In addition, when the first object to be copied exists in the target node 450, the copying of the object may mean that the source node 430 performs an operation of updating the first object of the target node 450. . For example, if the first object to be copied already exists in the target node 450, an insert operation for copying the first object may be changed to an update operation.

통상적인 데이터 마이그레이션과 달리, 소스 노드(430)의 객체들은 데이터 복사 중 프리즌(freezen)되지 않는다. 따라서, 원본 데이터(즉, 소스 노드(430)의 내의 데이터)도 지속적으로 갱신될 수 있으며, 이미 타겟 노드(450)로 복사된 객체도 타임스탬프가 최신의 값으로 변경됨으로써 갱신 후 재복사될 수 있다. 또한, 1) 신규의 데이터 접근 요청을 타겟 노드(450)로도 전송하고, 2) 소스 노드(430)가 종료 시각 이전의 시각을 나타내는 타임스탬프를 갖는 객체들 만을 타겟 노드(450)로 복사함으로써, 신규로 갱신(또는, 삽입)되는 객체들이 많은 경우에 복사 시간이 계속하여 연장되는 문제를 방지할 수 있다.Unlike conventional data migration, the objects of the source node 430 are not frozen during data copy. Therefore, the original data (that is, the data in the source node 430) can be continuously updated, and an object already copied to the target node 450 can also be copied again after the update by changing the timestamp to the latest value. have. In addition, 1) by sending a new data access request to the target node 450, and 2) the source node 430 copies only the objects having a timestamp indicating the time before the end time to the target node 450, In the case of many newly updated (or inserted) objects, the copying time can be prevented from being continuously extended.

단계(585)에서, 데이터 복사가 완료되는 타겟 노드(450)는 관리 노드(500)에게 데이터 복사의 완료를 통보할 수 있다. 관리 노드(500)는 타겟 노드(450)가 복제본들(410) 중 하나가 되었음을 분산 저장 시스템(400)의 전체 노드들에게 알릴 수 있다.In step 585, the target node 450 where the data copying is completed may notify the management node 500 of the completion of the data copying. The management node 500 may inform all nodes of the distributed storage system 400 that the target node 450 has become one of the replicas 410.

앞서 도 1 내지 도 4를 참조하여 설명된 본 발명의 일 실시예에 따른 기술적 내용들이 본 실시예에도 그대로 적용될 수 있다. 따라서 보다 상세한 설명은 이하 생략하기로 한다.
Technical contents according to an embodiment of the present invention described above with reference to FIGS. 1 to 4 may be applied to the present embodiment as it is. Therefore, more detailed description will be omitted below.

본 발명의 일 실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Method according to an embodiment of the present invention is implemented in the form of program instructions that can be executed by various computer means may be recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be those specially designed and constructed for the present invention or may be available to those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic disks, such as floppy disks. Magneto-optical media, and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like. The hardware device described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

이상과 같이 본 발명은 비록 한정된 실시예와 도면에 의해 설명되었으나, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 본 발명이 속하는 분야에서 통상의 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다.As described above, the present invention has been described by way of limited embodiments and drawings, but the present invention is not limited to the above embodiments, and those skilled in the art to which the present invention pertains various modifications and variations from such descriptions. This is possible.

그러므로, 본 발명의 범위는 설명된 실시예에 국한되어 정해져서는 아니 되며, 후술하는 특허청구범위뿐 아니라 이 특허청구범위와 균등한 것들에 의해 정해져야 한다.Therefore, the scope of the present invention should not be limited to the described embodiments, but should be determined by the equivalents of the claims, as well as the claims.

400: 분산 저장 시스템
410: 복제본들
420: 장애 노드
430: 소스 노드
440: 복제본 노드
450: 타겟 노드
460: 조정 노드
500: 관리 노드400: distributed storage system
410: replicas
420: failed node
430: source node
440: replica node
450: target node
460: Coordination Node
500: managed node

Claims

In a distributed storage system,
Multiple nodes
/ RTI >
A first node of the plurality of nodes receives a request for a first operation, and the first node generates a log indicating a failure of the first operation when a failure in which the first operation cannot be performed occurs. The first node repeatedly attempts the first operation based on the log, and deletes the log when the first operation is performed.

The method of claim 1,
The obstacle includes a case in which the object used in the first operation does not exist in the first node, and the first node performs the first operation after performing a second operation including generation of the object. Distributed storage system.

The method of claim 2,
And the timestamp of the first operation represents a later time than the timestamp of the second operation.

In a distributed storage system,
Multiple nodes
/ RTI >
A first node of the plurality of nodes receives a request for a first operation, and the first node generates a log indicating a failure of the first operation when a failure in which the first operation cannot be performed occurs. Transmits to a second node of the plurality of nodes,
The second node stores the log, periodically requests the first node to attempt the first operation based on the log, and the first node requests the second node when the first operation is performed. Requesting deletion of the log from the distributed storage system.

In a distributed storage system,
Multiple nodes
/ RTI >
The failed node is a failed node of the plurality of nodes, the source node is one of the replicas, the replicas are one or more nodes providing the same data as the failed node of the plurality of nodes, and the target node is the A node newly selected as one of the replicas, replacing the failed node of the plurality of nodes,
The source node recognizes the failure of the failure node, the source node sets an end time of the copy of data, copies the objects of data to the target node, and the end time after being copied to the target node. Copying a previously updated first object back to the target node.

The method of claim 5,
And the end time is a time at which all of the plurality of nodes can recognize a copy to the target node and can also send an access request for the data to the target node.

The method of claim 5,
The source node copies the objects to the target node in ascending order of timestamps of each of the objects.

The method of claim 7, wherein
And the source node performs the copying on the object of which the timestamp indicates the end time or a previous time of the end time.

The method of claim 5,
The coordination node of the plurality of nodes receives an update request for the first object, the coordination node transmits an update request for the first object to each of the source node and the target node, and the target node is And generating a log indicating failure of the update request when the first object does not exist, and updating the first object based on the log after the first object is copied from the source node.

The method of claim 5,
The source node performs an operation of inserting the second object into the target node when a second object does not exist in the target node, and the first node of the target node when the second object exists in the target node. 2) copying the second object to the target node by performing an operation to update the object.

In a method for managing data in a distributed storage system using a plurality of nodes,
Receiving, by a first node of the plurality of nodes, a request for a first operation;
Generating a log indicating a failure of the first operation by the first node when a failure in which the first operation cannot be performed occurs in the first node;
The first node repeatedly attempting the first operation based on the log; And
Deleting the first log when the first node performs the first operation
/ RTI >
How to manage your data.

The method of claim 11,
The failure includes a case in which the object used in the first operation does not exist in the first node,
Attempting to repeat the first operation,
Performing, by the first node, a second operation including creation of the first object; And
Performing the first operation after the second operation is performed
Including,
And the timestamp of the first operation represents a later time than the timestamp of the second operation.

In a method for managing data in a distributed storage system using a plurality of nodes,
Receiving, by a first node of the plurality of nodes, a request for a first operation;
Generating a log indicating a failure of the first operation by the first node when a failure in which the first operation cannot be performed occurs in the first node;
The first node sending the log to the second node;
The second node storing the log;
The second node periodically requesting the first node to attempt the first operation based on the log;
If the first operation can be performed, performing the first operation by the first node;
When the first operation is performed, requesting the second node to delete the log from the first node; And
The second node deleting the log
Comprising a data management method.

In a distributed storage system using a plurality of nodes to copy data between the nodes,
A target node requesting a copy of data from a source node, wherein the source node is one of the replicas, the target node replacing one of the failed failed ones of the replicas as one of the replicas of the plurality of nodes; Is a newly selected node-;
Setting, by the source node, an end time of copying of the data;
Copying the objects of data to a target node by the source node; And
Copying the first object updated before the end time back to the target node after the source node is copied to the target node.
Comprising a data copy method of a distributed storage system.

15. The method of claim 14,
And the end time is a time at which all of the plurality of nodes recognizes a copy to the target node and can also send an access request for the data to the target node.

15. The method of claim 14,
Wherein the copying comprises:
The source node copying the objects to the target node in ascending order of timestamps of each of the objects;
Comprising a data copy method of a distributed storage system.

17. The method of claim 16,
And the source node performs the copying on the object among the objects whose timestamp indicates the end time or a previous time of the end time.

15. The method of claim 14,
Receiving, by the coordinating node of the plurality of nodes, an update request for the first object;
Sending, by the coordinating node, an update request for the first object to each of the source node and the target node;
Generating, by the target node, a log indicating a failure of the update request when the first object does not exist; And
Updating, by the target node, the first object based on the log after the first object is copied from the source node
Further comprising, the data copy method of the distributed storage system.

15. The method of claim 14,
When the second object that is the target of the copy does not exist in the target node, the copying is performed by a source node to insert the second object into the target node, and the second object exists in the target node. If the copying is performed by the source node performing an operation of updating the second object of the target node.

A computer-readable recording medium containing a program for performing the method of any one of claims 11 to 19.