JP2012168781A

JP2012168781A - Distributed data-store system, and record management method in distributed data-store system

Info

Publication number: JP2012168781A
Application number: JP2011029782A
Authority: JP
Inventors: Kazuki Oikawa; 一樹及川; Takahiro Ida; 恭弘飯田; Hiroyuki Uchiyama; 寛之内山; Miyoshi Hanaki; 三良花木
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-02-15
Filing date: 2011-02-15
Publication date: 2012-09-06

Abstract

PROBLEM TO BE SOLVED: To avoid wasting computing resources and improve response performance in a distributed data-store system which performs distributed management of information using a plurality of computers connected by a network.SOLUTION: A distributed data-store system performs distribute management of records identified by Keys in a plurality of nodes 2-1 to 2-3 connected by a network. Each of the plurality of nodes 2-1 to 2-3 has: a storage unit 50 for storing a plurality of records managed by the own node as aggregations grouped by optional key ranges; an index assignment unit 10 for assigning an index which uses keys included in a range of an aggregation to the aggregation; and a record acquisition unit 40 for, in response to a record acquisition request, acquiring a record requested in the record acquisition request from the storage unit 50 by referring to indexes of aggregations.

Description

本発明は、ネットワークで接続された複数のコンピュータを用いて情報を分散管理することでスケールアウトを実現する分散型データストアシステムに関し、特に、コンピューティングリソースの無駄を回避し、レスポンス性能を向上させる技術に関する。 The present invention relates to a distributed data store system that realizes scale-out by distributing and managing information using a plurality of computers connected via a network, and in particular, avoids waste of computing resources and improves response performance. Regarding technology.

一般的なデータベースにおけるデータの格納は、テーブルスキーマで定義される一定のサイズに従って領域を確保するため、格納したデータのオフセット位置が容易にわかり、任意のレコードやカラムへのアクセスを効率的に行うことができる。 Data storage in a general database secures an area according to a certain size defined by the table schema, so the offset position of the stored data can be easily identified, and any record or column can be accessed efficiently be able to.

一方、レコード追記型の分散型データストアシステムでは、複数の可変長レコードをまとめてファイルに書き込み、また更新時においても既存のファイルを上書きするのではなく、ファイルの末尾や別ファイルに追記を行うため、所望のレコードを取得するには、これらのファイルを逐一メモリ領域上に読み込んだ上で、そのレコードの有無を確認する処理が必要となり、それにより、１レコードの取得や特定カラムのみを抽出したい場合にも、本来読み込む必要のないファイルへのアクセスが発生し、無駄なコンピューティングリソースの消費や、レスポンスの低下につながる等の非効率な処理が行われることになる。 On the other hand, in a record-added distributed data store system, multiple variable-length records are written to a file at the same time, and the existing file is not overwritten at the time of update, but is appended to the end of the file or another file. Therefore, in order to obtain a desired record, it is necessary to read these files into the memory area one by one and check the existence of the record, thereby obtaining one record and extracting only a specific column. Even if it is desired to do so, an access to a file that does not need to be read occurs, and inefficient processing such as useless computing resource consumption and a decrease in response is performed.

このような分散型データストアシステムでは、特にＫｅｙとＶａｌｕｅとの組み合わせをレコードとするものがあり、それにおいては、格納するデータであるＶａｌｕｅに、任意のＫｅｙを付与し、連続するＫｅｙとそれに対応するＶａｌｕｅとの組を最小単位として任意のサイズで分割し、ファイルシステムとなるディスク領域に保存することで自動的にＫｅｙ空間を任意の範囲で分割し、分割したＫｅｙ範囲をそれぞれ異なるノードに担当させることによってスケールアウトを実現している。あるＫｅｙに対応するレコードを読み込む際は、そのＫｅｙを含むＫｅｙ範囲を担当しているノードのロケーションをインデクスノードに問い合わせ、そのＫｅｙ範囲を担当しているノードのＩＰアドレス等のロケーション情報を取得した後、それに基づいて、担当ノードに対してＫｅｙに対応するレコードを読み取る要求を行うことで、Ｋｅｙに対応するレコードを読み込む。 In such a distributed data store system, in particular, there is a record in which a combination of Key and Value is used as a record. In that case, an arbitrary Key is assigned to Value that is data to be stored, and a continuous Key and corresponding to it. By dividing a pair with Value to be an arbitrary size as a minimum unit and saving it in a disk area that becomes a file system, the Key space is automatically divided into an arbitrary range, and the divided Key range is assigned to different nodes. To achieve scale-out. When a record corresponding to a certain key is read, the index node is inquired about the location of the node that is in charge of the key range including the key, and the location information such as the IP address of the node that is in charge of the key range is acquired. Then, based on this, the record corresponding to the key is read by making a request for reading the record corresponding to the key to the responsible node.

また、ノード内では、最近アクセスのあったＫｅｙとＶａｌｕｅとの組はメモリ領域上においておき、それ以外のデータはディスク領域から読み込む。また、更新や挿入等の処理もメモリ領域上でのみ行い、メモリ領域上のＫｅｙとＶａｌｕｅのサイズや、コミットログのサイズが閾値を超えた場合は、マイナーコンパクションと呼ばれる処理が発生し、メモリ領域上の複数のＫｅｙとＶａｌｕｅとの組からなるレコード群をディスク領域上にファイルとして書き出して永続的に記憶している（例えば、非特許文献１参照。）。 In the node, the recently accessed key and value pair is placed on the memory area, and other data is read from the disk area. In addition, processing such as update and insertion is performed only on the memory area. When the size of the key and value on the memory area or the size of the commit log exceeds the threshold, a process called minor compaction occurs. A record group including a plurality of pairs of Key and Value above is written as a file on the disk area and stored permanently (for example, see Non-Patent Document 1).

OSDI'06:Seventh Symposium on Operating System Design and Implementation, Seattle, WA, November, 2006.OSDI'06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, November, 2006.

上述したように分散型データストアシステムにおいては、メジャーコンパクションが実行されない限り、特定のＫｅｙ範囲に対応するディスク領域上のファイルを全て読み込み、どのファイルに目的のデータが含まれるのか確認する必要が生じることとなり、分散ファイルシステムにおいては、ディスク領域に対するディスクＩ／Ｏの負荷や、分散ファイルシステムを構成するノード間のネットワークＩ／Ｏの負荷が高いという問題点がある。 As described above, in the distributed data store system, unless major compaction is executed, it is necessary to read all the files on the disk area corresponding to the specific key range and check which file contains the target data. In other words, the distributed file system has a problem that the disk I / O load on the disk area and the network I / O load between nodes constituting the distributed file system are high.

本発明は、上述したような従来の技術が有する問題点に鑑みてなされたものであって、ネットワークで接続された複数のコンピュータを用いて情報を分散管理する分散型データストアシステムにおいて、コンピューティングリソースの無駄を回避するとともにレスポンス性能を向上させることができる分散型データストアシステム及び分散型データストアシステムにおけるレコード管理方法を提供することを目的とする。 The present invention has been made in view of the problems of the prior art as described above, and is applied to a distributed data store system in which information is distributed and managed using a plurality of computers connected by a network. It is an object of the present invention to provide a distributed data store system that can avoid waste of resources and improve response performance, and a record management method in the distributed data store system.

上記目的を達成するために本発明は、
ネットワークで接続された複数のノードにて、識別子によって特定されるレコードを分散管理する分散型データストアシステムにおいて、
前記複数のノードは、
当該ノードが管理する複数のレコードを前記識別子の任意の範囲毎に集合体として記憶するレコード記憶手段と、
前記集合体に対して、当該集合体の範囲に含まれる識別子を用いたインデクスを付与するインデクス付与手段と、
レコード取得要求に対して、前記インデクスを参照することにより、前記レコード取得要求にて要求されたレコードを前記レコード記憶手段から取得するレコード取得手段とを有することを特徴とする。 In order to achieve the above object, the present invention provides:
In a distributed data store system that distributes and manages records specified by identifiers at multiple nodes connected by a network,
The plurality of nodes are:
Record storage means for storing a plurality of records managed by the node as an aggregate for each arbitrary range of the identifier;
Index giving means for giving an index using an identifier included in the range of the aggregate to the aggregate;
Record acquisition means for acquiring the record requested by the record acquisition request from the record storage means by referring to the index in response to the record acquisition request.

また、ネットワークで接続された複数のノードにて、識別子によって特定されるレコードを分散管理する分散型データストアシステムにおけるレコード管理方法であって、
前記複数のノードが、当該ノードが管理する複数のレコードを前記識別子の任意の範囲毎に集合体として記憶する際に、前記集合体に対して当該集合体の範囲に含まれる識別子を用いたインデクスを付与しておき、レコード取得要求に対して、前記インデクスを参照することにより、前記レコード取得要求にて要求されたレコードを取得する。 Also, a record management method in a distributed data store system that distributes and manages records specified by identifiers at a plurality of nodes connected by a network,
When the plurality of nodes store a plurality of records managed by the node as an aggregate for each arbitrary range of the identifier, an index using an identifier included in the aggregate range for the aggregate The record requested in the record acquisition request is acquired by referring to the index in response to the record acquisition request.

以上説明したように本発明においては、ネットワークで接続された複数のノードにて、識別子によって特定されるレコードを分散管理する場合に、複数のノードのそれぞれにおいて、そのノードが管理する複数のレコードを識別子の任意の範囲毎に集合体として記憶する際に、集合体に対してその集合体の範囲に含まれる識別子を用いたインデクスを付与しておき、レコード取得要求に対して、インデクスを参照することにより、レコード取得要求にて要求されたレコードを取得する構成としたため、識別子の所定の範囲に対応するファイルが複数ある状況においても、目的の識別子が含まれるファイルへアクセスすることができ、それにより、複数のファイルへアクセスする必要が無く、読込み時の応答時間が短縮されるとともに、ファイルシステムやネットワークへの負荷を軽減することができる。 As described above, in the present invention, when a plurality of nodes connected by a network perform distributed management of a record specified by an identifier, in each of the plurality of nodes, a plurality of records managed by the node are stored. When storing as an aggregate for each arbitrary range of identifiers, an index using an identifier included in the aggregate range is assigned to the aggregate, and the index is referred to the record acquisition request. Thus, since the record requested by the record acquisition request is acquired, even in the situation where there are a plurality of files corresponding to the predetermined range of the identifier, the file including the target identifier can be accessed. This eliminates the need to access multiple files, shortens the response time when reading, and It is possible to reduce the load on the system or network.

本発明の分散型データストアシステムの実施の一形態を示す図である。It is a figure which shows one Embodiment of the distributed data store system of this invention. 図１に示した分散型データストアシステムにおいて記憶部にレコードを書き込む際の処理を説明するための図である。It is a figure for demonstrating the process at the time of writing a record in a memory | storage part in the distributed data store system shown in FIG. 図１に示したインデクス付与部にて作成されるインデクスの構成を示す図である。It is a figure which shows the structure of the index produced in the index provision part shown in FIG. 図１に示した分散型データストアシステムにおいて記憶部に記憶されたレコードを読み込む際の処理を説明するための図である。It is a figure for demonstrating the process at the time of reading the record memorize | stored in the memory | storage part in the distributed data store system shown in FIG. 図１に示した分散型データストアシステムにおいて世代単位でファイルを管理する方法を説明するための図である。It is a figure for demonstrating the method to manage a file per generation in the distributed data store system shown in FIG. 世代単位で作成されるファイルの構成を示す図であり、（ａ）は全体の構成を示す図、（ｂ）はＫｅｙの構成を示す図、（ｃ）は具体的な構成例を示す図である。It is a figure which shows the structure of the file produced by a generation unit, (a) is a figure which shows the whole structure, (b) is a figure which shows the structure of Key, (c) is a figure which shows a specific structural example. is there. 図１に示した分散型データストアシステムにおけるレコードの読み込み処理を説明するためのフローチャートである。3 is a flowchart for explaining record reading processing in the distributed data store system shown in FIG. 1.

以下に、本発明の実施の形態について図面を参照して説明する。 Embodiments of the present invention will be described below with reference to the drawings.

図１は、本発明の分散型データストアシステムの実施の一形態を示す図である。 FIG. 1 is a diagram showing an embodiment of a distributed data store system of the present invention.

本形態は図１に示すように、インデクスノード１と、３つのノード２−１〜２−３とがネットワークで接続されて構成されている。ノード２−１〜２−３にて分散管理されるレコードは、各レコードを一意に識別するための識別子となるＫｅｙとＶａｌｕｅとの組から構成されるＫｅｙ−ＶａｌｕｅＳｔｏｒｅ形式（ＫＶＳ形式）のデータ構造である。そして、Ｋｅｙの構成要素は、ＲｏｗＫｅｙ、カラム名及びタイムスタンプとなっており、タイムスタンプの概念を持つテーブル構造を表すことができるモデルとなっている。また、一台または階層構造を持つインデクスノード１によって、任意のＫｅｙの範囲を担当するノードが定められており、あるＫｅｙに対応するレコードへの読み込み／書き込み要求に対する処理は全てそのＫｅｙを担当するノードが行う。本形態においては、図１に示すように、ノード２−１が２つのＫｅｙ範囲Ｒａｎｇｅ１，Ｒａｎｇｅ２を担当し、ノード２−２が１つのＫｅｙ範囲Ｒａｎｇｅ３を担当し、ノード２−３が１つのＫｅｙ範囲Ｒａｎｇｅ４を担当している。 As shown in FIG. 1, the present embodiment is configured by connecting an index node 1 and three nodes 2-1 to 2-3 via a network. The records distributedly managed by the nodes 2-1 to 2-3 are data in the Key-Value Store format (KVS format) composed of a set of Key and Value that is an identifier for uniquely identifying each record. It is a structure. The key components are RowKey, column name, and time stamp, which is a model that can represent a table structure having the concept of time stamp. In addition, a node that handles a range of an arbitrary key is determined by one index node 1 having a hierarchical structure, and all processing for a read / write request to a record corresponding to a certain key is in charge of that key. Node does. In this embodiment, as shown in FIG. 1, the node 2-1 is responsible for two key ranges Range1, Range2, the node 2-2 is responsible for one key range Range3, and the node 2-3 is one key. I am in charge of the range Range4.

ノード２−１〜２−３のそれぞれは、担当するＫｅｙ範囲のレコードを記憶するレコード記憶手段である記憶部５０を有しているとともに、インデクス付与部１０と、記憶制御部２０と、ファイル分割部３０と、レコード取得部４０とを有している。 Each of the nodes 2-1 to 2-3 has a storage unit 50 that is a record storage unit that stores a record of the key range in charge, an index assigning unit 10, a storage control unit 20, and a file division Section 30 and record acquisition section 40.

記憶部５０は、キャッシュ機能を具備するメモリ領域５１と、メモリ領域５１に蓄積されたレコードについてマイナーコンパクションが実行されることによって作成されたファイルが永続的に記憶されるディスク領域５２とを有し、ノード２−１〜２−３が管理する複数のレコードを任意のＫｅｙ範囲毎に集合体として記憶する。 The storage unit 50 includes a memory area 51 having a cache function, and a disk area 52 in which files created by performing minor compaction on records accumulated in the memory area 51 are permanently stored. A plurality of records managed by the nodes 2-1 to 2-3 are stored as an aggregate for each arbitrary key range.

インデクス付与部１０は、ノード２−１〜２−３が管理する任意のＫｅｙ範囲毎に、そのＫｅｙ範囲に対応するレコードにアクセスするためのインデクスを作成して付与する。 The index assigning unit 10 creates and assigns an index for accessing a record corresponding to the key range for each arbitrary key range managed by the nodes 2-1 to 2-3.

記憶制御部２０は、ノード２−１〜２−３が管理する任意のＫｅｙ範囲のレコードをＫｅｙの昇順または降順でメモリ領域５１に蓄積し、メモリ領域５１に蓄積されたレコードのサイズが所定の閾値を超えた場合に、マイナーコンパクションを実行することにより、そのレコードを１つのファイル単位としてディスク領域５２に記憶する。 The storage control unit 20 accumulates records in an arbitrary key range managed by the nodes 2-1 to 2-3 in the memory area 51 in ascending or descending order of the keys, and the size of the record accumulated in the memory area 51 is a predetermined size. When the threshold value is exceeded, minor compaction is executed to store the record in the disk area 52 as one file unit.

ファイル分割部３０は、ディスク領域５２に記憶された複数のレコードの中から所定の条件に従う複数のレコードを抽出し、その複数のレコードをＫｅｙの昇順または降順で並び替えて新たなファイルを生成する。 The file dividing unit 30 extracts a plurality of records according to a predetermined condition from a plurality of records stored in the disk area 52, and rearranges the plurality of records in the ascending order or descending order of the key to generate a new file. .

レコード取得部４０は、レコード取得要求に対して、インデクスを参照することにより、レコード取得要求にて要求されたレコードを記憶部５０から取得し、レコード取得要求元に返す。 In response to the record acquisition request, the record acquisition unit 40 acquires the record requested by the record acquisition request from the storage unit 50 and returns it to the record acquisition request source.

以下に、上記のように構成された分散型データストアシステムにおけるレコード管理方法について説明する。 Below, the record management method in the distributed data store system comprised as mentioned above is demonstrated.

まず、記憶部５０にレコードを書き込む際の処理について説明する。なお、以下においては、ノード２−１に対する処理を例に挙げて説明するが、ノード２−２，２−３に対する場合も同様の処理となる。 First, a process for writing a record in the storage unit 50 will be described. In the following, the process for the node 2-1 will be described as an example, but the same process is performed for the nodes 2-2 and 2-3.

図２は、図１に示した分散型データストアシステムにおいて記憶部５０にレコードを書き込む際の処理を説明するための図である。 FIG. 2 is a diagram for explaining processing when a record is written in the storage unit 50 in the distributed data store system shown in FIG.

クライアントからの任意のＫｅｙに対するレコードの書き込み要求は図２のように、そのＫｅｙ範囲のレコードを管理するノード２−１に送信される。 A record write request for an arbitrary key from a client is transmitted to a node 2-1 that manages records in the key range, as shown in FIG.

すると、レコードの書き込み要求を受け取ったノード２−１においては、記憶制御部２０の制御によって、まず、レコードがメモリ領域５１に蓄積される（図２中ア）。この際、メモリ領域５１へは、Ｋｅｙの昇順または降順でレコードが蓄積されることになる。 Then, in the node 2-1 that has received the record write request, first, the record is accumulated in the memory area 51 under the control of the storage control unit 20 (a in FIG. 2). At this time, records are stored in the memory area 51 in ascending or descending order of the keys.

そして、メモリ領域５１に蓄積された全レコードサイズが所定の閾値を超えた場合、マイナーコンパクションが実行され（図２中イ）、レコードが１つのファイル単位として作成されてディスク領域５２に永続的に記憶される（図２中ウ）。なお、上記閾値としては、一般的には搭載物理メモリサイズを基準に指定されることになる。 When the total record size stored in the memory area 51 exceeds a predetermined threshold, minor compaction is executed (a in FIG. 2), and a record is created as one file unit and is permanently stored in the disk area 52. It is memorized (c in FIG. 2). Note that the threshold value is generally specified based on the mounted physical memory size.

また、それと同時に、インデクス付与部１０において、全レコードに対応するインデクスが作成され、ディスク領域５２に記憶されたレコードに付与される。これにより、Ｋｅｙ範囲に対応するインデクス情報（図２中ｉｎｄｅｘ１）が更新される。 At the same time, the index assigning unit 10 creates indexes corresponding to all records and assigns them to the records stored in the disk area 52. Thereby, the index information (index 1 in FIG. 2) corresponding to the Key range is updated.

図３は、図１に示したインデクス付与部１０にて作成されるインデクスの構成を示す図である。 FIG. 3 is a diagram showing the configuration of the index created by the index assigning unit 10 shown in FIG.

インデクス付与部１０にて作成されるインデクスは図３に示すように、Ｋｅｙが含まれるファイル名、ファイル内のオフセット、レコードサイズが格納され、ノード２−１にて担当するＫｅｙが任意の範囲毎に集合体とされたそれぞれに付与されている。 As shown in FIG. 3, the index created by the index assigning unit 10 stores the file name including the key, the offset in the file, and the record size, and the key in charge at the node 2-1 is in any range. Is given to each of the aggregates.

これにより、ノード２−１〜２−３のそれぞれにおいて、担当するＫｅｙ範囲（Ｒａｎｇｅ）毎に、その範囲に属するレコードが論理的な集合体として管理され、その際、その集合体に対し、Ｋｅｙに対応するＶａｌｕｅにアクセスするためのインデクスが付与される。 Thereby, in each of the nodes 2-1 to 2-3, the records belonging to the range are managed as a logical aggregate for each key range (Range) in charge, and at that time, the key is assigned to the aggregate. An index for accessing the value corresponding to is assigned.

次に、記憶部５０に記憶されたレコードを読み込む際の処理について説明する。 Next, a process when reading a record stored in the storage unit 50 will be described.

図４は、図１に示した分散型データストアシステムにおいて記憶部５０に記憶されたレコードを読み込む際の処理を説明するための図である。 FIG. 4 is a diagram for explaining processing when reading a record stored in the storage unit 50 in the distributed data store system shown in FIG.

図４に示すように、ディスク領域５２には、上述したようにファイルが記憶されており（図４中ア）、これらのファイルは、上述したようにして作成されたインデクスによってアクセス可能となっている（図４中イ）。 As shown in FIG. 4, files are stored in the disk area 52 as described above (a in FIG. 4), and these files can be accessed by the indexes created as described above. (A in FIG. 4).

任意のＫｅｙに対応するレコードを読み込むためのレコード取得要求がクライアントから送信されると、ノード２−１においてはまず、目的のＫｅｙが含まれるＫｅｙ範囲に対応するインデクスが参照され、どのファイルに目的のＫｅｙが含まれているのかが調べられる（図４中ウ）。インデクスにおいては、上述したように、Ｋｅｙが含まれるファイル名、ファイル内のオフセット、レコードサイズが格納されているため、Ｋｅｙを指定することにより、そのＫｅｙが含まれるファイルを調べることができる。 When a record acquisition request for reading a record corresponding to an arbitrary key is transmitted from the client, the node 2-1 first refers to the index corresponding to the key range including the target key, and to which file It is checked whether or not the key is included (C in FIG. 4). In the index, as described above, since the file name including the key, the offset in the file, and the record size are stored, the file including the key can be checked by specifying the key.

次に、レコード取得部４０において、目的のＫｅｙが含まれるファイルにアクセスが行われてレコードが読み込まれ（図４中エ）、それにより取得されたレコードがクライアントに返却される。 Next, the record acquisition unit 40 accesses a file including the target key, reads the record (D in FIG. 4), and returns the acquired record to the client.

ここで、タイムスタンプが最も新しい最新世代のみ取得するクエリを実行する場合において、分散型データストアシステムのＫｅｙはＲｏｗＫｅｙ、カラム名、タイムスタンプといった優先度でソートされているため、最新のタイムスタンプを持つレコードは、ファイル上にまばらに存在する。そのため、シーケンシャルアクセスであっても、本来ならば読み込む必要のない古いタイムスタンプのレコードも含めて、ファイル内容全てをファイルシステムより読み込む必要が生じ、ファイルシステムから読み取るレコード数が多くなってしまう。 Here, when executing a query that acquires only the latest generation with the latest time stamp, the key of the distributed data store system is sorted by priority such as RowKey, column name, and time stamp. The records you have are sparsely present on the file. For this reason, even with sequential access, it is necessary to read the entire file contents from the file system, including records with old time stamps that do not need to be read, and the number of records to be read from the file system increases.

そこで、タイムスタンプのみが異なるレコードを世代が異なる同一レコードとして認識し、世代が同一となるレコードを１つのファイルとして生成することが考えられる。 Therefore, it is conceivable to recognize records having different time stamps as the same record having different generations, and generate records having the same generation as one file.

図５は、図１に示した分散型データストアシステムにおいて世代単位でファイルを管理する方法を説明するための図である。 FIG. 5 is a diagram for explaining a method of managing files in units of generations in the distributed data store system shown in FIG.

図１に示した分散型データストアシステムにおいて世代単位でファイルを管理する場合は、まず、ファイル分割部３０において、メジャーコンパクションが実行され、世代（Ｇ３：第３世代、Ｇ２：第２世代、Ｇ１：第１世代）単位でファイルが作成される（図５中カ）。 When managing files in generation units in the distributed data store system shown in FIG. 1, first, major compaction is executed in the file dividing unit 30, and generations (G3: third generation, G2: second generation, G1). : First generation) file is created (in FIG. 5).

図６は、世代単位で作成されるファイルの構成を示す図であり、（ａ）は全体の構成を示す図、（ｂ）はＫｅｙの構成を示す図、（ｃ）は具体的な構成例を示す図である。 6A and 6B are diagrams showing the configuration of a file created for each generation. FIG. 6A is a diagram showing the overall configuration, FIG. 6B is a diagram showing the configuration of a key, and FIG. 6C is a specific configuration example. FIG.

世代単位で作成されるファイルは図６に示すように、ＫｅｙとＶａｌｕｅとから構成されており、Ｋｅｙにおいては、ＲｏｗＫｅｙ、カラム名及びタイムスタンプがコロン（:）によって連結されている。 As shown in FIG. 6, a file created in units of generations is composed of a key and a value. In the key, a row key, a column name, and a time stamp are connected by a colon (:).

ファイル分割部３０においては、Ｋｅｙの構成要素のうちタイムスタンプのみが異なるレコードが、世代が異なる同一レコードとして認識され、世代が同一となるレコードが１つのファイルとして生成される。具体的には、メジャーコンパクション実行時において、Ｋｅｙ要素依存ファイル分割機能が用られ、当該機能において所定の条件（抽出条件）として、“世代”が指定されることで、ＲｏｗＫｅｙ、カラム名が共に同じレコードが全て集められ、タイムスタンプが調べられることで各レコードの世代が識別され、世代単位のファイルが作成される。この際、世代単位のファイルは、一般的なマイナーコンパクションやメジャーコンパクションで作成されるファイルと同様にＫｅｙによってソートされている。 In the file dividing unit 30, records having only different time stamps among the key components are recognized as the same record having different generations, and records having the same generation are generated as one file. Specifically, when executing major compaction, the Key element dependent file division function is used, and by specifying “generation” as a predetermined condition (extraction condition) in the function, both the RowKey and the column name are the same. All records are collected and the time stamps are examined to identify the generation of each record and create a file for each generation. At this time, the file of the generation unit is sorted by the key as in the case of the file created by general minor compaction or major compaction.

これにより、最新世代である第３世代へのアクセスは、１つのファイル（ＦＩＬＥ（Ｇ３））を読み込むだけで完了する（図５中キ）。その際、最新世代の必要なカラムだけ読み込む必要がある場合は、一度インデクスに問い合わせる方法（図５中ク）と組み合わせることで、目的のＫｅｙに対応するレコードが第３世代ファイルのどの位置にあるのかを調べることにより、ファイルシステムのディスクＩ／Ｏや、分散ファイルシステムの場合はネットワークＩ／Ｏを削減することができる。また、メジャーコンパクションによって世代単位のファイルを作成した後に、マイナーコンパクションが発生し、世代単位のファイルの他にも、ファイルが作成された場合において、最新世代のレコードのみを読み込む場合は、最新世代のファイルを読み込みつつ、マイナーコンパクションによって作成されたファイルにさらに新しいレコードが含まれていないか、インデクスを検索し、新しい方のレコードを結果として返す。 Thereby, the access to the third generation which is the latest generation is completed only by reading one file (FILE (G3)) (in FIG. 5). At that time, if it is necessary to read only the necessary columns of the latest generation, the record corresponding to the target key is located at which position in the third generation file by combining with the method of querying the index once (indicated in FIG. 5). In the case of a distributed file system, the network I / O can be reduced. In addition, after creating a generation unit file by major compaction, minor compaction occurs, and in addition to the generation unit file, when the file is created, only the latest generation record is read. While reading the file, search the index for a newer record in the file created by minor compaction and return the newer record as the result.

このように、タイムスタンプのみが異なるレコードを世代が異なる同一レコードとして認識し、世代が同一となるレコードを１つのファイルとして生成しておくことにより、１つのファイルに特定の世代が連続して記録されていることになり、特定世代のファイルに対してアクセスを行うことにより、従来ではランダムアクセスとなってしまったアクセスパターンも、シーケンシャルアクセスにすることができ、応答時間やスループットを向上させ、また、不要な情報を読み込む必要が無くなるため、ファイルシステムやネットワークへの負荷を軽減することができる。 In this way, records with different time stamps are recognized as the same record with different generations, and records with the same generation are generated as one file, so that a specific generation is recorded continuously in one file. As a result, by accessing a specific generation of files, an access pattern that was previously random access can be changed to sequential access, improving response time and throughput. Since it is not necessary to read unnecessary information, the load on the file system and the network can be reduced.

なお、本形態においては、ファイル分割部３０において、Ｋｅｙの構成要素のうちタイムスタンプのみが異なるレコードを世代が異なる同一レコードとして認識し、世代が同一となるレコードを１つのファイルとして生成しているが、ディスク領域５２に記憶された複数のレコードの中から所定の条件に従う複数のレコードを抽出し、その複数のレコードをＫｅｙの昇順または降順で並び替えて新たなファイルを生成するものであれば、上記のように世代管理に限らない。 In this embodiment, the file dividing unit 30 recognizes records having only different time stamps among the key components as the same record having different generations, and generates records having the same generation as one file. However, if a plurality of records conforming to a predetermined condition are extracted from a plurality of records stored in the disk area 52, and the plurality of records are rearranged in the ascending or descending order of the keys to generate a new file. As described above, it is not limited to generation management.

以下に、上述した分散型データストアシステムにおけるレコードの読み込み処理について説明する。 Hereinafter, a record reading process in the above-described distributed data store system will be described.

図７は、図１に示した分散型データストアシステムにおけるレコードの読み込み処理を説明するためのフローチャートである。 FIG. 7 is a flowchart for explaining record reading processing in the distributed data store system shown in FIG.

まず、レコード取得部４０において、クライアントからのレコード取得要求となるクエリに含まれるレコード取得の条件式によって、特定世代へのアクセスであるか、特定カラム（複数選択可能）へのアクセスであるかが判別され、それに応じて、特定世代のファイルを全て読み込むか、インデクスを用いて読み込むかが選択される（ステップ１）。 First, in the record acquisition unit 40, whether the access is to a specific generation or an access to a specific column (multiple selection is possible) according to a record acquisition conditional expression included in a query that is a record acquisition request from a client. According to the determination, whether to read all files of a specific generation or to read using an index is selected (step 1).

世代が指定されておらず、特定世代へのアクセスではない場合は、インデクスが参照され、レコード取得要求に基づいてレコードがディスク領域５２から読み込まれて取得される（ステップ２）。 If no generation is specified and access is not to a specific generation, the index is referred to, and a record is read and acquired from the disk area 52 based on the record acquisition request (step 2).

一方、世代が指定され、特定世代へのアクセスである場合は、レコード取得要求に特定カラムへのアクセスが指定されているかどうかが判別され（ステップ３）、特定カラムへのアクセスが指定されている場合は、特定世代のファイルに対し、上述したインデクスを用いたレコード取得機能によって、レコード取得要求に基づいてレコードが取得される（ステップ４）。 On the other hand, if a generation is specified and access is to a specific generation, it is determined whether or not access to a specific column is specified in the record acquisition request (step 3), and access to the specific column is specified. In this case, a record is acquired for a specific generation file based on the record acquisition request by the record acquisition function using the index described above (step 4).

また、特定カラムへのアクセスが指定されていない場合は、レコード取得要求に基づいて、要求されたレコードが特定世代のファイルから直接取得される（ステップ５）。 If access to the specific column is not specified, the requested record is directly acquired from the specific generation file based on the record acquisition request (step 5).

次に、マイナーコンパクション機能で作成された新たなファイルが存在する場合は（ステップ６）、マイナーコンパクション機能で作成された新たなファイルについては、インデクスが参照され、レコード取得要求に基づいてレコードがディスク領域５２から読み込まれて取得される（ステップ７）。 Next, if there is a new file created by the minor compaction function (step 6), the index is referred to for the new file created by the minor compaction function, and the record is recorded on the disc based on the record acquisition request. It is read and acquired from the area 52 (step 7).

このようにしてディスク領域５２から取得されたレコードは、レコード取得要求の要求元に返却される。 The record acquired from the disk area 52 in this way is returned to the request source of the record acquisition request.

なお、インデクス付与部１０においては、マイナーコンパクションやメジャーコンパクションの実行時、もしくは、新たな特定世代のファイルが生成された契機のいずれかにおいて、インデクスが更新され、再付与されることになる。すなわち、各ノード２−１〜２−３にて管理されるＫｅｙ範囲に含まれるＫｅｙの読み込みを行う際に、インデクスに該当するＫｅｙが無く、かつインデクスを構築していない１つ以上のファイルがあれば、クライアントからのレコード取得要求に基づき、従来の方法で目的のレコードを検索すると同時に、インデクスが更新されることになる。具体的には、マイナーコンパクション機能で新たなファイルが生成された契機、または、世代単位の新たなファイルが生成された契機のいずれかにおいて、インデクスが更新されることになる。 In the index assigning unit 10, the index is updated and reassigned at the time of execution of minor compaction or major compaction, or when a new specific generation file is generated. That is, when reading a key included in the key range managed by each of the nodes 2-1 to 2-3, there is one or more files that have no key corresponding to the index and have not built an index. If there is, the index is updated at the same time as searching for the target record by the conventional method based on the record acquisition request from the client. Specifically, the index is updated either when a new file is generated by the minor compaction function or when a new file for each generation is generated.

１インデクスノード
２−１〜２−３ノード
１０インデクス付与部
２０記憶制御部
３０ファイル分割部
４０レコード取得部
５０記憶部
５１メモリ領域
５２ディスク領域 DESCRIPTION OF SYMBOLS 1 Index node 2-1 to 2-3 node 10 Index assignment part 20 Storage control part 30 File division part 40 Record acquisition part 50 Storage part 51 Memory area 52 Disk area

Claims

In a distributed data store system that distributes and manages records specified by identifiers at multiple nodes connected by a network,
The plurality of nodes are:
Record storage means for storing a plurality of records managed by the node as an aggregate for each arbitrary range of the identifier;
Index giving means for giving an index using an identifier included in the range of the aggregate to the aggregate;
A distributed data store system comprising: record acquisition means for acquiring the record requested by the record acquisition request from the record storage means by referring to the index in response to the record acquisition request.

The distributed data store system according to claim 1,
The record storage means includes
A memory area having a cache function;
A disk area where records are permanently stored;
Records are accumulated in the memory area in ascending or descending order of the identifiers, and when the size of the record accumulated in the memory area exceeds a predetermined threshold, the record is stored in the disk area as one file unit A distributed data store system having storage control means.

The distributed data store system according to claim 2,
A file dividing means for extracting a plurality of records according to a predetermined condition from a plurality of records stored in the disk area, and rearranging the plurality of records in the ascending or descending order of the identifiers to generate a new file; Distributed data store system.

The distributed data store system according to claim 3,
The identifier comprises a time stamp as a component,
The distributed file store system, wherein the file dividing unit recognizes records having only different time stamps among the constituent elements of the identifier as the same records having different generations, and generates records having the same generation as one file.

The distributed data store system according to claim 4,
The record acquisition unit acquires a record from the record storage unit based on the record acquisition request with reference to the index when a generation is not specified in the record acquisition condition in response to the record acquisition request, and the record acquisition A distributed data store system that acquires a record from the record storage unit based on the record acquisition request from a file of a specified generation when a generation is specified in the condition.

The distributed data store system according to claim 5,
The index assigning means is a distributed type that reassigns the index to the aggregate when the record accumulated in the memory area is stored in the disk area or when the new file is generated. Data store system.

A record management method in a distributed data store system that distributes and manages a record specified by an identifier in a plurality of nodes connected by a network,
When the plurality of nodes store a plurality of records managed by the node as an aggregate for each arbitrary range of the identifier, an index using an identifier included in the aggregate range for the aggregate A record management method in a distributed data store system that acquires the record requested by the record acquisition request by referring to the index in response to the record acquisition request.