JP6480495B2

JP6480495B2 - Data management apparatus, data management method, and program

Info

Publication number: JP6480495B2
Application number: JP2017050997A
Authority: JP
Inventors: 洸二山田; 周一鈴木
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2017-03-16
Filing date: 2017-03-16
Publication date: 2019-03-13
Anticipated expiration: 2037-03-16
Also published as: JP2018156233A

Description

本発明は、データ管理装置、データ管理方法、およびプログラムに関する。 The present invention relates to a data management device, a data management method, and a program.

従来、中国、日本、および韓国の言語のための名前を検出する方法が知られている（特許文献１参照）。この方法では、構造化されたデータを扱っている。 Conventionally, a method for detecting names for Chinese, Japanese, and Korean languages is known (see Patent Document 1). This method deals with structured data.

ところで、データベースにおけるデータ構造には、行指向型のデータ構造と列指向型のデータ構造がある。行指向型のデータ構造とは、ひとつのレコードを、ひとまとまりの論理構造として保持するデータ構造である。これに対し、列指向型のデータ構造が知られている。列指向型のデータ構造とは、同じインデックス（ユーザの属性データであれば、名前、年齢、性別といったもの）に対応するデータを、ひとまとまりの論理構造として保持するデータ構造である。論理構造とは、データを検索する際に使用される、キー、ＬＢＡ（Logical Block Addressing）、論物変換テーブル上のラベル、その他の論理的な情報をいう。行指向型のデータ構造は、データの追加や削除などが容易であるのに対し、列指向型のデータ構造は、インデックスごとの統計処理に向いているといった違いがある。 By the way, the data structure in the database includes a row-oriented data structure and a column-oriented data structure. A row-oriented data structure is a data structure that holds one record as a group of logical structures. On the other hand, a column-oriented data structure is known. The column-oriented data structure is a data structure that holds data corresponding to the same index (name, age, sex, etc. in the case of user attribute data) as a group of logical structures. The logical structure means a key, an LBA (Logical Block Addressing), a label on a logical / physical conversion table, and other logical information used when searching for data. The row-oriented data structure is easy to add or delete data, while the column-oriented data structure is suitable for statistical processing for each index.

特開２０１３−１０９３６４号公報JP 2013-109364 A

ここで、行指向型のデータ構造を扱うＪＳＯＮなどの機能では、データのツリー構造を自動生成することができるが、ネットワーク、記憶装置、ソフトウェア処理の面でコストが大きい。特に、列指向型のデータ構造を有するデータベースから統計処理のためのデータを読み出す際の処理時間は長くなってしまう。 Here, a function such as JSON that handles row-oriented data structures can automatically generate a tree structure of data, but the cost is high in terms of network, storage device, and software processing. In particular, the processing time for reading data for statistical processing from a database having a column-oriented data structure is long.

一方、列指向型のデータ構造でデータを格納した場合、採用され得る全てのインデックスの管理と、データの追加や削除などが困難である。特に、Ｓｔｒｅａｍ形式でデータが入力される場合、レコードごとにデータを処理することが想定されるが、レコードごとの処理から直接的に列指向型に書き込むことはできない。また、列指向型においては、書き込み失敗時の管理や重複排除を行う有効な方法が開発されていない。 On the other hand, when data is stored in a column-oriented data structure, it is difficult to manage all indexes that can be employed and to add or delete data. In particular, when data is input in the Stream format, it is assumed that the data is processed for each record, but cannot be directly written in a column-oriented manner from the process for each record. Further, in the column-oriented type, an effective method for performing management at the time of writing failure and deduplication has not been developed.

本発明は、このような事情を考慮してなされたものであり、非構造な入力データを、データ項目ごとに高速に読み出し可能な態様で保存すると共に、入力データの追加や削除を容易に行うことができるデータ管理装置、データ管理方法、およびプログラムを提供することを目的の一つとする。 The present invention has been made in view of such circumstances, and saves unstructured input data in a form that can be read at high speed for each data item, and easily adds or deletes input data. Another object is to provide a data management device, a data management method, and a program that can be used.

本発明の一態様は、入力されたデータ格納要求を解釈して抽象表現に変換する解釈部と、前記解釈部により抽象表現に変換されたデータを、予め配列が確保されていない列指向型のデータ構造で管理しながら記憶部に記憶させる変換部と、を備えるデータ管理装置である。 According to one aspect of the present invention, an interpretation unit that interprets an input data storage request and converts it into an abstract representation, and a column-oriented type in which data that has been converted into an abstract representation by the interpretation unit is not secured in advance. And a conversion unit that stores the data in the storage unit while managing the data structure.

本発明の一態様によれば、非構造な入力データを、データ項目ごとに高速に読み出し可能な態様で保存すると共に、入力データの追加や削除を容易に行うことができる。 According to one aspect of the present invention, unstructured input data can be stored in a manner that can be read at high speed for each data item, and input data can be easily added or deleted.

jsonフォーマットによるログの一例を示す図である。It is a figure which shows an example of the log by a json format. 図１のログを木構造で表現した図である。FIG. 2 is a diagram representing the log of FIG. 1 in a tree structure. jsonフォーマットによるログの他の一例を示す図である。It is a figure which shows another example of the log by a json format. 図３のログを木構造で表現した図である。FIG. 4 is a diagram representing the log of FIG. 3 in a tree structure. カラムナーファイルのデータを木構造で表現した図である。It is the figure which expressed the data of the columner file by the tree structure. データ管理装置の一例であるデータベースサーバ１００の使用環境と構成の一例を示す図である。It is a figure which shows an example of the use environment and structure of the database server 100 which is an example of a data management apparatus. 解釈部１１２の機能について説明するための図である。It is a figure for demonstrating the function of the interpretation part. 変換部１１４の機能について説明するための図（その１）である。FIG. 6 is a diagram (part 1) for describing a function of a conversion unit. スキーマ情報１５４Ｂの一例を示す図（その１）である。It is a figure (example 1) which shows an example of the schema information 154B. 変換部１１４の機能について説明するための図（その２）である。FIG. 6 is a diagram (No. 2) for describing a function of a conversion unit. スキーマ情報１５４Ｂの一例を示す図（その２）である。It is a figure (the 2) which shows an example of the schema information 154B. 変換部１１４の機能について説明するための図（その３）である。FIG. 11 is a third diagram for explaining the function of the conversion unit 114; スキーマ情報１５４Ｂの一例を示す図（その３）である。FIG. 13 is a third diagram illustrating an example of schema information 154B. 変換部１１４により実行される処理の流れの一例を示すフローチャートである。6 is a flowchart illustrating an example of a flow of processing executed by a conversion unit 114. データ利用者インターフェース１２０による出力データのイメージを示す図である。It is a figure which shows the image of the output data by the data user interface. データ利用者インターフェース１２０により実行される処理の流れの一例を示すフローチャートである。4 is a flowchart showing an example of a flow of processing executed by a data user interface 120. 変換部１１４による他の機能について説明するための図である。FIG. 10 is a diagram for explaining another function by the conversion unit 114. 図１７の例に対応したスキーマ情報１５４Ｂの一例を示す図である。It is a figure which shows an example of the schema information 154B corresponding to the example of FIG.

以下、図面を参照し、本発明のデータ管理装置、データ管理方法、およびプログラムの実施形態について説明する。データ管理装置は、クライアントから受信したデータを記憶装置に保管すると共に、データ送信元のクライアント、或いは他のクライアントからの要求に応じたデータを記憶装置から読み出して提供する装置である。データ管理装置をＤＢＭＳ（データベース管理システム）と称してもよい。クライアントには、エンドユーザの使用する端末装置において動作するアプリケーションプログラムと協調して動作するアプリケーションサーバ（以下、フロントエンドサーバと称する）、蓄積されたデータを統計データなどとして利用するデータ利用者サーバなどが含まれる。 Hereinafter, embodiments of a data management device, a data management method, and a program according to the present invention will be described with reference to the drawings. The data management apparatus is an apparatus that stores data received from a client in a storage device, and reads and provides data in response to a request from a data transmission source client or another client from the storage device. The data management apparatus may be referred to as a DBMS (database management system). The client includes an application server (hereinafter referred to as a front-end server) that operates in cooperation with an application program that operates on a terminal device used by an end user, a data user server that uses accumulated data as statistical data, and the like. Is included.

先に、本発明の概念的側面について説明する。近年のHadoopはhiveやprestoに代表される"SQL on Hadoop"でRDB的にhdfsにアクセスすることが主流であり、過去に言われていた「非構造な大量のデータ」のファイルを直接扱うケースはまれになってきた。一方、格納されるデータは、取得時には非構造な「ログ」であることがほとんどである。そこで、多くの場合「規則性のある非構造データ」としてデータを取得・加工することになる。この、「規則性のある非構造データ」の代表がjsonやxmlであり、これは「ネストを含むkey value形式」で表現でき、これは木構造として見ることができる。図１は、jsonフォーマットによるログの一例を示す図であり、図２は図１のログを木構造で表現した図である。木構造による表現は「ネストを含むkey value形式」の抽象化に適している。図３は、jsonフォーマットによるログの他の一例を示す図であり、図４は図３のログを木構造で表現した図である。 First, a conceptual aspect of the present invention will be described. In recent years, Hadoop is mainly accessing RDB-like hdfs with "SQL on Hadoop" represented by hive and presto, and directly handling "unstructured large amount of data" files that were said in the past Has become rare. On the other hand, the stored data is mostly an unstructured “log” at the time of acquisition. Therefore, in many cases, data is acquired and processed as “regular unstructured data”. The representative of “regular unstructured data” is json or xml, which can be expressed in “key value format including nest”, which can be viewed as a tree structure. FIG. 1 is a diagram showing an example of a log in json format, and FIG. 2 is a diagram representing the log of FIG. 1 in a tree structure. The representation by tree structure is suitable for the abstraction of "key value format including nesting". FIG. 3 is a diagram showing another example of the log in the json format, and FIG. 4 is a diagram expressing the log of FIG. 3 in a tree structure.

図４で示すように、「ネストを含むkeyvalue」は(x, z)平面で、配列に関してはy方向に次元を拡張する事が可能であり、多次元空間での木構造は「ネストを含むkeyvalue形式」、すなわちschemaを表現するのに適している事がわかる。この「多次元空間での木構造」をデータフォーマット（json, xml, avro, message pack等）から切り離して抽象化したオブジェクトにしたものが「schemaobject」である。 As shown in Fig. 4, "keyvalue including nest" is (x, z) plane, the dimension can be expanded in the y direction with respect to the array, and the tree structure in multidimensional space includes "nest includes" You can see that it is suitable for expressing "keyvalue format", that is, schema. “Schemaobject” is an object obtained by separating this “tree structure in a multidimensional space” from the data format (json, xml, avro, message pack, etc.) and abstracting it.

一方、Hadoopに代表される分散型ストレージは、当初は大量の非構造データに対し高スループット高レイテンシでアクセスすることを主眼に設計・開発されたが、近年では、高スループットかつ低レイテンシを実現するために、データを構造化して配置するケースが増えてきている。hdfs上に構造化する際はカラムナーと呼ばれる、RDB的なデータを永続化するファイルフォーマットが一般的であり、代表的なものとしてhive ORC file、apache parquetがある。カラムナーファイルのデータを木構造で表現すると、図５に示すような「root直下のみの階層しかない2次元木」で描くことができる。図５は、カラムナーファイルのデータを木構造で表現した図である。 On the other hand, distributed storage represented by Hadoop was originally designed and developed to access a large amount of unstructured data with high throughput and high latency, but in recent years it has achieved high throughput and low latency. For this reason, cases where data is structured and arranged are increasing. When structuring on hdfs, a file format called columnar that persists RDB-like data is common, and typical examples include hive ORC file and apache parquet. If the data of the columnar file is expressed in a tree structure, it can be drawn with a “two-dimensional tree having only a hierarchy directly under the root” as shown in FIG. FIG. 5 is a diagram representing the data of the columner file in a tree structure.

カラムナーファイルフォーマットの利点は、「カラム毎にアクセスすることによる省コスト可」であり、メモリ・CPU・IOどの観点でも、Hadoopで馴染みのある他の非構造データ用のファイルフォーマットを凌駕する。一方で、カラムナーファイルには「データに構造化を強制する」という弱点がある。前述のとおり、データは取得時には非構造な「ログ」であり、構造化しようにも「多次元的な木構造」という高度な表現は不可能である。 The advantage of the columnar file format is “cost savings by accessing each column”, and outperforms other unstructured data file formats that are familiar to Hadoop in terms of memory, CPU, and IO. On the other hand, the columnar file has a weakness of "force data structuring". As described above, data is an unstructured “log” at the time of acquisition, and a high-level expression “multi-dimensional tree structure” is impossible even if it is structured.

この問題を解決するのが、本発明で採用する方式である。これは、多次元的な（木構造で言うと深さ方向の）広がりをもつデータを永続化することができるファイルフォーマットである。前述したschemaobjectをそのまま記述する形式を取るので、「ネストを含むkey valueの配列」という表現力を保ったままデータを保持することができる。 The system adopted in the present invention solves this problem. This is a file format that can perpetuate data having a multidimensional spread (in the depth direction in the case of a tree structure). Since the schemaobject described above is used as it is, the data can be retained while maintaining the expressive power of "key value array including nesting".

一般に、データのカラムナフォーマットの弱点は「データの構造化」の部分であり、多次元的なデータを二次元へ次元圧縮するロジックと処理をどこかで実装する必要がでてきてしまい、それが俗にいう「スキーマ」である。スキーマの管理や変更には大きなコストが伴う。本発明の方式では、次元圧縮処理が不要であるため、データの保存においては、この「スキーマ問題」から解放される。また、カラムナファイルでは構造上不可能な、配列やStruct型の「特定の値」へのアクセスも、そのカラムを全展開することなく木の探索としてアクセスできる点でも大きな利点がある。 In general, the weakness of the columnar format of data is the "data structuring" part, and it has become necessary to implement logic and processing somewhere that compresses multidimensional data into two dimensions. Is a common “schema”. Schema management and modification is costly. In the method of the present invention, since dimension compression processing is not required, data storage frees you from this “schema problem”. In addition, access to a “specific value” of an array or Struct type, which is impossible in the structure of a columner file, has a great advantage in that it can be accessed as a tree search without fully expanding the column.

以下、具体的な構成および機能について説明する。図６は、データ管理装置の一例であるデータベースサーバ１００の使用環境と構成の一例を示す図である。エンドユーザの使用する一以上の端末装置１０は、フロントエンドサーバ２０と通信する。端末装置１０では、アプリケーションプログラムが動作し、アプリケーションプログラムの実行に必要なデータをフロントエンドサーバ２０との間で送受信する。フロントエンドサーバ２０は、端末装置１０から取得したデータのうち保存が必要なデータを、プロキシサーバ３０を介してデータベースサーバ１００に送信して保管させる。また、フロントエンドサーバ２０は、アプリケーションプログラムの実行に必要なデータをデータベースサーバ１００から読み出し、端末装置１０に送信する。このような、一以上の端末装置１０とフロントエンドサーバ２０との組み合わせが複数存在する。それぞれのフロントエンドサーバ２０は、ＪＳＯＮ（JavaScript（登録商標） Object Notation）、ＭｙＳＱＬなどの任意の形式で、データベースサーバ１００に対してデータの書き込み要求または読み出し要求を行う。 Hereinafter, a specific configuration and function will be described. FIG. 6 is a diagram illustrating an example of the usage environment and configuration of the database server 100 which is an example of the data management apparatus. One or more terminal devices 10 used by the end user communicate with the front-end server 20. In the terminal device 10, an application program runs and transmits / receives data necessary for executing the application program to / from the front-end server 20. The front-end server 20 transmits data that needs to be saved among the data acquired from the terminal device 10 to the database server 100 via the proxy server 30 for storage. The front-end server 20 reads data necessary for executing the application program from the database server 100 and transmits the data to the terminal device 10. There are a plurality of such combinations of one or more terminal devices 10 and the front-end server 20. Each front-end server 20 makes a data write request or a data read request to the database server 100 in an arbitrary format such as JSON (JavaScript (registered trademark) Object Notation) or MySQL.

一方、データ利用者サーバ５０は、フロントエンドサーバ２０から収集されたデータのうち、利用規約によって統計処理などに利用することが許可されているデータを、データベースサーバ１００から取得する。なお、フロントエンドサーバ２０とデータ利用者サーバ５０の区別は厳密なものである必要はなく、フロントエンドサーバ２０の一部がデータ利用者サーバ５０として動作することがあってもよい。また、データ利用者サーバ５０は、プロキシサーバ３０を介してデータベースサーバ１００と通信してもよい。図６に示す各装置は、インターネット、ＷＡＮ（Wide Area Network）、ＬＡＮ（Local Area Network）などのネットワークを介して相互に通信可能に接続されている。 On the other hand, the data user server 50 acquires, from the database server 100, data that is permitted to be used for statistical processing or the like according to the usage agreement among the data collected from the front end server 20. The distinction between the front end server 20 and the data user server 50 is not necessarily strict, and a part of the front end server 20 may operate as the data user server 50. Further, the data user server 50 may communicate with the database server 100 via the proxy server 30. Each device shown in FIG. 6 is connected to be communicable with each other via a network such as the Internet, a WAN (Wide Area Network), and a LAN (Local Area Network).

データベースサーバ１００は、例えば、図示しないＮＩＣ（Network Interface Card）などの通信インターフェースの他、フロントエンドインターフェース１１０と、データ利用者インターフェース１２０と、記憶部１５０とを備える。フロントエンドインターフェース１１０およびデータ利用者インターフェース１２０は、それぞれ、ＣＰＵ（Central Processing Unit）などのプロセッサがプログラム（ソフトウェア）を実行することにより実現される。また、これらの機能部のうち一方または双方は、ＬＳＩ（Large Scale Integration）やＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field-Programmable Gate Array）などのハードウェアにより実現されてもよいし、ソフトウェアとハードウェアが協働することで実現されてもよい。 The database server 100 includes a front-end interface 110, a data user interface 120, and a storage unit 150, in addition to a communication interface such as a network interface card (NIC) (not shown). The front-end interface 110 and the data user interface 120 are each realized by a processor (CPU) or the like executing a program (software). One or both of these functional units may be realized by hardware such as LSI (Large Scale Integration), ASIC (Application Specific Integrated Circuit), FPGA (Field-Programmable Gate Array), or software. It may be realized by cooperation of hardware.

フロントエンドインターフェース１１０は、例えば、解釈部１１２と、変換部１１４とを備える。解釈部１１２は、フロントエンドサーバ２０から取得されるデータを抽象化する。また、解釈部１１２は、フロントエンドサーバ２０にデータを提供する際には、抽象化されたデータを、フロントエンドサーバ２０に対応した形式に変換する。変換部１１４は、行指向型のデータを列指向型のデータに変換して記憶部１５０に記憶部１５０に記憶させる。データ利用者インターフェース１２０は、データ利用者サーバ５０から取得した要求に応じたデータを記憶部１５０から読み出し、データ利用者サーバ５０に送信する。これらの機能の詳細については後述する。 The front end interface 110 includes, for example, an interpretation unit 112 and a conversion unit 114. The interpretation unit 112 abstracts data acquired from the front-end server 20. Further, the interpretation unit 112 converts the abstracted data into a format corresponding to the front end server 20 when providing the data to the front end server 20. The conversion unit 114 converts the row-oriented data into column-oriented data and causes the storage unit 150 to store the data in the storage unit 150. The data user interface 120 reads data corresponding to the request acquired from the data user server 50 from the storage unit 150 and transmits the data to the data user server 50. Details of these functions will be described later.

記憶部１５０は、例えば、キャッシュメモリ１５２と、不揮発性メモリ１５４とを備える。キャッシュメモリ１５２は、ＲＡＭ（Random Access Memory）、レジスタ、フラッシュメモリなどで実現される。また、不揮発性メモリ１５４は、ＨＤＤ（Hard Disk Drive）、フラッシュメモリなどで実現される。不揮発性メモリ１５４には、クライアントデータ１５４Ａとスキーマ情報１５４Ｂとが格納される。スキーマ情報１５４Ｂは、参照情報の一例である。記憶部１５０は、データベースサーバ１００がネットワークを介してアクセス可能なＮＡＳ（Network Attached Storage）であってもよい。 The storage unit 150 includes, for example, a cache memory 152 and a nonvolatile memory 154. The cache memory 152 is realized by a RAM (Random Access Memory), a register, a flash memory, or the like. The nonvolatile memory 154 is realized by an HDD (Hard Disk Drive), a flash memory, or the like. The non-volatile memory 154 stores client data 154A and schema information 154B. The schema information 154B is an example of reference information. The storage unit 150 may be a NAS (Network Attached Storage) that can be accessed by the database server 100 via a network.

［フロントエンドインターフェース］
以下、フロントエンドインターフェース１１０の機能について説明する。フロントエンドインターフェース１１０の解釈部１１２は、フロントエンドサーバ２０ごとに定義が異なるデータを、一つの共通する形式に変換する。図７は、解釈部１１２の機能について説明するための図である。ここでは、ａａａａａというユーザ名（ｎａｍｅ）を有するユーザの年齢（ａｇｅ）が２４才であるというデータを示している。以下、このｎａｍｅやａｇｅなどの情報をインデックスと称する。これに対し、図７の下図は、データベースサーバ１００が扱うことのできる抽象化されたデータを模式的に示している。解釈部１１２は、図７に例示したように、フロントエンドサーバ２０から取得されたデータ格納要求を解釈し、抽象化する処理を行って、データを変換部１１４に渡す。 [Front-end interface]
Hereinafter, functions of the front end interface 110 will be described. The interpretation unit 112 of the front end interface 110 converts data having different definitions for each front end server 20 into one common format. FIG. 7 is a diagram for explaining the function of the interpretation unit 112. Here, data indicating that the age (age) of a user having a user name (name) of aaaaaa is 24 years old is shown. Hereinafter, information such as name and age is referred to as an index. In contrast, the lower diagram of FIG. 7 schematically shows abstracted data that can be handled by the database server 100. As illustrated in FIG. 7, the interpretation unit 112 interprets the data storage request acquired from the front end server 20, performs an abstraction process, and passes the data to the conversion unit 114.

フロントエンドインターフェース１１０により抽象化されたデータは、特に処理を加えなければ、行指向型のデータ構造を有するものとなるのが通常である。変換部１１４は、抽象化したデータを更に、列指向型のデータ構造に変換し、クライアントデータ１５４Ａとして記憶部１５０の不揮発性メモリ１５４に記憶させる。 The data abstracted by the front-end interface 110 usually has a row-oriented data structure unless special processing is applied. The conversion unit 114 further converts the abstracted data into a column-oriented data structure, and stores the converted data in the nonvolatile memory 154 of the storage unit 150 as client data 154A.

図８は、変換部１１４の機能について説明するための図（その１）である。ここでは、レコード１〜レコード３の３つのレコードがフロントエンドサーバ２０から取得され、解釈部１１２によって抽象化されたものとする。レコード１は、データ項目としてｉｄ（識別情報）、ｎａｍｅ（ユーザ名）、ｓｅｘ（性別）を含んでいる。図では、レコード１のｉｄという意味でｉｄ（１）などと表記している。他のデータ項目に関しても同様である。また、レコード２は、データ項目としてｉｄ、ｎａｍｅ、ａｇｅ（年齢）を含んでおり、レコード３は、データ項目としてｉｄ、ｎａｍｅを含んでいる。これらの抽象化されたレコードは、キャッシュメモリ１５２に格納される。 FIG. 8 is a diagram (No. 1) for explaining the function of the conversion unit 114. Here, it is assumed that three records 1 to 3 are acquired from the front-end server 20 and abstracted by the interpretation unit 112. The record 1 includes id (identification information), name (user name), and sex (gender) as data items. In the figure, id (1) is written in the meaning of the id of record 1. The same applies to other data items. The record 2 includes id, name, and age (age) as data items, and the record 3 includes id and name as data items. These abstracted records are stored in the cache memory 152.

キャッシュメモリ１５２に一定量のデータが格納されると、変換部１１４は、これらを予め配列が確保されていない列指向型のデータ構造で管理しながら不揮発性メモリ１５４に記憶させる。図８の例では、［ｉｄ（１）、ｉｄ（２）、ｉｄ（３）］、［ｎａｍｅ（１）、ｎａｍｅ（２）、ｎａｍｅ（３）］、［ｓｅｘ（１）］、および［ａｇｅ（２）］のそれぞれが、ひとまとまりの論理構造を有するように管理される。 When a certain amount of data is stored in the cache memory 152, the conversion unit 114 stores them in the nonvolatile memory 154 while managing them in a column-oriented data structure in which the arrangement is not secured in advance. In the example of FIG. 8, [id (1), id (2), id (3)], [name (1), name (2), name (3)], [sex (1)], and [age Each of (2)] is managed so as to have a group of logical structures.

ここで、変換部１１４は、後に行方向への検索が可能となるように、スキーマ情報１５４Ｂを不揮発性メモリ１５４に格納しておく（一時的にキャッシュメモリ１５２に格納してもよい）。図９は、スキーマ情報１５４Ｂの一例を示す図（その１）である。図示するように、スキーマ情報１５４Ｂは、ログを木構造で表現した情報である。この情報に基づいて、例えばレコード２の情報を読み出す要求がフロントエンドサーバ２０から取得された場合、木の探索としてアクセスすることができる。なお、スキーマ情報１５４Ｂの形式は図９に示すものに限らず、任意の形式を採用してよい。 Here, the conversion unit 114 stores the schema information 154B in the nonvolatile memory 154 so that the search in the row direction can be performed later (may be temporarily stored in the cache memory 152). FIG. 9 is a diagram (part 1) illustrating an example of the schema information 154B. As shown in the figure, the schema information 154B is information representing a log in a tree structure. Based on this information, for example, when a request for reading the information of the record 2 is acquired from the front-end server 20, it can be accessed as a tree search. The format of the schema information 154B is not limited to that shown in FIG. 9, and an arbitrary format may be adopted.

更に別のレコードを記憶する要求が取得された場合、変換部１１４は、以下の手法でデータを管理する。変換部１１４は、（手法１）既に管理されているデータ構造に追加する形でデータを管理してもよいし、（手法２）キャッシュメモリ１５２から不揮発性メモリ１５４にデータを移すごとに管理するデータを区分してもよい。 When a request for storing another record is acquired, the conversion unit 114 manages data by the following method. The conversion unit 114 may manage data by adding (method 1) to the already managed data structure, or (method 2) managing data every time data is transferred from the cache memory 152 to the nonvolatile memory 154. Data may be segmented.

（手法１）
図１０は、変換部１１４の機能について説明するための図（その２）である。ここでは、更に、レコード４〜レコード６の３つのレコードがフロントエンドサーバ２０から取得され、解釈部１１２によって抽象化されたものとする。レコード４は、データ項目としてｎａｍｅ、ａｇｅ、ｊｏｂを含んでいる。また、レコード５は、データ項目としてｉｄ、ｎａｍｅ、ｓｅｘを含んでおり、レコード６は、データ項目としてｉｄ、ｎａｍｅ、ｊｏｂを含んでいる。これらの抽象化されたレコードは、キャッシュメモリ１５２に格納される。 (Method 1)
FIG. 10 is a diagram (No. 2) for describing the function of the conversion unit 114. Here, it is further assumed that three records 4 to 6 are acquired from the front-end server 20 and abstracted by the interpretation unit 112. The record 4 includes name, age, and job as data items. Further, the record 5 includes id, name, and sex as data items, and the record 6 includes id, name, and job as data items. These abstracted records are stored in the cache memory 152.

キャッシュメモリ１５２に一定量のデータが格納されると、変換部１１４は、これらを列指向型のデータ構造で管理しながら不揮発性メモリ１５４に記憶させる。ここで、レコード４〜６には、レコード１〜３には含まれていなかったｊｏｂ（職業）というデータ項目が含まれている。この場合、変換部１１４は、新たな列を設定し、データを管理する。図１０の例では、［ｉｄ（１）、ｉｄ（２）、ｉｄ（３）、ｉｄ（５）、ｉｄ（６）］、［ｎａｍｅ（１）、ｎａｍｅ（２）、ｎａｍｅ（３）、ｎａｍｅ（４）、ｎａｍｅ（５）、ｎａｍｅ（６）］、［ｓｅｘ（１）、ｓｅｘ（５）］、［ａｇｅ（２）、ａｇｅ（４）］、および［ｊｏｂ（４）、ｊｏｂ（６）］のそれぞれが、ひとまとまりの論理構造を有するように管理される。この場合、スキーマ情報１５４Ｂは、図１１に示すような内容となる。図１１は、スキーマ情報１５４Ｂの一例を示す図（その２）である。 When a certain amount of data is stored in the cache memory 152, the conversion unit 114 stores them in the nonvolatile memory 154 while managing them with a column-oriented data structure. Here, the records 4 to 6 include a data item called job (profession) that was not included in the records 1 to 3. In this case, the conversion unit 114 sets a new column and manages data. In the example of FIG. 10, [id (1), id (2), id (3), id (5), id (6)], [name (1), name (2), name (3), name (4), name (5), name (6)], [sex (1), sex (5)], [age (2), age (4)], and [job (4), job (6) ] Are managed so as to have a group of logical structures. In this case, the schema information 154B has contents as shown in FIG. FIG. 11 is a diagram (part 2) illustrating an example of the schema information 154B.

（手法２）
図１２は、変換部１１４の機能について説明するための図（その３）である。ここでは、図１２で例示したレコード１〜３に対する処理をセッション１、続いて取得されたレコード４〜６に対する処理をセッション２と称する。図１２の例では、セッション１として、［ｉｄ（１）、ｉｄ（２）、ｉｄ（３）］、［ｎａｍｅ（１）、ｎａｍｅ（２）、ｎａｍｅ（３）］、［ｓｅｘ（１）］、および［ａｇｅ（２）］のそれぞれが、ひとまとまりの論理構造を有するように管理されると共に、セッション２として、［ｉｄ（５）、ｉｄ（６）］、［ｎａｍｅ（４）、ｎａｍｅ（５）、ｎａｍｅ（６）］、［ｓｅｘ（５）］、［ａｇｅ（４）］、および［ｊｏｂ（４）、ｊｏｂ（６）］のそれぞれが、ひとまとまりの論理構造を有するように管理される。この場合、スキーマ情報１５４Ｂは、図１３に示すような内容となる。図１３は、スキーマ情報１５４Ｂの一例を示す図（その３）である。 (Method 2)
FIG. 12 is a diagram (No. 3) for explaining the function of the conversion unit 114. Here, the process for the records 1 to 3 illustrated in FIG. 12 is referred to as session 1, and the process for the subsequently acquired records 4 to 6 is referred to as session 2. In the example of FIG. 12, as session 1, [id (1), id (2), id (3)], [name (1), name (2), name (3)], [sex (1)] , And [age (2)] are managed so as to have a set of logical structures, and as session 2, [id (5), id (6)], [name (4), name ( 5), name (6)], [sex (5)], [age (4)], and [job (4), job (6)] are managed so as to have a single logical structure. The In this case, the schema information 154B has contents as shown in FIG. FIG. 13 is a third diagram illustrating an example of the schema information 154B.

このようにデータを管理することで、例えば、「全ユーザのｊｏｂを取得したい」といった要求がデータ利用者サーバ５０から取得された場合、データベースサーバ１００は、他のデータ項目（ｉｄ、ｎａｍｅ、ａｇｅ、ｓｅｘ、…）を参照せずに、データ項目「ｊｏｂ」のデータを読み出すことができる。この結果、読み出しに要する時間を短縮し、データ利用のニーズに迅速に対応することができる。なお、不揮発性メモリ１５４がＨＤＤである場合、シーク時間が短くなるように、ひとまとまりの論理構造を、例えば同じトラック内に保持するようにすると好適であるが、これに限定されるものではない。 By managing the data in this way, for example, when a request such as “I want to acquire jobs for all users” is acquired from the data user server 50, the database server 100 allows other data items (id, name, age, etc.) to be acquired. , Sex,...) Can be read without referring to the data item “job”. As a result, it is possible to shorten the time required for reading and quickly respond to the need for data utilization. When the nonvolatile memory 154 is an HDD, it is preferable to hold a group of logical structures in, for example, the same track so that the seek time is shortened. However, the present invention is not limited to this. .

図１４は、変換部１１４により実行される処理の流れの一例を示すフローチャートである。まず、変換部１１４は、不揮発性メモリ１５４への書き込みタイミングが到来するまで待機する（Ｓ１００）。不揮発性メモリ１５４への書き込みタイミングとは、前述したようにキャッシュメモリ１５２に一定量のデータが格納されたタイミング、データベースサーバ１００がシャットダウンされるタイミング、直近までの集計処理が依頼されたタイミングなど、任意に定義することができる。 FIG. 14 is a flowchart illustrating an example of the flow of processing executed by the conversion unit 114. First, the conversion unit 114 waits until the write timing to the nonvolatile memory 154 arrives (S100). As described above, the write timing to the non-volatile memory 154 includes the timing at which a certain amount of data is stored in the cache memory 152, the timing at which the database server 100 is shut down, the timing at which the most recent aggregation processing is requested, etc. Can be arbitrarily defined.

不揮発性メモリ１５４への書き込みタイミングが到来すると、変換部１１４は、キャッシュメモリ１５２に格納されたレコードを一つ選択し（Ｓ１０２）、そのレコードに含まれるデータ項目を一つ選択する（Ｓ１０４）。そして、変換部１１４は、選択したデータ項目が、既に管理済のデータ項目であるか否かを判定する（Ｓ１０６）。 When the write timing to the nonvolatile memory 154 arrives, the conversion unit 114 selects one record stored in the cache memory 152 (S102), and selects one data item included in the record (S104). Then, the conversion unit 114 determines whether or not the selected data item is an already managed data item (S106).

選択したデータ項目が、既に管理済のデータ項目である場合、変換部１１４は、そのデータ項目の末尾にデータを追加する（Ｓ１０８）。一方、選択したデータ項目が、既に管理済のデータ項目でない場合、変換部１１４は、列を新たに設定（定義）し、設定した列にデータを書き込む（Ｓ１１０）。また、変換部１１４は、書き込んだデータ項目に関する情報をスキーマ情報１５４Ｂに追加する（Ｓ１１２）。 When the selected data item is a data item that has already been managed, the conversion unit 114 adds data to the end of the data item (S108). On the other hand, if the selected data item is not a data item that has already been managed, the conversion unit 114 newly sets (defines) a column and writes data in the set column (S110). Further, the conversion unit 114 adds information related to the written data item to the schema information 154B (S112).

次に、変換部１１４は、選択されているレコードの全てのデータ項目を選択したか否かを判定する（Ｓ１１４）。選択されているレコードの全てのデータ項目を選択していない場合、Ｓ１０４に処理が戻される。選択されているレコードの全てのデータ項目を選択した場合、変換部１１４は、キャッシュメモリ１５２に格納されている全てのレコードを選択したか否かを判定する（Ｓ１１６）。キャッシュメモリ１５２に格納されている全てのレコードを選択していない場合、Ｓ１０２に処理が戻される。キャッシュメモリ１５２に格納されている全てのレコードを選択した場合、本フローチャートの１ルーチンの処理が終了する。 Next, the conversion unit 114 determines whether or not all data items of the selected record have been selected (S114). If all data items of the selected record have not been selected, the process returns to S104. When all the data items of the selected record are selected, the conversion unit 114 determines whether all the records stored in the cache memory 152 have been selected (S116). If all the records stored in the cache memory 152 have not been selected, the process returns to S102. When all the records stored in the cache memory 152 are selected, the processing of one routine of this flowchart is completed.

［データ利用者インターフェース］
以下、データ利用者インターフェース１２０の機能について説明する。データ利用者インターフェース１２０は、例えば、データ利用者サーバ５０からの要求に応じて、表形式のデータ（配列データ）を提供する。データ利用者サーバ５０からの要求は、任意のデータ項目を指定して行われる。この際に、データ利用者インターフェース１２０は、指定されたデータ項目を含まないレコードに関しては、そのデータ項目に対応するデータを「ｎｕｌｌ」（或いはブランクなど、「該当データ無し」を示す任意の形態であってよい）とした表形式のデータを生成してデータ利用者サーバ５０に提供する。また、データ利用者インターフェース１２０は、指定されたデータ項目が既に管理されているデータ項目の中に無い場合、エラーを返すのではなく、そのデータ項目についてのデータを全て「ｎｕｌｌ」（或いはブランクなど、「該当データ無し」を示す任意の形態であってよい）とした表形式のデータを生成してデータ利用者サーバ５０に提供する。なお、データ利用者サーバ５０からの要求は、例えば所定の拡張子を指定することで行われてよい。 [Data user interface]
Hereinafter, functions of the data user interface 120 will be described. The data user interface 120 provides tabular data (array data) in response to a request from the data user server 50, for example. The request from the data user server 50 is performed by designating an arbitrary data item. At this time, for the record that does not include the designated data item, the data user interface 120 displays the data corresponding to the data item in an arbitrary form indicating “no corresponding data” such as “null” (or blank). Tabular data may be generated and provided to the data user server 50. The data user interface 120 does not return an error when the specified data item is not in the already managed data items, but instead returns all the data for the data item to “null” (or blank, etc.). The data may be in any form indicating “no corresponding data”) and is provided to the data user server 50. The request from the data user server 50 may be made by designating a predetermined extension, for example.

例えば、図１０に示すようなデータがクライアントデータ１５４Ａとして不揮発性メモリ１５４に格納されている状態で、データ項目［ｓｅｘ、ａｇｅ、ｊｏｂ、ｈｏｂｂｙ（趣味）］を指定したデータの要求があったとする。この場合、データ利用者インターフェース１２０による出力データのイメージは、図１５のようになる。図１５は、データ利用者インターフェース１２０による出力データのイメージを示す図である。図示するように、データ利用者インターフェース１２０による出力データは、データの有無に拘わらず、レコードごと且つデータ項目ごとにデータを配列化して表したデータである。これによって、データベースサーバ１００は、データ利用者サーバ５０のニーズに応じた形式でデータを提供することができる。 For example, when data as shown in FIG. 10 is stored in the nonvolatile memory 154 as the client data 154A, a data request specifying a data item [sex, age, job, hobby] is given. . In this case, an image of output data by the data user interface 120 is as shown in FIG. FIG. 15 is a diagram illustrating an image of output data by the data user interface 120. As shown in the figure, the output data from the data user interface 120 is data that is expressed by arranging data for each record and for each data item regardless of the presence or absence of the data. Thereby, the database server 100 can provide data in a format according to the needs of the data user server 50.

図１６は、データ利用者インターフェース１２０により実行される処理の流れの一例を示すフローチャートである。まず、データ利用者インターフェース１２０は、データの要求を取得するまで待機する（Ｓ２００）。データの要求を取得すると、データ利用者インターフェース１２０は、スキーマ情報１５４Ｂから、現時点でのレコードの最大数を取得する（Ｓ２０２）。この最大数をｎとする。次に、データ利用者インターフェース１２０は、データの要求に含まれるデータ項目数×ｎの配列を定義する（Ｓ２０４）。この配列が、出力データの枠組みとなる。 FIG. 16 is a flowchart showing an example of the flow of processing executed by the data user interface 120. First, the data user interface 120 stands by until a data request is acquired (S200). When the data request is acquired, the data user interface 120 acquires the maximum number of records at the current time from the schema information 154B (S202). Let this maximum number be n. Next, the data user interface 120 defines an array of the number of data items × n included in the data request (S204). This array is the framework for output data.

次に、データ利用者インターフェース１２０は、データの要求からデータ項目を一つ選択し（Ｓ２０６）、選択したデータ項目が、既にクライアントデータ１５４Ａに設定済であるか否かを判定する（Ｓ２０８）。データ利用者インターフェース１２０は、選択したデータ項目が、既にクライアントデータ１５４Ａに設定済でない場合、当該データ項目のデータを全てｎｕｌｌにする（Ｓ２１０）。 Next, the data user interface 120 selects one data item from the data request (S206), and determines whether the selected data item has already been set in the client data 154A (S208). If the selected data item is not already set in the client data 154A, the data user interface 120 sets all data of the data item to null (S210).

一方、選択したデータ項目が、既にクライアントデータ１５４Ａに設定済である場合、データ利用者インターフェース１２０は、クライアントデータ１５４Ａから、現在選択されているデータ項目のデータを一つ読み出す（Ｓ２１２）。次に、データ利用者インターフェース１２０は、Ｓ２１２において読み出し可能なデータが存在しなかったか否かを判定する（Ｓ２１４）。Ｓ２１２において読み出し可能なデータが存在した場合、データ利用者インターフェース１２０は、その読み出しに至るまでにレコード番号が飛ばされたか否かを判定する（Ｓ２１６）。レコード番号が飛ばされた場合、データ利用者インターフェース１２０は、飛ばされたレコード番号のデータをｎｕｌｌにする（Ｓ２１８）。そして、データ利用者インターフェース１２０は、クライアントデータ１５４Ａから読み出したデータをＳ２０４で設定した配列に含める（Ｓ２２０）。 On the other hand, if the selected data item has already been set in the client data 154A, the data user interface 120 reads one data item of the currently selected data item from the client data 154A (S212). Next, the data user interface 120 determines whether there is no data that can be read in S212 (S214). If there is data that can be read in S212, the data user interface 120 determines whether or not the record number has been skipped before the data is read (S216). When the record number is skipped, the data user interface 120 sets the skipped record number data to null (S218). Then, the data user interface 120 includes the data read from the client data 154A in the array set in S204 (S220).

Ｓ２１０の処理を行った後、或いは、Ｓ２１４において肯定的な判定を得た後、データ利用者インターフェース１２０は、繰り返しＳ２０６が行われる中で全てのデータ項目を選択したか否かを判定する（Ｓ２２２）。全てのデータ項目を選択していない場合、Ｓ２０６に処理が戻される。一方、全てのデータ項目を選択した場合、データを出力する（Ｓ２２４）。この段階で、配列における全てのデータに、クライアントデータ１５４Ａから読み出されたデータ、或いはｎｕｌｌが格納されている筈である。 After performing the process of S210 or obtaining a positive determination in S214, the data user interface 120 determines whether or not all data items have been selected during the repeated S206 (S222). ). If not all data items have been selected, the process returns to S206. On the other hand, when all the data items are selected, data is output (S224). At this stage, the data read from the client data 154A or null should be stored in all the data in the array.

データベースサーバ１００が管理するクライアントデータ１５４Ａは、一つのデータ項目の中に、更に階層的なデータを含んでもよい。例えば、ｈｏｂｂｙのように、複数のデータが該当するようなデータ項目が存在する。この場合、図１７に示すようなデータが変換部１１４に渡される場合がある。図１７は、変換部１１４による他の機能について説明するための図である。図中、ｈｏｂｂｙ（１−ｈ）はレコード１におけるデータ項目ｈｏｂｂｙのｈ番目のデータを示している。変換部１１４は、これらを同様に列指向型のデータ構造で管理すると共に、同一のレコードに含まれるデータであることをスキーマ情報１５４Ｂに残しておく。図１８は、図１７の例に対応したスキーマ情報１５４Ｂの一例を示す図である。 The client data 154A managed by the database server 100 may include hierarchical data in one data item. For example, there is a data item corresponding to a plurality of data such as hobby. In this case, data as shown in FIG. FIG. 17 is a diagram for explaining another function of the conversion unit 114. In the figure, hobby (1-h) indicates the h-th data of the data item hobby in the record 1. Similarly, the conversion unit 114 manages these with a column-oriented data structure, and leaves the schema information 154B as data included in the same record. 18 is a diagram illustrating an example of schema information 154B corresponding to the example of FIG.

以上説明した本発明のデータ管理装置、データ管理方法、およびプログラムによれば、入力された書き込み要求を解釈して抽象表現に変換する解釈部１１２と、解釈部１１２により抽象表現に変換されたデータを、予め配列が確保されていない列指向型のデータ構造に変換して、記憶部１５０に記憶させる変換部１１４と、を備えることにより、非構造な入力データを、データ項目ごとに高速に読み出し可能な態様で保存すると共に、入力データの追加や削除を容易に行うことができる。 According to the data management device, data management method, and program of the present invention described above, the interpretation unit 112 that interprets an input write request and converts it into an abstract representation, and the data that is converted into an abstract representation by the interpretation unit 112 Is converted into a column-oriented data structure in which an array is not secured in advance and stored in the storage unit 150, so that unstructured input data is read at high speed for each data item. While saving in a possible manner, it is possible to easily add or delete input data.

以上、本発明を実施するための形態について実施形態を用いて説明したが、本発明はこうした実施形態に何等限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々の変形及び置換を加えることができる。 As mentioned above, although the form for implementing this invention was demonstrated using embodiment, this invention is not limited to such embodiment at all, In the range which does not deviate from the summary of this invention, various deformation | transformation and substitution Can be added.

１０端末装置
２０フロントエンドサーバ
３０プロキシサーバ
５０データ利用者サーバ
１００データベースサーバ
１１０フロントエンドインターフェース
１１２解釈部
１１４変換部
１２０データ利用者インターフェース
１５０記憶部
１５２キャッシュメモリ
１５４不揮発性メモリ
１５４Ａクライアントデータ
１５４Ｂスキーマ情報 DESCRIPTION OF SYMBOLS 10 Terminal apparatus 20 Front end server 30 Proxy server 50 Data user server 100 Database server 110 Front end interface 112 Interpretation part 114 Conversion part 120 Data user interface 150 Storage part 152 Cache memory 154 Non-volatile memory 154A Client data 154B Schema information

Claims

An object that interprets the input data storage request, abstracts the tree structure in multidimensional space from the data format including json, xml, avro, or message pack, and can expand the dimension to the array An interpreter to convert to
The data converted into the object is converted into client data managed in a column-oriented data structure having a data structure different from that of the converted data, and an array is not secured in advance, and stored in the storage unit. causes stored, and the included in the object, conversion unit for storing reference information for the data managed by the data structure of the column-oriented reading is enabled in the row direction in the memory unit,
A data management device comprising:

The column-oriented data structure is a data structure in which data included in a record included in the data storage request is grouped into a logical structure for each data item,
The conversion unit adds a new column when the data item included in the data storage request is a data item not set in the column-oriented data structure;
The data management apparatus according to claim 1.

A data user interface that reads and outputs data managed in the column-oriented data structure from the storage unit for each data item included in the input data request;
The data management apparatus according to claim 1 or 2.

The data user interface fills the data corresponding to the data item with any form of data indicating that the corresponding data does not exist for a record that does not include the specified data item.
The data management apparatus according to claim 3.

When the specified data item is a data item not set in the column-oriented data structure, the data user interface indicates that there is no corresponding data for all the data items. Fill in any form of data,
The data management apparatus according to claim 3 or 4.

Computer
An object that interprets the input data storage request, abstracts the tree structure in multidimensional space from the data format including json, xml, avro, or message pack, and can expand the dimension to the array Converted to
The data converted into the object is converted into client data managed in a column-oriented data structure having a data structure different from that of the converted data, and an array is not secured in advance, and stored in the storage unit. causes stored, the included in the object, and stores the reference information for the data managed by the data structure of the column-oriented reading is enabled in the row direction in the memory unit,
Data management method.

On the computer,
An object that interprets the input data storage request, abstracts the tree structure in multidimensional space from the data format including json, xml, avro, or message pack, and can expand the dimension to the array Converted to
The data converted into the object is converted into client data managed in a column-oriented data structure having a data structure different from that of the converted data, and an array is not secured in advance, and stored in the storage unit. Storing the reference information for allowing the data managed in the column-oriented data structure included in the object to be read in a row direction, and storing the reference information in the storage unit.
program.