JP7211255B2

JP7211255B2 - Search processing program, search processing method and information processing device

Info

Publication number: JP7211255B2
Application number: JP2019090011A
Authority: JP
Inventors: 雄一松田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2018-08-31
Filing date: 2019-05-10
Publication date: 2023-01-24
Anticipated expiration: 2039-05-10
Also published as: JP2020038610A

Description

本発明は、検索処理プログラム、検索処理方法及び情報処理装置に関する。 The present invention relates to a search processing program, a search processing method, and an information processing apparatus.

近年、さまざまな知識に関するデータを集め、蓄積し、そして検索することへの関心が高まっている。このようなデータは、何らかのグラフデータモデルで表されると考えられる。このようなグラフデータモデルの一つとして、ＲＤＦ（Resource Description Framework）が知られている。現在、ＲＤＦにより記述されたオープンデータは、数多く存在する。ＲＤＦのデータモデルでは、トリプル（triple）と称される主語（subject）、述語（predicate）及び目的語（object）の３つの要素をノードとしてその関係が表される。以下では、ＲＤＦにより記述されたデータをＲＤＦデータと呼ぶ。 In recent years, there has been an increasing interest in collecting, accumulating, and searching data on various kinds of knowledge. Such data may be represented in some graph data model. RDF (Resource Description Framework) is known as one of such graph data models. Currently, there are many open data described by RDF. In the RDF data model, the relationships are expressed using three elements, a subject, a predicate, and an object, which are called triples, as nodes. Data described in RDF is hereinafter referred to as RDF data.

ＲＤＦデータを検索したり分析したりするため、ＳＰＡＲＱＬと呼ばれる問い合せ言語が標準化されている。ＳＰＡＲＱＬは、ＳＱＬに似た言語であり、これを用いてクエリを記述することでＲＤＦデータを格納したＲＤＦストアから条件に合うデータを引き出すことができる。 A query language called SPARQL has been standardized for searching and analyzing RDF data. SPARQL is a language similar to SQL, and by using this to describe a query, it is possible to retrieve data that meets conditions from an RDF store that stores RDF data.

このようなＳＰＡＲＱＬのクエリ処理の効率を改善する方法として、並列化という技術がある。並列化の一手法としてＭａｐＲｅｄｕｃｅに代表される分散フレームワークを使用した手法がある。分散フレームワークを使用した手法とは、簡単に言えば、計算機毎に並列に処理するアプローチである。分散フレームワークでは、計算機の台数を増やすことによってデータ量の増加に対するスケーラビリティを確保し、大規模データを効率良く処理することが可能となる。 As a method for improving the efficiency of such SPARQL query processing, there is a technique called parallelization. As one method of parallelization, there is a method using a distributed framework represented by MapReduce. A method using a distributed framework is, simply put, an approach in which each computer performs parallel processing. In the distributed framework, by increasing the number of computers, it is possible to secure scalability against an increase in the amount of data and to process large-scale data efficiently.

このような分散フレームワークを使用したミドルウェアとして、例えばＨａｄｏｏｐ（登録商標）がある。Ｈａｄｏｏｐは、データを複数のサーバに分散し、並列して処理するミドルウェアであり、テラバイト級さらにはペタバイト級の大容量データの分析などを高速処理できるため、ビッグデータ活用における主要技術として利用されている。Ｈａｄｏｏｐでは、１台のマスタサーバと、その配下に繋がる多数のスレーブサーバとが連携し、データの高速処理を行う。データ処理全体の流れをコントロールするのがマスタサーバであり、実際の計算処理は配下のスレーブサーバが手分けして行う。したがって、Ｈａｄｏｏｐは、スレーブサーバの台数が多ければ多いほど処理能力が高まり、増大するデータを高速に計算処理することができる。 Middleware using such a distributed framework is, for example, Hadoop (registered trademark). Hadoop is middleware that distributes data to multiple servers and processes it in parallel. Because it is capable of high-speed analysis of terabyte-level and even petabyte-level data, it is used as a major technology for big data utilization. there is In Hadoop, one master server cooperates with a large number of slave servers connected under it to perform high-speed data processing. The master server controls the flow of the entire data processing, and the actual calculation processing is performed by subordinate slave servers. Therefore, in Hadoop, the greater the number of slave servers, the higher the processing power, and the more data can be calculated at high speed.

さらに、Ｈａｄｏｏｐには、２つの主要な技術が用いられている。１つは、ＨＤＦＳ（Hadoop Distributed File System）である。ＨＤＦＳは、多数のスレーブサーバのハードディスクを取りまとめ、そこに計算すべき膨大なデータを書き込んだり、集計した結果を書き込んだりすることが可能な仮想的なファイルシステムである。 In addition, Hadoop uses two main technologies. One is HDFS (Hadoop Distributed File System). HDFS is a virtual file system capable of collecting hard disks of many slave servers, writing huge amounts of data to be calculated, and writing aggregated results.

もう１つは、ＭａｐＲｅｄｕｃｅ処理である。ＭａｐＲｅｄｕｃｅ処理は、与えられたデータから欲しいデータを抽出し分解するＭａｐ処理及び抽出されたデータを集計するＲｅｄｕｃｅ処理という２つの手順で計算処理を行う手法である。ＭａｐＲｅｄｕｃｅ処理は、複数台のスレーブサーバで並列処理ができるので、効率的である。ＭａｐＲｅｄｕｃｅ処理が計算処理の対象とするデータはＨＤＦＳ上に分散されているものが利用される。 Another is MapReduce processing. MapReduce processing is a method of performing calculation processing in two steps: Map processing for extracting and decomposing desired data from given data and Reduce processing for summarizing the extracted data. MapReduce processing is efficient because parallel processing can be performed by a plurality of slave servers. The data distributed on the HDFS is used as the data to be calculated by the MapReduce process.

なお、ＳＰＡＲＱＬ検索クエリにおいて頻繁に比較される値に対応する変数を抽出し、抽出した変数に対応する値を結合して作成した新たなノードをＲＤＦデータに加えて検索処理を行う従来技術がある。また、トリプルのデータにしたがって順序付けられたデータアイテムのセットの中のデータアイテムにトリプルが格納され、そのデータアイテムが格納される分散型の計算機がセット内のデータアイテムの位置に応じて決定される従来技術がある。また、インデックスを作成する場合に、文字列の長さが設定された閾値を超える場合には文字列から決まるハッシュ値とキーの値との組を登録する従来技術がある。また、タグと文字列とを用いて文書管理を行う従来技術がある。 In addition, there is a conventional technology that extracts variables corresponding to frequently compared values in SPARQL search queries, adds new nodes created by combining values corresponding to the extracted variables to RDF data, and performs search processing. . Also, the triple is stored in a data item in a set of data items ordered according to the data of the triple, and the distributed calculator in which the data item is stored is determined according to the position of the data item in the set. There is prior art. Also, there is a conventional technique for registering a set of a hash value and a key value determined from a character string when the length of the character string exceeds a set threshold when creating an index. There is also a conventional technique for document management using tags and character strings.

国際公開第２０１４／２０７８２７号WO2014/207827 特開２０１３－１７５１８１号公報JP 2013-175181 A 特開２０００－９０１１５号公報JP-A-2000-90115 特開２００８－５２６６２号公報JP-A-2008-52662

しかしながら、例えば、ＲＤＦデータは、主語、述語及び目的語の３要素のそれぞれの関係を表すことで成り立つ。これに対して、ＭａｐＲｅｄｕｃｅ処理を行う場合、入力されたデータはｋｅｙ＝ｖａｌｕｅの形式、つまり２要素として扱われる。そのため、ＲＤＦデータをＭａｐＲｄｄｕｃｅ処理で処理する場合、ＲＤＦデータをｋｅｙ＝ｖａｌｕｅの２要素の形式に分解して全ての組み合わせを予め作成する作業が加わる。例えば、ＲＤＦデータの３要素を（ｓ，ｐ，ｏ）と表した場合、（ｓ，ｐ）、（ｐ，ｏ）又は（ｏ，ｓ）の組み合わせを１つの要素として、全体で２要素となるように分解される。この変換作業には膨大な時間が掛かる。 However, for example, RDF data consists of representing the relationship between three elements of subject, predicate and object. On the other hand, when MapReduce processing is performed, input data is treated as a key=value format, that is, as two elements. Therefore, when RDF data is processed by MapRdduce processing, the work of decomposing the RDF data into two-element format of key=value and creating all combinations in advance is added. For example, when the three elements of RDF data are represented as (s, p, o), the combination of (s, p), (p, o) or (o, s) is regarded as one element, and the total is two elements. is decomposed into This conversion work takes an enormous amount of time.

さらに、ＲＤＦデータの場合、例えば、３要素のうち２要素が決まっている検索を行う場合には、２つの要素のそれぞれを比較して検索することになる。ＲＤＦデータの各要素の値には長い文字列、言い換えればデータ領域が大きい値を格納することもできる。特にこのような長い文字列を決まった２要素として検索を行う場合、膨大な時間が掛かるおそれがある。 Furthermore, in the case of RDF data, for example, when performing a search in which two of the three elements are determined, the two elements are compared and searched. A long character string, in other words, a value with a large data area can be stored as the value of each element of the RDF data. In particular, it may take an enormous amount of time to search such a long character string as two fixed elements.

また、頻出の変数に対応する値を結合して作成した新たなノードをＲＤＦデータに加える従来技術を用いても、ＲＤＦデータとＭａｐＲｅｄｕｃｅ処理とで取り扱われるデータ形式の違いは解消されず、検索処理を高速に行うことは困難である。また、セット内のデータアイテムの位置に応じて配置する計算機を決定する従来技術を用いても、同様にＲＤＦデータとＭａｐＲｅｄｕｃｅ処理とで取り扱われるデータ形式の違いは解消されず、検索処理を高速に行うことは困難である。また、文字列に基づくハッシュ値とキーの値との組をインデックスとして登録する従来技術では、要素が異ならない場合の検索処理は早くなるが、要素数が異なる場合のデータ形式の違いはやはり解消されず、検索処理を高速に行うことは困難である。さらに、タグと文字列とを用いて文書管理を行う従来技術でも、同様にＲＤＦデータとＭａｐＲｅｄｕｃｅ処理とで取り扱われるデータ形式の違いは解消されず、検索処理を高速に行うことは困難である。 Moreover, even if a conventional technique is used in which a new node created by combining values corresponding to frequently appearing variables is added to the RDF data, the difference between the data formats handled by the RDF data and the MapReduce process cannot be resolved, and the search process is difficult to perform at high speed. Also, even if the conventional technique of determining the computer to be arranged according to the position of the data item in the set is used, the difference in data format handled by the RDF data and the MapReduce process is similarly not resolved, and the search process is speeded up. It is difficult to do. In addition, with the conventional technology that registers pairs of hash values and key values based on character strings as indexes, search processing is faster when elements do not differ, but differences in data format when the number of elements differs is also eliminated. Therefore, it is difficult to perform search processing at high speed. Furthermore, even with conventional techniques for document management using tags and character strings, the difference in data format handled between RDF data and MapReduce processing is similarly not eliminated, making it difficult to perform high-speed retrieval processing.

開示の技術は、上記に鑑みてなされたものであって、検索処理を高速に実行する検索処理プログラム、検索処理方法及び情報処理装置を提供することを目的とする。 The disclosed technology has been made in view of the above, and aims to provide a search processing program, a search processing method, and an information processing apparatus that perform search processing at high speed.

本願の開示する検索処理プログラム、検索処理方法及び情報処理装置の一つの態様において、コンピュータに以下の処理を実行させる。３要素を有するデータのうち２要素を抽出し、抽出した前記２要素よりも小さいサイズの識別子を抽出した前記２要素に対応付けた第１の表を生成する。前記３要素の表に対して前記識別子を付加した第２の表を生成する。前記第２の表を複数の処理装置に分割して配置する。検索する際に、前記第１の表を用いて前記識別子を取り出し、取り出した前記識別子を用いてそれぞれの前記処理装置において、各前記処理装置に配置された前記第２の表の一部に対して検索を行う。前記検索により前記第２の表のうちの抽出される行を出力する。 In one aspect of the search processing program, search processing method, and information processing apparatus disclosed in the present application, a computer is caused to execute the following processes. Two elements are extracted from data having three elements, and a first table is generated in which an identifier smaller in size than the extracted two elements is associated with the extracted two elements. A second table is generated by adding the identifier to the three-element table. The second table is divided and arranged in a plurality of processing units. When retrieving, the first table is used to retrieve the identifier, and the retrieved identifier is used in each of the processing units for a portion of the second table located in each of the processing units. to search. Outputting the rows of the second table extracted by the search.

１つの側面では、本発明は、検索処理を高速に実行することができる。 In one aspect, the present invention can perform search processing at high speed.

図１は、情報処理システムのシステム構成図である。FIG. 1 is a system configuration diagram of an information processing system. 図２は、マスタサーバ及びスレーブサーバの詳細を表すブロック図である。FIG. 2 is a block diagram showing the details of the master server and slave servers. 図３は、ＲＤＦデータをツリー形式で表した一例の図である。FIG. 3 is a diagram showing an example of RDF data represented in a tree format. 図４は、ＲＤＦデータを表形式で表した一例の図である。FIG. 4 is a diagram of an example of RDF data represented in tabular form. 図５は、識別子対応表の一例を表す図である。FIG. 5 is a diagram showing an example of an identifier correspondence table. 図６は、識別子付ＲＤＦデータ表の一例を表す図である。FIG. 6 is a diagram showing an example of an identifier-attached RDF data table. 図７は、ＭａｐＲｅｄｕｃｅ処理の概要を表す図である。FIG. 7 is a diagram showing an overview of MapReduce processing. 図８は、実施例１に係るパラメータ識別子を用いた場合のＭａｐＲｅｄｕｃｅ処理の概要を表す図である。FIG. 8 is a diagram illustrating an overview of MapReduce processing when using parameter identifiers according to the first embodiment. 図９は、実施例１に係る識別子表及び識別子付ＲＤＦデータ表の生成処理のフローチャートである。FIG. 9 is a flowchart of processing for generating an identifier table and an identifier-attached RDF data table according to the first embodiment. 図１０は、実施例１に係るＭａｐＲｅｄｕｃｅ処理のフローチャートである。FIG. 10 is a flowchart of MapReduce processing according to the first embodiment. 図１１は、分割データ表の一例を表す図である。FIG. 11 is a diagram showing an example of a divided data table. 図１２は、実施例２に係るパラメータ識別子を用いた場合のＭａｐＲｅｄｕｃｅ処理の概要を表す図である。FIG. 12 is a diagram illustrating an overview of MapReduce processing when using parameter identifiers according to the second embodiment. 図１３は、コンピュータのハードウェア構成の一例を表す図である。FIG. 13 is a diagram illustrating an example of the hardware configuration of a computer; 図１４は、実施例３に係るマスタサーバ及びスレーブサーバの詳細を表すブロック図である。FIG. 14 is a block diagram showing details of a master server and slave servers according to the third embodiment. 図１５は、分割前の識別子対応表の一例を表す図である。FIG. 15 is a diagram showing an example of the identifier correspondence table before division. 図１６は、分割識別子対応表の一例を表す図である。FIG. 16 is a diagram showing an example of a division identifier correspondence table. 図１７は、実施例３に係るパラメータ識別子を用いた場合のＭａｐＲｅｄｕｃｅ処理の概要を表す図である。FIG. 17 is a diagram illustrating an overview of MapReduce processing when using parameter identifiers according to the third embodiment. 図１８は、実施例３に係る識別子表及び識別子付ＲＤＦデータ表の生成処理のフローチャートである。FIG. 18 is a flowchart of processing for generating an identifier table and an identifier-attached RDF data table according to the third embodiment. 図１９は、実施例３に係るＭａｐＲｅｄｕｃｅ処理のフローチャートである。FIG. 19 is a flowchart of MapReduce processing according to the third embodiment.

以下に、本願の開示する検索処理プログラム、検索処理方法及び情報処理装置の実施例を図面に基づいて詳細に説明する。なお、以下の実施例により本願の開示する検索処理プログラム、検索処理方法及び情報処理装置が限定されるものではない。 Exemplary embodiments of a search processing program, a search processing method, and an information processing apparatus disclosed in the present application will be described below in detail with reference to the drawings. The search processing program, search processing method, and information processing apparatus disclosed in the present application are not limited to the following embodiments.

図１は、情報処理システムのシステム構成図である。情報処理システム１は、図１に示すように、Ｈａｄｏｏｐクラスタ１０、ＨＤＦＳ（Hadoop Distributed File System）クライアント２０及びジョブクライアント３０を有する。 FIG. 1 is a system configuration diagram of an information processing system. The information processing system 1 has a Hadoop cluster 10, an HDFS (Hadoop Distributed File System) client 20, and a job client 30, as shown in FIG.

ＨＤＦＳクライアント２０は、Ｈａｄｏｏｐクラスタ１０に対してデータ管理の指示を行う情報処理端末である。ＨＤＦＳクライアント２０は、ネットワークを介してＨａｄｏｏｐクラスタ１０のマスタサーバ１１と接続される。ＨＤＦＳクライアント２０は、利用者からのデータ管理の指示の入力を入力装置（不図示）から受ける。そして、ＨＤＦＳクライアント２０は、利用者からの入力に応じたデータ管理の処理命令をＨＤＦＳＡＰＩ（Application Programing Interface）を介してマスタサーバ１１へ送信する。 The HDFS client 20 is an information processing terminal that instructs the Hadoop cluster 10 to manage data. The HDFS client 20 is connected to the master server 11 of the Hadoop cluster 10 via a network. The HDFS client 20 receives an input of a data management instruction from a user from an input device (not shown). Then, the HDFS client 20 transmits to the master server 11 via the HDFS API (Application Programming Interface) a processing instruction for data management according to the input from the user.

ジョブクライアント３０は、Ｈａｄｏｏｐクラスタ１０に対してジョブ管理の指示を行う情報処理端末である。ジョブクライアント３０は、ネットワークを介してＨａｄｏｏｐクラスタ１０のマスタサーバ１１と接続される。ジョブクライアント３０は、ＭａｐＲｅｄｕｃｅプログラムを有する。ジョブクライアント３０は、利用者からのジョブ管理の指示の入力を入力装置（不図示）から受ける。そして、ジョブクライアント３０は、利用者からの入力に応じたジョブ管理の処理命令をマスタサーバ１１へ送信する。 The job client 30 is an information processing terminal that issues job management instructions to the Hadoop cluster 10 . The job client 30 is connected to the master server 11 of the Hadoop cluster 10 via a network. The job client 30 has a MapReduce program. The job client 30 receives an input of a job management instruction from a user from an input device (not shown). The job client 30 then transmits to the master server 11 a processing command for job management according to the input from the user.

これら、ＨＤＦＳクライアント２０及びジョブクライアント３０は、同じ情報処理装置に配置されてもよいし、異なる情報処理装置に配置されてもよい。また、ＨＤＦＳクライアント２０及びジョブクライアント３０の機能は、Ｈａｄｏｏｐクラスタ１０の中に配置されてもよい。 These HDFS client 20 and job client 30 may be arranged in the same information processing apparatus, or may be arranged in different information processing apparatuses. Also, the functionality of the HDFS client 20 and the job client 30 may be located within the Hadoop cluster 10 .

Ｈａｄｏｏｐクラスタ１０は、マスタサーバ１１及びスレーブサーバ１２を有する。図１では、３台のスレーブサーバ１２を図示したが、スレーブサーバ１２の数に特に制限は無い。マスタサーバ１１は、各スレーブサーバ１２とネットワークで接続される。さらに、マスタサーバ１１は、ＨＤＦＳクライアント２０及びジョブクライアント３０とネットワークで接続される。また、各スレーブサーバ１２は、それぞれ相互にネット―ワークで接続される。 Hadoop cluster 10 has master server 11 and slave server 12 . Although three slave servers 12 are illustrated in FIG. 1, the number of slave servers 12 is not particularly limited. The master server 11 is connected to each slave server 12 via a network. Furthermore, the master server 11 is connected to the HDFS client 20 and the job client 30 via a network. Each slave server 12 is connected to each other via a network.

図２は、マスタサーバ及びスレーブサーバの詳細を表すブロック図である。以下では、図２を参照して、マスタサーバ１１及びスレーブサーバ１２について説明する。ここで、図１では構成の概略を図示するため、主要構成に絞りいくつかの構成を省略して記載したが、マスタサーバ１１及びスレーブサーバ１２は、より詳しくは図２に示す構成を有する。 FIG. 2 is a block diagram showing the details of the master server and slave servers. The master server 11 and the slave server 12 will be described below with reference to FIG. Here, in FIG. 1, in order to illustrate the outline of the configuration, only the main configuration is described and some configurations are omitted, but the master server 11 and the slave server 12 have the configuration shown in FIG.

マスタサーバ１１は、ＲＤＦストア１１０、ＨＤＦＳ１１１、ネームノード１１２、メタデータＤＢ（Data Base）１１３及びジョブトラッカー１１４を有する。さらに、マスタサーバ１１は、第１生成部１１５、第２生成部１１６、ＲＤＦコントローラ１１７、ＳＰＡＲＱＬ処理部１１８及びＭａｐＲｅｄｕｃｅ処理部１１９を有する。 The master server 11 has an RDF store 110 , an HDFS 111 , a name node 112 , a metadata DB (Data Base) 113 and a job tracker 114 . Furthermore, the master server 11 has a first generator 115 , a second generator 116 , an RDF controller 117 , a SPARQL processor 118 and a MapReduce processor 119 .

ＨＤＦＳ１１１は、複数のサーバと連携して見た目上１つのファイルシステムと見せる仮想ファイルシステムである。ＨＤＦＳ１１１は、ファイルをブロックサイズと呼ばれる単位で分割することでファイル管理を行う。ブロックサイズはデフォルトで６４ＭＢである。ＨＤＦＳ１１１は、排他制御機能を有さない。また、ＨＤＦＳ１１１では、ファイルの新規作成及び追加は可能であるが、修正は許可されない。ＨＤＦＳ１１１の１つのブロックに対して１つのＭａｐタスクが作成される。 The HDFS 111 is a virtual file system that looks like one file system in cooperation with a plurality of servers. The HDFS 111 performs file management by dividing a file into units called block sizes. The block size is 64MB by default. HDFS 111 does not have an exclusive control function. Also, in the HDFS 111, new creation and addition of files are possible, but modification is not permitted. One Map task is created for one block of HDFS 111 .

ＲＤＦデータは、例えば、図３に示すようにツリー形式で表すことができる。図３は、ＲＤＦデータをツリー形式で表した一例の図である。図３における、矢印の始点に配置された楕円で囲われたデータが主語にあたる。また、矢印の終点に配置された楕円で囲われたデータが述語にあたる。さらに、矢印上に記載されたデータが述語にあたる。 RDF data can be represented, for example, in a tree format as shown in FIG. FIG. 3 is a diagram showing an example of RDF data represented in a tree format. The subject is the data surrounded by an ellipse placed at the starting point of the arrow in FIG. Also, the data surrounded by an ellipse arranged at the end point of the arrow corresponds to the predicate. Furthermore, the data written on the arrow corresponds to the predicate.

また、ＲＤＦデータは、図４に示すように表形式で表すこともできる。図４は、ＲＤＦデータを表形式で表した一例の図である。図３に示したツリー形式のＲＤＦデータを表形式で表した図が、図４にあたる。図４における、ｉｄ（Identifier）は各トリプルに与えられた識別子を表す。また、Ｓｕｂｊｅｃｔはトリプルにおける主語を表し、ｐｒｅｄｉｃａｔｅは述語を表し、ｏｂｊｅｃｔは目的語を表す。このようにして各トリプルに割り当てられた識別子に対応させて、そのトリプルの主語、述語及びオブジェクトが表形式における１行（横列）に対応させて登録される。 The RDF data can also be represented in tabular form as shown in FIG. FIG. 4 is a diagram of an example of RDF data represented in tabular form. FIG. 4 is a diagram showing the tree-format RDF data shown in FIG. 3 in tabular form. In FIG. 4, id (Identifier) represents an identifier given to each triple. Also, Subject represents a subject in a triple, predicate represents a predicate, and object represents an object. Corresponding to the identifier assigned to each triple in this way, the subject, predicate and object of the triple are registered corresponding to one row (row) in the tabular format.

ＲＤＦストア１１０は、例えば、図４に示すような表形式のＲＤＦデータを保持することができる。このＲＤＦストア１１０に格納されたＲＤＦデータを入力としてＨＤＦＳ１１１へ保存することで、後述するＭａｐＲｅｄｕｃｅ処理でＨＤＦＳ１１１内のデータを操作することが可能となる。 The RDF store 110 can hold RDF data in tabular form as shown in FIG. 4, for example. By storing the RDF data stored in the RDF store 110 as an input in the HDFS 111, it becomes possible to manipulate the data in the HDFS 111 by the MapReduce process described later.

図２に戻って説明を続ける。ＲＤＦコントローラ１１７は、ＲＤＦストア１１０に格納されたＲＤＦデータの管理を行う。例えば、ＲＤＦコントローラ１１７は、読み出し要求や格納要求を受けて、指定されたＲＤＦデータの読み出し又は格納をＲＤＦストア１１０に対して行う。また、ＲＤＦコントローラ１１７は、ＲＤＦストア１１０に格納されたＲＤＦデータのＨＤＦＳ１１１への保存の指示を受けて、ＲＤＦストア１１０に格納されＲＦＤデータを入力としてＨＤＦＳ１１１に保存させる。 Returning to FIG. 2, the description continues. The RDF controller 117 manages RDF data stored in the RDF store 110 . For example, the RDF controller 117 reads or stores designated RDF data in the RDF store 110 in response to a read request or storage request. In addition, the RDF controller 117 receives an instruction to store the RDF data stored in the RDF store 110 in the HDFS 111 , and stores the RFD data stored in the RDF store 110 in the HDFS 111 as an input.

第１生成部１１５は、識別子対応表の生成の指示をＨＤＦＳクライアント２０から受ける。そして、第１生成部１１５は、ＲＤＦストア１１０に登録された全てのＲＤＦデータの主語、述語及び目的語の取得をＲＤＦコントローラ１１７に指示する。その後、第１生成部１１５は、ＲＤＦストア１１０に登録された全てのＲＤＦデータの主語、述語及び目的語をＲＤＦコントローラ１１７から取得する。 The first generation unit 115 receives an instruction to generate an identifier correspondence table from the HDFS client 20 . The first generator 115 then instructs the RDF controller 117 to acquire the subject, predicate and object of all RDF data registered in the RDF store 110 . After that, the first generator 115 acquires the subjects, predicates and objects of all RDF data registered in the RDF store 110 from the RDF controller 117 .

次に、第１生成部１１５は、取得した主語、述語及び目的語のそれぞれの重複を除いて集計する。そして、第１生成部１１５は、集計した主語、述語及び目的語を用いて、「主語、述語」、「述語、目的語」及び「主語、目的語」の全ての通りの組み合わせを生成する。これらの組を以下では、「ｖａｌｕｅパターン」という。このＲＤＦデータの主語、述語及び目的語が、「３要素」の一例にあたり、ｖａｌｕｅパターンに含まれる２つの値が、「２要素」の一例にあたる。 Next, the first generating unit 115 tabulates the obtained subjects, predicates, and objects by removing duplicates. Then, the first generating unit 115 generates all possible combinations of "subject, predicate", "predicate, object", and "subject, object" using the aggregated subjects, predicates, and objects. These sets are hereinafter referred to as "value patterns". The subject, predicate and object of this RDF data are an example of "three elements", and the two values included in the value pattern are an example of "two elements".

次に、第１生成部１１５は、生成したｖａｌｕｅパターンの中に、実際の各ＲＤＦデータの主語、述語及び目的語の組の中に含まれないｖａｌｕｅパターンが存在するか否かを判定する。実際の各ＲＤＦデータの主語、述語及び目的語の組の中に含まれないｖａｌｕｅパターンが存在する場合、第１生成部１１５は、各ＲＤＦデータの主語、述語及び目的語の中に含まれないｖａｌｕｅパターン以外のｖａｌｕｅパターンを抽出する。 Next, the first generating unit 115 determines whether or not the generated value patterns include a value pattern that is not included in the actual set of subject, predicate and object of each RDF data. If there is a value pattern that is not included in the actual set of subject, predicate and object of each RDF data, the first generation unit 115 generates the value pattern not included in the subject, predicate and object of each RDF data Extract value patterns other than value patterns.

第１生成部１１５は、抽出したｖａｌｕｅパターンのそれぞれにパターン識別子を割り当てる。この識別子は、ｖａｌｕｅパターンより小さいデータサイズである。データサイズとは、メモリを占有する上でのサイズである。そして、第１生成部１１５は、各ｖａｌｕｅパターンと割り当てたパターン識別子との対応を表す識別子対応表を作成する。このとき、第１生成部１１５は、実際の各ＲＤＦデータの主語、述語及び目的語の組の中に含まれないｖａｌｕｅパターンについては、不存在を表す情報を付加して識別子対応表へ登録する。図５は、識別子対応表の一例を表す図である。 The first generator 115 assigns a pattern identifier to each of the extracted value patterns. This identifier has a smaller data size than the value pattern. The data size is the size that occupies memory. Then, the first generation unit 115 creates an identifier correspondence table representing the correspondence between each value pattern and the assigned pattern identifier. At this time, the first generating unit 115 adds information indicating non-existence of value patterns that are not included in the set of subject, predicate and object of each RDF data and registers them in the identifier correspondence table. . FIG. 5 is a diagram showing an example of an identifier correspondence table.

ここでは、図４に示したＲＤＦデータを基に識別子対応表を作成する場合で説明する。また、述語と目的語の組み合わせのｖａｌｕｅパターンを生成する場合を例に説明する。 Here, a case of creating an identifier correspondence table based on the RDF data shown in FIG. 4 will be described. Also, a case of generating a value pattern of a combination of a predicate and an object will be described as an example.

第１生成部１１５は、図４に示されるＲＤＦデータにおける述語２０１を重複を除いて集計する。この場合、第１生成部１１５は、「ｌｉｋｅｓ」及び「ｌｏｖｅｓ」という２語を述語として取得する。また、第１生成部１１５は、図４に示されるＲＤＦデータにおける目的語２０２を重複を除いて集計する。この場合、第１生成部１１５は、「Ａ」、「Ｃ」、「Ｄ」及び「Ｆ」という４語を述語として取得する。そして、第１生成部１１５は、取得した述語及び目的語の全ての組み合わせを生成する。この場合、第１生成部１１５は、「ｌｉｋｅｓＡ」、「ｌｉｋｅｓＣ」、「ｌｉｋｅｓＤ」、「ｌｉｋｅｓＦ」、「ｌｏｖｅｓＡ」、「ｌｏｖｅｓＣ」、「ｌｏｖｅｓＤ」及び「ｌｏｖｅｓＦ」をｖａｌｕｅバターンとして生成する。そして、第１生成部１１５は、「ｌｉｋｅｓＡ」及び「ｌｏｖｅｓＤ」が図４に示すＲＤＦデータに含まれないと判定する。その後、第１生成部１１５は、実際に存在するＶａｌｕｅパターンにパターン識別子を割り当て、実際には存在しないｖａｌｕｅパターンに対しては不存在を示す情報を対応させて、図４に示す識別子対応表２１１及び２１２を生成する。図５では、「ｌｉｋｅｓ」を述語として含むｖａｌｕｅパターンを表す識別子対応表２１１と「ｌｏｖｅｓ」を述語として含むｖａｌｕｅパターンを表す識別子対応表２１２とを分けて記載した。さらに、第１生成部１１５は、存在しないｖａｌｕｅパターンである「ｌｉｋｅｓＡ」及び「ｌｏｖｅｓＤ」に不存在を表すＮＡ（Not Applicable）を付加して識別子対応表２１１及び２１２にそれぞれ登録する。 The first generation unit 115 tabulates the predicates 201 in the RDF data shown in FIG. 4 excluding duplicates. In this case, the first generator 115 acquires the two words “likes” and “loves” as predicates. In addition, the first generation unit 115 tabulates the objects 202 in the RDF data shown in FIG. 4 by eliminating duplication. In this case, the first generator 115 acquires four words "A", "C", "D" and "F" as predicates. Then, the first generation unit 115 generates all combinations of the acquired predicates and objects. In this case, the first generation unit 115 generates “likes A”, “likes C”, “likes D”, “likes F”, “loves A”, “loves C”, “loves D” and “loves F”. Generate as a value pattern. Then, the first generating unit 115 determines that "likes A" and "loves D" are not included in the RDF data shown in FIG. After that, the first generation unit 115 assigns pattern identifiers to value patterns that actually exist, associates information indicating non-existence with value patterns that do not actually exist, and creates an identifier correspondence table 211 shown in FIG. and 212. In FIG. 5, an identifier correspondence table 211 representing value patterns including "likes" as a predicate and an identifier correspondence table 212 representing value patterns including "loves" as a predicate are separately described. Furthermore, the first generation unit 115 adds NA (Not Applicable) indicating non-existence to non-existent value patterns “likes A” and “loves D” and registers them in the identifier correspondence tables 211 and 212, respectively.

その後、第１生成部１１５は、生成した識別子対応表をＲＤＦコントローラ１１７へ送信し、ＲＤＦストア１１０への格納を指示する。さらに、第１生成部１１５は、識別子対応表の生成完了を第２生成部１１６に通知する。この識別子対応表が、「第１の表」の一例にあたる。 After that, the first generation unit 115 transmits the generated identifier correspondence table to the RDF controller 117 and instructs the RDF store 110 to store it. Furthermore, the first generation unit 115 notifies the second generation unit 116 of completion of generation of the identifier correspondence table. This identifier correspondence table corresponds to an example of the "first table".

第２生成部１１６は、識別子対応表の生成完了の通知を第１生成部１１５から受ける。そして、第２生成部１１６は、識別子対応表の取得要求をＲＤＦコントローラ１１７へ送信する。その後、第２生成部１１６は、第１生成部１１５により作成された全ての識別子対応表をＲＤＦコントローラ１１７から取得する。 Second generation unit 116 receives from first generation unit 115 notification of the completion of generation of the identifier correspondence table. Second generation unit 116 then transmits an identifier correspondence table acquisition request to RDF controller 117 . After that, the second generator 116 acquires all identifier correspondence tables created by the first generator 115 from the RDF controller 117 .

次に、第２生成部１１６は、ＲＤＦストア１１０に登録された各ＲＤＦデータの取得要求をＲＤＦコントローラ１１７へ送信する。そして、第２生成部１１６は、ＲＤＦコントローラ１１７から取得した各ＲＤＦデータの主語、述語及び目的語を確認し、それぞれの組み合のｖａｌｕｅパターンに対応するパターン識別子を識別子対応表から取得する。その後、第２生成部１１６は、各ＲＤＦデータのトリプルの対応表に、ＲＤＦデータ毎の取得したパターン識別子を付加した識別子付ＲＤＦデータ表を生成する。 Next, the second generating unit 116 transmits to the RDF controller 117 an acquisition request for each RDF data registered in the RDF store 110 . Then, the second generation unit 116 confirms the subject, predicate and object of each RDF data acquired from the RDF controller 117, and acquires the pattern identifier corresponding to the value pattern of each combination from the identifier correspondence table. After that, the second generating unit 116 generates an RDF data table with an identifier by adding the acquired pattern identifier for each RDF data to the triple correspondence table of each RDF data.

図６は、識別子付ＲＤＦデータ表の一例を表す図である。図６における「ｖｐ－ｉｄ」は、パターン識別子を表す。そして、パターン識別子２２１は、主語と述語との組み合わせのｖａｌｕｅパターンに対応する。パターン識別子２２２は、述語と目的語との組み合わせのｖａｌｕｅパターンに対応する。パターン識別子２２３は、主語と目的語との組み合わせのｖａｌｕｅパターンに対応する。 FIG. 6 is a diagram showing an example of an identifier-attached RDF data table. “vp-id” in FIG. 6 represents a pattern identifier. The pattern identifier 221 corresponds to the value pattern of the combination of subject and predicate. The pattern identifier 222 corresponds to the value pattern of the predicate-object combination. The pattern identifier 223 corresponds to the value pattern of the subject-object combination.

第２生成部１１６は、例えば、図４の１行目のＲＤＦデータの主語、述語及び目的語として「Ａ」、「ｌｉｋｅｓ」及び「Ｄ」を取得する。そして、第２生成部１１６は、取得した各値から「Ａｌｉｋｅｓ」、「ｌｉｋｅｓＤ」及び「ＡＤ」というｖａｌｕｅパターンを取得する。その後、第２生成部１１６は、取得したｖａｌｕｅパターンに対応するパターン識別子を取得する。例えば、第２生成部１１６は、図５の識別子対応表２１１から「ｌｉｋｅｓＤ」のパターン識別子である「０００２」を取得する。同様に、第２生成部１１６は、「Ａｌｉｋｅｓ」及び「ＡＤ」のパターン識別子として「２００１」及び「４００１」を取得する。その後、第２生成部１１６は、各パターン識別子を１行目のＲＤＦデータに対応させて登録する。 The second generator 116 acquires, for example, "A", "likes" and "D" as the subject, predicate and object of the RDF data on the first line in FIG. Then, the second generation unit 116 acquires value patterns of “A likes”, “likes D”, and “AD” from the acquired values. After that, the second generator 116 acquires a pattern identifier corresponding to the acquired value pattern. For example, the second generation unit 116 acquires the pattern identifier "0002" of "likes D" from the identifier correspondence table 211 of FIG. Similarly, the second generator 116 acquires “2001” and “4001” as pattern identifiers of “A likes” and “AD”. After that, the second generation unit 116 registers each pattern identifier in association with the RDF data in the first row.

その後、第２生成部１１６は、生成した識別子付ＲＤＦデータ表をＲＤＦコントローラ１１７へ送信し、ＲＤＦストア１１０に格納させる。この識別子付ＲＤＦデータ表が、「第２の表」の一例にあたる。 After that, the second generating unit 116 transmits the generated identifier-attached RDF data table to the RDF controller 117 and stores it in the RDF store 110 . This identifier-attached RDF data table corresponds to an example of the "second table".

ネームノード１１２は、ＲＤＦストア１１０に格納された識別子付ＲＤＦデータ表の取得要求をＲＤＦコントローラ１１７に通知する。そして、ネームノード１１２は、ＲＤＦストア１１０に格納された識別子付ＲＤＦデータ表をＲＤＦコントローラ１１７から取得する。ここで、識別子付ＲＤＦデータ表のデータを分散配置する場合、ネームノード１１２は、ＲＤＦコントローラ１１７と連携してＲＤＦストア１１０に格納されたデータを取り扱うが、他の形式のデータを取り扱う場合にはＲＤＦストア１１０から直接データを取得してもよい。 The namenode 112 notifies the RDF controller 117 of a request to acquire the identifier-attached RDF data table stored in the RDF store 110 . The namenode 112 then acquires the identifier-attached RDF data table stored in the RDF store 110 from the RDF controller 117 . Here, when the data of the identifier-attached RDF data table is distributed, the namenode 112 handles the data stored in the RDF store 110 in cooperation with the RDF controller 117. Data may be obtained directly from the RDF store 110 .

ネームノード１１２は、識別子付ＲＤＦデータ表の一部の行データを含むブロックの格納先のデータノード１２１を決定する。ここで、図２では、分かり易いように、スレーブサーバ１２を１つ記載したが、実際には図１のように複数のスレーブサーバ１２が配置されており、ネームノード１１２は、各スレーブサーバ１２のデータノード１２１の中から各ブロックの配置先を選択する。 The name node 112 determines the data node 121 in which blocks containing partial row data of the identifier-attached RDF data table are stored. Here, although one slave server 12 is shown in FIG. 2 for the sake of clarity, a plurality of slave servers 12 are actually arranged as shown in FIG. , the location of each block is selected from among the data nodes 121 of .

そして、ネームノード１１２は、識別子付ＲＤＦデータ表の一部の行データを含むブロックを選択したデータノード１２１へ送信し配置する。ここで、ネームノード１１２は、複数のブロックを１つのデータノード１２１へ送信してもよい。この各ネームノード１１２へのブロックの配置が、「複数の処理装置に分割して配置」することの一例にあたる。さらに、ネームノード１１２は、各ブロックの保存先のデータノード１２１の情報をメタデータＤＢ１１３に登録する。 Then, the name node 112 transmits and arranges a block containing part of row data of the identifier-attached RDF data table to the selected data node 121 . Here, the NameNode 112 may transmit multiple blocks to one DataNode 121 . This placement of blocks in each name node 112 corresponds to an example of "dividing and placing in a plurality of processing units". Furthermore, the name node 112 registers information on the data node 121 that is the storage destination of each block in the metadata DB 113 .

ここで、分散配置において、ネームノード１１２は、１つのデータブロックを複製して複数のデータノード１２１に配置する。例えば、ネームノード１１２は、１つのデータブロックを複製して３つにする。これにより、あるデータノード１２１に障害が発生した場合に、他のデータノード１２１に格納された同一のブロックを用いることができるようになり、Ｈａｄｏｏｐクラスタ１０の耐障害性が確保される。このネームノード１１２が、「配置部」の一例にあたる。 Here, in distributed arrangement, the name node 112 duplicates one data block and places it in a plurality of data nodes 121 . For example, NameNode 112 duplicates one data block into three. As a result, when a failure occurs in a certain data node 121, the same block stored in another data node 121 can be used, and the Hadoop cluster 10 is ensured of failure tolerance. This name node 112 corresponds to an example of the "placement section".

ＳＰＡＲＱＬ処理部１１８は、ＳＰＡＲＱＬクエリの入力をジョブクライアント３０から受ける。そして、ＳＰＡＲＱＬ処理部１１８は、取得したＳＰＡＲＱＬクエリを解析してＭａｐＲｅｄｕｃｅ処理に変換する。さらに、ＳＰＡＲＱＬ処理部１１８は、識別子対応表の取得要求をＲＤＦコントローラ１１７に通知する。その後、ＳＰＡＲＱＬ処理部１１８は、ＲＤＦストア１１０に格納された識別子対応表をＲＤＦコントローラ１１７から取得する。次に、ＳＰＡＲＱＬ処理部１１８は、識別子対応表を参照して、取得したクエリの要素に対応するｖａｌｕｅパターンが存在するか否かを判定する。取得したクエリの要素に対応するｖａｌｕｅパターンが存在しなければ、ＳＰＡＲＱＬ処理部１１８は、そのようなｖａｌｕｅパターンのマッチング結果は０件としてジョブクライアント３０に検索結果を返す。 The SPARQL processing unit 118 receives SPARQL query input from the job client 30 . Then, the SPARQL processing unit 118 analyzes the obtained SPARQL query and converts it into MapReduce processing. Furthermore, the SPARQL processing unit 118 notifies the RDF controller 117 of an identifier correspondence table acquisition request. After that, the SPARQL processing unit 118 acquires the identifier correspondence table stored in the RDF store 110 from the RDF controller 117 . Next, the SPARQL processing unit 118 refers to the identifier correspondence table and determines whether or not there is a value pattern corresponding to the acquired query element. If there is no value pattern corresponding to the acquired query element, the SPARQL processing unit 118 returns the search result to the job client 30 as zero matching results for such a value pattern.

一方、取得したクエリの要素に対応するｖａｌｕｅパターンが存在する場合、ＳＰＡＲＱＬ処理部１１８は、取得したクエリの要素のｖａｌｕｅパターンに割り当てられたパターン識別子を取得する。次に、ＳＰＡＲＱＬ処理部１１８は、ＭＡＰＲｅｄｕｃｅ処理において、文字列を取得したパターン識別子に置き換える。その後、ＳＰＡＲＱＬ処理部１１８は、パターン識別子を含むＭａｐＲｅｄｕｃｅ処理をＭａｐＲｅｄｕｃｅ処理部１１９に出力する。 On the other hand, if there is a value pattern corresponding to the acquired query element, the SPARQL processing unit 118 acquires the pattern identifier assigned to the value pattern of the acquired query element. Next, the SPARQL processing unit 118 replaces the character string with the obtained pattern identifier in the MAPReduce process. After that, the SPARQL processing unit 118 outputs MapReduce processing including the pattern identifier to the MapReduce processing unit 119 .

その後、ＳＰＡＲＱＬ処理部１１８は、ＭａｐＲｅｄｕｃｅ処理の実行結果の入力をＭａｐＲｅｄｕｃｅ処理部１１９から受ける。そして、ＳＰＡＲＱＬ処理部１１８は、取得したＭａｐＲｅｄｕｃｅ処理の実行結果をＳＰＡＲＱＬクエリの実行結果としてジョブクライアント３０へ送信する。このＳＰＡＲＱＬ処理部１１８が、「出力部」の一例にあたる。 After that, the SPARQL processing unit 118 receives the input of the execution result of the MapReduce processing from the MapReduce processing unit 119 . Then, the SPARQL processing unit 118 transmits the acquired execution result of the MapReduce process to the job client 30 as the execution result of the SPARQL query. The SPARQL processing unit 118 corresponds to an example of the "output unit".

ＭａｐＲｅｄｕｃｅ処理部１１９は、パターン識別子を含むＭａｐＲｅｄｕｃｅ処理の入力をＳＰＡＲＱＬ処理部１１８から受ける。このＭａｐＲｅｄｕｃｅ処理には、元のＳＰＡＲＱＬ処理に含まれる個々の検索処理に対応する複数のＭａｐＲｅｄｕｃｅ処理が含まれる。そこで、ＭａｐＲｅｄｕｃｅ処理部１１９は、受信したＭａｐＲｅｄｕｃｅ処理に含まれる個々のＭａｐＲｅｄｕｃｅ処理を取得する。そして、ＭａｐＲｅｄｕｃｅ処理部１１９は、取得した各ＭａｐＲｅｄｕｃｅ処理の実行をジョブトラッカー１１４に指示する。この場合、検索に用いる文字列がパターン識別子に置き換えられているので、ＭａｐＲｅｄｕｃｅ処理部１１９は、パターン識別子を用いたＭａｐＲｅｄｕｃｅ処理の実行をジョブトラッカー１１４に指示する。 The MapReduce processing unit 119 receives a MapReduce processing input including a pattern identifier from the SPARQL processing unit 118 . This MapReduce process includes multiple MapReduce processes corresponding to individual search processes included in the original SPARQL process. Therefore, the MapReduce processing unit 119 acquires each MapReduce process included in the received MapReduce process. The MapReduce processing unit 119 then instructs the job tracker 114 to execute each acquired MapReduce process. In this case, since the character string used for the search is replaced with the pattern identifier, the MapReduce processing unit 119 instructs the job tracker 114 to execute the MapReduce process using the pattern identifier.

その後、ＭａｐＲｅｄｕｃｅ処理部１１９は、ＭａｐＲｅｄｕｃｅ処理の実行結果の入力をジョブトラッカー１１４から受ける。そして、ＭａｐＲｅｄｕｃｅ処理部１１９は、ＭａｐＲｅｄｕｃｅ処理の実行結果をＳＰＡＲＱＬ処理部１１８へ出力する。 After that, the MapReduce processing unit 119 receives an input of the execution result of the MapReduce processing from the job tracker 114 . The MapReduce processing unit 119 then outputs the execution result of the MapReduce processing to the SPARQL processing unit 118 .

ジョブトラッカー１１４は、各ＭａｐＲｅｄｕｃｅ処理の実行の指示をＭａｐＲｅｄｕｃｅ処理部１１９から受ける。次に、ジョブトラッカー１１４は、メタデータＤＢ１１３に格納された各ブロックが配置されたデータノード１２１を確認し、各ＭａｐＲｅｄｕｃｅ処理を実行させるデータノード１２１を決定する。そして、ジョブトラッカー１１４は、１つのブロックに対して１つのＭａｐタスクを生成して割り当てる。その後、ジョブトラッカー１１４は、各Ｍａｐタスクを対応するブロックを保持するデータノード１２１を有するスレーブサーバ１２のタスクトラッカー１２３へ送信する。このように、各Ｍａｐタスクが対象とするブロックを有するスレーブサーバ１２に対して、それぞれのＭａｐタスクが割り振られることにより、通信コストを最小化することができる。 The job tracker 114 receives an instruction to execute each MapReduce process from the MapReduce processing unit 119 . Next, the job tracker 114 confirms the data node 121 where each block stored in the metadata DB 113 is arranged, and determines the data node 121 to execute each MapReduce process. The job tracker 114 then generates and assigns one Map task to one block. Job tracker 114 then sends each Map task to task tracker 123 of slave server 12 having a data node 121 holding the corresponding block. In this way, the communication cost can be minimized by allocating each Map task to the slave server 12 having a block targeted by each Map task.

その後、ジョブトラッカー１１４は、各スレーブサーバ１２のタスクトラッカー１２３からジョブの実行結果を受信する。そして、ジョブトラッカー１１４は、ジョブの実行結果をまとめたＭａｐＲｅｄｕｃｅ処理の実行結果をＭａｐＲｅｄｕｃｅ処理部１１９へ出力する。 After that, the job tracker 114 receives the job execution result from the task tracker 123 of each slave server 12 . Then, the job tracker 114 outputs to the MapReduce processing unit 119 an execution result of the MapReduce process, which summarizes the job execution results.

次に、スレーブサーバ１２について説明する。スレーブサーバ１２は、図２に示すように、データノード１２１、ＨＤＦＳ１２２、タスクトラッカー１２３及びＭａｐＲｅｄｕｃｅ処理部１２４を有する。 Next, the slave server 12 will be explained. The slave server 12 has a data node 121, an HDFS 122, a task tracker 123 and a MapReduce processing unit 124, as shown in FIG.

ＨＤＦＳ１２２は、ＨＤＦＳ１１１と同様にデフォルト６４ＭＢのサイズのブロック単位でデータを管理する。各スレーブサーバ１２のそれぞれのＨＤＦＳ１２２は、全体で１つの仮想ファイルシステムを形成する。 The HDFS 122 manages data in block units with a default size of 64 MB, like the HDFS 111 . Each HDFS 122 of each slave server 12 forms one virtual file system as a whole.

データノード１２１は、識別子付ＲＤＦデータ表の一部の行データを含むブロックをネームノード１１２から受信する。ここで、データノード１２１は、複数のブロックを受信してもよい。そして、データノード１２１は、取得したブロックを自装置のＨＤＦＳ１２２へ格納する。すなわち、ＨＤＦＳ１２２には、識別子付ＲＤＦデータ表の全行のうちの一部の行のデータが格納される。 The data node 121 receives from the name node 112 a block containing partial row data of the RDF data table with identifier. Here, data node 121 may receive multiple blocks. Then, the data node 121 stores the obtained block in the HDFS 122 of its own device. That is, the HDFS 122 stores the data of some of the rows of the RDF data table with identifiers.

タスクトラッカー１２３は、自装置が有するブロックに対応するＭａｐタスクをジョブトラッカー１１４から受信する。そして、タスクトラッカー１２３は、Ｍａｐタスクで指示されたＭａｐ処理の実行をＭａｐＲｅｄｕｃｅ処理部１２４に指示する。その後、タスクトラッカー１２３は、Ｍａｐタスクにしたがって実行されたＭａｐＲｅｄｕｃｅ処理の実行結果の入力をＭａｐＲｅｄｕｃｅ処理部１２４から受ける。そして、タスクトラッカー１２３は、Ｍａｐタスク毎の実行結果をジョブトラッカー１１４へ送信する。 The task tracker 123 receives from the job tracker 114 the Map task corresponding to the blocks it owns. The task tracker 123 then instructs the MapReduce processing unit 124 to execute the Map processing instructed by the Map task. After that, the task tracker 123 receives the input of the execution result of the MapReduce process executed according to the MapTask from the MapReduce processing unit 124 . The task tracker 123 then transmits the execution result of each Map task to the job tracker 114 .

ＭａｐＲｅｄｕｃｅ処理部１２４は、タスクトラッカー１２３から取得したＭａｐタスクにしたがってＭａｐＲｅｄｕｃｅ処理を実行する。ここで、図７を参照して、ＭａｐＲｅｄｕｃｅ処理について説明する。図７は、ＭａｐＲｅｄｕｃｅ処理の概要を表す図である。ここでは、ＭａｐＲｅｄｕｃｅ処理部１２４Ａ～１２４Ｃが動作する場合で説明する。さらに、ここでは、ＭａｐＲｅｄｕｃｅ処理部１２４Ａがブロック３０１～３０３に対する処理を行い、ＭａｐＲｅｄｕｃｅ処理部１２４Ｂ及び１２４Ｃは他のブロックを処理する。各ブロックのデータは、ｋｅｙ＝ｖａｌｕｅの形式を有するデータを含む。図７において括弧でくくられた２つの文字は、先頭の文字がｋｅｙを表し、２番目の文字がｖａｌｕｅを表す。さらに、ここでは、ＭａｐＲｅｄｕｃｅ処理としてｖａｌｕｅがＸのデータをカウントする処理を実行する場合で説明する。 The MapReduce processing unit 124 executes MapReduce processing according to the Map Task acquired from the task tracker 123 . Here, the MapReduce process will be described with reference to FIG. FIG. 7 is a diagram showing an overview of MapReduce processing. Here, a case where the MapReduce processing units 124A to 124C operate will be described. Further, here, the MapReduce processing unit 124A processes blocks 301 to 303, and the MapReduce processing units 124B and 124C process other blocks. Data of each block includes data having a format of key=value. Of the two characters enclosed in parentheses in FIG. 7, the first character represents key and the second character represents value. Furthermore, here, a case where a process of counting data whose value is X is executed as a MapReduce process will be described.

ＭａｐＲｅｄｕｃｅ処理部１２４Ａは、ブロック３０１～３０３の各データを入力として、入力をｍａｐ関数に与えて内部で処理した結果を新たなｋｅｙ＝ｖａｌｕｅの形式のデータとして出力する。ここでは、ＭａｐＲｅｄｕｃｅ処理部１２４Ａは、ＶａｌｕｅがＸであるデータを出力する。この場合、ＭａｐＲｅｄｕｃｅ処理部１２４Ａは、ブロック３０１から（Ｋ１，Ｘ）及び（Ｋ４，Ｘ）を抽出し、ブロック３０２から（Ｋ２，Ｘ）及び（Ｋ３，Ｘ）を抽出し、ブロック３０３から（Ｋ２，Ｘ）及び（Ｋ５，Ｘ）を抽出する。この処理がＭａｐ処理にあたる。Ｍａｐ処理は、ブロック３０１～３０３毎に行われる。 The MapReduce processing unit 124A receives the data of blocks 301 to 303 as input, gives the input to the map function, and outputs the result of internal processing as new data in the key=value format. Here, the MapReduce processing unit 124A outputs data whose Value is X. In this case, the MapReduce processing unit 124A extracts (K1, X) and (K4, X) from the block 301, extracts (K2, X) and (K3, X) from the block 302, and extracts (K2 , X) and (K5, X). This processing corresponds to the Map processing. Map processing is performed for each block 301-303.

同様に、ＭａｐＲｅｄｕｃｅ処理部１２４Ｂは、処理対象とするブロックからｖａｌｕｅがＸであるものを抽出する。この場合、ＭａｐＲｅｄｕｃｅ処理部１２４Ｂは、（Ｋ１，Ｘ）、（Ｋ４，Ｘ）、（Ｋ５，Ｘ）、（Ｋ１，Ｘ）及び（Ｋ６，Ｘ）を抽出する。また、ＭａｐＲｅｄｕｃｅ処理部１２４Ｃも同様に処理対象とするブロックからｖａｌｕｅがＸであるものを抽出する。 Similarly, the MapReduce processing unit 124B extracts blocks whose value is X from the blocks to be processed. In this case, the MapReduce processing unit 124B extracts (K1, X), (K4, X), (K5, X), (K1, X) and (K6, X). Similarly, the MapReduce processing unit 124C also extracts blocks whose value is X from the blocks to be processed.

次に、ＭａｐＲｅｄｕｃｅ処理部１２４Ａ～１２４Ｃは、抽出した各データを分類してそれぞれを、ＭａｐＲｅｄｕｃｅ処理部１２４Ａ～１２４Ｃのうちの決められた送信先へ送信する。例えば、図７では、ｋｅｙがＫ１及びＫ２のデータがＭａｐＲｅｄｕｃｅ処理部１２４Ａへまとめられる。また、ｋｅｙがＫ３及びＫ４のデータがＭａｐＲｅｄｕｃｅ処理部１２４Ｂへまとめられる。また、ｋｅｙがＫ５及びＫ６のデータがＭａｐＲｅｄｕｃｅ処理部１２４Ｃへまとめられる。次に、各ＭａｐＲｅｄｕｃｅ処理部１２４Ａ～１２４Ｃは、自己に集められたデータを並び替える。ここでは、ＭａｐＲｅｄｕｃｅ処理部１２４Ａ～１２４Ｃは、ｋｅｙ毎にまとまるようにデータを並び替える。すなわち、ＭａｐＲｅｄｕｃｅ処理部１２４Ａ～１２４Ｃは、おなじｋｅｙを有するｋｅｙ＝ｖａｌｕｅ形式のデータ同士を集約する。これらの処理をシャッフル及びソート処理と言う。 Next, the MapReduce processing units 124A to 124C classify each extracted data and transmit each of them to a predetermined transmission destination among the MapReduce processing units 124A to 124C. For example, in FIG. 7, the data with keys K1 and K2 are combined into the MapReduce processing unit 124A. Also, data with keys K3 and K4 are put together in the MapReduce processing unit 124B. Also, data with keys K5 and K6 are put together in the MapReduce processing unit 124C. Next, each of the MapReduce processing units 124A to 124C rearranges the collected data. Here, the MapReduce processing units 124A to 124C rearrange the data so that they are grouped for each key. In other words, the MapReduce processing units 124A to 124C aggregate the key=value format data having the same key. These processes are called shuffle and sort processes.

次に、ＭａｐＲｅｄｕｃｅ処理部１２４Ａ～１２４Ｃは、シャッフル及びソート処理が完了したデータを取得し、取得したデータをＲｅｄｕｃｅ関数の内部で処理した結果をｋｅｙ＝ｖａｌｕｅ形式のデータとして出力する。ここでは、ＭａｐＲｅｄｕｃｅ処理部１２４Ａ～１２４Ｃは、Ｒｅｄｕｃｅ関数として同じｋｅｙを有するデータ毎に集計を行う。図７では、ＭａｐＲｅｄｕｃｅ処理部１２４Ａは、（Ｋ１，Ｘ）が３つあることを表すデータとして（Ｋ１，３）を出力する。また、ＭａｐＲｅｄｕｃｅ処理部１２４Ａは、（Ｋ２，Ｘ）が２つあることを表すデータとして（Ｋ２，２）を出力する。ＭａｐＲｅｄｕｃｅ処理部１２４Ｂは、ｋｅｙがＫ３又はＫ４であるデータの集計結果を出力する。ＭａｐＲｅｄｕｃｅ処理部１２４Ｃは、ｋｅｙがＫ５又はＫ６であるデータの集計結果を出力する。この処理をＲｅｄｕｃｅ処理と言う。Ｒｅｄｕｃｅ処理は、利用者が編集可能である。ここでは、Ｒｅｄｕｃｅ処理として、同じｋｅｙを有するデータの集計を行う処理を行ったが、他の処理に変更することも可能である。例えば、ＳＰＡＲＱＬクエリに対応する結果を返す場合、Ｒｅｄｕｃｅ処理を、ｖａｌｕｅがＸであり、そのＸに対応する値をｋｅｙとするデータをそのまま出力する処理にしてもよい。 Next, the MapReduce processing units 124A to 124C obtain the data for which the shuffle and sort processing have been completed, process the obtained data inside the Reduce function, and output the result as data in the key=value format. Here, the MapReduce processing units 124A to 124C aggregate each data having the same key as the Reduce function. In FIG. 7, the MapReduce processing unit 124A outputs (K1, 3) as data representing that there are three (K1, X). Also, the MapReduce processing unit 124A outputs (K2, 2) as data indicating that there are two (K2, X). The MapReduce processing unit 124B outputs the totalization result of the data whose key is K3 or K4. The MapReduce processing unit 124C outputs the totalization result of the data whose key is K5 or K6. This processing is called Reduce processing. The Reduce process can be edited by the user. Here, as the Reduce process, a process of aggregating data having the same key is performed, but it is also possible to change to other processes. For example, when returning a result corresponding to a SPARQL query, the Reduce process may be a process of outputting data whose value is X and whose key is the value corresponding to X as it is.

ＭａｐＲｅｄｕｃｅ処理部１２４は、Ｍａｐタスク実行部２４１及びＲｅｄｕｃｅタスク実行部２４２を有する。Ｍａｐタスク実行部２４１は、Ｍａｐ処理及びシャッフル及びソート処理を行う。 The MapReduce processing unit 124 has a Map task execution unit 241 and a Reduce task execution unit 242 . The Map task execution unit 241 performs Map processing and shuffle and sort processing.

Ｍａｐタスク実行部２４１は、タスクトラッカー１２３から実行の指示を受けたＭａｐタスクを取得する。そして、Ｍａｐタスク実行部２４１は、Ｍａｐ処理を実行する。この場合、Ｍａｐタスク実行部２４１は、パラメータ識別子を用いたＭａｐタスクを受信する。そこで、Ｍａｐタスク実行部２４１は、例えば、図８に示すようにパラメータ識別子を用いてＭａｐ処理を実行する。図８は、実施例１に係るパラメータ識別子を用いた場合のＭａｐＲｅｄｕｃｅ処理の概要を表す図である。図８に記載された識別子付ＲＤＦデータ表４１１及び４２１のそれぞれが異なるＭａｐＲｅｄｕｃｅ処理部１２４で処理される場合で説明する。 The Map task execution unit 241 acquires a Map task whose execution has been instructed by the task tracker 123 . Then, the Map task execution unit 241 executes Map processing. In this case, the Map task execution unit 241 receives the Map task using the parameter identifier. Therefore, the Map task execution unit 241 executes Map processing using parameter identifiers as shown in FIG. 8, for example. FIG. 8 is a diagram illustrating an overview of MapReduce processing when using parameter identifiers according to the first embodiment. A case where the RDF data tables 411 and 421 with identifiers shown in FIG. 8 are processed by different MapReduce processing units 124 will be described.

例えば、図８では、Ｍａｐタスク実行部２４１は、太枠で囲われた１００２というパラメータ識別子をｖａｌｕｅとするデータを抽出するＭａｐタスクを取得する。ここで、図８では、分かり易いように１００２に対応するｖａｌｕｅパターンを記載したが、実際のＭａｐタスクにはｖａｌｕｅパターンは含まれなくてもよい。 For example, in FIG. 8, the Map task execution unit 241 acquires a Map task that extracts data whose value is the parameter identifier 1002 surrounded by a thick frame. Here, in FIG. 8, the value pattern corresponding to 1002 is described for easy understanding, but the actual Map task may not include the value pattern.

Ｍａｐタスク実行部２４１は、識別子付ＲＤＦデータ表４１１又は４２１からＭａｐ処理を行う対象とするデータを取得してｋｅｙ＝ｖａｌｕｅ形式のデータに変換しそのデータを入力とする。ここでは、Ｍａｐタスク実行部２４１は、主語とｋｅｙとし述語及び目的語の組み合わせのｖａｌｕｅパターンをｖａｌｕｅとするデータを入力とする。 The Map task execution unit 241 acquires data to be subjected to Map processing from the identifier-attached RDF data table 411 or 421, converts the data into key=value format data, and receives the data as input. Here, the Map task execution unit 241 receives data with a subject and a key and a value pattern of a combination of a predicate and an object as the value.

そして、各Ｍａｐタスク実行部２４１は、入力のデータからｖａｌｕｅを表すパターン識別子が１００２であるデータ４１２又は４２２を抽出する。そして、各Ｍａｐタスク実行部２４１は、抽出したデータ４１２又は４２２に対してシャッフル及びソート処理を実行する。ここでは、各Ｍａｐタスク実行部２４１は、ｋｅｙがＢであるデータを一方に集め、それ以外のデータを他方に集める。このｋｅｙ毎に各スレーブサーバ１２にデータを集める処理が、「３要素のいずれか１つの要素を基準に集約」する処理の一例にあたる。 Then, each Map task execution unit 241 extracts data 412 or 422 whose pattern identifier representing value is 1002 from the input data. Each Map task execution unit 241 then shuffles and sorts the extracted data 412 or 422 . Here, each Map task execution unit 241 collects data whose key is B to one side, and collects other data to the other side. The process of collecting data in each slave server 12 for each key corresponds to an example of the process of "summarizing based on any one of the three elements".

さらに、各Ｍａｐタスク実行部２４１は、ｋｅｙを基準に収集したデータをソートしてデータ４１３又は４２３を生成する。そして、各Ｍａｐタスク実行部２４１は、データ４１３又は４２３をＲｅｄｕｃｅタスク実行部２４２へ出力する。このＭａｐタスク実行部２４１が、「検索部」の一例にあたる。 Furthermore, each Map task execution unit 241 sorts the collected data based on the key to generate data 413 or 423 . Each Map task execution unit 241 then outputs the data 413 or 423 to the Reduce task execution unit 242 . This Map task execution unit 241 corresponds to an example of a "search unit".

Ｒｅｄｕｃｅタスク実行部２４２は、Ｒｅｄｕｃｅ処理を行う。例えば図８に示すように、各Ｒｅｄｕｃｅタスク実行部２４２は、データ４１３又は４２３の入力をそれぞれ対応するＭａｐタスク実行部２４１から受ける。次に、各Ｒｅｄｕｃｅタスク実行部２４２は、取得したデータ４１３又は４２３から同じｋｅｙを有するデータの数を集計する。そして、各Ｒｅｄｕｃｅタスク実行部２４２は、Ｒｅｄｕｃｅ処理の結果４１４又は４２４をタスクトラッカー１２３へ出力する。ここで、図８の結果４１４及び４２４におけるｃはカウント値を表す。このＲｅｄｕｃｅタスク実行部２４２による同じｋｅｙを有するデータの数を集計が、「予め決められた処理の実行」の一例にあたる。 The Reduce task execution unit 242 performs Reduce processing. For example, as shown in FIG. 8, each Reduce task execution unit 242 receives input of data 413 or 423 from the corresponding Map task execution unit 241 respectively. Next, each Reduce task execution unit 242 tallies the number of data having the same key from the acquired data 413 or 423 . Each Reduce task execution unit 242 then outputs a result 414 or 424 of the Reduce process to the task tracker 123 . Here, c in results 414 and 424 in FIG. 8 represents the count value. Aggregation of the number of data having the same key by the Reduce task execution unit 242 corresponds to an example of “execution of predetermined processing”.

次に、図９を参照して、実施例１に係る識別子表及び識別子付ＲＤＦデータ表の生成処理の流れについて説明する。図９は、実施例１に係る識別子表及び識別子付ＲＤＦデータ表の生成処理のフローチャートである。以下では、ＨＤＦＳ１１１との間のデータの送受信におけるＲＤＦコントローラ１１７の仲介動作を省略する。 Next, with reference to FIG. 9, a flow of processing for generating an identifier table and an identifier-added RDF data table according to the first embodiment will be described. FIG. 9 is a flowchart of processing for generating an identifier table and an identifier-attached RDF data table according to the first embodiment. In the following, the intermediate operation of the RDF controller 117 in data transmission/reception with the HDFS 111 will be omitted.

第１生成部１１５は、ＲＤＦストア１１０に格納されたＲＤＦデータから全ての主語、述語及び目的語の重複を除いて取得する。そして、第１生成部１１５は、取得した主語、述語及び目的語を２つずつ組み合わせて、ｖａｌｕｅパターンを抽出する（ステップＳ１）。 The first generating unit 115 obtains the RDF data stored in the RDF store 110 by excluding duplication of all subjects, predicates and objects. Then, the first generation unit 115 combines two each of the acquired subject, predicate and object to extract a value pattern (step S1).

次に、第１生成部１１５は、抽出したｖａｌｕｅパターンの中にＲＤＦストア１１０に格納された実際のＲＤＦデータの中に存在しないｖａｌｕｅパターンがあるか否かを判定する（ステップＳ２）。実際には存在しないｖａｌｕｅパターンが無い場合（ステップＳ２：否定）、第１生成部１１５は、ステップＳ４へ進む。 Next, the first generator 115 determines whether or not the extracted value patterns include a value pattern that does not exist in the actual RDF data stored in the RDF store 110 (step S2). If there is no value pattern that does not actually exist (step S2: No), the first generator 115 proceeds to step S4.

実際には存在しないｖａｌｕｅパターンがある場合（ステップＳ２：肯定）、第１生成部１１５は、抽出したｖａｌｕｅパターンの中から実際には存在しないｖａｌｕｅパターンを除いて、実際に存在するｖａｌｕｅパターンを抽出する（ステップＳ３）。 If there is a value pattern that does not actually exist (step S2: Yes), the first generation unit 115 extracts the value pattern that actually exists by removing the value pattern that does not actually exist from the extracted value patterns. (step S3).

次に、第１生成部１１５は、実際に存在するｖａｌｕｅパターンに識別子を割り当て、各ｖａｌｕｅパターンに対応するパターン識別子を表す識別子対応表を生成する（ステップＳ４）。その後、第１生成部１１５は、生成した識別子対応表のＲＤＦストア１１０への格納をＲＤＦコントローラ１１７に行わせ、識別子対表の生成完了を第２生成部１１６に通知する。 Next, the first generating unit 115 assigns identifiers to actually existing value patterns, and generates an identifier correspondence table representing pattern identifiers corresponding to each value pattern (step S4). After that, the first generation unit 115 causes the RDF controller 117 to store the generated identifier correspondence table in the RDF store 110, and notifies the second generation unit 116 of the completion of the generation of the identifier correspondence table.

識別子対表の生成完了の通知を受けた第２生成部１１６は、ＲＤＦストア１１０に含まれる全てのＲＤＦデータ及び識別子対応表をＲＤＦストア１１０から取得する。次に、第２生成部１１６は、各ＲＤＦデータの主語と述語とを組み合わせたｖａｌｕｅパターン、述語と目的語とを組わせたｖａｌｕｅパターン及び主語と目的語とを組わせたｖａｌｕｅパターンを取得する。そして、第２生成部１１６は、取得したｖａｌｕｅパターンに対応するパターン識別子を識別子対応表から取得する。次に、第２生成部１１６は、トリプルの対応を表す対応表における各ＲＤＦデータに取得したパターン識別子を付加して識別子付ＲＤＦデータ表を生成する（ステップＳ５）。その後、第２生成部１１６は、生成した識別子付ＲＤＦデータ表をＲＤＦストア１１０に格納する。ここで、本実施例では、第１生成部１１５からの通知を受けた第２生成部１１６が、自動的に識別子付ＲＤＦデータの生成を行うように説明したが、これは他の手順でもよい。例えば、第２生成部１１６は、ジョブクライアント３０を用いた利用者からの指示を受けて、その指示の入力をトリガとして識別子付ＲＤＦデータの生成を行ってもよい。 The second generating unit 116 that has received the notification of the completion of generation of the identifier pair table acquires from the RDF store 110 all the RDF data and the identifier correspondence table contained in the RDF store 110 . Next, the second generation unit 116 acquires a value pattern that combines the subject and predicate of each RDF data, a value pattern that combines the predicate and object, and a value pattern that combines the subject and object. . Then, the second generating unit 116 acquires pattern identifiers corresponding to the acquired value patterns from the identifier correspondence table. Next, the second generating unit 116 generates an RDF data table with an identifier by adding the acquired pattern identifier to each piece of RDF data in the correspondence table representing triple correspondence (step S5). After that, the second generation unit 116 stores the generated identifier-attached RDF data table in the RDF store 110 . Here, in the present embodiment, the second generation unit 116, which receives the notification from the first generation unit 115, automatically generates RDF data with an identifier, but other procedures may be used. . For example, the second generation unit 116 may receive an instruction from a user using the job client 30 and use the input of the instruction as a trigger to generate identifier-attached RDF data.

ネームノード１１２は、識別子付ＲＤＦデータ表をＲＤＦストア１１０から取得する。次に、ネームノード１１２は、識別子付ＲＤＦデータ表に登録されたデータを含む各ブロックを配置するデータノード１２１を決定する。そして、ネームノード１１２は、識別子付ＲＤＦデータ表の一部の行データを含む各ブロックを、配置先として決定したそれぞれのデータノード１２１へ送信し、データの分散配置を実行する（ステップＳ６）。 The NameNode 112 acquires the RDF data table with identifier from the RDF store 110 . Next, the namenode 112 determines the datanode 121 in which each block containing the data registered in the identifier-attached RDF data table is placed. Then, the namenode 112 transmits each block including a part of row data of the identifier-attached RDF data table to each of the datanodes 121 determined as allocation destinations, and executes data distribution allocation (step S6).

次に、図１０を参照して、実施例１に係るＭａｐＲｅｄｕｃｅ処理の流れについて説明する。図１０は、実施例１に係るＭａｐＲｅｄｕｃｅ処理のフローチャートである。 Next, the flow of MapReduce processing according to the first embodiment will be described with reference to FIG. FIG. 10 is a flowchart of MapReduce processing according to the first embodiment.

ＳＰＡＲＱＬ処理部１１８は、ＳＰＡＲＱＬクエリの実行命令の入力をジョブクライアント３０から受ける。そして、ＳＰＡＲＱＬ処理部１１８は、ＳＰＡＲＱＬクエリを実行する（ステップＳ１１）。 The SPARQL processing unit 118 receives an input of a SPARQL query execution command from the job client 30 . The SPARQL processing unit 118 then executes the SPARQL query (step S11).

次に、ＳＰＡＲＱＬ処理部１１８は、ＳＰＡＲＱＬクエリをＭａｐＲｅｄｕｃｅ処理のジョブへ変換する（ステップＳ１２）。 Next, the SPARQL processing unit 118 converts the SPARQL query into a job for MapReduce processing (step S12).

次に、ＳＰＡＲＱＬ処理部１１８は、ＨＤＦＳ１１１から識別子対応表を取得する。そして、ＳＰＡＲＱＬ処理部１１８は、投入されたクエリを構文解析（パース）して識別子対応表に登録されたｖａｌｕｅパターンに該当するｖａｌｕｅパターンがあるか否かを判定する（ステップＳ１３）。該当するｖａｌｕｅパターンが無い場合（ステップＳ１３：否定）、ＳＰＡＲＱＬ処理部１１８は、そのようなｖａｌｕｅパターンのマッチング結果は０件であるという検索結果をジョブクライアント３０に返してＳＰＡＲＱＬクエリの実行処理を終了する。実際には、ＳＰＡＲＱＬ処理部１１８は、パースした段階で識別子対応表に登録されたｖａｌｕｅパターンに該当するｖａｌｕｅパターンがあるか否かが分かる。 Next, the SPARQL processing unit 118 acquires the identifier correspondence table from the HDFS 111. FIG. Then, the SPARQL processing unit 118 parses the entered query and determines whether there is a value pattern corresponding to the value patterns registered in the identifier correspondence table (step S13). If there is no corresponding value pattern (step S13: No), the SPARQL processing unit 118 returns to the job client 30 a search result indicating that the matching result for such a value pattern is 0, and ends the SPARQL query execution processing. do. In practice, the SPARQL processing unit 118 can know whether or not there is a value pattern corresponding to the value pattern registered in the identifier correspondence table at the parsing stage.

これに対して、該当するｖａｌｕｅパターンがある場合（ステップＳ１３：肯定）、ＳＰＡＲＱＬ処理部１１８は、パターン識別子を参照してＭａｐＲｅｄｕｃｅ処理を実行する。 On the other hand, if there is a corresponding value pattern (step S13: Yes), the SPARQL processing unit 118 refers to the pattern identifier and executes the MapReduce process.

ＭａｐＲｅｄｕｃｅ処理部１１９は、ＳＰＡＲＱＬ処理部１１８からの指示を受けて、パターン識別子を参照してＭａｐＲｅｄｕｃｅ処理の実行をジョブトラッカー１１４に指示する。ジョブトラッカー１１４は、メタデータＤＢ１１３を確認し、ＭａｐＲｅｄｕｃｅ処理を行わせるスレーブサーバ１２を選択する。そして、ジョブトラッカー１１４は、ＭａｐＲｅｄｕｃｅ処理をブロック単位のＭａｐタスクに分割し、選択したスレーブサーバ１２へ送信する。タスクトラッカー１２３は、Ｍａｐタスクをジョブトラッカー１１４から受信する。そして、タスクトラッカー１２３は、取得したＭａｐタスクの実行をＭａｐＲｅｄｕｃｅ処理部１２４に指示する。ＭａｐＲｅｄｕｃｅ処理部１２４は、Ｍａｐタスクの実行の指示をタスクトラッカー１２３から受ける。そして、Ｍａｐタスク実行部２４１は、ＨＤＦＳ１２２に格納された識別符号付ＲＤＦデータを用いて、Ｍａｐタスクで指定されたＭａｐ処理を実行する（ステップＳ１４）。 The MapReduce processing unit 119 receives an instruction from the SPARQL processing unit 118, refers to the pattern identifier, and instructs the job tracker 114 to execute the MapReduce process. The job tracker 114 confirms the metadata DB 113 and selects a slave server 12 to perform the MapReduce process. The job tracker 114 then divides the MapReduce process into block-based Map tasks and transmits them to the selected slave server 12 . The task tracker 123 receives Map tasks from the job tracker 114 . The task tracker 123 then instructs the MapReduce processing unit 124 to execute the acquired Map task. The MapReduce processing unit 124 receives an instruction to execute a Map task from the task tracker 123 . The Map task execution unit 241 then uses the RDF data with identification code stored in the HDFS 122 to execute the Map process designated by the Map task (step S14).

次に、Ｍａｐタスク実行部２４１は、Ｍａｐ処理の処理結果をｋｅｙ毎にまとまるようシャッフルして各スレーブサーバ１２のＭａｐタスク実行部２４１に振り分ける。さらに、Ｍａｐタスク実行部２４１は、シャッフルにより自装置に振り分けられたデータをｋｅｙ毎にまとまるようにソートする（ステップＳ１５）。そして、Ｍａｐタスク実行部２４１は、ソートしたデータをＲｅｄｕｃｅタスク実行部２４２へ出力する。 Next, the Map task execution unit 241 shuffles the processing results of the Map processing so that they are collected for each key and distributes them to the Map task execution unit 241 of each slave server 12 . Further, the Map task execution unit 241 sorts the data distributed to its own device by shuffling so that the data are grouped for each key (step S15). The Map task execution unit 241 then outputs the sorted data to the Reduce task execution unit 242 .

Ｒｅｄｕｃｅタスク実行部２４２は、Ｍａｐタスク実行部２４１から取得したデータに対して予め指定されたＲｅｄｕｃｅ処理を実行する（ステップＳ１６）。例えば、Ｒｅｄｕｃｅタスク実行部２４２は、データをｋｅｙ毎に集計する。 The Reduce task execution unit 242 executes a predesignated Reduce process on the data acquired from the Map task execution unit 241 (step S16). For example, the Reduce task execution unit 242 aggregates data for each key.

その後、Ｒｅｄｕｃｅタスク実行部２４２は、ＭａｐＲｅｄｕｃｅ処理の結果をタスクトラッカー１２３へ出力する。タスクトラッカー１２３は、入力されたＭａｐＲｅｄｕｃｅ処理の結果をマスタサーバ１１のジョブトラッカー１１４へ送信する。ジョブトラッカー１１４は、各スレーブサーバ１２から送信されたＭａｐＲｅｄｕｃｅ処理の結果を収集する。そして、ジョブトラッカー１１４は、ＭａｐＲｅｄｕｃｅ処理の結果を結合する。そして、ジョブトラッカー１１４は、結合したＭａｐＲｅｄｕｃｅ処理の結果をＭａｐＲｅｄｕｃｅ処理部１１９を介してＳＰＡＲＱＬ処理部１１８へ送信する。ＳＰＡＲＱＬ処理部１１８は、結合されたＭａｐＲｅｄｕｃｅ処理の結果を受信し、受信したデータをＲＤＦ形式に変換する（ステップＳ１７）。その後、ＳＰＡＲＱＬ処理部１１８は、ＲＤＦ形式に変換したＭａｐＲｅｄｕｃｅ処理の結果をＳＰＡＲＱＬクエリの実行結果としてジョブクライアント３０へ送信する。 After that, the Reduce task execution unit 242 outputs the result of the MapReduce process to the task tracker 123 . The task tracker 123 transmits the input result of the MapReduce process to the job tracker 114 of the master server 11 . The job tracker 114 collects the MapReduce processing results sent from each slave server 12 . The job tracker 114 then combines the results of the MapReduce processing. The job tracker 114 then transmits the combined result of the MapReduce processing to the SPARQL processing unit 118 via the MapReduce processing unit 119 . The SPARQL processing unit 118 receives the results of the combined MapReduce processing, and converts the received data into RDF format (step S17). After that, the SPARQL processing unit 118 transmits the result of the MapReduce process converted into the RDF format to the job client 30 as the execution result of the SPARQL query.

以上に説明したように、本実施例に係るＨａｄｏｏｐクラスタは、グラフデータに含まれる３要素のうちの２要素の組み合わせであるｖａｌｕｅパターンに識別子を割り当てし、その識別子を用いてＭａｐＲｅｄｕｃｅ処理を実行する。これにより、ＭａｐＲｅｄｕｃｅ処理においてグラフデータの検索を行う場合に、データ領域が小さい識別子を用いて検索を行うことができ、検索時のマッチングを高速に行うことができる。 As described above, the Hadoop cluster according to the present embodiment assigns identifiers to value patterns that are combinations of two elements out of three elements included in graph data, and uses the identifiers to execute MapReduce processing. . As a result, when searching for graph data in the MapReduce process, the search can be performed using an identifier with a small data area, and matching at the time of search can be performed at high speed.

さらに、本実施例に係るＨａｄｏｏｐクラスタは、実際のＲＤＦデータの中には存在しないｖａｌｕｅパターンを除いて識別子対応表を作成する。これにより、存在しないＲＤＦデータを用いた処理を省くことができ、検索速度がさらに向上する。例えば、ＲＤＦデータに存在しないｖａｌｕｅパターンを用いた検索操作の指示を受けた場合、本実施例に係るＨａｄｏｏｐクラスタは、ＭａｐＲｅｄｕｃｅ処理を行わずに結果を返すことができる。 Furthermore, the Hadoop cluster according to this embodiment creates an identifier correspondence table by excluding value patterns that do not exist in actual RDF data. As a result, processing using non-existent RDF data can be omitted, further improving search speed. For example, when receiving an instruction for a search operation using a value pattern that does not exist in RDF data, the Hadoop cluster according to this embodiment can return a result without performing MapReduce processing.

次に実施例２について説明する。本実施例に係るＨａｄｏｏｐクラスタは、検索の対象とするデータとしてｖａｌｕｅパターンと対応するｋｅｙとが登録された分割データ表を用いることが実施例１と異なる。本実施例に係るＨａｄｏｏｐクラスタ１０も図１及び２で表される。以下の説明では、実施例１と同様の各部の機能については説明を省略する。 Next, Example 2 will be described. The Hadoop cluster according to the present embodiment differs from the first embodiment in that a divided data table in which value patterns and corresponding keys are registered is used as data to be searched. A Hadoop cluster 10 according to this embodiment is also represented in FIGS. In the following description, descriptions of the functions of the same units as in the first embodiment will be omitted.

第２生成部１１６は、ＲＤＦコントローラ１１７を介してＲ全てのＲＤＦデータ及び識別対応表をＲＤＦストア１１０から取得する。次に、第２生成部１１６は、各ＲＤＦデータの主語、述語及び目的語のうち２つの組み合わせた値を取得し、識別子対応表からその組み合わせの値と一致するｖａｌｕｅパターンに対応する識別子を取得する。そして、第２生成部１１６は、主語、述語及び目的語のうちの２つを組み合わせ毎に、ｖａｌｕｅパターンに対応するパターン識別子と、主語、述語及び目的語のうちｖａｌｕｅパターンに含まれる２要素以外の残りの１要素とを対応させて分割データ表を生成する。 The second generator 116 acquires all R RDF data and the identification correspondence table from the RDF store 110 via the RDF controller 117 . Next, the second generation unit 116 acquires a combined value of two of the subject, predicate, and object of each RDF data, and acquires an identifier corresponding to the value pattern that matches the combined value from the identifier correspondence table. do. Then, for each combination of two of the subject, predicate, and object, the second generation unit 116 generates a pattern identifier corresponding to the value pattern, is associated with the remaining one element of to generate a split data table.

図１１は、分割データ表の一例を表す図である。本実施例に係る第２生成部１１６は、図１１に示すようにｋｅｙの種類多一致するデータ毎にパターン識別情報とｋｅｙとを一致させる分割データ表５０１～５０３を生成する。 FIG. 11 is a diagram showing an example of a divided data table. As shown in FIG. 11, the second generation unit 116 according to the present embodiment generates divided data tables 501 to 503 that match the pattern identification information and the key for each data that has multiple matching key types.

具体的には、第２生成部１１６は、述語と目的語との組み合わせを表すｖａｌｕｅパターンに対応するパターン識別子と主語との対応を表す分割データ表５０１を生成する。また、第２生成部１１６は、主語と述語との組み合わせを表すｖａｌｕｅパターンに対応するパターン識別子と目的語との対応を表す分割データ表５０２を生成する。また、分割データ表５０２は、主語と目的語との組み合わせを表すｖａｌｕｅパターンに対応するパターン識別子と述語との対応を表す分割データ表５０３を生成する。そして、第２生成部１１６は、ＲＤＦコントローラ１１７を介してＲＤＦストア１１０に生成した分割データ表５０１～５０３を格納させる。 Specifically, the second generating unit 116 generates the divided data table 501 representing the correspondence between the pattern identifier corresponding to the value pattern representing the combination of the predicate and the object and the subject. The second generating unit 116 also generates a divided data table 502 representing the correspondence between the pattern identifier corresponding to the value pattern representing the combination of the subject and the predicate and the object. Also, the divided data table 502 generates a divided data table 503 representing the correspondence between the pattern identifier corresponding to the value pattern representing the combination of the subject and the object and the predicate. Then, the second generating unit 116 stores the generated divided data tables 501 to 503 in the RDF store 110 via the RDF controller 117 .

ネームノード１１２は、分割データ表の一部の行データを含むブロックを各データノード１２１へ送信する。データノード１２１は、分割データ表の一部の行データを含むブロックをＨＤＦＳ１２２に格納する。 The namenode 112 transmits a block containing partial row data of the partitioned data table to each datanode 121 . The data node 121 stores in the HDFS 122 blocks containing partial row data of the partitioned data table.

Ｍａｐタスク実行部２４１は、Ｍａｐタスクの実行指示をタスクトラッカー１２３から受信する。そして、Ｍａｐタスク実行部２４１は、Ｍａｐタスクで使用するテーブルを選択する。例えば、Ｍａｐタスクが述語と目的語とを組み合わせたｖａｌｕｅパターン用いるＭａｐ処理の場合、Ｍａｐタスク実行部２４１は、述語と目的語とを組み合わせたｖａｌｕｅパターンが登録された分割データ表を選択する。図１１を用いた場合を例に説明すると、例えば、主語と述語との組み合わせのｖａｌｕｅパターンを用いたＭａｐ処理の場合、Ｍａｐタスク実行部２４１は、分割データ表５０１を選択する。 The Map task execution unit 241 receives a Map task execution instruction from the task tracker 123 . The Map task execution unit 241 then selects a table to be used in the Map task. For example, if the Map task uses a value pattern that combines a predicate and an object, the Map task execution unit 241 selects a divided data table in which a value pattern that combines a predicate and an object is registered. Taking the case of FIG. 11 as an example, for example, in the case of Map processing using value patterns of combinations of subjects and predicates, the Map task execution unit 241 selects the divided data table 501 .

そして、Ｍａｐタスク実行部２４１は、ＨＤＦＳ１２２に格納された各ブロックに対して、タスクトラッカー１２３から実行の指示を受けたＭａｐタスクを実行する。その後、Ｍａｐタスク実行部２４１は、Ｍａｐ処理、並びに、シャッフル及びソート処理を実行した結果をＲｅｄｕｃｅタスク実行部２４２へ出力する。 Then, the Map task executing unit 241 executes the Map task instructed to execute by the task tracker 123 for each block stored in the HDFS 122 . After that, the Map task execution unit 241 outputs the result of executing the Map processing and the shuffle and sort processing to the Reduce task execution unit 242 .

ここで、図１２を参照して、実施例２に係るＭａｐタスク実行部２４１によるＭａｐ処理の流れについて説明する。図１２は、実施例２に係るパラメータ識別子を用いた場合のＭａｐＲｅｄｕｃｅ処理の概要を表す図である。ここでは、図１２に記載された分割データ表５１１及び５２１のそれぞれが異なるＭａｐＲｅｄｕｃｅ処理部１２４で処理される場合で説明する。 Now, with reference to FIG. 12, the flow of Map processing by the Map task execution unit 241 according to the second embodiment will be described. FIG. 12 is a diagram illustrating an overview of MapReduce processing when using parameter identifiers according to the second embodiment. Here, a case where the divided data tables 511 and 521 shown in FIG. 12 are processed by different MapReduce processing units 124 will be described.

例えば、図１２では、Ｍａｐタスク実行部２４１は、太枠で囲われたパラメータ識別子である１００２をｖａｌｕｅとして抽出するＭａｐタスクを取得する。次に、Ｍａｐタスク実行部２４１は、分割データ表５１１又は５２１からＭａｐ処理を行う対象とするデータを取得する。この場合、分割データ表５１１及び５２１のデータは既にｋｅｙ＝ｖａｌｕｅの形式であるので、各Ｍａｐタスク実行部２４１は、分割データ表５１１又は５２１のデータをそのまま入力とすることができる。 For example, in FIG. 12, the Map task execution unit 241 acquires a Map task that extracts the parameter identifier 1002 surrounded by a thick frame as a value. Next, the Map task execution unit 241 acquires data to be subjected to Map processing from the divided data table 511 or 521 . In this case, since the data of the divided data tables 511 and 521 are already in the format of key=value, each Map task execution unit 241 can directly input the data of the divided data tables 511 and 521 .

そして、各Ｍａｐタスク実行部２４１は、入力されたデータからｖａｌｕｅにあたるパターン識別子が１００２であるデータ５１２又は５２２を抽出する。次に、各Ｍａｐタスク実行部２４１は、抽出したデータ５１２又は５２２に対してシャッフル及びソート処理を実行しデータ５１３及び５２３を取得する。 Then, each Map task execution unit 241 extracts data 512 or 522 whose pattern identifier is 1002 and corresponds to value from the input data. Next, each Map task execution unit 241 performs shuffle and sort processing on the extracted data 512 or 522 to acquire data 513 and 523 .

Ｒｅｄｕｃｅタスク実行部２４２は、Ｍａｐ処理、並びに、シャッフル及びソート処理の結果をＭａｐタスク実行部２４１から取得する。そして、Ｒｅｄｕｃｅタスク実行部２４２は、取得したデータに対してＲｅｄｕｃｅ処理を行う。 The Reduce task execution unit 242 acquires the results of Map processing and shuffle and sort processing from the Map task execution unit 241 . Then, the Reduce task execution unit 242 performs Reduce processing on the acquired data.

ここで、図１２を参照して、実施例２に係るＲｅｄｕｃｅタスク実行部２４２によるＲｅｄｕｃｅ処理の流れについて説明する。各Ｒｅｄｕｃｅタスク実行部２４２は、データ５１３又は５２３の入力をそれぞれ対応するＭａｐタスク実行部２４１から受ける。次に、各Ｒｅｄｕｃｅタスク実行部２４２は、取得したデータ５１３又は５２３から同じｋｅｙを有するデータの数を集計する。そして、各Ｒｅｄｕｃｅタスク実行部２４２は、Ｒｅｄｕｃｅ処理の結果５１４又は５２４をタスクトラッカー１２３へ出力する。 Here, a flow of Reduce processing by the Reduce task execution unit 242 according to the second embodiment will be described with reference to FIG. 12 . Each Reduce task execution unit 242 receives input of data 513 or 523 from the corresponding Map task execution unit 241 . Next, each Reduce task execution unit 242 tallies the number of data having the same key from the acquired data 513 or 523 . Then, each Reduce task execution unit 242 outputs the result 514 or 524 of the Reduce process to the task tracker 123 .

以上に説明したように、本実施例に係るＨａｄｏｏｐクラスタは、ｖａｌｕｅパターンに対応する識別子とその識別子に対応するｋｅｙとの対応を表す分割データ表を用いてＭａｐＲｅｄｕｅｃ処理を実行する。本実施例に係るＨａｄｏｏｐクラスタは、Ｍａｐ処理の目的に応じて分割データ表を選択する。各分割データ表は実施例１で用いた識別子付ＲＤＦデータ表よりもサイズが小さいため、実施例１に比べてメモリの消費量を抑えることができ、且つ、表のスキャンを迅速に行うことができる。 As described above, the Hadoop cluster according to the present embodiment executes MapReduce processing using a partitioned data table representing correspondence between identifiers corresponding to value patterns and keys corresponding to the identifiers. The Hadoop cluster according to this embodiment selects a partitioned data table according to the purpose of Map processing. Since each divided data table is smaller in size than the identifier-added RDF data table used in the first embodiment, the memory consumption can be suppressed and the table can be scanned quickly compared to the first embodiment. can.

ここで、以上の各実施例では、Ｈａｄｏｏｐクラスタを用いて説明したが、システムの構成はこれに限らず、３つの要素を有するデータを２つの要素に対する処理に対して用いるシステムであれば他のシステム構成でもよい。また、以上の各実施例ではＲＤＦデータを用いて説明したが、グラフデータで有れば他のデータを用いても同様の処理を行うことができ、同様の効果を得ることができる。 Here, in each of the above embodiments, a Hadoop cluster was used for explanation, but the system configuration is not limited to this, and other systems may be used as long as the system uses data having three elements for processing two elements. system configuration. Further, although the RDF data is used in the above embodiments, the same processing can be performed using other data as long as it is graph data, and the same effects can be obtained.

（ハードウェア構成）
上述してきた各実施例に係るマスタサーバ１１及びスレーブサーバ１２は、例えば図１３に示すようなハードウェア構成を有するコンピュータで実現できる。図１３は、コンピュータのハードウェア構成の一例を表す図である。コンピュータ９０は、ＣＰＵ（Central Processing Unit）９１、ＲＡＭ（Random Access Memory）９２、ＲＯＭ（Read Only Memory）９３及びＨＤＤ（Hard Disk Drive）９４を有する。さらに、コンピュータ９０は、通信インターフェイス（Ｉ／Ｆ：Interface）９５、入出力インターフェイス（Ｉ／Ｆ）９６、及びメディアインターフェイス（Ｉ／Ｆ）９７を有する。 (Hardware configuration)
The master server 11 and slave server 12 according to each of the embodiments described above can be realized by a computer having a hardware configuration as shown in FIG. 13, for example. FIG. 13 is a diagram illustrating an example of the hardware configuration of a computer; The computer 90 has a CPU (Central Processing Unit) 91 , RAM (Random Access Memory) 92 , ROM (Read Only Memory) 93 and HDD (Hard Disk Drive) 94 . Further, the computer 90 has a communication interface (I/F) 95 , an input/output interface (I/F) 96 and a media interface (I/F) 97 .

ＣＰＵ９１は、ＲＯＭ９３またはＨＤＤ９４に格納されたプログラムに基づいて動作し、各部の制御を行う。ＲＯＭ９３は、コンピュータ９０の起動時にＣＰＵ９１によって実行されるブートプログラムや、コンピュータ９０のハードウェアに依存するプログラム等を格納する。 CPU91 operates based on the program stored in ROM93 or HDD94, and controls each part. The ROM 93 stores a boot program executed by the CPU 91 when the computer 90 is started, a program depending on the hardware of the computer 90, and the like.

ＨＤＤ９４は、ＣＰＵ９１によって実行されるプログラム、及び、かかるプログラムによって使用されるデータ等を格納する。通信インターフェイス９５は、ネットワークを介して他の機器からデータを受信してＣＰＵ９１へ送り、ＣＰＵ９１が生成したデータをネットワークを介して他の機器へ送信する。 The HDD 94 stores programs executed by the CPU 91 and data used by these programs. The communication interface 95 receives data from other devices via the network, sends the data to the CPU 91, and transmits data generated by the CPU 91 to the other devices via the network.

ＣＰＵ９１は、入出力インターフェイス９６を介して、ディスプレイやプリンタ等の出力装置、及び、キーボードやマウス等の入力装置を制御する。ＣＰＵ９１は、入出力インターフェイス９６を介して、入力装置からデータを取得する。また、ＣＰＵ９１は、生成したデータを入出力インターフェイス９６を介して出力装置へ出力する。 The CPU 91 controls output devices such as a display and a printer and input devices such as a keyboard and a mouse through an input/output interface 96 . The CPU 91 acquires data from the input device via the input/output interface 96 . The CPU 91 also outputs the generated data to the output device via the input/output interface 96 .

メディアインターフェイス９７は、記録媒体９８に格納されたプログラムまたはデータを読み取り、ＲＡＭ９２を介してＣＰＵ９１に提供する。ＣＰＵ９１は、かかるプログラムを、メディアインターフェイス９７を介して記録媒体９８からＲＡＭ９２上にロードし、ロードしたプログラムを実行する。記録媒体９８は、例えばＤＶＤ（Digital Versatile Disc）、ＰＤ（Phase change rewritable Disk）等の光学記録媒体、ＭＯ（Magneto-Optical disk）等の光磁気記録媒体、テープ媒体、磁気記録媒体、または半導体メモリ等である。 The media interface 97 reads programs or data stored in the recording medium 98 and provides them to the CPU 91 via the RAM 92 . The CPU 91 loads the program from the recording medium 98 onto the RAM 92 via the media interface 97 and executes the loaded program. The recording medium 98 is, for example, an optical recording medium such as a DVD (Digital Versatile Disc) or a PD (Phase change rewritable disc), a magneto-optical recording medium such as an MO (Magneto-Optical disk), a tape medium, a magnetic recording medium, or a semiconductor memory. etc.

例えば、コンピュータ９０のＲＡＭ９２及びＨＤＤ９４は、ＨＤＦＳ１１１及び１２２、並びに、メタデータＤＢ１１３の機能を実現する。さらに、コンピュータ９０のＣＰＵ９１は、ＲＡＭ９２上にロードされたプログラムを実行することにより、ネームノード１１２、ジョブトラッカー１１４、第１生成部１１５、第２生成部１１６の機能と実現する。また、コンピュータ９０のＣＰＵ９１は、ＲＤＦコントローラ１１７、ＳＰＡＲＱＬ処理部１１８及びＭａｐＲｅｄｕｃｅ処理部１１９の機能を実現する。さらに、コンピュータ９０のＣＰＵ９１は、データノード１２１、タスクトラッカー１２３及びＭａｐＲｅｄｕｃｅ処理部１２４の機能を実現する。 For example, the RAM 92 and HDD 94 of the computer 90 implement the functions of HDFS 111 and 122 and metadata DB 113 . Furthermore, the CPU 91 of the computer 90 implements the functions of the name node 112, the job tracker 114, the first generator 115, and the second generator 116 by executing the programs loaded on the RAM 92. FIG. Also, the CPU 91 of the computer 90 implements the functions of the RDF controller 117 , the SPARQL processing unit 118 and the MapReduce processing unit 119 . Furthermore, the CPU 91 of the computer 90 implements the functions of the data node 121 , task tracker 123 and MapReduce processing unit 124 .

コンピュータ９０のＣＰＵ９１は、これらのプログラムをＨＤＤ９４から読み取って実行するが、他の例として、記録媒体９８からプログラムを読みとってもよいし、他の装置からネットワークを介してこれらのプログラムを取得してもよい。 The CPU 91 of the computer 90 reads these programs from the HDD 94 and executes them, but as another example, the programs may be read from the recording medium 98 or obtained from another device via a network. good.

ここで、以上の説明では、ＳＰＡＲＱＬ処理部１１８が、ＳＰＡＲＱＬクエリで指定された検索対象のｖａｌｕｅパターンに対応するパターン識別子を識別子対応表から取得しする場合で説明したが、この処理はスレーブサーバ１２側で実行することも可能である。例えば、スレーブサーバ１２のＭａｐＲｅｄｕｃｅ処理部１２４が、検索対象のｖａｌｕｅパターンに対応するパターン識別子を識別子対応表から取得して、取得したパターン識別子を用いてＭａｐＲｅｄｕｃｅ処理を実行してもよい。 Here, in the above description, the case where the SPARQL processing unit 118 acquires the pattern identifier corresponding to the value pattern to be searched specified in the SPARQL query from the identifier correspondence table has been described. It is also possible to run on the side. For example, the MapReduce processing unit 124 of the slave server 12 may acquire the pattern identifier corresponding to the value pattern to be searched from the identifier correspondence table and execute the MapReduce processing using the acquired pattern identifier.

次に、実施例３について説明する。ＨａｄｏｏｐによるＭａｐＲｅｄｕｃｅ処理では、入力データと最終の出力データは共にＨＤＦＳに格納される。さらに、ＨａｄｏｏｐによるＭａｐＲｅｄｕｃｅ処理では、Ｍａｐ処理において、生成される中間ファイルも、一時的にＨＤＦＳに格納される。そのため、Ｍａｐ処理において、ＨＤＦＳに対する中間ファイルの入出力が行われる。ＨＤＦＳは、ＨＤＤやＳＳＤ（Solid State Drive）に配置されるファイルシステムであり、演算処理に比べて読み書きにかかる時間が大きい。そのため、ＨａｄｏｏｐによるＭａｐＲｅｄｕｃｅ処理を行う場合、遅延が発生するおそれがある。 Next, Example 3 will be described. In MapReduce processing by Hadoop, both input data and final output data are stored in HDFS. Furthermore, in MapReduce processing by Hadoop, an intermediate file generated in Map processing is also temporarily stored in HDFS. Therefore, intermediate files are input/output to/from HDFS in Map processing. HDFS is a file system arranged in HDDs and SSDs (Solid State Drives), and the time required for reading and writing is longer than for arithmetic processing. Therefore, when performing MapReduce processing by Hadoop, there is a possibility that a delay may occur.

そこで、ＭａｐＲｅｄｕｃｅ処理を行う際に、メモリ上のデータを用いて処理を行うインメモリ処理を用いることで、ＨＤＦＳへのアクセスを減らして、処理速度を向上させる方法が考えられる。例えば、分散型のインメモリ処理として、Ｓｐａｒｋ（登録商標）を用いた処理が存在する。Ｓｐａｒｋを用いることで、インメモリでＭａｐＲｅｄｕｃｅを行うことができる。 Therefore, a method of reducing access to HDFS and improving processing speed by using in-memory processing that performs processing using data in memory when performing MapReduce processing is conceivable. For example, as distributed in-memory processing, there is processing using Spark (registered trademark). MapReduce can be performed in-memory by using Spark.

Ｓｐａｒｋでは、ストレージとして、ＨａｄｏｏｐのＨＤＦＳが利用される。そのため、Ｓｐａｒｋを用いた場合にも、入力データ及び最終の出力データは、ＨＤＦＳに格納される。一方、Ｍａｐ処理における中間データはＲＤＤ（Resilient Distributed Dataset）形式でメモリ上に保持され、ＨＤＦＳに格納されることなく連続的に処理される。そのため、深層学習などにおいて処理結果を用いてＭａｐ処理を繰り返す場合などでは、ＨａｄｏｏｐによるＭａｐＲｅｄｕｃｅ処理よりも処理速度をより向上させることが可能である。 Spark uses Hadoop HDFS as storage. Therefore, even when Spark is used, input data and final output data are stored in HDFS. On the other hand, intermediate data in map processing is held in memory in RDD (Resilient Distributed Dataset) format, and is processed continuously without being stored in HDFS. Therefore, in the case of repeating Map processing using processing results in deep learning or the like, it is possible to further improve the processing speed compared to MapReduce processing by Hadoop.

しかしながら、Ｓｐａｒｋのような分散型のインメモリ処理を用いてメインメモリでデータ処理を完結させる場合、識別子対応表をメモリ上に展開する構成では、識別子対応表のサイズが大きいとメモリ上に展開することが困難である。その場合、メモリ内でＭａｐ処理に割り当てるメモリ容量が不足するため、処理速度が低下するおそれがある。 However, when data processing is completed in the main memory using distributed in-memory processing such as Spark, in the configuration where the identifier correspondence table is developed in memory, if the size of the identifier correspondence table is large, it is developed in memory. is difficult. In that case, there is a possibility that the processing speed will decrease because the memory capacity allocated to the Map processing in the memory will be insufficient.

そこで、本実施例に係る情報処理システムでは、識別子対応表を分割することでメモリ上に展開する識別子対応表を小さくする。以下では、Ｓｐａｒｋを用いたＭａｐＲｅｄｕｃｅ処理における分割した識別子対応表の使用について主に説明する。図１４は、実施例３に係るマスタサーバ及びスレーブサーバの詳細を表すブロック図である。以下の説明では、実施例１と同様の各部の動作は説明を省略する。 Therefore, in the information processing system according to the present embodiment, the identifier correspondence table to be expanded on the memory is reduced by dividing the identifier correspondence table. The following mainly describes the use of the divided identifier correspondence table in the MapReduce process using Spark. FIG. 14 is a block diagram showing details of a master server and slave servers according to the third embodiment. In the following description, the description of the operation of each unit similar to that of the first embodiment will be omitted.

図１４に示すように、マスタサーバ１１は、実施例１の各部に加えてＳｐａｒｋ処理部１３１を有する。また、スレーブサーバ１２は、実施例１の各部に加えてＳＳＤ１２５及びメモリ１２６を有する。さらに、本実施例に係るスレーブサーバ１２のＭａｐＲｅｄｕｃｅ処理部１２４は、Ｍａｐタスク実行部２４１及びＲｅｄｕｃｅタスク実行部２４２に加えて、メモリ管理部２４３を有する。 As shown in FIG. 14, the master server 11 has a Spark processing unit 131 in addition to each unit of the first embodiment. Also, the slave server 12 has an SSD 125 and a memory 126 in addition to the components of the first embodiment. Furthermore, the MapReduce processing unit 124 of the slave server 12 according to this embodiment has a memory management unit 243 in addition to the Map task execution unit 241 and the Reduce task execution unit 242 .

次に、第１生成部１１５は、取得した主語、述語及び目的語のそれぞれの重複を除いて集計する。そして、第１生成部１１５は、集計した主語、述語及び目的語を用いて、全ての通りの組み合わせのｖａｌｕｅバターンを生成する。次に、第１生成部１１５は、実際の各ＲＤＦデータの主語、述語及び目的語の組み合わせに含まれないｖａｌｕｅパターン以外のｖａｌｕｅパターンを抽出して、識別子を割り当てる。そして、第１生成部１１５は、実際の各ＲＤＦデータの主語、述語及び目的語の組の中に含まれないｖａｌｕｅパターンについては、不存在を表す情報を付加して、各ｖａｌｕｅパターンと割り当てたパターン識別子との対応を表す識別子対応表を作成する。 Next, the first generating unit 115 tabulates the obtained subjects, predicates, and objects by removing duplicates. Then, the first generation unit 115 generates value patterns of all possible combinations using the aggregated subjects, predicates, and objects. Next, the first generation unit 115 extracts value patterns other than value patterns that are not included in the actual combinations of subject, predicate and object of each RDF data, and assigns identifiers to them. Then, the first generating unit 115 adds information indicating the absence of value patterns that are not included in the set of the subject, predicate, and object of each RDF data, and assigns them to each value pattern. Create an identifier correspondence table representing the correspondence with pattern identifiers.

この段階では、第１生成部１１５は、図１５に示す識別子対応表２１３が生成される。図１５は、分割前の識別子対応表の一例を表す図である。この識別子対応表２１３には、述語と目的語とを組み合わせたＶａｌｕｅバターンを表す領域２１４、主語と述語とを組み合わせたｖａｌｕｅバターンを表す領域２１５、主語と目的語とを組み合わせたＶａｌｕｅバターンを表す領域２１６が含まれる。 At this stage, first generation unit 115 generates identifier correspondence table 213 shown in FIG. FIG. 15 is a diagram showing an example of the identifier correspondence table before division. This identifier correspondence table 213 includes an area 214 representing a value pattern combining a predicate and an object, an area 215 representing a value pattern combining a subject and a predicate, and an area representing a value pattern combining a subject and an object. 216 are included.

ここで、例えば、「ｓｅｌｅｃｔ？ｓｗｈｅｒｅ｛？ｓｌｉｋｅｓＣ｝」といったＳＰＡＲＱＬクエリでは、述語と目的語とを組み合わせたｖａｌｕｅパターンが検索される。すなわち、このＳＰＡＲＱＬクエリでは、識別子対応表２１３の中の領域２１５及び２１６は、検索対象としなくてもよい。このように、検索が、対応する主語を検出する主語基準の検索なのか、対応する目的語を検出する目的語基準の検索なのか、又は、対応する述語を検出する述語基準の検索なのかにより、識別子対応表２１３において実際に必要となる領域が異なる。 Here, for example, in a SPARQL query such as "select ?s where {?s likes C}", a value pattern combining a predicate and an object is retrieved. That is, in this SPARQL query, the areas 215 and 216 in the identifier correspondence table 213 do not have to be searched. Thus, depending on whether the search is a subject-based search that finds the corresponding subject, an object-based search that finds the corresponding object, or a predicate-based search that finds the corresponding predicate, , the areas actually required in the identifier correspondence table 213 are different.

そして、第１生成部１１５は、識別子対応表２１３を分割して、図１６に示す主語基準の検索用の分割識別子対応表２３１、目的語基準の検索用の分割識別子対応表２３２及び述語基準の検索用の分割識別子対応表２３３を生成する。図１６は、分割識別子対応表の一例を表す図である。 Then, the first generating unit 115 divides the identifier correspondence table 213 into a divided identifier correspondence table 231 for subject-based retrieval, a divided identifier correspondence table 232 for object-based retrieval, and a predicate-based A split identifier correspondence table 233 for retrieval is generated. FIG. 16 is a diagram showing an example of a division identifier correspondence table.

その後、第１生成部１１５は、生成した分割識別子対応表２３１～２３３をＲＤＦコントローラ１１７へ送信し、ＲＤＦストア１１０への格納を指示する。さらに、第１生成部１１５は、識別子対応表の生成完了を第２生成部１１６に通知する。これにより、ＲＤＦコントローラ１１７によって、ＲＤＦストア１１０へ、分割識別子対応表２３１～２３３が格納される。 After that, the first generation unit 115 transmits the generated split identifier correspondence tables 231 to 233 to the RDF controller 117 and instructs the RDF store 110 to store them. Furthermore, the first generation unit 115 notifies the second generation unit 116 of completion of generation of the identifier correspondence table. As a result, the division identifier correspondence tables 231 to 233 are stored in the RDF store 110 by the RDF controller 117 .

ＳＰＡＲＱＬ処理部１１８は、ＳＰＡＲＱＬクエリの入力をジョブクライアント３０から受ける。そして、ＳＰＡＲＱＬ処理部１１８は、取得したＳＰＡＲＱＬクエリを解析してＭａｐＲｅｄｕｃｅ処理に変換する。その後、ＳＰＡＲＱＬ処理部１１８は、パターン識別子を含むＭａｐＲｅｄｕｃｅ処理をＳｐａｒｋ処理部１３１に出力する。さらに、ＳＰＡＲＱＬ処理部１１８は、分割識別子対応表２３１～２３３の送信要求をネームノード１１２に通知する。 The SPARQL processing unit 118 receives SPARQL query input from the job client 30 . Then, the SPARQL processing unit 118 analyzes the obtained SPARQL query and converts it into MapReduce processing. After that, the SPARQL processing unit 118 outputs MapReduce processing including the pattern identifier to the Spark processing unit 131 . Furthermore, the SPARQL processing unit 118 notifies the name node 112 of a transmission request for the division identifier correspondence tables 231 to 233. FIG.

ネームノード１１２は、分割識別子対応表２３１～２３３の送信要求の通知をＳＰＡＲＱＬ処理部１１８から受信する。そして、ネームノード１１２は、ＲＤＦストア１１０から分割識別子対応表２３１～２３３を取得し、データノード１２１へ送信する。また、ネームノード１１２は、識別子付ＲＤＦデータ表の一部の行データを含むブロックを選択したデータノード１２１へ送信し配置する。 The name node 112 receives from the SPARQL processing unit 118 notification of the transmission request for the split identifier correspondence tables 231 to 233 . The name node 112 then acquires the split identifier correspondence tables 231 to 233 from the RDF store 110 and transmits them to the data node 121 . Also, the name node 112 transmits and arranges a block including a part of row data of the RDF data table with identifier to the selected data node 121 .

Ｓｐａｒｋ処理部１３１は、Ｓｐａｒｋを用いて実行するＭａｐＲｅｄｕｃｅ処理の入力をＳＰＡＲＱＬ処理部１１８から受ける。次に、Ｓｐａｒｋ処理部１３１は、受信したＭａｐＲｅｄｕｃｅ処理に含まれる個々のＭａｐＲｅｄｕｃｅ処理を取得する。そして、Ｓｐａｒｋ処理部１３１は、取得したＭａｐＲｅｄｕｃｅ処理の実行をジョブトラッカー１１４に指示する。さらに、深層学習などにおいて実行結果を用いてＭａｐＲｅｄｕｃｅ処理が繰り返し行われる場合、Ｓｐａｒｋ処理部１３１は、繰り返しの手順を管理して、ジョブトラッカー１１４にメモリ１２６の上でのＭａｐＲｅｄｕｃｅ処理の繰り返しの実行を指示する。 The Spark processing unit 131 receives from the SPARQL processing unit 118 an input for MapReduce processing to be executed using Spark. Next, the Spark processing unit 131 acquires each MapReduce process included in the received MapReduce process. Then, the Spark processing unit 131 instructs the job tracker 114 to execute the acquired MapReduce process. Furthermore, when MapReduce processing is repeatedly performed using execution results in deep learning or the like, the Spark processing unit 131 manages the repetition procedure and instructs the job tracker 114 to repeatedly execute the MapReduce processing on the memory 126. instruct.

その後、Ｓｐａｒｋ処理部１３１は、ＭａｐＲｅｄｕｃｅ処理の実行結果の入力をジョブトラッカー１１４から受ける。そして、Ｓｐａｒｋ処理部１３１は、ＭａｐＲｅｄｕｃｅ処理の実行結果をＳＰＡＲＱＬ処理部１１８へ出力する。この場合のＳｐａｒｋ処理部１３１は、Ｓｐａｒｋにおける「Ｄｒｉｖｅｒ」にあたる。 After that, the Spark processing unit 131 receives an input of the execution result of the MapReduce process from the job tracker 114 . The Spark processing unit 131 then outputs the execution result of the MapReduce processing to the SPARQL processing unit 118 . The Spark processing unit 131 in this case corresponds to "Driver" in Spark.

データノード１２１は、識別子付ＲＤＦデータ表の一部の行データを含むブロックをネームノード１１２から受信する。ここで、データノード１２１は、複数のブロックを受信してもよい。そして、データノード１２１は、取得したブロックを自装置のＨＤＦＳ１２２へ格納する。 The data node 121 receives from the name node 112 a block containing partial row data of the RDF data table with identifier. Here, data node 121 may receive multiple blocks. Then, the data node 121 stores the obtained block in the HDFS 122 of its own device.

また、データノード１２１は、分割識別子対応表２３１～２３３をネームノード１１２から受信する。そして、データノード１２１は、取得した分割識別子対応表２３１～２３３を次装置のＨＤＦＳ１２２へ格納する。 The data node 121 also receives the division identifier correspondence tables 231 to 233 from the name node 112 . Then, the data node 121 stores the acquired split identifier correspondence tables 231 to 233 in the HDFS 122 of the next device.

本実施例に係るＭａｐＲｅｄｕｃｅ処理部１２４は、Ｍａｐタスク実行部２４１、Ｒｅｄｕｃｅタスク実行部２４２及びメモリ管理部２４３を有する。ＭａｐＲｅｄｕｃｅ処理部１２４は、Ｍａｐタスクをタスクトラッカー１２３から取得し、Ｓｐａｒｋを用いたＭａｐＲｅｄｕｃｅ処理を実行する。この場合のＭａｐＲｅｄｕｃｅ処理部１２４は、Ｓｐａｒｋにおける「Ｅｘｅｃｔｏｒ」にあたる。以下にＳｐａｒｋを用いたＭａｐＲｅｄｕｃｅ処理の詳細を説明する。 The MapReduce processing unit 124 according to this embodiment has a Map task execution unit 241 , a Reduce task execution unit 242 and a memory management unit 243 . The MapReduce processing unit 124 acquires a Map task from the task tracker 123 and executes MapReduce processing using Spark. The MapReduce processing unit 124 in this case corresponds to "Exector" in Spark. The details of the MapReduce process using Spark will be described below.

メモリ管理部２４３は、タスクトラッカー１２３から取得したＭａｐタスクで指定された検索するｖａｌｕｅパターンを取得する。そして、メモリ管理部２４３は、そのｖａｌｕｅパターンによる検索が、主語基準の検索、目的語基準の検索、又は、述語基準の検索のいずれにあたるかを特定する。そして、メモリ管理部２４３は、分割識別子対応表２３１～２３３のうち特定した種類の検索に対応する表をＳＳＤ１２５から取得する。ここでは、主語基準の検索を行う場合で説明する。すなわち、メモリ管理部２４３は、主語基準の検索用の分割識別子対応表２３１をＳＳＤ１２５から取得する。そして、メモリ管理部２４３は、取得した分割識別子対応表２３１をＲＤＤに変換する。その後、メモリ管理部２４３は、ＲＤＤに変換した分割識別子対応表２３１をメモリ１２６上に展開する。 The memory management unit 243 acquires the value pattern to be searched specified in the Map task acquired from the task tracker 123 . Then, the memory management unit 243 specifies whether the search by the value pattern corresponds to a subject-based search, an object-based search, or a predicate-based search. Then, the memory management unit 243 acquires from the SSD 125 a table corresponding to the identified type of search among the division identifier correspondence tables 231 to 233 . Here, a case where subject-based retrieval is performed will be described. That is, the memory management unit 243 acquires the split identifier correspondence table 231 for subject-based retrieval from the SSD 125 . Then, the memory management unit 243 converts the acquired division identifier correspondence table 231 into an RDD. After that, the memory management unit 243 develops the split identifier correspondence table 231 converted into the RDD on the memory 126 .

また、メモリ管理部２４３は、ＨＤＦＳ１２２に格納された識別子付ＲＤＦデータ表の一部の行データを含むブロックを取得する。そして、メモリ管理部２４３は、取得したブロックをＲＤＤに変換する。その後、メモリ管理部２４３は、ＲＤＤに変換したブロックをメモリ１２６上に展開する。ＲＤＤは、不変で並列実行可能な分割されたコレクションである。ＲＤＤは、メモリ上に保持することが可能で、耐障害性、データ局所性などの特徴を有する。 Also, the memory management unit 243 acquires a block containing part of row data of the identifier-attached RDF data table stored in the HDFS 122 . The memory management unit 243 then converts the acquired block into an RDD. After that, the memory management unit 243 develops the blocks converted into RDDs on the memory 126 . An RDD is a partitioned collection that is immutable and can be executed in parallel. An RDD can be held in memory and has features such as fault tolerance and data locality.

その後、メモリ管理部２４３は、Ｒｅｄｕｃｅタスク実行部２４２からＲｅｄｕｃｅ処理の完了の通知を受けると、ＭａｐＲｅｄｕｃｅ処理の実行結果をメモリ１２６から取得する。そして、メモリ管理部２４３は、取得したＭａｐＲｅｄｕｃｅ処理の実行結果をＲＤＤの形式からＨＤＦＳ１１１への格納用のデータ形式に直してＨＤＦＳ１２２へ格納する。すなわち、ＨＤＦＳ１２２には、ＭａｐＲｅｄｕｃｅ処理に使用するデータが格納された識別子付ＲＤＦデータ表及びＭａｐＲｅｄｕｃｅ処理の実行結果が格納される。 After that, when the memory management unit 243 receives notification of the completion of the Reduce process from the Reduce task execution unit 242 , it acquires the execution result of the MapReduce process from the memory 126 . Then, the memory management unit 243 converts the acquired execution result of the MapReduce process from the RDD format to the data format for storage in the HDFS 111 and stores it in the HDFS 122 . That is, the HDFS 122 stores an identifier-attached RDF data table in which data used in the MapReduce process is stored, and the execution result of the MapReduce process.

Ｍａｐタスク実行部２４１は、タスクトラッカー１２３から実行の指示を受けたＳｐａｒｋを用いたＭａｐタスクにおけるＭａｐ処理を実行する。具体的には、Ｍａｐタスク実行部２４１は、Ｍａｐタスクで指定された検索対象となるｖａｌｕｅパターンを取得する。そして、Ｍａｐタスク実行部２４１は、取得したｖａｌｕｅパターンでメモリ１２６上の分割識別子対応表２３１を検索して、ｖａｌｕｅパターンに対応するパラメータ識別子を取得する。 The Map task execution unit 241 executes Map processing in a Map task using Spark that has received an execution instruction from the task tracker 123 . Specifically, the Map task execution unit 241 acquires the value pattern to be searched specified by the Map task. Then, the Map task execution unit 241 searches the divided identifier correspondence table 231 on the memory 126 with the acquired value pattern to acquire the parameter identifier corresponding to the value pattern.

次に、Ｍａｐタスク実行部２４１は、識別子付ＲＤＦデータ表からＭａｐ処理を行う対象とするデータを取得してｋｅｙ＝ｖａｌｕｅ形式のデータに変換しそのデータを入力とする。次に、Ｍａｐタスク実行部２４１は、ＲＤＤに変換された入力データの中からｖａｌｕｅが取得したパターン識別子と一致するデータを抽出する。次に、Ｍａｐタスク実行部２４１は、抽出したデータに対してシャッフル及びソート処理を実行する。そして、各Ｍａｐタスク実行部２４１は、シャッフル及びソート処理を実行したデータをＲｅｄｕｃｅタスク実行部２４２へ出力する。 Next, the Map task execution unit 241 acquires data to be subjected to Map processing from the identifier-attached RDF data table, converts the data into key=value format data, and inputs the data. Next, the Map task execution unit 241 extracts data that matches the pattern identifier obtained by value from the input data converted to RDD. Next, the Map task execution unit 241 shuffles and sorts the extracted data. Each Map task execution unit 241 then outputs the shuffled and sorted data to the Reduce task execution unit 242 .

ここで、Ｍａｐタスク実行部２４１は、以上の処理の際に生成される中間データはＲＤＤ形式でメモリ１２６上に保持しつつ、以上の処理を連続的に実行する。特に、深層学習などにおいてＭａｐＲｅｄｕｃｅ処理の実行結果を繰り返し用いて処理を行う場合、Ｍａｐタスク実行部２４１は、メモリ１２６に対するデータの読み出し及び書き込みにより連続的に繰り返し処理を実行することができる。 Here, the Map task execution unit 241 continuously executes the above processes while holding intermediate data generated during the above processes in the memory 126 in RDD format. In particular, when performing processing by repeatedly using the execution results of the MapReduce processing in deep learning or the like, the Map task execution unit 241 can continuously and repeatedly perform processing by reading data from and writing data to the memory 126 .

Ｒｅｄｕｃｅタスク実行部２４２は、Ｒｅｄｕｃｅ処理を行う。Ｒｅｄｕｃｅ処理は、Ｒｅｄｕｃｅの設計者が予め決めた処理を実行することができる。例えば、Ｒｅｄｕｃｅタスク実行部２４２は、値の合計や集約などの処理を行う。その後、Ｒｅｄｕｃｅタスク実行部２４２は、ＭａｐＲｅｄｕｃｅ処理の実行結果をメモリ１２６に格納する。さらに、Ｒｅｄｕｃｅタスク実行部２４２は、Ｒｅｄｕｃｅ処理の完了をメモリ管理部２４３及びタスクトラッカー１２３へ通知する。 The Reduce task execution unit 242 performs Reduce processing. The Reduce process can execute a process predetermined by the Reduce designer. For example, the Reduce task execution unit 242 performs processing such as totaling and aggregating values. After that, the Reduce task execution unit 242 stores the execution result of the MapReduce process in the memory 126 . Furthermore, the Reduce task execution unit 242 notifies the memory management unit 243 and the task tracker 123 of the completion of the Reduce process.

タスクトラッカー１２３は、Ｒｅｄｕｃｅ処理の完了の通知をＲｅｄｕｃｅタスク実行部２４２から受ける。そして、タスクトラッカー１２３は、ＭａｐＲｅｄｕｃｅ処理の実行結果をＨＤＦＳ１２２から取得し、ジョブトラッカー１１４へ送信する。 The task tracker 123 receives notification of the completion of the Reduce process from the Reduce task execution unit 242 . The task tracker 123 acquires the execution result of the MapReduce process from the HDFS 122 and transmits it to the job tracker 114 .

ここで、以上の説明では、スレーブサーバ１２が保持するＳＳＤ１２５に分割識別子対応表２３１～２３３を格納する構成で説明したが、分割識別子対応表２３１～２３３の配置場所に特に制限は無い。例えば、マスタサーバ１１に分割識別子対応表２３１～２３３を配置して、スレーブサーバ１２のメモリ管理部２４３が、マスタサーバ１１から分割識別子対応表２３１～２３３を取得する構成であってもよい。 Here, in the above description, the partition identifier correspondence tables 231 to 233 are stored in the SSD 125 held by the slave server 12, but there is no particular restriction on the location of the partition identifier correspondence tables 231 to 233. For example, the split identifier correspondence tables 231 to 233 may be arranged in the master server 11 and the memory management unit 243 of the slave server 12 may obtain the split identifier correspondence tables 231 to 233 from the master server 11 .

次に、図１７を参照して、実施例３に係るパラメータ識別子を用いた場合のＭａｐＲｅｄｕｃｅ処理の概要を説明する。図１７は、実施例３に係るＭａｐＲｅｄｕｃｅ処理の概要を表す図である。ここでは、図１７に記載された識別子付ＲＤＦデータ表６１１及び６２１のそれぞれが異なるＭａｐＲｅｄｕｃｅ処理部１２４で処理される場合で説明する。 Next, with reference to FIG. 17, an overview of the MapReduce process when using the parameter identifier according to the third embodiment will be described. FIG. 17 is a diagram illustrating an overview of MapReduce processing according to the third embodiment. Here, a case where the RDF data tables 611 and 621 with identifiers shown in FIG. 17 are processed by different MapReduce processing units 124 will be described.

例えば、図１７では、Ｍａｐタスク実行部２４１は、ＳＰＡＲＱＬクエリが「ｓｅｌｅｃｔ？ｓｗｈｅｒｅ｛？ｓｌｏｖｅｓＣ．｝という構文で表されるデータ抽出をＳｐａｒｋを用いて行うＭａｐタスクを取得する。メモリ管理部２４３は、識別子付ＲＤＦデータ表６１１及び６２１、並びに、分割識別子対応表２３１をＲＤＤに変換してメモリ１２６上に格納する。 For example, in FIG. 17, the Map task execution unit 241 acquires a Map task that uses Spark to extract data whose SPARQL query is represented by the syntax “select ?s where {?s loves C.}. Memory management The unit 243 converts the RDF data tables 611 and 621 with identifiers and the divided identifier correspondence table 231 into RDDs and stores them in the memory 126 .

Ｍａｐタスク実行部２４１は、ＲＤＤに変換されメモリ上に格納された分割識別子対応表２３１から、「ｌｏｖｅｓＣ」に対応するパターン識別子として１００２を取得する。そして、Ｍａｐタスク実行部２４１は、ＲＤＤに変換された識別子付ＲＤＦデータ表４１１又は４２１からＭａｐ処理を行う対象とするデータを取得してｋｅｙ＝ｖａｌｕｅ形式のデータに変換しそのデータを入力とする。ここでは、Ｍａｐタスク実行部２４１は、主語とｋｅｙとし述語及び目的語の組み合わせのｖａｌｕｅパターンをｖａｌｕｅとするデータを入力とする。 The Map task execution unit 241 acquires 1002 as a pattern identifier corresponding to "loves C" from the divided identifier correspondence table 231 converted into RDD and stored on the memory. Then, the map task execution unit 241 acquires data to be subjected to map processing from the identifier-added RDF data table 411 or 421 converted to RDD, converts it to data in the key=value format, and uses the data as input. . Here, the Map task execution unit 241 receives data with a subject and a key and a value pattern of a combination of a predicate and an object as the value.

そして、各Ｍａｐタスク実行部２４１は、入力のデータからｖａｌｕｅを表すパターン識別子が１００２であるデータ６１２又は６２２を抽出してメモリ１２６上に格納する。そして、各Ｍａｐタスク実行部２４１は、抽出したデータ６１２又は６２２に対してシャッフル及びソート処理を実行し処理結果をメモリ１２６上に格納する。ここでは、各Ｍａｐタスク実行部２４１は、ｋｅｙがＢであるデータを一方に集め、それ以外のデータを他方に集める。さらに、各Ｍａｐタスク実行部２４１は、ｋｅｙを基準に収集したデータをソートしてデータ６１３又は６２３を生成しメモリ１２６上に格納する。 Then, each Map task execution unit 241 extracts data 612 or 622 whose pattern identifier representing value is 1002 from the input data and stores it in the memory 126 . Each Map task executing unit 241 executes shuffle and sort processing on the extracted data 612 or 622 and stores the processing result in the memory 126 . Here, each Map task execution unit 241 collects data whose key is B to one side, and collects other data to the other side. Furthermore, each Map task execution unit 241 sorts the collected data based on the key to generate data 613 or 623 and stores the data 613 or 623 in the memory 126 .

Ｒｅｄｕｃｅタスク実行部２４２は、データ６１３又は６２３の入力をメモリ１２６から取得する。次に、Ｒｅｄｕｃｅタスク実行部２４２は、取得したデータ６１３又は６２３から同じｋｅｙを有するデータの数を集計する。そして、Ｒｅｄｕｃｅタスク実行部２４２は、Ｒｅｄｕｃｅ処理の結果６１４又は６２４をメモリ１２６上に格納する。ここで、図１７の結果６１４及び６２４におけるｃはカウント値を表す。 The Reduce task execution unit 242 acquires input of data 613 or 623 from the memory 126 . Next, the Reduce task execution unit 242 tallies the number of data having the same key from the acquired data 613 or 623 . Then, the Reduce task execution unit 242 stores the result 614 or 624 of the Reduce process on the memory 126 . Here, c in results 614 and 624 in FIG. 17 represents the count value.

次に、図１８を参照して、実施例３に係る識別子表及び識別子付ＲＤＦデータ表の生成処理の流れについて説明する。図１８は、実施例３に係る識別子表及び識別子付ＲＤＦデータ表の生成処理のフローチャートである。以下では、ＨＤＦＳ１１１との間のデータの送受信におけるＲＤＦコントローラ１１７の仲介動作を省略する。 Next, with reference to FIG. 18, the flow of processing for generating an identifier table and an identifier-attached RDF data table according to the third embodiment will be described. FIG. 18 is a flowchart of processing for generating an identifier table and an identifier-attached RDF data table according to the third embodiment. In the following, the intermediate operation of the RDF controller 117 in data transmission/reception with the HDFS 111 will be omitted.

第１生成部１１５は、ＲＤＦストア１１０に格納されたＲＤＦデータから全ての主語、述語及び目的語の重複を除いて取得する。そして、第１生成部１１５は、取得した主語、述語及び目的語を２つずつ組み合わせて、ｖａｌｕｅパターンを抽出する（ステップＳ１０１）。 The first generating unit 115 obtains the RDF data stored in the RDF store 110 by excluding duplication of all subjects, predicates and objects. Then, the first generation unit 115 combines two each of the obtained subjects, predicates, and objects to extract a value pattern (step S101).

次に、第１生成部１１５は、抽出したｖａｌｕｅパターンの中にＲＤＦストア１１０に格納された実際のＲＤＦデータの中に存在しないｖａｌｕｅパターンがあるか否かを判定する（ステップＳ１０２）。実際には存在しないｖａｌｕｅパターンが無い場合（ステップＳ１０２：否定）、第１生成部１１５は、ステップＳ１０４へ進む。 Next, the first generator 115 determines whether or not the extracted value patterns include a value pattern that does not exist in the actual RDF data stored in the RDF store 110 (step S102). If there is no value pattern that does not actually exist (step S102: No), the first generator 115 proceeds to step S104.

実際には存在しないｖａｌｕｅパターンがある場合（ステップＳ１０２：肯定）、第１生成部１１５は、抽出したｖａｌｕｅパターンの中から実際には存在しないｖａｌｕｅパターンを除いて、実際に存在するｖａｌｕｅパターンを抽出する（ステップＳ１０３）。 If there is a value pattern that does not actually exist (step S102: YES), the first generation unit 115 extracts the value pattern that actually exists by removing the value pattern that does not actually exist from the extracted value patterns. (step S103).

次に、第１生成部１１５は、実際に存在するｖａｌｕｅパターンに識別子を割り当て、各ｖａｌｕｅパターンに対応するパターン識別子を表す識別子対応表を生成する（ステップＳ１０４）。 Next, the first generating unit 115 assigns identifiers to actually existing value patterns, and generates an identifier correspondence table representing pattern identifiers corresponding to each value pattern (step S104).

次に、第１生成部１１５は、生成した識別子対応表を主語基準の検索用、述語基準の検索用、目的語基準の検索用に分割して分割識別子対応表２３１～２３３を作成する。次に、第１生成部１１５は、分割識別子対応表２３１～２３３をＲＤＦストア１１０に格納する（ステップＳ１０５）。さらに、第１生成部１１５は、分割識別子対応表２３１～２３３の生成完了を第２生成部１１６に通知する。 Next, the first generating unit 115 divides the generated identifier correspondence table into subject-based search, predicate-based search, and object-based search to create divided identifier correspondence tables 231 to 233 . Next, the first generator 115 stores the split identifier correspondence tables 231 to 233 in the RDF store 110 (step S105). Further, first generation unit 115 notifies second generation unit 116 of completion of generation of division identifier correspondence tables 231 to 233 .

分割識別子対表２３１～２３３の生成完了の通知を受けた第２生成部１１６は、ＲＤＦストア１１０に含まれる全てのＲＤＦデータ及び識別子対応表をＲＤＦストア１１０から取得する。次に、第２生成部１１６は、各ＲＤＦデータの主語と述語とを組み合わせたｖａｌｕｅパターン、述語と目的語とを組わせたｖａｌｕｅパターン及び主語と目的語とを組わせたｖａｌｕｅパターンを取得する。そして、第２生成部１１６は、取得したｖａｌｕｅパターンに対応するパターン識別子を分割識別子対応表２３１～２３３から取得する。次に、第２生成部１１６は、トリプルの対応を表す対応表における各ＲＤＦデータに取得したパターン識別子を付加して識別子付ＲＤＦデータ表を生成する（ステップＳ１０６）。 The second generation unit 116 that has received the notification of completion of generation of the divided identifier pair tables 231 to 233 acquires from the RDF store 110 all the RDF data and the identifier correspondence table contained in the RDF store 110 . Next, the second generation unit 116 acquires a value pattern that combines the subject and predicate of each RDF data, a value pattern that combines the predicate and object, and a value pattern that combines the subject and object. . Then, the second generating unit 116 acquires pattern identifiers corresponding to the acquired value patterns from the divided identifier correspondence tables 231 to 233. FIG. Next, the second generating unit 116 generates an RDF data table with identifier by adding the acquired pattern identifier to each piece of RDF data in the correspondence table representing triple correspondence (step S106).

ネームノード１１２は、識別子付ＲＤＦデータ表をＲＤＦストア１１０から取得する。次に、ネームノード１１２は、識別子付ＲＤＦデータ表に登録されたデータを含む各ブロックを配置するデータノード１２１を決定する。そして、ネームノード１１２は、識別子付ＲＤＦデータ表の一部の行データを含む各ブロックを、配置先として決定したそれぞれのデータノード１２１へ送信し、データの分散配置を実行する（ステップＳ１０７）。 The NameNode 112 acquires the RDF data table with identifier from the RDF store 110 . Next, the namenode 112 determines the datanode 121 in which each block containing the data registered in the identifier-attached RDF data table is placed. Then, the namenode 112 transmits each block including a part of row data of the identifier-attached RDF data table to each of the datanodes 121 determined as allocation destinations, and executes data distribution allocation (step S107).

次に、図１９を参照して、実施例３に係るＭａｐＲｅｄｕｃｅ処理の流れについて説明する。図１９は、実施例３に係るＭａｐＲｅｄｕｃｅ処理のフローチャートである。 Next, a flow of MapReduce processing according to the third embodiment will be described with reference to FIG. 19 . FIG. 19 is a flowchart of MapReduce processing according to the third embodiment.

ＳＰＡＲＱＬ処理部１１８は、ＳＰＡＲＱＬクエリの実行命令の入力をジョブクライアント３０から受ける。そして、ＳＰＡＲＱＬ処理部１１８は、ＳＰＡＲＱＬクエリを実行する（ステップＳ２０１）。 The SPARQL processing unit 118 receives an input of a SPARQL query execution command from the job client 30 . The SPARQL processing unit 118 then executes the SPARQL query (step S201).

次に、ＳＰＡＲＱＬ処理部１１８は、ＳＰＡＲＱＬクエリをＭａｐＲｅｄｕｃｅ処理のジョブへ変換する（ステップＳ２０２）。 Next, the SPARQL processing unit 118 converts the SPARQL query into a job for MapReduce processing (step S202).

次に、ＳＰＡＲＱＬ処理部１１８は、ＨＤＦＳ１１１から識別子対応表を取得する。そして、ＳＰＡＲＱＬ処理部１１８は、投入されたクエリを構文解析（パース）して識別子対応表に登録されたｖａｌｕｅパターンに該当するｖａｌｕｅパターンがあるか否かを判定する（ステップＳ２０３）。該当するｖａｌｕｅパターンが無い場合（ステップＳ２０３：否定）、ＳＰＡＲＱＬ処理部１１８は、そのようなｖａｌｕｅパターンのマッチング結果は０件であるという検索結果をジョブクライアント３０に返してＳＰＡＲＱＬクエリの実行処理を終了する。 Next, the SPARQL processing unit 118 acquires the identifier correspondence table from the HDFS 111. FIG. Then, the SPARQL processing unit 118 parses the input query and determines whether or not there is a value pattern corresponding to the value patterns registered in the identifier correspondence table (step S203). If there is no corresponding value pattern (step S203: No), the SPARQL processing unit 118 returns to the job client 30 a search result indicating that the matching result for such a value pattern is 0, and ends the SPARQL query execution processing. do.

これに対して、該当するｖａｌｕｅパターンがある場合（ステップＳ２０３：肯定）、ＳＰＡＲＱＬ処理部１１８は、ＭａｐＲｅｄｕｃｅ処理を実行する。Ｓｐａｒｋ処理部１３１は、ＳＰＡＲＱＬ処理部１１８からの指示を受けて、ＭａｐＲｅｄｕｃｅ処理の実行をジョブトラッカー１１４に指示する。ジョブトラッカー１１４は、メタデータＤＢ１１３を確認し、ＭａｐＲｅｄｕｃｅ処理を行わせるスレーブサーバ１２を選択する。そして、ジョブトラッカー１１４は、ＭａｐＲｅｄｕｃｅ処理をブロック単位のＭａｐタスクに分割し、選択したスレーブサーバ１２へ送信する。タスクトラッカー１２３は、Ｍａｐタスクをジョブトラッカー１１４から受信する。そして、タスクトラッカー１２３は、取得したＭａｐタスクの実行をＭａｐＲｅｄｕｃｅ処理部１２４に指示する。ＭａｐＲｅｄｕｃｅ処理部１２４は、Ｍａｐタスクの実行の指示をタスクトラッカー１２３から受ける。そして、メモリ管理部２４３は、分割識別子対応表２３１～２３３の中からＭａｐタスクで実行する検索基準に応じた表を取得する（ステップＳ２０４）。ここでは、分割識別子対応表２３１を選択した場合で説明する。 On the other hand, if there is a corresponding value pattern (step S203: Yes), the SPARQL processing unit 118 executes MapReduce processing. The Spark processing unit 131 receives an instruction from the SPARQL processing unit 118 and instructs the job tracker 114 to execute MapReduce processing. The job tracker 114 confirms the metadata DB 113 and selects a slave server 12 to perform the MapReduce process. The job tracker 114 then divides the MapReduce process into block-based Map tasks and transmits them to the selected slave server 12 . The task tracker 123 receives Map tasks from the job tracker 114 . The task tracker 123 then instructs the MapReduce processing unit 124 to execute the acquired Map task. The MapReduce processing unit 124 receives an instruction to execute a Map task from the task tracker 123 . Then, the memory management unit 243 acquires a table corresponding to the search criteria to be executed by the Map task from among the division identifier correspondence tables 231 to 233 (step S204). Here, a case where the division identifier correspondence table 231 is selected will be described.

次に、メモリ管理部２４３は、選択した分割識別子対応表２３１及びＨＤＦＳ１２２ｂに格納された識別子付ＲＤＦデータ表をＲＤＤに変換してメモリ１２６上に展開する（ステップＳ２０５）。 Next, the memory management unit 243 converts the selected divided identifier correspondence table 231 and the identifier-attached RDF data table stored in the HDFS 122b into an RDD and develops it on the memory 126 (step S205).

Ｍａｐタスク実行部２４１は、メモリ１２６上に展開された分割識別子対応表２３１及び識別子付ＲＤＦデータ表を用いて、Ｍａｐタスクで指定されたＭａｐ処理を実行する（ステップＳ２０６）。 The Map task execution unit 241 uses the divided identifier correspondence table 231 and the identifier-attached RDF data table developed on the memory 126 to execute the Map process specified by the Map task (step S206).

次に、Ｍａｐタスク実行部２４１は、Ｍａｐ処理の処理結果をｋｅｙ毎にまとまるようシャッフルして各スレーブサーバ１２のＭａｐタスク実行部２４１に振り分ける。さらに、Ｍａｐタスク実行部２４１は、シャッフルにより自装置に振り分けられたデータをｋｅｙ毎にまとまるようにソートする（ステップＳ２０７）。そして、Ｍａｐタスク実行部２４１は、ソートしたデータをメモリ１２６に格納する。 Next, the Map task execution unit 241 shuffles the processing results of the Map processing so that they are collected for each key and distributes them to the Map task execution unit 241 of each slave server 12 . Furthermore, the Map task execution unit 241 sorts the data distributed to the own device by shuffling so that the data are collected for each key (step S207). The Map task execution unit 241 then stores the sorted data in the memory 126 .

Ｒｅｄｕｃｅタスク実行部２４２は、Ｍａｐタスク実行部２４１によりメモリ１２６に格納されたデータに対して予め指定されたＲｅｄｕｃｅ処理を実行する（ステップＳ２０８）。 The Reduce task execution unit 242 executes a designated Reduce process on the data stored in the memory 126 by the Map task execution unit 241 (step S208).

その後、Ｒｅｄｕｃｅタスク実行部２４２は、ＭａｐＲｅｄｕｃｅ処理の結果をメモリ１２６に格納する。メモリ管理部２４３は、メモリ１２６に格納されたＭａｐＲｅｄｕｃｅ処理の実行結果を取得してＨＤＦＳ１１１への格納用のデータ形式に変換してＨＤＦＳ１２２に格納する。タスクトラッカー１２３は、ＨＤＦＳ１２２に格納されたＭａｐＲｅｄｕｃｅ処理の実行結果をマスタサーバ１１のジョブトラッカー１１４へ送信する。ジョブトラッカー１１４は、各スレーブサーバ１２から送信されたＭａｐＲｅｄｕｃｅ処理の実行結果を収集する。そして、ジョブトラッカー１１４は、ＭａｐＲｅｄｕｃｅ処理の実行結果を結合する。そして、ジョブトラッカー１１４は、結合したＭａｐＲｅｄｕｃｅ処理の実行結果をＭａｐＲｅｄｕｃｅ処理部１１９を介してＳＰＡＲＱＬ処理部１１８へ送信する。ＳＰＡＲＱＬ処理部１１８は、結合されたＭａｐＲｅｄｕｃｅ処理の実行結果を受信し、受信したデータをＲＤＦ形式に変換する（ステップＳ２０９）。その後、ＳＰＡＲＱＬ処理部１１８は、ＲＤＦ形式に変換したＭａｐＲｅｄｕｃｅ処理の結果をＳＰＡＲＱＬクエリの実行結果としてジョブクライアント３０へ送信する。 After that, the Reduce task execution unit 242 stores the result of the MapReduce processing in the memory 126 . The memory management unit 243 acquires the execution result of the MapReduce process stored in the memory 126 , converts it into a data format for storage in the HDFS 111 , and stores it in the HDFS 122 . The task tracker 123 transmits the execution result of the MapReduce process stored in the HDFS 122 to the job tracker 114 of the master server 11 . The job tracker 114 collects the execution results of the MapReduce processing sent from each slave server 12 . The job tracker 114 then combines the execution results of the MapReduce process. The job tracker 114 then transmits the execution result of the combined MapReduce processing to the SPARQL processing unit 118 via the MapReduce processing unit 119 . The SPARQL processing unit 118 receives the execution result of the combined MapReduce process, and converts the received data into RDF format (step S209). After that, the SPARQL processing unit 118 transmits the result of the MapReduce process converted into the RDF format to the job client 30 as the execution result of the SPARQL query.

ここで、本実施例では、分散型のインメモリ処理としてＳｐａｒｋを用いる場合で説明したが、他の分散型のインメモリ処理を用いてもよい。また、情報処理システム１は、分散型のインメモリ処理を用いるＭａｐＲｅｄｕｃｅ処理と実施例１で説明した分散型のインメモリ処理を用いないＭａｐＲｅｄｕｃｅ処理とを選択的に実行できる構成にしてもよい。さらに、本実施例では、実施例１で説明したＭａｐＲｅｄｕｃｅ処理に対してＳｐａｒｋを用いる構成で説明したが、実施例２の構成に適用することもできる。 Here, in the present embodiment, the case of using Spark as distributed in-memory processing has been described, but other distributed in-memory processing may be used. The information processing system 1 may be configured to selectively execute MapReduce processing using distributed in-memory processing and MapReduce processing not using distributed in-memory processing described in the first embodiment. Furthermore, in the present embodiment, the configuration using Spark for the MapReduce processing described in the first embodiment has been described, but the configuration of the second embodiment can also be applied.

以上に説明したように、本実施例に係るＨａｄｏｏｐクラスタは、Ｓｐａｒｋを用いたＭａｐＲｅｄｕｃｅ処理を実行する際に、検索対象に応じて作成された識別子対応表のいずれかを用いる。これにより、メモリへの読み込み量を削減して処理に割り当てるメモリ容量を十分に確保することで、処理速度が低下を回避することができる。また、識別子対応表に含まれるエントリ数が少なくなるため、グラフデータの検索効率を向上させることができる。さらに、分散型のインメモリ処理によりＭａｐＲｅｄｕｃｅ処理を実行することにより、ＭａｐＲｅｄｕｃｅ処理の効率を向上させることができる。 As described above, the Hadoop cluster according to this embodiment uses one of the identifier correspondence tables created according to the search target when executing the MapReduce process using Spark. As a result, a reduction in processing speed can be avoided by reducing the amount of data read into the memory and ensuring a sufficient memory capacity to be allocated to processing. In addition, since the number of entries included in the identifier correspondence table is reduced, it is possible to improve the efficiency of searching graph data. Furthermore, the efficiency of the MapReduce processing can be improved by executing the MapReduce processing by distributed in-memory processing.

上述してきた各実施例に係るマスタサーバ１１及びスレーブサーバ１２は、例えば図１３に示すようなハードウェア構成を有するコンピュータで実現できる。Ｓｐａｒｋ処理部１３１は、マスタサーバ１１がコンピュータ９０で実現される場合、ＣＰＵ９１及びメモリ９２によりその機能が実現される。また、ＭａｐＲｅｄｕｃｅ処理部１２４は、スレーブサーバ１２がコンピュータ９０で実現される場合、ＣＰＵ９１及びメモリ９２によりその機能が実現される。 The master server 11 and slave server 12 according to each of the embodiments described above can be realized by a computer having a hardware configuration as shown in FIG. 13, for example. The function of the Spark processing unit 131 is realized by the CPU 91 and the memory 92 when the master server 11 is realized by the computer 90 . Further, when the slave server 12 is implemented by the computer 90 , the MapReduce processing unit 124 is implemented by the CPU 91 and the memory 92 .

１情報処理システム
１０Ｈａｄｏｏｐクラスタ
１１マスタサーバ
１２スレーブサーバ
２０ＨＤＦＳクライアント
３０ジョブクライアント
１１１ＨＤＦＳ
１１２ネームノード
１１３メタデータＤＢ
１１４ジョブトラッカー
１１５第１生成部
１１６第２生成部
１１７ＲＤＦコントローラ
１１８ＳＰＡＲＱＬ処理部
１１９ＭａｐＲｅｄｕｃｅ処理部
１２１データノード
１２２ＨＤＦＳ
１２３タスクトラッカー
１２４ＭａｐＲｅｄｕｃｅ処理部
１２５ＳＳＤ
１２６メモリ
１３１Ｓｐａｒｋ処理部
２４１Ｍａｐタスク実行部
２４２Ｒｅｄｕｃｅタスク実行部
２４３メモリ管理部 1 Information Processing System 10 Hadoop Cluster 11 Master Server 12 Slave Server 20 HDFS Client 30 Job Client 111 HDFS
112 name node 113 metadata DB
114 Job Tracker 115 First Generator 116 Second Generator 117 RDF Controller 118 SPARQL Processor 119 MapReduce Processor 121 Data Node 122 HDFS
123 Task Tracker 124 MapReduce Processing Unit 125 SSD
126 memory 131 Spark processing unit 241 Map task execution unit 242 Reduce task execution unit 243 Memory management unit

Claims

Extracting two elements from data having three elements, generating a first table in which an identifier having a data size smaller than the extracted two elements is associated with the extracted two elements,
generating a second table in which the identifier is added to the three-element table;
dividing and arranging the second table in a plurality of processing devices;
When retrieving, the first table is used to retrieve the identifier, and the retrieved identifier is used in each of the processing units for a portion of the second table located in each of the processing units. to search,
A search processing program for causing a computer to execute a process of outputting rows extracted from the second table by the search.

2. The search processing program according to claim 1, wherein said second table is a table obtained by adding said identifier to said correspondence table of said three elements.

2. The search processing program according to claim 1, wherein a correspondence table is generated in which each element out of the three elements is associated with the identifier.

The data having the three elements is graph data,
4. The search processing program according to any one of claims 1 to 3, wherein the search is a search using data in key=value format.

5. The method of claim 1, further causing the computer to perform a process of aggregating the output lines based on any one of the three elements and arranging them in different processing units according to the criteria. A search processing program according to any one of the above.

6. The search processing program according to claim 5, wherein each of said processing devices executes a predetermined process on the aggregated rows based on said reference elements.

dividing the first table for each combination of the two elements;
Selecting one of the divided first tables according to a search target, and executing the search by distributed in-memory processing using the selected divided first table. 7. The search processing program according to any one of claims 1 to 6, characterized by:

Extracting two elements from data having three elements, generating a first table in which an identifier having a data size smaller than the extracted two elements is associated with the extracted two elements,
generating a second table in which the identifier is added to the three-element table;
dividing and arranging the second table in a plurality of processing devices;
When retrieving, the first table is used to retrieve the identifier, and the retrieved identifier is used in each of the processing units for a portion of the second table located in each of the processing units. to search,
A search processing method, characterized by outputting a row extracted from the second table by the search.

a first generating unit that extracts two elements from data having three elements and generates a first table in which an identifier having a data size smaller than the extracted two elements is associated with the extracted two elements;
a second generation unit that generates a second table in which the identifier is added to the three-element table;
an arrangement unit that divides and arranges the second table in a plurality of processing devices;
When retrieving, the first table is used to retrieve the identifier, and the retrieved identifier is used in each of the processing units for a portion of the second table located in each of the processing units. a search unit that searches for
and an output unit that outputs a row extracted from the second table by the search performed by the search unit.