JP2014041501A

JP2014041501A - Fast reading method for batch processing target data and batch management system

Info

Publication number: JP2014041501A
Application number: JP2012183731A
Authority: JP
Inventors: Yasuhiro Kirihata; 康裕桐畑; Hideaki Saijo; 秀明才所
Original assignee: Hitachi Solutions Ltd
Current assignee: Hitachi Solutions Ltd
Priority date: 2012-08-23
Filing date: 2012-08-23
Publication date: 2014-03-06

Abstract

PROBLEM TO BE SOLVED: To solve the problem in which, in a conventional batch processing system targeting a large amount of data, it takes time to transfer batch processing target data stored in a DB to a batch processing cluster at a time, so that the whole batch processing time becomes long.SOLUTION: Therefore, the present invention employs a method for allowing a batch management system to extract update information on data stored in a database server a plurality of times during a batch execution cycle, to store and manage update data in a period from the last update processing to the current update processing on the basis of the update information, and to sequentially transfer the update data to a distributed file system.

Description

本発明は、バッチ処理対象データのバッチ処理システムへの読み込み処理を高速化する技術に関する。 The present invention relates to a technique for speeding up reading processing of batch processing target data into a batch processing system.

近年、ビックデータ(big data)の蓄積・バッチ処理・分析技術に対する需要が高まりつつある。その理由は、複数台の汎用サーバによるクラスタ構成が可能となり、低コストでのデータのアーカイブ・並列処理が実現されたことによる。現在、多くの企業において、この種のシステムが採用されている。 In recent years, there has been an increasing demand for big data storage, batch processing, and analysis technologies. This is because a cluster configuration with a plurality of general-purpose servers is possible, and data archiving and parallel processing are realized at low cost. Currently, this type of system is adopted by many companies.

ところで、企業システムが実行する処理業務には、基幹システム等の業務システムが生成・取り扱うデータを、業務システム側のデータベースからバックエンド側のデータベースに移行してバッチ処理する処理業務がある。 By the way, the processing work executed by the enterprise system includes a processing work for batch processing by transferring data generated and handled by a business system such as a core system from a database on the business system side to a database on the back end side.

本明細書では、バックエンド側に位置してバッチ処理を実行するシステムを「バッチ処理システム」という。バッチ処理システムは、業務システム側のデータベースからデータを読み込むと、設定されたバッチプロセスを実行し、自システム内のデータベース上にバッチ処理結果を保存する。現在、このバッチ処理に要する時間を短縮して、業務の効率化を図る取り組みが行われている。 In the present specification, a system that executes batch processing on the back end side is referred to as a “batch processing system”. When the batch processing system reads data from the database on the business system side, the batch processing system executes the set batch process and saves the batch processing result on the database in its own system. Currently, efforts are being made to reduce the time required for batch processing and improve the efficiency of operations.

例えば特許文献１では、データを入出力する際の負荷を低減するために、外部記憶媒体に対するデータのI/O回数を少なくする技術が提案されている。この技術は、データベースに対する複数の全件検索要求を受け付けた場合に、メモリにロードされたデータを再利用することにより外部記憶媒体へのI/O回数を低減するものであり、データベースへのアクセス性能とバッチ処理性能の向上を図っている。 For example, Patent Document 1 proposes a technique for reducing the number of times of data I / O with respect to an external storage medium in order to reduce a load when data is input / output. This technology reduces the number of I / Os to an external storage medium by reusing data loaded in memory when multiple search requests for a database are accepted. We are trying to improve performance and batch processing performance.

特開平０８−６８２９号公報Japanese Patent Laid-Open No. 08-6829

ところが、特許文献１に記載された発明は、バッチ処理に際し、一度に大量のデータをメモリにキャッシュすることを要求する。このように大量のメモリを消費するバッチ方法は、システムのコスト高を招いてしまう。また、現実問題として、バッチ対象とする全てのデータを、バッチ処理システム内のメモリに書き込めない可能性もある。 However, the invention described in Patent Document 1 requires that a large amount of data be cached in a memory at the time of batch processing. The batch method that consumes a large amount of memory in this way increases the cost of the system. Further, as a real problem, there is a possibility that not all data to be batched can be written to the memory in the batch processing system.

そこで、発明者は、低コストでありながらも、バッチ処理を高速に実行できる手法として、汎用PCクラスタを利用した並列バッチ処理技術に着目した。ただし、現時点において、Hadoop/HDFS等のオープンソース系のソフトウェアを単純に適用してバッチ処理システムを構成しても、それだけではバッチ処理の高速化は実現されず、システム動作の信頼性も担保することができない。 Therefore, the inventor paid attention to a parallel batch processing technique using a general-purpose PC cluster as a technique capable of executing batch processing at high speed while being low in cost. However, at present, even if a batch processing system is configured simply by applying open source software such as Hadoop / HDFS, it will not achieve high speed batch processing, and the reliability of system operation is also guaranteed. I can't.

そこで、発明者は、汎用PCクラスタを利用した並列バッチ処理システムに対し、データを高速に読み込むことができる仕組みが必要であるとの認識に至った。 Accordingly, the inventor has come to realize that a mechanism capable of reading data at high speed is necessary for a parallel batch processing system using a general-purpose PC cluster.

発明者は、データソースとしてのデータベースサーバから分散ファイルシステムへのデータの転送を、以下の手順により実行する手法を提案する。 The inventor proposes a method for transferring data from a database server as a data source to a distributed file system according to the following procedure.

すなわち、バッチ実行周期の間に、バッチ管理システムが、データベースサーバに格納されたデータの更新情報を複数回抽出する処理と、バッチ管理システムが、更新情報に基づいて、前回の更新処理から今回の更新処理までの間における更新データを保存・管理する処理と、バッチ管理システムが、更新データを分散ファイルシステムに逐次転送する処理とを実行する手法を提案する。 That is, during the batch execution cycle, the batch management system extracts the update information of the data stored in the database server multiple times, and the batch management system performs the current update process based on the update information. We propose a method for executing processing for storing and managing update data up to the update processing, and processing for sequentially transferring update data to the distributed file system by the batch management system.

本発明によれば、バッチ実行周期の間に、データソースとしてのデータベースサーバにおける更新状況（更新データ）を、定期的に分散ファイルシステムに反映することができる。この結果、バッチ処理の実行開始時に分散ファイルシステムに転送されるデータ量を削減することができ、バッチ処理に要する時間を大幅に短縮することができる。前述以外の課題、構成及び効果は、以下の実施の形態の説明により明らかにされる。 According to the present invention, the update status (update data) in the database server as a data source can be regularly reflected in the distributed file system during the batch execution cycle. As a result, the amount of data transferred to the distributed file system at the start of the execution of batch processing can be reduced, and the time required for batch processing can be greatly reduced. Problems, configurations, and effects other than those described above will become apparent from the following description of embodiments.

実施例に係るシステムの全体構成を示す図。The figure which shows the whole structure of the system which concerns on an Example. 実施例に係るバッチ処理の全体処理手順を示す図。The figure which shows the whole process sequence of the batch process which concerns on an Example. テーブルのハッシュ値管理に関する概要を説明する図。The figure explaining the outline | summary regarding the hash value management of a table. テーブル差分管理ＤＢのテーブル例を示す図。The figure which shows the table example of table difference management DB. ハッシュ値管理ＤＢのテーブル例を示す図。The figure which shows the example of a table of hash value management DB. 差分データ抽出処理手順を説明するフローチャート。The flowchart explaining a difference data extraction process procedure. バッチ処理手順を説明するフローチャート。The flowchart explaining a batch processing procedure.

以下、図面に基づいて、本発明の実施の形態を説明する。なお、本発明の実施の態様は、後述する形態例に限定されるものではなく、その技術思想の範囲において、種々の変形が可能である。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. The embodiment of the present invention is not limited to the embodiments described later, and various modifications are possible within the scope of the technical idea.

［実施例］
（システム構成）
図１に、本実施例が想定するシステムの全体構成を示す。利用者端末１０１は、ＬＡＮ１０２経由で業務サーバ１０３に接続されている。利用者は、利用者端末１０１を使用して業務サービス１０６にアクセスする。ここで、業務サーバ１０３はコンピュータを基本構成とし、演算装置、ハードディスク等の大規模ストレージ装置、通信インターフェース及び周辺装置で構成される。業務サービス１０６は、コンピュータによるプログラム処理を通じて実現される。 [Example]
(System configuration)
FIG. 1 shows the overall configuration of a system assumed in this embodiment. The user terminal 101 is connected to the business server 103 via the LAN 102. The user uses the user terminal 101 to access the business service 106. Here, the business server 103 has a computer as a basic configuration, and includes an arithmetic device, a large-scale storage device such as a hard disk, a communication interface, and a peripheral device. The business service 106 is realized through program processing by a computer.

業務サーバ１０３には、バックグラウンドで実行される並列バッチ処理を管理するバッチ管理サーバ１０４と、並列バッチ処理を実際に実行するバッチ処理クラスタ１１５とがＬＡＮ１０２経由で接続されている。なお、バッチ処理クラスタ１１５は、複数台のバッチ処理サーバ１０５で構成されている。本明細書では、バッチ管理サーバ１０４単独、又は、バッチ管理サーバ１０４とバッチ処理クラスタ１１５で構成されるシステムを「バッチ管理システム」という。 A batch management server 104 that manages parallel batch processing executed in the background and a batch processing cluster 115 that actually executes parallel batch processing are connected to the business server 103 via the LAN 102. Note that the batch processing cluster 115 includes a plurality of batch processing servers 105. In this specification, the batch management server 104 alone or a system configured by the batch management server 104 and the batch processing cluster 115 is referred to as a “batch management system”.

業務サーバ１０３のストレージ装置には、業務サービスＤＢ１０７やそのトランザクションログ１０８が格納されている。業務サービスＤＢ１０７には、業務サービスが使用するデータが格納されている。ここでの格納データがバッチ処理の対象となる。業務サービスＤＢ１０７は、トランザクションログ１０８を出力・保存するモードで稼働される。トランザクションログ１０８は、業務サーバ１０３の公開ディレクトリに保存されており、アクセスが許可された他のサーバからダウンロードすることができる。 The storage server of the business server 103 stores a business service DB 107 and its transaction log 108. The business service DB 107 stores data used by the business service. The stored data here becomes the target of batch processing. The business service DB 107 is operated in a mode for outputting and saving the transaction log 108. The transaction log 108 is stored in a public directory of the business server 103 and can be downloaded from another server permitted to access.

バッチ管理サーバ１０４もコンピュータを基本構成とし、演算装置、ハードディスク等のストレージ装置、通信インターフェース及び周辺装置で構成される。バッチ管理サーバ１０４では、バッチ管理サービス１０９が稼動している。バッチ管理サービス１０９は、コンピュータによるプログラム処理を通じて実現される。 The batch management server 104 also has a computer as a basic configuration, and includes a storage device such as a computing device and a hard disk, a communication interface, and peripheral devices. In the batch management server 104, a batch management service 109 is operating. The batch management service 109 is realized through program processing by a computer.

バッチ管理サービス１０９は、システム全体におけるバッチ処理の管理・制御を実行し、例えば業務サービスＤＢ１０７内におけるバッチ処理対象データ（バッチ処理の対象となるテーブルの特定カラムに関するデータ）と、前回アクセス時のデータに対する更新を管理する。バッチ管理サーバ１０４のストレージ装置には、テーブル差分管理ＤＢ１１０及びハッシュ値管理ＤＢ１１１が格納されている。 The batch management service 109 manages and controls batch processing in the entire system, for example, batch processing target data in the business service DB 107 (data related to a specific column of a table subject to batch processing) and data at the previous access Manage updates to. The storage device of the batch management server 104 stores a table difference management DB 110 and a hash value management DB 111.

バッチ処理クラスタ１１５を構成するバッチ処理サーバ１０５も、コンピュータを基本構成とする。各バッチ処理サーバ１０５には、バッチアプリケーション１１２と分散処理フレームワーク１１３がインストールされている。分散処理フレームワーク１１３は、クラスタノード間での協調したデータストア及び並列処理を実行するプログラムであり、バッチアプリケーション１１２の実行も制御する。また、バッチ処理サーバ１０５は、二次記憶装置１１４を備えている。二次記憶装置１１４は、例えばハードディスク装置等のストレージ装置で構成される。 The batch processing server 105 constituting the batch processing cluster 115 also has a computer as a basic configuration. Each batch processing server 105 has a batch application 112 and a distributed processing framework 113 installed therein. The distributed processing framework 113 is a program that executes a coordinated data store and parallel processing between cluster nodes, and also controls the execution of the batch application 112. Further, the batch processing server 105 includes a secondary storage device 114. The secondary storage device 114 is configured by a storage device such as a hard disk device.

ここで、分散処理フレームワーク１１３が、クラスタノード間でデータを共有するための分散ファイルシステムを構成するベースとなる。分散処理フレームワークの一例には、Apache Hadoopがある。Hadoopは、分散ファイルシステムであるHDFS上にデータを格納し、MapReduceフレームワークで並列分散処理を行うことができる。本実施例では、分散処理フレームワークとして、Hadoopを仮定するが、同様の機構を有する他の分散処理フレームワークでも良い。 Here, the distributed processing framework 113 serves as a base for configuring a distributed file system for sharing data between cluster nodes. An example of a distributed processing framework is Apache Hadoop. Hadoop can store data on HDFS, a distributed file system, and perform parallel distributed processing with the MapReduce framework. In this embodiment, Hadoop is assumed as the distributed processing framework, but other distributed processing frameworks having a similar mechanism may be used.

（バッチ処理フロー）
図２に、図１に示すシステムにおいて実行されるバッチ処理フローの流れを説明する。図２は、業務サーバ１０３、バッチ管理サーバ１０４及びバッチ処理クラスタ１１５の間における処理動作の流れを表している。 (Batch processing flow)
FIG. 2 explains the flow of the batch processing flow executed in the system shown in FIG. FIG. 2 shows the flow of processing operations among the business server 103, the batch management server 104, and the batch processing cluster 115.

業務サーバ１０３では、トランザクションログ１０８が、業務サービスＤＢ１０７へのアクセスが発生した段階で随時出力される。 In the business server 103, the transaction log 108 is output at any time when access to the business service DB 107 occurs.

バッチ管理サーバ１０４では、バッチ管理サービス１０９が定期的に業務サーバ１０３にアクセスし、トランザクションログ１０８をダウンロードする。この実施例では、トランザクションログ１０８の発生タイミングとは無関係に、トランザクションログ１０８のダウンロードが定期的に実行される。もっとも、バッチ管理サーバ１０４は、トランザクションログ１０８の発生をトリガーとして、トランザクションログ１０８のダウンロードを実行しても良い。 In the batch management server 104, the batch management service 109 periodically accesses the business server 103 and downloads the transaction log 108. In this embodiment, the transaction log 108 is periodically downloaded regardless of the generation timing of the transaction log 108. However, the batch management server 104 may execute the download of the transaction log 108 with the occurrence of the transaction log 108 as a trigger.

バッチ管理サービス１０９では、業務サービスＤＢ１０７を構成するテーブル群のうち、どのテーブルとカラムのデータをバッチ処理用にフェッチするかが予め定義されている。バッチ管理サービス１０９は、このトランザクションログ１０８に基づいて、前回からの更新部分を検知し、指定されたテーブルのカラムデータのみを管理するテーブル差分管理ＤＢ１１０のデータを更新する。 In the batch management service 109, it is defined in advance which table and column data are fetched for batch processing in the table group constituting the business service DB 107. Based on this transaction log 108, the batch management service 109 detects the update part from the previous time, and updates the data in the table difference management DB 110 that manages only the column data of the specified table.

本実施例の場合、バッチ管理サービス１０９は、更新部分を含むエントリ群のデータのハッシュ値を計算し、ハッシュ値管理ＤＢ１１１に登録する。本実施例の場合、ハッシュ値の計算は、差分エントリが更新される度に行う。ここで、更新部分を含むエントリ群のデータのハッシュ値を登録・管理する理由は、分散処理フレームワーク上で実際にバッチ処理を実行する際に、バッチ処理入力データが、実際の業務サービスＤＢ１０７上に存在するデータと一致しているか否かを効率的に確認できるようにするためである。 In the case of the present embodiment, the batch management service 109 calculates the hash value of the data of the entry group including the update part and registers it in the hash value management DB 111. In this embodiment, the hash value is calculated every time the difference entry is updated. Here, the reason for registering and managing the hash value of the data of the entry group including the update part is that when batch processing is actually executed on the distributed processing framework, the batch processing input data is stored on the actual business service DB 107. This is because it is possible to efficiently confirm whether or not the data matches the data existing in.

このハッシュ値の計算は、分散処理フレームワークが提供する分散ファイルシステムで信頼性が担保されていない場合に効果的である。例えばHadoop/HDFS等のオープンソース系のソフトウェアは、現時点において、バッチ処理の基盤に用いるには信頼性に難がある。しかし、後述する整合性の確認動作を採用すれば、Hadoop/HDFS等のオープンソース系のソフトウェアにより構成された分散ファイルシステムの信頼性を、従来システムと同等レベルまで引き上げることができる。 This calculation of the hash value is effective when reliability is not ensured in the distributed file system provided by the distributed processing framework. For example, open source software such as Hadoop / HDFS is difficult to use at present as a base for batch processing. However, if the consistency check operation described later is adopted, the reliability of the distributed file system configured by open source software such as Hadoop / HDFS can be raised to the same level as the conventional system.

バッチ管理サービス１０９は、更新があったエントリのみをバッチ処理クラスタ１１５に逐次転送する。このように、転送されるデータは、更新のあったエントリのみである。このため、バッチ処理クラスタ１１５に転送されるデータ量は、業務サービスＤＢ１０７のデータ量に対して非常に少なく済む。また、更新のあったエントリのデータは、バッチ処理クラスタ１１５におけるバッチ実行周期の間も逐次転送される。バッチ処理クラスタ１１５は、逐次受信されるデータを、分散処理フレームワークが構成する分散ファイルシステム上のバッチ処理入力データに反映する。 The batch management service 109 sequentially transfers only the updated entries to the batch processing cluster 115. Thus, the transferred data is only the updated entry. For this reason, the amount of data transferred to the batch processing cluster 115 is much smaller than the amount of data in the business service DB 107. The updated entry data is also transferred sequentially during the batch execution cycle in the batch processing cluster 115. The batch processing cluster 115 reflects the sequentially received data on the batch processing input data on the distributed file system configured by the distributed processing framework.

このように、本実施例では、差分データのみ逐次更新する方式を採用する。この方式の採用により、バッチ処理クラスタ１１５で実際にバッチ処理を実行する際に一度に大量のデータを転送する必要がなくなる。このため、従来システムに比して、バッチ開始時におけるデータの読み込み時間が短縮される。結果的に、バッチ処理開始から終了までに要する時間を、従来システムに比して、大幅に短縮することが可能になる。 As described above, this embodiment employs a method of sequentially updating only the difference data. By adopting this method, it is not necessary to transfer a large amount of data at a time when the batch processing cluster 115 actually executes batch processing. For this reason, compared with the conventional system, the data reading time at the start of the batch is shortened. As a result, the time required from the start to the end of batch processing can be greatly reduced as compared with the conventional system.

なお、バッチ処理クラスタ１１５は、バッチ処理の開始に先立って、バッチ管理サーバ１０４の間でバッチ処理入力データの整合性を確認し（具体的には、バッチ処理入力データから計算されるハッシュ値とハッシュ値管理ＤＢ１１１に登録されたハッシュ値とを照合し）、整合性が確認された場合にバッチ処理を実行する。 Prior to the start of batch processing, the batch processing cluster 115 confirms the consistency of batch processing input data with the batch management server 104 (specifically, the hash value calculated from the batch processing input data and The hash value registered in the hash value management DB 111 is collated), and batch processing is executed when the consistency is confirmed.

バッチ処理クラスタ１１５は、バッチ処理の終了後、集計値を計算してストレージ装置に格納する。バッチ処理クラスタ１１５は、計算された集計値をバッチ処理出力データ２０２として業務サーバ１０３の業務サービスＤＢ１０７に転送する。バッチ処理クラスタ１１５は、このバッチ処理出力データ２０２に基づいて、業務サービスＤＢ１０７にバッチ処理結果を反映する。 The batch processing cluster 115 calculates the total value after storing the batch processing and stores it in the storage device. The batch processing cluster 115 transfers the calculated total value as the batch processing output data 202 to the business service DB 107 of the business server 103. The batch processing cluster 115 reflects the batch processing result in the business service DB 107 based on the batch processing output data 202.

（ハッシュ値の管理）
ここでは、図３に基づいて、テーブルのハッシュ値を管理する方法について説明する。図３に示すテーブルは、テーブル差分管理ＤＢ１１０に格納されているデータである。テーブルデータが与えられた場合、バッチ管理サービス１０９は、先頭から指定された行数のエントリ群を抽出し、その個々の値を結合した値に対してハッシュ値を計算し、その値をハッシュ値管理ＤＢ１１１に登録する。 (Hash value management)
Here, a method for managing the hash value of the table will be described with reference to FIG. The table shown in FIG. 3 is data stored in the table difference management DB 110. When the table data is given, the batch management service 109 extracts the entry group having the designated number of rows from the top, calculates a hash value for a value obtained by combining the individual values, and calculates the hash value as the hash value. Register in the management DB 111.

例えば図３の例の場合、バッチ管理サービス１０９は、１０行ずつデータを抽出し、最初の１０行に関しては、「001-00001」、「佐藤一郎」、「34,000」、「001-00002」、「山田花子」、「8,200」、・・・、「001-00010」、「鈴木太郎」、「102,100」をバイナリ値として単純に結合して一つのバイナリ値を算出し、そのバイナリ値に対してハッシュ値h1を計算して登録する。同様に、バッチ管理サービス１０９は、次の１１行から２０行目までのデータについてハッシュ値h2を計算して登録する。以下同様に、バッチ管理サービス１０９は、ハッシュ値を計算してハッシュ値管理ＤＢ１１１に登録する。 For example, in the example of FIG. 3, the batch management service 109 extracts data by 10 rows, and for the first 10 rows, “001-00001”, “Ichiro Sato”, “34,000”, “001-00002” "Yamada Hanako", "8,200", ..., "001-00010", "Taro Suzuki", "102,100" are simply combined as binary values to calculate one binary value, and for that binary value Calculate and register the hash value h1. Similarly, the batch management service 109 calculates and registers the hash value h2 for the data from the next 11th line to the 20th line. Similarly, the batch management service 109 calculates a hash value and registers it in the hash value management DB 111.

ここで、バッチ管理サービス１０９は、差分エントリの更新時、更新があったエントリを含むデータ部分（実施例では１０行分のデータ）についてのみハッシュ値を再計算すればよく、毎回、テーブル内の全てのデータについてハッシュ値を計算する必要はない。この計算方法により、ハッシュ値の計算処理の高速化を実現することができる。 Here, the batch management service 109 only needs to recalculate the hash value only for the data portion including the updated entry (data for 10 rows in the embodiment) when the difference entry is updated. It is not necessary to calculate hash values for all data. With this calculation method, it is possible to realize a high-speed hash value calculation process.

ここで、テーブルデータを分割する際の指定行数は、HDFSが管理するブロックサイズ（デフォルトでは64MB）を考慮することが望ましい。例えば指定行数のデータサイズの定数倍が、HDFSのブロックサイズに一致するように指定行数を設定する。このように指定行数を設定することにより、ブロックファイルを跨ぐ指定行数のテーブルデータの存在を回避することができる。その結果、HDFS上でのバッチ処理入力データの効率的な読み書きを実現することができる。 Here, it is desirable to consider the block size (64 MB by default) managed by HDFS for the number of specified rows when dividing table data. For example, the specified number of rows is set so that a constant multiple of the data size of the specified number of rows matches the block size of HDFS. By setting the specified number of rows in this way, it is possible to avoid the presence of table data with the specified number of rows straddling the block file. As a result, efficient reading and writing of batch processing input data on HDFS can be realized.

（テーブルの構造例）
ここでは、バッチ管理サーバ１０４で管理されるテーブル差分管理ＤＢ１１０とハッシュ値管理ＤＢ１１１のテーブル例について説明する。 (Table structure example)
Here, a table example of the table difference management DB 110 and the hash value management DB 111 managed by the batch management server 104 will be described.

図４に、テーブル差分管理ＤＢ１１０のテーブル例を示す。図４は、EC（electronic commerce）サイトにおける各会員情報に関するＤＢを想定した図である。なお、業務サーバ１０３側の業務サービスＤＢ１０７には、会員情報のマスタテーブルが格納されている。 FIG. 4 shows a table example of the table difference management DB 110. FIG. 4 is a diagram assuming a DB related to member information on an EC (electronic commerce) site. The business service DB 107 on the business server 103 side stores a master table of member information.

会員番号４０１、氏名４０２、月別購入金額４０３は、業務サービスＤＢ１０７に格納されているマスタテーブルのうちバッチ処理に必要とされる属性である。テーブル差分管理ＤＢ１１０には、これらの属性からなる部分テーブルが保持される。当該構造の採用により、テーブル差分管理ＤＢ１１０におけるデータの更新時間やデータ量の削減が実現される。また、テーブル差分管理ＤＢ１１０には、更新フラグ４０４も保持される。更新フラグ４０４は、業務サービスＤＢ１０７からの前回のデータ抽出時から今回のデータ抽出時までの間に更新があったエントリの識別に使用される属性である。更新があったエントリには値「１」が設定され、更新が無かったエントリには値「０」が設定される。バッチ管理サービス１０９は、更新フラグ１０４の値に基づいてハッシュ値を再計算するか否かを判定する。 The member number 401, name 402, and monthly purchase amount 403 are attributes required for batch processing in the master table stored in the business service DB 107. The table difference management DB 110 holds a partial table consisting of these attributes. By adopting this structure, data update time and data amount in the table difference management DB 110 can be reduced. The table difference management DB 110 also holds an update flag 404. The update flag 404 is an attribute used to identify an entry that has been updated between the previous data extraction from the business service DB 107 and the current data extraction. A value “1” is set for an entry that has been updated, and a value “0” is set for an entry that has not been updated. The batch management service 109 determines whether to recalculate the hash value based on the value of the update flag 104.

図５は、ハッシュ値管理ＤＢ１１１のテーブル例を示している。本実施例の場合、テーブルの属性値には、テーブルＩＤ５０１、開始エントリＩＤ５０２、終了エントリＩＤ５０３、ハッシュ値５０４が含まれる。テーブルＩＤ５０１は、テーブル差分管理ＤＢ１１０に格納されているテーブルのＩＤを示している。本実施例では、図４に記載のテーブルのテーブルＩＤを「Ｔ１」で表している。また、開始エントリＩＤ５０２と終了エントリＩＤ５０３は、ハッシュ値を計算するエントリ群の最初と最後のエントリのキー値を示している。テーブルＴ１の主キーは会員番号であり、この会員番号をベースに用い、開始エントリＩＤと終了エントリＩＤが定義されている。また、ハッシュ値５０４は、指定されたエントリ群について、図３で説明した方式により計算されたハッシュ値が格納される。 FIG. 5 shows a table example of the hash value management DB 111. In this embodiment, the table attribute values include a table ID 501, a start entry ID 502, an end entry ID 503, and a hash value 504. A table ID 501 indicates the ID of a table stored in the table difference management DB 110. In this embodiment, the table ID of the table shown in FIG. 4 is represented by “T1”. The start entry ID 502 and the end entry ID 503 indicate the key values of the first and last entries of the entry group for calculating the hash value. The main key of the table T1 is a member number, and this member number is used as a base, and a start entry ID and an end entry ID are defined. The hash value 504 stores the hash value calculated by the method described with reference to FIG. 3 for the specified entry group.

（差分データ抽出処理手順）
図６に、バッチ管理サービス１０９において実行される差分データ抽出処理手順のフローチャートを示す。バッチ管理サービス１０９は、業務サーバ１０３上にある業務サービスＤＢ１０７が出力するトランザクションログ１０８を取得する（ステップ６０１）。 (Differential data extraction processing procedure)
FIG. 6 shows a flowchart of the difference data extraction processing procedure executed in the batch management service 109. The batch management service 109 acquires the transaction log 108 output from the business service DB 107 on the business server 103 (step 601).

次に、バッチ管理サービス１０９は、取得したトランザクションログに基づいて、業務サービスＤＢ１０７において更新されたエントリデータを抽出する（ステップ６０２）。 Next, the batch management service 109 extracts the entry data updated in the business service DB 107 based on the acquired transaction log (step 602).

抽出したデータ内に、テーブル差分管理ＤＢ１１０に格納されたバッチ処理用のテーブルの属性値に相当するものが含まれる場合、バッチ管理サービス１０９は、その更新エントリの値をテーブル差分管理ＤＢ１１０に反映する（ステップ６０３）。このデータの反映時に、バッチ管理サービス１０９は、更新フラグ４０４の値を「１」に設定する。 When the extracted data includes data corresponding to the attribute value of the batch processing table stored in the table difference management DB 110, the batch management service 109 reflects the value of the update entry in the table difference management DB 110. (Step 603). At the time of reflecting this data, the batch management service 109 sets the value of the update flag 404 to “1”.

全ての更新エントリの値についてテーブル差分管理ＤＢ１１０に反映処理が終了すると、バッチ管理サービス１０９は、更新フラグ４０４の値を参考に更新エントリを探索し、更新エントリを含む指定エントリ群のハッシュ値を計算する（ステップ６０４）。ハッシュ値が計算されると、バッチ管理サービス１０９は、計算したハッシュ値をハッシュ値管理ＤＢ１１１に登録し（ステップ６０５）、その後、更新データを分散ファイルシステム上のバッチ処理入力データに反映する（ステップ６０６）。 When the processing of reflecting all the update entry values in the table difference management DB 110 is completed, the batch management service 109 searches for the update entry with reference to the value of the update flag 404 and calculates the hash value of the designated entry group including the update entry. (Step 604). When the hash value is calculated, the batch management service 109 registers the calculated hash value in the hash value management DB 111 (step 605), and then reflects the update data in the batch processing input data on the distributed file system (step 605). 606).

この一連のプロセスは、バッチ管理サービス１０９によって定期的に実行される。これにより、業務サービスＤＢ１０７の更新データが、バッチ実行周期の間も、逐次、バッチ処理入力データに反映される。これにより、バッチ処理の実行時に、一度にデータ移行が発生することなく、バッチ処理全体の時間短縮を実現することができる。 This series of processes is periodically executed by the batch management service 109. Thereby, the update data of the business service DB 107 is sequentially reflected in the batch processing input data even during the batch execution cycle. As a result, when batch processing is executed, the entire batch processing can be shortened without causing data migration at a time.

（バッチ処理手順）
図７に、バッチ処理クラスタ１１５において実行されるバッチ処理手順のフローチャートを示す。バッチ処理クラスタ１１５は、バッチ処理が開始されると、まず、バッチ処理入力データの整合性をチェックする（ステップ７０１）。具体的には、バッチ処理クラスタ１１５は、各ブロックデータ群についてハッシュ値を計算し、ハッシュ値管理ＤＢ１１１に格納されている対応する値と照合する。ここで、ブロックデータとは、バッチ処理入力データの部分データであり、指定行数のエントリデータに対応するデータである。 (Batch processing procedure)
FIG. 7 shows a flowchart of a batch processing procedure executed in the batch processing cluster 115. When batch processing is started, the batch processing cluster 115 first checks the consistency of the batch processing input data (step 701). Specifically, the batch processing cluster 115 calculates a hash value for each block data group and compares it with the corresponding value stored in the hash value management DB 111. Here, the block data is partial data of batch processing input data, and is data corresponding to entry data for a specified number of rows.

次に、バッチ処理クラスタ１１５は、ハッシュ値が一致しないブロックデータ群の有無をチェックする（ステップ７０２）。ハッシュ値が一致しないブロックデータ群が存在する場合（ステップ７０２で肯定結果）、バッチ処理クラスタ１１５は、一致しない部分をテーブル差分管理ＤＢ１１０から抽出してバッチ入力データを修正し（ステップ７０３）、その後、バッチ処理を実行する（ステップ７０４）。これに対し、ハッシュ値が一致しないブロックデータが存在しない場合（ステップ７０２で否定結果）、バッチ処理クラスタ１１５は、整合性が確認できたと判定し、実際のバッチ処理を実行する（ステップ７０４）。 Next, the batch processing cluster 115 checks whether there is a block data group whose hash values do not match (step 702). When there is a block data group that does not match the hash value (Yes in step 702), the batch processing cluster 115 extracts the mismatched portion from the table difference management DB 110 to correct the batch input data (step 703), and then Then, batch processing is executed (step 704). On the other hand, if there is no block data whose hash values do not match (negative result in step 702), the batch processing cluster 115 determines that consistency has been confirmed and executes the actual batch processing (step 704).

最後に、バッチ処理クラスタ１１５は、分散ファイルシステム上に保存されたバッチ処理出力データに基づいて業務サービスＤＢ１０７を更新する（ステップ７０５）。 Finally, the batch processing cluster 115 updates the business service DB 107 based on the batch processing output data stored on the distributed file system (step 705).

（まとめ）
前述したように、バッチ管理サーバ１０４は、バッチ実行周期の間に、データソースとしての業務サービスＤＢ１０７の更新データを定期的に抽出してテーブル差分管理ＤＢ１１０に反映し、その都度、更新データをバッチ処理クラスタ１１５の分散ファイルシステムに反映する。この結果、バッチ処理の実行開始時に分散ファイルシステムに転送されるデータ量を削減することができ、バッチ処理に要する時間を大幅に短縮することができる。 (Summary)
As described above, the batch management server 104 periodically extracts update data of the business service DB 107 as a data source and reflects it in the table difference management DB 110 during the batch execution cycle. This is reflected in the distributed file system of the processing cluster 115. As a result, the amount of data transferred to the distributed file system at the start of the execution of batch processing can be reduced, and the time required for batch processing can be greatly reduced.

また、本実施例の場合、バッチ処理クラスタ１１５は、バッチ処理を開始する前にバッチ処理入力データについてハッシュ値を計算し、バッチ管理サーバ１０４で管理するハッシュ値と照合し、問題がある場合には事前に修正処理を実行する。これにより、バッチ処理クラスタ１１５を構成する分散ファイルシステムの信頼性に問題がある場合でも、バッチ処理入力データの信頼性を保証することができる。 In the case of this embodiment, the batch processing cluster 115 calculates a hash value for the batch processing input data before starting the batch processing, compares it with the hash value managed by the batch management server 104, and if there is a problem. Performs correction processing in advance. Thereby, even when there is a problem in the reliability of the distributed file system constituting the batch processing cluster 115, the reliability of the batch processing input data can be guaranteed.

以上により、業務システムにおけるデータベースのデータを正確かつ高速にバッチできる、安価なバッチ処理システムを実現することができる。 As described above, it is possible to realize an inexpensive batch processing system capable of accurately and rapidly batching database data in a business system.

［他の実施例］
本明細書で提案する発明は、上述した形態例に限定されるものでなく、様々な変形が考えられる。例えば、上述した実施例は、発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成・機能を備える必要は無い。また、実施例の構成に既知の構成を追加したり、実施例の一部の構成を削除又は置換することもできる。 [Other embodiments]
The invention proposed in this specification is not limited to the above-described embodiments, and various modifications are conceivable. For example, the above-described embodiments have been described in detail in order to explain the invention in an easy-to-understand manner, and it is not always necessary to have all the configurations and functions described. In addition, a known configuration can be added to the configuration of the embodiment, or a part of the configuration of the embodiment can be deleted or replaced.

また、上述した各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路その他のハードウェアとして実現しても良い。また、前述の実施例では、各サーバを構成するコンピュータによるプログラムの実行を通じ、各機能を実現する場合について説明したが、各機能をモジュール基板やＡＳＩＣ等のハードウェア資源を通じて実現しても良い。 Moreover, you may implement | achieve some or all of each structure, a function, a process part, a process means, etc. which were mentioned above as an integrated circuit or other hardware, for example. Further, although cases have been described with the above embodiments where functions are realized through the execution of programs by computers constituting each server, each function may be realized through hardware resources such as a module board or an ASIC.

また、制御線や情報線は、説明上必要と考えられるものを示すものであり、製品上必要な全ての制御線や情報線を表すものでない。実際にはほとんど全ての構成が相互に接続されていると考えて良い。 Control lines and information lines indicate what is considered necessary for the description, and do not represent all control lines and information lines necessary for the product. In practice, it can be considered that almost all components are connected to each other.

１０１…利用者端末、１０２…ＬＡＮ、１０３…業務サーバ、１０４…バッチ管理サーバ、１０５…バッチ処理サーバ、１０６…業務サービス、１０７…業務サービスＤＢ、１０８…トランザクションログ、１０９…バッチ管理サービス、１１０…テーブル差分管理ＤＢ、１１１…ハッシュ値管理ＤＢ、１１２…バッチアプリケーション、１１３…分散処理フレームワーク、１１４…二次記憶装置、１１５…バッチ処理クラスタ、２０１…バッチ処理入力データ、２０２…バッチ処理出力データ、４０１…会員番号、４０２…氏名、４０３…月別購入金額、４０４…更新フラグ、５０１…テーブルＩＤ、５０２…開始エントリＩＤ、５０３…終了エントリＩＤ、５０４…ハッシュ値。 DESCRIPTION OF SYMBOLS 101 ... User terminal, 102 ... LAN, 103 ... Business server, 104 ... Batch management server, 105 ... Batch processing server, 106 ... Business service, 107 ... Business service DB, 108 ... Transaction log, 109 ... Batch management service, 110 ... Table difference management DB, 111 ... Hash value management DB, 112 ... Batch application, 113 ... Distributed processing framework, 114 ... Secondary storage device, 115 ... Batch processing cluster, 201 ... Batch processing input data, 202 ... Batch processing output Data: 401 ... Member number, 402 ... Name, 403 ... Monthly purchase amount, 404 ... Update flag, 501 ... Table ID, 502 ... Start entry ID, 503 ... End entry ID, 504 ... Hash value.

Claims

In a method for high-speed reading of data subject to batch processing by a batch management system connected to a database server,
A process in which a batch management system extracts update information of data stored in the database server a plurality of times during a batch execution cycle;
The batch management system, based on the update information, a process for storing and managing update data between the last update process and the current update process;
The batch management system includes a process of sequentially transferring the update data to a distributed file system.

In the high-speed reading method of the batch processing target data according to claim 1,
The batch management system includes a process for storing and managing a hash value of the update data.

In the batch processing target data high-speed reading method according to claim 2,
A process in which the distributed file system updates batch process input data based on the update data;
Before starting the batch process, the distributed file system checks the consistency by comparing the batch process input data and the hash value;
A process in which the distributed file system executes a batch process on the batch process input data whose consistency has been confirmed;
The distributed file system includes a process of reflecting data after batch processing as batch processing output data to the database server.

In the high-speed reading method of the batch processing target data according to claim 1,
The batch management system acquires a transaction log output from a database constituting the database server as the update information. A method for reading batch processing target data at high speed.

In the high-speed reading method of the batch processing target data according to claim 1,
The batch management system extracts the update data collectively for each specified number of entries, combines the values of the extracted entries, calculates a first hash value, stores and updates,
The distributed file system extracts a data portion corresponding to the specified number of entries from the batch processing input data, calculates a second hash value, and collates with the first hash value. High-speed reading method of processing target data.

The batch processing target data high-speed reading method according to claim 5,
The data size of the specified number of entries is set so as to coincide with the file size in the batch processing.

The batch processing target data high-speed reading method according to claim 5,
If the first hash value and the second hash value do not match, the distributed file system acquires an entry group corresponding to the matching result from the batch management system and reflects it in the batch processing input data, A high-speed method for reading data subject to batch processing, wherein batch processing is executed after consistency is ensured.

A batch management system connected to a database server,
A database for storing and managing update data from the previous update process to the current update process every time update information of the data stored in the database server is extracted during the batch execution cycle;
A batch management system comprising: a processing unit that sequentially transfers the update data to a distributed file system that executes batch processing.