JP4502063B2

JP4502063B2 - Data warehouse system, query processing method used therein, data collection method and apparatus therefor, and billing system

Info

Publication number: JP4502063B2
Application number: JP2008317833A
Authority: JP
Inventors: 格西澤; 真二藤原; 一智牛嶋; 茂和猪原
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2008-12-15
Filing date: 2008-12-15
Publication date: 2010-07-14
Anticipated expiration: 2018-11-11
Also published as: JP2009146425A

Description

本発明は、分散ネットワークコンピューティング環境（以下、分散環境）におけるデータウェアハウスシステムとそこで用いられる問合せ処理方法及びそのためのデータ収集方法と装置、さらには、課金システム及び方法に関する。 The present invention relates to a data warehouse system in a distributed network computing environment (hereinafter referred to as a distributed environment), a query processing method used therein, a data collection method and apparatus therefor, and a charging system and method.

計算機システムの低価格化による普及と、高信頼なソフトウェアの出現、社会システムの効率化への要望などから、さまざまな情報がオンライン化され利用されつつある。例えば、企業活動においては、店舗での売り上げ状況や、製品管理情報、顧客情報などの各種業務データが計算機で処理されるようになっている。これらの基幹業務で利用されていた計算機上のデータを、基幹業務のみならず、商品の売れ行きの動向調査や、顧客の嗜好分析などの他の目的で有効活用したいという要望から、データウェアハウスシステムが盛んに用いられるようになっている。データウェアハウスについては、例えば、非特許文献１に、その構成方法、利用方法が説明されている。データウェアハウスシステムは、その名前の通り、データの倉庫であり、基幹業務で利用される膨大なデータを蓄積・管理するために広まりつつある。 Due to the spread of computer systems due to lower prices, the emergence of highly reliable software, and the demand for more efficient social systems, various information is being put online and used. For example, in business activities, various business data such as sales status in stores, product management information, and customer information are processed by a computer. The data warehouse system is used to effectively utilize the computer data used in these mission-critical operations not only for mission-critical operations but also for other purposes such as product sales trend surveys and customer preference analysis. Is actively used. As for the data warehouse, for example, Non-Patent Document 1 describes a configuration method and a usage method. As its name suggests, the data warehouse system is a data warehouse, and is spreading to store and manage enormous amounts of data used in mission-critical operations.

近年、このデータウェアハウスで蓄積・管理されるデータに対して、さまざまな角度から検討を加えることによって、これまでは見過ごされていた新しい情報を得られることがわかってきた。例えば、あるスーパーマーケットで、売り上げデータを解析したところ、「週末に仕事帰りの男性がビールとおむつをまとめ買い
する」という一見関係のなさそうな２つの商品間の関連が明らかとなり、この情報を利用してビールとおむつを近い場所に並べることで売り上げを伸ばすことができたという例がある。このように、これまでは見過ごされていた有益な情報を、データの中から見つけ出す手法はデータマイニングと呼ばれる。 In recent years, it has been found that new information that has been overlooked so far can be obtained by examining the data stored and managed in this data warehouse from various angles. For example, when sales data was analyzed at a supermarket, the relationship between two seemingly unrelated relations, such as “men on the weekend to buy beer and diapers at the weekend,” became clear. There is an example that sales could be increased by arranging beer and diapers in close proximity. As described above, a method for finding useful information that has been overlooked in the data is called data mining.

計算機の普及と並行する形で、インターネットに代表されるネットワーク技術の進歩も著しい。このネットワーク技術は、例えば、非特許文献２の第１節に説明されている、ＣＯＲＢＡ（ＣｏｍｍｏｎＯｂｊｅｃｔＲｅｑｕｅｓｔＢｒｏｋｅｒＡｒｃｈｉｔｅｃｔｕｒｅ）に代表される、分散ネットワーク基盤技術とを利用することにより、各種の情報をネットワークを介して利用することが可能となりつつある。 In parallel with the spread of computers, the progress of network technology represented by the Internet is also remarkable. This network technology, for example, uses a distributed network infrastructure technology represented by CORBA (Common Object Request Broker Architecture), which is described in Section 1 of Non-Patent Document 2, to network various information. It is becoming possible to use it via the Internet.

以上のような背景から、ネットワーク上に存在する複数のデータベースやデータウェアハウス上のデータを統合し、データマイニングなどの手法を利用することによって、これまでよりもさらに有用な情報を得ようようとする試みが生まれるのは自然な成り行きである。データベースの統合利用については、例えば非特許文献３などに説明されているように、従来から学会を中心として異種データベース、連邦データベース、多データベースなどの研究が盛んであり、複数のデータベースの結合方式が数多く議論されているが、そのほとんどがデータの異種性を考慮して、いかにして相異なるデータを統合利用するかを中心とする研究であった。 From these backgrounds, we will try to obtain more useful information than before by integrating data on multiple databases and data warehouses on the network and using methods such as data mining. The attempt to do so is a natural outcome. Regarding the integrated use of databases, for example, as described in Non-Patent Document 3, etc., researches on heterogeneous databases, federal databases, multi-databases, etc. have been actively conducted mainly by academic societies. Many studies have been discussed, but most of the research focused on how to integrate and use different data in consideration of the heterogeneity of data.

しかしながら、分散環境でデータウェアハウスシステムを構築しようとする場合には、そのデータ規模が大きいこと、データに対する問合せ処理が従来のデータベース検索処理と比較して複雑なことに起因する性能上の多くの問題が顕れる。例えば、データ規模に関しては、１９９８年３月現在で数ＴＢ（テラバイト；１０の１２乗バイト）級のデータウェアハウスが構築されている。問合せ処理の複雑さに関しては、データウェアハウス、データマイニングなどの意思決定支援をモデル化し、業界で広く受け入れられている標準的なベンチマーク”ＴＰＣＢＥＮＣＨＭＡＲＫＤ（ＤｅｃｉｓｉｏｎＳｕｐｐｏｒｔ）Ｓｔａｎｄａｒｄ
Ｓｐｅｃｉｆｉｃａｔｉｏｎ”、Ｒｅｖｉｓｉｏｎ１．２．２、Ｔｒａｎｓ
ａｃｔｉｏｎＰｒｏｃｅｓｓｉｎｇＰｅｒｆｏｒｍａｎｃｅＣｏｕｎｃｉｌの問合せがその好適な例となる。例えば、１ＴＢの大規模データに対して、前記ＴＰＣ−Ｄの一連の問合せを実行した場合、１９９８年５月現在の世界最高速コンピュータでも数十分から数時間を要する。 However, when trying to build a data warehouse system in a distributed environment, there are many performance issues due to its large data scale and the complexity of query processing for data compared to conventional database search processing. The problem appears. For example, with regard to the data scale, a data warehouse of several TB (terabytes; 10 12 bytes) class has been constructed as of March 1998. With regard to the complexity of query processing, modeling support for data warehousing, data mining, etc., is a standard benchmark widely accepted in the industry, “TPC BENCHMARK D (Decision Support) Standard.
Specification ”, Revision 1.2.2, Trans
A suitable example is an action processing performance council query. For example, when a series of TPC-D queries are executed for large-scale data of 1 TB, even the world's fastest computer as of May 1998 takes several tens of minutes to several hours.

データウェアハウスシステムの利用形態としては、図１１に示すように、データを蓄積・管理し、記憶装置１１０５を管理し、問合せ処理を実行するサーバ１１０２に対して、クライアント１１０１が処理を依頼し、結果を受け取るというクライアント−サーバ型が一般的である。 As shown in FIG. 11, the data warehouse system is used by storing and managing data, managing the storage device 1105, and requesting the server 1102 for executing query processing, the client 1101 to perform processing. The client-server type that receives results is common.

ところが、図１４に示すように、不特定多数の利用者（クライアント）１４０１〜１４０２が、ネットワーク１４０５を介して不特定多数のデータウェアハウスやデータベースなどのサーバ１４０３〜１４０４に問合せ処理を依頼し（１４０６）その結果を得る（１４０７）というような、分散環境におけるクライアント−サーバ型の利用形態を想定すると、前述の問合せ処理の負荷の高さから、不特定多数のクライアントからの要求を受け付けるサーバの負荷が高くなり、クライアントからの解析要求に対する処理が遅延することは容易に想像できる。 However, as shown in FIG. 14, an unspecified number of users (clients) 1401 to 1402 request inquiry processing to servers 1403 to 1404 such as an unspecified number of data warehouses and databases via a network 1405 ( 1406) Assuming a client-server usage mode in a distributed environment such as obtaining the result (1407), the server that accepts requests from an unspecified number of clients due to the high load of the query processing described above. It can be easily imagined that the load increases and the processing for the analysis request from the client is delayed.

複数のサーバのデータを解析処理の対象とする場合には、クライアント−サーバ型のデータウェアハウスシステムの拡張として、図１２に示すようにサーバの位置情報を管理するモジュール１２０２がクライアント１２０１からの問合せ１２０７を、サーバ位置情報１２０３を利用してネットワーク１２０４経由で、サーバ１２０５〜１２０６に転送し、問合せ処理結果１２０８がクライアントに返されるという方式が考えられる。例えば、ＩＮＴＥＲＳＯＬＶ社のＶｉｒｔｕａｌＤａｔａＷａｒｅｈｏｕｓｅＳｙｓｔｅｍ（以下、ＶＤＷ）がその好適な実現例である。ＶＤＷがサーバ位置を管理することにより、クライアント自らは意識することなく、複数のサーバデータをうまく取り扱うことができる。ところがＶＤＷは前記分散環境におけるクライアント−サーバ型のデータウェアハウスシステムと同様に、問合せ処理時のサーバ負荷が高くなってしまい、分散環境におけるデータウェアハウスシステムの好適な実現例とは言いがたい。 When data of a plurality of servers is to be analyzed, a module 1202 for managing server location information is sent from the client 1201 as an extension of the client-server type data warehouse system as shown in FIG. A method may be considered in which 1207 is transferred to the servers 1205 to 1206 via the network 1204 using the server location information 1203, and the query processing result 1208 is returned to the client. For example, INTERSOLV's Virtual Data Warehouse System (hereinafter referred to as VDW) is a suitable implementation. By managing the server location by the VDW, the client itself can handle a plurality of server data without being aware of it. However, VDW, like the client-server type data warehouse system in the distributed environment, increases the server load during query processing, and it is difficult to say that VDW is a suitable implementation example of the data warehouse system in the distributed environment.

分散環境において複数のデータベース、もしくはデータウェアハウスに対する問合せを処理するための方式が、特許文献１に開示されている
。本方式はクライアントの処理負荷を軽減するためにクラスタサーバに問合せを転送する。クラスタサーバは問合せに応じて適切なデータベースに問合せを転送し、該データベースから得られた結果を統合してクライアントに返す。本方式では、問合せは結局サーバに転送されるため、サーバの負荷削減は不可能である。 Japanese Patent Application Laid-Open No. 2004-151867 discloses a method for processing queries to a plurality of databases or data warehouses in a distributed environment. This method forwards the query to the cluster server in order to reduce the processing load on the client. In response to the query, the cluster server transfers the query to an appropriate database, integrates the results obtained from the database, and returns the result to the client. In this method, since the query is eventually transferred to the server, it is impossible to reduce the load on the server.

サーバ負荷削減と処理時間短縮という課題に関しては、例えば図１３に示すように、サーバ１３０５〜１３０６のデータ１３０７〜１３０８をクライアント側のモジュール１３０９にコピーし（処理１３１１、１３１２）、そのコピー１３１
０に対して問合せ１３１３を発行し、結果１３１４を得るという方式がある。以下、このサーバデータのコピー１３１０をレプリカと呼ぶことにする。レプリカに対して問合せ処理を実行することにより、サーバ１３０５〜１３０６での問合せ処理を避けることができ、サーバ負荷を削減できるとともに、ネットワークを介したサーバへのアクセスを回避できることにより、問合せ処理時間を短縮できる。 With respect to the problem of server load reduction and processing time reduction, for example, as shown in FIG. 13, data 1307 to 1308 of servers 1305 to 1306 are copied to the module 1309 on the client side (processing 1311 and 1312), and the copy 131
There is a method of issuing a query 1313 to 0 and obtaining a result 1314. Hereinafter, this server data copy 1310 will be referred to as a replica. By executing the query processing on the replica, the query processing in the servers 1305 to 1306 can be avoided, the server load can be reduced, and access to the server via the network can be avoided, thereby reducing the query processing time. Can be shortened.

ところが、分散環境で複数のサーバを対象にしてレプリカを作成しようとする場合には、単純な方法ではクライアント側でレプリカを格納するために大規模な記憶装置１３１５が必要となる。例えば、３００ＧＢ（ギガバイト；１０の９乗バイト）程度のデータを持つサーバ１０台を統合利用しようとするクライアントは単純計算で３００ＧＢ×１０台＝３ＴＢもの記憶装置を準備する必要があり、現状の技術ではこの大規模な記憶装置をクライアント側で準備することは不可能に近い。また、レプリカ作成時に大量のデータをネットワーク経由でサーバからクライアントへ転送するため、ネットワークへ大きな負荷がかかってしまう。さらに、レプリカを作成した後で、サーバ側のデータが更新された場合には、対応するレプリカも更新する必要があり、この更新処理コストはレプリカの大きさに比例するため、更新処理のためのコストも無視できなくなってしまい、本方式も分散環境におけるデータウェアハウスシステムの好適な実現例とは言いがたい。 However, when creating a replica for a plurality of servers in a distributed environment, a simple method requires a large-scale storage device 1315 to store the replica on the client side. For example, a client who intends to use 10 servers having data of about 300 GB (gigabytes; 10 9 bytes) needs to prepare a storage device of 300 GB × 10 units = 3 TB by simple calculation. Then, it is almost impossible to prepare this large-scale storage device on the client side. In addition, when a replica is created, a large amount of data is transferred from the server to the client via the network, which places a heavy load on the network. Furthermore, if the server-side data is updated after the replica is created, the corresponding replica must also be updated, and this update processing cost is proportional to the size of the replica. Costs cannot be ignored, and this method is not a suitable implementation example of a data warehouse system in a distributed environment.

これに対して、非特許文献４に記載されているように、レプリカを作成するのではなく、問合せとその問合せ処理結果をキャッシュし、新たな問合せに対しては、キャッシュされた結果を再利用して処理することによって、サーバ負荷の削減、問合せ処理時間の短縮を行う方式が提案されている。本方式は問合せ結果の再利用率が高い場合にはサーバ負荷削減、問合せ処理時間の短縮に大きな効果があるが、分散環境におけるデータウェアハウスを考えた場合には、対象となるデータ規模とクライアント側で準備できる記憶装置の規模の差が極めて大きいため、キャッシュされているデータの再利用率が極めて低くなってしまい、効率が悪いという欠点がある。 On the other hand, as described in Non-Patent Document 4, instead of creating a replica, the query and the query processing result are cached, and the cached result is reused for a new query. Thus, a method for reducing server load and query processing time has been proposed. This method is very effective in reducing server load and query processing time when the query result reuse rate is high, but when considering a data warehouse in a distributed environment, the target data scale and client Since the difference in the scale of storage devices that can be prepared on the side is extremely large, the reuse rate of cached data becomes extremely low, and there is a disadvantage that the efficiency is poor.

サーバのファイルをネットワーク経由で取得し、ユーザに提供するための情報処理装置及びシステム及びその制御方式が特許文献２に開示されている。本方式では、システムがユーザからファイル参照要求を受け取った時点でレプリカを作成し、データウェアハウスシステムにおいて、ユーザから問合せが発行された場合に、最初はサーバに検索を依頼する必要が出てくるため、最初の問合せの応答時間の短縮は不可能である。さらに、本方式ではレプリカの作成単位がファイルとなっているため、データベースの問合せ条件に合致したレコードあるいはオブジェクト単位のレプリカ作成は困難である。 Patent Document 2 discloses an information processing apparatus and system for acquiring a server file via a network and providing the file to a user. In this method, when the system receives a file reference request from the user, a replica is created, and when a query is issued from the user in the data warehouse system, it is necessary to first request a search from the server. Therefore, it is impossible to shorten the response time of the first inquiry. Furthermore, since the replica creation unit is a file in this method, it is difficult to create a record or object unit replica that matches the query condition of the database.

サーバにおけるデータの更新をクライアント（本願発明における、後述のデータ収集手段に対応）に伝播するための方式に関しては、サーバが例えば１時間毎などの一定の時間間隔で、あるいはサーバデータの更新が起こる毎にクライアントに向けてデータを送信するという、サーバ主導のＰＵＳＨ方式と、クライアントが一定の時間間隔あるいは必要となった時にサーバにデータを取得しに行くという、クライアント主導のＰＵＬＬ方式がある。ところが、ＰＵＳＨ方式のうち
、クライアント個別にサーバがデータを配送する方式ではサーバの負荷が高くなるという問題があり、サーバがデータをブロードキャストあるいはマルチキャストして各クライアントが必要なデータのみを受け取るという方式ではクライアントが適切なタイミングでデータを取得することが困難になるという問題があるので、ＰＵＳＨ方式のみでは分散環境における効率的なデータ配送は困難である。一方、ＰＵＬＬ方式では、サーバのデータが更新された場合にはすぐにクライアントのデータも更新する必要があるような制約が厳しい応用を考えた場合にはクライアントは頻繁にサーバのデータをチェックする必要があるが、多数のクライアントが処理要求を頻繁に発行するとサーバにおいてこれらの処理要求を処理するための負荷が高くなってしまうという問題があり、ＰＵＬＬ方式のみでも分散環境における効率的なデータ配送は困難であることがわかる。ＰＵＳＨ方式とＰＵＬＬ方式を組合せて利用する方式については、例えば非特許文献５（以下、ＣＱ）に説明されている。ＣＱでは、クライアントからのトリガ条件を含んだ問合せをＣＱサーバ上に登録し、最初はクライアント主導のＰＵＬＬ方式、次回からは問合せに含まれるトリガ条件に応じてサーバ主導のＰＵＳＨ方式でデータ転送を行うが、ＣＱでは問合せごとにＰＵＳＨ方式、ＰＵＬＬ方式を指定できるわけではないため、結局データ転送はサーバ主導のＰＵＳＨ方式となってしまい、サーバ負荷が高くなるというＰＵＳＨ方式の問題が現れる。 Regarding the method for propagating the data update in the server to the client (corresponding to the data collection means described later in the present invention), the server updates the server data at regular time intervals such as every hour, for example. There is a server-driven PUSH method in which data is transmitted to the client every time and a client-driven PULL method in which the client goes to acquire data at a certain time interval or when necessary. However, among the PUSH methods, there is a problem that the load of the server increases in the method in which the server distributes data for each client. In the method in which each client receives only necessary data by broadcasting or multicasting the data. Since there is a problem that it becomes difficult for the client to acquire data at an appropriate timing, efficient data delivery in a distributed environment is difficult only with the PUSH method. On the other hand, in the PULL method, when the data of the server is updated, the client needs to check the server data frequently when considering the severe application where the client data needs to be updated immediately. However, if a large number of clients frequently issue processing requests, there is a problem that the load for processing these processing requests in the server becomes high. Even with the PULL method alone, efficient data delivery in a distributed environment is not possible. It turns out to be difficult. A method using a combination of the PUSH method and the PULL method is described in, for example, Non-Patent Document 5 (hereinafter, CQ). In CQ, a query including a trigger condition from a client is registered on the CQ server, and data is transferred first using a client-driven PULL method and from the next time using a server-driven PUSH method according to the trigger condition included in the query. However, in CQ, since the PUSH method and the PULL method cannot be specified for each query, the data transfer eventually becomes the server-driven PUSH method, and the problem of the PUSH method that the server load becomes high appears.

Ｗ．Ｈ．Ｉｎｍｏｎ著、”ＢｕｉｌｄｉｎｇｔｈｅＤａｔａＷａｒｅｈｏｕｓｅＳｅｃｏｎｄＥｄｉｔｉｏｎ”、ＪｏｈｎＷｉｌｅｙ＆Ｓｏｎｓ、Ｉｎｃ．、ＩＳＢＮ０−４７１−１４１６１−５W. H. Inmon, "Building the Data Walehouse Second Edition", John Wiley & Sons, Inc. , ISBN0-471-14161-5 ＲｏｂｅｒｔＯｒｆａｌｉ、ＤａｎＨａｒｋｅｙ著、”Ｃｌｉｅｎｔ／ＳｅｒｖｅｒＰｒｏｇｒａｍｍｉｎｇｗｉｔｈＪＡＶＡ（Ｒ）ａｎｄＣＯＲＢＡＳｅｃｏｎｄＥｄｉｔｉｏｎ”、ＪｏｈｎＷｉｌｅｙ＆Ｓｏｎｓ、Ｉｎｃ．、ＩＳＢＮ０−４７１−２４５７８−ＸRobert Orfali, by Dan Harkey, “Client / Server Programming with JAVA and CORBA Second Edition”, John Wiley & Sons, Inc. , ISBN0-471-24578-X Ａ．Ｓｈｅｔｈ、Ｊ．Ｌａｒｓｏｎ著、”ＦｅｄｅｒａｔｅｄＤａｔａｂａｓｅＳｙｓｔｅｍｓｆｏｒＭａｎａｇｉｎｇＤｉｓｔｒｉｂｕｔｅｄ、Ｈｅｔｅｒｏｇｅｎｅｏｕｓ、ａｎｄＡｕｔｏｎｏｍｏｕｓＤａｔａｂａｓｅｓ”、ＡＣＭＣｏｍｐｕｔｅｒＳｕｒｖｅｙｓ、Ｖｏｌ．２２、Ｎｏ．３、ｐｐ．１８３−２３６や、Ａ．Ｓｈｅｔｈ、Ｇ．Ｋａｒａｂａｔｉｓ著、”ＭｕｌｔｉｄａｔａｂａｓｅＩｎｔｅｒｄｅｐｅｎｄｅｎｃｉｅｓｉｎＩｎｄｕｓｔｒｙ”、Ｐｒｏｃ．ｏｆ１９９３ＡＣＭＳｉｇｍｏｄ、Ｖｏｌ．２２、ｐｐ．４８３−４８６A. Sheth, J.A. Larson, "Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases," ACM Computer Surveys, Vol. 22, no. 3, pp. 183-236 and A.I. Sheth, G.M. By Karabatis, “Multidatabase Interdependencies in Industry”, Proc. of 1993 ACM Sigmad, Vol. 22, pp. 483-486 Ａ．Ｋｅｌｌｅｒ、Ｊ．Ｂａｓｕ著、”ＡＰｒｅｄｉｃａｔｅ−ｂａｓｅｄＣａｃｈｉｎｇＳｃｈｅｍｅｆｏｒＣｌｉｅｎｔ−ＳｅｒｖｅｒＤａｔａｂａｓｅＡｒｃｈｉｔｅｃｔｕｒｅｓ”、ＴｈｅＶＬＤＢＪｏｕｒｎａｌ、Ｖｏｌ．５、Ｎｏ．１、ｐｐ．３５−４７A. Keller, J .; Basu, “A Predicate-based Caching Scheme for Client-Server Database Architectures”, The VLDB Journal, Vol. 5, no. 1, pp. 35-47 Ｃ．Ｐｕ、Ｌ．Ｌｉｕ著、 ”ＵｐｄａｔｅＭｏｎｉｔｏｒｉｎｇ：ＴｈｅＣＱＰｒｏｊｅｃｔ”、ＬｅｃｔｕｒｅＮｏｔｅｓｉｎＣｏｍｐｕｔｅｒＳｃｉｅｎｃｅ、Ｖｏｌ．１３６８、ＩＳＳＮ０３０２−９７４３、ｐｐ．３９６−４１１C. Pu, L .; Liu, “Update Monitoring: The CQ Project”, Texture Notes in Computer Science, Vol. 1368, ISSN 0302-9743, pp. 396-411 ＭａｒｙＣａｍｐｉｏｎｅ、ＫａｔｈｙＷａｌｒａｔｈ著、”ＴｈｅＪａｖａ（Ｒ）Ｔｕｔｏｒｉａｌ”、Ａｄｄｉｓｏｎ−Ｗｅｓｌｅｙ、ＩＳＢＮ０−２０１−６３４５４−６Mary Campion, Katy Wallath, “The Java® Tutorial”, Addison-Wesley, ISBN 0-201-63454-6. ＪｅｆｆｒｅｙＤ．Ｕｌｌｍａｎ著、”ＰＲＩＮＣＩＰＬＥＳＯＦＤＡＴＡＢＡＳＥＡＮＤＫＮＯＷＬＥＤＧＥ−ＢＡＳＥＳＹＳＴＥＭＳ”、ＶＯＬＵＭＥＩＩ、ＣＯＭＰＵＴＥＲＳＣＩＥＮＣＥＰＲＥＳＳ、ＩＳＢＮ０−７１６７−８１６２−Ｘ、第１４章”ＯｐｔｉｍｉｚａｔｉｏｎｆｏｒＣｏｎｊｕｎｃｔｉｖｅＱｕｅｒｉｅｓ”Jeffrey D. By Ullman, “PRINCIPLES OF DATABASE AND KNOWLEDGE-BASE SYSTEMS”, VOLUME II, COMPUTER SCIENCE PRES, ISBN 0-7167-8162-X, Chapter 14 “Optimization for Conquest” 特開平８−２８６９６０号公報JP-A-8-286960 特開平９−２９７７０２号公報JP-A-9-297702

データベースやデータウェアハウスなどのサーバが多数存在し、不特定多数のクライアントがこれらのサーバにネットワークを介してアクセスし、サーバ上のデータを統合利用して有益な情報を得ようとする場合に、サーバに問合せを転送する方式では、サーバにかかる負荷の増大、ネットワークへの依存度の高さ、問合せへの応答時間の増大が問題となり、クライアント側にレプリカを作成する方式では、大量データ転送に伴うネットワークへの負荷の増大、クライアント側の記憶装置容量の増大、レプリカの更新処理コストの増大が問題となっていた。また、キャッシュを利用する方式では、ヒット率が低くなるため、キャッシュされているデータの再利用率が極めて低くなり、結果として分散環境における効率的なデータウェアハウスシステムの構築は困難であるという問題があった。 When there are a large number of servers such as databases and data warehouses, and an unspecified number of clients access these servers via the network and try to obtain useful information by integrating and using the data on the servers, In the method of transferring queries to the server, the load on the server is increased, the dependency on the network is high, and the response time to the query is increased. In the method of creating a replica on the client side, a large amount of data is transferred. Along with this, an increase in the load on the network, an increase in the storage device capacity on the client side, and an increase in the cost of replica update processing have been problems. In addition, since the hit rate is low in the cache-based method, the cached data reuse rate is extremely low, and as a result, it is difficult to construct an efficient data warehouse system in a distributed environment. was there.

本発明の目的は分散環境での効率の良いデータウェアハウスシステムとそこで用いられる問合せ処理方法及びそのためのデータ収集方法と装置を提供することである。 An object of the present invention is to provide an efficient data warehouse system in a distributed environment, a query processing method used there, and a data collection method and apparatus therefor.

さらに具体的には、第１の目的はサーバ負荷の削減であり、第２の目的はネットワーク依存度の低減であり、第３の目的は問合せへの応答時間の短縮であり、第４の目的はネットワーク負荷の削減であり、第５の目的はクライアントの記憶容量の削減であり、第６の目的はレプリカの更新処理コストの削減であり、第７の目的はレプリカのヒット率の向上である。 More specifically, the first objective is to reduce server load, the second objective is to reduce network dependency, the third objective is to reduce response time to an inquiry, and the fourth objective. Is the reduction of the network load, the fifth objective is to reduce the storage capacity of the client, the sixth objective is to reduce the replica update processing cost, and the seventh objective is to improve the hit rate of the replica .

上記第１〜第３の目的を達成するため、本発明ではクライアントの問合せ処理に用いるサーバデータのレプリカを作成・管理するデータ収集手段を設け、クライアントの問合せ処理を可能な限りデータ収集手段のレプリカを用いて実行する。データ収集手段において問合せ処理を実行することによって、サーバへの問合せ転送が少なくなり、第１の目的であるサーバ負荷が削減できる。また、サーバデータのレプリカをデータ収集手段で作成することにより、サーバへのネットワークが不調となった場合にも該データ収集手段内のレプリカを用いて問合せを処理することが可能となり、第２の目的であるネットワーク依存度が低減できる。さらに、多くのクライアントからアクセスされるために負荷が高くしかも大量のデータを管理するために問合せ処理コストの高い（課金システムでは、料金の高い）サーバにおいて問合せを処理するのではなく、限られた数のクライアントからの問合せを必要なデータのみのレプリカを用いてデータ収集手段において処理することにより、第３の目的である問合せ応答時間が短縮できる。 In order to achieve the above first to third objects, in the present invention, a data collection means for creating and managing a replica of server data used for client inquiry processing is provided. Run with. By executing the query process in the data collection means, the query transfer to the server is reduced, and the server load as the first object can be reduced. Further, by creating a replica of the server data by the data collection means, it becomes possible to process a query using the replica in the data collection means even when the network to the server is malfunctioning. The target network dependency can be reduced. In addition, a large amount of data is accessed by many clients, and a large amount of data is managed. In addition, a query is not processed at a server with a high query processing cost (which is expensive in a billing system). The query response time, which is the third object, can be shortened by processing inquiries from a number of clients in the data collection means using replicas of only necessary data.

問合せ処理を更に具体的に説明すると、前記データ収集手段はクライアントからの問合せが、自データ収集手段で処理可能か、または協調する他データ収集手段で処理可能か、またはサーバへの転送が必要かを判断し、自データ収集手段、または協調する他データ収集手段で処理が可能な場合には、該当のデータ収集手段で問合せを処理する。協調データ収集手段を用いることにより、データ収集手段で処理可能な問合せが増加し、前記第１〜第３の目的達成に寄与する。さらに、前記サーバへの前記問合せの転送が必要だと判断された場合にも、該問合せのうち自データ収集手段および協調する他データ収集手段で利用可能なレプリカで処理しきれない部分だけをサーバに転送することにより、サーバで処理するデータ量が少なくなり、サーバ負荷が削減できるだけでなく、サーバからクライアントに返される結果データも少なくなるため、第４の目的であるネットワーク負荷が削減できる。 To explain the query processing more specifically, whether the data collection means can process the inquiry from the client by its own data collection means, can be processed by another cooperative data collection means, or needs to be transferred to the server If the data can be processed by the own data collection unit or the other data collection unit cooperating, the inquiry is processed by the corresponding data collection unit. By using the cooperative data collection means, the number of queries that can be processed by the data collection means increases, which contributes to the achievement of the first to third objects. Further, even when it is determined that the query needs to be transferred to the server, only the portion of the query that cannot be processed by the replica that can be used by the own data collecting means and the other data collecting means to cooperate with the server By transferring to, the amount of data to be processed by the server is reduced, and not only the server load can be reduced, but also the result data returned from the server to the client is reduced, so that the network load as the fourth object can be reduced.

さらに、単純にレプリカを作成した場合に生じる前記課題を解決して前記第４〜第６の目的を達成するため、データの共有が可能なクライアントをグループ化し
、グループ化したクライアント集合に対してレプリカを作成し、該レプリカをクライアント間で共有すると共に、必要に応じて自データ収集手段は、協調する他データ収集手段と連係する。レプリカを共有することにより、サーバからデータ収集手段に転送されるデータ量が削減できるため、ネットワーク負荷が削減できる。本発明ではデータ収集手段においてレプリカを作成するため、第５の目的であるクライアントが必要とする記憶容量を削減できることは無論のこと、データ収集手段間でのレプリカ共有により、データ収集手段における記憶容量も削減できる。また、レプリカの共有により、システム全体としてのレプリカの量が削減できるため、第６の目的であるレプリカの更新処理コストも削減できる。 Further, in order to solve the above-mentioned problems that occur when a replica is simply created and achieve the fourth to sixth objects, the clients that can share data are grouped, and the replica is set to the grouped client set. And the replica is shared among the clients, and the own data collection means cooperates with other data collection means to cooperate as necessary. By sharing the replica, the amount of data transferred from the server to the data collection means can be reduced, so that the network load can be reduced. In the present invention, since a replica is created in the data collection means, it is a matter of course that the storage capacity required by the client, which is the fifth object, can be reduced, and the storage capacity in the data collection means can be achieved by sharing the replica between the data collection means. Can also be reduced. Further, since the amount of replicas as a whole system can be reduced by sharing replicas, the replica update processing cost, which is the sixth object, can also be reduced.

レプリカ作成に際しては、ユーザからクライアントを介して与えられるデータの精度、鮮度、および優先度などのデータの質に関する条件と、収集するデータの範囲に関する条件とを含むレプリカ作成要求を受け付けて該要求を保持し、記憶装置容量、ＣＰＵ性能などの該データ収集手段が利用可能な資源量を考慮しながら、データを供給するサーバと交渉処理を行うことによって、前記レプリカ作成要求の一部あるいは全部を満足するレプリカを作成する。レプリカ作成要求を、ユーザから与えてもらうことにより、ユーザの意図するデータの収集が可能となり、第７の目的であるレプリカのヒット率を向上できる。また、レプリカを作成する際にデータの質を調整することにより、データ収集手段の計算機資源に応じた適切なサイズのレプリカを作成でき、第４の目的であるネットワーク負荷の削減、第５の目的であるクライアント及びデータ収集手段の記憶装置容量の削減、第６の目的であるレプリカの更新処理コストの削減ができる。 When creating a replica, it accepts a request to create a replica including conditions regarding the quality of data such as accuracy, freshness, and priority given by the user from the client, and conditions regarding the range of data to be collected. By holding and negotiating with the server that supplies data, taking into account the amount of resources that can be used by the data collection means such as storage capacity and CPU performance, some or all of the replica creation request is satisfied Create a replica to use. By receiving a replica creation request from the user, it becomes possible to collect data intended by the user, and the hit rate of the replica, which is the seventh object, can be improved. Further, by adjusting the data quality when creating the replica, it is possible to create a replica of an appropriate size according to the computer resources of the data collection means, and the fourth purpose is to reduce the network load and the fifth purpose. It is possible to reduce the storage device capacity of the client and the data collection means and the replica update processing cost, which is the sixth object.

レプリカの更新処理においては、クライアント主導のＰＵＬＬ手法とサーバ主導のＰＵＳＨ手法を組み合わせて用いることにより、クライアントのデータに対する要求を考慮しながら、サーバの負荷を削減でき、レプリカ更新時のサーバ負荷を削減できる。これにより第１の目的が達成できる。 In replica update processing, by using a combination of the client-driven PULL method and the server-driven PUSH method, server load can be reduced while considering client data requirements, and server load during replica update can be reduced. it can. Thereby, the first object can be achieved.

本発明のデータウェアハウス構成方法に基づく、例えば、クライアントのグループ化により、クライアント間でのデータ共有が可能となり、クライアント側の記憶装置容量、更新処理コスト、ネットワーク負荷を削減できる。また、レプリカの作成の際にユーザから与えられたデータの要求条件に基づく、データ供給元のサーバとの交渉処理によって、レプリカを作成するデータ収集手段の計算機資源を考慮しながら、ユーザからの問合せに対して利用率が高いレプリカを作成できる。前記レプリカを用いてクライアントからの問合せを処理することによってサーバ負荷を削減でき、実用的なデータウェアハウスシステムの構築とこれを用いた問合せ処理方法が実現可能となる。 Based on the data warehouse configuration method of the present invention, for example, by grouping clients, data can be shared between clients, and the storage device capacity, update processing cost, and network load on the client side can be reduced. In addition, inquiries from users, taking into account the computer resources of the data collection means to create replicas by negotiation processing with the data supply source server based on the data requirements given by the user at the time of replica creation A replica with a high utilization rate can be created. By processing a query from a client using the replica, the server load can be reduced, and a practical data warehouse system and a query processing method using the same can be realized.

図１に、本発明によるデータウェアハウスシステムの好適な実現例を示す。クライアント１０３、１０４は内部ネットワーク１２８を介してデータ収集手段１（１０１）に接続される。
内部ネットワーク１２８は、イーサネット（Ｒ）、光ファイバ、ＦＤＤＩで接続されるローカルエリアネットワークであってよく、クライアントはＨｉｔａｃｈｉＦＬＯＲＡなどのパーソナルコンピュータ、Ｈｉｔａｃｈｉ３０５０クリエイティブワークステーションなどの任意のコンピュータ・システムでよい。データ収集手段１は複数のクライアントをグループ化して管理し、クライアントから発行されるレプリカ作成要求、および問合せを受け付け、レプリカ作成要求をレプリカ作成要求解析部１０６へ、問合せを問合せ解析部１０９に転送するクライアント管理部１０５と、クライアントからのレプリカ作成要求に対して実際にレプリカを作成するか否かを決定し、レプリカを作成する場合には、作成するレプリカに関する情報であるレプリカ記述をレプリカ作成管理部１０７に転送するレプリカ作成要求解析部１０６と、レプリカ管理テーブル１０８を参照しながらレプリカ１２３を記憶装置１１２に格納・管理するレプリカ作成管理部１０７と、クライアントからの問合せを解析する問合せ解析部１０９と、クライアントからの問合せがデータ収集手段１で処理できる場合に該問合せを処理する問合せ処理部１２７と、ネットワーク１１３を介してサーバや他のデータ収集手段との通信処理を管理する通信制御部１１０とを含む。さらに高度な処理を行わせるため、データ収集手段にはデータ収集手段交渉処理部１１１を備えるようにしても良い。該データ収集手段交渉処理部については後述する。データ収集手段は、クライアントと同様に任意のコンピュータ・システムでよく、記憶装置１１２は磁気記憶装置、光ディスク装置、磁気テープなどでよい。ネットワーク１１３は前記ローカルエリアネットワークでも、地理的に分散された複数のサイトを接続する広域エリアネットワークでもよい。 FIG. 1 shows a preferred implementation of a data warehouse system according to the present invention. The clients 103 and 104 are connected to the data collection unit 1 (101) via the internal network 128.
The internal network 128 may be a local area network connected by Ethernet (R), optical fiber, or FDDI, and the client may be any computer system such as a personal computer such as Hitachi FLORA or a Hitachi 3050 creative workstation. The data collection unit 1 manages a plurality of clients in a group, receives a replica creation request and a query issued from the client, and transfers the replica creation request to the replica creation request analysis unit 106 and the query to the query analysis unit 109. When the client management unit 105 determines whether or not to actually create a replica in response to a replica creation request from the client, and creates a replica, the replica creation management unit stores the replica description that is information about the replica to be created. A replica creation request analysis unit 106 that transfers to 107, a replica creation management unit 107 that stores and manages the replica 123 in the storage device 112 with reference to the replica management table 108, a query analysis unit 109 that analyzes a query from a client, Inquiries from clients The query processing unit 127 for processing the query if that can be processed by the collection means 1, and a communication control unit 110 via the network 113 to manage a communication process with the server or other data collector. In order to perform more advanced processing, the data collection unit may include a data collection unit negotiation processing unit 111. The data collection means negotiation processing unit will be described later. The data collection means may be an arbitrary computer system like the client, and the storage device 112 may be a magnetic storage device, an optical disk device, a magnetic tape, or the like. The network 113 may be the local area network or a wide area network connecting a plurality of geographically dispersed sites.

サーバ１（１１４）はデータ収集手段からのレプリカ作成要求や問合せを受け付ける通信制御部１１５と、問合せを処理する問合せ処理部１１７と、データ収集手段へ配送するデータを管理する配送データ管理部１１８と、配送データ管理部が参照する配送データ管理テーブル１２０とを含む。さらに高度な処理を行わせるため、サーバ交渉処理部１１６および負荷管理部１１９を備えるようにしても良いが、これらについては後述する。サーバもデータ収集手段と同様に任意のコンピュータ・システムでよく、サーバデータ１２４を格納する記憶装置１２１は磁気記憶装置、光ディスク装置、磁気テープ、ＣＤ−ＲＯＭなどの任意の記憶装置、あるいはこれらの記憶装置の組み合わせでよい。さらに、サーバデータおよびデータ収集手段のレプリカデータの管理はファイルシステム、あるいはＨＩＴＡＣＨＩＨｉＲＤＢなどの汎用のデータベースマネージメントシステムで行って差し支えない。 The server 1 (114) includes a communication control unit 115 that accepts a replica creation request and inquiry from the data collection unit, an inquiry processing unit 117 that processes the inquiry, and a delivery data management unit 118 that manages data to be delivered to the data collection unit. And a delivery data management table 120 referred to by the delivery data management unit. In order to perform more advanced processing, a server negotiation processing unit 116 and a load management unit 119 may be provided, which will be described later. The server may be an arbitrary computer system similar to the data collection means, and the storage device 121 for storing the server data 124 is an arbitrary storage device such as a magnetic storage device, an optical disk device, a magnetic tape, a CD-ROM, or a storage thereof. A combination of devices may be used. Furthermore, the server data and the replica data of the data collection means may be managed by a general-purpose database management system such as a file system or HITACHI HiRDB.

なお、前記データ収集手段内のクライアント管理部、前記レプリカ作成要求解析部、前記レプリカ作成管理部、前記問合せ処理部、前記通信制御部、及び前記データ収集手段交渉処理部、ならびに前記サーバ上の通信制御部、前記問合せ処理部、前記サーバ交渉処理部、前記負荷管理部、及び前記配送データ管理部は、専用のハードウエアとして構成されるばかりでなく、それぞれデータ収集手段内の記憶装置１１２、ならびに前記サーバ内の記憶装置１２１にローカルに格納されたプログラムであったり、あるいはネットワーク上のプログラムを格納するサーバからダウンロードしたプログラムであってよい。異種分散環境において、サーバからダウンロードしたプログラムを安全に実行するしくみについては、例えば、非特許文献６の４節で説明されているようなプログラミング言語（以下、インターネットプログラミング言語）を利用できる。 The client management unit in the data collection unit, the replica creation request analysis unit, the replica creation management unit, the inquiry processing unit, the communication control unit, the data collection unit negotiation processing unit, and the communication on the server The control unit, the inquiry processing unit, the server negotiation processing unit, the load management unit, and the delivery data management unit are not only configured as dedicated hardware, but also each of the storage devices 112 in the data collection unit, and The program may be a program stored locally in the storage device 121 in the server, or a program downloaded from a server storing a program on a network. As a mechanism for safely executing a program downloaded from a server in a heterogeneous distributed environment, for example, a programming language (hereinafter referred to as an Internet programming language) described in Section 4 of Non-Patent Document 6 can be used.

特に、クライアントの好適な実施例としてはＨｉｔａｃｈｉＦＬＯＲＡなどのパーソナルコンピュータ、Ｈｉｔａｃｈｉ３０５０クリエイティブワークステーションなど任意のコンピュータシステム上で、マイクロソフト社のインターネットエクスプローラ、ネットスケープ社のネットスケープナビゲータなどのウェブ・ブラウザを用いる形態が考えられるが、この場合には前記インターネットプログラミング言語を用いて作成したプログラムモジュールをブラウザにダウンロードすることにより、クライアントプログラムの動的な変更が可能となる。 In particular, as a preferred embodiment of the client, there is a form in which a web browser such as Microsoft Internet Explorer or Netscape Netscape Navigator is used on an arbitrary computer system such as a Hitachi FLORA personal computer or a Hitachi 3050 creative workstation. In this case, the client program can be dynamically changed by downloading a program module created using the Internet programming language to the browser.

なお、本実施例ではクライアントが内部ネットワークを介して直接的にデータ収集手段１〜ｎに接続されている例を示したが、図１８に示すようにクライアントがＬＡＮやインターネットを介してデータ収集手段に接続されている場合においても本発明が有効であることはいうまでもない。 In this embodiment, the client is directly connected to the data collecting means 1 to n via the internal network. However, as shown in FIG. 18, the client is connected to the data collecting means via the LAN or the Internet. Needless to say, the present invention is effective even in the case of being connected to the network.

本発明のさらに具体的な特徴は、（１）データ収集手段内にグループ化したクライアントで共有が可能な部分的レプリカを作成する。（２）レプリカの作成にはユーザからのレプリカ作成要求を受け付け、サーバとの交渉処理によって、実際にレプリカを作成するデータを決定し、その内容を記述したレプリカ記述を作成・管理する。（３）レプリカ作成要求にはデータの範囲に関する条件（データ領域条件）だけではなく、データの質に関する条件（データ品質条件）を含む。（４）レプリカ更新には、サーバ側で保持する配送データ管理テーブルを利用して、サーバ主導のＰＵＳＨ手法、クライアント主導のＰＵＬＬ手法を組合せたデータ転送方式を用いる。（５）自データ収集手段内のレプリカを用いて処理可能な問合せは該自データ収集手段のレプリカを用いて処理し、逆に、不可能な問合せは、処理が可能な他データ収集手段、もしくはサーバに転送し問合せを依頼する、という５ポイントにまとめることができる。以下それぞれについて、例を用いながら説明する。
（１）従来技術では、図１４で示すような単純なクライアント−サーバ形態では
、分散環境におけるデータウェアハウスシステムの実現は困難であると述べた。そこで、本発明ではデータの共用が可能なクライアントをまとめて管理するデータ収集手段を設ける。このデータ収集手段を利用することにより、クライアント側に重複する冗長なレプリカの作成を抑制でき、無駄なサーバへのアクセスを防止できるので、サーバへかかる負荷を軽減できる。例えば、図１５において、クライアント１（１５０１）は売上げが１００００円以上の商品売上表のデータ（１５０３）を必要とし、クライアント２（１５０２）は売上げが５０００円以上の前記商品売上表のデータ（１５０４）が必要とする。商品売上げデータ１５０９はサーバ１５０５に格納されているとすると、クライアント１、２はそれぞれ該サーバからデータを取得する必要がある。つまり、サーバにはクライアント１と２の両方からの問合せによる負荷がかかる。ところが、本発明のデータ収集手段１５０６を設置し、該データ収集手段がクライアント１および２のレプリカ作成要求を保持し、その条件の和集合である、「商品売上げが５０００円以上のデータ」（１５０７）をレプリカ記述として採用してそのデータをサーバから取得し、レプリカ１５０８を作成すれば、クライアント１、２の問合せはデータ集手段のデータを用いて処理できるため、クライアント１、２のサーバへのアクセスを省くことができ、サーバ負荷およびデータ転送によるネットワーク負荷を削減することができる。また、データ収集手段の記憶装置１６１０の容量の制約、あるいはデータ収集手段１６０６のＣＰＵ処理能力の制約、あるいはネットワークを転送するデータ量の制約から、クライアント１および２が要求するレプリカ作成要求の和集合が保持できない場合には、図１６に示すように、クライアント１（１６０１）とクライアント２（１６０２）のそれぞれのレプリカ作成要求（１６０３、１６０４）から、その条件の積集合である「商品売上げが１００００円以上のデータ」（１６０７）をレプリカ記述としてそのデータをサーバから取得し、レプリカを作成すればクライアント１からの処理要求がレプリカ作成要求を満たす場合には、全ての処理要求をレプリカを用いて処理でき、クライアント２についてもレプリカを用いて処理できない処理要求つまり「商品売り上げが５０００円以上１００００円未満のデータ」を参照する問合せのみをサーバに転送すればよく、サーバ負荷およびネットワーク負荷を削減することができる。図１５、１６を用いて説明したサーバ負荷削減、ネットワーク負荷削減の効果は、クライアント数が多くなるほど大きくなることは明らかである。
（２）レプリカ作成要求の受け付けと、レプリカ作成方法について図１および図２および図８を用いて説明する。データ収集手段１０１がクライアント１０３〜１０４からのレプリカ作成要求（レプリカ作成要求に関しては（３）で説明する）を受けつけた場合（処理２０２）には、レプリカ作成要求解析部１０６がその要求を解析し（処理２０３）、レプリカ作成管理部１０７と通信することにより、内容を図８に示すレプリカ管理テーブル１０８を参照する（処理２０４）。レプリカ管理テーブルは、作成されるデータの領域に関するデータ領域条件８０１と、データの品質に関するデータ品質条件８０２のエントリを含む。この２つのエントリをレプリカ記述８０３と呼ぶ。一つの前記レプリカ記述のエントリに対し、該レプリカが格納されているデータ収集手段名を記述するレプリカ位置情報８０４と、該レプリカのデータ源であるサーバ名を記述するサーバ位置情報８０５と、該レプリカのメンテナンス条件であるデータ配送条件８０６とで構成される。例えば図８の１番目のエントリ８０７は、注文表に格納されているレコードのうち、価格が１００００円以上のレコードの、注文番号と、価格と、顧客番号の３つのカラムを取出したデータのレプリカがデータ収集手段１にあり、該レプリカの元データはサーバ１であり、サーバ１からは１３：００にＰＵＳＨ手法でデータが配送され、レプリカがメンテナンスされることを表している。但し、図８の２番目のエントリにおけるデータ配送方法で、｛１：００、１３：００｝、ＰＵＳＨという記述は、１：００と１３：００の両方の時刻にＰＵＳＨ手法でデータが配送されることを表す。
要求されたレプリカが自データ収集手段内の既存のレプリカから作成可能な場合（判定処理２０５でＹｅｓが選択された場合）には、新たなレプリカ作成を行わずにレプリカ作成処理を終了する。例えば、レプリカ管理テーブルが図８に示されるような場合で、新たなレプリカ作成要求が図１９の１番目のエントリ１９０１であるとすると、該レプリカ作成要求は図８の８０７に示される、データ収集手段１内にある既存のレプリカを用いれば処理でき、判定処理２０５はＹｅｓとなる。 More specific features of the present invention are as follows: (1) A partial replica that can be shared by clients grouped in the data collection means is created. (2) To create a replica, a replica creation request from a user is accepted, data for actually creating a replica is determined by negotiation processing with a server, and a replica description describing the contents is created and managed. (3) The replica creation request includes not only conditions related to the range of data (data area conditions) but also conditions related to data quality (data quality conditions). (4) The replica update uses a data transfer method in which a server-driven PUSH method and a client-driven PULL method are combined using a delivery data management table held on the server side. (5) A query that can be processed using a replica in its own data collection means is processed using a replica of its own data collection means, and conversely, an impossible query is processed by another data collection means that can be processed, or It can be summarized into 5 points: transfer to the server and request an inquiry. Each will be described below using examples.
(1) In the prior art, it has been stated that it is difficult to realize a data warehouse system in a distributed environment with a simple client-server configuration as shown in FIG. Therefore, in the present invention, a data collecting means for collectively managing clients that can share data is provided. By using this data collection means, it is possible to suppress the creation of redundant replicas that overlap on the client side, and to prevent useless access to the server, thereby reducing the load on the server. For example, in FIG. 15, client 1 (1501) needs data (1503) of a product sales table with sales of 10,000 yen or more, and client 2 (1502) has data (1504) of the product sales table with sales of 5000 yen or more. ) Is required. If the product sales data 1509 is stored in the server 1505, the clients 1 and 2 need to acquire data from the server. In other words, the server is subjected to a load caused by inquiries from both the clients 1 and 2. However, the data collection means 1506 of the present invention is installed, the data collection means holds the replica creation request of the clients 1 and 2, and is the union of the conditions, “data with product sales of 5000 yen or more” (1507) ) As a replica description and the data is acquired from the server, and the replica 1508 is created, so that the query of the clients 1 and 2 can be processed using the data of the data collection means. Access can be omitted, and server load and network load due to data transfer can be reduced. Further, the union of replica creation requests requested by the clients 1 and 2 due to the capacity limitation of the storage device 1610 of the data collection means, the CPU processing capacity restriction of the data collection means 1606, or the restriction on the amount of data transferred over the network. 16 cannot be held, as shown in FIG. 16, from the respective replica creation requests (1603, 1604) of the client 1 (1601) and the client 2 (1602), “product sales is 10,000” If the processing request from the client 1 satisfies the replica creation request if the data is acquired from the server as a replica description and the data is obtained from the server as a replica description (1607), all the processing requests are processed using the replica. Can be processed, and client 2 cannot be processed using a replica. Processing request that is only queries that reference to "data of the product sales less than 10,000 yen or more 5000 yen" may be transferred to the server, it is possible to reduce the server load and network load. It is clear that the effects of server load reduction and network load reduction described with reference to FIGS. 15 and 16 increase as the number of clients increases.
(2) Receipt of a replica creation request and a replica creation method will be described with reference to FIG. 1, FIG. 2, and FIG. When the data collection unit 101 receives a replica creation request from the clients 103 to 104 (the replica creation request will be described in (3)) (process 202), the replica creation request analysis unit 106 analyzes the request. (Process 203) By communicating with the replica creation management unit 107, the contents are referred to the replica management table 108 shown in FIG. 8 (Process 204). The replica management table includes entries for a data area condition 801 relating to an area of data to be created and a data quality condition 802 relating to data quality. These two entries are called replica description 803. For one replica description entry, replica location information 804 that describes the name of the data collection means storing the replica, server location information 805 that describes the name of the server that is the data source of the replica, and the replica The data delivery condition 806 is a maintenance condition of For example, the first entry 807 in FIG. 8 is a replica of data obtained by extracting the three columns of order number, price, and customer number for records with a price of 10,000 yen or more among records stored in the order table. Is the data collection means 1, and the original data of the replica is the server 1, and the server 1 delivers data at 13:00 by the PUSH method, and represents that the replica is maintained. However, in the data delivery method in the second entry of FIG. 8, the description {1: 0, 13:00} and PUSH is data delivered by the PUSH method at both 1:00 and 13:00. Represents that.
When the requested replica can be created from an existing replica in the own data collection means (when Yes is selected in the determination process 205), the replica creation process is terminated without creating a new replica. For example, if the replica management table is as shown in FIG. 8 and the new replica creation request is the first entry 1901 in FIG. 19, the replica creation request is the data collection shown in 807 in FIG. Processing can be performed by using an existing replica in the means 1, and the determination processing 205 is Yes.

要求されたレプリカが自データ収集手段内の既存のレプリカ１２３から作成不可能な場合（判定処理２０５でＮｏが選択された場合）には、要求するレプリカが協調するデータ収集手段１０２の既存のレプリカ１２６から作成可能かどうかを判定する（判定処理２０８）。もし作成可能な場合（判定処理２０８でＹｅｓが選択された場合）には、自データ収集手段内にさらにレプリカを作成するかどうかを判定する（判定処理２１６）。自データ収集手段内には重複してレプリカを作成しない場合（判定処理２１６でＮｏが選択された場合）には、レプリカ作成を行わずにレプリカ作成処理を終了する。自データ収集手段内に重複してレプリカを作成する場合（判定処理２１６でＹｅｓが選択された場合）には、クライアントから要求された条件で、レプリカを協調データ収集手段１０２のレプリカから作成し（処理２１５）、レプリカ作成処理を終了する（２１９）。 When the requested replica cannot be created from the existing replica 123 in its own data collection means (when No is selected in the determination process 205), the existing replica of the data collection means 102 with which the requested replica cooperates It is determined whether or not it can be created from 126 (determination process 208). If creation is possible (when Yes is selected in the determination process 208), it is determined whether or not a replica is further created in the own data collection means (determination process 216). When replicas are not created redundantly in the own data collection means (when No is selected in the determination process 216), the replica creation process is terminated without creating a replica. When a duplicate replica is created in the own data collection means (when Yes is selected in the determination process 216), a replica is created from the replica of the cooperative data collection means 102 under the condition requested by the client ( Process 215), and the replica creation process ends (219).

例えば、レプリカ管理テーブルが図８に示すような場合で、新たなレプリカ作成要求が、図１９の２番目のエントリ１９０２であるとすると、データ収集手段１内の既存のレプリカでは新たなレプリカ作成要求を処理できないが、図８の８０８に示される、データ収集手段２内のレプリカを用いれば、該新レプリカ作成要求を処理できる。このような場合、データ収集手段１の記憶装置に該レプリカを作成する余裕がある場合には、データ収集手段２内のレプリカからデータ収集手段１内に新たにレプリカを作成する。また、データ収集手段１に対するクライアントからの該レプリカ作成要求の優先度が高い場合には、データ収集手段１内の優先度の低いレプリカを消去して該レプリカ作成要求に従うレプリカを作成する。 For example, if the replica management table is as shown in FIG. 8 and the new replica creation request is the second entry 1902 in FIG. 19, the new replica creation request in the existing replica in the data collection means 1 However, if a replica in the data collection means 2 shown at 808 in FIG. 8 is used, the new replica creation request can be processed. In such a case, if there is room for creating the replica in the storage device of the data collection unit 1, a new replica is created in the data collection unit 1 from the replica in the data collection unit 2. Further, when the priority of the replica creation request from the client to the data collection unit 1 is high, the replica having the low priority in the data collection unit 1 is deleted and a replica according to the replica creation request is created.

既存のレプリカから新たなレプリカ作成要求に基づくレプリカが作成できるかどうかは、既存のレプリカの内容を記述するレプリカ管理テーブルと、新たに与えられたレプリカ作成要求を比較することによって判断する。つまり、データ収集手段はレプリカ作成要求を保持し、該レプリカ作成要求の集合と、該レプリカ作成要求の組合せから生成されているレプリカ記述に対して、新たなレプリカ作成要求が届いた場合には、該データ収集手段は該レプリカ作成要求を保持し、該レプリカ作成要求と前記レプリカ管理テーブルのエントリとを比較し、同値関係、包含関係を決定する。例えば、先の例では図８に示されるレプリカ管理テーブルの各エントリに対して、図１９で示される新たなレプリカ作成要求が与えられた場合、エントリ１９０２で表される新レプリカ作成要求は、エントリ８０８で表される既存のレプリカに包含されるのは自明である。なお、本実施例では、レプリカ作成要求が非常に単純な場合を想定したが、さらに一般的な場合についても、例えば、非特許文献７は、“ＱｕｅｒｙＥｑｕｉｖａｌｅｎｃｅ、ＱｕｅｒｙＣｏｎｔａｉｎｍｅｎｔ”と呼ばれる方式を用いることにより、条件間の同値関係、包含関係を調べる方法が開示されている。該判定方式を利用することにより、さらに一般的な場合についても、本発明によるレプリカ管理テーブルを用いたレプリカ管理が実現できる。 Whether or not a replica based on a new replica creation request can be created from an existing replica is determined by comparing a replica management table describing the contents of the existing replica with a newly given replica creation request. That is, the data collection means holds the replica creation request, and when a new replica creation request arrives for the replica description generated from the set of replica creation requests and the combination of the replica creation requests, The data collection unit holds the replica creation request, compares the replica creation request with an entry in the replica management table, and determines an equivalence relation and an inclusion relation. For example, in the previous example, when a new replica creation request shown in FIG. 19 is given to each entry in the replica management table shown in FIG. 8, the new replica creation request represented by entry 1902 It is obvious that it is included in the existing replica represented by 808. In this embodiment, it is assumed that the replica creation request is very simple. However, in a more general case, for example, Non-Patent Document 7 uses a method called “Query Equivalence, Query Containment”. Discloses a method for examining equivalence relations and inclusion relations between conditions. By using this determination method, replica management using the replica management table according to the present invention can be realized even in a more general case.

クライアントから要求されたレプリカが自データ収集手段内の既存レプリカからも、協調データ収集手段内の既存のレプリカからも作成できない場合（判定処理２０８でＮｏが選択された場合）には、サーバとの交渉処理を行う（処理２１１）。この交渉処理の結果、データ収集手段およびサーバの双方が承諾できるレプリカ作成条件が存在しなかった場合（判定処理２１２でＮｏが選択された場合）には、レプリカ作成を行わずにレプリカ作成処理を終了する（２１９）。交渉処理の結果、双方が承諾できるレプリカ作成条件が存在した場合（判定処理２１２でＹｅｓが選択された場合）には、双方が承諾したレプリカ作成条件に従ってレプリカを作成し（処理２１５）、レプリカ作成処理を終了する（２１９）。 When the replica requested by the client cannot be created from the existing replica in the own data collection means or from the existing replica in the cooperative data collection means (when No is selected in the determination process 208), Negotiation processing is performed (processing 211). As a result of the negotiation process, if there is no replica creation condition that can be accepted by both the data collection means and the server (when No is selected in the determination process 212), the replica creation process is performed without performing the replica creation. The process ends (219). If there is a replica creation condition that both parties can accept as a result of the negotiation process (when Yes is selected in the determination process 212), a replica is created according to the replica creation condition that both parties have accepted (process 215). The process ends (219).

ステップ２１１の交渉処理について図１および図３を用いて説明する。まずレプリカ作成要求がデータ収集手段１０１からサーバ１１４に転送される（処理３０２）と、そのレプリカ作成要求をサーバが受け入れる場合（判定処理３０３でＹｅｓが選択された場合）には、レプリカ作成条件（課金システムでは、料金）をデータ収集手段が要求したレプリカ作成条件とし（処理３１０）、作成された条件に基づいてデータ収集手段のレプリカ作成管理部１０７がレプリカ管理テーブル１０８と、サーバの配送データ管理部１１８が配送データ管理テーブル１２０を更新して、交渉処理を終了する（３１１）。データ収集手段から転送されたレプリカ作成要求をサーバが受け入れられない場合（判定処理３０３でＮｏが選択された場合）で、サーバからデータ収集手段に対して提示できる新しい条件が存在しない場合（判定処理３１２でＮｏが選択された場合）には、レプリカ作成要求に関する情報を設定することなく交渉処理を終了する（３１１）。データ収集手段から転送されたレプリカ作成要求をサーバが受け入れられない場合（判定処理３０３でＮｏが選択された場合）で、サーバからデータ収集手段に対して提示できる新しい条件が存在する場合（判定処理３１２でＹｅｓが選択された場合）には、サーバは新しい条件をデータ収集手段に転送する（処理３０６）。データ収集手段が、サーバが提示した条件を受け入れる場合（判定処理３０７でＹｅｓが選択された場合）には、新しい条件をレプリカ作成条件として（処理３１０）、データ収集手段側のレプリカ管理テーブルおよびサーバ側の配送データ管理テーブルを更新して、交渉処理を終了する（３１１）。データ収集手段が、サーバが提示した条件を受け入れられない場合（判定処理３０７でＮｏが選択された場合）には、レプリカ作成要求に関する情報を設定することなく交渉処理を終了する。 The negotiation process in step 211 will be described with reference to FIGS. First, when a replica creation request is transferred from the data collection means 101 to the server 114 (process 302), if the server accepts the replica creation request (when Yes is selected in the determination process 303), a replica creation condition ( In the accounting system, the charge) is set as a replica creation condition requested by the data collection means (process 310), and based on the created condition, the replica creation management unit 107 of the data collection means manages the replica management table 108 and server delivery data management. The unit 118 updates the delivery data management table 120 and ends the negotiation process (311). When the server cannot accept the replica creation request transferred from the data collection unit (when No is selected in the determination process 303), there is no new condition that can be presented from the server to the data collection unit (determination process) If No is selected in 312), the negotiation process is terminated without setting information relating to the replica creation request (311). When the server cannot accept the replica creation request transferred from the data collection unit (when No is selected in the determination process 303), there is a new condition that can be presented from the server to the data collection unit (determination process) If Yes is selected in 312), the server transfers the new condition to the data collection means (process 306). When the data collection unit accepts the condition presented by the server (when Yes is selected in the determination process 307), the new condition is set as the replica creation condition (process 310), the replica management table and server on the data collection unit side The delivery data management table on the side is updated, and the negotiation process is terminated (311). If the data collection unit cannot accept the conditions presented by the server (when No is selected in the determination process 307), the negotiation process is terminated without setting information regarding the replica creation request.

交渉処理の具体例を図１および図１０を用いて説明する。クライアント１０３〜１０４からのレプリカ作成要求が図１０（Ａ）に示すような条件であったとする。まずデータ収集手段１（１０１）はレプリカ作成要求１００１をネットワーク１１３を介してサーバ１（１１４）に転送する。サーバ１は負荷管理部１１９で管理される現在の自システムの負荷から、該レプリカ作成要求を受け付けた場合の自システムの負荷を予測し、その予測値がある閾値以下の場合には該レプリカ作成要求を受け付け、図１０（Ｂ）に示すように、前記データ収集手段に受け付けという応答（１００３）を返し、該レプリカ作成要求がレプリカ作成条件となって交渉処理は終了する。これに対して、レプリカ作成要求１００２がサーバｍ（１２２）に転送された場合で、前記負荷予測の結果、サーバｍが作成要求１００２に示される条件でのレプリカ作成要求を受け付けられないが、作成要求１００４に示されるような新条件であれば受け付け得るという場合には、サーバｍはデータ収集手段１に条件付受け付けという応答と、サーバで生成した新条件を返す。データ収集手段１は該新条件をレプリカ作成要求を発行したクライアント１０３あるいは１０４に返し、クライアントが該新条件を受け付け得る場合には、該新条件をレプリカ作成条件としてデータ収集手段側のレプリカ管理テーブルおよびサーバ側の配送データ管理テーブルを更新して、交渉処理を終了する。 A specific example of the negotiation process will be described with reference to FIGS. It is assumed that the replica creation request from the clients 103 to 104 has a condition as shown in FIG. First, the data collection means 1 (101) transfers the replica creation request 1001 to the server 1 (114) via the network 113. The server 1 predicts the load of the local system when the replica creation request is received from the current load of the local system managed by the load management unit 119, and creates the replica when the predicted value is equal to or less than a threshold value. The request is accepted, and as shown in FIG. 10B, a response (1003) of acceptance is returned to the data collection means, and the replica creation request becomes a replica creation condition, and the negotiation process ends. In contrast, when the replica creation request 1002 is transferred to the server m (122), the server m cannot accept a replica creation request under the conditions indicated in the creation request 1002 as a result of the load prediction. If the new condition as shown in the request 1004 can be accepted, the server m returns a response that the condition is accepted to the data collecting means 1 and the new condition generated by the server. The data collection unit 1 returns the new condition to the client 103 or 104 that issued the replica creation request, and when the client can accept the new condition, the replica management table on the data collection unit side uses the new condition as the replica creation condition. And the delivery data management table on the server side is updated, and the negotiation process is terminated.

自システムの負荷計測と、前記レプリカ作成要求を受け付けた場合の負荷予測については、以下のような実施例が考えられる。一般的にマルチタスクのオペレーティングシステムは”ｒｕｎｑｕｅｕｅ”と呼ばれる実行可能なプロセスの待ち行列を持ち、この行列の平均待ち行列長の時間平均をロードアベレージと呼ぶ。そこで、例えばサーバは自システムのロードアベレージＬを自システムの負荷とみなし、前記新レプリカ作成要求を受け付けた場合の自システムの負荷をＬ＋１と見積もり、該予測負荷がある閾値Ｌ_MAX以下の場合（つまりＬ＋１≦Ｌ_MAX）には前記レプリカ作成要求を受け付ける。また、ロードアベレージを計測不可能なサーバの場合には、一定時間あたりの平均ジョブ数Ｎを自システムの負荷とみなし、前記新レプリカ作成要求を受け付けた場合の自システムの負荷をＮ＋１と見積もり、該予測値がある閾値Ｎ_MAX以下の場合（つまりＮ＋１≦Ｎ_MAX）には前記レプリカ作成要求を受け付けるという実施例が考えられる。 The following embodiments can be considered for the load measurement of the own system and the load prediction when the replica creation request is received. In general, a multitasking operating system has a queue of executable processes called “run queue”, and the time average of the average queue length of this queue is called a load average. Therefore, for example, the server considers the load average L of the own system as the load of the own system, estimates the load of the own system when the new replica creation request is accepted as L + 1, and the predicted load is less than a certain threshold L _MAX ( In other words, the replica creation request is accepted at L + 1 ≦ L _MAX . In the case of a server that cannot measure the load average, the average number of jobs N per fixed time is regarded as the load of the own system, and the load of the own system when the new replica creation request is accepted is estimated as N + 1. In the case where the predicted value is equal to or less than a certain threshold value N _MAX (that is, N + 1 ≦ N _MAX ), an embodiment may be considered in which the replica creation request is accepted.

（３）まず、レプリカ作成要求について説明する。例えば、図１５に示す実施例においては、クライアント１、２は商品の売上げデータを収集し、解析を行う。データマイニングにおいては、例えばあるクライアントでは東京都のみの売上げデータを解析対象とするというように、解析対象となるデータは何らかの制約に基づいている場合が多い。この制約は、クライアントを用いて解析を行おうとする人間の意図に基づいており、この意図を計算機が自動的にくみとるのは現状では困難であり、無駄も多い。そこで本発明では、クライアントに、アクセスするデータに関するレプリカ作成要求をユーザがシステムに与えられるようなインタフェースを設ける。該インタフェースの実装は、データ収集手段、サーバの場合と同様に、図１５のクライアント１５０１、１５０２上にローカルに格納されたプログラムであるか、あるいはネットワーク上のプログラムを格納するサーバからダウンロードしたプログラムであってよい。該クライアントの該インタフェースを利用してユーザから発行されたレプリカ作成要求に従って、データ収集手段１５０６にクライアントが共用可能なレプリカ１５０８が作成される。 (3) First, a replica creation request will be described. For example, in the embodiment shown in FIG. 15, the clients 1 and 2 collect sales data of products and perform analysis. In data mining, data to be analyzed is often based on some restrictions, for example, a certain client sets sales data for only Tokyo as an analysis target. This restriction is based on the intention of a person who wants to perform an analysis using a client, and it is difficult and wasteful for a computer to capture this intention automatically. Therefore, in the present invention, an interface is provided on the client so that a user can give a replica creation request regarding data to be accessed to the system. The implementation of the interface is a program stored locally on the clients 1501 and 1502 in FIG. 15, or a program downloaded from a server storing a program on the network, as in the case of the data collection means and server. It may be. In accordance with a replica creation request issued by a user using the interface of the client, a replica 1508 that can be shared by the client is created in the data collection unit 1506.

前記レプリカ作成要求は、図６に示すように、データの範囲を示すデータ領域条件（６０１）と、データの質を示すデータ品質条件（６０３）と、データの配送方法を示すデータ配送条件（６０４）とを含む。データの質に関する条件には、例えばサーバで更新済みのデータも、レプリカでは更新前のデータを１時間以内は最新データであるとみなすというような鮮度条件６０５、注文表の注文番号に対する１０％のサンプルをレプリカ対象データとする精度条件６０６、注文表の価格の上位１０傑のデータを対象とするというような優先度条件６０７を含む。但し、従来の問合せ言語との互換性を考慮し、該レプリカ作成要求はデータの質に関する条件を含まない記法も許すことにする。 As shown in FIG. 6, the replica creation request includes a data area condition (601) indicating a data range, a data quality condition (603) indicating data quality, and a data delivery condition (604) indicating a data delivery method. ). The data quality condition includes, for example, data that has been updated on the server, but the replica assumes that the data before the update is the latest data within one hour, the freshness condition 605, and 10% of the order number in the order table. It includes a precision condition 606 for replicating sample data, and a priority condition 607 for targeting the top 10 data in the order table price. However, in consideration of compatibility with a conventional query language, the replica creation request also allows a notation that does not include a condition regarding data quality.

レプリカ作成要求が鮮度条件を含まないときにクライアントが最新のデータを要求した場合には、必ずサーバのデータをチェックする必要があることとし、レプリカ作成要求が精度条件を含まない場合は、精度は１００％とすることとし、優先度条件を含まない場合は全ての順位に対するデータを求めることとする。精度条件については、図７に示すように、データのタイプに応じて、関係データベースレコードに対するサンプリング、フィールド切り出し、文書に対する要約作成、キーワード切り出し、静止画像に対する可逆、非可逆圧縮、輪郭抽出、色数削減、解像度削減、サイズ縮小、動画像に対するフレーム数削減、フレーム内画像圧縮、音声に対する音質調整、文字データへの変換などの各種方法の適用形態が考えられる。 If the client requests the latest data when the replica creation request does not include the freshness condition, the server data must be checked. If the replica creation request does not include the accuracy condition, the accuracy is 100% is assumed, and when priority conditions are not included, data for all ranks is obtained. As for accuracy conditions, as shown in FIG. 7, according to the type of data, sampling for relational database records, field segmentation, summary creation for documents, keyword segmentation, lossless for images, lossy compression, contour extraction, number of colors Various application methods such as reduction, resolution reduction, size reduction, reduction of the number of frames for moving images, compression of images within a frame, sound quality adjustment for sound, conversion to character data, and the like can be considered.

データ品質条件をレプリカ作成要求に含むことによって、レプリカのサイズを小さくすることができる。精度の調整に関しては、特開平０９−０２５８６３号公報「データベース処理システムにおける集計結果推定方式」に開示されている方式を用いることにより、小規模のサンプリングデータで高精度の結果推定が行えるため、データウェアハウスシステム構築には非常に有効であることがわかる。また、鮮度条件を用いることにより、分散システムで議論の多い更新の問題を柔軟に取り扱うことが可能となる。 By including the data quality condition in the replica creation request, the replica size can be reduced. Regarding the adjustment of accuracy, since a method disclosed in Japanese Patent Application Laid-Open No. 09-025863 “Aggregation Result Estimation Method in Database Processing System” can be used to estimate a result with high accuracy using small-scale sampling data, data It turns out that it is very effective for building a warehouse system. In addition, by using the freshness condition, it is possible to flexibly handle the update problem that is frequently discussed in the distributed system.

例えば、図１７に示すように、鮮度に関する制約が１日と指定されている場合で、レプリカの前回の更新時刻が１９９７年１０月２３日の６：００であり、現在の時刻が同年１０月２４日の２２：００であるとする。実際のサーバ上のデータはこの間も更新されている可能性があるが、鮮度に関する制約が１日として与えられている場合には、現時刻のデータＤ_A（Ｔ_C）を前回更新時刻のデータＤ_A（Ｔ_P）とみなすことにより、サーバのデータが更新される毎にネットワークを介したデータ転送が行われることを防ぎ、ネットワーク負荷削減が可能となる。もちろん、サーバデータが更新された時にすぐに更新が反映される必要があるような利用形態に対しては、鮮度条件を指定しないようにすればよい。 For example, as shown in FIG. 17, when the restriction on freshness is specified as one day, the last update time of the replica is 6:00 on October 23, 1997, and the current time is October of the same year. It is assumed that it is 22:00 on the 24th. The actual data on the server may be updated during this period. However, if the restriction on freshness is given as one day, the current time data D _A (T _C ) is changed to the data at the previous update time. By considering D _A (T _P ), it is possible to prevent data transfer through the network every time the server data is updated, and to reduce the network load. Of course, the freshness condition may not be specified for a usage mode in which the update needs to be reflected immediately when the server data is updated.

さらに、図６に示すような優先度に関する制約６０７を適用するとデータの転送量が削減できる。この制約６０７は注文表の価格の上位１０傑をレプリカの対象とするという指定であり、注文表全体のレプリカを作成する場合と比較して転送データ量は遥かに少なくなる。 Furthermore, when a priority constraint 607 as shown in FIG. 6 is applied, the data transfer amount can be reduced. This restriction 607 specifies that the top 10 items in the price of the order table are targeted for replicating, and the amount of transfer data is much smaller than when a replica of the entire order table is created.

（４）レプリカ更新時のデータ配送条件について、図１および図５を用いて説明する。サーバ１（１１４）の管理するデータ１２４が更新された場合（処理５０２）には、サーバの配送データ管理部１１８は配送データ管理テーブル１２０を参照し（処理５０３）、更新されたデータが配送データ管理テーブルに登録されていない場合（判定処理５０４でＮｏが選択された場合）には、データの配送を行うことなく更新処理を終了する（５０６）。更新されたデータが配送データ管理テーブルに登録されている場合（判定処理５０４でＹｅｓが選択された場合）には、配送管理テーブルのデータ配送条件に従って、データ配送先に該当するデータを配送し（処理５０５）、更新処理を終了する。 (4) Data delivery conditions at the time of replica update will be described with reference to FIGS. When the data 124 managed by the server 1 (114) is updated (process 502), the delivery data management unit 118 of the server refers to the delivery data management table 120 (process 503), and the updated data is the delivery data. If it is not registered in the management table (if No is selected in the determination process 504), the update process is terminated without delivering data (506). When the updated data is registered in the delivery data management table (when Yes is selected in the determination process 504), the data corresponding to the data delivery destination is delivered according to the data delivery condition of the delivery management table ( Process 505), and the update process ends.

レプリカ更新処理の具体例を、図１および図９を用いて説明する。サーバ１（１１４）は記憶装置１２１に注文表データ１２４を格納している。該注文表に新しい注文データ９０７が挿入されたとする。配送データ管理部１１８は、該更新データを配送データ管理テーブルと照合し、該更新データが配送データ管理テーブルのエントリ９０５、および９０６を満たしているため、該更新データをエントリ９０５にしたがって、データ収集手段１に１３：００にサーバ主導のＰＵＳＨ手法で、エントリ９０６にしたがって、データ収集手段２に１：００と１３：００にサーバ主導のＰＵＳＨ手法で配送する。１３：００のデータ配送の際には、サーバ１はマルチキャストを行うことにより、データ収集手段１とデータ収集手段２のためのデータ配送を一括して行うことができるため、サーバ負荷およびネットワーク負荷が削減できる。さらに進んで、もしもエントリ９０６のデータ配送条件が１１：００から１５：００の間に１度とか、一日に１度など、エントリ９０５のデータ配送条件を包含するような場合には、エントリ９０６のデータ転送をエントリ９０５のデータ転送条件に合わせることにより、データ転送回数を削減することができる。 A specific example of the replica update process will be described with reference to FIGS. The server 1 (114) stores the order table data 124 in the storage device 121. It is assumed that new order data 907 is inserted into the order table. The delivery data management unit 118 checks the update data against the delivery data management table, and since the update data satisfies the entries 905 and 906 of the delivery data management table, the update data is collected according to the entry 905. The data is delivered to the data collecting means 2 at 1:00 and 13:00 according to the entry 906 in the server-driven PUSH method at 13:00 to the means 1 and according to the entry 906. At the time of data delivery at 13:00, the server 1 can perform data delivery for the data collection unit 1 and the data collection unit 2 collectively by performing multicasting, so that server load and network load are reduced. Can be reduced. Further, if the data delivery condition of entry 906 includes the data delivery condition of entry 905 such as once between 11:00 and 15:00, or once a day, entry 906 By matching the data transfer to the data transfer condition of the entry 905, the number of data transfers can be reduced.

（５）レプリカを用いた問合せ処理について、図１および図４を用いて説明する。クライアント１０３からデータ収集手段１０１に問合せが発行されると、問合せはクライアント管理部１０５を介して、問合せ解析部１０９に転送される。問合せ解析部はまずレプリカ作成管理部１０７と通信してレプリカ管理テーブルを参照する（処理４０３）。前記問合せ解析部は、レプリカ管理テーブルから、問合せが自データ収集手段内で処理可能かどうかを判定し、その問合せが自データ収集手段内で処理可能な場合（判定処理４０４でＹｅｓが選択された場合）には、自データ収集手段内で問合せ処理を行い（処理４０７）、クライアントに解を転送して（処理４１５）、問合せ処理を終了する（４１６）。自データ収集手段内での処理が不可能な場合（判定処理４０４でＮｏが選択された場合）には、協調する他データ収集手段で問合せが処理できるかどうかを判定し、処理が可能な場合（判定処理４０８でＹｅｓが選択された場合）には、協調する他データ収集手段に問合せを転送し（処理４１１）、解を受け取り（処理４１２）その解をクライアントに転送し（４１５）、問合せ処理を終了する。自データ収集手段内でも協調する他データ収集手段でも問合せが処理できない場合（判定処理４０９でＮｏが選択された場合）には、サーバへ問合せを転送し（処理４１３）、解を受け取り（処理４１４）、クライアントに転送し（処理４１５）、問合せ処理を終了する（４１６）。 (5) Query processing using a replica will be described with reference to FIGS. When a query is issued from the client 103 to the data collection unit 101, the query is transferred to the query analysis unit 109 via the client management unit 105. The query analysis unit first communicates with the replica creation management unit 107 to refer to the replica management table (process 403). The query analysis unit determines whether or not the query can be processed in the own data collection unit from the replica management table, and if the query can be processed in the own data collection unit (Yes is selected in the determination process 404) In the case, the inquiry processing is performed in the own data collection means (processing 407), the solution is transferred to the client (processing 415), and the inquiry processing is terminated (416). When processing within the own data collection means is impossible (when No is selected in the determination process 404), it is determined whether or not the query can be processed by another cooperative data collection means, and processing is possible (When Yes is selected in the determination process 408), the inquiry is transferred to the other data collecting means to cooperate (process 411), the solution is received (process 412), and the solution is transferred to the client (415). The process ends. If the query cannot be processed either in the own data collection means or in the other data collection means that cooperates (when No is selected in the determination process 409), the inquiry is transferred to the server (process 413) and the solution is received (process 414). ) To the client (process 415), and the inquiry process is terminated (416).

クライアントから与えられた問合せが、レプリカを用いて処理できるか否かは、既存のレプリカの内容を記述するレプリカ管理テーブルと、与えられた問合せを比較することによって判定する。つまり、該レプリカ管理テーブルのエントリ（以下、エントリ）と該問合せとを比較し、該問合せが該エントリと同値、あるいは該問合せが該エントリに包含される場合には該レプリカを用いて該問合せが処理できる。本判定は、既存のレプリカを用いて前記新レプリカ作成要求が処理できるかどうかと同値であり、前記新レプリカ作成要求の処理で説明したように、一般的な場合についても、例えば非特許文献７に開示されている、“ＱｕｅｒｙＥｑｕｉｖａｌｅｎｃｅ、ＱｕｅｒｙＣｏｎｔａｉｎｍｅｎｔ”と呼ばれる方式を用いることにより、前記問合せを前記レプリカを用いて処理できるかどうかを判定できるため、本発明による自データ収集システムのレプリカ、協調する他データ収集システムを用いた問合せ処理が実現できる。 Whether or not a query given from a client can be processed using a replica is determined by comparing the given query with a replica management table describing the contents of an existing replica. That is, an entry in the replica management table (hereinafter referred to as an entry) is compared with the query, and if the query is equivalent to the entry or the query is included in the entry, the query is It can be processed. This determination is equivalent to whether or not the new replica creation request can be processed using an existing replica. As described in the processing of the new replica creation request, for example, non-patent document 7 By using the method called “Query Equivalence, Query Containment” disclosed in the above, it is possible to determine whether the query can be processed using the replica, so that the replica of the own data collection system according to the present invention cooperates. Query processing using another data collection system can be realized.

レプリカを用いた問合せ処理の具体的な例を、図１および図８を用いて説明する。クライアント１０３から「価格が２００００円以上の注文の注文番号と、その注文の価格を求めよ」という問合せが発行されたとする。クライアントからの問合せを受け取ったデータ収集手段１のクライアント管理部１０５は問合せを問合せ解析部１０１に転送する。該問合せ解析部１０１では、レプリカ管理テーブル１０８を参照し、与えられた問合せはエントリ８０７に対応する、自データ収集手段１内のレプリカで処理できることを判定し、該レプリカを用いて問合せ処理を行う。これにより、サーバに対して問合せ処理が転送されることはない。次に、「価格が３０００円以下の注文の注文番号を求めよ」という問合せがクライアント１０３から発行された場合を考える。この場合には、レプリカ管理テーブルを参照することによって、該問合せがエントリ８０８に対応する、データ収集手段２内のレプリカで処理できると判定し、問合せをデータ収集手段２に転送する。最後に、「価格が７５００円の注文をした顧客番号を求めよ」という問合せが発行された場合を考える。この場合に、図８のレプリカ管理テーブルでは該問合せを処理するためのエントリは存在せず、該問合せはレプリカを用いての処理は不可能と判定され、問合せはサーバに転送される。 A specific example of the inquiry process using the replica will be described with reference to FIGS. Assume that the client 103 issues an inquiry “find the order number of an order whose price is 20000 yen or more and the price of the order”. Upon receiving the inquiry from the client, the client management unit 105 of the data collection unit 1 transfers the inquiry to the inquiry analysis unit 101. The query analysis unit 101 refers to the replica management table 108, determines that the given query can be processed by the replica in the own data collection unit 1 corresponding to the entry 807, and performs query processing using the replica. . As a result, the inquiry process is not transferred to the server. Next, consider a case where the client 103 issues an inquiry “find the order number of an order whose price is 3000 yen or less”. In this case, by referring to the replica management table, it is determined that the query can be processed by the replica in the data collection unit 2 corresponding to the entry 808, and the query is transferred to the data collection unit 2. Finally, let us consider a case where an inquiry “A customer number for an order with a price of 7500 yen is requested” is issued. In this case, there is no entry for processing the query in the replica management table of FIG. 8, it is determined that the query cannot be processed using the replica, and the query is transferred to the server.

以上の説明において、データ収集手段側では、前記ＣＯＲＢＡ等の分散ネットワーク基盤技術を利用することにより、必要なデータを取得できるサーバ名を得られると仮定しているが、もしそれが不可能な場合にも、データ収集手段側でサーバに関する情報を管理することによって、本発明を実施できることはいうまでもない。
In the above description, it is assumed that the data collection means side can obtain a server name that can acquire necessary data by using the distributed network infrastructure technology such as CORBA. Needless to say, the present invention can be implemented by managing information about the server on the data collection means side.

本発明によるデータウェアハウスシステムの第１の実施例のブロック構成図。The block block diagram of the 1st Example of the data warehouse system by this invention. レプリカ作成処理手順を示すフローチャート。The flowchart which shows a replica creation process procedure. 交渉処理手順を示すフローチャート。The flowchart which shows a negotiation process procedure. 問合せ処理手順を示すフローチャート。The flowchart which shows an inquiry process procedure. 更新処理手順を示すフローチャート。The flowchart which shows an update process procedure. 本発明の実施例における、クライアントから与えられるレプリカ作成要求を示す図。The figure which shows the replica creation request given from the client in the Example of this invention. 本発明の実施例における、データタイプに応じた精度調整方法を示す図。The figure which shows the precision adjustment method according to the data type in the Example of this invention. 本発明の実施例における、レプリカ管理テーブルを示す図。The figure which shows the replica management table in the Example of this invention. 本発明の実施例における、配送データ情報、および更新データを示す図。The figure which shows the delivery data information in the Example of this invention, and update data. 本発明の実施例における、交渉処理時に利用される情報を示す図。The figure which shows the information utilized at the time of the negotiation process in the Example of this invention. 従来技術における、データウェアハウスの利用法を示す図。The figure which shows the utilization method of a data warehouse in a prior art. 従来技術における、分散環境でのデータウェアハウスの構成方法を示す図。The figure which shows the structure method of the data warehouse in a distributed environment in a prior art. 従来技術における、分散環境でのデータウェアハウスの構成方法を示す図。The figure which shows the structure method of the data warehouse in a distributed environment in a prior art. 従来技術における、分散環境でのクライアントとサーバの多対多の構成方法を示す図。The figure which shows the many-to-many configuration method of the client and server in a distributed environment in a prior art. 本発明によるデータ収集手段でのクライアントのグループ化方法の１例を示す図。The figure which shows an example of the grouping method of the client in the data collection means by this invention. 本発明によるデータ収集手段でのクライアントのグループ化方法の他の例を示す図。The figure which shows the other example of the grouping method of the client in the data collection means by this invention. 本発明の実施例における、鮮度条件の利用方法を示す図。The figure which shows the utilization method of the freshness conditions in the Example of this invention. 本発明によるデータウェアハウスシステムの第２の実施例のブロック構成図。The block block diagram of the 2nd Example of the data warehouse system by this invention. 本発明におけるレプリカ作成要求の１例を示す図。The figure which shows one example of the replica creation request | requirement in this invention.

Explanation of symbols

１０３、１０４、１１０１、１２０１、１３０１、１４０１、１４０２、１５０１、１５０２、１６０１、１６０２、１８０６、１８０７…クライアント端末、１１２、１２１、１１０５、１２０９、１２１０、１３１５、１３１６、１３１７、１４０８、１４０９、１５１０、１５１１、１６１０…記憶装置、１１３、１２０４、１３０４、１４０５、１８０２…ネットワーク、１１４、１２２、１１０２、１２０５、１２０６、１３０５、１３０６、１４０３、１４０４、１５０５、１６０５、１８０３、１８０５…サーバ、１０１、１０２、１５０６、１６０６、１８０１、１８０４…データ収集手段、１２０２…サーバ位置情報管理モジュール、１２０３…サーバ位置情報、１２３、１２６、１３１０、１５０８、１６０８…レプリカ、１２４、１２５、１３０７、１３０８、１５０９、１６０９…サーバの管理するデータ、１０８…レプリカ管理テーブル、１２０…配送データ管理テーブル、１５０７、１６０７…レプリカ記述。 103, 104, 1101, 1201, 1301, 1401, 1402, 1501, 1502, 1601, 1602, 1806, 1807 ... client terminal, 112, 121, 1105, 1209, 1210, 1315, 1316, 1317, 1408, 1409, 1510 , 1511, 1610 ... storage device, 113, 1204, 1304, 1405, 1802 ... network, 114, 122, 1102, 1205, 1206, 1305, 1306, 1403, 1404, 1505, 1605, 1803, 1805 ... server, 101, 102, 1506, 1606, 1801, 1804 ... data collection means, 1202 ... server location information management module, 1203 ... server location information, 123, 126, 1310, 1508, 1608 ... replication , Data managed by the 124,125,1307,1308,1509,1609 ... server, 108 ... replica management table, 120 ... delivery data management table, 1507,1607 ... replica description.

Claims

A client computer system that accepts processing requests from users;
A server computer system for searching the database according to the access request from said client computer system comprising a database, each,
A data collection unit provided between the client computer system and the server computer system and provided with a storage device;
The data collection means includes
A client management unit that manages the client computer system,
A query analyzing unit that analyzes an access request from said client computer system,
A communication control unit that determines an access procedure to the server computer system based on an analysis result in the query analysis unit;
Said receiving a replica creation request containing the requirement from the client computer system on the quality of the data, the replica creation request analysis unit for determining a generation condition of replicas by analyzing the replica creation request created based on the requirements,
Create a replica that creates a partial replica of the database in the storage device in accordance with the creation condition, and manages the replica using a replica management table that associates information about the creation condition and storage location of the created replica A management department,
The query analysis unit
In response to an access request from the client computer system, a determination processing unit that determines whether query processing using the replica is possible;
If the inquiry processing using the replica by the determination processing section is judged to be a query processor to query processing using the replica without transferring the inquiry to the server computing system,
A transfer processing unit that transfers the access request to the server computer system when it is determined by the determination processing unit that inquiry processing using the replica cannot be performed ,
Data warehouse system.

The server computer system is
A communication control unit that receives the access request transferred from the data collection unit through the network;
The data warehouse system according to claim 1, further comprising: a query processing unit that searches the database included in the server computer system and creates a response to the received access request .

The replica creation request analysis unit has a condition in which the request condition specifies data freshness, and the difference between the update time and the current time when the data is updated in the server computer system is smaller than a predetermined value. 3. The data warehouse system according to claim 2, further comprising a processing unit that handles data at the current time as data in the data collection means.