JP4285770B2

JP4285770B2 - Index specification for relational databases

Info

Publication number: JP4285770B2
Application number: JP52257597A
Authority: JP
Inventors: レンジー、ロバート・エス
Original assignee: British Telecommunications PLC
Current assignee: British Telecommunications PLC
Priority date: 1995-12-20
Filing date: 1996-12-16
Publication date: 2009-06-24
Anticipated expiration: 2016-12-16
Also published as: JP2000502201A; EP0868699B1; AU712636B2; DE69610509T2; US6182079B1; AU1384097A; CA2240155A1; DE69610509D1; EP0868699A1; ES2152578T3; GB9526096D0; CA2240155C; WO1997022939A1

Description

発明の背景
発明の分野
本発明は、リレーショナルデータベースのためのインデックスの指定に関する。本発明はまた、インデックスを指定するためのプロセスを含むリレーショナルデータベースに関する。
従来技術の説明
データの照会（enquiry）に応答してデータベース内に含まれたデータから１セットのデータを得るように実行可能な命令が構成されるデータ処理環境が知られている。データはデータ表から直接にアクセスされてもよく、そこにおいてそれは要求された情報を得るように表内の全てのエントリをサーチするために必要である。その代りに、サーチ過程において、サーチプロセスの速度を実質的に増加させるためにインデックスを利用することもある。
大きく、多用されるデータベース用のインデックス構造の設計は、現在では実行に際して非常に困難であり、エラーが導入され易い。この問題は、人間のデータベース管理者が、日常のベースでデータベースに対してランする一般に何百あるいは何千もの異なる構造化照会言語（ＳＱＬ）ステートメントに索引（インデックス）付けする要求を検知し、それからこれらの要求をデータベース全体を通じて定められた好ましいインデックスのセットに変換することが不可能であるという技術的な要求ならびに制約があるために生じる。しかしながら、貧弱な特定がされたインデックスの設計によって、結果的にＳＱＬステートメントはプロセッサの設備を大幅に消費し、あるべきすがたよりも長時間運行する結果となり、また、結果的に装置がひどくオーバーロードにされてしまう。
長い間、ターゲットワークロードと呼ばれる典型的なＳＱＬワークロードに対して、所定のデータベースの設計に対して定義されたインデックス構造を全体的に指定する方法が必要とされてきた。しかしながら、人間のオペレータの器官での直観的なメンタルプロセスを要求せずに使用できる処理能力を利用することによって技術的な解決方法を実現することが困難である場合、この技術的な問題は解決されない。
発明の概要
本発明の第１の特徴によれば、読取り可能な形態でマシン中に記憶されたデータベースに対するインデックスのセットを指定するために、前記データベースに供給された複数のステートメントを解析し、前記データベースの表から得られたインデックスを識別し、前記インデックスが使用可能であるときに達成可能な改良された動作のレベルを評価し、前記データベースに対してインデックスのセット（組、群）を指定するために前記評価されたレベルを処理するステップを含んでいる方法が提供されている。
好ましい実施形態において、改良された動作のレベルは、前記表の性質に関連した情報から得られたデータベース表の縮小されたモデルを生成することによって評価される。一般に、縮小されたモデルは、表毎に領域中に５０００個のデータエントリを含んでいてもよい。モデルデータベースはモデル化されたライブデータベースから得られた代表的なデータエントリで充たされていることが好ましく、前記モデルは、ライブデータベースの現在のインデックスのカーディナリティを考慮することによって充たされる。さらにデータベースモデルはライブデータベースの現在のインデックス内のエントリの分布を考慮することによって充たすことができる。
好ましい実施形態において、データベースの統計は、ライブデータベースからデータベースモデルにコピーされる。ベースレベルのコストは付加的なインデックスがない状態でステートメントを実行するために計算されることが好ましい。コストレベルは、実行時間を評価することによって得られることが好ましい。さらに、コストレベルは、インデックスのメンテナンスオーバーヘッドを査定することによって評価されてもよい。
本発明の第２の特徴によれば、リレーショナルデータベースに対してインデックスのセットを指定するように構成され、前記データベースに供給された複数のステートメントを解析する解析手段と、前記データベースの表から得られたインデックスを識別するための識別手段と、前記インデックスが使用可能である場合に達成可能な改良された動作のレベルを評価する評価手段と、前記データベースに対してインデックスのセットを指定するために前記評価されたレベルを処理する処理手段とを具備しているインデックスのセットの指定手段が提供される。
好ましい実施形態において、可能性のあるインデックスは、前記ステートメントによって定められた述語のセットから識別される。
本発明の第３の特徴によれば、データベースのためのインデックスのセットを指定するように構成され、データ記憶手段と、データ処理手段と、前記データ記憶手段から読取り可能なプログラム命令とを具備しているデータ処理装置が提供され、そこにおいて前記処理手段は、前記命令に応答して、前記データベースに供給された複数のステートメントを解析し、前記データベースの表から得られたインデックスを識別し、前記インデックスが使用可能である場合に達成可能な改良された動作のレベルを評価し、前記データベースに対して好ましいインデックスのセットを指定するために前記評価されたレベルを処理するための手段を提供するように構成されている。
好ましい実施形態において、コストの節減は、古いＳＱＬステートメントのコストのコスト値と新しいＳＱＬステートメントのコストのコスト値を可能なインデックスによって処理することによって計算される。コストの節減は、古いコストから新しいコストを減算することによって計算されてもよい。コストの節減は、新しいそれぞれの可能なインデックスをそのそれぞれの表に関して考慮することによって表に対して計算されることが好ましい。
好ましい実施形態において、可能なインデックスは、好ましいインデックスとして指定される可能性の点から順序付けされる。インデックスの組合わせは、現在可能性のあるインデックスをランダムに組合わせ、好ましいインデックスのセットを指定するために前記評価されたレベルを処理することによって識別される。
本発明の第４の特徴によれば、マシン読取り可能な形態で記憶された複数のデータ表と、ステートメントに応答して前記データ表を処理し、前記データ表の処理を容易にするためにインデックスを生成する処理手段とを具備し、さらに、好ましいインデックスのセットを指定するために前記処理手段によって実行可能な命令を含んでいるリレーショナルデータベースが提供され、そこにおいて前記命令は、データベースに供給された複数のステートメントを解析し、前記データベースの表から得られたインデックスを識別し、前記インデックスが使用可能である場合に達成可能な改良された動作のレベルを評価し、前記データベースに対して好ましいインデックスのセットを指定するために前記評価されたレベルを処理するように構成されている。
【図面の簡単な説明】
図１は、データ解析システムを含む遠隔通信環境を示している。
図２は、図１において示され、データ表を記憶するように構成された複数の大容量ディスク記憶装置を有するデータ解析システムの詳細を示している。
図３Ａ、３Ｂ、４Ａおよび４Ｂは、図２に示されたデータ記憶装置上に記憶されたタイプのデータ表の例を示している。
図５および図６は、図３Ａに示された表に含まれたデータから得られたインデックスの例を示している。
図７は、図２に示された環境内において設置されたデータベース構造を示しており、インデックスを指定する処理を含む。
図８は、図７に示された環境内で実行された処理を示しており、インデックスのセットを指定する処理を含む。
図９は、図８において示されたインデックスのセットを指定するプロセスを示しており、ライブデータベースをモデル化する処理と、典型的なＳＱＬステートメントを解析するための処理と、ベースコスト計算のための処理と、可能性のあるインデックスを識別するための処理と、好ましいインデックスを指定するための処理とを含む。
図１０は、図９において示されたライブデータベースをモデル化するプロセスを詳細に示している。
図１１は、図９において示されたＳＱＬステートメントを解析するプロセスを詳細に示している。
図１２は、図１１において示されたプロセスを使用して生成されたデータの表を示している。
図１３は、図９において示されたベースコストを計算するプロセスを示している。
図１４は、図９において示された可能性のあるインデックスを識別し、順序付けるプロセスを詳細に示している。
図１５は、図９において示されたインデックスの好ましいセットを指定するプロセスを詳細に示している。
図１６は、図１５において示されたプロセス中に生成されたデータの一例を詳細に示している。
図１７は、図１５において詳細に示されたプロセスを使用する新しいインデックスの組合わせの生成を示している。
図１８は、ライブデータベース内でインデックスを設定する手順を示している。
実施例
以下、本発明を、先に示された添付図面を参照して単なる例示として説明する。
遠隔通信環境が図１に示されており、そこにおいて、電話の送受話器、ファックスマシンおよびモデム等の複数の通信ユーザ装置101がそれぞれのローカルアナログライン103を介してローカル交換機102に接続されている。ローカル交換機102においてアナログ信号がデジタル化され、それに続いてデジタル領域内でスイッチングおよび再経路設定が行われる。これによって結果的に多数の呼がデジタル時分割多重化チャンネル104を通ってトランク通信ネットワーク105に導かれる。
ローカル交換機102およびトランクネットワーク105は、通常の通信のスイッチングを行い、呼が通常の方法で接続されることを許容する。さらに、一層進んだ（アドバンスト）サービスがアドバンストサービスノード106によって提供され、それによって顧客が記憶および順方向の転送ならびに個人番号の識別等のアドバンストサービスにアクセスできるようにする。顧客は、トランクネットワーク105に接続されたデジタルマルチプレクサ107を介してアドバンストサービスノード106にアクセスする。従って、呼はトランクネットワーク105を介してアドバンストサービスノード106に導かれ、その後、呼出ししている顧客に情報が送り返され、呼がトランクネットワーク105を介して端末装置101等に再度経路設定される。その代りに、アドバンスト・サービスノード106の機能がトランク遠隔通信ネットワーク105を通して分配されてもよい。
呼が生成されたとき、呼に関する料金の明細は、関連したローカル交換機102において記憶される。続いて、接続された顧客が費用を負担する使用を表すこの呼情報は、通信チャンネル109を介して中央管理装置108に供給される。この方法において、全ての料金情報は、顧客の勘定の発生および分配を管理する中央管理装置108に導かれる。
動作において、端末装置101、ローカル交換機102、トランクネットワーク105およびアドバンストサービスノード106で構成されたシステムは非常に多数の呼を接続し、その結果、動作データの集合体が生成される。第１に、このデータは、発生している呼の性質、目的地の呼の性質および呼のタイプに関する詳細を識別する。呼のタイプに関する情報によって呼は直接のローカルな呼であると識別されることもあり、あるいは呼が長距離あるいは国際間の呼であり、アドバンスト・サービスノード106によって提供されたサービスの使用を含むと識別されることもある。このデータの解析によって、少なくとも２つの顕著な利点が与えられる。第１に、システムの動作特性を表す集められたデータに応答して、システムが動作する方法を変更することができる。従って、特定の領域がその他の領域よりも実質的にアドバンストサービスを使用することが発見された場合、これらのサービスの使用可能性を最適にするためにこれらのサービスの割当てを再度経路指定することが可能である。同様に、ネットワーク使用の評価は、特にオフピーク期間中に使用可能な能力を良好に使用するための努力が行われたときに、マーケティング戦略の設計者が使用できるようにされてもよい。
実際に、ほぼ類似した照会がデータベース上で規則的なインターバルで実行される。動作の分割は特有の関心を有しており、それらの関心が規則的なベースで更新されることを要求する。非常に多数のユーザがデータベースにアクセスでき、一定期間を通じて何百あるいは何千ものＳＱＬ照会がデータベース内に含まれたデータに関して実行され、これらの照会の大部分は新しいデータが含まれた際に更新された結果を生成するために多数回実行される。
図１に示されたシステムは、トランクネットワーク105からデータを受取るように構成されたデータ解析装置110と、アドバンストサービスノード106と、中央管理装置108とを含んでいる。順に、データ解析装置は、データをトランクネットワーク105、アドバンストサービスノード106、および中央管理装置108に戻す。トランクネットワーク105およびアドバンストサービスノード106内で、データ解析装置から戻されて受信されたデータに応答してこれらのシステムの技術的動作に修正が行われる。同様に、中央管理装置108に戻されたデータによって結果的に顧客がインボイスされる（請求書を送られる）方法が変更され、すなわち、一般に顧客がネットワークを使用する方法が修正される。
データは、データ解析装置110において集められ、本来のシステムの設計者によって指定された形態で記憶される。これらの設計者は、後に要求される照会のタイプを予測するように努めるが、重要となる全ての照会を予期することは不可能である。それ故、データはリレーショナルデータベースのタームにおいてデータ解析装置に記憶される傾向があり、それによって特定の照会に応答する後続する操作が容易になる。これらの照会によって結果的にトランクネットワーク105、アドバンストサービスノード106、あるいは中央管理装置108の手順が修正される。さらに、特定の照会に応答して生成されたデータは、参照あるいは次の使用のためにデータ照合装置111において照合される。
データ解析装置110が図２において詳細に示されており、それは実質的にＩＢＭＥＳ９０００等のメインフレームコンピュータ201の周囲に配置され、５０ＭＩＰＳ（million instructions per second）で動作可能な１０個のプロセッサを有して構成されている。ユーザは、複数のネットワークで結合されたユーザ端末202を介してデータベースシステムにアクセスすることができ、データは、データ表の形態でディスク駆動装置203に記憶され、テラバイトの単位の容量のデータを記憶することができる。
トランクネットワーク105等の動作データソースと、アドバンストサービスノード106と、中央管理装置108が動作データソース204として示されている。さらに、データは外部データソース205によって示されているような別の外部ソースから受取られてもよい。トランクネットワーク105、アドバンスト・サービスノード106、あるいは中央管理装置に戻る動作制御信号流は、動作制御装置206に供給されるデータによって示され、データの照合および印刷は207において示されている。データはディスク記憶装置203上のシステム中に記憶され、出力データは照会に応答して得られ、ユーザがネットワークユーザ端末202を使用することによって始められる。ディスク記憶装置203上に保持されたデータは、図１に示されているようなトランクネットワークあるいは別の装置の動作に応答して継続して集められる。さらに、管理データもまた記憶装置上に保持されてもよく、作動データを管理データに関連させるように照会が始められてもよい。
ディスク記憶装置203上に記憶されるタイプのデータベース表の例が図３Ａ、図３Ｂ、図３Ｃおよび図３Ｄに示されている。通信の各イベントを表す情報が中央管理装置108から受取られる。各イベントは特有のシーケンス番号を与えられ、それによって、図３Ａに示されているようにデータベース中に新しい記録を生成する。記録は、そのイベントの日、開始時間、終了時間、イベントを開始する顧客の電話番号および呼のタイプを識別することによって完了する。従って、1995年12月01日の午前０時１分に、電話番号が404 7241である顧客がタイプＡの呼を行い、午前０時２５分に終了した。この例においては呼のタイプに対して任意の指定が与えられており、ここにおいて、タイプＡの呼はローカルな呼を表しており、タイプＢの呼は長距離の呼を表しており、タイプＣの呼はオペレータによって行われた長距離の呼であり、タイプＧの呼はアドバンスト・サービスの使用を表している。上で識別されたイベントは、イベント番号12345の下に記録されており、次のイベントはイベント番号12346の下に記録される。これは、1995年12月01日の午前０時１分に電話番号が386 4851である顧客によって開始されたものであり、再びこれはタイプＡの呼として識別される。
データ解析システム110内のデータベースはまた、図３Ｂにおいて示されたような管理データを含んでいる。図３Ｂにおいて示された表は、顧客の識別子を顧客の電話番号上にマップする。特定の顧客は異なる電話番号の複数の電話線を有していてもよく、それによって、関連した顧客を識別するために特定の電話番号が必要とされることが理解されるであろう。従って、示された例において、電話番号が404 7241である顧客は、記録303内に顧客識別番号0074895が割当てられている。同様に、記録304は、顧客識別番号057896で識別された顧客に電話番号404 7242が割当てられていることを示している。
図４Ａに示された表は、顧客の識別番号を顧客のアドレスに関連させる。従って、記録305から、識別番号0074895を有する顧客がロンドンのハイホルボーン５２に住んでいることがわかる。
一般的に、特定の都市が常に特定の地理領域内に位置しているとされると、広範囲の地理データを提供する必要はない。しかしながら、地理領域は、商業環境における変化を反映するように特定のオペレータによって調整されてもよい。町および市を特定の地域にマッピングしている別の表が与えられている。従って、記録306によって識別されているように、ロンドンは南東地域にマッピングされ、ラフバラは記録307において東部内陸地域にマッピングされている。地域のオフィスの位置を反映するために地域境界線を調整することができ、それによって、例えば、東部内陸地域は西部内陸地域と結合され、内陸地域とされる。そのような環境の下では、図４Ａに示された表を変更する必要がなく、図４Ｂに示された表を変更することが必要とされるだけであり、図４Ｂに示された表が有している記録は図４Ａに示された表よりも実質的に少ない。
イベント番号を参照して照会が行われた場合、図３Ａに示されたデータ表によってデータ記録が非常に速く読取られることが認識されるであろう。図３Ａに示されたデータ表はイベント番号に関して順序付けられ、そこにおいて、各イベント番号は特定の記録に特有のものである。従って、イベント番号12345が与えられると、データベースは電話番号が404 7247である顧客がタイプＡの呼を行ったことに迅速に応答することができる。同様に、図３Ｂに示された表を参照することによって、この電話番号を顧客の識別子に関連付けることができ、それによって、図４Ａおよび図４Ｂにそれぞれ示された表に関して照会するためにアドレスおよび地域が識別される。
しかしながら、図３Ａの表に示された記録内の別のフィールドに関して照会が行われた場合に問題が生じる。例えば、照会は、特定の電話番号によって開始された全てのイベントについてのリストが生成されることを要求してもよい。その代りに、照会は、特定の呼のタイプの全てのイベントに関して行われてもよい。より複雑な照会はこのタイプのエントリを使用して行われてもよく、例えば、照会は南東地域から開始されたすべての呼の詳細を要求して行われ、１０分以上続いてもよい。
先の例によれば、簡単な照会は、特定の電話番号によって開始された全てのイベントの詳細を要求して行われる。図３Ａに示された表は、電話番号のエントリの下で索引付けされておらず、それ故、プロセッサは表内の全ての記録の電話番号データフィールドをサーチする必要がある。明らかに、これにはイベント番号に関する記録の読取りよりも実質的に多くのプロセッサリソースが要求される。図３Ａに示された表はイベント番号のもとに順序付けられており、それによって、イベント番号が与えられると、特定の記録が非常に迅速に識別される。しかしながら、任意の別のフィールドの下で索引付けされないと、関心のある特定のフィールドを識別するためにサーチを行う必要がある。先に識別された一層複雑な例によれば、表内に含まれた全ての記録を通して異なるフィールドを数回サーチすることによって特定の照会を満足させる必要がある。
サーチが行われる速度を改良するために、迅速なサーチが別のデータフィールドに関しても行われるように特定の表のためのインデックスを生成することが可能である。図５に示されているように、表３Ａ内に含まれたデータは、電話番号に基づいてインデックスを生成するために使用されてきた。各電話番号は１つずつ考慮され、インデックスは特定の電話番号によって開始されたそれぞれの特定のイベントを識別するために生成される。従って、記録301は電話番号404 7241に対して識別され、イベント12345はこの電話番号に対して記録されている。続いて、別のイベント14876,15739,15928および16047がこの電話番号によって開始された。インデックスは、考慮されている電話番号に関する全てのイベントが記録されるまで継続する。その後、インデックスは、この例においては404 7242である次の電話番号に連続し、これに対してイベント13728,14937,15821,および14723等が記録された。
その後、電話番号404 7243がこの電話番号に対して記録されたイベントと共に考慮され、それは図３に示されたデータ表内の全ての電話番号のエントリが考慮され終わるまで続けられる。図５に示されたインデックスにおいて、与えられた記録の数は図３Ａに示された元の表にある記録の数と等しい。しかしながら、図５の表は単なるインデックスであり、従って、特定の電話番号に対して、インデックスは図３Ａに示された主要な表内の特定のイベント番号に戻って指示することになる。従って、インデックスによって電話番号に基づいてサーチを迅速に行うことができ、その後、図３Ａに示された表の記録内に含まれた残りのデータが得られる。
図５に示されたものに類似したインデックスが図６に示されており、そこにおいて、図３Ａに示された表内に含まれたデータは、イベントのタイプに関して索引付けされている。記録301は、図６に示されたインデックス中の記録601として再生される。イベント番号12345として示されたイベントは、その後生じるタイプＡの呼により生成されたものであり、このイベント番号は、イベントタイプＡのためのエントリに対してリストされたものである。その後、イベント番号13856,14024,15752および14831は、タイプＡの呼を生じるイベントであった。
その後、図６に示されているように、タイプＢの呼に対するイベントが、イベント13728を含んで記録されて記録602に配置され、順次イベントがタイプＢのイベントに対して記録される。一度、タイプＢのイベントに対する全てのイベントが記録されると、表は、イベント番号12350によって開始されたタイプＣのイベントに連続する。
一度、図５に示された電話番号に対するインデックスおよび図６に示されたタイプのイベントのインデックスが生成されると、イベント番号、電話番号およびイベントのタイプに基づいて図３Ａに示された表内の記録に迅速にアクセスするためにこれらのインデックスを使用することができる。照会に応答したとき、これらのインデックスを使用可能にすることが望ましいことは明らかである。しかしながら、インデックスが使用可能であるという利点は、記憶スペースに対する付加的な要求と併せて、インデックスの生成およびさらに重要であるインデックスの更新について含まれたコストと比較されなければならない。従って、多くの実際に具体化した場合において、使用可能なディスク記憶スペースが不十分である場合には可能性のある全てのインデックスを生成することは不可能である。時には、ディスク記憶スペースの増加が可能であるが、実際に具現化した大半の場合においては、全記憶スペースおよびインデックスの生成のために割当てられる記憶スペースの全体量に関して上限がある。
図２に示されたシステムは、非常に多量のデータを記憶するように構成されている。さらに、このデータは、新しいデータのセットを生成するために照会を行っているユーザからかなり強く要求される。これらの照会の幾つかは一回限りの性質を有しているが、ソースデータに対して行われた追加および変更に応答して要求されたデータのセットに対する変更を査定するために、照会は高い割合で比較的規則的なインターバルで反復的に照合される。これらの状況の下で、データベースに対して行われた数百あるいは数千もの異なるＳＱＬステートメントに応答して最適な動作を達成するためにデータベース管理者が索引付けの要求を正確に受取ることは実質的に不可能となる。しかしながら、最適なインデックスが提供されなかった場合、照会によって中央処理装置に過剰な要求が課せられてしまう。さらに、そのような照会は必要以上に長く実行する傾向があり、それによって遅延が生じる。これらの問題が残っているため、マシンには重い過負荷がかかり、単位時間中に実行できる照会の数が制限される。この結果、マシンの使用が少くなり、実際に照会当りのシステムのコストが増加してしまう。
本発明のシステムは、上述の技術的な問題を克服するための試みにおいて索引付け構造を指定するように構成されている。このシステムは、可能性のあるインデックスの好ましいセットを指定するためにデータベースに与えられるＳＱＬステートメントを解析するように構成され、データベースを参照するＳＱＬステートメントの実行を改良するために使用できる。可能性のあるインデックスがデータベースシステム内で実際に使用可能である場合に達成できる改良された動作のレベルに関して推定評価が行われる。これらの評価から、インデックスの好ましいリストが指定される。従って、データベースの管理者における主観的な制限がマシンによって取除かれて、数字による表示が計算され、インデックスを実際に生成するようにリソースを割当てる前に、システム内に特定のインデックスを含むことが望ましい程度が示される。これらの数字による指示は、インデックスが配置されたときに改良された動作の評価から実際に得られる。
図２に示されたデータベースハードウェアは、前記ハードウェアによって実行される関連したプロセスと共に図７に概略的に示されている。データベースシステムは、ＩＢＭ社の商標“ＤＢ２”の下で許可されたデータベース命令に従って構成されてもよい。結果として生じたＤＢ２環境は図７において710として示され、それはデータ記憶装置702およびＳＱＬ実行プロセス703を含んでいる。
図示された実施例において、３つの表がデータ記憶装置内で定められており、それらは第１の表704、第２の表705および第３の表706として示されている。図示されたデータベースシステム内では、６個のインデックス全てが各表に対して生成され（これは構成に依存して可変である）、それらは破線の領域707により示されている。従って、各表は関連したインデックスのセットを有していてもよく、インデックスのセットの中に含まれた実際のインデックスは、本発明の好ましい特徴によって定義されているように、インデックスのセットの指定装置により実行される方法に従って決定される。データベースはまた、データベースの定義を記憶するように構成されたカタログ709と、カタログ統計とを含んでいる。
ＤＢ２内では、データベース内の表、カラムおよびインデックスに関する統計を得るように構成された“ＲＵＮＳＴＡＴＳ”がユーティリティとして設けられている。カタログは、表の寸法についての詳細な情報を提供し、ライブデータ記憶装置から類似したデータ記憶装置のコピーにデータが転送される前に空の表が生成されることを可能にする。さらに、ＳＱＬ実行プロセスは、“オプティマイザ”として知られるプロセスを含み、それは特定のＳＱＬステートメントの実行を最適化するようにカタログの統計を解析するように構成されている。
表の寸法を定めるのに加えて、カタログの統計はカラムカーディナリティ、二番目に高い値および二番目に低い値、クラスター比およびデータの分布も記録する。
カラムカーディナリティは、特定のカラム中の異なる値の数を定義し、一方、フル・キーカーディナリティは、ある表について定義されたインデックス中の異なる値の全体数を同定する。顧客表は、１００万行を上回るカーディナリティを有していてもよく、それらのそれぞれは異なる顧客であるが、これらの顧客の性別を識別するカラムは２つのカーディナリティを有しているだけである。同様に、地理的な領域を識別するインデックスのカーディナリティは、１００の桁の低いほうであってもよい。
ＨＩＧＨ２ＫＥＹとして示された二番目に高い値は、特定のカラム内の二番目に高い値を表している。同様に、ＬＯＷ２ＫＥＹとして識別された値は、特定のカラムの二番目に低い値を表しており、この情報を使用すると、特にカラムのカーディナリティと組合わせて考慮したときに、カラム内の可能性のある値の範囲を表示することができる。
クラスタ比によって、インデックス中のデータの順序付けがどの程度良好に元の表におけるデータの順序付けに従うのかが示される。クラスタ用のインデックスが表について定められ、その表中のデータが再構成されたとき、表中の全ての行はクラスタとされたシーケンスにされ、クラスタ比は１００パーセントであると考えられる。異なるカラムと、クラスタ用のインデックスに対するカラムの順序とを有する別のインデックスはそれらのデータがクラスタ用のインデックスに関してうまくクラスタとされていないこともあり、値の低いクラスタ比を有する傾向がある。行が表に挿入されたり、表から削除されたりすると、インデックスのクラスタ比は徐々に減少し、より多くの行がクラスタ用のシーケンスから外れるようになる。従って、クラスタ比は、この実施形態においてデータがどの程度うまく特定のデータインデックス内で順序付けられ、また、それぞれの表内でライブデータのサンプルから発生されるかの表示を与える。データの分布の統計によって、最も頻繁に発生する１０個の値がそれらの発生の百分率と共にカラム中で定められる。
入力された情報は実質的に２つの形態のライブデータベースシステムに供給される。第１に、入力ライン715によって示されているように、新しいデータがシステムに供給され、第２に、入力ライン716によって示されているように、ＳＱＬステートメントの照会がデータベースに供給される。ライン716上の入力された照会は、ＳＱＬ実行プロセス703によって実行され、出力ライン717によって示されているように出力を生成する。入力ライン715上で受取られた新しいデータによって結果的にデータ記憶装置702内の表が更新され、この更新プロセスは、ＳＱＬ指令に従って実行される。従って、入力されたデータおよびＳＱＬの両者の変更、削除および照会は、ＳＱＬ実行プロセス703に供給され、両方とも前記プロセスの制御の下で実行される。
一般に、入力ライン716上に供給された問合わせすなわち照会の形態のＳＱＬステートメントは、ＳＱＬプロセッサ703がそれぞれの表からの主データに加えて多数のインデックスにアクセスした場合に一層迅速に実行される。結果的に、このタイプの照会に主として応答するためにデータベースシステムが要求される場合、多数のインデックスをシステム内に含むことが望ましい。しかしながら、入力ライン715上で受取られた新しいデータに応答して、あるいは変更または削除されたデータに応答して表が更新されることが要求されたとき、多数のインデックスが存在する場合にはＳＱＬ実行プロセス703の上に大きい処理オーバーヘッドが置かれる。従って、データベースシステムが主としてデータの保管所として要求され、システムへの照会が最小である場合、ハウスキーピング（準備）オーバーヘッドを減少するようにシステム内のインデックスの数を最小にすることが望ましい。大半の実際のシステムにおいて、両方のタイプの入力が受取られ、それ故、ハウスキーピングオーバーヘッドを減少するためにシステム内にあるインデックスの数が最小にされるが、記憶のスペースがあるならば、最適な動作性能を提供するために現在のインデックスの選択が最適にされるべきである。
表のインデックスの明細が理想的な解決方法に近付く程度は、入力ライン716上に供給されたＳＱＬ照会の性質に依存している。システムは多数のインデックスを供給されるが、これらのインデックスは、典型的なＳＱＬ照会内で指定された述語に関連していない場合にはほとんど利益がない。同様に、頻繁でないＳＱＬ照会に優先して規則的に生じるＳＱＬ照会を満足させるようにインデックスが構成された場合、使用可能なインデックスから最大の利益が得られる。しかしながら、特に一般的ではないが、このタイプの照会に対して高い優先度が与えられなければならないと言うような（既存のシステムの正当化に）関係がある動作に対して幾らかのＳＱＬ照会は非常に重要である可能性があることが認識されなければならない。それ故、多くの競合する制限がインデックスのセットの指定装置（スペシファイヤ）708に課せられることがわかり、それ（708）は表のインデックスの最適なセットを提供しようと試みるものであることが認められる。ＳＱＬ実行プロセス706は、ＳＱＬステートメントを満足させるためにデータ記憶装置702内に含まれたデータを獲得および操作するのに最適な方法を評価するように構成されたオプティマイザプロセスを含んでいる。オプティマイザは、データベースに対して実行される各ＳＱＬステートメントを検査し、ステートメントを満足させることができる多数のアクセスパスのそれぞれを評価する。それぞれ可能性のあるアクセスパスは、実行される特定のパスに対して要求される処理の量ならびにディスクアクセスの量を表すコストを割当てられる。その後、オプティマイザは、ＳＱＬステートメント内で要求された機能を実行するための実際の通路としてコストが最低のアクセス通路を選択するように構成される。
ステートメントの最適化は動的ＳＱＬとして知られている実行の際に行われてもよく、そこにおいて、各ＳＱＬステートメントの内容は通常、ある回の実行と次の回の実行とでは変化している。その代りに、最適化はステートメントが実際に実行される前に一度実行されてもよく、その結果はＤＢ２プラン内に記憶される。このタイプの最適化は静的ＳＱＬとして知られており、ＳＱＬステートメントが実行の前に知れたときに使用される。静的ＳＱＬは、最適化処理がＳＱＬステートメントのためにたった１度だけ行われ、その結果が後の反復実行のために記憶された場合には、システムのもつ特徴よりも一層効果的となる。
好ましいインデックスセットの指定装置（スペシファイヤ）708は最適化の準備をし、別の段階、すなわち、使用可能なインデックスのどのインデックスを使用するかについてＳＱＬ実行プロセス703内でオプティマイザがオンラインで決定する前に、どのインデックスが実際に生成されるべきかを指定するように構成されている段階を処理する。しかしながら、最適化の実行に加えて、指定装置708は、インデックスのハウスキーピング、実行頻度およびステートメントの優先度を考慮に入れなければならない。
指定装置708は、システムによって実行される典型的なＳＱＬステートメントのサンプルに応答してインデックスの好ましいセットを指定する。この情報を得るために、ＳＱＬ追跡（トレース）プロセス718は、入力ライン716上に供給された照会を追跡し、それによって、適切な期間の後、ＳＱＬステートメントの代表的なサンプルがライン719を通って指定装置708に供給される。指定装置708は、カタログの統計ならびにデータ記憶装置からの表データのサンプルを読取り、結果的に表データがライン720を通って供給される。インデックスセットの指定プロセスが実行された後、出力ライン721を通って出力信号が供給され、プリンタ722により目で読取ることができる形式で指定データを生成することができるようになる。さらに、ＳＱＬの指示がライン723を通ってＳＱＬ実行プロセッサ703に供給され、結果的にインデックスの好ましいセットがデータ記憶装置内で生成される。さらに、プリンタ722によって生成された情報は、インデックスの好ましいセットが構成されるようにするためにデータ記憶装置内に付加的な記憶装置が必要であることをオペレータに知らせることができる。
最適なインデックスのセットの指定装置708によるプロセスの大要が図８に示されている。最初に、ステップ801においてＳＱＬ追跡装置718が活性化され、それによって、一定期間にわたって、データベースにアクセスするために使用されたＳＱＬステートメントがＳＱＬ追跡装置によって記録される。最終的に、ＳＱＬステートメントのサンプルが集められ、表のインデックスが再度指定されることが決定される。
この段階において、データベースは事実上ラインから外されることができ、それによって、新しい表のインデックスが指定されるまでさらに照会を行うことはできない。このような環境の下では、インデックスのセットの指定プロセスはデータベースそれ自体に共通のハードウェアプラットフォーム上で実行されることが好ましい。その代りに、別の実施形態において、インデックスのセットの指定プロセスの手順は、データベースのプラットフォームとの間で送受されるデータを使用して別個のプラットフォーム上で独立して実行されてもよい。
インデックスのセットの指定命令は、磁気ディスク、光学ディスク、あるいは光磁気ディスク等の適当なデータ搬送媒体を使用して外部プラットフォームに供給される。その代りに、命令はネットワーク能力を介して付加的なプラットフォームに供給されてもよい。インデックスのセットの指定装置708に対する命令の負荷は、取外し可能なディスク724によって示されている。
ステップ802において、ライブデータベースの各表のスペースのカタログ統計はＲＵＮＳＴＡＴＳを使用して更新され、それによって、情報がインデックスのセットの指定装置708に供給されたときにカタログからの更新されたデータが使用可能となることが確実にされる。
ステップ803において、インデックスのセットの指定プロセスが実行され、インデックスのセットが指定される。ステップ804において、ディスクのスペースがより多く設けられるか否かに関して質問が行われ、その答えが肯定であった場合、インデックスを生成するためのディスクの記憶の割当てが増加する。その代りに、ステップ804において行われた質問が否定であった場合、結果的に制御はステップ806に導かれる。
ステップ806において、指定されたセットの詳細が印刷されるかどうかに関して質問され、答えが肯定である場合には、恐らくは通常の並列のインターフェイス接続の形態で印刷信号がプリンタ接続721を通じてプリンタ722に供給される。その代りに、ステップ806において行われた質問が否定である場合、結果的に制御はステップ808に導かれる。
ステップ808において、ステップ803において行われた指定がライブシステムについて実行されるか否かに関して質問が行われ、答えが肯定である場合、ディスクのスペースに制限があることを条件としてステップ809において実行される。その代りに、制御がステップ810に導かれると、結果的にデータベースはライン上に戻って置かれる。
好ましいインデックスのセットの指定方法803が図９に示されている。ステップ901において、ライブデータベースは図１０に詳細に示される手順に従って指定装置708のプロセス内でモデル化される。その後、ＳＱＬ追跡プロセス718によって追跡されたＳＱＬステートメントは、図１１に詳細に示された手順に従って指定装置708のプロセスによって解析される。
ライブデータベースのモデル化によって、結果的に図７に示された表に類似して表が生成されるが、実質的にオンラインのライブシステム中にある表よりも小さい。指定プロセス内でカタログ統計手順をランすることが可能であり、それによって結果的にモデル化された表内のエントリのサイズを反映するカタログ統計が生成される。しかしながら、指定装置708によるプロセスはライブシステムの効果的な動作に関連しており、それ故、それはライブシステム内に含まれ、ライブカタログ702に記憶されたようなカタログ統計にさらに一層関連している。従って、ステップ903において、前記カタログからのライブ統計はインデックスのセットの指定装置にコピーされ、それはライブインデックスの不履行（デフォルト）のセットと共に一次キーインデックスを構成し、各表に対してインデックスをクラスタとする。
ステップ904において、ベースレベルのコストの評価が図１２に詳細に示されているように計算され、それによって、潜在的に最適なインデックスが追加されたときにコストの改良が推定される。このコストの差によって、インデックスのセットの指定に関係する次の処理を行うための目的関数が提供される。
インデックスのセットを識別するために実行された計算のほとんどは基本的に表毎に行われ、その後、表は、インデックスの生成のためにディスクのスペースが使用可能である場合に、どの特定のインデックスがライブシステム上で生成されるかについての査定が行われるときに再び組合わせで考慮される。結果的に、ステップ905において表が選択され、ステップ906において候補のインデックスが識別され、ステップ907においてそれらの目的関数に従って資格のあるインデックスが順序付けられ、ステップ908において最適なインデックスのセットが生成される。その後、ステップ909において別の表が使用可能であるかどうかについての質問が行われ、答えが肯定であった場合、制御はステップ905に戻される。ステップ909における質問の答えが否定であった場合、結果的に制御はステップ910に導かれ、そこでライブシステムに対して適用できそうなインデックスの最適なセットが指定される。
インデックスのセットの指定プロセス内でデータベースの縮小されたモデルを生成するために、図１０においてライブデータベースをモデル化する手順901が示されている。ステップ1001において、カタログ統計がカタログ702から読取られ、その後、空の表がモデル中に生成され、それぞれのカタログの定義によって説明されたように、ライブ表の704,705,706の性質をコピーする。各表のサイズは５０００行に制限され、これは実質的にライブデータベース表中にある行数よりも少ない。
インデックスのセットの指定装置708内で表の縮小されたモデルがデータ記憶装置702中のライブ表を正確に反映することを確実にするために、ステップ1002において生成された空の表にそれらのそれぞれのライブ表から読取られたデータエントリの代表的なサンプルで充たされる必要がある。ステップ1003において、ライブ表はそのそれぞれのカタログと共に選択される。
ステップ1004において、一次キーのカード値が最高であるＨＦ1として示された特定のインデックスを識別するために、既に選択された表に関係付けられ、ライブデータベース内で動作可能なインデックスが考慮される。ステップ1005において、ＨＦ1の第１のカラムのためのＨＩＧＨ２ＫＥＹ、ＬＯＷ２ＫＥＹおよびＣＯＬＣＡＲＤが識別され、ステップ1006において前記第１のカラムに対するデータの分布が決定される。
その後、ステップ1007において、ＬＯＷ２ＫＥＹおよびＨＩＧＨ２ＫＥＹによって定められた範囲内でＨＦ1の第１のカラムに対して１セットのランダムな値が発生される。値の有用性は、ステップ1006において判断して決定されたような頻度分布に従って重み付けされ、それによって、ライブ表から得られた５０００までのエントリでモデル表が充たされたときに、そのモデルにおける値の分布は、その処理の際の要求に関してモデル上でプロセスが行われるのに十分となり、それはそれぞれのライブ表上で実行されたときに出された類似した要求を実質的に反映している。ステップ1007において識別されたランダムに選択された値は、ステップ1008においてライブ表のエントリから読取られ、ステップ1008において読取られたエントリは、ステップ1009においてモデル表に順次書込まれる。
ステップ1010において、第１に、データ表においてエントリがクラスタキーに従って再編成されるようにモデルデータが処理される。各表に対して可能性のあるインデックスが生成され、これらのインデックスの性質に関して統計が集められる。得られたインデックスの統計は製造寸法まで拡大され、それによって拡大された値が蓄えられ、元の表データが削除される。
ステップ1011において行われた質問の答えは、モデル化された表の全てが、データ記憶装置702内に保持されたライブ表からランダムに選択され、頻度に従って加重されたエントリで形成されるまで肯定である。
捕捉されたＳＱＬステートメントを解析する手順902が図１１に詳細に示されている。ステップ1101において、ＳＱＬステートメントは、それが初めて発生した特定のステートメントであるかどうか、あるいはそのステートメントが前に見られたものであるかどうかについて判断して決定するために処理される。従って、それぞれの特有のＳＱＬステートメントは特有のラベルを与えられ、同じＳＱＬステートメントが再び識別された場合、発生の回数が頻度のカラム中に記録される。
ステップ1102において表が選択され、次に、ステップ1101においてラベルをつけられたＳＱＬステートメントがステップ1103において識別される。従って、手順1103乃至1106は、捕捉されたステートメントのそれぞれの特有の発生に対して実行されるだけである。
ステップ1103において選択されたステートメントがステップ1102において選択された表を使用するかどうかについて、ステップ1104において質問が行われる。質問の答えが肯定である場合、ステップ1105においてステートメントのラベルが適切な表のリストに加えられる。その代りに、ステップ1104における質問の答えが否定である場合、ステップ1105はバイパスされ、制御はステップ1106に導かれる。
ステップ1106において、別のステートメントが考慮されるかどうかに関して質問が行われ、その答えが肯定である場合、制御はステップ1103に戻され、次のラベルをつけられたステートメントが選択されることを可能にする。その代りに、答えが否定である場合、別のステートメントは使用可能でなく、ポインタを識別するステートメントはリセットされ、制御はステップ1107に導かれる。ステップ1107において、別の表が存在するかどうかについて質問され、その答えが肯定である場合、制御はステップ1102に戻され、次の表が選択される。
最終的に全ての表が考慮され、結果的に、ステップ1107において行われた質問の答えは否定である。
ステップ1105において生成された表のリストが図１２において詳細に示されている。そのリストは、元の表を識別する第１のカラム1201と、ステップ1101において指定されたようなそれらのラベルに関してＳＱＬステートメントを識別する第２のカラム1202と、ステートメントの頻度、すなわち、追跡されたセットの内で特定のＳＱＬステートメントが生じる回数を識別する第３のカラム1203とで構成されている。
図１２に示されているように、最初にステップ1102において表１が選択され、その結果、ステップ1005における反復された動作に応答して、ＳＱＬステートメントＡ，Ｂ，Ｃ，乃至ＳＱＬＺが表のリストに加えられる。その後、表２が選択され、結果的にＳＱＬラベルがこの表で識別され、最終的に表３が選択され、結果的にステートメントラベルがその表と関連される。本発明の実施形態において３つの表が存在しているが、大きいリレーショナルデータベース内で使用される際には任意の数の表が存在してもよいことを認識すべきである。
第３のカラム1203において、発生の頻度が記録されており、それは一般には数千回測定されたものである。従って、ステートメントＡに対してｘ回の発生が記録され、ステートメントＢに対してｙ回の発生が記録され、ステートメントＣに対してｚ回の発生が記録される。
ベースレベルコストを評価する手順904が図１３に示されている。表がステップ1301において選択され、ステップ1302においてＳＱＬステートメントが選択される。ステップ1303において、ステップ1301において選択された表に適用されたときに、ステップ1302で選択されたＳＱＬステートメントを実行するためのコストが評価される。コスト計算は、ＳＱＬ実行プロセッサ703内にあるオプティマイザから得られたタイマーオン値を使用して達成される。しかしながら、タイマーオン値はインデックスの使用だけを考慮し、インデックスのメンテナンスは考慮に入れない。結果的に、好ましい実施形態では、米国フロリダ州のInnovation Management Solutions社により“ＱＣＦ”（商標）の名の下で開発された命令が実行され、それによって、ＣＰＵの使用およびＳＱＬステートメントが実行されるための経過時間に関して、インデックスのメンテナンスの評価と組合わせてインデックスのコスト値が与えられる。これらのコスト値は絶対的な測定結果を表してはいないが、付加的なインデックスが存在するときに類似した方法を実行することによって関連したコスト値を得ることができ、それは付加的なディスクスペースに対する要求と比較されたときに、より高価なインデックスに優先して特定のインデックスを選択することができる目的関数を提供する。
ステップ1304において、各ステートメントに対してステップ1303において計算されたコスト値が実行頻度の係数で乗算され、ステップ1305において、結果的な積が優先度係数によって乗算される。その後、ステップ1306において、コストが特定の表に対するベースコストの和に加算され、ステップ1307において、別のステートメントがあるかどうかについての質問が行われる。質問の答えが肯定である場合、制御はステップ1302に戻り、結果的に次のＳＱＬステートメントが選択され、コスト計算手順が繰り返される。
最後に、ステップ1307における質問の答えが否定である場合、ステップ1308において別の表があるかどうかに関して質問される。その答えが肯定である場合、別の表の和がステップ1309において生成されて制御がステップ1301に戻され、それによって、次の表が選択できるようにする。最終的に、ステップ1308において行われた質問の答えは否定となり、それによって制御がステップ905に導かれる。
資格のあるインデックスを識別するために候補のインデックスのコストを計算する手順が図１４に詳細に示されている。ステップ1401において、考慮中の表に関連するＳＱＬステートメントから得られた述語のセットから候補のインデックスが識別され、それは図１２において示されたリストに従ってステップ905で選択されることによって定められる。従って、考慮中の表に関連するＳＱＬステートメントを解析することによって、索引付け可能な述語が識別できる。識別された索引付け可能な述語は特定のカラムを参照し、これらのカラムは表およびＳＱＬステートメントによってグループにされて述語のセットを形成する。これらの述語のセットは、関連したＳＱＬステートメントを満足させるときに利益をもたらす可能性のある候補のインデックスを識別する（すなわち、候補のインデックスが構成される）ための開始点を与える。さらに、識別された各インデックスに対してカタログ統計が生成される。
ステップ1402において候補のインデックスが選択され、ステップ1403においてステップ1302で選択されたインデックスがインデックスセット指定装置708内に保持された表のモデルの一部分として生成される。ステップ1403において新しいインデックスが生成された後、モデル内の関連したカタログのエントリがステップ1404において更新され、それによってインデックスが完全なサイズとなる。
候補のインデックスがモデルデータベースに対して生成される。インデックスのユーティリティを回復するＤＢ２は、インデックスを送り込むために表に対してランされ、ＤＢ２Ｒｕｎｓｔａｔｓのユーティリティは、統計を集めるためにデータベースに対してランされる。統計は各インデックスに対して集められ、それらはデータベース中に記憶され、プロセス全体を通して使用するためにライブ容量まで拡大される。
ステップ1401において識別された可能性のあるインデックスは、特定のカラムのエントリを使用する可能性のある全てのインデックスの組合わせを含む。従って、カラムは異なる順序で配置され、可能性のある順序が含まれる。同様に、各カラム内のエントリは順次上昇および下降してもよく、再びこれら可能性のある全ての組合わせが存在するようになる。これらの組合わせの全てが実際に候補のインデックスとして要求される訳ではなく、それ故、候補のインデックスを識別するために選択プロセスがステップ1402において行われる。カラム位置はカラムのシーケンスと呼ばれ、上昇および下降している可能性は順序付けと呼ばれる。同じカラムの順序付けを共用するインデックスは一緒のグループにされ、それらの順序付けに関連した差だけを示す。グループ内で定められた全てのインデックス、すなわち、カラムのシーケンスが同じである全てのインデックスがモデル内で生成される。これらのインデックスのそれぞれに対するカタログ統計もまた生成され、その後、表を参照する全てのトラップされたＳＱＬがインデックス上に目標を定められる。その後、データベース内の説明機能が、実際に使用されたグループ内の特定のインデックスを識別するために訓練される。次に、これらのインデックスは選択された候補のインデックスとなり、可能性のあるセットからの残りのインデックスは排除される。その後、生成されたインデックスが削除され、そのグループの全てが考慮されてしまうまで次のグループが考慮に入れられ、結果的に、ステップ1402において開始されたループに制御が導かれる前に、最後に生成されたインデックスが再び削除される。
従って、前述されたように、ステップ1402において候補のインデックスが選択され、ステップ1403において候補のインデックスが生成され、新しく生成されたインデックスに応答してステップ1404においてカタログが更新される。
新しいインデックスがモデル内で生成されたので、それぞれの表を参照するＳＱＬステートメントは、図１３に詳細に示されたベースコストレベルを計算する方法にほぼ類似した方法に従ってコストを計算される。ステップ1405において適当な新しいインデックスによりＳＱＬのコストが計算された後、新しいコスト値がステップ1406において記憶され、その後、ステップ1303において生成されたインデックスがステップ1407において削除される。
ステップ1408において、別のインデックスが存在するか否かに関して質問が行われ、その答えが肯定である場合、制御はステップ1402に戻され、結果的に次に可能性のあるインデックスが選択される。最終的に、可能性のある全てのインデックスに対して全てのＳＱＬステートメントのコストが計算され、結果的にステップ1408において行われた質問の答えは否定となる。
ステップ1406において計算され、表中に記憶されたコスト値によって、ステップ1401において識別された可能性のある各インデックスに対して新しいコストが定められる。これらのインデックスのそれぞれに対して、ステップ904において計算されたベースコストから新しいコストを減算することによってコストの節減が計算される。このコストの節減値は目的関数を表し、そこにおいて、コストの節減値が低いインデックスの方がコストの節減値が高いインデックスよりも適していると考えられる。このコストの節減値は、ステップ1410において各インデックスに対して記憶され、ステップ1411において、別のインデックスが存在しているか否かに関しての質問が行われる。答えが肯定である場合、ステップ1409において次のインデックスに対するコストの節減値が計算され、その後、可能性のある全てのインデックスに対してコストの節減値が計算され、結果的にステップ1411において行われた質問の答えは否定となる。
ステップ1410においてコスト節減値を記憶した結果、それぞれに割当てられたコスト節減値によりインデックスのリストが生成される。これは目的関数を表しており、それ故、ステップ907においてこの目的関数に従って適当なインデックスが順序付けられ、それによって、コスト節減値が高いインデックスが適格性のリストの最上部の近くに配置される。
順序付けされた可能性のあるインデックスを処理するための図９に示されたプロセス908が図１５において詳細に示されており、図１５の手順に従って実行された動作の一例が図１６および図１７に示されている。
図１６において、１１個の可能なインデックスが示されており、それらは特有の識別番号1625,1616,1604,1673,1612,1646,1635,1691,1622,1683および1617で表されている。これらのインデックスに対してコスト節減を先に識別し、定めた手順は、最適なインデックスの指定されたセットに含むための優先度を表している。従って、図１６に示された適当な資格のあるインデックスは、各インデックスに対して記録された関連したコスト節減に関して指定される優先度に基づいて順序付けされたものである。これらのコスト節減値に絶対的な意味はなく、前述された手順に従って計算されたコスト節減に関連した指示を与える。すなわち、インデックス1625は７３のコスト節減を行うものとして識別され、インデックス1616は７２のコスト節減を行うものとして、インデックス1604は６８のコスト節減を行うものとして、そして最後にインデックス1617は４のコスト節減を行うものとして識別されている。従って、インデックス1625乃至1617をそれらのコスト節減の適格性の点から順序付けするのに要求された情報は、ステップ803において詳細に説明された手順に従って計算される。
図１５のステップ1501において、コスト節減が考慮され、次の処理を容易にするために標準化される。標準化することによって、コストの節減値が予め定められた範囲内で考慮されることが可能になり、この例においては０乃至９９９９の範囲内で選択される。図１６の1681において示されているように全体のコスト節減が計算され、その値はこの例においては５００である。全範囲がこのコスト節減全体1681で除算されて単位範囲値が与えられ、その後、それはコスト節減値と乗算され、それによって分布値が与えられる。分布値の計算はステップ1502において行われ、結果的に標準化されたコスト節減が計算される。従って、ステップ1502において実行された手順に従って、インデックス1625に対するコスト節減が1660の値に標準化され、インデックス1616は（７２のコスト節減値から）標準化されて1440の標準化された値になる。同様に、標準化された値は考慮中の全てのインデックスに対して計算され、それによって、合計されたとき、標準化された値は分布内の全範囲の値に等しくなり、それはこの例においては９９９９である。ステップ1503において、別の遺伝学的繰返し（genetic iteration）が要求されるか否かについて質問が行われ、第１の反復においてその答えは肯定である。要求される遺伝的繰返しの数に関して事前に選択が行われ、ステップ1503において計数動作が行われる。
質問の答えが肯定である場合、ステップ1504において０乃至９９９９の範囲内でランダムな数が生成される。このランダムな数は、特定のインデックスを選択するためにステップ1505において使用される。従って、ランダムなインデックスの選択はステップ1505において行われ、特定のインデックスによって提供されたコスト節減の点から加重される。従って、０乃至８０の範囲内の数によって結果的にインデックス1617が選択され、８１乃至４００の範囲内の数によって結果的にインデックス1683が選択され、４０１乃至８２０の範囲内の数によってインデックス1622が選択され、最後に、８５４０乃至９９９９の範囲内の数によってインデックス1625が選択される。特定のインデックスに割当てられた分布数の数はその関連したコスト節減に比例し、それによって、多数回の反復を通じて、平均的にコスト節減が低いインデックスよりもコスト節減が高いインデックスが選択される。しかしながら、コスト節減の低いインデックスが依然としてプールに残っており、遺伝手順に従ってこれらのインデックスを選択することができる。
それぞれの反復の際にインデックスの新しい組合わせが生成され、これらのインデックスの組合わせによって特定のコスト節減が与えられる。これによってインデックスの組合わせがそれ自体のインデックスの識別を与えられるようにし、複合のインデックスが図１６に示された表内に含まれるように、目的関数に基づいて順序付けされる。
従って、ステップ1505において特定のインデックスが選択され、ステップ1506において選択されたインデックスがインデックスバッファに書込まれる。インデックスバッファ1791が図１７に示されており、それは考慮中の特定の表に対して許容されたインデックスの最大の数を表す６個のバッファ位置を含んでいる。図１７を参照すると、各表は特定の実施形態内に最大の６個のインデックスを有して示されているが、この形態は特定のローカルな動作条件を満足させるために調節されてもよい。従って、バッファ1791は６個の位置を有しており、インデックス1625等の選択されたインデックスの識別はこれらの任意の位置に配置され、ランダムなベースで選択されることができる。従って、図１７に示されているように、インデックス1625の識別は、インデックスバッファ1791の第２のバッファ位置に位置されている。
ステップ1507において、第２のランダムな数が０乃至９９９９の範囲内で生成され、結果的に第２の親インデックスがステップ1508において選択される。選択されたインデックスの指示は、図１７の1792に示されているように、第２のインデックスバッファに書込まれる。この例において、ステップ1507において生成されたランダムな数によって結果的にインデックス1604が選択され、バッファ1792内の第４の位置にランダムに位置付けられる。
ステップ1510において、特定のバッファ内の位置間の任意のインターフェイスにおいてバッファ切断位置がランダムに選択される。この例において、切断位置は、図１７の矢印1793によって示されているように、第２の位置と第３の位置との間に位置されている。この切断位置によって、第１の親であるバッファ1791と、第２の親と見なされてもよいバッファ1792とが結合できる。この交換の結果がバッファ1794と1795に示されている。バッファ1794において、インデックス1625は、第１の親1791から得られて第２の位置に配置され、インデックスの識別子1604は親1792から得られて第４の位置に配置される。バッファ1795に示されているように、この結合の別の子孫はインデックス識別子を含んでおらず、それ故、それは役立たずの子（void child）として考えられ、それ以上は考慮されない。従って、図１５のステップ1511において、２つの親の“繁殖”が行われ、結果的にインデックス1625および1604を含んでいる子1794が生成される。
潜在的に最適なインデックスの使用可能性をさらに関心の高いものにするために、バッファ1794の内容によって定められた子が突然変異するという遺伝子操作の段階が設けられる。別のランダムな数が生成され、結果的に別のインデックスが選択される。ステップ1512におけるこの突然変異のプロセスの結果、バッファ1796は子1794から得られたインデックス指示で負荷され、また、第６の位置にランダムにインデックス1635が加えられる。ステップ1513において、ステップ1511で生成された子のコスト節減およびステップ1512で生成された突然変異体のコスト節減が評価される。
ステップ1514において、新しいインデックス1794および1796が適格性の順に潜在的なインデックスのプールに加えられる。コスト節減は、考慮中の表を参照して各インデックスに対して計算され、それによってインデックスは、結果的なコスト節減に従ってランク付けされた図１６のリストに加えられることができる。全体のコスト節減が再度計算され、その後、新しく標準化されたコスト節減が新しく加えられたインデックスを含む全てのインデックスに対して再計算される。この標準化されたコスト節減の分布から、ステップ1502において各インデックスに対して値が計算され、さらに遺伝の繰返しの必要があるか否かについて再び質問される。
ステップ1503において行われた質問の答えが再び肯定である場合、０乃至９９９９の範囲内でランダムな数が生成され、ステップ1505において新しいインデックスが選択され、その後、ステップ1506においてこのインデックスの指示が第１の親バッファ1791に書込まれる。また、第２のランダムな数がステップ1507において生成され、それによってステップ1508において第２の親が選択され、その後、ステップ1509においてこの親の指示がバッファ1792に書込まれる。
ステップ1510において切断位置が再びランダムに選択され、ステップ1511において親が育てられ、それらの子孫がバッファ1794および1795に書込まれ、その後、有効な子供から突然変異体が発生する。従って、それぞれ反復することによって、インデックス結合プールに４つまでの新しいインデックスが加えられる。
最後に、ステップ1503において行われた質問の答えが否定である場合、結果的に制御は図８のステップ805に導かれる。図１５に示されたプロセスが繰り返されている間、新しいインデックスの計算が実質的にランダムな方法で識別される。しかしながら、新しい各インデックスは、その結果として生じたコスト節減を判断し決定するために試験され、コスト節減値が高いインデックスが図１６に示されたリストの最上部の近くに配置される。さらに、比較的高いコスト節減値を有することによって、選択する番号の分布の範囲もまた大きくなり、それによってこれらのインデックスが結合のために選択される可能性を大きくする。しかしながら、比較的可能性の低いインデックスはプール中に残るため、そのようなインデックスが選択されることも可能である。十分に反復すると、コスト節減値が非常に高いインデックスの組合わせが識別され、これらのインデックスは図１６に示されたリストの最上部の近くに配置される。最終的に、図１５に示されたプロセスが終了したとき、新しく育てられたインデックスを含むインデックスは、指定の際の生成のためのリストにコスト節減の順に次第に下位になるように記載される。
指定されたインデックスを含むようにデータベースを構成するための図９のステップ910において識別された方法の詳細が図１８に示されている。関連したコスト節減を指定する目的関数は、それらの関連した表に関してのみインデックスに対して考慮される。しかしながら、動作中のデータベースシステムにおいて、複数の表が一緒に機能しなければならない。それ故、所定の記憶スペースが使用可能である場合、記憶スペースは、存在している全ての表に関連したインデックスに対して割当てられなければならない。
ステップ1801において、ライブデータベース内に存在している各表によって使用されたディスクスペースの全体量が計算され、ステップ1802において、ステップ1801で計算された値が合計されて全ての表に必要な全体のディスクスペースが与えられる。ステップ1803において、各表に必要なディスクスペースが全ディスクスペースで除算されて表ごとのベースのディスクスペースの割当てのパーセンテージ（百分率）が与えられる。このパーセンテージの割当ては、ステップ1804に示されているようにインデックスにスペースを割当て、それによって、インデックスの生成のために割当てられた記憶の相対量は、表に対する記憶スペースの相対的な割当てとほぼ等しくなる。従って、表が多量のディスクスペースを占める場合、この表にわたって動作しているインデックスに対して同様に多量のディスクスペースの割当てが行われる。
ステップ1805において表が選択され、その表に対して得られた最も適切なインデックスがステップ1806において選択される。ステップ1807において、ライブシステム上で生成されるようにステップ1806で選択された好ましいインデックスに対して十分なディスクスペースが割当てられるか否かに関して質問が行われる。この答えが肯定である場合、ステップ1806において選択されたインデックスがステップ1808において生成のために指定される。その代りに、十分なディスクスペースが使用できない場合、ステップ1807で行われた質問の答えは結果的に否定となり、ステップ1808はバイパスされ、制御はステップ1809に導かれる。
ステップ1809において、別の表が存在するかどうかに関して質問が行われ、答えが肯定である場合、制御はステップ1805に戻され、結果的に次の表が選択され、この表に適切なインデックスがステップ1806および1807において考慮される。最終的に、選択された表に対する全てのインデックスが考慮されると、ステップ1809において質問された答えは結果的に否定となる。
ステップ808での質問の答えが肯定である場合、図１８に詳細に示された手順が実行される。従って、新しいインデックス構造がライブデータベース内で生成され、その後、データベースは、その新しいインデックスが配置された状態でステップ810においてオンラインで配置される。
図１４を参照すると、ステップ1401において実行されたプロセスは、述語のセットによって結果的に４つ以上のカラムを要求するインデックスが識別された場合に、非常に時間を浪費するようになる。これらの環境の下で、第１の４つの好ましいカラムが選択され、残りのカラムが暫定的に排除される。選択はフィルタ係数のベースで行われ、フィルター係数が低い４つのカラムが選択される。全ての可能な順序付けの可能性を有してこれらの４つのカラムが処理され、それによって前述されたような候補のインデックスが選択される。
全ての適切なインデックスが決定された後、暫定的に排除された多量のインデックスは、第５の好ましいカラム、第６の好ましいカラムおよび第７の好ましいカラム等を加えることによって組み立てられ、それによって新しい第５のカラムのインデックス、新しい第６のカラムのインデックスおよび新しい第７のカラムのインデックス等が生成される。これらのインデックスは、必要な４つのカラムを含んでいる最も費用効果的な候補のインデックスに関して作られる。これらのカラムは、４つのカラムの候補のインデックスに増加順序で加えられるだけであり、カタログ統計はこれらの新しいインデックスに対して計算され、その後、これらの新しいインデックスはコスト計算されて、適格性で順序付けされたリストに加えられ、先にコストを計算されたインデックスの適格性（適切性）の順に配置される。
図１５を参照すると、ステップ1505において開始された遺伝プロセスのために考慮された適切なインデックスの数は重要であり、結果的に処理時間は比較的長くなる。これらの環境の下で、遺伝プロセスが実行される前に“結合プール”のサイズに上限を置くことが好ましい。一般に、結合プールのサイズは、遺伝プロセスを実行する前に最大３０の適切なインデックスに制限されてもよい。
基本的に、遺伝プロセスの目的は、一般的なＳＱＬの照会のセットを構成するときに、処理オーバーヘッドを著しく減少する複合インデックスが見つかるようにすることである。処理時間を減少するために、特に利点があると考えられるインデックスのセットを加えることによって“プールの準備”を行うことが好ましい。
プールの準備の第１の段階は、存在しているライブデータベースシステムを調査し、それによってどのインデックスが実際にライブシステムにおいて使用されるかについて決定することを含んでいる。その後、これらのインデックスのセットは、前述のように適切なインデックスの集合に加えられる。
プールの準備の第２の段階は、最も適格なインデックスがライブシステム内にあると仮定された後に適格性のランク付けを再考慮することである。従って、先に計算されたような最も適切なインデックスがライブシステムに属しているものとして配置されている開始位置からコスト係数が再考慮される。このライブインデックスがシステムに加えられると、残りの適格なインデックスの幾らかのコスト節減は顕著に変化し、それによって適格性のリスト内でインデックスを効果的に再度順序付けする。再び、最も費用効果的な残りの適格なインデックスがシステムに加えられ、存在しているこの新しいインデックスに基づいてコスト節減が再計算される。このプロセスは繰返され、反復的に６個の新しいインデックスのセットまで行われる。これらのインデックスのセットのそれぞれは、適切な回数で結合プールに加えられる。 Background of the Invention
Field of Invention
The present invention relates to specifying indexes for relational databases. The invention also relates to a relational database that includes a process for specifying an index.
Description of prior art
Data processing environments are known in which executable instructions are configured to obtain a set of data from data contained in a database in response to a data inquiry. Data may be accessed directly from the data table, where it is necessary to search all entries in the table to obtain the requested information. Alternatively, an index may be used in the search process to substantially increase the speed of the search process.
The design of a large and frequently used index structure for a database is currently very difficult to execute and is subject to errors. The problem is that human database administrators detect requests to index generally hundreds or thousands of different structured query language (SQL) statements that run against the database on a daily basis, and then This arises because of the technical demands and constraints that it is impossible to translate these requests into a set of preferred indexes defined throughout the database. However, the poorly specified index design results in the SQL statement consuming significantly more processor equipment and running longer than it should, and as a result, the device is overloaded. It will be loaded.
For a long time, there has been a need for a method for globally specifying an index structure defined for a given database design for a typical SQL workload called a target workload. However, this technical problem is solved when it is difficult to realize a technical solution by utilizing the processing power that can be used without requiring an intuitive mental process in the human operator's organ. Not.
Summary of the Invention
According to a first aspect of the invention, a plurality of statements supplied to the database are parsed to specify a set of indexes for the database stored in the machine in a readable form, and a table of the database is obtained. To identify an index derived from, evaluate a level of improved behavior achievable when the index is available, and specify a set of indexes (tuples, groups) to the database A method is provided that includes processing the evaluated level.
In a preferred embodiment, the improved level of behavior is assessed by generating a reduced model of the database table derived from information related to the nature of the table. In general, the reduced model may include 5000 data entries in the region for each table. The model database is preferably filled with representative data entries obtained from the modeled live database, and the model is filled by considering the cardinality of the current index of the live database. Furthermore, the database model can be fulfilled by considering the distribution of entries in the current index of the live database.
In the preferred embodiment, database statistics are copied from the live database to the database model. The base level cost is preferably calculated to execute the statement without an additional index. The cost level is preferably obtained by evaluating the execution time. Further, the cost level may be assessed by assessing index maintenance overhead.
According to a second aspect of the invention, an analysis means configured to specify a set of indexes for a relational database and analyzing a plurality of statements supplied to the database, and obtained from a table of the database Identifying means for identifying the index, evaluation means for assessing the level of improved behavior achievable when the index is usable, and specifying the set of indexes for the database Means for specifying a set of indexes is provided comprising processing means for processing the evaluated levels.
In a preferred embodiment, the potential index is identified from the set of predicates defined by the statement.
According to a third aspect of the invention, the data storage means, the data processing means, and the program instructions readable from the data storage means are configured to designate a set of indexes for the database. In response to the instructions, the processing means parses a plurality of statements supplied to the database, identifies an index obtained from a table of the database, and To provide a means for evaluating the level of improved behavior achievable when an index is available and for processing the evaluated level to specify a preferred set of indexes for the database It is configured.
In the preferred embodiment, cost savings are calculated by processing the cost value of the cost of the old SQL statement and the cost value of the cost of the new SQL statement by possible indexes. Cost savings may be calculated by subtracting the new cost from the old cost. Cost savings are preferably calculated for the table by considering each new possible index with respect to its respective table.
In the preferred embodiment, the possible indexes are ordered in terms of the possibility of being designated as the preferred index. Index combinations are identified by randomly combining currently possible indexes and processing the evaluated levels to specify a preferred set of indexes.
According to a fourth aspect of the invention, a plurality of data tables stored in a machine readable form and an index to process the data tables in response to statements and facilitate processing of the data tables A relational database comprising instructions executable by the processing means to specify a preferred set of indexes, wherein the instructions are provided to the database. Parses multiple statements, identifies the index obtained from the database table, evaluates the level of improved behavior achievable when the index is available, and determines the preferred index for the database Configured to process the evaluated levels to specify a set That.
[Brief description of the drawings]
FIG. 1 illustrates a telecommunications environment that includes a data analysis system.
FIG. 2 shows details of the data analysis system shown in FIG. 1 and having a plurality of large capacity disk storage devices configured to store data tables.
3A, 3B, 4A and 4B show examples of data tables of the type stored on the data storage device shown in FIG.
5 and 6 show examples of indexes obtained from the data included in the table shown in FIG. 3A.
FIG. 7 shows a database structure installed in the environment shown in FIG. 2 and includes a process of designating an index.
FIG. 8 shows processing executed in the environment shown in FIG. 7 and includes processing for designating a set of indexes.
FIG. 9 illustrates the process of specifying the set of indexes shown in FIG. 8, the process of modeling a live database, the process of analyzing a typical SQL statement, and the base cost calculation. Processing, processing for identifying possible indexes, and processing for designating a preferred index.
FIG. 10 shows in detail the process of modeling the live database shown in FIG.
FIG. 11 shows in detail the process of parsing the SQL statement shown in FIG.
FIG. 12 shows a table of data generated using the process shown in FIG.
FIG. 13 shows a process for calculating the base cost shown in FIG.
FIG. 14 details the process of identifying and ordering the possible indexes shown in FIG.
FIG. 15 details the process of specifying a preferred set of indexes shown in FIG.
FIG. 16 shows in detail an example of the data generated during the process shown in FIG.
FIG. 17 illustrates the generation of a new index combination using the process detailed in FIG.
FIG. 18 shows a procedure for setting an index in the live database.
Example
The present invention will now be described by way of example only with reference to the accompanying drawings shown above.
A telecommunications environment is shown in FIG. 1, in which a plurality of communication user equipments 101 such as telephone handsets, fax machines and modems are connected to a local exchange 102 via respective local analog lines 103. . The analog signal is digitized at the local switch 102, followed by switching and rerouting within the digital domain. This results in a large number of calls being routed through the digital time division multiplexed channel 104 to the trunk communication network 105.
The local exchange 102 and the trunk network 105 perform normal communication switching and allow calls to be connected in the normal manner. In addition, advanced services are provided by the advanced service node 106 to allow customers access to advanced services such as storage and forward forwarding and identification of personal numbers. The customer accesses the advanced service node 106 via a digital multiplexer 107 connected to the trunk network 105. Therefore, the call is routed to the advanced service node 106 via the trunk network 105, and then information is sent back to the calling customer, and the call is routed again to the terminal device 101 or the like via the trunk network 105. Alternatively, the functionality of the advanced service node 106 may be distributed through the trunk telecommunications network 105.
When a call is made, the billing details for the call are stored at the associated local exchange 102. Subsequently, this call information representing the use for which the connected customer bears the cost is supplied to the central management unit 108 via the communication channel 109. In this manner, all fee information is routed to a central management unit 108 that manages the generation and distribution of customer accounts.
In operation, a system composed of the terminal device 101, the local exchange 102, the trunk network 105, and the advanced service node 106 connects a very large number of calls, and as a result, a collection of operation data is generated. First, this data identifies details regarding the nature of the call being made, the nature of the destination call and the type of call. The call type information may identify the call as a direct local call, or the call is a long distance or international call, including use of services provided by the advanced service node 106 May be identified. Analysis of this data provides at least two significant advantages. First, the manner in which the system operates can be changed in response to the collected data representing the operating characteristics of the system. Therefore, if certain areas are found to use substantially more advanced services than other areas, reroute these service assignments to optimize the availability of these services. Is possible. Similarly, assessment of network usage may be made available to designers of marketing strategies, particularly when efforts are made to successfully use the capabilities available during off-peak periods.
In fact, almost similar queries are executed at regular intervals on the database. The division of actions has particular interests and requires that those interests be updated on a regular basis. A very large number of users can access the database, and over a period of time hundreds or thousands of SQL queries are performed on the data contained in the database, and most of these queries are updated as new data is included It is executed a number of times to produce the rendered result.
The system shown in FIG. 1 includes a data analysis device 110, an advanced service node 106, and a central management device 108 that are configured to receive data from the trunk network 105. In turn, the data analysis device returns the data to the trunk network 105, the advanced service node 106, and the central management device 108. Within the trunk network 105 and the advanced service node 106, modifications are made to the technical operation of these systems in response to data received back from the data analyzer. Similarly, the data returned to the central management device 108 results in a change in the way the customer is invoiced (ie, billed), i.e., generally modifies the way the customer uses the network.
Data is collected in the data analyzer 110 and stored in the form specified by the original system designer. These designers strive to predict the type of query that will be required later, but it is impossible to anticipate every query that matters. Therefore, data tends to be stored in the data analysis device in relational database terms, thereby facilitating subsequent operations in response to specific queries. As a result of these queries, the procedure of the trunk network 105, the advanced service node 106, or the central management device 108 is modified. In addition, the data generated in response to a particular query is verified in the data verification unit 111 for reference or subsequent use.
A data analysis device 110 is shown in detail in FIG. 2, which has 10 processors substantially located around a mainframe computer 201 such as the IBM ES9000 and capable of operating at 50 MIPS (million instructions per second). Configured. A user can access the database system via a user terminal 202 connected by a plurality of networks, and the data is stored in the disk drive 203 in the form of a data table and stores data in units of terabytes. can do.
An operational data source such as the trunk network 105, the advanced service node 106, and the central management device 108 are shown as the operational data source 204. Further, the data may be received from another external source, such as shown by external data source 205. The operation control signal stream returning to the trunk network 105, advanced service node 106, or central management unit is indicated by data supplied to the operation control unit 206, and data verification and printing is indicated at 207. Data is stored in the system on disk storage 203 and output data is obtained in response to the query and is initiated by the user using network user terminal 202. Data held on disk storage 203 is continuously collected in response to the operation of a trunk network or another device as shown in FIG. In addition, management data may also be maintained on the storage device and a query may be initiated to relate the operational data to the management data.
Examples of types of database tables stored on the disk storage device 203 are shown in FIGS. 3A, 3B, 3C and 3D. Information representing each communication event is received from the central management device 108. Each event is given a unique sequence number, thereby creating a new record in the database as shown in FIG. 3A. The recording is completed by identifying the date, start time, end time of the event, the customer's phone number that initiated the event, and the type of call. Therefore, at 0:01 am on December 01, 1995, a customer whose telephone number was 404 7241 made a type A call and ended at 0:25 am. In this example, an arbitrary designation is given for the type of call, where a type A call represents a local call, a type B call represents a long distance call, and the type The C call is a long distance call made by an operator, and the type G call represents the use of advanced services. The event identified above is recorded under event number 12345, and the next event is recorded under event number 12346. This was initiated by a customer whose telephone number was 386 4851 at 0:01 am on December 01, 1995, again identified as a type A call.
The database in data analysis system 110 also includes management data as shown in FIG. 3B. The table shown in FIG. 3B maps customer identifiers onto customer phone numbers. It will be appreciated that a particular customer may have multiple phone lines with different phone numbers, whereby a particular phone number is required to identify the associated customer. Thus, in the example shown, the customer whose telephone number is 404 7241 is assigned customer identification number 0074895 in record 303. Similarly, record 304 indicates that telephone number 404 7242 is assigned to the customer identified by customer identification number 057896.
The table shown in FIG. 4A associates customer identification numbers with customer addresses. Thus, it can be seen from record 305 that a customer with identification number 0074895 lives in Hyholborn 52 in London.
In general, if a particular city is always located within a particular geographic area, it is not necessary to provide extensive geographic data. However, the geographic area may be adjusted by a particular operator to reflect changes in the commercial environment. A separate table is provided that maps towns and cities to specific regions. Thus, as identified by record 306, London is mapped to the southeast region and Loughborough is mapped to eastern inland region in record 307. The regional boundaries can be adjusted to reflect the location of the regional office, so that, for example, the eastern inland region is combined with the western inland region to become an inland region. Under such circumstances, it is not necessary to change the table shown in FIG. 4A, only the table shown in FIG. 4B needs to be changed, and the table shown in FIG. It has substantially fewer records than the table shown in FIG. 4A.
It will be appreciated that when queried with reference to event numbers, the data record is read very quickly by the data table shown in FIG. 3A. The data table shown in FIG. 3A is ordered with respect to event numbers, where each event number is specific to a particular record. Thus, given the event number 12345, the database can quickly respond to a customer whose phone number is 404 7247 making a Type A call. Similarly, by referring to the table shown in FIG. 3B, this telephone number can be associated with the customer identifier, thereby addressing and querying for the tables shown in FIGS. 4A and 4B, respectively. A region is identified.
However, problems arise when a query is made regarding another field in the record shown in the table of FIG. 3A. For example, the query may require that a list is generated for all events initiated by a particular phone number. Alternatively, the inquiry may be made for all events of a particular call type. More complex queries may be made using this type of entry, for example, a query may be made requesting details of all calls initiated from the southeast region and may last for more than 10 minutes.
According to the previous example, a simple query is made requesting details of all events initiated by a particular phone number. The table shown in FIG. 3A is not indexed under the telephone number entry, so the processor needs to search the telephone number data field for all records in the table. Obviously, this requires substantially more processor resources than reading the record for the event number. The table shown in FIG. 3A is ordered by event number so that given an event number, a particular record is identified very quickly. However, if not indexed under any other field, a search must be performed to identify the particular field of interest. According to the more complex example identified above, a particular query needs to be satisfied by searching different fields several times through all the records contained in the table.
In order to improve the speed at which the search is performed, it is possible to generate an index for a particular table so that a quick search is also performed on other data fields. As shown in FIG. 5, the data contained in Table 3A has been used to generate an index based on telephone numbers. Each phone number is considered one at a time and an index is generated to identify each particular event initiated by a particular phone number. Thus, record 301 is identified for telephone number 404 7241 and event 12345 is recorded for this telephone number. Subsequently, another event 14876,15739,15928 and 16047 was initiated by this phone number. The index continues until all events for the phone number being considered are recorded. Subsequently, the index continued to the next telephone number, in this example 404 7242, for which events 13728, 14937, 15821, 14723, etc. were recorded.
Thereafter, phone number 404 7243 is considered along with the event recorded for this phone number, and so on, until all phone number entries in the data table shown in FIG. 3 have been considered. In the index shown in FIG. 5, the number of records given is equal to the number of records in the original table shown in FIG. 3A. However, the table of FIG. 5 is merely an index, so for a particular telephone number, the index will point back to the specific event number in the main table shown in FIG. 3A. Thus, the index can quickly perform a search based on the telephone number, after which the remaining data contained in the table record shown in FIG. 3A is obtained.
An index similar to that shown in FIG. 5 is shown in FIG. 6, where the data contained in the table shown in FIG. 3A is indexed with respect to the type of event. The record 301 is reproduced as the record 601 in the index shown in FIG. The event shown as event number 12345 was generated by a subsequent type A call, which is listed for the entry for event type A. Event numbers 13856, 14024, 15752, and 14831 were events that produced type A calls.
Thereafter, as shown in FIG. 6, events for type B calls are recorded including event 13728 and placed in record 602, and events are sequentially recorded for type B events. Once all events for a type B event have been recorded, the table continues to the type C event initiated by event number 12350.
Once the index for the phone number shown in FIG. 5 and the index for the type of event shown in FIG. 6 are generated, the table shown in FIG. 3A is based on the event number, phone number and event type. These indexes can be used for quick access to records. Clearly, it is desirable to make these indexes available when responding to queries. However, the benefits of index availability must be compared to the costs involved for index generation and more important index updates, along with additional demands on storage space. Thus, in many practical implementations, it is impossible to generate all possible indexes if there is insufficient disk storage space available. Sometimes an increase in disk storage space is possible, but in most practical cases, there is an upper bound on the total storage space and the total amount of storage space allocated for index generation.
The system shown in FIG. 2 is configured to store a very large amount of data. In addition, this data is fairly strongly requested by the user who is querying to generate a new set of data. Some of these queries have a one-time nature, but in order to assess changes to the requested set of data in response to additions and changes made to the source data, the queries are A high percentage is recursively matched at relatively regular intervals. Under these circumstances, it is virtually impossible for a database administrator to receive an indexing request correctly to achieve optimal behavior in response to hundreds or thousands of different SQL statements made against the database. Impossible. However, if the optimal index is not provided, the query places excessive demands on the central processing unit. In addition, such queries tend to run longer than necessary, thereby causing delays. Because these issues remain, the machine is heavily overloaded and limits the number of queries that can be run per unit time. This results in less machine usage and actually increases the cost of the system per query.
The system of the present invention is configured to specify an indexing structure in an attempt to overcome the technical problems described above. The system is configured to parse SQL statements provided to the database to specify a preferred set of possible indexes and can be used to improve the execution of SQL statements that reference the database. An estimate is made of the level of improved behavior that can be achieved if the potential index is actually available in the database system. From these evaluations, a preferred list of indexes is specified. Thus, subjective restrictions on the database administrator are removed by the machine, the numerical display is calculated, and a specific index may be included in the system before allocating resources to actually generate the index. The desired degree is indicated. These numerical indications are actually derived from improved behavioral evaluation when the index is placed.
The database hardware shown in FIG. 2 is schematically shown in FIG. 7 along with related processes performed by the hardware. The database system may be configured in accordance with database instructions permitted under the trademark “DB2” of IBM Corporation. The resulting DB2 environment is shown in FIG. 7 as 710, which includes data storage 702 and SQL execution process 703.
In the illustrated embodiment, three tables are defined in the data storage device, which are shown as a first table 704, a second table 705, and a third table 706. Within the illustrated database system, all six indexes are generated for each table (which is variable depending on the configuration), which are indicated by the dashed area 707. Thus, each table may have an associated set of indexes, and the actual index contained within the set of indexes is specified by the index set as defined by the preferred features of the present invention. Determined according to the method performed by the device. The database also includes a catalog 709 configured to store database definitions and catalog statistics.
In DB2, “RUNSTATS” configured to obtain statistics on tables, columns, and indexes in the database is provided as a utility. The catalog provides detailed information about the dimensions of the table and allows an empty table to be created before data is transferred from a live data store to a copy of a similar data store. In addition, the SQL execution process includes a process known as an “optimizer”, which is configured to analyze catalog statistics to optimize the execution of a particular SQL statement.
In addition to defining the table dimensions, the catalog statistics also record column cardinality, second highest and second lowest values, cluster ratios and data distribution.
Column cardinality defines the number of different values in a particular column, while full key cardinality identifies the total number of different values in the index defined for a table. The customer table may have a cardinality greater than 1 million rows, each of which is a different customer, but the column that identifies the gender of these customers only has two cardinality. Similarly, the cardinality of an index that identifies a geographical region may be the lower of the hundreds.
The second highest value shown as HIGH2KEY represents the second highest value in the particular column. Similarly, the value identified as LOW2KEY represents the second lowest value for a particular column, and this information can be used to determine the potential in the column, especially when considered in combination with the cardinality of the column. A range of values can be displayed.
The cluster ratio shows how well the ordering of the data in the index follows the ordering of the data in the original table. When an index for a cluster is defined for a table and the data in that table is reconstructed, all the rows in the table are put into a clustered sequence and the cluster ratio is considered to be 100 percent. Another index with different columns and the order of the columns relative to the index for the cluster tends to have a cluster ratio with a low value because their data may not be well clustered with respect to the index for the cluster. As rows are inserted into or deleted from the table, the cluster ratio of the index gradually decreases, and more rows become out of sequence for the cluster. Thus, the cluster ratio provides an indication of how well the data in this embodiment is ordered within a particular data index and generated from a sample of live data within each table. Data distribution statistics define the 10 most frequently occurring values in the column along with their percentage of occurrences.
The input information is supplied to substantially two forms of live database systems. First, as indicated by input line 715, new data is provided to the system, and second, as indicated by input line 716, a query for SQL statements is provided to the database. The entered query on line 716 is executed by the SQL execution process 703 and produces output as indicated by output line 717. The new data received on the input line 715 results in the table in the data store 702 being updated, and this update process is performed according to the SQL command. Accordingly, changes, deletions, and queries of both entered data and SQL are fed to the SQL execution process 703, both executed under the control of the process.
In general, SQL statements in the form of queries or queries provided on input line 716 are executed more quickly when SQL processor 703 accesses a number of indexes in addition to the main data from each table. As a result, if a database system is required to primarily respond to this type of query, it is desirable to include multiple indexes in the system. However, when a table is requested to be updated in response to new data received on input line 715 or in response to changed or deleted data, SQL may be used if multiple indexes exist. A large processing overhead is placed on the execution process 703. Therefore, if the database system is primarily required as a data repository and queries to the system are minimal, it is desirable to minimize the number of indexes in the system so as to reduce housekeeping overhead. In most practical systems, both types of input are received, and therefore the number of indexes in the system is minimized to reduce housekeeping overhead, but optimal if there is storage space Current index selection should be optimized to provide good performance.
The degree to which the table index specification approaches the ideal solution depends on the nature of the SQL query provided on input line 716. Although the system is supplied with a large number of indexes, these indexes have little benefit if they are not related to the predicates specified in a typical SQL query. Similarly, if an index is configured to satisfy regularly occurring SQL queries over infrequent SQL queries, the greatest benefit is derived from the available indexes. However, although not particularly common, some SQL queries for relevant operations (in the justification of existing systems) say that high priority must be given to this type of query. It must be recognized that can be very important. Therefore, it can be seen that many competing restrictions are imposed on the index set specifier 708, which recognizes that it attempts to provide an optimal set of table indexes. It is done. The SQL execution process 706 includes an optimizer process configured to evaluate the best method for obtaining and manipulating the data contained in the data storage device 702 to satisfy the SQL statement. The optimizer examines each SQL statement executed against the database and evaluates each of a number of access paths that can satisfy the statement. Each potential access path is assigned a cost that represents the amount of processing required for the particular path being executed as well as the amount of disk access. The optimizer is then configured to select the access path with the lowest cost as the actual path for performing the function requested in the SQL statement.
Statement optimization may occur during execution, known as dynamic SQL, where the content of each SQL statement typically varies from one execution to the next. . Instead, the optimization may be performed once before the statement is actually executed, and the result is stored in the DB2 plan. This type of optimization is known as static SQL, and is used when SQL statements are known prior to execution. Static SQL is more effective than system features if the optimization process is performed only once for the SQL statement and the result is stored for later iteration execution.
A preferred index set specifier (708) prepares for optimization and before the optimizer makes an online decision within the SQL execution process 703 as to which index of the available index to use. Process stages that are configured to specify which indexes are actually to be generated. However, in addition to performing optimization, the designator 708 must take into account index housekeeping, execution frequency, and statement priority.
Designator 708 designates a preferred set of indexes in response to a sample of typical SQL statements executed by the system. To obtain this information, the SQL trace (trace) process 718 tracks the queries provided on the input line 716 so that after an appropriate period of time, a representative sample of SQL statements has passed through the line 719. Is supplied to the designation device 708. Designator 708 reads the catalog statistics as well as a sample of the tabular data from the data store, so that the tabular data is provided through line 720. After the index set designation process is executed, an output signal is supplied through the output line 721 so that the designation data can be generated in a format that can be read by the printer 722 visually. In addition, an SQL indication is provided via line 723 to the SQL execution processor 703, resulting in a preferred set of indexes being generated in the data store. Further, the information generated by the printer 722 can inform the operator that additional storage is required in the data storage in order for a preferred set of indexes to be constructed.
An overview of the process by the optimal index set designator 708 is shown in FIG. Initially, the SQL tracker 718 is activated at step 801 so that the SQL statement used to access the database over a period of time is recorded by the SQL tracker. Finally, a sample SQL statement is collected and it is determined that the table index is again specified.
At this stage, the database can effectively be taken off line, so that no further queries can be made until a new table index is specified. Under such circumstances, the index set designation process is preferably performed on a hardware platform common to the database itself. Instead, in another embodiment, the index set designation process steps may be performed independently on a separate platform using data sent to and from the database platform.
The instruction to specify the set of indexes is supplied to the external platform using an appropriate data carrier medium such as a magnetic disk, an optical disk, or a magneto-optical disk. Alternatively, the instructions may be supplied to additional platforms via network capabilities. The instruction load on the index set designator 708 is indicated by a removable disk 724.
In step 802, the catalog statistics for each table space in the live database are updated using RUNSTATS so that the updated data from the catalog is used when information is provided to the index set designator 708. It is ensured that it becomes possible.
In step 803, an index set designation process is performed to designate an index set. In step 804, a question is asked as to whether more disk space is provided, and if the answer is affirmative, the disk storage allocation for generating the index is increased. Instead, if the question made in step 804 is negative, then control is directed to step 806.
In step 806, a question is asked as to whether the details of the specified set will be printed, and if the answer is affirmative, a print signal is supplied to printer 722, possibly in the form of a normal parallel interface connection. Is done. Instead, if the question made in step 806 is negative, then control is directed to step 808.
In step 808, a question is asked as to whether the designation made in step 803 is performed for the live system, and if the answer is affirmative, it is performed in step 809, subject to limited disk space. The Instead, when control is directed to step 810, the database is eventually put back on the line.
A preferred index set designation method 803 is shown in FIG. In step 901, the live database is modeled in the process of the designated device 708 according to the procedure detailed in FIG. Thereafter, the SQL statement tracked by the SQL tracking process 718 is analyzed by the process of the designated device 708 according to the procedure detailed in FIG.
Live database modeling results in a table that is similar to the table shown in FIG. 7, but is substantially smaller than a table in an online live system. It is possible to run the catalog statistics procedure within a designated process, thereby generating catalog statistics that reflect the size of the entries in the modeled table. However, the process by the designation device 708 is related to the effective operation of the live system, and therefore it is even more related to catalog statistics as contained in the live system and stored in the live catalog 702. . Thus, in step 903, the live statistics from the catalog are copied to the index set designator, which forms a primary key index with a default (default) set of live indexes, and for each table an index is clustered. To do.
In step 904, a base level cost estimate is calculated as shown in detail in FIG. 12, whereby a cost improvement is estimated when a potentially optimal index is added. This cost difference provides an objective function for performing the following processing related to the specification of the set of indexes.
Most of the calculations performed to identify the set of indexes are basically done on a per-table basis, after which the table determines which specific index when disk space is available for index generation. The combination is again taken into account when an assessment is made as to whether is generated on the live system. As a result, a table is selected in step 905, candidate indexes are identified in step 906, qualified indexes are ordered according to their objective functions in step 907, and an optimal set of indexes is generated in step 908. . Thereafter, a question is made at step 909 as to whether another table is available, and if the answer is affirmative, control is returned to step 905. If the answer to the question at step 909 is negative, then control is directed to step 910 where an optimal set of indexes that are likely to be applicable to the live system is specified.
A procedure 901 for modeling a live database is shown in FIG. 10 to generate a reduced model of the database within the process of specifying a set of indexes. In step 1001, catalog statistics are read from catalog 702, after which an empty table is generated in the model, copying the properties of live tables 704, 705, 706 as described by the respective catalog definition. The size of each table is limited to 5000 rows, which is substantially less than the number of rows in the live database table.
In order to ensure that the reduced model of the table within the index set designator 708 accurately reflects the live table in the data storage device 702, each of them in the empty table generated in step 1002 Need to be filled with a representative sample of data entries read from the live table. In step 1003, a live table is selected along with its respective catalog.
In step 1004, an index associated with the already selected table and operable in the live database is considered to identify the particular index indicated as HF1 with the highest primary key card value. In step 1005, HIGH2KEY, LOW2KEY and COLCARD for the first column of HF1 are identified, and in step 1006, the distribution of data for the first column is determined.
Thereafter, in step 1007, a set of random values is generated for the first column of HF1 within the range defined by LOW2KEY and HIGH2KEY. The usefulness of the value is weighted according to the frequency distribution as determined at step 1006, so that when the model table is filled with up to 5000 entries obtained from the live table, The distribution of values is sufficient for the process to be performed on the model with respect to the requirements at the time of processing, which substantially reflects the similar requirements made when executed on each live table . The randomly selected values identified in step 1007 are read from the live table entries in step 1008, and the entries read in step 1008 are sequentially written to the model table in step 1009.
In step 1010, first, the model data is processed so that the entries in the data table are reorganized according to the cluster key. Potential indexes are generated for each table, and statistics are gathered on the nature of these indexes. The resulting index statistics are expanded to production dimensions, the expanded values are stored, and the original table data is deleted.
The answer to the question made in step 1011 is affirmative until all of the modeled tables are randomly selected from live tables held in the data store 702 and formed with entries weighted according to frequency. is there.
A procedure 902 for parsing the captured SQL statement is shown in detail in FIG. In step 1101, the SQL statement is processed to determine and determine whether it is the first specific statement that occurred or whether the statement was seen before. Thus, each unique SQL statement is given a unique label, and if the same SQL statement is identified again, the number of occurrences is recorded in the frequency column.
A table is selected in step 1102, and then the SQL statement labeled in step 1101 is identified in step 1103. Thus, steps 1103 through 1106 are only performed for each unique occurrence of the captured statement.
A question is asked in step 1104 as to whether the statement selected in step 1103 uses the table selected in step 1102. If the answer to the question is affirmative, then in step 1105 the statement label is added to the appropriate list of tables. Instead, if the answer to the question at step 1104 is negative, step 1105 is bypassed and control is directed to step 1106.
In step 1106, a question is asked as to whether another statement is considered, and if the answer is affirmative, control returns to step 1103 to allow the next labeled statement to be selected. To. Instead, if the answer is negative, another statement is not available, the statement identifying the pointer is reset, and control is directed to step 1107. In step 1107, a question is asked as to whether another table exists, and if the answer is affirmative, control returns to step 1102 and the next table is selected.
Eventually all tables are considered, and as a result the answer to the question made in step 1107 is negative.
The list of tables generated in step 1105 is shown in detail in FIG. The list was tracked with a first column 1201 identifying the original table, a second column 1202 identifying the SQL statement with respect to those labels as specified in step 1101, and the frequency of the statements, ie It consists of a third column 1203 that identifies the number of times a particular SQL statement occurs in the set.
As shown in FIG. 12, first, Table 1 is selected at step 1102, so that in response to repeated operations at step 1005, SQL statements A, B, C, through SQL Z are Added to the list. Table 2 is then selected, and consequently the SQL label is identified in this table, and finally Table 3 is selected, resulting in the statement label associated with that table. Although there are three tables in an embodiment of the present invention, it should be appreciated that any number of tables may exist when used in a large relational database.
In the third column 1203, the frequency of occurrence is recorded, which is typically measured thousands of times. Thus, x occurrences are recorded for statement A, y occurrences are recorded for statement B, and z occurrences are recorded for statement C.
A procedure 904 for assessing base level costs is shown in FIG. A table is selected at step 1301, and an SQL statement is selected at step 1302. In step 1303, the cost for executing the SQL statement selected in step 1302 when applied to the table selected in step 1301 is evaluated. The cost calculation is accomplished using a timer on value obtained from an optimizer in the SQL execution processor 703. However, the timer on value only considers the use of the index and does not take into account index maintenance. Consequently, in the preferred embodiment, instructions developed under the name "QCF" (TM) by Innovation Management Solutions of Florida, USA are executed, thereby executing CPU usage and SQL statements. For the elapsed time, the index cost value is given in combination with the index maintenance evaluation. These cost values do not represent absolute measurement results, but the associated cost values can be obtained by performing similar methods when additional indexes are present, which adds additional disk space. It provides an objective function that can select a specific index over a more expensive index when compared to a request for.
In step 1304, for each statement, the cost value calculated in step 1303 is multiplied by an execution frequency factor, and in step 1305, the resulting product is multiplied by a priority factor. Then, in step 1306, the cost is added to the base cost sum for the particular table, and in step 1307 a question is asked as to whether there is another statement. If the answer to the question is affirmative, control returns to step 1302 so that the next SQL statement is selected and the cost calculation procedure is repeated.
Finally, if the answer to the question in step 1307 is negative, a question is asked in step 1308 as to whether there is another table. If the answer is affirmative, another table sum is generated in step 1309 and control is returned to step 1301, thereby allowing the next table to be selected. Eventually, the answer to the question made at step 1308 is negative, which leads control to step 905.
The procedure for calculating the cost of a candidate index to identify a qualified index is shown in detail in FIG. In step 1401, candidate indexes are identified from the set of predicates obtained from the SQL statement associated with the table under consideration, which is determined by being selected in step 905 according to the list shown in FIG. Thus, indexable predicates can be identified by analyzing the SQL statements associated with the table under consideration. Identified indexable predicates refer to specific columns that are grouped by table and SQL statement to form a set of predicates. These sets of predicates provide a starting point for identifying candidate indices (ie, candidate indices are constructed) that may benefit when satisfying the associated SQL statement. In addition, catalog statistics are generated for each identified index.
Candidate indexes are selected in step 1402, and the indexes selected in step 1302 in step 1403 are generated as part of the table model held in the index set designator 708. After a new index is created in step 1403, the associated catalog entry in the model is updated in step 1404, thereby making the index full size.
Candidate indexes are generated against the model database. The DB2 recovering index utility is run against the table to feed the index, and the DB2 Runstats utility is run against the database to gather statistics. Statistics are gathered for each index, stored in a database, and expanded to live capacity for use throughout the process.
The possible indexes identified in step 1401 include all index combinations that may use a particular column entry. Thus, the columns are arranged in a different order and include possible orders. Similarly, entries in each column may rise and fall sequentially, again once all these possible combinations exist. Not all of these combinations are actually required as candidate indexes, so a selection process is performed at step 1402 to identify candidate indexes. The column position is called the sequence of columns, and the possibility of rising and falling is called ordering. Indexes that share the same column ordering are grouped together, showing only the differences associated with their ordering. All indexes defined within the group, i.e. all indexes with the same sequence of columns, are generated in the model. Catalog statistics for each of these indexes are also generated, after which all trapped SQL that references the table is targeted on the index. Thereafter, the description function in the database is trained to identify the specific index within the group that was actually used. These indexes then become the selected candidate indexes and the remaining indexes from the potential set are eliminated. After that, the generated index is deleted, the next group is taken into account until all of its groups have been considered, and finally, before control is directed to the loop started in step 1402. The generated index is deleted again.
Accordingly, as described above, a candidate index is selected in step 1402, a candidate index is generated in step 1403, and the catalog is updated in step 1404 in response to the newly generated index.
Since a new index has been generated in the model, the SQL statements that reference the respective tables are cost calculated according to a method that is substantially similar to the method for calculating the base cost level shown in detail in FIG. After the SQL cost is calculated with the appropriate new index in step 1405, the new cost value is stored in step 1406 and then the index generated in step 1303 is deleted in step 1407.
In step 1408, a question is asked as to whether another index exists and if the answer is affirmative, control is returned to step 1402, resulting in the next possible index being selected. Eventually, the cost of all SQL statements is calculated for all possible indexes, resulting in a negative answer to the question made in step 1408.
The cost value calculated in step 1406 and stored in the table defines a new cost for each index that may have been identified in step 1401. For each of these indexes, a cost saving is calculated by subtracting the new cost from the base cost calculated in step 904. This cost saving value represents an objective function, where an index with a low cost saving value is considered more suitable than an index with a high cost saving value. This cost savings value is stored for each index in step 1410 and a question is asked in step 1411 as to whether another index exists. If the answer is affirmative, a cost savings value is calculated for the next index in step 1409, and then a cost savings value is calculated for all possible indexes, resulting in step 1411. The answer to the question is negative.
As a result of storing the cost savings in step 1410, a list of indexes is generated with the cost savings assigned to each. This represents the objective function, so the appropriate index is ordered according to this objective function in step 907, thereby placing the index with the highest cost savings near the top of the list of eligibility.
The process 908 shown in FIG. 9 for processing possibly ordered indexes is shown in detail in FIG. 15, and an example of operations performed according to the procedure of FIG. 15 is shown in FIGS. It is shown.
In FIG. 16, eleven possible indexes are shown, which are represented by the unique identification numbers 1625, 1616, 1604, 1673, 1612, 1646, 1635, 1691, 1622, 1683 and 1617. The cost savings identified earlier for these indexes and the procedures defined represent the priority for inclusion in the specified set of optimal indexes. Accordingly, the appropriately qualified indexes shown in FIG. 16 are ordered based on the priority specified for the associated cost savings recorded for each index. These cost savings have no absolute meaning and give instructions related to the cost savings calculated according to the procedure described above. That is, index 1625 is identified as a cost saving of 73, index 1616 is a cost saving of 72, index 1604 is a cost saving of 68, and finally index 1617 is a cost saving of 4. Have been identified as performing. Thus, the information required to order the indexes 1625-1617 in terms of their cost saving eligibility is calculated according to the procedure detailed in step 803.
In step 1501 of FIG. 15, cost savings are considered and standardized to facilitate subsequent processing. Standardization allows cost savings to be considered within a predetermined range, and in this example is selected within the range of 0-9999. The overall cost savings are calculated as shown at 1681 in FIG. 16, which is 500 in this example. The entire range is divided by this overall cost savings 1681 to give a unit range value, which is then multiplied by the cost savings value, thereby giving a distribution value. The calculation of the distribution value is performed in step 1502, resulting in a standardized cost saving. Thus, according to the procedure performed in step 1502, the cost savings for index 1625 are normalized to a value of 1660, and index 1616 is normalized (from 72 cost savings) to a standardized value of 1440. Similarly, standardized values are calculated for all indexes under consideration, so that when summed, the standardized values are equal to the full range of values in the distribution, which in this example is 9999. It is. In step 1503, a question is asked as to whether another genetic iteration is required and the answer is affirmative in the first iteration. A pre-selection is made regarding the number of required genetic repeats and a counting operation is performed at step 1503.
If the answer to the question is affirmative, a random number in the range of 0 to 9999 is generated in step 1504. This random number is used in step 1505 to select a particular index. Accordingly, the selection of a random index is made at step 1505 and is weighted in terms of cost savings provided by a particular index. Thus, an index 1617 is selected by a number in the range 0 to 80, an index 1683 is selected by a number in the range 81 to 400, and an index 1622 is selected by a number in the range 401 to 820. Finally, the index 1625 is selected by a number in the range 8540-9999. The number of distributions assigned to a particular index is proportional to its associated cost savings, whereby an index with higher cost savings is selected over many iterations over an index with an average lower cost savings. However, low cost saving indexes still remain in the pool and can be selected according to genetic procedures.
A new combination of indexes is generated at each iteration, and the combination of these indexes gives a specific cost saving. This allows the index combination to give its own index identification, and the composite index is ordered based on the objective function so that it is included in the table shown in FIG.
Accordingly, a specific index is selected in step 1505, and the selected index in step 1506 is written into the index buffer. An index buffer 1791 is shown in FIG. 17, which contains six buffer locations that represent the maximum number of indexes allowed for the particular table under consideration. Referring to FIG. 17, each table is shown with a maximum of 6 indexes within a particular embodiment, but this form may be adjusted to meet a particular local operating condition. . Thus, buffer 1791 has six positions, and the identification of the selected index, such as index 1625, can be placed at any of these positions and selected on a random basis. Accordingly, as shown in FIG. 17, the identification of index 1625 is located at the second buffer position of index buffer 1791.
In step 1507, a second random number is generated in the range of 0 to 9999, so that the second parent index is selected in step 1508. The indication of the selected index is written into the second index buffer as shown at 1792 in FIG. In this example, the index 1604 is consequently selected by the random number generated in step 1507 and randomly positioned at the fourth position in the buffer 1792.
In step 1510, a buffer disconnect location is randomly selected at any interface between locations within a particular buffer. In this example, the cutting position is located between the second position and the third position, as indicated by arrow 1793 in FIG. By this cutting position, the buffer 1791 which is the first parent and the buffer 1792 which may be regarded as the second parent can be combined. The result of this exchange is shown in buffers 1794 and 1795. In buffer 1794, index 1625 is obtained from first parent 1791 and placed in the second position, and index identifier 1604 is obtained from parent 1792 and placed in the fourth position. As shown in buffer 1795, another descendant of this join does not contain an index identifier, so it is considered as a void child and is not considered further. Thus, in step 1511 of FIG. 15, two parents are “bred”, resulting in the creation of a child 1794 that includes indexes 1625 and 1604.
In order to make the availability of a potentially optimal index even more interesting, a stage of genetic manipulation is provided in which a child defined by the contents of buffer 1794 is mutated. Another random number is generated and consequently another index is selected. As a result of this mutation process in step 1512, buffer 1796 is loaded with the index indication obtained from child 1794, and index 1635 is randomly added at the sixth position. In step 1513, the child cost savings generated in step 1511 and the mutant cost savings generated in step 1512 are evaluated.
In step 1514, new indexes 1794 and 1796 are added to the pool of potential indexes in order of eligibility. Cost savings are calculated for each index with reference to the table under consideration so that the index can be added to the list of FIG. 16 ranked according to the resulting cost savings. The overall cost savings are recalculated and then the new standardized cost savings are recalculated for all indexes including the newly added index. From this standardized distribution of cost savings, a value is calculated for each index in step 1502 and the question is again asked whether there is a need for genetic repeats.
If the answer to the question made in step 1503 is affirmative again, a random number in the range of 0 to 9999 is generated, a new index is selected in step 1505, and then an indication of this index is added in step 1506. 1 is written to the parent buffer 1791. A second random number is also generated at step 1507, whereby a second parent is selected at step 1508, after which the parent's indication is written to buffer 1792 at step 1509.
In step 1510, the cutting position is again randomly selected, the parent is raised in step 1511, and their offspring are written to buffers 1794 and 1795, after which mutants are generated from valid children. Thus, each iteration adds up to four new indexes to the index join pool.
Finally, if the answer to the question made at step 1503 is negative, then control is directed to step 805 of FIG. While the process shown in FIG. 15 is repeated, new index calculations are identified in a substantially random manner. However, each new index is tested to determine and determine the resulting cost savings, and the index with the highest cost saving value is placed near the top of the list shown in FIG. In addition, having a relatively high cost savings value also increases the range of number distributions to select, thereby increasing the likelihood that these indexes will be selected for combining. However, it is possible that such an index could be selected because relatively unlikely indexes remain in the pool. When repeated sufficiently, index combinations with very high cost savings are identified, and these indexes are placed near the top of the list shown in FIG. Finally, when the process shown in FIG. 15 is completed, the indexes including the newly raised index are listed in the list for generation at the time of specification in order of decreasing cost.
Details of the method identified in step 910 of FIG. 9 for configuring the database to include the specified index are shown in FIG. Objective functions that specify associated cost savings are considered for indexes only with respect to their associated tables. However, in an operating database system, multiple tables must function together. Therefore, if a given storage space is available, the storage space must be allocated for indexes associated with all existing tables.
In step 1801, the total amount of disk space used by each table present in the live database is calculated, and in step 1802, the values calculated in step 1801 are summed to obtain the total required for all tables. Disk space is given. In step 1803, the disk space required for each table is divided by the total disk space to give a percentage of the base disk space allocation for each table. This percentage allocation allocates space to the index as shown in step 1804, so that the relative amount of storage allocated for index generation is approximately equal to the relative allocation of storage space to the table. Will be equal. Thus, if a table occupies a large amount of disk space, a large amount of disk space is similarly allocated to indexes operating over this table.
A table is selected at step 1805 and the most appropriate index obtained for that table is selected at step 1806. In step 1807, a question is asked as to whether sufficient disk space is allocated for the preferred index selected in step 1806 to be generated on the live system. If this answer is affirmative, the index selected in step 1806 is designated for generation in step 1808. Instead, if sufficient disk space is not available, the answer to the question made in step 1807 results in a negative result, step 1808 is bypassed, and control is directed to step 1809.
In step 1809, a question is asked as to whether another table exists, and if the answer is affirmative, control is returned to step 1805, resulting in the next table being selected and the appropriate index in this table. Considered in steps 1806 and 1807. Finally, when all the indexes for the selected table are considered, the answer queried in step 1809 results in a negative result.
If the answer to the question at step 808 is affirmative, the procedure detailed in FIG. 18 is performed. Thus, a new index structure is created in the live database, after which the database is placed online at step 810 with the new index placed.
Referring to FIG. 14, the process performed in step 1401 becomes very time consuming if the set of predicates results in the identification of an index that requires more than four columns. Under these circumstances, the first four preferred columns are selected and the remaining columns are provisionally excluded. Selection is made on the basis of filter coefficients, and four columns with low filter coefficients are selected. These four columns are processed with all possible ordering possibilities, thereby selecting a candidate index as described above.
After all appropriate indexes have been determined, the tentatively excluded large number of indexes are assembled by adding the fifth preferred column, the sixth preferred column, the seventh preferred column, etc., thereby creating a new A fifth column index, a new sixth column index, a new seventh column index, etc. are generated. These indexes are created for the most cost effective candidate index containing the four columns needed. These columns are only added to the four column candidate indexes in increasing order, and catalog statistics are calculated for these new indexes, after which these new indexes are costed and qualified. It is added to the ordered list and placed in the order of eligibility (relevance) of the index previously costed.
Referring to FIG. 15, the number of appropriate indexes considered for the genetic process initiated in step 1505 is important, resulting in a relatively long processing time. Under these circumstances, it is preferable to place an upper limit on the size of the “binding pool” before the genetic process is performed. In general, the size of the binding pool may be limited to a maximum of 30 suitable indexes before performing the genetic process.
Basically, the purpose of the genetic process is to find a composite index that significantly reduces processing overhead when constructing a set of general SQL queries. To reduce processing time, it is preferable to “prepare the pool” by adding a set of indexes that may be particularly advantageous.
The first stage of pool preparation involves examining existing live database systems and thereby determining which indexes are actually used in the live system. These sets of indexes are then added to the appropriate set of indexes as described above.
The second stage of pool preparation is to re-evaluate eligibility ranking after it is assumed that the most eligible index is in the live system. Thus, the cost factor is re-considered from the starting position where the most appropriate index as calculated above is placed as belonging to the live system. When this live index is added to the system, some cost savings of the remaining eligible indexes will change significantly, thereby effectively re-ordering the indexes within the list of eligibility. Again, the most cost effective remaining eligible index is added to the system and the cost savings are recalculated based on this new index present. This process is repeated and iteratively performed up to a set of 6 new indexes. Each of these sets of indexes is added to the combined pool at the appropriate number of times.

Claims

In a computerized database system, a set of indexes for use in combination with a database table stored in a machine readable format is determined using its hardware resources based on computer software. In the method
The computerized database system is configured to be able to present to a user a data set identified from data contained in a data table in response to a data query;
Analyzing a plurality of data queries submitted to a computerized database system to identify terminology that can be indexed correspondingly;
A plurality of candidate indexes that are candidates for an index that can reduce the amount of processing required by the computerized database system in response to each data query can be indexed as identified in the analysis step. Identifying from the terminology;
A representative of the actual data query received by the computerized database system for a table of actual data stored in the computerized database system formed using the identified candidate index Evaluating a cost saving for each candidate index indicative of a reduction in throughput required by the computerized database system in response to the sample for each of the plurality of candidate indexes identified in the identifying step When,
Determining a preferred set of indexes to be used in a computerized database system based on the estimated cost savings for each of a plurality of candidate indexes ;
Cost savings use the candidate index and the amount of work done by the computerized database system in response to a representative sample of actual data queries received by the computerized database system without using the candidate index A throughput that represents the difference from the throughput performed by the computerized database system in response to a representative sample of actual data queries received by the computerized database system;
In the step of evaluating the cost savings, the processing performed by the computerized database system in response to a representative sample of actual data queries received by the computerized database system without using candidate indexes The base cost, which is the amount of processing, is calculated and the processing performed by the computerized database system in response to a representative sample of actual data queries received by the computerized database system using each candidate index the throughput is measured, that determine a set of indexes evaluation is performed by subtracting the measured throughput from the base cost method.

In the cost savings assessment, a reduced data table is generated that includes only samples selected as representative from entries read from the actual data table that contain more entries than the maximum allowed. , the actual data table for the generation of the reduced data table, the processing performed by the representative sample pairs to computerized database system of the actual data queries received by computerized database system The method of claim 1, wherein the throughput is measured.

The method of claim 2, wherein said database statistics about the actual data table is also used in the reduced data table.

The method of claim 1, wherein the evaluation of cost savings includes an evaluation of overhead storing an index.

In computerized database system, in equipment that determine the set of indexes used in combination with a database table stored in a machine readable form, the computerized database system, in response to the data query Configured to be able to present to the user a data set derived from the data contained in the data table,
An analysis means for analyzing a plurality of data queries submitted to a computerized database system to identify terms that can be indexed correspondingly;
A plurality of candidate indexes that are candidates for the index to help reduce the amount of processing required by the computerized database system in response to each data query can be indexed identified in the analysis step An identification means for identifying from a term,
Representative of actual data queries received by the computerized database system against actual data tables stored in the realized computerized database system formed using the identified candidate index Evaluation means for evaluating a cost saving for each candidate index indicative of a reduction in throughput required by the computerized database system in response to the sample for each of the plurality of candidate indexes identified in the identifying step When,
Determining means for determining a preferred set of indexes to be used in the computerized database system based on the estimated cost savings for each of the plurality of candidate indexes ;
The evaluation means for evaluating the cost saving is evaluated for each of the plurality of candidate indexes identified in the identifying step, and the cost saving is received by a computerized database system without using the candidate index. The amount of processing performed by a computerized database system in response to a representative sample of actual data queries and a representative of actual data queries received by a computerized database system using candidate indexes The amount of processing that represents the difference from the amount of processing performed by a computerized database system in response to a sample,
The evaluation means for evaluating the cost savings is a process performed by the computerized database system in response to a representative sample of actual data queries received by the computerized database system without using candidate indexes. Processing performed by the computerized database system in response to a representative sample of actual data queries received by the computerized database system using each candidate index the processing amount was measured, that determine a set of indices that is configured to perform evaluation by subtracting the measured throughput from the base cost device.

6. A data set derived from data contained in the actual data table comprising the apparatus of claim 5 and stored in a machine readable form in response to a data query is presented to a user. A computerized database system that is configured as follows.