JP6411232B2

JP6411232B2 - Sampling apparatus and sampling program

Info

Publication number: JP6411232B2
Application number: JP2015015094A
Authority: JP
Inventors: 繁雄廣瀬; 誠嶋村; 基孝金松
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2015-01-29
Filing date: 2015-01-29
Publication date: 2018-10-24
Anticipated expiration: 2035-01-29
Also published as: JP2016139361A

Description

本発明の実施形態は、サンプリング装置およびサンプリングプログラムに関する。 Embodiments described herein relate generally to a sampling device and a sampling program.

一般的に、データベースのレコードをランダムサンプリングする場合、多くの外れ値が混入し、元のデータとは分布状況が異なり、正しい解析結果が得られないおそれがある。 In general, when random sampling is performed on records in a database, many outliers are mixed in, and the distribution situation is different from the original data, and there is a possibility that a correct analysis result cannot be obtained.

元のデータの分布状況を保ったまま、高速にデータのサンプリングを行いたいとの要請に応えるものとして、索引情報を持つB-Tree インデックスを利用して高速にサンプリングを行うことが行われている。 Sampling at high speed using a B-Tree index with index information has been performed as a response to the request to sample data at high speed while maintaining the distribution of the original data. .

B-Tree インデックスを使用して1次元のデータの外れ値を検出する場合、他の点（レコードのKey 値）間の距離に比べ、点（レコードのKey 値）間の距離が離れている点（レコードのKey 値）が外れ値となる。 When detecting outliers in one-dimensional data using B-Tree index, the distance between points (Key value of record) is farther than the distance between other points (Key value of record) (Key value of the record) is an outlier.

厳密に外れ値を検出するためには、全てのレコードについて事前の計算（クラスタリング）が必要である。一般的に、インデックスの情報はサイズの関係上、ハードディスク装置に保持されるので、クラスタリングを行うには、ハードディスク装置からのリードと、Ｏ(nlogn)の計算コストが必要である、という問題があった。 In order to detect outliers strictly, a prior calculation (clustering) is required for all records. In general, since the index information is stored in the hard disk device due to its size, there is a problem that the read from the hard disk device and the calculation cost of O (nlogn) are necessary for clustering. It was.

Donald E. Knuth 「The Art of Computer Programming」 Volume 3 Sorting and Searching, 1973, 723pDonald E. Knuth “The Art of Computer Programming” Volume 3 Sorting and Searching, 1973, 723p

本発明が解決しようとする課題は、１次元データをクラスタリングなどで分類すること無く、データの分布を保ったまま、外れ値を除外したランダムサンプリングを高速に行うことができるサンプリング装置およびサンプリングプログラムを提供することである。 A problem to be solved by the present invention is to provide a sampling apparatus and a sampling program capable of performing random sampling at high speed without excluding outliers while maintaining the data distribution without classifying one-dimensional data by clustering or the like. Is to provide.

実施形態のサンプリング装置は、データベースを構成する複数のレコードを記憶するレコード記憶部と、階層構造を成すとともに、索引情報をインデックスとして保持するB-Tree インデックスに加えて、前記レコード記憶部に記憶されるレコードに応じた、リーフブロック毎のKey 間の平均距離で表す疎密度と、全リーフブロックの前記疎密度の平均値と、全リーフブロックの前記疎密度の標準偏差と、子ノードのレコード数の付加情報を記憶するB-Tree インデックス記憶部と、ユーザーによる前記データベースからのランダムサンプリングを実行するとのクエリに対応して、サンプリング要求を行うアプリケーションと、前記アプリケーションからサンプリング要求を受け取り、サンプリング命令に変換する要求変換部と、前記要求変換部から受け取ったサンプリング命令を解釈してサンプリングレコードを決定し、前記レコード記憶部からサンプリングレコードを取得する場合、前記全リーフブロックの前記疎密度の平均値よりも値の大きな所定値以上の前記疎密度を持つリーフブロックは外れ値を含むとみなして除外し、除外されなかったリーフブロックからレコードを取得するデータ操作部とを、備える。 The sampling apparatus according to the embodiment is stored in the record storage unit in addition to a record storage unit that stores a plurality of records constituting the database and a B-Tree index that has a hierarchical structure and holds index information as an index. Sparse density expressed by the average distance between keys for each leaf block, the average value of the sparse density of all leaf blocks, the standard deviation of the sparse density of all leaf blocks, and the number of records of child nodes B-Tree index storage unit for storing additional information, an application that makes a sampling request in response to a query that the user performs random sampling from the database, a sampling request from the application, and a sampling instruction From the request conversion unit to be converted and the request conversion unit When the sampling instruction is determined by interpreting the sampled sampling instruction and the sampling record is acquired from the record storage unit, the sparse density not less than a predetermined value greater than the average value of the sparse density of all the leaf blocks And a data operation unit that obtains a record from a leaf block that is not excluded.

本発明の実施形態に係るサンプリング装置の概略構成を示すブロック図である。It is a block diagram showing a schematic structure of a sampling device concerning an embodiment of the present invention. ランダムサンプリングに伴う外れ値を説明する図である。It is a figure explaining the outlier accompanying random sampling. 除外するリーフブロックを説明する図である。It is a figure explaining the leaf block to exclude. 正規分布図と除外する割合を説明する図である。It is a figure explaining the normal distribution map and the ratio to exclude. 疎密度を付加したB-Treeインデックスの構造を示した図である。It is the figure which showed the structure of the B-Tree index which added sparse density. 特定のリーフブロックへのレコードの偏在を説明する図である。It is a figure explaining the uneven distribution of the record to a specific leaf block. 選択確率を考慮したB-Treeにおけるランダムサンプリングを説明する図である。It is a figure explaining random sampling in B-Tree in consideration of selection probability. B-Treeの更新の手順と再計算の範囲を説明する図である。It is a figure explaining the update procedure of B-Tree, and the range of recalculation. B-Treeの更新の手順と再計算の範囲を説明する図である。It is a figure explaining the update procedure of B-Tree, and the range of recalculation. B-Treeの更新の手順と再計算の範囲を説明する図である。It is a figure explaining the update procedure of B-Tree, and the range of recalculation. B-Treeの更新の手順と再計算の範囲を説明する図である。It is a figure explaining the update procedure of B-Tree, and the range of recalculation. B-Treeの更新の手順と再計算の範囲を説明する図である。It is a figure explaining the update procedure of B-Tree, and the range of recalculation. サンプリング装置におけるB-Tree構築処理の流れを説明するフローチャートである。It is a flowchart explaining the flow of the B-Tree construction process in a sampling device. サンプリング装置におけるランダムサンプリング処理の流れを説明するフローチャートである。It is a flowchart explaining the flow of the random sampling process in a sampling device. 実施例におけるサンプリング要求に対する処理を説明する図である。It is a figure explaining the process with respect to the sampling request | requirement in an Example. 実施例における疎密度の標準偏差の計算を説明する図である。It is a figure explaining calculation of the standard deviation of sparse density in an Example. 実施例におけるサンプリングレコードの決定を説明する図である。It is a figure explaining the determination of the sampling record in an Example. 実施例におけるサンプリングレコードの決定およびリーフブロックの除外を説明する図である。It is a figure explaining determination of the sampling record and exclusion of a leaf block in an Example.

以下、本発明の一実施の形態について、図面を参照して説明する。尚、各図において同一箇所については同一の符号を付すとともに、重複した説明は省略する。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings. In the drawings, the same portions are denoted by the same reference numerals, and redundant description is omitted.

まず、本実施形態で用いる主要な用語について説明する。 First, main terms used in the present embodiment will be described.

「レコード」とは、データベースを構成する単位のひとつで、データの１件分のことである。 A “record” is one of the units constituting a database and is one record of data.

「ルートノード」とは、B-Treeの一番上にあるノードをいう。 “Root node” refers to the node at the top of the B-Tree.

「リーフノード」とは、B-Tree上の一番下、リーフブロックとつながっているノードをいう。 A “leaf node” is a node connected to a leaf block at the bottom of the B-Tree.

「ブランチノード」とは、ルートノードとリーフノードの間にあるノードをいう。 “Branch node” refers to a node between a root node and a leaf node.

「リーフブロック」は、B-Treeインデックスにおいては、リーフノードとつながっている、検索するためのKeyの値とそのKeyに対応するレコードへのアドレスの組を指す。リーフブロックによっては、複数の組を保持する。 In the B-Tree index, “leaf block” refers to a set of a key value for searching and an address to a record corresponding to the key, connected to a leaf node. Some leaf blocks hold multiple sets.

「子ノード」とは、自分から枝が出ているノードをいう。自分の上のノードは親ノードである。 “Child node” means a node from which a branch comes out. The node above you is the parent node.

「サンプリング」とは、データ分析の際、データが巨大などの理由で一部のレコードを抽出することをいう。無作為にサンプリングを行うことランダムサンプリングという。本実施形態においては、B-Treeインデックスを利用して、リーフブロックからレコードをランダムサンプリングする。 “Sampling” refers to extracting a part of records for data analysis because the data is huge. Random sampling is a random sampling. In the present embodiment, records are randomly sampled from leaf blocks using a B-Tree index.

「外れ値」とは、他の値とは大きく異なり、計算によってはデータの傾向がつかめなくなる恐れのあるデータをいう。本実施形態においては、B-Treeインデックスを使用して１次元のデータの外れ値を検出するもので、他の点（レコードのKey 値）間の距離に比べ、点（レコードのKey 値）間の距離が離れている点（レコードのKey 値）が外れ値となる。 An “outlier” is data that is significantly different from other values and may cause a tendency of data to be uncertain depending on the calculation. In this embodiment, the B-Tree index is used to detect outliers in one-dimensional data. Compared to the distance between other points (record key values), the distance between points (record key values) The point where the distance of is far (the key value of the record) is an outlier.

「カーソル」とは、B-Tree上で、現在操作しているノードのポインタをいう。B-Treeでは、ルートノードやブランチノードが子ノードへのポインタをもち、カーソルはこの値を元に子ノードへ移動していく。 “Cursor” refers to the pointer of the currently operated node on the B-Tree. In B-Tree, the root node and branch node have pointers to child nodes, and the cursor moves to the child nodes based on this value.

「結果セット」とは、データベースから取り出されたレコード群（レコードの集合）をいう。データベースにSELECTなどの結果が帰ってくる命令を与えた場合に返却される。 “Result set” refers to a group of records (a set of records) extracted from a database. Returned when a command such as SELECT is returned to the database.

本実施形態においては、B-Treeからリーフブロック毎のKey間の平均距離の情報を疎密度として算出し、相対的に疎密度の値が大きいリーフブロックが外れ値を含むとみなして除外することで、厳密に外れ値を計算せずに、ランダムサンプリングを高速化させるものである。 In this embodiment, information on the average distance between keys for each leaf block from the B-Tree is calculated as sparse density, and leaf blocks with relatively large sparse density values are regarded as including outliers and excluded. Thus, random sampling is speeded up without strictly calculating outliers.

図１は、本発明の実施形態に係るサンプリング装置の概略構成を示すブロック図である。この装置は汎用のコンピュータ（例えばパーソナルコンピュータ（PC）等）と、同コンピュータ上で動作するソフトウェアとを用いて実現される。コンピュータとしては、CAD（Computer Aided Design）やCAE（Computer Aided Engineering）に好適なエンジニアリングワークステーション（EWS）等も含む。本実施形態はこのようなコンピュータに、データベースを構成する複数のレコードを記憶するレコード記憶部と、階層構造を成すとともに、索引情報をインデックスとして保持するB-Tree インデックスに加えて、レコード記憶部に記憶されるレコードに応じた、リーフブロック毎のKey 間の平均距離で表す疎密度と、全リーフブロックの疎密度の平均値と、全リーフブロックの疎密度の標準偏差と、子ノードのレコード数の付加情報を記憶するB-Tree インデックス記憶部とを備えるサンプリング装置に、ユーザーによるデータベースからのランダムサンプリングを実行するとのクエリに対応して、サンプリング要求を行う機能と、サンプリング要求をサンプリング命令に変換する機能と、サンプリング命令を解釈してサンプリングレコードを決定し、サンプリングレコードを取得する場合、全リーフブロックの疎密度の平均値よりも値の大きな所定値以上の疎密度を持つリーフブロックは外れ値を含むとみなして除外し、除外されなかったリーフブロックからレコードを取得する機能とを、実現させるためのサンプリングプログラムとして実施することもできる。 FIG. 1 is a block diagram showing a schematic configuration of a sampling apparatus according to an embodiment of the present invention. This apparatus is realized by using a general-purpose computer (for example, a personal computer (PC) or the like) and software operating on the computer. The computer includes an engineering workstation (EWS) suitable for CAD (Computer Aided Design) and CAE (Computer Aided Engineering). In this embodiment, in such a computer, in addition to a record storage unit that stores a plurality of records constituting the database and a B-Tree index that has a hierarchical structure and holds index information as an index, the record storage unit The sparse density expressed by the average distance between keys for each leaf block, the average value of the sparse density of all leaf blocks, the standard deviation of the sparse density of all leaf blocks, and the number of child node records according to the stored records In response to a query that the user performs random sampling from the database to a sampling device that has a B-Tree index storage unit that stores additional information, and a sampling request function, and converts the sampling request into a sampling command Function and interpret the sampling instruction to determine the sampling record When acquiring sampling records, leaf blocks with a sparse density greater than a predetermined value that is larger than the average value of the sparse density of all leaf blocks are considered to contain outliers, and excluded from the leaf blocks that were not excluded It can also be implemented as a sampling program for realizing the function of acquiring records.

図１に示すように、本実施形態に係るサンプリング装置１は、主として、アプリケーション１１、要求変換部１２、データ操作部１３、レコード記憶部１４、B-Treeインデックス記憶部１５から構成されている。 As shown in FIG. 1, the sampling apparatus 1 according to the present embodiment mainly includes an application 11, a request conversion unit 12, a data operation unit 13, a record storage unit 14, and a B-Tree index storage unit 15.

レコード記憶部１４は、レコード記憶領域１、レコード記憶領域２、・・・等から成り、データベースを構成する複数のレコードを記憶するものである。データベースとしては、例えば、リレーショナルデータベースが該当するが、これに限定されるものではない。レコードは、例えば、テーブル形式で表すことができる。テーブルは、行と列で表される１次元データの集まりで、データのタイトルを表わす「列（またはカラム）」に対し、縦に並んでいるそれぞれのデータが「行（またはロウ）」であり、１行が１つのデータで、カラムがそれぞれの属性の値を表す。１件分のデータが、レコードである。 The record storage unit 14 includes a record storage area 1, a record storage area 2,..., And the like, and stores a plurality of records constituting the database. An example of the database is a relational database, but is not limited thereto. The record can be expressed in a table format, for example. A table is a collection of one-dimensional data represented by rows and columns, and each row of data is a “row (or row)” relative to a “column (or column)” that represents the title of the data. One row is one data, and the column represents the value of each attribute. One piece of data is a record.

データ操作部１３からランダムサンプリングを行うレコード記憶領域が指定された場合に、レコード記憶部１４の該当するレコード記憶領域からレコードが取得される。レコードの取得の詳細については後述する。 When a record storage area for performing random sampling is designated from the data operation unit 13, a record is acquired from the corresponding record storage area of the record storage unit 14. Details of record acquisition will be described later.

アプリケーション１１は、サンプリング装置１のユーザーがコンピュータ上で実行したい作業を実行する機能を直接的に有するソフトウェアであって、本実施形態では、例えば、サンプリング装置１におけるサンプリング取得およびレコード更新の作業を実行させるものである。 The application 11 is software that directly has a function of executing a work that the user of the sampling device 1 wants to execute on the computer. In this embodiment, for example, the sampling device 1 performs sampling acquisition and record update work. It is something to be made.

アプリケーション１１は、ユーザーによるデータベースからランダムサンプリングを実行するとのクエリに対応して、サンプリングレコード取得時には、要求変換部１２に対してサンプリング要求を行う。サンプリング要求に際しては、（１）対象レコード記憶領域名、（２）取得するレコード数のテーブルに対する割合(％)、（３）除外するリーフブロックの割合(％)、を指定する。ここで、対象レコード記憶領域名は、レコード記憶部１４に記憶される複数のレコード群のうち、サンプリングを実行するレコードの集合として特定されるものである。例えば、“TEST”のように名付けられる。 In response to the query that the user performs random sampling from the database, the application 11 makes a sampling request to the request conversion unit 12 when acquiring the sampling record. When sampling is requested, (1) the target record storage area name, (2) the ratio (%) of the number of records to be acquired with respect to the table, and (3) the ratio (%) of leaf blocks to be excluded are specified. Here, the target record storage area name is specified as a set of records to be sampled among a plurality of record groups stored in the record storage unit 14. For example, it is named “TEST”.

取得するレコードのテーブルに対する割合(％)は、サンプリングするレコードの数を、指定レコード記憶領域の割合で指定するものである。 The ratio (%) of the record to be acquired to the table specifies the number of records to be sampled by the ratio of the designated record storage area.

除外するリーフブロックの割合(％)は、リーフブロックの疎密度（後ほど詳述する）の分布を正規分布と仮定し、正規分布上で上位何％を除外するかを指定するものである。例えば、ユーザーは大きく外れているリーフブロックだけを除外したい場合には０．１％、確実に外れ値を除外したい場合には５％といった要求を出す。尚、取得するレコード数のテーブルに対する割合(％)に代えて、取得するレコード数そのものを指定することでもよい。さらに、除外するリーフブロックの割合(％) に代えて、除外するリーフブロック数そのものを指定することでもよい。 The ratio (%) of leaf blocks to be excluded designates what percentage is excluded from the top of the normal distribution, assuming that the distribution of sparse density (detailed later) of the leaf blocks is a normal distribution. For example, the user issues a request of 0.1% when it is desired to exclude only a leaf block that is significantly off, and 5% when it is desired to exclude an outlier. Note that the number of records to be acquired may be specified instead of the ratio (%) of the number of records to be acquired to the table. Furthermore, the number of leaf blocks to be excluded may be specified instead of the ratio (%) of the leaf blocks to be excluded.

ユーザーは、例えば、サンプリングレコードの取得をするために、データベース管理システムにおいてデータの操作や定義を行うためのデータベース言語であるSQLの一命令であるSELECT文で、サンプリング要求を次のように記述する。 For example, to acquire a sampling record, a user describes a sampling request as follows in a SELECT statement that is a command of SQL that is a database language for performing data manipulation and definition in a database management system. .

“SELECT ＊ FROM TEST, RANDSAMPLE(１０,５);”
このSELECT文では、ランダムサンプリングする対象レコード記憶領域が“TEST”という名のレコード記憶領域であり、サンプリングするレコード数の割合が１０％、外れ値を除外する割合が５％と、指定されている。 “SELECT * FROM TEST, RANDSAMPLE (10,5);”
In this SELECT statement, the target record storage area to be randomly sampled is a record storage area named “TEST”, the ratio of the number of records to be sampled is 10%, and the ratio of excluding outliers is 5%. .

アプリケーション１１は、ユーザーによるデータベースに対するレコード更新を実行するとのクエリに対応して、レコード更新時には、要求変換部１２に対してレコード更新要求を出す。レコード更新は、例えば、Key値＝６というレコードを対象レコード記憶領域に新たに挿入する要求である。 The application 11 issues a record update request to the request conversion unit 12 at the time of record update in response to a query that the user performs record update on the database. The record update is a request for newly inserting a record with a Key value = 6 into the target record storage area, for example.

要求変換部１２は、サンプリングレコード取得時には、上述のように指定されたサンプリング要求をアプリケーション１１から受け取った後、サンプリング命令に変換する。変換後のサンプリング命令は、データ操作部１３に送られる。 When acquiring the sampling record, the request conversion unit 12 receives the sampling request designated as described above from the application 11 and then converts it into a sampling command. The converted sampling command is sent to the data operation unit 13.

要求変換部１２は、レコード更新時には、レコード更新要求をアプリケーション１１から受け取った後、レコード更新命令に変換する。 At the time of record update, the request conversion unit 12 receives a record update request from the application 11 and then converts it into a record update command.

データ操作部１３は、サンプリングレコード取得時には、要求変換部１２から受け取ったサンプリング命令を解釈して、サンプリングレコードを決定する。決定されたサンプリングレコードは、レコード記憶部１４の指定されたレコード記憶領域から取得される。データ操作部１３は、B-Tree インデックス記憶部１５から読み込んだB-Tree を自己のメモリ（図示しない）上に構築し、ランダムに移動しながら読み込むレコードを決定する処理を行う。 When acquiring the sampling record, the data operation unit 13 interprets the sampling command received from the request conversion unit 12 and determines the sampling record. The determined sampling record is acquired from the designated record storage area of the record storage unit 14. The data operation unit 13 constructs a B-Tree read from the B-Tree index storage unit 15 on its own memory (not shown), and performs a process of determining a record to be read while moving at random.

後述するように、疎密度に応じて除外すると判定されたリーフブロック以外に属するサンプリングレコードが、レコード群である結果セットとしてアプリケーション１１に送られる。 As will be described later, sampling records belonging to other than leaf blocks determined to be excluded according to sparseness are sent to the application 11 as a result set which is a record group.

データ操作部１３は、レコード更新時には、要求変換部１２から受け取ったレコード更新命令を解釈して、レコードの更新を実行するレコード記憶領域を決定する。更新レコードは、レコード記憶部１４の指定されたレコード記憶領域に書き込まれる。 At the time of record update, the data operation unit 13 interprets the record update command received from the request conversion unit 12 and determines a record storage area for executing record update. The update record is written in the designated record storage area of the record storage unit 14.

データ操作部１３は、更新レコードに応じて、B-Tree を更新するためにデータを生成する。B-Tree を更新するデータは、B-Tree インデックス記憶部１５に送られる。 The data operation unit 13 generates data for updating the B-Tree according to the update record. Data for updating the B-Tree is sent to the B-Tree index storage unit 15.

B-Tree インデックス記憶部１５は、B-Tree を恒久的に保存する場所であり、例えば、ハードディスク記憶装置で構成するのが好適である。B-Tree インデックス記憶部１５には、B-Treeのルートノード、ブランチノード、リーフノード、リーフブロックの全てが保存される。 The B-Tree index storage unit 15 is a place for permanently storing the B-Tree, and is preferably configured by a hard disk storage device, for example. The B-Tree index storage unit 15 stores all the root nodes, branch nodes, leaf nodes, and leaf blocks of the B-Tree.

本実施形態においては、レコード更新時には、データ操作部１３から受け取ったB-Tree を更新するデータに応じて、（１）疎密度（リーフブロックの距離の平均値）、（２）全リーフブロックの疎密度の平均値、（３）全リーフブロックの疎密度の標準偏差、（４）子ノードのレコード数、の情報をB-Tree に付加したB-Tree インデックスを記憶するものである。尚、初期状態では、レコード記憶部１４のレコード記憶領域１、レコード記憶領域２、・・・等に記憶されるレコードに応じた（１）疎密度（リーフブロックの距離の平均値）、（２）全リーフブロックの疎密度の平均値、（３）全リーフブロックの疎密度の標準偏差、（４）子ノードのレコード数の付加情報が記憶されている。 In this embodiment, at the time of record update, according to the data for updating the B-Tree received from the data operation unit 13, (1) sparse density (average value of leaf block distance), (2) all leaf blocks It stores a B-Tree index obtained by adding information of the average value of sparse density, (3) standard deviation of sparse density of all leaf blocks, and (4) number of records of child nodes to B-Tree. In the initial state, (1) sparse density (average value of leaf block distance) corresponding to the records stored in the record storage area 1, record storage area 2,. ) An average value of sparse density of all leaf blocks, (3) standard deviation of sparse density of all leaf blocks, and (4) additional information on the number of records of child nodes.

B-Treeを使用する場合には、B-Tree インデックス記憶部１５から一時的にメモリ（図示しない）上にデータを読み込み、B-Treeを構築し処理を行う。しかし、B-Tree全てをメモリ上に構築するとサイズ制約があり、全リーフブロックをメモリ上に読み込めない場合がある。そこで、メモリ上にはルートノード、ブランチノード、リーフノードのみを構築しておき、必要に応じて、リーフブロックをB-Tree インデックス記憶部１５から読み込む。従来の手法では、全レコードの情報を計算する必要があるため、全てのリーフブロックをハードディスク記憶装置から読み込んでいたが、本実施形態によれば、ランダムに移動した先のリーフブロックだけを読み込むだけでよいので、ハードディスク記憶装置の読み込み回数を減らせることができる。 When using a B-Tree, data is temporarily read from a B-Tree index storage unit 15 onto a memory (not shown), and a B-Tree is constructed and processed. However, if all B-Trees are built in memory, there are size restrictions, and all leaf blocks may not be read into memory. Therefore, only the root node, branch node, and leaf node are constructed on the memory, and the leaf block is read from the B-Tree index storage unit 15 as necessary. In the conventional method, since it is necessary to calculate the information of all the records, all the leaf blocks are read from the hard disk storage device. However, according to the present embodiment, only the leaf block moved at random is read. As a result, the number of times the hard disk storage device is read can be reduced.

＜外れ値＞
ランダムサンプリングを実行する場合、外れ値を多く選んでしまい、元のデータとは分布状況が異なってしまい、解析結果がおかしくなる可能性がある。 <Outlier>
When performing random sampling, many outliers are selected, and the distribution status differs from the original data, which may cause the analysis result to be strange.

図２は、ランダムサンプリングに伴う外れ値を説明する図である。図２に示すように、外れ値を多くサンプリングすると、実データとは分布状況大きく異なってしまう場合がある。そこで、１次元データの構造をある程度維持して、外れ値を除外した部分を取得することが重要となる。 FIG. 2 is a diagram for explaining outliers associated with random sampling. As shown in FIG. 2, when many outliers are sampled, there are cases where the distribution situation differs greatly from the actual data. Therefore, it is important to obtain a portion excluding outliers while maintaining the structure of the one-dimensional data to some extent.

本実施形態においては、B-Tree インデックスを使用して、他の点（レコード）間の距離に比べ、点（レコード）間の距離が離れている１次元のデータ（レコード）を外れ値とするもので、疎密度（後述する）の値が大きいリーフブロックをサンプリング対象から除外するものである。 In this embodiment, by using the B-Tree index, one-dimensional data (record) in which the distance between points (records) is larger than the distance between other points (records) is used as an outlier. Therefore, leaf blocks having a large value of sparse density (described later) are excluded from sampling targets.

図３は、除外するリーフブロックを説明する図である。ルートノードでは、Key値と下層のブランチノードへのポインタを管理している。Key値が指定された場合に、次にどのブランチノードに進めばよいかがわかる。ブランチノードでは、Key値を更に細かく分割するとともに下層のリーフノードへのポインタを管理している。リーフノードは、最下層のノードであり、Key値とKey値に対応するレコードが記憶されているレコード記憶領域の物理的な位置を管理している。 FIG. 3 is a diagram for explaining leaf blocks to be excluded. The root node manages the Key value and the pointer to the lower branch node. When the Key value is specified, it is understood which branch node should be advanced next. In the branch node, the Key value is further divided and the pointer to the leaf node in the lower layer is managed. The leaf node is the lowest layer node and manages the physical position of the record storage area in which the key value and the record corresponding to the key value are stored.

図３に示す例では、ルートノードは、Key値＝６１であり、Key値が２９である左側のブランチノードと、Key値が８１である右側のブランチノードへのポインタを管理している。検索Key値が６１よりも小さい場合には、左側のブランチノードに分岐し、検索Key値が６１よりも大きい場合には、右側のブランチノードに分岐する。Key値＝２９の左側のブランチノードでは、Key値が３乃至８である左側のリーフノードと、Key値が５２乃至５７である右側のリーフノードへのポインタを管理している。 In the example shown in FIG. 3, the root node manages a pointer to the left branch node whose Key value is 61 and the Key value is 29, and the right branch node whose Key value is 81. If the search key value is smaller than 61, the process branches to the left branch node. If the search key value is greater than 61, the process branches to the right branch node. The left branch node with Key value = 29 manages pointers to the left leaf node with Key value 3 to 8 and the right leaf node with Key value 52 to 57.

検索Key値が２９よりも小さい場合には、左側のリーフノードに分岐し、検索Key値が２９よりも大きい場合には、右側のリーフノードに分岐する。図３に示す最左側のリーフノードでは、検索Key値が３以下の場合のレコード数、検索Key値が３と８の間にある場合のレコード数、検索Key値が８以上の場合のレコード数が管理されている。他のリーフノードでも同様である。リーフノードには、Key値と当該Keyに対応するレコードへのアドレスの組から成るリーフブロックがつながっている。 If the search key value is smaller than 29, the process branches to the left leaf node. If the search key value is greater than 29, the process branches to the right leaf node. In the leftmost leaf node shown in FIG. 3, the number of records when the search key value is 3 or less, the number of records when the search key value is between 3 and 8, and the number of records when the search key value is 8 or more. Is managed. The same applies to other leaf nodes. The leaf node is connected to a leaf block including a pair of a key value and an address to a record corresponding to the key.

本実施形態では、リーフブロックの点間の平均距離が他のリーフブロックよりも大きければ外れ値を含む可能性が高いため、除外する。図３に示す例では、Key値が８以上の場合のリーフブロックの範囲は、他のリーフブロックの範囲に比べ大きいので、外れ値を含む可能性があり、除外対象とする。したがって、Key値が８以上の場合のリーフブロックからは、レコードを取得しないことになる。 In the present embodiment, if the average distance between the points of the leaf block is larger than that of the other leaf blocks, it is highly likely that an outlier is included, and therefore it is excluded. In the example illustrated in FIG. 3, the range of leaf blocks when the Key value is 8 or more is larger than the range of other leaf blocks, and thus may include outliers and are excluded. Therefore, no record is acquired from the leaf block when the Key value is 8 or more.

＜疎密度＞
本実施形態においては、「リーフブロック毎のKey間の平均距離」を疎密度と定義する。具体的には、疎密度は、（リーフブロック中のKeyの最大値−リーフブロック中のKeyの最小値）を（リーフブロックのレコード数-１）で除して得られる。尚、リーフブロック中に1レコードしかない場合は、疎密度の判定は行わない。 <Sparse density>
In the present embodiment, “average distance between keys for each leaf block” is defined as sparse density. Specifically, the sparse density is obtained by dividing (the maximum value of the key in the leaf block−the minimum value of the key in the leaf block) by (the number of records in the leaf block−1). Note that if there is only one record in the leaf block, the density determination is not performed.

疎密度の数値自体が小さければ（Key間の平均距離が短い）、疎密度が高くて密であり、疎密度の数値自体が大きければ（Key間の平均距離が長い）、疎密度が低くて疎である。B-Tree はソートされているため疎密度の数値自体が大きいほど、点間の距離が離れている点を含む可能性が高い。 If the sparse density value itself is small (the average distance between keys is short), the sparse density is high and dense, and if the sparse density value itself is large (the average distance between keys is long), the sparse density is low. Sparse. Since the B-Tree is sorted, the larger the sparse density itself, the more likely it is to include points that are more distant from each other.

＜リーフブロックの除外の判定＞
本実施形態では、他のリーフブロックに比べ、相対的に疎なリーフブロックを除外する。他の点間の距離に比べ、点間の距離が離れている点を除外するには、複数のリーフブロック中から、（Keyの最大値−Keyの最小値）の値が大きいリーフブロックを除外する。 <Determining the exclusion of leaf blocks>
In the present embodiment, leaf blocks that are relatively sparse compared to other leaf blocks are excluded. To exclude points that are far apart from each other compared to the distance between other points, exclude the leaf block that has a large value of (Maximum value of key-Minimum value of key) from multiple leaf blocks. To do.

換言すれば、疎密度の数値自体が大きいものは疎であるのでサンプリング対象とするリーフブロックから除外し、疎密度の数値自体が小さいものは密であるのでサンプリング対象とするリーフブロックから除外しない。 In other words, those having a large sparse density value are sparse and therefore excluded from the leaf block to be sampled, and those having a small sparse density value itself are dense and are not excluded from the leaf block to be sampled.

ユーザーは正規分布上で疎なリーフブロックの上位何％を除外するか、パーセンテージで指定することができる。具体的には、疎密度の全体平均値よりも、ある値以上の疎密度を持つリーフブロックを除外する。 The user can specify what percentage of the top sparse leaf blocks to exclude from the normal distribution as a percentage. Specifically, leaf blocks having a sparse density greater than a certain value are excluded from the overall average value of the sparse density.

例えば、除外判定の閾値は、以下の式で示される。

For example, the threshold for exclusion determination is expressed by the following equation.

上記した除外判定の閾値以上の疎密度を持つリーフブロックを、サンプリング対象とするリーフブロックから除外する。係数Ｑは、ユーザーから指定されるパーセンテージから求め、リーフブロックの疎密度を正規分布と仮定し、既存の手法で計算する。例えば、統計学の「パーセント点」を用いることが好適であり、正規分布において、上側確率が２．５％となるパーセント点は１．９６である。 Leaf blocks having a sparse density equal to or greater than the above-described exclusion determination threshold are excluded from the leaf blocks to be sampled. The coefficient Q is obtained from a percentage specified by the user, and is calculated by an existing method assuming that the sparse density of the leaf blocks is a normal distribution. For example, it is preferable to use a statistical “percentage point”, and in the normal distribution, the percentage point at which the upper probability is 2.5% is 1.96.

図４は、正規分布図と除外する割合を説明する図である。図４に示すように、正規分布図の面積が、その範囲内に疎密度が存在するリーフブロックのパーセンテージとなる。図４に示す正規分布図では、中央値となる全体の疎密度の平均値よりも右側にいくほど、疎密度の数値が大きい（Key間の平均距離が長い）ことを示している。疎密度の数値が小さい（Key間の平均距離が短い）場合は、外れ値を含む可能性が低くサンプリング対象から除外する必要がないため、正規分布図の中央値の左側は全て含む。

FIG. 4 is a diagram for explaining the normal distribution chart and the excluded ratio. As shown in FIG. 4, the area of the normal distribution map is the percentage of leaf blocks in which the sparse density exists within the range. The normal distribution diagram shown in FIG. 4 indicates that the numerical value of the sparse density is larger (the average distance between the keys is longer) as it goes to the right of the average value of the overall sparse density that is the median value. When the numerical value of the sparse density is small (the average distance between keys is short), it is unlikely that an outlier is included, and it is not necessary to exclude it from the sampling target.

＜疎密度を付加したB-Tree インデックスの構造＞
図５は、疎密度を付加したB-Tree インデックスの構造を示した図である。図５に示すように、ルートノードでは、Key値と子ノードのレコード数および下層のブランチノードへのポインタを管理している。Key値が指定された場合に、次にどのブランチノードに進めばよいかがわかる。ブランチノードでは、Key値を更に細かく分割するとともに子ノードのレコード数および下層のリーフノードへのポインタを管理している。リーフノードは、最下層のノードであり、Key値、子ノード（リーフブロック）のレコード数、子ノード（リーフブロック）毎の疎密度およびKey値に対応するレコードが記憶されているレコード記憶領域の物理的な位置を管理している。 <B-Tree index structure with sparse density>
FIG. 5 is a diagram showing the structure of a B-Tree index to which sparse density is added. As shown in FIG. 5, the root node manages the Key value, the number of records of child nodes, and pointers to the lower branch nodes. When the Key value is specified, it is understood which branch node should be advanced next. In the branch node, the Key value is further divided and the number of records of the child node and the pointer to the leaf node in the lower layer are managed. A leaf node is a lowermost node, and is stored in a record storage area in which a key value, the number of records of a child node (leaf block), a sparse density for each child node (leaf block), and a record corresponding to the key value are stored. The physical position is managed.

図５に示す例では、ルートノードのKey値＝６１であり、左側のブランチノードのレコード数は１０個、右側のブランチノードのレコード数は７個で、子ノードのレコード数は合計で１７個となっている。検索Key値が６１よりも小さい場合には、左側のブランチノードに分岐し、検索Key値が６１よりも大きい場合には、右側のブランチノードに分岐する。 In the example shown in FIG. 5, the key value of the root node is 61, the number of records in the left branch node is 10, the number of records in the right branch node is 7, and the number of records in the child nodes is 17 in total. It has become. If the search key value is smaller than 61, the process branches to the left branch node. If the search key value is greater than 61, the process branches to the right branch node.

同様に、左側のブランチノードでは、Key値＝２９であり、左側のリーフノードのレコード数は６個、右側のリーフノードのレコード数は４個で、子ノードのレコード数は合計で１０個となっている。検索Key値が２９よりも小さい場合には、左側のリーフノードに分岐し、検索Key値が２９よりも大きい場合には、右側のリーフノードに分岐する。図５に示す最左側のリーフノードでは、Key値が３以下の場合の２個のレコード数、Key値が３と８の間にある場合の２個のレコード数、Key値が８以上の場合の２個のレコード数が管理されている。他のリーフノードでも同様である。本実施形態では、Key間の平均距離を表す疎密度がリーフノード毎に管理されている。最左側のリーフノードでは、Key値が３以下の場合の疎密度が２、Key値が３と８の間にある場合の疎密度が３、Key値が８以上の場合の疎密度が２０であると管理されている。さらに、リーフノードには、Key値と当該Keyに対応するレコードへのアドレスの組から成るリーフブロックがつながっている。 Similarly, in the left branch node, Key value = 29, the number of records in the left leaf node is 6, the number of records in the right leaf node is 4, and the number of records in the child nodes is 10 in total. It has become. If the search key value is smaller than 29, the process branches to the left leaf node. If the search key value is greater than 29, the process branches to the right leaf node. In the leftmost leaf node shown in FIG. 5, the number of two records when the Key value is 3 or less, the number of two records when the Key value is between 3 and 8, and the Key value is 8 or more These two records are managed. The same applies to other leaf nodes. In this embodiment, the sparse density representing the average distance between keys is managed for each leaf node. In the leftmost leaf node, the sparse density is 2 when the Key value is 3 or less, the sparse density is 3 when the Key value is between 3 and 8, and the sparse density is 20 when the Key value is 8 or more. It is managed to be. Further, a leaf block consisting of a pair of a key value and an address to a record corresponding to the key is connected to the leaf node.

＜B-Treeによるランダムサンプリング＞
B-Treeでは、ブランチノードの分岐数やリーフブロック中のレコード数が、ツリーによって異なるため、サンプリング時に移動（分岐方向）をランダムにすると、取得されるレコードに偏りが生じる。 <Random sampling with B-Tree>
In B-Tree, the number of branches of branch nodes and the number of records in leaf blocks differ depending on the tree. Therefore, if the movement (branch direction) is random at the time of sampling, the obtained records are biased.

図６は、特定のリーフブロックへのレコードの偏在を説明する図である。尚、図６では、子ノードのレコード数、疎密度は図示していない。図６に示すB-Treeでは、Key値＝２９の左側ブランチノードにぶら下がるリーフ数が６、右側ブランチノードにぶら下がるリーフ数が４と、リーフノードのレコード数にばらつきがある。そのため、ルートノードからKey値＝２９の左側ブランチノードを経て、Key値＝５２乃至５７の右側リーフノードへとランダムに移動してレコードを取得していくと、取得されるレコード数に偏りが生じる。 FIG. 6 is a diagram for explaining the uneven distribution of records in a specific leaf block. In FIG. 6, the number of records and sparse density of child nodes are not shown. In the B-Tree shown in FIG. 6, the number of leaves hanging from the left branch node of Key value = 29 is 6 and the number of leaves hanging from the right branch node is 4, so that the number of records of the leaf nodes varies. Therefore, if records are acquired by moving randomly from the root node to the right leaf node of Key value = 52 to 57 via the left branch node of Key value = 29, the number of records to be acquired is biased. .

図７は、選択確率を考慮したB-Treeにおけるランダムサンプリングを説明する図である。尚、図７では、疎密度は図示していない。図７に示すように、ルートノード、ブランチノードで子ノードのレコード数を保持する。すなわち、ブランチノードでは分岐する範囲毎に子ノードのレコード数を保持し、リーフノードでは現在のレコード数を保持し、子ノードへのランダム・ウォークを行うときに、各子ノードへ移動する確率を計算し決定するのが好適である。これにより、各レコードの選択確率を平均化することが可能となる。 FIG. 7 is a diagram for explaining random sampling in the B-Tree in consideration of the selection probability. In FIG. 7, sparse density is not shown. As shown in FIG. 7, the root node and the branch node hold the number of child node records. That is, the branch node holds the number of records of the child node for each branching range, the leaf node holds the current number of records, and the probability of moving to each child node when performing a random walk to the child node It is preferred to calculate and determine. Thereby, it becomes possible to average the selection probability of each record.

＜B-Treeの再構築＞
図８乃至図１２は、B-Treeの更新の手順と再計算の範囲を説明する図である。図８に示すように、疎密度を付加したB-Treeインデックス構造において、Key値＝６という挿入レコードを追加する。図９に示すように、移動先の子ノードのレコード数を加算しながら挿入位置を探索していく。まず、ルートノードから、Key値＝２９の左側ブランチノードに移動する。その結果、ルートノードが保持する子ノードのレコード数は１１となる。同様に、ブランチノードでは、左側のリーフノードに移動していくので、ブランチノードが保持する子ノードのレコード数は７となる。挿入レコードは、Key値＝６であるので、リーフノードのKey値＝３とKey値＝８の間に該当するので、中央のブロックに挿入される。その結果、図９に示すように、中央のリーフノードにつながるリーフブロック数は、３個となる。B-treeを再構築する場合の、適切なリーフノードにつながるリーフブロック数は、ユーザーが任意に設定することができる。尚、図９に示す例では、図を見やすくするため、リーフノードにつながるリーフブロックは最大で２個までにする。 <Rebuilding B-Tree>
8 to 12 are diagrams for explaining the B-Tree update procedure and the range of recalculation. As shown in FIG. 8, in the B-Tree index structure to which sparse density is added, an insertion record with a Key value = 6 is added. As shown in FIG. 9, the insertion position is searched while adding the number of records of the child node of the movement destination. First, move from the root node to the left branch node with Key value = 29. As a result, the number of child node records held by the root node is 11. Similarly, since the branch node moves to the left leaf node, the number of child node records held by the branch node is seven. Since the insertion record has a Key value = 6, it falls between the Key value = 3 and the Key value = 8 of the leaf node, and is therefore inserted into the central block. As a result, as shown in FIG. 9, the number of leaf blocks connected to the central leaf node is three. When reconstructing a B-tree, the number of leaf blocks connected to an appropriate leaf node can be arbitrarily set by the user. In the example shown in FIG. 9, the number of leaf blocks connected to leaf nodes is limited to two in order to make the drawing easier to see.

図１０に示すように、全てのリーフノードにつながるリーフブロックが２個になるようにリーフノードを細分化すると、最左側のリーフノードにつながる子ノード（リーフブロック）数は４個となる。さらに、リーフノードの状態の変化に伴い疎密度も変化している。B-treeを再構築する場合の、適切な子ノード（リーフブロック）数は、ユーザーが任意に設定することができる。尚、図１０に示す例では、図を見やすくするため、リーフノードにつながる子ノード（リーフブロック）数は最大で３個までにする。 As shown in FIG. 10, when a leaf node is subdivided so that there are two leaf blocks connected to all leaf nodes, the number of child nodes (leaf blocks) connected to the leftmost leaf node is four. Furthermore, the sparse density also changes as the state of the leaf node changes. The user can arbitrarily set the appropriate number of child nodes (leaf blocks) when rebuilding the B-tree. In the example shown in FIG. 10, the number of child nodes (leaf blocks) connected to the leaf nodes is limited to three at maximum in order to make the drawing easy to see.

そこで、図１１に示すように、Key値＝３のリーフノードとKey値＝８のリーフノードに分解する。これに伴い、左側のブランチノードもKey値＝６と、Key＝２９の分岐となるように分解する。この場合、ブランチノードの状態が変化したためレコード数を再計算する必要が生じる。 Therefore, as shown in FIG. 11, it is broken down into leaf nodes with Key value = 3 and leaf nodes with Key value = 8. Along with this, the left branch node is also decomposed so as to be a branch of Key value = 6 and Key = 29. In this case, since the state of the branch node has changed, it is necessary to recalculate the number of records.

図１２は、レコードの更新による計算が必要となる箇所を説明する図である。本実施形態では、疎密度の再計算が必要な場所は、通常のB-treeの更新部分と同じである。そのため、更新の計算量はlog(n)となる。 FIG. 12 is a diagram for explaining a place where calculation by updating a record is necessary. In this embodiment, the place where the sparse density needs to be recalculated is the same as the normal B-tree update part. Therefore, the update calculation amount is log (n).

＜B-Tree構築処理の流れ＞
以上のようにして構成されたサンプリング装置１におけるB-Tree構築処理の流れについて説明する。図１３は、サンプリング装置１におけるB-Tree構築処理の流れを説明するフローチャートである。 <B-Tree construction process flow>
A flow of B-Tree construction processing in the sampling device 1 configured as described above will be described. FIG. 13 is a flowchart for explaining the flow of B-Tree construction processing in the sampling device 1.

まず、カーソルをルートノードへ移動する（ステップＳ１３１）。 First, the cursor is moved to the root node (step S131).

次に、挿入するKeyの範囲のノードへカーソルを移動する。移動前に移動先ノードのレコード数を１加算する（ステップＳ１３２）。 Next, move the cursor to the node in the range of the key to be inserted. Before the movement, 1 is added to the number of records of the movement destination node (step S132).

続いて、カーソルの位置がリーフノードであるか否かを判定する（ステップＳ１３３）。カーソルの位置がリーフノードでなければ（ステップＳ１３３でＮｏ）、ステップＳ１３２に戻り、カーソルの位置がリーフノードであれば（ステップＳ１３３でＹｅｓ）、Keyとレコードのアドレスをリーフブロックに追加する（ステップＳ１３４）。 Subsequently, it is determined whether or not the cursor position is a leaf node (step S133). If the cursor position is not a leaf node (No in step S133), the process returns to step S132, and if the cursor position is a leaf node (Yes in step S133), the key and the address of the record are added to the leaf block (step). S134).

次に、リーフブロックに規定数以上のレコードがあるか否かを判定する（ステップＳ１３５）。ここで、規定数は、B-Treeを再構築するか否かの閾値となるものである。特定のリーフブロックに多くのレコードが偏在することを避け、各レコードの選択確率を平均化するために、子ノードのレコード数を管理するのが好適である。 Next, it is determined whether there are more than a predetermined number of records in the leaf block (step S135). Here, the specified number is a threshold value for determining whether or not to reconstruct the B-Tree. It is preferable to manage the number of records of child nodes in order to avoid the uneven distribution of many records in a specific leaf block and to average the selection probability of each record.

リーフブロックに規定数以上のレコードがなければ（ステップＳ１３５でＮｏ）、追加を行ったリーフブロックの疎密度を再計算し、リーフノードへ保存（ステップＳ１３６）し、B-Tree構築処理を終了する。 If there are no more than the specified number of records in the leaf block (No in step S135), the sparse density of the added leaf block is recalculated, stored in the leaf node (step S136), and the B-Tree construction process is terminated. .

一方、リーフブロックに規定数以上のレコードがあれば（ステップＳ１３５でＹｅｓ）、B-Treeの再構築を実行する（ステップＳ１３７）。 On the other hand, if there are more than the specified number of records in the leaf block (Yes in step S135), B-Tree reconstruction is executed (step S137).

続いて、再構築の影響を受けた各リーフブロックの疎密度と、子ノードのレコード数を再計算してリーフノードへ保存（ステップＳ１３８）し、B-Tree構築処理を終了する。 Subsequently, the sparse density of each leaf block affected by the reconstruction and the number of records of the child nodes are recalculated and stored in the leaf nodes (step S138), and the B-Tree construction process is terminated.

＜ランダムサンプリング処理の流れ＞
以上のようにして構成されたサンプリング装置１におけるランダムサンプリング処理の流れについて説明する。図１４は、サンプリング装置１におけるランダムサンプリング処理の流れを説明するフローチャートである。 <Random sampling process flow>
A flow of random sampling processing in the sampling apparatus 1 configured as described above will be described. FIG. 14 is a flowchart for explaining the flow of random sampling processing in the sampling apparatus 1.

まず、要求変換部１２は、レコードのサンプリング数と、リーフブロックの除外判定時に必要な係数Qを、ユーザーが指定した入力から決定する（ステップＳ１４０１）。 First, the request conversion unit 12 determines a record sampling number and a coefficient Q necessary for leaf block exclusion determination from input designated by the user (step S1401).

次に、データ操作部１３は、全てのリーフノードに移動し、全リーフブロックの疎密度の標準偏差σと全リーフブロックの疎密度の平均値を計算する（ステップＳ１４０２）。 Next, the data operation unit 13 moves to all leaf nodes, and calculates the standard deviation σ of the sparse density of all leaf blocks and the average value of the sparse density of all leaf blocks (step S1402).

次に、カーソルをルートノードへ移動する（ステップＳ１４０３）。 Next, the cursor is moved to the root node (step S1403).

次に、カーソルを子ノードへ移動する（ステップＳ１４０４）。 Next, the cursor is moved to the child node (step S1404).

続いて、カーソルの位置がリーフノードであるか否かを判定する（ステップＳ１４０５）。 Subsequently, it is determined whether or not the position of the cursor is a leaf node (step S1405).

カーソルの位置がリーフノードでなければ（ステップＳ１４０５でＮｏ）、ステップＳ１４０４に戻り、カーソルの位置がリーフノードであれば（ステップＳ１４０５でＹｅｓ）、リーフノードにつながっているリーフブロックの１つをランダムに選択する（ステップＳ１４０６）。 If the cursor position is not a leaf node (No in step S1405), the process returns to step S1404. If the cursor position is a leaf node (Yes in step S1405), one of the leaf blocks connected to the leaf node is randomly selected. (Step S1406).

次に、選択したリーフブロックの疎密度が、リーフブロックを除外する際の閾値よりも小さいか否かを判定する（ステップＳ１４０７）。 Next, it is determined whether or not the sparse density of the selected leaf block is smaller than a threshold value when excluding the leaf block (step S1407).

選択したリーフブロックの疎密度が、リーフブロックを除外する際の閾値よりも小さくなければ（ステップＳ１４０７でＮｏ）、当該リーフブロックはサンプリングレコードに適さないので除外し、ステップＳ１４０３に移行する。 If the sparse density of the selected leaf block is not smaller than the threshold value when excluding the leaf block (No in step S1407), the leaf block is excluded because it is not suitable for the sampling record, and the process proceeds to step S1403.

一方、選択したリーフブロックの疎密度が、リーフブロックを除外する際の閾値よりも小さければ（ステップＳ１４０７でＹｅｓ）、選択したリーフブロックから取得するレコードをランダムに１つ決定し結果セットへ挿入する（ステップＳ１４０８）。 On the other hand, if the sparse density of the selected leaf block is smaller than the threshold for excluding the leaf block (Yes in step S1407), one record to be acquired from the selected leaf block is randomly determined and inserted into the result set. (Step S1408).

次に、結果セットに取得したレコードの総数が指定されたサンプリング数に達しているか否かを判定する（ステップＳ１４０９）。 Next, it is determined whether or not the total number of records acquired in the result set has reached the designated sampling number (step S1409).

指定されたサンプリング数に達していなければ（ステップＳ１４０９でＮｏ）、引き続きランダムサンプリングを実行するためにステップＳ１４０３に移行する。 If the specified number of samplings has not been reached (No in step S1409), the process proceeds to step S1403 to continue random sampling.

一方、指定されたサンプリング数に達していれば（ステップＳ１４０９でＹｅｓ）、結果セットをアプリケーションへ返して（ステップＳ１４１０）、ランダムサンプリング処理を終了する。 On the other hand, if the designated sampling number has been reached (Yes in step S1409), the result set is returned to the application (step S1410), and the random sampling process is terminated.

［実施例］
次に、本実施形態の実施例について説明する。ここでは、図５に示すB-treeインデックス構造において、ランダムサンプリングを実行する実施例について説明する。図１５は、実施例におけるサンプリング要求に対する処理を説明する図である。 [Example]
Next, examples of the present embodiment will be described. Here, an embodiment in which random sampling is executed in the B-tree index structure shown in FIG. 5 will be described. FIG. 15 is a diagram illustrating processing for a sampling request in the embodiment.

アプリケーション１１は、ユーザーによるデータベースからランダムサンプリングを実行するとのクエリに対応して、要求変換部１２に対して図１５に示すようなサンプリング要求が行われた。ここでは、“TEST”と名付けられた対象レコード記憶領域に対して、サンプリングするレコードの数を指定レコード記憶領域の１０％と指定され、除外するリーフブロックの割合が５％と指定されている。 The application 11 makes a sampling request as shown in FIG. 15 to the request conversion unit 12 in response to a query that the user performs random sampling from the database. Here, for the target record storage area named “TEST”, the number of records to be sampled is specified as 10% of the designated record storage area, and the ratio of leaf blocks to be excluded is specified as 5%.

ルートノード（Key値＝６１）の子ノードのレコード数は１０＋７＝１７なので、サンプリングレコード数を１７×０．１ ≒ ２レコードと決定する。

Since the number of records of the child node of the root node (Key value = 61) is 10 + 7 = 17, the number of sampling records is determined to be 17 × 0.1≈2 records.

図１６は、実施例における疎密度の標準偏差の計算を説明する図である。図１６に示すように、全てのリーフノードに移動して疎密度を取得する。取得した疎密度に基づき疎密度の標準偏差と平均値を計算する。実施例では、リーフブロックの除外判定の閾値は、６．０＋１．６×６．０＝１５．６となる。 FIG. 16 is a diagram for explaining calculation of the standard deviation of sparse density in the embodiment. As shown in FIG. 16, the sparse density is acquired by moving to all leaf nodes. Based on the acquired sparse density, the standard deviation and average value of the sparse density are calculated. In the embodiment, the threshold for leaf block exclusion determination is 6.0 + 1.6 × 6.0 = 15.6.

図１７は、実施例におけるサンプリングレコードの決定を説明する図である。図１７に示すように、除外判定の閾値と疎密度を比較する。実施例では、リーフブロックの除外判定の閾値が１５．６であるため、レコードを１つランダムで選択する。取得したレコード総数が、サンプリング数として指定された数でなければルートノードに移動する。 FIG. 17 is a diagram illustrating determination of sampling records in the embodiment. As shown in FIG. 17, the threshold value for exclusion determination is compared with the sparse density. In the embodiment, since the threshold for leaf block exclusion determination is 15.6, one record is selected at random. If the total number of records acquired is not the number specified as the sampling number, the process moves to the root node.

図１８は、実施例におけるサンプリングレコードの決定およびリーフブロックの除外を説明する図である。図１８に示すように、リーフブロックの除外判定の閾値が１５．６であるのに対し、選択したリーフブロックの疎密度が２０であるため、除外するリーフブロックと判断する。そこで、当該リーフブロックにつながるレコードは、サンプリングレコードとはせずに、ルートノードに移動する。 FIG. 18 is a diagram for explaining sampling record determination and leaf block exclusion in the embodiment. As shown in FIG. 18, the leaf block exclusion determination threshold is 15.6, whereas the selected leaf block has a sparse density of 20. Therefore, it is determined as a leaf block to be excluded. Therefore, the record connected to the leaf block does not become a sampling record but moves to the root node.

以上を、取得したレコード総数が、サンプリング数として指定された数に到達するまで繰り返す。 The above is repeated until the total number of acquired records reaches the number specified as the sampling number.

本実施形態によれば、ランダムサンプリングの実行に際して、レコード数をｎとすると、厳密にKeyの外れ値を計算する場合と比べて計算量を１／ｎに少なくすることができる。 According to the present embodiment, when executing random sampling, if the number of records is n, the amount of calculation can be reduced to 1 / n compared to the case of strictly calculating an outlier value of Key.

本発明のいくつかの実施形態を説明したが、これらの実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。これら新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。これら実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 Although several embodiments of the present invention have been described, these embodiments are presented by way of example and are not intended to limit the scope of the invention. These novel embodiments can be implemented in various other forms, and various omissions, replacements, and changes can be made without departing from the scope of the invention. These embodiments and modifications thereof are included in the scope and gist of the invention, and are included in the invention described in the claims and the equivalents thereof.

１・・・サンプリング装置
１１・・・アプリケーション
１２・・・要求変換部
１３・・・データ操作部
１４・・・レコード記憶部
１５・・・B-tree インデックス記憶部 DESCRIPTION OF SYMBOLS 1 ... Sampling device 11 ... Application 12 ... Request conversion part 13 ... Data operation part 14 ... Record storage part 15 ... B-tree index storage part

Claims

A record storage unit for storing a plurality of records constituting the database;
In addition to the B-Tree index, which has a hierarchical structure and holds index information as an index, the sparse density represented by the average distance between keys for each leaf block according to the record stored in the record storage unit, and the total density An average value of the sparse density of leaf blocks, a standard deviation of the sparse density of all leaf blocks, and a B-Tree index storage unit that stores additional information on the number of records of child nodes;
An application that makes a sampling request in response to a query by a user to perform random sampling from the database;
A request conversion unit that receives a sampling request from the application and converts it into a sampling instruction;
When determining the sampling record by interpreting the sampling instruction received from the request conversion unit and obtaining the sampling record from the record storage unit, the predetermined value is greater than the average value of the sparse density of all the leaf blocks A data manipulation unit that obtains a record from a leaf block that is not excluded, excluding a leaf block having the sparse density of
A sampling device provided.

The sampling apparatus according to claim 1, wherein the database is a relational database.

The sampling apparatus according to claim 1, wherein the record relates to one-dimensional data and is represented in a table format.

In the sampling request, among the plurality of record groups stored in the record storage unit, the target record storage area name specified as a set of records to be sampled and the number of records to be sampled are the ratio of the target record storage area The sampling apparatus according to any one of claims 1 to 3, wherein a ratio of the number of records to be acquired represented by (2) to a table and a ratio of leaf blocks to be excluded are designated.

In the sampling request, among the plurality of record groups stored in the record storage unit, the target record storage area name specified as a set of records to be sampled, the number of records to be sampled, and the leaf block to be excluded The sampling device according to any one of claims 1 to 3, wherein a number is designated.

The sampling device according to claim 1, wherein the leaf block is a set of a key value to be searched and an address on the record storage unit to a record corresponding to the key.

7. The sparse density is calculated by dividing (maximum value of key in leaf block−minimum value of key in leaf block) by (number of records in leaf block−1). The sampling apparatus according to item 1.

The predetermined value at the time of exclusion determination of the leaf block is obtained by adding a value obtained by (coefficient × standard deviation of the sparse density of all leaf blocks) to the average value of the sparse density of all leaf blocks. The sampling apparatus according to any one of claims 1 to 7, wherein the sampling apparatus is used to represent a region to be excluded on the normal distribution when the sparse density distribution is assumed to be a normal distribution.

When the root node holds the number of child node records, the branch node holds the number of child node records for each branch range, the leaf node holds the current number of records, and a random walk to the child node The sampling apparatus according to claim 1, wherein the selection probability of each record is averaged by calculating a probability of moving to each child node.

When updating a record by inserting a new record into the database, the insertion position is searched while adding the number of records at the movement destination, the number of leaf blocks is suppressed to 2 or less, and the number of child nodes is 3 or less. The sampling apparatus according to any one of claims 1 to 9, wherein the insertion position is determined while suppressing the insertion position.

The sampling device according to any one of claims 1 to 10, wherein the data operation unit determines records to be randomly acquired from leaf blocks that are not excluded, and collects the results into a result set and returns them to the application.

In addition to a record storage unit that stores a plurality of records that constitute a database, and a B-Tree index that has a hierarchical structure and holds index information as an index, a leaf corresponding to a record stored in the record storage unit B that stores additional information such as the sparse density expressed by the average distance between keys for each block, the average value of the sparse density of all leaf blocks, the standard deviation of the sparse density of all leaf blocks, and the number of records of child nodes -Sampling device equipped with an index storage unit,
In response to a query to perform random sampling from the database by the user, a function to make a sampling request;
A function of converting the sampling request into a sampling instruction;
When interpreting the sampling instruction to determine a sampling record and obtaining a sampling record, a leaf block having a sparse density greater than or equal to a predetermined value greater than the average value of the sparse density of all the leaf blocks is an outlier. With the ability to retrieve records from leaf blocks that were not included and excluded
Sampling program to realize.