JP5655764B2

JP5655764B2 - Sampling apparatus, sampling program, and method thereof

Info

Publication number: JP5655764B2
Application number: JP2011245492A
Authority: JP
Inventors: 亮根山
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2011-11-09
Filing date: 2011-11-09
Publication date: 2015-01-21
Anticipated expiration: 2031-11-09
Also published as: JP2013101539A

Description

本発明は、サンプリング装置、サンプリングプログラム、およびその方法に関する。 The present invention relates to a sampling device, a sampling program, and a method thereof.

データベースや分散ファイルといった大規模なデータソースから処理対象のデータを取得し、複数の処理ノードによって分散処理を行う技術が知られている。また、分散処理のためのデータをどのようにデータソースから取得し、処理ノードに割り振れば効率がよいかといった研究が従来から行われている。 A technique is known in which data to be processed is acquired from a large-scale data source such as a database or a distributed file, and distributed processing is performed by a plurality of processing nodes. In addition, research has been conducted on how to efficiently obtain data for distributed processing from a data source and assign it to processing nodes.

分散処理システムにおいては、生成するタスクの大きさに偏りが生じると、大きなタスクを割り当てられたノードに負荷が集中し、全体のスループット（処理性能）が下がるという性質がある。そのため、処理ノードの単位性能あたりのタスクを均等に割り当てる必要があり、そのための方法として、Skewed Joinと呼ばれるアルゴリズムが用いられる。 In the distributed processing system, when the size of a task to be generated is biased, the load is concentrated on a node to which a large task is assigned, and the overall throughput (processing performance) is lowered. Therefore, it is necessary to assign tasks per unit performance of processing nodes equally, and an algorithm called Skewed Join is used as a method for that purpose.

非特許文献１には、Skewed Joinを実現するために、データソースから値をサンプリン
グして、データソース中のデータの値の出現頻度を表すヒストグラムを作成し、そのヒストグラムに応じて分散処理の割り当てを決める手法が記載されている。Skewed Joinには
、作成されたヒストグラムがデータソース全体の傾向を代表していれば、取得されるデータのサイズを均等にできるという利点がある。 In Non-Patent Document 1, in order to realize Skewed Join, a value is sampled from a data source, a histogram showing the appearance frequency of the value of the data in the data source is created, and distributed processing is assigned according to the histogram. Describes how to decide. Skewed Join has the advantage that the size of the acquired data can be made uniform if the created histogram represents the trend of the entire data source.

データソースからデータのサンプリングを行い、ヒストグラムを作成したうえで、分散処理のタスクを決定する場合、そのサンプリング結果はできるだけ公平であること、つまりデータソース全体の傾向を表していることが求められる。ヒストグラムに現れたレコードの分布が、実際の件数と乖離したものである場合、処理能力を超えたレコード件数が処理ノードに割り当たり、全体のスループットを低下させる原因となるためである。 When sampling data from a data source, creating a histogram, and determining the task of distributed processing, the sampling result must be as fair as possible, that is, represent the trend of the entire data source. This is because when the distribution of records appearing in the histogram deviates from the actual number of records, the number of records exceeding the processing capacity is allocated to the processing nodes, which causes a reduction in the overall throughput.

サンプリングを行う方法には、図１に示すように、データソース全体にアクセスをする方法と、一部だけを読み込む方法がある。図１（ａ）は、データソースである記憶装置内の全てのファイルにアクセスを行い、全レコードを読み込む様子を示した図である。非特許文献２には、このようにファイル全体を読み込んでサンプリングを行う方法が開示されている。 As shown in FIG. 1, the sampling method includes a method of accessing the entire data source and a method of reading only a part. FIG. 1A is a diagram showing a state in which all files in a storage device as a data source are accessed and all records are read. Non-Patent Document 2 discloses a method for sampling by reading the entire file in this way.

また、図１（ｂ）は、記憶装置内の一部のファイルに記録されている参照用データを参照し、サンプリングを行う様子を示した図である。参照用データとは、一般的にデータベース索引などの探索用に用意されたデータを指す。参照用データには、例えば、ハッシュやツリーなどの種類があり、これらを参照することにより、特定のキーを持つレコードがデータソース内のどの位置に存在するかを知ることができる。すなわち、サンプリング手段はデータソースの一部のみを参照することで、データソース全体のデータ分布を把握することができる。 FIG. 1B is a diagram showing a state in which sampling is performed with reference to reference data recorded in some files in the storage device. The reference data generally refers to data prepared for searching such as a database index. The reference data includes, for example, types such as a hash and a tree. By referring to these, it is possible to know at which position in the data source a record having a specific key exists. That is, the sampling means can grasp the data distribution of the entire data source by referring to only a part of the data source.

特許文献１には、参照用データとして近似問合せエンジンを用意し、データベースに対する近似回答を提供する技術が記載されている。特許文献１に記載の発明は、データベースに対応した参照用データである近似問合せエンジンを用意し、データベースの更新に対応して、所定の確率で近似問合せエンジン内の複数のデータサンプルを更新することを特徴とする。近似問合せエンジン内のデータは、データベースをサンプリングしたものと同等であるため、データベースエンジンは、近似問合せエンジンにアクセスすることで、データベース全体にアクセスすることなくデータの分布を取得することができる。 Patent Document 1 describes a technique for providing an approximate answer to a database by preparing an approximate inquiry engine as reference data. The invention described in Patent Document 1 prepares an approximate query engine that is reference data corresponding to a database, and updates a plurality of data samples in the approximate query engine with a predetermined probability corresponding to the update of the database. It is characterized by. Since the data in the approximate query engine is equivalent to the data sampled from the database, the database engine can obtain the distribution of data without accessing the entire database by accessing the approximate query engine.

特開平１１−３５３３３１号公報Japanese Patent Laid-Open No. 11-353331

”Skewed Join”、［online］、Apache Pig Wiki、［平成２３年９月１６日検索］、インターネット＜URL：http://wiki.apache.org/pig/PigSkewedJoinSpec＞"Skewed Join", [online], Apache Pig Wiki, [searched September 16, 2011], Internet <URL: http://wiki.apache.org/pig/PigSkewedJoinSpec> Jefferey Scott Vitter，“Random Sampling with a Reservoir”，ACM Transactions on Mathematical Software，Vol.11，pp.37-57，Mar.1985Jefferey Scott Vitter, “Random Sampling with a Reservoir”, ACM Transactions on Mathematical Software, Vol. 11, pp. 37-57, Mar. 1985

データソースからデータのサンプリングを行う際、ファイル全体を読み込む方式である、非特許文献２に記載された技術によると、確実にデータソース内のデータ分布を取得できる一方、データソースが大きくなるほど読み込みに時間がかかるという欠点がある。たとえば、読み込み性能が毎秒１００ＭＢであって記憶容量が１００ＧＢである記憶装置を１０台並列にしたデータソースがあった場合、全データの読み込みに約１７分かかる。この方法では、データソースが大規模になると、実用的な速度を得ることが難しくなる。 According to the technique described in Non-Patent Document 2, which is a method for reading the entire file when sampling data from the data source, the data distribution in the data source can be obtained reliably, while the larger the data source, the more the data is read. There is a drawback that it takes time. For example, if there is a data source in which 10 storage devices each having a reading performance of 100 MB per second and a storage capacity of 100 GB are arranged in parallel, it takes about 17 minutes to read all data. This method makes it difficult to obtain a practical speed when the data source becomes large.

一方、データソースの一部だけを読み込んでヒストグラムを作成する手法を用いる場合、効率的に処理を行うため、データソース内にあらかじめハッシュやツリーなどの探索コストが低いデータ構造を用意しておく必要がある。 On the other hand, when using the method of creating a histogram by reading only a part of the data source, it is necessary to prepare a data structure with a low search cost such as a hash or tree in advance in the data source in order to perform processing efficiently. There is.

しかし、データソースが、書き込み当初より探索のために構造化されていないケースも考えられる。例えば、車両の走行ログや、Ｗｅｂのアクセスログのように、書込みのスループットが要求されるデータは、構造化して格納されていることが仮定しにくい。これらのデータを構造化するためには、データソース全体を読み込み、構造化した状態で再度格納し直す必要があるため、結局、データソースの全体を読み込む手法となってしまう。 However, there may be cases where the data source has not been structured for searching since the beginning of writing. For example, it is difficult to assume that data that requires writing throughput, such as a vehicle running log or a web access log, is structured and stored. In order to structure these data, it is necessary to read the entire data source and store it again in a structured state, so that eventually the entire data source is read.

このように、従来技術においては、探索のために最適化されていないデータソースからサンプリングを行う場合において性能上の問題が存在していた。 Thus, in the prior art, there has been a performance problem when sampling from a data source that is not optimized for searching.

この問題を解消するために、従来技術の他にもいくつかのサンプリング方法が考えられる。たとえば、データソースの先頭から特定個数のレコードをサンプリングする方法である。しかし、この方法によると、記録されているデータが、時系列データのようにその分布に局所性を持っている場合、ヒストグラムがデータソース全体の傾向と異なってしまうという別の問題がある。たとえば、記録されているデータのうち、特定の傾向を持ったデータがデータソースの後方に集中している場合、先頭のデータをサンプリングすると、実際のデータと異なったサンプル結果となってしまい、公平なサンプリングが実現できなくなる。 In order to solve this problem, some sampling methods can be considered in addition to the prior art. For example, a method of sampling a specific number of records from the top of the data source. However, according to this method, when recorded data has locality in its distribution like time series data, there is another problem that the histogram is different from the tendency of the entire data source. For example, if data with a certain tendency among recorded data is concentrated behind the data source, sampling the top data will result in a sample result that is different from the actual data. Sampling cannot be realized.

また、別の方法として、データソース中の位置をランダムに特定し、その位置に存在するレコードをサンプリング対象として取り出す、ランダムサンプリングと呼ばれる手法がある。しかし、この方法を用いた場合、サイズの大きいレコードが選択されやすいため、レコードのサイズにばらつきがある場合、不公平な選択がなされる。不公平な選択がなされると、取得されたヒストグラムは、やはりデータソース全体の傾向と異なってしまう。 As another method, there is a method called random sampling in which a position in a data source is randomly specified and a record existing at that position is taken out as a sampling target. However, when this method is used, since a record having a large size is easily selected, an unfair selection is made when there is a variation in the record size. If an unfair choice is made, the acquired histogram will still differ from the overall data source trend.

このように、データソースに格納されているデータが、探索のために最適化されておら
ず、かつ局所的な傾向がある場合、データソース内のデータの分散を示すヒストグラムを取得するための公平なサンプリングができない、すなわちサンプリング結果の信頼性が低くなるという問題がある。 Thus, if the data stored in the data source is not optimized for searching and has a local tendency, it is fair to get a histogram showing the distribution of the data in the data source There is a problem that reliable sampling cannot be performed, that is, the reliability of the sampling result is lowered.

本発明は上記の問題点を考慮してなされたものであり、探索のために最適化されていないデータソースに対して、全体の傾向を適切に取得することができるサンプリング装置を提供することを目的とする。 The present invention has been made in consideration of the above-mentioned problems, and provides a sampling device that can appropriately acquire an overall trend for a data source that is not optimized for searching. Objective.

上記目的を達成するために、本発明に係るサンプリング装置では、以下の手段によりデータソースに対するサンプリングを行う。 To achieve the above object, the sampling apparatus according to the present invention performs sampling on a data source by the following means.

本発明に係るサンプリング装置は、記憶装置に記録された複数のレコードの中から、ランダムにレコードをサンプリングするサンプリング装置であって、記録位置に対応する乱数を生成する乱数生成手段と、前記複数のレコードから、前記生成した乱数に対応する記録位置にデータを有するレコードを選択するレコード選択手段と、前記選択されたレコードのレコード長を取得するレコード長取得手段と、前記取得されたレコード長に基づいて算出した確率で、前記選択されたレコードをサンプルとして採用するサンプル決定手段と、を有することを特徴とする。 A sampling device according to the present invention is a sampling device that randomly samples a record from a plurality of records recorded in a storage device, a random number generation unit that generates a random number corresponding to a recording position, and the plurality of the plurality of records Based on the record selection means for selecting a record having data at a recording position corresponding to the generated random number from the record, record length acquisition means for acquiring a record length of the selected record, and the acquired record length And sample determination means for adopting the selected record as a sample with the probability calculated in the above.

すなわち、データソースの中から、乱数によって特定された位置に存在するレコードの長さを取得し、当該長さに応じた確率で、選択されたレコードをサンプルとして採用する。データソース中の記録位置をランダムに指定して読み込むランダムサンプリングでは、レコードの長さによって選択される確率が変化する。これに対し、選択後に採用確率を乗じることで、各レコードが選ばれる確率を調整することができる。この手法によって、信頼性の高いサンプリングが可能となる。 That is, the length of the record existing at the position specified by the random number is acquired from the data source, and the selected record is adopted as a sample with a probability corresponding to the length. In random sampling in which a recording position in the data source is randomly specified and read, the probability of selection varies depending on the length of the record. On the other hand, the probability that each record is selected can be adjusted by multiplying the adoption probability after selection. This method enables highly reliable sampling.

また、前記サンプル決定手段は、前記取得されたレコードのレコード長と反比例するように前記確率を算出することを特徴とすることが好ましい。 Further, it is preferable that the sample determination means calculates the probability so as to be inversely proportional to a record length of the acquired record.

すなわち、選択されたレコードのレコード長が長いほど、サンプルとして採用する確率を低くする。これにより、全てのレコードが同じ確率でサンプリングされるようになる。 That is, the longer the record length of the selected record, the lower the probability of adopting it as a sample. As a result, all records are sampled with the same probability.

また、本発明に係るサンプリング装置は、前記複数のレコードのうち、サイズが最小であるレコードのレコード長Ｓ_minを取得する最小レコード長取得手段をさらに有し、前記
サンプル決定手段は、前記選択されたレコードのレコード長をＳ_kとし、Ｓ_min／Ｓ_kの式
によって前記確率を算出することを特徴とすることが好ましい。 The sampling device according to the present invention further includes a minimum record length acquisition unit that acquires a record length S _min of a record having a minimum size among the plurality of records, and the sample determination unit is the selected Preferably, the record length of each record is S _k, and the probability is calculated by the formula of S _min / S _k .

すなわち、基本とするレコード長をＳ_minとし、Ｓ_minを、選択されたレコードの長さで除することにより、当該レコードを採用する確率を決定する。このように構成することにより、ランダムサンプリングの効率を最大限とすることができる。 That is, the record length of the base and S _min, the S _min, divided by the length of the selected record, to determine the probability of employing the record. With this configuration, the efficiency of random sampling can be maximized.

また、本発明に係るサンプリング装置は、前記サンプルとして採用されたレコードから、前記記憶装置に記録されたデータの度数分布を示すヒストグラムを生成するヒストグラム生成手段をさらに有することを特徴としてもよい。 In addition, the sampling device according to the present invention may further include a histogram generating means for generating a histogram indicating a frequency distribution of data recorded in the storage device from the record adopted as the sample.

ヒストグラムを生成することにより、記録されたデータの度数分布を集計することができるため、分散処理システムにおける本発明の適用が容易になる。 By generating the histogram, the frequency distribution of the recorded data can be totaled, so that the present invention can be easily applied to the distributed processing system.

また、前記記憶装置は、分散ファイルシステムであり、前記ヒストグラムを用いて、外
部の分散処理ノードに実行すべき処理を割り当てることを特徴としてもよい。 The storage device may be a distributed file system, and a process to be executed may be assigned to an external distributed processing node using the histogram.

このように構成することにより、分散処理システムにおいて、割り当てるタスクのサイズを平均化することが可能になるため、システム全体の処理スループットを向上させることが可能になる。 By configuring in this way, in the distributed processing system, it becomes possible to average the size of the task to be assigned, so that the processing throughput of the entire system can be improved.

本発明によれば、探索のために最適化されていないデータソースに対して、全体の傾向を適切に取得することができるサンプリング装置を提供することができる。 ADVANTAGE OF THE INVENTION According to this invention, the sampling apparatus which can acquire the whole tendency appropriately with respect to the data source which is not optimized for a search can be provided.

既存のデータサンプリング手法を説明する図である。It is a figure explaining the existing data sampling method. 第一の実施形態に係るシステムの構成を表した図である。It is a figure showing the composition of the system concerning a first embodiment. 第一の実施形態に係るヒストグラムの例を表した図である。It is a figure showing the example of the histogram which concerns on 1st embodiment. 第一の実施形態に係るデータソースの内容例を表した図である。It is a figure showing the example of the content of the data source which concerns on 1st embodiment. 第一の実施形態に係るシステムの処理フローチャートである。It is a process flowchart of the system which concerns on 1st embodiment. 第二の実施形態に係るシステムの構成図である。It is a block diagram of the system which concerns on 2nd embodiment. 第二の実施形態に係るシステムの概要図である。It is a schematic diagram of the system concerning a second embodiment.

（第一の実施形態）
第一の実施形態に係るサンプリング装置について、図２を用いて詳細な説明をする。図２は、第一の実施形態に係るサンプリング装置のシステム構成を表す図である。 (First embodiment)
The sampling device according to the first embodiment will be described in detail with reference to FIG. FIG. 2 is a diagram illustrating a system configuration of the sampling apparatus according to the first embodiment.

本実施形態に係るサンプリング装置１００は、データソースにアクセスし、ランダムに既定数のレコードを選択するサンプリング部１１１と、サンプリング結果に基づいてデータソース内のデータ分布であるヒストグラムを生成するヒストグラム生成部１１２、および、データソースに記録されているレコードのうち、最小レコードのサイズを記録した最小レコード長記憶部１２１から構成される。 The sampling apparatus 100 according to the present embodiment accesses a data source, a sampling unit 111 that randomly selects a predetermined number of records, and a histogram generation unit that generates a histogram that is a data distribution in the data source based on the sampling result 112 and a minimum record length storage unit 121 that records the size of the minimum record among the records recorded in the data source.

以上の構成は、不図示の中央演算処理装置（ＣＰＵ）および主記憶装置（ＲＡＭ）、補助記憶装置（記憶媒体）によって実行される。本実施形態で説明する処理を実行するプログラムコードが補助記憶装置に格納され、当該プログラムコードをＣＰＵが読み出して実行することにより、前述した実施形態の機能が実現される。 The above configuration is executed by a central processing unit (CPU), a main storage device (RAM), and an auxiliary storage device (storage medium) (not shown). The program code for executing the processing described in the present embodiment is stored in the auxiliary storage device, and the CPU reads and executes the program code, thereby realizing the functions of the above-described embodiments.

データソース１２０は、サンプリング対象となるデータが格納されている手段であり、関係データベースやデータファイル、ディスク装置などに相当する。本実施形態においては、一つのデータファイルを想定して説明するが、複数のレコードを記録できる手段であれば、どのようなものが用いられてもよい。また、これらのデータ格納手段は一つの装置に限定されるものではない。たとえばディスクアレイや分散ファイルシステムなど、仮想的な記憶手段を一つのデータソースとしてもよい。 The data source 120 is means for storing data to be sampled, and corresponds to a relational database, a data file, a disk device, or the like. In the present embodiment, description will be made assuming a single data file, but any means may be used as long as it can record a plurality of records. Further, these data storage means are not limited to one device. For example, virtual storage means such as a disk array or a distributed file system may be used as one data source.

データソースには、改行コードやカンマ等のセパレータ（デリミタ）で区切られた１件分のレコードが複数個格納されている。これらのレコードは、それぞれ長さが異なっており、レコードの記録位置はメモリアドレスによって特定することができる。メモリアドレスとは、レコードが記録されている物理的もしくは論理的な場所を特定するための情報であり、例えばデータソースがファイルであれば先頭からのバイト数、分散ファイルシステムであればファイルの記録位置を示す仮想アドレス等が該当する。 The data source stores a plurality of records for one record separated by a separator (delimiter) such as a line feed code or a comma. These records have different lengths, and the recording position of the record can be specified by a memory address. The memory address is information for specifying the physical or logical location where the record is recorded. For example, if the data source is a file, the number of bytes from the beginning, and if it is a distributed file system, the record of the file is recorded. This corresponds to a virtual address indicating a position.

最小レコード長記憶部１２１は、データソース１２０が有しているレコードのうち、最
小となるレコード長を記録した手段である。例えば、レコードの最小値が５１２バイトである場合、最小レコード長記憶部１２１は、５１２バイトという数値を保持する。最小レコード長記憶部１２１は、装置として必ずしも独立している必要はなく、例えばデータソース１２０に、サンプリング対象のデータと一緒に記録されていてもよい。レコードが追加される際に、追加されたレコードの長さが最小レコード長を下回る場合は、最小レコード長は最新の値で更新される。 The minimum record length storage unit 121 is a means for recording the minimum record length among the records that the data source 120 has. For example, when the minimum value of a record is 512 bytes, the minimum record length storage unit 121 holds a numerical value of 512 bytes. The minimum record length storage unit 121 does not necessarily have to be independent as a device, and may be recorded in the data source 120 together with sampling target data, for example. When a record is added, if the length of the added record is less than the minimum record length, the minimum record length is updated with the latest value.

サンプリング部１１１は、データソース１２０から、ランダムに既定個数のレコードを取得する動作を行う手段である。実施形態の説明では、データソースから既定個数のレコードをランダムに取得する動作をサンプリングと称する。詳細な動作については後述する。 The sampling unit 111 is means for performing an operation of acquiring a predetermined number of records at random from the data source 120. In the description of the embodiment, an operation of randomly acquiring a predetermined number of records from a data source is referred to as sampling. Detailed operation will be described later.

ヒストグラム生成部１１２は、サンプリング部１１１がサンプリングを行ったデータに基づいて、データソース内のレコードの度数分布を表すヒストグラムを生成する手段である。ヒストグラムは、サンプリングされたデータのキーを基に度数分布を作成する。図３は、作成されたヒストグラムの例である。本例は、車両走行ログをイベントＩＤで分類し、それぞれのＩＤに対応するレコードの数を表している。 The histogram generation unit 112 is a unit that generates a histogram representing the frequency distribution of records in the data source based on the data sampled by the sampling unit 111. The histogram creates a frequency distribution based on the key of the sampled data. FIG. 3 is an example of the created histogram. In this example, the vehicle travel log is classified by event ID, and represents the number of records corresponding to each ID.

次に、サンプリング部１１１がサンプリングを行う方法について説明する。図４は、データソースに格納されたレコード全体を表した図である。ここでは便宜的に、レコード長の最小値をＳ_min、データソースの全体長をＳ_min×１０として説明を行う。サンプリング部１１１は、データソースが有しているメモリアドレスに対応する一様乱数を生成し、サンプル対象のメモリアドレスを決定する。仮に、図４に示したデータソースに対して処理を行う場合、０≦Ｎ≦１０×Ｓ_minの範囲をとる乱数Ｎを決定し、決定したメモリアドレ
スに存在するレコードを取得する。この処理により、対象のデータソースの中の単一のレコードが特定される。例えば、Ｓ_min＝１、Ｎ＝３であった場合、図４（ａ）の場合はレ
コード４、図４（ｂ）の場合はレコード３が選択される。 Next, a method in which the sampling unit 111 performs sampling will be described. FIG. 4 is a diagram showing the entire record stored in the data source. Here, for convenience, description will be made assuming that the minimum value of the record length is S _min and the total length of the data source is S _min × 10. The sampling unit 111 generates a uniform random number corresponding to the memory address included in the data source, and determines a memory address to be sampled. If processing is performed on the data source shown in FIG. 4, a random number N that takes a range of 0 ≦ N ≦ 10 × S _min is determined, and a record existing at the determined memory address is acquired. This process identifies a single record in the target data source. For example, when S _min = 1 and N = 3, the record 4 is selected in the case of FIG. 4A, and the record 3 is selected in the case of FIG. 4B.

図４（ａ）は、全て同じ長さのレコードが一様に記録されている場合を表した図である。この場合、ランダムにサンプリングを行うと、レコードＡからレコードＪの１０個が等しい確率で取得される。つまり、各レコードは理論上同じ数がサンプリングされるため、ヒストグラムに偏りは発生せず、前述したような問題は発生しない。 FIG. 4A shows a case where records having the same length are recorded uniformly. In this case, if sampling is performed at random, 10 records A to J are acquired with equal probability. That is, since the same number is theoretically sampled for each record, there is no bias in the histogram, and the above-described problem does not occur.

図４（ｂ）は、それぞれのレコード長が異なる場合を表した図である。例えば、レコード３およびレコード６は、レコード１の２倍の長さを持っており、レコード５は３倍の長さを持っている。このような場合、データソース上の記録位置であるメモリアドレスをランダムに決定し、対応する箇所にあるレコードを取得しようとすると、サイズが最少であるレコード１，２，４に対して、レコード３と６は２倍、レコード５は３倍それぞれ選択されやすくなる。 FIG. 4B illustrates a case where the record lengths are different. For example, record 3 and record 6 are twice as long as record 1, and record 5 is three times as long. In such a case, when a memory address that is a recording position on the data source is randomly determined and an attempt is made to acquire a record at a corresponding location, record 3 is recorded with respect to records 1, 2, and 4 having the smallest size And 6 are easily selected twice, and record 5 is easily selected three times.

つまり、各レコードが均等に選択されなくなるため、実際のレコード数は等しいにもかかわらず、作成されるヒストグラムに現れる分布には偏りが発生する。 That is, since the records are not selected equally, the distribution appearing in the created histogram is biased even though the actual number of records is equal.

そこで、本実施形態では、サンプリング手段がレコードを選択する際に、当該レコードをサンプルとして採用するか否かの判定を、以下の式（１）によって行う。Ｓ_minは、デ
ータソースに記録されているレコードのうち最小となるレコード長であり、Ｓ_kは、選択
されたレコードの長さを表す。
採用確率ｐ＝Ｓ_min／Ｓ_k … 式（１）
サンプリング部１１１は、式（１）によって計算された採用確率ｐに従って、選択したレコードをサンプルとして採用するか否かを決定する。例えば、採用確率ｐが０．１であ
った場合、１０％の確率で、選択したレコードを採用し、９０％の確率で選択を破棄する。確率の計算は、たとえば０〜１の範囲の実数をとる乱数を新たに生成し、結果が採用確率以下であった場合にのみ選択したレコードを採用する等の方法によって行うことができる。 Therefore, in the present embodiment, when the sampling unit selects a record, whether or not the record is adopted as a sample is determined by the following equation (1). S _min is the minimum record length of the records recorded in the data source, and S _k represents the length of the selected record.
Employment probability p = S _min / S _k Formula (1)
The sampling unit 111 determines whether to adopt the selected record as a sample according to the adoption probability p calculated by the equation (1). For example, when the adoption probability p is 0.1, the selected record is adopted with a probability of 10%, and the selection is discarded with a probability of 90%. The calculation of the probability can be performed by a method of newly generating a random number taking a real number in the range of 0 to 1, for example, and adopting the selected record only when the result is equal to or less than the adoption probability.

この方法によると、レコードの長さと、ランダムに選択されたレコードを採用する確率を反比例させることができる。つまり、レコードの長さに反比例した採用確率ｐを式（１）に乗ずることで、全てのレコードに対して、サンプリングされる確率を同一にすることができる。 According to this method, the length of a record and the probability of adopting a randomly selected record can be made inversely proportional. That is, by multiplying the expression (1) by the adoption probability p that is inversely proportional to the length of the record, the probability of sampling can be made the same for all the records.

データソース全体の長さをＬとすると、Ｓ_kのレコード長を持つレコードが選択される
確率は、Ｓ_k／Ｌとなる。これに、式（１）を乗ずると、
Ｓ_k／Ｌ×Ｓ_min／Ｓ_k＝Ｓ_min／Ｌ … 式（２）
となり、全てのレコードのサンプリングされる確率が同一となることが確認できる。 If the length of the entire data source is L, the probability that a record having a record length of S _k is selected is S _k / L. Multiply this by equation (1)
S _k / L × S _min / S _k = S _min / L (2)
Thus, it can be confirmed that the probability that all records are sampled is the same.

次に、サンプリング部１１１の動作フローチャートである図５を参照しながら、サンプリング動作を説明する。本実施形態に係るサンプリング装置が動作を開始すると、サンプリング部１１１は、最小レコード長記憶部１２１から最小レコード長を取得する（Ｓ１０）。最小レコードは、前述したように、可変レコード長のうち最小の値である。 Next, the sampling operation will be described with reference to FIG. 5 which is an operation flowchart of the sampling unit 111. When the sampling apparatus according to the present embodiment starts operation, the sampling unit 111 acquires the minimum record length from the minimum record length storage unit 121 (S10). As described above, the minimum record is the minimum value among the variable record lengths.

次に、データソース中の特定の位置を決定するための乱数を生成する（Ｓ１１）。生成する乱数は一様乱数であって、データソースに記録されているレコードの記録位置を特定できるものであればどのように生成しても構わない。生成する乱数の範囲は、たとえばディスク装置であれば、データが格納されているセクタ番号の最小値から最大値までとすることが考えられる。 Next, a random number for determining a specific position in the data source is generated (S11). The random number to be generated is a uniform random number, and any random number can be generated as long as the recording position of the record recorded in the data source can be specified. For example, in the case of a disk device, the range of random numbers to be generated may be from the minimum value to the maximum value of the sector number in which data is stored.

次に、決定した乱数に対応するレコードを選択する（Ｓ１２）。つまり、データソースから、乱数によって特定した記録位置に属しているレコードが選択される。なお、本実施形態では、追記のみを行うデータソースを想定しているため、特定した位置にレコードが存在しない場合の処理は行っていないが、フラグメント等によってデータが存在しない場合を考慮してもよい。その場合、ステップＳ１１へ戻り再度処理を行ってもよいし、前後の直近に存在するレコードを選択してもよい。 Next, a record corresponding to the determined random number is selected (S12). That is, a record belonging to the recording position specified by the random number is selected from the data source. In this embodiment, since a data source that only performs additional writing is assumed, processing is not performed when there is no record at the specified position, but it may be considered that data does not exist due to fragments or the like. Good. In that case, the process may return to step S11 and the process may be performed again, or the records existing immediately before and after may be selected.

次に、選択したレコードの長さを取得し（Ｓ１３）、取得されたレコード長と、最小レコード長を用いて、式（１）による計算を行い、計算された採用確率に従って、当該レコードを採用するか否かを決定する（Ｓ１４）。決定方法は、前述したように、新たに乱数を生成して判断してもよいし、他の方法を用いてもよい。選択したレコードを採用しないと判断した場合、処理はステップＳ１１へ戻り、乱数の生成を再度行う。 Next, the length of the selected record is acquired (S13), the calculated record length and the minimum record length are used to calculate according to the formula (1), and the record is adopted according to the calculated adoption probability. It is determined whether or not to perform (S14). As described above, the determination method may be determined by newly generating a random number, or another method may be used. If it is determined that the selected record is not adopted, the process returns to step S11 to generate a random number again.

次に、サンプリングの回数が規定回数に達したかを判定する（Ｓ１５）。サンプリングの回数は、装置によって固有なものであってもよいし、データソースの規模に応じて自動で決定されてもよい。規定回数のサンプリングが終了した場合、処理は終了し、規定回数に達していなかった場合は再度レコードの選択が行われる。 Next, it is determined whether the number of samplings has reached a specified number (S15). The number of samplings may be unique depending on the device, or may be automatically determined according to the scale of the data source. When the specified number of times of sampling has been completed, the process ends. When the specified number of times has not been reached, the record is selected again.

以上の処理により、サンプリング部１１１は、既定サンプル数分のレコードを取得することができる。取得されたレコードは、そのレコード長に関係なくランダムに選択されたものとなる。 Through the above processing, the sampling unit 111 can acquire records for a predetermined number of samples. The acquired record is selected at random regardless of the record length.

サンプリング部１１１がサンプルの収集を完了させると、ヒストグラム生成部生成部１１２が、収集したサンプルを用いてヒストグラムの生成を行う。ヒストグラムは、取得し
たサンプルデータが有するキーごとに、レコードの件数を加算することで生成される。 When the sampling unit 111 completes sample collection, the histogram generation unit generation unit 112 generates a histogram using the collected samples. The histogram is generated by adding the number of records for each key included in the acquired sample data.

本実施形態によれば、レコード長がそれぞれ異なるデータソースからサンプリングを行うサンプリング装置において、レコード長に反比例するように採用確率を設定することで、レコード長にかかわらず均等にレコードをサンプリングすることができる。すなわち、生成されたヒストグラムはデータソース全体の傾向を代表するものとなるため、巨大なデータソースであっても、短時間でデータ分布の傾向を得ることができる。 According to the present embodiment, in the sampling device that samples from different data sources, the record can be equally sampled regardless of the record length by setting the adoption probability so as to be inversely proportional to the record length. it can. That is, since the generated histogram represents the tendency of the entire data source, the tendency of data distribution can be obtained in a short time even for a huge data source.

なお、第一の実施形態においては、ヒストグラム生成部を用いてヒストグラムの生成を行ったが、ヒストグラムを他の手段で生成できる場合や、ヒストグラム以外のデータ集計手段が利用できる場合は、ヒストグラム生成部は必須構成ではない。 In the first embodiment, the histogram generation unit is used to generate the histogram. However, if the histogram can be generated by other means, or if data aggregation means other than the histogram can be used, the histogram generation unit Is not a required configuration.

（第二の実施形態）
第二の実施形態は、第一の実施形態におけるサンプリング装置１００を、分散処理フレームワークであるＨａｄｏｏｐを利用した分散処理システムに組み込んだ形態である。図６は、第二の実施形態に係るシステム構成図であり、図７は、第二の実施形態に係るシステムの概念図である。なお、第二の実施形態に係るサンプリング装置が収集を行うデータソースおよびレコードの構成は、第一の実施形態で説明したものと同一である。 (Second embodiment)
In the second embodiment, the sampling apparatus 100 in the first embodiment is incorporated into a distributed processing system using Hadoop, which is a distributed processing framework. FIG. 6 is a system configuration diagram according to the second embodiment, and FIG. 7 is a conceptual diagram of a system according to the second embodiment. The configuration of the data source and the record collected by the sampling device according to the second embodiment is the same as that described in the first embodiment.

第二の実施形態では、分散処理ノードが持っている分散ファイルをデータソースとして利用する。分散ファイル２０２ａ，ｂ，ｃは、それぞれ分散処理ノードであるコンピュータ２０１ａ，ｂ，ｃ上に配置され、サンプリング部１１１からシームレスにアクセスすることができる。すなわち、サンプリング部１１１からは、一つのデータソースがあるように見えるため、第一の実施形態と同様の方法によってデータのサンプリングを行うことができる。 In the second embodiment, a distributed file held by a distributed processing node is used as a data source. The distributed files 202a, b, and c are arranged on the computers 201a, b, and c, which are distributed processing nodes, respectively, and can be accessed seamlessly from the sampling unit 111. That is, since it seems that there is one data source from the sampling unit 111, data sampling can be performed by the same method as in the first embodiment.

また、ヒストグラム生成部１１２が生成したヒストグラムは、タスク割り当て部１３０へ送信される。タスク割り当て部１３０は、生成されたヒストグラムをもとに、分散ファイル２０２ａ，ｂ，ｃより処理すべきタスクを取得し、分散処理ノード２０１ａ，ｂ，ｃに対して処理を割り当てる。 Further, the histogram generated by the histogram generation unit 112 is transmitted to the task assignment unit 130. The task allocation unit 130 acquires tasks to be processed from the distributed files 202a, b, and c based on the generated histogram, and allocates processes to the distributed processing nodes 201a, b, and c.

本実施形態は、分散処理フレームワークであるＨａｄｏｏｐ上で動作するＨｉｖｅに本発明を適用したものである。図７に示した通り、Ｈｉｖｅには、Compiler、Optimizer、Executorの三つのフェーズがある。このうち、Optimizerフェーズにてサンプリングを行い、得られたヒストグラムから、分散処理ノードの単位性能あたりのタスクサイズが均等になるように、Executorフェーズでタスクの割り当て、すなわちスケジューリングを行う。 In the present embodiment, the present invention is applied to a High that operates on Hadoop, which is a distributed processing framework. As shown in FIG. 7, Hive has three phases: Compiler, Optimizer, and Executor. Among these, sampling is performed in the Optimizer phase, and task allocation, that is, scheduling is performed in the Executor phase so that the task size per unit performance of the distributed processing node is equalized from the obtained histogram.

このように構成することにより、分散処理におけるノード間の負荷の偏りが減り、システム全体のスループットを向上させることができる。なお、本例ではＨｉｖｅへの適用例を挙げたが、Ｐｉｇなど、Ｈａｄｏｏｐで動作する他のミドルウェアを適用してもよい。 With this configuration, it is possible to reduce the load unevenness among nodes in distributed processing, and to improve the throughput of the entire system. In this example, an example of application to High is given, but other middleware that operates on Hadoop, such as Pig, may be applied.

（変形例）
上記の実施形態はあくまでも一例であって、本発明はその要旨を逸脱しない範囲内で適宜変更して実施しうるものである。たとえば、実施形態の説明においては、処理を行うプログラムを補助記憶装置に記録し、ＣＰＵによって処理を行う例を挙げたが、処理はＦＰＧＡによって行われてもよいし、ハードウェアとして設計され実行されてもよい。 (Modification)
The above embodiment is merely an example, and the present invention can be implemented with appropriate modifications within a range not departing from the gist thereof. For example, in the description of the embodiment, an example in which a program for performing processing is recorded in the auxiliary storage device and the processing is performed by the CPU has been described. However, the processing may be performed by an FPGA, or may be designed and executed as hardware. May be.

また、実施形態の説明においては、選択されたレコードをサンプルとして採用する確率ｐをＳ_min／Ｓ_kの式によって計算したが、採用確率ｐをａ×Ｓ_min／Ｓ_k（ただし０＜ａ＜１）としても、各レコードがサンプリングされる確率を同一にすることができる。ただし
、係数ａを乗じた場合、一回の実行でレコードがサンプリングされない確率が上昇するため、採用確率ｐはＳ_min／Ｓ_kとした場合が最も効率がよい。 In the description of the embodiment, the probability p of adopting the selected record as a sample is calculated by the formula of S _min / S _k , but the adoption probability p is a × S _min / S _k (where 0 <a < Even in 1), the probability that each record is sampled can be made the same. However, when the coefficient a is multiplied, the probability that a record is not sampled in one execution increases, so that the adoption probability p is most efficient when S _min / S _k is used.

また、発明の効果を得るためには、選択されたレコードのサイズに応じて、当該レコードを採用する確率を調整する、すなわち選択されたレコードが長いほど採用確率を下げる必要があるが、採用確率はレコード長に反比例していなくともよい。 Further, in order to obtain the effect of the invention, it is necessary to adjust the probability of adopting the record according to the size of the selected record, that is, the longer the selected record, the lower the adoption probability. May not be inversely proportional to the record length.

たとえば、採用確率ｐをｎ／Ｓ_k（ｎは任意の定数）とし、ｐが１を超えた場合は１と
して扱ってもよい。この場合、各レコードが選択される確率は同一とはならないが、均一の値に近づけることはできる。このように、各レコードがサンプリングされる確率を均一の値に近づけるためには、選択されたレコードが長くなるに従って採用確率を下げることができればよい。 For example, the employment probability p may be n / S _k (n is an arbitrary constant), and if p exceeds 1, it may be treated as 1. In this case, the probability that each record is selected is not the same, but can be close to a uniform value. Thus, in order to bring the probability that each record is sampled close to a uniform value, it is only necessary that the adoption probability can be lowered as the selected record becomes longer.

また、レコードのサイズが極端にばらついている際、何度も選択を繰り返さないと必要な個数のレコードをサンプリングできないケースが考えられる。例えば、レコードの平均サイズをＳ_avgとすると、一回の実行でレコードが選択される確率はＳ_min／Ｓ_avgである
ため、ｎ回のサンプリングに必要なループ数は、（ｎ×Ｓ_avg）／Ｓ_minとなる。この値が全体のレコード件数を超える場合は、データソース全体を読み込んだほうが効率が良くなる。 In addition, when the record sizes vary extremely, there may be a case where a necessary number of records cannot be sampled unless selection is repeated many times. For example, if the average size of the record and S _avg, because the probability that a record is selected in one cycle of execution is S _min / S _avg, number of loops required for n times of sampling, (n × S _avg) / S _min . If this value exceeds the total number of records, it is more efficient to read the entire data source.

例えば図４（ｂ）の例では、Ｓ_avg＝１．６６…×Ｓ_minとなるため、ｎ≧４である場合、ループ数が総レコード数以上となる。そのため、前記条件に当てはまるほどレコードのサイズにばらつきがある場合、アルゴリズムから確率的な要素を減らすようにしてもよい。 For example, in the example of FIG. 4B, since S _avg = 1.66... × S _min , when n ≧ 4, the number of loops is equal to or greater than the total number of records. For this reason, if the record size varies to the extent that the above condition is met, the stochastic factor may be reduced from the algorithm.

例えば、ランダムに選択したレコードから、ｍ番目、ｍ×２番目・・・のレコードを選ぶことにより、一回のループで一つ以上のレコードを確実にサンプリングできるようになる。これにより、レコードの傾向の局所性がヒストグラムに反映されてしまうという副作用が生じるが、ランダムに選んだレコードから離れたレコードを選択することで、その影響を緩和できる可能性がある。 For example, by selecting mth, m × 2nd,... Records from randomly selected records, one or more records can be reliably sampled in a single loop. This causes a side effect that the locality of the tendency of the record is reflected in the histogram, but the influence may be reduced by selecting a record away from the randomly selected record.

１００サンプリング装置
１１１サンプリング部
１１２ヒストグラム生成部
１２０データソース
１２１最小レコード長記憶部
１３０タスク割当て部
２０１ａ，ｂ，ｃ分散処理ノード
２０２ａ，ｂ，ｃ分散ファイルノード DESCRIPTION OF SYMBOLS 100 Sampling apparatus 111 Sampling part 112 Histogram generation part 120 Data source 121 Minimum record length memory | storage part 130 Task allocation part 201a, b, c Distributed processing node 202a, b, c Distributed file node

Claims

A sampling device that randomly samples a record from a plurality of records recorded in a storage device,
Random number generating means for generating a random number corresponding to the recording position;
Record selecting means for selecting a record having data at a recording position corresponding to the generated random number from the plurality of records;
Record length acquisition means for acquiring a record length of the selected record;
Sample determination means that employs the selected record as a sample with a probability calculated based on the acquired record length;
A sampling device characterized by comprising:

The sampling apparatus according to claim 1, wherein the sample determination unit calculates the probability so as to be inversely proportional to a record length of the acquired record.

A minimum record length obtaining means for obtaining a record length S _min of a record having a minimum size among the plurality of records;
The sampling apparatus according to claim 2, wherein the sample determination unit calculates the probability using a formula of S _min / S _k , where S _k is a record length of the selected record.

The sampling apparatus according to claim 3, further comprising a histogram generation unit that generates a histogram indicating a frequency distribution of data recorded in the storage device from the record adopted as the sample.

The sampling device according to claim 4, wherein the storage device is a distributed file system, and processing to be executed is assigned to an external distributed processing node using the histogram.

A sampling method performed by a sampling device that randomly samples a record from a plurality of records recorded in a storage device,
Generating a random number corresponding to the recording position;
Selecting a record having data at a recording position corresponding to the generated random number from the plurality of records;
Obtaining a record length of the selected record;
Adopting the selected record as a sample with a probability calculated based on the acquired record length;
A sampling method characterized by comprising:

The sampling method according to claim 6, wherein the probability is calculated so as to be inversely proportional to a record length of the acquired record.

A step of obtaining a record length S _min of a record having a minimum size among the plurality of records;
The sampling method according to claim 7, wherein a record length of the selected record is S _k, and the probability is calculated by a formula of S _min / S _k .

On the computer,
A program for randomly sampling a record from a plurality of records recorded in a storage device,
Processing to generate a random number corresponding to the recording position;
A process of selecting a record recorded at a recording position corresponding to the generated random number from the plurality of records;
Processing for obtaining a record length of the selected record;
A process of adopting the selected record as a sample with a probability calculated based on the acquired record length;
A program characterized by having executed.

In the computer,
The program according to claim 9, wherein the probability is calculated so as to be inversely proportional to a record length of the acquired record.

In the computer,
A process of acquiring a record length S _min of a record having a minimum size among the plurality of records,
11. The program according to claim 10, wherein a record length of the selected record is S _k and the probability is calculated by a formula of S _min / S _k .