JP5925044B2

JP5925044B2 - Different number counting method, apparatus, and program

Info

Publication number: JP5925044B2
Application number: JP2012104399A
Authority: JP
Inventors: 浩司藤本; 加藤　剛; 剛加藤
Original assignee: TENSOR CONSULTING CO. LTD.; Sophia School Corp
Current assignee: TENSOR CONSULTING CO. LTD.; Sophia School Corp
Priority date: 2012-05-01
Filing date: 2012-05-01
Publication date: 2016-05-25
Anticipated expiration: 2032-05-01
Also published as: JP2013232140A

Description

本発明は、膨大な数の要素を有するデータ群の異なり数を推定する技術に関する。 The present invention relates to a technique for estimating the number of different data groups having an enormous number of elements.

情報処理においては膨大な個数の要素を有するデータ群を効率良く、短時間で処理することが求められる。多くの場合、情報処理に供されるデータ群には同じ種類の要素が複数含まれている。そして、データ群に同じ種類の要素が含まれていることを利用してデータ群の処理を効率化する技術がある。つまり、データ群に含まれている要素の種類の個数すなわち異なり数が情報処理に有効に用いられることがある。データ、あるいは有限なデータセットのデータ項目に対し、そのとる相異なる値の種類の数をデータあるいはデータ項目の異なり数と言っている。 In information processing, it is required to process a data group having an enormous number of elements efficiently and in a short time. In many cases, a data group used for information processing includes a plurality of elements of the same type. There is a technique for improving the efficiency of data group processing by utilizing the fact that the same type of elements are included in the data group. That is, the number of types of elements included in the data group, that is, the number of different types may be effectively used for information processing. The number of different types of data or data items of a finite data set is called the number of different data or data items.

例えば、膨大なデータが蓄積されたデータベースにおけるデータのソーティングでは、データベースに含まれる要素の異なり数を抽出し、その異なり数を用いることにより、ソーティング処理を高速化できることが知られている。また、データベースにデータを蓄積する際に、全てのデータを個々の領域に格納するのではなく、要素の種類と度数とを記録することで全体のデータ量を圧縮することができる。その場合にも要素の種類の個数を予め計数しておくことが求められる。 For example, in data sorting in a database in which an enormous amount of data is accumulated, it is known that the number of different elements included in the database is extracted and the sorting process can be accelerated by using the different numbers. Further, when storing data in the database, the entire data amount can be compressed by recording the type and frequency of the elements instead of storing all the data in individual areas. Even in that case, it is required to count the number of types of elements in advance.

また、サイトへの膨大なアクセスの履歴を蓄積したアクセスログから、あるサイトにアクセスしたユーザ数を抽出できれば、アクセス解析に利用することができる。その場合、アクセスログに含まれているアクセス元アドレスの異なり数がユーザ数を示すことになる。大規模なインターネットサイトにおいては、様々な外部からの攻撃や、許容範囲を超えたアクセス負荷からサーバーを守るため、常に外部からのアクセス状況の監視を行わなければならない。大量のアクセスログの中から、ユニークなアクセスホスト数を計測するのもその一環である。これはすなわち、アクセスログデータからＩＰアドレスなどのホスト識別情報の異なり数を数え上げることである。大きいサイトにおいては、一日のアクセス数が１０億件を超えており、常時全件の統計量を掌握するには、多大な計算機資源が必要となる。 Further, if the number of users who have accessed a certain site can be extracted from an access log in which a huge access history to the site is accumulated, it can be used for access analysis. In that case, the number of different access source addresses included in the access log indicates the number of users. In a large-scale Internet site, in order to protect the server from various external attacks and access loads exceeding the allowable range, it is necessary to constantly monitor the access status from the outside. Measuring the number of unique access hosts from a large number of access logs is part of that. That is, the number of different host identification information such as IP addresses is counted from the access log data. In a large site, the number of accesses per day exceeds one billion, and a large amount of computer resources are required to grasp the statistics of all cases at all times.

また、翻訳ソフトウェアや辞書データベースの作成においては、膨大なコーパスから単語数を抽出し、データベースの構築に利用する。その場合、膨大なコーパスの異なり数が単語数を示すことになる。 In creating translation software and dictionary databases, the number of words is extracted from a huge corpus and used to construct the database. In that case, the number of different corpora is the number of words.

また、住所管理においては、蓄積された住民の情報から住民の名字数を抽出し、住民データの管理に利用することができる。その場合、住民情報における名字の異なり数が名字数を示すことになる。 In addition, in address management, the number of surnames of residents can be extracted from the accumulated information of residents and used for management of residents data. In that case, the number of different surnames in the resident information indicates the number of surnames.

上述のように、データあるいはデータ項目の異なり数を計数することはデータ処理において非常に有用である。しかしながら、データ群が膨大であれば異なり数の計数にも膨大な処理量や記憶容量が必要となる。そのため、膨大なデータ群の異なり数を効率良く計数する技術が求められている。 As described above, counting the number of different data or data items is very useful in data processing. However, if the data group is enormous, it is different and enormous amount of processing and storage capacity are required for counting. Therefore, a technique for efficiently counting the number of different data groups is required.

そのような要求に対して、特許文献１には、様々なデータ群の異なり数を高い精度で推定する技術が開示されている。特許文献１の技術は、小さい異なり数の計数に適した計数法と、大きい異なり数の計数に適した計数法とを併用することにより、高い精度で異なり数を推定しようとするものである。 In response to such a requirement, Patent Document 1 discloses a technique for estimating the number of different data groups with high accuracy. The technique disclosed in Patent Document 1 attempts to estimate the number of differences with high accuracy by using a counting method suitable for counting small different numbers and a counting method suitable for counting large different numbers.

特開２００８−５９２９３号公報JP 2008-59293 A

特許文献１の技術は、単一の異なり数計数方法では、様々なデータ群の異なり数を高い精度で計数できないことを前提とし、同じデータ群の異なり数を２つの計数法で推定している。しかし、そのために冗長な処理を行うことになり、処理効率が良くなかった。 The technique of Patent Document 1 assumes that different numbers of various data groups cannot be counted with high accuracy by a single different number counting method, and estimates the different numbers of the same data group by two counting methods. . However, redundant processing is performed for that purpose, and processing efficiency is not good.

本発明の目的は、データ群の異なり数を効率良く推定する技術を提供することである。 An object of the present invention is to provide a technique for efficiently estimating the number of different data groups.

本発明の一態様による異なり数計数方法は、データ群からなる母集団の異なり数を計数するための異なり数計数方法であって、行列生成手段が、前記母集団のサイズＮと、前記母集団から標本を抽出する抽出率ｒとに基づき、式（Ｃ−１）により求まる、前記母集団の度数毎の異なり数を示した度数異なり数ベクトルと前記母集団から前記抽出率で抽出された標本の度数毎の異なり数を示した度数異なり数ベクトルとの関係を規定する二項確率行列Ｐの各要素を用いて、式（Ｃ−２）における離散値の和を求める演算を積分近似した近似式により、前記二項確率行列Ｐを行方向および列方向に圧縮した圧縮行列である対数圧縮二項確率行列ＬＰを求め、前記対数圧縮二項確率行列ＬＰの逆行列ＬＰ^−１を生成するステップと、異なり数算出手段が、前記母集団から抽出された前記標本の度数異なり数ベクトルを圧縮した対数圧縮度数異なり数ベクトルと前記逆行列ＬＰ^−１との積により、前記母集団の推定された対数圧縮度数異なり数ベクトルを算出し、当該母集団の推定された対数圧縮度数異なり数ベクトルの要素の総和を、前記母集団の異なり数の推定値として算出するステップと、を有している。
The different number counting method according to one aspect of the present invention is a different number counting method for counting the number of different populations of data groups, wherein the matrix generation means includes a size N of the population, and the population. Based on the extraction rate r for extracting a sample from the sample, the frequency difference number vector indicating the number of differences for each frequency of the population and the sample extracted from the population at the extraction rate, which is obtained by the equation (C-1) Approximation obtained by integrating and approximating the operation for obtaining the sum of discrete values in equation (C-2) using each element of the binomial probability matrix P that defines the relationship with the frequency vector different in frequency A step of obtaining a logarithmic compression binomial probability matrix LP which is a compression matrix obtained by compressing the binomial probability matrix P in a row direction and a column direction according to an equation, and generating an inverse matrix LP ⁻¹ of the logarithm compression binomial probability matrix LP. Unlike the number calculation An estimated logarithmic compression frequency difference number of the population is obtained by multiplying the logarithm compression frequency difference number vector obtained by compressing the frequency difference number vector of the sample extracted from the population and the inverse matrix LP ^−1. calculating a vector, has a step of calculating a sum of the estimated logarithmic compression power varies the number of vector elements of the population, as an estimate of the number of different of the population, the.

本発明によれば、予め算出した行列と母集団から抽出した標本とを使って母集団の異なり数を推定することができるので、全ての要素を分析することなく効率的に異なり数を推定することができる。 According to the present invention, since the number of different populations can be estimated using a matrix calculated in advance and a sample extracted from the population, the number can be efficiently estimated without analyzing all the elements. be able to.

本実施形態による異なり数計数装置のブロック図である。It is a block diagram of a different number counting device by this embodiment. 本実施形態の異なり数計数装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the number counting apparatus which differs in this embodiment. 本実施形態の異なり数計数装置１０を組み込んだデータ分析装置２０の一例を示すブロック図である。It is a block diagram which shows an example of the data analysis apparatus 20 incorporating the number counting apparatus 10 different from this embodiment. アクセスログの一例を模式的に示した図である。It is the figure which showed an example of the access log typically. アクセス解析結果データの一例を模式的に示した図である。It is the figure which showed an example of the access analysis result data typically. 母集団から標本を抽出する様子を示す図である。It is a figure which shows a mode that a sample is extracted from a population. ｆ−度数異なり数Ｕ_ｆについて説明するための図である。It is a diagram for explaining f- frequency different number U _f. 二項確率行列による近似の原理をトイモデルで示した図である。It is the figure which showed the principle of the approximation by a binomial probability matrix with the toy model. 二項確率行列を圧縮する様子を示す図である。It is a figure which shows a mode that a binomial probability matrix is compressed. 二項確率行列における要素の分布の様子を示す図である。It is a figure which shows the mode of the distribution of the element in a binomial probability matrix. 度数異なり数ベクトルを圧縮する様子を示す図である。It is a figure which shows a mode that a frequency vector differs in frequency. 対数圧縮二項確率行列による近似の原理を示した図である。It is the figure which showed the principle of the approximation by a logarithm compression binomial probability matrix. 推定実験１において抽出率を変えて推定精度を比較した結果を示す図である。It is a figure which shows the result of having changed the extraction rate in the estimation experiment 1, and comparing the estimation precision. 推定実験１において変数と抽出率を変えて推定精度を比較した結果を示す図である。It is a figure which shows the result of having changed the variable and the extraction rate in the estimation experiment 1, and comparing the estimation precision. 図１４の結果における対真値をグラフ化した図である。FIG. 15 is a graph showing the true value in the result of FIG. 14. 推定実験２における推定実験結果を示す図である。It is a figure which shows the estimation experiment result in the estimation experiment 2. FIG. 階段関数とその近似曲線を示す図である。It is a figure which shows a step function and its approximated curve.

本発明の基本的な実施形態について図面を参照して説明する。 A basic embodiment of the present invention will be described with reference to the drawings.

図１は、本実施形態による異なり数計数装置のブロック図である。本実施形態の異なり数計数装置１０は、膨大なデータ群からなる母集団の異なり数を計数する装置である。 FIG. 1 is a block diagram of a different number counting device according to this embodiment. The different number counting device 10 of this embodiment is a device that counts the number of different populations composed of a huge data group.

母集団は、例えば、データベースに蓄積された膨大なデータ、あるいはそのデータが複数の項目で構成されている場合には各項目の値を元とする集合である。データベースに蓄積されたデータ数が母集団のサイズである。母集団には同じ値を持ったデータが含まれている可能性があるため、母集団の異なり数は母集団のサイズと等しいかあるいはそれよりも小さい値である。 The population is, for example, an enormous amount of data stored in a database or a set based on the value of each item when the data is composed of a plurality of items. The number of data stored in the database is the size of the population. Since the population may contain data having the same value, the number of different populations is equal to or smaller than the size of the population.

本実施形態の異なり数計数装置１０は異なり数の計数に近似を用いる。近似の方法は母集団から所定の抽出率で抽出した標本群から母集団の異なり数を推定するものである。抽出率は予め与えられるものとする。 The different number counting apparatus 10 of the present embodiment uses an approximation for counting different numbers. The approximation method estimates the number of different populations from a sample group extracted from the population at a predetermined extraction rate. The extraction rate is given in advance.

ここで、母集団や母集団から抽出された標本群のようなデータ群について、度数異なり数ベクトルというものを定義する。度数異なり数ベクトルは、度数毎の異なり数を示したベクトルである。度数は、同じ値のデータが幾つあるかを示すものである。よって、度数異なり数ベクトルは、同じ値のものが何個あるデータが何種類あるかを示していると言える。 Here, a data vector such as a population and a sample group extracted from the population is defined as a frequency vector with different frequencies. The frequency difference number vector is a vector indicating the number of differences for each frequency. The frequency indicates how many pieces of data have the same value. Therefore, it can be said that the number vector different in frequency indicates how many types of data have the same value.

ここでは母集団の度数異なり数ベクトルは未知であるが、標本の度数異なり数ベクトルは既知である。 Here, the frequency vector of the population is unknown, but the frequency vector of the sample is unknown.

図１を参照すると、異なり数計数装置１０は行列生成部１１および異なり数算出部１２を有している。 Referring to FIG. 1, the different number counting apparatus 10 includes a matrix generation unit 11 and a different number calculation unit 12.

行列生成部１１は、母集団のサイズと、母集団から標本を抽出する抽出率とに基づき、母集団の度数毎の異なり数を示した度数異なり数ベクトルと、母集団からその抽出率で抽出された標本の度数毎の異なり数を示した度数異なり数ベクトルとの関係を二項確率によって規定する行列を生成する。この行列については後述において段階的に詳しく説明する。 Based on the size of the population and the extraction rate for extracting a sample from the population, the matrix generation unit 11 extracts the frequency difference number vector indicating the number of differences for each frequency of the population and the extraction rate from the population. A matrix is generated that defines the relationship with the frequency-variable number vector indicating the number of different samples for each frequency by binomial probabilities. This matrix will be described in detail later in stages.

異なり数算出部１２は、行列生成部１１で生成された行列と、母集団から抽出率で抽出された標本とに基づいて、母集団の異なり数の推定値を算出する。 The different number calculation unit 12 calculates an estimated value of the number of different populations based on the matrix generated by the matrix generation unit 11 and the sample extracted from the population at the extraction rate.

本実施形態によれば、予め算出した行列と母集団から抽出した標本とを使って母集団の異なり数を推定することができるので、全ての要素を分析することなく効率的に異なり数を推定することができる。 According to this embodiment, since the number of different populations can be estimated using a matrix calculated in advance and a sample extracted from the population, the number can be efficiently estimated without analyzing all the elements. can do.

以下、各部における、より具体的な動作について説明する。 Hereinafter, more specific operations in each unit will be described.

上述の行列を、二項確率行列を対数的に圧縮したという意味で対数圧縮二項確率行列と呼ぶことにする。二項確率行列は、母集団の度数異なり数ベクトルとの積が標本の度数異なり数ベクトルとなるように、要素として確率を配置した行列のことをいう。対数圧縮二項確率行列は、その二項確率行列を行方向および列方向に圧縮した圧縮行列である。 The above matrix will be referred to as a logarithmic compression binomial probability matrix in the sense that the binomial probability matrix is logarithmically compressed. The binomial probability matrix refers to a matrix in which probabilities are arranged as elements so that the product of the population frequency vector and the number vector differs from the sample frequency vector. The logarithmic compression binomial probability matrix is a compression matrix obtained by compressing the binomial probability matrix in the row direction and the column direction.

そして、異なり数算出部１２は、母集団から抽出された標本の度数異なり数ベクトルを圧縮して、標本の対数圧縮度数異なり数ベクトルを求め、その標本の対数圧縮度数異なり数ベクトルと、対数圧縮二項確率行列の逆行列との積により、母集団の推定された対数圧縮度数異なり数ベクトルを算出する。更に、異なり数算出部１２は、その母集団の推定された対数圧縮度数異なり数ベクトルの要素の総和を、母集団の異なり数の推定値とする。対数圧縮度数異なり数ベクトルは、度数異なり数ベクトルを対数圧縮二項確率行列に合うように圧縮したベクトルである。 Then, the different number calculation unit 12 compresses the frequency vector of the sample extracted from the population, obtains a logarithm compression frequency difference number vector of the sample, calculates the logarithm compression frequency difference number vector of the sample, and logarithm compression A number vector different from the estimated logarithmic compression degree of the population is calculated by the product of the binomial probability matrix and the inverse matrix. Furthermore, the different number calculation unit 12 sets the sum of the elements of the logarithmic compression degree different number vector estimated for the population as an estimated value of the number of different populations. The logarithmically compressed frequency difference number vector is a vector obtained by compressing the frequency difference number vector so as to fit the logarithm compression binary probability matrix.

このように、母集団の度数異なり数ベクトルと標本群の度数異なり数ベクトルとの関係を示す二項確率行列を行方向および列方向に圧縮した対数圧縮二項確率行列を用いて母集団の異なり数を算出するので、少ない処理量で効率良く異なり数を算出することができる。 In this way, using a logarithm-compressed binary probability matrix that compresses the binomial probability matrix indicating the relationship between the frequency vector of the population and the frequency vector of the sample group in the row and column directions, Since the number is calculated, the number can be calculated efficiently with a small amount of processing.

その際、行列生成部１１は、対数圧縮二項確率行列の１つの要素にまとめる二項確率行列における行数は、行番号が大きくなると共に増加し、同じく対数圧縮二項確率行列の１つの要素にまとめる二項確率行列における列数は、列番号が大きくなると共に増加するように圧縮を行う。行番号や列番号が大きくなるにつれて変化が緩やかになる二項確率行列を、行番号や列番号が大きくなるにつれて大きな単位で１つの要素にまとめるので、精度の劣化を抑えつつ処理量を大きく低減することが可能である。 At that time, the matrix generation unit 11 increases the number of rows in the binomial probability matrix to be combined into one element of the logarithm compression binomial probability matrix as the row number increases, and similarly, one element of the logarithm compression binomial probability matrix The number of columns in the binomial probability matrix summarized in (1) is compressed so as to increase as the column number increases. The binomial probability matrix, which changes slowly as the row number and column number increase, is combined into one element in large units as the row number and column number increase, greatly reducing the amount of processing while suppressing deterioration in accuracy. Is possible.

より好ましくは、行列生成部１１は、対数圧縮二項確率行列の１つの要素にまとめる二項確率行列における行数は、行番号が大きくなると共に指数関数的に増加し、対数圧縮二項確率行列の１つの要素にまとめる二項確率行列における列数は、列番号が大きくなると共に指数関数的に増加するように圧縮を行うとよい。例えば、２のべき乗個目を切れ目として要素を区切って圧縮を行うとよい。行番号や列番号が大きくなるにつれて対数的に変化が緩やかになる二項確率行列を、行番号や列番号が大きくなるにつれて指数関数的に大きな単位で１つの要素にまとめるので、精度の劣化を抑えつつ処理量を大きく低減することが可能である。 More preferably, the matrix generation unit 11 increases the number of rows in the binomial probability matrix combined into one element of the logarithmic compression binomial probability matrix exponentially as the row number increases, and the logarithm compression binomial probability matrix It is preferable to compress the number of columns in the binomial probability matrix to be integrated into a single element so as to increase exponentially as the column number increases. For example, the compression may be performed by dividing the element with a power of 2 as a break. Since the binomial probability matrix whose logarithmic change gradually increases as the row number or column number increases, it is combined into one element in an exponentially large unit as the row number or column number increases. It is possible to greatly reduce the processing amount while suppressing.

本実施形態の異なり数計数装置の基本的な動作の流れをフローチャートを用いて説明する。 A basic operation flow of the number counting apparatus according to this embodiment will be described with reference to a flowchart.

図２は、本実施形態の異なり数計数装置の動作を示すフローチャートである。図２を参照すると、異なり数計数装置１０は、まず、行列生成部１１によって、上述のような二項確率による行列を生成する（ステップ１０１）。続いて、異なり数計数装置１０は、その行列と標本とに基づいて、母集団の異なり数を算出する。 FIG. 2 is a flowchart showing the operation of the number counting device, which is different from the present embodiment. Referring to FIG. 2, the different number counting apparatus 10 first generates a matrix based on the binomial probability as described above by the matrix generation unit 11 (step 101). Subsequently, the different number counting device 10 calculates the number of different populations based on the matrix and the sample.

次に、本実施形態の異なり数計数装置による具体的な演算処理について説明する。 Next, a specific calculation process performed by the number counting device according to the present embodiment will be described.

まず、対数圧縮二項確率行列の求め方について説明する。 First, how to obtain a logarithmically compressed binomial probability matrix will be described.

サイズがＮの母集団の度数異なり数ベクトルＵを式（１）で表し、Ｍ個の標本からなる集団の度数異なり数ベクトルＶを式（２）で表わす。 A frequency vector U of a population of size N is expressed by equation (1), and a frequency vector V of a population composed of M samples is expressed by equation (2).

母集団から所定の抽出率ｒで元を抽出したときに、母集団の中に同じものがｎ個ある元が標本としてｋ個抽出される確率を要素とするＮ次行列Ｐが二項確率行列である。この二項確率行列は、母集団の度数毎に、標本の度数毎の発生確率を表した行列である。二項確率行列は式（３）で表すことができる。ｉ＜ｊならばＰ_ｉｊ＝０である。 When a source is extracted from a population at a predetermined extraction rate r, an Nth-order matrix P whose element is the probability that k elements having the same n in the population are extracted as samples is a binomial probability matrix It is. This binomial probability matrix is a matrix that represents the occurrence probability for each frequency of the sample for each frequency of the population. The binomial probability matrix can be expressed by equation (3). If i <j, then P _ij = 0.

これらＵ、Ｖ、Ｐは式（４）の関係にあり、Ｖが与えられると、式（５）によりＵを推定することができる。 These U, V, and P are in the relationship of Expression (4). When V is given, U can be estimated by Expression (5).

このままでは母集団のサイズＮが膨大なときには、二項確率行列Ｐが巨大なものとなり、逆行列Ｐ^−１を求めることが難しい。そこで二項確率行列Ｐを対数化によって圧縮した対数圧縮二項確率行列ＬＰを導入する。対数圧縮二項確率行列ＬＰは、二項確率行列Ｐに対し、列方向に（２のべき乗）列を区切りとして確率値を合算し、行方向に（２のべき乗÷抽出率）行を区切りとして確率値を平均した行列である。二項確率行列を対数圧縮二項確率行列に圧縮する演算は、行方向に加重平均を求めると共に列方向に合計値を求める演算としてもよい。各要素の重みは適宜設定すればよい。あるいは、全ての要素に均一の重みを与えても良い。 If the population size N is enormous as it is, the binomial probability matrix P becomes enormous and it is difficult to obtain the inverse matrix P- ¹ . Therefore, a logarithmically compressed binomial probability matrix LP obtained by compressing the binomial probability matrix P by logarithmization is introduced. The logarithm-compressed binomial probability matrix LP adds the probability values to the binomial probability matrix P by dividing the columns in the column direction (power of 2) and delimiting the rows in the row direction (power of 2 ÷ extraction rate). It is a matrix that averages the probability values. The operation for compressing the binomial probability matrix into a logarithmic compression binary probability matrix may be an operation for obtaining a weighted average in the row direction and obtaining a total value in the column direction. What is necessary is just to set the weight of each element suitably. Or you may give a uniform weight to all the elements.

この二項確率行列におけるグループ化し、１つにまとめる要素は、２のべき乗個目を切れ目とするグループでまとめられ、行方向および列方向に進むにつれて指数関数的に個数が増大する。 The elements in the binomial probability matrix that are grouped and grouped into one are grouped in a group with a power of 2 as a break, and the number increases exponentially as it proceeds in the row and column directions.

対数圧縮二項確率行列ＬＰは式（６）で表わすことができる。 The logarithmic compression binomial probability matrix LP can be expressed by Equation (6).

続いて、対数圧縮二項確率行列の近似計算について説明する。 Subsequently, approximate calculation of the logarithmically compressed binomial probability matrix will be described.

式（６）に示されたＬＰ_ｉｊには、二項確率の重み付きの和が含まれている。 LP _ij shown in Expression (6) includes a weighted sum of binomial probabilities.

ｎ∈｛１，２，・・・｝，ｒ∈（０，１）とし、Ｘ_ｒ，ｎを二項分布Ｂｉｎ（ｎ，ｒ）に従う確率変数とすると、二項確率の重み付きの和は式（Ａ−１）のように表わすことができる。 If n∈ {1, 2,...}, r∈ (0,1) and X _{r, n} is a random variable according to the binomial distribution Bin (n, r), the weighted sum of binomial probabilities is It can be expressed as in formula (A-1).

ここで、重みｆ（ｎ｜β，ｎ_１，ｎ_２）は、パラメータβ≧１をもち、自然数の集合｛ｎ_１，ｎ_１＋１，・・・，ｎ_２｝上で切断されたゼータ分布の確率関数である。すなわち、重みｆ（ｎ｜β，ｎ_１，ｎ_２）を式（Ａ−２）のように表わすことができる。 Here, the weight f (n | β, n ₁ , n ₂ ) has a parameter β ≧ 1, and the zeta distribution cut on the set of natural numbers {n ₁ , n ₁ +1,..., N ₂ }. Is a probability function. That is, the weight f (n | β, n ₁ , n ₂ ) can be expressed as in equation (A-2).

なお、ここでは規準化定数を式（Ａ−３）のようにおいている。 In this case, the normalization constant is represented by the formula (A-3).

本実施形態における対数圧縮二項確率行列を高速で算出する近似計算は、上記式（Ａ−１）に示した重み付き二項確率の和を高い精度で積分近似するものである。積分は、離散的な値の合計を求めるのとは異なり、２つの端点の値によって値が決まるので、演算量が削減される。 The approximate calculation for calculating the logarithm-compressed binomial probability matrix in the present embodiment at high speed is an integral approximation of the sum of the weighted binomial probabilities shown in the above formula (A-1) with high accuracy. Unlike the calculation of the sum of discrete values, the value of integration is determined by the values of the two end points, so that the amount of calculation is reduced.

和をとる演算処理には行列の縦方向（ｎの方向）と横方向（ｋの方向）の和を求める処理が含まれており、それぞれの方向に適した積分近似が適用される。 The arithmetic process for calculating the sum includes a process for obtaining the sum of the vertical direction (n direction) and the horizontal direction (k direction) of the matrix, and integral approximation suitable for each direction is applied.

行列生成部１１は、横方向（ｎの方向）と縦方向（ｋの方向）にそれぞれに適した近似方法を適用する。横方向では、広く知られた中心極限定理および連続修正を利用した近似が適用され、縦方向では、式（Ａ−１）に含まれている階段関数を連続曲線で近似して積分に変換した近似式が適用される。その近似式では、その階段関数を近似する連続曲線をパラメータ表示で表現している。また近似式は、その連続曲線の線積分を含んでいる。 The matrix generation unit 11 applies an approximation method suitable for each of the horizontal direction (direction n) and the vertical direction (direction k). In the horizontal direction, an approximation using the well-known central limit theorem and continuous correction is applied, and in the vertical direction, the step function included in the equation (A-1) is approximated by a continuous curve and converted to integral. An approximate expression is applied. In the approximate expression, a continuous curve that approximates the step function is expressed by parameter display. The approximate expression includes the line integral of the continuous curve.

（横方向（ｋの方向）の近似） (Approximation in the horizontal direction (direction of k))

上記式（Ａ−１）は、下記式（Ａ−４）、（Ａ−５）のように変形することができる。 The above formula (A-1) can be modified as the following formulas (A-4) and (A-5).

なお、式（Ａ−４）を式（Ａ−５）で近似するにあたり、平均ｎｒ、分散ｎｒ（１−ｒ）に、二項確率変数Ｘ_ｒ，ｎの確率分布が平均ｎｒ、分散ｎｒ（１−ｒ）の正規分布で近似できることを利用している。これが中心極限定理の適用である。 Note that, when the equation (A-4) is approximated by the equation (A-5), the probability distribution of the binomial random variable X _{r, n} is the average nr, variance nr ( 1-r) can be approximated by a normal distribution. This is the application of the central limit theorem.

また、式（Ａ−５）の積分区間で下端を−０．５だけずらし、上端を＋０．５だけずらしている。これが連続修正と呼ばれるものである。連続修正を用いることで、二項分布（二項確率の分布）を正規分布で近似するときに、その近似精度を高めることができる。 Further, the lower end is shifted by −0.5 and the upper end is shifted by +0.5 in the integration interval of the formula (A-5). This is called continuous correction. By using continuous correction, the approximation accuracy can be improved when a binomial distribution (binary probability distribution) is approximated by a normal distribution.

（縦方向（ｎの方向）の近似） (Approximation in the vertical direction (direction of n))

上記式（Ａ−５）を更に式（Ａ−６）のように変形することができる。 The above formula (A-5) can be further transformed into formula (A-6).

ここでは、χ（ｃ_１，ｃ_２］（ｘ）は、式（Ａ−７）に示す区間（ｃ_１，ｃ_２］上の定義関数である。 Here, χ (c ₁ , c ₂ ] (x) is a defining function on the section (c ₁ , c ₂ ] shown in the equation (A-7).

また、パラメータｔを導入して式（Ａ−８）、（Ａ−９）を定義している。 Further, the parameter t is introduced to define the expressions (A-8) and (A-9).

式（Ａ−６）における式（Ａ−１０）および式（Ａ−１１）の部分は定義関数の和、すなわち階段関数になっている。 The parts of the formula (A-10) and the formula (A-11) in the formula (A-6) are sums of definition functions, that is, step functions.

この階段関数は典型的な不連続関数である。この階段関数をそのまま計算しようとすると、式（Ａ−６）は膨大な数の項の和を計算することになる。そこで、本実施形態ではこれら式（Ａ−１０）、（Ａ−１１）の部分を曲線で近似する。 This step function is a typical discontinuous function. If this step function is calculated as it is, Equation (A-6) calculates the sum of a huge number of terms. Therefore, in the present embodiment, these formulas (A-10) and (A-11) are approximated by curves.

例えば、式（Ａ−１０）の挙動を調べると、ｂ（ｎ_２）＜ｂ（ｎ_２＋１）＜・・・＜ｂ（ｎ_１）を跳躍点とし、各跳躍点ｂ（ｎ）でｎ ^−βの跳躍量を持つ階段関数であることが分かる。図１７は、階段関数とその近似曲線を示す図である。図１７は、式（Ａ−１０）の全体をｙとし、式（Ａ−１０）におけるｂ（ｎ）をｘとしている。式（Ａ１１）の階段関数も式（Ａ−１０）のものと同様に表わすことができる。
For example, when the behavior of the equation (A-10) is examined, b (n ₂ ) <b (n ₂ +1) <... <B (n ₁ ) is a jump point, and each jump point b (n) is n. It can be seen that it is a step function with a jump amount of ^-β . FIG. 17 is a diagram showing a step function and its approximate curve. In FIG. 17, the entire expression (A-10) is y, and b (n) in the expression (A-10) is x. The step function of the formula (A11) can also be expressed in the same manner as the formula (A-10).

したがって、階段関数の挙動に即したパラメータ表示の曲線を用いれば、階段関数に対する適切な近似を与えることができる。そのようなパラメータ表示の曲線による近似の一例が式（Ａ−１２）に示すものである。 Therefore, if a parameter-displayed curve that matches the behavior of the step function is used, an appropriate approximation to the step function can be given. An example of approximation by such a parameter display curve is shown in Expression (A-12).

この式（Ａ−１２）による近似曲線が図１７に示されている。この図１７を見ると、良好に近似が行われていることが分かる。式（Ａ−１１）についても同様に良好な近似が可能である。 An approximate curve according to this equation (A-12) is shown in FIG. It can be seen from FIG. 17 that the approximation is satisfactorily performed. A good approximation can be similarly applied to formula (A-11).

切断されたゼータ分布による重みを構成する式（Ａ−３）についても、式（Ａ−１３）に示すように、積分による近似を行う。

As for the formula (A-3) constituting the weight by the cut zeta distribution, approximation by integration is performed as shown in the formula (A-13).

この式（Ａ−１０）に対する式（Ａ−１２）と、式（Ａ−１１）に対する近似式、および、切断したゼータ分布の重みの近似（Ａ−１３）を、式（Ａ−６）に代入する。その際、近似曲線がパラメータ表示なので、線積分として計算する。その結果、近似計算の基本となる式（式（Ａ−１４）に一部抜粋）が得られる。 Expression (A-12) with respect to Expression (A-10), approximation expression with respect to Expression (A-11), and approximation of the weight of the cut zeta distribution (A-13) are expressed as Expression (A-6). substitute. At this time, since the approximate curve is a parameter display, the calculation is performed as a line integral. As a result, an expression (partially extracted from Expression (A-14)) that is the basis of approximate calculation is obtained.

この式（Ａ−１４）に対して、ｎ_１＝［２^ｉ−２］ｒ^−１＋１、ｎ_２＝２^ｉ−１ｒ^−１、ｋ_１＝［２^ｊ−２］＋１、ｋ_２＝２^ｊ−１を代入すると、式（６）における二項確率の重み付き和の近似値が計算できる。得られた近似値を（２^ｉ−１−［２^ｉ−２］）ｒ^−１で割ることにより、式（６）のＬＰ_ｉｊの近似値が得られる。 For this formula (A-14), n ₁ = [2 ⁱ⁻² ] r ⁻¹ +1, n ₂ = 2 ⁱ⁻¹ r ⁻¹ , k ₁ = [2 ^j−2 ] +1, k ₂ = When 2 ^j−1 is substituted, an approximate value of the weighted sum of binomial probabilities in Equation (6) can be calculated. By dividing the obtained approximate value by (2 ^i-1 − [2 ⁱ⁻² ]) r ⁻¹ , the approximate value of LP _{ij in} Expression (6) is obtained.

以上のような、対数圧縮二項確率行列の近似計算が可能である。 The approximate calculation of the logarithm compression binomial probability matrix as described above is possible.

また、度数異なり数ベクトルＶも、対数圧縮二項確率行列ＬＰと同様に対数化による圧縮を行う。圧縮化された度数異なり数ベクトル（対数圧縮度数異なり数ベクトル）ＬＶは式（７）で表わすことができる。標本の度数異なり数ベクトルを（２のべき乗）個目を区切りとして、値を合算したものを標本の対数圧縮度数異なり数ベクトルという。抽出率ｒの標本をもつ母集団の度数異なり数ベクトルを（２のべき乗÷抽出率）個目を区切りとして、値を合算したものを母集団の対数圧縮度数異なり数ベクトルという。 Further, the number vector V different in frequency is also compressed by logarithmization in the same manner as the logarithmic compression binary probability matrix LP. The compressed frequency difference number vector (logarithm compression frequency difference number vector) LV can be expressed by equation (7). A sample obtained by dividing the number vector of the sample with different frequency vectors (power of 2) and summing the values is called a number vector with different logarithmic compression frequency of the sample. A frequency vector of a population having a sample with an extraction rate r is referred to as a power vector of the logarithm compression frequency difference of the population, and the sum of values obtained by dividing the number vector (power of 2 ÷ extraction rate).

そして、Ｕの推定値は、ＬＰの逆行列ＬＰ^−１を用いて、式（８）により求めることができる。 Then, the estimated value of U can be obtained by Expression (8) using the inverse matrix LP ⁻¹ of LP.

以上のような演算処理により、行列生成部１１が式（６）に示される対数圧縮二項確率行列ＬＰの逆行列ＬＰ^−１を生成し、異なり数算出部１２が式（８）により、母集団の度数異なり数ベクトルＵの推定値を算出する。 Through the arithmetic processing as described above, the matrix generation unit 11 generates the inverse matrix LP ⁻¹ of the logarithmic compression binomial probability matrix LP shown in the equation (6), and the different number calculation unit 12 uses the equation (8) to calculate the mother matrix. The estimated value of the number vector U with different frequency of the group is calculated.

なお、ここでは、演算およびその考え方の理解を容易にするために、二項確率行列Ｐを求め、それを圧縮することにより対数圧縮二項確率行列ＬＰを生成するような説明をしている。しかし実際には、行列生成部１１は、一旦、二項確率行列Ｐを求めるということをせず、直接、対数圧縮二項確率行列ＬＰを生成する。 Here, in order to facilitate understanding of the calculation and its concept, a description is given of obtaining the binomial probability matrix P and compressing it to generate the logarithmically compressed binomial probability matrix LP. However, actually, the matrix generation unit 11 does not obtain the binomial probability matrix P once, but directly generates the logarithm compression binomial probability matrix LP.

図３は、本実施形態の異なり数計数装置１０を組み込んだデータ分析装置２０の一例を示すブロック図である。図３を参照すると、データ分析装置２０は、図１と同じ行列生成部１１および異なり数算出部１２に加えて、記憶部２１、サイズ算出部２２、標本抽出部２３、およびデータ分析部２４を有している。 FIG. 3 is a block diagram illustrating an example of the data analysis apparatus 20 in which the number counting apparatus 10 according to the present embodiment is incorporated. Referring to FIG. 3, the data analysis apparatus 20 includes a storage unit 21, a size calculation unit 22, a sample extraction unit 23, and a data analysis unit 24 in addition to the same matrix generation unit 11 and different number calculation unit 12 as those in FIG. 1. Have.

記憶部２１は、母集団のデータ群を蓄積している。一例として、あるサーバへのアクセスデータを蓄積したアクセスログのデータ群を母集団とする。図４は、アクセスログの一例を模式的に示した図である。このアクセスログは、あるサーバへのアクセスデータを時系列に蓄積したものである。アクセスログには、個々のアクセスについて、アクセス番号（Ｎｏ．）、発ＩＰアドレス、およびユーザＩＤを含む各種項目が蓄積されている。アクセス番号は、サーバへのアクセスが発生した順に各アクセスに付与されるシリアル番号である。発ＩＰアドレスは、アクセス元のＩＰアドレスであり、アクセス元のユーザの識別に用いられる。利用時間は、そのアクセスによるサイト閲覧等の利用が継続した時間である。 The storage unit 21 stores a population data group. As an example, a data group of access logs in which access data for a certain server is accumulated is defined as a population. FIG. 4 is a diagram schematically illustrating an example of an access log. This access log stores access data to a certain server in time series. In the access log, various items including an access number (No.), an originating IP address, and a user ID are stored for each access. The access number is a serial number assigned to each access in the order in which access to the server occurs. The source IP address is the IP address of the access source, and is used for identifying the user of the access source. The usage time is the time during which the use of browsing the site by the access has been continued.

サイズ算出部２２は、記憶部２１に蓄積されている母集団のデータ群のサイズすなわち元の個数を算出し、行列生成部１１および標本抽出部２３に通知する。例えば、サイズ算出部２２は、ログカウンタを有し、記憶部２１に１件のアクセスの情報が格納されるたびに、ログカウンタをカウントアップすることにより、データ群のサイズを計数することにしてもよい。あるいは、サイズ算出部２２は、母集団のデータ群のサイズの情報が必要になったときに、記憶部２１におけるアクセスログの有効領域の大きさからアクセスログのサイズを算出することにしてもよい。 The size calculation unit 22 calculates the size of the population data group accumulated in the storage unit 21, that is, the original number, and notifies the matrix generation unit 11 and the sample extraction unit 23. For example, the size calculation unit 22 has a log counter, and counts the size of the data group by counting up the log counter each time information on one access is stored in the storage unit 21. Also good. Alternatively, the size calculation unit 22 may calculate the size of the access log from the size of the effective area of the access log in the storage unit 21 when information on the size of the population data group is needed. .

標本抽出部２３は、サイズ算出部２２から通知された母集団のサイズと、与えられた抽出率とに基づいて、母集団から抽出すべき標本の個数を算出し、その個数の標本を記憶部２１の母集団のデータ群から抽出する。例えば、母集団のサイズに抽出率を乗算すれば、母集団から抽出すべき標本の個数が得られる。抽出された標本は、異なり数算出部１２に通知される。
Sampling unit 23, the size of the population that has been notified by the size calculating unit 22, and based on the a given extraction rate, and calculates the number of samples to be extracted from the population, the sample of the number Extracted from the data group of the population in the storage unit 21. For example, if the size of the population is multiplied by the extraction rate, the number of samples to be extracted from the population can be obtained. The extracted sample is notified to the number calculation unit 12 differently.

行列生成部１１は、与えられた抽出率と、サイズ算出部２２から通知されたサイズとに基づいて、対数圧縮二項確率行列ＬＰの逆行列ＬＰ^−１を生成し、異なり数算出部１２に通知する。具体的には、通知されたサイズによって二項確率行列Ｐの行数および列数が定まる。例えば、通知されたサイズ以上であり、対数圧縮における圧縮の単位が完成する最小限の行数および列数を用いることにしてもよい。式（３）によって、抽出率から二項確率行列Ｐの各要素が定まる。更に、式（６）によって、二項確率行列Ｐの各要素から対数圧縮二項確率行列ＬＰが定まる。 The matrix generation unit 11 generates an inverse matrix LP ⁻¹ of the logarithmic compression binomial probability matrix LP based on the given extraction rate and the size notified from the size calculation unit 22, and the difference calculation unit 12 Notice. Specifically, the number of rows and the number of columns of the binomial probability matrix P are determined by the notified size. For example, the minimum number of rows and the number of columns that are not less than the notified size and that complete the compression unit in logarithmic compression may be used. By Equation (3), each element of the binomial probability matrix P is determined from the extraction rate. Furthermore, the logarithm compression binomial probability matrix LP is determined from each element of the binomial probability matrix P by Expression (6).

異なり数算出部１２は、行列生成部１１から通知された逆行列ＬＰ^−１と、標本抽出部２３で抽出された標本群とに基づいて、母集団の異なり数Ｕの推定値を算出する。標本群を分析することにより、その度数異なり数ベクトルＶを算出することができる。式（７）によって、その度数異なり数ベクトルＶを対数圧縮し、対数圧縮度数異なり数ベクトルＬＶを得ることができる。そして、式（８）によって、この対数圧縮二項確率行列ＬＰと、対数圧縮二項確率行列ＬＶの逆行列ＬＰ^−１とから、母集団の異なり数Ｕの推定値を算出することができる。算出された母集団の異なり数Ｕの推定値はデータ分析部２４に通知される。例えば、異なり数算出部１２は、算出した母集団の異なり数Ｕの推定値を記憶部２１に記録し、データ分析部２４は、記憶部２１からその異なり数Ｕの推定値を読み出す。 The difference number calculation unit 12 calculates an estimated value of the difference number U of the population based on the inverse matrix LP ^-1 notified from the matrix generation unit 11 and the sample group extracted by the sample extraction unit 23. By analyzing the sample group, the number vector V can be calculated differently. According to Expression (7), the number vector V different in frequency can be logarithmically compressed to obtain a number vector LV different in logarithmic compression frequency. Then, the estimated value of the number U of different populations can be calculated from the logarithm-compressed binomial probability matrix LP and the inverse matrix LP ^{−1 of} the logarithmic-compressed binomial probability matrix LV by Expression (8). The calculated estimated value of the number U of different populations is notified to the data analysis unit 24. For example, the different number calculation unit 12 records the estimated value of the calculated different number U of the population in the storage unit 21, and the data analysis unit 24 reads the estimated value of the different number U from the storage unit 21.

データ分析部２４は、異なり数算出部１２で算出された異なり数の推定値を利用して、記憶部２１に蓄積されてるデータ群の分析を行う。例えば、データ群のソーティングにおいて異なり数を利用することでソーティング処理を高速化することができる。例えば、データ分析部２４は、例えば、図４に示したアクセスログから、アクセス解析結果データを生成してもよい。図５は、アクセス解析結果データの一例を模式的に示した図である。アクセス解析結果データには、各ユーザについてのユーザＩＤ、アクセス回数、および平均利用時間等の項目のデータが含まれている。ユーザの異なり数がユーザ数を示し、そのユーザ数分の記憶領域を確保することが必要となる。データ分析部２４は、例えば、ユーザ数分の記憶領域を予め確保し、各ユーザについての各項目のデータを算出し、記録することにしてもよい。 The data analysis unit 24 analyzes the data group accumulated in the storage unit 21 using the estimated value of the different number calculated by the different number calculation unit 12. For example, the sorting process can be speeded up by using different numbers in sorting of data groups. For example, the data analysis unit 24 may generate access analysis result data from, for example, the access log illustrated in FIG. FIG. 5 is a diagram schematically illustrating an example of access analysis result data. The access analysis result data includes data of items such as the user ID, the number of accesses, and the average usage time for each user. The number of different users indicates the number of users, and it is necessary to secure storage areas for the number of users. For example, the data analysis unit 24 may secure storage areas for the number of users in advance and calculate and record data for each item for each user.

続いて、上述した実施形態における更に具体的な実施例について説明する。 Next, more specific examples in the above-described embodiment will be described.

本実施例の異なり数計数装置の基本的な構成は図１に示したものと同じであり、基本的な動作は図２に示したものと同じである。以下、本実施例における各詳細処理について説明する。 Unlike the present embodiment, the basic configuration of the number counting apparatus is the same as that shown in FIG. 1, and the basic operation is the same as that shown in FIG. Hereinafter, each detailed process in a present Example is demonstrated.

図６は、母集団から標本を抽出する様子を示す図である。ここでは標本の抽出率を５０％としている。図中で大きな文字で書かれた要素が標本として抽出される要素である。 FIG. 6 is a diagram illustrating how a sample is extracted from a population. Here, the sample extraction rate is 50%. Elements written in large letters in the figure are extracted as samples.

度数が大きければ、母集団における異なり数と標本における異なり数はほとんど変わらない。図６（ａ）では、母集団の異なり数と標本の異なり数がともに４であり、変わっていない。一方、度数が小さいと、標本における異なり数は母集団における異なり数から減少する。図６（ｂ）では、母集団の異なり数が４であり、標本の異なり数が２である。
Frequency is large listen lever, the number of different in different number and the sample in the population is almost unchanged. In FIG. 6A, the number of different populations and the number of different samples are both 4, which are not changed. On the other hand, when the frequency is small, the number of differences in the sample decreases from the number of differences in the population. In FIG. 6B, the number of different populations is 4, and the number of different samples is 2.

度数ｎの要素からｋ個の要素が抽出される確率は、式（９）の二項確率で示すことができる。 The probability that k elements are extracted from the element of frequency n can be represented by the binomial probability of Expression (9).

ここで、要素の度数がｆである異なり数をｆ−度数異なり数Ｕ_ｆと呼ぶことにする。 Here, the different number in which the frequency of the element is _f will be referred to as f−frequency different number U _f .

図７は、ｆ−度数異なり数Ｕ_ｆについて説明するための図である。図７を参照すると、Ａの度数だけが１０なので、１０−度数異なり数Ｕ_１０＝１である。また、ＢとＣの度数が３なので、３−度数異なり数Ｕ_３＝２である。ＤとＥとＦの度数が２なので、２−度数異なり数Ｕ_２＝３である。ＧとＨとＩとＪとＫの度数が１なので、５−度数異なり数Ｕ_１＝５である。また、全体の異なり数は全てのｆ−度数異なり数の合計なので１１である。 FIG. 7 is a diagram for explaining the number U _f different from f−frequency. Referring to FIG. 7, since only the frequency of A is 10, the number U ₁₀ = 1 differs from 10−frequency. Further, since the frequency of B and C is 3, the number U ₃ = 2 is different from the 3-frequency. Since the frequencies of D, E, and F are 2, the number U ₂ = 3, which is different from the 2-frequency. Since the frequencies of G, H, I, J, and K are 1, 5 is different and the number U ₁ = 5. The total number of differences is 11 because it is the sum of all f-frequency differences.

式（１）や式（２）などで上述した度数異なり数ベクトルは、ｆ−度数異なり数を度数毎の要素とするベクトルである。 The frequency difference number vector described above in Expression (1), Expression (2), and the like is a vector having f-frequency difference numbers as elements for each frequency.

標本のｋ−度数異なり数の期待値Ｖ_ｋ＾は、母集団のｎ−度数異なり数Ｕ_ｎと、式（９）の二項確率とを用いて、式（１０）のように示すことができる。 The expected value V _k ^ of the k-frequency difference number of the sample can be expressed as in Expression (10) using the n-frequency difference number U _n of the population and the binomial probability of Expression (9). it can.

サイズがＮの母集団の度数異なり数ベクトルＵ＝（Ｕ_１，・・・，Ｕ_Ｎ）と、標本の度数異なり数ベクトルＶ＝（Ｖ_１，・・・，Ｖ_Ｎ）は、上述のように、それぞれ式（１）と式（２）に示される。 The frequency difference number vector U = (U ₁ ,..., U _N ) of the population of size N and the sample frequency difference number vector V = (V ₁ ,..., V _N ) are as described above. Are shown in equations (1) and (2), respectively.

ここで、式（３）として上述した、二項確率を要素とするＮ次行列である二項確率行列Ｐを利用すると、式（４）が成り立つ。よって、式（５）によって、標本の異なり数Ｕの推定値Ｕ＾は、式（５）によって得られる。 Here, when the binomial probability matrix P, which is an Nth-order matrix having binomial probabilities as elements, is used as the expression (3), the expression (4) is established. Therefore, the estimated value U ^ of the number U of different samples is obtained by the equation (5).

図８は、二項確率行列による近似の原理をトイモデルで示した図である。 FIG. 8 is a diagram showing the principle of approximation by a binomial probability matrix using a toy model.

上述のように、本発明の実施形態および実施例では、二項確率行列を用いて、標本の度数異なり数ベクトルから、母集団の度数異なり数ベクトルを近似的に算出するものである。この二項確率行列は圧縮を行っていないものである。
As described above, according to the embodiments and examples of the present invention, the frequency difference vector of the population is approximately calculated from the frequency vector of the sample using the binomial probability matrix. The binomial probability matrix are those not subjected to compression.

図８参照すると、サイズ２２２の母集団から抽出率５０％で標本を抽出している。 Referring to FIG. 8, a sample is extracted from a population of size 222 at an extraction rate of 50%.

母集団の度数異なり数ベクトルを見ると、１−度数異なり数＝１２１であり、２−度数異なり数＝２７であり、３−度数異なり数＝１０であり、４−度数異なり数＝３であり、５−度数異なり数＝１である。母集団全体の異なり数は１６２である。ただし、実際には母集団の各異なり数は未知である。 Looking at the frequency vector of the population, 1-frequency difference number = 121, 2-frequency difference number = 27, 3-frequency difference number = 10, 4-frequency difference number = 3. , 5-difference number = 1. The total number of differences in the entire population is 162. In practice, however, the number of different populations is unknown.

実際に抽出された標本の度数異なり数ベクトルを見ると、１−度数異なり数＝８７であり、２−度数異なり数＝１０であり、３−度数異なり数＝３であり、４−度数異なり数＝０であり、５−度数異なり数＝０である。標本全体の異なり数は１００である。この標本の各異なり数は既知である。
Looking at the frequency difference number vector of the actually extracted sample, 1-frequency difference number = 87, 2-frequency difference number = 10, 3-frequency difference number = 3, 4-frequency difference number = 0, and the number is different from 5−number = 0. The total number of samples is 100. Each different number of this specimen is known.

この標本の度数異なり数ベクトルと、二項確率行列（圧縮していないもの）の逆行列との積として算出される、母集団の度数異なり数ベクトルの推定値は、１−度数異なり数＝１５２であり、２−度数異なり数＝４であり、３−度数異なり数＝２４であり、４−度数異なり数＝０であり、５−度数異なり数＝０である。母集団全体の異なり数の推定値は１８０となり、ある程度の誤差内で異なり数の推定ができていることが分かる。 The estimated value of the frequency difference vector of the population calculated as the product of the frequency vector of this sample and the inverse matrix of the binomial probability matrix (uncompressed) is 1−frequency difference number = 152 And 2-frequency difference number = 4, 3-frequency difference number = 24, 4-frequency difference number = 0, and 5-frequency difference number = 0. The estimated value of the number of differences in the entire population is 180, and it can be seen that the number of differences can be estimated within a certain amount of error.

図９は、二項確率行列を圧縮する様子を示す図である。 FIG. 9 is a diagram illustrating how the binomial probability matrix is compressed.

図９の右側には、圧縮されていない二項確率行列の一部が示され、左側には、その二項確率行列を圧縮した対数圧縮二項確率行列が示されている。 The right side of FIG. 9 shows a part of the uncompressed binomial probability matrix, and the left side shows a logarithmically compressed binary probability matrix obtained by compressing the binomial probability matrix.

右側の二項確率行列は、母集団の度数異なり数ベクトルとの積が標本の度数異なり数ベクトルとなるように要素として二項確率を配置した行列である。この二項確率行列を行方向および列方向に圧縮したものが、左側の対数圧縮二項確率行列である。 The binomial probability matrix on the right is a matrix in which binomial probabilities are arranged as elements so that the product of the population frequency vector and the number vector differs from the sample frequency vector. A logarithm-compressed binary probability matrix on the left side is obtained by compressing the binomial probability matrix in the row direction and the column direction.

二項確率行列の複数の要素が対数圧縮二項確率行例では１つの要素にまとめられている。二項確率行列の複数の要素が対数圧縮二項確率行列の１つの要素にまとめられる様子が図９では破線矢印で示されている。 A plurality of elements of the binomial probability matrix are combined into one element in the logarithm-compressed binomial probability example. A state in which a plurality of elements of the binomial probability matrix are combined into one element of the logarithmic compression binomial probability matrix is indicated by a broken line arrow in FIG.

対数圧縮二項確率行列の１つの要素にまとめる二項確率行列における行数は、行番号が大きくなると共に指数関数的に増加し、対数圧縮二項確率行列の１つの要素にまとめる二項確率行列における列数は、列番号が大きくなると共に指数関数的に増加するように圧縮を行っている。まとめる要素の個数が増えれば処理量削減の効果が増大する。 The number of rows in the binomial probability matrix combined into one element of the logarithmic compression binomial probability matrix increases exponentially as the row number increases, and is combined into one element of the logarithmic compression binomial probability matrix. The number of columns in is compressed so as to increase exponentially as the column number increases. As the number of elements to be collected increases, the effect of reducing the amount of processing increases.

図１０は、二項確率行列における要素の分布の様子を示す図である。図１０において、二項確率で示される要素の値の変化が山型の曲線によって示されている。これを見ると、二項確率行列では、行番号や列番号が大きくなるにつれて対数的に変化が緩やかになっていることが分かる。このように行番号や列番号が大きくなるにつれて対数的に変化が緩やかになる二項確率行列を、本実施例では、行番号や列番号が大きくなるにつれて指数関数的に大きな単位で１つの要素にまとめるので、精度の劣化を抑えつつ処理量を大きく低減することが可能となっている。 FIG. 10 is a diagram illustrating the distribution of elements in the binomial probability matrix. In FIG. 10, the change in the value of the element indicated by the binomial probability is indicated by a mountain-shaped curve. From this, it can be seen that, in the binomial probability matrix, the logarithmic change gradually decreases as the row number or column number increases. As described above, a binomial probability matrix whose logarithmically changes gradually as the row number or the column number increases, in this embodiment, one element in an exponentially large unit as the row number or the column number increases. Therefore, it is possible to greatly reduce the processing amount while suppressing deterioration in accuracy.

図１１は、度数異なり数ベクトルを圧縮する様子を示す図である。 FIG. 11 is a diagram showing a state in which a number vector is compressed differently.

図１１を参照すると、母集団の度数異なり数ベクトルにおける度数３〜４の２つの要素が、母集団の対数圧縮度数異なり数ベクトルにおける１つの要素に圧縮されている。ここでの圧縮は、３−度数異なり数と４−度数異なり数を合計して１つの要素にする演算である。 Referring to FIG. 11, two elements of frequencies 3 to 4 in the population frequency difference number vector are compressed into one element in the population logarithm compression frequency difference number vector. The compression here is an operation of adding the number different from 3-frequency and the number different from 4-frequency into one element.

同じように、標本の度数異なり数ベクトルにおける度数５〜８の４つの要素が、標本の対数圧縮度数異なり数ベクトルにおける１つの要素に圧縮されている。圧縮は、５−度数異なり数〜８−度数異なり数を合計して１つの要素にする演算である。
Similarly, four elements of frequencies 5 to 8 in the number vector different in frequency of the sample are compressed into one element in the number vector different in logarithmic compression frequency of the sample . The compression is an operation of adding the number of 5- frequency differences to the number of 8-frequency differences into one element.

図１２は、対数圧縮二項確率行列による近似の原理を示した図である。 FIG. 12 is a diagram showing the principle of approximation using a logarithmic compression binomial probability matrix.

本実施例では、圧縮二項確率行列を用いて、標本の度数異なり数ベクトルから、母集団の度数異なり数ベクトルを近似的に算出するものである。この例では、抽出率が１／２５６である。また、図８とは異なり、この例では行列および度数異なり数ベクトルの圧縮を行ってる。 In the present embodiment, the frequency vector of the population frequency difference is approximately calculated from the frequency vector of the sample frequency using the compressed binomial probability matrix. In this example, the extraction rate is 1/256. Further, unlike FIG. 8, in this example, the compression of the number vector differs from the matrix and the frequency.

図１２を参照すると、サイズ＝７５２４６８２の母集団から抽出率１／２５６で標本を抽出している。 Referring to FIG. 12, a sample is extracted from a population of size = 7524682 at an extraction rate of 1/256.

母集団の対数圧縮度数異なり数ベクトルを見ると、２５６−度数異なり数＝１７０８であり、５１２−度数異なり数＝４５４であり、１０２４−度数異なり数＝６４であり、２０４８−度数異なり数＝２１である。母集団全体の異なり数は２２４８である。ただし、実際には母集団の各異なり数は未知である。 Looking at the logarithm compression frequency difference number vector of the population, 256-frequency difference number = 1708, 512-frequency difference number = 454, 1024-frequency difference number = 64, 2048-frequency difference number = 21. It is. The total number of differences in the entire population is 2248. In practice, however, the number of different populations is unknown.

実際に抽出された標本の対数圧縮度数異なり数ベクトルを見ると、１−度数異なり数＝４８０であり、２−度数異なり数＝２１３であり、４−度数異なり数＝１３４であり、８−度数異なり数＝３４である。標本全体の異なり数は８６３である。この標本の各異なり数は既知である。
Looking at the logarithm compression frequency difference number vector of the actually extracted sample, 1-frequency difference number = 480, 2-frequency difference number = 213, 4-frequency difference number = 134, 8-frequency The difference number = 34. The total number of samples is 863 . Each different number of this specimen is known.

この標本の度数異なり数ベクトルと、二項確率行列（圧縮を行ったもの）の逆行列との積として算出される、母集団の度数異なり数ベクトルの推定値は、１−度数異なり数＝１３４５であり、２−度数異なり数＝３０９であり、３−度数異なり数＝１３２であり、４−度数異なり数＝３ある。母集団全体の異なり数の推定値は１７９０となり、ある程度の誤差内で異なり数の推定ができていることが分かる。
And the frequency depends number vector of the sample, is calculated as the product of the inverse matrix of binomial probability matrix (having been subjected to the compression), the estimated value of the frequency differs number vector of the population, 1-degree different number = 1345 And 2-frequency difference number = 309 , 3-frequency difference number = 132 , and 4-frequency difference number = 3 . The estimated value of the number of differences in the entire population is 1790, and it can be seen that the number of differences can be estimated within a certain amount of error.

以下、データを用いて異なり数計数において異なり数の推定精度がどの程度かを実験結果によって示す。
（推定実験１） Hereinafter, it will be shown by experimental results how the estimation accuracy of different numbers is different in the number counting using data.
(Estimation experiment 1)

商品購入の取引で記録される購買取引データにおける異なり数の推定精度の実験結果について説明する。 An experimental result of the estimation accuracy of the number of differences in the purchase transaction data recorded in the product purchase transaction will be described.

母集団として、１０種類のデータ項目をもつ１０００万件の購買履歴データを用いた。すなわち、この購買取引データには、取引日、取引店、取引額など様々な変数があるが、ここではその中から１０個の変数（データ項目）を選んだ。また、抽出率として、１／４、１／１６、１／６４、１／２５６、１／１０２４について計算を行った。 As a population, 10 million purchase history data having 10 kinds of data items were used. That is, the purchase transaction data includes various variables such as a transaction date, a transaction store, and a transaction amount. Here, ten variables (data items) are selected. Moreover, calculation was performed about 1/4, 1/16, 1/64, 1/256, and 1/1024 as extraction rates.

図１３は、推定実験１において抽出率を変えて推定精度を比較した結果を示す図である。変数は、ある一つの変数ｖ５を固定的に用いている。対真値は、真の値に対する推定値を示している。 FIG. 13 is a diagram illustrating a result of comparing the estimation accuracy by changing the extraction rate in the estimation experiment 1. As a variable, one variable v5 is fixedly used. The true value indicates an estimated value for the true value.

図１３を見て分かるように、抽出率が下がると推定精度の低下するが、１／６４程度の抽出率であれば推定精度６５％が得られている。 As can be seen from FIG. 13, when the extraction rate decreases, the estimation accuracy decreases. However, if the extraction rate is about 1/64, an estimation accuracy of 65% is obtained.

図１４は、推定実験１において変数と抽出率を変えて推定精度を比較した結果を示す図である。図１５は、図１４の結果における対真値をグラフ化した図である。図１５において、推定誤差が３５％に収まってるプロットが破線で囲まれている。これを見て分かるように、実験に用いた１０個の変数のうち８個の変数において、抽出率１／６５以上では、推定誤差が３５％以内に収まっていることが分かる。
（推定実験２） FIG. 14 is a diagram illustrating a result of comparing estimation accuracy by changing a variable and an extraction rate in the estimation experiment 1. FIG. 15 is a graph showing the true value in the result of FIG. In FIG. 15, a plot in which the estimation error is within 35% is surrounded by a broken line. As can be seen from this, it can be seen that the estimation error is within 35% at the extraction rate of 1/65 or more in 8 of the 10 variables used in the experiment.
(Estimation experiment 2)

全国の世帯数を母集団とし、抽出率１／１６の世帯の苗字データから、全国世帯の苗字数を推定した実験結果について説明する。 The experimental results of estimating the number of surnames of households nationwide from the surname data of households with a sampling rate of 1/16 will be described.

全国の世帯数は、国勢調査により、約５０００万世帯であることが分かっている。抽出率１／１６の標本として、３１２３９７９世帯の苗字データを用いた。ただし、無作為抽出は行っていない。 According to the national census, the number of households nationwide is about 50 million. Last name data of 3123979 households was used as a sample with an extraction rate of 1/16. However, random sampling is not performed.

図１６は、推定実験２における推定実験結果を示す図である。異なり数が７０５０６の標本から算出した母集団の異なり数は１３１２６５である。 FIG. 16 is a diagram illustrating an estimation experiment result in the estimation experiment 2. The number of different populations calculated from samples with 70506 different numbers is 131265.

苗字の個数には、十数万から３０万と諸説がある。この幅の広さには新字と旧字の扱いによる差異が大きく影響している。したがって、旧字を用いた苗字と新字を用いた苗字を合算すると、その幅の中の小さい方の値になることが考えられる。 There are various theories of surnames from dozens of thousands to 300,000. The breadth of this width is greatly affected by the difference between the new and old characters. Therefore, when the last name using the old letter and the last letter using the new letter are added together, it may be the smaller value in the width.

本推定実験では、旧字を用いた苗字を新字を用いた苗字に寄せるように合算している。したがって、１３１２６５という推定値は、十数万という苗字数とよく合っていると言える。 In this estimation experiment, the surnames using the old letters are added to the surnames using the new letters. Therefore, it can be said that the estimated value of 131265 matches well with the number of surnames of several ten thousand.

以上、上述した本発明の実施形態および実施例は、本発明の説明のための例示であり、本発明の範囲をそれらの実施形態あるいは実施例のみに限定する趣旨ではない。当業者は、本発明の要旨を逸脱することなしに、他の様々な態様で本発明を実施することができる。 The embodiments and examples of the present invention described above are examples for explaining the present invention, and are not intended to limit the scope of the present invention to only those embodiments or examples. Those skilled in the art can implement the present invention in various other modes without departing from the gist of the present invention.

また、上述した実施形態および実施例の装置は、装置を構成する各部の処理手順を規定したソフトウェアプログラムをコンピュータに実行させることにより実現することができるものである。その場合、各部の間におけるデータの通知はメモリやレジスタ等のデータ格納領域を介して行われることにしてもよい。送り側の処理において、通知しようとするデータをデータ格納領域に格納し、受け側の処理において、そのデータ格納領域からデータを読み出せばよい。 In addition, the devices of the above-described embodiments and examples can be realized by causing a computer to execute a software program that defines the processing procedure of each unit constituting the device. In that case, notification of data between the respective units may be performed via a data storage area such as a memory or a register. The data to be notified may be stored in the data storage area in the sending process, and the data may be read from the data storage area in the receiving process.

１０…異なり数計数装置、１１…行列生成部、１２…異なり数算出部、２０…データ分析装置、２１…記憶部、２２…サイズ算出部、２３…標本抽出部、２４…データ分析部
DESCRIPTION OF SYMBOLS 10 ... Different number counting apparatus, 11 ... Matrix generation part, 12 ... Different number calculation part, 20 ... Data analysis apparatus, 21 ... Memory | storage part, 22 ... Size calculation part, 23 ... Sample extraction part, 24 ... Data analysis part

Claims

A different number counting method for counting the number of different populations of data groups,
Frequency indicating the number of differences for each frequency of the population, which is obtained by the formula (C-1) based on the size N of the population and the extraction rate r for extracting a sample from the population. Using each element of the binomial probability matrix P that defines the relationship between the different number vector and the frequency different number vector indicating the number of different samples for each frequency extracted from the population at the extraction rate r, the expression ( A logarithm compression binomial probability matrix LP which is a compression matrix obtained by compressing the binomial probability matrix P in the row direction and the column direction by an approximate expression obtained by integral approximation of the operation for obtaining the sum of discrete values in C-2), Generating an inverse matrix LP ⁻¹ of a logarithmically compressed binomial probability matrix LP;
A different number calculation means calculates the logarithmic compression of the population by the product of the logarithm compression frequency difference number vector obtained by compressing the frequency difference number vector of the sample extracted from the population and the inverse matrix LP- ^1. Calculating a frequency difference number vector, and calculating a sum of elements of the logarithm compression frequency difference number vector estimated for the population as an estimated value of the number of different numbers of the population.

The approximate expression is obtained by approximating the step function in the direction of n included in Expression (C-3) representing the weighted sum of binomial probabilities included in Expression (C-2) with a continuous curve. The different number counting method according to claim 1, which is an expression converted into an integral.

The difference number counting method according to claim 2, wherein the approximate expression is an expression in which the continuous curve that approximates the step function is expressed by parameter display.

The method according to claim 3, wherein the approximate expression is an expression including a line integral of the continuous curve.

The approximate expression is the sum of discrete values in the k direction included in the expression (C-3) representing the weighted sum of binomial probabilities included in the expression (C-2), and the central limit theorem. 5. The different number counting method according to any one of claims 1 to 4, which is an expression approximated by correction and converted into integral.

A different number counting device for counting the number of different populations of data groups,
Based on the size N of the population and the extraction rate r for extracting a sample from the population, the frequency difference number vector indicating the number of differences for each frequency of the population, which is obtained by Expression (C-1), Using each element of the binomial probability matrix P that defines the relationship with the frequency difference number vector indicating the number of differences for each frequency of the samples extracted from the population at the extraction rate r, in Equation (C-2) A logarithm compression binomial probability matrix LP, which is a compression matrix obtained by compressing the binomial probability matrix P in the row direction and the column direction, is obtained by an approximate expression obtained by integral approximation of an operation for obtaining a sum of discrete values, and the logarithm compression binomial probability Matrix generating means for generating an inverse matrix LP ⁻¹ of the matrix LP;
A logarithm compression frequency difference number vector obtained by compressing the frequency difference number vector of the sample extracted from the population and a product of the inverse matrix LP ⁻¹ to calculate an estimated logarithmic compression frequency difference number vector of the population And a different number calculating means for calculating a sum of the elements of the estimated logarithmic compression frequency different number vector of the population as an estimated value of the different number of the population.

A different number counting program for counting the number of different population consisting of data groups,
Computer
Based on the size N of the population and the extraction rate r for extracting a sample from the population, the frequency difference number vector indicating the number of differences for each frequency of the population, which is obtained by Expression (C-1), Using each element of the binomial probability matrix P that defines the relationship with the frequency difference number vector indicating the number of differences for each frequency of the samples extracted from the population at the extraction rate r, in Equation (C-2) A logarithm compression binomial probability matrix LP, which is a compression matrix obtained by compressing the binomial probability matrix P in the row direction and the column direction, is obtained by an approximate expression obtained by integral approximation of an operation for obtaining a sum of discrete values, and the logarithm compression binomial probability Matrix generating means for generating an inverse matrix LP ⁻¹ of the matrix LP;
A logarithm compression frequency difference number vector obtained by compressing the frequency difference number vector of the sample extracted from the population and a product of the inverse matrix LP ⁻¹ to calculate an estimated logarithmic compression frequency difference number vector of the population And a different number counting program for functioning as a different number calculating means for calculating a sum of elements of different number vectors of logarithmic compression degrees estimated for the population as an estimated value of the different number of the population.