JP2014211761A

JP2014211761A - Analyzer, analysis method, and analysis program

Info

Publication number: JP2014211761A
Application number: JP2013087754A
Authority: JP
Inventors: 山中　章裕; Akihiro Yamanaka; 章裕山中; 亮菊池; Akira Kikuchi; 大五十嵐; Masaru Igarashi
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-04-18
Filing date: 2013-04-18
Publication date: 2014-11-13

Abstract

PROBLEM TO BE SOLVED: To improve the accuracy of statistical data analysis in which security is enhanced, and achieve improvement of the accuracy of hypothesis testing to concealed data.SOLUTION: An analyzer 10 extracts sample data from among data items to be analyzed. Then, the analyzer 10 asymptotically expands an inverse function of a cumulative distribution function by using the reliability of standard normal distribution of the extracted sample data, and performs a single-sample average match test for testing whether or not the average of the sample data matches a predetermined value by using the expanded inverse function.

Description

本発明は、解析装置、解析方法及び解析プログラムに関する。 The present invention relates to an analysis apparatus, an analysis method, and an analysis program.

近年、ＢｉｇＤａｔａと呼ばれる大量のデータが注目を集めている。このような大量のデータに対して仮説検定などの統計的データ解析を施し、結果を利用する試みが進んでいる。その一方で、解析対象データに対する個人情報の保護などセキュリティ強化も求められており、解析対象データを秘匿化した撹乱データに対する仮説検定も必要となっている。仮説検定では、母集団からサンプリングした標本から統計量を求め、標本に関する性質（例えば、平均値が１００より大きいか等）を統計学的に評価することができる。 In recent years, a large amount of data called BigData has attracted attention. Attempts have been made to perform statistical data analysis such as hypothesis testing on such a large amount of data and use the results. On the other hand, security enhancement such as protection of personal information for analysis target data is also required, and a hypothesis test is also required for disturbance data in which analysis target data is concealed. In the hypothesis test, a statistic can be obtained from a sample sampled from the population, and the property related to the sample (for example, whether the average value is greater than 100) can be statistically evaluated.

仮説検定には様々な手法があり、その中でも基本的で応用範囲が広い検定手法として、中心極限定理を用いた１標本平均一致検定手法が知られている。ここで１標本平均一致検定手法とは、平均と分散を持つと仮定した単一の母集団から標本を抽出し、分散が既知であると仮定したときに平均がある値に一致するか否かを判断する手法である。 There are various hypothesis tests. Among them, a one-sample average coincidence test technique using the central limit theorem is known as a basic test technique with a wide range of applications. Here, the one-sample mean-match test method refers to whether or not the mean agrees with a certain value when a sample is extracted from a single population assumed to have mean and variance, and the variance is assumed to be known. It is a method to judge.

また、解析の対象となるデータが個人情報を含むデータである場合には、データに対して非可逆な確率的な操作を施して、個人情報の漏洩を防止することが知られている。例えば、個人情報の漏洩を防止する手法として、どのような攻撃者も１／ｋ以上の確率で秘密データと撹乱データとを対応付けられないというｋ−匿名性を確率的指標に拡張した「Ｐｋ−匿名性」というプライバシー指標が提案されている。 In addition, when the data to be analyzed is data including personal information, it is known to perform an irreversible stochastic operation on the data to prevent leakage of personal information. For example, as a technique for preventing leakage of personal information, “Pk” is an extension of k-anonymity to a stochastic index, in which any attacker cannot associate secret data with disturbance data with a probability of 1 / k or more. A privacy index called “anonymity” has been proposed.

また、個人情報の漏洩を防止する手法として、例えば、解析の対象となるデータのうち、数値属性のデータに対して、特定のパラメータを持つＬａｐｌａｃｅ分布に従う確率変数を値（以下、ノイズと呼ぶ）として加える技術が知られている。 In addition, as a technique for preventing leakage of personal information, for example, a random variable according to a Laplace distribution having a specific parameter with respect to numerical attribute data among data to be analyzed is a value (hereinafter referred to as noise). The technology to add as is known.

特開２０１１−１００１１６号公報JP 2011-100116 A

P. Samarati and L. Sweeney. Generalizing Data to Provide Anonymity When Disclosing Information（Extended abstract）. Proc. Of the 17 th ACM-SIGMOD-SIGACT-SIGART Symposium on the Principles of Database Systems, p. 188, Seattle, WA, 1998.P. Samarati and L. Sweeney. Generalizing Data to Provide Anonymity When Disclosing Information (Extended abstract). Proc. Of the 17 th ACM-SIGMOD-SIGACT-SIGART Symposium on the Principles of Database Systems, p. 188, Seattle, WA, 1998. 五十嵐大、千田浩司、高橋克己．ｋ−匿名性の確率的指標への拡張とその適用例．CSS2009，2009Dai Igarashi, Koji Senda, Katsumi Takahashi. Extension to the probabilistic index of k-anonymity and its application example. CSS2009, 2009 五十嵐大、千田浩司、高橋克己．数値属性における，ｋ−匿名性を満たすためのランダム化手法． CSS2011，2011Dai Igarashi, Koji Senda, Katsumi Takahashi. Randomization method for satisfying k-anonymity in numerical attributes. CSS2011, 2011 菊池亮、山中章裕、五十嵐大．プライバシー保護されたデータに対するｔ検定手法．LOIS2012，2012Ryo Kikuchi, Akihiro Yamanaka, Dai Igarashi. A t-test method for privacy-protected data. LOIS2012, 2012 竹内啓他. 統計学辞典．東宝経済新報社，1989Takeuchi Kei et al. Statistical Dictionary. Toho Keizai Shinposha, 1989 E.A.Cornish and R.A.Fisher. Moments and cumulants in the specification of distributions. Revue de l´Institut International de Statistique, 5: 307-320，1937E.A.Cornish and R.A.Fisher. Moments and cumulants in the specification of distributions. Revue de l´Institut International de Statistique, 5: 307-320, 1937

しかしながら、従来の中心極限定理を用いた１標本平均一致検定手法では、解析対象となる標本数の多少が、算出される危険率の値に影響を及ぼす場合があり、仮説検定の精度に問題がある場合があった。また、撹乱データに対する精度の高い１標本平均一致検定の手法が確立していないため、セキュリティを強化することができない場合があった。 However, in the conventional one-sample average coincidence test method using the central limit theorem, the number of samples to be analyzed may affect the calculated risk factor, and there is a problem in the accuracy of the hypothesis test. There was a case. In addition, there is a case where security cannot be strengthened because a highly accurate one-sample average coincidence test method for disturbance data has not been established.

そこで、この発明は、セキュリティを強化した統計的データ解析の精度を向上させ、秘匿化されたデータに対する仮説検定の精度向上を実現する統計的データ解析技術を提供することを目的とする。 Therefore, an object of the present invention is to provide a statistical data analysis technique that improves the accuracy of statistical data analysis with enhanced security and realizes an improvement in accuracy of hypothesis testing for concealed data.

上述した課題を解決し、目的を達成するため、解析装置は、解析の対象となるデータのなかから標本データを抽出する抽出部と、前記抽出部によって抽出された標本データの標準正規分布の信頼度を利用して累積分布関数の逆関数を漸近展開し、該展開した逆関数を用いて、標本データの平均が所定の値に一致するか否かを検定する１標本平均一致検定を行う解析部と、を備えたことを特徴とする。 In order to solve the above-described problems and achieve the object, the analysis apparatus includes an extraction unit that extracts sample data from the data to be analyzed, and the reliability of the standard normal distribution of the sample data extracted by the extraction unit. Analysis that performs asymptotic expansion of the inverse function of the cumulative distribution function using degree, and uses the expanded inverse function to test whether the average of the sample data matches a predetermined value And a section.

また、解析方法は、解析の対象となるデータのなかから標本データを抽出する抽出工程と、前記抽出工程によって抽出された標本データの標準正規分布の信頼度を利用して累積分布関数の逆関数を漸近展開し、該展開した逆関数を用いて、標本データの平均が所定の値に一致するか否かを検定する１標本平均一致検定を行う解析工程と、を含んだことを特徴とする。 The analysis method includes an extraction step of extracting sample data from the data to be analyzed, and an inverse function of the cumulative distribution function using the reliability of the standard normal distribution of the sample data extracted by the extraction step. And an analysis step of performing a one-sample average match test for testing whether or not the average of the sample data matches a predetermined value using the expanded inverse function. .

また、解析プログラムは、解析の対象となるデータのなかから標本データを抽出する抽出ステップと、前記抽出ステップによって抽出された標本データの標準正規分布の信頼度を利用して累積分布関数の逆関数を漸近展開し、該展開した逆関数を用いて、標本データの平均が所定の値に一致するか否かを検定する１標本平均一致検定を行う解析ステップと、をコンピュータに実行させる。 The analysis program also includes an extraction step for extracting sample data from the data to be analyzed, and an inverse function of the cumulative distribution function using the reliability of the standard normal distribution of the sample data extracted by the extraction step. And an analysis step for performing a one-sample average match test for testing whether or not the average of the sample data matches a predetermined value using the developed inverse function.

本願に開示する解析装置、解析方法及び解析プログラムは、セキュリティを強化した統計的データ解析の精度を向上させ、秘匿化されたデータに対する仮説検定の精度向上を実現する統計的データ解析技術を提供することが可能である。 The analysis device, analysis method, and analysis program disclosed in the present application provide statistical data analysis technology that improves the accuracy of statistical data analysis with enhanced security and improves the accuracy of hypothesis testing for concealed data. It is possible.

図１は、第一の実施形態に係る解析装置の構成を説明するための図である。FIG. 1 is a diagram for explaining the configuration of the analysis apparatus according to the first embodiment. 図２は、解析対象データ記憶部によって記憶される解析対象データの一例を示す図である。FIG. 2 is a diagram illustrating an example of analysis target data stored in the analysis target data storage unit. 図３は、Ｐｋ匿名性を満たす撹乱処理について説明する図である。FIG. 3 is a diagram illustrating a disturbance process that satisfies Pk anonymity. 図４は、秘密データに確率変数を加える撹乱処理について図である。FIG. 4 is a diagram showing a disturbance process for adding a random variable to secret data. 図５は、１標本平均一致検定の処理例について説明する図である。FIG. 5 is a diagram for explaining a processing example of the one-sample average match test. 図６は、仮説検定における２つの誤りについて説明する図である。FIG. 6 is a diagram for explaining two errors in the hypothesis test. 図７は、片側検定の実験結果例を示す図である。FIG. 7 is a diagram illustrating an example of an experimental result of a one-sided test. 図８は、第一の実施形態に係る解析装置における解析処理の流れを説明するためのフローチャートである。FIG. 8 is a flowchart for explaining the flow of analysis processing in the analysis apparatus according to the first embodiment. 図９は、解析プログラムを実行するコンピュータを示す図である。FIG. 9 is a diagram illustrating a computer that executes an analysis program.

以下に添付図面を参照して、この発明に係る解析装置、解析方法及び解析プログラムの実施形態を詳細に説明する。なお、この実施形態によりこの発明が限定されるものではない。 Exemplary embodiments of an analysis apparatus, an analysis method, and an analysis program according to the present invention will be described below in detail with reference to the accompanying drawings. In addition, this invention is not limited by this embodiment.

［第一の実施形態］
以下の実施形態では、第一の実施形態に係る解析装置の構成、解析装置による処理の流れを順に説明し、最後に第一の実施形態による効果を説明する。 [First embodiment]
In the following embodiments, the configuration of the analysis apparatus according to the first embodiment and the flow of processing by the analysis apparatus will be described in order, and finally the effects of the first embodiment will be described.

［解析装置の構成］
最初に、図１を用いて、解析装置１０の構成を説明する。図１は、第一の実施形態に係る解析装置１０の構成を説明するための図である。図１に示すように、解析装置１０は、通信処理部１１、制御部１２および記憶部１３を有する。 [Configuration of analyzer]
Initially, the structure of the analyzer 10 is demonstrated using FIG. FIG. 1 is a diagram for explaining a configuration of an analysis apparatus 10 according to the first embodiment. As illustrated in FIG. 1, the analysis device 10 includes a communication processing unit 11, a control unit 12, and a storage unit 13.

通信処理部１１は、接続される端末装置２０との間でやり取りする各種情報に関する通信を制御する。例えば、通信処理部１１は、解析対象データに対する解析処理の要求を端末装置２０から受信する。また、例えば、通信処理部１１は、解析処理の処理結果を端末装置２０に対して送信する。 The communication processing unit 11 controls communication related to various types of information exchanged with the connected terminal device 20. For example, the communication processing unit 11 receives a request for analysis processing for the analysis target data from the terminal device 20. For example, the communication processing unit 11 transmits the processing result of the analysis process to the terminal device 20.

記憶部１３は、図１に示すように、解析対象データ記憶部１３ａ、撹乱データ記憶部１３ｂを有する。記憶部１３は、例えば、ＲＡＭ（Random Access Memory）、フラッシュメモリ（Flash Memory）等の半導体メモリ素子、又は、ハードディスク、光ディスク等の記憶装置などである。 As shown in FIG. 1, the storage unit 13 includes an analysis target data storage unit 13a and a disturbance data storage unit 13b. The storage unit 13 is, for example, a semiconductor memory device such as a RAM (Random Access Memory) or a flash memory, or a storage device such as a hard disk or an optical disk.

解析対象データ記憶部１３ａは、解析の対象となる解析対象データ（以下では、適宜、秘密データという）を記憶する。例えば、解析対象データ記憶部１３ａは、図２に示すように、社員を一意に識別する「社員ＩＤ」と、社員の住所を示す「住所」と、社員の年齢を示す「年齢」と、社員の最高血圧の数値を示す「最高血圧」とを対応付けて記憶する。 The analysis target data storage unit 13a stores analysis target data to be analyzed (hereinafter referred to as secret data as appropriate). For example, as shown in FIG. 2, the analysis target data storage unit 13a includes an “employee ID” that uniquely identifies an employee, an “address” that indicates the address of the employee, an “age” that indicates the age of the employee, Are stored in association with “maximum blood pressure” indicating the numerical value of the maximum blood pressure.

解析対象データ記憶部１３ａに記憶される秘密データは、例えば、「識別子」、「準識別子」、「センシティブ属性」の３つに分かれる。識別子とは、１つのレコードとその持ち主を一意的に結び付けるもので、例えば、図２における「社員ＩＤ」のようなものが該当する。識別子は、通常、撹乱処理の前段階で削除される。 The secret data stored in the analysis target data storage unit 13a is divided into, for example, “identifier”, “quasi-identifier”, and “sensitive attribute”. The identifier uniquely associates one record with its owner, and corresponds to, for example, “employee ID” in FIG. The identifier is usually deleted before the disturbance process.

準識別子とは、複数組み合わせることでレコードの持ち主が特定可能なもので、例えば、図２における「住所」や「年齢」などである。準識別子は、撹乱処理の対象になる。準識別子に関する情報は、入手が容易であるため、一般には秘密にしておくことが不可能であるが、これらを組み合わせることで、仮に識別子が削除されていてもレコードの持ち主を特定される恐れがある。そこで準識別子を左図のような方法で撹乱し、レコードの持ち主を特定できないようにする。準識別子は、撹乱データ公開後のデータ分析では不可欠なものと考えらえるため、完全に削除してしまうとデータ公開の意味がなくなってしまう場合がある。 The quasi-identifier is one that can identify the owner of a record by combining a plurality of quasi-identifiers, such as “address” and “age” in FIG. The quasi-identifier is subject to disturbance processing. Since information on quasi-identifiers is easy to obtain, it is generally impossible to keep them secret. However, combining them may lead to the identification of the record owner even if the identifiers are deleted. is there. Therefore, the quasi-identifier is disturbed by the method shown in the left figure so that the owner of the record cannot be specified. Since the quasi-identifier is considered to be indispensable in the data analysis after the disturbance data is disclosed, if it is completely deleted, the meaning of the data disclosure may be lost.

センシティブ属性とは、秘密にしたいデータであり、例えば、図２における「血圧」のようなものであり、センシティブ属性も撹乱操作の対象になり得まるが、撹乱しない場合もある。つまり、ある属性が準識別子であるか、センシティブ属性であるかは問題に応じて決めることであり、場合によっては「準識別子でありセンシティブ属性である」こともあり得る。また、撹乱操作は、基本的にデータ分析の有用性を失わせるため、準識別子と考えにくいのであれば、撹乱せずに残しておいた方がよい。本実施形態では「準識別子」と「センシティブ属性」を撹乱操作の対象とする。ただし、仮説検定で利用するのは「センシティブ属性」のみで、かつセンシティブ属性が数値属性であると仮定している。例えば、数値属性の準識別子（年齢等）が存在している場合、その準識別子に関して仮説検定を実施することが可能であるが、データ分析の目的としては一般的でない。 The sensitive attribute is data to be kept secret, for example, “blood pressure” in FIG. 2, and the sensitive attribute may be a target of the disturbance operation, but may not be disturbed. That is, whether an attribute is a quasi-identifier or a sensitive attribute is determined according to the problem, and may be “a quasi-identifier and a sensitive attribute” in some cases. In addition, since the disturbance operation basically loses the usefulness of data analysis, if it is difficult to consider it as a quasi-identifier, it is better to leave it undisturbed. In this embodiment, “quasi-identifier” and “sensitive attribute” are targets of the disturbance operation. However, it is assumed that only the “sensitive attribute” is used in the hypothesis test, and the sensitive attribute is a numerical attribute. For example, when a quasi-identifier (such as age) of a numerical attribute exists, a hypothesis test can be performed on the quasi-identifier, but this is not general for the purpose of data analysis.

撹乱データ記憶部１３ｂは、後述する撹乱部１２ａによって撹乱処理によりノイズが付加された撹乱データを記憶する。なお、撹乱処理については、後の撹乱部１２ａの説明で詳述する。 The disturbance data storage unit 13b stores disturbance data to which noise is added by a disturbance process by a disturbance unit 12a described later. The disturbing process will be described in detail in the description of the disturbing unit 12a later.

図１の説明に戻って、制御部１２は、撹乱部１２ａと、抽出部１２ｂ、解析部１２ｃとを有する。ここで、制御部１２は、ＣＰＵ（Central Processing Unit）やＭＰＵ（Micro Processing Unit）などの電子回路やＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）などの集積回路である。 Returning to the description of FIG. 1, the control unit 12 includes a disturbance unit 12a, an extraction unit 12b, and an analysis unit 12c. Here, the control unit 12 is an electronic circuit such as a CPU (Central Processing Unit) or MPU (Micro Processing Unit), or an integrated circuit such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array).

撹乱部１２ａは、秘密データうち、数値に関するデータに、特定のパラメータを持つ分布に従う確率変数を付加する。例えば、撹乱部１２ａは、図３に示すように、秘密データのうち、数値属性のデータに対して特定のパラメータを持つＬａｐｌａｃｅ分布に従う確率変数を値（ノイズ）として加える撹乱処理を行うことで、Ｐｋ匿名性を満たす撹乱データを生成する。その後、撹乱部１２ａは、生成した撹乱データを撹乱データ記憶部１３ｂに格納する。 The disturbing unit 12a adds a random variable according to a distribution having a specific parameter to data relating to a numerical value among the secret data. For example, as shown in FIG. 3, the disturbing unit 12a performs a disturbing process of adding a random variable according to a Laplace distribution having a specific parameter as data (noise) among the secret data, as shown in FIG. Disturbance data that satisfies Pk anonymity is generated. Thereafter, the disturbance unit 12a stores the generated disturbance data in the disturbance data storage unit 13b.

ここで、図４を用いて、秘密データに確率変数を加える撹乱処理例について説明する。図４に示すように、秘密データの値として、「１０」、「１５」、「１３」、「９」、「２１」があるものとする。そして、確率変数をＸとして、各秘密データにＸを加える。この結果、撹乱データとして、「８」、「１７」、「１３」、「９」、「１８」が生成される。ここで、例えば、Ｘが従う確率分布をμとすると、秘密データ「１０」が撹乱データ「８」に変化する確率は、Ｐ（１０＋Ｘ＝８）＝Ｐ（Ｘ＝−２）＝μ（−２）となる。 Here, a disturbance processing example for adding a random variable to secret data will be described with reference to FIG. As shown in FIG. 4, it is assumed that secret data values include “10”, “15”, “13”, “9”, and “21”. Then, X is added to each secret data, where X is a random variable. As a result, “8”, “17”, “13”, “9”, and “18” are generated as the disturbance data. Here, for example, if the probability distribution followed by X is μ, the probability that secret data “10” changes to disturbance data “8” is P (10 + X = 8) = P (X = −2) = μ (− 2).

抽出部１２ｂは、解析の対象となるデータのなかから標本データを抽出する。具体的には、抽出部１２ｂは、撹乱データ記憶部１３ｂに記憶された撹乱データのうち、ｎ個の標本データを抽出し、解析部１２ｃに標本データを通知する。 The extraction unit 12b extracts sample data from the data to be analyzed. Specifically, the extraction unit 12b extracts n sample data from the disturbance data stored in the disturbance data storage unit 13b, and notifies the analysis unit 12c of the sample data.

解析部１２ｃは、抽出部１２ａによって抽出された標本データの標準正規分布の信頼度（例えば、後述するｚ（δ））を利用して累積分布関数の逆関数（例えば、後述するＦ_ｎ（δ）^−１）を漸近展開し、該展開した逆関数を用いて、標本データの平均が所定の値に一致するか否かを検定する１標本平均一致検定を行う。 The analysis unit 12c uses the reliability of the standard normal distribution of the sample data extracted by the extraction unit 12a (for example, z (δ) to be described later) (for example, F _n (δ to be described later). ) ^-1 ) Asymptotically expanded, and using the expanded inverse function, a one-sample average match test is performed to test whether the average of the sample data matches a predetermined value.

例えば、解析部１２ｃは、標本データの標準正規分布の信頼度ｚ（δ）を利用する形で、Ｃｏｒｎｉｓｈ−Ｆｉｓｈｅｒ（コーニッシュフィッシャ）展開により累積分布関数の逆関数Ｆ_ｎ（δ）^−１を展開し、該逆関数Ｆ_ｎ（δ）^−１を用いて、１標本平均一致検定を行う。また、例えば、解析部１２ｃは、信頼度として、１の値を有意水準αの値で減算した１００×（１−α）％点を用いて、１標本平均一致検定を行う。 For example, the analysis unit 12c expands the inverse function F _n (δ) ⁻¹ of the cumulative distribution function by Cornish-Fisher expansion using the reliability z (δ) of the standard normal distribution of the sample data. Then, a one-sample average match test is performed using the inverse function F _n (δ) ⁻¹ . Further, for example, the analysis unit 12c performs a one-sample average coincidence test using 100 × (1-α)% points obtained by subtracting the value of 1 by the value of the significance level α as the reliability.

ここで、解析部１２ｃにより実行される処理を説明する前に、１標本平均一致検定について説明する。ここで１標本平均一致検定とは、平均と分散を持つと仮定した単一の母集団から標本を抽出し、分散が既知であると仮定したときに平均がある値に一致するかどうかを判断する手法である。 Here, before describing the processing executed by the analysis unit 12c, the one-sample average match test will be described. Here, the one-sample mean-match test is a sample from a single population that is assumed to have a mean and variance, and determines whether the mean matches a certain value when the variance is assumed to be known. It is a technique to do.

従来、１標本平均一致検定では、中心極限定理を用いることが一般的であった。中心極限定理は、標本数ｎを無限とした場合に、標本平均と真の平均の誤差が標準正規分布に従うことを利用するものであるが、収束速度については特に保証していないため、ｎが小さい場合には標本平均と真の平均がどの程度ずれるのかが不明確である。 Conventionally, it has been common to use the central limit theorem in the one-sample mean agreement test. The central limit theorem makes use of the fact that when the number of samples n is infinite, the error between the sample mean and the true mean follows a standard normal distribution. If it is small, it is unclear how much the sample average deviates from the true average.

特に、上記した撹乱処理を行う場合には、秘密データが正規分布に従うと仮定できても、撹乱操作のために加えるノイズの影響で撹乱データが一般に正規分布から離れてしまう。このため、標本データ数が少ない場合の１標本平均一致検定における中心極限定理の利用は適切でない場合がある。 In particular, when performing the above-described disturbance processing, even if it can be assumed that the secret data follows a normal distribution, the disturbance data generally deviates from the normal distribution due to the influence of noise applied for the disturbance operation. For this reason, the use of the central limit theorem in the one-sample average coincidence test when the number of sample data is small may not be appropriate.

ここで、図５の例を用いて、一般的な１標本平均一致検定の処理例を説明する。図５の例では、秘密データが従う分布の標準偏差σが既知であるものとする。そして、秘密データからｎ個の標本データＺ_１、Ｚ_２、・・・、Ｚ_ｎを抽出する。そして、下記（１）式および（２）式を計算する。ここで、μ_０の平均と想定される値である。μ_０は、問題に応じて与えられるものとする。μ_０は、多くの場合現実に与えられることがないため、何らかの値で代用する。例えば、解析対象のデータが、年収に関するデータであれば、日本全体の年収の平均などを与える。 Here, a processing example of a general one-sample average match test will be described using the example of FIG. In the example of FIG. 5, it is assumed that the standard deviation σ of the distribution that the secret data follows is known. Then, n sample data Z ₁ , Z ₂ ,..., Z _n are extracted from the secret data. Then, the following equations (1) and (2) are calculated. Here, it is a value assumed to be the average of μ ₀ . μ ₀ is given according to the problem. Since μ ₀ is not actually given in many cases, some value is substituted. For example, if the data to be analyzed is data related to annual income, an average of annual income for Japan as a whole is given.

そして、有意水準αの値を決め、標本統計量であるｓ_ｎとｚ（１−α）の大小を比較する。なお、ここでｚ（１−α）とは、標準正規分布の信頼度である「１００×（１−α）％点」であり、例えば、α＝０．０５のとき、ｚ（０．９５）≒１．６４となる。 Then, the value of the significance level α is determined, and the sample statistic s _n is compared with the magnitude of z (1-α). Here, z (1-α) is “100 × (1-α)% point” which is the reliability of the standard normal distribution. For example, when α = 0.05, z (0.95). ) ≈1.64.

比較の結果から、帰無仮説μ＝μ_０と、対立仮説μ＞μ_０のどちらが正しいかを判断する。ここで、μ_０とは、前述したように、秘密データの平均と想定される値であり、μとは、抽出したｎ個の平均の値である。つまり、比較の結果、ｓ_ｎ＞ｚ（１−α）の場合には、「もし、秘密データの平均がμ_０であるなら、ｓ_ｎ＞ｚ（１−α）となる確率が高々αである。従って、実際には秘密データの平均はμ_０ではない。すなわち、帰無仮説が正しくないものと判断する。通常は、対立仮説が主張したい命題である。 From the comparison result, it is determined whether the null hypothesis μ = μ ₀ or the alternative hypothesis μ> μ ₀ is correct. Here, μ ₀ is a value assumed to be the average of secret data, as described above, and μ is an average value of n extracted data. That is, if s _n > z (1-α) as a result of comparison, “if the average of the secret data is μ ₀ , the probability that s _n > z (1-α) is at most α Therefore, in practice, the average of secret data is not μ ₀ , that is, it is determined that the null hypothesis is not correct, which is usually the proposition that the alternative hypothesis wants to claim.

ここで、図６を用いて仮説検定における２つの誤りについて説明する。図６に示すように、仮説検定における２つの誤りには、第一種過誤と、第二種過誤がある。つまり、第一種過誤とは、真の状態として帰無仮説が正しい場合であって、かつ、検定結果が対立仮説を正しいとみなした場合、すなわち、帰無仮説が正しいが、対立仮説が正しいと判断してしまった場合における判断の誤りのことをいう。そして、この第一種過誤が生じる確率を危険率といい、危険率の上限を有意水準という。 Here, two errors in the hypothesis test will be described with reference to FIG. As shown in FIG. 6, the two errors in the hypothesis test include a first type error and a second type error. In other words, type I error is when the null hypothesis is correct as a true state, and the test result regards the alternative hypothesis as correct, that is, the null hypothesis is correct but the alternative hypothesis is correct. This is an error in judgment when it is judged. The probability that this type 1 error will occur is called the risk factor, and the upper limit of the risk factor is called the significance level.

また、第二種過誤とは、真の状態として対立仮説が正しい場合であって、かつ、検定結果が帰無仮説を正しいとみなした場合、すなわち、対立仮説が正しいが、帰無仮説が正しいと判断してしまった場合における判断の誤りのことをいう。第一種過誤が生じる確率と、第二種過誤が生じる確率を同時に下げることはない、いわゆるトレードオフの関係である。ただし、通常は危険率の上限（有意水準）を一定とした上で、検出力を最大化するような検定を行うことが望ましい。 The second type error is when the alternative hypothesis is correct as a true state and the test result regards the null hypothesis as correct, that is, the alternative hypothesis is correct but the null hypothesis is correct. This is an error in judgment when it is judged. This is a so-called trade-off relationship in which the probability of the first type error and the probability of the second type error are not lowered at the same time. However, it is usually desirable to perform a test that maximizes the power of detection while keeping the upper limit (significance level) of the risk factor constant.

上記したように、標本データ数が少ない場合の１標本平均一致検定における中心極限定理の利用は適切でない場合がある。これは、標本データ数が十分大きい場合には、与えられる有意水準αに対し、中心極限定理によりｓ_ｎが標準正規分布に従うといえるため、標準正規分布の１００×（１−α）％点と、標本データから計算するｓ_ｎの大小を比較すればよいが、標本データ数が小さい場合には、ｓ_ｎが標準正規分布に従うといえないからである。すなわち、標本データ数が小さい場合には、中心極限定理では与えられない優位水準を達成しているかどうかが不明であり、検定の信頼性が低い。つまり、ｓ_ｎの分布の仮定ができないので、ｓ_ｎの分布の１００×（１−α）％点を厳密に求めることができない。ここで検出力とは、真の状態として対立仮説が正しく、検定結果として対立仮説が正しい（例えば、図６の例では、右下の「○」の部分に該当）とみなす確率のことである。 As described above, the use of the central limit theorem in the one-sample average coincidence test when the number of sample data is small may not be appropriate. This is because when the number of sample data is sufficiently large, it can be said that s _n follows the standard normal distribution by the central limit theorem for a given significance level α, and therefore, 100 × (1-α)% points of the standard normal distribution The magnitudes of s _n calculated from the sample data may be compared, but when the number of sample data is small, it cannot be said that s _n follows the standard normal distribution. That is, when the number of sample data is small, it is unclear whether or not the superiority level not given by the central limit theorem is achieved, and the reliability of the test is low. That is, since it can not assume the distribution of s _n, it is impossible to exactly determine the 100 × (1-α)% point of the distribution of s _n. Here, the power is the probability that the alternative hypothesis is correct as a true state and the alternative hypothesis is correct as a test result (for example, in the example of FIG. 6, it corresponds to the “o” portion at the lower right). .

例えば、α＝０．０５として、中心極限定理に従い標準正規分布の９５％点を用いてしまうと、その点はｓ_ｎにとって９０％点かもしれないし、９９％点かもしれない。すなわち、危険率を求めることができないため、検定の信頼性が損なわれる。 For example, the alpha = 0.05, the results using a 95% point of the standard normal distribution in accordance with the central limit theorem, to might take 90% point to the point s _n, may 99% point. That is, since the risk factor cannot be obtained, the reliability of the test is impaired.

そこで、解析部１２ｃでは、Ｃｏｒｎｉｓｈ−Ｆｉｓｈｅｒ展開と呼ばれる手法により、撹乱データの累積分布関数の逆関数を展開することで、仮説検定における１００（１−α）％点を求める。 Therefore, the analysis unit 12c obtains 100 (1-α)% points in the hypothesis test by developing an inverse function of the cumulative distribution function of the disturbance data by a method called Cornish-Fisher expansion.

まず、解析部１２ｃは、抽出部１２ｂが抽出した仮説検定のためのｎ個の標本データを受信する。そして、解析部１２ｃは、データ（秘密データ、撹乱データ等に依らず、一般のデータ）を表す確率変数をＸ_ｉ（ｉ＝１・・・ｎ）とし、その平均と標準偏差（いずれも母集団平均・標準偏差、すなわち、サンプリングから求められるものではなく、母集団が従う分布）からそれぞれ平均μ、標準偏差σとし、下記（３）、（４）式を計算する。 First, the analysis unit 12c receives n sample data for hypothesis testing extracted by the extraction unit 12b. Then, the analysis unit 12c sets X _i (i = 1... N) as a random variable representing data (general data regardless of secret data, disturbance data, etc.), and the average and standard deviation (both are the mother data). The following formulas (3) and (4) are calculated by setting the mean μ and standard deviation σ from the group mean / standard deviation, ie, the distribution followed by the population, not obtained from sampling, respectively.

解析部１２ｃは、このＳ_ｎが従う分布の累積分布関数をＦ_ｎとする。すなわち、Ｓ_ｎの確率密度関数をｆｎとしたとき、以下の（５）式で表されるものとする。なお、（５）式における一番左のＰ［Ｓ_ｎ≦τ］は、Ｓ_ｎ≦τとなる確率の意味である。 Analyzing unit 12c, the cumulative distribution function of the distribution of the _{S n} follows the _{F n.} That is, when the probability density function of S _n and fn, and those represented by the following equation (5). Note that the leftmost P [S _n ≦ τ] in the equation (5) means the probability of S _n ≦ τ.

ここで、解析部１２ｃは、Ｆ_ｎに対してＣｏｒｎｉｓｈ−Ｆｉｓｈｅｒ展開を行うと、δ＝１−αとして、以下の（６）式で、Ｓ_ｎの１００×（１−α）％点を表現することができる。 Here, the analysis unit 12c, when the Cornish-Fisher expansion on _{F n,} as [delta] = 1-alpha, by the following equation (6), representing the 100 × (1-α)% point _{S n} can do.

ここで、上記の（６）式におけるキュムラントについて説明する。まず、モーメントを定義する。確率変数Ｘに対し、ｎ次のモーメントは、Ｘ^ｎの期待値、すなわちＸの確率密度関数をｆとしたとき、下記（７）式のように定義される。以下では、μ_ｎ＝Ｅ［Ｘ^ｎ］と表記する。このとき、４次までのキュムラントｋ_ｉ（ｉ＝１、２、３、４）については、以下の（８）式〜（１１）式で定義される。 Here, the cumulant in the above equation (6) will be described. First, define the moment. For the random variable X, the n-th moment is defined as the following equation (7), where x is the expected value of ^Xn , that is, the probability density function of X is f. Hereinafter, μ _n = E [X ⁿ ] is expressed. At this time, cumulants k _i up to the fourth order (i = 1, 2, 3, 4) are defined by the following formulas (8) to (11).

上記した（６）式については、中心極限定理を高次項まで近似したものと考えられる。つまり、ｎ→∞で、中心極限定理と同じように、標準正規分布の１００×δ％点ｚ（δ）に収束する。しかし、（６）式をそのまま用いても、必ずしも危険率を下げる効果がない。つまり、中心極限定理と比較して危険率を下げるには、Ｆ_ｎ（δ）^−１＞ｚ（δ）となるようにする必要があるが、上記の（６）式の右辺第二項以降の符号は、ｚ（δ）の大きさによって正の場合も負の場合もあり得る。 With regard to the above equation (6), it is considered that the central limit theorem is approximated to higher order terms. That is, n → ∞ converges to the 100 × δ% point z (δ) of the standard normal distribution as in the central limit theorem. However, using the formula (6) as it is does not necessarily have the effect of reducing the risk factor. That is, in order to lower the risk factor compared to the central limit theorem, it is necessary to satisfy F _n (δ) ⁻¹ > z (δ). The sign of may be positive or negative depending on the magnitude of z (δ).

このため、解析部１２ｃは、片側検定問題の場合は、（６）式の変わりに、下記（１２）式を用いる。（７）式もｎ→∞で、（６）式と同じ速さでｚ（δ）に収束する。また、解析部１２ｃは、両側検定問題の場合は、（６）式の変わりに、下記（１３）式を用いる。 Therefore, the analysis unit 12c uses the following equation (12) instead of the equation (6) in the case of a one-sided test problem. Expression (7) is also n → ∞, and converges to z (δ) at the same speed as Expression (6). The analysis unit 12c uses the following equation (13) instead of the equation (6) in the case of a two-sided test problem.

また、秘密データを表す確率変数をＸ、撹乱操作で加えるノイズを表す確率変数をＹとし、それぞれのｉ次キュムラントをι_ｉ、λ_ｉとすると、撹乱データＺ＝Ｘ＋Ｙのｉ次キュムラントκ_ｉがκ_ｉ＝ι_ｉ＋λ_ｉとして表現できる。 Further, if the random variable representing the secret data is X, the random variable representing the noise added by the disturbance operation is Y, and the respective i-th order cumulants are ι _i and λ _i , the i-th order cumulant κ _{i of the} disturbance data Z = X + Y is obtained. It can be expressed as κ _i = ι _i + λ _i .

これを用いて、下記（１４）式および（１５）式で表されるＳｎおよび撹乱データＺの平均の１００×（１−α）％点を上記の式（１２）もしくは式（１３）により近似することで、撹乱データに関する１標本平均値検定を実現することができる。 Using this, the average 100 × (1-α)% point of Sn and disturbance data Z expressed by the following formulas (14) and (15) is approximated by the above formula (12) or formula (13). By doing so, it is possible to realize a one-sample average test for disturbance data.

このように、第一の実施形態に係る解析装置１０では、標本平均検定の信頼度向上が可能となる。つまり、中心極限定理では、上記の（６）式における右辺第一項のみによる近似を行うのに対し、第一の実施形態に係る解析装置１０では、第二項以降の項を絶対値にした上記の（６）式および（７）式を用いて、第二項以降の項を利用することで、危険率を下げ、標本平均検定の信頼度向上が可能となる。 As described above, the analysis apparatus 10 according to the first embodiment can improve the reliability of the sample average test. In other words, in the central limit theorem, approximation is performed using only the first term on the right side in the above equation (6), whereas in the analysis apparatus 10 according to the first embodiment, the terms after the second term are absolute values. By using the above-mentioned formulas (6) and (7) and using the second and subsequent terms, the risk factor can be lowered and the reliability of the sample average test can be improved.

また、第一の実施形態に係る解析装置１０では、Ｃｏｒｎｉｓｈ−Ｆｉｓｈｅｒ展開と呼ばれる手法により累積分布関数を展開している。Ｃｏｒｎｉｓｈ−Ｆｉｓｈｅｒ展開では、今回のように累積分布関数を陽に求めることが困難な場合に有効であるが、そのまま用いると「仮説検定における危険率を下げる」という効果が生じない場合があるため、上記したように、危険率が下がるような補正を行っている。 In the analysis apparatus 10 according to the first embodiment, the cumulative distribution function is developed by a technique called Cornish-Fisher expansion. The Cornish-Fisher expansion is effective when it is difficult to obtain the cumulative distribution function explicitly as in this case, but if used as it is, the effect of “lowering the risk factor in the hypothesis test” may not occur. As described above, correction is performed so that the risk rate decreases.

以下では、仮説検定の処理について説明するが、片側検定問題を扱う場合と両側検定問題を扱う場合の処理例についてそれぞれ説明する。前提として、秘密データは正規分布に従うものとし、撹乱処理で付加するノイズはＬａｐｌａｃｅノイズであるものとする。 In the following, the hypothesis testing process will be described, but a processing example in the case of handling a one-sided test problem and a case of handling a two-sided test problem will be described. As a premise, the secret data is assumed to follow a normal distribution, and the noise added by the disturbance processing is assumed to be Laplace noise.

まず、片側検定問題を扱う場合の処理例について説明する。秘密データを表す確率変数をＸとしたとき、Ｘが従う正規分布はＮ（μ，σ^２）、すなわち、以下の（１６）式の確率密度関数ｆ_Ｘで表されるものとする。 First, a processing example when handling a one-sided test problem will be described. When a random variable representing secret data is X, the normal distribution followed by X is represented by N (μ, σ ² ), that is, the probability density function f _X of the following equation (16).

また、撹乱操作で加えるノイズを表す確率変数をＹとしたとき、Ｙが従うＬａｐｌａｃｅノイズは、以下の（１７）式の確率密度関数ｆ_Ｙで表される。 Further, when a random variable representing noise added by the disturbance operation is Y, Laplace noise followed by _Y is represented by a probability density function fY of the following equation (17).

このとき、上記の（１７）式の「ｂ」は、下記（１８）式でもとめられるものであるとすれば、Ｐｋ匿名性が満たされる。すなわち、どのような攻撃者も１／ｋ以上の確率で秘密データと撹乱データを対応付けられない。なお、（１８）式において、｜Ｒ｜は、表の行数であり、νは、秘密データ（数値）の値域を示す。また、確率変数Ｘと確率変数Ｙを加えたＺ＝Ｘ＋Ｙが撹乱データとなる。 At this time, if “b” in the above equation (17) is also obtained by the following equation (18), Pk anonymity is satisfied. That is, no attacker can associate secret data with disturbance data with a probability of 1 / k or more. In equation (18), | R | is the number of rows in the table, and ν is the range of secret data (numerical values). Further, Z = X + Y obtained by adding the random variable X and the random variable Y becomes disturbance data.

撹乱データからｎ個の標本を採るものとする。ここで、ｓ_ｎを下記（１９）式および（２０）式で定義する。ここでは、μ´＝μであって、μをＺの平均値と定義し、σ´を下記（２１）式で表される、Ｚの標準偏差と定義する。 Let n samples be taken from the disturbance data. We define a _{s n} below (19) and (20). Here, μ ′ = μ, μ is defined as an average value of Z, and σ ′ is defined as a standard deviation of Z expressed by the following equation (21).

このとき、上記の（１２）式は、下記（２２）式のようになる。ここで、下記の式（２２）中のγ_４は、下記（２３）式で求められる。 At this time, the above expression (12) becomes the following expression (22). Here, γ ₄ in the following equation (22) is obtained by the following equation (23).

上記の式（２２）を用いると、有意水準αの片側検定問題として、帰無仮説Ｈ_０：μ´＝μ_０と、対立仮説Ｈ_１：μ´＞μ_０の２つの仮説からなる仮説検定を構築できる。実際にサンプリングしたデータをｚ_ｉ（ｉ＝１，・・・，ｎ）とし、ｓ_ｎを下記（２４）式で定義する。このときに、もしｓ_０＞Ｆ_ｎ（δ）^−１（δ＝１−α）であれば、帰無仮説を棄却し、対立仮説を受理する。そうでなければ、帰無仮説を受理する。 Using the above equation (22), as a one-sided test problem of significance level α, a hypothesis test consisting of two hypotheses of null hypothesis H ₀ : μ ′ = μ ₀ and alternative hypothesis H ₁ : μ ′> μ ₀ Can be built. The actually sampled data is defined as z _i (i = 1,..., N), and s _n is defined by the following equation (24). At this time, if s ₀ > F _n (δ) ⁻¹ (δ = 1−α), the null hypothesis is rejected and the alternative hypothesis is accepted. Otherwise, accept the null hypothesis.

例えば、特定の薬を投与した後の最高血圧が有意に上昇したか否かという検定の場合であって、μ_０を最高血圧の平均（薬を投与する前の平均や、一般の平均値等を用いる）とし、μ´を標本データの最高血圧の平均とし、ｓ_０＞Ｆ_ｎ（δ）^−１であれば、帰無仮説を棄却し、ｓ_０＞Ｆ_ｎ（δ）^−１でなければ、帰無仮説を受理する。つまり、ここで帰無仮説の棄却とは、もし投与前と投与後の平均が同じだとしたら、１００×（１−α）％の確率でしか生じない事象が生じた。従って、投与後の最高血圧が投与前に比べて有意に上昇したと判断することを意味し、帰無仮説の受理とは、投与後の最高血圧が投与前に比べて変化していないと判断することを意味する。 For example, in the case of a test whether or not the systolic blood pressure significantly increased after administration of a specific drug, μ ₀ is the average of the systolic blood pressure (average before administration of the drug, general average value, etc. And μ ′ is the average of the systolic blood pressure of the sample data. If s ₀ > F _n (δ) ⁻¹ , the null hypothesis is rejected, and s ₀ > F _n (δ) ⁻¹ Accepts the null hypothesis. In other words, the rejection of the null hypothesis is an event that occurs only with a probability of 100 × (1-α)% if the average before and after administration is the same. Therefore, it means that it is judged that the systolic blood pressure after administration has increased significantly compared to before administration, and acceptance of the null hypothesis means that the systolic blood pressure after administration has not changed compared to before administration. It means to do.

ここで、片側検定について、上記検定処理を行った場合の実験結果例を図７に示す。図７は、片側検定の実験結果例を示す図である。図７の例では、乱数から生成した撹乱データの９５％点を菱形で示し、中心極限定理の９５％点（標準分布の９５％点）を四角で示し、本実施形態に係る解析装置１０の手法による９５％点を三角で示す。また、乱数から生成した撹乱データの９５％点は、１００００回Ｓ_ｎを計算し、９５００番目の値を表示している。また、図７において、縦軸は、９５％点の値を示すものであり、横軸は、標本数を示すものである。 Here, FIG. 7 shows an example of an experimental result when the above-described test processing is performed for the one-side test. FIG. 7 is a diagram illustrating an example of an experimental result of a one-sided test. In the example of FIG. 7, the 95% point of the disturbance data generated from the random number is indicated by a rhombus, the 95% point of the central limit theorem (the 95% point of the standard distribution) is indicated by a square, and the analysis apparatus 10 according to the present embodiment The 95% point by the method is indicated by a triangle. Moreover, 95% point of the disturbance data generated from random numbers, and calculates a 10000 _{S n,} displaying the 9500 th values. In FIG. 7, the vertical axis indicates the value of the 95% point, and the horizontal axis indicates the number of samples.

図７に示すように、標本数によらず、本実施形態に係る解析装置１０の手法による９５％点の値が、中心極限定理の９５％点の値よりも大きいことが分かる。この値が大きいほど、危険率が低くなるので、本実施形態に係る解析装置１０では中心極限定理を用いる場合と比較して、常に危険率を低くすることが可能である。 As shown in FIG. 7, regardless of the number of samples, it can be seen that the 95% point value obtained by the method of the analysis apparatus 10 according to the present embodiment is larger than the 95% point value of the central limit theorem. The larger the value, the lower the risk factor. Therefore, in the analysis apparatus 10 according to the present embodiment, it is possible to always reduce the risk factor as compared with the case where the central limit theorem is used.

次に、両側検定問題を扱う場合の処理例について説明する。両側検定問題を扱う場合においても、上記の片側検定問題を扱う場合と同様に、上記の（１７）式におけるパラメータｂを上記の（１８）式により決定し、上記の（１９）式および（２０）式でＳｎを定義する。このとき、上記の（１３）式は、下記（２５）式のようになる。ここで、上記の式（２２）中のγ_４は、下記（２６）式で求められる。 Next, a processing example when dealing with a two-sided test problem will be described. Also in the case of dealing with the two-sided test problem, the parameter b in the above equation (17) is determined by the above equation (18) as in the case of dealing with the above one-sided test problem, and the above equations (19) and (20 ) Define Sn. At this time, the above expression (13) becomes the following expression (25). Here, γ ₄ in the above equation (22) is obtained by the following equation (26).

上記の式（２５）を用いると、有意水準αの両側検定問題として、帰無仮説Ｈ_０：μ´＝μ_０と、対立仮説Ｈ_１：μ´≠μ_０の２つの仮説からなる仮説検定を構築できる。そして、Ｓ_０を計算し、下記（２７）式もしくは下記（２８）式が成り立つか判定する。なお、（２７）式および（２８）式において、δ＝１−α／２となる。下記（２７）式もしくは下記（２８）式が成り立つ場合には、帰無仮説を棄却し、成り立たない場合には、帰無仮説を受理する。例えば、特定の薬を投与した後の血圧に、投与前と比べて有意な差があるか否かといった検定問題に利用可能である。 Using the above equation (25), as a two-sided test problem of significance level α, a hypothesis test consisting of two hypotheses: null hypothesis H ₀ : μ ′ = μ ₀ and alternative hypothesis H ₁ : μ ′ ≠ μ ₀ Can be built. And _S0 is calculated and it is determined whether the following (27) Formula or (28) Formula is satisfied. In the equations (27) and (28), δ = 1−α / 2. If the following formula (27) or the following formula (28) holds, the null hypothesis is rejected, and if it does not hold, the null hypothesis is accepted. For example, it can be used for the test problem of whether or not there is a significant difference in blood pressure after administration of a specific drug compared to before administration.

このように、第一の実施形態に係る解析装置１０では、「秘匿化された（撹乱された）データを精度良く統計的データ解析（仮説検定）できる。例えば、医療分野において、Ｐｋ匿名性を保証する範囲で、標本データに基づく仮説の確からしさを精度良く検定することが可能となる。 As described above, in the analysis apparatus 10 according to the first embodiment, “anonymized (disturbed) data can be accurately statistical data analyzed (hypothesis test). For example, in the medical field, Pk anonymity is obtained. It is possible to accurately test the accuracy of hypotheses based on sample data within the guaranteed range.

［解析装置による処理］
次に、図８を用いて、第一の実施形態に係る解析装置１０の処理を説明する。図８は、第一の実施形態に係る解析装置１０における解析処理の流れを説明するためのフローチャートである。 [Processing by analyzer]
Next, processing of the analysis apparatus 10 according to the first embodiment will be described with reference to FIG. FIG. 8 is a flowchart for explaining the flow of analysis processing in the analysis apparatus 10 according to the first embodiment.

図８に示すように、解析装置１０の撹乱部１２ａは、端末装置２０から解析要求を受け付けると（ステップＳ１０１）、解析対象となる秘密データを解析対象データ記憶部１３ａから読み出す（ステップＳ１０２）。 As shown in FIG. 8, when the disturbance unit 12a of the analysis device 10 receives an analysis request from the terminal device 20 (step S101), the disturbing unit 12a reads secret data to be analyzed from the analysis target data storage unit 13a (step S102).

そして、撹乱部１２ａは、読み出した秘密データに対してＬａｐｌａｃｅノイズを付加して撹乱データを生成する（ステップＳ１０３）。そして、解析部１２ｃは、撹乱データからｎ個の標本データを抽出する（ステップＳ１０４）。そして、解析部１２ｃは、上記した（１）式および（２）式を用いて、検定統計量「ｓ_ｎ」を計算する（ステップＳ１０５）。 And the disturbance part 12a adds a Laplace noise with respect to the read secret data, and produces | generates disturbance data (step S103). And the analysis part 12c extracts n sample data from disturbance data (step S104). Then, the analysis unit 12c calculates the test statistic “s _n ” using the above-described equations (1) and (2) (step S105).

その後、解析部１２ｃは、ｓ_ｎとｚ（１−α）の大小を比較する（ステップＳ１０６）。この結果、解析部１２ｃは、ｓ_ｎ＞ｚ（１−α）であると判定した場合には（ステップＳ１０７肯定）、対立仮説が正しいと判定する（ステップＳ１０８）。また、ｓ_ｎ＞ｚ（１−α）ではないと判定した場合には（ステップＳ１０７否定）、帰無仮説が正しいと判定する（ステップＳ１０９）。そして、解析部１２ｃは、解析結果を端末装置２０に対して出力する（ステップＳ１１０）。 Then, the analysis unit 12c compares the magnitude of _{s n} and z (1-α) (step S106). As a result, if the analysis unit 12c determines that s _n > z (1-α) (Yes in step S107), the analysis unit 12c determines that the alternative hypothesis is correct (step S108). If it is determined that s _n > z (1-α) is not satisfied (No at Step S107), it is determined that the null hypothesis is correct (Step S109). And the analysis part 12c outputs an analysis result with respect to the terminal device 20 (step S110).

例えば、解析部１２ｃは、解析結果として、特定の病気に罹っている人々の血圧が、血圧の平均値と比較して有意な差があるかどうかや、特定の薬を投与した後の被験者の血圧が、血圧の平均値と比較して有意な差があるかどうかなどを出力する。 For example, the analysis unit 12c analyzes whether the blood pressure of people suffering from a specific disease has a significant difference compared to the average value of the blood pressure, Whether the blood pressure is significantly different from the average value of the blood pressure is output.

[第一の実施形態の効果]
上述してきたように、第一の実施形態にかかる解析装置１０では、解析の対象となるデータのなかから標本データを抽出し、抽出された標本データの標準正規分布の信頼度を利用して累積分布関数の逆関数を漸近展開し、該展開した逆関数を用いて、標本データの平均が所定の値に一致するか否かを検定する１標本平均一致検定を行う。これにより、セキュリティを強化した統計的データ解析の精度を向上させ、秘匿化されたデータに対する仮説検定の精度向上を実現することが可能である。 [Effect of the first embodiment]
As described above, in the analysis apparatus 10 according to the first embodiment, sample data is extracted from data to be analyzed, and accumulated using the reliability of the standard normal distribution of the extracted sample data. An asymptotic expansion is performed on the inverse function of the distribution function, and a one-sample average match test is performed to test whether the average of the sample data matches a predetermined value using the expanded inverse function. As a result, the accuracy of statistical data analysis with enhanced security can be improved, and the accuracy of hypothesis testing for concealed data can be improved.

また、第一の実施形態にかかる解析装置１０では、解析の対象となるデータのうち、数値に関するデータに、特定のパラメータを持つ分布に従う確率変数を付加し、確率変数が付加された解析対象データから標本データを抽出する。このため、データを秘匿化し、セキュリティを強化することが可能である。 Further, in the analysis apparatus 10 according to the first embodiment, the analysis target data in which a random variable according to a distribution having a specific parameter is added to data related to a numerical value among the data to be analyzed, and the random variable is added. Extract sample data from. For this reason, it is possible to conceal data and strengthen security.

また、第一の実施形態にかかる解析装置１０では、Ｃｏｒｎｉｓｈ−Ｆｉｓｈｅｒ展開により累積分布関数の逆関数を展開し、１標本平均一致検定を行う。このため、累積分関数を陽に求めることが可能である。 Further, in the analysis apparatus 10 according to the first embodiment, the inverse function of the cumulative distribution function is developed by Cornish-Fischer expansion, and a one-sample average match test is performed. For this reason, it is possible to obtain the cumulative function explicitly.

また、第一の実施形態にかかる解析装置１０では、信頼度として、任意に設定された有意水準αから求められる信頼度を利用して累積分布関数の逆関数を漸近展開し、１標本平均一致検定を行う。例えば、解析装置１０では、標準正規分布の１００×（１−α）％点を求め、１００×（１−α）％点を利用して、1標本平均一致検定を実施する。このため、セキュリティを強化した統計的データ解析の精度を向上させ、秘匿化されたデータに対する仮説検定の精度向上を実現することが可能である。 Further, in the analysis apparatus 10 according to the first embodiment, as the reliability, the inverse function of the cumulative distribution function is asymptotically developed using the reliability obtained from the arbitrarily set significance level α, and the one-sample average match Perform the test. For example, the analysis apparatus 10 obtains 100 × (1-α)% points of the standard normal distribution, and performs a one-sample average coincidence test using the 100 × (1-α)% points. For this reason, it is possible to improve the accuracy of statistical data analysis with enhanced security and to improve the accuracy of hypothesis testing for concealed data.

［システム構成等］
また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。例えば、撹乱部１２ａと解析部１２ｂを統合してもよい。さらに、各装置にて行なわれる各処理機能は、その全部または任意の一部が、ＣＰＵおよび当該ＣＰＵにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 [System configuration, etc.]
Further, each component of each illustrated apparatus is functionally conceptual, and does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution / integration of each device is not limited to that shown in the figure, and all or a part thereof may be functionally or physically distributed or arbitrarily distributed in arbitrary units according to various loads or usage conditions. Can be integrated and configured. For example, the disturbance unit 12a and the analysis unit 12b may be integrated. Further, all or any part of each processing function performed in each device may be realized by a CPU and a program analyzed and executed by the CPU, or may be realized as hardware by wired logic.

また、本実施例において説明した各処理のうち、自動的におこなわれるものとして説明した処理の全部または一部を手動的に行うこともでき、あるいは、手動的におこなわれるものとして説明した処理の全部または一部を公知の方法で自動的におこなうこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 Also, among the processes described in the present embodiment, all or part of the processes described as being performed automatically can be performed manually, or the processes described as being performed manually can be performed. All or a part can be automatically performed by a known method. In addition, the processing procedure, control procedure, specific name, and information including various data and parameters shown in the above-described document and drawings can be arbitrarily changed unless otherwise specified.

［プログラム］
また、上記実施形態において説明した解析装置１０が実行する処理をコンピュータが実行可能な言語で記述したプログラムを作成することもできる。例えば、第一の実施形態に係る解析装置１０が実行する処理をコンピュータが実行可能な言語で記述した解析プログラムを作成することもできる。この場合、コンピュータが解析プログラムを実行することにより、上記実施形態と同様の効果を得ることができる。さらに、かかる解析プログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録され解析プログラムをコンピュータに読み込ませて実行することにより上記第一の実施形態と同様の処理を実現してもよい。以下に、図２に示した解析装置１０と同様の機能を実現する解析プログラムを実行するコンピュータの一例を説明する。 [program]
In addition, it is possible to create a program in which processing executed by the analysis apparatus 10 described in the above embodiment is described in a language that can be executed by a computer. For example, an analysis program in which processing executed by the analysis apparatus 10 according to the first embodiment is described in a language that can be executed by a computer can be created. In this case, when the computer executes the analysis program, the same effect as that of the above embodiment can be obtained. Furthermore, the same processing as that of the first embodiment may be realized by recording the analysis program on a computer-readable recording medium, recording the analysis program on the recording medium, and reading and executing the analysis program on the computer. Good. Hereinafter, an example of a computer that executes an analysis program that realizes the same function as the analysis apparatus 10 illustrated in FIG. 2 will be described.

図９は、解析プログラムを実行するコンピュータ１０００を示す図である。図９に例示するように、コンピュータ１０００は、例えば、メモリ１０１０と、ＣＰＵ１０２０と、ハードディスクドライブインタフェース１０３０と、ディスクドライブインタフェース１０４０と、シリアルポートインタフェース１０５０と、ビデオアダプタ１０６０と、ネットワークインタフェース１０７０とを有し、これらの各部はバス１０８０によって接続される。 FIG. 9 is a diagram illustrating a computer 1000 that executes an analysis program. As illustrated in FIG. 9, the computer 1000 includes, for example, a memory 1010, a CPU 1020, a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. These units are connected by a bus 1080.

メモリ１０１０は、図９に例示するように、ＲＯＭ（Read Only Memory）１０１１及びＲＡＭ１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、図９に例示するように、ハードディスクドライブ１０３１に接続される。ディスクドライブインタフェース１０４０は、図９に例示するように、ディスクドライブ１０４１に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ１０４１に挿入される。シリアルポートインタフェース１０５０は、図９に例示するように、例えばマウス１０５１、キーボード１０５２に接続される。ビデオアダプタ１０６０は、図９に例示するように、例えばディスプレイ１０６１に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012 as illustrated in FIG. The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1031 as illustrated in FIG. The disk drive interface 1040 is connected to the disk drive 1041 as illustrated in FIG. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041. The serial port interface 1050 is connected to a mouse 1051 and a keyboard 1052, for example, as illustrated in FIG. The video adapter 1060 is connected to a display 1061, for example, as illustrated in FIG.

ここで、図９に例示するように、ハードディスクドライブ１０３１は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３、プログラムデータ１０９４を記憶する。すなわち、上記の解析プログラムは、コンピュータ１０００によって実行される指令が記述されたプログラムモジュールとして、例えばハードディスクドライブ１０３１に記憶される。 Here, as illustrated in FIG. 9, the hard disk drive 1031 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, the above analysis program is stored in, for example, the hard disk drive 1031 as a program module in which a command executed by the computer 1000 is described.

また、上記実施形態で説明した各種データは、プログラムデータとして、例えばメモリ１０１０やハードディスクドライブ１０３１に記憶される。そして、ＣＰＵ１０２０が、メモリ１０１０やハードディスクドライブ１０３１に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出し、各種処理手順を実行する。 The various data described in the above embodiment is stored as program data, for example, in the memory 1010 or the hard disk drive 1031. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1031 to the RAM 1012 as necessary, and executes various processing procedures.

なお、解析プログラムに係るプログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０３１に記憶される場合に限られず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、解析プログラムに係るプログラムモジュール１０９３やプログラムデータ１０９４は、ネットワーク（ＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）等）を介して接続された他のコンピュータに記憶され、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 Note that the program module 1093 and the program data 1094 related to the analysis program are not limited to being stored in the hard disk drive 1031, but may be stored in, for example, a removable storage medium and read out by the CPU 1020 via the disk drive or the like. Good. Alternatively, the program module 1093 and the program data 1094 related to the analysis program are stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.), and via the network interface 1070. May be read by the CPU 1020.

１０解析装置
１１通信処理部
１２制御部
１２ａ撹乱部
１２ｂ抽出部
１２ｃ解析部
１３記憶部
１３ａ解析対象データ記憶部
１３ｂ撹乱データ記憶部
２０端末装置 DESCRIPTION OF SYMBOLS 10 Analysis apparatus 11 Communication processing part 12 Control part 12a Disturbing part 12b Extraction part 12c Analyzing part 13 Storage part 13a Analysis object data storage part 13b Disturbing data storage part 20 Terminal device

Claims

An extractor that extracts sample data from the data to be analyzed;
The inverse function of the cumulative distribution function is asymptotically expanded using the reliability of the standard normal distribution of the sample data extracted by the extraction unit, and the average of the sample data matches a predetermined value using the expanded inverse function An analysis unit that performs a one-sample average match test to test whether or not
An analysis device characterized by comprising:

Among the data to be analyzed, further comprising a disturbance unit for adding a random variable according to a distribution having a specific parameter to data relating to a numerical value,
The analysis apparatus according to claim 1, wherein sample data is extracted from data to which a random variable is added by the disturbance unit.

The analysis apparatus according to claim 1, wherein the analysis unit expands an inverse function of the cumulative distribution function by Cornish finisher expansion as the asymptotic expansion, and performs the one-sample mean match test.

The analysis unit performs asymptotic expansion of an inverse function of a cumulative distribution function using the reliability obtained from an arbitrarily set significance level as the reliability, and performs the one-sample mean match test The analysis device according to claim 1.

An analysis method executed by an analysis device,
An extraction process for extracting sample data from the data to be analyzed;
Asymptotic expansion of the inverse function of the cumulative distribution function is performed using the reliability of the standard normal distribution of the sample data extracted by the extraction step, and the average of the sample data matches a predetermined value using the expanded inverse function An analysis step of performing a one-sample average match test to test whether or not
The analysis method characterized by including.

An extraction step for extracting sample data from the data to be analyzed;
Using the reliability of the standard normal distribution of the sample data extracted by the extraction step, the inverse function of the cumulative distribution function is asymptotically expanded, and the average of the sample data matches a predetermined value using the expanded inverse function An analysis step for performing a one-sample mean match test to test whether or not
An analysis program that causes a computer to execute.