JP2017207878A

JP2017207878A - Missing data estimation method, missing data estimation device, and missing data estimation program

Info

Publication number: JP2017207878A
Application number: JP2016099183A
Authority: JP
Inventors: 鈴木　重治; Shigeharu Suzuki; 重治鈴木
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2016-05-18
Filing date: 2016-05-18
Publication date: 2017-11-24

Abstract

PROBLEM TO BE SOLVED: To improve the reliability of an estimated value for missing data.SOLUTION: An arithmetic unit 12 detects, from multidimensional data 20, that a first value 23 corresponding to a set of a first attribute value 21a and a second attribute value 22a is missing. The arithmetic unit 12 calculates a first reliability index 17 with regard to a first estimation method 13 that calculates a first estimated value 15, using a second value 24 that corresponds to a set of a first attribute value 21b and the second attribute value 22a. The arithmetic unit 12 calculates a second reliability index 18 with regard to a second estimation method 14 that calculates a second estimated value 16, using a third value 25 that corresponds to a set of the first attribute value 21a and a second attribute value 22b. The arithmetic unit 12 complements an estimated value that is equivalent to the first value 23 using an estimation method selected on the basis of comparison of the first reliability index 17 with the second reliability index 18.SELECTED DRAWING: Figure 1

Description

本発明は欠落データ推定方法、欠落データ推定装置および欠落データ推定プログラムに関する。 The present invention relates to a missing data estimation method, a missing data estimation device, and a missing data estimation program.

情報処理システムを利用して、大規模データを収集し分析することがある。大規模データは、値の集合が複数の分類軸によって分類された多次元データであることがある。例えば、複数の地域（例えば、市町村や都道府県などの自治体）に関する統計値を収集して分析することが考えられる。その場合、収集した統計値は、年度・地域・統計種別などの分類軸によって分類される。各分類軸から１つずつ属性値を選択することで、ある値が特定される。例えば、年度＝２０１５年，地域＝横浜市，統計種別＝人口という属性値の組に対して、３７２万人という統計値が特定される。 Large-scale data may be collected and analyzed using an information processing system. Large-scale data may be multidimensional data in which a set of values is classified by a plurality of classification axes. For example, it is conceivable to collect and analyze statistical values relating to a plurality of regions (for example, municipalities such as municipalities and prefectures). In that case, the collected statistical values are classified according to the classification axis such as year, region, and statistical type. A value is identified by selecting one attribute value from each classification axis. For example, a statistical value of 3.72 million is specified for a set of attribute values of year = 2015, region = Yokohama city, statistical type = population.

ここで、収集した大規模データの中には、全ての属性値の組み合わせに対応する値が揃っているとは限らず、一部の値が欠落していることがある。欠落は、その値を知っている者からその値を入手することが困難である場合や、調査漏れや災害などの理由によりその値を知っている者が存在しない場合などに発生する。一部の値が欠落したままでは分析に支障がある場合、欠落した値を他の値から推定することが考えられる。 Here, in the collected large-scale data, values corresponding to all combinations of attribute values are not necessarily prepared, and some values may be missing. The omission occurs when it is difficult to obtain the value from a person who knows the value, or when there is no person who knows the value due to reasons such as omission of investigation or disaster. If there is a problem in the analysis if some values are missing, it is conceivable to estimate the missing values from other values.

例えば、関係データベースの中の欠損値を他の数値から推定する欠損値推定方法が提案されている。提案の欠損値推定方法では、関係テーブルに含まれる複数の列（カラム）の中から、欠損値の属する列と数値の種類が同じである他の列を選択する。数値の種類が同じか否かは、列名（カラム名）の末尾語の類似性に基づいて判断する。次に、回帰分析により、欠損値の属する列と選択した他の列との間の関係を示す推定式を、欠損値の属する行（レコード）以外の他の行の数値を用いて生成する。そして、生成した推定式を欠損値の属する行に対して適用することで、欠損値を推定する。 For example, a missing value estimation method for estimating missing values in a relational database from other numerical values has been proposed. In the proposed missing value estimation method, another column having the same numerical value type as the column to which the missing value belongs is selected from a plurality of columns (columns) included in the relationship table. Whether or not the types of numerical values are the same is determined based on the similarity of the end words of column names (column names). Next, by regression analysis, an estimation formula indicating the relationship between the column to which the missing value belongs and the other selected column is generated using the numerical values of the rows other than the row (record) to which the missing value belongs. Then, the missing value is estimated by applying the generated estimation formula to the row to which the missing value belongs.

また、例えば、あるソフトウェア開発プロジェクトのコストを、過去に行われた類似のソフトウェア開発プロジェクトのコストに基づいて推定する欠落データ推定方法が提案されている。また、例えば、複数の行列データそれぞれについて行と列の関係を学習し、複数の行列データに共通のパラメータと行列データによって異なる誤差項とを含む統一モデルを生成し、統一モデルを用いて欠損値を予測する欠損値予測方法が提案されている。 Further, for example, a missing data estimation method for estimating the cost of a certain software development project based on the costs of similar software development projects performed in the past has been proposed. In addition, for example, the relationship between rows and columns is learned for each of a plurality of matrix data, and a unified model including parameters common to the plurality of matrix data and error terms that differ depending on the matrix data is generated. A method for predicting missing values has been proposed.

特開平７−８５０８２号公報JP-A-7-85082 特開２００５−１４１５０８号公報JP 2005-141508 A 特開２０１２−１９４７４１号公報JP 2012-194741 A

多次元データの中の欠落した値を推定する際、複数の推定方法が存在する場合がある。この場合、どのようにして欠落した値を推定すればよいかが問題となる。推定方法を予め１つに固定してしまうと、多次元データの中の値の欠落パターンによっては、推定値の信頼性が大きく低下してしまうことがある。例えば、複数年度にわたって１つの地域の人口データが欠落した場合と、１つの年度において複数の地域の人口データが欠落した場合とでは、特定の推定方法で算出した推定値の信頼性は異なってくる。 When estimating missing values in multidimensional data, there may be multiple estimation methods. In this case, the problem is how to estimate the missing value. If the estimation method is fixed to one in advance, the reliability of the estimated value may be greatly reduced depending on the missing pattern of values in the multidimensional data. For example, the reliability of an estimated value calculated by a specific estimation method differs between the case where population data of one region is missing over a plurality of years and the case where population data of a plurality of regions are missing in one year. .

１つの側面では、本発明は、推定値の信頼性を向上できる欠落データ推定方法、欠落データ推定装置および欠落データ推定プログラムを提供することを目的とする。 In one aspect, an object of the present invention is to provide a missing data estimation method, a missing data estimation device, and a missing data estimation program that can improve the reliability of an estimated value.

１つの態様では、コンピュータが実行する欠落データ推定方法が提案されている。欠落データ推定方法では、複数の第１の属性値による第１の分類軸と複数の第２の属性値による第２の分類軸とを含む複数の分類軸を用いて値の集合が分類された多次元データから、１つの第１の属性値と１つの第２の属性値との組に対応する第１の値が欠落していることを検出する。他の第１の属性値と１つの第２の属性値との組に対応する第２の値を用いて第１の値に相当する第１の推定値を算出する第１の推定方法について、多次元データから、第１の推定方法の信頼性を示す第１の信頼指標を算出する。１つの第１の属性値と他の第２の属性値との組に対応する第３の値を用いて第１の値に相当する第２の推定値を算出する第２の推定方法について、多次元データから、第２の推定方法の信頼性を示す第２の信頼指標を算出する。第１の信頼指標と第２の信頼指標との比較に基づいて選択した推定方法を用いて、欠落した第１の値に相当する推定値を補完する。 In one aspect, a missing data estimation method executed by a computer has been proposed. In the missing data estimation method, a set of values is classified using a plurality of classification axes including a first classification axis based on a plurality of first attribute values and a second classification axis based on a plurality of second attribute values. It is detected from the multidimensional data that a first value corresponding to a set of one first attribute value and one second attribute value is missing. Regarding a first estimation method for calculating a first estimated value corresponding to the first value using a second value corresponding to a set of another first attribute value and one second attribute value, A first confidence index indicating the reliability of the first estimation method is calculated from the multidimensional data. Regarding a second estimation method for calculating a second estimated value corresponding to the first value using a third value corresponding to a set of one first attribute value and another second attribute value, A second reliability index indicating the reliability of the second estimation method is calculated from the multidimensional data. The estimated value corresponding to the missing first value is complemented using the estimation method selected based on the comparison between the first confidence index and the second confidence index.

また、１つの態様では、記憶部と演算部とを有する欠落データ推定装置が提供される。また、１つの態様では、欠落データ推定プログラムが提供される。 Moreover, in one aspect, a missing data estimation device having a storage unit and a calculation unit is provided. In one aspect, a missing data estimation program is provided.

１つの側面では、欠落データに対する推定値の信頼性が向上する。 In one aspect, the reliability of estimated values for missing data is improved.

第１の実施の形態の欠落データ推定装置の例を示す図である。It is a figure which shows the example of the missing data estimation apparatus of 1st Embodiment. 第２の実施の形態の情報処理装置のハードウェアの一例を示す図である。It is a figure which shows an example of the hardware of the information processing apparatus of 2nd Embodiment. 第２の実施の形態の実数データの一例を示す図である。It is a figure which shows an example of the real number data of 2nd Embodiment. 第２の実施の形態の情報処理装置が有する機能の一例を示す図である。It is a figure which shows an example of the function which the information processing apparatus of 2nd Embodiment has. 第２の実施の形態の抽出データの一例（時系列分析）を示す図である。It is a figure which shows an example (time series analysis) of the extraction data of 2nd Embodiment. 第２の実施の形態の相関式の一例（時系列分析）を示す図である。It is a figure which shows an example (time series analysis) of the correlation type | formula of 2nd Embodiment. 第２の実施の形態の抽出データの一例（地域相関分析）を示す図である。It is a figure which shows an example (area correlation analysis) of the extraction data of 2nd Embodiment. 第２の実施の形態の相関式の一例（地域相関分析）を示す図である。It is a figure which shows an example (area correlation analysis) of the correlation type | formula of 2nd Embodiment. 第２の実施の形態の抽出データの一例（項目相関分析）を示す図である。It is a figure which shows an example (item correlation analysis) of the extraction data of 2nd Embodiment. 第２の実施の形態の相関式の一例（項目相関分析）を示す図である。It is a figure which shows an example (item correlation analysis) of the correlation type | formula of 2nd Embodiment. 第２の実施の形態の加工データの一例を示す図である。It is a figure which shows an example of the process data of 2nd Embodiment. 第２の実施の形態の処理手順の一例を示すフロー図である。It is a flowchart which shows an example of the process sequence of 2nd Embodiment.

以下、本実施の形態を図面を参照して説明する。
［第１の実施の形態］
第１の実施の形態を説明する。 Hereinafter, the present embodiment will be described with reference to the drawings.
[First Embodiment]
A first embodiment will be described.

図１は、第１の実施の形態の欠落データ推定装置の例を示す図である。
欠落データ推定装置１０は、記憶部１１および演算部１２を有する。
記憶部１１は、多次元データ２０を記憶する。記憶部１１は、ＲＡＭ（Random Access Memory）などの揮発性の半導体メモリでもよいし、ＨＤＤ（Hard Disk Drive）やフラッシュメモリなどの不揮発性の記憶装置でもよい。 FIG. 1 is a diagram illustrating an example of a missing data estimation apparatus according to the first embodiment.
The missing data estimation device 10 includes a storage unit 11 and a calculation unit 12.
The storage unit 11 stores multidimensional data 20. The storage unit 11 may be a volatile semiconductor memory such as a RAM (Random Access Memory) or a non-volatile storage device such as an HDD (Hard Disk Drive) or a flash memory.

演算部１２は、記憶部１１に記憶された多次元データ２０を処理する。演算部１２は、ＣＰＵ（Central Processing Unit）やＤＳＰ（Digital Signal Processor）などのプロセッサでもよい。また、演算部１２は、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）などの特定用途の電子回路を含んでもよい。プロセッサは、ＲＡＭなどのメモリ（記憶部１１でもよい）に記憶されたプログラムを実行する。プロセッサが実行するプログラムには、以下に説明する処理を記載した欠落データ推定プログラムが含まれる。複数のプロセッサの集合を、「マルチプロセッサ」または単に「プロセッサ」と言うこともある。 The calculation unit 12 processes the multidimensional data 20 stored in the storage unit 11. The arithmetic unit 12 may be a processor such as a CPU (Central Processing Unit) or a DSP (Digital Signal Processor). The arithmetic unit 12 may include an electronic circuit for a specific purpose such as an application specific integrated circuit (ASIC) or a field programmable gate array (FPGA). The processor executes a program stored in a memory such as a RAM (or the storage unit 11). The program executed by the processor includes a missing data estimation program in which processing described below is described. A set of processors may be referred to as “multiprocessor” or simply “processor”.

多次元データ２０は、複数の分類軸を用いて値の集合が分類されたデータである。複数の分類軸には、第１の分類軸２１と第２の分類軸２２とが含まれる。第１の分類軸２１には、第１の属性値２１ａ，２１ｂを含む複数の第１の属性値が用いられる。値の集合に含まれる各値には、複数の第１の属性値のうちの何れか１つが対応付けられる。第２の分類軸２２には、第２の属性値２２ａ，２２ｂを含む複数の第２の属性値が用いられる。値の集合に含まれる各値には、複数の第２の属性値のうちの何れか１つが対応付けられる。 The multidimensional data 20 is data in which a set of values is classified using a plurality of classification axes. The plurality of classification axes includes a first classification axis 21 and a second classification axis 22. For the first classification axis 21, a plurality of first attribute values including the first attribute values 21a and 21b are used. Each value included in the set of values is associated with any one of a plurality of first attribute values. A plurality of second attribute values including second attribute values 22a and 22b are used for the second classification axis 22. Each value included in the set of values is associated with any one of a plurality of second attribute values.

多次元データ２０は、異なる時点・異なる地域・異なる統計種別についての統計値を収集した地域統計データであってもよい。第１の分類軸２１は、時点軸・地域軸・統計種別軸のうちの何れか１つであってもよい。複数の第１の属性値は、複数の時点・複数の地域・複数の統計種別のうちの何れか１つであってもよい。また、第２の分類軸２２は、時点軸・地域軸・統計種別軸のうちの他の１つであってもよい。複数の第２の属性値は、複数の時点・複数の地域・複数の統計種別のうちの他の１つであってもよい。 The multidimensional data 20 may be regional statistical data obtained by collecting statistical values for different time points, different regions, and different statistical types. The first classification axis 21 may be any one of a time point axis, a region axis, and a statistics type axis. The plurality of first attribute values may be any one of a plurality of time points, a plurality of regions, and a plurality of statistical types. The second classification axis 22 may be another one of the time axis, the regional axis, and the statistics type axis. The plurality of second attribute values may be another one of a plurality of time points, a plurality of regions, and a plurality of statistical types.

演算部１２は、多次元データ２０から、第１の値２３が欠落していることを検出する。第１の値２３には、第１の分類軸２１について第１の属性値２１ａが対応付けられ、第２の分類軸２２について第２の属性値２２ａが対応付けられている。多次元データ２０の中に、第１の属性値２１ａと第２の属性値２２ａの組に対応付けられた値が存在していなくてもよい。また、第１の属性値２１ａと第２の属性値２２ａの組に対して、正しい値が存在しないことを示すＮＵＬＬ値が対応付けられていてもよい。 The computing unit 12 detects from the multidimensional data 20 that the first value 23 is missing. The first value 23 is associated with the first attribute value 21 a for the first classification axis 21, and the second attribute value 22 a is associated with the second classification axis 22. In the multidimensional data 20, there may not be a value associated with the set of the first attribute value 21a and the second attribute value 22a. Further, a NULL value indicating that there is no correct value may be associated with the set of the first attribute value 21a and the second attribute value 22a.

演算部１２は、複数の推定方法の中から何れかの推定方法を選択する。演算部１２は、選択した推定方法を用いて、欠落した第１の値２３に相当する推定値を多次元データ２０に対して補完する。このとき、演算部１２は、多次元データ２０を用いて複数の推定方法それぞれの信頼指標を算出して、それら複数の推定方法を評価する。例えば、演算部１２は、複数の推定方法の信頼指標を比較し、最も信頼指標の高い推定方法を選択する。複数の推定方法には、第１の推定方法１３と第２の推定方法１４とが含まれる。 The computing unit 12 selects one of the estimation methods from among a plurality of estimation methods. The calculation unit 12 supplements the multidimensional data 20 with an estimated value corresponding to the missing first value 23 using the selected estimation method. At this time, the computing unit 12 calculates a reliability index for each of the plurality of estimation methods using the multidimensional data 20, and evaluates the plurality of estimation methods. For example, the calculation unit 12 compares the reliability indexes of a plurality of estimation methods, and selects the estimation method with the highest reliability index. The plurality of estimation methods include a first estimation method 13 and a second estimation method 14.

第１の推定方法１３は、多次元データ２０に含まれる第２の値２４を用いて第１の推定値１５を算出する方法である。第２の値２４には、第１の分類軸２１について第１の値２３と異なる第１の属性値２１ｂが対応付けられ、第２の分類軸２２について第１の値２３と同じ第２の属性値２２ａが対応付けられている。演算部１２は、多次元データ２０を用いて、第１の推定方法１３の信頼性を示す第１の信頼指標１７を算出する。 The first estimation method 13 is a method for calculating the first estimated value 15 using the second value 24 included in the multidimensional data 20. The second value 24 is associated with the first attribute value 21 b different from the first value 23 for the first classification axis 21, and the second value 24 is the same as the first value 23 for the second classification axis 22. The attribute value 22a is associated. The calculation unit 12 calculates a first reliability index 17 indicating the reliability of the first estimation method 13 using the multidimensional data 20.

第１の推定方法１３は、第１の属性値２１ｂと第２の値２４を用いて単回帰分析により第１の推定式を生成し、第１の推定式に第１の属性値２１ａを適用して第１の推定値１５を算出する方法であってもよい。第１の信頼指標１７は、第１の推定式の回帰分析の信頼性を示す指標でもよく、相関係数や決定係数などであってもよい。 The first estimation method 13 generates a first estimation formula by simple regression analysis using the first attribute value 21b and the second value 24, and applies the first attribute value 21a to the first estimation formula. Then, a method of calculating the first estimated value 15 may be used. The first reliability index 17 may be an index indicating the reliability of regression analysis of the first estimation formula, and may be a correlation coefficient, a determination coefficient, or the like.

第２の推定方法１４は、多次元データ２０に含まれる第３の値２５を用いて第２の推定値１６を算出する方法である。第３の値２５には、第１の分類軸２１について第１の値２３と同じ第１の属性値２１ａが対応付けられ、第２の分類軸２２について第１の値２３と異なる第２の属性値２２ｂが対応付けられている。演算部１２は、多次元データ２０を用いて、第２の推定方法１４の信頼性を示す第２の信頼指標１８を算出する。 The second estimation method 14 is a method for calculating the second estimated value 16 using the third value 25 included in the multidimensional data 20. The third value 25 is associated with the same first attribute value 21 a as the first value 23 for the first classification axis 21, and the second value different from the first value 23 for the second classification axis 22. The attribute value 22b is associated. The calculation unit 12 calculates a second reliability index 18 indicating the reliability of the second estimation method 14 using the multidimensional data 20.

第２の推定方法１４は、第２の値２４と第４の値（図示せず）を用いて重回帰分析により第２の推定式を生成し、第２の推定式に第３の値２５を適用して第２の推定値１６を算出する方法であってもよい。第４の値には、第１の分類軸２１について第２の値２４と同じ第１の属性値２１ｂが対応付けられ、第２の分類軸２２について第２の値２４と異なる第２の属性値２２ｂが対応付けられている。第２の信頼指標１８は、第２の推定式の回帰分析の信頼性を示す指標でもよく、相関係数や決定係数などであってもよい。 The second estimation method 14 generates a second estimation formula by multiple regression analysis using the second value 24 and the fourth value (not shown), and adds the third value 25 to the second estimation formula. May be applied to calculate the second estimated value 16. The fourth attribute is associated with the first attribute value 21b that is the same as the second value 24 for the first classification axis 21, and the second attribute that is different from the second value 24 for the second classification axis 22. The value 22b is associated. The second confidence index 18 may be an index indicating the reliability of the regression analysis of the second estimation formula, and may be a correlation coefficient, a determination coefficient, or the like.

なお、多次元データ２０に３以上の分類軸が存在する場合、第１の値２３、第２の値２４および第３の値２５には、第３の分類軸について同じ第３の属性値が対応付けられていてもよい。３以上の分類軸が存在する場合、演算部１２は、更に第３の推定方法について第３の信頼指標を算出し、第１の信頼指標１７および第２の信頼指標１８と比較してもよい。第３の推定方法は、多次元データ２０に含まれる第５の値（図示せず）を用いて第３の推定値を算出する方法であってもよい。第５の値には、第１の分類軸２１について第１の値２３と同じ第１の属性値２１ａが対応付けられ、第２の分類軸２２について第１の値２３と同じ第２の属性値２２ａが対応付けられている。ただし、第５の値には、第３の分類軸について、第１の値２３と異なる第３の属性値が対応付けられている。 If there are three or more classification axes in the multidimensional data 20, the first value 23, the second value 24, and the third value 25 have the same third attribute value for the third classification axis. It may be associated. When there are three or more classification axes, the calculation unit 12 may further calculate a third confidence index for the third estimation method and compare it with the first confidence index 17 and the second confidence index 18. . The third estimation method may be a method of calculating a third estimated value using a fifth value (not shown) included in the multidimensional data 20. The fifth attribute is associated with the first attribute value 21a that is the same as the first value 23 for the first classification axis 21, and the second attribute that is the same as the first value 23 for the second classification axis 22. The value 22a is associated. However, the fifth value is associated with a third attribute value different from the first value 23 for the third classification axis.

第１の実施の形態の欠落データ推定装置１０によれば、多次元データ２０から第１の値２３が欠落していることが検出される。すると、欠落した第１の値２３に相当する第１の推定値１５を算出する第１の推定方法１３について、多次元データ２０を適用して第１の信頼指標１７が算出される。また、欠落した第１の値２３に相当する第２の推定値１６を算出する第２の推定方法１４について、多次元データ２０を適用して第２の信頼指標１８が算出される。そして、第１の信頼指標１７と第２の信頼指標１８の比較に基づいて選択された推定方法を用いて、第１の値２３に相当する推定値が補完される。 According to the missing data estimation device 10 of the first embodiment, it is detected from the multidimensional data 20 that the first value 23 is missing. Then, the first confidence index 17 is calculated by applying the multidimensional data 20 to the first estimation method 13 for calculating the first estimated value 15 corresponding to the missing first value 23. Further, the second reliability index 18 is calculated by applying the multidimensional data 20 to the second estimation method 14 for calculating the second estimated value 16 corresponding to the missing first value 23. Then, the estimated value corresponding to the first value 23 is complemented using an estimation method selected based on the comparison between the first confidence index 17 and the second confidence index 18.

これにより、補完される推定値の信頼性が向上する。もし、使用する推定方法を予め１つに固定してしまうと、多次元データ２０の値の欠落パターンによっては、算出される推定値の信頼性が大きく低下してしまうことがある。例えば、複数年にわたって１つの地域の人口数が欠落している場合、その地域の人口数の時系列変化を示す推定式を生成しても推定式の信頼性が低いことがある。また、１つの年において複数の地域の人口数が欠落している場合、地域間の人口数の相関を示す推定式を生成しても推定式の信頼性が低いことがある。これに対し、複数の推定方法の信頼性を評価して推定方法を選択することで、多次元データ２０の欠落パターンなどに基づいて適切な推定方法を選択できる。 Thereby, the reliability of the estimated value complemented improves. If the estimation method to be used is fixed to one in advance, depending on the missing pattern of the values of the multidimensional data 20, the reliability of the calculated estimated value may be greatly reduced. For example, when the number of populations in one region is missing over a plurality of years, the estimation formulas may have low reliability even if an estimation formula indicating the time-series change in the number of populations in that region is generated. In addition, when the number of populations in a plurality of regions is missing in one year, the reliability of the estimation formulas may be low even if an estimation formula indicating the correlation of the population numbers between regions is generated. On the other hand, by evaluating the reliability of a plurality of estimation methods and selecting an estimation method, an appropriate estimation method can be selected based on a missing pattern of the multidimensional data 20 or the like.

［第２の実施の形態］
次に、第２の実施の形態を説明する。
第２の実施の形態の情報処理装置１００は、複数の地域から人口や出生数などの統計データを収集し、地域間の比較が容易になるように統計データを分析・加工する。地域としては、例えば、市町村などの基礎自治体（地方自治体の最小単位）が採用される。収集した統計データは、年・地域・項目などの属性を基準に整理される。ただし、全てのデータが揃っているとは限らず、一部のデータが欠落していることがある。その場合、情報処理装置１００は、収集されたデータを利用して欠落したデータを推定する。この推定方法は、後述する情報処理装置１００のハードウェアおよび機能により実現することができる。 [Second Embodiment]
Next, a second embodiment will be described.
The information processing apparatus 100 according to the second embodiment collects statistical data such as population and the number of births from a plurality of regions, and analyzes and processes the statistical data so that comparison between regions is easy. As the region, for example, a basic local government (minimum unit of local government) such as a municipality is adopted. The collected statistical data is organized based on attributes such as year, region, and item. However, not all data is available, and some data may be missing. In that case, the information processing apparatus 100 estimates missing data using the collected data. This estimation method can be realized by hardware and functions of the information processing apparatus 100 described later.

［２−１．ハードウェア］
まず、図２を参照しながら、情報処理装置１００のハードウェアについて説明する。図２は、第２の実施の形態の情報処理装置のハードウェアの一例を示す図である。 [2-1. hardware]
First, the hardware of the information processing apparatus 100 will be described with reference to FIG. FIG. 2 is a diagram illustrating an example of hardware of the information processing apparatus according to the second embodiment.

情報処理装置１００は、ＣＰＵ１０１、ＲＡＭ１０２、ＨＤＤ１０３、画像信号処理部１０４、入力信号処理部１０５、媒体リーダ１０６および通信インタフェース１０７を有する。ＣＰＵ１０１は、第１の実施の形態の演算部１２の一例である。ＲＡＭ１０２またはＨＤＤ１０３は、第１の実施の形態の記憶部１１の一例である。 The information processing apparatus 100 includes a CPU 101, a RAM 102, an HDD 103, an image signal processing unit 104, an input signal processing unit 105, a medium reader 106, and a communication interface 107. The CPU 101 is an example of the calculation unit 12 according to the first embodiment. The RAM 102 or the HDD 103 is an example of the storage unit 11 according to the first embodiment.

ＣＰＵ１０１は、プログラムの命令を実行する演算回路を含むプロセッサである。ＣＰＵ１０１は、ＨＤＤ１０３に記憶されているプログラムやデータの少なくとも一部をＲＡＭ１０２にロードし、プログラムを実行する。 The CPU 101 is a processor including an arithmetic circuit that executes program instructions. The CPU 101 loads at least a part of the program and data stored in the HDD 103 into the RAM 102 and executes the program.

なお、ＣＰＵ１０１は複数のプロセッサコアを備えてもよく、情報処理装置１００は複数のプロセッサを備えてもよく、以下で説明する処理を複数のプロセッサまたはプロセッサコアを用いて並列実行してもよい。また、複数のプロセッサの集合（マルチプロセッサ）を「プロセッサ」と呼んでもよい。 The CPU 101 may include a plurality of processor cores, the information processing apparatus 100 may include a plurality of processors, and the processes described below may be executed in parallel using a plurality of processors or processor cores. A set of processors (multiprocessor) may be called a “processor”.

ＲＡＭ１０２は、ＣＰＵ１０１が実行するプログラムやＣＰＵ１０１が演算に用いるデータを一時的に記憶する揮発性メモリである。なお、情報処理装置１００は、ＲＡＭ以外の種類のメモリを備えてもよく、複数個のメモリを備えてもよい。 The RAM 102 is a volatile memory that temporarily stores programs executed by the CPU 101 and data used by the CPU 101 for calculation. Note that the information processing apparatus 100 may include a type of memory other than the RAM, or may include a plurality of memories.

ＨＤＤ１０３は、ＯＳ（Operating System）やアプリケーションソフトウェアなどのソフトウェアのプログラム、および、データを記憶する不揮発性の記憶装置である。なお、情報処理装置１００は、フラッシュメモリやＳＳＤ（Solid State Drive）などの他の種類の記憶装置を備えてもよく、複数の不揮発性の記憶装置を備えてもよい。 The HDD 103 is a nonvolatile storage device that stores software programs such as an OS (Operating System) and application software, and data. The information processing apparatus 100 may include other types of storage devices such as a flash memory and an SSD (Solid State Drive), and may include a plurality of nonvolatile storage devices.

画像信号処理部１０４は、ＣＰＵ１０１からの命令に従って、情報処理装置１００に接続されたディスプレイ７１に画像を出力する。ディスプレイ７１としては、ＣＲＴ（Cathode Ray Tube）ディスプレイ、液晶ディスプレイ（LCD：Liquid Crystal Display）、プラズマディスプレイ（PDP：Plasma Display Panel）、有機ＥＬ（OEL：Organic Electro-Luminescence）ディスプレイなどを用いることができる。 The image signal processing unit 104 outputs an image to the display 71 connected to the information processing apparatus 100 in accordance with a command from the CPU 101. As the display 71, a CRT (Cathode Ray Tube) display, a liquid crystal display (LCD), a plasma display (PDP), an organic electro-luminescence (OEL) display, or the like can be used. .

入力信号処理部１０５は、情報処理装置１００に接続された入力デバイス７２から入力信号を取得し、ＣＰＵ１０１に出力する。
入力デバイス７２としては、マウスやタッチパネルやタッチパッドやトラックボールなどのポインティングデバイス、キーボード、リモートコントローラ、ボタンスイッチなどを用いることができる。また、情報処理装置１００に、複数の種類の入力デバイスが接続されていてもよい。なお、ディスプレイ７１および入力デバイス７２の少なくとも一方が、情報処理装置１００の筐体と一体に形成されていてもよい。 The input signal processing unit 105 acquires an input signal from the input device 72 connected to the information processing apparatus 100 and outputs it to the CPU 101.
As the input device 72, a mouse, a touch panel, a touch pad, a pointing device such as a trackball, a keyboard, a remote controller, a button switch, or the like can be used. A plurality of types of input devices may be connected to the information processing apparatus 100. Note that at least one of the display 71 and the input device 72 may be formed integrally with the housing of the information processing apparatus 100.

媒体リーダ１０６は、記録媒体７３に記録されたプログラムやデータを読み取る読み取り装置である。記録媒体７３として、例えば、フレキシブルディスク（FD：Flexible Disk）やＨＤＤなどの磁気ディスク、ＣＤ（Compact Disc）やＤＶＤ（Digital Versatile Disc）などの光ディスク、光磁気ディスク（MO：Magneto-Optical disk）、半導体メモリなどを使用できる。媒体リーダ１０６は、例えば、記録媒体７３から読み取ったプログラムやデータをＲＡＭ１０２またはＨＤＤ１０３に格納する。 The medium reader 106 is a reading device that reads programs and data recorded on the recording medium 73. Examples of the recording medium 73 include a magnetic disk such as a flexible disk (FD) and an HDD, an optical disk such as a CD (Compact Disc) and a DVD (Digital Versatile Disc), a magneto-optical disk (MO), A semiconductor memory or the like can be used. For example, the medium reader 106 stores the program and data read from the recording medium 73 in the RAM 102 or the HDD 103.

通信インタフェース１０７は、ネットワーク７４に接続され、ネットワーク７４を介して他の情報処理装置と通信を行うインタフェースである。通信インタフェース１０７は、スイッチなどの通信装置とケーブルで接続される有線通信インタフェースでもよいし、基地局と無線リンクで接続される無線通信インタフェースでもよい。 The communication interface 107 is an interface that is connected to the network 74 and communicates with other information processing apparatuses via the network 74. The communication interface 107 may be a wired communication interface connected to a communication device such as a switch via a cable, or may be a wireless communication interface connected to a base station via a wireless link.

以上、情報処理装置１００のハードウェアについて説明した。なお、ここでは説明の都合上、１台のハードウェアを利用する例を示したが、通信ケーブルやネットワークを介して接続された複数台のハードウェアを利用することもできる。 The hardware of the information processing apparatus 100 has been described above. Here, for convenience of explanation, an example in which one piece of hardware is used has been shown. However, a plurality of pieces of hardware connected via a communication cable or a network may be used.

［２−２．機能］
次に、情報処理装置１００の機能について説明する。図３は、第２の実施の形態の実数データの一例を示す図である。情報処理装置１００は、実数データ１１１ａを記憶する。実数データ１１１ａは、複数の地域について収集された統計データの集合である。実数データ１１１ａは、例えば、ＬＯＤ（Linked Open Data）などの一般に公開されているオープンデータを含む。ただし、実数データ１１１ａは、ある地域で独自に行われたアンケートの結果など、公開されていないデータを含んでもよい。なお、第２の実施の形態では、実数データ１１１ａとして人口や出生率などの統計データを用いているが、以下に説明する欠落データの推定方法は他の種類のデータに適用することも可能である。 [2-2. function]
Next, functions of the information processing apparatus 100 will be described. FIG. 3 is a diagram illustrating an example of real number data according to the second embodiment. The information processing apparatus 100 stores real number data 111a. The real number data 111a is a set of statistical data collected for a plurality of regions. The real number data 111a includes open data that is open to the public such as LOD (Linked Open Data). However, the real number data 111a may include unpublished data such as a result of a questionnaire conducted independently in a certain region. In the second embodiment, statistical data such as population and birth rate is used as the real number data 111a. However, the missing data estimation method described below can also be applied to other types of data. .

実数データ１１１ａは、実数値が年・地域・項目の３つの分類軸によって整理された三次元データである。年の次元の属性値は、２００９年・２０１０年・２０１１年・２０１２年などの年数である。地域の次元の属性値は、自治体Ｍ１・自治体Ｍ２などの自治体名である。項目の次元の属性値は、人口・出生数・死亡数などの統計種別名である。ある年・ある地域・ある項目の組に対して、１つの実数値が存在し得る。例えば、２００９年の自治体Ｍ１の人口が５０００人、２０１０年の自治体Ｍ２の出生数が２１００人、２０１１年の自治体Ｍ１の死亡数が８０人という情報が、実数データ１１１ａに含まれる。 The real number data 111a is three-dimensional data in which real numbers are arranged by three classification axes of year, region, and item. The attribute value of the year dimension is the number of years such as 2009, 2010, 2011, 2012, and the like. The attribute value of the regional dimension is a local government name such as the local government M1 or the local government M2. The attribute value of the item dimension is a statistical type name such as population, number of births, number of deaths, and the like. One real value can exist for a certain year, certain region, and certain item set. For example, information that the population of the municipality M1 in 2009 is 5000, the number of births of the municipality M2 in 2010 is 2100, and the number of deaths in the municipality M1 in 2011 is 80 is included in the real number data 111a.

ただし、実数データ１１１ａには、未調査や調査不能などの理由により一部の実数値が欠落していることがある。例えば、実数データ１１１ａには、２０１１年の自治体Ｍ１の人口が欠落している。欠落した実数値（欠落値）は、例えば、ＮＵＬＬ値として表現されている。ただし、欠落値であることを示す所定の記号や所定の数値が用いられていてもよい。また、情報処理装置１００は、実数データ１１１ａの中に間違いであることが明らかな異常値が含まれている場合、その異常値を欠落値として扱ってもよい。 However, some real values may be missing from the real number data 111a due to reasons such as unexamined or uninvestigated. For example, the population of the municipality M1 in 2011 is missing in the real number data 111a. The missing real value (missing value) is expressed as, for example, a NULL value. However, a predetermined symbol indicating a missing value or a predetermined numerical value may be used. Further, when the real number data 111a includes an abnormal value that is clearly erroneous, the information processing apparatus 100 may treat the abnormal value as a missing value.

情報処理装置１００は、実数データ１１１ａの中に欠落値がある場合、当該欠落値に相当する実数値を周辺の実数値から推定する。このとき、３つの推定方法が存在する。１番目の推定方法は、欠落値と同じ地域かつ同じ項目のデータを時系列に分析する方法（時系列分析）である。２番目の推定方法は、欠落値と同じ項目のデータを異なる地域間で比較して分析する方法（地域相関分析）である。３番目の推定方法は、欠落値と同じ地域のデータを異なる項目間で比較して分析する方法（項目相関分析）である。情報処理装置１００は、欠落値毎に、３つの推定方法のうち最も推定精度が高いものを採用する。 When there is a missing value in the real number data 111a, the information processing apparatus 100 estimates a real value corresponding to the missing value from surrounding real values. At this time, there are three estimation methods. The first estimation method is a method of analyzing data in the same region and the same item as the missing value in time series (time series analysis). The second estimation method is a method (regional correlation analysis) in which data of the same item as the missing value is compared and analyzed between different regions. The third estimation method is a method (item correlation analysis) in which data in the same region as the missing value is compared and analyzed between different items. The information processing apparatus 100 employs the one with the highest estimation accuracy among the three estimation methods for each missing value.

図４は、第２の実施の形態の情報処理装置が有する機能の一例を示す図である。図４に示すように、情報処理装置１００は、記憶部１１１と、データ取得部１１２と、時系列分析部１１３と、地域相関分析部１１４と、項目相関分析部１１５と、欠落値補完部１１６と、データ加工部１１７とを有する。 FIG. 4 is a diagram illustrating an example of functions of the information processing apparatus according to the second embodiment. As illustrated in FIG. 4, the information processing apparatus 100 includes a storage unit 111, a data acquisition unit 112, a time series analysis unit 113, a regional correlation analysis unit 114, an item correlation analysis unit 115, and a missing value complement unit 116. And a data processing unit 117.

なお、記憶部１１１の機能は、上述したＲＡＭ１０２、ＨＤＤ１０３、媒体リーダ１０６および記録媒体７３などを用いて実現できる。データ取得部１１２の機能は、上述したＣＰＵ１０１および通信インタフェース１０７などの機能を用いて実現できる。時系列分析部１１３、地域相関分析部１１４、項目相関分析部１１５、欠落値補完部１１６、およびデータ加工部１１７の機能は、上述したＣＰＵ１０１などを用いて実現できる。 The function of the storage unit 111 can be realized by using the RAM 102, the HDD 103, the medium reader 106, the recording medium 73, and the like described above. The functions of the data acquisition unit 112 can be realized using functions such as the CPU 101 and the communication interface 107 described above. The functions of the time series analysis unit 113, the regional correlation analysis unit 114, the item correlation analysis unit 115, the missing value complementing unit 116, and the data processing unit 117 can be realized using the CPU 101 described above.

記憶部１１１には、実数データ１１１ａ、相関式データ１１１ｂ、および加工データ１１１ｃが格納される。
相関式データ１１１ｂは、時系列分析、地域相関分析、および項目相関分析によって生成された相関式を含む。時系列分析によって生成される相関式は、単回帰分析によって生成される回帰式であり、欠落値が属する年から欠落値の推定値を算出する計算式である。地域相関分析によって生成される相関式は、重回帰分析によって生成される回帰式であり、欠落値が属する年の他の地域のデータから欠落値の推定値を算出する計算式である。項目相関分析によって生成される相関式は、重回帰分析によって生成される回帰式であり、欠落値が属する年の他の項目のデータから欠落値の推定値を算出する計算式である。また、相関式データ１１１ｂは、生成された相関式それぞれの決定係数を含む。 The storage unit 111 stores real number data 111a, correlation formula data 111b, and processed data 111c.
The correlation formula data 111b includes correlation formulas generated by time series analysis, regional correlation analysis, and item correlation analysis. The correlation formula generated by the time series analysis is a regression formula generated by the single regression analysis, and is a calculation formula for calculating the estimated value of the missing value from the year to which the missing value belongs. The correlation formula generated by the regional correlation analysis is a regression formula generated by the multiple regression analysis, and is a calculation formula for calculating the estimated value of the missing value from the data of other regions in the year to which the missing value belongs. The correlation formula generated by the item correlation analysis is a regression formula generated by the multiple regression analysis, and is a calculation formula for calculating the estimated value of the missing value from the data of other items in the year to which the missing value belongs. Further, the correlation formula data 111b includes a determination coefficient for each of the generated correlation formulas.

加工データ１１１ｃは、実数データ１１１ａおよび相関式データ１１１ｂに基づいて生成された二次的データである。例えば、ある年のある地域の人口および出生数から、出生数÷人口＝出生比率が算出される。加工データ１１１ｃを生成するにあたり、使用する実数値が実数データ１１１ａに存在する場合はその実数値が使用される一方、使用する実数値が欠落値である場合は推定値が使用される。 The processed data 111c is secondary data generated based on the real number data 111a and the correlation equation data 111b. For example, the number of births / population = birth ratio is calculated from the population and the number of births in a certain region in a certain year. In generating the processed data 111c, if a real value to be used exists in the real number data 111a, the real value is used, whereas if the real value to be used is a missing value, an estimated value is used.

データ取得部１１２は、実数データ１１１ａを取得し、取得した実数データ１１１ａを記憶部１１１に格納する。例えば、データ取得部１１２は、実数データ１１１ａとして、ネットワーク上で公開されているオープンデータを取得する。また、例えば、データ取得部１１２は、地方自治体が使用するコンピュータから、オープンデータ以外の独自データの提供を受ける。取得した実数データ１１１ａが年別・地域別・項目別などに整理されていない場合、データ取得部１１２は、取得した実数データ１１１ａの構造を解析して、年別・地域別・項目別に整理して記憶部１１１に格納してもよい。 The data acquisition unit 112 acquires the real number data 111 a and stores the acquired real number data 111 a in the storage unit 111. For example, the data acquisition unit 112 acquires open data published on the network as the real number data 111a. Further, for example, the data acquisition unit 112 receives provision of unique data other than open data from a computer used by a local government. If the acquired real number data 111a is not organized by year, region, or item, the data acquisition unit 112 analyzes the structure of the acquired real number data 111a and organizes it by year, region, or item. May be stored in the storage unit 111.

時系列分析部１１３は、実数データ１１１ａにデータの欠落がある場合、時系列分析によって相関式を生成する。例えば、時系列分析部１１３は、存在するデータを利用して単回帰分析を実行し、回帰直線を決める係数および決定係数を算出する。そして、時系列分析部１１３は、回帰直線を決める係数および決定係数を相関式データ１１１ｂとして記憶部１１１に格納する。時系列分析の例については後述する。 The time series analysis unit 113 generates a correlation equation by time series analysis when there is data loss in the real number data 111a. For example, the time series analysis unit 113 performs single regression analysis using existing data, and calculates a coefficient and a determination coefficient for determining a regression line. Then, the time series analysis unit 113 stores the coefficient for determining the regression line and the determination coefficient in the storage unit 111 as the correlation equation data 111b. An example of time series analysis will be described later.

地域相関分析部１１４は、実数データ１１１ａにデータの欠落がある場合、地域相関分析によって相関式を生成する。例えば、地域相関分析部１１４は、存在するデータを利用して重回帰分析を実行し、回帰直線を決める係数および決定係数を算出する。そして、地域相関分析部１１４は、回帰直線を決める係数および決定係数を相関式データ１１１ｂとして記憶部１１１に格納する。地域相関分析の例については後述する。 When there is a missing data in the real number data 111a, the regional correlation analysis unit 114 generates a correlation formula by regional correlation analysis. For example, the regional correlation analysis unit 114 performs multiple regression analysis using existing data, and calculates a coefficient and a determination coefficient for determining a regression line. Then, the regional correlation analysis unit 114 stores the coefficient for determining the regression line and the determination coefficient in the storage unit 111 as the correlation equation data 111b. An example of regional correlation analysis will be described later.

項目相関分析部１１５は、実数データ１１１ａにデータの欠落がある場合、項目相関分析によって相関式を生成する。例えば、項目相関分析部１１５は、存在するデータを利用して重回帰分析を実行し、回帰直線を決める係数および決定係数を算出する。そして、項目相関分析部１１５は、回帰直線を決める係数および決定係数を相関式データ１１１ｂとして記憶部１１１に格納する。項目相関分析の例については後述する。 The item correlation analysis unit 115 generates a correlation equation by item correlation analysis when there is data loss in the real number data 111a. For example, the item correlation analysis unit 115 performs multiple regression analysis using existing data, and calculates a coefficient and a determination coefficient for determining a regression line. Then, the item correlation analysis unit 115 stores the coefficient for determining the regression line and the determination coefficient in the storage unit 111 as the correlation equation data 111b. An example of item correlation analysis will be described later.

欠落値補完部１１６は、実数データ１１１ａを利用する処理を実行する際、実数データ１１１ａにおけるデータの欠落を検出する。データの欠落がある場合、欠落値補完部１１６は、相関式データ１１１ｂに含まれる決定係数に基づいて、３つの相関式の中から欠落値の推定に利用する相関式を選択する。 The missing value complementing unit 116 detects missing data in the real number data 111a when executing the process using the real number data 111a. When there is missing data, the missing value complementing unit 116 selects a correlation formula to be used for estimation of the missing value from the three correlation formulas based on the determination coefficient included in the correlation formula data 111b.

このとき、欠落値補完部１１６は、欠落値の属する地域や項目の情報を時系列分析部１１３、地域相関分析部１１４、および項目相関分析部１１５に通知する。この通知に応じて、時系列分析部１１３、地域相関分析部１１４、および項目相関分析部１１５は、欠落値の推定に用い得る相関式および決定係数を計算する。 At this time, the missing value complementing unit 116 notifies the time series analyzing unit 113, the regional correlation analyzing unit 114, and the item correlation analyzing unit 115 of information on the region or item to which the missing value belongs. In response to this notification, the time series analysis unit 113, the regional correlation analysis unit 114, and the item correlation analysis unit 115 calculate a correlation equation and a determination coefficient that can be used for estimation of missing values.

例えば、２０１１年の自治体Ｍ１における人口の値が欠落している場合、欠落値補完部１１６は、時系列分析によって算出された自治体Ｍ１の人口に関する決定係数を参照する。また、欠落値補完部１１６は、地域相関分析によって算出された自治体Ｍ１の人口に関する決定係数を参照する。また、欠落値補完部１１６は、項目相関分析によって算出された自治体Ｍ１の人口に関する決定係数を参照する。そして、欠落値補完部１１６は、３つの決定係数のうち、最も大きな決定係数に対応する相関式を選択する。 For example, when the population value in the municipality M1 in 2011 is missing, the missing value complementing unit 116 refers to the determination coefficient regarding the population of the municipality M1 calculated by the time series analysis. Further, the missing value complementing unit 116 refers to the determination coefficient regarding the population of the local government M1 calculated by the regional correlation analysis. In addition, the missing value complementing unit 116 refers to the determination coefficient regarding the population of the local government M1 calculated by the item correlation analysis. Then, the missing value complementing unit 116 selects a correlation expression corresponding to the largest determination coefficient among the three determination coefficients.

欠落値補完部１１６は、選択した相関式と実数データ１１１ａを利用して欠落値の推定値を計算し、計算した推定値を欠落値の代替データとして記憶部１１１に保存する。欠落値補完部１１６は、計算した推定値を加工データ１１１ｃに埋め込んでもよい。時系列分析部１１３、地域相関分析部１１４、項目相関分析部１１５、および欠落値補完部１１６は、実数データ１１１ａに含まれる各欠落値について上記の処理を実行する。 The missing value complementing unit 116 calculates an estimated value of the missing value using the selected correlation equation and the real number data 111a, and stores the calculated estimated value in the storage unit 111 as substitute data for the missing value. The missing value complementing unit 116 may embed the calculated estimated value in the processed data 111c. The time series analysis unit 113, the regional correlation analysis unit 114, the item correlation analysis unit 115, and the missing value complementing unit 116 execute the above-described processing for each missing value included in the real number data 111a.

データ加工部１１７は、実数データ１１１ａに含まれる実数値を加工して、加工データ１１１ｃを生成する。例えば、データ加工部１１７は、実数データ１１１ａに含まれる２以上の実数値と所定の計算式から、地域間の比較に有用な加工値を算出する。 The data processing unit 117 processes the real value included in the real number data 111a to generate the processed data 111c. For example, the data processing unit 117 calculates a processing value useful for comparison between regions from two or more real values included in the real number data 111a and a predetermined calculation formula.

このとき、利用しようとする実数値が欠落値である場合、データ加工部１１７は、欠落値補完部１１６によって算出された推定値を利用して加工値を算出する。データ加工部１１７は、算出した加工値のみを加工データ１１１ｃに挿入してもよいし、加工値の算出に利用した実数値などを加工データ１１１ｃに挿入してもよい。後者の場合、データ加工部１１７は、欠落値補完部１１６によって算出された推定値を加工データ１１１ｃに挿入してもよいし、推定値の算出に用いた相関式を加工データ１１１ｃに挿入してもよい。 At this time, when the real value to be used is a missing value, the data processing unit 117 calculates the processed value using the estimated value calculated by the missing value complementing unit 116. The data processing unit 117 may insert only the calculated processing value into the processing data 111c, or may insert a real value or the like used for calculation of the processing value into the processing data 111c. In the latter case, the data processing unit 117 may insert the estimated value calculated by the missing value complementing unit 116 into the processed data 111c, or insert the correlation equation used to calculate the estimated value into the processed data 111c. Also good.

データ加工部１１７は、生成した加工データ１１１ｃを出力してもよい。例えば、データ加工部１１７は、加工データ１１１ｃをディスプレイ７１に表示する。また、例えば、データ加工部１１７は、加工データ１１１ｃを他の情報処理装置に送信する。 The data processing unit 117 may output the generated processed data 111c. For example, the data processing unit 117 displays the processed data 111 c on the display 71. For example, the data processing unit 117 transmits the processed data 111c to another information processing apparatus.

次に、時系列分析と地域相関分析と項目相関分析について説明する。
（時系列分析）
時系列分析部１１３は、ある年・ある地域・ある項目の欠落値についての相関式を算出するとき、実数データ１１１ａから、欠落値と同じ地域かつ同じ項目についての複数の年のデータを抽出する。図５は、第２の実施の形態の抽出データの一例（時系列分析）を示す図である。図５に例示した抽出データ１１１ｄは、自治体Ｍ１における人口の推移を示している。国勢調査などのようにデータが隔年で収集される場合や、大規模災害などの要因でデータが収集できない期間がある場合など、データの欠落が生じる場合がある。図５の例では、自治体Ｍ１の２０１１年の人口にデータの欠落がある。そこで、抽出データ１１１ｄとして、自治体Ｍ１の他の年の人口についてのデータが抽出される。 Next, time series analysis, regional correlation analysis, and item correlation analysis will be described.
(Time series analysis)
When calculating a correlation formula for missing values of a certain year, a certain region, and a certain item, the time series analysis unit 113 extracts data of a plurality of years for the same region and the same item as the missing value from the real number data 111a. . FIG. 5 is a diagram illustrating an example (time series analysis) of the extracted data according to the second embodiment. The extracted data 111d illustrated in FIG. 5 indicates the transition of the population in the local government M1. Data may be lost when data is collected every other year, such as in the national census, or when there is a period during which data cannot be collected due to factors such as a large-scale disaster. In the example of FIG. 5, there is a lack of data in the 2011 population of the municipality M1. Therefore, data on the population of other years of the municipality M1 is extracted as the extracted data 111d.

時系列分析による相関式は、抽出データ１１１ｄを利用して生成される。例えば、時系列分析による相関式として、図６に示すような相関式が利用できる。図６は、第２の実施の形態の相関式の一例（時系列分析）を示す図である。 The correlation formula by the time series analysis is generated using the extracted data 111d. For example, a correlation equation as shown in FIG. 6 can be used as a correlation equation based on time series analysis. FIG. 6 is a diagram illustrating an example (time series analysis) of a correlation equation according to the second embodiment.

例えば、人口をＰ、年をＹと表すと、抽出データ１１１ｄから、Ｐを被説明変数、Ｙを説明変数とする単回帰分析により、回帰直線の切片（ｃ１１）および変数Ｙに対する係数ｑ１が決まる。すなわち、Ｐ＝ｑ１×Ｙ＋ｃ１１という相関式が生成される。同様にして、出生数が欠落している場合、出生数をＢとすれば、Ｂを被説明変数とすることで、回帰直線の切片（ｃ１２）および変数Ｂに対する係数ｑ２が決まる。すなわち、Ｂ＝ｑ２×Ｙ＋ｃ１２という相関式が生成される。なお、ここでは相関式として線形式を例示したが、相関式を非線形式としてもよい。 For example, when the population is represented as P and the year is represented as Y, the regression line intercept (c11) and the coefficient q1 for the variable Y are determined from the extracted data 111d by simple regression analysis using P as an explanatory variable and Y as an explanatory variable. . That is, a correlation equation P = q1 × Y + c11 is generated. Similarly, when the number of births is missing, if the number of births is B, the intercept (c12) of the regression line and the coefficient q2 for the variable B are determined by using B as the explained variable. That is, a correlation equation B = q2 × Y + c12 is generated. Although the linear form is exemplified here as the correlation formula, the correlation formula may be a nonlinear formula.

時系列分析部１１３は、ユーザとの対話を通じて相関式を調整してもよい。例えば、時系列分析部１１３は、ユーザから指定された次数の相関式を生成してディスプレイ７１に表示する。生成された相関式を見てユーザが次数の変更を指示した場合、時系列分析部１１３は、変更後の次数で相関式を生成し直す。また、相関式の生成に利用する年の範囲がユーザから指定された場合、時系列分析部１１３は、古過ぎるデータを除外するなど、指定された年の範囲のデータに限定して相関式を生成してもよい。 The time series analysis unit 113 may adjust the correlation formula through dialogue with the user. For example, the time series analysis unit 113 generates a correlation expression of the order designated by the user and displays it on the display 71. When the user gives an instruction to change the order by looking at the generated correlation equation, the time series analysis unit 113 regenerates the correlation equation with the changed order. In addition, when the range of years used for generating the correlation formula is specified by the user, the time series analysis unit 113 limits the correlation formula to data in the specified range of years, such as excluding data that is too old. It may be generated.

上記のように回帰分析を行うと、回帰式と抽出データ１１１ｄとの残差を評価する決定係数が得られる。決定係数は、相関係数Ｒの二乗に相当し、Ｒ²と表記されることがある。例えば、抽出データ１１１ｄに含まれる各実数値をｙ_i、実数値の平均をｙ^*、相関式を用いて計算される各実数値に対応する推定値をｆ_iとすると、決定係数は、１−ｓｕｍ（ｙ_i−ｆ_i）²／ｓｕｍ（ｙ_i−ｙ^*）²と算出できる。これは、残差二乗和を、実数値の平均からの差の二乗和で割り、それを１から引いたものである。ただし、評価指標として、相関係数、自由度調整済決定係数、自由度調整済相関係数などを用いてもよい。 When the regression analysis is performed as described above, a determination coefficient for evaluating the residual between the regression equation and the extracted data 111d is obtained. The determination coefficient corresponds to the square of the correlation coefficient R and may be expressed as R ² . For example, if each real value included in the extracted data 111d is y _i , the average of the real values is y ^* , and the estimated value corresponding to each real value calculated using the correlation equation is f _i , the determination coefficient is 1 -Sum (y _i -f _i ) ² / sum (y _i -y ^* ) ² This is the residual sum of squares divided by the sum of squares of the difference from the mean of the real value and subtracted from it. However, as an evaluation index, a correlation coefficient, a degree of freedom adjusted determination coefficient, a degree of freedom adjusted correlation coefficient, or the like may be used.

相関式データ１１１ｂには、分析種別＝「時系列」、地域および項目と対応付けて、時系列分析によって生成された相関式と決定係数が記憶される。例えば、相関式データ１１１ｂには、分析種別＝「時系列」、地域＝「Ｍ１」および項目＝「人口」と対応付けて、相関式＝「ｑ１×Ｙ＋ｃ１１」および決定係数（Ｒ²）＝「０．９６４」が記憶される。 The correlation formula data 111b stores the correlation formula and the determination coefficient generated by the time series analysis in association with the analysis type = “time series”, the region, and the item. For example, in the correlation formula data 111b, the correlation formula = “q1 × Y + c11” and the determination coefficient (R ² ) = “are associated with the analysis type =“ time series ”, the region =“ M1 ”, and the item =“ population ”. 0.964 "is stored.

（地域別データ）
地域相関分析部１１４は、ある年・ある地域・ある項目の欠落値についての相関式を算出するとき、実数データ１１１ａから、欠落値と同じ項目についての複数の地域および複数の年のデータを抽出する。図７は、第２の実施の形態の抽出データの一例（地域相関分析）を示す図である。図７に例示した抽出データ１１１ｅは、自治体Ｍ１、Ｍ２、…、Ｍ９のそれぞれに関する人口の推移を示している。図７の例では、自治体Ｍ１の２０１１年の人口にデータの欠落がある。そこで、抽出データ１１１ｅとして、複数の地域および複数の年の人口についてのデータが抽出されている。第２の実施の形態では、地域の単位として基礎自治体を用いているが、都道府県・選挙区・学区などを地域の単位としてもよい。以下では、自治体Ｍｉの人口をｍｉと表記する。 (Regional data)
When calculating a correlation formula for missing values of a certain year, a certain region, and a certain item, the regional correlation analysis unit 114 extracts a plurality of regions and a plurality of years of data regarding the same item as the missing value from the real number data 111a. To do. FIG. 7 is a diagram illustrating an example (regional correlation analysis) of extracted data according to the second embodiment. The extracted data 111e illustrated in FIG. 7 indicates the transition of population regarding each of the municipalities M1, M2,..., M9. In the example of FIG. 7, there is a lack of data in the 2011 population of the municipality M1. Therefore, data about a plurality of regions and a population of a plurality of years is extracted as the extracted data 111e. In the second embodiment, the basic municipality is used as a regional unit, but a prefecture, electoral district, school district, or the like may be used as a regional unit. Hereinafter, the population of the municipality Mi is expressed as mi.

地域相関分析による相関式は、抽出データ１１１ｅを利用して生成される。例えば、地域相関分析による相関式として、図８に示すような相関式が利用できる。図８は、第２の実施の形態の相関式の一例（地域相関分析）を示す図である。地域相関分析によって生成される人口に関する相関式は、同じ年における、ある地域の人口を被説明変数とし、他の地域の人口を説明変数とする重回帰分析により得られる。 The correlation formula based on the regional correlation analysis is generated using the extracted data 111e. For example, a correlation formula as shown in FIG. 8 can be used as a correlation formula based on regional correlation analysis. FIG. 8 is a diagram illustrating an example of a correlation formula (regional correlation analysis) according to the second embodiment. The correlation formula relating to the population generated by the regional correlation analysis is obtained by multiple regression analysis in which the population of a certain region in the same year is the explained variable and the population of the other region is the explanatory variable.

例えば、自治体Ｍ１の人口ｍ１を被説明変数、他の自治体Ｍ２、…、Ｍ９の人口ｍ２、…、ｍ９を説明変数とする重回帰分析により、回帰直線の切片（ｃ２１）と、変数ｍ２、…、ｍ９に対する係数ｒ１１、…、ｒ１８が決まる。つまり、自治体Ｍ１の人口ｍ１を推定する相関式が決まる。ただし、図８に示した相関式の例は、更に説明変数として年を示す変数Ｙを含み、変数Ｙに対する係数ｒ１９を含んでいる。すなわち、ｍ１＝ｒ１１×ｍ２＋…＋ｒ１８×ｍ９＋ｒ１９×Ｙ＋ｃ２１という相関式が生成される。他の自治体についても同様に相関式を決めることができる。なお、変数Ｙを説明変数に含めないようにしてもよい。また、ここでは相関式として線形式を例示したが、相関式を非線形式としてもよい。 For example, by multiple regression analysis with the population m1 of the municipality M1 as the explanatory variable and the population m2 of other municipalities M2,..., M9 as the explanatory variable, the regression line intercept (c21) and the variable m2,. , M9 are determined as coefficients r11,. That is, the correlation formula for estimating the population m1 of the municipality M1 is determined. However, the example of the correlation equation shown in FIG. 8 further includes a variable Y indicating year as an explanatory variable, and a coefficient r19 for the variable Y. That is, a correlation equation of m1 = r11 × m2 +... + R18 × m9 + r19 × Y + c21 is generated. Correlation equations can be determined in the same way for other municipalities. Note that the variable Y may not be included in the explanatory variable. Although the linear form is exemplified here as the correlation formula, the correlation formula may be a non-linear formula.

地域相関分析部１１４は、ユーザとの対話を通じて相関式を調整してもよい。例えば、地域相関分析部１１４は、ユーザから指定された次数の相関式を生成してディスプレイ７１に表示する。生成された相関式を見てユーザが次数の変更を指示した場合、地域相関分析部１１４は、変更後の次数で相関式を生成し直す。また、相関式の生成に利用する年の範囲がユーザから指定された場合、地域相関分析部１１４は、古過ぎるデータを除外するなど、指定された年の範囲のデータに限定して相関式を生成してもよい。 The regional correlation analysis unit 114 may adjust the correlation formula through dialogue with the user. For example, the regional correlation analysis unit 114 generates a correlation equation of the order specified by the user and displays it on the display 71. When the user instructs to change the order by looking at the generated correlation formula, the regional correlation analysis unit 114 regenerates the correlation formula with the changed order. In addition, when the range of years used for generating the correlation formula is specified by the user, the regional correlation analysis unit 114 limits the correlation formula to the data of the specified year range, such as excluding data that is too old. It may be generated.

また、上記では自治体Ｍ１〜Ｍ９の間の相関関係を分析したが、相関関係を分析する地域の範囲がユーザから指定された場合、地域相関分析部１１４は、指定された地域の範囲のデータに限定して相関式を生成してもよい。例えば、地域相関分析部１１４は、欠落値が属する地域と隣接する他の地域、欠落値が属する地域から所定距離内に存在する他の地域、欠落値が属する地域と同じ経済的特徴をもつ他の地域などに限定して、相関関係を分析してもよい。地域相関分析部１１４は、ユーザからの指示がなくても、相関関係を分析する地域の範囲を上記の基準に基づいて限定してもよい。また、地域相関分析部１１４は、ユーザとの対話を通じて、相関関係を分析する地域の範囲を変更しながら相関式を生成し直すようにしてもよい。 In the above description, the correlation between the local governments M1 to M9 is analyzed. However, when a region range for analyzing the correlation is designated by the user, the region correlation analysis unit 114 converts the region range data into the designated region range data. The correlation equation may be generated in a limited manner. For example, the region correlation analysis unit 114 may have other regions that are adjacent to the region to which the missing value belongs, other regions that exist within a predetermined distance from the region to which the missing value belongs, and that have the same economic characteristics as the region to which the missing value belongs. The correlation may be analyzed in a limited area. Even if there is no instruction from the user, the regional correlation analysis unit 114 may limit the range of the region in which the correlation is analyzed based on the above criteria. In addition, the regional correlation analysis unit 114 may regenerate the correlation formula while changing the range of the region for which the correlation is analyzed through a dialog with the user.

上記のように回帰分析を行うと、回帰式と抽出データ１１１ｅとの残差を評価する決定係数が得られる。相関式データ１１１ｂには、分析種別＝「地域相関」、地域および項目と対応付けて、地域相関分析によって生成された相関式と決定係数が記憶される。例えば、相関式データ１１１ｂには、分析種別＝「地域相関」、地域＝「Ｍ１」および項目＝「人口」と対応付けて、相関式＝「ｒ１１×ｍ２＋…＋ｒ１８×ｍ９＋ｒ１９×Ｙ＋ｃ２１」および決定係数（Ｒ²）＝「０．４６５」が記憶される。ただし、評価指標として、相関係数、自由度調整済決定係数、自由度調整済相関係数などを用いてもよい。 When the regression analysis is performed as described above, a determination coefficient for evaluating the residual between the regression equation and the extracted data 111e is obtained. In the correlation formula data 111b, the correlation type generated by the regional correlation analysis and the determination coefficient are stored in association with the analysis type = “regional correlation”, the region, and the item. For example, in the correlation formula data 111b, correlation type = “r11 × m2 +... + R18 × m9 + r19 × Y + c21” and a determination coefficient are associated with analysis type = “regional correlation”, region = “M1”, and item = “population”. (R ² ) = “0.465” is stored. However, as an evaluation index, a correlation coefficient, a degree of freedom adjusted determination coefficient, a degree of freedom adjusted correlation coefficient, or the like may be used.

（項目別データ）
項目相関分析部１１５は、ある年・ある地域・ある項目の欠落値についての相関式を算出するとき、実数データ１１１ａから、欠落値と同じ地域についての複数の項目および複数の年のデータを抽出する。図９は、第２の実施の形態の抽出データの一例（項目相関分析）を示す図である。図９に例示した抽出データ１１１ｆは、人口、出生数、死亡数などの項目毎の時間的な推移を示している。図９の例では、自治体Ｍ１の２０１１年の人口にデータの欠落がある。そこで、抽出データ１１１ｆとして、自治体Ｍ１の複数の年および複数の項目についてのデータが抽出されている。 (Data by item)
The item correlation analysis unit 115 extracts a plurality of items and data of a plurality of years for the same region as the missing value from the real number data 111a when calculating a correlation formula for the missing value of a certain year / region / item. To do. FIG. 9 is a diagram illustrating an example of extracted data (item correlation analysis) according to the second embodiment. The extracted data 111f illustrated in FIG. 9 shows temporal changes for each item such as population, number of births, number of deaths, and the like. In the example of FIG. 9, there is a missing data in the 2011 population of the municipality M1. Therefore, data about a plurality of years and a plurality of items of the local government M1 is extracted as the extracted data 111f.

項目相関分析による相関式は、抽出データ１１１ｆを利用して生成される。例えば、項目相関分析による相関式として、図１０に示すような相関式が利用できる。図１０は、第２の実施の形態の相関式の一例（項目相関分析）を示す図である。項目相関分析によって生成される自治体Ｍ１に関する相関式は、ある項目に関する値を被説明変数とし、他の項目に関する値を説明変数とする重回帰分析により得られる。 The correlation formula by the item correlation analysis is generated using the extracted data 111f. For example, a correlation equation as shown in FIG. 10 can be used as a correlation equation based on item correlation analysis. FIG. 10 is a diagram illustrating an example (item correlation analysis) of a correlation equation according to the second embodiment. The correlation formula relating to the municipality M1 generated by the item correlation analysis is obtained by multiple regression analysis in which a value related to a certain item is an explained variable and a value related to another item is an explanatory variable.

例えば、自治体Ｍ１の人口Ｐの値を被説明変数、自治体Ｍ１の出生数Ｂや自治体Ｍ１の死亡数Ｄなど自治体Ｍ１の他の項目を説明変数とする重回帰分析により、回帰直線の切片（ｃ３１）と、変数Ｂ、Ｄ、…に対する係数ｓ１１、ｓ１２、…が決まる。つまり、自治体Ｍ１の人口Ｐの値を推定する相関式が決まる。ただし、図１０に示した相関式の例は、更に説明変数として年を示す変数Ｙを含み、変数Ｙに対する係数ｓ１９を含んでいる。すなわち、Ｐ＝ｓ１１×Ｂ＋ｓ１２×Ｄ＋…＋ｓ１９×Ｙ＋ｃ３１という相関式が生成される。他の項目についても同様に相関式を決めることができる。なお、変数Ｙを説明変数に含めないようにしてもよい。また、ここでは相関式として線形式を例示したが、相関式を非線形式としてもよい。 For example, the regression line intercept (c31) is obtained by multiple regression analysis using the value of the population P of the municipality M1 as an explanatory variable, and other items of the municipality M1 as explanatory variables such as the number of births B of the municipality M1 and the number of deaths D of the municipality M1. ) And the coefficients s11, s12,... For the variables B, D,. That is, the correlation formula for estimating the value of the population P of the municipality M1 is determined. However, the example of the correlation equation shown in FIG. 10 further includes a variable Y indicating the year as an explanatory variable, and a coefficient s19 for the variable Y. That is, a correlation formula of P = s11 × B + s12 × D +... + S19 × Y + c31 is generated. Correlation equations can be similarly determined for other items. Note that the variable Y may not be included in the explanatory variable. Although the linear form is exemplified here as the correlation formula, the correlation formula may be a non-linear formula.

項目相関分析部１１５は、ユーザとの対話を通じて相関式を調整してもよい。例えば、項目相関分析部１１５は、ユーザから指定された次数の相関式を生成してディスプレイ７１に表示する。生成された相関式を見てユーザが次数の変更を指示した場合、項目相関分析部１１５は、変更後の次数で相関式を生成し直す。また、相関式の生成に利用する年の範囲がユーザから指定された場合、項目相関分析部１１５は、古過ぎるデータを除外するなど、指定された年の範囲のデータに限定して相関式を生成してもよい。 The item correlation analysis unit 115 may adjust the correlation formula through dialogue with the user. For example, the item correlation analysis unit 115 generates a correlation equation of the order specified by the user and displays it on the display 71. When the user instructs to change the order by looking at the generated correlation formula, the item correlation analysis unit 115 regenerates the correlation formula with the changed order. In addition, when the year range used for generating the correlation formula is specified by the user, the item correlation analysis unit 115 limits the correlation formula to data in the specified year range, such as excluding data that is too old. It may be generated.

また、上記では人口Ｐ、出生数Ｂおよび死亡数Ｄの間の相関関係を分析したが、相関関係を分析する項目の範囲がユーザから指定された場合、項目相関分析部１１５は、指定された項目の範囲のデータに限定して相関式を生成してもよい。例えば、項目相関分析部１１５は、欠落値が属する項目と関連性が強い項目に限定して、相関関係を分析してもよい。項目相関分析部１１５は、ユーザからの指示がなくても、相関関係を分析する項目の範囲を所定の基準に基づいて限定してもよい。また、項目相関分析部１１５は、ユーザとの対話を通じて、相関関係を分析する項目の範囲を変更しながら相関式を生成し直すようにしてもよい。 In the above description, the correlation between the population P, the number of births B, and the number of deaths D is analyzed. When the range of items whose correlation is to be analyzed is designated by the user, the item correlation analysis unit 115 is designated. The correlation formula may be generated by limiting to the data of the item range. For example, the item correlation analysis unit 115 may analyze the correlation by limiting to items that are strongly related to the item to which the missing value belongs. The item correlation analysis unit 115 may limit the range of items for which the correlation is analyzed based on a predetermined criterion without an instruction from the user. In addition, the item correlation analysis unit 115 may regenerate the correlation equation while changing the range of the items for which the correlation is analyzed through interaction with the user.

上記のように回帰分析を行うと、回帰式と抽出データ１１１ｆとの残差を評価する決定係数が得られる。相関式データ１１１ｂには、分析種別＝「項目相関」、地域および項目と対応付けて、項目相関分析によって生成された相関式と決定係数（Ｒ²）が記憶される。例えば、相関式データ１１１ｂには、分析種別＝「項目相関」、地域＝「Ｍ１」および項目＝「人口」と対応付けて、相関式＝「ｓ１１×Ｂ＋ｓ１２×Ｄ＋…＋ｓ１９×Ｙ＋ｃ３１」および決定係数（Ｒ²）＝「０．７１２」が記憶される。ただし、評価指標として、相関係数、自由度調整済決定係数、自由度調整済相関係数などを用いてもよい。 When the regression analysis is performed as described above, a determination coefficient for evaluating the residual between the regression equation and the extracted data 111f is obtained. In the correlation formula data 111b, the correlation type generated by the item correlation analysis and the determination coefficient (R ² ) are stored in association with the analysis type = “item correlation”, the region, and the item. For example, in the correlation formula data 111b, correlation type = “s11 × B + s12 × D +... + S19 × Y + c31” and determination coefficient are associated with analysis type = “item correlation”, region = “M1”, and item = “population”. (R ² ) = “0.712” is stored. However, as an evaluation index, a correlation coefficient, a degree of freedom adjusted determination coefficient, a degree of freedom adjusted correlation coefficient, or the like may be used.

ある欠落値に対する推定値を算出できる３つの相関式が生成されると、欠落値補完部１１６は、３つの相関式のうち最も決定係数が大きい相関式を選択し、選択した相関式を用いて推定値を算出する。データ加工部１１７は、欠落値補完部１１６が算出した推定値または欠落値補完部１１６が選択した推定式を用いて、加工データ１１１ｃを生成する。図１１は、第２の実施の形態の加工データの一例を示す図である。 When three correlation equations that can calculate an estimated value for a certain missing value are generated, the missing value complementing unit 116 selects the correlation equation having the largest determination coefficient from the three correlation equations, and uses the selected correlation equation. Calculate an estimate. The data processing unit 117 generates the processed data 111c using the estimated value calculated by the missing value complementing unit 116 or the estimation formula selected by the missing value complementing unit 116. FIG. 11 is a diagram illustrating an example of the machining data according to the second embodiment.

一例として、加工データ１１１ｃは、２０１１年の複数の地域の出生比率を含む。ある地域の２０１１年の出生比率は、当該地域の２０１１年の出生数Ｂを当該地域の２０１１年の人口Ｐで割ることで算出できる。実数データ１１１ａには、自治体Ｍ２の２０１１年の出生数Ｂ＝「１９００」と自治体Ｍ２の２０１１年の人口Ｐ＝「２２００００」が含まれている。よって、データ加工部１１７は、実数データ１１１ａから出生数Ｂと人口Ｐの実数値を取得する。データ加工部１１７は、例えば、取得した実数値を加工データ１１１ｃに挿入する。データ加工部１１７は、取得した実数値から出生比率＝「０．８６％」を算出し、算出した出生比率を加工データ１１１ｃに挿入する。 As an example, the processed data 111c includes birth ratios of a plurality of regions in 2011. The birth ratio of 2011 in a certain region can be calculated by dividing the number of births B in 2011 in the region by the population P in 2011 in the region. The real number data 111a includes the number of births B = “1900” of the municipality M2 in 2011 and the population P = “220,000” of the municipality M2 in 2011. Therefore, the data processing unit 117 acquires the real numbers of the birth number B and the population P from the real number data 111a. For example, the data processing unit 117 inserts the acquired real value into the processed data 111c. The data processing unit 117 calculates the birth ratio = “0.86%” from the acquired real value, and inserts the calculated birth ratio into the processed data 111c.

また、実数データ１１１ａには、自治体Ｍ１の２０１１年の出生数Ｂ＝「８」が含まれている。よって、データ加工部１１７は、実数データ１１１ａから出生数Ｂの実数値を取得する。しかし、実数データ１１１ａには、自治体Ｍ１の２０１１年の人口Ｐが欠落している。そこで、データ加工部１１７は、例えば、人口Ｐを推定する相関式＝「ｑ１×Ｙ＋ｃ１１」を欠落値補完部１１６から取得し、実数値に代えて相関式を加工データ１１１ｃに挿入する。そして、データ加工部１１７は、出生比率を求める式として「８／（ｑ１×Ｙ＋ｃ１１）」を算出し、算出した式を加工データ１１１ｃに挿入する。相関式を用いて表現された加工値については、加工データ１１１ｃを表示するときなど任意のタイミングで、データ加工部１１７が具体的な値を計算するようにしてもよい。 The real number data 111a includes the number of births B = “8” of the local government M1 in 2011. Therefore, the data processing unit 117 acquires the real value of the birth number B from the real number data 111a. However, the real number data 111a lacks the population P of the municipality M1 in 2011. Therefore, for example, the data processing unit 117 acquires the correlation formula = “q1 × Y + c11” for estimating the population P from the missing value complementing unit 116, and inserts the correlation formula into the processed data 111c instead of the real value. Then, the data processing unit 117 calculates “8 / (q1 × Y + c11)” as an expression for obtaining the birth ratio, and inserts the calculated expression into the processed data 111c. Regarding the processing value expressed using the correlation equation, the data processing unit 117 may calculate a specific value at an arbitrary timing such as when the processing data 111c is displayed.

または、データ加工部１１７は、推定値である人口Ｐ＝「４８５０」を欠落値補完部１１６から取得し、実数値に代えて推定値を加工データ１１１ｃに挿入する。そして、データ加工部１１７は、出生数Ｂの実数値および人口Ｐの推定値から出生比率＝「０．１６％」を算出し、算出した出生比率を加工データ１１１ｃに挿入する。このとき、データ加工部１１７は、その出生比率が推定値を用いて算出されている旨の情報を、加工データ１１１ｃに追記するようにしてもよい。 Alternatively, the data processing unit 117 acquires the estimated value population P = “4850” from the missing value complementing unit 116, and inserts the estimated value into the processed data 111c instead of the real value. Then, the data processing unit 117 calculates the birth ratio = “0.16%” from the real value of the number of births B and the estimated value of the population P, and inserts the calculated birth ratio into the processed data 111c. At this time, the data processing unit 117 may add information indicating that the birth ratio is calculated using the estimated value to the processed data 111c.

［２−３．処理の流れ］
次に、図１２を参照しながら、実数データ１１１ａを利用する際に情報処理装置１００が実行する処理の流れについて説明する。図１２は、第２の実施の形態の処理手順の一例を示すフロー図である。 [2-3. Process flow]
Next, a flow of processing executed by the information processing apparatus 100 when using the real number data 111a will be described with reference to FIG. FIG. 12 is a flowchart illustrating an example of a processing procedure according to the second embodiment.

（Ｓ１０１）欠落値補完部１１６は、実数データ１１１ａに含まれる１つの値（実数値またはＮＵＬＬ値）を特定できる属性の組み合わせ（｛年、地域、項目｝の組）を１つ選択する。このとき、欠落値補完部１１６は、取り得る｛年、地域、項目｝の組のうち、未選択の組を１つ選択する。 (S101) The missing value complementing unit 116 selects one attribute combination ({year, region, item}) that can specify one value (real value or NULL value) included in the real number data 111a. At this time, the missing value complementing unit 116 selects one unselected group from the possible {year, region, item} pairs.

（Ｓ１０２）欠落値補完部１１６は、Ｓ１０１で選択した組の実数値があるか否かを判定する。選択した組に対応する値が実数値である場合、処理はＳ１０８へと進む。一方、選択した組に対応する値がＮＵＬＬ値である場合、処理はＳ１０３へと進む。つまり、選択した組に対応する実数値が欠落している場合、欠落値補完部１１６は、欠落部分の情報（選択した組）を時系列分析部１１３、地域相関分析部１１４、および項目相関分析部１１５に通知し、処理をＳ１０３へと進める。 (S102) The missing value complementing unit 116 determines whether there is a real value of the set selected in S101. If the value corresponding to the selected set is a real value, the process proceeds to S108. On the other hand, if the value corresponding to the selected set is a NULL value, the process proceeds to S103. That is, when the real value corresponding to the selected set is missing, the missing value complementing unit 116 uses the time-series analyzing unit 113, the regional correlation analyzing unit 114, and the item correlation analysis for the missing part information (selected set). The unit 115 is notified, and the process proceeds to S103.

（Ｓ１０３）時系列分析部１１３は、欠落値補完部１１６から通知された欠落部分の情報に基づいて、欠落値を推定することができる相関式を時系列分析によって生成し、生成した相関式の決定係数を算出する。 (S103) The time series analysis unit 113 generates a correlation equation that can estimate the missing value based on the information of the missing part notified from the missing value complement unit 116 by time series analysis. Calculate the coefficient of determination.

例えば、２０１１年における自治体Ｍ１の人口の値が欠落している場合、時系列分析部１１３は、実数データ１１１ａから、地域を「Ｍ１」に限定し、項目を「人口」に限定し、年を限定しない抽出データ１１１ｄを抽出する。時系列分析部１１３は、抽出データ１１１ｄから、単回帰分析により相関式を計算し、相関式の変数に対する係数を相関式データ１１１ｂに格納する。また、時系列分析部１１３は、回帰分析の際に得られる決定係数を相関式データ１１１ｂに格納する。 For example, when the population value of the municipality M1 in 2011 is missing, the time series analysis unit 113 limits the area to “M1”, the item to “population”, and the year from the real number data 111a. Extraction data 111d that is not limited is extracted. The time series analysis unit 113 calculates a correlation formula from the extracted data 111d by single regression analysis, and stores the coefficient for the variable of the correlation formula in the correlation formula data 111b. Further, the time series analysis unit 113 stores the determination coefficient obtained in the regression analysis in the correlation formula data 111b.

一例として、時系列分析部１１３は、自治体Ｍ１の「人口」を被説明変数とし、「年」を説明変数とする線形回帰分析により、図６のような相関式の係数ｑ１およびｃ１１を得る。また、この線形回帰分析により、時系列分析部１１３は、図６のような決定係数（Ｒ²）を得る。なお、相関式を非線形式とする場合、非線形回帰分析を実施すれば、回帰曲線を示す係数の集合、および、実数値と推定値の残差を評価する決定係数が得られる。 As an example, the time series analysis unit 113 obtains coefficients q1 and c11 of the correlation equation as shown in FIG. 6 by linear regression analysis using “population” of the municipality M1 as an explained variable and “year” as an explanatory variable. In addition, by this linear regression analysis, the time series analysis unit 113 obtains a determination coefficient (R ² ) as shown in FIG. When the correlation equation is a nonlinear equation, if nonlinear regression analysis is performed, a set of coefficients indicating a regression curve and a determination coefficient for evaluating a residual between a real value and an estimated value are obtained.

ただし、選択された欠落値に対応する相関値を算出できる相関式が、時系列分析によって既に生成済みである場合、Ｓ１０３はスキップされる。
（Ｓ１０４）地域相関分析部１１４は、欠落値補完部１１６から通知された欠落部分の情報に基づいて、欠落値を推定することができる相関式を地域相関分析によって生成し、生成した相関式の決定係数を算出する。 However, if the correlation equation that can calculate the correlation value corresponding to the selected missing value has already been generated by the time series analysis, S103 is skipped.
(S104) The regional correlation analysis unit 114 generates a correlation expression that can estimate the missing value based on the information of the missing part notified from the missing value complementing unit 116 by the regional correlation analysis. Calculate the coefficient of determination.

例えば、２０１１年における自治体Ｍ１の人口の値が欠落している場合、地域相関分析部１１４は、実数データ１１１ａから、地域を限定せず、項目を「人口」に限定し、年を限定しない抽出データ１１１ｅを抽出する。地域相関分析部１１４は、抽出データ１１１ｅから、重回帰分析により相関式を計算し、相関式の変数に対する係数を相関式データ１１１ｂに格納する。また、地域相関分析部１１４は、回帰分析の際に得られる決定係数を相関式データ１１１ｂに格納する。 For example, when the population value of the municipality M1 in 2011 is missing, the regional correlation analysis unit 114 extracts the real number data 111a without limiting the area, limiting the item to “population”, and not limiting the year. Data 111e is extracted. The regional correlation analysis unit 114 calculates a correlation formula from the extracted data 111e by multiple regression analysis, and stores the coefficient for the variable of the correlation formula in the correlation formula data 111b. In addition, the regional correlation analysis unit 114 stores the determination coefficient obtained in the regression analysis in the correlation formula data 111b.

一例として、地域相関分析部１１４は、自治体Ｍ１の「人口」を被説明変数とし、自治体Ｍ２〜Ｍ９の「人口」および「年」を説明変数とする線形回帰分析により、図８のような相関式の係数ｒ１１〜ｒ１９およびｃ２１を得る。この線形回帰分析により、地域相関分析部１１４は、図８のような決定係数（Ｒ²）を得る。なお、相関式を非線形式とする場合、非線形回帰分析を実施すれば、回帰曲線を示す係数の集合、および、実数値と推定値の残差を評価する決定係数が得られる。 As an example, the regional correlation analysis unit 114 performs correlation as shown in FIG. 8 by linear regression analysis using “population” of the municipality M1 as an explained variable and “population” and “year” of the municipalities M2 to M9 as explanatory variables. The coefficients r11 to r19 and c21 of the equation are obtained. By this linear regression analysis, the regional correlation analysis unit 114 obtains a determination coefficient (R ² ) as shown in FIG. When the correlation equation is a nonlinear equation, if nonlinear regression analysis is performed, a set of coefficients indicating a regression curve and a determination coefficient for evaluating a residual between a real value and an estimated value are obtained.

ただし、選択された欠落値に対応する相関値を算出できる相関式が、地域相関分析によって既に生成済みである場合、Ｓ１０４はスキップされる。
（Ｓ１０５）項目相関分析部１１５は、欠落値補完部１１６から通知された欠落部分の情報に基づいて、欠落値を推定することができる相関式を項目相関分析によって生成し、生成した相関式の決定係数を算出する。 However, if a correlation equation that can calculate a correlation value corresponding to the selected missing value has already been generated by the regional correlation analysis, S104 is skipped.
(S105) The item correlation analysis unit 115 generates a correlation expression that can estimate the missing value based on the information of the missing part notified from the missing value complementing unit 116 by the item correlation analysis, and Calculate the coefficient of determination.

例えば、２０１１年における自治体Ｍ１の人口の値が欠落している場合、項目相関分析部１１５は、実数データ１１１ａから、地域を「Ｍ１」に限定し、項目を限定せず、年を限定しない抽出データ１１１ｆを抽出する。項目相関分析部１１５は、抽出データ１１１ｆから、重回帰分析により相関式を計算し、相関式の変数に対する係数を相関式データ１１１ｂに格納する。また、項目相関分析部１１５は、回帰分析の際に得られる決定係数を相関式データ１１１ｂに格納する。 For example, when the value of the population of the municipality M1 in 2011 is missing, the item correlation analysis unit 115 extracts the real number data 111a by limiting the area to “M1”, without limiting the item, and without limiting the year. Data 111f is extracted. The item correlation analysis unit 115 calculates a correlation formula from the extracted data 111f by multiple regression analysis, and stores the coefficient for the variable of the correlation formula in the correlation formula data 111b. Further, the item correlation analysis unit 115 stores the determination coefficient obtained in the regression analysis in the correlation formula data 111b.

一例として、項目相関分析部１１５は、自治体Ｍ１の「人口」を被説明変数とし、自治体Ｍ１の「出生数」「死亡数」などの項目および「年」を説明変数とする線形回帰分析により、図１０のような相関式の係数ｓ１１〜ｓ１９およびｃ３１を得る。この線形回帰分析により、項目相関分析部１１５は、図１０のような決定係数（Ｒ²）を得る。なお、相関式を非線形式とする場合、非線形回帰分析を実施すれば、回帰曲線を示す係数の集合、および、実数値と推定値の残差を評価する決定係数が得られる。 As an example, the item correlation analysis unit 115 performs linear regression analysis using “population” of the municipality M1 as an explanatory variable, items such as “number of births” and “deaths” of the local government M1, and “year” as explanatory variables. The coefficients s11 to s19 and c31 of the correlation equation as shown in FIG. 10 are obtained. By this linear regression analysis, the item correlation analysis unit 115 obtains a determination coefficient (R ² ) as shown in FIG. When the correlation equation is a nonlinear equation, if nonlinear regression analysis is performed, a set of coefficients indicating a regression curve and a determination coefficient for evaluating a residual between a real value and an estimated value are obtained.

ただし、選択された欠落値に対応する相関値を算出できる相関式が、項目相関分析によって既に生成済みである場合、Ｓ１０５はスキップされる。
なお、Ｓ１０３、Ｓ１０４、Ｓ１０５の処理は実行順序を入れ替えてもよい。 However, if a correlation equation that can calculate a correlation value corresponding to the selected missing value has already been generated by the item correlation analysis, S105 is skipped.
Note that the execution order of the processes of S103, S104, and S105 may be changed.

（Ｓ１０６）欠落値補完部１１６は、Ｓ１０３からＳ１０５までの処理において算出された決定係数を比較し、最大の決定係数を特定する。つまり、欠落値補完部１１６は、時系列分析部１１３、地域相関分析部１１４、および項目相関分析部１１５により計算された相関式のうち、実数データ１１１ａの実数値を最も良く説明する相関式（最大の決定係数に対応する相関式）を選択する。 (S106) The missing value complementing unit 116 compares the determination coefficients calculated in the processes from S103 to S105, and specifies the maximum determination coefficient. That is, the missing value complementing unit 116 is a correlation formula that best explains the real value of the real number data 111a among the correlation formulas calculated by the time series analysis unit 113, the regional correlation analysis unit 114, and the item correlation analysis unit 115. The correlation equation corresponding to the largest coefficient of determination is selected.

（Ｓ１０７）欠落値補完部１１６は、Ｓ１０６で特定した決定係数に対応する相関式を用いて欠落値の推定値（代替値）を算出し、その欠落値の代替データとして記憶部１１１に保存する。欠落値補完部１１６は、算出した推定値（代替値）を加工データ１１１ｃに挿入してもよい。 (S107) The missing value complementing unit 116 calculates an estimated value (substitute value) of the missing value using the correlation equation corresponding to the determination coefficient identified in S106, and stores it in the storage unit 111 as substitute data for the missing value. . The missing value complementing unit 116 may insert the calculated estimated value (alternative value) into the processed data 111c.

（Ｓ１０８）欠落値補完部１１６は、｛年、地域、項目｝の組を全て選択し終えたか否かを判定する。全ての組を選択し終えた場合、図１２に示した一連の処理は終了する。一方、未選択の組がある場合、処理はＳ１０１へと進む。つまり、実数データ１１１ａに含まれる全ての欠落値について代替データが得られた場合、或いは、実数データ１１１ａが欠落値を含まない場合、図１２に示した一連の処理は終了する。 (S108) The missing value complementing unit 116 determines whether or not all sets of {year, region, item} have been selected. When all the groups have been selected, the series of processes shown in FIG. On the other hand, if there is an unselected pair, the process proceeds to S101. That is, when alternative data is obtained for all missing values included in the real number data 111a, or when the real number data 111a does not include a missing value, the series of processing illustrated in FIG. 12 ends.

以上、情報処理装置１００が実行する処理の流れについて説明した。
上記のように、情報処理装置１００は、時系列分析、地域相関分析、項目相関分析という異なる３つの推定方法によって、欠落値を推定する相関式を得る仕組みを有し、推定精度が高いと評価される相関式を選択して利用する。そのため、特定の年、地域、項目のデータが多く欠落し、特定の推定方法のみからでは精度の高い相関式を得ることが困難な場合でも、他の推定方法から得た相関式を利用して欠落値を推定することができる。 The flow of processing executed by the information processing apparatus 100 has been described above.
As described above, the information processing apparatus 100 has a mechanism for obtaining a correlation equation for estimating a missing value by three different estimation methods of time series analysis, regional correlation analysis, and item correlation analysis, and is evaluated to have high estimation accuracy. Select the correlation formula to be used. Therefore, even if there are many missing data for a specific year, region, and item, and it is difficult to obtain a highly accurate correlation formula only with a specific estimation method, the correlation formula obtained from another estimation method is used. Missing values can be estimated.

例えば、ある自治体のある項目の値が数年間連続して欠落している場合、時系列分析では精度の高い相関式を生成することが難しいことがある。この場合、地域相関分析や項目相関分析から精度の高い相関式を生成できることがある。また、ある年のある項目について複数の自治体の値が纏めて欠落した場合、地域相関分析では精度の高い相関式を生成することが難しいことがある。この場合、時系列分析や項目相関分析から精度の高い相関式を生成できることがある。また、ある年のある自治体について複数の項目の値が纏めて欠落した場合、項目相関分析では精度の高い相関式を生成することが難しいことがある。この場合、時系列分析や地域相関分析から精度の高い相関式を生成できることがある。 For example, when the value of a certain item in a certain local government is missing continuously for several years, it may be difficult to generate a highly accurate correlation equation by time series analysis. In this case, a highly accurate correlation equation may be generated from regional correlation analysis or item correlation analysis. In addition, when the values of a plurality of local governments are missing for a certain item in a certain year, it may be difficult to generate a highly accurate correlation equation in the regional correlation analysis. In this case, a highly accurate correlation formula may be generated from time series analysis or item correlation analysis. In addition, when the values of a plurality of items are missing for a certain municipality in a certain year, it may be difficult to generate a highly accurate correlation equation by item correlation analysis. In this case, a highly accurate correlation equation may be generated from time series analysis or regional correlation analysis.

１０欠落データ推定装置
１１記憶部
１２演算部
１３第１の推定方法
１４第２の推定方法
１５第１の推定値
１６第２の推定値
１７第１の信頼指標
１８第２の信頼指標
２０多次元データ
２１第１の分類軸
２１ａ，２１ｂ第１の属性値
２２第２の分類軸
２２ａ，２２ｂ第２の属性値
２３第１の値
２４第２の値
２５第３の値 DESCRIPTION OF SYMBOLS 10 Missing data estimation apparatus 11 Memory | storage part 12 Operation part 13 1st estimation method 14 2nd estimation method 15 1st estimated value 16 2nd estimated value 17 1st reliability index 18 2nd reliability index 20 Multidimensional Data 21 First classification axis 21a, 21b First attribute value 22 Second classification axis 22a, 22b Second attribute value 23 First value 24 Second value 25 Third value

Claims

A missing data estimation method executed by a computer,
From multidimensional data in which a set of values is classified using a plurality of classification axes including a first classification axis based on a plurality of first attribute values and a second classification axis based on a plurality of second attribute values, 1 Detecting that a first value corresponding to a set of two first attribute values and one second attribute value is missing,
A first estimation method for calculating a first estimated value corresponding to the first value by using a second value corresponding to a set of another first attribute value and the one second attribute value A first confidence index indicating the reliability of the first estimation method is calculated from the multidimensional data,
A second estimation method for calculating a second estimated value corresponding to the first value by using a third value corresponding to a set of the one first attribute value and another second attribute value A second confidence index indicating the reliability of the second estimation method is calculated from the multidimensional data,
Using an estimation method selected based on a comparison between the first confidence index and the second confidence index, supplementing the estimated value corresponding to the missing first value;
Missing data estimation method.

The first estimation method generates a first estimation formula by simple regression analysis using the other first attribute value and the second value, and the first estimation formula includes the first estimation formula. A method of calculating the first estimated value by applying an attribute value of 1;
The second estimation method uses a fourth value corresponding to a set of the second value, the other first attribute value, and the other second attribute value, by a multiple regression analysis. 2 is generated, and the second estimated value is calculated by applying the third value to the second estimated expression.
The missing data estimation method according to claim 1.

The plurality of first attribute values is one of a plurality of time points, a plurality of regions, and a plurality of statistical types, and the plurality of second attribute values are the plurality of time points, the plurality of regions, and Another one of the plurality of statistical types,
The missing data estimation method according to claim 1.

Stores multidimensional data in which a set of values is classified using a plurality of classification axes including a first classification axis based on a plurality of first attribute values and a second classification axis based on a plurality of second attribute values. A storage unit;
Detecting from the multi-dimensional data that a first value corresponding to a set of one first attribute value and one second attribute value is missing,
A first estimation method for calculating a first estimated value corresponding to the first value by using a second value corresponding to a set of another first attribute value and the one second attribute value A first confidence index indicating the reliability of the first estimation method is calculated from the multidimensional data,
A second estimation method for calculating a second estimated value corresponding to the first value by using a third value corresponding to a set of the one first attribute value and another second attribute value A second confidence index indicating the reliability of the second estimation method is calculated from the multidimensional data,
A calculation unit that supplements an estimated value corresponding to the missing first value using an estimation method selected based on a comparison between the first confidence index and the second confidence index;
A missing data estimation device.

On the computer,
From multidimensional data in which a set of values is classified using a plurality of classification axes including a first classification axis based on a plurality of first attribute values and a second classification axis based on a plurality of second attribute values, 1 Detecting that a first value corresponding to a set of two first attribute values and one second attribute value is missing,
A first estimation method for calculating a first estimated value corresponding to the first value by using a second value corresponding to a set of another first attribute value and the one second attribute value A first confidence index indicating the reliability of the first estimation method is calculated from the multidimensional data,
A second estimation method for calculating a second estimated value corresponding to the first value by using a third value corresponding to a set of the one first attribute value and another second attribute value A second confidence index indicating the reliability of the second estimation method is calculated from the multidimensional data,
Using an estimation method selected based on a comparison between the first confidence index and the second confidence index, supplementing the estimated value corresponding to the missing first value;
Missing data estimation program that executes processing.