WO2020250930A1 - Detection method for correction location in number set and system for same - Google Patents

Detection method for correction location in number set and system for same Download PDF

Info

Publication number
WO2020250930A1
WO2020250930A1 PCT/JP2020/022846 JP2020022846W WO2020250930A1 WO 2020250930 A1 WO2020250930 A1 WO 2020250930A1 JP 2020022846 W JP2020022846 W JP 2020022846W WO 2020250930 A1 WO2020250930 A1 WO 2020250930A1
Authority
WO
WIPO (PCT)
Prior art keywords
numerical value
numerical
conversion
value
cumulative
Prior art date
Application number
PCT/JP2020/022846
Other languages
French (fr)
Japanese (ja)
Inventor
廣川 佐千男
祐輔 戸▲崎▼
鈴木 孝彦
Original Assignee
国立大学法人九州大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国立大学法人九州大学 filed Critical 国立大学法人九州大学
Publication of WO2020250930A1 publication Critical patent/WO2020250930A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/64Protecting data integrity, e.g. using checksums, certificates or signatures

Definitions

  • the present invention relates to a method for detecting a corrected part of a numerical set using Benford's law and a system thereof.
  • Benford's law is known as a law that holds for a set of natural numerical data. Benford's law states that there is a rule in the frequency of occurrence of the first digit of each number in an arbitrary set of numbers, and if the set does not follow Benford's law, there is some inconsistency. It is possible. This Benford's law has traditionally been used to detect fraudulent statistical data.
  • Non-Patent Document 1 shows a method in which Nigrini et al. Apply Benford's law to accounting data and verify the credibility of numerical data.
  • Non-Patent Document 2 is that Rauch et al. Investigated the credibility of economic data of EU member states using Benford's law. Analyzing Eurostat data from 1999 to 2009, it shows that the data reported by Greece deviate most from Benford's law.
  • the present invention has been made based on the above-mentioned problems in the prior art, and an object of the present invention is to provide a method for estimating which part of a specific set has an error with higher accuracy. It is a thing.
  • the computer performs the following steps: a process of acquiring a numerical set to be analyzed and storing each numerical value together with an ID as a pre-conversion numerical value in memory.
  • the deviation from Benford's distribution of the value of a specific digit of the converted numerical value is specified, and the deviation is accumulated by the number of conversion methods.
  • the process of calculating the cumulative deviation degree with The process of determining the possibility of correction / falsification of a specific numerical value by comparing the cumulative degree of deviation between each numerical value, and A data correction / falsification determination method characterized by having.
  • the step of determining the possibility of modification / tampering is A method characterized in that the cumulative deviation degree is compared between each numerical value, and one or a plurality of numerical values having a high cumulative deviation degree are specified as numerical values having a high possibility of being corrected / tampered with.
  • the step of determining the possibility of modification / tampering is A method characterized in that the cumulative divergence degree is compared between each numerical value, and one or a plurality of numerical values having a cumulative divergence degree equal to or higher than a predetermined threshold value are specified as numerical values having a high possibility of being corrected / tampered with.
  • the step of determining the possibility of modification / tampering is A method further comprising a step of determining the predetermined threshold value based on the cumulative deviation degree of all the numerical values.
  • the step of determining the possibility of modification / tampering is When the cumulative divergence degree is compared between each numerical value and sorted in descending order, one or a plurality of numerical values whose sort order is higher than a predetermined order are specified as numerical values having a high possibility of being corrected / tampered with. Method.
  • the step of determining the possibility of modification / tampering is A method further comprising a step of determining the predetermined order based on the cumulative deviation degree of all the numerical values.
  • the plurality of specific conversion methods are characterized in that they are radix conversions using two or more different radixes.
  • the cumulative deviation degree is a method characterized in that it is the number of converted numerical values that deviate from the Benford distribution for a plurality of converted numerical values converted by the specific conversion method for a specific numerical value.
  • the cumulative deviation degree is a method characterized in that it is a ratio of a plurality of converted numerical values converted by the specific conversion method for a specific numerical value to the converted numerical values deviating from the Benford distribution.
  • a system executed by a computer A means by which a computer acquires a numerical set to be analyzed and stores each numerical value together with an ID as a numerical value before conversion in memory.
  • a means by which a computer converts each pre-conversion numerical value included in the set by two or more different conversion methods, and aggregates the distribution of the value of a specific digit of the converted numerical value for each conversion method.
  • the computer identifies the deviation from Benford's distribution of the value of a specific digit of the converted numerical value for each numerical value referred to by the same ID included in the numerical value set, and the deviation is determined by the number of conversion methods.
  • a means by which a computer determines the possibility of correction / tampering by comparing the cumulative degree of deviation between each numerical value. Data correction / tampering judgment system with.
  • the means for determining the possibility of modification / tampering is A system characterized in that the cumulative divergence degree is compared between each numerical value, and one or a plurality of numerical values having a high cumulative divergence degree are specified as numerical values having a high possibility of being corrected / tampered with.
  • the system is characterized in that the specific digit is an upper nth digit (n is an integer having the minimum number of digits m or less among the converted numerical values).
  • the plurality of specific conversion methods are characterized in that they are radix conversions using two or more different radixes.
  • the cumulative divergence degree is calculated for the additionally received numerical value, and the cumulative divergence degree is compared with the cumulative divergence degree of other numerical values, whereby the additionally received numerical value can be corrected / tampered with.
  • a computer software program product for determining data correction / tampering executed by a computer Stored on a storage medium and stored on the computer below The process of acquiring the numerical value set to be analyzed and storing each numerical value together with the ID as the numerical value before conversion in the memory.
  • the process of calculating the cumulative deviation degree with A computer software program product characterized by having a means for executing a process of determining the possibility of correction / tampering by comparing the cumulative degree of deviation between each numerical value.
  • FIG. 1 is a schematic diagram showing the distribution of population by administrative division in Japan.
  • FIG. 2 is an explanatory diagram showing changes in statistical data before and after correction.
  • FIG. 3 is also an explanatory diagram showing changes in the statistical data before and after the correction.
  • FIG. 4 is a schematic diagram showing the relationship between the statistical data before correction and the Benford distribution.
  • FIG. 5 is a schematic diagram showing the relationship between the corrected statistical data and the Benford distribution.
  • FIG. 6 is a system configuration diagram showing an embodiment of the present invention.
  • FIG. 7 is also a flowchart showing the operation.
  • FIG. 8 is also a schematic diagram showing the relationship between the pentadecimal number and the Benford distribution.
  • FIG. 9 is also a schematic diagram showing the relationship between the hexadecimal number and the Benford distribution.
  • FIG. 10 is also an explanatory diagram for explaining the degree of deviation for a specific numerical value.
  • FIG. 11 is also a schematic diagram showing an output list for evaluation.
  • FIG. 12 is a graph showing each index related to the evaluation performance of the present invention.
  • FIG. 13 is a graph showing each index related to the evaluation performance by the maximum likelihood method, which is also a comparison target.
  • FIG. 14 is also a table showing the method according to the present invention and the method to be compared.
  • FIG. 15 is also a diagram showing the degree of agreement when only the decimal system is used.
  • FIG. 16 is also a diagram showing an analysis when the performance of the maximum likelihood method is exceeded.
  • FIG. 17 is also a graph showing the evaluation performance based on the corrected data.
  • FIG. 18 is also a graph showing the evaluation performance based on the corrected data.
  • the Ministry of Health, Labor and Welfare summarizes the employment status of persons with disabilities in public institutions as of June 1 every year and publishes the "aggregation result of employment status of persons with disabilities”. As for the aggregated results that became a problem, the employment status of persons with disabilities up to 2017 was announced.
  • the present inventors verified using the inappropriate tabulation result and the correct tabulation result after re-inspection.
  • the number of persons with disabilities in employment is the number of persons with disabilities employed by one institution.
  • the number of cells (numerical values) included in the analyzed numerical set data is 422 (excluding blank cells and cells with a numerical value of 0). Among them, about 40% (0.395) (167/422) of the cells had different numerical values before the correction (data before the falsification was discovered) and after the correction (data after the falsification was discovered and corrected).
  • the test statistic X 2 was calculated by the following formula, where the observed frequency was Od and the expected frequency was P d .
  • the present invention has been made based on such knowledge, and basically applies Benford's law to propose a method of estimating an error part of numerical data.
  • FIG. 6 shows a system according to an embodiment of the present invention.
  • the data storage unit 6 and the program storage unit 7 are connected to the bus 5 to which the CPU 2, RAM 3, and input / output unit 4 are connected.
  • the data storage unit 6 stores the numerical set data 9, the converted numerical set data 10 for each radix, the numerical distribution data 11 for each radix, and the numerical correctability evaluation result 12.
  • the program storage unit 7 is different from the numerical value set acquisition unit 14 that acquires the numerical value set to be analyzed and stores each numerical value together with the ID as the numerical value before conversion by 2 or more different from each numerical value before conversion included in the set.
  • Each numerical set includes the radix conversion processing unit 15 that generates a plurality of types of converted numerical sets consisting of converted numerical values for each conversion method by converting by the conversion method, and each of the pre-conversion or post-conversion numerical sets.
  • the corresponding numerical distribution and the specific digit value of the numerical value are The possibility of correction / tampering is improved by calculating the degree of separation from Benford and rearranging each numerical value based on the cumulative degree of deviation accumulation processing unit 17 and the cumulative degree of separation for n types of the degree of separation. It has a correctability determination processing unit 18 for determination.
  • the data storage unit 6 and the program storage unit 7 are actually storage units such as a hard disk, and each of the above configurations is called by the CPU 2 and expanded on the RAM 3 with other necessary programs such as an OS. When executed in cooperation, they function as each component of the present invention.
  • FIG. 7 is a flowchart showing the operation processing by this system. Hereinafter, the system operation will be described with reference to this flowchart (steps S1 to S5).
  • a static numerical set that already exists or a dynamic numerical set that is accumulated every moment is used as the numerical set to be analyzed.
  • An example of using a static set is to find a falsified numerical value from past statistical information as described above.
  • the dynamic numerical set is a case where falsification information is discovered in real time from payment information such as a credit card. In the latter case, it is possible to determine in real time whether the newly input numerical value is the corrected or falsified information in light of the already accumulated numerical value set.
  • the numerical value set acquisition unit 14 assigns an ID to each numerical value included in the set and stores it in the data storage unit 6 (numerical value set 9) (step S1).
  • the case where the above-mentioned employee number statistical information is used as the numerical set 9 will be described as an example.
  • the radix conversion processing unit 5 performs radix conversion of each numerical value included in the set 9 with a radix k of 3 to 16 to generate a numerical set after radix conversion (step S2).
  • the numerical distribution generation processing unit 16 generates a distribution of the number of occurrences of the fsd for each radix k for all the numerical values (422 numerical values) included in the numerical set (step S3), and the degree of deviation.
  • the cumulative processing unit 17 accumulates the degree of deviation from Benford according to the generated distribution (step S4).
  • the value of the first digit deviates from Benford is indicated by a black circle.
  • the cumulative value of the degree of deviation of the numerical value 17 in the above set is 0.6552.
  • the determination unit 18 outputs all the institutions V (c) included in the set by ranking as shown in FIG. 11 based on the degree of deviation accumulated above.
  • the rankings are sorted in descending order of cumulative divergence.
  • a threshold value for the ranking for this ranking for example, 10th place or higher
  • a specific cumulative value threshold value for example, 0
  • This threshold can be specified by the user, but this system may automatically determine it.
  • Example 2 The degree of divergence may be expressed not only by the actual cumulative value but also by the number of black circles shown in FIG. 10 (the number of radixes higher than the Benford distribution). An example in which this example is applied to the above employment statistics will be described below as Example 2.
  • the determination by the determination unit 18 is specifically performed using the following function.
  • v (c) is expressed in radix k, it is a set of institutions c in which d, which is FSD, is the same.
  • CKD ⁇ c
  • C 101 ⁇ Consumer Affairs Agency, Ministry of Internal Affairs and Communications, Ministry of Foreign Affairs, ... ⁇
  • C 102 ⁇ Imperial Household Agency, Ministry of Finance, Japan Tourism Agency, ... ⁇ .
  • This value is the number of radixes that did not comply with Benford above.
  • each institution is assigned a value corresponding to the radix k, and its ranking can be created.
  • FIG. 12 shows the evaluation result of the estimated performance of the present invention.
  • FIG. 13 shows a comparison target using the maximum likelihood method.
  • 0.395 about 40% is used as the above-mentioned known correction ratio, and it is determined whether or not C kd contains a large amount of c such that v (c) ⁇ r (c). Was evaluated not from Benford's law but from the number of v (c) ⁇ r (c) in C kd .
  • FIG. 14 is a comparison result of the estimation ability between the present invention and the maximum likelihood method and the random determination method.
  • Pb indicates the Precision of the proposed method
  • Rb indicates the Recall of the proposed method
  • Pml indicates the Precision of the maximum likelihood method
  • Rml indicates the recall of the maximum likelihood method.
  • the present invention it has become possible to estimate some parts where there are many errors.
  • the present invention can also be applied to numerical set data other than the number of persons with employment disabilities. It is also possible to supplement the numerical data flowing in time series and timely determine which time zone contains an incorrect numerical value.
  • the present invention is not limited to that of the above-described embodiment, and can be variously modified without changing the gist of the invention.
  • the plurality of conversion methods for each numerical value of the numerical set are radix conversion, but the conversion method is not limited to this, and any conversion method according to a specific rule is not limited to radix conversion. ..
  • the predetermined digit for analyzing the deviation from the Benford distribution is the highest digit, but the present invention is not limited to this, and the nth digit (n is the converted numerical value). Of these, an integer with the minimum number of digits m or less) may be used.

Abstract

[Problem] The purpose of the present invention is to provide a procedure for estimating, with a higher degree of accuracy, which section of a specified set has a mistake. [Solution] Provided is a data correction/falsification determination method which is executed by a computer, wherein the computer executes the following steps included in said data correction/falsification determination method: a step in which an analysis target number set is acquired, and each number is stored together with an ID in memory as a pre-conversion number; a step in which each of the pre-conversion numbers contained in the set are converted by two or more different conversion methods into post-conversion numbers, and the distribution of the value of a specified digit in the post-conversion numbers is summed up for each conversion method; a step in which, with respect to each of the numbers contained in the number set and referenced by the same ID, deviation in the value of the specified digit in the post-conversion numbers from the Benford distribution thereof is specified, and a cumulative degree of deviation is calculated by means of accumulating said deviation by the number of conversion methods; and a step in which the correctability/falsifiability of a specified number is determined by means of comparing the aforementioned cumulative degree of deviation to each of the numbers.

Description

数値集合の修正箇所検出方法及びそのシステムCorrection point detection method of numerical set and its system
 本発明は、ベンフォードの法則を用いた数値集合の訂正箇所検出方法及びそのシステムに関する。 The present invention relates to a method for detecting a corrected part of a numerical set using Benford's law and a system thereof.
 (ベンフォードの法則)
 現代社会では多くの行動が、その根拠となるデータに基づいて決定されており、データの信憑性は重要である。
(Benford's law)
In modern society, many actions are determined based on the data on which they are based, and the credibility of the data is important.
 自然な数値データの集合について成り立つ法則として、ベンフォードの法則が知られている。ベンフォードの法則とは、任意の数値の集合における各数値の上位1桁目の数字の出現頻度には法則性があり、その集合がベンフォードの法則に従わなければ、なんらかの不整合があると考えられるというものである。
このベンフォードの法則は従来から統計データの不正検出に使われてきた。
Benford's law is known as a law that holds for a set of natural numerical data. Benford's law states that there is a rule in the frequency of occurrence of the first digit of each number in an arbitrary set of numbers, and if the set does not follow Benford's law, there is some inconsistency. It is possible.
This Benford's law has traditionally been used to detect fraudulent statistical data.
 (ベンフォードの法則とは)
 自然な数値データの集合について上位1桁目の数字d(123の場合にはd=1)の出現確率は、log(1+1/d)となる、というものである。
(What is Benford's law)
The probability of appearance of the upper first digit d (d = 1 in the case of 123) for a set of natural numerical data is log k (1 + 1 / d).
 例えば、日本の行政区画毎の人口分布についての数値dの分布を調べると、図1に示すようにベンフォード分布に従うことになる。 For example, when examining the distribution of the numerical value d for the population distribution of each administrative division in Japan, it follows the Benford distribution as shown in Fig. 1.
 (ベンフォードに関連する研究)
 従来、ベンフォードの法則を用いた数値データの信憑性の評価を行った方法として、以下の非特許文献1、2に開示されたものが公知である。
(Study related to Benford)
Conventionally, as a method for evaluating the credibility of numerical data using Benford's law, those disclosed in the following Non-Patent Documents 1 and 2 are known.
 まず、非特許文献1に開示されたもので、Nigriniらが会計データに対して、ベンフォードの法則を適用し、数値データの信憑性を検証する方法を示したものである。 First, it is disclosed in Non-Patent Document 1, and shows a method in which Nigrini et al. Apply Benford's law to accounting data and verify the credibility of numerical data.
 また、非特許文献2に開示されたものは、Rauchらが,ベンフォードの法則を用いて、EU加盟国の経済データの信憑性を調査したものである。1999年から2009年までのEurostatのデータを分析し、ギリシャが報告したデータがベンフォードの法則から最も乖離していることを示したものである。 Also, what is disclosed in Non-Patent Document 2 is that Rauch et al. Investigated the credibility of economic data of EU member states using Benford's law. Analyzing Eurostat data from 1999 to 2009, it shows that the data reported by Greece deviate most from Benford's law.
 しかしながら、上記の各非特許文献に記載されたベンフォードの法則に関する信憑性の評価は、いずれも集合全体として誤りがあるかを判定するにすぎないものであり、その集合のどの部分(どの数値)に誤りがあるのかを判定するものではなかった。 However, the evaluation of the credibility of Benford's law described in each of the above non-patent documents merely determines whether the set as a whole is erroneous, and which part of the set (which numerical value). ) Was not determined.
 この発明は、上記従来の技術における課題に基づいてなされたものであり、その目的は、特定の集合のどの部分に誤りがあるかをより高い精度で推定する手法を提供することを目的とするものである。 The present invention has been made based on the above-mentioned problems in the prior art, and an object of the present invention is to provide a method for estimating which part of a specific set has an error with higher accuracy. It is a thing.
 上記目的を達成するため、本発明によれば、以下の手段が提供される。 According to the present invention, the following means are provided in order to achieve the above object.
 (1) コンピュータによって実行されるデータ修正/改ざん判定方法であり、
 当該コンピュータは、以下の工程を実行するものである
 解析対象の数値集合を取得し、各数値をIDと共に変換前数値としてメモリに格納する工程と、
 当該集合に含まれる各変換前数値を2以上の異なる変換方法で変換し、変換後数値の特定の桁の値の分布を、変換方法毎に集計する工程と、
 前記数値集合に含まれる同一IDで参照される各数値について、前記変換後数値の特定の桁の値のベンフォードの分布からの乖離を特定し、その乖離を変換方法の個数分、累積することで累積乖離度を算出する工程と、
 上記累積乖離度を各数値間で比較することで特定の数値の修正/改ざん可能性を判定する工程と、
 を有することを特徴とするデータ修正/改ざん判定方法。
(1) This is a data correction / tampering judgment method executed by a computer.
The computer performs the following steps: a process of acquiring a numerical set to be analyzed and storing each numerical value together with an ID as a pre-conversion numerical value in memory.
A process of converting each pre-conversion numerical value included in the set by two or more different conversion methods, and aggregating the distribution of the value of a specific digit of the converted numerical value for each conversion method.
For each numerical value referred to by the same ID included in the numerical value set, the deviation from Benford's distribution of the value of a specific digit of the converted numerical value is specified, and the deviation is accumulated by the number of conversion methods. And the process of calculating the cumulative deviation degree with
The process of determining the possibility of correction / falsification of a specific numerical value by comparing the cumulative degree of deviation between each numerical value, and
A data correction / falsification determination method characterized by having.
 (2) 上記(1記載の方法において、
 前記修正/改ざん可能性を判定する工程は、
 前記累積乖離度を各数値間で比較し、累積乖離度の高い1または複数数値を修正/改ざん可能性の高い数値として特定するものである
 ことを特徴とする方法。
(2) In the method described in (1) above
The step of determining the possibility of modification / tampering is
A method characterized in that the cumulative deviation degree is compared between each numerical value, and one or a plurality of numerical values having a high cumulative deviation degree are specified as numerical values having a high possibility of being corrected / tampered with.
 (3) 上記(2記載の方法において、
 前記修正/改ざん可能性を判定する工程は、
 前記累積乖離度を各数値間で比較し、所定の閾値以上の累積乖離度を有する1または複数数値を修正/改ざん可能性の高い数値として特定するものである
 ことを特徴とする方法。
(3) In the method described in (2) above
The step of determining the possibility of modification / tampering is
A method characterized in that the cumulative divergence degree is compared between each numerical value, and one or a plurality of numerical values having a cumulative divergence degree equal to or higher than a predetermined threshold value are specified as numerical values having a high possibility of being corrected / tampered with.
 (4) 上記(3記載の方法において、
 前記修正/改ざん可能性を判定する工程は、
 前記全数値の累積乖離度に基づいて前記所定の閾値を決定する工程をさらに有する
 ことを特徴とする方法。
(4) In the method described in (3) above
The step of determining the possibility of modification / tampering is
A method further comprising a step of determining the predetermined threshold value based on the cumulative deviation degree of all the numerical values.
 (5) 上記(2記載の方法において、
 前記修正/改ざん可能性を判定する工程は、
 前記累積乖離度を各数値間で比較して高い順にソートした場合に、ソート順序が所定順位以上の1または複数数値を修正/改ざん可能性の高い数値として特定するものである
 ことを特徴とする方法。
(5) In the method described in (2) above
The step of determining the possibility of modification / tampering is
When the cumulative divergence degree is compared between each numerical value and sorted in descending order, one or a plurality of numerical values whose sort order is higher than a predetermined order are specified as numerical values having a high possibility of being corrected / tampered with. Method.
 (6) 上記(5記載の方法において、
 前記修正/改ざん可能性を判定する工程は、
 前記全数値の累積乖離度に基づいて前記所定順位を決定する工程をさらに有する
 ことを特徴とする方法。
(6) In the method described in (5) above
The step of determining the possibility of modification / tampering is
A method further comprising a step of determining the predetermined order based on the cumulative deviation degree of all the numerical values.
 (7) 上記(1記載の方法において、
 前記特定の桁は上位n位(nは前記変換後数値のうち最小値の桁数m以下の整数)の桁であることを特徴とする方法。
(7) In the method described in (1) above
A method characterized in that the specific digit is an upper nth digit (n is an integer having the minimum number of digits m or less among the converted numerical values).
 (8) 上記(7記載の方法において、
 前記複数の特定の変換方法は、2つ以上の異なる基数を用いた基数変換である
 ことを特徴とする方法。
(8) In the method described in (7) above
The plurality of specific conversion methods are characterized in that they are radix conversions using two or more different radixes.
 (9) 上記(8記載の方法において、
 前記2以上の異なる基数kは、前記数値集合における基数変換後の変換後数値の最大値と最小値の桁数が2.6以上異なるものから選択されたものである
 ことを特徴とする方法。
(9) In the method described in (8) above
The method characterized in that the two or more different radix k is selected from those in which the maximum value and the minimum value of the converted numerical value after the radix conversion in the numerical set differ by 2.6 or more.
 (10) 上記(9記載の方法において、
 前記基数は3以上の値から選ばれたものである
 ことを特徴とする方法。
(10) In the method described in (9) above
A method characterized in that the radix is selected from a value of 3 or more.
 (11) 上記(1記載の方法において、
 前記前記累積乖離度は、特定の数値について前記特定の変換方法で変換された複数の変換後数値についてベンフォード分布から乖離する変換後数値の個数である
 ことを特徴とする方法。
(11) In the method described in (1) above
The cumulative deviation degree is a method characterized in that it is the number of converted numerical values that deviate from the Benford distribution for a plurality of converted numerical values converted by the specific conversion method for a specific numerical value.
 (12) 上記(1記載の方法において、
 前記前記累積乖離度は、特定の数値について前記特定の変換方法で変換された複数の変換後数値と、そのうちベンフォード分布から乖離する変換後数値の比率である
 ことを特徴とする方法。
(12) In the method described in (1) above
The cumulative deviation degree is a method characterized in that it is a ratio of a plurality of converted numerical values converted by the specific conversion method for a specific numerical value to the converted numerical values deviating from the Benford distribution.
 (13) 上記(1記載の方法において、
 さらに、
 特定の数値を追加で受け取る工程をさらに有し、
 前記判定する工程は、この追加で受け取った数値について前記累積乖離度を算出し、この累積乖離度を他の数値の累積乖離度と比較することで、当該追加で受け取った数値の修正/改ざん可能性をリアルタイムで判定するものである
 ことを特徴とする方法。
(13) In the method described in (1) above
further,
It also has the process of receiving additional specific numbers,
In the determination step, the cumulative divergence degree is calculated for the additionally received numerical value, and the cumulative divergence degree is compared with the cumulative divergence degree of other numerical values, whereby the additionally received numerical value can be corrected / tampered with. A method characterized by determining sex in real time.
 (14) 上記(1記載の方法において、
 さらに、
 修正/改ざん以外の理由で乖離度が高い数値を排除する工程を有する
 ことを特徴とする方法。
(14) In the method described in (1) above
further,
A method characterized by having a process of excluding numerical values having a high degree of divergence for reasons other than correction / tampering.
 (15) コンピュータによって実行されるシステムであり、
 コンピュータが、解析対象の数値集合を取得し、各数値をIDと共に変換前数値としてメモリに格納する手段と、
 コンピュータが、当該集合に含まれる各変換前数値を2以上の異なる変換方法で変換し、変換後数値の特定の桁の値の分布を、変換方法毎に集計する手段と、
 コンピュータが、前記数値集合に含まれる同一IDで参照される各数値について、前記変換後数値の特定の桁の値のベンフォードの分布からの乖離を特定し、その乖離を変換方法の個数分、累積することで累積乖離度を算出する手段と、
 コンピュータが、上記累積乖離度を各数値間で比較することで修正/改ざん可能性を判定する手段と、
 を有するデータ修正/改ざん判定システム。
(15) A system executed by a computer
A means by which a computer acquires a numerical set to be analyzed and stores each numerical value together with an ID as a numerical value before conversion in memory.
A means by which a computer converts each pre-conversion numerical value included in the set by two or more different conversion methods, and aggregates the distribution of the value of a specific digit of the converted numerical value for each conversion method.
The computer identifies the deviation from Benford's distribution of the value of a specific digit of the converted numerical value for each numerical value referred to by the same ID included in the numerical value set, and the deviation is determined by the number of conversion methods. A means to calculate the cumulative divergence by accumulating,
A means by which a computer determines the possibility of correction / tampering by comparing the cumulative degree of deviation between each numerical value.
Data correction / tampering judgment system with.
 (16) 上記(15記載のシステムにおいて、
 前記修正/改ざん可能性を判定する手段は、
 前記累積乖離度を各数値間で比較し、累積乖離度の高い1または複数数値を修正/改ざん可能性の高い数値として特定するものである
 ことを特徴とするシステム。
(16) In the system described in (15) above
The means for determining the possibility of modification / tampering is
A system characterized in that the cumulative divergence degree is compared between each numerical value, and one or a plurality of numerical values having a high cumulative divergence degree are specified as numerical values having a high possibility of being corrected / tampered with.
 (17) 請求項15記載のシステムにおいて、
 前記特定の桁は上位n位(nは前記変換後数値のうち最小値の桁数m以下の整数)の桁であることを特徴とするシステム。
(17) In the system according to claim 15,
The system is characterized in that the specific digit is an upper nth digit (n is an integer having the minimum number of digits m or less among the converted numerical values).
 (18) 請求項7記載のシステムにおいて、
 前記複数の特定の変換方法は、2つ以上の異なる基数を用いた基数変換である
 ことを特徴とする方法。
(18) In the system according to claim 7.
The plurality of specific conversion methods are characterized in that they are radix conversions using two or more different radixes.
 (19) 上記(15記載のシステムにおいて、
 さらに、
 特定の数値を追加で受け取る手段をさらに有し、
 前記判定する工程は、この追加で受け取った数値について前記累積乖離度を算出し、この累積乖離度を他の数値の累積乖離度と比較することで、当該追加で受け取った数値の修正/改ざん可能性をリアルタイムで判定するものである
 ことを特徴とするシステム。
(19) In the system described in (15) above
further,
It also has the means to receive additional specific numbers,
In the determination step, the cumulative divergence degree is calculated for the additionally received numerical value, and the cumulative divergence degree is compared with the cumulative divergence degree of other numerical values, whereby the additionally received numerical value can be corrected / tampered with. A system characterized in that sex is judged in real time.
 (20) コンピュータによって実行されるデータ修正/改ざん判定のためのコンピュータソフトウエアプログラム製品であり、
 記憶媒体に格納され、コンピュータに以下の、
 解析対象の数値集合を取得し、各数値をIDと共に変換前数値としてメモリに格納する工程と、
 当該集合に含まれる各変換前数値を2以上の異なる変換方法で変換し、変換後数値の特定の桁の値の分布を、変換方法毎に集計する工程と、
 前記数値集合に含まれる同一IDで参照される各数値について、前記変換後数値の特定の桁の値のベンフォードの分布からの乖離を特定し、その乖離を変換方法の個数分、累積することで累積乖離度を算出する工程と、
 上記累積乖離度を各数値間で比較することで修正/改ざん可能性を判定する工程と
 を実行させる手段
 を有することを特徴とするコンピュータソフトウエアプログラム製品。
(20) A computer software program product for determining data correction / tampering executed by a computer.
Stored on a storage medium and stored on the computer below
The process of acquiring the numerical value set to be analyzed and storing each numerical value together with the ID as the numerical value before conversion in the memory.
A process of converting each pre-conversion numerical value included in the set by two or more different conversion methods, and aggregating the distribution of the value of a specific digit of the converted numerical value for each conversion method.
For each numerical value referred to by the same ID included in the numerical value set, the deviation from Benford's distribution of the value of a specific digit of the converted numerical value is specified, and the deviation is accumulated by the number of conversion methods. And the process of calculating the cumulative deviation degree with
A computer software program product characterized by having a means for executing a process of determining the possibility of correction / tampering by comparing the cumulative degree of deviation between each numerical value.
 なお、本発明の上述した以外の特徴については、以下の実施形態の項及び図面から当業者に明らかにされる。 Features of the present invention other than those described above will be clarified to those skilled in the art from the sections and drawings of the following embodiments.
図1は、日本の行政区画ごとの人口の分布を示す模式図。FIG. 1 is a schematic diagram showing the distribution of population by administrative division in Japan.
図2は、統計データの修正前と修正後の変化を示す説明図。FIG. 2 is an explanatory diagram showing changes in statistical data before and after correction.
図3は、同じく、統計データの修正前と修正後の変化を示す説明図。FIG. 3 is also an explanatory diagram showing changes in the statistical data before and after the correction.
図4は、修正前の統計データとベンフォード分布の関係を示す模式図。FIG. 4 is a schematic diagram showing the relationship between the statistical data before correction and the Benford distribution.
図5は、修正後の統計データとベンフォード分布の関係を示す模式図。FIG. 5 is a schematic diagram showing the relationship between the corrected statistical data and the Benford distribution.
図6は、本発明の一実施形態を示すシステム構成図。FIG. 6 is a system configuration diagram showing an embodiment of the present invention.
図7は、同じく、動作を示すフローチャート。FIG. 7 is also a flowchart showing the operation.
図8は、同じく、5進数についてのベンフォード分布との関係を示す模式図。FIG. 8 is also a schematic diagram showing the relationship between the pentadecimal number and the Benford distribution.
図9は、同じく、6進数についてのベンフォード分布との関係を示す模式図。FIG. 9 is also a schematic diagram showing the relationship between the hexadecimal number and the Benford distribution.
図10は、同じく、特定の数値についての乖離度を説明する説明図。FIG. 10 is also an explanatory diagram for explaining the degree of deviation for a specific numerical value.
図11は、同じく、評価のための出力リストを示す模式図。FIG. 11 is also a schematic diagram showing an output list for evaluation.
図12は、同じく、本発明の評価性能に係る各指標を示すグラフ。FIG. 12 is a graph showing each index related to the evaluation performance of the present invention.
図13は、同じく、比較対象である最尤法による評価性能に係る各指標を示すグラフ。FIG. 13 is a graph showing each index related to the evaluation performance by the maximum likelihood method, which is also a comparison target.
図14は、同じく、本発明による方法と比較対象の方法を示す表。FIG. 14 is also a table showing the method according to the present invention and the method to be compared.
図15は、同じく、10進法のみを用いた場合の一致度を示す図。FIG. 15 is also a diagram showing the degree of agreement when only the decimal system is used.
図16は、同じく、最尤法の性能が上回る場合の分析を示す図。FIG. 16 is also a diagram showing an analysis when the performance of the maximum likelihood method is exceeded.
図17は、同じく、修正したデータによる評価性能を示すグラフ。FIG. 17 is also a graph showing the evaluation performance based on the corrected data.
図18は、同じく、修正したデータによる評価性能を示すグラフ。FIG. 18 is also a graph showing the evaluation performance based on the corrected data.
 以下、この発明の一実施形態を図面を参照して説明するが、その前に、本発明を完成する起因となった発明者らによる仮説及び知見について詳しく説明する。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings, but before that, the hypothesis and findings by the inventors who caused the completion of the present invention will be described in detail.
 (障害雇用者水増し問題)
 本発明の知見を得るにあたっては、まず、日本の公的機関の雇用障害者数を対象に、ベンフォードの法則に従うか否かを検証した。対象としたのは、厚生労働省が公表していた雇用障害者数に水増しが行われていたという事件である。
(Problem of inflating disabled employees)
In order to obtain the findings of the present invention, we first examined whether or not the number of persons with employment disabilities in Japanese public institutions was obeyed by Benford's law. The target was an incident in which the number of persons with employment disabilities announced by the Ministry of Health, Labor and Welfare was inflated.
 厚生労働省は毎年6月1日時点での公的機関の障害者雇用状況を取りまとめ、「障害者雇用状況の集計結果」公表している。問題となった集計結果は、平成29年までの障害者雇用状況を公表していた。 The Ministry of Health, Labor and Welfare summarizes the employment status of persons with disabilities in public institutions as of June 1 every year and publishes the "aggregation result of employment status of persons with disabilities". As for the aggregated results that became a problem, the employment status of persons with disabilities up to 2017 was announced.
 しかし、雇用障害者数について、不適切な算入が行われたという疑惑が浮上し、そのため、、報告内容の再点検を行うこととなったものである。 However, there was a suspicion that the number of persons with employment disabilities was improperly included, so it was decided to re-examine the contents of the report.
 その結果、厚生労働省は、平成30年10月22日に、再点検後の集計結果を公表した。 As a result, the Ministry of Health, Labor and Welfare announced the aggregated results after the re-inspection on October 22, 2018.
 これ対して、本発明者らは、当該不適切な集計結果と、再点検後の正しい集計結果を用いて検証を行った。 On the other hand, the present inventors verified using the inappropriate tabulation result and the correct tabulation result after re-inspection.
 以下詳しく説明する。 It will be explained in detail below.
 平成30年8月28日に公表した「国の行政機関における平成29年6月1日現在の障害者の任免状況の再点検結果について」及び同年9月7日に公表した「立法機関及び司法機関における平成29年6月1日現在の障害者の任免状況の再点検結果について」の記事には、公的機関の雇用障害者数の数値が表形式で記載されている。 "Regarding the results of re-inspection of the status of appointment and dismissal of persons with disabilities as of June 1, 2017 by national government agencies" announced on August 28, 2018 and "Legislature and judiciary" announced on September 7, 2018. In the article "Re-inspection results of the appointment and dismissal status of persons with disabilities as of June 1, 2017", the numerical value of the number of persons with disabilities employed by public institutions is described in a table format.
 雇用障害者数は、一つの機関で雇用されている障害者数のことである。 The number of persons with disabilities in employment is the number of persons with disabilities employed by one institution.
 上記資料に記載されている機関数は全部で433機関(c=1~433)であり、それには以下の機関が含まれていた。 The total number of institutions listed in the above material was 433 (c = 1-433), which included the following institutions.
 ・行政機関
 ・立法機関
 ・司法機関
 ・都道府県知事部局
 ・その他の都道府県機関
 ・都道府県教育委員会
 ・独立行政法人等。
・ Administrative agencies ・ Legislative agencies ・ Judiciary agencies ・ Prefectural governor departments ・ Other prefectural agencies ・ Prefectural boards of education ・ Incorporated administrative agencies, etc.
 再調査の結果判明したのは、当初の公表では、雇用障害者数が結果として図2に示すように1000人以上の水増しするように修正(改ざん)されていたことである。その結果、最初の発表では法定雇用率2.3%(H29)を上回るとされていた官公庁による障碍者雇用が、実はそれを大幅に下回っていたという問題が発覚したのである。 As a result of the re-investigation, it was found that the number of persons with employment disabilities was corrected (tampered) to inflate more than 1000 people as shown in Fig. 2 in the initial announcement. As a result, it was discovered that the employment rate of persons with disabilities by government agencies, which was said to exceed the statutory employment rate of 2.3% (H29) in the first announcement, was actually significantly lower than that.
 (ベンフォードの法則による分析)
 そこで、発明者らは、改ざん後の数値集合データと改ざん前の数値集合データがそれぞれベンフォードの法則の法則に従うかを検証した。
(Analysis by Benford's law)
Therefore, the inventors verified whether the numerical set data after falsification and the numerical set data before falsification each follow Benford's law.
 分析した数値集合データに含まれるセル(数値)の数は422(空白セルと数値が0のセルは除く)個である。そのうち、訂正前(改ざん発覚前のデータ)と訂正後(改ざんが発覚し修正したデータ)で数値が異なるセルは約40%(0.395)(167/422)であった。 The number of cells (numerical values) included in the analyzed numerical set data is 422 (excluding blank cells and cells with a numerical value of 0). Among them, about 40% (0.395) (167/422) of the cells had different numerical values before the correction (data before the falsification was discovered) and after the correction (data after the falsification was discovered and corrected).
 そして、まず、訂正前の障害者数のセルの集合と、訂正後の障害者数のセルの集合がそれぞれベンフォード(10進数)に従うかを分析した。 Then, first, it was analyzed whether the set of cells with the number of persons with disabilities before correction and the set of cells with the number of persons with disabilities after correction follow Benford (decimal number), respectively.
 障害者数のセルについて、上位1桁目の数字をカウントし、理論分布と比較した。セルの集合がベンフォードの法則に従うか否かの判定には、カイ二乗検定(有意水準5%)を用いた。
そして、観測度数をO、期待度数をPとして検定統計量Xを以下の式で求めた。
For the cell with the number of persons with disabilities, the first digit was counted and compared with the theoretical distribution. A chi-square test (significance level 5%) was used to determine whether the set of cells obeyed Benford's law.
Then, the test statistic X 2 was calculated by the following formula, where the observed frequency was Od and the expected frequency was P d .
式1 Equation 1
 
Figure JPOXMLDOC01-appb-I000001
 
Figure JPOXMLDOC01-appb-I000001
 上記の検討の結果、以下のことが判明した
 ・修正前の分布は図4に示すようにベンフォードの法則に従わない。(χ2検定 P<=0.01)
 ・修正後の雇用障害者数の分布は、図5に示すように、ベンフォードの法則に従わないとはいえない。
As a result of the above examination, the following was found: -The distribution before modification does not follow Benford's law as shown in Fig. 4. (χ2 test P <= 0.01)
-As shown in Fig. 5, it cannot be said that the revised distribution of the number of persons with employment disabilities does not follow Benford's law.
 (10進数だけの処理の結果と本発明者らの仮説・知見)
 上記の分析から発明者らは以下の知見を得たものである。
(1)10進数だけを用いたベンフォード解析では、特定の集合全体がベンフォードに従っているかいないかを導きだせるのみであり、どの数値に改ざんの可能性があるかを判定することはできない。
(2)一方で、自然な数値データの集合では10以外の基数のk進法においても,ベンフォードの法則が成り立つのではないか。
(Results of processing only in decimal numbers and hypotheses / findings of the present inventors)
From the above analysis, the inventors obtained the following findings.
(1) Benford analysis using only decimal numbers can only derive whether or not a specific set follows Benford, and it is not possible to determine which numerical value is likely to be tampered with.
(2) On the other hand, in a set of natural numerical data, Benford's law may hold even in the k-ary system of radixes other than 10.
 本発明は、このような知見に基づいてなされたものであり、基本的にベンフォードの法則を応用し,数値データの誤り箇所を推定する手法を提案するものである。 The present invention has been made based on such knowledge, and basically applies Benford's law to propose a method of estimating an error part of numerical data.
 (本発明の構成)
 本発明は,自然な数値データの集合では10以外の基数のk進法においても,ベンフォードの法則が成り立つという仮説、知見に基づいて、鋭意検証を行い、完成されたものである。
(Structure of the present invention)
The present invention has been completed by diligent verification based on the hypothesis and knowledge that Benford's law holds even in the k-ary system of a radix other than 10 in a set of natural numerical data.
 図6は、本発明の実施形態に係るシステムを示すものである。 FIG. 6 shows a system according to an embodiment of the present invention.
 このシステム1は、CPU2,RAM3,入出力部4が接続されてなるバス5に、データ格納部6とプログラム格納部7が接続されている。 In this system 1, the data storage unit 6 and the program storage unit 7 are connected to the bus 5 to which the CPU 2, RAM 3, and input / output unit 4 are connected.
 データ格納部6には、数値集合データ9と、基数毎の変換数値集合データ10と、基数毎の数値分布データ11と、数値修正可能性評価結果12とが格納されている。 The data storage unit 6 stores the numerical set data 9, the converted numerical set data 10 for each radix, the numerical distribution data 11 for each radix, and the numerical correctability evaluation result 12.
 また、プログラム格納部7には、解析対象の数値集合を取得し、各数値をIDと共に変換前数値として格納する数値集合取得部14と、当該集合に含まれる各変換前数値を2以上の異なる変換方法で変換することで、変換方法毎の変換後数値からなる複数種類の変換数値集合を生成する基数変換処理部15と、前記変換前若しくは変換後数値集合の夫々について、各数値集合に含まれる数値の特定の桁の数値分布を求める数値分布生成処理部16と、前記数値集合に含まれる同一IDで参照される各数値について、各対応する前記数値分布その数値の前記特定の桁値がベンフォードからのかい離度を算出し、そのかい離度をn種類分、累積する乖離度累積処理部17と、上記累積されたかい離度に基づいて各数値を並び替えることで修正/改ざん可能性を判定する訂正可能性判定処理部18とを有する。 Further, the program storage unit 7 is different from the numerical value set acquisition unit 14 that acquires the numerical value set to be analyzed and stores each numerical value together with the ID as the numerical value before conversion by 2 or more different from each numerical value before conversion included in the set. Each numerical set includes the radix conversion processing unit 15 that generates a plurality of types of converted numerical sets consisting of converted numerical values for each conversion method by converting by the conversion method, and each of the pre-conversion or post-conversion numerical sets. For each numerical value referred to by the same ID included in the numerical value set and the numerical distribution generation processing unit 16 for obtaining the numerical distribution of a specific digit of the numerical value, the corresponding numerical distribution and the specific digit value of the numerical value are The possibility of correction / tampering is improved by calculating the degree of separation from Benford and rearranging each numerical value based on the cumulative degree of deviation accumulation processing unit 17 and the cumulative degree of separation for n types of the degree of separation. It has a correctability determination processing unit 18 for determination.
 上記データ格納部6及びプログラム格納部7は、実際には、ハードディスクなどの記憶部であり、上記各構成は、上記CPU2により呼び出されてRAM3上に展開されてOS等の他の必要なプログラムと協働して実行されることで本発明の各構成要素として機能するものである。 The data storage unit 6 and the program storage unit 7 are actually storage units such as a hard disk, and each of the above configurations is called by the CPU 2 and expanded on the RAM 3 with other necessary programs such as an OS. When executed in cooperation, they function as each component of the present invention.
 なお、上記構成は、本発明に関連する構成のみ記載したものであり、上記OS等の基本プログラムやその他のプログラム(ドライバ等含む)についてはその記載を省略している。 Note that the above configuration is described only for the configuration related to the present invention, and the description of the basic program such as the OS and other programs (including the driver etc.) is omitted.
 以下、上記各構成を、その動作を通じて詳細に説明する。 Hereinafter, each of the above configurations will be described in detail through its operation.
 (実施例1)
 図7は、このシステムによる動作処理を示すフローチャートである。以下、上記システム動作をこのフローチャート(ステップS1~S5)を参照して説明する。
(Example 1)
FIG. 7 is a flowchart showing the operation processing by this system. Hereinafter, the system operation will be described with reference to this flowchart (steps S1 to S5).
 (数値集合の取得)
 この実施例では、解析する数値集合として、すでに存在する静的数値集合もしくは、時々刻々累積される動的数値集合を用いる。静的集合を用いる例としては、前述したような過去の統計情報から改ざんされた数値を見つける場合である。また動的数値集合としては、例えばクレジットカード等の決済情報の中からリアルタイムに改ざん情報を発見する場合である。後者の場合には、新たに入力された数値をすでに累積されている数値集合に照らして修正あるいは改ざんされた情報であるかをリアルタイムに判定することができる。
(Acquisition of numerical set)
In this embodiment, as the numerical set to be analyzed, a static numerical set that already exists or a dynamic numerical set that is accumulated every moment is used. An example of using a static set is to find a falsified numerical value from past statistical information as described above. The dynamic numerical set is a case where falsification information is discovered in real time from payment information such as a credit card. In the latter case, it is possible to determine in real time whether the newly input numerical value is the corrected or falsified information in light of the already accumulated numerical value set.
 上記数値集合取得部14は、その集合に含まれる各数値にIDを付して上記データ格納部6に格納する(数値集合9)(ステップS1)。 The numerical value set acquisition unit 14 assigns an ID to each numerical value included in the set and stores it in the data storage unit 6 (numerical value set 9) (step S1).
 以下では、数値集合9として、前述した雇用者数統計情報を用いる場合を例にとって説明する。この統計情報には、422個の雇用機関V(c)(c=1~422)の障碍者雇用数r(c)が、この機関のIDに関連付けられて格納されている。 In the following, the case where the above-mentioned employee number statistical information is used as the numerical set 9 will be described as an example. In this statistical information, the number of employees with disabilities r (c) of 422 employment institutions V (c) (c = 1 to 422) is stored in association with the ID of this institution.
 (基数変換)
 ついで、上記基数変換処理部5が、上記集合9に含まれる各数値を3~16の基数kで基数変換して、基数変換後の数値集合を生成する(ステップS2)。
(Radix conversion)
Then, the radix conversion processing unit 5 performs radix conversion of each numerical value included in the set 9 with a radix k of 3 to 16 to generate a numerical set after radix conversion (step S2).
 例えば、前記した数値集合中、ID314の機関V(c)=V(314)の雇用数r(c)が「17」である場合、この数値17を基数変換処理すると以下のようになる。 For example, in the above numerical set, when the number of employment r (c) of the institution V (c) = V (314) of ID 314 is "17", the radix conversion processing of this numerical value 17 results in the following.
  3進数   122
  4進数   101
  5進数   32
  6進数   25
  7進数   23
  8進数   21
  9進数   18
  10進数  17
  11進数  16
  12進数  15
  13進数  14
 上記基数変換処理部15は、上記の変換を上記422個の数値集合に含まれるすべての数値r(c)(c=1~422の422個の数値)について行い、基数k毎に上記データ格納部6に格納する(基数毎の変換数値分布10)。
Binary number 122
Quadrant 101
Five-ary number 32
Hex 25
7-ary number 23
Eighth number 21
9-ary number 18
Decimal number 17
11-ary number 16
Decimal number 15
13-ary number 14
The radix conversion processing unit 15 performs the above conversion for all the numerical values r (c) (422 numerical values of c = 1 to 422) included in the 422 numerical values set, and stores the data for each radix k. Stored in part 6 (conversion numerical distribution 10 for each radix).
 (数値分布の生成及び検証)
 ついで前記数値分布生成処理部16が、数値集合中の基数変換後の数値の1桁目の値(fsd(x、k):x=数値、k=基数)を算出する。
(Generation and verification of numerical distribution)
Then, the numerical distribution generation processing unit 16 calculates the first digit value (fsd (x, k): x = numerical value, k = radix) of the numerical value after radix conversion in the numerical value set.
 数値x=17の場合、各基数kに対応するfsdは以下のようになる。 When the numerical value x = 17, the fsd corresponding to each radix k is as follows.
  3進数   1
  4進数   1
  5進数   3
  6進数   2
  7進数   2
  8進数   2
  9進数   1
  10進数  1
  11進数  1
  12進数  1
  13進数  1
 ついで、前記数値分布生成処理部16は、前記数値集合に含まれるすべての数値(422個の数値)について、基数kごとに上記fsdの出現数の分布を生成し(ステップS3)、前記乖離度累積処理部17が上記生成された分布に応じてベンフォードからの乖離度を累積する(ステップS4)。
Binary number 1
Quadrant 1
Five-ary number 3
Hex 2
7-ary number 2
8 decimal number 2
9-ary number 1
Decimal number 1
11-ary number 1
Decimal number 1
13-ary number 1
Next, the numerical distribution generation processing unit 16 generates a distribution of the number of occurrences of the fsd for each radix k for all the numerical values (422 numerical values) included in the numerical set (step S3), and the degree of deviation. The cumulative processing unit 17 accumulates the degree of deviation from Benford according to the generated distribution (step S4).
 図8は、上記すべての数値について、基数k=5(5進数)についてのfsdの出現数の分布を累積したものを表示したものである。すなわち、422個(x=1~422)のfsd(x、5)を、累積してプロットして、ベンフォード分布を求めたものである。 FIG. 8 shows the cumulative distribution of the number of occurrences of fsd for the radix k = 5 (pentatric number) for all the above numerical values. That is, 422 (x = 1 to 422) fsd (x, 5) are cumulatively plotted to obtain the Benford distribution.
 これによれば、例えば422個の数値のうちの1つの数値x=17を5進数に変換した場合の「32」の1桁目の値は3(fsd(17,5)=3)であるが、この「3」の出現回数は、ベンフォード分布を下回っていることがわかる。下回っている場合は訂正可能性は低いと推定される。 According to this, for example, when one of the 422 numerical values x = 17 is converted into a quintuplet, the value of the first digit of "32" is 3 (fsd (17,5) = 3). However, it can be seen that the number of appearances of this "3" is lower than the Benford distribution. If it is lower than that, it is estimated that the possibility of correction is low.
 一方、図9に示すように、数値17を6進数に変換した場合の「25」の1桁目の値は2(fsd(17,6)=2)であるが、この「2」の出現回数はベンフォード分布を上回っている。このように上回っている場合、は訂正可能性は高いと推定し、当該数値「17」についてこの基数での1桁目の数字のベンフォードからの乖離度を計算する。 On the other hand, as shown in FIG. 9, the value of the first digit of "25" when the numerical value 17 is converted into a hexadecimal number is 2 (fsd (17,6) = 2), but the appearance of this "2" The number of times exceeds the Benford distribution. If it exceeds this value, it is presumed that the possibility of correction is high, and the degree of deviation from Benford of the first digit in this radix is calculated for the numerical value "17".
 そして、前記422個の各数値r(c)について、上記ベンフォードからの乖離度を、すべての基数k(3~16)について累積する。 Then, for each of the 422 numerical values r (c), the degree of deviation from Benford is accumulated for all the radix k (3 to 16).
 この方法を図10を参照して説明する。 This method will be described with reference to FIG.
 この図は、数値17について、1桁目の値がベンフォードから乖離しているものを黒丸で示している。この黒丸のベンフォードの分布からの乖離度を、すべての基数k=3~16に亘って足し合わせることでベンフォードからの乖離度の累積値を計算することができる(ステップS4)。 In this figure, for the numerical value 17, the value of the first digit deviates from Benford is indicated by a black circle. The cumulative value of the deviation from Benford can be calculated by adding the deviations from the Benford distribution of the black circles over all the radixes k = 3 to 16 (step S4).
 この例では、上記集合における数値17の乖離度の累積値は0.6552となる。 In this example, the cumulative value of the degree of deviation of the numerical value 17 in the above set is 0.6552.
 (出力)
 次に、前記判定部18が、上記で累積した乖離度に基づき、上記集合に含まれるすべての機関V(c)を、図11に示すようにランキング出力する。
(output)
Next, the determination unit 18 outputs all the institutions V (c) included in the set by ranking as shown in FIG. 11 based on the degree of deviation accumulated above.
 この実施形態では、このランキングは、乖離度累積値の高い順に並べ替えられている。 In this embodiment, the rankings are sorted in descending order of cumulative divergence.
 ここで、このランキングの高い順から修正(改ざん)可能性が高いということができるので、このランキングについての順位についての閾値(例えば10位以上等)もしくは、具体的な累積値の閾値(例えば0.5等)と比較して、その数値よりも高いIDを出力する。 Here, since it can be said that there is a high possibility of correction (tampering) in descending order of this ranking, a threshold value for the ranking for this ranking (for example, 10th place or higher) or a specific cumulative value threshold value (for example, 0). Compared with .5 etc.), an ID higher than that value is output.
 この閾値は、ユーザが指定することもできるが、このシステムが自動的に判定してもよい。 This threshold can be specified by the user, but this system may automatically determine it.
 例えば、ランキング3位までの値(ID=31、157,46,37の各機関)もしくは、累積値0.77以上の機関(ID=31、157,46,37の各機関)を改ざん可能性の高い数値として出力する。 For example, there is a possibility of tampering with the values up to the third place in the ranking (ID = 31, 157, 46, 37) or the cumulative value of 0.77 or more (ID = 31, 157, 46, 37). Output as a high value of.
 したがって、この出力に基づいて、当該機関についての数値を再調査することで、より迅速的かつ効率的に修正/改ざんを発見することができる。 Therefore, based on this output, it is possible to find corrections / tampering more quickly and efficiently by re-examining the numerical values for the relevant institution.
 このような構成によれば、修正・改ざん可能性の高いランキングのリストを得ることができるとともに、具体的に改ざん可能性の高い数値を特定することができるので、すべての数値を再調査する必要がなくなる。 With such a configuration, it is possible to obtain a list of rankings that are highly likely to be modified or tampered with, and it is possible to specifically identify numerical values that are likely to be tampered with, so it is necessary to re-examine all numerical values. Is gone.
 (実施例2)
 なお、上記の乖離度は、実際の累積値だけでなく、図10に示す黒丸の数(ベンフォード分布より高かった基数の数)で表現してもよい。この例を上記雇用統計に適用した場合の例を、実施例2として以下に説明する。
(Example 2)
The degree of divergence may be expressed not only by the actual cumulative value but also by the number of black circles shown in FIG. 10 (the number of radixes higher than the Benford distribution). An example in which this example is applied to the above employment statistics will be described below as Example 2.
 この実施例では、上記判定部18による判定は具体的には以下の関数を用いて行う。 In this embodiment, the determination by the determination unit 18 is specifically performed using the following function.
 ・Pben(k、d)
  基数kのときの一桁目の値(FSD)であるdのベンフォードの法則での出現確率である。
・ Pben (k, d)
It is the probability of appearance of d, which is the first digit value (FSD) when the radix is k, according to Benford's law.
 ・Pben(k、d)=log(1+1/d)
  例えば、Pben(10,1)=0.301である。(10進数の1の出現確率は0.301)。
・ Pben (k, d) = log k (1 + 1 / d)
For example, Pben (10, 1) = 0.301. (The probability of appearance of 1 in decimal number is 0.301).
 ・fsd(x、k)
  数値xを基数kで記述したときのFSDであるdを表す関数であり、
  例えば、fsd(123,10)=1
  fsd(18、3)=2 である。
・ Fsd (x, k)
It is a function representing d, which is the FSD when the numerical value x is described by the radix k.
For example, fsd (123,10) = 1
fsd (18, 3) = 2.
 ・v(c)
 機関cにおける修正前の雇用障害者数であり、
  例えば、v(2(総務省))=110 である。
・ V (c)
The number of persons with employment disabilities before revision in institution c,
For example, v (2 (Ministry of Internal Affairs and Communications)) = 110.
 ・r(c)
 機関cにおける修正後の雇用障害者数であり、
  例えば、r(2(総務省))=40 である。
・ R (c)
The corrected number of persons with employment disabilities in institution c,
For example, r (2 (Ministry of Internal Affairs and Communications)) = 40.
 ・Call
 修正前の雇用障害者数v(c)が未記載でなく,かつ0でない機関cの集合であり、
  Call={c1、c2、c3、…}={内閣官房、内閣法制局、内閣府、…}
  Callの要素数(本論文の分析対象となる機関の数)は,
  |Call|=422である.|S|は集合Sの要素数を意味する。
Call
The number of persons with employment disabilities v (c) before revision is a set of institutions c that are not unstated and are not 0.
Call = {c1, c2, c3, ...} = {Cabinet Secretariat, Cabinet Legislation Bureau, Cabinet Office, ...}
C all number of elements (number of the paper to be analyzed and made institutions) are
| Call | = 422. | S | means the number of elements in the set S.
 ・Ckd
  v(c)を基数kで表現したとき,FSDであるdが同じになる機関cの集合であり、
  Ckd={c|fsd(v(c)、k))
  例えば、C101={消費者庁、総務省、外務省、…}
      C102={宮内庁、財務省、観光庁、…} である。
・ C kd
When v (c) is expressed in radix k, it is a set of institutions c in which d, which is FSD, is the same.
CKD = {c | fsd (v (c), k))
For example, C 101 = {Consumer Affairs Agency, Ministry of Internal Affairs and Communications, Ministry of Foreign Affairs, ...}
C 102 = {Imperial Household Agency, Ministry of Finance, Japan Tourism Agency, ...}.
 ・OverS(c)
ある範囲の基数kに対するover(c、k)の総和.本論文では(3、…、16)の範囲を使う。
・ OverS (c)
The sum of over (c, k) for a radix k in a range. In this paper, the range (3, ..., 16) is used.
 この値は、上記ベンフォードに従わなかった基数の数となる。 This value is the number of radixes that did not comply with Benford above.
式2 Equation 2
 
Figure JPOXMLDOC01-appb-I000002
 
Figure JPOXMLDOC01-appb-I000002
 例えばOverS(外務省)=14
    OverS(徳島大学)=0 となる。
For example, OverS (Ministry of Foreign Affairs) = 14
OverS (Tokushima University) = 0.
 このような計算方法によれば、各機関には、基数kに対応する値が割り当てられ、そのランキングを作ることができる。 According to such a calculation method, each institution is assigned a value corresponding to the radix k, and its ranking can be created.
 (評価・検証)
 以下は、この実施形態2による得られた結果の検証結果を示すものである。
(Evaluation / verification)
The following shows the verification result of the result obtained by the second embodiment.
 図12は、本発明の推定性能の評価結果を示すものである。 FIG. 12 shows the evaluation result of the estimated performance of the present invention.
 この実施例では、上記OverS(c)がOverS(c)>iとなる場合,v(c)を誤りと判定する.以下のグラフは、v(c)≠r(c)となる機関cを正例として,しきい値iを変化させた場合の評価値を示すものである。 In this embodiment, when the above OverS (c) is OverS (c)> i, v (c) is determined to be an error. The graph below shows the evaluation value when the threshold value i is changed, taking the institution c where v (c) ≠ r (c) as a positive example.
 (最尤法の推定性能)
 本発明の推定精度を検証するために、比較対象として最尤法を用いたものを図13に示す。この最尤法を用いるにあたっては、前記した既知の訂正割合として0.395(約40%)を用い、Ckdがv(c)≠r(c)となるcを多く含むか否かの判定をベンフォードの法則ではなく,Ckd中のv(c)≠r(c)の数から評価した。
(Estimation performance of maximum likelihood method)
In order to verify the estimation accuracy of the present invention, FIG. 13 shows a comparison target using the maximum likelihood method. In using this maximum likelihood method, 0.395 (about 40%) is used as the above-mentioned known correction ratio, and it is determined whether or not C kd contains a large amount of c such that v (c) ≠ r (c). Was evaluated not from Benford's law but from the number of v (c) ≠ r (c) in C kd .
 (ランダム判定法の推定性能)
 また、ランダム判定法による性能推定、すなわち、ベースラインとしてセルv(c)の誤りをランダムに判定する方法を考える。これによれば、正しい判定となる確率は,0.395であり,この値がPrecisionと等しいすべてのセルを誤りと判定するときRecall=1となる。
(Estimation performance of random judgment method)
Further, a method of estimating performance by a random determination method, that is, a method of randomly determining an error in cell v (c) as a baseline is considered. According to this, the probability of making a correct judgment is 0.395, and when all cells whose value is equal to Precision are judged to be incorrect, Recall = 1.
 (本発明と最尤法、ランダム判定法との比較結果)
 図14は、本発明と上記最尤法、ランダム判定法との推定能力の比較結果である。
(Results of comparison between the present invention and the maximum likelihood method and the random determination method)
FIG. 14 is a comparison result of the estimation ability between the present invention and the maximum likelihood method and the random determination method.
 これによれば、本発明の優位性が明らかである。 According to this, the superiority of the present invention is clear.
 (10進数のみによる場合の性能評価)
 なお、10進数のみによる場合との比較をするために、これで性能評価を実行すると、図15に示すようになる。
(Performance evaluation when using only decimal numbers)
It should be noted that when the performance evaluation is executed with this in order to make a comparison with the case using only the decimal number, it becomes as shown in FIG.
 このことにより、10進数だけでは効果がないことがわかる。 From this, it can be seen that decimal numbers alone have no effect.
 (最尤法が上回る場合の修正)
 一方、図16に示すように、Call中のv(c)=xの出現回数を見ると、値「9」の出現回数が周囲の値と比べて突出している。そこで、v(c)=9である機関を調べてみると,17機関中10機関が「~県警察本部」となっていることがわかった。この問題に対応するため、本発明では、出現回数が非常に多い値については「誤り」と判定するようにデータを修正しても良い。
(Correction when the maximum likelihood method is exceeded)
On the other hand, as shown in FIG. 16, looking at the number of occurrences of v (c) = x in C all, the number of occurrences of the value "9" protrudes compared to the value of the surroundings. Therefore, when examining the institutions with v (c) = 9, it was found that 10 out of 17 institutions were "~ prefectural police headquarters". In order to deal with this problem, in the present invention, the data may be modified so that a value having a very large number of occurrences is determined to be "error".
 すなわち、出現回数の多い警察本部を全て取り除き,残りの376機関のデータについて,本発明による手法(提案手法)(OverS(c))と最尤法(WrongS(c))の推定性能を比較した(図17)。 That is, the police headquarters with a large number of appearances were removed, and the estimation performance of the method (proposed method) (OverS (c)) and the maximum likelihood method (WrongS (c)) according to the present invention was compared with respect to the data of the remaining 376 institutions. (Fig. 17).
 ここで、Pbは提案手法のPrecision,Rbは提案手法のRecall,Pmlは最尤法のPrecision,Rmlは最尤法のRecallを示す。その結果,しきい値i=12、11での本発明による手法(提案手法)の推定性能は向上した。一方、最尤法はあまり変化がなかった。このことから,本発明において,訂正以外の他の原因によるベンフォードの法則からの外れ値を取り除くことで,より高い推定性能を示すことが示された。 Here, Pb indicates the Precision of the proposed method, Rb indicates the Recall of the proposed method, Pml indicates the Precision of the maximum likelihood method, and Rml indicates the recall of the maximum likelihood method. As a result, the estimation performance of the method (proposed method) according to the present invention at the threshold values i = 12 and 11 was improved. On the other hand, the maximum likelihood method did not change much. From this, it was shown that in the present invention, higher estimation performance is exhibited by removing outliers from Benford's law due to causes other than correction.
 (基数範囲について)
 本実施例では,基数(k=3、…、16)までを選択し,ベンフォードの法則からの乖離を見積もったが、これに限定されるものではなく、原理的に3以上のいかなる基数にも対応できる。
(About radix range)
In this embodiment, the radix (k = 3, ..., 16) is selected and the deviation from Benford's law is estimated, but the deviation is not limited to this, and in principle, any radix of 3 or more can be used. Can also be supported.
 (最大基数の選択)
 本実施例で基数kの最大値を16としたのは,基数kにおけるベンフォードの法則が,数値データの最大値と最小値の桁が2.6以上異なる場合に限定したためである。
(Selection of maximum radix)
The reason why the maximum value of the radix k is set to 16 in this embodiment is that Benford's law in the radix k is limited to the case where the digits of the maximum value and the minimum value of the numerical data are different by 2.6 or more.
式3 Equation 3
 
Figure JPOXMLDOC01-appb-I000003
 
Figure JPOXMLDOC01-appb-I000003
 しかし、一方で、発明者らは基数の最大値を17~36まで変化させてみたが,推定性能は変化しなかった。このことから,本実施例においては,最大基数は16で十分であると考えられる。 However, on the other hand, the inventors tried changing the maximum value of the radix from 17 to 36, but the estimated performance did not change. From this, it is considered that 16 is sufficient for the maximum radix in this embodiment.
 (小さい基数の選択的利用)
 9(=32)進数でベンフォードの法則に従わない数値集合について,3進法での評価を重ねて行うべきか直感的には疑問が残る。このため発明者らは、基数の下限を変化させて本発明の手法の推定性能を試験した(Precisionが最大の場合)。しかし、この試験結果によれば、基数の下限を限定することについては特異的な効果が見られないことが分かる(図18)。
(Selective use of small radix)
9 (= 3 2) value set that does not follow the laws of Benford in Decimal, questions remain rated intuitively should done repeatedly in ternary. Therefore, the inventors tested the estimation performance of the method of the present invention by changing the lower limit of the radix (when the precision is maximum). However, according to this test result, it can be seen that there is no specific effect in limiting the lower limit of the radix (Fig. 18).
 (本発明の効果)
 本発明の実施例では,まず,2018年に判明した障害者雇用状況の集計結果の誤りについて,修正前のデータがベンフォードの法則に従わず,修正後のデータはベンフォードの法則から外れているとはいえないことを確認した。次に,基数k(k=3、…、16)でのベンフォードの法則を利用し,数値データの誤り箇所を推定する本発明の方法を提案した.さらに,修正前と修正後のデータを用いて,提案手法の推定性能を評価した。
(Effect of the present invention)
In the embodiment of the present invention, first, regarding the error in the aggregated result of the employment status of persons with disabilities found in 2018, the data before correction does not follow Benford's law, and the data after correction deviates from Benford's law. I confirmed that it was not possible. Next, we proposed the method of the present invention for estimating the error location of numerical data by using Benford's law in the radix k (k = 3, ..., 16). Furthermore, the estimation performance of the proposed method was evaluated using the data before and after the correction.
 本発明によれば、誤りが多い一部の箇所を推定することが可能になった。
なお、本発明は雇用障害者数以外の数値集合データについて,も適用可能である。
また、時系列で流れてくる数値データを補足し、どの時間帯に誤った数値が含まれているかをタイムリーに判定することも可能である。
According to the present invention, it has become possible to estimate some parts where there are many errors.
The present invention can also be applied to numerical set data other than the number of persons with employment disabilities.
It is also possible to supplement the numerical data flowing in time series and timely determine which time zone contains an incorrect numerical value.
 この場合、一定時間毎(例えば1分毎)に、過去1分間の間に入力された数値を本発明の方法で検証することによってその時間帯に改ざんされた数値が含まれているか及び改ざんされた可能性の高い数値を特定することができる。 In this case, at regular time intervals (for example, every minute), by verifying the numerical values input during the past 1 minute by the method of the present invention, whether or not the numerical values that have been tampered with are included or tampered with in that time zone. It is possible to identify the numerical value that is likely to have occurred.
 なお、この発明は上記一実施形態のものに限定されるものではなく、発明の要旨を変更しない範囲で種々変形可能である。 The present invention is not limited to that of the above-described embodiment, and can be variously modified without changing the gist of the invention.
 たとえば、上記一実施形態では、数値集合の各数値に対する複数の変換方法は基数変換であったがこれに限定されるものではなく、特定の法則にしたがう変換方法であれば基数変換には限定されない。 For example, in the above embodiment, the plurality of conversion methods for each numerical value of the numerical set are radix conversion, but the conversion method is not limited to this, and any conversion method according to a specific rule is not limited to radix conversion. ..
 また、上記一実施形態では、ベンフォード分布との乖離を分析する所定の桁は最上位の桁であったがこれに限定されるものではなく、上記n位の桁(nは前記変換後数値のうち最小値の桁数m以下の整数)であれば良い。 Further, in the above embodiment, the predetermined digit for analyzing the deviation from the Benford distribution is the highest digit, but the present invention is not limited to this, and the nth digit (n is the converted numerical value). Of these, an integer with the minimum number of digits m or less) may be used.
 1…システム
 2…CPU
 3…RAM
 4…入出力部
 5…バス
 6…データ格納部
 7…プログラム格納部
 9…数値集合データ
 10…変換数値集合データ
 11…数値分布データ
 12…数値修正可能性評価結果
 14…数値集合取得部
 15…基数変換処理部
 16…数値分布生成処理部
 17…乖離度累積処理部
 18…訂正可能性判定処理部
1 ... System 2 ... CPU
3 ... RAM
4 ... Input / output unit 5 ... Bus 6 ... Data storage unit 7 ... Program storage unit 9 ... Numerical set data 10 ... Converted numerical set data 11 ... Numerical distribution data 12 ... Numerical correctability evaluation result 14 ... Numerical set acquisition unit 15 ... Cardinal conversion processing unit 16 ... Numerical distribution generation processing unit 17 ... Deviation degree cumulative processing unit 18 ... Correctability judgment processing unit

Claims (20)

  1. コンピュータによって実行されるデータ修正/改ざん判定方法であり、
     当該コンピュータは、以下の工程を実行するものである
     解析対象の数値集合を取得し、各数値をIDと共に変換前数値としてメモリに格納する工程と、
     当該集合に含まれる各変換前数値を2以上の異なる変換方法で変換し、変換後数値の特定の桁の値の分布を、変換方法毎に集計する工程と、
     前記数値集合に含まれる同一IDで参照される各数値について、前記変換後数値の特定の桁の値のベンフォードの分布からの乖離を特定し、その乖離を変換方法の個数分、累積することで累積乖離度を算出する工程と、
     上記累積乖離度を各数値間で比較することで特定の数値の修正/改ざん可能性を判定する工程と、
     を有することを特徴とするデータ修正/改ざん判定方法。
    It is a data correction / tampering judgment method executed by a computer.
    The computer performs the following steps: a process of acquiring a numerical set to be analyzed and storing each numerical value together with an ID as a pre-conversion numerical value in memory.
    A process of converting each pre-conversion numerical value included in the set by two or more different conversion methods, and aggregating the distribution of the value of a specific digit of the converted numerical value for each conversion method.
    For each numerical value referred to by the same ID included in the numerical value set, the deviation from Benford's distribution of the value of a specific digit of the converted numerical value is specified, and the deviation is accumulated by the number of conversion methods. And the process of calculating the cumulative deviation degree with
    The process of determining the possibility of correction / falsification of a specific numerical value by comparing the cumulative degree of deviation between each numerical value, and
    A data correction / falsification determination method characterized by having.
  2.  請求項1記載の方法において、
     前記修正/改ざん可能性を判定する工程は、
     前記累積乖離度を各数値間で比較し、累積乖離度の高い1または複数数値を修正/改ざん可能性の高い数値として特定するものである
     ことを特徴とする方法。
    In the method according to claim 1,
    The step of determining the possibility of modification / tampering is
    A method characterized in that the cumulative deviation degree is compared between each numerical value, and one or a plurality of numerical values having a high cumulative deviation degree are specified as numerical values having a high possibility of being corrected / tampered with.
  3.  請求項2記載の方法において、
     前記修正/改ざん可能性を判定する工程は、
     前記累積乖離度を各数値間で比較し、所定の閾値以上の累積乖離度を有する1または複数数値を修正/改ざん可能性の高い数値として特定するものである
     ことを特徴とする方法。
    In the method according to claim 2,
    The step of determining the possibility of modification / tampering is
    A method characterized in that the cumulative divergence degree is compared between each numerical value, and one or a plurality of numerical values having a cumulative divergence degree equal to or higher than a predetermined threshold value are specified as numerical values having a high possibility of being corrected / tampered with.
  4.  請求項3記載の方法において、
     前記修正/改ざん可能性を判定する工程は、
     前記全数値の累積乖離度に基づいて前記所定の閾値を決定する工程をさらに有する
     ことを特徴とする方法。
    In the method according to claim 3,
    The step of determining the possibility of modification / tampering is
    A method further comprising a step of determining the predetermined threshold value based on the cumulative deviation degree of all the numerical values.
  5.  請求項2記載の方法において、
     前記修正/改ざん可能性を判定する工程は、
     前記累積乖離度を各数値間で比較して高い順にソートした場合に、ソート順序が所定順位以上の1または複数数値を修正/改ざん可能性の高い数値として特定するものである
     ことを特徴とする方法。
    In the method according to claim 2,
    The step of determining the possibility of modification / tampering is
    When the cumulative divergence degree is compared between each numerical value and sorted in descending order, one or a plurality of numerical values whose sort order is higher than a predetermined order are specified as numerical values having a high possibility of being corrected / tampered with. Method.
  6.  請求項5記載の方法において、
     前記修正/改ざん可能性を判定する工程は、
     前記全数値の累積乖離度に基づいて前記所定順位を決定する工程をさらに有する
     ことを特徴とする方法。
    In the method according to claim 5,
    The step of determining the possibility of modification / tampering is
    A method further comprising a step of determining the predetermined order based on the cumulative deviation degree of all the numerical values.
  7.  請求項1記載の方法において、
     前記特定の桁は上位n位(nは前記変換後数値のうち最小値の桁数m以下の整数)の桁であることを特徴とする方法。
    In the method according to claim 1,
    A method characterized in that the specific digit is an upper nth digit (n is an integer having the minimum number of digits m or less among the converted numerical values).
  8.  請求項7記載の方法において、
     前記複数の特定の変換方法は、2つ以上の異なる基数を用いた基数変換である
     ことを特徴とする方法。
    In the method according to claim 7.
    The plurality of specific conversion methods are characterized in that they are radix conversions using two or more different radixes.
  9.  請求項8記載の方法において、
     前記2以上の異なる基数kは、前記数値集合における基数変換後の変換後数値の最大値と最小値の桁数が2.6以上異なるものから選択されたものである
     ことを特徴とする方法。
    In the method according to claim 8,
    The method characterized in that the two or more different radix k is selected from those in which the maximum value and the minimum value of the converted numerical value after the radix conversion in the numerical set differ by 2.6 or more.
  10.  請求項9記載の方法において、
     前記基数は3以上の値から選ばれたものである
     ことを特徴とする方法。
    In the method according to claim 9.
    A method characterized in that the radix is selected from a value of 3 or more.
  11.  請求項1記載の方法において、
     前記前記累積乖離度は、特定の数値について前記特定の変換方法で変換された複数の変換後数値についてベンフォード分布から乖離する変換後数値の個数である
     ことを特徴とする方法。
    In the method according to claim 1,
    The cumulative deviation degree is a method characterized in that it is the number of converted numerical values that deviate from the Benford distribution for a plurality of converted numerical values converted by the specific conversion method for a specific numerical value.
  12.  請求項1記載の方法において、
     前記前記累積乖離度は、特定の数値について前記特定の変換方法で変換された複数の変換後数値と、そのうちベンフォード分布から乖離する変換後数値の比率である
     ことを特徴とする方法。
    In the method according to claim 1,
    The cumulative deviation degree is a method characterized in that it is a ratio of a plurality of converted numerical values converted by the specific conversion method for a specific numerical value to the converted numerical values deviating from the Benford distribution.
  13.  請求項1記載の方法において、
     さらに、
     特定の数値を追加で受け取る工程をさらに有し、
     前記判定する工程は、この追加で受け取った数値について前記累積乖離度を算出し、この累積乖離度を他の数値の累積乖離度と比較することで、当該追加で受け取った数値の修正/改ざん可能性をリアルタイムで判定するものである
     ことを特徴とする方法。
    In the method according to claim 1,
    further,
    It also has the process of receiving additional specific numbers,
    In the determination step, the cumulative divergence degree is calculated for the additionally received numerical value, and the cumulative divergence degree is compared with the cumulative divergence degree of other numerical values, whereby the additionally received numerical value can be corrected / tampered with. A method characterized by determining sex in real time.
  14.  請求項1記載の方法において、
     さらに、
     修正/改ざん以外の理由で乖離度が高い数値を排除する工程を有する
     ことを特徴とする方法。
    In the method according to claim 1,
    further,
    A method characterized by having a process of excluding numerical values having a high degree of divergence for reasons other than correction / tampering.
  15.  コンピュータによって実行されるシステムであり、
     コンピュータが、解析対象の数値集合を取得し、各数値をIDと共に変換前数値としてメモリに格納する手段と、
     コンピュータが、当該集合に含まれる各変換前数値を2以上の異なる変換方法で変換し、変換後数値の特定の桁の値の分布を、変換方法毎に集計する手段と、
     コンピュータが、前記数値集合に含まれる同一IDで参照される各数値について、前記変換後数値の特定の桁の値のベンフォードの分布からの乖離を特定し、その乖離を変換方法の個数分、累積することで累積乖離度を算出する手段と、
     コンピュータが、上記累積乖離度を各数値間で比較することで修正/改ざん可能性を判定する手段と、
     を有するデータ修正/改ざん判定システム。
    A system run by a computer
    A means by which a computer acquires a numerical set to be analyzed and stores each numerical value together with an ID as a numerical value before conversion in memory.
    A means by which a computer converts each pre-conversion numerical value included in the set by two or more different conversion methods, and aggregates the distribution of the value of a specific digit of the converted numerical value for each conversion method.
    The computer identifies the deviation from Benford's distribution of the value of a specific digit of the converted numerical value for each numerical value referred to by the same ID included in the numerical value set, and the deviation is determined by the number of conversion methods. A means to calculate the cumulative divergence by accumulating,
    A means by which a computer determines the possibility of correction / tampering by comparing the cumulative degree of deviation between each numerical value.
    Data correction / tampering judgment system with.
  16.  請求項15記載のシステムにおいて、
     前記修正/改ざん可能性を判定する手段は、
     前記累積乖離度を各数値間で比較し、累積乖離度の高い1または複数数値を修正/改ざん可能性の高い数値として特定するものである
     ことを特徴とするシステム。
    In the system of claim 15,
    The means for determining the possibility of modification / tampering is
    A system characterized in that the cumulative divergence degree is compared between each numerical value, and one or a plurality of numerical values having a high cumulative divergence degree are specified as numerical values having a high possibility of being corrected / tampered with.
  17.  請求項15記載のシステムにおいて、
     前記特定の桁は上位n位(nは前記変換後数値のうち最小値の桁数m以下の整数)の桁であることを特徴とするシステム。
    In the system of claim 15,
    The system is characterized in that the specific digit is an upper nth digit (n is an integer having the minimum number of digits m or less among the converted numerical values).
  18.  請求項7記載のシステムにおいて、
     前記複数の特定の変換方法は、2つ以上の異なる基数を用いた基数変換である
     ことを特徴とする方法。
    In the system according to claim 7,
    The plurality of specific conversion methods are characterized in that they are radix conversions using two or more different radixes.
  19.  請求項15記載のシステムにおいて、
     さらに、
     特定の数値を追加で受け取る手段をさらに有し、
     前記判定する工程は、この追加で受け取った数値について前記累積乖離度を算出し、この累積乖離度を他の数値の累積乖離度と比較することで、当該追加で受け取った数値の修正/改ざん可能性をリアルタイムで判定するものである
     ことを特徴とするシステム。
    In the system of claim 15,
    further,
    It also has the means to receive additional specific numbers,
    In the determination step, the cumulative divergence degree is calculated for the additionally received numerical value, and the cumulative divergence degree is compared with the cumulative divergence degree of other numerical values, whereby the additionally received numerical value can be corrected / tampered with. A system characterized in that sex is judged in real time.
  20.  コンピュータによって実行されるデータ修正/改ざん判定のためのコンピュータソフトウエアプログラム製品であり、
     記憶媒体に格納され、コンピュータに以下の、
     解析対象の数値集合を取得し、各数値をIDと共に変換前数値としてメモリに格納する工程と、
     当該集合に含まれる各変換前数値を2以上の異なる変換方法で変換し、変換後数値の特定の桁の値の分布を、変換方法毎に集計する工程と、
     前記数値集合に含まれる同一IDで参照される各数値について、前記変換後数値の特定の桁の値のベンフォードの分布からの乖離を特定し、その乖離を変換方法の個数分、累積することで累積乖離度を算出する工程と、
     上記累積乖離度を各数値間で比較することで修正/改ざん可能性を判定する工程と
     を実行させる手段
     を有することを特徴とするコンピュータソフトウエアプログラム製品。
    It is a computer software program product for data correction / tampering judgment executed by a computer.
    Stored on a storage medium and stored on the computer below
    The process of acquiring the numerical value set to be analyzed and storing each numerical value together with the ID as the numerical value before conversion in the memory.
    A process of converting each pre-conversion numerical value included in the set by two or more different conversion methods, and aggregating the distribution of the value of a specific digit of the converted numerical value for each conversion method.
    For each numerical value referred to by the same ID included in the numerical value set, the deviation from Benford's distribution of the value of a specific digit of the converted numerical value is specified, and the deviation is accumulated by the number of conversion methods. And the process of calculating the cumulative deviation degree with
    A computer software program product characterized by having a means for executing a process of determining the possibility of correction / tampering by comparing the cumulative degree of deviation between each numerical value.
PCT/JP2020/022846 2019-06-13 2020-06-10 Detection method for correction location in number set and system for same WO2020250930A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201962861111P 2019-06-13 2019-06-13
US62/861,111 2019-06-13

Publications (1)

Publication Number Publication Date
WO2020250930A1 true WO2020250930A1 (en) 2020-12-17

Family

ID=73781403

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/022846 WO2020250930A1 (en) 2019-06-13 2020-06-10 Detection method for correction location in number set and system for same

Country Status (1)

Country Link
WO (1) WO2020250930A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100332184A1 (en) * 2009-06-30 2010-12-30 Sap Ag Determining an encoding type of data
JP2014160344A (en) * 2013-02-19 2014-09-04 Nippon Telegr & Teleph Corp <Ntt> Bot determination device and method and program and numerical value aggregate distribution determination device
JP6463532B1 (en) * 2018-03-30 2019-02-06 株式会社Tkc Internal audit support device, internal audit support method, and internal audit support program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100332184A1 (en) * 2009-06-30 2010-12-30 Sap Ag Determining an encoding type of data
JP2014160344A (en) * 2013-02-19 2014-09-04 Nippon Telegr & Teleph Corp <Ntt> Bot determination device and method and program and numerical value aggregate distribution determination device
JP6463532B1 (en) * 2018-03-30 2019-02-06 株式会社Tkc Internal audit support device, internal audit support method, and internal audit support program

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
NAKAMURA TETSUZA, CHUOKEIYAI-SAH: "CAAT Basic Course Learned from 108 cases", BENFORD ANALYSIS, no. 15, 1 October 2014 (2014-10-01), pages 385 - 399 *

Similar Documents

Publication Publication Date Title
CN106991145B (en) Data monitoring method and device
Weyuker et al. Comparing the effectiveness of several modeling methods for fault prediction
Řezáč et al. How to measure the quality of credit scoring models
Guerrouj et al. The influence of app churn on app success and stackoverflow discussions
EP2854053A1 (en) Defect prediction method and device
Bontempi et al. A new index of uncertainty based on internet searches: A friend or foe of other indicators?
WO2017115458A1 (en) Log analysis system, method, and program
CN105095238A (en) Decision tree generation method used for detecting fraudulent trade
Li et al. Research and application of random forest model in mining automobile insurance fraud
CN111062642A (en) Method and device for identifying industrial risk degree of object and electronic equipment
US20130198147A1 (en) Detecting statistical variation from unclassified process log
Cope et al. Observed correlations and dependencies among operational losses in the ORX consortium database
CN115222259A (en) Strategic target-based comprehensive performance management method and system
US20140317066A1 (en) Method of analysing data
CN114139931A (en) Enterprise data evaluation method and device, computer equipment and storage medium
WO2020250930A1 (en) Detection method for correction location in number set and system for same
CN112819341A (en) Scientific and technological type small and micro enterprise credit risk assessment method
Blöchlinger et al. A new goodness-of-fit test for event forecasting and its application to credit defaults
CN107784578B (en) Bank foreign exchange data supervision method and device
Eisl et al. Re-mapping credit ratings
CN115564410A (en) State monitoring method and device for relay protection equipment
CN110458707B (en) Behavior evaluation method and device based on classification model and terminal equipment
CN114418304A (en) Method and device for evaluating bad asset pack
CN113888321A (en) Default probability analysis method based on third-party data source
Görgen et al. Predicting value at risk for cryptocurrencies with generalized random forests

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20821756

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20821756

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP