WO2020250930A1

WO2020250930A1 - Detection method for correction location in number set and system for same

Info

Publication number: WO2020250930A1
Application number: PCT/JP2020/022846
Authority: WO
Inventors: 廣川　佐千男; 祐輔戸▲崎▼; 鈴木　孝彦
Original assignee: 国立大学法人九州大学
Priority date: 2019-06-13
Filing date: 2020-06-10
Publication date: 2020-12-17

Abstract

[Problem] The purpose of the present invention is to provide a procedure for estimating, with a higher degree of accuracy, which section of a specified set has a mistake. [Solution] Provided is a data correction/falsification determination method which is executed by a computer, wherein the computer executes the following steps included in said data correction/falsification determination method: a step in which an analysis target number set is acquired, and each number is stored together with an ID in memory as a pre-conversion number; a step in which each of the pre-conversion numbers contained in the set are converted by two or more different conversion methods into post-conversion numbers, and the distribution of the value of a specified digit in the post-conversion numbers is summed up for each conversion method; a step in which, with respect to each of the numbers contained in the number set and referenced by the same ID, deviation in the value of the specified digit in the post-conversion numbers from the Benford distribution thereof is specified, and a cumulative degree of deviation is calculated by means of accumulating said deviation by the number of conversion methods; and a step in which the correctability/falsifiability of a specified number is determined by means of comparing the aforementioned cumulative degree of deviation to each of the numbers.

Description

Correction point detection method of numerical set and its system

The present invention relates to a method for detecting a corrected part of a numerical set using Benford's law and a system thereof.

(Benford's law)
In modern society, many actions are determined based on the data on which they are based, and the credibility of the data is important.

Benford's law is known as a law that holds for a set of natural numerical data. Benford's law states that there is a rule in the frequency of occurrence of the first digit of each number in an arbitrary set of numbers, and if the set does not follow Benford's law, there is some inconsistency. It is possible.
This Benford's law has traditionally been used to detect fraudulent statistical data.

(What is Benford's law)
The probability of appearance of the upper first digit d (d = 1 in the case of 123) for a set of natural numerical data is log _k (1 + 1 / d).

For example, when examining the distribution of the numerical value d for the population distribution of each administrative division in Japan, it follows the Benford distribution as shown in Fig. 1.

(Study related to Benford)
Conventionally, as a method for evaluating the credibility of numerical data using Benford's law, those disclosed in the following

Non-Patent Documents

1 and 2 are known.

First, it is disclosed in Non-Patent Document 1, and shows a method in which Nigrini et al. Apply Benford's law to accounting data and verify the credibility of numerical data.

Also, what is disclosed in Non-Patent Document 2 is that Rauch et al. Investigated the credibility of economic data of EU member states using Benford's law. Analyzing Eurostat data from 1999 to 2009, it shows that the data reported by Greece deviate most from Benford's law.

However, the evaluation of the credibility of Benford's law described in each of the above non-patent documents merely determines whether the set as a whole is erroneous, and which part of the set (which numerical value). ) Was not determined.

The present invention has been made based on the above-mentioned problems in the prior art, and an object of the present invention is to provide a method for estimating which part of a specific set has an error with higher accuracy. It is a thing.

According to the present invention, the following means are provided in order to achieve the above object.

(1) This is a data correction / tampering judgment method executed by a computer.
The computer performs the following steps: a process of acquiring a numerical set to be analyzed and storing each numerical value together with an ID as a pre-conversion numerical value in memory.
A process of converting each pre-conversion numerical value included in the set by two or more different conversion methods, and aggregating the distribution of the value of a specific digit of the converted numerical value for each conversion method.
For each numerical value referred to by the same ID included in the numerical value set, the deviation from Benford's distribution of the value of a specific digit of the converted numerical value is specified, and the deviation is accumulated by the number of conversion methods. And the process of calculating the cumulative deviation degree with
The process of determining the possibility of correction / falsification of a specific numerical value by comparing the cumulative degree of deviation between each numerical value, and
A data correction / falsification determination method characterized by having.

(2) In the method described in (1) above
The step of determining the possibility of modification / tampering is
A method characterized in that the cumulative deviation degree is compared between each numerical value, and one or a plurality of numerical values having a high cumulative deviation degree are specified as numerical values having a high possibility of being corrected / tampered with.

(3) In the method described in (2) above
The step of determining the possibility of modification / tampering is
A method characterized in that the cumulative divergence degree is compared between each numerical value, and one or a plurality of numerical values having a cumulative divergence degree equal to or higher than a predetermined threshold value are specified as numerical values having a high possibility of being corrected / tampered with.

(4) In the method described in (3) above
The step of determining the possibility of modification / tampering is
A method further comprising a step of determining the predetermined threshold value based on the cumulative deviation degree of all the numerical values.

(5) In the method described in (2) above
The step of determining the possibility of modification / tampering is
When the cumulative divergence degree is compared between each numerical value and sorted in descending order, one or a plurality of numerical values whose sort order is higher than a predetermined order are specified as numerical values having a high possibility of being corrected / tampered with. Method.

(6) In the method described in (5) above
The step of determining the possibility of modification / tampering is
A method further comprising a step of determining the predetermined order based on the cumulative deviation degree of all the numerical values.

(7) In the method described in (1) above
A method characterized in that the specific digit is an upper nth digit (n is an integer having the minimum number of digits m or less among the converted numerical values).

(8) In the method described in (7) above
The plurality of specific conversion methods are characterized in that they are radix conversions using two or more different radixes.

(9) In the method described in (8) above
The method characterized in that the two or more different radix k is selected from those in which the maximum value and the minimum value of the converted numerical value after the radix conversion in the numerical set differ by 2.6 or more.

(10) In the method described in (9) above
A method characterized in that the radix is selected from a value of 3 or more.

(11) In the method described in (1) above
The cumulative deviation degree is a method characterized in that it is the number of converted numerical values that deviate from the Benford distribution for a plurality of converted numerical values converted by the specific conversion method for a specific numerical value.

(12) In the method described in (1) above
The cumulative deviation degree is a method characterized in that it is a ratio of a plurality of converted numerical values converted by the specific conversion method for a specific numerical value to the converted numerical values deviating from the Benford distribution.

(13) In the method described in (1) above
further,
It also has the process of receiving additional specific numbers,
In the determination step, the cumulative divergence degree is calculated for the additionally received numerical value, and the cumulative divergence degree is compared with the cumulative divergence degree of other numerical values, whereby the additionally received numerical value can be corrected / tampered with. A method characterized by determining sex in real time.

(14) In the method described in (1) above
further,
A method characterized by having a process of excluding numerical values having a high degree of divergence for reasons other than correction / tampering.

(15) A system executed by a computer
A means by which a computer acquires a numerical set to be analyzed and stores each numerical value together with an ID as a numerical value before conversion in memory.
A means by which a computer converts each pre-conversion numerical value included in the set by two or more different conversion methods, and aggregates the distribution of the value of a specific digit of the converted numerical value for each conversion method.
The computer identifies the deviation from Benford's distribution of the value of a specific digit of the converted numerical value for each numerical value referred to by the same ID included in the numerical value set, and the deviation is determined by the number of conversion methods. A means to calculate the cumulative divergence by accumulating,
A means by which a computer determines the possibility of correction / tampering by comparing the cumulative degree of deviation between each numerical value.
Data correction / tampering judgment system with.

(16) In the system described in (15) above
The means for determining the possibility of modification / tampering is
A system characterized in that the cumulative divergence degree is compared between each numerical value, and one or a plurality of numerical values having a high cumulative divergence degree are specified as numerical values having a high possibility of being corrected / tampered with.

(17) In the system according to claim 15,
The system is characterized in that the specific digit is an upper nth digit (n is an integer having the minimum number of digits m or less among the converted numerical values).

(18) In the system according to claim 7.
The plurality of specific conversion methods are characterized in that they are radix conversions using two or more different radixes.

(19) In the system described in (15) above
further,
It also has the means to receive additional specific numbers,
In the determination step, the cumulative divergence degree is calculated for the additionally received numerical value, and the cumulative divergence degree is compared with the cumulative divergence degree of other numerical values, whereby the additionally received numerical value can be corrected / tampered with. A system characterized in that sex is judged in real time.

(20) A computer software program product for determining data correction / tampering executed by a computer.
Stored on a storage medium and stored on the computer below
The process of acquiring the numerical value set to be analyzed and storing each numerical value together with the ID as the numerical value before conversion in the memory.
A process of converting each pre-conversion numerical value included in the set by two or more different conversion methods, and aggregating the distribution of the value of a specific digit of the converted numerical value for each conversion method.
For each numerical value referred to by the same ID included in the numerical value set, the deviation from Benford's distribution of the value of a specific digit of the converted numerical value is specified, and the deviation is accumulated by the number of conversion methods. And the process of calculating the cumulative deviation degree with
A computer software program product characterized by having a means for executing a process of determining the possibility of correction / tampering by comparing the cumulative degree of deviation between each numerical value.

Features of the present invention other than those described above will be clarified to those skilled in the art from the sections and drawings of the following embodiments.

FIG. 1 is a schematic diagram showing the distribution of population by administrative division in Japan.

FIG. 2 is an explanatory diagram showing changes in statistical data before and after correction.

FIG. 3 is also an explanatory diagram showing changes in the statistical data before and after the correction.

FIG. 4 is a schematic diagram showing the relationship between the statistical data before correction and the Benford distribution.

FIG. 5 is a schematic diagram showing the relationship between the corrected statistical data and the Benford distribution.

FIG. 6 is a system configuration diagram showing an embodiment of the present invention.

FIG. 7 is also a flowchart showing the operation.

FIG. 8 is also a schematic diagram showing the relationship between the pentadecimal number and the Benford distribution.

FIG. 9 is also a schematic diagram showing the relationship between the hexadecimal number and the Benford distribution.

FIG. 10 is also an explanatory diagram for explaining the degree of deviation for a specific numerical value.

FIG. 11 is also a schematic diagram showing an output list for evaluation.

FIG. 12 is a graph showing each index related to the evaluation performance of the present invention.

FIG. 13 is a graph showing each index related to the evaluation performance by the maximum likelihood method, which is also a comparison target.

FIG. 14 is also a table showing the method according to the present invention and the method to be compared.

FIG. 15 is also a diagram showing the degree of agreement when only the decimal system is used.

FIG. 16 is also a diagram showing an analysis when the performance of the maximum likelihood method is exceeded.

FIG. 17 is also a graph showing the evaluation performance based on the corrected data.

FIG. 18 is also a graph showing the evaluation performance based on the corrected data.

Hereinafter, an embodiment of the present invention will be described with reference to the drawings, but before that, the hypothesis and findings by the inventors who caused the completion of the present invention will be described in detail.

(Problem of inflating disabled employees)
In order to obtain the findings of the present invention, we first examined whether or not the number of persons with employment disabilities in Japanese public institutions was obeyed by Benford's law. The target was an incident in which the number of persons with employment disabilities announced by the Ministry of Health, Labor and Welfare was inflated.

The Ministry of Health, Labor and Welfare summarizes the employment status of persons with disabilities in public institutions as of June 1 every year and publishes the "aggregation result of employment status of persons with disabilities". As for the aggregated results that became a problem, the employment status of persons with disabilities up to 2017 was announced.

However, there was a suspicion that the number of persons with employment disabilities was improperly included, so it was decided to re-examine the contents of the report.

As a result, the Ministry of Health, Labor and Welfare announced the aggregated results after the re-inspection on October 22, 2018.

On the other hand, the present inventors verified using the inappropriate tabulation result and the correct tabulation result after re-inspection.

It will be explained in detail below.

"Regarding the results of re-inspection of the status of appointment and dismissal of persons with disabilities as of June 1, 2017 by national government agencies" announced on August 28, 2018 and "Legislature and judiciary" announced on September 7, 2018. In the article "Re-inspection results of the appointment and dismissal status of persons with disabilities as of June 1, 2017", the numerical value of the number of persons with disabilities employed by public institutions is described in a table format.

The number of persons with disabilities in employment is the number of persons with disabilities employed by one institution.

The total number of institutions listed in the above material was 433 (c = 1-433), which included the following institutions.

・ Administrative agencies ・ Legislative agencies ・ Judiciary agencies ・ Prefectural governor departments ・ Other prefectural agencies ・ Prefectural boards of education ・ Incorporated administrative agencies, etc.

As a result of the re-investigation, it was found that the number of persons with employment disabilities was corrected (tampered) to inflate more than 1000 people as shown in Fig. 2 in the initial announcement. As a result, it was discovered that the employment rate of persons with disabilities by government agencies, which was said to exceed the statutory employment rate of 2.3% (H29) in the first announcement, was actually significantly lower than that.

(Analysis by Benford's law)
Therefore, the inventors verified whether the numerical set data after falsification and the numerical set data before falsification each follow Benford's law.

The number of cells (numerical values) included in the analyzed numerical set data is 422 (excluding blank cells and cells with a numerical value of 0). Among them, about 40% (0.395) (167/422) of the cells had different numerical values before the correction (data before the falsification was discovered) and after the correction (data after the falsification was discovered and corrected).

Then, first, it was analyzed whether the set of cells with the number of persons with disabilities before correction and the set of cells with the number of persons with disabilities after correction follow Benford (decimal number), respectively.

For the cell with the number of persons with disabilities, the first digit was counted and compared with the theoretical distribution. A chi-square test (significance level 5%) was used to determine whether the set of cells obeyed Benford's law.
Then, the test statistic X ² was calculated by the following formula, where the observed frequency was _Od and the expected frequency was P _d .

Equation 1

As a result of the above examination, the following was found: -The distribution before modification does not follow Benford's law as shown in Fig. 4. (χ2 test P <= 0.01)
-As shown in Fig. 5, it cannot be said that the revised distribution of the number of persons with employment disabilities does not follow Benford's law.

(Results of processing only in decimal numbers and hypotheses / findings of the present inventors)
From the above analysis, the inventors obtained the following findings.
(1) Benford analysis using only decimal numbers can only derive whether or not a specific set follows Benford, and it is not possible to determine which numerical value is likely to be tampered with.
(2) On the other hand, in a set of natural numerical data, Benford's law may hold even in the k-ary system of radixes other than 10.

The present invention has been made based on such knowledge, and basically applies Benford's law to propose a method of estimating an error part of numerical data.

(Structure of the present invention)
The present invention has been completed by diligent verification based on the hypothesis and knowledge that Benford's law holds even in the k-ary system of a radix other than 10 in a set of natural numerical data.

FIG. 6 shows a system according to an embodiment of the present invention.

In this system 1, the data storage unit 6 and the program storage unit 7 are connected to the bus 5 to which the CPU 2, RAM 3, and input / output unit 4 are connected.

The data storage unit 6 stores the numerical set data 9, the converted numerical set data 10 for each radix, the numerical distribution data 11 for each radix, and the numerical correctability evaluation result 12.

Further, the program storage unit 7 is different from the numerical value set acquisition unit 14 that acquires the numerical value set to be analyzed and stores each numerical value together with the ID as the numerical value before conversion by 2 or more different from each numerical value before conversion included in the set. Each numerical set includes the radix conversion processing unit 15 that generates a plurality of types of converted numerical sets consisting of converted numerical values for each conversion method by converting by the conversion method, and each of the pre-conversion or post-conversion numerical sets. For each numerical value referred to by the same ID included in the numerical value set and the numerical distribution generation processing unit 16 for obtaining the numerical distribution of a specific digit of the numerical value, the corresponding numerical distribution and the specific digit value of the numerical value are The possibility of correction / tampering is improved by calculating the degree of separation from Benford and rearranging each numerical value based on the cumulative degree of deviation accumulation processing unit 17 and the cumulative degree of separation for n types of the degree of separation. It has a correctability determination processing unit 18 for determination.

The data storage unit 6 and the program storage unit 7 are actually storage units such as a hard disk, and each of the above configurations is called by the CPU 2 and expanded on the RAM 3 with other necessary programs such as an OS. When executed in cooperation, they function as each component of the present invention.

Note that the above configuration is described only for the configuration related to the present invention, and the description of the basic program such as the OS and other programs (including the driver etc.) is omitted.

Hereinafter, each of the above configurations will be described in detail through its operation.

(Example 1)
FIG. 7 is a flowchart showing the operation processing by this system. Hereinafter, the system operation will be described with reference to this flowchart (steps S1 to S5).

(Acquisition of numerical set)
In this embodiment, as the numerical set to be analyzed, a static numerical set that already exists or a dynamic numerical set that is accumulated every moment is used. An example of using a static set is to find a falsified numerical value from past statistical information as described above. The dynamic numerical set is a case where falsification information is discovered in real time from payment information such as a credit card. In the latter case, it is possible to determine in real time whether the newly input numerical value is the corrected or falsified information in light of the already accumulated numerical value set.

The numerical value set acquisition unit 14 assigns an ID to each numerical value included in the set and stores it in the data storage unit 6 (numerical value set 9) (step S1).

In the following, the case where the above-mentioned employee number statistical information is used as the numerical set 9 will be described as an example. In this statistical information, the number of employees with disabilities r (c) of 422 employment institutions V (c) (c = 1 to 422) is stored in association with the ID of this institution.

(Radix conversion)
Then, the radix conversion processing unit 5 performs radix conversion of each numerical value included in the set 9 with a radix k of 3 to 16 to generate a numerical set after radix conversion (step S2).

For example, in the above numerical set, when the number of employment r (c) of the institution V (c) = V (314) of ID 314 is "17", the radix conversion processing of this numerical value 17 results in the following.

Binary number 122
Quadrant 101
Five-ary number 32
Hex 25
7-ary number 23
Eighth number 21
9-ary number 18
Decimal number 17
11-ary number 16
Decimal number 15
13-ary number 14
The radix conversion processing unit 15 performs the above conversion for all the numerical values r (c) (422 numerical values of c = 1 to 422) included in the 422 numerical values set, and stores the data for each radix k. Stored in part 6 (conversion numerical distribution 10 for each radix).

(Generation and verification of numerical distribution)
Then, the numerical distribution generation processing unit 16 calculates the first digit value (fsd (x, k): x = numerical value, k = radix) of the numerical value after radix conversion in the numerical value set.

When the numerical value x = 17, the fsd corresponding to each radix k is as follows.

Binary number 1
Quadrant 1
Five-ary number 3
Hex 2
7-ary number 2
8 decimal number 2
9-ary number 1
Decimal number 1
11-ary number 1
Decimal number 1
13-ary number 1
Next, the numerical distribution generation processing unit 16 generates a distribution of the number of occurrences of the fsd for each radix k for all the numerical values (422 numerical values) included in the numerical set (step S3), and the degree of deviation. The cumulative processing unit 17 accumulates the degree of deviation from Benford according to the generated distribution (step S4).

FIG. 8 shows the cumulative distribution of the number of occurrences of fsd for the radix k = 5 (pentatric number) for all the above numerical values. That is, 422 (x = 1 to 422) fsd (x, 5) are cumulatively plotted to obtain the Benford distribution.

According to this, for example, when one of the 422 numerical values x = 17 is converted into a quintuplet, the value of the first digit of "32" is 3 (fsd (17,5) = 3). However, it can be seen that the number of appearances of this "3" is lower than the Benford distribution. If it is lower than that, it is estimated that the possibility of correction is low.

On the other hand, as shown in FIG. 9, the value of the first digit of "25" when the numerical value 17 is converted into a hexadecimal number is 2 (fsd (17,6) = 2), but the appearance of this "2" The number of times exceeds the Benford distribution. If it exceeds this value, it is presumed that the possibility of correction is high, and the degree of deviation from Benford of the first digit in this radix is calculated for the numerical value "17".

Then, for each of the 422 numerical values r (c), the degree of deviation from Benford is accumulated for all the radix k (3 to 16).

This method will be described with reference to FIG.

In this figure, for the numerical value 17, the value of the first digit deviates from Benford is indicated by a black circle. The cumulative value of the deviation from Benford can be calculated by adding the deviations from the Benford distribution of the black circles over all the radixes k = 3 to 16 (step S4).

In this example, the cumulative value of the degree of deviation of the numerical value 17 in the above set is 0.6552.

(output)
Next, the determination unit 18 outputs all the institutions V (c) included in the set by ranking as shown in FIG. 11 based on the degree of deviation accumulated above.

In this embodiment, the rankings are sorted in descending order of cumulative divergence.

Here, since it can be said that there is a high possibility of correction (tampering) in descending order of this ranking, a threshold value for the ranking for this ranking (for example, 10th place or higher) or a specific cumulative value threshold value (for example, 0). Compared with .5 etc.), an ID higher than that value is output.

This threshold can be specified by the user, but this system may automatically determine it.

For example, there is a possibility of tampering with the values up to the third place in the ranking (ID = 31, 157, 46, 37) or the cumulative value of 0.77 or more (ID = 31, 157, 46, 37). Output as a high value of.

Therefore, based on this output, it is possible to find corrections / tampering more quickly and efficiently by re-examining the numerical values for the relevant institution.

With such a configuration, it is possible to obtain a list of rankings that are highly likely to be modified or tampered with, and it is possible to specifically identify numerical values that are likely to be tampered with, so it is necessary to re-examine all numerical values. Is gone.

(Example 2)
The degree of divergence may be expressed not only by the actual cumulative value but also by the number of black circles shown in FIG. 10 (the number of radixes higher than the Benford distribution). An example in which this example is applied to the above employment statistics will be described below as Example 2.

In this embodiment, the determination by the determination unit 18 is specifically performed using the following function.

・ Pben (k, d)
It is the probability of appearance of d, which is the first digit value (FSD) when the radix is k, according to Benford's law.

・ Pben (k, d) = log _k (1 + 1 / d)
For example, Pben (10, 1) = 0.301. (The probability of appearance of 1 in decimal number is 0.301).

・ Fsd (x, k)
It is a function representing d, which is the FSD when the numerical value x is described by the radix k.
For example, fsd (123,10) = 1
fsd (18, 3) = 2.

・ V (c)
The number of persons with employment disabilities before revision in institution c,
For example, v (2 (Ministry of Internal Affairs and Communications)) = 110.

・ R (c)
The corrected number of persons with employment disabilities in institution c,
For example, r (2 (Ministry of Internal Affairs and Communications)) = 40.

・_Call
The number of persons with employment disabilities v (c) before revision is a set of institutions c that are not unstated and are not 0.
_Call = {c1, c2, c3, ...} = {Cabinet Secretariat, Cabinet Legislation Bureau, Cabinet Office, ...}
C _all number of elements (number of the paper to be analyzed and made institutions) are
| _Call | = 422. | S | means the number of elements in the set S.

・ C _kd
When v (c) is expressed in radix k, it is a set of institutions c in which d, which is FSD, is the same.
CKD = {c | fsd (v (c), k))
For example, C ₁₀₁ = {Consumer Affairs Agency, Ministry of Internal Affairs and Communications, Ministry of Foreign Affairs, ...}
C ₁₀₂ = {Imperial Household Agency, Ministry of Finance, Japan Tourism Agency, ...}.

・ OverS (c)
The sum of over (c, k) for a radix k in a range. In this paper, the range (3, ..., 16) is used.

This value is the number of radixes that did not comply with Benford above.

Equation 2

For example, OverS (Ministry of Foreign Affairs) = 14
OverS (Tokushima University) = 0.

According to such a calculation method, each institution is assigned a value corresponding to the radix k, and its ranking can be created.

(Evaluation / verification)
The following shows the verification result of the result obtained by the second embodiment.

FIG. 12 shows the evaluation result of the estimated performance of the present invention.

In this embodiment, when the above OverS (c) is OverS (c)> i, v (c) is determined to be an error. The graph below shows the evaluation value when the threshold value i is changed, taking the institution c where v (c) ≠ r (c) as a positive example.

(Estimation performance of maximum likelihood method)
In order to verify the estimation accuracy of the present invention, FIG. 13 shows a comparison target using the maximum likelihood method. In using this maximum likelihood method, 0.395 (about 40%) is used as the above-mentioned known correction ratio, and it is determined whether or not C _kd contains a large amount of c such that v (c) ≠ r (c). Was evaluated not from Benford's law but from the number of v (c) ≠ r (c) in C _kd .

(Estimation performance of random judgment method)
Further, a method of estimating performance by a random determination method, that is, a method of randomly determining an error in cell v (c) as a baseline is considered. According to this, the probability of making a correct judgment is 0.395, and when all cells whose value is equal to Precision are judged to be incorrect, Recall = 1.

(Results of comparison between the present invention and the maximum likelihood method and the random determination method)
FIG. 14 is a comparison result of the estimation ability between the present invention and the maximum likelihood method and the random determination method.

According to this, the superiority of the present invention is clear.

(Performance evaluation when using only decimal numbers)
It should be noted that when the performance evaluation is executed with this in order to make a comparison with the case using only the decimal number, it becomes as shown in FIG.

From this, it can be seen that decimal numbers alone have no effect.

(Correction when the maximum likelihood method is exceeded)
On the other hand, as shown in FIG. 16, looking at the number of occurrences of v (c) = x in C _all, the number of occurrences of the value "9" protrudes compared to the value of the surroundings. Therefore, when examining the institutions with v (c) = 9, it was found that 10 out of 17 institutions were "~ prefectural police headquarters". In order to deal with this problem, in the present invention, the data may be modified so that a value having a very large number of occurrences is determined to be "error".

That is, the police headquarters with a large number of appearances were removed, and the estimation performance of the method (proposed method) (OverS (c)) and the maximum likelihood method (WrongS (c)) according to the present invention was compared with respect to the data of the remaining 376 institutions. (Fig. 17).

Here, Pb indicates the Precision of the proposed method, Rb indicates the Recall of the proposed method, Pml indicates the Precision of the maximum likelihood method, and Rml indicates the recall of the maximum likelihood method. As a result, the estimation performance of the method (proposed method) according to the present invention at the threshold values i = 12 and 11 was improved. On the other hand, the maximum likelihood method did not change much. From this, it was shown that in the present invention, higher estimation performance is exhibited by removing outliers from Benford's law due to causes other than correction.

(About radix range)
In this embodiment, the radix (k = 3, ..., 16) is selected and the deviation from Benford's law is estimated, but the deviation is not limited to this, and in principle, any radix of 3 or more can be used. Can also be supported.

(Selection of maximum radix)
The reason why the maximum value of the radix k is set to 16 in this embodiment is that Benford's law in the radix k is limited to the case where the digits of the maximum value and the minimum value of the numerical data are different by 2.6 or more.

Equation 3

However, on the other hand, the inventors tried changing the maximum value of the radix from 17 to 36, but the estimated performance did not change. From this, it is considered that 16 is sufficient for the maximum radix in this embodiment.

(Selective use of small radix)
9 (= 3 ²⁾ value set that does not follow the laws of Benford in Decimal, questions remain rated intuitively should done repeatedly in ternary. Therefore, the inventors tested the estimation performance of the method of the present invention by changing the lower limit of the radix (when the precision is maximum). However, according to this test result, it can be seen that there is no specific effect in limiting the lower limit of the radix (Fig. 18).

(Effect of the present invention)
In the embodiment of the present invention, first, regarding the error in the aggregated result of the employment status of persons with disabilities found in 2018, the data before correction does not follow Benford's law, and the data after correction deviates from Benford's law. I confirmed that it was not possible. Next, we proposed the method of the present invention for estimating the error location of numerical data by using Benford's law in the radix k (k = 3, ..., 16). Furthermore, the estimation performance of the proposed method was evaluated using the data before and after the correction.

According to the present invention, it has become possible to estimate some parts where there are many errors.
The present invention can also be applied to numerical set data other than the number of persons with employment disabilities.
It is also possible to supplement the numerical data flowing in time series and timely determine which time zone contains an incorrect numerical value.

In this case, at regular time intervals (for example, every minute), by verifying the numerical values input during the past 1 minute by the method of the present invention, whether or not the numerical values that have been tampered with are included or tampered with in that time zone. It is possible to identify the numerical value that is likely to have occurred.

The present invention is not limited to that of the above-described embodiment, and can be variously modified without changing the gist of the invention.

For example, in the above embodiment, the plurality of conversion methods for each numerical value of the numerical set are radix conversion, but the conversion method is not limited to this, and any conversion method according to a specific rule is not limited to radix conversion. ..

Further, in the above embodiment, the predetermined digit for analyzing the deviation from the Benford distribution is the highest digit, but the present invention is not limited to this, and the nth digit (n is the converted numerical value). Of these, an integer with the minimum number of digits m or less) may be used.

1 ... System 2 ... CPU
3 ... RAM
4 ... Input / output unit 5 ... Bus 6 ... Data storage unit 7 ... Program storage unit 9 ... Numerical set data 10 ... Converted numerical set data 11 ... Numerical distribution data 12 ... Numerical correctability evaluation result 14 ... Numerical set acquisition unit 15 ... Cardinal conversion processing unit 16 ... Numerical distribution generation processing unit 17 ... Deviation degree cumulative processing unit 18 ... Correctability judgment processing unit

Claims

It is a data correction / tampering judgment method executed by a computer.
The computer performs the following steps: a process of acquiring a numerical set to be analyzed and storing each numerical value together with an ID as a pre-conversion numerical value in memory.
A process of converting each pre-conversion numerical value included in the set by two or more different conversion methods, and aggregating the distribution of the value of a specific digit of the converted numerical value for each conversion method.
For each numerical value referred to by the same ID included in the numerical value set, the deviation from Benford's distribution of the value of a specific digit of the converted numerical value is specified, and the deviation is accumulated by the number of conversion methods. And the process of calculating the cumulative deviation degree with
The process of determining the possibility of correction / falsification of a specific numerical value by comparing the cumulative degree of deviation between each numerical value, and
A data correction / falsification determination method characterized by having.
In the method according to claim 1,
The step of determining the possibility of modification / tampering is
A method characterized in that the cumulative deviation degree is compared between each numerical value, and one or a plurality of numerical values having a high cumulative deviation degree are specified as numerical values having a high possibility of being corrected / tampered with.
In the method according to claim 2,
The step of determining the possibility of modification / tampering is
A method characterized in that the cumulative divergence degree is compared between each numerical value, and one or a plurality of numerical values having a cumulative divergence degree equal to or higher than a predetermined threshold value are specified as numerical values having a high possibility of being corrected / tampered with.
In the method according to claim 3,
The step of determining the possibility of modification / tampering is
A method further comprising a step of determining the predetermined threshold value based on the cumulative deviation degree of all the numerical values.
In the method according to claim 2,
The step of determining the possibility of modification / tampering is
When the cumulative divergence degree is compared between each numerical value and sorted in descending order, one or a plurality of numerical values whose sort order is higher than a predetermined order are specified as numerical values having a high possibility of being corrected / tampered with. Method.
In the method according to claim 5,
The step of determining the possibility of modification / tampering is
A method further comprising a step of determining the predetermined order based on the cumulative deviation degree of all the numerical values.
In the method according to claim 1,
A method characterized in that the specific digit is an upper nth digit (n is an integer having the minimum number of digits m or less among the converted numerical values).
In the method according to claim 7.
The plurality of specific conversion methods are characterized in that they are radix conversions using two or more different radixes.
In the method according to claim 8,
The method characterized in that the two or more different radix k is selected from those in which the maximum value and the minimum value of the converted numerical value after the radix conversion in the numerical set differ by 2.6 or more.
In the method according to claim 9.
A method characterized in that the radix is selected from a value of 3 or more.
In the method according to claim 1,
The cumulative deviation degree is a method characterized in that it is the number of converted numerical values that deviate from the Benford distribution for a plurality of converted numerical values converted by the specific conversion method for a specific numerical value.
In the method according to claim 1,
The cumulative deviation degree is a method characterized in that it is a ratio of a plurality of converted numerical values converted by the specific conversion method for a specific numerical value to the converted numerical values deviating from the Benford distribution.
In the method according to claim 1,
further,
It also has the process of receiving additional specific numbers,
In the determination step, the cumulative divergence degree is calculated for the additionally received numerical value, and the cumulative divergence degree is compared with the cumulative divergence degree of other numerical values, whereby the additionally received numerical value can be corrected / tampered with. A method characterized by determining sex in real time.
In the method according to claim 1,
further,
A method characterized by having a process of excluding numerical values having a high degree of divergence for reasons other than correction / tampering.
A system run by a computer
A means by which a computer acquires a numerical set to be analyzed and stores each numerical value together with an ID as a numerical value before conversion in memory.
A means by which a computer converts each pre-conversion numerical value included in the set by two or more different conversion methods, and aggregates the distribution of the value of a specific digit of the converted numerical value for each conversion method.
The computer identifies the deviation from Benford's distribution of the value of a specific digit of the converted numerical value for each numerical value referred to by the same ID included in the numerical value set, and the deviation is determined by the number of conversion methods. A means to calculate the cumulative divergence by accumulating,
A means by which a computer determines the possibility of correction / tampering by comparing the cumulative degree of deviation between each numerical value.
Data correction / tampering judgment system with.
In the system of claim 15,
The means for determining the possibility of modification / tampering is
A system characterized in that the cumulative divergence degree is compared between each numerical value, and one or a plurality of numerical values having a high cumulative divergence degree are specified as numerical values having a high possibility of being corrected / tampered with.
In the system of claim 15,
The system is characterized in that the specific digit is an upper nth digit (n is an integer having the minimum number of digits m or less among the converted numerical values).
In the system according to claim 7,
The plurality of specific conversion methods are characterized in that they are radix conversions using two or more different radixes.
In the system of claim 15,
further,
It also has the means to receive additional specific numbers,
In the determination step, the cumulative divergence degree is calculated for the additionally received numerical value, and the cumulative divergence degree is compared with the cumulative divergence degree of other numerical values, whereby the additionally received numerical value can be corrected / tampered with. A system characterized in that sex is judged in real time.
It is a computer software program product for data correction / tampering judgment executed by a computer.
Stored on a storage medium and stored on the computer below
The process of acquiring the numerical value set to be analyzed and storing each numerical value together with the ID as the numerical value before conversion in the memory.
A process of converting each pre-conversion numerical value included in the set by two or more different conversion methods, and aggregating the distribution of the value of a specific digit of the converted numerical value for each conversion method.
For each numerical value referred to by the same ID included in the numerical value set, the deviation from Benford's distribution of the value of a specific digit of the converted numerical value is specified, and the deviation is accumulated by the number of conversion methods. And the process of calculating the cumulative deviation degree with
A computer software program product characterized by having a means for executing a process of determining the possibility of correction / tampering by comparing the cumulative degree of deviation between each numerical value.