WO2021220403A1

WO2021220403A1 - Attribute estimation device, attribute estimation method, and program

Info

Publication number: WO2021220403A1
Application number: PCT/JP2020/018126
Authority: WO
Inventors: 聡長谷川
Original assignee: 日本電信電話株式会社
Priority date: 2020-04-28
Filing date: 2020-04-28
Publication date: 2021-11-04
Also published as: JPWO2021220403A1; JP7355232B2

Abstract

Provided is a technique for estimating the attributes of a database without performing a process on all of the data for attributes for which attribute estimation is to be performed. The present invention includes: a number sequence generation unit that, when X is defined as the aggregation of data of attributes (attributes to be estimated) of a database T that is the object of attribute estimation, Y is the aggregation of data (auxiliary information) used for estimating the attributes to be estimated, and p is the proportion matching the auxiliary information included in aggregation Y, generates the aggregation comprising m items of data extracted from aggregation X as an ith sample aggregation X(i), and repeats, s times, a process for deriving a number ki (i=1, …, s) of data matching any of the auxiliary information included in aggregation Y, among the m items of data of the ith sample aggregation X(i), thereby generating a number sequence {k1, …, ks}; and an estimation result generation unit that determines, using the number sequence {k1, …, ks}, whether a prescribed inequality expression is true, and if the inequality expression is true, generates the estimation result that the attribute to be estimated is the attribute expressed by aggregation Y.

Description

Attribute estimation device, attribute estimation method, program

The present invention relates to a technique for estimating database attributes.

Database attributes are user-defined. Therefore, it is usually not necessary to perform the attribute estimation process. However, it may be necessary to estimate the attributes of the data from a given set of database data, for example, when automating the database anonymization process.

Conventionally, as a method of estimating the attribute of the data from a set of data, there is a method used for matching search of an image of Non-Patent Document 1 and detection of malware of Non-Patent Document 2. This is because when the match rate of the data calculated using the set of data to be the target of attribute estimation and the set of data prepared separately in advance exceeds a certain value, the attribute of the data to be the target of attribute estimation Is a method of presuming that it is an attribute of data prepared separately in advance.

The above method can be applied to database attribute estimation. Hereinafter, a description will be given using an example. Prepare a set of data to be subject to attribute estimation in advance {Midorimachi, Musashino City, Tokyo, Hikarinooka, Yokosuka City, Kanagawa Prefecture, Morinosatowakamiya, Atsugi City, Kanagawa Prefecture, Hikaridai, Seikacho, Sagara County, Kyoto Prefecture, Hanabata, Tsukuba City, Ibaraki Prefecture} The set of data is {Tokyo, Kanagawa, Saitama, Chiba, Tochigi, Gunma, Ibaraki}, and the attribute of the data prepared in advance is "address", and it is calculated using these two sets. When the match rate of the data is 70% or more, it is estimated that the attribute of the data to be the target of the attribute estimation is the “address”. Here, assuming that the match of the data is judged by partial match, the match rate is calculated to be 4/5, that is, 80%, so it is estimated that the attribute of the data to be the target of the attribute estimation is the "address". can do.

However, in the attribute estimation of the database by the above method, in order to calculate the match rate, it is necessary to extract the data of the attribute to be the target of the attribute estimation from all the records constituting the database and perform the processing, so the number of records. If there are many, it will take an enormous amount of time to estimate the attributes.

Therefore, an object of the present invention is to provide a technique for estimating the attributes of a database without executing processing for all the data of the attributes to be estimated.

In one aspect of the present invention, X is a set of data of attributes (hereinafter referred to as estimation target attributes) of the database T to be estimated, and Y is data used for estimating the estimation target attributes (hereinafter referred to as auxiliary information). Set the set of, p as the ratio that matches the auxiliary information contained in the set Y, and ^{generate a set consisting of m data extracted from the set X as the i-sample set X (i)} (i = 1,…, s). Then, out of the m data of the i-th sample set X ⁽ⁱ⁾ , the number of data that matches any of the auxiliary information contained in the set Y k _i (i = 1,…, s) is calculated s times. by repeating, sequence _{_{{k 1, ..., k s}} } and the sequence generator for generating a sequence _{_{{k 1, ..., k s}} } was used to determine whether the following inequality holds, the inequality If, the estimation result generation unit that generates the estimation result that the estimation target attribute is an attribute represented by the set Y and

(However, n is the number of data contained in the set X, and ε is a predetermined number satisfying 0≤ε≤1).

According to the present invention, it is possible to estimate the attributes of the database without executing the processing for all the data of the attributes to be the target of the attribute estimation.

It is a block diagram which shows the structure of the attribute estimation apparatus 100. It is a flowchart which shows the operation of the attribute estimation apparatus 100. It is a block diagram which shows the structure of the estimation result generation part 120. It is a flowchart which shows the operation of the estimation result generation unit 120. It is a figure which shows an example of the functional structure of the computer which realizes each apparatus in embodiment of this invention.

Hereinafter, embodiments of the present invention will be described in detail. The components having the same function are given the same number, and duplicate explanations will be omitted.

Prior to the description of each embodiment, the notation method in this specification will be described.

^ (Caret) represents a superscript. For example, x ^{y ^ z} means that y ^z is a superscript for x, and x _{y ^ z} means that y ^z is a subscript for x. In addition, _ (underscore) represents a subscript. For example, x ^y_z means that y _z is a superscript for x, and x _{y_z} means that y _z is a subscript for x.

Also, superscripts "^" and "~" such as ^ x and ~ x for a certain character x should be written directly above "x", but they should be written directly above "x". Due to restrictions, it is described as ^ x or ~ x.

<First Embodiment>
X is the set of data of the attributes of the database T (hereinafter referred to as the estimation target attribute) to be estimated, Y is the set of data used to estimate the estimation target attribute (hereinafter referred to as auxiliary information), and p is the set Y. The ratio matches the supplementary information included. Specific examples of the set Y and the ratio p will be described later.

The attribute estimation device 100 inputs the set X, the set Y, and the ratio p, and estimates that the attribute to be estimated is the attribute represented by the set Y if a predetermined condition is satisfied, and estimates in other cases. Generates and outputs an estimation result that the target attribute cannot be said to be the attribute represented by the set Y.

Hereinafter, the attribute estimation device 100 will be described with reference to FIGS. 1 to 2. FIG. 1 is a block diagram showing the configuration of the attribute estimation device 100. FIG. 2 is a flowchart showing the operation of the attribute estimation device 100. As shown in FIG. 1, the attribute estimation device 100 includes a sequence generation unit 110, an estimation result generation unit 120, and a recording unit 190. The recording unit 190 is a component unit that appropriately records information necessary for processing of the attribute estimation device 100. For example, the set X, the set Y, and the ratio p are recorded in the recording unit 190.

The operation of the attribute estimation device 100 will be described with reference to FIG.

In S110, the sequence generator 110 takes the set X and the set Y as inputs, and sets m (m is an integer of 1 or more) data extracted from the set X as the i-sample set X ⁽ⁱ⁾ (i). = 1,…, s) ^{, and out of the m data of the i-th sample set X (i)} , the number of data that matches any of the auxiliary information contained in the set Y k _i (i = 1,… By repeating the process of finding, s) s times, a sequence {k ₁ ,…, k _s } is generated and output. Here, m is called the number of samplings and s is called the number of samplings. Both m and s are integers of 1 or more.

_{Specifically, the sequence generation unit 110 generates a sequence {k 1} , ..., k _s } by repeatedly executing the following processes (1) and (2) s times.

(1) The sequence generator 110 randomly extracts m pieces of data from the set X, and sets the set having the extracted m pieces of data as elements in the i-sample set X ⁽ⁱ⁾ (i = 1, ..., s). Generate as. Therefore, X ⁽ⁱ⁾ ⊂ X.

(2) The sequence generator 110 ^{determines whether or not the data x ∈ X (i)} matches any of the auxiliary information y ∈ Y. Then, the sequence generation unit 110 _{obtains the number k i} (i = 1, ..., s) of the data of the ^{i-th sample set X (i) that matches a certain auxiliary information.}

In S120, the estimation result generation unit 120 takes the sequence {k ₁ , ..., k _s } output in S110 and the ratio p as inputs, and _{uses the sequence {k 1} , ..., k _s } to use the following inequality. It is determined whether or not (1) holds, and if the inequality (1) holds, the estimation result is that the estimated target attribute is the attribute represented by the set Y. In other cases, the estimated target attribute is the set Y. Generates and outputs an estimation result that cannot be said to be a represented attribute.

(However, n is the number of data contained in the set X, and ε is a predetermined number satisfying 0 ≤ ε ≤ 1.)
Here, ε is called accuracy.

In calculating the inequality (1), the term _{_{^{n C j p j (1-}}} p) nj calculated several times, and executes the s additions. Since the cost of such combination calculation is high, we will devise ways to complete the minimum necessary calculation to improve the efficiency of the calculation.

Hereinafter, the estimation result generation unit 120 for improving the efficiency of the calculation of the inequality (1) will be described with reference to FIGS. 3 to 4. FIG. 3 is a block diagram showing the configuration of the estimation result generation unit 120. FIG. 4 is a flowchart showing the operation of the estimation result generation unit 120. As shown in FIG. 3, the estimation result generation unit 120 includes a sort unit 121, a function calculation unit 122, and an estimation unit 123.

The operation of the estimation result generation unit 120 will be described with reference to FIG.

In S121, the sort unit 121, the sequence _{_{{k 1, ..., k s}} } as input, sequence _{_{{k 1, ..., k s}} } from the ascending sort, sequence _{{~ k 1, ..., ~} k s} (However, ~ k ₁ , ..., ~ k _s satisfy ~ k ₁ ≤ ... ≤ ~ k _s ) is generated and output.

In S122, the function calculation unit 122 takes the sequence {~ k ₁ ,…, ~ k _s } output in S121 as an input, and for the sequence {~ k ₁ ,…, ~ k _s }, the function f (k). ) = Σ _{j = k} ^m _n C _j p ^j (1-p) Calculates and outputs the value f (~ k _i ^{) of nj.} _{The function calculation unit 122 calculates f (~ k i} ) (i = 1, ..., s) by executing the following processes (1) and (2).

(1) The function calculation unit 122 calculates f (~ k ₁ ).

(2) The function calculation unit 122 inductively calculates f (~ k _i _{) using f (~ k i-1) for i = 2, ..., S.}

In S123, the estimation unit 123 takes f (~ k _i ) (i = 1, ..., s) _{output in S122 as an input, and uses f (~ k i} ) (i = 1, ..., s). Σ _{i = 1} ^s logf (~ k _i ) is calculated, and if the inequality (1) holds, the estimated target attribute is the attribute represented by the set Y. Otherwise, the estimated target attribute is Generates and outputs an estimation result that cannot be said to be an attribute represented by the set Y.

(Specific example of set Y and ratio p)
Example 1: When estimating whether the attribute to be estimated is "address" Set Y = {Hokkaido, Aomori prefecture, Iwate prefecture, Miyagi prefecture, Akita prefecture, Yamagata prefecture, Fukui prefecture, Ibaraki prefecture, Tochigi prefecture, Gunma prefecture, Saitama Prefecture, Chiba prefecture, Tokyo, Niigata prefecture, Toyama prefecture, Ishikawa prefecture, Fukui prefecture, Yamanashi prefecture, Nagano prefecture, Gifu prefecture, Shizuoka prefecture, Aichi prefecture, Mie prefecture, Shiga prefecture, Kyoto prefecture, Osaka prefecture, Hyogo prefecture, Nara prefecture, Tottori prefecture, Shimane prefecture, Okayama prefecture, Hiroshima prefecture, Yamaguchi prefecture, Tokushima prefecture, Kagawa prefecture, Ehime prefecture, Kochi prefecture, Fukuoka prefecture, Saga prefecture, Nagasaki prefecture, Kumamoto prefecture, Oita prefecture, Miyazaki prefecture, Okinawa prefecture }, And the ratio p = 0.9076439. That is, the set Y is a set consisting of 44 prefectures excluding the three cases of Kanagawa, Wakayama, and Kagoshima prefectures, and the ratio p is included in the set Y with respect to the total population of the whole country at the time of the 2015 census. It is the ratio of the total population of 44 prefectures.

For the number of samplings m, the number of samplings s, and the accuracy ε, appropriate values can be used for the number n of data included in the set X. For example, when n = 10000000, m = 1000, s = 10, ε = 0.95 should be set.

Example 2: When estimating whether the attribute to be estimated is "name" Set Y = {Sato, Suzuki, Takahashi, Tanaka, Ito, Watanabe, Yamamoto, Nakamura, Kobayashi, Kato, Yoshida, Yamada, Yamaguchi, Matsumoto, Inoue, Kimura, Saito, Shimizu, Yamazaki, Abe, Ikeda, Hashimoto, Yamashita, Ishikawa, Nakajima, Maeda, Fujita}, ratio p = 0.1778762. That is, the set Y is a set obtained by extracting two-letter surnames from the top 30 surnames with a large population, and the ratio p is the set Y with respect to the total population of the whole country at the time of the 2015 census. It is the ratio of the estimated number of people who have the surname included.

According to the embodiment of the present invention, it is possible to estimate the attributes of the database without executing the processing for all the data of the attributes to be the target of the attribute estimation. In other words, even if some data obtained by sampling is used instead of using all the data of the attribute to be estimated for the attribute, the matching rate can be evaluated stochastically and accurately to achieve high speed. Attribute estimation is possible.

<Supplement>
FIG. 5 is a diagram showing an example of a functional configuration of a computer that realizes each of the above-mentioned devices (that is, each node). The processing in each of the above-mentioned devices can be carried out by causing the recording unit 2020 to read a program for causing the computer to function as each of the above-mentioned devices, and operating the control unit 2010, the input unit 2030, the output unit 2040, and the like.

The device of the present invention is, for example, as a single hardware entity, an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity. Communication unit, CPU (Central Processing Unit, cache memory, registers, etc.) to which can be connected, RAM and ROM as memory, external storage device as hard hardware, and input, output, and communication units of these , CPU, RAM, ROM, and external storage device have a connecting bus so that data can be exchanged. Further, if necessary, a device (drive) or the like capable of reading and writing a recording medium such as a CD-ROM may be provided in the hardware entity. A physical entity equipped with such hardware resources includes a general-purpose computer and the like.

The external storage device of the hardware entity stores the program required to realize the above-mentioned functions and the data required for processing this program (not limited to the external storage device, for example, reading a program). It may be stored in a ROM, which is a dedicated storage device). Further, the data obtained by the processing of these programs is appropriately stored in a RAM, an external storage device, or the like.

In the hardware entity, each program stored in the external storage device (or ROM, etc.) and the data necessary for processing each program are read into the memory as needed, and are appropriately interpreted, executed, and processed by the CPU. .. As a result, the CPU realizes a predetermined function (each constituent unit represented as the above-mentioned ... unit, ... means, etc.).

The present invention is not limited to the above-described embodiment, and can be appropriately modified without departing from the spirit of the present invention. Further, the processes described in the above-described embodiment are not only executed in chronological order according to the order described, but may also be executed in parallel or individually as required by the processing capacity of the device that executes the processes. ..

As described above, when the processing function in the hardware entity (device of the present invention) described in the above embodiment is realized by a computer, the processing content of the function that the hardware entity should have is described by a program. Then, by executing this program on the computer, the processing function in the above hardware entity is realized on the computer.

The program that describes this processing content can be recorded on a computer-readable recording medium. The computer-readable recording medium may be, for example, a magnetic recording device, an optical disk, a photomagnetic recording medium, a semiconductor memory, or the like. Specifically, for example, a hard disk device, a flexible disk, a magnetic tape, or the like as a magnetic recording device is used as an optical disk, and a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), or a CD-ROM (Compact Disc Read Only) is used as an optical disk. Memory), CD-R (Recordable) / RW (ReWritable), etc., MO (Magneto-Optical disc), etc. as a magneto-optical recording medium, EP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. as a semiconductor memory Can be used.

The distribution of this program is carried out, for example, by selling, transferring, renting, etc., a portable recording medium such as a DVD or CD-ROM on which the program is recorded. Further, the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via the network.

A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. Then, when the process is executed, the computer reads the program stored in its own storage device and executes the process according to the read program. Further, as another execution form of this program, a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. It is also possible to execute the process according to the received program one by one each time. In addition, the above processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition without transferring the program from the server computer to this computer. May be. The program in this embodiment includes information to be used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property of defining the processing of the computer, etc.).

Further, in this form, the hardware entity is configured by executing a predetermined program on the computer, but at least a part of these processing contents may be realized in terms of hardware.

The above description of the embodiment of the present invention is presented for the purpose of illustration and description. There is no intention to be exhaustive and no intention to limit the invention to the exact form disclosed. Deformations and variations are possible from the above teachings. The embodiments are in various embodiments and in various ways to provide the best illustration of the principles of the invention and to be suitable for practical use by those skilled in the art. It is selected and expressed so that it can be used by adding transformations. All such variations and variations are within the scope of the invention as defined by the appended claims, interpreted according to the width given fairly, legally and impartially.

Claims

X is a set of data of the attributes of the database T (hereinafter referred to as the estimation target attribute) to be estimated, Y is a set of data used for estimating the estimation target attribute (hereinafter referred to as auxiliary information), and p is a set Y. The ratio matches the auxiliary information contained in
A set consisting of m data extracted from the set X is generated as the i-sample set X (i) (i = 1,…, s), and out of the m data of the i-sample set X (i), A sequence that generates a sequence {k 1 ,…, k s } by repeating the process of finding the number of data k i (i = 1,…, s) that matches any of the auxiliary information contained in the set Y s times. Generation part and
The sequence {k 1 ,…, k s } is used to determine whether or not the following inequality holds, and if the inequality holds, the estimation result that the estimated target attribute is an attribute represented by the set Y is generated. With the estimation result generator

(However, n is the number of data contained in the set X, and ε is a predetermined number satisfying 0 ≤ ε ≤ 1.)
Attribute estimator including.
The attribute estimation device according to claim 1.
The estimation result generation unit
From the sequence {k 1 ,…, k s }, by ascending sort, the sequence {~ k 1 ,…, ~ k s } (however, ~ k 1 ,…, ~ k s is ~ k 1 ≦… ≦ ~ k s Satisfying) and the sort part that generates
Let f (k) = Σ j = k m n C j p j (1-p) nj
For the sequence {~ k 1 ,…, ~ k s }, calculate f (~ k 1 ), and for i = 2,…, s, use f (~ k i-1 ) to inductively f by calculating the (~ k i), a function calculating unit that calculates an f (~ k i) (i = 1, ..., s),
Includes an estimation unit that calculates Σ i = 1 s logf (~ k i ) using f (~ k i ) (i = 1,…, s) and generates the estimation result when the inequality holds. An attribute estimation device characterized by the fact that.
X is a set of data of the attributes of the database T (hereinafter referred to as the estimation target attribute) to be estimated, Y is a set of data used for estimating the estimation target attribute (hereinafter referred to as auxiliary information), and p is a set Y. The ratio matches the auxiliary information contained in
The attribute estimator generates a set consisting of m pieces of data extracted from the set X as the i-th sample set X (i) (i = 1,…, s), and m pieces of the i-th sample set X (i). By repeating the process of finding the number k i (i = 1,…, s) of the data that matches any of the auxiliary information contained in the set Y s times, the sequence {k 1 ,…, k s } And the sequence generation step to generate
The attribute estimator determines whether or not the following inequality holds using the sequence {k 1 , ..., k s }, and if the inequality holds, the estimated target attribute is an attribute represented by the set Y. With the estimation result generation step to generate the estimation result that there is

(However, n is the number of data contained in the set X, and ε is a predetermined number satisfying 0 ≤ ε ≤ 1.)
Attribute estimation method including.
A program for operating a computer as the attribute estimation device according to claim 1 or 2.