WO2021220403A1 - Attribute estimation device, attribute estimation method, and program - Google Patents

Attribute estimation device, attribute estimation method, and program Download PDF

Info

Publication number
WO2021220403A1
WO2021220403A1 PCT/JP2020/018126 JP2020018126W WO2021220403A1 WO 2021220403 A1 WO2021220403 A1 WO 2021220403A1 JP 2020018126 W JP2020018126 W JP 2020018126W WO 2021220403 A1 WO2021220403 A1 WO 2021220403A1
Authority
WO
WIPO (PCT)
Prior art keywords
attribute
data
estimation
sequence
auxiliary information
Prior art date
Application number
PCT/JP2020/018126
Other languages
French (fr)
Japanese (ja)
Inventor
聡 長谷川
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2020/018126 priority Critical patent/WO2021220403A1/en
Priority to JP2022518489A priority patent/JP7355232B2/en
Publication of WO2021220403A1 publication Critical patent/WO2021220403A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases

Definitions

  • the present invention relates to a technique for estimating database attributes.
  • Database attributes are user-defined. Therefore, it is usually not necessary to perform the attribute estimation process. However, it may be necessary to estimate the attributes of the data from a given set of database data, for example, when automating the database anonymization process.
  • the above method can be applied to database attribute estimation.
  • a description will be given using an example.
  • Prepare a set of data to be subject to attribute estimation in advance ⁇ Midorimachi, Musashino City, Tokyo, Hikarinooka, Yokosuka City, Kanagawa Prefecture, Morinosatowakamiya, Atsugi City, Kanagawa Prefecture, Hikaridai, Seikacho, Sagara County, Kyoto Prefecture, Hanabata, Tsukuba City, Ibaraki Prefecture ⁇
  • the set of data is ⁇ Tokyo, Kanagawa, Saitama, Chiba, Tochigi, Gunma, Ibaraki ⁇ , and the attribute of the data prepared in advance is "address", and it is calculated using these two sets.
  • the match rate of the data is 70% or more, it is estimated that the attribute of the data to be the target of the attribute estimation is the “address”.
  • the match rate is calculated to be 4/5, that is, 80%, so it is estimated that the attribute of the data to be the target of the attribute estimation is the "address”. can do.
  • an object of the present invention is to provide a technique for estimating the attributes of a database without executing processing for all the data of the attributes to be estimated.
  • X is a set of data of attributes (hereinafter referred to as estimation target attributes) of the database T to be estimated
  • Y is data used for estimating the estimation target attributes (hereinafter referred to as auxiliary information).
  • the estimation result generation unit that generates the estimation result that the estimation target attribute is an attribute represented by the set Y and (However, n is the number of data contained in the set X, and ⁇ is a predetermined number satisfying 0 ⁇ 1).
  • the present invention it is possible to estimate the attributes of the database without executing the processing for all the data of the attributes to be the target of the attribute estimation.
  • (Caret) represents a superscript.
  • x y ⁇ z means that y z is a superscript for x
  • x y ⁇ z means that y z is a subscript for x
  • _ (underscore) represents a subscript.
  • x y_z means that y z is a superscript for x
  • x y_z means that y z is a subscript for x.
  • X is the set of data of the attributes of the database T (hereinafter referred to as the estimation target attribute) to be estimated
  • Y is the set of data used to estimate the estimation target attribute (hereinafter referred to as auxiliary information)
  • p is the set Y.
  • the ratio matches the supplementary information included. Specific examples of the set Y and the ratio p will be described later.
  • the attribute estimation device 100 inputs the set X, the set Y, and the ratio p, and estimates that the attribute to be estimated is the attribute represented by the set Y if a predetermined condition is satisfied, and estimates in other cases. Generates and outputs an estimation result that the target attribute cannot be said to be the attribute represented by the set Y.
  • FIG. 1 is a block diagram showing the configuration of the attribute estimation device 100.
  • FIG. 2 is a flowchart showing the operation of the attribute estimation device 100.
  • the attribute estimation device 100 includes a sequence generation unit 110, an estimation result generation unit 120, and a recording unit 190.
  • the recording unit 190 is a component unit that appropriately records information necessary for processing of the attribute estimation device 100. For example, the set X, the set Y, and the ratio p are recorded in the recording unit 190.
  • m is called the number of samplings and s is called the number of samplings. Both m and s are integers of 1 or more.
  • the sequence generation unit 110 generates a sequence ⁇ k 1 , ..., k s ⁇ by repeatedly executing the following processes (1) and (2) s times.
  • the estimation result generation unit 120 takes the sequence ⁇ k 1 , ..., k s ⁇ output in S110 and the ratio p as inputs, and uses the sequence ⁇ k 1 , ..., k s ⁇ to use the following inequality. It is determined whether or not (1) holds, and if the inequality (1) holds, the estimation result is that the estimated target attribute is the attribute represented by the set Y. In other cases, the estimated target attribute is the set Y. Generates and outputs an estimation result that cannot be said to be a represented attribute. (However, n is the number of data contained in the set X, and ⁇ is a predetermined number satisfying 0 ⁇ ⁇ ⁇ 1.) Here, ⁇ is called accuracy.
  • n C j p j (1- p) nj calculated several times, and executes the s additions. Since the cost of such combination calculation is high, we will devise ways to complete the minimum necessary calculation to improve the efficiency of the calculation.
  • FIG. 3 is a block diagram showing the configuration of the estimation result generation unit 120.
  • FIG. 4 is a flowchart showing the operation of the estimation result generation unit 120.
  • the estimation result generation unit 120 includes a sort unit 121, a function calculation unit 122, and an estimation unit 123.
  • the sort unit 121 the sequence ⁇ k 1, ..., k s ⁇ as input, sequence ⁇ k 1, ..., k s ⁇ from the ascending sort, sequence ⁇ k 1, ..., ⁇ k s ⁇ (However, ⁇ k 1 , ..., ⁇ k s satisfy ⁇ k 1 ⁇ ... ⁇ ⁇ k s ) is generated and output.
  • the function calculation unit 122 calculates f ( ⁇ k 1 ).
  • Example 1 When estimating whether the attribute to be estimated is "address"
  • Set Y ⁇ Hokkaido, Aomori prefecture, Iwate prefecture, Miyagi prefecture, Akita prefecture, Yamagata prefecture, Fukui prefecture, Ibaraki prefecture, Tochigi prefecture, Gunma prefecture, Saitama Prefecture, Chiba prefecture, Tokyo, Niigata prefecture, Toyama prefecture, Ishikawa prefecture, Fukui prefecture, Yamanashi prefecture, Nagano prefecture, Gifu prefecture, Shizuoka prefecture, Aichi prefecture, Mie prefecture, Shiga prefecture, Kyoto prefecture, Osaka prefecture, Hyogo prefecture, Nara prefecture, Tottori prefecture, Shimane prefecture, Okayama prefecture, Hiroshima prefecture, Yamagu
  • the set Y is a set consisting of 44 prefectures excluding the three cases of Kanagawa, Wakayama, and Kagoshima prefectures, and the ratio p is included in the set Y with respect to the total population of the whole country at the time of the 2015 census. It is the ratio of the total population of 44 prefectures.
  • n 10000000
  • m 1000
  • s 10
  • 0.95
  • the set Y is a set obtained by extracting two-letter surnames from the top 30 surnames with a large population
  • the ratio p is the set Y with respect to the total population of the whole country at the time of the 2015 census. It is the ratio of the estimated number of people who have the surname included.
  • n 10000000
  • m 1000
  • s 10
  • 0.95
  • the embodiment of the present invention it is possible to estimate the attributes of the database without executing the processing for all the data of the attributes to be the target of the attribute estimation.
  • the matching rate can be evaluated stochastically and accurately to achieve high speed. Attribute estimation is possible.
  • FIG. 5 is a diagram showing an example of a functional configuration of a computer that realizes each of the above-mentioned devices (that is, each node).
  • the processing in each of the above-mentioned devices can be carried out by causing the recording unit 2020 to read a program for causing the computer to function as each of the above-mentioned devices, and operating the control unit 2010, the input unit 2030, the output unit 2040, and the like.
  • the device of the present invention is, for example, as a single hardware entity, an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity.
  • Communication unit CPU (Central Processing Unit, cache memory, registers, etc.) to which can be connected, RAM and ROM as memory, external storage device as hard hardware, and input, output, and communication units of these , CPU, RAM, ROM, and external storage device have a connecting bus so that data can be exchanged.
  • a device (drive) or the like capable of reading and writing a recording medium such as a CD-ROM may be provided in the hardware entity.
  • a physical entity equipped with such hardware resources includes a general-purpose computer and the like.
  • the external storage device of the hardware entity stores the program required to realize the above-mentioned functions and the data required for processing this program (not limited to the external storage device, for example, reading a program). It may be stored in a ROM, which is a dedicated storage device). Further, the data obtained by the processing of these programs is appropriately stored in a RAM, an external storage device, or the like.
  • each program stored in the external storage device (or ROM, etc.) and the data necessary for processing each program are read into the memory as needed, and are appropriately interpreted, executed, and processed by the CPU. ..
  • the CPU realizes a predetermined function (each constituent unit represented as the above-mentioned ... unit, ... means, etc.).
  • the present invention is not limited to the above-described embodiment, and can be appropriately modified without departing from the spirit of the present invention. Further, the processes described in the above-described embodiment are not only executed in chronological order according to the order described, but may also be executed in parallel or individually as required by the processing capacity of the device that executes the processes. ..
  • the processing function in the hardware entity (device of the present invention) described in the above embodiment is realized by a computer
  • the processing content of the function that the hardware entity should have is described by a program.
  • the processing function in the above hardware entity is realized on the computer.
  • the program that describes this processing content can be recorded on a computer-readable recording medium.
  • the computer-readable recording medium may be, for example, a magnetic recording device, an optical disk, a photomagnetic recording medium, a semiconductor memory, or the like.
  • a hard disk device, a flexible disk, a magnetic tape, or the like as a magnetic recording device is used as an optical disk
  • a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), or a CD-ROM (Compact Disc Read Only) is used as an optical disk.
  • Memory CD-R (Recordable) / RW (ReWritable), etc.
  • MO Magnetto-Optical disc
  • EP-ROM Electroically Erasable and Programmable-Read Only Memory
  • semiconductor memory can be used.
  • the distribution of this program is carried out, for example, by selling, transferring, renting, etc., a portable recording medium such as a DVD or CD-ROM on which the program is recorded. Further, the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via the network.
  • a computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. Then, when the process is executed, the computer reads the program stored in its own storage device and executes the process according to the read program. Further, as another execution form of this program, a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. It is also possible to execute the process according to the received program one by one each time. In addition, the above processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition without transferring the program from the server computer to this computer. May be.
  • the program in this embodiment includes information to be used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property of defining the processing of the computer, etc.).
  • the hardware entity is configured by executing a predetermined program on the computer, but at least a part of these processing contents may be realized in terms of hardware.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Provided is a technique for estimating the attributes of a database without performing a process on all of the data for attributes for which attribute estimation is to be performed. The present invention includes: a number sequence generation unit that, when X is defined as the aggregation of data of attributes (attributes to be estimated) of a database T that is the object of attribute estimation, Y is the aggregation of data (auxiliary information) used for estimating the attributes to be estimated, and p is the proportion matching the auxiliary information included in aggregation Y, generates the aggregation comprising m items of data extracted from aggregation X as an ith sample aggregation X(i), and repeats, s times, a process for deriving a number ki (i=1, …, s) of data matching any of the auxiliary information included in aggregation Y, among the m items of data of the ith sample aggregation X(i), thereby generating a number sequence {k1, …, ks}; and an estimation result generation unit that determines, using the number sequence {k1, …, ks}, whether a prescribed inequality expression is true, and if the inequality expression is true, generates the estimation result that the attribute to be estimated is the attribute expressed by aggregation Y.

Description

属性推定装置、属性推定方法、プログラムAttribute estimation device, attribute estimation method, program
 本発明は、データベースの属性を推定する技術に関する。 The present invention relates to a technique for estimating database attributes.
 データベースの属性はユーザが定義する。そのため、通常は属性推定処理を行う必要はない。しかし、例えば、データベースの匿名化処理を自動化する場合のように、与えられたデータベースのデータの集合から、当該データの属性を推定する必要が生じることもある。 Database attributes are user-defined. Therefore, it is usually not necessary to perform the attribute estimation process. However, it may be necessary to estimate the attributes of the data from a given set of database data, for example, when automating the database anonymization process.
 従来、データの集合から当該データの属性を推定する方法として、非特許文献1の画像の一致検索や非特許文献2のマルウェアの検知で用いられている方法がある。これは、属性推定の対象となるデータの集合と別途事前に用意するデータの集合とを用いて計算されるデータの一致率が一定の値以上となる場合、属性推定の対象となるデータの属性は別途事前に用意するデータの属性であると推定する方法である。 Conventionally, as a method of estimating the attribute of the data from a set of data, there is a method used for matching search of an image of Non-Patent Document 1 and detection of malware of Non-Patent Document 2. This is because when the match rate of the data calculated using the set of data to be the target of attribute estimation and the set of data prepared separately in advance exceeds a certain value, the attribute of the data to be the target of attribute estimation Is a method of presuming that it is an attribute of data prepared separately in advance.
 上記方法はデータベースの属性推定に適用することができる。以下、例を用いて説明する。属性推定の対象となるデータの集合を{東京都武蔵野市緑町, 神奈川県横須賀市光の丘, 神奈川県厚木市森の里若宮, 京都府相楽郡精華町光台, 茨城県つくば市花畑}、別途事前に用意するデータの集合を{東京都, 神奈川県, 埼玉県, 千葉県, 栃木県, 群馬県, 茨城県}、別途事前に用意するデータの属性を“住所”とし、当該2つの集合を用いて計算されるデータの一致率が7割以上である場合、属性推定の対象となるデータの属性は“住所”であると推定することとする。ここでは、データの一致を部分一致で判定することとして、一致率を計算すると、4/5、つまり8割となることから、属性推定の対象となるデータの属性は“住所”であると推定することができる。 The above method can be applied to database attribute estimation. Hereinafter, a description will be given using an example. Prepare a set of data to be subject to attribute estimation in advance {Midorimachi, Musashino City, Tokyo, Hikarinooka, Yokosuka City, Kanagawa Prefecture, Morinosatowakamiya, Atsugi City, Kanagawa Prefecture, Hikaridai, Seikacho, Sagara County, Kyoto Prefecture, Hanabata, Tsukuba City, Ibaraki Prefecture} The set of data is {Tokyo, Kanagawa, Saitama, Chiba, Tochigi, Gunma, Ibaraki}, and the attribute of the data prepared in advance is "address", and it is calculated using these two sets. When the match rate of the data is 70% or more, it is estimated that the attribute of the data to be the target of the attribute estimation is the “address”. Here, assuming that the match of the data is judged by partial match, the match rate is calculated to be 4/5, that is, 80%, so it is estimated that the attribute of the data to be the target of the attribute estimation is the "address". can do.
 しかし、上記方法によるデータベースの属性推定では、一致率計算のために、データベースを構成するすべてのレコードから属性推定の対象となる属性のデータを抽出して処理を行う必要があるため、レコードの数が多い場合、属性推定に膨大な時間を要することとなる。 However, in the attribute estimation of the database by the above method, in order to calculate the match rate, it is necessary to extract the data of the attribute to be the target of the attribute estimation from all the records constituting the database and perform the processing, so the number of records. If there are many, it will take an enormous amount of time to estimate the attributes.
 そこで本発明では、属性推定の対象となる属性のすべてのデータを対象とした処理を実行することなく、データベースの属性を推定する技術を提供することを目的とする。 Therefore, an object of the present invention is to provide a technique for estimating the attributes of a database without executing processing for all the data of the attributes to be estimated.
 本発明の一態様は、Xを属性推定の対象となるデータベースTの属性(以下、推定対象属性という)のデータの集合、Yを前記推定対象属性の推定に用いるデータ(以下、補助情報という)の集合、pを集合Yに含まれる補助情報と一致する割合とし、集合Xから抽出したm個のデータからなる集合を第iサンプル集合X(i) (i=1, …, s)として生成し、第iサンプル集合X(i)のm個のデータのうち、集合Yに含まれる補助情報のいずれかと一致するデータの数ki(i=1, …, s)を求める処理をs回繰り返すことで、数列{k1, …, ks}を生成する数列生成部と、数列{k1, …, ks}を用いて、以下の不等式が成り立つか否かを判定し、当該不等式が成り立つ場合、前記推定対象属性は集合Yで表される属性であるという推定結果を生成する推定結果生成部と
Figure JPOXMLDOC01-appb-M000003

(ただし、nは集合Xに含まれるデータの数、εは0≦ε≦1を満たす所定の数である)を含む。
In one aspect of the present invention, X is a set of data of attributes (hereinafter referred to as estimation target attributes) of the database T to be estimated, and Y is data used for estimating the estimation target attributes (hereinafter referred to as auxiliary information). Set the set of, p as the ratio that matches the auxiliary information contained in the set Y, and generate a set consisting of m data extracted from the set X as the i-sample set X (i) (i = 1,…, s). Then, out of the m data of the i-th sample set X (i) , the number of data that matches any of the auxiliary information contained in the set Y k i (i = 1,…, s) is calculated s times. by repeating, sequence {k 1, ..., k s } and the sequence generator for generating a sequence {k 1, ..., k s } was used to determine whether the following inequality holds, the inequality If, the estimation result generation unit that generates the estimation result that the estimation target attribute is an attribute represented by the set Y and
Figure JPOXMLDOC01-appb-M000003

(However, n is the number of data contained in the set X, and ε is a predetermined number satisfying 0≤ε≤1).
 本発明によれば、属性推定の対象となる属性のすべてのデータを対象とした処理を実行することなく、データベースの属性を推定することが可能となる。 According to the present invention, it is possible to estimate the attributes of the database without executing the processing for all the data of the attributes to be the target of the attribute estimation.
属性推定装置100の構成を示すブロック図である。It is a block diagram which shows the structure of the attribute estimation apparatus 100. 属性推定装置100の動作を示すフローチャートである。It is a flowchart which shows the operation of the attribute estimation apparatus 100. 推定結果生成部120の構成を示すブロック図である。It is a block diagram which shows the structure of the estimation result generation part 120. 推定結果生成部120の動作を示すフローチャートである。It is a flowchart which shows the operation of the estimation result generation unit 120. 本発明の実施形態における各装置を実現するコンピュータの機能構成の一例を示す図である。It is a figure which shows an example of the functional structure of the computer which realizes each apparatus in embodiment of this invention.
 以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. The components having the same function are given the same number, and duplicate explanations will be omitted.
 各実施形態の説明に先立って、この明細書における表記方法について説明する。 Prior to the description of each embodiment, the notation method in this specification will be described.
 ^(キャレット)は上付き添字を表す。例えば、xy^zはyzがxに対する上付き添字であり、xy^zはyzがxに対する下付き添字であることを表す。また、_(アンダースコア)は下付き添字を表す。例えば、xy_zはyzがxに対する上付き添字であり、xy_zはyzがxに対する下付き添字であることを表す。 ^ (Caret) represents a superscript. For example, x y ^ z means that y z is a superscript for x, and x y ^ z means that y z is a subscript for x. In addition, _ (underscore) represents a subscript. For example, x y_z means that y z is a superscript for x, and x y_z means that y z is a subscript for x.
 また、ある文字xに対する^xや~xのような上付き添え字の”^”や”~”は、本来”x”の真上に記載されるべきであるが、明細書の記載表記の制約上、^xや~xと記載しているものである。 Also, superscripts "^" and "~" such as ^ x and ~ x for a certain character x should be written directly above "x", but they should be written directly above "x". Due to restrictions, it is described as ^ x or ~ x.
<第1実施形態>
 Xを属性推定の対象となるデータベースTの属性(以下、推定対象属性という)のデータの集合、Yを推定対象属性の推定に用いるデータ(以下、補助情報という)の集合、pを集合Yに含まれる補助情報と一致する割合とする。なお、集合Y、割合pの具体例については後述する。
<First Embodiment>
X is the set of data of the attributes of the database T (hereinafter referred to as the estimation target attribute) to be estimated, Y is the set of data used to estimate the estimation target attribute (hereinafter referred to as auxiliary information), and p is the set Y. The ratio matches the supplementary information included. Specific examples of the set Y and the ratio p will be described later.
 属性推定装置100は、集合Xと集合Yと割合pとを入力とし、所定の条件を満たす場合は推定対象属性は集合Yで表される属性であるという推定結果を、それ以外の場合は推定対象属性は集合Yで表される属性であるとはいえないという推定結果を生成し、出力する。 The attribute estimation device 100 inputs the set X, the set Y, and the ratio p, and estimates that the attribute to be estimated is the attribute represented by the set Y if a predetermined condition is satisfied, and estimates in other cases. Generates and outputs an estimation result that the target attribute cannot be said to be the attribute represented by the set Y.
 以下、図1~図2を参照して属性推定装置100について説明する。図1は、属性推定装置100の構成を示すブロック図である。図2は、属性推定装置100の動作を示すフローチャートである。図1に示すように属性推定装置100は、数列生成部110と、推定結果生成部120と、記録部190を含む。記録部190は、属性推定装置100の処理に必要な情報を適宜記録する構成部である。記録部190には、例えば、集合X、集合Y、割合pが記録される。 Hereinafter, the attribute estimation device 100 will be described with reference to FIGS. 1 to 2. FIG. 1 is a block diagram showing the configuration of the attribute estimation device 100. FIG. 2 is a flowchart showing the operation of the attribute estimation device 100. As shown in FIG. 1, the attribute estimation device 100 includes a sequence generation unit 110, an estimation result generation unit 120, and a recording unit 190. The recording unit 190 is a component unit that appropriately records information necessary for processing of the attribute estimation device 100. For example, the set X, the set Y, and the ratio p are recorded in the recording unit 190.
 図2に従い属性推定装置100の動作について説明する。 The operation of the attribute estimation device 100 will be described with reference to FIG.
 S110において、数列生成部110は、集合Xと集合Yとを入力とし、集合Xから抽出したm個(mは1以上の整数)のデータからなる集合を第iサンプル集合X(i) (i=1, …, s)として生成し、第iサンプル集合X(i)のm個のデータのうち、集合Yに含まれる補助情報のいずれかと一致するデータの数ki(i=1, …, s)を求める処理をs回繰り返すことで、数列{k1, …, ks}を生成し、出力する。ここで、mをサンプリング数、sをサンプリング回数という。なお、m, sはいずれも1以上の整数である。 In S110, the sequence generator 110 takes the set X and the set Y as inputs, and sets m (m is an integer of 1 or more) data extracted from the set X as the i-sample set X (i) (i). = 1,…, s) , and out of the m data of the i-th sample set X (i) , the number of data that matches any of the auxiliary information contained in the set Y k i (i = 1,… By repeating the process of finding, s) s times, a sequence {k 1 ,…, k s } is generated and output. Here, m is called the number of samplings and s is called the number of samplings. Both m and s are integers of 1 or more.
 具体的には、数列生成部110は、以下の処理(1)、(2)をs回繰り返し実行することにより、数列{k1, …, ks}を生成する。 Specifically, the sequence generation unit 110 generates a sequence {k 1 , ..., k s } by repeatedly executing the following processes (1) and (2) s times.
(1)数列生成部110は、集合Xからランダムにデータをm個抽出し、抽出したm個のデータを要素とする集合を第iサンプル集合X(i) (i=1, …, s)として生成する。したがって、X(i)⊂Xとなる。 (1) The sequence generator 110 randomly extracts m pieces of data from the set X, and sets the set having the extracted m pieces of data as elements in the i-sample set X (i) (i = 1, ..., s). Generate as. Therefore, X (i) ⊂ X.
(2)数列生成部110は、データx∈X(i)に対して、補助情報y∈Yのいずれかと一致するか否かを判定する。そして、数列生成部110は、ある補助情報と一致した、第iサンプル集合X(i)のデータの数ki(i=1, …, s)を求める。 (2) The sequence generator 110 determines whether or not the data x ∈ X (i) matches any of the auxiliary information y ∈ Y. Then, the sequence generation unit 110 obtains the number k i (i = 1, ..., s) of the data of the i-th sample set X (i) that matches a certain auxiliary information.
 S120において、推定結果生成部120は、S110で出力された数列{k1, …, ks}と割合pとを入力とし、数列{k1, …, ks}を用いて、以下の不等式(1)が成り立つか否かを判定し、不等式(1)が成り立つ場合、推定対象属性は集合Yで表される属性であるという推定結果を、それ以外の場合、推定対象属性は集合Yで表される属性であるとはいえないという推定結果を生成し、出力する。
Figure JPOXMLDOC01-appb-M000004

(ただし、nは集合Xに含まれるデータの数、εは0≦ε≦1を満たす所定の数である)
 ここで、εを正確度という。
In S120, the estimation result generation unit 120 takes the sequence {k 1 , ..., k s } output in S110 and the ratio p as inputs, and uses the sequence {k 1 , ..., k s } to use the following inequality. It is determined whether or not (1) holds, and if the inequality (1) holds, the estimation result is that the estimated target attribute is the attribute represented by the set Y. In other cases, the estimated target attribute is the set Y. Generates and outputs an estimation result that cannot be said to be a represented attribute.
Figure JPOXMLDOC01-appb-M000004

(However, n is the number of data contained in the set X, and ε is a predetermined number satisfying 0 ≤ ε ≤ 1.)
Here, ε is called accuracy.
 不等式(1)の計算では、項nCjpj(1-p)n-jを複数回計算し、s回の加算を実行することになる。このような組合せ計算のコストは大きいため、必要最低限の計算で済ませるように工夫をし、計算の効率化を図ることにする。 In calculating the inequality (1), the term n C j p j (1- p) nj calculated several times, and executes the s additions. Since the cost of such combination calculation is high, we will devise ways to complete the minimum necessary calculation to improve the efficiency of the calculation.
 以下、図3~図4を参照して不等式(1)の計算の効率化を図った推定結果生成部120について説明する。図3は、推定結果生成部120の構成を示すブロック図である。図4は、推定結果生成部120の動作を示すフローチャートである。図3に示すように推定結果生成部120は、ソート部121と、関数計算部122と、推定部123を含む。 Hereinafter, the estimation result generation unit 120 for improving the efficiency of the calculation of the inequality (1) will be described with reference to FIGS. 3 to 4. FIG. 3 is a block diagram showing the configuration of the estimation result generation unit 120. FIG. 4 is a flowchart showing the operation of the estimation result generation unit 120. As shown in FIG. 3, the estimation result generation unit 120 includes a sort unit 121, a function calculation unit 122, and an estimation unit 123.
 図4に従い推定結果生成部120の動作について説明する。 The operation of the estimation result generation unit 120 will be described with reference to FIG.
 S121において、ソート部121は、数列{k1, …, ks}を入力とし、数列{k1, …, ks}から、昇順ソートにより、数列{~k1, …, ~ks}(ただし、~k1, …, ~ksは~k1≦…≦~ksを満たす)を生成し、出力する。 In S121, the sort unit 121, the sequence {k 1, ..., k s } as input, sequence {k 1, ..., k s } from the ascending sort, sequence {~ k 1, ..., ~ k s} (However, ~ k 1 , ..., ~ k s satisfy ~ k 1 ≤ ... ≤ ~ k s ) is generated and output.
 S122において、関数計算部122は、S121で出力された数列{~k1, …, ~ks}を入力とし、数列{~k1, …, ~ks}に対して、関数f(k)=Σj=k m nCjpj(1-p)n-jの値f(~ki)を計算し、出力する。関数計算部122は、以下の処理(1)、(2)を実行することにより、f(~ki)(i=1, …, s)を計算する。 In S122, the function calculation unit 122 takes the sequence {~ k 1 ,…, ~ k s } output in S121 as an input, and for the sequence {~ k 1 ,…, ~ k s }, the function f (k). ) = Σ j = k m n C j p j (1-p) Calculates and outputs the value f (~ k i ) of nj. The function calculation unit 122 calculates f (~ k i ) (i = 1, ..., s) by executing the following processes (1) and (2).
(1)関数計算部122は、f(~k1)を計算する。 (1) The function calculation unit 122 calculates f (~ k 1 ).
(2)関数計算部122は、i=2, …, sについてはf(~ki-1)を用いて帰納的にf(~ki)を計算する。 (2) The function calculation unit 122 inductively calculates f (~ k i ) using f (~ k i-1) for i = 2, ..., S.
 S123において、推定部123は、S122で出力されたf(~ki)(i=1, …, s)を入力とし、f(~ki)(i=1, …, s)を用いてΣi=1 slogf(~ki)を計算し、不等式(1)が成り立つ場合、推定対象属性は集合Yで表される属性であるという推定結果を、それ以外の場合、推定対象属性は集合Yで表される属性であるとはいえないという推定結果を生成し、出力する。 In S123, the estimation unit 123 takes f (~ k i ) (i = 1, ..., s) output in S122 as an input, and uses f (~ k i ) (i = 1, ..., s). Σ i = 1 s logf (~ k i ) is calculated, and if the inequality (1) holds, the estimated target attribute is the attribute represented by the set Y. Otherwise, the estimated target attribute is Generates and outputs an estimation result that cannot be said to be an attribute represented by the set Y.
(集合Y、割合pの具体例)
 例1:推定対象属性が“住所”であるか推定する場合
 集合Y={北海道, 青森県, 岩手県, 宮城県, 秋田県, 山形県, 福島県, 茨城県, 栃木県,群馬県, 埼玉県, 千葉県, 東京都, 新潟県, 富山県, 石川県, 福井県, 山梨県, 長野県, 岐阜県, 静岡県, 愛知県, 三重県, 滋賀県, 京都府, 大阪府, 兵庫県, 奈良県, 鳥取県, 島根県, 岡山県, 広島県, 山口県, 徳島県, 香川県, 愛媛県, 高知県, 福岡県, 佐賀県, 長崎県, 熊本県, 大分県, 宮崎県, 沖縄県}、割合p=0.9076439とする。すなわち、集合Yは、神奈川県、和歌山県、鹿児島県の3件を除く44都道府県からなる集合であり、割合pは、平成27年国勢調査時の全国の総人口に対する、集合Yに含まれる44都道府県の総人口の割合である。
(Specific example of set Y and ratio p)
Example 1: When estimating whether the attribute to be estimated is "address" Set Y = {Hokkaido, Aomori prefecture, Iwate prefecture, Miyagi prefecture, Akita prefecture, Yamagata prefecture, Fukui prefecture, Ibaraki prefecture, Tochigi prefecture, Gunma prefecture, Saitama Prefecture, Chiba prefecture, Tokyo, Niigata prefecture, Toyama prefecture, Ishikawa prefecture, Fukui prefecture, Yamanashi prefecture, Nagano prefecture, Gifu prefecture, Shizuoka prefecture, Aichi prefecture, Mie prefecture, Shiga prefecture, Kyoto prefecture, Osaka prefecture, Hyogo prefecture, Nara prefecture, Tottori prefecture, Shimane prefecture, Okayama prefecture, Hiroshima prefecture, Yamaguchi prefecture, Tokushima prefecture, Kagawa prefecture, Ehime prefecture, Kochi prefecture, Fukuoka prefecture, Saga prefecture, Nagasaki prefecture, Kumamoto prefecture, Oita prefecture, Miyazaki prefecture, Okinawa prefecture }, And the ratio p = 0.9076439. That is, the set Y is a set consisting of 44 prefectures excluding the three cases of Kanagawa, Wakayama, and Kagoshima prefectures, and the ratio p is included in the set Y with respect to the total population of the whole country at the time of the 2015 census. It is the ratio of the total population of 44 prefectures.
 なお、サンプリング数m、サンプリング回数s、正確度εについては、集合Xに含まれるデータの数nに対して適当な値を用いることができる。例えば、n=10000000の場合、m=1000, s=10, ε=0.95とするとよい。 For the number of samplings m, the number of samplings s, and the accuracy ε, appropriate values can be used for the number n of data included in the set X. For example, when n = 10000000, m = 1000, s = 10, ε = 0.95 should be set.
 例2:推定対象属性が“氏名”であるか推定する場合
 集合Y={佐藤, 鈴木, 高橋, 田中, 伊藤, 渡辺, 山本, 中村, 小林, 加藤, 吉田, 山田, 山口, 松本, 井上, 木村, 斎藤, 清水, 山崎, 阿部, 池田, 橋本, 山下, 石川, 中島, 前田, 藤田}、割合p=0.1778762とする。すなわち、集合Yは、人口が多い名字の上位30位から、2文字名字を抜粋することで得られる集合であり、割合pは、平成27年国勢調査時の全国の総人口に対する、集合Yに含まれる名字をもつ概算人数の割合である。
Example 2: When estimating whether the attribute to be estimated is "name" Set Y = {Sato, Suzuki, Takahashi, Tanaka, Ito, Watanabe, Yamamoto, Nakamura, Kobayashi, Kato, Yoshida, Yamada, Yamaguchi, Matsumoto, Inoue, Kimura, Saito, Shimizu, Yamazaki, Abe, Ikeda, Hashimoto, Yamashita, Ishikawa, Nakajima, Maeda, Fujita}, ratio p = 0.1778762. That is, the set Y is a set obtained by extracting two-letter surnames from the top 30 surnames with a large population, and the ratio p is the set Y with respect to the total population of the whole country at the time of the 2015 census. It is the ratio of the estimated number of people who have the surname included.
 なお、サンプリング数m、サンプリング回数s、正確度εについては、集合Xに含まれるデータの数nに対して適当な値を用いることができる。例えば、n=10000000の場合、m=1000, s=10, ε=0.95とするとよい。 For the number of samplings m, the number of samplings s, and the accuracy ε, appropriate values can be used for the number n of data included in the set X. For example, when n = 10000000, m = 1000, s = 10, ε = 0.95 should be set.
 本発明の実施形態によれば、属性推定の対象となる属性のすべてのデータを対象とした処理を実行することなく、データベースの属性を推定することが可能となる。つまり、属性推定の対象となる属性のすべてのデータを用いる代わりに、サンプリングにより得られた一部のデータを用いた場合であっても、一致率を確率的に正確に評価することで、高速な属性推定が可能となる。 According to the embodiment of the present invention, it is possible to estimate the attributes of the database without executing the processing for all the data of the attributes to be the target of the attribute estimation. In other words, even if some data obtained by sampling is used instead of using all the data of the attribute to be estimated for the attribute, the matching rate can be evaluated stochastically and accurately to achieve high speed. Attribute estimation is possible.
<補記>
 図5は、上述の各装置(つまり、各ノード)を実現するコンピュータの機能構成の一例を示す図である。上述の各装置における処理は、記録部2020に、コンピュータを上述の各装置として機能させるためのプログラムを読み込ませ、制御部2010、入力部2030、出力部2040などに動作させることで実施できる。
<Supplement>
FIG. 5 is a diagram showing an example of a functional configuration of a computer that realizes each of the above-mentioned devices (that is, each node). The processing in each of the above-mentioned devices can be carried out by causing the recording unit 2020 to read a program for causing the computer to function as each of the above-mentioned devices, and operating the control unit 2010, the input unit 2030, the output unit 2040, and the like.
 本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置(例えば通信ケーブル)が接続可能な通信部、CPU(Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい)、メモリであるRAMやROM、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、CPU、RAM、ROM、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、CD-ROMなどの記録媒体を読み書きできる装置(ドライブ)などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 The device of the present invention is, for example, as a single hardware entity, an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity. Communication unit, CPU (Central Processing Unit, cache memory, registers, etc.) to which can be connected, RAM and ROM as memory, external storage device as hard hardware, and input, output, and communication units of these , CPU, RAM, ROM, and external storage device have a connecting bus so that data can be exchanged. Further, if necessary, a device (drive) or the like capable of reading and writing a recording medium such as a CD-ROM may be provided in the hardware entity. A physical entity equipped with such hardware resources includes a general-purpose computer and the like.
 ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている(外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるROMに記憶させておくこととしてもよい)。また、これらのプログラムの処理によって得られるデータなどは、RAMや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores the program required to realize the above-mentioned functions and the data required for processing this program (not limited to the external storage device, for example, reading a program). It may be stored in a ROM, which is a dedicated storage device). Further, the data obtained by the processing of these programs is appropriately stored in a RAM, an external storage device, or the like.
 ハードウェアエンティティでは、外部記憶装置(あるいはROMなど)に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にCPUで解釈実行・処理される。その結果、CPUが所定の機能(上記、…部、…手段などと表した各構成部)を実現する。 In the hardware entity, each program stored in the external storage device (or ROM, etc.) and the data necessary for processing each program are read into the memory as needed, and are appropriately interpreted, executed, and processed by the CPU. .. As a result, the CPU realizes a predetermined function (each constituent unit represented as the above-mentioned ... unit, ... means, etc.).
 本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and can be appropriately modified without departing from the spirit of the present invention. Further, the processes described in the above-described embodiment are not only executed in chronological order according to the order described, but may also be executed in parallel or individually as required by the processing capacity of the device that executes the processes. ..
 既述のように、上記実施形態において説明したハードウェアエンティティ(本発明の装置)における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing function in the hardware entity (device of the present invention) described in the above embodiment is realized by a computer, the processing content of the function that the hardware entity should have is described by a program. Then, by executing this program on the computer, the processing function in the above hardware entity is realized on the computer.
 この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、DVD(Digital Versatile Disc)、DVD-RAM(Random Access Memory)、CD-ROM(Compact Disc Read Only Memory)、CD-R(Recordable)/RW(ReWritable)等を、光磁気記録媒体として、MO(Magneto-Optical disc)等を、半導体メモリとしてEEP-ROM(Electronically Erasable and Programmable-Read Only Memory)等を用いることができる。 The program that describes this processing content can be recorded on a computer-readable recording medium. The computer-readable recording medium may be, for example, a magnetic recording device, an optical disk, a photomagnetic recording medium, a semiconductor memory, or the like. Specifically, for example, a hard disk device, a flexible disk, a magnetic tape, or the like as a magnetic recording device is used as an optical disk, and a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), or a CD-ROM (Compact Disc Read Only) is used as an optical disk. Memory), CD-R (Recordable) / RW (ReWritable), etc., MO (Magneto-Optical disc), etc. as a magneto-optical recording medium, EP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. as a semiconductor memory Can be used.
 また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The distribution of this program is carried out, for example, by selling, transferring, renting, etc., a portable recording medium such as a DVD or CD-ROM on which the program is recorded. Further, the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via the network.
 このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記憶装置に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP(Application Service Provider)型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの(コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等)を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. Then, when the process is executed, the computer reads the program stored in its own storage device and executes the process according to the read program. Further, as another execution form of this program, a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. It is also possible to execute the process according to the received program one by one each time. In addition, the above processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition without transferring the program from the server computer to this computer. May be. The program in this embodiment includes information to be used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property of defining the processing of the computer, etc.).
 また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, in this form, the hardware entity is configured by executing a predetermined program on the computer, but at least a part of these processing contents may be realized in terms of hardware.
 上述の本発明の実施形態の記載は、例証と記載の目的で提示されたものである。網羅的であるという意思はなく、開示された厳密な形式に発明を限定する意思もない。変形やバリエーションは上述の教示から可能である。実施形態は、本発明の原理の最も良い例証を提供するために、そして、この分野の当業者が、熟考された実際の使用に適するように本発明を色々な実施形態で、また、色々な変形を付加して利用できるようにするために、選ばれて表現されたものである。すべてのそのような変形やバリエーションは、公正に合法的に公平に与えられる幅にしたがって解釈された添付の請求項によって定められた本発明のスコープ内である。 The above description of the embodiment of the present invention is presented for the purpose of illustration and description. There is no intention to be exhaustive and no intention to limit the invention to the exact form disclosed. Deformations and variations are possible from the above teachings. The embodiments are in various embodiments and in various ways to provide the best illustration of the principles of the invention and to be suitable for practical use by those skilled in the art. It is selected and expressed so that it can be used by adding transformations. All such variations and variations are within the scope of the invention as defined by the appended claims, interpreted according to the width given fairly, legally and impartially.

Claims (4)

  1.  Xを属性推定の対象となるデータベースTの属性(以下、推定対象属性という)のデータの集合、Yを前記推定対象属性の推定に用いるデータ(以下、補助情報という)の集合、pを集合Yに含まれる補助情報と一致する割合とし、
     集合Xから抽出したm個のデータからなる集合を第iサンプル集合X(i) (i=1, …, s)として生成し、第iサンプル集合X(i)のm個のデータのうち、集合Yに含まれる補助情報のいずれかと一致するデータの数ki(i=1, …, s)を求める処理をs回繰り返すことで、数列{k1, …, ks}を生成する数列生成部と、
     数列{k1, …, ks}を用いて以下の不等式が成り立つか否かを判定し、当該不等式が成り立つ場合、前記推定対象属性は集合Yで表される属性であるという推定結果を生成する推定結果生成部と
    Figure JPOXMLDOC01-appb-M000001

    (ただし、nは集合Xに含まれるデータの数、εは0≦ε≦1を満たす所定の数である)
     を含む属性推定装置。
    X is a set of data of the attributes of the database T (hereinafter referred to as the estimation target attribute) to be estimated, Y is a set of data used for estimating the estimation target attribute (hereinafter referred to as auxiliary information), and p is a set Y. The ratio matches the auxiliary information contained in
    A set consisting of m data extracted from the set X is generated as the i-sample set X (i) (i = 1,…, s), and out of the m data of the i-sample set X (i), A sequence that generates a sequence {k 1 ,…, k s } by repeating the process of finding the number of data k i (i = 1,…, s) that matches any of the auxiliary information contained in the set Y s times. Generation part and
    The sequence {k 1 ,…, k s } is used to determine whether or not the following inequality holds, and if the inequality holds, the estimation result that the estimated target attribute is an attribute represented by the set Y is generated. With the estimation result generator
    Figure JPOXMLDOC01-appb-M000001

    (However, n is the number of data contained in the set X, and ε is a predetermined number satisfying 0 ≤ ε ≤ 1.)
    Attribute estimator including.
  2.  請求項1に記載の属性推定装置であって、
     前記推定結果生成部は、
     数列{k1, …, ks}から、昇順ソートにより、数列{~k1, …, ~ks}(ただし、~k1, …, ~ksは~k1≦…≦~ksを満たす)を生成するソート部と、
     f(k)=Σj=k m nCjpj(1-p)n-jとし、
     数列{~k1, …, ~ks}に対して、f(~k1)を計算し、i=2, …, sについてはf(~ki-1)を用いて帰納的にf(~ki)を計算することにより、f(~ki)(i=1, …, s)を計算する関数計算部と、
     f(~ki)(i=1, …, s)を用いてΣi=1 slogf(~ki)を計算し、前記不等式が成り立つ場合に前記推定結果を生成する推定部とを含む
     ことを特徴とする属性推定装置。
    The attribute estimation device according to claim 1.
    The estimation result generation unit
    From the sequence {k 1 ,…, k s }, by ascending sort, the sequence {~ k 1 ,…, ~ k s } (however, ~ k 1 ,…, ~ k s is ~ k 1 ≦… ≦ ~ k s Satisfying) and the sort part that generates
    Let f (k) = Σ j = k m n C j p j (1-p) nj
    For the sequence {~ k 1 ,…, ~ k s }, calculate f (~ k 1 ), and for i = 2,…, s, use f (~ k i-1 ) to inductively f by calculating the (~ k i), a function calculating unit that calculates an f (~ k i) (i = 1, ..., s),
    Includes an estimation unit that calculates Σ i = 1 s logf (~ k i ) using f (~ k i ) (i = 1,…, s) and generates the estimation result when the inequality holds. An attribute estimation device characterized by the fact that.
  3.  Xを属性推定の対象となるデータベースTの属性(以下、推定対象属性という)のデータの集合、Yを前記推定対象属性の推定に用いるデータ(以下、補助情報という)の集合、pを集合Yに含まれる補助情報と一致する割合とし、
     属性推定装置が、集合Xから抽出したm個のデータからなる集合を第iサンプル集合X(i) (i=1, …, s)として生成し、第iサンプル集合X(i)のm個のデータのうち、集合Yに含まれる補助情報のいずれかと一致するデータの数ki(i=1, …, s)を求める処理をs回繰り返すことで、数列{k1, …, ks}を生成する数列生成ステップと、
     前記属性推定装置が、数列{k1, …, ks}を用いて以下の不等式が成り立つか否かを判定し、当該不等式が成り立つ場合、前記推定対象属性は集合Yで表される属性であるという推定結果を生成する推定結果生成ステップと
    Figure JPOXMLDOC01-appb-M000002

    (ただし、nは集合Xに含まれるデータの数、εは0≦ε≦1を満たす所定の数である)
     を含む属性推定方法。
    X is a set of data of the attributes of the database T (hereinafter referred to as the estimation target attribute) to be estimated, Y is a set of data used for estimating the estimation target attribute (hereinafter referred to as auxiliary information), and p is a set Y. The ratio matches the auxiliary information contained in
    The attribute estimator generates a set consisting of m pieces of data extracted from the set X as the i-th sample set X (i) (i = 1,…, s), and m pieces of the i-th sample set X (i). By repeating the process of finding the number k i (i = 1,…, s) of the data that matches any of the auxiliary information contained in the set Y s times, the sequence {k 1 ,…, k s } And the sequence generation step to generate
    The attribute estimator determines whether or not the following inequality holds using the sequence {k 1 , ..., k s }, and if the inequality holds, the estimated target attribute is an attribute represented by the set Y. With the estimation result generation step to generate the estimation result that there is
    Figure JPOXMLDOC01-appb-M000002

    (However, n is the number of data contained in the set X, and ε is a predetermined number satisfying 0 ≤ ε ≤ 1.)
    Attribute estimation method including.
  4.  請求項1または2に記載の属性推定装置としてコンピュータを機能させるためのプログラム。 A program for operating a computer as the attribute estimation device according to claim 1 or 2.
PCT/JP2020/018126 2020-04-28 2020-04-28 Attribute estimation device, attribute estimation method, and program WO2021220403A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2020/018126 WO2021220403A1 (en) 2020-04-28 2020-04-28 Attribute estimation device, attribute estimation method, and program
JP2022518489A JP7355232B2 (en) 2020-04-28 2020-04-28 Attribute estimation device, attribute estimation method, program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/018126 WO2021220403A1 (en) 2020-04-28 2020-04-28 Attribute estimation device, attribute estimation method, and program

Publications (1)

Publication Number Publication Date
WO2021220403A1 true WO2021220403A1 (en) 2021-11-04

Family

ID=78332324

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/018126 WO2021220403A1 (en) 2020-04-28 2020-04-28 Attribute estimation device, attribute estimation method, and program

Country Status (2)

Country Link
JP (1) JP7355232B2 (en)
WO (1) WO2021220403A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018025706A1 (en) * 2016-08-05 2018-02-08 日本電気株式会社 Table meaning estimating system, method, and program
JP2019159778A (en) * 2018-03-13 2019-09-19 富士通株式会社 Information processing device, control program, and information processing method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018025706A1 (en) * 2016-08-05 2018-02-08 日本電気株式会社 Table meaning estimating system, method, and program
JP2019159778A (en) * 2018-03-13 2019-09-19 富士通株式会社 Information processing device, control program, and information processing method

Also Published As

Publication number Publication date
JPWO2021220403A1 (en) 2021-11-04
JP7355232B2 (en) 2023-10-03

Similar Documents

Publication Publication Date Title
Guo et al. From general to specific: Informative scene graph generation via balance adjustment
JP4394020B2 (en) Data analysis apparatus and method
US9600577B2 (en) Estimating data topics of computers using external text content and usage information of the users
Battaglia et al. Outlier detection and estimation in nonlinear time series
CN110619075B (en) Webpage identification method and equipment
CN109416625A (en) Text is carried out in the feeding of database table, text file and data to add salt
Thabet et al. Study of nonlocal multiorder implicit differential equation involving Hilfer fractional derivative on unbounded domains
Xu et al. Mvse: Effort-aware heterogeneous defect prediction via multiple-view spectral embedding
JP2017045080A (en) Business flow specification regeneration method
WO2021220403A1 (en) Attribute estimation device, attribute estimation method, and program
Sun et al. Poster: Toward automating the generation of malware analysis reports using the sandbox logs
Adeyeye et al. Investigating Okun's law in Nigeria through the dynamic model
Jiang et al. Structural reliability assessment by integrating sensitivity analysis and support vector machine
Sun et al. Semi-supervised deep learning for network anomaly detection
Vahedi et al. Cloud based malware detection through behavioral entropy
JP6549786B2 (en) Data cleansing system, method and program
Black et al. API based discrimination of ransomware and benign cryptographic programs
Bogdanov et al. Protection of personal data using anonymization
US20220138598A1 (en) Reducing computational overhead involved with processing received service requests
WO2021144895A1 (en) Information analysis device, information analysis method, and computer readable storage medium
CN112346938B (en) Operation auditing method and device, server and computer readable storage medium
Kakavand et al. A novel crypto-ransomware family classification based on horizontal feature simplification
CN112347066A (en) Log processing method and device, server and computer readable storage medium
JP6040138B2 (en) Document classification apparatus, document classification method, and document classification program
Veríssimo et al. Adopt digital tools to monitor social dimensions of the global biodiversity framework

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20934013

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022518489

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20934013

Country of ref document: EP

Kind code of ref document: A1