TWI824927B - Data synthesis system with differential privacy protection, method and computer readable medium thereof - Google Patents
Data synthesis system with differential privacy protection, method and computer readable medium thereof Download PDFInfo
- Publication number
- TWI824927B TWI824927B TW112102060A TW112102060A TWI824927B TW I824927 B TWI824927 B TW I824927B TW 112102060 A TW112102060 A TW 112102060A TW 112102060 A TW112102060 A TW 112102060A TW I824927 B TWI824927 B TW I824927B
- Authority
- TW
- Taiwan
- Prior art keywords
- view
- data
- basic
- views
- final
- Prior art date
Links
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 107
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 107
- 238000000034 method Methods 0.000 title abstract description 71
- 238000009826 distribution Methods 0.000 claims abstract description 30
- 238000012805 post-processing Methods 0.000 claims abstract description 21
- 238000007781 pre-processing Methods 0.000 claims abstract description 19
- 239000002131 composite material Substances 0.000 claims abstract description 3
- 238000006243 chemical reaction Methods 0.000 claims description 18
- 238000004364 calculation method Methods 0.000 claims description 17
- 238000001308 synthesis method Methods 0.000 claims description 13
- 230000007246 mechanism Effects 0.000 claims description 9
- 230000011218 segmentation Effects 0.000 abstract description 7
- 230000008569 process Effects 0.000 description 27
- 238000004422 calculation algorithm Methods 0.000 description 18
- 238000012545 processing Methods 0.000 description 13
- 238000010586 diagram Methods 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 9
- 230000002194 synthesizing effect Effects 0.000 description 5
- 238000013459 approach Methods 0.000 description 3
- 238000010276 construction Methods 0.000 description 3
- 238000013503 de-identification Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 239000000243 solution Substances 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000009795 derivation Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000002347 injection Methods 0.000 description 2
- 239000007924 injection Substances 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- CIWBSHSKHKDKBQ-JLAZNSOCSA-N Ascorbic acid Chemical compound OC[C@H](O)[C@H]1OC(=O)C(O)=C1O CIWBSHSKHKDKBQ-JLAZNSOCSA-N 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013479 data entry Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000002427 irreversible effect Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Abstract
Description
本發明係有關於資料隱私保護之技術,尤指一種具差分隱私保護之資料合成系統、方法及其電腦可讀媒介。 The present invention relates to data privacy protection technology, and in particular, to a data synthesis system and method with differential privacy protection and a computer-readable medium thereof.
傳統的資料隱私保護方法通常會在去識別化程序中使用諸如k匿名(k-anonymity)的方法,以對資料造成不可恢復的破壞,故近年來的新興作法大都以差分隱私(Differential Privacy)為核心進行設計。舉例來說,當A公司有大量資料須要交由B公司進行資料分析時,基於資料具有個人資料或訊息下,資料保護是必要的,因而通常會透過去識別化以避免將敏感資料交付他人,因而有利用差分隱私技術來進行資料合成以達保護目的。 Traditional data privacy protection methods usually use methods such as k- anonymity ( k -anonymity) in de-identification procedures to cause irreversible damage to the data. Therefore, most of the emerging methods in recent years use differential privacy (Differential Privacy) as the Design the core. For example, when company A has a large amount of data that needs to be handed over to company B for data analysis, data protection is necessary because the data contains personal information or information, so de-identification is usually used to avoid handing over sensitive data to others. Therefore, differential privacy technology is used to synthesize data to achieve protection purposes.
差分隱私為一種資料共享手段,其概念就是當隨機修改一筆資料所造成的影響夠小,則以此修改後資料進行統計結果(例如特徵統計)就無法被輕易反推出單一筆資料內容,如此可達到保護隱私之目的,易言之,差分隱私僅係對資料中可被描述(即非隱私)的部分進行特徵統計,故不會公開具體的個人訊息,因而廣泛被用於去識別化程序中。然而,現行對於資料隱私保護的解決方案, 不是遭受嚴重的隱私預算切分(budget splitting),就是無法達到完全自動化合成,因而仍有待改善之處。 Differential privacy is a data sharing method. The concept is that when the impact of randomly modifying a piece of data is small enough, the statistical results of the modified data (such as feature statistics) cannot be easily deduced from the content of a single piece of data. This way, To achieve the purpose of protecting privacy, in other words, differential privacy only performs feature statistics on the parts of the data that can be described (that is, non-private), so specific personal information will not be disclosed, so it is widely used in de-identification procedures. . However, current solutions for data privacy protection It either suffers from serious privacy budget splitting or cannot achieve fully automated synthesis, so there is still room for improvement.
因此,如何提供一種資料隱私保護之技術,特別是,在利用差分隱私的基礎下,如何將讓合成後資料影響越小,且能提供並提高合成資料的品質,此將成為目前本技術領域人員急欲追求之目標。 Therefore, how to provide a data privacy protection technology, especially how to reduce the impact of synthesized data on the basis of using differential privacy, and how to provide and improve the quality of synthesized data, will become a problem for those in the field. An eager pursuit of a goal.
為解決上述現有技術之問題,本發明係揭露一種具差分隱私保護之資料合成系統,係包括:資料預處理模組,用於將原始資料之資料欄位內的數值資料區間化,以生成資料集;視圖列表合成模組,用於依據一預定的視圖欄位個數,將該資料集隨機合成為多個基本視圖,根據各該基本視圖的域大小進行該多個基本視圖之排序和標號,再依據前後排序的兩基本視圖之域大小,產生該前後排序的兩基本視圖之交叉視圖,之後,比較該基本視圖之域大小與該交叉視圖之域大小,以於該交叉視圖之域大小大於該基本視圖之域大小時不作合併,或於該交叉視圖之域大小不大於該基本視圖之域大小時取代現有暫存值,俾於多次迭代計算後,生成最終基本視圖和最終交叉視圖;邊際分布模組,用於將隱私預算平均分配給各該最終基本視圖和各該最終交叉視圖所生成之邊際表;後處理模組,用於將該些邊際表中的計數進行一致化、非負化及整數化處理;以及資料合成模組,用於將兩相鄰的視圖合成以產生合成資料,俾於經資料格式轉換後,使該合成資料成為該原始資料的格式。 In order to solve the above-mentioned problems of the prior art, the present invention discloses a data synthesis system with differential privacy protection, which includes: a data preprocessing module for intervalizing the numerical data in the data field of the original data to generate data. Set; the view list synthesis module is used to randomly synthesize the data set into multiple basic views based on a predetermined number of view fields, and sort and label the multiple basic views according to the domain size of each basic view. , and then generate a cross view of the two basic views sorted before and after based on the domain sizes of the two basic views sorted before and after, and then compare the domain size of the basic view with the domain size of the cross view to compare the domain size of the cross view No merging is performed when the domain size of the base view is larger than that of the basic view, or the existing temporary value is replaced when the domain size of the cross view is not larger than the domain size of the base view, so that the final basic view and the final cross view can be generated after multiple iterations of calculations. ; The marginal distribution module is used to evenly allocate the privacy budget to the marginal tables generated by each final basic view and each final cross view; the post-processing module is used to harmonize the counts in the marginal tables, Non-negative and integer processing; and a data synthesis module for synthesizing two adjacent views to generate synthetic data, so that after data format conversion, the synthetic data becomes the format of the original data.
於一實施例中,該交叉視圖用於橋接該前後排序的兩基本視圖,以於該原始資料之欄位個數無法被該視圖欄位個數整除且該資料集之最後屬性 的絕對值不大於該視圖欄位個數的一半時,不建立最後兩個該基本視圖之間的交叉視圖。 In one embodiment, the cross view is used to bridge the two basic views sorted before and after, so that when the number of fields of the original data is not divisible by the number of fields of the view and the last attribute of the data set When the absolute value of is not greater than half of the number of columns in the view, the cross view between the last two basic views will not be created.
於一實施例中,該視圖列表合成模組復包括:基本視圖合成單元,係先定義一合成空集合以及產生與該資料集相同的合成集合,從該合成集合中挑選出不重複的數個屬性並加入至該合成空集合,將該合成集合中減去該不重複的數個屬性,以於該合成空集合中元素個數大於該不重複的數個屬性的個數時,繼續迭代迴圈,直到該合成空集合中元素個數不大於該不重複的數個屬性的個數,產生該多個基本視圖,且該多個基本視圖之排序係以降冪進行排序;交叉視圖合成單元,係先定義一空集合,以於該視圖欄位個數為偶數,結合該前後排序的兩基本視圖中之前者中取出域值大於該視圖欄位個數的一半的屬性以及該前後排序的兩基本視圖中之後者中取出域值小於該視圖欄位個數的一半的屬性,產生該前後排序的兩基本視圖之交叉視圖,且於該視圖欄位個數為奇數,再判斷該資料集之最後屬性與該視圖欄位個數的一半的關係,以透過迭代方式得到該前後排序的兩基本視圖之交叉視圖;以及視圖列表合成單元,係於該交叉視圖之域大小不大於該基本視圖之域大小時,計算出該基本視圖和該交叉視圖之總和,並於該總和小於該現有暫存值時,以新的該總和、該基本視圖和該交叉視圖取代該現有暫存值,而於多次迭代計算後,得到該最終基本視圖和該最終交叉視圖。 In one embodiment, the view list synthesis module further includes: a basic view synthesis unit, which first defines a synthesized empty set and generates a synthesized set that is the same as the data set, and selects several non-duplicated views from the synthesized set. attributes and add them to the synthetic empty set, subtract the several unique attributes from the synthetic set, and continue to iterate back when the number of elements in the synthetic empty set is greater than the number of the unique attributes. Circle until the number of elements in the synthesized empty set is no greater than the number of non-repeating attributes, the multiple basic views are generated, and the multiple basic views are sorted in descending power; the cross view synthesis unit, First define an empty set so that the number of fields in the view is an even number. Combine the former of the two basic views sorted before and after to take out the attributes whose domain value is greater than half of the number of fields in the view and the two basic views sorted before and after. Remove the attributes whose domain value is less than half of the number of fields in the view from the latter, and generate a cross view of the two basic views sorted before and after. If the number of fields in the view is an odd number, then determine the last of the data set. The relationship between the attribute and half of the number of fields in the view is to obtain the cross view of the two basic views sorted before and after through iteration; and the view list synthesis unit is based on the fact that the domain size of the cross view is not larger than the domain of the basic view. When the size of the base view and the cross view is calculated, the sum of the base view and the cross view is calculated, and when the sum is less than the existing temporary value, the existing temporary value is replaced by the new sum, the base view and the cross view, and when the sum is smaller than the existing temporary value, the existing temporary value is replaced with the new After iterative calculations, the final basic view and the final cross view are obtained.
於一實施例中,該邊際分布模組係基於拉普拉斯(Laplace)機制,將該隱私預算平均分配給各該最終基本視圖和各該最終交叉視圖所生成之邊際表。 In one embodiment, the marginal distribution module is based on a Laplace mechanism and evenly distributes the privacy budget to the marginal tables generated by each final base view and each final cross view.
於一實施例中,該後處理模組復包括:一致性單元,係針對各該最終基本視圖和各該最終交叉視圖之計數,透過權重進行加權計算,更新各該最終基本視圖和各該最終交叉視圖之計數;非負化單元,係透過迭代地進行加總及相減,以消除各該最終基本視圖和各該最終交叉視圖之計數中的負值;以及整數化單元,係對該些邊際表進行移除或補值,而使各該邊際表與該原始資料有相同數量的紀錄。 In one embodiment, the post-processing module further includes: a consistency unit that performs weighted calculations based on the counts of each final basic view and each final cross view through weights, and updates each final basic view and each final cross view. a count of cross views; a non-negative unit that iteratively adds and subtracts to eliminate negative values in the counts of each final base view and each final cross view; and an integer unit that adds and subtracts the margins Tables are removed or filled so that each marginal table has the same number of records as the original data.
於一實施例中,該資料合成模組復包括:資料合成單元,係將該兩相鄰的視圖中共同欄位進行排序,之後,對於越多計數的欄位賦予越高的優先權,以於該兩相鄰的視圖之欄位配對後,產生該合成資料;以及資料格式轉換單元,係將該合成資料進行格式轉換而成為該原始資料的格式。 In one embodiment, the data synthesis module further includes: a data synthesis unit that sorts the common fields in the two adjacent views, and then gives higher priority to the fields with more counts, so as to After the fields of the two adjacent views are matched, the synthetic data is generated; and the data format conversion unit converts the format of the synthetic data into the format of the original data.
本發明復揭露一種具差分隱私保護之資料合成方法,係由電腦設備執行該方法,該方法包括以下步驟:將原始資料之資料欄位內的數值資料區間化,以生成資料集;依據一預定的視圖欄位個數,將該資料集隨機合成為多個基本視圖,根據各該基本視圖的域大小進行該多個基本視圖之排序和標號,再依據前後排序的兩基本視圖之域大小,產生該前後排序的兩基本視圖之交叉視圖,之後,比較該基本視圖之域大小與該交叉視圖之域大小,以於該交叉視圖之域大小大於該基本視圖之域大小時不作合併,或於該交叉視圖之域大小不大於該基本視圖之域大小時取代現有暫存值,俾於多次迭代計算後,得到最終基本視圖和最終交叉視圖;將隱私預算平均分配給各該最終基本視圖和各該最終交叉視圖所生成之邊際表;將該些邊際表中的計數進行一致化、非負化及整數化處理;以及將兩相鄰的視圖合成以產生合成資料,俾於經資料格式轉換後,使該合成資料成為該原始資料的格式。 The invention further discloses a data synthesis method with differential privacy protection, which is executed by computer equipment. The method includes the following steps: intervalize the numerical data in the data field of the original data to generate a data set; according to a predetermined The number of view columns, the data set is randomly synthesized into multiple basic views, the multiple basic views are sorted and numbered according to the domain size of each basic view, and then based on the domain sizes of the two basic views sorted before and after, Generate a cross view of the two basic views sorted before and after, and then compare the domain size of the basic view and the domain size of the cross view, so as to not merge when the domain size of the cross view is greater than the domain size of the basic view, or when When the domain size of the cross view is not larger than the domain size of the base view, it replaces the existing temporary value, so that after multiple iterations of calculation, the final base view and the final cross view are obtained; the privacy budget is evenly distributed to each of the final base view and the final cross view. The marginal tables generated by each final cross view; the counts in these marginal tables are consistent, non-negative and integerized; and the two adjacent views are combined to generate composite data for data format conversion , making the synthetic data become the format of the original data.
於上述方法中,該交叉視圖用於橋接該前後排序的兩基本視圖,以於該原始資料之欄位個數無法被該視圖欄位個數整除且該資料集之最後屬性的絕對值不大於該視圖欄位個數的一半時,不建立最後兩個該基本視圖之間的交叉視圖。 In the above method, the cross view is used to bridge the two basic views sorted before and after, so that the number of fields of the original data cannot be divided by the number of fields of the view and the absolute value of the last attribute of the data set is not greater than When the number of columns in this view is half, the cross view between the last two basic views will not be created.
於上述方法中,該生成最終基本視圖和最終交叉視圖之步驟,復包括:先定義一合成空集合以及產生與該資料集相同的合成集合,從該合成集合中挑選出不重複的數個屬性並加入至該合成空集合,將該合成集合中減去該不重複的數個屬性,以於該合成空集合中元素個數大於該不重複的數個屬性的個數時,繼續迭代迴圈,直到該合成空集合中元素個數不大於該不重複的數個屬性的個數,產生該多個基本視圖,且該多個基本視圖之排序係以降冪進行排序;先定義一空集合,以於該視圖欄位個數為偶數,結合該前後排序的兩基本視圖中之前者中取出域值大於該視圖欄位個數的一半的屬性以及該前後排序的兩基本視圖中之後者中取出域值小於該視圖欄位個數的一半的屬性,產生該前後排序的兩基本視圖之交叉視圖,且於該視圖欄位個數為奇數,再判斷該資料集之最後屬性與該視圖欄位個數的一半的關係,以透過迭代方式得到該前後排序的兩基本視圖之交叉視圖;以及於該交叉視圖之域大小不大於該基本視圖之域大小時,計算出該基本視圖和該交叉視圖之總和,並於該總和小於該現有暫存值時,以新的該總和、該基本視圖和該交叉視圖取代該現有暫存值,而於多次迭代計算後,得到該最終基本視圖和該最終交叉視圖。 In the above method, the step of generating the final basic view and the final cross view further includes: first defining a synthetic empty set and generating a synthetic set that is the same as the data set, and selecting several non-repeating attributes from the synthetic set. And add it to the synthetic empty set, subtract the several unique attributes from the synthetic set, so that when the number of elements in the synthetic empty set is greater than the number of the unique attributes, continue to iterate the loop , until the number of elements in the synthesized empty set is no greater than the number of non-repeating attributes, multiple basic views are generated, and the multiple basic views are sorted by descending powers; first define an empty set, and When the number of columns in the view is an even number, combine the attributes whose domain value is greater than half of the number of columns in the view from the former of the two basic views sorted before and after, and the fields from the latter of the two basic views sorted before and after. For attributes whose value is less than half of the number of fields in the view, a cross view of the two basic views sorted before and after is generated, and if the number of fields in the view is an odd number, then the final attribute of the data set is judged to be the same as the number of fields in the view. The cross view of the two basic views ordered before and after is obtained through iteration according to the relationship of half of the number; and when the domain size of the cross view is not larger than the domain size of the basic view, the distance between the basic view and the cross view is calculated The sum, and when the sum is less than the existing temporary value, replace the existing temporary value with the new sum, the basic view and the cross view, and after multiple iterations of calculation, the final basic view and the final Cross view.
於上述方法中,該將隱私預算平均分配給各該最終基本視圖和各該最終交叉視圖所生成之邊際表之步驟,係基於拉普拉斯(Laplace)機制,將該隱私預算平均分配給各該最終基本視圖和各該最終交叉視圖所生成之邊際表。 In the above method, the step of evenly allocating the privacy budget to each of the final basic views and the marginal tables generated by each of the final cross views is based on the Laplace mechanism to evenly allocate the privacy budget to each of the final basic views. The resulting margin table from the final base view and each final cross view.
於上述方法中,該將該些邊際表中的計數進行一致化、非負化及整數化處理之步驟,復包括:對各該最終基本視圖和各該最終交叉視圖之計數,透過權重進行加權計算,更新各該最終基本視圖和各該最終交叉視圖之計數;透過迭代地進行加總及相減,以消除各該最終基本視圖和各該最終交叉視圖之計數中的負值;以及對該些邊際表進行移除或補值,而使各該邊際表與該原始資料有相同數量的紀錄。 In the above method, the step of uniformizing, non-negative and integerizing the counts in the marginal tables further includes: weighting the counts of each final basic view and each final cross view through weight calculation. , update the counts of each final base view and each final cross view; eliminate negative values in the counts of each final base view and each final cross view by iteratively adding and subtracting; and The marginal tables are removed or filled in so that each marginal table has the same number of records as the original data.
於上述方法中,該將兩相鄰的視圖合成以產生合成資料,俾於經資料格式轉換後,使該合成資料成為該原始資料的格式之步驟,復包括:將該兩相鄰的視圖中共同欄位進行排序,之後,對於越多計數的欄位賦予越高的優先權,以於該兩相鄰的視圖之欄位配對後,產生該合成資料;以及將該合成資料進行格式轉換而成為該原始資料的格式。 In the above method, the step of synthesizing two adjacent views to generate synthetic data so that the synthetic data becomes the format of the original data after data format conversion includes: combining the two adjacent views. The common fields are sorted, and then the higher the priority is given to the fields with more counts, so that the synthetic data is generated after matching the fields of the two adjacent views; and the synthetic data is converted into a format. Become the format of this source material.
本發明復揭露一種電腦可讀媒介,應用於計算裝置或電腦中,係儲存有指令,以執行前述之具差分隱私保護之資料合成方法。 The invention further discloses a computer-readable medium, which is used in a computing device or a computer and stores instructions to execute the aforementioned data synthesis method with differential privacy protection.
綜上,本發明之具差分隱私保護之資料合成系統、方法及其電腦可讀媒介,為一種具差分隱私保護之自動化表單資料合成技術,首先,通過資料欄位域大小資訊進行前處理以及視圖列表合成(view list generation),因為該資訊屬於公開資訊,因而不用涉及隱私預算切分,隨後,對這些視圖建立相應的邊際表(contingency table)並對其注入基於拉普拉斯分布之差分隱私(differential privacy based on Laplace distribution)的隨機雜訊,為了提高合成資料的品質,本發明還會對這些具雜訊的邊際分布進行一系列的後處理程序,包括非負化(non-negativity)與一致性感知整數化(consistency-aware normalization),最後,通過迭代地進行各個邊際分布的拼接,以產生每一筆完整的合成資料。 In summary, the data synthesis system, method and computer-readable medium with differential privacy protection of the present invention is an automated form data synthesis technology with differential privacy protection. First, pre-processing and viewing are performed through the data field size information. List synthesis (view list generation), because the information is public information, there is no need to involve privacy budget segmentation. Subsequently, corresponding contingency tables are created for these views and differential privacy based on Laplace distribution is injected into them. (differential privacy based on Laplace distribution). In order to improve the quality of synthetic data, the present invention will also perform a series of post-processing procedures on these noisy marginal distributions, including non-negativity and consistency. Consistency-aware normalization, and finally, iteratively splicing each marginal distribution to generate each complete synthetic data.
1:具差分隱私保護之資料合成系統 1: Data synthesis system with differential privacy protection
11:資料預處理模組 11: Data preprocessing module
12:視圖列表合成模組 12:View list synthesis module
121:基本視圖合成單元 121: Basic view synthesis unit
122:交叉視圖合成單元 122: Cross view synthesis unit
123:視圖列表合成單元 123: View list synthesis unit
13:邊際分布模組 13:Marginal distribution module
14:後處理模組 14:Post-processing module
141:一致性單元 141: Consistency unit
142:非負化單元 142: Non-negative unit
143:整數化單元 143: Integerization unit
15:資料合成模組 15: Data synthesis module
151:資料合成單元 151: Data synthesis unit
152:資料格式轉換單元 152: Data format conversion unit
701-7052:流程 701-7052:Process
S601-S605:步驟 S601-S605: Steps
圖1為本發明之具差分隱私保護之資料合成系統的系統架構圖。 Figure 1 is a system architecture diagram of the data synthesis system with differential privacy protection of the present invention.
圖2為本發明之具差分隱私保護之資料合成系統另一實施例的系統架構圖。 FIG. 2 is a system architecture diagram of another embodiment of the data synthesis system with differential privacy protection of the present invention.
圖3為本發明之基本視圖與交叉視圖的列表合成的範例示意圖。 FIG. 3 is a schematic diagram illustrating an example of list synthesis of basic views and cross views according to the present invention.
圖4為本發明之四種合成交叉視圖的情境示意圖。 FIG. 4 is a schematic diagram of four synthetic cross views according to the present invention.
圖5為本發明之非負化處理的說明示意圖。 FIG. 5 is a schematic diagram illustrating the non-negative processing of the present invention.
圖6為本發明之具差分隱私保護之資料合成方法的步驟圖。 Figure 6 is a step diagram of the data synthesis method with differential privacy protection of the present invention.
圖7為本發明之具差分隱私保護之資料合成方法具體實施時的流程圖。 Figure 7 is a flow chart of the specific implementation of the data synthesis method with differential privacy protection of the present invention.
以下藉由特定的具體實施形態說明本發明之技術內容,熟悉此技藝之人士可由本說明書所揭示之內容輕易地瞭解本發明之優點與功效。然本發明亦可藉由其他不同的具體實施形態加以施行或應用。 The following describes the technical content of the present invention through specific embodiments. Those familiar with the art can easily understand the advantages and effects of the present invention from the content disclosed in this specification. However, the present invention can also be implemented or applied through other different specific implementation forms.
圖1為本發明之具差分隱私保護之資料合成系統的系統架構圖。如圖所示,本發明之具差分隱私保護之資料合成系統1係包括資料預處理模組11、視圖列表合成模組12、邊際分布模組13、後處理模組14以及資料合成模組15。
Figure 1 is a system architecture diagram of the data synthesis system with differential privacy protection of the present invention. As shown in the figure, the
資料預處理模組11用於將原始資料之資料欄位內的數值資料區間化,以生成資料集。簡言之,由於具差分隱私保護之資料合成系統1只能處理
離散型數值資料,故資料預處理模組11目的是執行區間化(bucketization),將數值欄位的資料離散化。具體而言,某個欄位中的數值將依照其大小選擇相符的區間並以該區間所在的指數(index)作為其新值且指數由0開始算,舉例來說,13.7在給定區間(-inf,10],(10,20],(20,inf)之中屬於第2個區間而有新值1。據此,資料預處理模組11可將連續數值型欄位轉換為離散數值型欄位,且由於原始資料可能具有多個連續數值型欄位,為了避免增加任務的複雜性,因而各個連續數值型欄位所採用的區間個數(number of buckets)為相同大小,因此,區間個數也是需要事先設置的參數。
The
視圖列表合成模組12用於依據一預定的視圖欄位個數,將該資料集隨機合成為多個基本視圖,以根據各該基本視圖的域大小進行該多個基本視圖之排序和標號,再依據前後排序的兩基本視圖之域大小,產生該前後排序的兩基本視圖之交叉視圖,之後,比較該基本視圖之域大小與該交叉視圖之域大小,以於該交叉視圖之域大小大於該基本視圖之域大小時不作合併,或於該交叉視圖之域大小不大於該基本視圖之域大小時取代現有暫存值,俾於多次迭代計算後,生成最終基本視圖和最終交叉視圖。簡言之,視圖列表合成模組12目的是將資料預處理模組11所生成之資料集由視圖呈現,透過先生成基本視圖和交叉視圖,再經邊緣分布、資料後處理和資料合成等程序,進而完成後續合成資料。
The view
於一實施例中,該交叉視圖用於橋接該前後排序的兩基本視圖,以於該原始資料之欄位個數無法被該視圖欄位個數整除且該資料集之最後屬性的絕對值不大於該視圖欄位個數的一半時,不建立最後兩個該基本視圖之間的交叉視圖。 In one embodiment, the cross view is used to bridge the two basic views sorted before and after, so that the number of fields of the original data is not divisible by the number of fields of the view and the absolute value of the last attribute of the data set is not When the number of columns in the view is greater than half, the cross view between the last two basic views will not be created.
與現有的具差分隱私保護的合成資料集(differentially private synthetic dataset,DPSD)為通過尋找高度相關性以建立視圖的方法相比較,本發明之具差分隱私保護之資料合成系統1旨在通過使用欄位域大小的公開資訊來構建受雜訊影響較小的視圖。於本發明中,具差分隱私保護之資料合成系統1內包括基本視圖(base view)和交叉視圖(cross view)兩種類型,下面將進一步說明和定義。
Compared with the existing differentially private synthetic dataset (DPSD) with differential privacy protection, which is a method of establishing views by finding high correlations, the
基本視圖旨在覆蓋資料中的所有欄位,並以變數d表示這些視圖所包含的欄位個數皆為d。於本發明中,當視圖所包含的欄位數目大小越少,則要使用更多的基本視圖才能覆蓋資料中的所有欄位,意味著對於記憶體的空間開銷減少以及資料的邊際稀疏性較不顯著,然而基本視圖的數量越多,表示欄位之間的相關信息越少,此導致隱私預算切分更為嚴重,因此,本發明對於基本視圖之建構給予下述定義,給定原始資料集O中的欄位數量m和固定視圖大小d,基本視圖構建會有兩種情況,在d|m的情況下(即m可以被d整除),可以合成m/d個基本視圖,即每個視圖的大小皆為d,相反地,在不滿足d|m的情況下,最後一個基本視圖的視圖大小將小於d,此會影響如何決定交叉視圖,具體作法,後面會再詳述。 Basic views are designed to cover all fields in the data, and the variable d indicates that the number of fields contained in these views is d. In the present invention, when the number of fields contained in a view is smaller, more basic views must be used to cover all fields in the data, which means that the space overhead of the memory is reduced and the marginal sparsity of the data is smaller. Not obvious, but the greater the number of basic views, the less relevant information between the fields, which leads to more serious privacy budget segmentation. Therefore, the present invention provides the following definition for the construction of basic views. Given the original data The number of columns m in set O and the fixed view size d , there are two situations in basic view construction. In the case of d | m (that is, m can be evenly divided by d ), m / d basic views can be synthesized, that is, each The size of each view is d . On the contrary, if d | m is not satisfied, the view size of the last basic view will be smaller than d . This will affect how to determine the cross view. The specific method will be described in detail later.
交叉視圖係用於橋接基本視圖,藉以補償基本視圖之間缺少的相關訊息,先給定欄位數量m、視圖大小d和基本視圖B,將有三種類型之交叉視圖可能產生。 The cross view is used to bridge the basic views to compensate for the lack of related information between the basic views. Given the number of columns m , the view size d and the basic view B , three types of cross views may be generated.
第一型(d|m):對於每一對基本視圖b i 和b i+1,可直接合成一個交叉視圖。舉例來說,如圖三中間的b1-b3以及下面的c1-c2所示,給定基本視圖 b 1={a 1,a 2,a 3},b 2={a 4,a 5,a 6}和b 3={a 7,a 8,a 9},則c 1={a 2,a 3,a 4}和c 2={a 5,a 7,a 8}就可以是其中的一組交叉視圖。 Type 1 ( d | m ): For each pair of basic views b i and b i +1 , a cross view can be directly synthesized. For example, as shown in b 1 -b 3 in the middle of Figure 3 and c 1 -c 2 below, given the basic view b 1 ={ a 1 , a 2 , a 3 }, b 2 ={ a 4 , a 5 , a 6 } and b 3 ={ a 7 , a 8 , a 9 }, then c 1 ={ a 2 , a 3 , a 4 } and c 2 ={ a 5 , a 7 , a 8 } are Can be a set of intersecting views within it.
第二型(非d|m且||>d/2):同樣也是可以直接合成交叉視圖。舉例來說,如圖三中間的b1-b3以及下面的c1-c2所示,給定基本視圖b 1={a 1,a 2,a 3},b 2={a 4,a 5,a 6}和b 3={a 7,a 8},則有c 1={a 2,a 3,a 4}和c 2={a 5,a 6,a 7}的這種可能的交叉視圖且有重疊欄位。 Type II (not d | m and | |> d /2): Cross views can also be synthesized directly. For example, as shown in b 1 -b 3 in the middle of Figure 3 and c 1 -c 2 below, given the basic view b 1 ={ a 1 , a 2 , a 3 }, b 2 ={ a 4 , a 5 , a 6 } and b 3 ={ a 7 , a 8 }, then there are c 1 ={ a 2 , a 3 , a 4 } and c 2 ={ a 5 , a 6 , a 7 } Possible cross views with overlapping fields.
第三型(非d|m且|| d/2):在這種情況下,可選擇d-||個來自第個基本視圖內的隨機欄位,並將它們放置在第個基本視圖中。惟,這裡不會為第和第個基本視圖之間產生交叉視圖,因為上述對第個基本視圖的補償行為已經足夠用來連結第和第個基本視圖,因而,合成額外的交叉視圖將是冗餘的,且會加劇隱私預算切分的問題。 Type III (not d | m and | | d /2): In this case, d -| | from random fields within the base view and place them in the in a basic view. But, here will not be the first and the first Cross views are generated between basic views because the above-mentioned The compensating behavior of the base view is sufficient to connect the and the first base view, therefore, synthesizing additional cross-views would be redundant and exacerbate the problem of privacy budget splitting.
關於基本視圖和交叉視圖具體作法,後面會再詳述。 The specific methods of basic view and cross view will be described in detail later.
邊際分布模組13用於將隱私預算平均分配給各該最終基本視圖和各該最終交叉視圖所生成之邊際表。簡言之,邊際分布模組13目的是構造具雜訊的邊際分布。先前,視圖列表合成模組12利用資料欄位域大小的公開資訊構建了基本視圖和交叉視圖,接著,邊際分布模組13在相應的邊際上應用拉普拉斯(Laplace)機制來確保ε-DP,其中,ε表示隱私預算,此為具差分隱私保護之資料合成系統1會存取原始資料集O的唯一步驟,意味著應該要以雜訊擾動來保護資料的統計資訊。
The
具體來說,給定視圖可以很容易地獲得相應的邊際表(contingency table),據此,將拉普拉斯機制應用於所有邊際表,並且每個邊際表將平等獲得部分ε,這是保證符合差分隱私(DP)的直接方法,大多數現有的具差分隱私保護
的合成資料集(DPSD)合成算法都遵循這種策略,亦即,ε被平均分配給各個邊際表。另外,具差分隱私保護之資料合成系統1利用預期平方誤差(ESE)來獲得最佳預算分配,儘管過去有研究提出以ESE的概念來分析雜訊規模,但其主要目的是在後處理中使用在通用欄位(特別是一致性)簡化視圖更新,與本發明用於獲得最佳預算分配,亦不相同。
Specifically, a given view can easily obtain the corresponding contingency table, according to which the Laplacian mechanism is applied to all marginal tables, and each marginal table will obtain the part ε equally, which is guaranteed In line with the direct approach of Differential Privacy (DP), most existing Differential Privacy Preserving Synthetic Dataset (DPSD) synthesis algorithms follow this strategy, that is, ε is evenly distributed to each marginal table. In addition, the
邊際分布模組13會計算分配給視圖v i 的隱私預算比例P i 。假設有k個視圖{v 1,v 2,...,v k }與相應的域大小{s 1,s 2,...,s k },隱私預算分配可以表述為如下優化問題:
The
min++...+ min + +...+
利用卡羅需-庫恩-塔克條件(Karush-Kuhn-Tucker condition),令 ,其中,i=1,2,...,k,同時又令 0,代入移項後,可獲得,又 。因此,最終可以推導得出,並因此推導出。由上可知,分配給每一視圖v i 的預算將是P i ε,這種最佳化分配策略可以避免在域大小較小的邊際表上花費過多的隱私預算導致換得的可用性提升僅有微幅之情形,並且還允許具有較大域大小的邊際表獲得更多預算以減少雜訊的負面影響。 Using the Karush-Kuhn-Tucker condition, let , where i =1,2,..., k , and let 0, after substituting the transfer term, we can get ,again . Therefore, it can finally be deduced that , and therefore deduced . It can be seen from the above that the budget allocated to each view v i will be Pi ε . This optimal allocation strategy can avoid spending too much privacy budget on marginal tables with small domain sizes, resulting in only a usability improvement. slightly, and also allows marginal tables with larger domain sizes to receive more budget to reduce the negative impact of noise.
後處理模組14用於將該些邊際表中的計數進行一致化、非負化以及整數化處理。簡言之,後處理模組14用於細化具差分隱私保護的合成資料集(DPSD)以提高其效用,包括非負化(Non-Negativity)、一致性(Consistency)與整數化(Normalization),其中,一致性是為了使得聚合後的統計信息能夠與公開的背
景知識保持一致,非負化是因為視圖表的計數在添加雜訊後有可能變成負數所作的對應處理,而整數化是為了滿足在注入雜訊之前的邊際表的計數為整數的這個條件。關於非負化、一致性與整數化等處理方式,後面會再詳述。
The
資料合成模組15用於將兩相鄰的視圖合成以產生合成資料,俾於經資料格式轉換後,使該合成資料成為該原始資料的格式。在歷經後處理模組14的資料處理後,資料合成模組15將資料合成為具差分隱私保護的合成資料集(DPSD),易言之,給定基本視圖與交叉視圖,資料合成的目標是通過鏈接來自不同視圖的合適部分記錄以合成完整的nm維記錄,其中,n表示資料集的紀錄筆數,而m代表資料集的欄位個數。與大多數現有的DPSD合成算法依賴基於採樣的數據合成相比,本發明之具差分隱私保護之資料合成系統1採用一種不同的基於鏈接的方法,可這確保資料合成能合成n筆記錄,其原因在於前述過程中的整數化同步了視圖計數。最後,將資料作格式轉換,使得合成資料符合原始資料的格式。
The
與現有技術相比較,本發明利用資料欄位的域大小協助高維資料進行視圖列表合成的作法,與現有技術基於視圖方法以隱私預算切分來選擇適當的邊際表做比較,由於本發明間涉及的欄位域大小資訊與資料無關,故能顯著避免隱私預算拆分的問題。另外,通過數學分析為各個視圖分配隱私運算,能在雜訊注入階段取得資料可利用性與隱私性之間的最佳平衡點,而非負化、一致性、整數化等後處理程序,除了能提升合成資料的品質之外,還能提高系統程式的運行效率。 Compared with the existing technology, the present invention uses the domain size of the data field to assist high-dimensional data in view list synthesis. Compared with the existing technology based on the view method and privacy budget segmentation to select an appropriate marginal table, due to the The field size information involved has nothing to do with the data, so the problem of privacy budget splitting can be significantly avoided. In addition, by allocating privacy operations to each view through mathematical analysis, the best balance between data availability and privacy can be achieved in the noise injection stage. Post-processing procedures such as non-negativeization, consistency, and integerization, in addition to In addition to improving the quality of synthetic data, it can also improve the operating efficiency of system programs.
圖2為本發明之具差分隱私保護之資料合成系統另一實施例的系統架構圖。如圖所示,資料預處理模組11、視圖列表合成模組12、邊際分布模
組13、後處理模組14以及資料合成模組15與圖1所述相同,於本實施例中,進一步說明視圖列表合成模組12、後處理模組14以及資料合成模組15之內部架構。
FIG. 2 is a system architecture diagram of another embodiment of the data synthesis system with differential privacy protection of the present invention. As shown in the figure,
視圖列表合成模組12用於產生基本視圖及交叉視圖,並進一步將兩者合成,因而視圖列表合成模組12進一步包括基本視圖合成單元121、交叉視圖合成單元122以及視圖列表合成單元123。
The view
基本視圖合成單元121先定義一合成空集合以及產生與該資料集相同的合成集合,從該合成集合中挑選出不重複的數個屬性並加入至該合成空集合,將該合成集合中減去該不重複的數個屬性,以於該合成空集合中元素個數大於該不重複的數個屬性的個數時,繼續迭代迴圈,直到該合成空集合中元素個數不大於該不重複的數個屬性的個數,產生該多個基本視圖,且該多個基本視圖之排序係以降冪進行排序。簡言之,儘管定義了基本視圖,但仍需要一種用於構建基本視圖的算法,基本視圖合成單元121包含這種基本視圖的算法,這裡稱之為基本視圖合成算法(base view generation,BVG)。
The basic
首先,BVG可理解為一種合成隨機基本視圖的簡單算法,但其特點是對基本視圖的排序,簡言之,最初所合成的基本視圖缺乏明確的順序,因而一旦合成基本視圖,BVG就會根據它們各自在域大小的前大值按降冪對它們進行排序與編號,其中,b i 的前的最大域大小定義為s j1,...,,並且a j1,...,表示在b i 中具有最大域大小的欄位,其中,j 1,...,[m]。對於列表[β 1,...,β 2k ]的排序表,其中的元素乘積之和[[β 1,β 2k ],...,[β k ,β k+1]]的2個分區,可以最小化。惟基本視圖構建的實際情境更加複雜,例如視圖大小不限於兩個,因此,BVG根據它們各自的前的最大域大小按降冪對基本 視圖進行排序和編號,用以嘗試執行類似地的最小化程序,舉例來說,圖3上部包括四個基本視圖,在進行排序與編號後,由於基本視圖的前個最大域大小中的最大值為42,故所對應的視圖編號為b 1。 First of all, BVG can be understood as a simple algorithm for synthesizing random base views, but its characteristic is the ordering of base views. In short, the initially synthesized base views lack a clear order, so once the base views are synthesized, BVG will be based on Each of them precedes the domain size Sort and number the large values in descending power, where, the first of b i The maximum domain size of is defined as s j 1 ,..., , and a j 1 ,..., Represents the field with the largest domain size in b i , where j 1 ,..., [ m ]. For a sorted list of the list [ β 1 ,..., β 2 k ], the sum of the products of the elements [[ β 1 , β 2 k ],...,[ β k , β k +1 ]] is 2 partitions , can be minimized. However, the actual situation of basic view construction is more complicated. For example, the view size is not limited to two. Therefore, BVG is based on their respective front The base views are sorted and numbered in descending powers of the maximum domain size to try to perform a similar minimization procedure. For example, the upper part of Figure 3 includes four base views. After sorting and numbering, due to the forward The maximum value among the maximum domain sizes is 42, so the corresponding view number is b 1 .
BVG演算法的詳細內容如下。給定預設視圖大小d以及資料集屬性集合。BVG首先會合成空集合B並合成集合S且其內容和一致,然後迭代進行下列操作:從S中隨機挑選出不重複的d個屬性形成集合b並將其加入B中,再將S內容更新成S\b,且當S中的元素個數不大於d則會直接將S加入B中,並且結束迴圈。產生出初步的基本視圖集合B後,BVG會對B進行排序,並且採用的方式為對每個基本視圖各自取出域大小為前大的屬性並各自再將這些域值進行乘法,最後再對所產生的B個乘積值(product)依照大小進行降冪排序且讓B也遵循該順序對其中的基本視圖進行排序,最終,BVG會將排序後的B做輸出。須說明者,指的是去掉餘數後的整數結果。 The details of the BVG algorithm are as follows. Given a default view size d and a collection of dataset properties . BVG first synthesizes the empty set B and synthesizes the set S whose contents are consistent, and then iteratively perform the following operations: randomly select d non-repeating attributes from S to form a set b and add them to B , then update the content of S to S \ b , and when the number of elements in S is not greater than d will directly add S to B and end the loop. After generating the preliminary basic view set B , BVG will sort B , and the method used is to take out the domain size of each basic view with the first Large attributes are then multiplied by these domain values respectively, and finally the generated B product values (products) are sorted in descending order according to size, and B is also allowed to sort the basic views in this order. Finally, BVG The sorted B will be output. What needs to be explained, Refers The integer result after removing the remainder.
交叉視圖合成單元122先定義一空集合,以於該視圖欄位個數為偶數,結合該前後排序的兩基本視圖中之前者中取出域值大於該視圖欄位個數的一半的屬性以及該前後排序的兩基本視圖中之後者中取出域值小於該視圖欄位個數的一半的屬性,產生該前後排序的兩基本視圖之交叉視圖,且於該視圖欄位個數為奇數,再判斷該資料集之最後屬性與該視圖欄位個數的一半的關係,以透過迭代方式得到該前後排序的兩基本視圖之交叉視圖。簡言之,在通過BVG計算且對基本視圖進行編號後,隨即轉而構建交叉視圖,交叉視圖合成單元122包含構建交叉視圖的算法,這裡稱之為交叉視圖合成算法(cross view generation,CVG)。
The cross
如前所述,對於每一對基本視圖b i 和b i+1,可以合成其間的交叉視 圖。根據兩個基本視圖b i 和b i+1中欄位的域大小,有四種合成交叉視圖的情境,包含(a)少對少、(b)多到多、(c)多到少及(d)低到高,如圖4所示。基於對最後步驟會進行視圖間的拼接,除了合成紀錄的多樣性決定與是否能夠反應原始資料的實際狀況有著重要關係,還須考慮視圖在雜訊受到雜訊影響的抵抗能力,因而在交叉視圖合成中使用多到少的配置,是較佳選擇。 As mentioned before, for each pair of base views b i and b i +1 , the cross views therebetween can be synthesized. According to the field sizes of the fields in the two basic views b i and b i +1 , there are four scenarios for synthesizing cross views, including (a) less to less, (b) more to more, (c) more to less and (d) low to high, as shown in Figure 4. Based on the splicing between views in the final step, in addition to the decision on the diversity of synthesized records, which is closely related to whether it can reflect the actual situation of the original data, the view's ability to withstand the influence of noise must also be considered. Therefore, in the cross view It is better to use more to less configurations in synthesis.
CVG演算法的詳細內容如下。給定預設視圖大小d、資料集屬性集合以及基本視圖集合。CVG首先產生空集合C用以存放待會產生的交叉視圖,隨後針對d的大小是否為偶數會採用兩種不同程序來合成交叉視圖。若d為偶數,則CVG會新增一個空集合c 0再從B中選擇b 0和b 1,然後,對b 0的屬性中依據其域值大小取前出前大的屬性添加到c 0中,而對b 1中的屬性則是依據其域值大小取前出前小的屬性添加到c 0中,最後,將c 0新增到C中就可以完成一次的進行,但由於B還可以產生諸如b 1,b 2、b 2,b 3、...、,的這種由相鄰基本視圖所形成的配對,故CVG還要對這些配對迭代進行一樣的操作才能得到最終大小為的交叉視圖集合。相反地,若d為奇數,則CVG會進一步再判斷是否>,若判斷式成立則令變數x為-1,不成立則令x為-2。 The details of the CVG algorithm are as follows. Given a default view size d and a collection of data set attributes and a collection of basic views . CVG first generates an empty set C to store the cross view to be generated, and then uses two different procedures to synthesize the cross view depending on whether the size of d is an even number. If d is an even number, CVG will add an empty set c 0 and select b 0 and b 1 from B. Then, take the first out of the attributes of b 0 according to the size of its domain value. Large attributes are added to c 0 , while attributes in b 1 are taken first based on their domain value size. Small attributes are added to c 0. Finally, adding c 0 to C can complete the process once, but since B can also generate items such as b 1 , b 2 , b 2 , b 3 ,..., , This kind of pairing formed by adjacent basic views, so CVG has to iteratively perform the same operation on these pairs to get the final size of A collection of cross views. On the contrary, if d is an odd number, CVG will further determine whether > , if the judgment is true, let the variable x be -1, if this is not true, let x be -2.
隨後,CVG會以i=1 to x迭代進行以下操作:首先,會從b i 中選擇域大小前大的屬性存入暫存array set 1中,並且也會從b i+1中選擇域大小前小的屬性存入暫存array set 2中,最後,將set 1 ∪ set 2的結果存入暫存array c i 中。接著,從b i+1\set 2中選擇具有最小的域大小的屬性以變數p記錄下,此時,若是set 1的域大小比起set 2的域大小和p的域大小之乘積來得更小或恰好相等,即dom(set 1)dom(set 2)*dom(p),則會從b i \set 1中選擇具有最小的域大小 的屬性取代變數p原本儲存的屬性。最終,將p新增到array c i 中,再將此時的c i 存入C中,便可完成一次迭代。同樣地,進行x迭代後就可以順利獲得大小為的交叉視圖集合。須注意,指的是去掉餘數後再加1的整數結果,B\s指的是B去掉s後的集合,set 1 ∪ set 2指的是取set 1和set 2的聯集。 Subsequently, CVG will iterate from i =1 to x and perform the following operations: First, the domain size will be selected from b i Large attributes are stored in temporary array set 1 , and the domain size will also be selected from bi + 1 . Small attributes are stored in temporary array set 2. Finally, the results of set 1 ∪ set 2 are stored in temporary array c i . Next, select the attribute with the smallest domain size from bi + 1 \ set 2 and record it with the variable p . At this time, if the domain size of set 1 is greater than the product of the domain size of set 2 and the domain size of p Small or exactly equal, that is, dom ( set 1 ) dom( set 2 )*dom( p ), then the attribute with the smallest domain size will be selected from bi \ set 1 to replace the attribute originally stored in the variable p . Finally, add p to array c i , and then store c i at this time in C to complete an iteration. Similarly, after performing x iteration, you can successfully obtain the size of A collection of cross views. It should be noted that Refers The integer result is the integer result of adding 1 after removing the remainder. B\s refers to the set of B minus s. Set 1 ∪ set 2 refers to the union of set 1 and set 2 .
視圖列表合成單元123於該交叉視圖之域大小不大於該基本視圖之域大小時,計算出該基本視圖和該交叉視圖之總和,並於該總和小於該現有暫存值時,以新的該總和、該基本視圖和該交叉視圖取代該現有暫存值,而於多次迭代計算後,得到該最終基本視圖和該最終交叉視圖。簡言之,視圖列表合成單元123包含一種迭代算法,稱之為視圖列表合成算法(view list generation,VLG),能調用BVG和CVG作為子程序以合成最佳基本視圖和交叉視圖。
The view
VLG會迭代給定次數並返回最佳基本視圖和交叉視圖。另一方面,VLG可選擇使用早期停止(early stop)策略以迭代非確定次數的方式找到最佳解決方案,而不是採用暴力搜索(brute force),因為對於不同的視圖組合存在過多的可能性。舉例來說,以暴力搜索對於具有25個欄位的資料集,要找到d=5的最佳視圖需要(m.d-1)!/(d-1)!2.585e22次的檢查。 VLG iterates a given number of times and returns the best base view and cross view. On the other hand, VLG can choose to use an early stop strategy to find the best solution in a non-deterministic number of iterations, rather than using brute force because there are too many possibilities for different view combinations. For example, using a brute force search for a data set with 25 columns, finding the best view for d = 5 requires ( m . d -1)! /( d -1)! 2.585e 22 checks.
VLG演算法的詳細內容如下:給定預設視圖大小d、資料集屬性集合以及預設置的迭代圈數X。VLG會先設置變數D best 為inf並且B best 和C best 為空集合。隨後,VLG將以X作為迭代圈數進行以下操作:首先,利用BVG(,d)產生基本視圖B,再利用CVG(,d,B)產生交叉視圖C。隨後,對於C中的每個交叉視圖會檢查是否域大小會比B中具有最大域大小的基本視圖還要來得大,一旦有交叉視圖被察覺到確實比較大則放棄當次合成,並且直接進入下一輪的合成。若是經過檢查後沒有發現到有比較大的情況,則會計算出 該輪的域大小總和,即對B中每個基本視圖計算各自的域大小並加總,然後對C中每個交叉視圖也計算出各自大小後做加總,最後再將兩個總和相加並以變數D儲存。此時,如果D比現有的D best 來得小,則用D作為D best 的新值且用B取代目前的B best 以及用C取代目前的C best 。經過X次迭代後會輸出B best 及C best 作為最終的基本視圖和交叉視圖。 The details of the VLG algorithm are as follows: given the default view size d and the data set attribute set And the preset number of iteration turns X. VLG will first set the variable D best to inf and B best and C best to empty sets. Subsequently, VLG will perform the following operations using X as the iteration number: First, use BVG ( , d ) generates basic view B , and then uses CVG( , d , B ) produces cross view C . Subsequently, for each cross view in C , it is checked whether the domain size is larger than the base view with the largest domain size in B. Once a cross view is detected to be indeed larger, the current synthesis is abandoned and directly entered. Next round of synthesis. If no relatively large situation is found after checking, the total domain size of the round will be calculated. , that is, calculate the respective domain sizes for each basic view in B and sum them up, then calculate the respective sizes for each cross view in C and sum them up, and finally add the two sums and store them in the variable D. At this time, if D is smaller than the existing D best , use D as the new value of D best and replace the current B best with B and replace the current C best with C. After X iterations, B best and C best will be output as the final basic view and cross view.
後處理模組14用於將具差分隱私保護的合成資料集(DPSD)進行細化,該後處理模組14包括一致性單元141、非負化單元142以及整數化單元143。
The
一致性單元141針對各該最終基本視圖和各該最終交叉視圖之計數,透過權重進行加權計算,更新各該最終基本視圖和各該最終交叉視圖之計數。這裡先簡要概述加權一致性(weighted consistency)的機制,令給定的x個視圖為v 1,...,v x ,並且對於j 1,...,j y [m]有v=v 1 ∩...∩ v x ={a j1,...,a jy }共y個共同欄位,若已知定義(c)表示從邊際表v i 中檢索單元c的計數,通過利用權重w 1,...,w x 對T v ((a j1,...,a jy ))進行加權進算,即T v ((a j1,...,a jy ))=Σ i w i .T v ((a j1,...,a jy ))。當w 1,...,w x =1則加權一致性將退化成一般的正規化程序,故需找到一組良好的權重組合以獲得最大的加權值。
The
對此,一致性單元141利用ESE計算v的雜訊規模有
In this regard, the
針對目標函數以及對其尋找最佳解的推導如下。 The derivation of the objective function and finding its optimal solution is as follows.
min min
利用卡羅需-庫恩-塔克條件,令,而對於 i=1,...,k令。經過推導可以獲得且 又有。因此,代入後有,並且對於來自v視圖中的任意元素C=(a j1,...,a jy ), 。最終,所有視圖表將能利用T v (C)當作目標對各自的值進行更新。不失一般性,假設v i 視圖表對於yd且j 1,...,j d [m]有元素C'=(a j1,...,a jy ,...,a jd ),則其更新如下:(C')←(C')+(T v (C)-(C)),其中,C表示來自於共有欄位v=v 1 ∩...∩ v x之元素(a j1,...,a jy )。 Using the Carlo-Kuhn-Tucker condition, let , and for i=1,..., k let . After derivation, it can be obtained and again . Therefore, after substituting, we have , and for any element C =( a j 1 ,..., a jy ) from the v view, . Eventually, all view tables will be able to update their values using T v ( C ) as a target. Without loss of generality, assume that the view table vi for y d and j 1 ,..., j d [ m ] has element C' =( a j 1 ,..., a jy ,..., a jd ), then its update is as follows: ( C' )← ( C' )+ ( T v ( C )- ( C )), where C represents the elements ( a j 1 ,..., a jy ) from the common field v= v 1 ∩...∩ v x .
非負化單元142能透過迭代地進行加總及相減,以消除各該最終基本視圖和各該最終交叉視圖之計數中的負值。簡言之,由於視圖表的計數在添加雜訊後有可能變成負數,儘管在一致性單元141處理後有可能獲得改善,但是效果有限。對此,本發明提出一種新穎的作法,如圖5所示,非負化單元142透過迭代地進行加總相減的過程來消除那些負值,同時還能維持住資料的分布形狀,甚至由於該算法所涉及的運算僅是簡單加減法而在計算速度上大有改進。
The
整數化單元143能對該些邊際表進行移除或補值,而使各該邊際表與該原始資料有相同數量的紀錄。在注入雜訊之前的邊際表的計數為整數的這個條件,作為對具雜訊影響的邊際表歷經上述兩道程序後的輸出,習知一些DPSD研究會將其視為資料分布以隨機產生樣本(sampling)的方式去產生紀錄,然而這種方式因為是隨機過程以致於產生的紀錄分布可能有所偏差,甚至於不能充分反應出原始資料的記錄多樣性。本發明提出使用整數化的方式對於各個
邊際表進行移除或補值而使得各個邊際表能恰好產生與原始資料相等數量的紀錄,如此一來便能有效解決上述問題。
The
具體來說,整數化單元143的整數化技術,係將每個計數分為整數部分與小數部分,額外的計數按照小數部分的降冪排序添加至整數部分的相應元素計數上,此過程同時確保校正的邊際表只有整數計數,並且它們總和恰好與原始資料集的紀錄筆數相等。舉例來說,考慮資料集紀錄總數n=5並且計數為[0.4,3.3,1.3]的情況,整數部分為[0,3,1],其加總只有4,而小數部分為[0.4,0.3,0.3],由於第一個元素的值最大,可將第一個元素加1得到[1,3,1],這使得其總和變為5,此時會與真實資料筆數相等,於是便可以替換掉先前非整數的計數表。
Specifically, the integerization technology of the
資料合成模組15目的是將資料合成為將具差分隱私保護的合成資料集(DPSD),該資料合成模組15包括資料合成單元151以及資料格式轉換單元152。
The purpose of the
資料合成單元151係將該兩相鄰的視圖中共同欄位進行排序,之後,對於越多計數的欄位賦予越高的優先權,以於該兩相鄰的視圖之欄位配對後,產生該合成資料。下面先首先介紹一般對於透過對不同視圖內的資料以鏈結方式形成完整記錄的說明。
The
關於視圖組合的最大基數匹配問題(Maximum Cardinality Matching Problem for View Combination)。首先,視圖組合實際上是一個優化問題,考慮一個簡單示例,給定2個視圖v 1={a 1,a 2,a 3}和v 2={a 2,a 3,a 4}。假設v 1只有一部分記錄[a 2=0,a 3=0]可以鏈接到v 2的另一部分記錄,並且新鏈結的a 4欄位只會有[a 4=3]和[a 4=4]兩種情況。又假設v 1有兩個部分記錄,[a 2=0,a 3=1]和[a 2=0,a 3=1],則它們可以鏈結到部分記錄[a 4=3]。若在上述情 況下,如果[a 2=0,a 3=0]與[a 4=3]相鏈結,那麼[a 2=0,a 3=1]將可以鏈接到任何內容上。因此,兩個視圖的視圖組合可以建模為在二分圖中找到最大基數匹配的問題,由左側而來的v 1之部分記錄和來自右側的v 2的部分紀錄相互鏈結組成完整紀錄。如果兩個部分記錄共享公開欄位的欄位值相同,則應存在邊的資訊。然而,由於最大基數匹配問題旨在找到包含盡可能多邊匹配,因此這種二分圖的最大基數匹配為兩個視圖提供了最佳視圖的組合結果。 Regarding the Maximum Cardinality Matching Problem for View Combination. First, view composition is actually an optimization problem, consider a simple example, given 2 views v 1 ={ a 1 , a 2 , a 3 } and v 2 ={ a 2 , a 3 , a 4 }. Assume that only part of the records [ a 2 =0, a 3 =0] of v 1 can be linked to another part of the records of v 2 , and the a 4 column of the new link will only have [ a 4 =3] and [ a 4 = 4] Two situations. Suppose v 1 has two partial records, [ a 2 =0, a 3 =1] and [ a 2 =0, a 3 =1], then they can be linked to the partial record [ a 4 =3]. In the above case, if [ a 2 =0, a 3 =0] is linked to [ a 4 =3], then [ a 2 =0, a 3 =1] will be linked to any content. Therefore, the view combination of two views can be modeled as the problem of finding the maximum cardinality matching in a bipartite graph, where the partial records of v 1 from the left and the partial records of v 2 from the right are linked to each other to form a complete record. If two partial records share the same public field value, edge information should be present. However, since the maximum cardinality matching problem aims to find a match that contains as many edges as possible, this maximum cardinality matching of a bipartite graph provides a combined result of the best view for both views.
再者,關於啟發式視圖組合。對於大型資料集,用於最大基數匹配問題的最先進的Hopcroft-Karp霍普克洛夫特-卡普(Hopcroft Karp)算法仍然效率低下。本發明提出了一種有效的啟發式替代方案來處理該問題,資料合成單元151考慮兩個相鄰的視圖,v 1和v 2,集合A={a 1,...,a k }為其共同欄位,請注意,v 1可能是和其他視圖相組合而成的新視圖,即|v 1| d,這裡可根據相應的計數以降冪對v 1中的部分記錄進行排序,這裡的部分記錄指的就是那些在共同欄位上具有相同數值組合的紀錄。隨後,具有更多計數的部分記錄將被賦予更高級別的優先權。最終,從最高優先級別開始,v 1中的每個部分記錄與v 2中的部分記錄將進行配對;即v 1中的一些部分記錄會根據它們的計數優先在v 2中找尋匹配項。
Again, about heuristic view composition. The state-of-the-art Hopcroft-Karp algorithm for the maximum cardinality matching problem is still inefficient for large datasets. The present invention proposes an effective heuristic alternative to deal with this problem. The
考慮上述中的示例,與僅出現一次的部分記錄[a 2=0,a 3=0]相比,[a 2=0,a 3=1]被分配到更高的優先級別,因為有兩個[a 2=0,a 3=1]。因此,一個[a 2=0,a 3=1]將與[a 4=3]紀錄配對,而另一個[a 2=0,a 3=1]將無記錄能與之配對,並且[a 2=0,a 3=0]會與[a 4=4]配對。最終,上述作法僅合成了兩個三維記錄。若在v 1中的部分記錄與在v 2中的部分記錄有多種選擇的情況下,它將與隨機記錄相配對。 Consider the example above, [ a 2 =0, a 3 =1] is assigned a higher priority than the partial record [ a 2 =0, a 3 =0] that appears only once because there are two [ a 2 =0, a 3 =1]. Therefore, one [ a 2 =0, a 3 =1] will be paired with the record [ a 4 =3], while the other [ a 2 =0, a 3 =1] will have no record to pair with, and [ a 2 =0, a 3 =0] will be paired with [ a 4 =4]. In the end, the above approach only synthesized two 3D records. If there are multiple choices for a partial record in v 1 and a partial record in v 2 , it will be paired with a random record.
資料格式轉換單元152係將該合成資料進行格式轉換而成為該原始資料的格式。簡言之,由於一開始有執行資料預處理,該程序會將資料中的內容轉換為離散資料,故對於文字型內容會建立辭典記錄此一對應關係。因此,合成資料產生後,資料格式轉換單元152會進行資料格式轉換以確保資料集內容和原始資料集的格式相同,即應是文字型內容的屬性就要以文字表示。另外,對於原始型態為數字型的屬性,也會重新在對應的範圍內隨機賦予數值,舉例來說,已知預處理階段13.7於給定區間(-inf,10],(10,20],(20,inf)之中會屬於第2個區間而被賦予新值1,則在資料格式轉換步驟會在新值1的對應範圍(10,20]中隨機挑選一個值作為最終的屬性值。
The data
圖6為本發明之具差分隱私保護之資料合成方法的步驟圖。本發明之目的是利用資料欄位的域大小,以協助高維資料進行視圖列表合成,該方法包含以下步驟。 Figure 6 is a step diagram of the data synthesis method with differential privacy protection of the present invention. The purpose of the present invention is to utilize the domain size of data fields to assist in view list synthesis of high-dimensional data. The method includes the following steps.
於步驟S601,將原始資料之資料欄位內的數值資料區間化,以生成資料集。本步驟即進行資料預處理,將數值欄位的資料區間化,可由圖1之資料預處理模組11執行。
In step S601, the numerical data in the data fields of the original data are intervalized to generate a data set. This step is to perform data preprocessing to intervalize the data in the numerical field, which can be performed by the
於步驟S602,依據一預定的視圖欄位個數,將該資料集隨機合成為多個基本視圖,以根據各該基本視圖的域大小進行該多個基本視圖之排序和標號,再依據前後排序的兩基本視圖之域大小,產生該前後排序的兩基本視圖之交叉視圖,之後,比較該基本視圖之域大小與該交叉視圖之域大小,以於該交叉視圖之域大小大於該基本視圖之域大小時不作合併,或於該交叉視圖之域大小不大於該基本視圖之域大小時取代現有暫存值,俾於多次迭代計算後,得到最終基本視圖和最終交叉視圖。本步驟即以基本視圖合成算法(BVG)將前一步驟之資
料集合成為基本視圖,接著以交叉視圖合成算法(CVG)求得前述兩個基本視圖之間的交叉視圖,最後透過視圖列表合成算法(VLG)將基本視圖和交叉視圖合成,上述運算可圖1之視圖列表合成模組12執行。
In step S602, the data set is randomly synthesized into multiple basic views according to a predetermined number of view fields, and the multiple basic views are sorted and numbered according to the domain size of each basic view, and then sorted according to the front and back. The domain sizes of the two basic views are generated to generate a cross view of the two basic views sorted before and after. Then, the domain size of the basic view is compared with the domain size of the cross view, so that the domain size of the cross view is greater than that of the basic view. Do not merge when the domain size is specified, or replace the existing temporary value when the domain size of the cross view is not larger than the domain size of the base view, so that after multiple iterative calculations, the final base view and the final cross view can be obtained. This step uses the basic view synthesis algorithm (BVG) to combine the information from the previous step.
The data set becomes a basic view, then the cross view synthesis algorithm (CVG) is used to obtain the cross view between the two basic views, and finally the basic view and the cross view are synthesized through the view list synthesis algorithm (VLG). The above operation can be shown in Figure 1 The view
於步驟S603,將隱私預算平均分配給各該最終基本視圖和各該最終交叉視圖所生成之邊際表。於本步驟中,在相應的邊際上應用拉普拉斯(Laplace)機制來確保ε-DP,也就是讓每個邊際表將平等獲得部分隱私預算(ε),上述運算可圖1之邊際分布模組13執行。
In step S603, the privacy budget is evenly distributed to the margin tables generated by each final base view and each final cross view. In this step, the Laplace mechanism is applied on the corresponding margin to ensure ε -DP, that is, each marginal table will receive part of the privacy budget ( ε ) equally. The above operation can be shown as the marginal distribution in Figure 1
於步驟S604,將該些邊際表中的計數進行一致化、非負化以及整數化處理。本步驟目的是細化具差分隱私保護的合成資料,包括一致化、非負化以及整數化等程序,通常後處理程序會先迭代幾次一致性程序和非負化程序,最後才會進行整數化步驟,上述運算可圖1之後處理模組14執行。
In step S604, the counts in the marginal tables are unified, non-negative and integerized. The purpose of this step is to refine the synthetic data with differential privacy protection, including procedures such as consistency, non-negativeization and integerization. Usually the post-processing program will iterate several times before the consistency procedure and non-negativeization procedure are performed, and finally the integerization step will be performed. , the above operation can be executed by the
於步驟S605,將兩相鄰的視圖合成以產生合成資料,俾於經資料格式轉換後,使該合成資料成為該原始資料的格式。本步驟即是產出的擾亂後的視圖,接著,將其格式轉換以產生與原始資料集相同格式之視圖,上述運算可圖1之資料合成模組15執行。
In step S605, two adjacent views are synthesized to generate synthesized data, so that after data format conversion, the synthesized data becomes the format of the original data. This step is to generate the scrambled view, and then convert its format to generate a view in the same format as the original data set. The above operation can be performed by the
圖7為本發明之具差分隱私保護之資料合成方法具體實施時的流程圖。如圖所示,透過一具體實施例來說明本發明系統運作及其流程。 Figure 7 is a flow chart of the specific implementation of the data synthesis method with differential privacy protection of the present invention. As shown in the figure, the operation and process of the system of the present invention are explained through a specific embodiment.
於流程701,進行資料預處理。具體來說,輸入資料集D具有n筆紀錄且m個屬性以表示之,系統會針對D中各個屬性的資料型態來決定是數值型態或類別型態,即文字類型為類別型態,而小數或整數數值則會被歸類在數值型態。對於類別型態,會先收集該屬性所有出現的不同內容[a i0,a i1,...,],然後重新為它們編碼成數字並實際套用在D中屬性a i 所在
的資料,即{a i0:0,a i1:1,...,:|a i |-1}。至於數值型態則會先取得屬性a i 所在的資料的最大值a iMax 和最小值a iMin ,再依照事先設置好的區間個數k將其分成k段區間[a iMin ,a iMin +),[a iMin +,a iMin +
),...,[a iMin +,a iMax ],再把屬性a i 所在的資料的所有值依照所對應的區間index重新編碼。最終,在對所有屬性都重新編碼後,D的內容就會是全數值狀態。
In
於流程7021,以視圖列表合成算法(VLG)進行合成視圖。詳言之,VLG對於給定預設視圖大小d以及預設置的迭代圈數X,首先設置變數D best 為inf並且B best 和C best 為空集合。隨後,VLG將以X作為迭代圈數進行以下操作:先利用流程7022的BVG(,d)產生基本視圖B,再利用流程7023的CVG(,d,B)產生交叉視圖C,亦即流程7022和流程7023可視為流程7021的子程序。
In
隨後,對於C中的每個交叉視圖會檢查是否域大小會比B中具有最大域大小的基本視圖還要來得大,一旦有交叉視圖被察覺到確實比較大則放棄當次合成,並且直接進入下一輪的合成。若是經過檢查後沒有發現到有比較大的情況,則會計算出該輪的域大小總私,即對B中每個基本視圖計算各自的域大小並加總,然後對C中每個交叉視圖也計算出各自大小後做加總,最後再將兩個總和相加並以變數D儲存。此時,如果D比現有的D best 來得小,則用D作為D best 的新值且用B取代目前的B best 以及用C取代目前的C best 。經過X次迭代後會輸出B best 及C best 作為最終的基本視圖和交叉視圖。 Subsequently, for each cross view in C , it is checked whether the domain size is larger than the base view with the largest domain size in B. Once a cross view is detected to be indeed larger, the current synthesis is abandoned and directly entered. Next round of synthesis. If no relatively large situation is found after checking, the total domain size of the round will be calculated. , that is, calculate the respective domain sizes for each basic view in B and sum them up, then calculate the respective sizes for each cross view in C and sum them up, and finally add the two sums and store them in the variable D. At this time, if D is smaller than the existing D best , use D as the new value of D best and replace the current B best with B and replace the current C best with C. After X iterations, B best and C best will be output as the final basic view and cross view.
於流程7022,生成基本視圖子程序(BVG)。詳言之,BVG對於給定的預設視圖大小d,首先會合成空集合B並將的內容複製到S,然後迭代進行下列操作:從S中隨機挑選出不重複的d個屬性形成集合b並將其加入B中,再將
S內容更新成S\b。然後,檢查S中的元素個數是否不大於d,若是則會直接將S加入B中,並且結束迴圈。隨後,BVG會對B進行排序,並且採用的方式為對每個基本視圖各自取出域大小為前大的屬性並各自再將這些域值進行乘法,即 dom(b ij ),其中為b i 之前大的屬性而函式dom(b ij )則會回傳屬性b ij 的域值大小。最後再對所產生的B個乘積值依照大小進行降冪排序且讓B也遵循該順序對其中的基本視圖進行排序。最終,BVG會將排序後的B做輸出。
In
於流程7023,生成交叉視圖子程序(CVG)。詳言之,CVG對於給定預設視圖大小d以及經由BVG所獲得之基本視圖集合,首先會產生空集合C用以存放待會產生的交叉視圖,隨後針對d的大小是否為偶數會採用兩種不同程序來合成交叉視圖。若d為偶數,則CVG會新增一個空集合c 0再從B中選擇b 0和b 1,然後對b 0的屬性中依據其域值大小取前出前大的屬性添加到c 0中,而對b 1中的屬性則是依據其域值大小取前出前小的屬性添加到c 0中。最後將c 0新增到C中就可以完成一次的進行,但由於B還可以產生諸如b 1,b 2、b 2,b 3、...、,的這種由相鄰基本視圖所形成的配對,故CVG還要對這些配對迭代進行一樣的操作才能得到最終大小為的交叉視圖集合。相反地,
若d為奇數,則CVG會進一步再判斷是否>,若判斷式成立則令變數x
為-1,不成立則令x為-2。隨後,CVG會以i=1 to x迭代進行以下操作:首先會從b i 中選擇域大小前大的屬性存入暫存array set 1中,並且也會從b i+1中選擇域大小前小的屬性存入暫存array set 2中,最後將set 1 ∪ set 2的結果存入暫存array c i 中。隨後,從b i+1\set 2中選擇具有最小的域大小的屬性以變數p記錄下。此時,若是set 1的域大小比起set 2的域大小和p的域大小之乘積來得更小或恰好相等,即dom(set 1)dom(set 2)*dom(p),則會從b i \set 1中選擇
具有最小的域大小的屬性取代變數p原本儲存的屬性。最終,將p新增到array c i 中,再將此時的c i 存入C中,便可完成一次迭代。同樣地,進行x迭代後就可以順利獲得大小為的交叉視圖集合。
In
於流程703,進行構造具雜訊的邊際分布。簡言之,系統將會對流程7021、7022、7023所產生的視圖產生各自的邊緣表。隨後,系統會將拉普拉斯機制應用於所有邊際表,而每個邊際表所對應的隱私預算為P i ε,其中P i =。因此,分配給每一視圖v i 的預算將是P i ε,而v i 所對應的雜訊分布則
是Laplace()。
In
於流程7041,進行後處理程序的一致性處理。首先,定義(C)表示從邊際表v i 中檢索單元C的計數,而針對v i 中的每個C的一致性更新會以(C')←(C')+(T v (C)-(C))來進行,其中C表示來自於共有欄位v=v 1 ∩...∩ v x 之元素(a j1,...,a jy )並且ξ i 表示dom(v i \v)。
In
於流程7042,進行後處理程序的非負化處理。具體來說,對於每個邊際表v i ,系統會先收集其(C)計數並且儲存在暫存array X中。隨後,在對X做昇冪排序的同時,還會紀錄下排序前與排序後的對應關係以便於之後能夠進行精確的調整。然後,系統就會以預設的圈數來進行以下迭代:首先,把計數為負值的數值進行加總得到N,也把計數為正值的出現次數進行加總得到P。然後將計數為負值的數值都設圍0,再把加入到每個計數為正值的數值上進行抵銷的動作。最後,在預設迭代圈數完成後,儘管還會有一些位置為負數,但系統就會直接把它設成0。
In
於流程7043,進行後處理程序的整數化處理。具體來說,針對每個邊際表v i ,系統會產生一個表格TN紀錄所有計數的整數部位,也會產生另一個表格TF記錄下計數的小數部位。隨後,會利用D的紀錄筆數減去TN的計算總和當作接下來要補數的迭代圈數進行以下操作:從TF表格找出最大值的位置k,把在TN表格的k位置之計數加1並把TF表格的k位置設為0。最後,進行完補數的迭代圈數後,就會把TN表格的內容取代掉原先邊際表v i 的計數。
In
於流程7051,進行資料合成程序。詳言之,針對兩個相鄰的視圖v和v i ,集合A={a 1,...,a k }為其共同欄位,可將根據相應的計數以降冪對v中的部分記錄進行排序,這裡的部分記錄指的就是那些在共同欄位上具有相同數值組合的紀錄。隨後,具有更多計數的部分記錄將被賦予更高級別的優先權。最終,從最高優先級別開始,v中的每個部分記錄與v i 中的部分記錄將進行配對。若v和v i 的所有配對均完成後會得到新的v,並且會進行下一輪的組合,即v和v i 且i=i+1。
In
於流程7052,進行資料格式轉換程序。由於流程701中會將資料中的內容轉換為離散資料,並對於文字型內容會建立辭典記錄此一對應關係。因此,將產生後的合成資料進行資料格式轉換為原始資料集的格式,以確保資料集內容和原始資料集的格式相同,即應是文字型內容的屬性就要以文字表示,對於原始型態為數字型的屬性,也會重新在對應的範圍內隨機賦予數值。如先前舉例所述,已知預處理階段13.7於給定區間(-inf,10],(10,20],(20,inf)之中會屬於第2個區間而被賦予新值1,則在資料格式轉換步驟會在新值1的對應範圍(10,20]中隨機挑選一個值作為最終的屬性值。
In
在一實施例中,上述之各個模組、單元均可為軟體、硬體或韌體;若為硬體,則可為具有資料處理與運算能力之處理單元、處理器、電腦或伺服器;若為軟體或韌體,則可包括處理單元、處理器、電腦或伺服器可執行之指令,且可安裝於同一硬體裝置或分布於不同的複數硬體裝置。 In one embodiment, each of the above-mentioned modules and units can be software, hardware or firmware; if it is hardware, it can be a processing unit, processor, computer or server with data processing and computing capabilities; If it is software or firmware, it may include instructions executable by a processing unit, processor, computer or server, and may be installed on the same hardware device or distributed on multiple different hardware devices.
此外,本發明還揭示一種電腦可讀媒介,係應用於具有處理器(例如,CPU、GPU等)及/或記憶體的計算裝置或電腦中,且儲存有指令,並可利用此計算裝置或電腦透過處理器及/或記憶體執行此電腦可讀媒介,以於執行此電腦可讀媒介時執行上述之方法及各步驟。 In addition, the present invention also discloses a computer-readable medium, which is applied to a computing device or computer having a processor (eg, CPU, GPU, etc.) and/or a memory, and stores instructions, and can utilize the computing device or computer. The computer executes the computer-readable medium through the processor and/or memory to perform the above methods and steps when executing the computer-readable medium.
綜上,本發明之具差分隱私保護之資料合成系統、方法及其電腦可讀媒介,係先通過資料欄位域大小資訊進行前處理以及視圖列表合成,因為該資訊屬於公開資訊,因而不用涉及隱私預算切分,隨後,對這些視圖建立相應的邊際表並對其注入基於拉普拉斯分布之差分隱私的隨機雜訊,為了提高合成資料的品質,還會對這些具雜訊的邊際分布進行一系列的後處理程序,包括一致性、非負化以及整數化等程序,最後,通過迭代地進行各個邊際分布的拼接,以產生每一筆完整的合成資料。據此,本發明具備以下特色及功效。 In summary, the data synthesis system, method and computer-readable medium with differential privacy protection of the present invention first perform pre-processing and view list synthesis through the data field size information. Because this information is public information, there is no need to involve Privacy budget segmentation, then, corresponding marginal tables are established for these views and random noise based on differential privacy of the Laplacian distribution is injected into them. In order to improve the quality of the synthetic data, these noisy marginal distributions are also A series of post-processing procedures are performed, including consistency, non-negativeization, and integerization procedures. Finally, each marginal distribution is iteratively spliced to generate each complete synthetic data. Accordingly, the present invention has the following features and effects.
第一,本發明採用差分隱私保護技術,除了對資料隱私性有所保障之外,還可以解決傳統資料隱私保護方法會對原始資料造成不可恢復的缺點。 First, the present invention adopts differential privacy protection technology, which in addition to guaranteeing data privacy, can also solve the shortcomings of traditional data privacy protection methods that cause unrecoverable original data.
第二,與現有技術相比較,本發明利用資料欄位的域大小協助高維資料進行視圖列表合成的作法,與先前基於視圖方法以隱私預算切分來選擇適當的邊際表做比較,由於其間涉及的欄位域大小資訊與資料無關,故能顯著避免隱私預算切分的問題。 Second, compared with the existing technology, the present invention uses the domain size of the data field to assist high-dimensional data in view list synthesis. Compared with the previous view-based method using privacy budget segmentation to select an appropriate marginal table, due to the The field size information involved has nothing to do with the data, so the problem of privacy budget segmentation can be significantly avoided.
第三,本發明通過數學分析為各個視圖分配隱私運算,能在雜訊注入階段取得資料可利用性與隱私性之間的最佳平衡點。 Third, the present invention allocates privacy operations to each view through mathematical analysis, and can achieve the best balance between data availability and privacy in the noise injection stage.
第四,本發明利用非負化、一致性、整數化之後處理程序,除了能提升合成資料的品質之外,還能提高系統程式的運行效率。 Fourth, the present invention uses non-negative, consistent, and integer post-processing procedures to not only improve the quality of synthesized data, but also improve the operating efficiency of the system program.
上列詳細說明係針對本發明之一可行實施例之具體說明,惟該實施例並非用以限制本發明之專利範圍,凡未脫離本發明技藝精神所為之等效實施或變更,均應包含於本發明之專利範圍中。 The above detailed description is a specific description of one possible embodiment of the present invention. However, this embodiment is not intended to limit the patent scope of the present invention. Any equivalent implementation or modification that does not depart from the technical spirit of the present invention shall be included in within the patent scope of this invention.
1:具差分隱私保護之資料合成系統 1: Data synthesis system with differential privacy protection
11:資料預處理模組 11: Data preprocessing module
12:視圖列表合成模組 12:View list synthesis module
13:邊際分布模組 13:Marginal distribution module
14:後處理模組 14:Post-processing module
15:資料合成模組 15: Data synthesis module
Claims (13)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW112102060A TWI824927B (en) | 2023-01-17 | 2023-01-17 | Data synthesis system with differential privacy protection, method and computer readable medium thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW112102060A TWI824927B (en) | 2023-01-17 | 2023-01-17 | Data synthesis system with differential privacy protection, method and computer readable medium thereof |
Publications (1)
Publication Number | Publication Date |
---|---|
TWI824927B true TWI824927B (en) | 2023-12-01 |
Family
ID=90052985
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW112102060A TWI824927B (en) | 2023-01-17 | 2023-01-17 | Data synthesis system with differential privacy protection, method and computer readable medium thereof |
Country Status (1)
Country | Link |
---|---|
TW (1) | TWI824927B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109522746A (en) * | 2018-11-07 | 2019-03-26 | 平安医疗健康管理股份有限公司 | A kind of data processing method, electronic equipment and computer storage medium |
TW202046138A (en) * | 2018-03-01 | 2020-12-16 | 鈺創科技股份有限公司 | Data collection and analysis device |
TWI720622B (en) * | 2019-03-12 | 2021-03-01 | 開曼群島商創新先進技術有限公司 | Security model prediction method and device based on secret sharing |
WO2022101515A1 (en) * | 2020-11-16 | 2022-05-19 | UMNAI Limited | Method for an explainable autoencoder and an explainable generative adversarial network |
-
2023
- 2023-01-17 TW TW112102060A patent/TWI824927B/en active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TW202046138A (en) * | 2018-03-01 | 2020-12-16 | 鈺創科技股份有限公司 | Data collection and analysis device |
CN109522746A (en) * | 2018-11-07 | 2019-03-26 | 平安医疗健康管理股份有限公司 | A kind of data processing method, electronic equipment and computer storage medium |
TWI720622B (en) * | 2019-03-12 | 2021-03-01 | 開曼群島商創新先進技術有限公司 | Security model prediction method and device based on secret sharing |
WO2022101515A1 (en) * | 2020-11-16 | 2022-05-19 | UMNAI Limited | Method for an explainable autoencoder and an explainable generative adversarial network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Li et al. | Consensus graph learning for multi-view clustering | |
Necoara et al. | Parallel random coordinate descent method for composite minimization: Convergence analysis and error bounds | |
Liu et al. | Joint binary classifier learning for ECOC-based multi-class classification | |
Kobel et al. | On the complexity of computing with planar algebraic curves | |
Aaron et al. | Dynamic incremental k-means clustering | |
EP3485419B1 (en) | Big data k-anonymizing by parallel semantic micro-aggregation | |
CN107729423A (en) | A kind of big data processing method and processing device | |
Zhang et al. | Scalable proximal Jacobian iteration method with global convergence analysis for nonconvex unconstrained composite optimizations | |
Gong et al. | Fault-tolerant enhanced bijective soft set with applications | |
CN108280366A (en) | A kind of batch linear query method based on difference privacy | |
Huang et al. | Cdd: Multi-view subspace clustering via cross-view diversity detection | |
Fang et al. | Improved Bounded Matrix Completion for Large-Scale Recommender Systems. | |
Nguyen et al. | Fast tensor decompositions for big data processing | |
TWI824927B (en) | Data synthesis system with differential privacy protection, method and computer readable medium thereof | |
Koskela et al. | Practical differentially private hyperparameter tuning with subsampling | |
Silva et al. | Cuda-based parallelization of power iteration clustering for large datasets | |
Schaffner et al. | An approximate computing technique for reducing the complexity of a direct-solver for sparse linear systems in real-time video processing | |
Luo et al. | Hyper-Laplacian regularized multi-view clustering with exclusive L21 regularization and tensor log-determinant minimization approach | |
CN112650869B (en) | Image retrieval reordering method and device, electronic equipment and storage medium | |
Easterling et al. | Probability-one homotopy methods for constrained clustering | |
RU2755568C1 (en) | Method for parallel execution of the join operation while processing large structured highly active data | |
Hu et al. | Low-rank matrix learning using biconvex surrogate minimization | |
Ren et al. | Recommender systems based on autoencoder and differential privacy | |
Cao et al. | Implementing a high-performance recommendation system using Phoenix++ | |
Ghosh et al. | Highly efficient parallel algorithms for solving the Bates PIDE for pricing options on a GPU |