TWI824927B

TWI824927B - Data synthesis system with differential privacy protection, method and computer readable medium thereof

Info

Publication number: TWI824927B
Application number: TW112102060A
Authority: TW
Inventors: 林志訓; 游家牧; 王紹睿; 張文軒; 左瑞麟
Original assignee: 中華電信股份有限公司
Priority date: 2023-01-17
Filing date: 2023-01-17
Publication date: 2023-12-01

Abstract

The present invention is a data synthesis system with differential privacy protection and method thereof. The data field size information is used for pre-processing of data and synthesis of view list. Because data field size information is public information, privacy budget segmentation can be not involved. Afterwards, corresponding contingency tables are established for these views and random noise based on differential privacy of Laplacian distribution is injected. Post-processing of data is then performed on contingency distributions with these noise to improve the quality of the synthesized data. Finally, iteratively concatenates each marginal distribution to generate each complete composite data. The present invention also provides a computer-readable medium for executing the method of the present invention.

Description

Data synthesis system, method and computer-readable medium with differential privacy protection

本發明係有關於資料隱私保護之技術，尤指一種具差分隱私保護之資料合成系統、方法及其電腦可讀媒介。 The present invention relates to data privacy protection technology, and in particular, to a data synthesis system and method with differential privacy protection and a computer-readable medium thereof.

傳統的資料隱私保護方法通常會在去識別化程序中使用諸如k匿名(k-anonymity)的方法，以對資料造成不可恢復的破壞，故近年來的新興作法大都以差分隱私(Differential Privacy)為核心進行設計。舉例來說，當A公司有大量資料須要交由B公司進行資料分析時，基於資料具有個人資料或訊息下，資料保護是必要的，因而通常會透過去識別化以避免將敏感資料交付他人，因而有利用差分隱私技術來進行資料合成以達保護目的。 Traditional data privacy protection methods usually use methods such as k- anonymity ( k -anonymity) in de-identification procedures to cause irreversible damage to the data. Therefore, most of the emerging methods in recent years use differential privacy (Differential Privacy) as the Design the core. For example, when company A has a large amount of data that needs to be handed over to company B for data analysis, data protection is necessary because the data contains personal information or information, so de-identification is usually used to avoid handing over sensitive data to others. Therefore, differential privacy technology is used to synthesize data to achieve protection purposes.

差分隱私為一種資料共享手段，其概念就是當隨機修改一筆資料所造成的影響夠小，則以此修改後資料進行統計結果(例如特徵統計)就無法被輕易反推出單一筆資料內容，如此可達到保護隱私之目的，易言之，差分隱私僅係對資料中可被描述(即非隱私)的部分進行特徵統計，故不會公開具體的個人訊息，因而廣泛被用於去識別化程序中。然而，現行對於資料隱私保護的解決方案，不是遭受嚴重的隱私預算切分(budget splitting)，就是無法達到完全自動化合成，因而仍有待改善之處。 Differential privacy is a data sharing method. The concept is that when the impact of randomly modifying a piece of data is small enough, the statistical results of the modified data (such as feature statistics) cannot be easily deduced from the content of a single piece of data. This way, To achieve the purpose of protecting privacy, in other words, differential privacy only performs feature statistics on the parts of the data that can be described (that is, non-private), so specific personal information will not be disclosed, so it is widely used in de-identification procedures. . However, current solutions for data privacy protection It either suffers from serious privacy budget splitting or cannot achieve fully automated synthesis, so there is still room for improvement.

因此，如何提供一種資料隱私保護之技術，特別是，在利用差分隱私的基礎下，如何將讓合成後資料影響越小，且能提供並提高合成資料的品質，此將成為目前本技術領域人員急欲追求之目標。 Therefore, how to provide a data privacy protection technology, especially how to reduce the impact of synthesized data on the basis of using differential privacy, and how to provide and improve the quality of synthesized data, will become a problem for those in the field. An eager pursuit of a goal.

為解決上述現有技術之問題，本發明係揭露一種具差分隱私保護之資料合成系統，係包括：資料預處理模組，用於將原始資料之資料欄位內的數值資料區間化，以生成資料集；視圖列表合成模組，用於依據一預定的視圖欄位個數，將該資料集隨機合成為多個基本視圖，根據各該基本視圖的域大小進行該多個基本視圖之排序和標號，再依據前後排序的兩基本視圖之域大小，產生該前後排序的兩基本視圖之交叉視圖，之後，比較該基本視圖之域大小與該交叉視圖之域大小，以於該交叉視圖之域大小大於該基本視圖之域大小時不作合併，或於該交叉視圖之域大小不大於該基本視圖之域大小時取代現有暫存值，俾於多次迭代計算後，生成最終基本視圖和最終交叉視圖；邊際分布模組，用於將隱私預算平均分配給各該最終基本視圖和各該最終交叉視圖所生成之邊際表；後處理模組，用於將該些邊際表中的計數進行一致化、非負化及整數化處理；以及資料合成模組，用於將兩相鄰的視圖合成以產生合成資料，俾於經資料格式轉換後，使該合成資料成為該原始資料的格式。 In order to solve the above-mentioned problems of the prior art, the present invention discloses a data synthesis system with differential privacy protection, which includes: a data preprocessing module for intervalizing the numerical data in the data field of the original data to generate data. Set; the view list synthesis module is used to randomly synthesize the data set into multiple basic views based on a predetermined number of view fields, and sort and label the multiple basic views according to the domain size of each basic view. , and then generate a cross view of the two basic views sorted before and after based on the domain sizes of the two basic views sorted before and after, and then compare the domain size of the basic view with the domain size of the cross view to compare the domain size of the cross view No merging is performed when the domain size of the base view is larger than that of the basic view, or the existing temporary value is replaced when the domain size of the cross view is not larger than the domain size of the base view, so that the final basic view and the final cross view can be generated after multiple iterations of calculations. ; The marginal distribution module is used to evenly allocate the privacy budget to the marginal tables generated by each final basic view and each final cross view; the post-processing module is used to harmonize the counts in the marginal tables, Non-negative and integer processing; and a data synthesis module for synthesizing two adjacent views to generate synthetic data, so that after data format conversion, the synthetic data becomes the format of the original data.

於一實施例中，該交叉視圖用於橋接該前後排序的兩基本視圖，以於該原始資料之欄位個數無法被該視圖欄位個數整除且該資料集之最後屬性的絕對值不大於該視圖欄位個數的一半時，不建立最後兩個該基本視圖之間的交叉視圖。 In one embodiment, the cross view is used to bridge the two basic views sorted before and after, so that when the number of fields of the original data is not divisible by the number of fields of the view and the last attribute of the data set When the absolute value of is not greater than half of the number of columns in the view, the cross view between the last two basic views will not be created.

於一實施例中，該視圖列表合成模組復包括：基本視圖合成單元，係先定義一合成空集合以及產生與該資料集相同的合成集合，從該合成集合中挑選出不重複的數個屬性並加入至該合成空集合，將該合成集合中減去該不重複的數個屬性，以於該合成空集合中元素個數大於該不重複的數個屬性的個數時，繼續迭代迴圈，直到該合成空集合中元素個數不大於該不重複的數個屬性的個數，產生該多個基本視圖，且該多個基本視圖之排序係以降冪進行排序；交叉視圖合成單元，係先定義一空集合，以於該視圖欄位個數為偶數，結合該前後排序的兩基本視圖中之前者中取出域值大於該視圖欄位個數的一半的屬性以及該前後排序的兩基本視圖中之後者中取出域值小於該視圖欄位個數的一半的屬性，產生該前後排序的兩基本視圖之交叉視圖，且於該視圖欄位個數為奇數，再判斷該資料集之最後屬性與該視圖欄位個數的一半的關係，以透過迭代方式得到該前後排序的兩基本視圖之交叉視圖；以及視圖列表合成單元，係於該交叉視圖之域大小不大於該基本視圖之域大小時，計算出該基本視圖和該交叉視圖之總和，並於該總和小於該現有暫存值時，以新的該總和、該基本視圖和該交叉視圖取代該現有暫存值，而於多次迭代計算後，得到該最終基本視圖和該最終交叉視圖。 In one embodiment, the view list synthesis module further includes: a basic view synthesis unit, which first defines a synthesized empty set and generates a synthesized set that is the same as the data set, and selects several non-duplicated views from the synthesized set. attributes and add them to the synthetic empty set, subtract the several unique attributes from the synthetic set, and continue to iterate back when the number of elements in the synthetic empty set is greater than the number of the unique attributes. Circle until the number of elements in the synthesized empty set is no greater than the number of non-repeating attributes, the multiple basic views are generated, and the multiple basic views are sorted in descending power; the cross view synthesis unit, First define an empty set so that the number of fields in the view is an even number. Combine the former of the two basic views sorted before and after to take out the attributes whose domain value is greater than half of the number of fields in the view and the two basic views sorted before and after. Remove the attributes whose domain value is less than half of the number of fields in the view from the latter, and generate a cross view of the two basic views sorted before and after. If the number of fields in the view is an odd number, then determine the last of the data set. The relationship between the attribute and half of the number of fields in the view is to obtain the cross view of the two basic views sorted before and after through iteration; and the view list synthesis unit is based on the fact that the domain size of the cross view is not larger than the domain of the basic view. When the size of the base view and the cross view is calculated, the sum of the base view and the cross view is calculated, and when the sum is less than the existing temporary value, the existing temporary value is replaced by the new sum, the base view and the cross view, and when the sum is smaller than the existing temporary value, the existing temporary value is replaced with the new After iterative calculations, the final basic view and the final cross view are obtained.

於一實施例中，該邊際分布模組係基於拉普拉斯(Laplace)機制，將該隱私預算平均分配給各該最終基本視圖和各該最終交叉視圖所生成之邊際表。 In one embodiment, the marginal distribution module is based on a Laplace mechanism and evenly distributes the privacy budget to the marginal tables generated by each final base view and each final cross view.

於一實施例中，該後處理模組復包括：一致性單元，係針對各該最終基本視圖和各該最終交叉視圖之計數，透過權重進行加權計算，更新各該最終基本視圖和各該最終交叉視圖之計數；非負化單元，係透過迭代地進行加總及相減，以消除各該最終基本視圖和各該最終交叉視圖之計數中的負值；以及整數化單元，係對該些邊際表進行移除或補值，而使各該邊際表與該原始資料有相同數量的紀錄。 In one embodiment, the post-processing module further includes: a consistency unit that performs weighted calculations based on the counts of each final basic view and each final cross view through weights, and updates each final basic view and each final cross view. a count of cross views; a non-negative unit that iteratively adds and subtracts to eliminate negative values in the counts of each final base view and each final cross view; and an integer unit that adds and subtracts the margins Tables are removed or filled so that each marginal table has the same number of records as the original data.

於一實施例中，該資料合成模組復包括：資料合成單元，係將該兩相鄰的視圖中共同欄位進行排序，之後，對於越多計數的欄位賦予越高的優先權，以於該兩相鄰的視圖之欄位配對後，產生該合成資料；以及資料格式轉換單元，係將該合成資料進行格式轉換而成為該原始資料的格式。 In one embodiment, the data synthesis module further includes: a data synthesis unit that sorts the common fields in the two adjacent views, and then gives higher priority to the fields with more counts, so as to After the fields of the two adjacent views are matched, the synthetic data is generated; and the data format conversion unit converts the format of the synthetic data into the format of the original data.

本發明復揭露一種具差分隱私保護之資料合成方法，係由電腦設備執行該方法，該方法包括以下步驟：將原始資料之資料欄位內的數值資料區間化，以生成資料集；依據一預定的視圖欄位個數，將該資料集隨機合成為多個基本視圖，根據各該基本視圖的域大小進行該多個基本視圖之排序和標號，再依據前後排序的兩基本視圖之域大小，產生該前後排序的兩基本視圖之交叉視圖，之後，比較該基本視圖之域大小與該交叉視圖之域大小，以於該交叉視圖之域大小大於該基本視圖之域大小時不作合併，或於該交叉視圖之域大小不大於該基本視圖之域大小時取代現有暫存值，俾於多次迭代計算後，得到最終基本視圖和最終交叉視圖；將隱私預算平均分配給各該最終基本視圖和各該最終交叉視圖所生成之邊際表；將該些邊際表中的計數進行一致化、非負化及整數化處理；以及將兩相鄰的視圖合成以產生合成資料，俾於經資料格式轉換後，使該合成資料成為該原始資料的格式。 The invention further discloses a data synthesis method with differential privacy protection, which is executed by computer equipment. The method includes the following steps: intervalize the numerical data in the data field of the original data to generate a data set; according to a predetermined The number of view columns, the data set is randomly synthesized into multiple basic views, the multiple basic views are sorted and numbered according to the domain size of each basic view, and then based on the domain sizes of the two basic views sorted before and after, Generate a cross view of the two basic views sorted before and after, and then compare the domain size of the basic view and the domain size of the cross view, so as to not merge when the domain size of the cross view is greater than the domain size of the basic view, or when When the domain size of the cross view is not larger than the domain size of the base view, it replaces the existing temporary value, so that after multiple iterations of calculation, the final base view and the final cross view are obtained; the privacy budget is evenly distributed to each of the final base view and the final cross view. The marginal tables generated by each final cross view; the counts in these marginal tables are consistent, non-negative and integerized; and the two adjacent views are combined to generate composite data for data format conversion , making the synthetic data become the format of the original data.

於上述方法中，該交叉視圖用於橋接該前後排序的兩基本視圖，以於該原始資料之欄位個數無法被該視圖欄位個數整除且該資料集之最後屬性的絕對值不大於該視圖欄位個數的一半時，不建立最後兩個該基本視圖之間的交叉視圖。 In the above method, the cross view is used to bridge the two basic views sorted before and after, so that the number of fields of the original data cannot be divided by the number of fields of the view and the absolute value of the last attribute of the data set is not greater than When the number of columns in this view is half, the cross view between the last two basic views will not be created.

於上述方法中，該生成最終基本視圖和最終交叉視圖之步驟，復包括：先定義一合成空集合以及產生與該資料集相同的合成集合，從該合成集合中挑選出不重複的數個屬性並加入至該合成空集合，將該合成集合中減去該不重複的數個屬性，以於該合成空集合中元素個數大於該不重複的數個屬性的個數時，繼續迭代迴圈，直到該合成空集合中元素個數不大於該不重複的數個屬性的個數，產生該多個基本視圖，且該多個基本視圖之排序係以降冪進行排序；先定義一空集合，以於該視圖欄位個數為偶數，結合該前後排序的兩基本視圖中之前者中取出域值大於該視圖欄位個數的一半的屬性以及該前後排序的兩基本視圖中之後者中取出域值小於該視圖欄位個數的一半的屬性，產生該前後排序的兩基本視圖之交叉視圖，且於該視圖欄位個數為奇數，再判斷該資料集之最後屬性與該視圖欄位個數的一半的關係，以透過迭代方式得到該前後排序的兩基本視圖之交叉視圖；以及於該交叉視圖之域大小不大於該基本視圖之域大小時，計算出該基本視圖和該交叉視圖之總和，並於該總和小於該現有暫存值時，以新的該總和、該基本視圖和該交叉視圖取代該現有暫存值，而於多次迭代計算後，得到該最終基本視圖和該最終交叉視圖。 In the above method, the step of generating the final basic view and the final cross view further includes: first defining a synthetic empty set and generating a synthetic set that is the same as the data set, and selecting several non-repeating attributes from the synthetic set. And add it to the synthetic empty set, subtract the several unique attributes from the synthetic set, so that when the number of elements in the synthetic empty set is greater than the number of the unique attributes, continue to iterate the loop , until the number of elements in the synthesized empty set is no greater than the number of non-repeating attributes, multiple basic views are generated, and the multiple basic views are sorted by descending powers; first define an empty set, and When the number of columns in the view is an even number, combine the attributes whose domain value is greater than half of the number of columns in the view from the former of the two basic views sorted before and after, and the fields from the latter of the two basic views sorted before and after. For attributes whose value is less than half of the number of fields in the view, a cross view of the two basic views sorted before and after is generated, and if the number of fields in the view is an odd number, then the final attribute of the data set is judged to be the same as the number of fields in the view. The cross view of the two basic views ordered before and after is obtained through iteration according to the relationship of half of the number; and when the domain size of the cross view is not larger than the domain size of the basic view, the distance between the basic view and the cross view is calculated The sum, and when the sum is less than the existing temporary value, replace the existing temporary value with the new sum, the basic view and the cross view, and after multiple iterations of calculation, the final basic view and the final Cross view.

於上述方法中，該將隱私預算平均分配給各該最終基本視圖和各該最終交叉視圖所生成之邊際表之步驟，係基於拉普拉斯(Laplace)機制，將該隱私預算平均分配給各該最終基本視圖和各該最終交叉視圖所生成之邊際表。 In the above method, the step of evenly allocating the privacy budget to each of the final basic views and the marginal tables generated by each of the final cross views is based on the Laplace mechanism to evenly allocate the privacy budget to each of the final basic views. The resulting margin table from the final base view and each final cross view.

於上述方法中，該將該些邊際表中的計數進行一致化、非負化及整數化處理之步驟，復包括：對各該最終基本視圖和各該最終交叉視圖之計數，透過權重進行加權計算，更新各該最終基本視圖和各該最終交叉視圖之計數；透過迭代地進行加總及相減，以消除各該最終基本視圖和各該最終交叉視圖之計數中的負值；以及對該些邊際表進行移除或補值，而使各該邊際表與該原始資料有相同數量的紀錄。 In the above method, the step of uniformizing, non-negative and integerizing the counts in the marginal tables further includes: weighting the counts of each final basic view and each final cross view through weight calculation. , update the counts of each final base view and each final cross view; eliminate negative values in the counts of each final base view and each final cross view by iteratively adding and subtracting; and The marginal tables are removed or filled in so that each marginal table has the same number of records as the original data.

於上述方法中，該將兩相鄰的視圖合成以產生合成資料，俾於經資料格式轉換後，使該合成資料成為該原始資料的格式之步驟，復包括：將該兩相鄰的視圖中共同欄位進行排序，之後，對於越多計數的欄位賦予越高的優先權，以於該兩相鄰的視圖之欄位配對後，產生該合成資料；以及將該合成資料進行格式轉換而成為該原始資料的格式。 In the above method, the step of synthesizing two adjacent views to generate synthetic data so that the synthetic data becomes the format of the original data after data format conversion includes: combining the two adjacent views. The common fields are sorted, and then the higher the priority is given to the fields with more counts, so that the synthetic data is generated after matching the fields of the two adjacent views; and the synthetic data is converted into a format. Become the format of this source material.

本發明復揭露一種電腦可讀媒介，應用於計算裝置或電腦中，係儲存有指令，以執行前述之具差分隱私保護之資料合成方法。 The invention further discloses a computer-readable medium, which is used in a computing device or a computer and stores instructions to execute the aforementioned data synthesis method with differential privacy protection.

綜上，本發明之具差分隱私保護之資料合成系統、方法及其電腦可讀媒介，為一種具差分隱私保護之自動化表單資料合成技術，首先，通過資料欄位域大小資訊進行前處理以及視圖列表合成(view list generation)，因為該資訊屬於公開資訊，因而不用涉及隱私預算切分，隨後，對這些視圖建立相應的邊際表(contingency table)並對其注入基於拉普拉斯分布之差分隱私(differential privacy based on Laplace distribution)的隨機雜訊，為了提高合成資料的品質，本發明還會對這些具雜訊的邊際分布進行一系列的後處理程序，包括非負化(non-negativity)與一致性感知整數化(consistency-aware normalization)，最後，通過迭代地進行各個邊際分布的拼接，以產生每一筆完整的合成資料。 In summary, the data synthesis system, method and computer-readable medium with differential privacy protection of the present invention is an automated form data synthesis technology with differential privacy protection. First, pre-processing and viewing are performed through the data field size information. List synthesis (view list generation), because the information is public information, there is no need to involve privacy budget segmentation. Subsequently, corresponding contingency tables are created for these views and differential privacy based on Laplace distribution is injected into them. (differential privacy based on Laplace distribution). In order to improve the quality of synthetic data, the present invention will also perform a series of post-processing procedures on these noisy marginal distributions, including non-negativity and consistency. Consistency-aware normalization, and finally, iteratively splicing each marginal distribution to generate each complete synthetic data.

1:具差分隱私保護之資料合成系統 1: Data synthesis system with differential privacy protection

11:資料預處理模組 11: Data preprocessing module

12:視圖列表合成模組 12:View list synthesis module

121:基本視圖合成單元 121: Basic view synthesis unit

122:交叉視圖合成單元 122: Cross view synthesis unit

123:視圖列表合成單元 123: View list synthesis unit

13:邊際分布模組 13:Marginal distribution module

14:後處理模組 14:Post-processing module

141:一致性單元 141: Consistency unit

142:非負化單元 142: Non-negative unit

143:整數化單元 143: Integerization unit

15:資料合成模組 15: Data synthesis module

151:資料合成單元 151: Data synthesis unit

152:資料格式轉換單元 152: Data format conversion unit

701-7052:流程 701-7052:Process

S601-S605:步驟 S601-S605: Steps

圖1為本發明之具差分隱私保護之資料合成系統的系統架構圖。 Figure 1 is a system architecture diagram of the data synthesis system with differential privacy protection of the present invention.

圖2為本發明之具差分隱私保護之資料合成系統另一實施例的系統架構圖。 FIG. 2 is a system architecture diagram of another embodiment of the data synthesis system with differential privacy protection of the present invention.

圖3為本發明之基本視圖與交叉視圖的列表合成的範例示意圖。 FIG. 3 is a schematic diagram illustrating an example of list synthesis of basic views and cross views according to the present invention.

圖4為本發明之四種合成交叉視圖的情境示意圖。 FIG. 4 is a schematic diagram of four synthetic cross views according to the present invention.

圖5為本發明之非負化處理的說明示意圖。 FIG. 5 is a schematic diagram illustrating the non-negative processing of the present invention.

圖6為本發明之具差分隱私保護之資料合成方法的步驟圖。 Figure 6 is a step diagram of the data synthesis method with differential privacy protection of the present invention.

圖7為本發明之具差分隱私保護之資料合成方法具體實施時的流程圖。 Figure 7 is a flow chart of the specific implementation of the data synthesis method with differential privacy protection of the present invention.

以下藉由特定的具體實施形態說明本發明之技術內容，熟悉此技藝之人士可由本說明書所揭示之內容輕易地瞭解本發明之優點與功效。然本發明亦可藉由其他不同的具體實施形態加以施行或應用。 The following describes the technical content of the present invention through specific embodiments. Those familiar with the art can easily understand the advantages and effects of the present invention from the content disclosed in this specification. However, the present invention can also be implemented or applied through other different specific implementation forms.

圖1為本發明之具差分隱私保護之資料合成系統的系統架構圖。如圖所示，本發明之具差分隱私保護之資料合成系統1係包括資料預處理模組11、視圖列表合成模組12、邊際分布模組13、後處理模組14以及資料合成模組15。 Figure 1 is a system architecture diagram of the data synthesis system with differential privacy protection of the present invention. As shown in the figure, the data synthesis system 1 with differential privacy protection of the present invention includes a data preprocessing module 11, a view list synthesis module 12, a marginal distribution module 13, a post-processing module 14 and a data synthesis module 15. .

資料預處理模組11用於將原始資料之資料欄位內的數值資料區間化，以生成資料集。簡言之，由於具差分隱私保護之資料合成系統1只能處理離散型數值資料，故資料預處理模組11目的是執行區間化(bucketization)，將數值欄位的資料離散化。具體而言，某個欄位中的數值將依照其大小選擇相符的區間並以該區間所在的指數(index)作為其新值且指數由0開始算，舉例來說，13.7在給定區間(-inf,10],(10,20],(20,inf)之中屬於第2個區間而有新值1。據此，資料預處理模組11可將連續數值型欄位轉換為離散數值型欄位，且由於原始資料可能具有多個連續數值型欄位，為了避免增加任務的複雜性，因而各個連續數值型欄位所採用的區間個數(number of buckets)為相同大小，因此，區間個數也是需要事先設置的參數。 The data preprocessing module 11 is used to intervalize the numerical data in the data fields of the original data to generate a data set. In short, since the data synthesis system 1 with differential privacy protection can only process Discrete numerical data, so the purpose of the data preprocessing module 11 is to perform bucketization and discretize the data in the numerical fields. Specifically, the value in a certain field will select a matching interval according to its size and use the index (index) of the interval as its new value and the index will start from 0. For example, 13.7 is in the given interval ( -inf,10], (10,20], (20,inf) belongs to the second interval and has a new value of 1. According to this, the data preprocessing module 11 can convert the continuous numerical field into a discrete numerical value Type field, and since the original data may have multiple continuous numerical fields, in order to avoid increasing the complexity of the task, the number of buckets used in each continuous numerical field is the same size. Therefore, The number of intervals is also a parameter that needs to be set in advance.

視圖列表合成模組12用於依據一預定的視圖欄位個數，將該資料集隨機合成為多個基本視圖，以根據各該基本視圖的域大小進行該多個基本視圖之排序和標號，再依據前後排序的兩基本視圖之域大小，產生該前後排序的兩基本視圖之交叉視圖，之後，比較該基本視圖之域大小與該交叉視圖之域大小，以於該交叉視圖之域大小大於該基本視圖之域大小時不作合併，或於該交叉視圖之域大小不大於該基本視圖之域大小時取代現有暫存值，俾於多次迭代計算後，生成最終基本視圖和最終交叉視圖。簡言之，視圖列表合成模組12目的是將資料預處理模組11所生成之資料集由視圖呈現，透過先生成基本視圖和交叉視圖，再經邊緣分布、資料後處理和資料合成等程序，進而完成後續合成資料。 The view list synthesis module 12 is used to randomly synthesize the data set into multiple basic views based on a predetermined number of view fields, so as to sort and label the multiple basic views according to the domain size of each basic view. Then, based on the domain sizes of the two basic views sorted before and after, a cross view of the two basic views sorted before and after is generated. Then, the domain size of the basic view and the domain size of the cross view are compared, so that the domain size of the cross view is greater than The domain size of the base view is not merged, or the existing temporary value is replaced when the domain size of the cross view is not larger than the domain size of the base view, so that the final base view and the final cross view can be generated after multiple iterative calculations. In short, the purpose of the view list synthesis module 12 is to present the data set generated by the data preprocessing module 11 as a view, by first generating a basic view and a cross view, and then through procedures such as edge distribution, data post-processing and data synthesis. , and then complete the subsequent synthesis of data.

於一實施例中，該交叉視圖用於橋接該前後排序的兩基本視圖，以於該原始資料之欄位個數無法被該視圖欄位個數整除且該資料集之最後屬性的絕對值不大於該視圖欄位個數的一半時，不建立最後兩個該基本視圖之間的交叉視圖。 In one embodiment, the cross view is used to bridge the two basic views sorted before and after, so that the number of fields of the original data is not divisible by the number of fields of the view and the absolute value of the last attribute of the data set is not When the number of columns in the view is greater than half, the cross view between the last two basic views will not be created.

與現有的具差分隱私保護的合成資料集(differentially private synthetic dataset，DPSD)為通過尋找高度相關性以建立視圖的方法相比較，本發明之具差分隱私保護之資料合成系統1旨在通過使用欄位域大小的公開資訊來構建受雜訊影響較小的視圖。於本發明中，具差分隱私保護之資料合成系統1內包括基本視圖(base view)和交叉視圖(cross view)兩種類型，下面將進一步說明和定義。 Compared with the existing differentially private synthetic dataset (DPSD) with differential privacy protection, which is a method of establishing views by finding high correlations, the data synthesis system 1 with differential privacy protection of the present invention aims to use columns Bitfield-sized public information to construct a less noisy view. In the present invention, the data synthesis system 1 with differential privacy protection includes two types: base view and cross view, which will be further explained and defined below.

基本視圖旨在覆蓋資料中的所有欄位，並以變數d表示這些視圖所包含的欄位個數皆為d。於本發明中，當視圖所包含的欄位數目大小越少，則要使用更多的基本視圖才能覆蓋資料中的所有欄位，意味著對於記憶體的空間開銷減少以及資料的邊際稀疏性較不顯著，然而基本視圖的數量越多，表示欄位之間的相關信息越少，此導致隱私預算切分更為嚴重，因此，本發明對於基本視圖之建構給予下述定義，給定原始資料集O中的欄位數量m和固定視圖大小d，基本視圖構建會有兩種情況，在d|m的情況下(即m可以被d整除)，可以合成m/d個基本視圖，即每個視圖的大小皆為d，相反地，在不滿足d|m的情況下，最後一個基本視圖的視圖大小將小於d，此會影響如何決定交叉視圖，具體作法，後面會再詳述。 Basic views are designed to cover all fields in the data, and the variable d indicates that the number of fields contained in these views is d. In the present invention, when the number of fields contained in a view is smaller, more basic views must be used to cover all fields in the data, which means that the space overhead of the memory is reduced and the marginal sparsity of the data is smaller. Not obvious, but the greater the number of basic views, the less relevant information between the fields, which leads to more serious privacy budget segmentation. Therefore, the present invention provides the following definition for the construction of basic views. Given the original data The number of columns m in set O and the fixed view size d , there are two situations in basic view construction. In the case of d | m (that is, m can be evenly divided by d ), m / d basic views can be synthesized, that is, each The size of each view is d . On the contrary, if d | m is not satisfied, the view size of the last basic view will be smaller than d . This will affect how to determine the cross view. The specific method will be described in detail later.

交叉視圖係用於橋接基本視圖，藉以補償基本視圖之間缺少的相關訊息，先給定欄位數量m、視圖大小d和基本視圖B，將有三種類型之交叉視圖可能產生。 The cross view is used to bridge the basic views to compensate for the lack of related information between the basic views. Given the number of columns m , the view size d and the basic view B , three types of cross views may be generated.

第一型(d|m)：對於每一對基本視圖b _i和b _i+1，可直接合成一個交叉視圖。舉例來說，如圖三中間的b₁-b₃以及下面的c₁-c₂所示，給定基本視圖 b ₁={a ₁，a ₂，a ₃}，b ₂={a ₄，a ₅，a ₆}和b ₃={a ₇，a ₈，a ₉}，則c ₁={a ₂，a ₃，a ₄}和c ₂={a ₅，a ₇，a ₈}就可以是其中的一組交叉視圖。 Type 1 ( d | m ): For each pair of basic views b _i and b _{i +1} , a cross view can be directly synthesized. For example, as shown in b ₁ -b ₃ in the middle of Figure 3 and c ₁ -c ₂ below, given the basic view b ₁ ={ a ₁ , a ₂ , a ₃ }, b ₂ ={ a ₄ , a ₅ , a ₆ } and b ₃ ={ a ₇ , a ₈ , a ₉ }, then c ₁ ={ a ₂ , a ₃ , a ₄ } and c ₂ ={ a ₅ , a ₇ , a ₈ } are Can be a set of intersecting views within it.

第二型(非d|m且|

|>d/2)：同樣也是可以直接合成交叉視圖。舉例來說，如圖三中間的b₁-b₃以及下面的c₁-c₂所示，給定基本視圖b ₁={a ₁，a ₂，a ₃}，b ₂={a ₄，a ₅，a ₆}和b ₃={a ₇，a ₈}，則有c ₁={a ₂，a ₃，a ₄}和c ₂={a ₅，a ₆，a ₇}的這種可能的交叉視圖且有重疊欄位。 Type II (not d | m and |

|> d /2): Cross views can also be synthesized directly. For example, as shown in b ₁ -b ₃ in the middle of Figure 3 and c ₁ -c ₂ below, given the basic view b ₁ ={ a ₁ , a ₂ , a ₃ }, b ₂ ={ a ₄ , a ₅ , a ₆ } and b ₃ ={ a ₇ , a ₈ }, then there are c ₁ ={ a ₂ , a ₃ , a ₄ } and c ₂ ={ a ₅ , a ₆ , a ₇ } Possible cross views with overlapping fields.

第三型(非d|m且|

|

d/2)：在這種情況下，可選擇d-|

|個來自第

個基本視圖內的隨機欄位，並將它們放置在第

個基本視圖中。惟，這裡不會為第

和第

個基本視圖之間產生交叉視圖，因為上述對第

個基本視圖的補償行為已經足夠用來連結第

和第

個基本視圖，因而，合成額外的交叉視圖將是冗餘的，且會加劇隱私預算切分的問題。 Type III (not d | m and |

|

d /2): In this case, d -|

| from

random fields within the base view and place them in the

in a basic view. But, here will not be the first

and the first

Cross views are generated between basic views because the above-mentioned

The compensating behavior of the base view is sufficient to connect the

and the first

base view, therefore, synthesizing additional cross-views would be redundant and exacerbate the problem of privacy budget splitting.

關於基本視圖和交叉視圖具體作法，後面會再詳述。 The specific methods of basic view and cross view will be described in detail later.

邊際分布模組13用於將隱私預算平均分配給各該最終基本視圖和各該最終交叉視圖所生成之邊際表。簡言之，邊際分布模組13目的是構造具雜訊的邊際分布。先前，視圖列表合成模組12利用資料欄位域大小的公開資訊構建了基本視圖和交叉視圖，接著，邊際分布模組13在相應的邊際上應用拉普拉斯(Laplace)機制來確保ε-DP，其中，ε表示隱私預算，此為具差分隱私保護之資料合成系統1會存取原始資料集O的唯一步驟，意味著應該要以雜訊擾動來保護資料的統計資訊。 The marginal distribution module 13 is used to evenly distribute the privacy budget to the marginal tables generated by each final base view and each final cross view. In short, the purpose of the marginal distribution module 13 is to construct a noisy marginal distribution. Previously, the view list synthesis module 12 constructed the base view and the cross view using the public information of the data field size. Then, the marginal distribution module 13 applied the Laplace mechanism on the corresponding margins to ensure ε - DP, where ε represents the privacy budget. This is the only step in which the data synthesis system 1 with differential privacy protection will access the original data set O , which means that noise perturbation should be used to protect the statistical information of the data.

具體來說，給定視圖可以很容易地獲得相應的邊際表(contingency table)，據此，將拉普拉斯機制應用於所有邊際表，並且每個邊際表將平等獲得部分ε，這是保證符合差分隱私(DP)的直接方法，大多數現有的具差分隱私保護的合成資料集(DPSD)合成算法都遵循這種策略，亦即，ε被平均分配給各個邊際表。另外，具差分隱私保護之資料合成系統1利用預期平方誤差(ESE)來獲得最佳預算分配，儘管過去有研究提出以ESE的概念來分析雜訊規模，但其主要目的是在後處理中使用在通用欄位(特別是一致性)簡化視圖更新，與本發明用於獲得最佳預算分配，亦不相同。 Specifically, a given view can easily obtain the corresponding contingency table, according to which the Laplacian mechanism is applied to all marginal tables, and each marginal table will obtain the part ε equally, which is guaranteed In line with the direct approach of Differential Privacy (DP), most existing Differential Privacy Preserving Synthetic Dataset (DPSD) synthesis algorithms follow this strategy, that is, ε is evenly distributed to each marginal table. In addition, the data synthesis system 1 with differential privacy protection uses expected square error (ESE) to obtain optimal budget allocation. Although some studies in the past have proposed using the concept of ESE to analyze the noise scale, its main purpose is to use it in post-processing. Simplifying view updates in common fields (especially consistency) is not the same as this invention is used to obtain optimal budget allocation.

邊際分布模組13會計算分配給視圖v _i的隱私預算比例P _i。假設有k個視圖{v ₁,v ₂,...,v _k}與相應的域大小{s ₁,s ₂,...,s _k}，隱私預算分配可以表述為如下優化問題： _The marginal distribution module 13 calculates the privacy budget _proportion Pi allocated to view vi . Assuming there are k views { v ₁ , v ₂ ,..., v _k } and corresponding domain sizes { s ₁ , s ₂ ,..., s _k }, privacy budget allocation can be formulated as the following optimization problem:

min

+

+...+

min

+

+...+

利用卡羅需-庫恩-塔克條件(Karush-Kuhn-Tucker condition)，令

，其中，i=1,2,...,k，同時又令

0，代入移項後，可獲得

，又

。因此，最終可以推導得出

，並因此推導出

。由上可知，分配給每一視圖v _i的預算將是P _i ε，這種最佳化分配策略可以避免在域大小較小的邊際表上花費過多的隱私預算導致換得的可用性提升僅有微幅之情形，並且還允許具有較大域大小的邊際表獲得更多預算以減少雜訊的負面影響。 Using the Karush-Kuhn-Tucker condition, let

, where i =1,2,..., k , and let

0, after substituting the transfer term, we can get

,again

. Therefore, it can finally be deduced that

, and therefore deduced

. It can be seen from the above that the budget allocated to each view v _i will be Pi _ε . This optimal allocation strategy can avoid spending too much privacy budget on marginal tables with small domain sizes, resulting in only a usability improvement. slightly, and also allows marginal tables with larger domain sizes to receive more budget to reduce the negative impact of noise.

後處理模組14用於將該些邊際表中的計數進行一致化、非負化以及整數化處理。簡言之，後處理模組14用於細化具差分隱私保護的合成資料集(DPSD)以提高其效用，包括非負化(Non-Negativity)、一致性(Consistency)與整數化(Normalization)，其中，一致性是為了使得聚合後的統計信息能夠與公開的背景知識保持一致，非負化是因為視圖表的計數在添加雜訊後有可能變成負數所作的對應處理，而整數化是為了滿足在注入雜訊之前的邊際表的計數為整數的這個條件。關於非負化、一致性與整數化等處理方式，後面會再詳述。 The post-processing module 14 is used to perform consistent, non-negative and integer processing on the counts in the marginal tables. In short, the post-processing module 14 is used to refine the differential privacy protected synthetic data set (DPSD) to improve its effectiveness, including non-negative (Non-Negativity), consistency (Consistency) and integerization (Normalization), Among them, consistency is to ensure that the aggregated statistical information can be consistent with the publicly available statistical information. The view knowledge remains consistent, non-negativeization is the corresponding processing because the count of the view table may become a negative number after adding noise, and integerization is to satisfy the condition that the count of the marginal table is an integer before the noise is injected. Regarding processing methods such as non-negativeization, consistency and integerization, we will discuss them in detail later.

資料合成模組15用於將兩相鄰的視圖合成以產生合成資料，俾於經資料格式轉換後，使該合成資料成為該原始資料的格式。在歷經後處理模組14的資料處理後，資料合成模組15將資料合成為具差分隱私保護的合成資料集(DPSD)，易言之，給定基本視圖與交叉視圖，資料合成的目標是通過鏈接來自不同視圖的合適部分記錄以合成完整的nm維記錄，其中，n表示資料集的紀錄筆數，而m代表資料集的欄位個數。與大多數現有的DPSD合成算法依賴基於採樣的數據合成相比，本發明之具差分隱私保護之資料合成系統1採用一種不同的基於鏈接的方法，可這確保資料合成能合成n筆記錄，其原因在於前述過程中的整數化同步了視圖計數。最後，將資料作格式轉換，使得合成資料符合原始資料的格式。 The data synthesis module 15 is used to synthesize two adjacent views to generate synthetic data, so that after data format conversion, the synthetic data becomes the format of the original data. After the data processing by the post-processing module 14, the data synthesis module 15 synthesizes the data into a synthetic data set with differential privacy protection (DPSD). In other words, given the basic view and the cross view, the goal of the data synthesis is Complete nm- dimensional records are synthesized by linking appropriate partial records from different views, where n represents the number of records in the data set, and m represents the number of fields in the data set. Compared with most existing DPSD synthesis algorithms that rely on sampling-based data synthesis, the data synthesis system 1 with differential privacy protection of the present invention adopts a different link-based method, which ensures that data synthesis can synthesize n records. The reason is that the integerization in the preceding process synchronizes the view count. Finally, the format of the data is converted so that the synthesized data conforms to the format of the original data.

與現有技術相比較，本發明利用資料欄位的域大小協助高維資料進行視圖列表合成的作法，與現有技術基於視圖方法以隱私預算切分來選擇適當的邊際表做比較，由於本發明間涉及的欄位域大小資訊與資料無關，故能顯著避免隱私預算拆分的問題。另外，通過數學分析為各個視圖分配隱私運算，能在雜訊注入階段取得資料可利用性與隱私性之間的最佳平衡點，而非負化、一致性、整數化等後處理程序，除了能提升合成資料的品質之外，還能提高系統程式的運行效率。 Compared with the existing technology, the present invention uses the domain size of the data field to assist high-dimensional data in view list synthesis. Compared with the existing technology based on the view method and privacy budget segmentation to select an appropriate marginal table, due to the The field size information involved has nothing to do with the data, so the problem of privacy budget splitting can be significantly avoided. In addition, by allocating privacy operations to each view through mathematical analysis, the best balance between data availability and privacy can be achieved in the noise injection stage. Post-processing procedures such as non-negativeization, consistency, and integerization, in addition to In addition to improving the quality of synthetic data, it can also improve the operating efficiency of system programs.

圖2為本發明之具差分隱私保護之資料合成系統另一實施例的系統架構圖。如圖所示，資料預處理模組11、視圖列表合成模組12、邊際分布模組13、後處理模組14以及資料合成模組15與圖1所述相同，於本實施例中，進一步說明視圖列表合成模組12、後處理模組14以及資料合成模組15之內部架構。 FIG. 2 is a system architecture diagram of another embodiment of the data synthesis system with differential privacy protection of the present invention. As shown in the figure, data preprocessing module 11, view list synthesis module 12, marginal distribution module Group 13, post-processing module 14 and data synthesis module 15 are the same as those described in Figure 1. In this embodiment, the internal architecture of the view list synthesis module 12, post-processing module 14 and data synthesis module 15 is further described. .

視圖列表合成模組12用於產生基本視圖及交叉視圖，並進一步將兩者合成，因而視圖列表合成模組12進一步包括基本視圖合成單元121、交叉視圖合成單元122以及視圖列表合成單元123。 The view list synthesis module 12 is used to generate a base view and a cross view, and further synthesize the two. Therefore, the view list synthesis module 12 further includes a base view synthesis unit 121, a cross view synthesis unit 122 and a view list synthesis unit 123.

基本視圖合成單元121先定義一合成空集合以及產生與該資料集相同的合成集合，從該合成集合中挑選出不重複的數個屬性並加入至該合成空集合，將該合成集合中減去該不重複的數個屬性，以於該合成空集合中元素個數大於該不重複的數個屬性的個數時，繼續迭代迴圈，直到該合成空集合中元素個數不大於該不重複的數個屬性的個數，產生該多個基本視圖，且該多個基本視圖之排序係以降冪進行排序。簡言之，儘管定義了基本視圖，但仍需要一種用於構建基本視圖的算法，基本視圖合成單元121包含這種基本視圖的算法，這裡稱之為基本視圖合成算法(base view generation，BVG)。 The basic view synthesis unit 121 first defines a synthetic empty set and generates a synthetic set that is the same as the data set, selects several non-repeating attributes from the synthetic set and adds them to the synthetic empty set, and subtracts them from the synthetic set. The number of non-repeating attributes is such that when the number of elements in the synthesized empty set is greater than the number of the number of non-repeating attributes, the loop continues to be iterated until the number of elements in the synthesized empty set is not greater than the number of non-repeating attributes. The number of several attributes generates the plurality of basic views, and the ordering of the plurality of basic views is in descending order. In short, although the base view is defined, an algorithm for constructing the base view is still needed. The base view synthesis unit 121 contains the algorithm of such a base view, which is called here a base view synthesis algorithm (base view generation, BVG). .

首先，BVG可理解為一種合成隨機基本視圖的簡單算法，但其特點是對基本視圖的排序，簡言之，最初所合成的基本視圖缺乏明確的順序，因而一旦合成基本視圖，BVG就會根據它們各自在域大小的前

大值按降冪對它們進行排序與編號，其中，b _i的前

的最大域大小定義為s _j1,...,

，並且a _j1,...,

表示在b _i中具有最大域大小的欄位，其中，j ₁,...,

[m]。對於列表[β ₁,...,β _2k]的排序表，其中的元素乘積之和[[β ₁,β _2k],...,[β _k,β _k+1]]的2個分區

，可以最小化。惟基本視圖構建的實際情境更加複雜，例如視圖大小不限於兩個，因此，BVG根據它們各自的前

的最大域大小按降冪對基本視圖進行排序和編號，用以嘗試執行類似地的最小化程序，舉例來說，圖3上部包括四個基本視圖，在進行排序與編號後，由於基本視圖的前

個最大域大小中的最大值為42，故所對應的視圖編號為b ₁。 First of all, BVG can be understood as a simple algorithm for synthesizing random base views, but its characteristic is the ordering of base views. In short, the initially synthesized base views lack a clear order, so once the base views are synthesized, BVG will be based on Each of them precedes the domain size

Sort and number the large values in descending power, where, the first of b _i

The maximum domain size of is defined as s _{j 1} ,...,

, and a _{j 1} ,...,

Represents the field with the largest domain size in b _i , where j ₁ ,...,

[ m ]. For a sorted list of the list [ β ₁ ,..., β _{2 k} ], the sum of the products of the elements [[ β ₁ , β _{2 k} ],...,[ β _k , β _{k +1} ]] is 2 partitions

, can be minimized. However, the actual situation of basic view construction is more complicated. For example, the view size is not limited to two. Therefore, BVG is based on their respective front

The base views are sorted and numbered in descending powers of the maximum domain size to try to perform a similar minimization procedure. For example, the upper part of Figure 3 includes four base views. After sorting and numbering, due to the forward

The maximum value among the maximum domain sizes is 42, so the corresponding view number is b ₁ .

BVG演算法的詳細內容如下。給定預設視圖大小d以及資料集屬性集合

。BVG首先會合成空集合B並合成集合S且其內容和

一致，然後迭代進行下列操作：從S中隨機挑選出不重複的d個屬性形成集合b並將其加入B中，再將S內容更新成S\b，且當S中的元素個數不大於d則會直接將S加入B中，並且結束迴圈。產生出初步的基本視圖集合B後，BVG會對B進行排序，並且採用的方式為對每個基本視圖各自取出域大小為前

大的屬性並各自再將這些域值進行乘法，最後再對所產生的B個乘積值(product)依照大小進行降冪排序且讓B也遵循該順序對其中的基本視圖進行排序，最終，BVG會將排序後的B做輸出。須說明者，

指的是

去掉餘數後的整數結果。 The details of the BVG algorithm are as follows. Given a default view size d and a collection of dataset properties

. BVG first synthesizes the empty set B and synthesizes the set S whose contents are

consistent, and then iteratively perform the following operations: randomly select d non-repeating attributes from S to form a set b and add them to B , then update the content of S to S \ b , and when the number of elements in S is not greater than d will directly add S to B and end the loop. After generating the preliminary basic view set B , BVG will sort B , and the method used is to take out the domain size of each basic view with the first

Large attributes are then multiplied by these domain values respectively, and finally the generated B product values (products) are sorted in descending order according to size, and B is also allowed to sort the basic views in this order. Finally, BVG The sorted B will be output. What needs to be explained,

Refers

The integer result after removing the remainder.

交叉視圖合成單元122先定義一空集合，以於該視圖欄位個數為偶數，結合該前後排序的兩基本視圖中之前者中取出域值大於該視圖欄位個數的一半的屬性以及該前後排序的兩基本視圖中之後者中取出域值小於該視圖欄位個數的一半的屬性，產生該前後排序的兩基本視圖之交叉視圖，且於該視圖欄位個數為奇數，再判斷該資料集之最後屬性與該視圖欄位個數的一半的關係，以透過迭代方式得到該前後排序的兩基本視圖之交叉視圖。簡言之，在通過BVG計算且對基本視圖進行編號後，隨即轉而構建交叉視圖，交叉視圖合成單元122包含構建交叉視圖的算法，這裡稱之為交叉視圖合成算法(cross view generation，CVG)。 The cross view synthesis unit 122 first defines an empty set, so that the number of fields in the view is an even number, and combines the former of the two basic views sorted before and after to extract the attributes whose domain values are greater than half of the number of fields in the view and the attributes before and after. From the latter of the two sorted basic views, take out the attribute whose domain value is less than half of the number of columns in the view, and generate a cross view of the two basic views sorted before and after, and if the number of columns in the view is an odd number, then determine the The relationship between the last attribute of the data set and half of the number of fields in the view is used to obtain the cross view of the two basic views sorted before and after through iteration. In short, after the basic views are calculated and numbered through BVG, the cross view is then constructed. The cross view synthesis unit 122 includes an algorithm for constructing the cross view, which is called here a cross view synthesis algorithm (cross view generation, CVG). .

如前所述，對於每一對基本視圖b _i和b _i+1，可以合成其間的交叉視圖。根據兩個基本視圖b _i和b _i+1中欄位的域大小，有四種合成交叉視圖的情境，包含(a)少對少、(b)多到多、(c)多到少及(d)低到高，如圖4所示。基於對最後步驟會進行視圖間的拼接，除了合成紀錄的多樣性決定與是否能夠反應原始資料的實際狀況有著重要關係，還須考慮視圖在雜訊受到雜訊影響的抵抗能力，因而在交叉視圖合成中使用多到少的配置，是較佳選擇。 As mentioned before, for each pair of base views b _i and b _{i +1} , the cross views therebetween can be synthesized. According to the field sizes of the fields in the two basic views b _i and b _{i +1} , there are four scenarios for synthesizing cross views, including (a) less to less, (b) more to more, (c) more to less and (d) low to high, as shown in Figure 4. Based on the splicing between views in the final step, in addition to the decision on the diversity of synthesized records, which is closely related to whether it can reflect the actual situation of the original data, the view's ability to withstand the influence of noise must also be considered. Therefore, in the cross view It is better to use more to less configurations in synthesis.

CVG演算法的詳細內容如下。給定預設視圖大小d、資料集屬性集合

以及基本視圖集合

。CVG首先產生空集合C用以存放待會產生的交叉視圖，隨後針對d的大小是否為偶數會採用兩種不同程序來合成交叉視圖。若d為偶數，則CVG會新增一個空集合c ₀再從B中選擇b ₀和b ₁，然後，對b ₀的屬性中依據其域值大小取前出前

大的屬性添加到c ₀中，而對b ₁中的屬性則是依據其域值大小取前出前

小的屬性添加到c ₀中，最後，將c ₀新增到C中就可以完成一次的進行，但由於B還可以產生諸如b ₁,b ₂、b ₂,b ₃、...、

,

的這種由相鄰基本視圖所形成的配對，故CVG還要對這些配對迭代進行一樣的操作才能得到最終大小為

的交叉視圖集合。相反地，若d為奇數，則CVG會進一步再判斷是否

>

，若判斷式成立則令變數x為

-1，不成立則令x為

-2。 The details of the CVG algorithm are as follows. Given a default view size d and a collection of data set attributes

and a collection of basic views

. CVG first generates an empty set C to store the cross view to be generated, and then uses two different procedures to synthesize the cross view depending on whether the size of d is an even number. If d is an even number, CVG will add an empty set c ₀ and select b ₀ and b ₁ from B. Then, take the first out of the attributes of b ₀ according to the size of its domain value.

Large attributes are added to c ₀ , while attributes in b ₁ are taken first based on their domain value size.

Small attributes are added to c _0. Finally, adding c ₀ to C can complete the process once, but since B can also generate items such as b ₁ , b ₂ , b ₂ , b ₃ ,...,

,

This kind of pairing formed by adjacent basic views, so CVG has to iteratively perform the same operation on these pairs to get the final size of

A collection of cross views. On the contrary, if d is an odd number, CVG will further determine whether

>

, if the judgment is true, let the variable x be

-1, if this is not true, let x be

-2.

隨後，CVG會以i=1 to x迭代進行以下操作：首先，會從b _i中選擇域大小前

大的屬性存入暫存array set ₁中，並且也會從b _i+1中選擇域大小前

小的屬性存入暫存array set ₂中，最後，將set ₁ ∪ set ₂的結果存入暫存array c _i中。接著，從b _i+1\set ₂中選擇具有最小的域大小的屬性以變數p記錄下，此時，若是set ₁的域大小比起set ₂的域大小和p的域大小之乘積來得更小或恰好相等，即dom(set ₁)

dom(set ₂)＊dom(p)，則會從b _i\set ₁中選擇具有最小的域大小的屬性取代變數p原本儲存的屬性。最終，將p新增到array c _i中，再將此時的c _i存入C中，便可完成一次迭代。同樣地，進行x迭代後就可以順利獲得大小為

的交叉視圖集合。須注意，

指的是

去掉餘數後再加1的整數結果，B\s指的是B去掉s後的集合，set ₁ ∪ set ₂指的是取set ₁和set ₂的聯集。 Subsequently, CVG will iterate from i =1 to x and perform the following operations: First, the domain size will be selected from b _i

Large attributes are stored in temporary array set ₁ , and the domain size will also be selected from bi _{+ 1} .

Small attributes are stored in temporary array set _2. Finally, the results of set ₁ ∪ set ₂ are stored in temporary array c _i . Next, select the attribute with the smallest domain size from bi _{+ 1} \ set ₂ and record it with the variable p . At this time, if the domain size of set ₁ is greater than the product of the domain size of set ₂ and the domain size of p Small or exactly equal, that is, dom ( set ₁ )

dom( set ₂ )*dom( p ), then the attribute with the smallest domain size will be selected from bi _\ set ₁ to replace the attribute originally stored in the variable p . Finally, add p to array c _i , and then store c _i at this time in C to complete an iteration. Similarly, after performing x iteration, you can successfully obtain the size of

A collection of cross views. It should be noted that

Refers

The integer result is the integer result of adding 1 after removing the remainder. B\s refers to the set of B minus s. Set ₁ ∪ set ₂ refers to the union of set ₁ and set ₂ .

視圖列表合成單元123於該交叉視圖之域大小不大於該基本視圖之域大小時，計算出該基本視圖和該交叉視圖之總和，並於該總和小於該現有暫存值時，以新的該總和、該基本視圖和該交叉視圖取代該現有暫存值，而於多次迭代計算後，得到該最終基本視圖和該最終交叉視圖。簡言之，視圖列表合成單元123包含一種迭代算法，稱之為視圖列表合成算法(view list generation，VLG)，能調用BVG和CVG作為子程序以合成最佳基本視圖和交叉視圖。 The view list synthesis unit 123 calculates the sum of the basic view and the cross view when the domain size of the cross view is not larger than the domain size of the base view, and when the sum is smaller than the existing temporary value, the new view list synthesis unit 123 calculates the sum of the basic view and the cross view. The sum, the base view and the cross view replace the existing temporary value, and after multiple iterative calculations, the final base view and the final cross view are obtained. In short, the view list synthesis unit 123 includes an iterative algorithm called view list generation (VLG), which can call BVG and CVG as subroutines to synthesize the best base view and cross view.

VLG會迭代給定次數並返回最佳基本視圖和交叉視圖。另一方面，VLG可選擇使用早期停止(early stop)策略以迭代非確定次數的方式找到最佳解決方案，而不是採用暴力搜索(brute force)，因為對於不同的視圖組合存在過多的可能性。舉例來說，以暴力搜索對於具有25個欄位的資料集，要找到d=5的最佳視圖需要(m．d-1)！/(d-1)！

2.585e²²次的檢查。 VLG iterates a given number of times and returns the best base view and cross view. On the other hand, VLG can choose to use an early stop strategy to find the best solution in a non-deterministic number of iterations, rather than using brute force because there are too many possibilities for different view combinations. For example, using a brute force search for a data set with 25 columns, finding the best view for d = 5 requires ( m . d -1)! /( d -1)!

2.585e ²² checks.

VLG演算法的詳細內容如下：給定預設視圖大小d、資料集屬性集合

以及預設置的迭代圈數X。VLG會先設置變數D _best為inf並且B _best和C _best為空集合。隨後，VLG將以X作為迭代圈數進行以下操作：首先，利用BVG(

,d)產生基本視圖B，再利用CVG(

,d,B)產生交叉視圖C。隨後，對於C中的每個交叉視圖會檢查是否域大小會比B中具有最大域大小的基本視圖還要來得大，一旦有交叉視圖被察覺到確實比較大則放棄當次合成，並且直接進入下一輪的合成。若是經過檢查後沒有發現到有比較大的情況，則會計算出該輪的域大小總和

，即對B中每個基本視圖計算各自的域大小並加總，然後對C中每個交叉視圖也計算出各自大小後做加總，最後再將兩個總和相加並以變數D儲存。此時，如果D比現有的D _best來得小，則用D作為D _best的新值且用B取代目前的B _best以及用C取代目前的C _best。經過X次迭代後會輸出B _best及C _best作為最終的基本視圖和交叉視圖。 The details of the VLG algorithm are as follows: given the default view size d and the data set attribute set

And the preset number of iteration turns X. VLG will first set the variable D _best to inf and B _best and C _best to empty sets. Subsequently, VLG will perform the following operations using X as the iteration number: First, use BVG (

, d ) generates basic view B , and then uses CVG(

, d , B ) produces cross view C . Subsequently, for each cross view in C , it is checked whether the domain size is larger than the base view with the largest domain size in B. Once a cross view is detected to be indeed larger, the current synthesis is abandoned and directly entered. Next round of synthesis. If no relatively large situation is found after checking, the total domain size of the round will be calculated.

, that is, calculate the respective domain sizes for each basic view in B and sum them up, then calculate the respective sizes for each cross view in C and sum them up, and finally add the two sums and store them in the variable D. At this time, if D is smaller than the existing D _best , use D as the new value of D _best and replace the current B _best with B and replace the current C _best with C. After X iterations, B _best and C _best will be output as the final basic view and cross view.

後處理模組14用於將具差分隱私保護的合成資料集(DPSD)進行細化，該後處理模組14包括一致性單元141、非負化單元142以及整數化單元143。 The post-processing module 14 is used to refine the synthetic data set with differential privacy protection (DPSD). The post-processing module 14 includes a consistency unit 141, a non-negative unit 142 and an integerization unit 143.

一致性單元141針對各該最終基本視圖和各該最終交叉視圖之計數，透過權重進行加權計算，更新各該最終基本視圖和各該最終交叉視圖之計數。這裡先簡要概述加權一致性(weighted consistency)的機制，令給定的x個視圖為v ₁,...,v _x，並且對於j ₁,...,j _y

[m]有v=v ₁ ∩...∩ v _x={a _j1,...,a _jy}共y個共同欄位，若已知定義

(c)表示從邊際表v _i中檢索單元c的計數，通過利用權重w ₁,...,w _x對T _v((a _j1,...,a _jy))進行加權進算，即T _v((a _j1,...,a _jy))=Σ_i w _i．T _v((a _j1,...,a _jy))。當w ₁,...,w _x=1則加權一致性將退化成一般的正規化程序，故需找到一組良好的權重組合以獲得最大的加權值。 The consistency unit 141 performs weighted calculation based on the weights for the counts of each final base view and each final cross view, and updates the counts of each final base view and each final cross view. Here we first briefly outline the mechanism of weighted consistency. Let the given x views be v ₁ ,..., v _x , and for j ₁ ,..., j _y

[ m ] There are v= v ₁ ∩...∩ v _x ={ a _{j 1} ,..., a _jy }, a total of y common fields, if the definition is known

( c ) represents retrieving the count of unit c from the marginal _table vi by weighting T _v (( a _{j 1} ,..., a _jy )) with weights w ₁ ,..., w _x , That is, T _v (( a _{j 1} ,..., a _jy ))=Σ _i w _i . T _v (( a _{j 1} ,..., a _jy )). When w ₁ ,..., w _x =1, the weighted consistency will degenerate into a general regularization procedure, so it is necessary to find a good set of weight combinations to obtain the maximum weighted value.

對此，一致性單元141利用ESE計算v的雜訊規模有 In this regard, the consistency unit 141 uses ESE to calculate the noise scale of v as

，其中，ξ _i表示dom(v _i\v)，而P _i則是先前所計算出之各視圖所獲得的分配隱私預算比例，Var _ε則是拉普拉斯雜訊Lap(

)且

為對視圖v _i所計算出的變異數。

, where ξ _i represents dom( v _i \v), and Pi is the _allocated privacy budget ratio obtained by each view calculated previously, and Var _ε is the Laplacian noise Lap(

)and

is the variation number calculated for view v _i .

針對目標函數以及對其尋找最佳解的推導如下。 The derivation of the objective function and finding its optimal solution is as follows.

min

min

利用卡羅需-庫恩-塔克條件，令

，而對於 i=1,...,k令

。經過推導可以獲得

且

又有

。因此，代入後有

，並且對於來自v視圖中的任意元素C=(a _j1,...,a _jy)，

。最終，所有視圖表將能利用T _v(C)當作目標對各自的值進行更新。不失一般性，假設v _i視圖表對於y

d且j ₁,...,j _d

[m]有元素C'=(a _j1,...,a _jy,...,a _jd)，則其更新如下：

(C')←

(C')+

(T _v(C)-

(C))，其中，C表示來自於共有欄位v=v ₁ ∩...∩ v _x之元素(a _j1,...,a _jy)。 Using the Carlo-Kuhn-Tucker condition, let

, and for i=1,..., k let

. After derivation, it can be obtained

and

again

. Therefore, after substituting, we have

, and for any element C =( a _{j 1} ,..., a _jy ) from the v view,

. Eventually, all view tables will be able to update their values using T _v ( C ) as a target. Without loss of generality, assume that _the view table vi for y

d and j ₁ ,..., j _d

[ m ] has element C' =( a _{j 1} ,..., a _jy ,..., a _jd ), then its update is as follows:

( C' )←

( C' )+

( T _v ( C )-

( C )), where C represents the elements ( a _{j 1} ,..., a _jy ) from the common field v= v ₁ ∩...∩ v _x .

非負化單元142能透過迭代地進行加總及相減，以消除各該最終基本視圖和各該最終交叉視圖之計數中的負值。簡言之，由於視圖表的計數在添加雜訊後有可能變成負數，儘管在一致性單元141處理後有可能獲得改善，但是效果有限。對此，本發明提出一種新穎的作法，如圖5所示，非負化單元142透過迭代地進行加總相減的過程來消除那些負值，同時還能維持住資料的分布形狀，甚至由於該算法所涉及的運算僅是簡單加減法而在計算速度上大有改進。 The non-negative unit 142 can eliminate negative values in the counts of each final base view and each final cross view by iteratively performing sums and subtractions. In short, since the count of the view table may become a negative number after adding noise, although it may be improved after processing by the consistency unit 141, the effect is limited. In this regard, the present invention proposes a novel approach. As shown in Figure 5, the non-negative unit 142 eliminates those negative values by iteratively performing a summation and subtraction process, while maintaining the distribution shape of the data, even due to the The operations involved in the algorithm are only simple addition and subtraction, which greatly improves the calculation speed.

整數化單元143能對該些邊際表進行移除或補值，而使各該邊際表與該原始資料有相同數量的紀錄。在注入雜訊之前的邊際表的計數為整數的這個條件，作為對具雜訊影響的邊際表歷經上述兩道程序後的輸出，習知一些DPSD研究會將其視為資料分布以隨機產生樣本(sampling)的方式去產生紀錄，然而這種方式因為是隨機過程以致於產生的紀錄分布可能有所偏差，甚至於不能充分反應出原始資料的記錄多樣性。本發明提出使用整數化的方式對於各個邊際表進行移除或補值而使得各個邊際表能恰好產生與原始資料相等數量的紀錄，如此一來便能有效解決上述問題。 The integerization unit 143 can remove or add values to these marginal tables so that each marginal table has the same number of records as the original data. The condition that the count of the marginal table before injecting noise is an integer is the output of the marginal table affected by noise after the above two procedures. It is known that some DPSD research will treat it as a data distribution to randomly generate samples. (sampling) is used to generate records. However, because this method is a random process, the generated record distribution may be biased, and may even fail to fully reflect the record diversity of the original data. The present invention proposes to use integerization for each The marginal table is removed or filled in so that each marginal table can produce exactly the same number of records as the original data. This can effectively solve the above problem.

具體來說，整數化單元143的整數化技術，係將每個計數分為整數部分與小數部分，額外的計數按照小數部分的降冪排序添加至整數部分的相應元素計數上，此過程同時確保校正的邊際表只有整數計數，並且它們總和恰好與原始資料集的紀錄筆數相等。舉例來說，考慮資料集紀錄總數n=5並且計數為[0.4,3.3,1.3]的情況，整數部分為[0,3,1]，其加總只有4，而小數部分為[0.4,0.3,0.3]，由於第一個元素的值最大，可將第一個元素加1得到[1,3,1]，這使得其總和變為5，此時會與真實資料筆數相等，於是便可以替換掉先前非整數的計數表。 Specifically, the integerization technology of the integerization unit 143 divides each count into an integer part and a decimal part, and additional counts are added to the corresponding element counts of the integer part in descending power order of the decimal part. This process also ensures The corrected margin table has only integer counts, and their sum is exactly equal to the number of records in the original data set. For example, consider the case where the total number of data set records n =5 and the count is [0.4,3.3,1.3], the integer part is [0,3,1], the sum is only 4, and the decimal part is [0.4,0.3 ,0.3], since the first element has the largest value, you can add 1 to the first element to get [1,3,1], which makes the sum become 5, which will be equal to the real number of data entries, so Can replace the previous non-integer count table.

資料合成模組15目的是將資料合成為將具差分隱私保護的合成資料集(DPSD)，該資料合成模組15包括資料合成單元151以及資料格式轉換單元152。 The purpose of the data synthesis module 15 is to synthesize data into a synthetic data set (DPSD) with differential privacy protection. The data synthesis module 15 includes a data synthesis unit 151 and a data format conversion unit 152 .

資料合成單元151係將該兩相鄰的視圖中共同欄位進行排序，之後，對於越多計數的欄位賦予越高的優先權，以於該兩相鄰的視圖之欄位配對後，產生該合成資料。下面先首先介紹一般對於透過對不同視圖內的資料以鏈結方式形成完整記錄的說明。 The data synthesis unit 151 sorts the common fields in the two adjacent views, and then gives higher priority to the fields with more counts, so that after the fields of the two adjacent views are matched, a The synthetic data. The following first introduces the general description of forming a complete record by linking data in different views.

關於視圖組合的最大基數匹配問題(Maximum Cardinality Matching Problem for View Combination)。首先，視圖組合實際上是一個優化問題，考慮一個簡單示例，給定2個視圖v ₁={a ₁，a ₂，a ₃}和v ₂={a ₂，a ₃，a ₄}。假設v ₁只有一部分記錄[a ₂=0，a ₃=0]可以鏈接到v ₂的另一部分記錄，並且新鏈結的a ₄欄位只會有[a ₄=3]和[a ₄=4]兩種情況。又假設v ₁有兩個部分記錄，[a ₂=0,a ₃=1]和[a ₂=0,a ₃=1]，則它們可以鏈結到部分記錄[a ₄=3]。若在上述情況下，如果[a ₂=0,a ₃=0]與[a ₄=3]相鏈結，那麼[a ₂=0,a ₃=1]將可以鏈接到任何內容上。因此，兩個視圖的視圖組合可以建模為在二分圖中找到最大基數匹配的問題，由左側而來的v ₁之部分記錄和來自右側的v ₂的部分紀錄相互鏈結組成完整紀錄。如果兩個部分記錄共享公開欄位的欄位值相同，則應存在邊的資訊。然而，由於最大基數匹配問題旨在找到包含盡可能多邊匹配，因此這種二分圖的最大基數匹配為兩個視圖提供了最佳視圖的組合結果。 Regarding the Maximum Cardinality Matching Problem for View Combination. First, view composition is actually an optimization problem, consider a simple example, given 2 views v ₁ ={ a ₁ , a ₂ , a ₃ } and v ₂ ={ a ₂ , a ₃ , a ₄ }. Assume that only part of the records [ a ₂ =0, a ₃ =0] of v ₁ can be linked to another part of the records of v ₂ , and the a ₄ column of the new link will only have [ a ₄ =3] and [ a ₄ = 4] Two situations. Suppose v ₁ has two partial records, [ a ₂ =0, a ₃ =1] and [ a ₂ =0, a ₃ =1], then they can be linked to the partial record [ a ₄ =3]. In the above case, if [ a ₂ =0, a ₃ =0] is linked to [ a ₄ =3], then [ a ₂ =0, a ₃ =1] will be linked to any content. Therefore, the view combination of two views can be modeled as the problem of finding the maximum cardinality matching in a bipartite graph, where the partial records of v ₁ from the left and the partial records of v ₂ from the right are linked to each other to form a complete record. If two partial records share the same public field value, edge information should be present. However, since the maximum cardinality matching problem aims to find a match that contains as many edges as possible, this maximum cardinality matching of a bipartite graph provides a combined result of the best view for both views.

再者，關於啟發式視圖組合。對於大型資料集，用於最大基數匹配問題的最先進的Hopcroft-Karp霍普克洛夫特-卡普(Hopcroft Karp)算法仍然效率低下。本發明提出了一種有效的啟發式替代方案來處理該問題，資料合成單元151考慮兩個相鄰的視圖，v ₁和v ₂，集合A={a ₁，...，a _k}為其共同欄位，請注意，v ₁可能是和其他視圖相組合而成的新視圖，即|v ₁|

d，這裡可根據相應的計數以降冪對v ₁中的部分記錄進行排序，這裡的部分記錄指的就是那些在共同欄位上具有相同數值組合的紀錄。隨後，具有更多計數的部分記錄將被賦予更高級別的優先權。最終，從最高優先級別開始，v ₁中的每個部分記錄與v ₂中的部分記錄將進行配對；即v ₁中的一些部分記錄會根據它們的計數優先在v ₂中找尋匹配項。 Again, about heuristic view composition. The state-of-the-art Hopcroft-Karp algorithm for the maximum cardinality matching problem is still inefficient for large datasets. The present invention proposes an effective heuristic alternative to deal with this problem. The data synthesis unit 151 considers two adjacent views, v ₁ and v ₂ , and the set A = { a ₁ ,..., a _k } is Common fields, please note that v ₁ may be a new view combined with other views, that is, | v ₁ |

d , where the partial records in v ₁ can be sorted according to the corresponding counts in descending powers. The partial records here refer to those records with the same numerical combination in the common fields. Subsequently, partial records with higher counts are given a higher level of priority. Eventually, starting from the highest priority level, each partial record in v ₁ will be paired with a partial record in v ₂ ; that is, some partial records in v ₁ will first find matches in v ₂ based on their counts.

考慮上述中的示例，與僅出現一次的部分記錄[a ₂=0,a ₃=0]相比，[a ₂=0,a ₃=1]被分配到更高的優先級別，因為有兩個[a ₂=0,a ₃=1]。因此，一個[a ₂=0,a ₃=1]將與[a ₄=3]紀錄配對，而另一個[a ₂=0,a ₃=1]將無記錄能與之配對，並且[a ₂=0,a ₃=0]會與[a ₄=4]配對。最終，上述作法僅合成了兩個三維記錄。若在v ₁中的部分記錄與在v ₂中的部分記錄有多種選擇的情況下，它將與隨機記錄相配對。 Consider the example above, [ a ₂ =0, a ₃ =1] is assigned a higher priority than the partial record [ a ₂ =0, a ₃ =0] that appears only once because there are two [ a ₂ =0, a ₃ =1]. Therefore, one [ a ₂ =0, a ₃ =1] will be paired with the record [ a ₄ =3], while the other [ a ₂ =0, a ₃ =1] will have no record to pair with, and [ a ₂ =0, a ₃ =0] will be paired with [ a ₄ =4]. In the end, the above approach only synthesized two 3D records. If there are multiple choices for a partial record in v ₁ and a partial record in v ₂ , it will be paired with a random record.

資料格式轉換單元152係將該合成資料進行格式轉換而成為該原始資料的格式。簡言之，由於一開始有執行資料預處理，該程序會將資料中的內容轉換為離散資料，故對於文字型內容會建立辭典記錄此一對應關係。因此，合成資料產生後，資料格式轉換單元152會進行資料格式轉換以確保資料集內容和原始資料集的格式相同，即應是文字型內容的屬性就要以文字表示。另外，對於原始型態為數字型的屬性，也會重新在對應的範圍內隨機賦予數值，舉例來說，已知預處理階段13.7於給定區間(-inf,10],(10,20],(20,inf)之中會屬於第2個區間而被賦予新值1，則在資料格式轉換步驟會在新值1的對應範圍(10,20]中隨機挑選一個值作為最終的屬性值。 The data format conversion unit 152 converts the format of the synthetic data into the format of the original data. In short, since data preprocessing is performed at the beginning, the program will convert the content in the data into discrete data, so a dictionary will be established to record this correspondence for text-type content. Therefore, after the synthetic data is generated, the data format conversion unit 152 will perform data format conversion to ensure that the format of the data set content is the same as that of the original data set, that is, attributes that should be text content must be expressed in text. In addition, for attributes whose original type is numeric, values will be randomly assigned within the corresponding range. For example, it is known that the preprocessing stage 13.7 is in the given interval (-inf,10], (10,20] , (20,inf) will belong to the second interval and be assigned a new value of 1. Then in the data format conversion step, a value will be randomly selected from the corresponding range (10,20] of the new value 1 as the final attribute value. .

圖6為本發明之具差分隱私保護之資料合成方法的步驟圖。本發明之目的是利用資料欄位的域大小，以協助高維資料進行視圖列表合成，該方法包含以下步驟。 Figure 6 is a step diagram of the data synthesis method with differential privacy protection of the present invention. The purpose of the present invention is to utilize the domain size of data fields to assist in view list synthesis of high-dimensional data. The method includes the following steps.

於步驟S601，將原始資料之資料欄位內的數值資料區間化，以生成資料集。本步驟即進行資料預處理，將數值欄位的資料區間化，可由圖1之資料預處理模組11執行。 In step S601, the numerical data in the data fields of the original data are intervalized to generate a data set. This step is to perform data preprocessing to intervalize the data in the numerical field, which can be performed by the data preprocessing module 11 in Figure 1 .

於步驟S602，依據一預定的視圖欄位個數，將該資料集隨機合成為多個基本視圖，以根據各該基本視圖的域大小進行該多個基本視圖之排序和標號，再依據前後排序的兩基本視圖之域大小，產生該前後排序的兩基本視圖之交叉視圖，之後，比較該基本視圖之域大小與該交叉視圖之域大小，以於該交叉視圖之域大小大於該基本視圖之域大小時不作合併，或於該交叉視圖之域大小不大於該基本視圖之域大小時取代現有暫存值，俾於多次迭代計算後，得到最終基本視圖和最終交叉視圖。本步驟即以基本視圖合成算法(BVG)將前一步驟之資料集合成為基本視圖，接著以交叉視圖合成算法(CVG)求得前述兩個基本視圖之間的交叉視圖，最後透過視圖列表合成算法(VLG)將基本視圖和交叉視圖合成，上述運算可圖1之視圖列表合成模組12執行。 In step S602, the data set is randomly synthesized into multiple basic views according to a predetermined number of view fields, and the multiple basic views are sorted and numbered according to the domain size of each basic view, and then sorted according to the front and back. The domain sizes of the two basic views are generated to generate a cross view of the two basic views sorted before and after. Then, the domain size of the basic view is compared with the domain size of the cross view, so that the domain size of the cross view is greater than that of the basic view. Do not merge when the domain size is specified, or replace the existing temporary value when the domain size of the cross view is not larger than the domain size of the base view, so that after multiple iterative calculations, the final base view and the final cross view can be obtained. This step uses the basic view synthesis algorithm (BVG) to combine the information from the previous step. The data set becomes a basic view, then the cross view synthesis algorithm (CVG) is used to obtain the cross view between the two basic views, and finally the basic view and the cross view are synthesized through the view list synthesis algorithm (VLG). The above operation can be shown in Figure 1 The view list synthesis module 12 is executed.

於步驟S603，將隱私預算平均分配給各該最終基本視圖和各該最終交叉視圖所生成之邊際表。於本步驟中，在相應的邊際上應用拉普拉斯(Laplace)機制來確保ε-DP，也就是讓每個邊際表將平等獲得部分隱私預算(ε)，上述運算可圖1之邊際分布模組13執行。 In step S603, the privacy budget is evenly distributed to the margin tables generated by each final base view and each final cross view. In this step, the Laplace mechanism is applied on the corresponding margin to ensure ε -DP, that is, each marginal table will receive part of the privacy budget ( ε ) equally. The above operation can be shown as the marginal distribution in Figure 1 Module 13 is executed.

於步驟S604，將該些邊際表中的計數進行一致化、非負化以及整數化處理。本步驟目的是細化具差分隱私保護的合成資料，包括一致化、非負化以及整數化等程序，通常後處理程序會先迭代幾次一致性程序和非負化程序，最後才會進行整數化步驟，上述運算可圖1之後處理模組14執行。 In step S604, the counts in the marginal tables are unified, non-negative and integerized. The purpose of this step is to refine the synthetic data with differential privacy protection, including procedures such as consistency, non-negativeization and integerization. Usually the post-processing program will iterate several times before the consistency procedure and non-negativeization procedure are performed, and finally the integerization step will be performed. , the above operation can be executed by the processing module 14 shown in Figure 1 .

於步驟S605，將兩相鄰的視圖合成以產生合成資料，俾於經資料格式轉換後，使該合成資料成為該原始資料的格式。本步驟即是產出的擾亂後的視圖，接著，將其格式轉換以產生與原始資料集相同格式之視圖，上述運算可圖1之資料合成模組15執行。 In step S605, two adjacent views are synthesized to generate synthesized data, so that after data format conversion, the synthesized data becomes the format of the original data. This step is to generate the scrambled view, and then convert its format to generate a view in the same format as the original data set. The above operation can be performed by the data synthesis module 15 in Figure 1 .

圖7為本發明之具差分隱私保護之資料合成方法具體實施時的流程圖。如圖所示，透過一具體實施例來說明本發明系統運作及其流程。 Figure 7 is a flow chart of the specific implementation of the data synthesis method with differential privacy protection of the present invention. As shown in the figure, the operation and process of the system of the present invention are explained through a specific embodiment.

於流程701，進行資料預處理。具體來說，輸入資料集D具有n筆紀錄且m個屬性以

表示之，系統會針對D中各個屬性的資料型態來決定是數值型態或類別型態，即文字類型為類別型態，而小數或整數數值則會被歸類在數值型態。對於類別型態，會先收集該屬性所有出現的不同內容[a _i0,a _i1,...,

]，然後重新為它們編碼成數字並實際套用在D中屬性a _i所在的資料，即{a _i0：0,a _i1：1,...,

：|a _i|-1}。至於數值型態則會先取得屬性a _i所在的資料的最大值a _iMax和最小值a _iMin，再依照事先設置好的區間個數k將其分成k段區間[a _iMin,a _iMin+

),[a _iMin+

,a _iMin+

),...,[a _iMin+

,a _iMax]，再把屬性a _i所在的資料的所有值依照所對應的區間index重新編碼。最終，在對所有屬性都重新編碼後，D的內容就會是全數值狀態。 In process 701, data preprocessing is performed. Specifically, the input data set D has n records and m attributes with

In other words, the system will determine whether the data type of each attribute in D is a numeric type or a category type, that is, the text type is a category type, and the decimal or integer value will be classified as a numeric type. For category types, all occurrences of the different contents of the attribute will first be collected [ a _{i 0} , a _{i 1} ,...,

], and then re-encode them into numbers and actually apply them to the data where the attribute a _i is located in D , that is, { a _{i 0} : 0, a _{i 1} : 1,...,

: | a _i |-1}. As for the numerical type, the maximum value a _iMax and the minimum value a _iMin of the data where attribute a _i is located will first be obtained, and then divided into k intervals according to the preset number of intervals k [ a _iMin , a _iMin +

),[ a _iMin +

, a _iMin +

),...,[ a _iMin +

, a _iMax ], and then recode all values of the data where attribute a _i is located according to the corresponding interval index. Finally, after all attributes have been recoded, the contents of D will be fully numeric.

於流程7021，以視圖列表合成算法(VLG)進行合成視圖。詳言之，VLG對於給定預設視圖大小d以及預設置的迭代圈數X，首先設置變數D _best為inf並且B _best和C _best為空集合。隨後，VLG將以X作為迭代圈數進行以下操作：先利用流程7022的BVG(

,d)產生基本視圖B，再利用流程7023的CVG(

,d,B)產生交叉視圖C，亦即流程7022和流程7023可視為流程7021的子程序。 In process 7021, the view list synthesis algorithm (VLG) is used to synthesize views. In detail, for a given preset view size d and a preset iteration number X , VLG first sets the variable D _best to inf and B _best and C _best to empty sets. Subsequently, VLG will use X as the iteration circle number to perform the following operations: first use the BVG (

, d ) generates basic view B , and then uses the CVG of process 7023 (

, d , B ) generates a cross view C , that is, process 7022 and process 7023 can be regarded as subroutines of process 7021.

隨後，對於C中的每個交叉視圖會檢查是否域大小會比B中具有最大域大小的基本視圖還要來得大，一旦有交叉視圖被察覺到確實比較大則放棄當次合成，並且直接進入下一輪的合成。若是經過檢查後沒有發現到有比較大的情況，則會計算出該輪的域大小總私

，即對B中每個基本視圖計算各自的域大小並加總，然後對C中每個交叉視圖也計算出各自大小後做加總，最後再將兩個總和相加並以變數D儲存。此時，如果D比現有的D _best來得小，則用D作為D _best的新值且用B取代目前的B _best以及用C取代目前的C _best。經過X次迭代後會輸出B _best及C _best作為最終的基本視圖和交叉視圖。 Subsequently, for each cross view in C , it is checked whether the domain size is larger than the base view with the largest domain size in B. Once a cross view is detected to be indeed larger, the current synthesis is abandoned and directly entered. Next round of synthesis. If no relatively large situation is found after checking, the total domain size of the round will be calculated.

於流程7022，生成基本視圖子程序(BVG)。詳言之，BVG對於給定的預設視圖大小d，首先會合成空集合B並將

的內容複製到S，然後迭代進行下列操作：從S中隨機挑選出不重複的d個屬性形成集合b並將其加入B中，再將 S內容更新成S\b。然後，檢查S中的元素個數是否不大於d，若是則會直接將S加入B中，並且結束迴圈。隨後，BVG會對B進行排序，並且採用的方式為對每個基本視圖各自取出域大小為前

大的屬性並各自再將這些域值進行乘法，即

dom(b _ij)，其中

為b _i之前

大的屬性而函式dom(b _ij)則會回傳屬性b _ij的域值大小。最後再對所產生的B個乘積值依照大小進行降冪排序且讓B也遵循該順序對其中的基本視圖進行排序。最終，BVG會將排序後的B做輸出。 In process 7022, a basic view subroutine (BVG) is generated. Specifically, for a given preset view size d , BVG first synthesizes an empty set B and

Copy the content to S , and then iteratively perform the following operations: randomly select d unique attributes from S to form a set b and add them to B , and then update the content of S to S\b. Then, check whether the number of elements in S is not greater than d. If so, S will be added directly to B and the loop will end. Subsequently, BVG will sort B , and the method used is to take out the domain size of each basic view as the first

Large attributes and then multiply these domain values, that is

dom ( b _ij ), where

before b _i

For large attributes, the function dom( b _ij ) will return the domain value of attribute b _ij . Finally, the generated B product values are sorted in descending order according to size, and B is also allowed to sort the basic views in this order. Finally, BVG will output the sorted B.

於流程7023，生成交叉視圖子程序(CVG)。詳言之，CVG對於給定預設視圖大小d以及經由BVG所獲得之基本視圖集合

，首先會產生空集合C用以存放待會產生的交叉視圖，隨後針對d的大小是否為偶數會採用兩種不同程序來合成交叉視圖。若d為偶數，則CVG會新增一個空集合c ₀再從B中選擇b ₀和b ₁，然後對b ₀的屬性中依據其域值大小取前出前

小的屬性添加到c ₀中。最後將c ₀新增到C中就可以完成一次的進行，但由於B還可以產生諸如b ₁,b ₂、b ₂,b ₃、...、

,

>

，若判斷式成立則令變數x 為

-1，不成立則令x為

-2。隨後，CVG會以i=1 to x迭代進行以下操作：首先會從b _i中選擇域大小前

小的屬性存入暫存array set ₂中，最後將set ₁ ∪ set ₂的結果存入暫存array c _i中。隨後，從b _i+1\set ₂中選擇具有最小的域大小的屬性以變數p記錄下。此時，若是set ₁的域大小比起set ₂的域大小和p的域大小之乘積來得更小或恰好相等，即dom(set ₁)

的交叉視圖集合。 In process 7023, a cross view subroutine (CVG) is generated. Specifically, for a given default view size d and the basic view set obtained through BVG, CVG

, first an empty set C will be generated to store the cross view to be generated, and then two different procedures will be used to synthesize the cross view depending on whether the size of d is an even number. If d is an even number, CVG will add an empty set c ₀ and select b ₀ and b ₁ from B , and then take the first out of the attributes of b ₀ according to its domain value.

Small properties are added to c ₀ . Finally, adding c ₀ to C can complete the process once, but since B can also produce such as b ₁ , b ₂ , b ₂ , b ₃ ,...,

,

>

, if the judgment is true, let the variable x be

-1, if this is not true, let x be

-2. Subsequently, CVG will iterate from i =1 to x and perform the following operations: first, the domain size will be selected from b _i

Small attributes are stored in temporary array set ₂ , and finally the results of set ₁ ∪ set ₂ are stored in temporary array c _i . Subsequently, the attribute with the smallest domain size is selected from bi _{+ 1} \ set ₂ and recorded in the variable p . At this time, if the domain size of set ₁ is smaller or exactly equal to the product of the domain size of set ₂ and the domain size of p , that is, dom( set ₁ )

A collection of cross views.

於流程703，進行構造具雜訊的邊際分布。簡言之，系統將會對流程7021、7022、7023所產生的視圖產生各自的邊緣表。隨後，系統會將拉普拉斯機制應用於所有邊際表，而每個邊際表所對應的隱私預算為P _i ε，其中P _i=

。因此，分配給每一視圖v _i的預算將是P _i ε，而v _i所對應的雜訊分布則是Laplace(

)。 In process 703, a marginal distribution with noise is constructed. In short, the system will generate respective edge tables for the views generated by

processes

7021, 7022, and 7023. Subsequently, the system will apply the Laplacian mechanism to all marginal tables, and the privacy budget corresponding to each marginal table is P _i ε , where P _i =

. Therefore, the budget allocated to _each view vi will be Pi _ε , and the noise distribution corresponding _to vi is Laplace(

).

於流程7041，進行後處理程序的一致性處理。首先，定義

(C)表示從邊際表v _i中檢索單元C的計數，而針對v _i中的每個C的一致性更新會以

(C')←

(C')+

(T _v(C)-

(C))來進行，其中C表示來自於共有欄位v=v ₁ ∩...∩ v _x之元素(a _j1,...,a _jy)並且ξ _i表示dom(v _i\v)。 In process 7041, consistency processing of the post-processing program is performed. First, define

( C ) represents retrieving the count of cell C from marginal _table vi , while a consistent update for each C _in vi would be

( C' )←

( C' )+

( T _v ( C )-

( C )), where C represents the elements ( a _{j 1} ,..., a _jy ) from the common field v= v ₁ ∩...∩ v _x and ξ _i represents dom ( v _i \v ).

於流程7042，進行後處理程序的非負化處理。具體來說，對於每個邊際表v _i，系統會先收集其

(C)計數並且儲存在暫存array X中。隨後，在對X做昇冪排序的同時，還會紀錄下排序前與排序後的對應關係以便於之後能夠進行精確的調整。然後，系統就會以預設的圈數來進行以下迭代：首先，把計數為負值的數值進行加總得到N，也把計數為正值的出現次數進行加總得到P。然後將計數為負值的數值都設圍0，再把

加入到每個計數為正值的數值上進行抵銷的動作。最後，在預設迭代圈數完成後，儘管還會有一些位置為負數，但系統就會直接把它設成0。 In process 7042, non-negative processing of the post-processing program is performed. Specifically, for each marginal table v _i , the system will first collect its

( C ) Count and store in temporary array X. Subsequently, while sorting Then, the system will perform the following iterations with a preset number of circles: First, add up the values with negative counts to get N , and also add up the number of occurrences with positive counts to get P. Then set the values that count as negative values to 0, and then

Add an offset action to each positive count value. Finally, after the preset iteration number is completed, although there will still be some negative positions, the system will directly set it to 0.

於流程7043，進行後處理程序的整數化處理。具體來說，針對每個邊際表v _i，系統會產生一個表格TN紀錄所有計數的整數部位，也會產生另一個表格TF記錄下計數的小數部位。隨後，會利用D的紀錄筆數減去TN的計算總和當作接下來要補數的迭代圈數進行以下操作：從TF表格找出最大值的位置k，把在TN表格的k位置之計數加1並把TF表格的k位置設為0。最後，進行完補數的迭代圈數後，就會把TN表格的內容取代掉原先邊際表v _i的計數。 In process 7043, the post-processing program performs integer processing. Specifically, for each marginal table vi , the system will generate a table TN to record the integer parts of all counts, and another table TF to record the decimal parts of _the counts. Subsequently, the calculated sum of the recorded records of D minus TN will be used as the number of iteration cycles to be complemented to perform the following operations: find the position k of the maximum value from the TF table, and count the k position in the TN table Add 1 and set the k position of the TF table to 0. Finally, after completing the iterative rounds of the complement, the contents of the TN table will replace the count of _the original marginal table vi .

於流程7051，進行資料合成程序。詳言之，針對兩個相鄰的視圖v和v _i，集合A={a ₁，...，a _k}為其共同欄位，可將根據相應的計數以降冪對v中的部分記錄進行排序，這裡的部分記錄指的就是那些在共同欄位上具有相同數值組合的紀錄。隨後，具有更多計數的部分記錄將被賦予更高級別的優先權。最終，從最高優先級別開始，v中的每個部分記錄與v _i中的部分記錄將進行配對。若v和v _i的所有配對均完成後會得到新的v，並且會進行下一輪的組合，即v和v _i且i=i+1。 In process 7051, the data synthesis process is performed. Specifically, for two adjacent views v and vi , the set A ₌ { a ₁ ,..., a _k } is its common column, and the partial records in v can be recorded in descending powers according to the corresponding counts. For sorting, some records here refer to those records with the same combination of values in common fields. Subsequently, partial records with higher counts are given a higher level of priority. Eventually, starting with the highest priority level, each partial record in v will be paired with a partial record _in vi . If all pairings of v and v _i are completed, a new v will be obtained, and the next round of combinations will be carried out, that is, v and v _i and i = i +1.

於流程7052，進行資料格式轉換程序。由於流程701中會將資料中的內容轉換為離散資料，並對於文字型內容會建立辭典記錄此一對應關係。因此，將產生後的合成資料進行資料格式轉換為原始資料集的格式，以確保資料集內容和原始資料集的格式相同，即應是文字型內容的屬性就要以文字表示，對於原始型態為數字型的屬性，也會重新在對應的範圍內隨機賦予數值。如先前舉例所述，已知預處理階段13.7於給定區間(-inf,10],(10,20],(20,inf)之中會屬於第2個區間而被賦予新值1，則在資料格式轉換步驟會在新值1的對應範圍(10,20]中隨機挑選一個值作為最終的屬性值。 In process 7052, the data format conversion process is performed. In process 701, the content in the data will be converted into discrete data, and a dictionary will be established to record the corresponding relationship for the text content. Therefore, the data format of the generated synthetic data is converted into the format of the original data set to ensure that the content of the data set is the same as the format of the original data set. That is, attributes that should be text content must be expressed in text. For the original type For numeric attributes, values will be randomly assigned within the corresponding range. As mentioned in the previous example, it is known that the preprocessing stage 13.7 will belong to the second interval in the given interval (-inf,10], (10,20], (20,inf) and be assigned a new value of 1, then In the data format conversion step, a value will be randomly selected from the corresponding range (10, 20] of the new value 1 as the final attribute value.

在一實施例中，上述之各個模組、單元均可為軟體、硬體或韌體；若為硬體，則可為具有資料處理與運算能力之處理單元、處理器、電腦或伺服器；若為軟體或韌體，則可包括處理單元、處理器、電腦或伺服器可執行之指令，且可安裝於同一硬體裝置或分布於不同的複數硬體裝置。 In one embodiment, each of the above-mentioned modules and units can be software, hardware or firmware; if it is hardware, it can be a processing unit, processor, computer or server with data processing and computing capabilities; If it is software or firmware, it may include instructions executable by a processing unit, processor, computer or server, and may be installed on the same hardware device or distributed on multiple different hardware devices.

此外，本發明還揭示一種電腦可讀媒介，係應用於具有處理器(例如，CPU、GPU等)及/或記憶體的計算裝置或電腦中，且儲存有指令，並可利用此計算裝置或電腦透過處理器及/或記憶體執行此電腦可讀媒介，以於執行此電腦可讀媒介時執行上述之方法及各步驟。 In addition, the present invention also discloses a computer-readable medium, which is applied to a computing device or computer having a processor (eg, CPU, GPU, etc.) and/or a memory, and stores instructions, and can utilize the computing device or computer. The computer executes the computer-readable medium through the processor and/or memory to perform the above methods and steps when executing the computer-readable medium.

綜上，本發明之具差分隱私保護之資料合成系統、方法及其電腦可讀媒介，係先通過資料欄位域大小資訊進行前處理以及視圖列表合成，因為該資訊屬於公開資訊，因而不用涉及隱私預算切分，隨後，對這些視圖建立相應的邊際表並對其注入基於拉普拉斯分布之差分隱私的隨機雜訊，為了提高合成資料的品質，還會對這些具雜訊的邊際分布進行一系列的後處理程序，包括一致性、非負化以及整數化等程序，最後，通過迭代地進行各個邊際分布的拼接，以產生每一筆完整的合成資料。據此，本發明具備以下特色及功效。 In summary, the data synthesis system, method and computer-readable medium with differential privacy protection of the present invention first perform pre-processing and view list synthesis through the data field size information. Because this information is public information, there is no need to involve Privacy budget segmentation, then, corresponding marginal tables are established for these views and random noise based on differential privacy of the Laplacian distribution is injected into them. In order to improve the quality of the synthetic data, these noisy marginal distributions are also A series of post-processing procedures are performed, including consistency, non-negativeization, and integerization procedures. Finally, each marginal distribution is iteratively spliced to generate each complete synthetic data. Accordingly, the present invention has the following features and effects.

第一，本發明採用差分隱私保護技術，除了對資料隱私性有所保障之外，還可以解決傳統資料隱私保護方法會對原始資料造成不可恢復的缺點。 First, the present invention adopts differential privacy protection technology, which in addition to guaranteeing data privacy, can also solve the shortcomings of traditional data privacy protection methods that cause unrecoverable original data.

第二，與現有技術相比較，本發明利用資料欄位的域大小協助高維資料進行視圖列表合成的作法，與先前基於視圖方法以隱私預算切分來選擇適當的邊際表做比較，由於其間涉及的欄位域大小資訊與資料無關，故能顯著避免隱私預算切分的問題。 Second, compared with the existing technology, the present invention uses the domain size of the data field to assist high-dimensional data in view list synthesis. Compared with the previous view-based method using privacy budget segmentation to select an appropriate marginal table, due to the The field size information involved has nothing to do with the data, so the problem of privacy budget segmentation can be significantly avoided.

第三，本發明通過數學分析為各個視圖分配隱私運算，能在雜訊注入階段取得資料可利用性與隱私性之間的最佳平衡點。 Third, the present invention allocates privacy operations to each view through mathematical analysis, and can achieve the best balance between data availability and privacy in the noise injection stage.

第四，本發明利用非負化、一致性、整數化之後處理程序，除了能提升合成資料的品質之外，還能提高系統程式的運行效率。 Fourth, the present invention uses non-negative, consistent, and integer post-processing procedures to not only improve the quality of synthesized data, but also improve the operating efficiency of the system program.

上列詳細說明係針對本發明之一可行實施例之具體說明，惟該實施例並非用以限制本發明之專利範圍，凡未脫離本發明技藝精神所為之等效實施或變更，均應包含於本發明之專利範圍中。 The above detailed description is a specific description of one possible embodiment of the present invention. However, this embodiment is not intended to limit the patent scope of the present invention. Any equivalent implementation or modification that does not depart from the technical spirit of the present invention shall be included in within the patent scope of this invention.

11:資料預處理模組 11: Data preprocessing module

12:視圖列表合成模組 12:View list synthesis module

13:邊際分布模組 13:Marginal distribution module

14:後處理模組 14:Post-processing module

15:資料合成模組 15: Data synthesis module

Claims

A data synthesis system with differential privacy protection, including:

The data preprocessing module is used to intervalize the numerical data in the data fields of the original data to generate a data set;

The view list synthesis module is used to randomly synthesize the data set into multiple basic views based on a predetermined number of view fields, sort and label the multiple basic views according to the domain size of each basic view, and then According to the domain sizes of the two basic views sorted before and after, a cross view of the two basic views sorted before and after is generated. Then, the domain size of the basic view and the domain size of the cross view are compared, so that the domain size of the cross view is larger than the domain size of the cross view. The domain size of the basic view is not merged, or the existing temporary value is replaced when the domain size of the cross view is not larger than the domain size of the basic view, so that the final basic view and the final cross view can be generated after multiple iterative calculations;

The marginal distribution module is used to evenly distribute the privacy budget to the marginal tables generated by each final base view and each final cross view;

A post-processing module used to normalize, non-negative, and integer the counts in these marginal tables; and

The data synthesis module is used to synthesize two adjacent views to generate synthetic data, so that after data format conversion, the synthetic data becomes the format of the original data.

The data synthesis system with differential privacy protection as described in claim 1, wherein the cross view is used to bridge the two basic views sorted before and after, so that the number of fields of the original data cannot be exceeded by the number of fields of the view When the data set is evenly divided and the absolute value of the last attribute of the data set is not greater than half of the number of columns in the view, the cross view between the last two basic views will not be created.

The data synthesis system with differential privacy protection as described in claim 1, wherein the view list synthesis module further includes:

The basic view synthesis unit first defines a synthetic empty set and generates a synthetic set that is the same as the data set, selects several non-repeating attributes from the synthetic set and adds them to the synthetic empty set, and subtracts the synthetic set from the synthetic set. Remove the several non-repeating attributes, so that when the number of elements in the synthetic empty set is greater than the number of the non-repeating properties, continue to iterate until the number of elements in the synthetic empty set is not greater than the number of non-repeating properties. The number of repeated attributes generates the multiple basic views, and the ordering of the multiple basic views is in descending order;

The cross view synthesis unit first defines an empty set, so that the number of fields in the view is an even number, and then combines the attributes whose domain values are greater than half of the number of fields in the view from the former of the two basic views sorted together with the From the two basic views sorted before and after, take out the attribute whose domain value is less than half of the number of columns in the view, and generate a cross view of the two basic views sorted before and after, and if the number of columns in the view is an odd number, then judge The relationship between the last attribute of the data set and half of the number of fields in the view is used to iteratively obtain the cross view of the two basic views sorted one after another; and

The view list synthesis unit calculates the sum of the basic view and the cross view when the domain size of the cross view is not greater than the domain size of the base view, and when the sum is less than the existing temporary value, it uses a new value The sum, the base view and the cross view replace the existing temporary values, and after multiple iterative calculations, the final base view and the final cross view are obtained.

The data synthesis system with differential privacy protection as described in claim 1, wherein the marginal distribution module is based on the Laplacian mechanism and evenly distributes the privacy budget to each final basic view and each final cross view. Generated margin table.

The data synthesis system with differential privacy protection as described in claim 1, wherein the post-processing module further includes:

The consistency unit performs weighted calculations through weights on the counts of each final basic view and each final cross view, and updates the counts of each final basic view and each final cross view;

a non-negative unit that iteratively adds and subtracts to eliminate negative values in the counts of each final base view and each final cross view; and

The integer unit removes or adds values to these marginal tables so that each marginal table has the same number of records as the original data.

The data synthesis system with differential privacy protection as described in claim 1, wherein the data synthesis module further includes:

The data synthesis unit sorts the common fields in the two adjacent views, and then gives higher priority to the fields with more counts, so that after the fields of the two adjacent views are matched, a the synthetic data; and

The data format conversion unit converts the format of the synthetic data into the format of the original data.

A data synthesis method with differential privacy protection, which includes the following steps:

Intervalize the numerical data in the data fields of the original data to generate a data set;

According to a predetermined number of view fields, the data set is randomly synthesized into multiple basic views, and the multiple basic views are sorted and numbered according to the domain size of each basic view, and then based on the difference between the two basic views sorted before and after. The domain size is used to generate a cross view of the two basic views sorted before and after. Then, the domain size of the basic view is compared with the domain size of the cross view, so that no merging is performed when the domain size of the cross view is greater than the domain size of the basic view. , or when the domain size of the cross view is not larger than the domain size of the base view Hours replace the existing temporary values so that after multiple iterative calculations, the final basic view and the final cross view can be obtained;

Evenly distribute the privacy budget to the marginal tables generated by each final base view and each final cross view;

Uniformize, non-negative, and integer the counts in those marginal tables; and

Two adjacent views are combined to generate synthetic data, so that after data format conversion, the synthetic data becomes the format of the original data.

The data synthesis method with differential privacy protection as described in claim 7, wherein the cross view is used to bridge the two basic views sorted before and after, so that the number of fields of the original data cannot be exceeded by the number of fields of the view When the data set is evenly divided and the absolute value of the last attribute of the data set is not greater than half of the number of columns in the view, the cross view between the last two basic views will not be created.

The data synthesis method with differential privacy protection as described in claim 7, wherein the step of generating the final basic view and the final cross view further includes:

First define a synthetic empty set and generate a synthetic set that is the same as the data set, select several unique attributes from the synthetic set and add them to the synthetic empty set, and subtract the unique numbers from the synthetic set. attributes, so that when the number of elements in the synthesized empty set is greater than the number of the unique attributes, the iteration loop continues until the number of elements in the synthesized empty set is not greater than the number of the unique attributes. number, generate the multiple basic views, and the multiple basic views are sorted in descending power order;

First define an empty set, so that the number of columns in the view is an even number, and combine the former of the two basic views sorted before and after to take out the attributes whose domain value is greater than half of the number of columns in the view and the two basic views sorted before and after. From the latter, take out the attributes whose domain value is less than half of the number of fields in the view, generate a cross view of the two basic views sorted before and after, and the number of fields in the view is an odd number, and then determine the value of the data set The relationship between the last attribute and half of the number of fields in the view is used to obtain the cross view of the two basic views sorted before and after through iteration; and

When the domain size of the cross view is not greater than the domain size of the base view, calculate the sum of the base view and the cross view, and when the sum is less than the existing temporary value, use the new sum and the base view and the cross view replace the existing temporary value, and after multiple iterative calculations, the final basic view and the final cross view are obtained.

The data synthesis method with differential privacy protection as described in claim 7, wherein the step of evenly allocating the privacy budget to the margin tables generated by each of the final basic views and each of the final cross views is based on Laplacian mechanism to evenly allocate the privacy budget to the marginal tables generated by each final base view and each final cross view.

The data synthesis method with differential privacy protection as described in claim 7, wherein the steps of uniformizing, non-negative and integerizing the counts in the marginal tables include:

Perform weighted calculations on the counts of each final basic view and each final cross view through weights, and update the counts of each final basic view and each final cross view;

Eliminate negative values in the counts of each final base view and each final cross view by iteratively adding and subtracting; and

Remove or add values to these marginal tables so that each marginal table has the same number of records as the original data.

The data synthesis method with differential privacy protection as described in claim 7, wherein two adjacent views are synthesized to generate synthetic data, so that after data format conversion, the synthetic data becomes one of the formats of the original data. Steps, including:

Sorting the common fields in the two adjacent views, and then giving higher priority to the fields with more counts, so as to generate the composite data after matching the fields of the two adjacent views; and

The format of the synthetic data is converted into the format of the original data.

A computer-readable medium, used in a computing device or computer, storing instructions to execute the data synthesis method with differential privacy protection as described in any one of claims 7 to 12.