CN117235800A - Data query protection method based on secret specification personalized privacy protection mechanism - Google Patents
Data query protection method based on secret specification personalized privacy protection mechanism Download PDFInfo
- Publication number
- CN117235800A CN117235800A CN202311416556.4A CN202311416556A CN117235800A CN 117235800 A CN117235800 A CN 117235800A CN 202311416556 A CN202311416556 A CN 202311416556A CN 117235800 A CN117235800 A CN 117235800A
- Authority
- CN
- China
- Prior art keywords
- query
- attribute
- mean
- data set
- sensitive
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000007246 mechanism Effects 0.000 title claims abstract description 56
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000013500 data storage Methods 0.000 claims abstract description 18
- 239000000284 extract Substances 0.000 claims abstract 2
- 230000035945 sensitivity Effects 0.000 claims description 18
- 230000008569 process Effects 0.000 description 9
- 238000007405 data analysis Methods 0.000 description 7
- 230000008859 change Effects 0.000 description 6
- 230000006872 improvement Effects 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 230000036541 health Effects 0.000 description 4
- 230000003862 health status Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000033228 biological regulation Effects 0.000 description 3
- 230000007423 decrease Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 2
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 2
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 2
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 2
- 230000001965 increasing effect Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 101100012902 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) FIG2 gene Proteins 0.000 description 1
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001568 sexual effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Landscapes
- Storage Device Security (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明提供了基于秘密规范的个性化隐私保护机制的数据查询保护方法,包括:访问设备发送查询指令至数据存储设备;数据存储设备执行:解析查询指令获得查询函数和查询属性;提取查询属性数据集;基于用户秘密规范集将查询属性数据集划分为敏感子集和非敏感子集;根据预先构建的基于秘密规范的拉普拉斯机制获得查询属性数据集的均值查询结果;根据预先构建的基于秘密规范的指数机制获得查询属性数据集的中值查询结果;发布均值查询结果和/或中值查询结果。通过秘密规范有助于精确定义隐私保护范围和受保护实体,避免将数据所有属性记录视为敏感的严格约束,提供更少数据失真和更准确数据查询结果,实现隐私保护和数据效用的平衡。
The invention provides a data query protection method based on a personalized privacy protection mechanism based on secret specifications, which includes: the access device sends a query instruction to the data storage device; the data storage device executes: parses the query instruction to obtain the query function and query attributes; and extracts the query attribute data Set; divide the query attribute data set into sensitive subsets and non-sensitive subsets based on the user secret specification set; obtain the mean query result of the query attribute data set according to the pre-built Laplacian mechanism based on the secret specification; according to the pre-built An exponential mechanism based on secret specifications obtains the median query result of the query attribute data set; publishes the mean query result and/or the median query result. Secret specifications help to accurately define the scope of privacy protection and protected entities, avoid strict constraints that treat all attribute records of data as sensitive, provide less data distortion and more accurate data query results, and achieve a balance between privacy protection and data utility.
Description
技术领域Technical Field
本发明涉及数据安全技术领域,尤其涉及基于秘密规范的个性化隐私保护机制的数据查询保护方法。The present invention relates to the field of data security technology, and in particular to a data query protection method based on a personalized privacy protection mechanism of a secret specification.
背景技术Background Art
随着大数据时代的到来,包括个人身份信息、医疗记录、金融交易等大量敏感数据存储在数据库中,为支持各种研究、业务决策和政府政策制定,对这些数据统计查询的需求在快速增加,如均值和中值等统计查询是了解数据分布和趋势的重要工具,它们为各行各业提供了有关数据的关键洞察。例如,研究人员可能需要计算医疗研究中患者的平均年龄,或者金融机构可能需要找到客户的中位数信用分数,或者从人口普查数据集中获取到平均年龄,中位数年收入等。这些查询通常会涉及到敏感信息,因此需要有效的隐私保护。With the advent of the big data era, a large amount of sensitive data, including personal identity information, medical records, financial transactions, etc., is stored in databases. In order to support various research, business decisions and government policy making, the demand for statistical queries on these data is increasing rapidly. Statistical queries such as mean and median are important tools for understanding data distribution and trends. They provide key insights into data for all walks of life. For example, researchers may need to calculate the average age of patients in medical research, or financial institutions may need to find the median credit score of customers, or obtain the average age, median annual income, etc. from the census data set. These queries usually involve sensitive information, so effective privacy protection is required.
差分隐私(Differential Privacy,简称DP)由于其理论上的可证明性和对具有先验知识的对手的鲁棒性,正在成为隐私保护的黄金标准,可以在满足数据分析需求的同时,保护个体隐私,可应用于均值和中值查询等常见的数据分析操作,确保敏感数据得到适当的保护。《健康保险可携性和责任法案》(HIPAA)隐私规则、《家庭教育权利和隐私法案》(FERPA)和欧盟的《通用数据保护条例》(GDPR)等法律框架致力于确保组织和个人在收集、处理和共享个人信息时遵守透明、公平和安全的原则。此外,美国加利福尼亚州最近出台的隐私保护法规,包括《加利福尼亚州消费者隐私法》和《加利福尼亚州隐私权法》,也加强了个人对其个人信息的控制权,并规定了组织在收集和使用数据时的透明度。这些隐私法律和标准的共同目标是保护个人隐私,赋予个人了解其数据如何被使用的权利,以及监管和限制其数据共享和处理的能力。从立法和政策的角度来看,用户有权控制自己的隐私,且展现出个性化的隐私需求。这种个性化的概念,根植于每个人独特的文化背景、个人隐私偏好或社会因素,反映了不同用户对隐私期望的差异。Differential Privacy (DP) is becoming the gold standard for privacy protection due to its theoretical provability and robustness against adversaries with prior knowledge. It can protect individual privacy while meeting data analysis needs. It can be applied to common data analysis operations such as mean and median queries to ensure that sensitive data is properly protected. Legal frameworks such as the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule, the Family Educational Rights and Privacy Act (FERPA), and the EU's General Data Protection Regulation (GDPR) are committed to ensuring that organizations and individuals adhere to the principles of transparency, fairness, and security when collecting, processing, and sharing personal information. In addition, recent privacy protection regulations in California, including the California Consumer Privacy Act and the California Privacy Rights Act, also strengthen individuals' control over their personal information and provide for transparency in the collection and use of data by organizations. The common goal of these privacy laws and standards is to protect individual privacy, give individuals the right to understand how their data is used, and the ability to regulate and limit the sharing and processing of their data. From a legislative and policy perspective, users have the right to control their privacy and show personalized privacy needs. This concept of personalization is rooted in each person's unique cultural background, personal privacy preferences or social factors, and reflects the differences in privacy expectations among different users.
但是,当涉及到个人对自身数据的敏感性的控制时,传统的差分隐私往往过于严格,它通常将数据集中与个人相关的所有数据视为固有敏感的,而实际上,由于个人隐私偏好和态度的不同,并非所有与个人相关的信息都被视为敏感的,且需要相同级别的保护。考虑一个像智能建筑管理系统这样的场景,它处理大量的传感器数据和个人信息,包括用户的位置详细信息和健康指标。值得注意的是,个人可能对这些属性的敏感性持有不同的观点。有些用户可能认为他们的位置信息敏感,而认为健康数据不敏感。相反,其他人可能将健康数据视为敏感数据,而将位置信息视为非敏感信息。此外,有些人认为这两种属性要么敏感,要么不敏感。可见,传统的差分隐私数据保护方法用户无法独立定义自己的秘密规范,无法精确数据保护范围,数据查询时获得的查询结果数据失真较大,效用较低。However, when it comes to individuals’ control over the sensitivity of their own data, traditional differential privacy is often too strict. It usually regards all data related to individuals in a dataset as inherently sensitive, while in reality, not all information related to individuals is considered sensitive and requires the same level of protection due to different personal privacy preferences and attitudes. Consider a scenario like a smart building management system, which processes a large amount of sensor data and personal information, including users’ location details and health indicators. It is worth noting that individuals may hold different views on the sensitivity of these attributes. Some users may consider their location information sensitive and health data insensitive. Conversely, others may consider health data sensitive and location information non-sensitive. In addition, some people consider both attributes to be either sensitive or insensitive. It can be seen that users of traditional differential privacy data protection methods cannot independently define their own secret specifications and cannot accurately define the scope of data protection. The query results obtained when querying data are highly distorted and have low utility.
人口普查数据集中通常包含年龄、性别、年收入、电话号码、健康状态等信息,但是年收入、健康状态、年龄等信息又涉及到用户个人隐私,对不同的用户有不同的隐私设置需求,如对于部分个人可能不希望公开其具体的年收入,部分用户可能不希望公开其年龄或者健康状态,以防止信息泄露被犯罪分子或无效的推销使用,若采用传统的差分隐私数据保护方法,用户无法独立定义自己的秘密规范,无法精确数据保护范围,数据查询时获得的查询结果数据失真较大导致效用较低。Census data sets usually contain information such as age, gender, annual income, telephone number, and health status. However, information such as annual income, health status, and age involves personal privacy of users, and different users have different privacy setting requirements. For example, some individuals may not want to disclose their specific annual income, and some users may not want to disclose their age or health status to prevent information leakage from being used by criminals or invalid marketing. If traditional differential privacy data protection methods are used, users cannot independently define their own secret specifications, and cannot accurately define the scope of data protection. The query results obtained during data query are highly distorted, resulting in lower utility.
发明内容Summary of the invention
本发明旨在解决现有技术中存在的技术问题,提供基于秘密规范的个性化隐私保护机制的数据查询保护方法以及人口普查数据集查询保护方法。The present invention aims to solve the technical problems existing in the prior art and provide a data query protection method based on a personalized privacy protection mechanism of secret specifications and a population census data set query protection method.
为了实现本发明的上述目的,根据本发明的第一个方面,本发明提供了基于秘密规范的个性化隐私保护机制的数据查询保护方法,包括:访问设备发送查询指令至数据存储设备;数据存储设备执行:接收并解析查询指令获得查询函数和查询属性;从设定数据集中提取查询属性数据集;获取用户秘密规范集,基于用户秘密规范集将查询属性数据集划分为敏感子集和非敏感子集;当查询函数为均值查询函数时,根据预先构建的基于秘密规范的拉普拉斯机制获得查询属性数据集的均值查询结果;当查询函数为中值查询函数时,根据预先构建的基于秘密规范的指数机制获得查询属性数据集的中值查询结果;向访问设备发布查询属性数据集的均值查询结果和/或中值查询结果。In order to achieve the above-mentioned purpose of the present invention, according to the first aspect of the present invention, the present invention provides a data query protection method of a personalized privacy protection mechanism based on a secret specification, including: an access device sends a query instruction to a data storage device; the data storage device executes: receiving and parsing the query instruction to obtain a query function and a query attribute; extracting a query attribute data set from a set data set; obtaining a user secret specification set, and dividing the query attribute data set into a sensitive subset and a non-sensitive subset based on the user secret specification set; when the query function is a mean query function, obtaining a mean query result of the query attribute data set according to a pre-built Laplace mechanism based on the secret specification; when the query function is a median query function, obtaining a median query result of the query attribute data set according to a pre-built exponential mechanism based on the secret specification; and publishing the mean query result and/or median query result of the query attribute data set to the access device.
上述技术方案:允许用户个人基于秘密规范设置其哪些属性记录是敏感的,哪些属性记录不是敏感的,有助于精确定义隐私保护的范围和受保护的实体,避免传统差分隐私方法将数据的所有属性记录视为敏感的严格约束,使得隐私保护更加灵活和个性化,同时提供更少的数据失真和更准确的数据查询结果;提出了基于秘密规范的拉普拉斯机制SSLM并应用于均值查询,提出了基于秘密规范的指数机制SSEM并应用于中值查询,提高数据分析的精度,同时最大限度地减少数据失真,特别是当数据的很大一部分是非敏感的时候,实现了隐私保护和数据效用的平衡,与最先进的差分隐私框架机制相比,SSLM通过利用非敏感数据,对于均值查询,效用提高了大约14倍,SSEM通过利用非敏感数据,对于中值查询,效用提高了大约6倍。The above technical solution allows individual users to set which attribute records are sensitive and which are not sensitive based on secret specifications, which helps to accurately define the scope of privacy protection and the protected entities, avoids the strict constraint of traditional differential privacy methods that regard all attribute records of data as sensitive, and makes privacy protection more flexible and personalized, while providing less data distortion and more accurate data query results; proposes a Laplace mechanism SSLM based on secret specifications and applies it to mean queries, and proposes an exponential mechanism SSEM based on secret specifications and applies it to median queries, which improves the accuracy of data analysis while minimizing data distortion, especially when a large part of the data is non-sensitive, achieving a balance between privacy protection and data utility. Compared with the most advanced differential privacy framework mechanism, SSLM improves the utility by about 14 times for mean queries by utilizing non-sensitive data, and SSEM improves the utility by about 6 times for median queries by utilizing non-sensitive data.
为了实现本发明的上述目的,根据本发明的第二个方面,本发明提供了一种人口普查数据集查询保护方法,包括:访问设备发送查询指令至存储人口普查数据集的数据存储设备;数据存储设备执行:接收并解析查询指令获得查询函数和查询属性,所述查询属性包括年龄和年收入;从人口普查数据集中提取查询属性数据集;获取用户秘密规范集,基于用户秘密规范集将查询属性数据集划分为敏感子集和非敏感子集;当查询函数为均值查询函数时,根据预先构建的基于秘密规范的拉普拉斯机制获得查询属性数据集的均值查询结果;当查询函数为中值查询函数时,根据预先构建的基于秘密规范的指数机制获得查询属性数据集的中值查询结果;向访问设备发布查询属性数据集的均值查询结果和/或中值查询结果。In order to achieve the above-mentioned purpose of the present invention, according to the second aspect of the present invention, the present invention provides a method for querying and protecting a population census data set, comprising: an access device sends a query instruction to a data storage device storing a population census data set; the data storage device executes: receiving and parsing the query instruction to obtain a query function and a query attribute, wherein the query attribute includes age and annual income; extracting a query attribute data set from the population census data set; obtaining a user secret specification set, and dividing the query attribute data set into a sensitive subset and a non-sensitive subset based on the user secret specification set; when the query function is a mean query function, obtaining a mean query result of the query attribute data set according to a pre-built Laplace mechanism based on the secret specification; when the query function is a median query function, obtaining a median query result of the query attribute data set according to a pre-built exponential mechanism based on the secret specification; and publishing the mean query result and/or median query result of the query attribute data set to the access device.
上述技术方案:允许用户个人基于秘密规范在人口普查数据集中设置其哪些属性记录是敏感的,哪些属性记录不是敏感的,有助于精确定义隐私保护的范围和受保护的实体,避免传统差分隐私方法将人口普查数据集的所有属性记录视为敏感的严格约束,使得隐私保护更加灵活和个性化,同时提供更少的数据失真和更准确的数据查询结果;提出了基于秘密规范的拉普拉斯机制(SSLM)并应用于人口普查数据集均值查询,提出了基于秘密规范的指数机制(SSEM)并应用于人口普查数据集中值查询,提高数据查询精度,同时最大限度地减少数据失真,特别是当数据的很大一部分是非敏感的时候,实现了隐私保护和数据效用的平衡,与最先进的差分隐私框架机制相比,SSLM通过利用非敏感数据,对于均值查询,效用提高了大约14倍,SSEM通过利用非敏感数据,对于中值查询,效用提高了大约6倍。The above technical solution: allows individual users to set which attribute records in the census dataset are sensitive and which are not sensitive based on secret specifications, which helps to accurately define the scope of privacy protection and the protected entities, avoids the strict constraint of traditional differential privacy methods that regard all attribute records of the census dataset as sensitive, makes privacy protection more flexible and personalized, and provides less data distortion and more accurate data query results; proposes a Laplace mechanism (SSLM) based on secret specifications and applies it to mean queries of census datasets, and proposes an exponential mechanism (SSEM) based on secret specifications and applies it to median queries of census datasets, improves data query accuracy while minimizing data distortion, especially when a large part of the data is non-sensitive, achieving a balance between privacy protection and data utility. Compared with the most advanced differential privacy framework mechanism, SSLM improves the utility by about 14 times for mean queries by utilizing non-sensitive data, and SSEM improves the utility by about 6 times for median queries by utilizing non-sensitive data.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本发明一种优选实施方式中基于秘密规范的个性化隐私保护机制的数据查询保护方法流程示意图;FIG1 is a flow chart of a data query protection method based on a personalized privacy protection mechanism of a secret specification in a preferred embodiment of the present invention;
图2是本发明计算中值的得分函数值的第一种示例;FIG2 is a first example of a score function value for calculating a median value according to the present invention;
图3是本发明计算中值的得分函数值的第二种示例;FIG3 is a second example of the score function value for calculating the median value of the present invention;
图4是本发明另一种优选实施方式中人口普查数据集查询保护方法流程示意图;FIG4 is a flow chart of a method for querying and protecting a population census data set in another preferred embodiment of the present invention;
图5是本发明另一种优选实施例中随着非敏感属性的比例变化,SSLM在均值查询结果的RMSE变化情况;FIG5 is a diagram showing changes in RMSE of mean query results of SSLM as the proportion of non-sensitive attributes changes in another preferred embodiment of the present invention;
图6是本发明另一种优选实施例中随着隐私保护程度变化,SSLM在均值查询结果的RMSE变化情况;FIG6 is a diagram showing changes in RMSE of mean query results of SSLM as the degree of privacy protection changes in another preferred embodiment of the present invention;
图7是本发明另一种优选实施例中随着非敏感属性的比例变化,SSEM在中值查询结果的RMSE变化情况;FIG7 is a diagram showing changes in RMSE of SSEM in median query results as the proportion of non-sensitive attributes changes in another preferred embodiment of the present invention;
图8是本发明另一种优选实施例中随着隐私保护程度变化,SSEM在中值查询结果的RMSE变化情况。FIG8 shows the change of RMSE of SSEM in median query results as the degree of privacy protection changes in another preferred embodiment of the present invention.
具体实施方式DETAILED DESCRIPTION
下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,仅用于解释本发明,而不能理解为对本发明的限制。Embodiments of the present invention are described in detail below, examples of which are shown in the accompanying drawings, wherein the same or similar reference numerals throughout represent the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, and cannot be understood as limiting the present invention.
在本发明的描述中,需要理解的是,术语“纵向”、“横向”、“上”、“下”、“前”、“后”、“左”、“右”、“竖直”、“水平”、“顶”、“底”“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本发明和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本发明的限制。In the description of the present invention, it is necessary to understand that the terms "longitudinal", "lateral", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inside", "outside", etc., indicating the orientation or position relationship, are based on the orientation or position relationship shown in the accompanying drawings, and are only for the convenience of describing the present invention and simplifying the description, rather than indicating or implying that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and therefore cannot be understood as a limitation on the present invention.
在本发明的描述中,除非另有规定和限定,需要说明的是,术语“安装”、“相连”、“连接”应做广义理解,例如,可以是机械连接或电连接,也可以是两个元件内部的连通,可以是直接相连,也可以通过中间媒介间接相连,对于本领域的普通技术人员而言,可以根据具体情况理解上述术语的具体含义。In the description of the present invention, unless otherwise specified and limited, it should be noted that the terms "installation", "connection" and "connection" should be understood in a broad sense. For example, it can be a mechanical connection or an electrical connection, or it can be the internal connection between two components. It can be a direct connection or an indirect connection through an intermediate medium. For ordinary technicians in this field, the specific meanings of the above terms can be understood according to the specific circumstances.
从立法和政策的角度来看,用户有权控制自己的隐私,且展现出个性化的隐私需求。这种个性化的概念,根植于每个人独特的文化背景、个人隐私偏好或社会因素,反映了不同用户对隐私期望的差异。而现有的关于差分隐私的个性化的研究主要体现在差分隐私(DP)框架内隐私预算的个性化,即个人指定自身数据的隐私保护强度,个人并不能根据隐私需求设置哪些数据敏感需要保护,哪些数据不敏感无需保护,涉及到个人对自身数据的敏感性的控制时,传统的差分隐私往往过于严格,它通常将数据集中与个人相关的所有数据视为固有敏感的,这样造成了访问时数据失真大、计算开销增大、查询访问数据用于后期分析时准确度较低等问题。基于此,本申请将秘密规范无缝集成到差分隐私框架中,引入了基于秘密规范的差分隐私(SSDP)。From the perspective of legislation and policy, users have the right to control their own privacy and show personalized privacy needs. This concept of personalization is rooted in each person's unique cultural background, personal privacy preferences or social factors, and reflects the differences in privacy expectations of different users. Existing research on the personalization of differential privacy is mainly reflected in the personalization of privacy budgets within the differential privacy (DP) framework, that is, individuals specify the privacy protection strength of their own data. Individuals cannot set which data is sensitive and needs to be protected according to privacy needs, and which data is not sensitive and does not need to be protected. When it comes to the control of the sensitivity of individuals' own data, traditional differential privacy is often too strict. It usually regards all data related to individuals in the data set as inherently sensitive, which causes large data distortion during access, increased computational overhead, and low accuracy when querying access data for later analysis. Based on this, this application seamlessly integrates secret specifications into the differential privacy framework and introduces differential privacy based on secret specifications (SSDP).
下面对本申请提出的基于秘密规范的差分隐私(SSDP)的相关定义和机制进行解释和说明。The following is an explanation and description of the relevant definitions and mechanisms of the secret specification-based differential privacy (SSDP) proposed in this application.
设定数据集为需要保护的数据集,将设定数据集表示为D0=(r1,…,ri…,rn),n表示用户数量,用户索引i∈[1,n],ri表示用户ui的记录, 表示用户集,ri表示一个k维变量,k表示属性维数,ri来自域 表示用户ui的属性Aj记录,属性索引j∈[1,k];表示一个属性记录,根据用户个人的秘密规范设置,属性记录可以被设置为敏感记录或不敏感记录,敏感记录、不敏感记录和性记录均是变量,可以为不同的记录值。设定数据集可以是人口普查数据集。Set the data set to be protected, and represent the set data set as D 0 =(r 1 ,…, ri …, rn ), where n represents the number of users, user index i∈[1,n], and ri represents the record of user ui . represents a user set, ri represents a k-dimensional variable, k represents the attribute dimension, ri comes from the domain Represents the attribute Aj record of user u i , with attribute index j∈[1,k]; Represents an attribute record. According to the user's personal privacy specification settings, the attribute record can be set as a sensitive record or an insensitive record. Sensitive records, insensitive records and sexual records are all variables and can have different record values. The set data set can be a census data set.
定义一(秘密规范):用户ui的秘密规范被正式定义为一个二元函数Si:ri→{0,1}k,其中是与用户ui相关的单个k维属性记录。函数Si决定记录ri中每个属性(如Aj(1≤j≤k))的敏感度分类。如果属性j的秘密规范函数值Si(Aj)=1,则认为属性Aj记录为敏感记录,如果属性j的秘密规范函数值Si(Aj)=0,则认为属性Aj记录为不敏感记录。Definition 1 (Secret Specification): The secret specification of user u i is formally defined as a binary function S i : r i →{0, 1} k , where is a single k-dimensional attribute record associated with user u i . Function S i determines the sensitivity classification of each attribute (such as A j (1≤j≤k)) in record r i . If the secret specification function value S i (A j ) of attribute j = 1, the attribute A j record is considered to be a sensitive record. If the secret specification function value S i (A j ) of attribute j = 0, the attribute A j record is considered to be an insensitive record.
根据定义一,每个用户指定的秘密规范将其记录(包含k属性)划分为两个子记录。具体来说,函数Si将记录ri中的属性值分为两个子记录:敏感子记录和非敏感子记录特别地,当k=1时,即一个记录只与一个属性相关联,此时用户的秘密规范集S将设定数据集D0划分为两个子集,即敏感数据子集Ds和非敏感数据子集Dns。According to Definition 1, each user-specified secret specification divides its record (containing k attributes) into two sub-records. Specifically, function Si divides the attribute values in record r i into two sub-records: sensitive sub-record and non-sensitive sub-records In particular, when k=1, that is, one record is associated with only one attribute, the user's secret specification set S divides the set data set D 0 into two subsets, namely, the sensitive data subset D s and the non-sensitive data subset D ns .
定义二(-相邻记录):让表示与用户集相关的秘密规范集。用户ui的记录与一个记录是-相邻的,满足对任意的j,当 Definition 2 - adjacent records): let Representation and User Set The set of related secret specifications. Records for user ui With a record yes -adjacent, satisfying that for any j, when
定义三(-相邻数据集):数据集D和数据集D′是-相邻的,当且仅当D中的一条记录与D′不同,且该不同记录是-相邻的。Definition 3 - adjacent datasets): dataset D and dataset D′ are -adjacent if and only if there is a record in D that is different from D′ and the different record is - adjacent.
定义四(-敏感度):对于任意一对-相邻的数据集D和D′,查询函数f的全局灵敏度表示为Δfs,并以L1范数衡量:Definition 4 -sensitivity): For any pair - adjacent datasets D and D′, the global sensitivity of the query function f is denoted as Δfs and measured by the L1 norm:
本申请从用户的隐私权出发,通过允许用户独立地指定自设记录的敏感度,实现个人隐私需求的个性化,提出了基于秘密规范的差分隐私(Secret Specification-basedDifferential Privacy,SSDP)。This application starts from the user's privacy right, realizes the personalization of personal privacy needs by allowing users to independently specify the sensitivity of self-set records, and proposes Secret Specification-based Differential Privacy (SSDP).
定义五(基于秘密规范的差分隐私SSDP):在秘密规范的背景下,一个随机算法M满足如果对于任意-相邻数据集D和D′以及Range(M)中的任何子集o,算法满足:Definition 5 (Differentially Private SSDP Based on Secret Specification): In the context of If for any - adjacent data sets D and D′ and any subset o in Range(M), algorithm satisfy:
Range(M)表示随机算法的输出空间,表示在随机算法作用在数据集D获得结果的概率,表示在随机算法M作用在数据集D′获得结果的概率。∈表示预设的隐私保护程度,∈>0。Range(M) indicates random algorithm The output space of Indicates that in randomized algorithms Act on data set D to obtain results The probability of Indicates the result obtained when the random algorithm M acts on the data set D′ ∈ represents the preset privacy protection level, ∈>0.
本申请提出的基于秘密规范的差分隐私SSDP的主要目标是保护设定数据集中的敏感记录,其显著特点是记录属性的敏感性由用户自己决定,与这些属性的记录值无关。意味着改变敏感记录的值不会影响其敏感性,从而形成对称的邻域关系。正如定义五所阐明的那样,对于记录的敏感属性,基于秘密规范的差分隐私SSDP确保了与传统差分隐私DP相同级别的隐私保护,从而抵御了强大的攻击。The main goal of the secret specification-based differential privacy SSDP proposed in this application is to protect sensitive records in a set of data sets. Its notable feature is that the sensitivity of record attributes is determined by the user himself and is independent of the record values of these attributes. This means that changing the value of a sensitive record will not affect its sensitivity, thus forming a symmetric neighborhood relationship. As explained in Definition 5, for the sensitive attributes of the record, the secret specification-based differential privacy SSDP ensures the same level of privacy protection as the traditional differential privacy DP, thereby resisting powerful attacks.
本申请提出了基于秘密规范的拉普拉斯机制(Secret Specification-basedLaplace Mechanism,SSLM)作为实现基于秘密规范的差分隐私SSDP方案的基本机制,并将其应用于数据库或设定数据集的均值查询。This application proposes a secret specification-based Laplace mechanism (SSLM) as a basic mechanism for implementing a secret specification-based differential privacy SSDP scheme, and applies it to mean queries of databases or set data sets.
基于秘密规范的拉普拉斯机制SSLM具有如下定理:The Laplace mechanism SSLM based on secret specification has the following theorem:
定理一(基于秘密规范的拉普拉斯机制SSLM):给定函数f:D→R和用户的秘密规范表示为的SSLM满足其中表示f的-灵敏度,R表示函数f的值域,当函数f为中值查询函数或均值查询函数时,R表示实数域。Lap(·)表示lapalce分布概率密度函数。Theorem 1 (Secret Specification-based Laplace Mechanism SSLM): Given a function f:D→R and a user's secret specification Expressed as The SSLM meets in Indicates f -sensitivity, R represents the range of function f, when function f is a median query function or a mean query function, R represents the real number domain. Lap(·) represents the lapalce distribution probability density function.
定理二(基于秘密规范的拉普拉斯机制SSLM):当查询函数f记为均值函数时,SSLM,即满足 Theorem 2 (Secret Specification-Based Laplace Mechanism SSLM): When the query function f is denoted as a mean function, SSLM, i.e. satisfy
本申请还提出了基于秘密规范的指数机制(Secret Specification-basedExponential Mechanism,SSEM)作为实现SSDP机制的基本机制,并将其应用于数据库和设定数据集的中值查询。This application also proposes a secret specification-based exponential mechanism (SSEM) as a basic mechanism for implementing the SSDP mechanism, and applies it to median queries of databases and set data sets.
考虑一个查询函数f:D→O,在这里查询函数可为中值查询函数,针对于函数f的输出空间O的实值得分函数可表示为其中,D*表示改变数据集D中任意数量的敏感记录的记录值而形成的满足f(D*)=o的数据集;表示D和D*中不同的敏感记录的集合;表示集合的基数,即集合敏感记录个数,也是从数据集D变换到D*过程中改变记录值的敏感记录个数;实值得分函数s(D,o)的值为负数,其计算公式的含义是表示在中敏感记录个数最小时(对应s(D,o)最大),的敏感记录个数取反获得得分函数值;是评分函数s的全局灵敏度。对于中值(中位数)查询函数等基本统计函数,Δs=1。Consider a query function f: D→O, where the query function can be a median query function. The real value score function for the output space O of function f can be expressed as Wherein, D* represents a data set formed by changing the record values of any number of sensitive records in the data set D and satisfying f(D * )=o; represents the set of different sensitive records in D and D*; Representing a collection The cardinality, that is, the number of sensitive records in the set, is also the number of sensitive records whose record values change during the transformation from data set D to D*; the value of the real value score function s(D, o) is a negative number, and its calculation formula means that When the number of sensitive records in is the smallest (corresponding to the largest s(D, o)), The number of sensitive records is inverted to obtain the score function value; is the global sensitivity of the scoring function s. For basic statistical functions such as the median query function, Δs=1.
定义六(指数机制EM):让O表示随机算法所有可能的输出的集合,即输出空间。对于得分函数如果在O中产生输出o的概率与成正比,那么满足∈-DP。其中,D和D′是-相邻数据集。表示实数域。Definition 6 (Exponential Mechanism EM): Let O represent a random algorithm The set of all possible outputs is the output space. For the score function if The probability of producing output o in O is Proportional, then Satisfies ∈-DP. Among them, D and D′ are - Neighboring datasets. Represents the field of real numbers.
根据秘密规范本申请修改原始指数机制的得分函数s(D,o),并用另一种方式表述,即其中,r表示中与一维属性关联的敏感记录,且 表示在数据集D得到D*的过程中,改变最小数量的敏感记录达到f(D*)=o时获得数据集D*,此时改变的敏感记录的数量乘以-∈为最大的值。According to the confidentiality regulations This application modifies the score function s(D, o) of the original exponential mechanism and expresses it in another way, namely Among them, r represents sensitive records associated with a one-dimensional attribute in In the process of obtaining D * from data set D, the data set D * is obtained when the minimum number of sensitive records is changed to reach f(D * )=o. At this time, the number of sensitive records changed multiplied by -∈ is the maximum. value.
定义七(基于秘密规范的指数机制SSEM):给定函数f:D→O和用户的秘密规范集SSEM以概率输出o,z,o∈O,表示按照上述概率输出o的中值查询结果。exp(·)表示指数概率分布密度函数。Definition 7 (Secret Specification-based Exponential Mechanism SSEM): Given a function f: D→O and a user's secret specification set SSEM By probability Output o,z,o∈O, It indicates the median query result of outputting o according to the above probability. exp(·) indicates the exponential probability distribution density function.
隐私分析:Privacy Analysis:
定理二(基于秘密规范的拉普拉斯机制SSLM):当查询函数f记为均值函数时,SSLM,即满足符合隐私要求。Theorem 2 (Secret Specification-Based Laplace Mechanism SSLM): When the query function f is denoted as a mean function, SSLM, i.e. satisfy Meet privacy requirements.
定理三:当查询函数f记为中值函数时,SSEM,即满足-SSDP,符合隐私要求。Theorem 3: When the query function f is recorded as the median function, SSEM, that is satisfy -SSDP, which meets privacy requirements.
本申请提供了一种基于秘密规范的个性化隐私保护机制的数据查询保护方法,如图1所示,在一种优选实施方式中,包括:The present application provides a data query protection method based on a personalized privacy protection mechanism of a secret specification, as shown in FIG1 , in a preferred implementation, including:
步骤S101,访问设备发送查询指令至数据存储设备。访问设备优选但不限于为移动终端或PC电脑或笔记本电脑。数据存储设备优选但不限于数据服务器或云服务器。访问设备和数据存储设备通过互联网连接通信。Step S101, the access device sends a query instruction to the data storage device. The access device is preferably but not limited to a mobile terminal or a PC or a laptop. The data storage device is preferably but not limited to a data server or a cloud server. The access device and the data storage device communicate via the Internet connection.
数据存储设备执行:Data storage devices perform:
步骤S102,接收并解析查询指令获得查询函数和查询属性。查询指令里面包含有查询函数以及查询属性,由于设定数据集中每个用户的记录包括多维属性,如需要从人口普查数据集中获得平均年龄或年收入,需要对年龄属性或年收入属性数据集进行处理。Step S102, receiving and parsing the query instruction to obtain the query function and query attribute. The query instruction contains the query function and query attribute. Since the record of each user in the set data set includes multi-dimensional attributes, such as the need to obtain the average age or annual income from the census data set, the age attribute or annual income attribute data set needs to be processed.
步骤S103,从设定数据集中提取查询属性数据集,包括:Step S103, extracting a query attribute data set from a set data set, includes:
设定数据集表示为D0=(r1,…,ri…,rn),n表示用户数量,ri表示用户ui的记录,用户索引i∈[1,n],k表示属性维数,ri来自域 表示用户ui的属性Aj记录,属性索引j∈[1,k];设查询属性为属性Aj,则查询属性数据集 Assume that the data set is represented as D 0 =(r 1 ,…, ri …, rn ), where n represents the number of users, ri represents the record of user ui , and user index i∈[1,n], k represents the attribute dimension, ri comes from the domain represents the attribute Aj record of user u i , and the attribute index j∈[1, k]; let the query attribute be attribute Aj , then the query attribute dataset
步骤S104,获取用户秘密规范集,基于用户秘密规范集将查询属性数据集划分为敏感子集和非敏感子集。用户秘密规范集包括用户对所有属性的二元函数,在解析出查询属性后,从用户秘密规范集中提取出查询属性的秘密规范集,基于提取的秘密规范集将将查询属性数据集划分为敏感子集和非敏感子集。具体的,步骤S104包括:Step S104, obtaining a user secret specification set, and dividing the query attribute data set into a sensitive subset and a non-sensitive subset based on the user secret specification set. The user secret specification set includes the user's binary function of all attributes. After parsing the query attribute, the secret specification set of the query attribute is extracted from the user secret specification set, and the query attribute data set is divided into a sensitive subset and a non-sensitive subset based on the extracted secret specification set. Specifically, step S104 includes:
步骤S1041,用户秘密规范集包括设定数据集对应的所有用户的秘密规范,用户ui的秘密规范定义为二元函数:Si:ri→{0,1}k,若用户ui定义属性Aj记录为敏感记录,则用户ui的秘密规范在属性Aj的数值Si(Aj)=1,若用户ui定义属性Aj记录为不敏感记录,则用户ui的秘密规范在属性Aj的数值Si(Aj)=0。Step S1041, user secret specification set Including setting the secret specifications of all users corresponding to the data set, the secret specification of user u i is defined as a binary function: S i : r i →{0, 1} k . If user u i defines attribute A j record as a sensitive record, then the secret specification of user u i is the value S i (A j ) = 1 of attribute A j . If user u i defines attribute A j record as an insensitive record, then the secret specification of user u i is the value S i (A j ) = 0 of attribute A j .
步骤S1042,设属性Aj为查询属性,对所有用户执行:若用户ui的秘密规范在属性Aj的数值Si(Aj)=1,则将用户ui的属性Aj记录的记录值归入敏感子集Ds,若用户ui的秘密规范在属性Aj的数值Si(Aj)=0,则将用户ui的属性Aj记录的记录值归入非敏感子集Dns。Step S1042, set attribute A j as the query attribute, and execute for all users: if the secret specification of user u i is in the value S i (A j ) of attribute A j = 1, then record attribute A j of user u i The recorded value of is classified into the sensitive subset D s . If the secret specification of user u i is in the value of attribute A j S i (A j ) = 0, then the attribute A j of user u i is recorded The recorded values are classified into the non-sensitive subset D ns .
通过步骤S104将查询属性数据集划分为敏感子集和非敏感子集,在均值查询或中值查询中对敏感子集进行保护,对非敏感子集不进行保护,在保护好敏感数据的同时,能够减小查询数据的失真性,提高查询数据的效用。Step S104 divides the query attribute data set into sensitive subsets and non-sensitive subsets. In mean query or median query, the sensitive subset is protected, while the non-sensitive subset is not protected. While protecting sensitive data, the distortion of query data can be reduced and the utility of query data can be improved.
步骤S105,当查询函数为均值查询函数时,根据预先构建的基于秘密规范的拉普拉斯机制获得查询属性数据集的均值查询结果;Step S105, when the query function is a mean query function, obtaining a mean query result of the query attribute data set according to a pre-built Laplace mechanism based on a secret specification;
当查询函数为中值查询函数时,根据预先构建的基于秘密规范的指数机制获得查询属性数据集的中值查询结果。When the query function is a median query function, a median query result of the query attribute data set is obtained according to a pre-built secret specification-based exponential mechanism.
步骤S106,向访问设备发布查询属性数据集的均值查询结果和/或中值查询结果。Step S106: publishing the mean query result and/or median query result of the query attribute data set to the access device.
在本实施方式中,优选地,步骤S105中,当查询函数为均值查询函数时,根据预先构建的基于秘密规范的拉普拉斯机制获得查询属性数据集的均值查询结果,包括:In this embodiment, preferably, in step S105, when the query function is a mean query function, obtaining a mean query result of the query attribute data set according to a pre-built Laplace mechanism based on a secret specification includes:
步骤A1,根据敏感子集Ds和非敏感子集Dns计算查询属性数据集的均值fmean(D):Step A1, calculate the mean f mean (D) of the query attribute data set based on the sensitive subset D s and the non-sensitive subset D ns :
其中,|·|表示求取数据集的基数,即求取数据集中数据个数;fmean(·)表示均值查询函数,fmean(Ds)表示敏感子集Ds的均值,fmean(Dns)表示非敏感子集Dns的均值。 Among them, |·| means to find the cardinality of the data set, that is, to find the number of data in the data set; f mean (·) represents the mean query function, f mean (D s ) represents the mean of the sensitive subset D s , and f mean (D ns ) represents the mean of the non-sensitive subset D ns .
步骤A2,构建查询属性数据集的-相邻数据集D′,具体的,按照上述定义三构建-相邻数据集D′,获取均值查询函数的全局灵敏度 Step A2: Construct the query attribute dataset - Adjacent dataset D′, specifically, constructed according to the above definition 3 - Adjacent dataset D′, to obtain the global sensitivity of the mean query function
其中,fmean(D′)表示-相邻数据集D′的均值,|| ||1表示L1范数。 Where f mean (D′) represents - the mean of the adjacent dataset D′, || || 1 represents the L1 norm.
步骤A3,在查询属性数据集的均值fmean(D)中添加满足Laplace分布的噪声获得均值查询结果 Step A3: add noise that satisfies the Laplace distribution to the mean f mean (D) of the query attribute data set. Get the mean query result
其中,Lap(·)表示lapalce分布概率密度函数,∈表示预设的隐私保护程度,∈>0。 Wherein, Lap(·) represents the probability density function of lapalce distribution, ∈ represents the preset privacy protection degree, ∈>0.
可看到,利用非敏感子集Dns,能够提升均值查询结果的效用,最大限度地减少数据失真,同时通过满足Laplace分布的噪声对敏感子集Ds的均值进行隐私保护,实现了个性化隐私保护,方便有效的数据分析。It can be seen that using the non-sensitive subset D ns can improve the mean query result. The utility of the proposed method is to minimize data distortion and protect the privacy of the mean of the sensitive subset Ds by satisfying the Laplace distribution of noise, thus achieving personalized privacy protection and facilitating effective data analysis.
在本实施方式的一种简化应用场景中,均值查询具体过程如下:In a simplified application scenario of this implementation, the specific process of mean query is as follows:
步骤101:输入数据集D,用户的秘密规范集隐私预算∈,均值查询函数fmean。Step 101: Input dataset D, the user's secret specification set Privacy budget ∈, mean query function f mean .
步骤102:针对于均值查询,考虑每个记录只与一个属性相关联的情况,即k=1,此时用户记录的敏感度和记录属性的敏感度一致。用户的秘密规范集S将数据集D=(r1,…,rn)划分为两个子集,即敏感数据子集Ds和非敏感数据子集Dns。基于此,我们初始化数据集的敏感子集和非敏感子集 Step 102: For mean query, consider the case where each record is associated with only one attribute, i.e. k = 1. In this case, the sensitivity of the user record is consistent with the sensitivity of the record attribute. The user's secret specification set S divides the data set D = (r 1 , ..., r n ) into two subsets, i.e., the sensitive data subset D s and the non-sensitive data subset D ns . Based on this, we initialize the sensitive subset of the data set and non-sensitive subsets
步骤103:基于用户的秘密规范集我们能得到敏感子集Ds和非敏感子集Dns。不失一般性,假设Ds=(r1,r2,…rm),Dns=(rm+1,r2,…rn),且每个记录关联与一个数值属性。基于获得的Ds和Dns,我们能得出关于数据集D的均值查询结果为其中,|·|表示取数据集的基数。Step 103: User-based secret specification set We can get the sensitive subset D s and the non-sensitive subset D ns . Without loss of generality, assume that D s = (r 1 , r 2 , … r m ), D ns = (r m+1 , r 2 , … r n ), and each record is associated with a numerical attribute. Based on the obtained D s and D ns , we can get the mean query result for the data set D as Among them, |·| represents the cardinality of the data set.
步骤104:假设数据集D=Ds∪Dns和D′=D′s∪D′ns是-相邻的,则我们能得到|Ds|=|D′s|,且Dns=D′ns。首先计算得出均值查询函数的-敏感度,由于数据集D中的非敏感记录不存在-相邻记录,因此值得强调的是,与均值查询的全局敏感度Δf相比,然后,给基于敏感子集Ds计算出来的均值查询结果f(Ds)添加满足Laplace分布的噪声,即 Step 104: Assume that the data sets D = D s ∪ D ns and D′ = D′ s ∪ D′ ns are -adjacent, we can get |D s |=|D′ s |, and D ns =D′ ns . First, calculate the mean query function -Sensitivity, since non-sensitive records do not exist in dataset D - adjacent records, so It is worth emphasizing that compared with the global sensitivity Δf of the mean query, Then, add the Laplace distribution to the mean query result f(D s ) calculated based on the sensitive subset D s The noise, that is
步骤105:返回噪声均值查询结果 Step 105: Return the noise mean query result
在本实施方式中,优选地,步骤S105中,当查询函数为中值查询函数时,根据预先构建的基于秘密规范的指数机制获得查询属性数据集的中值查询结果,包括:In this embodiment, preferably, in step S105, when the query function is a median query function, obtaining a median query result of the query attribute data set according to a pre-built secret specification-based index mechanism includes:
步骤B1,对查询属性数据集D中查询属性记录(即)的记录值按照从小到大排序,即排序后的查询属性数据集D满足ri≤ri+1。Step B1, query attribute records in the query attribute data set D (i.e. ) are sorted from small to large, that is, the sorted query attribute data set D satisfies r i ≤r i+1 .
步骤B2,设排序后的查询属性数据集D的基数|D|=2m+1,第一中间参数设O表示查询属性数据集D的中值输出空间。中值查询函数返回查询数据集D中的记录值作为中值查询结果,因此,ri∈O,即记录值属于中值空间O。Step B2, set the cardinality of the sorted query attribute data set D |D| = 2m + 1, the first intermediate parameter Let O represent the median output space of the query attribute data set D. The median query function returns the record value in the query data set D as the median query result, so ri∈O , that is, the record value belongs to the median space O.
构建查询属性数据集的-相邻数据集D′,具体的,按照上述定义三构建-相邻数据集D′,由于假设数据集D=Ds∪Dns和D′=D′s∪D′ns是-相邻的,则我们能得到|Ds|=|D′s|,且Dns=D′ns。D′s表示-相邻数据集D′的敏感子集,D′ns表示一相邻数据集D′的非敏感子集。必须强调的是,根据秘密规范S,数据集D中的非敏感记录不存在-相邻记录,因此不考虑非敏感记录的改变。Constructing a query attribute dataset - Adjacent dataset D′, specifically, constructed according to the above definition 3 - adjacent dataset D′, since the datasets D = D s ∪ D ns and D′ = D′ s ∪ D′ ns are -adjacent, we can get |D s |=|D′ s |, and D ns =D′ ns . D′ s means - The sensitive subset of the adjacent dataset D′, D′ ns represents A non-sensitive subset of a neighboring dataset D′. It must be emphasized that according to the secret specification S, non-sensitive records in dataset D do not exist. - adjacent records, so changes to non-sensitive records are not considered.
步骤B3,对任意输出中值o,o∈O,按照如下公式计算中值o的得分函数值 Step B3: For any output median o, o∈O, calculate the score function value of the median o according to the following formula:
其中,D*表示改变查询属性数据集D中任意数量的敏感记录的记录值而形成的满足fmed(D*)=o的数据集,在此过程中,不敏感记录数值不改变;表示D和D*中不同的敏感记录的集合;r表示属于的敏感记录;表示在中敏感记录个数最小时,即获得D*改变的敏感记录个数最小时,的敏感记录个数乘以-∈获得的乘积为中值o的得分函数值,fmed(·)表示中值查询函数,∈表示预设的隐私保护程度,∈>0。 Wherein, D* represents a data set satisfying f med (D * ) = o formed by changing the record values of any number of sensitive records in the query attribute data set D, and in this process, the values of insensitive records are not changed; represents the set of different sensitive records in D and D*; r represents the set of sensitive records belonging to sensitive records; Indicated in When the number of sensitive records in is the smallest, that is, when the number of sensitive records changed by D* is the smallest, The product of the number of sensitive records multiplied by -∈ is the score function value of the median o, f med (·) represents the median query function, ∈ represents the preset privacy protection degree, and ∈>0.
进一步优选地,为快速获得中值o的得分函数值对进行了推理论证,按照如下方式能够快速获得得分函数值,对于任何记录ri∈D且ri∈O,计算时有三种情况:Further preferably, to quickly obtain the score function value of the median o right The reasoning is carried out and the score function value can be quickly obtained as follows: for any record ri ∈ D and ri ∈ O, calculate There are three situations:
(1)如果i≤m,意味着记录ri右边的m-i个敏感记录被改变,此时 (1) If i≤m, it means that mi sensitive records to the right of record ri are changed.
(2)如果i=m,意味着没有敏感记录被改变,此时 (2) If i = m, it means no sensitive records have been changed.
(3)如果i≥m,意味着记录ri左边的i-m个敏感记录被改变,此时 (3) If i ≥ m, it means that the im sensitive records to the left of record ri have been changed.
步骤B4,计算中值输出空间O中所有中值的输出概率,设中值o的输出概率为:Step B4, calculate the output probability of all medians in the median output space O, and set the output probability of median o for:
步骤B5,将按照中值输出空间O中所有中值的输出概率输出的中值作为中值查询结果 Step B5: The median of all median output probabilities in the median output space O is used as the median query result.
在本实施方式的一种简化应用场景中,中值查询具体过程如下:In a simplified application scenario of this implementation, the specific process of median query is as follows:
步骤101:输入数据集D,用户的秘密规范集隐私预算∈,均值查询函数fmed。Step 101: Input dataset D, the user's secret specification set Privacy budget ∈, mean query function f med .
步骤102:针对于中值查询,考虑每个记录只与一个属性相关联的情况,即k=1,此时用户记录的敏感度和记录属性的敏感度一致。用户的秘密规范集S将数据集D=(r1,…,rn)划分为两个子集,即敏感数据子集Ds和非敏感数据子集Dns。基于此,我们初始化数据集的敏感子集和非敏感子集 Step 102: For median query, consider the case where each record is associated with only one attribute, i.e. k = 1. In this case, the sensitivity of the user record is consistent with the sensitivity of the record attribute. The user's secret specification set S divides the data set D = (r 1 , ..., r n ) into two subsets, i.e., the sensitive data subset D s and the non-sensitive data subset D ns . Based on this, we initialize the sensitive subset of the data set and non-sensitive subsets
步骤103:基于用户的秘密规范集我们能得到敏感子集Ds和非敏感子集Dns。不失一般性,我们假设|D|=2m+1,且ri≤ri+1,其中,|·|表示数据集的基数。中位数函数fmed返回数据集D中排名为m的记录,因此我们能获得数据集D的真实中值为fmed(D)=rm。Step 103: User-based secret specification set We can get the sensitive subset D s and the non-sensitive subset D ns . Without loss of generality, we assume |D| = 2m+1, and ri ≤ri +1 , where |·| represents the cardinality of the data set. The median function f med returns the record ranked m in the data set D, so we can get the true median of the data set D as f med (D) = r m .
步骤104:由于假设数据集D=Ds∪Dns和D′=D′s∪D′ns是-相邻的,则我们能得到|Ds|=|D′s|,且Dns=D′ns。必须强调的是,根据秘密规范S,数据集D中的非敏感记录不存在-相邻记录,因此我们不考虑非敏感记录的改变。Step 104: Since the assumed data sets D = D s ∪ D ns and D′ = D′ s ∪ D′ ns are -adjacent, we can get |D s |=|D′ s |, and D ns =D′ ns . It must be emphasized that according to the secret specification S, non-sensitive records in the dataset D do not exist. - adjacent records, so we do not consider changes to non-sensitive records.
对于任何记录ri∈D,计算时有三种情况:For any record r i ∈ D, compute There are three situations:
(1)如果i≤m,意味着记录ri右边的m-i个敏感记录被改变,此时 (1) If i≤m, it means that mi sensitive records to the right of record ri are changed.
(2)如果i=m,意味着没有记录被改变,此时 (2) If i = m, it means no record has been changed.
(3)如果i≥m,意味着记录ri左边的i-m个敏感记录被改变,此时 (3) If i ≥ m, it means that the im sensitive records to the left of record ri have been changed.
步骤105:返回噪声中值查询结果 Step 105: Return the noise median query result
为便于详细说明求取中值查询结果过程中中值o的得分函数值的计算过程,展示了两个计算示例。To facilitate the detailed description of the score function value of the median o in the process of obtaining the median query result The calculation process of is shown in Figure 2, and two calculation examples are shown.
示例1:D=(1,2,3,4,5,6,7),Ds=(1,3,5,6),Dns=(2,4,7),∈=1 D=(1,2,3,4,5,6,7),O=(1,2,3,4,5,6,7),图2展示了每个可能输出的中值的得分函数值,以及每个中值的概率。Example 1: D = (1, 2, 3, 4, 5, 6, 7), D s = (1, 3, 5, 6), D ns = (2, 4, 7), ∈ = 1 D = (1, 2, 3, 4, 5, 6, 7), O = (1, 2, 3, 4, 5, 6, 7), Figure 2 shows the score function value of the median of each possible output, as well as the probability of each median.
示例2:D=(1,2,3,4,5,6,7),Ds=(1,3,5),Dns=(2,4,6,7),∈=1,O=(1,2,3,4,5,6,7),图3展示了每个可能输出的中值的得分函数值,以及每个中值的概率。Example 2: D = (1, 2, 3, 4, 5, 6, 7), D s = (1, 3, 5), D ns = (2, 4, 6, 7), ∈ = 1, O = (1, 2, 3, 4, 5, 6, 7), Figure 3 shows the score function value of the median of each possible output and the probability of each median.
图2和图3中的表是说明计算过程的具体示例。这两张表的区别在于针对于查询数据集D的秘密规范不同,导致查询数据集D中敏感记录(敏感属性记录)的比例不同。从2和图3的表中的结果可以明显看出,当敏感记录的比例相对减少(或低于50%)时,D中的部分记录不能被视为中值输出。图3中举例说明了这一现象,其中这将提高其他记录值(可能中值)的输出概率,从而增强数据效用。The tables in Figures 2 and 3 are Specific example of the calculation process. The difference between the two tables is that the secret specifications for the query data set D are different, resulting in different proportions of sensitive records (sensitive attribute records) in the query data set D. It can be clearly seen from the results in the tables in Figures 2 and 3 that when the proportion of sensitive records is relatively reduced (or less than 50%), some records in D cannot be regarded as median output. This phenomenon is illustrated in Figure 3, where This will increase the output probability of other recorded values (possibly the median), thereby enhancing the data utility.
本发明还公开了一种人口普查数据集查询保护方法,在一种优选实施方式中,如图4所示,包括:The present invention also discloses a method for querying and protecting a population census data set. In a preferred implementation, as shown in FIG4 , the method comprises:
步骤S201,访问设备发送查询指令至存储人口普查数据集的数据存储设备。访问设备优选但不限于为移动终端或PC电脑或笔记本电脑。数据存储设备优选但不限于用于政务或公安系统的数据服务器或云服务器。访问设备和数据存储设备通过互联网连接通信。Step S201: The access device sends a query command to a data storage device storing a population census data set. The access device is preferably, but not limited to, a mobile terminal, a PC, or a laptop. The data storage device is preferably, but not limited to, a data server or a cloud server used in government affairs or public security systems. The access device and the data storage device communicate via an Internet connection.
数据存储设备执行:Data storage devices perform:
步骤S202,接收并解析查询指令获得查询函数和查询属性,查询属性包括年龄和年收入;查询属性还可以包括健康状态、学历、身份证号码等。Step S202, receiving and parsing the query instruction to obtain the query function and query attributes, the query attributes include age and annual income; the query attributes may also include health status, education level, ID number, etc.
步骤S203,从人口普查数据集中提取查询属性数据集,包括:Step S203, extracting a query attribute data set from the population census data set, including:
设定数据集表示为D0=(r1,…,ri…,rn),n表示用户数量,ri表示用户ui的记录,用户索引i∈[1,n],k表示属性维数,ri来自域 表示用户ui的属性Aj记录,属性索引j∈[1,k];设查询属性为属性Aj,则查询属性数据集 Assume that the data set is represented as D 0 =(r 1 ,…, ri …, rn ), where n represents the number of users, ri represents the record of user ui , and user index i∈[1,n], k represents the attribute dimension, ri comes from the domain represents the attribute Aj record of user u i , and the attribute index j∈[1, k]; let the query attribute be attribute Aj , then the query attribute dataset
步骤S204,获取用户秘密规范集,基于用户秘密规范集将查询属性数据集划分为敏感子集和非敏感子集。用户秘密规范集包括用户对所有属性的二元函数,在解析出查询属性后,从用户秘密规范集中提取出查询属性的秘密规范集,基于提取的秘密规范集将将查询属性数据集划分为敏感子集和非敏感子集。具体的,步骤S204包括:Step S204, obtaining a user secret specification set, and dividing the query attribute data set into a sensitive subset and a non-sensitive subset based on the user secret specification set. The user secret specification set includes the user's binary function of all attributes. After parsing the query attribute, the secret specification set of the query attribute is extracted from the user secret specification set, and the query attribute data set is divided into a sensitive subset and a non-sensitive subset based on the extracted secret specification set. Specifically, step S204 includes:
步骤S2041,用户秘密规范集包括设定数据集对应的所有用户的秘密规范,用户ui的秘密规范定义为二元函数:Si:ri→{0,1}k,若用户ui定义属性Aj记录为敏感记录,则用户ui的秘密规范在属性Aj的数值Si(Aj)=1,若用户ui定义属性Aj记录为不敏感记录,则用户ui的秘密规范在属性Aj的数值Si(Aj)=0。Step S2041, user secret specification set Including setting the secret specifications of all users corresponding to the data set, the secret specification of user u i is defined as a binary function: S i : r i →{0, 1} k . If user u i defines attribute A j record as a sensitive record, then the secret specification of user u i is the value S i (A j ) = 1 of attribute A j . If user u i defines attribute A j record as an insensitive record, then the secret specification of user u i is the value S i (A j ) = 0 of attribute A j .
步骤S2042,设属性Aj为查询属性,对所有用户执行:若用户ui的秘密规范在属性Aj的数值Si(Aj)=1,则将用户ui的属性Aj记录的记录值归入敏感子集Ds,若用户ui的秘密规范在属性Aj的数值Si(Aj)=0,则将用户ui的属性Aj记录的记录值归入非敏感子集Dns。Step S2042, set attribute A j as the query attribute, and execute for all users: if the secret specification of user u i is in the value S i (A j ) of attribute A j = 1, then record attribute A j of user u i The recorded value of is classified into the sensitive subset D s . If the secret specification of user u i is in the value of attribute A j S i (A j ) = 0, then the attribute A j of user u i is recorded The recorded values are classified into the non-sensitive subset D ns .
通过步骤S204将查询属性数据集划分为敏感子集和非敏感子集,在均值查询或中值查询中对敏感子集进行保护,对非敏感子集不进行保护,在保护好敏感数据的同时,能够减小查询数据的失真性,提高查询数据的效用。Step S204 divides the query attribute data set into sensitive subsets and non-sensitive subsets. In mean query or median query, the sensitive subset is protected, while the non-sensitive subset is not protected. While protecting sensitive data, the distortion of query data can be reduced and the utility of query data can be improved.
步骤S205,当查询函数为均值查询函数时,根据预先构建的基于秘密规范的拉普拉斯机制获得查询属性数据集的均值查询结果;Step S205, when the query function is a mean query function, obtaining a mean query result of the query attribute data set according to a pre-built Laplace mechanism based on a secret specification;
当查询函数为中值查询函数时,根据预先构建的基于秘密规范的指数机制获得查询属性数据集的中值查询结果;When the query function is a median query function, a median query result of the query attribute data set is obtained according to a pre-built secret specification-based exponential mechanism;
步骤S206,向访问设备发布查询属性数据集的均值查询结果和/或中值查询结果。Step S206: publishing the mean query result and/or median query result of the query attribute data set to the access device.
在本实施方式中,优选地,步骤S205中,当查询函数为均值查询函数时,根据预先构建的基于秘密规范的拉普拉斯机制获得查询属性数据集的均值查询结果,包括:In this embodiment, preferably, in step S205, when the query function is a mean query function, obtaining a mean query result of the query attribute data set according to a pre-constructed Laplace mechanism based on a secret specification includes:
步骤C1,根据敏感子集Ds和非敏感子集Dns计算查询属性数据集的均值fmean(D):Step C1, calculate the mean f mean (D) of the query attribute data set based on the sensitive subset D s and the non-sensitive subset D ns :
其中,|·|表示求取数据集的基数,fmean(·)表示均值查询函数,fmean(Ds)表示敏感子集Ds的均值,fmean(Dns)表示非敏感子集Dns的均值。 In which, |·| represents the cardinality of the data set, f mean (·) represents the mean query function, f mean (D s ) represents the mean of the sensitive subset D s , and f mean (D ns ) represents the mean of the non-sensitive subset D ns .
步骤C2,构建查询属性数据集的-相邻数据集D′,具体的,按照上述定义三构建-相邻数据集D′,获取均值查询函数的全局灵敏度 Step C2: Construct the query attribute dataset - Adjacent dataset D′, specifically, constructed according to the above definition 3 - Adjacent dataset D′, to obtain the global sensitivity of the mean query function
其中,fmean(D′)表示-相邻数据集D′的均值,|| ||1表示L1范数。 Where f mean (D′) represents - the mean of the adjacent dataset D′, || || 1 represents the L1 norm.
步骤C3,在查询属性数据集的均值fmean(D)中添加满足Laplace分布的噪声获得最终的均值查询结果 Step C3: add noise that satisfies the Laplace distribution to the mean f mean (D) of the query attribute data set. Get the final mean query result
其中,Lap(·)表示lapalce分布概率密度函数,∈表示预设的隐私保护程度,∈>0。 Wherein, Lap(·) represents the probability density function of lapalce distribution, ∈ represents the preset privacy protection degree, and ∈>0.
可看到,利用非敏感子集Dns,能够提升均值查询结果的效用,最大限度地减少数据失真,同时通过满足Laplace分布的噪声对敏感子集Ds的均值进行隐私保护,实现了个性化隐私保护,方便有效的数据分析。It can be seen that using the non-sensitive subset D ns can improve the mean query result. The utility of the proposed method is to minimize data distortion and protect the privacy of the mean of the sensitive subset Ds by satisfying the Laplace distribution of noise, thus achieving personalized privacy protection and facilitating effective data analysis.
在本实施方式中,优选地,步骤S205中,当查询函数为中值查询函数时,根据预先构建的基于秘密规范的指数机制获得查询属性数据集的中值查询结果,包括:In this embodiment, preferably, in step S205, when the query function is a median query function, obtaining a median query result of the query attribute data set according to a pre-built secret specification-based index mechanism includes:
步骤D1,对查询属性数据集D中查询属性记录(即)的记录值按照从小到大排序,即排序后的查询属性数据集D满足ri≤ri+1。Step D1, query attribute records in the query attribute data set D (i.e. ) are sorted from small to large, that is, the sorted query attribute data set D satisfies r i ≤r i+1 .
步骤D2,设排序后的查询属性数据集D的基数|D|=2m+1,第一中间参数设O表示查询属性数据集D的中值输出空间。中值查询函数返回查询数据集D中的记录值作为中值查询结果,因此,ri∈O,即记录值属于中值空间O。Step D2, set the cardinality of the sorted query attribute data set D |D| = 2m + 1, the first intermediate parameter Let O represent the median output space of the query attribute data set D. The median query function returns the record value in the query data set D as the median query result, so ri∈O , that is, the record value belongs to the median space O.
构建查询属性数据集的-相邻数据集D′,具体的,按照上述定义三构建-相邻数据集D′,由于假设数据集D=Ds∪Dns和D′=D′s∪D′ns是-相邻的,则我们能得到|Ds|=|D′s|,且Dns=D′ns。D′s表示-相邻数据集D′的敏感子集,D′ns表示-相邻数据集D′的非敏感子集。必须强调的是,根据秘密规范S,数据集D中的非敏感记录不存在-相邻记录,因此不考虑非敏感记录的改变。Constructing a query attribute dataset - Adjacent dataset D′, specifically, constructed according to the above definition 3 - adjacent dataset D′, since the datasets D = D s ∪ D ns and D′ = D′ s ∪ D′ ns are -adjacent, we can get |D s |=|D′ s |, and D ns =D′ ns . D′ s means - The sensitive subset of the adjacent dataset D′, D′ ns represents - a non-sensitive subset of the adjacent dataset D′. It must be emphasized that according to the secret specification S, non-sensitive records in dataset D do not exist - adjacent records, so changes to non-sensitive records are not considered.
步骤D3,对任意输出中值o,o∈O,按照如下公式计算中值o的得分函数值 Step D3, for any output median o, o∈O, calculate the score function value of the median o according to the following formula
其中,D*表示改变查询属性数据集D中任意数量的敏感记录的记录值而形成的满足fmed(D*)=o的数据集,在此过程中,不敏感记录数值不改变;表示D和D*中不同的敏感记录的集合;r表示属于的敏感记录;表示在中敏感记录个数最小时,即获得D*改变的敏感记录个数最小时,的敏感记录个数乘以-∈获得的乘积为中值o的得分函数值,fmed(·)表示中值查询函数,∈表示预设的隐私保护程度,∈>0。 Wherein, D* represents a data set satisfying f med (D * ) = o formed by changing the record values of any number of sensitive records in the query attribute data set D, and in this process, the values of insensitive records are not changed; represents the set of different sensitive records in D and D*; r represents the set of sensitive records belonging to sensitive records; Indicated in When the number of sensitive records in is the smallest, that is, when the number of sensitive records changed by D* is the smallest, The product of the number of sensitive records multiplied by -∈ is the score function value of the median o, f med (·) represents the median query function, ∈ represents the preset privacy protection degree, and ∈>0.
进一步优选地,为快速获得中值o的得分函数值对进行了推理论证,按照如下方式能够快速获得得分函数值,对于任何记录ri∈D且ri∈O,计算时有三种情况:Further preferably, to quickly obtain the score function value of the median o right The reasoning is carried out and the score function value can be quickly obtained as follows: for any record ri ∈ D and ri ∈ O, calculate There are three situations:
(1)如果i≤m,意味着记录ri右边的m-i个敏感记录被改变,此时 (1) If i≤m, it means that mi sensitive records to the right of record ri are changed.
(2)如果i=m,意味着没有敏感记录被改变,此时 (2) If i = m, it means no sensitive records have been changed.
(3)如果i≥m,意味着记录ri左边的i-m个敏感记录被改变,此时 (3) If i ≥ m, it means that the im sensitive records to the left of record ri have been changed.
步骤D4,计算中值输出空间O中所有中值的输出概率,设中值o的输出概率为:Step D4, calculate the output probability of all medians in the median output space O, and set the output probability of median o for:
步骤D5,将按照中值输出空间O中所有中值的输出概率输出的中值作为中值查询结果 Step D5: The median of all median output probabilities in the median output space O is used as the median query result.
下面对本申请提供的人口普查数据集查询保护方法进行效用实验分析。The following is an experimental analysis of the effectiveness of the population census dataset query protection method provided in this application.
实验背景:在2012 U.S.美国人口普查数据集上进行了实验验证。具体来说,随机从2012 U.S.Census数据集分别选择1000条和10000条记录,并以记录属性Age(年龄)进行均值查询的评估。此外,随机从2012 U.S.Census数据集分别选择1001条和10001条记录,并以记录属性Annual Income(年收入)进行中值查询的评估。其中,参数δ表示数据集中非敏感记录的占比(默认δ=0.8),参数∈表示隐私保护程度。Experimental background: The experimental verification was conducted on the 2012 U.S. Census dataset. Specifically, 1000 and 10000 records were randomly selected from the 2012 U.S. Census dataset, and the mean query was evaluated with the record attribute Age. In addition, 1001 and 10001 records were randomly selected from the 2012 U.S. Census dataset, and the median query was evaluated with the record attribute Annual Income. The parameter δ represents the proportion of non-sensitive records in the dataset (the default δ = 0.8), and the parameter ∈ represents the degree of privacy protection.
使用均方根误差(Root Mean Square Error,RMSE)作为评估指标来对SSLM的数据效用进行评估。图5和图6分别显示了在不同尺度的2012年美国人口普查数据子集上,随着参数δ和∈的变化,本发明所提出的SSLM(均值查询场景)的RMSE相应的变化趋势,其中,图5中子图(a)和图6中子图(a)的数据尺码为1000,图5中子图(b)和图6中子图(b)的数据尺码为10000。图7和图8分别显示了在不同尺度的2012年美国人口普查数据子集上,随着参数δ和∈的变化,本发明中所提出的SSEM(中值查询场景)的RMSE相应的变化趋势,其中,图7中子图(a)和图8中子图(a)的数据尺码为1001,图7中子图(b)和图8中子图(b)的数据尺码为10001。The root mean square error (RMSE) is used as an evaluation index to evaluate the data utility of SSLM. FIG5 and FIG6 respectively show the corresponding change trend of RMSE of SSLM (mean query scenario) proposed in the present invention on the 2012 US Census data subsets of different scales as the parameters δ and ∈ change, wherein the data size of sub-graph (a) in FIG5 and sub-graph (a) in FIG6 is 1000, and the data size of sub-graph (b) in FIG5 and sub-graph (b) in FIG6 is 10000. FIG7 and FIG8 respectively show the corresponding change trend of RMSE of SSEM (median query scenario) proposed in the present invention on the 2012 US Census data subsets of different scales as the parameters δ and ∈ change, wherein the data size of sub-graph (a) in FIG7 and sub-graph (a) in FIG8 is 1001, and the data size of sub-graph (b) in FIG7 and sub-graph (b) in FIG8 is 10001.
图5展示了非敏感属性值的比例δ如何影响我们提出的SSLM的数据效用。从图5(a)和图5(b)中可以看出,随着参数δ从0增加到0.9,SSLM在不同尺度的数据集上的RMSE呈一致的下降趋势。换句话说,SSLM的RMSE逐渐减小,并越来越优于基线方法(经典的LaplaceMechanism,LM)。具体来说,当δ=0时,即数据集中的所有属性值都是敏感的,SSLM等同于LM,表现出与LM相同的RMSE。相反,当δ=1时,表示所有属性值都不敏感时,SSLM的RMSE就会降低到0。此外,如图5(a)和图5(b)所示,当δ=0.8时,与LM相比,SSLM在数据量为1000和10000的2012年美国人口普查数据集子集上的效用分别提高了约6倍和5倍。当δ=0.9时,与LM相比,SSLM在数据量为1000和10000的2012年美国人口普查数据集子集上的效用分别提高了约14倍和10倍。效用的显著提高是由于在SSLM中,在响应平均值查询时,非敏感属性值保持不变,从而全面提高了平均值的准确性。Figure 5 shows how the proportion of insensitive attribute values δ affects the data utility of our proposed SSLM. As can be seen from Figures 5(a) and 5(b), as the parameter δ increases from 0 to 0.9, the RMSE of SSLM on datasets of different scales shows a consistent downward trend. In other words, the RMSE of SSLM gradually decreases and increasingly outperforms the baseline method (classical LaplaceMechanism, LM). Specifically, when δ = 0, that is, all attribute values in the dataset are sensitive, SSLM is equivalent to LM and exhibits the same RMSE as LM. On the contrary, when δ = 1, indicating that all attribute values are insensitive, the RMSE of SSLM is reduced to 0. In addition, as shown in Figures 5(a) and 5(b), when δ = 0.8, the utility of SSLM on the 2012 US Census dataset subsets with data volumes of 1000 and 10000 is improved by about 6 times and 5 times, respectively, compared with LM. When δ = 0.9, SSLM performs about 14 times and 10 times better than LM on the 2012 US Census dataset subsets with 1,000 and 10,000 data points, respectively. The significant improvement in utility is due to the fact that in SSLM, non-sensitive attribute values remain unchanged when responding to mean queries, which improves the accuracy of the mean across the board.
参数∈代表与敏感记录属性相关联的用户的隐私预算。较高的∈值意味着隐私保护水平较低,从而提高了效用。如图6(a)和图6(b)所示,随着∈从0.01增加到0.5,LM和SSLM(默认δ=0.8)在不同尺度大小的数据集上的RMSE都有所下降。图6传达的主要观点是,在任何∈条件下,SSLM都明显优于基线机制LM。此外,与LM相比,当∈>0.2时,SSLM在数据规模为1000的2012年美国人口普查数据子集上的效用提高了约2倍,在数据规模为10000时提高了3倍。The parameter ∈ represents the user's privacy budget associated with sensitive record attributes. A higher value of ∈ means a lower level of privacy protection, thereby improving the utility. As shown in Figures 6(a) and 6(b), as ∈ increases from 0.01 to 0.5, the RMSE of LM and SSLM (default δ = 0.8) on datasets of different scales decreases. The main point conveyed by Figure 6 is that SSLM significantly outperforms the baseline mechanism LM under any ∈ condition. In addition, compared with LM, when ∈>0.2, the utility of SSLM on the 2012 US Census data subset with a data size of 1000 is improved by about 2 times, and by 3 times when the data size is 10000.
与参数δ对平均查询量的影响类似,如图7所示,当δ从0增加到0.9时,SSEM的数据效用明显超过了基准方法EM。考虑到边界情况,我们可以看到当δ=0时,SSEM等同于EM。此外,图7(a)和图7(b)所示,对于中值查询,当参数δ∈(0,0.5]时,SSEM的RMSE与EM相当。然而,当参数δ∈(0.5,0.9]时,与EM相比,SSEM展现出明显的在数据效用改进。这种改进是由于非敏感记录属性值的比例会影响SSEM的输出空间。与图2和图3中的结论一致,当δ∈(0,0.5]时,SSEM的输出空间是整个空间(等同于EM的输出空间),导致SSEM和EM的数据效用相当。相反,当δ∈(0.5,0.9]时,非敏感记录属性的比例越大,SSEM的输出空间就越小,从而提高了数据效用。此外,如图7(b)所示,当δ取值分别为0.8和0.9时,SSEM与EM相比,效用分别提高了约3倍和6倍。Similar to the effect of parameter δ on the average query volume, as shown in Figure 7, when δ increases from 0 to 0.9, the data utility of SSEM significantly exceeds the baseline method EM. Considering the boundary case, we can see that when δ = 0, SSEM is equivalent to EM. In addition, as shown in Figures 7(a) and 7(b), for median queries, when the parameter δ∈(0, 0.5], the RMSE of SSEM is comparable to that of EM. However, when the parameter δ∈(0.5, 0.9], SSEM shows a significant improvement in data utility compared to EM. This improvement is due to the fact that the proportion of non-sensitive record attribute values affects the output space of SSEM. Consistent with the conclusions in Figures 2 and 3, when δ∈(0, 0.5], the output space of SSEM is the entire space (equivalent to the output space of EM), resulting in comparable data utility between SSEM and EM. On the contrary, when δ∈(0.5, 0.9], the larger the proportion of non-sensitive record attributes, the smaller the output space of SSEM, thereby improving data utility. In addition, as shown in Figure 7(b), when δ is 0.8 and 0.9, the utility of SSEM is improved by about 3 times and 6 times, respectively, compared to EM.
在图8中,参数∈对中位数查询的影响与图5中显示的对均值查询的影响类似。正如预期的那样,随着∈的增加,SSEM和EM在不同规模的数据集上的RMSE都有所降低。值得注意的是,在图8(a)和图8(b)中,与EM相比,SSEM显示出显著的效用改进。具体来说,当∈>0.2时,如图8(a)所示,与EM相比,SSEM在数据规模为1001的2012年美国人口普查数据子集上实现了约2倍的效用改进。In Figure 8, the effect of parameter ∈ on median query is similar to that on mean query shown in Figure 5. As expected, as ∈ increases, the RMSE of SSEM and EM decreases on datasets of different sizes. It is worth noting that SSEM shows significant utility improvement compared to EM in Figure 8(a) and Figure 8(b). Specifically, when ∈>0.2, as shown in Figure 8(a), SSEM achieves about 2 times utility improvement compared to EM on the 2012 US Census data subset with a data size of 1001.
可见,本申请提供的数据查询保护方法就有以下技术效果:It can be seen that the data query protection method provided by this application has the following technical effects:
1.我们引入了一种新的隐私定义SSDP,它使个人能够更好地控制他们的隐私信息,确保只有用户标记为敏感的数据才会受到隐私保护。1. We introduced a new privacy definition, SSDP, which gives individuals more control over their private information, ensuring that only data that users mark as sensitive is protected.
2.通过允许个人独立定义关于自己数据的秘密规范,SSDP实现个性化的隐私保护,方便有效的数据分析。2. By allowing individuals to independently define secret specifications about their own data, SSDP achieves personalized privacy protection and facilitates effective data analysis.
3.我们为均值查询提供特定的SSDP机制,提高数据分析的精度,同时最大限度地减少数据失真,特别是当数据的很大一部分是非敏感的时候同时更好地探索了隐私和效用权衡。3. We provide a specific SSDP mechanism for mean queries, which improves the accuracy of data analysis while minimizing data distortion, especially when a large portion of the data is non-sensitive and better explores the privacy and utility trade-offs.
4.我们通过在真实数据集上的对比实验来评价SSLM和SSEM的性能。实验结果表明与最先进的DP机制相比,SSLM通过利用非敏感数据,对于均值查询,效用提高了大约14倍。SSEM通过利用非敏感数据,对于中值查询,效用提高了大约6倍。4. We evaluate the performance of SSLM and SSEM through comparative experiments on real datasets. Experimental results show that compared with the most advanced DP mechanism, SSLM improves the utility by about 14 times for mean query by utilizing non-sensitive data. SSEM improves the utility by about 6 times for median query by utilizing non-sensitive data.
尽管已经示出和描述了本发明的实施例,本领域的普通技术人员可以理解:在不脱离本发明的原理和宗旨的情况下可以对这些实施例进行多种变化、修改、替换和变型,本发明的范围由权利要求及其等同物限定。Although the embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that various changes, modifications, substitutions and variations may be made to the embodiments without departing from the principles and spirit of the present invention, and that the scope of the present invention is defined by the claims and their equivalents.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311416556.4A CN117235800B (en) | 2023-10-27 | 2023-10-27 | Data query protection method of personalized privacy protection mechanism based on secret specification |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311416556.4A CN117235800B (en) | 2023-10-27 | 2023-10-27 | Data query protection method of personalized privacy protection mechanism based on secret specification |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117235800A true CN117235800A (en) | 2023-12-15 |
CN117235800B CN117235800B (en) | 2024-05-28 |
Family
ID=89082737
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311416556.4A Active CN117235800B (en) | 2023-10-27 | 2023-10-27 | Data query protection method of personalized privacy protection mechanism based on secret specification |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117235800B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118072928A (en) * | 2024-04-18 | 2024-05-24 | 中南大学 | A medical data integration system based on data warehouse |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090055657A1 (en) * | 2005-03-25 | 2009-02-26 | Rieko Asai | Program Converting Device, Secure Processing Device, Computer Program, and Recording Medium |
CN110704491A (en) * | 2019-09-30 | 2020-01-17 | 京东城市(北京)数字科技有限公司 | Data query method and device |
CN111177213A (en) * | 2019-12-16 | 2020-05-19 | 北京淇瑀信息科技有限公司 | Privacy cluster self-service query platform and method and electronic equipment |
CN114328640A (en) * | 2021-02-07 | 2022-04-12 | 湖南科技学院 | A method and system for differential privacy protection and data mining based on dynamic sensitive data of mobile users |
US20220277097A1 (en) * | 2019-06-12 | 2022-09-01 | Privitar Limited | Method or system for querying a sensitive dataset |
CN116541883A (en) * | 2023-05-10 | 2023-08-04 | 重庆大学 | Trust-based differential privacy protection method, device, equipment and storage medium |
CN116611101A (en) * | 2023-03-03 | 2023-08-18 | 广州大学 | Differential privacy track data protection method based on interactive query |
-
2023
- 2023-10-27 CN CN202311416556.4A patent/CN117235800B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090055657A1 (en) * | 2005-03-25 | 2009-02-26 | Rieko Asai | Program Converting Device, Secure Processing Device, Computer Program, and Recording Medium |
US20220277097A1 (en) * | 2019-06-12 | 2022-09-01 | Privitar Limited | Method or system for querying a sensitive dataset |
CN110704491A (en) * | 2019-09-30 | 2020-01-17 | 京东城市(北京)数字科技有限公司 | Data query method and device |
CN111177213A (en) * | 2019-12-16 | 2020-05-19 | 北京淇瑀信息科技有限公司 | Privacy cluster self-service query platform and method and electronic equipment |
CN114328640A (en) * | 2021-02-07 | 2022-04-12 | 湖南科技学院 | A method and system for differential privacy protection and data mining based on dynamic sensitive data of mobile users |
CN116611101A (en) * | 2023-03-03 | 2023-08-18 | 广州大学 | Differential privacy track data protection method based on interactive query |
CN116541883A (en) * | 2023-05-10 | 2023-08-04 | 重庆大学 | Trust-based differential privacy protection method, device, equipment and storage medium |
Non-Patent Citations (3)
Title |
---|
HU, CQ (HU, CHUNQIANG) 等: "A Federated Recommendation System Based on Local Differential Privacy Clustering", 《IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE AND COMPUTING, ADVANCED AND TRUSTED COMPUTING, SCALABLE COMPUTING AND COMMUNICATIONS, INTERNET OF PEOPLE, AND SMART CITY INNOVATIONS (SMARTWORLD/SCALCOM/UIC/ATC/IOP/SCI)》, 1 January 2021 (2021-01-01), pages 364 - 369 * |
张文静;李晖;: "差分隐私保护下的数据分级发布机制", 网络与信息安全学报, no. 01, 15 December 2015 (2015-12-15), pages 62 - 69 * |
胡春强: "秘密共享理论及相关应用研究", 《中国博士学位论文电子期刊网》, 15 February 2014 (2014-02-15), pages 136 - 16 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118072928A (en) * | 2024-04-18 | 2024-05-24 | 中南大学 | A medical data integration system based on data warehouse |
Also Published As
Publication number | Publication date |
---|---|
CN117235800B (en) | 2024-05-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8601024B2 (en) | Synopsis of a search log that respects user privacy | |
CN107871087B (en) | Personalized differential privacy protection method for high-dimensional data release in distributed environment | |
CN104050267B (en) | The personalized recommendation method and system of privacy of user protection are met based on correlation rule | |
Cummings et al. | Advancing differential privacy: Where we are now and future directions for real-world deployment | |
Freiman et al. | Data synthesis and perturbation for the American Community Survey at the US Census Bureau | |
CN117235800B (en) | Data query protection method of personalized privacy protection mechanism based on secret specification | |
WO2020151321A1 (en) | Graph computation-based claim anti-fraud method, apparatus and device, and storage medium | |
CN112632612B (en) | Medical data publishing anonymization method | |
Liu et al. | Face image publication based on differential privacy | |
Friedman | Contrast trees and distribution boosting | |
Kumar et al. | Content sensitivity based access control framework for Hadoop | |
WO2021012913A1 (en) | Data recognition method and system, electronic device and computer storage medium | |
CN117407908A (en) | Method for anonymizing multiple relation data sets | |
Shen et al. | Friendship links-based privacy-preserving algorithm against inference attacks | |
Zainab et al. | Sensitive and private data analysis: A systematic review | |
CN116611101A (en) | Differential privacy track data protection method based on interactive query | |
Kumar et al. | Upkeeping secrecy in information extraction using ‘k’division graph based postulates | |
Rahman et al. | Everything about you: A multimodal approach towards friendship inference in online social networks | |
CN110443068B (en) | Privacy protection method and device | |
Yadav et al. | Privacy preserving data mining with abridge time using vertical partition decision tree | |
Kan | Seeking the ideal privacy protection: Strengths and limitations of differential privacy | |
WO2019019711A1 (en) | Method and apparatus for publishing behaviour pattern data, terminal device and medium | |
Vadrevu et al. | A hybrid approach for personal differential privacy preservation in homogeneous and heterogeneous health data sharing | |
CN115482005A (en) | Dynamic anonymization for vehicle ordering | |
Schiegg et al. | Individual privacy levels in query-based anonymization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |