CN112948881A - Method for calculating information leakage probability in open scene - Google Patents

Method for calculating information leakage probability in open scene Download PDF

Info

Publication number
CN112948881A
CN112948881A CN202110282667.5A CN202110282667A CN112948881A CN 112948881 A CN112948881 A CN 112948881A CN 202110282667 A CN202110282667 A CN 202110282667A CN 112948881 A CN112948881 A CN 112948881A
Authority
CN
China
Prior art keywords
column
information leakage
calculating
clustering
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110282667.5A
Other languages
Chinese (zh)
Inventor
李辉
龚政
赵柯纯
史静文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202110282667.5A priority Critical patent/CN112948881A/en
Publication of CN112948881A publication Critical patent/CN112948881A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Abstract

The invention discloses a method for calculating information leakage probability in an open scene, which comprises the following steps: 1) constructing a public data pool; 2) performing column clustering on data in the public data pool; 3) calculating the probability of successful attack of the clustering results of the columns obtained in the step 2) by the attacker, and finishing the calculation of the information leakage probability in the open scene.

Description

Method for calculating information leakage probability in open scene
Technical Field
The invention belongs to the technical field of data security, and relates to a method for calculating information leakage probability in an open scene.
Background
In the current big data era, the value of fully searching the data has infinite prospects and applications, such as decision making, human relationship mining, intelligent information recommendation and the like. However, in these promising application scenarios, the use of large data will inevitably raise many privacy concerns. At present, many attacks that reveal personal privacy have been applied to open data sets that were originally used for research, with serious consequences. These attack means, combined with network analysis, data mining, and other techniques, can infer the identity of some records based on certain background knowledge, which is a notorious "record linkage attack". More specifically, according to the "record link attack", for multi-structured data, an attacker can combine personal identification attributes (e.g., identification number and bank card number) directly representing the identity of a user, or a quasi-identifier (e.g., birthday, sex, and age) that does not directly expose the identity of the user, with his background knowledge, to identify a specific group of users in the database, thereby revealing their private information.
Aiming at the privacy security problem to be solved, the previous research provides a database attribute sensitivity grading method based on attack probability, the method can grade and grade the sensitivity of all attributes in the database according to the attack success probability of an attacker, provides references in the sensitivity aspect of each attribute for data users, and lays a cushion for further data desensitization work. However, this approach requires the database manager or an associated risk assessment expert to infer the probability of an attacker acquiring certain columns in advance, based on the published situation of certain attributes in the database, and certain experience at hand. Obviously, the empirical quantitative method is not accurate enough to objectively and accurately measure the acquisition probability of the columns, so that the result of quantitative grading of data sensitivity is biased, and the subsequent data desensitization work is further influenced.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a method for calculating the information leakage probability in an open scene, which accurately calculates the information leakage probability.
In order to achieve the above purpose, the method for calculating the information leakage probability in the open scene includes the following steps:
1) constructing a public data pool;
2) performing column clustering on data in the public data pool;
3) calculating the probability of successful attack of the clustering results of the columns obtained in the step 2) by the attacker, and finishing the calculation of the information leakage probability in the open scene.
The specific operation of the step 1) is as follows:
and extracting data from the public data source, collecting the data in a specified data set, and establishing a public data pool.
The specific operation of the step 2) is as follows:
2a) dividing all data in the public data pool according to attributes to establish a column set;
2b) and clustering all the column name vectors by using the cosine similarity as a distance to obtain a clustering result of each column.
The step 2b) also comprises the following steps: and converting the Chinese column names of the columns into column name vectors by adopting a word embedding tool.
In the step 2b), a K-means clustering method is adopted for clustering.
Let the public data pool have N databases, and the data volume of the public data pool is R ═ R1+r2+...+rNAfter column clustering, there are M columns in the category a, and the corresponding database rows are R respectivelya=r1+r2+...+rM
The attribute column for that category is the probability of being obtained by the attacker
Figure BDA0002979212800000031
The invention has the following beneficial effects:
the method for calculating the information leakage probability in the open scene disclosed by the invention is transmitted from the perspective of an attacker, the row clustering is carried out on the data, the power of each row of clustering results successfully attacked by the attacker is calculated, namely the information leakage probability in the open scene, the problem of inaccuracy caused by manual experience is avoided, the probability of acquiring certain attribute information by the attacker is quantified, and the method can be directly put into the subsequent quantification and sequencing work of the sensitivity of each attribute, so that an accurate reference based on objective fact is provided for a data manager, and a good foundation is laid for further data desensitization and data publishing work.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings:
referring to fig. 1, the method for calculating the information leakage probability in the open scene according to the present invention includes the following steps:
1) constructing a public data pool;
specifically, data is extracted from public data sources and collected in a specified data set, and then a public data pool is established.
2) Performing column clustering on data in the public data pool;
the specific operation of the step 2) is as follows:
2a) dividing all data in the public data pool according to attributes to establish a column set;
2b) and converting the Chinese column names of each column into column name vectors by adopting a word embedding tool, and clustering all the column name vectors by using cosine similarity as a distance by adopting a K-means clustering method to obtain a clustering result of each column.
For two vectors X and Y, the remaining chordal distance cos θ is:
Figure BDA0002979212800000041
3) calculating the probability of successful attack of the clustering results of the columns obtained in the step 2) by the attacker, and finishing the calculation of the information leakage probability in the open scene.
Let the public data pool have N databases, and the data volume of the public data pool is R ═ R1+r2+...+rNAfter column clustering, there are M columns in the category a, and the corresponding database rows are R respectivelya=r1+r2+...+rM
The attribute column for that category is the probability of being obtained by the attacker
Figure BDA0002979212800000042
Example one
The structural forms of the present embodiment for different databases are shown in table 1, table 2, table 3, table 4 and table 5;
TABLE 1
Figure BDA0002979212800000043
Figure BDA0002979212800000051
TABLE 2
Figure BDA0002979212800000052
TABLE 3
Figure BDA0002979212800000053
TABLE 4
Figure BDA0002979212800000054
TABLE 5
Figure BDA0002979212800000055
The specific process of calculating by adopting the invention is as follows:
1) establishing a data pool for storing various published databases of students;
2) all columns that appear are divided into a set. As in this embodiment, the set of columns is: { student name, student gender, age, home address, contact telephone, dormitory number, fax, height, weight, zip code, average score, year of birth, ethnicity, political aspect, name, gender, age, address, telephone, history of record, telephone number, zip code, birthday, 1500 m score, chin-up number, BMI };
3) in this embodiment, an empty signal AINLP module is used to convert each column in the set into a vector by using a Bert model, and the magnitude of the cosine distance between the two columns is calculated as the similarity of the two columns. For example: 0.8448 for d (age ), 0.7194 for d (name ), 0.9242 for d (postal code, zip code), 0.5331 for d (address, telephone) and 0.3648 for d (weight, average score). Obviously, the more column names in the same attribute, the higher the similarity, and conversely, the lower the similarity;
4) clustering the column names in the set by using a K-means method, wherein the obtained clustering result is as follows: { student name, name }, { student gender, gender }, { age, age }, { home address, address }, { telephone, contact telephone, telephone number }, { dormitory number }, { facsimile }, { height }, { weight }, { postal code, postal code }, { average score }, { birthday, year of birth }, { ethnic }, { political face }, { follow through }, {1500 m score }, { number of introductions up }, and { BMI };
5) in the embodiment, no column with fuzzy column names, no column names and wrong clustering results exists, so that manual adjustment is not needed;
6) and calculating the size of the probability of the attacker obtaining the column according to the number of the rows.
For example, for the { student name, name } class, the probability that it is captured by an attacker is: rName (I)=rStudent name+rName 1+rName 2+rName 37400+4500+3800+ 6000-21700 because the "name" column appears in all three databases, it is necessary to accumulate their row numbers R-R1+r2+r3+r4+r5=7400+4500+2500+3800+6000=24200
Figure BDA0002979212800000061
Namely: the probability of the attacker obtaining the name information is 0.897;
similarly, for { number of pull-up } classes, the probability of being acquired by an attacker is:Rnumber of pull-up=rNumber of pull-up=6000;
Figure BDA0002979212800000071
I.e., the probability of an attacker acquiring student pull-up information is 0.248.

Claims (6)

1. A method for calculating information leakage probability in an open scene is characterized by comprising the following steps:
1) constructing a public data pool;
2) performing column clustering on data in the public data pool;
3) calculating the probability of successful attack of the clustering results of the columns obtained in the step 2) by the attacker, and finishing the calculation of the information leakage probability in the open scene.
2. The method for calculating the information leakage probability in the open scene according to claim 1, wherein the specific operation of step 1) is as follows:
and extracting data from the public data source, collecting the data in a specified data set, and establishing a public data pool.
3. The method for calculating the information leakage probability in the open scene according to claim 1, wherein the specific operation of step 2) is as follows:
2a) dividing all data in the public data pool according to attributes to establish a column set;
2b) and clustering all the column name vectors by using the cosine similarity as a distance to obtain a clustering result of each column.
4. The method for calculating the information leakage probability under the open scene according to claim 3, wherein the step 2b) further comprises: and converting the Chinese column names of the columns into column name vectors by adopting a word embedding tool.
5. The method for calculating the information leakage probability in the open scene according to claim 1, wherein in the step 2b), a K-means clustering method is adopted for clustering.
6. The method for calculating the information leakage probability in the open scene according to claim 1, wherein the public data pool is provided with N databases, and the data volume of the public data pool is R ═ R1+r2+...+rNAfter column clustering, there are M columns in the category a, and the corresponding database rows are R respectivelya=r1+r2+...+rM
The attribute column for that category is the probability of being obtained by the attacker
Figure FDA0002979212790000021
CN202110282667.5A 2021-03-16 2021-03-16 Method for calculating information leakage probability in open scene Pending CN112948881A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110282667.5A CN112948881A (en) 2021-03-16 2021-03-16 Method for calculating information leakage probability in open scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110282667.5A CN112948881A (en) 2021-03-16 2021-03-16 Method for calculating information leakage probability in open scene

Publications (1)

Publication Number Publication Date
CN112948881A true CN112948881A (en) 2021-06-11

Family

ID=76230212

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110282667.5A Pending CN112948881A (en) 2021-03-16 2021-03-16 Method for calculating information leakage probability in open scene

Country Status (1)

Country Link
CN (1) CN112948881A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130318615A1 (en) * 2012-05-23 2013-11-28 International Business Machines Corporation Predicting attacks based on probabilistic game-theory
CN104184742A (en) * 2014-09-09 2014-12-03 西安电子科技大学 Personalized dual hiding method based on location-based service privacy protection
CN106940777A (en) * 2017-02-16 2017-07-11 湖南宸瀚信息科技有限责任公司 A kind of identity information method for secret protection measured based on sensitive information
CN111191291A (en) * 2020-01-04 2020-05-22 西安电子科技大学 Database attribute sensitivity quantification method based on attack probability
CN111737750A (en) * 2020-06-30 2020-10-02 绿盟科技集团股份有限公司 Data processing method and device, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130318615A1 (en) * 2012-05-23 2013-11-28 International Business Machines Corporation Predicting attacks based on probabilistic game-theory
CN104184742A (en) * 2014-09-09 2014-12-03 西安电子科技大学 Personalized dual hiding method based on location-based service privacy protection
CN106940777A (en) * 2017-02-16 2017-07-11 湖南宸瀚信息科技有限责任公司 A kind of identity information method for secret protection measured based on sensitive information
CN111191291A (en) * 2020-01-04 2020-05-22 西安电子科技大学 Database attribute sensitivity quantification method based on attack probability
CN111737750A (en) * 2020-06-30 2020-10-02 绿盟科技集团股份有限公司 Data processing method and device, electronic equipment and storage medium

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
GONG Z等: "《Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data》", 16 October 2020 *
张付霞等: "面向分类型敏感属性的分级匿名算法", 《计算机应用研究》 *
桂琼等: "基于聚类的分级匿名方法", 《计算机应用》 *
武文博等: "基于攻击图的信息物理系统信息安全风险评估方法", 《计算机应用》 *
陈炜等: "抵抗背景知识攻击的电子病历隐私保护新算法", 《计算机工程》 *

Similar Documents

Publication Publication Date Title
KR102106462B1 (en) Method for filtering similar problem based on weight
US20200250226A1 (en) Similar face retrieval method, device and storage medium
CN108846422B (en) Account number association method and system across social networks
CN103262118B (en) Attribute value estimation device and property value method of estimation
CN109670727B (en) Crowd-sourcing-based word segmentation annotation quality evaluation system and evaluation method
US20070196804A1 (en) Question-answering system, question-answering method, and question-answering program
CN109213853B (en) CCA algorithm-based Chinese community question-answer cross-modal retrieval method
WO2019071738A1 (en) Examinee identity authentication method and apparatus, readable storage medium and terminal device
CN102663129A (en) Medical field deep question and answer method and medical retrieval system
CN109739997A (en) Address control methods, apparatus and system
CN115995018A (en) Long tail distribution visual classification method based on sample perception distillation
CN110502694A (en) Lawyer's recommended method and relevant device based on big data analysis
CN113192028B (en) Quality evaluation method and device for face image, electronic equipment and storage medium
CN104899493A (en) Novel face authentication system for examination
CN113420059A (en) Method and device for actively treating citizen hot line problem
CN110648754A (en) Department recommendation method, device and equipment
CN117235246A (en) Sensitive data automatic grading method and device based on data elements
CN112948881A (en) Method for calculating information leakage probability in open scene
CN113707304B (en) Triage data processing method, triage data processing device, triage data processing equipment and storage medium
CN114639153A (en) Face recognition method with dynamic capture function
CN113051962B (en) Pedestrian re-identification method based on twin Margin-Softmax network combined attention machine
CN114664400A (en) Medical record filing method and device
CN113407696A (en) Collection table processing method, device, equipment and storage medium
CN112528252A (en) Computer startup processing method and system
CN111191291A (en) Database attribute sensitivity quantification method based on attack probability

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination