CN111814187A - Big data desensitization method - Google Patents
Big data desensitization method Download PDFInfo
- Publication number
- CN111814187A CN111814187A CN202010675130.0A CN202010675130A CN111814187A CN 111814187 A CN111814187 A CN 111814187A CN 202010675130 A CN202010675130 A CN 202010675130A CN 111814187 A CN111814187 A CN 111814187A
- Authority
- CN
- China
- Prior art keywords
- data
- column
- transformation
- matrix
- random
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000586 desensitisation Methods 0.000 title claims abstract description 49
- 238000000034 method Methods 0.000 title claims abstract description 40
- 239000011159 matrix material Substances 0.000 claims abstract description 143
- 230000009466 transformation Effects 0.000 claims abstract description 116
- 238000012545 processing Methods 0.000 claims abstract description 15
- 238000010606 normalization Methods 0.000 claims abstract description 12
- 230000001131 transforming effect Effects 0.000 claims abstract description 7
- 230000003321 amplification Effects 0.000 claims description 31
- 238000003199 nucleic acid amplification method Methods 0.000 claims description 31
- 230000008602 contraction Effects 0.000 claims description 24
- 239000003638 chemical reducing agent Substances 0.000 claims description 21
- 238000012216 screening Methods 0.000 claims description 9
- 238000013507 mapping Methods 0.000 claims description 8
- 230000009467 reduction Effects 0.000 claims description 6
- 238000007405 data analysis Methods 0.000 description 22
- 238000007418 data mining Methods 0.000 description 12
- 238000006243 chemical reaction Methods 0.000 description 4
- 238000012795 verification Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000012954 risk control Methods 0.000 description 2
- 238000000844 transformation Methods 0.000 description 2
- 241000282813 Aepyceros melampus Species 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000011946 reduction process Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
- G06F21/6254—Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/58—Random or pseudo-random number generators
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/03—Credit; Loans; Processing thereof
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Mathematics (AREA)
- Pure & Applied Mathematics (AREA)
- Business, Economics & Management (AREA)
- Mathematical Physics (AREA)
- Mathematical Optimization (AREA)
- Mathematical Analysis (AREA)
- Finance (AREA)
- Accounting & Taxation (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Strategic Management (AREA)
- Algebra (AREA)
- Computing Systems (AREA)
- General Business, Economics & Management (AREA)
- Technology Law (AREA)
- Marketing (AREA)
- Economics (AREA)
- Development Economics (AREA)
- Medical Informatics (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Storage Device Security (AREA)
Abstract
The invention discloses a big data desensitization method, which is used for desensitizing specified data in a multi-dimensional fact table and comprises the following steps: and initializing, namely reading specified data in the multi-dimensional fact table and arranging the specified data into a data matrix, wherein each column in the data matrix corresponds to one dimension, and the data matrix is an original data matrix. And a spatial transformation step of transforming the specified data of each dimension according to columns, wherein the transformation comprises stretching transformation, shrinking transformation or warping transformation to obtain a transformed data matrix. After normalization processing, the difference between the value of each data in the transformed data matrix and the corresponding value in the original data matrix is less than 5%. The big data desensitization method of the invention desensitizes sensitive data by utilizing spatial transformation, the spatial relative position information of the desensitized data is reserved, and the data loss caused by the spatial transformation is less than 5 percent. The big data desensitization method can also be applied to a distributed framework to meet the requirement of big data operation of a distributed system.
Description
Technical Field
The invention relates to the field of big data, in particular to a data security technology of big data.
Background
Data processing is becoming an important infrastructure, and data security, especially security of sensitive data, is of particular importance for data processing. Data desensitization to sensitive data is also an infrastructure. In the financial field, the prior art basically uses a random value replacement desensitization and a special character replacement desensitization mode for data desensitization. The former changes data by replacing with random values (letters are changed into random letters, and numbers are changed into random numbers), and the latter changes data by replacing with special characters (such as a star).
For data that has no specific meaning, it is only indicative, such as: such desensitization is appropriate for name, cell phone number, card number, etc., where indicative information such as name, cell phone number, card number, etc. has no material significance for data mining and data analysis, etc.
With continuous deepening of informatization and datamation in the financial industry, the data mining and data analysis of financial data become more and more important, and the data mining and data analysis play more and more important roles in risk control, risk early warning, customer identification and benefit improvement. During data mining and data analysis, asset data, behavior data, customer figures and other data need to be used. These data are also subject to customer privacy and must be desensitized before use can be made. For the data, the data has meaning, and the traditional random value replacement desensitization or special character replacement desensitization mode can change the data, so that the meaning of the data is partially or completely lost, and the subsequent data mining and data analysis cannot be performed. In addition, in order to obtain more efficient effects of risk control, risk early warning, customer identification and benefit improvement, data is expected to be shared among more financial institutions, big data owned by the multiple financial institutions is analyzed, and the effects are more accurate. Data sharing and exchange put higher requirements on data desensitization, on one hand, the desensitized data and the original data are required to have obvious difference, the original data cannot be restored or positioned based on the desensitized data, and the original data is prevented from being attacked. On the other hand, the meaning and information of the original data are kept as much as possible for the desensitized data, so that the subsequent data mining and data analysis can be continued and the due accuracy is maintained.
Disclosure of Invention
The invention aims to provide a big data desensitization method which can perform micro-damage desensitization on sensitive data and can be executed on a distributed framework.
According to an embodiment of the invention, a big data desensitization method is provided for desensitizing specified data in a multi-dimensional fact table, and the method comprises the following steps:
an initialization step, namely reading specified data in a multi-dimensional fact table and arranging the specified data into a data matrix, wherein each column in the data matrix corresponds to a dimension, and the data matrix is an original data matrix;
a spatial transformation step of transforming the specified data of each dimension by columns, wherein the transformation comprises stretching transformation, shrinking transformation or warping transformation to obtain a transformed data matrix;
after normalization processing, the difference between the value of each data in the transformed data matrix and the corresponding value in the original data matrix is less than 5%.
In one embodiment, the stretch transformation comprises: generating a column of random amplification coefficients, wherein the number of the random amplification coefficients is the same as the number of the designated data in the corresponding column, and the column of random amplification coefficients meet normal distribution; and multiplying the column designation data by the column random amplification factor to obtain a column subjected to stretch transformation. The contraction transformation comprises the following steps: generating a column of random contraction coefficients, wherein the number of the random contraction coefficients is the same as the number of the designated data in the corresponding column, and the column of random contraction coefficients meet normal distribution; multiplying the column designation data by the column random puncturing coefficient to obtain a punctured column.
In one embodiment, the warping transformation comprises: generating a Sigmod function; generating a column of random additional coefficients, wherein the number of the random additional coefficients is the same as the number of the designated data in the corresponding column, and the column of random additional coefficients meets the normal distribution; and (4) performing operation on the column designation data by using a Sigmod function, wherein the operation result and the corresponding random additional coefficient jointly obtain a distorted and transformed column.
In one embodiment, a transformation matrix is generated according to the transformation of the specified data of each dimension, each column in the transformation matrix corresponds to one dimension of the specified data, and the transformation matrix is multiplied by the original data matrix to obtain a transformed data matrix.
In one embodiment, the method further comprises: the remaining data in the multi-dimensional fact table, except for the specified data, is encrypted.
According to an embodiment of the invention, a distributed big data desensitization method is provided, which desensitizes specified data in a multi-dimensional fact table under a distributed framework, and comprises the following steps:
a mapping step, wherein a mapper reads data in the multi-dimensional fact table and arranges the data into a data matrix, each row in the data matrix corresponds to one dimension, each dimension uses an independent mapper, and the data matrix is an unscreened data matrix;
screening, namely screening the unseen data matrix, selecting a column where the designated data is located to form an original data matrix, and forming an auxiliary data matrix by the rest data except the designated data;
a spatial transformation step, namely transforming the specified data of each dimension according to columns, wherein the transformation comprises stretching transformation, shrinking transformation or warping transformation, the transformation of each dimension is processed by using an independent reducer, and a transformed data matrix is obtained after reduction, wherein after normalization processing, the difference between the value of each data in the transformed data matrix and the corresponding value in the original data matrix is less than 5%;
and a data merging step, namely encrypting the data in the auxiliary data matrix, and merging the encrypted auxiliary data matrix and the transformed data matrix to generate a desensitized data matrix.
In one embodiment, the stretch transformation comprises: generating a column of random amplification coefficients, wherein the number of the random amplification coefficients is the same as the number of the designated data in the corresponding column, and the column of random amplification coefficients meet normal distribution; and multiplying the column designation data by the column random amplification factor to obtain a column subjected to stretch transformation. The contraction transformation comprises the following steps: generating a column of random contraction coefficients, wherein the number of the random contraction coefficients is the same as the number of the designated data in the corresponding column, and the column of random contraction coefficients meet normal distribution; multiplying the column designation data by the column random puncturing coefficient to obtain a punctured column.
In one embodiment, the warping transformation comprises: generating a Sigmod function; generating a column of random additional coefficients, wherein the number of the random additional coefficients is the same as the number of the designated data in the corresponding column, and the column of random additional coefficients meets the normal distribution; and (4) performing operation on the column designation data by using a Sigmod function, wherein the operation result and the corresponding random additional coefficient jointly obtain a distorted and transformed column.
In one embodiment, the reducer generates a column of transformation matrices from the transformation of the specified data for each dimension, and the reducer multiplies the column of transformation matrices by a corresponding column of specified data in the original data matrix to obtain a corresponding column of data in the transformed data matrix.
The big data desensitization method of the invention desensitizes sensitive data by utilizing spatial transformation, the spatial relative position information of the desensitized data is reserved, and from the viewpoint of data mining and data analysis, the data loss caused by the spatial transformation is less than 5 percent, so the method does not influence the subsequent data processing. The big data desensitization method can also be applied to a distributed framework to meet the requirement of big data operation of a distributed system.
Drawings
FIG. 1 discloses a flow diagram of a big data desensitization method according to an embodiment of the invention.
FIG. 2 discloses a flow diagram of a big data desensitization method according to another embodiment of the invention.
Detailed Description
The invention provides a big data desensitization method for desensitizing specified data in a multi-dimensional fact table, and a flow chart of the big data desensitization method according to an embodiment of the invention is disclosed in fig. 1. As shown, the method includes:
s101, an initialization step, namely reading specified data in the multi-dimensional fact table and arranging the specified data into a data matrix, wherein each column in the data matrix corresponds to one dimension, and the data matrix is an original data matrix.
In the financial industry, customer data is usually stored in a fact table, and some data in the fact table are used for identifying the identity of a customer, such as name, card number, address, mobile phone number and the like; some data records the assets information of the client, such as total assets, RMB assets, foreign currency assets, financial products, credit cards, guarantee funds, bonds and the like; some data records customer behavior information such as transfer records, purchase financing records, merchandise transaction records, and the like. The identification data is not used for subsequent data analysis basically, so that encryption or desensitization can be performed by using a traditional mode, and the information carried by the identification data basically does not influence the result of data analysis even if the information is lost or lost in the encryption or desensitization process. The asset information data and the behavior information data are core data for data analysis and data mining, and the meaning and the information of the original data need to be reserved after desensitization is carried out, so that the effectiveness and the accuracy of data analysis are guaranteed. In the invention, data such as asset information data and behavior information data are called designated data, and the designated data in the fact table is desensitized by adopting the big data desensitization mode provided by the invention.
TABLE 1
Card number | Gross amount of financing | Credit card bill | Total amount of national debt |
161340511092313455 | 195223.24 | 1642.09 | 50000.00 |
161421686113451004 | 20000.00 | 10821.45 | 20000.00 |
161731799840912249 | 85000.24 | 411.16 | 50000.00 |
Table 1 shows an example of a fact table in which customer data is recorded. Table 1 is in the form of a two-dimensional table, which is the most widely used way to record customer data. In the two-dimensional table shown in table 1, each row represents data for one customer and each column represents one category. In the present invention, each "category" represents a dimension, and the "dimension" of the present invention represents data of all customers under a certain category. The dimension "card number" in table 1 is identification data, and "total amount of financing", "credit card bill", and "total amount of treasury" are asset information data. The behavior information data is not listed in table 1. In table 1, data of three dimensions of "total amount of financing", "credit card bill", and "total amount of treasury" are designated data.
TABLE 2
195223.24 | 1642.09 | 50000.00 |
20000.00 | 10821.45 | 20000.00 |
85000.24 | 411.16 | 50000.00 |
In step S101, the specified data in the multidimensional fact table (table 1) is read and arranged into a data matrix (table 2), each column in the data matrix corresponds to a dimension, and the data matrix (table 2) is an original data matrix.
And S102, a spatial transformation step, namely transforming the specified data of each dimension according to the columns, wherein the transformation comprises stretching transformation, shrinking transformation or warping transformation, and a transformed data matrix is obtained. After normalization processing, the difference between the value of each data in the transformed data matrix and the corresponding value in the original data matrix is less than 5%.
In one embodiment of the present invention,
the stretch transformation includes:
generating a column of random amplification coefficients, wherein the number of the random amplification coefficients is the same as the number of the designated data in the corresponding column, and the column of random amplification coefficients meet normal distribution;
and multiplying the column designation data by the column random amplification factor to obtain a column subjected to stretch transformation.
The contraction transformation comprises the following steps:
generating a column of random contraction coefficients, wherein the number of the random contraction coefficients is the same as the number of the designated data in the corresponding column, and the column of random contraction coefficients meet normal distribution;
multiplying the column designation data by the column random puncturing coefficient to obtain a punctured column.
The warping transformation includes:
generating a Sigmod function;
generating a column of random additional coefficients, wherein the number of the random additional coefficients is the same as the number of the designated data in the corresponding column, and the column of random additional coefficients meets the normal distribution;
and (4) performing operation on the column designation data by using a Sigmod function, wherein the operation result and the corresponding random additional coefficient jointly obtain a distorted and transformed column.
The primary way of data desensitization to specified data in the present invention is spatial transformation. The original numerical values are hidden and replaced by new numerical values through stretching transformation, shrinking transformation or warping transformation, but the relative spatial position relation among the new numerical values is kept unchanged or changed a little, so that the results of subsequent data mining and data analysis are basically not influenced, the results of data analysis on desensitized data and the results of data analysis on original data can basically keep consistent, or the difference degree is controlled within 5%.
For example, the original data matrix shown in table 2 is subjected to stretch transformation. Firstly, a column of random amplification factors is generated, the number of the random amplification factors is the same as the number of the designated data in the corresponding column, and the column of random amplification factors meet the normal distribution. The column designation data is then multiplied by the column random magnification factor to obtain a stretch transformed column.
An example implementation code for the stretch transform is as follows:
the to _ random () function generates a column of random amplification factors, the number of which is the same as the number of specified data in the corresponding column. the nature of the to _ random () function is such that the generated random numbers follow a normal distribution, so that the column of random amplification factors generated by the to _ random () function satisfies the normal distribution.
fin refers to the dimension "financing amount", credit refers to the dimension "credit card bill", and bond refers to the dimension "national debt amount".
And multiplying a column of specified data by the random amplification coefficient corresponding to the column to obtain a column subjected to stretching transformation. After multiplication, the following data in table 3 are obtained:
TABLE 3
710264.3341476232 | 6455.808978154993 | 258916.91251466938 |
72764.32192679757 | 42544.083495213636 | 103566.76500586775 |
309249.2413607528 | 1616.458549445041 | 258916.91251466938 |
The data matrix of table 3 is a transformed data matrix.
And respectively verifying the relative spatial position relation of each data in the original data matrix and the transformed data matrix. The verification mode is to perform normalization processing on the data matrix. And then calculating Euclidean distances between data points in the data matrix. The Euclidean distance is calculated by the function euclidean _ distance (coords1, coords 2).
The normalization process uses the MinMaxScaler () function to perform the following example code on the raw data matrix:
min_max_scaler=preprocessing.MinMaxScaler()
result_origin=min_max_scaler.fit_transform(np.array(origin))
obtaining a normalized verification data matrix table 4:
TABLE 4
1 | 0.11824166 | 1 |
0 | 1 | 0 |
0.37095673 | 0 | 1 |
The following example code is executed on the transformed data matrix:
min_max_scaler=preprocessing.MinMaxScaler()
result_transformed=min_max_scaler.fit_transform(np.array(transformed))
obtaining a normalized verification data matrix table 5:
TABLE 5
1 | 0.11824166 | 1 |
0 | 1 | 0 |
0.37095673 | 0 | 1 |
Table 4 and table 5 are consistent at the level of 8 bits after the decimal point position, which indicates that the relative spatial position relationship of each data in the original data matrix and the transformed data matrix is basically unchanged, i.e., the euclidean distances between each data point after normalization are the same, and can be considered as completely consistent.
The contraction transform is computed in a similar manner to the pull-up transform and will not be described again here. The stretching transformation and the shrinking transformation are linear transformations, so that the relative spatial position relationship of each data in the original data matrix and the transformed data matrix is basically kept unchanged, and the transformation can be regarded as lossless transformation under the precision required by data analysis.
The warping transformation is continued by taking the original data matrix shown in table 2 as an example. The warping transformation first generates a Sigmod function. When a column of random addition coefficients is generated, the number of random addition coefficients is the same as the number of designated data in the corresponding column, and the column of random addition coefficients satisfies a normal distribution. And then, the column designation data is operated by using a Sigmod function, and the operation result and the corresponding random additional coefficient jointly obtain a distorted and transformed column.
In one embodiment, the sigmoid (x), sigmoid-type growth curve function is used for the Sigmod function. The Sigmod function is defined using the following code:
def sigmoid(x):
return 1./(1+np.exp(-x))
in one embodiment, the random additional coefficients are generated by random. The random additional coefficients are generated using the following code:
def to_random_within(i):
return random.normalvariate(i,1)
random additional coefficients generated by normal.
And (4) performing operation on the column designation data by using a Sigmod function, wherein the operation result and the corresponding random additional coefficient jointly obtain a distorted and transformed column. The warped transformed data matrix is obtained by the code of the example described below.
the to _ random _ within function generates a column of random additional coefficients, the number of which is the same as the number of specified data in the corresponding column. the characteristic of the to _ random _ within function is that the generated random number conforms to a normal distribution, so the column of random additional coefficients generated by the to _ random _ within satisfies the normal distribution.
fin refers to the dimension "financing amount", credit refers to the dimension "credit card bill", and bond refers to the dimension "national debt amount".
Using Sigmod function to operate on a column of designated data, adding the operation result and the corresponding random additional coefficient to obtain a column after distortion transformation, and obtaining a data matrix after distortion transformation after each column is respectively calculated, as shown in table 6:
TABLE 6
2.9818167618925275 | -0.7316915640653386 | 1.9027560487948123 |
2.7507581832625227 | -0.5301590086516604 | 1.6716974701648073 |
2.8424483209688303 | -0.7612175872816653 | 1.9027560487948123 |
The exemplary code below is performed on the warped transformed data matrix (table 6) for normalization:
result_transformed=min_max_scaler.fit_transform(np.array(transformed))
print(result_transformed)
obtaining a normalized verification data matrix table 7:
TABLE 7
1 | 0.12778588 | 1 |
0 | 1 | 0 |
0.39682637 | 0 | 1 |
Tables 4 and 7 are different in value, but the difference degree is less than 5%, which shows that the relative spatial position relationship of each data in the original data matrix and the transformed data matrix has only slight variation. The relative spatial positions of each datum in the original data matrix and the distorted data matrix slightly change, namely, the Euclidean distance between each datum point in the normalized original data matrix and the Euclidean distance between each datum point in the transformed data matrix have a certain difference, but the difference of the changed difference after normalization processing is less than 5%, the precision required by data analysis is micro-loss transformation, and the data analysis result cannot be influenced.
In one embodiment, when generating the transformed data matrix from the original data matrix, a transformation matrix is generated according to the transformation of the specified data for each dimension, each column in the transformation matrix corresponds to one dimension of the specified data, and the transformation matrix is multiplied by the original data matrix to obtain the transformed data matrix. The same transformation rules may be applied for each dimension of data (i.e., each column) in the original data matrix, or different transformation rules may be applied. Depending on the transformation rules, a column multiplier is generated for each column of specified data, which may be a random amplification factor, a random contraction factor, or a Sigmod function with a random addition factor superimposed. And forming a transformation matrix by the multipliers of all the columns, wherein each column in the transformation matrix corresponds to one dimension of the designated data, and multiplying the transformation matrix and the original data matrix to obtain a transformed data matrix.
In one embodiment, the big data desensitization method of the present invention further comprises: the remaining data in the multi-dimensional fact table, except for the specified data, is encrypted. Continuing with the example of the multidimensional fact table shown in Table 1, three columns of "financing amount", "credit card bill" and "treasury amount" are the specified data. The column "card number" is the remaining data. Therefore, the column of the card number is encrypted. In one embodiment, the card number is encrypted using the MD5 algorithm, performed by the following example code:
cust_ids=[to_md5(x[0])for x in sample_data]
where the cust _ id represents the card number.
The data processed by encryption will cause the loss of data information, but the card number will not participate in the subsequent data analysis, so the loss of data information will not affect the result of data analysis. In addition to the encryption algorithm, desensitization can be performed on data other than the specified data in such a way that random values are replaced or special characters are replaced.
As the big data storage of the financial institution is increasingly applied to the distributed architecture, in order to adapt to the big data storage mode of the distributed architecture, the big data desensitization method can also be realized on the distributed architecture. Referring to fig. 2, fig. 2 discloses a flow chart of a big data desensitization method according to another embodiment of the invention. The embodiment provides a distributed big data desensitization method, and specified data in a multi-dimensional fact table is desensitized under a distributed framework. The method is based on a Map/Reduce (Map/Reduce) distributed framework as a whole. The linear expansion can be carried out, and the method is suitable for various Hadoop components including but not limited to hbase, hive, impala and the like. The distributed big data desensitization method comprises the following steps:
s201, mapping. The data in the multi-dimensional fact table is read by a Mapper (Mapper) and is arranged as a data matrix, each column in the data matrix corresponds to one dimension, and each dimension uses a separate Mapper (Mapper), and the data matrix is an unscreened data matrix.
At present, the mainstream storage mode of financial big data is a partial format, and the partial is a column storage format suitable for Hadoop. If the multidimensional fact table is stored in a partial format, the multidimensional fact table is in a column storage format and can be directly read by a mapper according to columns. If the multi-dimensional fact table does not use a column storage format, but a row storage format (such as the TXT format). A format conversion step is also included before the mapping step. The format conversion step is also based on a Map/Reduce (Map/Reduce) distributed framework. In the format conversion step, a Mapper (Mapper) is applied to each row of the multi-dimensional fact table in the row storage format, and the Mapper reads data of one row and cuts the data into a plurality of segments according to the dimension, for example, for an N-dimension fact table, the Mapper cuts the data of one row into N segments after reading the data of the row. If there are M rows in the N-dimensional fact table, M mappers are needed for processing. After the mapping is complete, reduction is performed by a Reducer (Reducer). Reduction is performed in terms of dimensions, each of which is provided with one Reducer (Reducer), and N reducers (reducers) are required for the fact table of N dimensions. The data reduced by the Reducer (Reducer) is stored in columns. Thus, after the mapping/reduction process of the format conversion step is completed, the N-dimensional fact table stored in the row is converted into the N-dimensional fact table stored for the column. The mapping step is then continued.
S202, a screening step. And screening the unseen data matrix, selecting the column where the designated data is positioned to form an original data matrix, and forming an auxiliary data matrix by the rest data except the designated data. The description will be given by taking the fact table shown in table 1 as an example. Table 1 is a fact table, which after the mapping step, forms the unscreened data matrix as shown in table 8:
TABLE 8
161340511092313455 | 195223.24 | 1642.09 | 50000.00 |
161421686113451004 | 20000.00 | 10821.45 | 20000.00 |
161731799840912249 | 85000.24 | 411.16 | 50000.00 |
After the screening, the non-screened data matrix is divided into the original data matrix (refer to the foregoing table 2) and the auxiliary data matrix (refer to the following table 9). The raw data matrix includes the specified data: the three-dimensional asset information data of 'total amount of financing', 'credit card bill' and 'total amount of national debt'. The secondary data matrix includes identification data, as in table 9:
TABLE 9
161340511092313455 |
161421686113451004 |
161731799840912249 |
In other embodiments, the screening step applies the following screening principle to distinguish the specified data from the auxiliary data:
judging the data type: whether continuous value data or discrete class data.
The continuous value data is classified as specified data. If the continuous value data is time continuous data, converting the data format into a timestamp, taking the minimum value as 0, keeping the difference value from the rest time continuous data to the minimum value, and setting the difference value as a double type. If the data is digital continuous data, the original data is kept.
For discrete data, a determination is made as to whether the field contains a number of different values greater than 65536.
For discrete data with the number of different values contained in the field not greater than 65536, the data is converted into one-hot coding and classified as the designated data.
For discrete data with different values greater than 65536, the data is classified as auxiliary data. The original field content is reserved, and the situation is mostly unique identification such as a card number, a customer number or an internal log number.
And S203, a space transformation step. And transforming the specified data of each dimension according to columns, wherein the transformation comprises stretching transformation, shrinking transformation or warping transformation, the transformation of each dimension is processed by using an independent reducer, and a transformed data matrix is obtained after reduction, and after normalization processing, the difference between the value of each data in the transformed data matrix and the corresponding value in the original data matrix is less than 5%.
In one embodiment of the present invention,
the stretch transformation includes:
generating a column of random amplification coefficients, wherein the number of the random amplification coefficients is the same as the number of the designated data in the corresponding column, and the column of random amplification coefficients meet normal distribution;
and multiplying the column designation data by the column random amplification factor to obtain a column subjected to stretch transformation.
The contraction transformation comprises the following steps:
generating a column of random contraction coefficients, wherein the number of the random contraction coefficients is the same as the number of the designated data in the corresponding column, and the column of random contraction coefficients meet normal distribution;
multiplying the column designation data by the column random puncturing coefficient to obtain a punctured column.
The warping transformation includes:
generating a Sigmod function;
generating a column of random additional coefficients, wherein the number of the random additional coefficients is the same as the number of the designated data in the corresponding column, and the column of random additional coefficients meets the normal distribution;
and (4) performing operation on the column designation data by using a Sigmod function, wherein the operation result and the corresponding random additional coefficient jointly obtain a distorted and transformed column.
The stretch transform, the shrink transform, and the warp transform in step S203 are in accordance with the operation rules of the respective transforms in step S102 described above. The specific operation rules are not repeated here. The difference of step S203 is that when executing on the distributed framework, for each dimension transformation, an independent Reducer (Reducer) is used for processing, and a transformed data matrix is obtained after reduction (Reduce). Each dimension uses a respective independent reducer, so each dimension can have its own transformation mode, and each dimension can use the same transformation or different transformations. In one embodiment, each Reducer (Reducer) generates a column of transformation matrices from the transformation of the specified data for each dimension. The reducer multiplies the column of transformation matrix with a corresponding column of designated data in the original data matrix to obtain a corresponding column of data in the transformed data matrix.
And S204, a data merging step. And encrypting the data in the auxiliary data matrix, and combining the encrypted auxiliary data matrix and the transformed data matrix to generate a desensitized data matrix. For the data in the auxiliary data matrix, since the auxiliary data does not participate in the subsequent data mining and data analysis, the encryption or desensitization may be performed in a lossy manner, such as the aforementioned MD5 encryption algorithm, random value replacement in the conventional desensitization, or special character replacement. The transformed data matrix stores the specified data after spatial transformation, the specified data is desensitized through lossless or slightly-lossy spatial transformation, but the information carried by the data is reserved and can be continuously used for data analysis and data mining. And combining the encrypted auxiliary data matrix and the transformed data matrix to generate a desensitized data matrix. The desensitization data matrix may be saved as a file, such as a column storage file in a partial format, on demand.
The big data desensitization method of the invention desensitizes sensitive data by utilizing spatial transformation, the spatial relative position information of the desensitized data is reserved, and from the viewpoint of data mining and data analysis, the data loss caused by the spatial transformation is less than 5 percent, so the method does not influence the subsequent data processing. The big data desensitization method can also be applied to a distributed framework to meet the requirement of big data operation of a distributed system.
It should also be noted that the above-mentioned embodiments are only specific embodiments of the present invention. It is apparent that the present invention is not limited to the above embodiments and similar changes or modifications can be easily made by those skilled in the art from the disclosure of the present invention and shall fall within the scope of the present invention. The embodiments described above are provided to enable persons skilled in the art to make or use the invention and that modifications or variations can be made to the embodiments described above by persons skilled in the art without departing from the inventive concept of the present invention, so that the scope of protection of the present invention is not limited by the embodiments described above but should be accorded the widest scope consistent with the innovative features set forth in the claims.
Claims (9)
1. A big data desensitization method, wherein desensitizing specified data in a multidimensional fact table comprises:
an initialization step, namely reading specified data in a multi-dimensional fact table and arranging the specified data into a data matrix, wherein each column in the data matrix corresponds to a dimension, and the data matrix is an original data matrix;
a spatial transformation step of transforming the specified data of each dimension by columns, wherein the transformation comprises stretching transformation, shrinking transformation or warping transformation to obtain a transformed data matrix;
after normalization processing, the difference between the value of each data in the transformed data matrix and the corresponding value in the original data matrix is less than 5%.
2. The big data desensitization method of claim 1,
the stretch transformation includes:
generating a column of random amplification coefficients, wherein the number of the random amplification coefficients is the same as the number of the designated data in the corresponding column, and the column of random amplification coefficients meet normal distribution;
multiplying the column designation data by the column random amplification factor to obtain a column subjected to stretch transformation;
the systolic transformation includes:
generating a column of random contraction coefficients, wherein the number of the random contraction coefficients is the same as the number of the designated data in the corresponding column, and the column of random contraction coefficients meet normal distribution;
multiplying the column designation data by the column random puncturing coefficient to obtain a punctured column.
3. The big data desensitization method of claim 1, wherein the warping transformation comprises:
generating a Sigmod function;
generating a column of random additional coefficients, wherein the number of the random additional coefficients is the same as the number of the designated data in the corresponding column, and the column of random additional coefficients meets the normal distribution;
and performing operation on the column designation data by using the Sigmod function, wherein the operation result and the corresponding random additional coefficient jointly obtain a distorted and transformed column.
4. A big data desensitization method according to claim 1, wherein a transformation matrix is generated based on the transformation of the specified data for each dimension, each column in the transformation matrix corresponding to a dimension of the specified data, and the transformed data matrix is obtained by multiplying the transformation matrix by the original data matrix.
5. The big data desensitization method of claim 1, further comprising: the remaining data in the multi-dimensional fact table, except for the specified data, is encrypted.
6. A distributed big data desensitization method is characterized in that desensitization is carried out on specified data in a multi-dimensional fact table under a distributed framework, and the method comprises the following steps:
a mapping step, wherein a mapper reads data in the multi-dimensional fact table and arranges the data into a data matrix, each row in the data matrix corresponds to one dimension, each dimension uses an independent mapper, and the data matrix is an unscreened data matrix;
screening, namely screening the unseen data matrix, selecting a column where the designated data is located to form an original data matrix, and forming an auxiliary data matrix by the rest data except the designated data;
a spatial transformation step, namely transforming the specified data of each dimension according to columns, wherein the transformation comprises stretching transformation, shrinking transformation or warping transformation, the transformation of each dimension is processed by using an independent reducer, and a transformed data matrix is obtained after reduction, wherein after normalization processing, the difference between the value of each data in the transformed data matrix and the corresponding value in the original data matrix is less than 5%;
and a data merging step, namely encrypting the data in the auxiliary data matrix, and merging the encrypted auxiliary data matrix and the transformed data matrix to generate a desensitized data matrix.
7. The distributed big data desensitization method of claim 6,
the stretch transformation includes:
generating a column of random amplification coefficients, wherein the number of the random amplification coefficients is the same as the number of the designated data in the corresponding column, and the column of random amplification coefficients meet normal distribution;
multiplying the column designation data by the column random amplification factor to obtain a column subjected to stretch transformation;
the systolic transformation includes:
generating a column of random contraction coefficients, wherein the number of the random contraction coefficients is the same as the number of the designated data in the corresponding column, and the column of random contraction coefficients meet normal distribution;
multiplying the column designation data by the column random puncturing coefficient to obtain a punctured column.
8. The distributed big data desensitization method of claim 6, wherein the warping transform comprises:
generating a Sigmod function;
generating a column of random additional coefficients, wherein the number of the random additional coefficients is the same as the number of the designated data in the corresponding column, and the column of random additional coefficients meets the normal distribution;
and performing operation on the column designation data by using the Sigmod function, wherein the operation result and the corresponding random additional coefficient jointly obtain a distorted and transformed column.
9. The distributed big data desensitization method of claim 6, wherein the reducer generates a column of transformation matrices from the transformation of the specified data for each dimension, the reducer multiplying the column of transformation matrices with a corresponding column of specified data in the original data matrix to obtain a corresponding column of data in the transformed data matrix.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010675130.0A CN111814187A (en) | 2020-07-14 | 2020-07-14 | Big data desensitization method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010675130.0A CN111814187A (en) | 2020-07-14 | 2020-07-14 | Big data desensitization method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111814187A true CN111814187A (en) | 2020-10-23 |
Family
ID=72842436
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010675130.0A Pending CN111814187A (en) | 2020-07-14 | 2020-07-14 | Big data desensitization method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111814187A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102915519A (en) * | 2012-09-12 | 2013-02-06 | 东北林业大学 | Algorithm for encrypting image on basis of chaotic mapping and series changing |
CN103955884A (en) * | 2014-04-22 | 2014-07-30 | 西安理工大学 | Double-image encryption method based on chaotic and discrete fraction random transform |
CN104361292A (en) * | 2014-10-16 | 2015-02-18 | 盐城华大网安科技有限公司 | Method for protecting sensitive information in isomorphic data source |
CN104766028A (en) * | 2015-04-28 | 2015-07-08 | 中国科学院微电子研究所 | Privacy information protection method and system |
CN105893950A (en) * | 2016-03-30 | 2016-08-24 | 宁波三博电子科技有限公司 | Adaptive fingerprint identification method and system based on redundancy error sequence ranking algorithm |
CN110443061A (en) * | 2018-05-03 | 2019-11-12 | 阿里巴巴集团控股有限公司 | A kind of data ciphering method and device |
-
2020
- 2020-07-14 CN CN202010675130.0A patent/CN111814187A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102915519A (en) * | 2012-09-12 | 2013-02-06 | 东北林业大学 | Algorithm for encrypting image on basis of chaotic mapping and series changing |
CN103955884A (en) * | 2014-04-22 | 2014-07-30 | 西安理工大学 | Double-image encryption method based on chaotic and discrete fraction random transform |
CN104361292A (en) * | 2014-10-16 | 2015-02-18 | 盐城华大网安科技有限公司 | Method for protecting sensitive information in isomorphic data source |
CN104766028A (en) * | 2015-04-28 | 2015-07-08 | 中国科学院微电子研究所 | Privacy information protection method and system |
CN105893950A (en) * | 2016-03-30 | 2016-08-24 | 宁波三博电子科技有限公司 | Adaptive fingerprint identification method and system based on redundancy error sequence ranking algorithm |
CN110443061A (en) * | 2018-05-03 | 2019-11-12 | 阿里巴巴集团控股有限公司 | A kind of data ciphering method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Thiem et al. | Qualitative comparative analysis with R: A user’s guide | |
CN108604264B (en) | Digital watermarking without large information loss in anonymized datasets | |
US20200005410A1 (en) | System and Method for Facilitating Legal Review for Commercial Loan Transactions | |
US20240184919A1 (en) | Batch tokenization service | |
Muraina | Determinants of listed deposit money banks’ profitability in Nigeria | |
CN102567345A (en) | Method and device for generating bookkeeping voucher | |
WO2021169502A1 (en) | A pixel image as well as its processing method and application system | |
CN111597348A (en) | User image drawing method, device, computer equipment and storage medium | |
CN110737917A (en) | Data sharing device and method based on privacy protection and readable storage medium | |
US11966488B2 (en) | De-tokenization patterns and solutions | |
US20140236860A1 (en) | system allowing banks to diversify their loan portfolios via exchanging loans | |
CN111814187A (en) | Big data desensitization method | |
CN113255498A (en) | Financial reimbursement invoice management method based on block chain technology | |
Okon et al. | Modeling and forecasting exchange rate values between naira and Us dollar to assess the effect of COVID-19 Pandemic Period on the Rate | |
Ayuni et al. | Support vector machine (SVM) as financial distress model prediction in property and real estate companies | |
CN114897590A (en) | Form checking method and device, computer equipment and storage medium | |
KR20210017053A (en) | Method for deriving sales estimating models for individual proprietorship | |
CN117729264B (en) | Digital financial service mass information transmission method | |
CN117217172B (en) | Table information acquisition method, apparatus, computer device, and storage medium | |
CN112732948B (en) | Identity verification method, device and storage medium | |
Durica et al. | Cluster analysis of the economic activity of Slovak companies regarding potential indicators of earnings management | |
Glushko et al. | An L-shaped method with strengthened lift-and-project cuts | |
US20240086503A1 (en) | User Verification with Non-Fungible Tokens | |
CN117556103A (en) | Service handling pushing method and device, storage medium and electronic device | |
CN117312306A (en) | Financial business data sheet conversion method, apparatus, device, medium and program product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |