CN111814187A - Big data desensitization method - Google Patents

Big data desensitization method Download PDF

Info

Publication number
CN111814187A
CN111814187A CN202010675130.0A CN202010675130A CN111814187A CN 111814187 A CN111814187 A CN 111814187A CN 202010675130 A CN202010675130 A CN 202010675130A CN 111814187 A CN111814187 A CN 111814187A
Authority
CN
China
Prior art keywords
data
column
transformation
matrix
random
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010675130.0A
Other languages
Chinese (zh)
Inventor
臧其事
赵可欣
吴晓峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Agricultural Bank of China Shanghai Branch
Original Assignee
Agricultural Bank of China Shanghai Branch
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Agricultural Bank of China Shanghai Branch filed Critical Agricultural Bank of China Shanghai Branch
Priority to CN202010675130.0A priority Critical patent/CN111814187A/en
Publication of CN111814187A publication Critical patent/CN111814187A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6254Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/58Random or pseudo-random number generators
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Strategic Management (AREA)
  • Algebra (AREA)
  • Computing Systems (AREA)
  • General Business, Economics & Management (AREA)
  • Technology Law (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Storage Device Security (AREA)

Abstract

The invention discloses a big data desensitization method, which is used for desensitizing specified data in a multi-dimensional fact table and comprises the following steps: and initializing, namely reading specified data in the multi-dimensional fact table and arranging the specified data into a data matrix, wherein each column in the data matrix corresponds to one dimension, and the data matrix is an original data matrix. And a spatial transformation step of transforming the specified data of each dimension according to columns, wherein the transformation comprises stretching transformation, shrinking transformation or warping transformation to obtain a transformed data matrix. After normalization processing, the difference between the value of each data in the transformed data matrix and the corresponding value in the original data matrix is less than 5%. The big data desensitization method of the invention desensitizes sensitive data by utilizing spatial transformation, the spatial relative position information of the desensitized data is reserved, and the data loss caused by the spatial transformation is less than 5 percent. The big data desensitization method can also be applied to a distributed framework to meet the requirement of big data operation of a distributed system.

Description

Big data desensitization method
Technical Field
The invention relates to the field of big data, in particular to a data security technology of big data.
Background
Data processing is becoming an important infrastructure, and data security, especially security of sensitive data, is of particular importance for data processing. Data desensitization to sensitive data is also an infrastructure. In the financial field, the prior art basically uses a random value replacement desensitization and a special character replacement desensitization mode for data desensitization. The former changes data by replacing with random values (letters are changed into random letters, and numbers are changed into random numbers), and the latter changes data by replacing with special characters (such as a star).
For data that has no specific meaning, it is only indicative, such as: such desensitization is appropriate for name, cell phone number, card number, etc., where indicative information such as name, cell phone number, card number, etc. has no material significance for data mining and data analysis, etc.
With continuous deepening of informatization and datamation in the financial industry, the data mining and data analysis of financial data become more and more important, and the data mining and data analysis play more and more important roles in risk control, risk early warning, customer identification and benefit improvement. During data mining and data analysis, asset data, behavior data, customer figures and other data need to be used. These data are also subject to customer privacy and must be desensitized before use can be made. For the data, the data has meaning, and the traditional random value replacement desensitization or special character replacement desensitization mode can change the data, so that the meaning of the data is partially or completely lost, and the subsequent data mining and data analysis cannot be performed. In addition, in order to obtain more efficient effects of risk control, risk early warning, customer identification and benefit improvement, data is expected to be shared among more financial institutions, big data owned by the multiple financial institutions is analyzed, and the effects are more accurate. Data sharing and exchange put higher requirements on data desensitization, on one hand, the desensitized data and the original data are required to have obvious difference, the original data cannot be restored or positioned based on the desensitized data, and the original data is prevented from being attacked. On the other hand, the meaning and information of the original data are kept as much as possible for the desensitized data, so that the subsequent data mining and data analysis can be continued and the due accuracy is maintained.
Disclosure of Invention
The invention aims to provide a big data desensitization method which can perform micro-damage desensitization on sensitive data and can be executed on a distributed framework.
According to an embodiment of the invention, a big data desensitization method is provided for desensitizing specified data in a multi-dimensional fact table, and the method comprises the following steps:
an initialization step, namely reading specified data in a multi-dimensional fact table and arranging the specified data into a data matrix, wherein each column in the data matrix corresponds to a dimension, and the data matrix is an original data matrix;
a spatial transformation step of transforming the specified data of each dimension by columns, wherein the transformation comprises stretching transformation, shrinking transformation or warping transformation to obtain a transformed data matrix;
after normalization processing, the difference between the value of each data in the transformed data matrix and the corresponding value in the original data matrix is less than 5%.
In one embodiment, the stretch transformation comprises: generating a column of random amplification coefficients, wherein the number of the random amplification coefficients is the same as the number of the designated data in the corresponding column, and the column of random amplification coefficients meet normal distribution; and multiplying the column designation data by the column random amplification factor to obtain a column subjected to stretch transformation. The contraction transformation comprises the following steps: generating a column of random contraction coefficients, wherein the number of the random contraction coefficients is the same as the number of the designated data in the corresponding column, and the column of random contraction coefficients meet normal distribution; multiplying the column designation data by the column random puncturing coefficient to obtain a punctured column.
In one embodiment, the warping transformation comprises: generating a Sigmod function; generating a column of random additional coefficients, wherein the number of the random additional coefficients is the same as the number of the designated data in the corresponding column, and the column of random additional coefficients meets the normal distribution; and (4) performing operation on the column designation data by using a Sigmod function, wherein the operation result and the corresponding random additional coefficient jointly obtain a distorted and transformed column.
In one embodiment, a transformation matrix is generated according to the transformation of the specified data of each dimension, each column in the transformation matrix corresponds to one dimension of the specified data, and the transformation matrix is multiplied by the original data matrix to obtain a transformed data matrix.
In one embodiment, the method further comprises: the remaining data in the multi-dimensional fact table, except for the specified data, is encrypted.
According to an embodiment of the invention, a distributed big data desensitization method is provided, which desensitizes specified data in a multi-dimensional fact table under a distributed framework, and comprises the following steps:
a mapping step, wherein a mapper reads data in the multi-dimensional fact table and arranges the data into a data matrix, each row in the data matrix corresponds to one dimension, each dimension uses an independent mapper, and the data matrix is an unscreened data matrix;
screening, namely screening the unseen data matrix, selecting a column where the designated data is located to form an original data matrix, and forming an auxiliary data matrix by the rest data except the designated data;
a spatial transformation step, namely transforming the specified data of each dimension according to columns, wherein the transformation comprises stretching transformation, shrinking transformation or warping transformation, the transformation of each dimension is processed by using an independent reducer, and a transformed data matrix is obtained after reduction, wherein after normalization processing, the difference between the value of each data in the transformed data matrix and the corresponding value in the original data matrix is less than 5%;
and a data merging step, namely encrypting the data in the auxiliary data matrix, and merging the encrypted auxiliary data matrix and the transformed data matrix to generate a desensitized data matrix.
In one embodiment, the stretch transformation comprises: generating a column of random amplification coefficients, wherein the number of the random amplification coefficients is the same as the number of the designated data in the corresponding column, and the column of random amplification coefficients meet normal distribution; and multiplying the column designation data by the column random amplification factor to obtain a column subjected to stretch transformation. The contraction transformation comprises the following steps: generating a column of random contraction coefficients, wherein the number of the random contraction coefficients is the same as the number of the designated data in the corresponding column, and the column of random contraction coefficients meet normal distribution; multiplying the column designation data by the column random puncturing coefficient to obtain a punctured column.
In one embodiment, the warping transformation comprises: generating a Sigmod function; generating a column of random additional coefficients, wherein the number of the random additional coefficients is the same as the number of the designated data in the corresponding column, and the column of random additional coefficients meets the normal distribution; and (4) performing operation on the column designation data by using a Sigmod function, wherein the operation result and the corresponding random additional coefficient jointly obtain a distorted and transformed column.
In one embodiment, the reducer generates a column of transformation matrices from the transformation of the specified data for each dimension, and the reducer multiplies the column of transformation matrices by a corresponding column of specified data in the original data matrix to obtain a corresponding column of data in the transformed data matrix.
The big data desensitization method of the invention desensitizes sensitive data by utilizing spatial transformation, the spatial relative position information of the desensitized data is reserved, and from the viewpoint of data mining and data analysis, the data loss caused by the spatial transformation is less than 5 percent, so the method does not influence the subsequent data processing. The big data desensitization method can also be applied to a distributed framework to meet the requirement of big data operation of a distributed system.
Drawings
FIG. 1 discloses a flow diagram of a big data desensitization method according to an embodiment of the invention.
FIG. 2 discloses a flow diagram of a big data desensitization method according to another embodiment of the invention.
Detailed Description
The invention provides a big data desensitization method for desensitizing specified data in a multi-dimensional fact table, and a flow chart of the big data desensitization method according to an embodiment of the invention is disclosed in fig. 1. As shown, the method includes:
s101, an initialization step, namely reading specified data in the multi-dimensional fact table and arranging the specified data into a data matrix, wherein each column in the data matrix corresponds to one dimension, and the data matrix is an original data matrix.
In the financial industry, customer data is usually stored in a fact table, and some data in the fact table are used for identifying the identity of a customer, such as name, card number, address, mobile phone number and the like; some data records the assets information of the client, such as total assets, RMB assets, foreign currency assets, financial products, credit cards, guarantee funds, bonds and the like; some data records customer behavior information such as transfer records, purchase financing records, merchandise transaction records, and the like. The identification data is not used for subsequent data analysis basically, so that encryption or desensitization can be performed by using a traditional mode, and the information carried by the identification data basically does not influence the result of data analysis even if the information is lost or lost in the encryption or desensitization process. The asset information data and the behavior information data are core data for data analysis and data mining, and the meaning and the information of the original data need to be reserved after desensitization is carried out, so that the effectiveness and the accuracy of data analysis are guaranteed. In the invention, data such as asset information data and behavior information data are called designated data, and the designated data in the fact table is desensitized by adopting the big data desensitization mode provided by the invention.
TABLE 1
Card number Gross amount of financing Credit card bill Total amount of national debt
161340511092313455 195223.24 1642.09 50000.00
161421686113451004 20000.00 10821.45 20000.00
161731799840912249 85000.24 411.16 50000.00
Table 1 shows an example of a fact table in which customer data is recorded. Table 1 is in the form of a two-dimensional table, which is the most widely used way to record customer data. In the two-dimensional table shown in table 1, each row represents data for one customer and each column represents one category. In the present invention, each "category" represents a dimension, and the "dimension" of the present invention represents data of all customers under a certain category. The dimension "card number" in table 1 is identification data, and "total amount of financing", "credit card bill", and "total amount of treasury" are asset information data. The behavior information data is not listed in table 1. In table 1, data of three dimensions of "total amount of financing", "credit card bill", and "total amount of treasury" are designated data.
TABLE 2
195223.24 1642.09 50000.00
20000.00 10821.45 20000.00
85000.24 411.16 50000.00
In step S101, the specified data in the multidimensional fact table (table 1) is read and arranged into a data matrix (table 2), each column in the data matrix corresponds to a dimension, and the data matrix (table 2) is an original data matrix.
And S102, a spatial transformation step, namely transforming the specified data of each dimension according to the columns, wherein the transformation comprises stretching transformation, shrinking transformation or warping transformation, and a transformed data matrix is obtained. After normalization processing, the difference between the value of each data in the transformed data matrix and the corresponding value in the original data matrix is less than 5%.
In one embodiment of the present invention,
the stretch transformation includes:
generating a column of random amplification coefficients, wherein the number of the random amplification coefficients is the same as the number of the designated data in the corresponding column, and the column of random amplification coefficients meet normal distribution;
and multiplying the column designation data by the column random amplification factor to obtain a column subjected to stretch transformation.
The contraction transformation comprises the following steps:
generating a column of random contraction coefficients, wherein the number of the random contraction coefficients is the same as the number of the designated data in the corresponding column, and the column of random contraction coefficients meet normal distribution;
multiplying the column designation data by the column random puncturing coefficient to obtain a punctured column.
The warping transformation includes:
generating a Sigmod function;
generating a column of random additional coefficients, wherein the number of the random additional coefficients is the same as the number of the designated data in the corresponding column, and the column of random additional coefficients meets the normal distribution;
and (4) performing operation on the column designation data by using a Sigmod function, wherein the operation result and the corresponding random additional coefficient jointly obtain a distorted and transformed column.
The primary way of data desensitization to specified data in the present invention is spatial transformation. The original numerical values are hidden and replaced by new numerical values through stretching transformation, shrinking transformation or warping transformation, but the relative spatial position relation among the new numerical values is kept unchanged or changed a little, so that the results of subsequent data mining and data analysis are basically not influenced, the results of data analysis on desensitized data and the results of data analysis on original data can basically keep consistent, or the difference degree is controlled within 5%.
For example, the original data matrix shown in table 2 is subjected to stretch transformation. Firstly, a column of random amplification factors is generated, the number of the random amplification factors is the same as the number of the designated data in the corresponding column, and the column of random amplification factors meet the normal distribution. The column designation data is then multiplied by the column random magnification factor to obtain a stretch transformed column.
An example implementation code for the stretch transform is as follows:
Figure BDA0002583781120000061
the to _ random () function generates a column of random amplification factors, the number of which is the same as the number of specified data in the corresponding column. the nature of the to _ random () function is such that the generated random numbers follow a normal distribution, so that the column of random amplification factors generated by the to _ random () function satisfies the normal distribution.
fin refers to the dimension "financing amount", credit refers to the dimension "credit card bill", and bond refers to the dimension "national debt amount".
And multiplying a column of specified data by the random amplification coefficient corresponding to the column to obtain a column subjected to stretching transformation. After multiplication, the following data in table 3 are obtained:
TABLE 3
710264.3341476232 6455.808978154993 258916.91251466938
72764.32192679757 42544.083495213636 103566.76500586775
309249.2413607528 1616.458549445041 258916.91251466938
The data matrix of table 3 is a transformed data matrix.
And respectively verifying the relative spatial position relation of each data in the original data matrix and the transformed data matrix. The verification mode is to perform normalization processing on the data matrix. And then calculating Euclidean distances between data points in the data matrix. The Euclidean distance is calculated by the function euclidean _ distance (coords1, coords 2).
The normalization process uses the MinMaxScaler () function to perform the following example code on the raw data matrix:
min_max_scaler=preprocessing.MinMaxScaler()
result_origin=min_max_scaler.fit_transform(np.array(origin))
obtaining a normalized verification data matrix table 4:
TABLE 4
1 0.11824166 1
0 1 0
0.37095673 0 1
The following example code is executed on the transformed data matrix:
min_max_scaler=preprocessing.MinMaxScaler()
result_transformed=min_max_scaler.fit_transform(np.array(transformed))
obtaining a normalized verification data matrix table 5:
TABLE 5
1 0.11824166 1
0 1 0
0.37095673 0 1
Table 4 and table 5 are consistent at the level of 8 bits after the decimal point position, which indicates that the relative spatial position relationship of each data in the original data matrix and the transformed data matrix is basically unchanged, i.e., the euclidean distances between each data point after normalization are the same, and can be considered as completely consistent.
The contraction transform is computed in a similar manner to the pull-up transform and will not be described again here. The stretching transformation and the shrinking transformation are linear transformations, so that the relative spatial position relationship of each data in the original data matrix and the transformed data matrix is basically kept unchanged, and the transformation can be regarded as lossless transformation under the precision required by data analysis.
The warping transformation is continued by taking the original data matrix shown in table 2 as an example. The warping transformation first generates a Sigmod function. When a column of random addition coefficients is generated, the number of random addition coefficients is the same as the number of designated data in the corresponding column, and the column of random addition coefficients satisfies a normal distribution. And then, the column designation data is operated by using a Sigmod function, and the operation result and the corresponding random additional coefficient jointly obtain a distorted and transformed column.
In one embodiment, the sigmoid (x), sigmoid-type growth curve function is used for the Sigmod function. The Sigmod function is defined using the following code:
def sigmoid(x):
return 1./(1+np.exp(-x))
in one embodiment, the random additional coefficients are generated by random. The random additional coefficients are generated using the following code:
def to_random_within(i):
return random.normalvariate(i,1)
random additional coefficients generated by normal.
And (4) performing operation on the column designation data by using a Sigmod function, wherein the operation result and the corresponding random additional coefficient jointly obtain a distorted and transformed column. The warped transformed data matrix is obtained by the code of the example described below.
Figure BDA0002583781120000081
the to _ random _ within function generates a column of random additional coefficients, the number of which is the same as the number of specified data in the corresponding column. the characteristic of the to _ random _ within function is that the generated random number conforms to a normal distribution, so the column of random additional coefficients generated by the to _ random _ within satisfies the normal distribution.
fin refers to the dimension "financing amount", credit refers to the dimension "credit card bill", and bond refers to the dimension "national debt amount".
Using Sigmod function to operate on a column of designated data, adding the operation result and the corresponding random additional coefficient to obtain a column after distortion transformation, and obtaining a data matrix after distortion transformation after each column is respectively calculated, as shown in table 6:
TABLE 6
2.9818167618925275 -0.7316915640653386 1.9027560487948123
2.7507581832625227 -0.5301590086516604 1.6716974701648073
2.8424483209688303 -0.7612175872816653 1.9027560487948123
The exemplary code below is performed on the warped transformed data matrix (table 6) for normalization:
result_transformed=min_max_scaler.fit_transform(np.array(transformed))
print(result_transformed)
obtaining a normalized verification data matrix table 7:
TABLE 7
1 0.12778588 1
0 1 0
0.39682637 0 1
Tables 4 and 7 are different in value, but the difference degree is less than 5%, which shows that the relative spatial position relationship of each data in the original data matrix and the transformed data matrix has only slight variation. The relative spatial positions of each datum in the original data matrix and the distorted data matrix slightly change, namely, the Euclidean distance between each datum point in the normalized original data matrix and the Euclidean distance between each datum point in the transformed data matrix have a certain difference, but the difference of the changed difference after normalization processing is less than 5%, the precision required by data analysis is micro-loss transformation, and the data analysis result cannot be influenced.
In one embodiment, when generating the transformed data matrix from the original data matrix, a transformation matrix is generated according to the transformation of the specified data for each dimension, each column in the transformation matrix corresponds to one dimension of the specified data, and the transformation matrix is multiplied by the original data matrix to obtain the transformed data matrix. The same transformation rules may be applied for each dimension of data (i.e., each column) in the original data matrix, or different transformation rules may be applied. Depending on the transformation rules, a column multiplier is generated for each column of specified data, which may be a random amplification factor, a random contraction factor, or a Sigmod function with a random addition factor superimposed. And forming a transformation matrix by the multipliers of all the columns, wherein each column in the transformation matrix corresponds to one dimension of the designated data, and multiplying the transformation matrix and the original data matrix to obtain a transformed data matrix.
In one embodiment, the big data desensitization method of the present invention further comprises: the remaining data in the multi-dimensional fact table, except for the specified data, is encrypted. Continuing with the example of the multidimensional fact table shown in Table 1, three columns of "financing amount", "credit card bill" and "treasury amount" are the specified data. The column "card number" is the remaining data. Therefore, the column of the card number is encrypted. In one embodiment, the card number is encrypted using the MD5 algorithm, performed by the following example code:
cust_ids=[to_md5(x[0])for x in sample_data]
where the cust _ id represents the card number.
The data processed by encryption will cause the loss of data information, but the card number will not participate in the subsequent data analysis, so the loss of data information will not affect the result of data analysis. In addition to the encryption algorithm, desensitization can be performed on data other than the specified data in such a way that random values are replaced or special characters are replaced.
As the big data storage of the financial institution is increasingly applied to the distributed architecture, in order to adapt to the big data storage mode of the distributed architecture, the big data desensitization method can also be realized on the distributed architecture. Referring to fig. 2, fig. 2 discloses a flow chart of a big data desensitization method according to another embodiment of the invention. The embodiment provides a distributed big data desensitization method, and specified data in a multi-dimensional fact table is desensitized under a distributed framework. The method is based on a Map/Reduce (Map/Reduce) distributed framework as a whole. The linear expansion can be carried out, and the method is suitable for various Hadoop components including but not limited to hbase, hive, impala and the like. The distributed big data desensitization method comprises the following steps:
s201, mapping. The data in the multi-dimensional fact table is read by a Mapper (Mapper) and is arranged as a data matrix, each column in the data matrix corresponds to one dimension, and each dimension uses a separate Mapper (Mapper), and the data matrix is an unscreened data matrix.
At present, the mainstream storage mode of financial big data is a partial format, and the partial is a column storage format suitable for Hadoop. If the multidimensional fact table is stored in a partial format, the multidimensional fact table is in a column storage format and can be directly read by a mapper according to columns. If the multi-dimensional fact table does not use a column storage format, but a row storage format (such as the TXT format). A format conversion step is also included before the mapping step. The format conversion step is also based on a Map/Reduce (Map/Reduce) distributed framework. In the format conversion step, a Mapper (Mapper) is applied to each row of the multi-dimensional fact table in the row storage format, and the Mapper reads data of one row and cuts the data into a plurality of segments according to the dimension, for example, for an N-dimension fact table, the Mapper cuts the data of one row into N segments after reading the data of the row. If there are M rows in the N-dimensional fact table, M mappers are needed for processing. After the mapping is complete, reduction is performed by a Reducer (Reducer). Reduction is performed in terms of dimensions, each of which is provided with one Reducer (Reducer), and N reducers (reducers) are required for the fact table of N dimensions. The data reduced by the Reducer (Reducer) is stored in columns. Thus, after the mapping/reduction process of the format conversion step is completed, the N-dimensional fact table stored in the row is converted into the N-dimensional fact table stored for the column. The mapping step is then continued.
S202, a screening step. And screening the unseen data matrix, selecting the column where the designated data is positioned to form an original data matrix, and forming an auxiliary data matrix by the rest data except the designated data. The description will be given by taking the fact table shown in table 1 as an example. Table 1 is a fact table, which after the mapping step, forms the unscreened data matrix as shown in table 8:
TABLE 8
161340511092313455 195223.24 1642.09 50000.00
161421686113451004 20000.00 10821.45 20000.00
161731799840912249 85000.24 411.16 50000.00
After the screening, the non-screened data matrix is divided into the original data matrix (refer to the foregoing table 2) and the auxiliary data matrix (refer to the following table 9). The raw data matrix includes the specified data: the three-dimensional asset information data of 'total amount of financing', 'credit card bill' and 'total amount of national debt'. The secondary data matrix includes identification data, as in table 9:
TABLE 9
161340511092313455
161421686113451004
161731799840912249
In other embodiments, the screening step applies the following screening principle to distinguish the specified data from the auxiliary data:
judging the data type: whether continuous value data or discrete class data.
The continuous value data is classified as specified data. If the continuous value data is time continuous data, converting the data format into a timestamp, taking the minimum value as 0, keeping the difference value from the rest time continuous data to the minimum value, and setting the difference value as a double type. If the data is digital continuous data, the original data is kept.
For discrete data, a determination is made as to whether the field contains a number of different values greater than 65536.
For discrete data with the number of different values contained in the field not greater than 65536, the data is converted into one-hot coding and classified as the designated data.
For discrete data with different values greater than 65536, the data is classified as auxiliary data. The original field content is reserved, and the situation is mostly unique identification such as a card number, a customer number or an internal log number.
And S203, a space transformation step. And transforming the specified data of each dimension according to columns, wherein the transformation comprises stretching transformation, shrinking transformation or warping transformation, the transformation of each dimension is processed by using an independent reducer, and a transformed data matrix is obtained after reduction, and after normalization processing, the difference between the value of each data in the transformed data matrix and the corresponding value in the original data matrix is less than 5%.
In one embodiment of the present invention,
the stretch transformation includes:
generating a column of random amplification coefficients, wherein the number of the random amplification coefficients is the same as the number of the designated data in the corresponding column, and the column of random amplification coefficients meet normal distribution;
and multiplying the column designation data by the column random amplification factor to obtain a column subjected to stretch transformation.
The contraction transformation comprises the following steps:
generating a column of random contraction coefficients, wherein the number of the random contraction coefficients is the same as the number of the designated data in the corresponding column, and the column of random contraction coefficients meet normal distribution;
multiplying the column designation data by the column random puncturing coefficient to obtain a punctured column.
The warping transformation includes:
generating a Sigmod function;
generating a column of random additional coefficients, wherein the number of the random additional coefficients is the same as the number of the designated data in the corresponding column, and the column of random additional coefficients meets the normal distribution;
and (4) performing operation on the column designation data by using a Sigmod function, wherein the operation result and the corresponding random additional coefficient jointly obtain a distorted and transformed column.
The stretch transform, the shrink transform, and the warp transform in step S203 are in accordance with the operation rules of the respective transforms in step S102 described above. The specific operation rules are not repeated here. The difference of step S203 is that when executing on the distributed framework, for each dimension transformation, an independent Reducer (Reducer) is used for processing, and a transformed data matrix is obtained after reduction (Reduce). Each dimension uses a respective independent reducer, so each dimension can have its own transformation mode, and each dimension can use the same transformation or different transformations. In one embodiment, each Reducer (Reducer) generates a column of transformation matrices from the transformation of the specified data for each dimension. The reducer multiplies the column of transformation matrix with a corresponding column of designated data in the original data matrix to obtain a corresponding column of data in the transformed data matrix.
And S204, a data merging step. And encrypting the data in the auxiliary data matrix, and combining the encrypted auxiliary data matrix and the transformed data matrix to generate a desensitized data matrix. For the data in the auxiliary data matrix, since the auxiliary data does not participate in the subsequent data mining and data analysis, the encryption or desensitization may be performed in a lossy manner, such as the aforementioned MD5 encryption algorithm, random value replacement in the conventional desensitization, or special character replacement. The transformed data matrix stores the specified data after spatial transformation, the specified data is desensitized through lossless or slightly-lossy spatial transformation, but the information carried by the data is reserved and can be continuously used for data analysis and data mining. And combining the encrypted auxiliary data matrix and the transformed data matrix to generate a desensitized data matrix. The desensitization data matrix may be saved as a file, such as a column storage file in a partial format, on demand.
The big data desensitization method of the invention desensitizes sensitive data by utilizing spatial transformation, the spatial relative position information of the desensitized data is reserved, and from the viewpoint of data mining and data analysis, the data loss caused by the spatial transformation is less than 5 percent, so the method does not influence the subsequent data processing. The big data desensitization method can also be applied to a distributed framework to meet the requirement of big data operation of a distributed system.
It should also be noted that the above-mentioned embodiments are only specific embodiments of the present invention. It is apparent that the present invention is not limited to the above embodiments and similar changes or modifications can be easily made by those skilled in the art from the disclosure of the present invention and shall fall within the scope of the present invention. The embodiments described above are provided to enable persons skilled in the art to make or use the invention and that modifications or variations can be made to the embodiments described above by persons skilled in the art without departing from the inventive concept of the present invention, so that the scope of protection of the present invention is not limited by the embodiments described above but should be accorded the widest scope consistent with the innovative features set forth in the claims.

Claims (9)

1. A big data desensitization method, wherein desensitizing specified data in a multidimensional fact table comprises:
an initialization step, namely reading specified data in a multi-dimensional fact table and arranging the specified data into a data matrix, wherein each column in the data matrix corresponds to a dimension, and the data matrix is an original data matrix;
a spatial transformation step of transforming the specified data of each dimension by columns, wherein the transformation comprises stretching transformation, shrinking transformation or warping transformation to obtain a transformed data matrix;
after normalization processing, the difference between the value of each data in the transformed data matrix and the corresponding value in the original data matrix is less than 5%.
2. The big data desensitization method of claim 1,
the stretch transformation includes:
generating a column of random amplification coefficients, wherein the number of the random amplification coefficients is the same as the number of the designated data in the corresponding column, and the column of random amplification coefficients meet normal distribution;
multiplying the column designation data by the column random amplification factor to obtain a column subjected to stretch transformation;
the systolic transformation includes:
generating a column of random contraction coefficients, wherein the number of the random contraction coefficients is the same as the number of the designated data in the corresponding column, and the column of random contraction coefficients meet normal distribution;
multiplying the column designation data by the column random puncturing coefficient to obtain a punctured column.
3. The big data desensitization method of claim 1, wherein the warping transformation comprises:
generating a Sigmod function;
generating a column of random additional coefficients, wherein the number of the random additional coefficients is the same as the number of the designated data in the corresponding column, and the column of random additional coefficients meets the normal distribution;
and performing operation on the column designation data by using the Sigmod function, wherein the operation result and the corresponding random additional coefficient jointly obtain a distorted and transformed column.
4. A big data desensitization method according to claim 1, wherein a transformation matrix is generated based on the transformation of the specified data for each dimension, each column in the transformation matrix corresponding to a dimension of the specified data, and the transformed data matrix is obtained by multiplying the transformation matrix by the original data matrix.
5. The big data desensitization method of claim 1, further comprising: the remaining data in the multi-dimensional fact table, except for the specified data, is encrypted.
6. A distributed big data desensitization method is characterized in that desensitization is carried out on specified data in a multi-dimensional fact table under a distributed framework, and the method comprises the following steps:
a mapping step, wherein a mapper reads data in the multi-dimensional fact table and arranges the data into a data matrix, each row in the data matrix corresponds to one dimension, each dimension uses an independent mapper, and the data matrix is an unscreened data matrix;
screening, namely screening the unseen data matrix, selecting a column where the designated data is located to form an original data matrix, and forming an auxiliary data matrix by the rest data except the designated data;
a spatial transformation step, namely transforming the specified data of each dimension according to columns, wherein the transformation comprises stretching transformation, shrinking transformation or warping transformation, the transformation of each dimension is processed by using an independent reducer, and a transformed data matrix is obtained after reduction, wherein after normalization processing, the difference between the value of each data in the transformed data matrix and the corresponding value in the original data matrix is less than 5%;
and a data merging step, namely encrypting the data in the auxiliary data matrix, and merging the encrypted auxiliary data matrix and the transformed data matrix to generate a desensitized data matrix.
7. The distributed big data desensitization method of claim 6,
the stretch transformation includes:
generating a column of random amplification coefficients, wherein the number of the random amplification coefficients is the same as the number of the designated data in the corresponding column, and the column of random amplification coefficients meet normal distribution;
multiplying the column designation data by the column random amplification factor to obtain a column subjected to stretch transformation;
the systolic transformation includes:
generating a column of random contraction coefficients, wherein the number of the random contraction coefficients is the same as the number of the designated data in the corresponding column, and the column of random contraction coefficients meet normal distribution;
multiplying the column designation data by the column random puncturing coefficient to obtain a punctured column.
8. The distributed big data desensitization method of claim 6, wherein the warping transform comprises:
generating a Sigmod function;
generating a column of random additional coefficients, wherein the number of the random additional coefficients is the same as the number of the designated data in the corresponding column, and the column of random additional coefficients meets the normal distribution;
and performing operation on the column designation data by using the Sigmod function, wherein the operation result and the corresponding random additional coefficient jointly obtain a distorted and transformed column.
9. The distributed big data desensitization method of claim 6, wherein the reducer generates a column of transformation matrices from the transformation of the specified data for each dimension, the reducer multiplying the column of transformation matrices with a corresponding column of specified data in the original data matrix to obtain a corresponding column of data in the transformed data matrix.
CN202010675130.0A 2020-07-14 2020-07-14 Big data desensitization method Pending CN111814187A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010675130.0A CN111814187A (en) 2020-07-14 2020-07-14 Big data desensitization method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010675130.0A CN111814187A (en) 2020-07-14 2020-07-14 Big data desensitization method

Publications (1)

Publication Number Publication Date
CN111814187A true CN111814187A (en) 2020-10-23

Family

ID=72842436

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010675130.0A Pending CN111814187A (en) 2020-07-14 2020-07-14 Big data desensitization method

Country Status (1)

Country Link
CN (1) CN111814187A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102915519A (en) * 2012-09-12 2013-02-06 东北林业大学 Algorithm for encrypting image on basis of chaotic mapping and series changing
CN103955884A (en) * 2014-04-22 2014-07-30 西安理工大学 Double-image encryption method based on chaotic and discrete fraction random transform
CN104361292A (en) * 2014-10-16 2015-02-18 盐城华大网安科技有限公司 Method for protecting sensitive information in isomorphic data source
CN104766028A (en) * 2015-04-28 2015-07-08 中国科学院微电子研究所 Privacy information protection method and system
CN105893950A (en) * 2016-03-30 2016-08-24 宁波三博电子科技有限公司 Adaptive fingerprint identification method and system based on redundancy error sequence ranking algorithm
CN110443061A (en) * 2018-05-03 2019-11-12 阿里巴巴集团控股有限公司 A kind of data ciphering method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102915519A (en) * 2012-09-12 2013-02-06 东北林业大学 Algorithm for encrypting image on basis of chaotic mapping and series changing
CN103955884A (en) * 2014-04-22 2014-07-30 西安理工大学 Double-image encryption method based on chaotic and discrete fraction random transform
CN104361292A (en) * 2014-10-16 2015-02-18 盐城华大网安科技有限公司 Method for protecting sensitive information in isomorphic data source
CN104766028A (en) * 2015-04-28 2015-07-08 中国科学院微电子研究所 Privacy information protection method and system
CN105893950A (en) * 2016-03-30 2016-08-24 宁波三博电子科技有限公司 Adaptive fingerprint identification method and system based on redundancy error sequence ranking algorithm
CN110443061A (en) * 2018-05-03 2019-11-12 阿里巴巴集团控股有限公司 A kind of data ciphering method and device

Similar Documents

Publication Publication Date Title
Thiem et al. Qualitative comparative analysis with R: A user’s guide
CN108604264B (en) Digital watermarking without large information loss in anonymized datasets
US20200005410A1 (en) System and Method for Facilitating Legal Review for Commercial Loan Transactions
US20240184919A1 (en) Batch tokenization service
Muraina Determinants of listed deposit money banks’ profitability in Nigeria
CN102567345A (en) Method and device for generating bookkeeping voucher
WO2021169502A1 (en) A pixel image as well as its processing method and application system
CN111597348A (en) User image drawing method, device, computer equipment and storage medium
CN110737917A (en) Data sharing device and method based on privacy protection and readable storage medium
US11966488B2 (en) De-tokenization patterns and solutions
US20140236860A1 (en) system allowing banks to diversify their loan portfolios via exchanging loans
CN111814187A (en) Big data desensitization method
CN113255498A (en) Financial reimbursement invoice management method based on block chain technology
Okon et al. Modeling and forecasting exchange rate values between naira and Us dollar to assess the effect of COVID-19 Pandemic Period on the Rate
Ayuni et al. Support vector machine (SVM) as financial distress model prediction in property and real estate companies
CN114897590A (en) Form checking method and device, computer equipment and storage medium
KR20210017053A (en) Method for deriving sales estimating models for individual proprietorship
CN117729264B (en) Digital financial service mass information transmission method
CN117217172B (en) Table information acquisition method, apparatus, computer device, and storage medium
CN112732948B (en) Identity verification method, device and storage medium
Durica et al. Cluster analysis of the economic activity of Slovak companies regarding potential indicators of earnings management
Glushko et al. An L-shaped method with strengthened lift-and-project cuts
US20240086503A1 (en) User Verification with Non-Fungible Tokens
CN117556103A (en) Service handling pushing method and device, storage medium and electronic device
CN117312306A (en) Financial business data sheet conversion method, apparatus, device, medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination