MXPA06000549A

MXPA06000549A - Method and system for obfuscating data structures by deterministic natural data substitution

Info

Publication number: MXPA06000549A
Application number: MXPA/A/2006/000549A
Authority: MX
Inventors: E Fay Jonathan
Original assignee: Microsoft Corporation*
Priority date: 2005-02-07
Filing date: 2006-01-13
Publication date: 2006-10-17

Abstract

A method and system create a data structure from an obfuscated data structure. First, the system operates on a first data structure whose obfuscation is desired, and creates a data string based on a portion of the first data structure. Next, based on the data string, a second data structure is deterministically generated from a third data structure and the second data structure replaces the first data structure.

Description

METHOD AND SYSTEM TO MAKE UNINTELLIGIBLE DATA STRUCTURES BY SUBSTITUTION OF DATA DETERMINISTA NATURAL TECHNICAL FIELD OF THE INVENTION Modalities of the present invention refer to the field of obfuscation of data structures. More particularly, but not by way of limitation, the embodiments of the present invention provide a new and useful method and system for replacing data values in a data structure with pseudo-random data values generated in a deterministic manner which are like mirror image of the distribution of data values in the data structure. BACKGROUND OF THE INVENTION Many companies maintain databases that include information of clients or employees. The information may include names, addresses, telephone numbers, social security numbers, company names, wages and purchase histories. For example, an Internet sales company may have a customer database that includes the names, phone numbers, payment methods, and customer purchase history. The other example, a payer's department may have wage information regarding its employees. Due to the sensitive nature of some of this information, such payment methods, social security numbers and salaries, access is typically restricted to a relatively small group within that company. As is common with software applications, problems may arise that require resolution by computer programmers. When problems occur with software applications operating in a database that has sensitive information, programmers may require access to the sensitive database to resolve the problem. This can lead to sensitive information being seen by people who normally do not have access to information. In the example of the list of pages, the distribution of salary information can cause internal problems in the company with respect to wage discrepancies. In the example of Internet sales, the distribution of payment methods and other personal information, such as social security numbers, can lead to identity theft. However, in order to efficiently solve the malfunction of the software application, programmers need to access the real data and, in particular, the distribution of real data (geographical distribution, name distributions, etc.). It is known in the art to obfuscate databases by random data substitution, so a test database is generated. However, random data substitution does not produce a distribution of real data found in natural databases. A method and system are needed to obfuscate at least portions of the databases to produce test databases with data distributions that mirror the distributions found in real databases. BRIEF DESCRIPTION OF THE INVENTION The embodiments of the present invention provide a method for obfuscating or rendering data unintelligible by replacement by substitution of deterministic natural data. In addition, the embodiments of the present invention may have several practical applications in the techniques including, but not limited to, deterministically replacing confidential data with data that appears natural. The mirror data patterns are in original data in terms of data distribution, but do not cover the original confidential data. In one embodiment, a method for obfuscating data is provided. The method comprises operating on a first data structure whose obfuscation is desired, and creating a data string based on a portion of the first data structure. Based on the data string, a second data structure is generated deterministically from a third data structure and the second data structure replaces the first data structure. In another embodiment, a method for constructing a test data structure is provided. The method comprises operating in a source data structure having several types of data fields where each of the data fields includes several rows of data, and determining an identifier for each row of data.

Next, for each row of data, the method generates a string of data based on the identifier, maps a portion of the data string to a value in a reference data structure, and makes a population into a structure of data. test data with the value mapped in the reference data structure. In yet another embodiment, a computer-readable medium is provided that has computer-usable instructions for performing a method for generating a data structure that is nethical. The method comprises, first, providing a reference data structure and a source data structure, where each data structure has several types of data fields and each type of data field includes rows of data values. Next, the method comprises assigning a weighted value for each row of data values in the source data structure according to a predetermined pattern and deriving a respective data string for each row of data values from the data structure. of font. For each row of data values in the source data structure, each data value in the rows of data values in the source data structure is mapped to a data value in the rows of data values in the structure of reference data based on the weighted value, the respective data string and the type of data field. Finally, the synthetic data structure is populated with the mapped data value of the reference data structure. Further aspects are described in more detail below. BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING The embodiments of the present invention are described in detail below with reference to the Figures of the attached drawing, which are incorporated in their entirety by reference herein and in which: Figure 1A is a system diagram that illustrates an exemplary ordering process. Figure 1 B is a flow diagram illustrating an overview of one embodiment of a method for obfuscating a data structure. Figure 2 is a flow diagram illustrating in more detail one embodiment of a process for generating a data stream. Figure 3 is an exemplary data string. Figure 4 is a sample data structure that is desired Obfuscate. Figure 5 is a flow diagram illustrating in more detail one embodiment of a process to obfuscate a data structure. Figure 6 is an example obfuscated data structure derived from the data structure of Figure 4. Figure 7 is a flow chart illustrating in still greater detail another embodiment of a process for obfuscating a data structure. Figure 8 is a diagram illustrating various types of data field to which weighted values can be assigned.

DETAILED DESCRIPTION OF THE INVENTION The embodiments of the present invention provide a novel method and system for obfuscating data values in a first data structure by deterministically generating a single data string for each row of data values in the first data structure, using the data string to map each data value in the row of the first data structure for a data value in a reference data structure, and create a second data structure based on the Data values mapped in the reference data structure. The deterministic method and system allows reproducible results so that a row of data values in a first data structure correlate with a row of data values in a second data structure for each instance of obfuscation of the first data structure. In addition, the novel method and system illustrated in the various embodiments of the present invention can assign, in various modalities, weighted values to certain types of data values in the first data structure to create a second substantially approximate data structure. to the distribution of data values in the first data structure. A) Yes, the second data structure seems random, which is useful for testing and solving problems of software applications that operate in the first data structure. The embodiments of the present invention will be better understood from the detailed description provided below and from the accompanying drawings of various embodiments of the invention. However, the detailed description and drawings should not be construed as limiting the invention to the specific embodiments. Rather, these specifications are provided for example purposes to help the invention be better understood. Specific hardware devices, programming languages, components, processes and numerous details including operating environments and the like are set forth to provide a complete understanding of the present invention. In other instances, structures, devices, and processes are shown in the form of block diagrams, rather than in detail, to avoid obscuring the embodiments of the present invention. But a technician of ordinary skill will understand that the embodiments of the present invention can be practiced without these specific details. Computer systems, servers, work stations and other machines can be connected to each other through the means of communications including, for example, a network or network of networks. In addition, illustrative data structures used to explain the various embodiments of the present invention may be, but are not limited to, databases, spreadsheets, and any other devices capable of being a storage medium. Turning now to Figure 1 A, a system diagram of a process 10 of an example ordering system using a data obfuscation method illustrated in greater detail in Figures 1 B through 8 is illustrated. The process 10 begins in a step 14 where a customer service agent receives an order from a customer. The order can be received either through an e-commerce web site, by phone, or in person. In a step 16, the process 10 retrieves the customer information from a data structure 12 comprising a customer data structure 12A and a stock availability data structure 12B. The customer data structure 12A may include customer information, such as address, telephone, company, social security number and past payment methods used by the customer. In a step 18, the process 10 creates an invoice based on the available stocks of the stock data structure 12B and the shipping information from the customer data structure 12A. In a step 20 (to order is shipped to the customer based on the invoice created in step 18. In a step 22, the process is completed if the order is properly received by the customer, however, if the order is not received appropriately, just as in a situation where the wrong order is shipped or the appropriate order is shipped to the wrong customer, a software application used by process 10 must be debugged in order to determine the root cause of the malfunction. step 24, a test data structure 13 is created using data values from the customer data structure 12A and the stock data structure 12B.The software application used by the process 10 is piped to a step 26 using the test structure 13. It is desirable that the sensitive information included in the customer data structure 12A is not distributed outside of the limited group of people who require access to the information in the data structure 12A. the use of a deterministic method to obfuscate the information in the data structure 12A, the test data structure 13 can be generated with data that looks natural and retains the client's confidential information. Since a deterministic function is used to generate data values in the test data structure 13, a data entry in the test structure 13 can be traced back for an information value in the data structure 12A to locate the source of the problem in the software application used by the process 10. Returning to Figure 1 B , one embodiment of a method 100 is illustrated to create a second data or test structure from a first data or source structure that is desired to be obfuscated. Figure 4 illustrates an example source data structure 400 having columns 410 to 420 and rows 422 to 430. Data structure 400 includes columns of data field types. In the example data structure 400, columns are provided to designate ID numbers for each row. Various types of data fields included in data structure 400 include first names, last names, companies, genres and telephone numbers. The data structure 400, in some embodiments, may encompass other types of data fields, such as age and ethnicity. Returning to Figure 1B, the obfuscation method 100 includes a step 1 where a string of data is generated for a row of the data structure for which the obfuscation is desired. For example, row 422 of data structure 400 includes a "0001" ID number that is operated to generate the data string. The process of generating a data string in step 1 10 is discussed below with reference to Figure 2. Continuing with the obfuscation method 100, in a step 1 12 the first type of data field is determined, such as a address field or name of the data value in the data structure 400. For example, the data value "Chris" in row 422 is an archived type of "first name" data designated by column 412. In a step 14, the data value "Chris" of row 422 is retrieved and column 412. In a step 16, the "Chris" data value is obfuscated based on the data type and data string using a third data or reference structure or data structures (not shown) and a corresponding test data structure is created. The test data structure includes the obfuscated data value of the reference data structure corresponding to the data value "Chris". In a step 1 18, if there are more columns in the data structure that you wish to obfuscate, such as column 414 of "last name", column 418 of "company", column 418 of "gender" and column 420"telephone number" of the data structure 400, then steps 12-12 are repeated.16. After each column has been obfuscated, the method 100 moves to the next row in a step 120. For example, the row 424 of data structure 400. If there are more rows, a data string is generated in step 1 10 and method 100 repeats steps 1 12 to 1 18. However, if there are no more rows in the data structure that you want to obfuscate, method 100 it's complete. A second data or test structure, such as the test data structure 13 in Figure 1 A, has been created, and confidential information has been obfuscated in the source data structure. Turning now to Figure 2, the process of generating a data string from step 1 10 of Figure 1B is illustrated in more detail. The process of step 1 1 0 includes a step 1 10A to determine an identifier of a row of data values in the source data structure. In data structure 400, column 410 of "I D" can be used as an identifier. An example of a deterministic function is an MD-5 coding algorithm (message digest algorithm 5). An MD-5 is a widely used cryptographic numbering function with a 128-bit numbering output value. The MD-5 numbers of 128 bits are typically represented as hexadecimal numbers of 32 digits. Using the MD-5 function, even a small change in the input message will result in a completely different output message or numbering. The MD-5 algorithm is further described in the Internet Engineering Task Force (IETF) Request for Comments (R.F.C.) 1321, which is incorporated herein by reference. The MD-5 deterministic algorithm is used herein for illustrative purposes only. Various embodiments of the present invention may use other deterministic functions, such as, but not limited to, SHA-1 and RI PEM D-1 60. Continuing with Figure 2, in a step 1 1 0C, portions of the string are assigned. of data or output of the deterministic function to types of data fields in the source data structure. The data structure 400 of Figure 4 comprises several types of data fields, namely 41 0 of "ID", 412 of "first name", 414 of "'last name", 41 6 of "company", 41"gender" and "telephone number 420." Referring to Figure 3, a string 300 of example generic data is illustrated.Data string 300 comprises five portions that match the five types of data fields in the data structure 400. In a mode of step 1 1 0C, a portion 312 of the data string 300 can be assigned to the column 41 2 of the data structure 400 comprising the data field types of "first name "and a portion 314 of the data string 300 may be assigned to column 41 4 which comprises the" last name "data field types. Similarly, portion 316 may be assigned to column 416, the portion 31 8 can be assigned to column 41 8 and portion 320 can be assigned to column 420 of structure 40 0 of data Although, in this example, portions of string 300 of data in blocks are assigned to the types of data fields in data structure 400, in other embodiments of the present invention, portions of the string may be assigned. 300 of data in disconnected groups to several columns of data field types in the data structure 400. Turning now to Figure 5, the process of obfuscating data values from the source data structure described in step 1 16 of Figure 1B is illustrated in greater detail. In a step 16A, a portion of the data structure is recovered. the data string that was generated in step 1 of Figure 1 B and explained in more detail in Figure 2. For purposes of illustration, string 300 of data from Figure 3 is recovered in step 1 16A. In step 1 16B, a portion of the data string 300 corresponding to a type of data field in, for example, the data structure 400, is mapped to a data value of a corresponding data field type in a reference structure (not shown). The reference data structure can be, for example, census information that includes, among other information, First / Last names, addresses, gender, age, telephone numbers, social security numbers and ethnicity. In addition, in other embodiments of the present invention, the reference data structure may be a single data structure or a compilation of data structures, each one including data values corresponding to a type of data field. In a step 1 16C, the data value mapped in the reference data structure is retrieved to create a synthetic or test data structure. An exemplary synthetic data structure is illustrated by data structure 600 in Figure 6. Synthetic data structure 600 comprises the same number of columns and data types as structure 400 of obfuscated data or source of data. 4 and comprises substantially the same data as in the data structure 400 of Figure 4. The deterministic function is used for each instance that maps a row of data values from a source data structure to a data structure of the data structure. reference to generate a row of data values in synthetic information. There is a reproducible relationship between a given row of data values in the source data structure and the corresponding row of obfuscated data values in the synthetic data structure. In other words, with reference to Figures 4 and 6, the battery 422 of Figure 4 corresponds to row 622 of Figure 6 for each obfuscation of the source data structure 400. As previously mentioned in relation to Figure 1A, this reproduction capability allows multiple debugging instances of the software application used in the ordering and shipping process without losing the relationship between the data values in the data structure 12A customers and the structure 1 3 of test data. As previously discussed, a portion of the string is used 300 of data from Figure 3 to map a value in the data structure 400 to a value in the reference data structure (not shown). For example, one type of age data may correspond to 22 to 27 bits of string 300 of data, and a first name and a last name may be mapped using the least significant 1 1 bits of data string 300. In an example of selecting first and last names, the 65,000 first and last most popular names in the United States can be downloaded from the Census Bureau. To select a first and last name from the list of 65,000 in the Census Bureau, a certain number of bits of string 300 of data is needed. For example, 16 individual bits of the data string 300 can be selected and grouped together for the last name and another 12 bits of the data string 300 can be grouped for the first name. Although 12 and 16 bits are selected in this example, other bit numbers can be selected. For example, if the binary number of these bits for the first name adds up to two, the second entry in the reference data structure is selected. The first name and last name chosen from the list of 65,000 are then inserted into the structure 500 of the synthetic database. Similarly, when an address is used, a portion of the data string 300 may be used to choose an address. For example, if the portion of the data string 300 selected for an address adds up to 192, the entry 192a is selected in an address reference database and inserted into the synthetic data structure 600 of Figure 6. Making Referring now to Figure 7, another mode for obfuscating data values from the source data structure described in step 1 16 of Figure 1B is illustrated. In a step 1 16D, a weighted value can be assigned to certain types of data. For example, company names that begin with the letter "m" may occur more frequently than companies that begin with the letter "z". A weighted algorithm can be applied in conjunction with the deterministic function to simulate the actual distribution of company names in a population. Referring to Figures 4 and 6 in combination, the companies that begin with the letter "m" in the data structure 400 in the field of "company" data types occur more frequently than the companies that start with the letter "z" Similarly, the distribution of companies with names beginning with the letter "m" and companies beginning with the letter "z" is the same or similar in the data type field of "company" of the data structure 600 from Figure 6. Similar weighted values can be given, as illustrated in Figure 8, for other types of data fields. Weighted values 81 4 can be assigned to 81 0 of gender, 81 2 of age, 816 of first and last names and 81 8 of ethnicity. Referring again to Figure 7, method 1 16 continues with a step 1 16E where a portion of the data string 300 is retrieved and mapped to values in a reference data structure in a 1 16F step. A synthetic data structure is then generated with the mapped values of the reference data structure in a step 1 1 6G. Although, in one embodiment, a reference data structure may comprise all data values and data types included in a data structure to be obfuscated, other embodiments may comprise several reference data structures, one for each type of data. data included in the data structure that you wish to obfuscate. Certain embodiments of the present invention can use weighted algorithms to accurately reproduce distributions of data types in a population. The use of weighted algorithms depends on the desired accuracy of the obfuscated or illegible information or the accuracy of the distributions in the reference data structure. The present invention has been described in relation to particular modalities which are intended, in all respects, to be illustrative rather than restrictive. Alternative modalities will be apparent to those experts in the art, who do not deviate from its scope. There are many alternative embodiments, but they are not included due to the nature of this invention. An expert programmer can develop alternative means to implement the aforementioned improvements without departing from the scope of the present invention. It will be understood that certain aspects and sub-combinations are useful and may be employed without reference to other aspects and sub-combinations and are contemplated within the scope of the claims. Not all the steps listed in the various Figures need to be carried out in the specific order described. Not all the steps of the flow diagrams mentioned above are necessarily steps.

Claims

CLAIMS 1. A method of information obfuscation, which comprises: operating on a first data structure whose obfuscation is desired; create a data string based on a portion of said first data structure; and with base, in said data string, to generate in a deterministic manner a second data structure from at least a third data structure; and replacing said first data structure with said second data structure.
2. The method of claim 1, wherein said first data structure comprises one or more rows and one or more columns of data values, and may comprise an identifier for each of said one or more rows of data values. .
3. The method of claim 2, further comprising generating said data string based on a unique identifier, said data string that is an output of a deterministic function. The method of claim 2, wherein said second and at least a third data structure comprises one or more rows and one or more columns of data values, and each of said one or more columns in said second and at least a third data structure corresponds to types of data values in said one or more columns of said first data structure. The method of claim 2, further comprising: assigning a weighted value to various types of data values in each of said one or more rows of said first data structure; and populating said second data structure with data values of said at least one third data structure based on said assigned weighted values of said first data structure. The method of claim 2, wherein assigning a weighted value further comprises assigning a weighted value according to occurrences in a population of said types of data values in each of said one or more rows of said first data structure so that corresponding data values in said second data structure tie with patterns naturally found in a real population. 7. A computer readable medium having computer executable instructions for performing the method of claim 1. 8. A computer software product, comprising a code for performing a method as defined in claim 1. 9. A method for constructing a test data structure, comprising: operating on a source data structure having one or more types of data fields, wherein each of said one or more data fields includes a or more rows of data; determining an identifier for each of said one or more rows of data; for each of said one or more rows of data, perform the following: a) generate a data string based on said identifier; b) based on said data field type, map a portion of data string for a data value in at least one reference data structure; and c) populating said test data structure with said value mapped in said at least one reference data structure. The method according to claim 9, wherein said data string is an output of a deterministic function. eleven . The method according to claim 9, further comprising: assigning a weighted value to information in said one or more types of data fields in said source data structure; and populating said test data structure with said data value of said at least one reference data structure based on said weighted values. The method according to claim 1, wherein assigning a weighted value further comprises assigning a weighted value according to the occurrences in a data population in said one or more types of data fields in said data structure. of source so that the corresponding data in said test data structure substantially approximates a real population. The method of claim 12, further comprising: assigning a portion of said data string to each of said one or more types of data fields; and based on said portion and corresponding weighted values, locate in said at least one reference data structure said mapped value.
4. A computer readable medium having computer executable instructions for performing the method of claim 9. 1.
A software product for a compiler, comprising a code for performing a method as defined in claim 9. 16.
One or more computer-readable media having computer-usable instructions incorporated therein to perform a method for generating a synthetic data structure, comprising: operating in at least one reference data structure and a structure of source data, each having one or more types of data fields, wherein each of said one or more data fields includes at least one row of data values; assigning a weighted value for each said at least one row of data values in said source data structure according to a predetermined pattern; deriving a respective data string for each said at least one row of data values from said source data structure; for each of said at least one row of data values of said source data structure, perform the following: a) based on said weighted value, said respective data string, and said type of data fields, mapping each data value of said at least one string of data values of said source data structure for a data value of said at least one string of data values of said at least one reference data structure; and b) populating said synthetic data structure with said data value mapped to said at least one reference data structure.
The method according to claim 16, wherein said data string is an output of a deterministic function.
The method according to claim 16, wherein assigning a weighted value further comprises assigning a weighted value according to the occurrences in a population of data in said one or more types of data fields in said data structure. of source so that corresponding data in said test data structure substantially matches patterns naturally found in a real population.
9. The method according to claim 18, further comprising: assigning a portion of data string to each of said one or more types of data fields; and based on said portion and corresponding weighted values, locate in said at least one reference data structure said mapped value. 20, The method according to claim 19, wherein said types of data fields correspond to first and last names, company names, gender, ethnicity, payment methods, salary and age.