CN110741368A

CN110741368A - System and method for identifying leaked data and assigning errors to suspected compromised persons

Info

Publication number: CN110741368A
Application number: CN201880032328.8A
Authority: CN
Inventors: 亚瑟·科尔曼; C·鲍尔斯; 梁芷苓·克里斯蒂娜; 马丁·罗斯; 马特·勒巴伦
Original assignee: Chain Rui Co Ltd
Current assignee: Chain Rui Co Ltd; LiveRamp Inc
Priority date: 2017-03-17
Filing date: 2018-03-09
Publication date: 2020-01-31
Also published as: WO2018169802A1; US11350147B2; US20200092595A1; CA3056601A1; EP3596632A1; JP2020512630A; JP7046970B2

Abstract

systems and methods for identifying leaked data files and assigning errors to or more suspected leaks by multiple tiers at level , preliminary watermark detection occurs data is inserted into a subset of the data to determine the relevance to the data in the file suspected of leaking, then the resulting likelihood of errors is weighted based on the number of bits matched.

Description

System and method for identifying leaked data and assigning errors to suspected compromised persons

Technical Field

The field of the invention is the verification of ownership of data to determine if the data has been improperly copied or used, and if so, to identify the party that improperly copied or used the data.

Background

References mentioned in this background section are not admitted to be prior art to the present invention.

Data leakage can be defined as the theft of data by someone other than the owner or authorized user by 2019, data leakage is estimated to be problems involving tens of trillion dollars.data leakage solutions currently incur a sales loss of about $ 10 billion per year, which has been around hours for certain types of data. has been the use of these well-known watermark recognition solutions to protect their Intellectual Property (IP) from theft.these watermark recognition solutions allow data owners to pursue the damage compensation of unauthorized use because they can use watermarks as proof of ownership and copyright infringement in court.

Consumer data owners ("data owners") often give, lease, or sell their data to individuals or organizations that are trusted to use the data only in a legitimate manner, in compliance with contractual requirements or data handling regulations (such as the privacy laws enacted by B rules or local, state, or federal governments in financial services) ("trusted third party" or "TTP"). the data is typically transmitted as an series database table (e.g.,. sql format), a text file (e.g.,. csv,. tlt,. xls,. doc, or. rtp format), or as a real-time data feed (e.g., XML or jon).

The inventors of the present invention believe that an ideal error-specifying model will work by tracking the distribution history of only attributes in a data set, and identifying potential erroneous TTPs, and determining the likelihood that they will reveal the data.

Disclosure of Invention

In certain implementations, the present invention relates to error assignment models and scoring methods that achieve the goals set forth above.A "wild file" may be defined as a list of previously unknown sources that may contain illegally distributed proprietary data.A "reference database of historical attributes" that may be found from a variety of sources is then employed, which is an archive of attributes, metadata, and values.

The error assigning system and method generates a statistical likelihood that a particular TTP is in fact a bad actor that is not legitimately distributing data or enables a bad actor to illegally distribute data.when there are thousands of TTPs receiving data from a data owner, assigning errors can be difficult.watermark recognition and fingerprinting ideally will gain 100% certainty for the identity of the leak.if done correctly, watermark recognition or fingerprinting will exclude most of the TTPs and leave only potential suspects, each of which has a different statistical likelihood of being a source of the leak.

In each of the interaction layers, each layer helps to minimize the number of potential error-passing party or recipient IDs, attributes in the data are more heavily weighted in the error score than other attributes.

These and other features, objects and advantages of the present invention will be better understood by consideration of the following detailed description of the preferred embodiments and appended claims taken in conjunction with the accompanying drawings, as described below.

Drawings

Fig. 1 is a diagram illustrating a bit observation count in an example using an embodiment of the present invention.

FIG. 2 is an illustration of the application of the chi-square goodness of fit test (chi-square goodness of fit test) for matching attributes in data files using an embodiment of the present invention.

FIG. 3 is a schematic diagram illustrating comparison of wild file data with reference data in an example using an embodiment of the present invention.

FIG. 4 is an set of tables showing weighted and unweighted attributes during statistical profile evaluation in an example using an embodiment of the present invention.

FIG. 5 is a data flow diagram of an embodiment of the present invention.

Detailed Description

Although any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention, a limited number of exemplary methods and/or materials are described herein.

First, recipient IDs of each unique are generated and a recipient ID of each unique is randomly assigned to each client in the database.

Layer 1, watermark detection, proceeds in a manner that saltation is a mechanism that inserts -only data (salt) into a subset of data such that, in the event that the data is leaked, the data contained in the subset of data may be identified back to the owner of the data. upon receipt of a suspicious wild file, the saltation is checked by initiating a search protocol that generates sets of counts ("Bit counts" (bitcount ") associated with 0 and 1 (" Bit Value ") for each Bit Position (" Bit Position ") in the receiver id.a predefined heuristic such as, but not limited to, an 80-20 heuristic is applied to determine whether a Bit Position should be assigned a 0, 1, or unknown based on the counts associated with each Bit value.a predefined heuristic such as, but not limited to, an 80-Percent or greater proportion of the counts for a given Bit Position is associated with that Bit Position, that a 1 or 0-Percent Bit Value (" Bit Position ") is assigned to any unknown Bit Position (" Bit Value "), that is considered an unknown Bit Position where no 80-Percent of the bits Count is provided in a graph that no unknown bits are declared.

The detected recipient ID will have a variable number of recovery bits. If a recipient ID is detected to be less than 10 bits, then the recipient ID is not included in the recipient ID pool because the probability of a random match of up to 10 bits is approximately 0.1%. Thus, if the receiver ID is considered "recovered" during the watermark detection layer, the confidence level of the data owner for the customer to whom the relevant data was first distributed is greater than 99.9%. The recipient IDs detected during the watermark detection phase constitute the initial pool of suspected erroneous TTPs.

After initial watermark detection (layer 1), the likelihood of a false is 100 divided by the number of detected recipient IDs, then this value is weighted based on information about the number of matched bits in the detected recipient IDs, e.g., if 3 recipient IDs are detected in salt, the initial false score assigned to each recipient ID is 33. then this value is weighted by a coefficient associated with the number of bits that match the recipient ID during detection as a detection criterion, all recipient IDs match at least 11 bits, but as the number of bits increases, the probability of matching over 11 bits will decrease greatly.a grouping (bin) based weighting metric is applied, where 11 to 20 bit matched recipient IDs are weighted by a specific value (e.g., 1.1), 21 to 30 bit matched IDs are weighted by a different value (e.g., 1.35), and IDs with more than 30 matched bits are weighted by a third value (e.g., 1.55) given that the false weight is closely related to the matching rate, when layer 1 has a matching bit, the more of matched IDs have a weighted score of 3535, and if a false detection is found, the initial false reception score is found by , then the weighted by a group of receivers with a group of matched by a third value (e.g., 1.5.g., a group of receivers, if it has a weighted score of 3, if it is found that it is found to be found that a false receiver ID, then the wrong recipient ID, then the group has a group of receivers, if a group of receivers, then a group of receivers, then a.

Turning to layer 2 (advanced watermark detection), another search processes are started to detect additional salt-related patterns that are embedded in the data before distribution to the customer the method used for the search process is the same as in the initial watermark detection process, but is applied to other data values, and it produces the same type of bit string as described in FIG. 1.

For example, if 2 more IDs are added to the pool of recipient IDs at the end of layer 2 and are the same as the two IDs with 25 and 30 bit matches in layer 1, then the basic error score for these recipient IDs is 40, whereas for recipient IDs representing only times in the pool the basic error score is 20. using the same example weighting metrics (1.1, 1.35 and 1.55) and the same number of recipient ID bits (40) as described above, the resulting weighting metric (1.1, 1.35 and 1.55) and the resulting error score for recipient IDs of 25 and 30 bit matches, the resulting weighting metric (54 and 62) is broken down into an error score, for recipient IDs of 25 and 30 bit matches, the resulting error score for layer 2 is 54 and 62, respectively, with 12 bit matches.

After advanced watermark detection, a third layer of analysis is applied, in which the statistical distribution of the data in the wild file is compared with the distribution within the corresponding data in the reference database. This is referred to herein as tier 3 statistical profile detection. The pool of recipient IDs from tier 2 is used as a list of TTPs for suspected bad behavior. Using the information contained in the wildlife file, the date range in which the data must be distributed can be identified.

The statistical profile detection method in layer 3 is as follows:

1) records in the wild file, where there is personally identifying information available (e.g., name and address), and records in each suspected TTP associated with the suspected recipient ID file are matched, the matching records are only evaluated at (in step 4), the system will detect the layer 3 fingerprint using the company's master Data file (Data Owner Set) with no suspected recipient IDs generated at

layers

1 and 2.

2) Many matching mechanisms are employed, including but not limited to meta-features of each wild file column data (such as value type, number of values, value name and fill-rate, etc.) for matching with attributes in the reference database (see FIG. 2).

3) Chi fang (X)²) The chi-squared goodness of fit analysis is statistical tests that can be used to determine whether classes in a dataset are distributed in the same way and thus are assumed to be from the same "population," orWhich in this case represent the same attributes. In this context, the resultant χ²The comparison process iterates over traversing each attribute in the wild file and across all potential source files, yielding sets of common attributes with the wild file for each offender's data distributed to the pool of recipient IDs.05 indicates that the wild file attribute is 95% likely to be the same as the attribute in the TTP recipient file.in this example, this is considered an attribute match and the TTP recipient file attribute is added to the subset of data objects to be compared ²An example of how goodness-of-fit analysis may be used to match attributes in a wildfile with attributes in a TTP recipient file.

4) The subset of matching records and matching attributes in the TTP recipient file (as shown in fig. 3) is subjected to further step error assessment analysis as shown in fig. 3, the data in each cell of the wild file is compared to the data in each record and attribute matching cell of the recipient provider file.

5) For each potential bad actor, values are obtained representing the number of columns in the wild file that statistically match in each source file, the number of rows in the wild file that match via the name and address in each source file, and the number of cells in the wild file that have the same value as the cells in the source file for each suspected bad actor.

6) The number of matched cell values is then weighted by an attribute-specific factor that is closely related to historical information about attribute/column distribution frequency, proprietary status, and unique attribute characteristics this information is stored in an attribute reference database attribute weights range from 0 to 1, where 0 is assigned to attributes that are distributed relatively more frequently (such as "age" or "gender") and 1 is assigned to attributes that are, for example, rarely distributed or contain headers or value tags that are explicitly linked to known proprietary data.

For example, in fig. 3, six different attributes are represented across 4 files (1 wildfile and 3 recipient files): "Driver", "Yogi", "Parent", "Sex", "Age", and "Techie". Three of these attributes ("Driver", "Yogi", and "Parent") exist in the wild file and are therefore important factors in evaluating the error of the recipient file. The "driver" and "parent" attributes are assigned to TTPs more frequently than the "yogi" attribute. Thus, in this context, data determined to be from the "yogi" attribute (in tier 3) is weighted more heavily in the error score than data determined to be the "driver'" and "parent" attributes. Fig. 4 depicts an attribute-weighted error score calculation constructed according to the scenario of fig. 3.

International patent application No. PCT/US2017/062612 discloses methods for performing PCAMIX fingerprinting entitled "Mixed data fingerprinting with Primary Components fingerprinting" International.

Those in each suspected TTP that have wild files associated with suspected recipient ID files are processed with personally identifying information (e.g., name and address) available in the wild files, the matching records are only evaluated steps further the system will detect layer 4 fingerprints using the company's master Data file (Data Owner Set) in the case that layer 1 and layer 2 do not produce any suspected recipient ID.

1) The feature values will show that the reduced set of feature vectors accounts for most of the differences in the data set, while those accounting for a small number of differences may be discarded or ignored for subsequent analysis.

2) The next step is score generation the matrix of eigenvector scores is generated for the data owner set and the wild files, just as each observation has the value of the original variable , they also have the score of each eigenvector.

3) If not 1 or very close to 1, they should not exhibit statistically significant differences. In this case, when the feature value is equal to or greater than 0.8, we will score according to the feature value. That is, if the feature value is 0.85, the score will be 0.85. When the feature value is less than 0.8, the score of layer 4 is 0.

After final evaluation of the layers, we compute the average of the false scores across all layers (each layer has been detected to be scored) for each recipient file or data owner set. This value is then finally weighted based on a predetermined recipient risk profile score. Risk profile scores are a range of integer values (e.g., 1 to

4) A risk profile score is derived from an analysis of several factors related to company financial and/or credit records, operating habits, and additional features that contribute to the potential liability associated with distributing valuable data to companies.A lowest profile score (i.e., 1) is associated with a highest level of trustworthiness or lowest risk, and a highest numerical score (i.e., 4) indicates that a company has a low level of trustworthiness or highest risk.A company with a risk score of 1 or a company without recorded information does not obtain additional weighting after the last level of error assignment.A company with a risk score of 4 obtains the heaviest weighting after the last level of error assignment.

The output of this error specification process is a list of TTPs suspected of being in error, each with an error score representing the relative error potential of the compromised related file. Fig. 5 depicts the flow of information through the error assessment model and the adjustment of the error scoring weights throughout the various layers of the error assessment process. If multiple recipient IDs are detected in tier 1 and tier 2, the cumulative error score is also used to rank the relative error potential between TTPs.

Describing the process generally now with reference to fig. 5, watermark detection at layer 1 occurs at block 12, with the input change fingerprint detection 10 as input. The bit-match-rate-weight calculation 14 is as shown in the example of fig. 1 and is calculated as described above. Processing proceeds to advanced watermark detection at layer 2, which occurs at block 22 using individual bit match weights 16 and receiver ID frequency weights 18 calculated as described above. It may be noted that the recipient ID is extracted from the recipient file database 20 which includes all of the individual recipient files 24. Moving to the statistical profile fingerprinting detection layer 3 at block 26, the recipient file database 20 is the input for this process, as is the attribute reference database 30. An attribute reference database 30 is used to establish attribute frequency weights 28. Moving to the PCAMix fingerprint layer 4 at block 38, the matched individual records and the matched attributes are input to this process. The PCAMIx eigenvalue score 40 is received as input, the function of which is as described above. The process then moves to additional weighting factors, which obtain an overall error score at block 32. The inputs here include the recipient profile score database 36 and the average error score from the previous layer; a recipient profile score database 36 is used to calculate the recipient legitimacy weights 34. The output is the overall error score from the overall error score hierarchy at block 32.

When used herein, all individual members of a group, as well as all possible combinations and subcombinations of that group, are intended to be individually included, when a range is stated herein, the range is intended to include all sub-regions and a single point within the range.

The present invention has been described with reference to certain preferred and alternative embodiments, which are intended to be exemplary only and are not intended to limit the full scope of the invention as set forth in the appended claims.

Claims

A method of measuring errors with respect to a wild file suspected of leakage, comprising the steps of:

a. performing a search of ones of a plurality of salts in the wild file, wherein each salt is associated with a recipient ID, each recipient ID in turn associated with a recipient data file, the search resulting in sets of bit counts, each bit count comprising the bit values of each bit position in the recipient ID;

b. applying a predetermined heuristic to each bit position to assign heuristic values to each bit value to determine a th calculation of a Trusted Third Party (TTP) suspected of being in error;

c. dividing the mistake probability of each suspected mistake TTP by the number of suspected mistake TTPs;

d. weighting the error probability of each suspected erroneous TTP by a factor associated with the number of bits matching the recipient ID during detection to produce an th error score;

e. applying a second search to of the plurality of salts in the wild file;

f. calculating, for each recipient file associated with a detected recipient ID, a second false positive score for the wild file suspected of leaking;

g. increasing the weight of the detected recipient ID after both the th and second searches;

h. comparing the statistical distribution of data in the suspected leaked wildlife files to recipient files corresponding to the detected recipient IDs to generate a third error score;

i. applying hybrid data fingerprinting using principal component analysis to the wild file to generate a fourth error score;

j. the final error score is calculated by averaging the previously calculated th, second, third and fourth error scores.
2. The method of claim 1, wherein the predetermined heuristic is an 80-20 heuristic.
3. The method of claim 2, wherein the value of the heuristic is selected from the group consisting of 1, 0, and unknown.
4. The method of claim 3, wherein if the number of recovered bit values is less than the minimum bit value, the recovered bit values are not included in the pool of recipient IDs associated with suspected leaked recipient files.
5. The method of claim 4, wherein weighting the error likelihood of each suspected error-prone TTP by a factor associated with a number of bits matching the recipient ID during detection comprises applying a packet-based weighting metric.
6. The method of claim 5 wherein the packet-based weighting metric is a -th particular value for recipient IDs totaling 11 to 20 matches, a second particular value for recipient IDs totaling 21 to 30 matches and a third particular value for IDs totaling more than 30 matches.
7. The method of claim 5, wherein the grouping-based weighted metrics for both the th and second searches are added to create an overall grouping-based weighted metric.
8. The method of claim 1, wherein the step of comparing the statistical distribution of data in the file suspected of leaking with the file corresponding to the detected recipient ID further includes the step of identifying a range of data from which the data distributed to the file suspected of leaking must come.
9. The method of claim 8, wherein the step of comparing the statistical distribution of data in the file suspected of leaking with the file corresponding to the detected recipient ID comprises the step of comparing record fields in the file corresponding to the detected recipient ID with records in the file suspected of leaking and eliminating the recipient file corresponding to any detected recipient ID for any unmatched files.
10. The method of claim 9, wherein the step of comparing the statistical distribution of data in the file suspected of leaking with the file corresponding to the detected recipient ID comprises the step of meta-feature matching between the recipient file corresponding to the detected recipient ID and the wild file suspected of leaking.
11. The method of claim 10, wherein the meta-feature comprises at least of a value type, a value quantity, a value name, and a fill rate.
12. The method of claim 10, wherein comparing the statistical distribution of data in the file suspected of leaking with the file corresponding to the detected recipient ID comprises performing a chi-square goodness-of-fit analysis on at least attributes in the reference file corresponding to each recipient ID having matching meta-features.
13. The method of claim 12, further comprising the step of comparing values in cells for attribute matches between the wild file suspected of leaking and the recipient file corresponding to the matching recipient ID to obtain a number of matching columns.
14. The method of claim 13 wherein step includes the step of calculating the total number of possible cell matches by multiplying the total number of columns matched by the number of rows matched to produce the number of cell values matched.
15. The method of claim 14 wherein the step of includes the step of weighting each of the matching cell values by an attribute-specific factor that is closely related to historical information.
16. The method of claim 15, in which the historical information comprises at least of attribute/column distribution frequency, proprietary status, and unique attribute features.
17. The method of claim 1, wherein the final error score is weighted based on a predetermined recipient risk profile score.
18. The method of claim 17, wherein the predetermined recipient risk profile score comprises a range of integer values.
19. The method of claim 18, wherein the range of integer values for the predetermined recipient risk profile score is derived from a plurality of factors including or more of a recipient's corporate financial and/or credit records, operating habits, and additional features contributing to potential liability associated with distributing data.