WO2021115589A1 - Devices and methods for applying and extracting a digital watermark to a database - Google Patents

Devices and methods for applying and extracting a digital watermark to a database Download PDF

Info

Publication number
WO2021115589A1
WO2021115589A1 PCT/EP2019/084666 EP2019084666W WO2021115589A1 WO 2021115589 A1 WO2021115589 A1 WO 2021115589A1 EP 2019084666 W EP2019084666 W EP 2019084666W WO 2021115589 A1 WO2021115589 A1 WO 2021115589A1
Authority
WO
WIPO (PCT)
Prior art keywords
cluster
bit
values
watermark
attributes
Prior art date
Application number
PCT/EP2019/084666
Other languages
French (fr)
Inventor
Thomas VANNET
Xuebing Zhou
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/EP2019/084666 priority Critical patent/WO2021115589A1/en
Publication of WO2021115589A1 publication Critical patent/WO2021115589A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/10Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
    • G06F21/16Program or content traceability, e.g. by watermarking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/10Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
    • G06F21/106Enforcing content protection by specific content processing
    • G06F21/1063Personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Definitions

  • the method can only be used on numerical data. For instance, the watermark bits will replace the Least Significant Bits (LSB) of the data, which would create easy-to-detect inconsistencies on non-numerical data.
  • Least Significant Bits Least Significant Bits
  • shuffling the data is enough to make the watermark unrecoverable. This can happen maliciously or accidentally.
  • processing comprises calculating one or more cryptographic hash values based on attribute values of the synthetic data record.
  • Using a cryptographic hash allows for uniform distribution of values thereby avoiding a bias in the attribute values which might be used by an attacker to infer information about the watermark bit value.
  • the device comprises means for adding attribute data to a data record of the cluster where the attribute data for an attribute is missing based on a probability mass function (PMF).
  • the processing comprises calculating one or more cryptographic hash values based on attribute values of the synthetic data record.
  • Figure 1 shows schematically a process according to an embodiment whereby a raw database 110 and a digital watermark (e.g. a binary sequence) 120 are provided as inputs.
  • the database 120 is modified by applying an embedding algorithm 130 to the database 110 so as to embed the digital watermark 120 and thereby generate a watermarked digital database 140 as an output.
  • Parity bit b majority i ⁇ ⁇ 0,1 ⁇
  • the value of the quasi-identifier for the clusters embedding the watermark bit value has been determined. Accordingly, every data record (row) in the candidate database matching this quasi-identifier value is selected during the row extraction step 1007.
  • the watermark embedding and extraction may be applied where a cluster of physical or virtual machines deployed in a cloud and a number of operations implemented on the database through the Hadoop framework on those machines.
  • the system may be connected to an authentication module required to access the data.
  • it dynamically receives the customer’s identity and secretly embeds it as a watermark in the dataset before sending it to the user.
  • the method disclosed in the embodiments of the present invention may be applied in a processing unit 1201 of a device 1200.
  • each step of the method may be completed by using an integrated logic circuit of hardware in the processing unit 1201 or instructions in a software form. These instructions may be implemented and controlled by using the processing unit 1201.
  • the foregoing processing unit may include a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or another programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component; and can implement or execute each disclosed method, step, and logic block diagram in the embodiments of the present invention.
  • the disclosed system, apparatus, and method may be implemented in other manners.
  • the described apparatus embodiment is merely exemplary.
  • the unit division is merely logical function division and may be other division in actual implementation.
  • a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed.
  • the displayed or discussed mutual couplings or direct couplings or communications connections may be implemented through some interfaces.
  • the indirect couplings or communications connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Technology Law (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Editing Of Facsimile Originals (AREA)
  • Image Processing (AREA)

Abstract

The disclosure relates to a method of and device for applying a digital watermark. At least one cluster of a digital database is obtained, wherein a cluster comprises one or more data records of the database to be associated with a bit of a digital watermark. A synthetic data record is obtained. Values of attributes of the synthetic data record are processed to obtain a parity bit value. The synthetic data record is included in the cluster based on whether the parity bit value matches the bit value of the digital watermark to be associated with the cluster.

Description

DEVICES AND METHODS FOR APPLYING AND EXTRACTING A DIGITAL WATERMARK TO A DATABASE TECHNICAL FIELD
The present application relates to watermarking data. In particular, but not exclusively it relates to embedding a digital watermark in a data, such as a database, and extracting a digital watermark from data. BACKGROUND
Sharing or sale of databases is a common business practice. It comes at the risk of piracy or accidental leaks of private information. To prevent abuse, traceability and proof of ownership are required, especially when personal information is present within the database. A common way to achieve this is the insertion of a hidden watermark within the data.
Previous watermarking techniques for databases tend to have a weak security model relying on obscurity of the underlying watermarking algorithm. Furthermore, they are either limited to certain data types or are not resilient to data manipulation (attribute deletion in particular).
SUMMARY
To date, solutions to this problem fall under one of the following three categories 1) Modification of the real data, 2) Insertion of artificial data, and 3) Modification of the stru ctu re of th e d ata .
Solutions which modify real data have one or more of the following drawbacks:
• The method can only be used on numerical data. For instance, the watermark bits will replace the Least Significant Bits (LSB) of the data, which would create easy-to-detect inconsistencies on non-numerical data.
• The watermark can be easily removed if the embedding method is known. If LSBs are used, then deleting those same bits will not significantly affect the data while completely deleting the watermark.
• The watermark may be unrecoverable on an altered database. If the attributes containing the watermark bits are not present in the modified database, the bits cannot be recovered. Solutions which insert artificial data have one or more of the following drawbacks:
• The storage requirements become prohibitive when the database is shared many times. A large amount of artificial data is needed as a watermark to prevent accidental deletion. Each time the database is shared, all this data must be recovered to identify future leaks.
• It is difficult or impossible to prove ownership of a database to a third party. Since anyone could claim some of the data is artificial and matches their own local copy of the database, it is usually hard to convince the accused that the watermark was added on purpose.
Solutions which rely on modifying the structure of the data have at least the following drawback:
• Knowledge of the embedding method makes watermark removal easy.
Typically, shuffling the data (order of attributes, order of rows) is enough to make the watermark unrecoverable. This can happen maliciously or accidentally.
In a first aspect, there is provided a method of applying a digital watermark comprising obtaining at least one cluster of a digital database, wherein a cluster comprises one or more data records of the database to be associated with a bit of a digital watermark, generating a synthetic data record, processing values of attributes of the synthetic data record to obtain a parity bit value, and including the synthetic data record in the cluster based on whether the parity bit value matches the bit value of the digital watermark to be associated with the cluster. In an implementation, the attribute values of the synthetic data may be processed individually and the results used to obtain the parity bit value.
This process embeds watermark bits within the synthetic data, instead of using the synthetic data itself as a form of watermark. This is advantageous because it gives greater control over the choice of the watermark value without revealing its location. This is a requirement to prove ownership to a third party, for instance in the case of legal action.
Previous ways of embedding bits in data typically alter the data itself. In order not to decrease data utility, the less important parts of the data are used, such as the least significant bits of numerical values. This can be exploited by a knowledgeable attacker who would simply delete those less important parts without hurting data utility. According to the first aspect, the watermark bits are spread over whole sets of attributes and over multiple data records. It is, thus, strongly resilient to deletion of attributes or entire data records (rows). Further, the embedding process of the first aspect can be used with any kind of data by treating it as a bit sequence.
In a first implementation of the first aspect, processing comprises calculating one or more cryptographic hash values based on attribute values of the synthetic data record. Using a cryptographic hash allows for uniform distribution of values thereby avoiding a bias in the attribute values which might be used by an attacker to infer information about the watermark bit value.
In a second implementation of the first aspect, processing comprises determining a most common bit value in a bit sequence generated based on the cryptographic hash values. Using such a probabilistic generation mechanism improves watermark extraction success rate even after deletion attacks, for example. This is better than the parity bit being obtained as the LSB of some hash or keyed-hash of an attribute, for example. That kind of approach would potentially introduce a strong bias which could be detected by an attacker and lead to a watermark removal attack. For instance, if an attribute Gender has two possible values ‘Male’ and ‘Female’ and the LSB of the hash is 1 for ‘Male’ and 0 for ‘Female’, then a cluster with an abnormally high number of ‘Male’ could reveal it is embedding a '0' watermark bit.
In a third implementation of the first aspect, the values of the bit sequence are determined based on whether a respective hash value is smaller or larger than a subsequent hash value according to the order of the attributes. By having values in the bit sequence be based on the relative order between an attribute value’s hash and the next attribute value’s hash In embodiments, little or no bias is introduced if attributes have a substantially uniform distribution. This will be the case if the values are hashes or keyed hashes. Further, relative order of consecutive values provides a naturally- resilient mechanism against random attribute deletions. For instance, in cases where 3 consecutive values are in order (a < b < c or a > b > c), then deletion of b preserves the ordering (a < c or a > c) so the vote of a will be unchanged. Even in cases where the values are not in order (for instance a > b and b < c), since the bias introduced is small, we can actually use prior knowledge about the real data distribution to compute the expected value of the votes for a and b. Experimentally, this has been shown to provide a very high rate of successful watermark bits extraction.
In a fourth implementation of the first aspect, the one or more cryptographic hash values are hashes of respective attribute values having a secret key appended thereto. Improved security is thereby ensured by using a secret key and making the process for determining the parity bit highly resilient to adversarial data manipulation. For example, where a watermark has been secretly embedded, it makes it impossible without knowledge of the secret key to extract the watermark.
In a fifth implementation of the first aspect, the order of the attributes of the synthetic data record are modified based on statistical properties of the attribute values. In a sixth implementation of the first aspect, the statistical property is an entropy value and the order is modified such that the attributes alternately have a high and low entropy value.
In practice, distribution of attributes is often far from uniform. This optimal ordering step orders the attributes based on statistical properties of the distribution which allows the bias to be removed. For example, the values may be alternated between attributes close and far from a uniform distribution. One convenient measure indicating such a property is the entropy value. It can be shown that if every other attribute is uniformly random, the system introduces no bias. It has been empirically observed that for real data, any additional bias introduced by the synthetic data is negligible.
In a seventh implementation of the first aspect, the method comprises determining a resilience level based on a difference between the number of ones and zeros in the bit sequence, and including the synthetic data entry is based on the determined resilience level. This is advantageous because it increases the success rate of extraction after arbitrary deletions. This can be obtained, for example, by sampling rows with a high resiliency level. For instance, embedding a 0 bit over 50 attributes, may be done by selecting rows containing 30 votes for 0 and 20 votes for 1 , rather than rows with 26 votes for 0 and 24 votes for 1. If the randomly generated row does not reach the desired resiliency level, it is discarded. Experimentally, it can be observed that the successful sampling rate is around 5% even for high resiliency levels. This makes the scheme performant and resilient at the same time. In an eighth implementation of the first aspect, the method comprises including synthetic data records in the cluster until a predetermined target number of records is reached.
It embeds a single bit of information over many different data records. The embedded watermark bit is a majority vote for based on the parity bit of each record in a cluster.
By making sure that the target is reached, the correct value can be determined when it comes to extracting the watermark bit. The target number may be dependent on the parity bit returned by the non-synthetic data records in the cluster such that the cluster reaches the target when, for example, there are more synthetic than non-synthetic records. The parity bit returned by the non-synthetic entries may be determined and the target based on a number of synthetic entries required to ensure a majority of records in the cluster return the correct parity bit value corresponding to the watermark bit.
In a ninth implementation of the first aspect, the method comprises partitioning a digital database into the least one cluster based on predetermined attributes with high utility. By basing the partitioning into clusters on high utility attributes the method can be made more reliable. This is because the success will depend on how many of those attributes are preserved. The partitioning may be performed only once per database and may be independent of the watermark value and database content. The partitioning result may be stored in a memory for use when recovering the watermark. For example, it may be stored together with the watermarked database or in a separate repository.
According to a second aspect, there is provided a method of extracting a digital watermark comprising obtaining at least one cluster of data records of a digital database, wherein a cluster comprises one or more data records associated with a bit of a digital watermark, processing attribute data of the one or more data records of the cluster to determine respective parity bit values, and extracting a bit of the watermark based on a most common bit value among the determined parity bit values. The digital watermark bit is spread over sets of attributes and multiple data records; thus, the watermark is resiliently embedded and may be extracted even whether data is missing.
In a first implementation of the second aspect, the method comprises adding attribute data to a data record of the cluster where the attribute data for an attribute is missing based on a probability mass function (PMF). Thus, the recovery of the watermark is made less vulnerable to deletion of data whether carried out maliciously or otherwise because it can be effectively substituted with plausible replacement data using the PMF.
In a second implementation of the second aspect, the processing comprises calculating one or more cryptographic hash values based on attribute values of the synthetic data record. Using a cryptographic hash allows for uniform distribution of values thereby avoiding a bias in the attribute values which might be used by an attacker to infer information about the watermark bit value.
In a third implementation of the second aspect, the processing comprises determining a most common bit value in a bit sequence generated based on the cryptographic hash values. Using such a probabilistic generation mechanism improves watermark extraction success rate even after deletion attacks, for example. This is better than the parity bit being obtained as the LSB of some hash or keyed-hash of an attribute, for example. That kind of approach would potentially introduce a strong bias which could be detected by an attacker and lead to a watermark removal attack. For instance, if an attribute Gender has two possible values ‘Male’ and ‘Female’ and the LSB of the hash is 1 for ‘Male’ and 0 for ‘Female’, then a cluster with an abnormally high number of ‘Male’ could reveal it is embedding a'0' watermark bit.
In a fourth implementation of the second aspect the values of the bit sequence are determined based on whether a respective hash value is smaller or larger than a subsequent hash value according to the order of the attributes. By having values in the bit sequence be based on the relative order between an attribute value’s hash and the next attribute value’s hash, in embodiments, little or no bias is introduced if attributes have a substantially uniform distribution. This will be the case if the values are hashes or keyed hashes. Further, relative order of consecutive values provides a naturally- resilient mechanism against random attribute deletions. For instance, in cases where 3 consecutive values are in order (a < b < c or a > b > c), then deletion of b preserves the ordering (a < c or a > c) so the vote of a will be unchanged. Even in cases where the values are not in order (for instance a > b and b < c), since the bias introduced is small, we can actually use prior knowledge about the real data distribution to compute the expected value of the votes for a and b. Experimentally, this has been shown to provide very high rate of successful watermark bits extraction. In a fifth implementation of the second aspect, the one or more cryptographic hash values are hashes of respective attribute values having a secret key appended thereto. Improved security is thereby ensured by using a secret key and making the process for determining the parity pit highly resilient to adversarial data manipulation. For example, where a watermark has been secretly embedded, it makes it impossible without knowledge of the secret key to extract the watermark.
In a sixth implementation of the second aspect, the order of the attributes of the synthetic data record are modified based on statistical properties of the attribute values. In a seventh implementation of the second aspect, the statistical properties are entropy values and the order is modified such that the attributes alternately have a high and low entropy value.
In practice, distribution of attributes is often far from uniform. This optimal ordering step orders the attributes based on statistical properties of the distribution which allows the bias to be removed. For example, the values may be alternated between attributes close and far from a uniform distribution. One convenient measure indicating such a property is the entropy value. It can be shown that if every other attribute is uniformly random, the system introduces no bias. It has been empirically observed that for real data, any additional bias introduced by the synthetic data is negligible.
In an eighth implementation of the second aspect, the method comprises generating a confidence level based on a difference between the number of ones and zeros in the bit sequence used to determine a respective parity bit value, and determining whether to exclude the parity bit value from consideration when extracting the bit of the watermark based on the confidence level. This is advantageous because it maximizes the success rate of extraction after arbitrary deletions, where rows are extracted with a high confidence level. When the watermark bit is embedded, if we want to embed a 0 bit over 50 attributes, it is safer to select rows containing 30 votes for 0 and 20 votes for 1 , rather than rows with 26 votes for 0 and 24 votes for 1. If the randomly generated row does not reach the desired resiliency level, it is discarded. At extraction, the same measure used for resiliency may be used to determine a value representing a confidence level. The confidence level thus indicates a level of confidence that a data record has a watermark bit embedded within it e.g. by virtue of the data being synthetically generated and thus producing a desired level of confidence (i.e. the difference between the number of ‘1's and '0's being a high value. Experimentally, it can be observed that the successful sampling rate is around 5% even for high resiliency levels. This makes the scheme performant and resilient at the same time.
In a ninth implementation of the second aspect, the method further comprises partitioning a candidate digital database into the least one cluster based on predetermined attributes with high utility. By basing the partitioning into clusters on high utility attributes the method can be made more reliable. This is because the success will depend on how many of those attributes are preserved. The partitioning may be performed only once per database and may be independent of the watermark value and database content. The partitioning may be according to a partitioning result derived when embedding the watermark and obtained from a memory.
According to a third aspect, there is provided a device for applying a digital watermark comprising means for obtaining at least one cluster of a digital database, wherein a cluster comprises one or more data records of the database to be associated with a bit of a digital watermark, means for generating a synthetic data record, means for processing values of attributes of the synthetic data record to obtain a parity bit value, and means for including the synthetic data record in the cluster based on whether the parity bit value matches the bit value of the digital watermark to be associated with the cluster.
In a first implementation of the third aspect, processing comprises calculating one or more cryptographic hash values based on attribute values of the synthetic data record.
In a second implementation of the third aspect, processing comprises determining a most common bit value in a bit sequence generated based on the cryptographic hash values.
In a third implementation of the third aspect, the values of the bit sequence are determined based on whether a respective hash value is smaller or larger than a subsequent hash value according to the order of the attributes.
In a fourth implementation of the third aspect, the one or more cryptographic hash values are hashes of respective attribute values having a secret key appended thereto. In a fifth implementation of the third aspect, the device comprises means for modifying the order of the attributes of the synthetic data record based on statistical properties of the attribute values. In a sixth implementation of the third aspect, the statistical property is an entropy value and the order is modified such that the attributes alternately have a high and low entropy value.
In a seventh implementation of the third aspect, the device comprises means for determining a resilience level based on a difference between the number of ones and zeros in the bit sequence, and the including of the synthetic data entry is based on the determined resilience level.
In an eighth implementation of the third aspect, the device comprises means for including synthetic data records in the cluster until a predetermined target number of records is reached.
In a ninth implementation, of the third aspect, the device comprises means for partitioning a digital database into the least one cluster based on predetermined attributes with high utility.
The advantages of the third aspect and the first to ninth implementations correspond to those of the first aspect and the first to ninth implementations thereof and for brevity will not be repeated here.
According to a fourth aspect, there is provided a device for extracting a digital watermark comprising means for obtaining at least one cluster of data records of a digital database, wherein a cluster comprises one or more data records associated with a bit of a digital watermark, means for processing attribute data of the one or more data records of the cluster to determine respective parity bit values, and means for extracting a bit of the watermark based on a most common bit value among the determined parity bit values.
In a first implementation of the fourth aspect, the device comprises means for adding attribute data to a data record of the cluster where the attribute data for an attribute is missing based on a probability mass function (PMF). In a second implementation of the fourth aspect, the processing comprises calculating one or more cryptographic hash values based on attribute values of the synthetic data record.
In a third implementation of the fourth aspect, the processing comprises determining a most common bit value in a bit sequence generated based on the cryptographic hash values.
In a fourth implementation of the fourth aspect, the values of the bit sequence are determined based on whether a respective hash value is smaller or larger than a subsequent hash value according to the order of the attributes.
In a fifth implementation of the fourth aspect, the one or more cryptographic hash values are hashes of respective attribute values having a secret key appended thereto.
In a sixth implementation of the fourth aspect, the device comprises means for modifying the order of the attributes of the synthetic data record based on statistical properties of the attribute values. In a seventh implementation of the fourth aspect, the statistical properties are entropy values and the order is modified such that the attributes alternately have a high and low entropy value.
In an eighth implementation of the fourth aspect, the device comprises means for generating a confidence level based on a difference between the number of ones and zeros in the bit sequence used to determine a respective parity bit value, and means for determining whether to exclude the parity bit value from consideration when extracting the bit of the watermark based on the confidence level.
The advantages of the fourth aspect and the first to ninth implementations correspond to those of the second aspect and the first to ninth implementations thereof and for brevity will not be repeated here.
According to a fifth aspect, there is provided a computer program comprising instructions which upon execution cause a processor to carry out a method according to any implementation of the first and second aspects mentioned above. The computer program may be stored on a data carrier or other computer-readable medium, for example. The computer readable carrier medium may be transitory or non-transitory. According to a sixth aspect, there is provided a device comprising one or more processors and a memory configured to perform the method of any implementation of the first or second aspects mentioned above. For instance the memory may include instruction, which when executed by the processor(s) perform the method of any implementation of the first or second aspects mentioned above, or the steps performed by the means of the devices of the third and fourth aspects above and their implementations.
In a further implementation of any one of the third and fourth and sixth aspects, the device may be a server or other network element or device. For example, a cloud server or a virtual machine operating in a Hadoop cluster. Thus, the advantages of resilient watermarking may be obtained in a networked computing environment where multiple entities on the network may have access to the database making it vulnerable to manipulation.
BRIEF DESCRIPTION OF THE DRAWINGS
Embodiments will now be described, by way of example only, with reference to the accompanying drawings, in which:
Figure 1 shows a block diagram of a watermark embedding process according to an embodiment;
Figure 2 shows a block diagram of elements for carrying out a watermark embedding algorithm according to an embodiment;
Figure 3 shows a flowchart outlining a method of embedding a watermark according to an embodiment;
Figure 4 shows a block diagram illustrating a process for clustering data records (rows) of a database according to an embodiment;
Figure 5 shows a block diagram illustrating a process for generating synthetic data according to an embodiment;
Figure 6 shows a block diagram illustrating a process for determining a parity bit from attributes of a data record (row);
Figure 7 shows a block diagram illustrating a watermarking extraction process;
Figure 8 shows a block diagram of elements for carrying out a watermark extraction algorithm according to an embodiment; Figure 9 shows a flow chart outlining a method of extracting a watermark bit from data records;
Figure 10 shows a block diagram illustrating a process for extracting data records (rows) as clusters of data records;
Figure 11 shows a block diagram illustrating a process for determining a parity and resiliency level from extracted data records (rows); and
Figure 12 shows a device having a processor and a memory for implementing embodiments.
DESCRIPTION
Embodiments are described below in sufficient detail to enable those of ordinary skill in the art to embody and implement the systems and processes herein described. It is important to understand that embodiments can be provided in many alternate forms and should not be construed as limited to the examples set forth herein.
Accordingly, while embodiments can be modified in various ways and take on various alternative forms, specific embodiments thereof are shown in the drawings and described in detail below as examples. There is no intent to limit to the particular forms disclosed. On the contrary, all modifications, equivalents, and alternatives falling within the scope of the appended claims should be included. Elements of the example embodiments are consistently denoted by the same reference numerals throughout the drawings and detailed description where appropriate.
The terminology used herein to describe embodiments is not intended to limit the scope. The articles “a,” “an,” and “the” are singular in that they have a single referent, however the use of the singular form in the present document should not preclude the presence of more than one referent. In other words, elements referred to in the singular can number one or more, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises,” “comprising,” “includes,” and/or “including,” when used herein, specify the presence of stated features, items, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, items, steps, operations, elements, components, and/or groups thereof.
Unless otherwise defined, all terms (including technical and scientific terms) used herein are to be interpreted as is customary in the art. It will be further understood that terms in common usage should also be interpreted as is customary in the relevant art and not in an idealized or overly formal sense unless expressly so defined herein.
Watermark Embedding
Figure 1 shows schematically a process according to an embodiment whereby a raw database 110 and a digital watermark (e.g. a binary sequence) 120 are provided as inputs. The database 120 is modified by applying an embedding algorithm 130 to the database 110 so as to embed the digital watermark 120 and thereby generate a watermarked digital database 140 as an output.
An implementation of the embedding algorithm 130 is shown in Figure 2. A data clustering module 131 implements a clustering algorithm that is responsible for partitioning the database into groups of rows (data records) called clusters 132. The watermark bits are embedded into the clusters by adding rows of synthetic data. In this embodiment there is a one to one correspondence between a cluster and a watermark bit of the digital watermark 120.
A synthetic generation module 133 implements a synthetic generation algorithm to generate the synthetic data which will add rows of synthetic data to the clusters of rows selected by the clustering module 131. Synthetic data is added at node 134 for each data cluster such that each bit of the watermark is embedded in a respective data cluster. The rows of the modified database including the synthetically generated data rows is output as the watermarked database 140. The functional modules and elements of Figure 2 may be implemented as hardware or software or a combination of both.
Figure 3 shows an overview of a method 300 of embedding a watermark that may be implemented with the functional modules of Figure 2. In a first block 301, a cluster of data records may be obtained. A synthetic data record is generated ay step 302 which will contain artificial attribute values for the data record. The attributes are then processed e.g. according to a predetermined algorithm in order to obtain a parity bit. The attributes may be individually processed e.g. to generate a hash and the resulting values used to determine the parity bit value. It is then determined if the parity bit matches a respective watermark bit associated with the cluster and, if so, the record may be included in the cluster based on the match. In an embodiment, further criteria may need to be satisfied in order for the synthetic data record to be added. In an embodiment described in more detail below, for example, a resilience level may also be determined and the synthetic data only added if the resilience level is above a threshold. If the parity bit does not match the watermark bit, a further synthetic data record may be generated until a match is found. The method 300 may be repeated until a desired number of synthetic data records have been added to the cluster. For example, until a predetermined number have been added or until the data cluster contains a majority of data records that return a parity bit value that matches the associated watermark bit value.
Figure 4 is a flow chart that details the elements of the clustering algorithm 131. In an attribute partition step 401 the list of attributes from the database is partitioned in order to define clusters of data records. In an embodiment, the database is split between high utility 402 and other attributes 403. The high utility attributes are used for selecting which data records belong to a cluster and will be used to embed the watermark. Then for each bit position of the watermark, a different set of values for those attributes is selected. The set of attributes used to identify data records in a cluster may be called a quasi-identifier. The watermark recovery success depends heavily on how many of those attributes used for the quasi-identifier(s) are preserved. Accordingly, we select ranges or sets of high-utility attributes and their values to be used as quasi-identifiers for the partitioning.
Then for each watermark bit position (0, 1, 2, ...) 404, a procedure is used, which outputs one of all possible quasi-identifier values. For instance, a keyed-hash-based method may be used, with the bit position as input and the cluster list as hash domain.
This step is only performed once per database, it is independent of the watermark value and database contents. If the procedure used is not deterministic, the result should be stored for use when recovering the watermark.
According to embodiments, the attribute partition 401 may be done manually by an operator and, in other embodiments, automatically by an algorithm. The end goal is to select a quasi-identifier, i.e. a set of attributes which splits the data into small clusters of a few elements each. The desired cluster size is a parameter which can be tuned to provide higher resilience at the cost of extra storage.
In an example, shown in Figure 4, the data base is structured to have data records (e.g. rows) with the attribute fields ‘Name’, ‘Gender’, ‘Age’, ‘Post Code’, ‘Country’, ‘Marital Status’ and ‘Income’. High utility attributes are identified as ‘Name’ and ‘Post Code’. These high utility attributes are used to select quasi-identifiers which define the clusters of data records.
At block 405, cluster selection is performed. This takes clusters as defined by the attribute partition 404 to embed the watermark bits. Each cluster will embed one watermark bit. This step can be performed in a number of ways.
One simple way is to randomly pick one cluster that exists within the data for each watermark bit position and keep track of each selected cluster. This requires relatively small storage space as only the values of the attributes in the quasi-identifier must be recorded for each bit position. This is independent of the actual watermark value and thus only needs to be done once per database.
Alternatively, other row selection methods used in watermarking schemes can be applied here. For instance, the bit sequence forming the quasi-identifier can be hashed under some secret key and hashes ending with a sequence representative of a bit position are retained to embed the watermark bit at this given position. In this case, only the secret hashing key should be recorded. However, this method will not be resilient to deletion of attributes within the quasi-identifier. In other words, the watermark may not be recoverable if such a hashing scheme is used.
The result of the cluster selection is a selected cluster list 406 having groups of records corresponding to each watermark bit position 404. In the shown example, the high utility attributes identified in the attribute partition 401 are the high utility attributes ‘Name’ and ‘Postcode’. The quasi-identifier for bit-position Ό’ is Name = ‘Jack’ and Post Code=80000 whereas for bit-position ‘T the quasi-identifier is Name = ’Jane’ and Post Code=50000. The list of quasi-identifiers and their respective watermark bit positions make up the selected cluster list 406.
Figure 5 presents the components of the synthetic generation algorithm which will add data records to the identified clusters from the cluster list 406 to encode the associated watermark bits. In more detail, for each given cluster corresponding to a watermark bit position, rows are synthetically generated to build the cluster. This process is repeated for each watermark bit position j. Hereafter we use wj to refer to the value of the jth bit of the watermark. A Synthetic Data Generator 501 is used to produce random data records (rows) 502 as candidates for addition to a data cluster of the watermark bit currently under consideration and used as an input. Any generator may be used, for instance generic Auto-Encoders or manually programmed generators specifically for a database.
Ideally, the process is able to generate synthetic data according to the real data distribution to avoid bias. For this purpose, an Auto-Encoder system trained on the real data may be used. Other embodiments may include independently sampling each attribute value according to this particular attribute’s distribution, or writing a custom engine based on knowledge of the underlying data.
In an example, shown in Figure 5, a synthetic data record having the values Name- Jack’, Gender=’M’, Age=50, Post Code=80000, Country=’UK’, Marital Status- Single’ and lncome=’90k’ is generated as a candidate synthetic data record to be added to the database. Note that in Figure 5 we show the ‘Name’ and ‘Post Code’ attributes as forming part of the synthetic data. However, given that the quasi identifier for the cluster mandates those values, they are not required to be generated as such in the synthetic data generator block 501 and, at least in some embodiments, are not used to embed (or extract) the watermark bit.
Then a keyed hash/ordering/parity algorithm 503 (KOP algorithm hereafter) is used to compute from each candidate synthetic data record a parity bit of this row. The parity bit value is returned at block 504 and determination is made at block 505 if it is equal to the watermark bit currently being embedded. Finally, the row is added to the cluster at block 506 and then the database at 134 if its parity bit matches the current watermark bit being embedded. If the parity bit of the candidate data record for addition does not match the watermark bit, that data is discarded and the process starts over at block 507.
Thus, for each watermark bit, the system will create a cluster of synthetic data records (rows) such that the KOP algorithm returns the watermark bit for each row in the cluster. Depending on the output bit (parity bit) of this algorithm (and optionally, a resilience level), the data record (row) may or may not be added to the cluster. When the cluster is filled, it is added to the database. The synthetic data records generated are shown at block 508 as a synthetic cluster to be added to the other records in the cluster if it is determined at block 509 that the cluster is full. A cluster may be considered full, for example, when a predetermined number of synthetic records have been added, the cluster including the synthetic data records reaches a predetermined size, or it is determined that a majority of records in the cluster including the synthetic entries return a parity bit equal to the watermark bit. The number of records returning the matching parity bit compared with those that do not match may be required to be higher than a certain threshold in order to ensure a certain amount of resilience in the embedding of the watermark bit. However, given the roughly 0.5 probability that an existing data record will return a parity bit that matches the watermark bit, it may be enough to add only one or a small number of records in order for the watermark bit to be successfully embedded such that it can be reliably extracted.
Figure 6 shows a flow chart showing more details of the KOP algorithm 503. The inputs are a candidate synthetic data record 502 and a secret key 601. As mentioned above, once a candidate synthetic data record has been created by the synthetic data generator, the Keyed-Hash/Ordering/Parity (KOP) algorithm 503 outputs a parity bit on which whether or not this specific candidate should be added to the database 140 is based. Note that the ‘Name’ and ‘Post Code’ attribute values are not required to be used in the KOP process due to those values being unchanging due to forming the quasi-identifier for the cluster. In an example, shown in Figure 6, the synthetically data used in the KOP process is ‘M’, ‘50’, ‘UK’, ‘Single, ‘90k’.
Firstly, at block 602, an optional but useful step of Optimal Attribute Ordering is performed. The different attributes are ordered such that the watermark resilience is maximized. The optimal attribute ordering step will simply modify the positions of the attributes according to an optimal ordering based on the statistics (e.g. distribution) of values. The ordering impacts the result of the algorithm. This step significantly improves resiliency and genericity by limiting the impact of low-entropy attributes on the system. However, it is possible to omit the attribute ordering while still obtaining a resiliently embedded watermark bit. In the example of Figure 6, the reordering of the attribute values is from: (1) ‘Gender’, (2) ‘Age’, (3) ‘Country’, (4) ‘Marital Status’, (5) ‘Income’ to (1) ‘Age’, (2) ‘Marital Status’, (3) ‘Income’, (4) ‘Country’, (5) Gender.
The optimal attribute ordering may be performed only once per database. In an embodiment, the entropy of each attribute on the real data is computed. The entropy of a discrete random variable X is defined as E(X) = - ΣkeK pk logpk where K is the domain of X and pk = P(X = k). All attributes are placed in a set R. First, the attribute with highest entropy from the set is placed anywhere in the output vectors. At each step, the attribute with the lowest entropy from R is placed next to the attribute with the highest entropy in S which still has an empty neighbor. Then the attribute with the highest entropy in R is placed next to the attribute with the lowest entropy in S which still has an empty neighbor. Note that we consider the first and last elements of S to be neighbors. When the set R is empty, the set S will contain the desired order, which can be represented as a permutation p on the set of attributes.
Next, during the step 602, the input vector V0 = (v1 (0),..., v
Figure imgf000020_0001
m (0)) is sorted according to p the output vector is , where m is the number of attributes
Figure imgf000020_0003
considered by the KOP process. In other words, the total number of attributes in a data record (row) minus the attributes used for clustering (i.e. the quasi-identifier attributes).
The attributes (whether optimally ordered or in the original order) are then subjected to a keyed hashing process 603. In an embodiment, the keyed hashing step 603 hashes every attribute independently by appending a secret key and the attribute name to the attribute value, ex: SHA256(‘50’ | ‘Age’ | 0xA1B2C3D4), SHA256(‘Single’ | ‘Marital Status’ I 0xA1B2C3D4), ... etc. Other keyed hashes are possible, however, based on the attribute value. In an embodiment, the secret key may be omitted and the hashes based purely on the attribute data. This sacrifices some security (as in those without knowledge of a key but knowledge of the process may be able to extract the watermark) but omits the need for a recipient extracting the watermark to have access to a shared secret key.
According to an embodiment, a cryptographically secure hash function h should be selected, such as SHA256. This function outputs an integer, or more specifically a byte sequence which can be interpreted as an unsigned integer. According to an embodiment, the secret key sk is also selected (once per watermark). We can write A = (a1,...,am) the ordered list of attribute names and V1 =
Figure imgf000020_0002
(v1 (1),..., vm (1)) the input to this step.
The output of this step is
Figure imgf000020_0004
An ordering process 604 follows which will check for every hash value if it is smaller or larger than the next hash value and return a corresponding bit value e.g. a T if larger or a Ό’ if smaller. The order of the attributes, thus, creates a sequence of values which provides a sequence of bits. In the example of Figure 6, the hashed data is the sequence of values 0xF589, 0x6584, 0xDE65, 0x45FC, 0x7A7A. This gives a corresponding sequence of Larger, Smaller, Larger, Smaller, Smaller <=> 1, 0, 1, 0, 0.
In an embodiment where the input vector of hashed attributes is:
V2 = (v1 ( 2), ...., vm (2 )), the output vector is V3 = (v1 (3), ... , vm (3) ) ∈ {0,1}m where
Figure imgf000021_0001
In other words, the output vector represents whether or not consecutive hashes are in ascending order or not. The parity bit 606 of the row is computed at 605 by determining the most common bit in the bit sequence generated based on the hash values. In an embodiment, a resilience level 607 may also be returned by computing the difference between the number of 1s and the number of 0s in the sequence of bits. This can be used as a refinement to improve resiliency at the cost of more synthetic data generation. Accordingly, the KOP algorithm 502 outputs for a candidate data record a parity bit 606 and optionally a resilience level 607 as outputs at block 504.
Following the notation used above, according to an embodiment, for the input vector V 3 = (v1 (3), ......, vm (3)) ∈ {0,1}m , the parity step 605 returns the following two values at block 504:
Parity bit b = majorityi ∈ {0,1}
Figure imgf000021_0003
Resilience level
Figure imgf000021_0002
Note that the resilience level represents the absolute difference between the number of 1s in V3 and the number of 0s in V3.
In the example shown in Figure 6, the parity bit that is returned will be '0' which matches the watermark bit at the position corresponding to the cluster. Therefore, the synthetic data will be added because of a positive match with the watermark bit being determined e.g. at block 505 of Figure 5.
If the parity bit b for the synthetic row is equal to the current watermark bit wj and if the resilience level r is greater than a user-defined threshold, the row is added to cluster j. Otherwise, step 1.2 is repeated.
The process shown in Figure 5, may be repeated for every bit position j until the corresponding cluster has been filled with synthetic data offering appropriate resilience level.
According to an embodiment, standard post-processing for artificial data-based watermarking technique may then be performed, such as shuffling the order of attributes and rows.
Watermark Extraction
Figure. 7 is a block diagram showing a process for extracting a watermark from a database. A candidate database from which the watermark is to be extracted is supplied as input to an extraction algorithm 720. The candidate database 710 is processed according to the extraction algorithm to determine a watermark which is provided as an output 730. In the case that the extracted watermark bits are deemed unreliable by the process or in the case of data missing that is required then the output may register as a failed attempt or partial extraction of a watermark. This algorithm should output the originally applied watermark if the candidate database 710 is an altered version of a raw database 720 which has had a watermark embedded according to the process mentioned above.
The functional modules for performing the extraction algorithm according to an embodiment are shown in Figure 8. A clustering module 721 performs clustering in substantially the same manner as described before with respect to the embedding process. Since high-utility attributes are less likely to be removed, the clusters chosen in the previous step can be retrieved in the candidate database. The relevant quasiidentifier for a cluster may be automatically obtained or manually provided, as long as the same quasi-identifier is used to perform the same clustering that would have been used to embed the watermark. The output of the clustering algorithm will be a plurality of clusters 722 each containing a group of data records 722-1 ,.722-N. These are processed by an extraction module 725 in which respective clusters of data records 722-1. ,.722-N are separately processed 723-1. ,723-N to recover respective parity bits for the clusters. That is the attributes of respective records in a cluster 722-1 ,.722-N are processed according to a predetermined algorithm which returns a parity bit value. At block 724-1 ,.724-N a most common bit value is obtained as a cluster bit from the among the recovered parity bits. In other words, a cluster bit is the majority vote of the parity bits of each row in the cluster. The recovered cluster bit corresponds to a bit of the watermark having the position in the watermark bit sequence that is associated with the data cluster. In this way, the sequence of bits that make up the watermark 730 may be recovered from the data records (rows) of the database.
As will be further explained below, if some attributes are missing from a data record in one of the clusters, a recovery algorithm may be used to select rows most likely to have been part of the original cluster.
The process of extraction according to an embodiment is shown in the flowchart of Figure 9. In a first step, a cluster of data records (e.g. rows) are obtained. Attribute data in one or more data records in the obtained cluster are processed at block 902 to determine respective parity bits. The processing, in an embodiment may involve processing the attributes in a record individually by hashing them, for example using keyed hashing with a secret key, and basing the parity bit on the hashed attribute values. Whether a parity bit is validly extracted as a watermark bit may depend on a measure of reliability. At block 903 a most common bit (a cluster bit) is obtained from the parity bits determined from the respective data records in the cluster. The watermark is extracted at block 904 based on the cluster bit or bits.
Figure 10 gives an overview of clustering process 721 during the watermark extraction phase. It is identical to the algorithm presented in Figure 4, except that, according to an embodiment, it may take into account some attributes may not be present in the candidate database anymore. In particular, if a non-deterministic algorithm was used to select the clusters at the embedding stage, then the stored quasi-identifier values stored previously are used for the cluster selection at the extraction. According to an embodiment, if some attributes from the quasi-identifier have been deleted, all the rows that match the remaining attributes should be extracted. The attribute partition step 1002 is substantially the same as described earlier with respect to step 401 of Figure 4. If possible, the attribute partitioning should be performed on the original database structure 1001 (supplied as an input) rather than on the structure of the candidate database. Either way, the quasi-identifier returned by block 1002 should be the same set of attributes as the one initially selected during the embedding of the watermark. The output is the set of high utility attributes 1003 on which to cluster the candidate database 140 and the other attributes 1004 which are used to extract the relevant data records subject to the watermark extraction algorithm 725.
Cluster selection 1005 is substantially the same as the corresponding step 405 the cluster selection should be deterministic (or its result has been recorded), it is easy to reproduce the result at this point and obtain the list of quasi-identifier values associated with each watermark bit position 1009 as a selected cluster list 1006. In this example, we mirror the cluster steps performed in Figure 4, so that the attributes ‘Name’ and ‘Post Code’ are partitioned used to identify the quasi-identifiers corresponding to the clusters.
For each watermark bit position, the value of the quasi-identifier for the clusters embedding the watermark bit value has been determined. Accordingly, every data record (row) in the candidate database matching this quasi-identifier value is selected during the row extraction step 1007.
In the unlikely event that some attributes of the quasi-identifier are missing from the candidate database, according to an embodiment, we select every row that matches the remaining attributes. Since the expected value of the parity bit for non-synthetic rows is 0.5, these should not significantly affect the majority vote between all the cluster rows. Depending on the nature of the data, some heuristic can also be used to recover the expected value of the missing attributes. In the example shown in Figure 10, the quasi-identifier of Name- Jack’ and Post Code=80000, corresponding to bit position O’, returns the following cluster of data records:
Data Record 1 : Jack, M, 20, 80000, [PMF], Married, 30k Data Record 2: Jack, M, 50, 80000, UK, Single, 90k Data Record 3: Jack, M, 30, 80000, USA, Married, 40k These rows or data records are used as inputs in an extraction process shown in Figure 11. The first of these candidate data records is missing the attribute value for the ‘Country’ attribute. However, as will be explained below, this missing value may be replaced with a value determined from a probability mass function (PMF).
Figure 11 shows the steps in extracting the parity bits at 724-1 ,.724-N during extraction process 725, in more detail. According to an embodiment, the parity bit recovery process, largely mirrors the process of generating the parity bit for validating a synthetic data record described in Figure 6. From the clustering step 721 we have a set of candidate data records (rows) for each watermark bit position j to extract the candidate watermark bit .
Figure imgf000025_0001
According to an embodiment, however, there is an addition of a recovery step 1101 at the beginning, in which missing attribute values are replaced with an attribute value generated according to a Probability Mass Function (PMF) applied to the missing attribute. The probability mass function having been determined from the database when the watermark was embedded. In the example illustrated in Figure 11, a data record (row of data) being processed is missing the attribute value for the ‘Country’ attribute. Accordingly, for the missing value, a ‘Country’ determined according to a PMF may be used as a substitute. This is then propagated through the following steps and eventually returns an expected resiliency level. From the point of view of extraction, the resiliency level represents a confidence level indicating, for example, how confident one can be that the data record (row) truly has a watermark bit embedded within e.g. that it is a synthetic record. If this confidence/resiliency level is significant enough, the recovered parity bit is considered as a valid vote for one watermark bit.
In this situation, the parity function may now be considered probabilistic and returns the expected resiliency level (difference between number of 1s and 0s), based on the PMFs of the missing attributes. If this value is above a certain threshold, the extraction is successful for this row and a vote is cast for the expected parity. The majority vote of all rows in one cluster determines the watermark bit for this cluster.
At recovery step 1101, for every (or one or more) attribute present in the original database and absent in the candidate database, the originally computed probability mass function of that attribute is read, truncated to the most likely values. The exact number of considered values may be a user-defined parameter. In some embodiments, the recovery step may be omitted and the missing data merely given a null or predetermined value and the rest of the process proceeding on that basis. Alternatively, data records with missing values may be omitted from the watermark extraction process altogether, although this may impact the ability of the algorithm to reliably return a successfully extracted watermark. In the example shown in Figure 11, the ’Country’ attribute is missing and this is replaced with the corresponding PMF at the recovery stage 1101.
Optimal Attribute Ordering may be performed at block 1102. Here the process of block 602 at the embedding stage is repeated (or its result is read from a cache) on the original database structure. The output of that step should exactly match the attribute ordering chosen during the embedding process (which should be a deterministic process in case the result is not cached). The attributes are then re-ordered accordingly. As before the re-ordering step is optional and the process may be performed without it. As long as the process matches that used at the embedding stage the watermark should be extractable.
Keyed Hashing is then performed at block 1103 on the re-ordered values. This step is similar to step 603 and as before, it is possible to omit inclusion of the secret key in the hash at the cost of some security. According to an embodiment, the non-missing attributes: a hash ∈ N is computed for attribute i with name a
Figure imgf000026_0001
i and value vi. For missing attribu
Figure imgf000026_0002
tes, according to an embodiment, the hash value is computed, based on a hash of the value determined from the probability mass function of the attribute obtained in the preceding recovery stage 1101.
A bit sequence is obtained at the hash ordering block 1104 based on the relative values of the sequence of hash values obtained. According to an embodiment, for each pair of consecutive attributes vi and vi+1 (looping around so that vm+1 = v1), an expected relative order is computed in the following manner:
• If both vi and vi+1 are present, the expected relative order is computed as already described above in the embedding step:
Figure imgf000026_0003
• If either one or both of vt and vi+1 are missing, we compute the expected value in the bit sequence as follows:
Figure imgf000026_0004
where k is an index that iterates over the domain of attribute / and k’ is an attribute that iterates over the domain of attribute i+1. If vt is missing, the equation must be used for for ei-1 and ei. If vi+1 is also missing, then it should be used for ei+1 also. As will be appreciated this equation can in fact be used for every value all the time. If vt is present, then P(vi=k) = 1 if k = vi and 0 otherwise, which gives us ei = vi <= vi+1 if they are both present. An expected parity bit is then determined from the bit sequence values at block 1105 and optionally a confidence value. The expected bit value is based on a majority vote. The confidence level is based on the difference between the number of 1s and 0s in the bit sequence. The steps 1101 to 1106 are repeated for each data record (row) and the watermark bit is determined as the majority vote of the expected parity bit in all the cluster rows.
In an embodiment, a parity bit b and confidence level c, are calculated according to:
Figure imgf000027_0001
where e, is an expected parity bit value of data record / in the cluster, and c is the confidence level.
In an embodiment, this is repeated for every selected row. If the confidence level for a row is below a user-defined threshold, then this row’s vote is discarded. Otherwise, the extracted watermark bit w'j is the majority vote between all of the expected parity bits extracted for the cluster rows in a cluster. The algorithm fails if the number of successfully extracted watermark bits is under another user-defined threshold.
The watermarking processes described above have at least the following benefits over other solutions:
1. It embeds a single bit of information over many different values
2. It introduces little to no bias to the real data distribution
3. It is strongly resilient against data manipulation
4. It is performant
These properties stem from the following mechanisms respectively: • The embedded bit is a majority vote for each attribute. Typically, the vote for a given attribute would be obtained as the LSB of some keyed-hash of the attribute. This kind of approach would however introduce potentially strong bias which could be detected and lead to a watermark removal attack. For instance, if an attribute Gender has two possible values ‘Male’ and ‘Female’ and the LSB of the hash is ‘1' for ‘Male’ and '0' for ‘Female’, then a cluster with an abnormally high number of ‘Male’ could reveal it is embedding a '0' watermark bit.
• Instead, where in an embodiment, each attribute’s vote (e.g. bit value in the sequence) is based on the relative order between the value’s hash and the next attribute’s hash. It is straightforward to show that this introduces no bias if attributes have a uniform distribution. In practice, distribution is often far from uniform. The optimal ordering step which alternates between attributes close and far from a uniform distribution. It can be shown that if every other attribute is uniformly random, the system introduces no bias. We also empirically observe that for real data, the additional bias is negligible.
• Using a relative order of consecutive values, according to embodiments, provides a naturally-resilient mechanism against random attribute deletions. For instance, in cases where 3 consecutive values are in order (a < b < c or a > b > c), then deletion of b preserves the ordering (a < c or a > c) so the vote of a will be unchanged. Even in cases where the values are not in order (for instance a > b and b < c), since the bias introduced is small, we can actually use prior knowledge about the real data distribution to compute the expected value of the votes for a and b. Experimentally, this gives us very high rate of successful watermark bits extraction.
• To maximize the success rate of extraction after arbitrary deletions, according to embodiments, we may sample rows with a high resiliency level. For instance, if we want to embed a 0 bit over 50 attributes, it is safer to select rows containing 30 votes for 0 and 20 votes for 1 , rather than rows with 26 votes for 0 and 24 votes for 1. If the randomly generated row does not reach the desired resiliency level, it is discarded. Experimentally, we have observed that the successful sampling rate is around 5% even for high resiliency levels. This makes the scheme performant and resilient at the same time. Other embodiments
A scenario can be envisaged where a company collects data about its users, under some privacy policy allowing for data sharing with partners. After sending the data to the partners, the company has no control over what happens to it. If a partner leaks or re-sells that data, the company may be held responsible. It is essential that the leaked data can be traced back to the partner actually at fault.
According to an embodiment, our solution may be applied to a cluster of physical or virtual machines deployed in a cloud and a number of methods implemented through the Hadoop framework on those machines. The system is queried after data collection and inserts a proof of identity of each partner in the dataset (for instance a digital signature). Its output can then be shared with the partner.
Since, according to embodiments, the scheme is secure even when the existence of the watermark is public, the partner is now responsible for any leaks of that data and unable to remove his signature from the data itself without destroying its value.
In addition, we can consider a scenario where a company publicly sells datasets on a marketplace to many customers. Later, some of those customers share that dataset in a public place, allowing for piracy of the original data.
According to embodiments, the watermark embedding and extraction may be applied where a cluster of physical or virtual machines deployed in a cloud and a number of operations implemented on the database through the Hadoop framework on those machines. The system may be connected to an authentication module required to access the data. In an embodiment, it dynamically receives the customer’s identity and secretly embeds it as a watermark in the dataset before sending it to the user.
If the data is leaked, the watermark can be extracted according to the watermark extraction process described herein, to reveal the identity of the responsible party. Since embodiments provide a non-malleable watermark, it could be used as evidence, e.g. if legal action is taken.
The method disclosed in the embodiments of the present invention may be applied in a processing unit 1201 of a device 1200. In a process of implementation, each step of the method may be completed by using an integrated logic circuit of hardware in the processing unit 1201 or instructions in a software form. These instructions may be implemented and controlled by using the processing unit 1201. Configured to execute the method disclosed in the embodiments of the present invention, the foregoing processing unit may include a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or another programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component; and can implement or execute each disclosed method, step, and logic block diagram in the embodiments of the present invention. The general-purpose processor may be a microprocessor or the processor may be any common processor, and so on. The step with reference to the method disclosed in the embodiments of the present invention may be directly executed and completed by a hardware decoding processor or executed and completed by a combination of hardware and a software module in a decoding processor. The software module may be located in a mature storage medium in the art, such as a random- access memory, a flash memory, a read-only memory, a programmable read-only memory, an electronically erasable programmable memory, or a register. The storage medium is located in the memory 1202, and the processing unit 1201 reads information in the memory 1202, and completes the steps of the method with reference to the hardware. For example, the memory 1202 may store the database, watermark, secret key or partitioning data for the processing unit 1201 to use during embedding or extraction of the watermark.
In the above description the term data row is sometimes used when referring to the attribute fields of a data record in a database. It will be understood that the embodiments are not limited to any particular arrangement of data in the data record.
It is applicable to any configuration of data record having attributes (e.g. data fields).
As already mentioned, embodiments are applicable to the data regardless or whether it is numerical, text or any other form of data (e.g. images or video data). For example, the data may be alternatively or additionally be arranged in columns or in a relational or linked database arrangement.
A person of ordinary skill in the art may be aware that, in combination with the examples described in the embodiments disclosed in this specification, units and algorithm steps may be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether the functions are performed by hardware or software depends on particular applications and design constraint conditions of the technical solutions. A person skilled in the art may use different methods to implement the described functions for each particular application, but it should not be considered that such implementation goes beyond the scope of the present invention.
It may be clearly understood by a person skilled in the art that, for the purpose of convenient and brief description, for a detailed working process of the foregoing system, apparatus, and unit, reference may be made to the corresponding process in the foregoing method embodiments, and details are not described herein again.
In the embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the described apparatus embodiment is merely exemplary. For example, the unit division is merely logical function division and may be other division in actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communications connections may be implemented through some interfaces. The indirect couplings or communications connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. A part or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.
When the functions are implemented in the form of a software functional unit and sold or used as an independent product, the functions may be stored in a computer- readable storage medium. Based on such an understanding, the technical solutions of the present invention essentially, or the part contributing to the prior art, or a part of the technical solutions may be implemented in the form of a software product. The computer software product is stored in a storage medium, and includes several instructions for instructing a computer device (which may be a personal computer, a server, or a network device) to perform all or a part of the steps of the methods described in the embodiments of the present invention. The foregoing storage medium includes: any medium that can store program codes, such as a USB flash drive, a removable hard disk, a read-only memory (ROM, Read-Only Memory), a random access memory (RAM, Random Access Memory), a magnetic disk, or an optical disc.
The present inventions can be embodied in other specific apparatus and/or methods. The described embodiments are to be considered in all respects as illustrative and not restrictive. In particular, the scope of the invention is indicated by the appended claims rather than by the description and figures herein. All changes that come within the meaning and range of equivalency of the claims are to be embraced within their scope.

Claims

1. A method of applying a digital watermark comprising: obtaining at least one cluster of a digital database, wherein a cluster comprises one or more data records of the database to be associated with a bit of a digital watermark; generating a synthetic data record; processing values of attributes of the synthetic data record to obtain a parity bit value; and including the synthetic data record in the cluster based on whether the parity bit value matches the bit value of the digital watermark to be associated with the cluster.
2. A method according to claim 1, wherein processing comprises calculating one or more cryptographic hash values based on attribute values of the synthetic data record.
3. A method according to claim 1 or claim 2, wherein processing comprises determining a most common bit value in a bit sequence generated based on the cryptographic hash values.
4. A method according to claim 3, wherein the values of the bit sequence are determined based on whether a respective hash value is smaller or larger than a subsequent hash value according to the order of the attributes.
5. A method according to any of claims 2 to 4, wherein the one or more cryptographic hash values are hashes of respective attribute values having a secret key appended thereto.
6. A method according to any preceding claim, wherein the order of the attributes of the synthetic data record are modified based on the statistical properties of the attribute values.
7. A method according to claim 6, wherein the statistical property is an entropy value and the order is modified such that the attributes alternately have a high and low entropy value.
8. A method according to any of claims 3 to 7, further comprising: determining a resilience level based on a difference between the number of ones and zeros in the bit sequence, and wherein including the synthetic data entry is based on the determined resilience level.
9. A method according to any preceding claim, comprising including synthetic data records in the cluster until a predetermined target number of records is reached.
10. A method according to any preceding claim, comprising partitioning a digital database into the least one cluster based on predetermined attributes with high utility.
11. A method of extracting a digital watermark comprising: obtaining at least one cluster of data records of a digital database, wherein a cluster comprises one or more data records associated with a bit of a digital watermark; processing attribute data of the one or more data records of the cluster to determine respective parity bit values; and extracting a bit of the watermark based on a most common bit value among the determined parity bit values.
12. A method according to claim 11 , comprising adding attribute data to a data record of the cluster where the attribute data for an attribute is missing based on a probability mass function (PMF).
13. A method according to claim 11 or claim 12, wherein processing comprises calculating one or more cryptographic hash values based on attribute values of the synthetic data record.
14. A method according to any of claims 11 to 12, wherein processing comprises determining a most common bit value in a bit sequence generated based on the cryptographic hash values.
15. A method according to claim 14, wherein the values of the bit sequence are determined based on whether a respective hash value is smaller or larger than a subsequent hash value according to the order of the attributes.
16. A method according to any of claims 13 to 15, wherein the one or more cryptographic hash values are hashes of respective attribute values having a secret key appended thereto.
17. A method according to any of claims 11 to 16, wherein the order of the attributes of the synthetic data record are modified based on the statistical properties of the attribute values.
18. A method according to claim 17, wherein the statistical properties are entropy values and the order is modified such that the attributes alternately have a high and low entropy value.
19. A method according to any of claims 11 to 18, further comprising: generating a confidence level based on a difference between the number of ones and zeros in the bit sequence used to determine a respective parity bit value; and determining whether to exclude the parity bit value from consideration when extracting the bit of the watermark based on the confidence level.
20. A device for applying a digital watermark comprising: means for obtaining at least one cluster of a digital database, wherein a cluster comprises one or more data records of the database to be associated with a bit of a digital watermark; means for generating a synthetic data record; means for processing values of attributes of the synthetic data record to obtain a parity bit value; and means for including the synthetic data record in the cluster based on whether the parity bit value matches the bit value of the digital watermark to be associated with the cluster.
21. A device for extracting a digital watermark comprising: means for obtaining at least one cluster of data records of a digital database, wherein a cluster comprises one or more data records associated with a bit of a digital watermark; means for processing attribute data of the one or more data records of the cluster to determine respective parity bit values; means for extracting a bit of the watermark based on a most common bit value among the determined parity bit values.
22. A computer program comprising instructions which upon execution cause a processor to carry out a method according to any of claims 1 to 19.
PCT/EP2019/084666 2019-12-11 2019-12-11 Devices and methods for applying and extracting a digital watermark to a database WO2021115589A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2019/084666 WO2021115589A1 (en) 2019-12-11 2019-12-11 Devices and methods for applying and extracting a digital watermark to a database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2019/084666 WO2021115589A1 (en) 2019-12-11 2019-12-11 Devices and methods for applying and extracting a digital watermark to a database

Publications (1)

Publication Number Publication Date
WO2021115589A1 true WO2021115589A1 (en) 2021-06-17

Family

ID=69055969

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2019/084666 WO2021115589A1 (en) 2019-12-11 2019-12-11 Devices and methods for applying and extracting a digital watermark to a database

Country Status (1)

Country Link
WO (1) WO2021115589A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116150716A (en) * 2023-04-24 2023-05-23 中国科学技术大学 Database watermark embedding method, extraction method, storage medium and electronic device
CN117725565A (en) * 2023-12-04 2024-03-19 国网智能电网研究院有限公司 Data tracing method, device, equipment and medium based on digital watermark

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050055554A1 (en) * 2003-05-23 2005-03-10 Radu Sion Method and system for rights assessment over digital data through watermarking
WO2017093736A1 (en) * 2015-12-01 2017-06-08 Privitar Limited Digital watermarking without significant information loss in anonymized datasets
CN107992726A (en) * 2017-11-29 2018-05-04 北京安华金和科技有限公司 A kind of watermark processing and data source tracing method based on the pseudo- row of dummy lines

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050055554A1 (en) * 2003-05-23 2005-03-10 Radu Sion Method and system for rights assessment over digital data through watermarking
WO2017093736A1 (en) * 2015-12-01 2017-06-08 Privitar Limited Digital watermarking without significant information loss in anonymized datasets
CN107992726A (en) * 2017-11-29 2018-05-04 北京安华金和科技有限公司 A kind of watermark processing and data source tracing method based on the pseudo- row of dummy lines

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116150716A (en) * 2023-04-24 2023-05-23 中国科学技术大学 Database watermark embedding method, extraction method, storage medium and electronic device
CN117725565A (en) * 2023-12-04 2024-03-19 国网智能电网研究院有限公司 Data tracing method, device, equipment and medium based on digital watermark

Similar Documents

Publication Publication Date Title
US20210099287A1 (en) Cryptographic key generation for logically sharded data stores
AU2018367363B2 (en) Processing data queries in a logically sharded data store
US9977918B2 (en) Method and system for verifiable searchable symmetric encryption
Guo et al. A fragile watermarking scheme for detecting malicious modifications of database relations
US20100215175A1 (en) Methods and systems for stripe blind encryption
CA2731954A1 (en) Apparatus, methods, and computer program products providing dynamic provable data possession
Coatrieux et al. Lossless watermarking of categorical attributes for verifying medical data base integrity
Chang et al. A blind reversible robust watermarking scheme for relational databases
WO2021115589A1 (en) Devices and methods for applying and extracting a digital watermark to a database
CA3065767C (en) Cryptographic key generation for logically sharded data stores
Iftikhar et al. A survey on reversible watermarking techniques for relational databases
Hou et al. A graded reversible watermarking scheme for relational data
CN112818404B (en) Data access permission updating method, device, equipment and readable storage medium
CN116628721B (en) Searchable encryption method and system for digital object
Sonnleitner A robust watermarking approach for large databases
Du et al. Secure and verifiable keyword search in multiple clouds
US20220147508A1 (en) Method of ensuring confidentiality and integrity of stored data and metadata in an untrusted environment
Khanduja et al. Watermarking Categorical Data: Algorithm and Robustness Analysis.
US11372999B2 (en) Method for inserting data on-the-fly into a watermarked database and associated device
Alfagi et al. Survey on relational database watermarking techniques
KR102578606B1 (en) Fingerprinting apparatus and method for storing and sharing data in the cloud
CN117910023B (en) Computer information security processing method and system based on big data
WO2024001585A1 (en) Watermark embedding method, watermark extraction method, electronic device, and storage medium
CN111291387B (en) File protection method and file processing system thereof
Hristov et al. Graph Database Watermarking Using Pseudo-Nodes

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19828602

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19828602

Country of ref document: EP

Kind code of ref document: A1