CN113704709A

CN113704709A - Digital watermark data tracing method based on attribute importance index

Info

Publication number: CN113704709A
Application number: CN202110996040.6A
Authority: CN
Inventors: 徐超; 邹云峰; 单超; 朱峰; 范环宇
Original assignee: State Grid Jiangsu Electric Power Co ltd Marketing Service Center; State Grid Jiangsu Electric Power Co Ltd
Current assignee: State Grid Jiangsu Electric Power Co ltd Marketing Service Center; State Grid Jiangsu Electric Power Co Ltd
Priority date: 2021-08-27
Filing date: 2021-08-27
Publication date: 2021-11-26

Abstract

The digital watermark data tracing method based on the attribute importance index specifically comprises the following steps: 1, summarizing original data to be distributed, and extracting a prediction attribute and a class label attribute of each piece of original data to form a data table; 2, creating a watermark index table according to a data receiver of the original data and generating a KEY; 3, forming an unimportant attribute set attr; 4, embedding the watermark to obtain a data set containing the watermark; 5, distributing the data set containing the watermark according to the information of the data receiver in the watermark index table, and collecting suspected leakage data which are completely or partially leaked in the distribution process or after distribution to form a suspected leakage data set; 6, extracting all sub watermarks in each piece of data in the suspected leakage data set and connecting the sub watermarks with complete connecting bits; and 7, searching out a corresponding data receiver, namely an individual revealing data, through the extracted complete watermark by using a watermark index table, and finishing data disclosure tracing.

Description

Digital watermark data tracing method based on attribute importance index

Technical Field

The invention relates to the field of data tracing, in particular to a digital watermark data tracing method based on attribute importance indexes.

Background

With the rapid development of data transmission and sharing technologies, data is frequently sent out from a system, and the data contains sensitive information of a data owner, so how to prevent an authorized object from performing unauthorized forwarding after acquiring the data becomes a problem to be solved urgently in data security. For example, data owners such as governments and enterprise organizations have a large amount of data, and in order to extract valuable information and knowledge from the data, the data needs to be sent to a plurality of different third-party data analysis organizations for analysis and processing, and it may happen that an untrusted third party forwards received data to another person, so that illegal forwarding of the data is caused, data privacy is revealed, and how to determine a third party who reveals the data is a key for tracing data disclosure.

The digital watermarking technology is a common method for solving the problem of data copyright at present, and a series of watermarking algorithms are provided by continuous attention of researchers in recent years. Most of the existing research focuses on maintaining the availability of data, and can be basically divided into two types: optimization algorithm based methods and histogram technique based methods. In the research based on the Optimization Algorithm, the idea of converting watermark embedding into solving the optimal solution problem under the constraint condition is adopted, the watermark is created by using the Optimization algorithms such as Genetic Algorithm (Genetic Algorithm) and Particle Swarm Optimization (Particle Swarm Optimization), and the like, and the data availability is used as the constraint condition in the embedding process; in the histogram technology-based method, the gray level histogram adjustment method applied to the image watermark is used on a database, so that smaller data disturbance is realized. Some researches focus on ensuring the security of the watermark, and the watermark is segmented and then embedded into a plurality of groups so as to maintain a certain redundancy and maintain the usability of the watermark.

The existing method is mainly insufficient in the aspects of data availability and watermark safety, especially focuses on the research of watermark safety, and cannot combine the distribution characteristics of data in the watermark embedding process, so that the data availability is greatly damaged; meanwhile, the basic assumption is that data is complete in the distribution process, but in an actual situation, a data leakage person may only leak part of data tuples, so that the watermark embedded in the data is damaged, and the watermark extraction and the tracing of the leakage person are greatly influenced.

Disclosure of Invention

In order to solve the defects in the prior art, the invention aims to provide a digital watermark data tracing method based on an attribute importance index.

The invention adopts the following technical scheme:

the digital watermark data tracing method based on the attribute importance index comprises the following steps:

step 1, summarizing original data to be distributed, and extracting condition attribute A of each piece of original data_i(i is more than or equal to 1 and less than or equal to n) and the class label attribute L form a data table D, wherein n represents the number of condition attributes of each piece of original data, the class label attribute L corresponds to s types of classifications, and the data table D comprises M pieces of original data;

step 2, creating a watermark index table according to the data receivers of the original data in the step 1, wherein the watermark index table comprises the information of each original data receiver and the original watermark W to be embedded in the original data_ii(1 ≦ ii ≦ G), G representing the number of data recipients and generating the KEY KEY;

step 3, forming an unimportant attribute set attr;

step 4, according to the non-important attribute set attr and the watermark W embedded in each piece of original data in the watermark index table in step 2_ii(1 ≦ ii ≦ M), and embedding the watermark into the corresponding original data to obtain a data set D containing the watermark_W；

Step 5, the D obtained in the step 4_WDistributing according to the information of the data receiver in the watermark index table established in the step 2And collecting suspected leakage data which are completely or partially leaked in the distribution process or after distribution, and integrating the suspected leakage data into a suspected leakage data set D_W’；

Step 6, regarding suspected leakage data set D_W' extracting all sub-watermarks in each piece of data and connecting the sub-watermarks into a complete watermark;

and 7, searching out a corresponding data receiver, namely an individual revealing data, according to the complete watermark extracted in the step 6 through the watermark index table established in the step 2, and finishing data disclosure tracing.

In step 1, the class label attribute represents the class of the data, and comprises s classes;

conditional attributes refer to characteristics of the data based on which class label attributes of the data can be predicted using conventional prediction means.

In step 2, the original watermark contained in the original data it accepts is the same for the same data receiver.

The KEY is an arbitrary decimal number specified.

Step 3 comprises the following steps:

step 301, calculating the information gain ratio (A) of each condition attribute according to the data table established in step 1_i,D)；

Step 302, calculating Gini coefficient (A) of each condition attribute according to the data table of step 1_i,D)；

Step 303, for the information gain ratio (A) obtained in step 301_iD) and Gini's coefficient determined in step 302 (A)_iD) carrying out weighted average calculation to obtain each attribute A_iImportance index impt _ index (A)_iAnd D), sorting the attributes according to the size of the importance indexes, selecting tt attributes with the minimum importance indexes as the attributes of the watermarks to be embedded, and forming a non-important attribute set attr, wherein tt is more than or equal to 1 and less than or equal to n.

In step 301, the ratio of the original data in the jth classification to the data in the entire data table is set as p_j(j ═ 1,2 …, s), s is the total class number of the data classes,conditional Attribute A_i(1. ltoreq. i. ltoreq.n) information gain ratio (A)_iAnd D) satisfies the following relation:

wherein, Gain (A)_iAnd D) is a conditional attribute A_iInformation gain of (A), Split _ info (A)_i) Is a pair A_iThe following relationships are satisfied for the partition information of (1):

Gain(A_i,D)＝Entropy(D)-Entropy(A_i,D)

wherein, Encopy (D) is the information Entropy of data table D, Encopy (A)_iD) is a data sheet property A according to conditions_iThe divided conditional entropies respectively satisfy the following relations:

wherein r represents the data table D according to the condition attribute A_iDivided into r subsets D_m(m＝1,2,…,r)，|D_m| represents the subset D_mAnd | D | represents the original data amount of the data table.

In step 302, a dichotomy is used according to conditional Attribute A_i(1 ≦ i ≦ n) partitioning the dataset into subsets Z_i1And Z_i2(ii) a Firstly, all original data condition attributes A are added_iThe attribute values of (a) are arranged in descending order, then the average value of the adjacent attribute values is calculated as a division point, and the data set is divided into: two subsets greater than and less than the division point;

the two data subsets contain original data M_i1And M_i2Conditional Attribute A_iSatisfies the following relation:

where s is the total number of classes of data, Gini (Z)_i1) And Gini (Z)_i2) Respectively represent subsets Z_i1And subset Z_i2The coefficient of kini of (a).

In step 303, the tt ranges as: tt is more than or equal to 1 and less than or equal to n.

In step 303, the importance index impt _ index (A)_iAnd D) satisfies the following relation:

importance index impt _ index (A)_i,D)＝a×GainRatio(A_i,D)+b×Gini(A_iD), a, b are secret coefficients and satisfy 0<a、b<1，a+b＝1。

Step 4 comprises the following steps:

step 401: embedding an initial watermark W in each piece of original data_ii(1 ≦ ii ≦ M) split into t sub-watermarks W_iisub[index](0≤index≤t-1)；

Step 402, traversing the non-important attribute set attr for each piece of original data in the data table D, taking the integer part integer and the decimal part decimal of each condition attribute value in attr, saving the length of the decimal part as decimal _ len, and calculating the sub-watermark W according to the position hash function_iisub[index]The embedding position in the fractional part decimal;

step 403, completing embedding the watermark into the original data condition attribute by using a watermark embedding algorithm;

step 404, repeating steps 402 and 403 until all the condition attributes of the original data in the data table D are embedded into the corresponding watermarks W_ii(1≤ii≤M)。

In step 401, the segmentation method of the initial watermark includes:

W_iisub＝{W_ii[b]W_ii[b+1]…W_ii[b+sub_len-1]}

b＝0×sub_len,1×sub_len,…,(t-1)×sub_len

wherein, W_iisubIs an initial watermark W_iiOf a sub-watermark set, the sub-watermark length being

In step 402, the embedding position satisfies the following relation:

position＝H(KEY_ii||H(integer||index))％decimal_len

where H (KEY | | H (integer | | index)) represents a corresponding value calculated by KEY | | H (integer | | index) according to the position hash function, H (integer | | index) represents a corresponding value calculated by integer | | index according to the position hash function, and decimal _ len represents the length of the fractional part decimal.

In step 403, the watermark embedding algorithm is:

watermarkedDecimal＝ decimal[0:position]||W_iisub[index]||decimal[position+sub_len:end]；

newValue＝integer||watermarkedDecimal

wherein, watermark is embedded decimal part, newValue is new condition attribute value formed by connecting watermark decimal and integer, and digit [0: position ] represents 1 st bit to 1 st bit from left to right of decimal part digit; position + sub _ len: end represents the left-to-right position + sub _ len +1 bit to the last bit of the fractional part decimall; and | represents the concatenation of the character strings.

In step 6, traversing each piece of data in the data set to be traced, finding the non-important attribute of each piece of data by using the method in step 3, taking the integer part and the decimal part of the non-important attribute value, calculating the embedding position of the watermark, and extracting all the sub-watermarks to connect the sub-watermarks to form the complete watermark.

Compared with the prior art, the invention has the beneficial effects that:

1. compared with the traditional genetic algorithm and particle swarm algorithm, the method for constructing the watermark by combining the information gain rate of the data condition attribute and the Keyny coefficient is quicker on the premise of not losing the characteristics of the data.

2. The method not only considers the characteristics of single data, but also transversely considers the relative importance of each data in the data table where the data is located, so that the generated watermark has stronger security, uniqueness, secrecy and imperceptibility, the confidentiality and feasibility of tracing the data by using the method are greatly enhanced, and the usability of the data and the security of the watermark are effectively considered.

3. The invention divides the attribute value of the data condition attribute into an integer part and a decimal part and then embeds the integer part, can more effectively support a data owner to trace the source of the data in the scene of original data leakage, and prevents an attacker from damaging the watermark after leaking part of the original data to cause the problem of source tracing failure.

4. After the watermark data generated by the invention is distributed to a data receiver, if data classification prediction is needed in the later period, the classification accuracy of the data embedded with the watermark is far higher than that of the watermark data generated by the traditional algorithm.

Drawings

Fig. 1 is a flowchart of a digital watermark data tracing method based on attribute importance index according to the present invention.

Table 1 is a data table of an embodiment of the present invention;

table 2 is a data table after embedding a watermark according to an embodiment of the present invention;

table 3 is a data table revealed by the embodiment of the present invention.

Detailed Description

The present application is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present application is not limited thereby.

Fig. 1 is a flowchart of a digital watermark data tracing method based on attribute importance index, and the method specifically includes the following steps:

step 1, summarizing original data to be distributed, and extracting condition attribute A of each piece of original data_i(i is more than or equal to 1 and less than or equal to n) and the class label attribute L form a data table D, wherein n represents the total number of condition attributes of each piece of original data, the class label attribute L corresponds to s types of classifications, and the data table D comprises M pieces of original data;

the class label attribute of the data represents the class of the data, and the class label attribute comprises s classes in total, namely s is the total class number of the data class. As shown in table 1, in the present embodiment, the class label attribute indicates that the type of data is language, and if there are 2 types of classifications of data in table 1, each of which is ES and FR, the number corresponding to s is 2.

for the same data receiver, the initial watermark contained in the received original data is the same; different watermarks may be used for different data recipients, while the KEY is the same;

preferably, the key is a specified arbitrary decimal number;

step 3, forming an unimportant attribute set attr;

Setting the proportion of the original data in the jth classification to the data in the whole data table as p_j(j ═ 1,2 …, s), calculating a conditional attribute a_i(1. ltoreq. i. ltoreq.n) information gain ratio (A)_iD), which satisfies the following relation:

Gain(A_i,D)＝Entropy(D)-Entropy(A_i,D)

wherein r represents the data table D according to the condition attribute A_iDivided into r subsets D_m(m＝1,2,…,r)，|D_m| represents the subset D_mThe quantity of original data in, | D | represents the quantity of original data in the data table

Using dichotomy from conditional Attribute A_i(1 ≦ i ≦ n) partitioning the dataset into subsets Z_i1And Z_i2. Firstly, all original data condition attributes A are added_iThe attribute values of the data are arranged from big to small, then the average values of the adjacent attribute values are calculated, each average value is used as a dividing point, and if qq average values exist, qq dividing conditions exist; each partitioning case partitions the data set into: greater than scratchA division point and two subsets of data smaller than the division point; then calculating the Gini coefficient in each division case, and finally selecting the minimum value of the Gini coefficients in all division cases as the final Gini (A) of the condition attribute_i,D)。

Taking table 1 as an example, the condition attribute X1 has 6 average values, i.e., 6 dividing points, with which to divide the data set, resulting in 6 dividing cases; then calculating the kini coefficient under each division condition; and finally, selecting the minimum kini coefficient in the 6 partitions as the kini coefficient of the condition attribute.

The two data subsets contain original data M_i1And M_i2Calculating the conditional attribute A_iA coefficient of kini satisfying the following relationship:

step 303, for the information gain ratio (A) obtained in step 301_iD) and Gini's coefficient determined in step 302 (A)_iD) carrying out weighted average calculation to obtain each attribute A_iImportance index impt _ index (A)_iD), sorting the attributes according to the size of the importance indexes, selecting tt attributes with the minimum importance indexes as the attributes to be embedded with the watermark to form a non-important attribute set attr, wherein tt is more than or equal to 1 and less than or equal to n,

preferably, the range of tt is: tt is more than or equal to 1 and less than or equal to n;

importance index impt _ index (A)_iD) has the following characteristics:

The initial watermark segmentation method has the following characteristics:

W_iisub＝{W_ii[b]W_ii[b+1]…W_ii[b+sub_len-1]|b

＝0×sub_len,1×sub_len,…,(t-1)×sub_len}

let sub _ len be 4 and t be 3, then b be 0, 4, 8,

W_iisub＝{W_ii[0]W_ii[0+1]W_ii[0+2]W_ii[0+3]W_ii[4]W_ii[4+1]W_ii[4+ 2]W_ii[4+3],W_ii[8]W_ii[8+1]W_ii[8+2]W_ii[8+3]}W_iisubis an initial watermark W_iiWherein each sub-watermark is denoted as W_iisub[index]And index is more than or equal to 0 and less than or equal to t-1, and the sub-watermark length is

for the conditional attribute value-6.5128995678664, the integer part integer is-6 and the fractional part decimal is 5128995678664.

The embedding position satisfies the following relation:

position＝H(KEY_ii||H(integer||index))％decimal_len

wherein H (KEY | | H (integer | | index)) represents a corresponding value calculated by KEY | | H (integer | | index) according to the position hash function, and H (integer | | index) represents a corresponding value calculated by integer | | index according to the position hash function; decimall _ len represents the length of the fractional part decimall;

dividing fractional part decimal into two parts according to position bit, and dividing sub-watermark W_iisub[index]Inserting the watermark into the front part and the back part to form a decimal part watermark embedded into the watermark, and then connecting the decimal part watermark with the integer part integer to form a new condition attribute value newValue to finish embedding the watermark;

the watermark embedding algorithm is as follows:

newValue＝integer||watermarkedDecimal

wherein, the decimal [0: position ] represents the 1 st bit to the position +1 st bit from left to right of the decimal part decimal; position + sub _ len: end represents the left-to-right position + sub _ len +1 bit to the last bit of the fractional part decimall; | represents the concatenation of the character strings;

step 404, repeating steps 402 and 403 until all the condition attributes of the original data in the data table D are embedded into the corresponding watermarks W_ii(1≤ii≤M)；

Step 5, the D obtained in the step 4_WDistributing according to the information of the data receiver in the watermark index table established in the step 2, collecting suspected leakage data which are completely or partially leaked in the distribution process or after distribution, and integrating the suspected leakage data into a suspected leakage data set D_W’

traversing each piece of data in the data set to be traced, finding the non-important attribute of each piece of data by using the method in the step 3, taking the integer part and the fractional part of the non-important attribute value, and calculating the embedding position of the watermark; the method used is the same as in step 402, i.e. the watermark embedding location is calculated according to the following location hash function formula:

position＝H(KEY||H(integer||index))％decimal_len

extracting the embedded sub-watermarks from the position bit to the position + sub _ len-1 bit of fractional part decimal, repeating the above process for each non-important attribute of each piece of data, extracting all the sub-watermarks, and finally connecting the sub-watermarks to form a complete watermark;

and 7, searching out a corresponding data receiver, namely an individual revealing data, according to the complete watermark extracted in the step 6 through the watermark index table established in the step 2, and thus finishing the revealing tracing.

The data table shown in table 1 has 5 condition attributes, 7 pieces of raw data, and the class label attribute language is the category of each piece of raw data. Suppose the owner of the data wants to add the watermark W12345678 and the KEY 13579.

The data owner specifies that the secret coefficient a is 0.5, t is 2, the information gain ratio of each condition attribute is calculated to GainRatio (X1, D) is 0.476, GainRatio (X2, D) is 0.53, GainRatio (X3, D) is 0.543, the kini coefficient of each condition attribute is calculated to Gini (X1, D) is 0.229, Gini (X2, D) is 0.214, Gini (X3, D) is 0.229, and the importance index impt _ index (X1, D) is 0.353, impt _ index (X2, D) is 0.372, impt _ index (X3, D) is 0.386) of each condition attribute is calculated from the information gain ratio and the kini coefficient. Because t is 2, two attributes X1 and X2 with the minimum importance index are selected as the attributes to be embedded with the watermark;

dividing the watermark W into two character watermarks respectively W_iisub[0]＝1234，W_iisub[1]＝5678；

The watermark is inserted into the decimal place of the X1 and X2 attribute values of the tuple, for example the tuple with ID 1, the sub-watermark W_iisub[0]Embedding position in attribute X1: position ═ H (13579| (H (7| | | 0)))% 9 ═ 5, the sub-watermark W_iisub[1]Embedding position in attribute X1: position H (13579| (H (-6| | 1)))% 9 ═ 8;

sub-watermark W_iisub[0]Insert the decimal 5 th bit of attribute X1 to form a new attribute value 7.0714712345633, and watermark W_iisub[1]Insert 8 th bit of X2 decimal place of attribute, form the new attribute value-6.5128995678664;

repeating until a data table with embedded watermarks is formed, as shown in table 2;

an attacker leaks three records in the data, and as shown in table 3, the data owner calculates the sub-watermark embedding position ═ H (13579| (H (7| | | 0))% > 9 ═ 5 of the attribute X1 of the tuple with ID ═ 1, extracts the sub-watermark W_iisub[0]1234, the sub-watermark embedding position H (13579| (H (-6| | 1)))% 9 ═ 8 of the attribute X2, and the sub-watermark W is extracted_iisub[1]5678, the sub-watermarks are spliced into a finished watermark W12345678, and tracing is finished.

The present applicant has described and illustrated embodiments of the present invention in detail with reference to the accompanying drawings, but it should be understood by those skilled in the art that the above embodiments are merely preferred embodiments of the present invention, and the detailed description is only for the purpose of helping the reader to better understand the spirit of the present invention, and not for limiting the scope of the present invention, and on the contrary, any improvement or modification made based on the spirit of the present invention should fall within the scope of the present invention.

TABLE 1

ID	language	X1	X2	X3
					1	ES	7.071475633	-6.512899664	7.650799805
2	ES	10.98296717	-5.15744505	3.952060221
					3	ES	7.827108364	-5.477471938	7.816257284
4	FR	9.985760003	-8.976570322	6.122981616
					5	FR	13.88542526	-6.233852322	2.229776427
6	FR	13.46616788	-5.783487271	0.693888916
					7	ES	12.28075786	-2.437558361	3.175933842

TABLE 2

TABLE 3

Claims

1. The digital watermark data tracing method based on the attribute importance index is characterized by comprising the following steps:

step 3, forming an unimportant attribute set attr;

Step 5, the D obtained in the step 4_WDistributing according to the information of the data receiver in the watermark index table established in the step 2, collecting suspected leakage data which are completely or partially leaked in the distribution process or after distribution, and integrating the suspected leakage data into a suspected leakage data set D_W’；

2. The digital watermark data tracing method based on attribute importance index according to claim 1, wherein:

in the step 1, the class label attribute represents the class of the data, and comprises s classes;

the condition attribute refers to the characteristic of the data, and the class label attribute of the data can be predicted by using a conventional prediction means based on the condition attribute.

3. The digital watermark data tracing method based on attribute importance index according to claim 1, wherein:

in said step 2, the original watermark contained in the original data it accepts is the same for the same data receiver.

4. The digital watermark data tracing method based on attribute importance index according to claim 1 or 3, characterized in that:

the KEY is a specified arbitrary decimal number.

5. The digital watermark data tracing method based on attribute importance index according to claim 1, wherein:

the step 3 comprises the following steps:

6. The method according to claim 5, wherein the method comprises:

in step 301, the proportion of the original data in the jth classification to the data in the whole data table is set as p_j(j is 1,2 …, s), s is the total classification number of data classes, and the condition attribute A_i(1. ltoreq. i. ltoreq.n) information gain ratio (A)_iAnd D) satisfies the following relation:

Gain(A_i,D)＝Entropy(D)-Entropy(A_i,D)

7. The method according to claim 5, wherein the method comprises:

in said step 302, a dichotomy is used according to the conditional attribute A_i(1 ≦ i ≦ n) partitioning the dataset into subsets Z_i1And Z_i2(ii) a Firstly, all original data condition attributes A are added_iThe attribute values of (a) are arranged in descending order, then the average value of the adjacent attribute values is calculated as a division point, and the data set is divided into: two subsets greater than and less than the division point;

8. The method according to claim 5, wherein the method comprises:

in step 303, the tt range is: tt is more than or equal to 1 and less than or equal to n.

9. The method according to claim 5, wherein the method comprises:

in the step 303, the importance index impt _ index (A)_iAnd D) satisfies the following relation:

10. The digital watermark data tracing method based on attribute importance index according to claim 1, wherein:

the step 4 comprises the following steps:

11. The method according to claim 10, wherein the method for tracing the source of the digital watermark data based on the attribute importance index comprises:

in step 401, the segmentation method of the initial watermark includes:

W_iisub＝{W_ii[b]W_ii[b+1]…W_ii[b+sub_len-1]}

b＝0×sub_len,1×sub_len,…,(t-1)×sub_len

12. The method according to claim 10, wherein the method for tracing the source of the digital watermark data based on the attribute importance index comprises:

in the step 402, the embedding position satisfies the following relation:

position＝H(KEY_ii||H(integer||index))％decimal_len

13. The method according to claim 10, wherein the method for tracing the source of the digital watermark data based on the attribute importance index comprises:

in step 403, the watermark embedding algorithm is:

watermarkedDecimal＝

decimal[0:position]||W_iisub[index]||decimal[position+sub_len:end]；

newValue＝integer||watermarkedDecimal

14. The digital watermark data tracing method based on attribute importance index according to claim 1, wherein:

in the step 6, traversing each piece of data in the data set to be traced, finding the non-important attribute of each piece of data by using the method in the step 3, taking the integer part and the fractional part of the non-important attribute value, calculating the embedding position of the watermark, and extracting all the sub-watermarks to connect the sub-watermarks to form the complete watermark.