CN112613045B

CN112613045B - Method and system for embedding data watermark of target data

Info

Publication number: CN112613045B
Application number: CN202011375206.4A
Authority: CN
Inventors: 于鹏飞; 石聪聪; 陈磊
Original assignee: State Grid Smart Grid Research Institute Co ltd
Current assignee: State Grid Smart Grid Research Institute Co ltd
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2023-06-06
Anticipated expiration: 2040-11-30
Also published as: CN112613045A

Abstract

The invention discloses a data watermark embedding method and system of target data, wherein the method comprises the steps of S1 dividing the target data to be embedded with the data watermark into a plurality of content blocks, and embedding the data watermark in each content block; s2, carrying out data item similarity evaluation on the data items embedded with the data watermarks by adopting a preset data similarity evaluation model; s3, evaluating the watermark similarity of the data of the content block based on the similarity of all data items of the data days constituting the content block, executing S4 when the watermark similarity of the data of each content block meets a first threshold range, otherwise, adjusting the embedding proportion and/or position of the watermark of the data in the content block and executing S2; s4, calculating the overall similarity of the target data based on the data watermark similarity of all content blocks forming the target data, and obtaining the target data embedded with the data watermark by adjusting the embedding proportion and/or the position of the data watermark. Finally, high concealment and high simulation after the data watermark is embedded are realized.

Description

Method and system for embedding data watermark of target data

Technical Field

The invention relates to the field of data watermarking, in particular to a data watermarking embedding method and system for target data.

Background

With the continuous development of digital economy, the information exchange among different departments, different areas and different data main bodies is gradually increased, and the data is circulated, recombined and used more frequently in the form of structured data among all the links. The data is used in a dynamic environment, the risk of occurrence of a data leakage event is huge, and once the data leakage occurs, the responsibility links can be accurately positioned so as to trace the safety responsibility of related personnel, and the safety control of weak links is enhanced in a targeted manner.

The data watermarking technology is one of effective technical means for solving the problem of responsibility tracing after data leakage. The data watermark is to add extra redundant identification information to the data content itself, and to associate and record the relevant responsibility links by high-imitation of the real data content and by referencing the identification information, once the data is leaked, the data watermark can be positioned according to the watermark information added in advance. The high simulation and high concealment are effective key indexes of the data watermark, and are prevented from being found and destroyed by malicious users. The realization of high simulation and high concealment of the data watermark requires that the similarity of the target data before and after the data watermark is added must reach a threshold value which is not easy to be found by a user, so how to embed the data watermark in the target data to achieve the high concealment and high simulation after the data watermark is embedded needs to be solved.

Disclosure of Invention

In order to solve the above-mentioned shortcomings existing in the prior art, the present invention provides a data watermark embedding method of target data, including:

s1, dividing target data to be embedded with a data watermark into a plurality of content blocks, and embedding the data watermark in a data entry of each content block;

s2, carrying out data item similarity evaluation on the data items embedded with the data watermarks by adopting a preset data similarity evaluation model;

s3, evaluating the data watermark similarity of the content block based on the data item similarity of all the data items forming the content block, when the data watermark similarity of each content block meets a first threshold range, executing S4, otherwise, adjusting the embedding proportion and/or position of the data watermark in the content block which does not meet the first threshold range, and executing S2;

s4, calculating the similarity of the whole target data based on the data watermark similarity of all content blocks forming the target data, finishing embedding of the data watermark when the similarity of the whole target data meets a second threshold range, otherwise, adjusting the embedding proportion and/or the position of the data watermark in one or more content blocks, and executing S2.

Preferably, adjusting the embedding proportion and/or position of the data watermark includes:

when the data item contains a single type field, adjusting the embedding proportion of the data watermark;

when the data entry contains multiple types of fields, the embedding proportion and/or the position of the data watermark are/is adjusted.

Preferably, the adjusting the embedding ratio of the data watermark includes:

when the similarity of the watermarks of the content block data is greater than the maximum value in the first threshold range, reducing the proportion of embedding the data watermarks in the data items of the content block data to a preset proportion;

when the similarity of the watermarks of the content block data is smaller than the minimum value in the first threshold range, increasing the proportion of embedding the data watermarks in the data items of the content block data to a preset proportion;

when the overall similarity of the target data is greater than the maximum value in the second threshold range, reducing the proportion of embedding the data watermark in one or more content blocks to a preset proportion;

and when the overall similarity of the target data is smaller than the minimum value in the second threshold range, increasing the proportion of embedding the data watermark in one or more content blocks to a preset proportion.

Preferably, the adjusting the embedding position of the data watermark includes:

removing original data watermarks in the data items, and embedding data watermarks matched with field types into various fields in the data items according to preset proportions;

and evaluating the similarity of the data items after the data watermarks matched with the field types are embedded, selecting the position of the field with the maximum similarity of the data items as the optimal position for embedding the data watermarks, and embedding the data watermarks at the optimal position.

Preferably, the field types in the data entry include any one or more of the following:

a numeric field, a text field, and a natural language field.

Preferably, embedding a data watermark in the data entry includes:

embedding a numeric data watermark in a numeric field when the numeric field is included in the data entry;

embedding a character text type data watermark in a text field when the text field is included in the data item;

when the data entry comprises a natural language field, embedding a natural language type data watermark in the natural language field.

Preferably, the data item similarity evaluation for the data item embedded with the data watermark by adopting a preset data similarity evaluation model includes:

when a numerical data watermark is embedded in a numerical value field of the data item, deconstructing and word segmentation is carried out on numerical values before and after the data watermark is embedded, and the data item similarity is evaluated through a Euclidean distance vector data similarity evaluation model;

when a text field of the data item is embedded with a text-type character data watermark, deconstructing ASCII code values before and after the data watermark is embedded, and evaluating the similarity of the data item through a cosine vector data similarity evaluation model;

and when the natural language field of the data item is embedded with the natural language type data watermark, performing deconstructing word segmentation on the natural language field before and after the data watermark is embedded by using a space vector model, and performing data item similarity assessment on a deconstructing word segmentation result through a cosine vector data similarity assessment model.

Preferably, the content block data watermark similarity assessment is performed as follows:

wherein: delta represents the watermark similarity of the content block data; n represents the total number of data entries in the content block; c (C) _i Representing the similarity of the data entries of the ith data entry.

Preferably, the similarity of the whole target data is evaluated as follows:

/>

wherein: θ represents the similarity of the whole target data; m represents the total number of content blocks in the target data; delta _i And the watermark similarity of the content block data of the ith content block is represented.

Based on the same inventive concept, the invention also provides a data watermark embedding system of target data, comprising:

the embedding module is used for dividing target data to be embedded with the data watermark into a plurality of content blocks, and embedding the data watermark into each content block;

the data item similarity evaluation module is used for evaluating the similarity of the data items embedded with the data watermark by adopting a preset data similarity evaluation model;

the content block similarity evaluation module is used for evaluating the data watermark similarity of the content block based on the data item similarity of all the data items forming the content block, executing the overall similarity evaluation module when the data watermark similarity of each content block meets a first threshold range, otherwise, adjusting the embedding proportion and/or the position of the data watermark in the content block which does not meet the first threshold range, and executing the data item similarity evaluation module;

and the overall similarity evaluation module is used for calculating the overall similarity of the target data based on the data watermark similarity of all content blocks forming the target data, finishing the embedding of the data watermark when the overall similarity of the target data meets a second threshold range, otherwise, adjusting the embedding proportion and/or the position of the data watermark in one or more content blocks, and executing the data item similarity evaluation module.

Preferably, the data item similarity evaluation module is specifically configured to:

Compared with the prior art, the invention has the beneficial effects that:

according to the technical scheme provided by the invention, S1, target data to be embedded with the data watermark is divided into a plurality of content blocks, and the data watermark is embedded in a data item of each content block; s2, carrying out data item similarity evaluation on the data items embedded with the data watermarks by adopting a preset data similarity evaluation model; s3, evaluating the data watermark similarity of the content block based on the data item similarity of all the data items forming the content block, when the data watermark similarity of each content block meets a first threshold range, executing S4, otherwise, adjusting the embedding proportion and/or position of the data watermark in the content block which does not meet the first threshold range, and executing S2; s4, calculating the similarity of the whole target data based on the data watermark similarity of all content blocks forming the target data, finishing embedding of the data watermark when the similarity of the whole target data meets a second threshold range, otherwise, adjusting the embedding proportion and/or the position of the data watermark in one or more content blocks, and executing S2. According to the data watermark similarity evaluation results of the data items, the content blocks and the whole data, the data watermarks of the embedded content blocks are dynamically adjusted, so that high concealment and high simulation after the data watermarks are embedded are finally realized.

Drawings

FIG. 1 is a flow chart of a method for embedding a data watermark into target data;

fig. 2 is a schematic diagram of a data watermark embedding system of target data according to an embodiment of the present invention.

Detailed Description

For a better understanding of the present invention, reference is made to the following description, drawings and examples.

Example 1: as shown in fig. 1, in order to meet the urgent needs in the prior art, the present invention provides a data watermark embedding method of target data, including:

Wherein, adjust embedding proportion and/or position of data watermark, include:

According to the data watermark similarity evaluation results of the data items, the content blocks and the whole data, the data watermark of the embedded target data is dynamically adjusted, so that the similarity of the content blocks after the data watermark is embedded and the similarity of the whole target data respectively meet the set threshold range, and finally high concealment and high simulation after the data watermark is embedded are realized.

In this embodiment, S1 divides target data to be embedded with a data watermark into a plurality of content blocks, and embeds the data watermark in a data entry of each content block, including:

for each data item composing the content block, selecting the corresponding type of data watermark according to the field type in the data item and embedding, in order to improve the information capacity of the data watermark embedding, the proportion of the data watermark embedding in the target data is 100%.

The method specifically comprises the following steps:

S2, carrying out data item similarity evaluation on the data items embedded with the data watermarks by adopting a preset data similarity evaluation model, namely selecting a proper data similarity evaluation model according to different data watermark embedding algorithms to carry out similarity evaluation on the data watermark items, wherein the data item similarity evaluation comprises the following steps:

in this embodiment, the similarity refers to that, for a certain type of data, after the data watermark is embedded, the data type characteristics of the data watermark should not change, and if the data type characteristics change, the similarity evaluation result of the data watermark entry is 0.

For example, the data of the mobile phone number type is 11 bits, wherein the first 3 bits represent the network identification number, the 4 th to 7 th bits represent the region code, the 8 th to 11 th bits represent the user number, and the data watermark is embedded and still accords with the data type characteristic of the mobile phone number.

(1) After the numerical data watermark is embedded, the numerical values before and after the data watermark is embedded are processed through deconstructing word segmentation, similarity evaluation is performed through a Euclidean distance vector data similarity evaluation model, and the evaluation result is D.

For example, the values before and after the data watermark is embedded are P and P ', and each digit is an independent unit through structural word segmentation, namely p= { N1, N2, … …, N11}, P ' = { N '1, N '2, … …, N '11}; then, the Euclidean data similarity evaluation model is carried in, and the similarity is calculated

(2) After embedding the character text type data watermark, deconstructing ASCII code values before and after embedding the data watermark, and carrying out similarity assessment through a cosine vector data similarity assessment model, wherein the assessment result is C.

For example, the values before and after the watermark embedding of the WeChat account type data are Pi and Pi ', and each digit is an independent unit through the deconstructing of ASCII code values, namely P= { N1, N2, … … and Nn }, and P ' = { N '1, N '2, … … and N ' N }; then the cosine data similarity evaluation model is carried in, and the similarity is calculated

(3) After the data watermark of the natural language type is embedded, deconstructing and word segmentation is carried out on the data watermark before and after the data watermark is embedded by using a space vector model, and data similarity evaluation is carried out on the deconstructing and word segmentation result through a cosine vector data similarity evaluation model.

The data of the power business related to natural language has obvious professional characteristics, such as address data of maintenance addresses, expansion addresses and the like; operation terms, electrical quantity terms and other power technical term data; resident-oriented names, etc., can form a characteristic word bank of the natural language data of the electric power business.

The data related to natural language of the power business before and after the data watermark is added is subjected to word segmentation processing, the obtained vector expressions are O= { O1, O2, … …, on } and O '= { O'1, O '2, … …, O' n }, and the data are brought into a cosine data similarity evaluation model to calculate the similarity

S3, evaluating the similarity of the data watermarks of the content blocks based on the similarity of the data items of all the data items forming the content blocks, when the similarity of the data watermarks of all the content blocks meets a first threshold range, executing S4, otherwise, adjusting the embedding proportion and/or the embedding position of the data watermarks in the content blocks which do not meet the first threshold range, and executing S2, wherein the method comprises the following steps:

and performing secondary similarity calculation according to the data item similarity of all the data items constituting the content block, namely, content block data watermark similarity. The size of the content block is set by the user according to a specific service scenario, for example, for convenience of reference, the size of the content block may be set to 20 lines, 50 lines, or 100 lines.

Taking the data with the size of N rows of the content block as an example, carrying out similarity evaluation on data watermark entries according to the method provided in S2, and marking the evaluation result as C, wherein the secondary similarity of the content block after all the content blocks are embedded with the data watermark is

Judging whether the data watermark similarity of each content block meets a first threshold range, executing S4 when the data watermark similarity of each content block meets the first threshold range, otherwise, adjusting the embedding proportion and/or position of the data watermark in the content block which does not meet the first threshold range, executing S2,

in this embodiment, a specific description is given of a method adopted when the similarity of content blocks does not satisfy a threshold range:

the method I, dynamically adjusting the adding proportion of the data watermark, comprises the following steps:

the process specifically comprises the following steps: when the secondary similarity of all embedded data watermarks of a certain data content block exceeds the maximum value in the first threshold range, the secondary similarity before and after embedding the data watermarks can be ensured by reducing the embedding proportion of the data watermarks, for example, the embedding proportion of the data watermarks can be set to be 50%, 30% or 20% and the like.

When the second-level similarity of a certain data content block after embedding the data watermark is smaller than the minimum value in the first threshold range, the embedding capacity of the data watermark can be increased as much as possible by increasing the embedding proportion of the data watermark, for example, the embedding proportion of the data watermark can be set to be 20%, 30% or 50% and the like.

When the data items formed into the content block contain a plurality of field types, the method II can be adopted to dynamically adjust the position of the data watermarking, and the method comprises the following steps:

In this embodiment, adjusting the position of adding the data watermark specifically includes: when a data item in a certain data content block comprises a numerical value, text and natural language, the data watermark is added in a numerical value field, a text field or a natural language field according to the fixed embedding proportion of the data watermark, the embedded field, the text or the item similarity after the natural language data watermark is calculated according to the method provided by S2, the position with the maximum item similarity is selected as the optimal watermark adding position, the original data added in the item is deleted, the secondary similarity is calculated according to the item similarity, and the data watermark embedding capacity is improved as much as possible on the premise that the secondary similarity after the data watermark is embedded meets a threshold value.

S4, calculating the similarity of the whole target data based on the data watermark similarity of all content blocks forming the target data, completing the embedding of the data watermark when the similarity of the whole target data meets a second threshold range, otherwise, adjusting the embedding proportion and/or the position of the data watermark in one or more content blocks, and executing S2, wherein the method comprises the following steps:

and when the second-level similarity after embedding the data watermark meets the threshold range, simultaneously, the embedding capacity of the data watermark is improved as much as possible, and then the similarity of the data watermark embedded into the whole target data is calculated according to the second-level similarity of all the content blocks, namely, the third-level similarity, and when the third-level similarity meets the second threshold range, the embedding of the data watermark is completed, otherwise, the embedding proportion and/or the position of the data watermark in one or more content blocks are adjusted, and S2 is executed.

When the three-level similarity does not meet the second threshold range, the dynamic adjustment can be performed by the following way of adjusting the proportion:

I.e. when the overall similarity of the target data does not meet the second threshold range, the proportion of embedded data watermarks in one or more content blocks needs to be adjusted to a preset proportion.

When the three-level similarity does not meet the second threshold range and the data items forming the content block contain fields of various types in the content block to be adjusted, the positions of the data watermarks embedded in the data items can be adjusted to enable the three-level similarity to meet the second threshold range, and therefore the embedding process of the data watermarks is completed.

In this embodiment, taking an example of dividing a certain target data into M content blocks, embedding data watermarks in each content block, performing similarity evaluation of data watermark entries, evaluating the similarity of the content blocks based on the similarity of the data watermark entries, and marking the evaluation result as δ, where three-level similarity of the content blocks of the target data after all the data watermarks are embedded is

And if theta exceeds the maximum value of the set second threshold range, adjusting the embedding proportion and/or position of the data watermark, increasing the delta value and further increasing the theta value, and finally increasing the similarity of the data watermark after being embedded into the whole data.

And if the minimum value of the second threshold range set by the theta distance is larger, adjusting the embedding proportion and/or the position of the data watermark, and improving the embedding proportion of the data watermark so as to improve the embedding capacity of the data watermark as much as possible on the premise of ensuring the three-level similarity after the data watermark is embedded.

In order to achieve the aim of achieving high concealment and high simulation after the data watermark is embedded into the target data, the embodiment of the invention selects proper watermarking proportion and distribution strategy according to the similarity evaluation results of different data watermark algorithms so as to finally achieve the high concealment and high simulation after the data watermark is embedded.

Example 2: based on the same inventive concept, the invention also provides a data watermark embedding system of target data, as shown in fig. 2, comprising:

On one hand, the system evaluates the similarity of the data watermark item, the data watermark embedded content block and the data watermark embedded data through a data similarity evaluation model, and on the other hand, dynamically adjusts the proportion and the distribution position of watermark addition according to the evaluation result so as to meet the similarity threshold value of the circulation data set by a user, and integrally ensures the concealment and high simulation of the data watermark embedding.

In an embodiment, the system further comprises an adjustment module for adjusting the embedding ratio and/or the position of the data watermark.

The adjustment module includes:

the first adjusting unit is used for adjusting the embedding proportion of the data watermark when the data entry contains a single type field;

and the second adjusting unit is used for adjusting the embedding proportion and/or the position of the data watermark when the data entry contains multiple types of fields.

The adjustment module further includes: the proportion adjusting unit is specifically used for:

The adjustment module further includes: the position adjustment unit is specifically used for:

In an embodiment, the field types in the data entry include any one or more of:

a numeric field, a text field, and a natural language field.

In an embodiment, the embedding module is specifically configured to:

In an embodiment, the data entry similarity evaluation module is specifically configured to:

In an embodiment, the content block data watermark similarity assessment is performed as follows:

In an embodiment, the overall similarity of the target data is evaluated as follows:

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is illustrative of the present invention and is not to be construed as limiting thereof, but rather as providing for the use of additional embodiments and advantages of all such modifications, equivalents, improvements and similar to the present invention are intended to be included within the scope of the present invention as defined by the appended claims.

Claims

1. A method of data watermark embedding for target data, comprising:

s2, adopting a preset data similarity evaluation model to evaluate the similarity of the data items before embedding the data watermark and the data items after embedding the data watermark;

s3, evaluating the similarity of the data watermarks of the content blocks based on the data items of all the data items forming the content blocks before embedding the data watermarks and the data items of all the data items forming the content blocks after embedding the data watermarks, when the similarity of the data watermarks of all the content blocks meets a first threshold range, executing S4, otherwise, adjusting the embedding proportion and/or the embedding position of the data watermarks in the content blocks which do not meet the first threshold range, and executing S2;

s4, calculating the similarity of the whole target data based on the whole data of all content blocks forming the target data before embedding the data watermark and the whole data of all content blocks forming the target data after embedding the data watermark, and finishing the embedding of the data watermark when the similarity of the whole target data meets a second threshold range, otherwise, adjusting the embedding proportion and/or the position of the data watermark in one or more content blocks, and executing S2.

2. The method of claim 1, wherein adjusting the embedding ratio and/or location of the data watermark comprises:

3. The method of claim 2, wherein said adjusting the embedding ratio of the data watermark comprises:

4. The method of claim 2, wherein said adjusting the embedding location of the data watermark comprises:

5. The method of any of claims 2 or 4, wherein the field types in the data entry include any one or more of:

a numeric field, a text field, and a natural language field.

6. The method of claim 5, wherein embedding a data watermark in the data entry comprises:

7. The method of claim 1, wherein the performing data entry similarity evaluation on the data entry embedded with the data watermark using a preset data similarity evaluation model comprises:

8. The method of claim 1, wherein the content block data watermark similarity assessment is performed as follows:

9. The method of claim 1, wherein the similarity of the target data overall is evaluated as follows:

10. A data watermark embedding system for target data, comprising:

the embedding module is used for dividing target data to be embedded with the data watermark into a plurality of content blocks, and embedding the data watermark into a data item of each content block;

the data item similarity evaluation module is used for evaluating the similarity of the data items before the data watermark is embedded and the data items after the data watermark is embedded;

the content block similarity evaluation module is used for evaluating the similarity of the content block data watermarks of all data items which form the content block before embedding the data watermarks and all data items which form the content block after embedding the data watermarks, executing the overall similarity evaluation module when the similarity of the content block data watermarks meets a first threshold range, otherwise, adjusting the embedding proportion and/or the position of the data watermarks in the content block which does not meet the first threshold range, and executing the data item similarity evaluation module;

and the overall similarity evaluation module is used for evaluating overall similarity of the data overall based on all content blocks forming the target data before embedding the data watermark and the data overall based on all content blocks forming the target data after embedding the data watermark, completing embedding of the data watermark when the similarity of the data overall meets a second threshold range, otherwise, adjusting the embedding proportion and/or position of the data watermark in one or more content blocks, and executing the data item similarity evaluation module.

11. The system of claim 10, wherein the data item similarity evaluation module is specifically configured to: