CN118153007A

CN118153007A - Text-oriented data database watermark embedding method, system and storage medium

Info

Publication number: CN118153007A
Application number: CN202410576968.2A
Authority: CN
Inventors: 张亮; 曹晓光; 李娇娇; 刘涛; 郝春辉; 李艾功; 吴志刚; 徐建忠
Original assignee: Hangzhou Shiping Information & Technology Co ltd
Current assignee: Hangzhou Shiping Information & Technology Co ltd
Priority date: 2024-05-10
Filing date: 2024-05-10
Publication date: 2024-06-07
Anticipated expiration: 2044-05-10
Also published as: CN118153007B

Abstract

A method, system and storage medium for embedding watermark in text-oriented data database, the watermark embedding method includes converting arbitrary non-relational data and relational data into unified intermediate format data and grouping; dividing the data to be embedded by taking sentences as units, replacing target words of the sentences by marks, splicing the original sentences into the sentences with the marks, and predicting to obtain a synonym set; replacing the synonyms into sentences, sorting the synonyms in the synonym set according to the sentence scores, and taking the synonyms with the first rank as candidate words; and replacing the target word with the candidate word to obtain a new sentence, carrying out reversible test, and if the test passes, taking the corresponding candidate word as a replacement word and taking the position of the corresponding target word as a watermark embedding position. The method can eliminate the structural difference of the stored data of the relational database and the non-relational database, reduce the influence on the original data and enhance the robustness of the database watermark.

Description

Text-oriented data database watermark embedding method, system and storage medium

Technical Field

The invention relates to the technical field of computers, in particular to a text-oriented data database watermark embedding method, a text-oriented data database watermark embedding system and a storage medium.

Background

In order to protect database copyrights, database watermarking techniques may be utilized to timely terminate infringement when database leakage occurs. The core idea of database watermarking is to embed meaningful digital signals into the database in an invisible way. When infringement occurs, ownership of the embedded digital signal attestation data may be extracted. Some watermarking algorithms have emerged in recent years for text data in relational databases.

For example ,"Li W, Li N, Yan J, et al. Secure and High-Quality Watermarking Algorithms for Relational Database Based on Semantic[J]. IEEE Transactions on Knowledge and Data Engineering, 2022.", a relational database text watermarking algorithm is proposed, which generates synonyms based on Word2vec Word vectors. The method comprises the following specific steps: (1) And training a Word2vec model by using all text attributes of the current database to obtain a Word vector model in the language scene. (2) Splitting the attribute columns to be embedded to obtain a plurality of attribute columns. (3) For each attribute in the database, its identity is determined by hashing the tuple's primary key, the user key, and the attribute key. (4) And determining which tuples are to be embedded with the watermark according to the preset watermark embedding interval, and determining which positions are to be embedded with watermark bits according to the watermark length. (5) All the alternatives of the attribute are obtained. The criterion for selecting the alternative is to check if there are synonyms. If so, then it is considered that there is a possibility of substitution. (6) The number of all the alternative words of the attribute is determined by the identification of the attribute to select the word to be replaced. (7) And calculating the similarity between the Word to be replaced and the synonym by using Word2vec, and selecting the Word with the highest similarity and more than 0.8 as the replacement Word. (8) if the replacement word does not exist, recording. And if the replacement word exists and the watermark bit to be embedded is 1, performing synonym replacement on the original data, and recording. If the watermark bit to be embedded is 0, the watermark bit is not replaced and only recorded. The disadvantage of this method is that: on one hand, the synonyms of the target words obtained by Word2vec are not necessarily the synonyms which are the most consistent with the Word meanings of the target words, and on the other hand, the algorithm does not consider the context of the words in sentences, so that the original sentence semantics can be changed after replacement. And for example "Gort M L P, Olliaro M, Cortesi A, et al. Semantic-driven watermarking of relational textual databases[J]. Expert Systems with Applications, 2021, 167: 114013.", generating synonyms based on the artificial word library WordNet, and designing a relational database text watermarking algorithm. The method comprises the following specific steps: (1) Virtual primary keys (virtual prime key, VPK) for the tuples are generated by the user key and the tuple primary keys. (2) A preset embedding interval is determined by the virtual primary key to determine which tuples to embed the watermark in, and if the value is 0, embedding is performed. Each sentence of the tuple attribute is then processed. (3) Key values are generated by the user key, the virtual primary key of the tuple and the sentence content. And determining the length of the sentence through the key value, and selecting the word in the sentence as a target word for replacement. (4) And obtaining a synonym set of the target word through WordNet, and if the synonym set obtained by the synonym is the same as the synonym set obtained by the target word, taking the synonym set as a candidate word to be replaced. (5) And (3) determining the length of the watermark through the key value obtained in the step (3) to obtain the watermark value to be embedded. And then sorting the candidate words and the target words, if the watermark value to be embedded is 1, selecting a first word for replacement, otherwise, selecting a second word. (6) If the similarity between the sentences after the synonym replacement and the original sentences does not meet the preset threshold value, rolling back, otherwise, finishing embedding. The method has the defects that on one hand, the generated synonyms depend on the quality of a word stock by relying on manual words, on the other hand, the embedded position is directly selected through a secret key, and the words at the position are not necessarily the most suitable to replace in a text, so that semantic loss can be caused. Moreover, the existing method has problems in terms of robustness and the like. In addition, most of the prior art only aims at the relational database, and the requirement and meaning of embedding the watermark into the non-relational database are ignored.

Disclosure of Invention

The invention aims to solve the problems in the prior art, and provides a method, a system and a storage medium for embedding a database watermark for text data, which can eliminate the structural difference of the stored data of a relational database and a non-relational database, simultaneously reduce the influence on the original data to the greatest extent, ensure the usability of the data and enhance the robustness of the database watermark.

In order to achieve the above purpose, the present invention has the following technical scheme:

in a first aspect, an embodiment of the present invention provides a method for embedding a watermark in a database for text-oriented data, including:

converting any non-relational data and relational data in the database into uniform intermediate format data;

grouping the intermediate format data, and determining the data of each group and the watermark value to be embedded in each group;

dividing the data to be embedded with the watermark values by taking sentences as units, replacing target words of the sentences with marks to obtain marked sentences, splicing the original sentences into the marked sentences, and predicting the sentences through a BERT model to obtain a synonym set;

replacing the synonyms in the synonym set into sentences, calculating to obtain sentence scores, sorting the synonyms in the synonym set according to the sentence scores, and taking the synonyms with the first rank as candidate words of the target words;

Replacing the target word with the candidate word to obtain a new sentence, carrying out reversible test, detecting whether the target word can be obtained again through the new sentence and the candidate word, if so, taking the corresponding candidate word as a replacement word, and taking the position of the corresponding target word as a watermark embedding position; otherwise, taking the next word as a target word, and continuing to search in the same way;

and finding out the positions of the corresponding replacement words and the target words in all data of all groups to finish the embedding of the watermark.

As a preferred solution, in the step of converting any non-relational data and relational data in the database into unified intermediate format data:

The non-relational data comprises documents, graphs, wide columns and key value data;

In a database of document types, for the case of key value nesting, converting nested key names into a single key value pair format; for the label nesting situation, converting the nested label name into a single key value pair format;

In the graph type database, the relational data in the graph type database is ignored, only the node data in the graph is processed, and for each node, the extracted node attribute value is converted into a key value pair format;

In a wide-column type database, adding a row key value into respective data in the form of a key value pair of 'id: row key value', wherein the key name of each attribute of each piece of data is formed in the form of a 'column family_column name_timestamp', the value corresponding to the key name is the value of the attribute, and the timestamp adopts the latest timestamp of the corresponding data;

for the database of the key value pair type, processing conversion is not performed;

For the relational data, no processing conversion is performed; regarding each piece of data in the relational database as a key name as an attribute, and regarding the key value as a key value pair format of the attribute value; the foreign key is ignored.

The intermediate format data consists of each item of value in the key value pair, and each item of value is partitioned by a partitioner; watermark embedding is carried out after each piece of data generates a virtual primary key; after watermark embedding is completed, the position of the data in the original database is positioned according to the key value pair information, and the original data is updated by using the watermark embedding result.

As a preferable mode, in the step of converting any non-relational data and relational data in the database into unified intermediate format data, a virtual primary key is designed for the unified intermediate format data, and the design mode of the virtual primary key includes the following steps:

Inputting a key SK of a database proprietor;

for each non-embedded data C of each piece of data, a hash value is calculated Obtaining a plurality of hash values/>, corresponding to the data，/>Representing the nth hash value of the data, wherein n is the total number of hash values of the corresponding data;

Taking the maximum hash value of the corresponding data as a virtual main key of the corresponding data:

。

In a preferred embodiment, in the step of grouping the intermediate format data to determine each group of data and each group of watermark values to be embedded, the grouping is determined by a virtual primary key of the intermediate format data, and the specific calculation formula is that Where i is the group number, i is the watermark value of the ith bit corresponding to the embedded watermark, l is the length of the watermark bit sequence, and each group is embedded with only the watermark bit sequence at the same corresponding position.

As a preferred solution, in the step of finding the positions of the corresponding replacement words and the target words in all the data of all the packets, the watermark is embedded:

According to the formula Selecting a location to be embedded with a watermark;

Wherein j is the serial number of all the embeddable watermark positions P, and L is the number of embeddable watermark positions;

The position P in which the watermark can be embedded is determined according to the actual situation, and if each sentence is embedded with the watermark, P is the position of all the watermarks which can be embedded in the sentence; if only each piece of data is required to be embedded, P is the position of all the embeddable watermarks of the piece of data; when each sentence is embedded with a watermark, ordering the target words and the candidate words according to an ordering rule, and if the watermark bit is 0, replacing the target word of the original sentence with the first word; if the watermark bit is 1, the target word of the original sentence is replaced with the second word.

As a preferable scheme, in the step of replacing the synonyms in the synonym set into sentences and calculating to obtain sentence scores, calculating to obtain sentence scores of replacing the synonyms into sentences through BERTScore models;

in the step of sorting the synonyms in the synonym set according to the sentence scores and taking the synonym with the first rank as the candidate word of the target word, if the sentence score of the first word is smaller than the set threshold value, the next word is skipped, otherwise, the first word can be used as the candidate word to replace the target word.

As a preferred scheme, after the step of finding the corresponding replacement words and the positions of the target words in all data of all packets to finish watermark embedding, checking which level the target word of the position of the data is in, extracting watermark bit 0 if the target word is arranged in the first position, and extracting watermark bit 1 if the target word is arranged in the second position; after all data in the same group extract watermark bits, determining the group of finally extracted watermark bits by adopting a majority voting mode; until all watermark bits are extracted.

In a second aspect, another embodiment of the present invention provides a text-oriented data database watermark embedding system, including:

The data format conversion module is used for converting any non-relational data and relational data in the database into unified intermediate format data;

The grouping module is used for grouping the data in the intermediate format and determining the data of each group and the watermark value to be embedded in each group;

The synonym generation module is used for dividing the data with the watermark value to be embedded into the sentences, replacing the target words of the sentences with marks to obtain marked sentences, splicing the original sentences into the marked sentences, and predicting the sentences through the BERT model to obtain a synonym set;

The similarity detection module is used for replacing the synonyms in the synonym set into sentences, calculating to obtain sentence scores, sorting the synonyms in the synonym set according to the sentence scores, and taking the synonyms with the first rank as candidate words of the target words;

The reversible test module is used for replacing the target word by using the candidate word to obtain a new sentence and carrying out reversible test, detecting whether the target word can be obtained again through the new sentence and the candidate word, if so, taking the corresponding candidate word as a replacement word and taking the position where the corresponding target word is located as a watermark embedding position; otherwise, taking the next word as a target word, and continuing to search in the same way;

And the replacement module is used for finding out the corresponding replacement words and the positions of the target words in all data of all groups to complete the embedding of the watermark.

In a third aspect, another embodiment of the present invention further proposes a computer readable storage medium storing a computer program, which when executed by a processor implements the text-oriented data database watermark embedding method according to the first aspect.

Compared with the prior art, the first aspect of the invention has at least the following beneficial effects:

By converting any non-relational data and relational data in the database into unified intermediate format data, the structural difference of the relational and non-relational database storage data is eliminated, so that the method is independent of any type of database data, only depends on the converted intermediate format data, and meets the requirement of embedding watermarks in various databases. Meanwhile, in order to ensure the transparency of the watermark, the watermark is embedded in a synonym replacement mode, and the invention predicts a high-quality synonym set based on the BERT model, and reduces the distortion influence on the original data caused by the synonym replacement and watermark embedding. And after the synonym is replaced, carrying out reversible test on the new sentence, detecting whether the target word can be obtained through the new sentence and the candidate word, if so, taking the corresponding candidate word as a replacement word, and taking the position where the corresponding target word is located as a watermark embedding position, thereby ensuring that the correct watermark is extracted. The invention provides a watermark embedding method aiming at text data in relational and non-relational databases based on synonym substitution, and watermark extraction is the inverse process of watermark embedding, so that the watermark technology can be widely applied to databases of different types, the usability of the data is ensured, and the robustness of the database watermark is enhanced.

Furthermore, the invention designs the virtual main key for the intermediate format data generated by the relational database and the non-relational database, and the generated virtual main key has two functions: one is to mark each piece of data for the data packet when watermark is embedded; and secondly, the position for watermark embedding is selected. By taking the maximum hash value of the non-embedded attribute value as a virtual main key, grouping data according to the virtual main key, embedding the same watermark into each group, and combining with a reversible test, the mutual reversibility of the embedding and extracting processes is ensured.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention, and that other related drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a data format conversion process according to an embodiment of the present invention;

FIG. 2 is a flow chart of watermark embedding and extraction in accordance with an embodiment of the present invention;

FIG. 3 is a diagram of an embodiment of the present invention for embedding watermarks in a particular sentence;

FIG. 4 is a graph showing the comparison of experimental results of deleting the entire embedded data according to the embodiment of the present invention with other prior methods;

FIG. 5 is a graph comparing experimental results after inserting new data with other prior art methods;

FIG. 6 is a graph comparing the results of alternative attack experiments with other prior art methods.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, one of ordinary skill in the art may also obtain other embodiments without undue burden.

The invention provides a watermark embedding method for text data in relational and non-relational databases based on synonym substitution. First, in order to ensure transparency of the watermark, the watermark is embedded by using a synonym substitution method. To reduce the distortion of the original data caused by the embedded watermark, synonyms are generated based on the BERT model ^[1] and the generated text evaluation model BERTScore ^[2]. Secondly, in order to eliminate the difference of the structure of the stored data of the relational database and the non-relational database, a data format preprocessing process is designed, and the data format is uniformly converted into an intermediate format. And then taking the maximum hash value of the non-embedded attribute value as a virtual main key, grouping the data according to the virtual main key, and embedding the same watermark into each group. Finally, in order to ensure that the embedding and the extraction are mutually reversible processes, a reversible detection process is designed, and original words and embedding positions can be obtained from new sentences after synonym replacement.

[1]Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.

[2]Zhang T, Kishore V, Wu F, et al. BERTScore: Evaluating Text Generation with BERT[C]//International Conference on Learning Representations.2020.

Specifically, the embodiment of the invention provides a text-oriented data database watermark embedding method, which comprises the following steps:

s1, converting any non-relational data and relational data in a database into uniform intermediate format data;

s2, grouping the intermediate format data, and determining the data of each group and the watermark value to be embedded in each group;

S3, dividing the data to be embedded with the watermark values by taking sentences as units, replacing target words of the sentences with marks to obtain marked sentences, splicing the original sentences into the marked sentences, and predicting through a BERT model to obtain a synonym set;

S4, replacing the synonyms in the synonym set into sentences, calculating to obtain sentence scores, sorting the synonyms in the synonym set according to the sentence scores, and taking the synonyms with the first rank as candidate words of the target words;

S5, replacing the target word with the candidate word to obtain a new sentence, carrying out reversible test, detecting whether the target word can be obtained again through the new sentence and the candidate word, if so, taking the corresponding candidate word as a replacement word, and taking the position of the corresponding target word as a watermark embedding position; otherwise, taking the next word as a target word, and continuing to search in the same way;

and S6, finding out the corresponding replacement words and the positions of the target words in all data of all groups to finish watermark embedding.

As shown in fig. 1, the non-relational data includes document, graph, wide column and key value pair data, and the processing flows are as follows:

1) For a non-relational database, the main idea is to translate it into a key-value pair format.

11 For databases of document types, most of the data types are JSON and XML formats, and embodiments of the present invention are also primarily directed to both types. For pointers to other documents that may exist in the document, the preprocessing link ignores and does not process. For the key value nesting case, the nested key names are connected in "_" and converted into a single key value pair format. As in FIG. 1 { people: { name: alice, age: 10 } }, turn to { people _name: alice, people _age: 10 }. For the format of XML, the tag content is taken as a key name, and the content wrapped by the tag is taken as a value. For the label nesting case, the nested label names are also connected in "_" and converted into a single key value pair format. As in FIG. 1 < people > < name > alice </name >10</age > </people > is converted to { people _name: alice, people _age:10 }.

12 For the graph type database, the embodiment of the invention ignores the relation data in the graph type database and only processes the node data in the graph. For each node, its attributes are extracted as a separate piece of data. The embodiment of the present invention, such as data ( Person { born : XXX , name : XXX } ) - [ : ACTED { roles : [ XXX] } ]->( : Movie { title : XXX } ), in fig. 1, focuses only on the data of the nodes, i.e., both Person and Movie, and does not process the relationship data, i.e., ACTED. The attribute values of the two nodes are converted into key-value pair formats { born:XXX, name:XXX } and { title:XXX }.

13 Each attribute value for each piece of data in the wide column type database is determined by a row key, column family, column name, and timestamp. The embodiment of the invention adds the row key value to the respective data in the form of a key value pair of "id: row key value". And the key name of each attribute of each piece of data is formed in the form of 'column family_column name_timestamp', and the value corresponding to the key name is the value of the attribute, wherein the timestamp adopts the latest timestamp of the piece of data, namely only the latest version of the data is processed. The data as in fig. 1 is finally converted to { id : 001 , personal_name_t1 : alice , personal_age_t1 : 10 , office_address_t1 : China , office_phone_t2 : 055345 }. where t1 and t2 are the time stamps of the respective data.

14 For a database of key-value pair types, the data itself is in key-value pair format, and no process conversion is performed.

2) For relational databases, no processing is performed. Because each piece of data can be understood as a key name as an attribute, a key value is a key value pair format of an attribute value. For foreign keys, the embodiment of the invention does not process and ignores them.

Finally, the format of all key-pair data can be uniformly converted into the intermediate format in fig. 1. The data in the intermediate format in the embodiment of the invention consists of each item value in the key value pair, wherein each item value is divided by a divider, and commas are used for dividing in the figure. And then generating a virtual primary key for each piece of data, and then embedding the watermark. After watermark embedding is completed, the position of the data in the original database is positioned according to the key value pair information, and the original data is updated by using the watermark embedding result.

Further, the embodiment of the invention designs virtual primary keys for the intermediate format data generated by the relational database and the non-relational database. The generated virtual primary key has two functions in the embodiment of the present invention: one for marking each piece of data for the data packet at the time of watermark embedding and the other for selecting the location of watermark embedding. The design mode of the virtual main key comprises the following steps:

Inputting a key SK of a database proprietor;

；

Wherein the non-embedded data is opposite the embedded data. The embedded data refers to data possibly embedded with watermark, and the embodiment of the invention sets the data capable of being embedded with watermark as text data with word number more than 10. In addition, other non-text type data, shorter text data, etc., are non-embedded data.

Further, in the embodiment of the present invention, when grouping the intermediate format data, the grouping is determined by the virtual primary key of the intermediate format data, and the specific calculation formula isIn the formula, i is the group number, the i-th bit watermark value corresponding to the embedded watermark, l is the length of the watermark bit sequence, each group is only embedded with the watermark bit sequence at the same corresponding position, and the robustness of the watermark is further improved through repeated embedding.

In one possible implementation, step S3 processes each item of embedded data for each piece of data. The data to be embedded is divided in sentence units as a basic unit of processing, that is, the data to be embedded is divided in sentence units. The embodiment of the invention selects each sentence to be embedded with the watermark, wherein sentences with the number of sentence words smaller than 10 are ignored. Later sentence alignmentTarget word/>Replacing with < mask > tag to obtain tagged sentenceIn the above, the ratio of/>、/>、/>Representing the 1 st, 2 nd and n th words in the sentence, wherein n is the number of words contained in the sentence, and then splicing the original sentence into the sentence with the mark. Finally, obtaining a synonym set through BERT model predictionIn the above, the ratio of/>、/>、/>Representing target words/>The 1 st, 2 nd and nth synonyms obtained through BERT model prediction, and sn is the number of the obtained synonyms.

Further, step S4 replaces the synonyms in the synonym set with sentences to obtain、/>Etc., in which/>、/>Is a synonym in the set of synonyms,、/>Respectively use/>、/>Replacement of original sentence/>Target word/>And obtaining new sentences. Then S and/> are respectively obtained by BERTScore model、/>Etc. And sorting the synonyms in the obtained synonym set according to the sentence score, wherein the first word of the target word is a candidate word to be replaced, if the sentence score of the word is smaller than a set threshold value, skipping, and processing the next word, otherwise, the candidate word can be used as the candidate word to replace the target word.

Further, step S5 uses the candidate word to replace the target word, and performs a reversible test on the obtained new sentence, so as to ensure that the same result can be obtained through the replaced word when the watermark is extracted, thereby ensuring that the correct watermark is extracted.

In one possible embodiment, step S6 follows the formulaSelecting a location to be embedded with a watermark;

The position P in which the watermark can be embedded is determined according to the actual situation, and if each sentence is embedded with the watermark, P is the position of all the watermarks which can be embedded in the sentence; if only each piece of data is required to be embedded, P is the position of all the embeddable watermarks of the piece of data; the embodiment of the invention selects each sentence to be embedded. When each sentence is embedded with a watermark, the target word and the candidate word are ordered in alphabetical order (specific ordering rules can be changed, for example, the target word and the candidate word can be ordered according to Unicode codes or hash values for Chinese, etc.), if the watermark bit is 0, the target word of the original sentence is replaced by the first word; if the watermark bit is 1, the target word of the original sentence is replaced with the second word.

In one possible implementation, a majority voting process is implemented after step S6, the majority voting process being used only for watermark extraction. The majority voting process comprises the following specific steps:

after all the embedded positions are obtained, the formula is still passed The location of watermark embedding is selected, wherein the target word and the candidate word are ordered alphabetically. Finally, looking at which level the target word of this position of the data is, watermark bit 0 is extracted if it is arranged in the first position, and watermark bit 1 is extracted if it is arranged in the second position. Since different data are grouped and embedded with the same watermark when the watermark is embedded, each group also extracts a plurality of watermark bits when extracting. After watermark bits are extracted from all data in the group, determining the finally extracted watermark bits in the group by adopting a majority voting mode; until all watermark bits are extracted.

The embodiment of the invention also provides a text-oriented data database watermark embedding system, which comprises:

The embodiment of the invention adopts a meaningful binary image as a watermark, and can be directly converted into a binary bit sequence consisting of 0 and 1 after the picture is read in, and then the binary bit sequence is used as the watermark for embedding.

The embedding flow of watermark embedding and extracting is shown in figure 2, and the specific steps of watermark embedding are as follows:

(1) And inputting the data in the database into a data format conversion module for processing, generating a virtual primary key, and converting the virtual primary key into a unified data format.

(2) And grouping the data obtained by the data format conversion module in a grouping module, and determining the data of each group and the watermark value to be embedded in each group.

(3) And inputting each sentence of each embedded data in each piece of data into a synonym generating module by taking the sentence as a unit.

(4) And taking each word of the sentence as a target word to obtain a synonym set in a synonym generation module.

(5) And inputting the obtained synonym set into a similarity detection module for sorting. And taking the first word as a candidate of the target word.

(6) Inputting the candidate word into a reversible test module, marking the position as a watermark embeddable position if the test is passed, and taking the next word as a target word if the test is not passed, and cycling the operations of the steps (4) to (6). After all the words are processed, processing the next sentence, and cycling the operations from step (3) to step (6) until all embeddable positions of the piece of data are obtained, and then entering a replacement module.

(7) And selecting an embedded position by the replacement module through the virtual primary key, and carrying out synonym replacement according to the watermark to be embedded to complete the embedding of the watermark.

(8) Repeating the operations of steps (3) to (7) until all data of all packets are processed.

Watermark extraction is the reverse of watermark embedding, except that watermark extraction does not involve a replacement module. And (3) after all embeddable positions are obtained in the step (6), entering a majority voting module, and extracting the watermark. The majority voting module is used for checking which level the target word at the position of the data is, extracting watermark bit 0 if the target word is arranged at the first level, and extracting watermark bit 1 if the target word is arranged at the second level; after all data in the same group extract watermark bits, determining the group of finally extracted watermark bits by adopting a majority voting mode; until all watermark bits are extracted.

The flow of embedding the watermark for a specific sentence according to the embodiment of the present invention is shown in fig. 3. For each target word in the sentence, the target word is input into a synonym generation module to search a synonym set of the target word. The target word total of the sentence in the figure obtains synonyms totals, sum, whole, etc. These sets of synonyms are then input to a similarity detection module for ranking, where the highest scoring word is sum. And then in the reversible test module, the sum replaces the target word total of the original sentence and is input to the synonym generation module and the similarity test module again. It passes the test and records this embedded position. If not, the next word of is processed. When the embedding location search is completed, this sentence finds two embedding locations, the first is owe which can be replaced with owed, and the second is total which can be replaced with sum. Finally, the watermark is embedded in the second location by virtual primary key selection. Because the watermark bit to be embedded is 0, the total of the original sentence is replaced with the sum of the first order, and the embedding is completed.

The embodiment of the invention eliminates the difference of the structure of the storage data of the relational database and the non-relational database, and meets the requirement of embedding watermarks in various databases. Meanwhile, synonyms are generated based on the deep learning model BERT and the generated text evaluation model BERTScore, so that high-quality synonyms are obtained, and distortion caused by replacing and embedding watermarks with the synonyms is reduced.

TABLE 1 synonym generation quality contrast based on SWORDS

The synonym quality generated by the embodiment of the invention is evaluated by using a synonym quality evaluation standard (Stanford Word Substitution Benchmark, SWORDS) ^[3]. SWORDS is a benchmark for replacing a target word by finding an appropriate synonym in context, and the synonym is collected by a method that allows one to choose from given target words and contexts, and options for multiple synonyms. SWORDS contains the term and its synonyms and scoring of the synonyms can be used to determine how good the model generated the synonyms. The header of Table 1And/>Is accuracy/>And recall/>Is a harmonic mean of (c). Accuracy/>And recall/>The calculation formula of (2) is/>And/>Where K refers to the number of algorithmically generated synonyms,/>Refers to the number of K words generated by the algorithm that can be accepted by SWORDS. /(I)Refer to the number of words that can be accepted in SWORDS, where "accepted" refers to the word in the synonym set of SWORDS. /(I)And/>The differences are: /(I)The score of an acceptable word in SWORDS of (i) is greater than 50%,/>So long as it is greater than 0%. Lenient at the head of the table refers to words that the non-filtering algorithm produces but not in SWORDS, strict being filtering. The first three pieces of data in Table 1 are from SWORDS, which are the results of the manually gathered synonym set test in SWORDS. The synonym quality generated by the embodiment of the invention is higher than that generated by using WordNet and Word2vec, and the synonym quality has certain superiority.

The method of the embodiment of the invention takes the maximum hash value of the non-embedded attribute value as the virtual main key, groups the data according to the virtual main key, embeds the same watermark in each group, and carries out reversible detection, thereby ensuring the process of embedding and extracting in a mutually reversible way and further improving the robustness of the watermark.

Amazon Fine Food Reviews dataset ^[4] contains over 50 tens of thousands of pieces of food assessment information, including user information, food scores, assessment text information, and the like.

[3]Lee M, Donahue C, Jia R, et al. Swords: A benchmark for lexical substitution with improved data coverage and quality[J]. arXiv preprint arXiv:2106.04102, 2021.

[4]McAuley J J, Leskovec J. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews[C]//Proceedings of the 22nd international conference on World Wide Web. 2013: 897-908.

Watermark embedding is carried out by using 30,000 pieces of data in Amazon Fine Food Reviews data sets, and the robustness aspect of the method provided by the embodiment of the invention is compared with the method proposed by Li W and the like and Gort ML P and the like. The proposed algorithms of Li W et al and Gort ml P et al also involve embedding of data of a numeric type, only against the effect of embedding a watermark for text type data.

The comparative experiments were:

(1) The entire embedded data is deleted, and the deletion proportion is from 10%, 20% to 90%. The experimental results are shown in FIG. 4. The result shows that the embodiment of the invention is superior to the existing text database watermarking algorithm, and the accuracy of watermark extraction is basically 100% when the deletion proportion is smaller than 80%.

(2) The impact on watermark extraction when new data is inserted is tested. The data inserted in the experiment were from the new data of the same dataset, with a proportion of from 10% to 150%. The experimental result is shown in fig. 5, and the accuracy of watermark extraction in the embodiment of the invention is always 100%, so that the method has certain advantages.

(3) Synonym substitution attack, using Word2vec to generate synonyms to replace data with watermarks, the substitution strength is set to be 1 Word for each sentence in the text, and the substitution ratio is from 10% to 90%. The experimental result is shown in fig. 6, and the result shows that the embodiment of the invention can basically achieve 100% of extraction accuracy, can resist such attacks, and has certain superiority.

Another embodiment of the present invention also proposes an electronic device, including:

a memory storing at least one instruction; and the processor executes the instructions stored in the memory to realize the database watermark embedding method facing the text data.

Another embodiment of the present invention also proposes a computer readable storage medium storing a computer program which, when executed by a processor, implements the text-oriented data database watermark embedding method.

The instructions stored in the memory may be partitioned into one or more modules/units, which are stored in a computer-readable storage medium and executed by the processor to perform the text-data oriented database watermark embedding method of the present invention, for example. The one or more modules/units may be a series of computer readable instruction segments capable of performing a specified function, which describes the execution of the computer program in a server.

The electronic equipment can be a smart phone, a notebook computer, a palm computer, a cloud server and other computing equipment. The electronic device may include, but is not limited to, a processor, a memory. Those skilled in the art will appreciate that the electronic device may also include more or fewer components, or may combine certain components, or different components, e.g., the electronic device may also include input and output devices, network access devices, buses, etc.

The Processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may be an internal storage unit of the server, such as a hard disk or a memory of the server. The memory may also be an external storage device of the server, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the server. Further, the memory may also include both an internal storage unit and an external storage device of the server. The memory is used to store the computer readable instructions and other programs and data required by the server. The memory may also be used to temporarily store data that has been output or is to be output.

It should be noted that, because the content of information interaction and execution process between the above module units is based on the same concept as the method embodiment, specific functions and technical effects thereof may be referred to in the method embodiment section, and details thereof are not repeated herein.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiments, and may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing device/terminal apparatus, recording medium, computer Memory, read-Only Memory (ROM), random access Memory (RAM, random Access Memory), electrical carrier signals, telecommunications signals, and software distribution media. Such as a U-disk, removable hard disk, magnetic or optical disk, etc.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A method for embedding a watermark into a database for text-oriented data, comprising:

2. The text-based data oriented database watermark embedding method according to claim 1, wherein in said step of converting any non-relational data and relational data in the database into unified intermediate format data:

3. The text-based data oriented database watermark embedding method according to claim 2, wherein in said step of converting any non-relational data and relational data in the database into unified intermediate format data:

4. A method of embedding a watermark in a text-oriented data database according to claim 3, wherein in said step of converting any non-relational data and relational data in the database into unified intermediate format data, a virtual primary key is designed for the unified intermediate format data, the manner of designing the virtual primary key comprising the steps of:

Inputting a key SK of a database proprietor;

。

5. the method of claim 4, wherein in the step of grouping the intermediate format data to determine each group of data and each group of watermark values to be embedded, the grouping is determined by a virtual primary key of the intermediate format data, and the specific calculation formula is Where i is the group number, i is the watermark value of the ith bit corresponding to the embedded watermark, l is the length of the watermark bit sequence, and each group is embedded with only the watermark bit sequence at the same corresponding position.

6. The method for embedding a watermark in a text-oriented database according to claim 4, wherein in said step of finding positions of corresponding replacement words and target words in all data of all packets to complete watermark embedding:

According to the formula Selecting a location to be embedded with a watermark;

7. The method for embedding a watermark in a text-oriented database according to claim 1, wherein in the step of replacing synonyms in a set of synonyms with sentences to calculate a sentence score, a BERTScore model is used to calculate a sentence score for replacing the synonyms with sentences;

8. The method for embedding a watermark in a text-oriented data database according to claim 1, wherein after said step of finding the positions of the corresponding replacement words and target words in all data of all packets to complete embedding the watermark, checking which bit the target word in this position of the data is in, extracting watermark bit 0 if it is arranged in the first bit, and extracting watermark bit 1 if it is arranged in the second bit; after all data in the same group extract watermark bits, determining the group of finally extracted watermark bits by adopting a majority voting mode; until all watermark bits are extracted.

9. A text-oriented data database watermark embedding system, comprising:

10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements a method of text-oriented data database watermark embedding as claimed in any one of claims 1 to 8.