CN111581952B

CN111581952B - Large-scale replaceable word library construction method for natural language information hiding

Info

Publication number: CN111581952B
Application number: CN202010428651.6A
Authority: CN
Inventors: 向凌云; 冯章成; 傅明; 郭国庆; 杨双辉; 刘宇航
Original assignee: Changsha University of Science and Technology
Current assignee: Changsha University of Science and Technology
Priority date: 2020-05-20
Filing date: 2020-05-20
Publication date: 2023-10-03
Anticipated expiration: 2040-05-20
Also published as: CN111581952A

Abstract

The invention discloses a large-scale replaceable word stock construction method for natural language information hiding, which comprises the following steps: step 1, for each word in the dictionary, calculating it to represent it as a low-dimensional dense word vector; step 2, calculating the similarity between words according to the word vector distance between the words, and obtaining a similar word list of each word; step 3, carrying out the association relation and the expression of the similarity degree among all similar words; and 4, constructing candidate alternative word groups according to the association relation and the similarity degree between the similar words. According to the invention, a large-scale candidate replaceable word library is successfully constructed, the improvement of the embedding capacity of the natural language information hiding method is realized, the quality of the ciphertext-containing book is improved through filtering of the candidate replaceable words, and the improvement of the security of secret information is realized.

Description

Large-scale replaceable word library construction method for natural language information hiding

Technical Field

The invention belongs to the field of information security, and particularly relates to a method for constructing a word library capable of replacing words for hiding natural language information.

Background

With the development of global informatization, language characters become indispensable interaction tools and information carriers in life and work, and more people conduct information transmission activities such as office work, study and conversation through networks. According to the latest release of CNNIC (computer network interface) on the 44 th China Internet development status statistical report, the Chinese netizen scale reaches 8.54 hundred million by 2019 in 6 months, the network news user scale reaches 6.86 hundred million, and the instant messaging user scale reaches 8.25 hundred million. These data indicate that there is plentiful text data being transmitted, published and shared every moment in our country. Therefore, the text data is very suitable for being used as a carrier for hiding information to realize hidden communication, and the safe transmission and storage of secret information are protected. On the other hand, due to the openness and sharing property of the network, the text data is easy to be attacked by modification, copying, piracy and the like, so that the watermark information is very necessary to be embedded in the text data by using an information hiding technology, and copyright protection, leakage tracking and the like of important text data are realized.

Natural language information hiding with text content as carrier is a technology to hide secret information in imperceptible mode in public text carrier, and can reach the aim of hidden communication, copyright protection, etc. The core of the natural language information hiding technology is to keep the readability and the semantic invariance of the original text content, so that secret information is often embedded by utilizing the ways of replacing semantic equivalent words, converting semantic equivalent sentence patterns and the like, and the replaced or converted text can better keep the local or global semantics of the original text. Because no support of complex natural language processing technology is needed, existing natural language information hiding related achievements are mainly focused on a method for hiding information by using synonym substitution.

And after the synonym is encoded into different values by the natural language information hiding method based on synonym replacement, selecting the synonym with a specified encoding value to replace the original word according to the secret information to be embedded so as to realize the embedding of the secret information. Since synonyms have similar meanings, in theory, the substitution of such synonyms does not affect the meaning of the original text, and the embedded secret information has better concealment. Many related information hiding methods have been proposed by researchers who have conducted intensive studies in terms of improving embedding capacity, embedding efficiency, anti-steganalysis detection capability, and the like. However, because the number of synonyms is limited, and one word can only appear in one synonym phrase, otherwise, the embedding failure of secret information is easy to cause, so that the existing natural language information hiding method based on synonym replacement generally has the problem of low embedding capacity, and the practicability of the method is greatly reduced.

The synonym replacement-based method is limited to mutual replacement among synonyms, but in natural language text, more than synonyms can be replaced with each other to hide information. In many cases, the replacement of words with similar context, such as co-located words, anti-ambiguous words, etc., with each other does not affect the readability, the value of use, and the quality of the text of the natural text. Words representing different colors often have similar contexts when used, and in an example sentence "She wears a green coat", when green is replaced with a word red of another color or the like, the influence on the sentence is small. No word sense item of green is included in the dictionary, and a color word such as red is not included as a near sense (synonym). Therefore, in the existing natural language information hiding method based on word replacement, such word is not utilized for replacement. However, after the green and red are represented by the distributed word vectors, the similarity of the green and red is up to 0.9235 by calculating the distance between the vectors, in which case the green and red with high similarity can be regarded as the replaceable word for information hiding, thereby expanding the range of the replaceable word and improving the embedding capacity of the information hiding method.

Based on the analysis, the invention provides a large-scale replaceable word stock construction method, which improves the embedding capacity of an information hiding method based on synonym replacement from two aspects: the number of word groups of the replaceable words and the number of words which are mutually replaceable words. When the constructed word library of the replaceable word is applied to a natural language information hiding method based on synonym replacement, the embedding capacity is greatly improved, and meanwhile, the generated ciphertext-containing book has higher text quality and anti-steganography analysis detection capability.

Disclosure of Invention

The invention is realized by adopting the following technical scheme:

a large-scale replaceable word stock construction method for natural language information hiding comprises the following steps: step 1, for each word in the dictionary, calculating it to represent it as a low-dimensional dense word vector; step 2, calculating the similarity between words according to the word vector distance between the words, and obtaining a similar word list of each word; step 3, carrying out the association relation and the expression of the similarity degree among all similar words; and 4, constructing candidate alternative word groups according to the association relation and the similarity degree between the similar words.

The word stock construction method comprises the following steps: the step 1 comprises the following steps: 1.1 preparing a dictionary d= { w ₁ ，w ₂ ，L，w _N W is the word in the dictionary, N is the total number of words in the dictionary; 1.2 training a preset corpus by using a continuous neural network language model Skip-gram to obtain word vector representation of each word for any word w in a dictionary D _i The word vector is denoted as E (w _i )。

The word stock construction method comprises the following steps: the step 2 comprises the following steps: 2.1 measuring similarity between two words by cosine formula and word vector, calculating word w _i Sum word w _j Similarity between the two, the formula is as follows:

2.2 setting a similarity threshold δ to determine whether two words are similar words, if S (w _i ，w _j ) > delta, then determine w _i And w is equal to _j Are similar words.

The word stock construction method, wherein step 2 further comprises:

2.3 obtaining w from the threshold delta _i Is a list of similar words of (a)Wherein sw is _ij Is w _i The j-th similar word is equal to w _i The similarity value of (2) satisfies: s (w) _i ，sw _ij ) > delta, and is called sw _ij And w is equal to _i Are similar; n is n _i Is w _i To obtain the number of similar words of all wordsList of similar words, the set of these lists is denoted as SList, expressed as follows:

the word stock construction method comprises the following steps:

3.1, converting a similar word list SList of all words in the dictionary D into an undirected graph G (D, E), wherein the graph takes the words in the dictionary D as vertexes, if the two words are similar, edges are connected between the corresponding vertexes, the set of the edges is E, and the weight of the edges is the similarity between the words corresponding to the vertexes where the edges are positioned;

and 3.2, dividing G (D, E) into a plurality of maximum connected subgraphs according to the association relation among the words, wherein certain similarity relations exist between the words corresponding to the vertexes in each maximum connected subgraph, and the maximum connected subgraphs are dissimilar to any word in other maximum connected subgraphs.

The word stock construction method comprises the following step:

4.1 for each maximum connected subgraph in G (D, E), calculating all maximum cliques in the subgraph and extracting all maximum cliques therefrom;

4.2 maximum cliques that are completely independent of vertices, then directly determined as candidate alternative groups;

4.3 selecting the maximum clique with the largest mean value of the edge weights as the candidate alternative phrase for the plurality of maximum cliques with shared vertexes.

4.4 deleting words in the determined candidate alternative phrase from the remaining maximum cliques except the maximum cliques determined as candidate alternative phrases, namely deleting vertices shared with the selected maximum cliques, repeating the steps 4.1-4.3 until no maximum cliques remain or the number of vertices of the maximum cliques after deleting the shared vertices is less than 2.

The word stock construction method comprises the following step:

4.1 enumerating all maximum groups in one maximum connected subgraph in G (D, E) according to a Bron-Kerbosch algorithm, and setting a set formed by all the maximum groups as CS; candidate set of replaceable phrasesIs combined into

4.2 traversing each maximum group in the CS, obtaining all the maximum groups, calculating the average value of the edge weights of each maximum group, and sequencing the maximum groups from large to small according to the size of the average edge weight to obtain the sequenced maximum groups as follows: MC (methyl cellulose) ₀ ，K，MC _k I.e. Avg (MC) ₀ )≥，K，≥Avg(MC _k ) Wherein Avg (MC) _k ) Representing the maximum cluster MC _k Average value of weight of middle edge;

4.3 let i=1; candidate maximum clique set mcs= { MC ₀ }；

4.4 setting MC _i The vertexes of the two layers are vw in sequence ₀ ，...，vw _l ，j＝0；

4.4.1 for vw _j Traversing the peaks of all candidate maximum groups in the MCS in turn, if vw _j Appears in the MCS, then MC _i If there is a shared vertex with a certain maximum clique in the MCS, the MC is ignored _i Jumping to the step 4.5; otherwise j=j+1;

4.4.2 if j is less than or equal to l, repeating the step 4.4.1; otherwise j=l+1, explaining MC _i If no shared vertex exists with any candidate maximum clique of MCS, MC is generated _i Added to the MCS, i.e. MCS=MCS+ { MC _i }；

4.5 i=i+1, if i is less than or equal to k, jumping to step 4.4; otherwise, jumping to the step 4.6;

4.6 Each maximum group in the MCS is the calculated candidate alternative phrase, namely CG=CG+MCS;

4.7 deleting the selected maximum clique from all maximum cliques set CS, i.e. cs=cs-MCS;

4.8 updating the vertex of each maximum clique in the CS, namely deleting the vertex appearing in the MCS from each maximum clique in the CS, so that any maximum clique in the CS after vertex updating does not have a shared vertex with any maximum clique in the MCS;

4.9 deleting the maximum group with the number of top points less than or equal to 1, and updating CS;

4.10 repeating steps 4.2-4.9 until

And 4.11 outputting CG to obtain all candidate alternative phrases.

The word stock construction method further comprises the following steps: step 5, identifying candidate replaceable words appearing in the original carrier text as positions to be embedded according to the candidate replaceable word lexicon, positioning candidate replaceable word groups where the words are located, and calculating the replaceable degree of each candidate replaceable word in the group in the current context for the context of each position to be embedded; and filtering the words in the candidate alternative word groups according to the obtained alternative degree through the set alternative degree threshold value to obtain the final alternative words.

The word stock construction method comprises the following steps: 5.1 let the ith word m appearing in text _i As candidate replaceable words, the context window size is 2k, and the words in the context window are sequentially noted as: cont (m) _i )＝{m _i-k ，…，m _i-1 ，m _i+1 ，…，m _i+k M is then _i And context word m _j (j∈[i-k，i+k]But j+.0) is expressed as:

wherein p (m) _i ，m _j ) Is the word m _i And m _j Co-occurrence frequency in context window, p (m _i ) Is m _i Frequency of occurrence, p (m _j ) Is m _j Frequency of occurrence in the corpus; p (m) _i ，m _j )＝p(m _i |m _j )×p(m _j )＝p(m _j |m _i )×p(m _i ) If m is _i And m _j The occurrences in the text are independent of each other, then p (m _i ，m _j )＝p(m _i )×p(m _j ) PMI (m) _i ，m _j ) =0; if m is _i And m _j The occurrences in the text are interrelatedLinked, then p (m _i ，m _j )＞p(m _i )×p(m _j ) And the stronger the correlation PMI (m _i ，m _j ) The greater the value of (2);

5.2 definition m _i Context Cont (m _i ) Degree of replaceability SPMI (m _i ，Cont(m _i ) Is) m is _i And each context word m _i-k ，…，m _i-1 ，m _i+1 ，…，m _i+k The sum of PMI values of (2) is represented by the following formula:

let m be _i The candidate alternative phrase is CG (m _i )＝{cm ₀ ，cm ₁ ，....，cm _r-1 Current context is Cont (m _i )＝{m _i-k ，…，m _i-1 ，m _i+1 ，…，m _i+k Sequentially calculating candidate alternative phrases CG (m) _i ) Each candidate alternative word cm _q (l=0, 1,., r-1) degree of interchangeability SPMI (cm _q ，Cont(m _i ))；

If SPMI (cm) _q ，Cont(m _i ) E) is less than or equal to epsilon, filtering the candidate replaceable word cm _q The method comprises the steps of carrying out a first treatment on the surface of the Otherwise remain as alternative words.

Drawings

FIG. 1: a frame diagram of the present invention;

fig. 2: schematic diagram of the structure of the graph of the similar words;

fig. 3: extremely large connected subgraph examples;

fig. 4: FIG. 3 illustrates all of the biggests contained in the graph;

fig. 5: FIG. 4 is a diagram of a method for deleting the remaining biggest clusters of shared vertices after extracting candidate alternative phrases;

fig. 6: the vector of words represents a flow chart;

fig. 7: a flow chart of similar word list acquisition;

fig. 8: a flow chart represented based on inter-word similarity relationships of the graph structure;

fig. 9: a flow chart of candidate alternative phrase acquisition based on maximum group enumeration;

fig. 10: a flow chart of exchangeable metric based on PMI and context;

fig. 11: a flow chart of candidate alternative word filtering based on the degree of alternatives.

Detailed Description

Specific embodiments of the present invention will be described in detail below with reference to fig. 1-11.

A flow chart of the present invention is shown in fig. 1. The invention relates to a large-scale replaceable word stock construction method for natural language information hiding, which mainly comprises the following processing flow: global candidate alternative word stock construction and local alternative word stock construction.

The first part can be divided into four steps: the first step, for each word in the dictionary, represents it as a low-dimensional dense word vector by a distributed word vector representation model; secondly, calculating the similarity between words according to the word vector distance between the words, and obtaining a similar word list of each word; thirdly, converting the similar words and the corresponding similarity into a graph structure to represent the association relation and the similarity between all the similar words; and fourthly, enumerating the maximum groups in each maximum connected subgraph by using a maximum group algorithm, selecting a specific maximum group according to the number of top points of the maximum groups, the sharing condition of the top points and the average value of the edge weights, enabling the maximum groups to be disjoint and the average value of the number of the top points and the average value of the edge weights to be maximized, wherein words corresponding to the top points of the selected maximum groups are candidate replaceable words, and the candidate replaceable words in the same group have high similarity to each other to form a candidate replaceable word group.

The second part can be divided into two steps: firstly, recognizing candidate replaceable words appearing in an original carrier text as positions to be embedded according to a candidate replaceable word lexicon, positioning candidate replaceable word groups where the words are located, and calculating the replaceable degree of each candidate replaceable word in the group in the current context by using a PMI method for the context where each position to be embedded is located; and secondly, filtering words in the candidate alternative word groups according to the obtained alternative degree through a set alternative degree threshold value to obtain final alternative words.

The specific description is as follows:

1. global candidate alternative word library construction

1.1 vector representation of words

The invention prepares a dictionary (vocabulary) D= { w in advance ₁ ，w ₂ ，L，w _N W is the word in the dictionary and N is the total number of words in the dictionary. The dictionary may be obtained by direct importation into an existing dictionary or may be formed by extracting word lists from a large corpus. For words in the dictionary, training a pre-prepared corpus by using a continuous neural network language model Skip-gram to obtain word vector representation of each word. For any word w in dictionary D _i The word vector is denoted as E (w _i )。

1.2 similar words List acquisition

In a training corpus, if two different words have very similar contexts, two similar word vectors will be obtained by Skip-gram model. In this case, not only synonyms such as "intelligent" and "smart" etc. having very similar contexts, but also related words such as "green" and "red" etc. having similar contexts will be expressed as similar word vectors. Therefore, the similar words with high similar contexts can be found through the distance between word vectors, and the similar words are used as alternative words in the natural language information hiding method, and information is embedded through replacement of the similar words, so that the replaceable words are expanded from synonyms to wider similar words.

The invention utilizes cosine formula and word vector to measure the similarity between two words, and carries out enhancement normalization (augmented normalization) operation to make the similarity value range of words be [0.5,1 ]]Specifically calculate word w _i Sum word w _j The formula of the similarity is as follows:

the greater the similarity value between two words, the higher the similarity that the two words are, the higher the likelihood that they can be replaced with each other in the information hiding algorithm. Therefore, a similarity threshold delta is set to determine whether the two words are similar words, if S (w _i ，w _j ) > delta, then determine w _i And w is equal to _j Are similar words. In the present invention, δ=0.6.

For each word w in dictionary D _i It is calculated according to equation (1) and calculated for all words in the dictionary (including w _i Itself) and obtaining w from the threshold delta _i Is a list of similar words of (a)Wherein sw is _ij Is w _i The j-th similar word is equal to w _i The similarity value of (2) satisfies: s (w) _i ，sw _ij ) > delta, and is called sw _ij And w is equal to _i Are similar; n is n _i Is w _i Is a number of similar words of (a). Finally, the invention will obtain a list of similar words for all words, and the set of these lists is denoted SList, which can be expressed as follows:

1.3 representation of inter-word similarity based on graph Structure

w _i Any similar term sw to it _ij And sw _ik All have a similarity greater than delta, but sw _ij And sw _ik The similarity between the two is not necessarily larger than delta, and the dissimilar sw _ij And sw _ik Are not suitable for mutual replacement to embed information. In order to load as much secret information as possible for each similar word, the more similar words that can be replaced with each other, the better the words that will be encoded with the same words that can be replaced with each other will be, the longer the code words will be, and thus the more information each similar word can embed. The similar words which can be replaced with each other form a replaceable phrase, and each similar word in the phrase is respectively encodedForming a unique code word, and when the embedded secret information is consistent with the code value of the word to be replaced in the phrase, not changing the word at all; when the secret information is inconsistent, selecting similar words with the same coding value as the embedded secret information value to replace the original words, and realizing the embedding of the secret information. Since each word and its mutually exchangeable similar words have unique code words, if not unique, a problem of extraction error of secret information will occur, and thus each similar word can only appear in one exchangeable phrase. While in the similar word list set SList, a word may appear in multiple similar word lists. In order to meet the requirements of the information hiding task, according to the similar word list set SList, individual sets of words similar to each other need to be obtained, and at the same time, the sets should be as large as possible, and no intersection exists between the sets. In the invention, a set of word compositions similar to each other is defined as candidate alternative word groups, each word in the word groups is a candidate alternative word, and the words are similar to each other.

In order to intuitively express the similarity relation among all words in the Chinese dictionary, so as to extract candidate alternative word groups more effectively, the invention adopts a graph structure to express the similarity relation among the words, thereby converting a similar word list set Chinese of all words in the Chinese dictionary D into an undirected graph G (D, E), wherein the graph takes the words in the Chinese dictionary D as vertexes, if the two words are similar, edges are connected between the corresponding vertexes, the set of the edges is E, and the weight of the edges is the similarity between the words corresponding to the vertexes where the edges are positioned.

In the undirected graph G (D, E), similar words are connected by edges, and dissimilar words are connected by edges. Because the dictionary contains words with various word senses, related similar words are in the same connected subgraph, word sets with far-reaching meanings and no relation are independent from each other, and word sets are connected without edges. Thus, undirected graph G (D, E) is a non-connected graph. By utilizing the connected ideas of the graph and according to the association relation among words, G (D, E) is divided into a plurality of maximum connected subgraphs, and each maximum connected subgraph is one connected component of the non-connected graph G (D, E). The corresponding words of the vertexes in each maximum connected subgraph have direct or indirect similarity relation, and are dissimilar to any word in other maximum connected subgraphs. As shown in fig. 2, each word in the maximum connected sub-graph a is dissimilar to other words, i.e., no edge connected to the vertex outside the sub-graph a exists at the vertex in the sub-graph a, and thus, the sub-graph a is not connected to other maximum connected sub-graphs.

The undirected graph G (D, E) has a large number of maximum connected subgraphs, the number of vertexes in each maximum connected subgraph is uncertain, and the variation range is large. The invention acquires a list of 22585 English words in the experimental process, relates to 151762 words (the same word repeatedly appears in a plurality of similar word lists), and the formed undirected graph comprises 6051 maximum connected subgraphs, and the distribution situation of the top points contained in the subgraphs is shown in table 1. The number of the vertexes of the maximum connected subgraphs of the rule mode maximum exceeds 10000, the number of the maximum connected subgraphs with the number of the vertexes less than 100 is 6000, and the number of the subgraphs with the number of the vertexes exceeding 1000 is 6.

TABLE 1 distribution of the number of vertices of maximum connected subgraphs

1.4 candidate alternative phrase acquisition based on maximum clique enumeration

Obviously, each maximum connected subgraph in G (D, E) is not a complete subgraph. If the words O and P are similar words and P and Q are similar words, then O and Q may or may not be similar, i.e., there may or may not be edges between vertices O and P, P and Q, but there may or may not be edges between O and Q. Whereas substitution between dissimilar words may cause a change in the semantics of the original text, which is not suitable for hiding the information. Therefore, in order to obtain the candidate alternative phrase, a complete sub-graph needs to be found in the extremely large connected sub-graph, as shown in fig. 2. Any two vertexes in the complete subgraph are connected by edges, namely, words corresponding to the vertexes in the complete subgraph are similar, and words corresponding to the vertexes in the complete subgraph can form a candidate replaceable phrase.

From table 1, we can see that the number of vertices of one extremely large connected subgraph in G (D, E) is generally relatively large, so there will be multiple complete subgraphs. In order to increase the hiding capacity when applied to word replacement based information hiding methods, we need to find the best complete subgraph from this. The hiding capacity of the word replacement-based information hiding method is affected by two main factors: 1) The number of available alternative word groups, namely the number of positions in which the carrier text can embed information; 2) The average number of the replaceable words in the replaceable word group, namely the average coding length of one replaceable word. Taking these two factors into consideration, the invention enumerates the maximum groups in the maximum connected subgraph, and sequentially selects the best maximum groups as candidate alternative phrases according to the average weight of the edges (i.e. the average similarity among words) and the sharing condition of the vertexes among the maximum groups, so as to obtain as many candidate alternative phrases as possible, and the phrases contain as many words as possible and as similar as possible.

The clique is a complete sub-graph of an undirected graph. A biggest cluster is a cluster in the undirected graph that is not contained by any other cluster, i.e., it is not a proper subset of any other cluster. One maximum connected subgraph of G (D, E) typically contains multiple maximum clusters of different sizes. The method enumerates all maximum groups in a connected graph, and obtains all maximum groups in each maximum connected graph in G (D, E) through a Bron-Kerbosch algorithm. Any of the biggest groups may be selected as candidate alternative phrases, but not every biggest group is the best choice. The number of the vertexes of the maximum group is larger than the corresponding candidate replaceable phrase. The maximum connected subgraph as shown in fig. 3 contains 4 maximum clusters as shown in fig. 4. The number of top points of the maximum clusters shown in fig. 4 (a) and (b) is 4, fig. 4 (c) is 3, and fig. 4 (d) is 2. The candidate alternative phrases corresponding to fig. 4 (a) and (b) contain the most words that are identical to each other.

In order to make the candidate alternative phrase contain as many words as possible, the group with the largest number of vertices, i.e. the largest group, should be preferentially selected as the candidate alternative phrase. It can be seen from fig. 4 that there may be multiple maximum cliques in one maximum connected subgraph and share part of the vertices. As in fig. 4 (a) and (b) are both maximum cliques and share vertex 4. The higher the similarity between words in the candidate alternative word groups is, the better the concealment is when the word groups are used for information hiding, so that for the maximum groups with shared vertexes, one maximum group with the maximum average value of the edge weights is selected as the candidate alternative word group; when there are a plurality of maximum clusters with the largest mean value, one of the clusters is selected as a candidate alternative phrase. As the average value of the edge weights in fig. 4 (a) is 0.834 and 0.7038 in fig. 4 (b), the biggest group shown in fig. 4 (a) is selected as a candidate alternative phrase, which includes vertices {1,2,3,4}. And directly determining the maximum groups with completely independent vertexes as candidate alternative phrases.

In order to obtain as many candidate alternative phrases as possible, the vertexes of the remaining maximum groups are updated, and then the proper maximum group in the corresponding maximum groups is selected as the candidate alternative phrase. First, words in the determined candidate alternative phrase are deleted from the remaining maximum cliques, i.e., vertices shared with the selected maximum clique are deleted. The remaining maximum clusters are shown in fig. 5, in which the shared vertices 4 with the maximum clusters in fig. 4 (b), (c), and (d) are deleted. Then, searching a new maximum cluster again from the maximum clusters updated by the vertexes, and directly determining the maximum clusters with the vertexes completely independent as candidate alternative groups. For the largest cliques with shared vertices, one of the largest cliques with the largest average edge weight is selected as the candidate alternative phrase.

And for the maximum groups which are not selected as candidate alternative phrase, continuing to update the vertexes, and selecting the proper maximum groups until no maximum groups remain or the number of vertexes of the maximum groups after deleting the shared vertexes is smaller than 2. The number of candidate alternative phrases obtained from the final maximum connected subgraph according to the example of fig. 3 is 3, respectively: {1,2,3,4},{5,6,7},{8,0}.

The specific flow of the candidate alternative phrase acquisition algorithm based on the maximum group enumeration is as follows:

input: maximum connected subgraph Q in G (D, E)

And (3) outputting: candidate set of alternative phrases

Step 1: all the biggest cliques in the biggest connected subgraph are enumerated according to the Bron-Kerbosch algorithm. Set the set of all the biggest groups as CS; the candidate alternative phrase set is

Step 2: traversing each maximum group in CS, obtaining all the maximum groups, calculating the average value of the edge weights of each maximum group, and sorting the maximum groups from large to small according to the size of the average edge weight to obtain the sorted maximum groups as follows: MC (methyl cellulose) ₀ ，K，MC _k I.e. Avg (MC) ₀ )≥，K，≥Avg(MC _k ) Wherein Avg (MC) _k ) Representing the maximum cluster MC _k Average value of weight of middle edge.

Step 3: let i=1; candidate maximum clique set mcs= { MC ₀ }。

Step 4: set MC _i The vertexes of the two layers are vw in sequence ₀ ，...，vw _l ，j＝0。

4.1: for vw _j And traversing the peaks of all candidate maximum cliques in the MCS in turn. If vw is _j Appears in the MCS, then MC _i If there is a shared vertex with a certain maximum clique in the MCS, the MC is ignored _i Step 5, jumping to the step; otherwise j=j+1.

4.2: if j is less than or equal to l, repeating the step 4.1; otherwise j=l+1, explaining MC _i If no shared vertex exists with any candidate maximum clique of MCS, MC is generated _i Added to the MCS, i.e. MCS=MCS+ { MC _i }。

Step 5: i=i+1. If i is less than or equal to k, jumping to the step 4; otherwise, the process jumps to step 6.

Step 6: each maximum group in the MCS is the candidate alternative phrase that is sought, namely cg=cg+mcs.

Step 7: the selected maximum clique is deleted from all maximum cliques set CS, i.e. cs=cs-MCS.

Step 8: and updating the vertex of each maximum clique in the CS, namely deleting the vertex appearing in the MCS from each maximum clique in the CS, so that any maximum clique in the CS after vertex updating does not share the vertex with any maximum clique in the MCS.

Step 9: and deleting the maximum cliques with the number of top points less than or equal to 1, and updating CS.

Step 10: repeating the steps 2-9 until

Step 11: and outputting CG, namely obtaining all candidate alternative phrases.

And (3) solving a plurality of corresponding candidate alternative phrases by adopting the candidate alternative phrase acquisition algorithm based on the maximum group enumeration for each maximum connected subgraph in G (D, E) to obtain all the candidate alternative phrases. In the experiment, 11000 candidate alternative phrases are obtained, 43405 words are involved, and the length of each phrase is 3.9459 words on average.

2. Construction of local replaceable word stock

Since only the similarity between the global word vectors of each word is considered when the candidate alternative phrases are acquired, the local context information of the word in the specific text is not considered. When word replacement is performed in a specific context with candidate replaceable words, not all candidate replaceable words that are original words fit in the context. Therefore, in order to further improve the concealment and safety of the natural language information concealment method, the invention further considers local context information to filter candidate alternative words so as to construct more suitable alternative word groups.

2.1 PMI and context based alternative degree metrics

The invention measures the degree of replaceability of candidate replaceability words with the original words by calculating the replaceability of the candidate replaceability words in the context. The degree of replaceability of a candidate replacement word in context is affected by the relevance of the word to the context word. The invention uses PMI (Pointwise Mutual Information point mutual information) to calculate the relevance of candidate alternative words to context words.

Let the ith word m appearing in text _i As candidate replaceable words, the context window size is 2k (in the present invention, let k=2), and the words in the context window are sequentially noted as: cont (m) _i )＝{m _i-k ，…，m _i-1 ，m _i+1 ，…，m _i+k M is then _i And context word m _j (j∈[i-k，i+k]But j+.0) can be expressed as:

wherein p (m) _i ，m _j ) Is the word m _i And m _j Co-occurrence frequency in context window, p (m _i ) Is m _i Frequency of occurrence, p (m _j ) Is m _j Frequency of occurrence in the corpus. Can be derived from the bayesian theorem, p (m _i ，m _j )＝p(m _i |m _j )×p(m _j )＝p(m _j |m _i )×p(m _i ). If m is _i And m _j The occurrences in the text are independent of each other, then p (m _i ，m _j )＝p(m _i )×p(m _j ) PMI (m) _i ，m _j ) =0. If m is _i And m _j The occurrences in text are interrelated, then p (m _i ，m _j )＞p(m _i )×p(m _j ) And the stronger the correlation PMI (m _i ，m _j ) The greater the value of (2). PMI (m) _i ，m _j ) The larger the value, the more the word m is specified _i And m _j The more collocated the consistency in semantics and expression.

The conjunctive word m _i With its context words m _j Correlation p (m) _i ，m _j ) The present invention defines m _i Context Cont (m _i ) Degree of replaceability SPMI (m _i ，Cont(m _i ) Is) m is _i And each context word m _i-k ，…，m _i-1 ，m _i+1 ，…，m _i+k The sum of PMI values of (2) is represented by the following formula:

candidate alternative word m _i The stronger the correlation with the context word, the higher its replaceability; whereas the lower the degree of substitution.

2.2 candidate alternatives filtering based on degree of alternatives

The higher the degree of substitution, the more suitable the candidate alternative will be for the current context to embed secret information; conversely, the less suitable it is for information hiding. Thus, the present invention filters unsuitable words according to the degree of substitution of candidate alternative words, and determines alternative words that are ultimately available for information hiding.

Let m be _i The candidate alternative phrase is CG (m _i )＝{cm ₀ ，cm ₁ ，....，cm _r-1 Current context is Cont (m _i )＝{m _i-k ，…，m _i-1 ，m _i+1 ，…，m _i+ k, then sequentially calculating candidate alternative phrases CG (m _i ) Each candidate alternative word cm _q (q=0, 1,., r-1) degree of interchangeability SPMI (cm _q ，Cont(m _i )). If SPMI (cm) _q ，Cont(m _i ) And then filter the candidate alternative word cm _q The method comprises the steps of carrying out a first treatment on the surface of the Otherwise remain as alternative words. Thus, filtering CG (m _i ) Candidate replaceable words with lower replaceable degrees in the words are obtained to obtain m _i Is a word group that can be replaced. By the same method, the alternative word groups of all candidate alternative words in the original text can be obtained. When ε=0, since the degree of substitution of any candidate replaceable word is always 0 or more, the candidate replaceable word will not be filtered at this time. In experiments, the invention comprehensively considers the balance of embedding capacity and security, and epsilon=11.2 is selected through a plurality of experiments.

The invention provides a method for constructing a word library of alternative words for hiding natural language information through the two parts from the aspects of global alternative word group construction, local alternative word group construction and the like, and successfully constructs a large-scale alternative word library of alternative words, thereby realizing the improvement of the embedding capacity of the natural language information hiding method, improving the quality of ciphertext-containing books through the filtering of alternative words, and realizing the improvement of the security of secret information.

Comparison experiment:

to better verify the advantages of the method of the invention, we apply the alternative word stock constructed according to the invention to the synonym substitution-based steganography system T-Lex (http:// www.imsa.edu/k-eitw/tlex), i.e. replace the original synonym word stock to generate the text containing the secret, and then statistically analyze the embedding capacity and evaluate the security of these text containing the secret using the steganography analysis method NRF (Chen Z, huang L, yang W.detection of discussion-based linguistic steganography by relative frequency analysis [ J ]. Digital investigation,2011,8 (1): 68-77) based on relative word frequency analysis. The T-Lex system is the only hidden writing system based on synonym replacement disclosed at present on the Internet, is published by Winstein in 1998, and adopts an English synonym word stock mainly composed of absolute synonyms.

According to the invention, 1000 texts are randomly selected from the Gutenberg corpus to serve as original texts. In order to better compare the embedded capacity of different alternative word libraries, the method utilizes the synonym word library originally used by the T-Lex to identify the synonyms in the original text, and cuts the original text according to the number of the identified synonyms, so that the texts in the same original text set contain the same number of the synonyms. In the T-lex system, synonyms are alternatives. According to the method, 1000 original texts are divided into 5 text sets in equal proportion, the 5 text sets are named as a cover200, a cover400, a cover600, a cover800 and a cover1000 respectively, and the number of synonyms contained in the texts is 200, 400, 600, 800 and 1000 in sequence.

The number of alternative words contained in the text represents the number of positions in which information can be embedded. And the length of the information that can be embedded at each embedded location depends on the number of replaceable words at that embedded location, i.e., the size of the corresponding replaceable word group. The embedding capacity can be estimated by multiplying the number of embedding locations by the average of the size of the replaceable word phrase for each embedding location. Therefore, the invention adopts the average embedding position number and the average replaceable word group size in each text set as the evaluation index of the embedding capacity. The security of the text containing the secret is measured by its ability to resist detection by steganalysis. The higher the probability that the ciphertext-containing book is misidentified as a normal text by the steganalysis method, namely the higher the false alarm rate, the higher the security of the ciphertext-containing text is.

The results of the related experiments are shown in Table 2. Where CSWS represents a word stock of the alternative word obtained when the threshold value of the degree of substitution epsilon=0, and SWS represents a word stock of the alternative word obtained when the threshold value of the alternative epsilon=11.2. The higher the threshold of the replaceable degree is, the more candidate replaceable words are filtered out, the number of embeddable positions and the average replaceable phrase length are reduced, the embedding capacity is reduced, but the higher the false alarm rate is, the greater the safety is. As can be seen from the table, the number of the embedded positions and the average size of the replaceable word groups can be greatly increased by the replaceable word library constructed by the method, and the embedded capacity of the information hiding method is greatly improved. And secondly, the generated ciphertext-containing book has higher steganography analysis and detection resistance, and the security of secret information is improved.

Table 2 comparative experimental results of embedding capacity and anti-steganalysis test

In conclusion, the invention is a method for effectively constructing the replaceable word stock for hiding the natural language information, and improves the embedding capacity and the safety of the information hiding method.

Claims

1. A large-scale replaceable word stock construction method for natural language information hiding is characterized by comprising the following steps: step 1, for each word in the dictionary, calculating it to represent it as a low-dimensional dense word vector; step 2, calculating the similarity between words according to the word vector distance between the words, and obtaining a similar word list of each word; step 3, carrying out the association relation and the expression of the similarity degree among all similar words; step 4, constructing candidate alternative word groups according to the association relation and the similarity degree between the similar words; step 5, identifying candidate replaceable words appearing in the original carrier text as positions to be embedded according to the candidate replaceable word lexicon, positioning candidate replaceable word groups where the words are located, and calculating the replaceable degree of each candidate replaceable word in the group in the current context for the context of each position to be embedded; filtering words in the candidate alternative word groups according to the obtained alternative degree through a set alternative degree threshold value to obtain final alternative words;

wherein step 2 comprises: 2.1 measuring similarity between two words by cosine formula and word vector, calculating word w _i Sum word w _j Similarity between the two, the formula is as follows:

E(w _i ) Representing word vectors

2.2 setting a similarity threshold δ to determine whether two words are similar words, if S (w _i ,w _j ) > delta, then determine w _i And w is equal to _j Are similar words;

the word stock construction method, wherein step 2 further comprises:

2.3 obtaining w from the threshold delta _i Is a list of similar words of (a)Wherein sw is _ij Is w _i The j-th similar word is equal to w _i The similarity value of (2) satisfies: s (w) _i ,sw _ij ) > delta, and is called sw _ij And w is equal to _i Are similar; n is n _i Is w _i Obtaining a list of similar words of all words, wherein the set of the list is denoted as SList and is expressed as follows:

wherein step 3 comprises:

3.2 dividing G (D, E) into a plurality of maximum connected subgraphs according to the association relation among words, wherein certain similarity relations exist between the words corresponding to the vertexes in each maximum connected subgraph and are dissimilar to any word in other maximum connected subgraphs;

wherein step 4 comprises:

4.3, selecting the maximum group with the largest average value of the edge weights as a candidate alternative phrase for the plurality of maximum groups with shared vertexes;

4.4 deleting words in the determined candidate alternative phrase from the remaining maximum groups except the maximum groups determined as candidate alternative phrases, namely deleting vertexes shared with the selected maximum groups, and repeating the steps 4.1-4.3 until no maximum groups remain or the number of vertexes of the maximum groups after deleting the shared vertexes is smaller than 2;

wherein step 5 comprises:

5.1 let the ith word m appearing in text _i As candidate replaceable words, the context window size is 2k, and the words in the context window are sequentially noted as: cont (m) _i )＝{m _i-k ,…,m _i-1 ,m _i+1 ,…,m _i+k M is then _i And context word m _j (j∈[i-k,i+k]But j+.0) is expressed as:

wherein p (m) _i ,m _j ) Is the word m _i And m _j Co-occurrence frequency in context window, p (m _i ) Is m _i Frequency of occurrence, p (m _j ) Is m _j In a corpusFrequency of occurrence; p (m) _i ,m _j )＝p(m _i |m _j )×p(m _j )＝p(m _j |m _i )×p(m _i ) If m is _i And m _j The occurrences in the text are independent of each other, then p (m _i ,m _j )＝p(m _i )×p(m _j ) PMI (m) _i ,m _j ) =0; if m is _i And m _j The occurrences in text are interrelated, then p (m _i ,m _j )＞p(m _i )×p(m _j ) And the stronger the correlation PMI (m _i ,m _j ) The greater the value of (2);

5.2 definition m _i Context Cont (m _i ) Degree of replaceability SPMI (m _i ,Cont(m _i ) Is) m is _i And each context word m _i-k ,…,m _i-1 ,m _i+1 ,…,m _i+k The sum of PMI values of (2) is represented by the following formula:

let m be _i The candidate alternative phrase is CG (m _i )＝{cm ₀ ,cm ₁ ,....,cm _r-1 Current context is Cont (m _i )＝{m _i-k ,…,m _i-1 ,m _i+1 ,…,m _i+k Sequentially calculating candidate alternative phrases CG (m) _i ) Each candidate alternative word cm _q Q=0, 1,..r-1, the degree of substitution SPMI (cm _q ,Cont(m _i ))；

If SPMI (cm) _q ,Cont(m _i ) E) is less than or equal to epsilon, filtering the candidate replaceable word cm _q The method comprises the steps of carrying out a first treatment on the surface of the Otherwise remain as alternative words.

2. The word stock construction method according to claim 1, wherein: the step 1 comprises the following steps: 1.1 preparing a dictionary d= { w ₁ ,w ₂ ,…,w _N W is the word in the dictionary, N is the total number of words in the dictionary; 1.2 for words in the dictionary, continuous god is utilizedTraining a preset corpus through a web language model Skip-gram to obtain a low-dimensional dense word vector representation of each word, wherein the low-dimensional dense word vector representation is used for any word w in a dictionary D _i The word vector is denoted as E (w _i )。