CN111859407A

CN111859407A - Text automatic generation steganography method based on candidate pool self-contraction mechanism

Info

Publication number: CN111859407A
Application number: CN201911004851.2A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shenyang University of Technology
Current assignee: Shenyang University of Technology
Priority date: 2019-10-16
Filing date: 2019-10-16
Publication date: 2020-10-30

Abstract

The invention discloses a text automatic generation steganography method based on a candidate pool self-contraction mechanism. The method can generate a high-quality text carrier containing the hidden information according to the hidden bit stream required to be hidden. The method is different from the traditional text steganography method, and the innovation point of the method is that a candidate pool is based on a self-contraction mechanism of confusion calculation. In the automatic generation process of the steganographic text, the neural network is fully utilized to score words, the fact that the sensitivity of different words is different for the context is considered, a self-contraction mechanism based on confusion degree calculation is introduced, and the concealment of a steganographic system and the readability of the generated steganographic text are improved.

Description

Text automatic generation steganography method based on candidate pool self-contraction mechanism

Technical Field

The invention relates to the fields of information hiding, big data, deep learning, natural language processing and the like, in particular to a text automatic generation steganography method based on a candidate pool self-contraction mechanism.

Background

Today's society is in the "big data age," a new type of capability. That is, mass data is analyzed in a manner to obtain a product or service of great value. In the big data era, more data can be analyzed, and even all data related to a specific phenomenon can be processed. The core of the big data is prediction, and a mathematical algorithm is applied to massive data to predict the feasibility of occurrence of things. The text is used as the most frequent communication mode for daily use of people, and massive text data has great value with the arrival of a big data era. The problem of data security ensues.

The idea of network security management is 'strictly preventing blocking and physically isolating'. At present, the traditional information security is mainly completed around encryption technology and system. Encryption technology means that a sender converts a message (plaintext) into meaningless ciphertext through a key and an encryption function. The receiving party restores the received ciphertext into plaintext through a decryption function and a secret key. Encryption technology has formed a relatively perfect system and management, but after entering the big data era, encryption technology has exposed some defects. Since the ciphertext is easily noticed by the attacker, the attacker can obtain the transmitted original text only by decrypting the ciphertext. The information hiding disguises the information, so that the data sent by the sender is similar to the common data and is not easy to be concerned by an attacker, and the content security of the secret information can be ensured.

The information hiding is also called data hiding, and mainly obtains data without difference from common data by embedding secret information into a carrier for public transmission, thereby ensuring the imperceptibility and the safety of the secret information. The carrier of the public transmission can be pictures, audios and videos, texts and the like. The text is used as the most widely used information carrier in daily life of people, and has higher information coding degree. But because of the low redundancy of the text, it is very challenging to hide the secret information with the text.

Information hiding based on text signals is mainly divided into two main categories: one is to hide secret information by changing text formats (such as word spacing, letter case, etc.); the other type is that a piece of secret information needing to be hidden is passed through a text generator by a carrier generation technology to generate a meaningful natural text. The text readability after the text format is changed is high, but the modifiable space is small, and the realization of a sufficiently high hidden capacity is difficult. The information hiding technology generated by the carrier has higher steganography capacity, but easily generates the defects that the generated steganography text has low readability and is easy to identify. For example, a box design text generator improves the naturalness of the generated text and makes an attacker not easily perceive the generated text, which is an urgent problem in the field.

The concept of Deep Learning (DL) stems from the study of Artificial Neural Networks (ANN). The artificial Neural Network abstracts the human brain neuron Network from the information processing perspective, establishes a simple model, and forms different networks, called Neural Networks (NN) for short, by different connection modes, as shown in fig. 1. The deep learning is also called deep neural network, and is a method for performing characterization learning on data. In other words, deep learning can simulate the neural structure interpretation data of the human brain, such as images, audios and videos, text, and the like. The accuracy of classification or prediction is realized mainly by constructing a machine learning model with a plurality of hidden layers and massive training data learning characteristics. It emphasizes the depth of the model structure, i.e. the number of layers from the "input layer" to the "output layer", the more the number of layers, the deeper the depth; meanwhile, the importance of feature learning is highlighted, and the features of the sample crude oil are transformed to a new feature space through layer-by-layer feature extraction. Compared with the characteristics extracted manually, the characteristics are extracted by utilizing deep learning, so that the labor is saved, and the internal information of the data can be described.

A Recurrent Neural Network (RNN) is a type of Neural Network that is suitable for recursion in the evolution direction of a sequence using sequence data, and all cyclic units are linked in a chain. It is mainly characterized in that the output of the neuron directly acts on the neuron at the next moment. I.e. the nodes between the hidden layers are no longer connected but connected and the input of the hidden layer comprises not only the output of the input layer but also the output of the hidden layer at the previous moment. This makes the network memorable. As shown in fig. 2. Its depth is embodied in the time domain. Compared with other deep and feedforward neural networks, the recurrent neural network can process sequence data with any length by using a neural unit with self feedback, and is an attractive deep learning structure.

For a recurrent neural network with only one hidden layer, the following formula can be used:

S_t＝f(W*S_t-1+U*x_t)

wherein x is_tRepresenting the input vector at time t, S_tRepresenting the memory of the network at time t, h_tA vector representing the hidden layer, W represents the input sample weight, U represents the input sample weight at the current time instant, V represents the output sample weight, f, g is a non-linear function, typically using a tanh, relu or softmax function, W, U, V being equal at each time instant.

Theoretically, the simplest RNN model as described by the above equation can handle sequence signals of arbitrary length. However, this model is prone to the problem of gradient disappearance, which cannot effectively deal with long-term dependencies. To solve this problem, an improved algorithm based on the recurrent neural network, the long short term memory model (LSTM), was created. This approach can effectively solve this problem by carefully designed hidden layer elements. It consists of four parts: cell unit, input gate, output gate and forget to remember the gate. The modeling method can store the input information of the past time into the unit, and realizes the modeling of long time sequences. The key to LSTM is the overall state of a Cell unit. The Cell unit, like a conveyor belt, runs over the entire chain with only a few small linear operations acting on it, and the information is easily kept unchanged. The addition or deletion of information is implemented by the structure of the gate. The gate may selectively pass information through the sigmoid neural layer and a point-by-point multiplication operation. Sigmoid is output as a vector, where each element is a real number between 0 and 1, representing the weight to let the corresponding information pass. "0" means "not letting any information pass through", and "1" means "letting all information pass through". The LSTM implements protection and control of information through three gate structures, namely an input gate, an output gate, and a forgetting gate. The LSTM cell may be described using the following formula:

Wherein, I_tIndicating the input gate, can control the amount of new information input by the memory unit. F_tTo forget the gate, the memory unit can be controlled to discard a portion of the memoryPreviously stored information. C_tIs a memory unit, is input information modulated by an input gate and is forgotten by a gate F_tThe sum of the modulated previous memories. Output gate o_tAllowing the memory cell to affect the current hidden state and outputting or blocking its effect. Consider the calculation of LSTM as a black box, using f_LSTM(x) represents the transfer function of the LSTM unit per time, and the output O at time t can be obtained_t. As can be seen from the formula, the output at time t is calculated from the previous time t-1. The problem that the RNN cannot be relied on for a long time is solved

O_t＝f_LSTM(x_t|x₁，x₂，...，x_t-1)

The recurrent neural network can perform good feature extraction, expression and semantic understanding capacity on the sequence signal in a time domain. Text can be viewed as a series of sequential signals. Various characteristics of the text can be automatically learned through the recurrent neural network, and a high-quality natural text carrier can be reconstructed according to the characteristics. However, in the existing automatic steganographic text generation method based on the recurrent neural network, different sensitivities of each word at each moment are ignored, so that the number of embedded bits of each word in the generated steganographic text is the same. The invention provides a self-contraction mechanism, which screens a candidate pool through confusion calculation. The different scoring conditions of each word in the process of automatically generating the steganographic text are fully utilized, and the generation quality of the steganographic text is well improved. The concealment of the concealment system can thus be further optimized compared to previous methods.

From the above, it can be known that the automatic generation of the text by using the recurrent neural network can overcome the problems existing in the existing method, improve the quality of the generated text, and improve the concealment of the concealment system.

Disclosure of Invention

The invention provides a text automatic steganography generation method based on a candidate pool self-contraction mechanism, which belongs to steganography of an automatic generation carrier, and can automatically generate steganography text with high readability according to a secret bit stream to be hidden and by using the candidate pool self-contraction mechanism. The method well estimates the statistical language model by using the crawled text as a training sample. And according to the trained neural network model, a high-quality text can be automatically generated. And in the text generation process, carrying out confusion calculation according to the candidate pool, and further screening the candidate pool. And reasonably coding each word according to the conditional probability distribution of the screened candidate pool, and controlling text generation according to the bit stream. By the mode, the model can improve the quality of the generated steganographic text and improve the concealment of the system. To achieve the above object, the method comprises the steps of:

(1) building a text data set as a training set by crawling a large number of common natural texts on the web;

(2) Preprocessing data, wherein the existing form of English natural texts is all lowercase, and only letters and numbers are reserved;

(3) modeling texts in the training set, constructing a cyclic neural network model, and optimizing the performance of the model through a back propagation algorithm;

(4) testing the loss value of the model, and adjusting the model training parameters according to the loss value;

(5) repeating the steps (3) to (4) until the parameters and the performance of the neural network model are stable;

(6) counting the word frequency distribution of the first word of each sentence in the training set, and selecting the first 200 words with the highest word frequency to form an initial word list;

(7) when a sentence of steganographic text is generated each time, randomly selecting a word from the initial word list as an initial word, and marking the beginning of the generation of the steganographic text;

(8) iteratively calculating the conditional probability distribution of words at each moment through the adjusted recurrent neural network model and the given first word to obtain a dictionary T;

(9) at each time t, calculating the score of the whole sentence when each word is combined with the first t-1 words according to the conditional probability of the words calculated by the iterative formula and the trained statistical language model in a descending order, and carrying out normalization processing on the words;

(10) determining the size of a candidate pool according to a preset embedding rate pair;

(11) After the size of the candidate pool is determined, performing confusion calculation according to the conditional probability distribution of the moment t and the t-1 moments;

(12) screening the candidate pool according to a preset confusion degree, wherein the screened candidate pool has a self-contraction property because the number of words in the candidate pool is not fixed;

(13) if the confusion degree of each word in the candidate pool at the moment is greater than the preset confusion degree, selecting the word with the highest score at the current moment as the output of the current moment;

(14) searching from a code table according to the code stream to be embedded, wherein the word corresponding to the code matched with the embedded code stream is output at the current moment, and hiding the secret information in the process of generating the steganographic text based on the word;

(15) repeating the steps (8) to (14) until a complete steganographic sentence is generated, and completing the process of automatically generating the steganographic text according to the secret information;

(16) and after receiving the steganographic text generated by the model, the receiver decodes the steganographic text and acquires the confidential message.

In order to ensure high hiding capacity and naturalness of the steganographic text, the experiment utilizes a recurrent neural network to perform text steganography based on candidate pool self-contraction, and the candidate pool is mainly screened through confusion calculation. High quality text can be automatically generated from the secret information. The details of the model, including three main modules, will be described below: the system comprises a text automatic generation module, an information hiding module and an information extraction module. The text automatic generation module models the natural text, trains a statistical language model from a large number of samples by utilizing the information extraction capability of the neural network, and obtains the actual sending words output at each moment. The information hiding module carries out confusion calculation on condition probability distribution and screens a candidate pool through a preset value, and the mode has self-contractibility. Concealment of the secret bitstream is achieved based on variable length encoding of words by the candidate pool. The information decoding module simulates a receiving end, decodes the received natural text embedded with the secret information and acquires the secret information.

Automatic text generation based on RNN

In the automatic text generation process, the method mainly utilizes the characteristic extraction and expression capability of the RNN suitable for sequence signals and combines signals of the first t-1 times to calculate the probability distribution of the signals at the time t, and the probability distribution is shown in the following formula:

O_t＝f_LSTM(x_t|x₁，x₂，...，x_t-1)

for each sentence Sen in the training sample, it can be considered as a sequence signal, and the time signal at time t is the word at the current time

Since the machine only recognizes mathematical symbols, for words containing rich semantic information, it needs to convert them into vectors, i.e. it needs to convert words into word vectors. All words have the same word vector dimension and are in the same semantic space R

Each vector is a point in the space, and the distance between the points in the space can express the semantic similarity of the corresponding word. Each sentence in the training sample can be formulated as follows:

for each sentence Sen, a matrix S ∈ R may be used^m×nWhere the t-th line represents the t-th word in sentence S, m is its length, and n is the dimension of the word vector.

Typically, a recurrent neural network consists of multiple layers, each layer having multiple LSTM units. Stacking of the number of LSTM layers increases the depth, enabling more accurate extraction of the features of the training samples. The present invention primarily uses 2-layer LSTM. The LSTM of the upper layer provides the sequential outputs to the LSTM of the lower layer, i.e., one output per input instant. Adjacent hidden layers may be connected by a transfer matrix. For example, the transition matrix between the ith layer and the (i + 1) th layer may be represented as a matrix

As previously mentioned, the output at time t depends not only on the current time x_tAnd depends on the vectors in the unit at the previous t +1 instants. Thus, the output of the 1 st hidden layer at time t can be seen as a summary of the first t times, i.e. the first t times to information fusion. Based on the characteristic, the method adds a softmax layer after all hidden layers of the model to calculate the probability distribution of the t +1 th word. I.e. time t +1 to the prediction weight

Where N represents the number of words in T after statistical probability. Each word is scored by a prediction weight.

Wherein W_PAnd b^pIs a weight matrix and a bias shared between the levels, the weight matrix W_PThe value in (1) reflects oⁱThe output vector O has a dimension N, depending on the importance of each feature. At time t, the predicted output at the next time t +1 is:

in order to update each parameter in the neural network, the performance of the recurrent neural network is optimized through a back propagation algorithm. And minimizing a loss function through iterative optimization of the network so as to obtain a language model which best accords with a semantic space. The method defines the loss function of the whole network as the negative logarithm of the statistical probability of each sentence:

information hiding algorithm

In the information hiding module, the method mainly carries out self-contraction screening on a candidate pool through confusion calculation and encodes the candidate pool based on the conditional probability distribution of the screened words so as to form a mapping relation from binary bit streams to word space. The method is mainly based on the following facts: when the model training results are good, there are multiple alternative solutions at each time. And after the conditional probability distribution of all the words in the dictionary T is sorted in a descending order, screening a candidate pool by an embedding rate b, wherein the number of the words in the Candidate Pool (CP) is 2 b.

Because each word has different influence on the probability distribution of the next word at the current moment, the self-contraction mechanism is adopted to calculate the confusion degree of the conditional probability distribution of each word in the candidate pool at the moment t and the output of the previous moment t-1.

Wherein, { p₁，...，p_jRepresents the probability of occurrence of all the generated words at the first t moments, p_t+1，jRepresenting the probability of the jth word in the preset candidate pool at the time t +1, and N representing the total number of words at the previous time t + 1. ppl_t+1，jThat is, if the jth word is selected at the time t +1, the confusion degree of the sentence at the first time t +1 is large. And comparing the calculation result with a preset value, and finely screening the candidate pool. When the score is lower than a preset value, the candidate word is in a candidate pool; above the threshold, it is not in the candidate pool, i.e.:

After the candidate pool is screened, the conditional probability distribution of each word in the candidate pool needs to be encoded to realize the embedding of the secret information. Huffman coding is a distortion-free coding scheme and is a variable length coding. The method is based on a probability statistical model and is very suitable for coding by using the conditional probability distribution of each word. The method is realized by utilizing a tree structure to obtain the instantaneous code, and can fully utilize the conditional probability distribution of each word, so that the word code with high occurrence probability is shorter, the word code with low probability is longer, and the minimum average code length is ensured. Therefore, huffman coding is the optimal coding scheme for variable length coding. In the process of secret information embedding, the currently transmitted word at the previous time is used as the root node of the tree, and the leaf nodes of the tree are used to represent each word in the candidate pool.

In the information hiding process of coding through the Huffman, secret information needing to be embedded is read in sequence, and output at the current moment is selected according to the Huffman codes corresponding to the words in the candidate pool. In order to avoid that two identical bit sequences produce identical sentence text, the method constructs a list of start words and calculates the frequency of the first word of each sentence in the text data set. After sorting in descending order, the 200 most frequently used words are selected to form the starting word list. In the generation process, a word in the starting word list is randomly selected as the beginning of the generated steganographic sentence. And ensuring that the generated text is not repeated.

Model details and secret information embedding are shown in fig. 3. The algorithm details of the information hiding method are shown in algorithm 1. By the method, the steganographic text with high readability can be generated according to the input secret bit stream. These texts are highly hidden and are not easily recognized by attackers.

Information extraction algorithm

Information extraction means that after receiving a sentence which is transmitted publicly, the receiving party needs to correctly decode the confidential information contained in the sentence. The process of information hiding and extraction is basically the same. The same RNN model is used to compute the conditional probability distribution for each word at each time, perform a confusion calculation on the candidate pool and screen it. Notably, both require the construction of the same candidate pool and encoding of the words in the candidate pool using the same encoding method.

After receiving the text, the receiver inputs the first word of each sentence as the decoded starting word into the RNN, which in turn calculates the distribution probability of the words at subsequent times. At each point in time, after obtaining the probability distribution of the current word, the recipient first sorts all words in the dictionary T in descending order, selecting top 2^bIndividual words to form a candidate pool. And then, calculating the common confusion degree of the words in the candidate pool at the moment t +1 and the previous moment t, and finely screening the candidate pool according to a preset value. And constructing a Huffman tree for the obtained new candidate pool according to the same rule, and coding the words in the candidate pool. Finally, according to the code corresponding to the word at the current moment, the bit hidden in the current word is decoded. In this way, the bit stream hidden in the original text can be extracted without distortion.

Algorithm 1 information hiding algorithm

Inputting:

secret bit stream: b ═ 0, 0, 1, 0, 1

Embedding rate: b

List of start words: start word (word)₁，word₂，...，word₂₀₀}

Perplexity preset value: ppl

And (3) outputting:

a plurality of generated steganographic texts:

Text＝{S₁，S₂，...，S_N}

1. preparing data and training an RNN model;

2. when B is not finished:

3. if not the end of the current sentence:

4. calculating the probability distribution of the next word by using the trained RNN according to the generated word;

5. sorting the prediction probabilities of all the words in a descending order and calculating the score of the whole sentence when the word is combined with the first t-1 words according to a trained statistical language model;

6. before selection 2^bForming a candidate pool by the words;

7. performing confusion calculation on the candidate pool and the output before the current moment to obtain ppl_t+1，j；

8. If ppl_t+1I is less than m:

9. reserving corresponding words;

10. otherwise:

11. screening out from a candidate pool;

12. constructing a Huffman tree for the conditional probability distribution of each word in the screened candidate pool;

13. reading the bit stream in the B and searching a Huffman tree until a leaf node is found and outputting a word corresponding to the current moment;

14. otherwise:

15. randomly selecting a Start word from the Start word list Start word as the Start of the next sentence;

16. If not the end of the current sentence:

17. selecting the word with the highest probability outside the candidate pool as the output of the current moment;

18. selecting the word with the highest probability at each moment as output until the end of the sentence;

19. and returning the generated sentence.

Algorithm 2 information extraction algorithm

Inputting:

a plurality of generated sentences: text ═ S₁，S₂，...，S_N}

Embedding rate: b

Perplexity preset value: ppl

And (3) outputting:

secret bit stream: b ═ 0, 0, 1, 0, 1

1. For each sentence S in the text, the following is performed:

2. inputting a first word of the sentence S in the trained RNN model as the beginning of decoding;

3. for each word in sentence S, word_iCarrying out the following steps:

4. calculating the probability distribution of the next word using the trained RNN based on the previous word;

5. sort the prediction probabilities of all words in descending order and select top 2^bConstructing a candidate pool by using the words;

6. performing confusion calculation on the candidate pool and the output before the current moment to obtain ppl_t+1，j；

7. If ppl_t+1I is less than m:

8. reserving corresponding words;

9. otherwise:

10. screening out from a candidate pool;

11. encoding each word in the screened candidate pool by using a Huffman code;

12. If word_iIn the candidate pool, then:

13. word based on the word actually received at each moment_iFinding out and outputting a matched Huffman code;

14. extracting a corresponding bitstream and appending it to B;

15. otherwise:

16. the information extraction process is finished;

17. returning the extracted secret bit stream B

Drawings

FIG. 1 is a block diagram of a neural network of the present invention

FIG. 2 is a schematic diagram of a recurrent neural network used in the present invention

Fig. 3 is a detailed diagram of the model and the information hiding algorithm proposed in the present invention.

Claims

1. The text automatic generation steganography method based on the candidate pool self-contraction mechanism comprises the following steps:

(8) iteratively calculating the conditional probability distribution of the words at each moment through the adjusted recurrent neural network model and the selected first word;

(14) searching from the coding table according to the code stream to be embedded, wherein the word corresponding to the code matched with the embedded code stream is output at the current moment, and hiding the secret information in the process of generating the steganographic text;

2. The method for automatically generating steganography based on candidate pool self-contraction mechanism according to claim 1, wherein the candidate pool self-contraction mechanism is implemented as described in steps (11) (12) (13), which improves the concealment of the steganography system and the readability of the generated steganography text.