CN111859407A - Text automatic generation steganography method based on candidate pool self-contraction mechanism - Google Patents
Text automatic generation steganography method based on candidate pool self-contraction mechanism Download PDFInfo
- Publication number
- CN111859407A CN111859407A CN201911004851.2A CN201911004851A CN111859407A CN 111859407 A CN111859407 A CN 111859407A CN 201911004851 A CN201911004851 A CN 201911004851A CN 111859407 A CN111859407 A CN 111859407A
- Authority
- CN
- China
- Prior art keywords
- word
- text
- candidate pool
- words
- steganographic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000004364 calculation method Methods 0.000 claims abstract description 14
- 238000012549 training Methods 0.000 claims description 15
- 230000000306 recurrent effect Effects 0.000 claims description 14
- 238000012216 screening Methods 0.000 claims description 9
- 238000003062 neural network model Methods 0.000 claims description 7
- 125000004122 cyclic group Chemical group 0.000 claims description 3
- 230000009193 crawling Effects 0.000 claims description 2
- 238000010606 normalization Methods 0.000 claims description 2
- 238000007781 pre-processing Methods 0.000 claims description 2
- 238000012545 processing Methods 0.000 claims description 2
- 238000012360 testing method Methods 0.000 claims description 2
- 238000013528 artificial neural network Methods 0.000 abstract description 24
- 230000035945 sensitivity Effects 0.000 abstract description 2
- 238000000605 extraction Methods 0.000 description 11
- 239000013598 vector Substances 0.000 description 10
- 238000013135 deep learning Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 239000011159 matrix material Substances 0.000 description 6
- 230000015654 memory Effects 0.000 description 6
- 210000004027 cell Anatomy 0.000 description 5
- 238000010586 diagram Methods 0.000 description 3
- 230000001537 neural effect Effects 0.000 description 3
- 210000002569 neuron Anatomy 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 230000000903 blocking effect Effects 0.000 description 2
- 210000004556 brain Anatomy 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 239000010779 crude oil Substances 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000012886 linear function Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/602—Providing cryptographic facilities or services
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioethics (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention discloses a text automatic generation steganography method based on a candidate pool self-contraction mechanism. The method can generate a high-quality text carrier containing the hidden information according to the hidden bit stream required to be hidden. The method is different from the traditional text steganography method, and the innovation point of the method is that a candidate pool is based on a self-contraction mechanism of confusion calculation. In the automatic generation process of the steganographic text, the neural network is fully utilized to score words, the fact that the sensitivity of different words is different for the context is considered, a self-contraction mechanism based on confusion degree calculation is introduced, and the concealment of a steganographic system and the readability of the generated steganographic text are improved.
Description
Technical Field
The invention relates to the fields of information hiding, big data, deep learning, natural language processing and the like, in particular to a text automatic generation steganography method based on a candidate pool self-contraction mechanism.
Background
Today's society is in the "big data age," a new type of capability. That is, mass data is analyzed in a manner to obtain a product or service of great value. In the big data era, more data can be analyzed, and even all data related to a specific phenomenon can be processed. The core of the big data is prediction, and a mathematical algorithm is applied to massive data to predict the feasibility of occurrence of things. The text is used as the most frequent communication mode for daily use of people, and massive text data has great value with the arrival of a big data era. The problem of data security ensues.
The idea of network security management is 'strictly preventing blocking and physically isolating'. At present, the traditional information security is mainly completed around encryption technology and system. Encryption technology means that a sender converts a message (plaintext) into meaningless ciphertext through a key and an encryption function. The receiving party restores the received ciphertext into plaintext through a decryption function and a secret key. Encryption technology has formed a relatively perfect system and management, but after entering the big data era, encryption technology has exposed some defects. Since the ciphertext is easily noticed by the attacker, the attacker can obtain the transmitted original text only by decrypting the ciphertext. The information hiding disguises the information, so that the data sent by the sender is similar to the common data and is not easy to be concerned by an attacker, and the content security of the secret information can be ensured.
The information hiding is also called data hiding, and mainly obtains data without difference from common data by embedding secret information into a carrier for public transmission, thereby ensuring the imperceptibility and the safety of the secret information. The carrier of the public transmission can be pictures, audios and videos, texts and the like. The text is used as the most widely used information carrier in daily life of people, and has higher information coding degree. But because of the low redundancy of the text, it is very challenging to hide the secret information with the text.
Information hiding based on text signals is mainly divided into two main categories: one is to hide secret information by changing text formats (such as word spacing, letter case, etc.); the other type is that a piece of secret information needing to be hidden is passed through a text generator by a carrier generation technology to generate a meaningful natural text. The text readability after the text format is changed is high, but the modifiable space is small, and the realization of a sufficiently high hidden capacity is difficult. The information hiding technology generated by the carrier has higher steganography capacity, but easily generates the defects that the generated steganography text has low readability and is easy to identify. For example, a box design text generator improves the naturalness of the generated text and makes an attacker not easily perceive the generated text, which is an urgent problem in the field.
The concept of Deep Learning (DL) stems from the study of Artificial Neural Networks (ANN). The artificial Neural Network abstracts the human brain neuron Network from the information processing perspective, establishes a simple model, and forms different networks, called Neural Networks (NN) for short, by different connection modes, as shown in fig. 1. The deep learning is also called deep neural network, and is a method for performing characterization learning on data. In other words, deep learning can simulate the neural structure interpretation data of the human brain, such as images, audios and videos, text, and the like. The accuracy of classification or prediction is realized mainly by constructing a machine learning model with a plurality of hidden layers and massive training data learning characteristics. It emphasizes the depth of the model structure, i.e. the number of layers from the "input layer" to the "output layer", the more the number of layers, the deeper the depth; meanwhile, the importance of feature learning is highlighted, and the features of the sample crude oil are transformed to a new feature space through layer-by-layer feature extraction. Compared with the characteristics extracted manually, the characteristics are extracted by utilizing deep learning, so that the labor is saved, and the internal information of the data can be described.
A Recurrent Neural Network (RNN) is a type of Neural Network that is suitable for recursion in the evolution direction of a sequence using sequence data, and all cyclic units are linked in a chain. It is mainly characterized in that the output of the neuron directly acts on the neuron at the next moment. I.e. the nodes between the hidden layers are no longer connected but connected and the input of the hidden layer comprises not only the output of the input layer but also the output of the hidden layer at the previous moment. This makes the network memorable. As shown in fig. 2. Its depth is embodied in the time domain. Compared with other deep and feedforward neural networks, the recurrent neural network can process sequence data with any length by using a neural unit with self feedback, and is an attractive deep learning structure.
For a recurrent neural network with only one hidden layer, the following formula can be used:
St=f(W*St-1+U*xt)
wherein x istRepresenting the input vector at time t, StRepresenting the memory of the network at time t, htA vector representing the hidden layer, W represents the input sample weight, U represents the input sample weight at the current time instant, V represents the output sample weight, f, g is a non-linear function, typically using a tanh, relu or softmax function, W, U, V being equal at each time instant.
Theoretically, the simplest RNN model as described by the above equation can handle sequence signals of arbitrary length. However, this model is prone to the problem of gradient disappearance, which cannot effectively deal with long-term dependencies. To solve this problem, an improved algorithm based on the recurrent neural network, the long short term memory model (LSTM), was created. This approach can effectively solve this problem by carefully designed hidden layer elements. It consists of four parts: cell unit, input gate, output gate and forget to remember the gate. The modeling method can store the input information of the past time into the unit, and realizes the modeling of long time sequences. The key to LSTM is the overall state of a Cell unit. The Cell unit, like a conveyor belt, runs over the entire chain with only a few small linear operations acting on it, and the information is easily kept unchanged. The addition or deletion of information is implemented by the structure of the gate. The gate may selectively pass information through the sigmoid neural layer and a point-by-point multiplication operation. Sigmoid is output as a vector, where each element is a real number between 0 and 1, representing the weight to let the corresponding information pass. "0" means "not letting any information pass through", and "1" means "letting all information pass through". The LSTM implements protection and control of information through three gate structures, namely an input gate, an output gate, and a forgetting gate. The LSTM cell may be described using the following formula:
Wherein, ItIndicating the input gate, can control the amount of new information input by the memory unit. FtTo forget the gate, the memory unit can be controlled to discard a portion of the memoryPreviously stored information. CtIs a memory unit, is input information modulated by an input gate and is forgotten by a gate FtThe sum of the modulated previous memories. Output gate otAllowing the memory cell to affect the current hidden state and outputting or blocking its effect. Consider the calculation of LSTM as a black box, using fLSTM(x) represents the transfer function of the LSTM unit per time, and the output O at time t can be obtainedt. As can be seen from the formula, the output at time t is calculated from the previous time t-1. The problem that the RNN cannot be relied on for a long time is solved
Ot=fLSTM(xt|x1,x2,...,xt-1)
The recurrent neural network can perform good feature extraction, expression and semantic understanding capacity on the sequence signal in a time domain. Text can be viewed as a series of sequential signals. Various characteristics of the text can be automatically learned through the recurrent neural network, and a high-quality natural text carrier can be reconstructed according to the characteristics. However, in the existing automatic steganographic text generation method based on the recurrent neural network, different sensitivities of each word at each moment are ignored, so that the number of embedded bits of each word in the generated steganographic text is the same. The invention provides a self-contraction mechanism, which screens a candidate pool through confusion calculation. The different scoring conditions of each word in the process of automatically generating the steganographic text are fully utilized, and the generation quality of the steganographic text is well improved. The concealment of the concealment system can thus be further optimized compared to previous methods.
From the above, it can be known that the automatic generation of the text by using the recurrent neural network can overcome the problems existing in the existing method, improve the quality of the generated text, and improve the concealment of the concealment system.
Disclosure of Invention
The invention provides a text automatic steganography generation method based on a candidate pool self-contraction mechanism, which belongs to steganography of an automatic generation carrier, and can automatically generate steganography text with high readability according to a secret bit stream to be hidden and by using the candidate pool self-contraction mechanism. The method well estimates the statistical language model by using the crawled text as a training sample. And according to the trained neural network model, a high-quality text can be automatically generated. And in the text generation process, carrying out confusion calculation according to the candidate pool, and further screening the candidate pool. And reasonably coding each word according to the conditional probability distribution of the screened candidate pool, and controlling text generation according to the bit stream. By the mode, the model can improve the quality of the generated steganographic text and improve the concealment of the system. To achieve the above object, the method comprises the steps of:
(1) building a text data set as a training set by crawling a large number of common natural texts on the web;
(2) Preprocessing data, wherein the existing form of English natural texts is all lowercase, and only letters and numbers are reserved;
(3) modeling texts in the training set, constructing a cyclic neural network model, and optimizing the performance of the model through a back propagation algorithm;
(4) testing the loss value of the model, and adjusting the model training parameters according to the loss value;
(5) repeating the steps (3) to (4) until the parameters and the performance of the neural network model are stable;
(6) counting the word frequency distribution of the first word of each sentence in the training set, and selecting the first 200 words with the highest word frequency to form an initial word list;
(7) when a sentence of steganographic text is generated each time, randomly selecting a word from the initial word list as an initial word, and marking the beginning of the generation of the steganographic text;
(8) iteratively calculating the conditional probability distribution of words at each moment through the adjusted recurrent neural network model and the given first word to obtain a dictionary T;
(9) at each time t, calculating the score of the whole sentence when each word is combined with the first t-1 words according to the conditional probability of the words calculated by the iterative formula and the trained statistical language model in a descending order, and carrying out normalization processing on the words;
(10) determining the size of a candidate pool according to a preset embedding rate pair;
(11) After the size of the candidate pool is determined, performing confusion calculation according to the conditional probability distribution of the moment t and the t-1 moments;
(12) screening the candidate pool according to a preset confusion degree, wherein the screened candidate pool has a self-contraction property because the number of words in the candidate pool is not fixed;
(13) if the confusion degree of each word in the candidate pool at the moment is greater than the preset confusion degree, selecting the word with the highest score at the current moment as the output of the current moment;
(14) searching from a code table according to the code stream to be embedded, wherein the word corresponding to the code matched with the embedded code stream is output at the current moment, and hiding the secret information in the process of generating the steganographic text based on the word;
(15) repeating the steps (8) to (14) until a complete steganographic sentence is generated, and completing the process of automatically generating the steganographic text according to the secret information;
(16) and after receiving the steganographic text generated by the model, the receiver decodes the steganographic text and acquires the confidential message.
In order to ensure high hiding capacity and naturalness of the steganographic text, the experiment utilizes a recurrent neural network to perform text steganography based on candidate pool self-contraction, and the candidate pool is mainly screened through confusion calculation. High quality text can be automatically generated from the secret information. The details of the model, including three main modules, will be described below: the system comprises a text automatic generation module, an information hiding module and an information extraction module. The text automatic generation module models the natural text, trains a statistical language model from a large number of samples by utilizing the information extraction capability of the neural network, and obtains the actual sending words output at each moment. The information hiding module carries out confusion calculation on condition probability distribution and screens a candidate pool through a preset value, and the mode has self-contractibility. Concealment of the secret bitstream is achieved based on variable length encoding of words by the candidate pool. The information decoding module simulates a receiving end, decodes the received natural text embedded with the secret information and acquires the secret information.
Automatic text generation based on RNN
In the automatic text generation process, the method mainly utilizes the characteristic extraction and expression capability of the RNN suitable for sequence signals and combines signals of the first t-1 times to calculate the probability distribution of the signals at the time t, and the probability distribution is shown in the following formula:
Ot=fLSTM(xt|x1,x2,...,xt-1)
for each sentence Sen in the training sample, it can be considered as a sequence signal, and the time signal at time t is the word at the current timeSince the machine only recognizes mathematical symbols, for words containing rich semantic information, it needs to convert them into vectors, i.e. it needs to convert words into word vectors. All words have the same word vector dimension and are in the same semantic space REach vector is a point in the space, and the distance between the points in the space can express the semantic similarity of the corresponding word. Each sentence in the training sample can be formulated as follows:
for each sentence Sen, a matrix S ∈ R may be usedm×nWhere the t-th line represents the t-th word in sentence S, m is its length, and n is the dimension of the word vector.
Typically, a recurrent neural network consists of multiple layers, each layer having multiple LSTM units. Stacking of the number of LSTM layers increases the depth, enabling more accurate extraction of the features of the training samples. The present invention primarily uses 2-layer LSTM. The LSTM of the upper layer provides the sequential outputs to the LSTM of the lower layer, i.e., one output per input instant. Adjacent hidden layers may be connected by a transfer matrix. For example, the transition matrix between the ith layer and the (i + 1) th layer may be represented as a matrix
As previously mentioned, the output at time t depends not only on the current time xtAnd depends on the vectors in the unit at the previous t +1 instants. Thus, the output of the 1 st hidden layer at time t can be seen as a summary of the first t times, i.e. the first t times to information fusion. Based on the characteristic, the method adds a softmax layer after all hidden layers of the model to calculate the probability distribution of the t +1 th word. I.e. time t +1 to the prediction weightWhere N represents the number of words in T after statistical probability. Each word is scored by a prediction weight.
Wherein WPAnd bpIs a weight matrix and a bias shared between the levels, the weight matrix WPThe value in (1) reflects oiThe output vector O has a dimension N, depending on the importance of each feature. At time t, the predicted output at the next time t +1 is:
in order to update each parameter in the neural network, the performance of the recurrent neural network is optimized through a back propagation algorithm. And minimizing a loss function through iterative optimization of the network so as to obtain a language model which best accords with a semantic space. The method defines the loss function of the whole network as the negative logarithm of the statistical probability of each sentence:
information hiding algorithm
In the information hiding module, the method mainly carries out self-contraction screening on a candidate pool through confusion calculation and encodes the candidate pool based on the conditional probability distribution of the screened words so as to form a mapping relation from binary bit streams to word space. The method is mainly based on the following facts: when the model training results are good, there are multiple alternative solutions at each time. And after the conditional probability distribution of all the words in the dictionary T is sorted in a descending order, screening a candidate pool by an embedding rate b, wherein the number of the words in the Candidate Pool (CP) is 2 b.
Because each word has different influence on the probability distribution of the next word at the current moment, the self-contraction mechanism is adopted to calculate the confusion degree of the conditional probability distribution of each word in the candidate pool at the moment t and the output of the previous moment t-1.
Wherein, { p1,...,pjRepresents the probability of occurrence of all the generated words at the first t moments, pt+1,jRepresenting the probability of the jth word in the preset candidate pool at the time t + 1, and N representing the total number of words at the previous time t + 1. pplt+1,jThat is, if the jth word is selected at the time t + 1, the confusion degree of the sentence at the first time t +1 is large. And comparing the calculation result with a preset value, and finely screening the candidate pool. When the score is lower than a preset value, the candidate word is in a candidate pool; above the threshold, it is not in the candidate pool, i.e.:
After the candidate pool is screened, the conditional probability distribution of each word in the candidate pool needs to be encoded to realize the embedding of the secret information. Huffman coding is a distortion-free coding scheme and is a variable length coding. The method is based on a probability statistical model and is very suitable for coding by using the conditional probability distribution of each word. The method is realized by utilizing a tree structure to obtain the instantaneous code, and can fully utilize the conditional probability distribution of each word, so that the word code with high occurrence probability is shorter, the word code with low probability is longer, and the minimum average code length is ensured. Therefore, huffman coding is the optimal coding scheme for variable length coding. In the process of secret information embedding, the currently transmitted word at the previous time is used as the root node of the tree, and the leaf nodes of the tree are used to represent each word in the candidate pool.
In the information hiding process of coding through the Huffman, secret information needing to be embedded is read in sequence, and output at the current moment is selected according to the Huffman codes corresponding to the words in the candidate pool. In order to avoid that two identical bit sequences produce identical sentence text, the method constructs a list of start words and calculates the frequency of the first word of each sentence in the text data set. After sorting in descending order, the 200 most frequently used words are selected to form the starting word list. In the generation process, a word in the starting word list is randomly selected as the beginning of the generated steganographic sentence. And ensuring that the generated text is not repeated.
Model details and secret information embedding are shown in fig. 3. The algorithm details of the information hiding method are shown in algorithm 1. By the method, the steganographic text with high readability can be generated according to the input secret bit stream. These texts are highly hidden and are not easily recognized by attackers.
Information extraction algorithm
Information extraction means that after receiving a sentence which is transmitted publicly, the receiving party needs to correctly decode the confidential information contained in the sentence. The process of information hiding and extraction is basically the same. The same RNN model is used to compute the conditional probability distribution for each word at each time, perform a confusion calculation on the candidate pool and screen it. Notably, both require the construction of the same candidate pool and encoding of the words in the candidate pool using the same encoding method.
After receiving the text, the receiver inputs the first word of each sentence as the decoded starting word into the RNN, which in turn calculates the distribution probability of the words at subsequent times. At each point in time, after obtaining the probability distribution of the current word, the recipient first sorts all words in the dictionary T in descending order, selecting top 2bIndividual words to form a candidate pool. And then, calculating the common confusion degree of the words in the candidate pool at the moment t +1 and the previous moment t, and finely screening the candidate pool according to a preset value. And constructing a Huffman tree for the obtained new candidate pool according to the same rule, and coding the words in the candidate pool. Finally, according to the code corresponding to the word at the current moment, the bit hidden in the current word is decoded. In this way, the bit stream hidden in the original text can be extracted without distortion.
Inputting:
secret bit stream: b ═ 0, 0, 1, 0, 1
Embedding rate: b
List of start words: start word (word)1,word2,...,word200}
Perplexity preset value: ppl
And (3) outputting:
a plurality of generated steganographic texts:
Text={S1,S2,...,SN}
1. preparing data and training an RNN model;
2. when B is not finished:
3. if not the end of the current sentence:
4. calculating the probability distribution of the next word by using the trained RNN according to the generated word;
5. sorting the prediction probabilities of all the words in a descending order and calculating the score of the whole sentence when the word is combined with the first t-1 words according to a trained statistical language model;
6. before selection 2bForming a candidate pool by the words;
7. performing confusion calculation on the candidate pool and the output before the current moment to obtain pplt+1,j;
8. If pplt+1I is less than m:
9. reserving corresponding words;
10. otherwise:
11. screening out from a candidate pool;
12. constructing a Huffman tree for the conditional probability distribution of each word in the screened candidate pool;
13. reading the bit stream in the B and searching a Huffman tree until a leaf node is found and outputting a word corresponding to the current moment;
14. otherwise:
15. randomly selecting a Start word from the Start word list Start word as the Start of the next sentence;
16. If not the end of the current sentence:
17. selecting the word with the highest probability outside the candidate pool as the output of the current moment;
18. selecting the word with the highest probability at each moment as output until the end of the sentence;
19. and returning the generated sentence.
Algorithm 2 information extraction algorithm
Inputting:
a plurality of generated sentences: text ═ S1,S2,...,SN}
Embedding rate: b
Perplexity preset value: ppl
And (3) outputting:
secret bit stream: b ═ 0, 0, 1, 0, 1
1. For each sentence S in the text, the following is performed:
2. inputting a first word of the sentence S in the trained RNN model as the beginning of decoding;
3. for each word in sentence S, wordiCarrying out the following steps:
4. calculating the probability distribution of the next word using the trained RNN based on the previous word;
5. sort the prediction probabilities of all words in descending order and select top 2bConstructing a candidate pool by using the words;
6. performing confusion calculation on the candidate pool and the output before the current moment to obtain pplt+1,j;
7. If pplt+1I is less than m:
8. reserving corresponding words;
9. otherwise:
10. screening out from a candidate pool;
11. encoding each word in the screened candidate pool by using a Huffman code;
12. If wordiIn the candidate pool, then:
13. word based on the word actually received at each momentiFinding out and outputting a matched Huffman code;
14. extracting a corresponding bitstream and appending it to B;
15. otherwise:
16. the information extraction process is finished;
17. returning the extracted secret bit stream B
Drawings
FIG. 1 is a block diagram of a neural network of the present invention
FIG. 2 is a schematic diagram of a recurrent neural network used in the present invention
Fig. 3 is a detailed diagram of the model and the information hiding algorithm proposed in the present invention.
Claims (2)
1. The text automatic generation steganography method based on the candidate pool self-contraction mechanism comprises the following steps:
(1) building a text data set as a training set by crawling a large number of common natural texts on the web;
(2) preprocessing data, wherein the existing form of English natural texts is all lowercase, and only letters and numbers are reserved;
(3) modeling texts in the training set, constructing a cyclic neural network model, and optimizing the performance of the model through a back propagation algorithm;
(4) testing the loss value of the model, and adjusting the model training parameters according to the loss value;
(5) repeating the steps (3) to (4) until the parameters and the performance of the neural network model are stable;
(6) counting the word frequency distribution of the first word of each sentence in the training set, and selecting the first 200 words with the highest word frequency to form an initial word list;
(7) When a sentence of steganographic text is generated each time, randomly selecting a word from the initial word list as an initial word, and marking the beginning of the generation of the steganographic text;
(8) iteratively calculating the conditional probability distribution of the words at each moment through the adjusted recurrent neural network model and the selected first word;
(9) at each time t, calculating the score of the whole sentence when each word is combined with the first t-1 words according to the conditional probability of the words calculated by the iterative formula and the trained statistical language model in a descending order, and carrying out normalization processing on the words;
(10) determining the size of a candidate pool according to a preset embedding rate pair;
(11) after the size of the candidate pool is determined, performing confusion calculation according to the conditional probability distribution of the moment t and the t-1 moments;
(12) screening the candidate pool according to a preset confusion degree, wherein the screened candidate pool has a self-contraction property because the number of words in the candidate pool is not fixed;
(13) if the confusion degree of each word in the candidate pool at the moment is greater than the preset confusion degree, selecting the word with the highest score at the current moment as the output of the current moment;
(14) searching from the coding table according to the code stream to be embedded, wherein the word corresponding to the code matched with the embedded code stream is output at the current moment, and hiding the secret information in the process of generating the steganographic text;
(15) Repeating the steps (8) to (14) until a complete steganographic sentence is generated, and completing the process of automatically generating the steganographic text according to the secret information;
(16) and after receiving the steganographic text generated by the model, the receiver decodes the steganographic text and acquires the confidential message.
2. The method for automatically generating steganography based on candidate pool self-contraction mechanism according to claim 1, wherein the candidate pool self-contraction mechanism is implemented as described in steps (11) (12) (13), which improves the concealment of the steganography system and the readability of the generated steganography text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911004851.2A CN111859407A (en) | 2019-10-16 | 2019-10-16 | Text automatic generation steganography method based on candidate pool self-contraction mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911004851.2A CN111859407A (en) | 2019-10-16 | 2019-10-16 | Text automatic generation steganography method based on candidate pool self-contraction mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111859407A true CN111859407A (en) | 2020-10-30 |
Family
ID=72970670
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911004851.2A Pending CN111859407A (en) | 2019-10-16 | 2019-10-16 | Text automatic generation steganography method based on candidate pool self-contraction mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111859407A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113762523A (en) * | 2021-01-26 | 2021-12-07 | 北京沃东天骏信息技术有限公司 | Text generation method and device, storage medium and electronic equipment |
CN117131203A (en) * | 2023-08-14 | 2023-11-28 | 湖北大学 | Knowledge graph-based text generation steganography method, related method and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106202010A (en) * | 2016-07-12 | 2016-12-07 | 重庆兆光科技股份有限公司 | The method and apparatus building Law Text syntax tree based on deep neural network |
CN108923922A (en) * | 2018-07-26 | 2018-11-30 | 北京工商大学 | A kind of text steganography method based on generation confrontation network |
CN109711121A (en) * | 2018-12-27 | 2019-05-03 | 清华大学 | Text steganography method and device based on Markov model and Huffman encoding |
CN109815496A (en) * | 2019-01-22 | 2019-05-28 | 清华大学 | Based on capacity adaptive shortening mechanism carrier production text steganography method and device |
-
2019
- 2019-10-16 CN CN201911004851.2A patent/CN111859407A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106202010A (en) * | 2016-07-12 | 2016-12-07 | 重庆兆光科技股份有限公司 | The method and apparatus building Law Text syntax tree based on deep neural network |
CN108923922A (en) * | 2018-07-26 | 2018-11-30 | 北京工商大学 | A kind of text steganography method based on generation confrontation network |
CN109711121A (en) * | 2018-12-27 | 2019-05-03 | 清华大学 | Text steganography method and device based on Markov model and Huffman encoding |
CN109815496A (en) * | 2019-01-22 | 2019-05-28 | 清华大学 | Based on capacity adaptive shortening mechanism carrier production text steganography method and device |
Non-Patent Citations (1)
Title |
---|
张培晶;宋蕾;: "基于LDA的微博文本主题建模方法研究述评" * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113762523A (en) * | 2021-01-26 | 2021-12-07 | 北京沃东天骏信息技术有限公司 | Text generation method and device, storage medium and electronic equipment |
CN117131203A (en) * | 2023-08-14 | 2023-11-28 | 湖北大学 | Knowledge graph-based text generation steganography method, related method and device |
CN117131203B (en) * | 2023-08-14 | 2024-03-22 | 湖北大学 | Knowledge graph-based text generation steganography method, related method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yang et al. | RNN-stega: Linguistic steganography based on recurrent neural networks | |
Yang et al. | TS-RNN: Text steganalysis based on recurrent neural networks | |
CN109711121B (en) | Text steganography method and device based on Markov model and Huffman coding | |
CN111858932B (en) | Multiple-feature Chinese and English emotion classification method and system based on Transformer | |
CN109857871B (en) | User relationship discovery method based on social network mass contextual data | |
Zhang et al. | Generative steganography by sampling | |
CN110110318B (en) | Text steganography detection method and system based on cyclic neural network | |
Kang et al. | Generative text steganography based on LSTM network and attention mechanism with keywords | |
CN111753024A (en) | Public safety field-oriented multi-source heterogeneous data entity alignment method | |
CN114969316B (en) | Text data processing method, device, equipment and medium | |
CN109522454B (en) | Method for automatically generating web sample data | |
CN112560456B (en) | Method and system for generating generated abstract based on improved neural network | |
CN109815496A (en) | Based on capacity adaptive shortening mechanism carrier production text steganography method and device | |
CN112463956B (en) | Text abstract generation system and method based on antagonistic learning and hierarchical neural network | |
Chen et al. | Distribution-preserving steganography based on text-to-speech generative models | |
Xu et al. | A novel image compression technology based on vector quantisation and linear regression prediction | |
CN115130463A (en) | Error correction method, model training method, computer medium, and apparatus | |
CN111767697B (en) | Text processing method and device, computer equipment and storage medium | |
CN111859407A (en) | Text automatic generation steganography method based on candidate pool self-contraction mechanism | |
CN113657107A (en) | Natural language information hiding method based on sequence to steganographic sequence | |
CN113254652A (en) | Social media posting authenticity detection method based on hypergraph attention network | |
CN111026852B (en) | Financial event-oriented hybrid causal relationship discovery method | |
Wang et al. | GAN-GLS: Generative Lyric Steganography Based on Generative Adversarial Networks. | |
Yi et al. | Exploiting language model for efficient linguistic steganalysis | |
CN113360601A (en) | PGN-GAN text abstract model fusing topics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20201030 |
|
WD01 | Invention patent application deemed withdrawn after publication |