CN115952528B - Multi-scale combined text steganography method and system - Google Patents
Multi-scale combined text steganography method and system Download PDFInfo
- Publication number
- CN115952528B CN115952528B CN202310240044.0A CN202310240044A CN115952528B CN 115952528 B CN115952528 B CN 115952528B CN 202310240044 A CN202310240044 A CN 202310240044A CN 115952528 B CN115952528 B CN 115952528B
- Authority
- CN
- China
- Prior art keywords
- text
- steganography
- word
- joint
- replacement
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Abstract
The invention discloses a multi-scale joint text steganography method and a system, wherein the method comprises the following steps: acquiring a text sequence and secret information; inputting the text sequence into a pre-constructed generation and replacement joint model, and obtaining the generation probability distribution of each word; performing steganography operation on the text sequence according to the secret information and the generated probability distribution to obtain a first steganography text and a steganography record; determining non-steganographic words in a text sequence according to the steganographic records, inputting the text sequence into a pre-constructed generated replacement joint model, and obtaining the replacement probability distribution of each non-steganographic word; performing steganography operation on the non-steganography words according to the secret information and the replacement probability distribution, and obtaining a second steganography text; generating a joint steganography text according to the first steganography text and the second steganography text; the method and the device can solve the technical problems of low quality and low embedding rate of the steganographic text in the traditional text steganographic algorithm.
Description
Technical Field
The invention relates to a multi-scale combined text steganography method and system, and belongs to the technical field of information hiding.
Background
Text steganography is a method for embedding secret information in text and performing secure transmission, and is mainly used for realizing secret communication. The most important difference between text steganography and cryptography is the existence of the hidden information itself rather than the content of the information. Text steganography therefore has unique advantages in protecting information security. However, the conventional text steganography algorithm has the problems of low steganography text quality, low embedding rate and the like.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, provides a multi-scale combined text steganography method and a system, and solves the technical problems of low steganography text quality and low embedding rate in the traditional text steganography algorithm.
In order to achieve the above purpose, the invention is realized by adopting the following technical scheme:
in a first aspect, the present invention provides a multi-scale joint text steganography method, including:
acquiring a text sequence and secret information;
inputting the text sequence into a pre-constructed generation and replacement joint model, and obtaining the generation probability distribution of each word;
performing steganography operation on the text sequence according to the secret information and the generated probability distribution to obtain a first steganography text and a steganography record;
determining non-steganographic words in a text sequence according to the steganographic records, inputting the text sequence into a pre-constructed generated replacement joint model, and obtaining the replacement probability distribution of each non-steganographic word;
performing steganography operation on the non-steganography words according to the secret information and the replacement probability distribution, and obtaining a second steganography text;
a joint steganographic text is generated from the first steganographic text and the second steganographic text.
Optionally, the construction process for generating the alternative joint model includes:
acquiring a preset number of text data;
preprocessing text data, and constructing a sample set based on the preprocessed text data;
dividing a sample set into a training set and a verification set according to a preset proportion;
generating a replacement joint model based on PyTorch construction, wherein the generation of the replacement joint model comprises generation of a model and replacement of the model;
and performing iterative training on the generated and replaced joint model by using a training set, after iterative training, verifying the generated and replaced joint model after iterative training by using a verification set, and after verification, keeping and outputting the generated and replaced joint model with the minimum loss.
Optionally, the preprocessing includes:
dividing the text data, reserving words in the division result and generating word sequences;
taking the first n-1 bits of the word sequence as a sample, taking the last n-1 bits of the word sequence as a label, and n being the total number of bits of the word sequence;
if the number of bits of the sample or the label is smaller than a preset bit number threshold value N, filling the tail of the corresponding sample or label by filling a symbol to enable the number of bits of the sample or the label to be equal to the preset bit number threshold value N;
if the number of bits of the sample or the label is larger than the preset number of bits threshold N, the tail part of the corresponding sample or label is cut off to make the number of bits equal to the preset number of bits threshold N.
Optionally, the iterative training of generating the alternative joint model using the training set includes:
inputting samples in the training set to generate a replacement joint model, and acquiring the generation probability distribution output by the generation model and the replacement probability distribution output by the replacement model;
calculating loss according to the generated probability distribution prediction result and the replacement probability distribution prediction result which are respectively used as the input of the cross entropy loss function with the labelAnd->For loss->And->Sum acquisition loss->;
To loss ofPerforming back propagation to obtain a parameter gradient for generating a replacement joint model, and performing parameter optimization by using an Adam optimizer;
carrying out iteration by taking the generated and replaced joint model after parameter optimization into the step of iterative training until lossAnd (5) converging, and outputting a trained generation replacement joint model.
Optionally, the generating the probability distribution of the model output includes:
extracting time sequence relation feature vectors of words in the sample one by using LSTM and forming a time sequence relation feature matrix;
Calculating the relation weight of each word in the sample on time sequence characteristics through a multi-head self-attention mechanism and reflecting the relation weight to be an attention moment array:
In the method, in the process of the invention,for attention head->Output feature vector, ">For the total number of attention head>Attention head->A parameter matrix corresponding to query, key, value vector, +.>For the attention parameter matrix, +.>,For the dimension of the timing relation feature vector, +.>For the connection operation +.>Is a sigmoid function;
the time sequence relation characteristic matrixAnd attention matrix->Multiplying to obtain time characteristic matrix of each time step>:
Mapping each word in the sample to a high-dimensional semantic space through a word embedding layer to obtain a word embedding vector of each word;
constructing a graph structureAnd embedding the word embedding vectors of all words in the sample as graph structuresIs->,/>The number of words in the sample;
extracting spatial relationships of all words in a sample by a sliding window algorithm to build a graph structureEdge set of (i.e.)>,/>The number of edges;
using GAT from graph structureExtracting the spatial relation feature vector of each node, calculating the spatial feature by a multi-head self-attention mechanism, and reflecting the spatial feature to be attention coefficient +.>:/>
In the method, in the process of the invention,for node->To node->Attention coefficient of>For node->Is>For node->Node->And node->Spatial relation feature vector, +_>A linear transformation weight matrix for each node, < +.>Is a weight vector, ++>To activate the function +.>To splice the two vectors;
attention coefficientMultiplying the spatial relation feature vector of the node by the spatial relation feature vector of the node, and updating the spatial relation feature vector of the node through a multi-head self-attention mechanism to generate a spatial feature matrix +.>:
matrix the time characteristicsAnd spatial feature matrix->Feature fusion is carried out through the first full-connection layer and the activation function to obtain a fusion feature matrix +.>:
In the method, in the process of the invention,a parameter matrix for the first full connection layer;
will fuse the feature matrixPredictive generation by means of the second fully connected layer and the activation function, output generation probability distribution +.>:
In the method, in the process of the invention,for the parameter matrix of the second fully connected layer, < >>Is the first bias parameter.
Optionally, the substitution probability distribution output by the substitution model includes:
randomly selecting a plurality of words from the sample to replace the words with symbols representing the mask, so as to obtain a sample with the mask symbols;
mapping the mask symbol samples to a high-dimensional semantic space through an embedded vector layer of BERT to obtain feature mapping vectors of words:
In the method, in the process of the invention,for the masked symbol samples, +.>In order to embed the vector layer(s),
mapping feature vectorsPredictive generation by means of the third fully connected layer and the activation function, outputting a set of alternative probability distributions +.>:
In the method, in the process of the invention,for the parameter matrix of the third fully connected layer, < >>For the second bias parameter, +.>Is a sigmoid function;
will replace the probability distribution setThe probability distribution of the words with mask symbols is taken as output.
Optionally, the steganographically operating the text sequence according to the secret information and the generated probability distribution includes:
aiming at each word in the text sequence, the generation probability distribution is arranged in descending order according to the generation probability;
after being arranged beforeThe generation probabilities are taken out as generation candidate pools, < ->A maximum number of bits embedded for a preset word;
calculating the ratio of the first and second generation probabilities in the generation candidate pool:
if the ratio is greater than the preset ratio thresholdWill thenFirstly, generating a probability corresponding word as the output of the word in the text sequence, and recording that the word in the text sequence is not hidden;
if the ratio is less than or equal to a preset ratio thresholdConstructing a Huffman tree according to the generation probabilities in the generation candidate pool, and acquiring a coding set of each generation probability according to the Huffman tree;
converting the secret information into a binary bit stream and initializing a value s=1;
when the codes in the code set are the same as the s bits before the binary bit stream, outputting the word corresponding to the probability of the codes as the word in the text sequence, and recording the word in the text sequence as steganography; when there is no code in the code set that is the same as the s bits before the binary bit stream, let s=s+1 and repeat the current step until s is greater than the total number of bits of the binary bit stream.
Optionally, the performing steganographic operation on the non-steganographic word according to the secret information and the replacement probability distribution includes:
for each non-steganographic word, arranging the replacement probability distribution in descending order according to the size of the replacement probability;
after being arranged beforeThe individual substitution probabilities are taken out as substitution candidate pools, < >>A maximum number of bits embedded for a preset word;
constructing a Huffman tree according to the replacement probability in the replacement candidate pool, and acquiring a coding set of each replacement probability according to the Huffman tree;
converting the secret information into a binary bit stream and initializing a value s=1;
when the codes existing in the code set are the same as the s bits before the binary bit stream, taking the word corresponding to the substitution probability corresponding to the codes as the output of the non-steganographic word, and recording the non-steganographic word as steganographic; when there is no code in the code set that is the same as the s bits before the binary bit stream, let s=s+1 and repeat the current step until s is greater than the total number of bits of the binary bit stream.
In a second aspect, the present invention provides a multi-scale joint text steganography system comprising:
the information acquisition module is used for acquiring the text sequence and the secret information;
the generation module is used for inputting the text sequence into a pre-constructed generation and replacement joint model and obtaining the generation probability distribution of each word;
the first steganography module is used for carrying out steganography operation on the text sequence according to the secret information and the generated probability distribution, and obtaining a first steganography text and a steganography record;
the replacement module is used for determining non-steganographic words in the text sequence according to the steganographic records, inputting the text sequence into a pre-constructed generated replacement joint model, and obtaining the replacement probability distribution of each non-steganographic word;
the second steganography module is used for carrying out steganography operation on the non-steganography words according to the secret information and the replacement probability distribution, and obtaining a second steganography text;
and the joint steganography module is used for generating joint steganography text according to the first steganography text and the second steganography text.
In a third aspect, the present invention provides a secret information extraction method based on the above-mentioned multi-scale joint text steganography method, including:
acquiring a joint steganography text;
inputting the joint steganography text into a pre-constructed generation and replacement joint model, and obtaining the generation probability distribution of each word;
carrying out extraction operation according to the generated probability distribution and the joint steganography text to obtain a first extraction text and an extraction record;
determining unextracted words in the joint steganography text according to the extraction records, inputting the joint steganography text into a pre-constructed generated replacement joint model, and obtaining replacement probability distribution of each unextracted word;
extracting according to the replacement probability distribution and the joint steganography text to obtain a second extracted text;
generating secret information according to the first extracted text and the second extracted text;
the secret information extracting operation according to the generated probability distribution and the joint steganography text comprises the following steps:
aiming at each word in the joint steganography text, the generation probability distribution is arranged in descending order according to the generation probability;
after being arranged beforeThe generation probabilities are taken out as generation candidate pools, < ->A maximum number of bits embedded for a preset word;
calculating the ratio of the first and second generation probabilities in the generation candidate pool:
if the ratio is greater than the preset ratio thresholdThe first generation probability corresponding word is used as the output of the word in the joint steganography text, and the word in the joint steganography text is recorded as not extracted;
if the ratio is less than or equal to a preset ratio thresholdConstructing a Huffman tree according to the generation probabilities in the generation candidate pool, and acquiring a coding set of each generation probability according to the Huffman tree;
when the codes in the code set have the codes corresponding to the words which are the same as the words in the joint steganography text, the codes corresponding to the words are used as the output of the words in the joint steganography text, and the words in the joint steganography text are recorded as the extraction;
the extracting operation according to the replacement probability distribution and the joint steganography text comprises the following steps:
for each unextracted word, arranging the replacement probability distribution according to the descending order of the replacement probability;
after being arranged beforeThe individual substitution probabilities are taken out as substitution candidate pools, < >>A maximum number of bits embedded for a preset word;
constructing a Huffman tree according to the replacement probability in the replacement candidate pool, and acquiring a coding set of each replacement probability according to the Huffman tree;
when the codes in the code set are the same as the words in the joint steganography text, the codes are used as the output of the non-extracted words, and the non-extracted words are recorded as extraction.
Compared with the prior art, the invention has the beneficial effects that:
according to the multi-scale combined text steganography method and system, the generated model and the generated and replaced combined model of the replaced model are constructed, so that the feature consistency of the generated model and the replaced model is guaranteed; the generated replacement joint model is applied to the steganography process, so that the text redundancy is utilized to the maximum extent, the steganography embedding (steganography) rate is improved, and the steganography text quality is ensured.
Drawings
Fig. 1 is a flowchart of a multi-scale joint text steganography method according to an embodiment of the present invention.
Fig. 2 is a flowchart of a secret information extraction method of a multi-scale joint text steganography method according to a third embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.
Embodiment one:
as shown in fig. 1, an embodiment of the present invention provides a multi-scale joint text steganography method, including the following steps:
s1, acquiring a text sequence and secret information;
s2, inputting the text sequence into a pre-constructed generation and replacement joint model, and obtaining the generation probability distribution of each word;
s3, performing steganography operation on the text sequence according to the secret information and the generated probability distribution, and obtaining a first steganography text and a steganography record;
s4, determining non-steganographic words in the text sequence according to the steganographic records, inputting the text sequence into a pre-constructed generated replacement joint model, and obtaining replacement probability distribution of each non-steganographic word;
s5, performing steganography operation on the non-steganography words according to the secret information and the replacement probability distribution, and obtaining a second steganography text;
s6, generating a joint steganography text according to the first steganography text and the second steganography text.
1. The construction process for generating the replacement joint model comprises the following steps:
11. acquiring a preset number of text data; for example, 30 ten thousand pieces of text data are obtained from an OPUS dataset website;
12. preprocessing text data, and constructing a sample set based on the preprocessed text data;
wherein the preprocessing comprises the following steps:
121. dividing the text data, reserving words in the division result and generating word sequences;
122. taking the first n-1 bits of the word sequence as a sample, taking the last n-1 bits of the word sequence as a label, and n being the total number of bits of the word sequence;
123. if the number of bits of the sample or the label is smaller than a preset bit number threshold value N, filling the tail of the corresponding sample or label by filling a symbol to enable the number of bits of the sample or the label to be equal to the preset bit number threshold value N;
124. if the number of bits of the sample or the label is larger than a preset number of bits threshold N, cutting off words at the tail of the corresponding sample or label to enable the number of bits to be equal to the preset number of bits threshold N;
13. dividing a sample set into a training set and a verification set according to a preset proportion; the ratio is usually set to be 8:2, all the words appearing in the training set are required to be counted and the word frequency is calculated, the words with the word frequency meeting the preset word frequency threshold value are added into a dictionary, and the length of the dictionary is the length of the model output probability distribution; words in the dictionary are in one-to-one correspondence with probabilities in the probability distribution;
14. generating a replacement joint model based on PyTorch building, wherein the generation of the replacement joint model comprises a generation model and a replacement model, the generation model is used for outputting a generation probability distribution, and the replacement model is used for outputting a replacement probability distribution;
15. iterative training is carried out on the generated and replaced joint model by using a training set, after iterative training, the generated and replaced joint model after iterative training is verified by using a verification set, and after verification, loss is reservedThe minimum generation replaces the joint model and outputs.
Wherein iteratively training the generation of the alternative joint model using the training set comprises:
151. inputting samples in the training set to generate a replacement joint model, and acquiring the generation probability distribution output by the generation model and the replacement probability distribution output by the replacement model;
152. calculating loss according to the generated probability distribution prediction result and the replacement probability distribution prediction result which are respectively used as the input of the cross entropy loss function with the labelAnd->For loss->And->Sum acquisition loss;
153. To loss ofPerforming back propagation to obtain a parameter gradient for generating a replacement joint model, and performing parameter optimization by using an Adam optimizer;
154. carrying out iteration by taking the generated replacement joint model after parameter optimization into the step of iterative training (namely returning to step 151) until lossAnd (5) converging, and outputting a trained generation replacement joint model.
In step 151, generating a probability distribution of model outputs includes:
(1.1) extracting time sequence relation characteristic vectors of words in the sample one by using LSTM, and forming a time sequence relation characteristic matrix;
(1.2) calculating the relation weight of each word in the sample on the time sequence characteristic through a multi-head self-attention mechanism, and reflecting the relation weight into an attention moment array:
In the method, in the process of the invention,for attention head->Output feature vector, ">For the total number of attention head>Attention head->A parameter matrix corresponding to query, key, value vector, +.>For the attention parameter matrix, +.>,For the dimension of the timing relation feature vector, +.>For the connection operation +.>Is a sigmoid function;
(1.3) matrix the time sequence relation characteristicAnd attention matrix->Multiplying to obtain time characteristic matrix of each time step:
(2.1) mapping each word in the sample to a high-dimensional semantic space through a word embedding layer to obtain a word embedding vector of each word;
(2.2) building a graph StructureAnd embedding the word embedding vectors of all words in the sample as graph structuresIs->,/>The number of words in the sample;
(2.3) extracting spatial relationships of all words in the sample by sliding window algorithm to build the graph structureEdge set of (i.e.)>,/>The number of edges;
(2.4) use of GAT from graph StructureExtracting the spatial relation feature vector of each node, calculating the spatial feature by a multi-head self-attention mechanism, and reflecting the spatial feature to be attention coefficient +.>:
In the method, in the process of the invention,for node->To node->Attention coefficient of>For node->Is>For node->Node->And node->Spatial relation feature vector, +_>A linear transformation weight matrix for each node, < +.>Is a weight vector, ++>To activate the function +.>To splice the two vectors;
(2.5) attention coefficientMultiplying the spatial relation feature vector of the node by the spatial relation feature vector of the node, and updating the spatial relation feature vector of the node through a multi-head self-attention mechanism to generate a spatial feature matrix +.>:/>
(3.1) time-feature matrixAnd spatial feature matrix->Feature fusion is carried out through the first full-connection layer and the activation function to obtain a fusion feature matrix +.>:
In the method, in the process of the invention,a parameter matrix for the first full connection layer;
(3.2) fusing the feature matricesPredictive generation by means of the second fully connected layer and the activation function, output generation probability distribution +.>:
In the method, in the process of the invention,for the parameter matrix of the second fully connected layer, < >>Is the first bias parameter.
In step 151, the replacement probability distribution of the replacement model output includes:
(1.1) randomly selecting a plurality of words from the sample to replace the words with symbols representing the mask, so as to obtain a sample with the mask symbols;
(1.2) mapping the masked symbol samples to a high-dimensional semantic space through an embedded vector layer of BERT to obtain feature mapping vectors of words:
In the method, in the process of the invention,for the masked symbol samples, +.>In order to embed the vector layer(s),
(1.3) mapping the feature to the vectorPredictive generation by means of the third fully connected layer and the activation function, outputting a set of alternative probability distributions +.>:
In the method, in the process of the invention,for the parameter matrix of the third fully connected layer, < >>For the second bias parameter, +.>Is a sigmoid function;
(1.4) replacing the probability distribution setThe probability distribution of the words with mask symbols is taken as output.
2. Steganographically operating a sequence of text based on the secret information and the generated probability distribution includes:
2.1, aiming at each word in the text sequence, arranging the generation probability distribution according to the generation probability size descending order;
2.2 before alignmentThe generation probabilities are taken out as generation candidate pools, < ->A maximum number of bits embedded for a preset word;
2.3, calculating the ratio of the first generation probability to the second generation probability in the generation candidate pool:
2.4 if the ratio is greater than the preset ratio thresholdOutputting the word corresponding to the first generation probability as a word in the text sequence, and recording that the word in the text sequence is not hidden;
2.5, if the ratio is less than or equal to the preset ratio thresholdConstructing a Huffman tree according to the generation probabilities in the generation candidate pool, and acquiring a coding set of each generation probability according to the Huffman tree;
2.6, converting the secret information into a binary bit stream, and initializing a value s=1;
2.7, when the codes exist in the code set and are the same as the s bits before the binary bit stream, outputting the word corresponding to the probability generated by the codes as the word in the text sequence, and recording the word in the text sequence as steganography; when there is no code in the code set that is the same as the s bits before the binary bit stream, let s=s+1 and repeat the current step until s is greater than the total number of bits of the binary bit stream.
3. Steganographically operating the non-steganographic word based on the secret information and the replacement probability distribution includes:
3.1, aiming at each non-steganographic word, arranging the replacement probability distribution in descending order according to the size of the replacement probability;
3.2 before alignmentThe individual substitution probabilities are taken out as substitution candidate pools, < >>A maximum number of bits embedded for a preset word;
3.3, constructing a Huffman tree according to the replacement probability in the replacement candidate pool, and acquiring a coding set of each replacement probability according to the Huffman tree;
3.4, converting the secret information into a binary bit stream, and initializing a numerical value s=1;
3.5, when the codes exist in the code set and are the same as the s bits before the binary bit stream, outputting the word corresponding to the substitution probability corresponding to the codes as an un-steganographic word, and recording the un-steganographic word as steganographic; when there is no code in the code set that is the same as the s bits before the binary bit stream, let s=s+1 and repeat the current step until s is greater than the total number of bits of the binary bit stream.
Embodiment two:
the embodiment of the invention provides a multi-scale joint text steganography system, which comprises the following components:
the information acquisition module is used for acquiring the text sequence and the secret information;
the generation module is used for inputting the text sequence into a pre-constructed generation and replacement joint model and obtaining the generation probability distribution of each word;
the first steganography module is used for carrying out steganography operation on the text sequence according to the secret information and the generated probability distribution, and obtaining a first steganography text and a steganography record;
the replacement module is used for determining non-steganographic words in the text sequence according to the steganographic records, inputting the text sequence into a pre-constructed generated replacement joint model, and obtaining the replacement probability distribution of each non-steganographic word;
the second steganography module is used for carrying out steganography operation on the non-steganography words according to the secret information and the replacement probability distribution, and obtaining a second steganography text;
and the joint steganography module is used for generating joint steganography text according to the first steganography text and the second steganography text.
And (3) implementation:
as shown in fig. 2, according to a first embodiment, the present invention provides a secret information extraction method of a multi-scale joint text steganography method, including:
s11, acquiring a joint steganography text;
s12, inputting the joint steganography text into a pre-constructed generation and replacement joint model, and obtaining the generation probability distribution of each word;
s13, carrying out extraction operation according to the generated probability distribution and the joint steganography text, and obtaining a first extraction text and an extraction record;
s14, determining unextracted words in the joint steganography text according to the extraction records, inputting the joint steganography text into a pre-constructed generated replacement joint model, and obtaining replacement probability distribution of each unextracted word;
s15, extracting according to the replacement probability distribution and the joint steganography text to obtain a second extracted text;
s16, generating secret information according to the first extracted text and the second extracted text;
the secret information extracting operation according to the generated probability distribution and the joint steganography text comprises the following steps:
(1) Aiming at each word in the joint steganography text, the generation probability distribution is arranged in descending order according to the generation probability;
(2) After being arranged beforeThe generation probabilities are taken out as generation candidate pools, < ->A maximum number of bits embedded for a preset word;
(3) Calculating the ratio of the first and second generation probabilities in the generation candidate pool:
(4) If the ratio is greater than the preset ratio thresholdThe first generation probability corresponding word is used as the output of the word in the joint steganography text, and the word in the joint steganography text is recorded as not extracted;
(5) If the ratio is smaller than or equal to the preset ratio threshold valueConstructing a Huffman tree according to the generation probabilities in the generation candidate pool, and acquiring a coding set of each generation probability according to the Huffman tree;
(6) When the codes in the code set have the codes corresponding to the words which are the same as the words in the joint steganography text, the codes corresponding to the words are used as the output of the words in the joint steganography text, and the words in the joint steganography text are recorded as the extraction;
wherein, the extracting operation according to the replacement probability distribution and the joint steganography text comprises the following steps:
(1) Arranging the replacement probability distribution according to the descending order of the replacement probability for each unextracted word;
(2) After being arranged beforeThe individual substitution probabilities are taken out as substitution candidate pools, < >>A maximum number of bits embedded for a preset word;
(3) Constructing a Huffman tree according to the replacement probability in the replacement candidate pool, and acquiring a coding set of each replacement probability according to the Huffman tree;
(4) When the codes in the code set are the same as the words in the joint steganography text, the codes are output as the non-extracted words, and the non-extracted words are recorded as the extraction.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.
Claims (10)
1. A method of multi-scale joint text steganography, comprising:
acquiring a text sequence and secret information;
inputting the text sequence into a pre-constructed generation and replacement joint model, and obtaining the generation probability distribution of each word;
performing steganography operation on the text sequence according to the secret information and the generated probability distribution to obtain a first steganography text and a steganography record;
determining non-steganographic words in a text sequence according to the steganographic records, inputting the text sequence into a pre-constructed generated replacement joint model, and obtaining the replacement probability distribution of each non-steganographic word;
performing steganography operation on the non-steganography words according to the secret information and the replacement probability distribution, and obtaining a second steganography text;
a joint steganographic text is generated from the first steganographic text and the second steganographic text.
2. The method of claim 1, wherein the creating the alternative joint model comprises:
acquiring a preset number of text data;
preprocessing text data, and constructing a sample set based on the preprocessed text data;
dividing a sample set into a training set and a verification set according to a preset proportion;
generating a replacement joint model based on PyTorch construction, wherein the generation of the replacement joint model comprises generation of a model and replacement of the model;
and performing iterative training on the generated and replaced joint model by using a training set, after iterative training, verifying the generated and replaced joint model after iterative training by using a verification set, and after verification, keeping and outputting the generated and replaced joint model with the minimum loss.
3. A multi-scale joint text steganography method as recited in claim 2, wherein the preprocessing comprises:
dividing the text data, reserving words in the division result and generating word sequences;
taking the first n-1 bits of the word sequence as a sample, taking the last n-1 bits of the word sequence as a label, and n being the total number of bits of the word sequence;
if the number of bits of the sample or the label is smaller than a preset bit number threshold value N, filling the tail of the corresponding sample or label by filling a symbol to enable the number of bits of the sample or the label to be equal to the preset bit number threshold value N;
if the number of bits of the sample or the label is larger than the preset number of bits threshold N, the tail part of the corresponding sample or label is cut off to make the number of bits equal to the preset number of bits threshold N.
4. A multi-scale joint text steganography method as recited in claim 2, wherein iteratively training the generation of the alternative joint model using a training set comprises:
inputting samples in the training set to generate a replacement joint model, and acquiring the generation probability distribution output by the generation model and the replacement probability distribution output by the replacement model;
calculating loss according to the generated probability distribution prediction result and the replacement probability distribution prediction result which are respectively used as the input of the cross entropy loss function with the labelAnd->For loss->And->Sum acquisition loss->;
To loss ofPerforming back propagation to obtain a parameter gradient for generating a replacement joint model, and performing parameter optimization by using an Adam optimizer;
5. The method of claim 4, wherein generating the probability distribution of model output comprises:
extracting time sequence relation feature vectors of words in the sample one by using LSTM and forming a time sequence relation feature matrix;
Calculating the relation weight of each word in the sample on time sequence characteristics through a multi-head self-attention mechanism and reflecting the relation weight to be an attention moment array:
In the method, in the process of the invention,for attention head->Output feature vector, ">For the total number of attention head>Attention head->A parameter matrix corresponding to query, key, value vector, +.>For the attention parameter matrix, +.>,/>For the dimension of the timing relation feature vector, +.>For the connection operation +.>Is a sigmoid function;
the time sequence relation characteristic matrixAnd attention matrix->Multiplying to obtain time characteristic matrix of each time step>:
Mapping each word in the sample to a high-dimensional semantic space through a word embedding layer to obtain a word embedding vector of each word;
constructing a graph structureAnd embedding the word embedding vectors of all words in the sample as a graph structure +.>Is->,/>The number of words in the sample;
extracting spatial relationships of all words in a sample by a sliding window algorithm to build a graph structureEdge set of (i.e.)>,/>The number of edges;
using GAT from graph structureExtracting the spatial relation feature vector of each node, calculating the spatial feature by a multi-head self-attention mechanism, and reflecting the spatial feature to be attention coefficient +.>:
In the method, in the process of the invention,for node->To node->Attention coefficient of>For node->Is>For node->Node->And node->Spatial relation feature vector, +_>A linear transformation weight matrix for each node, < +.>As a weight vector of the weight vector,to activate the function +.>To splice the two vectors;
attention coefficientMultiplying the spatial relation feature vector of the node by the spatial relation feature vector of the node, and updating the spatial relation feature vector of the node through a multi-head self-attention mechanism to generate a spatial feature matrix +.>:
matrix the time characteristicsAnd spatial feature matrix->Feature fusion is carried out through the first full-connection layer and the activation function to obtain a fusion feature matrix +.>:/>
In the method, in the process of the invention,a parameter matrix for the first full connection layer;
will fuse the feature matrixPredictive generation by means of the second fully connected layer and the activation function, output generation probability distribution +.>:
6. The method of claim 4, wherein the substitution probability distribution of the substitution model output comprises:
randomly selecting a plurality of words from the sample to replace the words with symbols representing the mask, so as to obtain a sample with the mask symbols;
mapping the mask symbol samples to a high-dimensional semantic space through an embedded vector layer of BERT to obtain feature mapping vectors of words:
In the method, in the process of the invention,for the masked symbol samples, +.>In order to embed the vector layer(s),
mapping feature vectorsPredictive generation by a third full connection layer and an activation function, outputting a replacement probability distribution set:
In the method, in the process of the invention,for the parameter matrix of the third fully connected layer, < >>For the second bias parameter, +.>Is a sigmoid function;
7. A multi-scale joint text steganography method as recited in claim 1, wherein steganographically operating a sequence of texts based on secret information and a generated probability distribution comprises:
aiming at each word in the text sequence, the generation probability distribution is arranged in descending order according to the generation probability;
after being arranged beforeThe generation probabilities are taken out as generation candidate pools, < ->A maximum number of bits embedded for a preset word;
calculating the ratio of the first and second generation probabilities in the generation candidate pool:
if the ratio is greater than the preset ratio thresholdThe first generation probability corresponding word is used as the output of the word in the text sequence, and the word in the text sequence is recorded as non-steganography;
if the ratio is less than or equal to a preset ratio thresholdConstructing a Huffman tree according to the generation probabilities in the generation candidate pool, and acquiring a coding set of each generation probability according to the Huffman tree;
converting the secret information into a binary bit stream and initializing a value s=1;
when the codes in the code set are the same as the s bits before the binary bit stream, outputting the word corresponding to the probability of the codes as the word in the text sequence, and recording the word in the text sequence as steganography; when there is no code in the code set that is the same as the s bits before the binary bit stream, let s=s+1 and repeat the current step until s is greater than the total number of bits of the binary bit stream.
8. A multi-scale joint text steganography method as recited in claim 1, wherein steganographically operating on non-steganographically words based on secret information and a replacement probability distribution comprises:
for each non-steganographic word, arranging the replacement probability distribution in descending order according to the size of the replacement probability;
after being arranged beforeThe individual substitution probabilities are taken out as substitution candidate pools, < >>A maximum number of bits embedded for a preset word;
constructing a Huffman tree according to the replacement probability in the replacement candidate pool, and acquiring a coding set of each replacement probability according to the Huffman tree;
converting the secret information into a binary bit stream and initializing a value s=1;
when the codes existing in the code set are the same as the s bits before the binary bit stream, taking the word corresponding to the substitution probability corresponding to the codes as the output of the non-steganographic word, and recording the non-steganographic word as steganographic; when there is no code in the code set that is the same as the s bits before the binary bit stream, let s=s+1 and repeat the current step until s is greater than the total number of bits of the binary bit stream.
9. A multi-scale joint text steganography system, comprising:
the information acquisition module is used for acquiring the text sequence and the secret information;
the generation module is used for inputting the text sequence into a pre-constructed generation and replacement joint model and obtaining the generation probability distribution of each word;
the first steganography module is used for carrying out steganography operation on the text sequence according to the secret information and the generated probability distribution, and obtaining a first steganography text and a steganography record;
the replacement module is used for determining non-steganographic words in the text sequence according to the steganographic records, inputting the text sequence into a pre-constructed generated replacement joint model, and obtaining the replacement probability distribution of each non-steganographic word;
the second steganography module is used for carrying out steganography operation on the non-steganography words according to the secret information and the replacement probability distribution, and obtaining a second steganography text;
and the joint steganography module is used for generating joint steganography text according to the first steganography text and the second steganography text.
10. A secret information extraction method based on a multi-scale joint text steganography method according to any one of claims 1-8, characterized by comprising:
acquiring a joint steganography text;
inputting the joint steganography text into a pre-constructed generation and replacement joint model, and obtaining the generation probability distribution of each word;
carrying out extraction operation according to the generated probability distribution and the joint steganography text to obtain a first extraction text and an extraction record;
determining unextracted words in the joint steganography text according to the extraction records, inputting the joint steganography text into a pre-constructed generated replacement joint model, and obtaining replacement probability distribution of each unextracted word;
extracting according to the replacement probability distribution and the joint steganography text to obtain a second extracted text;
generating secret information according to the first extracted text and the second extracted text;
the secret information extracting operation according to the generated probability distribution and the joint steganography text comprises the following steps:
aiming at each word in the joint steganography text, the generation probability distribution is arranged in descending order according to the generation probability;
after being arranged beforeThe generation probabilities are taken out as generation candidate pools, < ->A maximum number of bits embedded for a preset word;
calculating the ratio of the first and second generation probabilities in the generation candidate pool:
if the ratio is greater than the preset ratio thresholdThe first generation probability corresponding word is used as the output of the word in the joint steganography text, and the word in the joint steganography text is recorded as not extracted;
if the ratio is less than or equal to a preset ratio thresholdConstructing a Huffman tree according to the generation probabilities in the generation candidate pool, and acquiring a coding set of each generation probability according to the Huffman tree; />
When the codes in the code set have the codes corresponding to the words which are the same as the words in the joint steganography text, the codes corresponding to the words are used as the output of the words in the joint steganography text, and the words in the joint steganography text are recorded as the extraction;
the extracting operation according to the replacement probability distribution and the joint steganography text comprises the following steps:
for each unextracted word, arranging the replacement probability distribution according to the descending order of the replacement probability;
after being arranged beforeThe individual substitution probabilities are taken out as substitution candidate pools, < >>A maximum number of bits embedded for a preset word;
constructing a Huffman tree according to the replacement probability in the replacement candidate pool, and acquiring a coding set of each replacement probability according to the Huffman tree;
when the codes in the code set are the same as the words in the joint steganography text, the codes are used as the output of the non-extracted words, and the non-extracted words are recorded as extraction.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310240044.0A CN115952528B (en) | 2023-03-14 | 2023-03-14 | Multi-scale combined text steganography method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310240044.0A CN115952528B (en) | 2023-03-14 | 2023-03-14 | Multi-scale combined text steganography method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115952528A CN115952528A (en) | 2023-04-11 |
CN115952528B true CN115952528B (en) | 2023-05-16 |
Family
ID=85893115
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310240044.0A Active CN115952528B (en) | 2023-03-14 | 2023-03-14 | Multi-scale combined text steganography method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115952528B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116595587B (en) * | 2023-07-14 | 2023-09-22 | 江西通友科技有限公司 | Document steganography method and document management method based on secret service |
CN117648681B (en) * | 2024-01-30 | 2024-04-05 | 北京点聚信息技术有限公司 | OFD format electronic document hidden information extraction and embedding method |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105959104B (en) * | 2016-04-25 | 2019-05-17 | 深圳大学 | Steganalysis method based on Hamming distance distribution |
CN109815496A (en) * | 2019-01-22 | 2019-05-28 | 清华大学 | Based on capacity adaptive shortening mechanism carrier production text steganography method and device |
CN110533570A (en) * | 2019-08-27 | 2019-12-03 | 南京工程学院 | A kind of general steganography method based on deep learning |
CN113987129A (en) * | 2021-11-08 | 2022-01-28 | 重庆邮电大学 | Digital media protection text steganography method based on variational automatic encoder |
CN115169293A (en) * | 2022-09-02 | 2022-10-11 | 南京信息工程大学 | Text steganalysis method, system, device and storage medium |
-
2023
- 2023-03-14 CN CN202310240044.0A patent/CN115952528B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN115952528A (en) | 2023-04-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115952528B (en) | Multi-scale combined text steganography method and system | |
CN110209801B (en) | Text abstract automatic generation method based on self-attention network | |
CN109445834B (en) | Program code similarity rapid comparison method based on abstract syntax tree | |
US10380236B1 (en) | Machine learning system for annotating unstructured text | |
CN101610088B (en) | System and method for encoding data based on a compression technique with security features | |
Xiang et al. | A novel linguistic steganography based on synonym run-length encoding | |
CN111666402A (en) | Text abstract generation method and device, computer equipment and readable storage medium | |
CN109460434B (en) | Data extraction model establishing method and device | |
CN109711121A (en) | Text steganography method and device based on Markov model and Huffman encoding | |
CN112560456B (en) | Method and system for generating generated abstract based on improved neural network | |
CN101799802A (en) | Method and system for extracting entity relationship by using structural information | |
CN111950287A (en) | Text-based entity identification method and related device | |
Shi et al. | An approach to text steganography based on search in internet | |
CN116151132A (en) | Intelligent code completion method, system and storage medium for programming learning scene | |
CN113032001B (en) | Intelligent contract classification method and device | |
CN115099233A (en) | Semantic analysis model construction method and device, electronic equipment and storage medium | |
CN111159394A (en) | Text abstract generation method and device | |
CN111126059B (en) | Short text generation method, short text generation device and readable storage medium | |
Shanmugasundaram et al. | Automatic reassembly of document fragments via data compression | |
US20230098398A1 (en) | Molecular structure reconstruction method and apparatus, device, storage medium, and program product | |
CN114065269B (en) | Method for generating and analyzing bindless heterogeneous token and storage medium | |
CN115758415A (en) | Text carrier-free information hiding method based on Chinese character component combination | |
KR101741186B1 (en) | Data distribution storage apparatus and method for verifying the locally repairable code | |
WO2022110730A1 (en) | Label-based optimization model training method, apparatus, device, and storage medium | |
TWI499928B (en) | Data hiding method via revision records on a collaboration platform |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |