CN115952528B - Multi-scale combined text steganography method and system - Google Patents

Multi-scale combined text steganography method and system Download PDF

Info

Publication number
CN115952528B
CN115952528B CN202310240044.0A CN202310240044A CN115952528B CN 115952528 B CN115952528 B CN 115952528B CN 202310240044 A CN202310240044 A CN 202310240044A CN 115952528 B CN115952528 B CN 115952528B
Authority
CN
China
Prior art keywords
text
steganography
word
joint
replacement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310240044.0A
Other languages
Chinese (zh)
Other versions
CN115952528A (en
Inventor
付章杰
丁长浩
卢俊杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN202310240044.0A priority Critical patent/CN115952528B/en
Publication of CN115952528A publication Critical patent/CN115952528A/en
Application granted granted Critical
Publication of CN115952528B publication Critical patent/CN115952528B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a multi-scale joint text steganography method and a system, wherein the method comprises the following steps: acquiring a text sequence and secret information; inputting the text sequence into a pre-constructed generation and replacement joint model, and obtaining the generation probability distribution of each word; performing steganography operation on the text sequence according to the secret information and the generated probability distribution to obtain a first steganography text and a steganography record; determining non-steganographic words in a text sequence according to the steganographic records, inputting the text sequence into a pre-constructed generated replacement joint model, and obtaining the replacement probability distribution of each non-steganographic word; performing steganography operation on the non-steganography words according to the secret information and the replacement probability distribution, and obtaining a second steganography text; generating a joint steganography text according to the first steganography text and the second steganography text; the method and the device can solve the technical problems of low quality and low embedding rate of the steganographic text in the traditional text steganographic algorithm.

Description

Multi-scale combined text steganography method and system
Technical Field
The invention relates to a multi-scale combined text steganography method and system, and belongs to the technical field of information hiding.
Background
Text steganography is a method for embedding secret information in text and performing secure transmission, and is mainly used for realizing secret communication. The most important difference between text steganography and cryptography is the existence of the hidden information itself rather than the content of the information. Text steganography therefore has unique advantages in protecting information security. However, the conventional text steganography algorithm has the problems of low steganography text quality, low embedding rate and the like.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, provides a multi-scale combined text steganography method and a system, and solves the technical problems of low steganography text quality and low embedding rate in the traditional text steganography algorithm.
In order to achieve the above purpose, the invention is realized by adopting the following technical scheme:
in a first aspect, the present invention provides a multi-scale joint text steganography method, including:
acquiring a text sequence and secret information;
inputting the text sequence into a pre-constructed generation and replacement joint model, and obtaining the generation probability distribution of each word;
performing steganography operation on the text sequence according to the secret information and the generated probability distribution to obtain a first steganography text and a steganography record;
determining non-steganographic words in a text sequence according to the steganographic records, inputting the text sequence into a pre-constructed generated replacement joint model, and obtaining the replacement probability distribution of each non-steganographic word;
performing steganography operation on the non-steganography words according to the secret information and the replacement probability distribution, and obtaining a second steganography text;
a joint steganographic text is generated from the first steganographic text and the second steganographic text.
Optionally, the construction process for generating the alternative joint model includes:
acquiring a preset number of text data;
preprocessing text data, and constructing a sample set based on the preprocessed text data;
dividing a sample set into a training set and a verification set according to a preset proportion;
generating a replacement joint model based on PyTorch construction, wherein the generation of the replacement joint model comprises generation of a model and replacement of the model;
and performing iterative training on the generated and replaced joint model by using a training set, after iterative training, verifying the generated and replaced joint model after iterative training by using a verification set, and after verification, keeping and outputting the generated and replaced joint model with the minimum loss.
Optionally, the preprocessing includes:
dividing the text data, reserving words in the division result and generating word sequences;
taking the first n-1 bits of the word sequence as a sample, taking the last n-1 bits of the word sequence as a label, and n being the total number of bits of the word sequence;
if the number of bits of the sample or the label is smaller than a preset bit number threshold value N, filling the tail of the corresponding sample or label by filling a symbol to enable the number of bits of the sample or the label to be equal to the preset bit number threshold value N;
if the number of bits of the sample or the label is larger than the preset number of bits threshold N, the tail part of the corresponding sample or label is cut off to make the number of bits equal to the preset number of bits threshold N.
Optionally, the iterative training of generating the alternative joint model using the training set includes:
inputting samples in the training set to generate a replacement joint model, and acquiring the generation probability distribution output by the generation model and the replacement probability distribution output by the replacement model;
calculating loss according to the generated probability distribution prediction result and the replacement probability distribution prediction result which are respectively used as the input of the cross entropy loss function with the label
Figure SMS_1
And->
Figure SMS_2
For loss->
Figure SMS_3
And->
Figure SMS_4
Sum acquisition loss->
Figure SMS_5
To loss of
Figure SMS_6
Performing back propagation to obtain a parameter gradient for generating a replacement joint model, and performing parameter optimization by using an Adam optimizer;
carrying out iteration by taking the generated and replaced joint model after parameter optimization into the step of iterative training until loss
Figure SMS_7
And (5) converging, and outputting a trained generation replacement joint model.
Optionally, the generating the probability distribution of the model output includes:
extracting time sequence relation feature vectors of words in the sample one by using LSTM and forming a time sequence relation feature matrix
Figure SMS_8
Calculating the relation weight of each word in the sample on time sequence characteristics through a multi-head self-attention mechanism and reflecting the relation weight to be an attention moment array
Figure SMS_9
Figure SMS_10
In the method, in the process of the invention,
Figure SMS_12
for attention head->
Figure SMS_19
Output feature vector, ">
Figure SMS_20
For the total number of attention head>
Figure SMS_13
Attention head->
Figure SMS_15
A parameter matrix corresponding to query, key, value vector, +.>
Figure SMS_17
For the attention parameter matrix, +.>
Figure SMS_18
Figure SMS_11
For the dimension of the timing relation feature vector, +.>
Figure SMS_14
For the connection operation +.>
Figure SMS_16
Is a sigmoid function;
the time sequence relation characteristic matrix
Figure SMS_21
And attention matrix->
Figure SMS_22
Multiplying to obtain time characteristic matrix of each time step>
Figure SMS_23
Figure SMS_24
Mapping each word in the sample to a high-dimensional semantic space through a word embedding layer to obtain a word embedding vector of each word;
constructing a graph structure
Figure SMS_25
And embedding the word embedding vectors of all words in the sample as graph structures
Figure SMS_26
Is->
Figure SMS_27
,/>
Figure SMS_28
The number of words in the sample;
extracting spatial relationships of all words in a sample by a sliding window algorithm to build a graph structure
Figure SMS_29
Edge set of (i.e.)>
Figure SMS_30
,/>
Figure SMS_31
The number of edges;
using GAT from graph structure
Figure SMS_32
Extracting the spatial relation feature vector of each node, calculating the spatial feature by a multi-head self-attention mechanism, and reflecting the spatial feature to be attention coefficient +.>
Figure SMS_33
:/>
Figure SMS_34
In the method, in the process of the invention,
Figure SMS_35
for node->
Figure SMS_43
To node->
Figure SMS_45
Attention coefficient of>
Figure SMS_38
For node->
Figure SMS_44
Is>
Figure SMS_46
For node->
Figure SMS_47
Node->
Figure SMS_36
And node->
Figure SMS_39
Spatial relation feature vector, +_>
Figure SMS_41
A linear transformation weight matrix for each node, < +.>
Figure SMS_42
Is a weight vector, ++>
Figure SMS_37
To activate the function +.>
Figure SMS_40
To splice the two vectors;
attention coefficient
Figure SMS_48
Multiplying the spatial relation feature vector of the node by the spatial relation feature vector of the node, and updating the spatial relation feature vector of the node through a multi-head self-attention mechanism to generate a spatial feature matrix +.>
Figure SMS_49
Figure SMS_50
In the method, in the process of the invention,
Figure SMS_51
for attention head->
Figure SMS_52
A corresponding weight matrix;
matrix the time characteristics
Figure SMS_53
And spatial feature matrix->
Figure SMS_54
Feature fusion is carried out through the first full-connection layer and the activation function to obtain a fusion feature matrix +.>
Figure SMS_55
Figure SMS_56
In the method, in the process of the invention,
Figure SMS_57
a parameter matrix for the first full connection layer;
will fuse the feature matrix
Figure SMS_58
Predictive generation by means of the second fully connected layer and the activation function, output generation probability distribution +.>
Figure SMS_59
Figure SMS_60
In the method, in the process of the invention,
Figure SMS_61
for the parameter matrix of the second fully connected layer, < >>
Figure SMS_62
Is the first bias parameter.
Optionally, the substitution probability distribution output by the substitution model includes:
randomly selecting a plurality of words from the sample to replace the words with symbols representing the mask, so as to obtain a sample with the mask symbols;
mapping the mask symbol samples to a high-dimensional semantic space through an embedded vector layer of BERT to obtain feature mapping vectors of words
Figure SMS_63
Figure SMS_64
In the method, in the process of the invention,
Figure SMS_65
for the masked symbol samples, +.>
Figure SMS_66
In order to embed the vector layer(s),
mapping feature vectors
Figure SMS_67
Predictive generation by means of the third fully connected layer and the activation function, outputting a set of alternative probability distributions +.>
Figure SMS_68
Figure SMS_69
In the method, in the process of the invention,
Figure SMS_70
for the parameter matrix of the third fully connected layer, < >>
Figure SMS_71
For the second bias parameter, +.>
Figure SMS_72
Is a sigmoid function;
will replace the probability distribution set
Figure SMS_73
The probability distribution of the words with mask symbols is taken as output.
Optionally, the steganographically operating the text sequence according to the secret information and the generated probability distribution includes:
aiming at each word in the text sequence, the generation probability distribution is arranged in descending order according to the generation probability;
after being arranged before
Figure SMS_74
The generation probabilities are taken out as generation candidate pools, < ->
Figure SMS_75
A maximum number of bits embedded for a preset word;
calculating the ratio of the first and second generation probabilities in the generation candidate pool:
if the ratio is greater than the preset ratio threshold
Figure SMS_76
Will thenFirstly, generating a probability corresponding word as the output of the word in the text sequence, and recording that the word in the text sequence is not hidden;
if the ratio is less than or equal to a preset ratio threshold
Figure SMS_77
Constructing a Huffman tree according to the generation probabilities in the generation candidate pool, and acquiring a coding set of each generation probability according to the Huffman tree;
converting the secret information into a binary bit stream and initializing a value s=1;
when the codes in the code set are the same as the s bits before the binary bit stream, outputting the word corresponding to the probability of the codes as the word in the text sequence, and recording the word in the text sequence as steganography; when there is no code in the code set that is the same as the s bits before the binary bit stream, let s=s+1 and repeat the current step until s is greater than the total number of bits of the binary bit stream.
Optionally, the performing steganographic operation on the non-steganographic word according to the secret information and the replacement probability distribution includes:
for each non-steganographic word, arranging the replacement probability distribution in descending order according to the size of the replacement probability;
after being arranged before
Figure SMS_78
The individual substitution probabilities are taken out as substitution candidate pools, < >>
Figure SMS_79
A maximum number of bits embedded for a preset word;
constructing a Huffman tree according to the replacement probability in the replacement candidate pool, and acquiring a coding set of each replacement probability according to the Huffman tree;
converting the secret information into a binary bit stream and initializing a value s=1;
when the codes existing in the code set are the same as the s bits before the binary bit stream, taking the word corresponding to the substitution probability corresponding to the codes as the output of the non-steganographic word, and recording the non-steganographic word as steganographic; when there is no code in the code set that is the same as the s bits before the binary bit stream, let s=s+1 and repeat the current step until s is greater than the total number of bits of the binary bit stream.
In a second aspect, the present invention provides a multi-scale joint text steganography system comprising:
the information acquisition module is used for acquiring the text sequence and the secret information;
the generation module is used for inputting the text sequence into a pre-constructed generation and replacement joint model and obtaining the generation probability distribution of each word;
the first steganography module is used for carrying out steganography operation on the text sequence according to the secret information and the generated probability distribution, and obtaining a first steganography text and a steganography record;
the replacement module is used for determining non-steganographic words in the text sequence according to the steganographic records, inputting the text sequence into a pre-constructed generated replacement joint model, and obtaining the replacement probability distribution of each non-steganographic word;
the second steganography module is used for carrying out steganography operation on the non-steganography words according to the secret information and the replacement probability distribution, and obtaining a second steganography text;
and the joint steganography module is used for generating joint steganography text according to the first steganography text and the second steganography text.
In a third aspect, the present invention provides a secret information extraction method based on the above-mentioned multi-scale joint text steganography method, including:
acquiring a joint steganography text;
inputting the joint steganography text into a pre-constructed generation and replacement joint model, and obtaining the generation probability distribution of each word;
carrying out extraction operation according to the generated probability distribution and the joint steganography text to obtain a first extraction text and an extraction record;
determining unextracted words in the joint steganography text according to the extraction records, inputting the joint steganography text into a pre-constructed generated replacement joint model, and obtaining replacement probability distribution of each unextracted word;
extracting according to the replacement probability distribution and the joint steganography text to obtain a second extracted text;
generating secret information according to the first extracted text and the second extracted text;
the secret information extracting operation according to the generated probability distribution and the joint steganography text comprises the following steps:
aiming at each word in the joint steganography text, the generation probability distribution is arranged in descending order according to the generation probability;
after being arranged before
Figure SMS_80
The generation probabilities are taken out as generation candidate pools, < ->
Figure SMS_81
A maximum number of bits embedded for a preset word;
calculating the ratio of the first and second generation probabilities in the generation candidate pool:
if the ratio is greater than the preset ratio threshold
Figure SMS_82
The first generation probability corresponding word is used as the output of the word in the joint steganography text, and the word in the joint steganography text is recorded as not extracted;
if the ratio is less than or equal to a preset ratio threshold
Figure SMS_83
Constructing a Huffman tree according to the generation probabilities in the generation candidate pool, and acquiring a coding set of each generation probability according to the Huffman tree;
when the codes in the code set have the codes corresponding to the words which are the same as the words in the joint steganography text, the codes corresponding to the words are used as the output of the words in the joint steganography text, and the words in the joint steganography text are recorded as the extraction;
the extracting operation according to the replacement probability distribution and the joint steganography text comprises the following steps:
for each unextracted word, arranging the replacement probability distribution according to the descending order of the replacement probability;
after being arranged before
Figure SMS_84
The individual substitution probabilities are taken out as substitution candidate pools, < >>
Figure SMS_85
A maximum number of bits embedded for a preset word;
constructing a Huffman tree according to the replacement probability in the replacement candidate pool, and acquiring a coding set of each replacement probability according to the Huffman tree;
when the codes in the code set are the same as the words in the joint steganography text, the codes are used as the output of the non-extracted words, and the non-extracted words are recorded as extraction.
Compared with the prior art, the invention has the beneficial effects that:
according to the multi-scale combined text steganography method and system, the generated model and the generated and replaced combined model of the replaced model are constructed, so that the feature consistency of the generated model and the replaced model is guaranteed; the generated replacement joint model is applied to the steganography process, so that the text redundancy is utilized to the maximum extent, the steganography embedding (steganography) rate is improved, and the steganography text quality is ensured.
Drawings
Fig. 1 is a flowchart of a multi-scale joint text steganography method according to an embodiment of the present invention.
Fig. 2 is a flowchart of a secret information extraction method of a multi-scale joint text steganography method according to a third embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.
Embodiment one:
as shown in fig. 1, an embodiment of the present invention provides a multi-scale joint text steganography method, including the following steps:
s1, acquiring a text sequence and secret information;
s2, inputting the text sequence into a pre-constructed generation and replacement joint model, and obtaining the generation probability distribution of each word;
s3, performing steganography operation on the text sequence according to the secret information and the generated probability distribution, and obtaining a first steganography text and a steganography record;
s4, determining non-steganographic words in the text sequence according to the steganographic records, inputting the text sequence into a pre-constructed generated replacement joint model, and obtaining replacement probability distribution of each non-steganographic word;
s5, performing steganography operation on the non-steganography words according to the secret information and the replacement probability distribution, and obtaining a second steganography text;
s6, generating a joint steganography text according to the first steganography text and the second steganography text.
1. The construction process for generating the replacement joint model comprises the following steps:
11. acquiring a preset number of text data; for example, 30 ten thousand pieces of text data are obtained from an OPUS dataset website;
12. preprocessing text data, and constructing a sample set based on the preprocessed text data;
wherein the preprocessing comprises the following steps:
121. dividing the text data, reserving words in the division result and generating word sequences;
122. taking the first n-1 bits of the word sequence as a sample, taking the last n-1 bits of the word sequence as a label, and n being the total number of bits of the word sequence;
123. if the number of bits of the sample or the label is smaller than a preset bit number threshold value N, filling the tail of the corresponding sample or label by filling a symbol to enable the number of bits of the sample or the label to be equal to the preset bit number threshold value N;
124. if the number of bits of the sample or the label is larger than a preset number of bits threshold N, cutting off words at the tail of the corresponding sample or label to enable the number of bits to be equal to the preset number of bits threshold N;
13. dividing a sample set into a training set and a verification set according to a preset proportion; the ratio is usually set to be 8:2, all the words appearing in the training set are required to be counted and the word frequency is calculated, the words with the word frequency meeting the preset word frequency threshold value are added into a dictionary, and the length of the dictionary is the length of the model output probability distribution; words in the dictionary are in one-to-one correspondence with probabilities in the probability distribution;
14. generating a replacement joint model based on PyTorch building, wherein the generation of the replacement joint model comprises a generation model and a replacement model, the generation model is used for outputting a generation probability distribution, and the replacement model is used for outputting a replacement probability distribution;
15. iterative training is carried out on the generated and replaced joint model by using a training set, after iterative training, the generated and replaced joint model after iterative training is verified by using a verification set, and after verification, loss is reserved
Figure SMS_86
The minimum generation replaces the joint model and outputs.
Wherein iteratively training the generation of the alternative joint model using the training set comprises:
151. inputting samples in the training set to generate a replacement joint model, and acquiring the generation probability distribution output by the generation model and the replacement probability distribution output by the replacement model;
152. calculating loss according to the generated probability distribution prediction result and the replacement probability distribution prediction result which are respectively used as the input of the cross entropy loss function with the label
Figure SMS_87
And->
Figure SMS_88
For loss->
Figure SMS_89
And->
Figure SMS_90
Sum acquisition loss
Figure SMS_91
153. To loss of
Figure SMS_92
Performing back propagation to obtain a parameter gradient for generating a replacement joint model, and performing parameter optimization by using an Adam optimizer;
154. carrying out iteration by taking the generated replacement joint model after parameter optimization into the step of iterative training (namely returning to step 151) until loss
Figure SMS_93
And (5) converging, and outputting a trained generation replacement joint model.
In step 151, generating a probability distribution of model outputs includes:
(1.1) extracting time sequence relation characteristic vectors of words in the sample one by using LSTM, and forming a time sequence relation characteristic matrix
Figure SMS_94
(1.2) calculating the relation weight of each word in the sample on the time sequence characteristic through a multi-head self-attention mechanism, and reflecting the relation weight into an attention moment array
Figure SMS_95
Figure SMS_96
In the method, in the process of the invention,
Figure SMS_98
for attention head->
Figure SMS_100
Output feature vector, ">
Figure SMS_101
For the total number of attention head>
Figure SMS_99
Attention head->
Figure SMS_103
A parameter matrix corresponding to query, key, value vector, +.>
Figure SMS_105
For the attention parameter matrix, +.>
Figure SMS_106
Figure SMS_97
For the dimension of the timing relation feature vector, +.>
Figure SMS_102
For the connection operation +.>
Figure SMS_104
Is a sigmoid function;
(1.3) matrix the time sequence relation characteristic
Figure SMS_107
And attention matrix->
Figure SMS_108
Multiplying to obtain time characteristic matrix of each time step
Figure SMS_109
Figure SMS_110
(2.1) mapping each word in the sample to a high-dimensional semantic space through a word embedding layer to obtain a word embedding vector of each word;
(2.2) building a graph Structure
Figure SMS_111
And embedding the word embedding vectors of all words in the sample as graph structures
Figure SMS_112
Is->
Figure SMS_113
,/>
Figure SMS_114
The number of words in the sample;
(2.3) extracting spatial relationships of all words in the sample by sliding window algorithm to build the graph structure
Figure SMS_115
Edge set of (i.e.)>
Figure SMS_116
,/>
Figure SMS_117
The number of edges;
(2.4) use of GAT from graph Structure
Figure SMS_118
Extracting the spatial relation feature vector of each node, calculating the spatial feature by a multi-head self-attention mechanism, and reflecting the spatial feature to be attention coefficient +.>
Figure SMS_119
Figure SMS_120
In the method, in the process of the invention,
Figure SMS_121
for node->
Figure SMS_128
To node->
Figure SMS_132
Attention coefficient of>
Figure SMS_122
For node->
Figure SMS_125
Is>
Figure SMS_127
For node->
Figure SMS_130
Node->
Figure SMS_123
And node->
Figure SMS_129
Spatial relation feature vector, +_>
Figure SMS_131
A linear transformation weight matrix for each node, < +.>
Figure SMS_133
Is a weight vector, ++>
Figure SMS_124
To activate the function +.>
Figure SMS_126
To splice the two vectors;
(2.5) attention coefficient
Figure SMS_134
Multiplying the spatial relation feature vector of the node by the spatial relation feature vector of the node, and updating the spatial relation feature vector of the node through a multi-head self-attention mechanism to generate a spatial feature matrix +.>
Figure SMS_135
:/>
Figure SMS_136
In the method, in the process of the invention,
Figure SMS_137
for attention head->
Figure SMS_138
A corresponding weight matrix;
(3.1) time-feature matrix
Figure SMS_139
And spatial feature matrix->
Figure SMS_140
Feature fusion is carried out through the first full-connection layer and the activation function to obtain a fusion feature matrix +.>
Figure SMS_141
Figure SMS_142
In the method, in the process of the invention,
Figure SMS_143
a parameter matrix for the first full connection layer;
(3.2) fusing the feature matrices
Figure SMS_144
Predictive generation by means of the second fully connected layer and the activation function, output generation probability distribution +.>
Figure SMS_145
Figure SMS_146
In the method, in the process of the invention,
Figure SMS_147
for the parameter matrix of the second fully connected layer, < >>
Figure SMS_148
Is the first bias parameter.
In step 151, the replacement probability distribution of the replacement model output includes:
(1.1) randomly selecting a plurality of words from the sample to replace the words with symbols representing the mask, so as to obtain a sample with the mask symbols;
(1.2) mapping the masked symbol samples to a high-dimensional semantic space through an embedded vector layer of BERT to obtain feature mapping vectors of words
Figure SMS_149
Figure SMS_150
In the method, in the process of the invention,
Figure SMS_151
for the masked symbol samples, +.>
Figure SMS_152
In order to embed the vector layer(s),
(1.3) mapping the feature to the vector
Figure SMS_153
Predictive generation by means of the third fully connected layer and the activation function, outputting a set of alternative probability distributions +.>
Figure SMS_154
Figure SMS_155
In the method, in the process of the invention,
Figure SMS_156
for the parameter matrix of the third fully connected layer, < >>
Figure SMS_157
For the second bias parameter, +.>
Figure SMS_158
Is a sigmoid function;
(1.4) replacing the probability distribution set
Figure SMS_159
The probability distribution of the words with mask symbols is taken as output.
2. Steganographically operating a sequence of text based on the secret information and the generated probability distribution includes:
2.1, aiming at each word in the text sequence, arranging the generation probability distribution according to the generation probability size descending order;
2.2 before alignment
Figure SMS_160
The generation probabilities are taken out as generation candidate pools, < ->
Figure SMS_161
A maximum number of bits embedded for a preset word;
2.3, calculating the ratio of the first generation probability to the second generation probability in the generation candidate pool:
2.4 if the ratio is greater than the preset ratio threshold
Figure SMS_162
Outputting the word corresponding to the first generation probability as a word in the text sequence, and recording that the word in the text sequence is not hidden;
2.5, if the ratio is less than or equal to the preset ratio threshold
Figure SMS_163
Constructing a Huffman tree according to the generation probabilities in the generation candidate pool, and acquiring a coding set of each generation probability according to the Huffman tree;
2.6, converting the secret information into a binary bit stream, and initializing a value s=1;
2.7, when the codes exist in the code set and are the same as the s bits before the binary bit stream, outputting the word corresponding to the probability generated by the codes as the word in the text sequence, and recording the word in the text sequence as steganography; when there is no code in the code set that is the same as the s bits before the binary bit stream, let s=s+1 and repeat the current step until s is greater than the total number of bits of the binary bit stream.
3. Steganographically operating the non-steganographic word based on the secret information and the replacement probability distribution includes:
3.1, aiming at each non-steganographic word, arranging the replacement probability distribution in descending order according to the size of the replacement probability;
3.2 before alignment
Figure SMS_164
The individual substitution probabilities are taken out as substitution candidate pools, < >>
Figure SMS_165
A maximum number of bits embedded for a preset word;
3.3, constructing a Huffman tree according to the replacement probability in the replacement candidate pool, and acquiring a coding set of each replacement probability according to the Huffman tree;
3.4, converting the secret information into a binary bit stream, and initializing a numerical value s=1;
3.5, when the codes exist in the code set and are the same as the s bits before the binary bit stream, outputting the word corresponding to the substitution probability corresponding to the codes as an un-steganographic word, and recording the un-steganographic word as steganographic; when there is no code in the code set that is the same as the s bits before the binary bit stream, let s=s+1 and repeat the current step until s is greater than the total number of bits of the binary bit stream.
Embodiment two:
the embodiment of the invention provides a multi-scale joint text steganography system, which comprises the following components:
the information acquisition module is used for acquiring the text sequence and the secret information;
the generation module is used for inputting the text sequence into a pre-constructed generation and replacement joint model and obtaining the generation probability distribution of each word;
the first steganography module is used for carrying out steganography operation on the text sequence according to the secret information and the generated probability distribution, and obtaining a first steganography text and a steganography record;
the replacement module is used for determining non-steganographic words in the text sequence according to the steganographic records, inputting the text sequence into a pre-constructed generated replacement joint model, and obtaining the replacement probability distribution of each non-steganographic word;
the second steganography module is used for carrying out steganography operation on the non-steganography words according to the secret information and the replacement probability distribution, and obtaining a second steganography text;
and the joint steganography module is used for generating joint steganography text according to the first steganography text and the second steganography text.
And (3) implementation:
as shown in fig. 2, according to a first embodiment, the present invention provides a secret information extraction method of a multi-scale joint text steganography method, including:
s11, acquiring a joint steganography text;
s12, inputting the joint steganography text into a pre-constructed generation and replacement joint model, and obtaining the generation probability distribution of each word;
s13, carrying out extraction operation according to the generated probability distribution and the joint steganography text, and obtaining a first extraction text and an extraction record;
s14, determining unextracted words in the joint steganography text according to the extraction records, inputting the joint steganography text into a pre-constructed generated replacement joint model, and obtaining replacement probability distribution of each unextracted word;
s15, extracting according to the replacement probability distribution and the joint steganography text to obtain a second extracted text;
s16, generating secret information according to the first extracted text and the second extracted text;
the secret information extracting operation according to the generated probability distribution and the joint steganography text comprises the following steps:
(1) Aiming at each word in the joint steganography text, the generation probability distribution is arranged in descending order according to the generation probability;
(2) After being arranged before
Figure SMS_166
The generation probabilities are taken out as generation candidate pools, < ->
Figure SMS_167
A maximum number of bits embedded for a preset word;
(3) Calculating the ratio of the first and second generation probabilities in the generation candidate pool:
(4) If the ratio is greater than the preset ratio threshold
Figure SMS_168
The first generation probability corresponding word is used as the output of the word in the joint steganography text, and the word in the joint steganography text is recorded as not extracted;
(5) If the ratio is smaller than or equal to the preset ratio threshold value
Figure SMS_169
Constructing a Huffman tree according to the generation probabilities in the generation candidate pool, and acquiring a coding set of each generation probability according to the Huffman tree;
(6) When the codes in the code set have the codes corresponding to the words which are the same as the words in the joint steganography text, the codes corresponding to the words are used as the output of the words in the joint steganography text, and the words in the joint steganography text are recorded as the extraction;
wherein, the extracting operation according to the replacement probability distribution and the joint steganography text comprises the following steps:
(1) Arranging the replacement probability distribution according to the descending order of the replacement probability for each unextracted word;
(2) After being arranged before
Figure SMS_170
The individual substitution probabilities are taken out as substitution candidate pools, < >>
Figure SMS_171
A maximum number of bits embedded for a preset word;
(3) Constructing a Huffman tree according to the replacement probability in the replacement candidate pool, and acquiring a coding set of each replacement probability according to the Huffman tree;
(4) When the codes in the code set are the same as the words in the joint steganography text, the codes are output as the non-extracted words, and the non-extracted words are recorded as the extraction.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims (10)

1. A method of multi-scale joint text steganography, comprising:
acquiring a text sequence and secret information;
inputting the text sequence into a pre-constructed generation and replacement joint model, and obtaining the generation probability distribution of each word;
performing steganography operation on the text sequence according to the secret information and the generated probability distribution to obtain a first steganography text and a steganography record;
determining non-steganographic words in a text sequence according to the steganographic records, inputting the text sequence into a pre-constructed generated replacement joint model, and obtaining the replacement probability distribution of each non-steganographic word;
performing steganography operation on the non-steganography words according to the secret information and the replacement probability distribution, and obtaining a second steganography text;
a joint steganographic text is generated from the first steganographic text and the second steganographic text.
2. The method of claim 1, wherein the creating the alternative joint model comprises:
acquiring a preset number of text data;
preprocessing text data, and constructing a sample set based on the preprocessed text data;
dividing a sample set into a training set and a verification set according to a preset proportion;
generating a replacement joint model based on PyTorch construction, wherein the generation of the replacement joint model comprises generation of a model and replacement of the model;
and performing iterative training on the generated and replaced joint model by using a training set, after iterative training, verifying the generated and replaced joint model after iterative training by using a verification set, and after verification, keeping and outputting the generated and replaced joint model with the minimum loss.
3. A multi-scale joint text steganography method as recited in claim 2, wherein the preprocessing comprises:
dividing the text data, reserving words in the division result and generating word sequences;
taking the first n-1 bits of the word sequence as a sample, taking the last n-1 bits of the word sequence as a label, and n being the total number of bits of the word sequence;
if the number of bits of the sample or the label is smaller than a preset bit number threshold value N, filling the tail of the corresponding sample or label by filling a symbol to enable the number of bits of the sample or the label to be equal to the preset bit number threshold value N;
if the number of bits of the sample or the label is larger than the preset number of bits threshold N, the tail part of the corresponding sample or label is cut off to make the number of bits equal to the preset number of bits threshold N.
4. A multi-scale joint text steganography method as recited in claim 2, wherein iteratively training the generation of the alternative joint model using a training set comprises:
inputting samples in the training set to generate a replacement joint model, and acquiring the generation probability distribution output by the generation model and the replacement probability distribution output by the replacement model;
calculating loss according to the generated probability distribution prediction result and the replacement probability distribution prediction result which are respectively used as the input of the cross entropy loss function with the label
Figure QLYQS_1
And->
Figure QLYQS_2
For loss->
Figure QLYQS_3
And->
Figure QLYQS_4
Sum acquisition loss->
Figure QLYQS_5
To loss of
Figure QLYQS_6
Performing back propagation to obtain a parameter gradient for generating a replacement joint model, and performing parameter optimization by using an Adam optimizer;
carrying out iteration by taking the generated and replaced joint model after parameter optimization into the step of iterative training until loss
Figure QLYQS_7
And (5) converging, and outputting a trained generation replacement joint model.
5. The method of claim 4, wherein generating the probability distribution of model output comprises:
extracting time sequence relation feature vectors of words in the sample one by using LSTM and forming a time sequence relation feature matrix
Figure QLYQS_8
Calculating the relation weight of each word in the sample on time sequence characteristics through a multi-head self-attention mechanism and reflecting the relation weight to be an attention moment array
Figure QLYQS_9
Figure QLYQS_10
In the method, in the process of the invention,
Figure QLYQS_12
for attention head->
Figure QLYQS_15
Output feature vector, ">
Figure QLYQS_18
For the total number of attention head>
Figure QLYQS_14
Attention head->
Figure QLYQS_17
A parameter matrix corresponding to query, key, value vector, +.>
Figure QLYQS_19
For the attention parameter matrix, +.>
Figure QLYQS_20
,/>
Figure QLYQS_11
For the dimension of the timing relation feature vector, +.>
Figure QLYQS_13
For the connection operation +.>
Figure QLYQS_16
Is a sigmoid function;
the time sequence relation characteristic matrix
Figure QLYQS_21
And attention matrix->
Figure QLYQS_22
Multiplying to obtain time characteristic matrix of each time step>
Figure QLYQS_23
Figure QLYQS_24
Mapping each word in the sample to a high-dimensional semantic space through a word embedding layer to obtain a word embedding vector of each word;
constructing a graph structure
Figure QLYQS_25
And embedding the word embedding vectors of all words in the sample as a graph structure +.>
Figure QLYQS_26
Is->
Figure QLYQS_27
,/>
Figure QLYQS_28
The number of words in the sample;
extracting spatial relationships of all words in a sample by a sliding window algorithm to build a graph structure
Figure QLYQS_29
Edge set of (i.e.)>
Figure QLYQS_30
,/>
Figure QLYQS_31
The number of edges;
using GAT from graph structure
Figure QLYQS_32
Extracting the spatial relation feature vector of each node, calculating the spatial feature by a multi-head self-attention mechanism, and reflecting the spatial feature to be attention coefficient +.>
Figure QLYQS_33
Figure QLYQS_34
In the method, in the process of the invention,
Figure QLYQS_36
for node->
Figure QLYQS_40
To node->
Figure QLYQS_44
Attention coefficient of>
Figure QLYQS_38
For node->
Figure QLYQS_42
Is>
Figure QLYQS_45
For node->
Figure QLYQS_47
Node->
Figure QLYQS_35
And node->
Figure QLYQS_39
Spatial relation feature vector, +_>
Figure QLYQS_43
A linear transformation weight matrix for each node, < +.>
Figure QLYQS_46
As a weight vector of the weight vector,
Figure QLYQS_37
to activate the function +.>
Figure QLYQS_41
To splice the two vectors;
attention coefficient
Figure QLYQS_48
Multiplying the spatial relation feature vector of the node by the spatial relation feature vector of the node, and updating the spatial relation feature vector of the node through a multi-head self-attention mechanism to generate a spatial feature matrix +.>
Figure QLYQS_49
Figure QLYQS_50
In the method, in the process of the invention,
Figure QLYQS_51
for attention head->
Figure QLYQS_52
A corresponding weight matrix;
matrix the time characteristics
Figure QLYQS_53
And spatial feature matrix->
Figure QLYQS_54
Feature fusion is carried out through the first full-connection layer and the activation function to obtain a fusion feature matrix +.>
Figure QLYQS_55
:/>
Figure QLYQS_56
In the method, in the process of the invention,
Figure QLYQS_57
a parameter matrix for the first full connection layer;
will fuse the feature matrix
Figure QLYQS_58
Predictive generation by means of the second fully connected layer and the activation function, output generation probability distribution +.>
Figure QLYQS_59
Figure QLYQS_60
In the method, in the process of the invention,
Figure QLYQS_61
for the parameter matrix of the second fully connected layer, < >>
Figure QLYQS_62
Is the first bias parameter.
6. The method of claim 4, wherein the substitution probability distribution of the substitution model output comprises:
randomly selecting a plurality of words from the sample to replace the words with symbols representing the mask, so as to obtain a sample with the mask symbols;
mapping the mask symbol samples to a high-dimensional semantic space through an embedded vector layer of BERT to obtain feature mapping vectors of words
Figure QLYQS_63
Figure QLYQS_64
In the method, in the process of the invention,
Figure QLYQS_65
for the masked symbol samples, +.>
Figure QLYQS_66
In order to embed the vector layer(s),
mapping feature vectors
Figure QLYQS_67
Predictive generation by a third full connection layer and an activation function, outputting a replacement probability distribution set
Figure QLYQS_68
Figure QLYQS_69
In the method, in the process of the invention,
Figure QLYQS_70
for the parameter matrix of the third fully connected layer, < >>
Figure QLYQS_71
For the second bias parameter, +.>
Figure QLYQS_72
Is a sigmoid function;
will replace the probability distribution set
Figure QLYQS_73
The probability distribution of the words with mask symbols is taken as output.
7. A multi-scale joint text steganography method as recited in claim 1, wherein steganographically operating a sequence of texts based on secret information and a generated probability distribution comprises:
aiming at each word in the text sequence, the generation probability distribution is arranged in descending order according to the generation probability;
after being arranged before
Figure QLYQS_74
The generation probabilities are taken out as generation candidate pools, < ->
Figure QLYQS_75
A maximum number of bits embedded for a preset word;
calculating the ratio of the first and second generation probabilities in the generation candidate pool:
if the ratio is greater than the preset ratio threshold
Figure QLYQS_76
The first generation probability corresponding word is used as the output of the word in the text sequence, and the word in the text sequence is recorded as non-steganography;
if the ratio is less than or equal to a preset ratio threshold
Figure QLYQS_77
Constructing a Huffman tree according to the generation probabilities in the generation candidate pool, and acquiring a coding set of each generation probability according to the Huffman tree;
converting the secret information into a binary bit stream and initializing a value s=1;
when the codes in the code set are the same as the s bits before the binary bit stream, outputting the word corresponding to the probability of the codes as the word in the text sequence, and recording the word in the text sequence as steganography; when there is no code in the code set that is the same as the s bits before the binary bit stream, let s=s+1 and repeat the current step until s is greater than the total number of bits of the binary bit stream.
8. A multi-scale joint text steganography method as recited in claim 1, wherein steganographically operating on non-steganographically words based on secret information and a replacement probability distribution comprises:
for each non-steganographic word, arranging the replacement probability distribution in descending order according to the size of the replacement probability;
after being arranged before
Figure QLYQS_78
The individual substitution probabilities are taken out as substitution candidate pools, < >>
Figure QLYQS_79
A maximum number of bits embedded for a preset word;
constructing a Huffman tree according to the replacement probability in the replacement candidate pool, and acquiring a coding set of each replacement probability according to the Huffman tree;
converting the secret information into a binary bit stream and initializing a value s=1;
when the codes existing in the code set are the same as the s bits before the binary bit stream, taking the word corresponding to the substitution probability corresponding to the codes as the output of the non-steganographic word, and recording the non-steganographic word as steganographic; when there is no code in the code set that is the same as the s bits before the binary bit stream, let s=s+1 and repeat the current step until s is greater than the total number of bits of the binary bit stream.
9. A multi-scale joint text steganography system, comprising:
the information acquisition module is used for acquiring the text sequence and the secret information;
the generation module is used for inputting the text sequence into a pre-constructed generation and replacement joint model and obtaining the generation probability distribution of each word;
the first steganography module is used for carrying out steganography operation on the text sequence according to the secret information and the generated probability distribution, and obtaining a first steganography text and a steganography record;
the replacement module is used for determining non-steganographic words in the text sequence according to the steganographic records, inputting the text sequence into a pre-constructed generated replacement joint model, and obtaining the replacement probability distribution of each non-steganographic word;
the second steganography module is used for carrying out steganography operation on the non-steganography words according to the secret information and the replacement probability distribution, and obtaining a second steganography text;
and the joint steganography module is used for generating joint steganography text according to the first steganography text and the second steganography text.
10. A secret information extraction method based on a multi-scale joint text steganography method according to any one of claims 1-8, characterized by comprising:
acquiring a joint steganography text;
inputting the joint steganography text into a pre-constructed generation and replacement joint model, and obtaining the generation probability distribution of each word;
carrying out extraction operation according to the generated probability distribution and the joint steganography text to obtain a first extraction text and an extraction record;
determining unextracted words in the joint steganography text according to the extraction records, inputting the joint steganography text into a pre-constructed generated replacement joint model, and obtaining replacement probability distribution of each unextracted word;
extracting according to the replacement probability distribution and the joint steganography text to obtain a second extracted text;
generating secret information according to the first extracted text and the second extracted text;
the secret information extracting operation according to the generated probability distribution and the joint steganography text comprises the following steps:
aiming at each word in the joint steganography text, the generation probability distribution is arranged in descending order according to the generation probability;
after being arranged before
Figure QLYQS_80
The generation probabilities are taken out as generation candidate pools, < ->
Figure QLYQS_81
A maximum number of bits embedded for a preset word;
calculating the ratio of the first and second generation probabilities in the generation candidate pool:
if the ratio is greater than the preset ratio threshold
Figure QLYQS_82
The first generation probability corresponding word is used as the output of the word in the joint steganography text, and the word in the joint steganography text is recorded as not extracted;
if the ratio is less than or equal to a preset ratio threshold
Figure QLYQS_83
Constructing a Huffman tree according to the generation probabilities in the generation candidate pool, and acquiring a coding set of each generation probability according to the Huffman tree; />
When the codes in the code set have the codes corresponding to the words which are the same as the words in the joint steganography text, the codes corresponding to the words are used as the output of the words in the joint steganography text, and the words in the joint steganography text are recorded as the extraction;
the extracting operation according to the replacement probability distribution and the joint steganography text comprises the following steps:
for each unextracted word, arranging the replacement probability distribution according to the descending order of the replacement probability;
after being arranged before
Figure QLYQS_84
The individual substitution probabilities are taken out as substitution candidate pools, < >>
Figure QLYQS_85
A maximum number of bits embedded for a preset word;
constructing a Huffman tree according to the replacement probability in the replacement candidate pool, and acquiring a coding set of each replacement probability according to the Huffman tree;
when the codes in the code set are the same as the words in the joint steganography text, the codes are used as the output of the non-extracted words, and the non-extracted words are recorded as extraction.
CN202310240044.0A 2023-03-14 2023-03-14 Multi-scale combined text steganography method and system Active CN115952528B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310240044.0A CN115952528B (en) 2023-03-14 2023-03-14 Multi-scale combined text steganography method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310240044.0A CN115952528B (en) 2023-03-14 2023-03-14 Multi-scale combined text steganography method and system

Publications (2)

Publication Number Publication Date
CN115952528A CN115952528A (en) 2023-04-11
CN115952528B true CN115952528B (en) 2023-05-16

Family

ID=85893115

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310240044.0A Active CN115952528B (en) 2023-03-14 2023-03-14 Multi-scale combined text steganography method and system

Country Status (1)

Country Link
CN (1) CN115952528B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116595587B (en) * 2023-07-14 2023-09-22 江西通友科技有限公司 Document steganography method and document management method based on secret service
CN117648681B (en) * 2024-01-30 2024-04-05 北京点聚信息技术有限公司 OFD format electronic document hidden information extraction and embedding method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105959104B (en) * 2016-04-25 2019-05-17 深圳大学 Steganalysis method based on Hamming distance distribution
CN109815496A (en) * 2019-01-22 2019-05-28 清华大学 Based on capacity adaptive shortening mechanism carrier production text steganography method and device
CN110533570A (en) * 2019-08-27 2019-12-03 南京工程学院 A kind of general steganography method based on deep learning
CN113987129A (en) * 2021-11-08 2022-01-28 重庆邮电大学 Digital media protection text steganography method based on variational automatic encoder
CN115169293A (en) * 2022-09-02 2022-10-11 南京信息工程大学 Text steganalysis method, system, device and storage medium

Also Published As

Publication number Publication date
CN115952528A (en) 2023-04-11

Similar Documents

Publication Publication Date Title
CN115952528B (en) Multi-scale combined text steganography method and system
CN110209801B (en) Text abstract automatic generation method based on self-attention network
CN109445834B (en) Program code similarity rapid comparison method based on abstract syntax tree
US10380236B1 (en) Machine learning system for annotating unstructured text
CN101610088B (en) System and method for encoding data based on a compression technique with security features
Xiang et al. A novel linguistic steganography based on synonym run-length encoding
CN111666402A (en) Text abstract generation method and device, computer equipment and readable storage medium
CN109460434B (en) Data extraction model establishing method and device
CN109711121A (en) Text steganography method and device based on Markov model and Huffman encoding
CN112560456B (en) Method and system for generating generated abstract based on improved neural network
CN101799802A (en) Method and system for extracting entity relationship by using structural information
CN111950287A (en) Text-based entity identification method and related device
Shi et al. An approach to text steganography based on search in internet
CN116151132A (en) Intelligent code completion method, system and storage medium for programming learning scene
CN113032001B (en) Intelligent contract classification method and device
CN115099233A (en) Semantic analysis model construction method and device, electronic equipment and storage medium
CN111159394A (en) Text abstract generation method and device
CN111126059B (en) Short text generation method, short text generation device and readable storage medium
Shanmugasundaram et al. Automatic reassembly of document fragments via data compression
US20230098398A1 (en) Molecular structure reconstruction method and apparatus, device, storage medium, and program product
CN114065269B (en) Method for generating and analyzing bindless heterogeneous token and storage medium
CN115758415A (en) Text carrier-free information hiding method based on Chinese character component combination
KR101741186B1 (en) Data distribution storage apparatus and method for verifying the locally repairable code
WO2022110730A1 (en) Label-based optimization model training method, apparatus, device, and storage medium
TWI499928B (en) Data hiding method via revision records on a collaboration platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant