CN112395868A

CN112395868A - Rapid and safe natural language information hiding method based on word replacement

Info

Publication number: CN112395868A
Application number: CN202011283016.XA
Authority: CN
Inventors: 冯章成; 向凌云; 傅明; 章登勇
Original assignee: Changsha University of Science and Technology
Current assignee: Changsha University of Science and Technology
Priority date: 2020-11-17
Filing date: 2020-11-17
Publication date: 2021-02-23

Abstract

The invention discloses a quick and safe natural language information hiding method based on word replacement, which comprises the steps of 1, preprocessing, 2, distortion value measurement and 3, secret information embedding. The invention can make the local distortion of the embedded text approach the global optimum optimally, shorten the total secret information embedding time and realize the improvement of the speed of the embedding process and the safety of the embedded text.

Description

Rapid and safe natural language information hiding method based on word replacement

Technical Field

The invention relates to the field of information security, in particular to a quick and safe natural language information hiding method based on word replacement.

Background

With the development of global informatization, the frequency of information transmission activities using texts as carriers and the importance of people on information transmission security, a natural language information hiding technology for hiding secret information in carrier texts in an imperceptible manner is urgently needed to be developed and extended, and the technology can realize copyright protection, covert communication and the like of important text data.

The natural language information hiding technology mainly relates to knowledge of linguistics, statistics and the like, automatically generates content similar to natural texts by using related technologies of natural language processing and the like, or modifies the conventional normal texts in the aspects of syntax, vocabulary, semantics and the like, and realizes information embedding under the conditions of ensuring that the characteristics of local texts, global semantics, statistics and the like are unchanged, the syntax is correct and the syntax structure is reasonable. The information hiding method based on synonym replacement is a type of mainstream steganography, and mainly replaces synonyms in carrier texts to achieve the purpose of hiding secret information embedded in the carrier texts. However, the existing technical scheme has high distortion degree and is easy to perceive. Therefore, the invention provides a rapid and safe natural language information hiding method based on word replacement, which can ensure the information hiding effect and reduce distortion.

Disclosure of Invention

In order to realize the purpose of the invention, the following technical scheme is adopted for realizing the purpose:

a fast and safe natural language information hiding method based on word replacement comprises a step 1, preprocessing, a step 2, distortion value measurement and a step 3, secret information embedding, wherein the step 1 comprises the following steps:

1.1, preparing a synonym thesaurus;

1.2 randomly selecting an original carrier text T;

1.3 determining the length n₁To be embedded with secret information

The secret information is a text information;

1.4 traversing all words in the carrier text T according to the synonym thesaurus to obtain all synonyms in the carrier text T, and assuming that the length is n₂Arranging them in sequence, and marking the obtained sequence as the synonym sequence to be embedded

Wherein i is more than or equal to 1 and less than or equal to n₂；

The information hiding method comprises the following steps of 2:

2.1 search n according to elements in synonym thesaurus in the carrier text T₂Synonyms, assuming that the ith synonym appearing in the text is x_i，x_iThe synonym phrase with length p is denoted as CG (x)_i)＝{cx_i,0,cx_i,1,…,cx_i,p-1The size of the context window is 2k, x_iThe words in the context window in text T are in order noted as: cont (x)_i)＝{w_i,i-k,…,w_i,i-1,w_i,i+1,…,w_i,i+kIs then x_iThe sentence composed of the words in order in the context window is marked as Sen (x)_i,Cont(x_i))＝{w_i,i-k,…,w_i,i-1,x_i,w_i,i+1,…,w_i,i+k}；

2.2 sequential use of x_iSynonym cx in synonym phrase_i,l(l is more than or equal to 0 and less than or equal to p-1) substitution x_iThe candidate dense-embedding sentence is Sen (cx)_i,l,Cont(x_i))＝{w_i,i-k,…,w_i,i-1,cx_i,l,w_i,i+1,…,w_i,i+kCalculating to obtain an original sentence Sen (x)_i,Cont(x_i) And a dense sentence Sen (cx)_i,l,Cont(x_i) Sentence vector E (Sen (x))_i,Cont(x_i) )) and E (Sen (cx)_i,l,Cont(x_i)))；

2.3 calculate the original sentence Sen (x)_i,Cont(x_i) And a dense sentence Sen (cx)_i,l,Cont(x_i) Sentence vector distance between) to obtain synonym cx_lReplacement of x_iDistortion value ρ (cx) caused by embedding into the current sentence_i,l,x_i) The distortion function for calculating the sentence vector distance is:

ρ(cx_i,l,x_i)＝1-nolmal(Sen(cx_i,l,Cont(x_i)),Sen(x_i,Cont(x_i)))

wherein the content of the first and second substances,

e (u), E (v) sentence vectors of the original sentence and the embedded sentence respectively;

2.4 repeating steps 2.2 and 2.3, and calculating each synonym cx in the corresponding synonym phrase_i,l(l is more than or equal to 0 and less than or equal to p-1) substitution x_iPost-induced distortion value ρ (cx)_i,l,x_i) Selecting a minimum distortion value rho excluding zero at each position to be embedded_iAnd obtaining the synonym cx corresponding to the distortion value_i,lThe minimum distortion value rho caused by each position to be embedded_iThe sequence of the distortion values is obtained by arranging the distortion values in sequence and is recorded as

The original word x of each position to be embedded_iSynonym cx with the value causing the corresponding distortion_i,lTwo-by-two combination and arrangement according to the sequence and recording as a binary synonym library

2.5 formulating synonym coding rule according to each binary synonym phrase (x) in binary synonym thesaurus_i,cx_i,l) Wherein i is more than or equal to 1 and less than or equal to n₂By the formula a ═ ASCII ({ a | a ∈ x)_i}),B＝∑ASCII({b|b∈cx_i,lF (x) min { a, B } encodes each synonym, where a, B represent the constituent word x_i、cx_i,lA, B denotes the sum of the ASCII codes of all the letter elements a, b, f (x) denotes the minimum value of A, B, if f (x) is a, x_iCoded as 0, cx_i,lThe code is 1; if F (X) ═ B, cx_i,lCoded as 0, x_iThe code is 1.

The information hiding method comprises the following steps of 3:

3.1 according to the binary synonym thesaurus

And synonym coding rule will be embedded into synonym sequence

Binary vectorization is described as

Wherein x'_i∈{0,1}，(1≤i≤n₂) (ii) a Secret information

Binary pre-processing notation

3.2 dividing the secret information into N sections according to the length of the secret information and the length of the synonym sequence to be embedded of the carrier text T;

3.3 sequence of distortion values from the ensemble

To obtain each synonym sequence segment { x ] to be embedded_i,…,x_j},(i≥0,j≤n₂) Corresponding distortion value sequence segment { ρ_i,…,ρ_j},(i≥0,j≤n₂) Then according to the quantized sequence segment { x 'of the synonym to be embedded'_i,…,x′_j} quantized secret information segment { m'_i,…,m′_jAnd a sequence of distortion values { p } and a sequence of distortion values_i,…,ρ_j},(i≥0,j≤n₂) Performing STC coding to obtain an STC coding sequence { y 'of each segment'_i,…,y′_j},(i≥0,j≤n₂)；

3.4 splicing the STC coding sequences obtained from each segment into a complete ciphertext binary vector sequence Y ═ Y'₁,y′₂,…,y′_n2)，y′_i∈{0,1}，(1≤i≤n₂) And according to the method, words at corresponding positions of the original carrier text are matched with a binary synonym word library

Selectively replacing the words in the text to obtain a steganographic text T' embedded with secret information M, and replacing rulesComprises the following steps: if it is

If the number is 1, replacing the words at the corresponding positions of the original carrier text with the words in the binary synonym thesaurus, and if the number is not 1, replacing the words at the corresponding positions of the original carrier text with the words in the binary synonym thesaurus

If 0, no word replacement is performed.

The information hiding method further comprises a secret information extraction step 4:

4.1 input: steganographic text T' and secret information length n₁The number N of segments and a synonym lexicon;

4.2: traversing all words in T' according to the synonym thesaurus, and acquiring the length n of the appearance of the words₂Synonym sequence of

Obtaining binary synonym library in the same way of information hiding algorithm

Then according to the synonym coding rule

Quantized synonym sequences

Wherein y'_i∈{0,1}，(1≤i≤n₂)；

4.3: according to the number of segments N and the length of secret information N₁To obtain the length of each bit stream

Obtaining the synonym sequence after each segment of quantization

And the length s of each piece of secret information₁；

4.4: according to each segment quantizationLater synonym sequence Y'_iAnd length s of each piece of secret information₁Performing STC decoding to obtain secret information bit stream M 'of each segment'_iAnd sequentially splicing them into a complete secret information bit stream

4.5: restoring a binary string secret information bit stream M' to secret information

And output.

Drawings

FIG. 1 is a schematic diagram of the fast and safe natural language information hiding method based on word replacement according to the present invention

FIG. 2 is a schematic diagram of the SBERT model structure;

FIG. 3 is a schematic diagram of a segment STC embedding model;

FIG. 4 is a diagram of a parity check matrix structure;

FIG. 5 is a graph comparing distortion levels of steganographic text based on segmented STC versus unsegmented STC;

figure 6 is a graph of embedded time comparison of steganographic text based on a segmented STC versus an unsegmented STC.

Detailed Description

The following detailed description of the embodiments of the present invention is made with reference to the accompanying drawings 1-6:

as shown in fig. 1 to 6, the method for hiding natural language information based on word replacement, which is fast and safe, of the present invention includes: step 1, pretreatment; step 2, measuring a distortion value; and 3, embedding the secret information.

Step 1 pretreatment

1.1, preparing a synonym word bank (English) in advance;

1.2 randomly selecting an original carrier text T (English), wherein the selection mode can be, for example, acquiring a section of news, comment and the like from a website, or selecting one or selecting a section of scientific paper;

1.3 determining the length n₁To be embedded randomlySecret information

The secret information is a piece of text information (English);

1.4 traversing all words in the carrier text T according to the synonym thesaurus to obtain all synonyms in the carrier text T, and assuming that the length is n₂The synonym sequence X to be embedded is marked as X₁,x₂,…,x_n2}。

Step 2. distortion value measurement

2.1 search n according to elements in synonym thesaurus in the carrier text T₂Synonyms, assuming that the ith synonym appearing in the text is x_i，x_iThe synonym phrase with length p is denoted as CG (x)_i)＝{cx_i,0,cx_i,1,…,cx_i,p-1With a context window size of

Wherein num (T)_word) And num (T)_symbol) Representing the total number of words in the carrier text T and the total number of sentence-end symbols (e.g. period, question mark, etc.) of all sentences in the carrier text T, x, respectively_iThe words in the context window in text T are in order noted as: cont (x)_i)＝{w_i,i-k,…,w_i,i-1,w_i,i+1,…,w_i,i+kIs then x_iThe sentence composed of the words in order in the context window is marked as Sen (x)_i,Cont(x_i))＝{w_i,i-k,…,w_i,i-1,x_i,w_i,i+1,…,w_i,i+k}; the 2k value can be selected to obtain a corresponding proper sentence length value according to different carrier texts.

2.2 sequential use of x_iSynonym cx in synonym phrase_i,l(l is more than or equal to 0 and less than or equal to p-1) substitution x_iThe dense sentence is composed of Sen (cx)_i,l,Cont(x_i))＝{w_i,i-k,…,w_i,i-1,cx_i,l,w_i,i+1,…,w_i,i+kWill the original sentenceSon Sen (x)_i,Cont(x_i) And a dense sentence Sen (cx)_i,l,Cont(x_i) Respectively inputting a sentence vector generation model (SBERT model, the structure of which is shown in fig. 2), and obtaining an original sentence Sen (x) by the SBERT model_i,Cont(x_i) And a dense sentence Sen (cx)_i,l,Cont(x_i) Sentence vector E (Sen (x))_i,Cont(x_i) )) and E (Sen (cx)_i,l,Cont(x_i)))。

2.3 computing two sentences (original sentence Sen (x)) by means of a semantic similarity function (distortion function)_i,Cont(x_i) And a dense sentence Sen (cx)_i,l,Cont(x_i) ) to obtain synonyms cx) from the sentence vector distance between the synonyms cx_i,lReplacement of x_iDistortion value ρ (cx) caused by embedding into the current sentence_i,l,x_i) The distortion function for calculating the sentence vector distance is:

ρ(cx_i,l,x_i)＝1-nolmal(Sen(cx_i,l,Cont(x_i)),Sen(x_i,Cont(x_i)))

wherein the content of the first and second substances,

2.4 according to the synonym sequence to be embedded of the carrier text T obtained in the preprocessing work

And a synonym thesaurus for traversing x to be embedded into the synonym sequence_i(1≤i≤n₂) And its corresponding synonym phrase CG (x)_i)＝{cx_i,0,cx_i,1,…,cx_i,p-1And (6) repeating the steps 2.2 and 2.3, and calculating each synonym cx in the corresponding synonym phrase_i,l(l is more than or equal to 0 and less than or equal to p-1) substitution x_iPost-induced distortion value ρ (cx)_i,l,x_i) Selecting a minimum distortion value rho excluding zero at each position to be embedded_iAnd obtaining the synonym cx corresponding to the distortion value_i,lThe minimum distortion value rho caused by each position to be embedded_iThe sequence of the distortion values is obtained by arranging the distortion values in sequence and is recorded as

2.5 formulating synonym coding rule according to each binary synonym phrase (x) in binary synonym thesaurus_i,cx_i,l) (wherein 1. ltoreq. i. ltoreq. n₂) By the formula a ═ ASCII ({ a | a ∈ x)_i}),B＝∑ASCII({b|b∈cx_i,lF (x) ═ min { a, B } encodes each synonym, where a, B represent the constituent english word x_i、cx_i,lA, B denotes the sum of the ASCII codes of all the letter elements a, b, f (x) denotes the minimum value of A, B, if f (x) is a, x_iCoded as 0, cx_i,lThe code is 1; if F (X) ═ B, cx_i,lCoded as 0, x_iThe code is 1.

Step 3. secret information embedding

3.1 according to the binary synonym thesaurus

And synonym coding rule will be embedded into synonym sequence

Binary vectorization (i.e. if according to the synonym coding rule x of step 2.5)_iCoding is 0, then x'_iMarked 0 if according to synonym coding rule x of step 2.5_iCoding to 1, then x'_iIs recorded as 0) is recorded as

Wherein x'_i∈{0,1}，(1≤i≤n₂) (ii) a Secret information

Binary pre-processing notation

3.2 dividing the secret information into N sections (N is more than 2 and less than N) according to the length of the secret information and the length of the synonym sequence to be embedded in the carrier text T₁Integer of) that is to say a stream of secret information bits

And the quantified synonym sequence to be embedded

Dividing the data into a series of short segments with the same number, and if the length of the synonym sequence to be embedded in the secret information or the carrier text cannot be divided by N, placing the rest parts in the corresponding Nth segment;

3.3 sequence of distortion values from the ensemble

To obtain each synonym sequence segment { x ] to be embedded_i,…,x_j},(i≥0,j≤n₂) Corresponding distortion value sequence segment { ρ_i,…,ρ_j},(i≥0,j≤n₂) Then according to the quantized sequence segment { x 'of the synonym to be embedded'_i,…,x′_j} quantized secret information segment { m'_i,…,m′_jAnd a sequence of distortion values { p } and a sequence of distortion values_i,…,ρ_j},(i≥0,j≤n₂) STC encoding is performed by

Embedding secret information is completed, wherein Emb () represents an embedding function, C (M ') is a feasible set of embedded text binary vectors Y' (secret information extraction is also completed by the formula Ext (Y ') ═ HY', wherein Ext () represents extracted secret information, H represents a parity check matrix which is formed by f sub-matrices with height H and width w

Placed along the main diagonal and formed by arranging adjacent sub-matrixes staggered by one row, then the height of H is f, the width is f multiplied by w, the specific relation is shown in FIG. 4), STC coding can find the distortion minimum path transformed from vector X 'to vector Y' by utilizing a Viterbi algorithm to determine the embedding position, and if conditions allow that a plurality of segments can be distributed to different CPU cores to execute calculation;

3.4 Each segment of the resulting STC coding sequence { y'_i,…,y′_j},(i≥0,j≤n₂) Spliced into a complete binary vector sequence of embedded texts

y′_i∈{0,1}，(1≤i≤n₂) And according to the synonym sequence to be embedded after the quantization of the sum

The original words at the corresponding positions of the original carrier text and a binary synonym word library are combined

Is selectively replaced, i.e., if

If the value is 0, the word replacement is not performed, and finally the steganographic text T' embedded with the secret information M is obtained.

The invention also comprises a secret information extraction step 4, as follows:

step 4. secret information extraction

4.2: traversing all the words in T' according to the synonym thesaurus to obtain the wordsNow length n₂Synonym sequence of

Then go through step 2.4 (now with synonym sequence)

Replacement of the sequence of synonyms to be embedded in step 2.4

I.e.) obtain a binary synonym library

Then according to the synonym coding rule

Quantized synonym sequences

Wherein y'_i∈{0,1}，(1≤i≤n₂)；

Obtaining the synonym sequence after each segment of quantization

And the length s of each piece of secret information₁；

4.4: according to the synonym sequence Y 'after each segment quantization'_iAnd length s of each piece of secret information₁Performing STC decoding to obtain secret information bit stream M 'of each segment'_iAnd sequentially splicing them into a complete secret information bit stream

4.5: restoring a binary string secret information bit stream MSecret information

And output.

The invention can make the local distortion of the embedded text approach the global optimum optimally, shorten the total secret information embedding time and realize the improvement of the speed of the steganography process and the safety of the embedded text.

Comparative experiment:

in order to verify that the method provided by the invention has the characteristics of rapidness and safety, a comparison experiment is carried out by using the segmented STC information hiding method (marked as P-Seg-STC) and the unsegmented STC information hiding method (marked as P-STC) designed by the invention under the condition that secret information with the same length is embedded, and the parameters of the segmented STC code and the unsegmented STC code are set to be consistent in the experiment, wherein the relative effective load of a carrier text is 0.3, the constraint height is 2, and the number of segmented segments is 3n₂/20，n₂For the length of the synonym sequence to be embedded, each synonym sequence segment to be embedded is embedded with 2 bits of secret information.

200 texts are randomly selected from a Gutenberg corpus as carrier texts, corresponding steganographic texts are respectively generated, the distortion degree comparison and embedding time comparison of a segmented STC and unsegmented STC information hiding method are obtained, corresponding data are counted and presented through line graphs, and the results are respectively shown in a graph 5 and a graph 6.

As can be seen from fig. 5 and 6, for the same carrier text, under the condition of embedding the secret information with the same length, the distortion caused by the steganographic text generated by using the segmented STC coding can achieve the low-distortion effect which is nearly identical to or even superior to the distortion caused by the unsegmented STC coding. The embedding time required for generating the steganographic text by using the segmented STC coding is obviously shorter than that required for generating the steganographic text by using the unsegmented STC coding, which shows that the segmented STC coding can greatly shorten the embedding time in the information hiding method based on synonym replacement and improve the embedding speed of the information hiding method.

Claims

1. A natural language information hiding method based on word replacement comprises the steps of 1, preprocessing, 2, distortion value measurement and 3, secret information embedding, and is characterized in that the step 1 comprises the following steps:

1.1, preparing a synonym thesaurus;

1.2 randomly selecting an original carrier text T;

1.3 determining the length n₁To be embedded with secret information

The secret information is a text information;

Wherein i is more than or equal to 1 and less than or equal to n₂。