CN117079669A

CN117079669A - Feature vector extraction method for LSB audio steganography with low embedding rate

Info

Publication number: CN117079669A
Application number: CN202311336594.9A
Authority: CN
Inventors: 林萍萍; 王忠臣; 章云鹏
Original assignee: Shandong Future Network Research Institute Industrial Internet Innovation Application Base Of Zijinshan Laboratory; Boshang Shandong Network Technology Co ltd
Current assignee: Shandong Future Network Research Institute Industrial Internet Innovation Application Base Of Zijinshan Laboratory; Boshang Shandong Network Technology Co ltd
Priority date: 2023-10-17
Filing date: 2023-10-17
Publication date: 2023-11-17

Abstract

The invention discloses a feature vector extraction method for LSB audio steganography with low embedding rate, which comprises the following steps: feature vectors are extracted based on conditional probability, noise of adjacent voice sampling points in short time has larger correlation, and conditional probability is usedRepresenting the likelihood of the system transitioning to another state at time m+1 under known conditions in which the system is in one state at time m, where the sample value at time m isThe sampling value at the time m+1 is. According to the invention, the correlation of noise signals at different positions is represented by using a first-order discrete random variable conditional distribution law, a probability matrix of an audio digital coding sequence is constructed and is converted into a high-dimensional feature vector, the feature vector can effectively capture local micro differences introduced before and after steganography, the feature vector is used as an input value for training a CNN model, the recognition accuracy of LSB audio steganography with low embedding rate can be effectively improved, and a probability distribution modeling means is provided for noise sequences introduced by LSB steganography.

Description

Feature vector extraction method for LSB audio steganography with low embedding rate

Technical Field

The invention relates to the technical field of LSB audio steganography detection, in particular to a feature vector extraction method for LSB audio steganography with low embedding rate.

Background

1. LSB audio steganography detection technology

With the rapid development of computers and the internet, people rapidly communicate and exchange information through a network, and more digital multimedia contents, such as audio and video, become main carriers for people to communicate and exchange. Due to the digitization of the audio carrier and the redundancy of the information encoding, the hidden information can be embedded into the audio file under the condition that the physiological perception of the person cannot be perceived, so as to realize the hidden transmission of the data. LSB audio steganography refers to embedding hidden information in the Least Significant Bits (LSBs) of an audio file because modifying the least significant bits has less impact on audio quality.

The LSB audio steganography detection technology is a countermeasure technology of the LSB audio steganography technology, and is used for analyzing suspicious audio to judge whether an audio file is steganographically by the LSB.

2.CNN (CNN)

In recent years, convolutional neural networks CNN are widely used in the fields of image and voice recognition and the like. The detection method based on the convolutional neural network is the most advanced LSB audio steganography detection method at present, and compared with the traditional manual feature extraction method, such as chi-square detection and SPA detection, the method can realize the automation of various feature extraction; compared with Bagging detection which can only extract partial features, the method can realize comprehensive high-dimensional feature abstraction.

3. Feature vector

The acquisition of the digital voice signal firstly requires the interval timing sampling of the continuous voice signal, then rounding and quantizing the sampling value, and finally binary coding is carried out. However, because ideal acquisition conditions do not exist, noise is necessarily introduced into the digital voice signal during acquisition.

Voice signal acquisition sequenceCan be expressed as +.>Wherein->A voice sampling value without interference at the moment n is represented; />Is the noise signal value at time n. If LSB steganography is regarded as artificial noise, a new thought can be provided for extracting the distribution characteristics in our steganography analysis.

At this time, for the acquisition sequence of the LSB audio steganographically-written voice signalCan be updated as represented byWherein->Is the synthesized noise after the LSB audio steganography. That is, assuming that LSB steganography is regarded as a noise, the noise signal value is made of +.>Become->. This transition will necessarily change some of the original distribution characteristics of the pre-steganographic noise. If the distribution characteristic is energy-treated, an effective high-dimensional feature vector is extracted and can be used as a training sample of a CNN steganography detection model. In the training process of the CNN model, a high-quality feature vector training set is a key for improving the quality of model training and the prediction accuracy.

There are two ways to quantify the noise sequence feature variation introduced by LSB steganography at present, namely noise sequence estimation based on local correlation and noise sequence estimation of wavelet signal reconstruction, respectively.

Estimation based on adjacent sample sequences of audio signals

Because of the short-term correlation of speech, adjacent speech signal samples can be considered to be equal within the tolerance, i.e. At this time, noise randomly affects the sampled value of the voice signal, and accordingly, it is assumed that noise is only superimposed onOr->Either, the difference between the two is a noise value. Accordingly, the difference between adjacent speech signal sampled values can be used as noise, as shown in the following formula:

；

2) Noise sequence estimation for wavelet signal reconstruction

The wavelet denoising is realized by mainly selecting proper wavelet coefficients in wavelet space by wavelet signals to reconstruct the signals so as to achieve the aim of noise elimination. Studies have proposed improved wavelet transform (SGWT) denoising. The main reason is that since the output value of the conventional wavelet filter is not an integer but a floating point number, there is a serious problem that data after wavelet transformationWhen compressing and quantizing, there is a large error and the audio distortion is serious. The output value of the filter constructed by the lifted wavelet transform, also called second generation wavelet transform, is an integer, and the floating point number problem is avoided. If two-stage wavelet transform decomposition is performed on the voice signal, high-frequency and low-frequency signals can be obtained on each stage. The low frequency part reflects the average value of the signal and the high frequency part reflects the difference in detail of the signal, so the noise of the signal is mainly represented in the high frequency part. In the high frequency part, the wavelet transformation coefficients are thresholded, and the processing method comprises two kinds of hard thresholding and soft thresholding. The hard threshold can make the denoised signal oscillate near the singular point, and soft threshold processing is generally adopted to process the 1H high-frequency band signal, namely when the absolute value of the wavelet coefficient is smaller than a certain threshold, the wavelet coefficient is set to zero, and when the absolute value of the wavelet coefficient is larger than a certain threshold, the difference value between the wavelet coefficient and the threshold is used for replacing the original wavelet coefficient. Then the wavelet signal is reconstructed to obtain the original signalEstimate of->And taking the noise sequence:

。

noise sequence estimation based on local correlation, in short, is to take the difference between adjacent samples, since speech has short-term stationarity, we can consider adjacent samples equal, i.e. The embedded secret information is superimposed on a certain sampling value, so that the secret information is not equal to the adjacent sampling value. Therefore->It can be regarded as a noise and we can then construct a differential sequence of sample values to quantify this feature. The quantization mode has simple thought and canThe operability is strong. Another is to quantize LSB steganographically introduced noise based on wavelet transform, and to quantize the feature variation of this dimension by a noise sequence of wavelet signal reconstruction, since the noise of the audio signal is mainly represented in the high frequency part, we only threshold the high frequency part. Finally, the wavelet signal is subjected to inverse wavelet transformation to obtain the original signal +.>Estimate of +.>And performing differential operation on the two, wherein the obtained differential sequence is the feature vector for quantifying the feature change. However, the sensitivity of the feature vector extracted by the two features to steganography and the embedding rate of the secret information are positively correlated, and the higher the embedding rate is, the better the detection performance is. LSB audio steganography detection effect with low embedding rate is poor. LSB steganography is directed to the least significant bit, which changes the sampling amplitude very weakly, and the too low embedding rate makes the characterization capability of the feature vector more difficult to be effective. Since the steganographically embedded secret information causes very little disturbance to the overall content, more extensive research and higher level mathematical analysis must be performed to make it possible to find different features of a normal audio carrier and a steganographically audio carrier in a certain imperceptible dimension.

There is currently no effective solution to the above problems.

Disclosure of Invention

Aiming at the technical problems in the related art, the invention provides a feature vector extraction method aiming at LSB audio steganography with low embedding rate, which can overcome the defects in the prior art.

In order to achieve the technical purpose, the technical scheme of the invention is realized as follows:

a feature vector extraction method for LSB audio steganography with low embedding rate comprises the following steps:

s1, extracting feature vectors based on conditional probability, wherein noise of adjacent voice sampling points in short time has larger correlation, and the conditional probability is usedIndicating the possibility of the system to be transferred to another state at time m+1, under the known condition that the system is in a certain state at time m, where the sampling value at time m is +.>The sampling value at the time m+1 is +.>；

The specific steps for extracting the feature vector based on the conditional probability in the S1 are as follows:

s11, acquiring a digital coding sequence of a voice fragment;

s12, determining an audio sampling discrete value range;

s13, counting the occurrence times and the duty ratio of each discrete point in the audio sampling discrete value domain in the voice segment;

s14 using conditional probability distribution lawEvaluating the correlation of the sampling points;

s15, bringing the probability value calculated in the step S14 into a probability matrix;

s16, converting the probability matrix into a target feature vector;

s2, a complete LSB steganography detection model training process;

s21, setting a data set;

training of S22 CNN model: extracting a high-dimensional feature vector based on a conditional probability distribution model, and performing CNN network model training by taking the extracted vector as an input value, wherein the method comprises the following specific steps of:

s221, extracting feature vectors Xi and Xi with the same dimension for the same number of original audio and hidden audio respectively；

S222 toCNNs are trained as input vectors and return values.

Further, in step S14Is calculated by classical generalization.

Further, the specific steps of setting the data set in S21 are as follows:

s211, randomly selecting uncompressed voice fragments from a public data set, isochronously segmenting an original voice fragment into a plurality of small fragments, wherein the duration time of each audio is the same, acquiring a certain number of voice fragments as data sets and backing up, and performing LSB audio steganography on the backed up data sets by using the original data sets as normal non-steganography data sets;

s212, performing steganography operation on the backed-up data set by using an LSB audio steganography algorithm to obtain the same number of normal audio and steganography audio, wherein half of the normal audio and steganography audio are used for training, and the rest half of the normal audio and steganography audio are used for testing.

Further, the duration of each audio in step S211 is 5S.

Further, the sampling frequency of the sample in step S211 is 16kHz.

Further, the embedding rate of LSB steganography in step S212 is 5%.

Further, the specific steps of S222 are as follows:

s2221 uses a hyper-parametric convolution kernel to pre-process the input values;

s2222 performs superposition operation of a convolution group on the preprocessed data output in the step S2221, so as to realize extraction of layer-by-layer degradation and high-level semantics of the data;

s2223 inputs the data output in the step S2222 into the classifier, the result after processing is input into the softmax layer through a full connection layer, and finally the recognition result is output in a probability mode.

The invention has the beneficial effects that: according to the invention, the correlation of noise signals at different positions is represented by using a first-order discrete random variable conditional distribution law, a probability matrix of an audio digital coding sequence is constructed and is converted into a high-dimensional feature vector, the feature vector can effectively capture local micro differences introduced before and after steganography, the feature vector is used as an input value for training a CNN model, the recognition accuracy of LSB audio steganography with low embedding rate can be effectively improved, and a probability distribution modeling means is provided for noise sequences introduced by LSB steganography.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a complete LSB steganography detection process for a feature vector extraction method for low-embedding-rate LSB audio steganography according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which are derived by a person skilled in the art based on the embodiments of the invention, fall within the scope of protection of the invention.

The explanation and meaning of the nouns referred to in the invention are as follows:

LSB Least Significant Bit (least significant bit)

CNN Convolutional Neural Networks (convolutional neural network)

SGWT Second Generation Wavelet Transform (improved wavelet transform)

ReLU Linear rectification function (modified linear unit)

pool

conv Convolition (Convolution)

The number of channels, also called the width of the neural network

We propose for the first time a method for characterizing adjacent audio signal correlation using a discrete random variable conditional discipline for LSB audio steganography with low embedding rate. Since the modification introduced by LSB steganography with low embedding rate is less, the interference degree to the original audio content after data hiding is very low. Thus, those means that attempt to capture the specific noise content of the audio, which quantifies the noise distribution characteristics, are not suitable for this low-steganography detection problem. In consideration of the short-time stationarity of voice, a larger correlation is necessarily present between adjacent voice sampling points, the invention uses a first-order discrete random variable conditional distribution law to represent the correlation of noise signals at different positions, constructs a probability matrix of an audio digital coding sequence, and converts the probability matrix into a high-dimensional feature vector, wherein the feature vector can effectively capture local micro differences introduced before and after steganography. The recognition accuracy of LSB audio steganography with low embedding rate can be effectively improved by taking the LSB audio steganography as an input value for training a CNN model.

1. Feature vector extraction based on conditional probability

Modern speech coding is basically audio digital coding, i.e. continuously varying analog signals are converted into digital coding by three steps of sampling, quantization and coding. We can consider such a digital code sequence as a random sequence that is discontinuous in both time and state.

The voice sampling environment is stable within a certain and short time range, so that the noise of the adjacent voice sampling points within a short time has a larger correlation. Conditional probabilityIndicating the likelihood that the system will transition to another state at time m +1 under the known condition that the system is in that state at time m.

In view of this, the feature vector extraction based on conditional probability is specifically as follows:

1. acquiring a digital coding sequence of a voice fragment;

2. determining an audio sampling discrete value range;

3. the number of occurrences and the duty cycle of each discrete point (sampling point) in the statistical value domain in the speech segment;

4. evaluating the correlation of the sampling points by using a conditional probability distribution law (formula 2-1);

5. bringing the probability value calculated in the step 4 into a probability matrix;

6. the probability matrix is converted into a target feature vector.

；

The following we describe this process by way of an example:

we use the sequence of letters to simulate the quantized speech sample sequence, assuming the digital coding sequence of the speech segment S isFirst of all its state space is +.>I.e. the audio discrete sample value range, the actual audio sample discrete value range is much larger than this. The number and duty cycle of occurrences of each discrete point in the statistical value domain in the speech segment:

，

using equation 2-1, if the m-time sample value is a and the m+1-time sample value is b, the probability is:

，

wherein the method comprises the steps ofThe probability that the m time sampling value is a and the m+1 time sampling value is b is calculated by classical probability, and the number of times that the next state is a, b, c, d, e, f, g when the current state is a needs to be calculated respectively:

a- > a 0 times; a- > b 1 times; a- > c 1 times; a- > d 2 times; a- > e 0 times; a- > f 1 times; a- > g 0 times;

so that

By the above way, we evaluate the correlation between all the sampling points of the audio segment and construct a probability matrix as shown in the figure:

；

this matrix of 7*7 is converted into eigenvectors:

；

thus we succeeded in extracting a 49-dimensional feature vector from a speech segment. We find that the dimension of the extracted feature vector is exactly the square of the discrete points of the speech encoded discrete value range.

2. Complete LSB steganography detection model training process

1. Data set arrangement

Because of the popularity of voice research and the openness of audio data, in a network, we can conveniently acquire an audio data set suitable for training of a network model through a public data set, and the audio coding standard used in the audio data set is G.711 voice coding and is stored in a PCM format. The data set is set as follows:

1) Randomly selecting uncompressed voice fragments from the public data set, and isochronously slicing the original audio fragments into 20000 small fragments. The duration of each audio is 5 seconds, and it is noted that the sampling frequency of the samples is 16kHz. After acquiring 20000 data sets with the duration of 5 seconds, backing up the data sets, taking the original data sets as normal data sets which are not subjected to steganography, and performing LSB audio steganography on the backed up data;

2) And performing steganography operation on the backed-up data set by using an LSB audio steganography algorithm, wherein the embedding rate of LSB steganography is 5%. Thus 20000 pairs of normal audio and steganographic audio are obtained in total. One half of which is used for training and the other for testing. In the training phase 4000 pairs are taken out for post-training verification, the remaining 16000 pairs are used for training the neural network.

Training of CNN model

The invention is described in further detail below with reference to fig. 1 of the accompanying drawings. As shown in fig. 1, a high-dimensional feature vector is extracted based on a conditional probability distribution model, and the extracted vector is used as an input value to perform CNN network model training, and the complete flow comprises the following steps:

1. since the G.711 encoded discrete value domain has 130 discrete points, the feature vector can be constructed by the method based on the conditional probabilityThe probability matrix of (2) is converted into a feature vector with the dimension of 16900, and in order to facilitate dimension reduction calculation in CNN model training, 50 numerical values at the head and the tail of the vector are removed; i.e. we extract the dimensions +.for 16000 for original audio and steganographic audio, respectively>Feature vectors Xi and +.>；

2. To be used forTraining the CNN as an input vector and a return value;

2.1. using a hyper-parametric convolution kernelAnd->Preprocessing an input value;

2.2. and (2) performing superposition operation of 7 convolution groups (G1-G7) on the preprocessed data output in the step (2.1) to realize layer-by-layer reduction of the data and extraction of high-level semantics, and performing detailed description of each convolution group.

The convolution groups G1 and G2 each use three convolution layers of different kernel sizes, different channel numbers and different step sizes, respectively: the core size isConvolution kernel with channel 1, kernel size +.>Convolution kernel with channel 8 and kernel size +.>Convolution kernel with channel 1 and step 2.

The same layer structure is used for convolution groups G3, G4, G5 and G6. First using a core size ofConvolution kernel with channel 1, and taking the output result as the input value of the activation function ReLU, after the activation function processing, we input data into the kernel size of +.>And the convolution kernel with channel being 2, then, taking the output result as the input value of the activation function ReLU to calculate, and taking the output result as the input value at the moment to be put into a pooling layer for further dimension reduction processing. The core size of the pooling layer is +.>A maximum pooling layer with a step size of 2.

The convolution group G7 adopts a global average pooling strategy, namely a core size is usedThe dimensionality of the data acquired by the previous layer is reduced to 1 at one time, so that the feature distribution learned by all previous layers is summarized.

2.3. The data output in the step 2.2 are input into a classifier, firstly, a full connection layer is adopted, the processed result is input into a softmax layer, and finally, the identification result is output in a probability mode.

In summary, by means of the technical scheme, the correlation of noise signals at different positions is represented by using a first-order discrete random variable conditional distribution law, a probability matrix of an audio digital coding sequence is constructed and is converted into a high-dimensional feature vector, the feature vector can effectively capture local micro-differences introduced before and after steganography, the feature vector is used as an input value for training a CNN model, the recognition accuracy of LSB audio steganography with low embedding rate can be effectively improved, and a probability distribution modeling means is provided for noise sequences introduced by LSB steganography.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. The feature vector extraction method for LSB audio steganography with low embedding rate is characterized by comprising the following steps:

s1, extracting feature vectors based on conditional probability, wherein noise of adjacent voice sampling points in short time has larger correlation, and the conditional probability is usedIndicating the possibility of the system to be transferred to another state at time m+1, under the known condition that the system is in a certain state at time m, where the sampling value at time m is +.>The sampling value at the time m+1 is；

s11, acquiring a digital coding sequence of a voice fragment;

s12, determining an audio sampling discrete value range;

s16, converting the probability matrix into a target feature vector;

s2, a complete LSB steganography detection model training process;

s21, setting a data set;

s22, training a CNN model, namely extracting a high-dimensional feature vector based on a conditional probability distribution model, and training the CNN model by taking the extracted vector as an input value, wherein the method comprises the following specific steps of:

S222 toCNNs are trained as input vectors and return values.

2. The feature vector extraction method for low-embedding-rate LSB audio steganography according to claim 1, characterized in that in step S14Is calculated by classical generalization.

3. The feature vector extraction method for low-embedding-rate LSB audio steganography according to claim 1, wherein the setting of the data set in S21 specifically includes the steps of:

4. A feature vector extraction method for low-embedding-rate LSB audio steganography according to claim 3, characterized in that the duration of each audio in step S211 is 5S.

5. A feature vector extraction method for low-embedding-rate LSB audio steganography as claimed in claim 3, characterized in that the sampling frequency of the samples in step S211 is 16kHz.

6. The feature vector extraction method for low-embedding-rate LSB audio steganography according to claim 3, wherein the embedding rate of LSB steganography in step S212 is 5%.

7. The feature vector extraction method for low-embedding-rate LSB audio steganography according to claim 1, wherein the specific steps of S222 are as follows:

s2221 uses the super-parameter convolution to check the input value for preprocessing;