Audio steganalysis method based on convolutional neural network and domain countermeasure learning
Technical Field
The invention relates to the technical field of audio steganography, in particular to an audio steganography analysis method based on convolutional neural networks and domain countermeasure learning.
Background
The current audio steganalysis model based on the deep learning technology has higher detection performance under laboratory conditions. However, in the actual network big data evidence obtaining environment, the audio carrier data has the characteristics of diversity, heterogeneity and the like, and if the audio carrier data is directly detected by using a steganographic analysis model obtained by laboratory training, the accuracy rate is greatly reduced.
Carrier source mismatch (Cover Source Mismatch, CSM) problems in audio steganalysis are determined from training set audio data and measurementsThe sources of the test set audio data (such as recording equipment, speaker gender, language, etc.) are different. CSM is essentially a domain adaptation in transfer learning (Domain Adaptation) problem, which can be defined as: given a tagged source data fieldAnd a label-free target data field +.>Assuming that they have the same feature space, the same class space and the same conditional probability distribution, but different edge distributions for the two domains, the goal of domain adaptive learning is to use the labeled data D s To learn a classifier f x t →y t To predict the target domain D t To minimize the risk of errors in the predictions.
But there is currently no solution specifically directed to the CSM problem in audio steganalysis.
Disclosure of Invention
In view of the above problems, the invention aims to provide an audio steganalysis method based on convolutional neural network and domain countermeasure learning, which can effectively relieve the influence of CSM phenomenon on the performance degradation of an audio steganalysis model and improve the application feasibility of an audio steganalysis technology in a complex Internet big data evidence obtaining scene.
In order to achieve the above purpose, the technical scheme of the invention is as follows: the audio steganalysis method based on convolutional neural network and domain countermeasure learning is characterized in that: the network framework corresponding to the method comprises a feature extraction sub-networkSteganalysis subnetwork->And vector origin discrimination subnetwork->Wherein θ is f 、θ y 、θ d Representing network parameters of respective sub-networks, the method comprising,
s1, inputting source domain dataTarget Domain data->An countermeasure training factor lambda, a learning rate eta;
s2, outputting a steganographic analysis feature vector F through a feature extraction sub-network;
s3, the steganalysis feature vector F is output through a steganalysis sub-network to obtain binary steganalysis prediction probabilityCalculating binary steganography prediction probability +.>Cross entropy loss with original steganographic tag y y And updates the network parameter θ accordingly by a back propagation error and gradient descent algorithm y Wherein y is {0,1}, when y takes the value 0 as the original carrier and takes the value 1, it represents the hidden carrier;
s4, outputting the steganographic analysis feature vector F through a carrier source discrimination sub-network to obtain a carrier source prediction probability valueCalculating the vector source prediction probability value->Cross entropy loss with original steganographic tag d/ d And updates the network parameter θ accordingly by back-propagation error d Where d ε {0,1}, when d takes the value 0 representing the source domain and takes the value 1 representing the target domain.
Further, the feature extraction sub-network in S2 includes an audio preprocessing layer and 4 cascaded convolution groups after the audio preprocessing layer, namely, a 1 st convolution group, a 2 nd convolution group, a 3 rd convolution group, and a 4 th convolution group.
Further, the audio preprocessing layer consists of 4 1×5 convolution kernels D1 to D4, and initial weights are respectively:
D1=[1,-1,0,0,0],D1=[1,-2,1,0,0],D1=[1,-3,3,1,0],D1=[1,-4,6,-4,1];
the 1 st convolution group includes a 1×1 first convolution layer, a 1×5 second convolution layer, and a 1×1 third convolution layer;
the 2 nd convolution group, the 3 rd convolution group and the 4 th convolution group all comprise a 1 multiplied by 5 convolution layer, a 1 multiplied by 1 convolution layer and a mean-average pooling layer, wherein the mean-average pooling layer of the 4 th convolution group is a global mean-average pooling layer;
the steganalysis feature vector is a 256-dimensional vector.
Furthermore, the audio preprocessing layer adopts a differential filtering design.
Further, the steganalysis sub-network comprises a full-connection layer and a steganographic label prediction layer, wherein the full-connection layer is formed by cascading two layers, and the full-connection layer comprises 128 neurons and 64 neurons respectively.
Further, the carrier source discrimination subnetwork comprises a gradient inversion layer, a domain discrimination layer and a domain label prediction layer, wherein the gradient inversion layer keeps constant mapping of input and output data in a forward propagation stage, and gradient values of inversion errors in an error reverse propagation stage are respectively expressed as,
Forward:F(x)=x
wherein F (x) represents an equivalent function formula of the gradient inversion layer, and I is an identity matrix.
Further, in the step S3, the network parameter θ is updated y And updating the network parameter θ in S4 d The optimization is carried out by the following formula,
wherein, and respectively representing the network parameters determined by each sub-network, wherein n is the number of training samples of the source domain data, and m is the number of training samples of the target domain data.
Compared with the prior art, the invention has the advantages that:
by combining a convolutional neural network and domain countermeasure learning and applying the combination to an audio general steganalysis model, the domain independent steganalysis characteristic can be obtained, the problem of performance degradation of the audio steganalysis model caused by the problem of carrier source mismatch can be effectively relieved, and a feasible idea is provided for application of an audio steganalysis technology in a complex Internet big data evidence obtaining scene.
Detailed Description
The following detailed description of embodiments of the invention is merely exemplary in nature and is not intended to limit the invention to the precise forms disclosed.
The invention provides an audio steganalysis method based on convolutional neural network and domain countermeasure learning, which is characterized in that: the network framework corresponding to the method comprises a feature extraction sub-networkSteganalysis subnetwork->And vector origin discrimination subnetwork->Wherein θ is f 、θ y 、θ d Representing network parameters of respective sub-networks, the method comprising,
s1, inputting source domain dataTarget Domain data->An countermeasure training factor lambda, a learning rate eta;
s2, outputting a steganographic analysis feature vector F through a feature extraction sub-network;
s3, the steganalysis feature vector F is output through a steganalysis sub-network to obtain binary steganalysis prediction probabilityCalculating binary steganography prediction probability +.>Cross entropy loss with original steganographic tag y y And updates the network parameter θ accordingly by a back propagation error and gradient descent algorithm y Wherein y is {0,1}, when y takes the value 0 as the original carrier and takes the value 1, it represents the hidden carrier;
s4, outputting the steganographic analysis feature vector F through a carrier source discrimination sub-network to obtain a carrier source prediction probability valueCalculating the vector source prediction probability value->Cross entropy loss with original steganographic tag d/ d And updates the network parameter θ accordingly by back-propagation error d Where d ε {0,1}, when d takes the value 0 representing the source domain and takes the value 1 representing the target domain.
The feature extraction sub-network is used for adaptively extracting features, and in order to alleviate the degradation of the steganography analysis performance caused by the CSM problem, the output feature vector F needs to have steganography detectability (namely, the correct steganography analysis result is obtained after the steganography classification sub-network is input), and also needs to have certain field independence (namely, the feature spatial distribution of different audio carrier data is kept consistent). Feature extraction network promotion by continuously learning differences in data distribution of original audio samples and steganographic audio samplesThe learned feature F is for the ability to correctly detect steganographic audio. At the same time, reverse in the counter-propagating phaseThe error gradient brought about to update +.>Network parameter theta of (2) f To reduce the correlation of its extracted features F to the field of audio carrier data.
For the network architecture, the detailed architecture parameters of the individual sub-network modules are shown in the following table. Examples of meaning of parameters in the table: 64x (1 x 5), reLU, represents that the parameters of the convolutional layer are set to a convolution kernel of 1x5 with an output channel of 64, and the output is activated using ReLU. FC-256 represents a fully connected layer with 256 neurons.
For feature extraction subnetworksThe function of this is to adaptively extract steganalysis features from the input audio data. In the CNN steganalysis model, setting a reasonable preprocessing layer can improve the steganalysis performance of the network. Therefore, at the beginning of the feature extraction subnetwork, an audio preprocessing layer based on differential filtering design is used, which consists of 4 1x5 convolution kernels D1-D4, with initial weights of:
D 1 =[1,-1,0,0,0]
D 2 =[1,-2,1,0,0]
D 3 =[1,-3,3,1,0]
D 4 =[1,-4,6,-4,1]
the audio preprocessing layer is followed by 4 concatenated convolutional group modules.The convolution layer in the 1 st convolution group module does not undergo nonlinear activation processing, and the pooling operation is cancelled, so as to more effectively capture weak information brought by steganography. The 2 nd to 4 th convolution group modules all comprise a 1x5 convolution layer, a 1x1 convolution layer and a mean-pooling layer, wherein the last mean-pooling layer of the 4 th convolution group module is replaced by a global mean-pooling (Global Average Pooling) layer, so as to fuse global features.And finally outputting 256-dimensional steganalysis feature vector F.
Classifying sub-networks for steganographyIt is followed by a feature output layer, which is structured as a two-layer cascade of fully connected layers (containing 128 and 64 neuron structures, respectively).
For carrier source discrimination subnetworkThe structure of the carrier source discrimination network is similar to that of the steganographic classification network, and the main structure is also composed of a full-connection layer. The difference is that the feature extraction subnetwork +.>Output characteristics F of (2) and vector origin discrimination subnetwork->The domain discrimination layers of (2) are connected by gradient inversion layers (Gradient Reversal Layer, GRL).
For the formula Forward: F (x) =x andthe smaller λ, the smaller the importance of the domain tag, +.>The extracted feature vector F is also allowed to contain more domain information. When λ is 0 then it means that the influence of the domain label is not considered, i.e. migration is not considered. At this time, the classifier and the source domain numberThe dependence is the strongest. It is therefore also important to set a reasonable lambda. When the two domains differ significantly, λ may be suitably larger.
The method of the invention protects the source domain audio data in the training processWith complete steganographic tag information, and target domain audio data +.>Then steganographic tag information is not included. The training process of the whole network can be divided into two parts: 1)/>And->The sub-networks are cascaded to form a supervised steganalysis network; 2)/>And->And a carrier source discriminating process formed by cascading the sub-networks. The training purpose of the whole network is as follows: by training->To promote the difference of the characteristic F in the steganography space by training +.>To discriminate the audio data of different sources and extract the domain information, and at the same time, by +.>And->To eliminate->Domain related information of the extracted feature F. Integral netThe training purpose of the complex is equivalent to solving the following optimization problem:
wherein, and respectively representing the network parameters determined by each sub-network, wherein n is the number of training samples of the source domain data, and m is the number of training samples of the target domain data.
To achieve the above object, the training process of the whole network can be represented by the following table, namely
The method effectively relieves the performance degradation of the audio steganography analysis model caused by the problem of carrier source mismatch, and provides a feasible thought for the application of the audio steganography analysis technology in a complex Internet big data evidence obtaining scene.
While embodiments of the invention have been shown and described, it will be understood by those skilled in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.