CN111768803B

CN111768803B - General audio steganalysis method based on convolutional neural network and multitask learning

Info

Publication number: CN111768803B
Application number: CN202010415020.0A
Authority: CN
Inventors: 王让定; 林昱臻; 严迪群; 董理
Original assignee: Tianyi Safety Technology Co Ltd
Current assignee: Shenzhen Hongyue Information Technology Co ltd; Tianyi Safety Technology Co Ltd
Priority date: 2020-05-15
Filing date: 2020-05-15
Publication date: 2024-01-30
Anticipated expiration: 2040-05-15
Also published as: CN111768803A

Abstract

The invention relates to a general audio steganalysis method based on convolutional neural network and multitask learning, which is characterized in that: the network framework corresponding to the method comprises a feature extraction sub-network, a two-class sub-network and a multi-class sub-network, and the detection effect of various audio steganography algorithms is effectively improved by providing an audio general steganography analysis model and an analysis method based on convolutional neural network and multi-task learning; moreover, the method improves the detection capability of an unknown steganography algorithm, and is convenient for the application of the audio steganography analysis technology in a complex Internet big data evidence obtaining scene.

Description

General audio steganalysis method based on convolutional neural network and multitask learning

Technical Field

The invention relates to the technical field of audio steganography, in particular to a general audio steganography analysis method based on convolutional neural network and multitask learning.

Background

The current audio steganalysis model based on the deep learning technology has higher detection performance under laboratory conditions. However, in an actual network big data evidence obtaining environment, the secret-containing audio may be generated by various steganography algorithms, which may include unused steganography algorithms in the training dataset, and in this scenario, if the steganography analyst directly uses a model obtained by laboratory training to detect, the steganography algorithm mismatch (SteganographicAlgorithm Mismatch, SAM) in the audio steganography analysis will be caused, and the accuracy will be greatly compromised.

SAM occurs in the generation of a dense carrier, and specifically refers to the difference in the embedding method used to generate the dense carrier in the training set and the test set. In this mode, the steganalyst researcher knows the statistical properties of the carrier source and only needs to design and train the classifier by using a carrier database with the same statistical properties, however, since the steganalyst algorithm is not known, the characteristic distribution of the dense carrier in the training stage and the dense carrier in the testing stage may have a certain difference, so that even if the detection performance of the classifier in the training stage is very good, the classifier in the testing stage may be invalid.

Disclosure of Invention

In view of the above problems, the present invention aims to provide a general audio steganalysis method based on convolutional neural network and multi-task learning, which can effectively improve the detection effect of an audio steganalysis algorithm and the detection capability of an unknown steganalysis algorithm.

In order to achieve the above purpose, the technical scheme of the invention is as follows: the general audio steganalysis method based on convolutional neural network and multitask learning is characterized in that: the network framework corresponding to the method comprises a feature extraction sub-network, a classification sub-network and a multi-classification sub-network, the method comprises,

s1, inputting audio data;

s2, outputting a steganographic analysis feature vector F through a feature extraction sub-network;

s3, judging whether the audio data is a steganographic carrier or not through the two classification sub-networks, if so, executing S4-S8 in sequence, and if not, outputting the audio data as normal audio;

s4, the steganalysis feature vector F is output through a classification sub-network to obtain a binary steganalysis predictive probability vectorCalculating a binary steganographic predictive probability vector>Binary steganographic tag vector y= [ y ] after being encoded with One-hot ⁰ ,y ¹ ]Cross entropy loss L of (2) _m ，/>Wherein y is ⁱ E {0,1}, i represents the category index, i e [0, 1]]Updating parameters of the two classification sub-networks according to the backward propagation error and the gradient descent algorithm;

s5, steganalysisThe feature vector F is outputted by a multi-classification sub-network to obtain a predictive probability value of the steganography algorithm typeCalculating a predictive probability value->Steganographic category label m= [ m after being encoded with One-hot ⁰ ，m ¹ ，…，m ^M-1 ]Cross entropy loss L of (2) _a ，/>Wherein M represents the number of M different steganography algorithms contained in the training set data, and accordingly the sub-network parameters are classified by the counter-propagation error and gradient descent algorithm;

s6, according to the integrated loss l=l _m +λL _a Setting lambda as auxiliary task weight factors;

s7, calculating a confidence value C (m) of the prediction probability through the multi-classification sub-network;

s8, judging whether the confidence value C (m) is larger than a set experience threshold CT, if so, outputting the result as an unknown steganography algorithm, and if not, outputting the type of the steganography algorithm.

Further, the feature extraction sub-network in S2 includes an audio preprocessing layer and 5 cascaded convolution groups after the audio preprocessing layer, i.e. a 1 st convolution group, a 2 nd convolution group, a 3 rd convolution group, a 4 th convolution group, and a 5 th convolution group.

Further, the audio preprocessing layer consists of 4 1×5 convolution kernels D1 to D4, and initial weights are respectively:

D1＝[1,-1,0,0,0]，D1＝[1,-2,1，0，0]，D1＝[1,-3，3，1，0]，D1＝[1,-4,6,-4,1]；

the 1 st convolution group includes a 1×1 first convolution layer, a 1×5 second convolution layer, and a 1×1 third convolution layer;

the 2 nd convolution group, the 3 rd convolution group, the 4 th convolution group and the 5 th convolution group all comprise a 1 multiplied by 5 convolution layer, a 1 multiplied by 1 convolution layer and a mean value pooling layer, wherein the mean value pooling layer of the 5 th convolution group is a global mean value pooling layer;

the steganalysis feature vector is a 256-dimensional vector.

Furthermore, the audio preprocessing layer adopts a differential filtering design.

Further, the first convolution layer in the 1 st convolution group is activated using a truncated linear unit TLU.

Further, the two-classification sub-network includes a fully connected layer having 128 neurons and a binary steganographic label prediction layer.

Further, the multi-classification sub-network comprises a full-connection layer and a steganographic class label prediction layer which are cascaded in two layers, wherein the two cascaded layers respectively comprise 128 neurons and 64 neurons.

Further, the calculation formula of the confidence value C (m) in S8 isSetting an empirical threshold ct=0.5×c (m) max, where C (m) _max ＝logM。

Compared with the prior art, the invention has the advantages that:

by providing the general audio steganography analysis model and the analysis method based on convolutional neural network and multitask learning, the detection effect of various audio steganography algorithms is effectively improved; moreover, the method improves the detection capability of an unknown steganography algorithm, and is convenient for the application of the audio steganography analysis technology in a complex Internet big data evidence obtaining scene.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

The invention provides a general audio steganalysis method based on convolutional neural network and multitask learning, which comprises the following steps of,

s1, inputting audio data;

s5, the steganography analysis feature vector F is output through a multi-classification sub-network to obtain a prediction probability value of the steganography algorithm typeCalculating a predictive probability value->Steganographic category label m= [ m after being encoded with One-hot ⁰ ，m ¹ ，…，m ^M-1 ]Cross entropy loss L of (2) _a ，/>Wherein M represents the number of M different steganography algorithms contained in the training set data, and accordingly the sub-network parameters are classified by the counter-propagation error and gradient descent algorithm;

s6, according to healdLoss of synthesis l=l _m +λL _a Setting lambda as auxiliary task weight factors;

Two related steganalysis tasks are constructed in this application: a classification task that distinguishes between normal audio (Cover) and close-containing audio (Stego) images, and a multi-classification task that distinguishes between close-containing audio steganography algorithm types. Of the two tasks, the two classification tasks of distinguishing normal audio from dense audio are the important targets of the present work and can be regarded as the main task (Maintask).

In particular, the feature extraction sub-network functions to adaptively extract steganalysis features from the input audio data. The reasonable preprocessing layer is arranged in the CNN (Convolutional Neural Networks, convolutional neural network) steganalysis model, so that the steganalysis performance of the network can be improved, and therefore, an audio preprocessing layer based on differential filtering design is used at the beginning of the characteristic extraction sub-network, and consists of 4 1X 5 convolutional kernels D1-D4, wherein the initial weights are respectively as follows: d1 = [1, -1, 0], d1= [1, -2,1,0,0], d1= [1, -3,3,1,0], d1= [1, -4,6, -4,1];

the audio preprocessing layer is followed by 5 concatenated convolutions, namely, the 1 st convolutions, the 2 nd convolutions, the 3 rd convolutions, the 4 th convolutions, and the 5 th convolutions.

The 1 st convolution group includes a 1×1 first convolution layer, a 1×5 second convolution layer, and a 1×1 third convolution layer; the 1 st convolution layer is activated by using a cut-off linear unit (TLU), compared with a linear rectification activation unit ReLU commonly used in a deep learning voice recognition task, the TLU can inhibit activation in an oversized positive value area and keep certain activation capacity in a negative value area, compared with another commonly used tanH activation unit, the TLU has a larger activation interval, gradients are kept certain in the activation interval, and the risk of gradient disappearance can be reduced when a network is trained. In addition, the other convolution layers of the 1 st convolution group do not perform activation processing, and the pooling operation is eliminated, so as to more effectively capture weak information brought by steganography.

The 2 nd convolution group, the 3 rd convolution group, the 4 th convolution group and the 5 th convolution group all comprise a 1×5 convolution layer, a 1×1 convolution layer and a mean pooling layer, wherein the last mean pooling layer of the 5 th convolution group module is replaced by a global mean pooling (globalargeagepooling) layer, so as to fuse global features.

The feature extraction sub-network further comprises a feature output layer, wherein the feature output layer is composed of a full-connection layer with 256 neurons, and finally outputs 256-dimensional steganalysis feature vector F.

The detailed parameters of each subnetwork are shown in the following table:

examples of meaning of parameters in the table: 64× (1×5), reLU, indicating that the parameters of the convolutional layer are set to a 1×5 convolutional kernel with output channel 64, and the output is activated using ReLU; FC-256 represents a fully connected layer with 256 neurons.

The two classification sub-networks are next to the feature output layer and consist of a fully connected layer containing 128 neurons. The feature vector F outputs a binary steganography predictive probability vector through the sub-networkCalculating the binary steganographic tag vector y= [ y ] after the binary steganographic tag vector y and One-hot are coded ⁰ ，y ¹ ](y ⁱ E {0,1}, i represents a class index, when the value on class index i is 1, represents the cross entropy loss L for the data tag class i) _m ，/>Finally, updating network parameters through a back propagation error and gradient descent algorithm.

The multi-classification sub-network is structured as a fully connected layer of two cascaded layers, the two sides of the cascade respectively comprising 128 and 64 neuron structures. The feature vector F outputs a predictive probability value of the steganography algorithm type through the sub-networkCalculating the steganography class label m= [ m ] after the steganography class label is encoded with One-hot ⁰ ,m ¹ ,…,m ^M-1 ]Cross entropy loss L (representing M classes of different steganographic algorithms contained in training set data) _a ，/>Finally, updating network parameters through a back propagation error and gradient descent algorithm.

The optimization problem of the solution required by the whole network is the comprehensive loss L=L of the main task loss and the auxiliary task loss _m +λL _a The auxiliary task weight factor is lambda, which determines the importance degree of the auxiliary task to the main task, the larger lambda is, the larger the auxiliary task is for guiding the main task training, and the larger the interference information correspondingly is, so that the reasonable lambda is also important to set.

The multi-class sub-network finally also includes a Softmax activation layer whose output can be seen as the predicted probability p (m ^k ) The more concentrated the prediction probability distribution, the more credibility of the representative network prediction result, the information entropy can reflect the concentration degree of the information, so the confidence value of the prediction probability is calculated and output according to the information entropy

From knowledge of the theory of information, the confidence value C (M) takes the maximum value log M, i.e. C (M), when the output probability is evenly distributed (i.e. the predicted probability for all algorithm types is 1/M) _max Log M, thereby setting an empirical confidence threshold CT, ct=0.5c (M) _max 。

When the confidence value C (m) is greater than CT, the predictive probability distribution is considered to be more uniform, and the network has little predictive confidence for each type of algorithm, so the algorithm may be considered not to be contained in the training data, i.e., the unknown steganography algorithm. Conversely, when the confidence value C (m) is smaller than CT, the predictive probability distribution is more concentrated, and the type with the highest predictive probability is selected as the output type of the data.

While embodiments of the invention have been shown and described, it will be understood by those skilled in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A general audio steganalysis method based on convolutional neural network and multitask learning is characterized in that: the network framework corresponding to the method comprises a feature extraction sub-network, a classification sub-network and a multi-classification sub-network, the method comprises,

s1, inputting audio data;

s4, the steganalysis feature vector F is output through a classification sub-network to obtain a binary steganalysis predictive probability vectorCalculating a binary steganographic predictive probability vector>Binary steganographic tag vector y= [ y ] after being encoded with One-hot ⁰ ,y ¹ ]Cross entropy loss L of (2) _m ，/>Wherein y is ⁱ ∈{0,1}，i represents class index, i e [0, 1]]Updating parameters of the two classification sub-networks according to the backward propagation error and the gradient descent algorithm;

s5, the steganography analysis feature vector F is output through a multi-classification sub-network to obtain a prediction probability value of the steganography algorithm typeCalculating a predictive probability value->Steganographic category label m= [ m after being encoded with One-hot ⁰ ,m ¹ ,…,m ^M-1 ]Cross entropy loss L of (2) _a ，/>Wherein M represents the number of M different steganography algorithms contained in the training set data, and updates a plurality of classification sub-network parameters according to the number of M different steganography algorithms through a back propagation error and gradient descent algorithm;

2. The general audio steganalysis method based on convolutional neural network and multitasking learning of claim 1, characterized in that:

the characteristic extraction sub-network in the S2 comprises an audio preprocessing layer and 5 cascaded convolution groups after the audio preprocessing layer, namely a 1 st convolution group, a 2 nd convolution group, a 3 rd convolution group, a 4 th convolution group and a 5 th convolution group.

3. The general audio steganalysis method based on convolutional neural network and multitasking learning of claim 2, characterized in that:

the audio preprocessing layer consists of 4 1X 5 convolution kernels D1-D4, and initial weights are respectively as follows:

D1＝[1,-1,0,0,0]，D2＝[1,-2,1,0,0]，D3＝[1,-3,3,1,0]，D4＝[1,-4,6,-4,1]；

the steganalysis feature vector is a 256-dimensional vector.

4. A general audio steganalysis method based on convolutional neural network and multitasking learning as claimed in claim 3, characterized in that:

the audio preprocessing layer adopts a differential filtering design.

5. A general audio steganalysis method based on convolutional neural network and multitasking learning as claimed in claim 3, characterized in that:

the first convolutional layer in the 1 st convolutional group is activated with a truncated linear unit TLU.

6. The general audio steganalysis method based on convolutional neural network and multitasking learning of claim 1, characterized in that:

the two-class subnetwork includes a fully connected layer having 128 neurons and a binary steganographic label prediction layer.

7. The general audio steganalysis method based on convolutional neural network and multitasking learning of claim 1, characterized in that:

the multi-classification sub-network comprises a full-connection layer and a steganographic class label prediction layer which are cascaded in two layers, wherein the two cascaded layers respectively comprise 128 neurons and 64 neurons.

8. The general audio steganalysis method based on convolutional neural network and multitasking learning of claim 1, characterized in that:

the calculation formula of the confidence value C (m) in the S8 isSetting an empirical threshold ct=0.5×c (m) _max Wherein C (m) _max ＝logM。