CN113990330A

CN113990330A - Method and device for embedding and identifying audio watermark based on deep network

Info

Publication number: CN113990330A
Application number: CN202111250867.9A
Authority: CN
Inventors: 李平; 蒋升
Original assignee: Suirui Technology Group Co Ltd
Current assignee: Suirui Technology Group Co Ltd
Priority date: 2021-10-26
Filing date: 2021-10-26
Publication date: 2022-01-28

Abstract

The invention discloses a method and a device for embedding and identifying audio watermarks based on a deep network, belonging to the field of audio digital watermarks, wherein the embedding method comprises the following steps: s1: framing original audio, and performing windowing and short-time Fourier transform on each frame; s2: the frequency domain characteristics extracted after short-time Fourier transform are used as the input of a U-net network, and watermark information is embedded into the U-net network for encoding; s3: and carrying out short-time Fourier inverse transformation on the coded frequency domain characteristics to obtain the audio with the watermark. The invention can make the watermark have stronger robustness under different types of noise scenes.

Description

Method and device for embedding and identifying audio watermark based on deep network

Technical Field

The invention belongs to the field of audio digital watermarks, and particularly relates to a method and a device for embedding and identifying an audio watermark based on a deep network.

Background

The digital watermarking technology is an information hiding technology, namely an audio digital watermarking algorithm, namely, a digital watermark is embedded into an audio file (such as wav, mp3, avi and the like) through a watermarking embedding algorithm, but the digital watermarking technology has no great influence on the original tone quality of the audio file or cannot be influenced by human ears. And on the contrary, the audio digital watermark is completely extracted from the audio host file through a watermark extraction algorithm, and the embedded watermark and the extracted watermark are called the audio digital watermark.

Digital watermarking is not a new concept and has long been studied as a pure signal processing problem. Classical methods include the use of hidden information in encoded images and audio files. These methods include LSB encoding, phase encoding, and spread spectrum watermarking, among others. Another watermarking method has recently appeared, which represents the information in the media through machine learning, and can use a deep neural network to perform coding and decoding to obtain the corresponding watermark.

In the prior art, a steganographic algorithm in an image is available, wherein transmitted information is another image, the method is a lossy transmission which allows carrying a large amount of information, and the algorithm shows the potential of machine learning in watermarking and steganography in the artificial intelligence era of big data. Furthermore, there is a method of steganography and watermarking in images that leaves the carrier distorted by the noise introduced after encoding, rendering the encoded information that has been trained robust to the introduced noise, in such a way that one can try to make the watermark robust to lossy JPEG compression.

Given that these methods can be applied to audio, taking advantage of the audio characteristics, it is possible to create more robust watermarks in the audio domain.

In view of the above, the present invention is particularly proposed.

Disclosure of Invention

The invention aims to provide a method and a device for embedding and identifying an audio watermark based on a deep network, which can enable the watermark to have stronger robustness under different types of noise scenes.

In order to achieve the above object, the present invention provides a method for embedding an audio watermark based on a deep network, including the following steps:

s1: framing original audio, and performing windowing and short-time Fourier transform on each frame;

s2: the frequency domain characteristics extracted after short-time Fourier transform are used as the input of a U-net network, and watermark information is embedded into the U-net network for encoding;

s3: and carrying out short-time Fourier inverse transformation on the coded frequency domain characteristics to obtain the audio with the watermark.

Further, in step S2, the upsampling stage and the downsampling stage of the U-net network use convolution operations of the same number of levels, and the downsampling layer and the upsampling layer are connected by using a cross-connection structure.

Furthermore, the down-sampling part of the U-net consists of four encoders, and each encoder consists of a 2-dimensional convolution network and is used for batch normalization processing; after the frequency domain features are mapped by 4 encoders, the obtained dimension is reduced to 8 x 2 x 256; and then, performing up-sampling operation by using four encoders, wherein each encoder is composed of a 2-dimensional convolution network and is used for batch normalization processing.

The invention also provides a method for identifying the audio watermark based on the deep network, which comprises the following steps:

s4: decoding the audio with the watermark through a 2-dimensional convolutional network to obtain watermark information, wherein the decoding process comprises the following steps: and adopting a 2-dimensional convolution network for batch normalization processing.

Further, in step S4, a decoder is used for decoding; the decoder can output 32 prediction probability values between 0 and 1 and obtain watermark information through a decoder loss function.

The invention also provides a device for embedding the audio watermark based on the deep network, which comprises a preprocessing module, an encoder and a processing module,

the preprocessing module is used for framing the original audio and performing windowing and short-time Fourier transform on each frame;

the encoder is used for taking the frequency domain characteristics extracted after short-time Fourier transform as the input of the U-net network, and embedding the watermark information into the U-net network for encoding;

and the processing module is used for carrying out short-time Fourier inverse transformation on the encoded frequency domain characteristics to obtain the audio with the watermark.

Further, the up-sampling stage and the down-sampling stage of the U-net network used in the encoder adopt convolution operations of the same number of levels, and a structure of crossing connections is used to connect the down-sampling layer and the up-sampling layer.

The invention also provides a device for identifying the audio watermark based on the deep network, which comprises a decoder,

the decoder is used for decoding the audio with the watermark through a 2-dimensional convolutional network to obtain watermark information, wherein the decoding process comprises the following steps: and adopting a 2-dimensional convolution network for batch normalization processing.

Furthermore, in the decoder, the decoder is adopted for decoding; wherein, the decoder can output 32 prediction probability values between 0 and 1, and obtains watermark information through a decoder loss function.

Compared with the algorithm in the prior art, the method and the device for embedding and identifying the audio watermark based on the deep network can encode and accurately extract information of various noise types, and have good robustness on noise.

Drawings

Fig. 1 is a flowchart of a method for embedding an audio watermark based on a deep network according to this embodiment.

Fig. 2 is a spectrum diagram of the signal obtained in step S1 after short-time fourier transform in accordance with the exemplary embodiment.

Fig. 3 is a schematic diagram of a U-net network architecture framework adopted in step S2 according to an embodiment.

Fig. 4 is a schematic diagram of a network structure of a decoder according to an embodiment.

Fig. 5 is a flowchart of a method for identifying an audio watermark based on a deep network according to this embodiment.

Fig. 6 is a schematic diagram of an apparatus for embedding an audio watermark based on a deep network according to this embodiment.

Detailed Description

The present invention will be described in further detail with reference to specific embodiments in order to make the technical field better understand the scheme of the present invention.

As shown in fig. 1, an embodiment of the present invention is a method for embedding an audio watermark based on a deep network, which is an audio watermarking method capable of performing deep learning based on U-net and having robustness, where robustness means that a watermark can still be extracted after being propagated and recorded in the air.

The method for embedding the watermark specifically comprises the following steps:

s1: the original audio is framed and each frame is windowed and short-time fourier transformed.

The formula for performing the short-time fourier transform is as follows:

S(m，w)＝DTFT{x(n)w(n-m*N/2)}

in the formula, S (m, w) is a two-dimensional function with respect to time and frequency domains; x (N) is a speech signal, w (N) is a window function which is inverted in the time domain with an offset of N/2, m represents the number of window slips; n is 2048 points in the present invention and STFT is a short-time fourier transform.

In the invention, the original audio format is that the sampling rate is 16000Hz, a single channel is 16 bits, the time length is 2s, the number of sampling points is 32000, the framing and windowing are carried out, the window length is 2048, the window shift is 1024, and then the short-time Fourier transform (STFT) is carried out on the non-stationary model, so as to obtain 1025 frequency domain sections and 32 time domain sections which are respectively the amplitude and the phase.

The invention only uses 10-512 frequency domain amplitude, which is equivalent to 87-4000 Hz frequency domain amplitude, other parts are kept unchanged, and the obtained spectrogram of the signal after short-time Fourier transform is shown in figure 2, wherein the abscissa is a time axis and the ordinate is a frequency axis.

The reason for the frequency domain amplitude of 10-512 selected by the present invention is as follows: the channel plays an important role in playing or recording audio in the air by using hardware products (such as a mobile phone or a computer). This parameter range is chosen considering that low frequencies may be lost in air, and higher frequencies are lost if the hardware width is limited.

S2: and (4) taking the 87Hz-4000Hz frequency domain characteristics extracted after short-time Fourier transform as the input of the U-net network, and embedding the watermark information into the U-net network for encoding.

The encoder used for encoding is the U-net structure used by Jansson et al (Andrea Jansson, Eric Humphrey, Nicola Montechio, Rachel Bittner, Aparna Kumar, and Tillman Weyde.singing voice separation with deep U-network connected networks.2017). The U-net structure is a U-shaped symmetrical structure and is divided into a down-sampling stage and an up-sampling stage, and the network only has a convolution layer and a pooling layer and does not have a full connection layer. The convolution operation with the same number of levels is adopted in the up-sampling stage and the down-sampling stage of the U-net, and the down-sampling layer is connected with the up-sampling layer by using a crossing connection structure, so that the characteristics extracted by the down-sampling layer can be directly transmitted to the up-sampling layer, and the subdivision characteristics of the U-net network are more accurate.

The U-net network structure framework adopted in the invention is shown in figure 3, the down-sampling part of the U-net network consists of the following four encoders, each encoder consists of a 2-dimensional convolution network, the step length of the encoder is 2, the convolution sum is 5 x 5, the encoder is used for batch normalization processing, and the activation function of the encoder uses a ReLU function; after the frequency domain features are mapped by a 4-block encoder, the resulting dimension is reduced to 8 x 2 x 256, and the watermark information is inserted in this part of the bottleneck layer, this block being 8 x 2 x 32. In order to recover the original input bit number, the up-sampling operation is still performed by using four block encoders, each block encoder is composed of a 2-dimensional convolution network, the step size of the 2-dimensional convolution network is 2, the convolution kernel is 5 × 5, the activation function is used for batch normalization processing, and the ReLU function is also used as the activation function. The step size of the last 2-dimensional convolution network is 1, the convolution sum is 5 x 5, and the number of channels is reduced to 1.

Therefore, the watermark information is stacked and repeated by 1 × 32 into 8 × 2 × 32 information, and connected to the U-net encoder.

The formula of the short-time inverse Fourier transform is as follows:

in the formula, y (N) is a reconstructed signal subjected to short-time Fourier transform, S (m, N) is a function obtained by performing inverse discrete Fourier transform on an S (m, w) time frequency spectrum, w (N) is a window function, and N is 2048 points.

As shown in fig. 5, an embodiment of the present invention is a method for identifying an audio watermark based on a deep network, which includes the following steps:

s4: and decoding the audio with the watermark to obtain watermark information.

Wherein a decoder is used for decoding. The decoder is a multi-label classifier, a spectrogram of a voice signal is used as input, 32 prediction probability values between 0 and 1 are output, bit bits corresponding to each prediction value are coded into binary bit bits in the audio watermark, and the bit bits are 0 or 1 respectively.

The network structure of the decoder is shown in fig. 4, specifically, the decoder is decoded by 6 decoders, the decoder adopts a 2-dimensional convolution network, batch normalization processing is performed, the activation function of the decoder uses a ReLU function for adjusting different step sizes, the loss function is minimum in the training process, a proper step size is obtained, the space size after the last convolution is 32 × 1, the last layer is a fully-connected layer, and 32 output neurons and a sigmoid function exist. The output value is a value between 0-1 and the decoder loss function is the cross entropy of the output value from the decoder network and the encoder binary watermark message. Finally, watermark information is obtained.

Wherein, the formula of the ReLU function is as follows:

y(x)＝max(0，x)

the use of the ReLU function can overcome the problem of gradient disappearance during training and speed up the training.

The formula for the loss function is as follows:

in the formula, Loss represents the Loss function size, y_icLabels (0 or 1), pre, representing binary bits of watermarks_icIs to say that the watermarked speech signal belongs to a prediction probability of type c and the information embedded in the audio watermark is a bit, i.e. 0 or 1. In the patent of the present invention, the two types are 0 or 1. Therefore, the value of c belongs to 1 or 0. M is the number of the watermark labels, and 1 ten thousand different watermark labels are adopted in the invention.

By the method, information of various noise types can be coded and accurately extracted, and the watermark model has good robustness to noise.

As shown in fig. 6, an embodiment of the present invention is an apparatus for embedding an audio watermark based on a deep network, and the apparatus includes a preprocessing module 1, an encoder 2, and a processing module 3.

The pre-processing module 1 is used to frame the original audio and perform windowing and short-time fourier transform on each frame.

The formula for performing the short-time fourier transform is as follows:

S(m，w)＝DTFT{x(n)w(n-m*N/2)}

in the formula, S (m, w) is a two-dimensional function with respect to time and frequency domains; x (N) is a speech signal, w (N) is a window function which is inverted in the time domain with an offset of N/2, m represents the number of window slips; n is 2048 points in the present invention and DTFT is a discrete fourier transform.

And the encoder 2 is used for taking the 87Hz-4000Hz frequency domain characteristics extracted after short-time Fourier transform as the input of the U-net network, embedding the watermark information into the U-net network and encoding.

The encoder employs the U-net architecture used by Jansson et al. The U-net structure is a U-shaped symmetrical structure and is divided into a down-sampling stage and an up-sampling stage, and the network only has a convolution layer and a pooling layer and does not have a full connection layer. The convolution operation with the same number of levels is adopted in the up-sampling stage and the down-sampling stage of the U-net, and the down-sampling layer is connected with the up-sampling layer by using a crossing connection structure, so that the characteristics extracted by the down-sampling layer can be directly transmitted to the up-sampling layer, and the subdivision characteristics of the U-net network are more accurate.

Thus, the watermark information can be stacked and repeated from 1 x 32 to 8 x 2 x 32 information by the inventive encoder 2 connected to the U-net encoder.

The processing module 3 is configured to perform short-time inverse fourier transform on the audio to obtain a watermarked audio.

The formula of the short-time inverse Fourier transform is as follows:

One embodiment of the present invention is an apparatus for embedding an audio watermark based on a deep network, which includes a decoder 4.

The decoder 4 is configured to decode the watermarked audio to obtain watermark information.

Wherein a decoder is used for decoding. The decoder is a multi-label classifier, which takes a spectrogram of a voice signal as input and outputs 32 prediction probability values between 0 and 1, and bit corresponding to each prediction value.

The network structure of the decoder is as shown in fig. 4, the decoder is decoded by 6 decoders, the decoder adopts a 2-dimensional convolution network, batch normalization processing is carried out, the activating function of the decoder uses a ReLU function for adjusting different step sizes, the loss function is minimum in the training process, a proper step size is obtained, the space size after the last convolution is 32 x 1, and the last layer is a full-connection layer and has 32 output neurons and a sigmoid function. The output value is a value between 0-1 and the decoder loss function is the cross entropy of the output value from the decoder network and the encoder binary watermark message. Finally, watermark information is obtained.

Wherein, the formula of the ReLU function is as follows:

y(x)＝max(0，x)

The formula for the loss function is as follows:

in the formula, Loss represents the Loss function size, y_icLabels (0 or 1), pre, representing binary bits of watermarks_icIs to mean the prediction probability that the watermarked speech signal belongs to type c. M is the number of the watermark labels, and 1 ten thousand different watermark labels are adopted in the invention.

The invention is explained in detail by applying specific examples, and the above description of the embodiments is only used to help understanding the core idea of the invention. It should be understood that any obvious modifications, equivalents and other improvements made by those skilled in the art without departing from the spirit of the present invention are included in the scope of the present invention.

Claims

1. A method for embedding audio watermark based on a deep network is characterized by comprising the following steps:

2. The method for embedding an audio watermark according to claim 1, wherein in step S2, the up-sampling stage and the down-sampling stage of the U-net network use convolution operations of the same number of levels, and the down-sampling layer and the up-sampling layer are connected by using a cross-connection structure.

3. The method for embedding audio watermark based on the deep network as claimed in claim 2, wherein the downsampling part of the U-net is composed of four encoders, each encoder is composed of a 2-dimensional convolution network and is used for batch normalization processing; after the frequency domain features are mapped by 4 encoders, the obtained dimension is reduced to 8 x 2 x 256; and then, performing up-sampling operation by using four encoders, wherein each encoder is composed of a 2-dimensional convolution network and is used for batch normalization processing.

4. A method for identifying audio watermarks based on a deep network is characterized by comprising the following steps:

5. The method for identifying an audio watermark based on a deep network as claimed in claim 4, wherein in the step S4, a decoder is used for decoding; the decoder can output 32 prediction probability values between 0 and 1 and obtain watermark information through a decoder loss function.

6. The device for embedding the audio watermark based on the deep network is characterized by comprising a preprocessing module, an encoder and a processing module,

7. The apparatus of claim 6, wherein the up-sampling stage and the down-sampling stage of the U-net network used in the encoder use convolution operations of the same number of levels, and the down-sampling layer and the up-sampling layer are connected by a cross-connection structure.

8. The device for embedding audio watermark based on deep network as claimed in claim 7, wherein the downsampling part of U-net is composed of four encoders, each encoder is composed of 2-dimensional convolution network for batch normalization processing; after the frequency domain features are mapped by 4 encoders, the obtained dimension is reduced to 8 x 2 x 256; and then, performing up-sampling operation by using four encoders, wherein each encoder is composed of a 2-dimensional convolution network and is used for batch normalization processing.

9. An apparatus for identifying an audio watermark based on a deep network, comprising a decoder,

10. The apparatus for identifying audio watermark based on deep network as claimed in claim 9, wherein the decoder is adopted for decoding; wherein, the decoder can output 32 prediction probability values between 0 and 1, and obtains watermark information through a decoder loss function.