CN115831131B

CN115831131B - Audio watermark embedding and extracting method based on deep learning

Info

Publication number: CN115831131B
Application number: CN202310056386.7A
Authority: CN
Inventors: 张卫明; 刘畅; 张�杰; 方涵; 马泽华; 陈可江; 俞能海
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2023-01-15
Filing date: 2023-01-15
Publication date: 2023-06-16
Anticipated expiration: 2043-01-15
Also published as: CN115831131A

Abstract

The invention discloses a method for embedding and extracting an audio watermark based on deep learning, which comprises the steps of firstly embedding watermark information into carrier audio by using an encoder to obtain the audio containing the watermark; inserting a distortion layer between the encoder and the decoder for enhancing the robustness of the audio reproduction process before inputting the watermarked audio to the decoder; the audio containing the watermark after being subjected to the distortion layer is input to a decoder, and watermark information therein is extracted by the decoder. According to the method, after the watermark is embedded into the target audio, the watermark information in the target audio can be still extracted after the target audio is subjected to the distortion of noise adding, filtering, compression, resampling, re-quantization, transcription and the like, so that the purposes of audio leakage tracing and copyright protection are achieved.

Description

Audio watermark embedding and extracting method based on deep learning

Technical Field

The invention relates to the technical field of digital watermarking, in particular to a method for embedding and extracting an audio watermark based on deep learning.

Background

Digital watermarking has been widely studied for many years as an effective method of tracking leakage sources and copyright protection. Two of the most important properties that an audio watermark should meet are fidelity, which ensures proper use of the watermarked audio, and robustness, which ensures that the embedded watermark can be extracted without loss even if the audio is subject to distortion (MPEG encoding, noise addition, audio re-recording, etc.). Most conventional audio watermarking methods focus on the robustness of digital distortion in electronic channels, since most audio copying occurs in digital channels. However, with miniaturization of recording devices, audio Re-recording (AR) has become a more convenient and efficient way to copy Audio, when Audio is used as a carrier for information transmission, for many important confidential Audio information (litigation recording, evidence-taking Audio) and paid Audio piracy (network classroom Audio, film piracy), since the recording can effectively retain Audio content and significantly destroy embedded watermark signals, an attacker can easily and implicitly implement Audio content information stealing by using the means of recording, and has difficulty in leaving evidence, as shown in fig. 1 is a schematic diagram of the recording operation on leaking information in the prior art, so how to keep robustness in complex scenes is one of the greatest challenges for Audio watermarking, and ensuring robustness to the recording becomes the current stage of Audio watermarking.

At present, the research field of audio watermarking is mainly based on traditional mathematical algorithms, in an attempt to find the characteristics which are unchanged before and after distortion to embed the watermark, most of the used characteristics are in a transform domain, for example, the transform domain characteristics of audio are obtained by adopting audio frequency domain conversion methods such as Discrete Cosine Transform (DCT), discrete Wavelet Transform (DWT) and Fast Fourier Transform (FFT). However, because of the complexity of the transcription process itself, quantitatively and finely analyzing the distortion and finding robust and invariant features in this process is a very difficult task to achieve, and none of the prior art algorithms is well resistant to transcription distortion.

Disclosure of Invention

The invention aims to provide a method for embedding and extracting an audio watermark based on deep learning, which can be used for extracting watermark information in target audio after the target audio is subjected to noise adding, filtering, compressing, resampling, re-quantizing, copying and the like after the watermark is embedded into the target audio, so that the purposes of audio leakage tracing and copyright protection are realized.

The invention aims at realizing the following technical scheme:

a method of deep learning based audio watermark embedding extraction, the method comprising:

step 1, embedding watermark information into carrier audio by using an encoder to obtain audio containing watermark;

step 2, before inputting the audio containing the watermark into the decoder, inserting a distortion layer between the encoder and the decoder for enhancing the robustness of the audio reproduction process;

and 3, inputting the audio containing the watermark after being subjected to the distortion layer into a decoder, and extracting watermark information in the audio by the decoder.

According to the technical scheme provided by the invention, after the watermark is embedded into the target audio, the watermark information in the target audio can be still extracted after the target audio is subjected to the distortion of noise adding, filtering, compression, resampling, re-quantization, transcription and the like, so that the purposes of audio leakage tracing and copyright protection are realized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of a dubbing operation for leaking information in the prior art.

Fig. 2 is a schematic flow chart of a method for audio watermark embedding and extraction based on deep learning according to an embodiment of the present invention.

Detailed Description

The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments of the present invention, and this is not limiting to the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings. What is not described in detail in the embodiments of the present invention belongs to the prior art known to those skilled in the art. The specific conditions are not noted in the examples of the present invention and are carried out according to the conditions conventional in the art or suggested by the manufacturer. The reagents or apparatus used in the examples of the present invention were conventional products commercially available without the manufacturer's knowledge.

Fig. 2 is a schematic flow chart of a method for audio watermark embedding and extraction based on deep learning according to an embodiment of the present invention, where the method includes:

in this step, in step 1, use is made of

To represent a length of +.>

Mono original carrier audio of (a); the original carrier audio is first of all +.>

Transferring to frequency domain to obtain corresponding approximation coefficients +.>

And detail coefficient->

The method comprises the following steps:

wherein the approximation coefficients

And detail coefficient->

Is the length of the original carrier audio +.>

Half of (i.e.)>

；

Inspired by the traditional audio watermarking, watermark information is obtained

Embedded in the original carrier audio->

In the low frequency of (2), i.e. using approximation coefficients +.>

As carrier of watermark information while preserving detail coefficients +.>

For subsequent audio reconstruction;

the encoder is used for watermark information

Embedded in->

In, as shown in fig. 2, the encoder En generates a residual R and marks it further to +.>

Thereby generating approximation coefficients +.>

The method comprises the following steps:

wherein

Is an intensity factor, default set to 1; en () represents encoder processing.

In addition, to meetFidelity requirements to approximate coefficients containing watermarks

As much as possible->

Keeping in agreement, a basic penalty is introduced in the training of the encoder>

Using mean square error->

As->

The method comprises the following steps:

wherein i represents an index number;

representing the i-th approximation coefficient; />

Representing the ith watermark-containing approximation coefficient;

to further improve fidelity and minimize

and />

The difference of the domains between the two is introduced with an additional discriminator D for creating a countermeasure training with the encoder against losses ∈ ->

For better embedding of watermark information by the encoder, making the discriminator indistinguishable +.>

and />

Thereby minimizing +.>

and />

The domain gap between, namely:

；

wherein D () represents the discriminator process.

in this step, it is necessary to make the distortion layer tiny, which can prevent gradient interruption in the end-to-end learning process, whereas the transcription process is a complex non-differential process, so the inserted distortion layer is set as a differential audio re-recording operation DAR, including ambient reverberation, band-pass filtering, and gaussian noise; in a specific implementation, in order to realize robustness to the dubbing, the dubbing process is analyzed from the influence of sound propagation in the air and the processing of a microphone and a loudspeaker; from the analysis, this example models the transcription distortion finely by several differential operations (ambient reverberation, band-pass filtering, and gaussian noise) and uses these operations as a distortion layer with the proposed framework;

since DAR is a process running in the time domain, it cannot be directly applied to approximation coefficients containing watermarks

The approximation coefficients containing the watermark are therefore scaled using inverse DWT, IDWT (Inverse Discrete Wavelet Transform)

And the corresponding detail coefficients->

Transform back to watermarked audio->

The method comprises the following steps:

the environmental reverberation specifically includes:

the impulse response is the response of the environment upon receipt of a brief input signal describing the acoustic properties of the environment, in particular the spatial reverberation behavior, the impulse response reproducing the reverberation in the environment by convolution, collecting different base impulse responses from different microphones, room environments and loudspeakers to form a set

Given target audio +.>

From the collection->

A basic impulse response is selected randomly>

And by means of the collection->

Audio of the object->

Convolving->

Operate to simulate ambient reverberation ER (), i.e.:

the band-pass filtering specifically comprises the following steps:

because of the limited frequency band of human hearing, the widely used normal frequency range is 500Hz to 2000 Hz, based on which the conventional speaker does not play audio in too high or too low a frequency band, while the microphone also processes the played audio, typically cuts off the frequency band outside the normal range, to reduce noise, i.e. a basic de-noising process, thus, to simulate distortion caused by the inherent characteristics of the speaker and microphone, the audio containing the watermark

Applying frequency band-pass filtering

Operation, given target Audio->

Execute +.>

：

wherein

and />

Representing low-pass filtering and high-pass filtering, respectively; />

and />

Representation->

and />

A corresponding threshold;

the Gaussian noise is specifically:

in addition to the two components, random noise caused by uncertainty factors in the reproduction process is simulated by introducing Gaussian noise, which is an additive noise widely used in current automatic speech recognition schemes to enhance robustness to random environmental noise, in particular by directly superimposing Gaussian noise

In the target audio +.>

Upper implementation of Add Gaussian noise operation->

The method comprises the following steps:

wherein ,

representing gaussian noise; />

Mean value 0, variance ++>

Is a gaussian distribution of (c).

In this step, the processing procedure of the distortion layer DAR is as follows:

for watermarked audio

Ultimately obtaining a watermark-containing after being subjected to a distortion layer DAR processingAudio frequency

The method comprises the following steps:

obtaining the discrete wavelet transform DWT

Corresponding approximation coefficients>

And detail coefficient

And approximate coefficient +.>

Inputting into decoder De, extracting watermark from the decoder De>

The method comprises the following steps:

where De (-) represents the decoder processing.

In specific implementation, watermark loss is further introduced

I.e. watermark information->

And watermark extracted by decoder>

The mean square error loss MSE (Mean Square Error) between:

when binary watermarking is employed

Rather than +.>

This is more advantageous for watermark embedding and extraction of the model, in which case the watermark-containing audio is +.>

Watermark extracted by decoder>

The distribution of (2) should be as close as possible to-1 and 1; for watermarking-free audio, the watermark extracted by the decoder is +.>

The distribution should be close to 0, which facilitates constrained operation based on MSE.

It is noted that what is not described in detail in the embodiments of the present invention belongs to the prior art known to those skilled in the art.

In order to illustrate the effect of the scheme of the embodiment of the present invention, the following detailed description is made through experiments:

1) Fidelity testing

First, the fidelity of the method described herein is compared to the existing baseline method, as shown in table 1, which achieves an SNR of 25.86, which is superior to the existing baseline method.

Quantitative comparison of Table 1 with baseline method

Index (I)	The method	Baseline 1	Baseline 2
				SNR(dB)	25.86	25.81	24.94
ACC(%)	99.18	77.09	56.0

2) Robustness test for audio reproduction

The robustness of audio re-recordings was compared in this experiment and quantitative results are provided in table 2, with significant fidelity over the baseline method (over 20% and 40%, respectively). In addition to the default distance (5 cm), a control comparison was further made with the baseline method under different conditions. As shown in table 2, the method of the embodiment of the present application performs better over a long distance, and as the distance increases, the robustness to dubbing decreases correspondingly, but is still acceptable (both are above 90%).

Table 2 robustness comparison of dubbing at different distances

Distance (cm)	5	20	50	100
					The method	99.18	98.55	93.40	92.68
Baseline 1	77.09	82.64	74.76	66.02

3) Robustness testing against other common distortions

To more fully compare robustness, further evaluations were made under other common distortions in the digital transmission process, namely gaussian noise at different signal-to-noise ratios (20 dB, 30 dB, 40 dB, 50 dB), MP3 compression (64 kbps, 128 kbps), bandpass (1 kHz high pass, 4 kHz), resampling, clipping, amplitude modification, re-quantization and median filtering. As shown in table 3, the method employed in the present application is robust against all types of distortion.

Table 3 robustness to other common distortions, default/enhanced ACC

The experimental results show that: the method of the embodiment of the invention can automatically realize the embedding of the audio watermark and the robust extraction under various distortions, and can achieve higher extraction accuracy than the existing method.

In summary, after watermark information is embedded in Audio, the method of the embodiment of the invention can realize robust extraction of the watermark under common Audio processing distortion scene, watermark attack scene and Audio Re-recording (AR) distortion, thereby achieving the purposes of disclosure tracing and copyright protection.

In addition, it will be understood by those skilled in the art that all or part of the steps in implementing the methods of the above embodiments may be implemented by a program to instruct related hardware, and the corresponding program may be stored in a computer readable storage medium, where the storage medium may be a read only memory, a magnetic disk or an optical disk, etc.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims. The information disclosed in the background section herein is only for enhancement of understanding of the general background of the invention and is not to be taken as an admission or any form of suggestion that this information forms the prior art already known to those of ordinary skill in the art.

Claims

1. A method for deep learning-based audio watermark embedding and extraction, the method comprising:

in step 1, using

To represent a length of +.>

Mono original carrier audio of (a); first by means of a differentiable discrete wavelet transform +.>

Original carrier audio +.>

And detail coefficient->

The method comprises the following steps:

wherein the approximation coefficients

And detail coefficient->

Is the length of the original carrier audio +.>

Half of (i.e.)>

；

Watermark information

Embedded in the original carrier audio->

In the low frequency of (2), i.e. using approximation coefficients +.>

As carrier of watermark information while preserving detail coefficients +.>

For subsequent audio reconstruction;

the encoder is used for watermark information

Embedded in->

Specifically: the encoder generates a residual and marks it further +.>

Thereby generating approximation coefficients +.>

The method comprises the following steps:

wherein

Is an intensity factor, default set to 1; en () represents encoder processing;

in order to make the approximation coefficient containing watermark

As much as possible->

Using mean square error->

As->

The method comprises the following steps:

wherein i represents an index number;

representing the i-th approximation coefficient; />

Representing the ith watermark-containing approximation coefficient;

to minimize

and />

The difference of the domains is introduced with an additional discriminator for creating an countermeasure training with the encoder, a countermeasure loss ∈ ->

and />

Thereby minimizing +.>

and />

The domain gap between, namely:

；

wherein D () represents a discriminator process;

in step 2, the inserted distortion layer is a differential audio re-recording operation DAR, including ambient reverberation, band-pass filtering, and gaussian noise;

Therefore, inverse DWT, i.e. IDWT will contain the approximation coefficients of the watermark +.>

And the corresponding detail coefficients->

Transforming back to watermarked audio

The method comprises the following steps:

the environmental reverberation specifically includes:

the impulse response reproduces reverberation in the environment by convolution, collecting different base impulse responses from different microphones, room environments and speakers to form a set

Given target audio +.>

From the collection->

A basic impulse response is selected randomly>

And by means of the collection->

Audio of the object->

Convolving->

Operate to simulate ambient reverberation ER (), i.e.:

the band-pass filtering specifically comprises the following steps:

to simulate distortion caused by inherent characteristics of speakers and microphones, audio with watermark is recorded

Applying frequency band-pass filtering->

Operation, given target Audio->

Execute +.>

：

wherein

and />

Representing low-pass filtering and high-pass filtering, respectively; />

and />

Representation->

and />

A corresponding threshold;

the Gaussian noise is specifically:

simulating random noise caused by uncertainty factors during reproduction by introducing Gaussian noise, in particular by directly superimposing Gaussian noise

In the target audio +.>

Upper implementation of Add Gaussian noise operation->

The method comprises the following steps:

wherein ,

representing gaussian noise; />

Mean value 0, variance ++>

Is a gaussian distribution of (c);

2. The method for deep learning based audio watermark embedding extraction of claim 1, wherein in step 3, the distortion layer DAR is processed as follows:

for watermarked audio

Ultimately obtaining audio containing a watermark after being subjected to a distortion layer DAR processing

The method comprises the following steps:

obtaining the discrete wavelet transform DWT

Corresponding approximation coefficients>

And detail coefficient

And approximate coefficient +.>

Inputting into decoder De, extracting watermark from the decoder De>

The method comprises the following steps:

where De (-) represents the decoder processing.

3. The method for deep learning based audio watermark embedding extraction as claimed in claim 2, wherein in step 3, watermark loss is further introduced

I.e. watermark information->

And watermark extracted by decoder>

The mean square error between MSE, namely:

when binary watermarking is employed

Rather than +.>

In this case, for watermarked audio

Watermark extracted by decoder>

The distribution of (2) should be close to 0.