CN115831131A

CN115831131A - Deep learning-based audio watermark embedding and extracting method

Info

Publication number: CN115831131A
Application number: CN202310056386.7A
Authority: CN
Inventors: 张卫明; 刘畅; 张�杰; 方涵; 马泽华; 陈可江; 俞能海
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2023-01-15
Filing date: 2023-01-15
Publication date: 2023-03-21
Anticipated expiration: 2043-01-15
Also published as: CN115831131B

Abstract

The invention discloses an audio watermark embedding and extracting method based on deep learning, which comprises the steps of firstly, embedding watermark information into carrier audio by utilizing an encoder to obtain audio containing watermarks; before the audio containing the watermark is input into a decoder, a distortion layer is inserted between the encoder and the decoder for enhancing the robustness of the audio dubbing process; the audio containing the watermark after the distortion layer is input to a decoder, and watermark information in the audio is extracted by the decoder. The method can extract the watermark information after the watermark is embedded into the target audio frequency and the target audio frequency is subjected to distortion such as noise adding, filtering, compression, resampling, requantization, dubbing and the like, thereby realizing the purposes of audio frequency divulgence and tracing and copyright protection.

Description

Deep learning-based audio watermark embedding and extracting method

Technical Field

The invention relates to the technical field of digital watermarks, in particular to an audio watermark embedding and extracting method based on deep learning.

Background

Digital watermarking has been widely studied for many years as an effective method of tracking the source of leakage and copyright protection. The two most important attributes that an audio watermark should satisfy are fidelity, which ensures normal use of watermarked audio, and robustness, which ensures that the embedded watermark can be extracted without loss even if the audio is subject to distortion (MPEG encoding, noise addition, audio re-recording, etc.). Most conventional audio watermarking methods focus on the robustness of digital distortion in the electronic channel, since most audio reproduction occurs in the digital channel. However, with the miniaturization of recording devices, audio Re-recording (AR) has become a more convenient and efficient way to copy Audio, when Audio is used as a carrier for information transmission, for many important confidential Audio information (litigation recording, forensic Audio) and paid Audio piracy (network classroom Audio, movie piracy), since the dubbing can effectively retain Audio content and significantly destroy embedded watermark signals, by means of the dubbing, an attacker can easily and covertly perform Audio content information stealing and cannot leave evidence easily, as shown in fig. 1, which is a schematic diagram of leaked information in the prior art for dubbing operation, how to maintain robustness under a complex scene is one of the greatest challenges for Audio watermarking, and it is ensured that robustness to dubbing becomes a critical task for Audio watermarking at present.

At present, the field of audio watermark research still mainly uses a traditional mathematical algorithm to try to find features which are unchanged before and after distortion to embed the watermark, and most of the used features are in a transform domain, for example, the transform domain features of audio are obtained by using audio frequency domain conversion methods such as Discrete Cosine Transform (DCT), discrete Wavelet Transform (DWT), fast Fourier Transform (FFT) and the like. However, due to the complexity of the dubbing process itself, it is a very difficult task to quantitatively and finely analyze the distortion and find robust and invariant features in the process, and thus none of the prior art algorithms is very resistant to dubbing distortion.

Disclosure of Invention

The invention aims to provide an audio watermark embedding and extracting method based on deep learning, which can extract watermark information in a target audio after the watermark is embedded into the target audio and the target audio is subjected to distortion such as noise adding, filtering, compression, resampling, requantization, dubbing and the like, thereby realizing the purposes of audio leakage tracing and copyright protection.

The purpose of the invention is realized by the following technical scheme:

a method of deep learning based audio watermark embedding extraction, the method comprising:

step 1, embedding watermark information into carrier audio by using an encoder to obtain audio containing a watermark;

step 2, before the audio containing the watermark is input into a decoder, a distortion layer is inserted between the encoder and the decoder for enhancing the robustness of the audio copying process;

and 3, inputting the audio containing the watermark after the distortion layer into a decoder, and extracting the watermark information by the decoder.

According to the technical scheme provided by the invention, after the watermark is embedded into the target audio, the watermark information in the target audio can still be extracted after the target audio is subjected to distortion such as noise addition, filtering, compression, resampling, requantization, dubbing and the like, so that the aims of audio leakage tracing and copyright protection are fulfilled.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a schematic diagram of a dubbing operation on leaked information in the prior art.

Fig. 2 is a flowchart of a method for embedding and extracting an audio watermark based on deep learning according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, rather than all of the embodiments, and this does not limit the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings. Details which are not described in detail in the embodiments of the invention belong to the prior art which is known to the person skilled in the art. Those not specifically mentioned in the examples of the present invention were carried out according to the conventional conditions in the art or conditions suggested by the manufacturer. The reagents or instruments used in the examples of the present invention are not specified by manufacturers, and are all conventional products available by commercial purchase.

Fig. 2 is a schematic flow chart of a method for embedding and extracting an audio watermark based on deep learning according to an embodiment of the present invention, where the method includes:

step 1, embedding watermark information into a carrier audio by using an encoder to obtain an audio containing a watermark;

in this step, in step 1, with

To express a length of

The mono original carrier audio; first by differentiable Discrete wavelet transform (Discrete)Wavelet Transform, DWT) to convert the original carrier audio

Transferring to frequency domain to obtain corresponding approximate coefficient

And detail coefficient

Namely:

wherein the approximation coefficient

And detail coefficient

Is the original carrier audio

Is one half, i.e.

；

Inspired by traditional audio watermarking, watermarking information

Embedding into original carrier audio

In low frequencies, i.e. using approximation coefficients

As carriers of watermark information while preserving detail coefficients

For subsequent audio reconstruction;

the encoder is used for encoding watermark information

Is embedded into

As shown in fig. 2, the encoder En generates a residual R and marks it further to

Thereby generating approximate coefficients containing the watermark

Namely:

wherein

Is an intensity factor, set to 1 by default; en (.) denotes encoder processing.

In addition, to meet fidelity requirements, approximate coefficients containing watermarks are used

As far as possible identical to the original

Keeping the same introduces a fundamental loss in the encoder training

By mean square error

As

Namely:

wherein i represents an index number;

represents the ith approximation coefficient;

representing the ith watermarked approximation coefficient;

to further improve fidelity and minimize

And

the domain gap between them, an additional discriminator D is introduced for forming a countertrain with the encoder, to counter the loss

For better embedding of watermark information by the encoder, making it indistinguishable by the discriminator

And

thereby minimizing

And

the domain gap between, namely:

；

where D (.) represents discriminator processing.

in this step, it is essential to make the distortion layer differentiable, which can prevent the gradient interruption in the end-to-end learning process, however, the dubbing process is a complicated non-differential process, so the inserted distortion layer is set to the differential audio re-recording operation DAR, including ambient reverberation, band-pass filtering and gaussian noise; in a specific implementation, in order to realize robustness to dubbing, a dubbing process is analyzed from the influence of sound propagation in the air and the processing of a microphone and a loudspeaker; based on the analysis, the present example elaborately models transcription distortion through several differential operations (ambient reverberation, band-pass filtering and gaussian noise) and uses these as distortion layers with the proposed framework;

DAR cannot be applied directly to approximate coefficients containing watermarks, since it is a process running in the time domain

Therefore, inverse DWT (Inverse Discrete Wavelet Transform) is used to watermark the approximate coefficients

And corresponding detail coefficients

Transforming back watermarked audio

Namely:

wherein the ambient reverberation is specifically:

the impulse response is the reaction of the environment upon receipt of a brief input signal, describing the acoustic properties of the environment, in particular the spatial reverberation behavior, and reproduces the reverberation in the environment by convolution, collecting different basic impulse responses from different microphones, room environments and loudspeakers to form a set

Given a target audio

From the set

In which a basic impulse response is randomly selected

And by aggregation

For target audio

Performing convolution

Operate to simulate the ambient reverberation ER (), namely:

the band-pass filtering specifically comprises:

since the frequency band of human hearing is limited, the widely used normal frequency range is 500Hz to 2000 Hz, based on which the commonly used speaker does not play audio with too high or too low frequency band, while the microphone also processes the played audio, usually by cutting off the frequency band outside the normal range to reduce noise, i.e. a basic noise removal process, so that the audio with watermark is processed in order to simulate the distortion caused by the inherent characteristics of the speaker and microphone

Using band-pass filtering

Operation, given target Audio

Is carried out as follows

：

wherein

And

respectively representing low-pass filtering and high-pass filtering;

and

to represent

And

a corresponding threshold value;

the gaussian noise is specifically:

in addition to the above two components, the random noise caused by uncertain factors in the dubbing process is simulated by introducing gaussian noise, which is an additive noise and widely used in the current automatic speech recognition scheme to enhance the robustness to the random environmental noise, specifically by directly superposing the gaussian noise

At the target audio

To implement additive Gaussian noise operation

Namely:

wherein ,

representing gaussian noise;

represents a mean of 0 and a variance of

A gaussian distribution of (a).

In this step, the processing procedure of the distortion layer DAR is as follows:

for audio with watermark

Finally, the audio containing the watermark after being subjected to the DAR processing of the distortion layer is obtained

Namely:

obtaining the same by Discrete Wavelet Transform (DWT)

Corresponding approximation coefficient

And detail coefficient

And combining the approximation coefficients

Input to a decoder De, the watermark being extracted by the decoder De

Namely:

where De (.) denotes decoder processing.

In a specific implementation, watermark loss is further introduced

I.e. watermark information

And watermark extracted by decoder

Mean Square Error loss MSE (Mean Square Error) between, i.e.:

when using binary watermarking

Rather than to

This is more advantageous for model watermark embedding and extraction, in this case for watermarked audio

Watermark extracted by decoder

Should be as close as possible to-1 and 1; for audio without watermarks, watermarks extracted by the decoder

The distribution should be close to 0, which helps the MSE based constraint work.

It is noted that those skilled in the art will recognize that the embodiments of the present invention described in detail herein are not limited to the embodiments of the present invention described in detail herein.

To illustrate the effects of the embodiments of the present invention, the following experiments are described in detail:

1) Fidelity test

First, the fidelity of the method described in the present application was compared to the existing baseline method, and as shown in table 1, the method achieved an SNR of 25.86, which is superior to the existing baseline method.

Table 1 quantitative comparison with baseline method

Index (I)	Method for producing a composite material	Base line 1	Base line 2
				SNR(dB)	25.86	25.81	24.94
ACC(%)	99.18	77.09	56.0

2) Robustness testing for audio dubbing

The robustness of the audio re-recording was compared in this experiment and quantitative results are provided in table 2, with the method being significantly better than the baseline method (over 20% and 40%, respectively) at comparable fidelity. In addition to the default distance (5 cm), further control comparisons were made with the baseline method under different conditions. As shown in table 2, the method described in the embodiment of the present application performs better over a wide range of distances, and as the distance increases, the robustness to dubbing decreases accordingly, but is still acceptable (both above 90%).

TABLE 2 comparison of robustness of dubbing at different distances

Distance (cm)	5	20	50	100
					Method for producing a composite material	99.18	98.55	93.40	92.68
Base line 1	77.09	82.64	74.76	66.02

3) Robustness testing for other common distortions

To compare robustness more fully, further evaluations were made at other common distortions in the digital transmission process, namely gaussian noise at different signal-to-noise ratios (20 dB, 30 dB, 40 dB, 50 dB), MP3 compression (64 kbps, 128 kbps), bandpass (1 kHz high pass, 4 kHz), resampling, clipping, amplitude modification, re-quantization and median filtering. As shown in table 3, the method employed in the present application is robust against all types of distortions.

TABLE 3 robustness to other common distortions, ACC Default/enhanced version

The above experimental results show that: the method of the embodiment of the invention can automatically realize the embedding of the audio watermark and the robust extraction under various distortions, and can achieve higher extraction accuracy rate compared with the prior method.

In summary, after the watermark information is embedded in the Audio, the method according to the embodiment of the present invention can implement robust extraction on the watermark in a common Audio processing distortion scene, a watermark attack scene, and recording (AR) distortion, thereby implementing the purposes of divulgence and tracing to the source and copyright protection.

In addition, it is understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by using a program to instruct the relevant hardware to implement, and the corresponding program may be stored in a computer-readable storage medium, where the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims. The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.

Claims

1. A method for deep learning-based audio watermark embedding and extraction is characterized in that the method comprises the following steps:

2. The method for embedding and extracting audio watermark based on deep learning of claim 1, wherein in step 1, the method uses

To express a length of

The mono original carrier audio; first by differentiable discrete wavelet transform

Audio of original carrier

And detail coefficient

Namely:

wherein the approximation coefficient

And detail coefficient

Is the original carrier audio

Is one half, i.e.

；

Watermarking information

Embedding into original carrier audio

In low frequencies, i.e. using approximation coefficients

As carriers of watermark information while preserving detail coefficients

For subsequent audio reconstruction;

the encoder is used for encoding watermark information

Is embedded into

In particular: the encoder generates a residual and marks it further to

Thereby generating approximate coefficients containing the watermark

Namely:

wherein

Is an intensity factor, set to 1 by default; en () denotes encoder processing.

3. The method for deep learning based audio watermark embedding extraction as claimed in claim 2, wherein, in step 1,

in order to approximate the coefficients with the watermark

As far as possible identical to the original

Keeping the same introduces a fundamental loss in the encoder training

By mean square error

As

Namely:

wherein i represents an index number;

represents the ith approximation coefficient;

representing the ith watermarked approximation coefficient;

to minimize

And

the field gap between them, an additional discriminator is introduced for forming a countertraining with the encoder, to counter the loss

And

thereby minimizing

And

the domain gap between, namely:

；

where D (.) represents discriminator processing.

4. The method for deep learning based audio watermark embedding extraction as claimed in claim 3, wherein in step 2, the inserted distortion layer is a difference audio re-recording operation DAR comprising ambient reverberation, band-pass filtering and Gaussian noise;

Therefore, the approximate coefficients of the watermark will be contained using the inverse DWT, i.e. IDWT

And the corresponding detailsNumber of

Transforming back watermarked audio

Namely:

wherein the ambient reverberation is specifically:

the impulse response reproduces reverberation in the environment by convolution, collecting different base impulse responses from different microphones, room environments and loudspeakers to form a set

Given a target audio

From the set

In which a basic impulse response is randomly selected

And by aggregation

For target audio

Performing convolution

Operate to simulate the ambient reverberation ER (), namely:

the band-pass filtering specifically comprises:

for audio containing watermarks, in order to simulate distortions caused by the inherent characteristics of the loudspeaker and microphone

Using band-pass filtering

Operation, given target Audio

Is performed as follows

：

wherein

And

respectively representing low-pass filtering and high-pass filtering;

and

to represent

And

a corresponding threshold value;

the gaussian noise is specifically:

simulating random noise caused by uncertain factors in the copying process by introducing Gaussian noise, specifically by directly superposing Gaussian noiseNoise(s)

At the target audio

To implement additive Gaussian noise operation

Namely:

wherein ,

representing gaussian noise;

represents a mean of 0 and a variance of

A gaussian distribution of (a).

5. The method for deep learning based audio watermark embedding and extraction according to claim 4, wherein in step 3, the processing procedure of the distortion layer DAR is as follows:

for audio with watermark

Namely:

obtaining the same by discrete wavelet transform DWT

Corresponding approximation coefficient

And detail coefficient

And combining the approximation coefficients

Input to a decoder De, the watermark being extracted by the decoder De

Namely:

where De (.) denotes decoder processing.

6. The method for deep learning based audio watermark embedding and extraction according to claim 5, wherein in step 3, watermark loss is further introduced

I.e. watermark information

And watermark extracted by decoder

Mean square error loss MSE between, i.e.:

when using binary watermarking

Rather than to

In this case, for watermarked audio

Watermarks extracted by decoders

Should be close to 0.