WO2011160966A1

WO2011160966A1 - Audio watermarking

Info

Publication number: WO2011160966A1
Application number: PCT/EP2011/059688
Authority: WO
Inventors: Jian Wang; Ron Healy; Joseph Timoney
Original assignee: National University Of Ireland, Maynooth
Priority date: 2010-06-21
Filing date: 2011-06-10
Publication date: 2011-12-29

Abstract

A method of providing a digital watermark in an audio signal comprises selecting a key frequency value determining how watermark information is to be embedded into a first time frame of the audio signal. A plurality of discrete frequency component values of the audio signal is provided for the first time frame. At least two frequency components for the time frame are selected as a function of the key frequency value. The two frequency components are tested to determine if they meet a given mutual criterion for the signal in the first time frame. If the components do not meet the criterion, the magnitude of at least one of the two frequency components is adjusted in the first time frame.

Description

Audio Watermarking

Field of the Invention

The present invention relates to steganography for digital audio files.

Background

Steganography comprises concealing a message, image, or file within another message, image, or file. Digital watermarking of audio and/or video is a form of steganography, in that audio or video can be used to 'hide' the presence of other information. In recent years, digital watermarking of audio/video files has been considered in an attempt to protect, track, identify or authenticate media such as photographs, music and/or movies. Moulin, P., & Koetter, R., "Data-Hiding Codes", Proc. Of the IEEE, Vol. 93, No. 12, Dec. 2005 provides an overview of techniques for hiding data in cover signals.

It is an object of the present invention to provide an improved method and apparatus for adding watermark information to an audio file in a relatively processor efficient manner and without unduly effecting the quality of the audio signal.

Summary of the Invention

According to the present invention there is provided a method of providing a digital watermark in an audio signal according to claim 1 .

According to a second aspect there is provided a method of extracting a digital watermark from an audio signal according to claim 13.

In further aspects there are provided a corresponding encoder, a decoder, a computer program product and an audio signal watermarked according to the invention.

Brief Description of the Drawings Embodiments of the invention will now be described, by way of example, with reference to the accompanying drawings, in which:

Figure 1 is a flow diagram for adding watermark information to an audio file in accordance with an embodiment of the invention;

Figure 2 is a flow diagram illustrating the operation of CSPE employed in

embodiments of the present invention;

Figure 3 is an exemplary output after CSPE transformation of an audio signal; Figure 4 is a flow diagram illustrating the extraction of watermark information from an audio signal;

Figure 5 is a flow diagram for adding watermark information to an audio file in accordance with a second embodiment of the invention;

Figure 6 shows FFT components illustrating a Type I click introduced into an audio signal during processing according to the second embodiment; and

Figure 7 shows FFT components illustrating a Type II click introduced into an audio signal during processing according to the second embodiment.

Description of the Preferred Embodiments Referring now to Figure 1 , a key value is first chosen as the basis for determining which components within an audio signal are to be chosen to hide a watermark message within the audio signal, step 10. In some cases, the key value can simply be a frequency value. However, as this value is needed in order to extract the watermark message from a signal, in alternative implementations, the key value(s) can be used as a private key which is mapped to a frequency and ultimately to the chosen frequency components. As such, the key value can be used to add security as required according to the environment where the hidden message is to be used. Also, as an indication of the windowing scheme used to embed information in the signal will be required to decode the watermark information, this can also be used as a private key, if required.

In implementations of the invention, the key value is mapped to identify frequency components from a window within the signal onto which watermark information is to be written. This mapping may depend on various factors, such as the type or content of audio used as host/cover for the message. For example, human speech generally includes lower frequency components - and less of them - than a modern Rock or Pop song, so hiding data in a recording of speech would naturally limit the choice of frequency components. However, even in an audio file with such a limited range, there could still be thousands of components to choose from.

In the embodiments described below, a relatively simple scheme is described where key value is mapped to a given frequency value, e.g. approximately 1 kHz, and the adjacent distinct frequency components identified within the signal above and below this frequency value are chosen for receiving watermark information.

So for a given time frame within the signal, distinct frequency components may occur at 990Hz and 1015Hz, whereas for another frame in the signal, the

components may occur at 950Hz and 1020Hz. Various criteria may be set for determining the distinctiveness of the components, e.g. they may need to comprise more than a given threshold percentage of the overall energy of the signal for the time window, or they may need to contrast by more than a given amount with the surrounding signal components, or they may simply need to have an amplitude above a given threshold.

It should also be appreciated that frequency components immediately adjacent the frequency derived from the key value need not be chosen. In other

implementations, the second or third next adjacent frequency components could be chosen, or indeed this might change from frame to frame according to a keying scheme.

In any case, the signal S(t) intended as the cover or host audio is segmented into frames or windows of uniform length, step 12, for example, 20 ms, and, in the preferred embodiments, the frame is analyzed using Complex Spectral Phase Estimation (CSPE) to identify the presence, magnitude and phase of its frequency components, step 14. CSPE is disclosed more fully in K. M. Short and R. A. Garcia, 'Signal Analysis using the Complex Spectral Phase Evolution (CSPE) Method', Audio Engineering Society 120^th Convention, May 2006, Paris, France and provides one computationally efficient method of accurately estimating the frequency and phase of components that exist in an audio signal.

Douglas Nelson, "Cross Spectral Methods for Processing Speech", Journal of the Acoustic Society of America, vol . 1 10, No.5, pt.1 , Nov.2001 , pp.2575-2592 discloses a related cross-spectrogram technique.

Referring now to figure 2, in CSPE, an FFT analysis is performed twice: firstly on a frame S₀ of the signal of interest and the second time upon a frame Si for the same signal but shifted in time by one sample. Then, by multiplying the sample-shifted FFT spectrum with the complex conjugate of the initial FFT spectrum, step 20, a frequency dependent function is formed from which the magnitude and phase angle of the frequency components it contains can be detected. Referring to figure 3, for a 1 second window of a signal containing components with frequency values (in Hz) of 17, 293.5, 313.9, 204.6, 153.7, 378 and 423 and for a sampling frequency of 1024 HZ, CSPE produces a graph with a staircase-like appearance where the flat parts of the graph indicated by the arrows are the frequencies of the components.

Referring back to Figure 1 , the key value, if not already, is transformed to a frequency value and this is used to identify a frequency component in the CSPE frequency domain for the chosen window and calculate its magnitude within the CSPE frequency domain, step 16. The magnitude of this first identified component can then be modified by comparison with a value for a second component from within the signal window, in order to represent a single bit Ί ' or '0' from a watermark message. Alternatively, the value of both the first and the second component can be modified, or just the second component can be modified by comparison with the first identified component to represent the watermark information.

In the first embodiment, identified component(s) are dynamically chosen dependent on the key value and the signal in the window under analysis. In the embodiment, the components chosen for modification are the nearest distinct components beyond a calculated threshold (say 10 Hz) above (CompB) and below (CompA) the frequency derived from the key value.

When modifying the amplitude of an identified frequency component, care must be taken to ensure that a user perceptible artefact is not introduced into the signal and that the modification does not have a negative impact on the timbre of the original signal.

The first embodiment is based on a set of rules that lead to the modification of only one of the identified components (compA or compB) in approximately half the frames. This is achieved with the rule:

If watermark bit= let Amp(compA) > Amp (compB) + margin

If watermark bit=0 let Amp(compS) > Amp(compA) + margin

where Amp is the amplitude of the identified component.

Thus, the magnitude of both components (compA and compB) in any given frame is compared before deciding if any modification would be required in order to satisfy these criteria, depending on the watermark bit to be embedded and the magnitudes of the two components in that particular frame. If they are already in the correct relationship relative to the watermark bit selected at step 18, no modification is required at step 22. If, however, they are not in the correct relationship, at least one of them should be modified at step 22.

Let us assume that the magnitude of compA is lower than that of compB, in a frame in which it needs to be of a higher magnitude to represent a Ί ' bit. In order to increase the magnitude of a particular frequency component in the particular window of the cover signal S(t), a component is added at a defined magnitude and matched to the phase of the component it is being combined with, as follows:

S(t)= S(t)+(rAmp-IAmp+threshold)cos(2n(CompA)t+lp)

where rAmp, lAmp are the amplitudes of CompB and CompA respectively;

CompA is the frequency of CompA; and

Ip is the phase of CompA. Alternatively, and possibly more desirably, to reduce the magnitude of a component of S(t) so that it satisfies the requirements for embedding a Ί ' bit in a window of the signal, the magnitude of CompB is reduced, by adding in a component that is 180° out of phase with the original component in the signal, as follows:

S(t)= S(t)+(rAmp-IAmp+threshold)cos(2n(compB)t+ π-rp)

where compB and rp define amplitude and phase of CompB.

In each of the above cases, the threshold value is set to provide a 25% difference in magnitude between CompA and CompB, however, this value can be varied dynamically from window to window or signal to signal as required.

It is possible that more than 1 bit would be written within a given window of a signal and if it is determined at step 24, that more information is to be embedded, the process loops around to choose the next pair of components as determined by the key value and the applicable keying algorithm.

Once watermarking for a given window is complete, and if more signal remains, step 26, the process continues to the next window to be processed. Referring now to Figure 4, in order to decode an embedded watermarked message from an audio signal, a decoder must be provided with the key value, an indication of the windowing used as the basis for embedding the watermark message as well as the rules that define a Ί ' bit and a '0' bit within the audio signal. The candidate audio signal is then segmented into frames using the same windowing as was used for embedding, step 40. In the embodiment, the system uses CSPE to calculate the magnitudes of the frequency components for the window. The two components above and below the frequency determined by the key value are then identified, step 42. These two components then have their magnitude compared and a "T or a '0' bit is determined according to the rules used in their embedding, step 44. From this comparison, the watermarked bit sequence can be recreated from the sequence of windows in the signal.

In some cases, a relatively short watermark message can be repeatedly written into successive windows of the signal and in decoding, each impression of the message can be correlated with the others to correct for any errors in decoding any portion of the message from the signal.

Referring now to Figure 5, in a second embodiment of the invention, the bin index values for the Fourier components corresponding to CSPE derived frequency components identified through using the key value are used as criteria for determining whether or not the signal in a window needs to be adjusted to

accommodate watermark information. Thus, 1^st and 2^nd components are identified from a key value as in the first embodiment and their index value k is calculated as k=f^*N/Fs where f is the identified component's frequency, Fs is the sampling frequency used to produce the Fourier transform, Figure 2 and N is the transform window length.

In one scheme based on this principle, if the watermark bit is 0, it requires that the bin index value (k) of both identified CSPE frequency components should be either both odd or both even. If the embed bit is 1 , then one bin location should be odd and one even.

If the bin index values for the identified frequency components selected at step 56, do not satisfy the criteria for the watermark message bit to be embedded in the window at that point, then the magnitude of one or other of the frequency

components is reduced, step 58. In one implementation, the first reduction is 25% of the component's magnitude. CSPE is then run again on the adjusted signal for the window, and possibly new 1 ^st and 2^nd frequency components are selected. If the bin index values for these components satisfy the embedding criteria, the process proceeds, and if not, the magnitude of one or other of these components is again reduced, step 58, before repeating the process. It has been found that this loop has not had to be repeated more than 3 times before the criteria for embedding a bit have been met.

It is appreciated that this loop adds a processing overhead to the embedding phase but since embedding is a one-off process and not time-critical, it is a satisfactory compromise for improved accuracy. In a steganographic audio watermarking system, audible artefacts are unacceptable as they allow listeners to deduce that there might be a watermark present or simply effect the quality of the recordal. One possible result of the watermarking schemes described above could be unexpected audible artefacts comprising 'pops' or 'clicks'.

In embodiments of the present invention, the signal, whether modified according to the first or second embodiment is analyzed for two types of artefact, Type I and II as follows:

If a modified frequency component in the adjusted signal has a magnitude greater than 10 times the original component's magnitude, it is identified as a Type I click.

If the Fourier transformed spectrum of the watermarked signal has bins with magnitudes that are different from the corresponding bins in the original spectrum, peaks are picked those differing bins from each spectrum. If a peak exists in the spectrum of the watermarked signal with a magnitude greater than 3 times the magnitude of the original corresponding peak, and if this peak is also greater than the magnitude of the neighbouring peaks, then this is identified as a Type II click.

Figure 6 shows a Type I click where a selected component's magnitude in a finally adjusted signal satisfying the criteria for embedding is apparently much larger than its value in the original signal and noticeably larger than neighboring components' magnitudes. On the other hand, the spectrum shown in Figure 7 shows a Type II click.

It is thought that these clicks result from CSPE identifying a 'ghost frequency component' not actually present in the original signal. When the magnitude of such a ghost component is adjusted when embedding watermark information, a real component can then be added into the signal.

Type II clicks only occur relatively rarely by comparison to Type I. Since Type II clicks occur only very occasionally, in embodiments of the present invention, the solution to this artefact is to return the adjusted component to its original state, step 60. This of course introduces an inaccuracy in the embedded information, however, as the event is so rare, it can be compensated for by building redundancy into the watermark message, either by repeatedly embedding a short watermark message into the signal, allowing a true version to be built up by a decoder or simply using CRC codes or equivalent within the watermark message information.

For Type I clicks, the magnitude of the selected component is reduced again, step 58 and the process is repeated. The solid line in Figure 6 represents the original signal, the dash-dot line represents the signal after it has been modified once (denoted as 'intermediate signal'), while the dashed line represents the final signal, in which the selected bins satisfy the condition for embedding.

In the embodiments described above, two components are selected in encoding step 16 and decoding step 42 and their mutual values are determined to enable the embedding/extraction of watermark information in/from an audio signal. However, it should be appreciated that the invention could equally be implemented by selecting more than two components and using their mutual values to determine how to embed information in the audio signal. It will be appreciated that the watermark information embedded and extracted in/from the audio signal can be used for any number of applications. For example, by embedding an ISRC code for a song in a recording of the song, it's broadcast on radio stations can be detected by listening for such watermark information including, but not limited to, where such stations broadcast through the Internet. This in turn can be used to assist musicians to properly recover royalties from the broadcast of their works from the responsible agencies around the world.

In addition or alternatively, public key information for stakeholders in a piece of audio material including the author, performer, publisher, distributor, retailer etc can be included in the audio material as a way of verifying/authenticating the audio material and especially for the purposes of combating illegal distribution of a piece of music. Thus, it will be seen that an audio signal watermarked according to the present invention may simultaneously include many threads of information which can be used for different applications relating to that audio material. The invention is not limited to the embodiment(s) described herein but can be amended or modified without departing from the scope of the present invention.

Claims

Claims:

1 . A method of providing a digital watermark in an audio signal comprising: a) selecting a key frequency value determining how watermark information is to be embedded into a first time frame of said audio signal;

b) providing a plurality of discrete frequency component values of said audio signal for said first time frame;

c) selecting at least two frequency components for said time frame as a function of at least said key frequency value;

d) determining if said at least two frequency components meet a given mutual criterion for said signal in said first time frame; and

e) responsive to said components not meeting said criterion, adjusting the magnitude of at least one of said at least two frequency components in said first time frame.

2. A method according to claim 1 wherein step b) further comprises:

b1 ) providing a first Fourier transform for said first time frame within said audio signal, said transform including a number of indexed bins representing the Fourier components of said first time frame;

b2) providing a second Fourier transform for a second time frame shifted in time relative to said first time frame and overlapping in time with said first time frame; and

b3) convolving said first and second transform components to provide said plurality of frequency component values for said first time frame.

3. A method according to claim 1 comprising performing CSPE analysis on said first time frame to provide said frequency component values.

4. A method according to claim 1 wherein step d) further comprises:

d1 ) determining the respective bin index values said selected frequency

components; and

d2) testing said bin index values for meeting a given criterion for said signal in said first time frame, and wherein step e) comprises reducing the magnitude of one of said selected frequency components.

5. A method according to claim 4 further comprising the step of:

f) repeating steps b) to e) with said adjusted audio signal until said audio signal does not need to be adjusted to accommodate digital watermark information.

6. A method according to claim 4 in which step d2) comprises testing each of the bin index values for being even valued and, responsive to said test values, reducing the magnitude of one of said selected frequency components to accommodate a bit of a given value in said time frame.

7. A method according to claim 6 wherein said test values are tested for having mutually different values.

8. A method according to claim 1 wherein said adjusting comprises reducing the magnitude of one of said selected frequency components and further comprising:

g) analysing the frequency component values for said adjusted signal; and h) responsive to said adjusted frequency component's magnitude exceeding its magnitude prior to said adjustment by a first threshold, further reducing the magnitude of said adjusted frequency component in said first time frame.

9. A method according to claim 8 wherein said first threshold is an order of magnitude.

10. A method according to claim 1 wherein said adjusting comprises reducing the magnitude of one of said selected frequency components and further comprising:

g) analysing frequency component values for said adjusted signal for said first time frame;

h) identifying any frequency component having a magnitude exceeding its magnitude prior to said adjustment by a second threshold; and i) responsive to any identified component's magnitude being greater than the magnitudes for respective adjacent frequency components of said identified component, restoring the original magnitude of said adjusted frequency component in said first time frame of said audio signal.

1 1 . A method according to claim 10 wherein said second threshold is 3 times the original value.

12. A method according to claim 1 comprising: providing watermark information to be included in an audio signal; and adjusting successive time frames of said audio signal according to claim 1 to accommodate one or more bits of said watermark information in said successive time frames.

13. A method of extracting a digital watermark from an audio signal comprising: a) selecting a key frequency value determining how watermark information was embedded into a first time frame of said audio signal;

c) selecting at least two frequency components for said time frame as a function of at least said key frequency value; and

d) determining if said at least two frequency components meet a given mutual criterion for said signal in said first time frame, to determine a value for said digital watermark in said first time frame.

14. A computer program product comprising computer readable code stored on a computer readable medium which when executed on a computing device is arranged to process an audio signal according to any one of claims 1 to 13.

15. An audio signal including a digital watermark embedded according to any one of claims 1 to 12.