WO2015012680A2

WO2015012680A2 - A method for speech watermarking in speaker verification

Info

Publication number: WO2015012680A2
Application number: PCT/MY2014/000138
Authority: WO
Inventors: Syed Abdul Rahman AL-HADDAD SYED MOHAMED; M. Iqbal Saripan; Shyamala C. DORAISAMY; Abd. Rahman RAMLI; Mohammad Ali NEMATOLLAHI
Original assignee: Universiti Putra Malaysia
Priority date: 2013-07-22
Filing date: 2014-05-29
Publication date: 2015-01-29
Also published as: WO2015012680A3

Abstract

The present invention relates toamethod forspeech watermarking inspeaker verification,comprising the steps of: embedding watermark data into speech signal at a transmitter; and extracting watermark data from the speech signal at a receiver;characterisedby the steps of: selecting frameshavingleast speaker-specific information fromthe speech signal to carry watermark data; detecting voice activity to detect presence or absence of speaker's voice in the speech signal;and embedding watermark data into the selected frames of the speech signal according to thepresence or absence of the speaker's voice.

Description

A METHOD FOR SPEECH WATERMARKING IN SPEAKER VERIFICATION

Background of the Invention

Field of the Invention

This invention relates to a method for speech watermarking to provide a secure communication system in speaker verification, and more particularly to a method for speech watermarking by taking into account speaker-specific information and characteristics of speech features. Description of Related Arts

Speaker verification is a process to verify speaker identity in a speech signal to provide secure access in communication system particularly in a distance communication system involving critical subject matter such as telephone banking and air traffic control. In order to establish a secure communication system, speaker verification process is a must and need to be employed before taking further action. Conventional speaker verification techniques are exposed to two possible vulnerable points. Firstly, speech could be manipulated while speech is recorded before being transmitted and secondly when speech signal passes through the communication channel.

There are various techniques for performing speaker verification and one of the most well-known techniques is speech watermarking. Speech watermarking improves security of the conventional speaker verification by embedding watermark inside the speech signal at a transmitter side and extracting on a receiver side. Apart from the security issues, selecting proper features is another concern for the conventional speaker verification due to discriminant ability, reliability and robustness. Speaker recognition base on speech features has several common problems such as long-term effects due to physiological changes, the emotional state of the speaker, illness, time of the day, fatigue or tiredness, and auditory accommodation. This is due to the speaker-specific features having different concentration in each speech signal frame. Other problems of feature base speaker verification are time and cost of training, amount of data for training, level of security to achieve, and developing text dependant or text independent system. Furthermore, noise in the speech signal is a major contributor for mismatch between training and testing phases which could degrade speaker verification performance. Many researchers have tried to combat with undesired features effect as long as developing speaker modelling techniques to improve the accuracy.

One of the prior art is US patent 6892175 B1 , disclosed a method for encoding watermark in digital message such as a speech signal. The cited patent generates a spread spectrum signal, wherein the spread spectrum signal is representative of the digital information and further embedding the spread spectrum signal in the speech signal. Drawback of the cited patent is that the spread spectrum signal of the watermark is embedded in all frames of the speech signal. As the speech signal has less bandwidth compare to audio signal, thus speech signal can carry less watermark bits than the audio signal which is lead to less watermark capacity. Furthermore, implementing speech watermarking in all frames of the speech signal may degrade accuracy of the speaker verification while consuming more time.

In the paper of Marcos Faundez-Zanuy et al. disclosed speech watermarking which combines the spread spectrum approach with a simplified frequency masking. However, the present paper also does not consider the speaker-specific features for embedding the watermark data. Therefore, a lot challenges and opportunity in robustness, accuracy, and efficiency of the speech watermarking methods are yet to be explored particularly in distance speaker verification. Accordingly, it can be seen in the prior arts that there exists a need to provide a speech watermarking method for more secured while efficiently considering speaker-specific features of speech signal in speaker verification process. The speech watermarking method should be robust under unintentional attacks (i.e background noise, compression, amplitude scaling) and fragile under intentional attacks (i.e copying, cutting or removing). The speech watermarking method must also provide enough capacity to transmit verification data through speech signal. Also, there is trade-off between capacity, inaudibility and robustness that should be considered for designing speech watermark method.

References

· Marcos Faundez-Zanuy et al., Pattern Recognition Journal, Elsevier, volume 40, pages 3027-3034, February 2007.

• Faundez-Zanuy, Marcos, Jose J. Lucena-Molina, and Martin Hagmiiller.

"Speech Watermarking: An Approach for the Forensic Analysis of Digital Telephonic Recordings*." Journal of forensic sciences 55.4 (2010): 1080-1087.

Summary of Invention

It is an objective of the present invention to provide a robust, efficient and accurate speech watermarking method in speaker verification technique.

It is also an objective of the present invention to provide speech watermarking method having least speaker-specific features.

It is yet another objective of the present invention to provide speech watermarking method by selecting frames with the least speaker-specific features to carry watermark data.

It is a further objective of the present invention to provide an efficient speech watermarking method for a genuine distance speaker verification technique.

Accordingly, these objectives may be achieved by following the teachings of the present invention. The present invention relates to a method for speech watermarking in speaker verification, comprising the steps of: embedding watermark data into speech signal at a transmitter; and extracting watermark data from the speech signal at a receiver; characterised by the steps of: selecting frames having least speaker-specific information from the speech signal to carry watermark data; detecting voice activity to detect presence or absence of speaker's voice in the speech signal; and embedding watermark data into the selected frames of the speech signal according to the presence or absence of the speaker's voice. Brief Description of the Drawings

The features of the invention will be more readily understood and appreciated from the following detailed description when read in conjunction with the accompanying drawings of the preferred embodiment of the present invention, in which: Fig. 1 is a flow chart of a method for embedding speech watermarking in speaker verification of the present invention.

Fig. 2 is a schematic diagram of a method for speech watermarking in speaker verification of the present invention.

Fig. 3 is a flow chart of frame selection in the method of the speech watermarking in the present invention.

Fig. 4 shows a step of detecting voice activity for separating voice and non-voice frames.

Fig. 5 is a schematic diagram for a method of embedding the speech watermarking in speaker verification in the present invention.

Fig. 6 is a schematic diagram for a method of extracting speech watermarking in speaker verification in the present invention.

Detailed Description of the Invention

As required, detailed embodiments of the present invention are disclosed herein; however, it is to be understood that the disclosed embodiments are merely exemplary of the invention, which may be embodied in various forms. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting but merely as a basis for claims. It should be understood that the drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the invention is to cover all modification, equivalents and alternatives falling within the scope of the present invention as defined by the appended claims. As used throughout this application, the word "may" is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words "include," "including," and "includes" mean including, but not limited to. Further, the words "a" or "an" mean "at least one" and the word "plurality" means one or more, unless otherwise mentioned. Where the abbreviations or technical terms are used, these indicate the commonly accepted meanings as known in the technical field. For ease of reference, common reference numerals will be used throughout the figures when referring to the same or similar features common to the figures. The present invention will now be described with reference to Figs. 1-6.

The present invention provides a method for speech watermarking in speaker verification, comprising the steps of:

embedding watermark data into speech signal at a transmitter; and extracting watermark data from the speech signal at a receiver; characterised by the steps of.

selecting frames having least speaker-specific information from the speech signal to carry watermark data;

detecting voice activity to detect presence or absence of speaker's voice in the speech signal; and

embedding watermark data into the selected frames of the speech signal according to the presence or absence of the speaker's voice.

Referring to Fig. 1 , the method for speech watermarking in speaker verification of the present invention comprises embedding watermark data into the speech signal. The embedding watermark process is employed at the transmitter side whereby only watermarked speech signal is available at the receiver. Then, the watermarked speech signal is transmitted over a communication channel to the receiver to go through a watermark extraction method as shown in Fig. 2 before being further processed.

As shown in Fig.3, the speech signal is first undergo a frame selection to prioritize frames of the speech signal to carry the watermark data. This is due to the speaker specific-information are not uniformly distributed into all frames of the speech signal. In a preferred embodiment of the method for speech watermarking in speaker verification, the speaker-specific information depends on system noise, fundamental frequencies, system features and source features of the speaker-specific information. In a preferred embodiment, the system features relates to structure of speaker vocal fold while source features are up to manner and vibration of speaker vocal cords.

Fig. 3 shows a preferred embodiment of the step for selecting frames of the speech signal. In a preferred embodiment, fundamental frequency estimation for the frame selection is estimated using Linear Predictive Coding (LPC). LPC is applied on each frame of the speech signal to calculate prediction number of dominant frequencies and prediction error for extracting glottal closure instance (GCI) in next step. In a preferred embodiment, most speaker discriminant frequencies are located in low frequencies below 600 Hz and high frequencies above 3500 Hz. Some frequencies are located in mid frequency area of 500Hz to 3500Hz which is most important for phonetic speech verification. In the preferred embodiment, phonetic speaker verification shows that stop, fricative, nasal, diphthongs and vowel have important speaker-specific information in ascending order. Said frequencies are then weighted for comparison between frames of the speech signal. In the preferred embodiment in the present invention, higher-order spectral analysis (HOS) is also applied to each frame to detect associated Gaussianity of the speech signal such as speech enhancement, channel selection and blind source separation. In the preferred embodiment, variance, skewness and kurtosis are applied to select most noisy frame from other frames in the speech signal. The most noisy frame is preferred as noise is known to be the main source of mismatch between enrolment (training) and testing sets in speaker verification systems. In addition to that, noise does not carry much speaker-specific information. In a preferred embodiment of the method for speech watermarking in speaker verification, the frames with least speaker-specific information are selected to carry the watermark data Therefore, the embedded watermark cannot change the noisy frame severely. Thus, the watermark will be imperceptible and inaudible. When the least speaker-specific features frames are selected, voice activity detection is applied to the selected frames to detect presence or absence of speaker's voice in the speech signal. The step of detecting voice activity in the speech signal categorizes the selected frames into voice and non-voice frames. In a preferred embodiment, Magnitude Sum Function (MSF), Pitch period and Zero Crossing Rates (ZCR) is utilized to determine the voiced and non-voiced frames. Fig. 4 shows a preferred embodiment of the voice activity detection for separating voice and non-voice frames. ZCR is counting number of times that speech signal cross the X axis. In a preferred embodiment, the non-voice frame has more ZCR than voice frame due to high frequency character. On the other hand, the MSF shows energy of the speech signal wherein in the preferred embodiment show that voice frame has more energy than non-voice frame due to lower frequency. Also, shown in Fig. 4 that pitch period in voice frame is higher than the non-voice frame.

In a preferred embodiment of the method for speech watermarking in speaker verification, the steps of embedding watermark data comprises of modifying probability distribution function of Linear Predictive Coding (LPC) coefficients. However, it may be difficult to modify each LPC because the LPCs are varying during the embedding and extraction process even without speech manipulation attack. In another preferred embodiment, constants may be applied to shape the probability density function of the LPCs in the method for embedding the speech watermarking. This is done by multiplying a constant to all LPCs and adding another constant to all LPCs. These constants change the variance and mean of the LPCs. Therefore, all LPCs of the speech frames will be embedded with the watermark to increase robustness instead of only embedding the watermark in just one LPC. Fig.5 shows preferred embodiment of a schematic diagram for a method of embedding the speech watermarking in speaker verification. In the preferred embodiment, the schematic diagram depicted how the probability density function is shaped by constants named alpha and beta. Fig.6 shows preferred embodiment of a schematic diagram for a method of embedding the speech watermarking in speaker verification. The preferred embodiment in Fig. 6 shows how watermark may be detected by using mean and standard deviation.

In a preferred embodiment of the method for speech watermarking in speaker verification, the step of extracting watermark data from the speech signal is comprises the steps of:

performing synchronization of a decoder to the speech signal;

extracting watermark data from the speech signal according to the presence or absence of the speaker's voice.

In the preferred embodiment of the present invention as shown in Fig.2, the step of extracting watermark data from the speech signal is performed on the receiver side of the communication system. When the receiver receives the watermarked speech signal, the watermarked frames must be distinguished from non-watermarked frames. Therefore, synchronization is performed to arrange the received speech signals. The step of performing synchronization may also improve timing and robustness between the transmitter and the receiver. Besides that, through the step of synchronization, other information like meta data, parity, cycling redundancy check (CRC) and watermark information may also be sent from the transmitter to the receiver.

Firstly in the step of performing synchronization, synchronization is performed for timing between transmitter and receiver. Then, based on synchronization information the watermarked speech signal is segmented to frames. Then, Voice Activity Detection (VAD) is applied to each frame to distinguish voice or non-voice speech signal. Fourth, based on VAD decision, type of watermark method is found and LPCs are extracted from the frame. Finally, watermark is detected based on the shape of probability density function of LPCs in the method of embedding the speech watermarking. According to the method for speech watermarking in the present invention considering speaker-specific information and embedding watermark data regarding to speech characteristics of voice and non-voice frames provide solution to security issues in speaker verification as well as improving the efficacy and accuracy of the speaker verification. Therefore, frames of the least speaker-specific information are selected to carry the watermark data to preserve performance of features in the speaker-specific information in the speaker verification. The method in the present invention may stand alone as a method for speech watermarking and may also be used in conventional speaker verification to solve security problems over channels without any degradation over performance, accuracy and efficiency.

Although the present invention has been described with reference to specific embodiments, also shown in the appended figures, it will be apparent for those skilled in the art that many variations and modifications can be done within the scope of the invention as described in the specification and defined in the following claims.

Claims

A method for speech watermarking in speaker verification, comprising the steps of:

embedding watermark data into speech signal at a transmitter; and extracting watermark data from the speech signal at a receiver; characterised by the steps of:

A method for speech watermarking in speaker verification according to claim 1 , wherein speaker-specific information depends on system noise, fundamental frequencies, system features and source features of the speaker-specific information.

A method for speech watermarking in speaker verification according to claim 1 , wherein the step of extracting watermark data from the speech signal is comprises the steps of:

performing synchronization of a decoder to the speech signal; detecting voice activity to detect presence or absence of speaker's voice in the speech signal; and