AU2019100372A4 - A robust speaker recognition system based on dynamic time wrapping - Google Patents

A robust speaker recognition system based on dynamic time wrapping Download PDF

Info

Publication number
AU2019100372A4
AU2019100372A4 AU2019100372A AU2019100372A AU2019100372A4 AU 2019100372 A4 AU2019100372 A4 AU 2019100372A4 AU 2019100372 A AU2019100372 A AU 2019100372A AU 2019100372 A AU2019100372 A AU 2019100372A AU 2019100372 A4 AU2019100372 A4 AU 2019100372A4
Authority
AU
Australia
Prior art keywords
data
training
test
speaker
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
AU2019100372A
Inventor
Ruiyang Jin
Zhenxian Lu
Haiwei Wang
Xinlu Yao
Fengchun Zhang
Qianqian ZHANG
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lu Zhenxian Miss
Zhang Qianqian Miss
Original Assignee
Lu Zhenxian Miss
Zhang Qianqian Miss
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lu Zhenxian Miss, Zhang Qianqian Miss filed Critical Lu Zhenxian Miss
Priority to AU2019100372A priority Critical patent/AU2019100372A4/en
Application granted granted Critical
Publication of AU2019100372A4 publication Critical patent/AU2019100372A4/en
Ceased legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/12Speech classification or search using dynamic programming techniques, e.g. dynamic time warping [DTW]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/09Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being zero crossing rates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Telephonic Communication Services (AREA)

Abstract

Abstract The application relates to a robust speaker recognition system based on dynamic time wrapping. This invention lies in the field of voice recognition, specifically speaker recognition, and illustrates the basic principle and key technology including MFCC, DTW and so on. The invention consists of the following steps: To begin with, we collect some data and divide them into training set and test set. Secondly, in training procedure training data is preprocessed and converted into MFCCs, then stored in database. Thirdly, in test procedure, we process test data in the same way to get their MFCCs and compare them with those in database to get the result. With some improvement to the traditional endpoint detection method, this invention is more robust against noise when extracting effective speech segments and achieves 100% accuracy with our dataset, with 311.0524 seconds spent on recognition of each sample on average. Besides, the implementation in MATLAB is given. Generally speaking, this invention is a reliable tool to be used in police offices, banking systems, and other places that need speaker recognition. Training Training database data H procedur~eH Data Test Test data procedure Speaker Speaker4 ,,F E akerlResult Speaker2 Speaker5 Speaker3 Speaker6 Fig.1 Test data Training data J Judging effective speech segment Judging effective speech segment Calculation of MFCC Calculation of MFCC Comparison using DTW Save the results to the 1 t 1 database Gettheresults Fig.2 Fig.3

Description

A robust speaker recognition system based on dynamic time wrapping
Field of the Invention
This invention is in the field of voice recognition and serves as identification of speakers powered by dynamic time wrapping (DTW).
Background
Nowadays voice gets increasingly used in conveying message instead of texts due to its convenience, and we can find its trace in QQ, WeChat, Siri and a sea of other software. Along with the popularity of voice come security problems, since attackers now can access more and more private information by imitating the users’ voice. Thus, a new method is needed to authenticate the speakers of voice signals to protect people’s information and property from being stolen. It must be accurate as well as fast and automatic, granting correctness of recognition while ensuring large quantity of voice signals on the Internet doesn’t get stuck in authentication.
As a branch of voice recognition, speaker recognition is a newly-developed
2019100372 05 Apr 2019 multidisciplinary technology. Speaker recognition focuses on extracting features in voice signal and identifying who the speaker is, and is well suited in speaker authentication. Researches about speaker recognition were continuously conducted since it came into being, and today speaker recognition is widely used in many situations, such as police department, court, telephone banking, user tracking, etc. Companies like AT&T, Motorola and Visa all applied this technology in their own products.
The study of speaker recognition began in the 1930s, and the early work mainly focused on the human ear hearing experiment and exploring the possibilities of voice recognition. Later, the study of automatic speaker recognition thrived with the development of electronic technology and computer technology. Methods and techniques for speaker recognition have developed rapidly in recent decades. The recognition models evolved from single-mode model to multi-template model, from template model to vector quantization model, Gaussian mixture model, hidden Markov model, and artificial neural network. The recognition situation changes from a few speakers in the noise-free environment to a large number of speakers in the complex noise environment.
2019100372 05 Apr 2019
In this invention, we focus on speaker recognition based on DTW (dynamic time wrapping), an algorithm mainly used in template matching. We use
MATLAB to implement our invention since it provides well-rounded support for voice signal processing.
Summary of the invention
In order to achieve the goal of reliable speaker authentication and protect peoples’ private information from attackers, this invention proposes a speaker recognition method based on DTW algorithm. People can achieve fast and accurate speaker recognition with this invention in situations like talking with others online or authenticating users in banking systems. This invention is able to reach 100% accuracy based on our dataset.
The overall procedure can be divided into data collection, training and test. In general, it takes three steps to train the invention: judging effective speech segment, calculation of MFCC, and saving the result to database. Then in test phase, input test samples also go through the first two steps to be converted to MFCCs, and the results are compared with MFCCs in database using DTW to get the final answer and check the correctness of the invention. As soon as the test completes, we can put the invention into
2019100372 05 Apr 2019 use, and the steps are identical to those in test phase.
Before actually recognizing any input voice signal, this invention requires some training data to be collected and preprocessed before putting into use. In our implementation, we collect some voice samples, rename them in a specific format, divide them into train set and test set, and will later use their MFCCs instead of themselves.
Receiving an input voice signal, the invention extracts its effective speech segment by endpoint detection based on dual-threshold comparison. After the input signal is divided into frames, the invention calculates the shortterm average energy of each frame. Then it finds out an energy threshold to roughly decide the effective speech segment, and repeat the process based on the result to make it more precise. Next, the short-term zerocrossing rate of every frame is calculated to decide another threshold, and the input signal can be truncated according to those two thresholds. The definition of short-term average energy En and short-term zero-crossing rate Zn is shown below:
+co n
-co m=n-(N-l) i 1 λ
2019100372 05 Apr 2019 +00
Zn = Σ |sgn[x(w)]-sgn[x(w-l)]|w(«-w) = |sgn[x(«)]-sgn[x(«-l)]|*w(«) m=-<x>
Where n is the number of frames, and (2)
1 0 < n < N -1
w(n) = 2N
ιθ others .................(3)
sgn[x(fr)] =
Figure AU2019100372A4_D0001
x(n) > 0 x(n) < 0 (4)
In addition, the invention improved traditional endpoint detection method by applying new algorithm to decide the thresholds and is thus more precise when judging effective speech segments of voice signals with strange noise compared with traditional endpoint detection.
Then the invention calculates the MFCC of the effective speech segment obtained from the last step to better represent the voice signal. The calculation of the MFCC contains many substeps including framing, windowing, mel-frequency filtering, cepstrum analysis, etc.
To recognize the speaker, the result MFCC needs to be compared with MFCCs of training data in database. Since different MFCCs have different shapes, DTW is used to measure the similarity instead of simply
2019100372 05 Apr 2019 calculating their L2 distance. The invention then finds those MFCCs that is the most similar to the input one, check which speaker they belong to, and take the speaker that occurs most frequently as the result.
Description of the drawings
Figure 1 is our overall procedure of using the invention.
Figure 2 is the process of training.
Figure 3 is the process of test, which is also the actual recognition process.
Figure 4 is the process of extraction of effective speech segment using endpoint detection.
Figure 5 is the steps of calculation of MFCC.
Figure 6 is an example of waveform plots of voice signals.
Figure 7 is an example of short-term average energy plots of voice signals.
Figure 8 is an example of short-term zero-crossing rate plots of voice signals.
Figure 9 is an example of the result of endpoint detection based on dualthreshold comparison, where vertical lines are thresholds of different stages.
2019100372 05 Apr 2019
Figure 10 is an example of MFCC plots of voice signals.
Figure 11 is a diagram showing how the DTW algorithm align two sequences of different lengths.
Description of preferred embodiment
This invention provides reliable speaker recognition function and is well suited for situations in which authentication by voice is needed, such as logging in bank accounts and online chatting.
The overall procedure for using this invention is shown in Figure 1. This section will first describe the data collection process, and illustrate the process of judging effective speech segments and calculation of MFCC in detail since they appear in both procedures. Finally, the process of getting the final answer is demonstrated, and some miscellaneous details are mentioned.
Data collection and training
Essentially, this invention performs text-dependent speaker identification. That is, it tries to find out the speaker of the input voice signal among a
2019100372 05 Apr 2019 group of candidates, and it achieves this goal by comparing features of input voice signal with those of signals in database, which should represent same words or sentences spoken by different people.
Therefore, we first collected 600 pieces of voice sample from 6 speakers, and the content is a sentence pronounced as “zhu da jia xin nian kuai le” in Chinese. 3 speakers are male and 3 are female, whose ages are all between 18 and 20, and every speaker is assigned a unique ID to represent them in following programs. The voice samples are recorded by microphones in computers in quiet environment, and speakers record their voice with their normal tone. The naming format of sample files is “samplex_p.mp3”, in which x is the ID of the speaker and y is the index of sample files. Those samples are converted into single-channel and divided into two parts: training set and test set. For fairness, training set includes 80 samples of each speaker, namely 480 equally distributed samples in total, and test set contains all other samples.
In the training phase, we just calculate and record the MFCCs of samples in train set without further process, and MFCCs of samples from the same speaker are stored in the same row in a big cell to label them. Those
2019100372 05 Apr 2019 recorded MFCCs form the database of the invention and will be compared to that of test sample later. An .mat file is used to store those MFCCs and represent the database in our experiment. Since MFCC reflects individual’s speech identity and has better performance and robustness compared with other features in speaker recognition, we choose it to represent voice signals in our invention. The detailed calculation is the same as that of test samples and will be described below.
Judging effective speech segment
Training and test samples all need to go through the steps of judging effective speech segments and calculation of MFCC, and in our implementation, we convert participants’ samples into digital information using ‘audioread’ function in MATLAB (shown in Fig.5). Such digital information also contains, although participants tried their best to avoid, useless and confusing information about noise. To guarantee the simplicity and efficiency of calculation of MFCC, it's important to refine that information and use their effective speech segments instead. This invention uses endpoint detection based on dual-threshold comparison to perform the task, with some improvement to the traditional method. The extraction of effective speech segment consists of four steps, which are shown below.
2019100372 05 Apr 2019
Stepl: Framing
In our experiment, we convert those samples to single-channel, and use 50 as frame length and 10 as frameshift to frame them. Frameshift is just the length of the overlapping part of two adjacent frames. After framing, an original signal array x becomes an n*50 matrix frame, in which n is frame number, frame(i,.j) equals to x(40*z+/), and j ranges from 1 to 50. Frame number n is calculated as follow:
_ length - fralen fralen - frashift
In which length is the length of original signal array x, fralen is frame length (use 50 here), and frashift is frameshift (use 10 here).
Step2: calculating short-term average energy
Short-term average energy is a major standard forjudging effective speech segments, and its definition is shown in formula (1). In our experiment, we take the sum of the square of elements in a single frame as its short-term average energy, and the short-term average energy of the whole sample can be calculated the way. Because the amplitude of environmental noise is usually smaller than that of speaker’s voice (shown in Fig.6), short-term io
2019100372 05 Apr 2019 energy is useful in distinguishing voice from noise. According to the energy of voice samples, a threshold can be built to roughly remove the environmental noise, and we can get our first version of effective speech segments.
Step3: Calculating short-term zero-crossing rate
Short-term zero-crossing rate is another key parameter in speech analysis. It represents the frequency in every frame that the signal value crosses zero, and its definition is shown in formula (2). It is the second standard for extracting effective speech segments in our detection method. Nevertheless, the zero-crossing rate of environmental noise can sometimes be bigger than that of speaker’s voice, so it cannot be used to judge effective speech segments independently or decide the first threshold and must cooperate with short-term average energy or other parameters.
Step4: Acquiring effective speech segments
The three steps above are the preparation for acquiring effective speech segments. Our method is based on traditional dual-threshold comparison to decide the threshold with the following improvement. Firstly, we use one-tenth of the highest speech energy as the first threshold. To get a more11
2019100372 05 Apr 2019 precise result, we treat segments outside of the first threshold as noise temporarily, calculate the sum of the first threshold and the noise’s average energy, and take one-tenth of the sum as the second threshold, which makes it a dynamic value connected with both noise and voice. After that, the average of short-term zero-crossing rates of noise and tenth of the largest zero-crossing rates is computed as the third threshold. Because these three thresholds are dynamic, they could eliminate the confusion of environment noise better to some extent. But it still has a shortcoming: when the amplitude of the environment’s average power is bigger than the speaker voice’s average power, it is hard to distinguish speech effective segments from the whole signal.
Calculation of MFCC
The Mel frequency cepstrum is a linear transformation of the logarithmic energy spectrum of a nonlinear Mel scale based on the sound frequency. MFCC is a feature of voice signal based on human hearing perceptions which cannot perceive frequencies over lKHz and can thus reflect individual’s speech identity. Calculation of MFCC has seven steps (shown in Fig.4). The efficiency of this step is so important because it affects the following phase’s behavior. After getting MFCC, original data can be discarded, and MFCCs of training data are stored in database for
2019100372 05 Apr 2019 recognition use while those of test data are to be entered into the next step to find its speaker.
Getting the result
The MFCC of the input test sample calculated in the above step should now be compared with those in database to find the most similar ones. Since different people have different speaking manner, same words spoken by different people may not be aligned, therefore those MFCCs in database are matrices of the same width but different length, which makes it improper to measure their similarity by calculating their L2 distances directly. Thus, the DTW algorithm is applied here to solve the problem.
The DTW algorithm stretch time sequences of different lengths to align their features. Given two vectors x and y as time sequences, whose sizes are m and n, a matrix dist of size m*rt is built, each element being the distance of corresponding elements in original vectors. For instance, if L2 distance is used to calculate the distances between elements, then dist[i,j} is just the L2 distance between x[z] andy[/]. The best alignment between the two vectors is a path in the matrix from dist[l,l] to dist[m,n] whose total distance is minimal, and DTW finds it by dynamic programming. A
2019100372 05 Apr 2019 new matrix global of size m*n is built, each element storing minimal total distance from dist[l,l] to its position. After previous elements is filled, global[ij] can be calculated by adding dist[ij] to the smallest one among global[i-l,j], global[i,j-l\ and global[i-l,j-l]. Eventually, global[m,n\ is the final distance of the best alignment, which is the distance between x and y calculated by DTW.
With DTW algorithm the distances of the input sample and all 480 samples in database is measured, and 80 samples with smallest distances are chosen to be the most similar ones. Then the invention just finds their positions in database and which speaker they belong to. In the end, the speaker ID that occurs most frequently is chosen as the speaker of the input voice signal and is returned as the result of speaker recognition.
Putting it into use
Once it passes the test, the invention is safe to use. Our experiment achieved 100% accuracy on all test samples, with 311.0524 seconds spent on recognition of each sample on average. To use this invention, just simply enter the path of the file to recognize, and the invention will return the ID of the corresponding speaker as answer after executing the same steps as in test phase.
The invention still has some room for improvement. For example, it cannot perfectly deal with noises like the sound of breathing, and the speed is not fast enough to recognize voice signals in large quantity. However, it is good enough to be used for accurate speaker recognition in a small amount.
2019100372 05 Apr 2019

Claims (2)

1. A robust speaker recognition system based on dynamic time wrapping, is characterized in that an enhanced endpoint detection method based on dualthreshold comparison is used to extract effective speech segment, in which short-term average energy and short-term zero-crossing rate of frames are used to judge the effectiveness of voice signal frames.
2. The robust speaker recognition system according to claim 1, wherein the algorithm used to match voice signal is called Dynamic Time Warping (DTW), this algorithm measures the similarity between two temporal voice signals of different lengths and match the same features in different positions and is thus better than L2 distance.
AU2019100372A 2019-04-05 2019-04-05 A robust speaker recognition system based on dynamic time wrapping Ceased AU2019100372A4 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2019100372A AU2019100372A4 (en) 2019-04-05 2019-04-05 A robust speaker recognition system based on dynamic time wrapping

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
AU2019100372A AU2019100372A4 (en) 2019-04-05 2019-04-05 A robust speaker recognition system based on dynamic time wrapping

Publications (1)

Publication Number Publication Date
AU2019100372A4 true AU2019100372A4 (en) 2019-05-16

Family

ID=66443161

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2019100372A Ceased AU2019100372A4 (en) 2019-04-05 2019-04-05 A robust speaker recognition system based on dynamic time wrapping

Country Status (1)

Country Link
AU (1) AU2019100372A4 (en)

Similar Documents

Publication Publication Date Title
Yu et al. Spoofing detection in automatic speaker verification systems using DNN classifiers and dynamic acoustic features
Zhang et al. Voicelive: A phoneme localization based liveness detection for voice authentication on smartphones
Liu et al. An MFCC‐based text‐independent speaker identification system for access control
Singh et al. Statistical Analysis of Lower and Raised Pitch Voice Signal and Its Efficiency Calculation.
Vestman et al. Voice mimicry attacks assisted by automatic speaker verification
Korshunov et al. Cross-database evaluation of audio-based spoofing detection systems
Algabri et al. Automatic speaker recognition for mobile forensic applications
CN106409298A (en) Identification method of sound rerecording attack
Gupta et al. Gender-based speaker recognition from speech signals using GMM model
Nandyal et al. MFCC based text-dependent speaker identification using BPNN
Stefanus et al. GMM based automatic speaker verification system development for forensics in Bahasa Indonesia
Singh et al. Linear Prediction Residual based Short-term Cepstral Features for Replay Attacks Detection.
Singh et al. Combining evidences from Hilbert envelope and residual phase for detecting replay attacks
Mandalapu et al. Multilingual voice impersonation dataset and evaluation
AU2019100372A4 (en) A robust speaker recognition system based on dynamic time wrapping
Sukor et al. Speaker identification system using MFCC procedure and noise reduction method
Islam et al. A Novel Approach for Text-Independent Speaker Identification Using Artificial Neural Network
Lapidot et al. Speech Database and Protocol Validation Using Waveform Entropy.
CN112992155A (en) Far-field voice speaker recognition method and device based on residual error neural network
Nagakrishnan et al. Generic speech based person authentication system with genuine and spoofed utterances: different feature sets and models
Nguyen et al. Vietnamese speaker authentication using deep models
Al-Hassani et al. Design a text-prompt speaker recognition system using LPC-derived features
Rahman et al. Blocking black area method for speech segmentation
Vielhauer et al. Fusion strategies for speech and handwriting modalities in HCI
Shome et al. Effect of End Point Detection on Fixed Phrase Speaker Verification

Legal Events

Date Code Title Description
FGI Letters patent sealed or granted (innovation patent)
MK22 Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry