AU2019100372A4

AU2019100372A4 - A robust speaker recognition system based on dynamic time wrapping

Info

Publication number: AU2019100372A4
Application number: AU2019100372A
Authority: AU
Inventors: Ruiyang Jin; Zhenxian Lu; Haiwei Wang; Xinlu Yao; Fengchun Zhang; Qianqian ZHANG
Original assignee: Lu Zhenxian Miss; Zhang Qianqian Miss
Current assignee: Lu Zhenxian Miss; Zhang Qianqian Miss
Priority date: 2019-04-05
Filing date: 2019-04-05
Publication date: 2019-05-16
Anticipated expiration: 2027-04-05

Abstract

Abstract The application relates to a robust speaker recognition system based on dynamic time wrapping. This invention lies in the field of voice recognition, specifically speaker recognition, and illustrates the basic principle and key technology including MFCC, DTW and so on. The invention consists of the following steps: To begin with, we collect some data and divide them into training set and test set. Secondly, in training procedure training data is preprocessed and converted into MFCCs, then stored in database. Thirdly, in test procedure, we process test data in the same way to get their MFCCs and compare them with those in database to get the result. With some improvement to the traditional endpoint detection method, this invention is more robust against noise when extracting effective speech segments and achieves 100% accuracy with our dataset, with 311.0524 seconds spent on recognition of each sample on average. Besides, the implementation in MATLAB is given. Generally speaking, this invention is a reliable tool to be used in police offices, banking systems, and other places that need speaker recognition. Training Training database data H procedur~eH Data Test Test data procedure Speaker Speaker4 ,,F E akerlResult Speaker2 Speaker5 Speaker3 Speaker6 Fig.1 Test data Training data J Judging effective speech segment Judging effective speech segment Calculation of MFCC Calculation of MFCC Comparison using DTW Save the results to the 1 t 1 database Gettheresults Fig.2 Fig.3

Description

A robust speaker recognition system based on dynamic time wrapping

Field of the Invention

This invention is in the field of voice recognition and serves as identification of speakers powered by dynamic time wrapping (DTW).

Background

Nowadays voice gets increasingly used in conveying message instead of texts due to its convenience, and we can find its trace in QQ, WeChat, Siri and a sea of other software. Along with the popularity of voice come security problems, since attackers now can access more and more private information by imitating the users’ voice. Thus, a new method is needed to authenticate the speakers of voice signals to protect people’s information and property from being stolen. It must be accurate as well as fast and automatic, granting correctness of recognition while ensuring large quantity of voice signals on the Internet doesn’t get stuck in authentication.

As a branch of voice recognition, speaker recognition is a newly-developed

2019100372 05 Apr 2019 multidisciplinary technology. Speaker recognition focuses on extracting features in voice signal and identifying who the speaker is, and is well suited in speaker authentication. Researches about speaker recognition were continuously conducted since it came into being, and today speaker recognition is widely used in many situations, such as police department, court, telephone banking, user tracking, etc. Companies like AT&T, Motorola and Visa all applied this technology in their own products.

The study of speaker recognition began in the 1930s, and the early work mainly focused on the human ear hearing experiment and exploring the possibilities of voice recognition. Later, the study of automatic speaker recognition thrived with the development of electronic technology and computer technology. Methods and techniques for speaker recognition have developed rapidly in recent decades. The recognition models evolved from single-mode model to multi-template model, from template model to vector quantization model, Gaussian mixture model, hidden Markov model, and artificial neural network. The recognition situation changes from a few speakers in the noise-free environment to a large number of speakers in the complex noise environment.

2019100372 05 Apr 2019

In this invention, we focus on speaker recognition based on DTW (dynamic time wrapping), an algorithm mainly used in template matching. We use

MATLAB to implement our invention since it provides well-rounded support for voice signal processing.

Summary of the invention

In order to achieve the goal of reliable speaker authentication and protect peoples’ private information from attackers, this invention proposes a speaker recognition method based on DTW algorithm. People can achieve fast and accurate speaker recognition with this invention in situations like talking with others online or authenticating users in banking systems. This invention is able to reach 100% accuracy based on our dataset.

The overall procedure can be divided into data collection, training and test. In general, it takes three steps to train the invention: judging effective speech segment, calculation of MFCC, and saving the result to database. Then in test phase, input test samples also go through the first two steps to be converted to MFCCs, and the results are compared with MFCCs in database using DTW to get the final answer and check the correctness of the invention. As soon as the test completes, we can put the invention into

2019100372 05 Apr 2019 use, and the steps are identical to those in test phase.

Before actually recognizing any input voice signal, this invention requires some training data to be collected and preprocessed before putting into use. In our implementation, we collect some voice samples, rename them in a specific format, divide them into train set and test set, and will later use their MFCCs instead of themselves.

Receiving an input voice signal, the invention extracts its effective speech segment by endpoint detection based on dual-threshold comparison. After the input signal is divided into frames, the invention calculates the shortterm average energy of each frame. Then it finds out an energy threshold to roughly decide the effective speech segment, and repeat the process based on the result to make it more precise. Next, the short-term zerocrossing rate of every frame is calculated to decide another threshold, and the input signal can be truncated according to those two thresholds. The definition of short-term average energy E_n and short-term zero-crossing rate Z_n is shown below:

+co n

-co m=n-(N-l) i 1 λ

2019100372 05 Apr 2019 +00

Z_n = Σ |sgn[x(w)]-sgn[x(w-l)]|w(«-w) = |sgn[x(«)]-sgn[x(«-l)]|*w(«) m=-<x>

Where n is the number of frames, and (2)

	1	0 < n < N -1
w(n) =	2N
	ιθ	others	.................(3)

sgn[x(fr)] =

x(n) > 0 x(n) < 0 (4)

In addition, the invention improved traditional endpoint detection method by applying new algorithm to decide the thresholds and is thus more precise when judging effective speech segments of voice signals with strange noise compared with traditional endpoint detection.

Then the invention calculates the MFCC of the effective speech segment obtained from the last step to better represent the voice signal. The calculation of the MFCC contains many substeps including framing, windowing, mel-frequency filtering, cepstrum analysis, etc.

To recognize the speaker, the result MFCC needs to be compared with MFCCs of training data in database. Since different MFCCs have different shapes, DTW is used to measure the similarity instead of simply

2019100372 05 Apr 2019 calculating their L2 distance. The invention then finds those MFCCs that is the most similar to the input one, check which speaker they belong to, and take the speaker that occurs most frequently as the result.

Description of the drawings

Figure 1 is our overall procedure of using the invention.

Figure 2 is the process of training.

Figure 3 is the process of test, which is also the actual recognition process.

Figure 4 is the process of extraction of effective speech segment using endpoint detection.

Figure 5 is the steps of calculation of MFCC.

Figure 6 is an example of waveform plots of voice signals.

Figure 7 is an example of short-term average energy plots of voice signals.

Figure 8 is an example of short-term zero-crossing rate plots of voice signals.

Figure 9 is an example of the result of endpoint detection based on dualthreshold comparison, where vertical lines are thresholds of different stages.

2019100372 05 Apr 2019

Figure 10 is an example of MFCC plots of voice signals.

Figure 11 is a diagram showing how the DTW algorithm align two sequences of different lengths.

Description of preferred embodiment

This invention provides reliable speaker recognition function and is well suited for situations in which authentication by voice is needed, such as logging in bank accounts and online chatting.

The overall procedure for using this invention is shown in Figure 1. This section will first describe the data collection process, and illustrate the process of judging effective speech segments and calculation of MFCC in detail since they appear in both procedures. Finally, the process of getting the final answer is demonstrated, and some miscellaneous details are mentioned.

Data collection and training

Essentially, this invention performs text-dependent speaker identification. That is, it tries to find out the speaker of the input voice signal among a

2019100372 05 Apr 2019 group of candidates, and it achieves this goal by comparing features of input voice signal with those of signals in database, which should represent same words or sentences spoken by different people.

Therefore, we first collected 600 pieces of voice sample from 6 speakers, and the content is a sentence pronounced as “zhu da jia xin nian kuai le” in Chinese. 3 speakers are male and 3 are female, whose ages are all between 18 and 20, and every speaker is assigned a unique ID to represent them in following programs. The voice samples are recorded by microphones in computers in quiet environment, and speakers record their voice with their normal tone. The naming format of sample files is “samplex_p.mp3”, in which x is the ID of the speaker and y is the index of sample files. Those samples are converted into single-channel and divided into two parts: training set and test set. For fairness, training set includes 80 samples of each speaker, namely 480 equally distributed samples in total, and test set contains all other samples.

In the training phase, we just calculate and record the MFCCs of samples in train set without further process, and MFCCs of samples from the same speaker are stored in the same row in a big cell to label them. Those

2019100372 05 Apr 2019 recorded MFCCs form the database of the invention and will be compared to that of test sample later. An .mat file is used to store those MFCCs and represent the database in our experiment. Since MFCC reflects individual’s speech identity and has better performance and robustness compared with other features in speaker recognition, we choose it to represent voice signals in our invention. The detailed calculation is the same as that of test samples and will be described below.

Judging effective speech segment

Training and test samples all need to go through the steps of judging effective speech segments and calculation of MFCC, and in our implementation, we convert participants’ samples into digital information using ‘audioread’ function in MATLAB (shown in Fig.5). Such digital information also contains, although participants tried their best to avoid, useless and confusing information about noise. To guarantee the simplicity and efficiency of calculation of MFCC, it's important to refine that information and use their effective speech segments instead. This invention uses endpoint detection based on dual-threshold comparison to perform the task, with some improvement to the traditional method. The extraction of effective speech segment consists of four steps, which are shown below.

2019100372 05 Apr 2019

Stepl: Framing

In our experiment, we convert those samples to single-channel, and use 50 as frame length and 10 as frameshift to frame them. Frameshift is just the length of the overlapping part of two adjacent frames. After framing, an original signal array x becomes an n*50 matrix frame, in which n is frame number, frame(i,.j) equals to x(40*z+/), and j ranges from 1 to 50. Frame number n is calculated as follow:

_ length - fralen fralen - frashift

In which length is the length of original signal array x, fralen is frame length (use 50 here), and frashift is frameshift (use 10 here).

Step2: calculating short-term average energy

Short-term average energy is a major standard forjudging effective speech segments, and its definition is shown in formula (1). In our experiment, we take the sum of the square of elements in a single frame as its short-term average energy, and the short-term average energy of the whole sample can be calculated the way. Because the amplitude of environmental noise is usually smaller than that of speaker’s voice (shown in Fig.6), short-term io

2019100372 05 Apr 2019 energy is useful in distinguishing voice from noise. According to the energy of voice samples, a threshold can be built to roughly remove the environmental noise, and we can get our first version of effective speech segments.

Step3: Calculating short-term zero-crossing rate

Short-term zero-crossing rate is another key parameter in speech analysis. It represents the frequency in every frame that the signal value crosses zero, and its definition is shown in formula (2). It is the second standard for extracting effective speech segments in our detection method. Nevertheless, the zero-crossing rate of environmental noise can sometimes be bigger than that of speaker’s voice, so it cannot be used to judge effective speech segments independently or decide the first threshold and must cooperate with short-term average energy or other parameters.

Step4: Acquiring effective speech segments

The three steps above are the preparation for acquiring effective speech segments. Our method is based on traditional dual-threshold comparison to decide the threshold with the following improvement. Firstly, we use one-tenth of the highest speech energy as the first threshold. To get a more11

2019100372 05 Apr 2019 precise result, we treat segments outside of the first threshold as noise temporarily, calculate the sum of the first threshold and the noise’s average energy, and take one-tenth of the sum as the second threshold, which makes it a dynamic value connected with both noise and voice. After that, the average of short-term zero-crossing rates of noise and tenth of the largest zero-crossing rates is computed as the third threshold. Because these three thresholds are dynamic, they could eliminate the confusion of environment noise better to some extent. But it still has a shortcoming: when the amplitude of the environment’s average power is bigger than the speaker voice’s average power, it is hard to distinguish speech effective segments from the whole signal.

Calculation of MFCC

The Mel frequency cepstrum is a linear transformation of the logarithmic energy spectrum of a nonlinear Mel scale based on the sound frequency. MFCC is a feature of voice signal based on human hearing perceptions which cannot perceive frequencies over lKHz and can thus reflect individual’s speech identity. Calculation of MFCC has seven steps (shown in Fig.4). The efficiency of this step is so important because it affects the following phase’s behavior. After getting MFCC, original data can be discarded, and MFCCs of training data are stored in database for

2019100372 05 Apr 2019 recognition use while those of test data are to be entered into the next step to find its speaker.

Getting the result

The MFCC of the input test sample calculated in the above step should now be compared with those in database to find the most similar ones. Since different people have different speaking manner, same words spoken by different people may not be aligned, therefore those MFCCs in database are matrices of the same width but different length, which makes it improper to measure their similarity by calculating their L2 distances directly. Thus, the DTW algorithm is applied here to solve the problem.

The DTW algorithm stretch time sequences of different lengths to align their features. Given two vectors x and y as time sequences, whose sizes are m and n, a matrix dist of size m*rt is built, each element being the distance of corresponding elements in original vectors. For instance, if L2 distance is used to calculate the distances between elements, then dist[i,j} is just the L2 distance between x[z] andy[/]. The best alignment between the two vectors is a path in the matrix from dist[l,l] to dist[m,n] whose total distance is minimal, and DTW finds it by dynamic programming. A

2019100372 05 Apr 2019 new matrix global of size m*n is built, each element storing minimal total distance from dist[l,l] to its position. After previous elements is filled, global[ij] can be calculated by adding dist[ij] to the smallest one among global[i-l,j], global[i,j-l\ and global[i-l,j-l]. Eventually, global[m,n\ is the final distance of the best alignment, which is the distance between x and y calculated by DTW.

With DTW algorithm the distances of the input sample and all 480 samples in database is measured, and 80 samples with smallest distances are chosen to be the most similar ones. Then the invention just finds their positions in database and which speaker they belong to. In the end, the speaker ID that occurs most frequently is chosen as the speaker of the input voice signal and is returned as the result of speaker recognition.

Putting it into use

Once it passes the test, the invention is safe to use. Our experiment achieved 100% accuracy on all test samples, with 311.0524 seconds spent on recognition of each sample on average. To use this invention, just simply enter the path of the file to recognize, and the invention will return the ID of the corresponding speaker as answer after executing the same steps as in test phase.

The invention still has some room for improvement. For example, it cannot perfectly deal with noises like the sound of breathing, and the speed is not fast enough to recognize voice signals in large quantity. However, it is good enough to be used for accurate speaker recognition in a small amount.

2019100372 05 Apr 2019

Claims

1. A robust speaker recognition system based on dynamic time wrapping, is characterized in that an enhanced endpoint detection method based on dualthreshold comparison is used to extract effective speech segment, in which short-term average energy and short-term zero-crossing rate of frames are used to judge the effectiveness of voice signal frames.

2. The robust speaker recognition system according to claim 1, wherein the algorithm used to match voice signal is called Dynamic Time Warping (DTW), this algorithm measures the similarity between two temporal voice signals of different lengths and match the same features in different positions and is thus better than L2 distance.