CN110211594B

CN110211594B - Speaker identification method based on twin network model and KNN algorithm

Info

Publication number: CN110211594B
Application number: CN201910494606.8A
Authority: CN
Inventors: 张莉; 李文钧; 李竹
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2019-06-06
Filing date: 2019-06-06
Publication date: 2021-05-04
Anticipated expiration: 2039-06-06
Also published as: CN110211594A

Abstract

The invention discloses a speaker identification method based on a twin network model and a KNN algorithm, which comprises the following steps of S1: using a microphone to collect voice information of a speaker as a data set to train an RNN network model; step S2: and constructing a twin network model by using the trained RNN and identifying the speaker by combining a KNN algorithm. By adopting the technical scheme of the invention, the data set of the speaker in the database is trained, the characteristic that the input of each sentence of voice signal input into the twin network can be output to represent the speaker is determined, the distance between different output characteristic vectors is calculated by using cosine distance, and whether the same speaker belongs to is judged by applying KNN algorithm, so that a small amount of samples can also identify the speaker, the network does not need to be trained again along with the increase of the number of the speakers, the requirement of the neural network on the number of the data samples is reduced, and the real-time performance and the accuracy of speaker identification are effectively improved.

Description

Speaker identification method based on twin network model and KNN algorithm

Technical Field

The invention belongs to the technical field of man-machine interaction, in particular to the technical field of speaker recognition, and particularly relates to a speaker recognition method based on a twin network model and a KNN algorithm.

Background

In the field of human-computer interaction, along with the rapid development of technologies such as artificial intelligence, mode recognition and the like, interaction between people and computers is more and more compact, the traditional contact type interaction mode cannot meet the requirements of people, and research on novel interaction modes which accord with the communication habits of people has become a research hotspot in recent years. Speaker recognition, one of the main channels of human-computer interaction, is gradually becoming an important research topic in the field of interaction.

The existing speaker recognition method mainly comprises a voice feature extraction and template matching technology method, a voice statistical model method and a deep learning technology method. The main research work of the traditional model focuses on the speech feature extraction and template matching technology. The template matching based method is to train recognized voiceprint samples in advance and match the voiceprints to be recognized with the voiceprints, and the method is simple to operate, but the recognition accuracy is not high and a large number of data samples are needed. In the method based on the voice statistical model, a recognition task is defined as the probability of calculating variables, and the method has high recognition precision but needs a large amount of data for verification. In the method based on the deep learning technology, the neural network is used for capturing hidden features in the speaker so that the speaker can be better represented, the method not only needs a large amount of data, but also needs to retrain the neural network when a data set is updated every time, and new information is not favorably input.

Disclosure of Invention

Aiming at the defect that a large number of voice samples are needed in the prior art, the invention aims to provide a speaker recognition method based on a twin network model and a KNN algorithm. The voice information of the speaker is collected through the microphone device, and a strategy for adjusting the updating of the information of the speaker by combining a twin RNN (radio network) and a KNN (K nearest neighbor) algorithm is designed, so that the speaker can be identified and identified more quickly and efficiently by small amount of data. The specific technical scheme is as follows:

a speaker recognition method based on twin network model and KNN classification comprises the following steps:

step S1: using a microphone to collect voice information of a speaker as a data set to train an RNN network model;

step S2: constructing a twin network model by using the trained RNN and identifying the speaker by combining a KNN algorithm;

wherein the step S1 further includes:

step S11: acquiring a large number of voice data sets and carrying out data preprocessing;

step S12: storing the preprocessed voice data set in a voice database;

step S13: acquiring a voice signal data set from a voice database, and acquiring a feature vector of a voice signal by using a feature extraction method based on the rule that the voice signal changes along with time;

s14, training an RNN model by using a time-based back propagation algorithm (BPTT) according to the voice signal feature vector v extracted in the step S13 to obtain an optimal parameter theta and an original model;

the step S13 further includes:

step S131: let x be a set of a period of speech over a period of time t, and frame it with a frame length of 25s, resulting in a discrete speech signal x over time t₁,x₂…x_t；

Step S132: set input X ═ X in step S131₁,x₂…x_tExtracting the characteristics of the discrete signal by combining MFCC, and extracting a 40-dimensional voice characteristic vector V ═ V { (V) }₁,v₂…v_t}

The step S14 further includes:

step S141: for each speech signal that is temporally correlated, the input at time t of that speech signal contains v_tThe last speech state s is memorized in the RNN network model_t-1Hidden layer h at each instant_tBoth relate to the input at the current time and the state at the previous time, and are formulated as follows:

h_t＝Uv_t+Ws_t-1

step S142: for the current time t, its state s_tIn relation to the hidden layer at that moment, then s_t＝f(h_t) Here, the activating function f is a tanh function, which can better fit the speech signal, and the hidden layer value at the moment is substituted to obtain:

step S143: output vector f for current time t_tThen f is_t＝g(Vs_t) Finally, obtaining an output vector F of a section of voice;

step S144: let output vector F ═ F₁,f₂…f_tIs as

And (3) setting a parameter theta shared by the RNN network model as { W, U, V }, and obtaining a loss function of the RNN network model according to a difference value of the output value and the real value in the calculation of the total time sum

Step S145: the obtained loss function

Respectively carrying out derivation on the { W, U and V } by using a back propagation algorithm so as to obtain an optimal parameter theta and an optimal original model;

the step S2 further includes:

step S21: using the trained RNN network model to construct a twin network model, sharing the same network parameter, and respectively inputting a plurality of different voice signals X₀…X_nThe result set of output vectors FS of the predicted speech signal is { F ═ F₀,···,F_n}；

Step S22: calculating cosine distances between different output characteristic vectors according to the output vector FS obtained in the last step, and applying a KNN algorithm to obtain whether the voice belongs to the same person;

the step S22 further includes:

step S221: obtaining an output vector F from the voice signal passing through the twin network model₀,F₁,…,F_nUsing the classified vector for weight representation, wherein F₁＝{f₁₁,f₁₂…f_1tAnd F_n＝{f_n1,f_n2…f_ntDenotes the speech signal in the speaker sample set, F₀＝{f₀₁,f₀₂…f_0tExpressing the voice signal to be tested of the speaker;

step S222: whether the speakers belong to the same speaker is judged according to cosine distance scoring, the similarity of the two speakers is not reflected in that the lengths of the two vectors are only related to the included angle of the two vectors, and then the formula is as follows:

step S223: in step S222, cosine distances between different speech signals are calculated, and a KNN algorithm is used to find the distance F₀The closest points are indicated as the same speaker.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention designs a speaker identification method aiming at speaker identification, and only provides one or a small number of training samples for each speaker category, and the samples have variability. The output classification model of the model is not directly trained, but the similarity function of the model is trained, so that a small amount of samples can accurately and quickly identify the speaker.

2. The invention decomposes a section of continuous voice signal into discrete voice signal vectors, and needs to input voice signals with the same length in the traditional speaker recognition.

3. The network provided by the invention expands the dual-channel input of the twin network to the multi-channel input on the basis of the traditional twin network. The speaker can be identified more quickly.

4. The invention designs a method for judging the similarity between different speakers according to the similarity between the same speaker and different speakers and the KNN algorithm and judging whether the speakers belong to the same speaker.

Drawings

FIG. 1 is a block flow diagram of a twin network model and KNN classification based speaker recognition method according to the present invention;

FIG. 2 is a detailed flowchart of the speech feature extraction of a twin network model and KNN algorithm based speaker recognition method according to the present invention;

FIG. 3 is a deep circular neural network structure of a speaker recognition method based on a twin network model and a KNN algorithm according to the present invention;

FIG. 4 shows a twin network structure constructed in the speaker recognition method based on the twin network model and KNN algorithm according to the present invention

FIG. 5 is a detailed flowchart of a twin network and KNN algorithm in a speaker recognition method based on a twin network model and a KNN algorithm according to the present invention;

Detailed Description

The technical solution provided by the present invention will be further explained with reference to the accompanying drawings.

In actual life, as people increase and leave, when people want to recognize speakers according to voices, the voice signals of the people need to be added when the people increase, and the existing models need to be retrained again when new voices are added, so that updating is not utilized. The invention provides a method for judging the similarity of a newly added voice signal to a voice to be detected by using a twin network model.

The present invention provides a twin network model and KNN based speaker recognition system as shown in figure 1. Overall, the present invention comprises two major steps: step S1: using a microphone to collect voice information of a speaker as a data set to train an RNN network model; step S2: constructing a twin network model by using the trained RNN and identifying the speaker by combining a KNN algorithm;

referring to fig. 2, a large number of acquired voice data sets are subjected to data preprocessing, and the acquired voice signals are subjected to pre-emphasis, framing and fourier transform to obtain a 40-dimensional voice feature vector v;

referring to the RNN network model shown in fig. 3, the obtained 40-dimensional speech feature vectors are input into the RNN network model for training and an initial model is obtained.

h_t＝Uv_t+Ws_t-1

step S144: let output vector F ═ F₁,f₂…f_tIs as

Step S145: the obtained loss function

with the twin network structure shown in fig. 4, the dual input of the twin network is referred to as multiple input, and n sections of voice signals are input at each test, where one section of voice signal to be tested and n-1 sections of reference voice sample signals are shown in fig. 5. The Recurrent Neural Network (RNN) extracts the feature vectors of the speech signal to be detected as shown in fig. 3, measures the distances to the feature vectors of the reference speech signal, and then assigns the labels to be detected to the class with the closest spatial distance by the KNN nearest neighbor algorithm, thereby realizing speaker recognition.

Claims

1. A speaker recognition method based on a twin network model and a KNN algorithm is characterized by comprising the following steps:

wherein, the step S1 is as follows:

step S12: storing the preprocessed voice data set in a voice database;

step S13: acquiring a voice signal data set from a voice database, and acquiring a feature vector v of a voice signal by using a feature extraction method based on the rule that the voice signal changes along with time;

the step S2 is specifically as follows:

step S21: constructing a twin network model by using the trained RNN network model, and respectively inputting a plurality of different voice signals X₀…X_nThe result set of output vectors FS of the predicted speech signal is { F ═ F₀,···,F_n}；

Step S22: and calculating cosine distances among different output characteristic vectors according to the output vector FS obtained in the last step, and applying a KNN algorithm to obtain whether the voice belongs to the same person.

2. The twin network model and KNN algorithm based speaker recognition method of claim 1,

the step S13 is specifically as follows:

step S131: let X be a set of a segment of speech over a period of time t, which is framed in a frame length of 25s, resulting in a discrete speech signal X for time t₁,x₂…x_t；

Step S132: set input X ═ X in step S131₁,x₂…x_tExtracting the characteristics of the discrete voice signal by combining MFCC, and extracting a 40-dimensional voice characteristic vector V ═ V₁,v₂…v_t}。

3. The twin network model and KNN algorithm based speaker recognition method of claim 1,

the step S22 is specifically as follows:

step S221: obtaining an output vector F from the voice signal passing through the twin RNN network model₀,F₁,…,F_nUsing the classified vector for weight representation, wherein F₁＝{f₁₁,f₁₂…f_1tAnd F_n＝{f_n1,f_n2…f_ntDenotes the speech signal in the speaker sample set, F₀＝{f₀₁,f₀₂…f_0tExpressing the voice signal to be tested of the speaker;