CN110211594A

CN110211594A - A kind of method for distinguishing speek person based on twin network model and KNN algorithm

Info

Publication number: CN110211594A
Application number: CN201910494606.8A
Authority: CN
Inventors: 张莉; 李文钧; 李竹
Original assignee: Hangzhou Electronic Science and Technology University
Current assignee: Hangzhou Dianzi University; Hangzhou Electronic Science and Technology University
Priority date: 2019-06-06
Filing date: 2019-06-06
Publication date: 2019-09-06
Anticipated expiration: 2039-06-06
Also published as: CN110211594B

Abstract

Step S1: the invention discloses a kind of method for distinguishing speek person based on twin network model and KNN algorithm uses the voice messaging of microphone acquisition speaker as the trained RNN network model of data set；Step S2: speaker is identified using the twin network model of trained RNN network struction and in conjunction with KNN algorithm.Using technical solution of the present invention, the data set of speaker in database is trained, determine that the input energy output for each voice signal being input in twin network indicates the feature of the speaker, the distance between different output feature vectors are calculated with COS distance and judge whether to belong to same speaker with KNN algorithm, so that a small amount of samples can also identify speaker, and as the increase of speaker's quantity does not need again to be trained network, reduce requirement of the neural network to data sample size, the real-time and accuracy of Speaker Identification are effectively increased simultaneously.

Description

A kind of method for distinguishing speek person based on twin network model and KNN algorithm

Technical field

The invention belongs to human-computer interaction technique fields, especially speaker Recognition Technology field, specifically design one kind and are based on The method for distinguishing speek person of twin network model and KNN algorithm.

Background technique

In field of human-computer interaction, with the rapid development of the technologies such as artificial intelligence, pattern-recognition, between people and computer Interaction it is more and more closer, traditional contact interactive mode is no longer satisfied the demand of people, study it is novel, meet people's Exchange the written research hotspot in recent years of interactive mode of habit.Speaker Identification as human-computer interaction main thoroughfare it One, gradually become the important research topic in interactive field.

The existing method for Speaker Identification mainly includes the method for speech feature extraction and template matching technique, voice The method of the method and depth learning technology of statistical model.Wherein the main research work concern of conventional model is phonetic feature Extraction and template matching technique.Method based on template matching is the vocal print sample for training identification in advance, by sound to be identified Line is matched, and this method is easy to operate, but accuracy of identification is not high and needs a large amount of data sample.It is united based on voice In the method for counting model, identification mission is defined as to calculate the probability of variable, this method accuracy of identification is high, but needs a large amount of Data are verified.In method based on depth learning technology, feature hiding inside crawl speaker is gone using neural network It can preferably indicate speaker, this method not only needs mass data but also requires in more new data set each time Re -training is carried out to neural network, is unfavorable for new data input.

Summary of the invention

The deficiency of a large amount of speech samples is required to for the prior art, it is an object of that present invention to provide one kind to be based on twin net The method for distinguishing speek person of network model and KNN algorithm.The voice messaging that speaker is acquired by microphone apparatus, devises one kind Strategy is adjusted to speaker information update in conjunction with twin RNN network and KNN algorithm, is spoken so that small amount of data is just able to achieve People's identification and identification are more rapidly and efficiently.Specific technical solution is as follows:

A kind of method for distinguishing speek person classified based on twin network model and KNN, comprising the following steps:

Step S1: use the voice messaging of microphone acquisition speaker as data set training RNN network model；

Step S2: using the twin network model of trained RNN network struction and KNN algorithm is combined to carry out speaker Identification；

Wherein, the step S1 further comprises:

Step S11: a large amount of voice data collection of acquisition, line number of going forward side by side Data preprocess；

Step S12: pretreated voice data collection is stored in speech database；

Step S13: voice signal data collection, the rule changed over time based on voice signal are obtained from speech database Rule obtains the feature vector of voice signal using feature extracting method；

Step S14: it according to the phonic signal character vector v extracted in step S13, is calculated using time-based backpropagation Method BPTT trains RNN model, obtains optimal parameter Θ and archetype；

The step S13 further comprises:

Step S131: the set in a period of time t that x is one section of voice is set, framing is carried out to it, frame length 25s is obtained Discrete voice signal x about time t₁,x₂…x_t；

Step S132: input X={ x is set in step S131₁,x₂…x_tMFCC is combined to carry out spy to the discrete signal Sign is extracted, and the speech feature vector V={ v of 40 dimensions is extracted₁,v₂…v_t}

The step S14 further comprises:

Step S141: for be on each its time of voice signal it is associated, the t moment of that section of voice signal Input just contains v_t, the voice status s of last moment can be remembered in RNN network model_t-1, the hidden layer h at each moment_tAll Related with the input at current time and the state of last moment, formula is as follows:

h_t=Uv_t+Ws_t-1

Step S142: for current time t, state s_tIt is related with the hidden layer at the moment, then s_t=f (h_t), herein Activation primitive f selects tanh function, can preferably be fitted voice signal, the hidden layer value at the moment is substituted into obtain:

Step S143: to the output vector f of current time t_t, then f_t=g (Vs_t), finally obtain the output of one section of voice to Measure F；

Step S144: output vector F={ f is set₁,f₂…f_tBeIf the shared parameter Θ of RNN network model=W, U, V }, according between output valve and true value the calculating entire time and upper difference obtain its loss function

Step S145: by resulting loss function{ W, U, V } is asked respectively using back-propagation algorithm It leads to obtain optimal parameter Θ and archetype；

The step S2 further comprises:

Step S21: joined using the twin network model of trained RNN network of network model construction and shared consolidated network Number, inputs multiple and different voice signal X respectively₀…X_n, predict the output vector result set FS=of its voice signal {F₀,···,F_n}；

Step S22: the output vector FS according to obtained in previous step, calculate it is different output feature vectors between cosine away from From, and KNN algorithm is used, to obtain whether the voice belongs to same people；

The step S22 further comprises:

Step S221: output vector F will be obtained by the voice signal of twin network of network model₀,F₁,…,F_n, use The vector classified carries out weight expression, wherein F₁={ f₁₁,f₁₂…f_1tAnd F_n={ f_n1,f_n2…f_ntIndicate speaker's sample The voice signal of concentration, F₀={ f₀₁,f₀₂…f_0tIndicate speaker's voice signal to be measured；

Step S222: judge whether to belong to same speaker according to COS distance marking, the degree of seeming of two speakers is not The length for being embodied in two vectors is only related with the angle of two vectors, then its formula is as follows:

Step S223: the COS distance between different phonetic signal is being calculated in step S222, is being found with KNN algorithm Distance F₀Nearest point, is expressed as same speaker.

Compared with prior art, the invention has the benefit that

1, the present invention devises a kind of method for distinguishing speek person for Speaker Identification, to each speaker's classification, only One or a small amount of training sample is provided, and sample has mobility again.The output category model of not direct training pattern but instruct Practice the similarity function of model, a small amount of sample made also can precisely quickly identify speaker.

2, one section of continuous voice signal is decomposed into discrete speech signal vector by the present invention, is known in traditional speaker It is required to the voice signal of input equal length in not, and can be inputted in the present invention with arbitrary voice signal length, Improve the terseness used.

3, the network proposed in the present invention is on the basis of traditional twin network, by the binary channels input expanding of twin network It is inputted to multichannel.More quickly speaker can be identified.

4, the present invention devises a kind of according to similarity between same speaker and different speaker, according to KNN algorithm come Judge the similarity between different speakers and judges whether to belong to same speaker.

Detailed description of the invention

Fig. 1 is a kind of frame of method for distinguishing speek person classified based on twin network model and KNN provided by the invention Flow chart；

Fig. 2 is a kind of voice of the method for distinguishing speek person based on twin network model and KNN algorithm provided by the invention The detail flowchart of feature extraction；

Fig. 3 is a kind of depth of the method for distinguishing speek person based on twin network model and KNN algorithm provided by the invention Recognition with Recurrent Neural Network structure；

Fig. 4 is to construct in a kind of method for distinguishing speek person based on twin network model and KNN algorithm provided by the invention Twin network structure

Fig. 5 is twin in a kind of method for distinguishing speek person based on twin network model and KNN algorithm provided by the invention The detail flowchart of network and KNN algorithm；

Specific embodiment

Technical solution provided by the invention is described further below with reference to attached drawing.

In real life, with personnel increase and leave away, it is desirable to realization speaker is identified according to voice, need Its voice signal to be added when increasing personnel, as new voice addition just needs again to carry out existing model Re -training does not utilize update in this way.The present invention propose based on twin network model make only need by increase newly into voice believe Number be added network, differentiated using the similarity that twin network obtains itself and voice to be measured.

The present invention provides a kind of system based on twin network model and KNN classification Speaker Identification is as shown in Figure 1.It is whole For body, the present invention includes two big steps: step S1: using the voice messaging of microphone acquisition speaker as data set training RNN network model；Step S2: using the twin network model of trained RNN network struction and combine KNN algorithm to speaker It is identified；

Shown in Figure 2, a large amount of voice data collection that will acquire are gone forward side by side line number Data preprocess, by voice signal obtained It carries out preemphasis, framing and carries out Fourier transformation and obtain the speech feature vector v of 40 dimensions；

RNN network model shown in Figure 3, by it is obtained 40 dimension speech feature vector input RNN network model into Row training simultaneously obtains initial model.

h_t=Uv_t+Ws_t-1

With twin network structure shown in Fig. 4, the dual input of twin network is somebody's turn to do as multi input, every time when test N sections of voice signals are inputted, wherein there is one section of voice signal and n-1 section reference voice sample signal to be measured as shown in Figure 5.Circulation mind The feature vector for extracting voice signal to be measured as shown in Figure 3 through network (RNN) is spent respectively with the feature vector of reference speech signal Span is from incorporating that nearest a kind of label of space length into for be measured, speak to realize later by KNN nearest neighbor algorithm People's identification.

Claims

1. a kind of method for distinguishing speek person based on twin network model and KNN algorithm, which comprises the following steps:

Step S2: speaker is known using the twin network model of trained RNN network struction and in conjunction with KNN algorithm Not；

Wherein, the step S1 is specific as follows:

Step S12: pretreated voice data collection is stored in speech database；

Step S13: obtain voice signal data collection from speech database is made based on the rule that voice signal changes over time The feature vector v of voice signal is obtained with feature extracting method；

Step S14: according to the phonic signal character vector v extracted in step S13, time-based back-propagation algorithm is used BPTT trains RNN model, obtains optimal parameter Θ and archetype；

The step S2 is specific as follows:

Step S21: the trained twin network model of RNN network of network model construction is used, inputs multiple and different languages respectively Sound signal X₀…X_n, predict the output vector result set FS={ F of its voice signal₀..., F_n}；

Step S22: the output vector FS according to obtained in previous step, the COS distance between different output feature vectors is calculated, And KNN algorithm is used, to obtain whether the voice belongs to same people.

2. the method for distinguishing speek person based on twin network model and KNN algorithm as described in claim 1, which is characterized in that

The step S13 is specific as follows:

Step S131: setting the set in a period of time t that x is one section of voice, carries out framing to it, frame length 25s, obtain about The discrete voice signal X=x of time t₁, x₂...x_t；

Step S132: input X={ x is set in step S131₁, x₂...x_tMFCC is combined to propose discrete signal progress feature It takes, extracts the speech feature vector V={ v of 40 dimensions₁, v₂...v_t}。

3. the method for distinguishing speek person based on twin network model and KNN algorithm as described in claim 1, which is characterized in that

The step S14 is specific as follows:

Step S141: for being associated, the input of the t moment of that section of voice signal on each its time of voice signal Just contain v_tFeature vector, the voice status s of last moment can be remembered in RNN network model_t-1, each moment it is hidden Hide layer h_tAll related with the input at current time and the state of last moment, formula is as follows:

h_t=Uv_t+Ws_t-1；

Step S142: for current time t, state s_tIt is related with the hidden layer at the moment, then s_t=f (h_t), it activates herein Function f selects tanh function, can preferably be fitted voice signal, the hidden layer value at the moment is substituted into obtain:

Step S143: to the output vector f of current time t_t, then f_t=g (Vs_t), finally obtain the output vector F of one section of voice；

Step S144: output vector F={ f is set₁, f₂...f_tBeIf shared parameter Θ={ W, U, the V } of RNN network model, According between output valve and true value the calculating entire time and upper difference obtain its loss function

Step S145: by resulting loss functionUsing back-propagation algorithm to { W, U, V } respectively carry out derivation to Obtain optimal parameter Θ and archetype.

4. the method for distinguishing speek person based on twin network model and KNN algorithm as described in claim 1, which is characterized in that

The step S22 is specific as follows:

Step S221: output vector F will be obtained by the voice signal of twin RNN network model₀, F₁..., F_n, use classification Good vector carries out weight expression, wherein F₁={ f₁₁, f₁₂...f_1tAnd F_n={ f_n1, f_n2...f_ntIndicate speaker's sample set In voice signal, F₀={ f₀₁, f₀₂...f_0tIndicate speaker's voice signal to be measured；

Step S222: judge whether to belong to same speaker according to COS distance marking, the degree of seeming of two speakers does not embody Only related with the angle of two vectors in the length of two vectors, then its formula is as follows:

Step S223: the COS distance between different phonetic signal is being calculated in step S222, is finding distance with KNN algorithm F₀Nearest point, is expressed as same speaker.