CN105096955A

CN105096955A - Speaker rapid identification method and system based on growing and clustering algorithm of models

Info

Publication number: CN105096955A
Application number: CN201510563935.5A
Authority: CN
Inventors: 张晶; 陈晓梅; 郑党
Original assignee: Guangdong University of Foreign Studies
Current assignee: Guangdong University of Foreign Studies
Priority date: 2015-09-06
Filing date: 2015-09-06
Publication date: 2015-11-25
Anticipated expiration: 2035-09-06
Also published as: CN105096955B

Abstract

The present invention discloses a speaker rapid identification method and a system based on the growing and clustering algorithm of models. The method comprises the processes of model training and model identification. The model training process comprises the steps of acquiring voiceprint signals from multiple persons including a speaker, pre-treating all the voice-print signals and extracting voiceprint characteristic parameters to form a plurality of models, and conducting the adaptive classification for all models based on the growing and clustering algorithm of models. The model identification process comprises the steps of acquiring voice signals from a speaker, pre-treating the voice signals, extracting voiceprint characteristic parameters, calculating the characteristic parameters of to-be-identified voice signals and the likelihoods of all model types, selecting a model type for the to-be-identified voice signal based on the maximum likelihood principle, calculating the likelihood scores of all models in the above selected model type, and adopting a model of the highest likelihood score as an identification result. According to the technical scheme of the invention, the operation of matching the to-be-identified voice characteristics with all models is not required, so that the method is short in matching period and good in real-time performance. The method can be well adapted to large-scale model bases.

Description

A kind of speaker's method for quickly identifying based on model growth cluster and system

Technical field

The present invention relates to Application on Voiceprint Recognition field, more specifically, relate to a kind of speaker's method for quickly identifying based on model growth cluster and system.

Background technology

In embedded OS, realized the identification of speaker ' s identity by voice, usually need to carry out pre-service to the vocal print of input, transfer data to server, and then generate sound-groove model, Model Matching, finally exports and shows result.Wherein, sound-groove model refer to mixed Gauss model (GMM), and the training of this model have employed EM algorithm.λ=(ω, μ, Σ) tlv triple generally can be used to carry out succinct expression mixed Gauss model.Mixed Gauss model adopts the speech model of the set of weights of a multiple Gauss model incompatible description speaker, and maximal value algorithm EM constantly updates systematic parameter to adopt local to expect, thus obtains the mathematical statistical model GMM of voice." Speaker Identification model and method " book that Wu Chaohui, Yang Yingchun write is described in detail GMM and EM algorithm.Traditional recognition methods needs phonetic feature to be identified to mate with models all in model bank, once model bank scale becomes large, coupling required time is just more and more longer, thus causes identifying slowly that even initiating system is to bursting, and the real-time of system cannot be protected.

Summary of the invention

The present invention is intended to solve the problems of the technologies described above at least to a certain extent.

Primary and foremost purpose of the present invention is the defect overcoming match time described in above-mentioned prior art of long, poor real, provides the speaker's method for quickly identifying based on model growth cluster that a kind of match time is short, real-time is good.

A further object of the present invention is to provide the speaker's system for rapidly identifying based on model growth cluster that a kind of match time is short, real-time is good.

For solving the problems of the technologies described above, technical scheme of the present invention is as follows:

Based on speaker's method for quickly identifying of model growth cluster, comprise model training and Model Identification;

Model training comprises the following steps:

S1: gather the vocal print signal comprising many people of speaker;

S2: carry out pre-service to each vocal print signal, its preprocessing process comprises pre-emphasis, framing, windowing and end-point detection successively;

S3: vocal print characteristic parameter extraction is carried out to each vocal print signal, forms multiple model;

S4: adopt model growth clustering algorithm to carry out self-adaptation classification to all models, self-adaptation classification process comprises the representative of initialization class, class represents authorization, class representative is elected;

Model Identification comprises the following steps:

S5: the voice signal gathering speaker, is voice signal to be identified;

S6: pre-service is carried out to voice signal to be identified and extracts vocal print characteristic parameter;

S7: the likelihood score of characteristic parameter to all kinds of representative calculating voice signal to be identified, the class belonging to selecting with the maximum principle of likelihood score, and then calculate Likelihood Score with all models in the class selected, the model that score is the highest is recognition result.

In the preferred scheme of one, in step S2, pre-service is carried out to each vocal print signal and specifically comprises the following steps:

S2.1: pre-emphasis, in pre-emphasis process, vocal print signal moves suitable frequency range by wave filter,

Transport function is: H (z)=1-0.9375z ^-1,

The signal obtained is:

\tilde{S} (n) = S (n) - 0.9375 S (n - 1);

S2.2: framing, with 10 ~ 20ms for vocal print signal is divided into some frames by interval, a frame is a base unit; Vocal print signal is transient change, but is metastable in 10 ~ 20ms, so the vocal print signal in this relatively stable period can be regarded as a base unit---frame.

S2.3: windowing, in order to avoid during rectangular window to LPC coefficient (linear predictor coefficient) end-on error, have employed Hamming window function to carry out window, that is: wherein:

w (n) = 0.54 - 0.46 (\frac{2 π n}{N - 1}), 0 \leq n \leq N - 1;

S2.4: end-point detection, detect end points by the short-time energy coefficient of vocal print signal and short-time zero-crossing rate coefficient, the formula of these two coefficients is as follows:

Short-time energy coefficient:

e (i) = Σ_{n = 1}^{N} | x_{i} (n) |,

Short-time zero-crossing rate coefficient:

Z C R (i) = Σ_{n = 1}^{N - 1} | x_{i} (n) - x_{i} (n + 1) | .

End-point detection object detects the existence having unmodulated groove signal, from the segment signal comprising vocal print, namely determine starting point and the terminating point of vocal print.Effective end-point detection can not only make the processing time reduce to minimum, and can get rid of the noise of unvoiced segments, thus makes recognition system have good recognition performance.

In the preferred scheme of one, in step S3, described characteristic parameter is MFCC (mel-frequency cepstrum coefficient) characteristic parameter, carries out vocal print characteristic parameter extraction, specifically comprise the following steps each vocal print signal:

S3.1: Fast Fourier Transform (FFT) is carried out to vocal print signal and obtains energy frequency spectrum;

S3.2: energy frequency spectrum is multiplied by one group of N number of V-belt bandpass filter, tries to achieve logarithmic energy (LogEnergy) E that each wave filter exports _k, described N number of V-belt bandpass filter is evenly distributed on mel-frequency (MelFrequency), and the relational expression of mel-frequency mel (f) and general frequency f is:

mel(f)＝2595*log10(1+f/700)；

S3.3: by the N number of logarithmic energy E obtained _kbring discrete cosine transform (DCT) into, obtain the Mel-scaleCepstrum parameter on L rank, obtain L parameters of cepstrum, the value of L gets 12, and discrete cosine transform formula is as follows:

C _m＝Ncos[m*(k-0.5)*p/N]*E _k，m＝1,2,...,L；

S3.4: the logarithmic energy extracting a vocal print signal frame, the logarithmic energy of a frame is defined as the quadratic sum of signal in a frame, get denary logarithm value again, be multiplied by 10 again, the energy of a frame is also the key character of vocal print, therefore add the logarithmic energy of a frame, the vocal print feature making each frame basic just has 13 dimensions, contains 1 logarithmic energy and 12 parameters of cepstrums;

S3.5: the residual quantity parameters of cepstrum (Deltacepstrum) extracting vocal print signal, residual quantity parameters of cepstrum represents the slope of parameters of cepstrum relative to the time, although obtained 13 characteristic parameters, but when being applied to sound-groove identification, add residual quantity parameters of cepstrum, to show the change of parameters of cepstrum to the time, its meaning is the slope of parameters of cepstrum relative to the time, namely represent parameters of cepstrum dynamic change in time, formula is as follows:

&dtri; C_{m} (t) = \frac{Σ_{τ = - M}^{M} τ \cdot C_{m} (t + τ)}{Σ_{τ = - M}^{M} τ^{2}} = \frac{Σ_{τ = 1}^{M} τ \cdot C_{m} (t + τ) - C_{m} (t - τ))}{2 \cdot Σ_{τ = 1}^{M} τ^{2}}, m = 1, 2, ... L

Here the value of M gets the number of 2 or 3, t representative frame, C _mt () refers to the parameters of cepstrum of t frame.

In the preferred scheme of one, in step S4, model growth clustering algorithm is adopted to comprise the following steps the concrete grammar that all models carry out self-adaptation classification:

S4.1: initialization class represents:

From all models, Stochastic choice model represents R0 as first initial classes;

Calculate the approximate entropy D of residue model to R0 successively, until D> θ, then this model is appointed as second initial classes and represent R1, now class presenting set A0={R0, R1}, wherein θ is default threshold value;

Calculate the approximate entropy of all the other models to R0 and R1 respectively, if be all greater than θ, appoint as the 3rd initial classes and represent R2, so repeatedly, until obtain k class representative, k is the number of default class, i.e. A0={R0, R1 ... Rk-1}, class represents initialization and completes;

The value of initial classes representative directly has influence on the efficiency of clustering algorithm, and initial classes of the present invention representative meets following two conditions: initial classes representative is directly or indirectly produced by model set, and initial classes representative similarity between any two need be greater than the threshold value θ of setting.

S4.2: class representative authorization:

Cluster result due to initial classes representative often cannot meet the restriction of class members, therefore needs to authorize class representative to cancel class representative or produce new class representative.

Calculate number of members γ and be greater than γ _maxclass ω in all members model densities value and by order arrangement from big to small, the maximum member of density value directly appoints as new class representative, and the method then according to initialization class representative in step S4.1 generates γ _newindividual new class representative, γ _newspan determined by following formula:

1 \leq γ_{n e w} \leq \frac{γ}{γ_{m a x}}

Authorize all class representatives successively, and reclassify, until upgrade without class representative;

S4.3: class representative is elected:

After cluster, all models are divided into k class, now the feature re-training of each class model are obtained the class representative of class GMM (gauss hybrid models) model as such; This GMM is elected by models all in class and obtains, and has representative more accurately.

Based on speaker's system for rapidly identifying of model growth cluster, comprising: client, network connecting module and service end, client is connected by network connecting module with service end;

Client comprises:

Vocal print acquisition module: for gathering the vocal print signal of the many people comprising speaker and outputting to pretreatment module;

Service end comprises:

Pretreatment module: comprise the pre-emphasis unit, sub-frame processing unit, window processing unit and the end-point detection unit that connect in turn, for carrying out pre-emphasis, framing, windowing and end-point detection to vocal print signal successively, then vocal print signal is transferred to server end by network connecting module;

Vocal print characteristic extracting module: vocal print characteristic parameter extraction is carried out to each vocal print signal, forms multiple model;

Self-adaptation classifying module: grow clustering algorithm for adopting model and self-adaptation classification is carried out to all models, self-adaptation classification process comprises the representative of initialization class, class representative is authorized, class representative is elected;

Voiceprint identification module: for the likelihood score of the characteristic parameter to all kinds of representative that calculate voice signal to be identified, the class belonging to selecting with the maximum principle of likelihood score, and then calculate Likelihood Score with all models in the class selected, the model that score is the highest is recognition result.

In the preferred scheme of one, server end receives the identification request of multiple client simultaneously, and server end is newly-built 1 thread of each identification request, and makes response by the identification request of wireless network to user.

In the preferred scheme of one, described client is Android client,

Compared with prior art, the beneficial effect of technical solution of the present invention is: the present invention discloses a kind of speaker's method for quickly identifying based on model growth cluster, and model training comprises the vocal print signal gathering and comprise many people of speaker; Pre-service carried out to each vocal print signal and extracts vocal print characteristic parameter, forming multiple model; Model growth clustering algorithm is adopted to carry out self-adaptation classification to all models; Model Identification comprises the voice signal gathering speaker and carries out pre-service and extract vocal print characteristic parameter, calculate the likelihood score of characteristic parameter to all kinds of representative of voice signal to be identified, class belonging to selecting with the maximum principle of likelihood score, and then calculating Likelihood Score with all models in the class selected, the model that score is the highest is recognition result.The inventive method is without the need to mating phonetic feature to be identified with all models, and therefore match time is short, real-time good, can adapt to large-scale model bank well.

The present invention also discloses a kind of speaker's system for rapidly identifying based on model growth cluster, and described system is the hardware foundation that said method realizes, and described method and system combines can achieve quick, real-time Speaker Identification.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the speaker's method for quickly identifying based on model growth cluster.

Fig. 2 is the process flow diagram that self-adaptation is sorted out.

Fig. 3 is the schematic diagram of the speaker's system for rapidly identifying based on model growth cluster.

Fig. 4 is the functional schematic of the speaker's system for rapidly identifying based on model growth cluster.

Embodiment

Accompanying drawing, only for exemplary illustration, can not be interpreted as the restriction to this patent;

In order to better the present embodiment is described, some parts of accompanying drawing have omission, zoom in or out, and do not represent the size of actual product; To those skilled in the art, in accompanying drawing, some known features and explanation thereof may be omitted is understandable.

Below in conjunction with drawings and Examples, technical scheme of the present invention is described further.

Embodiment 1

As shown in Figure 1, a kind of speaker's method for quickly identifying based on model growth cluster, comprises model training and Model Identification;

Model training comprises the following steps:

S1: gather the vocal print signal comprising many people of speaker, i.e. voice signal;

S2: carry out pre-service and noise reduction process to each vocal print signal, its preprocessing process comprises pre-emphasis, framing, windowing and end-point detection successively;

In specific implementation process, in step S2, pre-service is carried out to each vocal print signal and specifically comprises the following steps:

Transport function is: H (z)=1-0.9375z ^-1,

The signal obtained is:

\tilde{S} (n) = S (n) - 0.9375 S (n - 1);

S2.3: windowing, in order to avoid during rectangular window to the end-on error of LPC coefficient, have employed Hamming window function to carry out window, that is: wherein:

w (n) = 0.54 - 0.46 (\frac{2 π n}{N - 1}), 0 \leq n \leq N - 1;

Short-time energy coefficient:

e (i) = Σ_{n = 1}^{N} | x_{i} (n) |,

Short-time zero-crossing rate coefficient:

Z C R (i) = Σ_{n = 1}^{N - 1} | x_{i} (n) - x_{i} (n + 1) | .

In specific implementation process, in step S3, described characteristic parameter is MFCC characteristic parameter, carries out vocal print characteristic parameter extraction, specifically comprise the following steps each vocal print signal:

S3.2: energy frequency spectrum is multiplied by one group of N number of V-belt bandpass filter, tries to achieve the logarithmic energy E that each wave filter exports _k, described N number of V-belt bandpass filter is evenly distributed on mel-frequency, and the relational expression of mel-frequency mel (f) and general frequency f is: mel (f)=2595*log10 (1+f/700);

S3.3: by the N number of logarithmic energy E obtained _kbring discrete cosine transform into, obtain the Mel-scaleCepstrum parameter on L rank, obtain L parameters of cepstrum, the value of L gets 12, and discrete cosine transform formula is as follows:

C _m＝Ncos[m*(k-0.5)*p/N]*E _k，m＝1,2,...,L；

S3.5: the residual quantity parameters of cepstrum extracting vocal print signal, residual quantity parameters of cepstrum represents the slope of parameters of cepstrum relative to the time, although obtained 13 characteristic parameters, but when being applied to sound-groove identification, add residual quantity parameters of cepstrum, to show the change of parameters of cepstrum to the time, its meaning is the slope of parameters of cepstrum relative to the time, namely represent parameters of cepstrum dynamic change in time, formula is as follows:

&dtri; C_{m} (t) = \frac{Σ_{τ = - M}^{M} τ \cdot C_{m} (t + τ)}{Σ_{τ = - M}^{M} τ^{2}} = \frac{Σ_{τ = 1}^{M} τ \cdot C_{m} (t + τ) - C_{m} (t - τ))}{2 \cdot Σ_{τ = 1}^{M} τ^{2}}, m = 1, 2, ... L

As shown in Figure 2, in specific implementation process, in step S4, model growth clustering algorithm is adopted to comprise the following steps the concrete grammar that all models carry out self-adaptation classification:

S4.1: initialization class represents:

From all models model bank, Stochastic choice model represents R0 as first initial classes;

Calculate the approximate entropy of all the other models to R0 and R1 respectively, if be all greater than θ, appoint as the 3rd initial classes and represent R2, so repeatedly, until obtain k class representative, k is the number of default class, i.e. A0={R0, R1 ... Rk-1}, class represents initialization and completes, and then sorts out model;

S4.2: class representative authorization:

1 \leq γ_{n e w} \leq \frac{γ}{γ_{m a x}}

S4.3: class representative is elected:

After cluster, all models are divided into k class, now the feature re-training of each class model are obtained the class representative of class GMM model as such, and are saved in database; This GMM is elected by models all in class and obtains, and has representative more accurately.

Model Identification comprises the following steps:

S5: the voice signal gathering speaker, is voice signal to be identified;

S6: pre-service, noise reduction process extract vocal print characteristic parameter are carried out to voice signal to be identified;

S7: the likelihood score of characteristic parameter to all kinds of representative calculating voice signal to be identified, class belonging to selecting with the maximum principle of likelihood score, and then calculating Likelihood Score with all models in the class selected, the model that score is the highest is recognition result, finally exports recognition result.

The present embodiment provides a kind of speaker's method for quickly identifying based on model growth cluster, and model training comprises the vocal print signal gathering and comprise many people of speaker; Pre-service carried out to each vocal print signal and extracts vocal print characteristic parameter, forming multiple model; Model growth clustering algorithm is adopted to carry out self-adaptation classification to all models; Model Identification comprises the voice signal gathering speaker and carries out pre-service and extract vocal print characteristic parameter, calculate the likelihood score of characteristic parameter to all kinds of representative of voice signal to be identified, class belonging to selecting with the maximum principle of likelihood score, and then calculating Likelihood Score with all models in the class selected, the model that score is the highest is recognition result.The inventive method is without the need to mating phonetic feature to be identified with all models, and therefore match time is short, real-time good, can adapt to large-scale model bank well.

Embodiment 2

As shown in Figure 3, a kind of speaker's system for rapidly identifying based on model growth cluster, comprising: client, network connecting module and service end, client is connected by network connecting module with service end;

Client comprises:

Service end comprises:

As shown in Figure 4, in specific implementation process, server end receives the identification request of multiple client and user simultaneously, and server end is newly-built 1 thread of each identification request, and makes response by the identification request of wireless network to user.

In specific implementation process, described client is Android client, and voice acquisition module is realized by the android.media.AudioRecord of android system, obtains the PCM data of voice.

In the present invention, client gathers voice signal, and service end does signal logic process, and both data pass through Http agreement and complete.Client does not do mathematical logic process, and therefore system there is no special hardware requirement to client; The data-handling capacity of service end is far above client, and therefore the logical relation such as training, classification, cluster, coupling of model is all by server process, thus ensure that the smoothness of client.

By gathering voice messaging after the selection of client modules function, setup parameter, and send to server by network request; Network connecting module select the network transmission protocol, setting data transformat and process network request or response time-out; Service end is resolved after receiving request and is obtained speech data, carry out Preprocessing, then select to perform corresponding operation according to different functions, comprise training pattern, Model tying and Model Identification three zones, finally the result of process is returned the display of client display panel.

The present embodiment provides a kind of speaker's system for rapidly identifying based on model growth cluster, and described system is the hardware foundation that said method realizes, and described method and system combines can achieve quick, real-time Speaker Identification.

Obviously, the above embodiment of the present invention is only for example of the present invention is clearly described, and is not the restriction to embodiments of the present invention.For those of ordinary skill in the field, can also make other changes in different forms on the basis of the above description.Here exhaustive without the need to also giving all embodiments.All any amendments done within the spirit and principles in the present invention, equivalent to replace and improvement etc., within the protection domain that all should be included in the claims in the present invention.

Claims

1., based on speaker's method for quickly identifying of model growth cluster, it is characterized in that, comprise model training and Model Identification;

Model training comprises the following steps:

S1: gather the vocal print signal comprising many people of speaker;

Model Identification comprises the following steps:

S5: the voice signal gathering speaker;

2. the speaker's method for quickly identifying based on model growth cluster according to claim 1, is characterized in that, in step S2, carry out pre-service specifically comprise the following steps each vocal print signal:

S2.1: pre-emphasis, in pre-emphasis process,

Transport function is: H (z)=1-0.9375z ^-1,

The signal obtained is:

\tilde{S} (n) = S (n) - 0.9375 S (n - 1);

S2.2: framing, with 10 ~ 20ms for vocal print signal is divided into some frames by interval, a frame is a base unit;

S2.3: windowing, have employed Hamming window function to carry out window, that is: 0≤n≤N-1, wherein:

w (n) = 0.54 - 0.46 (\frac{2 π n}{N - 1}),

0≤n≤N-1；

Short-time energy coefficient:

e (i) = Σ_{n = 1}^{N} | x_{i} (n) |,

Short-time zero-crossing rate coefficient:

Z C R (i) = Σ_{n = 1}^{N - 1} | x_{i} (n) - x_{i} (n + 1) | .

3. the speaker's method for quickly identifying based on model growth cluster according to claim 1, it is characterized in that, in step S3, described characteristic parameter is MFCC characteristic parameter, carries out vocal print characteristic parameter extraction, specifically comprise the following steps each vocal print signal:

S3.3: by the N number of logarithmic energy E obtained _kbring discrete cosine transform into, obtain the Mel-scaleCepstrum parameter on L rank, obtain L parameters of cepstrum, discrete cosine transform formula is as follows:

C _m＝Ncos[m*(k-0.5)*p/N]*E _k，m＝1,2,...,L；

S3.4: the logarithmic energy extracting a vocal print signal frame, the logarithmic energy of a frame is defined as the quadratic sum of signal in a frame, then gets denary logarithm value, then is multiplied by 10;

S3.5: the residual quantity parameters of cepstrum extracting vocal print signal, residual quantity parameters of cepstrum represents the slope of parameters of cepstrum relative to the time, and formula is as follows:

&dtri; C_{m} (t) = \frac{Σ_{τ = - M}^{M} τ \cdot C_{m} (t + τ)}{Σ_{τ = - M}^{M} τ^{2}} = \frac{Σ_{τ = 1}^{M} τ \cdot C_{m} (t + τ) - C_{m} (t - τ))}{2 \cdot Σ_{τ = 1}^{M} τ^{2}}, m = 1, 2, ... L

4. the speaker's method for quickly identifying based on model growth cluster according to claim 1, is characterized in that, in step S4, adopts model growth clustering algorithm to comprise the following steps the concrete grammar that all models carry out self-adaptation classification:

S4.1: initialization class represents:

S4.2: class representative authorization:

1 \leq γ_{n e w} \leq \frac{γ}{γ_{m a x}}

S4.3: class representative is elected:

After cluster, all models are divided into k class, now the feature re-training of each class model are obtained the class representative of class GMM model as such.

5., based on speaker's system for rapidly identifying of model growth cluster, it is characterized in that, comprising: client, network connecting module and service end, client is connected by network connecting module with service end;

Client comprises:

Service end comprises:

6. the speaker's system for rapidly identifying based on model growth cluster according to claim 5, it is characterized in that, server end receives the identification request of multiple client simultaneously, server end is newly-built 1 thread of each identification request, and makes response by the identification request of wireless network to user.

7. the speaker's system for rapidly identifying based on model growth cluster according to claim 5, it is characterized in that, described client is Android client.