CN104538036A

CN104538036A - Speaker recognition method based on semantic cell mixing model

Info

Publication number: CN104538036A
Application number: CN201510026239.0A
Authority: CN
Inventors: 孙凌云; 何博伟; 尤伟涛; 李彦; 郑楷洪
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2015-01-20
Filing date: 2015-01-20
Publication date: 2015-04-22

Abstract

The invention discloses a speaker recognition method based on a semantic cell mixing model. The method comprises the following steps: (1) establishing a voice library, wherein the voice library comprises multiple voice signals of multiple speakers; (2) preprocessing each voice signal in the voice library, extracting the voice characteristics, thereby obtaining each feature vector of each person; (3) performing dimensionality reduction on the feature vector so as to obtain a dimensionality reduction feature vector based on a semantic cell feature selection method, and training the semantic cell mixing model; (4) constructing an SVM classifier of each speaker by using a kernel function based on the semantic cell mixing model, and training a recognition model of the SVM classifier; and (5) recognizing the unknown speaker by utilizing the recognition model. According to the method disclosed by the invention, the problem that the kernel function of the conventional SVM model does not perform targeted optimization on a specific speaker, and when voice features used for a training classifier are selected, the method has high targeting property compared with the conventional common method, and the needed space for storing the model can be reduced.

Description

A kind of method for distinguishing speek person based on semantic mixing with cells model

Technical field

The present invention relates to signal transacting and area of pattern recognition, particularly relate to a kind of method for distinguishing speek person based on semantic mixing with cells model.

Background technology

Speaker Identification (Speaker Recognition) is also known as Speaker identification, refer to carry out the analyses such as feature extraction by the voice signal produced unknown speaker, automatically determine speaker whether in registered speaker's set, and distinguish the process of concrete speaker.Because the shape size of individual sound channel, throat and other generators official is different, any two individual phonetic features are not identical (see Kinnunen T, Li H.An overview of text-independent speaker recognition:fromfeatures to supervectors.Speech communication, 2010,52 (1): 12-40.).This technology can be used for the process that telephone bank, voice gate inhibition, teleshopping etc. need be differentiated operator.

Current method for distinguishing speek person generally includes following two operation stepss: 1. utilize the speaker's data set in corpus to train given sorter model.Current application has template model, gauss hybrid models (GMM), Hidden Markov Model (HMM) (HMM), support vector machine (SVM) etc. comparatively widely.2. by the phonetic entry recognition system of unknown speaker, carry out mating and making a policy with the model of known speaker, judge this unknown speaker whether in registered speaker's set.

Wherein step 1 needs to carry out characteristic extraction step to sound signal, flow process conventional is at present: 1. carry out pre-emphasis (pre-emphasis), framing (framing), windowing (windowing) operation to the voice signal (waveform signal) of sampling, be called pre-service; 2. carry out feature extraction, general to pretreated signal extraction Mel frequency cepstral coefficient (Mel-frequency CepstralCoefficients at present, MFCC), linear prediction residue error (Linear Prediction CepstralCoefficients, LPCC) etc., these features are the features based on sound channel, principal feature is strong robustness, and descriptive power is good, and easy to implement.

Semantic cell (Information Cell) theory proposes (see TANG Y by soup Yongchuan and Lawry J. jointly, LAWRY J.Information Cell Mixture Models:The CognitiveRepresentations of Vague Concepts [C] //Integrated Uncertainty Managementand Applications.Heidelberg, Berlin:Springer, 2010:371-382), its basis is Fuzzy Calculation and Prototype Theory, main thought is: concept and can't help formal rule or map represent, but represented by its prototype, concept field judges based on the similarity of same prototype.This theory has been applied to prediction Mackey-Glass time series and sunspot problem, and its performance is better than Kim & Kim, autoregressive model algorithm.

Semantic cell has transparent cognitive structure, meets the cognitive process that the mankind learn concept, has solid cognitive psychology basis and strict mathematical definition, possesses the innate advantage describing fuzzy concept.Speaker Identification is the typical problem in fuzzy concept field, and according to current present Research, the sound property of speaker is a kind of fuzzy concept, is difficult at present rely on specific rules to define.And express the semantic cell of concept because it does not rely on the feature of concrete classifying rules by prototype, be suitable for Speaker Identification.

Publication number be CN104200814A application discloses a kind of speech-emotion recognition method based on semantic cell, comprise: build sound bank, to the voice signal of each in sound bank, carry out pre-service and carry out affective feature extraction, the proper vector of every bar voice signal is calculated according to extraction result, utilize proper vector to train the mixture model obtained based on semantic cell as the model of cognition of sorter, utilize the emotion classification belonging to this model of cognition identification voice signal to be identified.The speech-emotion recognition method of this invention is based on the method for identification of the semantic cell of bilayer, the mixture model of two layers of semantic cell of employing structure identification speaker, speaker's emotion sets up model of cognition to speech emotional, when the model of cognition utilizing the method to set up carries out speech emotion recognition, precision is high, and ensureing under the prerequisite with the accuracy of SVM algorithm same identification, still effectively reduce the data volume stored needed for model of cognition, space complexity and recognition accuracy all possess advantage.The shortcoming of this invention uses principal component analysis (PCA) to carry out dimensionality reduction from angle of statistics to proper vector, and specific aim is not strong.In addition, this invention is using the model of cognition of semantic mixing with cells model as sorter, and described method is limited for accuracy rate during Speaker Identification.

Summary of the invention

The invention provides a kind of method for distinguishing speek person based on semantic mixing with cells model.The present invention adopts the sorter based on the Kernel SVM of semantic mixing with cells model, is reached the object distinguishing speaker by the model of cognition of SVM classifier.

Based on a method for distinguishing speek person for semantic mixing with cells model, comprise the following steps:

(1) build sound bank, described sound bank comprises many voice signals of multiple speaker;

(2) every bar voice signal of speaker each in sound bank is carried out pre-service, extract phonetic feature, obtain each proper vector of each speaker;

(3) based on the Method for Feature Selection of semantic cell, each proper vector generated step (2) is carried out dimensionality reduction and is obtained dimensionality reduction proper vector, and trains the semantic mixing with cells model of each speaker;

(4) use the SVM classifier based on each speaker of Kernel of semantic mixing with cells model, and train the model of cognition of SVM classifier;

(5) the model of cognition identification unknown speaker of SVM classifier is utilized.

Step (2) is carried out pre-service to every bar voice signal and is obtained corresponding proper vector, and each speaker has many voice signals, obtains each proper vector of each speaker after pretreatment.

Step (2) described pre-service comprises pre-emphasis, framing and windowing.

(2-1) use transport function for H (z)=1-0.97z ^-1carry out preemphasis filtering;

(2-2) voice signal is divided into some short time intervals, each short time interval is called a frame, and the length of each frame is probably 10-30ms;

(2-3) use Hamming window function to speech frame windowing;

(2-4) extract the feature of each frame in current speech signal: described in be characterized as following 9 statistical values of 1 to 12 rank Mel frequency cepstral (MFCC) coefficient: maximal value, minimum value, maximal value place frame position, minimum value place frame position, arithmetic average, linear regression coeffficient (slope, intercept), the coefficient of skewness and coefficient of kurtosis;

(2-5) build according to the statistical value of various features the proper vector obtaining current speech signal;

(2-6) use standard scores (z-score) by proper vector normalization, obtain characteristic set to be selected.

Step (3) carries out dimensionality reduction based on the Method for Feature Selection of semantic cell to proper vector, and more current common dimension reduction method is more targeted, and can therefore reduce model storage requisite space.

Step (3) described reduction process is: the feature selecting predetermined quantity from each proper vector of each speaker, when selecting at every turn, choose the feature in each proper vector of each speaker one by one, form intermediate vector, in conjunction with picked out using intermediate vector form express all features as training set, train semantic mixing with cells model, and the maximum feature of the coverage rate picking out semantic mixing with cells model adds dimensionality reduction proper vector, repeats this step until the feature of dimensionality reduction proper vector reaches predetermined quantity.

Select the feature of predetermined quantity, when predetermined quantity is less, model training, recognition speed are fast; When predetermined quantity is larger, accuracy rate is higher, but model training, recognition speed are slow.

Preferably, described predetermined quantity is 30% ~ 50% of total characteristic amount.

The step of the semantic mixing with cells model of the training described in step (3) is as follows:

(3-1) cluster is carried out to the intermediate vector in training set and obtain multiple cluster centre, and as the center of each semantic cell, a semantic mixing with cells model is made up of n semantic cell, comprises the cluster centre that n has different weight;

The value of semantic number of cells n affects recognition result and performance: when n is less, summarize and may occur unsharp situation, but model training, recognition speed is fast to the semanteme of complex concept; When n is larger, the semanteme of complex concept can be summarized preferably, but model training, recognition speed are slow.

Preferably, n is 3 ~ 10.

(3-2) calculating parameter initial value: for each semantic cell, utilize each intermediate vector in training set to the location parameter of the distance computing semantic cell at the center of this semantic cell and scale parameter, and set the contribution degree parameter of each semantic cell to mixture model; Wherein, the initial value of the location parameter of i-th semantic cell, scale parameter and contribution degree parameter is designated as c respectively _i(0), σ _iand Pr (L (0) _i(0)), the parameter of each semantic cell forms the parameter of semantic mixing with cells model.

Set the percentage contribution parameter of each semantic cell to semantic mixing with cells model equal, namely

Pr(L _i(0))＝1/n；

Wherein, the distance ε at the center of a kth intermediate vector and i-th semantic cell _ik, according to following formulae discovery:

ε _ik＝d _i(X _K,P _i)＝||X _K-P _i||

P _iit is the center of i-th semantic cell;

X _kfor the intermediate vector of kth in training set, i=1,2 ... n, k=1,2 ... N;

N is the number of intermediate vector in training set;

c_{i} (0) = \frac{1}{N} Σ_{K = 1}^{N} ϵ_{ik}

{(σ_{i} (0))}^{2} = \frac{1}{N} Σ_{K = 1}^{N} {(ϵ_{ik} - c_{i} (0))}^{2}

(3-3) after obtaining each initial parameter value of semantic mixing with cells model, setting threshold value, adopt cyclic iterative update semantics mixing with cells model, the objective function of the t time loop iteration is:

J_{LP} (t) = Σ_{K = 1}^{N} \ln (Σ_{i = 1}^{n} δ (ϵ_{ik} | c_{i} (t), σ_{i} (t)) \Pr (L_{i} (t)))

δ (ϵ_{ik} | c_{i} (t), σ_{i} (t)) = \frac{f (ϵ_{ik} | c_{i} (t), σ_{i} (t))}{{&Integral;}_{0}^{+ \infty} f (ϵ_{ik} | c_{i} (t), σ_{i} (t)) d ϵ_{ik}}

Wherein:

f (ϵ_{ik} | c_{i} (t), σ_{i} (t)) = \frac{1}{\sqrt{2 π {(σ_{i} (t))}^{2}}} \exp \frac{{(ϵ_{ik} - c_{i} (t))}^{2}}{- 2 {(σ_{i} (t))}^{2}}

c_{i} (t) = \frac{Σ_{k = 1}^{N} q_{ik} (t - 1) ϵ_{ik}}{Σ_{k = 1}^{N} q_{ik} (t - 1)}

{(σ_{i} (t))}^{2} = \frac{Σ_{k = 1}^{N} q_{ik} (t - 1) {(ϵ_{ik} - c_{i} (t))}^{2}}{Σ_{k = 1}^{N} q_{ik} (t - 1)}

q_{ik} (t - 1) = \frac{δ (ϵ_{ik} | c_{i} (t - 1), σ_{i} (t - 1)) \Pr (L_{i} (t - 1))}{Σ_{i = 1}^{n} δ (ϵ_{ik} | c_{i} (t - 1), σ_{i} (t - 1)) \Pr (L_{i} (t - 1))}

\Pr (L_{i} (t)) = \frac{1}{N} Σ_{k = 1}^{N} q_{ik} (t - 1)

Wherein, t=1,2 ... for iterations;

N is the number of semantic cell;

N is the number of proper vector in training set;

Q _ikt () is the weighted value of semantic cell centre distance;

C _it () is location parameter;

σ _it () is scale parameter;

Pr (L _i(t)) be contribution degree parameter;

Until the absolute value of the difference of the value of objective function that adjacent twice loop iteration obtains | J _lP(t)-J _lP(t-1) | stop when being less than the threshold value of setting.

The threshold value set during loop iteration is larger, restrains faster, and the time that training consumes is short, but the model of cognition set up is inaccurate, and discrimination is low.On the contrary, threshold value is less, restrains slower, and may there is situation about not restraining, and the time that training consumes is long, but the model of cognition set up is accurate, and discrimination is high.Therefore need to set rational threshold value, the value of threshold value can adjust according to practical application request.

Preferably, the threshold value of setting is 0.001 ~ 0.010.

In step (3), the coverage rate computing formula of semantic mixing with cells model is:

| LP | = Σ_{i = 1}^{n} | L_{i} | \cdot \Pr (L_{i}); | L_{i} | = Σ_{k = 1}^{N} μ_{L_{i}} (X_{k})

| LP| represents the coverage rate of semantic mixing with cells model LP;

| L _i| represent that the i-th semantic cell Li covers the coverage rate of corresponding training set;

N is the number of intermediate vector in training set;

X _kfor the intermediate vector of kth in training set;

represent given feature vector, X _kto L _idegree of membership;

Step (4) is based on the kernel function of semantic mixing with cells model:

K (X, Z) = \exp (- | | Σ_{i = 1}^{n} μ_{L_{i}} (X) \Pr (L_{i}) - Σ_{i = 1}^{n} μ_{L_{i}} (Z) \Pr (L_{i}) | |)

Wherein L _irepresent i-th semantic cell;

represent that given feature vector, X is to L _idegree of membership;

represent that given proper vector Z is to L _idegree of membership;

X, Z represent dimensionality reduction proper vector corresponding for certain two voice compared in calculating SVM process;

Utilize the SVM classifier of each speaker of this Kernel;

In dimensionality reduction proper vector, the parameter of corresponding proper vector and semantic mixing with cells model is as input, trains the model of cognition of SVM classifier.

The model that the SVM classifier of training is used is a pair other (OVR) type, and namely in training, the example belonging to this speaker is considered as positive example, and what do not belong to this speaker is considered as counter-example.

Step (5) identifies that unknown speaker process is specific as follows:

(5-1) feature is extracted to the voice signal of the unknown speaker of input, generating feature vector, and select the feature identical with step (3) dimensionality reduction proper vector as dimensionality reduction proper vector a;

(5-1) can be that generating feature vector a is normalized and obtains dimensionality reduction proper vector, be normalized again after also can obtaining dimensionality reduction proper vector a.

After step (5-1) obtains dimensionality reduction proper vector a, adopt standard scores to be normalized it, be normalized again after obtaining dimensionality reduction proper vector a and can improve operational performance, save operation time and storage space.

Described normalized use the average value mu identical with step (2) preprocessing process, standard deviation sigma ' press the eigenwert after arranging (feature) normalized, namely

x^{'} = \frac{x - μ_{j}}{{σ^{'}}_{j}},

Wherein x, x ' is respectively the forward and backward eigenwert of standardization; μ _jwith σ ' _jbe respectively x characteristic of correspondence the j mean value that obtains and standard deviation when step (2) calculates standard scores.

(5-2) the dimensionality reduction proper vector obtained is inputed in SVM classifier corresponding to each speaker, calculate the posterior probability P of svm classifier _j, j=1 ..., W, W are speaker's quantity, described posterior probability P _jcodomain be [-1 ,+1], reflect its estimated value close to negative example (-1) or positive example (+1).

(5-3) choose all speaker's posterior probability values maximum as judged result, specific as follows:

The speaker's sequence number judged

kk = \{\begin{matrix} \underset{j}{\arg \max} P_{j} & , if (\max P_{j} > T) \\ 0 & , else \end{matrix},

Wherein kk=0 represents that this speaker does not belong to a member in original speaker set, and T is decision threshold.

Decision threshold improves will cause the accuracy rate of system identification (precision) rising but recall rate (recall) reduction (even if the speaker of more how tested paragraph is sorted in outside set); Vice versa.

Preferably, decision threshold T is 0.01 ~ 0.10.

Compared with prior art, the present invention has following beneficial effect:

1, the present invention uses a kind of method for distinguishing speek person based on semantic cell, the problem that the kernel function that can solve existing SVM model is optimized without specific aim speaker dependent.

2, the present invention uses a kind of method for distinguishing speek person based on semantic cell, when choosing the phonetic feature for training classifier, what adopt is Method for Feature Selection based on semantic cell, and more current common methods is more targeted, and can therefore reduce model storage requisite space.

3, the SVM classifier model of cognition of the present invention's structure, accuracy rate is higher.

Accompanying drawing explanation

Fig. 1 is the method for distinguishing speek person process flow diagram that the present invention is based on semantic cell.

Fig. 2 is the semantic mixing with cells model modification of the present invention and Speaker Identification process flow diagram.

Embodiment

Below in conjunction with Fig. 1 and 2, the invention will be further described.Implementation method of the present invention comprises five steps.

Step (1) builds sound bank: require that the voice signal of input must comprise the indications of speaker, as name.

The sound bank that the present embodiment builds comprises 138 speakers (106 men, 32 female), everyone 10 voice, speech data totally 1380.

Step (2) described pre-service comprises pre-emphasis, framing and windowing process, and detailed process can be the patented claim of CN104200814A with reference to publication number.

(2-1) power spectrum of voice signal reduces with the increase of frequency, and its most of concentration of energy is in low-frequency range.This just causes the signal to noise ratio (S/N ratio) of voice signal front end may drop to the degree that can not allow.But because in voice signal, the energy of higher frequency components is little, seldom there is the amplitude being enough to produce maximum frequency deviation, therefore the signal amplitude majority producing maximum frequency deviation is caused by the low frequency component of signal, and the frequency deviation that the high fdrequency component that usual amplitude is less produces is much smaller.The high fdrequency component increasing the weight of (lifting) transmitter input modulating signal artificially by pre-emphasis process can improve the signal to noise ratio (S/N ratio) of voice signal effectively.As preferably, use transport function is H (z)=1-0.97z ^-1carry out preemphasis filtering;

(2-2) voice of certain length are divided into many frames to analyze, can analyze with to the analytical approach of stationary process, therefore voice signal is divided into short time interval one by one by the present invention, and each short time interval is called a frame, and the length of each frame is probably 10-30ms.In order to make to seamlessly transit between frame and frame, making it keep continuity, have employed the method for overlapping segmentation, namely the postamble of each frame is overlapping with the frame head of next frame.

(2-3) in order to reduce the truncation effect of speech frame, reduce the gradient at frame two ends, the two ends of speech frame are not caused sharply change and be smoothly transitted into zero, speech frame will be allowed to be multiplied by a window function, the present invention can use any finite impulse response (FIR) (Finite Impulse Response, FIR) spectral window function.Every bar voice signal is divided into several voice segments in short-term by framing windowing, a voice segments is in short-term called a frame, and each frame all has corresponding numbering (i.e. frame number) according to time sequencing.

Above-mentioned steps (2-2) framing is carried out with the operation of (2-3) windowing simultaneously, namely in the process of windowing, carries out framing to voice.Even if in periodically obvious voiced sound spectrum analysis, be multiplied by applicable window function, also can suppress the variable effect of the relative phase relation of pitch period analystal section, thus stable frequency spectrum can be obtained.

Feature extraction on speech frame realizes via following steps:

(2-4) eigenwert of each frame in current speech signal is extracted.Described phonetic feature is following 9 statistical values (sample is the data extracted frame by frame) of 1 to 12 rank Mel frequency cepstral (MFCC) coefficient:

Maximal value, minimum value, maximal value place frame position, minimum value place frame position, arithmetic average, linear regression coeffficient (slope, intercept), the coefficient of skewness (skewness), coefficient of kurtosis (kurtosis).

(2-5) build according to the statistical value of every acoustic feature the proper vector obtaining current speech signal.

Directly by all phonetic features corresponding for current speech signal in step (2-5), and namely the statistical value of the corresponding first order difference coefficient vector that is arranged in rows obtains the proper vector of current speech signal.Be arranged in rows vector time can carry out according to random order, but for all voice signals, each statistical value should arrange according to identical order.For each voice signal, obtaining proper vector is 108 dimensions, namely has 108 features.

(2-6) proper vector normalization.Because semantic cell model needs service range function to measure the degree of membership of original shape (semantic nucleus) to any proper vector, therefore be necessary that the proper vector to obtaining is normalized by row, thus avoid the different impact on the model calculation of the yardstick between each feature.

Described step (3) uses based on the Method for Feature Selection of semantic cell, and the proper vector generated (2) is carried out dimensionality reduction and realized as follows.Following step need respectively implement one time to each speaker, and namely each speaker has the characteristic set after for be optimized by itself after step (3).

(3-1) quantity (feature vector dimension) D=36 of feature is selected in setting;

Initial characteristics quantity k=0, the set of initial characteristics (empty set), places the feature expressed with intermediate vector form in S; The set M={m of speaker characteristic ₁..., m _dd; As shown in table 1, table 1 is the set of the feature of each speaker, sp ₁~ sp _hrepresent everyone every bar voice signal, place is classified as the proper vector of this voice signal; m ₁~ m _ddrepresent the feature of every bar voice signal, be expert at representation feature form intermediate vector, the number of the every bar phonic signal character of dd, dd=108; H is voice signal number, h=10.

(3-2) to arbitrary characteristics m _v∈ M, m _vthe row structural feature intermediate vector at place, uses S ∪ { m _vfeature, train semantic mixing with cells model as training set.

Table 1

The process of semantic mixing with cells model training is specific as follows:

(3-2-1) cluster is carried out to all intermediate vector in training set and obtain multiple cluster centre, and as the center of each semantic cell.Cluster centre is called semantic nucleus, and its quantity is generally 3 to 10, is denoted as n, n=5 in the present embodiment;

(3-2-2) calculating parameter initial value:

Make each semantic cell to the percentage contribution initial parameter value Pr (L of mixture model _i(0))=1/n;

For each semantic cell, utilize each intermediate vector in training set to the location parameter of the distance computing semantic cell at the center of this semantic cell and scale parameter, wherein the initial value of the location parameter of the mixture model of i-th semantic cell, scale parameter and contribution degree parameter is designated as c respectively _i(0), σ _iand Pr (L (0) _i(0)).

ε _ik＝d _i(X _K,P _i)＝||X _K-P _i||

P _ibe the center of i-th semantic cell, X _kfor the intermediate vector of kth in training set, i=1,2 ... n, k=1,2 ... N,

c_{i} (0) = \frac{1}{N} Σ_{K = 1}^{N} ϵ_{ik},

{(σ_{i} (0))}^{2} = \frac{1}{N} Σ_{K = 1}^{N} {(ϵ_{ik} - c_{i} (0))}^{2},

(3-2-3) after obtaining the parameter of each semantic cell, setting threshold value, adopt the mixture model described in cyclic iterative renewal, the objective function of the t time loop iteration is:

J_{LP} (t) = Σ_{K = 1}^{N} \ln (Σ_{i = 1}^{n} δ (ϵ_{ik} | c_{i} (t), σ_{i} (t)) \Pr (L_{i} (t)))

Until the absolute value of the difference of the value of objective function that adjacent twice loop iteration obtains | J _lP(t)-J _lP(t-1) | stop when being less than the threshold value of setting; The threshold value set in the present embodiment is 0.005.

Wherein, t=1,2 ... for iterations;

N is the number of intermediate vector in training set;

N is the number of semantic cell;

δ (ϵ_{ik} | c_{i} (t), σ_{i} (t)) = \frac{f (ϵ_{ik} | c_{i} (t), σ_{i} (t))}{{&Integral;}_{0}^{+ \infty} f (ϵ_{ik} | c_{i} (t), σ_{i} (t)) d ϵ_{ik}},

Wherein:

f (ϵ_{ik} | c_{i} (t), σ_{i} (t)) = \frac{1}{\sqrt{2 π {(σ_{i} (t))}^{2}}} \exp \frac{{(ϵ_{ik} - c_{i} (t))}^{2}}{- 2 {(σ_{i} (t))}^{2}},

c_{i} (t) = \frac{Σ_{k = 1}^{N} q_{ik} (t - 1) ϵ_{ik}}{Σ_{k = 1}^{N} q_{ik} (t - 1)},

{(σ_{i} (t))}^{2} = \frac{Σ_{k = 1}^{N} q_{ik} (t - 1) {(ϵ_{ik} - c_{i} (t))}^{2}}{Σ_{k = 1}^{N} q_{ik} (t - 1)}

q_{ik} (t - 1) = \frac{δ (ϵ_{ik} | c_{i} (t - 1), σ_{i} (t - 1)) \Pr (L_{i} (t - 1))}{Σ_{i = 1}^{n} δ (ϵ_{ik} | c_{i} (t - 1), σ_{i} (t - 1)) \Pr (L_{i} (t - 1))};

\Pr (L_{i} (t)) = \frac{1}{N} Σ_{k = 1}^{N} q_{ik} (t - 1)

Q _ikt () and similar q are the weighted values of semantic cell centre distance,

(3-3) coverage rate of each semantic mixing with cells model that previous step obtains is calculated.

After v feature increases, the coverage rate computing method of corresponding semantic mixing with cells are as follows:

{| LP |}_{v} = Σ_{i = 1}^{n} | L_{i} | \cdot \Pr (L_{i});

Wherein

| L_{i} | = Σ_{k = 1}^{N} μ_{L_{i}} (X_{k})

| LP| _vrepresent that v feature increases the coverage rate of rear corresponding semantic mixing with cells model, this feature is expressed with intermediate vector form;

it is given feature vector, X _kwith certain semantic cell L _idegree of membership;

| L _i| represent semantic cell L _icover the coverage rate of training set;

Select the feature m that wherein coverage rate is maximum _k, namely

(3-4) by feature m _kshift out set M:M ← M-{m _k, and add S set: S ← S ∪ { m _k, k ← k+1.

(3-5) as k<D, step (3-2) is skipped to; Otherwise end step (3).

Step (4) is according to pattern recognition theory, the pattern of lower dimensional space linearly inseparable then may realize linear separability by Nonlinear Mapping to high-dimensional feature space, if but directly adopt this technology to carry out classifying or returning at higher dimensional space, then there is form and the problem such as parameter, feature space dimension of determining nonlinear mapping function, maximum obstacle then exists " dimension disaster " when high-dimensional feature space computing.Kernel function technology is adopted can effectively to solve such problem.

Kernel function based on semantic mixing with cells model is:

K (X, Z) = \exp (- | | Σ_{i = 1}^{n} μ_{L_{i}} (X) \Pr (L_{i}) - Σ_{i = 1}^{n} μ_{L_{i}} (Z) \Pr (L_{i}) | |),

Wherein δ in formula (ε | c _i, σ _i) the same step of computing method (3-2-3).

According to above-mentioned kernel function, if the prototype of X and Z and semantic cell is all very close, so kernel function value is 1; Otherwise kernel function value approximates 0.

Utilize the SVM classifier of each speaker of this Kernel;

In dimensionality reduction proper vector, the parameter of corresponding proper vector and semantic mixing with cells model is as input, trains the model of cognition of SVM classifier.The sorter model of training is a pair other (OVR) type, and namely in training, the example belonging to this speaker is considered as positive example, and what do not belong to this speaker is considered as counter-example.

Because the use of SVM classifier model in this area is very general, its computing method all have a detailed description (such as can with reference to " the LIBSVM:a Library for Support Vector Machines " of Chih-chung Chang and Chih-Jen Lin) at a lot of document, are not described in detail here.

Step (5) identifies that unknown speaker process is specific as follows:

(5-1) character is extracted to the voice signal of the unknown speaker of input, obtain proper vector, and select the feature identical with step (3) dimensionality reduction proper vector and obtain dimensionality reduction proper vector a, when then calculating standard scores, with perform step (2) identical average value mu, standard deviation sigma ' calculate.

(5-2) the dimensionality reduction proper vector a that (5-1) obtains is inputed in SVM classifier corresponding to each speaker, calculate the posterior probability P of svm classifier _j, j=1 ..., W (W is speaker's quantity), described posterior probability P _jcodomain be [-1 ,+1], reflect its estimated value close to negative example (-1) or positive example (+1).

The speaker's sequence number judged

kk = \{\begin{matrix} \underset{j}{\arg \max} P_{j} & , if (\max P_{j} > T) \\ 0 & , else \end{matrix},

Wherein kk=0 represents that this speaker does not belong to a member in original speaker set, and T is decision threshold, decision threshold T=0.100 in the present embodiment.

The inventive method and other two kinds of algorithms contrast, specific as follows:

(1) based on the method for identification of semantic cell.Request for utilization number: the algorithm of the semantic cell of ground floor in the invention of 201410402937.1, uses principal component analysis (PCA) (PCA) that proper vector is down to 36 dimensions;

(2) based on the support vector machine (SVM) of radial basis (RBM) kernel function.Use a pair many classification problem of other (OVR) method process;

Table 2

Table 2 shows the experimental result of three kinds of methods, and training set, test set data separate according to 5 times of cross validation (cross-validation) methods.

Claims

1., based on a method for distinguishing speek person for semantic mixing with cells model, comprise the following steps:

(3) based on the Method for Feature Selection of semantic cell, each proper vector obtained step (2) is carried out dimensionality reduction and is obtained corresponding dimensionality reduction proper vector, and trains the semantic mixing with cells model of each speaker;

2. the method for distinguishing speek person based on semantic mixing with cells model according to claim 1, it is characterized in that, step (3) described reduction process is: the feature selecting predetermined quantity from each proper vector of each speaker, when selecting at every turn, choose the feature in each proper vector of each speaker one by one, form intermediate vector, in conjunction with picked out using intermediate vector form express all features as training set, train semantic mixing with cells model, and the maximum feature of the coverage rate picking out semantic mixing with cells model adds dimensionality reduction proper vector, repeat this step until the feature of dimensionality reduction proper vector reaches predetermined quantity.

3. the method for distinguishing speek person based on semantic mixing with cells model according to claim 2, is characterized in that, trains the step of semantic mixing with cells model as follows:

(3-2) calculating parameter initial value: for each semantic cell, utilize each intermediate vector in training set to the location parameter of the distance computing semantic cell at the center of this semantic cell and scale parameter, and set the contribution degree parameter of each semantic cell to semantic mixing with cells model; Wherein, the initial value of the location parameter of i-th semantic cell, scale parameter and contribution degree parameter is designated as c respectively _i(0), σ _iand Pr (L (0) _i(0)), the parameter of each semantic cell forms the parameter of semantic mixing with cells model;

ε _ik＝d _i(X _k,P _i)＝||X _k-P _i||

c_{i} (0) = \frac{1}{N} Σ_{K = 1}^{N} ϵ_{ik}

{(σ_{i} (0))}^{2} = \frac{1}{N} Σ_{K = 1}^{N} {(ϵ_{ik} - c_{i} (0))}^{2}

D _irepresent X _kto P _idistance;

P _iit is the center of i-th semantic cell;

N is the number of intermediate vector in training set;

Setting the initial value of each semantic cell to the contribution degree parameter of mixture model is:

Pr(L _i(0))＝1/n；

J_{LP} (t) = Σ_{K = 1}^{N} \ln (Σ_{i = 1}^{n} δ (ϵ_{ik} | c_{i} (t), σ_{i} (t)) \Pr (L_{i} (t))) .

δ (ϵ_{ik} | c_{i} (t), σ_{i} (t)) = \frac{f (ϵ_{ik} | c_{i} (t), σ_{i} (t))}{{&Integral;}_{0}^{- \infty} f (ϵ_{ik} | c_{i} (t), σ_{i} (t)) {dϵ}_{ik}}

Wherein:

f (ϵ_{ik} | c_{i} (t), σ_{i} (t)) = \frac{1}{\sqrt{2 π {(σ_{i} (t))}^{2}}} \exp \frac{{(ϵ_{ik} - c_{i} (t))}^{2}}{- 2 {(σ_{i} (t))}^{2}}

c_{i} (t) = \frac{Σ_{k = 1}^{N} q_{ik} (t - 1) ϵ_{ik}}{Σ_{k = 1}^{N} q_{ik} (t - 1)}

{(σ_{i} (t))}^{2} = \frac{Σ_{k = 1}^{N} q_{ik} (t - 1) {(ϵ_{ik} - c_{i})}^{2}}{Σ_{k = 1}^{N} q_{ik} (t - 1)}

q_{ik} (t - 1) = \frac{δ (ϵ_{ik} | c_{i} (t - 1), σ_{i} (t - 1)) \Pr (L_{i} (t - 1))}{Σ_{i = 1}^{n} δ (ϵ_{ik} | c_{i} (t - 1), σ_{i} (t - 1)) \Pr (L_{i} (t - 1))}

\Pr (L_{i} (t)) = \frac{1}{N} Σ_{k = 1}^{N} q_{ik} (t - 1)

T=1,2 ... for iterations;

N is the number of semantic cell;

Q _ikt () is the weighted value of semantic cell centre distance;

C _it () is location parameter;

σ _it () is scale parameter;

Pr (L _i(t)) be contribution degree parameter;

4. the method for distinguishing speek person based on semantic mixing with cells model according to claim 3, is characterized in that, in step (3-1), n is 3 ~ 10.

5. the method for distinguishing speek person based on semantic mixing with cells model according to claim 3, is characterized in that, in step (3-3), the threshold value of setting is 0.001 ~ 0.010.

6. the method for distinguishing speek person based on semantic mixing with cells model according to claim 2, is characterized in that, the coverage rate computing formula of semantic mixing with cells model is:

| LP | = Σ_{i = 1}^{n} | L_{i} | \cdot \Pr (L_{i}); | L_{i} | = Σ_{k = 1}^{N} μ_{L_{i}} (X_{k})

| LP| represents the coverage rate of semantic mixing with cells model;

| L _i| represent the i-th semantic cell L _icover the coverage rate of corresponding training set;

(X _k) representation feature vector X _kto L _idegree of membership;

Pr (L _i) represent semantic cell L in semantic mixing with cells model _iweight parameter.

7. the method for distinguishing speek person based on semantic mixing with cells model according to claim 1, is characterized in that, step (4) based on the kernel function of semantic mixing with cells model is:

K (X, Z) = \exp (- | | Σ_{i = 1}^{n} μ_{L_{i}} (X) \Pr (L_{i}) - Σ_{i = 1}^{n} μ_{L_{i}} (Z) \Pr (L_{i}) | |)

L _irepresent i-th semantic cell;

(X) represent that given feature vector, X is to L _idegree of membership;

(Z) represent that given proper vector Z is to L _idegree of membership;

X, Z represent the proper vector after the dimensionality reduction that certain two voice for comparing are corresponding;

Utilize the SVM classifier of each speaker of this Kernel;

Using the parameter of dimensionality reduction proper vector and semantic mixing with cells model as input, the model of cognition of SVM classifier is trained;

The model of cognition of SVM classifier is one to its alloytype, and namely in training, what belong to this speaker is considered as positive example, and what do not belong to this speaker is considered as counter-example.

8. the method for distinguishing speek person based on semantic mixing with cells model according to claim 1, is characterized in that, step (5) identifies that unknown speaker process is specific as follows:

(5-1) feature is extracted to the voice signal of the unknown speaker of input, generating feature vector, and select the feature identical with step (3) dimensionality reduction proper vector and obtain dimensionality reduction proper vector a;

(5-2) the dimensionality reduction proper vector a obtained is inputed in SVM classifier corresponding to each speaker, calculate the posterior probability P of svm classifier _j, j=1 ..., W, W are speaker's quantity, described posterior probability P _jcodomain be [-1 ,+1];

(5-3) choose all speaker's posterior probability values maximum as judged result, specific as follows: speaker's sequence number of judgement

kk = \{\begin{matrix} \underset{j}{\arg \max P_{j}} & , if (\max P_{j} > T) \\ 0 & , else \end{matrix},

9. the method for distinguishing speek person based on semantic mixing with cells model according to claim 8, is characterized in that, decision threshold T is 0.01 ~ 0.10.

10. the method for distinguishing speek person based on semantic mixing with cells model according to claim 8, is characterized in that, after step (5-1) obtains dimensionality reduction proper vector a, adopts standard scores to be normalized it.