CN104538036A - Speaker recognition method based on semantic cell mixing model - Google Patents

Speaker recognition method based on semantic cell mixing model Download PDF

Info

Publication number
CN104538036A
CN104538036A CN201510026239.0A CN201510026239A CN104538036A CN 104538036 A CN104538036 A CN 104538036A CN 201510026239 A CN201510026239 A CN 201510026239A CN 104538036 A CN104538036 A CN 104538036A
Authority
CN
China
Prior art keywords
semantic
sigma
speaker
model
mixing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510026239.0A
Other languages
Chinese (zh)
Inventor
孙凌云
何博伟
尤伟涛
李彦
郑楷洪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201510026239.0A priority Critical patent/CN104538036A/en
Publication of CN104538036A publication Critical patent/CN104538036A/en
Pending legal-status Critical Current

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a speaker recognition method based on a semantic cell mixing model. The method comprises the following steps: (1) establishing a voice library, wherein the voice library comprises multiple voice signals of multiple speakers; (2) preprocessing each voice signal in the voice library, extracting the voice characteristics, thereby obtaining each feature vector of each person; (3) performing dimensionality reduction on the feature vector so as to obtain a dimensionality reduction feature vector based on a semantic cell feature selection method, and training the semantic cell mixing model; (4) constructing an SVM classifier of each speaker by using a kernel function based on the semantic cell mixing model, and training a recognition model of the SVM classifier; and (5) recognizing the unknown speaker by utilizing the recognition model. According to the method disclosed by the invention, the problem that the kernel function of the conventional SVM model does not perform targeted optimization on a specific speaker, and when voice features used for a training classifier are selected, the method has high targeting property compared with the conventional common method, and the needed space for storing the model can be reduced.

Description

A kind of method for distinguishing speek person based on semantic mixing with cells model
Technical field
The present invention relates to signal transacting and area of pattern recognition, particularly relate to a kind of method for distinguishing speek person based on semantic mixing with cells model.
Background technology
Speaker Identification (Speaker Recognition) is also known as Speaker identification, refer to carry out the analyses such as feature extraction by the voice signal produced unknown speaker, automatically determine speaker whether in registered speaker's set, and distinguish the process of concrete speaker.Because the shape size of individual sound channel, throat and other generators official is different, any two individual phonetic features are not identical (see Kinnunen T, Li H.An overview of text-independent speaker recognition:fromfeatures to supervectors.Speech communication, 2010,52 (1): 12-40.).This technology can be used for the process that telephone bank, voice gate inhibition, teleshopping etc. need be differentiated operator.
Current method for distinguishing speek person generally includes following two operation stepss: 1. utilize the speaker's data set in corpus to train given sorter model.Current application has template model, gauss hybrid models (GMM), Hidden Markov Model (HMM) (HMM), support vector machine (SVM) etc. comparatively widely.2. by the phonetic entry recognition system of unknown speaker, carry out mating and making a policy with the model of known speaker, judge this unknown speaker whether in registered speaker's set.
Wherein step 1 needs to carry out characteristic extraction step to sound signal, flow process conventional is at present: 1. carry out pre-emphasis (pre-emphasis), framing (framing), windowing (windowing) operation to the voice signal (waveform signal) of sampling, be called pre-service; 2. carry out feature extraction, general to pretreated signal extraction Mel frequency cepstral coefficient (Mel-frequency CepstralCoefficients at present, MFCC), linear prediction residue error (Linear Prediction CepstralCoefficients, LPCC) etc., these features are the features based on sound channel, principal feature is strong robustness, and descriptive power is good, and easy to implement.
Semantic cell (Information Cell) theory proposes (see TANG Y by soup Yongchuan and Lawry J. jointly, LAWRY J.Information Cell Mixture Models:The CognitiveRepresentations of Vague Concepts [C] //Integrated Uncertainty Managementand Applications.Heidelberg, Berlin:Springer, 2010:371-382), its basis is Fuzzy Calculation and Prototype Theory, main thought is: concept and can't help formal rule or map represent, but represented by its prototype, concept field judges based on the similarity of same prototype.This theory has been applied to prediction Mackey-Glass time series and sunspot problem, and its performance is better than Kim & Kim, autoregressive model algorithm.
Semantic cell has transparent cognitive structure, meets the cognitive process that the mankind learn concept, has solid cognitive psychology basis and strict mathematical definition, possesses the innate advantage describing fuzzy concept.Speaker Identification is the typical problem in fuzzy concept field, and according to current present Research, the sound property of speaker is a kind of fuzzy concept, is difficult at present rely on specific rules to define.And express the semantic cell of concept because it does not rely on the feature of concrete classifying rules by prototype, be suitable for Speaker Identification.
Publication number be CN104200814A application discloses a kind of speech-emotion recognition method based on semantic cell, comprise: build sound bank, to the voice signal of each in sound bank, carry out pre-service and carry out affective feature extraction, the proper vector of every bar voice signal is calculated according to extraction result, utilize proper vector to train the mixture model obtained based on semantic cell as the model of cognition of sorter, utilize the emotion classification belonging to this model of cognition identification voice signal to be identified.The speech-emotion recognition method of this invention is based on the method for identification of the semantic cell of bilayer, the mixture model of two layers of semantic cell of employing structure identification speaker, speaker's emotion sets up model of cognition to speech emotional, when the model of cognition utilizing the method to set up carries out speech emotion recognition, precision is high, and ensureing under the prerequisite with the accuracy of SVM algorithm same identification, still effectively reduce the data volume stored needed for model of cognition, space complexity and recognition accuracy all possess advantage.The shortcoming of this invention uses principal component analysis (PCA) to carry out dimensionality reduction from angle of statistics to proper vector, and specific aim is not strong.In addition, this invention is using the model of cognition of semantic mixing with cells model as sorter, and described method is limited for accuracy rate during Speaker Identification.
Summary of the invention
The invention provides a kind of method for distinguishing speek person based on semantic mixing with cells model.The present invention adopts the sorter based on the Kernel SVM of semantic mixing with cells model, is reached the object distinguishing speaker by the model of cognition of SVM classifier.
Based on a method for distinguishing speek person for semantic mixing with cells model, comprise the following steps:
(1) build sound bank, described sound bank comprises many voice signals of multiple speaker;
(2) every bar voice signal of speaker each in sound bank is carried out pre-service, extract phonetic feature, obtain each proper vector of each speaker;
(3) based on the Method for Feature Selection of semantic cell, each proper vector generated step (2) is carried out dimensionality reduction and is obtained dimensionality reduction proper vector, and trains the semantic mixing with cells model of each speaker;
(4) use the SVM classifier based on each speaker of Kernel of semantic mixing with cells model, and train the model of cognition of SVM classifier;
(5) the model of cognition identification unknown speaker of SVM classifier is utilized.
Step (2) is carried out pre-service to every bar voice signal and is obtained corresponding proper vector, and each speaker has many voice signals, obtains each proper vector of each speaker after pretreatment.
Step (2) described pre-service comprises pre-emphasis, framing and windowing.
(2-1) use transport function for H (z)=1-0.97z -1carry out preemphasis filtering;
(2-2) voice signal is divided into some short time intervals, each short time interval is called a frame, and the length of each frame is probably 10-30ms;
(2-3) use Hamming window function to speech frame windowing;
(2-4) extract the feature of each frame in current speech signal: described in be characterized as following 9 statistical values of 1 to 12 rank Mel frequency cepstral (MFCC) coefficient: maximal value, minimum value, maximal value place frame position, minimum value place frame position, arithmetic average, linear regression coeffficient (slope, intercept), the coefficient of skewness and coefficient of kurtosis;
(2-5) build according to the statistical value of various features the proper vector obtaining current speech signal;
(2-6) use standard scores (z-score) by proper vector normalization, obtain characteristic set to be selected.
Step (3) carries out dimensionality reduction based on the Method for Feature Selection of semantic cell to proper vector, and more current common dimension reduction method is more targeted, and can therefore reduce model storage requisite space.
Step (3) described reduction process is: the feature selecting predetermined quantity from each proper vector of each speaker, when selecting at every turn, choose the feature in each proper vector of each speaker one by one, form intermediate vector, in conjunction with picked out using intermediate vector form express all features as training set, train semantic mixing with cells model, and the maximum feature of the coverage rate picking out semantic mixing with cells model adds dimensionality reduction proper vector, repeats this step until the feature of dimensionality reduction proper vector reaches predetermined quantity.
Select the feature of predetermined quantity, when predetermined quantity is less, model training, recognition speed are fast; When predetermined quantity is larger, accuracy rate is higher, but model training, recognition speed are slow.
Preferably, described predetermined quantity is 30% ~ 50% of total characteristic amount.
The step of the semantic mixing with cells model of the training described in step (3) is as follows:
(3-1) cluster is carried out to the intermediate vector in training set and obtain multiple cluster centre, and as the center of each semantic cell, a semantic mixing with cells model is made up of n semantic cell, comprises the cluster centre that n has different weight;
The value of semantic number of cells n affects recognition result and performance: when n is less, summarize and may occur unsharp situation, but model training, recognition speed is fast to the semanteme of complex concept; When n is larger, the semanteme of complex concept can be summarized preferably, but model training, recognition speed are slow.
Preferably, n is 3 ~ 10.
(3-2) calculating parameter initial value: for each semantic cell, utilize each intermediate vector in training set to the location parameter of the distance computing semantic cell at the center of this semantic cell and scale parameter, and set the contribution degree parameter of each semantic cell to mixture model; Wherein, the initial value of the location parameter of i-th semantic cell, scale parameter and contribution degree parameter is designated as c respectively i(0), σ iand Pr (L (0) i(0)), the parameter of each semantic cell forms the parameter of semantic mixing with cells model.
Set the percentage contribution parameter of each semantic cell to semantic mixing with cells model equal, namely
Pr(L i(0))=1/n;
Wherein, the distance ε at the center of a kth intermediate vector and i-th semantic cell ik, according to following formulae discovery:
ε ik=d i(X K,P i)=||X K-P i||
P iit is the center of i-th semantic cell;
X kfor the intermediate vector of kth in training set, i=1,2 ... n, k=1,2 ... N;
N is the number of intermediate vector in training set;
c i ( 0 ) = 1 N Σ K = 1 N ϵ ik
( σ i ( 0 ) ) 2 = 1 N Σ K = 1 N ( ϵ ik - c i ( 0 ) ) 2
(3-3) after obtaining each initial parameter value of semantic mixing with cells model, setting threshold value, adopt cyclic iterative update semantics mixing with cells model, the objective function of the t time loop iteration is:
J LP ( t ) = Σ K = 1 N ln ( Σ i = 1 n δ ( ϵ ik | c i ( t ) , σ i ( t ) ) Pr ( L i ( t ) ) )
δ ( ϵ ik | c i ( t ) , σ i ( t ) ) = f ( ϵ ik | c i ( t ) , σ i ( t ) ) ∫ 0 + ∞ f ( ϵ ik | c i ( t ) , σ i ( t ) ) d ϵ ik
Wherein:
f ( ϵ ik | c i ( t ) , σ i ( t ) ) = 1 2 π ( σ i ( t ) ) 2 exp ( ϵ ik - c i ( t ) ) 2 - 2 ( σ i ( t ) ) 2
c i ( t ) = Σ k = 1 N q ik ( t - 1 ) ϵ ik Σ k = 1 N q ik ( t - 1 )
( σ i ( t ) ) 2 = Σ k = 1 N q ik ( t - 1 ) ( ϵ ik - c i ( t ) ) 2 Σ k = 1 N q ik ( t - 1 )
q ik ( t - 1 ) = δ ( ϵ ik | c i ( t - 1 ) , σ i ( t - 1 ) ) Pr ( L i ( t - 1 ) ) Σ i = 1 n δ ( ϵ ik | c i ( t - 1 ) , σ i ( t - 1 ) ) Pr ( L i ( t - 1 ) )
Pr ( L i ( t ) ) = 1 N Σ k = 1 N q ik ( t - 1 )
Wherein, t=1,2 ... for iterations;
N is the number of semantic cell;
N is the number of proper vector in training set;
Q ikt () is the weighted value of semantic cell centre distance;
C it () is location parameter;
σ it () is scale parameter;
Pr (L i(t)) be contribution degree parameter;
Until the absolute value of the difference of the value of objective function that adjacent twice loop iteration obtains | J lP(t)-J lP(t-1) | stop when being less than the threshold value of setting.
The threshold value set during loop iteration is larger, restrains faster, and the time that training consumes is short, but the model of cognition set up is inaccurate, and discrimination is low.On the contrary, threshold value is less, restrains slower, and may there is situation about not restraining, and the time that training consumes is long, but the model of cognition set up is accurate, and discrimination is high.Therefore need to set rational threshold value, the value of threshold value can adjust according to practical application request.
Preferably, the threshold value of setting is 0.001 ~ 0.010.
In step (3), the coverage rate computing formula of semantic mixing with cells model is:
| LP | = Σ i = 1 n | L i | · Pr ( L i ) ; | L i | = Σ k = 1 N μ L i ( X k )
| LP| represents the coverage rate of semantic mixing with cells model LP;
| L i| represent that the i-th semantic cell Li covers the coverage rate of corresponding training set;
N is the number of intermediate vector in training set;
X kfor the intermediate vector of kth in training set;
represent given feature vector, X kto L idegree of membership;
Step (4) is based on the kernel function of semantic mixing with cells model:
K ( X , Z ) = exp ( - | | Σ i = 1 n μ L i ( X ) Pr ( L i ) - Σ i = 1 n μ L i ( Z ) Pr ( L i ) | | )
Wherein L irepresent i-th semantic cell;
represent that given feature vector, X is to L idegree of membership;
represent that given proper vector Z is to L idegree of membership;
X, Z represent dimensionality reduction proper vector corresponding for certain two voice compared in calculating SVM process;
Utilize the SVM classifier of each speaker of this Kernel;
In dimensionality reduction proper vector, the parameter of corresponding proper vector and semantic mixing with cells model is as input, trains the model of cognition of SVM classifier.
The model that the SVM classifier of training is used is a pair other (OVR) type, and namely in training, the example belonging to this speaker is considered as positive example, and what do not belong to this speaker is considered as counter-example.
Step (5) identifies that unknown speaker process is specific as follows:
(5-1) feature is extracted to the voice signal of the unknown speaker of input, generating feature vector, and select the feature identical with step (3) dimensionality reduction proper vector as dimensionality reduction proper vector a;
(5-1) can be that generating feature vector a is normalized and obtains dimensionality reduction proper vector, be normalized again after also can obtaining dimensionality reduction proper vector a.
After step (5-1) obtains dimensionality reduction proper vector a, adopt standard scores to be normalized it, be normalized again after obtaining dimensionality reduction proper vector a and can improve operational performance, save operation time and storage space.
Described normalized use the average value mu identical with step (2) preprocessing process, standard deviation sigma ' press the eigenwert after arranging (feature) normalized, namely
x ′ = x - μ j σ ′ j ,
Wherein x, x ' is respectively the forward and backward eigenwert of standardization; μ jwith σ ' jbe respectively x characteristic of correspondence the j mean value that obtains and standard deviation when step (2) calculates standard scores.
(5-2) the dimensionality reduction proper vector obtained is inputed in SVM classifier corresponding to each speaker, calculate the posterior probability P of svm classifier j, j=1 ..., W, W are speaker's quantity, described posterior probability P jcodomain be [-1 ,+1], reflect its estimated value close to negative example (-1) or positive example (+1).
(5-3) choose all speaker's posterior probability values maximum as judged result, specific as follows:
The speaker's sequence number judged kk = arg max j P j , if ( max P j > T ) 0 , else , Wherein kk=0 represents that this speaker does not belong to a member in original speaker set, and T is decision threshold.
Decision threshold improves will cause the accuracy rate of system identification (precision) rising but recall rate (recall) reduction (even if the speaker of more how tested paragraph is sorted in outside set); Vice versa.
Preferably, decision threshold T is 0.01 ~ 0.10.
Compared with prior art, the present invention has following beneficial effect:
1, the present invention uses a kind of method for distinguishing speek person based on semantic cell, the problem that the kernel function that can solve existing SVM model is optimized without specific aim speaker dependent.
2, the present invention uses a kind of method for distinguishing speek person based on semantic cell, when choosing the phonetic feature for training classifier, what adopt is Method for Feature Selection based on semantic cell, and more current common methods is more targeted, and can therefore reduce model storage requisite space.
3, the SVM classifier model of cognition of the present invention's structure, accuracy rate is higher.
Accompanying drawing explanation
Fig. 1 is the method for distinguishing speek person process flow diagram that the present invention is based on semantic cell.
Fig. 2 is the semantic mixing with cells model modification of the present invention and Speaker Identification process flow diagram.
Embodiment
Below in conjunction with Fig. 1 and 2, the invention will be further described.Implementation method of the present invention comprises five steps.
Step (1) builds sound bank: require that the voice signal of input must comprise the indications of speaker, as name.
The sound bank that the present embodiment builds comprises 138 speakers (106 men, 32 female), everyone 10 voice, speech data totally 1380.
Step (2) described pre-service comprises pre-emphasis, framing and windowing process, and detailed process can be the patented claim of CN104200814A with reference to publication number.
(2-1) power spectrum of voice signal reduces with the increase of frequency, and its most of concentration of energy is in low-frequency range.This just causes the signal to noise ratio (S/N ratio) of voice signal front end may drop to the degree that can not allow.But because in voice signal, the energy of higher frequency components is little, seldom there is the amplitude being enough to produce maximum frequency deviation, therefore the signal amplitude majority producing maximum frequency deviation is caused by the low frequency component of signal, and the frequency deviation that the high fdrequency component that usual amplitude is less produces is much smaller.The high fdrequency component increasing the weight of (lifting) transmitter input modulating signal artificially by pre-emphasis process can improve the signal to noise ratio (S/N ratio) of voice signal effectively.As preferably, use transport function is H (z)=1-0.97z -1carry out preemphasis filtering;
(2-2) voice of certain length are divided into many frames to analyze, can analyze with to the analytical approach of stationary process, therefore voice signal is divided into short time interval one by one by the present invention, and each short time interval is called a frame, and the length of each frame is probably 10-30ms.In order to make to seamlessly transit between frame and frame, making it keep continuity, have employed the method for overlapping segmentation, namely the postamble of each frame is overlapping with the frame head of next frame.
(2-3) in order to reduce the truncation effect of speech frame, reduce the gradient at frame two ends, the two ends of speech frame are not caused sharply change and be smoothly transitted into zero, speech frame will be allowed to be multiplied by a window function, the present invention can use any finite impulse response (FIR) (Finite Impulse Response, FIR) spectral window function.Every bar voice signal is divided into several voice segments in short-term by framing windowing, a voice segments is in short-term called a frame, and each frame all has corresponding numbering (i.e. frame number) according to time sequencing.
Above-mentioned steps (2-2) framing is carried out with the operation of (2-3) windowing simultaneously, namely in the process of windowing, carries out framing to voice.Even if in periodically obvious voiced sound spectrum analysis, be multiplied by applicable window function, also can suppress the variable effect of the relative phase relation of pitch period analystal section, thus stable frequency spectrum can be obtained.
Feature extraction on speech frame realizes via following steps:
(2-4) eigenwert of each frame in current speech signal is extracted.Described phonetic feature is following 9 statistical values (sample is the data extracted frame by frame) of 1 to 12 rank Mel frequency cepstral (MFCC) coefficient:
Maximal value, minimum value, maximal value place frame position, minimum value place frame position, arithmetic average, linear regression coeffficient (slope, intercept), the coefficient of skewness (skewness), coefficient of kurtosis (kurtosis).
(2-5) build according to the statistical value of every acoustic feature the proper vector obtaining current speech signal.
Directly by all phonetic features corresponding for current speech signal in step (2-5), and namely the statistical value of the corresponding first order difference coefficient vector that is arranged in rows obtains the proper vector of current speech signal.Be arranged in rows vector time can carry out according to random order, but for all voice signals, each statistical value should arrange according to identical order.For each voice signal, obtaining proper vector is 108 dimensions, namely has 108 features.
(2-6) proper vector normalization.Because semantic cell model needs service range function to measure the degree of membership of original shape (semantic nucleus) to any proper vector, therefore be necessary that the proper vector to obtaining is normalized by row, thus avoid the different impact on the model calculation of the yardstick between each feature.
Described step (3) uses based on the Method for Feature Selection of semantic cell, and the proper vector generated (2) is carried out dimensionality reduction and realized as follows.Following step need respectively implement one time to each speaker, and namely each speaker has the characteristic set after for be optimized by itself after step (3).
(3-1) quantity (feature vector dimension) D=36 of feature is selected in setting;
Initial characteristics quantity k=0, the set of initial characteristics (empty set), places the feature expressed with intermediate vector form in S; The set M={m of speaker characteristic 1..., m dd; As shown in table 1, table 1 is the set of the feature of each speaker, sp 1~ sp hrepresent everyone every bar voice signal, place is classified as the proper vector of this voice signal; m 1~ m ddrepresent the feature of every bar voice signal, be expert at representation feature form intermediate vector, the number of the every bar phonic signal character of dd, dd=108; H is voice signal number, h=10.
(3-2) to arbitrary characteristics m v∈ M, m vthe row structural feature intermediate vector at place, uses S ∪ { m vfeature, train semantic mixing with cells model as training set.
Table 1
The process of semantic mixing with cells model training is specific as follows:
(3-2-1) cluster is carried out to all intermediate vector in training set and obtain multiple cluster centre, and as the center of each semantic cell.Cluster centre is called semantic nucleus, and its quantity is generally 3 to 10, is denoted as n, n=5 in the present embodiment;
(3-2-2) calculating parameter initial value:
Make each semantic cell to the percentage contribution initial parameter value Pr (L of mixture model i(0))=1/n;
For each semantic cell, utilize each intermediate vector in training set to the location parameter of the distance computing semantic cell at the center of this semantic cell and scale parameter, wherein the initial value of the location parameter of the mixture model of i-th semantic cell, scale parameter and contribution degree parameter is designated as c respectively i(0), σ iand Pr (L (0) i(0)).
Wherein, the distance ε at the center of a kth intermediate vector and i-th semantic cell ik, according to following formulae discovery:
ε ik=d i(X K,P i)=||X K-P i||
P ibe the center of i-th semantic cell, X kfor the intermediate vector of kth in training set, i=1,2 ... n, k=1,2 ... N,
c i ( 0 ) = 1 N Σ K = 1 N ϵ ik ,
( σ i ( 0 ) ) 2 = 1 N Σ K = 1 N ( ϵ ik - c i ( 0 ) ) 2 ,
(3-2-3) after obtaining the parameter of each semantic cell, setting threshold value, adopt the mixture model described in cyclic iterative renewal, the objective function of the t time loop iteration is:
J LP ( t ) = Σ K = 1 N ln ( Σ i = 1 n δ ( ϵ ik | c i ( t ) , σ i ( t ) ) Pr ( L i ( t ) ) )
Until the absolute value of the difference of the value of objective function that adjacent twice loop iteration obtains | J lP(t)-J lP(t-1) | stop when being less than the threshold value of setting; The threshold value set in the present embodiment is 0.005.
Wherein, t=1,2 ... for iterations;
N is the number of intermediate vector in training set;
N is the number of semantic cell;
δ ( ϵ ik | c i ( t ) , σ i ( t ) ) = f ( ϵ ik | c i ( t ) , σ i ( t ) ) ∫ 0 + ∞ f ( ϵ ik | c i ( t ) , σ i ( t ) ) d ϵ ik ,
Wherein:
f ( ϵ ik | c i ( t ) , σ i ( t ) ) = 1 2 π ( σ i ( t ) ) 2 exp ( ϵ ik - c i ( t ) ) 2 - 2 ( σ i ( t ) ) 2 ,
c i ( t ) = Σ k = 1 N q ik ( t - 1 ) ϵ ik Σ k = 1 N q ik ( t - 1 ) ,
( σ i ( t ) ) 2 = Σ k = 1 N q ik ( t - 1 ) ( ϵ ik - c i ( t ) ) 2 Σ k = 1 N q ik ( t - 1 )
q ik ( t - 1 ) = δ ( ϵ ik | c i ( t - 1 ) , σ i ( t - 1 ) ) Pr ( L i ( t - 1 ) ) Σ i = 1 n δ ( ϵ ik | c i ( t - 1 ) , σ i ( t - 1 ) ) Pr ( L i ( t - 1 ) ) ;
Pr ( L i ( t ) ) = 1 N Σ k = 1 N q ik ( t - 1 )
Q ikt () and similar q are the weighted values of semantic cell centre distance,
(3-3) coverage rate of each semantic mixing with cells model that previous step obtains is calculated.
After v feature increases, the coverage rate computing method of corresponding semantic mixing with cells are as follows:
| LP | v = Σ i = 1 n | L i | · Pr ( L i ) ; Wherein | L i | = Σ k = 1 N μ L i ( X k )
| LP| vrepresent that v feature increases the coverage rate of rear corresponding semantic mixing with cells model, this feature is expressed with intermediate vector form;
it is given feature vector, X kwith certain semantic cell L idegree of membership;
| L i| represent semantic cell L icover the coverage rate of training set;
Select the feature m that wherein coverage rate is maximum k, namely
(3-4) by feature m kshift out set M:M ← M-{m k, and add S set: S ← S ∪ { m k, k ← k+1.
(3-5) as k<D, step (3-2) is skipped to; Otherwise end step (3).
Step (4) is according to pattern recognition theory, the pattern of lower dimensional space linearly inseparable then may realize linear separability by Nonlinear Mapping to high-dimensional feature space, if but directly adopt this technology to carry out classifying or returning at higher dimensional space, then there is form and the problem such as parameter, feature space dimension of determining nonlinear mapping function, maximum obstacle then exists " dimension disaster " when high-dimensional feature space computing.Kernel function technology is adopted can effectively to solve such problem.
Kernel function based on semantic mixing with cells model is:
K ( X , Z ) = exp ( - | | &Sigma; i = 1 n &mu; L i ( X ) Pr ( L i ) - &Sigma; i = 1 n &mu; L i ( Z ) Pr ( L i ) | | ) ,
Wherein δ in formula (ε | c i, σ i) the same step of computing method (3-2-3).
According to above-mentioned kernel function, if the prototype of X and Z and semantic cell is all very close, so kernel function value is 1; Otherwise kernel function value approximates 0.
Utilize the SVM classifier of each speaker of this Kernel;
In dimensionality reduction proper vector, the parameter of corresponding proper vector and semantic mixing with cells model is as input, trains the model of cognition of SVM classifier.The sorter model of training is a pair other (OVR) type, and namely in training, the example belonging to this speaker is considered as positive example, and what do not belong to this speaker is considered as counter-example.
Because the use of SVM classifier model in this area is very general, its computing method all have a detailed description (such as can with reference to " the LIBSVM:a Library for Support Vector Machines " of Chih-chung Chang and Chih-Jen Lin) at a lot of document, are not described in detail here.
Step (5) identifies that unknown speaker process is specific as follows:
(5-1) character is extracted to the voice signal of the unknown speaker of input, obtain proper vector, and select the feature identical with step (3) dimensionality reduction proper vector and obtain dimensionality reduction proper vector a, when then calculating standard scores, with perform step (2) identical average value mu, standard deviation sigma ' calculate.
(5-2) the dimensionality reduction proper vector a that (5-1) obtains is inputed in SVM classifier corresponding to each speaker, calculate the posterior probability P of svm classifier j, j=1 ..., W (W is speaker's quantity), described posterior probability P jcodomain be [-1 ,+1], reflect its estimated value close to negative example (-1) or positive example (+1).
(5-3) choose all speaker's posterior probability values maximum as judged result, specific as follows:
The speaker's sequence number judged kk = arg max j P j , if ( max P j > T ) 0 , else , Wherein kk=0 represents that this speaker does not belong to a member in original speaker set, and T is decision threshold, decision threshold T=0.100 in the present embodiment.
The inventive method and other two kinds of algorithms contrast, specific as follows:
(1) based on the method for identification of semantic cell.Request for utilization number: the algorithm of the semantic cell of ground floor in the invention of 201410402937.1, uses principal component analysis (PCA) (PCA) that proper vector is down to 36 dimensions;
(2) based on the support vector machine (SVM) of radial basis (RBM) kernel function.Use a pair many classification problem of other (OVR) method process;
Table 2
Table 2 shows the experimental result of three kinds of methods, and training set, test set data separate according to 5 times of cross validation (cross-validation) methods.

Claims (10)

1., based on a method for distinguishing speek person for semantic mixing with cells model, comprise the following steps:
(1) build sound bank, described sound bank comprises many voice signals of multiple speaker;
(2) every bar voice signal of speaker each in sound bank is carried out pre-service, extract phonetic feature, obtain each proper vector of each speaker;
(3) based on the Method for Feature Selection of semantic cell, each proper vector obtained step (2) is carried out dimensionality reduction and is obtained corresponding dimensionality reduction proper vector, and trains the semantic mixing with cells model of each speaker;
(4) use the SVM classifier based on each speaker of Kernel of semantic mixing with cells model, and train the model of cognition of SVM classifier;
(5) the model of cognition identification unknown speaker of SVM classifier is utilized.
2. the method for distinguishing speek person based on semantic mixing with cells model according to claim 1, it is characterized in that, step (3) described reduction process is: the feature selecting predetermined quantity from each proper vector of each speaker, when selecting at every turn, choose the feature in each proper vector of each speaker one by one, form intermediate vector, in conjunction with picked out using intermediate vector form express all features as training set, train semantic mixing with cells model, and the maximum feature of the coverage rate picking out semantic mixing with cells model adds dimensionality reduction proper vector, repeat this step until the feature of dimensionality reduction proper vector reaches predetermined quantity.
3. the method for distinguishing speek person based on semantic mixing with cells model according to claim 2, is characterized in that, trains the step of semantic mixing with cells model as follows:
(3-1) cluster is carried out to the intermediate vector in training set and obtain multiple cluster centre, and as the center of each semantic cell, a semantic mixing with cells model is made up of n semantic cell, comprises the cluster centre that n has different weight;
(3-2) calculating parameter initial value: for each semantic cell, utilize each intermediate vector in training set to the location parameter of the distance computing semantic cell at the center of this semantic cell and scale parameter, and set the contribution degree parameter of each semantic cell to semantic mixing with cells model; Wherein, the initial value of the location parameter of i-th semantic cell, scale parameter and contribution degree parameter is designated as c respectively i(0), σ iand Pr (L (0) i(0)), the parameter of each semantic cell forms the parameter of semantic mixing with cells model;
Wherein, the distance ε at the center of a kth intermediate vector and i-th semantic cell ik, according to following formulae discovery:
ε ik=d i(X k,P i)=||X k-P i||
c i ( 0 ) = 1 N &Sigma; K = 1 N &epsiv; ik
( &sigma; i ( 0 ) ) 2 = 1 N &Sigma; K = 1 N ( &epsiv; ik - c i ( 0 ) ) 2
D irepresent X kto P idistance;
P iit is the center of i-th semantic cell;
X kfor the intermediate vector of kth in training set, i=1,2 ... n, k=1,2 ... N;
N is the number of intermediate vector in training set;
Setting the initial value of each semantic cell to the contribution degree parameter of mixture model is:
Pr(L i(0))=1/n;
(3-3) after obtaining each initial parameter value of semantic mixing with cells model, setting threshold value, adopt cyclic iterative update semantics mixing with cells model, the objective function of the t time loop iteration is:
J LP ( t ) = &Sigma; K = 1 N ln ( &Sigma; i = 1 n &delta; ( &epsiv; ik | c i ( t ) , &sigma; i ( t ) ) Pr ( L i ( t ) ) ) .
&delta; ( &epsiv; ik | c i ( t ) , &sigma; i ( t ) ) = f ( &epsiv; ik | c i ( t ) , &sigma; i ( t ) ) &Integral; 0 - &infin; f ( &epsiv; ik | c i ( t ) , &sigma; i ( t ) ) d&epsiv; ik
Wherein:
f ( &epsiv; ik | c i ( t ) , &sigma; i ( t ) ) = 1 2 &pi; ( &sigma; i ( t ) ) 2 exp ( &epsiv; ik - c i ( t ) ) 2 - 2 ( &sigma; i ( t ) ) 2
c i ( t ) = &Sigma; k = 1 N q ik ( t - 1 ) &epsiv; ik &Sigma; k = 1 N q ik ( t - 1 )
( &sigma; i ( t ) ) 2 = &Sigma; k = 1 N q ik ( t - 1 ) ( &epsiv; ik - c i ) 2 &Sigma; k = 1 N q ik ( t - 1 )
q ik ( t - 1 ) = &delta; ( &epsiv; ik | c i ( t - 1 ) , &sigma; i ( t - 1 ) ) Pr ( L i ( t - 1 ) ) &Sigma; i = 1 n &delta; ( &epsiv; ik | c i ( t - 1 ) , &sigma; i ( t - 1 ) ) Pr ( L i ( t - 1 ) )
Pr ( L i ( t ) ) = 1 N &Sigma; k = 1 N q ik ( t - 1 )
T=1,2 ... for iterations;
N is the number of semantic cell;
Q ikt () is the weighted value of semantic cell centre distance;
C it () is location parameter;
σ it () is scale parameter;
Pr (L i(t)) be contribution degree parameter;
Until the absolute value of the difference of the value of objective function that adjacent twice loop iteration obtains | J lP(t)-J lP(t-1) | stop when being less than the threshold value of setting.
4. the method for distinguishing speek person based on semantic mixing with cells model according to claim 3, is characterized in that, in step (3-1), n is 3 ~ 10.
5. the method for distinguishing speek person based on semantic mixing with cells model according to claim 3, is characterized in that, in step (3-3), the threshold value of setting is 0.001 ~ 0.010.
6. the method for distinguishing speek person based on semantic mixing with cells model according to claim 2, is characterized in that, the coverage rate computing formula of semantic mixing with cells model is:
| LP | = &Sigma; i = 1 n | L i | &CenterDot; Pr ( L i ) ; | L i | = &Sigma; k = 1 N &mu; L i ( X k )
| LP| represents the coverage rate of semantic mixing with cells model;
| L i| represent the i-th semantic cell L icover the coverage rate of corresponding training set;
(X k) representation feature vector X kto L idegree of membership;
Pr (L i) represent semantic cell L in semantic mixing with cells model iweight parameter.
7. the method for distinguishing speek person based on semantic mixing with cells model according to claim 1, is characterized in that, step (4) based on the kernel function of semantic mixing with cells model is:
K ( X , Z ) = exp ( - | | &Sigma; i = 1 n &mu; L i ( X ) Pr ( L i ) - &Sigma; i = 1 n &mu; L i ( Z ) Pr ( L i ) | | )
L irepresent i-th semantic cell;
(X) represent that given feature vector, X is to L idegree of membership;
(Z) represent that given proper vector Z is to L idegree of membership;
X, Z represent the proper vector after the dimensionality reduction that certain two voice for comparing are corresponding;
Utilize the SVM classifier of each speaker of this Kernel;
Using the parameter of dimensionality reduction proper vector and semantic mixing with cells model as input, the model of cognition of SVM classifier is trained;
The model of cognition of SVM classifier is one to its alloytype, and namely in training, what belong to this speaker is considered as positive example, and what do not belong to this speaker is considered as counter-example.
8. the method for distinguishing speek person based on semantic mixing with cells model according to claim 1, is characterized in that, step (5) identifies that unknown speaker process is specific as follows:
(5-1) feature is extracted to the voice signal of the unknown speaker of input, generating feature vector, and select the feature identical with step (3) dimensionality reduction proper vector and obtain dimensionality reduction proper vector a;
(5-2) the dimensionality reduction proper vector a obtained is inputed in SVM classifier corresponding to each speaker, calculate the posterior probability P of svm classifier j, j=1 ..., W, W are speaker's quantity, described posterior probability P jcodomain be [-1 ,+1];
(5-3) choose all speaker's posterior probability values maximum as judged result, specific as follows: speaker's sequence number of judgement kk = arg max P j j , if ( max P j > T ) 0 , else , Wherein kk=0 represents that this speaker does not belong to a member in original speaker set, and T is decision threshold.
9. the method for distinguishing speek person based on semantic mixing with cells model according to claim 8, is characterized in that, decision threshold T is 0.01 ~ 0.10.
10. the method for distinguishing speek person based on semantic mixing with cells model according to claim 8, is characterized in that, after step (5-1) obtains dimensionality reduction proper vector a, adopts standard scores to be normalized it.
CN201510026239.0A 2015-01-20 2015-01-20 Speaker recognition method based on semantic cell mixing model Pending CN104538036A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510026239.0A CN104538036A (en) 2015-01-20 2015-01-20 Speaker recognition method based on semantic cell mixing model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510026239.0A CN104538036A (en) 2015-01-20 2015-01-20 Speaker recognition method based on semantic cell mixing model

Publications (1)

Publication Number Publication Date
CN104538036A true CN104538036A (en) 2015-04-22

Family

ID=52853552

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510026239.0A Pending CN104538036A (en) 2015-01-20 2015-01-20 Speaker recognition method based on semantic cell mixing model

Country Status (1)

Country Link
CN (1) CN104538036A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105845140A (en) * 2016-03-23 2016-08-10 广州势必可赢网络科技有限公司 Speaker confirmation method and speaker confirmation device used in short voice condition
CN106448681A (en) * 2016-09-12 2017-02-22 南京邮电大学 Super-vector speaker recognition method
CN107292662A (en) * 2017-06-08 2017-10-24 浙江大学 A kind of method for evaluating the innovation vigor that article is obtained from mass-rent environment
CN108847245A (en) * 2018-08-06 2018-11-20 北京海天瑞声科技股份有限公司 Speech detection method and device
CN110348258A (en) * 2019-07-12 2019-10-18 西安电子科技大学 A kind of RFID answer signal frame synchronization system and method based on machine learning
CN110473552A (en) * 2019-09-04 2019-11-19 平安科技(深圳)有限公司 Speech recognition authentication method and system
CN111583914A (en) * 2020-05-12 2020-08-25 安徽中医药大学 Big data voice classification method based on Hadoop platform
CN112562693A (en) * 2021-02-24 2021-03-26 北京远鉴信息技术有限公司 Speaker determining method and device based on clustering and electronic equipment
CN113066121A (en) * 2019-12-31 2021-07-02 深圳迈瑞生物医疗电子股份有限公司 Image analysis system and method for identifying repeat cells
CN113160795A (en) * 2021-04-28 2021-07-23 平安科技(深圳)有限公司 Language feature extraction model training method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102479511A (en) * 2010-11-23 2012-05-30 盛乐信息技术(上海)有限公司 Large-scale voiceprint authentication method and system
CN102663370A (en) * 2012-04-23 2012-09-12 苏州大学 Face identification method and system
CN102968410A (en) * 2012-12-04 2013-03-13 江南大学 Text classification method based on RBF (Radial Basis Function) neural network algorithm and semantic feature selection
CN104200814A (en) * 2014-08-15 2014-12-10 浙江大学 Speech emotion recognition method based on semantic cells

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102479511A (en) * 2010-11-23 2012-05-30 盛乐信息技术(上海)有限公司 Large-scale voiceprint authentication method and system
CN102663370A (en) * 2012-04-23 2012-09-12 苏州大学 Face identification method and system
CN102968410A (en) * 2012-12-04 2013-03-13 江南大学 Text classification method based on RBF (Radial Basis Function) neural network algorithm and semantic feature selection
CN104200814A (en) * 2014-08-15 2014-12-10 浙江大学 Speech emotion recognition method based on semantic cells

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZENGCHANG QIN, YONGCHUAN TANG: "<Uncertainty Modeling for Data Mining>", 30 December 2013 *
颜程伟: "《硕士学位论文》", 15 July 2012 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105845140A (en) * 2016-03-23 2016-08-10 广州势必可赢网络科技有限公司 Speaker confirmation method and speaker confirmation device used in short voice condition
CN106448681A (en) * 2016-09-12 2017-02-22 南京邮电大学 Super-vector speaker recognition method
CN107292662A (en) * 2017-06-08 2017-10-24 浙江大学 A kind of method for evaluating the innovation vigor that article is obtained from mass-rent environment
CN107292662B (en) * 2017-06-08 2022-08-30 浙江大学 Method for evaluating innovation activity of acquiring articles from crowdsourcing environment
CN108847245A (en) * 2018-08-06 2018-11-20 北京海天瑞声科技股份有限公司 Speech detection method and device
CN110348258A (en) * 2019-07-12 2019-10-18 西安电子科技大学 A kind of RFID answer signal frame synchronization system and method based on machine learning
CN110473552A (en) * 2019-09-04 2019-11-19 平安科技(深圳)有限公司 Speech recognition authentication method and system
CN113066121A (en) * 2019-12-31 2021-07-02 深圳迈瑞生物医疗电子股份有限公司 Image analysis system and method for identifying repeat cells
CN111583914A (en) * 2020-05-12 2020-08-25 安徽中医药大学 Big data voice classification method based on Hadoop platform
CN111583914B (en) * 2020-05-12 2023-03-28 安徽中医药大学 Big data voice classification method based on Hadoop platform
CN112562693A (en) * 2021-02-24 2021-03-26 北京远鉴信息技术有限公司 Speaker determining method and device based on clustering and electronic equipment
CN113160795A (en) * 2021-04-28 2021-07-23 平安科技(深圳)有限公司 Language feature extraction model training method, device, equipment and storage medium
CN113160795B (en) * 2021-04-28 2024-03-05 平安科技(深圳)有限公司 Language feature extraction model training method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN104538036A (en) Speaker recognition method based on semantic cell mixing model
US10909456B2 (en) Training multiple neural networks with different accuracy
Poddar et al. Speaker verification with short utterances: a review of challenges, trends and opportunities
US11030998B2 (en) Acoustic model training method, speech recognition method, apparatus, device and medium
Mannepalli et al. A novel adaptive fractional deep belief networks for speaker emotion recognition
Ververidis et al. Fast and accurate sequential floating forward feature selection with the Bayes classifier applied to speech emotion recognition
Almaadeed et al. Speaker identification using multimodal neural networks and wavelet analysis
CN102800316B (en) Optimal codebook design method for voiceprint recognition system based on nerve network
US9715660B2 (en) Transfer learning for deep neural network based hotword detection
US20150199960A1 (en) I-Vector Based Clustering Training Data in Speech Recognition
Wang et al. Using parallel tokenizers with DTW matrix combination for low-resource spoken term detection
Semwal et al. Automatic speech emotion detection system using multi-domain acoustic feature selection and classification models
CN101930735A (en) Speech emotion recognition equipment and speech emotion recognition method
CN104200814A (en) Speech emotion recognition method based on semantic cells
Kamper et al. Fully unsupervised small-vocabulary speech recognition using a segmental bayesian model
CN102024455A (en) Speaker recognition system and method
CN105355214A (en) Method and equipment for measuring similarity
US20180039888A1 (en) System and method for speaker change detection
CN106909946A (en) A kind of picking system of multi-modal fusion
US20110093427A1 (en) System and method for tagging signals of interest in time variant data
CN103824557A (en) Audio detecting and classifying method with customization function
CN104240720A (en) Voice emotion recognition method based on multi-fractal and information fusion
CN102005205B (en) Emotional speech synthesizing method and device
CN109919295B (en) Embedded audio event detection method based on lightweight convolutional neural network
CN105261367A (en) Identification method of speaker

Legal Events

Date Code Title Description
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150422