CN102024455B

CN102024455B - Speaker recognition system and method

Info

Publication number: CN102024455B
Application number: CN200910170552.6A
Authority: CN
Inventors: 刘昆; 吴伟国
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2009-09-10
Filing date: 2009-09-10
Publication date: 2014-09-17
Anticipated expiration: 2029-09-10
Also published as: CN102024455A

Abstract

The invention discloses a speaker recognition system and a speaker recognition method. The speaker recognition system comprises a characteristic extraction unit, a background model generation unit, a registered speaker model generation unit, a metric value calculation unit and a recognition unit, wherein the characteristic extraction unit is configured to extract a characteristic vector of speech data of a speaker; the background model generation unit is configured to perform internal clustering on the characteristic vector of the speech data of a background speaker and generate a universal background model aiming at a normal speaker according to the result of the internal clustering; the registered speaker model generation unit is configured to adapt to the universal background model by using the characteristic vector of the speech data of each registered speaker so as to generate a registered speaker model of each registered speaker; the metric value calculation unit is configured to calculate metric values of the characteristic vector of a tested speaker on the universal background model generated by the background model generation unit and on the registered speaker model of each registered speaker, which is generated by the registered speaker model generation model; and the recognition unit is configured to recognize the tested speaker according to the metric values calculated by the metric value calculation unit.

Description

Speaker Recognition System and method thereof

Technical field

Relate generally to Speaker Recognition System of the present invention and method thereof.More particularly, the present invention relates to a kind of speaker dependent's recognition system and method based on universal background model (universal background model, UBM) and registration speaker model.

Background technology

The biometrics identification technology that various countries are mainly studied at present comprises the identification of hand shape, fingerprint recognition, face recognition, Application on Voiceprint Recognition, iris recognition, signature identification etc.In these biological characteristics, fingerprint, iris, image surface etc. all belong to exposed conveying appliance physical trait, easily in the situation that suffering violence infringement, are pretended to be by force by offender by litigant's physical trait.People's sound characteristic belongs to built-in physical trait, as long as litigant does not lift up one's voice, without any stolen possibility, has therefore obtained deep research and development in biometrics identification technology field.

Application on Voiceprint Recognition (Speaker Recognition) is exactly the technology of utilizing the intrinsic physiological characteristic of human body or behavioural characteristic to carry out personal identification qualification, belongs to the one of biometrics identification technology.Application on Voiceprint Recognition, also referred to as Speaker Identification, is by the speaker's who receives voice signal is analyzed and extracted, and automatically determines that whether speaker gathers the inside set up speaker, and whose process definite speaker is.Speaker's the predetermined Application on Voiceprint Recognition of the content of speaking is called the Application on Voiceprint Recognition of text-dependent (text-dependent), speaker's the content of speaking is uncertain in advance, say what content can Application on Voiceprint Recognition be called and the Application on Voiceprint Recognition of text-independent (text-independent).

The method for distinguishing speek person of current main-stream is the method for distinguishing speek person based on GMM (Gaussian mixture model, mixed Gauss model)-UBM (universal background model, universal background model).Speaker Recognition System based on GMM-UBM is mainly divided into three parts, UBM training, speaker dependent's model adaptation and Speaker Identification test.Specifically, train out a universal background model from individual even thousands of speakers' the data of hundreds of in advance, then go out the mixed Gauss model relevant to speaker dependent by speaker dependent's data from general background model self-adaptation, and use the model that this self-adaptation goes out to carry out Speaker Identification.

Its advantage is that speaker dependent's model is that the training utterance self-adaptation according to speaker obtains on UBM.Like this, the pronunciation character covering for speaker's training utterance can be with this speaker's self pronunciation modeling, the pronunciation character not covering for speaker's training utterance is similar to UBM, reduce thus tested speech and training utterance on acoustic space due to different the brought impacts that distribute.In addition, in carrying out identity validation, can with the score of tested speech on UBM as one with reference to threshold values.

A good UBM background model is by a large amount of background speakers' language voice training out.For simple recognition system, use the background model training, just can reach satisfied recognition effect.And for a specific application, before being put to practicality, should gather the pronunciation sample of some actual channel, utilize adaptive algorithm to train and upgrade background model, to reach best recognition performance.

But, in the Speaker Recognition System based on GMM-UBM, only represent speaker's statistical average pronunciation character with a UBM, and UBM model training needs a large amount of speakers' language voice, also to consider M-F, ratio of age of speaker etc., to adopt GMM to carry out modeling simultaneously.

An important deficiency based on GMM modeling is exactly to train the GMM physical significance obtaining very indefinite, and unclear each gaussian component finally by which feature contribution is obtained.In addition, due to a large amount of speakers' of needs language voice, therefore the GMM training time is very long.

Summary of the invention

In view of the foregoing, the present invention proposes a kind of new Speaker Recognition System and method for distinguishing speek person, also can reduce the training time of GMM with the physical significance of clear and definite universal background model.

Specifically, according to an aspect of the present invention, provide a kind of Speaker Recognition System, comprising: feature extraction unit, is configured to the eigenvector of the speech data that extracts speaker; Background model generation unit, the eigenvector that is configured to the speech data to background speaker carries out inner cluster and generates the universal background model for general speaker according to the result of inner cluster; Registration speaker model generation unit, is configured to the eigenvector of the speech data that utilizes each registration speaker to universal background model self-adaptation, generates each registration speaker's registration speaker model; Metric computing unit, is configured to calculate the metric on each registration speaker's that universal background model that test speaker's eigenvector generates at background model generation unit and registration speaker model generation unit generate registration speaker model; And recognition unit, be configured to the metric identification test speaker who calculates according to metric computing unit.

According to another aspect of the present invention, provide a kind of method for distinguishing speek person, comprising: the eigenvector that extracts speaker's speech data; The eigenvector of the speech data to background speaker carries out inner cluster and generates the universal background model for general speaker according to the result of inner cluster; Utilize the eigenvector of each speech data of registering speaker to universal background model self-adaptation, generate each registration speaker's registration speaker model; The metric of calculating test speaker's eigenvector on universal background model and each registration speaker's registration speaker model; And according to calculated metric identification test speaker.

According to one embodiment of present invention, generating universal background model comprises: the eigenvector of the speech data to background speaker carries out inner cluster, to generate series of features subclass; From the speaker's that has powerful connections the feature subclass generating, select cluster centre, be divided into feature space so that all feature subclasses are carried out to space; And all feature subclasses that comprise in each feature space are characterized, to generate the universal background model for general speaker.

Preferably, in inner cluster, an eigenvector KDTree of structure of the speech data to each background speaker also carries out inner cluster according to nearest neighbouring rule.

According to a specific embodiment of the present invention, inner cluster comprises: the eigenvector that extracts background speaker's the speech data that has voice segments; The eigenvector of extraction is configured to KDTree, make the value of the eigenvector of the dimension corresponding with this layer of all nodes on the left subtree of each root node on every one deck all be less than the value of the eigenvector of this dimension of this root node, on every one deck, on the right subtree of each root node, the value of the eigenvector of the dimension corresponding with this layer of all nodes is all greater than the value of the eigenvector of this dimension of this root node; And each root node and subtree thereof on arbitrary layer of constructed KDTree are clustered into the feature subclass with common feature.

Preferably, in inner cluster, described each root node is screened, retain the root node that corresponding child node number is many.

According to one embodiment of present invention, adopt ultimate range example-based approach, K-Mean method, minimum distance method, group average distance method or gravity model appoach to select cluster centre from generated all registration speakers' feature subclass.

According to a preferred embodiment of the present invention, characterize with Gaussian function all feature subclasses that comprise in each feature space.Wherein, calculate average and the variance of the eigenvector that all feature subclasses in each feature space comprise, to obtain the normal distyribution function of each feature space.

In addition, according to one embodiment of present invention, generate registration speaker model and comprise: the eigenvector F that obtains registration speaker's speech data; For each eigenvector F, calculate its posterior probability p to each feature space k _k,

p_{k} = \frac{1}{{(2 π)}^{d / 2} {| Σ_{k} |}^{1 / 2}} \exp {- \frac{1}{2} {(x - μ_{k})}^{T} Σ^{- 1} (x - μ_{k})}, k = 1,2, . . ., N,

Wherein, μ _kthe average of the eigenvector comprising for all feature subclasses in each feature space, ∑ _kthe variance of the eigenvector comprising for all feature subclasses in each feature space, N is the quantity that feature space is divided, d is intrinsic dimensionality; Calculate and upgrade the factor

α = \frac{1}{γ + p_{k}},

γ is empirical value; Average to each feature space is upgraded: μ ' _k=μ _k(1-α)+α * F; And universal background model is carried out to self-adaptation by the average of feature space after upgrading, to generate this registration speaker's registration speaker model.

In addition, according to a preferred embodiment of the present invention, in metric calculates, obtain the eigenvector of test speaker's speech data, and all eigenvectors of the test speaker's that calculating is obtained respectively speech data are to universal background model M _bwith registration speaker model M _rposterior probability P _band P _r,

P_{B} = \frac{1}{m} Σ_{i = 1}^{m} p_{B_{i}}

P_{R} = \frac{1}{m} Σ_{i = 1}^{m} p_{R_{i}}

p_{B_{i}} = Σ_{k = 1}^{N} \log (w_{k} p_{k}^{B})

= N * \log w_{k} - Σ_{k = 1}^{N} \log ({(2 π)}^{d / 2} {| Σ_{k}^{B} |}^{1 / 2}) + Σ_{k = 1}^{N} (- \frac{1}{2} {(x - μ_{k}^{B})}^{T} {Σ_{k}}^{- 1} (x - μ_{k}^{B}))

p_{R_{i}} = Σ_{k = 1}^{N} \log (w_{k} p_{k}^{R})

= N * {\log w}_{k} - Σ_{k = 1}^{N} \log ({(2 π)}^{d / 2} {| Σ_{k}^{R} |}^{1 / 2}) + Σ_{k = 1}^{N} (- \frac{1}{2} {(x - μ_{k}^{R})}^{T} {Σ_{k}}^{- 1} (x - μ_{k}^{R}))

Wherein, m is the quantity of the eigenvector of obtained test speaker's speech data, p _k ^band p _k ^rbe respectively eigenvector at universal background model M _bwith registration speaker model M _rk feature space on posterior probability; w _kfor the weight of each feature space,

w_{k} = \frac{1}{N};

And in identification, the marking P of the speech data that calculates this test speaker to each registration speaker model _r-P _b, obtain maximal value P _maxand according to the threshold value identification test speaker who sets.

Can find out, the Speaker Recognition System and the method for distinguishing speek person that propose according to the present invention, in the time generating universal background model, first by background speaker everyone KDTree of latent structure and carry out the inner cluster of speaker according to nearest neighbouring rule, then the cluster between the mankind of speaking in had powerful connections speaker, and Renewal model parameter is to obtain background model.Clearly, the physical significance of this universal background model is obvious.And the computation complexity of KDTree is less than GMM, therefore the training time shortens.

In addition, the present invention is also provided for realizing the computer program of above-mentioned method for distinguishing speek person.

In addition, the present invention also provides at least computer program of computer-readable medium form, records the computer program code for realizing above-mentioned method for distinguishing speek person on it.

Brief description of the drawings

Below with reference to the accompanying drawings illustrate embodiments of the invention, can understand more easily above and other objects, features and advantages of the present invention.In the accompanying drawings, identical or corresponding technical characterictic or parts will adopt identical or corresponding Reference numeral to represent.In accompanying drawing:

Fig. 1 illustrates the block diagram of Speaker Recognition System according to an embodiment of the invention;

Fig. 2 illustrates the block diagram of background model generation unit according to an embodiment of the invention;

Fig. 3 illustrates the block diagram of inner cluster cell according to an embodiment of the invention;

Fig. 4 illustrates the KDTree of the constructed speaker characteristic vector of a concrete example according to the present invention;

Fig. 5 illustrates the process flow diagram of the processing procedure of method for distinguishing speek person according to an embodiment of the invention;

Fig. 6 illustrates according to the process flow diagram of the universal background model training process of a concrete example of the present invention; And

Fig. 7 illustrates for implementing the structure calcspar according to the messaging device of method for distinguishing speek person of the present invention.

Embodiment

Embodiments of the invention are described with reference to the accompanying drawings.It should be noted that for purposes of clarity, in accompanying drawing and explanation, omitted expression and the description of unrelated to the invention, parts known to persons of ordinary skill in the art and processing.

First with reference to accompanying drawing, particularly Fig. 1 to Fig. 4, describes according to the general work process of the Speaker Recognition System of the embodiment of the present invention.As shown in Figure 1, comprise according to the Speaker Recognition System of the embodiment of the present invention: feature extraction unit 101, is configured to the eigenvector of the speech data that extracts speaker; Background model generation unit 103, the eigenvector that is configured to the speech data to background speaker carries out inner cluster and generates the universal background model for general speaker according to the result of inner cluster; Registration speaker model generation unit 105, is configured to the eigenvector of the speech data that utilizes each registration speaker to universal background model self-adaptation, generates each registration speaker's registration speaker model; Metric computing unit 107, is configured to calculate the metric on test speaker's the universal background model that generates at background model generation unit 103 of eigenvector and each registration speaker's that registration speaker model generation unit 105 generates registration speaker model; And recognition unit 109, be configured to the metric identification test speaker who calculates according to metric computing unit 107.

Be described in detail below in conjunction with 2 to 4 pairs of each included modules of Speaker Recognition System according to the present invention of accompanying drawing.

According to the Speaker Recognition System of this embodiment of the invention, first by feature extraction unit 101, the eigenvector of the speech data to speaker extracts.Here, for different situations, feature extraction unit 101 extract different speakers speech data eigenvector and send to different subsequent treatment unit.For example, generating when universal background model, feature extraction unit 101 extract a large amount of speakers speech data eigenvector and send to background model generation unit 103.For registration speaker model, the eigenvector of each registration speaker's that extraction will be registered respectively speech data also sends to registration speaker model generation unit 105.In the time of identification, the eigenvector of the test speaker's that extraction will be identified speech data also sends to metric computing unit 107.

The eigenvector of 103 speakers' that have powerful connections that feature extraction unit 101 is provided of background model generation unit speech data carries out inner cluster and generates the universal background model for general speaker according to the result of inner cluster, and Fig. 2 illustrates the block diagram of background model generation unit 103 according to an embodiment of the invention.

As shown in Figure 2, comprise according to the background model generation unit 103 of this embodiment: inner cluster cell 201, the eigenvector that is configured to the speech data to background speaker carries out inner cluster, to generate series of features subclass; Feature subclass spatial division unit 203, is configured to select cluster centre the speaker's that has powerful connections who generates from inner cluster cell 201 feature subclass, is divided into feature space so that all feature subclasses are carried out to space; And feature space characterization unit 205, all feature subclasses that are configured to comprising in each feature space characterize, to generate the universal background model for general speaker.

According to example of the present invention, the eigenvector of everyone speech data in background speaker is constructed a KDTree by inner cluster cell 201, and carry out the inner cluster of speaker according to nearest neighbouring rule, thereby obtain series of features subclass.Fig. 3 shows according to the block diagram of the inside cluster cell 201 of this embodiment.

As shown in Figure 3, comprise according to the inside cluster cell 201 of this embodiment: voice segments extraction unit 301, is configured to the eigenvector of the speech data that has voice segments that extracts background speaker; KDTree construction unit 303, the eigenvector that is configured to voice segments extraction unit 301 to extract is configured to KDTree, make the value of the eigenvector of the dimension corresponding with this layer of all nodes on the left subtree of each root node on every one deck all be less than the value of the eigenvector of this dimension of this root node, on every one deck, on the right subtree of each root node, the value of the eigenvector of the dimension corresponding with this layer of all nodes is all greater than the value of the eigenvector of this dimension of this root node; And feature subclass generation unit 305, be configured to each root node and subtree thereof on arbitrary layer of KDTree constructed KDTree construction unit 303 to be clustered into the feature subclass with common feature.

Here, according to example of the present invention, the feature that voice segments extraction unit 301 extracts comprises 18 dimension MFCC (Mel Frequency Cepstral Coefficients, Mel frequency cepstral coefficient) feature, 18 dimension difference MFCC features and 9 dimension prosodic features.Certainly, these features are merely given as examples, and can and require according to different concrete conditions choose some feature wherein or select other can characterize the eigenvector of speaker's characteristic voice in the time of specific implementation.

After receiving the above-mentioned eigenvector that voice segments extraction unit 301 extracts, KDTree construction unit 303 sorts these eigenvectors, builds the KDTree for this speaker.

KDTree (k ties up search tree) is a kind ofly promoted and the version (k is the dimension in space) of the tree for multidimensional retrieval that comes by binary search tree.What KDTree was different from binary search tree is, and its each node represents a point of k dimension space, and every one deck is all made branch decision-making according to the resolving device of this layer (discriminator) to corresponding object.

In KDTree, top mode is divided by a dimension, and second layer node is divided according to another dimension, and between the each dimension of remainder, constantly divides by that analogy, until counting while being less than given maximum number of points in a node finishes to divide.

Specifically, in the time building a speaker's the KDTree of eigenvector of speech data, first select root node, the value of the first dimension of more all eigenvectors, the line ordering of going forward side by side, selecting the eigenvector at sequence intermediate value place after sequence is root node.Then, from first eigenvector, searching position insert this eigenvector in this KDTree successively.An eigenvector is divided into the left subtree of this KDTree or the rule of right subtree is as follows: taking i layer as example, if its left subtree non-NULL, on its left subtree, the i dimension value of all nodes is all less than the i dimension value of its root node; If its right subtree non-NULL, on its right subtree, the i dimension value of all nodes is all greater than the i dimension value of its root node; And its left and right subtree is also respectively KDTree.

According to mentioned above principle, be F={f1 such as establishing a certain eigenvector, f2, f3, ..., fn}, wherein fi is i dimensional feature, so in ground floor (root node), the relatively size of the first dimensional feature of f1 and root node, if f1 is less than the first dimensional feature of root node, this eigenvector is divided in the left subtree of root node, then enter the second layer.In the second layer, if left subtree is not empty, compare the size of the second dimensional feature of the second dimensional feature f2 and left subtree root node feature, judgement belongs to left subtree or right subtree and divides, then enter the 3rd layer and subsequent each layer, until this subtree is while being empty, till this eigenvector is inserted to this position.

According to such criteria for classifying, the eigenvector of each speaker's speech data can be divided in each feature dimensions, and be built this speaker's KDTree.For instance, suppose that eigenvector list is for [(2,3,9), (5,4,2), (9,6,4), (4,7,0), (8,1,8), (7,2,3)].First select now root node (5,4,2) or (7,2,3), because 5 and 7 are intermediate values in the first dimensional feature, can select arbitrarily one as root node, select here (7,2,3).Then, from first eigenvector (2,3,9) start, by its first dimensional feature 2 and root node (7,2,3) the first dimensional feature 7 relatively, owing to being less than 7, thereby belongs to left subtree, again because left subtree is empty, so eigenvector (2,3,9) is inserted in to root node (7,2,3) on the root node of left subtree.

For second eigenvector (5,4,2), by the first dimensional feature 7 of the first dimensional feature 5 and root node (7,2,3) relatively, owing to being less than 7, thereby belong to left subtree.Because left subtree is not empty, follow the second dimension of the eigenvector (2,3,9) on the second peacekeeping left subtree root node that compares second eigenvector (5,4,2) again.Here due to 4 the second dimensions 3 that are greater than (2,3,9), thereby belong to right subtree, again because right subtree be sky, so eigenvector (5,4,2) is inserted on the root node of right subtree of root node (2,3,9).The like, obtain final KDTree as shown in Figure 4.

After KDTree construction unit 303 has built this speaker's KDTree, feature subclass generation unit 305 is divided into each speaker's eigenvector the subclass with certain common feature, and then can accelerate the realization of next step subclass spatial division.

Specifically, suppose all root node i (i <=2 at N layer place in this speaker's KDTree ⁿ) be respectively root node separately, the child node below it and corresponding root node form respectively a little KDTree, and think that all eigenvectors of this little KDTree have common feature, be polymerized to a class, calculated the number of average, variance and the eigenvector (leaf node) of every class.Thus, finally can obtain i feature subclass for this speaker.

Preferably, in all root node i (i <=2N) at feature subclass generation unit 305 N layer place from this speaker's KDTree, screen, only retain the more front several root nodes of its corresponding child node number, by these root nodes and below the child node of each layer be polymerized to a class and calculate the number of average, variance and the eigenvector (leaf node) of every class.

Turn back to now Fig. 2, after inner cluster cell 201 completes the feature subclass generation for each speaker, feature subclass spatial division unit 203 carries out spatial division to had powerful connections speaker's feature subclass, therefrom select cluster centre, thereby all feature subclasses are divided into feature space, cluster between the speaker's that had powerful connections to realize class.

According to example of the present invention, feature subclass spatial division unit 203 adopts ultimate range example-based approach to carry out cluster to background speaker all categories, may do cluster centre by sample a long way off to ensure to exhaust.

Specifically, suppose and have M feature subclass sample, Zs={Z ₁, Z ₂..., Z _m.First appoint get a sample for example Z1 as first Lei center, Z1=Z1, then from set find Zs Z1 apart from maximum sample as Z2.Then remaining sample Zi in Zs is calculated respectively the distance of Z1 and Z2, making wherein less that is D _zi.

Having calculated in Zs all remaining sample Zi after the distance of Z1 and Z2, calculate if value be greater than a certain calculated value or given threshold value, get this Zi Wei Xinlei center.Here desirable doubly (0.5≤α < 1) of α that is more than or equal to Z1 and Z2 spacing of calculated value.

Then, repeat processing above, until again can not find qualified Xin Lei center.Finally, residue sample is assigned to from the class under its that nearest center.

Certainly, the spatial division of feature subclass is not limited only to ultimate range example-based approach described above, but can select diverse ways to select cluster centre from had powerful connections speaker's feature subclass according to different situations, such as K-Mean method, minimum distance method (single linkage method), group average distance method (average linkage method), gravity model appoach (centroid hierarchicalmethod) etc.

In feature subclass spatial division unit 203, had powerful connections speaker's feature subclass is carried out after space is divided into feature space, feature space characterization unit 205 use Gaussian functions characterize all feature subclasses that comprise in each feature space, thereby generate the universal background model for general speaker.Specifically, feature space characterization unit 205 can calculate average and the variance of the eigenvector that all feature subclasses in each feature space comprise, to obtain the normal distyribution function of each feature space, generate thus the universal background model for general speaker.

Next turn back to Fig. 1, will describe the principle of work of registration speaker model generation unit 105, metric computing unit 107 and recognition unit 109.

After the speech data that is utilized had powerful connections speaker by background model generation unit 103 generates universal background model, the registration speaker that need to register for each, registration speaker model generation unit 105 utilizes the eigenvector of each speech data of registering speaker to universal background model self-adaptation, generates each registration speaker's registration speaker model.

Specifically, according to a specific embodiment of the present invention, first registration speaker model generation unit 105 obtains the eigenvector F of registration speaker's speech data from feature extraction unit 101, then according to formula below for each eigenvector F, calculate its posterior probability p to each feature space k _k,

p_{k} = \frac{1}{{(2 π)}^{d / 2} {| Σ_{k} |}^{1 / 2}} \exp {- \frac{1}{2} {(x - μ_{k})}^{T} Σ^{- 1} (x - μ_{k})}, k = 1,2, . . ., N,

Wherein, μ _kthe average of the eigenvector comprising for all feature subclasses in each feature space, ∑ _kthe variance of the eigenvector comprising for all feature subclasses in each feature space, N is the quantity that feature space is divided, for example, can get 512,1024 etc., d representation feature dimension.

Then, registration speaker model generation unit 105 is according to formula

α = \frac{1}{γ + p_{k}}

Calculate and upgrade factor-alpha, and according to formula μ ' _k=μ _kthe average μ of (1-α)+α * F to each feature space _kupgrade, wherein γ is empirical value, for example, can be made as 10,16 etc.Finally, the average μ of the feature space after the 105 use renewals of registration speaker model generation unit _kuniversal background model is carried out to self-adaptation, generate thus the registration speaker model for this registration speaker.

In identifying, the metric on each registration speaker's that the universal background model that the eigenvector that first metric computing unit 107 calculates test speaker generates at background model generation unit 103 and registration speaker model generation unit 105 generate registration speaker model.Then the metric identification test speaker that, recognition unit 109 calculates according to metric computing unit 107.

Specifically, according to a specific embodiment of the present invention, first metric computing unit 107 obtains test speaker's the eigenvector of speech data from feature extraction unit 101, calculate respectively all eigenvectors of speech data of obtained test speaker to universal background model M according to formula below _bwith registration speaker model M _rposterior probability P _band P _r,

P_{B} = \frac{1}{m} Σ_{i = 1}^{m} p_{B_{i}}

P_{R} = \frac{1}{m} Σ_{i = 1}^{m} p_{R_{i}}

p_{B_{i}} = Σ_{k = 1}^{N} \log (w_{k} p_{k}^{B})

= N * \log w_{k} - Σ_{k = 1}^{N} \log ({(2 π)}^{d / 2} {| Σ_{k}^{B} |}^{1 / 2}) + Σ_{k = 1}^{N} (- \frac{1}{2} {(x - μ_{k}^{B})}^{T} {Σ_{k}}^{- 1} (x - μ_{k}^{B}))

p_{R_{i}} = Σ_{k = 1}^{N} \log (w_{k} p_{k}^{R})

= N * {\log w}_{k} - Σ_{k = 1}^{N} \log ({(2 π)}^{d / 2} {| Σ_{k}^{R} |}^{1 / 2}) + Σ_{k = 1}^{N} (- \frac{1}{2} {(x - μ_{k}^{R})}^{T} {Σ_{k}}^{- 1} (x - μ_{k}^{R}))

w_{k} = \frac{1}{N} .

Afterwards, the posterior probability P that recognition unit 109 availability magnitude calculation unit 107 calculate _band P _r, calculate this test speaker's speech data to the marking P of each registration speaker model _r-P _b, obtain maximal value P _maxand judging according to the threshold value Th setting, thereby identification test speaker.If P _maxbe greater than the threshold value Th of setting, P _maxcorresponding speaker is the speaker who recognizes, otherwise refuses to know.

Here it is to be noted, posterior probability on each registration speaker's that the universal background model that what metric computing unit 107 calculated in the embodiment described above is test speaker's eigenvector generates at background model generation unit 103 and registration speaker model generation unit 105 generate registration speaker model, and the posterior probability that recognition unit 109 availability magnitude calculation unit 107 calculate is identified test speaker.But, the present invention is not limited only to this, but can select as required other parameters, such as distance metric, similarity computation measure etc., respectively carry out corresponding computation measure at universal background model and registration on speaker model for the eigenvector of test speaker's speech data by metric computing unit 107, then set by recognition unit 109 that different threshold values is identified or according to knowledge.These methods can be tested speaker's identification equally quickly and easily, and those skilled in the art is in the above on the basis of the principle of work of described metric computing unit 107 and recognition unit 109, easily realizes by suitable processing.

For example, for the mode that adopts distance metric, first metric computing unit 107 can obtain the eigenvector of test speaker's speech data from feature extraction unit 101, then calculate the distance of each feature space component of each eigenvector and universal background model, such as more common Euclidean distance (Euclidean distance), manhatton distance (Manhattan distance), Minkowski Distance (Minkowski distance), Gauss's distance (Gaussian Divergence), Pasteur (Bhattacharyya, BHA) distance, Kullback-Leibler (KL) distance etc., and will sue for peace as the distance of this eigenvector and universal background model with the distance of each feature space component and the weighted accumulation of respective weights, and then can calculate all characteristic vectors of the speech data of testing speaker and the distance of universal background model.Based on same principle, can calculate test speaker's the eigenvector of speech data and the distance of registration speaker model.109 of recognition units calculate these two distances, and the distance of universal background model and and the distance of registration speaker model between poor.After the registration speaker model for all has calculated the difference between these two distances, the difference that recognition unit 109 also can be based on calculated is identified test speaker according to the threshold value of setting or according to knowledge.

In addition, for the mode that adopts similarity computation measure, first metric computing unit 107 for example can obtain the eigenvector of test speaker's speech data from feature extraction unit 101, then calculate the similarity of each feature space component of each eigenvector and universal background model, such as cosine similarity, Pearson's coefficient, adjust cosine similarity etc., and will sue for peace as the similarity of this eigenvector and universal background model with the similarity of each feature space component and the weighted accumulation of respective weights, and then can calculate all characteristic vectors of the speech data of testing speaker and the similarity of universal background model.Based on same principle, can calculate test speaker's the eigenvector of speech data and the similarity of registration speaker model.Equally, recognition unit 109 calculates these two similarities, and the similarity of universal background model and and the similarity of registration speaker model between poor.After the registration speaker model for all has calculated the difference between these two similarities, the difference that recognition unit 109 also can be based on calculated is identified test speaker according to the threshold value of setting or according to knowledge.

More than describe the Speaker Recognition System according to the embodiment of the present invention, described method for distinguishing speek person according to an embodiment of the invention in detail below in conjunction with accompanying drawing.Fig. 5 illustrates the process flow diagram of the processing procedure of method for distinguishing speek person according to an embodiment of the invention.

As shown in Figure 5, method for distinguishing speek person according to this embodiment of the invention comprises that characteristic extraction step S501, universal background model generate step S503, registration speaker model generates step S505, metric calculation procedure S507 and test Speaker Identification step S509.Due to according to the concrete processing procedure in above-mentioned each step of this embodiment of the invention respectively with the Speaker Recognition System of describing with reference to Fig. 1 in the modules such as feature extraction unit 101, background model generation unit 103, registration speaker model generation unit 105, metric computing unit 107 and recognition unit 109 in processing similar, therefore omit further detailed description at this.

In addition, Fig. 6 shows in detail according to the process flow diagram of the universal background model training process of a concrete example of the present invention.As shown in Figure 6, first extract the eigenvector of each background speaker's speech data at step S601, and build respectively each background speaker's KDTree at step S603 according to the processing procedure in KDTree construction unit above.

Then,, at step S605, select the suitably root node of layer to carry out this background speaker's internal feature cluster for each background speaker's KDTree, thereby generate a series of feature subclass.Then, at step S607, adopt ultimate range example-based approach to carry out cluster to had powerful connections speaker's feature subclass, may do cluster centre by sample a long way off to ensure to exhaust, be divided into feature space thereby all feature subclasses are carried out to space.

Finally, at step S609, all feature subclasses that comprise in each feature space are characterized, thereby generate the universal background model for general speaker.

Equally, the cluster of feature subclass and spatial division are not limited only to ultimate range example-based approach described above, but can select diverse ways to select cluster centre from had powerful connections speaker's feature subclass according to different situations, be not described in detail at this.

From finding out the detailed description of Speaker Recognition System and method for distinguishing speek person according to an embodiment of the invention above, in order to ensure each speaker's only characteristic concentrated distribution and training, first the feature of each speaker in background model is carried out to inner cluster (can adopt KDTree) according to Speaker Recognition System of the present invention and method for distinguishing speek person, then all speakers' feature subclass is carried out to space and cut apart, obtaining optimum space is the universal background model finally obtaining.Then with registration speaker's speech data, background model is carried out to self-adaptation, obtain registering speaker's model.In identification test process, calculate respectively the eigenvector of the speech data of testing speaker at universal background model and the metric of registering on speaker model, and give a mark and according to knowledge, finally judge speaker's identity.

Therefore, compare with traditional GMM-UBM system, according to the not only explicit physical meaning of this Speaker Recognition System of the present invention and method for distinguishing speek person, and fast operation, therefore can realize good recognition performance.

Ultimate principle of the present invention has below been described in conjunction with specific embodiments, but, it is to be noted, for those of ordinary skill in the art, can understand whole or any steps or the parts of method and apparatus of the present invention, can be in the network of any calculation element (comprising processor, storage medium etc.) or calculation element, realized with hardware, firmware, software or their combination, this is that those of ordinary skill in the art use their basic programming skill just can realize in the situation that having read explanation of the present invention.

Therefore, object of the present invention can also realize by move a program or batch processing on any calculation element.Described calculation element can be known fexible unit.Therefore, object of the present invention also can be only by providing the program product that comprises the program code of realizing described method or device to realize.That is to say, such program product also forms the present invention, and the storage medium that stores such program product also forms the present invention.Obviously, described storage medium can be any storage medium developing in any known storage medium or future.

In the situation that realizing embodiments of the invention by software and/or firmware, from storage medium or network to the computing machine with specialized hardware structure, example general purpose personal computer 700 is as shown in Figure 7 installed the program that forms this software, this computing machine, in the time that various program is installed, can be carried out various functions etc.

In Fig. 7, CPU (central processing unit) (CPU) 701 carries out various processing according to the program of storage in ROM (read-only memory) (ROM) 702 or from the program that storage area 708 is loaded into random access memory (RAM) 703.In RAM 703, also store as required data required in the time that CPU 701 carries out various processing etc.CPU 701, ROM 702 and RAM 703 are connected to each other via bus 704.Input/output interface 705 is also connected to bus 704.

Following parts are connected to input/output interface 705: importation 706, comprises keyboard, mouse etc.; Output 707, comprises display, such as cathode-ray tube (CRT) (CRT), liquid crystal display (LCD) etc., and loudspeaker etc.; Storage area 708, comprises hard disk etc.; With communications portion 709, comprise that network interface unit is such as LAN card, modulator-demodular unit etc.Communications portion 709 via network such as the Internet executive communication processing.

As required, driver 710 is also connected to input/output interface 705.Detachable media 711, such as disk, CD, magneto-optic disk, semiconductor memory etc. are installed on driver 710 as required, is installed in storage area 708 computer program of therefrom reading as required.

In the situation that realizing above-mentioned series of processes by software, from network such as the Internet or storage medium are such as detachable media 711 is installed the program that forms softwares.

It will be understood by those of skill in the art that this storage medium is not limited to wherein having program stored therein shown in Fig. 7, distributes separately the detachable media 711 so that program to be provided to user with device.The example of detachable media 711 comprises disk (comprising floppy disk (registered trademark)), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (registered trademark)) and semiconductor memory.Or storage medium can be hard disk comprising in ROM 702, storage area 708 etc., wherein computer program stored, and be distributed to user together with comprising their device.

Also it is pointed out that in apparatus and method of the present invention, obviously, each parts or each step can decompose and/or reconfigure.These decomposition and/or reconfigure and should be considered as equivalents of the present invention.And, carry out the step of above-mentioned series of processes and can order naturally following the instructions carry out in chronological order, but do not need necessarily to carry out according to time sequencing.Some step can walk abreast or carry out independently of one another.

Although described the present invention and advantage thereof in detail, be to be understood that in the case of not departing from the spirit and scope of the present invention that limited by appended claim and can carry out various changes, alternative and conversion.And, the application's term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thereby the process, method, article or the device that make to comprise a series of key elements not only comprise those key elements, but also comprise other key elements of clearly not listing, or be also included as the intrinsic key element of this process, method, article or device.The in the situation that of more restrictions not, the key element being limited by statement " comprising ... ", and be not precluded within process, method, article or the device that comprises described key element and also have other identical element.

Claims

1. a Speaker Recognition System, comprising:

Feature extraction unit, is configured to the eigenvector of the speech data that extracts speaker;

Background model generation unit, the eigenvector that is configured to the speech data to background speaker carries out inner cluster and generates the universal background model for general speaker according to the result of inner cluster;

Registration speaker model generation unit, is configured to the eigenvector of the speech data that utilizes each registration speaker to universal background model self-adaptation, generates each registration speaker's registration speaker model;

Metric computing unit, is configured to calculate the metric on each registration speaker's that universal background model that test speaker's eigenvector generates at background model generation unit and registration speaker model generation unit generate registration speaker model; And

Recognition unit, is configured to the metric identification test speaker who calculates according to metric computing unit.

2. Speaker Recognition System according to claim 1, wherein background model generation unit comprises:

Inner cluster cell, the eigenvector that is configured to the speech data to background speaker carries out inner cluster, to generate series of features subclass;

Feature subclass spatial division unit, is configured to select cluster centre the speaker's that has powerful connections who generates from inner cluster cell feature subclass, is divided into feature space so that all feature subclasses are carried out to space; And

Feature space characterization unit, all feature subclasses that are configured to comprising in each feature space characterize, to generate the universal background model for general speaker.

3. Speaker Recognition System according to claim 2, wherein the eigenvector of the speech data of inner cluster cell to each background speaker is constructed a KDTree and is carried out inner cluster according to nearest neighbouring rule.

4. Speaker Recognition System according to claim 3, wherein inner cluster cell comprises:

Voice segments extraction unit, is configured to the eigenvector of the speech data that has voice segments that extracts background speaker;

KDTree construction unit, the eigenvector that is configured to voice segments extraction unit to extract is configured to KDTree, make the value of the eigenvector of the dimension corresponding with this layer of all nodes on the left subtree of each root node on every one deck all be less than the value of the eigenvector of this dimension of this root node, on every one deck, on the right subtree of each root node, the value of the eigenvector of the dimension corresponding with this layer of all nodes is all greater than the value of the eigenvector of this dimension of this root node; And

Feature subclass generation unit, is configured to each root node and subtree thereof on arbitrary layer of KDTree constructed KDTree construction unit to be clustered into the feature subclass with common feature.

5. Speaker Recognition System according to claim 4, wherein feature subclass generation unit screens described each root node, retains the root node that corresponding child node number is many.

6. according to the arbitrary described Speaker Recognition System of claim 2 to 5, wherein feature subclass spatial division unit adopts ultimate range example-based approach, K-Mean method, minimum distance method, group average distance method or gravity model appoach to select cluster centre from the speaker's that has powerful connections of inner cluster cell generation feature subclass.

7. according to the arbitrary described Speaker Recognition System of claim 2 to 5, wherein feature space characterization unit characterizes all feature subclasses that comprise in each feature space with Gaussian function.

8. according to the arbitrary described Speaker Recognition System of claim 2 to 5, wherein feature space characterization unit calculates average and the variance of the eigenvector that all feature subclasses in each feature space comprise, to obtain the normal distyribution function of each feature space.

9. Speaker Recognition System according to claim 8, wherein registers speaker model generation unit

Obtain the eigenvector F of registration speaker's speech data;

For each eigenvector F, calculate its posterior probability p to each feature space k _k,

p_{k} = \frac{1}{{(2 π)}^{d / 2} {| Σ_{k} |}^{1 / 2}} \exp {- \frac{1}{2} {(x - μ_{k})}^{T} Σ^{- 1} ({x - μ}_{k})}, k = 1,2, . . ., N,

Wherein, μ _kthe average of the eigenvector comprising for all feature subclasses in each feature space, ∑ _kthe variance of the eigenvector comprising for all feature subclasses in each feature space, N is the quantity that feature space is divided, d representation feature dimension;

Calculate and upgrade the factor

α = \frac{1}{γ + p_{k}},

γ is empirical value;

Average to each feature space is upgraded: μ ' _k=μ _k(1-α)+α * F; And

Universal background model is carried out to self-adaptation by the average of feature space after upgrading, to generate this registration speaker's registration speaker model.

10. Speaker Recognition System according to claim 9, wherein

Described metric computing unit obtains the eigenvector of test speaker's speech data, and all eigenvectors of the test speaker's that calculating is obtained respectively speech data are to universal background model M _bwith registration speaker model M _rposterior probability P _band P _r,

P_{B} = \frac{1}{m} Σ_{i = 1}^{m} p_{B_{i}}

P_{R} = \frac{1}{m} Σ_{i = 1}^{m} p_{R_{i}}

p_{B_{i}} = Σ_{k = 1}^{N} \log (w_{k} p_{k}^{B})

= N * \log w_{k} - Σ_{k = 1}^{N} \log ({(2 π)}^{d / 2} {| Σ_{k}^{B} |}^{1 / 2}) + Σ_{k = 1}^{N} (- \frac{1}{2} {(x - μ_{k}^{B})}^{T} {Σ_{k}}^{- 1} (x - μ_{k}^{B}))

p_{R_{i}} = Σ_{k = 1}^{N} \log (w_{k} p_{k}^{R})

= N * \log w_{k} - Σ_{k = 1}^{N} \log ({(2 π)}^{d / 2} {| Σ_{k}^{R} |}^{1 / 2}) + Σ_{k = 1}^{N} (- \frac{1}{2} {(x - μ_{k}^{R})}^{T} {Σ_{k}}^{- 1} (x - μ_{k}^{R}))

w_{k} = \frac{1}{N};

And

The marking P of this test of described recognition unit computes speaker's speech data to each registration speaker model _r-P _b, obtain maximal value P _maxand according to the threshold value identification test speaker who sets.

11. 1 kinds of method for distinguishing speek person, comprising:

Extract the eigenvector of speaker's speech data;

The eigenvector of the speech data to background speaker carries out inner cluster and generates the universal background model for general speaker according to the result of inner cluster;

Utilize the eigenvector of each speech data of registering speaker to universal background model self-adaptation, generate each registration speaker's registration speaker model;

The metric of calculating test speaker's eigenvector on universal background model and each registration speaker's registration speaker model; And

According to calculated metric identification test speaker.

12. method for distinguishing speek person according to claim 11, wherein generate universal background model and comprise:

The eigenvector of the speech data to background speaker carries out inner cluster, to generate series of features subclass;

From the speaker's that has powerful connections the feature subclass generating, select cluster centre, be divided into feature space so that all feature subclasses are carried out to space; And

The all feature subclasses that comprise in each feature space are characterized, to generate the universal background model for general speaker.

13. method for distinguishing speek person according to claim 12, wherein, in inner cluster, an eigenvector KDTree of structure of the speech data to each background speaker also carries out inner cluster according to nearest neighbouring rule.

14. method for distinguishing speek person according to claim 13, wherein inner cluster comprises:

Extract the eigenvector of background speaker's the speech data that has voice segments;

The eigenvector of extraction is configured to KDTree, make the value of the eigenvector of the dimension corresponding with this layer of all nodes on the left subtree of each root node on every one deck all be less than the value of the eigenvector of this dimension of this root node, on every one deck, on the right subtree of each root node, the value of the eigenvector of the dimension corresponding with this layer of all nodes is all greater than the value of the eigenvector of this dimension of this root node; And

Each root node and subtree thereof on arbitrary layer of constructed KDTree are clustered into the feature subclass with common feature.

15. method for distinguishing speek person according to claim 14, wherein screen described each root node, retain the root node that corresponding child node number is many.

16. according to claim 12 to 15 arbitrary described method for distinguishing speek person, wherein adopt ultimate range example-based approach, K-Mean method, minimum distance method, group average distance method or gravity model appoach to select cluster centre from generated all registration speakers' feature subclass.

17. according to claim 12 to 15 arbitrary described method for distinguishing speek person, wherein characterize with Gaussian function all feature subclasses that comprise in each feature space.

18. according to claim 12 to 15 arbitrary described method for distinguishing speek person, wherein calculate average and the variance of the eigenvector that all feature subclasses in each feature space comprise, to obtain the normal distyribution function of each feature space.

19. method for distinguishing speek person according to claim 18, wherein generate registration speaker model and comprise:

Obtain the eigenvector F of registration speaker's speech data;

p_{k} = \frac{1}{{(2 π)}^{d / 2} {| Σ_{k} |}^{1 / 2}} \exp {- \frac{1}{2} {(x - μ_{k})}^{T} Σ^{- 1} ({x - μ}_{k})}, k = 1,2, . . ., N,

Wherein, μ _kthe average of the eigenvector comprising for all feature subclasses in each feature space, ∑ _kthe variance of the eigenvector comprising for all feature subclasses in each feature space, N is the quantity that feature space is divided, d is intrinsic dimensionality;

Calculate and upgrade the factor

α = \frac{1}{γ + p_{k}},

γ is empirical value;

Average to each feature space is upgraded: μ ' _k=μ _k(1-α)+α * F; And

20. method for distinguishing speek person according to claim 19, wherein

In metric calculates, obtain the eigenvector of test speaker's speech data, and all eigenvectors of the test speaker's that calculating is obtained respectively speech data are to universal background model M _bwith registration speaker model M _rposterior probability P _band P _r,

P_{B} = \frac{1}{m} Σ_{i = 1}^{m} p_{B_{i}}

P_{R} = \frac{1}{m} Σ_{i = 1}^{m} p_{R_{i}}

p_{B_{i}} = Σ_{k = 1}^{N} \log (w_{k} p_{k}^{B})

= N * \log w_{k} - Σ_{k = 1}^{N} \log ({(2 π)}^{d / 2} {| Σ_{k}^{B} |}^{1 / 2}) + Σ_{k = 1}^{N} (- \frac{1}{2} {(x - μ_{k}^{B})}^{T} {Σ_{k}}^{- 1} (x - μ_{k}^{B}))

p_{R_{i}} = Σ_{k = 1}^{N} \log (w_{k} p_{k}^{R})

= N * \log w_{k} - Σ_{k = 1}^{N} \log ({(2 π)}^{d / 2} {| Σ_{k}^{R} |}^{1 / 2}) + Σ_{k = 1}^{N} (- \frac{1}{2} {(x - μ_{k}^{R})}^{T} {Σ_{k}}^{- 1} (x - μ_{k}^{R}))

w_{k} = \frac{1}{N};

And

In identification, the marking P of the speech data that calculates this test speaker to each registration speaker model _r-P _b, obtain maximal value P _maxand according to the threshold value identification test speaker who sets.