CN103871411A

CN103871411A - Text-independent speaker identifying device based on line spectrum frequency difference value

Info

Publication number: CN103871411A
Application number: CN201410134694.8A
Authority: CN
Inventors: 马占宇; 齐峰; 张洪刚
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2014-04-03
Filing date: 2014-04-03
Publication date: 2014-06-18

Abstract

The embodiment of the invention discloses a text-independent speaker identification method based on line spectrum frequency difference. The method includes the following steps: feature extraction step: converting the line spectrum frequency parameter into a line spectrum frequency parameter difference through linear transformation, combining the current frame and its two adjacent frames to form a generated line spectrum frequency feature supervector; model training step : use the super-Dirichlet mixture model to simulate the distribution of feature supervectors, and solve the parameters in the model; identification step: extract features according to step 1 of the speech sequence of the person to be identified, and then input the model obtained in step 2, and calculate for each The likelihood value of a probability model, obtain the maximum likelihood value, and confirm the speaker number. By using the embodiment of the present invention, the text-independent speaker identification rate can be improved, which has great practical value.

Description

A kind of speaker's identification device of the text-independent based on line spectral frequencies difference

Technical field

The present invention has described emphatically the Speaker Recognition System of the text-independent of a kind of line spectral frequencies parameter based on linear transformation and super Di Li Cray mixture model.

Background technology

Along with the development of computer technology, utilize people's biological characteristic (as fingerprint, vocal print, face) to carry out identification or confirmation has very important research and using value.Speaker Identification is according to the speak speech parameter of feature of human physiology and behavior of reflection in speech waveform, automatically confirms that speaker whether in recorded words person's set, further confirms speaker's identity.Speaker Identification comprises that again speaker differentiates two parts with speaker verification.Speaker's identification system generally includes three parts: extraction can represent speaker's feature, and each speaker is trained to an independently model that meets the statistical law of its selected feature, finally makes a policy with the model obtaining by relatively inputting data.

Extract feature for Part I, be the good method of effect in current Speaker Identification based on sound channel signature analysis voice signal, conventional feature mainly contains: Mel-cepstrum coefficient (MFCC:Mel-frequency Cepstral Coefficients) and linear spectral coefficient (LSF:Line Spectral Frequencies).Traditional Mel-cepstrum coefficient (MFCC) vector is expressed multidate information by the method for difference, and the present invention adopts the feature super vector that line spectral frequencies difference represents to preserve original neighborhood information.What in addition, method of the present invention had also considered that Mel-cepstrum coefficient (MFCC) ignores differentiates the useful high-frequency information of speaker to machine.

In recognition methods, can be divided three classes at present: template matching method, probability model method, and Artificial Neural Network.Probability model adopts certain probability density function to describe the distribution situation of speaker's speech feature space, and using one group of parameter of this probability density function as speaker model.Gauss hybrid models (GMM:Gaussian Mixture Model) is due to the simple Speaker Recognition System that has efficiently been widely used in text-independent.But super Di Li Cray mixture model of the present invention (SDMM:super-Dirichlet Mixture Model) can better be described boundedness and the order of extracted feature.

According to the difference of identifying object, Speaker Identification can be divided into text dependent and text-independent two classes.The wherein speaker Recognition Technology of text dependent, requires speaker's the keyword of pronunciation and crucial sentence as training text, when identification according to identical content pronunciation.The speaker Recognition Technology of text-independent, in the time of training or in the time of identification, do not specify no matter be the content of speaking, identifying object is voice signal freely, need in voice signal freely, find feature and the method for the information that can characterize speaker, therefore sets up speaker model difficulty relatively.In addition, the easy stolen record of the recognition system of text dependent is emitted and is recognized, and uses inconvenience, and described in the invention is the recognition system of text-independent.

Summary of the invention

In order to solve the existing defect of above-mentioned technology and to improve speaker's resolution of text-independent, the invention provides the Speaker Identification device of the text-independent of a kind of line spectral frequencies parameter based on linear transformation and super Di Li Cray mixture model.

For achieving the above object, the method for distinguishing speek person of the text-independent that the present invention proposes comprises the following steps:

One, characteristic extraction step

A, line spectral frequencies parameter transformation step: in the linear coded prediction model of voice, be converted into line spectral frequencies parameter difference by linear transformation by line spectral frequencies parameter;

B, generation line spectral frequencies feature super vector step: form a feature super vector in conjunction with present frame two frames adjacent with its front and back and express multidate information.

Two. model training step: the frame sequence training pattern that is T by length to each speaker, use the distribution of super Di Li Cray mixture model (SDMM:super-Dirichlet Mixture Model) simulation feature super vector, solve an equation and obtain the parameter alpha in model by gradient method, finally obtain a series of models, the corresponding speaker of each model.

Three. differentiate coupling step: get in a series of probability models that the speech samples input of certain speaker in training set trained, adopt method transformation parameter and generating feature super vector in step 1, calculate the likelihood value for each probability model by the model that trains in step 2, get wherein maximum likelihood value and confirm speaker's numbering.

According to speaker's discrimination method of a kind of and text-independent of an embodiment of the invention, in line spectral frequencies parameter transformation step described in steps A, utilize the 1. non-negative characteristic of line spectral frequencies parameter, 2. in order characteristic and 3. bounded characteristic be transformed to linear spectral parameter difference Δ LSF, being characterized as of this difference: be 1. distributed in (0,1), in open interval, 2. add and be 1.This step detailed process is as follows:

1) K dimension line spectral frequencies Parametric Representation is s=[s ₁, s ₂..., s _k] ^t, meet 0 < s ₁< s ₂< ..., s _k< π;

2) dimension of the K+1 after conversion line spectral frequencies parameter difference Δ LSF is

wherein

x_{i} = \{\begin{matrix} s_{1} / π & i = 1 \\ (s_{i} - s_{i - 1}) / π & 1 < i \leq K \\ (π - s_{K}) / π & i = K + 1 \end{matrix} .

According to speaker's discrimination method of a kind of and text-independent of an embodiment of the invention, generation line spectral frequencies feature super vector step described in step B combines present frame x (t) and its consecutive frame to form a super vector, express multidate information with this, this super vector comprises three subvectors in the present invention.The interval of supposing present frame and former frame and a rear frame is all τ, only considers former frame x (t-τ) and two neighborhood frames of a rear frame x (t-τ) of present frame here, and the feature super vector of generation is 3 (K+1) dimension.Its detailed process is as follows:

1) K+1 dimension line spectral frequencies parameter difference vector x (t)=[x _1,1, x _1,2..., x _{1, K+1}] ^t;

2) the super vector result that comprises multidate information is:

x_{\sup} (t) \overset{Δ}{=} [\begin{matrix} x (t) \\ x (t - τ) \\ x (t + τ) \end{matrix}] = [\begin{matrix} x_{1,1} (t) \\ . \\ . \\ . \\ x_{1, K + 1} (t) \\ x_{2,1} (t) \\ . \\ . \\ . \\ x_{2, K + 1} (t) \\ x_{3,1} (t) \\ . \\ . \\ . \\ x_{3, K + 1} (t) \end{matrix}], τ = 1,2, . . . .

According to speaker's discrimination method of a kind of and text-independent of an embodiment of the invention, the detailed step of the model training described in step 2 is:

1) x _supin each feature subvector x (t), x (t-τ), x (t+ τ) is separate and meet Dirichlet distribute, super vector x _supmeet super Di Li Cray probability density distribution:

SDir (x_{\sup}; α) = Π_{n = 1}^{3} \frac{Γ (Σ_{k = 1}^{K + 1} α_{n, k})}{Π_{k = 1}^{K + 1} Γ (α_{n, k})} Π_{k = 1}^{K + 1} {(x_{n, k})}^{α_{n, k} - 1}

2) for the line spectral frequencies parameter difference subvector x (1) of a sequential ..., x (t) ..., x (T), has X=[x _sup(1) ..., x _sup(T)], carry out artificial line spectral frequency parameter difference with super Di Li Cray mixture model (SDMM):

f (X) = Π_{t = 1}^{T} Σ_{m = 1}^{M} π_{m} SDir (x_{\sup} (t); α (m))

Wherein, weight factor

π_{m} = \frac{1}{T} Σ_{t = 1}^{T} {\overset{&OverBar;}{z}}_{tm} = \frac{1}{T} Σ_{t = 1}^{T} \frac{π_{m} SDir (x_{\sup} (t); α (m))}{Σ_{m = 1}^{M} π_{m} SDir (x_{\sup} (t); α (m))} .

3) computation model parameter, for m mixed components, parameter vector α _mbe divided into 3 subvectors, the corresponding x of each parameter subvector _supin a subvector.So we can obtain all parameters by solution equation below:

Beneficial effect of the present invention is, in terms of existing technologies, the line spectral frequencies parameter super vector of the present invention's application conversion is extracted as speaker's feature, by super Di Li Cray mixed distribution training pattern, provide again complete implementation system for application, test findings has been verified high efficiency of the present invention, has very strong practicality.

Brief description of the drawings

Fig. 1 is the flow chart of steps of method provided by the invention;

Fig. 2 is the flow chart of steps of line spectral frequencies parameter transformation;

Fig. 3 is the flow chart of steps of constitutive characteristic super vector.

Embodiment

Below in conjunction with accompanying drawing, specific embodiments of the present invention is described in detail.

Fig. 1 is process flow diagram of the present invention, and wherein dotted line represents training department's point flow process trend, and solid line represents to differentiate part flow process trend, comprises the following steps:

The first step: characteristic extraction step, speaker's voice sequence for the treatment of training carries out feature extraction

Step S1: line spectral frequencies parameter is converted to line spectral frequencies parameter difference;

Step S2: generate line spectral frequencies feature super vector;

Second step: training pattern

Step S3: use the distribution of super Di Li Cray mixture model simulation feature super vector, and solve the parameter in model;

The 3rd step: discrimination process

Speaker's voice sequence to be identified is repeated to step S1 and the step S2 generating feature super vector in the first step, and input step S3 trains the model obtaining.

Step S4: calculate the likelihood value for each probability model, obtain maximum likelihood value, confirm speaker's numbering.

To be specifically described each step below:

Step S1 realizes line spectral frequencies parameter transformation, and the line spectral frequencies parameter of the linear coded prediction model of voice is converted into line spectral frequencies parameter difference by linear transformation.It is as follows that Fig. 2 has provided the idiographic flow of the method:

1) input: line spectral frequencies parameter s=[s ₁, s ₂..., s _k] ^t;

2), in step 11, by i, from 1 to K+1 circulation, the difference at every turn obtaining is as follows:

x_{i} = \{\begin{matrix} s_{1} / π & i = 1 \\ (s_{i} - s_{i - 1}) / π & 1 < i \leq K \\ (π - s_{K}) / π & i = K + 1 \end{matrix};

3) output: line spectral frequencies parameter

\tilde{x} {[x_{1}, x_{2}, . . ., x_{K + 1}]}^{T} .

Step S2 generates line spectral frequencies feature super vector, by present frame x (t) and its former and later two consecutive frames super vector of formation that combines, expresses multidate information with this.The interval of supposing present frame and former frame and a rear frame is all τ, this super vector comprises three subvectors in the present invention: present frame x (t), former frame x (t-τ) and a rear frame x (t-τ), the feature super vector of generation is 3 (K+1) dimension.Fig. 3 provides its idiographic flow schematic diagram, and step is as follows:

1) input: K+1 dimension line spectral frequencies parameter difference vector x (t)=[x _1,1, x _1,2..., x _{1, K+1}] ^t;

2) output:

x_{\sup} (t) \overset{Δ}{=} [\begin{matrix} x (t) \\ x (t - τ) \\ x (t + τ) \end{matrix}] = [\begin{matrix} x_{1,1} (t) \\ . \\ . \\ . \\ x_{1, K + 1} (t) \\ x_{2,1} (t) \\ . \\ . \\ . \\ x_{2, K + 1} (t) \\ x_{3,1} (t) \\ . \\ . \\ . \\ x_{3, K + 1} (t) \end{matrix}], τ = 1,2, . . . .

Step S3 uses the distribution of super Di Li Cray mixture model simulation feature super vector, and solves the parameter in model.Detailed step is:

1) x _supin each feature subvector x (t), x (t-τ), x (t+ τ) is separate and meet Dirichlet distribute, super vector x _supmeet super Dirichlet distribute:

SDir (x_{\sup}; α) = Π_{n = 1}^{3} \frac{Γ (Σ_{k = 1}^{K + 1} α_{n, k})}{Π_{k = 1}^{K + 1} Γ (α_{n, k})} Π_{k = 1}^{K + 1} {(x_{n, k})}^{α_{n, k} - 1}

Wherein α _{1, k}, α _{2, k}, α _{3, k}it is parameter subvector.

2) for the line spectral frequencies parameter difference subvector x (1) of a sequential ..., x (t) ..., x (T), has X=[x _sup(1) ..., x _sup(T)],, by containing the super mixing Di Li Cray model (SDMM) of M component, can obtain the probability of object vector:

f (X) = Π_{t = 1}^{T} Σ_{m = 1}^{M} π_{m} SDir (x_{\sup} (t); α (m))

Wherein, weight factor

π_{m} = \frac{1}{T} Σ_{t = 1}^{T} {\overset{&OverBar;}{z}}_{tm} = \frac{1}{T} Σ_{t = 1}^{T} \frac{π_{m} SDir (x_{\sup} (t); α (m))}{Σ_{m = 1}^{M} π_{m} SDir (x_{\sup} (t); α (m))},

π _mthe non-negative weight of m component, and

Σ_{m = 1}^{M} π_{m} = 1 .

Step S4, in the time differentiating, trains phonetic entry to be identified in a series of models of all speakers that obtain to step S3, determine that the speaker who is differentiated is numbered the numbering of the model of likelihood value maximum.

Below the nonlinear optimization packet loss method of estimation to proposed speech linear predictive model and the embodiment of each module are set forth by reference to the accompanying drawings.By the description of above embodiment, one of ordinary skill in the art can clearly recognize that the mode that the present invention can add essential general hardware platform by software realizes, and can certainly realize by hardware, but the former is better embodiment.Based on such understanding, the part that technical scheme of the present invention contributes to prior art in essence in other words can embody with the form of computer software product, this software product is stored in a storage medium, comprises that some instructions are in order to make one or more computer equipment carry out the method described in each embodiment of the present invention.

According to thought of the present invention, all will change in specific embodiments and applications.In sum, this description should not be construed as limitation of the present invention.

Above-described embodiment of the present invention, does not form the restriction to invention protection domain.Any amendment of doing within the spirit and principles in the present invention, be equal to and replace and improvement etc., within all should being included in protection scope of the present invention.

Claims

1. a method for distinguishing speek person for the text-independent of the line spectral frequencies parameter based on linear transformation and super Di Li Cray mixture model, is characterized in that, comprises the following steps:

One. characteristic extraction step:

2. speaker's discrimination method of a kind of and text-independent as claimed in claim 1, is characterized in that, the line spectral frequencies parameter transformation step described in steps A is:

wherein

x_{i} = \{\begin{matrix} s_{1} / π & i = 1 \\ (s_{i} - s_{i - 1}) / π & 1 < i \leq K \\ (π - s_{K}) / π & i = K + 1 \end{matrix} .

3. speaker's discrimination method of a kind of and text-independent as claimed in claim 1, generation line spectral frequencies feature super vector step described in step B combines present frame x (t) and its consecutive frame to form a super vector, express multidate information with this, this super vector comprises three subvectors in the present invention.The interval of supposing present frame and former frame and a rear frame is all τ, only considers former frame x (t-τ) and two neighborhood frames of a rear frame x (t-τ) of present frame here, and the feature super vector of generation is 3 (K+1) dimension.Its detailed process is as follows:

2) the super vector result that comprises multidate information is:

x_{\sup} (t) \overset{Δ}{=} [\begin{matrix} x (t) \\ x (t - τ) \\ x (t + τ) \end{matrix}] = [\begin{matrix} x_{1,1} (t) \\ . \\ . \\ . \\ x_{1, K + 1} (t) \\ x_{2,1} (t) \\ . \\ . \\ . \\ x_{2, K + 1} (t) \\ x_{3,1} (t) \\ . \\ . \\ . \\ x_{3, K + 1} (t) \end{matrix}], τ = 1,2, . . .

4. speaker's discrimination method of a kind of and text-independent as claimed in claim 1, the detailed step of the model training described in step 2 is:

SDir (x_{\sup}; α) = Π_{n = 1}^{3} \frac{Γ (Σ_{k = 1}^{K + 1} α_{n, k})}{Π_{k = 1}^{K + 1} Γ (α_{n, k})} Π_{k = 1}^{K + 1} {(x_{n, k})}^{α_{n, k} - 1}

f (X) = Π_{t = 1}^{T} Σ_{m = 1}^{M} π_{m} SDir (x_{\sup} (t); α (m))

Wherein, weight factor

π_{m} = \frac{1}{T} Σ_{t = 1}^{T} {\overset{&OverBar;}{z}}_{tm} = \frac{1}{T} Σ_{t = 1}^{T} \frac{π_{m} SDir (x_{\sup} (t); α (m))}{Σ_{m = 1}^{M} π_{m} SDir (x_{\sup} (t); α (m))} .