CN105139855A

CN105139855A - Speaker identification method with two-stage sparse decomposition and device

Info

Publication number: CN105139855A
Application number: CN201410231798.0A
Authority: CN
Inventors: 何勇军; 孙广路; 付茂国
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2014-05-29
Filing date: 2014-05-29
Publication date: 2015-12-09

Abstract

The present invention relates to a speaker identification method with two-stage sparse decomposition. The method comprises a step of (S1) carrying out framing and windowing processing on the discrete-time signal of inputted speech, (S2) carrying out discrete Fourier transform on each frame of signal and obtaining an amplitude value, and extracting the amplitude value as a characteristic, (S3) constructing a large dictionary, (S4) carrying out first stage sparse decomposition to obtain the sparse representation of the speech to be identified in the large dictionary, and carrying out sparse classification on the inputted speech to obtain a part of targeted speaker dictionary, (S5) splicing a part of targeted speaker dictionary, carrying out second phase sparse decomposition and using sparse representation to confirm a final identified speaker. According to the method, different speakers can be identified, and the method has the advantages of high efficiency, accuracy and usability of the identification of speaker identification. The invention also discloses a speaker identification device with two-stage sparse decomposition.

Description

A kind of method for distinguishing speek person of two benches Its Sparse Decomposition and device

Technical field

The present invention relates to the Speaker Identification field in Speech processing, particularly relate to a kind of method for distinguishing speek person and device of two benches Its Sparse Decomposition.

Background technology

At present, Speaker Identification at identity verify, network monitoring, the field such as telephone monitoring and information security extensive application.Through the extensive research of decades, typical recognition system is as Gaussian Mixture-universal background model (GMM-UBM) method, the methods such as gauss hybrid models-support vector machine method (GMM-SVM) and associating factor analysis, achieve satisfied effect under ideal conditions.But in a noisy environment, its performance will sharply decline, and which has limited the widespread use of these technology.

Researchers propose two class methods to strengthen the noise robustness of Speaker Identification.First kind method extracts the feature to noise robustness, such as linear predictor coefficient (LPCC), mel cepstrum coefficients (MFCC) and perception linear predictor coefficient (PLP) etc.The raising that these methods are only limited, only represents voice because do not have feature to have and does not represent the selective power of noise.Equations of The Second Kind method is having in noise speech the method adopting enhancing to remove noise, such as spectrum subtraction and Wiener filtering, then from the voice strengthened, extracting feature.Unfortunately, most noise is non-stable, and even some noise is as voice, is difficult to its modeling and estimates.As a result, sound enhancement method inevitably causes larger distortion, and this have impact on current method for distinguishing speek person, and therefore, people wish to have new technology to solve this difficult problem.

In the past in the several years, sparse coding is studied widely, for the Speaker Identification under noise circumstance provides possible solution.This technology one group of atom (primitive signal) represents signal, and the set of atom is being called as dictionary.By sparse coding, represent the whole of signal or most information with the linear combination of a small amount of atom.Recently, a sparse coding method being called anatomic element analysis (MCA) is employed successfully in Speaker Identification.Based on this technology, each speaker prepares a dictionary, and all speaker's dictionaries are spliced into a big dictionary.In identification, the voice of test are sparsely represented by big dictionary.In theory, speaker's word only can be represented by the dictionary of this speaker, and therefore, rarefaction representation can be directly used in classification.

The method of nearly all Speaker Identification has all used the framework of MCA, first these methods change training utterance into the vector of the super vector of GMM average or total variation, then these vector compositions big dictionary, Its Sparse Decomposition and classification is carried out with this big dictionary.It is reported, these methods have better performance than traditional GMM-UBM and GMM-SVM method.But these methods are not still considered to compensate noise, it reduce the discrimination of these methods under noise situations.

Summary of the invention

Technical matters to be solved by this invention is, for the deficiencies in the prior art, how to provide a kind of method to consider to compensate the noise in Speaker Identification, improves the key issue of the discrimination of speaker's voice under noise situations.

For this purpose, the present invention proposes a kind of method for distinguishing speek person of two benches Its Sparse Decomposition, comprise concrete following steps:

S1: framing and windowing process are carried out to the discrete-time signal of the voice of input;

S2: discrete Fourier transform (DFT) is done to each frame signal and asks range value, amplitude spectrum is extracted as feature;

S3: build a big dictionary, wherein, described big dictionary comprises general background dictionary, the characteristics dictionary of different speaker and noise dictionary;

S4: carry out first stage Its Sparse Decomposition to obtain the rarefaction representation of voice to be identified on described big dictionary, and dictionary input voice being done to rough sort obtaining portion partial objectives for speaker;

S5: splice described partial target speaker dictionary, carry out subordinate phase Its Sparse Decomposition, utilizes rarefaction representation to confirm finally to identify speaker.

Particularly, described windowing process is Hamming window, Hanning window or rectangular window.

For this purpose, the invention allows for a kind of Speaker Identification device of two benches Its Sparse Decomposition, comprising:

Framing and windowing module, the discrete-time signal for the voice to input carries out framing and windowing process;

Characteristic extracting module, for making discrete Fourier transform (DFT) to each frame signal and asking range value, extracts amplitude spectrum as feature;

Build dictionary module, for building a big dictionary, wherein, described big dictionary comprises general background dictionary, the characteristics dictionary of different speaker and noise dictionary;

First stage Its Sparse Decomposition module, for carrying out first stage Its Sparse Decomposition to obtain the rarefaction representation of voice to be identified on described big dictionary, and makes the dictionary of rough sort obtaining portion partial objectives for speaker to input voice;

Subordinate phase Its Sparse Decomposition module, for splicing described partial target speaker dictionary, carries out subordinate phase Its Sparse Decomposition, utilizes rarefaction representation to confirm finally to identify speaker.

The method for distinguishing speek person of a kind of two benches Its Sparse Decomposition disclosed in this invention, by first to input voice framing, windowing, and after making discrete Fourier transform (DFT) to each frame voice, asks amplitude spectrum as phonetic feature; Then one is built by common background dictionary, the big dictionary of the dictionary of different speaker and noise dictionary composition; Then voice to be identified are sparsely represented on big dictionary, and with rarefaction representation, rude classification is done to voice to be identified, obtain the dictionary of a small amount of target speaker; Finally these dictionaries are spliced again after becoming a big dictionary, input voice are represented on this big dictionary, and do last classification to determine speaker ' s identity with rarefaction representation.The present invention can identify different speakers.There is the beneficial effect of high efficiency, accuracy and the ease for use identifying speaker ' s identity.The invention also discloses a kind of Speaker Identification device of two benches Its Sparse Decomposition.

Accompanying drawing explanation

Can understanding the features and advantages of the present invention clearly by reference to accompanying drawing, accompanying drawing is schematic and should not be construed as and carry out any restriction to the present invention, in the accompanying drawings:

Fig. 1 shows the flow chart of steps of the method for distinguishing speek person of a kind of two benches Its Sparse Decomposition in the embodiment of the present invention;

Fig. 2 shows the flow example figure of the method for distinguishing speek person of a kind of two benches Its Sparse Decomposition in the embodiment of the present invention;

Fig. 3 shows the structural drawing of the Speaker Identification device of a kind of two benches Its Sparse Decomposition in the embodiment of the present invention.

Embodiment

Below in conjunction with accompanying drawing, embodiments of the present invention is described in detail.

As shown in Figure 1, the invention provides a kind of method for distinguishing speek person of two benches Its Sparse Decomposition, comprise concrete following steps:

Step S1: carry out framing and windowing process to the discrete-time signal of the voice of input, wherein, windowing process is Hamming window, Hanning window or rectangular window.

Step S2: discrete Fourier transform (DFT) is done to each frame signal and asks range value, amplitude spectrum is extracted as feature.

Step S3: build a big dictionary, wherein, big dictionary comprises general background dictionary, the characteristics dictionary of different speaker and noise dictionary.

Step S4: carry out first stage Its Sparse Decomposition to obtain the rarefaction representation of voice to be identified on described big dictionary, and dictionary input voice being done to rough sort obtaining portion partial objectives for speaker.

Step S5: splice described partial target speaker dictionary, carry out subordinate phase Its Sparse Decomposition, utilizes rarefaction representation to confirm finally to identify speaker.

As shown in Figure 2, the invention provides a kind of flow example figure of method for distinguishing speek person of two benches Its Sparse Decomposition.

Particularly, the object of pre-emphasis is the impact of minimizing sharp noise and promotes HFS, does pre-emphasis process undertaken by following formula given signal y (n):

z(n)＝y(n)-0.97y(n-1)(1)

Here pre emphasis factor is 0.97.Then windowing can be Hamming window, Hanning window or rectangular window.Research shows, Hamming window has better Frequency Response than rectangle, can alleviate and leak phenomenon frequently.Windowing process is expressed as:

S _p(n)＝z(n)*W(n)(2)

Above formula represents the convolution of z (n) and W (n).

Wherein,

Wherein n is time sequence number, and M is that window is long.

Then, discrete Fourier transform (DFT) (DiscreteFourierTransform, DFT) is carried out

Y_{P} = Σ_{n = 0}^{N - 1} S_{p} (n) e^{- j 2 kπ / N}, 0 \leq k \leq N - - - (4)

S in formula _p(n) for the p frame voice signal after windowing, p be frame number, N represents counting of Fourier transform.

Dictionary prepares.One is had to the Speaker Recognition System of K different speaker, we devise the big dictionary that has new structure.

Ψ＝[Φ ₀,Φ ₁,Φ ₂,...,Φ _K,Φ _v](5)

Wherein, Φ ₀be a general background dictionary, comprise the common feature of all speakers.Here we have used for reference the idea that GMM-UBM UBM carrys out simulation background.Φ _i(i=1 ..., K) be used to the dictionary of simulation i-th speaker's variability (feature).Φ _vthat noise dictionary is used for simulated environment noise.All atoms in Ψ are all standardized into unit norm vector.K-SVD is used for training dictionary, and dictionary training problem is described as:

\min {| | Y - ΦX | |}_{2}^{2} sujectto {| | x_{i} | |}_{0} \leq T_{0} - - - (6)

Wherein, Y=[Y ₁, Y ₂..., Y _m] be the data set of training, each Y _ibe all the proper vector of speech frame, Φ is dictionary, X=[x ₁, x ₂..., x _m] be one group of sparse vector corresponding to Y, T ₀it is sparse thresholding.The general background dictionary voice of a large amount of unlabelled different speakers are trained.Each Φ _iall use the voice of the i-th speaker to train, do initial value with Ψ.

The Its Sparse Decomposition problem solved in formula (4), this problem is proved to be a NP-hard, all can not thoroughly be solved completely by all possible sparse subset.If x is sparse or approximate sparse, it is by separating formula (5) by unique decision.

y = \arg \min_{y} λ {| | y | |}_{1} + \frac{1}{2} {| | Y - Ψy | |}_{2}^{2} - - - (7)

Wherein, λ >0 is regularization parameter, and equation also with reference to base and follows the trail of denoising (basispursuitdenoising, BPDN).

For each speech frame Y, separate formula (5) and obtain sparse coefficient y, carry out first time Its Sparse Decomposition.

The method of second time Its Sparse Decomposition and classification, first, calculates c _i=|| δ _i(y) || ₁, i=1,2 ..., K, δ _i(.) gives us a vector, and unique nonzero term of this vector is from i-th classification, namely the item zero setting of non-i-th classification in sparse coefficient y, with ascending order arrangement c _i;

Then selecting the dictionary of Q speaker, is exactly c after sequence _ithe dictionary of the maximal value of a Q corresponded to, forms a large dictionary

y = \arg \min_{y} λ {| | y | |}_{1} + \frac{1}{2} {| | Y - \hat{Ψ} y | |}_{2}^{2} - - - (8)

Separate above-mentioned formula (6) and obtain y.Through type (7) confirms this speech frame belongs to which speaker

j = \arg \max_{1 \leq j \leq Q} {| | δ_{i} (y) | |}_{1} - - - (9)

The method for distinguishing speek person of a kind of two benches Its Sparse Decomposition disclosed in this invention, the Advantages found of said method both ways.On the one hand: the compensation becoming noise when to consider pair.This method devises a noise dictionary with noise change, can become noise during effective compensation; On the other hand: adopt two stage Its Sparse Decomposition, further increase accuracy of identification.In decomposition, the competition of speaker's atom may occur, causes method hydraulic performance decline.In order to solve this difficult problem, we have proposed a kind of method for distinguishing speek person of two benches Its Sparse Decomposition, namely after first stage Its Sparse Decomposition, the dictionary of the speaker being used to atom in big dictionary is stitched together, carry out subordinate phase Its Sparse Decomposition again, finally be shown as with secondary sparse table and confirm classification, two stage Its Sparse Decomposition method is obviously better than current method for distinguishing speek person.

As shown in Figure 3, the invention provides a kind of structural drawing of Speaker Identification device of two benches Its Sparse Decomposition.

Particularly, framing and windowing module 101 are for carrying out framing and windowing process to the discrete-time signal of the voice inputted, and wherein, windowing process is Hamming window, Hanning window or rectangular window; Characteristic extracting module 102, for making discrete Fourier transform (DFT) to each frame signal and asking range value, is extracted amplitude spectrum as feature; Build dictionary module 103 for building a big dictionary, wherein, described big dictionary comprises general background dictionary, the characteristics dictionary of different speaker and noise dictionary; First stage Its Sparse Decomposition module 104 for carrying out first stage Its Sparse Decomposition to obtain the rarefaction representation of voice to be identified on described big dictionary, and makes the dictionary of rough sort obtaining portion partial objectives for speaker to input voice; Subordinate phase Its Sparse Decomposition module 105, for splicing described partial target speaker dictionary, carries out subordinate phase Its Sparse Decomposition, utilizes rarefaction representation to confirm finally to identify speaker.

Above embodiment is only for illustration of the present invention; and be not limitation of the present invention; the those of ordinary skill of relevant technical field; without departing from the spirit and scope of the present invention; can also make a variety of changes and modification; therefore all equivalent technical schemes also belong to category of the present invention, and scope of patent protection of the present invention should be defined by the claims.

Although describe embodiments of the present invention by reference to the accompanying drawings, but those skilled in the art can make various modifications and variations without departing from the spirit and scope of the present invention, such amendment and modification all fall into by within claims limited range.

Claims

1. a method for distinguishing speek person for two benches Its Sparse Decomposition, is characterized in that, comprises concrete following steps:

2. the method for claim 1, is characterized in that, described windowing process is Hamming window, Hanning window or rectangular window.

3. a Speaker Identification device for two benches Its Sparse Decomposition, is characterized in that, comprising:

4. device as claimed in claim 3, it is characterized in that, described windowing process is Hamming window, Hanning window or rectangular window.