CN103730121B

CN103730121B - A kind of recognition methods pretending sound and device

Info

Publication number: CN103730121B
Application number: CN201310728591.XA
Authority: CN
Inventors: 王泳; 黄继武
Original assignee: Shenzhen University; National Sun Yat Sen University
Current assignee: Shenzhen University; National Sun Yat Sen University
Priority date: 2013-12-24
Filing date: 2013-12-24
Publication date: 2016-08-24
Anticipated expiration: 2033-12-24
Also published as: CN103730121A

Abstract

The present invention discloses a kind of recognition methods pretending sound and device, this recognition methods is the coefficient utilizing the fundamental frequency characteristic of voice to estimate voice conversion, and Mel frequency cepstral coefficient extraction algorithm is improved, i.e. utilize linear interpolation to stretch to be incorporated in Mel frequency cepstral coefficient extraction algorithm by the coefficient of estimation so that it is approximate calculation can go out converting speech Mel frequency cepstral coefficient before switching.Finally, above method is incorporated into GMM UBM(gauss hybrid models agreement context model) in identification framework, calculate the similarity between voice.Simultaneously, moreover it is possible to the voice restoration after utilizing the conversion coefficient of this estimation to change is primitive sound.The present invention compares conventional identification evidence collecting method on recognition performance raising greatly, and missing inspection and false-alarm are all low than conventional scheme.

Description

A kind of recognition methods pretending sound and device

Technical field

The present invention relates to field of multi-media information safety, more particularly, to a kind of recognition methods pretending sound and dress Put.

Background technology

Voice conversion (Voice Transformation) is one of the most frequently used method of speech processing.Its function is one Sound becomes another and sounds natural but totally different sound.Voice conversion is generally used for music making or protects speaker's Safety and privacy, but be also possible to by criminal for covering up sound, in case being identified to identity.Therefore speaking after voice conversion People's identification has important using value.

The general step of voice conversion:

1) to signal x (n) framing, windowing:

F (k) = Σ_{n = 0}^{N - 1} x (n) \cdot w (n) \cdot e^{- j \frac{2 π}{N} \cdot k \cdot n} 0 \leq n < N - - - (1)

2) instantaneous amplitude is calculated:

| F (k) | = | Σ_{n = 0}^{N - 1} x (n) \cdot w (n) \cdot e^{- j \frac{2 π}{N} \cdot k \cdot n} | 0 \leq n < N - - - (2)

3) by the phase relation calculating instantaneous frequency of this frame with former frame:

ω (k) = (k + Δ) * \frac{F_{s}}{N} - - - (3)

Wherein F_sBeing sampling frequency, Δ is the deviation frequency of relative centre frequency.

4) frequency spectrum stretches.First it is instantaneous amplitude linear interpolation

|F(k′)|=μ|F(k)|+(1-μ)|F(k+1)| 0≤k<N/2 0≤k′<N/2 (4)

μ=k′/α-k (6)

Under the premise that will not give rise to misunderstanding, still the instantaneous amplitude after interpolation is designated as | F (k) |.

Then carry out frequency line to move:

ω′(k*α)=ω(k)*α 0≤k<N/2 0≤k*α<N/2 (7)

Under the premise that will not give rise to misunderstanding, after still moving instantaneous frequency be designated as ω (k).

5) instantaneous phase φ (k) is calculated by instantaneous frequency, it is thus achieved that the FFT coefficient after voice conversion:

F(k)=|F(k)|e^jφ(k) (8)

6) F (k) is carried out FFT inverse transformation, the signal after voice conversion can be obtained.

Extract the process of MFCC as shown in Figure 1.Specifically comprise the following steps that

1) windowing and calculating frequency spectrum.

MFCC therein have employed the hamming window of N=1024 point:

w (n) = 0.53836 - 0.46164 \cos (\frac{2 πn}{N - 1}) 0 \leq n < N - - - (9)

To making FFT after source signal x (n) windowing:

F (k) = Σ_{n = 0}^{N - 1} x (n) \cdot w (n) \cdot e^{- j \frac{2 π}{N} \cdot k \cdot n} 0 \leq n < N - - - (10)

2) Mel segmentation (triangle filtering) and logarithmic transformation:

Weighted window uses quarter window, and its formula is as follows:

H_{m} (k) = \{\begin{matrix} 0 & k < k_{m - 1} \\ \frac{k - k_{m - 1}}{k_{m} - k_{m - 1}} & k_{m - 1} \leq k \leq k_{m} \\ \frac{k_{m + 1} - k}{k_{m + 1} - k_{m}} & k_{m} < k \leq k_{m + 1} \\ 0 & k > k_{m + 1} \end{matrix} - - - (11)

Wherein, k_m=f(m)·N/F_s。F_sFor sampling frequency.

Utilize quarter window to making logarithmic transformation after the energy spectrum weighting of FFT:

Y (m) = \log [Σ_{k = 0}^{N - 1} {| F (k) |}^{2} \cdot H_{m} (k)] 1 \leq m \leq M - - - (12)

3) cosine inverse transformation

Finally utilize cosine inverse transformation, i.e. can get Mel cepstrum coefficient, i.e. MFCC:

MFCC (n) = \frac{1}{M} Σ_{m = 1}^{M} Y (m) \cos (\frac{n (m - 0.5) π}{M}) 1 \leq m \leq M 0 \leq n \leq N - 1 - - - (13)

GMM-UBM

Speaker Identification can be considered two class hypothesis:

H₀: Y is from speaker S；

H₁: Y is not from speaker S.

Mathematically, H₀With the model λ of speaker S_hypRepresent, H₁With agreement context model λ_bkgRepresent.Probability calculation is such as Shown in formula (14):

Along with the extensive application of Audiotechnica, protection audio production becomes a study hotspot of information security field.Language Sound evidence obtaining is also one of them important field.In judicial, business and other application, the speaker's identity after voice conversion is known The most all there is important using value.Test result indicate that, at voice after bigger conversion, conventional Speaker Identification side Case can cause higher or high loss and false alarm rate, identifies entirely ineffective.

Summary of the invention

The primary and foremost purpose of the present invention is to propose a kind of recognition methods pretending sound, and one turns to use the method to identify Change the speaker's identity of audio production, the speaker's identity after changing through voice has critically important using value.

A further object of the present invention is to propose the identification device of a kind of language camouflage sound.

In order to solve the deficiencies in the prior art, the technical solution used in the present invention is as follows:

A kind of recognition methods pretending sound, described method includes:

In the training stage, maximum expected value EM algorithm is utilized to calculate agreement context model UBM λ from background sound storehouse_bkg；

In the training stage, extract tested speech S of speaker j_jMel cepstrum coefficient MFCC and fundamental frequency, after utilizing maximum Test probability MAP algorithm and calculate the gauss hybrid models GMM λ of speaker j_j, calculate fundamental frequency mean value f_j；Set up the mould of speaker j Type V_j=(λ_j,f_j), and be stored in model database；

Threshold θ is obtained in the training stage；Threshold θ acquisition methods: calculate Customer Score and personator's mark, utilize this two class The distribution of mark selects threshold θ to reach to meet loss and false alarm rate, the wherein Customer Score Client that application requires Scores, is speaker's sound bite probability under this speaker model, and personator mark Imposter Scores, is to say Words people's sound bite probability under other speaker model；

At test phase, voice Y is the voice after conversion, extracts the fundamental frequency mean value f of voice Y_Y；Utilize f_Y/f_jMeter Calculate conversion coefficient；Original MFCC coefficient X before utilizing modified MFCC extraction algorithm to calculate Y conversion；Through based on GMM-UBM general Rate algorithm for estimating show that Y is model V_jProbability Λ (X)；

Relatively probability Λ (X) and threshold θ, if gained probability is more than threshold θ, voice Y is fragment described in j；Otherwise voice Y Not described in j；

Wherein said modified MFCC extraction algorithm particularly as follows: after windowing in MFCC extraction algorithm and FFT, Amplitude to FFT coefficient | F (k) | carries out that linear interpolation is flexible draws | F (k ') |, and the amplitude linear interpolation of FFT coefficient is flexible such as Shown in lower formula:

|F(k′)|=μ|F(k)|+(1-μ)|F(k+1)| 0≤k<N/2 0≤k′<N/2

μ=k′/(1/α′)-k

Wherein 1/ α ' is the inverse of the conversion coefficient of described estimation, and α ' is the conversion coefficient estimated, α '=f_Y/f_j。

In the preferred scheme of one, the extraction step of described fundamental frequency is as follows:

(1) signal windowing is asked obtain any instant t_midThe signal of front later predetermined length value；

(2) auto-correlation function and the auto-correlation function of window function of the signal of described predetermined length value are asked；

(3) pairwise correlation function is divided by, and is cycle T at maximum, obtains this moment t_midFundamental frequency F.

In the preferred scheme of one, described fundamental frequency mean value is mean (F), and mean () is for being averaging.

In the preferred scheme of one, as α ' > 1, frequency spectrum compensation need to be carried out；Making nyquist frequency is F_n；Compensation method It is at F_n/ 2/ α ' to F_n/2/α′-F_nIn frequency spectrum between/2, symmetry copies into F_n/ 2/ α ' to F_nIn the range of/2/.Compensate herein Effect be approximation reduction F_n/ 2/ α ' to F_nThe amplitude of frequency range between/2/, thus make the MFCC value after reduction close to original MFCC Value.

A kind of identification device pretending sound, including:

Training module, is used for utilizing maximum expected value EM algorithm to calculate agreement context model UBM from background sound storehouse λ_bkg；Extract tested speech S of speaker j_jMel cepstrum coefficient MFCC and fundamental frequency, utilize Maximize algorithm meter Calculate the gauss hybrid models GMM λ of speaker j_j, calculate fundamental frequency mean value f_j；Set up the model V of speaker j_j=(λ_j,f_j), and deposit Storage, in model database, obtains threshold θ in the training stage；

Wherein threshold θ acquisition methods: calculate Customer Score and personator's mark, utilizes the distribution of this two classes mark to select threshold Value θ, to reach to meet loss and false alarm rate, the wherein Customer Score Client Scores that application requires, is speaker's voice Fragment probability under this speaker model, personator mark Imposter Scores, is that speaker's sound bite is said at other Probability under words human model；

Test module, is the voice after conversion at voice Y, extracts its fundamental frequency mean value f_Y；Utilize f_Y/f_jCalculate and turn Change coefficient；Original MFCC coefficient X before utilizing modified MFCC extraction algorithm to calculate Y conversion；Estimate through probability based on GMM-UBM Calculating method show that Y is model V_jProbability Λ (X)；

Identification module, compares probability Λ (X) and threshold θ, if gained probability is more than threshold θ, voice Y is fragment described in j； Otherwise voice Y is not described in j；

Wherein the modified MFCC extraction algorithm specific implementation described in test module is:

After windowing in MFCC extraction algorithm and FFT, | the F (k) | of the amplitude to FFT coefficient carries out linear interpolation Stretching and draw | F (k ') |, the amplitude linear interpolation of FFT coefficient is stretched shown in equation below:

|F(k′)|=μ|F(k)|+(1-μ)|F(k+1)| 0≤k<N/2 0≤k′<N/2

μ=k′/(1/α′)-k

Compared with prior art, the invention have the benefit that the present invention compares conventional identification on recognition performance and collects evidence Method has greatly raising, uses the mean value of fundamental frequency and estimates conversion coefficient, having made MFCC extraction algorithm to improve simultaneously, Can directly calculate the MFCC feature before voice conversion, calculate test language by method for calculating probability based on GMM-UBM Whether sound is described in some target speaker, and missing inspection and false-alarm are all low than conventional scheme.

Accompanying drawing explanation

Fig. 1 is the extraction process schematic of Mel frequency cepstral coefficient.

Fig. 2 is that the conversion coefficient contrast estimated is true obtains conversion coefficient (true factor alpha (k)=2^k/12, estimation coefficient α ' (y) =2^y/12) comparison schematic diagram.

Fig. 3 is tetra-kinds of frequency domain camouflage methods of EER() curve map.

Fig. 4 is tetra-kinds of frequency domain methods of DET() curve map.

Fig. 5 is EER(TD-PSOLA) curve map.

Fig. 6 is DET(TD-PSOLA) curve map.

Detailed description of the invention

As seen in figures 3-6, the present invention is disclosed in the training stage, utilizes EM(Expectation Maximum greatest hope Value) algorithm calculates UBM(agreement context model from background sound storehouse) λ_bkg；In the training stage, extract the test language of speaker j Sound S_jMFCC coefficient and fundamental frequency, utilize MAP((Maximum A posteriori, maximum a posteriori probability) algorithm calculate speak The GMM(Gaussian Mixture Model of people j) model λ_j, calculate fundamental frequency mean value f_j.Set up the model V of speaker j_j= (λ_j,f_j), and be stored in model database.Threshold θ is obtained in the training stage.At test phase, voice Y is after conversion Voice, extract its fundamental frequency mean value f_Y.Utilize f_Y/f_jCalculate conversion coefficient；Utilize modified MFCC extraction algorithm to calculate Y to turn Original MFCC coefficient X before changing.Show that Y is model V through probability Estimation algorithm based on GMM-UBM_jProbability Λ (X).If institute Obtain probability and then identify that voice Y is fragment described in j more than threshold θ；If no more than threshold value, then identify that voice Y is not described in j.

The method of estimation calculating described conversion coefficient is: α '=f_Y/f_j, wherein α ' is the conversion coefficient of described estimation, and fundamental frequency is put down Average is that fundamental frequency is averaging gained.

The extraction step of fundamental frequency value is as follows:

(1) signal windowing is asked obtain any instant t_midThe signal of a predetermined length value front and back；

After modified extraction algorithm is the windowing in Mel frequency cepstral coefficient extraction algorithm and FFT, to FFT The amplitude of coefficient | F (k) | carries out that linear interpolation is flexible draws | F (k ') |.The amplitude linear interpolation of FFT coefficient is stretched equation below Shown in:

| F (k ') |=μ | F (k) |+(1-μ) | F (k+1) | 0≤k < N/2 0 ￡ k ' < N/2

μ=k′/(1/α′)-k

The inverse that value 1/ α ' that wherein linear interpolation is flexible is the conversion coefficient of described estimation.

The method of matching primitives is method for calculating probability based on GMM-UBM.Matching primitives refers to that calculating sound bite exists Probability under certain model, this probability reflects the probability of the speaker speaking this sound bite this model artificial representing.

Now provide the sound bank and some experimental results utilizing the inventive method to be used.

Sound bank is TIMIT.This is sound bank the most frequently used in speech/speaker identification.Comprise from 8 different geographicals 192 female and 438 man, totally 630 speakers.Each speaker reads 10 sections of different voices respectively, amounts to 6300 sections of voices.Institute Having voice is WAV form, 8k sample rate, 16bit quantified precision.TIMIT is divided into three word banks by this experiment:

1) agreement context sound bank: 60 men, the 60 all sound bites of woman link together and train UBM.

2) the regular sound bank of mark: 40 female, 90 man's sound bites are for mark regular (TNorm).

3) exploitation-evaluation and test sound bank: 92 female, 288 men.To speaker j, its 5 sections of fragments are linked to be one section and train the 2048 of j Dimension GMM model λ_jAnd fundamental frequency mean value f_j.Remaining 5 sections are linked to be one section, and act on this with different switching transformation of coefficient camouflage In one fragment.For training the sound bank of speaker model to be referred to as developing sound bank；Sound bank for camouflage is referred to as evaluating and testing language Sound storehouse.

Consider five kinds of transformation tool (method): Adobe Audition based on frequency domain, Audacity, GoldWave, RSITI and TD-PSOLA based on time domain.Adjusting as shift strength with 12 half inside acoustics, conversion coefficient is adjusted just like ShiShimonoseki with half System:

α(k)=2^k/12

Experiment only considers the voice conversion of-11≤k≤11, because practical audio frequency (voice) instrument generally provides-11≤k The voice conversion of≤11.

By transfer function H (z)=1-0.97z^-1Voice signal is carried out pre-emphasis.

Frame length 1024 sample point, 24 dimension MFCC matrixes are made up of 12 dimension MFCC coefficients and 12 dimension Δ MFCC coefficients.

The experimental example of estimating voice conversion coefficient is given below.The estimate of the conversion coefficient of same people is averaging, And in true that conversion coefficient is made comparisons, display is in fig. 2.

Recognition performance is given below.Fig. 3 is EER(Equal Error Rate, i.e. loss False Reject Rate (FRR) equal to false alarm rate False Alarm Rate (FAR)).Overall situation EER is shown in Table 1 and table 2.

Table 1 overall situation EER, | k |≤11(%)

Table 2 overall situation EER, | k |≤8(%)

Fig. 4 is DET(Detection Error Tradeoff, detection mistake balance).Visible, conventional scheme (baseline) performance is destroyed completely by various camouflage methods, say, that conventional Speaker Recognition System cannot correctly be known Do not pretend the speaker's identity of voice.And the method for the present invention (proposed with estimated scaling factor) Significantly reduce error probability, go up largely and can identify speaker's identity, reach acceptable level in a lot of application. Additionally give this method in the case of conversion coefficient is right-on performance (this performance be use the present invention can reach Best performance).From each chart, the performance that the present invention reaches is very close to optimum performance.

The method of the present invention also includes the identification to TD-PSOLA camouflage.Result shows in Fig. 5 and Fig. 6.Routine side Case performance is slightly better than the method that the present invention is put forward.But owing to the sense of hearing that also cannot keep voice when shift strength is little is natural Property, therefore range of application is less, and present useful application software is the most real in this way.

Claims

1. the recognition methods pretending sound, it is characterised in that described method includes:

In the training stage, extract tested speech S of speaker j_jMel cepstrum coefficient MFCC and fundamental frequency, utilize maximum a posteriori probability MAP algorithm calculates the gauss hybrid models GMM λ of speaker j_j, calculate fundamental frequency mean value f_j；Set up the model V of speaker j_j= (λ_j,f_j), and be stored in model database；

Threshold θ is obtained, threshold θ acquisition methods: calculate Customer Score and personator's mark, utilize this two classes mark in the training stage Distribution select threshold θ with the loss reaching to meet application and requiring and false alarm rate, wherein Customer Score Client Scores, Being speaker's sound bite probability under speaker model, personator mark Imposter Scores, is speaker's voice sheet Section probability under other speaker model；

At test phase, voice Y is the voice after conversion, extracts the fundamental frequency mean value f of voice Y_Y；Utilize f_Y/f_jCalculate and turn Change coefficient；Original MFCC coefficient X before utilizing modified MFCC extraction algorithm to calculate Y conversion；Estimate through probability based on GMM-UBM Calculating method show that Y is model V_jProbability Λ (X)；

Relatively probability Λ (X) and threshold θ, if gained probability is more than threshold θ, voice Y is fragment described in j；Otherwise voice Y is not j Described；

Wherein said modified MFCC extraction algorithm is particularly as follows: after windowing in MFCC extraction algorithm and FFT, right The amplitude of FFT coefficient | F (k) | carries out that linear interpolation is flexible draws | F (k ') |, and the amplitude linear interpolation of FFT coefficient is flexible as follows Shown in formula:

| F (k ') |=μ | F (k) |+(1-μ) | F (k+1) | 0≤k < N/2,0≤k ' < N/2

μ=k '/(1/ α ')-k

Wherein 1/ α ' is the inverse of the conversion coefficient estimated, α ' is the conversion coefficient estimated, α '=f_Y/f_j。

The recognition methods of camouflage sound the most according to claim 1, it is characterised in that the extraction step of described fundamental frequency is such as Under:

The recognition methods of camouflage sound the most according to claim 2, it is characterised in that described fundamental frequency mean value is mean (F), mean () is for being averaging.

The recognition methods of camouflage sound the most according to claim 1, it is characterised in that as α ' > 1, frequency spectrum compensation need to be carried out； Making nyquist frequency is F_n；Compensation method is at F_n/ 2/ α ' to F_n/2/α′-F_nIn frequency spectrum between/2, symmetry copies into F_n/2/ α ' to F_nIn the range of/2/.

5. the identification device pretending sound, it is characterised in that including:

Training module, is used for utilizing maximum expected value EM algorithm to calculate agreement context model UBM λ from background sound storehouse_bkg；Carry Take tested speech S of speaker j_jMel cepstrum coefficient MFCC and fundamental frequency, utilize Maximize algorithm calculate speak The gauss hybrid models GMM λ of people j_j, calculate fundamental frequency mean value f_j；Set up the model V of speaker j_j=(λ_j,f_j), and it is stored in mould In type database, obtain threshold θ in the training stage；

Wherein threshold θ acquisition methods: calculate Customer Score and personator's mark, utilizes the distribution of this two classes mark to select threshold θ To reach to meet loss and false alarm rate, the wherein Customer Score Client Scores that application requires, it it is speaker's sound bite Probability under speaker model, personator mark Imposter Scores, is that speaker's sound bite is at other speaker's mould Probability under type；

Test module, is the voice after conversion at voice Y, extracts its fundamental frequency mean value f_Y；Utilize f_Y/f_jCalculate conversion system Number；Original MFCC coefficient X before utilizing modified MFCC extraction algorithm to calculate Y conversion；Calculate through probability Estimation based on GMM-UBM Method show that Y is model V_jProbability Λ (X)；

Identification module, compares probability Λ (X) and threshold θ, if gained probability is more than threshold θ, voice Y is fragment described in j；Otherwise Voice Y is not described in j；

The modified MFCC extraction algorithm wherein used in test module is particularly as follows: the windowing in MFCC extraction algorithm and FFT After conversion, | the F (k) | of the amplitude to FFT coefficient carries out that linear interpolation is flexible show that | F (k ') |, the amplitude of FFT coefficient linearly insert Shown in the flexible equation below of value:

| F (k ') |=μ | F (k) |+(1-μ) | F (k+1) | 0≤k < N/2,0≤k ' < N/2

μ=k '/(1/ α ')-k