CN102436810A

CN102436810A - Record replay attack detection method and system based on channel mode noise

Info

Publication number: CN102436810A
Application number: CN2011103305987A
Authority: CN
Inventors: 贺前华; 王志锋; 罗海宇; 陈芬
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2011-10-26
Filing date: 2011-10-26
Publication date: 2012-05-02
Also published as: WO2013060079A1

Abstract

The invention relates to the technical field of intelligent voice signal processing, mode recognition and artificial intelligence and in particular relates to a record replay attack detection method and system in a speaker recognition system based on a channel mode noise. The invention discloses a simpler and more efficient record replay attack detection method in a speaker recognition system. The method comprises the following steps: (1) inputting a to-be-recognized voice signal; (2) pre-processing the voice signal; (3) extracting the channel mode noise in the pre-processed voice signal; (4) extracting a long time statistic feature based on the channel mode noise; and (5) classifying the long time statistic feature according to a channel noise classifying judging model. By using the channel mode noise to perform the record replay attack detection, the extracted feature dimension is low, the computation complexity is low, and the recognition error rate is low, therefore, the safety performance of the speaker recognition system is greatly improved, and the method and system provided by the invention can be used in the reality more easily.

Description

A kind of recording replay attack detection method and system based on the channelling mode noise

Technical field

The present invention relates to intelligent sound signal Processing, pattern-recognition and field of artificial intelligence, particularly relate to a kind of based on recording replay attack detection method and system in the Speaker Recognition System of channelling mode noise.

Background technology

Along with the continuous development of speaker Recognition Technology, Speaker Recognition System has obtained using very widely, for example: judicial evidence collection, ecommerce, financial sector etc.Meanwhile, some safety problems that Speaker Recognition System faced have restricted its development and application.Two kinds of common attacks that Speaker Recognition System faces are speaker's bogus attack and recording replay attack.Speaker's bogus attack is meant that the assailant attacks system through user's in the imitation Speaker Recognition System sound.The experiment of Speaker Identification on twins' sound bank shows that existing speaker Recognition Technology can distinguish the twins' voice with similar acoustic characteristic; Therefore implementing speaker's bogus attack needs extraordinary imitation skill; Make assailant's voice to reach highly similar with the voice of system user, this makes that the exploitativeness of bogus attack is not high.The recording replay attack is meant that the assailant uses a hidden recorder user's voice in the Speaker Recognition System with the high-fidelity sound pick-up outfit in advance, passes through the high-fidelity power amplifier then in the system input playback, with this Speaker Recognition System is implemented to attack.For the relevant Speaker Recognition System of text, can through use a hidden recorder the user when getting into system voice or use a hidden recorder a large number of users voice and implement replay attack through the mode of syllable splicing.Only need obtain the User Part voice for the system of text-independent and can implement replay attack.Compare with counterfeit voice, the recording voice playback is truly to come from the user, and it is bigger to the threat that Speaker Recognition System causes.On the other hand, performance is good now high-fidelity recording and playback apparatus continue to bring out, and price is also more and more cheap, and volume is also more and more littler, and being easy to carry is difficult for coming to light, and this also lets the recording replay attack become more and more easier.

Whether a kind of strategy of replay attack of preventing to record is to let the user follow through system's random choose statement read, when carrying out Speaker Identification, also want judges to come on request with reading.The enforcement of this method needs to prepare in advance abundant sound bank; And requiring the user to follow according to voice content reads; When the user according to oneself pronunciation custom when reading, can not pass through Speaker Recognition System, this not too close friend's interactivity mode is not easy to be accepted by the user.And this method can sacrifice the security protection of Speaker Recognition System for specific user's particular text, can produce other safety problem.In the application of reality, this method can only be used for the relevant Speaker Recognition System of text, when doing Speaker Identification, also will carry out the text identification of voice, and this has also reduced the overall efficiency of Speaker Recognition System.

Adopt sentence similarity method relatively in addition; Though the password text of the each input of user is identical; But twice can not collect same sample, just can regard as the recording replay attack if the sentence similarity of sentence of therefore importing and storage exceeds certain scope.There is open defect in this method: one, this algorithm is merely able to be applied to the relevant Speaker Recognition System of the text replay attack of recording and detects; Two, the user gets into systematic sample at every turn and will leave a large amount of storage spaces of needs; Three, each user gets into that systematic sample is all wanted and all storing sample are carried out the similarity comparison, and calculated amount is very big; If four voice playbacks of recording not are when the user gets into system, to record, for example record privately or obtain through syllable splicing, this method is just invalid so; Five, this method is very strong to the dependence of threshold setting, and Speaker Identification itself is exactly to carry out similarity relatively, and similarity is high is judged as same speaker, and the boundary of attacking with the similarity threshold of speaker self identification that therefore goes back on defense is difficult to definite.

Summary of the invention

The objective of the invention is to overcome the defective and the deficiency of prior art, a kind of recording replay attack detection method based on the channelling mode noise is provided, be used for Speaker Recognition System and can improve the success ratio that the recording replay attack detects.

Another object of the present invention also is to provide the realization system for carrying out said process.

The object of the invention is realized through following technical proposals:

A kind of recording replay attack detection method based on the channelling mode noise is characterized in that, said recording replay attack detection method may further comprise the steps:

(1) imports voice signal to be identified;

(2) voice signal is carried out pre-service;

(3) the channelling mode noise in the voice signal after the extraction pre-service;

Statistical nature when (4) extracting based on channelling mode noise long;

(5) classify the court verdict that the replay attack that obtains recording detects according to interchannel noise classification judgement model statistical nature when long.

Said step (2) pre-service comprises pre-emphasis, divides frame and windowing.

Said step (3) may further comprise the steps:

(31) pretreated voice signal being carried out noise-removed filtering handles;

(32) noise-removed filtering is handled forward and backward signal and carry out the statistics frame analysis respectively;

(33) two paths of signals after statistics frame is analyzed extracts log power spectrum, and subtraction, extracts the channelling mode noise of input speech signal.

Said statistics frame is after the short time frame of voice signal is done discrete Fourier transformation, to get the wherein mean value of same frequency composition.

Said step (4) may further comprise the steps:

(41) 0～5 rank Legendre multinomial coefficient of extraction channelling mode noise;

(42) six statistical natures of extraction channelling mode noise;

Statistical nature vector when the numerical value that (43) above-mentioned steps is obtained is merged into one group of 12 tie up long is as the eigenvector of recording replay attack detection.

Minimum value, maximal value, average, intermediate value, standard deviation and the maximal value that six statistical natures of said step (42) are the channelling mode noise and the difference of minimum value.

The interchannel noise classification judgement modelling of said step (5) comprises the steps:

(51) input training utterance signal;

Statistical nature when (52) repeating step (2)～(4), the channelling mode noise that obtains training long;

(53) (Support Vector Machine SVM) classifies, and sets up interchannel noise classification judgement model to utilize SVMs.

Realize system for carrying out said process, comprising:

---load module 100 is used for input training or voice signal to be identified;

---pre-processing module 200, be used for voice signal is carried out pre-service, it comprises pre-emphasis, divides frame and adds window unit;

---channelling mode noise extraction module 300 is used for extracting the channelling mode noise of voice signal after the pre-service;

---statistical nature extraction module 400 when long, statistical nature when being used to extract based on channelling mode noise long;

---interchannel noise model module 500, statistical nature utilizes SVM to classify when being used for training long, sets up interchannel noise classification and adjudicates model;

---recognition decision module 600, statistical nature is classified when being used to utilize interchannel noise classification judgement model to treat recognition of speech signals long, the court verdict of the replay attack detection that obtains recording;

---output module 700 is used to export the court verdict of voice signal to be identified.

Ultimate principle of the present invention is: detect through the channelling mode noise that the extracts speech signal replay attack of recording.In the recognition system of speaking, raw tone is meant system acquisition user's raw tone, the voice playback replay attack voice that refer to record.Voice playback has also experienced the process of once recording and playback before get into Speaker Recognition System recording channel.Different recording and playback apparatus can be introduced the different interchannel noise of equipment self (microphone, loudspeaker, dither circuit, prime amplifier, power amplifier, input and output wave filter, A, D, sample-and-hold circuit etc. all can introduce corresponding noise); These interchannel noises are superimposed upon on the voice playback, make voice playback and raw tone exist subtle difference.The present invention is called the channelling mode noise with these noises of going into from transducer (microphone, loudspeaker) and different electric pass in difference recording and the playback apparatus.The channelling mode noise that contains system's sound pick-up outfit in the raw tone; And voice playback not only contains the channelling mode noise of system; Therefore the channelling mode noise that also contains the equipment of using a hidden recorder and playback apparatus extracts channelling mode noise in the voice to be identified replay attack of can recording and detects.The present invention extracts the channelling mode noise through the noise-removed filtering device, and on the basis of channelling mode noise, extracts statistical nature when long, and whether utilize SVM to set up the interchannel noise model again is the recording replay attack in order to the input of judgement Speaker Recognition System.

The present invention compares with existing recording replay attack detection method, has following advantage and beneficial effect:

(1) can be applied to the relevant Speaker Recognition System of text, also can be applied to the Speaker Recognition System of text-independent.

(2) to the Classification and Identification of raw tone and voice playback can before the Speaker Identification also can after; Therefore; Can utilize interchannel noise modelling front end recording replay attack detecting device or rear end recording replay attack detecting device, make that recording replay attack algorithm application is more flexible.

Statistical nature and MFCC (Mel Frequency Cepstrum Coefficient, Mel frequency cepstral coefficient) characteristic are compared when (3) growing, and intrinsic dimensionality obviously reduces, and in the training stage, when extracting characteristic, efficient obviously improves.And need each user not got into systematic sample and store, save a large amount of storage spaces and computational resource.

Description of drawings

Fig. 1 is a system construction drawing of the present invention.

Fig. 2 is channelling mode noise extraction and feature extraction process flow diagram during based on channelling mode noise long.

Fig. 3 is that statistics frame is extracted process flow diagram.

Fig. 4 is the comparison diagram after the connection Speaker Recognition System.

Embodiment

Below in conjunction with accompanying drawing and embodiment enforcement of the present invention is further described, but enforcement of the present invention is not limited thereto.

Recording replay attack detection method of the present invention can realize in embedded system according to the following steps:

Step (1), the input training utterance, it comprises primary speech signal and voice playback signal.

Step (2) is carried out pre-service to input speech signal, comprises voice signal is carried out pre-emphasis, divides frame and windowing process.Pre-emphasis is that voice signal is carried out high-pass filtering, and the transition function of wave filter is H (z)=1-az ^-1, a=0.975 wherein.To the branch frame of voice signal, wherein frame length is 512 points, and it is 256 points that frame moves.To the added window of voice signal is Hamming window, and wherein the function of Hamming window is:

ω_{H} (n) = \{\begin{matrix} 0.54 - 0.46 \cos (\frac{2 πn}{N - 1}), & 0 \leq n \leq N - 1 \\ 1 & others \end{matrix}

Step (3), the channelling mode noise after the extraction pre-service in the voice signal, extraction step is as shown in Figure 2.The extraction of channelling mode noise is divided into following steps:

Step S301 is with arriving channelling mode noise extraction module 300 through pretreated phonetic entry in the step (2);

Step S302 carries out noise-removed filtering with the signal among the step S301 through the noise-removed filtering device and handles, and the design of noise-removed filtering device is following:

H (z) = 1 - \frac{Σ_{n = 1}^{N} α^{n} z^{- n}}{Σ_{n = 1}^{N} α^{n}},

N=32 wherein, α=0.94;

Step S303 is with carrying out the statistics frame analysis respectively without the voice signal of crossing noise-removed filtering among process noise-removed filtering and the step S301 among the step S302.Statistics frame is the mean value of same frequency composition in the voice signal short time frame, establishes X={x ₁[n], K, x _T[n] } the expression frame number is the voice signal of T, i (the frame signal x of 1≤i≤T) then _i[n] (discrete Fourier transformation of 0≤n≤N-1) is:

X_{i} [k] = Σ_{n = 0}^{N - 1} x_{i} [n] e^{- j \frac{2 πkn}{N}}, 0 \leq k \leq N - 1

The expression formula of statistics frame S [k] is following so:

S [k] = \frac{1}{T} Σ_{i = 1}^{T} X_{i} [k]

= \frac{1}{T} Σ_{i = 1}^{T} Σ_{n = 0}^{N - 1} x_{i} [n] e^{- j \frac{2 πkn}{N}}

As shown in Figure 3, the method for distilling of statistics frame is divided into following steps among the step S303:

Step S3031 will carry out discrete Fourier transformation through the signal that step S301, S302 handle;

Step S3032 is with superposeing through same frequency composition in the every frame of signal of discrete Fourier transformation among the step S3031;

Step S3033 asks the frequency spectrum that superposes among the step S3032 on average, obtains the statistics frame of input signal.

Step S304; Ask log power spectrum; The two paths of signals that process statistics frame among the step S303 is analyzed extracts log power spectrum; To deduct another road signal without the road signal of crossing noise-removed filtering then, thereby obtain the channelling mode noise of input speech signal, be shown below through the noise-removed filtering device:

N = \log [\frac{1}{T} Σ_{i = 1}^{T} Σ_{n = 0}^{N - 1} x_{i} [n] e^{- j \frac{2 πkn}{N}}] - \log [\frac{1}{T} Σ_{i = 1}^{T} Σ_{n = 0}^{N - 1} {Defilter (x_{i} [n])} e^{- j \frac{2 πkn}{N}}]

Wherein Defilter () is the noise-removed filtering device that designs among the step S302.

Step (4), statistical nature when extracting two group leaders on the basis of the signal mode noise that obtains in above-mentioned step, one group is the Legendre multinomial coefficient on 0～5 rank, other one group is 6 kinds of statistical natures of channelling mode noise.

Step S401, the extraction of Legendre multinomial coefficient: the legendre multinomial coefficient of getting 0～5 rank carries out parameter fitting to the channelling mode noise that extracts.

The polynomial form of Legendre is following:

f (x) = Σ_{n = 0}^{\infty} L_{n} P_{n} (x)

Wherein 3, L _nBe the Legendre multinomial coefficient.After extracting the channelling mode noise, carry out the Legendre polynomial expansion, obtain L ₀～L ₅Multinomial coefficient.Each Legendre multinomial coefficient has embodied the information of an aspect of channelling mode noise: L0---the direct current component of channelling mode noise; L1---channelling mode noise profile slope of a curve; L2---the curvature of channelling mode noise profile curve; L3---the S curvature of channelling mode noise profile curve; L4, L5---the more details information of channelling mode noise profile curve.

Step S402 extracts the statistical nature based on the channelling mode noise, and this group statistical nature comprises following six kinds of characteristics:

● PN_min: the minimum value of channelling mode noise;

● PN_max: the maximal value of channelling mode noise;

● PN_mean: the average of channelling mode noise;

● PN_median: the intermediate value of channelling mode noise;

● PN_diff: maximal value and minimum value poor;

● PN_stdev: the standard deviation of channelling mode noise.

Statistical nature vector when statistical nature is merged into one group of 12 tie up long during with two group leaders is with its eigenvector that detects as the recording replay attack.

Step (5) is set up SVM interchannel noise classification judgement model, and the voice to be identified that are used for distinguishing input are raw tone or voice playback.The detailed process that SVM makes up the interchannel noise model parameter is following: SVM makes up the interchannel noise model parameter and comprises positive sample and negative sample.Wherein positive sample be primary speech signal through above-mentioned steps (2)～(4) obtain based on channelling mode noise long the time statistical nature.Negative sample for the voice playback signal through above-mentioned steps (2)～(4) obtain based on channelling mode noise long the time statistical nature.

So-called svm classifier is that the requirement classifying face not only can correctly separate two types of samples, and makes the class interval maximum.We can be to sample set (x _i, y _i), i=1, L, n, x ∈ R ^d, y _i∈ [1 ,+1], carry out normalization it satisfied:

y _i[(w·x _i)+b]-1≥0，i＝1，L，n

This moment, the class interval equaled 2/||w||, the interval maximum is equivalent to makes || w|| ²Minimum.Therefore satisfy following formula and make

minimum classifying face and just be called the optimal classification face, the training sample point on it just is called support vector.

Utilize the Lagrange optimization method to find the solution, the Lagrange function is:

L (w, b, α) = \frac{1}{2} (w . w) - Σ_{i = 1}^{n} α_{i} {y_{i} [(w . x_{i}) + b] - 1}

This function is converted into the Wolf dual problem, promptly in constraint condition:

Σ_{i = 1}^{n} y_{i} α_{i} = 0,

And α _i>=0, i=1, L, n

Down to α _iFind the solution down the array function maximal value:

Q (α) = Σ_{i = 1}^{n} α_{i} - \frac{1}{2} Σ_{i, j = 1}^{n} α_{i} α_{j} y_{i} y_{j} (x_{i} \cdot x_{j})

α _iFor in the former problem with each constraint condition y _i[(wx _i)+b]-1>=0, i=1, L, the Lagrange multiplier that n is corresponding.After separating the problems referred to above, establish the optimum solution that obtains separate into

And b ^*, x is the grouped data of treating of input.Available optimal classification function (being the output function of SVM),

f (x) = sgn {(w \cdot x) + b^{*}} = sgn {Σ_{i = 1}^{n} α_{i}^{*} y_{i} (x_{i} \cdot x) + b^{*}}

Speech samples can have fully and makes an uproar in the reality, and linear separability fully is so be under the inseparable situation of linearity, to use the svm classifier device.Then can be in constraint condition

y _i[(w·x _i)+b]-1≥0，i＝1，L，n

Relaxation factor ξ of middle increase _i>=0, then constraint condition becomes:

y _i[(w·x _i)+b]-1+ξ _i≥0，i＝1，L，n

Then the Lagrange function is:

L (w, b, α) = \frac{1}{2} (w . w) + C (Σ_{i = 1}^{n} ξ_{i})

Changing the Wolf problem into gets:

With 0≤α _i≤C, i=1, L, find the solution under the n condition:

Q (α) = Σ_{i = 1}^{n} α_{i} - \frac{1}{2} Σ_{i, j = 1}^{n} α_{i} α_{j} y_{i} y_{j} K (x_{i}, x_{j})

Wherein C is a constant, in order to the degree of control to this punishment of wrong increment, is called penalty factor.

So under the inseparable situation of linearity, the output function of SVM can be expressed as:

f (x) = sgn (Σ_{i = 1}^{N} α_{i}^{*} y_{i} K (x, x_{i}) + b^{*})

Wherein, 0≤α _i≤C, i=1 ..., n, sgn () are sign function,

K (x _iX _j) be radially basic inner product function, can be used as kernel function as SVM:

K(x，x _i)＝exp(-λ||x-x _j||)，λ＞0

Can select different kernel functions in the practical operation.

Penalty factor C and λ confirm through SMO (Sequential Minimal Optimization, sequential minimum optimization) algorithm and grid search algorithm, and are used to train the interchannel noise model.One group through actual parameter optimization is set to: C=0.03125, λ=0.0078125.

Step (6); The Classification and Identification of raw tone and voice playback; Import voice signal to be identified; Court verdict is exported in the replay attack detection of recording of statistical nature when obtaining based on channelling mode noise long through above-mentioned steps (2)～(4), the interchannel noise model that utilizes step (5) to set up at last.

As shown in Figure 1, a kind of recording replay attack detection system of the present invention comprises:

---load module 100 is used for input training or voice signal to be identified;

---recognition decision module 600 is used to utilize whether the voice to be identified of interchannel noise model module judgement input are recording replay attack voice;

Provided by the invention a kind of based on channelling mode noise recording replay attack detection method; At recording and voice playback database (Authentic and Playback Speech Database; APSD) compare in Yu based on the sentence similarity comparative approach; As shown in table 1, lower based on the method fault rate of channelling mode noise.

Table 1

As shown in Figure 4, the recording replay attack detecting device that two kinds of methods are set up is connected with the Speaker Recognition System of reality respectively.For the data that contain the replay attack voice, the Speaker Recognition System error rate that does not load the replay attack detection module is very high, and security performance is very low.Loading based on the replay attack detection module of channelling mode noise after error rate such as system minimum, be 10.2564%.And load based on error rates such as systems behind the replay attack detection module of sentence similarity comparison is 29.0598%.

Proposed by the invention a kind ofly be simple and easy to not only realize that based on channelling mode noise recording replay attack detection method efficiency of algorithm is high, and error rate is low.Be used on embedded identification and other smart machine higher efficient will be arranged.

Claims

1. the recording replay attack detection method based on the channelling mode noise is characterized in that, said recording replay attack detection method may further comprise the steps:

(1) imports voice signal to be identified;

(2) voice signal is carried out pre-service;

Statistical nature when (4) extracting based on channelling mode noise long;

2. a kind of recording replay attack detection method as claimed in claim 1 is characterized in that, the pre-service in the said step (2) comprises pre-emphasis, divides frame and windowing.

3. a kind of recording replay attack detection method as claimed in claim 1 is characterized in that said step (3) is further comprising the steps of:

(31) pretreated voice signal being carried out noise-removed filtering handles;

4. a kind of recording replay attack detection method as claimed in claim 3 is characterized in that, said statistics frame is after the short time frame of voice signal is done discrete Fourier transformation, to get the wherein mean value of same frequency composition.

5. a kind of recording replay attack detection method as claimed in claim 1 is characterized in that said step (4) is further comprising the steps of:

(41) 0 ~ 5 rank Legendre multinomial coefficient of extraction channelling mode noise;

(42) six statistical natures of extraction channelling mode noise;

6. a kind of recording replay attack detection method as claimed in claim 5 is characterized in that, minimum value, maximal value, average, intermediate value, standard deviation and the maximal value that six statistical natures of said step (42) are the channelling mode noise and the difference of minimum value.

7. a kind of recording replay attack detection method as claimed in claim 1 is characterized in that, the interchannel noise classification judgement modelling of said step (5) comprises the steps:

(51) input training utterance signal;

Statistical nature when (52) repeating step (2) ~ (4), the channelling mode noise that obtains training long;

(53) utilize SVMs (SVM) to classify, set up interchannel noise classification judgement model.

8. recording replay attack detection system based on the channelling mode noise is characterized in that comprising:

---load module (100) is used to import training utterance signal or voice signal to be identified;

---pre-processing module (200), be used for training utterance signal or voice signal to be identified are carried out pre-service, it comprises pre-emphasis, divides frame and adds window unit;

---channelling mode noise extraction module (300) is used for extracting the channelling mode noise of training utterance signal after the pre-service or voice signal to be identified;

---statistical nature extraction module (400) when long, statistical nature when being used to extract based on the training utterance signal of channelling mode noise or voice signal to be identified long;

---interchannel noise model module (500), statistical nature utilizes SVM to classify when being used for training utterance signal long, sets up interchannel noise classification judgement model;

---recognition decision module (600), statistical nature is classified when being used to utilize interchannel noise classification judgement model to treat recognition of speech signals long, the court verdict of the replay attack detection that obtains recording;

---output module (700) is used to export the court verdict of voice signal to be identified.