CN112201255A

CN112201255A - Voice signal spectrum characteristic and deep learning voice spoofing attack detection method

Info

Publication number: CN112201255A
Application number: CN202011061172.1A
Authority: CN
Inventors: 徐文渊; 冀晓宇; 王炎; 周瑜; 薛晖; 金子植; 石卓杨; 闫琛
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2021-01-08
Anticipated expiration: 2040-09-30
Also published as: CN112201255B

Abstract

The invention discloses a voice deception attack detection method based on voice signal spectrum characteristics and deep learning. After a microphone of the electronic equipment receives a voice signal, signal processing work is carried out on the voice, then specific features are extracted, finally the marked features are input into a classifier of a deep convolution neural network SE-ResNet to be trained, the trained classifier is adopted to carry out voice living body detection on the voice signal to be detected, and whether the voice is emitted by a human voice or the result of voice attack is output. The invention can accurately and effectively detect the voice deception attack represented by the replay attack aiming at the speaker recognition system.

Description

Voice signal spectrum characteristic and deep learning voice spoofing attack detection method

Technical Field

The invention belongs to the technical field of voice authentication technology and safety, and particularly relates to a voice recognition technology based on voice signal spectrum characteristics and a software processing method capable of detecting voice spoofing attack aiming at a speaker recognition system.

Background

The speaker authentication system is a safety authentication system which identifies the identity of a speaker by extracting the voice characteristics of the speaker and learning and matching the characteristic patterns. Due to the characteristics of low hardware requirement (only a microphone), low cost, simple and convenient user operation and capability of performing remote non-contact authentication, the system gradually becomes a mainstream user authentication and access control mode, and is widely applied to equipment such as smart phones, smart sound boxes and smart homes.

However, existing voice authentication systems are generally vulnerable to voice spoofing attacks. The voice spoofing attack refers to an attack means of spoofing a voice authentication system by forging voice similar to the voice of a target user, thereby spoofing the access right of the target user. Common voice spoofing attacks include replay attacks, voice synthesis attacks, and voice conversion attacks. In the replay attack, an attacker deceives the voice authentication system by replaying the real voice of the target user recorded in advance; in the voice synthesis attack, an attacker synthesizes false target user voice according to required voice content by means of artificial intelligence or voice splicing and the like; in a voice conversion attack, an attacker converts the voice of others into the sound of a target user. With the development of voice technology and electronic equipment, the threshold of voice spoofing attack is lower and better, and the harm is larger and larger. Therefore, under the circumstances, it is necessary to provide an efficient and low-cost voice spoofing attack detection method.

The key of using the spectrum characteristics to detect the attack is to extract the characteristics with large difference from the spectrums of the real voice and the replay attack.

There are many related studies to protect against noise and distortion introduced by detecting voice spoofing attacks. However, this kind of detection method generally has low detection accuracy and is difficult to be applied after the attack method and the device are upgraded. In addition, a defense method for in-vivo detection by wearing additional equipment by a user is provided, and the method is high in cost and poor in user experience due to the need of additional equipment.

Disclosure of Invention

In order to solve the technical problems in the background art, the invention provides a sounding body authentication detection method based on spectral features and a deep convolutional neural network SE-ResNet, and a detection processing method capable of detecting spoofing attacks aiming at a voice authentication system, so that voice spoofing attacks, represented by replay attacks, aiming at a speaker recognition system can be accurately and effectively detected.

The technical scheme adopted by the invention is as follows:

after a microphone of the electronic equipment receives a voice signal, signal processing work is carried out on the voice, then specific features are extracted, finally the marked features are input into a classifier of a deep convolution neural network for training, voice living body detection is carried out on the voice signal to be detected by adopting the trained classifier, and whether the voice is generated by human voice or the result of voice attack is output.

The method specifically comprises the following steps:

1) signal processing:

for original Voice signal Voice_inThe cumulative power spectrum S is obtained in the following two-step process_pow：

The first step adopts short-time Fourier transform, and the short-time Fourier transform process is as follows: firstly, using a periodic Hamming window with length of 1024 and overlapping length of 768 to process the original Voice signal Voice_inPerforming windowing to obtain original Voice signal Voice_inDivided into a plurality of data frames of length 1024,then, performing fast Fourier transform on each data frame, wherein the number nfft of points of the Fourier transform is 4096;

secondly, accumulating the results of the fast Fourier transform of each data frame to obtain a vector with the length of 4096, and finally taking the first 2049 data points of the vector as an accumulated power spectrum S_pow；

2) Feature extraction:

using as an accumulated power spectrum S_powAnd (4) performing feature extraction to obtain four features, namely a low-frequency feature, an energy distribution feature, a peak feature and a linear prediction cepstrum coefficient.

3) Attack detection:

a classifier of a squeeze excitation residual error network (SE-ResNet architecture) is established, the squeeze excitation residual error network comprises 50 squeeze excitation residual error blocks, each squeeze excitation residual error comprises a residual error block residual, an addition block, a global pooling layer, two convolution modules and a Sigmoid activation function,

the four characteristics are input into a residual block (residual), the output of the residual block (residual) is sequentially connected and input into a scale layer (scale) after passing through a global pooling layer, two convolution modules and a Sigmoid activation function, the output of the residual block (residual) is also input into the scale layer (scale), the scale layer (scale) is subjected to Reshape reshaping operation and dimension increasing to the input dimension and then output into an addition block, the original four characteristics are also input into the addition block at the same time, and the addition block is subjected to weighting operation and then outputs a final probability prediction result.

The 2) is specifically as follows:

2.1) Low frequency characteristics

The accumulated power spectrum S obtained in signal processing_powAs an input, a low frequency feature FV is obtained according to the following three-step process₁: the first step is to spectrum the accumulated power S_powEqually dividing the voice into voice sections with fixed length W; the second step is to sum the values in each speech segment and take the first 200 points to form the speech intermediate vector<pow>Finish the spectrum S of the accumulated power_powCarrying out smoothing treatment; the third step is to take the speech intermediate vector<pow>The first 50 points of (a) as low-frequency features FV₁，FV₁Is oneA 50-dimensional vector as a first class of features;

2.2) energy distribution characteristics

First computing intermediate vectors of speech<pow>Cumulative distribution function pow of_cdfDrawing an accumulative distribution diagram, and calculating an accumulative distribution function pow_cdfThe linear correlation coefficient rho and the quadratic curve fitting coefficient q form the energy distribution characteristic FV₂＝[ρ,q]As a second type of feature;

the energy distribution of the voice in the above steps is processed and described by using the linearity characteristic of a cumulative distribution function (cdf).

2.3) Peak feature

Calculating the maximum value of the cumulative distribution diagram, using the point where the maximum value greater than a preset threshold value is positioned as a peak, forming a peak data set by the value of each peak and the corresponding frequency in the cumulative power spectrum, and calculating a series of statistics of the peak, wherein the statistics comprises the total number N of the peaks in the peak data set S_peakAverage value mu of frequencies corresponding to all peaks in the peak data set S_peakStandard deviation sigma of corresponding frequencies of all peaks in the peak data set S_peak(ii) a Fitting the shape of each peak by using a sixth-order polynomial to obtain a coefficient set of the sixth-order polynomial; finally, the peak value feature FV is formed by the statistic and the coefficient set of the sixth order polynomial₃＝[N_peak,μ_peak,σ_peak,P_est]As a third class of features;

2.4) Linear prediction of cepstrum coefficients

For original Voice signal Voice_inProcessing is performed to obtain Linear Prediction Cepstral Coefficients (LPCC) as a fourth class of features.

The SE-ResNet50 algorithm structure comprises a first convolution layer group, a second convolution layer group, a third convolution layer group, a fourth convolution layer group, a fifth convolution layer group, an average pooling layer, a full connection layer and a Sigmoid layer which are sequentially connected; the squeeze-excited residual network is trained by inputting four features of a speech signal and a corresponding label known whether a human voice utters speech or replays a speech attack when the classifier is trained.

According to the method, a residual error network ResNet is used as a basic framework, quick connection processing is added in the network, an extrusion excitation structure is added, the problem of network degradation is solved, and the sensitivity of a model to channel characteristics is improved.

Specifically, the model acquires the importance of each feature channel through a learning method, and then the weight of the important feature channel is increased according to the importance.

The invention selects four types of features, uses the four types of features for the recognition of the sounding body and provides an extraction algorithm of the four types of features. And an advanced deep convolutional neural network SE-ResNet is selected as a classifier, and a detection method for voice spoofing attack is constructed on the basis of the spectral characteristics and the SE-ResNet.

The invention acquires and records voice through a microphone of the intelligent device to obtain voice signals, and extracts four characteristics which can effectively and truly reflect the spectrum difference of real voice and replay attack voice through signal processing. According to the fact that the real voice and the replay attack have regular difference on low-frequency peak characteristics and energy distribution, the characteristics are input into a built deep convolution neural network classifier SE-ResNet50, and then the real voice and the replay attack are detected.

The invention can accurately and effectively detect the voice deception attack represented by the replay attack aiming at the speaker recognition system.

The invention has the beneficial effects that:

the innovation point of the invention is that aiming at the difference between the replay voice and the real voice in the aspect of spectrum characteristics, 74-dimensional characteristics such as energy power characteristics, low-frequency characteristics and the like are provided, and effective characteristic data are provided for attack detection. In addition, SE-ResNet was established to be used for replay attack detection. In the voice spoofing attack, even if an attacker generates sound which is very similar to the voice of a real user, the sound necessarily causes a certain degree of nonlinear distortion when passing through a microphone and a loudspeaker, the spectral characteristics of the sound are inconsistent with those of the real user, and therefore the method can be used for detecting the voice spoofing attack.

The voice spoofing attack detection method can efficiently detect the voice spoofing attack through the existing microphone and voice hardware of the voice authentication system, has the characteristics of low cost and high attack detection accuracy, can be used for safety protection of the voice authentication system on intelligent equipment such as a mobile phone and the like, and has wide requirements and application prospects.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a spectrogram (left) of real voice and a spectrogram (right) of replay attack.

Fig. 3 is a flow chart of an actual user issuing an instruction and being received by the smart device (up) and performing a replay attack (down).

FIG. 4 is a diagram of the SE-ResNet model architecture of the present invention.

Fig. 5 is a graph of the training process and results of the present invention on the ASVspoof2017 and ASVspoof2019 data sets.

Detailed Description

The invention will be further explained with reference to the drawings.

The examples and embodiments of the method according to the invention are as follows:

1) signal processing:

as shown in fig. 1, for the original Voice signal Voice_inThe cumulative power spectrum S is obtained in the following two-step process_pow：

The first step adopts short-time Fourier transform, and the short-time Fourier transform process is as follows: first, a periodic Hamming window with length of 1024 (representing 1024 data points) and overlap length of 768 is used to process the original Voice signal Voice_inPerforming windowing to obtain original Voice signal Voice_inDividing the data into a plurality of data frames with the length of 1024, and then performing fast Fourier transform on each data frame, wherein the number nfft of points of the Fourier transform is 4096;

2) Feature extraction:

using as an accumulated power spectrum S_powThe feature extraction is carried out to obtain four features,low frequency features, energy distribution features, peak features and linear prediction cepstral coefficients, respectively.

The 2) is specifically as follows:

2.1) Low frequency characteristics

The accumulated power spectrum S obtained in signal processing_powAs an input, the low frequency feature FV is obtained according to the following two-step process₁：

The first step is to spectrum the accumulated power S_powEqually dividing the voice into voice sections with fixed length W; if S is_powIs not divided by W, the last redundant segment is omitted and W is taken to be 10 in the practice of the invention.

The second step is to sum the values in each speech segment and take the first 200 points to form the speech intermediate vector<pow>Finish the spectrum S of the accumulated power_powCarrying out smoothing treatment;

the third step is to take the speech intermediate vector<pow>The first 50 points of (a) as low-frequency features FV₁，FV₁Is a 50-dimensional vector as a first class of features;

in this way, the accumulated power spectrum slow is smoothed, and a low-frequency band point below 2kHz is selected as a low-frequency feature in the implementation.

2.2) energy distribution characteristics

2.3) Peak feature

Calculating the maximum value of the cumulative distribution diagram, using the point where the maximum value greater than a preset threshold value is positioned as a peak, forming a peak data set by the value of each peak and the corresponding frequency in the cumulative power spectrum, and calculating a series of statistics of the peak, wherein the statistics comprises the total number N of the peaks in the peak data set S_peakAverage value mu of frequencies corresponding to all peaks in the peak data set S_peakStandard deviation sigma of corresponding frequencies of all peaks in the peak data set S_peak(ii) a Using a sixth order polynomialFitting the shape of each peak to obtain a coefficient set of a sixth-order polynomial; finally, the peak value feature FV is formed by the statistic and the coefficient set of the sixth order polynomial₃＝[N_peak,μ_peak,σ_peak,P_est]As a third class of features;

2.4) Linear prediction of cepstrum coefficients

For original Voice signal Voice_inAnd processing to obtain Linear Prediction Cepstrum Coefficients (LPCC), wherein the linear prediction cepstrum coefficients are 12-order coefficients, and the 12-order LPC coefficients are a vector as a fourth-class feature.

3) Attack detection:

a classifier of a squeeze excitation residual error network (SE-ResNet architecture) is established, the squeeze excitation residual error network comprises 50 squeeze excitation residual error blocks, and each squeeze excitation residual error comprises a residual error block residual, an addition block, a global pooling layer, two convolution modules and a Sigmoid activation function.

The four characteristics are input into a residual block (residual), the output of the residual block (residual) is sequentially connected and input into a scale layer (scale) after passing through a global pooling layer, two convolution modules and a Sigmoid activation function, the output of the residual block (residual) is also input into the scale layer (scale), the scale layer (scale) is subjected to Reshape reshaping operation and dimension increasing to the input dimension and then output into an addition block, the original four characteristics are also input into the addition block at the same time, and the addition block is subjected to weighting operation and then outputs a final probability prediction result. If the probability value is more than 0.5, the attack voice is judged to be replayed, and if the probability value is less than 0.5, the attack voice is judged to be real voice.

In a specific implementation level, the SE-ResNet architecture is shown in fig. 4, and includes 2 operations, namely squeeze (squeeze) and stimulus (excitation).

In the squeeze operation, the original feature map dimension is C × H × W, C represents the feature channel, H represents the height, W represents the width, and the number of feature channels in the model, i.e., the total number 74 of extracted features, is compressed into a feature map of C × H × W, which is implemented by global average pooling, as shown in the dashed box portion of fig. 4. After H × W is compressed into one dimension, the corresponding one-dimensional parameters obtain the global view of H × W before, and the sensing area of the convolution kernel is wider.

And in the excitation operation, adding a full-connection layer to the characteristic diagram of C1 x 1 obtained in the extrusion operation, and predicting the importance of each characteristic channel. And finally, performing normalization processing through a Sigmoid function, and weighting the normalized weight to the characteristics of each channel through a Scale layer.

After the architecture of the SE-ResNet is obtained, 50 layers of extrusion excitation residual blocks are specifically deployed, and the whole process is shown in Table 1. Since only distinguishing between real sounds and reproduced sounds is a two-classification problem, the final output dimension is set to 1, and the output result is the probability that each voice to be detected is detected as a real voice.

TABLE 1 SE-ResNet50 flow framework

Fig. 3 is a schematic diagram of a playback attack, and it can be seen that the playback attack has two links of microphone recording and speaker playing compared with real voice, which necessarily generates changes to the original signal. The sensitivity of the microphone and speaker depends on the degree of deflection of the diaphragm under the influence of the sound pressure. Due to imperfections in the manufacturing process, the microphone has limitations that ultimately result in inherent distortion. This non-linear characteristic of the microphone results in the addition of noise signals over a lower frequency range. Loudspeakers also introduce non-linear distortion when reproducing sound. Despite great progress in producing high quality sound, most loudspeakers still exhibit non-linear behavior, especially in the low frequency region. The main reasons for this non-linearity are three: (1) changes in magnetic field caused by voice coil excursion; (2) a non-linear suspension stiffness of the voice coil; (3) self-inductance of voice coil drift. Although voice spoofing attacks can adopt various generation modes of false voice signals, in the actual attack process, an attacker needs to play the false voice signals to a voice authentication system to be attacked by using a loudspeaker (sound box). Therefore, the protection of the voice authentication system can be started from the identification of a sound source (a sounding body), and the detection of the spoofing attack is realized.

The upper left corner of fig. 2 is a spectrogram of a real speech, and the other three spectrograms are spectrograms obtained after the speech is played back by different speakers. For comparison, the following observations were made: real voice fluctuates more obviously in a low frequency band (more peaks are seen quantitatively), and the fluctuation of replay attack is less (the peaks are concentrated); the energy distribution of real voice and replay attacks are different, and the energy proportion of the replay attack is higher at 4-5 kHz.

Embodiments were tested with the data set of asvspoons 2017 and 2019, which is the standard data set for voice spoofing attacks. "ASVspoof change" is a special competition unit for Interspeed, the international top academic conference in the field of speech, focusing on spoofing for automatic speaker recognition systems.

Firstly, extracting the four types of characteristics from the data of a training set, adding a label, marking the voice as real voice or replay voice, and then training a neural network SE-ResNet by using the marked characteristics. And then the trained SE-ResNet is used for verification on the test set. The verification results are shown in fig. 5. Equal Error Rates (EER) of 2.38% were achieved on the ASVspoof2017 data set, 0.163% on the ASVspoof2019 PA data set, and the first race was ranked in both races of the current year. The equal error rate is an error rate value when the error acceptance rate and the error rejection rate are equal, and a smaller index indicates a higher accuracy of the detection system.

In addition, the embodiment passes through cross validation, the EER of 4.47% can be reached by using the training set and the development set of ASVspoof2017 for training and the testing set of AS-Vspoof2019 for testing; by using the training set and development set training of the ASVspoof2019 and the testing set of the ASVspoof2017 for testing, an EER close to 0 can be achieved.

Claims

1. A voice replay attack detection method based on spectral features and a deep convolutional neural network is characterized by comprising the following steps:

2. The voice replay attack detection method based on the spectral feature and the deep convolutional neural network as claimed in claim 1, characterized in that: the method specifically comprises the following steps:

1) signal processing:

The first step adopts short-time Fourier transform, and the short-time Fourier transform process is as follows: firstly, using a periodic Hamming window with length of 1024 and overlapping length of 768 to process the original Voice signal Voice_inPerforming windowing to obtain original Voice signal Voice_inDividing the data into a plurality of data frames with the length of 1024, and then performing fast Fourier transform on each data frame, wherein the number nfft of points of the Fourier transform is 4096;

2) Feature extraction:

using as an accumulated power spectrum S_powGo on speciallyAnd (4) extracting features to obtain four features, namely low-frequency features, energy distribution features, peak features and linear prediction cepstrum coefficients.

3) Attack detection:

3. The voice replay attack detection method based on the spectral feature and the deep convolutional neural network as claimed in claim 1, characterized in that: the 2) is specifically as follows:

2.1) Low frequency characteristics

The accumulated power spectrum S obtained in signal processing_powAs an input, a low frequency feature FV is obtained according to the following three-step process₁: the first step is to spectrum the accumulated power S_powEqually dividing the voice into voice sections with fixed length W; the second step is to sum the values in each speech segment and take the first 200 points to form the speech intermediate vector<pow>Finish the spectrum S of the accumulated power_powCarrying out smoothing treatment; the third step is to take the speech intermediate vector<pow>The first 50 points of (a) as low-frequency features FV₁，FV₁Is a 50-dimensional vector as a first class of features;

2.2) energy distribution characteristics

2.3) Peak feature

2.4) Linear prediction of cepstrum coefficients

4. The voice replay attack detection method based on the spectral feature and the deep convolutional neural network as claimed in claim 1, characterized in that: the SE-ResNet50 algorithm structure comprises a first convolution layer group, a second convolution layer group, a third convolution layer group, a fourth convolution layer group, a fifth convolution layer group, an average pooling layer, a full connection layer and a Sigmoid layer which are sequentially connected; the squeeze-excited residual network is trained by inputting four features of a speech signal and a corresponding label known whether a human voice utters speech or replays a speech attack when the classifier is trained.