CN112201255B

CN112201255B - Voice signal spectrum characteristic and deep learning voice spoofing attack detection method

Info

Publication number: CN112201255B
Application number: CN202011061172.1A
Authority: CN
Inventors: 徐文渊; 冀晓宇; 王炎; 周瑜; 薛晖; 金子植; 石卓杨; 闫琛
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2022-10-21
Anticipated expiration: 2040-09-30
Also published as: CN112201255A

Abstract

The invention discloses a voice deception attack detection method based on voice signal spectrum characteristics and deep learning. After a microphone of the electronic equipment receives a voice signal, signal processing work is carried out on the voice, then specific features are extracted, finally the marked features are input into a classifier of a deep convolution neural network SE-ResNet to be trained, the trained classifier is adopted to carry out voice living body detection on the voice signal to be detected, and whether the voice is emitted by a human voice or the result of voice attack is output. The invention can accurately and effectively detect the voice deception attack represented by the replay attack aiming at the speaker recognition system.

Description

Voice signal spectrum characteristic and deep learning voice spoofing attack detection method

Technical Field

The invention belongs to the technical field of voice authentication technology and safety, and particularly relates to a voice recognition technology based on voice signal spectrum characteristics and a software processing method capable of detecting voice spoofing attack aiming at a speaker recognition system.

Background

The speaker authentication system is a safety authentication system which identifies the identity of a speaker by extracting the voice characteristics of the speaker and learning and matching the characteristic patterns. Due to the characteristics of low hardware requirement (only a microphone is needed), low cost, simple and convenient user operation and capability of performing remote non-contact authentication, the system gradually becomes a mainstream user authentication and access control mode, and is widely applied to equipment such as smart phones, smart sound boxes and smart homes.

However, existing voice authentication systems are generally vulnerable to voice spoofing attacks. The voice spoofing attack refers to an attack means of spoofing a voice authentication system by forging a voice similar to the voice of a target user, thereby impersonating the target user to cheat the access right. Common voice spoofing attacks include replay attacks, voice synthesis attacks, and voice conversion attacks. In the replay attack, an attacker deceives the voice authentication system by replaying the real voice of the target user recorded in advance; in the voice synthesis attack, an attacker synthesizes false target user voice according to required voice content by means of artificial intelligence or voice splicing and the like; in a voice conversion attack, an attacker converts the voice of others into the sound of a target user. With the development of voice technology and electronic equipment, the threshold of voice spoofing attack is lower and lower, the effect is better and better, and the harm is larger and larger. Therefore, under such circumstances, it is desirable to provide an efficient and low-cost detection method for voice spoofing attacks.

The key of using the spectrum characteristics to detect the attack is to extract the characteristics with large difference from the spectrums of the real voice and the replay attack.

There are many related studies to protect against noise and distortion introduced by detecting voice spoofing attacks. However, this kind of detection method generally has low detection accuracy and is difficult to be applied after the attack method and device are upgraded. In addition, a defense method for in-vivo detection by wearing additional equipment by a user is provided, and the method is high in cost and poor in user experience due to the fact that additional equipment is needed.

Disclosure of Invention

In order to solve the technical problems in the background art, the invention provides a sounding body authentication detection method based on spectral features and a deep convolutional neural network SE-ResNet, and a detection processing method capable of detecting spoofing attacks aiming at a voice authentication system, so that voice spoofing attacks, represented by replay attacks, aiming at a speaker recognition system can be accurately and effectively detected.

The technical scheme adopted by the invention is as follows:

after a microphone of the electronic equipment receives a voice signal, signal processing work is carried out on the voice, then specific features are extracted, finally the marked features are input into a classifier of a deep convolution neural network for training, voice living body detection is carried out on the voice signal to be detected by adopting the trained classifier, and whether the voice is generated by human voice or the result of voice attack is output.

The method specifically comprises the following steps:

1) Signal processing:

for original Voice signal Voice _in The cumulative power spectrum S is obtained in the following two-step process _pow ：

The first step adopts short-time Fourier transform, and the short-time Fourier transform process is as follows: firstly, using a periodic Hamming window with length of 1024 and overlapping length of 768 to process the original Voice signal Voice _in Performing windowing to obtain original Voice signal Voice _in Dividing the data frames into a plurality of data frames with the length of 1024, and then performing fast Fourier transform on each data frame, wherein the number nfft of points of the Fourier transform is 4096;

secondly, accumulating the results of the fast Fourier transform of each data frame to obtain a vector with the length of 4096, and finally taking the first 2049 data points of the vector as an accumulated power spectrum S _pow ；

2) Characteristic extraction:

using as an accumulated power spectrum S _pow And (4) performing feature extraction to obtain four features, namely a low-frequency feature, an energy distribution feature, a peak feature and a linear prediction cepstrum coefficient.

3) Attack detection:

a classifier of a squeeze-excited residual error network (SE-ResNet architecture) is established, the squeeze-excited residual error network comprises 50 squeeze-excited residual error blocks, each squeeze-excited residual error comprises a residual block residual, an addition block, a global pooling layer, two convolution modules and a Sigmoid activation function,

the four characteristics are input into a residual block (residual), the output of the residual block (residual) is sequentially connected and input into a scale layer (scale) after passing through a global pooling layer, two convolution modules and a Sigmoid activation function, the output of the residual block (residual) is also input into the scale layer (scale), the scale layer (scale) is subjected to Reshape reshaping operation and dimension increasing to the input dimension and then output into an addition block, the original four characteristics are also input into the addition block at the same time, and the addition block is subjected to weighting operation and then outputs a final probability prediction result.

The 2) is specifically as follows:

2.1 Low frequency characteristics)

Processing the accumulated power spectrum S obtained in the signal processing _pow As input, a low-frequency characteristic FV is obtained according to the following three-step processing ₁ : the first step is to spectrum the accumulated power S _pow Equally dividing the voice into voice sections with fixed length W; the second step is to sum the values in each speech segment and take the first 200 points to form the speech intermediate vector<pow>Finish the spectrum S of the accumulated power _pow Carrying out smoothing treatment; the third step is to take the speech intermediate vector<pow>The first 50 points of (a) as low-frequency features FV ₁ ，FV ₁ Is a 50-dimensional vector as a first class of features;

2.2 ) characteristics of energy distribution

First computing intermediate vectors of speech<pow>Cumulative distribution function pow of _cdf Drawing an accumulative distribution diagram, and then obtaining an accumulative distribution function pow _cdf The linear correlation coefficient rho and the quadratic curve fitting coefficient q form the energy distribution characteristic FV ₂ ＝[ρ,q]As a second type of feature;

the energy distribution of the voice in the above steps is processed and described by using the linearity characteristic of a cumulative distribution function (cdf).

2.3 Characteristic of the peak

Calculating the maximum value of the accumulation distribution diagram, using the point where the maximum value is larger than a preset threshold value as a peak, forming a peak data set by the value of each peak and the corresponding frequency in the accumulation power frequency spectrum, and calculating a series of statistics of the peak, wherein the statistics comprises the total number N of the peaks in the peak data set S _peak Average value mu of corresponding frequencies of all peaks in the peak data set S _peak Standard deviation sigma of corresponding frequencies of all peaks in the peak data set S _peak (ii) a Fitting the shape of each peak by using a sixth-order polynomial to obtain a coefficient set of the sixth-order polynomial; finally, the peak value feature FV is formed by the statistic and the coefficient set of the sixth order polynomial ₃ ＝[N _peak ,μ _peak ,σ _peak ,P _est ]As a third class of features;

2.4 ) linear predictive cepstrum coefficients

For original Voice signal Voice _in Processing is performed to obtain Linear Prediction Cepstral Coefficients (LPCC) as a fourth class of features.

The SE-ResNet50 algorithm structure comprises a first convolution layer group, a second convolution layer group, a third convolution layer group, a fourth convolution layer group, a fifth convolution layer group, an average pooling layer, a full connection layer and a Sigmoid layer which are sequentially connected; the squeeze-excited residual network is trained by inputting four features of a speech signal and a corresponding label known whether a human voice utters speech or replays a speech attack when the classifier is trained.

The method takes a residual error network ResNet as a basic framework, adds quick connection processing in the network, and adds an extrusion excitation structure, thereby solving the problem of network degradation and improving the sensitivity of a model to channel characteristics.

Specifically, the model acquires the importance of each feature channel through a learning method, and then the weight of the important feature channel is increased according to the importance.

The invention selects four types of features, uses the four types of features for recognition of sounding bodies and provides an extraction algorithm of the four types of features. And an advanced deep convolutional neural network SE-ResNet is selected as a classifier, and a detection method for voice spoofing attack is constructed on the basis of the spectral characteristics and the SE-ResNet.

The invention acquires and records voice through a microphone of the intelligent device to obtain voice signals, and extracts four types of characteristics which can effectively and really reflect the difference of real voice and replay attack voice frequency spectrum through signal processing. According to the fact that the real voice and the replay attack have regular difference on the low-frequency peak value feature and the energy distribution, the feature is input into the built deep convolution neural network classifier SE-ResNet50, and then the real voice and the replay attack are detected.

The invention can accurately and effectively detect the voice deception attack which is represented by replay attack and aims at the speaker recognition system.

The invention has the beneficial effects that:

the innovation point of the invention is that aiming at the difference of the replay voice and the real voice in the aspect of spectrum characteristics, 74-dimensional characteristics such as energy power characteristics, low-frequency characteristics and the like are provided, and effective characteristic data are provided for attack detection. In addition, SE-ResNet was established to be used for replay attack detection. In the voice spoofing attack, even if an attacker generates sound which is very similar to the voice of a real user, the sound necessarily causes a certain degree of nonlinear distortion when passing through a microphone and a loudspeaker, the spectral characteristics of the sound are inconsistent with those of the real user, and therefore the method can be used for detecting the voice spoofing attack.

The method can efficiently detect the voice deception attack through the existing microphone and voice hardware of the voice authentication system, has the characteristics of low cost and high attack detection accuracy, can be used for safety protection of the voice authentication system on intelligent equipment such as a mobile phone and the like, and has wide requirements and application prospects.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a spectrum diagram of a real voice (left) and a spectrum diagram of a replay attack (right).

Fig. 3 is a flow chart of an actual user issuing an instruction and being received by the smart device (up) and performing a replay attack (down).

FIG. 4 is a diagram of the SE-ResNet model architecture of the present invention.

Fig. 5 is a graph of the training process and results of the present invention on the ASVspoof2017 and ASVspoof2019 data sets.

Detailed Description

The invention will be further described with reference to the accompanying drawings.

Examples of the complete implementation of the method according to the invention and the implementation thereof are as follows:

1) Signal processing:

as shown in fig. 1, for the original Voice signal Voice _in The cumulative power spectrum S is obtained in the following two-step process _pow ：

The first step adopts short-time Fourier transform, and the short-time Fourier transform process comprises the following steps: first, a periodic Hamming window with length of 1024 (representing 1024 data points) and overlap length of 768 is used to process the original Voice signal Voice _in Performing windowing to obtain original Voice signal Voice _in Dividing the data into a plurality of data frames with the length of 1024, and then performing fast Fourier transform on each data frame, wherein the number nfft of points of the Fourier transform is 4096;

2) Feature extraction:

The 2) is specifically as follows:

2.1 Low frequency characteristics)

The accumulated power spectrum S obtained in signal processing _pow As an input, the low frequency feature FV is obtained according to the following two-step process ₁ ：

The first step is to spectrum the accumulated power S _pow Equally dividing the voice into voice sections with fixed length W; if S is _pow Is not divided by W, the last redundant segment is omitted and W is taken to be 10 in the practice of the invention.

The second step is to sum the values in each speech segment and take the first 200 points to form the speech intermediate vector<pow>Finish the spectrum S of the accumulated power _pow Carrying out smoothing treatment;

the third step is to take the intermediate vector of the speech<pow>As the low frequency feature FV ₁ ，FV ₁ Is a 50-dimensional vector as a first class of features;

in this way, the accumulated power spectrum slow is smoothed, and a low-frequency band point below 2kHz is selected as a low-frequency feature in the implementation.

2.2 ) characteristics of energy distribution

First computing intermediate vectors of speech<pow>Cumulative distribution function pow of _cdf Drawing an accumulative distribution diagram, and calculating an accumulative distribution function pow _cdf The linear correlation coefficient rho and the quadratic curve fitting coefficient q form the energy distribution characteristic FV ₂ ＝[ρ,q]As a second type of feature;

2.3 Characteristic of peak value

Calculating the maximum value of the accumulation distribution diagram, using the point where the maximum value is larger than a preset threshold value as a peak, forming a peak data set by the value of each peak and the corresponding frequency in the accumulation power frequency spectrum, and calculating a series of statistics of the peak, wherein the statistics comprises the total number N of the peaks in the peak data set S _peak Average value mu of corresponding frequencies of all peaks in the peak data set S _peak And standard deviation sigma of corresponding frequencies of all peaks in the peak data set S _peak (ii) a Fitting the shape of each peak by using a sixth-order polynomial to obtain a coefficient set of the sixth-order polynomial; finally, the peak value feature FV is formed by the statistic and the coefficient set of the sixth order polynomial ₃ ＝[N _peak ,μ _peak ,σ _peak ,P _est ]As a third class of features;

2.4 ) linear predictive cepstrum coefficients

For original Voice signal Voice _in And processing to obtain Linear Prediction Cepstrum Coefficients (LPCC), wherein the linear prediction cepstrum coefficients are 12-order coefficients, and the 12-order LPC coefficients are a vector as a fourth-class feature.

3) Attack detection:

a classifier of a squeeze excitation residual error network (SE-ResNet architecture) is established, the squeeze excitation residual error network comprises 50 squeeze excitation residual error blocks, and each squeeze excitation residual error comprises a residual error block residual, an addition block, a global pooling layer, two convolution modules and a Sigmoid activation function.

The four characteristics are input into a residual block (residual), the output of the residual block (residual) is sequentially connected and input into a scale layer (scale) after passing through a global pooling layer, two convolution modules and a Sigmoid activation function, the output of the residual block (residual) is also input into the scale layer (scale), the scale layer (scale) is subjected to Reshape operation to increase dimension to the input dimension and then output into an addition block, the original four characteristics are also input into the addition block, and the addition block is subjected to weighting operation to output a final probability prediction result. If the probability value is more than 0.5, the attack voice is judged to be replayed, and if the probability value is less than 0.5, the attack voice is judged to be real voice.

In a specific implementation level, the SE-ResNet architecture is shown in fig. 4, and includes 2 operations, namely squeeze (squeeze) and stimulus (excitation).

The original signature dimensions are C x H x W, C represents the signature channels, H represents the height, W represents the width, and the number of signature channels in the model, i.e., the total number of extracted features 74, are compressed into a signature of C x 1 by the squeeze operation, which is implemented as a dashed box in fig. 4. After H × W is compressed into one dimension, the corresponding one-dimensional parameters obtain the global view of H × W before, and the sensing area of the convolution kernel is wider.

During the excitation operation, the fully-connected layer is added to the characteristic diagram of C1 x 1 obtained by the extrusion operation, and the importance of each characteristic channel is predicted. And finally, performing normalization processing through a Sigmoid function, and weighting the normalized weight to the characteristics of each channel through a Scale layer.

After the architecture of the SE-ResNet is obtained, 50 layers of extrusion excitation residual blocks are specifically deployed, and the whole process is shown in Table 1. Since only distinguishing between real sounds and reproduced sounds is a two-classification problem, the final output dimension is set to 1, and the output result is the probability that each voice to be detected is detected as a real voice.

TABLE 1 SE-ResNet50 Process framework

Fig. 3 is a schematic diagram of a playback attack, and it can be seen that the playback attack has two links of microphone recording and speaker playing compared with real voice, which necessarily generates changes to the original signal. The sensitivity of the microphone and speaker depends on the degree of deflection of the diaphragm under the influence of the sound pressure. Due to imperfections in the manufacturing process, the microphone has limitations that ultimately result in inherent distortion. This non-linear characteristic of the microphone results in the addition of noise signals over a lower frequency range. Loudspeakers also introduce non-linear distortion when reproducing sound. Despite great progress in producing high quality sound, most loudspeakers still exhibit non-linear behavior, especially in the low frequency region. The main reasons for this non-linearity are three: (1) changes in magnetic field caused by voice coil excursion; (2) a non-linear suspension stiffness of the voice coil; and (3) self-inductance of voice coil drift. Although the voice spoofing attack can adopt various false voice signal generating modes, in the actual attack process, an attacker needs to play the false voice signal to a voice authentication system to be attacked by using a loudspeaker (a sound box). Therefore, the protection of the voice authentication system can be started from the identification of a sound source (a sounding body), and the detection of the spoofing attack is realized.

The upper left corner of fig. 2 is a spectrogram of a real voice, and the other three spectrograms are spectrograms obtained after the voice is played back by different speakers. For comparison, the following observations were made: real voice fluctuates more obviously in a low frequency band (quantitatively, more peaks exist), and the fluctuation of replay attack is less (the peaks are concentrated); the energy distribution of real voice and replay attacks are different, and the energy proportion of the replay attack is higher at 4-5 kHz.

Embodiments were tested with the data set of asvspoons 2017 and 2019, which is the standard data set for voice spoofing attacks. "ASVspoof challenge" is a special competition unit for Interspeed, the international top academic conference in the field of speech, focusing on spoofing for automatic speaker recognition systems.

Firstly, extracting the four types of characteristics from the data of a training set, adding a label, marking the voice as real voice or replay voice, and then training a neural network SE-ResNet by using the marked characteristics. And then the trained SE-ResNet is used for verification on the test set. The verification results are shown in fig. 5. Equal Error Rates (EER) of 2.38% were achieved on the ASVspoof2017 data set, 0.163% on the ASVspoof2019 PA data set, and the first race was ranked in both races of the current year. The equal error rate is an error rate value when the error acceptance rate and the error rejection rate are equal, and a smaller index indicates a higher accuracy of the detection system.

In addition, the embodiment passes through cross validation, the EER of 4.47% can be reached by using the training set and the development set of ASVspoof2017 for training and the testing set of AS-Vspoof2019 for testing; by using the training set and development set training of the ASVspoof2019 and the testing set of the ASVspoof2017 for testing, an EER close to 0 can be achieved.

Claims

1. A voice replay attack detection method based on spectral features and a deep convolutional neural network is characterized by comprising the following steps:

after a microphone of the electronic equipment receives a voice signal, performing signal processing work on the voice, then extracting specific characteristics, finally inputting the marked characteristics into a classifier SE-ResNet50 of a deep convolution neural network for training, performing voice living body detection on the voice signal to be detected by adopting the trained classifier, and outputting a result of whether the voice is emitted by human voice or voice attack is replayed;

the method specifically comprises the following steps:

1) Signal processing:

for original Voice signal Voice _in The following two steps of treatmentObtaining a cumulative power spectrum S _pow ：

The first step adopts short-time Fourier transform, and the short-time Fourier transform process comprises the following steps: firstly, using a periodic Hamming window with length of 1024 and overlapping length of 768 to process the original Voice signal Voice _in Performing windowing to obtain original Voice signal Voice _in Dividing into 1024 data frames, and performing fast Fourier transform on each data frame by the number of Fourier transform points n _fft 4096;

2) Feature extraction:

using the cumulative power spectrum S _pow Performing feature extraction to obtain four features, namely a low-frequency feature, an energy distribution feature, a peak feature and a linear prediction cepstrum coefficient;

the 2) is specifically as follows:

2.1 Low frequency characteristics)

The accumulated power spectrum S obtained in signal processing _pow As input, a low-frequency characteristic FV is obtained according to the following three-step processing ₁ : the first step is to spectrum the accumulated power S _pow Equally dividing the voice into voice sections with fixed length W; the second step is to sum the values in each speech segment and take the first 200 points to form the speech intermediate vector<pow>Finish the spectrum S of the accumulated power _pow Carrying out smoothing treatment; the third step is to take the speech intermediate vector<pow>The first 50 points of (a) as low-frequency features FV ₁ ，FV ₁ Is a 50-dimensional vector as a first class of features;

2.2 ) characteristics of energy distribution

2.3 Characteristic of peak value

Calculating the maximum value of the accumulation distribution diagram, using the point where the maximum value is larger than a preset threshold value as a peak, forming a peak data set by the value of each peak and the corresponding frequency in the accumulation power frequency spectrum, and calculating a series of statistics of the peak, wherein the statistics comprises the total number N of the peaks in the peak data set S _peak Average value mu of corresponding frequencies of all peaks in the peak data set S _peak Standard deviation sigma of corresponding frequencies of all peaks in the peak data set S _peak (ii) a And fitting the shape of each peak by a sixth-order polynomial to obtain a coefficient set P of the sixth-order polynomial _est (ii) a Finally, the peak value feature FV is formed by the statistic and the coefficient set of the sixth order polynomial ₃ ＝[N _peak ,μ _peak ,σ _peak ,P _est ]As a third class of features;

2.4 ) linear predictive cepstrum coefficients

For original Voice signal Voice _in Processing to obtain a Linear Prediction Cepstrum Coefficient (LPCC) as a fourth class of characteristics;

3) Attack detection:

the method comprises the steps of establishing a classifier of an extrusion excitation residual error network, wherein the extrusion excitation residual error network comprises 50 extrusion excitation residual error blocks, each extrusion excitation residual error comprises a residual error block residual, an addition block, a global pooling layer, two convolution modules and a Sigmoid activation function, inputting four characteristics into the residual error block residual, sequentially passing the output of the residual error block residual through the global pooling layer, the two convolution modules and the Sigmoid activation function, then connecting and inputting the output of the residual error block residual into a scale layer scale, simultaneously inputting the output of the residual error block residual into the scale layer scale, increasing the dimension of the scale layer scale to the input dimension through a remolding Reshape operation, then outputting the input dimension to an addition block, simultaneously inputting the original four characteristics into the addition block, and outputting a final probability prediction result by the addition block after a weighting operation.

2. The voice replay attack detection method based on the spectral feature and the deep convolutional neural network as claimed in claim 1, characterized in that: the classifier SE-ResNet50 of the deep convolutional neural network comprises a first convolutional layer group, a second convolutional layer group, a third convolutional layer group, a fourth convolutional layer group, a fifth convolutional layer group, an average pooling layer, a full connection layer and a Sigmoid layer which are sequentially connected; the squeeze-excited residual network is trained by inputting four features of a voice signal and corresponding labels known whether a human voice produces voice or replays voice attack when a classifier is trained.