CN112201255A - Voice signal spectrum characteristic and deep learning voice spoofing attack detection method - Google Patents
Voice signal spectrum characteristic and deep learning voice spoofing attack detection method Download PDFInfo
- Publication number
- CN112201255A CN112201255A CN202011061172.1A CN202011061172A CN112201255A CN 112201255 A CN112201255 A CN 112201255A CN 202011061172 A CN202011061172 A CN 202011061172A CN 112201255 A CN112201255 A CN 112201255A
- Authority
- CN
- China
- Prior art keywords
- voice
- peak
- features
- pow
- residual
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001228 spectrum Methods 0.000 title claims abstract description 30
- 238000001514 detection method Methods 0.000 title claims abstract description 26
- 238000013135 deep learning Methods 0.000 title abstract description 3
- 238000012545 processing Methods 0.000 claims abstract description 15
- 238000013528 artificial neural network Methods 0.000 claims abstract description 5
- 238000000034 method Methods 0.000 claims description 22
- 239000013598 vector Substances 0.000 claims description 19
- 238000009826 distribution Methods 0.000 claims description 18
- 230000005284 excitation Effects 0.000 claims description 16
- 230000001186 cumulative effect Effects 0.000 claims description 13
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims description 10
- 238000011176 pooling Methods 0.000 claims description 10
- 238000010586 diagram Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 9
- 238000012549 training Methods 0.000 claims description 9
- 238000005315 distribution function Methods 0.000 claims description 7
- 230000003595 spectral effect Effects 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 6
- 238000013527 convolutional neural network Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 6
- FFBHFFJDDLITSX-UHFFFAOYSA-N benzyl N-[2-hydroxy-4-(3-oxomorpholin-4-yl)phenyl]carbamate Chemical compound OC1=C(NC(=O)OCC2=CC=CC=C2)C=CC(=C1)N1CCOCC1=O FFBHFFJDDLITSX-UHFFFAOYSA-N 0.000 claims description 3
- 238000009499 grossing Methods 0.000 claims description 3
- 230000000737 periodic effect Effects 0.000 claims description 3
- 230000006870 function Effects 0.000 description 5
- 238000012360 testing method Methods 0.000 description 5
- 238000011161 development Methods 0.000 description 3
- 238000001125 extrusion Methods 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000035945 sensitivity Effects 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 230000007123 defense Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000005242 forging Methods 0.000 description 1
- 238000001727 in vivo Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 239000000725 suspension Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1416—Event detection, e.g. attack signature detection
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
- H04L63/1466—Active attacks involving interception, injection, modification, spoofing of data unit addresses, e.g. hijacking, packet injection or TCP sequence number attacks
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Signal Processing (AREA)
- Computing Systems (AREA)
- Computer Networks & Wireless Communication (AREA)
- General Engineering & Computer Science (AREA)
- Computer Hardware Design (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- Telephonic Communication Services (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a voice deception attack detection method based on voice signal spectrum characteristics and deep learning. After a microphone of the electronic equipment receives a voice signal, signal processing work is carried out on the voice, then specific features are extracted, finally the marked features are input into a classifier of a deep convolution neural network SE-ResNet to be trained, the trained classifier is adopted to carry out voice living body detection on the voice signal to be detected, and whether the voice is emitted by a human voice or the result of voice attack is output. The invention can accurately and effectively detect the voice deception attack represented by the replay attack aiming at the speaker recognition system.
Description
Technical Field
The invention belongs to the technical field of voice authentication technology and safety, and particularly relates to a voice recognition technology based on voice signal spectrum characteristics and a software processing method capable of detecting voice spoofing attack aiming at a speaker recognition system.
Background
The speaker authentication system is a safety authentication system which identifies the identity of a speaker by extracting the voice characteristics of the speaker and learning and matching the characteristic patterns. Due to the characteristics of low hardware requirement (only a microphone), low cost, simple and convenient user operation and capability of performing remote non-contact authentication, the system gradually becomes a mainstream user authentication and access control mode, and is widely applied to equipment such as smart phones, smart sound boxes and smart homes.
However, existing voice authentication systems are generally vulnerable to voice spoofing attacks. The voice spoofing attack refers to an attack means of spoofing a voice authentication system by forging voice similar to the voice of a target user, thereby spoofing the access right of the target user. Common voice spoofing attacks include replay attacks, voice synthesis attacks, and voice conversion attacks. In the replay attack, an attacker deceives the voice authentication system by replaying the real voice of the target user recorded in advance; in the voice synthesis attack, an attacker synthesizes false target user voice according to required voice content by means of artificial intelligence or voice splicing and the like; in a voice conversion attack, an attacker converts the voice of others into the sound of a target user. With the development of voice technology and electronic equipment, the threshold of voice spoofing attack is lower and better, and the harm is larger and larger. Therefore, under the circumstances, it is necessary to provide an efficient and low-cost voice spoofing attack detection method.
The key of using the spectrum characteristics to detect the attack is to extract the characteristics with large difference from the spectrums of the real voice and the replay attack.
There are many related studies to protect against noise and distortion introduced by detecting voice spoofing attacks. However, this kind of detection method generally has low detection accuracy and is difficult to be applied after the attack method and the device are upgraded. In addition, a defense method for in-vivo detection by wearing additional equipment by a user is provided, and the method is high in cost and poor in user experience due to the need of additional equipment.
Disclosure of Invention
In order to solve the technical problems in the background art, the invention provides a sounding body authentication detection method based on spectral features and a deep convolutional neural network SE-ResNet, and a detection processing method capable of detecting spoofing attacks aiming at a voice authentication system, so that voice spoofing attacks, represented by replay attacks, aiming at a speaker recognition system can be accurately and effectively detected.
The technical scheme adopted by the invention is as follows:
after a microphone of the electronic equipment receives a voice signal, signal processing work is carried out on the voice, then specific features are extracted, finally the marked features are input into a classifier of a deep convolution neural network for training, voice living body detection is carried out on the voice signal to be detected by adopting the trained classifier, and whether the voice is generated by human voice or the result of voice attack is output.
The method specifically comprises the following steps:
1) signal processing:
for original Voice signal VoiceinThe cumulative power spectrum S is obtained in the following two-step processpow:
The first step adopts short-time Fourier transform, and the short-time Fourier transform process is as follows: firstly, using a periodic Hamming window with length of 1024 and overlapping length of 768 to process the original Voice signal VoiceinPerforming windowing to obtain original Voice signal VoiceinDivided into a plurality of data frames of length 1024,then, performing fast Fourier transform on each data frame, wherein the number nfft of points of the Fourier transform is 4096;
secondly, accumulating the results of the fast Fourier transform of each data frame to obtain a vector with the length of 4096, and finally taking the first 2049 data points of the vector as an accumulated power spectrum Spow;
2) Feature extraction:
using as an accumulated power spectrum SpowAnd (4) performing feature extraction to obtain four features, namely a low-frequency feature, an energy distribution feature, a peak feature and a linear prediction cepstrum coefficient.
3) Attack detection:
a classifier of a squeeze excitation residual error network (SE-ResNet architecture) is established, the squeeze excitation residual error network comprises 50 squeeze excitation residual error blocks, each squeeze excitation residual error comprises a residual error block residual, an addition block, a global pooling layer, two convolution modules and a Sigmoid activation function,
the four characteristics are input into a residual block (residual), the output of the residual block (residual) is sequentially connected and input into a scale layer (scale) after passing through a global pooling layer, two convolution modules and a Sigmoid activation function, the output of the residual block (residual) is also input into the scale layer (scale), the scale layer (scale) is subjected to Reshape reshaping operation and dimension increasing to the input dimension and then output into an addition block, the original four characteristics are also input into the addition block at the same time, and the addition block is subjected to weighting operation and then outputs a final probability prediction result.
The 2) is specifically as follows:
2.1) Low frequency characteristics
The accumulated power spectrum S obtained in signal processingpowAs an input, a low frequency feature FV is obtained according to the following three-step process1: the first step is to spectrum the accumulated power SpowEqually dividing the voice into voice sections with fixed length W; the second step is to sum the values in each speech segment and take the first 200 points to form the speech intermediate vector<pow>Finish the spectrum S of the accumulated powerpowCarrying out smoothing treatment; the third step is to take the speech intermediate vector<pow>The first 50 points of (a) as low-frequency features FV1,FV1Is oneA 50-dimensional vector as a first class of features;
2.2) energy distribution characteristics
First computing intermediate vectors of speech<pow>Cumulative distribution function pow ofcdfDrawing an accumulative distribution diagram, and calculating an accumulative distribution function powcdfThe linear correlation coefficient rho and the quadratic curve fitting coefficient q form the energy distribution characteristic FV2=[ρ,q]As a second type of feature;
the energy distribution of the voice in the above steps is processed and described by using the linearity characteristic of a cumulative distribution function (cdf).
2.3) Peak feature
Calculating the maximum value of the cumulative distribution diagram, using the point where the maximum value greater than a preset threshold value is positioned as a peak, forming a peak data set by the value of each peak and the corresponding frequency in the cumulative power spectrum, and calculating a series of statistics of the peak, wherein the statistics comprises the total number N of the peaks in the peak data set SpeakAverage value mu of frequencies corresponding to all peaks in the peak data set SpeakStandard deviation sigma of corresponding frequencies of all peaks in the peak data set Speak(ii) a Fitting the shape of each peak by using a sixth-order polynomial to obtain a coefficient set of the sixth-order polynomial; finally, the peak value feature FV is formed by the statistic and the coefficient set of the sixth order polynomial3=[Npeak,μpeak,σpeak,Pest]As a third class of features;
2.4) Linear prediction of cepstrum coefficients
For original Voice signal VoiceinProcessing is performed to obtain Linear Prediction Cepstral Coefficients (LPCC) as a fourth class of features.
The SE-ResNet50 algorithm structure comprises a first convolution layer group, a second convolution layer group, a third convolution layer group, a fourth convolution layer group, a fifth convolution layer group, an average pooling layer, a full connection layer and a Sigmoid layer which are sequentially connected; the squeeze-excited residual network is trained by inputting four features of a speech signal and a corresponding label known whether a human voice utters speech or replays a speech attack when the classifier is trained.
According to the method, a residual error network ResNet is used as a basic framework, quick connection processing is added in the network, an extrusion excitation structure is added, the problem of network degradation is solved, and the sensitivity of a model to channel characteristics is improved.
Specifically, the model acquires the importance of each feature channel through a learning method, and then the weight of the important feature channel is increased according to the importance.
The invention selects four types of features, uses the four types of features for the recognition of the sounding body and provides an extraction algorithm of the four types of features. And an advanced deep convolutional neural network SE-ResNet is selected as a classifier, and a detection method for voice spoofing attack is constructed on the basis of the spectral characteristics and the SE-ResNet.
The invention acquires and records voice through a microphone of the intelligent device to obtain voice signals, and extracts four characteristics which can effectively and truly reflect the spectrum difference of real voice and replay attack voice through signal processing. According to the fact that the real voice and the replay attack have regular difference on low-frequency peak characteristics and energy distribution, the characteristics are input into a built deep convolution neural network classifier SE-ResNet50, and then the real voice and the replay attack are detected.
The invention can accurately and effectively detect the voice deception attack represented by the replay attack aiming at the speaker recognition system.
The invention has the beneficial effects that:
the innovation point of the invention is that aiming at the difference between the replay voice and the real voice in the aspect of spectrum characteristics, 74-dimensional characteristics such as energy power characteristics, low-frequency characteristics and the like are provided, and effective characteristic data are provided for attack detection. In addition, SE-ResNet was established to be used for replay attack detection. In the voice spoofing attack, even if an attacker generates sound which is very similar to the voice of a real user, the sound necessarily causes a certain degree of nonlinear distortion when passing through a microphone and a loudspeaker, the spectral characteristics of the sound are inconsistent with those of the real user, and therefore the method can be used for detecting the voice spoofing attack.
The voice spoofing attack detection method can efficiently detect the voice spoofing attack through the existing microphone and voice hardware of the voice authentication system, has the characteristics of low cost and high attack detection accuracy, can be used for safety protection of the voice authentication system on intelligent equipment such as a mobile phone and the like, and has wide requirements and application prospects.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 is a spectrogram (left) of real voice and a spectrogram (right) of replay attack.
Fig. 3 is a flow chart of an actual user issuing an instruction and being received by the smart device (up) and performing a replay attack (down).
FIG. 4 is a diagram of the SE-ResNet model architecture of the present invention.
Fig. 5 is a graph of the training process and results of the present invention on the ASVspoof2017 and ASVspoof2019 data sets.
Detailed Description
The invention will be further explained with reference to the drawings.
The examples and embodiments of the method according to the invention are as follows:
1) signal processing:
as shown in fig. 1, for the original Voice signal VoiceinThe cumulative power spectrum S is obtained in the following two-step processpow:
The first step adopts short-time Fourier transform, and the short-time Fourier transform process is as follows: first, a periodic Hamming window with length of 1024 (representing 1024 data points) and overlap length of 768 is used to process the original Voice signal VoiceinPerforming windowing to obtain original Voice signal VoiceinDividing the data into a plurality of data frames with the length of 1024, and then performing fast Fourier transform on each data frame, wherein the number nfft of points of the Fourier transform is 4096;
secondly, accumulating the results of the fast Fourier transform of each data frame to obtain a vector with the length of 4096, and finally taking the first 2049 data points of the vector as an accumulated power spectrum Spow;
2) Feature extraction:
using as an accumulated power spectrum SpowThe feature extraction is carried out to obtain four features,low frequency features, energy distribution features, peak features and linear prediction cepstral coefficients, respectively.
The 2) is specifically as follows:
2.1) Low frequency characteristics
The accumulated power spectrum S obtained in signal processingpowAs an input, the low frequency feature FV is obtained according to the following two-step process1:
The first step is to spectrum the accumulated power SpowEqually dividing the voice into voice sections with fixed length W; if S ispowIs not divided by W, the last redundant segment is omitted and W is taken to be 10 in the practice of the invention.
The second step is to sum the values in each speech segment and take the first 200 points to form the speech intermediate vector<pow>Finish the spectrum S of the accumulated powerpowCarrying out smoothing treatment;
the third step is to take the speech intermediate vector<pow>The first 50 points of (a) as low-frequency features FV1,FV1Is a 50-dimensional vector as a first class of features;
in this way, the accumulated power spectrum slow is smoothed, and a low-frequency band point below 2kHz is selected as a low-frequency feature in the implementation.
2.2) energy distribution characteristics
First computing intermediate vectors of speech<pow>Cumulative distribution function pow ofcdfDrawing an accumulative distribution diagram, and calculating an accumulative distribution function powcdfThe linear correlation coefficient rho and the quadratic curve fitting coefficient q form the energy distribution characteristic FV2=[ρ,q]As a second type of feature;
2.3) Peak feature
Calculating the maximum value of the cumulative distribution diagram, using the point where the maximum value greater than a preset threshold value is positioned as a peak, forming a peak data set by the value of each peak and the corresponding frequency in the cumulative power spectrum, and calculating a series of statistics of the peak, wherein the statistics comprises the total number N of the peaks in the peak data set SpeakAverage value mu of frequencies corresponding to all peaks in the peak data set SpeakStandard deviation sigma of corresponding frequencies of all peaks in the peak data set Speak(ii) a Using a sixth order polynomialFitting the shape of each peak to obtain a coefficient set of a sixth-order polynomial; finally, the peak value feature FV is formed by the statistic and the coefficient set of the sixth order polynomial3=[Npeak,μpeak,σpeak,Pest]As a third class of features;
2.4) Linear prediction of cepstrum coefficients
For original Voice signal VoiceinAnd processing to obtain Linear Prediction Cepstrum Coefficients (LPCC), wherein the linear prediction cepstrum coefficients are 12-order coefficients, and the 12-order LPC coefficients are a vector as a fourth-class feature.
3) Attack detection:
a classifier of a squeeze excitation residual error network (SE-ResNet architecture) is established, the squeeze excitation residual error network comprises 50 squeeze excitation residual error blocks, and each squeeze excitation residual error comprises a residual error block residual, an addition block, a global pooling layer, two convolution modules and a Sigmoid activation function.
The four characteristics are input into a residual block (residual), the output of the residual block (residual) is sequentially connected and input into a scale layer (scale) after passing through a global pooling layer, two convolution modules and a Sigmoid activation function, the output of the residual block (residual) is also input into the scale layer (scale), the scale layer (scale) is subjected to Reshape reshaping operation and dimension increasing to the input dimension and then output into an addition block, the original four characteristics are also input into the addition block at the same time, and the addition block is subjected to weighting operation and then outputs a final probability prediction result. If the probability value is more than 0.5, the attack voice is judged to be replayed, and if the probability value is less than 0.5, the attack voice is judged to be real voice.
The SE-ResNet50 algorithm structure comprises a first convolution layer group, a second convolution layer group, a third convolution layer group, a fourth convolution layer group, a fifth convolution layer group, an average pooling layer, a full connection layer and a Sigmoid layer which are sequentially connected; the squeeze-excited residual network is trained by inputting four features of a speech signal and a corresponding label known whether a human voice utters speech or replays a speech attack when the classifier is trained.
In a specific implementation level, the SE-ResNet architecture is shown in fig. 4, and includes 2 operations, namely squeeze (squeeze) and stimulus (excitation).
In the squeeze operation, the original feature map dimension is C × H × W, C represents the feature channel, H represents the height, W represents the width, and the number of feature channels in the model, i.e., the total number 74 of extracted features, is compressed into a feature map of C × H × W, which is implemented by global average pooling, as shown in the dashed box portion of fig. 4. After H × W is compressed into one dimension, the corresponding one-dimensional parameters obtain the global view of H × W before, and the sensing area of the convolution kernel is wider.
And in the excitation operation, adding a full-connection layer to the characteristic diagram of C1 x 1 obtained in the extrusion operation, and predicting the importance of each characteristic channel. And finally, performing normalization processing through a Sigmoid function, and weighting the normalized weight to the characteristics of each channel through a Scale layer.
After the architecture of the SE-ResNet is obtained, 50 layers of extrusion excitation residual blocks are specifically deployed, and the whole process is shown in Table 1. Since only distinguishing between real sounds and reproduced sounds is a two-classification problem, the final output dimension is set to 1, and the output result is the probability that each voice to be detected is detected as a real voice.
TABLE 1 SE-ResNet50 flow framework
Fig. 3 is a schematic diagram of a playback attack, and it can be seen that the playback attack has two links of microphone recording and speaker playing compared with real voice, which necessarily generates changes to the original signal. The sensitivity of the microphone and speaker depends on the degree of deflection of the diaphragm under the influence of the sound pressure. Due to imperfections in the manufacturing process, the microphone has limitations that ultimately result in inherent distortion. This non-linear characteristic of the microphone results in the addition of noise signals over a lower frequency range. Loudspeakers also introduce non-linear distortion when reproducing sound. Despite great progress in producing high quality sound, most loudspeakers still exhibit non-linear behavior, especially in the low frequency region. The main reasons for this non-linearity are three: (1) changes in magnetic field caused by voice coil excursion; (2) a non-linear suspension stiffness of the voice coil; (3) self-inductance of voice coil drift. Although voice spoofing attacks can adopt various generation modes of false voice signals, in the actual attack process, an attacker needs to play the false voice signals to a voice authentication system to be attacked by using a loudspeaker (sound box). Therefore, the protection of the voice authentication system can be started from the identification of a sound source (a sounding body), and the detection of the spoofing attack is realized.
The upper left corner of fig. 2 is a spectrogram of a real speech, and the other three spectrograms are spectrograms obtained after the speech is played back by different speakers. For comparison, the following observations were made: real voice fluctuates more obviously in a low frequency band (more peaks are seen quantitatively), and the fluctuation of replay attack is less (the peaks are concentrated); the energy distribution of real voice and replay attacks are different, and the energy proportion of the replay attack is higher at 4-5 kHz.
Embodiments were tested with the data set of asvspoons 2017 and 2019, which is the standard data set for voice spoofing attacks. "ASVspoof change" is a special competition unit for Interspeed, the international top academic conference in the field of speech, focusing on spoofing for automatic speaker recognition systems.
Firstly, extracting the four types of characteristics from the data of a training set, adding a label, marking the voice as real voice or replay voice, and then training a neural network SE-ResNet by using the marked characteristics. And then the trained SE-ResNet is used for verification on the test set. The verification results are shown in fig. 5. Equal Error Rates (EER) of 2.38% were achieved on the ASVspoof2017 data set, 0.163% on the ASVspoof2019 PA data set, and the first race was ranked in both races of the current year. The equal error rate is an error rate value when the error acceptance rate and the error rejection rate are equal, and a smaller index indicates a higher accuracy of the detection system.
In addition, the embodiment passes through cross validation, the EER of 4.47% can be reached by using the training set and the development set of ASVspoof2017 for training and the testing set of AS-Vspoof2019 for testing; by using the training set and development set training of the ASVspoof2019 and the testing set of the ASVspoof2017 for testing, an EER close to 0 can be achieved.
Claims (4)
1. A voice replay attack detection method based on spectral features and a deep convolutional neural network is characterized by comprising the following steps:
after a microphone of the electronic equipment receives a voice signal, signal processing work is carried out on the voice, then specific features are extracted, finally the marked features are input into a classifier of a deep convolution neural network for training, voice living body detection is carried out on the voice signal to be detected by adopting the trained classifier, and whether the voice is generated by human voice or the result of voice attack is output.
2. The voice replay attack detection method based on the spectral feature and the deep convolutional neural network as claimed in claim 1, characterized in that: the method specifically comprises the following steps:
1) signal processing:
for original Voice signal VoiceinThe cumulative power spectrum S is obtained in the following two-step processpow:
The first step adopts short-time Fourier transform, and the short-time Fourier transform process is as follows: firstly, using a periodic Hamming window with length of 1024 and overlapping length of 768 to process the original Voice signal VoiceinPerforming windowing to obtain original Voice signal VoiceinDividing the data into a plurality of data frames with the length of 1024, and then performing fast Fourier transform on each data frame, wherein the number nfft of points of the Fourier transform is 4096;
secondly, accumulating the results of the fast Fourier transform of each data frame to obtain a vector with the length of 4096, and finally taking the first 2049 data points of the vector as an accumulated power spectrum Spow;
2) Feature extraction:
using as an accumulated power spectrum SpowGo on speciallyAnd (4) extracting features to obtain four features, namely low-frequency features, energy distribution features, peak features and linear prediction cepstrum coefficients.
3) Attack detection:
a classifier of a squeeze excitation residual error network (SE-ResNet architecture) is established, the squeeze excitation residual error network comprises 50 squeeze excitation residual error blocks, each squeeze excitation residual error comprises a residual error block residual, an addition block, a global pooling layer, two convolution modules and a Sigmoid activation function,
the four characteristics are input into a residual block (residual), the output of the residual block (residual) is sequentially connected and input into a scale layer (scale) after passing through a global pooling layer, two convolution modules and a Sigmoid activation function, the output of the residual block (residual) is also input into the scale layer (scale), the scale layer (scale) is subjected to Reshape reshaping operation and dimension increasing to the input dimension and then output into an addition block, the original four characteristics are also input into the addition block at the same time, and the addition block is subjected to weighting operation and then outputs a final probability prediction result.
3. The voice replay attack detection method based on the spectral feature and the deep convolutional neural network as claimed in claim 1, characterized in that: the 2) is specifically as follows:
2.1) Low frequency characteristics
The accumulated power spectrum S obtained in signal processingpowAs an input, a low frequency feature FV is obtained according to the following three-step process1: the first step is to spectrum the accumulated power SpowEqually dividing the voice into voice sections with fixed length W; the second step is to sum the values in each speech segment and take the first 200 points to form the speech intermediate vector<pow>Finish the spectrum S of the accumulated powerpowCarrying out smoothing treatment; the third step is to take the speech intermediate vector<pow>The first 50 points of (a) as low-frequency features FV1,FV1Is a 50-dimensional vector as a first class of features;
2.2) energy distribution characteristics
First computing intermediate vectors of speech<pow>Cumulative distribution function pow ofcdfDrawing an accumulative distribution diagram, and calculating an accumulative distribution function powcdfThe linear correlation coefficient rho and the quadratic curve fitting coefficient q form the energy distribution characteristic FV2=[ρ,q]As a second type of feature;
2.3) Peak feature
Calculating the maximum value of the cumulative distribution diagram, using the point where the maximum value greater than a preset threshold value is positioned as a peak, forming a peak data set by the value of each peak and the corresponding frequency in the cumulative power spectrum, and calculating a series of statistics of the peak, wherein the statistics comprises the total number N of the peaks in the peak data set SpeakAverage value mu of frequencies corresponding to all peaks in the peak data set SpeakStandard deviation sigma of corresponding frequencies of all peaks in the peak data set Speak(ii) a Fitting the shape of each peak by using a sixth-order polynomial to obtain a coefficient set of the sixth-order polynomial; finally, the peak value feature FV is formed by the statistic and the coefficient set of the sixth order polynomial3=[Npeak,μpeak,σpeak,Pest]As a third class of features;
2.4) Linear prediction of cepstrum coefficients
For original Voice signal VoiceinProcessing is performed to obtain Linear Prediction Cepstral Coefficients (LPCC) as a fourth class of features.
4. The voice replay attack detection method based on the spectral feature and the deep convolutional neural network as claimed in claim 1, characterized in that: the SE-ResNet50 algorithm structure comprises a first convolution layer group, a second convolution layer group, a third convolution layer group, a fourth convolution layer group, a fifth convolution layer group, an average pooling layer, a full connection layer and a Sigmoid layer which are sequentially connected; the squeeze-excited residual network is trained by inputting four features of a speech signal and a corresponding label known whether a human voice utters speech or replays a speech attack when the classifier is trained.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011061172.1A CN112201255B (en) | 2020-09-30 | 2020-09-30 | Voice signal spectrum characteristic and deep learning voice spoofing attack detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011061172.1A CN112201255B (en) | 2020-09-30 | 2020-09-30 | Voice signal spectrum characteristic and deep learning voice spoofing attack detection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112201255A true CN112201255A (en) | 2021-01-08 |
CN112201255B CN112201255B (en) | 2022-10-21 |
Family
ID=74013928
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011061172.1A Active CN112201255B (en) | 2020-09-30 | 2020-09-30 | Voice signal spectrum characteristic and deep learning voice spoofing attack detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112201255B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113012684A (en) * | 2021-03-04 | 2021-06-22 | 电子科技大学 | Synthesized voice detection method based on voice segmentation |
CN113192504A (en) * | 2021-04-29 | 2021-07-30 | 浙江大学 | Domain-adaptation-based silent voice attack detection method |
CN113241079A (en) * | 2021-04-29 | 2021-08-10 | 江西师范大学 | Voice spoofing detection method based on residual error neural network |
CN113284513A (en) * | 2021-07-26 | 2021-08-20 | 中国科学院自动化研究所 | Method and device for detecting false voice based on phoneme duration characteristics |
CN113488027A (en) * | 2021-09-08 | 2021-10-08 | 中国科学院自动化研究所 | Hierarchical classification generated audio tracing method, storage medium and computer equipment |
CN113506583A (en) * | 2021-06-28 | 2021-10-15 | 杭州电子科技大学 | Disguised voice detection method using residual error network |
CN113611329A (en) * | 2021-07-02 | 2021-11-05 | 北京三快在线科技有限公司 | Method and device for detecting abnormal voice |
CN116504226A (en) * | 2023-02-27 | 2023-07-28 | 佛山科学技术学院 | Lightweight single-channel voiceprint recognition method and system based on deep learning |
CN117393000A (en) * | 2023-11-09 | 2024-01-12 | 南京邮电大学 | Synthetic voice detection method based on neural network and feature fusion |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108831485A (en) * | 2018-06-11 | 2018-11-16 | 东北师范大学 | Method for distinguishing speek person based on sound spectrograph statistical nature |
CN108875787A (en) * | 2018-05-23 | 2018-11-23 | 北京市商汤科技开发有限公司 | A kind of image-recognizing method and device, computer equipment and storage medium |
CN110473569A (en) * | 2019-09-11 | 2019-11-19 | 苏州思必驰信息科技有限公司 | Detect the optimization method and system of speaker's spoofing attack |
-
2020
- 2020-09-30 CN CN202011061172.1A patent/CN112201255B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108875787A (en) * | 2018-05-23 | 2018-11-23 | 北京市商汤科技开发有限公司 | A kind of image-recognizing method and device, computer equipment and storage medium |
CN108831485A (en) * | 2018-06-11 | 2018-11-16 | 东北师范大学 | Method for distinguishing speek person based on sound spectrograph statistical nature |
CN110473569A (en) * | 2019-09-11 | 2019-11-19 | 苏州思必驰信息科技有限公司 | Detect the optimization method and system of speaker's spoofing attack |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113012684A (en) * | 2021-03-04 | 2021-06-22 | 电子科技大学 | Synthesized voice detection method based on voice segmentation |
CN113192504A (en) * | 2021-04-29 | 2021-07-30 | 浙江大学 | Domain-adaptation-based silent voice attack detection method |
CN113241079A (en) * | 2021-04-29 | 2021-08-10 | 江西师范大学 | Voice spoofing detection method based on residual error neural network |
CN113506583B (en) * | 2021-06-28 | 2024-01-05 | 杭州电子科技大学 | Camouflage voice detection method using residual error network |
CN113506583A (en) * | 2021-06-28 | 2021-10-15 | 杭州电子科技大学 | Disguised voice detection method using residual error network |
CN113611329A (en) * | 2021-07-02 | 2021-11-05 | 北京三快在线科技有限公司 | Method and device for detecting abnormal voice |
CN113611329B (en) * | 2021-07-02 | 2023-10-24 | 北京三快在线科技有限公司 | Voice abnormality detection method and device |
CN113284513A (en) * | 2021-07-26 | 2021-08-20 | 中国科学院自动化研究所 | Method and device for detecting false voice based on phoneme duration characteristics |
CN113284513B (en) * | 2021-07-26 | 2021-10-15 | 中国科学院自动化研究所 | Method and device for detecting false voice based on phoneme duration characteristics |
CN113488027A (en) * | 2021-09-08 | 2021-10-08 | 中国科学院自动化研究所 | Hierarchical classification generated audio tracing method, storage medium and computer equipment |
CN116504226A (en) * | 2023-02-27 | 2023-07-28 | 佛山科学技术学院 | Lightweight single-channel voiceprint recognition method and system based on deep learning |
CN116504226B (en) * | 2023-02-27 | 2024-01-02 | 佛山科学技术学院 | Lightweight single-channel voiceprint recognition method and system based on deep learning |
CN117393000A (en) * | 2023-11-09 | 2024-01-12 | 南京邮电大学 | Synthetic voice detection method based on neural network and feature fusion |
CN117393000B (en) * | 2023-11-09 | 2024-04-16 | 南京邮电大学 | Synthetic voice detection method based on neural network and feature fusion |
Also Published As
Publication number | Publication date |
---|---|
CN112201255B (en) | 2022-10-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112201255B (en) | Voice signal spectrum characteristic and deep learning voice spoofing attack detection method | |
WO2019232829A1 (en) | Voiceprint recognition method and apparatus, computer device and storage medium | |
CN109036382B (en) | Audio feature extraction method based on KL divergence | |
CN102394062B (en) | Method and system for automatically identifying voice recording equipment source | |
US20170061978A1 (en) | Real-time method for implementing deep neural network based speech separation | |
Wu et al. | Identification of electronic disguised voices | |
CN103794207A (en) | Dual-mode voice identity recognition method | |
CN110767239A (en) | Voiceprint recognition method, device and equipment based on deep learning | |
Paul et al. | Countermeasure to handle replay attacks in practical speaker verification systems | |
CN116490920A (en) | Method for detecting an audio challenge, corresponding device, computer program product and computer readable carrier medium for a speech input processed by an automatic speech recognition system | |
CN107274887A (en) | Speaker's Further Feature Extraction method based on fusion feature MGFCC | |
Shang et al. | Voice liveness detection for voice assistants using ear canal pressure | |
Adiban et al. | Sut system description for anti-spoofing 2017 challenge | |
CN112466276A (en) | Speech synthesis system training method and device and readable storage medium | |
CN111782861A (en) | Noise detection method and device and storage medium | |
Rupesh Kumar et al. | A novel approach towards generalization of countermeasure for spoofing attack on ASV systems | |
Gupta et al. | Deep convolutional neural network for voice liveness detection | |
Liu et al. | Learnable nonlinear compression for robust speaker verification | |
Tian et al. | Spoofing detection under noisy conditions: a preliminary investigation and an initial database | |
Ye et al. | Detection of replay attack based on normalized constant q cepstral feature | |
Mardhotillah et al. | Speaker recognition for digital forensic audio analysis using support vector machine | |
CN112309404B (en) | Machine voice authentication method, device, equipment and storage medium | |
Mills et al. | Replay attack detection based on voice and non-voice sections for speaker verification | |
Nagakrishnan et al. | Generic speech based person authentication system with genuine and spoofed utterances: different feature sets and models | |
Shi et al. | Anti-replay: A fast and lightweight voice replay attack detection system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |