CN112201255A - Voice signal spectrum characteristic and deep learning voice spoofing attack detection method - Google Patents

Voice signal spectrum characteristic and deep learning voice spoofing attack detection method Download PDF

Info

Publication number
CN112201255A
CN112201255A CN202011061172.1A CN202011061172A CN112201255A CN 112201255 A CN112201255 A CN 112201255A CN 202011061172 A CN202011061172 A CN 202011061172A CN 112201255 A CN112201255 A CN 112201255A
Authority
CN
China
Prior art keywords
voice
peak
features
pow
residual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011061172.1A
Other languages
Chinese (zh)
Other versions
CN112201255B (en
Inventor
徐文渊
冀晓宇
王炎
周瑜
薛晖
金子植
石卓杨
闫琛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202011061172.1A priority Critical patent/CN112201255B/en
Publication of CN112201255A publication Critical patent/CN112201255A/en
Application granted granted Critical
Publication of CN112201255B publication Critical patent/CN112201255B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1466Active attacks involving interception, injection, modification, spoofing of data unit addresses, e.g. hijacking, packet injection or TCP sequence number attacks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Computing Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Telephonic Communication Services (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a voice deception attack detection method based on voice signal spectrum characteristics and deep learning. After a microphone of the electronic equipment receives a voice signal, signal processing work is carried out on the voice, then specific features are extracted, finally the marked features are input into a classifier of a deep convolution neural network SE-ResNet to be trained, the trained classifier is adopted to carry out voice living body detection on the voice signal to be detected, and whether the voice is emitted by a human voice or the result of voice attack is output. The invention can accurately and effectively detect the voice deception attack represented by the replay attack aiming at the speaker recognition system.

Description

Voice signal spectrum characteristic and deep learning voice spoofing attack detection method
Technical Field
The invention belongs to the technical field of voice authentication technology and safety, and particularly relates to a voice recognition technology based on voice signal spectrum characteristics and a software processing method capable of detecting voice spoofing attack aiming at a speaker recognition system.
Background
The speaker authentication system is a safety authentication system which identifies the identity of a speaker by extracting the voice characteristics of the speaker and learning and matching the characteristic patterns. Due to the characteristics of low hardware requirement (only a microphone), low cost, simple and convenient user operation and capability of performing remote non-contact authentication, the system gradually becomes a mainstream user authentication and access control mode, and is widely applied to equipment such as smart phones, smart sound boxes and smart homes.
However, existing voice authentication systems are generally vulnerable to voice spoofing attacks. The voice spoofing attack refers to an attack means of spoofing a voice authentication system by forging voice similar to the voice of a target user, thereby spoofing the access right of the target user. Common voice spoofing attacks include replay attacks, voice synthesis attacks, and voice conversion attacks. In the replay attack, an attacker deceives the voice authentication system by replaying the real voice of the target user recorded in advance; in the voice synthesis attack, an attacker synthesizes false target user voice according to required voice content by means of artificial intelligence or voice splicing and the like; in a voice conversion attack, an attacker converts the voice of others into the sound of a target user. With the development of voice technology and electronic equipment, the threshold of voice spoofing attack is lower and better, and the harm is larger and larger. Therefore, under the circumstances, it is necessary to provide an efficient and low-cost voice spoofing attack detection method.
The key of using the spectrum characteristics to detect the attack is to extract the characteristics with large difference from the spectrums of the real voice and the replay attack.
There are many related studies to protect against noise and distortion introduced by detecting voice spoofing attacks. However, this kind of detection method generally has low detection accuracy and is difficult to be applied after the attack method and the device are upgraded. In addition, a defense method for in-vivo detection by wearing additional equipment by a user is provided, and the method is high in cost and poor in user experience due to the need of additional equipment.
Disclosure of Invention
In order to solve the technical problems in the background art, the invention provides a sounding body authentication detection method based on spectral features and a deep convolutional neural network SE-ResNet, and a detection processing method capable of detecting spoofing attacks aiming at a voice authentication system, so that voice spoofing attacks, represented by replay attacks, aiming at a speaker recognition system can be accurately and effectively detected.
The technical scheme adopted by the invention is as follows:
after a microphone of the electronic equipment receives a voice signal, signal processing work is carried out on the voice, then specific features are extracted, finally the marked features are input into a classifier of a deep convolution neural network for training, voice living body detection is carried out on the voice signal to be detected by adopting the trained classifier, and whether the voice is generated by human voice or the result of voice attack is output.
The method specifically comprises the following steps:
1) signal processing:
for original Voice signal VoiceinThe cumulative power spectrum S is obtained in the following two-step processpow
The first step adopts short-time Fourier transform, and the short-time Fourier transform process is as follows: firstly, using a periodic Hamming window with length of 1024 and overlapping length of 768 to process the original Voice signal VoiceinPerforming windowing to obtain original Voice signal VoiceinDivided into a plurality of data frames of length 1024,then, performing fast Fourier transform on each data frame, wherein the number nfft of points of the Fourier transform is 4096;
secondly, accumulating the results of the fast Fourier transform of each data frame to obtain a vector with the length of 4096, and finally taking the first 2049 data points of the vector as an accumulated power spectrum Spow
2) Feature extraction:
using as an accumulated power spectrum SpowAnd (4) performing feature extraction to obtain four features, namely a low-frequency feature, an energy distribution feature, a peak feature and a linear prediction cepstrum coefficient.
3) Attack detection:
a classifier of a squeeze excitation residual error network (SE-ResNet architecture) is established, the squeeze excitation residual error network comprises 50 squeeze excitation residual error blocks, each squeeze excitation residual error comprises a residual error block residual, an addition block, a global pooling layer, two convolution modules and a Sigmoid activation function,
the four characteristics are input into a residual block (residual), the output of the residual block (residual) is sequentially connected and input into a scale layer (scale) after passing through a global pooling layer, two convolution modules and a Sigmoid activation function, the output of the residual block (residual) is also input into the scale layer (scale), the scale layer (scale) is subjected to Reshape reshaping operation and dimension increasing to the input dimension and then output into an addition block, the original four characteristics are also input into the addition block at the same time, and the addition block is subjected to weighting operation and then outputs a final probability prediction result.
The 2) is specifically as follows:
2.1) Low frequency characteristics
The accumulated power spectrum S obtained in signal processingpowAs an input, a low frequency feature FV is obtained according to the following three-step process1: the first step is to spectrum the accumulated power SpowEqually dividing the voice into voice sections with fixed length W; the second step is to sum the values in each speech segment and take the first 200 points to form the speech intermediate vector<pow>Finish the spectrum S of the accumulated powerpowCarrying out smoothing treatment; the third step is to take the speech intermediate vector<pow>The first 50 points of (a) as low-frequency features FV1,FV1Is oneA 50-dimensional vector as a first class of features;
2.2) energy distribution characteristics
First computing intermediate vectors of speech<pow>Cumulative distribution function pow ofcdfDrawing an accumulative distribution diagram, and calculating an accumulative distribution function powcdfThe linear correlation coefficient rho and the quadratic curve fitting coefficient q form the energy distribution characteristic FV2=[ρ,q]As a second type of feature;
the energy distribution of the voice in the above steps is processed and described by using the linearity characteristic of a cumulative distribution function (cdf).
2.3) Peak feature
Calculating the maximum value of the cumulative distribution diagram, using the point where the maximum value greater than a preset threshold value is positioned as a peak, forming a peak data set by the value of each peak and the corresponding frequency in the cumulative power spectrum, and calculating a series of statistics of the peak, wherein the statistics comprises the total number N of the peaks in the peak data set SpeakAverage value mu of frequencies corresponding to all peaks in the peak data set SpeakStandard deviation sigma of corresponding frequencies of all peaks in the peak data set Speak(ii) a Fitting the shape of each peak by using a sixth-order polynomial to obtain a coefficient set of the sixth-order polynomial; finally, the peak value feature FV is formed by the statistic and the coefficient set of the sixth order polynomial3=[Npeakpeakpeak,Pest]As a third class of features;
2.4) Linear prediction of cepstrum coefficients
For original Voice signal VoiceinProcessing is performed to obtain Linear Prediction Cepstral Coefficients (LPCC) as a fourth class of features.
The SE-ResNet50 algorithm structure comprises a first convolution layer group, a second convolution layer group, a third convolution layer group, a fourth convolution layer group, a fifth convolution layer group, an average pooling layer, a full connection layer and a Sigmoid layer which are sequentially connected; the squeeze-excited residual network is trained by inputting four features of a speech signal and a corresponding label known whether a human voice utters speech or replays a speech attack when the classifier is trained.
According to the method, a residual error network ResNet is used as a basic framework, quick connection processing is added in the network, an extrusion excitation structure is added, the problem of network degradation is solved, and the sensitivity of a model to channel characteristics is improved.
Specifically, the model acquires the importance of each feature channel through a learning method, and then the weight of the important feature channel is increased according to the importance.
The invention selects four types of features, uses the four types of features for the recognition of the sounding body and provides an extraction algorithm of the four types of features. And an advanced deep convolutional neural network SE-ResNet is selected as a classifier, and a detection method for voice spoofing attack is constructed on the basis of the spectral characteristics and the SE-ResNet.
The invention acquires and records voice through a microphone of the intelligent device to obtain voice signals, and extracts four characteristics which can effectively and truly reflect the spectrum difference of real voice and replay attack voice through signal processing. According to the fact that the real voice and the replay attack have regular difference on low-frequency peak characteristics and energy distribution, the characteristics are input into a built deep convolution neural network classifier SE-ResNet50, and then the real voice and the replay attack are detected.
The invention can accurately and effectively detect the voice deception attack represented by the replay attack aiming at the speaker recognition system.
The invention has the beneficial effects that:
the innovation point of the invention is that aiming at the difference between the replay voice and the real voice in the aspect of spectrum characteristics, 74-dimensional characteristics such as energy power characteristics, low-frequency characteristics and the like are provided, and effective characteristic data are provided for attack detection. In addition, SE-ResNet was established to be used for replay attack detection. In the voice spoofing attack, even if an attacker generates sound which is very similar to the voice of a real user, the sound necessarily causes a certain degree of nonlinear distortion when passing through a microphone and a loudspeaker, the spectral characteristics of the sound are inconsistent with those of the real user, and therefore the method can be used for detecting the voice spoofing attack.
The voice spoofing attack detection method can efficiently detect the voice spoofing attack through the existing microphone and voice hardware of the voice authentication system, has the characteristics of low cost and high attack detection accuracy, can be used for safety protection of the voice authentication system on intelligent equipment such as a mobile phone and the like, and has wide requirements and application prospects.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 is a spectrogram (left) of real voice and a spectrogram (right) of replay attack.
Fig. 3 is a flow chart of an actual user issuing an instruction and being received by the smart device (up) and performing a replay attack (down).
FIG. 4 is a diagram of the SE-ResNet model architecture of the present invention.
Fig. 5 is a graph of the training process and results of the present invention on the ASVspoof2017 and ASVspoof2019 data sets.
Detailed Description
The invention will be further explained with reference to the drawings.
The examples and embodiments of the method according to the invention are as follows:
1) signal processing:
as shown in fig. 1, for the original Voice signal VoiceinThe cumulative power spectrum S is obtained in the following two-step processpow
The first step adopts short-time Fourier transform, and the short-time Fourier transform process is as follows: first, a periodic Hamming window with length of 1024 (representing 1024 data points) and overlap length of 768 is used to process the original Voice signal VoiceinPerforming windowing to obtain original Voice signal VoiceinDividing the data into a plurality of data frames with the length of 1024, and then performing fast Fourier transform on each data frame, wherein the number nfft of points of the Fourier transform is 4096;
secondly, accumulating the results of the fast Fourier transform of each data frame to obtain a vector with the length of 4096, and finally taking the first 2049 data points of the vector as an accumulated power spectrum Spow
2) Feature extraction:
using as an accumulated power spectrum SpowThe feature extraction is carried out to obtain four features,low frequency features, energy distribution features, peak features and linear prediction cepstral coefficients, respectively.
The 2) is specifically as follows:
2.1) Low frequency characteristics
The accumulated power spectrum S obtained in signal processingpowAs an input, the low frequency feature FV is obtained according to the following two-step process1
The first step is to spectrum the accumulated power SpowEqually dividing the voice into voice sections with fixed length W; if S ispowIs not divided by W, the last redundant segment is omitted and W is taken to be 10 in the practice of the invention.
The second step is to sum the values in each speech segment and take the first 200 points to form the speech intermediate vector<pow>Finish the spectrum S of the accumulated powerpowCarrying out smoothing treatment;
the third step is to take the speech intermediate vector<pow>The first 50 points of (a) as low-frequency features FV1,FV1Is a 50-dimensional vector as a first class of features;
in this way, the accumulated power spectrum slow is smoothed, and a low-frequency band point below 2kHz is selected as a low-frequency feature in the implementation.
2.2) energy distribution characteristics
First computing intermediate vectors of speech<pow>Cumulative distribution function pow ofcdfDrawing an accumulative distribution diagram, and calculating an accumulative distribution function powcdfThe linear correlation coefficient rho and the quadratic curve fitting coefficient q form the energy distribution characteristic FV2=[ρ,q]As a second type of feature;
2.3) Peak feature
Calculating the maximum value of the cumulative distribution diagram, using the point where the maximum value greater than a preset threshold value is positioned as a peak, forming a peak data set by the value of each peak and the corresponding frequency in the cumulative power spectrum, and calculating a series of statistics of the peak, wherein the statistics comprises the total number N of the peaks in the peak data set SpeakAverage value mu of frequencies corresponding to all peaks in the peak data set SpeakStandard deviation sigma of corresponding frequencies of all peaks in the peak data set Speak(ii) a Using a sixth order polynomialFitting the shape of each peak to obtain a coefficient set of a sixth-order polynomial; finally, the peak value feature FV is formed by the statistic and the coefficient set of the sixth order polynomial3=[Npeakpeakpeak,Pest]As a third class of features;
2.4) Linear prediction of cepstrum coefficients
For original Voice signal VoiceinAnd processing to obtain Linear Prediction Cepstrum Coefficients (LPCC), wherein the linear prediction cepstrum coefficients are 12-order coefficients, and the 12-order LPC coefficients are a vector as a fourth-class feature.
3) Attack detection:
a classifier of a squeeze excitation residual error network (SE-ResNet architecture) is established, the squeeze excitation residual error network comprises 50 squeeze excitation residual error blocks, and each squeeze excitation residual error comprises a residual error block residual, an addition block, a global pooling layer, two convolution modules and a Sigmoid activation function.
The four characteristics are input into a residual block (residual), the output of the residual block (residual) is sequentially connected and input into a scale layer (scale) after passing through a global pooling layer, two convolution modules and a Sigmoid activation function, the output of the residual block (residual) is also input into the scale layer (scale), the scale layer (scale) is subjected to Reshape reshaping operation and dimension increasing to the input dimension and then output into an addition block, the original four characteristics are also input into the addition block at the same time, and the addition block is subjected to weighting operation and then outputs a final probability prediction result. If the probability value is more than 0.5, the attack voice is judged to be replayed, and if the probability value is less than 0.5, the attack voice is judged to be real voice.
The SE-ResNet50 algorithm structure comprises a first convolution layer group, a second convolution layer group, a third convolution layer group, a fourth convolution layer group, a fifth convolution layer group, an average pooling layer, a full connection layer and a Sigmoid layer which are sequentially connected; the squeeze-excited residual network is trained by inputting four features of a speech signal and a corresponding label known whether a human voice utters speech or replays a speech attack when the classifier is trained.
In a specific implementation level, the SE-ResNet architecture is shown in fig. 4, and includes 2 operations, namely squeeze (squeeze) and stimulus (excitation).
In the squeeze operation, the original feature map dimension is C × H × W, C represents the feature channel, H represents the height, W represents the width, and the number of feature channels in the model, i.e., the total number 74 of extracted features, is compressed into a feature map of C × H × W, which is implemented by global average pooling, as shown in the dashed box portion of fig. 4. After H × W is compressed into one dimension, the corresponding one-dimensional parameters obtain the global view of H × W before, and the sensing area of the convolution kernel is wider.
And in the excitation operation, adding a full-connection layer to the characteristic diagram of C1 x 1 obtained in the extrusion operation, and predicting the importance of each characteristic channel. And finally, performing normalization processing through a Sigmoid function, and weighting the normalized weight to the characteristics of each channel through a Scale layer.
After the architecture of the SE-ResNet is obtained, 50 layers of extrusion excitation residual blocks are specifically deployed, and the whole process is shown in Table 1. Since only distinguishing between real sounds and reproduced sounds is a two-classification problem, the final output dimension is set to 1, and the output result is the probability that each voice to be detected is detected as a real voice.
TABLE 1 SE-ResNet50 flow framework
Figure BDA0002712425220000061
Figure BDA0002712425220000071
Fig. 3 is a schematic diagram of a playback attack, and it can be seen that the playback attack has two links of microphone recording and speaker playing compared with real voice, which necessarily generates changes to the original signal. The sensitivity of the microphone and speaker depends on the degree of deflection of the diaphragm under the influence of the sound pressure. Due to imperfections in the manufacturing process, the microphone has limitations that ultimately result in inherent distortion. This non-linear characteristic of the microphone results in the addition of noise signals over a lower frequency range. Loudspeakers also introduce non-linear distortion when reproducing sound. Despite great progress in producing high quality sound, most loudspeakers still exhibit non-linear behavior, especially in the low frequency region. The main reasons for this non-linearity are three: (1) changes in magnetic field caused by voice coil excursion; (2) a non-linear suspension stiffness of the voice coil; (3) self-inductance of voice coil drift. Although voice spoofing attacks can adopt various generation modes of false voice signals, in the actual attack process, an attacker needs to play the false voice signals to a voice authentication system to be attacked by using a loudspeaker (sound box). Therefore, the protection of the voice authentication system can be started from the identification of a sound source (a sounding body), and the detection of the spoofing attack is realized.
The upper left corner of fig. 2 is a spectrogram of a real speech, and the other three spectrograms are spectrograms obtained after the speech is played back by different speakers. For comparison, the following observations were made: real voice fluctuates more obviously in a low frequency band (more peaks are seen quantitatively), and the fluctuation of replay attack is less (the peaks are concentrated); the energy distribution of real voice and replay attacks are different, and the energy proportion of the replay attack is higher at 4-5 kHz.
Embodiments were tested with the data set of asvspoons 2017 and 2019, which is the standard data set for voice spoofing attacks. "ASVspoof change" is a special competition unit for Interspeed, the international top academic conference in the field of speech, focusing on spoofing for automatic speaker recognition systems.
Firstly, extracting the four types of characteristics from the data of a training set, adding a label, marking the voice as real voice or replay voice, and then training a neural network SE-ResNet by using the marked characteristics. And then the trained SE-ResNet is used for verification on the test set. The verification results are shown in fig. 5. Equal Error Rates (EER) of 2.38% were achieved on the ASVspoof2017 data set, 0.163% on the ASVspoof2019 PA data set, and the first race was ranked in both races of the current year. The equal error rate is an error rate value when the error acceptance rate and the error rejection rate are equal, and a smaller index indicates a higher accuracy of the detection system.
In addition, the embodiment passes through cross validation, the EER of 4.47% can be reached by using the training set and the development set of ASVspoof2017 for training and the testing set of AS-Vspoof2019 for testing; by using the training set and development set training of the ASVspoof2019 and the testing set of the ASVspoof2017 for testing, an EER close to 0 can be achieved.

Claims (4)

1. A voice replay attack detection method based on spectral features and a deep convolutional neural network is characterized by comprising the following steps:
after a microphone of the electronic equipment receives a voice signal, signal processing work is carried out on the voice, then specific features are extracted, finally the marked features are input into a classifier of a deep convolution neural network for training, voice living body detection is carried out on the voice signal to be detected by adopting the trained classifier, and whether the voice is generated by human voice or the result of voice attack is output.
2. The voice replay attack detection method based on the spectral feature and the deep convolutional neural network as claimed in claim 1, characterized in that: the method specifically comprises the following steps:
1) signal processing:
for original Voice signal VoiceinThe cumulative power spectrum S is obtained in the following two-step processpow
The first step adopts short-time Fourier transform, and the short-time Fourier transform process is as follows: firstly, using a periodic Hamming window with length of 1024 and overlapping length of 768 to process the original Voice signal VoiceinPerforming windowing to obtain original Voice signal VoiceinDividing the data into a plurality of data frames with the length of 1024, and then performing fast Fourier transform on each data frame, wherein the number nfft of points of the Fourier transform is 4096;
secondly, accumulating the results of the fast Fourier transform of each data frame to obtain a vector with the length of 4096, and finally taking the first 2049 data points of the vector as an accumulated power spectrum Spow
2) Feature extraction:
using as an accumulated power spectrum SpowGo on speciallyAnd (4) extracting features to obtain four features, namely low-frequency features, energy distribution features, peak features and linear prediction cepstrum coefficients.
3) Attack detection:
a classifier of a squeeze excitation residual error network (SE-ResNet architecture) is established, the squeeze excitation residual error network comprises 50 squeeze excitation residual error blocks, each squeeze excitation residual error comprises a residual error block residual, an addition block, a global pooling layer, two convolution modules and a Sigmoid activation function,
the four characteristics are input into a residual block (residual), the output of the residual block (residual) is sequentially connected and input into a scale layer (scale) after passing through a global pooling layer, two convolution modules and a Sigmoid activation function, the output of the residual block (residual) is also input into the scale layer (scale), the scale layer (scale) is subjected to Reshape reshaping operation and dimension increasing to the input dimension and then output into an addition block, the original four characteristics are also input into the addition block at the same time, and the addition block is subjected to weighting operation and then outputs a final probability prediction result.
3. The voice replay attack detection method based on the spectral feature and the deep convolutional neural network as claimed in claim 1, characterized in that: the 2) is specifically as follows:
2.1) Low frequency characteristics
The accumulated power spectrum S obtained in signal processingpowAs an input, a low frequency feature FV is obtained according to the following three-step process1: the first step is to spectrum the accumulated power SpowEqually dividing the voice into voice sections with fixed length W; the second step is to sum the values in each speech segment and take the first 200 points to form the speech intermediate vector<pow>Finish the spectrum S of the accumulated powerpowCarrying out smoothing treatment; the third step is to take the speech intermediate vector<pow>The first 50 points of (a) as low-frequency features FV1,FV1Is a 50-dimensional vector as a first class of features;
2.2) energy distribution characteristics
First computing intermediate vectors of speech<pow>Cumulative distribution function pow ofcdfDrawing an accumulative distribution diagram, and calculating an accumulative distribution function powcdfThe linear correlation coefficient rho and the quadratic curve fitting coefficient q form the energy distribution characteristic FV2=[ρ,q]As a second type of feature;
2.3) Peak feature
Calculating the maximum value of the cumulative distribution diagram, using the point where the maximum value greater than a preset threshold value is positioned as a peak, forming a peak data set by the value of each peak and the corresponding frequency in the cumulative power spectrum, and calculating a series of statistics of the peak, wherein the statistics comprises the total number N of the peaks in the peak data set SpeakAverage value mu of frequencies corresponding to all peaks in the peak data set SpeakStandard deviation sigma of corresponding frequencies of all peaks in the peak data set Speak(ii) a Fitting the shape of each peak by using a sixth-order polynomial to obtain a coefficient set of the sixth-order polynomial; finally, the peak value feature FV is formed by the statistic and the coefficient set of the sixth order polynomial3=[Npeakpeakpeak,Pest]As a third class of features;
2.4) Linear prediction of cepstrum coefficients
For original Voice signal VoiceinProcessing is performed to obtain Linear Prediction Cepstral Coefficients (LPCC) as a fourth class of features.
4. The voice replay attack detection method based on the spectral feature and the deep convolutional neural network as claimed in claim 1, characterized in that: the SE-ResNet50 algorithm structure comprises a first convolution layer group, a second convolution layer group, a third convolution layer group, a fourth convolution layer group, a fifth convolution layer group, an average pooling layer, a full connection layer and a Sigmoid layer which are sequentially connected; the squeeze-excited residual network is trained by inputting four features of a speech signal and a corresponding label known whether a human voice utters speech or replays a speech attack when the classifier is trained.
CN202011061172.1A 2020-09-30 2020-09-30 Voice signal spectrum characteristic and deep learning voice spoofing attack detection method Active CN112201255B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011061172.1A CN112201255B (en) 2020-09-30 2020-09-30 Voice signal spectrum characteristic and deep learning voice spoofing attack detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011061172.1A CN112201255B (en) 2020-09-30 2020-09-30 Voice signal spectrum characteristic and deep learning voice spoofing attack detection method

Publications (2)

Publication Number Publication Date
CN112201255A true CN112201255A (en) 2021-01-08
CN112201255B CN112201255B (en) 2022-10-21

Family

ID=74013928

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011061172.1A Active CN112201255B (en) 2020-09-30 2020-09-30 Voice signal spectrum characteristic and deep learning voice spoofing attack detection method

Country Status (1)

Country Link
CN (1) CN112201255B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113012684A (en) * 2021-03-04 2021-06-22 电子科技大学 Synthesized voice detection method based on voice segmentation
CN113192504A (en) * 2021-04-29 2021-07-30 浙江大学 Domain-adaptation-based silent voice attack detection method
CN113241079A (en) * 2021-04-29 2021-08-10 江西师范大学 Voice spoofing detection method based on residual error neural network
CN113284513A (en) * 2021-07-26 2021-08-20 中国科学院自动化研究所 Method and device for detecting false voice based on phoneme duration characteristics
CN113488027A (en) * 2021-09-08 2021-10-08 中国科学院自动化研究所 Hierarchical classification generated audio tracing method, storage medium and computer equipment
CN113506583A (en) * 2021-06-28 2021-10-15 杭州电子科技大学 Disguised voice detection method using residual error network
CN113611329A (en) * 2021-07-02 2021-11-05 北京三快在线科技有限公司 Method and device for detecting abnormal voice
CN116504226A (en) * 2023-02-27 2023-07-28 佛山科学技术学院 Lightweight single-channel voiceprint recognition method and system based on deep learning
CN117393000A (en) * 2023-11-09 2024-01-12 南京邮电大学 Synthetic voice detection method based on neural network and feature fusion

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108831485A (en) * 2018-06-11 2018-11-16 东北师范大学 Method for distinguishing speek person based on sound spectrograph statistical nature
CN108875787A (en) * 2018-05-23 2018-11-23 北京市商汤科技开发有限公司 A kind of image-recognizing method and device, computer equipment and storage medium
CN110473569A (en) * 2019-09-11 2019-11-19 苏州思必驰信息科技有限公司 Detect the optimization method and system of speaker's spoofing attack

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108875787A (en) * 2018-05-23 2018-11-23 北京市商汤科技开发有限公司 A kind of image-recognizing method and device, computer equipment and storage medium
CN108831485A (en) * 2018-06-11 2018-11-16 东北师范大学 Method for distinguishing speek person based on sound spectrograph statistical nature
CN110473569A (en) * 2019-09-11 2019-11-19 苏州思必驰信息科技有限公司 Detect the optimization method and system of speaker's spoofing attack

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113012684A (en) * 2021-03-04 2021-06-22 电子科技大学 Synthesized voice detection method based on voice segmentation
CN113192504A (en) * 2021-04-29 2021-07-30 浙江大学 Domain-adaptation-based silent voice attack detection method
CN113241079A (en) * 2021-04-29 2021-08-10 江西师范大学 Voice spoofing detection method based on residual error neural network
CN113506583B (en) * 2021-06-28 2024-01-05 杭州电子科技大学 Camouflage voice detection method using residual error network
CN113506583A (en) * 2021-06-28 2021-10-15 杭州电子科技大学 Disguised voice detection method using residual error network
CN113611329A (en) * 2021-07-02 2021-11-05 北京三快在线科技有限公司 Method and device for detecting abnormal voice
CN113611329B (en) * 2021-07-02 2023-10-24 北京三快在线科技有限公司 Voice abnormality detection method and device
CN113284513A (en) * 2021-07-26 2021-08-20 中国科学院自动化研究所 Method and device for detecting false voice based on phoneme duration characteristics
CN113284513B (en) * 2021-07-26 2021-10-15 中国科学院自动化研究所 Method and device for detecting false voice based on phoneme duration characteristics
CN113488027A (en) * 2021-09-08 2021-10-08 中国科学院自动化研究所 Hierarchical classification generated audio tracing method, storage medium and computer equipment
CN116504226A (en) * 2023-02-27 2023-07-28 佛山科学技术学院 Lightweight single-channel voiceprint recognition method and system based on deep learning
CN116504226B (en) * 2023-02-27 2024-01-02 佛山科学技术学院 Lightweight single-channel voiceprint recognition method and system based on deep learning
CN117393000A (en) * 2023-11-09 2024-01-12 南京邮电大学 Synthetic voice detection method based on neural network and feature fusion
CN117393000B (en) * 2023-11-09 2024-04-16 南京邮电大学 Synthetic voice detection method based on neural network and feature fusion

Also Published As

Publication number Publication date
CN112201255B (en) 2022-10-21

Similar Documents

Publication Publication Date Title
CN112201255B (en) Voice signal spectrum characteristic and deep learning voice spoofing attack detection method
WO2019232829A1 (en) Voiceprint recognition method and apparatus, computer device and storage medium
CN109036382B (en) Audio feature extraction method based on KL divergence
CN102394062B (en) Method and system for automatically identifying voice recording equipment source
US20170061978A1 (en) Real-time method for implementing deep neural network based speech separation
Wu et al. Identification of electronic disguised voices
CN103794207A (en) Dual-mode voice identity recognition method
CN110767239A (en) Voiceprint recognition method, device and equipment based on deep learning
Paul et al. Countermeasure to handle replay attacks in practical speaker verification systems
CN116490920A (en) Method for detecting an audio challenge, corresponding device, computer program product and computer readable carrier medium for a speech input processed by an automatic speech recognition system
CN107274887A (en) Speaker&#39;s Further Feature Extraction method based on fusion feature MGFCC
Shang et al. Voice liveness detection for voice assistants using ear canal pressure
Adiban et al. Sut system description for anti-spoofing 2017 challenge
CN112466276A (en) Speech synthesis system training method and device and readable storage medium
CN111782861A (en) Noise detection method and device and storage medium
Rupesh Kumar et al. A novel approach towards generalization of countermeasure for spoofing attack on ASV systems
Gupta et al. Deep convolutional neural network for voice liveness detection
Liu et al. Learnable nonlinear compression for robust speaker verification
Tian et al. Spoofing detection under noisy conditions: a preliminary investigation and an initial database
Ye et al. Detection of replay attack based on normalized constant q cepstral feature
Mardhotillah et al. Speaker recognition for digital forensic audio analysis using support vector machine
CN112309404B (en) Machine voice authentication method, device, equipment and storage medium
Mills et al. Replay attack detection based on voice and non-voice sections for speaker verification
Nagakrishnan et al. Generic speech based person authentication system with genuine and spoofed utterances: different feature sets and models
Shi et al. Anti-replay: A fast and lightweight voice replay attack detection system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant