CN112201255B - Voice signal spectrum characteristic and deep learning voice spoofing attack detection method - Google Patents

Voice signal spectrum characteristic and deep learning voice spoofing attack detection method Download PDF

Info

Publication number
CN112201255B
CN112201255B CN202011061172.1A CN202011061172A CN112201255B CN 112201255 B CN112201255 B CN 112201255B CN 202011061172 A CN202011061172 A CN 202011061172A CN 112201255 B CN112201255 B CN 112201255B
Authority
CN
China
Prior art keywords
voice
peak
pow
attack
residual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011061172.1A
Other languages
Chinese (zh)
Other versions
CN112201255A (en
Inventor
徐文渊
冀晓宇
王炎
周瑜
薛晖
金子植
石卓杨
闫琛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202011061172.1A priority Critical patent/CN112201255B/en
Publication of CN112201255A publication Critical patent/CN112201255A/en
Application granted granted Critical
Publication of CN112201255B publication Critical patent/CN112201255B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1466Active attacks involving interception, injection, modification, spoofing of data unit addresses, e.g. hijacking, packet injection or TCP sequence number attacks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Complex Calculations (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a voice deception attack detection method based on voice signal spectrum characteristics and deep learning. After a microphone of the electronic equipment receives a voice signal, signal processing work is carried out on the voice, then specific features are extracted, finally the marked features are input into a classifier of a deep convolution neural network SE-ResNet to be trained, the trained classifier is adopted to carry out voice living body detection on the voice signal to be detected, and whether the voice is emitted by a human voice or the result of voice attack is output. The invention can accurately and effectively detect the voice deception attack represented by the replay attack aiming at the speaker recognition system.

Description

Voice signal spectrum characteristic and deep learning voice spoofing attack detection method
Technical Field
The invention belongs to the technical field of voice authentication technology and safety, and particularly relates to a voice recognition technology based on voice signal spectrum characteristics and a software processing method capable of detecting voice spoofing attack aiming at a speaker recognition system.
Background
The speaker authentication system is a safety authentication system which identifies the identity of a speaker by extracting the voice characteristics of the speaker and learning and matching the characteristic patterns. Due to the characteristics of low hardware requirement (only a microphone is needed), low cost, simple and convenient user operation and capability of performing remote non-contact authentication, the system gradually becomes a mainstream user authentication and access control mode, and is widely applied to equipment such as smart phones, smart sound boxes and smart homes.
However, existing voice authentication systems are generally vulnerable to voice spoofing attacks. The voice spoofing attack refers to an attack means of spoofing a voice authentication system by forging a voice similar to the voice of a target user, thereby impersonating the target user to cheat the access right. Common voice spoofing attacks include replay attacks, voice synthesis attacks, and voice conversion attacks. In the replay attack, an attacker deceives the voice authentication system by replaying the real voice of the target user recorded in advance; in the voice synthesis attack, an attacker synthesizes false target user voice according to required voice content by means of artificial intelligence or voice splicing and the like; in a voice conversion attack, an attacker converts the voice of others into the sound of a target user. With the development of voice technology and electronic equipment, the threshold of voice spoofing attack is lower and lower, the effect is better and better, and the harm is larger and larger. Therefore, under such circumstances, it is desirable to provide an efficient and low-cost detection method for voice spoofing attacks.
The key of using the spectrum characteristics to detect the attack is to extract the characteristics with large difference from the spectrums of the real voice and the replay attack.
There are many related studies to protect against noise and distortion introduced by detecting voice spoofing attacks. However, this kind of detection method generally has low detection accuracy and is difficult to be applied after the attack method and device are upgraded. In addition, a defense method for in-vivo detection by wearing additional equipment by a user is provided, and the method is high in cost and poor in user experience due to the fact that additional equipment is needed.
Disclosure of Invention
In order to solve the technical problems in the background art, the invention provides a sounding body authentication detection method based on spectral features and a deep convolutional neural network SE-ResNet, and a detection processing method capable of detecting spoofing attacks aiming at a voice authentication system, so that voice spoofing attacks, represented by replay attacks, aiming at a speaker recognition system can be accurately and effectively detected.
The technical scheme adopted by the invention is as follows:
after a microphone of the electronic equipment receives a voice signal, signal processing work is carried out on the voice, then specific features are extracted, finally the marked features are input into a classifier of a deep convolution neural network for training, voice living body detection is carried out on the voice signal to be detected by adopting the trained classifier, and whether the voice is generated by human voice or the result of voice attack is output.
The method specifically comprises the following steps:
1) Signal processing:
for original Voice signal Voice in The cumulative power spectrum S is obtained in the following two-step process pow
The first step adopts short-time Fourier transform, and the short-time Fourier transform process is as follows: firstly, using a periodic Hamming window with length of 1024 and overlapping length of 768 to process the original Voice signal Voice in Performing windowing to obtain original Voice signal Voice in Dividing the data frames into a plurality of data frames with the length of 1024, and then performing fast Fourier transform on each data frame, wherein the number nfft of points of the Fourier transform is 4096;
secondly, accumulating the results of the fast Fourier transform of each data frame to obtain a vector with the length of 4096, and finally taking the first 2049 data points of the vector as an accumulated power spectrum S pow
2) Characteristic extraction:
using as an accumulated power spectrum S pow And (4) performing feature extraction to obtain four features, namely a low-frequency feature, an energy distribution feature, a peak feature and a linear prediction cepstrum coefficient.
3) Attack detection:
a classifier of a squeeze-excited residual error network (SE-ResNet architecture) is established, the squeeze-excited residual error network comprises 50 squeeze-excited residual error blocks, each squeeze-excited residual error comprises a residual block residual, an addition block, a global pooling layer, two convolution modules and a Sigmoid activation function,
the four characteristics are input into a residual block (residual), the output of the residual block (residual) is sequentially connected and input into a scale layer (scale) after passing through a global pooling layer, two convolution modules and a Sigmoid activation function, the output of the residual block (residual) is also input into the scale layer (scale), the scale layer (scale) is subjected to Reshape reshaping operation and dimension increasing to the input dimension and then output into an addition block, the original four characteristics are also input into the addition block at the same time, and the addition block is subjected to weighting operation and then outputs a final probability prediction result.
The 2) is specifically as follows:
2.1 Low frequency characteristics)
Processing the accumulated power spectrum S obtained in the signal processing pow As input, a low-frequency characteristic FV is obtained according to the following three-step processing 1 : the first step is to spectrum the accumulated power S pow Equally dividing the voice into voice sections with fixed length W; the second step is to sum the values in each speech segment and take the first 200 points to form the speech intermediate vector<pow>Finish the spectrum S of the accumulated power pow Carrying out smoothing treatment; the third step is to take the speech intermediate vector<pow>The first 50 points of (a) as low-frequency features FV 1 ,FV 1 Is a 50-dimensional vector as a first class of features;
2.2 ) characteristics of energy distribution
First computing intermediate vectors of speech<pow>Cumulative distribution function pow of cdf Drawing an accumulative distribution diagram, and then obtaining an accumulative distribution function pow cdf The linear correlation coefficient rho and the quadratic curve fitting coefficient q form the energy distribution characteristic FV 2 =[ρ,q]As a second type of feature;
the energy distribution of the voice in the above steps is processed and described by using the linearity characteristic of a cumulative distribution function (cdf).
2.3 Characteristic of the peak
Calculating the maximum value of the accumulation distribution diagram, using the point where the maximum value is larger than a preset threshold value as a peak, forming a peak data set by the value of each peak and the corresponding frequency in the accumulation power frequency spectrum, and calculating a series of statistics of the peak, wherein the statistics comprises the total number N of the peaks in the peak data set S peak Average value mu of corresponding frequencies of all peaks in the peak data set S peak Standard deviation sigma of corresponding frequencies of all peaks in the peak data set S peak (ii) a Fitting the shape of each peak by using a sixth-order polynomial to obtain a coefficient set of the sixth-order polynomial; finally, the peak value feature FV is formed by the statistic and the coefficient set of the sixth order polynomial 3 =[N peakpeakpeak ,P est ]As a third class of features;
2.4 ) linear predictive cepstrum coefficients
For original Voice signal Voice in Processing is performed to obtain Linear Prediction Cepstral Coefficients (LPCC) as a fourth class of features.
The SE-ResNet50 algorithm structure comprises a first convolution layer group, a second convolution layer group, a third convolution layer group, a fourth convolution layer group, a fifth convolution layer group, an average pooling layer, a full connection layer and a Sigmoid layer which are sequentially connected; the squeeze-excited residual network is trained by inputting four features of a speech signal and a corresponding label known whether a human voice utters speech or replays a speech attack when the classifier is trained.
The method takes a residual error network ResNet as a basic framework, adds quick connection processing in the network, and adds an extrusion excitation structure, thereby solving the problem of network degradation and improving the sensitivity of a model to channel characteristics.
Specifically, the model acquires the importance of each feature channel through a learning method, and then the weight of the important feature channel is increased according to the importance.
The invention selects four types of features, uses the four types of features for recognition of sounding bodies and provides an extraction algorithm of the four types of features. And an advanced deep convolutional neural network SE-ResNet is selected as a classifier, and a detection method for voice spoofing attack is constructed on the basis of the spectral characteristics and the SE-ResNet.
The invention acquires and records voice through a microphone of the intelligent device to obtain voice signals, and extracts four types of characteristics which can effectively and really reflect the difference of real voice and replay attack voice frequency spectrum through signal processing. According to the fact that the real voice and the replay attack have regular difference on the low-frequency peak value feature and the energy distribution, the feature is input into the built deep convolution neural network classifier SE-ResNet50, and then the real voice and the replay attack are detected.
The invention can accurately and effectively detect the voice deception attack which is represented by replay attack and aims at the speaker recognition system.
The invention has the beneficial effects that:
the innovation point of the invention is that aiming at the difference of the replay voice and the real voice in the aspect of spectrum characteristics, 74-dimensional characteristics such as energy power characteristics, low-frequency characteristics and the like are provided, and effective characteristic data are provided for attack detection. In addition, SE-ResNet was established to be used for replay attack detection. In the voice spoofing attack, even if an attacker generates sound which is very similar to the voice of a real user, the sound necessarily causes a certain degree of nonlinear distortion when passing through a microphone and a loudspeaker, the spectral characteristics of the sound are inconsistent with those of the real user, and therefore the method can be used for detecting the voice spoofing attack.
The method can efficiently detect the voice deception attack through the existing microphone and voice hardware of the voice authentication system, has the characteristics of low cost and high attack detection accuracy, can be used for safety protection of the voice authentication system on intelligent equipment such as a mobile phone and the like, and has wide requirements and application prospects.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 is a spectrum diagram of a real voice (left) and a spectrum diagram of a replay attack (right).
Fig. 3 is a flow chart of an actual user issuing an instruction and being received by the smart device (up) and performing a replay attack (down).
FIG. 4 is a diagram of the SE-ResNet model architecture of the present invention.
Fig. 5 is a graph of the training process and results of the present invention on the ASVspoof2017 and ASVspoof2019 data sets.
Detailed Description
The invention will be further described with reference to the accompanying drawings.
Examples of the complete implementation of the method according to the invention and the implementation thereof are as follows:
1) Signal processing:
as shown in fig. 1, for the original Voice signal Voice in The cumulative power spectrum S is obtained in the following two-step process pow
The first step adopts short-time Fourier transform, and the short-time Fourier transform process comprises the following steps: first, a periodic Hamming window with length of 1024 (representing 1024 data points) and overlap length of 768 is used to process the original Voice signal Voice in Performing windowing to obtain original Voice signal Voice in Dividing the data into a plurality of data frames with the length of 1024, and then performing fast Fourier transform on each data frame, wherein the number nfft of points of the Fourier transform is 4096;
secondly, accumulating the results of the fast Fourier transform of each data frame to obtain a vector with the length of 4096, and finally taking the first 2049 data points of the vector as an accumulated power spectrum S pow
2) Feature extraction:
using as an accumulated power spectrum S pow And (4) performing feature extraction to obtain four features, namely a low-frequency feature, an energy distribution feature, a peak feature and a linear prediction cepstrum coefficient.
The 2) is specifically as follows:
2.1 Low frequency characteristics)
The accumulated power spectrum S obtained in signal processing pow As an input, the low frequency feature FV is obtained according to the following two-step process 1
The first step is to spectrum the accumulated power S pow Equally dividing the voice into voice sections with fixed length W; if S is pow Is not divided by W, the last redundant segment is omitted and W is taken to be 10 in the practice of the invention.
The second step is to sum the values in each speech segment and take the first 200 points to form the speech intermediate vector<pow>Finish the spectrum S of the accumulated power pow Carrying out smoothing treatment;
the third step is to take the intermediate vector of the speech<pow>As the low frequency feature FV 1 ,FV 1 Is a 50-dimensional vector as a first class of features;
in this way, the accumulated power spectrum slow is smoothed, and a low-frequency band point below 2kHz is selected as a low-frequency feature in the implementation.
2.2 ) characteristics of energy distribution
First computing intermediate vectors of speech<pow>Cumulative distribution function pow of cdf Drawing an accumulative distribution diagram, and calculating an accumulative distribution function pow cdf The linear correlation coefficient rho and the quadratic curve fitting coefficient q form the energy distribution characteristic FV 2 =[ρ,q]As a second type of feature;
2.3 Characteristic of peak value
Calculating the maximum value of the accumulation distribution diagram, using the point where the maximum value is larger than a preset threshold value as a peak, forming a peak data set by the value of each peak and the corresponding frequency in the accumulation power frequency spectrum, and calculating a series of statistics of the peak, wherein the statistics comprises the total number N of the peaks in the peak data set S peak Average value mu of corresponding frequencies of all peaks in the peak data set S peak And standard deviation sigma of corresponding frequencies of all peaks in the peak data set S peak (ii) a Fitting the shape of each peak by using a sixth-order polynomial to obtain a coefficient set of the sixth-order polynomial; finally, the peak value feature FV is formed by the statistic and the coefficient set of the sixth order polynomial 3 =[N peakpeakpeak ,P est ]As a third class of features;
2.4 ) linear predictive cepstrum coefficients
For original Voice signal Voice in And processing to obtain Linear Prediction Cepstrum Coefficients (LPCC), wherein the linear prediction cepstrum coefficients are 12-order coefficients, and the 12-order LPC coefficients are a vector as a fourth-class feature.
3) Attack detection:
a classifier of a squeeze excitation residual error network (SE-ResNet architecture) is established, the squeeze excitation residual error network comprises 50 squeeze excitation residual error blocks, and each squeeze excitation residual error comprises a residual error block residual, an addition block, a global pooling layer, two convolution modules and a Sigmoid activation function.
The four characteristics are input into a residual block (residual), the output of the residual block (residual) is sequentially connected and input into a scale layer (scale) after passing through a global pooling layer, two convolution modules and a Sigmoid activation function, the output of the residual block (residual) is also input into the scale layer (scale), the scale layer (scale) is subjected to Reshape operation to increase dimension to the input dimension and then output into an addition block, the original four characteristics are also input into the addition block, and the addition block is subjected to weighting operation to output a final probability prediction result. If the probability value is more than 0.5, the attack voice is judged to be replayed, and if the probability value is less than 0.5, the attack voice is judged to be real voice.
The SE-ResNet50 algorithm structure comprises a first convolution layer group, a second convolution layer group, a third convolution layer group, a fourth convolution layer group, a fifth convolution layer group, an average pooling layer, a full connection layer and a Sigmoid layer which are sequentially connected; the squeeze-excited residual network is trained by inputting four features of a speech signal and a corresponding label known whether a human voice utters speech or replays a speech attack when the classifier is trained.
In a specific implementation level, the SE-ResNet architecture is shown in fig. 4, and includes 2 operations, namely squeeze (squeeze) and stimulus (excitation).
The original signature dimensions are C x H x W, C represents the signature channels, H represents the height, W represents the width, and the number of signature channels in the model, i.e., the total number of extracted features 74, are compressed into a signature of C x 1 by the squeeze operation, which is implemented as a dashed box in fig. 4. After H × W is compressed into one dimension, the corresponding one-dimensional parameters obtain the global view of H × W before, and the sensing area of the convolution kernel is wider.
During the excitation operation, the fully-connected layer is added to the characteristic diagram of C1 x 1 obtained by the extrusion operation, and the importance of each characteristic channel is predicted. And finally, performing normalization processing through a Sigmoid function, and weighting the normalized weight to the characteristics of each channel through a Scale layer.
After the architecture of the SE-ResNet is obtained, 50 layers of extrusion excitation residual blocks are specifically deployed, and the whole process is shown in Table 1. Since only distinguishing between real sounds and reproduced sounds is a two-classification problem, the final output dimension is set to 1, and the output result is the probability that each voice to be detected is detected as a real voice.
TABLE 1 SE-ResNet50 Process framework
Figure BDA0002712425220000061
Figure BDA0002712425220000071
Fig. 3 is a schematic diagram of a playback attack, and it can be seen that the playback attack has two links of microphone recording and speaker playing compared with real voice, which necessarily generates changes to the original signal. The sensitivity of the microphone and speaker depends on the degree of deflection of the diaphragm under the influence of the sound pressure. Due to imperfections in the manufacturing process, the microphone has limitations that ultimately result in inherent distortion. This non-linear characteristic of the microphone results in the addition of noise signals over a lower frequency range. Loudspeakers also introduce non-linear distortion when reproducing sound. Despite great progress in producing high quality sound, most loudspeakers still exhibit non-linear behavior, especially in the low frequency region. The main reasons for this non-linearity are three: (1) changes in magnetic field caused by voice coil excursion; (2) a non-linear suspension stiffness of the voice coil; and (3) self-inductance of voice coil drift. Although the voice spoofing attack can adopt various false voice signal generating modes, in the actual attack process, an attacker needs to play the false voice signal to a voice authentication system to be attacked by using a loudspeaker (a sound box). Therefore, the protection of the voice authentication system can be started from the identification of a sound source (a sounding body), and the detection of the spoofing attack is realized.
The upper left corner of fig. 2 is a spectrogram of a real voice, and the other three spectrograms are spectrograms obtained after the voice is played back by different speakers. For comparison, the following observations were made: real voice fluctuates more obviously in a low frequency band (quantitatively, more peaks exist), and the fluctuation of replay attack is less (the peaks are concentrated); the energy distribution of real voice and replay attacks are different, and the energy proportion of the replay attack is higher at 4-5 kHz.
Embodiments were tested with the data set of asvspoons 2017 and 2019, which is the standard data set for voice spoofing attacks. "ASVspoof challenge" is a special competition unit for Interspeed, the international top academic conference in the field of speech, focusing on spoofing for automatic speaker recognition systems.
Firstly, extracting the four types of characteristics from the data of a training set, adding a label, marking the voice as real voice or replay voice, and then training a neural network SE-ResNet by using the marked characteristics. And then the trained SE-ResNet is used for verification on the test set. The verification results are shown in fig. 5. Equal Error Rates (EER) of 2.38% were achieved on the ASVspoof2017 data set, 0.163% on the ASVspoof2019 PA data set, and the first race was ranked in both races of the current year. The equal error rate is an error rate value when the error acceptance rate and the error rejection rate are equal, and a smaller index indicates a higher accuracy of the detection system.
In addition, the embodiment passes through cross validation, the EER of 4.47% can be reached by using the training set and the development set of ASVspoof2017 for training and the testing set of AS-Vspoof2019 for testing; by using the training set and development set training of the ASVspoof2019 and the testing set of the ASVspoof2017 for testing, an EER close to 0 can be achieved.

Claims (2)

1. A voice replay attack detection method based on spectral features and a deep convolutional neural network is characterized by comprising the following steps:
after a microphone of the electronic equipment receives a voice signal, performing signal processing work on the voice, then extracting specific characteristics, finally inputting the marked characteristics into a classifier SE-ResNet50 of a deep convolution neural network for training, performing voice living body detection on the voice signal to be detected by adopting the trained classifier, and outputting a result of whether the voice is emitted by human voice or voice attack is replayed;
the method specifically comprises the following steps:
1) Signal processing:
for original Voice signal Voice in The following two steps of treatmentObtaining a cumulative power spectrum S pow
The first step adopts short-time Fourier transform, and the short-time Fourier transform process comprises the following steps: firstly, using a periodic Hamming window with length of 1024 and overlapping length of 768 to process the original Voice signal Voice in Performing windowing to obtain original Voice signal Voice in Dividing into 1024 data frames, and performing fast Fourier transform on each data frame by the number of Fourier transform points n fft 4096;
secondly, accumulating the results of the fast Fourier transform of each data frame to obtain a vector with the length of 4096, and finally taking the first 2049 data points of the vector as an accumulated power spectrum S pow
2) Feature extraction:
using the cumulative power spectrum S pow Performing feature extraction to obtain four features, namely a low-frequency feature, an energy distribution feature, a peak feature and a linear prediction cepstrum coefficient;
the 2) is specifically as follows:
2.1 Low frequency characteristics)
The accumulated power spectrum S obtained in signal processing pow As input, a low-frequency characteristic FV is obtained according to the following three-step processing 1 : the first step is to spectrum the accumulated power S pow Equally dividing the voice into voice sections with fixed length W; the second step is to sum the values in each speech segment and take the first 200 points to form the speech intermediate vector<pow>Finish the spectrum S of the accumulated power pow Carrying out smoothing treatment; the third step is to take the speech intermediate vector<pow>The first 50 points of (a) as low-frequency features FV 1 ,FV 1 Is a 50-dimensional vector as a first class of features;
2.2 ) characteristics of energy distribution
First computing intermediate vectors of speech<pow>Cumulative distribution function pow of cdf Drawing an accumulative distribution diagram, and calculating an accumulative distribution function pow cdf The linear correlation coefficient rho and the quadratic curve fitting coefficient q form the energy distribution characteristic FV 2 =[ρ,q]As a second type of feature;
2.3 Characteristic of peak value
Calculating the maximum value of the accumulation distribution diagram, using the point where the maximum value is larger than a preset threshold value as a peak, forming a peak data set by the value of each peak and the corresponding frequency in the accumulation power frequency spectrum, and calculating a series of statistics of the peak, wherein the statistics comprises the total number N of the peaks in the peak data set S peak Average value mu of corresponding frequencies of all peaks in the peak data set S peak Standard deviation sigma of corresponding frequencies of all peaks in the peak data set S peak (ii) a And fitting the shape of each peak by a sixth-order polynomial to obtain a coefficient set P of the sixth-order polynomial est (ii) a Finally, the peak value feature FV is formed by the statistic and the coefficient set of the sixth order polynomial 3 =[N peakpeakpeak ,P est ]As a third class of features;
2.4 ) linear predictive cepstrum coefficients
For original Voice signal Voice in Processing to obtain a Linear Prediction Cepstrum Coefficient (LPCC) as a fourth class of characteristics;
3) Attack detection:
the method comprises the steps of establishing a classifier of an extrusion excitation residual error network, wherein the extrusion excitation residual error network comprises 50 extrusion excitation residual error blocks, each extrusion excitation residual error comprises a residual error block residual, an addition block, a global pooling layer, two convolution modules and a Sigmoid activation function, inputting four characteristics into the residual error block residual, sequentially passing the output of the residual error block residual through the global pooling layer, the two convolution modules and the Sigmoid activation function, then connecting and inputting the output of the residual error block residual into a scale layer scale, simultaneously inputting the output of the residual error block residual into the scale layer scale, increasing the dimension of the scale layer scale to the input dimension through a remolding Reshape operation, then outputting the input dimension to an addition block, simultaneously inputting the original four characteristics into the addition block, and outputting a final probability prediction result by the addition block after a weighting operation.
2. The voice replay attack detection method based on the spectral feature and the deep convolutional neural network as claimed in claim 1, characterized in that: the classifier SE-ResNet50 of the deep convolutional neural network comprises a first convolutional layer group, a second convolutional layer group, a third convolutional layer group, a fourth convolutional layer group, a fifth convolutional layer group, an average pooling layer, a full connection layer and a Sigmoid layer which are sequentially connected; the squeeze-excited residual network is trained by inputting four features of a voice signal and corresponding labels known whether a human voice produces voice or replays voice attack when a classifier is trained.
CN202011061172.1A 2020-09-30 2020-09-30 Voice signal spectrum characteristic and deep learning voice spoofing attack detection method Active CN112201255B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011061172.1A CN112201255B (en) 2020-09-30 2020-09-30 Voice signal spectrum characteristic and deep learning voice spoofing attack detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011061172.1A CN112201255B (en) 2020-09-30 2020-09-30 Voice signal spectrum characteristic and deep learning voice spoofing attack detection method

Publications (2)

Publication Number Publication Date
CN112201255A CN112201255A (en) 2021-01-08
CN112201255B true CN112201255B (en) 2022-10-21

Family

ID=74013928

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011061172.1A Active CN112201255B (en) 2020-09-30 2020-09-30 Voice signal spectrum characteristic and deep learning voice spoofing attack detection method

Country Status (1)

Country Link
CN (1) CN112201255B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113012684B (en) * 2021-03-04 2022-05-31 电子科技大学 Synthesized voice detection method based on voice segmentation
CN113192504B (en) * 2021-04-29 2022-11-11 浙江大学 Silent voice attack detection method based on domain adaptation
CN113241079A (en) * 2021-04-29 2021-08-10 江西师范大学 Voice spoofing detection method based on residual error neural network
CN113506583B (en) * 2021-06-28 2024-01-05 杭州电子科技大学 Camouflage voice detection method using residual error network
CN113611329B (en) * 2021-07-02 2023-10-24 北京三快在线科技有限公司 Voice abnormality detection method and device
CN113284513B (en) * 2021-07-26 2021-10-15 中国科学院自动化研究所 Method and device for detecting false voice based on phoneme duration characteristics
CN113488027A (en) * 2021-09-08 2021-10-08 中国科学院自动化研究所 Hierarchical classification generated audio tracing method, storage medium and computer equipment
CN116504226B (en) * 2023-02-27 2024-01-02 佛山科学技术学院 Lightweight single-channel voiceprint recognition method and system based on deep learning
CN117393000B (en) * 2023-11-09 2024-04-16 南京邮电大学 Synthetic voice detection method based on neural network and feature fusion

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108875787B (en) * 2018-05-23 2020-07-14 北京市商汤科技开发有限公司 Image recognition method and device, computer equipment and storage medium
CN108831485B (en) * 2018-06-11 2021-04-23 东北师范大学 Speaker identification method based on spectrogram statistical characteristics
CN110473569A (en) * 2019-09-11 2019-11-19 苏州思必驰信息科技有限公司 Detect the optimization method and system of speaker's spoofing attack

Also Published As

Publication number Publication date
CN112201255A (en) 2021-01-08

Similar Documents

Publication Publication Date Title
CN112201255B (en) Voice signal spectrum characteristic and deep learning voice spoofing attack detection method
Wang et al. Voicepop: A pop noise based anti-spoofing system for voice authentication on smartphones
Ahmed et al. Void: A fast and light voice liveness detection system
WO2019232829A1 (en) Voiceprint recognition method and apparatus, computer device and storage medium
CN102394062B (en) Method and system for automatically identifying voice recording equipment source
CN108986824B (en) Playback voice detection method
CN102436810A (en) Record replay attack detection method and system based on channel mode noise
CN103794207A (en) Dual-mode voice identity recognition method
JPH1083194A (en) Two-stage group selection method for speaker collation system
Paul et al. Countermeasure to handle replay attacks in practical speaker verification systems
Shang et al. Voice liveness detection for voice assistants using ear canal pressure
CN105513598A (en) Playback voice detection method based on distribution of information quantity in frequency domain
Adiban et al. Sut system description for anti-spoofing 2017 challenge
CN111782861A (en) Noise detection method and device and storage medium
Weng et al. The sysu system for the interspeech 2015 automatic speaker verification spoofing and countermeasures challenge
CN111243600A (en) Voice spoofing attack detection method based on sound field and field pattern
Gupta et al. Deep convolutional neural network for voice liveness detection
Ye et al. Detection of replay attack based on normalized constant q cepstral feature
Skosan et al. Modified segmental histogram equalization for robust speaker verification
Mills et al. Replay attack detection based on voice and non-voice sections for speaker verification
CN112309404B (en) Machine voice authentication method, device, equipment and storage medium
Zhao et al. Channel interdependence enhanced speaker embeddings for far-field speaker verification
Nagakrishnan et al. Generic speech based person authentication system with genuine and spoofed utterances: different feature sets and models
Alam On the use of fisher vector encoding for voice spoofing detection
Shi et al. Anti-replay: A fast and lightweight voice replay attack detection system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant