CN112599149A - Detection method and device for replay attack voice - Google Patents

Detection method and device for replay attack voice Download PDF

Info

Publication number
CN112599149A
CN112599149A CN202011455471.3A CN202011455471A CN112599149A CN 112599149 A CN112599149 A CN 112599149A CN 202011455471 A CN202011455471 A CN 202011455471A CN 112599149 A CN112599149 A CN 112599149A
Authority
CN
China
Prior art keywords
voice
amplitude
support vector
vector machine
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011455471.3A
Other languages
Chinese (zh)
Other versions
CN112599149B (en
Inventor
周颖慧
孟子厚
刘亚丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Communication University of China
Original Assignee
Communication University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Communication University of China filed Critical Communication University of China
Priority to CN202011455471.3A priority Critical patent/CN112599149B/en
Priority claimed from CN202011455471.3A external-priority patent/CN112599149B/en
Publication of CN112599149A publication Critical patent/CN112599149A/en
Application granted granted Critical
Publication of CN112599149B publication Critical patent/CN112599149B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention provides a detection method and a device for replay attack voice, wherein the method comprises the following steps: acquiring a normal voice sample and a recording playback voice sample; preprocessing a voice sample to obtain a feature vector of an acoustic parameter as a sample set; training by a support vector machine based on the preprocessed sample set to obtain a support vector machine model; and acquiring a voice sample to be detected, and inputting the voice sample to be detected into the support vector machine model to obtain a recognition result. The detection method of the invention is not affected by the equipment type and the text content, and greatly improves the detection general performance.

Description

Detection method and device for replay attack voice
Technical Field
The invention relates to the technical field of computers, in particular to a detection method and a detection device for replay attack voice.
Background
At present, Speaker Recognition (SR) is a biometric technology that performs similarity comparison between original voice extracted voiceprint feature vectors and corresponding voiceprint templates to identify Speaker identities according to differences in vocal tract structures and pronunciation habits of different speakers. However, the application scenarios of speaker recognition include distributed scenarios of telephony or other unattended, without manual supervision or face-to-face contact, and thus speaker recognition is more susceptible to malicious interference than other biometric recognition processes. When the identity authentication is carried out through speaker identification, if the threat brought by the counterfeit attack is not considered, the authentication process has no practical significance. The main sources of the forgery attack for speaker identification are four, which are: (1) the speaker imitates attacks; (2) a voice synthesis attack; (3) a voice conversion attack; (4) a playback attack of the recording. The mode of the record playback attack does not need related professional knowledge and technology, and the record playback becomes an effective and common counterfeit attack means along with the continuous improvement of the quality of the recording equipment and the continuous reduction of the cost.
The current playback attack detection method mainly comprises the following steps: first, channel difference: according to the difference between the original voice channel and the playback voice channel, taking a mute section or pattern noise in a voice fragment as a research object, extracting and analyzing channel information, and achieving the purpose of preventing a counterfeiter from intruding through a playback attack detection technology based on the channel information; the algorithm is only specific to the specific illegal recording equipment, and the diversity of the equipment types determines that the detection mode has no universality. Second, encoding information: the voice generated by the target speaker is collected by the collection equipment in the form of analog signals in the form of digital signals to finish the recording process, for the playback voice signals, the voice transmission process comprises a plurality of coding processes, and the secretly recording equipment and the playback equipment can cause the playback voice to generate coding distortion and amplification distortion results, so that the playback voice comprises related equipment information; this approach does not take into account the effects of noise interference factors and equipment changes may cause corrections to the algorithm. Third, spectral similarity characteristics: the recognition of the replay attack voice is realized through the analysis and judgment of the similarity degree; the method is only suitable for text-related detection systems, and with the increase of the verification times, the requirement of storage space is increased, and finally the working efficiency of the system is reduced, so that certain defects exist in practical application. Fourth, multi-modal characterization: compared with single-mode speaker recognition, the combination mode of the biological characteristics of multiple modes is more effective in coping with counterfeit attacks, for example, scores under 3 modes of face, lip movement and voice are fused to carry out an identity authentication system of the speaker, and the accuracy and the safety of the identity authentication system are improved; this approach is not as convenient to operate as the single mode detection technique.
The existing parameters for playback voice detection are mainly obtained by direct objective measurement of voice signals, and have respective limitations, including: only for specific devices, there is no universality; text correlation is needed, and flexibility is not available; multimodal operation is inconvenient, etc.
Disclosure of Invention
The invention provides a detection method of replay attack voice for solving the technical problems, which adopts the acoustic parameter characteristics of voice quality to carry out characteristic training according to the hearing difference between the original voice and the replay voice, can realize the detection of the replay attack, is not influenced by the equipment type and the text content, and greatly improves the general detection performance.
The technical scheme adopted by the invention is as follows:
a detection method of replay attack voice comprises the following steps: acquiring a normal voice sample and a recording playback voice sample; preprocessing a voice sample to obtain a feature vector of an acoustic parameter as a sample set, wherein the feature vector of the acoustic parameter comprises: amplitude perturbations, glottal noise characteristics, and smooth cepstrum; training by a support vector machine based on the preprocessed sample set to obtain a support vector machine model; and acquiring a voice sample to be detected, and inputting the voice sample to be detected into the support vector machine model to obtain a recognition result.
According to one embodiment of the present invention, preprocessing a speech sample to obtain a feature vector of an acoustic parameter comprises: extracting voiced segments in the voice samples; obtaining a pitch period by adopting an autocorrelation function; framing according to the pitch period; and carrying out feature extraction on the voice signal after framing to obtain amplitude perturbation and glottal noise features.
According to an embodiment of the present invention, feature extraction is performed on a speech signal after framing to obtain an amplitude perturbation, including: obtaining the intra-frame amplitude peak value after framing; calculating the difference value of the amplitude peak value of each frame to obtain the amplitude perturbation, wherein the amplitude perturbation comprises: the relative values of the amplitude perturbations and the quotient of the three point amplitude perturbations.
According to one embodiment of the invention, the relative value of the amplitude perturbation is obtained by the following formula:
Figure BDA0002828626510000031
where SL represents the relative value of the amplitude perturbation, A(i)(i ═ 1,2,3 …, N) represents the peak amplitude parameter, and N represents the number of peak amplitudes;
obtaining the three-point amplitude perturbation quotient by the following formula:
Figure BDA0002828626510000032
wherein APQ3 represents the three-point amplitude perturbation quotient.
According to an embodiment of the present invention, feature extraction is performed on a speech signal after framing to obtain a glottal noise feature, wherein the glottal noise feature includes: average autocorrelation and harmonic noise ratio, including: acquiring autocorrelation of each frame wave; calculating the average value of all the autocorrelation to be used as the average autocorrelation; calculating an average amplitude sequence of the time domain signal wave amplitude sequence truncated according to the period, and calculating the difference value of each truncated signal and the average wave; and acquiring a harmonic-to-noise ratio according to the average amplitude sequence of the time domain signal wave amplitude sequence truncated according to the period and the difference value between each truncated signal and the average wave.
According to an embodiment of the present invention, the noise-to-noise ratio is obtained by calculating using the following formula:
Figure BDA0002828626510000033
di(τ)=fi(τ)-fA(τ)
Figure BDA0002828626510000041
wherein HNR represents the harmonic-to-noise ratio, i represents the ith period of the signal, N represents the number of signal periods, M represents the length of the signal in the period at the sampling rate, fA(τ) represents dividing the signal unit according to the calculated pitch period to obtain an average amplitude sequence of the time domain signal wave amplitude sequence truncated by period, di(τ) represents the difference of each truncated signal from the average wave.
According to one embodiment of the present invention, preprocessing a speech sample to obtain a feature vector of an acoustic parameter comprises: extracting voiced segments in the voice samples; framing the voiced segments; performing discrete Fourier transform on the framed signal to obtain a frequency spectrum; taking logarithm of the frequency spectrum, and performing inverse discrete Fourier transform to obtain a cepstrum; and smoothing the cepstrum to obtain the smoothed cepstrum.
According to one embodiment of the invention, training by a support vector machine based on a preprocessed sample set to obtain a support vector machine model comprises: performing five-fold cross validation on the training data by adopting a grid search method so as to optimize parameters of a support vector machine model; and obtaining a decision hyperplane according to the support vector machine model.
According to an embodiment of the present invention, inputting the speech sample to be tested into the support vector machine model to obtain a recognition result, includes: obtaining the relative position of the voice sample to be tested and the decision hyperplane; and identifying the voice sample to be detected according to the relative position.
The invention also provides a detection device for the replay attack voice, which comprises the following components: the acquisition module is used for acquiring a normal voice sample and a recording playback voice sample; a processing module, configured to pre-process a speech sample to obtain a feature vector of an acoustic parameter as a sample set, where the feature vector of the acoustic parameter includes: amplitude perturbations, glottal noise characteristics, and smooth cepstrum; the model training module is used for training on the basis of the preprocessed sample set through the support vector machine to obtain a support vector machine model; and the recognition module is used for acquiring a voice sample to be detected and inputting the voice sample to be detected into the support vector machine model so as to obtain a recognition result.
The invention has the beneficial effects that:
the method comprises the steps of firstly obtaining a normal voice sample and a recording playback voice sample, preprocessing the voice sample to obtain a characteristic vector of an acoustic parameter as a sample set, training the sample set after preprocessing through a support vector machine to obtain a support vector machine model, obtaining a voice sample to be detected, and inputting the voice sample to be detected into the support vector machine model to obtain a recognition result, so that the detection general performance is greatly improved under the condition of not being influenced by the type of equipment and the content of text.
Drawings
FIG. 1 is a flowchart of a method for detecting replay attack speech according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a parameter test of a support vector machine model according to an embodiment of the present invention;
FIGS. 3 and 4 are schematic diagrams of a detection method for replay attack speech according to an embodiment of the present invention;
fig. 5 is a block diagram of a detection apparatus for replaying attack voice according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of a detection method of replay attack speech according to an embodiment of the present invention.
As shown in fig. 1, the method for detecting replay attack speech according to an embodiment of the present invention may include the following steps:
and S1, acquiring a normal voice sample and a recorded and played back voice sample.
S2, pre-processing the voice sample to obtain a feature vector of the acoustic parameter as a sample set, wherein the feature vector of the acoustic parameter includes: amplitude perturbations, glottal noise characteristics, and a smoothed cepstrum.
According to one embodiment of the present invention, preprocessing a speech sample to obtain a feature vector of an acoustic parameter comprises: extracting voiced segments in the voice samples; obtaining a pitch period by adopting an autocorrelation function; framing according to the pitch period; and carrying out feature extraction on the voice signal after framing to obtain amplitude perturbation and glottal noise features.
Specifically, when the sound intensity threshold is smaller than the intensity threshold (for example, the intensity threshold may be-25 dB) and the duration exceeds a first preset time (for example, 0.1s) at the time of the sound zone detection, it is considered as a silent section (silent section); a voiced segment is considered when the sound intensity threshold is greater than or equal to the intensity threshold (e.g., the intensity threshold may be-25 dB) and the duration exceeds a second preset time (e.g., 0.1 s). The purpose of setting the duration judgment is to filter out the strong small pulses with short duration.
Then, the fundamental frequency calculation is carried out, which specifically comprises the following steps: according to the nature of the autocorrelation function, if the speech signal is a signal with period P, its autocorrelation function is also the signal with period P, and at the integral multiple of the signal period, the autocorrelation function takes the maximum value, the voiced speech signal has quasi-periodicity, its autocorrelation function has the maximum value at the integral multiple of the pitch period, the distance between two adjacent maximum peaks is calculated, the pitch frequency is the distance between two adjacent maximum peaks, and the reciprocal of the pitch frequency is the pitch period. Then, the fundamental frequency framing is carried out on the fundamental tone period, the frame length is 2-7 times of the fundamental tone period, and the frame length is within 10-30 ms. For example, the step size of the fundamental frequency extraction is set to 0.01s, the window length is set to 0.04s, i.e. the values of 4 fundamental frequency points are analyzed in one window, and the overlap duration between two adjacent frames is 0.03s, i.e. the window length of 3/4 is overlapped.
The method for extracting the features of the voice signals after framing to extract the feature vectors of the acoustic parameters reflecting the voice quality comprises the following steps: amplitude perturbations, glottal noise characteristics, and a smoothed cepstrum, wherein the amplitude perturbations include: the relative value of the amplitude perturbation and the quotient of the three-point amplitude perturbations; the glottal noise characteristics include: average autocorrelation and harmonic noise ratio.
In an embodiment of the present invention, performing feature extraction on a framed speech signal to obtain an amplitude perturbation includes: obtaining the intra-frame amplitude peak value after framing; and calculating the difference value of the amplitude peak value of each frame to obtain the amplitude perturbation.
For example, the relative value of the amplitude perturbation can be calculated by the following formula:
Figure BDA0002828626510000071
obtaining three-point amplitude perturbation quotient by the following formula:
Figure BDA0002828626510000072
where SL represents the relative value of the amplitude perturbation, A(i)(i ═ 1,2,3 …, N) represents the peak amplitude parameter, N represents the number of peak amplitudes, and APQ3 represents the three-point amplitude perturbation quotient.
Specifically, because the amplitude between adjacent voice signal periods has slight change, the amplitude peak value in the frame can be obtained through detection, then the amplitude difference of the peak value in each frame is calculated, and finally the relative value of the required amplitude perturbation and the three-point amplitude perturbation quotient are extracted and obtained.
Further, according to an embodiment of the present invention, feature extraction is performed on the framed speech signal to obtain a glottal noise feature, where the glottal noise feature includes: average autocorrelation and harmonic noise ratio, including: acquiring autocorrelation of each frame wave; calculating the average value of all the autocorrelation to be used as the average autocorrelation; calculating an average amplitude sequence of the time domain signal wave amplitude sequence truncated according to the period, and calculating the difference value of each truncated signal and the average wave; and acquiring the harmonic-to-noise ratio according to the average amplitude sequence of the time domain signal waves truncated according to the period and the difference value of each truncated signal and the average wave.
Specifically, the harmonic structure of the speech contains rich frequency structure information, harmonic components characterize harmonicity on subjective audibility, and harmonic analysis of the speech can help to judge the harmonicity and the prosody of the sound. The average autocorrelation can be obtained by calculating the autocorrelation of each frame wave and calculating the sum of all the autocorrelations to average. The harmonic-to-noise ratio is the ratio of signal period performance to aperiodic energy, reflects the roughness and the hoarseness degree of sound, can analyze noise components in the voice, evaluates the sound quality, and is expressed by dB, the calculation definition of the harmonic-to-noise ratio is divided into two types of time domain calculation and frequency domain calculation, wherein the time domain calculation effect is better, and the specific calculation mode is as follows:
Figure BDA0002828626510000081
di(τ)=fi(τ)-fA(τ)
Figure BDA0002828626510000082
where HNR represents the harmonic-to-noise ratio, i represents the ith period of the signal, N represents the number of signal periods, M represents the length of the signal in a period at the sampling rate, and fA(τ) represents dividing the signal unit according to the calculated pitch period to obtain an average amplitude sequence of the time domain signal wave amplitude sequence truncated by period, di(τ) represents the difference of each truncated signal from the average wave.
It should be noted that the glottal noise characteristic further includes a noise-to-noise ratio, where the noise-to-noise ratio is an inverse of the harmonic-to-noise ratio, that is, the noise-to-noise ratio is a ratio of aperiodic energy to periodic energy of a signal. In the feature extraction, a noise-harmonic ratio may also be extracted.
Further, according to an embodiment of the present invention, preprocessing a speech sample to obtain a feature vector of an acoustic parameter includes: extracting voiced segments in the voice samples; framing the voiced segments; performing discrete Fourier transform on the framed signal to obtain a frequency spectrum; taking logarithm of the frequency spectrum, and carrying out inverse discrete Fourier transform to obtain a cepstrum; and smoothing the cepstrum to obtain a smoothed cepstrum.
Specifically, the smoothed cepstrum represents the distance between the first cepstrum peak on the regression line of the smoothed cepstrum and a point with equal quality, and the longer the period of the speech signal, the more harmonic the spectrum, and the more prominent the smoothed cepstrum. The specific calculation steps are as follows: extracting a voiced segment in a voice sample, then performing framing calculation on the voiced segment, setting the window length to be 0.05s, and performing cepstrum calculation on a framed signal: firstly, calculating discrete Fourier transform to obtain a frequency spectrum, then calculating inverse discrete Fourier transform after logarithm of the frequency spectrum to obtain a cepstrum, and finally smoothing the cepstrum to obtain a smooth cepstrum with required characteristics.
It should be noted that, when extracting features, the voiced segment is extracted first, and the difference is that amplitude perturbation and glottal noise features require to calculate the fundamental tone and perform reframing according to the fundamental tone, but the feature of smooth cepstrum does not require to calculate the fundamental frequency and reframing, performs direct framing, and then performs cepstrum acquisition.
And S3, training by the support vector machine based on the preprocessed sample set to obtain a support vector machine model.
According to one embodiment of the invention, training by a support vector machine based on a preprocessed sample set to obtain a support vector machine model comprises: performing five-fold cross validation on the training data by adopting a grid search method so as to optimize parameters of a support vector machine model; and obtaining a decision hyperplane according to the support vector machine model.
In particular, the original normal speech samples and the recorded playback speech samples in the training set are used for model training. The extraction of the voices by the feature extraction method comprises the following steps: five-dimensional feature vectors including relative values of amplitude perturbations, three-point amplitude perturbation quotient, average autocorrelation, harmonic-to-noise ratio and smooth cepstrum, then selecting libsvm software package of Taiwan university to train the feature vectors, selecting a support vector machine model SVM by a classifier, finding an optimal decision hyperplane by calculation, dividing data, namely, respectively locating real voice and deceptive voice at two sides of the decision hyperplane, and regarding input X ═ X (X ═ X1,x2,x3,…,xN) And judging the SVM optimal classification decision function as follows by using a symbolic function:
Figure BDA0002828626510000091
wherein alpha isi=(α123...,αN) Is an optimal solution for classification, b is a classification threshold; y isiE { -1,1} is a classification label, and the kernel function is a radial basis function, i.e., K (x, z) ═ exp (-g | | | x-z | | |2). In order to improve the classification precision of the SVM classifier, a grid search method is used, five-fold cross validation is carried out on training data, so that the optimal parameters of the SVM are obtained, wherein a penalty factor c is 0.2176, a parameter g in a kernel function is 0.0625, the Accuracy of parameter selection is shown in FIG. 2, and the Accuracy is 99.9981%.
And S4, acquiring a voice sample to be detected, and inputting the voice sample to be detected into the support vector machine model to obtain a recognition result.
According to one embodiment of the present invention, inputting a speech sample to be tested into a support vector machine model to obtain a recognition result, including: obtaining the relative position of a voice sample to be tested and a decision hyperplane; and identifying the voice sample to be detected according to the relative position.
Specifically, according to the method, the feature vector of the acoustic parameter of the voice sample to be detected is extracted, then the distance between the feature vector and the decision hyperplane is calculated, and the original voice and the playback voice are identified through the judgment of the position of the feature vector on the hyperplane. For example, as shown in fig. 3, wx + b is 0 and represents a hyperplane, and a distance formula from an arbitrary point to the hyperplane wx + b is 0, which side of the hyperplane is located is determined, and the two sides represent the original voice and the played voice respectively. As shown in fig. 4, a portion above wx + b-1 represents original voice, and a portion below wx + b-1 represents playback voice. In order to verify the detection accuracy of the present application, the comparison results of the playback attack detection of the recording obtained by the method of the present application and the baseline system under the same conditions are shown in table 1.
TABLE 1
Figure BDA0002828626510000101
As can be seen from table 1, the features consisting of the relative value of the acoustic feature amplitude perturbations, the three-point amplitude perturbation quotient, the mean autocorrelation, the harmonic-to-noise ratio, and the smoothed cepstrum based on the auditory perception can better detect replay attacks than the reference features CQCC and LFCC. The performance on the test set is very good, the EER is 0, the accuracy rate also reaches 99.99 percent, the overall performance of the system is still far superior to that of a reference system, and the real voice and the replay voice can be well distinguished.
Note that EER in table 1 indicates an equal error Rate, which is an error Rate when FAR (False Accept Rate) and FRR (False Reject Rate) are the same, and indicates that the lower the index value, the better the detection performance, reflecting the overall performance of the system. t-DCF in Table 1 represents a tandem detection cost function, and the function of the index is to evaluate the overall performance of the tandem system, namely fraud Countermeasures (CMs) and ASV subsystems, wherein the lower the index value, the better the detection performance is.
In summary, the invention first obtains a normal voice sample and a recorded playback voice sample, preprocesses the voice sample to obtain a feature vector of an acoustic parameter, uses the feature vector as a sample set, trains the support vector machine based on the preprocessed sample set to obtain a support vector machine model, obtains a voice sample to be detected, and inputs the voice sample to be detected into the support vector machine model to obtain a recognition result, thereby greatly improving the detection general performance without being influenced by the type of equipment and the content of text.
Corresponding to the detection method of the replay attack voice of the above embodiment, the invention also provides a detection device of the replay attack voice.
Fig. 5 is a block diagram of a detection apparatus for replaying attack voice according to an embodiment of the present invention.
As shown in fig. 5, the apparatus for detecting replay attack speech according to an embodiment of the present invention may include: an acquisition module 10, a processing module 20, a model training module 30 and a recognition module 40.
The obtaining module 10 is configured to obtain a normal voice sample and a recorded playback voice sample. The processing module 20 is configured to pre-process the voice sample to obtain a feature vector of the acoustic parameter as a sample set, where the feature vector of the acoustic parameter includes: amplitude perturbations, glottal noise characteristics, and a smoothed cepstrum. The model training module 30 is configured to perform training based on the preprocessed sample set by the support vector machine to obtain a support vector machine model. The recognition module 40 is configured to obtain a voice sample to be detected, and input the voice sample to be detected into the support vector machine model to obtain a recognition result.
According to an embodiment of the present invention, the processing module 20 performs preprocessing on the voice sample to obtain a feature vector of an acoustic parameter, specifically, to extract a voiced segment in the voice sample; obtaining a pitch period by adopting an autocorrelation function; framing according to the pitch period; and carrying out feature extraction on the voice signal after framing to obtain amplitude perturbation and glottal noise features.
According to an embodiment of the present invention, the processing module 20 performs feature extraction on the framed speech signal to obtain an amplitude perturbation, specifically, to obtain an intra-frame amplitude peak after framing; calculating the difference value of the amplitude peak value of each frame to obtain amplitude perturbation, wherein the amplitude perturbation comprises the following steps: the relative values of the amplitude perturbations and the quotient of the three point amplitude perturbations.
According to one embodiment of the invention, the relative value of the amplitude perturbation is obtained by the following formula:
Figure BDA0002828626510000121
where SL represents the relative value of the amplitude perturbation, A(i)(i ═ 1,2,3 …, N) represents the peak amplitude parameter, and N represents the number of peak amplitudes;
obtaining three-point amplitude perturbation quotient by the following formula:
Figure BDA0002828626510000122
wherein APQ3 represents the three-point amplitude perturbation quotient.
According to an embodiment of the present invention, the processing module 20 performs feature extraction on the framed speech signal to obtain a glottal noise feature, where the glottal noise feature includes: the average autocorrelation harmonic noise ratio is specifically used for obtaining the autocorrelation of each frame wave; calculating the average value of all the autocorrelation to be used as the average autocorrelation; calculating an average amplitude sequence of the time domain signal wave amplitude sequence truncated according to the period, and calculating the difference value of each truncated signal and the average wave; and acquiring the harmonic-to-noise ratio according to the average amplitude sequence of the time domain signal waves truncated according to the period and the difference value of each truncated signal and the average wave.
According to one embodiment of the invention, the following formula is used to calculate the obtained noise-to-noise ratio:
Figure BDA0002828626510000123
di(τ)=fi(τ)-fA(τ)
Figure BDA0002828626510000124
wherein HNR represents a harmonic-to-noise ratio, i represents a signalThe i-th period of the number, N the number of signal periods, M the length of the signal in the period at the sampling rate, fA(τ) represents dividing the signal unit according to the calculated pitch period to obtain an average amplitude sequence of the time domain signal wave amplitude sequence truncated by period, di(τ) represents the difference of each truncated signal from the average wave.
According to an embodiment of the present invention, the processing module 20 performs preprocessing on the voice sample to obtain a feature vector of an acoustic parameter, specifically, to extract a voiced segment in the voice sample; framing the voiced segments; performing discrete Fourier transform on the framed signal to obtain a frequency spectrum; taking logarithm of the frequency spectrum, and carrying out inverse discrete Fourier transform to obtain a cepstrum; and smoothing the cepstrum to obtain a smoothed cepstrum.
According to an embodiment of the present invention, the model training module 30 performs training based on the preprocessed sample set through the support vector machine to obtain a support vector machine model, and is specifically configured to perform five-fold cross validation on training data by using a grid search method to optimize parameters of the support vector machine model; and obtaining a decision hyperplane according to the support vector machine model.
According to an embodiment of the present invention, the recognition module 40 inputs the voice sample to be detected into the support vector machine model to obtain a recognition result, specifically, to obtain a relative position between the voice sample to be detected and the decision hyperplane; and identifying the voice sample to be detected according to the relative position.
It should be noted that details that are not disclosed in the detection apparatus for a replay attack voice according to the embodiment of the present invention refer to details that are disclosed in the detection method for a replay attack voice according to the embodiment of the present invention, and details are not described here again.
According to the detection device for the replay attack voice, the acquisition module acquires a normal voice sample and a recorded replay voice sample, the processing module preprocesses the voice sample to acquire a feature vector of an acoustic parameter as a sample set, the model training module trains through a support vector machine based on the preprocessed sample set to acquire a support vector machine model, and the recognition module acquires a voice sample to be detected and inputs the voice sample to be detected into the support vector machine model to acquire a recognition result. Therefore, the device can perform feature training by adopting the acoustic parameter characteristics of voice quality according to the auditory sense difference between the original voice and the playback voice, can realize playback attack detection, is not influenced by equipment types and text contents, and greatly improves the detection universality.
The invention further provides a computer device corresponding to the embodiment.
The computer device of the embodiment of the invention comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and when the processor executes the computer program, the method for detecting the replay attack voice according to the embodiment of the invention can be realized.
According to the computer device of the embodiment of the invention, when the processor executes the computer program stored on the memory, the normal voice sample and the recorded and replayed voice sample are firstly obtained, the voice sample is preprocessed to obtain the characteristic vector of the acoustic parameter as the sample set, the support vector machine is used for training based on the preprocessed sample set to obtain the support vector machine model, the voice sample to be detected is obtained, and the voice sample to be detected is input into the support vector machine model to obtain the recognition result, so that the detection general performance is greatly improved under the condition of not being influenced by the type of the device and the text content.
The invention also provides a non-transitory computer readable storage medium corresponding to the above embodiment.
A non-transitory computer-readable storage medium of an embodiment of the present invention has stored thereon a computer program that, when executed by a processor, can implement the detection method of replay attack speech according to the above-described embodiment of the present invention.
According to the non-transitory computer-readable storage medium of the embodiment of the invention, when the processor executes the computer program stored thereon, the processor firstly obtains a normal voice sample and a recorded and played back voice sample, and preprocesses the voice sample to obtain a feature vector of an acoustic parameter as a sample set, and trains the voice sample based on the preprocessed sample set through the support vector machine to obtain a support vector machine model, and obtains a voice sample to be detected, and inputs the voice sample to be detected into the support vector machine model to obtain a recognition result, so that the detection general performance can be greatly improved without being influenced by the type of equipment and the content of text.
In the description of the present invention, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. The meaning of "plurality" is two or more unless specifically limited otherwise.
In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; either directly or indirectly through intervening media, either internally or in any other relationship. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
In the present invention, unless otherwise expressly stated or limited, the first feature "on" or "under" the second feature may be directly contacting the first and second features or indirectly contacting the first and second features through an intermediate. Also, a first feature "on," "over," and "above" a second feature may be directly or diagonally above the second feature, or may simply indicate that the first feature is at a higher level than the second feature. A first feature being "under," "below," and "beneath" a second feature may be directly under or obliquely under the first feature, or may simply mean that the first feature is at a lesser elevation than the second feature.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (10)

1. A detection method of replay attack voice is characterized by comprising the following steps:
acquiring a normal voice sample and a recording playback voice sample;
preprocessing a voice sample to obtain a feature vector of an acoustic parameter as a sample set, wherein the feature vector of the acoustic parameter comprises: amplitude perturbations, glottal noise characteristics, and smooth cepstrum;
training by a support vector machine based on the preprocessed sample set to obtain a support vector machine model;
and acquiring a voice sample to be detected, and inputting the voice sample to be detected into the support vector machine model to obtain a recognition result.
2. The method for detecting replay attack speech according to claim 1, wherein preprocessing the speech samples to obtain feature vectors of the acoustic parameters comprises:
extracting voiced segments in the voice samples;
obtaining a pitch period by adopting an autocorrelation function;
framing according to the pitch period;
and carrying out feature extraction on the voice signal after framing to obtain amplitude perturbation and glottal noise features.
3. The method for detecting replay attack speech according to claim 2, wherein the step of performing feature extraction on the framed speech signal to obtain the amplitude perturbation comprises:
obtaining the intra-frame amplitude peak value after framing;
calculating the difference value of the amplitude peak value of each frame to obtain the amplitude perturbation, wherein the amplitude perturbation comprises: the relative values of the amplitude perturbations and the quotient of the three point amplitude perturbations.
4. The playback-attacking speech detecting method according to claim 3, wherein,
obtaining a relative value of the amplitude perturbation by the following formula:
Figure FDA0002828626500000011
where SL represents the relative value of the amplitude perturbation, A(i)(i ═ 1,2,3 …, N) represents the peak amplitude parameter, and N represents the number of peak amplitudes;
obtaining the three-point amplitude perturbation quotient by the following formula:
Figure FDA0002828626500000021
wherein APQ3 represents the three-point amplitude perturbation quotient.
5. The method for detecting replay attack speech according to claim 2, wherein feature extraction is performed on the framed speech signal to obtain a glottal noise feature, wherein the glottal noise feature includes: average autocorrelation and harmonic noise ratio, including:
acquiring autocorrelation of each frame wave;
calculating the average value of all the autocorrelation to be used as the average autocorrelation;
calculating an average amplitude sequence of the time domain signal wave amplitude sequence truncated according to the period, and calculating the difference value of each truncated signal and the average wave;
and acquiring a harmonic-to-noise ratio according to the average amplitude sequence of the time domain signal wave amplitude sequence truncated according to the period and the difference value between each truncated signal and the average wave.
6. The method for detecting replay attack speech according to claim 5, wherein the harmonic to noise ratio is obtained by calculating using the following formula:
Figure FDA0002828626500000022
di(τ)=fi(τ)-fA(τ)
Figure FDA0002828626500000023
wherein HNR represents the harmonic-to-noise ratio, i represents the ith period of the signal, N represents the number of signal periods, M represents the length of the signal in the period at the sampling rate, fA(τ) represents dividing the signal unit according to the calculated pitch period to obtain an average amplitude sequence of the time domain signal wave amplitude sequence truncated by period, di(τ) represents the difference of each truncated signal from the average wave.
7. The method for detecting replay attack speech according to claim 1, wherein preprocessing the speech samples to obtain feature vectors of the acoustic parameters comprises:
extracting voiced segments in the voice samples;
framing the voiced segments;
performing discrete Fourier transform on the framed signal to obtain a frequency spectrum;
taking logarithm of the frequency spectrum, and performing inverse discrete Fourier transform to obtain a cepstrum;
and smoothing the cepstrum to obtain the smoothed cepstrum.
8. The method for detecting replay attack speech according to claim 1, wherein training by a support vector machine based on the preprocessed sample set to obtain a support vector machine model comprises:
performing five-fold cross validation on the training data by adopting a grid search method so as to optimize parameters of a support vector machine model;
and obtaining a decision hyperplane according to the support vector machine model.
9. The method for detecting replay attack speech according to claim 8, wherein inputting the speech sample to be detected into the support vector machine model to obtain a recognition result comprises:
obtaining the relative position of the voice sample to be tested and the decision hyperplane;
and identifying the voice sample to be detected according to the relative position.
10. A playback attack voice detection apparatus, comprising:
the acquisition module is used for acquiring a normal voice sample and a recording playback voice sample;
a processing module, configured to pre-process a speech sample to obtain a feature vector of an acoustic parameter as a sample set, where the feature vector of the acoustic parameter includes: amplitude perturbations, glottal noise characteristics, and smooth cepstrum;
the model training module is used for training on the basis of the preprocessed sample set through the support vector machine to obtain a support vector machine model;
and the recognition module is used for acquiring a voice sample to be detected and inputting the voice sample to be detected into the support vector machine model so as to obtain a recognition result.
CN202011455471.3A 2020-12-10 Method and device for detecting replay attack voice Active CN112599149B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011455471.3A CN112599149B (en) 2020-12-10 Method and device for detecting replay attack voice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011455471.3A CN112599149B (en) 2020-12-10 Method and device for detecting replay attack voice

Publications (2)

Publication Number Publication Date
CN112599149A true CN112599149A (en) 2021-04-02
CN112599149B CN112599149B (en) 2024-06-04

Family

ID=

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102436810A (en) * 2011-10-26 2012-05-02 华南理工大学 Record replay attack detection method and system based on channel mode noise
CN108154879A (en) * 2017-12-26 2018-06-12 广西师范大学 A kind of unspecified person speech-emotion recognition method based on cepstrum separation signal
CN111640439A (en) * 2020-05-15 2020-09-08 南开大学 Deep learning-based breath sound classification method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102436810A (en) * 2011-10-26 2012-05-02 华南理工大学 Record replay attack detection method and system based on channel mode noise
CN108154879A (en) * 2017-12-26 2018-06-12 广西师范大学 A kind of unspecified person speech-emotion recognition method based on cepstrum separation signal
CN111640439A (en) * 2020-05-15 2020-09-08 南开大学 Deep learning-based breath sound classification method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘珊珊;: "基于SVM的病理嗓音障碍严重等级的评估和分类", 仪表技术, no. 06, pages 1 *
陈迪;龚卫国;李波;: "噪声鲁棒性说话人识别语音高频加权MFCC提取", 仪器仪表学报, no. 03, pages 5 *

Similar Documents

Publication Publication Date Title
Tan et al. rVAD: An unsupervised segment-based robust voice activity detection method
Shiota et al. Voice liveness detection algorithms based on pop noise caused by human breath for automatic speaker verification
Zhang12 et al. The effect of silence and dual-band fusion in anti-spoofing system
Ajmera et al. Text-independent speaker identification using Radon and discrete cosine transforms based features from speech spectrogram
US8428945B2 (en) Acoustic signal classification system
Shiota et al. Voice Liveness Detection for Speaker Verification based on a Tandem Single/Double-channel Pop Noise Detector.
CN108986824B (en) Playback voice detection method
Patel et al. Cochlear filter and instantaneous frequency based features for spoofed speech detection
US20070129941A1 (en) Preprocessing system and method for reducing FRR in speaking recognition
Paul et al. Countermeasure to handle replay attacks in practical speaker verification systems
EP3430612B1 (en) Apparatus and method for harmonic-percussive-residual sound separation using a structure tensor on spectrograms
Hanilçi et al. Optimizing acoustic features for source cell-phone recognition using speech signals
KR20100036893A (en) Speaker cognition device using voice signal analysis and method thereof
Pao et al. Combining acoustic features for improved emotion recognition in mandarin speech
Alonso-Martin et al. Multidomain voice activity detection during human-robot interaction
Srinivas et al. Combining phase-based features for replay spoof detection system
Li et al. A comparative study on physical and perceptual features for deepfake audio detection
Yarra et al. A mode-shape classification technique for robust speech rate estimation and syllable nuclei detection
Singh et al. Linear Prediction Residual based Short-term Cepstral Features for Replay Attacks Detection.
Srinivas et al. Relative phase shift features for replay spoof detection system
Maazouzi et al. MFCC and similarity measurements for speaker identification systems
CN112599149B (en) Method and device for detecting replay attack voice
Kaminski et al. Automatic speaker recognition using a unique personal feature vector and Gaussian Mixture Models
CN112599149A (en) Detection method and device for replay attack voice
Sukor et al. Speaker identification system using MFCC procedure and noise reduction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant