CN112599149A

CN112599149A - Detection method and device for replay attack voice

Info

Publication number: CN112599149A
Application number: CN202011455471.3A
Authority: CN
Inventors: 周颖慧; 孟子厚; 刘亚丽
Original assignee: Communication University of China
Current assignee: Communication University of China
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2021-04-02
Anticipated expiration: 2040-12-10

Abstract

The invention provides a detection method and a device for replay attack voice, wherein the method comprises the following steps: acquiring a normal voice sample and a recording playback voice sample; preprocessing a voice sample to obtain a feature vector of an acoustic parameter as a sample set; training by a support vector machine based on the preprocessed sample set to obtain a support vector machine model; and acquiring a voice sample to be detected, and inputting the voice sample to be detected into the support vector machine model to obtain a recognition result. The detection method of the invention is not affected by the equipment type and the text content, and greatly improves the detection general performance.

Description

Detection method and device for replay attack voice

Technical Field

The invention relates to the technical field of computers, in particular to a detection method and a detection device for replay attack voice.

Background

At present, Speaker Recognition (SR) is a biometric technology that performs similarity comparison between original voice extracted voiceprint feature vectors and corresponding voiceprint templates to identify Speaker identities according to differences in vocal tract structures and pronunciation habits of different speakers. However, the application scenarios of speaker recognition include distributed scenarios of telephony or other unattended, without manual supervision or face-to-face contact, and thus speaker recognition is more susceptible to malicious interference than other biometric recognition processes. When the identity authentication is carried out through speaker identification, if the threat brought by the counterfeit attack is not considered, the authentication process has no practical significance. The main sources of the forgery attack for speaker identification are four, which are: (1) the speaker imitates attacks; (2) a voice synthesis attack; (3) a voice conversion attack; (4) a playback attack of the recording. The mode of the record playback attack does not need related professional knowledge and technology, and the record playback becomes an effective and common counterfeit attack means along with the continuous improvement of the quality of the recording equipment and the continuous reduction of the cost.

The current playback attack detection method mainly comprises the following steps: first, channel difference: according to the difference between the original voice channel and the playback voice channel, taking a mute section or pattern noise in a voice fragment as a research object, extracting and analyzing channel information, and achieving the purpose of preventing a counterfeiter from intruding through a playback attack detection technology based on the channel information; the algorithm is only specific to the specific illegal recording equipment, and the diversity of the equipment types determines that the detection mode has no universality. Second, encoding information: the voice generated by the target speaker is collected by the collection equipment in the form of analog signals in the form of digital signals to finish the recording process, for the playback voice signals, the voice transmission process comprises a plurality of coding processes, and the secretly recording equipment and the playback equipment can cause the playback voice to generate coding distortion and amplification distortion results, so that the playback voice comprises related equipment information; this approach does not take into account the effects of noise interference factors and equipment changes may cause corrections to the algorithm. Third, spectral similarity characteristics: the recognition of the replay attack voice is realized through the analysis and judgment of the similarity degree; the method is only suitable for text-related detection systems, and with the increase of the verification times, the requirement of storage space is increased, and finally the working efficiency of the system is reduced, so that certain defects exist in practical application. Fourth, multi-modal characterization: compared with single-mode speaker recognition, the combination mode of the biological characteristics of multiple modes is more effective in coping with counterfeit attacks, for example, scores under 3 modes of face, lip movement and voice are fused to carry out an identity authentication system of the speaker, and the accuracy and the safety of the identity authentication system are improved; this approach is not as convenient to operate as the single mode detection technique.

The existing parameters for playback voice detection are mainly obtained by direct objective measurement of voice signals, and have respective limitations, including: only for specific devices, there is no universality; text correlation is needed, and flexibility is not available; multimodal operation is inconvenient, etc.

Disclosure of Invention

The invention provides a detection method of replay attack voice for solving the technical problems, which adopts the acoustic parameter characteristics of voice quality to carry out characteristic training according to the hearing difference between the original voice and the replay voice, can realize the detection of the replay attack, is not influenced by the equipment type and the text content, and greatly improves the general detection performance.

The technical scheme adopted by the invention is as follows:

a detection method of replay attack voice comprises the following steps: acquiring a normal voice sample and a recording playback voice sample; preprocessing a voice sample to obtain a feature vector of an acoustic parameter as a sample set, wherein the feature vector of the acoustic parameter comprises: amplitude perturbations, glottal noise characteristics, and smooth cepstrum; training by a support vector machine based on the preprocessed sample set to obtain a support vector machine model; and acquiring a voice sample to be detected, and inputting the voice sample to be detected into the support vector machine model to obtain a recognition result.

According to one embodiment of the present invention, preprocessing a speech sample to obtain a feature vector of an acoustic parameter comprises: extracting voiced segments in the voice samples; obtaining a pitch period by adopting an autocorrelation function; framing according to the pitch period; and carrying out feature extraction on the voice signal after framing to obtain amplitude perturbation and glottal noise features.

According to an embodiment of the present invention, feature extraction is performed on a speech signal after framing to obtain an amplitude perturbation, including: obtaining the intra-frame amplitude peak value after framing; calculating the difference value of the amplitude peak value of each frame to obtain the amplitude perturbation, wherein the amplitude perturbation comprises: the relative values of the amplitude perturbations and the quotient of the three point amplitude perturbations.

According to one embodiment of the invention, the relative value of the amplitude perturbation is obtained by the following formula:

where SL represents the relative value of the amplitude perturbation, A⁽ⁱ⁾(i ═ 1,2,3 …, N) represents the peak amplitude parameter, and N represents the number of peak amplitudes;

obtaining the three-point amplitude perturbation quotient by the following formula:

wherein APQ3 represents the three-point amplitude perturbation quotient.

According to an embodiment of the present invention, feature extraction is performed on a speech signal after framing to obtain a glottal noise feature, wherein the glottal noise feature includes: average autocorrelation and harmonic noise ratio, including: acquiring autocorrelation of each frame wave; calculating the average value of all the autocorrelation to be used as the average autocorrelation; calculating an average amplitude sequence of the time domain signal wave amplitude sequence truncated according to the period, and calculating the difference value of each truncated signal and the average wave; and acquiring a harmonic-to-noise ratio according to the average amplitude sequence of the time domain signal wave amplitude sequence truncated according to the period and the difference value between each truncated signal and the average wave.

According to an embodiment of the present invention, the noise-to-noise ratio is obtained by calculating using the following formula:

d_i(τ)＝f_i(τ)-f_A(τ)

wherein HNR represents the harmonic-to-noise ratio, i represents the ith period of the signal, N represents the number of signal periods, M represents the length of the signal in the period at the sampling rate, f_A(τ) represents dividing the signal unit according to the calculated pitch period to obtain an average amplitude sequence of the time domain signal wave amplitude sequence truncated by period, d_i(τ) represents the difference of each truncated signal from the average wave.

According to one embodiment of the present invention, preprocessing a speech sample to obtain a feature vector of an acoustic parameter comprises: extracting voiced segments in the voice samples; framing the voiced segments; performing discrete Fourier transform on the framed signal to obtain a frequency spectrum; taking logarithm of the frequency spectrum, and performing inverse discrete Fourier transform to obtain a cepstrum; and smoothing the cepstrum to obtain the smoothed cepstrum.

According to one embodiment of the invention, training by a support vector machine based on a preprocessed sample set to obtain a support vector machine model comprises: performing five-fold cross validation on the training data by adopting a grid search method so as to optimize parameters of a support vector machine model; and obtaining a decision hyperplane according to the support vector machine model.

According to an embodiment of the present invention, inputting the speech sample to be tested into the support vector machine model to obtain a recognition result, includes: obtaining the relative position of the voice sample to be tested and the decision hyperplane; and identifying the voice sample to be detected according to the relative position.

The invention also provides a detection device for the replay attack voice, which comprises the following components: the acquisition module is used for acquiring a normal voice sample and a recording playback voice sample; a processing module, configured to pre-process a speech sample to obtain a feature vector of an acoustic parameter as a sample set, where the feature vector of the acoustic parameter includes: amplitude perturbations, glottal noise characteristics, and smooth cepstrum; the model training module is used for training on the basis of the preprocessed sample set through the support vector machine to obtain a support vector machine model; and the recognition module is used for acquiring a voice sample to be detected and inputting the voice sample to be detected into the support vector machine model so as to obtain a recognition result.

The invention has the beneficial effects that:

the method comprises the steps of firstly obtaining a normal voice sample and a recording playback voice sample, preprocessing the voice sample to obtain a characteristic vector of an acoustic parameter as a sample set, training the sample set after preprocessing through a support vector machine to obtain a support vector machine model, obtaining a voice sample to be detected, and inputting the voice sample to be detected into the support vector machine model to obtain a recognition result, so that the detection general performance is greatly improved under the condition of not being influenced by the type of equipment and the content of text.

Drawings

FIG. 1 is a flowchart of a method for detecting replay attack speech according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a parameter test of a support vector machine model according to an embodiment of the present invention;

FIGS. 3 and 4 are schematic diagrams of a detection method for replay attack speech according to an embodiment of the present invention;

fig. 5 is a block diagram of a detection apparatus for replaying attack voice according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a detection method of replay attack speech according to an embodiment of the present invention.

As shown in fig. 1, the method for detecting replay attack speech according to an embodiment of the present invention may include the following steps:

and S1, acquiring a normal voice sample and a recorded and played back voice sample.

S2, pre-processing the voice sample to obtain a feature vector of the acoustic parameter as a sample set, wherein the feature vector of the acoustic parameter includes: amplitude perturbations, glottal noise characteristics, and a smoothed cepstrum.

Specifically, when the sound intensity threshold is smaller than the intensity threshold (for example, the intensity threshold may be-25 dB) and the duration exceeds a first preset time (for example, 0.1s) at the time of the sound zone detection, it is considered as a silent section (silent section); a voiced segment is considered when the sound intensity threshold is greater than or equal to the intensity threshold (e.g., the intensity threshold may be-25 dB) and the duration exceeds a second preset time (e.g., 0.1 s). The purpose of setting the duration judgment is to filter out the strong small pulses with short duration.

Then, the fundamental frequency calculation is carried out, which specifically comprises the following steps: according to the nature of the autocorrelation function, if the speech signal is a signal with period P, its autocorrelation function is also the signal with period P, and at the integral multiple of the signal period, the autocorrelation function takes the maximum value, the voiced speech signal has quasi-periodicity, its autocorrelation function has the maximum value at the integral multiple of the pitch period, the distance between two adjacent maximum peaks is calculated, the pitch frequency is the distance between two adjacent maximum peaks, and the reciprocal of the pitch frequency is the pitch period. Then, the fundamental frequency framing is carried out on the fundamental tone period, the frame length is 2-7 times of the fundamental tone period, and the frame length is within 10-30 ms. For example, the step size of the fundamental frequency extraction is set to 0.01s, the window length is set to 0.04s, i.e. the values of 4 fundamental frequency points are analyzed in one window, and the overlap duration between two adjacent frames is 0.03s, i.e. the window length of 3/4 is overlapped.

The method for extracting the features of the voice signals after framing to extract the feature vectors of the acoustic parameters reflecting the voice quality comprises the following steps: amplitude perturbations, glottal noise characteristics, and a smoothed cepstrum, wherein the amplitude perturbations include: the relative value of the amplitude perturbation and the quotient of the three-point amplitude perturbations; the glottal noise characteristics include: average autocorrelation and harmonic noise ratio.

In an embodiment of the present invention, performing feature extraction on a framed speech signal to obtain an amplitude perturbation includes: obtaining the intra-frame amplitude peak value after framing; and calculating the difference value of the amplitude peak value of each frame to obtain the amplitude perturbation.

For example, the relative value of the amplitude perturbation can be calculated by the following formula:

obtaining three-point amplitude perturbation quotient by the following formula:

where SL represents the relative value of the amplitude perturbation, A⁽ⁱ⁾(i ═ 1,2,3 …, N) represents the peak amplitude parameter, N represents the number of peak amplitudes, and APQ3 represents the three-point amplitude perturbation quotient.

Specifically, because the amplitude between adjacent voice signal periods has slight change, the amplitude peak value in the frame can be obtained through detection, then the amplitude difference of the peak value in each frame is calculated, and finally the relative value of the required amplitude perturbation and the three-point amplitude perturbation quotient are extracted and obtained.

Further, according to an embodiment of the present invention, feature extraction is performed on the framed speech signal to obtain a glottal noise feature, where the glottal noise feature includes: average autocorrelation and harmonic noise ratio, including: acquiring autocorrelation of each frame wave; calculating the average value of all the autocorrelation to be used as the average autocorrelation; calculating an average amplitude sequence of the time domain signal wave amplitude sequence truncated according to the period, and calculating the difference value of each truncated signal and the average wave; and acquiring the harmonic-to-noise ratio according to the average amplitude sequence of the time domain signal waves truncated according to the period and the difference value of each truncated signal and the average wave.

Specifically, the harmonic structure of the speech contains rich frequency structure information, harmonic components characterize harmonicity on subjective audibility, and harmonic analysis of the speech can help to judge the harmonicity and the prosody of the sound. The average autocorrelation can be obtained by calculating the autocorrelation of each frame wave and calculating the sum of all the autocorrelations to average. The harmonic-to-noise ratio is the ratio of signal period performance to aperiodic energy, reflects the roughness and the hoarseness degree of sound, can analyze noise components in the voice, evaluates the sound quality, and is expressed by dB, the calculation definition of the harmonic-to-noise ratio is divided into two types of time domain calculation and frequency domain calculation, wherein the time domain calculation effect is better, and the specific calculation mode is as follows:

d_i(τ)＝f_i(τ)-f_A(τ)

where HNR represents the harmonic-to-noise ratio, i represents the ith period of the signal, N represents the number of signal periods, M represents the length of the signal in a period at the sampling rate, and f_A(τ) represents dividing the signal unit according to the calculated pitch period to obtain an average amplitude sequence of the time domain signal wave amplitude sequence truncated by period, d_i(τ) represents the difference of each truncated signal from the average wave.

It should be noted that the glottal noise characteristic further includes a noise-to-noise ratio, where the noise-to-noise ratio is an inverse of the harmonic-to-noise ratio, that is, the noise-to-noise ratio is a ratio of aperiodic energy to periodic energy of a signal. In the feature extraction, a noise-harmonic ratio may also be extracted.

Further, according to an embodiment of the present invention, preprocessing a speech sample to obtain a feature vector of an acoustic parameter includes: extracting voiced segments in the voice samples; framing the voiced segments; performing discrete Fourier transform on the framed signal to obtain a frequency spectrum; taking logarithm of the frequency spectrum, and carrying out inverse discrete Fourier transform to obtain a cepstrum; and smoothing the cepstrum to obtain a smoothed cepstrum.

Specifically, the smoothed cepstrum represents the distance between the first cepstrum peak on the regression line of the smoothed cepstrum and a point with equal quality, and the longer the period of the speech signal, the more harmonic the spectrum, and the more prominent the smoothed cepstrum. The specific calculation steps are as follows: extracting a voiced segment in a voice sample, then performing framing calculation on the voiced segment, setting the window length to be 0.05s, and performing cepstrum calculation on a framed signal: firstly, calculating discrete Fourier transform to obtain a frequency spectrum, then calculating inverse discrete Fourier transform after logarithm of the frequency spectrum to obtain a cepstrum, and finally smoothing the cepstrum to obtain a smooth cepstrum with required characteristics.

It should be noted that, when extracting features, the voiced segment is extracted first, and the difference is that amplitude perturbation and glottal noise features require to calculate the fundamental tone and perform reframing according to the fundamental tone, but the feature of smooth cepstrum does not require to calculate the fundamental frequency and reframing, performs direct framing, and then performs cepstrum acquisition.

And S3, training by the support vector machine based on the preprocessed sample set to obtain a support vector machine model.

In particular, the original normal speech samples and the recorded playback speech samples in the training set are used for model training. The extraction of the voices by the feature extraction method comprises the following steps: five-dimensional feature vectors including relative values of amplitude perturbations, three-point amplitude perturbation quotient, average autocorrelation, harmonic-to-noise ratio and smooth cepstrum, then selecting libsvm software package of Taiwan university to train the feature vectors, selecting a support vector machine model SVM by a classifier, finding an optimal decision hyperplane by calculation, dividing data, namely, respectively locating real voice and deceptive voice at two sides of the decision hyperplane, and regarding input X ═ X (X ═ X₁,x₂,x₃,…,x_N) And judging the SVM optimal classification decision function as follows by using a symbolic function:

wherein alpha is_i＝(α₁,α₂,α₃...,α_N) Is an optimal solution for classification, b is a classification threshold; y is_iE { -1,1} is a classification label, and the kernel function is a radial basis function, i.e., K (x, z) ═ exp (-g | | | x-z | | |²). In order to improve the classification precision of the SVM classifier, a grid search method is used, five-fold cross validation is carried out on training data, so that the optimal parameters of the SVM are obtained, wherein a penalty factor c is 0.2176, a parameter g in a kernel function is 0.0625, the Accuracy of parameter selection is shown in FIG. 2, and the Accuracy is 99.9981%.

And S4, acquiring a voice sample to be detected, and inputting the voice sample to be detected into the support vector machine model to obtain a recognition result.

According to one embodiment of the present invention, inputting a speech sample to be tested into a support vector machine model to obtain a recognition result, including: obtaining the relative position of a voice sample to be tested and a decision hyperplane; and identifying the voice sample to be detected according to the relative position.

Specifically, according to the method, the feature vector of the acoustic parameter of the voice sample to be detected is extracted, then the distance between the feature vector and the decision hyperplane is calculated, and the original voice and the playback voice are identified through the judgment of the position of the feature vector on the hyperplane. For example, as shown in fig. 3, wx + b is 0 and represents a hyperplane, and a distance formula from an arbitrary point to the hyperplane wx + b is 0, which side of the hyperplane is located is determined, and the two sides represent the original voice and the played voice respectively. As shown in fig. 4, a portion above wx + b-1 represents original voice, and a portion below wx + b-1 represents playback voice. In order to verify the detection accuracy of the present application, the comparison results of the playback attack detection of the recording obtained by the method of the present application and the baseline system under the same conditions are shown in table 1.

TABLE 1

As can be seen from table 1, the features consisting of the relative value of the acoustic feature amplitude perturbations, the three-point amplitude perturbation quotient, the mean autocorrelation, the harmonic-to-noise ratio, and the smoothed cepstrum based on the auditory perception can better detect replay attacks than the reference features CQCC and LFCC. The performance on the test set is very good, the EER is 0, the accuracy rate also reaches 99.99 percent, the overall performance of the system is still far superior to that of a reference system, and the real voice and the replay voice can be well distinguished.

Note that EER in table 1 indicates an equal error Rate, which is an error Rate when FAR (False Accept Rate) and FRR (False Reject Rate) are the same, and indicates that the lower the index value, the better the detection performance, reflecting the overall performance of the system. t-DCF in Table 1 represents a tandem detection cost function, and the function of the index is to evaluate the overall performance of the tandem system, namely fraud Countermeasures (CMs) and ASV subsystems, wherein the lower the index value, the better the detection performance is.

In summary, the invention first obtains a normal voice sample and a recorded playback voice sample, preprocesses the voice sample to obtain a feature vector of an acoustic parameter, uses the feature vector as a sample set, trains the support vector machine based on the preprocessed sample set to obtain a support vector machine model, obtains a voice sample to be detected, and inputs the voice sample to be detected into the support vector machine model to obtain a recognition result, thereby greatly improving the detection general performance without being influenced by the type of equipment and the content of text.

Corresponding to the detection method of the replay attack voice of the above embodiment, the invention also provides a detection device of the replay attack voice.

As shown in fig. 5, the apparatus for detecting replay attack speech according to an embodiment of the present invention may include: an acquisition module 10, a processing module 20, a model training module 30 and a recognition module 40.

The obtaining module 10 is configured to obtain a normal voice sample and a recorded playback voice sample. The processing module 20 is configured to pre-process the voice sample to obtain a feature vector of the acoustic parameter as a sample set, where the feature vector of the acoustic parameter includes: amplitude perturbations, glottal noise characteristics, and a smoothed cepstrum. The model training module 30 is configured to perform training based on the preprocessed sample set by the support vector machine to obtain a support vector machine model. The recognition module 40 is configured to obtain a voice sample to be detected, and input the voice sample to be detected into the support vector machine model to obtain a recognition result.

According to an embodiment of the present invention, the processing module 20 performs preprocessing on the voice sample to obtain a feature vector of an acoustic parameter, specifically, to extract a voiced segment in the voice sample; obtaining a pitch period by adopting an autocorrelation function; framing according to the pitch period; and carrying out feature extraction on the voice signal after framing to obtain amplitude perturbation and glottal noise features.

According to an embodiment of the present invention, the processing module 20 performs feature extraction on the framed speech signal to obtain an amplitude perturbation, specifically, to obtain an intra-frame amplitude peak after framing; calculating the difference value of the amplitude peak value of each frame to obtain amplitude perturbation, wherein the amplitude perturbation comprises the following steps: the relative values of the amplitude perturbations and the quotient of the three point amplitude perturbations.

obtaining three-point amplitude perturbation quotient by the following formula:

wherein APQ3 represents the three-point amplitude perturbation quotient.

According to an embodiment of the present invention, the processing module 20 performs feature extraction on the framed speech signal to obtain a glottal noise feature, where the glottal noise feature includes: the average autocorrelation harmonic noise ratio is specifically used for obtaining the autocorrelation of each frame wave; calculating the average value of all the autocorrelation to be used as the average autocorrelation; calculating an average amplitude sequence of the time domain signal wave amplitude sequence truncated according to the period, and calculating the difference value of each truncated signal and the average wave; and acquiring the harmonic-to-noise ratio according to the average amplitude sequence of the time domain signal waves truncated according to the period and the difference value of each truncated signal and the average wave.

According to one embodiment of the invention, the following formula is used to calculate the obtained noise-to-noise ratio:

d_i(τ)＝f_i(τ)-f_A(τ)

wherein HNR represents a harmonic-to-noise ratio, i represents a signalThe i-th period of the number, N the number of signal periods, M the length of the signal in the period at the sampling rate, f_A(τ) represents dividing the signal unit according to the calculated pitch period to obtain an average amplitude sequence of the time domain signal wave amplitude sequence truncated by period, d_i(τ) represents the difference of each truncated signal from the average wave.

According to an embodiment of the present invention, the processing module 20 performs preprocessing on the voice sample to obtain a feature vector of an acoustic parameter, specifically, to extract a voiced segment in the voice sample; framing the voiced segments; performing discrete Fourier transform on the framed signal to obtain a frequency spectrum; taking logarithm of the frequency spectrum, and carrying out inverse discrete Fourier transform to obtain a cepstrum; and smoothing the cepstrum to obtain a smoothed cepstrum.

According to an embodiment of the present invention, the model training module 30 performs training based on the preprocessed sample set through the support vector machine to obtain a support vector machine model, and is specifically configured to perform five-fold cross validation on training data by using a grid search method to optimize parameters of the support vector machine model; and obtaining a decision hyperplane according to the support vector machine model.

According to an embodiment of the present invention, the recognition module 40 inputs the voice sample to be detected into the support vector machine model to obtain a recognition result, specifically, to obtain a relative position between the voice sample to be detected and the decision hyperplane; and identifying the voice sample to be detected according to the relative position.

It should be noted that details that are not disclosed in the detection apparatus for a replay attack voice according to the embodiment of the present invention refer to details that are disclosed in the detection method for a replay attack voice according to the embodiment of the present invention, and details are not described here again.

According to the detection device for the replay attack voice, the acquisition module acquires a normal voice sample and a recorded replay voice sample, the processing module preprocesses the voice sample to acquire a feature vector of an acoustic parameter as a sample set, the model training module trains through a support vector machine based on the preprocessed sample set to acquire a support vector machine model, and the recognition module acquires a voice sample to be detected and inputs the voice sample to be detected into the support vector machine model to acquire a recognition result. Therefore, the device can perform feature training by adopting the acoustic parameter characteristics of voice quality according to the auditory sense difference between the original voice and the playback voice, can realize playback attack detection, is not influenced by equipment types and text contents, and greatly improves the detection universality.

The invention further provides a computer device corresponding to the embodiment.

The computer device of the embodiment of the invention comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and when the processor executes the computer program, the method for detecting the replay attack voice according to the embodiment of the invention can be realized.

According to the computer device of the embodiment of the invention, when the processor executes the computer program stored on the memory, the normal voice sample and the recorded and replayed voice sample are firstly obtained, the voice sample is preprocessed to obtain the characteristic vector of the acoustic parameter as the sample set, the support vector machine is used for training based on the preprocessed sample set to obtain the support vector machine model, the voice sample to be detected is obtained, and the voice sample to be detected is input into the support vector machine model to obtain the recognition result, so that the detection general performance is greatly improved under the condition of not being influenced by the type of the device and the text content.

The invention also provides a non-transitory computer readable storage medium corresponding to the above embodiment.

A non-transitory computer-readable storage medium of an embodiment of the present invention has stored thereon a computer program that, when executed by a processor, can implement the detection method of replay attack speech according to the above-described embodiment of the present invention.

According to the non-transitory computer-readable storage medium of the embodiment of the invention, when the processor executes the computer program stored thereon, the processor firstly obtains a normal voice sample and a recorded and played back voice sample, and preprocesses the voice sample to obtain a feature vector of an acoustic parameter as a sample set, and trains the voice sample based on the preprocessed sample set through the support vector machine to obtain a support vector machine model, and obtains a voice sample to be detected, and inputs the voice sample to be detected into the support vector machine model to obtain a recognition result, so that the detection general performance can be greatly improved without being influenced by the type of equipment and the content of text.

In the description of the present invention, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. The meaning of "plurality" is two or more unless specifically limited otherwise.

In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; either directly or indirectly through intervening media, either internally or in any other relationship. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In the present invention, unless otherwise expressly stated or limited, the first feature "on" or "under" the second feature may be directly contacting the first and second features or indirectly contacting the first and second features through an intermediate. Also, a first feature "on," "over," and "above" a second feature may be directly or diagonally above the second feature, or may simply indicate that the first feature is at a higher level than the second feature. A first feature being "under," "below," and "beneath" a second feature may be directly under or obliquely under the first feature, or may simply mean that the first feature is at a lesser elevation than the second feature.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A detection method of replay attack voice is characterized by comprising the following steps:

acquiring a normal voice sample and a recording playback voice sample;

preprocessing a voice sample to obtain a feature vector of an acoustic parameter as a sample set, wherein the feature vector of the acoustic parameter comprises: amplitude perturbations, glottal noise characteristics, and smooth cepstrum;

training by a support vector machine based on the preprocessed sample set to obtain a support vector machine model;

and acquiring a voice sample to be detected, and inputting the voice sample to be detected into the support vector machine model to obtain a recognition result.

2. The method for detecting replay attack speech according to claim 1, wherein preprocessing the speech samples to obtain feature vectors of the acoustic parameters comprises:

extracting voiced segments in the voice samples;

obtaining a pitch period by adopting an autocorrelation function;

framing according to the pitch period;

and carrying out feature extraction on the voice signal after framing to obtain amplitude perturbation and glottal noise features.

3. The method for detecting replay attack speech according to claim 2, wherein the step of performing feature extraction on the framed speech signal to obtain the amplitude perturbation comprises:

obtaining the intra-frame amplitude peak value after framing;

calculating the difference value of the amplitude peak value of each frame to obtain the amplitude perturbation, wherein the amplitude perturbation comprises: the relative values of the amplitude perturbations and the quotient of the three point amplitude perturbations.

4. The playback-attacking speech detecting method according to claim 3, wherein,

obtaining a relative value of the amplitude perturbation by the following formula:

wherein APQ3 represents the three-point amplitude perturbation quotient.

5. The method for detecting replay attack speech according to claim 2, wherein feature extraction is performed on the framed speech signal to obtain a glottal noise feature, wherein the glottal noise feature includes: average autocorrelation and harmonic noise ratio, including:

acquiring autocorrelation of each frame wave;

calculating the average value of all the autocorrelation to be used as the average autocorrelation;

calculating an average amplitude sequence of the time domain signal wave amplitude sequence truncated according to the period, and calculating the difference value of each truncated signal and the average wave;

and acquiring a harmonic-to-noise ratio according to the average amplitude sequence of the time domain signal wave amplitude sequence truncated according to the period and the difference value between each truncated signal and the average wave.

6. The method for detecting replay attack speech according to claim 5, wherein the harmonic to noise ratio is obtained by calculating using the following formula:

d_i(τ)＝f_i(τ)-f_A(τ)

7. The method for detecting replay attack speech according to claim 1, wherein preprocessing the speech samples to obtain feature vectors of the acoustic parameters comprises:

extracting voiced segments in the voice samples;

framing the voiced segments;

performing discrete Fourier transform on the framed signal to obtain a frequency spectrum;

taking logarithm of the frequency spectrum, and performing inverse discrete Fourier transform to obtain a cepstrum;

and smoothing the cepstrum to obtain the smoothed cepstrum.

8. The method for detecting replay attack speech according to claim 1, wherein training by a support vector machine based on the preprocessed sample set to obtain a support vector machine model comprises:

performing five-fold cross validation on the training data by adopting a grid search method so as to optimize parameters of a support vector machine model;

and obtaining a decision hyperplane according to the support vector machine model.

9. The method for detecting replay attack speech according to claim 8, wherein inputting the speech sample to be detected into the support vector machine model to obtain a recognition result comprises:

obtaining the relative position of the voice sample to be tested and the decision hyperplane;

and identifying the voice sample to be detected according to the relative position.

10. A playback attack voice detection apparatus, comprising:

the acquisition module is used for acquiring a normal voice sample and a recording playback voice sample;

a processing module, configured to pre-process a speech sample to obtain a feature vector of an acoustic parameter as a sample set, where the feature vector of the acoustic parameter includes: amplitude perturbations, glottal noise characteristics, and smooth cepstrum;

the model training module is used for training on the basis of the preprocessed sample set through the support vector machine to obtain a support vector machine model;

and the recognition module is used for acquiring a voice sample to be detected and inputting the voice sample to be detected into the support vector machine model so as to obtain a recognition result.