CN111653289B

CN111653289B - Playback voice detection method

Info

Publication number: CN111653289B
Application number: CN202010479392.XA
Authority: CN
Inventors: 王让定; 胡君; 严迪群
Original assignee: Ningbo University
Current assignee: Huzhou Chuangguan Technology Co ltd; Kong Fanbin
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2022-12-27
Anticipated expiration: 2040-05-29
Also published as: CN111653289A

Abstract

The invention discloses a playback voice detection method, which is characterized by comprising the following steps: the method comprises the following steps: 1) A training stage: 1.1 Input training speech samples, the training speech samples comprising original speech and playback speech; 1.2 Extracting cepstral features of the training speech samples; 1.3 Training a residual error network model according to the extracted features to obtain network model parameters; 2) And (3) a testing stage: 2.1 Input a test speech sample; 2.2 Extracting cepstral features of the test speech sample; 2.3 Utilizing a residual error network obtained by training in the step 1) to identify and score the characteristics of the extracted test voice sample; 2.4 Determine whether the test speech sample is playback speech. Compared with the prior art, the invention has the advantages that: based on a deep learning mode, the cepstrum characteristics of the voice signals are combined with a deep residual error network, so that the detection performance of the system is effectively improved, and the algorithm has better robustness.

Description

Playback voice detection method

Technical Field

The invention relates to a voice detection technology, in particular to a playback voice detection method.

Background

With the continuous development of voiceprint authentication technology, various fraudulent voice attacks on the voiceprint authentication technology are becoming more and more serious. The playback voice attack mainly depends on using recording equipment to steal and record voice of a legal user when the legal user enters the system and using high-fidelity playback equipment to play back so as to attack the voiceprint authentication system. The playback voice attack is easier to implement because the voice sample is derived from real voice, the operation is simple, the acquisition is convenient, and the attacker does not need to have related professional knowledge.

Early playback voice detection technologies are based on traditional manual features, however, traditional manual feature extraction cannot well express high-level semantics of voice, so that more current playback voice detection technologies are based on a deep learning mode, shallow features are extracted deeper through a neural network model, and performance of a playback voice detection algorithm is improved. As in prior art 1: lavrenyeva G, novoselov S, malykh E, et al, audio Replay attach Detection with Deep Learning framework [ C ]// interspace.2017: 82-86 ], the authors propose a lightweight convolutional neural network LCNN, construct a Max-Feature-Map (MFM) layer by applying a maximum output method to each convolutional layer in the LCNN, and eliminate the part with smaller output in a competitive Learning manner. As with prior art 2: jung J, shim H, heo H S, et al, replay attack detection with compensation using end-to-end DNN for the ASV spoof2019Chanllege J, arXiv preprint arXiv, 1904.10134,2019. As with prior art 3: chetteri B, stoller D, morfi V, et al, ensemble modules for spaofing detection in automatic spectral verification [ J ] arXiv preprinting arXiv, 1904.04589,2019.

The prior art 1 has the problem of weak robustness, the performance of the detection algorithm is not high in the evaluation set, mainly because the generalization capability of the model is not strong, the detection performance is obviously reduced when the unknown attack in the evaluation set is faced; the problems of complex detection algorithms exist in the prior art 2 and the prior art 3, in the prior art 2, an author partitions a data set, so that the playback voice attacks in a test set and a training set do not overlap, but the partitioning process of the data set is complex, and therefore more time is consumed; in the prior art 3, an author assists in the recognition and judgment of the second classification by proposing a multi-feature and multi-task learning mode and also performing other auxiliary class tasks during network training, but the detection algorithm is relatively complex.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a playback voice detection method, which can improve the detection performance and robustness of the playback voice algorithm, aiming at the defects existing in the prior art.

The technical scheme adopted by the invention for solving the technical problems is as follows: a playback voice detection method, characterized by: the method comprises the following steps:

1) A training stage:

1.1 Input training speech samples, the training speech samples comprising original speech and playback speech;

1.2 Extracting cepstral features of the training speech samples;

1.3 Training a residual error network model according to the extracted features to obtain network model parameters;

2) And (3) a testing stage:

2.1 Input a test speech sample;

2.2 Extracting cepstral features of the test speech sample;

2.3 Utilizing a residual error network obtained by training in the step 1) to identify and score the characteristics of the extracted test voice sample;

2.4 Determine whether the test speech sample is playback speech.

Preferably, in order to keep more detailed information on the frequency spectrum, in step 1.2) and step 2.2), the full-frequency cepstral coefficient features are extracted.

Further, the extraction method of the full-frequency cepstrum coefficient features comprises the steps of 1) performing framing and windowing processing on the voice signal of the training voice sample or the testing voice sample, and then performing Fourier transform on the framed voice signal to obtain the spectral coefficient X of the voice signal _i (k)：

Wherein, i represents the ith frame after framing, k represents the frequency point in the ith frame, k =0,1, 2.. The N-1, j represents a plurality of numbers, m represents the frame number after framing the voice signal, and N represents the number of Fourier transform points;

2) Then, the absolute value is calculated to obtainTo the corresponding amplitude spectrum coefficient E _i (k)：

3) Then carrying out logarithm operation and DCT transformation to obtain a frequency cepstrum coefficient BFCC (i) of the ith frame:

in order to enable training models with different characteristics to cooperate with each other to obtain a better fusion result, in the step 1.2) and the step 2.2), mel frequency cepstrum coefficient characteristics and normal Q cepstrum coefficient characteristics are also extracted respectively, corresponding three residual error networks are obtained in the training stage according to the full frequency cepstrum coefficient characteristics, the Mel frequency cepstrum coefficient characteristics and the normal Q cepstrum coefficient characteristics respectively, residual error network identification results obtained according to the full frequency cepstrum coefficient characteristics, the Mel frequency cepstrum coefficient characteristics and the normal Q cepstrum coefficient characteristics are correspondingly obtained in the testing stage, and the fusion of the three identification results is comprehensively judged.

According to one aspect of the invention, in step 1.2) and step 2.2), the extracted features are mel-frequency cepstral coefficients.

According to another aspect of the invention, in step 1.2) and step 2.2), the normal-Q cepstral coefficient features are extracted.

Preferably, the residual error network comprises a two-dimensional convolutional layer, a residual error block sequence, a Dropout layer, a first full connection layer, an activation function layer, a GRU layer, a second full connection layer and a network output layer which are connected in sequence.

Preferably, the activation function layer adopts a leakage correction linear unit.

Preferably, in order to increase the convergence rate of learning, a batch normalization processing layer is further provided between the two-dimensional convolution layer and the activation function layer.

To improve the detection accuracy, in step 2.4), the score of the residual network output is combined with the ASV system score to determine whether the test speech sample is original speech or playback speech.

Compared with the prior art, the invention has the advantages that: the neural network can be used for extracting deeper features and completely representing detail information in the voice signal, and the cepstrum features of the voice signal are combined with the deep residual error network based on a deep learning mode, so that the detection performance of the system is effectively improved, and the algorithm has better robustness; the residual error network can well model the distortion in time domain and frequency domain, thereby improving the accuracy of neural network classification; by extracting the full-frequency cepstrum coefficient characteristics, a filter is not needed, and more detailed information on the frequency spectrum can be reserved.

Drawings

FIG. 1 is a flowchart of a playback voice detection method according to an embodiment of the present invention;

FIG. 2 is a flowchart of MFCC extraction for a playback voice detection method in an embodiment of the present invention;

fig. 3 is a flow chart of CQCC extraction of the playback voice detection method according to the embodiment of the present invention;

fig. 4 is a BFCC extraction flow chart of the playback voice detection method of the embodiment of the present invention;

fig. 5 is a schematic diagram of a residual error network of a playback voice detection method according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar functions.

In the description of the present invention, it is to be understood that the terms "center," "longitudinal," "transverse," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counterclockwise," "axial," "radial," "circumferential," and the like are used in the orientations and positional relationships indicated in the drawings for convenience in describing the invention and simplicity in description, but do not indicate or imply that the devices or elements so referred to must have a particular orientation, be constructed and operated in a particular orientation, and that the directional terms are illustrative only and are not to be construed as limiting since the disclosed embodiments of the invention can be positioned in different orientations, e.g., "upper" and "lower" are not necessarily limited to directions opposite or coincident with the direction of gravity. Furthermore, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature.

Referring to fig. 1, a playback voice detection method includes the steps of:

1) A training stage:

1.1 Input training speech samples;

1.2 Extracting cepstral features;

1.3 ) training a residual error network model to obtain network model parameters;

2) And (3) a testing stage:

2.1 Input a test speech sample;

2.2 Extracting cepstral features;

2.3 Carrying out identification and scoring by utilizing a residual error network obtained by training in the step 1);

2.4 ) combine the ASV system score with step 2.3) to arrive at a decision.

In the above steps, the extraction features of the training speech sample and the testing speech sample may be implemented by using a filter, and the features in the speech sample may be determined according to the filter, where the filter may be a filter preconfigured according to actual requirements, and is used to extract features in the speech sample, such as traditional mel-frequency cepstrum coefficient features and normal Q cepstrum coefficient features.

1. Mel-frequency cepstrum coefficient characteristics: mel-Frequency Cepstral Coefficients (MFCCs) are a characteristic parameter of speech that is commonly used in the field of speaker recognition, and conform to the auditory characteristics of the human ear, and have different auditory sensitivities to sound waves of different frequencies. The Mel-frequency cepstrum coefficient is a cepstrum coefficient extracted from a frequency domain of Mel scale, wherein the Mel scale can reflect the non-linear characteristic of human ear frequency, and the relationship expression of the Mel-frequency cepstrum coefficient and the frequency is shown as the following formula:

wherein, F _mel Is the perceived frequency in Mel (Mel) units, and f is the actual frequency in hertz (Hz). Converting the speech signal to the perceptual frequency domain, rather than simply expressing it as a fourier transform, generally better simulates the processing of the human auditory process.

Referring to fig. 2, a flow chart of a specific extraction process of MFCC is shown, comprising the following steps:

1) The signal x (n) of the speech sample (training speech sample or testing speech sample) is preprocessed to be x _i (m), then carrying out Fourier time-frequency transform (STFT) on each frame of signals obtained after framing to obtain the spectral coefficient of the frame of signals;

wherein i represents the i-th frame after framing, k represents a frequency point in the i-th frame, and k =1, 2.., N-1;

2) Solving the magnitude spectrum coefficient of each frame of the obtained spectrum coefficient:

3) The resulting energy spectral coefficients are fed to a bank of mel-filters where the energy is calculated. The resulting Mel spectral coefficients are derived from the energy spectral coefficients and the frequency response H in the Mel filter _m (k) Multiply and sum, i.e.:

wherein, M is more than or equal to 0 and less than or equal to M refers to the mth Mel filter, and M filters are in total;

4) Then, carrying out logarithm operation and DCT (discrete cosine transformation) on the Mel frequency spectrum coefficient to obtain a Mel frequency cepstrum coefficient:

the standard MFCC only reflects the static characteristics of parameters in a voice signal, the dynamic characteristics of the voice signal can be obtained through the differential spectral coefficients of the static characteristics, and the static coefficients of the MFCC characteristics can be preferably combined with the dynamic characteristics of first-order coefficients and second-order coefficients to improve the recognition performance of the system.

2. Constant Q cepstral coefficient characteristics: the Constant Q Cepstrum Coefficients (CQCC) are characterized by that the speech signal is obtained by CQT time-frequency transform. Used in the CQT transform is a set of filters with a constant center frequency to bandwidth ratio of Q. The cross-axis frequency of the spectrum obtained by the CQT transform is non-linear because its center frequency follows an exponential distribution and the window length used by the filter bank varies according to frequency. When the time domain voice signal is converted into the frequency domain voice signal, higher frequency resolution can be provided in a low frequency area, and higher time resolution can be provided in a high frequency area.

Referring to fig. 3, a flowchart of a specific extraction process of the constant Q cepstrum coefficient includes the following steps:

1) The perception frequency domain signal of the signal X (n) of the voice sample (training voice sample or testing voice sample) after CQT conversion is X ^CQ (k)，X ^CQ (k) The calculation formula of (a) is as follows:

where K denotes a frequency point, K =1,2 _s Is the sampling rate of the speech samples, f _k For the center frequency of the filter, which follows an exponential distribution, it is defined as follows:

b is the number of frequency points in each multiplication, f1 is the central frequency of the lowest frequency point, and the frequency is calculated by the following formula:

the Q factor being the centre frequency f _k And bandwidth B _k Is a constant independent of k, which basically defines the formula:

the window function adopts a Hanning window, and the time resolution is gradually reduced along with the improvement of the frequency resolution, so the window length N _k Varies with k and is inversely proportional to k, and the window function is defined as follows:

2) Calculating the frequency value X of the kth frequency point of the ith frame _i (k) Amplitude of (d):

3) Solving a logarithmic spectrum coefficient for the amplitude:

wherein T is _k Represents the total number of frames of the speech signal in the K-th band, K =1,2Here K is taken to be 420.

4) And uniformly resampling the obtained logarithmic spectrum coefficients, wherein the new frequency representation is related to the original frequency representation as follows:

5) Performing DCT on the resampled frequency spectrum coefficient, namely:

where p =0,1.., L-1,l denotes the frequency points after resampling.

The two types are the existing feature extraction, and in the invention, the following feature extraction method can be adopted:

in addition, the feature of the speech sample may also be extracted without using a filter, which is specifically as follows:

3. full frequency cepstral coefficients: compared with other traditional characteristics, such as CQCC and MFCC, the Full-Frequency Cepstral Coefficients (BFCC) abandon the use of a filter in the traditional characteristics, namely directly carrying out logarithm operation and DCT (discrete cosine transform) on the spectral Coefficients obtained by Fourier transform, and the advantage of the method is that more detailed information on the Frequency spectrum can be reserved.

Referring to fig. 4, a flowchart of a specific extraction process of BFCC is shown, which comprises the following steps:

1) A section of voice is subjected to framing and windowing, and then Fourier transform is performed on a framed voice signal to obtain a spectral coefficient:

wherein i represents the ith frame after framing, k represents a frequency point in the ith frame, k =0,1, 2.,. N-1, j represents a complex number, m represents the number of frames after framing of the voice signal, and N represents the number of fourier transform points, in this embodiment, N =512;

2) Then, the absolute value of the amplitude spectrum coefficient is calculated to obtain the corresponding amplitude spectrum coefficient:

3) Further carrying out logarithm operation and DCT transformation to obtain the cepstrum coefficient:

in the BFCC feature, preferably, the logarithmic energy coefficient and the first and second order difference coefficients of the feature may also be added as the final feature vector.

In the invention, the network model adopts a residual error network, the overall framework of which is shown in fig. 5, when different characteristics are taken as input, the overall framework of the network model is unchanged, and only the dimension of the input characteristics at the input end is changed.

The residual error network comprises a two-dimensional convolution layer, four identical residual error block sequences, a Dropot layer, a first full connection layer, an activation function layer, a GRU layer, a second full connection layer and a network output layer (classifier) which are sequentially connected, the value of a Dropot layer function is set to be 0.5, and the network output layer preferably adopts a Softmax layer.

The activation function layer adopts a leakage correction linear unit (LeakyRelu), which is evolved from a corrected linear unit (Relu) activation function. The Relu activation function activates neurons only when the input exceeds a threshold, changing all negative values to 0, while positive values do not. However, its complement is very weak during training, especially when the input is negative, the learning speed of Relu will be very slow or even the neuron will not function directly, resulting in that its weight cannot be updated, so that this neuron will not activate any data any more, and the gradient is always 0. Therefore, in order to solve the complement point of the Relu function, the invention adopts the LeakyRelu activation function in the network. Unlike Relu, which sets all negative values to 0, leakyRelu adds a very small non-zero slope to all negative values, thus solving the problem that the neuron does not learn when the input of Relu is negative, and the mathematical expression is as follows:

in addition, the GRU layer can not only re-aggregate the frame-level features extracted from the upper layer into a single speech feature, but also has a simple model, and is very suitable for constructing a deeper network. In the residual error network, the two gates of the residual error network can also enable the efficiency of the whole network to be higher, the calculation is more time-saving, and the convergence speed of the model during training is obviously accelerated.

The three extracted cepstrum features are respectively sent to the input end of a residual error network, the time domain-frequency domain features based on the frame level are firstly extracted through a two-dimensional convolution layer, then the output of the layer is sent to 4 identical residual error block sequences to promote the deeper training of the network, and the output of the last residual error block is sent to a Dropout layer, a first full-link layer, an activation function layer and a GRU layer in sequence. After the GRU layer, its features based on the utterance are mapped to a new space by the second fully-connected layer, converted to new features, and sent to an output layer with only two node units, where the logarithms of the classifications are generated, and finally the output of the second fully-connected layer is sent to the Softmax layer to convert the logarithms to score probability distributions.

Because the neural network may have a phenomenon that the gradient disappears when the number of layers of the network is too deep during training, a Batch Normalization (BN) layer is added into the residual network, wherein the BN layer is used for drawing the distribution deviating from a normal range into a standardized distribution range according to a standardized means, and the layer is positioned between the two-dimensional convolution layer and the activation function layer, so that data can be distributed in a region sensitive to an activation function, the gradient is increased, and the convergence speed of learning is accelerated.

In the training phase, the Adam algorithm can be used for optimizing the residual error networkAnd (4) transforming. At the same time, with 10e ^-4 As the learning rate of the network, the training process was stopped after 50 cycles with 32 as the value of the batch process. The loss function adopts a binary cross entropy function between the predicted value and the target value, and the node output of the last full connection layer is used as a prediction score.

In the training stage of step 1), extracting traditional characteristics by using training voice samples (including training sets of original voice and playback voice) and sending the traditional characteristics into a neural network to respectively train residual error network models of the original voice and the playback voice; in the testing stage, the characteristics in the tested voice sample are extracted and sent to a residual error network model trained in the training stage, the tested voice is classified and judged according to the score result of the network output layer, and the result is combined with the score of an ASV (Automatic Speaker Recognition) system to be used as the final result for judging whether the tested voice sample is the playback voice.

Through the training and testing processes, the subsystems about the three cepstrum characteristics can be obtained respectively, the scores of the three subsystems are fused, the fusion mode is that the scores of the three subsystems are fused, and the formula is as follows:

S＝i·S _BFCC +j·S _MFCC +k·S _CQCC

wherein i, j and k are weight coefficients of scores of the three subsystems respectively, and the constraint condition is that i + j + k =1,S _BFCC 、S _MFCC 、S _CQCC Respectively, the scores of the normalized subsystems. By enabling training models with different characteristics to cooperate with each other, a better fusion result is obtained.

Claims

1. A playback voice detection method, characterized by: the method comprises the following steps:

1) A training stage:

1.2 Extracting cepstrum features of the training voice sample, wherein the cepstrum features comprise full-frequency cepstrum coefficient features, mel frequency cepstrum coefficient features and normal Q cepstrum coefficient features, and corresponding three residual error networks are obtained according to the full-frequency cepstrum coefficient features, the Mel frequency cepstrum coefficient features and the normal Q cepstrum coefficient features respectively;

2) And (3) a testing stage:

2.1 Input a test speech sample;

2.2 Extracting cepstrum characteristics of the tested voice sample, wherein the cepstrum characteristics comprise full-frequency cepstrum coefficient characteristics, mel frequency cepstrum coefficient characteristics and normal Q cepstrum coefficient characteristics, and residual error network identification results are obtained according to the full-frequency cepstrum coefficient characteristics, the Mel frequency cepstrum coefficient characteristics and the normal Q cepstrum coefficient characteristics respectively;

2.3 Utilizing a residual error network obtained by training in the step 1) to identify and score the characteristics of the extracted test voice sample: obtaining subsystems related to three cepstrum characteristics through the step 1.2) and the step 2.2), and fusing scores of the three subsystems in a mode of fusing the scores of the three subsystems, wherein the formula is as follows:

S＝i·S _BFCC +j·S _MFCC +k·S _CQCC

wherein i, j and k are weight coefficients of scores of the three subsystems respectively, and the constraint condition is that i + j + k =1,S _BFCC 、S _MFCC 、S _CQCC Respectively scoring the normalized full frequency cepstrum coefficient characteristic subsystem, the normalized Mel frequency cepstrum coefficient characteristic subsystem and the normalized normal Q cepstrum coefficient characteristic subsystem;

2.4 Determine whether the test speech sample is playback speech.

2. The playback voice detection method according to claim 1, characterized in that: 1) The full frequency is processed by framing and windowing the speech signal of the training speech sample or the testing speech sample, and then Fourier transform is carried out on the framed speech signal to obtain the spectral coefficient X _i (k)：

2) Then, the absolute value is calculated to obtain the corresponding amplitude spectrum coefficient E _i (k)：

3) Then carrying out logarithmic operation and DCT transformation to obtain a full frequency cepstrum coefficient BFCC (i) of the ith frame:

3. the playback voice detection method according to claim 1 or 2, characterized in that: the residual error network comprises a two-dimensional convolution layer, a residual error block sequence, a Dropout layer, a first full connection layer, an activation function layer, a GRU layer, a second full connection layer and a network output layer which are connected in sequence.

4. The playback voice detection method according to claim 3, characterized in that: the activation function layer adopts a leakage correction linear unit.

5. The playback voice detection method according to claim 3, characterized in that: there is also a batch normalization layer between the two-dimensional convolution layer and the activation function layer.

6. The playback voice detection method according to claim 1 or 2, characterized in that: in step 2.4), the score of the residual network output is combined with the ASV system score to determine whether the test speech sample is original speech or played back speech.