Disclosure of Invention
The technical problem to be solved by the present invention is to provide a playback voice detection method, which can improve the detection performance and robustness of the playback voice algorithm, aiming at the defects existing in the prior art.
The technical scheme adopted by the invention for solving the technical problems is as follows: a playback voice detection method, characterized by: the method comprises the following steps:
1) a training stage:
1.1) inputting training voice samples, wherein the training voice samples comprise original voice and playback voice;
1.2) extracting cepstrum characteristics of a training voice sample;
1.3) training a residual error network model according to the extracted features to obtain network model parameters;
2) and (3) a testing stage:
2.1) inputting a test voice sample;
2.2) extracting cepstrum characteristics of the test voice sample;
2.3) utilizing the residual error network obtained by training in the step 1) to identify and score the characteristics of the extracted test voice sample;
2.4) judging whether the test voice sample is playback voice.
Preferably, in order to retain more detailed information on the frequency spectrum, in step 1.2) and step 2.2), the full frequency cepstrum coefficient features are extracted.
Further, the extraction method of the full-frequency cepstrum coefficient features comprises the steps of 1) performing framing and windowing processing on the voice signal of the training voice sample or the testing voice sample, and then performing Fourier transform on the framed voice signal to obtain the spectral coefficient X of the voice signali(k):
Wherein, i represents the ith frame after framing, k represents the frequency point in the ith frame, k is 0,1,2, N-1, j represents a complex number, m represents the number of frames after framing of the voice signal, and N represents the number of Fourier transform points;
2) then, the absolute value is calculated to obtain the corresponding amplitude spectrum coefficient Ei(k):
3) Then carrying out logarithm operation and DCT transformation to obtain a frequency cepstrum coefficient BFCC (i) of the ith frame:
in order to enable training models with different characteristics to cooperate with each other to obtain a better fusion result, in the step 1.2) and the step 2.2), a Mel frequency cepstrum coefficient characteristic and a normal Q cepstrum coefficient characteristic are also respectively extracted, corresponding three residual error networks are respectively obtained according to the full frequency cepstrum coefficient characteristic, the Mel frequency cepstrum coefficient characteristic and the normal Q cepstrum coefficient characteristic in a training stage, residual error network identification results obtained according to the full frequency cepstrum coefficient characteristic, the Mel frequency cepstrum coefficient characteristic and the normal Q cepstrum coefficient characteristic are correspondingly obtained in a testing stage, and the fusion of the three identification results is comprehensively judged.
According to one aspect of the invention, in step 1.2) and step 2.2), the extracted features are mel-frequency cepstral coefficients.
According to another aspect of the invention, in step 1.2) and step 2.2), the normal-Q cepstral coefficient features are extracted.
Preferably, the residual error network comprises a two-dimensional convolution layer, a residual error block sequence, a Dropout layer, a first full connection layer, an activation function layer, a GRU layer, a second full connection layer and a network output layer which are connected in sequence.
Preferably, the activation function layer adopts a leakage correction linear unit.
Preferably, in order to increase the convergence rate of learning, a batch normalization layer is further provided between the two-dimensional convolution layer and the activation function layer.
To improve the detection accuracy, in step 2.4), the score of the residual network output is combined with the ASV system score to determine whether the test speech sample is original speech or played back speech.
Compared with the prior art, the invention has the advantages that: the neural network can be used for extracting deeper features and completely representing detail information in the voice signal, and the cepstrum features of the voice signal are combined with the deep residual error network based on a deep learning mode, so that the detection performance of the system is effectively improved, and the algorithm has better robustness; the residual error network can well model the distortion in time domain and frequency domain, thereby improving the accuracy of neural network classification; by extracting the full-frequency cepstrum coefficient characteristics, a filter is not needed, and more detailed information on the frequency spectrum can be reserved.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar functions.
In the description of the present invention, it is to be understood that the terms "central," "longitudinal," "lateral," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counterclockwise," "axial," "radial," "circumferential," and the like are used in the orientations and positional relationships indicated in the drawings for convenience in describing the present invention and to simplify the description, but are not intended to indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and that the directional terms are used for purposes of illustration and are not to be construed as limiting, for example, because the disclosed embodiments of the present invention may be oriented in different directions, "lower" is not necessarily limited to a direction opposite to or coincident with the direction of gravity. Furthermore, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature.
Referring to fig. 1, a playback voice detection method includes the steps of:
1) a training stage:
1.1) inputting a training voice sample;
1.2) extracting cepstrum characteristics;
1.3) training a residual error network model to obtain network model parameters;
2) and (3) a testing stage:
2.1) inputting a test voice sample;
2.2) extracting cepstrum characteristics;
2.3) utilizing the residual error network obtained by training in the step 1) to carry out identification and scoring;
2.4) combining the ASV system score and the step 2.3) to obtain a decision.
In the above steps, the extraction features of the training speech sample and the testing speech sample may be implemented by using a filter, and the features in the speech sample may be determined according to the filter, where the filter may be a filter preconfigured according to actual requirements, and is used to extract features in the speech sample, such as traditional mel-frequency cepstrum coefficient features and normal Q cepstrum coefficient features.
1. Mel-frequency cepstrum coefficient characteristics: Mel-Frequency cepstral coefficients (MFCCs) are a characteristic parameter of speech that is commonly used in the field of speaker recognition, and conform to the auditory characteristics of the human ear, and have different auditory sensitivities to sound waves of different frequencies. The Mel-frequency cepstrum coefficient is a cepstrum coefficient extracted in a frequency domain of Mel scale, the Mel scale can reflect the nonlinear characteristic of human ear frequency, and the relationship expression of the Mel-frequency cepstrum coefficient and the frequency is shown as the following formula:
wherein, FmelIs the perceived frequency in Mel (Mel), and f is the actual frequency in hertz (Hz). Converting the speech signal to the perceptual frequency domain, rather than simply expressing it as a fourier transform, generally better simulates the processing of the human auditory process.
Referring to fig. 2, a flow chart of a specific extraction process of MFCC is shown, comprising the following steps:
1) the signal x (n) of the speech sample (training speech sample or testing speech sample) is preprocessed to be xi(m), then carrying out Fourier time-frequency transformation (STFT) on each frame of signals obtained after framing to obtain the spectral coefficient of the frame of signals;
wherein, i represents the ith frame after framing, k represents the frequency point in the ith frame, and k is 1, 2.
2) And solving the magnitude spectral coefficient of each frame of the obtained spectral coefficient:
3) the resulting energy spectral coefficients are fed to a bank of mel-filters where the energy is calculated. The resulting Mel spectral coefficients are derived from the energy spectral coefficients and the frequency response H in the Mel filterm(k) Multiply and sum, i.e.:
wherein, M is more than or equal to 0 and less than or equal to M refers to the mth Mel filter, and M filters are in total;
4) then, carrying out logarithm operation and DCT (discrete cosine transformation) on the Mel frequency spectrum coefficient to obtain a Mel frequency cepstrum coefficient:
the standard MFCC only reflects the static characteristics of parameters in a voice signal, the dynamic characteristics of the voice signal can be obtained through the differential spectral coefficients of the static characteristics, and the static coefficients of the MFCC characteristics can be preferably combined with the dynamic characteristics of first-order coefficients and second-order coefficients to improve the recognition performance of the system.
2. Constant Q cepstral coefficient characteristics: the Constant Q Cepstral Coefficients (CQCCs) are characterized by the CQT time-frequency transform of the speech signal. Used in the CQT transform is a set of filters whose center frequency to bandwidth ratio is constant at a constant Q. The cross-axis frequency of the spectrum obtained by the CQT transform is non-linear because its center frequency follows an exponential distribution and the window length used by the filter bank varies according to frequency. When the time domain voice signal is converted into the frequency domain voice signal, higher frequency resolution can be provided in a low frequency area, and higher time resolution can be provided in a high frequency area.
Referring to fig. 3, a flowchart of a specific extraction process of the constant Q cepstrum coefficient includes the following steps:
1) the signal X (n) of the speech sample (training speech sample or testing speech sample) is the perception frequency domain signal X after CQT conversionCQ(k),XCQ(k) The calculation formula of (a) is as follows:
where K denotes a frequency point, K1, 2sIs the sampling rate of the speech samples, fkFor the center frequency of the filter, which follows an exponential distribution, the definition is as follows:
b is the number of frequency points per multiple, and f1 is the center frequency of the lowest frequency point, which is calculated by the following formula:
the Q factor being the centre frequency fkAnd bandwidth BkIs a constant independent of k, which basically defines the formula:
the window function adopts a Hanning window, and the time resolution is gradually reduced along with the improvement of the frequency resolution, so the window length NkIs a function of k and is inversely proportional to k, and the window function is defined as follows:
2) calculating the frequency value X of the kth frequency point of the ith framei(k) Amplitude of (d):
3) and solving a logarithmic spectrum coefficient for the amplitude:
wherein T iskDenotes the total number of frames of the speech signal in the kth band, K being 1, 2.
4) And uniformly resampling the obtained logarithmic spectrum coefficients, wherein the new frequency representation is related to the original frequency representation as follows:
5) performing DCT on the resampled spectral coefficients, namely:
where p is 0, 1., L-1, L denotes the frequency points after resampling.
The two types are the existing feature extraction, and in the invention, the following feature extraction method can be adopted:
in addition, the features of the speech sample may also be extracted without using a filter, specifically as follows:
3. full frequency cepstral coefficients: compared with some other traditional characteristics, such as CQCC and MFCC, the Full-Frequency cepstral coefficients (BFCC) abandon the use of filters in the traditional characteristics, that is, directly perform logarithm operation and DCT transformation on the spectral coefficients obtained through fourier transform, which is advantageous in that more detailed information on the spectrum can be retained.
Referring to fig. 4, a flow chart of a specific extraction process of BFCC is shown, which includes the following steps:
1) a section of voice is subjected to framing and windowing, and then Fourier transform is carried out on a framed voice signal to obtain a spectral coefficient:
wherein, i represents the ith frame after being framed, k represents the frequency point in the ith frame, k is 0,1, 2., N-1, j represents a complex number, m represents the number of frames after being framed by the voice signal, N represents the number of fourier transform points, in this embodiment, N is 512;
2) then, the absolute value of the amplitude spectrum coefficient is calculated to obtain the corresponding amplitude spectrum coefficient:
3) further carrying out logarithm operation and DCT transformation to obtain the cepstrum coefficient:
in the BFCC feature, preferably, the logarithmic energy coefficient and the first and second order difference coefficients of the feature may also be added as the final feature vector.
In the invention, the network model adopts a residual error network, the overall framework of which is shown in fig. 5, when different characteristics are taken as input, the overall framework of the network model is unchanged, and only the dimension of the input characteristics at the input end is changed.
The residual error network comprises a two-dimensional convolution layer, four identical residual error block sequences, a Dropout layer, a first full connection layer, an activation function layer, a GRU layer, a second full connection layer and a network output layer (classifier) which are sequentially connected, wherein the value of a Dropout layer function is set to be 0.5, and the network output layer preferably adopts a Softmax layer.
The activation function layer adopts a leakage correction linear unit (LeakyRelu), which is evolved from a corrected linear unit (Relu) activation function. The Relu activation function activates neurons only when the input exceeds a threshold, changing all negative values to 0, while positive values do not. However, its complement is very weak during training, especially when the input is negative, the learning speed of Relu will be very slow or even the neuron will not function directly, resulting in that its weight cannot be updated, so that this neuron will not activate any data any more, and the gradient is always 0. Therefore, in order to solve the complement point of the Relu function, the invention adopts the LeakyRelu activation function in the network. Unlike the Relu function, which sets all negative values to 0, the LeakyRelu function adds a very small non-zero slope to all negative values, thus solving the problem that the neuron does not learn when the Relu function inputs negative values, and the mathematical expression is as follows:
in addition, the GRU layer can not only re-aggregate the frame-level features extracted from the upper layer into a single speech feature, but also has a simple model, and is very suitable for constructing a deeper network. In the residual error network, the two gates of the residual error network can also enable the efficiency of the whole network to be higher, the calculation is more time-saving, and the convergence speed of the model during training is obviously accelerated.
The three extracted cepstrum features are respectively sent to the input end of a residual error network, the time domain-frequency domain features based on the frame level are firstly extracted through a two-dimensional convolution layer, then the output of the layer is sent to 4 identical residual error block sequences to promote the deeper training of the network, and the output of the last residual error block is sent to a Dropout layer, a first full-link layer, an activation function layer and a GRU layer in sequence. After the GRU layer, features based on the words are mapped to a new space through the second full connection layer, converted into new features and sent into an output layer with only two node units, classification logarithms are generated, and finally the output of the second full connection layer is sent into a Softmax layer to convert the logarithms into score probability distribution.
Because the neural network may have a phenomenon that the gradient disappears when the number of layers of the network is too deep during training, a Batch Normalization (BN) layer is added into the residual network, wherein the BN layer is used for drawing the distribution deviating from a normal range into a standardized distribution range according to a standardized means, and the layer is positioned between the two-dimensional convolution layer and the activation function layer, so that data can be distributed in a region sensitive to an activation function, the gradient is increased, and the convergence speed of learning is accelerated.
In the training phase, the residual error network can be optimized by using an Adam algorithm. At the same time, with 10e-4As the learning rate of the network, the training process was stopped after 50 cycles with 32 as the value of the batch process. The loss function adopts a binary cross entropy function between the predicted value and the target value, and the node output of the last full-connection layer is used as the prediction score.
In the training stage of step 1), extracting traditional characteristics by using training voice samples (including training sets of original voice and playback voice) and sending the traditional characteristics into a neural network to respectively train residual error network models of the original voice and the playback voice; in the testing stage, the characteristics in the tested voice sample are extracted and sent to a residual error network model trained in the training stage, the tested voice is classified and judged according to the score result of the network output layer, and the result is combined with the score of an ASV (automatic speaker recognition) system to be used as the final result for judging whether the tested voice sample is the playback voice.
Through the training and testing processes, the subsystems about the three cepstrum characteristics can be obtained respectively, the scores of the three subsystems are fused, the fusion mode is that the scores of the three subsystems are fused, and the formula is as follows:
S=i·SBFCC+j·SMFCC+k·SCQCC
wherein, i, j, k are weight coefficients of the three subsystem scores respectively, and the constraint condition is that i + j + k is 1, SBFCC、SMFCC、SCQCCRespectively, the scores of the normalized subsystems. By making a differenceThe training models of the features cooperate to achieve a better fusion result.