CN113506583A

CN113506583A - Disguised voice detection method using residual error network

Info

Publication number: CN113506583A
Application number: CN202110718049.0A
Authority: CN
Inventors: 简志华; 徐嘉; 韦凤瑜; 朱雅楠; 于佳祺; 吴超; 游林
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2021-10-15
Anticipated expiration: 2041-06-28
Also published as: CN113506583B

Abstract

The invention relates to the field of voice recognition, in particular to a disguised voice detection method by using a residual error network. The method comprises the following steps that S1, a characteristic extraction module is used for processing a voice signal x (n) to obtain a voice characteristic-constant Q modulation envelope based on a modulation spectrum; s2, outputting the extracted constant Q modulation envelope characteristics in a form of a Q modulation envelope characteristic diagram, and inputting the preprocessed constant Q modulation envelope characteristics into an improved ResNet classification network; s3, after the Q modulation envelope characteristics are input into a classification network in the form of pictures, firstly, the depth characteristic extraction is realized through 1 7 × 7 convolutional layer and a 3 × 3 pooling layer and then through 16 residual error units; and S4, outputting the speech classification through the average pooling layer and finally the full-link layer and the Softmax layer after 16 residual units. The method improves the accuracy of the detection of the disguised voice by constant Q transformation and the adoption of an improved residual error network.

Description

Disguised voice detection method using residual error network

Technical Field

The invention relates to the field of voice recognition, in particular to a disguised voice detection method by using a residual error network.

Background

The technology of detecting the disguised voice is to detect the disguised voice (including replay voice, synthesized voice, converted voice and simulated voice) according to different voice characteristics so as to achieve the purpose of distinguishing real voice from the disguised voice, and has very important application in the field of voice biological feature recognition. A common disguised voice detection method comprises two parts of feature extraction and a classifier. At present, Mel-Frequency Cepstral Coefficients (MFCCs), Constant Q Cepstral Coefficients (CQCCs) and the like are used for feature extraction, and the conventional machine learning methods such as Gaussian Mixture Models (GMM), Hidden Markov Models (HMM), Support Vector Machines (SVM) and the like are used as common classifiers. However, these methods can only detect specific types of disguised voices, and when the types of disguised voices are unknown, the detection effect is reduced, and the method cannot adapt to the challenge of variable disguised voices.

Disclosure of Invention

To solve the above problems, the present invention provides a method for detecting a disguised voice using a residual network,

in order to achieve the technical purpose, the invention adopts the following technical scheme;

the method for detecting disguised voice by using a residual error network is characterized by comprising the following steps,

s1, processing the voice signal x (n) by using a feature extraction module to obtain a voice feature-Constant Q Modulation Envelope (CQME) based on a modulation spectrum;

s2, outputting the extracted constant Q modulation envelope characteristics in a form of a Q modulation envelope characteristic diagram, and inputting the preprocessed constant Q modulation envelope characteristics into an improved ResNet classification network;

s3, after the Q modulation envelope characteristics are input into a classification network in the form of pictures, firstly, the depth characteristic extraction is realized through 1 7 × 7 convolutional layer and a 3 × 3 pooling layer and then through 16 residual error units;

and S4, outputting the speech classification through the average pooling layer and finally the full-link layer and the Softmax layer after 16 residual units.

Further, in step S1, the processing of the speech signal x (n) by the feature extraction module includes the following steps,

s11, the input voice x (n) is divided into K signals x of different frequency bands by a frequency division filter bank_k(n), wherein K is 1,2, …, K;

s12 Signal x obtained by frequency division_k(n) extracting the envelope;

s13, carrying out nonlinear processing on the voice packet;

s14 envelope lg (m) to be subjected to nonlinear processing_k(n)) is transformed to the frequency domain by a constant Q transform;

s15, calculating the mean square value of each frequency segment to obtain the speech characteristic based on the modulation spectrum, namely Constant Q transform envelope (CQME).

Further, in step S2, the Q modulation envelope characteristic map is an image plotted with frequency — amplitude as horizontal and vertical coordinates.

Further, in step S2, the modulation envelope characteristic map input to the ResNet classification network is preprocessed to adjust the size to 224 × 224 × 3.

Further, in step S2, the ResNet classification network is a residual network of 50 layers.

Further, the residual network contains 5 convolutional blocks; the 1 st convolution block is a convolution layer composed of 64 convolution kernels of 7 × 7, and the 2 nd convolution block is composed of 1 maximum pooling layer of 3 × 3 and 3 residual units.

Furthermore, each residual unit is formed by 3 layers of convolution, the size of a convolution kernel of the 1 st layer is 1 multiplied by 1, the convolution kernel is connected with the 3 multiplied by 3 convolution of the 2 nd layer through MMN activation function and Dropout technology, the 2 nd layer of convolution is connected with the 1 multiplied by 1 convolution of the 3 rd layer through ReLU activation function, the output is added with the output connected with short, and the sum is transmitted to the next residual unit after nonlinear processing of the ReLU activation function again.

Further, in combination with the MMN activation function of Dropout technology, the layer 3 convolution block is composed of 4 residual units, the layer 4 convolution block is composed of 6 residual units, the layer 5 convolution block is composed of 3 residual units, and the number of channels of the residual units in each layer of convolution block is continuously increased.

Further, the output is averaged and pooled through 5 layers of convolution blocks.

Compared with the prior art, the invention has the beneficial technical effects that:

(1) when the constant Q modulation envelope characteristic is extracted, constant Q transformation is applied, the time-frequency resolution is variable, the characteristic of the voice signal is met, the method is more suitable for processing the voice signal, and the information contained in the voice signal is fully utilized.

(2) By adopting the improved residual error network and combining the MMN activation function with the Dropout technology to replace the bottom-layer ReLU activation function, the problem that the accuracy rate is reduced when the training layer number of the convolutional neural network is deeper is solved, and the accuracy of the detection of the disguised voice is improved.

(3) The method has the advantages that the normal Q modulation envelope characteristics of different camouflage voices are extracted, and the traditional camouflage detection mode only can detect a single camouflage voice through a residual error network classification mode, so that the method can detect different camouflage voices under the condition of unknown camouflage types and is convenient to use.

Drawings

FIG. 1 is a schematic diagram of a method for detecting a disguised voice using a residual network according to the present invention;

FIG. 2 is a diagram of a residual unit structure according to the present invention;

FIG. 3 is a diagram of a Maxout cell structure according to the present invention;

fig. 4 is a diagram of the MMN unit structure of the present invention.

FIG. 5 is a MMN structure diagram of the present invention incorporating Dropout

FIG. 6 is a flowchart of the entire method for detecting a disguised voice using a residual network according to the present invention.

Detailed Description

The invention will be further described with reference to specific examples, but the scope of the invention is not limited thereto.

As shown in fig. 1 to 6, the present embodiment proposes a method for detecting masquerading voice using a residual network, comprising the following steps,

The voice camouflage detection system provided by the embodiment is composed of two parts, wherein the first part calculates the constant Q modulation envelope of any input voice to obtain the characteristic diagram of the input voice as the input of a classification network. Different types of disguised voices are different in fundamental frequency, noise and the like, and the extracted feature map is different from the real voice. And in the second part, after the feature map is input into a trained ResNet classification network, deep feature extraction and class matching are carried out on the feature map through a plurality of groups of rolling blocks, and finally classification is output through a full connection layer and a Softmax layer. The whole process can classify the real voice and the disguised voice, and the detection of the disguised voice is realized.

Wherein, in step S1, the processing of the speech signal x (n) by the feature extraction module includes the following steps,

s12 Signal x obtained by frequency division_k(n) extracting the envelope; the process is specifically that the Hilbert transform is obtained for each segment of speechFurther obtaining an analytic signal of each section of voice, and extracting the amplitude of the analytic signal of each section of voice to obtain the envelope of K sections of voice;

s13, carrying out nonlinear processing on the voice packet; the process is specifically that a number function carries out nonlinear processing on the voice envelope to transform the scale.

S14 envelope lg (m) to be subjected to nonlinear processing_k(n)) is transformed to the frequency domain by a constant Q transform. The process is specifically to obtain constant Q transformation parameters, and transform the characteristics to the frequency domain through constant Q transformation.

And drawing an image by using the frequency-amplitude value as a horizontal coordinate and a vertical coordinate to obtain a constant Q modulation envelope characteristic diagram. Preprocessing the feature map, and redefining the size of the feature map to be 224 multiplied by 3;

therefore, the present embodiment adopts a combination of a constant Q modulation envelope characteristic and an improved Residual Network (ResNet) to process a masquerading voice detection scheme with unknown masquerading mode. After the speech signal x (n) is inputted, the system divides its frequency into K signals x with different frequency bands_k(n), wherein K is 1,2, …, K. Extracting an envelope of the signal of each frequency band, performing nonlinear processing, then transforming the envelope characteristic to a frequency domain through Constant Q Transform (CQT), calculating a mean square value of each frequency band, and obtaining a speech characteristic based on a modulation spectrum, namely a Constant Q Transform envelope (CQME). And outputting the extracted features in the form of pictures, and inputting the preprocessed features into the improved 50-layer ResNet classification network. The ResNet classification network in the invention adopts a residual unit formed by connecting 2 1 × 1 convolutional layers, 13 × 3 convolutional layers and 1 short, and combines a multi-layer Maxout (MMN) activation function with a Dropout technology. After a constant Q modulation envelope characteristic diagram is input into a classification network, firstly, 1 7 multiplied by 7 convolutional layer and 3 multiplied by 3 pooling layer are passed through, then, 16 residual error units are used for realizing depth characteristic extraction, then, average pooling is used for reducing data volume, and finally, a full connection layer and a Softmax layer are used for extracting data volumeA classification of the speech is obtained. In the training stage, the training data sets of the constant Q modulation envelope characteristic diagrams of different camouflage voices are input into the classification network, and the cross entropy function is used as a loss function for training.

The characteristic extraction module extracts constant Q modulation envelope characteristics from the input voice. Compared with real voice, the method increases the steps of recording, and introduces equipment noise, environmental noise, coding and decoding distortion and the like. During recording, the voice signal will propagate through more than one path, and there is delay and attenuation compared with real voice, and there is obvious difference in high frequency part. The synthesized speech and the converted speech only imitate the rough outline of the spectral envelope in the process of forming, and the slight change between frames is not captured, so that the synthesized speech and the converted speech are smoother and have noise compared with the natural speech. The similarity between the simulated voice and the natural voice is lowest, the audio and the video can be confused only at the level of human ears, and the difference between the sound spectrum structure and the real voice is obvious. Different kinds of disguised voice can be distinguished by adopting the constant Q modulation envelope characteristic.

The specific content of the processing of the voice signal x (n) by the feature extraction module is as follows:

firstly, the input speech x (n) is passed through a frequency-dividing filter bank to divide the speech into K signals x with different frequency bands_k(n) of (a). This function is implemented by a Mel filter bank, the lowest frequency of which is 300Hz and the highest frequency is 3400Hz, depending on the frequency range of the speech signal. In the invention, K is taken to be 128, namely, 128 frequency division is realized. The Mel Filter Bank is a set of triangular band-pass filters, each having a transfer function of h_k(q),0≤k＜K：

Where K is the number of filters 128, f (K) is the center mel frequency of the kth triangular filter, f (K-1) is the upper cut-off mel frequency of the kth triangular filter, f (K +1) is the lower cut-off mel frequency of the kth triangular filter, and q is an integer value between f (K-1) and f (K + 1). N is the number of points of DFT, and 256, f is taken_sFor sampling frequency, take 8KHz, f_lIs the lowest frequency of the filter bank, f_hThe highest frequency of the filter bank is 300Hz and 3400Hz, F_mel(f) The frequency can be converted to Mel-frequency, where F is a conventional frequency value, i.e. by formula (3) the conventional frequency value F can be converted to Mel-frequency value F_mel(f)，

Is its inverse transformation, b is the mel frequency value, i.e. by formula (4) the mel frequency value b can be transformed into the regular frequency value

The signal x obtained by frequency division_k(n) extracting the envelope:

wherein j is an imaginary unit,

is x_k(n) Hilbert transform, s_k(n) is x_k(n) the analytic signal of the signal,

is s is_kConjugate signal of (n), m_k(n) is x_k(n) spectral envelope. And further carrying out nonlinear processing on the voice envelope, wherein the nonlinearity is realized by a logarithmic function.

The next step is to apply the non-linearly processed envelope lg (m)_k(n)) is transformed to the frequency domain by a constant Q transform, represented as:

where n represents the nth time component of the time domain signal, L represents the total number of frequency components of the constant Q transform spectrum, and f_minIs lg (m) of treatment_k(n)) lowest frequency, f_maxIs lg (m) of treatment_k(n)) highest frequency, f_lFrequency of the l-th component, Δ f_lFilter bandwidth of the l-th component, Q being a constant equal to the ratio of the center frequency to the filter bandwidth, f_sFor sampling frequency, 8kHz and l are frequency serial numbers of a constant Q transform spectrum,

is the l frequency of the k frequency bandConstant Q-transformed value of component, N_lIs the window length that varies with frequency, j is an imaginary unit, used to represent the relationship between signal phases,

is of length N_lThe expression of the Hamming window is as follows:

where n represents the nth time component of the time domain signal, and when the signal is processed by fourier transform, the frequency points are equally spaced, which corresponds to a set of filter banks with equal center frequencies and the same bandwidth. Whereas the constant Q transform was originally proposed for the processing of music signals, the musical scale frequency interval is not fixed, but log₂For the first time, the constant Q transform can be regarded as a filter bank with center frequency distributed exponentially and a constant ratio of center frequency to bandwidth, and the constant Q transform is used to process the filter bank which is more consistent with the characteristics of the voice signal, and has variable time-frequency resolution, higher frequency resolution at low frequency and higher time resolution at high frequency. Finally, the mean square value of each frequency component is solved to obtain the voice characteristics:

the k-th frequency component is a constant Q transformation value of the ith frequency component of the kth frequency band, and finally the obtained feature vector is drawn into a feature graph with frequency as an independent variable, namely the feature graph is input into a subsequent classification network.

In the embodiment, the improved residual network module is an improved ResNet classification network, and the feature map input into the ResNet classification network needs to be preprocessed to adjust the size to 224 × 224 × 3. The invention adopts a residual error network with 50 layers as a classification network, and the residual error network comprises 5 volume blocks. The 1 st convolution block is a convolution layer composed of 64 convolution kernels of 7 × 7, and the 2 nd convolution block is composed of 1 maximum pooling layer of 3 × 3 and 3 residual units, where each residual unit is composed of 3 convolution layers.

The structure of the residual error unit in this embodiment is as shown in fig. 2. Where X is input, shortcut directly transmits input X to the adder, and adds it to output f (X) obtained by convolution, and finally output h (X) ═ f (X) + X, and what the residual network needs to train and learn is no longer output h (X), but residual f (X) ═ h (X) — X, and in the extreme case of h (X) ═ f (X) ═ X, it is only necessary to train f (X) to approach 0. The principle of the residual unit can be expressed as:

Y＝F(X,{ω_i})+ω_sX (14)

here, Y denotes the output of the residual unit, ω_iRepresenting the processing of 3 layers of convolution, ω_sThe representation is that the shortcut connection is processed, so that the shortcut connection and the dimension of the residual mapping are kept consistent, and the shortcut connection and the dimension of the residual mapping can be directly added, and can be realized through a 1 × 1 convolutional layer. The 1 × 1 convolutional layer functions as a dimension reduction or dimension increase, and the number of processed dimensions is determined by the number of convolutional kernels. The input feature matrix is first reduced to 64, then raised to 256 at layer 3, and MMN is used as the activation function at the bottom layer, while in combination with Dropout techniques, ReLU is used as the activation function at layer 2 and layer 3.

In the residual unit, the MMN activation function is adopted in the first layer, and the MMN activation function is developed on the basis of the Maxout activation function, and the Maxout unit structure is shown in fig. 3. Wherein x_iIn order to input the vector, the vector is input,

the weight coefficient of the jth hidden node corresponding to the ith input vector,

offset of the jth hidden node corresponding to the ith input vector, z_jAs an input vector x_iWeighted sum bias term

The sum, expressed as:

and

the optimization is achieved through training. If each Maxout unit contains g hidden nodes, the output of each Maxout unit is:

the MMN unit is obtained by jointly using a plurality of Maxout units, and the MMN unit structure used in this embodiment is as shown in fig. 4. Wherein the content of the first and second substances,

and

the two parameters respectively indicate the weighting coefficient and the offset of the jth output node corresponding to the ith input vector in the 1 st weighting operation,

represents the output of the jth node weighted 1 st time,

the output of the jth node representing the 1 st max operation,

is calculated in the same way as the Maxout activation function,

and

the same way of calculation. The MMN units form an MMN activation function, the Maxout activation function can fit any convex function, the MMN activation function can fit any form of distribution, characteristics can be better represented, and the convergence speed is higher.

In the training stage, after the MMN activation function processing, the feature matrix is connected with the next convolutional layer by combining the Dropout technology, that is, in each training process, a specific number of hidden nodes are randomly ignored, in the invention, Dropout is hidden by 50%, the number of MMN units in the hidden layer is 100, which is equivalent to training by using different networks, and finally, the training result is averaged, the MMN activation function structure combining the Dropout technology is not used in the prediction stage as shown in FIG. 5:

the 3 rd layer of convolution block is composed of 4 residual error units, the 4 th layer of convolution block is composed of 6 residual error units, the 5 th layer of convolution block is composed of 3 residual error units, the number of channels of the residual error units in each layer of convolution block is continuously increased, and the 5 th layer of convolution block is shown in table 1.

TABLE 1

And 5 layers of rolling blocks are used for averagely pooling output, the complexity of data is reduced, the output is output through a full connection layer to obtain the probability that the voice signal belongs to true voice and false voice, and finally the classification is obtained through a Softmax layer.

Claims

1. The method for detecting disguised voice by using a residual error network is characterized by comprising the following steps,

s1, processing the voice signal x (n) by using a feature extraction module to obtain a voice feature-constant Q modulation envelope based on a modulation spectrum;

2. The disguised voice detection method using residual error network as claimed in claim 1, wherein the step S1, the processing of the voice signal x (n) by the feature extraction module comprises the steps of,

s12 Signal x obtained by frequency division_k(n) extracting the envelope;

s13, carrying out nonlinear processing on the voice packet;

s14 envelope lg (m) to be subjected to nonlinear processing_k(n)) is transformed to the frequency domain by a constant Q transform; m is_k(n) is x_k(n) a spectral envelope;

and S15, calculating the mean square value of each frequency band to obtain the speech characteristic-constant Q modulation envelope based on the modulation spectrum.

3. The method of detecting disguised speech using a residual network as claimed in claim 1, wherein: in step S2, the Q modulation envelope characteristic map is an image plotted with frequency-amplitude as horizontal and vertical coordinates.

4. The method of detecting disguised voice using residual error network as claimed in claim 1, wherein in step S2, the modulation envelope profile inputted to the ResNet classification network is pre-processed to adjust the size to 224 x 3.

5. The method of detecting a disguised voice using a residual network as claimed in claim 1, wherein in step S2, the ResNet classification network is a 50-layer residual network.

6. The method of detecting disguised speech using a residual network as claimed in claim 5, wherein the residual network comprises 5 convolutional blocks; the 1 st convolution block is a convolution layer composed of 64 convolution kernels of 7 × 7, and the 2 nd convolution block is composed of 1 maximum pooling layer of 3 × 3 and 3 residual units.

7. The method of detecting disguised speech using residual network as claimed in claim 6, wherein each residual unit is formed by 3 layers of convolutions, the kernel size of layer 1 convolution is 1 x 1, and is connected to layer 2 3 x 3 convolutions by MMN activation function in combination with Dropout technique, and layer 2 convolution is connected to layer 3 x 1 convolutions by ReLU activation function, and the outputs are added to the outputs connected to shortcut and are again subjected to nonlinear processing by ReLU activation function and then transferred to the next residual unit.

8. The method of detecting disguised speech using residual error network as claimed in claim 7, wherein in combination with the MMN activation function of Dropout technique, the layer 3 convolution block is composed of 4 residual error units, the layer 4 convolution block is composed of 6 residual error units, the layer 5 convolution block is composed of 3 residual error units, and the number of channels of residual error units in each layer of convolution block is continuously increased.

9. The method of detecting disguised speech using a residual network as claimed in claim 9, wherein the output is averaged and pooled through 5-layer rolling blocks.