CN113506583B

CN113506583B - Camouflage voice detection method using residual error network

Info

Publication number: CN113506583B
Application number: CN202110718049.0A
Authority: CN
Inventors: 简志华; 徐嘉; 韦凤瑜; 朱雅楠; 于佳祺; 吴超; 游林
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2024-01-05
Anticipated expiration: 2041-06-28
Also published as: CN113506583A

Abstract

The invention relates to the field of voice recognition, in particular to a camouflage voice detection method utilizing a residual error network. S1, processing a voice signal x (n) by utilizing a feature extraction module to obtain a voice feature-constant Q modulation envelope based on a modulation spectrum; s2, outputting the extracted constant Q modulation envelope characteristic in the form of a Q modulation envelope characteristic diagram, preprocessing the constant Q modulation envelope characteristic, and inputting the constant Q modulation envelope characteristic into an improved ResNet classification network; s3, after Q modulation envelope features are input into a classification network in the form of pictures, depth feature extraction is realized through 1 7×7 convolution layer and one 3×3 pooling layer and then through 16 residual units; and S4, after 16 residual units are passed through an average pooling layer, and finally, outputting voice classification through a full connection layer and a Softmax layer. The invention improves the accuracy of camouflage voice detection through constant Q conversion and adopting an improved residual error network.

Description

Camouflage voice detection method using residual error network

Technical Field

The invention relates to the field of voice recognition, in particular to a camouflage voice detection method utilizing a residual error network.

Background

The camouflage voice detection technology is to detect camouflage voice (including replay voice, synthesis voice, conversion voice and imitation voice) according to different voice characteristics, so as to achieve the purpose of distinguishing real voice from camouflage voice, and has very important application in the field of voice biological feature recognition. The common camouflage voice detection method comprises two parts, namely a feature extraction part and a classifier. At present, feature extraction is carried out by using traditional machine learning methods such as a multi-purpose Mel frequency cepstrum coefficient (Mel-Frequency Cepstral Coefficients, MFCCs), a constant Q cepstrum coefficient (Constant Q Cepstral Coefficients, CQCCs) and the like, and a common classifier comprises a Gaussian mixture model (Gaussian Mixture Model, GMM), a hidden Markov model (Hidden Markov Model, HMM), a support vector machine (Support Vector Machines, SVM) and the like. However, these methods can only detect specific types of camouflage voice, and when the camouflage type is unknown, the detection effect is reduced, and the method cannot adapt to changeable camouflage voice challenges.

Disclosure of Invention

In order to solve the above problems, the present invention provides a camouflage voice detection method using a residual network,

in order to achieve the technical purpose, the invention adopts the following technical scheme;

the camouflage voice detection method using the residual network is characterized by comprising the following steps,

s1, processing a voice signal x (n) by utilizing a feature extraction module to obtain a voice feature-Constant Q Modulation Envelope (CQME) based on a modulation spectrum;

s2, outputting the extracted constant Q modulation envelope characteristic in the form of a Q modulation envelope characteristic diagram, preprocessing the constant Q modulation envelope characteristic, and inputting the constant Q modulation envelope characteristic into an improved ResNet classification network;

s3, after Q modulation envelope features are input into a classification network in the form of pictures, depth feature extraction is realized through 1 7×7 convolution layer and one 3×3 pooling layer and then through 16 residual units;

and S4, after 16 residual units are passed through an average pooling layer, and finally, outputting voice classification through a full connection layer and a Softmax layer.

Further, in step S1, the feature extraction module processes the speech signal x (n) comprising the steps of,

s11, the input voice x (n) is passed through a frequency division filter bank to divide the voice into K signals x with different frequency bands _k (n), wherein k=1, 2, …, K;

s12, dividing the obtained signal x _k (n) extracting an envelope;

s13, performing nonlinear processing on the voice envelope;

s14, the envelope lg (m _k (n)) to the frequency domain by a constant Q transform;

s15, calculating the mean square value of each frequency segment to obtain a voice characteristic-constant Q modulation envelope (Constant Q transform modulation envelope, CQME) based on the modulation spectrum.

Further, in step S2, the Q-modulation envelope feature map is an image plotted with frequency-amplitude as the abscissa.

Further, in step S2, the modulation envelope feature map input to the res net classification network is pre-processed to be 224×224×3.

Further, in step S2, the res net classification network is a 50-layer residual network.

Further, the residual network contains 5 convolution blocks; the 1 st convolution block is a convolution layer composed of 64 convolution kernels of 7×7, and the 2 nd convolution block is composed of 1 max pooling layer of 3×3 and 3 residual units.

Further, each residual unit is formed by 3 layers of convolutions, the size of a 1 st layer convolution kernel is 1 multiplied by 1, the 1 st layer convolution kernel is connected with the 2 nd layer of 3 multiplied by 3 convolutions through an MMN (mean square) activation function and a Dropout technology, the 2 nd layer convolution is connected with the 3 rd layer of 1 multiplied by 1 st convolutions through a ReLU activation function, outputs are added with outputs connected with a shortcut, and the outputs are transmitted to the next residual unit after nonlinear processing of the ReLU activation function.

Further, in combination with the MMN activation function of the Dropout technique, the layer 3 convolution block is composed of 4 residual units, the layer 4 convolution block is composed of 6 residual units, the layer 5 convolution block is composed of 3 residual units, and the number of channels of the residual units in each layer of convolution block is continuously increased.

Further, the output is averaged and pooled through a 5-layer convolution block.

Compared with the prior art, the invention has the beneficial technical effects that:

(1) When the constant Q modulation envelope characteristic is extracted, constant Q conversion is used, the time-frequency resolution is variable, the characteristics of the voice signal are met, the method is more suitable for processing the voice signal, and the information contained in the voice signal is fully utilized.

(2) The improved residual error network is adopted, and the MMN activation function and the Dropout technology are combined to replace the bottom ReLU activation function, so that the problem that the accuracy is reduced when the number of training layers of the convolutional neural network is deeper is solved, and the accuracy of camouflage voice detection is improved.

(3) The common Q modulation envelope features are extracted for different camouflage voices, and the defect that the traditional camouflage detection mode can only detect single camouflage voice is overcome by means of residual error network classification, so that detection can be carried out under the condition of unknown camouflage types, and the method is convenient to use.

Drawings

FIG. 1 is a schematic diagram of a camouflage voice detection method utilizing a residual network in accordance with the present invention;

FIG. 2 is a block diagram of a residual unit of the present invention;

FIG. 3 is a block diagram of a Maxout cell of the present invention;

FIG. 4 is a block diagram of an MMN unit of the present invention.

FIG. 5 is a diagram showing the structure of an MMN combined with Dropout according to the present invention

Fig. 6 is a flowchart of a method for camouflage voice detection using a residual network in accordance with the present invention.

Detailed Description

The invention will be further described with reference to specific examples, but the scope of the invention is not limited thereto.

As shown in fig. 1-6, the present embodiment proposes a camouflage voice detection method using a residual network, which includes the following steps,

The voice camouflage detection system provided by the embodiment is composed of two parts, wherein the first part calculates a constant Q modulation envelope of any input voice, and a characteristic diagram of the input voice is obtained and used as input of a classification network. Different types of camouflage voices are different in fundamental frequency, noise and the like, and the extracted feature images are different from real voices. And the second part inputs the feature images into a trained ResNet classification network, performs deep feature extraction and category matching on the feature images through a plurality of groups of convolution blocks, and finally outputs classification through a full connection layer and a Softmax layer. The whole process can classify the real voice and the camouflage voice, and detection of the camouflage voice is realized.

Wherein, in step S1, the feature extraction module processes the voice signal x (n) and comprises the following steps,

s12, dividing the obtained signal x _k (n) extracting the envelope; the process comprises the steps of solving Hilbert transform of each section of voice to further obtain an analysis signal of each section of voice, extracting the amplitude of the analysis signal of each section of voice, and obtaining the envelope of K sections of voice;

s13, performing nonlinear processing on the voice envelope; the process specifically processes the speech envelope non-linearly as a function of the scale.

S14, the envelope lg (m _k (n)) is transformed to the frequency domain by a constant Q transform. This process is specifically to obtain a constant Q transform parameter, and transform the feature to the frequency domain by constant Q transform.

And drawing an image by taking the frequency-amplitude as the abscissa and the ordinate to obtain a constant Q modulation envelope characteristic diagram. Preprocessing the feature map, and redefining the feature map size to 224×224×3;

thus, the present embodiment uses a constant Q modulation envelope feature in combination with a modified Residual Network (res net) to handle a camouflage voice detection scheme with unknown camouflage patterns. After the voice signal x (n) is input, the system divides the voice signal into K signals x with different frequency bands _k (n), wherein k=1, 2, …, K. The envelope of the signal of each frequency band is extracted and subjected to nonlinear processing, then the envelope characteristics are transformed into the frequency domain through constant Q transformation (Constant Q Transform, CQT), the mean square value of each frequency band is calculated, and the voice characteristics based on the modulation spectrum, namely, the constant Q modulation envelope (Constant Q transform modulation envelope, CQME) is obtained. The extracted characteristics are output in the form of pictures, and are input into an improved 50-layer ResNet classification network after being preprocessed. The ResNet classification network of the present invention uses a residual unit of 2 1 x 1 convolutional layers, 13 x 3 convolutional layers, and 1 shortcut connection, and uses a multi-layer Maxout (MMN) activation function in combination with the Dropout technique. After the constant Q modulation envelope feature map is input into a classification network, depth feature extraction is realized through 1 7×7 convolution layers and 3×3 pooling layers, then 16 residual units, the data volume is reduced through average pooling, and finally the classification of the voice is obtained through a full connection layer and a Softmax layer. In the training stage, a training data set of constant Q modulation envelope feature maps of different camouflage voices is input into the classification network, and the cross entropy function is used as a loss function for training.

The feature extraction module extracts constant Q modulation envelope features for the input speech. The step of reproducing speech, which is added to the recording compared to real speech, introduces device noise, ambient noise, codec distortion, etc. During recording, the voice signal can propagate through more than one path, and delay and attenuation exist compared with real voice, and obvious distinction is made in a high-frequency part. In the process of forming the synthesized speech and the converted speech, only the rough outline of the spectrum envelope is simulated, small changes among frames are not captured, and compared with natural speech, the synthesized speech and the converted speech are smoother and have noise. The simulated voice has the lowest similarity with the natural voice, can be confused with audio-visual effect only at the human ear level, and has obvious difference between the sound spectrum structure and the real voice. Different kinds of camouflage speech can be distinguished by using a constant Q modulation envelope characteristic.

The specific content of the characteristic extraction module for processing the voice signal x (n) is as follows:

first, an input voice x (n) is passed through a frequency division filter bank to divide the voice into K signals x of different frequency bands _k (n). This function is achieved by means of a Mel filter bank, the lowest frequency of which is 300Hz and the highest frequency is 3400Hz, depending on the frequency range of the speech signal. In the invention, 128 is obtained by K, namely 128 frequency division is realized. The Mel filter bank is a group of triangular band-pass filters, and the transfer function of each band-pass filter is h _k (q),0≤k＜K：

Where K is the number of filters 128, f (K) is the center mel frequency of the kth triangular filter, f (K-1) is the upper limit cutoff mel frequency of the kth triangular filter, f (k+1) is the lower limit cutoff mel frequency of the kth triangular filter, q is an integer value between f (K-1) and f (k+1). N is the point number of DFT, and 256, f are taken _s Is sampling frequency, is taken as 8KHz, f _l Is the lowest frequency of the filter bank, f _h For the highest frequencies of the filter bank, 300Hz and 3400Hz, F _mel (f) The frequency can be converted to Mel frequency, where f is a conventional frequency value, i.e. by equation (3), the conventional frequency value f canTo be converted into mel frequency value F _mel (f)，Is its inverse, b is the mel frequency value, i.e. the mel frequency value b can be transformed into the conventional frequency value +.>

Then dividing the frequency-resolved signal x _k (n) extracting the envelope:

wherein j is an imaginary unit,is x _k Hilbert transform of (n), s _k (n) is x _k The resolved signal of (n),is s _k Conjugate signal of (n), m _k (n) is x _k (n) a spectral envelope. Further nonlinear processing is performed on the speech envelope, the nonlinearity in the present invention being realized by a logarithmic function.

The non-linearly processed envelope lg (m _k (n)) is transformed to the frequency domain by a constant Q transform, expressed as:

where n represents the nth time component of the time domain signal, L represents the total number of frequency components of the constant Q transform spectrum, f _min Is treated lg (m _k (n)) lowest frequency, f _max Is treated lg (m _k (n)) highest frequency, f _l For the frequency of the first component Δf _l For the filter bandwidth of the first component, Q is a constant equal to the ratio of the center frequency to the filter bandwidth, f _s Taking 8kHz as sampling frequency, taking l as the frequency sequence number of a constant Q conversion spectrum,constant Q transform value, N for the first frequency component of the kth frequency band _l For a window length that varies with frequency, j is an imaginary unit that represents the relationship between the phases of the signals,/v>Is of length N _l The hamming window expression is:

where n represents the nth time component of the time domain signal, and when the signal is processed by fourier transform, the frequency points are equally spaced, which corresponds to a group of filter banks having equally spaced center frequencies and having the same bandwidth. While the constant Q transform is initially directed to a music signalThe musical scale frequency interval is not fixed, but is log ₂ For the bottom, the constant Q transform can be regarded as a set of filter banks with exponentially distributed center frequencies and constant center frequency to bandwidth ratios, where the constant Q transform is used to process the characteristics of the more conforming speech signal, with variable time-frequency resolution, higher frequency resolution at low frequencies and higher time resolution at high frequencies. Finally, the mean square value of each frequency component is calculated to obtain the voice characteristic:

and finally, drawing the obtained eigenvector into an eigenvector by taking the frequency as an independent variable, namely, inputting a subsequent classification network, wherein the eigenvector is a constant Q conversion value of the first frequency component of the kth frequency band.

The improved residual network module of this embodiment is an improved ResNet classification network, and the feature map input to the ResNet classification network first needs to be preprocessed to be 224×224×3. The invention uses a 50-layer residual network as a classification network, which contains 5 convolution blocks. The 1 st convolution block is a convolution layer composed of 64 convolution kernels of 7×7, and the 2 nd convolution block is composed of 1 max pooling layer of 3×3 and 3 residual units, wherein each residual unit is composed of 3 layers of convolution layers.

The structure of the residual unit in this embodiment is shown in fig. 2. Wherein X is an input, the shortcut connection directly transmits the input X to the adder, and adds the input X to the output F (X) obtained by the convolution layer, and finally outputs H (X) =f (X) +x, the residual error network needs training learning, and the residual error F (X) =h (X) -X is no longer the output H (X), and in the extreme case H (X) =f (X) =x, only the training F (X) is needed to approach 0. The principle of the residual unit can be expressed as:

Y＝F(X,{ω _i })+ω _s X (14)

here, Y represents the output of the residual unit,ω _i representing the processing of a 3-layer convolution, ω _s The shortcut connection is processed to keep the dimension of the shortcut connection and the residual mapping consistent, so that the shortcut connection and the residual mapping can be directly added, and the shortcut connection and the residual mapping can be realized through a 1×1 convolution layer. The function of the 1×1 convolution layer is to reduce or increase the dimension, and the number of convolution kernels determines the dimension after processing. Firstly, the dimension of an input feature matrix is reduced to 64, the dimension is increased to 256 in the 3 rd layer, MMN is adopted as an activation function in the bottommost layer, meanwhile, the Dropout technology is combined, and ReLU is adopted as an activation function in the 2 nd layer and the 3 rd layer.

In the residual unit, the first layer adopts an MMN activation function, the MMN activation function is developed on the basis of a Maxout activation function, and the structure of the Maxout unit is shown in figure 3. Wherein x is _i In order to input the vector(s),weight coefficient of j hidden node corresponding to i input vector, +.>For the offset, z, of the j hidden node corresponding to the i input vector _j For the input vector x _i Weighted and offset term->The sum, expressed as:

and->The optimal value is achieved through training. If each Maxout cell contains g hidden nodes, the output of each Maxout cell is:

the multiple Maxout units are jointly used to form an MMN unit, and the MMN unit structure used in this embodiment is shown in fig. 4. Wherein,and->The two parameters respectively represent the weight coefficient and the offset of the j output node corresponding to the i input vector in the 1 st weighting operation, +.>Output of j node representing 1 st weighting,>output of the j-th node representing the 1 st max operation,>is calculated in the same way as the Maxout activation function,/->And->The same way of calculation. The MMN units form an MMN activation function, the Maxout activation function can fit any convex function, the MMN activation function can fit any form of distribution, the characteristics can be better represented, and the convergence speed is faster.

In the training stage, after the feature matrix is processed by an MMN activation function, the feature matrix is connected with a next layer of convolution layer by combining with a Dropout technology, namely, in each training process, a specific number of hidden nodes are randomly ignored, in the invention, dropout is hidden by 50%, the number of MMN units of the hidden layer is 100, which is equivalent to training by using different networks, finally, the training result is averaged, the technology is not used in the prediction stage, and the MMN activation function structure combined with the Dropout technology is as shown in figure 5:

the layer 3 convolution block is composed of 4 residual units, the layer 4 convolution block is composed of 6 residual units, the layer 5 convolution block is composed of 3 residual units, the channel number of the residual units in each layer of convolution block is continuously progressive, and the layer 5 convolution blocks are shown in table 1.

TABLE 1

And carrying out average pooling on the output through a 5-layer convolution block to reduce the complexity of data, outputting the output through a full-connection layer to obtain the probability that the voice signal belongs to true and false voice, and finally obtaining the classification of the voice signal through a Softmax layer.

Claims

1. The camouflage voice detection method by utilizing the residual error network is characterized by comprising the following steps of:

s1, processing a voice signal x (n) by utilizing a feature extraction module to obtain a voice feature-constant Q modulation envelope based on a modulation spectrum;

s4, after 16 residual error units, outputting voice classification through an average pooling layer and finally through a full connection layer and a Softmax layer;

in step S1, the feature extraction module processes the speech signal x (n) including the steps of:

s12, dividing the obtained signal x _k (n) extracting an envelope;

s13, performing nonlinear processing on the voice envelope;

s14, the envelope lg (m _k (n)) to the frequency domain by a constant Q transform; m is m _k (n) is x _k A spectral envelope of (n);

s15, calculating the mean square value of each frequency segment to obtain a constant Q modulation envelope which is a voice characteristic based on a modulation frequency spectrum;

in step S2, the ResNet classification network is a 50-layer residual network;

the residual network contains 5 convolution blocks; the 1 st convolution block is a convolution layer formed by 64 convolution kernels of 7×7, and the 2 nd convolution block is formed by 1 max pooling layer of 3×3 and 3 residual units;

each residual unit is formed by 3 layers of convolutions, the size of a 1 st layer convolution kernel is 1 multiplied by 1, the 1 st layer convolution kernel is connected with the 2 nd layer of 3 multiplied by 3 convolutions through an MMN (mean square wave) activation function and a Dropout technology, the 2 nd layer convolution is connected with the 3 rd layer of 1 multiplied by 1 st convolutions through a ReLU activation function, outputs are added with outputs connected with a shortcut, and the outputs are transmitted to the next residual unit after nonlinear processing of the ReLU activation function;

in combination with the MMN activation function of the Dropout technology, the layer 3 convolution block is composed of 4 residual units, the layer 4 convolution block is composed of 6 residual units, the layer 5 convolution block is composed of 3 residual units, and the channel number of the residual units in each layer of convolution block is continuously increased.

2. The method for camouflage speech detection using a residual network of claim 1, wherein: in step S2, the Q-modulation envelope feature map is an image plotted with frequency-amplitude as the abscissa.

3. The method for detecting a camouflage voice using a residual network according to claim 1, wherein the modulation envelope feature map inputted to the res net classification network is pre-processed to be 224×224×3 in step S2.

4. The method for detecting camouflage speech using a residual network of claim 1, wherein the output is averaged and pooled via a 5-layer convolution block.