CN113506583B - Camouflage voice detection method using residual error network - Google Patents

Camouflage voice detection method using residual error network Download PDF

Info

Publication number
CN113506583B
CN113506583B CN202110718049.0A CN202110718049A CN113506583B CN 113506583 B CN113506583 B CN 113506583B CN 202110718049 A CN202110718049 A CN 202110718049A CN 113506583 B CN113506583 B CN 113506583B
Authority
CN
China
Prior art keywords
layer
voice
residual
constant
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110718049.0A
Other languages
Chinese (zh)
Other versions
CN113506583A (en
Inventor
简志华
徐嘉
韦凤瑜
朱雅楠
于佳祺
吴超
游林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202110718049.0A priority Critical patent/CN113506583B/en
Publication of CN113506583A publication Critical patent/CN113506583A/en
Application granted granted Critical
Publication of CN113506583B publication Critical patent/CN113506583B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Complex Calculations (AREA)
  • Transmission Systems Not Characterized By The Medium Used For Transmission (AREA)

Abstract

The invention relates to the field of voice recognition, in particular to a camouflage voice detection method utilizing a residual error network. S1, processing a voice signal x (n) by utilizing a feature extraction module to obtain a voice feature-constant Q modulation envelope based on a modulation spectrum; s2, outputting the extracted constant Q modulation envelope characteristic in the form of a Q modulation envelope characteristic diagram, preprocessing the constant Q modulation envelope characteristic, and inputting the constant Q modulation envelope characteristic into an improved ResNet classification network; s3, after Q modulation envelope features are input into a classification network in the form of pictures, depth feature extraction is realized through 1 7×7 convolution layer and one 3×3 pooling layer and then through 16 residual units; and S4, after 16 residual units are passed through an average pooling layer, and finally, outputting voice classification through a full connection layer and a Softmax layer. The invention improves the accuracy of camouflage voice detection through constant Q conversion and adopting an improved residual error network.

Description

Camouflage voice detection method using residual error network
Technical Field
The invention relates to the field of voice recognition, in particular to a camouflage voice detection method utilizing a residual error network.
Background
The camouflage voice detection technology is to detect camouflage voice (including replay voice, synthesis voice, conversion voice and imitation voice) according to different voice characteristics, so as to achieve the purpose of distinguishing real voice from camouflage voice, and has very important application in the field of voice biological feature recognition. The common camouflage voice detection method comprises two parts, namely a feature extraction part and a classifier. At present, feature extraction is carried out by using traditional machine learning methods such as a multi-purpose Mel frequency cepstrum coefficient (Mel-Frequency Cepstral Coefficients, MFCCs), a constant Q cepstrum coefficient (Constant Q Cepstral Coefficients, CQCCs) and the like, and a common classifier comprises a Gaussian mixture model (Gaussian Mixture Model, GMM), a hidden Markov model (Hidden Markov Model, HMM), a support vector machine (Support Vector Machines, SVM) and the like. However, these methods can only detect specific types of camouflage voice, and when the camouflage type is unknown, the detection effect is reduced, and the method cannot adapt to changeable camouflage voice challenges.
Disclosure of Invention
In order to solve the above problems, the present invention provides a camouflage voice detection method using a residual network,
in order to achieve the technical purpose, the invention adopts the following technical scheme;
the camouflage voice detection method using the residual network is characterized by comprising the following steps,
s1, processing a voice signal x (n) by utilizing a feature extraction module to obtain a voice feature-Constant Q Modulation Envelope (CQME) based on a modulation spectrum;
s2, outputting the extracted constant Q modulation envelope characteristic in the form of a Q modulation envelope characteristic diagram, preprocessing the constant Q modulation envelope characteristic, and inputting the constant Q modulation envelope characteristic into an improved ResNet classification network;
s3, after Q modulation envelope features are input into a classification network in the form of pictures, depth feature extraction is realized through 1 7×7 convolution layer and one 3×3 pooling layer and then through 16 residual units;
and S4, after 16 residual units are passed through an average pooling layer, and finally, outputting voice classification through a full connection layer and a Softmax layer.
Further, in step S1, the feature extraction module processes the speech signal x (n) comprising the steps of,
s11, the input voice x (n) is passed through a frequency division filter bank to divide the voice into K signals x with different frequency bands k (n), wherein k=1, 2, …, K;
s12, dividing the obtained signal x k (n) extracting an envelope;
s13, performing nonlinear processing on the voice envelope;
s14, the envelope lg (m k (n)) to the frequency domain by a constant Q transform;
s15, calculating the mean square value of each frequency segment to obtain a voice characteristic-constant Q modulation envelope (Constant Q transform modulation envelope, CQME) based on the modulation spectrum.
Further, in step S2, the Q-modulation envelope feature map is an image plotted with frequency-amplitude as the abscissa.
Further, in step S2, the modulation envelope feature map input to the res net classification network is pre-processed to be 224×224×3.
Further, in step S2, the res net classification network is a 50-layer residual network.
Further, the residual network contains 5 convolution blocks; the 1 st convolution block is a convolution layer composed of 64 convolution kernels of 7×7, and the 2 nd convolution block is composed of 1 max pooling layer of 3×3 and 3 residual units.
Further, each residual unit is formed by 3 layers of convolutions, the size of a 1 st layer convolution kernel is 1 multiplied by 1, the 1 st layer convolution kernel is connected with the 2 nd layer of 3 multiplied by 3 convolutions through an MMN (mean square) activation function and a Dropout technology, the 2 nd layer convolution is connected with the 3 rd layer of 1 multiplied by 1 st convolutions through a ReLU activation function, outputs are added with outputs connected with a shortcut, and the outputs are transmitted to the next residual unit after nonlinear processing of the ReLU activation function.
Further, in combination with the MMN activation function of the Dropout technique, the layer 3 convolution block is composed of 4 residual units, the layer 4 convolution block is composed of 6 residual units, the layer 5 convolution block is composed of 3 residual units, and the number of channels of the residual units in each layer of convolution block is continuously increased.
Further, the output is averaged and pooled through a 5-layer convolution block.
Compared with the prior art, the invention has the beneficial technical effects that:
(1) When the constant Q modulation envelope characteristic is extracted, constant Q conversion is used, the time-frequency resolution is variable, the characteristics of the voice signal are met, the method is more suitable for processing the voice signal, and the information contained in the voice signal is fully utilized.
(2) The improved residual error network is adopted, and the MMN activation function and the Dropout technology are combined to replace the bottom ReLU activation function, so that the problem that the accuracy is reduced when the number of training layers of the convolutional neural network is deeper is solved, and the accuracy of camouflage voice detection is improved.
(3) The common Q modulation envelope features are extracted for different camouflage voices, and the defect that the traditional camouflage detection mode can only detect single camouflage voice is overcome by means of residual error network classification, so that detection can be carried out under the condition of unknown camouflage types, and the method is convenient to use.
Drawings
FIG. 1 is a schematic diagram of a camouflage voice detection method utilizing a residual network in accordance with the present invention;
FIG. 2 is a block diagram of a residual unit of the present invention;
FIG. 3 is a block diagram of a Maxout cell of the present invention;
FIG. 4 is a block diagram of an MMN unit of the present invention.
FIG. 5 is a diagram showing the structure of an MMN combined with Dropout according to the present invention
Fig. 6 is a flowchart of a method for camouflage voice detection using a residual network in accordance with the present invention.
Detailed Description
The invention will be further described with reference to specific examples, but the scope of the invention is not limited thereto.
As shown in fig. 1-6, the present embodiment proposes a camouflage voice detection method using a residual network, which includes the following steps,
s1, processing a voice signal x (n) by utilizing a feature extraction module to obtain a voice feature-Constant Q Modulation Envelope (CQME) based on a modulation spectrum;
s2, outputting the extracted constant Q modulation envelope characteristic in the form of a Q modulation envelope characteristic diagram, preprocessing the constant Q modulation envelope characteristic, and inputting the constant Q modulation envelope characteristic into an improved ResNet classification network;
s3, after Q modulation envelope features are input into a classification network in the form of pictures, depth feature extraction is realized through 1 7×7 convolution layer and one 3×3 pooling layer and then through 16 residual units;
and S4, after 16 residual units are passed through an average pooling layer, and finally, outputting voice classification through a full connection layer and a Softmax layer.
The voice camouflage detection system provided by the embodiment is composed of two parts, wherein the first part calculates a constant Q modulation envelope of any input voice, and a characteristic diagram of the input voice is obtained and used as input of a classification network. Different types of camouflage voices are different in fundamental frequency, noise and the like, and the extracted feature images are different from real voices. And the second part inputs the feature images into a trained ResNet classification network, performs deep feature extraction and category matching on the feature images through a plurality of groups of convolution blocks, and finally outputs classification through a full connection layer and a Softmax layer. The whole process can classify the real voice and the camouflage voice, and detection of the camouflage voice is realized.
Wherein, in step S1, the feature extraction module processes the voice signal x (n) and comprises the following steps,
s11, the input voice x (n) is passed through a frequency division filter bank to divide the voice into K signals x with different frequency bands k (n), wherein k=1, 2, …, K;
s12, dividing the obtained signal x k (n) extracting the envelope; the process comprises the steps of solving Hilbert transform of each section of voice to further obtain an analysis signal of each section of voice, extracting the amplitude of the analysis signal of each section of voice, and obtaining the envelope of K sections of voice;
s13, performing nonlinear processing on the voice envelope; the process specifically processes the speech envelope non-linearly as a function of the scale.
S14, the envelope lg (m k (n)) is transformed to the frequency domain by a constant Q transform. This process is specifically to obtain a constant Q transform parameter, and transform the feature to the frequency domain by constant Q transform.
S15, calculating the mean square value of each frequency segment to obtain a voice characteristic-constant Q modulation envelope (Constant Q transform modulation envelope, CQME) based on the modulation spectrum.
And drawing an image by taking the frequency-amplitude as the abscissa and the ordinate to obtain a constant Q modulation envelope characteristic diagram. Preprocessing the feature map, and redefining the feature map size to 224×224×3;
thus, the present embodiment uses a constant Q modulation envelope feature in combination with a modified Residual Network (res net) to handle a camouflage voice detection scheme with unknown camouflage patterns. After the voice signal x (n) is input, the system divides the voice signal into K signals x with different frequency bands k (n), wherein k=1, 2, …, K. The envelope of the signal of each frequency band is extracted and subjected to nonlinear processing, then the envelope characteristics are transformed into the frequency domain through constant Q transformation (Constant Q Transform, CQT), the mean square value of each frequency band is calculated, and the voice characteristics based on the modulation spectrum, namely, the constant Q modulation envelope (Constant Q transform modulation envelope, CQME) is obtained. The extracted characteristics are output in the form of pictures, and are input into an improved 50-layer ResNet classification network after being preprocessed. The ResNet classification network of the present invention uses a residual unit of 2 1 x 1 convolutional layers, 13 x 3 convolutional layers, and 1 shortcut connection, and uses a multi-layer Maxout (MMN) activation function in combination with the Dropout technique. After the constant Q modulation envelope feature map is input into a classification network, depth feature extraction is realized through 1 7×7 convolution layers and 3×3 pooling layers, then 16 residual units, the data volume is reduced through average pooling, and finally the classification of the voice is obtained through a full connection layer and a Softmax layer. In the training stage, a training data set of constant Q modulation envelope feature maps of different camouflage voices is input into the classification network, and the cross entropy function is used as a loss function for training.
The feature extraction module extracts constant Q modulation envelope features for the input speech. The step of reproducing speech, which is added to the recording compared to real speech, introduces device noise, ambient noise, codec distortion, etc. During recording, the voice signal can propagate through more than one path, and delay and attenuation exist compared with real voice, and obvious distinction is made in a high-frequency part. In the process of forming the synthesized speech and the converted speech, only the rough outline of the spectrum envelope is simulated, small changes among frames are not captured, and compared with natural speech, the synthesized speech and the converted speech are smoother and have noise. The simulated voice has the lowest similarity with the natural voice, can be confused with audio-visual effect only at the human ear level, and has obvious difference between the sound spectrum structure and the real voice. Different kinds of camouflage speech can be distinguished by using a constant Q modulation envelope characteristic.
The specific content of the characteristic extraction module for processing the voice signal x (n) is as follows:
first, an input voice x (n) is passed through a frequency division filter bank to divide the voice into K signals x of different frequency bands k (n). This function is achieved by means of a Mel filter bank, the lowest frequency of which is 300Hz and the highest frequency is 3400Hz, depending on the frequency range of the speech signal. In the invention, 128 is obtained by K, namely 128 frequency division is realized. The Mel filter bank is a group of triangular band-pass filters, and the transfer function of each band-pass filter is h k (q),0≤k<K:
Where K is the number of filters 128, f (K) is the center mel frequency of the kth triangular filter, f (K-1) is the upper limit cutoff mel frequency of the kth triangular filter, f (k+1) is the lower limit cutoff mel frequency of the kth triangular filter, q is an integer value between f (K-1) and f (k+1). N is the point number of DFT, and 256, f are taken s Is sampling frequency, is taken as 8KHz, f l Is the lowest frequency of the filter bank, f h For the highest frequencies of the filter bank, 300Hz and 3400Hz, F mel (f) The frequency can be converted to Mel frequency, where f is a conventional frequency value, i.e. by equation (3), the conventional frequency value f canTo be converted into mel frequency value F mel (f),Is its inverse, b is the mel frequency value, i.e. the mel frequency value b can be transformed into the conventional frequency value +.>
Then dividing the frequency-resolved signal x k (n) extracting the envelope:
wherein j is an imaginary unit,is x k Hilbert transform of (n), s k (n) is x k The resolved signal of (n),is s k Conjugate signal of (n), m k (n) is x k (n) a spectral envelope. Further nonlinear processing is performed on the speech envelope, the nonlinearity in the present invention being realized by a logarithmic function.
The non-linearly processed envelope lg (m k (n)) is transformed to the frequency domain by a constant Q transform, expressed as:
where n represents the nth time component of the time domain signal, L represents the total number of frequency components of the constant Q transform spectrum, f min Is treated lg (m k (n)) lowest frequency, f max Is treated lg (m k (n)) highest frequency, f l For the frequency of the first component Δf l For the filter bandwidth of the first component, Q is a constant equal to the ratio of the center frequency to the filter bandwidth, f s Taking 8kHz as sampling frequency, taking l as the frequency sequence number of a constant Q conversion spectrum,constant Q transform value, N for the first frequency component of the kth frequency band l For a window length that varies with frequency, j is an imaginary unit that represents the relationship between the phases of the signals,/v>Is of length N l The hamming window expression is:
where n represents the nth time component of the time domain signal, and when the signal is processed by fourier transform, the frequency points are equally spaced, which corresponds to a group of filter banks having equally spaced center frequencies and having the same bandwidth. While the constant Q transform is initially directed to a music signalThe musical scale frequency interval is not fixed, but is log 2 For the bottom, the constant Q transform can be regarded as a set of filter banks with exponentially distributed center frequencies and constant center frequency to bandwidth ratios, where the constant Q transform is used to process the characteristics of the more conforming speech signal, with variable time-frequency resolution, higher frequency resolution at low frequencies and higher time resolution at high frequencies. Finally, the mean square value of each frequency component is calculated to obtain the voice characteristic:
and finally, drawing the obtained eigenvector into an eigenvector by taking the frequency as an independent variable, namely, inputting a subsequent classification network, wherein the eigenvector is a constant Q conversion value of the first frequency component of the kth frequency band.
The improved residual network module of this embodiment is an improved ResNet classification network, and the feature map input to the ResNet classification network first needs to be preprocessed to be 224×224×3. The invention uses a 50-layer residual network as a classification network, which contains 5 convolution blocks. The 1 st convolution block is a convolution layer composed of 64 convolution kernels of 7×7, and the 2 nd convolution block is composed of 1 max pooling layer of 3×3 and 3 residual units, wherein each residual unit is composed of 3 layers of convolution layers.
The structure of the residual unit in this embodiment is shown in fig. 2. Wherein X is an input, the shortcut connection directly transmits the input X to the adder, and adds the input X to the output F (X) obtained by the convolution layer, and finally outputs H (X) =f (X) +x, the residual error network needs training learning, and the residual error F (X) =h (X) -X is no longer the output H (X), and in the extreme case H (X) =f (X) =x, only the training F (X) is needed to approach 0. The principle of the residual unit can be expressed as:
Y=F(X,{ω i })+ω s X (14)
here, Y represents the output of the residual unit,ω i representing the processing of a 3-layer convolution, ω s The shortcut connection is processed to keep the dimension of the shortcut connection and the residual mapping consistent, so that the shortcut connection and the residual mapping can be directly added, and the shortcut connection and the residual mapping can be realized through a 1×1 convolution layer. The function of the 1×1 convolution layer is to reduce or increase the dimension, and the number of convolution kernels determines the dimension after processing. Firstly, the dimension of an input feature matrix is reduced to 64, the dimension is increased to 256 in the 3 rd layer, MMN is adopted as an activation function in the bottommost layer, meanwhile, the Dropout technology is combined, and ReLU is adopted as an activation function in the 2 nd layer and the 3 rd layer.
In the residual unit, the first layer adopts an MMN activation function, the MMN activation function is developed on the basis of a Maxout activation function, and the structure of the Maxout unit is shown in figure 3. Wherein x is i In order to input the vector(s),weight coefficient of j hidden node corresponding to i input vector, +.>For the offset, z, of the j hidden node corresponding to the i input vector j For the input vector x i Weighted and offset term->The sum, expressed as:
and->The optimal value is achieved through training. If each Maxout cell contains g hidden nodes, the output of each Maxout cell is:
the multiple Maxout units are jointly used to form an MMN unit, and the MMN unit structure used in this embodiment is shown in fig. 4. Wherein,and->The two parameters respectively represent the weight coefficient and the offset of the j output node corresponding to the i input vector in the 1 st weighting operation, +.>Output of j node representing 1 st weighting,>output of the j-th node representing the 1 st max operation,>is calculated in the same way as the Maxout activation function,/->And->The same way of calculation. The MMN units form an MMN activation function, the Maxout activation function can fit any convex function, the MMN activation function can fit any form of distribution, the characteristics can be better represented, and the convergence speed is faster.
In the training stage, after the feature matrix is processed by an MMN activation function, the feature matrix is connected with a next layer of convolution layer by combining with a Dropout technology, namely, in each training process, a specific number of hidden nodes are randomly ignored, in the invention, dropout is hidden by 50%, the number of MMN units of the hidden layer is 100, which is equivalent to training by using different networks, finally, the training result is averaged, the technology is not used in the prediction stage, and the MMN activation function structure combined with the Dropout technology is as shown in figure 5:
the layer 3 convolution block is composed of 4 residual units, the layer 4 convolution block is composed of 6 residual units, the layer 5 convolution block is composed of 3 residual units, the channel number of the residual units in each layer of convolution block is continuously progressive, and the layer 5 convolution blocks are shown in table 1.
TABLE 1
And carrying out average pooling on the output through a 5-layer convolution block to reduce the complexity of data, outputting the output through a full-connection layer to obtain the probability that the voice signal belongs to true and false voice, and finally obtaining the classification of the voice signal through a Softmax layer.

Claims (4)

1. The camouflage voice detection method by utilizing the residual error network is characterized by comprising the following steps of:
s1, processing a voice signal x (n) by utilizing a feature extraction module to obtain a voice feature-constant Q modulation envelope based on a modulation spectrum;
s2, outputting the extracted constant Q modulation envelope characteristic in the form of a Q modulation envelope characteristic diagram, preprocessing the constant Q modulation envelope characteristic, and inputting the constant Q modulation envelope characteristic into an improved ResNet classification network;
s3, after Q modulation envelope features are input into a classification network in the form of pictures, depth feature extraction is realized through 1 7×7 convolution layer and one 3×3 pooling layer and then through 16 residual units;
s4, after 16 residual error units, outputting voice classification through an average pooling layer and finally through a full connection layer and a Softmax layer;
in step S1, the feature extraction module processes the speech signal x (n) including the steps of:
s11, the input voice x (n) is passed through a frequency division filter bank to divide the voice into K signals x with different frequency bands k (n), wherein k=1, 2, …, K;
s12, dividing the obtained signal x k (n) extracting an envelope;
s13, performing nonlinear processing on the voice envelope;
s14, the envelope lg (m k (n)) to the frequency domain by a constant Q transform; m is m k (n) is x k A spectral envelope of (n);
s15, calculating the mean square value of each frequency segment to obtain a constant Q modulation envelope which is a voice characteristic based on a modulation frequency spectrum;
in step S2, the ResNet classification network is a 50-layer residual network;
the residual network contains 5 convolution blocks; the 1 st convolution block is a convolution layer formed by 64 convolution kernels of 7×7, and the 2 nd convolution block is formed by 1 max pooling layer of 3×3 and 3 residual units;
each residual unit is formed by 3 layers of convolutions, the size of a 1 st layer convolution kernel is 1 multiplied by 1, the 1 st layer convolution kernel is connected with the 2 nd layer of 3 multiplied by 3 convolutions through an MMN (mean square wave) activation function and a Dropout technology, the 2 nd layer convolution is connected with the 3 rd layer of 1 multiplied by 1 st convolutions through a ReLU activation function, outputs are added with outputs connected with a shortcut, and the outputs are transmitted to the next residual unit after nonlinear processing of the ReLU activation function;
in combination with the MMN activation function of the Dropout technology, the layer 3 convolution block is composed of 4 residual units, the layer 4 convolution block is composed of 6 residual units, the layer 5 convolution block is composed of 3 residual units, and the channel number of the residual units in each layer of convolution block is continuously increased.
2. The method for camouflage speech detection using a residual network of claim 1, wherein: in step S2, the Q-modulation envelope feature map is an image plotted with frequency-amplitude as the abscissa.
3. The method for detecting a camouflage voice using a residual network according to claim 1, wherein the modulation envelope feature map inputted to the res net classification network is pre-processed to be 224×224×3 in step S2.
4. The method for detecting camouflage speech using a residual network of claim 1, wherein the output is averaged and pooled via a 5-layer convolution block.
CN202110718049.0A 2021-06-28 2021-06-28 Camouflage voice detection method using residual error network Active CN113506583B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110718049.0A CN113506583B (en) 2021-06-28 2021-06-28 Camouflage voice detection method using residual error network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110718049.0A CN113506583B (en) 2021-06-28 2021-06-28 Camouflage voice detection method using residual error network

Publications (2)

Publication Number Publication Date
CN113506583A CN113506583A (en) 2021-10-15
CN113506583B true CN113506583B (en) 2024-01-05

Family

ID=78011168

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110718049.0A Active CN113506583B (en) 2021-06-28 2021-06-28 Camouflage voice detection method using residual error network

Country Status (1)

Country Link
CN (1) CN113506583B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109767776A (en) * 2019-01-14 2019-05-17 广东技术师范学院 A kind of deception speech detection method based on intensive neural network
CN110120227A (en) * 2019-04-26 2019-08-13 天津大学 A kind of depth stacks the speech separating method of residual error network
CN110211604A (en) * 2019-06-17 2019-09-06 广东技术师范大学 A kind of depth residual error network structure for voice deformation detection
CN111653289A (en) * 2020-05-29 2020-09-11 宁波大学 Playback voice detection method
CN112201255A (en) * 2020-09-30 2021-01-08 浙江大学 Voice signal spectrum characteristic and deep learning voice spoofing attack detection method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11238843B2 (en) * 2018-02-09 2022-02-01 Baidu Usa Llc Systems and methods for neural voice cloning with a few samples
US20200322377A1 (en) * 2019-04-08 2020-10-08 Pindrop Security, Inc. Systems and methods for end-to-end architectures for voice spoofing detection

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109767776A (en) * 2019-01-14 2019-05-17 广东技术师范学院 A kind of deception speech detection method based on intensive neural network
CN110120227A (en) * 2019-04-26 2019-08-13 天津大学 A kind of depth stacks the speech separating method of residual error network
CN110211604A (en) * 2019-06-17 2019-09-06 广东技术师范大学 A kind of depth residual error network structure for voice deformation detection
CN111653289A (en) * 2020-05-29 2020-09-11 宁波大学 Playback voice detection method
CN112201255A (en) * 2020-09-30 2021-01-08 浙江大学 Voice signal spectrum characteristic and deep learning voice spoofing attack detection method

Also Published As

Publication number Publication date
CN113506583A (en) 2021-10-15

Similar Documents

Publication Publication Date Title
CN110827837B (en) Whale activity audio classification method based on deep learning
CN108711436B (en) Speaker verification system replay attack detection method based on high frequency and bottleneck characteristics
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
CN109285538B (en) Method for identifying mobile phone source in additive noise environment based on constant Q transform domain
CN111653289B (en) Playback voice detection method
CN109559736B (en) Automatic dubbing method for movie actors based on confrontation network
CN112331216A (en) Speaker recognition system and method based on composite acoustic features and low-rank decomposition TDNN
CN101366078A (en) Neural network classifier for separating audio sources from a monophonic audio signal
CN108922513A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN105206270A (en) Isolated digit speech recognition classification system and method combining principal component analysis (PCA) with restricted Boltzmann machine (RBM)
CN111754988A (en) Sound scene classification method based on attention mechanism and double-path depth residual error network
CN110189766B (en) Voice style transfer method based on neural network
CN113488058A (en) Voiceprint recognition method based on short voice
CN111048097B (en) Twin network voiceprint recognition method based on 3D convolution
Todkar et al. Speaker recognition techniques: A review
CN111899757A (en) Single-channel voice separation method and system for target speaker extraction
CN109036470A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN112397074A (en) Voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning
CN112541533A (en) Modified vehicle identification method based on neural network and feature fusion
CN116052689A (en) Voiceprint recognition method
CN108806725A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN111243621A (en) Construction method of GRU-SVM deep learning model for synthetic speech detection
CN111785262B (en) Speaker age and gender classification method based on residual error network and fusion characteristics
CN113506583B (en) Camouflage voice detection method using residual error network
CN112133326A (en) Gunshot data amplification and detection method based on antagonistic neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant