CN113506583A - Disguised voice detection method using residual error network - Google Patents

Disguised voice detection method using residual error network Download PDF

Info

Publication number
CN113506583A
CN113506583A CN202110718049.0A CN202110718049A CN113506583A CN 113506583 A CN113506583 A CN 113506583A CN 202110718049 A CN202110718049 A CN 202110718049A CN 113506583 A CN113506583 A CN 113506583A
Authority
CN
China
Prior art keywords
layer
residual
voice
network
disguised
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110718049.0A
Other languages
Chinese (zh)
Other versions
CN113506583B (en
Inventor
简志华
徐嘉
韦凤瑜
朱雅楠
于佳祺
吴超
游林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202110718049.0A priority Critical patent/CN113506583B/en
Publication of CN113506583A publication Critical patent/CN113506583A/en
Application granted granted Critical
Publication of CN113506583B publication Critical patent/CN113506583B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Complex Calculations (AREA)
  • Transmission Systems Not Characterized By The Medium Used For Transmission (AREA)

Abstract

The invention relates to the field of voice recognition, in particular to a disguised voice detection method by using a residual error network. The method comprises the following steps that S1, a characteristic extraction module is used for processing a voice signal x (n) to obtain a voice characteristic-constant Q modulation envelope based on a modulation spectrum; s2, outputting the extracted constant Q modulation envelope characteristics in a form of a Q modulation envelope characteristic diagram, and inputting the preprocessed constant Q modulation envelope characteristics into an improved ResNet classification network; s3, after the Q modulation envelope characteristics are input into a classification network in the form of pictures, firstly, the depth characteristic extraction is realized through 1 7 × 7 convolutional layer and a 3 × 3 pooling layer and then through 16 residual error units; and S4, outputting the speech classification through the average pooling layer and finally the full-link layer and the Softmax layer after 16 residual units. The method improves the accuracy of the detection of the disguised voice by constant Q transformation and the adoption of an improved residual error network.

Description

Disguised voice detection method using residual error network
Technical Field
The invention relates to the field of voice recognition, in particular to a disguised voice detection method by using a residual error network.
Background
The technology of detecting the disguised voice is to detect the disguised voice (including replay voice, synthesized voice, converted voice and simulated voice) according to different voice characteristics so as to achieve the purpose of distinguishing real voice from the disguised voice, and has very important application in the field of voice biological feature recognition. A common disguised voice detection method comprises two parts of feature extraction and a classifier. At present, Mel-Frequency Cepstral Coefficients (MFCCs), Constant Q Cepstral Coefficients (CQCCs) and the like are used for feature extraction, and the conventional machine learning methods such as Gaussian Mixture Models (GMM), Hidden Markov Models (HMM), Support Vector Machines (SVM) and the like are used as common classifiers. However, these methods can only detect specific types of disguised voices, and when the types of disguised voices are unknown, the detection effect is reduced, and the method cannot adapt to the challenge of variable disguised voices.
Disclosure of Invention
To solve the above problems, the present invention provides a method for detecting a disguised voice using a residual network,
in order to achieve the technical purpose, the invention adopts the following technical scheme;
the method for detecting disguised voice by using a residual error network is characterized by comprising the following steps,
s1, processing the voice signal x (n) by using a feature extraction module to obtain a voice feature-Constant Q Modulation Envelope (CQME) based on a modulation spectrum;
s2, outputting the extracted constant Q modulation envelope characteristics in a form of a Q modulation envelope characteristic diagram, and inputting the preprocessed constant Q modulation envelope characteristics into an improved ResNet classification network;
s3, after the Q modulation envelope characteristics are input into a classification network in the form of pictures, firstly, the depth characteristic extraction is realized through 1 7 × 7 convolutional layer and a 3 × 3 pooling layer and then through 16 residual error units;
and S4, outputting the speech classification through the average pooling layer and finally the full-link layer and the Softmax layer after 16 residual units.
Further, in step S1, the processing of the speech signal x (n) by the feature extraction module includes the following steps,
s11, the input voice x (n) is divided into K signals x of different frequency bands by a frequency division filter bankk(n), wherein K is 1,2, …, K;
s12 Signal x obtained by frequency divisionk(n) extracting the envelope;
s13, carrying out nonlinear processing on the voice packet;
s14 envelope lg (m) to be subjected to nonlinear processingk(n)) is transformed to the frequency domain by a constant Q transform;
s15, calculating the mean square value of each frequency segment to obtain the speech characteristic based on the modulation spectrum, namely Constant Q transform envelope (CQME).
Further, in step S2, the Q modulation envelope characteristic map is an image plotted with frequency — amplitude as horizontal and vertical coordinates.
Further, in step S2, the modulation envelope characteristic map input to the ResNet classification network is preprocessed to adjust the size to 224 × 224 × 3.
Further, in step S2, the ResNet classification network is a residual network of 50 layers.
Further, the residual network contains 5 convolutional blocks; the 1 st convolution block is a convolution layer composed of 64 convolution kernels of 7 × 7, and the 2 nd convolution block is composed of 1 maximum pooling layer of 3 × 3 and 3 residual units.
Furthermore, each residual unit is formed by 3 layers of convolution, the size of a convolution kernel of the 1 st layer is 1 multiplied by 1, the convolution kernel is connected with the 3 multiplied by 3 convolution of the 2 nd layer through MMN activation function and Dropout technology, the 2 nd layer of convolution is connected with the 1 multiplied by 1 convolution of the 3 rd layer through ReLU activation function, the output is added with the output connected with short, and the sum is transmitted to the next residual unit after nonlinear processing of the ReLU activation function again.
Further, in combination with the MMN activation function of Dropout technology, the layer 3 convolution block is composed of 4 residual units, the layer 4 convolution block is composed of 6 residual units, the layer 5 convolution block is composed of 3 residual units, and the number of channels of the residual units in each layer of convolution block is continuously increased.
Further, the output is averaged and pooled through 5 layers of convolution blocks.
Compared with the prior art, the invention has the beneficial technical effects that:
(1) when the constant Q modulation envelope characteristic is extracted, constant Q transformation is applied, the time-frequency resolution is variable, the characteristic of the voice signal is met, the method is more suitable for processing the voice signal, and the information contained in the voice signal is fully utilized.
(2) By adopting the improved residual error network and combining the MMN activation function with the Dropout technology to replace the bottom-layer ReLU activation function, the problem that the accuracy rate is reduced when the training layer number of the convolutional neural network is deeper is solved, and the accuracy of the detection of the disguised voice is improved.
(3) The method has the advantages that the normal Q modulation envelope characteristics of different camouflage voices are extracted, and the traditional camouflage detection mode only can detect a single camouflage voice through a residual error network classification mode, so that the method can detect different camouflage voices under the condition of unknown camouflage types and is convenient to use.
Drawings
FIG. 1 is a schematic diagram of a method for detecting a disguised voice using a residual network according to the present invention;
FIG. 2 is a diagram of a residual unit structure according to the present invention;
FIG. 3 is a diagram of a Maxout cell structure according to the present invention;
fig. 4 is a diagram of the MMN unit structure of the present invention.
FIG. 5 is a MMN structure diagram of the present invention incorporating Dropout
FIG. 6 is a flowchart of the entire method for detecting a disguised voice using a residual network according to the present invention.
Detailed Description
The invention will be further described with reference to specific examples, but the scope of the invention is not limited thereto.
As shown in fig. 1 to 6, the present embodiment proposes a method for detecting masquerading voice using a residual network, comprising the following steps,
s1, processing the voice signal x (n) by using a feature extraction module to obtain a voice feature-Constant Q Modulation Envelope (CQME) based on a modulation spectrum;
s2, outputting the extracted constant Q modulation envelope characteristics in a form of a Q modulation envelope characteristic diagram, and inputting the preprocessed constant Q modulation envelope characteristics into an improved ResNet classification network;
s3, after the Q modulation envelope characteristics are input into a classification network in the form of pictures, firstly, the depth characteristic extraction is realized through 1 7 × 7 convolutional layer and a 3 × 3 pooling layer and then through 16 residual error units;
and S4, outputting the speech classification through the average pooling layer and finally the full-link layer and the Softmax layer after 16 residual units.
The voice camouflage detection system provided by the embodiment is composed of two parts, wherein the first part calculates the constant Q modulation envelope of any input voice to obtain the characteristic diagram of the input voice as the input of a classification network. Different types of disguised voices are different in fundamental frequency, noise and the like, and the extracted feature map is different from the real voice. And in the second part, after the feature map is input into a trained ResNet classification network, deep feature extraction and class matching are carried out on the feature map through a plurality of groups of rolling blocks, and finally classification is output through a full connection layer and a Softmax layer. The whole process can classify the real voice and the disguised voice, and the detection of the disguised voice is realized.
Wherein, in step S1, the processing of the speech signal x (n) by the feature extraction module includes the following steps,
s11, the input voice x (n) is divided into K signals x of different frequency bands by a frequency division filter bankk(n), wherein K is 1,2, …, K;
s12 Signal x obtained by frequency divisionk(n) extracting the envelope; the process is specifically that the Hilbert transform is obtained for each segment of speechFurther obtaining an analytic signal of each section of voice, and extracting the amplitude of the analytic signal of each section of voice to obtain the envelope of K sections of voice;
s13, carrying out nonlinear processing on the voice packet; the process is specifically that a number function carries out nonlinear processing on the voice envelope to transform the scale.
S14 envelope lg (m) to be subjected to nonlinear processingk(n)) is transformed to the frequency domain by a constant Q transform. The process is specifically to obtain constant Q transformation parameters, and transform the characteristics to the frequency domain through constant Q transformation.
S15, calculating the mean square value of each frequency segment to obtain the speech characteristic based on the modulation spectrum, namely Constant Q transform envelope (CQME).
And drawing an image by using the frequency-amplitude value as a horizontal coordinate and a vertical coordinate to obtain a constant Q modulation envelope characteristic diagram. Preprocessing the feature map, and redefining the size of the feature map to be 224 multiplied by 3;
therefore, the present embodiment adopts a combination of a constant Q modulation envelope characteristic and an improved Residual Network (ResNet) to process a masquerading voice detection scheme with unknown masquerading mode. After the speech signal x (n) is inputted, the system divides its frequency into K signals x with different frequency bandsk(n), wherein K is 1,2, …, K. Extracting an envelope of the signal of each frequency band, performing nonlinear processing, then transforming the envelope characteristic to a frequency domain through Constant Q Transform (CQT), calculating a mean square value of each frequency band, and obtaining a speech characteristic based on a modulation spectrum, namely a Constant Q Transform envelope (CQME). And outputting the extracted features in the form of pictures, and inputting the preprocessed features into the improved 50-layer ResNet classification network. The ResNet classification network in the invention adopts a residual unit formed by connecting 2 1 × 1 convolutional layers, 13 × 3 convolutional layers and 1 short, and combines a multi-layer Maxout (MMN) activation function with a Dropout technology. After a constant Q modulation envelope characteristic diagram is input into a classification network, firstly, 1 7 multiplied by 7 convolutional layer and 3 multiplied by 3 pooling layer are passed through, then, 16 residual error units are used for realizing depth characteristic extraction, then, average pooling is used for reducing data volume, and finally, a full connection layer and a Softmax layer are used for extracting data volumeA classification of the speech is obtained. In the training stage, the training data sets of the constant Q modulation envelope characteristic diagrams of different camouflage voices are input into the classification network, and the cross entropy function is used as a loss function for training.
The characteristic extraction module extracts constant Q modulation envelope characteristics from the input voice. Compared with real voice, the method increases the steps of recording, and introduces equipment noise, environmental noise, coding and decoding distortion and the like. During recording, the voice signal will propagate through more than one path, and there is delay and attenuation compared with real voice, and there is obvious difference in high frequency part. The synthesized speech and the converted speech only imitate the rough outline of the spectral envelope in the process of forming, and the slight change between frames is not captured, so that the synthesized speech and the converted speech are smoother and have noise compared with the natural speech. The similarity between the simulated voice and the natural voice is lowest, the audio and the video can be confused only at the level of human ears, and the difference between the sound spectrum structure and the real voice is obvious. Different kinds of disguised voice can be distinguished by adopting the constant Q modulation envelope characteristic.
The specific content of the processing of the voice signal x (n) by the feature extraction module is as follows:
firstly, the input speech x (n) is passed through a frequency-dividing filter bank to divide the speech into K signals x with different frequency bandsk(n) of (a). This function is implemented by a Mel filter bank, the lowest frequency of which is 300Hz and the highest frequency is 3400Hz, depending on the frequency range of the speech signal. In the invention, K is taken to be 128, namely, 128 frequency division is realized. The Mel Filter Bank is a set of triangular band-pass filters, each having a transfer function of hk(q),0≤k<K:
Figure BDA0003135659920000051
Figure BDA0003135659920000052
Figure BDA0003135659920000053
Figure BDA0003135659920000054
Where K is the number of filters 128, f (K) is the center mel frequency of the kth triangular filter, f (K-1) is the upper cut-off mel frequency of the kth triangular filter, f (K +1) is the lower cut-off mel frequency of the kth triangular filter, and q is an integer value between f (K-1) and f (K + 1). N is the number of points of DFT, and 256, f is takensFor sampling frequency, take 8KHz, flIs the lowest frequency of the filter bank, fhThe highest frequency of the filter bank is 300Hz and 3400Hz, Fmel(f) The frequency can be converted to Mel-frequency, where F is a conventional frequency value, i.e. by formula (3) the conventional frequency value F can be converted to Mel-frequency value Fmel(f),
Figure BDA0003135659920000061
Is its inverse transformation, b is the mel frequency value, i.e. by formula (4) the mel frequency value b can be transformed into the regular frequency value
Figure BDA0003135659920000062
The signal x obtained by frequency divisionk(n) extracting the envelope:
Figure BDA0003135659920000063
Figure BDA0003135659920000064
wherein j is an imaginary unit,
Figure BDA0003135659920000065
is xk(n) Hilbert transform, sk(n) is xk(n) the analytic signal of the signal,
Figure BDA0003135659920000066
is s iskConjugate signal of (n), mk(n) is xk(n) spectral envelope. And further carrying out nonlinear processing on the voice envelope, wherein the nonlinearity is realized by a logarithmic function.
The next step is to apply the non-linearly processed envelope lg (m)k(n)) is transformed to the frequency domain by a constant Q transform, represented as:
Figure BDA0003135659920000067
Figure BDA0003135659920000068
Figure BDA0003135659920000069
Figure BDA00031356599200000610
Figure BDA00031356599200000611
where n represents the nth time component of the time domain signal, L represents the total number of frequency components of the constant Q transform spectrum, and fminIs lg (m) of treatmentk(n)) lowest frequency, fmaxIs lg (m) of treatmentk(n)) highest frequency, flFrequency of the l-th component, Δ flFilter bandwidth of the l-th component, Q being a constant equal to the ratio of the center frequency to the filter bandwidth, fsFor sampling frequency, 8kHz and l are frequency serial numbers of a constant Q transform spectrum,
Figure BDA0003135659920000071
is the l frequency of the k frequency bandConstant Q-transformed value of component, NlIs the window length that varies with frequency, j is an imaginary unit, used to represent the relationship between signal phases,
Figure BDA0003135659920000072
is of length NlThe expression of the Hamming window is as follows:
Figure BDA0003135659920000073
where n represents the nth time component of the time domain signal, and when the signal is processed by fourier transform, the frequency points are equally spaced, which corresponds to a set of filter banks with equal center frequencies and the same bandwidth. Whereas the constant Q transform was originally proposed for the processing of music signals, the musical scale frequency interval is not fixed, but log2For the first time, the constant Q transform can be regarded as a filter bank with center frequency distributed exponentially and a constant ratio of center frequency to bandwidth, and the constant Q transform is used to process the filter bank which is more consistent with the characteristics of the voice signal, and has variable time-frequency resolution, higher frequency resolution at low frequency and higher time resolution at high frequency. Finally, the mean square value of each frequency component is solved to obtain the voice characteristics:
Figure BDA0003135659920000074
Figure BDA0003135659920000075
the k-th frequency component is a constant Q transformation value of the ith frequency component of the kth frequency band, and finally the obtained feature vector is drawn into a feature graph with frequency as an independent variable, namely the feature graph is input into a subsequent classification network.
In the embodiment, the improved residual network module is an improved ResNet classification network, and the feature map input into the ResNet classification network needs to be preprocessed to adjust the size to 224 × 224 × 3. The invention adopts a residual error network with 50 layers as a classification network, and the residual error network comprises 5 volume blocks. The 1 st convolution block is a convolution layer composed of 64 convolution kernels of 7 × 7, and the 2 nd convolution block is composed of 1 maximum pooling layer of 3 × 3 and 3 residual units, where each residual unit is composed of 3 convolution layers.
The structure of the residual error unit in this embodiment is as shown in fig. 2. Where X is input, shortcut directly transmits input X to the adder, and adds it to output f (X) obtained by convolution, and finally output h (X) ═ f (X) + X, and what the residual network needs to train and learn is no longer output h (X), but residual f (X) ═ h (X) — X, and in the extreme case of h (X) ═ f (X) ═ X, it is only necessary to train f (X) to approach 0. The principle of the residual unit can be expressed as:
Y=F(X,{ωi})+ωsX (14)
here, Y denotes the output of the residual unit, ωiRepresenting the processing of 3 layers of convolution, ωsThe representation is that the shortcut connection is processed, so that the shortcut connection and the dimension of the residual mapping are kept consistent, and the shortcut connection and the dimension of the residual mapping can be directly added, and can be realized through a 1 × 1 convolutional layer. The 1 × 1 convolutional layer functions as a dimension reduction or dimension increase, and the number of processed dimensions is determined by the number of convolutional kernels. The input feature matrix is first reduced to 64, then raised to 256 at layer 3, and MMN is used as the activation function at the bottom layer, while in combination with Dropout techniques, ReLU is used as the activation function at layer 2 and layer 3.
In the residual unit, the MMN activation function is adopted in the first layer, and the MMN activation function is developed on the basis of the Maxout activation function, and the Maxout unit structure is shown in fig. 3. Wherein xiIn order to input the vector, the vector is input,
Figure BDA0003135659920000081
the weight coefficient of the jth hidden node corresponding to the ith input vector,
Figure BDA0003135659920000082
offset of the jth hidden node corresponding to the ith input vector, zjAs an input vector xiWeighted sum bias term
Figure BDA0003135659920000083
The sum, expressed as:
Figure BDA0003135659920000084
Figure BDA0003135659920000085
and
Figure BDA0003135659920000086
the optimization is achieved through training. If each Maxout unit contains g hidden nodes, the output of each Maxout unit is:
Figure BDA0003135659920000087
the MMN unit is obtained by jointly using a plurality of Maxout units, and the MMN unit structure used in this embodiment is as shown in fig. 4. Wherein the content of the first and second substances,
Figure BDA0003135659920000088
and
Figure BDA0003135659920000089
the two parameters respectively indicate the weighting coefficient and the offset of the jth output node corresponding to the ith input vector in the 1 st weighting operation,
Figure BDA00031356599200000810
represents the output of the jth node weighted 1 st time,
Figure BDA0003135659920000091
the output of the jth node representing the 1 st max operation,
Figure BDA0003135659920000092
is calculated in the same way as the Maxout activation function,
Figure BDA0003135659920000093
and
Figure BDA0003135659920000094
the same way of calculation. The MMN units form an MMN activation function, the Maxout activation function can fit any convex function, the MMN activation function can fit any form of distribution, characteristics can be better represented, and the convergence speed is higher.
In the training stage, after the MMN activation function processing, the feature matrix is connected with the next convolutional layer by combining the Dropout technology, that is, in each training process, a specific number of hidden nodes are randomly ignored, in the invention, Dropout is hidden by 50%, the number of MMN units in the hidden layer is 100, which is equivalent to training by using different networks, and finally, the training result is averaged, the MMN activation function structure combining the Dropout technology is not used in the prediction stage as shown in FIG. 5:
the 3 rd layer of convolution block is composed of 4 residual error units, the 4 th layer of convolution block is composed of 6 residual error units, the 5 th layer of convolution block is composed of 3 residual error units, the number of channels of the residual error units in each layer of convolution block is continuously increased, and the 5 th layer of convolution block is shown in table 1.
Figure BDA0003135659920000095
TABLE 1
And 5 layers of rolling blocks are used for averagely pooling output, the complexity of data is reduced, the output is output through a full connection layer to obtain the probability that the voice signal belongs to true voice and false voice, and finally the classification is obtained through a Softmax layer.

Claims (9)

1. The method for detecting disguised voice by using a residual error network is characterized by comprising the following steps,
s1, processing the voice signal x (n) by using a feature extraction module to obtain a voice feature-constant Q modulation envelope based on a modulation spectrum;
s2, outputting the extracted constant Q modulation envelope characteristics in a form of a Q modulation envelope characteristic diagram, and inputting the preprocessed constant Q modulation envelope characteristics into an improved ResNet classification network;
s3, after the Q modulation envelope characteristics are input into a classification network in the form of pictures, firstly, the depth characteristic extraction is realized through 1 7 × 7 convolutional layer and a 3 × 3 pooling layer and then through 16 residual error units;
and S4, outputting the speech classification through the average pooling layer and finally the full-link layer and the Softmax layer after 16 residual units.
2. The disguised voice detection method using residual error network as claimed in claim 1, wherein the step S1, the processing of the voice signal x (n) by the feature extraction module comprises the steps of,
s11, the input voice x (n) is divided into K signals x of different frequency bands by a frequency division filter bankk(n), wherein K is 1,2, …, K;
s12 Signal x obtained by frequency divisionk(n) extracting the envelope;
s13, carrying out nonlinear processing on the voice packet;
s14 envelope lg (m) to be subjected to nonlinear processingk(n)) is transformed to the frequency domain by a constant Q transform; m isk(n) is xk(n) a spectral envelope;
and S15, calculating the mean square value of each frequency band to obtain the speech characteristic-constant Q modulation envelope based on the modulation spectrum.
3. The method of detecting disguised speech using a residual network as claimed in claim 1, wherein: in step S2, the Q modulation envelope characteristic map is an image plotted with frequency-amplitude as horizontal and vertical coordinates.
4. The method of detecting disguised voice using residual error network as claimed in claim 1, wherein in step S2, the modulation envelope profile inputted to the ResNet classification network is pre-processed to adjust the size to 224 x 3.
5. The method of detecting a disguised voice using a residual network as claimed in claim 1, wherein in step S2, the ResNet classification network is a 50-layer residual network.
6. The method of detecting disguised speech using a residual network as claimed in claim 5, wherein the residual network comprises 5 convolutional blocks; the 1 st convolution block is a convolution layer composed of 64 convolution kernels of 7 × 7, and the 2 nd convolution block is composed of 1 maximum pooling layer of 3 × 3 and 3 residual units.
7. The method of detecting disguised speech using residual network as claimed in claim 6, wherein each residual unit is formed by 3 layers of convolutions, the kernel size of layer 1 convolution is 1 x 1, and is connected to layer 2 3 x 3 convolutions by MMN activation function in combination with Dropout technique, and layer 2 convolution is connected to layer 3 x 1 convolutions by ReLU activation function, and the outputs are added to the outputs connected to shortcut and are again subjected to nonlinear processing by ReLU activation function and then transferred to the next residual unit.
8. The method of detecting disguised speech using residual error network as claimed in claim 7, wherein in combination with the MMN activation function of Dropout technique, the layer 3 convolution block is composed of 4 residual error units, the layer 4 convolution block is composed of 6 residual error units, the layer 5 convolution block is composed of 3 residual error units, and the number of channels of residual error units in each layer of convolution block is continuously increased.
9. The method of detecting disguised speech using a residual network as claimed in claim 9, wherein the output is averaged and pooled through 5-layer rolling blocks.
CN202110718049.0A 2021-06-28 2021-06-28 Camouflage voice detection method using residual error network Active CN113506583B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110718049.0A CN113506583B (en) 2021-06-28 2021-06-28 Camouflage voice detection method using residual error network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110718049.0A CN113506583B (en) 2021-06-28 2021-06-28 Camouflage voice detection method using residual error network

Publications (2)

Publication Number Publication Date
CN113506583A true CN113506583A (en) 2021-10-15
CN113506583B CN113506583B (en) 2024-01-05

Family

ID=78011168

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110718049.0A Active CN113506583B (en) 2021-06-28 2021-06-28 Camouflage voice detection method using residual error network

Country Status (1)

Country Link
CN (1) CN113506583B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109767776A (en) * 2019-01-14 2019-05-17 广东技术师范学院 A kind of deception speech detection method based on intensive neural network
CN110120227A (en) * 2019-04-26 2019-08-13 天津大学 A kind of depth stacks the speech separating method of residual error network
US20190251952A1 (en) * 2018-02-09 2019-08-15 Baidu Usa Llc Systems and methods for neural voice cloning with a few samples
CN110211604A (en) * 2019-06-17 2019-09-06 广东技术师范大学 A kind of depth residual error network structure for voice deformation detection
CN111653289A (en) * 2020-05-29 2020-09-11 宁波大学 Playback voice detection method
US20200322377A1 (en) * 2019-04-08 2020-10-08 Pindrop Security, Inc. Systems and methods for end-to-end architectures for voice spoofing detection
CN112201255A (en) * 2020-09-30 2021-01-08 浙江大学 Voice signal spectrum characteristic and deep learning voice spoofing attack detection method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190251952A1 (en) * 2018-02-09 2019-08-15 Baidu Usa Llc Systems and methods for neural voice cloning with a few samples
CN109767776A (en) * 2019-01-14 2019-05-17 广东技术师范学院 A kind of deception speech detection method based on intensive neural network
US20200322377A1 (en) * 2019-04-08 2020-10-08 Pindrop Security, Inc. Systems and methods for end-to-end architectures for voice spoofing detection
CN110120227A (en) * 2019-04-26 2019-08-13 天津大学 A kind of depth stacks the speech separating method of residual error network
CN110211604A (en) * 2019-06-17 2019-09-06 广东技术师范大学 A kind of depth residual error network structure for voice deformation detection
CN111653289A (en) * 2020-05-29 2020-09-11 宁波大学 Playback voice detection method
CN112201255A (en) * 2020-09-30 2021-01-08 浙江大学 Voice signal spectrum characteristic and deep learning voice spoofing attack detection method

Also Published As

Publication number Publication date
CN113506583B (en) 2024-01-05

Similar Documents

Publication Publication Date Title
CN108711436B (en) Speaker verification system replay attack detection method based on high frequency and bottleneck characteristics
CN108766419B (en) Abnormal voice distinguishing method based on deep learning
CN107146601B (en) Rear-end i-vector enhancement method for speaker recognition system
JP5554893B2 (en) Speech feature vector conversion method and apparatus
CN109559736B (en) Automatic dubbing method for movie actors based on confrontation network
CN111653289B (en) Playback voice detection method
CN112331216A (en) Speaker recognition system and method based on composite acoustic features and low-rank decomposition TDNN
CN112151059A (en) Microphone array-oriented channel attention weighted speech enhancement method
CN109256127B (en) Robust voice feature extraction method based on nonlinear power transformation Gamma chirp filter
CN101366078A (en) Neural network classifier for separating audio sources from a monophonic audio signal
CN112270931B (en) Method for carrying out deceptive voice detection based on twin convolutional neural network
CN112017682B (en) Single-channel voice simultaneous noise reduction and reverberation removal system
CN113488058A (en) Voiceprint recognition method based on short voice
CN112259120A (en) Single-channel human voice and background voice separation method based on convolution cyclic neural network
Todkar et al. Speaker recognition techniques: A review
CN111899757A (en) Single-channel voice separation method and system for target speaker extraction
CN112382301B (en) Noise-containing voice gender identification method and system based on lightweight neural network
CN112541533A (en) Modified vehicle identification method based on neural network and feature fusion
CN113763965A (en) Speaker identification method with multiple attention characteristics fused
CN111899750A (en) Speech enhancement algorithm combining cochlear speech features and hopping deep neural network
CN111243621A (en) Construction method of GRU-SVM deep learning model for synthetic speech detection
Gaafar et al. An improved method for speech/speaker recognition
Renisha et al. Cascaded Feedforward Neural Networks for speaker identification using Perceptual Wavelet based Cepstral Coefficients
CN113506583B (en) Camouflage voice detection method using residual error network
CN111785262B (en) Speaker age and gender classification method based on residual error network and fusion characteristics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant