CN114495950A - Voice deception detection method based on deep residual shrinkage network - Google Patents
Voice deception detection method based on deep residual shrinkage network Download PDFInfo
- Publication number
- CN114495950A CN114495950A CN202210347480.3A CN202210347480A CN114495950A CN 114495950 A CN114495950 A CN 114495950A CN 202210347480 A CN202210347480 A CN 202210347480A CN 114495950 A CN114495950 A CN 114495950A
- Authority
- CN
- China
- Prior art keywords
- features
- voice
- depth
- residual shrinkage
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 55
- 238000000034 method Methods 0.000 claims abstract description 15
- 238000013528 artificial neural network Methods 0.000 claims abstract description 10
- 238000007781 pre-processing Methods 0.000 claims abstract description 7
- 230000001131 transforming effect Effects 0.000 claims abstract description 6
- 230000008569 process Effects 0.000 claims abstract description 5
- 238000012545 processing Methods 0.000 claims description 17
- 238000001228 spectrum Methods 0.000 claims description 13
- 230000006870 function Effects 0.000 claims description 12
- 230000003044 adaptive effect Effects 0.000 claims description 11
- 230000009466 transformation Effects 0.000 claims description 11
- 238000010276 construction Methods 0.000 claims description 8
- 238000012952 Resampling Methods 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 5
- 238000000605 extraction Methods 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 238000012795 verification Methods 0.000 description 4
- 238000011176 pooling Methods 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000008602 contraction Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000009432 framing Methods 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000003061 neural cell Anatomy 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/22—Interactive procedures; Man-machine interfaces
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/20—Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
Abstract
The invention discloses a voice deception detection method based on a deep residual shrinkage network, which comprises the steps of preprocessing voice to be detected, and transforming preprocessed voice feature data to obtain corresponding constant Q cepstrum coefficient features, Mel frequency cepstrum coefficient features and spectrogram features; then, a depth residual shrinkage network is adopted to respectively process the constant Q cepstrum coefficient characteristic, the Mel frequency cepstrum coefficient characteristic and the spectrogram characteristic to obtain three corresponding depth characteristics; inputting the three depth features into a deep neural network classifier respectively, and calculating to obtain detection scores corresponding to the three depth features; and finally, fusing the detection scores corresponding to the three depth features, and judging whether the voice to be detected is real voice. The method improves the distinguishing characteristic learning ability in the complex acoustic environment, improves the system generalization and has wider application scene.
Description
Technical Field
The application belongs to the technical field of voice detection and deep learning, and particularly relates to a voice deception detection method based on a deep residual shrinkage network.
Background
In recent years, the role of biometric-based identity authentication techniques in data security and passing authentication has become increasingly important. Due to the development of the acquisition sensing equipment, the automatic speaker verification technology is widely concerned and applied to the aspects of intelligent equipment login, access control, online banking and the like. However, various voice forgery technologies threaten the security performance of the automatic speaker verification system, and four types of forged voice spoofing attacks are determined at present: speech synthesis, speech conversion, speech emulation, replay, which can generate spurious speech similar to the speech of a legitimate user. Logical access attacks that dominate speech synthesis and speech conversion are perceptually indistinguishable from true speech, and therefore it becomes more challenging to distinguish counterfeit speech from real user speech. More and more research has demonstrated that automated speaker verification systems are severely vulnerable to various malicious spoofing attacks on databases.
In order to deal with the threat of spoofing attack, researchers are continuously dedicated to seeking an effective anti-spoofing method, and the current voice spoofing detection system mainly comprises a front-end feature extraction part and a rear-end classifier part. Unlike the acoustic features used by speaker verification and speech processing in general, spoof detection requires the development of acoustic features more suitable for spoof detection. After the acoustic features are extracted, the classifier with excellent performance is used for finishing true and false voice distinguishing. Among the traditional machine learning methods, the Gaussian Mixture Model (GMM) is the most classical classification model, which has the advantages of short training time but limited detection accuracy; with the rise of deep learning, various deep neural networks capable of learning complex nonlinear features are also applied to voice spoofing detection. Convolutional Neural Networks (CNNs) have good characterization learning capabilities and are widely used in extracting audio features. The Recurrent Neural Network (RNN) has memory due to the recurrent units and the threshold structure, and therefore has certain advantages in the handling of time series problems.
Although the training performance of the existing methods is improved, the unknown type of attacks are encountered in practical applications, and the attacks are usually distributed with different statistics from the known attacks, so that a huge performance gap is created between training and applications, which indicates that the generalization capability of the spoof detection system to the unknown attacks is still to be improved. In addition, because noise, reverberation and channel interference often exist in a real environment, when various deception detection systems face a complex acoustic environment, the performance is greatly reversed.
Disclosure of Invention
The application aims to provide a voice spoofing detection method based on a deep residual shrinkage network, so as to aim at voice spoofing detection in a complex acoustic environment.
In order to achieve the purpose, the technical scheme of the application is as follows:
a voice spoofing detection method based on a deep residual shrinkage network comprises the following steps:
preprocessing the voice to be detected, and transforming the voice characteristic data after preprocessing to obtain corresponding constant Q cepstrum coefficient characteristics, Mel frequency cepstrum coefficient characteristics and spectrogram characteristics;
respectively processing the constant Q cepstrum coefficient characteristic, the Mel frequency cepstrum coefficient characteristic and the spectrogram characteristic by adopting a depth residual shrinkage network to obtain three corresponding depth characteristics;
inputting the three depth features into a deep neural network classifier respectively, and calculating to obtain detection scores corresponding to the three depth features;
and fusing the detection scores corresponding to the three depth characteristics, and judging whether the voice to be detected is real voice or not.
Further, the deep residual shrinkage network comprises a residual shrinkage construction unit, the residual shrinkage construction unit comprises a convolution module, an adaptive threshold learning module and a soft threshold module, the output of the convolution module is learned by the adaptive threshold learning module to obtain a threshold, and the soft threshold module processes the output of the convolution module and the adaptive threshold learning module to highlight sound information with high discriminability.
Further, a depth residual shrinkage network for processing the constant Q cepstrum coefficient features stacks 6 residual shrinkage building units, a depth residual shrinkage network for processing the mel-frequency cepstrum coefficient features stacks 9 residual shrinkage building units, and a depth residual shrinkage network for processing the spectrogram feature stacks 6 residual shrinkage building units.
Further, the deep neural network classifier comprises a Drapout layer, a first fully hidden connection layer, a Leak-Relu activation function layer, a second hidden fully connected layer and a LogSoftmax layer.
Further, the probability of the random discard weight of the Dropout layer is 50%.
Further, the detection scores corresponding to the three depth features are fused, and the formula is as follows:
wherein Score isfuseFor fused joint detection score, wiTo fuse the weights, siAnd the detection score corresponding to the ith depth feature.
Further, the transforming the preprocessed voice feature data to obtain corresponding constant Q cepstrum coefficient features, mel-frequency cepstrum coefficient features, and spectrogram features includes:
constant Q transformation is carried out on the preprocessed voice characteristic data, then a power spectrum is calculated and logarithm is taken, then uniform resampling is carried out, and finally constant Q cepstrum system characteristics are obtained through discrete cosine transformation;
performing short-time Fourier transform (STFT) on the preprocessed voice feature data, mapping the frequency spectrum to a Mel frequency spectrum through filtering, and finally obtaining Mel frequency cepstrum coefficient features through discrete cosine transform;
and performing short-time Fourier transform on the preprocessed voice feature data, calculating the size of each component, and finally converting the component into logarithmic scales to obtain the spectrogram features.
According to the voice deception detection method based on the deep residual error shrinkage network, the deep residual error shrinkage network is constructed, the residual error shrinkage construction unit of the self-adaptive threshold learning module and the soft threshold module based on the deep attention mechanism is adopted, each voice signal determines an independent threshold according to the acoustic environment, unimportant features are forced to be zero, information related to noise is eliminated, more discriminative high-level features are learned, and the learning capacity of the discriminative features under the complex acoustic environment is improved. Aiming at the problem of poor generalization performance of the detection method, three different acoustic feature extraction algorithms of CQCC, MFCC and Spectrogram are used for more comprehensively representing voice characteristics, the features are respectively used as network input, weights are generated for each model according to the output performance of the features, multi-feature joint detection is executed, and therefore system generalization is improved, and the application scene is wider.
Drawings
FIG. 1 is a flow chart of a voice spoofing detection method based on a deep residual shrinkage network according to the present application;
FIG. 2 is a block diagram of the overall network framework of the present application;
FIG. 3 is a schematic diagram of feature extraction of the present application;
FIG. 4 is a schematic structural diagram of a residual shrinkage building unit according to the present application;
FIG. 5 is a diagram of a depth residual shrinkage network structure according to the present application;
fig. 6 is a schematic structural diagram of the DNN classifier of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, there is provided a method for detecting spoofing based on a deep residual shrinkage network, including:
and step S1, preprocessing the voice to be detected, and transforming the preprocessed voice feature data to obtain corresponding constant Q cepstrum coefficient features, Mel frequency cepstrum coefficient features and spectrogram features.
As shown in fig. 2, this step implements feature extraction. Firstly, framing processing is carried out on voice to be detected, pad filling operation is carried out on voice data with the number of sample points less than 64000, and finally data normalization is carried out to complete data preprocessing. The voice data and the video data are different, and the concept of frames does not exist, but the audio data collected by the application are all one-segment for transmission and storage. In order for a program to be able to do batch processing, it is segmented according to a specified length (time period or number of samples) and structured into a programmed data structure, which is a framing. The speech signal is not stable macroscopically and stable microscopically, has short-time stationarity (the speech signal can be considered to be approximately constant within 10-30 ms), and the speech signal can be divided into a plurality of short segments for processing, wherein each short segment is called a frame (CHUNK).
Then referring to fig. 3, the preprocessed speech feature data is transformed, wherein:
constant Q cepstral coefficient feature (CQCC): constant Q transformation is carried out on the preprocessed voice characteristic data, then a power spectrum is calculated and logarithm is taken, then uniform resampling is carried out, and finally constant Q cepstrum series characteristics are obtained through discrete cosine transformation.
For example, constant Q transformation is performed on the preprocessed voice feature data at a sampling frequency of 16kHz, then a power spectrum is calculated and logarithmized, then uniform resampling is performed at a uniform resampling period of 16kHz, and finally a constant Q cepstrum coefficient feature is obtained through discrete cosine transformation.
Mel-frequency cepstral coefficient feature (MFCC): and performing short-time Fourier transform (STFT) on the preprocessed voice feature data, mapping the frequency spectrum to a Mel frequency spectrum through filtering, and finally performing discrete cosine transform to obtain Mel frequency cepstrum coefficient features.
For example, short-time fourier transform STFT is performed on the preprocessed speech feature data at a sampling frequency of 16kHz, the spectrum is mapped to a mel spectrum by a filter bank, and finally a mel-frequency cepstrum coefficient MFCC feature is obtained by discrete cosine transform. The present application selects the first 24 coefficients and concatenates the MFCCs with their first and second derivative MFCCs to produce a MFCC signature representation, which is a two-dimensional matrix.
Spectrogram feature (Spectrogram): and performing short-time Fourier transform on the preprocessed voice feature data, calculating the size of each component, and finally converting the component into logarithmic scales to obtain the spectrogram features.
For example, a short-time fourier transform is calculated on the preprocessed speech feature data over a hamming window (window size 2048, overlap 25%). And then calculating the size of each component and converting the size into logarithmic scales, and capturing the time-frequency characteristics of the input audio waveform by an output matrix to obtain the spectrogram characteristics.
And step S2, processing the constant Q cepstrum coefficient characteristic, the Mel frequency cepstrum coefficient characteristic and the spectrogram characteristic respectively by adopting the trained depth residual shrinkage network to obtain three corresponding depth characteristics.
In this embodiment, a depth residual shrinkage network is designed and built, and the constant Q cepstrum coefficient feature, the mel-frequency cepstrum coefficient feature, and the spectrogram feature obtained by the transformation in step S1 are processed to obtain a depth feature related to the detected speech.
In a specific embodiment, a residual shrinkage building unit adopted by the deep residual shrinkage network comprises a convolution module, an adaptive threshold learning module and a soft threshold module.
Fig. 4 shows a Residual Shrinkage Building Unit (RSBU) in a modified Depth Residual Shrinkage Network (DRSN). The residual shrinkage construction unit comprises a convolution module, an adaptive threshold learning module and a soft threshold module, wherein the output of the convolution module is learned by the adaptive threshold learning module to obtain a threshold, and the soft threshold module processes the output of the convolution module and the adaptive threshold learning module to highlight high-discriminant sound information.
The adaptive threshold learning module includes an Absolute value taking function (Absolute), a Global Average Pooling (GAP), two fully connected layers (FC), a BN layer, a ReLu function, and an activation function. The output of the global average pooling is 1 × 1 in size, then both fully connected layers have 32 neural cell counts, and the BN connected between them uses BatchNorm1d with the parameter set to 32. Obtaining a scaling parameter after passing through a full connection layer, and finally utilizing a sigmoid function to enable the scaling parameter to be in a range of (0,1), which can be expressed as:
wherein z iscIs a characteristic of a neuron in the c-th channel, αcIs the scaling factor corresponding thereto.
Will scale the coefficient alphacMultiplying the result of the global average pooling to obtain a positive threshold tauc。
The soft threshold module of the embodiment is used as a nonlinear transformation layer and is inserted into a residual contraction construction unit (RSBU), voice characteristic data is compared with a threshold value, then the characteristic data is processed, characteristics close to zero are set to zero, useful positive and negative characteristics are reserved, noise attenuation can be flexibly realized according to the current audio condition, and sound information with high discriminability is highlighted.
The soft threshold function may be expressed as:
where x is the input characteristic, y is the output characteristic, τ is the threshold, and there are different thresholds τ for different channelsc。
As shown in fig. 5, the depth residual shrinking network of this embodiment processes constant Q cepstrum coefficient CQCC characteristics, mel-frequency cepstrum coefficient MFCC characteristics, and Spectrogram spectrum characteristics, respectively. The depth residual error shrinkage network for processing the constant Q cepstrum coefficient characteristics is stacked with 6 residual error shrinkage building units, the depth residual error shrinkage network for processing the Mel frequency cepstrum coefficient characteristics is stacked with 9 residual error shrinkage building units, and the depth residual error shrinkage network for processing the spectrogram characteristics is stacked with 6 residual error shrinkage building units.
The multiple RSBUs are stacked and used in the embodiment, the capability of various nonlinear transformation learning distinguishing features can be enhanced, and the soft threshold is used as a contraction function, so that the information related to noise can be eliminated. And respectively inputting the constant Q cepstrum coefficient CQC characteristic, the Mel frequency cepstrum coefficient MFCC characteristic and the Spectrogram Spectrogram characteristic to obtain three depth characteristics.
And step S3, inputting the three depth features into a deep neural network classifier respectively, and calculating to obtain detection scores corresponding to the three depth features.
The deep neural network classifier of this step refers to fig. 6, and is used to generate a detection score of the speech to be detected.
Specifically, the three depth features are respectively transmitted into a deep neural network classifier (DNN classifier), which includes: a Dropout layer, a first fully hidden link layer, a Leak-Relu activation function layer, a second hidden fully connected layer, and a LogSoftmax layer.
And the depth residual shrinkage network adjusts the number of elements on the last dimension, reshapes the depth characteristic shape and inputs the reshaped depth characteristic shape into a hidden layer of the DNN classifier, wherein the hidden layer is a Dropout layer with the probability of a random discarding weight value of 50%. The settings of the first full-connection layers of the DNN classifiers aiming at different characteristics are different, and the parameters of the first hidden full-connection layers of the CQCC-DSRN model, the MFCC-DRSN model and the Spectrogram-model are respectively (32,128), (480,128) and (160,128). The second hidden fully-connected layer with a leak-Relu activation function of α 0.01 has its parameters set to (128,2) and there is a logits unit to generate the classification. Then using a LogSoftmax layer, softmax operations are performed on all elements of each row and log values are taken to convert logits into detection scores.
For convenience in expression, a network model formed by combining a deep residual shrinkage network DRSN and a DDN classifier is named as an MFCC-DRSN model, a CQCC-DRSN model and a Spectrogram-DRSN model respectively according to three different characteristics.
And step S4, fusing the detection scores corresponding to the three depth features, and judging whether the voice to be detected is real voice.
The combined detection unit performs weighted fusion on detection scores generated by the MFCC-DRSN model, the CQCC-DRSN model and the Spectrogram-DRSN model to obtain a final voice deception detection result.
Considering that in a general fraud detection task, the number of real voices is much smaller than the number of forged voices, all models are trained using a minimized weighted cross entropy loss function, where the ratio of weights assigned to real voices and forged voices is 9:1, to alleviate the imbalance of the distribution of training data. And applying the training model parameter with the best service performance to an evaluation data set, and obtaining a single-class feature-DRSN model detection score after model detection.
Establishing a logistic regression model by taking the detection scores of the three single-class features, namely the DRSN model, as independent variables and the detection result as dependent variables, wherein the expression is as follows:
wherein w1、w2、w3Representing a weight parameter, s1、s2、s3Indicating the detection score.
And obtaining a regression constant through model processing, normalizing the regression constant, and finally obtaining the weight of the model. The joint detection is realized by weighted fusion of the score files, and the detection scores are fused in a logic function of polynomial regression and are expressed as follows:
wherein Score isfuseFor fused joint detection score, wiTo fuse the weights, siAnd the detection score corresponding to the ith depth feature.
And analyzing the joint detection score to obtain a judgment threshold, judging as real voice if the joint detection score is smaller than the threshold, and otherwise, judging as deceptive voice.
The voice deception detection method based on the deep residual shrinkage network has the following advantages:
(1) a deep residual shrinkage network DRSN is constructed, a residual shrinkage construction unit RSBU comprising a self-adaptive threshold learning module and a soft threshold module based on a deep attention mechanism is designed, each voice signal is enabled to determine an independent threshold according to the acoustic environment of the voice signal, unimportant features are forced to be zero, information related to noise is eliminated, more discriminative high-level features are learned, and therefore the discriminative feature learning capability under the complex acoustic environment is improved.
(2) Three different acoustic feature extraction algorithms of CQCC, MFCC and Spectrogram are used for representing the voice characteristics more comprehensively, the features are used as network input respectively, weights are generated for the models according to the output performance of the features, and multi-feature joint detection is executed, so that the system generalization is improved.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (7)
1. A voice spoofing detection method based on a deep residual shrinkage network is characterized in that the voice spoofing detection method based on the deep residual shrinkage network comprises the following steps:
preprocessing the voice to be detected, and transforming the voice characteristic data after preprocessing to obtain corresponding constant Q cepstrum coefficient characteristics, Mel frequency cepstrum coefficient characteristics and spectrogram characteristics;
respectively processing the constant Q cepstrum coefficient characteristic, the Mel frequency cepstrum coefficient characteristic and the spectrogram characteristic by adopting a depth residual shrinkage network to obtain three corresponding depth characteristics;
inputting the three depth features into a deep neural network classifier respectively, and calculating to obtain detection scores corresponding to the three depth features;
and fusing the detection scores corresponding to the three depth features, and judging whether the voice to be detected is real voice.
2. The method according to claim 1, wherein the deep residual shrinking network comprises a residual shrinking construction unit, the residual shrinking construction unit comprises a convolution module, an adaptive threshold learning module and a soft threshold module, the output of the convolution module is learned by the adaptive threshold learning module to obtain a threshold, and the soft threshold module processes the outputs of the convolution module and the adaptive threshold learning module to highlight high-discriminant sound information.
3. The method of claim 2, wherein the depth residual shrinkage network for processing constant Q cepstral coefficient features is stacked with 6 residual shrinkage building units, the depth residual shrinkage network for processing mel-frequency cepstral coefficient features is stacked with 9 residual shrinkage building units, and the depth residual shrinkage network for processing spectrogram feature is stacked with 6 residual shrinkage building units.
4. The deep residual shrinkage network-based spoofing detection method of claim 1, wherein the deep neural network classifier comprises a Dropout layer, a first fully hidden connected layer, a Leak-Relu activation function layer, a second hidden fully connected layer, and a LogSoftmax layer.
5. The method of claim 4, wherein the random discard weight probability of the Dropout layer is 50%.
7. The method according to claim 1, wherein the transforming the preprocessed voice feature data to obtain corresponding constant Q cepstrum coefficient features, mel-frequency cepstrum coefficient features, and spectrogram features comprises:
constant Q transformation is carried out on the preprocessed voice characteristic data, then a power spectrum is calculated and logarithm is taken, then uniform resampling is carried out, and finally constant Q cepstrum system characteristics are obtained through discrete cosine transformation;
performing short-time Fourier transform (STFT) on the preprocessed voice feature data, mapping the frequency spectrum to a Mel frequency spectrum through filtering, and finally obtaining Mel frequency cepstrum coefficient features through discrete cosine transform;
and performing short-time Fourier transform on the preprocessed voice feature data, calculating the size of each component, and finally converting the component into logarithmic scales to obtain the spectrogram features.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210347480.3A CN114495950A (en) | 2022-04-01 | 2022-04-01 | Voice deception detection method based on deep residual shrinkage network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210347480.3A CN114495950A (en) | 2022-04-01 | 2022-04-01 | Voice deception detection method based on deep residual shrinkage network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114495950A true CN114495950A (en) | 2022-05-13 |
Family
ID=81488846
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210347480.3A Pending CN114495950A (en) | 2022-04-01 | 2022-04-01 | Voice deception detection method based on deep residual shrinkage network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114495950A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115424625A (en) * | 2022-08-31 | 2022-12-02 | 江西师范大学 | Speaker confirmation deception detection method based on multi-path network |
CN115414048A (en) * | 2022-08-31 | 2022-12-02 | 长沙理工大学 | Electrocardiosignal denoising method, denoising system, electrocardiosignal denoising device and storage medium |
CN116153336A (en) * | 2023-04-19 | 2023-05-23 | 北京中电慧声科技有限公司 | Synthetic voice detection method based on multi-domain information fusion |
CN116862530A (en) * | 2023-06-25 | 2023-10-10 | 江苏华泽微福科技发展有限公司 | Intelligent after-sale service method and system |
CN117393000A (en) * | 2023-11-09 | 2024-01-12 | 南京邮电大学 | Synthetic voice detection method based on neural network and feature fusion |
CN118585889A (en) * | 2024-08-06 | 2024-09-03 | 杭州电子科技大学 | Ship type identification method and system based on ship radiation noise data |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110674677A (en) * | 2019-08-06 | 2020-01-10 | 厦门大学 | Multi-mode multi-layer fusion deep neural network for anti-spoofing of human face |
US20200322377A1 (en) * | 2019-04-08 | 2020-10-08 | Pindrop Security, Inc. | Systems and methods for end-to-end architectures for voice spoofing detection |
US20210233541A1 (en) * | 2020-01-27 | 2021-07-29 | Pindrop Security, Inc. | Robust spoofing detection system using deep residual neural networks |
CN113241079A (en) * | 2021-04-29 | 2021-08-10 | 江西师范大学 | Voice spoofing detection method based on residual error neural network |
-
2022
- 2022-04-01 CN CN202210347480.3A patent/CN114495950A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200322377A1 (en) * | 2019-04-08 | 2020-10-08 | Pindrop Security, Inc. | Systems and methods for end-to-end architectures for voice spoofing detection |
CN110674677A (en) * | 2019-08-06 | 2020-01-10 | 厦门大学 | Multi-mode multi-layer fusion deep neural network for anti-spoofing of human face |
US20210233541A1 (en) * | 2020-01-27 | 2021-07-29 | Pindrop Security, Inc. | Robust spoofing detection system using deep residual neural networks |
CN113241079A (en) * | 2021-04-29 | 2021-08-10 | 江西师范大学 | Voice spoofing detection method based on residual error neural network |
Non-Patent Citations (1)
Title |
---|
MINGHANG ZHAO ET AL.: "Deep Residual Shrinkage Networks for Fault Diagnosis", 《IEEE》, 31 December 2019 (2019-12-31), pages 1 - 10 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115424625A (en) * | 2022-08-31 | 2022-12-02 | 江西师范大学 | Speaker confirmation deception detection method based on multi-path network |
CN115414048A (en) * | 2022-08-31 | 2022-12-02 | 长沙理工大学 | Electrocardiosignal denoising method, denoising system, electrocardiosignal denoising device and storage medium |
CN116153336A (en) * | 2023-04-19 | 2023-05-23 | 北京中电慧声科技有限公司 | Synthetic voice detection method based on multi-domain information fusion |
CN116862530A (en) * | 2023-06-25 | 2023-10-10 | 江苏华泽微福科技发展有限公司 | Intelligent after-sale service method and system |
CN116862530B (en) * | 2023-06-25 | 2024-04-05 | 江苏华泽微福科技发展有限公司 | Intelligent after-sale service method and system |
CN117393000A (en) * | 2023-11-09 | 2024-01-12 | 南京邮电大学 | Synthetic voice detection method based on neural network and feature fusion |
CN117393000B (en) * | 2023-11-09 | 2024-04-16 | 南京邮电大学 | Synthetic voice detection method based on neural network and feature fusion |
CN118585889A (en) * | 2024-08-06 | 2024-09-03 | 杭州电子科技大学 | Ship type identification method and system based on ship radiation noise data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114495950A (en) | Voice deception detection method based on deep residual shrinkage network | |
Yu et al. | Spoofing detection in automatic speaker verification systems using DNN classifiers and dynamic acoustic features | |
CN106952649A (en) | Method for distinguishing speek person based on convolutional neural networks and spectrogram | |
JPH02238495A (en) | Time series signal recognizing device | |
CN113488073B (en) | Fake voice detection method and device based on multi-feature fusion | |
CN106991312B (en) | Internet anti-fraud authentication method based on voiceprint recognition | |
CN113436646B (en) | Camouflage voice detection method adopting combined features and random forest | |
CN111613240B (en) | Camouflage voice detection method based on attention mechanism and Bi-LSTM | |
Hassan et al. | Voice spoofing countermeasure for synthetic speech detection | |
Wu et al. | Adversarial sample detection for speaker verification by neural vocoders | |
Li et al. | Multi-task learning of deep neural networks for joint automatic speaker verification and spoofing detection | |
CN114898773A (en) | Synthetic speech detection method based on deep self-attention neural network classifier | |
Lu et al. | Detecting Unknown Speech Spoofing Algorithms with Nearest Neighbors. | |
Nelus et al. | Privacy-Preserving Siamese Feature Extraction for Gender Recognition versus Speaker Identification. | |
You et al. | Device feature extraction based on parallel neural network training for replay spoofing detection | |
CN116386648A (en) | Cross-domain voice fake identifying method and system | |
CN110706712A (en) | Recording playback detection method in home environment | |
Choudhary et al. | Automatic speaker verification using gammatone frequency cepstral coefficients | |
Dua et al. | Audio Deepfake Detection Using Data Augmented Graph Frequency Cepstral Coefficients | |
Babu et al. | Exploration of Bonafide and Spoofed Audio Classification Using Machine Learning Models | |
Moonasar et al. | A committee of neural networks for automatic speaker recognition (ASR) systems | |
CN116230012B (en) | Two-stage abnormal sound detection method based on metadata comparison learning pre-training | |
CN117809694B (en) | Fake voice detection method and system based on time sequence multi-scale feature representation learning | |
Shawkat | Evaluation of Human Voice Biometrics and Frog Bioacoustics Identification Systems Based on Feature Extraction Method and Classifiers | |
Sivaramakrishnan et al. | Classification of Deep Fake Audio Using MFCC Technique |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |