CN114495950A - Voice deception detection method based on deep residual shrinkage network - Google Patents

Voice deception detection method based on deep residual shrinkage network Download PDF

Info

Publication number
CN114495950A
CN114495950A CN202210347480.3A CN202210347480A CN114495950A CN 114495950 A CN114495950 A CN 114495950A CN 202210347480 A CN202210347480 A CN 202210347480A CN 114495950 A CN114495950 A CN 114495950A
Authority
CN
China
Prior art keywords
features
voice
depth
residual shrinkage
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210347480.3A
Other languages
Chinese (zh)
Inventor
章坚武
周晔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202210347480.3A priority Critical patent/CN114495950A/en
Publication of CN114495950A publication Critical patent/CN114495950A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

The invention discloses a voice deception detection method based on a deep residual shrinkage network, which comprises the steps of preprocessing voice to be detected, and transforming preprocessed voice feature data to obtain corresponding constant Q cepstrum coefficient features, Mel frequency cepstrum coefficient features and spectrogram features; then, a depth residual shrinkage network is adopted to respectively process the constant Q cepstrum coefficient characteristic, the Mel frequency cepstrum coefficient characteristic and the spectrogram characteristic to obtain three corresponding depth characteristics; inputting the three depth features into a deep neural network classifier respectively, and calculating to obtain detection scores corresponding to the three depth features; and finally, fusing the detection scores corresponding to the three depth features, and judging whether the voice to be detected is real voice. The method improves the distinguishing characteristic learning ability in the complex acoustic environment, improves the system generalization and has wider application scene.

Description

Voice deception detection method based on deep residual shrinkage network
Technical Field
The application belongs to the technical field of voice detection and deep learning, and particularly relates to a voice deception detection method based on a deep residual shrinkage network.
Background
In recent years, the role of biometric-based identity authentication techniques in data security and passing authentication has become increasingly important. Due to the development of the acquisition sensing equipment, the automatic speaker verification technology is widely concerned and applied to the aspects of intelligent equipment login, access control, online banking and the like. However, various voice forgery technologies threaten the security performance of the automatic speaker verification system, and four types of forged voice spoofing attacks are determined at present: speech synthesis, speech conversion, speech emulation, replay, which can generate spurious speech similar to the speech of a legitimate user. Logical access attacks that dominate speech synthesis and speech conversion are perceptually indistinguishable from true speech, and therefore it becomes more challenging to distinguish counterfeit speech from real user speech. More and more research has demonstrated that automated speaker verification systems are severely vulnerable to various malicious spoofing attacks on databases.
In order to deal with the threat of spoofing attack, researchers are continuously dedicated to seeking an effective anti-spoofing method, and the current voice spoofing detection system mainly comprises a front-end feature extraction part and a rear-end classifier part. Unlike the acoustic features used by speaker verification and speech processing in general, spoof detection requires the development of acoustic features more suitable for spoof detection. After the acoustic features are extracted, the classifier with excellent performance is used for finishing true and false voice distinguishing. Among the traditional machine learning methods, the Gaussian Mixture Model (GMM) is the most classical classification model, which has the advantages of short training time but limited detection accuracy; with the rise of deep learning, various deep neural networks capable of learning complex nonlinear features are also applied to voice spoofing detection. Convolutional Neural Networks (CNNs) have good characterization learning capabilities and are widely used in extracting audio features. The Recurrent Neural Network (RNN) has memory due to the recurrent units and the threshold structure, and therefore has certain advantages in the handling of time series problems.
Although the training performance of the existing methods is improved, the unknown type of attacks are encountered in practical applications, and the attacks are usually distributed with different statistics from the known attacks, so that a huge performance gap is created between training and applications, which indicates that the generalization capability of the spoof detection system to the unknown attacks is still to be improved. In addition, because noise, reverberation and channel interference often exist in a real environment, when various deception detection systems face a complex acoustic environment, the performance is greatly reversed.
Disclosure of Invention
The application aims to provide a voice spoofing detection method based on a deep residual shrinkage network, so as to aim at voice spoofing detection in a complex acoustic environment.
In order to achieve the purpose, the technical scheme of the application is as follows:
a voice spoofing detection method based on a deep residual shrinkage network comprises the following steps:
preprocessing the voice to be detected, and transforming the voice characteristic data after preprocessing to obtain corresponding constant Q cepstrum coefficient characteristics, Mel frequency cepstrum coefficient characteristics and spectrogram characteristics;
respectively processing the constant Q cepstrum coefficient characteristic, the Mel frequency cepstrum coefficient characteristic and the spectrogram characteristic by adopting a depth residual shrinkage network to obtain three corresponding depth characteristics;
inputting the three depth features into a deep neural network classifier respectively, and calculating to obtain detection scores corresponding to the three depth features;
and fusing the detection scores corresponding to the three depth characteristics, and judging whether the voice to be detected is real voice or not.
Further, the deep residual shrinkage network comprises a residual shrinkage construction unit, the residual shrinkage construction unit comprises a convolution module, an adaptive threshold learning module and a soft threshold module, the output of the convolution module is learned by the adaptive threshold learning module to obtain a threshold, and the soft threshold module processes the output of the convolution module and the adaptive threshold learning module to highlight sound information with high discriminability.
Further, a depth residual shrinkage network for processing the constant Q cepstrum coefficient features stacks 6 residual shrinkage building units, a depth residual shrinkage network for processing the mel-frequency cepstrum coefficient features stacks 9 residual shrinkage building units, and a depth residual shrinkage network for processing the spectrogram feature stacks 6 residual shrinkage building units.
Further, the deep neural network classifier comprises a Drapout layer, a first fully hidden connection layer, a Leak-Relu activation function layer, a second hidden fully connected layer and a LogSoftmax layer.
Further, the probability of the random discard weight of the Dropout layer is 50%.
Further, the detection scores corresponding to the three depth features are fused, and the formula is as follows:
Figure BDA0003577373070000031
Figure BDA0003577373070000032
wherein Score isfuseFor fused joint detection score, wiTo fuse the weights, siAnd the detection score corresponding to the ith depth feature.
Further, the transforming the preprocessed voice feature data to obtain corresponding constant Q cepstrum coefficient features, mel-frequency cepstrum coefficient features, and spectrogram features includes:
constant Q transformation is carried out on the preprocessed voice characteristic data, then a power spectrum is calculated and logarithm is taken, then uniform resampling is carried out, and finally constant Q cepstrum system characteristics are obtained through discrete cosine transformation;
performing short-time Fourier transform (STFT) on the preprocessed voice feature data, mapping the frequency spectrum to a Mel frequency spectrum through filtering, and finally obtaining Mel frequency cepstrum coefficient features through discrete cosine transform;
and performing short-time Fourier transform on the preprocessed voice feature data, calculating the size of each component, and finally converting the component into logarithmic scales to obtain the spectrogram features.
According to the voice deception detection method based on the deep residual error shrinkage network, the deep residual error shrinkage network is constructed, the residual error shrinkage construction unit of the self-adaptive threshold learning module and the soft threshold module based on the deep attention mechanism is adopted, each voice signal determines an independent threshold according to the acoustic environment, unimportant features are forced to be zero, information related to noise is eliminated, more discriminative high-level features are learned, and the learning capacity of the discriminative features under the complex acoustic environment is improved. Aiming at the problem of poor generalization performance of the detection method, three different acoustic feature extraction algorithms of CQCC, MFCC and Spectrogram are used for more comprehensively representing voice characteristics, the features are respectively used as network input, weights are generated for each model according to the output performance of the features, multi-feature joint detection is executed, and therefore system generalization is improved, and the application scene is wider.
Drawings
FIG. 1 is a flow chart of a voice spoofing detection method based on a deep residual shrinkage network according to the present application;
FIG. 2 is a block diagram of the overall network framework of the present application;
FIG. 3 is a schematic diagram of feature extraction of the present application;
FIG. 4 is a schematic structural diagram of a residual shrinkage building unit according to the present application;
FIG. 5 is a diagram of a depth residual shrinkage network structure according to the present application;
fig. 6 is a schematic structural diagram of the DNN classifier of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, there is provided a method for detecting spoofing based on a deep residual shrinkage network, including:
and step S1, preprocessing the voice to be detected, and transforming the preprocessed voice feature data to obtain corresponding constant Q cepstrum coefficient features, Mel frequency cepstrum coefficient features and spectrogram features.
As shown in fig. 2, this step implements feature extraction. Firstly, framing processing is carried out on voice to be detected, pad filling operation is carried out on voice data with the number of sample points less than 64000, and finally data normalization is carried out to complete data preprocessing. The voice data and the video data are different, and the concept of frames does not exist, but the audio data collected by the application are all one-segment for transmission and storage. In order for a program to be able to do batch processing, it is segmented according to a specified length (time period or number of samples) and structured into a programmed data structure, which is a framing. The speech signal is not stable macroscopically and stable microscopically, has short-time stationarity (the speech signal can be considered to be approximately constant within 10-30 ms), and the speech signal can be divided into a plurality of short segments for processing, wherein each short segment is called a frame (CHUNK).
Then referring to fig. 3, the preprocessed speech feature data is transformed, wherein:
constant Q cepstral coefficient feature (CQCC): constant Q transformation is carried out on the preprocessed voice characteristic data, then a power spectrum is calculated and logarithm is taken, then uniform resampling is carried out, and finally constant Q cepstrum series characteristics are obtained through discrete cosine transformation.
For example, constant Q transformation is performed on the preprocessed voice feature data at a sampling frequency of 16kHz, then a power spectrum is calculated and logarithmized, then uniform resampling is performed at a uniform resampling period of 16kHz, and finally a constant Q cepstrum coefficient feature is obtained through discrete cosine transformation.
Mel-frequency cepstral coefficient feature (MFCC): and performing short-time Fourier transform (STFT) on the preprocessed voice feature data, mapping the frequency spectrum to a Mel frequency spectrum through filtering, and finally performing discrete cosine transform to obtain Mel frequency cepstrum coefficient features.
For example, short-time fourier transform STFT is performed on the preprocessed speech feature data at a sampling frequency of 16kHz, the spectrum is mapped to a mel spectrum by a filter bank, and finally a mel-frequency cepstrum coefficient MFCC feature is obtained by discrete cosine transform. The present application selects the first 24 coefficients and concatenates the MFCCs with their first and second derivative MFCCs to produce a MFCC signature representation, which is a two-dimensional matrix.
Spectrogram feature (Spectrogram): and performing short-time Fourier transform on the preprocessed voice feature data, calculating the size of each component, and finally converting the component into logarithmic scales to obtain the spectrogram features.
For example, a short-time fourier transform is calculated on the preprocessed speech feature data over a hamming window (window size 2048, overlap 25%). And then calculating the size of each component and converting the size into logarithmic scales, and capturing the time-frequency characteristics of the input audio waveform by an output matrix to obtain the spectrogram characteristics.
And step S2, processing the constant Q cepstrum coefficient characteristic, the Mel frequency cepstrum coefficient characteristic and the spectrogram characteristic respectively by adopting the trained depth residual shrinkage network to obtain three corresponding depth characteristics.
In this embodiment, a depth residual shrinkage network is designed and built, and the constant Q cepstrum coefficient feature, the mel-frequency cepstrum coefficient feature, and the spectrogram feature obtained by the transformation in step S1 are processed to obtain a depth feature related to the detected speech.
In a specific embodiment, a residual shrinkage building unit adopted by the deep residual shrinkage network comprises a convolution module, an adaptive threshold learning module and a soft threshold module.
Fig. 4 shows a Residual Shrinkage Building Unit (RSBU) in a modified Depth Residual Shrinkage Network (DRSN). The residual shrinkage construction unit comprises a convolution module, an adaptive threshold learning module and a soft threshold module, wherein the output of the convolution module is learned by the adaptive threshold learning module to obtain a threshold, and the soft threshold module processes the output of the convolution module and the adaptive threshold learning module to highlight high-discriminant sound information.
The adaptive threshold learning module includes an Absolute value taking function (Absolute), a Global Average Pooling (GAP), two fully connected layers (FC), a BN layer, a ReLu function, and an activation function. The output of the global average pooling is 1 × 1 in size, then both fully connected layers have 32 neural cell counts, and the BN connected between them uses BatchNorm1d with the parameter set to 32. Obtaining a scaling parameter after passing through a full connection layer, and finally utilizing a sigmoid function to enable the scaling parameter to be in a range of (0,1), which can be expressed as:
Figure BDA0003577373070000051
wherein z iscIs a characteristic of a neuron in the c-th channel, αcIs the scaling factor corresponding thereto.
Will scale the coefficient alphacMultiplying the result of the global average pooling to obtain a positive threshold tauc
The soft threshold module of the embodiment is used as a nonlinear transformation layer and is inserted into a residual contraction construction unit (RSBU), voice characteristic data is compared with a threshold value, then the characteristic data is processed, characteristics close to zero are set to zero, useful positive and negative characteristics are reserved, noise attenuation can be flexibly realized according to the current audio condition, and sound information with high discriminability is highlighted.
The soft threshold function may be expressed as:
Figure BDA0003577373070000061
where x is the input characteristic, y is the output characteristic, τ is the threshold, and there are different thresholds τ for different channelsc
As shown in fig. 5, the depth residual shrinking network of this embodiment processes constant Q cepstrum coefficient CQCC characteristics, mel-frequency cepstrum coefficient MFCC characteristics, and Spectrogram spectrum characteristics, respectively. The depth residual error shrinkage network for processing the constant Q cepstrum coefficient characteristics is stacked with 6 residual error shrinkage building units, the depth residual error shrinkage network for processing the Mel frequency cepstrum coefficient characteristics is stacked with 9 residual error shrinkage building units, and the depth residual error shrinkage network for processing the spectrogram characteristics is stacked with 6 residual error shrinkage building units.
The multiple RSBUs are stacked and used in the embodiment, the capability of various nonlinear transformation learning distinguishing features can be enhanced, and the soft threshold is used as a contraction function, so that the information related to noise can be eliminated. And respectively inputting the constant Q cepstrum coefficient CQC characteristic, the Mel frequency cepstrum coefficient MFCC characteristic and the Spectrogram Spectrogram characteristic to obtain three depth characteristics.
And step S3, inputting the three depth features into a deep neural network classifier respectively, and calculating to obtain detection scores corresponding to the three depth features.
The deep neural network classifier of this step refers to fig. 6, and is used to generate a detection score of the speech to be detected.
Specifically, the three depth features are respectively transmitted into a deep neural network classifier (DNN classifier), which includes: a Dropout layer, a first fully hidden link layer, a Leak-Relu activation function layer, a second hidden fully connected layer, and a LogSoftmax layer.
And the depth residual shrinkage network adjusts the number of elements on the last dimension, reshapes the depth characteristic shape and inputs the reshaped depth characteristic shape into a hidden layer of the DNN classifier, wherein the hidden layer is a Dropout layer with the probability of a random discarding weight value of 50%. The settings of the first full-connection layers of the DNN classifiers aiming at different characteristics are different, and the parameters of the first hidden full-connection layers of the CQCC-DSRN model, the MFCC-DRSN model and the Spectrogram-model are respectively (32,128), (480,128) and (160,128). The second hidden fully-connected layer with a leak-Relu activation function of α 0.01 has its parameters set to (128,2) and there is a logits unit to generate the classification. Then using a LogSoftmax layer, softmax operations are performed on all elements of each row and log values are taken to convert logits into detection scores.
For convenience in expression, a network model formed by combining a deep residual shrinkage network DRSN and a DDN classifier is named as an MFCC-DRSN model, a CQCC-DRSN model and a Spectrogram-DRSN model respectively according to three different characteristics.
And step S4, fusing the detection scores corresponding to the three depth features, and judging whether the voice to be detected is real voice.
The combined detection unit performs weighted fusion on detection scores generated by the MFCC-DRSN model, the CQCC-DRSN model and the Spectrogram-DRSN model to obtain a final voice deception detection result.
Considering that in a general fraud detection task, the number of real voices is much smaller than the number of forged voices, all models are trained using a minimized weighted cross entropy loss function, where the ratio of weights assigned to real voices and forged voices is 9:1, to alleviate the imbalance of the distribution of training data. And applying the training model parameter with the best service performance to an evaluation data set, and obtaining a single-class feature-DRSN model detection score after model detection.
Establishing a logistic regression model by taking the detection scores of the three single-class features, namely the DRSN model, as independent variables and the detection result as dependent variables, wherein the expression is as follows:
Figure BDA0003577373070000071
wherein w1、w2、w3Representing a weight parameter, s1、s2、s3Indicating the detection score.
And obtaining a regression constant through model processing, normalizing the regression constant, and finally obtaining the weight of the model. The joint detection is realized by weighted fusion of the score files, and the detection scores are fused in a logic function of polynomial regression and are expressed as follows:
Figure BDA0003577373070000072
Figure BDA0003577373070000081
wherein Score isfuseFor fused joint detection score, wiTo fuse the weights, siAnd the detection score corresponding to the ith depth feature.
And analyzing the joint detection score to obtain a judgment threshold, judging as real voice if the joint detection score is smaller than the threshold, and otherwise, judging as deceptive voice.
The voice deception detection method based on the deep residual shrinkage network has the following advantages:
(1) a deep residual shrinkage network DRSN is constructed, a residual shrinkage construction unit RSBU comprising a self-adaptive threshold learning module and a soft threshold module based on a deep attention mechanism is designed, each voice signal is enabled to determine an independent threshold according to the acoustic environment of the voice signal, unimportant features are forced to be zero, information related to noise is eliminated, more discriminative high-level features are learned, and therefore the discriminative feature learning capability under the complex acoustic environment is improved.
(2) Three different acoustic feature extraction algorithms of CQCC, MFCC and Spectrogram are used for representing the voice characteristics more comprehensively, the features are used as network input respectively, weights are generated for the models according to the output performance of the features, and multi-feature joint detection is executed, so that the system generalization is improved.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (7)

1. A voice spoofing detection method based on a deep residual shrinkage network is characterized in that the voice spoofing detection method based on the deep residual shrinkage network comprises the following steps:
preprocessing the voice to be detected, and transforming the voice characteristic data after preprocessing to obtain corresponding constant Q cepstrum coefficient characteristics, Mel frequency cepstrum coefficient characteristics and spectrogram characteristics;
respectively processing the constant Q cepstrum coefficient characteristic, the Mel frequency cepstrum coefficient characteristic and the spectrogram characteristic by adopting a depth residual shrinkage network to obtain three corresponding depth characteristics;
inputting the three depth features into a deep neural network classifier respectively, and calculating to obtain detection scores corresponding to the three depth features;
and fusing the detection scores corresponding to the three depth features, and judging whether the voice to be detected is real voice.
2. The method according to claim 1, wherein the deep residual shrinking network comprises a residual shrinking construction unit, the residual shrinking construction unit comprises a convolution module, an adaptive threshold learning module and a soft threshold module, the output of the convolution module is learned by the adaptive threshold learning module to obtain a threshold, and the soft threshold module processes the outputs of the convolution module and the adaptive threshold learning module to highlight high-discriminant sound information.
3. The method of claim 2, wherein the depth residual shrinkage network for processing constant Q cepstral coefficient features is stacked with 6 residual shrinkage building units, the depth residual shrinkage network for processing mel-frequency cepstral coefficient features is stacked with 9 residual shrinkage building units, and the depth residual shrinkage network for processing spectrogram feature is stacked with 6 residual shrinkage building units.
4. The deep residual shrinkage network-based spoofing detection method of claim 1, wherein the deep neural network classifier comprises a Dropout layer, a first fully hidden connected layer, a Leak-Relu activation function layer, a second hidden fully connected layer, and a LogSoftmax layer.
5. The method of claim 4, wherein the random discard weight probability of the Dropout layer is 50%.
6. The method according to claim 1, wherein the detection scores corresponding to the three depth features are fused, and the formula is as follows:
Figure FDA0003577373060000021
Figure FDA0003577373060000022
wherein Score isfuseFor fused joint detection score, wiTo fuse the weights, siAnd the detection score corresponding to the ith depth feature.
7. The method according to claim 1, wherein the transforming the preprocessed voice feature data to obtain corresponding constant Q cepstrum coefficient features, mel-frequency cepstrum coefficient features, and spectrogram features comprises:
constant Q transformation is carried out on the preprocessed voice characteristic data, then a power spectrum is calculated and logarithm is taken, then uniform resampling is carried out, and finally constant Q cepstrum system characteristics are obtained through discrete cosine transformation;
performing short-time Fourier transform (STFT) on the preprocessed voice feature data, mapping the frequency spectrum to a Mel frequency spectrum through filtering, and finally obtaining Mel frequency cepstrum coefficient features through discrete cosine transform;
and performing short-time Fourier transform on the preprocessed voice feature data, calculating the size of each component, and finally converting the component into logarithmic scales to obtain the spectrogram features.
CN202210347480.3A 2022-04-01 2022-04-01 Voice deception detection method based on deep residual shrinkage network Pending CN114495950A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210347480.3A CN114495950A (en) 2022-04-01 2022-04-01 Voice deception detection method based on deep residual shrinkage network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210347480.3A CN114495950A (en) 2022-04-01 2022-04-01 Voice deception detection method based on deep residual shrinkage network

Publications (1)

Publication Number Publication Date
CN114495950A true CN114495950A (en) 2022-05-13

Family

ID=81488846

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210347480.3A Pending CN114495950A (en) 2022-04-01 2022-04-01 Voice deception detection method based on deep residual shrinkage network

Country Status (1)

Country Link
CN (1) CN114495950A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116153336A (en) * 2023-04-19 2023-05-23 北京中电慧声科技有限公司 Synthetic voice detection method based on multi-domain information fusion
CN116862530A (en) * 2023-06-25 2023-10-10 江苏华泽微福科技发展有限公司 Intelligent after-sale service method and system
CN117393000A (en) * 2023-11-09 2024-01-12 南京邮电大学 Synthetic voice detection method based on neural network and feature fusion

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116153336A (en) * 2023-04-19 2023-05-23 北京中电慧声科技有限公司 Synthetic voice detection method based on multi-domain information fusion
CN116862530A (en) * 2023-06-25 2023-10-10 江苏华泽微福科技发展有限公司 Intelligent after-sale service method and system
CN116862530B (en) * 2023-06-25 2024-04-05 江苏华泽微福科技发展有限公司 Intelligent after-sale service method and system
CN117393000A (en) * 2023-11-09 2024-01-12 南京邮电大学 Synthetic voice detection method based on neural network and feature fusion
CN117393000B (en) * 2023-11-09 2024-04-16 南京邮电大学 Synthetic voice detection method based on neural network and feature fusion

Similar Documents

Publication Publication Date Title
Yu et al. Spoofing detection in automatic speaker verification systems using DNN classifiers and dynamic acoustic features
CN114495950A (en) Voice deception detection method based on deep residual shrinkage network
Gomez-Alanis et al. A gated recurrent convolutional neural network for robust spoofing detection
CN110084610B (en) Network transaction fraud detection system based on twin neural network
CN106952649A (en) Method for distinguishing speek person based on convolutional neural networks and spectrogram
JPH02238495A (en) Time series signal recognizing device
CN113488073B (en) Fake voice detection method and device based on multi-feature fusion
Qian et al. Deep feature engineering for noise robust spoofing detection
CN110120230B (en) Acoustic event detection method and device
CN106991312B (en) Internet anti-fraud authentication method based on voiceprint recognition
Hassan et al. Voice spoofing countermeasure for synthetic speech detection
Li et al. Multi-task learning of deep neural networks for joint automatic speaker verification and spoofing detection
CN111613240A (en) Camouflage voice detection method based on attention mechanism and Bi-LSTM
Wu et al. Adversarial sample detection for speaker verification by neural vocoders
CN113436646B (en) Camouflage voice detection method adopting combined features and random forest
CN111243621A (en) Construction method of GRU-SVM deep learning model for synthetic speech detection
CN116229960B (en) Robust detection method, system, medium and equipment for deceptive voice
You et al. Device feature extraction based on parallel neural network training for replay spoofing detection
CN110706712A (en) Recording playback detection method in home environment
Choudhary et al. Automatic speaker verification using gammatone frequency cepstral coefficients
CN111524520A (en) Voiceprint recognition method based on error reverse propagation neural network
CN106373576B (en) Speaker confirmation method and system based on VQ and SVM algorithms
CN114898773A (en) Synthetic speech detection method based on deep self-attention neural network classifier
Moonasar et al. A committee of neural networks for automatic speaker recognition (ASR) systems
CN116230012B (en) Two-stage abnormal sound detection method based on metadata comparison learning pre-training

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination