CN117393000A - Synthetic voice detection method based on neural network and feature fusion - Google Patents
Synthetic voice detection method based on neural network and feature fusion Download PDFInfo
- Publication number
- CN117393000A CN117393000A CN202311490667.XA CN202311490667A CN117393000A CN 117393000 A CN117393000 A CN 117393000A CN 202311490667 A CN202311490667 A CN 202311490667A CN 117393000 A CN117393000 A CN 117393000A
- Authority
- CN
- China
- Prior art keywords
- audio
- score
- neural network
- layer
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 52
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 47
- 230000004927 fusion Effects 0.000 title claims abstract description 25
- 238000000034 method Methods 0.000 claims description 42
- 238000012549 training Methods 0.000 claims description 28
- 230000006870 function Effects 0.000 claims description 12
- 238000006243 chemical reaction Methods 0.000 claims description 11
- 238000012795 verification Methods 0.000 claims description 11
- 230000009466 transformation Effects 0.000 claims description 10
- 238000011176 pooling Methods 0.000 claims description 9
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims description 8
- 238000007781 pre-processing Methods 0.000 claims description 6
- 230000004931 aggregating effect Effects 0.000 claims description 5
- 238000009432 framing Methods 0.000 claims description 5
- 230000007774 longterm Effects 0.000 claims description 4
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 9
- 238000004364 calculation method Methods 0.000 description 6
- 238000012360 testing method Methods 0.000 description 6
- 238000013135 deep learning Methods 0.000 description 5
- 230000004913 activation Effects 0.000 description 4
- 239000000203 mixture Substances 0.000 description 4
- 238000013473 artificial intelligence Methods 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 239000013598 vector Substances 0.000 description 3
- 238000012952 Resampling Methods 0.000 description 2
- 102000000591 Tight Junction Proteins Human genes 0.000 description 2
- 108010002321 Tight Junction Proteins Proteins 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000001308 synthesis method Methods 0.000 description 2
- 210000001578 tight junction Anatomy 0.000 description 2
- 241000834151 Notesthes Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 210000005069 ears Anatomy 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000030279 gene silencing Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000035800 maturation Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000013138 pruning Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000007480 spreading Effects 0.000 description 1
- 238000009966 trimming Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a synthetic voice detection method based on neural network and feature fusion, which comprises the following steps: acquiring an audio data set to be detected, and extracting acoustic characteristics of audio and corresponding spectrogram image characteristics from the audio data set to be detected; inputting the acoustic characteristics of the audio and the corresponding spectrogram image characteristics into a pre-trained synthetic audio detection model to respectively obtain an authenticity score first of the audio and an authenticity score second of the audio; weighting and fusing the first audio authenticity score and the second audio authenticity score to obtain an audio authenticity score after feature information fusion; comparing the obtained authenticity score after feature information fusion with a preset threshold value to obtain a final audio detection result; the invention skillfully fuses the acoustic characteristics and the spectrogram image information to carry out the synthetic voice detection, and has better stability and generalization capability.
Description
Technical Field
The invention relates to a synthetic voice detection method based on neural network and feature fusion, and belongs to the technical field of information security and artificial intelligence.
Background
With the maturation of various deep learning-based speech synthesis methods, the most advanced speech synthesis methods have been able to generate highly realistic sounds that spoof human ears. Because of the availability, ease of use, and imperfections of the related laws of these tools, a technique called audio depth forgery has been developed, the abuse of which poses a serious threat to the state image, social public opinion and public interests, and it is therefore of great importance to develop tools capable of detecting synthetic audio. Based on the above background, synthetic audio detection has become an important research problem in the fields of acoustic signal processing and artificial intelligence, and its main task is to automatically predict whether a piece of audio is synthesized by an artificial intelligence tool through calculation.
In view of the potential hazards of audio depth forgery technology, there has been much effort devoted to detecting synthesized audio. Generally, synthetic audio detection methods can be classified into machine learning-based methods and deep learning-based methods. Synthetic audio detection based on machine learning generally requires manual design of features, and although the method has good interpretability, the performance of the method depends on the manual features to a great extent, and the expandability is poor. The synthetic voice detection method based on deep learning can automatically extract and learn useful features by utilizing the advantages of the deep neural network, and realizes complex mapping relation between input and output, so that the synthetic voice detection method has good performance, and is widely valued by researchers in recent years. However, the synthetic voice detection method based on deep learning is mostly aimed at a specific data set, and the performance of the method in a cross-language situation is not generally considered. And there are cases of over-training, with severe over-fitting on specific datasets, reducing the ability of the corresponding method to generalize to unknown data.
In addition, the synthetic audio detection method based on machine learning or deep learning often only utilizes acoustic features of audio or spectrogram image features corresponding to the audio, and does not fully utilize rich information contained in the audio, so that the method has some defects in detecting the synthetic audio: such as the scalability, stability, etc. of the method between different languages.
Disclosure of Invention
The invention aims to provide a synthetic voice detection method based on neural network and feature fusion, which aims to solve the defects that the prior art only utilizes acoustic features of audio or spectrogram image features corresponding to the audio, does not fully utilize rich information contained in the audio, and is insufficient in detecting synthetic audio.
A synthetic speech detection method based on neural network and feature fusion, the method comprising:
acquiring an audio data set to be detected, and extracting acoustic characteristics of audio and corresponding spectrogram image characteristics from the audio data set to be detected;
inputting the acoustic characteristics of the audio and the corresponding spectrogram image characteristics into a pre-trained synthetic audio detection model to respectively obtain an authenticity score first of the audio and an authenticity score second of the audio;
weighting and fusing the first audio authenticity score and the second audio authenticity score to obtain an audio authenticity score after feature information fusion;
comparing the obtained authenticity score with a preset threshold value to obtain a final audio detection result;
the synthesized audio detection model comprises a feature-to-score module and an image-to-score module, wherein the feature-to-score module is used for outputting the acoustic features of the input audio as the authenticity score I of the audio, and the image-to-score module is used for outputting the corresponding image features of the input spectrogram as the authenticity score II of the audio.
Further, the training method of the synthesized audio detection model comprises the following steps:
the method comprises the steps of obtaining a real audio data set and a synthesized audio data set as sample sets, and dividing the sample sets into a training set and a verification set according to a preset proportion;
carrying out data preprocessing on the training set, and extracting corresponding acoustic features and spectrogram image features;
training an initial synthesized audio detection model by adopting acoustic features and spectrogram image features of a sample, and outputting a training result; weighting and fusing training results to obtain an audio authenticity score;
calculating loss through the audio authenticity score and a sample preset label, optimizing and training an initial synthesized audio detection model by adopting a gradient descent method, and observing the performance of the model on a verification set;
and comparing the final audio authenticity score with a preset threshold value, and taking the optimized initial synthesized audio detection model as a synthesized audio detection model after the detection result is met.
Further, the method for extracting the acoustic characteristics of the audio from the audio data set to be detected comprises the following steps:
performing pre-emphasis, framing, windowing and discrete Fourier transformation on an audio file to obtain a frequency domain representation of the audio, and calculating the square amplitude of complex value signals in the frequency domain representation to obtain a spectrogram of the audio;
and obtaining the acoustic characteristic of the linear frequency cepstrum coefficient by adopting a linear filter bank and discrete cosine transform on the spectrogram, and obtaining the acoustic characteristic.
Further, the method for extracting spectrogram image features corresponding to acoustic features of the audio from the audio to-be-detected data set comprises the following steps:
performing pre-emphasis, framing, windowing and discrete Fourier transformation on an audio file to obtain a frequency domain representation of the audio, and calculating the square amplitude of complex value signals in the frequency domain representation to obtain a spectrogram of the audio;
after the spectrogram is converted from the amplitude scale to the decibel scale, a gray scale image with a specified pixel size is constructed as the spectrogram image characteristic.
Further, the feature-to-score module comprises a maximum feature map unit, a time delay neural network unit, a tightly-connected time delay neural network unit, a conversion layer, a pooling layer, a feedforward neural network layer and a linear layer; firstly, extracting features from linear frequency cepstrum coefficients in a two-dimensional space through a maximum feature map unit;
initializing the number of channels through the time delay neural network unit, learning local features through a plurality of continuous closely connected time delay neural network units, and aggregating multi-stage information through a conversion layer; then a plurality of closely connected time delay neural network units are connected, long-term dependence is learned, and information is aggregated by using a conversion layer; and finally, aggregating information through a pooling layer, and outputting the authenticity fraction through a feedforward neural network layer and a linear layer.
Further, the formula of the multi-stage information aggregated by the conversion layer is:
d k =D k ([d 0 ,d 1 ,…,d k-1 ])
wherein d is 0 Input representing a closely coupled time-lapse neural network element, d k Represents the output of the k-layer closely-connected time delay neural network unit, [ + ]]Representing the splicing operation, D k (. Cndot.) represents the nonlinear transformation of the k-th layer.
Further, the image-to-fraction module comprises a two-dimensional convolution layer, a residual block, a maximum pooling layer, a flattening layer, a Dropout layer and a full connection layer;
the gray level image of the spectrogram fully extracts information through a two-dimensional convolution layer and a residual block; reducing the feature image size through a maximum pooling layer, and reducing the feature dimension after flattening; then spreading the flattening layer, reducing the dimension through the full-connection layer and combining with the Dropout layer to improve the generalization of the module; and finally outputting the authenticity score of the image angle through the full connection layer.
Further, the formula of the residual block construction information stream is:
y=F(x,ω)+x
where x represents the input, ω represents the parameter of the current layer, F (x, ω) represents the output of the input through the nonlinear transformation of the current layer, and y represents the output of the current layer.
Further, the formula of weighted fusion of the audio authenticity score one and the audio authenticity score two is as follows:
where f (·) represents the weighting function, s f The audio authenticity score, s, output for the feature-to-score module i For the audio authenticity score output by the image-to-score module, ω is a weighting coefficient, threshold is a threshold, score represents the final audio authenticity score, H 0 Representing the original hypothesis, explanatory notesThe frequency being true, H 1 Representing alternative assumptions, illustrating that the audio is synthesized, the f (·) function is calculated by the formula f (s f ,s i ;ω)=ω×s f +(1-ω)×s i Yielding a final audio authenticity score, score greater than threshold, indicating acceptance of H 0 Assuming that the description audio is authentic; score less than threshold, indicating acceptance of H 1 It is assumed that the description audio is spurious.
Further, the expression that the authenticity score is compared with a preset threshold value is:
wherein threshold represents a preset threshold, 0 represents that the audio is synthesized, 1 represents that the audio is real, and label represents a label of the audio.
Compared with the prior art, the invention has the beneficial effects that: the invention utilizes the synthetic audio detection model to process spectrogram image information, constructs unobstructed information flow between each layer of the network, has better stability when in use, is oriented to cross-language synthetic audio detection, has higher universality and can adapt to complex situations in real scenes;
the invention combines the acoustic characteristics and spectrogram image information to carry out the synthetic voice detection, and has better stability and generalization capability;
the invention combines the maximum characteristic diagram and the close connection time delay neural network, not only can learn the relation among local characteristics, but also can learn the long-term dependence among the characteristics, thereby improving the detection accuracy.
Drawings
FIG. 1 is a schematic diagram of a network architecture of the method of the present invention;
FIG. 2 is a schematic diagram of the unit of the maximum feature diagram of the method of the present invention;
FIG. 3 is a schematic representation of the maximum features of the method of the present invention;
FIG. 4 is a diagram of a residual block structure in accordance with the present invention;
fig. 5 is a training-testing schematic diagram of the present invention.
Detailed Description
The invention is further described in connection with the following detailed description, in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.
As shown in fig. 1, a synthetic speech detection method based on neural network and feature fusion is disclosed, the method comprising:
acquiring an audio data set to be detected, and extracting acoustic characteristics of audio and corresponding spectrogram image characteristics from the audio data set to be detected;
inputting the acoustic characteristics of the audio and the corresponding spectrogram image characteristics into a pre-trained synthetic audio detection model to respectively obtain an authenticity score first of the audio and an authenticity score second of the audio;
weighting and fusing the first audio authenticity score and the second audio authenticity score to obtain an audio authenticity score after feature information fusion;
comparing the obtained authenticity score after feature information fusion with a preset threshold value to obtain a final audio detection result; the synthesized audio detection model comprises a feature-to-score module and an image-to-score module, wherein the feature-to-score module is used for outputting the acoustic features of the input audio as the authenticity score I of the audio, and the image-to-score module is used for outputting the corresponding image features of the input spectrogram as the authenticity score II of the audio;
aiming at the method, the specific steps comprise:
1) For the input audio data, in order to facilitate the subsequent calculation of the network model, preprocessing operation is required, which specifically includes: resampling all the audio data to make all the audio data be 16KHz mono audio; silence removal, pruning all silence exceeding 0.2s in the audio data; all the audio is trimmed or filled to 4s, the filling strategy is to repeat the audio to be filled and intercept 4s long audio as the filled audio.
The method for acquiring the audio data set to be detected comprises the following steps:
acquiring a used audio data set containing different languages, different speakers and different utterances;
preprocessing the acquired audio data set containing different languages, different speakers and different utterances to obtain an audio data set to be tested;
the method for preprocessing the acquired audio data set containing different languages, different speakers and different utterances comprises the following steps:
and resampling, silencing, eliminating, trimming and cutting the audio sample data set to unify the data format of the audio.
2) And extracting the linear frequency cepstrum coefficient acoustic characteristic of the preprocessed audio as an acoustic domain characteristic, and extracting a spectrogram image corresponding to the audio as an image domain characteristic. For linear frequency cepstrum coefficients, pre-emphasis, framing and windowing are sequentially performed on the audio waveform to avoid spectrum leakage; next, applying a discrete fourier transform to the frames to obtain a frequency domain representation X (T, K) of the audio, where t=1, …, T represents the frame index, and k=0, 1, … K-1 represents the discrete fourier transform coefficients; calculating the square of complex-valued signal amplitude |X (t, k) | 2 As a spectrogram of audio; and (3) applying a linear filter bank and discrete cosine transform to the spectrogram to obtain 80-dimensional linear frequency cepstrum coefficients. For the spectrogram image, we construct a gray-scale image of 50×34 pixels size after converting the spectrogram from the amplitude scale to the decibel scale on the spectrogram basis.
3) And (3) passing the linear frequency cepstrum coefficient through a feature-to-score (feature 2 score) module of the synthesized voice detection model to obtain the audio authenticity score of the acoustic feature angle.
The detailed structure of the feature2score module is shown in table 1.
TABLE 1
The maximum feature diagram unit structure is shown in fig. 2, and the maximum feature diagram is shown in fig. 3. The time delay neural network unit consists of a one-dimensional convolution layer and an activation function; the conversion layer consists of a batch normalization layer, an activation function and a one-dimensional convolution layer; the feedforward neural network layer consists of a one-dimensional convolution layer, a batch normalization layer and an activation function. The essence is that features are extracted in a two-dimensional space, competition learning is carried out through a maximum feature diagram, a part with smaller output is discarded, a part with larger output is reserved, and the extraction of useful features is facilitated. After the input passes through the maximum feature map unit, the time sequence relation among the features is learned through the closely connected time delay neural network, so as to extract the feature vector which can represent whether the audio is synthesized or not. The structure of the tight junction delay neural network is shown in table 2. Specifically, firstly, initializing the number of channels by using a delay network unit; learning local features by two continuous closely connected time delay neural network units; the multi-stage features are then aggregated by a conversion layer based on a feedforward neural network. The information represented by the k-th layer tight-junction time-delay neural network is shown in a formula (1):
d k =D k ([d 0 ,d 1 ,…,d k-1 ])#(1)
wherein d is 0 Input representing a closely coupled time-lapse neural network element, d k Represents the output of the k-layer closely-connected time delay neural network unit, [ + ]]Representing the splicing operation, D k (. Cndot.) represents the nonlinear transformation of the k-th layer.
Then four continuous close connection time delay neural network units are used for learning long-term dependence, and then a conversion layer is connected to aggregate information; and finally, aggregating the features through a statistics pooling layer, outputting 256-dimensional feature vectors, and then outputting an authenticity score by a full-connection layer.
4) And (3) enabling the spectrogram gray level image of the audio to pass through an image-to-score (image 2 score) module of the synthesized voice detection model to obtain the audio authenticity score of the image angle.
The detailed structure of the image2score module is shown in table 2.
TABLE 2
Wherein the structure of the residual block is shown in fig. 4. The image2score module takes a spectrogram gray image as input, and has a shape of (B, C, H, W), B representing a batch size, C representing a channel number, H representing a height, and W representing a width. It should be noted that after the input passes through the flattening layer, the channel number C, the height H and the width W thereof are flattened into tensors, and the output shape of the network is (B, D), where D represents the dimension of the feature. Effective features are extracted through two-dimensional convolution, and meanwhile, smooth information flow in the network is constructed by utilizing a residual structure, so that the network can learn the effective features better, and a calculation formula of the residual structure is shown in a formula (2).
y=F(x,ω)+x#(2)
Where x represents the input, ω represents the parameter of the current layer, F (x, ω) represents the output of the input through the nonlinear transformation of the current layer, and y represents the output of the current layer.
And after feature extraction, outputting the audio authenticity fraction of the image angle through flattening the feature vector and then connecting with the full-connection layer.
5) Obtaining the authenticity scores of the audio under two different angles from the step 3) and the step 4), and calculating the final authenticity score of the audio through weighting. The weighted calculation formula is shown as formula (4):
where f (·) represents the weighting function, s f The audio authenticity score, s, output for the feature-to-score module i For the audio authenticity score output by the image-to-score module, ω is a weighting coefficient, threshold is a threshold, score represents the final audio authenticity score, H 0 Representing the original assumption that the description audio is authentic, H 1 Representing alternative hypotheses that illustrate that the audio is synthesized. The f (·) function is calculated by the formula f (s f ,s i ;ω)=ω×s f +(1-ω)×s i Resulting in a final audio authenticity score. score greater than threshold, indicating acceptance of H 0 Assuming that the description audio is authentic; score less than threshold, indicating acceptance of H 1 It is assumed that the description audio is spurious.
6) And comparing the final authenticity score of the audio with a preset threshold value to obtain a label of a final audio sample, wherein a predicted sample label calculation formula is shown as a formula (5).
Where threshold represents a preset threshold, in the present invention 0.5.
In the training part of the invention, the loss function used is as follows:
loss=-(y n *log(δ(z n ))+(1-y n )*log(1-δ(z n )))
wherein z is n A score, y, representing that the nth sample is a positive sample n The label representing the nth sample, delta represents the sigmoid activation function. All data need to be computed according to the above rules, the network model (step 2) to step 6)) is trained by optimizing the loss function.
The specific training and verification process is as follows:
carrying out data preprocessing on the training set, the verification set and the test set, and extracting corresponding acoustic characteristics and spectrogram image characteristics;
inputting the acoustic characteristics, spectrogram image characteristics and labels of the samples into a synthesized audio detection model by 128 samples in each batch;
the feature-to-score module in the weighted fusion synthesized audio detection model outputs an audio authenticity score I and an audio authenticity score II output by the image-to-score module, so that a final audio authenticity score is obtained;
optimizing a training model through a gradient descent method through final audio authenticity scores and label calculation losses of samples, and iterating for 10 rounds in total;
comparing the final audio authenticity score with a preset threshold value, outputting a final detection result, and calculating an equal error rate on a training set;
likewise, while training, the effect of the model is measured by its performance on the verification set, including in particular:
and inputting the acoustic characteristics, spectrogram image characteristics and labels of the samples in the verification set into the synthesized audio detection model by 128 samples in each batch to obtain the final authenticity score of the audio, comparing the final authenticity score with a preset threshold value, calculating the equal error rate on the verification set, and measuring the detection performance of the model.
In this embodiment, the method and the model are applied to detection of the synthesized voice based on the nerve vocoder, and compared with the currently mainstream synthesized audio detection model gaussian mixture model and the rafnet 2 model on a WaveFake data set, a lxapeech data set and a JSUT data set (lxapeech is an english data set and JSUT data set is a japanese data set), and the experimental method is as follows: taking a synthesized audio subset taking LJSPeech as a reference in a WaveFake data set and a reference real set LJSPeech as a training set, taking a synthesized audio and JMUT data set taking JMUT as a reference in the WaveFake data set as a test set, reserving one sample subset from the training set as an additional test set each time during training, and taking the rest as a training set, wherein for a verification set, the training set is shown as 8:2 dividing training set and verification set for experiment. The flow is shown in fig. 5. The overall performance comparison result is shown in table 3, wherein the reserved set represents a reserved data subset during each training, the TTS represents an audio test set of different sentences of the same speaker synthesized on the LJSPeech data set, the two data subsets under the JMUT are test sets of different speakers and different languages, the data in the table are equal error rates, and the method is a performance measurement index commonly used in evaluation tasks such as audio field and speech recognition.
TABLE 3 Table 3
The average etc. error rates of the invention in experiments with gaussian mixture model, rannet 2 are given in table 4. The order of the reserved sets in table 4 corresponds one-to-one to the reserved sets in table 3.
TABLE 4 Table 4
As can be seen from a combination of tables 3 and 4, the performance of the present invention is better than the method of down 2, basically better than the gaussian mixture model, both on the data subset TTS with different utterances of the same speaker and on the two subsets corresponding to the data set JUST of different speakers and different languages. And the error rate of the optimal average of the invention is 0.004, which is superior to Gaussian mixture model (0.054) and Rawnet2 (0.436). The method provided by the invention can effectively detect and give reliable results even if the audio is synthesized by facing different utterances of different languages which are never seen before, so that the method has good stability and generalization performance and is more suitable for complex situations in a real environment.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.
Claims (10)
1. A synthetic speech detection method based on neural network and feature fusion, the method comprising:
acquiring an audio data set to be detected, and extracting acoustic characteristics of audio and corresponding spectrogram image characteristics from the audio data set to be detected;
inputting the acoustic characteristics of the audio and the corresponding spectrogram image characteristics into a pre-trained synthetic audio detection model to respectively obtain an authenticity score first of the audio and an authenticity score second of the audio;
weighting and fusing the first audio authenticity score and the second audio authenticity score to obtain an audio authenticity score after feature information fusion;
comparing the obtained authenticity score with a preset threshold value to obtain a final audio detection result;
the synthesized audio detection model comprises a feature-to-score module and an image-to-score module, wherein the feature-to-score module is used for outputting the acoustic features of the input audio as the authenticity score I of the audio, and the image-to-score module is used for outputting the corresponding image features of the input spectrogram as the authenticity score II of the audio.
2. The method for detecting synthesized speech based on neural network and feature fusion according to claim 1, wherein,
the training method of the synthesized audio detection model comprises the following steps:
the method comprises the steps of obtaining a real audio data set and a synthesized audio data set as sample sets, and dividing the sample sets into a training set and a verification set according to a preset proportion;
carrying out data preprocessing on the training set, and extracting corresponding acoustic features and spectrogram image features;
training an initial synthesized audio detection model by adopting acoustic features and spectrogram image features of a sample, and outputting a training result;
weighting and fusing training results to obtain an audio authenticity score;
calculating loss through the audio authenticity score and a sample preset label, optimizing and training an initial synthesized audio detection model by adopting a gradient descent method, and observing the performance of the model on a verification set;
and comparing the final audio authenticity score with a preset threshold value, and taking the optimized initial synthesized audio detection model as a synthesized audio detection model after the detection result is met.
3. The method for detecting synthesized speech based on neural network and feature fusion according to claim 1, wherein the method for extracting acoustic features of audio from the audio to-be-detected data set comprises:
performing pre-emphasis, framing, windowing and discrete Fourier transformation on an audio file to obtain a frequency domain representation of the audio, and calculating the square amplitude of complex value signals in the frequency domain representation to obtain a spectrogram of the audio;
and obtaining the acoustic characteristic of the linear frequency cepstrum coefficient by adopting a linear filter bank and discrete cosine transform on the spectrogram, and obtaining the acoustic characteristic.
4. The method for detecting synthesized speech based on neural network and feature fusion according to claim 1, wherein the method for extracting spectrogram image features corresponding to acoustic features of audio from an audio to-be-detected data set comprises:
performing pre-emphasis, framing, windowing and discrete Fourier transformation on an audio file to obtain a frequency domain representation of the audio, and calculating the square amplitude of complex value signals in the frequency domain representation to obtain a spectrogram of the audio;
after the spectrogram is converted from the amplitude scale to the decibel scale, a gray scale image with a specified pixel size is constructed as the spectrogram image characteristic.
5. The method for detecting synthesized speech based on neural network and feature fusion according to claim 1, wherein the feature-to-score module comprises a maximum feature map unit, a time delay neural network unit, a close-coupled time delay neural network unit, a conversion layer, a pooling layer, a feed-forward neural network layer and a linear layer;
firstly, extracting features from linear frequency cepstrum coefficients in a two-dimensional space through a maximum feature map unit;
initializing the number of channels through the time delay neural network unit, learning local features through a plurality of continuous closely connected time delay neural network units, and aggregating multi-stage information through a conversion layer; then a plurality of closely connected time delay neural network units are connected, long-term dependence is learned, and information is aggregated by using a conversion layer; and finally, aggregating information through a pooling layer, and outputting the authenticity fraction through a feedforward neural network layer and a linear layer.
6. The method for detecting synthesized speech based on neural network and feature fusion according to claim 5, wherein the formula of the multi-stage information aggregated by the conversion layer is:
d k =D k ([d 0 ,d 1 ,…,d k-1 ])
wherein d is 0 Input representing a closely coupled time-lapse neural network element, d k Represents the output of the k-layer closely-connected time delay neural network unit, [ + ]]Representing the splicing operation, D k (. Cndot.) represents the nonlinear transformation of the k-th layer.
7. The method for detecting synthesized speech based on neural network and feature fusion according to claim 1, wherein the image-to-score module comprises a two-dimensional convolution layer, a residual block, a maximum pooling layer, a flattening layer, a Dropout layer and a full connection layer;
the gray level image of the spectrogram is firstly fully extracted with information through a two-dimensional convolution layer and a residual block, then the feature image size is reduced through a maximum pooling layer, the feature dimension after flattening is reduced, then the feature dimension is unfolded through a flattening layer, the dimension is reduced through a full-connection layer, and the generalization of the module is improved by combining with a Dropout layer; and finally outputting the authenticity score of the image angle through the full connection layer.
8. The method for detecting synthesized speech based on neural network and feature fusion according to claim 7, wherein the formula of the residual block construction information stream is:
y=F(x,ω)+x
where x represents the input, ω represents the parameter of the current layer, F (x, ω) represents the output of the input through the nonlinear transformation of the current layer, and y represents the output of the current layer.
9. The method for detecting synthesized speech based on neural network and feature fusion according to claim 1, wherein the formula of weighting and fusing the audio authenticity score of one and the audio authenticity score of two is:
where f (·) represents the weighting function, s f The audio authenticity score, s, output for the feature-to-score module i For the audio authenticity score output by the image-to-score module, ω is a weighting coefficient, threshold is a threshold, score represents the final audio authenticity score, H 0 Representing the original assumption that the description audio is authentic, H 1 Representing alternative assumptions, illustrating that the audio is synthesized, the f (·) function is calculated by the formula f (s f ,s i ;ω)=ω×s f +(1-ω)×s i Yielding a final audio authenticity score, score greater than threshold, indicating acceptance of H 0 Assuming that the description audio is authentic; score less than threshold, indicating acceptance of H 1 It is assumed that the description audio is spurious.
10. The method for detecting synthesized speech based on neural network and feature fusion according to claim 9, wherein the expression of comparing the authenticity score with a predetermined threshold value is:
wherein threshold represents a preset threshold, 0 represents that the audio is synthesized, 1 represents that the audio is authentic, and label represents a label of the audio.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311490667.XA CN117393000B (en) | 2023-11-09 | 2023-11-09 | Synthetic voice detection method based on neural network and feature fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311490667.XA CN117393000B (en) | 2023-11-09 | 2023-11-09 | Synthetic voice detection method based on neural network and feature fusion |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117393000A true CN117393000A (en) | 2024-01-12 |
CN117393000B CN117393000B (en) | 2024-04-16 |
Family
ID=89440827
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311490667.XA Active CN117393000B (en) | 2023-11-09 | 2023-11-09 | Synthetic voice detection method based on neural network and feature fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117393000B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100161327A1 (en) * | 2008-12-18 | 2010-06-24 | Nishant Chandra | System-effected methods for analyzing, predicting, and/or modifying acoustic units of human utterances for use in speech synthesis and recognition |
US20180254046A1 (en) * | 2017-03-03 | 2018-09-06 | Pindrop Security, Inc. | Method and apparatus for detecting spoofing conditions |
CN110491391A (en) * | 2019-07-02 | 2019-11-22 | 厦门大学 | A kind of deception speech detection method based on deep neural network |
US10504504B1 (en) * | 2018-12-07 | 2019-12-10 | Vocalid, Inc. | Image-based approaches to classifying audio data |
CN110992987A (en) * | 2019-10-23 | 2020-04-10 | 大连东软信息学院 | Parallel feature extraction system and method for general specific voice in voice signal |
CN112201255A (en) * | 2020-09-30 | 2021-01-08 | 浙江大学 | Voice signal spectrum characteristic and deep learning voice spoofing attack detection method |
CN114495950A (en) * | 2022-04-01 | 2022-05-13 | 杭州电子科技大学 | Voice deception detection method based on deep residual shrinkage network |
CN116386664A (en) * | 2022-12-07 | 2023-07-04 | 讯飞智元信息科技有限公司 | Voice counterfeiting detection method, device, system and storage medium |
-
2023
- 2023-11-09 CN CN202311490667.XA patent/CN117393000B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100161327A1 (en) * | 2008-12-18 | 2010-06-24 | Nishant Chandra | System-effected methods for analyzing, predicting, and/or modifying acoustic units of human utterances for use in speech synthesis and recognition |
US20180254046A1 (en) * | 2017-03-03 | 2018-09-06 | Pindrop Security, Inc. | Method and apparatus for detecting spoofing conditions |
US10504504B1 (en) * | 2018-12-07 | 2019-12-10 | Vocalid, Inc. | Image-based approaches to classifying audio data |
CN110491391A (en) * | 2019-07-02 | 2019-11-22 | 厦门大学 | A kind of deception speech detection method based on deep neural network |
CN110992987A (en) * | 2019-10-23 | 2020-04-10 | 大连东软信息学院 | Parallel feature extraction system and method for general specific voice in voice signal |
CN112201255A (en) * | 2020-09-30 | 2021-01-08 | 浙江大学 | Voice signal spectrum characteristic and deep learning voice spoofing attack detection method |
CN114495950A (en) * | 2022-04-01 | 2022-05-13 | 杭州电子科技大学 | Voice deception detection method based on deep residual shrinkage network |
CN116386664A (en) * | 2022-12-07 | 2023-07-04 | 讯飞智元信息科技有限公司 | Voice counterfeiting detection method, device, system and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN117393000B (en) | 2024-04-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102800316B (en) | Optimal codebook design method for voiceprint recognition system based on nerve network | |
CN109559736B (en) | Automatic dubbing method for movie actors based on confrontation network | |
CN107633842A (en) | Audio recognition method, device, computer equipment and storage medium | |
CN108922541B (en) | Multi-dimensional characteristic parameter voiceprint recognition method based on DTW and GMM models | |
CN111724770B (en) | Audio keyword identification method for generating confrontation network based on deep convolution | |
CN113488058A (en) | Voiceprint recognition method based on short voice | |
Rammo et al. | Detecting the speaker language using CNN deep learning algorithm | |
WO2019232833A1 (en) | Speech differentiating method and device, computer device and storage medium | |
CN114783418B (en) | End-to-end voice recognition method and system based on sparse self-attention mechanism | |
CN112562725A (en) | Mixed voice emotion classification method based on spectrogram and capsule network | |
CN112735404A (en) | Ironic detection method, system, terminal device and storage medium | |
CN111243621A (en) | Construction method of GRU-SVM deep learning model for synthetic speech detection | |
Goyani et al. | Performance analysis of lip synchronization using LPC, MFCC and PLP speech parameters | |
Sen et al. | A convolutional neural network based approach to recognize bangla spoken digits from speech signal | |
CN113129908B (en) | End-to-end macaque voiceprint verification method and system based on cyclic frame level feature fusion | |
Yousfi et al. | Holy Qur'an speech recognition system Imaalah checking rule for warsh recitation | |
Kamble et al. | Emotion recognition for instantaneous Marathi spoken words | |
CN117393000B (en) | Synthetic voice detection method based on neural network and feature fusion | |
KR100969138B1 (en) | Method For Estimating Noise Mask Using Hidden Markov Model And Apparatus For Performing The Same | |
CN111326161B (en) | Voiceprint determining method and device | |
CN111785262B (en) | Speaker age and gender classification method based on residual error network and fusion characteristics | |
CN111091816B (en) | Data processing system and method based on voice evaluation | |
CN113963718A (en) | Voice session segmentation method based on deep learning | |
Wu et al. | Audio-based expansion learning for aerial target recognition | |
CN110689875A (en) | Language identification method and device and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |