CN117393000A - Synthetic voice detection method based on neural network and feature fusion - Google Patents

Synthetic voice detection method based on neural network and feature fusion Download PDF

Info

Publication number
CN117393000A
CN117393000A CN202311490667.XA CN202311490667A CN117393000A CN 117393000 A CN117393000 A CN 117393000A CN 202311490667 A CN202311490667 A CN 202311490667A CN 117393000 A CN117393000 A CN 117393000A
Authority
CN
China
Prior art keywords
audio
score
neural network
layer
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311490667.XA
Other languages
Chinese (zh)
Other versions
CN117393000B (en
Inventor
徐小龙
刘畅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202311490667.XA priority Critical patent/CN117393000B/en
Publication of CN117393000A publication Critical patent/CN117393000A/en
Application granted granted Critical
Publication of CN117393000B publication Critical patent/CN117393000B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a synthetic voice detection method based on neural network and feature fusion, which comprises the following steps: acquiring an audio data set to be detected, and extracting acoustic characteristics of audio and corresponding spectrogram image characteristics from the audio data set to be detected; inputting the acoustic characteristics of the audio and the corresponding spectrogram image characteristics into a pre-trained synthetic audio detection model to respectively obtain an authenticity score first of the audio and an authenticity score second of the audio; weighting and fusing the first audio authenticity score and the second audio authenticity score to obtain an audio authenticity score after feature information fusion; comparing the obtained authenticity score after feature information fusion with a preset threshold value to obtain a final audio detection result; the invention skillfully fuses the acoustic characteristics and the spectrogram image information to carry out the synthetic voice detection, and has better stability and generalization capability.

Description

Synthetic voice detection method based on neural network and feature fusion
Technical Field
The invention relates to a synthetic voice detection method based on neural network and feature fusion, and belongs to the technical field of information security and artificial intelligence.
Background
With the maturation of various deep learning-based speech synthesis methods, the most advanced speech synthesis methods have been able to generate highly realistic sounds that spoof human ears. Because of the availability, ease of use, and imperfections of the related laws of these tools, a technique called audio depth forgery has been developed, the abuse of which poses a serious threat to the state image, social public opinion and public interests, and it is therefore of great importance to develop tools capable of detecting synthetic audio. Based on the above background, synthetic audio detection has become an important research problem in the fields of acoustic signal processing and artificial intelligence, and its main task is to automatically predict whether a piece of audio is synthesized by an artificial intelligence tool through calculation.
In view of the potential hazards of audio depth forgery technology, there has been much effort devoted to detecting synthesized audio. Generally, synthetic audio detection methods can be classified into machine learning-based methods and deep learning-based methods. Synthetic audio detection based on machine learning generally requires manual design of features, and although the method has good interpretability, the performance of the method depends on the manual features to a great extent, and the expandability is poor. The synthetic voice detection method based on deep learning can automatically extract and learn useful features by utilizing the advantages of the deep neural network, and realizes complex mapping relation between input and output, so that the synthetic voice detection method has good performance, and is widely valued by researchers in recent years. However, the synthetic voice detection method based on deep learning is mostly aimed at a specific data set, and the performance of the method in a cross-language situation is not generally considered. And there are cases of over-training, with severe over-fitting on specific datasets, reducing the ability of the corresponding method to generalize to unknown data.
In addition, the synthetic audio detection method based on machine learning or deep learning often only utilizes acoustic features of audio or spectrogram image features corresponding to the audio, and does not fully utilize rich information contained in the audio, so that the method has some defects in detecting the synthetic audio: such as the scalability, stability, etc. of the method between different languages.
Disclosure of Invention
The invention aims to provide a synthetic voice detection method based on neural network and feature fusion, which aims to solve the defects that the prior art only utilizes acoustic features of audio or spectrogram image features corresponding to the audio, does not fully utilize rich information contained in the audio, and is insufficient in detecting synthetic audio.
A synthetic speech detection method based on neural network and feature fusion, the method comprising:
acquiring an audio data set to be detected, and extracting acoustic characteristics of audio and corresponding spectrogram image characteristics from the audio data set to be detected;
inputting the acoustic characteristics of the audio and the corresponding spectrogram image characteristics into a pre-trained synthetic audio detection model to respectively obtain an authenticity score first of the audio and an authenticity score second of the audio;
weighting and fusing the first audio authenticity score and the second audio authenticity score to obtain an audio authenticity score after feature information fusion;
comparing the obtained authenticity score with a preset threshold value to obtain a final audio detection result;
the synthesized audio detection model comprises a feature-to-score module and an image-to-score module, wherein the feature-to-score module is used for outputting the acoustic features of the input audio as the authenticity score I of the audio, and the image-to-score module is used for outputting the corresponding image features of the input spectrogram as the authenticity score II of the audio.
Further, the training method of the synthesized audio detection model comprises the following steps:
the method comprises the steps of obtaining a real audio data set and a synthesized audio data set as sample sets, and dividing the sample sets into a training set and a verification set according to a preset proportion;
carrying out data preprocessing on the training set, and extracting corresponding acoustic features and spectrogram image features;
training an initial synthesized audio detection model by adopting acoustic features and spectrogram image features of a sample, and outputting a training result; weighting and fusing training results to obtain an audio authenticity score;
calculating loss through the audio authenticity score and a sample preset label, optimizing and training an initial synthesized audio detection model by adopting a gradient descent method, and observing the performance of the model on a verification set;
and comparing the final audio authenticity score with a preset threshold value, and taking the optimized initial synthesized audio detection model as a synthesized audio detection model after the detection result is met.
Further, the method for extracting the acoustic characteristics of the audio from the audio data set to be detected comprises the following steps:
performing pre-emphasis, framing, windowing and discrete Fourier transformation on an audio file to obtain a frequency domain representation of the audio, and calculating the square amplitude of complex value signals in the frequency domain representation to obtain a spectrogram of the audio;
and obtaining the acoustic characteristic of the linear frequency cepstrum coefficient by adopting a linear filter bank and discrete cosine transform on the spectrogram, and obtaining the acoustic characteristic.
Further, the method for extracting spectrogram image features corresponding to acoustic features of the audio from the audio to-be-detected data set comprises the following steps:
performing pre-emphasis, framing, windowing and discrete Fourier transformation on an audio file to obtain a frequency domain representation of the audio, and calculating the square amplitude of complex value signals in the frequency domain representation to obtain a spectrogram of the audio;
after the spectrogram is converted from the amplitude scale to the decibel scale, a gray scale image with a specified pixel size is constructed as the spectrogram image characteristic.
Further, the feature-to-score module comprises a maximum feature map unit, a time delay neural network unit, a tightly-connected time delay neural network unit, a conversion layer, a pooling layer, a feedforward neural network layer and a linear layer; firstly, extracting features from linear frequency cepstrum coefficients in a two-dimensional space through a maximum feature map unit;
initializing the number of channels through the time delay neural network unit, learning local features through a plurality of continuous closely connected time delay neural network units, and aggregating multi-stage information through a conversion layer; then a plurality of closely connected time delay neural network units are connected, long-term dependence is learned, and information is aggregated by using a conversion layer; and finally, aggregating information through a pooling layer, and outputting the authenticity fraction through a feedforward neural network layer and a linear layer.
Further, the formula of the multi-stage information aggregated by the conversion layer is:
d k =D k ([d 0 ,d 1 ,…,d k-1 ])
wherein d is 0 Input representing a closely coupled time-lapse neural network element, d k Represents the output of the k-layer closely-connected time delay neural network unit, [ + ]]Representing the splicing operation, D k (. Cndot.) represents the nonlinear transformation of the k-th layer.
Further, the image-to-fraction module comprises a two-dimensional convolution layer, a residual block, a maximum pooling layer, a flattening layer, a Dropout layer and a full connection layer;
the gray level image of the spectrogram fully extracts information through a two-dimensional convolution layer and a residual block; reducing the feature image size through a maximum pooling layer, and reducing the feature dimension after flattening; then spreading the flattening layer, reducing the dimension through the full-connection layer and combining with the Dropout layer to improve the generalization of the module; and finally outputting the authenticity score of the image angle through the full connection layer.
Further, the formula of the residual block construction information stream is:
y=F(x,ω)+x
where x represents the input, ω represents the parameter of the current layer, F (x, ω) represents the output of the input through the nonlinear transformation of the current layer, and y represents the output of the current layer.
Further, the formula of weighted fusion of the audio authenticity score one and the audio authenticity score two is as follows:
where f (·) represents the weighting function, s f The audio authenticity score, s, output for the feature-to-score module i For the audio authenticity score output by the image-to-score module, ω is a weighting coefficient, threshold is a threshold, score represents the final audio authenticity score, H 0 Representing the original hypothesis, explanatory notesThe frequency being true, H 1 Representing alternative assumptions, illustrating that the audio is synthesized, the f (·) function is calculated by the formula f (s f ,s i ;ω)=ω×s f +(1-ω)×s i Yielding a final audio authenticity score, score greater than threshold, indicating acceptance of H 0 Assuming that the description audio is authentic; score less than threshold, indicating acceptance of H 1 It is assumed that the description audio is spurious.
Further, the expression that the authenticity score is compared with a preset threshold value is:
wherein threshold represents a preset threshold, 0 represents that the audio is synthesized, 1 represents that the audio is real, and label represents a label of the audio.
Compared with the prior art, the invention has the beneficial effects that: the invention utilizes the synthetic audio detection model to process spectrogram image information, constructs unobstructed information flow between each layer of the network, has better stability when in use, is oriented to cross-language synthetic audio detection, has higher universality and can adapt to complex situations in real scenes;
the invention combines the acoustic characteristics and spectrogram image information to carry out the synthetic voice detection, and has better stability and generalization capability;
the invention combines the maximum characteristic diagram and the close connection time delay neural network, not only can learn the relation among local characteristics, but also can learn the long-term dependence among the characteristics, thereby improving the detection accuracy.
Drawings
FIG. 1 is a schematic diagram of a network architecture of the method of the present invention;
FIG. 2 is a schematic diagram of the unit of the maximum feature diagram of the method of the present invention;
FIG. 3 is a schematic representation of the maximum features of the method of the present invention;
FIG. 4 is a diagram of a residual block structure in accordance with the present invention;
fig. 5 is a training-testing schematic diagram of the present invention.
Detailed Description
The invention is further described in connection with the following detailed description, in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.
As shown in fig. 1, a synthetic speech detection method based on neural network and feature fusion is disclosed, the method comprising:
acquiring an audio data set to be detected, and extracting acoustic characteristics of audio and corresponding spectrogram image characteristics from the audio data set to be detected;
inputting the acoustic characteristics of the audio and the corresponding spectrogram image characteristics into a pre-trained synthetic audio detection model to respectively obtain an authenticity score first of the audio and an authenticity score second of the audio;
weighting and fusing the first audio authenticity score and the second audio authenticity score to obtain an audio authenticity score after feature information fusion;
comparing the obtained authenticity score after feature information fusion with a preset threshold value to obtain a final audio detection result; the synthesized audio detection model comprises a feature-to-score module and an image-to-score module, wherein the feature-to-score module is used for outputting the acoustic features of the input audio as the authenticity score I of the audio, and the image-to-score module is used for outputting the corresponding image features of the input spectrogram as the authenticity score II of the audio;
aiming at the method, the specific steps comprise:
1) For the input audio data, in order to facilitate the subsequent calculation of the network model, preprocessing operation is required, which specifically includes: resampling all the audio data to make all the audio data be 16KHz mono audio; silence removal, pruning all silence exceeding 0.2s in the audio data; all the audio is trimmed or filled to 4s, the filling strategy is to repeat the audio to be filled and intercept 4s long audio as the filled audio.
The method for acquiring the audio data set to be detected comprises the following steps:
acquiring a used audio data set containing different languages, different speakers and different utterances;
preprocessing the acquired audio data set containing different languages, different speakers and different utterances to obtain an audio data set to be tested;
the method for preprocessing the acquired audio data set containing different languages, different speakers and different utterances comprises the following steps:
and resampling, silencing, eliminating, trimming and cutting the audio sample data set to unify the data format of the audio.
2) And extracting the linear frequency cepstrum coefficient acoustic characteristic of the preprocessed audio as an acoustic domain characteristic, and extracting a spectrogram image corresponding to the audio as an image domain characteristic. For linear frequency cepstrum coefficients, pre-emphasis, framing and windowing are sequentially performed on the audio waveform to avoid spectrum leakage; next, applying a discrete fourier transform to the frames to obtain a frequency domain representation X (T, K) of the audio, where t=1, …, T represents the frame index, and k=0, 1, … K-1 represents the discrete fourier transform coefficients; calculating the square of complex-valued signal amplitude |X (t, k) | 2 As a spectrogram of audio; and (3) applying a linear filter bank and discrete cosine transform to the spectrogram to obtain 80-dimensional linear frequency cepstrum coefficients. For the spectrogram image, we construct a gray-scale image of 50×34 pixels size after converting the spectrogram from the amplitude scale to the decibel scale on the spectrogram basis.
3) And (3) passing the linear frequency cepstrum coefficient through a feature-to-score (feature 2 score) module of the synthesized voice detection model to obtain the audio authenticity score of the acoustic feature angle.
The detailed structure of the feature2score module is shown in table 1.
TABLE 1
The maximum feature diagram unit structure is shown in fig. 2, and the maximum feature diagram is shown in fig. 3. The time delay neural network unit consists of a one-dimensional convolution layer and an activation function; the conversion layer consists of a batch normalization layer, an activation function and a one-dimensional convolution layer; the feedforward neural network layer consists of a one-dimensional convolution layer, a batch normalization layer and an activation function. The essence is that features are extracted in a two-dimensional space, competition learning is carried out through a maximum feature diagram, a part with smaller output is discarded, a part with larger output is reserved, and the extraction of useful features is facilitated. After the input passes through the maximum feature map unit, the time sequence relation among the features is learned through the closely connected time delay neural network, so as to extract the feature vector which can represent whether the audio is synthesized or not. The structure of the tight junction delay neural network is shown in table 2. Specifically, firstly, initializing the number of channels by using a delay network unit; learning local features by two continuous closely connected time delay neural network units; the multi-stage features are then aggregated by a conversion layer based on a feedforward neural network. The information represented by the k-th layer tight-junction time-delay neural network is shown in a formula (1):
d k =D k ([d 0 ,d 1 ,…,d k-1 ])#(1)
wherein d is 0 Input representing a closely coupled time-lapse neural network element, d k Represents the output of the k-layer closely-connected time delay neural network unit, [ + ]]Representing the splicing operation, D k (. Cndot.) represents the nonlinear transformation of the k-th layer.
Then four continuous close connection time delay neural network units are used for learning long-term dependence, and then a conversion layer is connected to aggregate information; and finally, aggregating the features through a statistics pooling layer, outputting 256-dimensional feature vectors, and then outputting an authenticity score by a full-connection layer.
4) And (3) enabling the spectrogram gray level image of the audio to pass through an image-to-score (image 2 score) module of the synthesized voice detection model to obtain the audio authenticity score of the image angle.
The detailed structure of the image2score module is shown in table 2.
TABLE 2
Wherein the structure of the residual block is shown in fig. 4. The image2score module takes a spectrogram gray image as input, and has a shape of (B, C, H, W), B representing a batch size, C representing a channel number, H representing a height, and W representing a width. It should be noted that after the input passes through the flattening layer, the channel number C, the height H and the width W thereof are flattened into tensors, and the output shape of the network is (B, D), where D represents the dimension of the feature. Effective features are extracted through two-dimensional convolution, and meanwhile, smooth information flow in the network is constructed by utilizing a residual structure, so that the network can learn the effective features better, and a calculation formula of the residual structure is shown in a formula (2).
y=F(x,ω)+x#(2)
Where x represents the input, ω represents the parameter of the current layer, F (x, ω) represents the output of the input through the nonlinear transformation of the current layer, and y represents the output of the current layer.
And after feature extraction, outputting the audio authenticity fraction of the image angle through flattening the feature vector and then connecting with the full-connection layer.
5) Obtaining the authenticity scores of the audio under two different angles from the step 3) and the step 4), and calculating the final authenticity score of the audio through weighting. The weighted calculation formula is shown as formula (4):
where f (·) represents the weighting function, s f The audio authenticity score, s, output for the feature-to-score module i For the audio authenticity score output by the image-to-score module, ω is a weighting coefficient, threshold is a threshold, score represents the final audio authenticity score, H 0 Representing the original assumption that the description audio is authentic, H 1 Representing alternative hypotheses that illustrate that the audio is synthesized. The f (·) function is calculated by the formula f (s f ,s i ;ω)=ω×s f +(1-ω)×s i Resulting in a final audio authenticity score. score greater than threshold, indicating acceptance of H 0 Assuming that the description audio is authentic; score less than threshold, indicating acceptance of H 1 It is assumed that the description audio is spurious.
6) And comparing the final authenticity score of the audio with a preset threshold value to obtain a label of a final audio sample, wherein a predicted sample label calculation formula is shown as a formula (5).
Where threshold represents a preset threshold, in the present invention 0.5.
In the training part of the invention, the loss function used is as follows:
loss=-(y n *log(δ(z n ))+(1-y n )*log(1-δ(z n )))
wherein z is n A score, y, representing that the nth sample is a positive sample n The label representing the nth sample, delta represents the sigmoid activation function. All data need to be computed according to the above rules, the network model (step 2) to step 6)) is trained by optimizing the loss function.
The specific training and verification process is as follows:
carrying out data preprocessing on the training set, the verification set and the test set, and extracting corresponding acoustic characteristics and spectrogram image characteristics;
inputting the acoustic characteristics, spectrogram image characteristics and labels of the samples into a synthesized audio detection model by 128 samples in each batch;
the feature-to-score module in the weighted fusion synthesized audio detection model outputs an audio authenticity score I and an audio authenticity score II output by the image-to-score module, so that a final audio authenticity score is obtained;
optimizing a training model through a gradient descent method through final audio authenticity scores and label calculation losses of samples, and iterating for 10 rounds in total;
comparing the final audio authenticity score with a preset threshold value, outputting a final detection result, and calculating an equal error rate on a training set;
likewise, while training, the effect of the model is measured by its performance on the verification set, including in particular:
and inputting the acoustic characteristics, spectrogram image characteristics and labels of the samples in the verification set into the synthesized audio detection model by 128 samples in each batch to obtain the final authenticity score of the audio, comparing the final authenticity score with a preset threshold value, calculating the equal error rate on the verification set, and measuring the detection performance of the model.
In this embodiment, the method and the model are applied to detection of the synthesized voice based on the nerve vocoder, and compared with the currently mainstream synthesized audio detection model gaussian mixture model and the rafnet 2 model on a WaveFake data set, a lxapeech data set and a JSUT data set (lxapeech is an english data set and JSUT data set is a japanese data set), and the experimental method is as follows: taking a synthesized audio subset taking LJSPeech as a reference in a WaveFake data set and a reference real set LJSPeech as a training set, taking a synthesized audio and JMUT data set taking JMUT as a reference in the WaveFake data set as a test set, reserving one sample subset from the training set as an additional test set each time during training, and taking the rest as a training set, wherein for a verification set, the training set is shown as 8:2 dividing training set and verification set for experiment. The flow is shown in fig. 5. The overall performance comparison result is shown in table 3, wherein the reserved set represents a reserved data subset during each training, the TTS represents an audio test set of different sentences of the same speaker synthesized on the LJSPeech data set, the two data subsets under the JMUT are test sets of different speakers and different languages, the data in the table are equal error rates, and the method is a performance measurement index commonly used in evaluation tasks such as audio field and speech recognition.
TABLE 3 Table 3
The average etc. error rates of the invention in experiments with gaussian mixture model, rannet 2 are given in table 4. The order of the reserved sets in table 4 corresponds one-to-one to the reserved sets in table 3.
TABLE 4 Table 4
As can be seen from a combination of tables 3 and 4, the performance of the present invention is better than the method of down 2, basically better than the gaussian mixture model, both on the data subset TTS with different utterances of the same speaker and on the two subsets corresponding to the data set JUST of different speakers and different languages. And the error rate of the optimal average of the invention is 0.004, which is superior to Gaussian mixture model (0.054) and Rawnet2 (0.436). The method provided by the invention can effectively detect and give reliable results even if the audio is synthesized by facing different utterances of different languages which are never seen before, so that the method has good stability and generalization performance and is more suitable for complex situations in a real environment.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims (10)

1. A synthetic speech detection method based on neural network and feature fusion, the method comprising:
acquiring an audio data set to be detected, and extracting acoustic characteristics of audio and corresponding spectrogram image characteristics from the audio data set to be detected;
inputting the acoustic characteristics of the audio and the corresponding spectrogram image characteristics into a pre-trained synthetic audio detection model to respectively obtain an authenticity score first of the audio and an authenticity score second of the audio;
weighting and fusing the first audio authenticity score and the second audio authenticity score to obtain an audio authenticity score after feature information fusion;
comparing the obtained authenticity score with a preset threshold value to obtain a final audio detection result;
the synthesized audio detection model comprises a feature-to-score module and an image-to-score module, wherein the feature-to-score module is used for outputting the acoustic features of the input audio as the authenticity score I of the audio, and the image-to-score module is used for outputting the corresponding image features of the input spectrogram as the authenticity score II of the audio.
2. The method for detecting synthesized speech based on neural network and feature fusion according to claim 1, wherein,
the training method of the synthesized audio detection model comprises the following steps:
the method comprises the steps of obtaining a real audio data set and a synthesized audio data set as sample sets, and dividing the sample sets into a training set and a verification set according to a preset proportion;
carrying out data preprocessing on the training set, and extracting corresponding acoustic features and spectrogram image features;
training an initial synthesized audio detection model by adopting acoustic features and spectrogram image features of a sample, and outputting a training result;
weighting and fusing training results to obtain an audio authenticity score;
calculating loss through the audio authenticity score and a sample preset label, optimizing and training an initial synthesized audio detection model by adopting a gradient descent method, and observing the performance of the model on a verification set;
and comparing the final audio authenticity score with a preset threshold value, and taking the optimized initial synthesized audio detection model as a synthesized audio detection model after the detection result is met.
3. The method for detecting synthesized speech based on neural network and feature fusion according to claim 1, wherein the method for extracting acoustic features of audio from the audio to-be-detected data set comprises:
performing pre-emphasis, framing, windowing and discrete Fourier transformation on an audio file to obtain a frequency domain representation of the audio, and calculating the square amplitude of complex value signals in the frequency domain representation to obtain a spectrogram of the audio;
and obtaining the acoustic characteristic of the linear frequency cepstrum coefficient by adopting a linear filter bank and discrete cosine transform on the spectrogram, and obtaining the acoustic characteristic.
4. The method for detecting synthesized speech based on neural network and feature fusion according to claim 1, wherein the method for extracting spectrogram image features corresponding to acoustic features of audio from an audio to-be-detected data set comprises:
performing pre-emphasis, framing, windowing and discrete Fourier transformation on an audio file to obtain a frequency domain representation of the audio, and calculating the square amplitude of complex value signals in the frequency domain representation to obtain a spectrogram of the audio;
after the spectrogram is converted from the amplitude scale to the decibel scale, a gray scale image with a specified pixel size is constructed as the spectrogram image characteristic.
5. The method for detecting synthesized speech based on neural network and feature fusion according to claim 1, wherein the feature-to-score module comprises a maximum feature map unit, a time delay neural network unit, a close-coupled time delay neural network unit, a conversion layer, a pooling layer, a feed-forward neural network layer and a linear layer;
firstly, extracting features from linear frequency cepstrum coefficients in a two-dimensional space through a maximum feature map unit;
initializing the number of channels through the time delay neural network unit, learning local features through a plurality of continuous closely connected time delay neural network units, and aggregating multi-stage information through a conversion layer; then a plurality of closely connected time delay neural network units are connected, long-term dependence is learned, and information is aggregated by using a conversion layer; and finally, aggregating information through a pooling layer, and outputting the authenticity fraction through a feedforward neural network layer and a linear layer.
6. The method for detecting synthesized speech based on neural network and feature fusion according to claim 5, wherein the formula of the multi-stage information aggregated by the conversion layer is:
d k =D k ([d 0 ,d 1 ,…,d k-1 ])
wherein d is 0 Input representing a closely coupled time-lapse neural network element, d k Represents the output of the k-layer closely-connected time delay neural network unit, [ + ]]Representing the splicing operation, D k (. Cndot.) represents the nonlinear transformation of the k-th layer.
7. The method for detecting synthesized speech based on neural network and feature fusion according to claim 1, wherein the image-to-score module comprises a two-dimensional convolution layer, a residual block, a maximum pooling layer, a flattening layer, a Dropout layer and a full connection layer;
the gray level image of the spectrogram is firstly fully extracted with information through a two-dimensional convolution layer and a residual block, then the feature image size is reduced through a maximum pooling layer, the feature dimension after flattening is reduced, then the feature dimension is unfolded through a flattening layer, the dimension is reduced through a full-connection layer, and the generalization of the module is improved by combining with a Dropout layer; and finally outputting the authenticity score of the image angle through the full connection layer.
8. The method for detecting synthesized speech based on neural network and feature fusion according to claim 7, wherein the formula of the residual block construction information stream is:
y=F(x,ω)+x
where x represents the input, ω represents the parameter of the current layer, F (x, ω) represents the output of the input through the nonlinear transformation of the current layer, and y represents the output of the current layer.
9. The method for detecting synthesized speech based on neural network and feature fusion according to claim 1, wherein the formula of weighting and fusing the audio authenticity score of one and the audio authenticity score of two is:
where f (·) represents the weighting function, s f The audio authenticity score, s, output for the feature-to-score module i For the audio authenticity score output by the image-to-score module, ω is a weighting coefficient, threshold is a threshold, score represents the final audio authenticity score, H 0 Representing the original assumption that the description audio is authentic, H 1 Representing alternative assumptions, illustrating that the audio is synthesized, the f (·) function is calculated by the formula f (s f ,s i ;ω)=ω×s f +(1-ω)×s i Yielding a final audio authenticity score, score greater than threshold, indicating acceptance of H 0 Assuming that the description audio is authentic; score less than threshold, indicating acceptance of H 1 It is assumed that the description audio is spurious.
10. The method for detecting synthesized speech based on neural network and feature fusion according to claim 9, wherein the expression of comparing the authenticity score with a predetermined threshold value is:
wherein threshold represents a preset threshold, 0 represents that the audio is synthesized, 1 represents that the audio is authentic, and label represents a label of the audio.
CN202311490667.XA 2023-11-09 2023-11-09 Synthetic voice detection method based on neural network and feature fusion Active CN117393000B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311490667.XA CN117393000B (en) 2023-11-09 2023-11-09 Synthetic voice detection method based on neural network and feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311490667.XA CN117393000B (en) 2023-11-09 2023-11-09 Synthetic voice detection method based on neural network and feature fusion

Publications (2)

Publication Number Publication Date
CN117393000A true CN117393000A (en) 2024-01-12
CN117393000B CN117393000B (en) 2024-04-16

Family

ID=89440827

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311490667.XA Active CN117393000B (en) 2023-11-09 2023-11-09 Synthetic voice detection method based on neural network and feature fusion

Country Status (1)

Country Link
CN (1) CN117393000B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100161327A1 (en) * 2008-12-18 2010-06-24 Nishant Chandra System-effected methods for analyzing, predicting, and/or modifying acoustic units of human utterances for use in speech synthesis and recognition
US20180254046A1 (en) * 2017-03-03 2018-09-06 Pindrop Security, Inc. Method and apparatus for detecting spoofing conditions
CN110491391A (en) * 2019-07-02 2019-11-22 厦门大学 A kind of deception speech detection method based on deep neural network
US10504504B1 (en) * 2018-12-07 2019-12-10 Vocalid, Inc. Image-based approaches to classifying audio data
CN110992987A (en) * 2019-10-23 2020-04-10 大连东软信息学院 Parallel feature extraction system and method for general specific voice in voice signal
CN112201255A (en) * 2020-09-30 2021-01-08 浙江大学 Voice signal spectrum characteristic and deep learning voice spoofing attack detection method
CN114495950A (en) * 2022-04-01 2022-05-13 杭州电子科技大学 Voice deception detection method based on deep residual shrinkage network
CN116386664A (en) * 2022-12-07 2023-07-04 讯飞智元信息科技有限公司 Voice counterfeiting detection method, device, system and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100161327A1 (en) * 2008-12-18 2010-06-24 Nishant Chandra System-effected methods for analyzing, predicting, and/or modifying acoustic units of human utterances for use in speech synthesis and recognition
US20180254046A1 (en) * 2017-03-03 2018-09-06 Pindrop Security, Inc. Method and apparatus for detecting spoofing conditions
US10504504B1 (en) * 2018-12-07 2019-12-10 Vocalid, Inc. Image-based approaches to classifying audio data
CN110491391A (en) * 2019-07-02 2019-11-22 厦门大学 A kind of deception speech detection method based on deep neural network
CN110992987A (en) * 2019-10-23 2020-04-10 大连东软信息学院 Parallel feature extraction system and method for general specific voice in voice signal
CN112201255A (en) * 2020-09-30 2021-01-08 浙江大学 Voice signal spectrum characteristic and deep learning voice spoofing attack detection method
CN114495950A (en) * 2022-04-01 2022-05-13 杭州电子科技大学 Voice deception detection method based on deep residual shrinkage network
CN116386664A (en) * 2022-12-07 2023-07-04 讯飞智元信息科技有限公司 Voice counterfeiting detection method, device, system and storage medium

Also Published As

Publication number Publication date
CN117393000B (en) 2024-04-16

Similar Documents

Publication Publication Date Title
CN102800316B (en) Optimal codebook design method for voiceprint recognition system based on nerve network
CN109559736B (en) Automatic dubbing method for movie actors based on confrontation network
CN107633842A (en) Audio recognition method, device, computer equipment and storage medium
CN108922541B (en) Multi-dimensional characteristic parameter voiceprint recognition method based on DTW and GMM models
CN111724770B (en) Audio keyword identification method for generating confrontation network based on deep convolution
CN113488058A (en) Voiceprint recognition method based on short voice
Rammo et al. Detecting the speaker language using CNN deep learning algorithm
WO2019232833A1 (en) Speech differentiating method and device, computer device and storage medium
CN114783418B (en) End-to-end voice recognition method and system based on sparse self-attention mechanism
CN112562725A (en) Mixed voice emotion classification method based on spectrogram and capsule network
CN112735404A (en) Ironic detection method, system, terminal device and storage medium
CN111243621A (en) Construction method of GRU-SVM deep learning model for synthetic speech detection
Goyani et al. Performance analysis of lip synchronization using LPC, MFCC and PLP speech parameters
Sen et al. A convolutional neural network based approach to recognize bangla spoken digits from speech signal
CN113129908B (en) End-to-end macaque voiceprint verification method and system based on cyclic frame level feature fusion
Yousfi et al. Holy Qur'an speech recognition system Imaalah checking rule for warsh recitation
Kamble et al. Emotion recognition for instantaneous Marathi spoken words
CN117393000B (en) Synthetic voice detection method based on neural network and feature fusion
KR100969138B1 (en) Method For Estimating Noise Mask Using Hidden Markov Model And Apparatus For Performing The Same
CN111326161B (en) Voiceprint determining method and device
CN111785262B (en) Speaker age and gender classification method based on residual error network and fusion characteristics
CN111091816B (en) Data processing system and method based on voice evaluation
CN113963718A (en) Voice session segmentation method based on deep learning
Wu et al. Audio-based expansion learning for aerial target recognition
CN110689875A (en) Language identification method and device and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant