CN117393000A

CN117393000A - Synthetic voice detection method based on neural network and feature fusion

Info

Publication number: CN117393000A
Application number: CN202311490667.XA
Authority: CN
Inventors: 徐小龙; 刘畅
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-11-09
Filing date: 2023-11-09
Publication date: 2024-01-12
Anticipated expiration: 2043-11-09
Also published as: CN117393000B

Abstract

The invention discloses a synthetic voice detection method based on neural network and feature fusion, which comprises the following steps: acquiring an audio data set to be detected, and extracting acoustic characteristics of audio and corresponding spectrogram image characteristics from the audio data set to be detected; inputting the acoustic characteristics of the audio and the corresponding spectrogram image characteristics into a pre-trained synthetic audio detection model to respectively obtain an authenticity score first of the audio and an authenticity score second of the audio; weighting and fusing the first audio authenticity score and the second audio authenticity score to obtain an audio authenticity score after feature information fusion; comparing the obtained authenticity score after feature information fusion with a preset threshold value to obtain a final audio detection result; the invention skillfully fuses the acoustic characteristics and the spectrogram image information to carry out the synthetic voice detection, and has better stability and generalization capability.

Description

Synthetic voice detection method based on neural network and feature fusion

Technical Field

The invention relates to a synthetic voice detection method based on neural network and feature fusion, and belongs to the technical field of information security and artificial intelligence.

Background

With the maturation of various deep learning-based speech synthesis methods, the most advanced speech synthesis methods have been able to generate highly realistic sounds that spoof human ears. Because of the availability, ease of use, and imperfections of the related laws of these tools, a technique called audio depth forgery has been developed, the abuse of which poses a serious threat to the state image, social public opinion and public interests, and it is therefore of great importance to develop tools capable of detecting synthetic audio. Based on the above background, synthetic audio detection has become an important research problem in the fields of acoustic signal processing and artificial intelligence, and its main task is to automatically predict whether a piece of audio is synthesized by an artificial intelligence tool through calculation.

In view of the potential hazards of audio depth forgery technology, there has been much effort devoted to detecting synthesized audio. Generally, synthetic audio detection methods can be classified into machine learning-based methods and deep learning-based methods. Synthetic audio detection based on machine learning generally requires manual design of features, and although the method has good interpretability, the performance of the method depends on the manual features to a great extent, and the expandability is poor. The synthetic voice detection method based on deep learning can automatically extract and learn useful features by utilizing the advantages of the deep neural network, and realizes complex mapping relation between input and output, so that the synthetic voice detection method has good performance, and is widely valued by researchers in recent years. However, the synthetic voice detection method based on deep learning is mostly aimed at a specific data set, and the performance of the method in a cross-language situation is not generally considered. And there are cases of over-training, with severe over-fitting on specific datasets, reducing the ability of the corresponding method to generalize to unknown data.

In addition, the synthetic audio detection method based on machine learning or deep learning often only utilizes acoustic features of audio or spectrogram image features corresponding to the audio, and does not fully utilize rich information contained in the audio, so that the method has some defects in detecting the synthetic audio: such as the scalability, stability, etc. of the method between different languages.

Disclosure of Invention

The invention aims to provide a synthetic voice detection method based on neural network and feature fusion, which aims to solve the defects that the prior art only utilizes acoustic features of audio or spectrogram image features corresponding to the audio, does not fully utilize rich information contained in the audio, and is insufficient in detecting synthetic audio.

A synthetic speech detection method based on neural network and feature fusion, the method comprising:

acquiring an audio data set to be detected, and extracting acoustic characteristics of audio and corresponding spectrogram image characteristics from the audio data set to be detected;

inputting the acoustic characteristics of the audio and the corresponding spectrogram image characteristics into a pre-trained synthetic audio detection model to respectively obtain an authenticity score first of the audio and an authenticity score second of the audio;

weighting and fusing the first audio authenticity score and the second audio authenticity score to obtain an audio authenticity score after feature information fusion;

comparing the obtained authenticity score with a preset threshold value to obtain a final audio detection result;

the synthesized audio detection model comprises a feature-to-score module and an image-to-score module, wherein the feature-to-score module is used for outputting the acoustic features of the input audio as the authenticity score I of the audio, and the image-to-score module is used for outputting the corresponding image features of the input spectrogram as the authenticity score II of the audio.

Further, the training method of the synthesized audio detection model comprises the following steps:

the method comprises the steps of obtaining a real audio data set and a synthesized audio data set as sample sets, and dividing the sample sets into a training set and a verification set according to a preset proportion;

carrying out data preprocessing on the training set, and extracting corresponding acoustic features and spectrogram image features;

training an initial synthesized audio detection model by adopting acoustic features and spectrogram image features of a sample, and outputting a training result; weighting and fusing training results to obtain an audio authenticity score;

calculating loss through the audio authenticity score and a sample preset label, optimizing and training an initial synthesized audio detection model by adopting a gradient descent method, and observing the performance of the model on a verification set;

and comparing the final audio authenticity score with a preset threshold value, and taking the optimized initial synthesized audio detection model as a synthesized audio detection model after the detection result is met.

Further, the method for extracting the acoustic characteristics of the audio from the audio data set to be detected comprises the following steps:

performing pre-emphasis, framing, windowing and discrete Fourier transformation on an audio file to obtain a frequency domain representation of the audio, and calculating the square amplitude of complex value signals in the frequency domain representation to obtain a spectrogram of the audio;

and obtaining the acoustic characteristic of the linear frequency cepstrum coefficient by adopting a linear filter bank and discrete cosine transform on the spectrogram, and obtaining the acoustic characteristic.

Further, the method for extracting spectrogram image features corresponding to acoustic features of the audio from the audio to-be-detected data set comprises the following steps:

after the spectrogram is converted from the amplitude scale to the decibel scale, a gray scale image with a specified pixel size is constructed as the spectrogram image characteristic.

Further, the feature-to-score module comprises a maximum feature map unit, a time delay neural network unit, a tightly-connected time delay neural network unit, a conversion layer, a pooling layer, a feedforward neural network layer and a linear layer; firstly, extracting features from linear frequency cepstrum coefficients in a two-dimensional space through a maximum feature map unit;

initializing the number of channels through the time delay neural network unit, learning local features through a plurality of continuous closely connected time delay neural network units, and aggregating multi-stage information through a conversion layer; then a plurality of closely connected time delay neural network units are connected, long-term dependence is learned, and information is aggregated by using a conversion layer; and finally, aggregating information through a pooling layer, and outputting the authenticity fraction through a feedforward neural network layer and a linear layer.

Further, the formula of the multi-stage information aggregated by the conversion layer is:

d ^k ＝D ^k ([d ⁰ ,d ¹ ,…,d ^k-1 ])

wherein d is ⁰ Input representing a closely coupled time-lapse neural network element, d ^k Represents the output of the k-layer closely-connected time delay neural network unit, [ + ]]Representing the splicing operation, D ^k (. Cndot.) represents the nonlinear transformation of the k-th layer.

Further, the image-to-fraction module comprises a two-dimensional convolution layer, a residual block, a maximum pooling layer, a flattening layer, a Dropout layer and a full connection layer;

the gray level image of the spectrogram fully extracts information through a two-dimensional convolution layer and a residual block; reducing the feature image size through a maximum pooling layer, and reducing the feature dimension after flattening; then spreading the flattening layer, reducing the dimension through the full-connection layer and combining with the Dropout layer to improve the generalization of the module; and finally outputting the authenticity score of the image angle through the full connection layer.

Further, the formula of the residual block construction information stream is:

y＝F(x,ω)+x

where x represents the input, ω represents the parameter of the current layer, F (x, ω) represents the output of the input through the nonlinear transformation of the current layer, and y represents the output of the current layer.

Further, the formula of weighted fusion of the audio authenticity score one and the audio authenticity score two is as follows:

where f (·) represents the weighting function, s _f The audio authenticity score, s, output for the feature-to-score module _i For the audio authenticity score output by the image-to-score module, ω is a weighting coefficient, threshold is a threshold, score represents the final audio authenticity score, H ₀ Representing the original hypothesis, explanatory notesThe frequency being true, H ₁ Representing alternative assumptions, illustrating that the audio is synthesized, the f (·) function is calculated by the formula f (s _f ,s _i ；ω)＝ω×s _f +(1-ω)×s _i Yielding a final audio authenticity score, score greater than threshold, indicating acceptance of H ₀ Assuming that the description audio is authentic; score less than threshold, indicating acceptance of H ₁ It is assumed that the description audio is spurious.

Further, the expression that the authenticity score is compared with a preset threshold value is:

wherein threshold represents a preset threshold, 0 represents that the audio is synthesized, 1 represents that the audio is real, and label represents a label of the audio.

Compared with the prior art, the invention has the beneficial effects that: the invention utilizes the synthetic audio detection model to process spectrogram image information, constructs unobstructed information flow between each layer of the network, has better stability when in use, is oriented to cross-language synthetic audio detection, has higher universality and can adapt to complex situations in real scenes;

the invention combines the acoustic characteristics and spectrogram image information to carry out the synthetic voice detection, and has better stability and generalization capability;

the invention combines the maximum characteristic diagram and the close connection time delay neural network, not only can learn the relation among local characteristics, but also can learn the long-term dependence among the characteristics, thereby improving the detection accuracy.

Drawings

FIG. 1 is a schematic diagram of a network architecture of the method of the present invention;

FIG. 2 is a schematic diagram of the unit of the maximum feature diagram of the method of the present invention;

FIG. 3 is a schematic representation of the maximum features of the method of the present invention;

FIG. 4 is a diagram of a residual block structure in accordance with the present invention;

fig. 5 is a training-testing schematic diagram of the present invention.

Detailed Description

The invention is further described in connection with the following detailed description, in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.

As shown in fig. 1, a synthetic speech detection method based on neural network and feature fusion is disclosed, the method comprising:

comparing the obtained authenticity score after feature information fusion with a preset threshold value to obtain a final audio detection result; the synthesized audio detection model comprises a feature-to-score module and an image-to-score module, wherein the feature-to-score module is used for outputting the acoustic features of the input audio as the authenticity score I of the audio, and the image-to-score module is used for outputting the corresponding image features of the input spectrogram as the authenticity score II of the audio;

aiming at the method, the specific steps comprise:

1) For the input audio data, in order to facilitate the subsequent calculation of the network model, preprocessing operation is required, which specifically includes: resampling all the audio data to make all the audio data be 16KHz mono audio; silence removal, pruning all silence exceeding 0.2s in the audio data; all the audio is trimmed or filled to 4s, the filling strategy is to repeat the audio to be filled and intercept 4s long audio as the filled audio.

The method for acquiring the audio data set to be detected comprises the following steps:

acquiring a used audio data set containing different languages, different speakers and different utterances;

preprocessing the acquired audio data set containing different languages, different speakers and different utterances to obtain an audio data set to be tested;

the method for preprocessing the acquired audio data set containing different languages, different speakers and different utterances comprises the following steps:

and resampling, silencing, eliminating, trimming and cutting the audio sample data set to unify the data format of the audio.

2) And extracting the linear frequency cepstrum coefficient acoustic characteristic of the preprocessed audio as an acoustic domain characteristic, and extracting a spectrogram image corresponding to the audio as an image domain characteristic. For linear frequency cepstrum coefficients, pre-emphasis, framing and windowing are sequentially performed on the audio waveform to avoid spectrum leakage; next, applying a discrete fourier transform to the frames to obtain a frequency domain representation X (T, K) of the audio, where t=1, …, T represents the frame index, and k=0, 1, … K-1 represents the discrete fourier transform coefficients; calculating the square of complex-valued signal amplitude |X (t, k) | ² As a spectrogram of audio; and (3) applying a linear filter bank and discrete cosine transform to the spectrogram to obtain 80-dimensional linear frequency cepstrum coefficients. For the spectrogram image, we construct a gray-scale image of 50×34 pixels size after converting the spectrogram from the amplitude scale to the decibel scale on the spectrogram basis.

3) And (3) passing the linear frequency cepstrum coefficient through a feature-to-score (feature 2 score) module of the synthesized voice detection model to obtain the audio authenticity score of the acoustic feature angle.

The detailed structure of the feature2score module is shown in table 1.

TABLE 1

The maximum feature diagram unit structure is shown in fig. 2, and the maximum feature diagram is shown in fig. 3. The time delay neural network unit consists of a one-dimensional convolution layer and an activation function; the conversion layer consists of a batch normalization layer, an activation function and a one-dimensional convolution layer; the feedforward neural network layer consists of a one-dimensional convolution layer, a batch normalization layer and an activation function. The essence is that features are extracted in a two-dimensional space, competition learning is carried out through a maximum feature diagram, a part with smaller output is discarded, a part with larger output is reserved, and the extraction of useful features is facilitated. After the input passes through the maximum feature map unit, the time sequence relation among the features is learned through the closely connected time delay neural network, so as to extract the feature vector which can represent whether the audio is synthesized or not. The structure of the tight junction delay neural network is shown in table 2. Specifically, firstly, initializing the number of channels by using a delay network unit; learning local features by two continuous closely connected time delay neural network units; the multi-stage features are then aggregated by a conversion layer based on a feedforward neural network. The information represented by the k-th layer tight-junction time-delay neural network is shown in a formula (1):

d ^k ＝D ^k ([d ⁰ ,d ¹ ,…,d ^k-1 ])#(1)

Then four continuous close connection time delay neural network units are used for learning long-term dependence, and then a conversion layer is connected to aggregate information; and finally, aggregating the features through a statistics pooling layer, outputting 256-dimensional feature vectors, and then outputting an authenticity score by a full-connection layer.

4) And (3) enabling the spectrogram gray level image of the audio to pass through an image-to-score (image 2 score) module of the synthesized voice detection model to obtain the audio authenticity score of the image angle.

The detailed structure of the image2score module is shown in table 2.

TABLE 2

Wherein the structure of the residual block is shown in fig. 4. The image2score module takes a spectrogram gray image as input, and has a shape of (B, C, H, W), B representing a batch size, C representing a channel number, H representing a height, and W representing a width. It should be noted that after the input passes through the flattening layer, the channel number C, the height H and the width W thereof are flattened into tensors, and the output shape of the network is (B, D), where D represents the dimension of the feature. Effective features are extracted through two-dimensional convolution, and meanwhile, smooth information flow in the network is constructed by utilizing a residual structure, so that the network can learn the effective features better, and a calculation formula of the residual structure is shown in a formula (2).

y＝F(x,ω)+x#(2)

And after feature extraction, outputting the audio authenticity fraction of the image angle through flattening the feature vector and then connecting with the full-connection layer.

5) Obtaining the authenticity scores of the audio under two different angles from the step 3) and the step 4), and calculating the final authenticity score of the audio through weighting. The weighted calculation formula is shown as formula (4):

where f (·) represents the weighting function, s _f The audio authenticity score, s, output for the feature-to-score module _i For the audio authenticity score output by the image-to-score module, ω is a weighting coefficient, threshold is a threshold, score represents the final audio authenticity score, H ₀ Representing the original assumption that the description audio is authentic, H ₁ Representing alternative hypotheses that illustrate that the audio is synthesized. The f (·) function is calculated by the formula f (s _f ,s _i ；ω)＝ω×s _f +(1-ω)×s _i Resulting in a final audio authenticity score. score greater than threshold, indicating acceptance of H ₀ Assuming that the description audio is authentic; score less than threshold, indicating acceptance of H ₁ It is assumed that the description audio is spurious.

6) And comparing the final authenticity score of the audio with a preset threshold value to obtain a label of a final audio sample, wherein a predicted sample label calculation formula is shown as a formula (5).

Where threshold represents a preset threshold, in the present invention 0.5.

In the training part of the invention, the loss function used is as follows:

loss＝-(y _n *log(δ(z _n ))+(1-y _n )*log(1-δ(z _n )))

wherein z is _n A score, y, representing that the nth sample is a positive sample _n The label representing the nth sample, delta represents the sigmoid activation function. All data need to be computed according to the above rules, the network model (step 2) to step 6)) is trained by optimizing the loss function.

The specific training and verification process is as follows:

carrying out data preprocessing on the training set, the verification set and the test set, and extracting corresponding acoustic characteristics and spectrogram image characteristics;

inputting the acoustic characteristics, spectrogram image characteristics and labels of the samples into a synthesized audio detection model by 128 samples in each batch;

the feature-to-score module in the weighted fusion synthesized audio detection model outputs an audio authenticity score I and an audio authenticity score II output by the image-to-score module, so that a final audio authenticity score is obtained;

optimizing a training model through a gradient descent method through final audio authenticity scores and label calculation losses of samples, and iterating for 10 rounds in total;

comparing the final audio authenticity score with a preset threshold value, outputting a final detection result, and calculating an equal error rate on a training set;

likewise, while training, the effect of the model is measured by its performance on the verification set, including in particular:

and inputting the acoustic characteristics, spectrogram image characteristics and labels of the samples in the verification set into the synthesized audio detection model by 128 samples in each batch to obtain the final authenticity score of the audio, comparing the final authenticity score with a preset threshold value, calculating the equal error rate on the verification set, and measuring the detection performance of the model.

In this embodiment, the method and the model are applied to detection of the synthesized voice based on the nerve vocoder, and compared with the currently mainstream synthesized audio detection model gaussian mixture model and the rafnet 2 model on a WaveFake data set, a lxapeech data set and a JSUT data set (lxapeech is an english data set and JSUT data set is a japanese data set), and the experimental method is as follows: taking a synthesized audio subset taking LJSPeech as a reference in a WaveFake data set and a reference real set LJSPeech as a training set, taking a synthesized audio and JMUT data set taking JMUT as a reference in the WaveFake data set as a test set, reserving one sample subset from the training set as an additional test set each time during training, and taking the rest as a training set, wherein for a verification set, the training set is shown as 8:2 dividing training set and verification set for experiment. The flow is shown in fig. 5. The overall performance comparison result is shown in table 3, wherein the reserved set represents a reserved data subset during each training, the TTS represents an audio test set of different sentences of the same speaker synthesized on the LJSPeech data set, the two data subsets under the JMUT are test sets of different speakers and different languages, the data in the table are equal error rates, and the method is a performance measurement index commonly used in evaluation tasks such as audio field and speech recognition.

TABLE 3 Table 3

The average etc. error rates of the invention in experiments with gaussian mixture model, rannet 2 are given in table 4. The order of the reserved sets in table 4 corresponds one-to-one to the reserved sets in table 3.

TABLE 4 Table 4

As can be seen from a combination of tables 3 and 4, the performance of the present invention is better than the method of down 2, basically better than the gaussian mixture model, both on the data subset TTS with different utterances of the same speaker and on the two subsets corresponding to the data set JUST of different speakers and different languages. And the error rate of the optimal average of the invention is 0.004, which is superior to Gaussian mixture model (0.054) and Rawnet2 (0.436). The method provided by the invention can effectively detect and give reliable results even if the audio is synthesized by facing different utterances of different languages which are never seen before, so that the method has good stability and generalization performance and is more suitable for complex situations in a real environment.

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims

1. A synthetic speech detection method based on neural network and feature fusion, the method comprising:

2. The method for detecting synthesized speech based on neural network and feature fusion according to claim 1, wherein,

the training method of the synthesized audio detection model comprises the following steps:

training an initial synthesized audio detection model by adopting acoustic features and spectrogram image features of a sample, and outputting a training result;

weighting and fusing training results to obtain an audio authenticity score;

3. The method for detecting synthesized speech based on neural network and feature fusion according to claim 1, wherein the method for extracting acoustic features of audio from the audio to-be-detected data set comprises:

4. The method for detecting synthesized speech based on neural network and feature fusion according to claim 1, wherein the method for extracting spectrogram image features corresponding to acoustic features of audio from an audio to-be-detected data set comprises:

5. The method for detecting synthesized speech based on neural network and feature fusion according to claim 1, wherein the feature-to-score module comprises a maximum feature map unit, a time delay neural network unit, a close-coupled time delay neural network unit, a conversion layer, a pooling layer, a feed-forward neural network layer and a linear layer;

firstly, extracting features from linear frequency cepstrum coefficients in a two-dimensional space through a maximum feature map unit;

6. The method for detecting synthesized speech based on neural network and feature fusion according to claim 5, wherein the formula of the multi-stage information aggregated by the conversion layer is:

d ^k ＝D ^k ([d ⁰ ,d ¹ ,…,d ^k-1 ])

7. The method for detecting synthesized speech based on neural network and feature fusion according to claim 1, wherein the image-to-score module comprises a two-dimensional convolution layer, a residual block, a maximum pooling layer, a flattening layer, a Dropout layer and a full connection layer;

the gray level image of the spectrogram is firstly fully extracted with information through a two-dimensional convolution layer and a residual block, then the feature image size is reduced through a maximum pooling layer, the feature dimension after flattening is reduced, then the feature dimension is unfolded through a flattening layer, the dimension is reduced through a full-connection layer, and the generalization of the module is improved by combining with a Dropout layer; and finally outputting the authenticity score of the image angle through the full connection layer.

8. The method for detecting synthesized speech based on neural network and feature fusion according to claim 7, wherein the formula of the residual block construction information stream is:

y＝F(x,ω)+x

9. The method for detecting synthesized speech based on neural network and feature fusion according to claim 1, wherein the formula of weighting and fusing the audio authenticity score of one and the audio authenticity score of two is:

where f (·) represents the weighting function, s _f The audio authenticity score, s, output for the feature-to-score module _i For the audio authenticity score output by the image-to-score module, ω is a weighting coefficient, threshold is a threshold, score represents the final audio authenticity score, H ₀ Representing the original assumption that the description audio is authentic, H ₁ Representing alternative assumptions, illustrating that the audio is synthesized, the f (·) function is calculated by the formula f (s _f ，s _i ；ω)＝ω×s _f +(1-ω)×s _i Yielding a final audio authenticity score, score greater than threshold, indicating acceptance of H ₀ Assuming that the description audio is authentic; score less than threshold, indicating acceptance of H ₁ It is assumed that the description audio is spurious.

10. The method for detecting synthesized speech based on neural network and feature fusion according to claim 9, wherein the expression of comparing the authenticity score with a predetermined threshold value is:

wherein threshold represents a preset threshold, 0 represents that the audio is synthesized, 1 represents that the audio is authentic, and label represents a label of the audio.