CN116758943A - Synthetic voice detection method and device, electronic equipment and storage medium - Google Patents

Synthetic voice detection method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116758943A
CN116758943A CN202310820588.4A CN202310820588A CN116758943A CN 116758943 A CN116758943 A CN 116758943A CN 202310820588 A CN202310820588 A CN 202310820588A CN 116758943 A CN116758943 A CN 116758943A
Authority
CN
China
Prior art keywords
voice
loss function
voice signal
classification model
likelihood value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310820588.4A
Other languages
Chinese (zh)
Inventor
张鹏远
张宇翔
王文超
陈树丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Original Assignee
Institute of Acoustics CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS filed Critical Institute of Acoustics CAS
Priority to CN202310820588.4A priority Critical patent/CN116758943A/en
Publication of CN116758943A publication Critical patent/CN116758943A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The embodiment of the application discloses a synthetic voice detection method, a device, electronic equipment and a storage medium, relates to the technical field of voice recognition, and can improve the accuracy of detecting synthetic voice. The method comprises the following steps: acquiring a first voice signal in a first set; the first voice signal comprises a real voice signal and a synthetic voice signal; generating a first cross entropy loss function based on the first voice signal and a preset classification model; generating posterior distribution characteristics of data outside the approximate first set under current model parameters of the classification model, and generating a second cross entropy loss function based on the posterior distribution characteristics; generating a total loss function based on the first cross entropy loss function and the second cross entropy loss function; carrying out gradient return by using the total loss function, and carrying out parameter updating on the classification model to obtain an updated classification model; and inputting the acoustic characteristics of the voice signals to be detected into the updated classification model to obtain the detection result of the voice signals to be detected.

Description

Synthetic voice detection method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of speech recognition technologies, and in particular, to a method and apparatus for detecting synthesized speech, an electronic device, and a storage medium.
Background
With the rapid development of artificial intelligence technology, high-quality synthetic voice can be generated aiming at specific characters, and the synthetic voice brings great convenience in aspects of virtual people, man-machine interaction, content creation and the like. However, the synthesized voice technology may also be used maliciously, such as telecommunication fraud, disseminating malicious language and false information, etc., which brings great threat to national social stability and life and property security of people.
For this reason, synthetic speech detection techniques have been developed that discriminate synthetic speech generated by various speech synthesis algorithms by means of artificial intelligence techniques. In recent years, detection systems based on deep learning are becoming mainstream, where the front end extracts time-frequency characteristics (such as spectrogram characteristics and mel-frequency characteristics) of voice through different methods, and the rear end learns advanced representations of the characteristics through a deep neural network, and determines whether a voice signal is synthesized. However, due to the task characteristics and the limitations of the current deep learning technology, the current deep learning-based synthetic voice detection technology has strong dependence on a data set, is easy to be over-fitted to a synthetic algorithm of known data, and has the technical problem of low detection accuracy when detecting a synthetic voice signal.
Disclosure of Invention
In view of the above, the embodiments of the present application provide a method, an apparatus, an electronic device, and a storage medium for detecting synthesized speech, which can improve the accuracy of detecting synthesized speech.
In a first aspect, an embodiment of the present application provides a method for detecting synthesized speech, including: acquiring a first voice signal in a first set; the first voice signal comprises a real voice signal and a synthetic voice signal; extracting acoustic features of the first speech signal; inputting the acoustic characteristics of the first voice signal into a preset classification model to obtain the likelihood value of the real voice and the likelihood value of the synthesized voice output by the classification model; generating a first cross entropy loss function according to the likelihood value of the real voice and the likelihood value of the synthesized voice; generating posterior distribution characteristics of data outside the first set under current model parameters of the classification model, and generating a second cross entropy loss function based on the posterior distribution characteristics; performing weighted summation operation on the first cross entropy loss function and the second cross entropy loss function to obtain a total loss function; carrying out gradient feedback by utilizing the total loss function, and carrying out parameter updating on the classification model to obtain an updated classification model; and inputting the acoustic characteristics of the voice signals to be detected into the updated classification model to obtain the detection result of the voice signals to be detected.
Optionally, the generating posterior distribution features of the data outside the first set under the current model parameters of the classification model includes: randomly sampling from an embedded feature cache through a random gradient Langmuir dynamic sampling process and generating posterior distribution features of the data outside the first set under the current model parameters of the classification model.
Optionally, the generating a second cross entropy loss function based on the posterior distribution feature includes: inputting the posterior distribution characteristics into the classification model to obtain a first likelihood value that the first voice signal is real voice, a second likelihood value that the first voice signal is synthesized voice and a third likelihood value that the first voice signal is voice outside the first set; and determining a second cross entropy loss function according to the first likelihood value, the second likelihood value and the third likelihood value.
Optionally, the extracting the acoustic feature of the first voice signal includes: preprocessing the first voice signal to obtain a preprocessing result; the preprocessing comprises pre-emphasis processing, framing processing and windowing processing; performing time-frequency analysis on the pretreatment result by a short-time Fourier transform method to obtain a time-frequency analysis result; and calculating a logarithmic magnitude spectrum of the obtained time-frequency analysis result, and taking the logarithmic magnitude spectrum as the acoustic characteristic of the first voice signal.
Optionally, the preprocessing the first voice signal to obtain a preprocessing result includes: performing data enhancement processing on the first voice signal to obtain a first processing result; the data enhancement processing comprises noise adding processing and reverberation processing; and preprocessing the first processing result to obtain a preprocessing result.
In a second aspect, an embodiment of the present application provides a synthesized voice detection apparatus, including: a first acquisition module for acquiring a first voice signal in a first set; the first voice signal comprises a real voice signal and a synthetic voice signal; the extraction module is used for extracting the acoustic characteristics of the first voice signal; the second acquisition module is used for inputting the acoustic characteristics of the first voice signals into a preset classification model so as to obtain the likelihood value of the real voice and the likelihood value of the synthesized voice output by the classification model; the first generation module is used for generating a first cross entropy loss function according to the likelihood value of the real voice and the likelihood value of the synthesized voice; a second generation module, configured to generate posterior distribution characteristics of data outside the first set under current model parameters of the classification model, and generate a second cross entropy loss function based on the posterior distribution characteristics; the summation module is used for carrying out weighted summation operation on the first cross entropy loss function and the second cross entropy loss function so as to obtain a total loss function; the updating module is used for carrying out gradient return by utilizing the total loss function and carrying out parameter updating on the classification model so as to obtain an updated classification model; and the third acquisition module is used for inputting the acoustic characteristics of the voice signal to be detected into the updated classification model so as to obtain the detection result of the voice signal to be detected.
Optionally, the second generating module is specifically configured to: randomly sampling from an embedded feature cache through a random gradient Langmuir dynamic sampling process and generating posterior distribution features of the data outside the first set under the current model parameters of the classification model.
Optionally, the second generating module is specifically configured to: inputting the posterior distribution characteristics into the classification model to obtain a first likelihood value that the first voice signal is real voice, a second likelihood value that the first voice signal is synthesized voice and a third likelihood value that the first voice signal is voice outside the first set; and determining a second cross entropy loss function according to the first likelihood value, the second likelihood value and the third likelihood value.
Optionally, the extracting module includes: the preprocessing sub-module is used for preprocessing the first voice signal to obtain a preprocessing result; the preprocessing comprises pre-emphasis processing, framing processing and windowing processing; the analysis submodule is used for carrying out time-frequency analysis on the pretreatment result through the short-time Fourier transform device so as to obtain a time-frequency analysis result; and the calculating sub-module is used for calculating a logarithmic magnitude spectrum of the obtained time-frequency analysis result and taking the logarithmic magnitude spectrum as the acoustic characteristic of the first voice signal.
Optionally, the preprocessing submodule includes: the enhancement processing unit is used for carrying out data enhancement processing on the first voice signal so as to obtain a first processing result; the data enhancement processing comprises noise adding processing and reverberation processing; and the preprocessing unit is used for preprocessing the first processing result to obtain a preprocessing result.
In a third aspect, embodiments of the present application further provide an electronic device, including: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space surrounded by the shell, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to each circuit or device of the electronic apparatus; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for performing any one of the synthetic voice detection methods provided by the embodiments of the present application.
In a fourth aspect, embodiments of the present application also provide a computer-readable storage medium storing one or more programs executable by one or more processors to implement any of the synthetic speech detection methods provided by the embodiments of the present application.
The synthetic voice detection method, the synthetic voice detection device, the electronic equipment and the storage medium provided by the embodiment of the application can acquire the first voice signals in the first set; generating a first cross entropy loss function based on the first voice signal and a preset classification model; generating posterior distribution characteristics of data outside the first set under current model parameters of the classification model, and generating a second cross entropy loss function based on the posterior distribution characteristics; generating a total loss function based on the first cross entropy loss function and the second cross entropy loss function; carrying out gradient feedback by utilizing the total loss function, and carrying out parameter updating on the classification model to obtain an updated classification model; and inputting the acoustic characteristics of the voice signals to be detected into the updated classification model to obtain the detection result of the voice signals to be detected. In this way, the acoustic features of the first speech signal are input into the preset classification model, so that the likelihood value of the real speech and the likelihood value of the synthesized speech output by the classification model can be obtained, and the first cross entropy loss function can be generated accordingly. Posterior distribution characteristics of data outside the first set may then be generated and a second cross entropy loss function generated therefrom. According to the first cross entropy loss function and the second cross entropy loss function, a total loss function can be generated, so that likelihood values of data outside the first set are introduced into the total loss function, the classification model can be updated through training of the total loss function, the original high likelihood value of the synthesized voice output by the updated classification model when the voice signal is recognized is reduced, and the accuracy of detecting the synthesized voice can be effectively improved.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a method for detecting synthesized speech according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a device for detecting synthesized speech according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application will be described in detail below with reference to the accompanying drawings.
It should be understood that the described embodiments are merely some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
In a first aspect, as shown in fig. 1, a method for detecting synthesized speech provided by an embodiment of the present application may include:
s11, acquiring a first voice signal in a first set; the first voice signal comprises a real voice signal and a synthetic voice signal;
in this step, the first set may be a pre-established set, where the set includes both the real speech signal and the synthesized speech signal, and the specific method for generating the synthesized speech signal is not limited in the embodiment of the present application.
S12, extracting acoustic features of the first voice signal;
s13, inputting the acoustic characteristics of the first voice signal into a preset classification model to obtain the likelihood value of the real voice and the likelihood value of the synthesized voice output by the classification model;
s14, generating a first cross entropy loss function according to the likelihood value of the real voice and the likelihood value of the synthesized voice;
in this step, the predetermined classification model may be a convolutional neural network model, and the convolutional neural network (Convolutional Neural Networks) is a deep learning model or a multi-layer sensor similar to an artificial neural network. The loss function is an index that measures how well the classification model predicts the required output in the case of a particular input.
After the first speech signal is acquired, acoustic features of the first speech signal (including both real speech and synthesized speech) may be extracted. Illustratively, the acoustic feature may be a speech spectrogram. After the extracted acoustic features are input into the classification model, the likelihood value of the first voice signal output by the classification model as real voice and the likelihood value of the first voice signal as synthesized voice can be obtained, and the sum of the likelihood value and the likelihood value is 1. Based on the two likelihood values, a first cross entropy loss function may be generated.
Illustratively, the acoustic features described above are fed into a convolutional neural network classification model, and the classification loss is calculated. The neural network used here is the wideresent 22, where 22 stands for 22 trainable layers, consisting of a shallow to deep head convolution layer, 16 residual blocks with added extrusion excitation modules, and the last fully connected layer, respectively. The head convolution layer convolution kernel is 3 in size, span is 1, filling is 1, and the number of output channels is 16. Each residual block contains 2 layers of two-dimensional convolution layers, parameters are identical except the span, the convolution kernel is 3 in size, residual connection is carried out, and a Leake ReLU activation function and batch regularization are carried out. All residual blocks are divided into 3 groups according to the number of (2, 2), the spans are (2, 1, 2) respectively, the output channel numbers are (32, 64, 128) respectively. And obtaining the likelihood value of the real voice and the likelihood value of the synthesized voice through the linear classification layer by using the embedded features output by the convolutional neural network, and finally calculating a classified cross entropy loss function (namely a first cross entropy loss function) through the likelihood value. The calculation formula of the cross entropy loss function is as follows:Where K is the total number of classes, here 2 (i.e. the real speech type and the synthesized speech type, two types are used), p is the class label, and q is the corresponding likelihood value.
S15, generating posterior distribution characteristics of data which are approximate to the data outside the first set under the current model parameters of the classification model, and generating a second cross entropy loss function based on the posterior distribution characteristics;
in this step, the data outside the first set is a speech signal not belonging to the first set. A second cross entropy loss function may be generated based on posterior distribution characteristics of the data outside the first set, such that the second cross entropy loss function includes likelihood values of the data outside the first set.
S16, carrying out weighted summation operation on the first cross entropy loss function and the second cross entropy loss function to obtain a total loss function;
in this step, specifically, different weights may be given to the first cross entropy loss function and the second cross entropy loss function, and the two may be weighted and added, so that the total loss function may be obtained. Illustratively, the first cross entropy loss function has a weight value of 1 and the first cross entropy loss function has a weight value of 0.1.
S17, carrying out gradient return by utilizing the total loss function, and carrying out parameter updating on the classification model to obtain an updated classification model;
in this step, the gradient of the loss function is used to indicate the direction in which the weights and bias of the classification model should be adjusted in order to improve performance. Gradient pass-back is the basis for the algorithm Adam optimizer, which is a popular optimization algorithm. In the training process of the classification model, the Adam optimizer calculates the gradient of the total loss function relative to the parameters of the classification model and updates the parameters of the classification model in a manner that minimizes the loss, thereby completing the training of the classification model.
Illustratively, gradient feedback is performed through a loss function, and parameters of a convolutional neural network and a linear classification layer are updated by using an Adam optimizer to obtain a speech synthesis fake speech detection model. The Adam optimizer may be formulated as,
m t =β 1 m t-1 +(1-β 1 )g t
wherein g (t) is a loss functionFor model parameter theta t Is of the type of (A) and (B)>For the purpose of deviation guiding. m is m t And v t First-order moment estimate and second-order moment estimate of the moment gradient in momentum form, respectively,/-moment>And->Respectively, the estimated quantity after correction of deviation, beta 1 And beta 2 Is to control oneThe moment estimate and the exponential decay of the moment estimate. The last term is a parameter updating formula, and eta is the learning rate. The optimizer parameters may be: beta 1 =0.9,β 2 =0.999,∈=10 -8 Weight decay is 10 -4 . The learning rate is increased to 10 by preheating during training -3 After which the negative index drops, for a total of 100 cycles.
S18, inputting the acoustic characteristics of the voice signals to be detected into the updated classification model to obtain the detection result of the voice signals to be detected.
In the step, the likelihood value of the data except the first set is introduced into the updated classification model, so that when the voice signal to be detected is detected, the likelihood value of the original high synthesized voice output by the updated classification model when the voice signal is recognized can be reduced, and the accuracy of detecting the synthesized voice is effectively improved. Illustratively, the output result of an external real speech signal after passing through an untrained classification model is: the probability of 4% is real voice, the probability of 96% is synthesized voice, and the probability of the two probabilities is higher than that of the real situation, namely the accuracy of detecting the synthesized voice by an untrained classification model is lower. After the classification model is trained by the method in the embodiment of the application, a foreign language signal is introduced, and the previous classification model is changed into a three-classification model. The classification result of the out-of-set real voice signals after three classification models is as follows: the probability (likelihood value) of the real voice is 2%, the probability of the synthesized voice is 60%, and the probability of the synthesized voice is 38%, so that the original high likelihood value of the synthesized voice output during voice signal recognition can be reduced, and the accuracy of detecting the synthesized voice is effectively improved.
According to the synthetic voice detection method provided by the embodiment of the application, the acoustic characteristics of the first voice signal are input into the preset classification model, so that the likelihood value of the real voice output by the classification model and the likelihood value of the synthetic voice can be obtained, and further, the first cross entropy loss function can be generated according to the likelihood value. Posterior distribution characteristics of data outside the first set may then be generated and a second cross entropy loss function generated therefrom. According to the first cross entropy loss function and the second cross entropy loss function, a total loss function can be generated, so that likelihood values of data outside the first set are introduced into the total loss function, the classification model can be updated through training of the total loss function, the original high likelihood value of the synthesized voice output by the updated classification model when the voice signal is recognized is reduced, and the accuracy of detecting the synthesized voice can be effectively improved.
Optionally, in one embodiment of the present application, the generating the posterior distribution feature of the data outside the first set under the current model parameters of the classification model (step S15) may include: randomly sampling from an embedded feature cache through a random gradient Langmuir dynamic sampling process and generating posterior distribution features of the data outside the first set under the current model parameters of the classification model.
In the embodiment of the application, the Langmuir dynamic sampling process has the characteristics of high sampling speed, high sampling quality, wide application range and the like. Random sampling can be performed through the Langmuir dynamic sampling process, and posterior distribution characteristics of data outside the first set can be generated based on sampling results.
The approximation of the out-of-set data posterior distribution can be achieved by a random gradient Langmuir dynamic sampling process, by randomly initializing an embedded feature buffer first, and then approximating the out-of-set distribution by the random gradient Langmuir dynamic sampling. The langevin dynamic sampling formula is:
wherein z is t For embedded features, t is the iteration step number, α is the step size, and ε is Gaussian random noise. And then, sending the similar obtained out-of-set data distribution embedded features into a penultimate group of convolution layers of the classification model convolution neural network, and calculating the similar out-of-set data feature output likelihood cross entropy loss.
Optionally, in an embodiment of the present application, the generating a second cross entropy loss function based on the posterior distribution feature includes: inputting the posterior distribution characteristics into the classification model to obtain a first likelihood value that the first voice signal is real voice, a second likelihood value that the first voice signal is synthesized voice and a third likelihood value that the first voice signal is voice outside the first set; and determining a second cross entropy loss function according to the first likelihood value, the second likelihood value and the third likelihood value.
In the embodiment of the application, the posterior distribution characteristics obtained in the process are input into a classification model, and the classification model can output a first likelihood value that a first voice signal is real voice, a second likelihood value that the first voice signal is synthesized voice and a third likelihood value that the first voice signal is voice outside a first set. From these three likelihood values, a second cross entropy loss function may be obtained, thus enabling the likelihood values of data outside the first set to be introduced into the second cross entropy loss function.
Optionally, in one embodiment of the present application, the extracting the acoustic feature of the first voice signal includes: preprocessing the first voice signal to obtain a preprocessing result; the preprocessing comprises pre-emphasis processing, framing processing and windowing processing; performing time-frequency analysis on the pretreatment result by a short-time Fourier transform method to obtain a time-frequency analysis result; and calculating a logarithmic magnitude spectrum of the obtained time-frequency analysis result, and taking the logarithmic magnitude spectrum as the acoustic characteristic of the first voice signal.
In the embodiment of the application, when the acoustic characteristics of the first voice signal are extracted, pre-emphasis processing, framing processing and windowing processing can be performed on the first voice signal, so that corresponding processing results can be obtained. Illustratively, the pre-emphasis formula may be: y (n) =x (n) -0.97·x (n-1). The frame length and window length in the framing process and the windowing process can be 25ms, the window function is Hamming window (Hamming), and the window function formula is:then, lead toPerforming time-frequency analysis on the processing result by a short-time Fourier transform method to obtain a time-frequency analysis result; the short-time fourier transform is defined as: />Where x (τ) is a single frame speech signal, h (τ -t) is an analysis window function, τ is an offset. Where N is the window length, N is the current calculated sampling point, f is the frequency, and t is the current calculated sampling point. Finally, a log magnitude spectrum of the time-frequency analysis result can be calculated, and the log magnitude spectrum is used as the acoustic characteristic of the first voice signal.
Optionally, in an embodiment of the present application, the preprocessing the first voice signal to obtain a preprocessing result may include: performing data enhancement processing on the first voice signal to obtain a first processing result; the data enhancement processing comprises noise adding processing and reverberation processing; and preprocessing the first processing result to obtain a preprocessing result.
In the embodiment of the application, data enhancement processing operations such as noise adding processing, reverberation processing and the like can be performed on the first voice signal to obtain a data enhancement processing result. And then carrying out pre-emphasis processing, framing processing and windowing processing on the data enhancement processing result.
Specifically, the noise adding process may be specifically that noise is randomly selected from the noise data set, so that the frequency and the signal-to-noise ratio of the selected noise are completely random, energy-level adjustment is performed on the selected noise according to the signal-to-noise ratio parameter, and then the noise after energy-level adjustment is superimposed with the first voice signal. The reverberation adding processing process specifically may be that a reverberation impulse response and a signal-to-noise ratio are randomly selected from a room reverberation data set, and after energy adjustment is performed on the reverberation audio according to a signal-to-noise ratio parameter, convolution processing is performed on the reverberation audio and the first speech signal.
In a second aspect, as shown in fig. 2, a synthesized voice detection apparatus 2 provided in an embodiment of the present application may include: a first acquisition module 21, configured to acquire a first speech signal in a first set; the first voice signal comprises a real voice signal and a synthetic voice signal; an extraction module 22 for extracting acoustic features of the first speech signal; a second obtaining module 23, configured to input acoustic features of the first speech signal into a preset classification model, so as to obtain likelihood values of real speech and likelihood values of synthesized speech output by the classification model; a first generation module 24, configured to generate a first cross entropy loss function according to the likelihood value of the real speech and the likelihood value of the synthesized speech; a second generation module 25, configured to generate posterior distribution features of data that approximates the data outside the first set under current model parameters of the classification model, and generate a second cross entropy loss function based on the posterior distribution features; a summation module 26, configured to perform a weighted summation operation on the first cross entropy loss function and the second cross entropy loss function to obtain a total loss function; an updating module 27, configured to perform gradient feedback by using the total loss function, and update parameters of the classification model to obtain an updated classification model; the third obtaining module 28 is configured to input the acoustic feature of the to-be-detected voice signal into the updated classification model, so as to obtain a detection result of the to-be-detected voice signal.
According to the synthetic voice detection device provided by the embodiment of the application, the acoustic characteristics of the first voice signal are input into the preset classification model, so that the likelihood value of the real voice output by the classification model and the likelihood value of the synthetic voice can be obtained, and further, the first cross entropy loss function can be generated according to the likelihood value. Posterior distribution characteristics of data outside the first set may then be generated and a second cross entropy loss function generated therefrom. According to the first cross entropy loss function and the second cross entropy loss function, a total loss function can be generated, so that likelihood values of data outside the first set are introduced into the total loss function, the classification model can be updated through training of the total loss function, the original high likelihood value of the synthesized voice output by the updated classification model when the voice signal is recognized is reduced, and the accuracy of detecting the synthesized voice can be effectively improved.
Optionally, in one embodiment of the present application, the second generating module 25 is specifically configured to: randomly sampling from an embedded feature cache through a random gradient Langmuir dynamic sampling process and generating posterior distribution features of the data outside the first set under the current model parameters of the classification model.
Optionally, in one embodiment of the present application, the second generating module 25 is specifically configured to: inputting the posterior distribution characteristics into the classification model to obtain a first likelihood value that the first voice signal is real voice, a second likelihood value that the first voice signal is synthesized voice and a third likelihood value that the first voice signal is voice outside the first set; and determining a second cross entropy loss function according to the first likelihood value, the second likelihood value and the third likelihood value.
Optionally, in one embodiment of the present application, the extracting module 22 includes: the preprocessing sub-module is used for preprocessing the first voice signal to obtain a preprocessing result; the preprocessing comprises pre-emphasis processing, framing processing and windowing processing; the analysis submodule is used for carrying out time-frequency analysis on the pretreatment result through the short-time Fourier transform device so as to obtain a time-frequency analysis result; and the calculating sub-module is used for calculating a logarithmic magnitude spectrum of the obtained time-frequency analysis result and taking the logarithmic magnitude spectrum as the acoustic characteristic of the first voice signal.
Optionally, in one embodiment of the present application, the preprocessing submodule includes: the enhancement processing unit is used for carrying out data enhancement processing on the first voice signal so as to obtain a first processing result; the data enhancement processing comprises noise adding processing and reverberation processing; and the preprocessing unit is used for preprocessing the first processing result to obtain a preprocessing result.
In a third aspect, embodiments of the present application further provide an electronic device, which can improve accuracy in detecting synthesized speech.
As shown in fig. 3, an electronic device provided by an embodiment of the present application may include: the processor 52 and the memory 53 are arranged on the circuit board 54, wherein the circuit board 54 is arranged in a space surrounded by the shell 51; a power supply circuit 55 for supplying power to the respective circuits or devices of the above-described electronic apparatus; the memory 53 is for storing executable program code; the processor 52 executes a program corresponding to the executable program code by reading the executable program code stored in the memory 53 for performing the synthetic voice detection method provided in any of the foregoing embodiments.
The specific implementation of the above steps by the processor 52 and the further implementation of the steps by the processor 52 through the execution of the executable program code may be referred to the description of the foregoing embodiments, and will not be repeated here.
Such electronic devices exist in a variety of forms including, but not limited to:
(1) A mobile communication device: such devices are characterized by mobile communication capabilities and are primarily aimed at providing voice, data communications. Such terminals include: smart phones (e.g., iPhone), multimedia phones, functional phones, and low-end phones, etc.
(2) Ultra mobile personal computer device: such devices are in the category of personal computers, having computing and processing functions, and generally also having mobile internet access characteristics. Such terminals include: PDA, MID, and UMPC devices, etc., such as iPad.
(3) Portable entertainment device: such devices may display and play multimedia content. The device comprises: audio, video players (e.g., iPod), palm game consoles, electronic books, and smart toys and portable car navigation devices.
(4) And (3) a server: the configuration of the server includes a processor, a hard disk, a memory, a system bus, and the like, and the server is similar to a general computer architecture, but is required to provide highly reliable services, and thus has high requirements in terms of processing capacity, stability, reliability, security, scalability, manageability, and the like.
(5) Other electronic devices with data interaction functions.
In a fourth aspect, embodiments of the present application further provide a computer readable storage medium storing one or more programs, where the one or more programs are executable by one or more processors to implement any one of the synthetic speech detection methods provided in the foregoing embodiments, so that corresponding technical effects can be achieved, and the foregoing details are not repeated herein.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments.
In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.
For convenience of description, the above apparatus is described as being functionally divided into various units/modules, respectively. Of course, the functions of the various elements/modules may be implemented in the same piece or pieces of software and/or hardware when implementing the present application.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present application should be included in the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims (10)

1. A method for detecting synthesized speech, comprising:
acquiring a first voice signal in a first set; the first voice signal comprises a real voice signal and a synthetic voice signal;
extracting acoustic features of the first speech signal;
inputting the acoustic characteristics of the first voice signal into a preset classification model to obtain the likelihood value of the real voice and the likelihood value of the synthesized voice output by the classification model;
generating a first cross entropy loss function according to the likelihood value of the real voice and the likelihood value of the synthesized voice;
generating posterior distribution characteristics of data outside the first set under current model parameters of the classification model, and generating a second cross entropy loss function based on the posterior distribution characteristics;
performing weighted summation operation on the first cross entropy loss function and the second cross entropy loss function to obtain a total loss function;
carrying out gradient feedback by utilizing the total loss function, and carrying out parameter updating on the classification model to obtain an updated classification model;
and inputting the acoustic characteristics of the voice signals to be detected into the updated classification model to obtain the detection result of the voice signals to be detected.
2. The method of claim 1, wherein the generating posterior distribution features of the data outside the first set that are approximated under current model parameters of the classification model comprises:
randomly sampling from an embedded feature cache through a random gradient Langmuir dynamic sampling process and generating posterior distribution features of the data outside the first set under the current model parameters of the classification model.
3. The method of claim 1, wherein the generating a second cross entropy loss function based on the posterior distribution characteristics comprises:
inputting the posterior distribution characteristics into the classification model to obtain a first likelihood value that the first voice signal is real voice, a second likelihood value that the first voice signal is synthesized voice and a third likelihood value that the first voice signal is voice outside the first set;
and determining a second cross entropy loss function according to the first likelihood value, the second likelihood value and the third likelihood value.
4. The method of claim 1, wherein the extracting acoustic features of the first speech signal comprises:
preprocessing the first voice signal to obtain a preprocessing result; the preprocessing comprises pre-emphasis processing, framing processing and windowing processing;
performing time-frequency analysis on the pretreatment result by a short-time Fourier transform method to obtain a time-frequency analysis result;
and calculating a logarithmic magnitude spectrum of the obtained time-frequency analysis result, and taking the logarithmic magnitude spectrum as the acoustic characteristic of the first voice signal.
5. The method of claim 4, wherein preprocessing the first speech signal to obtain a preprocessing result comprises:
performing data enhancement processing on the first voice signal to obtain a first processing result; the data enhancement processing comprises noise adding processing and reverberation processing;
and preprocessing the first processing result to obtain a preprocessing result.
6. A synthesized speech detection apparatus, comprising:
a first acquisition module for acquiring a first voice signal in a first set; the first voice signal comprises a real voice signal and a synthetic voice signal;
the extraction module is used for extracting the acoustic characteristics of the first voice signal;
the second acquisition module is used for inputting the acoustic characteristics of the first voice signals into a preset classification model so as to obtain the likelihood value of the real voice and the likelihood value of the synthesized voice output by the classification model;
the first generation module is used for generating a first cross entropy loss function according to the likelihood value of the real voice and the likelihood value of the synthesized voice;
a second generation module, configured to generate posterior distribution characteristics of data outside the first set under current model parameters of the classification model, and generate a second cross entropy loss function based on the posterior distribution characteristics;
the summation module is used for carrying out weighted summation operation on the first cross entropy loss function and the second cross entropy loss function so as to obtain a total loss function;
the updating module is used for carrying out gradient return by utilizing the total loss function and carrying out parameter updating on the classification model so as to obtain an updated classification model;
and the third acquisition module is used for inputting the acoustic characteristics of the voice signal to be detected into the updated classification model so as to obtain the detection result of the voice signal to be detected.
7. The apparatus of claim 6, wherein the second generation module is specifically configured to: randomly sampling from an embedded feature cache through a random gradient Langmuir dynamic sampling process and generating posterior distribution features of the data outside the first set under the current model parameters of the classification model.
8. The apparatus of claim 6, wherein the second generation module is specifically configured to:
inputting the posterior distribution characteristics into the classification model to obtain a first likelihood value that the first voice signal is real voice, a second likelihood value that the first voice signal is synthesized voice and a third likelihood value that the first voice signal is voice outside the first set;
and determining a second cross entropy loss function according to the first likelihood value, the second likelihood value and the third likelihood value.
9. An electronic device, the electronic device comprising: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space surrounded by the shell, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to each circuit or device of the electronic apparatus; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory for performing the synthetic speech detection method according to any one of the preceding claims 1-5.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores one or more programs executable by one or more processors to implement the synthetic speech detection method of any of the preceding claims 1-5.
CN202310820588.4A 2023-07-05 2023-07-05 Synthetic voice detection method and device, electronic equipment and storage medium Pending CN116758943A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310820588.4A CN116758943A (en) 2023-07-05 2023-07-05 Synthetic voice detection method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310820588.4A CN116758943A (en) 2023-07-05 2023-07-05 Synthetic voice detection method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116758943A true CN116758943A (en) 2023-09-15

Family

ID=87958909

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310820588.4A Pending CN116758943A (en) 2023-07-05 2023-07-05 Synthetic voice detection method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116758943A (en)

Similar Documents

Publication Publication Date Title
CN110706692B (en) Training method and system of child voice recognition model
Zhang et al. Boosting contextual information for deep neural network based voice activity detection
CN105976812B (en) A kind of audio recognition method and its equipment
CN110211575B (en) Voice noise adding method and system for data enhancement
JP6099556B2 (en) Voice identification method and apparatus
CN110600017A (en) Training method of voice processing model, voice recognition method, system and device
CN108922513B (en) Voice distinguishing method and device, computer equipment and storage medium
CN111916111B (en) Intelligent voice outbound method and device with emotion, server and storage medium
CN109346087B (en) Noise robust speaker verification method and apparatus against bottleneck characteristics of a network
CN108281137A (en) A kind of universal phonetic under whole tone element frame wakes up recognition methods and system
CN109887484A (en) A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device
CN111785288B (en) Voice enhancement method, device, equipment and storage medium
CN112837669B (en) Speech synthesis method, device and server
CN111899757A (en) Single-channel voice separation method and system for target speaker extraction
CN108877783A (en) The method and apparatus for determining the audio types of audio data
CN109346062A (en) Sound end detecting method and device
CN112233651A (en) Dialect type determining method, dialect type determining device, dialect type determining equipment and storage medium
CN110111798A (en) A kind of method and terminal identifying speaker
Cao et al. Underwater target classification at greater depths using deep neural network with joint multiple‐domain feature
CN111883147B (en) Audio data processing method, device, computer equipment and storage medium
CN110827809A (en) Language identification and classification method based on condition generation type confrontation network
CN116758943A (en) Synthetic voice detection method and device, electronic equipment and storage medium
CN115221351A (en) Audio matching method and device, electronic equipment and computer-readable storage medium
CN113113048B (en) Speech emotion recognition method and device, computer equipment and medium
CN117795527A (en) Evaluation of output sequences using autoregressive language model neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination