CN116153336B

CN116153336B - Synthetic voice detection method based on multi-domain information fusion

Info

Publication number: CN116153336B
Application number: CN202310415885.0A
Authority: CN
Inventors: 田野; 汤跃忠; 陈云坤; 傅景楠; 张晓灿; 付泊暘
Original assignee: Third Research Institute Of China Electronics Technology Group Corp; Beijing Zhongdian Huisheng Technology Co ltd
Current assignee: Third Research Institute Of China Electronics Technology Group Corp; Beijing Zhongdian Huisheng Technology Co ltd
Priority date: 2023-04-19
Filing date: 2023-04-19
Publication date: 2023-07-21
Anticipated expiration: 2043-04-19
Also published as: CN116153336A

Abstract

The application discloses a synthetic voice detection method based on multi-domain information fusion, which comprises the following steps: extracting multi-domain acoustic features of a voice signal to be detected; inputting the extracted multi-domain acoustic features into a synthetic speech detection model to complete detection, the synthetic speech detection model performing training based on a training speech dataset: decomposing a voiced segment part, a mute segment part and an inherent modal component of the voice data in the training voice data set, respectively extracting features based on the voiced segment part, the mute segment and the inherent modal component part of the voice data, cascading the extracted features to be used as multi-domain acoustic features; taking the multi-domain acoustic features of the voice data of the training voice data set as the input of the feature fusion device, and executing training; the trained shallow classifier and depth classifier are used for outputting the fused recognition result. According to the embodiment of the application, the detection capability and the generalization application capability of the synthesized voice detection model are comprehensively improved through various means.

Description

Synthetic voice detection method based on multi-domain information fusion

Technical Field

The application relates to the technical field of voice detection, in particular to a synthetic voice detection method based on multi-domain information fusion.

Background

The synthetic voice detection technology is to identify the fake synthetic voice by a certain technical means, so as to realize the discrimination of true and false voice. Currently, the technical means for synthesizing speech mainly include speech synthesis technology and speech conversion technology. Speech synthesis technology enables the generation of text to speech and speech conversion technology enables the conversion of someone to specific person speech. In recent years, with the development of artificial intelligence technology, the naturalness and similarity level of synthesized voice are rapidly improved, and the confusion of detection technology is improved; in addition, the technical means of synthesizing the voice are changed day by day and updated frequently, and most of the synthetic voice detection technologies belong to the supervision learning method, and how to ensure the generalization capability of the detection model in practical application is also the key direction of the research and development of the current synthetic voice technology.

In the research of synthetic speech detection technology, there are two general technical routes: one is that the front end extracts acoustic characteristics and the rear end carries out the training of a classifier; the other is a classification network from end to end, which takes the voice signal directly as input. In the first technical route, the currently commonly used acoustic features include Mel-cepstrum coefficient (Mel-Frequency Cepstral Coefficient, MFCC), linear frequency cepstrum coefficient (LFCC, linear Frequency Cepstral Coefficient), constant Q transform cepstrum coefficient (CQCC, constant-QCepstral Coefficient), etc., wherein LFCC features are highlighted in the synthetic speech detection task; in the aspect of back-end classifier design, a Gaussian mixture model and various neural network models (such as a convolutional neural network, a long-short-term memory neural network, a residual neural network and the like) are most commonly used. Different characteristics and different classifiers have the characteristics, the improvement of the model performance through characteristic fusion and decision fusion is a feasible technical route, and the key is to design an effective fusion strategy, and fully utilize the complementarity of the characteristics and the classifiers while controlling redundant information as much as possible.

The invention CN113488073A provides a fake voice detection method based on multi-feature fusion, which is characterized in that for voice signals, a plurality of acoustic features such as fundamental frequency, mel cepstrum coefficient, non-periodic component, mel spectrum, energy spectrum, frequency spectrum, linear prediction coefficient, linear prediction cepstrum coefficient and the like are extracted, and the extracted features are fused through feature scaling and feature balance matrix to obtain fusion features. The method provides a thought for feature fusion, but has obvious redundancy among the acoustic features and lacks complementary design.

In addition, in practical application, the scene where the voice of the training data set is located is deviated from the scene where the voice collected in practical application is located, so that the problem that the performance of the detection model is reduced in the practical scene is caused.

Disclosure of Invention

The embodiment of the application provides a synthetic voice detection method based on multi-domain information fusion, which comprehensively improves the detection capability and the generalization application capability of a synthetic voice detection model through various means.

The embodiment of the application provides a synthetic voice detection method based on multi-domain information fusion, which comprises the following steps:

acquiring a voice signal to be detected, and extracting multi-domain acoustic characteristics of the voice signal to be detected;

inputting the extracted multi-domain acoustic features into a synthetic voice detection model to complete detection, wherein the synthetic voice detection model comprises a feature fusion device, a depth classifier and at least two shallow classifiers, and the synthetic voice detection model is obtained by training based on a training voice data set in the following mode:

dividing a sound section part and a mute section part of the voice data in the training voice data set in a time domain, decomposing an inherent mode component of the voice data in a time domain, respectively extracting features based on the sound section part, the mute section part and the inherent mode component of the voice data, and cascading the extracted features to be used as multi-domain acoustic features;

taking the multi-domain acoustic features of the voice data of the training voice data set as the input of the feature fusion device, and executing training to obtain the weight coefficients of the multi-domain acoustic features;

taking the output of the feature fusion device as the input of a depth classifier, training the depth classifier, calculating a loss function value through a preset cross entropy loss function, adjusting parameters of the feature fusion device and the depth classifier according to the loss function value, and performing iterative training; the method comprises the steps of,

taking the output of the feature fusion device as the input of each shallow classifier, and training the shallow classifier;

the trained shallow classifier and depth classifier are used for outputting the fused recognition result.

Optionally, the method further comprises:

acquiring an initial training voice data set;

and carrying out data enhancement on the voice data in the initial training voice data set so as to expand the initial training voice data set and obtain the training voice data set.

Optionally, in the time domain, the step of dividing the voiced segment part and the unvoiced segment part of the voice data in the training voice data set includes:

dividing the voice data into a sound section part and a mute section part in the time domain;

in the time-frequency domain, decomposing the natural modal component of the voice data comprises:

on a time-frequency domain, decomposing the voice by adopting a variation modal decomposition (Variational mode decomposition, VMD) method to obtain M natural modal components;

based on the voiced segment portion, the unvoiced segment portion, and the natural modal component of the speech data, extracting features respectively includes:

extracting short-time energy and zero crossing rate features from the decomposed voice mute segment in the time domain;

extracting MFCC features from the decomposed voice sound segments in the frequency domain;

and on a time frequency domain, extracting LFCC features from the decomposed M natural mode components.

Optionally, the feature fusion device comprises a global pooling layer, a full connection layer, a ReLU activation layer and a sigmoid layer which are sequentially arranged.

Optionally, the depth classifier is a depth residual network, a depth convolution network, or a depth recursion network.

Optionally, outputting the fused recognition result based on each shallow classifier and depth classifier after training includes:

the label result identified by each classifier is obtained, and the average value of the identification probability of each classifier is obtained;

and determining a final fusion recognition result according to the mode number and the average value of the recognition probabilities, wherein if the maximum value of the average value of the recognition probabilities is not lower than a preset threshold value, the fusion recognition result is a label result corresponding to the column in which the maximum value of the average value of the recognition probabilities is located, and if the maximum value of the average value of the recognition probabilities is lower than the preset threshold value, the fusion recognition result is determined according to the label result recognized by each classifier.

Optionally, the process of training the synthetic speech detection model further includes:

the method comprises the steps of checking whether the performance index of a currently trained synthetic voice detection model meets requirements according to preset indexes of accuracy, precision and recall, wherein the accuracy is defined as the ratio of the number of correctly recognized samples in a training voice data set to the total number of test samples, the precision is defined as the proportion of the actual positive samples in the samples recognized as positive in the training voice data set, and the recall is defined as the proportion of the actual positive samples recognized as positive in the training voice data set.

The embodiment of the application also provides a synthetic voice detection device, which comprises a processor and a memory, wherein the memory is stored with a computer program, and the computer program realizes the steps of the synthetic voice detection method based on multi-domain information fusion when being executed by the processor.

The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the steps of the synthesized voice detection method based on multi-domain information fusion when being executed by a processor.

The synthetic voice detection model designed by the embodiment of the application comprises a feature fusion device, a depth classifier and at least two shallow classifiers, so that the detection capability and the generalization application capability of the synthetic voice detection model are comprehensively improved through various means.

The foregoing description is only an overview of the technical solutions of the present application, and may be implemented according to the content of the specification in order to make the technical means of the present application more clearly understood, and in order to make the above-mentioned and other objects, features and advantages of the present application more clearly understood, the following detailed description of the present application will be given.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 is a basic flow chart of a method for detecting synthesized speech according to an embodiment of the present application;

FIG. 2 is an overall flowchart of a method for detecting synthesized speech according to an embodiment of the present application;

FIG. 3 is an example of the composition of algorithm modules of a method for detecting synthesized speech according to an embodiment of the present application;

FIG. 4 is an example of a test model training process for a synthetic speech test method according to an embodiment of the present application;

fig. 5 is a structural example of a feature fusion device of the synthetic speech detection method according to the embodiment of the present application;

FIG. 6 is an example of a shallow classifier training process for a synthesized speech detection method according to an embodiment of the present application;

fig. 7 is a decision flow example of a decision fusion device of the synthetic speech detection method according to the embodiment of the present application.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The embodiment of the application provides a synthetic voice detection method based on multi-domain information fusion, as shown in fig. 1, comprising the following steps:

the method for obtaining the voice signal to be detected, extracting multi-domain acoustic features of the voice signal to be detected, and specifically extracting the multi-domain acoustic features of the voice signal to be detected may, for example, adopt the following modes in some examples:

in step S101, the speech to be detected is subjected to signal decomposition, and is divided into a voiced segment and a unvoiced segment in the time domain and M natural modal components in the time-frequency domain. In step S102, feature extraction is performed on the voice to be detected, short-time energy and zero-crossing rate features of a mute segment are extracted in a time domain, mel cepstrum coefficient features of a voiced segment are extracted in a frequency domain, linear frequency cepstrum coefficient features of modal components are extracted in a time-frequency domain, and the linear frequency cepstrum coefficient features are used as multi-domain acoustic features after cascade connection.

The extracted multi-domain acoustic features are then input into a synthesized speech detection model to complete the detection. The synthesized voice detection model comprises a feature fusion device, a depth classifier and at least two shallow classifiers. As shown in fig. 1 and 2, in step S103, the multi-domain acoustic features of the voice to be detected are input to a pre-trained feature fusion device, and the multi-domain fusion features of the voice to be detected are output.

In step S104, the multi-domain fusion features of the speech to be detected are respectively input to a pre-trained shallow classifier and a pre-trained deep classifier, and recognition results of the shallow classifier and the deep classifier are output.

In step S104, the recognition results of the shallow classifier and the depth classifier are input to the decision fusion device, and the true or false recognition result of the speech to be detected is output, as shown in fig. 3, which can be specifically implemented by configuring a corresponding operation module based on the detection method of the present application.

As shown in fig. 4, the synthetic speech detection model is obtained by training in the following manner based on the training speech data set:

and in a time domain, dividing a sound segment part and a mute segment part of the voice data in the training voice data set, and in a time-frequency domain, decomposing natural mode components of the voice data, respectively extracting features based on the sound segment part, the mute segment part and the natural mode components of the voice data, and cascading the extracted features to be used as multi-domain acoustic features.

And taking the multi-domain acoustic features of the voice data of the training voice data set as the input of the feature fusion device, and executing training to obtain the weight coefficients of the multi-domain acoustic features.

And taking the output of the feature fusion device as the input of a depth classifier, training the depth classifier, calculating a loss function value through a preset cross entropy loss function, adjusting the parameters of the feature fusion device and the depth classifier according to the loss function value, and performing iterative training until the loss function value meets a preset condition. The method comprises the steps of,

taking the output of the feature fusion device as the input of each shallow classifier, and training the shallow classifier; the multi-domain fusion characteristics of the training voice data set are obtained based on the available characteristic fusion device, the multi-domain fusion characteristics are used as input of training of the shallow classifier model, and the available shallow classifier is obtained through training.

In some embodiments, further comprising:

acquiring an initial training voice data set;

and carrying out data enhancement on the voice data in the initial training voice data set so as to expand the initial training voice data set and obtain the training voice data set. For example, the method can specifically comprise the steps of increasing the variation of training voice by means of adjusting the speech speed, compressing the format, bit rate, coding mode, transmission protocol and the like, so as to improve the generalization capability of the model in practical application.

In some embodiments, time-domain separating the voiced and unvoiced segments of the speech data in the training speech data set comprises:

in the time-frequency domain, decomposing the natural modal component of the voice data comprises: on a time-frequency domain, decomposing voice by adopting a variation modal decomposition (Variational mode decomposition, VMD) method to obtain M natural modal components, wherein the VMD is a signal time-frequency analysis method, and optimizing and solving the central frequency of an estimated signal component and reconstructing the signal component by assuming that all components of the signal are narrowband signals concentrated near respective central frequencies. The number of modal components M may be determined by comparing the distribution of the center frequencies under different modal numbers.

and on a time-frequency domain, extracting LFCC features from the M decomposed natural modal components, and cascading the features to obtain multi-domain acoustic features.

In some embodiments, as shown in fig. 5, the feature fusion includes a global pooling layer, a fully connected layer, a ReLU activation layer, and a sigmoid layer, which are sequentially arranged. And obtaining the weight coefficient of the multi-domain acoustic feature through training, and further weighting the multi-domain acoustic feature to obtain the multi-domain fusion feature.

In some embodiments, the depth classifier is a depth residual network, a depth convolution network, or a depth recursion network, such as a se-resnet34, se-res2net, ecapa-tdnn, or the like.

And performing multiple rounds of training based on the training data set, testing the performance of the depth classifier obtained by each round of training based on the test voice data set, and storing the corresponding feature fusion device and the depth classifier as the available feature fusion device and the depth classifier when the performance meets the requirement.

And taking the output of the feature fusion device as the input of each shallow classifier to train the shallow classifier. As shown in fig. 6, in some specific examples, the shallow classifier may be a gaussian mixture model (Gaussian Mixture Models, GMM), a support vector machine (supportvector machines, SVM), or the like.

In some embodiments, the process of training the synthetic speech detection model further comprises:

In some embodiments, outputting the fused recognition result based on each shallow classifier and depth classifier after training comprises:

As shown in fig. 7, in some specific examples, the multi-domain fusion features of the speech to be detected are respectively input to at least two shallow classifiers and depth classifiers, and recognition results of the shallow classifiers and the depth classifiers are output.

And inputting the recognition results of the shallow classifier and the depth classifier to a decision fusion device to obtain the true and false recognition result of the voice to be detected. Decision fusion an exemplary decision rule is as follows: the label result identified by the GMM classifier of the shallow classifier is A, and the probability result is (a 1, a 2); the label result identified by the shallow classifier SVM classifier is B, and the probability result is (B1, B2); the label result identified by the depth classifier is C, and the probability result is (C1, C2); the value of the label result is true or false, and the value of the identification result is a probability value between normalized 0 and 1. The final recognition result of the decision fusion device is decided by the two aspects together:

the label results identified by the three classifiers are obtained as a mode number, and the mode number is taken as an identification result 1;

then, the probability results of the three classifier identifications are averagedAs a recognition result 2;

and finally, synthesizing the recognition result 1 and the recognition result 2 to obtain a final recognition result. If the maximum value Pmax in the identification result 2 is not lower than a preset threshold, for example, the preset threshold may be 0.6, the fusion decision result is a tag corresponding to the column where Pmax is located; if Pmax is lower than the preset threshold value 0.6, fusing the decision result with the identification result 1.

Further illustrated is:

if the first column of the recognition result 2 corresponds to the label of true, and the second column corresponds to the label of false, whenWhen the final output of the decision fusion device is true; while->The final output of the decision fusion is false, at which point the recognition result 1 is not considered.

While whenAnd->If the recognition result 1 is true when the result is lower than 0.6, the final output of the decision fusion device is true; if the recognition result 1 is false, the final output of the decision fusion device is false.

The method of the embodiment of the application enriches the voice scene of the training data set by adopting a voice enhancement method at the signal layer; extracting time domain, frequency domain and time-frequency domain features at a feature layer, acquiring multi-domain feature weights through an attention network to obtain multi-domain fusion features, and fusing identification results of a shallow classifier and a depth classifier at a decision layer; the detection capability and the generalization application capability of the synthesized voice detection model are comprehensively improved through various means.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal (which may be a mobile phone, a computer, a server or a network device, etc.) to perform the method described in the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those of ordinary skill in the art without departing from the spirit of the present application and the scope of the protection of the claims, which fall within the protection of the present application.

Claims

1. The synthetic voice detection method based on multi-domain information fusion is characterized by comprising the following steps:

the trained shallow classifier and depth classifier are used for outputting a fused recognition result;

in the time domain, the step of dividing the voiced segment part and the mute segment part of the voice data in the training voice data set comprises the following steps:

on a time-frequency domain, decomposing the voice by adopting a variation modal decomposition method to obtain M natural modal components;

extracting LFCC features from the M decomposed natural mode components in a time-frequency domain;

each shallow classifier and each depth classifier after training are used for outputting a fused recognition result and comprise the following steps:

2. The method for detecting synthesized speech based on multi-domain information fusion according to claim 1, further comprising:

acquiring an initial training voice data set;

3. The method for detecting synthesized speech based on multi-domain information fusion according to claim 1, wherein the feature fusion device comprises a global pooling layer, a full connection layer, a ReLU activation layer and a sigmoid layer which are sequentially arranged.

4. The method for detecting synthesized speech based on multi-domain information fusion according to claim 1, wherein the depth classifier is a depth residual network, a depth convolution network, or a depth recursion network.

5. The method for multi-domain information fusion-based synthesized speech detection of claim 1, wherein training the synthesized speech detection model further comprises:

6. A synthetic speech detection apparatus comprising a processor and a memory, the memory having stored thereon a computer program which, when executed by the processor, implements the steps of the multi-domain information fusion based synthetic speech detection method according to any one of claims 1 to 5.

7. A computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of the multi-domain information fusion based synthetic speech detection method according to any one of claims 1 to 5.