CN116153336B - Synthetic voice detection method based on multi-domain information fusion - Google Patents

Synthetic voice detection method based on multi-domain information fusion Download PDF

Info

Publication number
CN116153336B
CN116153336B CN202310415885.0A CN202310415885A CN116153336B CN 116153336 B CN116153336 B CN 116153336B CN 202310415885 A CN202310415885 A CN 202310415885A CN 116153336 B CN116153336 B CN 116153336B
Authority
CN
China
Prior art keywords
training
voice data
classifier
domain
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310415885.0A
Other languages
Chinese (zh)
Other versions
CN116153336A (en
Inventor
田野
汤跃忠
陈云坤
傅景楠
张晓灿
付泊暘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Third Research Institute Of China Electronics Technology Group Corp
Beijing Zhongdian Huisheng Technology Co ltd
Original Assignee
Third Research Institute Of China Electronics Technology Group Corp
Beijing Zhongdian Huisheng Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Third Research Institute Of China Electronics Technology Group Corp, Beijing Zhongdian Huisheng Technology Co ltd filed Critical Third Research Institute Of China Electronics Technology Group Corp
Priority to CN202310415885.0A priority Critical patent/CN116153336B/en
Publication of CN116153336A publication Critical patent/CN116153336A/en
Application granted granted Critical
Publication of CN116153336B publication Critical patent/CN116153336B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a synthetic voice detection method based on multi-domain information fusion, which comprises the following steps: extracting multi-domain acoustic features of a voice signal to be detected; inputting the extracted multi-domain acoustic features into a synthetic speech detection model to complete detection, the synthetic speech detection model performing training based on a training speech dataset: decomposing a voiced segment part, a mute segment part and an inherent modal component of the voice data in the training voice data set, respectively extracting features based on the voiced segment part, the mute segment and the inherent modal component part of the voice data, cascading the extracted features to be used as multi-domain acoustic features; taking the multi-domain acoustic features of the voice data of the training voice data set as the input of the feature fusion device, and executing training; the trained shallow classifier and depth classifier are used for outputting the fused recognition result. According to the embodiment of the application, the detection capability and the generalization application capability of the synthesized voice detection model are comprehensively improved through various means.

Description

Synthetic voice detection method based on multi-domain information fusion
Technical Field
The application relates to the technical field of voice detection, in particular to a synthetic voice detection method based on multi-domain information fusion.
Background
The synthetic voice detection technology is to identify the fake synthetic voice by a certain technical means, so as to realize the discrimination of true and false voice. Currently, the technical means for synthesizing speech mainly include speech synthesis technology and speech conversion technology. Speech synthesis technology enables the generation of text to speech and speech conversion technology enables the conversion of someone to specific person speech. In recent years, with the development of artificial intelligence technology, the naturalness and similarity level of synthesized voice are rapidly improved, and the confusion of detection technology is improved; in addition, the technical means of synthesizing the voice are changed day by day and updated frequently, and most of the synthetic voice detection technologies belong to the supervision learning method, and how to ensure the generalization capability of the detection model in practical application is also the key direction of the research and development of the current synthetic voice technology.
In the research of synthetic speech detection technology, there are two general technical routes: one is that the front end extracts acoustic characteristics and the rear end carries out the training of a classifier; the other is a classification network from end to end, which takes the voice signal directly as input. In the first technical route, the currently commonly used acoustic features include Mel-cepstrum coefficient (Mel-Frequency Cepstral Coefficient, MFCC), linear frequency cepstrum coefficient (LFCC, linear Frequency Cepstral Coefficient), constant Q transform cepstrum coefficient (CQCC, constant-QCepstral Coefficient), etc., wherein LFCC features are highlighted in the synthetic speech detection task; in the aspect of back-end classifier design, a Gaussian mixture model and various neural network models (such as a convolutional neural network, a long-short-term memory neural network, a residual neural network and the like) are most commonly used. Different characteristics and different classifiers have the characteristics, the improvement of the model performance through characteristic fusion and decision fusion is a feasible technical route, and the key is to design an effective fusion strategy, and fully utilize the complementarity of the characteristics and the classifiers while controlling redundant information as much as possible.
The invention CN113488073A provides a fake voice detection method based on multi-feature fusion, which is characterized in that for voice signals, a plurality of acoustic features such as fundamental frequency, mel cepstrum coefficient, non-periodic component, mel spectrum, energy spectrum, frequency spectrum, linear prediction coefficient, linear prediction cepstrum coefficient and the like are extracted, and the extracted features are fused through feature scaling and feature balance matrix to obtain fusion features. The method provides a thought for feature fusion, but has obvious redundancy among the acoustic features and lacks complementary design.
In addition, in practical application, the scene where the voice of the training data set is located is deviated from the scene where the voice collected in practical application is located, so that the problem that the performance of the detection model is reduced in the practical scene is caused.
Disclosure of Invention
The embodiment of the application provides a synthetic voice detection method based on multi-domain information fusion, which comprehensively improves the detection capability and the generalization application capability of a synthetic voice detection model through various means.
The embodiment of the application provides a synthetic voice detection method based on multi-domain information fusion, which comprises the following steps:
acquiring a voice signal to be detected, and extracting multi-domain acoustic characteristics of the voice signal to be detected;
inputting the extracted multi-domain acoustic features into a synthetic voice detection model to complete detection, wherein the synthetic voice detection model comprises a feature fusion device, a depth classifier and at least two shallow classifiers, and the synthetic voice detection model is obtained by training based on a training voice data set in the following mode:
dividing a sound section part and a mute section part of the voice data in the training voice data set in a time domain, decomposing an inherent mode component of the voice data in a time domain, respectively extracting features based on the sound section part, the mute section part and the inherent mode component of the voice data, and cascading the extracted features to be used as multi-domain acoustic features;
taking the multi-domain acoustic features of the voice data of the training voice data set as the input of the feature fusion device, and executing training to obtain the weight coefficients of the multi-domain acoustic features;
taking the output of the feature fusion device as the input of a depth classifier, training the depth classifier, calculating a loss function value through a preset cross entropy loss function, adjusting parameters of the feature fusion device and the depth classifier according to the loss function value, and performing iterative training; the method comprises the steps of,
taking the output of the feature fusion device as the input of each shallow classifier, and training the shallow classifier;
the trained shallow classifier and depth classifier are used for outputting the fused recognition result.
Optionally, the method further comprises:
acquiring an initial training voice data set;
and carrying out data enhancement on the voice data in the initial training voice data set so as to expand the initial training voice data set and obtain the training voice data set.
Optionally, in the time domain, the step of dividing the voiced segment part and the unvoiced segment part of the voice data in the training voice data set includes:
dividing the voice data into a sound section part and a mute section part in the time domain;
in the time-frequency domain, decomposing the natural modal component of the voice data comprises:
on a time-frequency domain, decomposing the voice by adopting a variation modal decomposition (Variational mode decomposition, VMD) method to obtain M natural modal components;
based on the voiced segment portion, the unvoiced segment portion, and the natural modal component of the speech data, extracting features respectively includes:
extracting short-time energy and zero crossing rate features from the decomposed voice mute segment in the time domain;
extracting MFCC features from the decomposed voice sound segments in the frequency domain;
and on a time frequency domain, extracting LFCC features from the decomposed M natural mode components.
Optionally, the feature fusion device comprises a global pooling layer, a full connection layer, a ReLU activation layer and a sigmoid layer which are sequentially arranged.
Optionally, the depth classifier is a depth residual network, a depth convolution network, or a depth recursion network.
Optionally, outputting the fused recognition result based on each shallow classifier and depth classifier after training includes:
the label result identified by each classifier is obtained, and the average value of the identification probability of each classifier is obtained;
and determining a final fusion recognition result according to the mode number and the average value of the recognition probabilities, wherein if the maximum value of the average value of the recognition probabilities is not lower than a preset threshold value, the fusion recognition result is a label result corresponding to the column in which the maximum value of the average value of the recognition probabilities is located, and if the maximum value of the average value of the recognition probabilities is lower than the preset threshold value, the fusion recognition result is determined according to the label result recognized by each classifier.
Optionally, the process of training the synthetic speech detection model further includes:
the method comprises the steps of checking whether the performance index of a currently trained synthetic voice detection model meets requirements according to preset indexes of accuracy, precision and recall, wherein the accuracy is defined as the ratio of the number of correctly recognized samples in a training voice data set to the total number of test samples, the precision is defined as the proportion of the actual positive samples in the samples recognized as positive in the training voice data set, and the recall is defined as the proportion of the actual positive samples recognized as positive in the training voice data set.
The embodiment of the application also provides a synthetic voice detection device, which comprises a processor and a memory, wherein the memory is stored with a computer program, and the computer program realizes the steps of the synthetic voice detection method based on multi-domain information fusion when being executed by the processor.
The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the steps of the synthesized voice detection method based on multi-domain information fusion when being executed by a processor.
The synthetic voice detection model designed by the embodiment of the application comprises a feature fusion device, a depth classifier and at least two shallow classifiers, so that the detection capability and the generalization application capability of the synthetic voice detection model are comprehensively improved through various means.
The foregoing description is only an overview of the technical solutions of the present application, and may be implemented according to the content of the specification in order to make the technical means of the present application more clearly understood, and in order to make the above-mentioned and other objects, features and advantages of the present application more clearly understood, the following detailed description of the present application will be given.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 is a basic flow chart of a method for detecting synthesized speech according to an embodiment of the present application;
FIG. 2 is an overall flowchart of a method for detecting synthesized speech according to an embodiment of the present application;
FIG. 3 is an example of the composition of algorithm modules of a method for detecting synthesized speech according to an embodiment of the present application;
FIG. 4 is an example of a test model training process for a synthetic speech test method according to an embodiment of the present application;
fig. 5 is a structural example of a feature fusion device of the synthetic speech detection method according to the embodiment of the present application;
FIG. 6 is an example of a shallow classifier training process for a synthesized speech detection method according to an embodiment of the present application;
fig. 7 is a decision flow example of a decision fusion device of the synthetic speech detection method according to the embodiment of the present application.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The embodiment of the application provides a synthetic voice detection method based on multi-domain information fusion, as shown in fig. 1, comprising the following steps:
the method for obtaining the voice signal to be detected, extracting multi-domain acoustic features of the voice signal to be detected, and specifically extracting the multi-domain acoustic features of the voice signal to be detected may, for example, adopt the following modes in some examples:
in step S101, the speech to be detected is subjected to signal decomposition, and is divided into a voiced segment and a unvoiced segment in the time domain and M natural modal components in the time-frequency domain. In step S102, feature extraction is performed on the voice to be detected, short-time energy and zero-crossing rate features of a mute segment are extracted in a time domain, mel cepstrum coefficient features of a voiced segment are extracted in a frequency domain, linear frequency cepstrum coefficient features of modal components are extracted in a time-frequency domain, and the linear frequency cepstrum coefficient features are used as multi-domain acoustic features after cascade connection.
The extracted multi-domain acoustic features are then input into a synthesized speech detection model to complete the detection. The synthesized voice detection model comprises a feature fusion device, a depth classifier and at least two shallow classifiers. As shown in fig. 1 and 2, in step S103, the multi-domain acoustic features of the voice to be detected are input to a pre-trained feature fusion device, and the multi-domain fusion features of the voice to be detected are output.
In step S104, the multi-domain fusion features of the speech to be detected are respectively input to a pre-trained shallow classifier and a pre-trained deep classifier, and recognition results of the shallow classifier and the deep classifier are output.
In step S104, the recognition results of the shallow classifier and the depth classifier are input to the decision fusion device, and the true or false recognition result of the speech to be detected is output, as shown in fig. 3, which can be specifically implemented by configuring a corresponding operation module based on the detection method of the present application.
As shown in fig. 4, the synthetic speech detection model is obtained by training in the following manner based on the training speech data set:
and in a time domain, dividing a sound segment part and a mute segment part of the voice data in the training voice data set, and in a time-frequency domain, decomposing natural mode components of the voice data, respectively extracting features based on the sound segment part, the mute segment part and the natural mode components of the voice data, and cascading the extracted features to be used as multi-domain acoustic features.
And taking the multi-domain acoustic features of the voice data of the training voice data set as the input of the feature fusion device, and executing training to obtain the weight coefficients of the multi-domain acoustic features.
And taking the output of the feature fusion device as the input of a depth classifier, training the depth classifier, calculating a loss function value through a preset cross entropy loss function, adjusting the parameters of the feature fusion device and the depth classifier according to the loss function value, and performing iterative training until the loss function value meets a preset condition. The method comprises the steps of,
taking the output of the feature fusion device as the input of each shallow classifier, and training the shallow classifier; the multi-domain fusion characteristics of the training voice data set are obtained based on the available characteristic fusion device, the multi-domain fusion characteristics are used as input of training of the shallow classifier model, and the available shallow classifier is obtained through training.
The trained shallow classifier and depth classifier are used for outputting the fused recognition result.
The synthetic voice detection model designed by the embodiment of the application comprises a feature fusion device, a depth classifier and at least two shallow classifiers, so that the detection capability and the generalization application capability of the synthetic voice detection model are comprehensively improved through various means.
In some embodiments, further comprising:
acquiring an initial training voice data set;
and carrying out data enhancement on the voice data in the initial training voice data set so as to expand the initial training voice data set and obtain the training voice data set. For example, the method can specifically comprise the steps of increasing the variation of training voice by means of adjusting the speech speed, compressing the format, bit rate, coding mode, transmission protocol and the like, so as to improve the generalization capability of the model in practical application.
In some embodiments, time-domain separating the voiced and unvoiced segments of the speech data in the training speech data set comprises:
dividing the voice data into a sound section part and a mute section part in the time domain;
in the time-frequency domain, decomposing the natural modal component of the voice data comprises: on a time-frequency domain, decomposing voice by adopting a variation modal decomposition (Variational mode decomposition, VMD) method to obtain M natural modal components, wherein the VMD is a signal time-frequency analysis method, and optimizing and solving the central frequency of an estimated signal component and reconstructing the signal component by assuming that all components of the signal are narrowband signals concentrated near respective central frequencies. The number of modal components M may be determined by comparing the distribution of the center frequencies under different modal numbers.
Based on the voiced segment portion, the unvoiced segment portion, and the natural modal component of the speech data, extracting features respectively includes:
extracting short-time energy and zero crossing rate features from the decomposed voice mute segment in the time domain;
extracting MFCC features from the decomposed voice sound segments in the frequency domain;
and on a time-frequency domain, extracting LFCC features from the M decomposed natural modal components, and cascading the features to obtain multi-domain acoustic features.
In some embodiments, as shown in fig. 5, the feature fusion includes a global pooling layer, a fully connected layer, a ReLU activation layer, and a sigmoid layer, which are sequentially arranged. And obtaining the weight coefficient of the multi-domain acoustic feature through training, and further weighting the multi-domain acoustic feature to obtain the multi-domain fusion feature.
In some embodiments, the depth classifier is a depth residual network, a depth convolution network, or a depth recursion network, such as a se-resnet34, se-res2net, ecapa-tdnn, or the like.
And performing multiple rounds of training based on the training data set, testing the performance of the depth classifier obtained by each round of training based on the test voice data set, and storing the corresponding feature fusion device and the depth classifier as the available feature fusion device and the depth classifier when the performance meets the requirement.
And taking the output of the feature fusion device as the input of each shallow classifier to train the shallow classifier. As shown in fig. 6, in some specific examples, the shallow classifier may be a gaussian mixture model (Gaussian Mixture Models, GMM), a support vector machine (supportvector machines, SVM), or the like.
In some embodiments, the process of training the synthetic speech detection model further comprises:
the method comprises the steps of checking whether the performance index of a currently trained synthetic voice detection model meets requirements according to preset indexes of accuracy, precision and recall, wherein the accuracy is defined as the ratio of the number of correctly recognized samples in a training voice data set to the total number of test samples, the precision is defined as the proportion of the actual positive samples in the samples recognized as positive in the training voice data set, and the recall is defined as the proportion of the actual positive samples recognized as positive in the training voice data set.
In some embodiments, outputting the fused recognition result based on each shallow classifier and depth classifier after training comprises:
the label result identified by each classifier is obtained, and the average value of the identification probability of each classifier is obtained;
and determining a final fusion recognition result according to the mode number and the average value of the recognition probabilities, wherein if the maximum value of the average value of the recognition probabilities is not lower than a preset threshold value, the fusion recognition result is a label result corresponding to the column in which the maximum value of the average value of the recognition probabilities is located, and if the maximum value of the average value of the recognition probabilities is lower than the preset threshold value, the fusion recognition result is determined according to the label result recognized by each classifier.
As shown in fig. 7, in some specific examples, the multi-domain fusion features of the speech to be detected are respectively input to at least two shallow classifiers and depth classifiers, and recognition results of the shallow classifiers and the depth classifiers are output.
And inputting the recognition results of the shallow classifier and the depth classifier to a decision fusion device to obtain the true and false recognition result of the voice to be detected. Decision fusion an exemplary decision rule is as follows: the label result identified by the GMM classifier of the shallow classifier is A, and the probability result is (a 1, a 2); the label result identified by the shallow classifier SVM classifier is B, and the probability result is (B1, B2); the label result identified by the depth classifier is C, and the probability result is (C1, C2); the value of the label result is true or false, and the value of the identification result is a probability value between normalized 0 and 1. The final recognition result of the decision fusion device is decided by the two aspects together:
the label results identified by the three classifiers are obtained as a mode number, and the mode number is taken as an identification result 1;
then, the probability results of the three classifier identifications are averagedAs a recognition result 2;
and finally, synthesizing the recognition result 1 and the recognition result 2 to obtain a final recognition result. If the maximum value Pmax in the identification result 2 is not lower than a preset threshold, for example, the preset threshold may be 0.6, the fusion decision result is a tag corresponding to the column where Pmax is located; if Pmax is lower than the preset threshold value 0.6, fusing the decision result with the identification result 1.
Further illustrated is:
if the first column of the recognition result 2 corresponds to the label of true, and the second column corresponds to the label of false, whenWhen the final output of the decision fusion device is true; while->The final output of the decision fusion is false, at which point the recognition result 1 is not considered.
While whenAnd->If the recognition result 1 is true when the result is lower than 0.6, the final output of the decision fusion device is true; if the recognition result 1 is false, the final output of the decision fusion device is false.
The method of the embodiment of the application enriches the voice scene of the training data set by adopting a voice enhancement method at the signal layer; extracting time domain, frequency domain and time-frequency domain features at a feature layer, acquiring multi-domain feature weights through an attention network to obtain multi-domain fusion features, and fusing identification results of a shallow classifier and a depth classifier at a decision layer; the detection capability and the generalization application capability of the synthesized voice detection model are comprehensively improved through various means.
The embodiment of the application also provides a synthetic voice detection device, which comprises a processor and a memory, wherein the memory is stored with a computer program, and the computer program realizes the steps of the synthetic voice detection method based on multi-domain information fusion when being executed by the processor.
The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the steps of the synthesized voice detection method based on multi-domain information fusion when being executed by a processor.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal (which may be a mobile phone, a computer, a server or a network device, etc.) to perform the method described in the embodiments of the present application.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those of ordinary skill in the art without departing from the spirit of the present application and the scope of the protection of the claims, which fall within the protection of the present application.

Claims (7)

1. The synthetic voice detection method based on multi-domain information fusion is characterized by comprising the following steps:
acquiring a voice signal to be detected, and extracting multi-domain acoustic characteristics of the voice signal to be detected;
inputting the extracted multi-domain acoustic features into a synthetic voice detection model to complete detection, wherein the synthetic voice detection model comprises a feature fusion device, a depth classifier and at least two shallow classifiers, and the synthetic voice detection model is obtained by training based on a training voice data set in the following mode:
dividing a sound section part and a mute section part of the voice data in the training voice data set in a time domain, decomposing an inherent mode component of the voice data in a time domain, respectively extracting features based on the sound section part, the mute section part and the inherent mode component of the voice data, and cascading the extracted features to be used as multi-domain acoustic features;
taking the multi-domain acoustic features of the voice data of the training voice data set as the input of the feature fusion device, and executing training to obtain the weight coefficients of the multi-domain acoustic features;
taking the output of the feature fusion device as the input of a depth classifier, training the depth classifier, calculating a loss function value through a preset cross entropy loss function, adjusting parameters of the feature fusion device and the depth classifier according to the loss function value, and performing iterative training; the method comprises the steps of,
taking the output of the feature fusion device as the input of each shallow classifier, and training the shallow classifier;
the trained shallow classifier and depth classifier are used for outputting a fused recognition result;
in the time domain, the step of dividing the voiced segment part and the mute segment part of the voice data in the training voice data set comprises the following steps:
dividing the voice data into a sound section part and a mute section part in the time domain;
in the time-frequency domain, decomposing the natural modal component of the voice data comprises:
on a time-frequency domain, decomposing the voice by adopting a variation modal decomposition method to obtain M natural modal components;
based on the voiced segment portion, the unvoiced segment portion, and the natural modal component of the speech data, extracting features respectively includes:
extracting short-time energy and zero crossing rate features from the decomposed voice mute segment in the time domain;
extracting MFCC features from the decomposed voice sound segments in the frequency domain;
extracting LFCC features from the M decomposed natural mode components in a time-frequency domain;
each shallow classifier and each depth classifier after training are used for outputting a fused recognition result and comprise the following steps:
the label result identified by each classifier is obtained, and the average value of the identification probability of each classifier is obtained;
and determining a final fusion recognition result according to the mode number and the average value of the recognition probabilities, wherein if the maximum value of the average value of the recognition probabilities is not lower than a preset threshold value, the fusion recognition result is a label result corresponding to the column in which the maximum value of the average value of the recognition probabilities is located, and if the maximum value of the average value of the recognition probabilities is lower than the preset threshold value, the fusion recognition result is determined according to the label result recognized by each classifier.
2. The method for detecting synthesized speech based on multi-domain information fusion according to claim 1, further comprising:
acquiring an initial training voice data set;
and carrying out data enhancement on the voice data in the initial training voice data set so as to expand the initial training voice data set and obtain the training voice data set.
3. The method for detecting synthesized speech based on multi-domain information fusion according to claim 1, wherein the feature fusion device comprises a global pooling layer, a full connection layer, a ReLU activation layer and a sigmoid layer which are sequentially arranged.
4. The method for detecting synthesized speech based on multi-domain information fusion according to claim 1, wherein the depth classifier is a depth residual network, a depth convolution network, or a depth recursion network.
5. The method for multi-domain information fusion-based synthesized speech detection of claim 1, wherein training the synthesized speech detection model further comprises:
the method comprises the steps of checking whether the performance index of a currently trained synthetic voice detection model meets requirements according to preset indexes of accuracy, precision and recall, wherein the accuracy is defined as the ratio of the number of correctly recognized samples in a training voice data set to the total number of test samples, the precision is defined as the proportion of the actual positive samples in the samples recognized as positive in the training voice data set, and the recall is defined as the proportion of the actual positive samples recognized as positive in the training voice data set.
6. A synthetic speech detection apparatus comprising a processor and a memory, the memory having stored thereon a computer program which, when executed by the processor, implements the steps of the multi-domain information fusion based synthetic speech detection method according to any one of claims 1 to 5.
7. A computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of the multi-domain information fusion based synthetic speech detection method according to any one of claims 1 to 5.
CN202310415885.0A 2023-04-19 2023-04-19 Synthetic voice detection method based on multi-domain information fusion Active CN116153336B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310415885.0A CN116153336B (en) 2023-04-19 2023-04-19 Synthetic voice detection method based on multi-domain information fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310415885.0A CN116153336B (en) 2023-04-19 2023-04-19 Synthetic voice detection method based on multi-domain information fusion

Publications (2)

Publication Number Publication Date
CN116153336A CN116153336A (en) 2023-05-23
CN116153336B true CN116153336B (en) 2023-07-21

Family

ID=86360373

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310415885.0A Active CN116153336B (en) 2023-04-19 2023-04-19 Synthetic voice detection method based on multi-domain information fusion

Country Status (1)

Country Link
CN (1) CN116153336B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116433403A (en) * 2023-06-14 2023-07-14 国网安徽省电力有限公司营销服务中心 Account tracking-based electric enterprise accounts receivable early warning method and system

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5839103A (en) * 1995-06-07 1998-11-17 Rutgers, The State University Of New Jersey Speaker verification system using decision fusion logic
CN113450806B (en) * 2021-05-18 2022-08-05 合肥讯飞数码科技有限公司 Training method of voice detection model, and related method, device and equipment
CN113488073B (en) * 2021-07-06 2023-11-24 浙江工业大学 Fake voice detection method and device based on multi-feature fusion
CN113284513B (en) * 2021-07-26 2021-10-15 中国科学院自动化研究所 Method and device for detecting false voice based on phoneme duration characteristics
CN114566170A (en) * 2022-03-01 2022-05-31 北京邮电大学 Lightweight voice spoofing detection algorithm based on class-one classification
CN114495990A (en) * 2022-03-07 2022-05-13 浙江工业大学 Speech emotion recognition method based on feature fusion
CN114596879B (en) * 2022-03-25 2022-12-30 北京远鉴信息技术有限公司 False voice detection method and device, electronic equipment and storage medium
CN114495950A (en) * 2022-04-01 2022-05-13 杭州电子科技大学 Voice deception detection method based on deep residual shrinkage network

Also Published As

Publication number Publication date
CN116153336A (en) 2023-05-23

Similar Documents

Publication Publication Date Title
Kabir et al. A survey of speaker recognition: Fundamental theories, recognition methods and opportunities
CN107680582B (en) Acoustic model training method, voice recognition method, device, equipment and medium
JP5768093B2 (en) Speech processing system
Ferrer et al. Study of senone-based deep neural network approaches for spoken language recognition
KR20140082157A (en) Apparatus for speech recognition using multiple acoustic model and method thereof
CN111640418B (en) Prosodic phrase identification method and device and electronic equipment
CN111916111A (en) Intelligent voice outbound method and device with emotion, server and storage medium
CN112735383A (en) Voice signal processing method, device, equipment and storage medium
JP5017534B2 (en) Drinking state determination device and drinking state determination method
CN103489445A (en) Method and device for recognizing human voices in audio
CN116153336B (en) Synthetic voice detection method based on multi-domain information fusion
CN111508505A (en) Speaker identification method, device, equipment and storage medium
Bhati et al. Self-expressing autoencoders for unsupervised spoken term discovery
Sivaram et al. Data-driven and feedback based spectro-temporal features for speech recognition
CN114360514A (en) Speech recognition method, apparatus, device, medium, and product
Aradilla Acoustic models for posterior features in speech recognition
KR102113879B1 (en) The method and apparatus for recognizing speaker's voice by using reference database
Chung et al. Unsupervised iterative Deep Learning of speech features and acoustic tokens with applications to spoken term detection
CN102237082A (en) Self-adaption method of speech recognition system
CN114121018A (en) Voice document classification method, system, device and storage medium
US11551666B1 (en) Natural language processing
KR20230120790A (en) Speech Recognition Healthcare Service Using Variable Language Model
Nahar et al. Arabic dialect identification using different machine learning methods
KR100842754B1 (en) Method and Apparatus for Speech Recognition using reliability of articulatory feature
Dhakal Novel Architectures for Human Voice and Environmental Sound Recognitionusing Machine Learning Algorithms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant