CN115620748A

CN115620748A - Comprehensive training method and device for speech synthesis and false discrimination evaluation

Info

Publication number: CN115620748A
Application number: CN202211552858.XA
Authority: CN
Inventors: 郑榕; 孟凡芹
Original assignee: Beijing Yuanjian Information Technology Co Ltd
Current assignee: Beijing Yuanjian Information Technology Co Ltd
Priority date: 2022-12-06
Filing date: 2022-12-06
Publication date: 2023-01-17
Anticipated expiration: 2042-12-06
Also published as: CN115620748B

Abstract

The invention provides a comprehensive training method and a comprehensive training device for speech synthesis and authenticity identification evaluation, wherein source speech and target speech are obtained as input corpora; performing voice conversion by training a preset voice converter; performing voice inverse conversion by training a preset voice inverse converter; carrying out voice authentication by training a preset voice authentication device; performing voice quality evaluation by training a preset voice quality evaluator; and fusing a voice conversion loss function corresponding to the voice conversion-inverse conversion process, a voice counterfeit identification loss function corresponding to the voice counterfeit identifier and a quality evaluation loss function corresponding to the voice quality evaluator to construct a target loss function for minimum iteration. The voice recognition and protection method can be used for carrying out comprehensive training and optimization on three tasks of voice conversion, voice evaluation and voice authentication detection, further improving the voice conversion effect, realizing the detectability and traceability of the converted voice, and carrying out defense and reinforcement on potential malicious attacks of voice processing and voiceprint recognition.

Description

Comprehensive training method and device for speech synthesis and false identification evaluation

Technical Field

The disclosure relates to the technical field of audio processing, in particular to a comprehensive training method and device for speech synthesis and counterfeit detection evaluation.

Background

With the continuous development of deep synthesis technology, the method can be applied to various application forms such as speech synthesis, video generation and even digital virtual human. Voice conversion (voice conversion) is a technology that changes the voice personality characteristics, such as frequency spectrum, rhythm, etc., of a source speaker (source speaker) through technical processing, so that the voice personality characteristics have the personality characteristics of a target speaker (target speaker), and meanwhile, semantic information is kept unchanged. Based on the sound conversion technology, the sound of a real player is converted into the sound of a game character, or the sound of real social interaction is converted into the sound of an entertainment character or a specific target. Typical conversions include a male voice to a female voice, a female voice to a male voice, a male voice to a male voice, etc.

Currently, voice conversion, voice evaluation, and voice authentication detection are usually processed as independent tasks, and a processing flow of a voice conversion task cannot simultaneously satisfy requirements for evaluation of a conversion effect from a source speaker voice to a target speaker voice, detection and supervision of converted voice, namely, controllability detection of voice authentication, traceability detection of voice, and the like, so that converted voice has poor traceability in voice quality, conversion effect, and authentication controllability.

Disclosure of Invention

The embodiment of the disclosure at least provides a comprehensive training method and device for voice synthesis and voice identification evaluation, which can perform comprehensive training optimization for three tasks of voice conversion, voice evaluation and voice identification detection, further improve the voice conversion effect, simultaneously realize the detectability and traceability of the converted voice, and perform defense and reinforcement on potential malicious attacks of voice processing and voiceprint recognition.

The embodiment of the disclosure provides a comprehensive training method for speech synthesis and counterfeit detection evaluation, which comprises the following steps:

obtaining source voice and target voice as input corpora;

converting the input corpus into corresponding conversion voice information through a preset voice converter, and converting the conversion voice information into corresponding inversion voice information through a preset voice inverse converter;

determining the converted voice information and the false distinguishing score corresponding to the inverted voice information through a preset voice false distinguishing device, and determining the MOS (metal oxide semiconductor) score between the inverted voice information and the input corpus through a preset voice quality evaluator;

respectively determining a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit identification loss function corresponding to the voice counterfeit identifier and a quality evaluation loss function corresponding to the voice quality evaluator;

and constructing a target loss function according to the voice conversion loss function, the voice authentication loss function and the quality evaluation loss function, and performing minimization iteration aiming at the target loss function.

In an optional implementation manner, the converting the input corpus into corresponding converted speech information through a preset sound converter, and converting the converted speech information into corresponding inverted speech information through a preset sound inverse converter specifically includes:

determining a source voiceprint embedding vector corresponding to the source speech and a target voiceprint embedding vector corresponding to the target speech;

inputting the source voiceprint embedding vector and the source speech to the sound converter, and determining conversion source speech corresponding to the source speech and conversion target speech corresponding to the target speech;

determining the conversion source speech and the conversion target speech as the conversion speech information;

inputting the conversion source voice and the target voiceprint embedding vector to the voice inverse converter, and determining an inverse source voice corresponding to the conversion source voice and an inverse target voice corresponding to the conversion target voice;

and determining the inverse source voice and the inverse target voice as the inverse voice information.

In an optional implementation manner, the constructing a target loss function according to the speech conversion loss function, the speech discrimination loss function, and the quality evaluation loss function, and performing minimization iteration on the target loss function specifically includes:

configuring corresponding learning hyper-parameters to be optimized for the voice conversion loss function, the voice authentication loss function and the quality evaluation loss function respectively;

according to the learning hyper-parameter to be optimized, carrying out weighted summation on the voice conversion loss function, the voice identification loss function and the quality evaluation loss function to determine the target loss function;

and performing minimum iterative computation aiming at the target loss function to realize the joint training optimization of a sound conversion-inverse conversion process, a voice identification process and a voice quality evaluation process.

In an alternative embodiment, the speech conversion loss function is determined based on the following steps:

determining an inverse source voiceprint embedding vector corresponding to the inverse source speech and an inverse target voiceprint embedding vector corresponding to the inverse target speech;

determining a first mean square error between the source voiceprint embedding vector and the inverse source voiceprint embedding vector and a second mean square error between the target voiceprint embedding vector and the inverse target voiceprint embedding vector;

defining a sum of the first mean square error and the second mean square error as the speech conversion loss function, wherein the speech conversion loss function is used to describe speaker similarity between the source speech and the target speech.

In an alternative embodiment, the speech discrimination loss function is determined based on the following steps:

determining a first authentication score for the speech authenticator for the translated source speech and the inverted source speech output and a second authentication score for the speech authenticator for the translated target speech and the inverted target speech output;

and respectively carrying out normalization index operation on the first authentication score and the second authentication score, and defining the sum of the first authentication score and the second authentication score after normalization index operation as the voice authentication loss function, wherein the voice authentication loss function is used for describing the detectability of voice authentication.

In an alternative embodiment, the quality assessment loss function is determined based on the following steps:

determining, by the speech quality evaluator, based on a perceptual objective hearing quality evaluation algorithm, a first MOS score between the source voiceprint embedding vector and the inverted source voiceprint embedding vector, and a second MOS score between the target voiceprint embedding vector and the inverted target voiceprint embedding vector;

and summing the first MOS score and the second MOS score after taking negative numbers, and defining the sum as the quality assessment loss function, wherein the quality assessment loss function is used for describing the evaluability of the voice quality.

The embodiment of the present disclosure further provides a comprehensive training device for speech synthesis and counterfeit detection evaluation, where the device includes:

the acquisition module is used for acquiring source speech and target speech as input linguistic data;

the conversion inversion module is used for converting the input corpus into corresponding conversion voice information through a preset sound converter and converting the conversion voice information into corresponding inversion voice information through a preset sound inverse converter;

the false discrimination evaluation module is used for determining the converted voice information and the false discrimination scores corresponding to the inverted voice information through a preset voice false discriminator and determining the MOS score between the inverted voice information and the input corpus through a preset voice quality evaluator;

a loss function constructing module, configured to determine a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit discrimination loss function corresponding to the voice counterfeit discriminator, and a quality evaluation loss function corresponding to the voice quality evaluator, respectively;

and the training module is used for constructing a target loss function according to the voice conversion loss function, the voice identification loss function and the quality evaluation loss function and performing minimum iteration aiming at the target loss function.

An embodiment of the present disclosure further provides an electronic device, including: a processor, a memory and a bus, wherein the memory stores machine-readable instructions executable by the processor, the processor and the memory communicate with each other via the bus when the electronic device is running, and the machine-readable instructions are executed by the processor to perform the above method for training speech synthesis and counterfeit detection, or the steps of any one of the possible embodiments of the above method for training speech synthesis and counterfeit detection.

The disclosed embodiments also provide a computer-readable storage medium, where a computer program is stored, and the computer program is executed by a processor to perform the steps in any possible implementation manner of the above-mentioned method for comprehensive training of speech synthesis and counterfeit detection evaluation, or the above-mentioned method for comprehensive training of speech synthesis and counterfeit detection evaluation.

Embodiments of the present disclosure further provide a computer program product, which includes a computer program/instructions, and the computer program/instructions, when executed by a processor, implement the above-mentioned comprehensive training method for speech synthesis and counterfeit detection evaluation, or the steps in any possible implementation manner of the above-mentioned comprehensive training method for speech synthesis and counterfeit detection evaluation.

According to the comprehensive training method and device for speech synthesis and authenticity identification evaluation, source speech and target speech are obtained and used as input corpora; converting the input corpus into corresponding conversion voice information through a preset voice converter, and converting the conversion voice information into corresponding inversion voice information through a preset voice inverse converter; determining the converted voice information and the false distinguishing score corresponding to the inverted voice information through a preset voice false distinguishing device, and determining the MOS score between the inverted voice information and the input corpus through a preset voice quality evaluator; respectively determining a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit identification loss function corresponding to a voice counterfeit identifier and a quality evaluation loss function corresponding to a voice quality evaluator; and constructing a target loss function according to the voice conversion loss function, the voice identification loss function and the quality evaluation loss function, and performing minimum iteration aiming at the target loss function. The voice recognition and protection method can be used for carrying out comprehensive training and optimization on three tasks of voice conversion, voice evaluation and voice authentication detection, further improving the voice conversion effect, realizing the detectability and traceability of the converted voice, and carrying out defense and reinforcement on potential malicious attacks of voice processing and voiceprint recognition.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is to be understood that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art to which the disclosure pertains without the benefit of the inventive faculty, and that additional related drawings may be derived therefrom.

FIG. 1 is a flowchart illustrating a method for comprehensive training of speech synthesis and counterfeit detection evaluation according to an embodiment of the present disclosure;

FIG. 2 is a flow chart illustrating another method for integrated training of speech synthesis and authentication evaluation provided by embodiments of the present disclosure;

FIG. 3 is a schematic diagram of a comprehensive training device for speech synthesis and counterfeit detection evaluation according to an embodiment of the present disclosure;

fig. 4 shows a schematic diagram of an electronic device provided by an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined or explained in subsequent figures.

The term "and/or" herein merely describes an associative relationship, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of a, B, and C, and may mean including any one or more elements selected from the group consisting of a, B, and C.

Research shows that currently, voice conversion, voice evaluation and voice identification detection are usually processed as independent tasks, and a processing flow of a voice conversion task cannot simultaneously meet requirements of conversion effect evaluation on source speaker voice to target speaker voice, detection and supervision on converted voice, namely controllability detection of voice identification and traceability detection of voice, and the like, so that converted voice has poor traceability on voice quality, conversion effect and traceability controllability.

Based on the research, the present disclosure provides a comprehensive training method and apparatus for speech synthesis and authenticity assessment, by obtaining source speech and target speech as input corpora; converting the input corpus into corresponding conversion voice information through a preset voice converter, and converting the conversion voice information into corresponding inversion voice information through a preset voice inverse converter; determining the converted voice information and the false identification score corresponding to the inverted voice information through a preset voice false identifier, and determining the MOS score between the inverted voice information and the input corpus through a preset voice quality evaluator; respectively determining a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit identification loss function corresponding to a voice counterfeit identifier and a quality evaluation loss function corresponding to a voice quality evaluator; and constructing a target loss function according to the voice conversion loss function, the voice identification loss function and the quality evaluation loss function, and performing minimum iteration aiming at the target loss function. The voice recognition and protection method can be used for carrying out comprehensive training and optimization on three tasks of voice conversion, voice evaluation and voice authentication detection, further improving the voice conversion effect, realizing the detectability and traceability of the converted voice, and carrying out defense and reinforcement on potential malicious attacks of voice processing and voiceprint recognition.

To facilitate understanding of the embodiment, first, a comprehensive training method for speech synthesis and counterfeit detection evaluation disclosed in the embodiments of the present disclosure is described in detail, where an execution subject of the comprehensive training method for speech synthesis and counterfeit detection evaluation provided in the embodiments of the present disclosure is generally a computer device with certain computing power, and the computer device includes: a terminal device, which may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle mounted device, a wearable device, or a server or other processing device. In some possible implementations, the method for integrated training of speech synthesis and authentication evaluation may be implemented by a processor calling computer-readable instructions stored in a memory.

Referring to fig. 1, a flowchart of a comprehensive training method for speech synthesis and counterfeit detection evaluation provided by the embodiment of the present disclosure is shown, where the method includes steps S101 to S105, where:

s101, obtaining source voice and target voice as input corpora.

In specific implementation, a source speech corresponding to a source speaker and a target speech corresponding to a target speaker are obtained, and the source speech and the target speech are used as input corpora of speech conversion.

Here, the source speech and the target speech are used as input corpora, and the input corpora are used for inputting to a speech conversion system for speech conversion, so that the speech characteristics of the source speaker, such as frequency spectrum, rhythm and the like, are converted into the speech characteristics of the target speaker while the semantic information of the source speech of the speaker is kept unchanged.

The method comprises a voice conversion system, a voice recognition system and a voice recognition system, wherein the voice conversion system usually comprises a training stage and an inference stage, aiming at the training stage, firstly, source voice of a source speaker and target voice of a target speaker are analyzed and feature extraction is carried out, then, mapping processing is carried out on extracted features, and finally, model training is carried out on the mapping features so as to obtain a voice conversion model. And analyzing, extracting and mapping the source speech to be converted aiming at the reasoning stage, then performing feature conversion on the mapping feature by using the speech conversion model obtained in the training stage, and finally using the converted feature for speech synthesis to obtain converted speech.

Optionally, the source speech and the target speech may be collected for the source speaker and the target speaker respectively through audio collection equipment.

S102, converting the input corpus into corresponding conversion voice information through a preset sound converter, and converting the conversion voice information into corresponding inversion voice information through a preset sound inverse converter.

In specific implementation, source speech and target speech are input to a preset sound converter as input corpora, the input corpora are converted into corresponding converted speech information, and the converted speech information is input to a preset sound inverse converter, so that inverse speech information for performing inverse speech conversion on the converted speech information is obtained.

Here, the converted speech information includes converted speech information output after the source speech is converted by the sound converter, and converted speech information output after the target speech information is converted by the sound converter; the sound inverse converter outputs inverse voice information corresponding to the source voice and inverse voice information corresponding to the target voice.

The preset sound converter may be a conditional variant self-encoder (VITS _ VC) for speech synthesis with antagonistic learning. VITS _ VC is a high-expressive voice conversion model that combines variational inference (variational inference), normalized flow (normative flows), and resistance training. The VITS _ VC carries out random modeling on the hidden variable through an acoustic model and a vocoder in the voice conversion of the hidden variable instead of the common spectrum series connection, utilizes a random duration predictor to improve the diversity of the converted voice, inputs the same voice and can obtain the voices with different tones and rhythms. The VITS _ VC algorithm adopts a non-autoregressive network structure, and compared with the traditional autoregressive network, the generation speed is obviously improved, and the requirement of high-rate conversion in practical application is met.

Furthermore, the voice inverse converter can evaluate the voice similarity between the input corpus and the inverse voice information, so that the controllability and traceability of the voice are realized.

It should be noted that after the voice discriminator detects the converted voice information and the inverted voice information as described in step S103, it further determines that the voice conversion task is performed by the own voice converter, and then performs the voice inverse conversion process on the converted voice information.

As a possible implementation, step S102 may be implemented by steps S1021 to S1025 as follows:

s1021, determining a source voiceprint embedding vector corresponding to the source speech and a target voiceprint embedding vector corresponding to the target speech.

S1022, inputting the source voiceprint embedding vector and the source speech to the sound converter, and determining a conversion source speech corresponding to the source speech and a conversion target speech corresponding to the target speech.

S1023, determining the conversion source voice and the conversion target voice as the conversion voice information.

S1024, inputting the conversion source voice and the target voiceprint embedding vector to the voice inverse converter, and determining the inverse source voice corresponding to the conversion source voice and the inverse target voice corresponding to the conversion target voice.

S1025, the inverse source voice and the inverse target voice are determined to be the inverse voice information.

In a specific implementation, in order to improve the voice conversion effect under the condition of small samples or phrase voices, a pre-trained voiceprint extractor is used as a speaker encoder (speaker encoder) for extracting a voiceprint embedding vector from an input corpus.

Wherein, the voiceprint extractor converts the input corpus into 512-dimensional speaker embedded representation and sends the 512-dimensional speaker embedded representation to the voice converter.

Optionally, when the sound converter selects VITS _ VC, the VITS supports multi-speaker sound conversion (multi-speaker sound conversion), and when the sound converter is applied to a multi-speaker model, the source sound-stripe embedding vector corresponding to the source sound of each source speaker is added to a corresponding module of the VITS.

Furthermore, the voice converter embeds vectors into source voice and source voice prints corresponding to a given source speaker, and then outputs converted source voice obtained by converting the source voice through the voice converter and converted target voice obtained by converting the target voice through the voice converter through the vocoder.

Furthermore, the voice inverse converter outputs inverse source voice of the converted source voice after inverse conversion of the voice converter and inverse target voice of the target voice after inverse conversion of the voice inverse converter through the vocoder by giving the converted source voice output by the voice converter and the target voiceprint embedding vector corresponding to the target voice.

S103, determining the converted voice information and the discrimination scores corresponding to the inverted voice information through a preset voice discriminator, and determining the MOS score between the inverted voice information and the input corpus through a preset voice quality evaluator.

In specific implementation, the voice discriminator adopts an end-to-end audio frequency discrimination method based on a graph convolution attention network, extracts voice full-frequency band and sub-frequency band embedding characteristics, introduces a fusion attention mechanism, effectively utilizes information of three attention sub-modules of a time region, a frequency spectrum region and a channel region, inputs converted voice information and inverted voice information into the voice discriminator, and outputs authenticity judgment scores corresponding to the converted voice information and the inverted voice information respectively by the voice discriminator.

The authenticity judgment score of the voice counterfeit discriminator is normalized to output a range [0,1], wherein the output authenticity judgment score represents the possibility of judging the authenticity, if the counterfeit audio is close to 0, if the counterfeit audio is close to 1, the true audio is close to 1.

Further, the voice quality evaluator may evaluate a MOS score between the inverted voice information and the input corpus by a perceptual objective hearing quality evaluation (POLQA) algorithm.

Here, the most intuitive judgment of the quality of the voice conversion system is to convert the audio quality, which is a commonly used MOS (mean opinion score) value, but the scoring of the MOS value requires personnel in many fields to score, which requires expensive human resources and time overhead, so in the training process of the embodiment of the present application, the original input corpus before the conversion of the sound converter and the inverse voice information after the inverse conversion of the sound inverse converter can be obtained, and the POLQA algorithm is used to evaluate the voice quality.

Specifically, the POLQA algorithm obtains a POLQA score by filtering, time alignment, sampling rate estimation, objective perception and scoring of a reference signal and a degradation signal, and finally maps the POLQA score to a MOS score. POLQA is a reference objective evaluation method that quantifies the degree of impairment of a corrupted signal (here, inversely transformed speech) in the presence of a reference signal (lossless signal, here, the original speech before transformation), and gives an objective speech quality score close to the subjective speech quality score.

The maximum MOS value of the POLQA is 4.5 under the narrow-band mode, and the maximum MOS value of the POLQA is 4.75 under the ultra-wideband mode. Preferably, to define the quality assessment loss function, the POLQA value is taken to be negative.

S104, respectively determining a voice conversion loss function corresponding to the voice conversion-inverse conversion process, a voice identification loss function corresponding to the voice identifier, and a quality evaluation loss function corresponding to the voice quality evaluator.

In the implementation, a voice conversion loss function corresponding to the voice conversion-inverse conversion process, a voice identification loss function corresponding to the voice identifier, and a quality evaluation loss function corresponding to the voice quality evaluator are respectively determined.

Here, the voice conversion loss function corresponds to a loss function corresponding to the sound converter and the sound inverse converter.

Wherein the voice conversion loss function describes similarity between speakers, the voice authentication loss function describes detectability of voice authentication, and the quality assessment loss function describes quality evaluability between voices.

As a possible implementation, the speech conversion loss function may be determined based on the following steps 1-3:

step 1, determining an inverse source voiceprint embedded vector corresponding to the inverse source voice and an inverse target voiceprint embedded vector corresponding to the inverse target voice;

step 2, determining a first mean square error between the source voiceprint embedding vector and the inverse source voiceprint embedding vector and a second mean square error between the target voiceprint embedding vector and the inverse target voiceprint embedding vector;

and 3, defining the sum of the first mean square error and the second mean square error as the voice conversion loss function, wherein the voice conversion loss function is used for describing the speaker similarity between the source voice and the target voice.

Specifically, the voice conversion loss function may be constructed based on the following formula:

wherein L is _MSE Representing a speech conversion loss function; l is a radical of an alcohol _{MSE_source} Representing the corresponding loss function of the source speech in the speech conversion and inverse conversion processes; l is _{MSE_target} Representing the corresponding loss function of the source speech in the speech conversion and inverse conversion processes; MSE represents the mean square error; e (.) represents the calculation of the voiceprint embedding vector; s and t represent source and target speech respectively,

and

respectively representing the inverted source speech and the inverted target speech after being inversely converted by the acoustic inverse converter.

As another possible implementation, the speech discrimination loss function may be determined based on the following steps 1-2:

step 1, determining a first authentication score of the voice authenticator for the converted source voice and the inverted source voice and a second authentication score of the voice authenticator for the converted target voice and the inverted target voice.

And 2, respectively carrying out normalization index operation on the first authentication score and the second authentication score, and defining the sum of the first authentication score and the second authentication score after normalization index operation as the voice authentication loss function, wherein the voice authentication loss function is used for describing the detectability of voice authentication.

Specifically, the speech discrimination loss function may be constructed based on the following formula:

wherein L is _SPOOF Representing a speech discrimination loss function; l is a radical of an alcohol _{SPOOF_vc} Representing a loss function of the speech discriminator for the processing of the converted source speech and the inverted source speech; l is _{SPOOF_ivc} Representing a loss function of the voice discriminator for processing the conversion target voice and the inversion target voice; softmax (.) represents a normalized exponential function operation;

and

respectively representing converted source speech and inverse source speech;

and

respectively representing the converted target speech and the inverted target speech.

As another possible implementation, the quality assessment loss function may be determined based on steps 1-2 below:

step 1, determining a first MOS (metal oxide semiconductor) score between the source voiceprint embedding vector and the inverse source voiceprint embedding vector and a second MOS score between the target voiceprint embedding vector and the inverse target voiceprint embedding vector through the voice quality evaluator based on a perception objective hearing quality evaluation algorithm.

And 2, summing the first MOS value and the second MOS value after taking negative values, and defining the sum as the quality evaluation loss function, wherein the quality evaluation loss function is used for describing the evaluability of the voice quality.

Specifically, the quality assessment loss function may be constructed based on the following formula:

wherein L is _POLQA Representing a quality assessment loss function; l is _{POLQA_source} Representing a loss function of the speech quality evaluator for the source speech and the inverse source speech; l is _{POLQA_target} Representing a loss function of the voice quality evaluator for processing the target voice and the inverted target voice; POLQA (.) represents the calculation of a reference objective evaluation score based on POLQA;

and

respectively representing source voice and inverse source voice;

and

respectively representing the target voice and the inverted target voice.

S105, constructing a target loss function according to the voice conversion loss function, the voice identification loss function and the quality evaluation loss function, and performing minimum iteration aiming at the target loss function.

In a particular implementation, a voice conversion loss function describes similarity between speakers, a speech discrimination loss function describes detectability of speech discrimination, and a quality assessment loss function describes quality assessability between speech. And combining the three loss functions defined from different dimensions to form an objective function, and realizing the combined optimization of the whole system through the minimum iterative training of the objective loss function.

According to the comprehensive training method for speech synthesis and authenticity identification evaluation, source speech and target speech are obtained and used as input corpora; converting the input corpus into corresponding conversion voice information through a preset voice converter, and converting the conversion voice information into corresponding inversion voice information through a preset voice inverse converter; determining the converted voice information and the false identification score corresponding to the inverted voice information through a preset voice false identifier, and determining the MOS score between the inverted voice information and the input corpus through a preset voice quality evaluator; respectively determining a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit identification loss function corresponding to a voice counterfeit identifier and a quality evaluation loss function corresponding to a voice quality evaluator; and constructing a target loss function according to the voice conversion loss function, the voice identification loss function and the quality evaluation loss function, and performing minimum iteration aiming at the target loss function. The voice recognition and protection method has the advantages that comprehensive training optimization can be carried out on three tasks of voice conversion, voice evaluation and voice counterfeit detection, further, the voice conversion effect is improved, the detectability and traceability of converted voice are achieved, and defense and reinforcement are carried out on potential malicious attacks of voice processing and voiceprint recognition.

Referring to fig. 2, a flow chart of another speech synthesis and counterfeit detection evaluation comprehensive training method provided in the embodiment of the present disclosure is shown, where the method includes steps S201 to S203, where:

s201, configuring corresponding learning hyper-parameters to be optimized for the voice conversion loss function, the voice authenticity distinguishing loss function and the quality evaluation loss function respectively.

In the specific implementation, corresponding learning hyper-parameters to be optimized are configured for the voice conversion loss function, the voice identification loss function and the quality evaluation loss function respectively.

It should be noted that the learning hyper-parameter to be optimized corresponding to each loss function may be selected according to actual needs, and is not limited specifically here. Preferably, the initial value of the learning hyper-parameter to be optimized may be set to 1.

S202, according to the learning hyper-parameter to be optimized, carrying out weighted summation on the voice conversion loss function, the voice identification loss function and the quality evaluation loss function, and determining the target loss function.

Specifically, the target loss function can be constructed by the following formula:

wherein L is _total Representing an objective loss function; l is _MSE Representing a speech conversion loss function; l is _SPOOF Representing a speech discrimination loss function; l is _POLQA Representing a quality assessment loss function; alpha, beta and lambda respectively represent the learning hyper-parameters to be optimized.

S203, performing minimum iterative computation aiming at the target loss function to realize the joint training optimization of a sound conversion-inverse conversion process, a voice identification process and a voice quality evaluation process.

According to the comprehensive training method for speech synthesis and authenticity identification evaluation, source speech and target speech are obtained and used as input corpora; converting the input corpus into corresponding conversion voice information through a preset voice converter, and converting the conversion voice information into corresponding inversion voice information through a preset voice inverse converter; determining the converted voice information and the false distinguishing score corresponding to the inverted voice information through a preset voice false distinguishing device, and determining the MOS score between the inverted voice information and the input corpus through a preset voice quality evaluator; respectively determining a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit identification loss function corresponding to a voice counterfeit identifier and a quality evaluation loss function corresponding to a voice quality evaluator; and constructing a target loss function according to the voice conversion loss function, the voice identification loss function and the quality evaluation loss function, and performing minimum iteration aiming at the target loss function. The voice recognition and protection method has the advantages that comprehensive training optimization can be carried out on three tasks of voice conversion, voice evaluation and voice counterfeit detection, further, the voice conversion effect is improved, the detectability and traceability of converted voice are achieved, and defense and reinforcement are carried out on potential malicious attacks of voice processing and voiceprint recognition.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Based on the same inventive concept, the embodiment of the present disclosure further provides a comprehensive training device for speech synthesis and counterfeit detection evaluation corresponding to the comprehensive training method for speech synthesis and counterfeit detection evaluation, and as the principle of solving the problem of the device in the embodiment of the present disclosure is similar to that of the comprehensive training method for speech synthesis and counterfeit detection evaluation in the embodiment of the present disclosure, the implementation of the device may refer to the implementation of the method, and repeated parts are not repeated.

Referring to fig. 3, fig. 3 is a schematic diagram of a speech synthesis and counterfeit detection evaluation integrated training device according to an embodiment of the present disclosure. As shown in fig. 3, a speech synthesis and counterfeit detection comprehensive training device 300 provided in the embodiment of the present disclosure includes:

the obtaining module 310 is configured to obtain source speech and target speech as input corpora.

The conversion and inversion module 320 is configured to convert the input corpus into corresponding conversion voice information through a preset voice converter, and convert the conversion voice information into corresponding inversion voice information through a preset voice inverse converter.

The counterfeit discrimination evaluation module 330 is configured to determine the converted speech information and a counterfeit discrimination score corresponding to the inverted speech information by using a preset speech discriminator, and determine an MOS score between the inverted speech information and the input corpus by using a preset speech quality evaluator.

The loss function constructing module 340 is configured to determine a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit detection loss function corresponding to the voice counterfeit detector, and a quality evaluation loss function corresponding to the voice quality evaluator, respectively.

A training module 350, configured to construct a target loss function according to the speech conversion loss function, the speech discrimination loss function, and the quality evaluation loss function, and perform a minimization iteration on the target loss function.

The description of the processing flow of each module in the apparatus and the interaction flow between the modules may refer to the relevant description in the above method embodiments, and will not be described in detail here.

According to the comprehensive training device for speech synthesis and authenticity identification evaluation, source speech and target speech are obtained to serve as input corpora; converting the input corpus into corresponding conversion voice information through a preset voice converter, and converting the conversion voice information into corresponding inversion voice information through a preset voice inverse converter; determining the converted voice information and the false distinguishing score corresponding to the inverted voice information through a preset voice false distinguishing device, and determining the MOS score between the inverted voice information and the input corpus through a preset voice quality evaluator; respectively determining a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit identification loss function corresponding to a voice counterfeit identifier and a quality evaluation loss function corresponding to a voice quality evaluator; and constructing a target loss function according to the voice conversion loss function, the voice identification loss function and the quality evaluation loss function, and performing minimum iteration aiming at the target loss function. The voice recognition and protection method has the advantages that comprehensive training optimization can be carried out on three tasks of voice conversion, voice evaluation and voice counterfeit detection, further, the voice conversion effect is improved, the detectability and traceability of converted voice are achieved, and defense and reinforcement are carried out on potential malicious attacks of voice processing and voiceprint recognition.

Corresponding to the speech synthesis and the comprehensive training method for counterfeit detection evaluation in fig. 1 and fig. 2, an embodiment of the present disclosure further provides an electronic device 400, as shown in fig. 4, which is a schematic structural diagram of the electronic device 400 provided in the embodiment of the present disclosure and includes:

a processor 41, a memory 42, and a bus 43; the storage 42 is used for storing execution instructions and includes a memory 421 and an external storage 422; the memory 421 is also referred to as an internal memory, and is configured to temporarily store the operation data in the processor 41 and the data exchanged with the external memory 422 such as a hard disk, the processor 41 exchanges data with the external memory 422 through the memory 421, and when the electronic device 400 operates, the processor 41 communicates with the memory 42 through the bus 43, so that the processor 41 executes the steps of the comprehensive training method of speech synthesis and false-detection evaluation in fig. 1 and 2.

The embodiment of the present disclosure further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the comprehensive training method for speech synthesis and counterfeit identification evaluation described in the above method embodiments are performed. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The embodiments of the present disclosure further provide a computer program product, where the computer program product includes computer instructions, and when the computer instructions are executed by a processor, the steps of the comprehensive training method for speech synthesis and false identification evaluation described in the foregoing method embodiments may be executed.

The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A speech synthesis and counterfeit detection evaluation comprehensive training method is characterized by comprising the following steps:

obtaining source voice and target voice as input corpora;

2. The method according to claim 1, wherein the converting the input corpus into corresponding converted speech information by a predetermined voice converter, and converting the converted speech information into corresponding inverted speech information by a predetermined voice inverse converter, specifically comprises:

inputting the converted source speech and the target voiceprint embedded vector to the sound inverse converter, and determining inverse source speech corresponding to the converted source speech and inverse target speech corresponding to the converted target speech;

3. The method according to claim 1, wherein the constructing a target loss function according to the voice conversion loss function, the voice authentication loss function and the quality assessment loss function, and performing minimization iteration on the target loss function specifically comprises:

4. The method of claim 2, wherein the speech conversion loss function is determined based on the steps of:

5. The method of claim 2, wherein the speech discrimination loss function is determined based on the steps of:

and respectively carrying out normalization index operation on the first counterfeit identification score and the second counterfeit identification score, and defining the sum of the first counterfeit identification score and the second counterfeit identification score subjected to normalization index operation as the voice counterfeit identification loss function, wherein the voice counterfeit identification loss function is used for describing the detectability of voice counterfeit identification.

6. The method of claim 4, wherein the quality assessment loss function is determined based on the steps of:

determining, by the speech quality evaluator, a first MOS score between the source voiceprint embedding vector and the inverted source voiceprint embedding vector and a second MOS score between the target voiceprint embedding vector and the inverted target voiceprint embedding vector based on a perceptual objective hearing quality evaluation algorithm;

and summing the first MOS score and the second MOS score after taking negative values to define the first MOS score and the second MOS score as the quality assessment loss function, wherein the quality assessment loss function is used for describing the evaluability of the voice quality.

7. A comprehensive training device for speech synthesis and false-distinguishing evaluation is characterized by comprising:

the anti-counterfeiting evaluation module is used for determining the anti-counterfeiting scores corresponding to the converted voice information and the inverted voice information through a preset voice anti-counterfeiting device and determining the MOS score between the inverted voice information and the input corpus through a preset voice quality evaluator;

8. The apparatus of claim 7, wherein the conversion inverting module is specifically configured to:

9. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is running, the machine readable instructions when executed by the processor performing the steps of the method of comprehensive training of speech synthesis and authentication assessment according to any one of claims 1 to 6.

10. A computer-readable storage medium, having stored thereon a computer program for performing, when being executed by a processor, the steps of the method for integrated training of speech synthesis and authentication evaluation according to any one of claims 1 to 6.