CN115620748A - Comprehensive training method and device for speech synthesis and false discrimination evaluation - Google Patents
Comprehensive training method and device for speech synthesis and false discrimination evaluation Download PDFInfo
- Publication number
- CN115620748A CN115620748A CN202211552858.XA CN202211552858A CN115620748A CN 115620748 A CN115620748 A CN 115620748A CN 202211552858 A CN202211552858 A CN 202211552858A CN 115620748 A CN115620748 A CN 115620748A
- Authority
- CN
- China
- Prior art keywords
- voice
- speech
- loss function
- conversion
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 83
- 238000012549 training Methods 0.000 title claims abstract description 64
- 238000011156 evaluation Methods 0.000 title claims abstract description 50
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 43
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 43
- 238000006243 chemical reaction Methods 0.000 claims abstract description 171
- 238000013441 quality evaluation Methods 0.000 claims abstract description 42
- 238000001514 detection method Methods 0.000 claims abstract description 39
- 230000008569 process Effects 0.000 claims abstract description 28
- 238000005457 optimization Methods 0.000 claims abstract description 11
- 230000006870 function Effects 0.000 claims description 191
- 239000013598 vector Substances 0.000 claims description 51
- 238000001303 quality assessment method Methods 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 13
- 238000004422 calculation algorithm Methods 0.000 claims description 7
- 238000010606 normalization Methods 0.000 claims description 6
- 229910044991 metal oxide Inorganic materials 0.000 claims description 3
- 150000004706 metal oxides Chemical class 0.000 claims description 3
- 239000004065 semiconductor Substances 0.000 claims description 3
- 238000012545 processing Methods 0.000 abstract description 18
- 230000000694 effects Effects 0.000 abstract description 12
- 230000007123 defense Effects 0.000 abstract description 7
- 230000002787 reinforcement Effects 0.000 abstract description 7
- 230000008901 benefit Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 238000013507 mapping Methods 0.000 description 4
- 238000001228 spectrum Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000033764 rhythmic process Effects 0.000 description 3
- LFQSCWFLJHTTHZ-UHFFFAOYSA-N Ethanol Chemical compound CCO LFQSCWFLJHTTHZ-UHFFFAOYSA-N 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000003042 antagnostic effect Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000006735 deficit Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000012850 discrimination method Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000003997 social interaction Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/60—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention provides a comprehensive training method and a comprehensive training device for speech synthesis and authenticity identification evaluation, wherein source speech and target speech are obtained as input corpora; performing voice conversion by training a preset voice converter; performing voice inverse conversion by training a preset voice inverse converter; carrying out voice authentication by training a preset voice authentication device; performing voice quality evaluation by training a preset voice quality evaluator; and fusing a voice conversion loss function corresponding to the voice conversion-inverse conversion process, a voice counterfeit identification loss function corresponding to the voice counterfeit identifier and a quality evaluation loss function corresponding to the voice quality evaluator to construct a target loss function for minimum iteration. The voice recognition and protection method can be used for carrying out comprehensive training and optimization on three tasks of voice conversion, voice evaluation and voice authentication detection, further improving the voice conversion effect, realizing the detectability and traceability of the converted voice, and carrying out defense and reinforcement on potential malicious attacks of voice processing and voiceprint recognition.
Description
Technical Field
The disclosure relates to the technical field of audio processing, in particular to a comprehensive training method and device for speech synthesis and counterfeit detection evaluation.
Background
With the continuous development of deep synthesis technology, the method can be applied to various application forms such as speech synthesis, video generation and even digital virtual human. Voice conversion (voice conversion) is a technology that changes the voice personality characteristics, such as frequency spectrum, rhythm, etc., of a source speaker (source speaker) through technical processing, so that the voice personality characteristics have the personality characteristics of a target speaker (target speaker), and meanwhile, semantic information is kept unchanged. Based on the sound conversion technology, the sound of a real player is converted into the sound of a game character, or the sound of real social interaction is converted into the sound of an entertainment character or a specific target. Typical conversions include a male voice to a female voice, a female voice to a male voice, a male voice to a male voice, etc.
Currently, voice conversion, voice evaluation, and voice authentication detection are usually processed as independent tasks, and a processing flow of a voice conversion task cannot simultaneously satisfy requirements for evaluation of a conversion effect from a source speaker voice to a target speaker voice, detection and supervision of converted voice, namely, controllability detection of voice authentication, traceability detection of voice, and the like, so that converted voice has poor traceability in voice quality, conversion effect, and authentication controllability.
Disclosure of Invention
The embodiment of the disclosure at least provides a comprehensive training method and device for voice synthesis and voice identification evaluation, which can perform comprehensive training optimization for three tasks of voice conversion, voice evaluation and voice identification detection, further improve the voice conversion effect, simultaneously realize the detectability and traceability of the converted voice, and perform defense and reinforcement on potential malicious attacks of voice processing and voiceprint recognition.
The embodiment of the disclosure provides a comprehensive training method for speech synthesis and counterfeit detection evaluation, which comprises the following steps:
obtaining source voice and target voice as input corpora;
converting the input corpus into corresponding conversion voice information through a preset voice converter, and converting the conversion voice information into corresponding inversion voice information through a preset voice inverse converter;
determining the converted voice information and the false distinguishing score corresponding to the inverted voice information through a preset voice false distinguishing device, and determining the MOS (metal oxide semiconductor) score between the inverted voice information and the input corpus through a preset voice quality evaluator;
respectively determining a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit identification loss function corresponding to the voice counterfeit identifier and a quality evaluation loss function corresponding to the voice quality evaluator;
and constructing a target loss function according to the voice conversion loss function, the voice authentication loss function and the quality evaluation loss function, and performing minimization iteration aiming at the target loss function.
In an optional implementation manner, the converting the input corpus into corresponding converted speech information through a preset sound converter, and converting the converted speech information into corresponding inverted speech information through a preset sound inverse converter specifically includes:
determining a source voiceprint embedding vector corresponding to the source speech and a target voiceprint embedding vector corresponding to the target speech;
inputting the source voiceprint embedding vector and the source speech to the sound converter, and determining conversion source speech corresponding to the source speech and conversion target speech corresponding to the target speech;
determining the conversion source speech and the conversion target speech as the conversion speech information;
inputting the conversion source voice and the target voiceprint embedding vector to the voice inverse converter, and determining an inverse source voice corresponding to the conversion source voice and an inverse target voice corresponding to the conversion target voice;
and determining the inverse source voice and the inverse target voice as the inverse voice information.
In an optional implementation manner, the constructing a target loss function according to the speech conversion loss function, the speech discrimination loss function, and the quality evaluation loss function, and performing minimization iteration on the target loss function specifically includes:
configuring corresponding learning hyper-parameters to be optimized for the voice conversion loss function, the voice authentication loss function and the quality evaluation loss function respectively;
according to the learning hyper-parameter to be optimized, carrying out weighted summation on the voice conversion loss function, the voice identification loss function and the quality evaluation loss function to determine the target loss function;
and performing minimum iterative computation aiming at the target loss function to realize the joint training optimization of a sound conversion-inverse conversion process, a voice identification process and a voice quality evaluation process.
In an alternative embodiment, the speech conversion loss function is determined based on the following steps:
determining an inverse source voiceprint embedding vector corresponding to the inverse source speech and an inverse target voiceprint embedding vector corresponding to the inverse target speech;
determining a first mean square error between the source voiceprint embedding vector and the inverse source voiceprint embedding vector and a second mean square error between the target voiceprint embedding vector and the inverse target voiceprint embedding vector;
defining a sum of the first mean square error and the second mean square error as the speech conversion loss function, wherein the speech conversion loss function is used to describe speaker similarity between the source speech and the target speech.
In an alternative embodiment, the speech discrimination loss function is determined based on the following steps:
determining a first authentication score for the speech authenticator for the translated source speech and the inverted source speech output and a second authentication score for the speech authenticator for the translated target speech and the inverted target speech output;
and respectively carrying out normalization index operation on the first authentication score and the second authentication score, and defining the sum of the first authentication score and the second authentication score after normalization index operation as the voice authentication loss function, wherein the voice authentication loss function is used for describing the detectability of voice authentication.
In an alternative embodiment, the quality assessment loss function is determined based on the following steps:
determining, by the speech quality evaluator, based on a perceptual objective hearing quality evaluation algorithm, a first MOS score between the source voiceprint embedding vector and the inverted source voiceprint embedding vector, and a second MOS score between the target voiceprint embedding vector and the inverted target voiceprint embedding vector;
and summing the first MOS score and the second MOS score after taking negative numbers, and defining the sum as the quality assessment loss function, wherein the quality assessment loss function is used for describing the evaluability of the voice quality.
The embodiment of the present disclosure further provides a comprehensive training device for speech synthesis and counterfeit detection evaluation, where the device includes:
the acquisition module is used for acquiring source speech and target speech as input linguistic data;
the conversion inversion module is used for converting the input corpus into corresponding conversion voice information through a preset sound converter and converting the conversion voice information into corresponding inversion voice information through a preset sound inverse converter;
the false discrimination evaluation module is used for determining the converted voice information and the false discrimination scores corresponding to the inverted voice information through a preset voice false discriminator and determining the MOS score between the inverted voice information and the input corpus through a preset voice quality evaluator;
a loss function constructing module, configured to determine a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit discrimination loss function corresponding to the voice counterfeit discriminator, and a quality evaluation loss function corresponding to the voice quality evaluator, respectively;
and the training module is used for constructing a target loss function according to the voice conversion loss function, the voice identification loss function and the quality evaluation loss function and performing minimum iteration aiming at the target loss function.
An embodiment of the present disclosure further provides an electronic device, including: a processor, a memory and a bus, wherein the memory stores machine-readable instructions executable by the processor, the processor and the memory communicate with each other via the bus when the electronic device is running, and the machine-readable instructions are executed by the processor to perform the above method for training speech synthesis and counterfeit detection, or the steps of any one of the possible embodiments of the above method for training speech synthesis and counterfeit detection.
The disclosed embodiments also provide a computer-readable storage medium, where a computer program is stored, and the computer program is executed by a processor to perform the steps in any possible implementation manner of the above-mentioned method for comprehensive training of speech synthesis and counterfeit detection evaluation, or the above-mentioned method for comprehensive training of speech synthesis and counterfeit detection evaluation.
Embodiments of the present disclosure further provide a computer program product, which includes a computer program/instructions, and the computer program/instructions, when executed by a processor, implement the above-mentioned comprehensive training method for speech synthesis and counterfeit detection evaluation, or the steps in any possible implementation manner of the above-mentioned comprehensive training method for speech synthesis and counterfeit detection evaluation.
According to the comprehensive training method and device for speech synthesis and authenticity identification evaluation, source speech and target speech are obtained and used as input corpora; converting the input corpus into corresponding conversion voice information through a preset voice converter, and converting the conversion voice information into corresponding inversion voice information through a preset voice inverse converter; determining the converted voice information and the false distinguishing score corresponding to the inverted voice information through a preset voice false distinguishing device, and determining the MOS score between the inverted voice information and the input corpus through a preset voice quality evaluator; respectively determining a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit identification loss function corresponding to a voice counterfeit identifier and a quality evaluation loss function corresponding to a voice quality evaluator; and constructing a target loss function according to the voice conversion loss function, the voice identification loss function and the quality evaluation loss function, and performing minimum iteration aiming at the target loss function. The voice recognition and protection method can be used for carrying out comprehensive training and optimization on three tasks of voice conversion, voice evaluation and voice authentication detection, further improving the voice conversion effect, realizing the detectability and traceability of the converted voice, and carrying out defense and reinforcement on potential malicious attacks of voice processing and voiceprint recognition.
In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is to be understood that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art to which the disclosure pertains without the benefit of the inventive faculty, and that additional related drawings may be derived therefrom.
FIG. 1 is a flowchart illustrating a method for comprehensive training of speech synthesis and counterfeit detection evaluation according to an embodiment of the present disclosure;
FIG. 2 is a flow chart illustrating another method for integrated training of speech synthesis and authentication evaluation provided by embodiments of the present disclosure;
FIG. 3 is a schematic diagram of a comprehensive training device for speech synthesis and counterfeit detection evaluation according to an embodiment of the present disclosure;
fig. 4 shows a schematic diagram of an electronic device provided by an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined or explained in subsequent figures.
The term "and/or" herein merely describes an associative relationship, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of a, B, and C, and may mean including any one or more elements selected from the group consisting of a, B, and C.
Research shows that currently, voice conversion, voice evaluation and voice identification detection are usually processed as independent tasks, and a processing flow of a voice conversion task cannot simultaneously meet requirements of conversion effect evaluation on source speaker voice to target speaker voice, detection and supervision on converted voice, namely controllability detection of voice identification and traceability detection of voice, and the like, so that converted voice has poor traceability on voice quality, conversion effect and traceability controllability.
Based on the research, the present disclosure provides a comprehensive training method and apparatus for speech synthesis and authenticity assessment, by obtaining source speech and target speech as input corpora; converting the input corpus into corresponding conversion voice information through a preset voice converter, and converting the conversion voice information into corresponding inversion voice information through a preset voice inverse converter; determining the converted voice information and the false identification score corresponding to the inverted voice information through a preset voice false identifier, and determining the MOS score between the inverted voice information and the input corpus through a preset voice quality evaluator; respectively determining a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit identification loss function corresponding to a voice counterfeit identifier and a quality evaluation loss function corresponding to a voice quality evaluator; and constructing a target loss function according to the voice conversion loss function, the voice identification loss function and the quality evaluation loss function, and performing minimum iteration aiming at the target loss function. The voice recognition and protection method can be used for carrying out comprehensive training and optimization on three tasks of voice conversion, voice evaluation and voice authentication detection, further improving the voice conversion effect, realizing the detectability and traceability of the converted voice, and carrying out defense and reinforcement on potential malicious attacks of voice processing and voiceprint recognition.
To facilitate understanding of the embodiment, first, a comprehensive training method for speech synthesis and counterfeit detection evaluation disclosed in the embodiments of the present disclosure is described in detail, where an execution subject of the comprehensive training method for speech synthesis and counterfeit detection evaluation provided in the embodiments of the present disclosure is generally a computer device with certain computing power, and the computer device includes: a terminal device, which may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle mounted device, a wearable device, or a server or other processing device. In some possible implementations, the method for integrated training of speech synthesis and authentication evaluation may be implemented by a processor calling computer-readable instructions stored in a memory.
Referring to fig. 1, a flowchart of a comprehensive training method for speech synthesis and counterfeit detection evaluation provided by the embodiment of the present disclosure is shown, where the method includes steps S101 to S105, where:
s101, obtaining source voice and target voice as input corpora.
In specific implementation, a source speech corresponding to a source speaker and a target speech corresponding to a target speaker are obtained, and the source speech and the target speech are used as input corpora of speech conversion.
Here, the source speech and the target speech are used as input corpora, and the input corpora are used for inputting to a speech conversion system for speech conversion, so that the speech characteristics of the source speaker, such as frequency spectrum, rhythm and the like, are converted into the speech characteristics of the target speaker while the semantic information of the source speech of the speaker is kept unchanged.
The method comprises a voice conversion system, a voice recognition system and a voice recognition system, wherein the voice conversion system usually comprises a training stage and an inference stage, aiming at the training stage, firstly, source voice of a source speaker and target voice of a target speaker are analyzed and feature extraction is carried out, then, mapping processing is carried out on extracted features, and finally, model training is carried out on the mapping features so as to obtain a voice conversion model. And analyzing, extracting and mapping the source speech to be converted aiming at the reasoning stage, then performing feature conversion on the mapping feature by using the speech conversion model obtained in the training stage, and finally using the converted feature for speech synthesis to obtain converted speech.
Optionally, the source speech and the target speech may be collected for the source speaker and the target speaker respectively through audio collection equipment.
S102, converting the input corpus into corresponding conversion voice information through a preset sound converter, and converting the conversion voice information into corresponding inversion voice information through a preset sound inverse converter.
In specific implementation, source speech and target speech are input to a preset sound converter as input corpora, the input corpora are converted into corresponding converted speech information, and the converted speech information is input to a preset sound inverse converter, so that inverse speech information for performing inverse speech conversion on the converted speech information is obtained.
Here, the converted speech information includes converted speech information output after the source speech is converted by the sound converter, and converted speech information output after the target speech information is converted by the sound converter; the sound inverse converter outputs inverse voice information corresponding to the source voice and inverse voice information corresponding to the target voice.
The preset sound converter may be a conditional variant self-encoder (VITS _ VC) for speech synthesis with antagonistic learning. VITS _ VC is a high-expressive voice conversion model that combines variational inference (variational inference), normalized flow (normative flows), and resistance training. The VITS _ VC carries out random modeling on the hidden variable through an acoustic model and a vocoder in the voice conversion of the hidden variable instead of the common spectrum series connection, utilizes a random duration predictor to improve the diversity of the converted voice, inputs the same voice and can obtain the voices with different tones and rhythms. The VITS _ VC algorithm adopts a non-autoregressive network structure, and compared with the traditional autoregressive network, the generation speed is obviously improved, and the requirement of high-rate conversion in practical application is met.
Furthermore, the voice inverse converter can evaluate the voice similarity between the input corpus and the inverse voice information, so that the controllability and traceability of the voice are realized.
It should be noted that after the voice discriminator detects the converted voice information and the inverted voice information as described in step S103, it further determines that the voice conversion task is performed by the own voice converter, and then performs the voice inverse conversion process on the converted voice information.
As a possible implementation, step S102 may be implemented by steps S1021 to S1025 as follows:
s1021, determining a source voiceprint embedding vector corresponding to the source speech and a target voiceprint embedding vector corresponding to the target speech.
S1022, inputting the source voiceprint embedding vector and the source speech to the sound converter, and determining a conversion source speech corresponding to the source speech and a conversion target speech corresponding to the target speech.
S1023, determining the conversion source voice and the conversion target voice as the conversion voice information.
S1024, inputting the conversion source voice and the target voiceprint embedding vector to the voice inverse converter, and determining the inverse source voice corresponding to the conversion source voice and the inverse target voice corresponding to the conversion target voice.
S1025, the inverse source voice and the inverse target voice are determined to be the inverse voice information.
In a specific implementation, in order to improve the voice conversion effect under the condition of small samples or phrase voices, a pre-trained voiceprint extractor is used as a speaker encoder (speaker encoder) for extracting a voiceprint embedding vector from an input corpus.
Wherein, the voiceprint extractor converts the input corpus into 512-dimensional speaker embedded representation and sends the 512-dimensional speaker embedded representation to the voice converter.
Optionally, when the sound converter selects VITS _ VC, the VITS supports multi-speaker sound conversion (multi-speaker sound conversion), and when the sound converter is applied to a multi-speaker model, the source sound-stripe embedding vector corresponding to the source sound of each source speaker is added to a corresponding module of the VITS.
Furthermore, the voice converter embeds vectors into source voice and source voice prints corresponding to a given source speaker, and then outputs converted source voice obtained by converting the source voice through the voice converter and converted target voice obtained by converting the target voice through the voice converter through the vocoder.
Furthermore, the voice inverse converter outputs inverse source voice of the converted source voice after inverse conversion of the voice converter and inverse target voice of the target voice after inverse conversion of the voice inverse converter through the vocoder by giving the converted source voice output by the voice converter and the target voiceprint embedding vector corresponding to the target voice.
S103, determining the converted voice information and the discrimination scores corresponding to the inverted voice information through a preset voice discriminator, and determining the MOS score between the inverted voice information and the input corpus through a preset voice quality evaluator.
In specific implementation, the voice discriminator adopts an end-to-end audio frequency discrimination method based on a graph convolution attention network, extracts voice full-frequency band and sub-frequency band embedding characteristics, introduces a fusion attention mechanism, effectively utilizes information of three attention sub-modules of a time region, a frequency spectrum region and a channel region, inputs converted voice information and inverted voice information into the voice discriminator, and outputs authenticity judgment scores corresponding to the converted voice information and the inverted voice information respectively by the voice discriminator.
The authenticity judgment score of the voice counterfeit discriminator is normalized to output a range [0,1], wherein the output authenticity judgment score represents the possibility of judging the authenticity, if the counterfeit audio is close to 0, if the counterfeit audio is close to 1, the true audio is close to 1.
Further, the voice quality evaluator may evaluate a MOS score between the inverted voice information and the input corpus by a perceptual objective hearing quality evaluation (POLQA) algorithm.
Here, the most intuitive judgment of the quality of the voice conversion system is to convert the audio quality, which is a commonly used MOS (mean opinion score) value, but the scoring of the MOS value requires personnel in many fields to score, which requires expensive human resources and time overhead, so in the training process of the embodiment of the present application, the original input corpus before the conversion of the sound converter and the inverse voice information after the inverse conversion of the sound inverse converter can be obtained, and the POLQA algorithm is used to evaluate the voice quality.
Specifically, the POLQA algorithm obtains a POLQA score by filtering, time alignment, sampling rate estimation, objective perception and scoring of a reference signal and a degradation signal, and finally maps the POLQA score to a MOS score. POLQA is a reference objective evaluation method that quantifies the degree of impairment of a corrupted signal (here, inversely transformed speech) in the presence of a reference signal (lossless signal, here, the original speech before transformation), and gives an objective speech quality score close to the subjective speech quality score.
The maximum MOS value of the POLQA is 4.5 under the narrow-band mode, and the maximum MOS value of the POLQA is 4.75 under the ultra-wideband mode. Preferably, to define the quality assessment loss function, the POLQA value is taken to be negative.
S104, respectively determining a voice conversion loss function corresponding to the voice conversion-inverse conversion process, a voice identification loss function corresponding to the voice identifier, and a quality evaluation loss function corresponding to the voice quality evaluator.
In the implementation, a voice conversion loss function corresponding to the voice conversion-inverse conversion process, a voice identification loss function corresponding to the voice identifier, and a quality evaluation loss function corresponding to the voice quality evaluator are respectively determined.
Here, the voice conversion loss function corresponds to a loss function corresponding to the sound converter and the sound inverse converter.
Wherein the voice conversion loss function describes similarity between speakers, the voice authentication loss function describes detectability of voice authentication, and the quality assessment loss function describes quality evaluability between voices.
As a possible implementation, the speech conversion loss function may be determined based on the following steps 1-3:
step 1, determining an inverse source voiceprint embedded vector corresponding to the inverse source voice and an inverse target voiceprint embedded vector corresponding to the inverse target voice;
step 2, determining a first mean square error between the source voiceprint embedding vector and the inverse source voiceprint embedding vector and a second mean square error between the target voiceprint embedding vector and the inverse target voiceprint embedding vector;
and 3, defining the sum of the first mean square error and the second mean square error as the voice conversion loss function, wherein the voice conversion loss function is used for describing the speaker similarity between the source voice and the target voice.
Specifically, the voice conversion loss function may be constructed based on the following formula:
wherein L is MSE Representing a speech conversion loss function; l is a radical of an alcohol MSE_source Representing the corresponding loss function of the source speech in the speech conversion and inverse conversion processes; l is MSE_target Representing the corresponding loss function of the source speech in the speech conversion and inverse conversion processes; MSE represents the mean square error; e (.) represents the calculation of the voiceprint embedding vector; s and t represent source and target speech respectively,andrespectively representing the inverted source speech and the inverted target speech after being inversely converted by the acoustic inverse converter.
As another possible implementation, the speech discrimination loss function may be determined based on the following steps 1-2:
step 1, determining a first authentication score of the voice authenticator for the converted source voice and the inverted source voice and a second authentication score of the voice authenticator for the converted target voice and the inverted target voice.
And 2, respectively carrying out normalization index operation on the first authentication score and the second authentication score, and defining the sum of the first authentication score and the second authentication score after normalization index operation as the voice authentication loss function, wherein the voice authentication loss function is used for describing the detectability of voice authentication.
Specifically, the speech discrimination loss function may be constructed based on the following formula:
wherein L is SPOOF Representing a speech discrimination loss function; l is a radical of an alcohol SPOOF_vc Representing a loss function of the speech discriminator for the processing of the converted source speech and the inverted source speech; l is SPOOF_ivc Representing a loss function of the voice discriminator for processing the conversion target voice and the inversion target voice; softmax (.) represents a normalized exponential function operation;andrespectively representing converted source speech and inverse source speech;andrespectively representing the converted target speech and the inverted target speech.
As another possible implementation, the quality assessment loss function may be determined based on steps 1-2 below:
step 1, determining a first MOS (metal oxide semiconductor) score between the source voiceprint embedding vector and the inverse source voiceprint embedding vector and a second MOS score between the target voiceprint embedding vector and the inverse target voiceprint embedding vector through the voice quality evaluator based on a perception objective hearing quality evaluation algorithm.
And 2, summing the first MOS value and the second MOS value after taking negative values, and defining the sum as the quality evaluation loss function, wherein the quality evaluation loss function is used for describing the evaluability of the voice quality.
Specifically, the quality assessment loss function may be constructed based on the following formula:
wherein L is POLQA Representing a quality assessment loss function; l is POLQA_source Representing a loss function of the speech quality evaluator for the source speech and the inverse source speech; l is POLQA_target Representing a loss function of the voice quality evaluator for processing the target voice and the inverted target voice; POLQA (.) represents the calculation of a reference objective evaluation score based on POLQA;andrespectively representing source voice and inverse source voice;andrespectively representing the target voice and the inverted target voice.
S105, constructing a target loss function according to the voice conversion loss function, the voice identification loss function and the quality evaluation loss function, and performing minimum iteration aiming at the target loss function.
In a particular implementation, a voice conversion loss function describes similarity between speakers, a speech discrimination loss function describes detectability of speech discrimination, and a quality assessment loss function describes quality assessability between speech. And combining the three loss functions defined from different dimensions to form an objective function, and realizing the combined optimization of the whole system through the minimum iterative training of the objective loss function.
According to the comprehensive training method for speech synthesis and authenticity identification evaluation, source speech and target speech are obtained and used as input corpora; converting the input corpus into corresponding conversion voice information through a preset voice converter, and converting the conversion voice information into corresponding inversion voice information through a preset voice inverse converter; determining the converted voice information and the false identification score corresponding to the inverted voice information through a preset voice false identifier, and determining the MOS score between the inverted voice information and the input corpus through a preset voice quality evaluator; respectively determining a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit identification loss function corresponding to a voice counterfeit identifier and a quality evaluation loss function corresponding to a voice quality evaluator; and constructing a target loss function according to the voice conversion loss function, the voice identification loss function and the quality evaluation loss function, and performing minimum iteration aiming at the target loss function. The voice recognition and protection method has the advantages that comprehensive training optimization can be carried out on three tasks of voice conversion, voice evaluation and voice counterfeit detection, further, the voice conversion effect is improved, the detectability and traceability of converted voice are achieved, and defense and reinforcement are carried out on potential malicious attacks of voice processing and voiceprint recognition.
Referring to fig. 2, a flow chart of another speech synthesis and counterfeit detection evaluation comprehensive training method provided in the embodiment of the present disclosure is shown, where the method includes steps S201 to S203, where:
s201, configuring corresponding learning hyper-parameters to be optimized for the voice conversion loss function, the voice authenticity distinguishing loss function and the quality evaluation loss function respectively.
In the specific implementation, corresponding learning hyper-parameters to be optimized are configured for the voice conversion loss function, the voice identification loss function and the quality evaluation loss function respectively.
It should be noted that the learning hyper-parameter to be optimized corresponding to each loss function may be selected according to actual needs, and is not limited specifically here. Preferably, the initial value of the learning hyper-parameter to be optimized may be set to 1.
S202, according to the learning hyper-parameter to be optimized, carrying out weighted summation on the voice conversion loss function, the voice identification loss function and the quality evaluation loss function, and determining the target loss function.
Specifically, the target loss function can be constructed by the following formula:
wherein L is total Representing an objective loss function; l is MSE Representing a speech conversion loss function; l is SPOOF Representing a speech discrimination loss function; l is POLQA Representing a quality assessment loss function; alpha, beta and lambda respectively represent the learning hyper-parameters to be optimized.
S203, performing minimum iterative computation aiming at the target loss function to realize the joint training optimization of a sound conversion-inverse conversion process, a voice identification process and a voice quality evaluation process.
According to the comprehensive training method for speech synthesis and authenticity identification evaluation, source speech and target speech are obtained and used as input corpora; converting the input corpus into corresponding conversion voice information through a preset voice converter, and converting the conversion voice information into corresponding inversion voice information through a preset voice inverse converter; determining the converted voice information and the false distinguishing score corresponding to the inverted voice information through a preset voice false distinguishing device, and determining the MOS score between the inverted voice information and the input corpus through a preset voice quality evaluator; respectively determining a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit identification loss function corresponding to a voice counterfeit identifier and a quality evaluation loss function corresponding to a voice quality evaluator; and constructing a target loss function according to the voice conversion loss function, the voice identification loss function and the quality evaluation loss function, and performing minimum iteration aiming at the target loss function. The voice recognition and protection method has the advantages that comprehensive training optimization can be carried out on three tasks of voice conversion, voice evaluation and voice counterfeit detection, further, the voice conversion effect is improved, the detectability and traceability of converted voice are achieved, and defense and reinforcement are carried out on potential malicious attacks of voice processing and voiceprint recognition.
It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.
Based on the same inventive concept, the embodiment of the present disclosure further provides a comprehensive training device for speech synthesis and counterfeit detection evaluation corresponding to the comprehensive training method for speech synthesis and counterfeit detection evaluation, and as the principle of solving the problem of the device in the embodiment of the present disclosure is similar to that of the comprehensive training method for speech synthesis and counterfeit detection evaluation in the embodiment of the present disclosure, the implementation of the device may refer to the implementation of the method, and repeated parts are not repeated.
Referring to fig. 3, fig. 3 is a schematic diagram of a speech synthesis and counterfeit detection evaluation integrated training device according to an embodiment of the present disclosure. As shown in fig. 3, a speech synthesis and counterfeit detection comprehensive training device 300 provided in the embodiment of the present disclosure includes:
the obtaining module 310 is configured to obtain source speech and target speech as input corpora.
The conversion and inversion module 320 is configured to convert the input corpus into corresponding conversion voice information through a preset voice converter, and convert the conversion voice information into corresponding inversion voice information through a preset voice inverse converter.
The counterfeit discrimination evaluation module 330 is configured to determine the converted speech information and a counterfeit discrimination score corresponding to the inverted speech information by using a preset speech discriminator, and determine an MOS score between the inverted speech information and the input corpus by using a preset speech quality evaluator.
The loss function constructing module 340 is configured to determine a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit detection loss function corresponding to the voice counterfeit detector, and a quality evaluation loss function corresponding to the voice quality evaluator, respectively.
A training module 350, configured to construct a target loss function according to the speech conversion loss function, the speech discrimination loss function, and the quality evaluation loss function, and perform a minimization iteration on the target loss function.
The description of the processing flow of each module in the apparatus and the interaction flow between the modules may refer to the relevant description in the above method embodiments, and will not be described in detail here.
According to the comprehensive training device for speech synthesis and authenticity identification evaluation, source speech and target speech are obtained to serve as input corpora; converting the input corpus into corresponding conversion voice information through a preset voice converter, and converting the conversion voice information into corresponding inversion voice information through a preset voice inverse converter; determining the converted voice information and the false distinguishing score corresponding to the inverted voice information through a preset voice false distinguishing device, and determining the MOS score between the inverted voice information and the input corpus through a preset voice quality evaluator; respectively determining a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit identification loss function corresponding to a voice counterfeit identifier and a quality evaluation loss function corresponding to a voice quality evaluator; and constructing a target loss function according to the voice conversion loss function, the voice identification loss function and the quality evaluation loss function, and performing minimum iteration aiming at the target loss function. The voice recognition and protection method has the advantages that comprehensive training optimization can be carried out on three tasks of voice conversion, voice evaluation and voice counterfeit detection, further, the voice conversion effect is improved, the detectability and traceability of converted voice are achieved, and defense and reinforcement are carried out on potential malicious attacks of voice processing and voiceprint recognition.
Corresponding to the speech synthesis and the comprehensive training method for counterfeit detection evaluation in fig. 1 and fig. 2, an embodiment of the present disclosure further provides an electronic device 400, as shown in fig. 4, which is a schematic structural diagram of the electronic device 400 provided in the embodiment of the present disclosure and includes:
a processor 41, a memory 42, and a bus 43; the storage 42 is used for storing execution instructions and includes a memory 421 and an external storage 422; the memory 421 is also referred to as an internal memory, and is configured to temporarily store the operation data in the processor 41 and the data exchanged with the external memory 422 such as a hard disk, the processor 41 exchanges data with the external memory 422 through the memory 421, and when the electronic device 400 operates, the processor 41 communicates with the memory 42 through the bus 43, so that the processor 41 executes the steps of the comprehensive training method of speech synthesis and false-detection evaluation in fig. 1 and 2.
The embodiment of the present disclosure further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the comprehensive training method for speech synthesis and counterfeit identification evaluation described in the above method embodiments are performed. The storage medium may be a volatile or non-volatile computer-readable storage medium.
The embodiments of the present disclosure further provide a computer program product, where the computer program product includes computer instructions, and when the computer instructions are executed by a processor, the steps of the comprehensive training method for speech synthesis and false identification evaluation described in the foregoing method embodiments may be executed.
The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.
Claims (10)
1. A speech synthesis and counterfeit detection evaluation comprehensive training method is characterized by comprising the following steps:
obtaining source voice and target voice as input corpora;
converting the input corpus into corresponding conversion voice information through a preset voice converter, and converting the conversion voice information into corresponding inversion voice information through a preset voice inverse converter;
determining the converted voice information and the false distinguishing score corresponding to the inverted voice information through a preset voice false distinguishing device, and determining the MOS (metal oxide semiconductor) score between the inverted voice information and the input corpus through a preset voice quality evaluator;
respectively determining a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit identification loss function corresponding to the voice counterfeit identifier and a quality evaluation loss function corresponding to the voice quality evaluator;
and constructing a target loss function according to the voice conversion loss function, the voice authentication loss function and the quality evaluation loss function, and performing minimization iteration aiming at the target loss function.
2. The method according to claim 1, wherein the converting the input corpus into corresponding converted speech information by a predetermined voice converter, and converting the converted speech information into corresponding inverted speech information by a predetermined voice inverse converter, specifically comprises:
determining a source voiceprint embedding vector corresponding to the source speech and a target voiceprint embedding vector corresponding to the target speech;
inputting the source voiceprint embedding vector and the source speech to the sound converter, and determining conversion source speech corresponding to the source speech and conversion target speech corresponding to the target speech;
determining the conversion source speech and the conversion target speech as the conversion speech information;
inputting the converted source speech and the target voiceprint embedded vector to the sound inverse converter, and determining inverse source speech corresponding to the converted source speech and inverse target speech corresponding to the converted target speech;
and determining the inverse source voice and the inverse target voice as the inverse voice information.
3. The method according to claim 1, wherein the constructing a target loss function according to the voice conversion loss function, the voice authentication loss function and the quality assessment loss function, and performing minimization iteration on the target loss function specifically comprises:
configuring corresponding learning hyper-parameters to be optimized for the voice conversion loss function, the voice authentication loss function and the quality evaluation loss function respectively;
according to the learning hyper-parameter to be optimized, carrying out weighted summation on the voice conversion loss function, the voice identification loss function and the quality evaluation loss function to determine the target loss function;
and performing minimum iterative computation aiming at the target loss function to realize the joint training optimization of a sound conversion-inverse conversion process, a voice identification process and a voice quality evaluation process.
4. The method of claim 2, wherein the speech conversion loss function is determined based on the steps of:
determining an inverse source voiceprint embedding vector corresponding to the inverse source speech and an inverse target voiceprint embedding vector corresponding to the inverse target speech;
determining a first mean square error between the source voiceprint embedding vector and the inverse source voiceprint embedding vector and a second mean square error between the target voiceprint embedding vector and the inverse target voiceprint embedding vector;
defining a sum of the first mean square error and the second mean square error as the speech conversion loss function, wherein the speech conversion loss function is used to describe speaker similarity between the source speech and the target speech.
5. The method of claim 2, wherein the speech discrimination loss function is determined based on the steps of:
determining a first authentication score for the speech authenticator for the translated source speech and the inverted source speech output and a second authentication score for the speech authenticator for the translated target speech and the inverted target speech output;
and respectively carrying out normalization index operation on the first counterfeit identification score and the second counterfeit identification score, and defining the sum of the first counterfeit identification score and the second counterfeit identification score subjected to normalization index operation as the voice counterfeit identification loss function, wherein the voice counterfeit identification loss function is used for describing the detectability of voice counterfeit identification.
6. The method of claim 4, wherein the quality assessment loss function is determined based on the steps of:
determining, by the speech quality evaluator, a first MOS score between the source voiceprint embedding vector and the inverted source voiceprint embedding vector and a second MOS score between the target voiceprint embedding vector and the inverted target voiceprint embedding vector based on a perceptual objective hearing quality evaluation algorithm;
and summing the first MOS score and the second MOS score after taking negative values to define the first MOS score and the second MOS score as the quality assessment loss function, wherein the quality assessment loss function is used for describing the evaluability of the voice quality.
7. A comprehensive training device for speech synthesis and false-distinguishing evaluation is characterized by comprising:
the acquisition module is used for acquiring source speech and target speech as input linguistic data;
the conversion inversion module is used for converting the input corpus into corresponding conversion voice information through a preset sound converter and converting the conversion voice information into corresponding inversion voice information through a preset sound inverse converter;
the anti-counterfeiting evaluation module is used for determining the anti-counterfeiting scores corresponding to the converted voice information and the inverted voice information through a preset voice anti-counterfeiting device and determining the MOS score between the inverted voice information and the input corpus through a preset voice quality evaluator;
a loss function constructing module, configured to determine a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit discrimination loss function corresponding to the voice counterfeit discriminator, and a quality evaluation loss function corresponding to the voice quality evaluator, respectively;
and the training module is used for constructing a target loss function according to the voice conversion loss function, the voice identification loss function and the quality evaluation loss function and performing minimum iteration aiming at the target loss function.
8. The apparatus of claim 7, wherein the conversion inverting module is specifically configured to:
determining a source voiceprint embedding vector corresponding to the source speech and a target voiceprint embedding vector corresponding to the target speech;
inputting the source voiceprint embedding vector and the source speech to the sound converter, and determining conversion source speech corresponding to the source speech and conversion target speech corresponding to the target speech;
determining the conversion source speech and the conversion target speech as the conversion speech information;
inputting the converted source speech and the target voiceprint embedded vector to the sound inverse converter, and determining inverse source speech corresponding to the converted source speech and inverse target speech corresponding to the converted target speech;
and determining the inverse source voice and the inverse target voice as the inverse voice information.
9. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is running, the machine readable instructions when executed by the processor performing the steps of the method of comprehensive training of speech synthesis and authentication assessment according to any one of claims 1 to 6.
10. A computer-readable storage medium, having stored thereon a computer program for performing, when being executed by a processor, the steps of the method for integrated training of speech synthesis and authentication evaluation according to any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211552858.XA CN115620748B (en) | 2022-12-06 | 2022-12-06 | Comprehensive training method and device for speech synthesis and false identification evaluation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211552858.XA CN115620748B (en) | 2022-12-06 | 2022-12-06 | Comprehensive training method and device for speech synthesis and false identification evaluation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115620748A true CN115620748A (en) | 2023-01-17 |
CN115620748B CN115620748B (en) | 2023-03-28 |
Family
ID=84879698
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211552858.XA Active CN115620748B (en) | 2022-12-06 | 2022-12-06 | Comprehensive training method and device for speech synthesis and false identification evaluation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115620748B (en) |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2233179A1 (en) * | 1997-05-21 | 1998-11-21 | At&T Corp. | Unsupervised hmm adaptation based on speech-silence discrimination |
CN110060701A (en) * | 2019-04-04 | 2019-07-26 | 南京邮电大学 | Multi-to-multi phonetics transfer method based on VAWGAN-AC |
US20200365166A1 (en) * | 2019-05-14 | 2020-11-19 | International Business Machines Corporation | High-quality non-parallel many-to-many voice conversion |
WO2021137754A1 (en) * | 2019-12-31 | 2021-07-08 | National University Of Singapore | Feedback-controlled voice conversion |
WO2021179717A1 (en) * | 2020-03-11 | 2021-09-16 | 平安科技(深圳)有限公司 | Speech recognition front-end processing method and apparatus, and terminal device |
CN113555023A (en) * | 2021-09-18 | 2021-10-26 | 中国科学院自动化研究所 | Method for joint modeling of voice authentication and speaker recognition |
WO2021229643A1 (en) * | 2020-05-11 | 2021-11-18 | 日本電信電話株式会社 | Sound signal conversion model learning device, sound signal conversion device, sound signal conversion model learning method, and program |
CN114360583A (en) * | 2022-01-05 | 2022-04-15 | 新疆大学 | Voice quality evaluation method based on neural network |
WO2022142115A1 (en) * | 2020-12-31 | 2022-07-07 | 平安科技(深圳)有限公司 | Adversarial learning-based speaker voice conversion method and related device |
CN114882897A (en) * | 2022-05-13 | 2022-08-09 | 平安科技(深圳)有限公司 | Training of voice conversion model, voice conversion method, device and related equipment |
CN115273804A (en) * | 2022-07-29 | 2022-11-01 | 平安科技(深圳)有限公司 | Voice conversion method and device based on coding model, electronic equipment and medium |
-
2022
- 2022-12-06 CN CN202211552858.XA patent/CN115620748B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2233179A1 (en) * | 1997-05-21 | 1998-11-21 | At&T Corp. | Unsupervised hmm adaptation based on speech-silence discrimination |
CN110060701A (en) * | 2019-04-04 | 2019-07-26 | 南京邮电大学 | Multi-to-multi phonetics transfer method based on VAWGAN-AC |
US20200365166A1 (en) * | 2019-05-14 | 2020-11-19 | International Business Machines Corporation | High-quality non-parallel many-to-many voice conversion |
WO2021137754A1 (en) * | 2019-12-31 | 2021-07-08 | National University Of Singapore | Feedback-controlled voice conversion |
WO2021179717A1 (en) * | 2020-03-11 | 2021-09-16 | 平安科技(深圳)有限公司 | Speech recognition front-end processing method and apparatus, and terminal device |
WO2021229643A1 (en) * | 2020-05-11 | 2021-11-18 | 日本電信電話株式会社 | Sound signal conversion model learning device, sound signal conversion device, sound signal conversion model learning method, and program |
WO2022142115A1 (en) * | 2020-12-31 | 2022-07-07 | 平安科技(深圳)有限公司 | Adversarial learning-based speaker voice conversion method and related device |
CN113555023A (en) * | 2021-09-18 | 2021-10-26 | 中国科学院自动化研究所 | Method for joint modeling of voice authentication and speaker recognition |
CN114360583A (en) * | 2022-01-05 | 2022-04-15 | 新疆大学 | Voice quality evaluation method based on neural network |
CN114882897A (en) * | 2022-05-13 | 2022-08-09 | 平安科技(深圳)有限公司 | Training of voice conversion model, voice conversion method, device and related equipment |
CN115273804A (en) * | 2022-07-29 | 2022-11-01 | 平安科技(深圳)有限公司 | Voice conversion method and device based on coding model, electronic equipment and medium |
Non-Patent Citations (3)
Title |
---|
宋鹏;王浩;赵力;: "采用模型自适应的语音转换方法" * |
张来洪;邱波;刘红玉: "一种基于感知特征动态失真度量的语音质量评估算法" * |
苗晓孔;孙蒙;张雄伟;李嘉康;张星昱: "基于参数转换的语音深度伪造及其对声纹认证的威胁评估" * |
Also Published As
Publication number | Publication date |
---|---|
CN115620748B (en) | 2023-03-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107609572B (en) | Multi-modal emotion recognition method and system based on neural network and transfer learning | |
CN108962237A (en) | Mixing voice recognition methods, device and computer readable storage medium | |
WO2020098256A1 (en) | Speech enhancement method based on fully convolutional neural network, device, and storage medium | |
WO2021159902A1 (en) | Age recognition method, apparatus and device, and computer-readable storage medium | |
CN109887484A (en) | A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device | |
CN106952649A (en) | Method for distinguishing speek person based on convolutional neural networks and spectrogram | |
EP4198807A1 (en) | Audio processing method and device | |
CN110033756A (en) | Language Identification, device, electronic equipment and storage medium | |
CN107316635B (en) | Voice recognition method and device, storage medium and electronic equipment | |
CN113314119B (en) | Voice recognition intelligent household control method and device | |
CN110120230B (en) | Acoustic event detection method and device | |
CN112712809B (en) | Voice detection method and device, electronic equipment and storage medium | |
CN111508524B (en) | Method and system for identifying voice source equipment | |
CN102945673A (en) | Continuous speech recognition method with speech command range changed dynamically | |
CN111091809B (en) | Regional accent recognition method and device based on depth feature fusion | |
KR20210052036A (en) | Apparatus with convolutional neural network for obtaining multiple intent and method therof | |
CN112767927A (en) | Method, device, terminal and storage medium for extracting voice features | |
CN112632248A (en) | Question answering method, device, computer equipment and storage medium | |
CN116935889B (en) | Audio category determining method and device, electronic equipment and storage medium | |
CN115620748B (en) | Comprehensive training method and device for speech synthesis and false identification evaluation | |
CN108847251A (en) | A kind of voice De-weight method, device, server and storage medium | |
CN110767238B (en) | Blacklist identification method, device, equipment and storage medium based on address information | |
CN111652164A (en) | Isolated word sign language identification method and system based on global-local feature enhancement | |
Medikonda et al. | Higher order information set based features for text-independent speaker identification | |
Chakravarty et al. | Feature extraction using GTCC spectrogram and ResNet50 based classification for audio spoof detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |