CN114974218A - Voice conversion model training method and device and voice conversion method and device - Google Patents

Voice conversion model training method and device and voice conversion method and device Download PDF

Info

Publication number
CN114974218A
CN114974218A CN202210554179.XA CN202210554179A CN114974218A CN 114974218 A CN114974218 A CN 114974218A CN 202210554179 A CN202210554179 A CN 202210554179A CN 114974218 A CN114974218 A CN 114974218A
Authority
CN
China
Prior art keywords
speaker
characteristic
calculating
content
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210554179.XA
Other languages
Chinese (zh)
Inventor
盛乐园
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Xiaoying Innovation Technology Co ltd
Original Assignee
Hangzhou Xiaoying Innovation Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Xiaoying Innovation Technology Co ltd filed Critical Hangzhou Xiaoying Innovation Technology Co ltd
Priority to CN202210554179.XA priority Critical patent/CN114974218A/en
Publication of CN114974218A publication Critical patent/CN114974218A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention relates to a voice conversion model training method and a device thereof, and a voice conversion method and a device thereof in the field of voice conversion, wherein the model training method comprises the following steps: acquiring first voice and text data with the same content as the first voice, and calculating a first content characteristic by using the text data; extracting the spectral feature of the first voice, outputting the first spectral feature, and calculating the first speaker feature and the first hidden variable through the first spectral feature; and calculating and outputting a second speaking characteristic by taking the first speaker characteristic as a condition on the input stream model of the first implicit variable and the first speaker characteristic, calculating a loss function by taking the second speaking characteristic and the first content characteristic, extracting the first implicit variable reaching a preset optimization parameter, and inputting the optimized first implicit variable into a decoder to obtain predicted speech. The technology of the invention well keeps the information of tone and the like of the speaker.

Description

Voice conversion model training method and device and voice conversion method and device
Technical Field
The present invention relates to the field of voice conversion, and in particular, to a method and an apparatus for training a voice conversion model, and a method and an apparatus for voice conversion.
Background
Due to the development of deep learning and application in various fields, voice conversion also obtains a lot of benefits. The voice conversion is to convert the tone of the voice, and the aim is to only change the tone of the speaker, the content, emotion, tone, speed and the like of the speaker to be consistent with the original voice frequency. Examples are: there are two speakers A and B, where A speaks a speech (S), and the function of speech conversion is to convert the tone of the speech (S) into the sound of B, and the rest of the content remains unchanged. The data set used according to the training can be divided into: 1. and 2, voice conversion based on the parallel language materials. Parallel corpora refer to: for each sentence (S1) in the data set, there is another sentence (S2), and the difference between them is only that the speaker' S timbre is inconsistent, and other information, such as content, emotion, mood, speed, etc., is the same. Since such parallel corpora are difficult to obtain, the current research focus is on speech conversion of non-parallel corpora.
Content coding, which is a conventional speech conversion technology, first extracts and contrasts predictive coding from source audio by means of speech recognition. Compared with predictive coding, predictive coding generally does not contain information such as speakers, tone, intonation, and the like in audio, and more usually contains content information.
Speaker coding, which is a technique for extracting a speaker from audio, generally extracts a speaker vector by using a deep learning technique.
And feature decoding, wherein the feature decoding is to perform feature fusion on the content coding and the speaker coding through a deep learning network and calculate loss with a Mel frequency spectrum extracted from real voice.
A vocoder, taking as input the mel spectrum extracted from real speech, using neural network models such as: WaveNet, Parallel WaveNet, Hifi-Gan, etc. to predict the true speech waveform. The input at the inference stage is the mel spectrum after the source audio is converted, and not the true mel spectrum as input.
The existing technical circuit is as follows: 1. the content coding is recognized by the framework of speech recognition. 2. Speaker vectors are extracted from the pre-trained model. In the training phase, 1 and 2 are decoded to obtain the Mel frequency spectrum of the source audio. The speaker vector in 2 is replaced with the speaker vector of the target speaker during the inference phase. The disadvantages are that the content recognition depends on the model of speech recognition, and the converted audio only can retain the content information, and tone and the like cannot be converted.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a framework which does not need to rely on voice recognition to code the content and retains information such as tone and the like.
In order to solve the technical problem, the invention is solved by the following technical scheme:
a speech conversion model training method comprises the following steps: acquiring first voice and text data with the same content as the first voice, and calculating a first content characteristic by using the text data;
extracting the spectral feature of the first voice, outputting the first spectral feature, and calculating the first speaker feature and the first hidden variable through the first spectral feature;
and calculating and outputting a second speaking characteristic by taking the first speaker characteristic as a condition on the input stream model of the first implicit variable and the first speaker characteristic, calculating a loss function by taking the second speaking characteristic and the first content characteristic, extracting the first implicit variable reaching a preset optimization parameter, and inputting the optimized first implicit variable into a decoder to obtain predicted speech.
Preferably, the specific method for extracting the spectral feature of the first speech, outputting the first spectral feature, and calculating the first speaker feature and the first hidden variable according to the first spectral feature includes:
and calculating a first hidden variable according to the first spectrum characteristic by adopting an a posteriori encoder, wherein the a posteriori encoder comprises a plurality of WaveNet residual error models.
Preferably, the specific method for extracting the spectral feature of the first speech, outputting the first spectral feature, and calculating the first speaker feature and the first hidden variable according to the first spectral feature includes:
and calculating first speaker characteristics according to the first spectrum characteristics by adopting a speaker encoder, wherein the speaker encoder comprises a converter model.
Preferably, the flow model comprises a plurality of WaveNet residual blocks for constructing the mapping relation between the content characteristics and the hidden variables,
the content features are converted into hidden variables through a flow model, and the hidden variables are converted into the content features through the flow model.
Preferably, the method for calculating the first content feature from the text data includes:
and obtaining phonemes corresponding to the text from the text data through a font, representing the phonemes of the text, and coding the characterized characteristics by a CBHG module to obtain a first content characteristic.
The invention also discloses a voice conversion method, which comprises a flow model obtained by training according to the voice conversion model training method, and also comprises the following steps:
acquiring a first audio characteristic P1 unrelated to the source audio speaker information;
acquiring the voice of the target speaker to be converted, extracting the spectral feature of the voice of the target speaker, outputting a second spectral feature, and calculating a second speaking feature according to the second spectral feature S2;
and inputting the second speaking characteristic and the first audio characteristic into a stream model to obtain a second hidden variable Z2, and decoding the second hidden variable to generate a target audio.
A method of obtaining a first audio characteristic independent of source audio speaker information includes transcoding using a content encoder.
Preferably, the speaker information includes a speaker tone.
Preferably, the method of obtaining the first audio characteristic independent of the source audio speaker information comprises:
and inputting the first hidden variable and the second speaker characteristic into the flow model, and calculating and outputting a second speaker characteristic P1 by taking the second speaker characteristic as a condition.
The invention also provides a speech conversion model training device, comprising: the main controller is used for acquiring first voice, calculating first spectrum characteristics, calculating text data which is the same as the first voice, and controlling the input and the output of the data among the content encoder, the posterior encoder, the stream model unit and the decoder;
the content encoder is used for acquiring text data which is the same as the first voice content and calculating a first content characteristic according to the text data;
the posterior encoder receives the first spectral feature and calculates a first hidden variable through the first spectral feature;
the speaker encoder receives the first spectral characteristics and calculates the first speaker characteristics according to the first spectral characteristics;
the flow model unit is used for receiving a first hidden variable and the first speaker characteristic, calculating and outputting a second speaker characteristic by taking the first speaker characteristic as a condition, calculating a loss function by taking the second speaker characteristic and the first content characteristic, and extracting the first hidden variable reaching a preset optimization parameter;
and the decoder is used for inputting the optimized first hidden variable into the decoder to obtain the predicted voice.
The present invention also provides a voice conversion apparatus, comprising:
a speech conversion model training apparatus comprising:
the content encoder is used for encoding the text content of the voice of the source speaker through a deep learning model to obtain a first audio characteristic irrelevant to the information of the source speaker;
the speaker encoder is used for receiving the target speaker voice to be converted, extracting the frequency spectrum characteristic of the target speaker voice, outputting a second frequency spectrum characteristic, and calculating the second speaking characteristic according to the second frequency spectrum characteristic;
the stream model unit receives a second speaking characteristic and the first audio characteristic and outputs a second hidden variable by taking the second speaking characteristic as a condition;
and the decoder receives the second hidden variable and outputs the target audio.
The invention has the beneficial effects that:
the invention avoids the defects of the prior art, does not need to rely on a frame of speech recognition to code the content, and can also keep information such as tone and intonation. Because the structure of the invention can well decouple the characteristics of the speaker and the non-speaker in the frequency spectrum, the converted audio frequency reserves other information except the tone color of the speaker.
In addition, the invention is a universal voice conversion technology, which can well convert the source audio of any language. The corresponding relation of the frame level is more precise than the corresponding relation of the phoneme level by extracting the frequency spectrum characteristics, so that even if the training set only comprises a Chinese data set, the English, Japanese, Korean, Western, dialect and the like can be well converted.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flowchart of a speech conversion model training method according to embodiment 1;
fig. 2 is a flowchart of a speech conversion method of embodiment 2.
Detailed Description
The present invention will be described in further detail with reference to examples, which are illustrative of the present invention and are not to be construed as being limited thereto.
Example 1:
a method for training a speech conversion model is provided,
acquiring first voice and text data with the same content as the first voice, and calculating a first content characteristic C by using the text data;
extracting the spectral feature of the first voice, outputting the first spectral feature, and calculating a first speaker feature S1 and a first hidden variable Z1 according to the first spectral feature;
and calculating and outputting a second speaking characteristic P1 by taking the first speaker characteristic as a condition on the basis of the first speaker characteristic as an input stream model, calculating a loss function by taking the second speaking characteristic and the first content characteristic, extracting the first hidden variable reaching a preset optimization parameter, and inputting the optimized first hidden variable into a decoder to obtain predicted speech.
The specific method for extracting the spectral feature of the first voice, outputting the first spectral feature, and calculating the first speaker feature S1 and the first hidden variable Z1 through the first spectral feature comprises the following steps:
and calculating a first hidden variable Z1 by adopting an a posteriori encoder according to the first spectrum characteristic, wherein the a posteriori encoder comprises a plurality of WaveNet residual models.
The specific method for extracting the spectral feature of the first voice, outputting the first spectral feature and calculating the first speaker feature S1 and the first hidden variable Z1 through the first spectral feature comprises the following steps:
and calculating first speaker characteristics according to the first spectrum characteristics by adopting a speaker encoder, wherein the speaker encoder comprises a converter model.
The flow model comprises a plurality of WaveNet residual blocks and is used for constructing a mapping relation between content characteristics and hidden variables, the content characteristics are converted into the hidden variables through the flow model, and the hidden variables are converted into the content characteristics through the flow model.
As a preferable solution, the method of calculating the first content feature C from the text data includes:
and obtaining phonemes corresponding to the text from the text data through a font, representing the phonemes of the text, and coding the characterized characteristics by a CBHG module to obtain a first content characteristic.
The first step is as follows: the method comprises the steps of obtaining pinyin through a tool from a character form to a phoneme, and then splitting initial consonants and vowels of the pinyin to obtain the phoneme corresponding to the text.
The second step is that: and forming a phoneme dictionary by all phonemes, wherein the number of the phoneme dictionaries is used as the dimension of the embedding layer, and representing the phonemes of the text.
The third step: the characterized features are encoded by a CBHG module, which includes a one-dimensional convolutional filter bank, a highway network, and a recurrent neural network of bi-directional gated cyclic units.
Preferably, the decoder has the same structure as the HiFi-GAN generator.
Example 2:
a speech conversion method includes a stream model trained according to the speech conversion model training method disclosed in embodiment 1, wherein a first speech is a source speech, and a second speech is a target speech, i.e., first speaker information with the first speech removed is converted into a target speech with target speaker speech information of the second speech.
The method comprises the following steps:
acquiring a first audio characteristic P1 unrelated to the source audio speaker information;
acquiring the voice of the target speaker to be converted, extracting the spectral feature of the voice of the target speaker, outputting a second spectral feature, and calculating a second speaking feature according to the second spectral feature S2;
and inputting the second speaking characteristics and the first audio characteristics into a flow model to obtain a second hidden variable Z2, and decoding the second hidden variable to generate a target audio.
A method of obtaining a first audio characteristic independent of source audio speaker information includes transcoding using a content encoder.
Example 3
A speech conversion model training apparatus comprising:
the main controller is used for acquiring first voice, calculating first spectrum characteristics, calculating text data which is the same as the first voice, and controlling the input and the output of the data among the content encoder, the posterior encoder, the stream model unit and the decoder;
the content encoder is used for acquiring text data which is the same as the first voice content and calculating a first content characteristic according to the text data;
coding the text content of the source speaker voice through a deep learning model to obtain content coding information irrelevant to the source speaker information;
the posterior encoder receives the first spectral feature and calculates a first hidden variable through the first spectral feature;
the speaker encoder receives the first spectral characteristics and calculates the first speaker characteristics according to the first spectral characteristics;
the flow model unit is used for receiving a first hidden variable and the first speaker characteristic, calculating and outputting a second speaker characteristic by taking the first speaker characteristic as a condition, calculating a loss function by taking the second speaker characteristic and the first content characteristic, and extracting the first hidden variable reaching a preset optimization parameter;
and the decoder is used for inputting the optimized first hidden variable into the decoder to obtain the predicted voice.
The trained stream model unit realizes the conversion of the content characteristics and the hidden variables, namely, the input of the content characteristics to the stream model unit can be realized, and the stream model unit outputs the hidden variables.
Example 4
A speech conversion model training apparatus comprising:
the content encoder is used for encoding the text content of the source speaker voice through a deep learning model to obtain content encoding information irrelevant to the source speaker information;
a target encoder for extracting a target speaker feature vector from the voice of the target speaker S2;
the flow model unit of the embodiment is a trained flow model unit and realizes the conversion of content characteristics and hidden variables;
and the decoder inputs the hidden variable parameters and outputs the audio frequency of the target speaker.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical functional division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another device, or some features may be omitted, or not executed.
The units may or may not be physically separate, and components displayed as units may be one physical unit or a plurality of physical units, that is, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present invention may be essentially or partially contributed to by the prior art, or all or part of the technical solution may be embodied in the form of a software product, where the software product is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (10)

1. A speech conversion model training method is characterized in that,
acquiring first voice and text data with the same content as the first voice, and calculating a first content characteristic by using the text data;
extracting the spectral feature of the first voice, outputting the first spectral feature, and calculating the first speaker feature and the first hidden variable through the first spectral feature;
and calculating and outputting a second speaking characteristic by taking the first speaker characteristic as a condition on the input stream model of the first implicit variable and the first speaker characteristic, calculating a loss function by taking the second speaking characteristic and the first content characteristic, extracting the first implicit variable reaching a preset optimization parameter, and inputting the optimized first implicit variable into a decoder to obtain predicted speech.
2. The method for training a speech conversion model according to claim 1, wherein the extracting a spectral feature of the first speech and outputting a first spectral feature, and the specific method for calculating the first speaker feature and the first hidden variable through the first spectral feature comprises:
and calculating a first hidden variable according to the first spectrum characteristic by adopting an a posteriori encoder, wherein the a posteriori encoder comprises a plurality of WaveNet residual error models.
3. The method for training a speech conversion model according to claim 1, wherein the extracting a spectral feature of the first speech and outputting a first spectral feature, and the specific method for calculating the first speaker feature and the first hidden variable through the first spectral feature comprises:
and calculating first speaker characteristics according to the first spectrum characteristics by adopting a speaker encoder, wherein the speaker encoder comprises a converter model.
4. The method of claim 1, wherein the flow model comprises a plurality of WaveNet residual blocks for constructing a mapping relationship between content features and hidden variables,
the content features are converted into hidden variables through a flow model, and the hidden variables are converted into the content features through the flow model.
5. The method of claim 1, wherein computing the first content feature from the text data comprises:
and obtaining phonemes corresponding to the text from the text data through a font, representing the phonemes of the text, and coding the characterized characteristics by a CBHG module to obtain a first content characteristic.
6. A speech conversion method comprising a flow model trained by the speech conversion model training method according to claims 1-5, and further comprising the steps of:
acquiring a first audio characteristic P1 irrelevant to the information of the source audio speaker;
acquiring the voice of the target speaker to be converted, extracting the spectral feature of the voice of the target speaker, outputting a second spectral feature, and calculating a second speaking feature according to the second spectral feature S2;
inputting the second speaking characteristic and the first audio characteristic into a stream model to obtain a second hidden variable Z2, and decoding the second hidden variable to generate a target audio;
a method of obtaining a first audio characteristic independent of source audio speaker information includes transcoding using a content encoder.
7. The speech conversion method of claim 6, wherein the speaker information comprises speaker timbre.
8. The method of claim 6, wherein the step of obtaining the first audio characteristic independent of the source audio speaker information comprises:
and inputting the first hidden variable and the second speaker characteristic into the flow model, and calculating and outputting a second speaker characteristic P1 by taking the second speaker characteristic as a condition.
9. A speech conversion model training apparatus, comprising: the main controller is used for acquiring first voice, calculating first spectrum characteristics, calculating text data which is the same as the first voice, and controlling the input and the output of the data among the content encoder, the posterior encoder, the stream model unit and the decoder;
the content encoder is used for acquiring text data which is the same as the first voice content and calculating a first content characteristic according to the text data;
the posterior encoder receives the first spectral feature and calculates a first hidden variable through the first spectral feature;
the speaker encoder receives the first spectral characteristics and calculates the first speaker characteristics according to the first spectral characteristics;
the flow model unit is used for receiving a first hidden variable and the first speaker characteristic, calculating and outputting a second speaker characteristic by taking the first speaker characteristic as a condition, calculating a loss function by taking the second speaker characteristic and the first content characteristic, and extracting the first hidden variable reaching a preset optimization parameter;
and the decoder is used for inputting the optimized first hidden variable into the decoder to obtain the predicted voice.
10. A speech conversion apparatus, comprising:
a speech conversion model training apparatus comprising:
the content encoder is used for encoding the text content of the voice of the source speaker through a deep learning model to obtain a first audio characteristic irrelevant to the information of the source speaker;
the speaker encoder is used for receiving the target speaker voice to be converted, extracting the frequency spectrum characteristic of the target speaker voice, outputting a second frequency spectrum characteristic, and calculating the second speaking characteristic according to the second frequency spectrum characteristic;
the stream model unit receives a second speaking characteristic and the first audio characteristic and outputs a second hidden variable by taking the second speaking characteristic as a condition;
and the decoder receives the second hidden variable and outputs the target audio.
CN202210554179.XA 2022-05-20 2022-05-20 Voice conversion model training method and device and voice conversion method and device Pending CN114974218A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210554179.XA CN114974218A (en) 2022-05-20 2022-05-20 Voice conversion model training method and device and voice conversion method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210554179.XA CN114974218A (en) 2022-05-20 2022-05-20 Voice conversion model training method and device and voice conversion method and device

Publications (1)

Publication Number Publication Date
CN114974218A true CN114974218A (en) 2022-08-30

Family

ID=82984912

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210554179.XA Pending CN114974218A (en) 2022-05-20 2022-05-20 Voice conversion model training method and device and voice conversion method and device

Country Status (1)

Country Link
CN (1) CN114974218A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115631275A (en) * 2022-11-18 2023-01-20 北京红棉小冰科技有限公司 Multi-mode driven human body action sequence generation method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115631275A (en) * 2022-11-18 2023-01-20 北京红棉小冰科技有限公司 Multi-mode driven human body action sequence generation method and device
CN115631275B (en) * 2022-11-18 2023-03-31 北京红棉小冰科技有限公司 Multi-mode driven human body action sequence generation method and device

Similar Documents

Publication Publication Date Title
US11587569B2 (en) Generating and using text-to-speech data for speech recognition models
WO2022083083A1 (en) Sound conversion system and training method for same
Guo et al. Conversational end-to-end tts for voice agents
JP2022527970A (en) Speech synthesis methods, devices, and computer-readable storage media
KR20230003056A (en) Speech recognition using non-speech text and speech synthesis
CN106971709A (en) Statistic parameter model method for building up and device, phoneme synthesizing method and device
CN112349289B (en) Voice recognition method, device, equipment and storage medium
CN113808571B (en) Speech synthesis method, speech synthesis device, electronic device and storage medium
CN112786004A (en) Speech synthesis method, electronic device, and storage device
Ronanki et al. A Hierarchical Encoder-Decoder Model for Statistical Parametric Speech Synthesis.
JP2024505076A (en) Generate diverse, natural-looking text-to-speech samples
CN114974218A (en) Voice conversion model training method and device and voice conversion method and device
CN113628608A (en) Voice generation method and device, electronic equipment and readable storage medium
Chen et al. Unsupervised multi-scale expressive speaking style modeling with hierarchical context information for audiobook speech synthesis
CN115206281A (en) Speech synthesis model training method and device, electronic equipment and medium
Ronanki Prosody generation for text-to-speech synthesis
Zhang et al. Chinese speech synthesis system based on end to end
Liu et al. Emotion-Aware Prosodic Phrasing for Expressive Text-to-Speech
EP4068279B1 (en) Method and system for performing domain adaptation of end-to-end automatic speech recognition model
Louw Neural speech synthesis for resource-scarce languages
CN114999447B (en) Speech synthesis model and speech synthesis method based on confrontation generation network
CN116994553A (en) Training method of speech synthesis model, speech synthesis method, device and equipment
de Carvalho Campinho Automatic Speech Recognition for European Portuguese
Campinho Automatic speech recognition for European Portuguese
Cullen Improving Dysarthric Speech Recognition by Enriching Training Datasets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination