CN113470622B - Conversion method and device capable of converting any voice into multiple voices - Google Patents

Conversion method and device capable of converting any voice into multiple voices Download PDF

Info

Publication number
CN113470622B
CN113470622B CN202111035937.9A CN202111035937A CN113470622B CN 113470622 B CN113470622 B CN 113470622B CN 202111035937 A CN202111035937 A CN 202111035937A CN 113470622 B CN113470622 B CN 113470622B
Authority
CN
China
Prior art keywords
network
feature
fundamental frequency
channel
variance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111035937.9A
Other languages
Chinese (zh)
Other versions
CN113470622A (en
Inventor
曹艳艳
陈佩云
高君效
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chipintelli Technology Co Ltd
Original Assignee
Chipintelli Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chipintelli Technology Co Ltd filed Critical Chipintelli Technology Co Ltd
Priority to CN202111035937.9A priority Critical patent/CN113470622B/en
Publication of CN113470622A publication Critical patent/CN113470622A/en
Application granted granted Critical
Publication of CN113470622B publication Critical patent/CN113470622B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

A conversion method and device for converting an arbitrary voice into a plurality of voices, the conversion method comprising the steps of: preparing corpora of a plurality of target speakers as training corpora; extracting ppg characteristics of each training corpus; obtaining comprehensive characteristics; acquiring the encoding characteristics of the target speaker in the training set to obtain a mean value simulation characteristic gamma and a variance simulation characteristic beta; training a conversion model capable of converting the comprehensive characteristics into Mel characteristics; the mean value simulation characteristic gamma and the variance simulation characteristic beta are used as conversion model style input, the comprehensive characteristics are used as conversion model content input, the Mel frequency spectrums of different speakers are decoded, and synthesis of different sounds is achieved. The invention can better decouple the speaking content information and reduce the influence of inaccurate ppg characteristics extracted by the speech recognition model on speech conversion.

Description

Conversion method and device capable of converting any voice into multiple voices
Technical Field
The invention belongs to the technical field of voice synthesis, and particularly relates to a conversion method and a conversion device capable of converting any voice into a plurality of voices.
Background
The voice conversion technology is a technology for converting source voice data into voice data of a specified pronunciation person and keeping consistent pronunciation content, the traditional sound change technology changes original audio into machine sound by voice signal processing, adjusting the pitch of the audio, the speed of speech and the like, and the conversion mode is single; different from the traditional sound changing technology, the voice conversion technology can control the emotion, rhythm and other information of the target sound under the condition of ensuring the consistent pronunciation content. The voice conversion technology can be used for scenes such as virtual anchor, voice remodeling, rhythm/emotion conversion, voice style conversion and the like.
The voice conversion technique can be divided into parallel data conversion and non-parallel data conversion according to the supplied training data, and the parallel data conversion requires different speakers to supply the same voice data, which is difficult to satisfy in actual operation. Therefore, more and more studies are beginning to research another voice conversion technology aiming at the non-parallel data, and the application of deep learning in the technology greatly improves the voice conversion effect of the non-parallel data. The voice conversion technology can be classified into one-to-one, many-to-many, one-to-many, and the like according to application requirements, and one-to-many converts one person's voice into a plurality of persons' voices. The application scheme of deep learning on voice conversion mainly comprises the following steps: based on a method of antagonism learning (CycleGAN, StarGAN and the like), a method based on a voice recognition system extracts information ppg (phonetic post-experience probability) irrelevant to a speaker through a voice recognition model, then trains a conversion model from the ppg to audio characteristics to obtain voice information of a target speaker, and sends the voice information to a vocoder to obtain converted audio data.
The method based on the competitive learning can achieve better results by converting in the training set, but has the disadvantage that only the audio of the people in the training set can be converted. Whereas speech recognition model based methods can achieve conversion of any timbre, but rely on the accuracy of speech recognition.
Disclosure of Invention
In order to overcome the technical defects in the prior art, the invention discloses a conversion method and a conversion device capable of converting any voice into a plurality of voices.
The invention discloses a conversion method capable of converting any voice into a plurality of voices, which comprises a training method and a synthesis method, wherein the training method comprises the following steps:
step 1, preparing corpora of a plurality of target speakers as training corpora, wherein each corpus comprises audio and corresponding speaker information, and extracting original Mel characteristics of the training corpora;
building a first preprocessing network, a second preprocessing network, an affine layer and a conversion model; the number of output channels of the first preprocessing network and the second preprocessing network is the same, and the down-sampling rate of the second preprocessing network is consistent with the down-sampling rate when the ppg characteristics of the training corpus are extracted;
wherein the first and second pre-processing networks comprise an instance normalization layer;
the example normalization layer is calculated as follows:
Figure 919648DEST_PATH_IMAGE001
(1)
Figure 647302DEST_PATH_IMAGE002
(2)
the normalized feature profile is:
Figure 697297DEST_PATH_IMAGE003
(3)
µcis the c channel mean, σ, of the feature mapcIs the c channel variance of the feature map, where McRepresenting the c-th channel of the feature map, W representing the dimension of each channel, Mc[n]Represents McThe value of the nth dimension of the channel, epsilon,
Figure 186791DEST_PATH_IMAGE004
is Mc[n]Normalizing the characteristic value;
step 2, extracting the ppg characteristics of each training corpus;
step 3, sending the obtained ppg characteristics into a first pretreatment network for treatment;
step 4, calculating fundamental frequency features f0 of the training corpus audio data, taking logarithmic value logf0 to obtain fundamental frequency logarithmic features lf0, calculating voiced and unvoiced sound marks of the audio data, and splicing the fundamental frequency logarithmic features and the unvoiced and voiced sound marks to obtain fundamental frequency-unvoiced and voiced sound splicing features lf 0-uv;
sending the fundamental frequency-unvoiced and voiced sound splicing characteristics into a second preprocessing network for processing;
adding the results of the first and second preprocessing network processing to obtain a comprehensive characteristic;
step 5, acquiring the encoding characteristics of the target speaker in the training set, and transforming the encoding characteristics through an affine layer to obtain a mean value simulation characteristic gamma and a variance simulation characteristic beta;
step 6, inputting the mean value simulation characteristic gamma and the variance simulation characteristic beta obtained in the step 5 as the style of a conversion model, inputting the comprehensive characteristics obtained in the step 4 as the content of the conversion model, and generating converted Mel characteristics through the conversion model;
calculating a loss function according to the converted Mel feature and the original Mel feature;
the conversion model adopts a coder-decoder network framework, and comprises a coding network and a decoding network;
the coding network part codes the comprehensive characteristics obtained in the step 4, and the decoding network part decodes the coding result output by the coding network to obtain corresponding Mel characteristics;
the decoding network part comprises a convolution layer and an activation layer, the convolution layer is connected with the activation layer, and the activation layer is connected with a self-adaptive instance normalization layer;
the calculation method of the self-adaptive example normalization layer is as the formula (4):
Figure 400866DEST_PATH_IMAGE005
(4)
µcis the c channel mean, σ, of the feature mapcIs the c channel variance, M, of the feature mapc[n]Represents McThe value of the nth dimension of the channel, epsilon,
Figure 135473DEST_PATH_IMAGE004
is Mc[n]Normalized eigenvalue;γc、βcThe values of the c channel of the mean value simulation characteristic gamma and the variance simulation characteristic beta obtained in the step 5 are obtained;
step 7, using the loss function to update the first preprocessing network, the second preprocessing network, the affine layer and the conversion model;
and 8, repeating the steps 2 to 7 until the loss function is converged and the training is finished.
Preferably, the updating of step 7 is performed by gradient descent and reverse conduction.
Preferably, the corpus in step S1 includes corpora of different languages, and the step S2 extracts ppg features of each language, and then performs the splicing process, followed by performing the step 3.
Preferably, in step S2, WeNet is used to extract ppg features of the corpus.
Preferably, the conversion model adopts an encoder-decoder network framework, which comprises an encoding network and a decoding network;
the coding network part codes the comprehensive characteristics obtained in the step 4, and the decoding network part decodes the coding result output by the coding network to obtain corresponding Mel characteristics;
the decoding network part comprises a convolution layer and an activation layer, the convolution layer is connected with the activation layer, and the activation layer is connected with a self-adaptive instance normalization layer;
the calculation method of the self-adaptive example normalization layer is as the formula (4):
Figure 465741DEST_PATH_IMAGE005
(4)
µcis the c channel mean, σ, of the feature mapcIs the c channel variance, M, of the feature mapc[n]Represents McThe value of the nth dimension of the channel, epsilon,
Figure 627601DEST_PATH_IMAGE004
is Mc[n]Normalizing the characteristic value; gamma rayc、βcFor the step 5 to obtainThe resulting mean model feature γ and variance model feature β values for the c-th channel.
Preferably, the synthesis method comprises the following steps:
s9, extracting the ppg characteristics of the converted audio and sending the ppg characteristics into a first preprocessing network;
s10, extracting fundamental frequency logarithmic features of the converted audio and any audio of the target speaker, calculating a mean value and a variance, and performing linear mapping according to a formula to obtain a mapped feature lf 0':
Figure 266655DEST_PATH_IMAGE006
(5)
wherein, lf0sFor logarithmic features of the fundamental frequency of the converted audio, musIs the mean, mu, of the logarithmic characteristics of the fundamental frequency of the converted audiotIs the mean value, sigma, of the logarithmic feature of the fundamental frequency of the target speakersIs the variance, σ, of the logarithmic feature of the fundamental frequency of the converted audiotIs the variance of the target speaker's base frequency logarithmic feature;
s11, splicing the mapped feature lf 0' with the unvoiced and voiced sound marks of the converted audio to obtain a fundamental frequency-unvoiced and voiced sound splicing feature;
sending the fundamental frequency-unvoiced and voiced sound splicing characteristics into a second preprocessing network for processing;
adding the results of the first and second preprocessing network processing to obtain a comprehensive characteristic;
acquiring the encoding characteristics of a speaker of a target speaker, and transforming the encoding characteristics of the speaker through an affine layer to obtain a mean value simulation characteristic gamma and a variance simulation characteristic beta;
s12, inputting the mean value simulation feature gamma and the variance simulation feature beta obtained in the step S11 as conversion model styles, inputting the comprehensive features obtained in the step S11 as conversion model contents, and generating converted Mel features through a conversion model; converting the Mel characteristic input vocoder into audio;
the first preprocessing network, the second preprocessing network, the affine layer and the conversion model in the steps S9-S12 are obtained after the training of the training method is completed.
The invention also discloses a conversion device capable of converting any voice into a plurality of voices, which comprises a ppg characteristic extraction module, an LF0 characteristic extraction module and a speaker coding extraction module, wherein the ppg characteristic extraction module is used for extracting ppg characteristics, the LF0 characteristic extraction module is used for extracting fundamental frequency-unvoiced and voiced sound splicing characteristics, the ppg characteristic extraction module and the LF0 characteristic extraction module are respectively connected with a first preprocessing network and a second preprocessing network, the first preprocessing network and the second preprocessing network are also connected with two input ends of an adder, and the adder adds the characteristics input by the input ends;
the output end of the adder is connected with the content input end of the conversion model;
the speaker code extraction module is used for extracting speaker code features, and is connected with an affine layer which is connected with the style input end of the conversion model; the output end of the conversion model is connected with a vocoder;
the first preprocessing network and the second preprocessing network comprise an instance normalization layer;
the example normalization layer is calculated as follows:
Figure 352422DEST_PATH_IMAGE001
(1)
Figure 891857DEST_PATH_IMAGE002
(2)
the normalized feature profile is:
Figure 457574DEST_PATH_IMAGE003
(3)
µcis the c channel mean, σ, of the feature mapcIs the c channel variance of the feature map, where McRepresenting the c-th channel of the feature map, W representing the dimension of each channel, Mc[n]Representing the value of the nth dimension of the Mc channel, epsilon is a stable constant,
Figure 36454DEST_PATH_IMAGE004
is Mc[n]Normalizing the characteristic value;
the conversion model adopts an encoder-decoder network framework, which comprises an encoding network and a decoding network;
the decoding network part comprises a convolution layer and an activation layer, the convolution layer is connected with the activation layer, and the activation layer is connected with a self-adaptive instance normalization layer;
the calculation method of the self-adaptive example normalization layer is as the formula (4):
Figure 644022DEST_PATH_IMAGE005
(4)
µcis the c channel mean, σ, of the feature mapcIs the c channel variance, M, of the feature mapc[n]Represents McThe value of the nth dimension of the channel, epsilon,
Figure 319854DEST_PATH_IMAGE004
is Mc[n]Normalizing the characteristic value; gamma rayc、βcThe values of the c channel of the mean value simulation characteristic gamma and the variance simulation characteristic beta obtained in the step 5 are obtained;
the LF0 feature extraction module extracts fundamental frequency logarithmic features of the converted audio and any audio of the target speaker, calculates mean and variance, and performs linear mapping according to a formula to obtain mapped features LF 0':
Figure 121719DEST_PATH_IMAGE006
(5)
wherein, lf0sFor logarithmic features of the fundamental frequency of the converted audio, musIs the mean, mu, of the logarithmic characteristics of the fundamental frequency of the converted audiotIs the mean value, sigma, of the logarithmic feature of the fundamental frequency of the target speakersIs the variance, σ, of the logarithmic feature of the fundamental frequency of the converted audiotIs the variance of the logarithmic feature of the fundamental frequency of the target speaker.
Preferably, a splicing module is further connected between the ppg feature extraction module and the first preprocessing network.
The invention can better decouple the speaking content information and reduce the influence of inaccurate ppg characteristics extracted by the speech recognition model on speech conversion. The processing effect of audio detail information during voice conversion can be improved by combining and applying the ppg characteristic and the added fundamental frequency-unvoiced and voiced sound splicing characteristic; particularly, in the case of trans-lingual conversion, a phenomenon such as chinese english is remarkably improved.
The invention adds the speaker code into the coding network part, can realize the conversion from any voice to multi-person voice, and can realize the conversion from any voice under the condition that the training of the speaker code model is good enough.
Drawings
FIG. 1 is a schematic flow chart of a conversion method according to the present invention;
FIG. 2 is a schematic diagram of an embodiment of a first pre-processing network according to the present invention, and a second pre-processing network may also adopt the structure shown in FIG. 2;
FIG. 3 is a schematic diagram of an embodiment of step 5 and step 6 according to the present invention;
fig. 4 is a schematic diagram of an embodiment of the conversion device according to the present invention.
Detailed Description
The following provides a more detailed description of the present invention.
The method for converting any voice into a plurality of voices, as shown in fig. 1, includes the following steps:
step 1, preparing corpora of a plurality of target speakers as training corpora, wherein each corpus comprises audio and corresponding speaker information, and extracting original Mel characteristics of the training corpora;
the target speaker is a conversion target at the time of voice conversion, that is, it is desired to convert an arbitrary audio into an audio having the same characteristics as the voice of the target speaker.
Building a first preprocessing network, a second preprocessing network, an affine layer and a conversion model; the number of output channels of the first preprocessing network and the second preprocessing network is the same, and the down-sampling rate of the second preprocessing network is consistent with the down-sampling rate when the ppg characteristics of the training corpus are extracted;
wherein the first and second pre-processing networks comprise an instance normalization layer;
the example normalization layer is calculated as follows:
Figure 984633DEST_PATH_IMAGE001
(1)
Figure 395891DEST_PATH_IMAGE002
(2)
the normalized feature profile is:
Figure 395071DEST_PATH_IMAGE003
(3)
µcis the c channel mean, σ, of the feature mapcIs the c channel variance of the feature map, where McRepresenting the c-th channel of the feature map, W representing the dimension of each channel, Mc[n]Represents McThe value of the nth dimension of the channel, epsilon,
Figure 851328DEST_PATH_IMAGE004
is Mc[n]Normalizing the characteristic value;
step 2, extracting the ppg characteristics of each training corpus;
step 3, sending the obtained ppg characteristics into a first pretreatment network for treatment;
step 4, calculating a fundamental frequency feature f0 of the audio data of the training sample, taking a logarithmic value logf0 to obtain a fundamental frequency logarithmic feature lf0, calculating a voiced and unvoiced label of the audio data, and splicing the fundamental frequency logarithmic feature and the voiced and unvoiced label to obtain a fundamental frequency-unvoiced and voiced splicing feature lf 0-uv;
sending the fundamental frequency-unvoiced and voiced sound splicing characteristics into a second preprocessing network for processing;
adding the results of the first and second preprocessing network processing to obtain a comprehensive characteristic;
step 5, acquiring the encoding characteristics of the target speaker in the training set, and transforming the encoding characteristics through an affine layer to obtain a mean value simulation characteristic gamma and a variance simulation characteristic beta;
step 6, inputting the mean value simulation characteristic gamma and the variance simulation characteristic beta obtained in the step 5 as the style of a conversion model, inputting the comprehensive characteristics obtained in the step 4 as the content of the conversion model, and generating converted Mel characteristics through the conversion model;
calculating a loss function according to the converted Mel feature and the original Mel feature;
step 7, using the loss function to update the first preprocessing network, the second preprocessing network, the affine layer and the conversion model;
and 8, repeating the steps 2 to 7 until the loss function is converged and the training is finished.
In the step 1, an encoder-decoder network (encoder-decoder) network framework is adopted for the conversion model initial architecture, wherein the encoding network (encoder) part encodes the comprehensive characteristics, and the decoding network (decoder) part decodes the encoding result output by the encoder network to obtain the corresponding Mel characteristics. The encoder-decoder network is an existing general network, model input is generally encoded firstly and then decoded into target output, and when the encoder-decoder network is applied, a network layer used by the encoder and the decoder needs to be specifically selected and designed.
The decoder network part of the invention can comprise a convolution layer, an active layer and the like, wherein the convolution layer is connected with the active layer; an Adaptive Instance normalization layer (AdaIN for short) connected behind the active layer is embedded in the decoder network part, and as shown in fig. 3, the Adaptive Instance normalization layer is used for taking the mean simulation feature γ and the variance simulation feature β as the input of the conversion model style.
An adaptive instance normalization layer is used in style migration, where the adaptive instance normalization layer inputs include content inputs and style inputs, and channel-wise means and standard deviations of the content inputs are matched to channel-wise means and standard deviations of the style inputs.
The calculation method of the adaptive instance normalization layer of the decoder network may be as in equation (4).
Figure 450805DEST_PATH_IMAGE005
(4)
µcIs the c channel mean, σ, of the feature mapcIs the c channel variance, M, of the feature mapc[n]Represents McThe value of the nth dimension of the channel, epsilon,
Figure 416487DEST_PATH_IMAGE004
is Mc[n]Normalizing the characteristic value; gamma rayc、βcAnd 5, simulating the value of the c channel of the mean value simulation characteristic gamma and the variance simulation characteristic beta obtained in the step 5.
Mean value μcSum variance σcThe calculation method of (2) is the same as the formulas (1) and (2).
The global speaker characteristics, including mean analog characteristics gamma and variance analog characteristics beta, are embedded into the decoding network by equation (4), thereby implementing the conversion to multiple speakers.
In one embodiment, step 2 may use WeNet (a speech recognition toolkit for industrial floor applications, where the question and ask speech team is open source in western big speech laboratories) to extract ppg (phoneme posterior) features of the corpus, where the ppg features correspond to the output of the encoding layer encoder in the WeNet model.
And when the conversion among multiple languages is involved, obtaining corresponding ppg characteristics through WeNet models of the corresponding languages, splicing the ppg characteristics of the multiple languages, and then performing the step 3.
And 3, sending the obtained ppg features into a first preprocessing network prenet1 for network processing, adding an Instance normalization layer (IN) after a one-dimensional convolution layer and an active layer IN a prenet1 network, and performing multi-stage cascading, as shown IN fig. 2, wherein the result obtained after the first preprocessing network prenet1 network processing is prenet1_ out.
In designing a prenet1 network, a one-dimensional convolution is used, and the example normalization layer is calculated as follows:
Figure 286485DEST_PATH_IMAGE001
(1)
Figure 413841DEST_PATH_IMAGE002
(2)
wherein M iscRepresenting the feature map is the c-th channel, W represents the dimension of each channel, Mc[n]Represents McThe nth dimension value of the channel is obtained by the formula (1) and the formula (2) to obtain the mean value mu of each channelcSum variance σc. Then the normalized feature map can be obtained as follows:
Figure 703877DEST_PATH_IMAGE003
(3)
the stable constant epsilon is a small constant value, so that the numerical value after normalization is prevented from being unstable. And sending the normalized characteristic value to a subsequent network model. The audio content information can be better decoupled through the processing of the step.
And 4, calculating a fundamental frequency feature f0 of the audio data of the training sample, taking a logarithmic value logf0 to obtain a fundamental frequency logarithmic feature lf0, calculating a voiced and unvoiced label of the audio data, and splicing the fundamental frequency logarithmic feature and the voiced and unvoiced label to obtain a fundamental frequency-unvoiced and voiced splicing feature lf 0-uv.
Fundamental frequency feature F0 calculation can be calculated by reference to M.Morise, H.Kawahara and H.Katayose: Fast and reliable F0 estimation method based on the periodic extraction of the periodic focus division of horizontal and space, AES 35th International Conference, CD-ROM processing, Feb. 2009.
The result of feeding the fundamental frequency-unvoiced-voiced concatenation feature lf0-uv into the second preprocessing network prenet2 is prenet2_ out, the number of output channels of the second preprocessing network prenet2 must be consistent with the output channel of prenet1 in step 2, and the down-sampling rate of prenet2 is consistent with the down-sampling rate of WeNet. The results of both pre-processing networks are then summed with prenet1_ out and prenet2_ out to get the composite signature.
Preferably, the second pre-processing network prenet2 network is made up of a plurality of one-dimensional convolutional layers and is structurally identical to the first pre-processing network prenet 1. As fig. 2 shows a specific embodiment of the first preprocessing network, the second preprocessing network can also be implemented in the manner shown in fig. 2.
And 5, acquiring the speaker code of the target speaker in the training set, and extracting through a special neural network. As shown in fig. 3, the speaker coding features are transformed through two affine layers to obtain a mean simulation feature γ and a variance simulation feature β, which are used for simulating the mean and variance of the style features, respectively. The prior art is the technology of extracting the model and algorithm of the speaker code, and extracting the mean value simulation feature gamma and the variance simulation feature beta by the affine layer transformation.
Step 6, inputting the mean value simulation characteristic gamma and the variance simulation characteristic beta obtained in the step 5 as the style of a conversion model, inputting the comprehensive characteristics obtained in the step 4 as the content of the conversion model, and generating converted Mel characteristics through the conversion model;
calculating a loss function according to the converted Mel feature and the original Mel feature; the loss function is typically the difference between the transformed mel-frequency signature and the original mel-frequency signature.
Step 7, using the loss function to update the first preprocessing network, the second preprocessing network, the affine layer and the conversion model;
the method for updating each parameter in the first preprocessing network, the second preprocessing network, the affine layer and the conversion model by using the loss function is the prior art, and usually adopts gradient descent and reverse conduction modes for updating.
And 8, repeating the steps 2 to 7 until the loss function is converged and the training is finished.
When the loss function value is no longer reduced or hardly reduced, the loss function may be considered to converge and training is complete. And after the training is finished, the obtained first preprocessing network, the second preprocessing network, the affine layer and the conversion model are used for subsequent speech synthesis.
The synthesis method comprises the following steps:
s9, extracting the ppg characteristics of the converted audio and sending the ppg characteristics into a first preprocessing network;
s10, extracting fundamental frequency logarithmic features of the converted audio and any audio of the target speaker, calculating a mean value and a variance, and performing linear mapping according to a formula to obtain a mapped feature lf 0':
Figure 207671DEST_PATH_IMAGE006
(5)
wherein, lf0sFor logarithmic features of the fundamental frequency of the converted audio, musIs the mean, mu, of the logarithmic characteristics of the fundamental frequency of the converted audiotIs the mean value, sigma, of the logarithmic feature of the fundamental frequency of the target speakersIs the variance, σ, of the logarithmic feature of the fundamental frequency of the converted audiotIs the variance of the target speaker's base frequency logarithmic feature;
s11, splicing the mapped feature lf 0' with the unvoiced and voiced sound marks of the converted audio to obtain a fundamental frequency-unvoiced and voiced sound splicing feature;
sending the fundamental frequency-unvoiced and voiced sound splicing characteristics into a second preprocessing network for processing;
adding the results of the first and second preprocessing network processing to obtain a comprehensive characteristic;
acquiring a speaker code of a target speaker, and transforming the speaker code through an affine layer to obtain a mean value simulation feature gamma and a variance simulation feature beta;
s12, inputting the mean value simulation feature gamma and the variance simulation feature beta obtained in the step S11 as conversion model styles, inputting the comprehensive features obtained in the step S11 as conversion model contents, and generating converted Mel features through a conversion model; the mel features are input into the vocoder and converted into audio.
The first preprocessing network, the second preprocessing network, the affine layer and the conversion model in the steps S9-S12 are obtained after the training of the training method is completed.
The invention also discloses a conversion device capable of converting any voice into a plurality of voices, which comprises a ppg feature extraction module, an LF0 feature extraction module and a speaker code extraction module, wherein the ppg feature extraction module is used for respectively extracting the ppg features in the step 3, the LF0-uv features in the step 4 and the speaker code extraction module in the step 5.
The ppg characteristic extraction module and the LF0 characteristic extraction module are respectively connected with a first preprocessing network and a second preprocessing network, the first preprocessing network and the second preprocessing network are also connected with two input ends of an adder, and the output end of the adder is connected with the content input end of the conversion model;
the adder outputs the comprehensive features to a conversion model, the speaker code extraction module is connected with an affine layer, and the affine layer is connected with a style input end of the conversion model; and the output end of the conversion model is connected with the vocoder.
The example normalization layer is calculated as follows:
Figure 194825DEST_PATH_IMAGE001
(1)
Figure 493082DEST_PATH_IMAGE002
(2)
the normalized feature profile is:
Figure 67152DEST_PATH_IMAGE003
(3)
µcis the c channel mean, σ, of the feature mapcIs the c channel variance of the feature map, where McRepresenting the c-th channel of the feature map, W representing the dimension of each channel, Mc[n]Represents McThe value of the nth dimension of the channel, epsilon,
Figure 374636DEST_PATH_IMAGE004
is Mc[n]Normalizing the characteristic value;
the conversion model adopts an encoder-decoder network framework, which comprises an encoding network and a decoding network;
the decoding network part comprises a convolution layer and an active layer, wherein the convolution layer is connected with the active layer, and the active layer is connected with an adaptive instance normalization layer.
The calculation method of the self-adaptive example normalization layer is as the formula (4):
Figure 219227DEST_PATH_IMAGE005
(4)
µcis the c channel mean, σ, of the feature mapcIs the c channel variance, M, of the feature mapc[n]Represents McThe value of the nth dimension of the channel, epsilon,
Figure 953965DEST_PATH_IMAGE004
is Mc[n]Normalizing the characteristic value; gamma rayc、βcAnd 5, simulating the value of the c channel of the mean value simulation characteristic gamma and the variance simulation characteristic beta obtained in the step 5.
By adopting the device, the voice conversion method can be realized. When the PPG feature extraction module is used in multiple languages, a splicing module is connected between the PPG feature extraction module and the first preprocessing network, and the PPG features of different languages are spliced and then input into the first preprocessing network.
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
Preparing training corpus, multi-person Chinese audio data, multi-person English data and marking pronouncing person numbers.
Extracting ppg characteristics by adopting WeNet (a speech recognition toolkit for industrial floor application, which is derived by a question-and-ask speech team in combination with a western big speech laboratory), wherein the ppg characteristics correspond to the output of a coding layer in a WeNet model.
When the conversion among multiple languages is related, corresponding ppg characteristics are obtained through WeNet models of the corresponding languages respectively, and the multi-language ppg characteristics are spliced. The Wenet model can be trained by self, and the ppg characteristic extraction can also be carried out by adopting the publicly trained model. When speech recognition is performed by the WeNet model, audio features of audio are downsampled, and in the present embodiment, a downsampling factor is set to 4.
And (3) sending the spliced Chinese-English mixed ppg features into a prenet1 network, wherein the prenet1 network IN the embodiment adopts three one-dimensional convolutional layers, and each convolutional layer is connected with an active layer and an example normalization layer (IN layer). After the three-layer convolution module is finished, the result is added with the ppg value obtained in step 2 to obtain the output result prenet1_ out of prenet 1.
And (3) extracting lf0-uv characteristics, wherein the frame length during characteristic extraction is consistent with the acoustic characteristic frame length sent into the recognition model in the step 1. The lf0-uv characteristics are sent into a prenet2 network, the prenet2 network still adopts three layers of one-dimensional convolution layers + Relu function layers + IN layers, and the output channel of the last layer of prenet1 are set to be the same value.
And setting the convolution step length of the last two layers as [2,2], and performing down-sampling to obtain an output value which is the same as the ppg characteristic dimension in the step 2. The output result is labeled prenet2_ out, which is added to prenet1_ out by prenet2_ out.
In this embodiment, a method in a thesis "Generalized end-to-end loss for spread verification, 2018, ICASSP, l, Wan, q, Wang," is adopted to train a deep learning model of multi-speaker codes. The resulting speaker of each audio data is encoded as a 256-dimensional vector. And taking the vector as an affine layer through two full-connection layers to obtain a mean value simulation feature gamma and a variance simulation feature beta.
The encoder-decoder network in this embodiment is referred to in the paper "Glow-TTS: a general Flow for Text-to-Speech visual information Search, NeurIPS 2020, Jaehieon Kim, Sungwon Kim, changes the input of encoder part into the output of step 4, adds speaker information in decoder part of the paper, and embeds AdaIN layer shown in formula (4) in the decoder's coupling layer. The decoder output is the mel-frequency spectrum feature.
In this embodiment, a wave rnn vocoder model described in the documents "Efficient neural synthesis Learning, International Conference on Machine 2018, Dieleman, and k.
After the encoder-decoder network and other relevant models are obtained through the steps, for any source audio signal, only Chinese and English ppg characteristics are obtained, and then model inference is carried out according to the steps. The source audio and the target voice audio calculate the converted lf0 characteristic according to equation (5) and obtain the target person code according to step 5, whereby the source audio can be converted into the voice of the target speaker through the coder-decoder network and the vocoder model.
The foregoing is directed to preferred embodiments of the present invention, wherein the preferred embodiments are not obviously contradictory or subject to any particular embodiment, and any combination of the preferred embodiments may be combined in any overlapping manner, and the specific parameters in the embodiments and examples are only for the purpose of clearly illustrating the inventor's invention verification process and are not intended to limit the scope of the invention, which is defined by the claims and the equivalent structural changes made by the description and drawings of the present invention are also intended to be included in the scope of the present invention.

Claims (7)

1. A conversion method for converting an arbitrary speech into a plurality of speeches, comprising a training method and a synthesis method, wherein the training method comprises the steps of:
step 1, preparing corpora of a plurality of target speakers as training corpora, wherein each corpus comprises audio and corresponding speaker information, and extracting original Mel characteristics of the training corpora;
building a first preprocessing network, a second preprocessing network, an affine layer and a conversion model; the number of output channels of the first preprocessing network and the second preprocessing network is the same, and the down-sampling rate of the second preprocessing network is consistent with the down-sampling rate when the ppg characteristics of the training corpus are extracted;
wherein the first and second pre-processing networks comprise an instance normalization layer;
the example normalization layer is calculated as follows:
Figure 135357DEST_PATH_IMAGE001
(1)
Figure 755561DEST_PATH_IMAGE002
(2)
the normalized feature profile is:
Figure 738561DEST_PATH_IMAGE003
(3)
µcis the c channel mean, σ, of the feature mapcIs the c channel variance of the feature map, where McRepresenting the c-th channel of the feature map, W representing the dimension of each channel, Mc[n]Represents McThe value of the nth dimension of the channel, epsilon,
Figure 55142DEST_PATH_IMAGE004
is Mc[n]Normalizing the characteristic value;
step 2, extracting the ppg characteristics of each training corpus;
step 3, sending the obtained ppg characteristics into a first pretreatment network for treatment;
step 4, calculating fundamental frequency features f0 of the training corpus audio data, taking logarithmic value logf0 to obtain fundamental frequency logarithmic features lf0, calculating voiced and unvoiced sound marks of the audio data, and splicing the fundamental frequency logarithmic features and the unvoiced and voiced sound marks to obtain fundamental frequency-unvoiced and voiced sound splicing features lf 0-uv;
sending the fundamental frequency-unvoiced and voiced sound splicing characteristics into a second preprocessing network for processing;
adding the results of the first and second preprocessing network processing to obtain a comprehensive characteristic;
step 5, acquiring the encoding characteristics of the target speaker in the training set, and transforming the encoding characteristics through an affine layer to obtain a mean value simulation characteristic gamma and a variance simulation characteristic beta;
step 6, inputting the mean value simulation characteristic gamma and the variance simulation characteristic beta obtained in the step 5 as the style of a conversion model, inputting the comprehensive characteristics obtained in the step 4 as the content of the conversion model, and generating converted Mel characteristics through the conversion model;
calculating a loss function according to the converted Mel feature and the original Mel feature;
the conversion model adopts a coder-decoder network framework, and comprises a coding network and a decoding network;
the coding network part codes the comprehensive characteristics obtained in the step 4, and the decoding network part decodes the coding result output by the coding network to obtain corresponding Mel characteristics;
the decoding network part comprises a convolution layer and an activation layer, the convolution layer is connected with the activation layer, and the activation layer is connected with a self-adaptive instance normalization layer;
the calculation method of the self-adaptive example normalization layer is as the formula (4):
Figure 72776DEST_PATH_IMAGE005
(4)
µcis the c channel mean, σ, of the feature mapcIs the c channel variance, M, of the feature mapc[n]Represents McThe value of the nth dimension of the channel, epsilon,
Figure 583654DEST_PATH_IMAGE004
is Mc[n]Normalizing the characteristic value; gamma rayc、βcThe values of the c channel of the mean value simulation characteristic gamma and the variance simulation characteristic beta obtained in the step 5 are obtained;
step 7, using the loss function to update the first preprocessing network, the second preprocessing network, the affine layer and the conversion model;
and 8, repeating the steps 2 to 7 until the loss function is converged and the training is finished.
2. The method of claim 1, wherein the updating of step 7 is performed by gradient descent and reverse conduction.
3. The method as claimed in claim 1, wherein the corpus in step S1 contains corpora of different languages, and step S2 is performed after extracting ppg features of each language, and then step 3 is performed after concatenation.
4. The method according to claim 1, wherein WeNet is used to extract the ppg characteristics of the corpus in step S2.
5. The method of converting from an arbitrary speech to a plurality of speeches according to claim 1, wherein said synthesizing method comprises the steps of:
s9, extracting the ppg characteristics of the converted audio and sending the ppg characteristics into a first preprocessing network;
s10, extracting fundamental frequency logarithmic features of the converted audio and any audio of the target speaker, calculating a mean value and a variance, and performing linear mapping according to a formula to obtain a mapped feature lf 0':
Figure 788371DEST_PATH_IMAGE006
(5)
wherein, lf0sFor logarithmic features of the fundamental frequency of the converted audio, musIs the mean, mu, of the logarithmic characteristics of the fundamental frequency of the converted audiotIs the mean value, sigma, of the logarithmic feature of the fundamental frequency of the target speakersIs the variance, σ, of the logarithmic feature of the fundamental frequency of the converted audiotIs the variance of the target speaker's base frequency logarithmic feature;
s11, splicing the mapped feature lf 0' with the unvoiced and voiced sound marks of the converted audio to obtain a fundamental frequency-unvoiced and voiced sound splicing feature;
sending the fundamental frequency-unvoiced and voiced sound splicing characteristics into a second preprocessing network for processing;
adding the results of the first and second preprocessing network processing to obtain a comprehensive characteristic;
acquiring the encoding characteristics of a speaker of a target speaker, and transforming the encoding characteristics of the speaker through an affine layer to obtain a mean value simulation characteristic gamma and a variance simulation characteristic beta;
s12, inputting the mean value simulation feature gamma and the variance simulation feature beta obtained in the step S11 as conversion model styles, inputting the comprehensive features obtained in the step S11 as conversion model contents, and generating converted Mel features through a conversion model; converting the Mel characteristic input vocoder into audio;
the first preprocessing network, the second preprocessing network, the affine layer and the conversion model in the steps S9-S12 are obtained after the training of the training method is completed.
6. A conversion device capable of converting any voice into a plurality of voices is characterized by comprising a ppg characteristic extraction module, an LF0 characteristic extraction module and a speaker coding extraction module, wherein the ppg characteristic extraction module is used for extracting ppg characteristics, the LF0 characteristic extraction module is used for extracting fundamental frequency-unvoiced and voiced sound splicing characteristics, the fundamental frequency-unvoiced and voiced sound splicing characteristics are respectively connected with a first preprocessing network and a second preprocessing network, the first preprocessing network and the second preprocessing network are further connected with two input ends of an adder, and the adder adds the characteristics input by the input ends;
the output end of the adder is connected with the content input end of the conversion model;
the speaker code extraction module is used for extracting speaker code features, and is connected with an affine layer which is connected with the style input end of the conversion model; the output end of the conversion model is connected with a vocoder;
the first preprocessing network and the second preprocessing network comprise an instance normalization layer;
the example normalization layer is calculated as follows:
Figure 174222DEST_PATH_IMAGE001
(1)
Figure 780783DEST_PATH_IMAGE002
(2)
the normalized feature profile is:
Figure 459633DEST_PATH_IMAGE003
(3)
µcis the c channel mean, σ, of the feature mapcIs the c channel variance of the feature map, where McRepresenting the c-th channel of the feature map, W representing the dimension of each channel, Mc[n]Represents McThe value of the nth dimension of the channel, epsilon,
Figure 604175DEST_PATH_IMAGE004
is Mc[n]Normalizing the characteristic value;
the conversion model adopts an encoder-decoder network framework, which comprises an encoding network and a decoding network;
the decoding network part comprises a convolution layer and an activation layer, the convolution layer is connected with the activation layer, and the activation layer is connected with a self-adaptive instance normalization layer;
the calculation method of the self-adaptive example normalization layer is as the formula (4):
Figure 544450DEST_PATH_IMAGE005
(4)
µcthe c channel of the feature mapMean value, σcIs the c channel variance, M, of the feature mapc[n]Represents McThe value of the nth dimension of the channel, epsilon,
Figure 225092DEST_PATH_IMAGE004
is Mc[n]Normalizing the characteristic value; gamma rayc、βcThe values of the c channel of the mean value simulation characteristic gamma and the variance simulation characteristic beta obtained in the step 5 are obtained;
the LF0 feature extraction module extracts fundamental frequency logarithmic features of the converted audio and any audio of the target speaker, calculates mean and variance, and performs linear mapping according to a formula to obtain mapped features LF 0':
Figure 592619DEST_PATH_IMAGE006
(5)
wherein, lf0sFor logarithmic features of the fundamental frequency of the converted audio, musIs the mean, mu, of the logarithmic characteristics of the fundamental frequency of the converted audiotIs the mean value, sigma, of the logarithmic feature of the fundamental frequency of the target speakersIs the variance, σ, of the logarithmic feature of the fundamental frequency of the converted audiotIs the variance of the logarithmic feature of the fundamental frequency of the target speaker.
7. The apparatus according to claim 6, wherein a concatenation module is further connected between the PPG feature extraction module and the first preprocessing network, and the concatenation module concatenates PPG features of different languages and inputs the result to the first preprocessing network.
CN202111035937.9A 2021-09-06 2021-09-06 Conversion method and device capable of converting any voice into multiple voices Active CN113470622B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111035937.9A CN113470622B (en) 2021-09-06 2021-09-06 Conversion method and device capable of converting any voice into multiple voices

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111035937.9A CN113470622B (en) 2021-09-06 2021-09-06 Conversion method and device capable of converting any voice into multiple voices

Publications (2)

Publication Number Publication Date
CN113470622A CN113470622A (en) 2021-10-01
CN113470622B true CN113470622B (en) 2021-11-19

Family

ID=77867524

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111035937.9A Active CN113470622B (en) 2021-09-06 2021-09-06 Conversion method and device capable of converting any voice into multiple voices

Country Status (1)

Country Link
CN (1) CN113470622B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113782052A (en) * 2021-11-15 2021-12-10 北京远鉴信息技术有限公司 Tone conversion method, device, electronic equipment and storage medium
CN114333865A (en) * 2021-12-22 2022-04-12 广州市百果园网络科技有限公司 Model training and tone conversion method, device, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111785261A (en) * 2020-05-18 2020-10-16 南京邮电大学 Cross-language voice conversion method and system based on disentanglement and explanatory representation
CN112133278A (en) * 2020-11-20 2020-12-25 成都启英泰伦科技有限公司 Network training and personalized speech synthesis method for personalized speech synthesis model
CN112259072A (en) * 2020-09-25 2021-01-22 北京百度网讯科技有限公司 Voice conversion method and device and electronic equipment
CN112767958A (en) * 2021-02-26 2021-05-07 华南理工大学 Zero-learning-based cross-language tone conversion system and method

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10176819B2 (en) * 2016-07-11 2019-01-08 The Chinese University Of Hong Kong Phonetic posteriorgrams for many-to-one voice conversion
CN106504741B (en) * 2016-09-18 2019-10-25 广东顺德中山大学卡内基梅隆大学国际联合研究院 A kind of phonetics transfer method based on deep neural network phoneme information
CN110930981A (en) * 2018-09-20 2020-03-27 深圳市声希科技有限公司 Many-to-one voice conversion system
CN109346107B (en) * 2018-10-10 2022-09-30 中山大学 LSTM-based method for inversely solving pronunciation of independent speaker
KR20200094493A (en) * 2019-01-30 2020-08-07 김남형 Operating Method for Voice-Conversion Application with Phonetic-Posteriorgram Extractor , TTS and Vocoder
CN110223705B (en) * 2019-06-12 2023-09-15 腾讯科技(深圳)有限公司 Voice conversion method, device, equipment and readable storage medium
CN111462769B (en) * 2020-03-30 2023-10-27 深圳市达旦数生科技有限公司 End-to-end accent conversion method
CN111862939A (en) * 2020-05-25 2020-10-30 北京捷通华声科技股份有限公司 Prosodic phrase marking method and device
CN113345452B (en) * 2021-04-27 2024-04-26 北京搜狗科技发展有限公司 Voice conversion method, training method, device and medium of voice conversion model
CN113345431A (en) * 2021-05-31 2021-09-03 平安科技(深圳)有限公司 Cross-language voice conversion method, device, equipment and medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111785261A (en) * 2020-05-18 2020-10-16 南京邮电大学 Cross-language voice conversion method and system based on disentanglement and explanatory representation
CN112259072A (en) * 2020-09-25 2021-01-22 北京百度网讯科技有限公司 Voice conversion method and device and electronic equipment
CN112133278A (en) * 2020-11-20 2020-12-25 成都启英泰伦科技有限公司 Network training and personalized speech synthesis method for personalized speech synthesis model
CN112767958A (en) * 2021-02-26 2021-05-07 华南理工大学 Zero-learning-based cross-language tone conversion system and method

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
A Modularized Neural Network with Language-Specific Output Layers for Cross-Lingual Voice Conversion;Yi Zhou等;《2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)》;20200220;全文 *
Fastspeech 2: Fastand high-quality end-to-end text-to-speech;Yi Ren等;《arxiv.org/abs/2006.04558v6》;20210304;全文 *
Generalized End-to-End Loss for Speaker Verification;Li Wan 等;《 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)》;20180913;全文 *
Towards Natural and Controllable Cross-Lingual Voice Conversion Based on Neural TTS Model and Phonetic Posteriorgram;Shengkui Zhao等;《ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)》;20210513;全文 *
基于深度学习的语音转换研究;赖家豪;《中国优秀硕士学位论文全文数据库》;20181231;全文 *

Also Published As

Publication number Publication date
CN113470622A (en) 2021-10-01

Similar Documents

Publication Publication Date Title
JP7436709B2 (en) Speech recognition using unspoken text and speech synthesis
US20230043916A1 (en) Text-to-speech processing using input voice characteristic data
JP2022107032A (en) Text-to-speech synthesis method using machine learning, device and computer-readable storage medium
CN108899009B (en) Chinese speech synthesis system based on phoneme
CN112863483A (en) Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm
US20070213987A1 (en) Codebook-less speech conversion method and system
JP2024023421A (en) Two-level speech prosody transfer
JP7228998B2 (en) speech synthesizer and program
CN113470622B (en) Conversion method and device capable of converting any voice into multiple voices
KR20230133362A (en) Generate diverse and natural text-to-speech conversion samples
Moon et al. Mist-tacotron: End-to-end emotional speech synthesis using mel-spectrogram image style transfer
CN112509550A (en) Speech synthesis model training method, speech synthesis device and electronic equipment
Wu et al. Multilingual text-to-speech training using cross language voice conversion and self-supervised learning of speech representations
Dua et al. Spectral warping and data augmentation for low resource language ASR system under mismatched conditions
Nosek et al. Cross-lingual neural network speech synthesis based on multiple embeddings
Zhao et al. Research on voice cloning with a few samples
Bae et al. Hierarchical and multi-scale variational autoencoder for diverse and natural non-autoregressive text-to-speech
WO2010104040A1 (en) Voice synthesis apparatus based on single-model voice recognition synthesis, voice synthesis method and voice synthesis program
Venkatagiri Speech recognition technology applications in communication disorders
CN115359778A (en) Confrontation and meta-learning method based on speaker emotion voice synthesis model
WO2022039636A1 (en) Method for synthesizing speech and transmitting the authentic intonation of a clonable sample
Chen et al. Diffusion transformer for adaptive text-to-speech
JP2021085943A (en) Voice synthesis device and program
CN113628609A (en) Automatic audio content generation
KR102426020B1 (en) Method and apparatus for Speech Synthesis Containing Emotional Rhymes with Scarce Speech Data of a Single Speaker

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant