WO2022156413A1 - 语音风格的迁移方法、装置、可读介质和电子设备 - Google Patents

语音风格的迁移方法、装置、可读介质和电子设备 Download PDF

Info

Publication number
WO2022156413A1
WO2022156413A1 PCT/CN2021/136525 CN2021136525W WO2022156413A1 WO 2022156413 A1 WO2022156413 A1 WO 2022156413A1 CN 2021136525 W CN2021136525 W CN 2021136525W WO 2022156413 A1 WO2022156413 A1 WO 2022156413A1
Authority
WO
WIPO (PCT)
Prior art keywords
phoneme
audio
acoustic feature
sequence
acoustic
Prior art date
Application number
PCT/CN2021/136525
Other languages
English (en)
French (fr)
Inventor
伍林
吴鹏飞
潘俊杰
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2022156413A1 publication Critical patent/WO2022156413A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used

Definitions

  • the present disclosure relates to the technical field of electronic information processing, and in particular, to a voice style migration method, apparatus, readable medium and electronic device.
  • E-books are usually divided into different styles according to the content in them, such as: sci-fi, suspense, etc.
  • the reader when the reader records the corresponding audio, it will also record according to the style of the e-book, so that the style of the audio and the style of the e-book can be matched.
  • the present disclosure provides a method for migrating a speech style, the method comprising:
  • the initial acoustic feature sequence includes an acoustic feature corresponding to each of the phonemes, and the acoustic feature is used to indicate the prosody feature of the phoneme;
  • the initial acoustic feature sequence is processed to obtain a target acoustic feature sequence, and the target acoustic feature sequence includes the processed acoustic features corresponding to each of the phonemes;
  • the speech synthesis model is obtained by training according to the corpus that conforms to the second timbre.
  • the present disclosure provides a voice style migration device, the device comprising:
  • an acquisition module configured to acquire a target text and a first audio corresponding to the target text, where the first audio conforms to the first timbre and has a target style
  • a first extraction module configured to extract a phoneme sequence corresponding to the target text, where the phoneme sequence includes at least one phoneme
  • a second extraction module configured to extract an initial acoustic feature sequence corresponding to the first audio, where the initial acoustic feature sequence includes an acoustic feature corresponding to each phoneme, and the acoustic feature is used to indicate the prosodic feature of the phoneme ;
  • a processing module configured to process the initial acoustic feature sequence according to the acoustic statistical features of the second timbre to obtain a target acoustic feature sequence, where the target acoustic feature sequence includes the processed acoustic features corresponding to each of the phonemes;
  • a synthesis module configured to input the phoneme sequence and the target acoustic feature sequence into a pre-trained speech synthesis model to obtain a second audio output from the speech synthesis model, where the second audio conforms to the second timbre and has the target style, and the speech synthesis model is obtained by training according to the corpus that conforms to the second timbre.
  • the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing apparatus, implements the steps of the method described in the first aspect of the present disclosure.
  • the present disclosure provides an electronic device, comprising:
  • a processing device is configured to execute the computer program in the storage device to implement the steps of the method in the first aspect of the present disclosure.
  • the present disclosure provides a computer program, comprising: instructions that, when executed by a processor, cause the processor to perform one of the methods for migrating speech style according to any one of the first aspect or multiple steps.
  • the present disclosure provides a computer program product comprising instructions that, when executed by a processor, cause the processor to perform one of the methods for migrating speech style of any one of the first aspect, or multiple steps.
  • FIG. 1 is a flowchart of a method for migrating a voice style according to an exemplary embodiment
  • FIG. 2 is a flow chart of another method for migrating voice styles according to an exemplary embodiment
  • FIG. 3 is a flowchart of another method for migrating voice style according to an exemplary embodiment
  • FIG. 4 is a process flow diagram of a speech synthesis model according to an exemplary embodiment
  • FIG. 5 is a block diagram of a speech synthesis model according to an exemplary embodiment
  • FIG. 6 is a flowchart illustrating a training speech synthesis model according to an exemplary embodiment
  • FIG. 7 is a flowchart illustrating another training speech synthesis model according to an exemplary embodiment
  • FIG. 8 is a block diagram of an apparatus for migrating a voice style according to an exemplary embodiment
  • FIG. 9 is a block diagram of another apparatus for migrating voice style according to an exemplary embodiment.
  • FIG. 10 is a block diagram of another apparatus for migrating voice style according to an exemplary embodiment
  • Fig. 11 is a block diagram of an electronic device according to an exemplary embodiment.
  • the term “including” and variations thereof are open-ended inclusions, ie, "including but not limited to”.
  • the term “based on” is “based at least in part on.”
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the description below.
  • the audio corresponding to an e-book is often recorded by only one reader, which is difficult to meet the diverse needs of users. If the existing speech synthesis method is used to simulate the audio of other readers reading the electronic book, the style of the simulated audio does not match the style of the electronic book. Because speech synthesis requires the help of the original sound library, and the original sound library usually has no style.
  • the present invention provides a solution for realizing style transfer in the process of speech synthesis.
  • Fig. 1 is a flowchart of a method for migrating a voice style according to an exemplary embodiment. As shown in Fig. 1 , the method may include steps 101-105.
  • Step 101 Obtain the target text and the first audio corresponding to the target text, where the first audio conforms to the first timbre and has the target style.
  • the target text can be an e-book, a chapter, a segment, or a sentence in an e-book, or other types of text, such as news, official account articles, blogs, and so on.
  • the first audio matches the first timbre and has a target style. It can be understood that the first speaker has a first timbre, and the first audio is the audio recorded when the first speaker reads the target text according to the target style, wherein the target style can be, for example, romance, urban, antiquity, suspense, science fiction, military , or sports, etc.
  • Step 102 Extract a phoneme sequence corresponding to the target text, where the phoneme sequence includes at least one phoneme.
  • the target text may be input into a pre-trained recognition model to obtain a phoneme sequence corresponding to the target text output by the recognition model.
  • the phoneme corresponding to each word in the target text may also be searched in a pre-established dictionary, and then the phoneme corresponding to each word may be formed into a phoneme sequence corresponding to the target text.
  • a phoneme can be understood as a phonetic unit divided according to the pronunciation of each word, and can also be understood as a vowel and a consonant in the corresponding pinyin of each word.
  • the phoneme sequence includes the phoneme corresponding to each word in the target text (a word may correspond to one or more phonemes). Take the target text as "the sun is out" as an example.
  • the phoneme corresponding to each word can be searched in the dictionary in turn to determine the phoneme sequence as "taiyangchulaile".
  • Step 103 Extract an initial acoustic feature sequence corresponding to the first audio, where the initial acoustic feature sequence includes an acoustic feature corresponding to each phoneme, and the acoustic feature is used to indicate the prosodic feature of the phoneme.
  • the first audio frequency may be processed by means of signal processing to obtain an initial acoustic feature sequence including the acoustic feature corresponding to each phoneme.
  • the acoustic features may include at least one of pitch (English: Pitch), volume (English: Energy), or speech rate (English: Duration), and may also include noise level, pitch, or loudness, and the like.
  • HTS English: HMM-based Speech Synthesis System
  • Audio processing tools such as sox, librosa, or straight can also be used to process the first audio to obtain the pitch and volume corresponding to each phoneme.
  • Step 104 Process the initial acoustic feature sequence according to the acoustic statistical features of the second timbre to obtain a target acoustic feature sequence, where the target acoustic feature sequence includes the processed acoustic features corresponding to each phoneme.
  • Step 105 input the phoneme sequence and the target acoustic feature sequence into the pre-trained speech synthesis model, to obtain the second audio output of the speech synthesis model, the second audio conforms to the second timbre and has the target style, and the speech synthesis model is based on conforming to the first audio frequency.
  • Two-timbral corpus training input the phoneme sequence and the target acoustic feature sequence into the pre-trained speech synthesis model, to obtain the second audio output of the speech synthesis model, the second audio conforms to the second timbre and has the target style, and the speech synthesis model is based on conforming to the first audio frequency.
  • the corresponding acoustic features have different variation ranges and variation ranges, which can be understood as different timbres, and the corresponding acoustic features obey different probability distributions. Therefore, the acoustic features corresponding to each phoneme included in the initial acoustic feature sequence obey the probability distribution of the acoustic features corresponding to the first timbre, and it is difficult to satisfy the probability distribution obeyed by the acoustic features corresponding to the second timbre, that is to say It is difficult to synthesize audio that matches the second timbre from the initial sequence of acoustic features.
  • the acoustic features corresponding to each phoneme included in the initial acoustic feature sequence can be processed to obtain a target acoustic feature sequence, where the target acoustic feature sequence includes the processing corresponding to each phoneme The latter acoustic characteristics.
  • the acoustic statistical feature of the second timbre can be understood as a statistical feature obtained in advance according to a large number of audio statistics conforming to the second timbre, and can reflect the probability distribution to which the acoustic feature corresponding to the second timbre complies.
  • the acoustic statistical features may include one or more of the statistical features of speech rate (eg, mean and variance), statistical features of pitch, or statistical features of volume of the second timbre.
  • the acoustic features of the second timbre can satisfy the probability distribution obeyed by the acoustic features corresponding to the second timbre.
  • the phoneme sequence and the target acoustic feature sequence can be input into the pre-trained speech synthesis model, and the speech synthesis model outputs the second audio that conforms to the second timbre and has the target style.
  • the speech synthesis model can be pre-trained and can be understood as a TTS (English: Text To Speech, Chinese: from text to speech) model, which can generate the second audio according to the phoneme sequence and the target acoustic feature sequence.
  • the speech synthesis model may be obtained by training based on the Tacotron model, the Deepvoice 3 model, the Tacotron 2 model, the Wavenet model, etc., which is not specifically limited in the present disclosure.
  • the speech synthesis model is trained based on the corpus that matches the second timbre.
  • the corpus that matches the second timbre can be understood as when the second speaker reads any text. recorded audio.
  • the arbitrary text may be different from the target text, and the second speaker can read the arbitrary text in any style, that is, it does not need to read in the target style. That is to say, the speech synthesis model can be trained by using the audio read by the existing second speaker. In this way, the speech synthesis model is trained by the second speaker reading the corpus of other texts.
  • the semantics included in the target text and the target acoustic characteristics determined according to the first audio frequency are considered.
  • the sequence can make the second audio match the second timbre and have the target style, thereby realizing style transfer. There is no need to spend a lot of time and labor costs to record multiple tones with the same style of audio, and it can provide users with a variety of choices to meet the diverse needs of users.
  • the target text can be an e-book obtained from an e-book reading APP (English: Application, Chinese: Application), and the e-book is located in the "Ancient Style" column of the e-book reading APP, then the target style is "Ancient Style" , the first audio may be obtained from the e-book reading APP, and the audio of the pre-recorded reader A (corresponding to the first timbre) reading the e-book. First extract the phoneme sequence corresponding to the target text, and then determine the initial acoustic feature sequence according to the first audio.
  • the initial acoustic feature sequence is processed to obtain the target acoustic feature sequence, and finally the phoneme sequence and the target acoustic feature sequence are input into speech synthesis
  • the model, the second audio output by the speech synthesis model conforms to the timbre of the speaker B and has an "ancient style" style, that is, the second audio can be understood as an audio that imitates the speaker B to read the target text in an " ancientt style” style.
  • the speech synthesis model is pre-trained according to the audio read aloud by a large number of B speakers.
  • the present disclosure first obtains the target text and the first audio corresponding to the target text, which has the target style and matches the first timbre, and then extracts the phoneme sequence corresponding to the target text, which includes at least one phoneme, and then extracts the first audio sequence corresponding to the target text.
  • an initial acoustic feature sequence including an acoustic feature corresponding to each phoneme, and the acoustic feature is used to indicate the prosodic feature of the phoneme.
  • the initial acoustic feature sequence is further processed according to the acoustic statistical features of the second timbre to obtain a target acoustic feature sequence including the processed acoustic features corresponding to each phoneme, and finally the phoneme sequence and the target acoustic feature sequence are input in advance.
  • the trained speech synthesis model, the speech synthesis model can output the second audio that conforms to the second timbre and has the target style, wherein the speech synthesis model is trained according to the corpus that conforms to the second timbre.
  • the present disclosure uses the target text and the corresponding first audio that conforms to the first timbre and has the target style to synthesize the second audio that conforms to the second timbre and has the target style, so that the target can be performed in the same style by using different timbres Text, can achieve style transfer in the process of speech synthesis.
  • Fig. 2 is a flow chart of another method for migrating speech styles according to an exemplary embodiment.
  • the acoustic features include: at least one of fundamental frequency, volume, and speech rate, and the corresponding step 103
  • the implementation may include steps 1031-1033.
  • Step 1031 if the acoustic feature includes the speech rate, according to the phoneme sequence and the first audio, determine one or more audio frames corresponding to each phoneme in the first audio, and determine the corresponding audio frame according to the number of audio frames corresponding to the phoneme. speed of speech.
  • HTS can be used to divide the first audio according to the phonemes included in the phoneme sequence to obtain one or more audio frames corresponding to each phoneme, and then according to the duration occupied by each audio frame, the corresponding phoneme The number of audio frames to determine the speech rate corresponding to the phoneme. For example, after division, a phoneme in the phoneme sequence corresponds to 3 audio frames, and the duration occupied by each audio frame is 10ms, then the speech rate (ie Duration) corresponding to the phoneme is 30ms.
  • Step 1032 if the acoustic feature includes pitch, extract the pitch of each audio frame in the first audio, and determine the pitch corresponding to the phoneme according to the pitch of the audio frame corresponding to each phoneme.
  • Step 1033 if the acoustic feature includes volume, extract the volume of each audio frame in the first audio, and determine the volume corresponding to each phoneme according to the volume of the audio frame corresponding to each phoneme.
  • the acoustic feature includes pitch
  • audio processing tools such as sox, librosa, straight and the like may be used to process the first audio, and extract the pitch of each audio frame in the first audio.
  • the mean value (or extreme value, standard deviation, etc.) of the pitches of the audio frames corresponding to each phoneme may be used as the pitch corresponding to the phoneme.
  • the pitches of the two audio frames corresponding to the phoneme are 1.2kHz and 1.6kHz respectively, then the average of the two pitches 1.4kHz can be used as the The pitch corresponding to the phoneme.
  • audio processing tools such as sox, librosa, and straight may be used to process the first audio to extract the volume of each audio frame in the first audio.
  • the mean value (or extreme value, standard deviation, etc.) of the volume of the audio frame corresponding to each phoneme may be used as the volume corresponding to the phoneme.
  • the volumes of the two audio frames corresponding to the phoneme are respectively 30dB and 80dB, and the average value of the two volumes is 55dB as the volume corresponding to the phoneme.
  • Fig. 3 is a flow chart of another method for migrating a speech style according to an exemplary embodiment.
  • the acoustic features include at least one of fundamental frequency, volume, or speech rate.
  • step 104 may also include steps 1041-1043.
  • Step 1041 if the acoustic feature includes the speaking rate, standardize the speaking rate corresponding to each phoneme according to the average speaking rate and the variance of the speaking rate included in the acoustic statistical feature to obtain the processed speaking rate corresponding to the phoneme.
  • Step 1042 if the acoustic feature includes pitch, standardize the pitch corresponding to each phoneme according to the average pitch and pitch variance included in the acoustic statistical feature to obtain the processed pitch corresponding to the phoneme.
  • Step 1043 if the acoustic feature includes volume, standardize the volume corresponding to each phoneme according to the volume average and volume variance included in the acoustic statistical feature to obtain the processed volume corresponding to the phoneme.
  • the acoustic statistical feature may include: the average speech rate (represented as duration_mean) and the speech rate variance (represented as duration_var) of the second timbre.
  • the acoustic statistical features may include: average pitch (represented as pitch_mean) and pitch variance (represented by pitch_var), and in a scenario where the acoustic features include volume, the acoustic statistical features may include: volume average (represented as energy_mean) and volume variance (denoted as energy_var).
  • the acoustic features include: pitch, volume, and speech rate as an example, the speech rate, pitch, and volume corresponding to each phoneme can be standardized by formulas to obtain the processed speech rate, pitch, volume.
  • A represents the first timbre, represents the speech rate corresponding to the i-th phoneme in the phoneme sequence, represents the processed speech rate corresponding to the ith phoneme, represents the pitch corresponding to the i-th phoneme, represents the processed pitch corresponding to the i-th phoneme, represents the volume corresponding to the i-th phoneme, Indicates the processed volume corresponding to the ith phoneme.
  • Fig. 4 is a processing flow chart of a speech synthesis model according to an exemplary embodiment. As shown in Fig. 4 , the speech synthesis model can be used to perform the following steps A and B.
  • step A a text feature sequence corresponding to the target text is determined according to the phoneme sequence, and the text feature sequence includes a text feature corresponding to each phoneme in the phoneme sequence.
  • step B the second audio is generated according to the text feature sequence and the target acoustic feature sequence.
  • the specific process of synthesizing the second audio by the speech synthesis model may be to first extract the text feature sequence (ie Text Embedding) corresponding to the target text according to the phoneme sequence, and the text feature sequence includes the text corresponding to each phoneme in the phoneme sequence.
  • text feature can be understood as a text vector that can characterize the phoneme. For example, if the phoneme sequence includes 100 phonemes, and the text vector corresponding to each phoneme is a 1*256-dimensional vector, the text feature sequence may be a 100*256-dimensional vector.
  • the text feature sequence can be combined with the target acoustic feature sequence to generate a second audio.
  • the text feature sequence can be concatenated with the target acoustic feature sequence to obtain a combined sequence, and then the second audio can be generated according to the combined sequence.
  • the phoneme sequence includes 100 phonemes
  • the text feature sequence can be a 100*256-dimensional vector
  • the corresponding target acoustic feature sequence is a 100*3-dimensional vector (each phoneme corresponds to three dimensions of pitch, volume, and speech rate).
  • the combined sequence can be a 100*259-dimensional vector.
  • the second audio can be generated according to this 100*259-dimensional vector.
  • the speech synthesis model is the Tacotron model, which includes: an encoder (ie Encoder), an attention network (ie Attention), a decoder (ie Decoder) and a post-processing network (ie Post -processing).
  • the encoder can include an embedding layer (ie Character Embedding layer), a pre-net sub-model and CBHG (English: Convolution Bank+Highway network+bidirectional Gated Recurrent Unit, Chinese: convolutional layer + high-speed network + bidirectional recursion neural network) submodel.
  • a sequence of phonemes can be fed into the encoder.
  • the phoneme sequence is converted into a word vector through the embedding layer, and then the word vector is input into the Pre-net sub-model to perform nonlinear transformation on the word vector, thereby improving the convergence and generalization capabilities of the speech synthesis model.
  • the Pre-net sub-model obtains a text feature sequence that can characterize the text to be synthesized according to the non-linearly transformed word vector.
  • the target acoustic feature sequence and the text feature sequence output by the encoder can be spliced to obtain a combined sequence, and then the combined sequence is input into the attention network, and the attention network can add an attention weight to each element in the combined sequence.
  • the attention network may be a location-sensitive attention (English: Locative Sensitive Attention) network, or a GMM (English: Gaussian Mixture Model, abbreviated GMM) attention network, or a Multi-Head Attention network. This is not specifically limited.
  • the output of the attention network is then used as the input of the decoder.
  • the decoder may include a preprocessing network sub-model (which may be the same as that included in the encoder), Attention-RNN, Decoder-RNN.
  • the preprocessing network sub-model is used to perform nonlinear transformation on the input.
  • the structure of the Attention-RNN is a layer of unidirectional, zoneout-based LSTM (English: Long Short-Term Memory, Chinese: Long Short-Term Memory Network), which can The output of the processing network sub-model is used as input, and is output to the Decoder-RNN after passing through the LSTM unit.
  • Decode-RNN is a two-layer unidirectional, zoneout-based LSTM, which outputs Mel spectrum information through the LSTM unit, and the Mel spectrum information can include one or more Mel spectrum features.
  • the Mel spectral information is input into the post-processing network, and the post-processing network may include a vocoder (eg, Wavenet vocoder, Griffin-Lim vocoder, etc.) to convert the Mel spectral feature information to obtain the first Two audio.
  • a vocoder eg, Wavenet vocoder, Griffin-Lim vocoder, etc.
  • Fig. 6 is a flowchart of training a speech synthesis model according to an exemplary embodiment. As shown in Fig. 6 , the speech synthesis model is obtained by training in the manner shown in steps 201-204.
  • Step 201 Obtain training text, a training phoneme sequence corresponding to the training text, and training audio, where the training audio matches the second timbre, and the training phoneme sequence includes at least one training phoneme.
  • the training phoneme sequence and training audio can also be multiple.
  • the training phoneme sequence includes training phonemes corresponding to each word in the training text, and the training audio is audio corresponding to the training text and conforming to the second timbre. It should be noted that there is no association between the training text and the target text, that is, the training text may be a different text from the target text.
  • the training audio only needs to use the second timbre to interpret the audio of the training text, without specifying a specific style, that is to say, the training audio can be audio without any style (which can be understood as plain), or it can be Audio in the target style or, in addition to the target style.
  • Step 202 extracting a real acoustic feature sequence of the training audio, where the real acoustic feature sequence includes an acoustic feature corresponding to each training phoneme.
  • Step 203 Process the real acoustic feature sequence according to the acoustic statistical features to obtain a training acoustic feature sequence, where the training acoustic feature sequence includes the processed acoustic features corresponding to each training phoneme.
  • the training audio can be processed by means of signal processing to obtain a real acoustic feature sequence including the acoustic feature corresponding to each training phoneme, wherein the acoustic feature is used to indicate the prosodic feature of the training phoneme, which may include sound. At least one of high, volume, or speech rate, and may also include: noise level, pitch, or loudness, and the like.
  • the label information of the training phoneme sequence can also be obtained, and the acoustic features corresponding to each training phoneme can be directly obtained from the label information.
  • the acoustic features corresponding to the training phonemes can also be processed according to the acoustic statistical features to obtain the training acoustic feature sequence.
  • the processing process can be understood as standardization, so that the processed
  • the acoustic features of the second timbre can satisfy the probability distribution obeyed by the acoustic features corresponding to the second timbre.
  • the speech rate, pitch, and volume corresponding to each training phoneme can be standardized by formula 2 to obtain the training phoneme Correspondingly processed speech rate, pitch, and volume.
  • B represents the second timbre, represents the speech rate corresponding to the ith training phoneme in the training phoneme sequence, represents the processed speech rate corresponding to the ith training phoneme, represents the pitch corresponding to the i-th training phoneme, represents the processed pitch corresponding to the ith training phoneme, represents the volume corresponding to the i-th training phoneme, Indicates the processed volume corresponding to the ith training phoneme.
  • Step 204 input the training phoneme sequence and the training acoustic feature sequence into the speech synthesis model, and train the speech synthesis model according to the output of the speech synthesis model and the training audio.
  • the training phoneme sequence and the training acoustic feature sequence are used as the input of the speech synthesis model, and the speech synthesis model is trained according to the output of the speech synthesis model and the training audio.
  • the difference (or mean square error) between the training audio and the speech synthesis model can be used as the loss function of the speech synthesis model, and with the goal of reducing the loss function, the back-propagation algorithm can be used to correct the neuron in the speech synthesis model.
  • the parameter, the parameter of the neuron can be, for example, the weight (English: Weight) and the bias (English: Bias) of the neuron.
  • FIG. 7 is another flowchart of training a speech synthesis model according to an exemplary embodiment. As shown in FIG. 7 , the training of the speech synthesis model further includes step 205 .
  • Step 205 after extracting the real acoustic feature sequence of the training audio, determine the acoustic feature average and acoustic feature variance of the acoustic feature corresponding to each training phoneme, and use the acoustic feature average and acoustic feature variance as acoustic statistical features.
  • the acoustic statistical feature may be determined according to the acoustic feature corresponding to each training phoneme.
  • the acoustic feature average and acoustic feature variance of the acoustic features corresponding to all training phonemes may be determined as acoustic statistical features.
  • the average speech rate and the variance of the speech rate can be determined according to the speech rates corresponding to all the training phonemes respectively
  • the average value and the pitch variance can be determined according to the pitches corresponding to all the training phonemes, and according to the volume corresponding to all the training phonemes, Determine volume mean and volume variance.
  • the mean and variance of speech rate, the mean and variance of pitch, the mean and variance of volume, and the mean and variance of volume are taken as acoustic statistical features, respectively.
  • the present disclosure first obtains the target text and the first audio corresponding to the target text, which has the target style and matches the first timbre, and then extracts the phoneme sequence corresponding to the target text, which includes at least one phoneme, and then extracts the first audio sequence corresponding to the target text.
  • an initial acoustic feature sequence including an acoustic feature corresponding to each phoneme, and the acoustic feature is used to indicate the prosodic feature of the phoneme.
  • the initial acoustic feature sequence is further processed according to the acoustic statistical features of the second timbre to obtain a target acoustic feature sequence including the processed acoustic features corresponding to each phoneme, and finally the phoneme sequence and the target acoustic feature sequence are input in advance.
  • the trained speech synthesis model, the speech synthesis model can output the second audio that conforms to the second timbre and has the target style, wherein the speech synthesis model is trained according to the corpus that conforms to the second timbre.
  • the present disclosure uses the target text and the corresponding first audio that conforms to the first timbre and has the target style to synthesize the second audio that conforms to the second timbre and has the target style, so that the target can be performed in the same style by using different timbres Text, can achieve style transfer in the process of speech synthesis.
  • Fig. 8 is a block diagram of an apparatus for migrating a voice style according to an exemplary embodiment. As shown in Fig. 8, the apparatus 300 includes:
  • an acquisition module 301 configured to acquire the target text and the first audio corresponding to the target text, where the first audio conforms to the first timbre and has the target style;
  • the first extraction module 302 is used to extract the phoneme sequence corresponding to the target text, and the phoneme sequence includes at least one phoneme;
  • the second extraction module 303 is used to extract the initial acoustic feature sequence corresponding to the first audio frequency, the initial acoustic feature sequence includes the acoustic feature corresponding to each phoneme, and the acoustic feature is used to indicate the prosody feature of the phoneme;
  • the processing module 304 is configured to process the initial acoustic feature sequence according to the acoustic statistical features of the second timbre to obtain a target acoustic feature sequence, where the target acoustic feature sequence includes the processed acoustic features corresponding to each phoneme;
  • the synthesis module 305 is used to input the phoneme sequence and the target acoustic feature sequence into the pre-trained speech synthesis model to obtain the second audio output of the speech synthesis model, the second audio conforms to the second timbre and has the target style, and the speech synthesis model is It is obtained by training on the corpus that matches the second timbre.
  • Fig. 9 is a block diagram of another voice style transfer apparatus according to an exemplary embodiment.
  • the acoustic features include: at least one of fundamental frequency, volume, or speech rate
  • the second extraction module 303 includes:
  • the determination sub-module 3031 is used to determine one or more audio frames corresponding to each phoneme in the first audio according to the phoneme sequence and the first audio if the acoustic feature includes the speech rate, and according to the audio frame corresponding to the phoneme. The number determines the speech rate corresponding to the phoneme;
  • Extraction sub-module 3032 configured to extract the pitch of each audio frame in the first audio if the acoustic feature includes pitch, and determine the pitch corresponding to the phoneme according to the pitch of the audio frame corresponding to each phoneme . If the acoustic feature includes volume, extract the volume of each audio frame in the first audio, and determine the volume corresponding to the phoneme according to the volume of the audio frame corresponding to each phoneme.
  • Fig. 10 is a block diagram of another voice style transfer apparatus according to an exemplary embodiment. As shown in Fig. 10 , the acoustic features include at least one of fundamental frequency, volume, or speech rate.
  • the processing module 304 includes:
  • the first processing sub-module 3041 is used to standardize the speech rate corresponding to each phoneme according to the average speech rate and the variance of the speech rate included in the acoustic statistical feature if the acoustic feature includes the speech rate, so as to obtain the processed post-processing corresponding to the phoneme. speed of speech;
  • the second processing sub-module 3042 is configured to standardize the pitch corresponding to each phoneme according to the average pitch and pitch variance included in the acoustic statistical feature if the acoustic feature includes the pitch, so as to obtain the processed signal corresponding to the phoneme. pitch;
  • the third processing submodule 3043 is used to standardize the volume corresponding to each phoneme according to the volume average value and volume variance included in the acoustic statistical feature if the acoustic feature includes the volume, to obtain the processed volume corresponding to the phoneme;
  • the speech synthesis model in the above embodiment can be used to perform the following steps:
  • Step A determine the text feature sequence corresponding to the target text according to the phoneme sequence, and the text feature sequence includes the text feature corresponding to each phoneme in the phoneme sequence;
  • step B the second audio is generated according to the text feature sequence and the target acoustic feature sequence.
  • the speech synthesis model is trained as follows:
  • Step 1) obtain the training text, the training phoneme sequence corresponding to the training text and the training audio, the training audio conforms to the second timbre, and the training phoneme sequence includes at least one training phoneme;
  • Step 2) extract the real acoustic feature sequence of the training audio, and the real acoustic feature sequence includes the acoustic feature corresponding to each training phoneme;
  • Step 3 processing the real acoustic feature sequence according to the acoustic statistical feature to obtain a training acoustic feature sequence, and the training acoustic feature sequence includes the processed acoustic feature corresponding to each training phoneme;
  • Step 4 input the training phoneme sequence and the training acoustic feature sequence into the speech synthesis model, and train the speech synthesis model according to the output of the speech synthesis model and the training audio.
  • the training process of the speech synthesis model further includes:
  • Step 5 after extracting the real acoustic feature sequence of the training audio, determine the acoustic feature average and acoustic feature variance of the acoustic feature corresponding to each training phoneme, and use the acoustic feature average and acoustic feature variance as acoustic statistical features.
  • the present disclosure first obtains the target text and the first audio corresponding to the target text, which has the target style and matches the first timbre, and then extracts the phoneme sequence corresponding to the target text, which includes at least one phoneme, and then extracts the first audio sequence corresponding to the target text.
  • an initial acoustic feature sequence including an acoustic feature corresponding to each phoneme, and the acoustic feature is used to indicate the prosodic feature of the phoneme.
  • the initial acoustic feature sequence is further processed according to the acoustic statistical features of the second timbre to obtain a target acoustic feature sequence including the processed acoustic features corresponding to each phoneme, and finally the phoneme sequence and the target acoustic feature sequence are input in advance.
  • the trained speech synthesis model, the speech synthesis model can output the second audio that conforms to the second timbre and has the target style, wherein the speech synthesis model is trained according to the corpus that conforms to the second timbre.
  • the present disclosure uses the target text and the corresponding first audio that conforms to the first timbre and has the target style to synthesize the second audio that conforms to the second timbre and has the target style, so that the target can be performed in the same style by using different timbres Text, can achieve style transfer in the process of speech synthesis.
  • FIG. 11 shows a schematic structural diagram of an electronic device (which can be understood as an executive body in the above embodiment) 400 suitable for implementing the embodiments of the present disclosure.
  • Terminal devices in the embodiments of the present disclosure may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablets), PMPs (portable multimedia players), vehicle-mounted terminals (eg, mobile terminals such as in-vehicle navigation terminals), etc., and stationary terminals such as digital TVs, desktop computers, and the like.
  • the electronic device shown in FIG. 11 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.
  • an electronic device 400 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 401 that may be loaded into random access according to a program stored in a read only memory (ROM) 402 or from a storage device 408 Various appropriate actions and processes are executed by the programs in the memory (RAM) 403 . In the RAM 403, various programs and data necessary for the operation of the electronic device 400 are also stored.
  • the processing device 401, the ROM 402, and the RAM 403 are connected to each other through a bus 404.
  • An input/output (I/O) interface 405 is also connected to bus 404 .
  • I/O interface 405 input devices 406 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 407 of a computer, etc.; a storage device 408 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 409. Communication means 409 may allow electronic device 400 to communicate wirelessly or by wire with other devices to exchange data.
  • FIG. 11 shows electronic device 400 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via the communication device 409, or from the storage device 408, or from the ROM 402.
  • the processing apparatus 401 When the computer program is executed by the processing apparatus 401, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections having one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, electrical wire, optical fiber cable, RF (radio frequency), etc., or any suitable combination of the foregoing.
  • terminal devices and servers can use any currently known or future developed network protocols such as HTTP (HyperText Transfer Protocol) to communicate, and can communicate with digital data in any form or medium Communication (eg, a communication network) interconnects.
  • network protocols such as HTTP (HyperText Transfer Protocol) to communicate, and can communicate with digital data in any form or medium Communication (eg, a communication network) interconnects.
  • Examples of communication networks include local area networks (“LAN”), wide area networks (“WAN”), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any currently known or future development network of.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device causes the electronic device to: obtain the target text and the first audio corresponding to the target text, and the first audio
  • the audio conforms to the first timbre and has a target style; extracts a phoneme sequence corresponding to the target text, the phoneme sequence includes at least one phoneme; extracts an initial acoustic feature sequence corresponding to the first audio, wherein the initial acoustic feature sequence Including the acoustic features corresponding to each of the phonemes, the acoustic features are used to indicate the prosodic features of the phoneme;
  • the initial acoustic feature sequence is processed according to the acoustic statistical features of the second timbre to obtain a target acoustic feature sequence, the The target acoustic feature sequence includes the processed acoustic features corresponding to each of the phonemes; input the phoneme sequence and
  • Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider to via Internet connection).
  • LAN local area network
  • WAN wide area network
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
  • the modules involved in the embodiments of the present disclosure may be implemented in software or hardware. Wherein, the name of the module does not constitute a limitation of the module itself under certain circumstances, for example, the acquisition module may also be described as "a module for acquiring target text and first audio".
  • exemplary types of hardware logic components include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), Systems on Chips (SOCs), Complex Programmable Logical Devices (CPLDs) and more.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs Systems on Chips
  • CPLDs Complex Programmable Logical Devices
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • Example 1 provides a method for migrating a voice style, including: acquiring a target text and a first audio corresponding to the target text, the first audio conforming to a first timbre and having target style; extract the phoneme sequence corresponding to the target text, the phoneme sequence includes at least one phoneme; extract the initial acoustic feature sequence corresponding to the first audio, the initial acoustic feature sequence includes each of the phonemes corresponding to The acoustic feature is used to indicate the prosody feature of the phoneme; the initial acoustic feature sequence is processed according to the acoustic statistical feature of the second timbre to obtain a target acoustic feature sequence, which includes each each of the processed acoustic features corresponding to the phonemes; input the phoneme sequence and the target acoustic feature sequence into a pre-trained speech synthesis model to obtain the second audio output by the speech synthesis model, the second audio The audio conforms to the second timbre and
  • Example 2 provides the method of Example 1, wherein the acoustic features include: at least one of fundamental frequency, volume, or speech rate; the extracting corresponding to the first audio
  • the initial acoustic feature sequence includes: if the acoustic feature includes the speech rate, determining one or more audio frames corresponding to each of the phonemes in the first audio according to the phoneme sequence and the first audio, and The speech rate corresponding to the phoneme is determined according to the number of audio frames corresponding to the phoneme; if the acoustic feature includes pitch, the pitch of each audio frame in the first audio is extracted, and according to the audio frequency corresponding to each phoneme The pitch of the frame determines the pitch corresponding to the phoneme; if the acoustic feature includes volume, extract the volume of each audio frame in the first audio, and determine the phoneme according to the volume of the audio frame corresponding to each phoneme corresponding volume.
  • Example 3 provides the method of Example 1, the acoustic characteristics include: at least one of fundamental frequency, volume, or speech rate; the acoustic statistical characteristics according to the second timbre
  • the initial acoustic feature sequence is processed to obtain a target acoustic feature sequence, including: if the acoustic feature includes a speaking rate, according to the average speaking rate and the variance of the speaking rate included in the acoustic statistical feature, for each of the phonemes
  • the corresponding speech rate is standardized to obtain the processed speech rate corresponding to the phoneme; if the acoustic feature includes pitch, according to the average pitch and pitch variance included in the acoustic statistical feature, for each of the phonemes
  • the corresponding pitches are standardized to obtain the processed pitches corresponding to the phonemes; if the acoustic features include volume, according to the volume average value and volume variance included in the acoustic statistical features, for each of the phonemes The volume is normalized to obtain the processed volume corresponding
  • Example 4 provides the method of Example 1, wherein the speech synthesis model is configured to: determine a text feature sequence corresponding to the target text according to the phoneme sequence, where the text feature sequence includes The text feature corresponding to each phoneme in the phoneme sequence; the second audio is generated according to the text feature sequence and the target acoustic feature sequence.
  • Example 5 provides the methods of Examples 1 to 4, where the speech synthesis model is obtained by training in the following manner: acquiring training text, a training phoneme sequence corresponding to the training text, and training audio, the training audio matches the second timbre, and the training phoneme sequence includes at least one training phoneme; extracting a real acoustic feature sequence of the training audio, the real acoustic feature sequence includes the corresponding training phoneme Acoustic features; process the real acoustic feature sequence according to the acoustic statistical features to obtain a training acoustic feature sequence, and the training acoustic feature sequence includes the processed acoustic features corresponding to each of the training phonemes; The training phoneme sequence and the training acoustic feature sequence are input into the speech synthesis model, and the speech synthesis model is trained according to the output of the speech synthesis model and the training audio.
  • Example 6 provides the method of Example 5, and the speech synthesis model is also obtained by training in the following manner: after the extraction of the real acoustic feature sequence of the training audio, determining each The acoustic feature average value and the acoustic feature variance of the acoustic features corresponding to the training phonemes, and the acoustic feature average value and the acoustic feature variance are used as the acoustic statistical feature.
  • Example 7 provides an apparatus for migrating a voice style, including: an acquisition module configured to acquire target text and first audio corresponding to the target text, the first audio conforming to The first timbre has a target style; the first extraction module is used to extract the phoneme sequence corresponding to the target text, and the phoneme sequence includes at least one phoneme; the second extraction module is used to extract the corresponding phoneme of the first audio.
  • An initial acoustic feature sequence the initial acoustic feature sequence includes an acoustic feature corresponding to each of the phonemes, and the acoustic feature is used to indicate the prosodic feature of the phoneme; a processing module, used for pairing according to the acoustic statistical features of the second timbre
  • the initial acoustic feature sequence is processed to obtain a target acoustic feature sequence, and the target acoustic feature sequence includes the processed acoustic features corresponding to each phoneme;
  • a synthesis module is used to combine the phoneme sequence and the target Acoustic feature sequence, input a pre-trained speech synthesis model to obtain the second audio output from the speech synthesis model, the second audio conforms to the second timbre and has the target style, and the speech synthesis model is based on It is obtained by training on a corpus that matches the second timbre.
  • Example 8 provides the apparatus of Example 7, the acoustic feature includes: at least one of fundamental frequency, volume, or speech rate; the second extraction module includes: a determiner A module, configured to determine one or more audio frames corresponding to each of the phonemes in the first audio according to the phoneme sequence and the first audio if the acoustic feature includes the speech rate, and according to the phoneme The number of corresponding audio frames determines the speech rate corresponding to the phoneme; the extraction submodule is used for extracting the pitch of each audio frame in the first audio if the acoustic feature includes pitch, and according to each phoneme The pitch of the corresponding audio frame determines the pitch corresponding to the phoneme; if the acoustic feature includes volume, extract the volume of each audio frame in the first audio, and according to the volume of the audio frame corresponding to each phoneme, Determines the volume corresponding to the phoneme.
  • the second extraction module includes: a determiner A module, configured to determine one or more audio frames corresponding to each of the phonemes in the first
  • Example 9 provides a computer-readable medium having stored thereon a computer program that, when executed by a processing apparatus, implements the steps of the methods described in Examples 1 to 6.
  • Example 10 provides an electronic device, comprising: a storage device on which a computer program is stored; and a processing device for executing the computer program in the storage device to Implement the steps of the methods described in Examples 1 to 6.
  • Example 11 provides a computer program comprising: instructions that, when executed by a processor, cause the processor to perform the speech style transfer method of Examples 1 to 6 one or more of the steps in .
  • Example 12 provides a computer program product comprising instructions that, when executed by a processor, cause the processor to perform the speech style transfer method of Examples 1 to 6 one or more of the steps in .

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Document Processing Apparatus (AREA)

Abstract

一种语音风格的迁移方法、装置(300)、可读介质和电子设备(400),涉及电子信息处理技术领域,语音风格的迁移方法包括:获取目标文本和目标文本对应的第一音频(101),第一音频符合第一音色且具有目标风格,提取目标文本对应的音素序列(102),提取第一音频对应的初始声学特征序列(103),初始声学特征序列中包括每个音素对应的声学特征,声学特征用于指示该音素的韵律特征,按照第二音色的声学统计特征对初始声学特征序列进行处理,得到目标声学特征序列(104),将音素序列和目标声学特征序列,输入预先训练的语音合成模型,以得到语音合成模型输出的第二音频(105),第二音频符合第二音色且具有目标风格,语音合成模型为根据符合第二音色的语料训练得到的。

Description

语音风格的迁移方法、装置、可读介质和电子设备
相关申请的交叉引用
本申请是以CN申请号为202110077658.2,申请日为2021年1月20日的申请为基础,并主张其优先权,该CN申请的公开内容在此作为整体引入本申请中。
技术领域
本公开涉及电子信息处理技术领域,具体地,涉及一种语音风格的迁移方法、装置、可读介质和电子设备。
背景技术
随着电子信息技术的不断发展,人们的娱乐生活也越来越丰富,阅读电子书已经成为了一种主流的阅读方式。为了使用户在不方便阅览电子书时,也能通过听觉来获取电子书中包括的信息,或者边读边听,从视觉和听觉两个维度来获取电子书中包括的信息,往往会为电子书预先录制对应的音频,以供用户收听。
电子书通常会根据其中的内容分为不同的风格,例如:科幻、悬疑等。相应的,朗读者在录制对应的音频时,也会按照电子书的风格进行录制,以使音频的风格与电子书的风格能够匹配。
发明内容
提供该发明内容部分以便以简要的形式介绍构思,这些构思将在后面的具体实施方式部分被详细描述。该发明内容部分并不旨在标识要求保护的技术方案的关键特征或必要特征,也不旨在用于限制所要求的保护的技术方案的范围。
第一方面,本公开提供一种语音风格的迁移方法,所述方法包括:
获取目标文本和所述目标文本对应的第一音频,所述第一音频符合第一音色且具有目标风格;
提取所述目标文本对应的音素序列,所述音素序列中包括至少一个音素;
提取所述第一音频对应的初始声学特征序列,所述初始声学特征序列中包括每个所述音素对应的声学特征,所述声学特征用于指示该音素的韵律特征;
按照第二音色的声学统计特征对所述初始声学特征序列进行处理,得到目标声学 特征序列,所述目标声学特征序列中包括每个所述音素对应的处理后的声学特征;
将所述音素序列和所述目标声学特征序列,输入预先训练的语音合成模型,以得到所述语音合成模型输出的第二音频,所述第二音频符合所述第二音色且具有所述目标风格,所述语音合成模型为根据符合所述第二音色的语料训练得到的。
第二方面,本公开提供一种语音风格的迁移装置,所述装置包括:
获取模块,用于获取目标文本和所述目标文本对应的第一音频,所述第一音频符合第一音色且具有目标风格;
第一提取模块,用于提取所述目标文本对应的音素序列,所述音素序列中包括至少一个音素;
第二提取模块,用于提取所述第一音频对应的初始声学特征序列,所述初始声学特征序列中包括每个所述音素对应的声学特征,所述声学特征用于指示该音素的韵律特征;
处理模块,用于按照第二音色的声学统计特征对所述初始声学特征序列进行处理,得到目标声学特征序列,所述目标声学特征序列中包括每个所述音素对应的处理后的声学特征;
合成模块,用于将所述音素序列和所述目标声学特征序列,输入预先训练的语音合成模型,以得到所述语音合成模型输出的第二音频,所述第二音频符合所述第二音色且具有所述目标风格,所述语音合成模型为根据符合所述第二音色的语料训练得到的。
第三方面,本公开提供一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现本公开第一方面所述方法的步骤。
第四方面,本公开提供一种电子设备,包括:
存储装置,其上存储有计算机程序;
处理装置,用于执行所述存储装置中的所述计算机程序,以实现本公开第一方面所述方法的步骤。
第五方面,本公开提供一种计算机程序,包括:指令,所述指令当由处理器执行时使所述处理器执行第一方面中任一项所述的语音风格的迁移方法中的一个或多个步骤。
第六方面,本公开提供一种计算机程序产品,包括指令,所述指令当由处理器执行时使所述处理器执行第一方面中任一项所述的语音风格的迁移方法中的一个或多 个步骤。
本公开的其他特征和优点将在随后的具体实施方式部分予以详细说明。
附图说明
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。在附图中:
图1是根据一示例性实施例示出的一种语音风格的迁移方法的流程图;
图2是根据一示例性实施例示出的另一种语音风格的迁移方法的流程图;
图3是根据一示例性实施例示出的另一种语音风格的迁移方法的流程图;
图4是根据一示例性实施例示出的一种语音合成模型的处理流程图;
图5是根据一示例性实施例示出的一种语音合成模型的框图;
图6是根据一示例性实施例示出的一种训练语音合成模型的流程图;
图7是根据一示例性实施例示出的另一种训练语音合成模型的流程图;
图8是根据一示例性实施例示出的一种语音风格的迁移装置的框图;
图9是根据一示例性实施例示出的另一种语音风格的迁移装置的框图;
图10是根据一示例性实施例示出的另一种语音风格的迁移装置的框图;
图11是根据一示例性实施例示出的一种电子设备的框图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另 一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。
由于录制成本和录制效率等原因,一部电子书对应的音频,往往只会由一个朗读者来进行录制,很难满足用户多样化的需求。若利用现有的语音合成方法来模拟其他朗读者朗读电子书的音频,会导致模拟出的音频的风格与电子书的风格不匹配。因为语音合成需要借助原始音库,而原始音库通常是不带有风格的。
本发明提供了一种能够在语音合成的过程中实现风格的迁移的方案。
图1是根据一示例性实施例示出的一种语音风格的迁移方法的流程图,如图1所示,该方法可以包括步骤101-105。
步骤101,获取目标文本和目标文本对应的第一音频,第一音频符合第一音色且具有目标风格。
举例来说,首先获取目标文本和目标文本对应的第一音频。目标文本可以是一部电子书,也可以是一部电子书中的一个章节、一个片段或者一个句子,还可以是其他类型的文本,例如新闻、公众号文章、博客等。第一音频符合第一音色,且具有目标风格。可以理解为第一朗读者具有第一音色,第一音频即为第一朗读者按照目标风格朗读目标文本时录制的音频,其中,目标风格例如可以是言情、都市、古风、悬疑、科幻、军事、或体育等。
步骤102,提取目标文本对应的音素序列,音素序列中包括至少一个音素。
示例的,针对目标文本,可以将目标文本输入预先训练的识别模型,以得到识别模型输出的,目标文本对应的音素序列。也可以在预先建立的字典中,查找目标文本中的每个字对应的音素,然后将每个字对应的音素组成目标文本对应的音素序列。其中,音素可以理解为根据每个字的发音划分出的语音单位,也可以理解为每个字对应 的拼音中的元音和辅音。音素序列中,包括了目标文本中每个字对应的音素(一个字可以对应一个或多个音素)。以目标文本为“太阳出来了”为例。可以依次在字典中查找每个字对应的音素,从而确定音素序列为“taiyangchulaile”。
步骤103,提取第一音频对应的初始声学特征序列,初始声学特征序列中包括每个音素对应的声学特征,声学特征用于指示该音素的韵律特征。
示例的,可以通过信号处理的方式,对第一音频进行处理,以得到包括了每个音素对应的声学特征的初始声学特征序列。其中,声学特征可以包括音高(英文:Pitch)、音量(英文:Energy)、或语速(英文:Duration)中的至少一种,还可以包括:噪声水平、音调、或响度等。具体的,可以利用HTS(英文:HMM-based Speech Synthesis System),将第一音频按照音素序列中包括的音素进行划分,以得到每个音素对应的语速。还可以利用sox、librosa、或straight等音频处理工具,对第一音频进行处理,以得到每个音素对应的音高和音量。
步骤104,按照第二音色的声学统计特征对初始声学特征序列进行处理,得到目标声学特征序列,目标声学特征序列中包括每个音素对应的处理后的声学特征。
步骤105,将音素序列和目标声学特征序列,输入预先训练的语音合成模型,以得到语音合成模型输出的第二音频,第二音频符合第二音色且具有目标风格,语音合成模型为根据符合第二音色的语料训练得到的。
示例的,由于不同的朗读者具有不同的音色,对应的声学特征的变化范围和变化幅度均不相同,可以理解为不同的音色,对应的声学特征服从的概率分布均不相同。因此,初始声学特征序列中包括的每个音素对应的声学特征,服从的是第一音色对应的声学特征的概率分布,很难满足第二音色对应的声学特征所服从的概率分布,也就是说很难根据初始声学特征序列来合成符合第二音色的音频。可以按照预先获得的,第二音色的声学统计特征,对初始声学特征序列中包括的每个音素对应的声学特征进行处理,得到目标声学特征序列,目标声学特征序列中包括每个音素对应的处理后的声学特征。其中,第二音色的声学统计特征,可以理解为预先根据大量符合第二音色的音频统计得到的,能够反映第二音色对应的声学特征所服从的概率分布的统计特征。声学统计特征可以包括第二音色的语速统计特征(例如:平均值和方差)、音高统计特征、或音量统计特征等的一种或多种。按照声学统计特征对每个音素对应的声学特征进行处理,可以理解为对每个音素对应的声学特征进行标准化(英文:Standardization),以使目标声学特征序列中包括的每个音素对应的处理后的声学特 征,能够满足第二音色对应的声学特征所服从的概率分布。
之后,可以将音素序列和目标声学特征序列,输入预先训练的语音合成模型,语音合成模型输出的,即为符合第二音色,且具有目标风格的第二音频。其中,语音合成模型可以是预先训练的,可以理解成一种TTS(英文:Text To Speech,中文:从文本到语音)模型,能够根据音素序列和目标声学特征序列,生成第二音频。具体的,语音合成模型可以是基于Tacotron模型、Deepvoice 3模型、Tacotron 2模型、Wavenet模型等训练得到的,本公开对此不作具体限定。
需要说明的是,语音合成模型是根据符合第二音色的语料训练得到的,以第二朗读者具有第二音色来举例,符合第二音色的语料可以理解为,第二朗读者朗读任意文本时录制的音频。任意文本可以是和目标文本不相同的文本,并且第二朗读者在朗读任意文本时可以按照任意风格来朗读,即不需要按照目标风格来朗读。也就是说,可以利用已有的第二朗读者朗读的音频,来训练语音合成模型。这样,通过第二朗读者朗读其他文本的语料来训练语音合成模型,在对目标文本进行语音合成的过程中,考虑了目标文本中包括的语义,又考虑了根据第一音频确定的目标声学特征序列,能够使第二音频符合第二音色,且具有目标风格,从而实现风格的迁移。无需花费大量的时间成本和人力成本录制多种音色具有同一种风格的音频,就能为用户提供多种选择,满足了用户多样化的需求。
例如,目标文本可以是从电子书阅读APP(英文:Application,中文:应用程序)上获取的电子书,该电子书位于该电子书阅读APP中“古风”栏目,那么目标风格即为“古风”,第一音频可以是从该电子书阅读APP获取的,预先录制好的A朗读者(对应第一音色)朗读该电子书的音频。先提取目标文本对应的音素序列,然后根据第一音频,确定初始声学特征序列。再按照预先统计大量的B朗读者(对应第二音色)朗读的音频得到的声学统计特征,对初始声学特征序列进行处理,得到目标声学特征序列,最后将音素序列和目标声学特征序列输入语音合成模型,语音合成模型输出的第二音频,符合B朗读者的音色,且具有“古风”的风格,即第二音频可以理解为为模仿B朗读者按照“古风”的风格朗读目标文本的音频。其中,语音合成模型是预先根据大量的B朗读者朗读的音频训练得到的。
综上所述,本公开首先获取目标文本,和目标文本对应的具有目标风格,且符合第一音色的第一音频,之后提取目标文本对应的音素序列,其中包括了至少一个音素,再提取第一音频对应的,包括了每个音素对应的声学特征的初始声学特征序列,声学 特征用于指示该音素的韵律特征。进一步的按照第二音色的声学统计特征对初始声学特征序列进行处理,以得到包括了每个音素对应的处理后的声学特征的目标声学特征序列,最后将音素序列和目标声学特征序列,输入预先训练的语音合成模型,语音合成模型能够输出符合第二音色,且具有目标风格的第二音频,其中语音合成模型是根据符合第二音色的语料训练得到的。本公开利用目标文本和对应的符合第一音色、且具有目标风格的第一音频,合成符合第二音色,且具有目标风格的第二音频,从而利用不同的音色,按照相同的风格来演绎目标文本,能够在语音合成的过程中实现风格的迁移。
图2是根据一示例性实施例示出的另一种语音风格的迁移方法的流程图,如图2所示,声学特征包括:基频、音量、语速中的至少一种,相应的步骤103的实现方式可以包括步骤1031-1033。
步骤1031,若声学特征包括语速,根据音素序列和第一音频,确定第一音频中每个音素对应的一个或多个音频帧,并根据该音素对应的音频帧的数量确定该音素对应的语速。
示例的,可以利用HTS,将第一音频按照音素序列中包括的音素进行划分,以得到每个音素对应的一个或多个音频帧,然后根据每个音频帧所占的时长,和该音素对应的音频帧的数量,确定该音素对应的语速。例如,经过划分后,音素序列中某个音素对应3个音频帧,每个音频帧所占的时长为10ms,那么该音素对应的语速(即Duration)为30ms。
步骤1032,若声学特征包括音高,提取第一音频中的每个音频帧的音高,并根据每个音素对应的音频帧的音高,确定该音素对应的音高。
步骤1033,若声学特征包括音量,提取第一音频中的每个音频帧的音量,并根据每个音素对应的音频帧的音量,确定该音素对应的音量。
进一步的,在声学特征包括音高的情况下,可以利用sox、librosa、straight等音频处理工具,对第一音频进行处理,提取第一音频中的每个音频帧的音高。之后,可以将每个音素对应的音频帧的音高的均值(或者极值、标准差等),作为该音素对应的音高。例如,经过划分后,音素序列中某个音素对应两个音频帧,该音素对应的两个音频帧的音高分别为1.2kHz、1.6kHz,那么可以将两个音高的均值1.4kHz作为该音素对应的音高。在声学特征包括音量的情况下,可以利用sox、librosa、straight等音频处理工具,对第一音频进行处理,提取第一音频中的每个音频帧的音量。之后, 可以将每个音素对应的音频帧的音量的均值(或者极值、标准差等),作为该音素对应的音量。例如,该音素对应的两个音频帧的音量分别为30dB、80dB,并将两个音量的均值55dB作为该音素对应的音量。
图3是根据一示例性实施例示出的另一种语音风格的迁移方法的流程图,如图3所示,声学特征包括:基频、音量、或语速中的至少一种。相应的,步骤104也可以包括步骤1041-1043。
步骤1041,若所述声学特征包括语速,根据声学统计特征包括的语速平均值和语速方差,对每个音素对应的语速进行标准化,以得到该音素对应的处理后的语速。
步骤1042,若所述声学特征包括音高,根据声学统计特征包括的音高平均值和音高方差,对每个音素对应的音高进行标准化,以得到该音素对应的处理后的音高。
步骤1043,若所述声学特征包括音量,根据声学统计特征包括的音量平均值和音量方差,对每个音素对应的音量进行标准化,以得到该音素对应的处理后的音量。
举例来说,在声学特征包括语速的场景中,声学统计特征可以包括:第二音色的语速平均值(表示为duration_mean)和语速方差(表示为duration_var),在声学特征包括音高的场景中,声学统计特征可以包括:音高平均值(表示为pitch_mean)和音高方差(表示为pitch_var),在声学特征包括音量的场景中,声学统计特征可以包括:音量平均值(表示为energy_mean)和音量方差(表示为energy_var)。以声学特征包括:音高、音量、语速的场景来举例,那么可以通过公式一对每个音素对应的语速、音高、音量进行标准化,以得到该音素对应的处理后的语速、音高、音量。
Figure PCTCN2021136525-appb-000001
其中,A表示第一音色,
Figure PCTCN2021136525-appb-000002
表示音素序列中第i个音素对应的语速,
Figure PCTCN2021136525-appb-000003
表示第i个音素对应的处理后的语速,
Figure PCTCN2021136525-appb-000004
表示第i个音素对应的音高,
Figure PCTCN2021136525-appb-000005
表示第i个音素对应的处理后的音高,
Figure PCTCN2021136525-appb-000006
表示第i个音素对应的音量,
Figure PCTCN2021136525-appb-000007
表示第i个音素对应的处理后的音量。
图4是根据一示例性实施例示出的一种语音合成模型的处理流程图,如图4所示,语音合成模型可以用于执行以下步骤A和B。
步骤A,根据音素序列确定目标文本对应的文本特征序列,文本特征序列包括音素序列中每个音素对应的文本特征。
步骤B,根据文本特征序列和目标声学特征序列,生成第二音频。
示例的,语音合成模型合成第二音频的具体过程,可以是先根据音素序列,提取目标文本对应的文本特征序列(即Text Embedding),文本特征序列中包括了音素序列中每个音素对应的文本特征,文本特征可以理解为能够表征该音素的文本向量。例如,音素序列中包括100个音素,每个音素对应的文本向量为1*256维的向量,那么文本特征序列可以为100*256维的向量。
在获得文本特征序列之后,可以将文本特征序列与目标声学特征序列进行结合,以生成第二音频。例如,可以将文本特征序列与目标声学特征序列进行拼接,得到一个组合序列,然后根据组合序列生成第二音频。例如,音素序列中包括100个音素,文本特征序列可以为100*256维的向量,相应的目标声学特征序列为100*3维的向量(每个音素对应音高、音量、语速3个维度),那么组合序列可以为100*259维的向量。可以根据这个100*259维的向量,生成第二音频。
以图5所示的语音合成模型来举例,语音合成模型为Tacotron模型,其中包括:编码器(即Encoder)、注意力网络(即Attention)、解码器(即Decoder)和后处理网络(即Post-processing)。编码器可以包括嵌入层(即Character Embedding层)、预处理网络(Pre-net)子模型和CBHG(英文:Convolution Bank+Highway network+bidirectional Gated Recurrent Unit,中文:卷积层+高速网络+双向递归神经网络)子模型。可以将音素序列输入编码器。首先,通过嵌入层将音素序列转换为词向量,然后将词向量输入至Pre-net子模型,以对词向量进行非线性变换,从而提升语音合成模型的收敛和泛化能力,最后,通过CBHG子模型根据非线性变换后的词向量,获得能够表征待合成文本的文本特征序列。
之后可以将目标声学特征序列和编码器输出的文本特征序列进行拼接,得到组合序列,再将组合序列输入注意力网络,注意力网络可以为组合序列中的每个元素增加一个注意力权重。具体的,注意力网络可以为位置敏感注意力(英文:Locative Sensitive Attention)网络,也可以为GMM(英文:Gaussian Mixture Model,缩写GMM)attention网络,还可以是Multi-Head Attention网络,本公开对此不作具体限定。
再将注意力网络的输出作为解码器的输入。解码器可以包括预处理网络子模型(可以与编码器中包括的预处理网络子模型的相同)、Attention-RNN、Decoder-RNN。预处理网络子模型用于对输入进行非线性变换,Attention-RNN的结构为一层单向的、基于zoneout的LSTM(英文:Long Short-Term Memory,中文:长短期记忆网络), 能够将预处理网络子模型的输出作为输入,经过LSTM单元后输出到Decoder-RNN中。Decode-RNN为两层单向的、基于zoneout的LSTM,经过LSTM单元输出梅尔频谱信息,梅尔频谱信息中可以包括一个或多个梅尔频谱特征。最后将梅尔频谱信息输入后处理网络,后处理网络可以包括声码器(例如,Wavenet声码器、Griffin-Lim声码器等),用于对梅尔频谱特征信息进行转换,以得到第二音频。
图6是根据一示例性实施例示出的一种训练语音合成模型的流程图,如图6所示,语音合成模型是通过步骤201-204所示的方式训练获得的。
步骤201,获取训练文本、训练文本对应的训练音素序列和训练音频,训练音频符合第二音色,训练音素序列包括至少一个训练音素。
对语音合成模型进行训练,需要预先获取训练文本和训练文本对应的训练音素序列、训练音频等,训练文本可以有多个,相应的,训练音素序列、训练音频也可以是多个。其中,训练音素序列中包括训练文本中的每个字对应的训练音素,训练音频为训练文本对应的,符合第二音色的音频。需要说明的是,训练文本与目标文本之间不存在关联,也就是说,训练文本可以是和目标文本不同的文本。相应的,训练音频只需要是利用第二音色演绎训练文本的音频即可,不需要指定具体的风格,也就是说训练音频可以为不具有任何风格(可以理解为平淡)的音频,也可以为具有目标风格或者,除目标风格之外的其他风格的音频。
步骤202,提取训练音频的真实声学特征序列,真实声学特征序列包括每个训练音素对应的声学特征。
步骤203,按照声学统计特征对真实声学特征序列进行处理,得到训练声学特征序列,训练声学特征序列中包括每个训练音素对应的处理后的声学特征。
示例的,可以通过信号处理的方式,对训练音频进行处理,得到包括了每个训练音素对应的声学特征的真实声学特征序列,其中,声学特征用于指示该训练音素的韵律特征,可以包括音高、音量、或语速中的至少一种,还可以包括:噪声水平、音调、或响度等。还可以获取训练音素序列的标注信息,从标注信息中直接获取每个训练音素对应的声学特征。同样的,也可以按照声学统计特征,对训练音素对应的声学特征的进行处理,得到训练声学特征序列,处理过程可以理解为标准化,以使训练声学特征序列包括的每个训练音素对应的处理后的声学特征,能够满足第二音色对应的声学特征所服从的概率分布。
具体的,以声学统计特征包括表示为duration_mean、duration_var、pitch_mean、 pitch_var、energy_mean和energy_var为例,可以通过公式二对每个训练音素对应的语速、音高、音量进行标准化,以得到该训练音素对应的处理后的语速、音高、音量。
Figure PCTCN2021136525-appb-000008
其中,B表示第二音色,
Figure PCTCN2021136525-appb-000009
表示训练音素序列中第i个训练音素对应的语速,
Figure PCTCN2021136525-appb-000010
表示第i个训练音素对应的处理后的语速,
Figure PCTCN2021136525-appb-000011
表示第i个训练音素对应的音高,
Figure PCTCN2021136525-appb-000012
表示第i个训练音素对应的处理后的音高,
Figure PCTCN2021136525-appb-000013
表示第i个训练音素对应的音量,
Figure PCTCN2021136525-appb-000014
表示第i个训练音素对应的处理后的音量。
步骤204,将训练音素序列和训练声学特征序列,输入语音合成模型,并根据语音合成模型的输出与训练音频,训练语音合成模型。
最后,将训练音素序列和训练声学特征序列,作为语音合成模型的输入,并根据语音合成模型的输出与训练音频,训练语音合成模型。例如,可以根据语音合成模型的输出,与训练音频的差(或者均方差)作为语音合成模型的损失函数,以降低损失函数为目标,利用反向传播算法来修正语音合成模型中的神经元的参数,神经元的参数例如可以是神经元的权重(英文:Weight)和偏置量(英文:Bias)。重复上述步骤,直至损失函数满足预设条件,例如损失函数小于预设的损失阈值。
图7是根据一示例性实施例示出的另一种训练语音合成模型的流程图,如图7所示,语音合成模型的训练还包括步骤205。
步骤205,在提取训练音频的真实声学特征序列之后,确定每个训练音素对应的声学特征的声学特征平均值和声学特征方差,并将声学特征平均值和声学特征方差,作为声学统计特征。
示例的,在步骤202提取到真实声学特征序列之后,可以根据每个训练音素对应的声学特征,来确定声学统计特征。例如,可以确定全部训练音素对应的声学特征的声学特征平均值和声学特征方差,以作为声学统计特征。具体的,可以分别根据全部训练音素对应的语速,确定语速平均值和语速方差,根据全部训练音素对应的音高,确定音高平均值和音高方差,根据全部训练音素对应的音量,确定音量平均值和音量方差。然后分别将语速平均值和语速方差、音高平均值和音高方差、音量平均值和音量方差作为声学统计特征。
综上所述,本公开首先获取目标文本,和目标文本对应的具有目标风格,且符合 第一音色的第一音频,之后提取目标文本对应的音素序列,其中包括了至少一个音素,再提取第一音频对应的,包括了每个音素对应的声学特征的初始声学特征序列,声学特征用于指示该音素的韵律特征。进一步的按照第二音色的声学统计特征对初始声学特征序列进行处理,以得到包括了每个音素对应的处理后的声学特征的目标声学特征序列,最后将音素序列和目标声学特征序列,输入预先训练的语音合成模型,语音合成模型能够输出符合第二音色,且具有目标风格的第二音频,其中语音合成模型是根据符合第二音色的语料训练得到的。本公开利用目标文本和对应的符合第一音色、且具有目标风格的第一音频,合成符合第二音色,且具有目标风格的第二音频,从而利用不同的音色,按照相同的风格来演绎目标文本,能够在语音合成的过程中实现风格的迁移。
图8是根据一示例性实施例示出的一种语音风格的迁移装置的框图,如图8所示,该装置300包括:
获取模块301,用于获取目标文本和目标文本对应的第一音频,第一音频符合第一音色且具有目标风格;
第一提取模块302,用于提取目标文本对应的音素序列,音素序列中包括至少一个音素;
第二提取模块303,用于提取第一音频对应的初始声学特征序列,初始声学特征序列中包括每个音素对应的声学特征,声学特征用于指示该音素的韵律特征;
处理模块304,用于按照第二音色的声学统计特征对初始声学特征序列进行处理,得到目标声学特征序列,目标声学特征序列中包括每个音素对应的处理后的声学特征;
合成模块305,用于将音素序列和目标声学特征序列,输入预先训练的语音合成模型,以得到语音合成模型输出的第二音频,第二音频符合第二音色且具有目标风格,语音合成模型为根据符合第二音色的语料训练得到的。
图9是根据一示例性实施例示出的另一种语音风格的迁移装置的框图,如图9所示,声学特征包括:基频、音量、或语速中的至少一种,第二提取模块303包括:
确定子模块3031,用于若所述声学特征包括语速,根据音素序列和第一音频,确定第一音频中每个音素对应的一个或多个音频帧,并根据该音素对应的音频帧的数量确定该音素对应的语速;
提取子模块3032,用于若所述声学特征包括音高,提取第一音频中的每个音频帧的音高,并根据每个音素对应的音频帧的音高,确定该音素对应的音高。若声学特征 包括音量,提取第一音频中的每个音频帧的音量,并根据每个音素对应的音频帧的音量,确定该音素对应的音量。
图10是根据一示例性实施例示出的另一种语音风格的迁移装置的框图,如图10所示,声学特征包括:基频、音量、或语速中的至少一种。处理模块304包括:
第一处理子模块3041,用于若声学特征包括语速,根据声学统计特征包括的语速平均值和语速方差,对每个音素对应的语速进行标准化,以得到该音素对应的处理后的语速;
第二处理子模块3042,用于若声学特征包括音高,根据声学统计特征包括的音高平均值和音高方差,对每个音素对应的音高进行标准化,以得到该音素对应的处理后的音高;
第三处理子模块3043,用于若声学特征包括音量,根据声学统计特征包括的音量平均值和音量方差,对每个音素对应的音量进行标准化,以得到该音素对应的处理后的音量;
在一种应用场景中,上述实施例中的语音合成模型可以用于执行以下步骤:
步骤A,根据音素序列确定目标文本对应的文本特征序列,文本特征序列包括音素序列中每个音素对应的文本特征;
步骤B,根据文本特征序列和目标声学特征序列,生成第二音频。
在另一种应用场景中,语音合成模型是通过如下方式训练获得的:
步骤1),获取训练文本、训练文本对应的训练音素序列和训练音频,训练音频符合第二音色,训练音素序列包括至少一个训练音素;
步骤2),提取训练音频的真实声学特征序列,真实声学特征序列包括每个训练音素对应的声学特征;
步骤3),按照声学统计特征对真实声学特征序列进行处理,得到训练声学特征序列,训练声学特征序列中包括每个训练音素对应的处理后的声学特征;
步骤4),将训练音素序列和训练声学特征序列,输入语音合成模型,并根据语音合成模型的输出与训练音频,训练语音合成模型。
在又一种应用场景中,语音合成模型的训练过程还包括:
步骤5),在提取训练音频的真实声学特征序列之后,确定每个训练音素对应的声学特征的声学特征平均值和声学特征方差,并将声学特征平均值和声学特征方差,作为声学统计特征。
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
综上所述,本公开首先获取目标文本,和目标文本对应的具有目标风格,且符合第一音色的第一音频,之后提取目标文本对应的音素序列,其中包括了至少一个音素,再提取第一音频对应的,包括了每个音素对应的声学特征的初始声学特征序列,声学特征用于指示该音素的韵律特征。进一步的按照第二音色的声学统计特征对初始声学特征序列进行处理,以得到包括了每个音素对应的处理后的声学特征的目标声学特征序列,最后将音素序列和目标声学特征序列,输入预先训练的语音合成模型,语音合成模型能够输出符合第二音色,且具有目标风格的第二音频,其中语音合成模型是根据符合第二音色的语料训练得到的。本公开利用目标文本和对应的符合第一音色、且具有目标风格的第一音频,合成符合第二音色,且具有目标风格的第二音频,从而利用不同的音色,按照相同的风格来演绎目标文本,能够在语音合成的过程中实现风格的迁移。
图11示出了适于用来实现本公开实施例的电子设备(可以理解为上述实施例中的执行主体)400的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图11示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图11所示,电子设备400可以包括处理装置(例如中央处理器、图形处理器等)401,其可以根据存储在只读存储器(ROM)402中的程序或者从存储装置408加载到随机访问存储器(RAM)403中的程序而执行各种适当的动作和处理。在RAM403中,还存储有电子设备400操作所需的各种程序和数据。处理装置401、ROM 402以及RAM 403通过总线404彼此相连。输入/输出(I/O)接口405也连接至总线404。
通常,以下装置可以连接至I/O接口405:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置406;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置407;包括例如磁带、硬盘等的存储装置408;以及通信装置409。通信装置409可以允许电子设备400与其他设备进行无线或有线通信以交换数据。虽然图11示出了具有各种装置的电子设备400,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置409从网络上被下载和安装,或者从存储装置408被安装,或者从ROM 402被安装。在该计算机程序被处理装置401执行时,执行本公开实施例的方法中限定的上述功能。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
在一些实施方式中,终端设备、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电 子设备执行时,使得该电子设备:获取目标文本和所述目标文本对应的第一音频,所述第一音频符合第一音色且具有目标风格;提取所述目标文本对应的音素序列,所述音素序列中包括至少一个音素;提取所述第一音频对应的初始声学特征序列,所述初始声学特征序列中包括每个所述音素对应的声学特征,所述声学特征用于指示该音素的韵律特征;按照第二音色的声学统计特征对所述初始声学特征序列进行处理,得到目标声学特征序列,所述目标声学特征序列中包括每个所述音素对应的处理后的声学特征;将所述音素序列和所述目标声学特征序列,输入预先训练的语音合成模型,以得到所述语音合成模型输出的第二音频,所述第二音频符合所述第二音色且具有所述目标风格,所述语音合成模型为根据符合所述第二音色的语料训练得到的。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言——诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)——连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的模块可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,模块的名称在某种情况下并不构成对该模块本身的限定,例如,获取模块还可以被描述为“获取目标文本和第一音频的模块”。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如, 非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
根据本公开的一个或多个实施例,示例1提供了一种语音风格的迁移方法,包括:获取目标文本和所述目标文本对应的第一音频,所述第一音频符合第一音色且具有目标风格;提取所述目标文本对应的音素序列,所述音素序列中包括至少一个音素;提取所述第一音频对应的初始声学特征序列,所述初始声学特征序列中包括每个所述音素对应的声学特征,所述声学特征用于指示该音素的韵律特征;按照第二音色的声学统计特征对所述初始声学特征序列进行处理,得到目标声学特征序列,所述目标声学特征序列中包括每个所述音素对应的处理后的声学特征;将所述音素序列和所述目标声学特征序列,输入预先训练的语音合成模型,以得到所述语音合成模型输出的第二音频,所述第二音频符合所述第二音色且具有所述目标风格,所述语音合成模型为根据符合所述第二音色的语料训练得到的。
根据本公开的一个或多个实施例,示例2提供了示例1的方法,所述声学特征包括:基频、音量、或语速中的至少一种;所述提取所述第一音频对应的初始声学特征序列,包括:若所述声学特征包括语速,根据所述音素序列和所述第一音频,确定所述第一音频中每个所述音素对应的一个或多个音频帧,并根据该音素对应的音频帧的数量确定该音素对应的语速;若所述声学特征包括音高,提取所述第一音频中的每个音频帧的音高,并根据每个音素对应的音频帧的音高确定该音素对应的音高;若所述声学特征包括音量,提取所述第一音频中的每个音频帧的音量,并根据每个音素对应的音频帧的音量,确定该音素对应的音量。
根据本公开的一个或多个实施例,示例3提供了示例1的方法,所述声学特征包 括:基频、音量、或语速中的至少一种;所述按照第二音色的声学统计特征对所述初始声学特征序列进行处理,得到目标声学特征序列,包括:若所述声学特征包括语速,根据所述声学统计特征包括的语速平均值和语速方差,对每个所述音素对应的语速进行标准化,以得到该音素对应的处理后的语速;若所述声学特征包括音高,根据所述声学统计特征包括的音高平均值和音高方差,对每个所述音素对应的音高进行标准化,以得到该音素对应的处理后的音高;若所述声学特征包括音量,根据所述声学统计特征包括的音量平均值和音量方差,对每个所述音素对应的音量进行标准化,以得到该音素对应的处理后的音量。
根据本公开的一个或多个实施例,示例4提供了示例1的方法,所述语音合成模型用于:根据所述音素序列确定所述目标文本对应的文本特征序列,所述文本特征序列包括所述音素序列中每个音素对应的文本特征;根据所述文本特征序列和所述目标声学特征序列,生成所述第二音频。
根据本公开的一个或多个实施例,示例5提供了示例1至示例4的方法,所述语音合成模型是通过如下方式训练获得的:获取训练文本、所述训练文本对应的训练音素序列和训练音频,所述训练音频符合所述第二音色,所述训练音素序列包括至少一个训练音素;提取所述训练音频的真实声学特征序列,所述真实声学特征序列包括每个所述训练音素对应的声学特征;按照所述声学统计特征对所述真实声学特征序列进行处理,得到训练声学特征序列,所述训练声学特征序列中包括每个所述训练音素对应的处理后的声学特征;将所述训练音素序列和所述训练声学特征序列,输入所述语音合成模型,并根据所述语音合成模型的输出与所述训练音频,训练所述语音合成模型。
根据本公开的一个或多个实施例,示例6提供了示例5的方法,所述语音合成模型还是通过如下方式训练获得的:在所述提取所述训练音频的真实声学特征序列之后,确定每个所述训练音素对应的声学特征的声学特征平均值和声学特征方差,并将所述声学特征平均值和所述声学特征方差,作为所述声学统计特征。
根据本公开的一个或多个实施例,示例7提供了一种语音风格的迁移装置,包括:获取模块,用于获取目标文本和所述目标文本对应的第一音频,所述第一音频符合第一音色且具有目标风格;第一提取模块,用于提取所述目标文本对应的音素序列,所述音素序列中包括至少一个音素;第二提取模块,用于提取所述第一音频对应的初始声学特征序列,所述初始声学特征序列中包括每个所述音素对应的声学特征,所述声 学特征用于指示该音素的韵律特征;处理模块,用于按照第二音色的声学统计特征对所述初始声学特征序列进行处理,得到目标声学特征序列,所述目标声学特征序列中包括每个所述音素对应的处理后的声学特征;合成模块,用于将所述音素序列和所述目标声学特征序列,输入预先训练的语音合成模型,以得到所述语音合成模型输出的第二音频,所述第二音频符合所述第二音色且具有所述目标风格,所述语音合成模型为根据符合所述第二音色的语料训练得到的。
根据本公开的一个或多个实施例,示例8提供了示例7的装置,所述声学特征包括:基频、音量、或语速中的至少一种;所述第二提取模块包括:确定子模块,用于若所述声学特征包括语速,根据所述音素序列和所述第一音频,确定所述第一音频中每个所述音素对应的一个或多个音频帧,并根据该音素对应的音频帧的数量确定该音素对应的语速;提取子模块,用于若所述声学特征包括音高,提取所述第一音频中的每个音频帧的音高,并根据每个音素对应的音频帧的音高确定该音素对应的音高;若所述声学特征包括音量,提取所述第一音频中的每个音频帧的音量,并根据每个音素对应的音频帧的音量,确定该音素对应的音量。
根据本公开的一个或多个实施例,示例9提供了一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现示例1至示例6中所述方法的步骤。
根据本公开的一个或多个实施例,示例10提供了一种电子设备,包括:存储装置,其上存储有计算机程序;处理装置,用于执行所述存储装置中的所述计算机程序,以实现示例1至示例6中所述方法的步骤。
根据本公开的一个或多个实施例,示例11提供了一种计算机程序,包括:指令,所述指令当由处理器执行时使所述处理器执行示例1至示例6的语音风格的迁移方法中的一个或多个步骤。
根据本公开的一个或多个实施例,示例12提供了一种计算机程序产品,包括指令,所述指令当由处理器执行时使所述处理器执行示例1至示例6的语音风格的迁移方法中的一个或多个步骤。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。

Claims (12)

  1. 一种语音风格的迁移方法,包括:
    获取目标文本和所述目标文本对应的第一音频,所述第一音频符合第一音色且具有目标风格;
    提取所述目标文本对应的音素序列,所述音素序列中包括至少一个音素;
    提取所述第一音频对应的初始声学特征序列,所述初始声学特征序列中包括每个所述音素对应的声学特征,所述声学特征用于指示所述音素的韵律特征;
    按照第二音色的声学统计特征对所述初始声学特征序列进行处理,得到目标声学特征序列,所述目标声学特征序列中包括每个所述音素对应的处理后的声学特征;
    将所述音素序列和所述目标声学特征序列,输入预先训练的语音合成模型,以得到所述语音合成模型输出的第二音频,所述第二音频符合所述第二音色且具有所述目标风格,所述语音合成模型为根据符合所述第二音色的语料训练得到的。
  2. 根据权利要求1所述的方法,其中,所述声学特征包括:基频、音量、或语速中的至少一种;所述提取所述第一音频对应的初始声学特征序列,包括:
    若所述声学特征包括语速,根据所述音素序列和所述第一音频,确定所述第一音频中每个所述音素对应的一个或多个音频帧,并根据所述音素对应的音频帧的数量确定所述音素对应的语速;
    若所述声学特征包括音高,提取所述第一音频中的每个音频帧的音高,并根据每个音素对应的音频帧的音高,确定所述音素对应的音高;
    若所述声学特征包括音量,提取所述第一音频中的每个音频帧的音量,并根据每个音素对应的音频帧的音量,确定所述音素对应的音量。
  3. 根据权利要求1所述的方法,其中,所述声学特征包括:基频、音量、或语速中的至少一种;
    所述按照第二音色的声学统计特征对所述初始声学特征序列进行处理,得到目标声学特征序列,包括:
    若所述声学特征包括语速,根据所述声学统计特征包括的语速平均值和语速方差,对每个所述音素对应的语速进行标准化,以得到所述音素对应的处理后的语速;
    若所述声学特征包括音高,根据所述声学统计特征包括的音高平均值和音高方差,对每个所述音素对应的音高进行标准化,以得到所述音素对应的处理后的音高;
    若所述声学特征包括音量,根据所述声学统计特征包括的音量平均值和音量方差,对每个所述音素对应的音量进行标准化,以得到所述音素对应的处理后的音量。
  4. 根据权利要求1所述的方法,其中,所述语音合成模型用于:
    根据所述音素序列确定所述目标文本对应的文本特征序列,所述文本特征序列包括所述音素序列中每个音素对应的文本特征;
    根据所述文本特征序列和所述目标声学特征序列,生成所述第二音频。
  5. 根据权利要求1-4中任一项所述的方法,还包括:
    获取训练文本、所述训练文本对应的训练音素序列和训练音频,所述训练音频符合所述第二音色,所述训练音素序列包括至少一个训练音素;
    提取所述训练音频的真实声学特征序列,所述真实声学特征序列包括每个所述训练音素对应的声学特征;
    按照所述声学统计特征对所述真实声学特征序列进行处理,得到训练声学特征序列,所述训练声学特征序列中包括每个所述训练音素对应的处理后的声学特征;
    将所述训练音素序列和所述训练声学特征序列,输入所述语音合成模型,并根据所述语音合成模型的输出与所述训练音频,训练所述语音合成模型。
  6. 根据权利要求5所述的方法,还包括:
    在所述提取所述训练音频的真实声学特征序列之后,确定每个所述训练音素对应的声学特征的声学特征平均值和声学特征方差,并将所述声学特征平均值和所述声学特征方差,作为所述声学统计特征。
  7. 一种语音风格的迁移装置,包括:
    获取模块,用于获取目标文本和所述目标文本对应的第一音频,所述第一音频符合第一音色且具有目标风格;
    第一提取模块,用于提取所述目标文本对应的音素序列,所述音素序列中包括至少一个音素;
    第二提取模块,用于提取所述第一音频对应的初始声学特征序列,所述初始声学特征序列中包括每个所述音素对应的声学特征,所述声学特征用于指示所述音素的韵律特征;
    处理模块,用于按照第二音色的声学统计特征对所述初始声学特征序列进行处理,得到目标声学特征序列,所述目标声学特征序列中包括每个所述音素对应的处理后的声学特征;
    合成模块,用于将所述音素序列和所述目标声学特征序列,输入预先训练的语音合成模型,以得到所述语音合成模型输出的第二音频,所述第二音频符合所述第二音色且具有所述目标风格,所述语音合成模型为根据符合所述第二音色的语料训练得到的。
  8. 根据权利要求7所述的装置,其特征在于,所述声学特征包括:基频、音量、或语速中的至少一种;所述第二提取模块包括:
    确定子模块,用于若所述声学特征包括语速,根据所述音素序列和所述第一音频,确定所述第一音频中每个所述音素对应的一个或多个音频帧,并根据所述音素对应的音频帧的数量确定所述音素对应的语速;
    提取子模块,用于若所述声学特征包括音高,提取所述第一音频中的每个音频帧的音高,并根据每个音素对应的音频帧的音高,确定所述音素对应的音高;若所述声学特征包括音量,提取所述第一音频中的每个音频帧的音量,并根据每个音素对应的音频帧的音量,确定所述音素对应的音量。
  9. 一种计算机可读介质,其上存储有计算机程序,其特征在于,该程序被处理装置执行时实现权利要求1-6中任一项所述方法的步骤。
  10. 一种电子设备,其特征在于,包括:
    存储装置,其上存储有计算机程序;
    处理装置,用于执行所述存储装置中的所述计算机程序,以实现权利要求1-6中任一项所述方法的步骤。
  11. 一种计算机程序,包括:指令,所述指令当由处理器执行时使所述处理器执 行根据权利要求1-6中任一项所述的语音风格的迁移方法中的一个或多个步骤。
  12. 一种计算机程序产品,包括指令,所述指令当由处理器执行时使所述处理器执行根据权利要求1-6中任一项所述的语音风格的迁移方法中的一个或多个步骤。
PCT/CN2021/136525 2021-01-20 2021-12-08 语音风格的迁移方法、装置、可读介质和电子设备 WO2022156413A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110077658.2A CN112927674B (zh) 2021-01-20 2021-01-20 语音风格的迁移方法、装置、可读介质和电子设备
CN202110077658.2 2021-01-20

Publications (1)

Publication Number Publication Date
WO2022156413A1 true WO2022156413A1 (zh) 2022-07-28

Family

ID=76165243

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/136525 WO2022156413A1 (zh) 2021-01-20 2021-12-08 语音风格的迁移方法、装置、可读介质和电子设备

Country Status (2)

Country Link
CN (1) CN112927674B (zh)
WO (1) WO2022156413A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112927674B (zh) * 2021-01-20 2024-03-12 北京有竹居网络技术有限公司 语音风格的迁移方法、装置、可读介质和电子设备
CN114299910B (zh) * 2021-09-06 2024-03-22 腾讯科技(深圳)有限公司 语音合成模型的训练方法、使用方法、装置、设备及介质
CN114613353B (zh) * 2022-03-25 2023-08-08 马上消费金融股份有限公司 语音合成方法、装置、电子设备及存储介质
WO2024103383A1 (zh) * 2022-11-18 2024-05-23 广州酷狗计算机科技有限公司 音频处理方法、装置、设备、存储介质及程序产品

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107945786A (zh) * 2017-11-27 2018-04-20 北京百度网讯科技有限公司 语音合成方法和装置
US20180254034A1 (en) * 2015-10-20 2018-09-06 Baidu Online Network Technology (Beijing) Co., Ltd Training method for multiple personalized acoustic models, and voice synthesis method and device
US20190096386A1 (en) * 2017-09-28 2019-03-28 Baidu Online Network Technology (Beijing) Co., Ltd Method and apparatus for generating speech synthesis model
CN111292720A (zh) * 2020-02-07 2020-06-16 北京字节跳动网络技术有限公司 语音合成方法、装置、计算机可读介质及电子设备
CN111599343A (zh) * 2020-05-14 2020-08-28 北京字节跳动网络技术有限公司 用于生成音频的方法、装置、设备和介质
CN111667816A (zh) * 2020-06-15 2020-09-15 北京百度网讯科技有限公司 模型训练方法、语音合成方法、装置、设备和存储介质
CN111754976A (zh) * 2020-07-21 2020-10-09 中国科学院声学研究所 一种韵律控制语音合成方法、系统及电子装置
CN111785247A (zh) * 2020-07-13 2020-10-16 北京字节跳动网络技术有限公司 语音生成方法、装置、设备和计算机可读介质
US20200380952A1 (en) * 2019-05-31 2020-12-03 Google Llc Multilingual speech synthesis and cross-language voice cloning
CN112927674A (zh) * 2021-01-20 2021-06-08 北京有竹居网络技术有限公司 语音风格的迁移方法、装置、可读介质和电子设备

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110223705B (zh) * 2019-06-12 2023-09-15 腾讯科技(深圳)有限公司 语音转换方法、装置、设备及可读存储介质
CN110534089B (zh) * 2019-07-10 2022-04-22 西安交通大学 一种基于音素和韵律结构的中文语音合成方法
CN110600045A (zh) * 2019-08-14 2019-12-20 科大讯飞股份有限公司 声音转换方法及相关产品
CN111583904B (zh) * 2020-05-13 2021-11-19 北京字节跳动网络技术有限公司 语音合成方法、装置、存储介质及电子设备

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180254034A1 (en) * 2015-10-20 2018-09-06 Baidu Online Network Technology (Beijing) Co., Ltd Training method for multiple personalized acoustic models, and voice synthesis method and device
US20190096386A1 (en) * 2017-09-28 2019-03-28 Baidu Online Network Technology (Beijing) Co., Ltd Method and apparatus for generating speech synthesis model
CN107945786A (zh) * 2017-11-27 2018-04-20 北京百度网讯科技有限公司 语音合成方法和装置
US20200380952A1 (en) * 2019-05-31 2020-12-03 Google Llc Multilingual speech synthesis and cross-language voice cloning
CN111292720A (zh) * 2020-02-07 2020-06-16 北京字节跳动网络技术有限公司 语音合成方法、装置、计算机可读介质及电子设备
CN111599343A (zh) * 2020-05-14 2020-08-28 北京字节跳动网络技术有限公司 用于生成音频的方法、装置、设备和介质
CN111667816A (zh) * 2020-06-15 2020-09-15 北京百度网讯科技有限公司 模型训练方法、语音合成方法、装置、设备和存储介质
CN111785247A (zh) * 2020-07-13 2020-10-16 北京字节跳动网络技术有限公司 语音生成方法、装置、设备和计算机可读介质
CN111754976A (zh) * 2020-07-21 2020-10-09 中国科学院声学研究所 一种韵律控制语音合成方法、系统及电子装置
CN112927674A (zh) * 2021-01-20 2021-06-08 北京有竹居网络技术有限公司 语音风格的迁移方法、装置、可读介质和电子设备

Also Published As

Publication number Publication date
CN112927674B (zh) 2024-03-12
CN112927674A (zh) 2021-06-08

Similar Documents

Publication Publication Date Title
WO2022156544A1 (zh) 语音合成方法、装置、可读介质及电子设备
WO2022105545A1 (zh) 语音合成方法、装置、可读介质及电子设备
CN111402855B (zh) 语音合成方法、装置、存储介质和电子设备
WO2022156413A1 (zh) 语音风格的迁移方法、装置、可读介质和电子设备
CN111369971B (zh) 语音合成方法、装置、存储介质和电子设备
CN111583900B (zh) 歌曲合成方法、装置、可读介质及电子设备
WO2022156464A1 (zh) 语音合成方法、装置、可读介质及电子设备
CN112786006B (zh) 语音合成方法、合成模型训练方法、装置、介质及设备
WO2022105553A1 (zh) 语音合成方法、装置、可读介质及电子设备
US20230317055A1 (en) Method, apparatus, storage medium and electronic device for speech synthesis
US10650810B2 (en) Determining phonetic relationships
WO2022143058A1 (zh) 语音识别方法、装置、存储介质及电子设备
WO2022151930A1 (zh) 语音合成方法、合成模型训练方法、装置、介质及设备
CN110097870B (zh) 语音处理方法、装置、设备和存储介质
CN111798821B (zh) 声音转换方法、装置、可读存储介质及电子设备
CN111489735B (zh) 语音识别模型训练方法及装置
CN111782576B (zh) 背景音乐的生成方法、装置、可读介质、电子设备
US20230326446A1 (en) Method, apparatus, storage medium, and electronic device for speech synthesis
CN111354343B (zh) 语音唤醒模型的生成方法、装置和电子设备
CN111883117B (zh) 语音唤醒方法及装置
CN113327580A (zh) 语音合成方法、装置、可读介质及电子设备
WO2023160553A1 (zh) 语音合成方法、装置、计算机可读介质及电子设备
CN111477210A (zh) 语音合成方法和装置
CN114255740A (zh) 语音识别方法、装置、计算机设备和存储介质
CN114220436A (zh) 语音处理方法、装置、计算机设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21920787

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21920787

Country of ref document: EP

Kind code of ref document: A1