CN113160794B - Voice synthesis method and device based on timbre clone and related equipment - Google Patents

Voice synthesis method and device based on timbre clone and related equipment Download PDF

Info

Publication number
CN113160794B
CN113160794B CN202110482151.5A CN202110482151A CN113160794B CN 113160794 B CN113160794 B CN 113160794B CN 202110482151 A CN202110482151 A CN 202110482151A CN 113160794 B CN113160794 B CN 113160794B
Authority
CN
China
Prior art keywords
phoneme
sequence
model
processing
cloned
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110482151.5A
Other languages
Chinese (zh)
Other versions
CN113160794A (en
Inventor
宋伟
袁鑫
张政臣
吴友政
何晓冬
周伯文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingdong Technology Holding Co Ltd
Original Assignee
Jingdong Technology Holding Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingdong Technology Holding Co Ltd filed Critical Jingdong Technology Holding Co Ltd
Priority to CN202110482151.5A priority Critical patent/CN113160794B/en
Publication of CN113160794A publication Critical patent/CN113160794A/en
Application granted granted Critical
Publication of CN113160794B publication Critical patent/CN113160794B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser

Abstract

The embodiment of the disclosure provides a voice synthesis method, a device, an electronic device and a readable medium based on timbre clone, wherein the method comprises the following steps: acquiring a text to be cloned for a target user, and acquiring a phoneme sequence of the text to be cloned, wherein the phoneme sequence comprises at least one phoneme; processing the phoneme sequence through a first model to obtain a predicted duration sequence, wherein the predicted duration sequence comprises the predicted duration of each phoneme; processing the predicted duration and the phoneme sequence of each phoneme through a second model to obtain target prediction characteristics; and synthesizing the voice according to the target prediction characteristics. The method, the device, the electronic equipment and the readable medium for synthesizing the voice based on the timbre clone, which are provided by the embodiment of the disclosure, can improve the robustness of the model and improve the accuracy and the synthesis quality of the voice synthesis.

Description

Voice synthesis method and device based on timbre clone and related equipment
Technical Field
The present disclosure relates to the field of voice synthesis technology based on timbre cloning, and in particular, to a method and an apparatus for voice synthesis based on timbre cloning, an electronic device, and a computer readable medium.
Background
In the technical field of speech cloning, models of coding and decoding structures based on attention mechanism are often used for acoustic feature prediction. However, in the current application process, the coding structure of the attention mechanism can cause excessive speech repetition and sound loss problems, and even cause failure of prediction ending.
Therefore, a new method, apparatus, electronic device and computer readable medium for voice synthesis based on timbre cloning are needed.
The above information disclosed in this background section is only for enhancement of understanding of the background of the disclosure.
Disclosure of Invention
In view of the above, embodiments of the present disclosure provide a method, an apparatus, an electronic device, and a computer-readable medium for voice synthesis based on timbre cloning, which can avoid the problems of missing sound, repeated pronunciation, and failure of prediction end caused by attention mechanism in an encoder in the prior art.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.
According to a first aspect of the embodiments of the present disclosure, a method for speech synthesis based on timbre cloning is provided, the method including: acquiring a text to be cloned for a target user, and acquiring a phoneme sequence of the text to be cloned, wherein the phoneme sequence comprises at least one phoneme; processing the phoneme sequence through a first model to obtain a predicted duration sequence, wherein the predicted duration sequence comprises the predicted duration of each phoneme; processing the predicted duration and the phoneme sequence of each phoneme through a second model to obtain target prediction characteristics; according to the target prediction feature speech synthesis, obtaining a synthesized speech of the text to be cloned for the target user;
the processing the predicted duration of each phoneme and the phoneme sequence through the second model to obtain the target prediction characteristics comprises the following steps: processing the phoneme sequence through an encoder comprising a first one-dimensional convolution module and a bidirectional long-time memory module which are sequentially connected to obtain an initial characterization vector sequence, wherein the initial characterization vector sequence comprises initial characterization vectors of the phonemes; processing the initial characterization vector sequence according to the predicted duration of each phoneme to obtain a frame-level characterization vector of each phoneme; and processing the frame level representation vector of each phoneme and the user embedded representation of the target user by using a decoder to obtain target prediction characteristics.
In an exemplary embodiment of the present disclosure, processing the initial token vector sequence according to the predicted duration of each phoneme, and obtaining a frame-level token vector sequence includes: determining the repetition times of each phoneme according to the unit duration of each frame and the predicted duration of each phoneme; and expanding the initial characterization vector of each phoneme in the initial characterization vector sequence according to the repetition times of each phoneme to obtain the frame-level characterization vector sequence.
In an exemplary embodiment of the present disclosure, processing the phoneme sequence through the first model to obtain the predicted duration sequence comprises: carrying out embedded representation on the phoneme sequence to obtain a phoneme embedded representation sequence, wherein the phoneme embedded representation sequence comprises embedded representation of each phoneme; processing the phoneme embedded expression sequence expression by using n second one-dimensional convolution modules which are sequentially connected to obtain a phoneme one-dimensional convolution result, wherein n is an integer greater than 0; processing the phoneme one-dimensional convolution result by using a first full-link layer to obtain a first full-link layer output; coding information according to the position of each phoneme in the phoneme sequence; performing bitwise addition on the phoneme embedded representation sequence, the position coding information, the first full-link layer output and the user embedded representation to obtain a bitwise addition result; processing the bitwise addition result through a self-attention structure to obtain attention structure output; and processing the attention structure output through a second full-connection layer to obtain the predicted time length sequence.
In an exemplary embodiment of the present disclosure, the second one-dimensional convolution module includes a one-dimensional convolution layer, a batch normalization operation layer, an activation function layer, and an anti-overfitting layer, which are connected in sequence.
In an exemplary embodiment of the present disclosure, processing the phoneme sequence through an encoder including a first one-dimensional convolution module and a bidirectional long-time and short-time memory module connected in sequence, and obtaining an initial characterization vector sequence includes: carrying out embedded expression on the phoneme sequence to obtain a phoneme embedded expression sequence; processing the phoneme embedded expression sequence by utilizing m first one-dimensional convolution modules which are sequentially connected to obtain a first one-dimensional convolution result, wherein each first one-dimensional convolution module comprises a one-dimensional convolution layer, a batch normalization operation layer, an activation function layer and an over-fitting prevention layer which are sequentially connected; processing the first one-dimensional convolution result by using a bidirectional long-time and short-time memory network to obtain bidirectional long-time and short-time memory network output; and adding the embedded representation sequence of each phoneme, the bidirectional long-short time memory network output and the user embedded representation according to the bit to obtain an initial characterization vector of each phoneme.
In an exemplary embodiment of the present disclosure, processing the frame-level token vector of each phoneme and the user-embedded representation of the target user with a decoder, and obtaining the target prediction feature includes: and taking the frame level characterization vector of each phoneme as the input of the decoder, and updating the output of each preprocessing network according to the bit-wise addition result of the user embedded representation and the output of each preprocessing network in the decoder to obtain the target prediction feature output by the decoder.
In an exemplary embodiment of the present disclosure, the method further comprises: training the first model and the second model according to an original training sample set to obtain a basic model of the first model and a basic model of the second model; and acquiring voice information to be cloned of a target user, and performing transfer learning on the basic model of the first model and the basic model of the second model by using the voice information to be cloned to acquire the trained first model and the trained second model.
According to a second aspect of the embodiments of the present disclosure, there is provided a voice synthesis apparatus based on timbre cloning, the apparatus including: the system comprises a to-be-cloned text acquisition module, a to-be-cloned text acquisition module and a to-be-cloned text acquisition module, wherein the to-be-cloned text acquisition module is configured to acquire a to-be-cloned text for a target user and acquire a phoneme sequence of the to-be-cloned text, and the phoneme sequence comprises at least one phoneme; the first model processing module is configured to process the phoneme sequence through a first model to obtain a predicted duration sequence, and the predicted duration sequence comprises predicted durations of the phonemes; the second model processing module is configured to process the predicted duration and the phoneme sequence of each phoneme through a second model to obtain a target prediction characteristic; a speech synthesis module configured to obtain a synthesized speech of the text to be cloned for the target user according to the target prediction feature speech synthesis;
wherein the second model processing module comprises: the encoding unit is configured to process the phoneme sequence through an encoder comprising a first one-dimensional convolution module and a two-way long-and-short-time memory module which are sequentially connected to obtain an initial characterization vector sequence, wherein the initial characterization vector sequence comprises initial characterization vectors of the phonemes; an initial token vector processing unit, configured to process the initial token vector sequence according to the predicted duration of each phoneme, so as to obtain a frame-level token vector of each phoneme; and the decoding unit is configured to process the frame-level characterization vectors of the phonemes and the user embedded representation of the target user by using a decoder to obtain target prediction characteristics.
According to a third aspect of the embodiments of the present disclosure, an electronic device is provided, which includes: one or more processors; storage means for storing one or more programs; when executed by the one or more processors, cause the one or more processors to implement any of the above-described timbre clone-based speech synthesis methods.
According to a fourth aspect of the embodiments of the present disclosure, a computer-readable medium is proposed, on which a computer program is stored, which when executed by a processor, implements the method for voice synthesis based on timbre cloning as described in any one of the above.
According to the method, the device, the electronic equipment and the computer readable medium for synthesizing the voice based on the timbre clone, which are provided by some embodiments of the present disclosure, based on the first model and the second model obtained by training the voice information to be cloned of the target user, in the process of processing the phoneme sequence of the text to be cloned, because the first model is obtained by training alone, a more flexible phoneme duration prediction mode can be provided. In the process of processing the predicted duration of each phoneme and the phoneme sequence through the second model, after an initial characterization vector sequence is obtained by using an encoder, the characterization vector of each phoneme is expanded to a frame level based on the predicted duration corresponding to each phoneme, and the obtained frame level characterization vector is consistent with the length of a target predicted feature (namely, an acoustic feature sequence) obtained by prediction of a decoder, so that the encoder in the second model in the scheme can avoid adopting an attention mechanism, and further the problems of sound loss, repeated pronunciation and failure of prediction ending caused by the attention mechanism in the encoder in the prior art can be avoided, thereby improving the robustness of the model and improving the accuracy and the synthesis quality of speech synthesis.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure. The drawings described below are merely some embodiments of the present disclosure, and other drawings may be derived from those drawings by those of ordinary skill in the art without inventive effort.
Fig. 1 is a system block diagram illustrating a method and apparatus for voice synthesis based on timbre cloning according to an exemplary embodiment.
FIG. 2 is a flow diagram illustrating a method for timbre clone based speech synthesis according to an exemplary embodiment.
Fig. 3 is a schematic diagram illustrating a structure of a first model according to an exemplary embodiment.
FIG. 4 is a block diagram illustrating a first one-dimensional convolution module according to an exemplary embodiment.
FIG. 5 is a schematic diagram illustrating a second model architecture according to an exemplary embodiment.
FIG. 6 is a robustness test result presentation diagram shown in accordance with an exemplary embodiment.
FIG. 7 is a graph illustrating performance displays according to an example embodiment.
FIG. 8 is a block diagram illustrating a timbre cloning based speech synthesis apparatus according to an exemplary embodiment.
Fig. 9 schematically illustrates a block diagram of an electronic device in an exemplary embodiment of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.
The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations or operations have not been shown or described in detail to avoid obscuring aspects of the invention.
The drawings are merely schematic illustrations of the present invention, in which the same reference numerals denote the same or similar parts, and thus, a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and steps nor must they be performed in the order described. For example, some steps may be decomposed, and some steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
In the related art, the traditional end-to-end model needs to solve the problem of inconsistent lengths of an encoder and a decoder through an attention mechanism, but in the field of speech synthesis, the attention mechanism can cause some input phonemes to be omitted or repeatedly pronounced, so that the skip/replay problem is caused, and the problems of sound missing, repeated pronunciation and prediction end failure can occur. Meanwhile, the duration prediction model is obtained by jointly training with the whole acoustic feature prediction model, so that the flexibility of a neural network is reduced, and the flexibility and the accuracy of phoneme duration information prediction cannot be improved by fully utilizing other data.
The following detailed description of exemplary embodiments of the invention refers to the accompanying drawings.
Fig. 1 is a system block diagram illustrating a method and apparatus for voice synthesis based on timbre cloning according to an exemplary embodiment.
In the system 100 of the method and apparatus for voice synthesis based on timbre cloning, the server 105 may be a server providing various services, such as a background management server (for example only) providing support to a voice synthesis system based on timbre cloning operated by users with the terminal devices 101, 102, 103 via the network 104. The background management server may analyze and/or otherwise process the received voice synthesis request based on the timbre clone, and/or feed back the processing result (e.g., synthesized voice — just an example) to the terminal device.
The server 105 may be a server of one entity, and may be, for example, composed of a plurality of servers, and a part of the server 105 may be, for example, used as a voice synthesis task submitting system based on the timbre clone in the present disclosure, for obtaining a task to execute a voice synthesis command based on the timbre clone; and a part of the server 105 may also be used, for example, as a voice synthesis system based on timbre cloning in the present disclosure, for obtaining a text to be cloned for a target user, and obtaining a phoneme sequence of the text to be cloned, the phoneme sequence including at least one phoneme; processing the phoneme sequences through a first model to obtain predicted duration sequences, wherein the predicted duration sequences comprise predicted durations of the phonemes; processing the predicted duration and the phoneme sequence of each phoneme through a second model to obtain target prediction characteristics; and synthesizing the voice according to the target prediction characteristics. .
According to the method and the device for synthesizing the voice based on the timbre clone, which are provided by the embodiment of the disclosure, the robustness of the model can be improved, and the accuracy and the synthesis quality of the voice synthesis can be improved.
FIG. 2 is a flow diagram illustrating a method for timbre clone based speech synthesis in accordance with an exemplary embodiment. The method for voice synthesis based on timbre cloning provided by the embodiment of the present disclosure may be executed by any electronic device with computing processing capability, such as the terminal devices 101, 102, and 103 and/or the server 105, and in the following embodiments, the method executed by the server is taken as an example for illustration, but the present disclosure is not limited thereto. The method for synthesizing voice based on timbre cloning provided by the embodiment of the present disclosure may include steps S210 to S240.
As shown in fig. 2, in step S210, a text to be cloned for a target user is obtained, and a phoneme sequence of the text to be cloned is obtained, where the phoneme sequence includes at least one phoneme.
In the embodiment of the disclosure, the text to be cloned is text information which needs to be subjected to speech synthesis. Wherein the phoneme sequence is obtained by converting the text to be cloned into phonemes.
In step S220, the phoneme sequence is processed through the first model to obtain a predicted duration sequence, and the predicted duration sequence includes the predicted duration of each phoneme.
In the embodiment of the present disclosure, the first model may be obtained by training according to the to-be-cloned voice information of the target user. The voice information to be cloned of the target user is voice format data recorded by the target user. The user embedded representation of the target user can be obtained, the phoneme sequence is processed through the first model, and the predicted duration sequence is obtained and comprises the predicted duration of each phoneme. In the embodiment, the duration prediction process of the phoneme is compensated according to the user embedded representation of the user, so that the robustness of model prediction can be improved, and the duration prediction accuracy of the phoneme can be improved. Meanwhile, the embodiment of the disclosure provides a more flexible phoneme duration prediction mode through the first model trained independently.
In step S230, the predicted duration and phoneme sequence of each phoneme are processed by the second model to obtain the target prediction characteristics.
In the embodiment of the present disclosure, the second model may be obtained by training according to the to-be-cloned speech information of the target user. The second model may be an acoustic feature prediction model. The acoustic feature refers to a physical quantity representing acoustic characteristics of speech, and is a general term for acoustic representation of elements of sound. Such as energy concentration zones representing timbre, formant frequency, formant intensity and bandwidth, and duration, fundamental frequency, average speech power, etc., representing speech prosodic characteristics. The first model and the second model are obtained by training according to the to-be-cloned voice information of the target user, so that the target prediction characteristics obtained in the step can predict the voice acoustic characteristics expressed by the target user when the target user speaks the to-be-cloned text.
In step S240, speech synthesis is performed according to the target prediction feature, so as to obtain a synthesized speech of the text to be cloned for the target user.
In the embodiment of the present disclosure, the voice format information may be synthesized based on the target prediction feature to perform voice synthesis with the target user as a clone object. For example, the text to be cloned is "where you are", the speech synthesis of this step is simulated speech information that mimics the acoustic characteristics of the target user when saying "where you are".
Wherein, the step S230 may further include the following steps S231 to S233.
In step S231, the phoneme sequence is processed by an encoder including a first one-dimensional convolution module and a two-way long-and-short-term memory module connected in sequence, so as to obtain an initial token vector sequence, where the initial token vector sequence includes an initial token vector of each phoneme.
As shown in fig. 3, a schematic view of the second model can be seen in fig. 5. As shown in fig. 5, the encoder may include m first one-dimensional convolution modules 510 and a bidirectional long and short term memory network Layer 520 (BLSTM Layer), m being an integer greater than 0. The structure of the first one-dimensional convolution module 510 can be shown in fig. 4, and as shown in fig. 4, the first one-dimensional convolution module 510 can include a one-dimensional convolution Layer (1-d constants) 401, a Batch normalization operation Layer (Batch Norm) 402, an activation function Layer (e.g., reLU) 403, and an anti-overfitting Layer (Dropout Layer) 404 connected in sequence. The bidirectional long-short term memory network layer 520 may be formed by combining a forward long-short term memory network and a backward long-short term memory network. The encoder network based on the one-dimensional convolution operation and the bidirectional long-time and short-time memory network structure can improve the robustness of the model so as to improve the prediction accuracy of the target voice characteristics.
In step S232, the initial token vector sequence is processed according to the predicted duration of each phoneme, and a frame-level token vector of each phoneme is obtained.
In the embodiment of the present disclosure, the token vector of each phoneme can be extended to the frame level through the predicted duration corresponding to the phoneme, so that the input of the encoder and the decoder in the second model can have the same sequence length. This step may be performed by the Length adjuster 530 (Length adjuster) in FIG. 5.
In step S233, the frame-level characterization vectors of the phonemes and the user-embedded representation of the target user are processed by the decoder to obtain the target prediction features.
In the embodiment of the present disclosure, the decoder may have an autoregressive decoder network structure, that is, the output at the previous time may be used as the input at the current time, the output at the current time may be used as the input at the next time, and so on. The decoder may be configured as shown in fig. 5, and may include a preprocessing network 540, a long-term memory network layer 550, a third fully connected layer 560 (e.g., a full connected layer), and a post-processing network 570. The reduction factor (reduction factor) of the third fully-connected layer 560 output vector may be, for example, 4. The post-processing network may include p sequentially connected third one-dimensional convolution modules 571 and fourth fully connected layers 572 (e.g., full connected layers). The third one-dimensional convolution module 571 may have a structure similar to that of the first one-dimensional convolution module 510 and the second one-dimensional convolution module 310, and will not be described herein.
If the decoder predicts one frame at each decoding moment, the decoding steps and the length of the encoder sequence can be ensured to be consistent; it also means that the sequence length of the acoustic feature (i.e. the target prediction feature) obtained by the decoder is always consistent with the length of the encoder.
According to the voice synthesis method based on the timbre cloning, provided by the embodiment of the disclosure, based on the first model and the second model obtained by training the voice information to be cloned of the target user, in the process of processing the phoneme sequence of the text to be cloned, as the first model is obtained by independent training, a more flexible phoneme duration prediction mode can be provided. In the process of processing the predicted duration of each phoneme and the phoneme sequence through the second model, after an initial characterization vector sequence is obtained by using an encoder, the characterization vector of each phoneme is expanded to a frame level based on the predicted duration corresponding to each phoneme, and the obtained frame level characterization vector is consistent with the length of a target predicted feature (namely, an acoustic feature sequence) obtained by prediction of a decoder, so that the encoder in the second model in the scheme can avoid adopting an attention mechanism, and further the problems of sound loss, repeated pronunciation and failure of prediction ending caused by the attention mechanism in the encoder in the prior art can be avoided, thereby improving the robustness of the model and improving the accuracy and the synthesis quality of speech synthesis.
Further, in step S232, the number of repetitions of each phoneme may be determined according to the unit duration of each frame and the predicted duration of each phoneme; and expanding the initial characterization vector of each phoneme in the initial characterization vector sequence according to the repetition times of each phoneme to obtain the frame-level characterization vector sequence.
Wherein, as shown in FIG. 5It is shown that the number of repetitions of each phoneme can be determined according to the quotient of the predicted duration of each factor and the unit duration of each frame. Wherein, when the quotient is not an integer, the quotient can be rounded as the number of repetitions. For the phoneme sequence of "'x', 'in1','d', \8230, the prediction duration is" 40, 30, \8230 ", respectively, the unit is millisecond (ms), and assuming that the unit duration of each frame is 10ms, the repetition number for the phoneme of" x "may be 40 ÷ 10=4 since the prediction duration is 40ms, the repetition number for the phoneme of" in1 "may be 30 ÷ 10=3 since the prediction duration is 30ms, and the repetition number for the same phoneme of" d "is also 3. Further, when the initial token vector of each phoneme in the initial token vector sequence is expanded according to the repetition times of each phoneme, for I phonemes (I is an integer greater than 0) in the phoneme sequence, the repetition time of the ith phoneme is a i Then generate a based on the ith phoneme i The elements are used as elements in a frame-level characterization vector sequence, wherein the a i The element value of each element is the initial token vector of the ith phoneme. Following the previous example, for the 1 st phoneme "x", the repetition number a is 1 =4, then the 1 st to 4 th (a) in the frame level characterization vector sequence can be determined 1 = 4) element value is determined as an initial token vector of the 1 st phoneme "x", and the 1 st to 4 th elements shown in the frame level token vector 504 in fig. 5 can be seen. For another example, for the 2 nd phoneme "in1", i =2, the repetition number a thereof 2 =3, then the frame level can be characterized as determined in the vector sequence
Figure BDA0003049678900000101
To the first
Figure BDA0003049678900000102
Figure BDA0003049678900000103
The element values of the elements are determined as the initial token vector for the 2 nd phoneme "in1", and the 5 th to 7 th elements shown in the frame-level token vector 504 in fig. 5 can be seen. For another example, for the 3 rd phoneme "d", i =3, it repeatsNumber of times a 3 =3, then the frame level can be characterized as determined in the vector sequence
Figure BDA0003049678900000104
To the first
Figure BDA0003049678900000105
The element values of the individual elements are determined as the initial token vector for the 3 rd phoneme "d", and the 8 th to 10 th elements shown in the frame-level token vector 504 in fig. 5 can be seen. And so on to obtain a sequence of frame-level token vectors 504. In the embodiment of the disclosure, the characterization vector of each phoneme is extended to a frame level based on the predicted duration corresponding to the phoneme, so that the obtained frame-level characterization vector has the same sequence length as that of the decoder, thereby enabling an encoder in the second model in the scheme to avoid adopting an attention mechanism, avoiding the problems of repetition (repeat) and tone loss (skip) caused by the attention mechanism, and ensuring that the model can still ensure strong robustness even in timbre cloning with less corpus.
Further, in step S220, the phoneme sequence may be subjected to embedded representation to obtain a phoneme embedded representation sequence, where the phoneme embedded representation sequence includes embedded representations of phonemes; processing the phoneme embedded expression sequence by utilizing n second one-dimensional convolution modules which are sequentially connected to obtain a phoneme one-dimensional convolution result, wherein n is an integer larger than 0; processing the phoneme one-dimensional convolution result by using the first full-connection layer to obtain a first full-connection layer output; coding information according to the position of each phoneme in the phoneme sequence; bitwise adding the phoneme embedded representation sequence, the position coding information, the first full-link layer output and the user embedded representation to obtain a bitwise addition result; processing the bit-based addition result through a self-attention structure to obtain attention structure output; and processing the attention structure output through a second full-connection layer to obtain a predicted time length sequence.
As shown in fig. 3, the phoneme embedded representation sequence obtained from the phoneme sequence 301 is shown by reference numeral 302. Further, the second one-dimensional convolution module may include a one-dimensional convolution Layer (1-d constants), a Batch normalization operation Layer (Batch Norm), an activation function Layer (e.g., reLU), and an anti-overfit Layer (Dropout Layer) connected in sequence, and the second one-dimensional convolution module 310 may have a structure similar to that of the first one-dimensional convolution module 510, as shown in fig. 4.
The first fully-connected Layer 320 may be, for example, a Dense Layer (Dense Layer). The position coding information 303 of each phoneme in the phoneme sequence refers to the position representation of each phoneme in the phoneme sequence. For example, in the phoneme sequence 301, the position coding information of the 1 st phoneme "x" can be obtained by characterizing the position of the 1 st phoneme in the phoneme sequence. The phoneme sequence 301, the position coding information 303, the output of the first fully connected layer 320 and the user embedded representation 306 are added bitwise, obtaining a bitwise addition result as input information from the attention structure 330. Self-attention structure 330 may, for example, include q transducer structure layers, q being an integer greater than 0. The second fully connected layer 340 may be, for example, a projection layer (projection layer). The output of the second fully-connected layer 340 is the predicted duration sequence 304.
Further, as shown in fig. 5, step S231 may include: carrying out embedded representation on the phoneme sequence to obtain a phoneme embedded representation sequence; processing the phoneme embedded expression sequence by using m first one-dimensional convolution modules 510 which are connected in sequence to obtain a first one-dimensional convolution result, wherein the first one-dimensional convolution module comprises a one-dimensional convolution layer, a batch normalization operation layer, an activation function layer and an over-fitting prevention layer which are connected in sequence, and the first one-dimensional convolution module can be shown in FIG. 4; processing the first one-dimensional convolution result by using the bidirectional long-short term memory network layer 520 to obtain bidirectional long-short term memory network output; and adding the phoneme embedded expression sequence 302, the two-way long-time memory network output and the user embedded expression 306 in a bitwise manner to obtain an initial characterization vector of each phoneme.
Further, as shown in fig. 5, in step S233, the frame-level representation 402 of each phoneme may be used as an input of the decoder, and the output of each pre-processing network may be updated according to the bit-wise addition result of the user-embedded representation 306 and the output of each pre-processing network 540 in the decoder to obtain the target prediction feature 505 of the decoder output.
The decoder in fig. 5 is an autoregressive decoder network structure, the decoder is used for decoding the token vectors obtained by the encoder (encoder) into the process of acoustic features, and the decoder is an autoregressive process, that is, the output at the previous time is used as the input at the current time, and the output at the current time is used as the input at the next time. In the training process, the first model is first subjected to transfer learning through a training sample set, and when to stop learning is judged through verification set loss. The second model is then transfer learned through the training sample set, and the determination of when to stop transfer learning is made also through the validation set loss. In the migration learning process, the user embedded representation of the sample corresponding to the user in the sample can be added to carry out model training, and in the migration learning process, the model is updated to obtain all parameters. The model parameters obtained from the training of the present application can be shown in the following table, for example.
TABLE 1 model parameters
Figure BDA0003049678900000121
Further, before performing steps S210 to S240 of the embodiment shown in fig. 2, the method for synthesizing a voice based on a timbre clone of the embodiment of the present disclosure may further perform the following steps in advance: training the first model and the second model according to the original training sample set to obtain a basic model of the first model and a basic model of the second model; and acquiring the voice information to be cloned of the target user, and performing transfer learning on the basic model of the first model and the basic model of the second model by using the voice information to be cloned to obtain the trained first model and second model.
In the embodiment of the present disclosure, the duration information of each tone in the training sample set may be obtained first through an alignment model (e.g., MFA model) to serve as the label information; then we train the first model based on the training sample set and use the first model as a base model of the first model.
After the duration information of each tone obtained based on the alignment model is obtained, an acoustic feature prediction model (i.e., a second model) based on an autoregressive generation network can be trained by combining the training sample set and the duration information. In the acoustic feature prediction model, an input phoneme sequence passes through a series of neural network layers (namely an encoder) to obtain a characterization vector corresponding to each phoneme, and then the characterization vectors of the phonemes are expanded to a frame level (namely the frame level characterization vectors of the phonemes) through duration information corresponding to each phoneme, so that the encoder and the decoder of the acoustic feature prediction model can have the same sequence length, the attention mechanism in the encoder in the prior art can be cancelled, and the robustness of the model is improved; a basic acoustic feature prediction model (i.e., a second model) is trained as a basic model of timbre cloning through multi-speaker training data (i.e., a training sample set).
According to the method, the performance of the phoneme duration prediction model can be improved by training the trained first model (namely the phoneme duration prediction model) independently and optimizing the network structure of the phoneme duration prediction model; in the first model, the stability of the model and the accuracy of phoneme duration prediction can be improved by combining a convolution network and a self-attention network structure.
Compared with the prior art, the initial characterization vectors of the phonemes are updated to the frame-level characterization vectors of the phonemes, the second model of the application can adopt a simpler network structure in an encoder part, and the problems of repetition and sound loss caused by attention in the encoder in the prior art are eliminated. Further, the one-dimensional convolution and the bidirectional LSTM network structure adopted by the encoder can improve the robustness of the model. Based on the obtained frame-level characterization vectors of the phonemes, the lengths of the sequences of the encoder and the decoder can be ensured to be consistent, so that the attention mechanism can be removed, and the technical means for connecting the encoder and the decoder through the attention mechanism in the prior art is not needed. Removing the attention mechanism helps to improve the quality and robustness of the model.
The skip coder structure (skip encoder) in the second model in the prior art needs to predict prosodic information of a text, such as prosodic words and prosodic phrases, but the prosodic information does not stop during pronunciation, so that the prosodic effect of the model is not affected, and if an additional model is needed to predict the prosodic information, error accumulation is caused, and the quality of the finally synthesized voice is affected. This application is based on prior art adopts the coding structure including one-dimensional convolution module and two-way long and short duration memory network structure to replace prior art's skip encoder, when simplifying the model structure, can also avoid the accumulative error that arouses by skip encoder structure to the influence of final tone quality, promotes speech synthesis quality.
The robustness test results of the first model and the second model proposed by the present application are shown in fig. 6. In table 4, task 1 uses 100 sentences of target data for migration learning, task 2 uses five sentences of target data for migration learning, and S3/S4/S5 correspond to three different corpus providers (i.e., spakers), i.e., three different migration learning results, each spaker has 100 sentences with high difficulty, and by counting bad cases with wrong pronunciation, unclear pronunciation and incorrect tone of the model, the target timbre obtained by migration learning in the application has very good model robustness. In the task 1 and the task 2, it can be seen that even if only five sentences of target data are adopted in the task of the task 2 for transfer learning, errors of the obtained target timbres are not increased in the test, and the model robustness is very good.
Meanwhile, the performance display of the present application is as shown in fig. 7 (task 1 employs 100 sentences for migration learning, task 2 employs five sentences for migration learning, and the tasks of tasks 1-b and 2-b allow for model training using additional open source data, but the present invention does not employ any open source data).
It should be clearly understood that this disclosure describes how to make and use particular examples, but the principles of this disclosure are not limited to any details of these examples. Rather, these principles can be applied to many other embodiments based on the teachings of the present disclosure.
Those skilled in the art will appreciate that all or part of the steps for implementing the above embodiments are implemented as a computer program executed by a Central Processing Unit (CPU). When executed by a central processing unit CPU, performs the above-described functions defined by the above-described methods provided by the present disclosure. The program may be stored in a computer readable storage medium, which may be a read-only memory, a magnetic or optical disk, or the like.
Furthermore, it should be noted that the above-mentioned figures are only schematic illustrations of the processes involved in the methods according to exemplary embodiments of the disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed, for example, synchronously or asynchronously in multiple modules.
The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods. For details not disclosed in the embodiments of the apparatus of the present disclosure, refer to the embodiments of the method of the present disclosure.
FIG. 8 is a block diagram illustrating a timbre cloning based speech synthesis apparatus according to an exemplary embodiment. Referring to fig. 8, a voice synthesis apparatus 80 based on timbre cloning provided by an embodiment of the present disclosure may include: a text to be cloned acquisition module 810, a first model processing module 820, a second model processing module 830 and a speech synthesis module 840.
In the voice synthesis apparatus 80 based on timbre cloning, the text to be cloned acquisition module 810 may be configured to acquire a text to be cloned for a target user and obtain a phoneme sequence of the text to be cloned, the phoneme sequence including at least one phoneme.
The first model processing module 820 may be configured to process the sequence of phonemes through the first model to obtain a predicted duration sequence that includes a predicted duration for each phoneme.
The second model processing module 830 may be configured to process the predicted duration and the sequence of phonemes of each phoneme through the second model to obtain the target prediction characteristic.
The speech synthesis module 840 may be configured to obtain a synthesized speech of the text to be cloned for the target user according to the target predictive feature speech synthesis.
Wherein, the second model processing module 830 may include:
the encoding unit 831 may be configured to process the phoneme sequence through an encoder including a first one-dimensional convolution module and a bidirectional long-and-short-term memory module connected in sequence to obtain an initial token vector sequence, where the initial token vector sequence includes an initial token vector of each phoneme;
an initial token vector processing unit 832, configured to process the initial token vector sequence according to the predicted duration of each phoneme, to obtain a frame-level token vector of each phoneme;
the decoding unit 833 may be configured to process the frame-level token vectors of the phonemes and the user-embedded representation of the target user by using a decoder, so as to obtain a target prediction feature.
According to the voice synthesis device based on the timbre clone provided by the embodiment of the disclosure, based on the first model and the second model obtained by training the voice information to be cloned of the target user, in the process of processing the phoneme sequence of the text to be cloned, as the first model is obtained by independent training, a more flexible phoneme duration prediction mode can be provided. In the process of processing the predicted duration of each phoneme and the phoneme sequence through the second model, after an initial characterization vector sequence is obtained by using an encoder, the characterization vector of each phoneme is expanded to a frame level based on the predicted duration corresponding to each phoneme, and the obtained frame level characterization vector is consistent with the length of a target predicted feature (namely, an acoustic feature sequence) obtained by prediction of a decoder, so that the encoder in the second model in the scheme can avoid adopting an attention mechanism, and further the problems of sound loss, repeated pronunciation and failure of prediction ending caused by the attention mechanism in the encoder in the prior art can be avoided, thereby improving the robustness of the model and improving the accuracy and the synthesis quality of speech synthesis.
In an exemplary embodiment, initial token vector processing unit 832 may include: a repetition number determining subunit configurable to determine a repetition number of each of the phonemes from the unit duration of each of the frames and the predicted duration of each of the phonemes; and the frame-level token vector sequence generation subunit is configured to expand the initial token vector of each phoneme in the initial token vector sequence according to the repetition times of each phoneme to obtain the frame-level token vector sequence.
In an exemplary embodiment, the first model processing module 820 may include: the embedded expression unit can be configured to perform embedded expression on the phoneme sequence to obtain a phoneme embedded expression sequence, wherein the phoneme embedded expression sequence comprises embedded expressions of phonemes; the phoneme one-dimensional convolution result unit can be configured to process the phoneme embedded expression sequence by using n second one-dimensional convolution modules which are sequentially connected to obtain a phoneme one-dimensional convolution result, wherein n is an integer greater than 0; the first full-connection layer unit can be configured to process the phoneme one-dimensional convolution result by using a first full-connection layer to obtain a first full-connection layer output; a position coding information acquisition unit configured to encode information according to a position of each phoneme in the phoneme sequence; a first bitwise addition unit configurable to perform bitwise addition on the phoneme embedded representation sequence, the position coding information, the first fully-connected layer output, and the user embedded representation to obtain a bitwise addition result; a self-attention structure unit configurable to process the bitwise addition result by a self-attention structure to obtain an attention structure output; a second fully-connected layer unit configurable to process the attention structure output through a second fully-connected layer to obtain the predicted duration sequence.
In an exemplary embodiment, the second one-dimensional convolution module may include a one-dimensional convolution layer, a batch normalization operation layer, an activation function layer, and an anti-overfitting layer, which are connected in sequence.
In an exemplary embodiment, the encoding unit 831 may include: the embedded expression unit can be configured to perform embedded expression on the phoneme sequence to obtain a phoneme embedded expression sequence, wherein the phoneme embedded expression sequence comprises embedded expressions of phonemes; the first one-dimensional convolution operation subunit can be configured to process the phoneme embedded expression sequence by utilizing m first one-dimensional convolution modules which are connected in sequence to obtain a first one-dimensional convolution result, wherein each first one-dimensional convolution module comprises a one-dimensional convolution layer, a batch normalization operation layer, an activation function layer and an over-fitting prevention layer which are connected in sequence; the bidirectional long-short time memory network output subunit can be configured to process the first one-dimensional convolution result by using a bidirectional long-short time memory network to obtain bidirectional long-short time memory network output; a second bitwise addition subunit configurable to bitwise add the embedded representation sequence of phonemes, the bi-directional long and short time memory network output and the user embedded representation to obtain an initial token vector for each phoneme.
In an exemplary embodiment, the decoding unit 833 may be configured to take the frame-level token vector of each phoneme as an input to the decoder and update the output of each pre-processing network in the decoder according to the bit-wise addition of the user-embedded representation and the output of each pre-processing network to obtain the target prediction feature output by the decoder.
In an exemplary embodiment, the timbre clone based speech synthesis apparatus 80 may further include: a base model obtaining module configurable to train the first model and the second model according to an original training sample set to obtain a base model of the first model and a base model of the second model; and the model training module can be configured to acquire the voice information to be cloned of the target user, and perform transfer learning on the basic model of the first model and the basic model of the second model by using the voice information to be cloned to acquire the trained first model and the trained second model.
An electronic device 900 according to this embodiment of the invention is described below with reference to fig. 9. The electronic device 900 shown in fig. 9 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present invention.
As shown in fig. 9, the electronic device 900 is embodied in the form of a general purpose computing device. Components of electronic device 900 may include, but are not limited to: the at least one processing unit 910, the at least one memory unit 920, and a bus 930 that couples various system components including the memory unit 920 and the processing unit 910.
Wherein the storage unit stores program code that is executable by the processing unit 910 to cause the processing unit 910 to perform steps according to various exemplary embodiments of the present invention described in the above section "exemplary methods" of the present specification. For example, the processing unit 910 may perform the steps as shown in fig. 2.
The storage unit 920 may include a readable medium in the form of a volatile storage unit, such as a random access memory unit (RAM) 9201 and/or a cache memory unit 9202, and may further include a read only memory unit (ROM) 9203.
Storage unit 920 may also include a program/utility 9204 having a set (at least one) of program modules 9205, such program modules 9205 including but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which or some combination thereof may comprise an implementation of a network environment.
Bus 930 can be any type representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 900 may also communicate with one or more external devices 1000 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 900, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 900 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interface 950. Also, the electronic device 900 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) via the network adapter 960. As shown, the network adapter 960 communicates with the other modules of the electronic device 900 via the bus 930. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 900, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, to name a few.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, and may also be implemented by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above section "exemplary methods" of the present description, when said program product is run on the terminal device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily appreciated that the processes illustrated in the above figures are not intended to indicate or limit the temporal order of the processes. In addition, it is also readily understood that these processes may be performed, for example, synchronously or asynchronously in multiple modules.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims (10)

1. A method for synthesizing voice based on timbre cloning is characterized by comprising the following steps:
acquiring a text to be cloned for a target user, and acquiring a phoneme sequence of the text to be cloned, wherein the phoneme sequence comprises at least one phoneme;
processing the phoneme sequence through a first model to obtain a predicted duration sequence, wherein the predicted duration sequence comprises the predicted duration of each phoneme;
processing the predicted duration and the phoneme sequence of each phoneme through a second model to obtain target prediction characteristics;
according to the target prediction feature speech synthesis, obtaining a synthesized speech of the text to be cloned for the target user;
the processing the predicted duration of each phoneme and the phoneme sequence through the second model to obtain the target prediction characteristics comprises the following steps:
processing the phoneme sequence through an encoder comprising a first one-dimensional convolution module and a bidirectional long-time and short-time memory module which are sequentially connected to obtain an initial characterization vector sequence, wherein the initial characterization vector sequence comprises initial characterization vectors of the phonemes;
processing the initial characterization vector sequence according to the predicted duration of each phoneme to obtain a frame-level characterization vector of each phoneme;
and processing the frame level representation vector of each phoneme and the user embedded representation of the target user by using a decoder to obtain target prediction characteristics.
2. The method of claim 1, wherein processing the initial token vector sequence according to the predicted duration of each phoneme to obtain a frame-level token vector sequence comprises:
determining the repetition times of each phoneme according to the unit duration of each frame and the predicted duration of each phoneme;
and expanding the initial characterization vector of each phoneme in the initial characterization vector sequence according to the repetition times of each phoneme to obtain the frame-level characterization vector sequence.
3. The method of claim 1 wherein processing the sequence of phonemes through the first model to obtain the sequence of predicted durations comprises:
performing embedded representation on the phoneme sequence to obtain a phoneme embedded representation sequence, wherein the phoneme embedded representation sequence comprises embedded representation of each phoneme;
processing the phoneme embedded expression sequence expression by utilizing n second one-dimensional convolution modules which are sequentially connected to obtain a phoneme one-dimensional convolution result, wherein n is an integer greater than 0;
processing the phoneme one-dimensional convolution result by utilizing a first full-link layer to obtain a first full-link layer output;
coding information according to the position of each phoneme in the phoneme sequence;
performing bitwise addition on the phoneme embedded representation sequence, the position coding information, the first full-link layer output and the user embedded representation to obtain a bitwise addition result;
processing the bitwise addition result through a self-attention structure to obtain attention structure output;
and processing the attention structure output through a second full-connection layer to obtain the predicted time length sequence.
4. The method of claim 3, the second one-dimensional convolution module comprising a one-dimensional convolution layer, a batch normalization operation layer, an activation function layer, and an anti-overfitting layer connected in series.
5. The method of claim 1, wherein processing the phoneme sequence through an encoder comprising a first one-dimensional convolution module and a bidirectional long-and-short-term memory module connected in series to obtain an initial token vector sequence comprises:
carrying out embedded expression on the phoneme sequence to obtain a phoneme embedded expression sequence;
processing the phoneme embedded expression sequence by utilizing m first one-dimensional convolution modules which are sequentially connected to obtain a first one-dimensional convolution result, wherein each first one-dimensional convolution module comprises a one-dimensional convolution layer, a batch normalization operation layer, an activation function layer and an over-fitting prevention layer which are sequentially connected to each other, and m is an integer larger than 0;
processing the first one-dimensional convolution result by using a bidirectional long-short term memory network to obtain bidirectional long-short term memory network output;
and carrying out bitwise addition on the embedded expression sequence of each phoneme, the two-way long-short time memory network output and the user embedded expression to obtain an initial characterization vector of each phoneme.
6. The method of claim 1, wherein processing the frame-level token vectors for each phoneme and the user-embedded representation of the target user with a decoder to obtain target predicted features comprises:
and taking the frame-level characterization vector of each phoneme as the input of the decoder, and updating the output of each preprocessing network according to the bit-wise addition result of the user embedded representation and the output of each preprocessing network in the decoder to obtain the target prediction feature output by the decoder.
7. The method of claim 1, further comprising:
training the first model and the second model according to an original training sample set to obtain a basic model of the first model and a basic model of the second model;
and acquiring voice information to be cloned of a target user, and performing transfer learning on the basic model of the first model and the basic model of the second model by using the voice information to be cloned to obtain the trained first model and the trained second model.
8. A voice synthesis apparatus based on timbre cloning, comprising:
the system comprises a to-be-cloned text acquisition module, a to-be-cloned text acquisition module and a to-be-cloned text acquisition module, wherein the to-be-cloned text acquisition module is configured to acquire a to-be-cloned text for a target user and acquire a phoneme sequence of the to-be-cloned text, and the phoneme sequence comprises at least one phoneme;
the first model processing module is configured to process the phoneme sequence through a first model to obtain a predicted duration sequence, and the predicted duration sequence comprises predicted durations of the phonemes;
the second model processing module is configured to process the predicted duration and the phoneme sequence of each phoneme through a second model to obtain a target prediction characteristic;
a speech synthesis module configured to synthesize speech according to the target prediction feature;
wherein the second model processing module comprises:
the encoding unit is configured to process the phoneme sequence through an encoder comprising a first one-dimensional convolution module and a bidirectional long-time and short-time memory module which are sequentially connected to obtain an initial characterization vector sequence, wherein the initial characterization vector sequence comprises initial characterization vectors of the phonemes;
an initial token vector processing unit, configured to process the initial token vector sequence according to the predicted duration of each phoneme, so as to obtain a frame-level token vector of each phoneme;
and the decoding unit is configured to process the frame-level characterization vectors of the phonemes and the user embedded representation of the target user by using a decoder to obtain target prediction characteristics.
9. An electronic device, comprising:
at least one processor;
storage means for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement the method of any one of claims 1-7.
10. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.
CN202110482151.5A 2021-04-30 2021-04-30 Voice synthesis method and device based on timbre clone and related equipment Active CN113160794B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110482151.5A CN113160794B (en) 2021-04-30 2021-04-30 Voice synthesis method and device based on timbre clone and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110482151.5A CN113160794B (en) 2021-04-30 2021-04-30 Voice synthesis method and device based on timbre clone and related equipment

Publications (2)

Publication Number Publication Date
CN113160794A CN113160794A (en) 2021-07-23
CN113160794B true CN113160794B (en) 2022-12-27

Family

ID=76873080

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110482151.5A Active CN113160794B (en) 2021-04-30 2021-04-30 Voice synthesis method and device based on timbre clone and related equipment

Country Status (1)

Country Link
CN (1) CN113160794B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113781995B (en) * 2021-09-17 2024-04-05 上海喜马拉雅科技有限公司 Speech synthesis method, device, electronic equipment and readable storage medium
CN114566143B (en) * 2022-03-31 2022-10-11 北京帝派智能科技有限公司 Voice synthesis method and voice synthesis system capable of locally modifying content

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011075870A (en) * 2009-09-30 2011-04-14 Oki Electric Industry Co Ltd Voice synthesizing system, and device and program of the same
CN102270449A (en) * 2011-08-10 2011-12-07 歌尔声学股份有限公司 Method and system for synthesising parameter speech
CN105206257A (en) * 2015-10-14 2015-12-30 科大讯飞股份有限公司 Voice conversion method and device
CN110136687A (en) * 2019-05-20 2019-08-16 深圳市数字星河科技有限公司 One kind is based on voice training clone's accent and sound method
CN111583902A (en) * 2020-05-14 2020-08-25 携程计算机技术(上海)有限公司 Speech synthesis system, method, electronic device, and medium
CN111968618A (en) * 2020-08-27 2020-11-20 腾讯科技(深圳)有限公司 Speech synthesis method and device
CN112562676A (en) * 2020-11-13 2021-03-26 北京捷通华声科技股份有限公司 Voice decoding method, device, equipment and storage medium
CN112634856A (en) * 2020-12-10 2021-04-09 苏州思必驰信息科技有限公司 Speech synthesis model training method and speech synthesis method
CN112687259A (en) * 2021-03-11 2021-04-20 腾讯科技(深圳)有限公司 Speech synthesis method, device and readable storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7386451B2 (en) * 2003-09-11 2008-06-10 Microsoft Corporation Optimization of an objective measure for estimating mean opinion score of synthesized speech
KR102581346B1 (en) * 2019-05-31 2023-09-22 구글 엘엘씨 Multilingual speech synthesis and cross-language speech replication

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011075870A (en) * 2009-09-30 2011-04-14 Oki Electric Industry Co Ltd Voice synthesizing system, and device and program of the same
CN102270449A (en) * 2011-08-10 2011-12-07 歌尔声学股份有限公司 Method and system for synthesising parameter speech
CN105206257A (en) * 2015-10-14 2015-12-30 科大讯飞股份有限公司 Voice conversion method and device
CN110136687A (en) * 2019-05-20 2019-08-16 深圳市数字星河科技有限公司 One kind is based on voice training clone's accent and sound method
CN111583902A (en) * 2020-05-14 2020-08-25 携程计算机技术(上海)有限公司 Speech synthesis system, method, electronic device, and medium
CN111968618A (en) * 2020-08-27 2020-11-20 腾讯科技(深圳)有限公司 Speech synthesis method and device
CN112562676A (en) * 2020-11-13 2021-03-26 北京捷通华声科技股份有限公司 Voice decoding method, device, equipment and storage medium
CN112634856A (en) * 2020-12-10 2021-04-09 苏州思必驰信息科技有限公司 Speech synthesis model training method and speech synthesis method
CN112687259A (en) * 2021-03-11 2021-04-20 腾讯科技(深圳)有限公司 Speech synthesis method, device and readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于HMM的中文语音合成研究;徐思昊;《中国优秀硕士学位论文全文数据库(电子期刊)》;20071115;I136-126 *
基于线性预测下的语音信号合成;贺艳平;《西北民族大学学报(自然科学版)》;20101215(第04期);第46-50页 *

Also Published As

Publication number Publication date
CN113160794A (en) 2021-07-23

Similar Documents

Publication Publication Date Title
US20210295858A1 (en) Synthesizing speech from text using neural networks
JP7009564B2 (en) End-to-end text-to-speech conversion
CN110782870B (en) Speech synthesis method, device, electronic equipment and storage medium
Kleijn et al. Wavenet based low rate speech coding
US11205444B2 (en) Utilizing bi-directional recurrent encoders with multi-hop attention for speech emotion recognition
CN111161702B (en) Personalized speech synthesis method and device, electronic equipment and storage medium
CN110444203B (en) Voice recognition method and device and electronic equipment
CN113160794B (en) Voice synthesis method and device based on timbre clone and related equipment
CN110288972B (en) Speech synthesis model training method, speech synthesis method and device
CN112634856A (en) Speech synthesis model training method and speech synthesis method
CN112786005B (en) Information synthesis method, apparatus, electronic device, and computer-readable storage medium
CN115641543B (en) Multi-modal depression emotion recognition method and device
Zheng et al. Forward–backward decoding sequence for regularizing end-to-end tts
CN111930900B (en) Standard pronunciation generating method and related device
Wu et al. Quasi-periodic WaveNet: An autoregressive raw waveform generative model with pitch-dependent dilated convolution neural network
CN114207706A (en) Generating acoustic sequences via neural networks using combined prosodic information
US11322133B2 (en) Expressive text-to-speech utilizing contextual word-level style tokens
CN113450765A (en) Speech synthesis method, apparatus, device and storage medium
CN113674733A (en) Method and apparatus for speaking time estimation
KR20220066962A (en) Training a neural network to generate structured embeddings
CN116129859A (en) Prosody labeling method, acoustic model training method, voice synthesis method and voice synthesis device
CN114495896A (en) Voice playing method and computer equipment
CN113628630B (en) Information conversion method and device based on coding and decoding network and electronic equipment
US20220129643A1 (en) Method of training real-time simultaneous interpretation model based on external alignment information, and method and system for simultaneous interpretation based on external alignment information
CN113392645B (en) Prosodic phrase boundary prediction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Room 221, 2 / F, block C, 18 Kechuang 11th Street, Daxing District, Beijing, 100176

Applicant after: Jingdong Technology Holding Co.,Ltd.

Address before: Room 221, 2 / F, block C, 18 Kechuang 11th Street, Beijing Economic and Technological Development Zone, 100176

Applicant before: Jingdong Digital Technology Holding Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant