CN113314097B

CN113314097B - Speech synthesis method, speech synthesis model processing device and electronic equipment

Info

Publication number: CN113314097B
Application number: CN202110868103.XA
Authority: CN
Inventors: 孙晓辉; 王宝勋
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-07-30
Filing date: 2021-07-30
Publication date: 2021-11-02
Anticipated expiration: 2041-07-30
Also published as: CN113314097A

Abstract

The application relates to a speech synthesis method, a speech synthesis model processing device and electronic equipment. The speech synthesis method comprises the following steps: acquiring a phonon sequence of a text to be synthesized; performing tone processing on the tone subsequence through a voice synthesis model to obtain acoustic characteristics including target tone information; the speech synthesis model is obtained by training based on a phononic sequence of a target text and target speech of at least one target language; the target voice corresponds to the target text and is voice with target tone color generated according to acoustic features extracted from voice samples with different tone colors; and performing voice synthesis on the acoustic features through the voice synthesis model to obtain at least one synthesized voice of the target language and the target tone. By adopting the method, the tone of the synthesized voice before and after the language switching can be kept consistent, and the synthesized voice is natural and smooth.

Description

Speech synthesis method, speech synthesis model processing device and electronic equipment

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a speech synthesis method, a speech synthesis model processing method, an apparatus, and an electronic device.

Background

With the continuous development of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields. Natural Language Processing (NLP) and speech Processing are important directions in artificial intelligence technology, for example, a speech synthesis model is used to synthesize a text to obtain a synthesized speech, so that the synthesized speech can be played to a user.

In some application scenarios, the text needs to be synthesized into synthesized speech of different languages, and at this time, some texts and different pronunciation objects need to be trained on different speech synthesis models respectively by using speech uttered by different languages. Because the timbres of different pronunciation objects are different, after the model training is finished, the text to be synthesized is subjected to speech synthesis according to the trained speech synthesis model, so that the synthesized speech with different languages and different timbres can be obtained. In the process of speech synthesis, if a certain language is switched to another language, different speech synthesis models need to be called, and at this time, the problem of inconsistent tone color occurs, and the naturalness and fluency of speech are also affected.

Disclosure of Invention

In view of the above, it is desirable to provide a speech synthesis method, a speech synthesis model processing apparatus, and an electronic device, which can ensure that the timbre of the synthesized speech is consistent before and after the language switching, and the synthesized speech is natural and smooth.

A method of speech synthesis, the method comprising:

acquiring a phonon sequence of a text to be synthesized;

performing tone processing on the tone subsequence through a voice synthesis model to obtain acoustic characteristics including target tone information; the speech synthesis model is obtained by training based on a phononic sequence of a target text and target speech of at least one target language; the target voice corresponds to the target text and is generated according to acoustic features extracted from voice samples with different timbres and has a target timbre;

and performing voice synthesis on the acoustic features through the voice synthesis model to obtain at least one synthesized voice of the target language and the target tone.

In one embodiment, the speech synthesis model includes an acoustic model; the method further comprises the following steps:

respectively extracting acoustic features of at least one voice sample of the target language to obtain training acoustic features;

generating target voice with target tone color of at least one target language based on the training acoustic features;

when a training tone subsequence is obtained from a target text corresponding to the target voice, performing tone processing on the training tone subsequence through the acoustic model to obtain training acoustic features including target tone information;

and performing parameter adjustment on the acoustic model based on the loss value between the training acoustic feature and the acoustic feature extracted from the target voice.

In one embodiment, the speech synthesis model includes a vocoder; the method further comprises the following steps:

extracting acoustic features from the target voice to obtain target acoustic features;

performing voice synthesis on the target acoustic features through the vocoder to obtain at least one target prediction voice of the target language; the target predicted speech has the target timbre;

performing parameter adjustment on the vocoder based on a loss value between the target predicted speech and the target speech.

A speech synthesis apparatus, the apparatus comprising:

the acquisition module is used for acquiring a phonon sequence of a text to be synthesized;

the processing module is used for performing tone processing on the tone subsequence through a voice synthesis model to obtain acoustic characteristics including target tone information; the speech synthesis model is obtained by training based on a phononic sequence of a target text and target speech of at least one target language; the target voice corresponds to the target text and is generated according to acoustic features extracted from voice samples with different timbres and has a target timbre;

and the synthesis module is used for carrying out voice synthesis on the acoustic features through the voice synthesis model to obtain at least one synthesized voice of the target language and the target tone.

An electronic device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

acquiring a phonon sequence of a text to be synthesized;

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

acquiring a phonon sequence of a text to be synthesized;

According to the voice synthesis method, the voice synthesis device, the electronic equipment and the storage medium, firstly, the acoustic features are extracted from the voice samples with different timbres, the target voice with at least one target language of the target timbre is generated based on the extracted acoustic features, and the voice synthesis model is trained according to the target text and the target voice with the target timbre corresponding to the target text, so that different voice synthesis models do not need to be trained by the target voice with different target languages and different timbres, and the model training efficiency is improved. In addition, the sound color of the sound subsequence of the text to be synthesized is processed by utilizing the trained speech synthesis model to obtain the acoustic characteristic comprising the target sound color information, then the acoustic characteristic is subjected to speech synthesis through the trained speech synthesis model to obtain the synthesized speech of at least one target language of the target sound color, therefore, even if a certain language is switched to another language, the sound color of the synthesized speech is kept unchanged, and the synthesized speech can be natural and smooth because the speech synthesis model does not need to be changed.

A method of speech synthesis model processing, the method comprising:

respectively extracting acoustic features of the voice samples of at least one target language to obtain training acoustic features; the timbre of each of the speech samples is different;

generating target voice of at least one target language and with a target tone color based on the training acoustic features;

when a training phononic sequence is obtained from a target text corresponding to the target voice, performing timbre processing on the training phononic sequence through a voice synthesis model to obtain training acoustic features including target timbre information;

performing voice synthesis on the training acoustic features through the voice synthesis model to obtain at least one predicted voice of the target language; the predicted voice has a target tone corresponding to the target tone information;

adjusting network parameters in the speech synthesis model based on a loss value between the predicted speech and the target speech.

A speech synthesis model processing apparatus, the apparatus comprising:

the extraction module is used for respectively extracting acoustic features of the voice samples of at least one target language to obtain training acoustic features; the timbre of each of the speech samples is different;

a generating module, configured to generate, based on the training acoustic features, a target speech of at least one of the target languages and having a target timbre;

the processing module is used for performing tone processing on the training tone subsequence through a voice synthesis model when the training tone subsequence is obtained from the target text corresponding to the target voice, so as to obtain training acoustic features including target tone information;

the synthesis module is used for carrying out voice synthesis on the training acoustic features through the voice synthesis model to obtain at least one predicted voice of the target language; the predicted voice has a target tone corresponding to the target tone information;

and the adjusting module is used for adjusting the network parameters in the speech synthesis model based on the loss value between the predicted speech and the target speech.

According to the voice synthesis model processing method, the voice synthesis model processing device, the electronic equipment and the storage medium, acoustic feature extraction is carried out in voice sample extraction of different timbres, target voices with target timbres and different target languages are generated based on extracted training acoustic features, and the voice synthesis model is trained according to the phononic sequence of the target text and the target voices, corresponding to the target text, with the target timbres and different target languages, so that the voice synthesis model for synthesizing the target timbres and different target languages can be obtained, different voice synthesis models do not need to be trained by the target voices with different target languages and different timbres, and model training efficiency is effectively improved.

Drawings

FIG. 1 is a diagram of an exemplary implementation of a speech synthesis method and speech synthesis model processing method;

FIG. 2 is a flow diagram illustrating a method for speech synthesis in one embodiment;

FIG. 3 is a diagram illustrating triggering of speech synthesis of text in a presentation page in one embodiment;

FIG. 4 is a flowchart illustrating an application scenario in which a speech synthesis method is applied to speech synthesis of text in a display page according to an embodiment;

fig. 5 is a schematic flow chart illustrating a process in which the server synthesizes voices and distributes the synthesized voices of the corresponding target language to each terminal in one embodiment;

FIG. 6 is a flow diagram that illustrates the extraction of acoustic features and the synthesis of speech by a speech synthesis model, according to one embodiment;

FIG. 7 is a block diagram of a vocoder in one embodiment;

FIG. 8 is a schematic flow chart illustrating training of a speech synthesis model in one embodiment;

FIG. 9 is a diagram illustrating a Mel spectral signature of a target timbre obtained by a PPGs model and a sound conversion model in one embodiment;

FIG. 10 is a diagram illustrating the training of PPGs models and sound conversion models in one embodiment;

FIG. 11 is a schematic flow chart of training a speech synthesis model in another embodiment;

FIG. 12 is a flowchart illustrating a method of processing a speech synthesis model according to one embodiment;

FIG. 13 is a block diagram of the structure of a speech synthesis system in one embodiment;

FIG. 14 is a diagram of a multilingual speech library that constructs a target timbre in one embodiment;

FIG. 15 is a block diagram showing the construction of a speech synthesis system in another embodiment;

FIG. 16 is a block diagram showing the structure of a speech synthesis apparatus according to an embodiment;

FIG. 17 is a block diagram showing the construction of a speech synthesis apparatus according to another embodiment;

FIG. 18 is a block diagram showing the construction of a speech synthesis model processing apparatus according to an embodiment;

FIG. 19 is an internal block diagram of an electronic device in one embodiment;

fig. 20 is an internal structural view of an electronic apparatus in another embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future. Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence, and includes technologies such as text processing, semantic understanding, machine translation, robot question and answer, knowledge mapping, and the like.

The scheme provided by the embodiment of the application relates to the technologies of artificial intelligence, such as voice technology, natural language processing, machine learning and the like, and is specifically explained by the following embodiments:

the speech synthesis method and the speech synthesis model processing method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The speech synthesis method may be executed by the terminal 102 or the server 104, or executed by the terminal 102 and the server 104 cooperatively, and the speech synthesis method executed by the terminal 102 is described as an example: the terminal 102 obtains a text to be synthesized from the local (e.g. a text in a gray dashed box in fig. 1), or obtains the text to be synthesized from the server 104, and obtains a corresponding phonon sequence based on the text; performing tone processing on the tone subsequence through a voice synthesis model to obtain acoustic characteristics including target tone information; the speech synthesis model is obtained by training the terminal 102 or the server 104 based on a phononic sequence of a target text and a target speech of at least one target language, and is deployed at the terminal 102; the target voice corresponds to the target text and is generated according to acoustic features extracted from voice samples with different timbres and has a target timbre; and performing voice synthesis on the acoustic features through a voice synthesis model to obtain at least one synthesized voice of the target language and the target tone.

Further, the speech synthesis method server 104 will be described by taking, as an example, the following: the server 104 respectively extracts acoustic features of the voice samples of at least one target language to obtain training acoustic features; generating target voice of at least one target language and with target tone color based on the training acoustic features; when a training phononic sequence is obtained from a target text corresponding to target voice, performing timbre processing on the training phononic sequence through a voice synthesis model to obtain training acoustic features including target timbre information; performing voice synthesis on the training acoustic features through a voice synthesis model to obtain predicted voice of at least one target language; predicting that the voice has a target tone corresponding to the target tone information; based on the loss value between the predicted speech and the target speech, the network parameters in the speech synthesis model are adjusted, and then the trained speech synthesis model is deployed in the terminal 102, so that the terminal 102 can implement the steps of the speech synthesis method. In addition, the trained speech synthesis model may also be deployed in the server 104, so that the server 104 may also implement the steps of the speech synthesis method.

The terminal 102 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, or the like. In addition, the terminal 102 may also be an intelligent vehicle-mounted device, and the intelligent vehicle-mounted device may perform voice synthesis by using a sound subsequence of the text to obtain a synthesized voice of the target tone, so as to implement voice interaction with the user.

The server 104 may be an independent physical server or a service node in a blockchain system, a point-To-point (P2P, Peer To Peer) network is formed among the service nodes in the blockchain system, and the P2P Protocol is an application layer Protocol operating on a Transmission Control Protocol (TCP). In addition, the server 104 may also be a server cluster composed of a plurality of physical servers, and may be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. The server 104 may be installed with a server of the demand management system, and the server may interact with the terminal 104.

The terminal 102 and the server 104 may be connected through communication connection manners such as bluetooth, USB (Universal Serial Bus), or network, which is not limited herein.

In one embodiment, as shown in fig. 2, a speech synthesis method is provided, in which the method is performed by an electronic device, which may be the terminal 102 or the server 104 in fig. 1, i.e. the method may be performed by the terminal 102 or the server 104. In the following embodiments, taking an electronic device as the terminal 102 as an example for explanation, the method includes the following steps:

s202, acquiring a phonon sequence of the text to be synthesized.

The text to be synthesized may refer to a text used for synthesizing speech, and the text to be synthesized may be an article, or a segment of words, a line of words, or several phrases in an article. The article can be in various fields, such as science and technology, sports, leisure and entertainment, gourmet food and literature.

A phone may refer to the pronunciation phone of a character or word in the text to be synthesized, such as the initial consonant and the final sound in a chinese character. Correspondingly, a phonon sequence may refer to a sequence of multiple phonons.

In one embodiment, a terminal responds to a text-to-speech operation to obtain a text to be synthesized (i.e., a text to be synthesized), then performs text analysis on the text to be synthesized, for example, performs word segmentation on the text to be synthesized, and then performs phonetic notation on a plurality of words to be synthesized and/or characters to be synthesized obtained by the word segmentation, so as to obtain a phonon of each word segmentation and/or character to be synthesized, and combines the obtained phonons to obtain a phonon sequence.

Specifically, a speech synthesis system for converting characters into speech is installed on the terminal, and a speech synthesis model is deployed in the speech synthesis system. Therefore, when the text-to-speech operation is detected, the terminal generates a speech synthesis service request and sends the speech synthesis service request to the speech synthesis system. Responding to the voice synthesis service request, the terminal extracts a text to be synthesized from the voice synthesis service request by using a voice synthesis system; performing word segmentation processing on the text to obtain a word to be synthesized; performing phonon conversion on each word segmentation to obtain a word segmentation phonon to be synthesized; combining the obtained word segmentation phones to obtain a phone sequence of the text, wherein if the text to be synthesized is' you like hong magic or iPhone

"corresponding participles are" you "," like "," red magic "," still "," iPhone ",

when the word segmentation phones corresponding to the word segmentation are obtained, the word segmentation phones corresponding to the word segmentation are combined to obtain a phone sequence n i x i h ua h ong m o h ai i Eay Ef Eow En of the text.

For example, as shown in fig. 3, when the user clicks the voice broadcast button in the diagram (a) of fig. 3, a voice synthesis service request is generated, and the voice synthesis system on the terminal acquires the text content (i.e., the text in the dashed line frame in fig. 3) in the display page, and then performs word segmentation and phonon conversion on the text content, so as to obtain a phonon sequence of the text content. Further, as shown in fig. 3 (b), when the title content is selected in response to the selection operation, a pop-up box is displayed, and in response to a voice synthesis operation triggered on the synthesized voice button, the voice synthesis system on the terminal acquires the title content in the presentation page, and then performs word segmentation processing and phonon conversion on the title content, thereby obtaining a phonon sequence of the title content. In addition, the phoneme subsequence of the text content of the designated part can be obtained according to the method.

In an embodiment, when the terminal obtains the text to be synthesized, the terminal may translate the text into texts of other languages, so as to obtain one or more texts of different target languages, and then perform text parsing to obtain a phononic sequence of the text of one or more target languages, such as a factor sequence of a chinese text and a factor sequence of an english text.

And S204, performing tone processing on the tone subsequence through the voice synthesis model to obtain acoustic characteristics including target tone information.

The voice synthesis model is obtained by training based on a phononic sequence of a target text and target voice of at least one target language, the target voice corresponds to the target text and is voice with target tone color generated according to acoustic features extracted from voice samples of different tone colors, therefore, after the phononic sequence is input into the voice synthesis model, the acoustic features of at least one target language can be obtained, and each acoustic feature comprises target tone color information. Further, the speech synthesis model may include an acoustic model and a vocoder. The target voice having the target tone refers to a target voice having the tone characteristics of the target pronunciation object.

The target voice is generated according to acoustic features extracted from voice samples with different timbres, for example, acoustic features are extracted from voice samples of different languages recorded by users a-e, and then the target voice with the target timbre and one or more different target languages is generated according to the extracted acoustic features.

The target language may refer to a language that may be expected by a user to synthesize a text to be synthesized, and in the training process, the speech synthesis model may be trained using a target speech of at least one target language and a phoneme sequence of a corresponding target text, so that a speech synthesis model for synthesizing speech of the language expected by the user may be obtained. For example, after the training of the speech synthesis model is completed, the user may synthesize the text to be synthesized into english synthesized speech and/or french synthesized speech through the speech synthesis model, so as to play the speech.

In one embodiment, S204 may specifically include: the terminal can convert the acoustic features of the phononic sequence and add the tone information of the target pronunciation object (target tone information for short) in the conversion process, so as to obtain the acoustic features containing the target tone information. Wherein the acoustic feature may be a mel-frequency spectral feature of the frequency domain.

S206, performing voice synthesis on the acoustic features through the voice synthesis model to obtain at least one synthesized voice of the target language and the target tone.

Wherein, the at least one synthesized voice of the target language and the target tone is: and the synthesized voice of one or more target languages has the same target tone. Such as english synthesized speech, chinese synthesized speech, french synthesized speech, etc., which are to obtain the timbre of a particular user.

In one embodiment, the terminal may perform inverse transformation on the acoustic features through a speech synthesis model, so as to obtain a synthesized speech of at least one target language. Wherein the synthesized speech has a target timbre; the inverse transform may be an inverse fourier transform.

In order to more intuitively and clearly understand the solution of the above embodiment, the following description is made with reference to fig. 4: when the user selects to display the title content in the page, displaying a pop-up box; the user touches the synthesized speech button in the pop-up box, at which time the header content is sent to the speech synthesis system via a speech synthesis service request, on which speech synthesis models (including acoustic models and vocoders) are deployed. When a phonon sequence of the title content is obtained, inputting the phonon sequence into an acoustic model in a speech synthesis model, and converting the phonon sequence through the acoustic model to obtain Mel spectrum characteristics; then, the mel-frequency spectrum feature is input to a vocoder, the vocoder performs inverse Fourier transform on the mel-frequency spectrum feature to obtain a synthetic voice of the title content with the target tone, and then voice broadcasting is performed through a loudspeaker of the terminal, so that a user can hear the synthetic voice. The synthesized speech may be a common target language or a target language specified by the user, for example, after the user touches the synthesized speech button, the terminal may further display a language selection box, and select one or more target languages in the language selection box, so as to obtain the synthesized speech of the specified target language. It should be noted that when at least one synthesized voice of the target language is obtained, the synthesized voices may be played in sequence, or the synthesized voice of the target language most frequently used by the user may also be played. Because the speech synthesis model can synthesize the synthesized speech of at least one target language, when the languages are switched, the synthesized speech of the target language required by the user can be switched very smoothly, and the tone before and after switching is kept unchanged, so that the synthesized speech before and after switching is natural and smooth.

As another example, if the terminal is an intelligent vehicle-mounted device, when receiving a voice uttered by a user, the voice is "please play ' a song of ' kiss ', the intelligent vehicle-mounted device obtains a text corresponding to the voice, the text is" good ", the song is played for you now", then obtains a sound sub-sequence of the text, performs a sound color process on the sound sub-sequence through a voice synthesis model to obtain an acoustic feature including target sound color information, performs a voice synthesis on the acoustic feature through the voice synthesis model to obtain a synthesized voice of the target sound color, and plays the synthesized voice, for example, a certain announcer or a star sound color plays the synthesized voice of which the content is "good", and the song is played for you now. In addition, in the process of playing the synthesized voice, music software is started for the user, and after the synthesized voice is played, the song on demand is played for the user.

As a third example, as shown in fig. 5, when the speech synthesis method is applied to a server, the server may obtain a phononic sequence of a text to be synthesized, and then process the phononic sequence through an acoustic model in a speech synthesis model to obtain a mel-frequency spectrum feature; then, the mel spectrum feature is inversely transformed by a vocoder in the voice synthesis model to obtain at least one target language synthetic voice, and the synthetic voice has a target tone, such as the tone of a singer, an actor or a broadcaster; the server sends the synthesized voice of at least one target language to the corresponding terminal according to the language requirement requested by each terminal, for example, the synthesized voice of the target language 1 is sent to the terminal 1, and so on.

In the above embodiment, first, acoustic features are extracted from voice samples of different timbres, and target voices in at least one target language with a target timbre are generated based on the extracted acoustic features, and the voice synthesis models are trained according to the target text and the target voices, corresponding to the target text, with the target timbre and different target languages, so that different voice synthesis models do not need to be trained by the target voices in different target languages with different timbres, and model training efficiency is improved. In addition, the sound color of the sound subsequence of the text to be synthesized is processed by utilizing the trained speech synthesis model to obtain the acoustic characteristic comprising the target sound color information, then the acoustic characteristic is subjected to speech synthesis through the trained speech synthesis model to obtain the synthesized speech of at least one target language of the target sound color, therefore, even if a certain language is switched to another language, the sound color of the synthesized speech is kept unchanged, and the synthesized speech can be natural and smooth because the speech synthesis model does not need to be changed.

In an embodiment, as shown in fig. 6, S204 may specifically include:

s602, semantic feature extraction is carried out on the phonon sequence through the voice synthesis model, and semantic features are obtained.

The semantic features refer to context information of each participle in the text to be synthesized. The phoneme sequence comprises word segmentation phonemes obtained by performing phoneme conversion on each word segmentation in the text.

In an embodiment, S602 may specifically include: and the terminal encodes each word segmentation phonon in the phonon sequence through an encoder in the acoustic model to obtain an encoding vector containing semantic features. Among them, the acoustic model adopts a sequence-to-sequence (sequence-to-sequence) structure, which includes an Encoder (Encoder), a Decoder (Decoder), and an Attention (Attention) network, as shown in fig. 7. For example, suppose the phoneme sequence is [ x1, x 2., xU ], the terminal inputs the phoneme sequence into the encoder, and the encoder encodes the input phoneme sequence [ x1, x 2., xU ] to obtain an encoding vector [ h1, h 2., hU ] containing semantic features.

S604, performing tone color processing on the semantic features based on the target tone color information of the target pronunciation object to obtain at least one acoustic feature of the target language including the target tone color information.

The target tone information is the tone characteristic of the target pronunciation object learned by the speech synthesis model in the training process.

In one embodiment, S604 may specifically include: and the terminal decodes the coding vector based on the target tone information of the target pronunciation object through a decoder in the acoustic model to obtain the acoustic characteristics of at least one target language including the target tone information.

In an embodiment, the decoding the coding vector based on the target timbre information of the target pronunciation object may specifically include: the terminal determines the attention degree of each participle in the coding vector through an attention network in an acoustic model; wherein the attention degree of each participle in the text is different; carrying out weighting processing on the word coding vectors corresponding to the participles in the coding vectors according to the attention degree to obtain weighted coding vectors; the weighted encoding vector is decoded based on target timbre information of the target pronunciation object.

The attention network is used for calculating elements (namely word encoding vectors) in the encoding vectors which need to be focused by the decoder, and weighting the elements which are focused by the decoder by using the value with large weight, so that the modeling precision of the model can be effectively improved.

For example, following the above example, assuming that the phone sub-sequence is [ x1, x 2., xU ], the terminal inputs the phone sub-sequence to the encoder, and the encoder encodes the input phone sub-sequence [ x1, x 2.,. xU ], resulting in an encoding vector [ h1, h 2., hU ] containing semantic features. In the decoding process of the decoder, the attention network determines the attention degree of each element in the coding vector in real time, and the elements with different attention degrees are weighted by different weights, so that the decoder can obtain Mel spectral features [ y1, y2,. yT ]. It should be noted that Decoder is an autoregressive structure, the starting state is y0, after y1 is generated based on y0, y2 is generated based on y1, and the above steps are repeated to obtain mel-spectrum features [ y1, y 2.. yT ].

Further, after obtaining the acoustic features, the method further comprises:

s606, performing voice synthesis on the acoustic features through the voice synthesis model to obtain at least one synthesized voice of the target language and the target tone.

The specific step of S606 may refer to S206 in the embodiment of fig. 2.

S608, playing the synthesized voice.

Specifically, the terminal plays the synthesized voice through a speaker. In addition, before playing, the terminal can also perform denoising processing on the synthesized voice.

In the above embodiment, the voice synthesis model may be used to convert the sound subsequence of the text to be synthesized into the acoustic feature containing the target tone information, so that the acoustic feature may be subjected to voice synthesis to obtain a synthesized voice of at least one target language with the target tone, and the problem of inconsistency of the tone colors of the synthesized voices of different target languages is avoided. In addition, the encoder and the decoder in the acoustic model can be used for encoding and decoding the phonon sequence to obtain the acoustic features, and in the decoding process, the attention network can calculate which parts of the encoding vector should be paid attention to by the encoder at each step, so that the accuracy of the acoustic features can be improved.

In one embodiment, the terminal may train the speech synthesis model before performing speech synthesis, or may train the speech synthesis model through the server and deploy it to the terminal after training is completed. As shown in fig. 8, the training step of the speech synthesis model may specifically include:

s802, acoustic feature extraction is respectively carried out on the voice samples of at least one target language to obtain training acoustic features.

The voice sample can refer to voices recorded by different pronunciation objects, voices recorded by different objects, and different languages and tones. The different pronunciation objects include a target pronunciation object and other pronunciation objects, and the target pronunciation object may refer to a pronunciation object of a timbre desired by a user, for example, when a user wants the timbre of an actor a during speech synthesis, the actor a is the target pronunciation object. The training acoustic features may be mel-frequency spectral features containing target pronunciation information for the target pronunciation object.

In one embodiment, after obtaining a voice sample of at least one target language, the terminal extracts an MFCC (Mel Frequency Cepstrum Coefficient) from the voice sample, and then determines training semantic features of the voice sample according to the Mel Cepstrum Coefficient; and performing tone processing on the training semantic features to obtain Mel spectrum features. Wherein, the training semantic features, namely PPGs (Phonetic transcription groups), are used for representing the Posterior Probability (PPGs) that each speech frame in the speech sample belongs to the target phonon

Wherein, the step of extracting the mel-frequency cepstrum coefficient from the voice sample may specifically include: the terminal frames the voice sample, then performs Fourier transform to obtain the frequency spectrum of each frame of voice, determines a power spectrum according to the frequency spectrum of each frame of voice, and obtains a logarithmic power spectrum corresponding to the power spectrum by taking the logarithm; and the terminal inputs the logarithmic power spectrum into a Mel-scale triangular filter, and a Mel cepstrum coefficient is obtained after discrete cosine transformation.

For example, assume that the signal representation of the speech sample isx(n) The speech after framing and windowing isx’(n)= x(n)×h(n) To the voice after windowingx’(n)= x(n)×h(n) Performing a discrete Fourier transformThe corresponding spectrum signals are obtained as follows:

where N represents the number of points of the discrete fourier transform.

When obtaining the frequency spectrum of each frame of voice, the terminal calculates the corresponding power spectrum, and finds the logarithm value of the power spectrum to obtain the logarithm power spectrum, inputs the logarithm power spectrum into a Mel-scale triangular filter, obtains the Mel cepstrum coefficient after discrete cosine transform, and the expression of the obtained Mel cepstrum coefficient is as follows:

and (3) performing discrete cosine transform on the logarithmic power spectral band to obtain L-order Mel frequency cepstrum parameters, wherein the L-order refers to the Mel cepstrum coefficient order, and the value can be 12-16. M refers to the number of triangular filters.

After the Mel cepstrum coefficient is obtained, the terminal can process the Mel cepstrum coefficient through a PPGs extraction model to obtain a PPGs representation; then, the PPGs representations are processed through a sound conversion model to obtain Mel spectral characteristics. As shown in fig. 9, MFCC is extracted from a speech sample of a pronunciation object a, and then MFCC is input to a PPGs extraction model to obtain a PPGs representation, and then the PPGs representation is input to a sound conversion model to obtain a mel-frequency spectrum feature containing target pronunciation information.

Wherein, the PPGs extraction model and the sound conversion model are subjected to model training before being applied. Wherein, the model training is divided into two stages: that is, the PPGs extraction model is trained first, and when the PPGs extraction model converges, the converged PPGs extraction model is trained together with the voice conversion model, as shown in fig. 10.

In one embodiment, before S802, a speech enhancement (e.g., noise reduction) process may be performed on the speech sample, and then the acoustic feature extraction may be performed on the speech sample after the speech enhancement process.

S804, generating target voice with target tone of at least one target language based on the training acoustic features.

In one embodiment, the terminal may perform inverse transformation on the training acoustic features through the vocoder, so as to obtain target voices of at least one target language, and the obtained target voices all have a target tone color. Wherein the inverse transform may be an inverse fourier transform.

S806, when the training tone subsequence is obtained from the target text corresponding to the target voice, performing tone processing on the training tone subsequence through the voice synthesis model to obtain training acoustic features including target tone information.

S808, performing voice synthesis on the training acoustic features through the voice synthesis model to obtain at least one predicted voice of the target language and with the target tone.

S810, based on the loss value between the predicted voice and the target voice, network parameters in the voice synthesis model are adjusted.

The above-mentioned S806~ S810 are the training process of speech synthesis model, can train speech synthesis model as the training set with all target voices, can also divide into two stages to train this speech synthesis model, wherein: stage 1, training by taking a voice sample sent by a target pronunciation object as a target voice, and entering stage 2 after a model is converged; and 2, continuing to train the speech synthesis model by using the generated target speech with the target tone (namely the tone of the target pronunciation object).

Next, the training process of the above two phases is described as follows:

and stage 1, training as target voice based on a voice sample sent by the target pronunciation object.

In one embodiment, the target language includes a first language in which the target pronunciation object utters the corresponding speech sample, and thus the training phoneme subsequence includes a first training phoneme subsequence corresponding to the first language. S806 may specifically include: the terminal performs tone processing on the first training tone subsequence through a voice synthesis model to obtain a first training acoustic feature comprising target tone information; s808 may specifically include: the terminal carries out voice synthesis on the first training acoustic feature through a voice synthesis model to obtain voice of a first language type with target tone; s810 may specifically include: the terminal adjusts the network parameters in the speech synthesis model based on a first loss value between the speech of the first language type and the target speech until the speech synthesis model converges.

After the first loss value is obtained, the terminal reversely propagates the first loss value in the voice synthesis model so as to obtain a gradient value of each network parameter in the voice synthesis model, and each network parameter in the voice synthesis model is adjusted based on the gradient value.

And 2, continuing training the speech synthesis model by using the generated target speech with the target tone.

In one embodiment, the target language includes a second language type of the corresponding voice sample sent by the pronunciation object of other different languages, and the training phoneme subsequence includes a second training phoneme subsequence corresponding to the second language type; s806 may specifically include: when the terminal adjusts the network parameters in the voice synthesis model based on the first loss value to achieve model convergence, performing tone processing on the second training tone subsequence through the voice synthesis model to obtain a second training acoustic feature comprising target tone information; s808 may specifically include: the terminal carries out voice synthesis on the second training acoustic characteristics through a voice synthesis model to obtain voice of a second language type with target tone; s810 may specifically include: and the terminal adjusts the network parameters in the speech synthesis model based on a second loss value between the speech of the second language and the target speech until the speech synthesis model converges, thereby finishing the training of the second stage and obtaining the final speech synthesis model.

After the second loss value is obtained, the terminal reversely propagates the second loss value in the voice synthesis model, so that a gradient value of each network parameter in the voice synthesis model is obtained, and each network parameter in the voice synthesis model is adjusted based on the gradient value.

In the above embodiment, the acoustic features are extracted from the extracted voice samples with different timbres, the target voices with the target timbre and different target languages are generated based on the extracted training acoustic features, and the voice synthesis model is trained according to the phononic sequence of the target text and the target voices with the target timbre and different target languages corresponding to the target text, so that the voice synthesis model for synthesizing the target timbre and different target languages can be obtained, and thus different voice synthesis models do not need to be trained by the target voices with different target languages and different timbres, and the model training efficiency is effectively improved.

In an embodiment, for training of the speech synthesis model, in addition to the joint training in the embodiment of fig. 8, the acoustic model and the vocoder in the speech synthesis model may be trained separately, as shown in fig. 11, the training step may specifically include:

and S1102, respectively extracting acoustic features of the voice samples of at least one target language to obtain training acoustic features.

Wherein the timbre differs between speech samples.

And S1104, generating target voice with target tone of at least one target language based on the training acoustic features.

S1106, when the training tone subsequence is obtained from the target text corresponding to the target voice, performing tone processing on the training tone subsequence through the acoustic model to obtain training acoustic features including target tone information.

The specific steps of S1102 to S1106 can refer to S802 to S806 in the embodiment of FIG. 8.

S1108, based on the loss value between the training acoustic feature and the acoustic feature extracted from the target voice, the acoustic model is subjected to parameter adjustment.

In one embodiment, after obtaining the loss value, the terminal performs back propagation on the loss value in the acoustic model to obtain a gradient value of each network parameter in the acoustic model, and adjusts each network parameter in the acoustic model based on the gradient value until the acoustic model converges.

In the process of training the acoustic model, the terminal can also train a vocoder, and the specific steps include: the terminal extracts acoustic features in the target voice to obtain target acoustic features; performing voice synthesis on the target acoustic characteristics through a vocoder to obtain target predicted voice of different target languages, wherein the target predicted voice has a target tone; based on the loss value between the target predicted speech and the target speech, the vocoder is parameter adjusted.

The model structure of the vocoder can be a WaveRNN structure, and the WaveRNN structure is a single-layer recurrent neural network.

In the above embodiment, the acoustic features are extracted from the extracted voice samples with different timbres, the target voices with the target timbre and different target languages are generated based on the extracted training acoustic features, and the acoustic model is trained according to the target text and the target voices with the target timbre and different target languages corresponding to the target text; in addition, the acoustic features are extracted from the target voice to obtain target acoustic features, and the vocoder is trained on the basis of the target acoustic features and the target voice to obtain voice synthesis models (including the trained acoustic models and the vocoder) for synthesizing target voices with target timbres and different target languages, so that different voice synthesis models do not need to be trained by the target voices with different target languages and different timbres, and the model training efficiency is effectively improved.

In one embodiment, as shown in fig. 12, a speech synthesis model processing method is provided, in which the method is performed by an electronic device, which may be the terminal 102 or the server 104 in fig. 1, that is, the method may be performed by the terminal 102 or the server 104. In the following embodiments, taking an electronic device as the server 104 as an example for explanation, the method includes the following steps:

and S1202, respectively carrying out acoustic feature extraction on the voice samples of at least one target language to obtain training acoustic features.

Wherein the timbre differs between speech samples.

In one embodiment, the training acoustic features include mel-frequency spectral features; s1202 specifically includes: the terminal respectively extracts a Mel cepstrum coefficient from at least one voice sample of a target language; determining training semantic features of the voice samples according to the Mel cepstrum coefficients, wherein the training semantic features are used for expressing the posterior probability that each voice frame in the voice samples belongs to a target phoneme; and performing tone processing on the training semantic features to obtain Mel spectrum features.

And S1204, generating at least one target language of target voice with target tone based on the training acoustic features.

And S1206, when the training phononic sequence is obtained from the target text corresponding to the target voice, performing timbre processing on the training phononic sequence through the voice synthesis model to obtain training acoustic features including target timbre information.

And S1208, performing voice synthesis on the training acoustic features through the voice synthesis model to obtain predicted voice of at least one target language, wherein the predicted voice has a target tone corresponding to the target tone information.

S1210, adjusting network parameters in the speech synthesis model based on the loss value between the predicted speech and the target speech.

The specific steps of S1202 to S1210 can refer to embodiments S802 to S810 in fig. 8 and embodiments S1102 to S1108 in fig. 11.

Next, the above scheme is described with reference to a structure of a multilingual speech synthesis system, which is as follows:

as shown in fig. 13, the speech synthesis system structure is mainly divided into three layers, namely a data layer, a model layer and a service layer, wherein:

(1) data layer

The main role of the data layer is to obtain a training set for training the speech synthesis model (which comprises the acoustic model and the vocoder). Firstly, a target pronunciation object is required to record high-quality voice of a certain language, such as Chinese voice, according to a formulated text. And after the recording is finished, the recorded voice and the formulated text are respectively proofread and labeled.

In addition, the wrong sentences can be additionally recorded. The labeling process comprises the steps of note sound labeling, rhythm labeling and the like. It should be noted that phonetic notation requires phonetic notation according to the actual speech, for example, for Chinese, the polyphone "ground" is labeled as "di" or "de"; in addition, prosody needs to be labeled according to the voice pause condition, and as Chinese is generally divided into 4 pause levels: sentence-level edge landmarks are #4, intonation phrases are labeled #3, prosodic phrases are labeled #2, and prosodic words are labeled # 1.

When the target pronunciation object records the voice of a certain language (such as Chinese), any voice is converted into the tone of the target pronunciation object (called target tone for short) according to the built voice conversion system. The sound conversion system can comprise a PPGs extraction model and a sound conversion model. In addition, a vocoder may be included in the voice conversion system to synthesize voice having a target tone.

The sound conversion system building process can refer to fig. 10, and mainly includes PPGs extraction model training and sound conversion model training. The speaker independent speech recognition model is used as a PPGs extraction model for model training, and the trained PPGs extraction model is used for extracting PPGs representation from the MFCC. The PPGs representation represents the posterior probability that each frame of input speech belongs to a certain pronunciation phonon, and can be regarded as a normalized representation of the semantic space from which the tone color information is removed. The speaker independent speech recognition model (i.e., the PPGs extraction model) adopts a TDNN (Time delay neural network) structure, which is a multi-layer neural network structure, wherein a lower layer processes narrow context information, a higher layer processes wide context information, and different layers learn information with different Time resolutions, and thus, a long context dependency relationship can be better learned through the structure.

The sound conversion model adopts a BLSTM (Bidirectional Long Short-Term Memory, Bidirectional Long-and-Short-Term Memory neural network) structure, the input of the model is represented by PPGs, and the output of the model is a Mel spectral feature. The PPGs characterization may be extracted from the MFCC by a trained PPGs extraction model. And comparing the output Mel spectrum features with Mel spectrum features extracted from the speech of the training set, and calculating loss values, thereby optimizing the sound conversion model. After the training is finished, the sound conversion model can learn the mapping relation between the features of the Mel spectrum from the PPGs representation to the tone of the target pronunciation object, so that when in application, the semantic information of the voice with any tone can be kept when the voice with any tone is input, and the tone of the voice can be converted into the voice with the tone of the target pronunciation object. The target pronunciation object tone refers to the tone of the target pronunciation object, i.e. the target tone.

After the sound conversion system is built, a multilingual speech library with the timbre characteristics of the target pronunciation object can be built by using the sound conversion system, and the specific flow is shown in fig. 14. Preparing a large amount of standard English speech, extracting MFCC in the standard English speech, then obtaining corresponding PPGs representation through a PPGs extraction model, obtaining Mel spectral features with the tone of a target pronunciation object through a sound conversion model, and finally reversely converting the Mel spectral features into English speech with the tone of the target pronunciation object through a language-independent vocoder (such as a WaveRNN vocoder, namely a neural network coder).

The vocoder can train with multilingual speech.

Similarly, a large amount of standard Japanese speech can be converted into Japanese speech with the timbre characteristics of the target pronunciation object. Combining the generated English voice, Japanese voice and Chinese voice recorded by the target pronunciation object to obtain multi-language voice, adding corresponding label files to obtain a multi-language voice library, and taking the voice in the multi-language voice library and corresponding text labels as a training set to train a voice synthesis model. The above languages are merely examples, but not exhaustive, and in practical application, other languages may also be included.

Through the operation, the multilingual speech library with the timbre characteristics of the target pronunciation object is obtained. It should be noted that the target pronunciation object only needs to record a certain language (in the example, chinese language), that is, only needs to record a single language.

(2) Model layer

The model layer mainly converts the text needing speech synthesis into speech with target tone through a speech synthesis system. The speech synthesis system mainly includes a front-end text analysis module and a speech synthesis model, where the speech synthesis model includes an acoustic model, a vocoder, and the like, as shown in fig. 15. The text analysis module can convert the input text into a phone sub-sequence, the acoustic model converts the phone sub-sequence into Mel spectrum characteristics, and the vocoder generates a final voice signal according to the Mel spectrum characteristics.

For training an acoustic model and a vocoder, firstly, a certain language voice recorded by a target pronunciation object is used for training the acoustic model, and then the vocoder is trained, wherein the training process is a first stage; in addition, the acoustic model and vocoder may also be jointly trained. Next, the process of the acoustic model and vocoder independent training in the first stage is described, for the training of the acoustic model: firstly, analyzing text labels in a training set, and obtaining a sound subsequence based on the text labels, wherein for Chinese, the sound subsequence is an initial sound and final sound sequence; after the phononic sequence is obtained, the phononic sequence is used as the input of an acoustic model, so that the Mel spectrum characteristics are continuously learned according to the phononic sequence, and the learned Mel spectrum characteristics are consistent with the Mel spectrum characteristics extracted from the voice in the training set. The acoustic model adopts a sequence-to-sequence structure, as shown in fig. 7, which mainly includes an Encoder, a Decoder and an Attention network. The Encoder encodes an input phonon sequence [ x1, x 2.,. multidot ] to obtain [ h1, h 2.,. multidot ] which is an autoregressive structure, the starting state is y0, y1 is generated according to y0, y2 is generated according to y1, the operation is repeated in this way, an output sequence [ y1, y 2.,. multidot ] is generated, and an Attention module is used for calculating which parts of an encoding result should be focused by each step of the Decoder, so that the modeling precision of the model can be effectively improved.

For vocoder training: mel spectral features are extracted from the speech of the training set as input to learn the speech with the target pronunciation object timbre. The model structure of the vocoder adopts a WaveRNN structure, and the WaveRNN structure is a single-layer recurrent neural network.

And after the training of the acoustic model and the vocoder is converged, taking the acoustic model and the vocoder as initialization models, and then performing the second stage of training, namely continuing to train the acoustic model and the vocoder by using the multi-language sound library of the target pronunciation object generated by the data layer. It should be noted that during the training process, the language label is also used as an input and is introduced into the model after being spliced to the output of the Encoder. For the second stage training, the specific training method is consistent with the training method of the first stage, except that the data input by the model is different. Through the training of the first stage and the second stage, the acoustic model and the vocoder in multiple languages can be obtained.

(3) Service layer

The acoustic models and vocoders of multiple languages obtained by the training of the model layer can be deployed to a server, so that the server can receive a voice synthesis service request of a terminal through a network and then perform voice synthesis service. The terminal will be the text to be synthesized (e.g. "you like hong magic or iPhone)

") is sent to the server through a voice synthesis service request, the server performs text analysis after obtaining a text to be synthesized, the text analysis includes word segmentation processing, phonetic notation and the like, the text analysis is analyzed into a phonon sequence (for example," nxi h ua w wei h ai i Eay Ef Eow En "), and then a corresponding Mel is obtained through a multilingual acoustic modelAnd finally, obtaining the voice with the tone of the target pronunciation object through inverse transformation of a vocoder, and returning the synthesized voice to the terminal through a network so that a user can listen to the synthesized voice with the tone of the target pronunciation object.

It should be understood that, although the steps in the flowcharts of fig. 2, 5, 6, 8, 11, 12 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2, 5, 6, 8, 11, and 12 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least some of the other steps.

In one embodiment, as shown in fig. 16, there is provided a speech synthesis apparatus, which may be a part of an electronic device using a software module or a hardware module, or a combination of the two modules, and specifically includes: an acquisition module 1602, a processing module 1604, and a synthesis module 1606, wherein:

an obtaining module 1602, configured to obtain a phonon sequence of a text to be synthesized;

the processing module 1604, configured to perform timbre processing on the phononic sequence through the speech synthesis model to obtain acoustic features including target timbre information; the speech synthesis model is obtained by training based on the phononic sequence of the target text and the target speech of at least one target language; the target voice corresponds to the target text and is generated according to acoustic features extracted from voice samples with different timbres and has a target timbre;

a synthesis module 1606, configured to perform speech synthesis on the acoustic features through the speech synthesis model, so as to obtain a synthesized speech of at least one target language and the target timbre.

In an embodiment, the obtaining module 1602 is further configured to, in response to the voice synthesis service request, extract a text to be synthesized from the voice synthesis service request; performing word segmentation processing on the text to obtain a word to be synthesized; performing phonon conversion on each word segmentation to obtain a word segmentation phonon to be synthesized; and combining the obtained word segmentation phones to obtain a phone sequence of the text.

In an embodiment, the processing module 1604 is further configured to perform semantic feature extraction on the phoneme sequence through a speech synthesis model to obtain semantic features; performing tone processing on the semantic features based on target tone information of the target pronunciation object to obtain acoustic features including the target tone information of at least one target language; the target tone information is the tone characteristic of the target pronunciation object learned by the speech synthesis model in the training process.

In one embodiment, the speech synthesis model includes an acoustic model; the processing module 1604 is further configured to encode each participle phonon in the phonon sequence through an encoder in the acoustic model to obtain an encoding vector containing semantic features; the semantic features are context information of each participle in the text.

In one embodiment, the processing module 1604 is further configured to decode, by a decoder in the acoustic model, the encoding vector based on the target timbre information of the target pronunciation object to obtain acoustic features of at least one target language including the target timbre information.

In one embodiment, the processing module 1604 is further configured to determine a degree of interest of each participle in the coding vector through an attention network in the acoustic model; wherein the attention degree of each participle in the text is different; carrying out weighting processing on the word coding vectors corresponding to the participles in the coding vectors according to the attention degree to obtain weighted coding vectors; the weighted encoding vector is decoded based on target timbre information of the target pronunciation object.

In the above embodiment, the voice synthesis model may be used to convert the sound subsequence of the text to be synthesized into the acoustic feature containing the target tone information, so that the acoustic feature may be subjected to voice synthesis to obtain synthesized voices of different target languages with the target tone, and the problem of inconsistent tone colors of the obtained synthesized voices of different target languages is avoided. In addition, the encoder and the decoder in the acoustic model can be used for encoding and decoding the phonon sequence to obtain the acoustic features, and in the decoding process, the attention network can calculate which parts of the encoding vector should be paid attention to by the encoder at each step, so that the accuracy of the acoustic features can be improved.

In one embodiment, as shown in fig. 17, the apparatus further comprises:

an extraction module 1608, configured to perform acoustic feature extraction on the voice samples of at least one target language respectively to obtain training acoustic features; the timbre is different among the voice samples;

a generating module 1610, configured to generate a target speech with a target tone color in at least one target language based on the training acoustic features;

the processing module 1604 is further configured to, when the training phononic sequence is obtained from the target text corresponding to the target speech, perform timbre processing on the training phononic sequence through the speech synthesis model to obtain training acoustic features including target timbre information;

a synthesis module 1606, configured to perform speech synthesis on the training acoustic features through a speech synthesis model to obtain a predicted speech of at least one target language and having a target timbre;

an adjusting module 1612 is configured to adjust the network parameters in the speech synthesis model based on the loss values between the predicted speech and the target speech.

In one embodiment, the target language includes a first language in which the target pronunciation object utters the corresponding voice sample, and the training phoneme subsequence includes a first training phoneme subsequence corresponding to the first language;

the processing module 1604 is further configured to perform timbre processing on the first training phononic sequence through the speech synthesis model to obtain a first training acoustic feature including target timbre information;

a synthesis module 1606, configured to perform speech synthesis on the first training acoustic feature through the speech synthesis model to obtain a speech of the first language type with the target timbre;

the adjusting module 1612 is further configured to adjust the network parameter in the speech synthesis model based on a first loss value between the speech of the first language and the target speech.

In one embodiment, the target language includes a second language type of the corresponding voice sample sent by the pronunciation object of other different languages, and the training phoneme subsequence includes a second training phoneme subsequence corresponding to the second language type;

the processing module 1604 is further configured to perform timbre processing on the second training phononic sequence through the speech synthesis model when the network parameter in the speech synthesis model is adjusted based on the first loss value to reach model convergence, so as to obtain a second training acoustic feature including target timbre information;

the synthesis module 1606, configured to perform speech synthesis on the second training acoustic feature through the speech synthesis model to obtain a speech of the second language type with the target timbre;

the adjusting module 1612 is further configured to adjust the network parameter in the speech synthesis model based on a second loss value between the speech of the second language type and the target speech.

In one embodiment, the speech synthesis model includes an acoustic model;

a processing module 1604, configured to, when a training phononic sequence is obtained from a target text corresponding to a target speech, perform timbre processing on the training phononic sequence through an acoustic model to obtain a training acoustic feature including target timbre information;

an adjusting module 1612, configured to perform parameter adjustment on the acoustic model based on a loss value between the training acoustic feature and the acoustic feature extracted from the target speech.

In one embodiment, a vocoder is included in the speech synthesis model;

the extraction module 1608 is further configured to perform acoustic feature extraction in the target speech to obtain a target acoustic feature;

a synthesizing module 1606, configured to perform speech synthesis on the target acoustic feature through a vocoder to obtain a target predicted speech of at least one target language; the target predicted speech has a target timbre;

the adjusting module 1612 is further configured to perform parameter adjustment on the vocoder based on the loss value between the target predicted speech and the target speech.

For the specific limitations of the speech synthesis apparatus, reference may be made to the above limitations of the speech synthesis method, which are not described herein again. The respective modules in the above-described speech synthesis apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the electronic device, or can be stored in a memory in the electronic device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, as shown in fig. 18, there is provided a speech synthesis model processing apparatus, which may be a part of an electronic device using a software module or a hardware module, or a combination of the two modules, and specifically includes: an extraction module 1802, a generation module 1804, a processing module 1806, a synthesis module 1808, and an adjustment module 1810, wherein:

an extraction module 1802, configured to perform acoustic feature extraction on at least one voice sample of a target language respectively to obtain training acoustic features; the timbre is different among the voice samples;

a generating module 1804, configured to generate target speech in at least one target language and having a target timbre based on the training acoustic features;

a processing module 1806, configured to, when a training phononic sequence is obtained from a target text corresponding to a target voice, perform timbre processing on the training phononic sequence through a voice synthesis model to obtain a training acoustic feature including target timbre information;

a synthesis module 1808, configured to perform speech synthesis on the training acoustic features through a speech synthesis model to obtain predicted speech of at least one target language, where the predicted speech has a target tone corresponding to the target tone information;

an adjusting module 1810 is configured to adjust a network parameter in the speech synthesis model based on a loss value between the predicted speech and the target speech.

In one embodiment, the training acoustic features include mel-frequency spectral features; the extraction module 1802 is further configured to extract mel cepstrum coefficients from the voice samples of at least one target language respectively; determining training semantic features of the voice samples according to the Mel cepstrum coefficients, wherein the training semantic features are used for expressing the posterior probability that each voice frame in the voice samples belongs to a target phoneme; and performing tone processing on the training semantic features to obtain Mel spectrum features.

the processing module is further used for performing tone processing on the first training tone subsequence through the voice synthesis model to obtain a first training acoustic feature comprising target tone information;

the synthesis module is also used for carrying out voice synthesis on the first training acoustic feature through the voice synthesis model to obtain voice of a first language type with target tone;

and the adjusting module is also used for adjusting the network parameters in the speech synthesis model based on the first loss value between the speech of the first language and the target speech.

the processing module is further used for performing tone processing on the second training tone subsequence through the voice synthesis model when the network parameters in the voice synthesis model are adjusted based on the first loss value to achieve model convergence, so that second training acoustic features including target tone information are obtained;

the synthesis module is also used for carrying out voice synthesis on the second training acoustic characteristics through the voice synthesis model to obtain voice of a second language type with the target tone;

and the adjusting module is also used for adjusting the network parameters in the speech synthesis model based on a second loss value between the speech of the second language and the target speech.

In one embodiment, the speech synthesis model includes an acoustic model;

the extraction module is used for respectively extracting acoustic features of the voice samples of at least one target language to obtain training acoustic features; the timbre is different among the voice samples;

the generating module is used for generating target voice with target tone of at least one target language based on the training acoustic features;

the processing module is used for performing tone processing on the training tone subsequence through the acoustic model when the training tone subsequence is obtained from the target text corresponding to the target voice, so as to obtain training acoustic characteristics including target tone information;

and the adjusting module is used for carrying out parameter adjustment on the acoustic model based on the loss value between the training acoustic feature and the acoustic feature extracted from the target voice.

In one embodiment, a vocoder is included in the speech synthesis model;

the extraction module is also used for extracting acoustic features in the target voice to obtain target acoustic features;

the synthesis module is also used for carrying out voice synthesis on the target acoustic characteristics through a vocoder to obtain target prediction voice of at least one target language; the target predicted speech has a target timbre;

and the adjusting module is also used for adjusting parameters of the vocoder based on the loss value between the target prediction voice and the target voice.

For the specific definition of the speech synthesis model processing device, reference may be made to the above definition of the speech synthesis model processing method, which is not described herein again. The respective modules in the speech synthesis model processing apparatus described above may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the electronic device, or can be stored in a memory in the electronic device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, an electronic device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 19. The electronic device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the electronic device is used for storing voice data. The network interface of the electronic device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a speech synthesis model processing method, and may also implement a speech synthesis method.

In one embodiment, an electronic device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 20. The electronic device comprises a processor, a memory, a communication interface, a display screen and an input device which are connected through a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the electronic device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a speech synthesis method, and may also implement a speech synthesis model processing method. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the electronic equipment, an external keyboard, a touch pad or a mouse and the like.

It will be understood by those skilled in the art that the configurations shown in fig. 19 and 20 are only block diagrams of some configurations relevant to the present disclosure, and do not constitute a limitation on the electronic devices to which the present disclosure may be applied, and a particular electronic device may include more or less components than those shown in the drawings, or may combine some components, or have a different arrangement of components.

In one embodiment, an electronic device is further provided, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of the electronic device from the computer-readable storage medium, and the processor executes the computer instructions to cause the electronic device to perform the steps in the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of speech synthesis, the method comprising:

acquiring a phonon sequence of a text to be synthesized;

performing tone processing on the tone subsequence through a voice synthesis model to obtain acoustic characteristics including target tone information; the speech synthesis model is obtained by training based on a phononic sequence of a target text and target speech of at least one target language; the target voice has a target tone color, corresponds to the target text and is generated based on training acoustic features of voice samples with different tone colors of at least one target language; the training acoustic features comprise Mel spectral features, the Mel spectral features are obtained by a feature extraction step, and the feature extraction step comprises the following steps: respectively extracting Mel cepstrum coefficients from at least one voice sample of the target language; determining training semantic features of the voice samples according to the Mel cepstrum coefficients, wherein the training semantic features are used for expressing the posterior probability that each voice frame in the voice samples belongs to a target phoneme; performing tone processing on the training semantic features to obtain the Mel spectrum features;

2. The method of claim 1, wherein the performing timbre processing on the phononic sequence through a speech synthesis model to obtain an acoustic feature including target timbre information comprises:

semantic feature extraction is carried out on the phoneme sequences through a voice synthesis model to obtain semantic features;

performing timbre processing on the semantic features based on target timbre information of a target pronunciation object to obtain at least one acoustic feature of the target language and including the target timbre information;

and the target tone information is the tone characteristic of the target pronunciation object learned by the speech synthesis model in the training process.

3. The method of claim 2, wherein the speech synthesis model comprises an acoustic model; the semantic feature extraction of the phoneme sequence through the voice synthesis model to obtain the semantic features comprises the following steps:

coding each word segmentation phoneme in the phoneme sequence through a coder in the acoustic model to obtain a coding vector containing the semantic features; the semantic features are context information of each participle in the text.

4. The method according to claim 3, wherein the timbre processing of the semantic features based on target timbre information of a target pronunciation object to obtain at least one acoustic feature of the target language and including the target timbre information comprises:

and decoding the coding vector based on target tone color information of a target pronunciation object through a decoder in the acoustic model to obtain at least one acoustic feature of the target language and including the target tone color information.

5. The method of claim 4, wherein the decoding the encoded vector based on target timbre information of the target pronunciation object comprises:

determining the attention degree of each participle in the coding vector through an attention network in the acoustic model; wherein the attention degree of each word segmentation in the text is different;

carrying out weighting processing on word coding vectors corresponding to the participles in the coding vectors according to the attention degree to obtain weighted coding vectors;

decoding the weighted encoding vector based on target timbre information of a target pronunciation object.

6. The method of claim 1, wherein obtaining the phononic sequence of the text to be synthesized comprises:

responding to a voice synthesis service request, and extracting a text to be synthesized from the voice synthesis service request;

performing word segmentation processing on the text to obtain a word to be synthesized;

performing phonon conversion on each word segmentation to obtain a word segmentation phonon to be synthesized;

and combining the obtained word segmentation phones to obtain a phone sequence of the text.

7. The method according to any one of claims 1 to 6, characterized in that it comprises:

after generating the target voice based on training acoustic features of voice samples with different timbres of at least one target language, when a training tone subsequence is obtained from a target text corresponding to the target voice, performing timbre processing on the training tone subsequence through the voice synthesis model to obtain training acoustic features including target timbre information;

performing voice synthesis on the training acoustic features through the voice synthesis model to obtain at least one predicted voice of the target language and with the target tone;

8. The method according to claim 7, wherein the target language comprises a first language used by a target pronunciation object to record the voice sample, and the training tone subsequence comprises a first training tone subsequence corresponding to the first language; performing tone processing on the training phononic sequence through the speech synthesis model to obtain training acoustic features including the target tone information includes:

performing tone processing on the first training tone subsequence through the voice synthesis model to obtain a first training acoustic feature comprising the target tone information;

the step of performing speech synthesis on the training acoustic features through the speech synthesis model to obtain at least one predicted speech of the target language and having the target timbre includes:

performing voice synthesis on the first training acoustic feature through the voice synthesis model to obtain a voice of a first language with the target tone;

the adjusting network parameters in the speech synthesis model based on the loss value between the predicted speech and the target speech comprises:

and adjusting the network parameters in the speech synthesis model based on a first loss value between the speech of the first language type and the target speech.

9. The method according to claim 8, wherein the target language comprises a second language used by other different language pronunciation objects to record the voice sample, and the training tone subsequence comprises a second training tone subsequence corresponding to the second language; performing tone processing on the training phononic sequence through the speech synthesis model to obtain training acoustic features including the target tone information includes:

when network parameters in the voice synthesis model are adjusted based on the first loss value to achieve model convergence, performing tone processing on the second training tone subsequence through the voice synthesis model to obtain a second training acoustic feature comprising the target tone information;

performing voice synthesis on the second training acoustic feature through the voice synthesis model to obtain voice of a second language type with the target tone;

and adjusting the network parameters in the speech synthesis model based on a second loss value between the speech of the second language type and the target speech.

10. The method of any of claims 1 to 6, wherein an acoustic model is included in the speech synthesis model; the method further comprises the following steps:

11. The method of any one of claims 1 to 6, wherein a vocoder is included in the speech synthesis model; the method further comprises the following steps:

12. A method of processing a speech synthesis model, the method comprising:

respectively extracting Mel cepstrum coefficients from the voice samples of at least one target language; determining training semantic features of the voice samples according to the Mel cepstrum coefficients, wherein the training semantic features are used for expressing the posterior probability that each voice frame in the voice samples belongs to a target phoneme; performing tone processing on the training semantic features to obtain training acoustic features; the training acoustic features comprise Mel spectrum features, and the tone colors of the voice samples are different;

13. The method according to claim 12, wherein the target language comprises a first language used by a target pronunciation object to record the voice sample, and the training phoneme subsequence comprises a first training phoneme subsequence corresponding to the first language; the obtaining of the training acoustic features including the target tone information by performing tone processing on the training phononic sequence through the speech synthesis model includes:

performing tone processing on the first training tone subsequence through a speech synthesis model to obtain a first training acoustic feature comprising target tone information;

the performing speech synthesis on the training acoustic features through the speech synthesis model to obtain at least one predicted speech of the target language includes:

14. A speech synthesis apparatus, characterized in that the apparatus comprises:

the processing module is used for performing tone processing on the tone subsequence through a voice synthesis model to obtain acoustic characteristics including target tone information; the speech synthesis model is obtained by training based on a phononic sequence of a target text and target speech of at least one target language; the target voice has a target tone color, corresponds to the target text and is generated based on training acoustic features of voice samples with different tone colors of at least one target language; the training acoustic features comprise Mel spectral features, the Mel spectral features are obtained by a feature extraction step, and the feature extraction step comprises the following steps: respectively extracting Mel cepstrum coefficients from at least one voice sample of the target language; determining training semantic features of the voice samples according to the Mel cepstrum coefficients, wherein the training semantic features are used for expressing the posterior probability that each voice frame in the voice samples belongs to a target phoneme; performing tone processing on the training semantic features to obtain the Mel spectrum features;

15. The apparatus according to claim 14, wherein the processing module is further configured to perform semantic feature extraction on the phoneme sequence through a speech synthesis model to obtain semantic features; performing timbre processing on the semantic features based on target timbre information of a target pronunciation object to obtain at least one acoustic feature of the target language and including the target timbre information; and the target tone information is the tone characteristic of the target pronunciation object learned by the speech synthesis model in the training process.

16. The apparatus of claim 15, wherein the speech synthesis model comprises an acoustic model; the processing module is further configured to encode each participle phonon in the phonon sequence through an encoder in the acoustic model to obtain an encoding vector including the semantic feature; the semantic features are context information of each participle in the text.

17. The apparatus according to claim 16, wherein the processing module is further configured to decode, by a decoder in the acoustic model, the coding vector based on target timbre information of a target pronunciation object to obtain at least one acoustic feature of the target language and including the target timbre information.

18. The apparatus of claim 16, wherein the processing module is further configured to determine a degree of attention of each participle in the coding vector through an attention network in the acoustic model; wherein the attention degree of each word segmentation in the text is different; carrying out weighting processing on word coding vectors corresponding to the participles in the coding vectors according to the attention degree to obtain weighted coding vectors; decoding the weighted encoding vector based on target timbre information of a target pronunciation object.

19. The apparatus of claim 14, wherein the obtaining module is further configured to extract a text to be synthesized from a speech synthesis service request in response to the speech synthesis service request; performing word segmentation processing on the text to obtain a word to be synthesized; performing phonon conversion on each word segmentation to obtain a word segmentation phonon to be synthesized; and combining the obtained word segmentation phones to obtain a phone sequence of the text.

20. The apparatus according to any one of claims 14 to 19, characterized in that it comprises:

the processing module is further configured to, after generating the target speech based on training acoustic features of speech samples of different timbres of at least one target language, perform timbre processing on a training phononic sequence through the speech synthesis model when obtaining the training phononic sequence from a target text corresponding to the target speech to obtain training acoustic features including target timbre information;

the synthesis module is further configured to perform speech synthesis on the training acoustic features through the speech synthesis model to obtain at least one predicted speech of the target language and having the target timbre;

21. The apparatus according to claim 20, wherein the target language comprises a first language used by a target pronunciation object to record the voice sample, and the training phoneme subsequence comprises a first training phoneme subsequence corresponding to the first language;

the processing module is further configured to perform timbre processing on the first training phononic sequence through the speech synthesis model to obtain a first training acoustic feature including the target timbre information;

the synthesis module is further configured to perform speech synthesis on the first training acoustic feature through the speech synthesis model to obtain a speech of a first language type with the target timbre;

the adjusting module is further configured to adjust a network parameter in the speech synthesis model based on a first loss value between the speech of the first language type and the target speech.

22. The apparatus according to claim 21, wherein the target language comprises a second language used by other pronunciation objects of different languages for recording the voice sample, and the training phoneme subsequence comprises a second training phoneme subsequence corresponding to the second language;

the processing module is further configured to perform timbre processing on the second training tone subsequence through the speech synthesis model when the network parameter in the speech synthesis model is adjusted based on the first loss value to achieve model convergence, so as to obtain a second training acoustic feature including the target timbre information;

the synthesis module is further configured to perform speech synthesis on the second training acoustic feature through the speech synthesis model to obtain a speech of a second language type with the target timbre;

the adjusting module is further configured to adjust the network parameter in the speech synthesis model based on a second loss value between the speech of the second language type and the target speech.

23. The apparatus according to any one of claims 14 to 19, wherein an acoustic model is included in the speech synthesis model; the device further comprises:

the extraction module is used for respectively extracting acoustic features of at least one voice sample of the target language to obtain training acoustic features;

a generating module, configured to generate a target speech with a target tone color in at least one of the target languages based on the training acoustic features;

the processing module is used for performing tone processing on the training tone subsequence through the acoustic model when the training tone subsequence is obtained from the target text corresponding to the target voice, so as to obtain training acoustic features including target tone information;

24. The apparatus according to any one of claims 14 to 19, wherein a vocoder is included in the speech synthesis model; the device further comprises:

the extraction module is used for extracting acoustic features in the target voice to obtain target acoustic features;

the synthesis module is used for carrying out voice synthesis on the target acoustic characteristics through the vocoder to obtain at least one target prediction voice of the target language; the target predicted speech has the target timbre;

and the adjusting module is used for adjusting the parameters of the vocoder based on the loss value between the target prediction voice and the target voice.

25. A speech synthesis model processing apparatus, characterized in that the apparatus comprises:

the extraction module is used for extracting Mel cepstrum coefficients from the voice samples of at least one target language respectively; determining training semantic features of the voice samples according to the Mel cepstrum coefficients, wherein the training semantic features are used for expressing the posterior probability that each voice frame in the voice samples belongs to a target phoneme; performing tone processing on the training semantic features to obtain training acoustic features; the training acoustic features comprise Mel spectrum features, and the tone colors of the voice samples are different;

26. The apparatus according to claim 25, wherein the target language comprises a first language used by a target pronunciation object to record the voice sample, and the training phoneme subsequence comprises a first training phoneme subsequence corresponding to the first language;

the obtaining of the training acoustic features including the target tone information by performing tone processing on the training phononic sequence through the speech synthesis model includes:

27. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 11 or 12 to 13 when executing the computer program.

28. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 11 or 12 to 13.