CN113257218B

CN113257218B - Speech synthesis method, device, electronic equipment and storage medium

Info

Publication number: CN113257218B
Application number: CN202110523097.4A
Authority: CN
Inventors: 吴鹏飞; 潘俊杰; 马泽君
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-05-13
Filing date: 2021-05-13
Publication date: 2024-01-30
Anticipated expiration: 2041-05-13
Also published as: WO2022237665A1; CN113257218A

Abstract

The embodiment of the disclosure provides a voice synthesis method, a voice synthesis device, electronic equipment and a storage medium. The method comprises the following steps: acquiring a text to be synthesized, character identification information of a target character and emotion identification information of a target emotion; and performing voice synthesis on the text to be synthesized based on the personal identification information and the emotion identification information to obtain target voice, wherein the target voice has voice characteristics of the target person and emotion characteristics of the target emotion. By adopting the technical scheme, the embodiment of the disclosure can realize the synthesis of the voices with different emotions of different people under the authorized condition, thereby meeting different requirements when people listen to the voice books.

Description

Speech synthesis method, device, electronic equipment and storage medium

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to a voice synthesis method, a voice synthesis device, electronic equipment and a storage medium.

Background

In the field of speech synthesis, emotion migration is a technique of great practical value, particularly when creating a vocal book. When a voice book is generated, if emotion migration among different voices of the same speaker authorized to use can be realized, the speaker can realize voice synthesis of the speaker with different emotions only by recording part of voices with emotion; if emotion migration among different authorized speakers can be realized, emotion in the voices of the speakers with emotion deduction capability can be migrated to the speakers with poor emotion deduction capability, synthesis of voices of the speakers with different emotion is realized, and a voice book of a corresponding sentence in an emotion broadcast novel which is consistent with a novel emotion can be directly generated according to the existing voice of a certain authorized speaker.

However, in the prior art, only the synthesis of voices with different emotions of the same speaker with authorization can be realized, namely, the emotion migration among the same speaker can be realized, but the synthesis of voices with different emotions among different speakers with authorization can not be realized, when a voice book is generated, the voice of the speaker with the emotion conforming to a small-scale emotion needs to be recorded, the voice book of the corresponding sentence in the emotion broadcasting small-scale can be generated, and when the speaker with authorization cannot adopt the emotion conforming to the small-scale emotion to perform deduction or cannot acquire the voice of the speaker with the emotion, the voice book of the corresponding sentence in the emotion broadcasting small-scale can not be generated, so that the voice book can be used for a speaker selected by a user to be single, and the requirements of people can not be met.

Disclosure of Invention

The embodiment of the disclosure provides a voice synthesis method, a voice synthesis device, electronic equipment and a storage medium, so as to realize the synthesis of voices with different emotions of different speakers.

In a first aspect, an embodiment of the present disclosure provides a method for synthesizing speech, including:

acquiring a text to be synthesized, character identification information of a target character and emotion identification information of a target emotion;

And performing voice synthesis on the text to be synthesized based on the personal identification information and the emotion identification information to obtain target voice, wherein the target voice has voice characteristics of the target person and emotion characteristics of the target emotion.

In a second aspect, embodiments of the present disclosure further provide a speech synthesis apparatus, including:

the acquisition module is used for acquiring the text to be synthesized, the personal identification information of the target person and the emotion identification information of the target emotion;

and the synthesis module is used for carrying out voice synthesis on the text to be synthesized based on the personal identification information and the emotion identification information to obtain target voice, wherein the target voice has the voice characteristics of the target person and the emotion characteristics of the target emotion.

In a third aspect, an embodiment of the present disclosure further provides an electronic device, including:

one or more processors;

a memory for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the speech synthesis method as described in embodiments of the present disclosure.

In a fourth aspect, the disclosed embodiments also provide a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements a speech synthesis method according to the disclosed embodiments.

According to the voice synthesis method, the voice synthesis device, the electronic equipment and the storage medium, the text to be synthesized, the personal identification information of the target person and the emotion identification information of the target emotion are obtained, voice synthesis is conducted on the text to be synthesized based on the personal identification information and the emotion identification information, and target voice with voice characteristics of the target person and emotion characteristics of the target emotion is obtained. By adopting the technical scheme, the embodiment of the disclosure can realize the synthesis of the voices with different emotions of different people under the authorized condition, so that after the authorization, the voice book of the corresponding sentence in the emotion broadcast novel which accords with the small speech scene can be generated only according to any voice of the speaker, the speaker who is authorized to use does not need to use the emotion to perform deduction, more alternative voice book speakers can be provided, and different requirements of people when listening to the voice book are met.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

Fig. 1 is a schematic flow chart of a speech synthesis method according to an embodiment of the disclosure;

FIG. 2 is a flow chart of another speech synthesis method according to an embodiment of the disclosure;

fig. 3 is a schematic structural diagram of a speech synthesis model according to an embodiment of the disclosure;

fig. 4 is a schematic diagram of a model structure during training of a speech synthesis model according to an embodiment of the present disclosure;

fig. 5 is a block diagram of a speech synthesis apparatus according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

Fig. 1 is a flow chart of a speech synthesis method according to an embodiment of the disclosure. The method may be performed by a speech synthesis apparatus, wherein the apparatus may be implemented in software and/or hardware, may be configured in an electronic device, typically in a mobile phone or tablet computer. The voice synthesis method provided by the embodiment of the disclosure is suitable for synthesizing scenes of voices with different emotions of different persons authorized to use. As shown in fig. 1, the speech synthesis method provided in this embodiment may include:

s101, acquiring a text to be synthesized, personal identification information of a target person and emotion identification information of a target emotion.

The text to be synthesized is understood as text of the corresponding voice to be synthesized, and the text is acquired after the authorization of the user. The target person may be a speaker of authorized use of the text to be synthesized, i.e., a person who is to synthesize the voice of authorized use having its voice characteristics. The target emotion may be emotion, such as happiness, neutrality, heart injury, or gas, used by the authorized target person when speaking the text to be synthesized (or a sentence or sentences in the text to be synthesized). Accordingly, the personal identification information of the target person may be information for uniquely identifying the speaker of the text to be synthesized, such as the speaker name, the person ID, the person code, or the like of the speaker; the emotion identification information of the target emotion can be information for uniquely identifying emotion adopted by a speaker of the text to be synthesized when the speaker plays the text to be synthesized, such as emotion name, emotion ID, emotion code and the like of the emotion. The personal identification information of the target person and the emotion identification information of the target emotion can be input by the user when the user needs to perform voice synthesis on the text to be synthesized, or can be preset by a publisher of the text to be synthesized or a provider of the target voice.

In an exemplary scenario, when a user authorized to use is to synthesize a section of speech, the user may input a text to be synthesized corresponding to the speech, and select or input personal identification information of an authorized speaker of the section of speech and emotion identification information of emotion to be carried in the section of speech authorized to use; accordingly, the electronic device may acquire the text input by the user as the text to be synthesized, acquire the personal identification information selected or input by the user as the personal identification information of the target person, and acquire the emotion identification information selected or input by the user as the emotion identification information of the target emotion.

In another exemplary scenario, when a user reads a certain text to be synthesized (such as an article, etc.), if he wants to listen to the voice of the text to be synthesized, he can input or select the character identification information of the player authorized to use the text to be synthesized and the emotion identification information of the emotion carried in the voice authorized to use the text to be synthesized; accordingly, the electronic device may acquire the personal identification information selected or input by the user as the personal identification information of the target person, and acquire the emotion identification information selected or input by the user as emotion identification information of the target emotion.

In yet another exemplary scenario, the novice provider may preset emotions that each sentence in the novice provided to the user should carry; therefore, when a user wants to read the novel by listening to the voice, the user can set the authorized use speaker corresponding to each character in the novel; correspondingly, the electronic equipment can acquire the authorized speaker corresponding to each person in the novel set by the user, can sequentially take the personal identification information of the authorized speaker corresponding to each sentence in the novel text as the personal identification information of the target person of the sentence, and takes the emotion identification information of emotion corresponding to the sentence as the emotion identification information of the target emotion so as to synthesize the voice corresponding to each sentence in the novel. Or, when the audio book developer wants to generate an audio book of a certain section of novel, the audio book developer can set the speaker of each sentence in the novel and the emotion carried by each sentence; therefore, when the electronic equipment receives the triggering operation of generating the voice book by the voice book developer or the triggering operation of receiving the voice book of the user listening to the novice, the personal identification information of the speaker corresponding to each sentence in the novice text can be sequentially used as the personal identification information of the target person of the sentence, and the emotion identification information of the emotion corresponding to the sentence can be used as the emotion identification information of the target emotion, so that the voice corresponding to each sentence in the novice can be synthesized.

S102, performing voice synthesis on the text to be synthesized based on the personal identification information and the emotion identification information to obtain target voice, wherein the target voice has voice characteristics of the target person and emotion characteristics of the target emotion.

The target voice may be a target voice obtained by synthesizing voice of a text to be synthesized (or one or more sentences in the text to be synthesized), where the target voice has a target voice feature and an emotion feature of a target emotion, that is, the target voice may be a voice obtained by playing and speaking a text to be synthesized (or one or more sentences in the text to be synthesized) by using the target emotion authorized by the user for a target person authorized by the user.

In this embodiment, after the text to be synthesized, the personal identification information of the target person and the emotion identification information of the target emotion are obtained, the text to be synthesized may be synthesized according to the personal identification information of the target person and the emotion identification of the target emotion, for example, the text feature information of the text to be synthesized is first determined, the voice feature information (such as a voice feature vector) of the target person is determined according to the personal identification information of the target person, the emotion feature information (such as an emotion feature vector) of the target emotion is determined according to the emotion identification information of the target emotion, and then the target voice of the text to be synthesized is generated based on the text feature information, the voice feature information and the emotion feature information, that is, the target voice of the target person is played in the target emotion. Here, the determined voice characteristic information of the target person may be voice characteristic information in voice authorized by the target person to use, and the generated target emotion carried in the target voice broadcasted by the target person may be emotion authorized by the target person.

In one embodiment, the performing speech synthesis on the text to be synthesized based on the personal identification information and the emotion identification information to obtain a target speech includes: determining a voice spectrum sequence of the text to be synthesized based on the personal identification information and the emotion identification information through a voice synthesis model obtained through pre-training; and inputting the voice spectrum sequence into a vocoder to perform voice synthesis on the voice spectrum sequence to obtain target voice.

The speech spectrum sequence of the text to be synthesized may be a spectrum sequence of the target speech to be synthesized, and preferably may be a mel spectrum sequence of the target speech, so as to ensure that the target speech synthesized based on the spectrum sequence can more conform to the hearing habit of the person.

In the above embodiment, a speech spectrum sequence of a text to be synthesized may be generated by a speech synthesis model obtained by training in advance and converted into a target speech by a vocoder. The text feature information of the text to be synthesized, the voice feature information of the target person and the emotion feature information of the target emotion can be determined through the voice synthesis model obtained through pre-training, a voice batch spectrum sequence of the text to be synthesized is generated according to the text feature information, the voice feature information and the emotion feature information, the voice spectrum sequence is further input into a vocoder, and target voice of the text to be synthesized is generated through the vocoder.

In one embodiment, the text to be synthesized is a novel text, and the obtaining the text to be synthesized, the personal identification information of the target person, and the emotion identification information of the target emotion includes: according to the arrangement sequence of each sentence to be synthesized in the text to be synthesized, sequentially determining each sentence to be synthesized as a current sentence, and acquiring the current personal identification information and the current emotion identification information of the current sentence to be synthesized; the speech synthesis is carried out on the text to be synthesized based on the personal identification information and the emotion identification information to obtain target speech, and the method comprises the following steps: and carrying out voice synthesis on the current sentence to be synthesized based on the current personal identification information and the current emotion identification information to obtain the target voice of the current sentence.

The current sentence to be synthesized can be a sentence in a text to be synthesized, which needs to be synthesized at the current moment; correspondingly, the current personal identification information can be the personal identification information of the target person of the current sentence to be synthesized, namely the personal identification information of the speaker authorized to use of the current sentence to be synthesized; the current emotion identification information can be emotion identification information of a target emotion corresponding to the current sentence to be synthesized, namely emotion identification information of emotion carried by the current sentence to be synthesized.

In this embodiment, since the novel includes sentences such as conversations and bystandings of a plurality of characters, the authorized speaker and/or emotion corresponding to different sentences may be different, when the text to be synthesized is novel text, the target character and the target emotion corresponding to each sentence in the text to be synthesized can be determined sentence by sentence, and speech synthesis can be performed.

When speech synthesis is performed on the novel text, the first sentence of the novel text can be first determined to be the current sentence, the current personal identification information and the current emotion identification information of the current sentence are obtained, speech synthesis is performed on the current sentence to be synthesized according to the current personal identification information and the current emotion identification information to obtain the target speech of the current sentence, the next sentence to be synthesized, which is positioned behind the current sentence to be synthesized and is adjacent to the current sentence to be synthesized, in the novel text is determined to be the current sentence to be synthesized, and the operation of obtaining the current personal identification information and the current emotion identification information of the current sentence is performed in a returning mode until the next sentence to be synthesized does not exist, so that speech synthesis of the novel text to be synthesized can be achieved, and the sound book of the novel text to be synthesized is obtained.

According to the voice synthesis method provided by the embodiment, the text to be synthesized, the personal identification information of the target person and the emotion identification information of the target emotion are obtained, and voice synthesis is carried out on the text to be synthesized based on the personal identification information and the emotion identification information, so that target voice with the voice characteristics of the target person and the emotion characteristics of the target emotion is obtained. By adopting the technical scheme, the embodiment can realize the synthesis of the voices with different emotions of different people under the authorized condition, so that after the authorization, the voice book which adopts the emotion corresponding to the small-talking scenery to broadcast the corresponding sentence in the small-talking scenery can be generated only according to any voice of the speaker, the speaker who is authorized to use does not need to use the emotion to perform deduction, more alternative voice book speakers can be provided, and different requirements of people when listening to the voice book are met.

Fig. 2 is a flow chart of another speech synthesis method according to an embodiment of the disclosure. The aspects of this embodiment may be combined with one or more of the alternatives of the embodiments described above. Optionally, the determining, by the speech synthesis model obtained through pre-training, the speech spectrum sequence of the text to be synthesized based on the personal identification information and the emotion identification information includes: determining a text phoneme sequence of the text to be synthesized; inputting the text phoneme sequence, the personal identification information and the emotion identification information into a pre-trained voice synthesis model, and obtaining a voice frequency spectrum sequence output by the voice synthesis model.

Optionally, after the target voice is obtained, the method further includes: and playing the target voice.

Accordingly, as shown in fig. 2, the speech synthesis method provided in this embodiment may include:

s201, acquiring a text to be synthesized, personal identification information of a target person and emotion identification information of a target emotion.

S202, determining a text phoneme sequence of the text to be synthesized.

The phonemes are minimum speech units obtained by dividing according to natural attributes of speech, and correspondingly, a text phoneme sequence of the text to be synthesized can be the minimum speech unit sequence of the text to be synthesized.

Specifically, after the text to be synthesized is obtained, phoneme extraction can be performed on the text to be synthesized to obtain a text phoneme sequence of the text to be synthesized.

In this embodiment, a functional module for extracting a text phoneme sequence of a text to be synthesized may be set independently of a speech synthesis model, and when synthesizing speech of the text to be synthesized, the functional module first extracts the text phoneme sequence of the text to be synthesized, and inputs the text phoneme sequence of the text to be synthesized extracted by the functional module into the speech synthesis model for speech synthesis, so as to further reduce complexity of the speech synthesis model.

It can be understood that, in this embodiment, the functional module for extracting the text phoneme sequence of the text to be synthesized may be embedded in the speech synthesis model, and when synthesizing the speech of the text to be synthesized, the text to be synthesized is directly input into the speech synthesis model, and the speech synthesis model obtains the text phoneme sequence of the text to be synthesized.

S203, inputting the text phoneme sequence, the personal identification information and the emotion identification information into a pre-trained voice synthesis model, and obtaining a voice spectrum sequence output by the voice synthesis model.

In this embodiment, the speech synthesis model may be used to determine a speech spectrum sequence of the text to be synthesized according to the text phoneme sequence of the text to be synthesized, the personal identification information of the target person authorized to use, and the emotion identification information of the target emotion authorized to use, that is, the input of the speech synthesis model is the text phoneme sequence of the text to be synthesized, the personal identification information of the target person, and the emotion identification information of the target emotion, and output as the speech spectrum sequence of the text to be synthesized.

In one embodiment, the speech synthesis model includes a text encoder, a high-dimensional mapping module, an emotion marking layer, an attention module, and a decoder, wherein an output end of the text encoder and an output end of the emotion marking layer are respectively connected with an input end of the attention module, and an output end of the high-dimensional mapping module and an output end of the attention module are respectively connected with an input end of the decoder.

In the above embodiment, as shown in fig. 3, the speech synthesis model may include a text encoder 30, a high-dimensional mapping module 31, an emotion markup layer 32, an attention module 33, and a decoder 34, where an output terminal of the text encoder 30 may be connected to an input terminal of the attention module 33, and configured to determine text feature information of the text to be synthesized according to a text phoneme sequence of the text to be synthesized, for example, determine a text feature vector of the text to be synthesized, and input the text feature information of the text to be synthesized into the attention module 33; an output terminal of the high-dimensional mapping module 31 may be connected to an input terminal of the attention module 33, for determining a voice feature vector of the target person according to the personal identification information of the target person authorized to use, such as mapping the personal identification information of the target person authorized to use to the voice feature vector of the target person, and inputting the voice feature vector into the attention module 33 or the decoder 34 (in fig. 3, the voice feature vector is input into the decoder 34 as an example); an output terminal of the emotion marking layer 32 may be connected to an input terminal of the attention module 33, for determining an emotion feature vector of the target emotion according to emotion identification information of the target emotion authorized to use; an input of attention module 33 may be coupled to an input of decoder 34 for generating, in conjunction with the decoder, an audio spectral sequence of the text to be synthesized from the text feature vector input by text encoder 30, the speech feature vector input by high-dimensional mapping module 31, and the emotion feature vector input by emotion markup layer 32.

In the above embodiment, the inputting the text phoneme sequence, the personal identification information and the emotion identification information into a pre-trained speech synthesis model, and obtaining a speech spectrum sequence output by the speech synthesis model may include: encoding the text phoneme sequence by adopting the text encoder so as to obtain a text feature vector of the text to be synthesized; performing high-dimensional mapping on the personal identification information by adopting the high-dimensional mapping module to obtain the personal feature vector of the text to be synthesized; determining an emotion feature vector corresponding to the emotion identification information by adopting the emotion marking layer, and taking the emotion feature vector as the emotion feature vector of the text to be synthesized; and inputting the text feature vector and the emotion feature vector into the attention module, and inputting the intermediate vector output by the attention module and the character feature vector into the decoder to obtain the audio frequency spectrum sequence of the text to be synthesized. The intermediate vector may be understood as a vector that is output by the attention module after processing the received text identification information, personal identification information, and emotion identification information.

For example, after obtaining a text phoneme sequence of a text to be synthesized, character identification information of a target character authorized to use, and emotion identification information of a target emotion authorized to use, the text phoneme sequence may be first input into a text encoder of a speech synthesis model, and a text feature vector of the text to be synthesized may be determined by the text encoder; inputting the personal identification information into a high-dimensional mapping module of a voice synthesis model, and determining a voice feature vector of a target person through the high-dimensional mapping module; and inputting the emotion identification information into an emotion marking layer of the voice synthesis model, and determining an emotion characteristic vector of the target emotion through the emotion marking layer. And then inputting the text feature vector, the voice feature vector and the emotion feature vector into an attention module of the voice synthesis module to obtain an intermediate vector output by the attention module. And finally, inputting the intermediate vector into a decoder of the speech synthesis model, and acquiring an audio frequency spectrum sequence output by the decoder as an audio frequency spectrum sequence of the text to be synthesized.

It may be appreciated that in the above embodiment, the text feature vector output by the text encoder and the emotion feature vector output by the emotion flag layer may be directly input into the attention module; the text feature vector output by the text encoder and the emotion feature vector output by the emotion marking layer can also be combined into a vector, for example, the text feature vector output by the text encoder and the emotion feature vector output by the emotion encoder are spliced or added, and the vector obtained by the combination is input into the attention module, as shown in fig. 3. In addition, the decoder can synthesize the target voice frame by frame according to the voice frame, and after obtaining the audio frequency spectrum sequence corresponding to the voice frame at the current moment, the decoder can further input the audio frequency spectrum sequence corresponding to the voice frame at the current moment into the attention module as the input of the attention module when determining the intermediate vector corresponding to the next voice frame at the next moment besides outputting the audio frequency spectrum sequence externally; accordingly, the attention module may determine the intermediate vector based on the text feature vector, the speech feature vector, the emotion markup layer, and the audio spectrum sequence of the last speech frame output by the decoder at the previous time.

In this embodiment, the speech synthesis model can generate speech in which different persons authorized to use play the text to be synthesized with different emotions, that is, when the target persons and/or target emotions selected or set by the user (or provider) are different, the speech synthesis model used in this embodiment can generate different target speech. The model structure of the speech synthesis model during training is shown in fig. 4, the input end of the emotion marking layer 32 may be connected with the attention layer 35, the output end of the emotion marking layer 32 is connected with the emotion classifier 36, the input end of the attention layer 35 is connected with the reference encoder 37, and the output end of the reference encoder 37 is connected with the character classifier 38, at this time, the training process of the speech synthesis model may be as follows:

a. and acquiring voices which are authorized to be used by at least one speaker as voice samples, and acquiring text phoneme sequences corresponding to the voice samples which are authorized to be used. Wherein, each voice sample at least comprises an emotion voice with a certain emotion.

b. An audio spectrum sequence of each authorized speech sample is extracted.

c. The method comprises the steps of respectively inputting text phoneme sequences of voice samples which are authorized to be used into a text encoder, inputting original audio frequency spectrum sequences of the voice samples which are authorized to be used into a reference encoder, inputting personal identification information of a speaker corresponding to the corresponding voice samples into a high-dimensional mapping module, obtaining personal identification information output by a personal classifier, emotion identification information output by an emotion classifier and an output audio frequency spectrum sequence of a decoder, training a voice synthesis model in a countermeasure training mode, and optimizing the voice synthesis model through a counter propagation algorithm.

Wherein the back propagation algorithm may include 3 optimization loss functions: the output audio spectrum sequence is compared with the reconstruction error (such as the minimum mean square error) of the original audio spectrum sequence, the cross entropy loss between emotion identification information output by the emotion classifier and the true emotion in the audio voice, and the error between the person identification information output by the person classifier and the true person identification information corresponding to the audio voice.

d. Repeating the step c until the model converges, for example, until the value of the optimized loss function is smaller than or equal to a preset error threshold, or until the number of repeated iterations reaches a preset number of thresholds, or the like.

S204, inputting the voice spectrum sequence into a vocoder to perform voice synthesis on the voice spectrum sequence to obtain target voice, wherein the target voice has voice characteristics of the target person and emotion characteristics of the target emotion.

In this embodiment, after obtaining the speech spectrum sequence of the text to be synthesized output by the speech synthesis model, the text speech spectrum sequence may be input into a vocoder, and the speech spectrum sequence may be converted into the target speech by the vocoder. The vocoder can be an optional vocoder, preferably a vocoder which is obtained by pre-training and matched with the voice synthesis model, so as to further improve the synthesis effect of the target voice.

S205, playing the target voice.

In this embodiment, after the target voice is synthesized, the target voice may be further played under the condition that the user has authorized to play, so as to facilitate the user to listen. For example, after the vocoder synthesizes the target voice, the target voice which the user has authorized to play is played, such as synthesizing and playing the target voice at the user terminal; or after the vocoder synthesizes the target voice, storing the target voice, playing the target voice again when receiving a play request for the target voice, for example, synthesizing and storing the target voice of the text to be synthesized at the server, and sending the target voice to a user terminal when receiving the play request for the target voice sent by the user terminal, so as to play the target voice through the user terminal.

According to the voice synthesis method provided by the embodiment, the text phoneme sequence of the text to be synthesized is obtained, the voice spectrum sequence of the text to be synthesized is generated according to the text phoneme sequence of the text to be synthesized, the personal identification information of the target person and the emotion identification information of the target emotion through the voice synthesis model, the voice spectrum sequence is synthesized into the target voice through the vocoder, and the target voice is played, so that the voice synthesis effect can be further improved on the premise that the synthesis of voices with different emotions of different persons is realized based on the authorization of the user, and the user experience of listening to the voice book is further improved.

Fig. 5 is a block diagram of a speech synthesis apparatus according to an embodiment of the present disclosure. The device can be realized by software and/or hardware, can be configured in electronic equipment, typically can be configured in a mobile phone or a tablet computer, and can perform voice synthesis on text by executing a voice synthesis method. As shown in fig. 5, the voice synthesis apparatus provided in this embodiment may include: an acquisition module 501 and a synthesis module 502, wherein,

an obtaining module 501, configured to obtain a text to be synthesized, personal identification information of a target person, and emotion identification information of a target emotion;

the synthesizing module 502 is configured to perform speech synthesis on the text to be synthesized based on the personal identification information and the emotion identification information, so as to obtain a target speech, where the target speech has a speech feature of the target person and an emotion feature of the target emotion.

According to the voice synthesis device provided by the embodiment, the acquisition module is used for acquiring the text to be synthesized, the personal identification information of the target person and the emotion identification information of the target emotion, and the synthesis module is used for carrying out voice synthesis on the text to be synthesized based on the personal identification information and the emotion identification information, so that the target voice with the voice characteristics of the target person and the emotion characteristics of the target emotion is obtained. By adopting the technical scheme, the embodiment can realize the synthesis of the voices with different emotions of different people under the authorized condition, so that after the authorization, the voice book which adopts the emotion corresponding to the small-talking scenery to broadcast the corresponding sentence in the small-talking scenery can be generated only according to any voice of the speaker, the speaker who is authorized to use does not need to use the emotion to perform deduction, more alternative voice book speakers can be provided, and different requirements of people when listening to the voice book are met.

In the above solution, the synthesizing module 502 may include: the frequency spectrum determining unit is used for determining a voice frequency spectrum sequence of the text to be synthesized based on the personal identification information and the emotion identification information through a voice synthesis model obtained through pre-training; and the voice synthesis unit is used for inputting the voice spectrum sequence into the vocoder so as to perform voice synthesis on the voice spectrum sequence to obtain target voice.

In the above aspect, the spectrum determining unit may include: a phoneme obtaining subunit, configured to determine a text phoneme sequence of the text to be synthesized; and the frequency spectrum determining subunit is used for inputting the text phoneme sequence, the personal identification information and the emotion identification information into a pre-trained voice synthesis model and obtaining a voice frequency spectrum sequence output by the voice synthesis model.

In the above scheme, the speech synthesis model may include a text encoder, a high-dimensional mapping module, an emotion flag layer, an attention module, and a decoder, where an output end of the text encoder and an output end of the emotion flag layer are respectively connected to an input end of the attention module, and an output end of the high-dimensional mapping module and an output end of the attention module are respectively connected to an input end of the decoder.

In the above aspect, the spectrum determining subunit may be configured to: encoding the text phoneme sequence by adopting the text encoder so as to obtain a text feature vector of the text to be synthesized; performing high-dimensional mapping on the personal identification information by adopting the high-dimensional mapping module to obtain the personal feature vector of the text to be synthesized; determining an emotion feature vector corresponding to the emotion identification information by adopting the emotion marking layer, and taking the emotion feature vector as the emotion feature vector of the text to be synthesized; and inputting the text feature vector and the emotion feature vector into the attention module, and inputting the intermediate vector output by the attention module and the character feature vector into the decoder to obtain the audio frequency spectrum sequence of the text to be synthesized.

Further, the voice synthesis apparatus provided in this embodiment may further include: and the voice playing module is used for playing the target voice after the target voice is obtained.

In the above solution, the text to be synthesized may be novel text, and the obtaining module 501 may be configured to: according to the arrangement sequence of each sentence to be synthesized in the text to be synthesized, sequentially determining each sentence to be synthesized as a current sentence, and acquiring the current personal identification information and the current emotion identification information of the current sentence to be synthesized; the synthesis module 502 may be configured to: and carrying out voice synthesis on the current sentence to be synthesized based on the current personal identification information and the current emotion identification information to obtain the target voice of the current sentence.

The voice synthesis device provided by the embodiment of the disclosure can execute the voice synthesis method provided by any embodiment of the disclosure, and has the corresponding functional modules and beneficial effects of executing the voice synthesis method. Technical details not described in detail in this embodiment may be found in the speech synthesis method provided in any embodiment of the present disclosure.

Referring now to fig. 6, a schematic diagram of an electronic device (e.g., terminal device) 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 6 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 6, the electronic device 600 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 601, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage means 606 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

In general, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 606 including, for example, magnetic tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 shows an electronic device 600 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 609, or from storage means 606, or from ROM 602. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 601.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a text to be synthesized, character identification information of a target character and emotion identification information of a target emotion; and performing voice synthesis on the text to be synthesized based on the personal identification information and the emotion identification information to obtain target voice, wherein the target voice has voice characteristics of the target person and emotion characteristics of the target emotion.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. Wherein the name of the module does not constitute a limitation of the unit itself in some cases.

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, example 1 provides a speech synthesis method, comprising:

According to one or more embodiments of the present disclosure, example 2 is the method of example 1, wherein performing speech synthesis on the text to be synthesized based on the personal identification information and the emotion identification information to obtain a target speech, including:

determining a voice spectrum sequence of the text to be synthesized based on the personal identification information and the emotion identification information through a voice synthesis model obtained through pre-training;

and inputting the voice spectrum sequence into a vocoder to perform voice synthesis on the voice spectrum sequence to obtain target voice.

According to one or more embodiments of the present disclosure, example 3 is the method of example 2, the speech synthesis model obtained by pre-training determining a speech spectrum sequence of the text to be synthesized based on the personal identification information and the emotion identification information, including:

Determining a text phoneme sequence of the text to be synthesized;

inputting the text phoneme sequence, the personal identification information and the emotion identification information into a pre-trained voice synthesis model, and obtaining a voice frequency spectrum sequence output by the voice synthesis model.

According to one or more embodiments of the present disclosure, example 4 is the method of example 3, the speech synthesis model comprising a text encoder, a high-dimensional mapping module, an emotion markup layer, an attention module, and a decoder, an output of the text encoder and an output of the emotion markup layer being connected to an input of the attention module, respectively, an output of the high-dimensional mapping module and an output of the attention module being connected to an input of the decoder, respectively.

According to one or more embodiments of the present disclosure, example 5 is the method according to example 4, wherein the inputting the text-to-phoneme sequence, the personal identification information, and the emotion identification information into a pre-trained speech synthesis model, and obtaining a speech spectrum sequence output by the speech synthesis model, includes:

encoding the text phoneme sequence by adopting the text encoder so as to obtain a text feature vector of the text to be synthesized;

Performing high-dimensional mapping on the personal identification information by adopting the high-dimensional mapping module to obtain the personal feature vector of the text to be synthesized;

determining an emotion feature vector corresponding to the emotion identification information by adopting the emotion marking layer, and taking the emotion feature vector as the emotion feature vector of the text to be synthesized;

and inputting the text feature vector and the emotion feature vector into the attention module, and inputting the intermediate vector output by the attention module and the character feature vector into the decoder to obtain the audio frequency spectrum sequence of the text to be synthesized.

According to one or more embodiments of the present disclosure, example 6 is the method according to any one of examples 1-5, further comprising, after the obtaining the target speech:

and playing the target voice.

According to one or more embodiments of the present disclosure, example 7 is the method of any one of examples 1 to 5, wherein the obtaining the text to be synthesized, the personal identification information of the target person, and the emotion identification information of the target emotion includes:

according to the arrangement sequence of each sentence to be synthesized in the text to be synthesized, sequentially determining each sentence to be synthesized as a current sentence, and acquiring the current personal identification information and the current emotion identification information of the current sentence to be synthesized;

The speech synthesis is carried out on the text to be synthesized based on the personal identification information and the emotion identification information to obtain target speech, and the method comprises the following steps:

and carrying out voice synthesis on the current sentence to be synthesized based on the current personal identification information and the current emotion identification information to obtain the target voice of the current sentence.

According to one or more embodiments of the present disclosure, example 8 provides a speech synthesis apparatus, comprising:

According to one or more embodiments of the present disclosure, optionally, the synthesizing module includes:

the frequency spectrum determining unit is used for determining a voice frequency spectrum sequence of the text to be synthesized based on the personal identification information and the emotion identification information through a voice synthesis model obtained through pre-training;

And the voice synthesis unit is used for inputting the voice spectrum sequence into the vocoder so as to perform voice synthesis on the voice spectrum sequence to obtain target voice.

According to one or more embodiments of the present disclosure, optionally, the spectrum determining unit includes:

a phoneme obtaining subunit, configured to determine a text phoneme sequence of the text to be synthesized;

and the frequency spectrum determining subunit is used for inputting the text phoneme sequence, the personal identification information and the emotion identification information into a pre-trained voice synthesis model and obtaining a voice frequency spectrum sequence output by the voice synthesis model.

According to one or more embodiments of the present disclosure, optionally, the speech synthesis model includes a text encoder, a high-dimensional mapping module, an emotion markup layer, an attention module, and a decoder, where an output end of the text encoder and an output end of the emotion markup layer are connected to an input end of the attention module, and an output end of the high-dimensional mapping module and an output end of the attention module are connected to an input end of the decoder, respectively.

According to one or more embodiments of the present disclosure, optionally, the spectrum determination subunit is configured to:

According to one or more embodiments of the present disclosure, optionally, the voice synthesis apparatus further comprises:

and the voice playing module is used for playing the target voice after the target voice is obtained.

According to one or more embodiments of the present disclosure, optionally, the text to be synthesized is novel text, and the obtaining module is configured to:

The synthesis module is used for: and carrying out voice synthesis on the current sentence to be synthesized based on the current personal identification information and the current emotion identification information to obtain the target voice of the current sentence.

Example 9 provides an electronic device according to one or more embodiments of the present disclosure, comprising:

one or more processors;

a memory for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the speech synthesis method as described in any of examples 1-7.

According to one or more embodiments of the present disclosure, example 10 provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the speech synthesis method according to any one of examples 1-7.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

1. A method of speech synthesis, comprising:

acquiring a text to be synthesized, character identification information of a target character and emotion identification information of a target emotion; the target person is a speaker authorized to use the text to be synthesized, and the target emotion is emotion adopted by the authorized target person when the text to be synthesized is played;

Performing voice synthesis on the text to be synthesized based on the personal identification information and the emotion identification information to obtain target voice, wherein the target voice has voice characteristics of the target person and emotion characteristics of the target emotion;

the text to be synthesized is a novel text, and the obtaining of the text to be synthesized, the personal identification information of the target person and the emotion identification information of the target emotion comprises the following steps:

according to the arrangement sequence of each sentence to be synthesized in the text to be synthesized, sequentially determining each sentence to be synthesized as a current sentence to be synthesized, and acquiring the current personal identification information and the current emotion identification information of the current sentence to be synthesized; the current personal identification information is the personal identification information of the speaker authorized to use of the current sentence to be synthesized;

performing voice synthesis on the current sentence to be synthesized based on the current personal identification information and the current emotion identification information to obtain target voice of the current sentence to be synthesized; the personal identification information of the target person and the emotion identification information of the target emotion are selected or input by a user, and the speaker of the target emotion is different from the target person.

2. The method of claim 1, wherein the speech synthesis of the text to be synthesized based on the personal identification information and the emotion identification information to obtain a target speech comprises:

3. The method according to claim 2, wherein the speech synthesis model obtained by training in advance determines a speech spectrum sequence of the text to be synthesized based on the personal identification information and the emotion identification information, comprising:

determining a text phoneme sequence of the text to be synthesized;

4. The method of claim 3, wherein the speech synthesis model comprises a text encoder, a high-dimensional mapping module, an emotion markup layer, an attention module, and a decoder, wherein an output of the text encoder and an output of the emotion markup layer are respectively connected to an input of the attention module, and wherein an output of the high-dimensional mapping module and an output of the attention module are respectively connected to an input of the decoder.

5. The method of claim 4 wherein the inputting the text-to-phoneme sequence, the persona identification information, and the emotion identification information into a pre-trained speech synthesis model and obtaining a speech spectrum sequence output by the speech synthesis model comprises:

6. The method according to any one of claims 1-5, further comprising, after said obtaining the target speech:

And playing the target voice.

7. A speech synthesis apparatus, comprising:

the acquisition module is used for acquiring the text to be synthesized, the personal identification information of the target person and the emotion identification information of the target emotion; the target person is a speaker authorized to use the text to be synthesized, and the target emotion is emotion adopted by the authorized target person when the text to be synthesized is played; the text to be synthesized is a novel text;

the synthesis module is used for carrying out voice synthesis on the text to be synthesized based on the personal identification information and the emotion identification information to obtain target voice, wherein the target voice has voice characteristics of the target person and emotion characteristics of the target emotion;

the acquisition module is further configured to: according to the arrangement sequence of each sentence to be synthesized in the text to be synthesized, sequentially determining each sentence to be synthesized as a current sentence to be synthesized, and acquiring the current personal identification information and the current emotion identification information of the current sentence to be synthesized; the current personal identification information is the personal identification information of the speaker authorized to use of the current sentence to be synthesized;

The synthesis module is also for: performing voice synthesis on the current sentence to be synthesized based on the current personal identification information and the current emotion identification information to obtain target voice of the current sentence to be synthesized; the personal identification information of the target person and the emotion identification information of the target emotion are selected or input by a user, and the speaker of the target emotion is different from the target person.

8. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs,

when executed by the one or more processors, causes the one or more processors to implement the speech synthesis method of any of claims 1-6.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the speech synthesis method according to any of claims 1-6.